In this series of Blog articles, I am going to talk about the Modern Data Lakehouses a.k.a. Open Data Platforms doing all heavy lifts & shifts and thus shaping the data as per new requirements in terms of their formats and places!
Let me start with Part 1 (This Article) with basic definitions of terms being used here for sake of clarity.
Conceptualizing a Data Lakehouse
A Data Lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.
We plan to leverage Apache Spark for data processing, using open file and table formats such as:
Popular Big Data Format s — Open Source
Here we’re gonna exploit object storage into a scalable, high-performance, and secure data lakehouse solution.
We will be benchmarking performance of our DLH using https://www.tpc.org/tpcds/ : The Transaction Processing Performance Council (TPC) defines Transaction Processing and Database Benchmarks and delivers trusted results to the industry.
While setting up the system and during experiments, I realize that the combination of Apache Spark, Hive Metastore, and Open file/table formats on cloud object storage creates a powerful solution for data analytics that emphasizes performance, scalability, and ease of management.
Next, we are gonna do Environment setup so that we can conduct our experiments using the aforementioned Open Source tech stack.
We use cookies to improve your experience on our site. By using our site, you consent to cookies.
Websites store cookies to enhance functionality and personalise your experience. You can manage your preferences, but blocking some cookies may impact site performance and services.
Essential cookies enable basic functions and are necessary for the proper function of the website.