Transforming Open Data LakeHouses with Apache Spark

Processing Open Table / File formats for Benchmarking Performance

In this series of Blog articles, I am going to talk about the Modern Data Lakehouses a.k.a. Open Data Platforms doing all heavy lifts & shifts and thus shaping the data as per new requirements in terms of their formats and places!

Let me start with Part 1 (This Article) with basic definitions of terms being used here for sake of clarity.

Data LakeHouses

  • Combine the best elements of Data lake & Data warehouse
  • Are built on Open Source & Open Standards
  • Simplify data estate by eliminating Silos, which impedes enterprise data value creation
  • Reduce costs and delivers Data/AI Initiatives faster!

Conceptualizing a Data Lakehouse

A Data Lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.

We plan to leverage Apache Spark for data processing, using open file and table formats such as:

Delta Lake

  • Parquet

Iceberg

  • Parquet
  • ORC
  • AVRO

Popular Big Data Format s — Open Source

Here we’re gonna exploit object storage into a scalable, high-performance, and secure data lakehouse solution.

We will be benchmarking performance of our DLH using https://www.tpc.org/tpcds/ : The Transaction Processing Performance Council (TPC) defines Transaction Processing and Database Benchmarks and delivers trusted results to the industry.

While setting up the system and during experiments, I realize that the combination of Apache Spark, Hive Metastore, and Open file/table formats on cloud object storage creates a powerful solution for data analytics that emphasizes performance, scalability, and ease of management.

  • Apache Spark provides a distributed computing engine for processing large datasets
  • Hive Metastore acts as a centralized metadata repository to manage table schemas and ensure consistency
  • By integrating cloud object storage having Open file/table data, this solution provides a scalable and fast storage backend capable of handling high-throughput data operations, making it ideal for modern data lakehouse architectures.

Next, we are gonna do Environment setup so that we can conduct our experiments using the aforementioned Open Source tech stack.