Dipankar Mazumdar: Apache Iceberg: enabling an open lakehouse architecture for large-scale analytics

Anastasia Khomyakova
Jun 20, 2023
2 min read

Updated: Nov 6, 2023

Dipankar is currently a Staff Data Advocate at Onehouse.ai, where he focuses on open-source projects such as Apache Hudi and Onetable to help engineering teams build and scale robust data analytics platforms. Before this, he worked on critical open-source projects such as Apache Iceberg and Apache Arrow at Dremio. For most of his career, Dipankar worked at the intersection of Data Visualization and Machine Learning. He also holds a Master's in Computer Science with a research area focused on ExplainableAI.

Apache Iceberg: enabling an open lakehouse architecture for large-scale analytics.

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data.

A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format, released by Facebook, which addresses some of these problems, but falls short on data, user, and application scale.

Apache Iceberg is a foundational technology for implementing an open data lakehouse, an architecture that addresses the limitations of traditional data architecture patterns. These limitations include having to ETL the data into each tool creating data drift and data silos, high costs making it cost prohibitive to make warehouse features available to all of your data and lack of flexibility forcing you to adjust your workflow to the tool your data is locked in.

Apache Iceberg provides the capabilities, performance, scalability and savings that fulfill the promise of an open data lakehouse. In this talk we will go through:

- What is a Lakehouse architecture?

- Table Formats in Data Lake?

- Architecture of an Iceberg table

- Benefits of this architecture (cost savings, etc.) & how it enables workloads such as BI, ML