DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Getting Started With Apache Cassandra
  • Looking for the Best Java Data Computation Layer Tool
  • The Magic of Apache Spark in Java
  • A Deep Dive into Apache Doris Indexes

Trending

  • Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
  • One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
  • From AI Chaos to Control: Building Enterprise-Grade LLM Gateways With MuleSoft Anypoint
  • From Indicators to Insights: Automating IOC Enrichment Using Python and Threat Feeds
  1. DZone
  2. Data Engineering
  3. Databases
  4. A Short Introduction to Apache Iceberg

A Short Introduction to Apache Iceberg

This tutorial shows how to use Apache Iceberg in order to address data consistency and performance issues. Read on to see how it can help you!

By 
Gautam Goswami user avatar
Gautam Goswami
DZone Core CORE ·
Aug. 20, 21 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
9.6K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Iceberg introduces the concept of hidden partitioning where the reading of unnecessary partitions can be avoided automatically. Data consumers that fire the queries don’t need to know how the table is partitioned and add extra filters to their queries.

A table can be defined as an arrangement of data in rows and columns and in a similar fashion, if you visualize from the Big Data perspective, the large number of individual files that hold the actual data can be organized in a tabular manner too.

We are already familiar with Apache Hive that works as a data warehouse system to query and analyze large datasets stored in the HDFS (Hadoop Distributed File System) or Amazon S3. Also, an integral part of the big data ecosystem. Hive is a simple directory-based design where actual data files are getting stored at the folder/directory level in HDFS. If interested, you can read here how to import data into Hive tables. Hive keeps track of data at the folder level not in actual data files. 

Because of the directory-based model in Hive, listings are much slower, renames are not atomic, and results are eventually consistent. To work with data in a table, Hive needs to perform file list operations and this causes a performance bottleneck while executing SQL queries. Apache Iceberg is a new table format for storing large, slow-moving tabular data and can improve on the more standard table layout built into Hive, Trino, and Spark.

The giant OTT platform Netflix originally developed Iceberg to decode their established issues related to managing/storing huge volumes of data in tables probably in petabyte-scales. Later, in 2018, Iceberg was open-sourced as an Apache Incubator project.

The main aim to designed and developed Apache Iceberg was basically to address the data consistency and performance issues that Hive having. Below we can see few major issues that Hive holding as said above and how resolved by Apache Iceberg.

Schema Evolution

In nutshell, Schema evolution permits us to update the schema used to write new data while maintaining backward compatibility with the schemas of our old data. To make schema evolution support in Hive, actual data files need to be modified or rewritten. As an example, if we want to handle schema changes/evolution in Hive ORC tables like column deletions occurring at source database, which is MySQL by leveraging Flume to import data, here are few major steps we need to follow:

  1. Taking backup of old schema file.
  2. Move the New AVSC Schema File to HDFS.
  3. Create AVRO table with new location set and scheme location set.
  4. Verify the data in AVRO after Schema changes in MySQL.
  5. Take a Backup of the current ORC table and Drop the Original ORC table.
  6. Create ORC table with a new location set 
  7. Insert the data into the ORC table and eventually verify the ORC table after the schema changes. 
  8. Then, continue the incremental loads from the next day onwards with the new target directory which was created after the schema changes.

But, Apache Iceberg schema updates are metadata changes only and because of that, no data files need to be rewritten to perform the update.

Using unique IDs, Iceberg tracks each column in a table. While we add a new column, a new ID would assign to it to avoid any existing data usage by mistake.

Following schema, evolution changes are currently supporting by Apache Iceberg:

  • Add – add a new column to the table or to a nested struct.
  • Drop – remove an existing column from the table or a nested struct.
  • Rename – rename an existing column or field in a nested struct.
  • Update – widen the type of a column, struct field, map key, map value, or list element.
  • Reorder – change the order of columns or fields in a nested struct.

To ensure schema evolution changes are unfettered and free of side-effects as well as without rewriting files, Apache Iceberg never read existing values from another column while adding a new column. Similarly for dropping or updating a column or field, Iceberg does not change the values in any other column.

Partition Evolution

In Apache, Hive partitioning can be done by dividing a table into related groups based on the values of a particular column like date, city, country, etc. Partitioning reduces the query response time in Apache Hive as data is stored in horizontal slices. In Hive partitioning, partitions are explicit and appear as a column and must be given partition values. Due to this approach, Hive having several issues like not being able to validate partition values is so fully dependent on the writer to produce the correct value, 100% dependent on the user to write queries correctly, Working queries are tightly coupled with the table’s partitioning scheme, so partitioning configuration cannot be changed without breaking queries, etc.

Apache Iceberg introduces the concept of hidden partitioning where the reading of unnecessary partitions can be avoided automatically. Data consumers that fire the queries don’t need to know how the table is partitioned and add extra filters to their queries. Iceberg partition layouts can evolve as needed. Iceberg can hide partitioning because it does not require user-maintained partition columns. Iceberg produces partition values by taking a column value and optionally transforming it.

Apache Iceberg is used in production where a single table can contain tens of petabytes of data and can read these huge tables without leveraging distributed SQL engine. It was developed for gigantic tables. By using a set of Java API that Iceberg produces, we can manage table metadata, like schema, partition spec, metadata, and data files that store table data.

Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.

Reference

Database Big data sql Schema file IO Partition (database)

Published at DZone with permission of Gautam Goswami. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Getting Started With Apache Cassandra
  • Looking for the Best Java Data Computation Layer Tool
  • The Magic of Apache Spark in Java
  • A Deep Dive into Apache Doris Indexes

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook