DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Getting Started With Apache Cassandra
  • Looking for the Best Java Data Computation Layer Tool
  • The Magic of Apache Spark in Java
  • RION - A Fast, Compact, Versatile Data Format

Trending

  • Internal Developer Portals: Modern DevOps's Missing Piece
  • Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
  • Streamlining Event Data in Event-Driven Ansible
  • Unlocking AI Coding Assistants Part 1: Real-World Use Cases
  1. DZone
  2. Data Engineering
  3. Databases
  4. A Short Introduction to Apache Iceberg

A Short Introduction to Apache Iceberg

This tutorial shows how to use Apache Iceberg in order to address data consistency and performance issues. Read on to see how it can help you!

By 
Gautam Goswami user avatar
Gautam Goswami
DZone Core CORE ·
Aug. 20, 21 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
9.4K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Iceberg introduces the concept of hidden partitioning where the reading of unnecessary partitions can be avoided automatically. Data consumers that fire the queries don’t need to know how the table is partitioned and add extra filters to their queries.

A table can be defined as an arrangement of data in rows and columns and in a similar fashion, if you visualize from the Big Data perspective, the large number of individual files that hold the actual data can be organized in a tabular manner too.

We are already familiar with Apache Hive that works as a data warehouse system to query and analyze large datasets stored in the HDFS (Hadoop Distributed File System) or Amazon S3. Also, an integral part of the big data ecosystem. Hive is a simple directory-based design where actual data files are getting stored at the folder/directory level in HDFS. If interested, you can read here how to import data into Hive tables. Hive keeps track of data at the folder level not in actual data files. 

Because of the directory-based model in Hive, listings are much slower, renames are not atomic, and results are eventually consistent. To work with data in a table, Hive needs to perform file list operations and this causes a performance bottleneck while executing SQL queries. Apache Iceberg is a new table format for storing large, slow-moving tabular data and can improve on the more standard table layout built into Hive, Trino, and Spark.

The giant OTT platform Netflix originally developed Iceberg to decode their established issues related to managing/storing huge volumes of data in tables probably in petabyte-scales. Later, in 2018, Iceberg was open-sourced as an Apache Incubator project.

The main aim to designed and developed Apache Iceberg was basically to address the data consistency and performance issues that Hive having. Below we can see few major issues that Hive holding as said above and how resolved by Apache Iceberg.

Schema Evolution

In nutshell, Schema evolution permits us to update the schema used to write new data while maintaining backward compatibility with the schemas of our old data. To make schema evolution support in Hive, actual data files need to be modified or rewritten. As an example, if we want to handle schema changes/evolution in Hive ORC tables like column deletions occurring at source database, which is MySQL by leveraging Flume to import data, here are few major steps we need to follow:

  1. Taking backup of old schema file.
  2. Move the New AVSC Schema File to HDFS.
  3. Create AVRO table with new location set and scheme location set.
  4. Verify the data in AVRO after Schema changes in MySQL.
  5. Take a Backup of the current ORC table and Drop the Original ORC table.
  6. Create ORC table with a new location set 
  7. Insert the data into the ORC table and eventually verify the ORC table after the schema changes. 
  8. Then, continue the incremental loads from the next day onwards with the new target directory which was created after the schema changes.

But, Apache Iceberg schema updates are metadata changes only and because of that, no data files need to be rewritten to perform the update.

Using unique IDs, Iceberg tracks each column in a table. While we add a new column, a new ID would assign to it to avoid any existing data usage by mistake.

Following schema, evolution changes are currently supporting by Apache Iceberg:

  • Add – add a new column to the table or to a nested struct.
  • Drop – remove an existing column from the table or a nested struct.
  • Rename – rename an existing column or field in a nested struct.
  • Update – widen the type of a column, struct field, map key, map value, or list element.
  • Reorder – change the order of columns or fields in a nested struct.

To ensure schema evolution changes are unfettered and free of side-effects as well as without rewriting files, Apache Iceberg never read existing values from another column while adding a new column. Similarly for dropping or updating a column or field, Iceberg does not change the values in any other column.

Partition Evolution

In Apache, Hive partitioning can be done by dividing a table into related groups based on the values of a particular column like date, city, country, etc. Partitioning reduces the query response time in Apache Hive as data is stored in horizontal slices. In Hive partitioning, partitions are explicit and appear as a column and must be given partition values. Due to this approach, Hive having several issues like not being able to validate partition values is so fully dependent on the writer to produce the correct value, 100% dependent on the user to write queries correctly, Working queries are tightly coupled with the table’s partitioning scheme, so partitioning configuration cannot be changed without breaking queries, etc.

Apache Iceberg introduces the concept of hidden partitioning where the reading of unnecessary partitions can be avoided automatically. Data consumers that fire the queries don’t need to know how the table is partitioned and add extra filters to their queries. Iceberg partition layouts can evolve as needed. Iceberg can hide partitioning because it does not require user-maintained partition columns. Iceberg produces partition values by taking a column value and optionally transforming it.

Apache Iceberg is used in production where a single table can contain tens of petabytes of data and can read these huge tables without leveraging distributed SQL engine. It was developed for gigantic tables. By using a set of Java API that Iceberg produces, we can manage table metadata, like schema, partition spec, metadata, and data files that store table data.

Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.

Reference

Database Big data sql Schema file IO Partition (database)

Published at DZone with permission of Gautam Goswami, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Getting Started With Apache Cassandra
  • Looking for the Best Java Data Computation Layer Tool
  • The Magic of Apache Spark in Java
  • RION - A Fast, Compact, Versatile Data Format

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!