DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Automate Azure Databricks Unity Catalog Permissions at the Schema Level
  • How to Enable Azure Databricks Lakehouse Monitoring Through Scripts
  • Enhancing Performance With Data Modeling: Techniques and Best Practices for Optimization in Snowflake
  • Exploring JSON Schema for Form Validation in Web Components

Trending

  • How AI Agents Are Transforming Enterprise Automation Architecture
  • AI’s Role in Everyday Development
  • Article Moderation: Your Questions, Answered
  • Building Resilient Identity Systems: Lessons from Securing Billions of Authentication Requests
  1. DZone
  2. Data Engineering
  3. Data
  4. Handling Dynamic Data Using Schema Evolution in Delta

Handling Dynamic Data Using Schema Evolution in Delta

This article is intended to help data practitioners learn an effective way of dealing with dynamically changing data through Delta Lake.

By 
Hanza Parayil Salim user avatar
Hanza Parayil Salim
·
Mar. 03, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.8K Views

Join the DZone community and get the full member experience.

Join For Free

One of the biggest challenges of managing data nowadays is its constantly changing nature. We recently came across a requirement to ingest survey data and send it for sentimental analysis. The survey data is very dynamic, and even for existing surveys, new questions will be added, modified, or deleted. 

As the data source evolves and adapts to new changes, the challenge is how to deal with the dynamically changing data to be represented in a table format. We implemented this by leveraging the schema evolution feature in Delta tables.

Why Delta Lake?

Delta Lake (or Delta Table) is designed to bring reliability and robustness to data lakes. Delta Lake is an open-source table format that supports ACID transactions, scalable metadata handling, enhanced data reliability, faster query performance, and seamless integration with Spark. 

Delta Lake also supports schema evolution. In data management, the ability to change the structure of data over time is a common requirement, known as schema evolution.

Source Data and Ingestion

The source data is semi-structured data, and in this case, we receive the file in text format. Usually, these kinds of data are available in XML or JSON file formats. The first step is to read the data from the source and flatten it accordingly to the data governance standards. 

In this example, we receive source data in four batches, and each batch will be ingested in the Delta table one after the other. The Delta table is created initially with this structure.

SQL
 
CREATE TABLE IF NOT EXISTS surveydb.CampaignTable
(
   CustomerID STRING
  ,Tier STRING
  ,CampaignQn1 STRING
) 
USING DELTA
LOCATION '/mnt/blobname/DeltaLake/CampaignTable'


This is the Pyspark code that is used to load the incoming Dataframe to the Delta table, and this is executed for every batch. Delta does not allow us to append data with mismatched schema by default, and this feature is known as schema enforcement. So, we do an empty Dataframe append with schema merge and then do the regular merge.

Python
 
from delta.tables import DeltaTable

MergeColumns=[Col for Col in SourceDF.columns]

SourceDF.limit(0).write.format("delta") \
        .mode("append") \
        .option("mergeSchema",True) \
        .save(TargetPath)

DeltaTable.forPath(spark,TargetPath).alias("TGT") \
          .merge(SourceDF.alias("STG")
                  ,f"TGT.CustomerID=STG.CustomerID") \
          .whenNotMatchedInsert(values={Col:f"STG.{Col}" for Col in MergeColumns}) \
          .execute()


Schema Evolution in Every Batch

Let’s examine the source data and the Delta table structure after the execution of each batch. In the initial batch, the data is received in the same structure as the table, but that is not mandatory in this approach.

Batch 1

Source data for the initial batch.

CustomerID

Tier

CampaignQn1

 1001

Tier2

Q1_1001


Target table after the initial batch is loaded.

Target table after the initial batch is loaded

Batch 2

In the second batch, a new column is added to the source data. The write command with the mergeSchema option and limit(0) will only append schema, so the new column CampaignQn2 is automatically added to the table.

The merge command that follows next will only insert specific columns available on the list. The MergeColumns list has only the columns available in the incoming Dataframe. In this way, the new column is created and loaded with data.

CustomerID

Tier

CampaignQn1

CampaignQn2

1003

Tierl

Q1_1003

Q2_1003

Customer ID 1003

Batch 3

Unlike the previous batches, here, the incoming data have CampaignQn1 column missing. Even though the mergeSchema runs, there is no schema change, then the merge command with MergeColumns list inserts only the columns available in the source Dataframe.

CustomerID

Tier

CampaignQn2

1005

Tier1

Q2_1005

Customer ID 1005

Batch 4

Here, the column positions are shuffled, and there is a new column, CampaignQn3. The mergeSchema makes an empty Dataframe and appends with schema merge, and that creates the new column.

The column position change is handled by naming the columns during merge using the MergeColumns list. In this way, the data is appended in the correct columns in the Delta table.

CustomerID

Tier

CampaignQn3

CampaignQn2

1007

Tier1

Q3_1007

Q2_1007

Customer ID 1007

This allows us to change the schema of the table without doing a full data rewrite. This feature offers a lot of flexibility, but we must use it carefully. In Delta, the delta log tracks all the schema and data changes. In addition to that, we can track the history of these changes by using the describe history command.

SQL
 
DESCRIBE HISTORY surveydb.CampaignTable


Delta Lake allows us to add, change, or delete columns in a table without impacting data pipelines and thereby simplifies schema evolution. In addition to that, it tracks changes, making it easier to understand how the data has evolved over time. Schema enforcement and evolution in Delta help maintain data consistency and integrity by preventing errors, resulting in improved data quality and more efficient data governance.

Conclusion

The automatic schema evolution in Delta is a very useful feature when dealing with dynamic data sources and supports most complex scenarios. When data sources evolve, we can avoid the manual intervention of updating the table schema with a straightforward command that can be executed along with the load process. Schema evolution is especially useful when consuming multiple data sources with dynamically changing data.

Data (computing) DELTA (taxonomy) Schema

Opinions expressed by DZone contributors are their own.

Related

  • Automate Azure Databricks Unity Catalog Permissions at the Schema Level
  • How to Enable Azure Databricks Lakehouse Monitoring Through Scripts
  • Enhancing Performance With Data Modeling: Techniques and Best Practices for Optimization in Snowflake
  • Exploring JSON Schema for Form Validation in Web Components

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!