Delta Change Data Feed Deep Dive: Building Incremental Pipelines Without Complexity

Delta CDF in Databricks enables pipelines to process only changed rows with commit metadata, simplifying incremental ETL without full scans.

Apr. 01, 26 · Analysis

Likes (1)

Comment

Save

3.0K Views

Delta Lake’s Change Data Feed (CDF) is a key feature for building incremental pipelines. When enabled on a Delta table, CDF tracks row-level changes between versions of that table. In practice, this means your pipelines can process only the rows that changed since the last run, instead of scanning entire tables. For example, rather than comparing two multi-terabyte snapshots, you can quickly retrieve just the handful of rows that were updated. This greatly simplifies ETL/ELT workloads by avoiding full-table scans.

Enabling Change Data Feed

Before you can read changes, CDF must be enabled on the table. In Databricks, you set the table property delta.enableChangeDataFeed = true when creating or altering a Delta table. For instance, in PySpark, you might run:

    Python
   
 

   # Create a new Delta table with CDF enabled
spark.sql("""
  CREATE TABLE people (id INT, name STRING, value DOUBLE)
  USING delta
  TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")
  

You can also enable CDF on an existing table with an ALTER TABLE … SET TBLPROPERTIES command. Note that only changes after enabling CDF are recorded; past changes are not captured. You can even set a Spark session default so all new tables have CDF enabled.

Once enabled, CDF adds hidden metadata columns to the table’s history. The change feed output includes each row’s data plus _change_type, as well as _commit_version and _commit_timestamp to mark when the change occurred.

Reading Changes With Structured Streaming

Databricks recommends using Structured Streaming to process CDF changes incrementally. In a streaming query, set the readChangeFeed option to true. For example:

    Python
   
 

   from pyspark.sql.functions import col

# Read the change feed as a streaming DataFrame
deltaStreamDF = (
    spark.readStream
         .format("delta")
         .option("readChangeFeed", "true")
         .table("people")
)
  

This streaming source will, by default, emit the current table snapshot as INSERT events on first run, and then continuously yield only new inserts, updates, or deletes as they occur. You can also specify a starting version or timestamp to skip initial data.

Once you have the streaming DataFrame, you can apply transformations and write it to a sink as usual. For instance, to capture only the deleted rows into another Delta table, you might do:

    Python
   
 

   (deltaStreamDF
   .filter(col("_change_type") == "delete")
   .select("id", "name")
   .writeStream
   .format("delta")
   .outputMode("append")
   .option("checkpointLocation", "/path/to/checkpoint")
   .trigger(availableNow=True)
   .table("deleted_records")
)
  

This uses the availableNow trigger, which processes all currently available changes and then stops automatically. The availableNow=True trigger is ideal for finite pipelines, such as hourly or daily jobs, because it processes all pending changes and then exits. Using availableNow=True is especially useful for serverless clusters that don’t support continuous streaming.

As shown, you simply treat the change feed like any other streaming source. In fact, CDF is plug-and-play with Databricks features such as Autoloader, Delta Live Tables, Workflows, and Databricks SQL. You just add .option("readChangeFeed","true") and Spark handles the rest.

Batch Reads (Optional)

While streaming is recommended for continuous pipelines, Delta also supports batch reads of change data. In a batch query, you can use spark.read with readChangeFeed=true and specify a starting and/or ending version. For example:

    Python
   
 

   # Batch read of all changes from version 0 to 10
spark.read \
     .format("delta") \
     .option("readChangeFeed", "true") \
     .option("startingVersion", 0) \
     .option("endingVersion", 10) \
     .table("people") \
     .show()
  

This returns all row-level deltas (with _change_type, etc.) in the specified range. Batch mode is useful for ad-hoc audits or backfills.

Building an Incremental Pipeline

With CDF enabled and read in streaming mode, your pipeline logic typically looks like any other Spark streaming ETL. The big difference is that only changed rows flow through, making the logic simpler and faster. Your code might:

Read changes from the source Delta table.
Filter or transform the change DataFrame as needed.
Write results to the target table or downstream system using writeStream.
Checkpoint and trigger as usual.

This approach is almost identical to normal streaming code, but under the hood, Spark/Delta ensures exactly-once processing of each row-level change.

Because CDF ties changes to Delta’s transaction log, it guarantees correctness even across concurrent writes, schema evolution, or cluster restarts. Unlike external CDC systems, you don’t need to manage offsets or watermarks manually; the versioning in Delta provides a consistent source of truth.

Key Benefits

Efficiency: Only modified rows are read and processed. This avoids expensive full-table scans and speeds up incremental ETL.
Simplicity: You avoid complex logic to compare snapshots or track change timestamps. The delta table’s own log provides the changes.
Rich metadata: Each change record carries _change_type commit version/timestamp, and even pre-image for updates, so you can implement SCD logic or downstream updates easily.
Integration: CDF works naturally with Spark Structured Streaming and Databricks features. It is queryable via Spark or via SQL (table_changes(...) function).

Best Practices and Considerations

Retention: By default, Delta will eventually remove old log entries (and CDF data) based on retention settings. To ensure you don’t miss changes, set a long enough delta.logRetentionDuration so downstream jobs have time to catch up. If you need a permanent audit, write out changes to a separate table as shown.
Vacuum: Running VACUUM on the table will delete old change files. Be careful not to vacuum out change data prematurely; align vacuum retention with your CDF processing window.
Schema changes: Simple schema evolutions are supported, but if you do a non-additive change (rename/drop column), you cannot read CDF for that transaction range.
Unsupported scenarios: CDF only works on Delta tables, and once enabled, it cannot be disabled without rewriting the table. If you do a bulk reload occasionally, CDF may be overkill. It’s best for use cases where incremental changes are common and downstream systems can handle CDC.
Output modes: You will typically use outputMode("append") with CDF, since each change is output once. No special handling of updates is needed an update_postimage row just appears as a new record to be applied by your target logic.

Conclusion

Delta Lake’s Change Data Feed makes incremental pipelines straightforward and efficient. By enabling CDF, engineers can build streaming or micro-batch pipelines that consume only the changed rows with little extra code. The infrastructure takes care of tracking changes while Spark Structured Streaming provides fault tolerance and exactly once delivery. In practice, enabling CDF incurs minimal write overhead but can drastically reduce downstream ETL costs.

In short, CDF turns your Delta Lake into a built-in CDC engine: you can set it up with a single table property and then simply read changes as a stream. This paradigm shift lets you replace complex incremental logic with simple Spark code, enabling near-real-time and efficient analytics pipelines

Pipeline (software) Data lake

Opinions expressed by DZone contributors are their own.

Related

Trending