DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • 5 Critical Databricks Performance Hacks That Most Engineers Miss (100x Faster Queries)
  • Spark Job Optimization
  • Big Data Realtime Data Pipeline Architecture
  • Building Robust Real-Time Data Pipelines With Python, Apache Kafka, and the Cloud

Trending

  • Integrating AI-Driven Decision-Making in Agile Frameworks: A Deep Dive into Real-World Applications and Challenges
  • Building a Production-Ready AI Agent in 2026: Beyond the Hello World Demo
  • Observability in Spring Boot 4
  • The Serverless Illusion: When “Pay for What You Use” Becomes Expensive
  1. DZone
  2. Data Engineering
  3. Big Data
  4. End-to-End Streaming Optimization: Kafka to Delta With Exactly-Once Guarantees

End-to-End Streaming Optimization: Kafka to Delta With Exactly-Once Guarantees

Kafka feeds the stream, Spark tracks progress via checkpoints, and Delta's transaction log ensures every event lands exactly once, even across failures and restarts.

By 
Seshendranath Balla Venkata user avatar
Seshendranath Balla Venkata
·
Apr. 01, 26 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
2.4K Views

Join the DZone community and get the full member experience.

Join For Free

Modern data applications often rely on real-time pipelines that ingest events from systems like Apache Kafka into data lakes. Ensuring exactly once delivery is critical; each event should be processed and stored only once, even across failures. Apache Spark Structured Streaming on Databricks, together with Delta Lake, provides end-to-end exactly once fault tolerance.

Key benefits of this approach include:

  • Exactly-once processing: Delta’s transaction log coordinates commits so each event ends up in the table only once.
  • Fault tolerance: Spark’s checkpointing plus Delta’s idempotency mean failed batches can be retried safely.
  • Built-in optimizations: Features like auto-compact and OPTIMIZE commands merge small files on the fly, improving query throughput.

Reading From Kafka

Use Spark Structured Streaming with the Kafka source to continuously read events. Configure the Kafka bootstrap.servers and topic:

Python
 
df = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092")
    .option("subscribe", "input_topic")
    .option("startingOffsets", "latest")
    .load()
)


This sets up a stream that pulls new messages from input_topic. Spark will track Kafka offsets in its checkpoint so that, on a restart, it resumes where it left off. After loading, convert the binary value into structured columns, for example, by parsing JSON with a predefined schema.

Writing Streaming Data to Delta Lake

Delta Lake is a transactional storage layer that works as a streaming sink. You can write the DataFrame in append mode to a Delta table. For example:

Python
 
schema = StructType([...])  # define schema for JSON payload
parsed = df.selectExpr("CAST(value AS STRING) as json") \
    .select(from_json(col("json"), schema).alias("data")) \
    .select("data.*")

(parsed.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/mnt/checkpoints/events/")
    .option("path", "/mnt/delta/events/")
    .start()
)


Here we cast the Kafka value to string, parse it, and write the result into a Delta table at /mnt/delta/events. The checkpointLocation stores Spark’s progress. The Delta transaction log then applies each batch atomically. Databricks notes that the transaction log guarantees exactly-once processing even if multiple streams or batch jobs write to the same table. Delta tables also enforce the schema by default to protect table integrity. You can enable schema evolution if your message schema changes.

Achieving Exactly-Once Semantics

Spark Structured Streaming is designed for end-to-end exactly once processing. With Kafka as a replayable source and Delta as a transactional sink, Spark can recover from failures without duplicating records. Delta Lake enhances this by recording two pieces of metadata for each micro-batch: a unique txnAppId (query ID) and txnVersion. If Spark replays a batch after a failure, Delta checks the transaction log if an entry with the same (txnAppId, txnVersion) exists, it skips writing it again otherwise, it writes and commits the batch. This idempotent behavior ensures exactly-once semantics even if Spark reprocesses data.

Checkpointing and Fault Tolerance

Always provide a persistent checkpoint location. Spark writes an offsets file before each batch and a commits file after writing to Delta. If a failure occurs after writing but before recording the commit, Spark will see offsets/N exists but not commits/N, and will retry batch N. On restart, Delta’s transaction log prevents duplicates: it recognizes that batch N was already applied, so it skips the duplicate write. 

In other words, the checkpoint directory tracks intent, and Delta’s log tracks completion, forming a distributed two-phase commit. This guarantees that progress is safely recorded and no data is lost or duplicated.

Checkpoint safety: Do not delete or overwrite the streaming checkpoint directory. If the checkpoint is lost or reset, Spark may restart with a new query ID and reapply old batches, breaking the exactly once guarantee.

Performance Optimizations

Streaming ingestion can create many small files, which slows downstream queries. To optimize performance:

  • Auto-optimize and auto-compact: Delta Lake can merge small files on write. Enable table properties delta.autoOptimize.optimizeWrite = true and delta.autoOptimize.autoCompact = true (set them via Spark config). With these enabled, each streaming micro-batch produces larger, compacted files automatically.
  • Databricks auto-tuning: For Unity Catalog managed tables, Databricks can enable auto-optimization on write, compacting files during streaming writes.
  • Manual OPTIMIZE: Periodically run OPTIMIZE on the Delta table to coalesce small files. Delta Lake documentation recommends that runs a single-line command to rearrange the small files into properly sized files. For example: OPTIMIZE '/mnt/delta/events'.
  • Partitioning: Use sensible partition columns to avoid excessive file churn. For streaming workloads, partition by coarse time (date or hour) and avoid high-cardinality values. Good partitioning reduces the number of files written per batch and speeds queries.
  • Trigger tuning: Adjust the query trigger to balance latency and throughput. Use .trigger(processingTime='10 seconds') to process data at fixed intervals, or use Trigger.AvailableNow to process all available data immediately (for periodic batch runs). You can also throttle the read rate with options like .option("maxOffsetsPerTrigger", N).
  • State store tuning: If your streaming query uses large state (aggregations or joins), configure Spark to use the RocksDB state store provider. For example:
    Python
     
    spark.conf.set(
        "spark.sql.streaming.stateStore.providerClass",
        "com.databricks.sql.streaming.state.RocksDBStateStoreProvider"
    )
    This local key-value store scales better under load.

Putting It All Together

Here is a complete PySpark example combining these elements:

Python
 
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType

# Read from Kafka topic
df = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker1:9092")
    .option("subscribe", "input_topic")
    .option("startingOffsets", "latest")
    .load())

# Define schema and parse JSON value
schema = StructType([
    StructField("user_id", StringType()),
    StructField("action", StringType())
])
parsed = df.selectExpr("CAST(value AS STRING) as json") \
           .select(from_json(col("json"), schema).alias("data")) \
           .select("data.*")

# Write to Delta with checkpointing
parsed.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/mnt/checkpoints/events/") \
    .start("/mnt/delta/events/")


This continuous query ingests Kafka records and appends them to the Delta table at /mnt/delta/events/. Thanks to Delta’s ACID commits and Spark’s offset tracking, each Kafka message is stored exactly once. If the job restarts, Spark resumes from the last offsets, and Delta’s log ensures no batch is double-applied.

Conclusion

Building a robust streaming pipeline from Kafka to Delta Lake on Databricks involves leveraging Spark Structured Streaming with checkpoints and Delta’s transactional sink. Delta Lake’s transaction log provides built-in idempotency so that each batch is applied atomically, yielding true exactly once semantics. Tuning triggers and enabling Delta’s file-optimization features can enable low-latency, high-throughput ingestion. Following these best practices ensures that streaming pipelines are both correct and efficient in production. Databricks documentation emphasizes that using Delta Lake as the sink delivers the most reliable exactly once processing pipeline. 

Additionally, Delta’s versioning lets engineers query historical states or roll back a table if needed, further improving auditability and correctness. By focusing on these practices, engineers can build Kafka-to-Delta streams with confidence that each record is processed exactly once and performance remains high

Apache Spark Transaction log kafka optimization

Opinions expressed by DZone contributors are their own.

Related

  • 5 Critical Databricks Performance Hacks That Most Engineers Miss (100x Faster Queries)
  • Spark Job Optimization
  • Big Data Realtime Data Pipeline Architecture
  • Building Robust Real-Time Data Pipelines With Python, Apache Kafka, and the Cloud

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook