DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Precision at Scale: Unveiling the Secrets of Quality Engineering in Data Engineering
  • Automating Data Quality Check in Data Pipelines
  • How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
  • Ingesting Fixed-Width Mainframe Files Into Delta Lake: The Details Nobody Writes Down

Trending

  • Edge Computing in Utility IoT: Two Architecture Patterns That Actually Work
  • Master-Class: Understanding Database Replication (Single, Multi, and Leaderless)
  • LLM-Powered Deep Parsing for Industrial Inventory Search
  • Solving the Mystery: Why Java RSS Grows in Docker on M1 Macs
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Quality at Write Time: Engineering Reliability With Delta Expectations

Data Quality at Write Time: Engineering Reliability With Delta Expectations

Ensure reliable data from the start. Learn how Delta Expectations in Databricks enforce real-time, write-time validation to prevent bad data before it lands.

By 
Rambabu Bandam user avatar
Rambabu Bandam
·
Oct. 23, 25 · Analysis
Likes (1)
Comment
Save
Tweet
Share
2.4K Views

Join the DZone community and get the full member experience.

Join For Free

Data quality failures don't announce themselves. They compound silently — a malformed timestamp here, a negative revenue figure there — until a quarterly board deck shows impossible numbers or an ML model degrades into uselessness. A 2023 Gartner study pegged the cost at $12.9 million annually per organization, but that figure misses the hidden expense: engineering time spent firefighting data incidents instead of building features.

The traditional approach treats validation as a post-processing step. You write data to storage, then run Great Expectations or Deequ checks, discover failures, and either fix the pipeline or quarantine bad records. This pattern creates a fundamental gap: the window between data landing and validation completion. In high-throughput lakehouses processing terabytes daily, that window can represent millions of corrupted records propagating downstream before anyone notices.

Delta Expectations — a feature of Databricks' Delta Live Tables (DLT) — collapses that window to zero by enforcing validation during the write transaction itself. This isn't just faster validation; it's an architectural shift from reactive data quality to proactive data contracts.

Why Write-Time Validation Changes the Game

Traditional ETL validation operates as an external auditor. You extract data, transform it, load it into a Delta table, and then check if it's valid:

  • Extract → Transform → Load → Validate → (Quarantine/Repair)

This sequence has consequences:

  • Storage pollution: Invalid data physically lands in your lakehouse. Even if you delete it later, it exists in the transaction log and occupies storage until VACUUM runs.
  • Downstream propagation: Between write completion and validation failure, downstream consumers may have already read the bad data. Scheduled jobs don't wait for validation results.
  • Cascading failures: If validation discovers issues hours after ingestion, you're debugging a cold trail. Which upstream system sent bad data? Was it a transient API failure or a schema change?

Delta Expectations inverts this model by embedding validation into the write path:

  • Extract → Transform → Validate+Load (atomic)

The validation logic executes during Spark's DataFrame evaluation, before the Delta transaction commits. A failed expectation can abort the write entirely, drop invalid records, or log violations while continuing — but critically, the validation decision happens before data persistence.

The Architecture: How Expectations Actually Work

Delta Expectations aren't modifications to Delta Lake's transaction protocol itself. They're a DLT framework feature that leverages Spark's lazy evaluation and Delta's atomicity guarantees.

When you define an expectation like:

Python
 
@dlt.expect_or_drop("valid_price", "price > 0")


DLT injects a filter operation into the Spark logical plan. During execution:

  1. Evaluation phase: Spark computes the expectation predicate (price > 0) for each record as part of the standard DataFrame transformation graph
  2. Action phase: Records are partitioned based on validation results:
    • FAIL mode: If any record fails, Spark throws an exception before the Delta write API is invoked
    • DROP mode: Failed records are filtered from the DataFrame before being passed to .write.format("delta")
    • WARN mode: All records proceed to write, but DLT logs metrics about failures
  3. Persistence phase: Only the valid subset (or all records in WARN mode) participates in the Delta transaction

This means validation happens at Spark execution time, not Delta commit time. The distinction matters: Delta's ACID guarantees ensure the write is atomic, but the expectation logic runs earlier in the execution pipeline. The phrase "atomic validation" refers to the fact that validation and write are part of the same Spark job — not that expectations are integrated into Delta's transaction protocol.

Critical implication: Expectations operate on data in motion (DataFrames) rather than data at rest (Delta tables). This is why they can prevent invalid data from ever being written, but also why they can't validate data already in a table without reprocessing it.

Implementation Patterns: From Basic to Production-Grade

Pattern 1: Layered Validation With Quarantine

Production pipelines don't just drop bad data — they preserve it for debugging. The Bronze-Silver-Gold medallion architecture naturally supports this:

Python
 
# Bronze: Accept everything, no expectations

@dlt.table(
    comment="Raw ingestion, no quality gates"
)
def bronze_orders():
    return spark.readStream.format("cloudFiles") \
        .option("cloudFiles.format", "json") \
        .load("/mnt/landing/orders")

# Silver: Strict validation with quarantine

@dlt.table(comment="Validated orders meeting business rules")
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("price_positive", "price > 0 AND price < 1000000")
@dlt.expect_or_drop("valid_date", "order_date <= current_date()")
@dlt.expect("suspicious_quantity", "quantity < 10000")  # WARN only
def silver_orders():
    return dlt.read_stream("bronze_orders")

# Quarantine: Capture failed records

@dlt.table(comment="Orders that failed Silver validation")
def quarantine_orders():
    bronze = dlt.read_stream("bronze_orders")
    return bronze.filter(
        (col("order_id").isNull()) |
        (col("price") <= 0) |
        (col("price") >= 1000000) |
        (col("order_date") > current_date())
    ).withColumn("quarantine_timestamp", current_timestamp()) \
     .withColumn(
        "failure_reason",
        when(col("order_id").isNull(), "null_order_id")
        .when(col("price") <= 0, "negative_price")
        .when(col("price") >= 1000000, "price_too_high")
        .otherwise("invalid_date")
    )


Why this works:

  • Bronze preserves source fidelity for audit and replay
  • Silver expectations ensure downstream consumers never see invalid data
  • Quarantine tables enable post-mortem analysis without data loss
  • DLT's metrics UI shows expectation pass rates in real time

Operational note: The quarantine pattern requires duplicating expectation logic. In practice, extract these to a shared function to maintain DRY code.

Pattern 2: Streaming Watermarks and Late Data

Delta Expectations behave differently in streaming contexts. With late-arriving data, you need to coordinate watermarks and validation:

Python
 
@dlt.table
@dlt.expect_or_drop("within_watermark", "event_timestamp > current_timestamp() - interval 24 hours")
def streaming_events():
    return (spark.readStream
            .format("kafka")
            .option("subscribe", "events")
            .load()
            .withWatermark("event_timestamp", "1 hour")
            .select("event_timestamp", "user_id", "event_type"))


Critical detail: The expectation runs after the watermark is applied. Records older than the watermark are already dropped by Spark Structured Streaming before expectations are evaluated. This creates a layered filter:

  1. Watermark drops events older than the threshold
  2. Expectations validate remaining records
  3. Delta writes persist only valid, timely data

Pattern 3: Cross-Table Validation and Referential Integrity

Expectations can enforce relationships across tables using joins:

Python
 
@dlt.table
@dlt.expect_or_drop("valid_customer", "customer_id IN (SELECT customer_id FROM LIVE.dim_customers)")
def orders_with_referential_integrity():
    return dlt.read("bronze_orders")


Performance caveat: This triggers a broadcast join at every micro-batch. If dim_customers is large, this becomes expensive. Prefer precomputed lookup tables or alternative validation strategies for large dimensions.

Performance Considerations: What Expectations Actually Cost

Delta Expectations aren't free. Each expectation adds a predicate evaluation to your Spark execution plan. The overhead depends on:

  • Predicate complexity: Simple column comparisons add negligible cost; regex or UDFs can be expensive.
  • Record volume: Expectations scale linearly with data volume.
  • Execution mode:
    • WARN — logs all failures: metrics overhead but no impact on persisted data.
    • DROP — filters early, which may reduce shuffle.
    • FAIL — aborts transaction immediately on failure.

Optimization strategies:

  • Order DROP expectations to filter as early as possible.
  • Combine similar validations (price > 0 AND price < 1000000).
  • Avoid UDFs; use vectorized SQL expressions.
  • Broadcast dimension tables wisely.

Integration Beyond DLT: Orchestration and Observability

Airflow Integration

DLT pipelines expose REST APIs for orchestration. When FAIL expectations trigger, the API returns a non-zero exit code, enabling alerting via tools like PagerDuty.

Metrics Export

DLT stores expectation metrics in event logs. Query and export these to observability platforms for real-time monitoring and alerting.

Unity Catalog and Data Lineage

Register DLT tables in Unity Catalog for governance, access control, and lineage tracing. Expectation metadata in cataloged tables builds trust and traceability.

When Not to Use Delta Expectations

  • Non-DLT pipelines: Expectations are DLT features only.
  • Complex statistical validation: For distributional anomalies or advanced data profiling, use Deequ or Great Expectations.
  • Historical data: Expectations only run on new writes; reprocessing is required for validation of existing data.
  • Multi-table atomicity: Can't coordinate transactional constraints across multiple tables.
  • Extreme performance constraints: Expectations add compute overhead — test with your SLA.

Alternative Approaches: Understanding Trade-offs

Approach Timing Best For Limitations
Delta Expectations Write-time (DLT) Standard validations, quarantine, atomicity DLT only, not open-source Delta
Great Expectations Post-write Profiling, docs, contract CI/CD Reactive; data already written
Delta CHECK Constraints Commit time Basic per-column invariants Limited expression power
Deequ Post-write Large-scale stats/anomaly detection JVM/Scala only; more setup
Custom Spark Filters Write-time Fully custom cases No built-in metrics


Production Readiness Checklist

  • Testing:
    • Inject failure scenarios to verify enforcement.
    • Confirm quarantine and warning behaviors.
    • Test with evolving schemas.
  • Monitoring:
    • Alert on fail-rate spikes.
    • Observe the expectation evaluation time as complexity grows.
  • Governance:
    • Document purposes of all expectations.
    • Version control expectation logic.
  • Operational:
    • Map escalation paths for FAIL triggers.
    • Procedures for backfills and expectation changes.

The Future: From Validation to Contracts

Delta Expectations represent a step toward automated, write-time data contracts. The next evolution will be:

  • Auto-generated expectations informed by historic profiles.
  • Contract versioning tied to pipeline and schema releases.
  • Organization-wide contract publishing for data mesh domains.
  • Integration with schema registries for full data lifecycle coverage.

In modern data architecture, velocity without reliability is just expensive noise. Delta Expectations transform data quality from a post-mortem exercise into a real-time guarantee — ensuring that the data powering your analytics, ML models, and business decisions meets the standards required before it ever reaches production. That shift from reactive validation to proactive contracts is the cornerstone of trustworthy data systems.

Data quality Engineering DELTA (taxonomy)

Opinions expressed by DZone contributors are their own.

Related

  • Precision at Scale: Unveiling the Secrets of Quality Engineering in Data Engineering
  • Automating Data Quality Check in Data Pipelines
  • How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
  • Ingesting Fixed-Width Mainframe Files Into Delta Lake: The Details Nobody Writes Down

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook