Data Quality at Write Time: Engineering Reliability With Delta Expectations
Ensure reliable data from the start. Learn how Delta Expectations in Databricks enforce real-time, write-time validation to prevent bad data before it lands.
Join the DZone community and get the full member experience.
Join For FreeData quality failures don't announce themselves. They compound silently — a malformed timestamp here, a negative revenue figure there — until a quarterly board deck shows impossible numbers or an ML model degrades into uselessness. A 2023 Gartner study pegged the cost at $12.9 million annually per organization, but that figure misses the hidden expense: engineering time spent firefighting data incidents instead of building features.
The traditional approach treats validation as a post-processing step. You write data to storage, then run Great Expectations or Deequ checks, discover failures, and either fix the pipeline or quarantine bad records. This pattern creates a fundamental gap: the window between data landing and validation completion. In high-throughput lakehouses processing terabytes daily, that window can represent millions of corrupted records propagating downstream before anyone notices.
Delta Expectations — a feature of Databricks' Delta Live Tables (DLT) — collapses that window to zero by enforcing validation during the write transaction itself. This isn't just faster validation; it's an architectural shift from reactive data quality to proactive data contracts.
Why Write-Time Validation Changes the Game
Traditional ETL validation operates as an external auditor. You extract data, transform it, load it into a Delta table, and then check if it's valid:
Extract → Transform → Load → Validate → (Quarantine/Repair)
This sequence has consequences:
- Storage pollution: Invalid data physically lands in your lakehouse. Even if you delete it later, it exists in the transaction log and occupies storage until VACUUM runs.
- Downstream propagation: Between write completion and validation failure, downstream consumers may have already read the bad data. Scheduled jobs don't wait for validation results.
- Cascading failures: If validation discovers issues hours after ingestion, you're debugging a cold trail. Which upstream system sent bad data? Was it a transient API failure or a schema change?
Delta Expectations inverts this model by embedding validation into the write path:
Extract → Transform → Validate+Load (atomic)
The validation logic executes during Spark's DataFrame evaluation, before the Delta transaction commits. A failed expectation can abort the write entirely, drop invalid records, or log violations while continuing — but critically, the validation decision happens before data persistence.
The Architecture: How Expectations Actually Work
Delta Expectations aren't modifications to Delta Lake's transaction protocol itself. They're a DLT framework feature that leverages Spark's lazy evaluation and Delta's atomicity guarantees.
When you define an expectation like:
@dlt.expect_or_drop("valid_price", "price > 0")
DLT injects a filter operation into the Spark logical plan. During execution:
- Evaluation phase: Spark computes the expectation predicate (
price > 0) for each record as part of the standard DataFrame transformation graph - Action phase: Records are partitioned based on validation results:
- FAIL mode: If any record fails, Spark throws an exception before the Delta write API is invoked
- DROP mode: Failed records are filtered from the DataFrame before being passed to
.write.format("delta") - WARN mode: All records proceed to write, but DLT logs metrics about failures
- Persistence phase: Only the valid subset (or all records in WARN mode) participates in the Delta transaction
This means validation happens at Spark execution time, not Delta commit time. The distinction matters: Delta's ACID guarantees ensure the write is atomic, but the expectation logic runs earlier in the execution pipeline. The phrase "atomic validation" refers to the fact that validation and write are part of the same Spark job — not that expectations are integrated into Delta's transaction protocol.
Critical implication: Expectations operate on data in motion (DataFrames) rather than data at rest (Delta tables). This is why they can prevent invalid data from ever being written, but also why they can't validate data already in a table without reprocessing it.
Implementation Patterns: From Basic to Production-Grade
Pattern 1: Layered Validation With Quarantine
Production pipelines don't just drop bad data — they preserve it for debugging. The Bronze-Silver-Gold medallion architecture naturally supports this:
# Bronze: Accept everything, no expectations
@dlt.table(
comment="Raw ingestion, no quality gates"
)
def bronze_orders():
return spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load("/mnt/landing/orders")
# Silver: Strict validation with quarantine
@dlt.table(comment="Validated orders meeting business rules")
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("price_positive", "price > 0 AND price < 1000000")
@dlt.expect_or_drop("valid_date", "order_date <= current_date()")
@dlt.expect("suspicious_quantity", "quantity < 10000") # WARN only
def silver_orders():
return dlt.read_stream("bronze_orders")
# Quarantine: Capture failed records
@dlt.table(comment="Orders that failed Silver validation")
def quarantine_orders():
bronze = dlt.read_stream("bronze_orders")
return bronze.filter(
(col("order_id").isNull()) |
(col("price") <= 0) |
(col("price") >= 1000000) |
(col("order_date") > current_date())
).withColumn("quarantine_timestamp", current_timestamp()) \
.withColumn(
"failure_reason",
when(col("order_id").isNull(), "null_order_id")
.when(col("price") <= 0, "negative_price")
.when(col("price") >= 1000000, "price_too_high")
.otherwise("invalid_date")
)
Why this works:
- Bronze preserves source fidelity for audit and replay
- Silver expectations ensure downstream consumers never see invalid data
- Quarantine tables enable post-mortem analysis without data loss
- DLT's metrics UI shows expectation pass rates in real time
Operational note: The quarantine pattern requires duplicating expectation logic. In practice, extract these to a shared function to maintain DRY code.
Pattern 2: Streaming Watermarks and Late Data
Delta Expectations behave differently in streaming contexts. With late-arriving data, you need to coordinate watermarks and validation:
@dlt.table
@dlt.expect_or_drop("within_watermark", "event_timestamp > current_timestamp() - interval 24 hours")
def streaming_events():
return (spark.readStream
.format("kafka")
.option("subscribe", "events")
.load()
.withWatermark("event_timestamp", "1 hour")
.select("event_timestamp", "user_id", "event_type"))
Critical detail: The expectation runs after the watermark is applied. Records older than the watermark are already dropped by Spark Structured Streaming before expectations are evaluated. This creates a layered filter:
- Watermark drops events older than the threshold
- Expectations validate remaining records
- Delta writes persist only valid, timely data
Pattern 3: Cross-Table Validation and Referential Integrity
Expectations can enforce relationships across tables using joins:
@dlt.table
@dlt.expect_or_drop("valid_customer", "customer_id IN (SELECT customer_id FROM LIVE.dim_customers)")
def orders_with_referential_integrity():
return dlt.read("bronze_orders")
Performance caveat: This triggers a broadcast join at every micro-batch. If dim_customers is large, this becomes expensive. Prefer precomputed lookup tables or alternative validation strategies for large dimensions.
Performance Considerations: What Expectations Actually Cost
Delta Expectations aren't free. Each expectation adds a predicate evaluation to your Spark execution plan. The overhead depends on:
- Predicate complexity: Simple column comparisons add negligible cost; regex or UDFs can be expensive.
- Record volume: Expectations scale linearly with data volume.
- Execution mode:
- WARN — logs all failures: metrics overhead but no impact on persisted data.
- DROP — filters early, which may reduce shuffle.
- FAIL — aborts transaction immediately on failure.
Optimization strategies:
- Order DROP expectations to filter as early as possible.
- Combine similar validations (
price > 0 AND price < 1000000). - Avoid UDFs; use vectorized SQL expressions.
- Broadcast dimension tables wisely.
Integration Beyond DLT: Orchestration and Observability
Airflow Integration
DLT pipelines expose REST APIs for orchestration. When FAIL expectations trigger, the API returns a non-zero exit code, enabling alerting via tools like PagerDuty.
Metrics Export
DLT stores expectation metrics in event logs. Query and export these to observability platforms for real-time monitoring and alerting.
Unity Catalog and Data Lineage
Register DLT tables in Unity Catalog for governance, access control, and lineage tracing. Expectation metadata in cataloged tables builds trust and traceability.
When Not to Use Delta Expectations
- Non-DLT pipelines: Expectations are DLT features only.
- Complex statistical validation: For distributional anomalies or advanced data profiling, use Deequ or Great Expectations.
- Historical data: Expectations only run on new writes; reprocessing is required for validation of existing data.
- Multi-table atomicity: Can't coordinate transactional constraints across multiple tables.
- Extreme performance constraints: Expectations add compute overhead — test with your SLA.
Alternative Approaches: Understanding Trade-offs
| Approach | Timing | Best For | Limitations |
|---|---|---|---|
| Delta Expectations | Write-time (DLT) | Standard validations, quarantine, atomicity | DLT only, not open-source Delta |
| Great Expectations | Post-write | Profiling, docs, contract CI/CD | Reactive; data already written |
| Delta CHECK Constraints | Commit time | Basic per-column invariants | Limited expression power |
| Deequ | Post-write | Large-scale stats/anomaly detection | JVM/Scala only; more setup |
| Custom Spark Filters | Write-time | Fully custom cases | No built-in metrics |
Production Readiness Checklist
- Testing:
- Inject failure scenarios to verify enforcement.
- Confirm quarantine and warning behaviors.
- Test with evolving schemas.
- Monitoring:
- Alert on fail-rate spikes.
- Observe the expectation evaluation time as complexity grows.
- Governance:
- Document purposes of all expectations.
- Version control expectation logic.
- Operational:
- Map escalation paths for FAIL triggers.
- Procedures for backfills and expectation changes.
The Future: From Validation to Contracts
Delta Expectations represent a step toward automated, write-time data contracts. The next evolution will be:
- Auto-generated expectations informed by historic profiles.
- Contract versioning tied to pipeline and schema releases.
- Organization-wide contract publishing for data mesh domains.
- Integration with schema registries for full data lifecycle coverage.
In modern data architecture, velocity without reliability is just expensive noise. Delta Expectations transform data quality from a post-mortem exercise into a real-time guarantee — ensuring that the data powering your analytics, ML models, and business decisions meets the standards required before it ever reaches production. That shift from reactive validation to proactive contracts is the cornerstone of trustworthy data systems.
Opinions expressed by DZone contributors are their own.
Comments