Stabilizing ETL Pipelines With Airflow, Presto, and Metadata Contracts

Silent data drift broke our metrics, errors, just lies. We fixed it with schema contracts, validation, lineage, and loud failures. Now, trust is engineered.

BEVERLY DSOUZA

Alex Chen

Jul. 02, 25 · Opinion

Likes (2)

Comment

Save

2.3K Views

Wednesday. 10:04 AM.

The dashboard says conversions dropped 18%. Product’s panicking. Marketing’s quiet-slacking you. But nothing’s failed—Airflow’s green, Hive tables are updating, and your pipeline logs look squeaky clean. That’s when it hits you: this isn’t a failure. It’s something worse. It’s silent data drift.

This isn’t just a cautionary tale—it’s a breakdown of how we diagnosed, fixed, and hardened a data pipeline after it quietly compromised downstream metrics. If you’ve ever relied on JSON event data, this story might feel familiar. The fix wasn’t a fancy rewrite. It was contracts, observability, and a cultural shift in how we treat our pipelines.

When All Systems Are Green and Still Wrong

We maintained a behavioral analytics dashboard powered by Hive, queried via Presto, and refreshed by Airflow. The metrics were the heartbeat of growth conversations across product and marketing. One Monday morning, they showed an abrupt drop in sign-up conversions. Engineering metrics looked normal. DAGs ran on schedule. Tables had updated. No errors. But something was off.

After some raw data pulls and sanity checks, a pattern emerged: device values were suddenly NULL for large segments of traffic. Queries were working, but the data wasn’t. Eventually, we traced it to an upstream event structure change. The JSON payload had shifted.

Expected

    SQL
   
   json_extract_scalar(event, '$.metadata.device') AS device

What arrived

    SQL
   
   json_extract_scalar(event, '$.metadata.device_info.device') AS device

No one was alerted. Nothing failed. But decisions made on top of this data? Completely compromised.

Root Cause: No Contracts, Just Hope

The ETL parsed event data straight from JSON using SQL. No schemas. No protobuf. No validation. We trusted upstream teams, and that trust was implicit, informal, and unenforced.

The change wasn’t malicious or even careless. It was a product improvement to support richer device metadata. The problem was that no contract existed between event producers and data consumers. The shift propagated through ingestion without warning and broke logic that no longer aligned with reality.

Worse still, the break wasn’t uniform. Some clients hadn’t adopted the new structure yet. So data inconsistencies blended together just enough to avoid immediate detection. Monitoring didn’t flag anomalies because everything was technically still working. Except it wasn’t.

The Fix: Schema Contracts, Validation, and Visibility

Our first move was to validate what we assumed. We integrated Great Expectations directly into Airflow. Each DAG that touched raw JSON data gained a pre-transformation validation task , enforcing presence, type, and structure of key fields. We validated not just that a field existed, but that its values matched expected formats (e.g., device as one of a known enum list), that timestamps were within expected bounds, and that field cardinality wasn’t anomalous.

    Python
   
   validate_schema = GreatExpectationsOperator(
    task_id='validate_raw_event_schema',
    checkpoint_name='raw_event_contract',
    data_context_root_dir='/opt/airflow/great_expectations'
)

To support schema evolution, we stored JSON schemas in Git, tied to semantic versions, and pinned our validation checkpoints to specific schema tags. When upstream producers needed to update, they submitted pull requests that included schema diffs and sample payloads. That alone created an accountability trail and slowed down the firehose of surprise changes.

When expectations failed, the DAG did too , loudly. We wired Slack and PagerDuty alerts to validation errors. Broken events no longer slipped quietly into the warehouse.

Then came lineage. We brought in Marquez to map data dependencies across tables, jobs, and dashboards. For every field, we could now trace its origin and ripple effects. Schema changes weren’t just tracked , they were contextualized.

Finally, we updated our DAG design pattern: validation became a branch task after ingestion and before transformation. If validation failed, the rest of the DAG halted gracefully. Logs were clean. Errors were actionable. Data quality became a first-class citizen in Airflow.

The Result: Fewer Backfills, More Confidence

We didn’t stop at validation. We refactored how queries were written and how transformations were staged. Instead of parsing JSON at query time, we staged validated tables that enforced a fixed schema. This change alone eliminated a category of runtime errors that had previously been hard to reproduce or explain.

    SQL
   
   SELECT user_id, device_type 
FROM events_validated 
WHERE ds = '2025-04-01';

The impact was immediate. Query latency dropped by over 40% due to reduced parsing and better partitioning. Analysts and product managers noticed that metrics loaded faster and felt more consistent. Operationally, we saw a significant decline in manual interventions. The number of backfill requests , once a weekly ritual , shrank to almost zero because the validation step blocked bad data before it could do damage.

Even incident response changed. Before, tracing a corrupted metric to its root cause took hours, often involving multiple teams. With lineage from Marquez and field-level schema enforcement, we could pinpoint upstream changes in minutes. This made post-mortems cleaner, faster, and more actionable.

In one case, a 97% null rate on a critical field was caught by validation within two minutes of ingestion , and stopped cold. Previously, that would have silently corrupted dashboards for days.

Finally, the biggest win was cultural: stakeholders regained confidence. They knew that if a field changed, someone would catch it , before dashboards broke, not after. And engineers had confidence that the pipelines they owned wouldn’t betray them quietly.

What changed wasn’t just the tooling. It was the posture. We went from reactive cleanup to proactive guarantees.

Lessons: Green DAGs Don’t Mean Good Data

This wasn’t about a failing tool. Airflow, Presto, Hive, they all worked as designed. The problem was a lack of intent around the data itself. We treated pipelines as jobs, not products. But metrics aren’t valuable because a table exists. They’re valuable because the logic behind them holds up under change.

A few lessons stuck with us:

First, contracts matter. Schema isn’t optional, it’s the API between teams. If you don’t validate it, you’ve outsourced reliability to luck. Second, catch problems early. Validate on ingestion, not in dashboards. By the time a metric breaks, it’s too late. Third, observability is more than DAG status. You need lineage, ownership, and visibility into field-level semantics.

Mostly, we learned to engineer trust. Because that’s what data pipelines ultimately serve—trust in the numbers we use to build, ship, and decide.

Loud Failures Are Better Than Quiet Lies

This incident didn’t knock out infrastructure. It didn’t corrupt millions of rows. What it did was worse: it let us believe something that wasn’t true. Quietly. Persistently.

If your pipelines can silently drift out of sync with your business logic, that’s not resilience, it’s risk disguised as stability.

So, build your pipelines to fail fast. Instrument for human visibility. Validate what matters. And when something breaks, make sure it breaks loud enough to fix before the next decision gets made.

Want to prevent silent failures in your own stack? Start with contracts, layer in validation, and don’t wait until a dashboard goes dark to find out what your data’s been doing behind the scenes.

Metadata Pipeline (software) Presto (SQL query engine)

Opinions expressed by DZone contributors are their own.

Related

Trending