Stabilizing ETL Pipelines With Airflow, Presto, and Metadata Contracts
Silent data drift broke our metrics, errors, just lies. We fixed it with schema contracts, validation, lineage, and loud failures. Now, trust is engineered.
Join the DZone community and get the full member experience.
Join For FreeWednesday. 10:04 AM.
The dashboard says conversions dropped 18%. Product’s panicking. Marketing’s quiet-slacking you. But nothing’s failed—Airflow’s green, Hive tables are updating, and your pipeline logs look squeaky clean. That’s when it hits you: this isn’t a failure. It’s something worse. It’s silent data drift.
This isn’t just a cautionary tale—it’s a breakdown of how we diagnosed, fixed, and hardened a data pipeline after it quietly compromised downstream metrics. If you’ve ever relied on JSON event data, this story might feel familiar. The fix wasn’t a fancy rewrite. It was contracts, observability, and a cultural shift in how we treat our pipelines.
When All Systems Are Green and Still Wrong
We maintained a behavioral analytics dashboard powered by Hive, queried via Presto, and refreshed by Airflow. The metrics were the heartbeat of growth conversations across product and marketing. One Monday morning, they showed an abrupt drop in sign-up conversions. Engineering metrics looked normal. DAGs ran on schedule. Tables had updated. No errors. But something was off.
After some raw data pulls and sanity checks, a pattern emerged: device values were suddenly NULL
for large segments of traffic. Queries were working, but the data wasn’t. Eventually, we traced it to an upstream event structure change. The JSON payload had shifted.
Expected
json_extract_scalar(event, '$.metadata.device') AS device
What arrived
json_extract_scalar(event, '$.metadata.device_info.device') AS device
No one was alerted. Nothing failed. But decisions made on top of this data? Completely compromised.
Root Cause: No Contracts, Just Hope
The ETL parsed event data straight from JSON using SQL. No schemas. No protobuf. No validation. We trusted upstream teams, and that trust was implicit, informal, and unenforced.
The change wasn’t malicious or even careless. It was a product improvement to support richer device metadata. The problem was that no contract existed between event producers and data consumers. The shift propagated through ingestion without warning and broke logic that no longer aligned with reality.
Worse still, the break wasn’t uniform. Some clients hadn’t adopted the new structure yet. So data inconsistencies blended together just enough to avoid immediate detection. Monitoring didn’t flag anomalies because everything was technically still working. Except it wasn’t.
The Fix: Schema Contracts, Validation, and Visibility
Our first move was to validate what we assumed. We integrated Great Expectations directly into Airflow. Each DAG that touched raw JSON data gained a pre-transformation validation task , enforcing presence, type, and structure of key fields. We validated not just that a field existed, but that its values matched expected formats (e.g., device
as one of a known enum list), that timestamps were within expected bounds, and that field cardinality wasn’t anomalous.
validate_schema = GreatExpectationsOperator(
task_id='validate_raw_event_schema',
checkpoint_name='raw_event_contract',
data_context_root_dir='/opt/airflow/great_expectations'
)
To support schema evolution, we stored JSON schemas in Git, tied to semantic versions, and pinned our validation checkpoints to specific schema tags. When upstream producers needed to update, they submitted pull requests that included schema diffs and sample payloads. That alone created an accountability trail and slowed down the firehose of surprise changes.
When expectations failed, the DAG did too , loudly. We wired Slack and PagerDuty alerts to validation errors. Broken events no longer slipped quietly into the warehouse.
Then came lineage. We brought in Marquez to map data dependencies across tables, jobs, and dashboards. For every field, we could now trace its origin and ripple effects. Schema changes weren’t just tracked , they were contextualized.
Finally, we updated our DAG design pattern: validation became a branch task after ingestion and before transformation. If validation failed, the rest of the DAG halted gracefully. Logs were clean. Errors were actionable. Data quality became a first-class citizen in Airflow.
The Result: Fewer Backfills, More Confidence
We didn’t stop at validation. We refactored how queries were written and how transformations were staged. Instead of parsing JSON at query time, we staged validated tables that enforced a fixed schema. This change alone eliminated a category of runtime errors that had previously been hard to reproduce or explain.
SELECT user_id, device_type
FROM events_validated
WHERE ds = '2025-04-01';
The impact was immediate. Query latency dropped by over 40% due to reduced parsing and better partitioning. Analysts and product managers noticed that metrics loaded faster and felt more consistent. Operationally, we saw a significant decline in manual interventions. The number of backfill requests , once a weekly ritual , shrank to almost zero because the validation step blocked bad data before it could do damage.
Even incident response changed. Before, tracing a corrupted metric to its root cause took hours, often involving multiple teams. With lineage from Marquez and field-level schema enforcement, we could pinpoint upstream changes in minutes. This made post-mortems cleaner, faster, and more actionable.
In one case, a 97% null rate on a critical field was caught by validation within two minutes of ingestion , and stopped cold. Previously, that would have silently corrupted dashboards for days.
Finally, the biggest win was cultural: stakeholders regained confidence. They knew that if a field changed, someone would catch it , before dashboards broke, not after. And engineers had confidence that the pipelines they owned wouldn’t betray them quietly.
What changed wasn’t just the tooling. It was the posture. We went from reactive cleanup to proactive guarantees.
Lessons: Green DAGs Don’t Mean Good Data
This wasn’t about a failing tool. Airflow, Presto, Hive, they all worked as designed. The problem was a lack of intent around the data itself. We treated pipelines as jobs, not products. But metrics aren’t valuable because a table exists. They’re valuable because the logic behind them holds up under change.
A few lessons stuck with us:
First, contracts matter. Schema isn’t optional, it’s the API between teams. If you don’t validate it, you’ve outsourced reliability to luck. Second, catch problems early. Validate on ingestion, not in dashboards. By the time a metric breaks, it’s too late. Third, observability is more than DAG status. You need lineage, ownership, and visibility into field-level semantics.
Mostly, we learned to engineer trust. Because that’s what data pipelines ultimately serve—trust in the numbers we use to build, ship, and decide.
Loud Failures Are Better Than Quiet Lies
This incident didn’t knock out infrastructure. It didn’t corrupt millions of rows. What it did was worse: it let us believe something that wasn’t true. Quietly. Persistently.
If your pipelines can silently drift out of sync with your business logic, that’s not resilience, it’s risk disguised as stability.
So, build your pipelines to fail fast. Instrument for human visibility. Validate what matters. And when something breaks, make sure it breaks loud enough to fix before the next decision gets made.
Want to prevent silent failures in your own stack? Start with contracts, layer in validation, and don’t wait until a dashboard goes dark to find out what your data’s been doing behind the scenes.
Opinions expressed by DZone contributors are their own.
Comments