DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Securing Software Delivery: Zero Trust CI/CD Patterns for Modern Pipelines
  • The Cybersecurity Blind Spot in DevOps Pipelines
  • The Missing Layer in AI Pipelines: Why Data Engineers Must Think Like Product Managers
  • Optimizing Data Pipelines in Cloud-Based Systems: Tools and Techniques

Trending

  • How My AI Agents Learned to Talk to Each Other With A2A
  • Cloud Hardware Diagnostics for AI Workloads
  • My Dive into Local LLMs, Part 2: Taming Personal Finance with Homegrown AI (and Why Privacy Matters)
  • Data Ingestion: The Front Door to Modern Data Infrastructure
  1. DZone
  2. Data Engineering
  3. Data
  4. Stabilizing ETL Pipelines With Airflow, Presto, and Metadata Contracts

Stabilizing ETL Pipelines With Airflow, Presto, and Metadata Contracts

Silent data drift broke our metrics, errors, just lies. We fixed it with schema contracts, validation, lineage, and loud failures. Now, trust is engineered.

By 
BEVERLY DSOUZA user avatar
BEVERLY DSOUZA
·
Jul. 02, 25 · Opinion
Likes (0)
Comment
Save
Tweet
Share
1.6K Views

Join the DZone community and get the full member experience.

Join For Free

Wednesday. 10:04 AM.

The dashboard says conversions dropped 18%. Product’s panicking. Marketing’s quiet-slacking you. But nothing’s failed—Airflow’s green, Hive tables are updating, and your pipeline logs look squeaky clean. That’s when it hits you: this isn’t a failure. It’s something worse. It’s silent data drift.

This isn’t just a cautionary tale—it’s a breakdown of how we diagnosed, fixed, and hardened a data pipeline after it quietly compromised downstream metrics. If you’ve ever relied on JSON event data, this story might feel familiar. The fix wasn’t a fancy rewrite. It was contracts, observability, and a cultural shift in how we treat our pipelines.

When All Systems Are Green and Still Wrong

We maintained a behavioral analytics dashboard powered by Hive, queried via Presto, and refreshed by Airflow. The metrics were the heartbeat of growth conversations across product and marketing. One Monday morning, they showed an abrupt drop in sign-up conversions. Engineering metrics looked normal. DAGs ran on schedule. Tables had updated. No errors. But something was off.

After some raw data pulls and sanity checks, a pattern emerged: device values were suddenly NULL for large segments of traffic. Queries were working, but the data wasn’t. Eventually, we traced it to an upstream event structure change. The JSON payload had shifted.

Expected

SQL
 
json_extract_scalar(event, '$.metadata.device') AS device


What arrived

SQL
 
json_extract_scalar(event, '$.metadata.device_info.device') AS device


No one was alerted. Nothing failed. But decisions made on top of this data? Completely compromised.

Root Cause: No Contracts, Just Hope

The ETL parsed event data straight from JSON using SQL. No schemas. No protobuf. No validation. We trusted upstream teams, and that trust was implicit, informal, and unenforced.

The change wasn’t malicious or even careless. It was a product improvement to support richer device metadata. The problem was that no contract existed between event producers and data consumers. The shift propagated through ingestion without warning and broke logic that no longer aligned with reality.

Worse still, the break wasn’t uniform. Some clients hadn’t adopted the new structure yet. So data inconsistencies blended together just enough to avoid immediate detection. Monitoring didn’t flag anomalies because everything was technically still working. Except it wasn’t.

The Fix: Schema Contracts, Validation, and Visibility

Our first move was to validate what we assumed. We integrated Great Expectations directly into Airflow. Each DAG that touched raw JSON data gained a pre-transformation validation task , enforcing presence, type, and structure of key fields. We validated not just that a field existed, but that its values matched expected formats (e.g., device as one of a known enum list), that timestamps were within expected bounds, and that field cardinality wasn’t anomalous.

Python
 
validate_schema = GreatExpectationsOperator(
    task_id='validate_raw_event_schema',
    checkpoint_name='raw_event_contract',
    data_context_root_dir='/opt/airflow/great_expectations'
)


To support schema evolution, we stored JSON schemas in Git, tied to semantic versions, and pinned our validation checkpoints to specific schema tags. When upstream producers needed to update, they submitted pull requests that included schema diffs and sample payloads. That alone created an accountability trail and slowed down the firehose of surprise changes.

When expectations failed, the DAG did too , loudly. We wired Slack and PagerDuty alerts to validation errors. Broken events no longer slipped quietly into the warehouse.

Then came lineage. We brought in Marquez to map data dependencies across tables, jobs, and dashboards. For every field, we could now trace its origin and ripple effects. Schema changes weren’t just tracked , they were contextualized.

Finally, we updated our DAG design pattern: validation became a branch task after ingestion and before transformation. If validation failed, the rest of the DAG halted gracefully. Logs were clean. Errors were actionable. Data quality became a first-class citizen in Airflow.

The Result: Fewer Backfills, More Confidence

We didn’t stop at validation. We refactored how queries were written and how transformations were staged. Instead of parsing JSON at query time, we staged validated tables that enforced a fixed schema. This change alone eliminated a category of runtime errors that had previously been hard to reproduce or explain.

SQL
 
SELECT user_id, device_type 
FROM events_validated 
WHERE ds = '2025-04-01';


The impact was immediate. Query latency dropped by over 40% due to reduced parsing and better partitioning. Analysts and product managers noticed that metrics loaded faster and felt more consistent. Operationally, we saw a significant decline in manual interventions. The number of backfill requests , once a weekly ritual , shrank to almost zero because the validation step blocked bad data before it could do damage.

Even incident response changed. Before, tracing a corrupted metric to its root cause took hours, often involving multiple teams. With lineage from Marquez and field-level schema enforcement, we could pinpoint upstream changes in minutes. This made post-mortems cleaner, faster, and more actionable.

In one case, a 97% null rate on a critical field was caught by validation within two minutes of ingestion , and stopped cold. Previously, that would have silently corrupted dashboards for days.

Finally, the biggest win was cultural: stakeholders regained confidence. They knew that if a field changed, someone would catch it , before dashboards broke, not after. And engineers had confidence that the pipelines they owned wouldn’t betray them quietly.

What changed wasn’t just the tooling. It was the posture. We went from reactive cleanup to proactive guarantees.

Lessons: Green DAGs Don’t Mean Good Data

This wasn’t about a failing tool. Airflow, Presto, Hive, they all worked as designed. The problem was a lack of intent around the data itself. We treated pipelines as jobs, not products. But metrics aren’t valuable because a table exists. They’re valuable because the logic behind them holds up under change.

A few lessons stuck with us:

First, contracts matter. Schema isn’t optional, it’s the API between teams. If you don’t validate it, you’ve outsourced reliability to luck. Second, catch problems early. Validate on ingestion, not in dashboards. By the time a metric breaks, it’s too late. Third, observability is more than DAG status. You need lineage, ownership, and visibility into field-level semantics.

Mostly, we learned to engineer trust. Because that’s what data pipelines ultimately serve—trust in the numbers we use to build, ship, and decide.

Loud Failures Are Better Than Quiet Lies

This incident didn’t knock out infrastructure. It didn’t corrupt millions of rows. What it did was worse: it let us believe something that wasn’t true. Quietly. Persistently.

If your pipelines can silently drift out of sync with your business logic, that’s not resilience, it’s risk disguised as stability.

So, build your pipelines to fail fast. Instrument for human visibility. Validate what matters. And when something breaks, make sure it breaks loud enough to fix before the next decision gets made.

Want to prevent silent failures in your own stack? Start with contracts, layer in validation, and don’t wait until a dashboard goes dark to find out what your data’s been doing behind the scenes.

Metadata Pipeline (software) Presto (SQL query engine)

Opinions expressed by DZone contributors are their own.

Related

  • Securing Software Delivery: Zero Trust CI/CD Patterns for Modern Pipelines
  • The Cybersecurity Blind Spot in DevOps Pipelines
  • The Missing Layer in AI Pipelines: Why Data Engineers Must Think Like Product Managers
  • Optimizing Data Pipelines in Cloud-Based Systems: Tools and Techniques

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: