DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Why Clean Data Is the Foundation of Successful AI Systems
  • AI Agents in Java: Architecting Intelligent Health Data Systems
  • Manual Investigation: The Hidden Bottleneck in Incident Response
  • Hallucination Has Real Consequences — Lessons From Building AI Systems

Trending

  • MuleSoft IDP: Enhancing Efficiency and Accuracy in Data Extraction
  • LLM-Powered Deep Parsing for Industrial Inventory Search
  • Feature Flag Debt: Performance Impact in Enterprise Applications
  • Compliance Automated Standard Solution (COMPASS), Part 10: How OSCAL Mapping Paves the Way for Continuous Compliance Scalability
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Improving DAG Failure Detection in Airflow Using AI Techniques

Improving DAG Failure Detection in Airflow Using AI Techniques

A practical approach to enhancing DAG failure detection using AI to improve pipeline reliability and enable proactive intervention in large-scale data environments.

By 
Bruno Bocardo Guzoni user avatar
Bruno Bocardo Guzoni
·
May. 19, 26 · Analysis
Likes (1)
Comment
Save
Tweet
Share
1.5K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Airflow is widely used to orchestrate ETL pipelines, but failure handling in large-scale environments remains largely reactive. While Airflow provides strong scheduling and execution primitives, identifying root causes and detecting silent data issues still requires significant manual effort.

This article presents an approach implemented in a production data platform to improve failure detection and diagnosis using a combination of large language models (LLMs), statistical methods, and traditional machine learning. The system focuses on three areas: log-based failure classification, data integrity anomaly detection, and predictive failure modeling.

In practice, this reduced triage time for recurring failures, improved detection of silent data issues, and introduced a more proactive operational model for managing data pipelines.

Problem Context

Airflow works well as an orchestrator, but operational challenges emerge as pipelines scale across multiple datasets, teams, and environments.

In our case, the primary difficulty was not detecting that failures occurred, but understanding them quickly and consistently. A few recurring issues stood out:

  • Logs were available but inconsistent across operators and difficult to aggregate
  • Root cause analysis required navigating multiple systems (Airflow UI, logs, external services)
  • Retries often obscured whether failures were transient or systemic
  • Data quality issues rarely triggered task failures, but still affected downstream systems

In pipelines processing millions of records per run, these gaps led to delayed incident response and, occasionally, incorrect reporting. The lack of structured failure information also made it difficult to identify recurring patterns.

System Overview

To address these issues, we introduced an AI-driven observability layer on top of Airflow. The system was designed as a set of loosely coupled services consuming metadata from DAG executions:

  • Task logs
  • Execution metadata
  • Historical run data
  • Data quality metrics

Rather than modifying Airflow internals, integration was done via callbacks and external services. This allowed the system to evolve independently and scale across multiple environments.

The solution focuses on three capabilities:

  1. Failure classification and root cause analysis
  2. Data integrity anomaly detection
  3. Predictive failure modeling

Each component can operate independently, but the combined effect provides a more complete view of pipeline health.

Failure Classification and Root Cause Analysis

We started with a simple observation: Most failures are not unique. They tend to follow patterns, even if the log messages are slightly different.

Initial Approach: Embeddings and Similarity

The first iteration used embeddings to represent log messages and retrieve similar past failures. This worked well for recurring issues and provided immediate value with minimal complexity.

Python
 
from openai import OpenAI
import numpy as np 

client = OpenAI() 
def generate_embedding(text: str):
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding 

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


However, this approach struggled with new or slightly different failures where similarity alone was not sufficient.

Introducing LLM-Based Classification

To handle unseen errors, we introduced an LLM layer to classify failures and suggest likely causes.

One key lesson: Free-form responses were not useful operationally. Early versions of the system produced inconsistent outputs, which made automation difficult. Constraining the model to return structured JSON significantly improved reliability.

Python
 
def classify_with_llm(log_text, similar_cases):
    prompt = f""" You are an expert in distributed systems and Apache Airflow. 
Analyze the following task failure log and return a structured JSON response. 
Log: {log_text} 
Similar past failures: {similar_cases} 
Return JSON with:
 - category (infra | data | application | external)
 - root_cause
 - confidence (0-1)
 - suggested_fix
"""
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content


Over time, the taxonomy was refined to align with how engineers actually triaged issues. This reduced ambiguity and made the output directly actionable.

Practical Impact

This combination of similarity search and LLM classification reduced the time required to diagnose recurring failures from roughly 20–30 minutes to under 5 minutes in many cases. More importantly, it standardized how failures were described across teams.

Data Integrity Anomaly Detection

Not all failures are explicit. Some of the most impactful issues were silent data problems that passed through the pipeline undetected.

To address this, we introduced lightweight anomaly detection based on metrics collected during each DAG run:

  • Row counts at each stage
  • Null ratios per column
  • Aggregated metrics (e.g., sums, averages)
  • Execution time
  • Distribution characteristics

Approach

We avoided overly complex models here and focused on simple, explainable methods:

  • Rolling baselines per dataset
  • Threshold-based deviation detection
  • Basic time-series anomaly detection

This approach proved sufficient for identifying common issues such as:

  • Sudden drops in processed records
  • Unexpected spikes in null values
  • Significant deviations in aggregate metrics

Why Not Use LLMs Here?

LLMs were not a good fit for this layer. The problem is primarily numerical and benefits from deterministic, explainable methods. Keeping this component simple also reduced operational overhead and false positives.

Practical Impact

This layer surfaced several issues that previously went unnoticed until downstream systems failed or produced incorrect outputs. In some cases, anomalies were detected within the same DAG run, allowing for faster intervention.

Predictive Failure Modeling

The final component focuses on anticipating failures before they occur.

Feature Engineering

We derived features from historical DAG runs, including:

  • Task duration trends
  • Retry counts
  • Recent failure frequency
  • Data volume
  • Dependency reliability

One challenge here was signal quality. For example, task duration alone was not reliable due to variability in upstream systems. Combining multiple features was necessary to produce useful predictions.

Model Selection

We experimented with several models and found that random forests provided a good balance between performance and interpretability for our dataset.

Python
 
from sklearn.ensemble import RandomForestClassifier 
model = RandomForestClassifier()
model.fit(X, y)


Inference

Python
 
def extract_features(run):
    return [
        run["duration"],
        run["retry_count"],
        run["prev_failures"],
        run["data_volume"],
    ] 

def predict_failure(run):
    features = extract_features(run)
    return model.predict_proba([features])[0][1]


Predictions were used to trigger alerts when the probability of failure exceeded a defined threshold.

Practical Use

This enabled:

  • Early warning signals before failures occurred
  • Selective retries or fallback strategies
  • Better prioritization of operational attention

Integration With Airflow

Integration was implemented using Airflow callbacks, avoiding changes to core components.

Python
 
def analyze_callback(context):
    ti = context["task_instance"]
    payload = {
        "dag_id": ti.dag_id,
        "task_id": ti.task_id,
        "execution_date": str(ti.execution_date),
        "log_url": ti.log_url,
        "state": ti.state,
        "task_metrics": {}
    }
    send_to_pipeline(payload)


This approach allowed the system to capture both success and failure events and process them asynchronously.

Operational Impact

In production, the system produced measurable improvements:

  • Triage time for recurring failures reduced from ~25 minutes to under 5 minutes
  • Earlier detection of data quality issues before the downstream impact
  • Reduced manual log inspection across engineering teams
  • More consistent failure reporting and classification

The most significant improvement was not just speed, but consistency in how failures were understood and addressed.

Limitations and Trade-offs

The system is not without challenges:

  • Model performance depends on the quality and consistency of historical data
  • Anomaly detection requires tuning to avoid false positives
  • LLM outputs must be constrained to remain reliable in automated workflows
  • Initial setup requires instrumentation and data collection across pipelines

In practice, incremental adoption worked better than a full rollout.

Conclusion

As data platforms scale, orchestration alone is not sufficient. Observability, diagnosis, and proactive failure handling become equally important.

Combining LLM-based reasoning with simpler statistical and machine learning techniques provided a practical path toward improving pipeline reliability. Not every component requires complex models, but selectively applying them where they add the most value can significantly reduce operational overhead.

This approach is not a drop-in solution, but it demonstrates how AI can be integrated into existing data platforms to improve reliability measurably and practically.

AI Data quality systems

Opinions expressed by DZone contributors are their own.

Related

  • Why Clean Data Is the Foundation of Successful AI Systems
  • AI Agents in Java: Architecting Intelligent Health Data Systems
  • Manual Investigation: The Hidden Bottleneck in Incident Response
  • Hallucination Has Real Consequences — Lessons From Building AI Systems

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook