Improving DAG Failure Detection in Airflow Using AI Techniques

A practical approach to enhancing DAG failure detection using AI to improve pipeline reliability and enable proactive intervention in large-scale data environments.

May. 19, 26 · Analysis

Likes (1)

Comment

Save

2.2K Views

Apache Airflow is widely used to orchestrate ETL pipelines, but failure handling in large-scale environments remains largely reactive. While Airflow provides strong scheduling and execution primitives, identifying root causes and detecting silent data issues still requires significant manual effort.

This article presents an approach implemented in a production data platform to improve failure detection and diagnosis using a combination of large language models (LLMs), statistical methods, and traditional machine learning. The system focuses on three areas: log-based failure classification, data integrity anomaly detection, and predictive failure modeling.

In practice, this reduced triage time for recurring failures, improved detection of silent data issues, and introduced a more proactive operational model for managing data pipelines.

Problem Context

Airflow works well as an orchestrator, but operational challenges emerge as pipelines scale across multiple datasets, teams, and environments.

In our case, the primary difficulty was not detecting that failures occurred, but understanding them quickly and consistently. A few recurring issues stood out:

Logs were available but inconsistent across operators and difficult to aggregate
Root cause analysis required navigating multiple systems (Airflow UI, logs, external services)
Retries often obscured whether failures were transient or systemic
Data quality issues rarely triggered task failures, but still affected downstream systems

In pipelines processing millions of records per run, these gaps led to delayed incident response and, occasionally, incorrect reporting. The lack of structured failure information also made it difficult to identify recurring patterns.

System Overview

To address these issues, we introduced an AI-driven observability layer on top of Airflow. The system was designed as a set of loosely coupled services consuming metadata from DAG executions:

Task logs
Execution metadata
Historical run data
Data quality metrics

Rather than modifying Airflow internals, integration was done via callbacks and external services. This allowed the system to evolve independently and scale across multiple environments.

The solution focuses on three capabilities:

Failure classification and root cause analysis
Data integrity anomaly detection
Predictive failure modeling

Each component can operate independently, but the combined effect provides a more complete view of pipeline health.

Failure Classification and Root Cause Analysis

We started with a simple observation: Most failures are not unique. They tend to follow patterns, even if the log messages are slightly different.

Initial Approach: Embeddings and Similarity

The first iteration used embeddings to represent log messages and retrieve similar past failures. This worked well for recurring issues and provided immediate value with minimal complexity.

    Python
   
 

   from openai import OpenAI
import numpy as np 

client = OpenAI() 
def generate_embedding(text: str):
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding 

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
  

However, this approach struggled with new or slightly different failures where similarity alone was not sufficient.

Introducing LLM-Based Classification

To handle unseen errors, we introduced an LLM layer to classify failures and suggest likely causes.

One key lesson: Free-form responses were not useful operationally. Early versions of the system produced inconsistent outputs, which made automation difficult. Constraining the model to return structured JSON significantly improved reliability.

    Python
   
 

   def classify_with_llm(log_text, similar_cases):
    prompt = f""" You are an expert in distributed systems and Apache Airflow. 
Analyze the following task failure log and return a structured JSON response. 
Log: {log_text} 
Similar past failures: {similar_cases} 
Return JSON with:
 - category (infra | data | application | external)
 - root_cause
 - confidence (0-1)
 - suggested_fix
"""
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content
  

Over time, the taxonomy was refined to align with how engineers actually triaged issues. This reduced ambiguity and made the output directly actionable.

Practical Impact

This combination of similarity search and LLM classification reduced the time required to diagnose recurring failures from roughly 20–30 minutes to under 5 minutes in many cases. More importantly, it standardized how failures were described across teams.

Data Integrity Anomaly Detection

Not all failures are explicit. Some of the most impactful issues were silent data problems that passed through the pipeline undetected.

To address this, we introduced lightweight anomaly detection based on metrics collected during each DAG run:

Row counts at each stage
Null ratios per column
Aggregated metrics (e.g., sums, averages)
Execution time
Distribution characteristics

Approach

We avoided overly complex models here and focused on simple, explainable methods:

Rolling baselines per dataset
Threshold-based deviation detection
Basic time-series anomaly detection

This approach proved sufficient for identifying common issues such as:

Sudden drops in processed records
Unexpected spikes in null values
Significant deviations in aggregate metrics

Why Not Use LLMs Here?

LLMs were not a good fit for this layer. The problem is primarily numerical and benefits from deterministic, explainable methods. Keeping this component simple also reduced operational overhead and false positives.

Practical Impact

This layer surfaced several issues that previously went unnoticed until downstream systems failed or produced incorrect outputs. In some cases, anomalies were detected within the same DAG run, allowing for faster intervention.

Predictive Failure Modeling

The final component focuses on anticipating failures before they occur.

Feature Engineering

We derived features from historical DAG runs, including:

Task duration trends
Retry counts
Recent failure frequency
Data volume
Dependency reliability

One challenge here was signal quality. For example, task duration alone was not reliable due to variability in upstream systems. Combining multiple features was necessary to produce useful predictions.

Model Selection

We experimented with several models and found that random forests provided a good balance between performance and interpretability for our dataset.

    Python
   
   from sklearn.ensemble import RandomForestClassifier 
model = RandomForestClassifier()
model.fit(X, y)

Inference

    Python
   
 

   def extract_features(run):
    return [
        run["duration"],
        run["retry_count"],
        run["prev_failures"],
        run["data_volume"],
    ] 

def predict_failure(run):
    features = extract_features(run)
    return model.predict_proba([features])[0][1]
  

Predictions were used to trigger alerts when the probability of failure exceeded a defined threshold.

Practical Use

This enabled:

Early warning signals before failures occurred
Selective retries or fallback strategies
Better prioritization of operational attention

Integration With Airflow

Integration was implemented using Airflow callbacks, avoiding changes to core components.

    Python
   
 

   def analyze_callback(context):
    ti = context["task_instance"]
    payload = {
        "dag_id": ti.dag_id,
        "task_id": ti.task_id,
        "execution_date": str(ti.execution_date),
        "log_url": ti.log_url,
        "state": ti.state,
        "task_metrics": {}
    }
    send_to_pipeline(payload)
  

This approach allowed the system to capture both success and failure events and process them asynchronously.

Operational Impact

In production, the system produced measurable improvements:

Triage time for recurring failures reduced from ~25 minutes to under 5 minutes
Earlier detection of data quality issues before the downstream impact
Reduced manual log inspection across engineering teams
More consistent failure reporting and classification

The most significant improvement was not just speed, but consistency in how failures were understood and addressed.

Limitations and Trade-offs

The system is not without challenges:

Model performance depends on the quality and consistency of historical data
Anomaly detection requires tuning to avoid false positives
LLM outputs must be constrained to remain reliable in automated workflows
Initial setup requires instrumentation and data collection across pipelines

In practice, incremental adoption worked better than a full rollout.

Conclusion

As data platforms scale, orchestration alone is not sufficient. Observability, diagnosis, and proactive failure handling become equally important.

Combining LLM-based reasoning with simpler statistical and machine learning techniques provided a practical path toward improving pipeline reliability. Not every component requires complex models, but selectively applying them where they add the most value can significantly reduce operational overhead.

This approach is not a drop-in solution, but it demonstrates how AI can be integrated into existing data platforms to improve reliability measurably and practically.

AI Data quality systems

Opinions expressed by DZone contributors are their own.

Related

Trending

Improving DAG Failure Detection in Airflow Using AI Techniques

A practical approach to enhancing DAG failure detection using AI to improve pipeline reliability and enable proactive intervention in large-scale data environments.

Problem Context

System Overview

Failure Classification and Root Cause Analysis

Initial Approach: Embeddings and Similarity

Introducing LLM-Based Classification

Practical Impact

Data Integrity Anomaly Detection

Approach

Why Not Use LLMs Here?

Practical Impact

Predictive Failure Modeling

Feature Engineering

Model Selection

Inference

Practical Use

Integration With Airflow

Operational Impact

Limitations and Trade-offs

Conclusion

Related

Partner Resources