Improving DAG Failure Detection in Airflow Using AI Techniques
A practical approach to enhancing DAG failure detection using AI to improve pipeline reliability and enable proactive intervention in large-scale data environments.
Join the DZone community and get the full member experience.
Join For FreeApache Airflow is widely used to orchestrate ETL pipelines, but failure handling in large-scale environments remains largely reactive. While Airflow provides strong scheduling and execution primitives, identifying root causes and detecting silent data issues still requires significant manual effort.
This article presents an approach implemented in a production data platform to improve failure detection and diagnosis using a combination of large language models (LLMs), statistical methods, and traditional machine learning. The system focuses on three areas: log-based failure classification, data integrity anomaly detection, and predictive failure modeling.
In practice, this reduced triage time for recurring failures, improved detection of silent data issues, and introduced a more proactive operational model for managing data pipelines.
Problem Context
Airflow works well as an orchestrator, but operational challenges emerge as pipelines scale across multiple datasets, teams, and environments.
In our case, the primary difficulty was not detecting that failures occurred, but understanding them quickly and consistently. A few recurring issues stood out:
- Logs were available but inconsistent across operators and difficult to aggregate
- Root cause analysis required navigating multiple systems (Airflow UI, logs, external services)
- Retries often obscured whether failures were transient or systemic
- Data quality issues rarely triggered task failures, but still affected downstream systems
In pipelines processing millions of records per run, these gaps led to delayed incident response and, occasionally, incorrect reporting. The lack of structured failure information also made it difficult to identify recurring patterns.
System Overview
To address these issues, we introduced an AI-driven observability layer on top of Airflow. The system was designed as a set of loosely coupled services consuming metadata from DAG executions:
- Task logs
- Execution metadata
- Historical run data
- Data quality metrics
Rather than modifying Airflow internals, integration was done via callbacks and external services. This allowed the system to evolve independently and scale across multiple environments.
The solution focuses on three capabilities:
- Failure classification and root cause analysis
- Data integrity anomaly detection
- Predictive failure modeling
Each component can operate independently, but the combined effect provides a more complete view of pipeline health.
Failure Classification and Root Cause Analysis
We started with a simple observation: Most failures are not unique. They tend to follow patterns, even if the log messages are slightly different.
Initial Approach: Embeddings and Similarity
The first iteration used embeddings to represent log messages and retrieve similar past failures. This worked well for recurring issues and provided immediate value with minimal complexity.
from openai import OpenAI
import numpy as np
client = OpenAI()
def generate_embedding(text: str):
response = client.embeddings.create(
model="text-embedding-3-large",
input=text
)
return response.data[0].embedding
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
However, this approach struggled with new or slightly different failures where similarity alone was not sufficient.
Introducing LLM-Based Classification
To handle unseen errors, we introduced an LLM layer to classify failures and suggest likely causes.
One key lesson: Free-form responses were not useful operationally. Early versions of the system produced inconsistent outputs, which made automation difficult. Constraining the model to return structured JSON significantly improved reliability.
def classify_with_llm(log_text, similar_cases):
prompt = f""" You are an expert in distributed systems and Apache Airflow.
Analyze the following task failure log and return a structured JSON response.
Log: {log_text}
Similar past failures: {similar_cases}
Return JSON with:
- category (infra | data | application | external)
- root_cause
- confidence (0-1)
- suggested_fix
"""
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content
Over time, the taxonomy was refined to align with how engineers actually triaged issues. This reduced ambiguity and made the output directly actionable.
Practical Impact
This combination of similarity search and LLM classification reduced the time required to diagnose recurring failures from roughly 20–30 minutes to under 5 minutes in many cases. More importantly, it standardized how failures were described across teams.
Data Integrity Anomaly Detection
Not all failures are explicit. Some of the most impactful issues were silent data problems that passed through the pipeline undetected.
To address this, we introduced lightweight anomaly detection based on metrics collected during each DAG run:
- Row counts at each stage
- Null ratios per column
- Aggregated metrics (e.g., sums, averages)
- Execution time
- Distribution characteristics
Approach
We avoided overly complex models here and focused on simple, explainable methods:
- Rolling baselines per dataset
- Threshold-based deviation detection
- Basic time-series anomaly detection
This approach proved sufficient for identifying common issues such as:
- Sudden drops in processed records
- Unexpected spikes in null values
- Significant deviations in aggregate metrics
Why Not Use LLMs Here?
LLMs were not a good fit for this layer. The problem is primarily numerical and benefits from deterministic, explainable methods. Keeping this component simple also reduced operational overhead and false positives.
Practical Impact
This layer surfaced several issues that previously went unnoticed until downstream systems failed or produced incorrect outputs. In some cases, anomalies were detected within the same DAG run, allowing for faster intervention.
Predictive Failure Modeling
The final component focuses on anticipating failures before they occur.
Feature Engineering
We derived features from historical DAG runs, including:
- Task duration trends
- Retry counts
- Recent failure frequency
- Data volume
- Dependency reliability
One challenge here was signal quality. For example, task duration alone was not reliable due to variability in upstream systems. Combining multiple features was necessary to produce useful predictions.
Model Selection
We experimented with several models and found that random forests provided a good balance between performance and interpretability for our dataset.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X, y)
Inference
def extract_features(run):
return [
run["duration"],
run["retry_count"],
run["prev_failures"],
run["data_volume"],
]
def predict_failure(run):
features = extract_features(run)
return model.predict_proba([features])[0][1]
Predictions were used to trigger alerts when the probability of failure exceeded a defined threshold.
Practical Use
This enabled:
- Early warning signals before failures occurred
- Selective retries or fallback strategies
- Better prioritization of operational attention
Integration With Airflow
Integration was implemented using Airflow callbacks, avoiding changes to core components.
def analyze_callback(context):
ti = context["task_instance"]
payload = {
"dag_id": ti.dag_id,
"task_id": ti.task_id,
"execution_date": str(ti.execution_date),
"log_url": ti.log_url,
"state": ti.state,
"task_metrics": {}
}
send_to_pipeline(payload)
This approach allowed the system to capture both success and failure events and process them asynchronously.
Operational Impact
In production, the system produced measurable improvements:
- Triage time for recurring failures reduced from ~25 minutes to under 5 minutes
- Earlier detection of data quality issues before the downstream impact
- Reduced manual log inspection across engineering teams
- More consistent failure reporting and classification
The most significant improvement was not just speed, but consistency in how failures were understood and addressed.
Limitations and Trade-offs
The system is not without challenges:
- Model performance depends on the quality and consistency of historical data
- Anomaly detection requires tuning to avoid false positives
- LLM outputs must be constrained to remain reliable in automated workflows
- Initial setup requires instrumentation and data collection across pipelines
In practice, incremental adoption worked better than a full rollout.
Conclusion
As data platforms scale, orchestration alone is not sufficient. Observability, diagnosis, and proactive failure handling become equally important.
Combining LLM-based reasoning with simpler statistical and machine learning techniques provided a practical path toward improving pipeline reliability. Not every component requires complex models, but selectively applying them where they add the most value can significantly reduce operational overhead.
This approach is not a drop-in solution, but it demonstrates how AI can be integrated into existing data platforms to improve reliability measurably and practically.
Opinions expressed by DZone contributors are their own.
Comments