When Airflow Tasks Get Stuck in Queued: A Real-World Debugging Story
Fixed Airflow 2.2.2 tasks stuck in "queued" state by backporting a patch from v2.6.0, optimizing scheduler config, and deploying temporary workarounds.
Join the DZone community and get the full member experience.
Join For FreeRecently, my team encountered a critical production issue in which Apache Airflow tasks were getting stuck in the "queued" state indefinitely. As someone who has worked extensively with Scheduler, I've handled my share of DAG failures, retries, and scheduler quirks, but this particular incident stood out both for its technical complexity and the organizational coordination it demanded.
The Symptom: Tasks Stuck in Queued
It began when one of our business-critical Directed Acyclic Graphs (DAGs) failed to complete. Upon investigation, we discovered several tasks were stuck in the "queued" state — not running, failing, or retrying, just permanently queued.
First Steps: Isolating the Problem
A teammate and I immediately began our investigation with the fundamental checks:
- Examined Airflow UI logs: Nothing unusual beyond standard task submission entries
- Reviewed scheduler and worker logs: The scheduler was detecting the DAGs, but nothing was reaching the workers
- Confirmed worker health: All Celery workers showed as active and running
- Restarted both scheduler and workers: Despite this intervention, tasks remained stubbornly queued
Deep Dive: Uncovering a Scheduler Bottleneck
We soon suspected a scheduler issue. We observed that the scheduler was queuing tasks but not dispatching them. This led us to investigate:
- Slot availability across workers
- Message queue health (RabbitMQ in our environment)
- Heartbeat communication logs
We initially hypothesized that the scheduler machine might be over occupied because of the dual responsibility of scheduling and DAG parsing other tasks, so we increased the min_file_process_interval
to 2 mins. While this reduced CPU utilization by limiting how frequently the scheduler parsed DAG files, it didn't resolve our core issue — tasks remained stuck in the queued state.
After further research, we discovered that our Airflow version (2.2.2) contained a known issue causing tasks to become trapped in the queued state under specific scheduler conditions. This bug was fixed in Airflow 2.6.0, with the solution documented in PR #30375.
However, upgrading wasn't feasible in the short term. The migration from 2.2.2 to 2.6.0 would require extensive testing, custom plugin adjustments, and deployment pipeline modifications — none of which could be implemented quickly without disrupting other priorities.
Interim Mitigations and Configuration Optimizations
While working on the backported fix, we implemented several tactical measures to stabilize the system:
- Increased
parsing_processes
to 8 to parallelize and improve the DAG parsing time - Increased
scheduler_heartbeat_sec
to 30s and increasedmin_file_process_interval
to 120s (up from the default setting of 30s) to reduce scheduler load - Implemented continuous monitoring to ensure tasks were being processed appropriately
- We also deployed a temporary workaround using a script referenced in this GitHub comment. This script forcibly transitions tasks from queued to running state. We scheduled it via a cron job with an additional filter targeting only task instances that had been queued for more than 10 minutes. This approach provided temporary relief while we finalized our long-term solution.
However, we soon discovered limitations with the cron job. While effective for standard tasks that could eventually reach completion once moved from queued to running, it was less reliable for sensor-related tasks. After being pushed to running state, sensor tasks would often transition to up_for_reschedule
and then back to queued, becoming stuck again. This required the cron job to repeatedly advance these tasks, essentially functioning as an auxiliary scheduler.
We suspect this behavior stems from inconsistencies between the scheduler's in-memory state and the actual task states in the database. This unintentionally made our cron job responsible for orchestrating part of the sensor lifecycle — clearly not a sustainable solution.
The Fix: Strategic Backporting
After evaluating our options, we decided to backport the specific fix from Airflow 2.6.0 to our existing 2.2.2 environment. This approach allowed us to implement the necessary correction without undertaking a full upgrade cycle.
We created a targeted patch by cherry-picking the fix from the upstream PR and applying it to our forked version of Airflow. The patch can be viewed here: GitHub Patch.
How to Apply the Patch
Important disclaimer: The patch referenced in this article is specifically designed for Airflow deployments using the Celery executor. If you're using a different executor (such as Kubernetes, Local, or Sequential), you'll need to backport the appropriate changes for your specific executor from the original PR (#30375). The file paths and specific code changes may differ based on your executor configuration.
If you're facing similar issues, here's how to apply this patch to your Airflow 2.2.2 installation:
Download the Patch File
First, download the patch from the GitHub link provided above. You can use wget or directly download the patch file:
wget -O airflow-queued-fix.patch https://github.com/gurmeetsaran/airflow/pull/1.patch
Navigate to Your Airflow Installation Directory
This is typically where your Airflow Python package is installed.
cd /path/to/your/airflow/installation
Apply the Patch Using git
Use the git apply command to apply the patch:
git apply --check airflow-queued-fix.patch # Test if the patch can be applied cleanly
git apply airflow-queued-fix.patch # Actually apply the patch
- Restart your Airflow scheduler to apply the changes.
- Monitor task states to verify that newly queued tasks are being properly processed by the scheduler.
Note that this approach should be considered a temporary solution until you can properly upgrade to a newer Airflow version that contains the official fix.
Organizational Lessons
Resolving the technical challenge was only part of the equation. Equally important was our approach to cross-team communication and coordination:
- We engaged our platform engineering team early to validate our understanding of Airflow's architecture.
- We maintained transparent communication with stakeholders so they could manage downstream impacts.
- We meticulously documented our findings and remediation steps to facilitate future troubleshooting.
- We learned the value of designating a dedicated communicator — someone not involved in the core debugging but responsible for tracking progress, taking notes, and providing regular updates to leadership, preventing interruptions to the engineering team.
We also recognized the importance of assembling the right team — collaborative problem-solvers focused on solutions rather than just identifying issues. Establishing a safe, solution-oriented environment significantly accelerated our progress.
I was grateful to have the support of a thoughtful and effective manager who helped create the space for our team to stay focused on diagnosing and resolving the issue, minimizing external distractions.
Key Takeaways
This experience reinforced several valuable lessons:
- Airflow is powerful but sensitive to scale and configuration parameters
- Comprehensive monitoring and detailed logging are indispensable diagnostic tools
- Sometimes the issue isn't a failing task but a bottleneck in the orchestration layer
- Version-specific bugs can have widespread impact — staying current helps, even when upgrades require planning
- Backporting targeted patches can be a pragmatic intermediate solution when complete upgrades aren't immediately feasible
- Effective cross-team collaboration can dramatically influence incident response outcomes
This incident reminded me that while technical expertise is fundamental, the ability to coordinate and communicate effectively across teams is equally crucial. I hope this proves helpful to others who find themselves confronting a mysteriously stuck Airflow task and wondering, "Now what?"
Opinions expressed by DZone contributors are their own.
Comments