How I Fixed a Silent Production Bug in Apache Airflow That Affected Thousands of Deployments

Pool names in Airflow can contain any characters, but the metrics system can't handle them. I traced the root cause and got a backwards-compatible fix and merged it.

Pradeep Kalluri

Apr. 14, 26 · Analysis

Likes (0)

Comment

Save

2.7K Views

The Issue That Stopped Me

I was browsing Apache Airflow's open issues one evening — something I do when I want to understand the parts of the codebase I don't use every day. Issue #59935 caught my attention immediately.

The report was simple: pool names in Airflow can contain any characters — spaces, emojis, anything. But the metrics reporting system requires stats names to contain only ASCII letters, numbers, underscores, dots, and dashes. When you created a pool with a name like 'pool name with whitespace and emoji,' Airflow would happily accept it — and then crash silently when it tried to report metrics for that pool.

The error looked like this:

The error from issue #59935

    Plain Text
   
   airflow.exceptions.InvalidStatsNameException: The stat name
(pool.running_slots.pool name with whitespace and emoji)
has to be composed of ASCII alphabets, numbers, or the
underscore, dot, or dash characters.

A pool gets created. Everything looks fine in the UI. Then the metrics start crashing. No immediate feedback to the user, no validation at creation time — just a silent failure downstream. I knew I wanted to fix this. What I didn't know yet was that my first idea would be rejected, and the process of understanding why would teach me something important about how production systems should handle invalid data.

My First Instinct: Validate at Creation Time

When I read the issue, I was unsure whether the right fix was validation — rejecting invalid pool names at creation time — or normalization — accepting any name but sanitizing it before passing it to the metrics system. Both felt reasonable. I went with validation first because it seemed cleaner. Reject the bad input early. Don't let the invalid state into the system at all.

I added a validate_pool_name() function in airflow/models/pool.py and wired it into create_or_update_pool():

First approach — validation at creation time

    Python
   
 

   # airflow/models/pool.py
import re

VALID_POOL_NAME_PATTERN = re.compile(r'^[a-zA-Z0-9_\-.]+$')

def validate_pool_name(pool_name: str) -> None:
    """Validate pool name contains only stats-compatible characters."""
    if not VALID_POOL_NAME_PATTERN.match(pool_name):
        raise ValueError(
            f"Pool name '{pool_name}' contains invalid characters. "
            "Pool names may only contain ASCII letters, numbers, "
            "underscores, dots, and dashes."
        )

@classmethod
def create_or_update_pool(
    cls,
    name: str,
    slots: int,
    description: str,
    include_deferred: bool,
    session: Session = NEW_SESSION,
) -> Pool:
    validate_pool_name(name)  # <-- added validation
    ...
  

Listing 1: The original validation approach — reject pool names with invalid characters at creation time.

I submitted the PR feeling reasonably confident. The logic was tight, the error message was clear, and the fix was at the right layer — preventing invalid data from entering the system rather than dealing with it downstream.

Then the review came back.

Why My Approach Was Wrong

A maintainer left a comment that initially confused me. The core of it was: raising an error at pool creation time is the wrong behavior. If a user already has a pool with an invalid name — and plenty of people do, since Airflow has been accepting any pool name for years — they would suddenly find their existing pools broken after an upgrade. That's a much worse user experience than the original bug.

The maintainer outlined two real options. The first was to prevent invalid names and migrate existing ones — but that's enormously complex and essentially impossible for users who have invalid names embedded in their DAGs, their configurations, and their operational runbooks. The second was to normalize the name at the point where it gets passed to the metrics system — accept any pool name, but sanitize it silently and log a warning so the user knows their stats are being reported under a modified name.

When I first read this, I was confused. My instinct was still that validation was cleaner. But as I thought about it more, it clicked: the problem isn't the pool name itself. The pool can be called whatever the user wants. The problem is only at the boundary where that name gets handed to the metrics system. Fix it at that boundary, not at the creation boundary. Leave existing users' pools untouched.

That's a meaningful distinction. Validation at creation time solves the problem for new users but breaks existing ones. Normalization at the metrics boundary solves the problem for everyone without breaking anything.

The New Approach: Normalize at the Stats Boundary

I rewrote the fix. Instead of rejecting invalid pool names, I added a normalization step inside the stats reporting path. Any character that isn't valid for a stats name gets replaced with an underscore. A warning gets logged so the user knows their pool is being reported under a different name in metrics.

Second approach — normalization at stats reporting boundary

    Python
   
 

   # airflow/models/pool.py
import re
import logging

log = logging.getLogger(__name__)

STATS_NAME_PATTERN = re.compile(r'[^a-zA-Z0-9_\-.]')

@staticmethod
def normalise_pool_name_for_stats(pool_name: str) -> str:
    """
    Normalise pool name for use in stats reporting.
    Characters not compatible with stats naming are replaced
    with underscores. A warning is logged if normalisation occurs
    so users know their pool is reported under a modified name.
    """
    normalised = STATS_NAME_PATTERN.sub('_', pool_name)
    if normalised != pool_name:
        log.warning(
            "Pool '%s' contains characters that are not valid in stats names. "
            "It will be reported in metrics as '%s'. "
            "Rename the pool to remove this warning.",
            pool_name,
            normalised,
        )
    return normalised
  

Listing 2: The revised approach — normalize the pool name at the stats boundary, log a warning, leave pool creation unchanged.

The key difference is where the fix lives. The original fix was in create_or_update_pool() — the pool creation path. The new fix is in the stats reporting path. Pool names are untouched everywhere else in the system. Users with existing invalid pool names don't need to do anything. They just start seeing a warning in their logs that tells them exactly what is happening and what to do about it.

Then Came the CI Battles

If you have contributed to a large open source project before, you know that getting the code right is often the easy part. Getting CI to agree with you is a different kind of challenge.

This PR involved all three of the most common CI pain points simultaneously. First, the Ruff formatter kept failing. Ruff is extremely precise about blank lines, import ordering, and quote styles. I would fix one issue, push, and a different formatting rule would fail. The cycle repeated more times than I would like to admit.

Second, GPG signing. Apache Airflow requires all commits to be GPG signed. I had dealt with this before, but on Windows, the configuration requires pointing git explicitly at the right gpg.exe path, and the email on the GPG key has to match the email in your git config exactly. One character difference, and every commit silently fails to sign. The error messages are not always clear about what went wrong.

Third — and this one was genuinely unexpected — my test file got deleted by another PR mid-review. A maintainer had opened a separate PR to clean up some unused test files, and mine got caught in that cleanup. One morning, I checked the PR, and the tests were just gone. I had to trace through the linked PRs to understand what had happened and then coordinate on where the tests should live.

Sixteen commits to get this merged. Most of them were not logical changes. They were formatting fixes, import reorders, GPG re-signs, and test file relocations. This is normal for large open source projects, and it's worth knowing going in.

What the Final Fix Actually Does

The merged code does three things:

When Airflow reports pool metrics, it normalizes the pool name by replacing any character that isn't an ASCII letter, number, underscore, dot, or dash with an underscore.
If normalization changes the name, it logs a warning at the WARNING level that tells the user the original pool name, the normalized name being used in metrics, and a suggestion to rename the pool.
Pool creation is completely unchanged. Existing pools with any kind of name continue to work. No migration required, no breaking change.

The result: a pool called 'pool name with whitespace and emoji' now reports metrics as 'pool_name_with_whitespace_and_emoji__' in your monitoring system, with a log warning that tells you exactly what happened. The crash is gone. The user has actionable information. Existing deployments are unaffected.

What I Took Away From This

1. Where You Fix a Bug Matters as Much as How You Fix It

My first fix was technically correct for new users but wrong for the system as a whole. The right question wasn't 'how do I prevent invalid pool names?' It was 'where in the system does the invalid name actually cause a problem, and what is the least disruptive place to fix it?' Answering that question led to a completely different — and better — solution.

2. Breaking Changes in Open Source Require a Much Higher Bar

In a codebase you own, you can make breaking changes and migrate everything yourself. In a project used by thousands of teams in production, a breaking change affects people you will never talk to, running versions you cannot control. That constraint fundamentally changes how you design fixes. Normalization was better not because it was more elegant, but because it imposed zero cost on existing users.

3. The Review Process Is Where the Real Learning Happens

I came in with what I thought was a clean solution and left with a better understanding of how to think about backwards compatibility in production systems. That would not have happened if I had just merged my first PR without review. The pushback was the most valuable part of the contribution.

4. CI Is Part of the Contribution — Not a Separate Problem

Sixteen commits feel like a lot. But formatting consistency, signed commits, and properly located tests are not bureaucratic overhead — they are how a project with hundreds of contributors stays maintainable. The discipline that makes CI annoying is the same discipline that makes the codebase readable. Accepting that early makes the process much less frustrating.

Final Thoughts

The fix that got merged was not the fix I wrote first. It was the fix I wrote after someone more experienced pointed out why my first approach, while locally correct, was globally wrong. That is a pattern I have seen repeat across every open source contribution I have made: the code you submit is a starting point, not a finished product. The review process is collaborative, not adversarial.

If you use Apache Airflow and you have ever seen an InvalidStatsNameException related to pool names, this is the fix. Pool metrics now normalize automatically, and a warning in your logs will tell you exactly which pools need renaming if you want clean metric names going forward.

And if you are thinking about contributing to open source for the first time, go find an issue that makes you unsure which approach is right. That uncertainty is exactly where the learning happens.

Open source Apache DevOps

Opinions expressed by DZone contributors are their own.

Related

Trending