Building SRE Error Budgets for AI/ML Workloads: A Practical Framework

ML systems decay gradually instead of breaking suddenly, so we need error budgets for model accuracy, data freshness, and fairness — not just uptime.

Feb. 03, 26 · Analysis

Likes (1)

Comment

Save

2.0K Views

Here's a problem I've seen happen far too often: your recommendation system is functioning, spitting out results in milliseconds, and meeting all its infrastructure SLAs. Everything is looking rosy in the dashboard world. Yet engagement has plummeted by 40% because your model has been pointless for several weeks.

On behalf of your traditional error budget? You're golden. According to your product team? The system is broken.

ML systems fail in ways that were not accounted for in classical SRE practices. A model does not 'go down'; it gradually deteriorates. Data pipelines can be 'working' while providing garbage to the model. And you won't even realize this until users start to complain or, worst, quietly depart.

The past few years spent breaking and fixing ML systems have taught me that we need a paradigm shift in our error budget. Here's how it works.

Understanding the Limitations of Conventional Error Budgets

The challenge here is that "reliability" in ML does not live on a one-dimensional spectrum. Your API could be functioning correctly even if your model is not working. Your model could be working correctly even if your data pipeline is providing stale features to your model. You could be doing great on your aggregate numbers even if you're treating some users unfairly.

What I've found is that you need to break down four different error budgets.

Mapping These to Actual Error Budgets

Before delving into each dimension, I must clarify the application of these to conventional SRE error budgets — not merely health checks:

For each dimension, you require:

• SLI (service level indicator): What you're measuring

• SLO (service level objective): Your target over time

• Error budget: How much you can miss the SLO before you take action

Here's what model quality means with concrete examples:

SLI: Accuracy of the model compared with the baseline, hourly

SLO: Accuracy ≥ 92% of Baseline over the rolling 7 days

Error budget: 8% allowable error in 7 days

Burn rate: Monitor hourly; warn for burning above 10% of budget daily

The main difference versus an error budget is that you're measuring degradation relative to a known-good state as opposed to just measuring success or failure. The math is exactly the same in both cases — a time budget that gets spent if you don't meet your SLO.

Now, let's consider every dimension one by one:

1. Infrastructure Error Budget

These are your standard SRE metrics: uptime, latency, and success rate of requests. It's old news, but you should have this as your baseline.

What I monitor: 99.95% availability, latency of sub-150ms at p95, 99.9% success rate

2. Model Quality Error Budget

This is where it gets fascinating. You must specify at what point you are willing to let the degradation of your model become noisy.

What I track:

• Model accuracy vs baseline accuracy (typically up to 8% loss)

• Percentage of low-confidence predictions

• Distribution of feature drift via statistical tests

Here's how I can determine degradation:

    Python
   
   # Compare Current Performance with Your Personal Benchmark
accuracy_degradation = (baseline_accuracy - current_accuracy) / baseline_accuracy
budget_burn_rate = accuracy_degradation / acceptable_degradation

Real example: Accuracy decreased from 95% to 93%, my threshold is 8%

As for drift detection, I employ the Kolmogorov-Smirnov test:

    Python
   
   # Verify distribution of features has changed
from scipy.stats import ks_2samp

statistic, p_value = ks_2samp(baseline_features, current_features)
drift_alert = p_value < 0.05

One thing that bit me: Tie your model accuracy metrics to business metrics. Rather than accuracy percentages, track something your PM cares about — for example, "click-through rate stays within 95% of baseline."

3. Data Quality Error Budget

Garbage in, garbage out. However, the ML system "garbage" needs a different definition.

What matters:

• Feature completeness score (my target is 99%+)

• Feature freshness degree (how many features are stale?)

• Schema violations

Simple quality check:

    Python
   
   def simple_quality_check(features):
    missing_rate = missing_features / total_features
    stale_rate = stale_features / total_features
    data_quality_score = min(1 - missing_rate, 1 - stale_rate)
    meets_sli = data_quality_score > 0.99

Traditional data pipelines only cared about having a correct schema. When working with machine learning, you also want to ensure that your data features are fresh enough and that your distributions look fairly regular. I've been burned before working on pipelines that "worked" but passed day-old data, making our model irrelevant.

4. Fairness Error Budget

In your case, fairness can be either desirable or mandatory. Regardless, it should be tracked.

What I monitor:

• Differences in accuracy across demographic groups (this is under 5%)

• False positive rate parity across segments

To calculate disparate impact:

    Python
   
   # Determine disparate impact
group_A_rate = predictions[group == 'A'].mean()
group_B_rate = predictions[group == 'B'].mean()

disparity = abs(group_A_rate - group_B_rate)
violation = disparity > 0.05  # flag if over 5%

There is no such dimension in traditional SRE because a traditional system is not involved in people's decision-making. However, as soon as your machine learning system starts approving loans or ranking candidates for jobs, you want to determine whether your system is treating people fairly.

Critical Caveats

Fairness metrics are extremely domain-specific and complex from a legal standpoint. The metrics that I am presenting here are only examples, and demographic parity is not necessarily a good thing for every problem you want to solve. Before using fairness budgets:

Discuss with lawyers the way in which fairness may be considered in your regulatory environment
Coordinate with the product and policy teams on identifying the acceptable tradeoffs
Reflect on whether you have the right to maintain, process, or use sensitive attributes for monitoring purposes
Do not use simplistic parity checks as the sole indicators of fairness

In regulated industries such as finance, healthcare, or hiring, you require knowledge that goes beyond the capabilities of any framework.

How to Actually Implement This

Step 1: Determine How Reliability Applies in Your Business

Don't begin with metrics in mind. Begin with conversations instead. "What is a broken model in the eyes of my PM?" "What will make my users grumble?"

For an ML-driven search functionality, you can choose:

Infrastructure: Less than 200 ms (p95)
Model quality: Relevance scores greater than 0.85 relative to human assessors
Data quality: Less than 1% of queries missing critical features
Fairness: Search diversity preserved when considering different user categories

Step 2: Establish Your Baseline

Run your system in a stable state for 30 days. Observe what "good" looks like.

    Python
   
 

   # Calculate your baseline during a stable period
baseline = {
    'accuracy': np.percentile(stable_metrics['accuracy'], 50),
    'p95_latency': np.percentile(stable_metrics['latency'], 95),
    'drift_threshold': calculate_drift_threshold(stable_features)
}
  

This becomes your north star. All else shall be measured from that.

Step 3: Define Ownership

This is crucial. Each dimension must have a "clear owner" to make decisions and take actions:

Infrastructure budget → SRE owns:

• Right to suspend deployments

• Authority to reverse modifications

• Infrastructure scaling authority

Model quality budget → ML engineering owns:

• Authority for triggering retraining

• Authority to roll back to previous model version

• Power to increase monitoring frequency

Data quality budget → data engineering owns:

• Power to halt data pipelines

• Authority to enable fallback data sources

• Right to disregard upstream data

Fairness budget → ML + product + legal own together:

• Needs a multi-stakeholder decision for any actions

• Product evaluates business impact

• Legal specifies compliance requirements

• ML applies technical solutions

If the budget constraints are conflicting, such that model quality is satisfactory, but fairness is violated, then the more constraining budget prevails. If you have depleted your fairness budget, you cannot just rely on your predictions for satisfactory accuracy.

Step 4: Monitor Everything

Establish dashboards to measure all four key dimensions. Here's how I calculate the composite health factors:

    Python
   
 

   # Current health across dimensions
dimensions = {
    'infrastructure': 0.95,  # meeting 95% of SLO
    'model_quality': 0.88,   # at 88% of baseline
    'data_quality': 0.98,
    'fairness': 0.96
}

# Weight them according to what is important to your business
weights = {
    'infrastructure': 0.3,
    'model_quality': 0.35,
    'data_quality': 0.2,
    'fairness': 0.15
}

composite_score = sum(dimensions[d] * weights[d] for d in dimensions)
  

Critical note: The composite score is solely for executive visibility. Hard enforcement always happens on a per-dimension basis. Having a 90% composite score does not supersede a violation in any dimension. You are in violation if you blow your fairness budget.

Step 5: Know What to Do When Budgets Blow Up

This list should be recorded prior to having a situation on your hands:

Infrastructure budget spent out: Stop deployments, undo changes made, see if scale is required
Model quality budget used up: Kick off the retraining process, think about reverting to the former model version, and look at what changed in your dataset
Data Quality budget exhausted: Check your upstream data sources, validate your ETL pipeline, turn on feature fallbacks if you have them
Fairness budget used up: If it's bad, then simply stop making predictions for those subgroups. Don't release it to society until you figure out where you introduced unfair bias and retrain.

A Real Example: Fraud Detection

Let me illustrate what I mean with a system for preventing fraud that I built for a fintech company.

Our error budgets:

Infrastructure: 99.99% uptime, under 100ms at p95
Model quality: Precision above 95%, Recall above 90%, False Positive Rate below 2%
Data quality: +99.5% feature completion rate, <1% stale features
Fairness: FPR differences across merchant types <3%

Here's what our code for monitoring looked like:

    Python
   
 

   # Validating the health of each batch of predictions made
def check_fraud_detection_health(predictions, features, ground_truth):
    # Did model quality degrade?
    current_precision = precision_score(ground_truth, predictions)
    precision_violation = (baseline - current_precision) / baseline > 0.02

    # Are features getting stale?
    stale_rate = features[features['age_hours'] > 24].shape[0] / len(features)
    data_violation = stale_rate > 0.01

    # Fairness issues regarding various merchants?
    fprs = calculate_fpr_by_category(predictions, ground_truth)
    fairness_violation = max(fprs.values()) - min(fprs.values()) > 0.03

    return any([precision_violation, data_violation, fairness_violation])
  

The "interesting" part: All these dimensions are actually tested in every prediction batch. It helps you detect issues early, as data quality problems could become evident before affecting model performance.

A Few Things I've Learned

Use Rolling Windows Where Time-Based Budgets Are Required

Monthly budgets aren't really effective in ML either. You may have a bad week when you're retraining your model, but you can't waste the rest of the budget.

I use 7-day rolling windows instead — still time budgets, but with a sliding window.

    Python
   
   from collections import deque

# Measurements deque with maxlen of 7 days * 24 hours
measurements = deque(maxlen=168)
measurements.append({'timestamp': now, 'accuracy': current_accuracy})

avg_accuracy = sum([m['accuracy'] for m in measurements]) / len(measurements)
budget_ok = avg_accuracy >= target_accuracy

This provides some buffer for recovering from transient problems without having to call bankruptcy for the month. You're still measuring reliability over time (the point of error budgets), but the window slides smoothly rather than restarting each month.

Budget According to What Is Happening

In a large product rollout, I'll cut model quality budgets (can't have the model shaming us during peak traffic) while relaxing latency requirements slightly. It's fine to adjust these based on context, just be sure to record the reasoning behind adjustments as they happen.

Be Alert for Cascading Failures

"Garbage in, garbage out" applies here, too: bad data input leads to bad model output, which, in turn, results in more attempts and fallbacks, thus more load on the infrastructure. It is where having budgets per dimension comes in handy, as it allows you to zero in on where the problem actually occurred.

Wrapping Up

Conventional error budgets account for failures in infrastructure, such as servers becoming unavailable and requests timing out. They fail to account, however, for failure in ML, which occurs in terms of model drift, pipelines with stale features, and biased predictions in terms of user segments.

This framework identifies these failures early. By monitoring the degradation of model quality with time, you address the issue before it affects users. By monitoring the freshness of the data, you identify the pipeline failures before their impact affects your predictions. By monitoring fairness, you identify bias before it turns into a compliance issue.

The actual gains in reliability come from the following three sources:

Earlier detection: You detect degradation trends before outages
Root cause clarity: When quality goes down, you know if it's the infrastructure or the quality of the data
Clear accountability: Every factor has a clear owner who has clear action power

You want to start with the budget on infrastructure and the quality of models. Get familiar with tracking the baseline and calculating the burn rate. Once you're comfortable with that, you can integrate the data quality tracking. Fairness tracking is what you want to do last. It's the most complex aspect of fairness, and it's the most dependent on the domain.

Your set of metrics will be different from mine in specifics. A recommendation system can deal with variation in its accuracy results better compared to the fraud detector system. However, the model that consists of four aspects, budgets that consider time intervals, and ownership that is clearly stated has proved to be effective throughout the models involving ML that I have used before.

The aim is not about preventing all cases of model deterioration. It is about understanding it, comprehending why it happens, and having the power to correct it before it shatters user trust.

Site reliability engineering Framework

Opinions expressed by DZone contributors are their own.

Related

Trending