Failure Handling in AI Pipelines: Designing Retries Without Creating Chaos
AI systems require disciplined retry strategies. By applying bounded backoff policies, teams can prevent retry storms and control cost amplification.
Join the DZone community and get the full member experience.
Join For FreeRetries have become an integral part of the AI tools or systems. In most systems I have seen, teams usually approach failures with blanket retrying. This often yields duplicate work, cost spikes, wasted compute, and operational instability. Every unnecessary retry triggers another inference call, an embedding request, or a downstream write, without improving the outcome.
In most early-stage AI tools, the pattern is that if a request fails, a retry is added. If the retry succeeds intermittently, then the logic is considered sufficient. This approach works fine until the application is in the test environment or in low-user-usage mode; as soon as the application sees higher traffic and concurrent execution, retries begin to dominate system behavior.
The consequences like these become visible:
- Increased token usage and cost
- Inconsistent latency
- Repeated processing of the same Job
- Workers look busy, but the queues are not draining
- Non-meaningful logs
To avoid these consequences, AI tools must treat failures as structured states and respond appropriately to their nature. At a minimum, failures should be categorized into 3 broad categories.
1. Transient (Retryable)
Temporary failures should be retried with appropriate backoff. For example, timeouts, HTTP 429 rate limits, 5xx upstream errors, short-lived network interruptions, etc.
2. Permanent (Non-Retryable)
For these, retries won't change the outcome and should not be retried. For example, invalid payload, schema mismatch, missing required fields, authentication errors, incorrect model configuration, API key failures, policy violations, etc.
3. Unknown (Quarantine)
Any failures that cannot be confidently classified into the two categories above should be marked as unknown. These should not be retried indefinitely. These require controlled handling, often through quarantine or dead-letter routing. For example, inconsistent upstream data, unexpected response structures, edge cases, exceptions, etc.
Let's understand this with a real-world AI application.
Consider an AI-based data enrichment workflow inside a multi-tenant SaaS platform. A typical job within this workflow is structured as:
- Step 1: The system receives source data
- Step 2: An LLM is invoked to normalize or enrich selected fields.
- Step 3: The enriched output is written to a database.
- Step 4: An event is emitted for downstream indexing or analytics.
This flow appears to be straightforward. The complexity arises when any of the individual step fails.
These complexities can be anything. The ideal response to these complexities should depend on the nature of the failure. A few examples of these complexities are:
- LLM returns a 429 rate-limit response. In this case, the workflow should retry with bounded backoff.
- LLM returns a 503 temporary outage. In this case, retrying may also be reasonable.
- Payload is missing a required field, such as title. In this case, retrying will not resolve the issue; the job should be marked failed with a clear reason.
- Tenant configuration lacks a required model name. In this case, it is a configuration error rather than a transient failure, so no retry is needed.
- Database write times out. In this case, retry behavior depends on idempotency guarantees and write semantics.
Simple and Powerful Production-Friendly Failure Model
We should have failure records that operators can read and understand. For example:
{
"job_id": "job_84721",
"tenant_id": "tenant_A",
"stage": "LLM_CALL",
"status": "FAILED",
"failure_type": "TRANSIENT",
"reason": "RATE_LIMIT",
"http_status": 429,
"attempts": 3,
"next_action": "RETRY",
"timestamp": "2026-02-12T16:10:00Z"
}
To understand it better, let’s look at this code defining failure classification and retry policy.
Step 1: Defining the Failure Types and Classification
import random
import time
from dataclasses import dataclass
from typing import Optional, Dict, Any
class FailureType:
TRANSIENT = "TRANSIENT" # retryable
NON_RETRYABLE = "NON_RETRYABLE"
UNKNOWN = "UNKNOWN"
@dataclass
class Failure:
failure_type: str
reason: str
http_status: Optional[int] = None
detail: Optional[str] = None
def classify_failure(err: Exception, http_status: Optional[int] = None) -> Failure:
"""
Classify failures into TRANSIENT / NON_RETRYABLE / UNKNOWN.
Keep this logic small and explicit.
"""
# Common transient HTTP statuses
if http_status in (408, 429, 500, 502, 503, 504):
reason = "RATE_LIMIT" if http_status == 429 else "UPSTREAM_UNAVAILABLE"
return Failure(FailureType.TRANSIENT, reason, http_status=http_status)
# Auth/config errors are usually permanent until fixed
if http_status in (401, 403):
return Failure(FailureType.NON_RETRYABLE, "AUTH_OR_PERMISSION", http_status=http_status)
# Bad request / schema problems are usually permanent
if http_status in (400, 404, 422):
return Failure(FailureType.NON_RETRYABLE, "BAD_REQUEST_OR_SCHEMA", http_status=http_status)
# Known local validation errors
if isinstance(err, ValueError):
return Failure(FailureType.NON_RETRYABLE, "INPUT_VALIDATION", detail=str(err))
# Everything else: quarantine unless you have a reason to retry
return Failure(FailureType.UNKNOWN, "UNCLASSIFIED_EXCEPTION", detail=str(err))
Step 2: Retry Policy With Exponential Backoff and Jitter
@dataclass
class RetryPolicy:
max_attempts: int = 5
base_delay_sec: float = 0.5 # initial delay
max_delay_sec: float = 15.0 # cap
jitter_ratio: float = 0.2 # +/- 20% randomness
def compute_backoff(policy: RetryPolicy, attempt: int) -> float:
# Exponential backoff: base * 2^(attempt-1), capped
delay = min(policy.base_delay_sec * (2 ** (attempt - 1)), policy.max_delay_sec)
# Add jitter to avoid synchronized retries
jitter = delay * policy.jitter_ratio
return max(0.0, delay + random.uniform(-jitter, jitter))
Step 3: A Wrapper That Applies Classification and Policy
def run_with_failure_handling(
*,
job_id: str,
tenant_id: str,
stage: str,
policy: RetryPolicy,
fn,
fn_kwargs: Dict[str, Any]
) -> Dict[str, Any]:
"""
Runs a single stage (e.g., LLM call) with:
- classification
- bounded retries
- backoff + jitter
"""
last_failure: Optional[Failure] = None
for attempt in range(1, policy.max_attempts + 1):
try:
return fn(**fn_kwargs)
except Exception as e:
# If your fn can provide http_status, pass it in explicitly.
http_status = getattr(e, "http_status", None)
failure = classify_failure(e, http_status=http_status)
last_failure = failure
# Decide what to do next
if failure.failure_type == FailureType.NON_RETRYABLE:
return {
"job_id": job_id,
"tenant_id": tenant_id,
"stage": stage,
"status": "FAILED",
"failure_type": failure.failure_type,
"reason": failure.reason,
"http_status": failure.http_status,
"attempts": attempt,
"next_action": "STOP"
}
if failure.failure_type == FailureType.UNKNOWN:
# Conservative choice: do not retry unknown failures forever.
# Quarantine after 1 attempt (or 2 if you prefer).
return {
"job_id": job_id,
"tenant_id": tenant_id,
"stage": stage,
"status": "FAILED",
"failure_type": failure.failure_type,
"reason": failure.reason,
"http_status": failure.http_status,
"attempts": attempt,
"next_action": "QUARANTINE"
}
# Transient: retry if attempts remain
if attempt < policy.max_attempts:
delay = compute_backoff(policy, attempt)
time.sleep(delay)
continue
# Ran out of attempts
return {
"job_id": job_id,
"tenant_id": tenant_id,
"stage": stage,
"status": "FAILED",
"failure_type": failure.failure_type,
"reason": failure.reason,
"http_status": failure.http_status,
"attempts": attempt,
"next_action": "DLQ"
}
# Should not reach here, but return last known state
return {
"job_id": job_id,
"tenant_id": tenant_id,
"stage": stage,
"status": "FAILED",
"failure_type": (last_failure.failure_type if last_failure else FailureType.UNKNOWN),
"reason": (last_failure.reason if last_failure else "UNKNOWN"),
"attempts": policy.max_attempts,
"next_action": "DLQ"
}
Failure Handling and Idempotency
Failure handling and idempotency are a pair. Idempotency prevents duplicates from retries, whereas failure handling prevents retries from becoming chaotic.
If the retry logic is aggressive and jobs are not idempotent, the usage cost will be high, as there will be duplicate inference calls and duplicate DB writes, leading to a confusing state.
If the retry logic is disciplined and jobs are idempotent, the system becomes predictable: retries resolve to state checks, only one execution wins, and operators can reprocess failures intentionally.
Closing Thoughts
In summary, retries are not the enemy for any AI tool; uncontrolled retries are. A production-grade AI tool shouldn’t just retry because of failure; it should understand why the job failed and should retry with discipline when retry proves to be beneficial and stops when it doesn’t.
Opinions expressed by DZone contributors are their own.
Comments