When Retries Become a Denial-of-Wallet

Retries can silently DDoS your wallet — amplifying failures into massive costs. Without limits, jitter, and circuit breakers, “resilience” becomes self-inflicted damage.

May. 06, 26 · Opinion

Likes (1)

Comment

Save

4.3K Views

There's a particular kind of incident that doesn't show up in your error dashboards. No alerts fire. Latency looks fine, actually — or fine-ish, in that flickering, indeterminate way that makes you suspicious but not certain. What shows up, days later, is a billing anomaly. A line item that's 4x what you budgeted. And when you dig, you find it: retries. Hundreds of thousands of them. Loyal, tireless, utterly pointless retries, hammering a dependency that was never going to recover within the retry window, each one spinning up a Lambda invocation, writing to CloudWatch, touching the database, accruing egress. The system was "retrying" its way into insolvency.

This is what I mean when I call uncontrolled retries a self-inflicted Denial-of-Wallet attack. Not metaphorically. Mechanically.

The Seductive Logic of "Just Try Again"

The impulse is almost irresistible. Networks are flaky. Downstream services hiccup. Transient faults are real, they are common, and a single retry genuinely does rescue a meaningful fraction of requests that would otherwise fail. Every distributed systems textbook will tell you this. The problem is that the textbook version of a retry — lone request, momentary fault, clean recovery — bears almost no resemblance to what retries actually do inside a system operating at load under a real failure.

Under real failure, the math inverts.

Say Service A depends on Service B. B starts returning 500s — maybe a deployment went sideways, maybe a database connection pool saturated. A is configured with what seems reasonable: three retries, linear backoff, no jitter. What happens next is not three polite attempts and a graceful degradation. What happens is multiplication. Every original request to A becomes four requests to B (the original plus three retries). If A is receiving 1,000 RPS, B is now absorbing 4,000 RPS — on top of the load it was already failing to handle. Each of those extra requests touches middleware, writes a log line, maybe hits a queue. B, already struggling, gets worse. A's retries accelerate B's failure. The snowball rolls.

The Stanford RetryGuard researchers have a name for this: the retry storm. It's not exotic. It's what happens when you deploy reasonable-looking retry policies without thinking about what they do in aggregate.

What the Cost Actually Looks Like

People underestimate the surface area of a retry. They think: one extra HTTP call. They don't think about what's attached to that HTTP call.

In a Lambda-backed architecture, each retry is an invocation — billed separately. Each invocation likely emits structured logs to CloudWatch, which charges per GB ingested. If the function hits a DynamoDB table, that's another read unit consumed, possibly another write. If there's an API Gateway in front, that's another API call counted against your tier. If the response is large, there's egress cost. And this happens in parallel across however many concurrent requests are in flight.

Now consider the timeline. Service B fails at 2 AM. The on-call engineer doesn't see it until 2:17. During those 17 minutes, if A was receiving 500 RPS and each request retried three times, you've generated roughly 2 million additional requests to B. You've paid for every one of them. You've gotten nothing back. The original failure wasn't solved; the retries just made the failure expensive.

One way to think about this: retries without circuit breakers are paying a premium to prolong a failure.

The Hidden Feedback Loops Nobody Draws on the Architecture Diagram

The simple A-calls-B diagram is almost always wrong. What's usually true is that A, B, and C all call each other in some configuration, and several of them share infrastructure. So when B degrades:

A retries B, increasing load on B's shared database connection pool. The pool saturates. Now C, which also reads from that database, starts timing out. C's callers — let's say D and E — start retrying. D and E's retries hit the same pool. The pool is now so saturated that even requests that have nothing to do with the original B failure are timing out.

This is the cascade that the RetryGuard paper captures: service A experiences a retry storm and pays the price, but the price is actually distributed across the whole graph. The bulkhead patterns — isolating thread pools, rate-limiting per-dependency — exist precisely to prevent this. Most systems don't have them, or have them configured with defaults that were never tuned for actual traffic.

The other feedback loop worth naming is the log-based one. Your observability stack is probably downstream of your services. If it's Elasticsearch or Loki or CloudWatch, it absorbs your logs. Under a retry storm, log volume can spike 5–10x. That means your observability system — the thing you're depending on to diagnose the problem — is now also under load. I've been in incidents where the logging pipeline itself started dropping messages at exactly the moment we needed full fidelity. The retry storm ate its own evidence.

Exponential Backoff Is Not Enough (and Jitter Matters More Than You Think)

Backoff is the first thing people reach for. Double the wait between attempts. It's better than nothing. But standard exponential backoff without jitter has a subtle and nasty property: it synchronizes retries.

Suppose 500 requests arrive simultaneously. They all fail. They all back off by 1 second. They all retry simultaneously at T+1. They all fail again. They all back off by 2 seconds. They all retry simultaneously at T+3. You've turned continuous load into synchronized bursts — which are, in some ways, worse than continuous load, because they create spike conditions that can exceed per-second rate limits and overwhelm autoscaling that hasn't had time to provision.

Jitter — adding a random offset to the backoff interval — breaks this synchronization. The AWS Architecture Blog's "Exponential Backoff and Jitter" post from 2015 remains one of the clearest explications of why, and the "full jitter" strategy (where the wait is uniformly random between zero and the calculated backoff) outperforms "equal jitter" in most workloads. The math isn't complicated. The intuition is: you want your retriers to spread out across time, not march in lockstep.

The formula you actually want:

    Plain Text
   
   wait = random_between(0, min(cap, base * 2^attempt))

That min(cap, ...) is important. Without a ceiling, your backoff can grow to minutes or hours, which creates its own problems — held connections, stale state, zombie sessions that reconnect long after the original context is gone.

Retry Budgets: The Underused Primitive

Here's where Linkerd gets something importantly right that most service meshes and client libraries don't foreground: the retry budget.

The idea is simple. Instead of configuring retries per-request ("retry up to N times"), you configure retries per-traffic-volume ("retries may not exceed X% of requests"). Linkerd's default is 20% — meaning if your service is handling 1,000 RPS, it will allow at most 200 retry requests per second, regardless of how many individual requests are failing. Once the budget is exhausted, requests fail fast.

This is a fundamentally different mental model. Per-request retry limits think locally — this request failed, try it again. Retry budgets think globally — the system is under stress, we cannot afford to amplify that stress beyond this threshold. The budget makes the cost of retrying explicit at the system level.

The Istio equivalent is less elegant but workable. You can cap numRetries and set aggressive perTryTimeout values to bound the worst-case amplification, though you're still thinking per-route rather than per-budget. A rough YAML configuration:

    YAML
   
   retries:
  attempts: 3
  perTryTimeout: 2s
  retryOn: "5xx,connect-failure,refused-stream"

Notice retryOn. This matters. You should not retry on every error code. A 400 Bad Request doesn't get better with retries — the request is malformed and will fail identically on every attempt. Retrying 4xx errors is particularly wasteful because they're often client-side problems that the server will consistently reject. The codes worth retrying are: transient network failures, 503 Service Unavailable, 429 Too Many Requests (with appropriate backoff), and sometimes 502 Bad Gateway. Even 504 Gateway Timeout deserves scrutiny — if B is genuinely overwhelmed, retrying a timed-out request doesn't help B recover.

Circuit Breakers: The Pattern Everyone Claims to Use and Almost Nobody Tunes

Resilience4j, Hystrix (RIP), Polly, Istio's outlier detection — the options are plentiful. The implementations, in my experience, are often misconfigured to the point of uselessness.

A circuit breaker has three states: closed (passing requests through), open (failing fast), and half-open (letting a probe request through to test recovery). The transitions between states are governed by parameters: failure rate threshold, minimum number of calls before the threshold applies, wait duration in open state, permitted calls in half-open state.

The defaults in most libraries are conservative in a way that makes them nearly inert. A failure rate threshold of 50% sounds aggressive, but if your minimum call count is 100, the breaker won't open until you've seen 50 failures in the sampling window. With a small sliding window of, say, 10 calls, you might need 5 consecutive failures before it trips. In practice, by the time the breaker opens, you've already generated substantial unnecessary load.

The tuning questions nobody asks at configuration time:

What's the expected recovery time for this dependency? Set your waitDurationInOpenState to something meaningful relative to that. If your downstream service typically recovers in 30 seconds, a 5-second open window means the breaker will half-open and immediately re-trip multiple times before recovery, adding noise to your metrics and extending the incident.
What's the right sampling window? A count-based window (last N calls) can be gamed by low-traffic services where N takes minutes to fill. Time-based windows (last N seconds) are usually more appropriate for production.
What should happen when the circuit is open? This is the graceful degradation question. Returning an error is fine. Returning a cached response is better. Returning a sensible default is sometimes correct. The teams I've seen handle this best define the fallback behavior explicitly, in code, with the same rigor they'd apply to the happy path.

The half-open state is where circuit breakers most often fail in practice. Probe requests succeed in the test environment because the test environment has predictable load. In production, the first probe arrives when the downstream service has just recovered and is still warming up — and under the concurrent burst of all the callers that were queued behind the open breaker. The probe succeeds. The breaker closes. 200 requests hit simultaneously. The service tips over again. Repeat.

The fix is to open the circuit gradually: allow, say, 5% of traffic through in half-open state, ramp to 25%, ramp to 100%. Most libraries don't do this natively. Istio's outlier detection is closer to this model, ejecting individual hosts rather than binary-tripping a per-service breaker.

What You Actually Change on Monday Morning

Not everything. The systems are running. You don't get to redesign the retry architecture from scratch during business hours.

But some things are cheap and high-value:

Audit your retry configurations. Find every place in your codebase where retries are configured — client libraries, service mesh configs, SDK defaults you didn't know were there. AWS SDKs retry by default. Many HTTP clients retry on timeout by default. The retry behavior you didn't configure is often more dangerous than the retry behavior you did.

Add jitter to anything that doesn't have it. If you have backoff = base * 2^attempt, change it to backoff = random(0, base * 2^attempt). Twenty minutes of work. Immediate improvement in thundering herd conditions.

Turn on retry rate monitoring. Your APM or service mesh almost certainly exposes retry counts. Surface them. Add a dashboard. Set an alert at, say, 1% retry rate under normal conditions — abnormal elevations will catch incipient retry storms before they become billing anomalies.

Identify your non-idempotent paths and either remove retries or add idempotency keys. POST endpoints that create resources cannot be safely retried without idempotency controls. If you're retrying a payment or an order creation, you're potentially creating duplicates. This is its own class of disaster, separate from cost — but it compounds cost because you're now also writing extra records.

Define your fallbacks. For each service your system depends on, what should happen when it's unavailable? The answer "retry indefinitely" is almost never correct. "Return a cached response" or "return a degraded but valid result" or "queue for later processing" are usually better. The fallback should be in code, tested, and not a surprise to the on-call engineer at 2 AM.

The Broader Frame

There's something philosophically interesting about retry storms that I keep coming back to. Each individual retry is rational. From the perspective of a single request that failed due to a transient network glitch, retrying is exactly the right behavior. The emergence of a retry storm from individually-rational retries is a classic collective action problem — something that's good for each agent is destructive when everyone does it simultaneously.

Circuit breakers and retry budgets are collective action solutions. They impose a global constraint that each individual caller would have no incentive to impose on itself. This is, incidentally, why they work better when implemented in the mesh layer (where they can see aggregate traffic) than in individual client libraries (where they can only see their own requests).

The Denial-of-Wallet framing is useful because it names the threat model correctly. You don't need an external attacker. You don't need a misconfigured adversary. You need one failure, one reasonable-looking retry policy, and enough traffic that the multiplication matters. The attack surface is your own response to your own failures.

That's the part that's hard to internalize. The retries feel like resilience. They feel like diligence. They are, under the wrong conditions, the instrument of your own undoing.

Database connection IT Requests systems

Opinions expressed by DZone contributors are their own.

Related

Trending