Retries Will Bankrupt You Before Any Attacker Gets the Chance
Bad retries turn minor failures into major outages and bigger bills. Resilience without limits, jitter, and idempotency is just expensive failure.
Join the DZone community and get the full member experience.
Join For FreeI've watched a $40,000 AWS bill materialize in a weekend. No breach, no botnet, no disgruntled ex-employee with root access. Just a misconfigured retry policy on a Lambda-backed payment processor that hit a flaky downstream vendor API during a Saturday night deployment. Every timeout spawned three children. Each child could time out too.
That’s the thing nobody tells you when they hand you the Polly documentation and say, “Add resilience.” Resilience, implemented carelessly, is just a different failure mode with a credit card attached.
The Mechanism Nobody Draws on the Whiteboard
Here’s what actually happens inside a retry storm, at the level that matters.
Your service gets a 503 from an upstream dependency. Standard advice: back off and try again. So your code waits 200ms and retries. Meanwhile, every other in-flight request to that same dependency is doing the same thing — because you deployed the same library, with the same defaults, to forty pods simultaneously.
That’s not backoff. That’s a synchronized chorus.
The upstream, already struggling, now receives a coherent spike at the 200ms mark. Another failure. Retry again — 400ms this time. Another spike, equally coherent, at 400ms. You haven’t reduced load on a sick system; you’ve imposed a series of artillery barrages on a hospital.
This is the thundering herd problem, and the jitter fix is almost insultingly simple: multiply your backoff interval by a random float between 0.5 and 1.5. Desynchronize the clients. The math is noncontroversial, and the implementation is three lines. Yet I still find production systems in 2024 with deterministic exponential backoff and no jitter because nobody read past the first code example in the docs.
But jitter alone doesn’t save you when your retry budget is unbounded, when you’re retrying the wrong things, or when each retry is itself a unit of billable compute.
Serverless Specifically Is a Trap
Lambda’s execution model charges per invocation, per millisecond of duration. When a Lambda function retries synchronously — inside the handler, before returning — each attempt runs on your bill.
It gets worse with asynchronous invocations. SNS-to-Lambda, EventBridge rules, and async invoke calls use AWS’s internal retry machinery, which by default retries failed async invocations twice with delays between attempts. The events don’t disappear if all three attempts fail; they go to a dead-letter queue if you configured one — or they vanish silently.
More practically: if your function is timing out due to an overwhelmed downstream, you now have three invocations per logical request, each running to the full timeout duration. Say your timeout is 15 seconds and you have 1,000 events queued. That’s potentially 45,000 Lambda-seconds of compute from a single failure cascade. Even at the lower end of Lambda pricing, you’re talking hundreds of dollars from one bad hour — before accounting for concurrency limits and application-level damage from dropped or duplicated operations.
Container environments aren’t immune — they’re just slower to explode. Kubernetes with HPA scales out under load, which means it scales out under retry load too. Your cluster autoscaler faithfully provisions new nodes to handle the surge of retries hammering a service that cannot recover because it’s being hammered by retries. The financial feedback loop runs through your node pool instead of your invocation counter, but it converges on the same outcome.
What “Idempotency” Actually Requires
Everyone says, “Make your retries idempotent,” and then doesn’t think hard enough about what that entails.
Idempotency isn’t a property of your endpoint in isolation. It’s a property of the interaction between your endpoint, your datastore, and your client’s retry behavior.
If a POST /orders returns a 500 after writing to the database but before sending the response — which happens, because distributed systems don’t commit and respond atomically — a naive retry will create a duplicate order. Your endpoint may be stateless and pure, but the operation isn’t, because the first attempt succeeded at the storage layer.
The fix is idempotency keys. The client generates a UUID for the logical operation, sends it as a header or body field, and your service uses it as a deduplication token before processing.
Stripe does this. Braintree does this. Most internal services at most companies do not, because it requires coordination between the caller (who must generate and persist the key) and the server (who must store and check it). That’s two teams and one JIRA ticket that never gets prioritized.
Without idempotency keys, you’re in a probabilistic regime: most retries are harmless; occasional ones cause double charges or duplicate shipments; and you find out on a Tuesday when a customer emails. That’s a trade-off some products can absorb. A payments system cannot.
The Circuit Breaker Is More Nuanced Than the Diagram Suggests
The circuit breaker pattern looks elegant in diagrams: Closed → Open → Half-Open. A clean state machine. Everyone nods.
The implementation details are where systems go wrong.
What threshold opens the circuit — error rate or error count?
An error rate of 50% means nothing if you’re receiving two requests per minute. An error count threshold might never trip under very high traffic if the failure rate is only slightly elevated. Most teams pick one, deploy it, and never validate whether the thresholds make sense at actual traffic volumes.
I’ve seen circuits configured to open at “10 failures in 10 seconds” on a service receiving 10,000 requests per second. That’s a 0.1% failure rate required to trip the circuit — so sensitive it trips on normal variance. The circuit was effectively always open. Engineers disabled it.
What does “Open” mean to callers?
If your circuit opens and returns 503 immediately, callers see a fast failure instead of a slow one. Good — it saves timeout-related compute. But if the caller is itself a synchronous API handler, it now returns 503 to its caller, which may retry. Your circuit is now absorbing retry pressure instead of the downstream, and you’ve moved the thundering herd one hop upstream.
The circuit works only if the outermost client has its own retry budget and respects Retry-After. It doesn’t work if you have four service layers each retrying independently without coordination.
The Half-Open probe is usually a single request. One. Most frameworks allow one request through to test whether the downstream has recovered. If that probe fails, the circuit re-opens.
This works — unless the downstream is flapping, recovering just long enough to pass the probe before collapsing again under load. You’ve now admitted traffic into a system that still can’t handle it.
This is an argument for gradual traffic restoration (10%, then 50%, then 100%) rather than single-probe half-open logic. Resilience4j supports this. Most teams don’t configure it.
What You Actually Do Monday Morning
Pull up your service metrics. If you aren’t tracking retry rate separately from error rate, fix that first. Add a retries_attempted_total Prometheus counter labeled by upstream and error class.
If retries exceed 10–15% of outbound request volume, something is wrong — either with your retry policy, your dependency health, or both.
Then inspect your retry configuration and ask three questions:
1. What’s the maximum number of attempts?
If the answer is “unlimited” or you don’t know, that’s a production incident waiting to happen. Three to five attempts is defensible. More than that usually means you’re retrying a system that needs fixing.
2. Does your backoff include jitter?
Not “in theory.” Look at the actual code. You’d be surprised.
3. Which error classes are retried?
Timeouts and 503s — yes.401s — never. You’ll just hammer auth.400s — never.429s — only if you strictly respect the Retry-After header. A 429 is the dependency telling you its capacity. Ignoring that is rude and expensive.
Then configure cost anomaly alerts in AWS Cost Explorer or GCP Budget Alerts if you’re in a serverless or aggressively autoscaled environment. Set the threshold at 2× normal daily spend. It won’t catch a fast-moving incident, but it will catch the slow retry storm that builds overnight and greets you Monday morning with a number you have to explain to your CFO.
The deeper fix is cultural: treat runaway retries like a memory leak. They are reliability bugs, not acceptable trade-offs. Not “we’ll get to it.” Not something normalized because they haven’t caused a visible outage.
Retries are the mechanism that converts your upstream’s problem into your bill.
That deserves more respect than it typically gets.
Opinions expressed by DZone contributors are their own.
Comments