Retries Are a Denial-of-Wallet Attack Waiting to Happen

How uncontrolled retries can crash systems and inflate costs — and how jitter, circuit breakers, and idempotency prevent self-inflicted failures.

Mar. 09, 26 · Opinion

Likes (1)

Comment

Save

2.2K Views

The invoice arrived on a Tuesday. Forty-seven thousand dollars for Lambda invocations across a weekend nobody was working. The team lead stared at CloudWatch metrics — normal traffic Friday afternoon, then a cliff of timeouts starting around 21:00. What followed wasn't an attack. No credential leak, no bot swarm. Just the application eating itself alive through retries, each failed request spawning three more, those spawning nine, the exponential curve steepening until AWS started provisioning containers faster than anyone could hit the "stop" button.

This is what we mean by Denial-of-Wallet. Not malicious. Self-inflicted.

The Anatomy of a Retry Storm

Systems retry because networks lie. A timeout doesn't mean the downstream service is dead — it might be slow, it might have processed your request and choked sending the response, it might be shedding load. So you retry. Reasonable. The HTTP client library you pulled from npm does it by default. Three attempts, hardcoded. Your API gateway has its own retry policy. The message queue has redelivery logic. Suddenly a single user request, encountering a hiccup, becomes four network calls. Nine if the gateway and the client both retry twice.

Now scale that. A service handling ten thousand requests per second hits a transient database slowdown. Connection pool exhausted, maybe a bad index scan. Requests start timing out. Each timeout triggers retries — suddenly you're attempting thirty thousand database connections. The database, already struggling, collapses entirely. More timeouts. More retries. The spiral tightens.

I watched this happen to a payment processor once. Their reconciliation service called a third-party fraud check API. That API started returning 503s — legitimate overload on their side, nothing nefarious. The reconciliation service had no circuit breaker, just a retry loop: five attempts with fixed two-second delays. Traffic wasn't huge, maybe a hundred reconciliations per minute. But five retries per failure meant five hundred API calls hammering an already-saturated endpoint. Their bill for that service, normally twelve dollars a day, hit eight hundred in six hours. The fraud vendor started rate-limiting them, which caused more retries, which triggered their own internal alerting, which woke someone up at 3 AM to discover they'd been essentially DDoS-ing a partner while paying for the privilege.

The Thundering Herd Problem

Synchronized retries are worse. Picture this: a thousand microservice instances all querying the same backing service. That service hiccups — garbage collection pause, network blip, doesn't matter. All thousand clients experience the timeout simultaneously. They all retry. Together. No jitter, no backoff, just a synchronized wave of duplicate requests landing exactly two seconds later (or whatever the hardcoded delay is).

The backing service, barely recovered, takes a second hit. Bigger than the first because now it's handling both new legitimate traffic and a thousand retries. It falls over again. The next wave arrives. This is the thundering herd, and it kills systems that would've survived the original fault if clients had just... waited. Spread out. Stopped trying for a minute.

I've seen engineers add jitter after incidents like this. Random delays between 100ms and 5 seconds. It's unsexy — literally just sleep(random() * 5000) — but it prevents the herd. One study I read clocked a 98% reduction in retry traffic volume just by adding jitter and a max-attempt cap. Ninety-eight percent. We're talking about changing three lines of code to avoid quadrupling your bill.

What Gets Retried (And What Shouldn't)

Not all errors are retry-worthy. A 401 Unauthorized won't magically become a 200 OK if you ask again. A 400 Bad Request means your payload is malformed; retrying it is just spam. Yet I've audited client libraries that retry everything. Every non-2xx status. Including redirects. Including client errors. One team was retrying 413 Payload Too Large responses, sending the same enormous JSON blob over and over, racking up egress charges while accomplishing nothing.

Transient failures — timeouts, 503s, 429 rate-limits, certain 500s — those are retry candidates. Even then you need backoff. Exponential is standard: wait one second, then two, then four, then eight. But that's still not enough if everyone's clock started at the same moment. You need jitter. You need a maximum ceiling, too. Retrying for ten minutes is madness if your user already navigated away or the job deadline passed.

And you absolutely need idempotency. If your retry logic can't guarantee that replaying a request doesn't double-charge a credit card or duplicate a database row, you shouldn't be retrying at all. Idempotency tokens — UUIDs attached to requests, checked server-side — are the fix. Generate once, include in every retry attempt, deduplicate on the backend. Simple in theory. Painful in practice because it means state, coordination, probably a Redis cache or a database column.

Circuit Breakers (Or: When to Stop Digging)

A circuit breaker is just a state machine with a timer. Start in "closed" (requests flow normally). If failures exceed a threshold — say, ten consecutive errors — trip to "open" (fail fast, don't even try the downstream call, return an error immediately). After some timeout (thirty seconds, maybe), transition to "half-open" (try one request; if it succeeds, close the circuit; if it fails, back to open).

Netflix's Hystrix popularized this. Resilience4j carries the torch now that Hystrix is in maintenance mode. The idea is dead simple: stop hitting a broken service. Give it time to recover. Prevent retry storms by not retrying at all once you've detected sustained failure.

But circuit breakers have edge cases. What's the threshold? Ten failures sounds reasonable until you realize that's ten failed user requests. If your traffic is bursty, you might trip the circuit during a legitimate spike in load, then fail everything for thirty seconds even though the backend is fine. You need sliding windows, percentile-based thresholds, maybe separate breakers per endpoint or per dependency. The abstraction is elegant; the tuning is fiddly.

One team I worked with set their circuit breaker timeout to five minutes. Reasonable for a batch job, catastrophic for user-facing API traffic. They'd trip the breaker on a transient issue, then return errors to every user for five minutes straight while the downstream service sat there, recovered, waiting for traffic that never came. Monitoring showed the database was healthy. The application was just refusing to talk to it.

Observability (Because You Can't Fix What You Can't See)

You need metrics. Not just error rates — retry rates. Separate them. If your error rate is 5% but your retry rate is 40%, you have a problem. That means most of your traffic is re-attempts, which means your observability is lying to you about load.

Prometheus is fine. CloudWatch is fine. Datadog will happily ingest retry counts. The key is: instrument your client libraries. Tag requests with attempt=1, attempt=2, etc. Count them. Chart them. Alert when retries dominate. I've seen dashboards where the error rate looked acceptable (3%, within SLA) but retries were chewing through 60% of Lambda invocations. Nobody noticed until the bill came.

Distributed tracing helps. AWS X-Ray, Jaeger, Honeycomb — anything that lets you follow a request ID through multiple hops. When a retry storm hits, you can see the fan-out. One root request spawning dozens of child spans. The trace graph looks like a fractal. You need to be able to see that in real-time, not just in postmortem.

And for serverless especially: monitor concurrency. Lambda bills per invocation. If your concurrency spikes because of retries, you'll scale up, hit your account limit (or blow through it if you've raised limits), and pay for every millisecond. One lab experiment found default retry policies spiking Lambda costs by over 1000%. A thousand percent. That's not a typo. A service that should've cost ten dollars for the day cost a hundred because of naive retry logic.

Bulkheads, Dead Letters, and Priority Queues

The bulkhead pattern is borrowed from shipbuilding. Compartmentalize. If one section floods, seal it off, don't let it sink the whole vessel. In software: if retries for one endpoint are spiraling, don't let them consume all your worker threads or all your API quota. Limit retries per dependency. Use separate connection pools. Fail one subsystem without cascading.

Dead-letter queues (DLQs) are where messages go to die. After five delivery attempts, say, stop retrying. Drop the message into a DLQ. Alert someone. Investigate later. Don't let poison messages (malformed events, requests to nonexistent resources) bounce forever, burning cycles and money.

Priority queues let you triage. High-value work goes to the front. Retries go to the back. If you're slammed, process new user orders before you retry a failed analytics job. This requires infrastructure — Kafka topics with different consumer groups, RabbitMQ priority headers, SQS with separate queues and weighted polling. But it prevents retries from starving real work.

The Human Element (Where Most of This Falls Apart)

Engineers inherit code. You pick up a microservice written in 2019 by someone who left the company. It uses an HTTP client from a tutorial. That client retries three times by default. Nobody documented it. Nobody changed it. It works fine until it doesn't.

Or: you deploy a new feature. It introduces a latency spike. Suddenly timeouts are more common. Retries kick in. The spike worsens. Feedback loop. You roll back the feature, but the retries are still in flight. The system takes twenty minutes to stabilize even though you reverted in two.

Or: cost anomalies don't trigger the same urgency as downtime. A spike in errors wakes the on-call engineer. A spike in spend waits until the monthly finance review. By then you've bled thousands of dollars to a retry loop nobody noticed because the service technically stayed up.

This is why you treat excessive retrying as a reliability bug, not a billing problem. Put it in your incident retrospectives. Did retries amplify the outage? Did they cause it? If your monitoring showed the downstream service recovered in thirty seconds but your application kept erroring for ten minutes because of circuit breaker tuning or retry queue backlog, that's a gap. Close it.

What to Do on Monday

Audit your dependencies. List every HTTP client, every message queue consumer, every RPC stub. What's their retry policy? Is there one? Is it bounded? Does it have jitter? If you don't know, dig into the code or the library defaults. I've found Spring RestTemplate configurations with infinite retries. I've found gRPC clients with no backoff. These are ticking bombs.

Implement circuit breakers. Start with Resilience4j or Polly (.NET) or your language's equivalent. Wrap external calls. Set thresholds conservatively at first — you can tighten later once you have data on actual failure rates. Don't cargo-cult Netflix's settings; your traffic patterns are different.

Add idempotency tokens to anything that mutates state. Generate a UUID at the edge (API gateway, load balancer), thread it through your system, check it before executing writes. Yes, this is extra work. Yes, it's worth it the first time a network blip causes a double-debit and you have to explain to a user why their account is wrong.

Monitor retry rates as a first-class metric. If your observability platform doesn't support it, add it. Tag requests, count attempts, alert when retries exceed some percentage of total traffic. Start at 20%, tighten to 10% once you've fixed the low-hanging fruit.

For serverless workloads, consider asynchronous invocation patterns. If you don't need synchronous responses, use SNS or SQS. Let AWS handle retries with built-in exponential backoff and concurrency limits. You can cap max retries at the queue level. Costs are more predictable because you're not spawning hundreds of Lambdas in parallel when something breaks.

And finally: put retry policies in your architecture docs. Not buried in code comments. In the actual runbooks and design docs. New hires should know that the payment service has a circuit breaker with a 15-second timeout and 50% error threshold. They should know that the recommendation engine retries up to three times with jittered backoff. Make it legible. Make it intentional.

The forty-seven-thousand-dollar invoice? They eventually got a partial credit from AWS. Goodwill gesture. The real fix was adding jitter, capping retries at three, and implementing a circuit breaker around the Lambda invocation that triggered the storm. Took two engineers four days. The alternative was just accepting that sometimes the bill explodes for no reason.

Nobody chose that alternative. But a lot of systems are running on defaults that might.

Circuit Breaker Pattern

Opinions expressed by DZone contributors are their own.

Related

Trending