Overcoming the Retry Dilemma in Distributed Systems

Retries help increase service availability. However, if not done right, they can have a devastating impact on the service and elongate recovery time.

Rajesh Pandey

Aug. 27, 24 · Tutorial

Likes (3)

Comment

Save

5.9K Views

“Insanity is doing the same thing over and over again, but expecting different results” - Source unknown

As you can see in the quote above, humans have this tendency to retry things even when results are not going to change. This was manifested in systems designs as well where we pushed these biases when designing systems. If you look closely there are two broad categories of failures:

Cases where retry makes sense
Where retries don’t make sense

In the first category, transient failures like network glitches or intermittent service overloads are examples for which retrying makes sense. For the second, where the failure originates from something like the request itself is malformed, requests are getting throttled (429s), or service is load-shedding (5xx), it doesn’t make much sense to retry.

While both of these categories need special attention and appropriate handling, in this article, we will primarily focus on category 1, where it makes sense to retry.

The Case for Retries

In the modern world, a typical customer-facing service is made up of various microservices. A customer query to a server goes through a complex call graph and typically deals with many services. On a happy day, a single customer query meets no error (or independent failure) to give an illusion of a successful call event.

For example, a service that is dependent on 5 services with 99.99% availability, can only achieve a max of 99.95% of availability (reference doc) for its callers. The key point here is that even though each individual dependency has an excellent availability of 99.99%, the cumulative effect of depending on 5 such services results in a lower overall availability for the main service. This is because the probability of all 5 dependencies succeeding on a single request is the product of their individual probabilities.

Overall Availability =
1 - (1 - Individual Availability) ^ (# of Dependencies)

Now, by using the same formula, we can see that to maintain a 99.99% availability at the main service without any retries, all the dependencies need to have an availability higher than 99.998%. So, this begs a question: how do you achieve 99.99% availability at the main service? The answer is we need retries!

Retries Sweet Spot

We discussed above that the maximum availability that you can achieve without retries is 99.95% with those 5 dependencies. Now, if we expand our above formula and try to model the overall availability to 99.99% of the main service, it will include retries as a factor in considering it. So, the formula becomes:

Overall Availability =
1 - (1 - Individual Availability) ^ (# of Dependencies + # of Retries)

If you plug these values, it gives you 99.99% = 1 - (1 - 0.9999) ^ (5 + Number of Retries).

This gives you # of retries = 2, which means that by adding two retries at the main service, you will be able to achieve 99.99% availability.

This demonstrates how retries can be an effective strategy to overcome the effect of cumulative availability reduction when relying on multiple dependencies and help achieve the desired overall service-level objectives. This is great and obvious! So why this article!?

Retry Storm and Call Escalation

While these retries help, they also bring trouble with them. The main problem with retries is when one of the services you depend on is in trouble or having a bad day. Now when you retry when it's already down, it is like you are kicking it where it hurts! — potentially delaying the service’s recovery.

Now think of a scenario where this call graph is multi-level deep; for example, the main service depends on 5 sub-services which in turn depend on another 5 services. Now when there is a failure, you retry at the top, and this will lead to 5 * 5 = 25 retries. What if those are further dependent on 5 services? So for one retry at the top, you may end up with 5 * 5 * 5 retries, and so on. While retries don’t help with faster recovery, they can take down the service that was operating partially with this extra generated load. Now, when those fail, they further increase the failure leading to more failures and this retry storm starts and creates long-lasting outages. At the lowest level, the call volume reaches 2 ^ N, which is catastrophic and would have recovered much faster had there been no retries at all.

This brings us to the meat of the article where we say we like retries, but they are also dangerous if not done right. So, how can we do it right?

Strategies for Limiting Retries

To benefit from retries while keeping the retry storms and cascading failures from happening, we need a way to stop the excessive retries. As highlighted above, even a single retry at the top can cause too much damage when the call graph is deep (2 ^ N). We need to limit retries in aggregate. These are some of the practical tips on how this can be done:

Bounded Retries

The idea with the bounded retries is that you have an upper bound in how many retries you can do. The upper bound can be decided based on the time; for example, every minute you can make 10 retries or it’s based on the success rate for every 1000 success calls, you give service a single retry credit and you keep getting until you reach the fixed upper bound.

Circuit Breaking

The philosophy of the circuit-breaking technique is to don’t hammer whats already down. With the circuit breaker pattern, what you do is when you meet an error, you close the connection, stop making calls to the server, and give it breathing room. To check for the recovery, you have a thread that makes a single invoke to the service on a periodic basis to check if it has recovered. Once it has recovered, you gradually start the traffic and go back to normal operation mode. This gives the receiving service the much-needed time to recover. This topic is covered in much detail in Martin Fowler’s article.

Retry Strategies

There are techniques in TCP congestion control like AIMD (Additive Increase and Multiplicative decrease) that can be employed. AIMD basically says that you slowly increase the traffic to a service (think of it like connection + 1), but you immediately reduce the traffic when faced with an error (think of it like active Connection / 2). You keep doing that till you equilibrium.

Exponential backoff is another technique where you back off for a period of time upon meeting an error and subsequently increase the time you back off up to some maximum time. The subsequent increase is generally like 1, 2, 4, 8.. 2^retryCount, but could also be Fibonacci-based like 1,2,3,5,8…. There is another gotcha while retrying, which is to keep the jitter in mind. Here is a great article from Marc Brooker who goes into more depth about exponential backoff and retries.

While these techniques talk about client-side protection, we could also employ some guardrails on the server side. Some of the things that we can consider are:

Explicit backpressure contract: When under load, reject caller request and pass on the metadata that you are failing because of overload and ask it to pass to its upstream and so on.
Avoid wastage work: In the case where you expect service to be under load, avoid doing wastage work. You can check whose caller has timed out and don’t need an answer to drop the work altogether.
Load shedding: When under stress, load shed aggressively. Investigate in mechanism where you can identify duplicate requests and discard one of them. Use signals like CPU, memory, etc. to compliment the load-shedding decision-making.

Conclusion

In distributed systems, transient failures in remote interactions are unavoidable. To build higher availability systems, we rely on retries. Retries can lead to big outages because of call escalations. Clients can adopt various approaches to avoid overloading the system in case of failures. Services should also employ techniques to protect themselves in case a client goes rogue.

Disclaimer

The guidance provided here offers principles and practices that could broadly improve the reliability of services in many conditions. However, I would advise that you do not view this as a one-size-fits-all mandate. Instead, I suggest that you and your team evaluate these recommendations through the lens of your specific needs and circumstances.

Call graph Exponential backoff HTTPS Circuit Breaker Pattern Distributed Computing

Opinions expressed by DZone contributors are their own.

Related

Trending