Over a million developers have joined DZone.

How to Make Services Resilient in a Microservices Environment

DZone's Guide to

How to Make Services Resilient in a Microservices Environment

Learn about techniques and open-source tools to ensure resiliency and performance in your microservices applications.

Free Resource

Containerized Microservices require new monitoring. See why a new APM approach is needed to even see containerized applications.

Microservices architecture helps to break down problems into pieces, which helps the customer for easy maintenance and testability. It also helps software providers in faster development and faster time to market, which in turn leads to better revenue by decoupling the components. To get these benefits, many large-scale websites and applications have all evolved from monolithic to microservices architecture.

Even though microservice architecture isolates failures through defined boundaries, there is a high chance of network, hardware, database, or application issues, which will lead to the temporary unavailability of a component. To avoid or minimize this kind of outage, we have to build a fault tolerance mechanism. Nowadays, we have an easier way to build fault tolerance with the help of Spring Boot components, Hystrix, Resilience 4j libraries, etc.


Retry provides the ability to invoke failed operations, which is very helpful when errors are transient in nature. It will retry the failed operation for configured times and then proceed to the fallback (recovery) to return the data from the cache or the default value. Mainly in service to a service call, service B will not respond due to high load at that time, and it will be available to communicate after few seconds. In this case, retry will be helpful in getting the results.

Retrying the service continuously without any interval will also start a cascading effect. To reduce that, there should be an exponential backoff to continuously increase the delay between retries until we reach the maximum limit. Handling of idempotent scenarios is very important in this case, as the client initiates the request.

It should also have a recovery mechanism, where we do operations like getting the response from the cache, returning the standard error which says to try after some time to users, returning the default value, or sending to queue broker to process its request by the services listening to it. The recovery method will be called after n configured retries where the exception was thrown.

We can use either the spring-boot retry module or the resilience-4j retry component.


Using the Spring Retry module:

1. Add the below dependency in the pom.xml file. 


 2. Add the annotation @EnableRetry to the class.

 3. The code below for the sample:

 * try the method 9 times with 2 seconds delay
 * @return the name
 * @throws Exception
 *             the exception
@Retryable(maxAttempts = 9, value = Exception.class, backoff = @Backoff(delay = 2000))
public String springReTryTest() throws Exception {
 throw new Exception();
 * This method is called to recover
 * @param e
 * @return
public String recover(Exception e) {
 return "Test";


In the above example, the maximum attempts to retry is 9 times, with the exponential back of 2000ms, and it will retry on exception. The recovery method will be called if there is no success at the maximum time. We can also specify on which exception retry should be called.

The Resilience-4j library supports this feature.

Circuit Breakers

We normally use time-outs to limit the duration of operations which will prevent hanging of an operation. But most of us are not able to predict the perfect timeout which will be suitable for all the operations in this dynamic environment, so it is called an anti-pattern.

Circuit breakers came to deal with the above problems and are very helpful in a distributed systems where repetitive failures can bring down the whole system down. Let us consider that the circuit is closed and service-to-service calls happen successfully. If any service throws an error of a particular type continuously over a short period, then the circuit breaker will open the connections so that no service can communicate with that one unless it becomes stable. One more important thing is to remember is that not all errors will trigger the circuit breaker.

There is also a concept called a half-open state, in which a service can send the first request to check the status, and if it is a success, then it will form a closed circuit, and otherwise, it will be left open.


Using the Spring retry module:

/**circuit breaker will close the circuit if it have 2 failures and it will close the circuit after specified interval*/
@CircuitBreaker(include = {
}, openTimeout = 10 _000, resetTimeout = 20 _000, maxAttempts = 2)
publicString doProcess(String value) {
 /**if the value is fail then it will throw error but have recover method to recover*/
 if (value.contentEquals("FAIL")) {
  throw new RestClientException("");
 /**real value processed*/
 System.out.println("real service called");
 return value;
public String recover(Exception e) {
 return "Test";


In the above example, the circuit breaker annotation parameter "includes" will describe the exceptions which are all retry-able. With "excludes," exceptions which are all not retry-able, "maxAttempts" defaults to 3 — and we can change that value. The "open timeout" default is 5000ms. the timeout before a closed circuit is opened in milliseconds and "reset timeout default" is 20000ms, the timeout before an open circuit is reset.

The Resilience-4j library also supports this feature.

Rate Limiters

This is the technique of defining how many requests can be received or processed by an application or customer during the time intervals. With the help of a rate limiter, we can learn the number of requests by a customer and protect the service from overloading by stopping the request until the application load balancer comes into the picture to scale an instance. We can also hold low-priority tasks to give enough resources to high-priority tasks.

In a scenario where one of your users has a misbehaving script which is accidentally sending you a lot of requests or one of your users is intentionally trying to overwhelm your servers, a rate limiter is very helpful.

Spring Boot doesn't have a rate limiter module and we should provide the custom implementation with the help of Interceptors.

The Resilience-4j library supports this feature.


Bulkheads are used to avoid faults in one part of a system taking the entire system down by limiting the number of concurrent calls to a component. It is mainly used to segregate resources.

Say there is an application which has two components. If the requests to components 1 start hanging, it will result in all threads hanging. To avoid this, bulkheads will separate the number of threads and only threads allocated to component 1 will hang; others will be there to process component 2's requests.

The Resilience-4j library and Hystrix support this feature.

Correlation Id and Log Aggregator

These will be more helpful mainly in debugging microservice architecture. Let's say service A is communicating with service B and how we can identify the flow of a request from A to B in logs. Correlation Id will be helpful in the trace of the request in a flow of different microservices.

Spring doesn't provide an implementation for correlation id. We can provide our own implementation to generate the correlation id and pass it to downstream services without being invasive into the code.

Create the class which implements the filter to filter all the requests and check whether the request header has the correlation id. If the correlation id is not there, it will generate and set the thread local. For the service to a service call, pass the correlation id as part of the header. Don't forget to unset the thread local on completion of a request.

The Log Aggregator will aggregate all the logs from different microservices, which makes them searchable together. LogStash with Kibana are open-source tools which allow us to aggregate logs into one location, and Kibana allows us to search log files.

We can use the ChaosMonkey resiliency tool to test failure scenario.

Automatically manage containers and microservices with better control and performance using Instana APM. Try it for yourself today.

microservices ,software architecture ,fault tolerance ,resilience ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}