DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Engineering Closed-Loop Graph-RAG Systems, Part 3: Closing the Loop in Graph-RAG Systems
  • Engineering Closed-Loop Graph-RAG Systems, Part 1: From Retrieval to Reasoning
  • Engineering Agentic AI for Production: A Distributed Systems Perspective
  • Why Queues Don’t Fix Scaling Problems

Trending

  • 5 Failure Patterns That Break AI Chatbots in Production
  • Why Round-Robin Won't Save You: Load Balancing Challenges in Data Streaming Services With Heterogeneous Traffic
  • How to Build an Agentic AI SRE Co-Pilot for Incident Response
  • Amazon Quick: AWS's Agentic Workspace, Explained for Engineers
  1. DZone
  2. Software Design and Architecture
  3. Microservices
  4. Designing Resilient Systems: What Every Engineering Manager Should Know

Designing Resilient Systems: What Every Engineering Manager Should Know

In this article, you'll learn how to build resilient systems through well-configured retries, circuit breakers, and bulkheads.

By 
Shushyam Malige Sharanappa user avatar
Shushyam Malige Sharanappa
·
Aug. 11, 25 · Analysis
Likes (3)
Comment
Save
Tweet
Share
2.1K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction: Why Resiliency Matters

If you've ever launched a major feature, only to watch it spectacularly break under pressure, you know exactly why resiliency matters. As a Software Development Manager (SDM) at Amazon, I've been on the front lines of creating systems that handle real-world chaos gracefully.

Let's talk resiliency — focusing on critical architectural patterns in plain language: Retries, Circuit Breakers, and Bulkheads.

Resiliency Patterns Explained

1. Retry Logic: Giving Your Code a Second Chance

Think of retry logic as attempting to call your friend again if they don't answer the first time. It’s a straightforward, powerful way to handle temporary failures, like transient network hiccups or brief outages.

Key Points

  • Exponential backoff: Increase wait times between retries to avoid overloading services.
  • Limit retries: Cap your retry attempts (usually 3–5 attempts).
  • Idempotency: Ensure operations can safely retry without side effects.

Illustrative Java Example:

(Disclaimer: This simplified snippet illustrates a general concept. Real-world implementations are significantly more intricate.)

Java
 
public DeliveryPromise fetchDeliveryPromiseWithRetry(Cart cart, Address address) {
    int maxRetries = 3;
    int waitTimeMs = 1000;
    
    for (int attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            DeliveryPromise promise = deliveryService.getPromise(cart, address);
            if (promise.isValid()) {
                return promise;
            }
            Thread.sleep(waitTimeMs);
            waitTimeMs *= 2; // Exponential backoff
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            break; 
        }
    }
    // Graceful fallback
    return DeliveryPromise.unavailable("Cannot deliver to selected address.");
}
-


2. Circuit Breakers: Knowing When to Stop Trying

Circuit breakers prevent systems from repeatedly trying — and failing — to call unresponsive services, much like the electrical breakers in your home that protect against overloads.

How they work:

  • Monitor error rates.
  • Temporarily halt calls when error thresholds are crossed.
  • Gradually resume traffic after the system recovers.

Circuit Breaker illustration

Circuit Breaker illustration

(Source: Martin Fowler)


3. Bulkheads: Containing the Damage

Bulkheads compartmentalize critical system parts. If one service crashes, bulkheads limit the impact, keeping your whole system afloat.

Best Practices

  • Separate critical from non-critical paths.
  • Provide dedicated resources and isolate failures.

Bulkhead illustration

Bulkhead illustration

Real-Life Story: A Minor Bug With Major Impact

Let me illustrate this with an actual scenario I encountered at Amazon. (Note: The following is an intentionally simplified and generalized example. Our actual scenario involved complex interactions across multiple services and teams.)

The Problem

An obscure, deprecated code path resurfaced after a routine platform migration. Customers browsing with a default delivery address would occasionally hit a rare but disastrous scenario:

  • The customer selects products based on promises calculated for their default address.
  • At checkout, they switch to a different delivery address.
  • The delivery promise is no longer valid for the newly selected address.
  • Instead of gracefully handling this edge case, our system spiraled into unnecessary retries, ending with a cryptic error message.

Imagine you're shopping for your friend's birthday gift. You trust the promise, proceed confidently, but at the last moment, you hit a confusing, looping error. Even if just 1% of shoppers experience this, the cumulative impact — especially on a scale of thousands of daily orders — is enormous. This seemingly small oversight was costing potential sales and damaging customer trust.

Technical Culprit: Misconfigured Retry Logic (Illustration)

Faulty Java Code Example (Highly Simplified):

Java
 
int retries = 5;
while (retries-- > 0) {
    DeliveryPromise promise = legacyService.getPromise(cart, address);
    if (promise.isValid()) return promise;
    // No backoff, no break, repetitive calls
}
// Cryptic error message after long delay
throw new DeliveryException("Delivery unavailable");


This loop would repeatedly attempt the same call without meaningful delays or graceful exits, causing prolonged confusion.

Rapid Resolution: Graceful Fallbacks and Circuit Breakers

We quickly fixed the issue by introducing proper retry logic and graceful fallbacks, supported by circuit breakers:

Updated Java Example (Simplified for Illustration):

Java
 
public DeliveryPromise safeFetchPromise(Cart cart, Address address) {
    int retries = 3;
    int waitMs = 500;

    while (retries-- > 0) {
        DeliveryPromise promise = deliveryService.getPromise(cart, address);
        if (promise.isValid()) return promise;
        
        try {
            Thread.sleep(waitMs);
            waitMs *= 2; // Exponential backoff
        } catch (InterruptedException ie) {
            Thread.currentThread().interrupt();
            break;
        }
    }
    // Immediate and clear user feedback
    return DeliveryPromise.unavailable("Sorry, this item can't be delivered to the selected address.");
}


Circuit Breaker Integration (Conceptual Example):

Circuit breakers helped detect failing downstream services instantly and reduced wasted retries:

Java
 
if (circuitBreaker.isOpen()) {
    return DeliveryPromise.unavailable("Delivery estimates temporarily unavailable.");
}


Core Architectural Patterns Summarized:

  • Retries: A simple way to handle temporary failures; use with caution and proper configuration.
  • Circuit breakers: Fail-fast mechanism preventing cascading failures.
  • Bulkheads: Compartmentalize your architecture to limit failure scope.

Key Takeaways for Every SDM

  • Expect failure: It's inevitable; build resilience in.
  • Judiciously implement retries: Useful, but avoid overwhelming your systems.
  • Implement circuit breakers early: Essential in distributed systems to maintain stability.
  • Design with bulkheads: Prevent total system failure by isolating critical services.
  • Always prioritize user experience: Clearly inform users about system issues and failures.

Conclusion: Resiliency Is More Than Just Good Design

Building resilient systems means proactively protecting your customers' experience and your team's sanity. Resiliency isn’t just technical — it's empathetic leadership, critical foresight, and practical risk management.

Disclaimer on Illustrative Examples

All code examples provided here are significantly simplified for illustrative purposes. The actual scenario encountered at Amazon involved deeply intertwined services across multiple code packages and teams. I cannot disclose the exact implementation or proprietary details, but the lessons presented here are generic enough that many will recognize and relate to these patterns and solutions.

References and Further Reading

  • Martin Fowler: Circuit Breaker
  • Resiliency Patterns in Microservices

Have you faced similar resilience challenges? I'd love to hear your stories and solutions.

systems Circuit Breaker Pattern Engineering

Opinions expressed by DZone contributors are their own.

Related

  • Engineering Closed-Loop Graph-RAG Systems, Part 3: Closing the Loop in Graph-RAG Systems
  • Engineering Closed-Loop Graph-RAG Systems, Part 1: From Retrieval to Reasoning
  • Engineering Agentic AI for Production: A Distributed Systems Perspective
  • Why Queues Don’t Fix Scaling Problems

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook