Building a Fault-Tolerant Microservices Architecture With Kubernetes, gRPC, and Circuit Breakers
Learn how to build resilient microservices with Kubernetes, gRPC, and the Circuit Breaker pattern to prevent cascading failures and improve reliability.
Join the DZone community and get the full member experience.
Join For FreeOver the last decade, microservice architectures have become commonplace when designing scalable, maintainable, and independently deployable applications. Breaking down a system into multiple, domain-focused services, development squads can quickly develop, have varying technology stacks per service, and independently scale an application's constituent pieces.
But this flexibility has its cost: operational complexity and failure propagation. Unlike monoliths, whose failures may be localized to one runtime, microservices communicate over networks. Each service-to-service invocation creates a possibility of latency, partial failure, or total unavailability. In a critical dependency, when this occurs, it causes cascading failures — where one service's downtime propagates through the system, ruining the user experience or even causing complete outages.
When building genuinely resilient distributed systems, engineers must implement patterns and infrastructure decisions that not only recover from failures, but also help prevent failures from spreading. In this article, we examine a commonly utilized, battle-hardened set of tools and patterns that we’ll be using with Kubernetes, gRPC, and the Circuit Breaker Pattern.
Through a fully working Go-based implementation, let’s design an architecture where we have:
- Payment service – This service will process payment requests.
- Inventory service – Manages stock levels (which has not been implemented to any extent in the example).
The order service calls the payment and inventory services via gRPC. Circuit breaker encapsulates calls to the payment service to prevent overload when it's struggling. Let's make the system.
We will also be deploying the system to Kubernetes with YAML manifests, simulating dependency failures, and observing how the circuit breaker safeguards the system’s stability.
With this article, you’ll not only understand the theory behind circuit breakers and fault-tolerant communication but also have practical, production-ready code that you can adapt for your own microservices environment.
The Problem: Cascading Failures
In a microservice architecture, the optimal service will most frequently be part of an extended transactional chain. The chains never go straight; fan-out calls, graphs of dependencies, and multi-hop requests propagated across multiple data centers are the norm. Although this decentralized nature makes each service independent, deployable independently, and scalable independently, it brings with it a runtime binding among services through network calls. As a result, should a service be broken through poor performance or downtime, its effect can radiate out and take down an otherwise healthy system. Entirely, let's consider a highly contrived workflow of order processing:
- Service A depends on B.
- B slows down or fails.
- A keeps retrying or waiting, tying up resources.
- Traffic queues build, and soon A fails too
Order Service receives an order request and proceeds to call Payment Service to complete a transaction. The Payment Service, in its turn, makes a call to an external bank API/gateway. Once the payment is successfully confirmed, the Payment Service makes a call to Order Service, and Order Service makes a call to Inventory Service to reserve inventory. In case the Payment Service is keeping its response lazy or is non-responsive, Order Service gRPC calls hang after a timeout. The waiting calls accumulate over a period, taking up memory and CPU. Kubernetes can scale out Order Service as a response, but newly spawned replicas hang on deadlocked Payment Service as well, unnecessarily doubling the load. Clients then start receiving timeouts or HTTP 500, and any downstream service that is waiting on Order Service freezes or fails. What started off as a localized issue within a single service quickly scales up to become a system-level performance degradation.
This is where the circuit breaker pattern comes into the picture. Like an automatic gate, a circuit breaker looks after recent rates of failures and successes on calls. Once failures get high enough, the breaker "opens" and just rejects further calls to the flaky service. In this way, upstream service quickly fails instead of hanging on to cause hungfire failures, and doesn't become a contributor to a bigger outage. The breaker occasionally transitions to a half-open state to probe if the dependency has come back, and if so, transitions to fully open operation. This is a graceful way of segregating failures before there are total outages.
Solution Overview
We combine three powerful tools:
- Kubernetes – Automates deployment, scaling, and recovery of services.
- gRPC – Offers high-performance, is strongly typed, and has low-latency communication.
- Circuit breaker pattern – Essential for the detection of repeated failures in dependent services and stops calling them until they recover.
Kubernetes forms this pattern's foundation. Its orchestration will help us execute services running within containers by using specific deployment manifests and utilizing built-in health checks, rolling updates, and automatic scaling. For service-to-service communication, this solution utilizes gRPC — an open-source, high-performance remote procedure call (RPC) system that implements Protocol Buffers (Protobuf) as its interface definition language. gRPC provides strong typing, auto-code generation in multiple programmatic languages, and a lightweight binary protocol that facilitates bandwidth saving and performance boost.
In order to prevent cascading failure, a circuit breaker pattern is used at the application level, i.e., when calling Order Service to Payment Service. The sentinel, or in this example, the circuit breaker, keeps track of how frequently recent memory calls are successful and how frequently they fail. If failures reach some threshold over a specified window of time, the breaker opens and refuses additional requests to that failing dependency entirely.
Then, after the time-out period, the breaker component enters a "half-open" state where several trial requests can be submitted to determine if the dependency has returned. If it's successful, only then is the traffic resumed to its normal state, and the breaker is reopened thereafter.
System Architecture
Our example system has three microservices:
- Order Service – This service will handle incoming orders from clients.
- Payment Service – This service will process payment requests.
- Inventory Service – Manages stock availability (which is not fully implemented in the example).
The Order Service calls the Payment and Inventory services via gRPC. A circuit breaker wraps the Payment Service calls to prevent overload when it’s failing. Now let’s implement the system.
Defining gRPC Service Contracts
We can begin by creating .proto files for each service API:
order.proto
syntax = "proto3";
package order;
option go_package = "orderpb";
service OrderService {
rpc PlaceOrder (OrderRequest) returns (OrderResponse) {}
}
message OrderRequest {
string _order_product_id = 1;
int32 _order_quantity = 2;
string _order_payment_method = 3;
}
message OrderResponse {
string resp_order_id = 1;
string resp_status = 2;
}
payment.proto
syntax = "proto3";
package payment;
option go_package = "paymentpb";
service PaymentService {
rpc ProcessPayment (PaymentRequest) returns (PaymentResponse) {}
}
message PaymentRequest {
double _paymnt_amount = 1;
string _paymnt_method = 2;
}
message PaymentResponse {
bool _paymnt_success = 1;
string _paymnt_transaction_id = 2;
}
Implementing the Circuit Breaker
The Order Service uses the gobreaker library in Go.
Key settings:
- MaxRequests = 5 (When half-open, allow up to 5 trial requests before deciding to close or reopen the circuit.)
- Interval = 60s (Rolling window for measuring failure rate.)
- Timeout = 10s (How long to keep the circuit open before trying again?)
- ReadyToTrip >= trip when ≥ 5 requests and ≥ 50% failures.
This logic ensures that if the Payment Service is unstable, the Order Service won’t waste resources waiting on it.
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "PaymentCB",
MaxRequests: 5,
Interval: 60 * time.Second,
Timeout: 10 * time.Second,
ReadyToTrip: func(counts gobreaker.Counts) bool {
failRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 5 && failRatio >= 0.5
},
})
Order Service Flow
- Receive order request from client.
- Calculate price (simple quantity × unit price).
- Call Payment Service inside cb.Execute():
- If Payment succeeds → confirm order.
- If Payment fails → fallback to “Payment Pending” status
- Return response to client.
Payment Service Behavior
To simulate real-world instability:
- Adds 200ms artificial latency per request.
- Randomly fails in 25% of requests.
if rand.Intn(4) == 0 {
return &paymentpb.PaymentResponse{
Success: false,
}, nil
}
This forces the circuit breaker in the Order Service to open during tests.
Kubernetes Deployment
Two YAMLs are provided — one for Payment Service and one for Order Service.
Example: payment-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 2
selector:
matchLabels:
app: payment
template:
metadata:
labels:
app: payment
spec:
containers:
- name: payment
image: yourrepo/payment-service:latest
ports:
- containerPort: 50051
---
apiVersion: v1
kind: Service
metadata:
name: payment-service
spec:
selector:
app: payment
ports:
- protocol: TCP
port: 50051
targetPort: 50051
Conclusion
Creating resilient microservices involves something more than decomposing an application into independently deployable, tiny fragments. As much as the microservice pattern offers undisputed advantages in scalability, maintainability, and IT flexibility, it engenders an interdependence matrix whose failure has unintended, widespread consequences. Unless managed through deliberate resilience patterns, interdependencies become vulnerabilities, and individual issues are amplified to system collapses.
Here, we have demonstrated how all three in combination — Kubernetes, gRPC, and the circuit breaker pattern — provide a stable foundation for fault-tolerant distributed systems. Kubernetes provides operational fault tolerance through auto-scaling, self-healing, and declarative deployments. gRPC provides high-performance, type-safe, service-to-service communications with negligible serialization overhead and latency. The circuit breaker pattern provides a further layer of defense, actively guarding against flaky services, forestalling cascading failures, and enabling the system to fail gracefully rather than be toppled by catastrophic failure.
Our use of an implementation of a system of an Order Service, Payment Service, and failure simulation showed not only how the building blocks work but also the architectural concepts behind them. By being able to run the system with Kubernetes and simulating instability in the Payment Service, we're able to observe how the circuit breaker made the Order Service healthy and responsive in any generalized sense of the system.
At design time, this plan is supplemented with further resilience mechanisms like service mesh-based timeouts and retries, distributed tracing, rate limiting, and real-time monitoring via Prometheus and Grafana. All reinforce one another in that when, if, and once issues do arise, matters quickly spring back and user experience remains positive.
Ultimately, what must be learned from all this is that resilience needs to be built into microservices from the beginning. With proper orchestration coordination, good communications protocols, and runtime fault isolation behavior, engineering teams can create architectures that do relatively well in ideal scenarios but that hold up to whatever unforeseen environments abound in production.
Opinions expressed by DZone contributors are their own.
Comments