Preventing Systemic Failure in Microservices With Circuit Breaking (Part 2)
Preventing Systemic Failure in Microservices With Circuit Breaking (Part 2)
This is the second of a two-part series on circuit breaking. In part one, we covered the pattern and how it is approached differently by developers and operators. In this post, we’ll explore its typical use cases and how it is implemented in modern service middleware.
Join the DZone community and get the full member experience.Join For Free
Typical Microservice Use Cases
Developers and operators typically use circuit breaking for different purposes. Being primarily concerned with protecting their code, developers look to circuit breaking as a way to compensate for upstream failures. Operators, on the other hand, are responsible for the stability and availability of the entire service landscape and thus use circuit breaking primarily to monitor and remediate.
Developers: Compensating for Upstream Failures
Besides merely “breaking the circuit” and moving on, developers care about mainly three benefits of circuit breakers. First, because circuit breakers allow developers to deal with service failures, clients can adapt to changes in service availability dynamically over time and in a graceful manner. Secondly, circuit breakers that share their state across a service architecture provide network effects that can significantly improve responsiveness in the face of failures. Third, circuit breakers coupled with intelligent routing and load balancing can be used to automatically substitute healthy service instances for failed ones, thus promoting self-healing.
Operators: Monitoring and Remediation
Circuit breakers are a great way for operations teams to spot trouble before it cascades into bigger problems. When a circuit breaker is tripped, operators might decide to divert some or most traffic away from service while the responsible engineering team investigates the relevant logs and metrics. Because of its usefulness in relieving systems of acute stress, diverting traffic or load shedding, this represents the most popular use of circuit breaking among operators.
Another closely-related variant is to define circuit breakers as predetermined breaking points in the architecture. Ideally, such breakers would be set up in places that are known to bear loads in direct proportion to critical systems. Such breakers work in essence as canaries in the architecture that, again, lead to remediation through load shedding.
Advanced Circuit Breaking
As circuit breakers evolved from client-side libraries to middleware, shared-state breakers, and platforms, their definition also became increasingly diverse. The developer and operational use cases of circuit breaking diverged and its definitions involved an increasing number of parameters. Circuit breaking as provided today by cloud traffic controllers such as Glasnostic can be applied to traffic links that have been defined by arbitrary sets of endpoints, and combined with several complementary patterns such as timeouts, backpressure or brownouts. These combinations of patterns are then refined over time in conjunction with several parameters such as rate of requests, concurrency, bandwidth or latency.
Circuit Breaking with Hystrix
Netflix’ Hystrix was the first service middleware dedicated exclusively to circuit breaking. When it was released to the public in 2012 to provide microservice architectures with “greater tolerance of latency and failure,” it was already being used extensively at Netflix for over a year. Hystrix continued to serve as a fundamental part of the Netflix service middleware until it entered maintenance mode in late 2018, marking, according to the project, a “shift [in focus] towards more adaptive implementations that react to an application’s real-time performance rather than pre-configured settings.”
Hystrix is a Java library that developers can use to wrap service calls with circuit breaking logic. It is based on thresholds and can fail calls immediately and perform fallback logic as shown in part 1. Besides providing timeouts and concurrency limits, it can also publish metrics to monitoring tools. Finally, when used in conjunction with the Archaius library, it can also support dynamic configuration changes.
Although Hystrix supported refinements such as combining circuit breaking with timeouts and concurrency pools, it proved ultimately not flexible enough for the increasingly dynamic interaction behaviors in modern organic architecture. The ability to set thresholds and client-side concurrency pools give service developers sufficient control to isolate their code from upstream failures, but ceases to be useful where systemic, operational concerns gain importance. As such, the decline of Hystrix is a direct consequence of the limitation of circuit breaking as a developer pattern.
Circuit Breaking in Service Meshes
Istio is a service mesh that supports circuit breaking based on connection pool, requests per connection, and failure detection parameters. It does this with the help of so-called “destination rules”, which tell each Envoy sidecar proxy which policy to apply to traffic, and how. This step happens after routing has occurred, which is not always ideal. Destination rules may specify limits on load balancing, the connection pool size, and the parameters for what ends up qualifying as an “outlier” so that unhealthy hosts can be removed from the load balancing pool. This type of circuit breaking is great at insulating clients from service failures, but because destination rules are always applied cluster-wide, it lacks a way of limiting breakers to only a subset of clients. To achieve combinations of circuit breakers with e.g. quality-of-service patterns, multiple client-specific routing rules must be created, each with its own destination rule.
Circuit breaking in Linkerd is somewhat complicated, reflecting the generally conflicted state of circuit breaking as a developer pattern. While Linkerd 1 continues to support robust circuit breaking courtesy of the original Finagle code, Linkerd 2, a complete, lightweight rewrite in Rust and Go, does not do so directly. Instead, it offers related functionality in its Conduit proxy, which is now merged into Linkerd 2, albeit without support for retries and timeouts.
To implement retry and timeout support, Linkerd 2.1 introduced the concept of “service profiles,” custom Kubernetes resources to provide Linkerd with extra information about a service. Using service profiles, operators can now define routes as being “retry-able” or having a specific timeout. While this provides some essential functions related to it, circuit breaking in Linkerd is still a ways off.
Circuit Breaking with Glasnostic
Glasnostic is a cloud traffic controller that enables operations teams to control the complex emergent behaviors that their organic architectures exhibit. This enables companies to run diverse architectures in an agile manner, without costly revalidation on every change. As a result, development and operations are ideally positioned to adapt to their company’s rapidly changing business needs.
Unlike Hystrix and service meshes, which implement circuit breaking from a developer’s perspective, Glasnostic implements circuit breaking as an operational pattern, designed for operators.
Glasnostic’s control plane provides high-level visibility of large-scale, complex and dynamic interaction behaviors that enable operators to remediate issues quickly. Operators can apply tried-and-tested, predictable operational patterns such as circuit breaking by exerting fine-grained control over interactions across arbitrary sets of service endpoints. Because operational patterns may be readily combined to form highly refined, compound patterns, circuit breakers can likewise be easily refined by combining them with e.g. backpressure based on request rate, bandwidth or concurrency.
For example, figure 3 shows a channel set up to monitor and control intermittently recurring latency spikes across a set of otherwise unrelated services. Without looking for a putative root cause, operators decide to first control the situation by circuit-breaking the more extreme long-running requests. They achieve this by first defining a new channel covering the services in question, as well as any potential clients, and then imposing a suitable latency limit on the interactions governed by the channel. This allows the operations team to control the situation until engineering can provide a fix.
Of course, initial policies are often just that–first attempts to remediate a situation—and need to remain open to adjustments. Adjusting or complementing policies in Glasnostic is both fast and easy. For instance, the operations team may find that the initial channel policy can be further refined by the first circuit breaking non-mission-critical clients to leave mission-critical clients unaffected as long as possible. To accomplish this, they could define a refinement channel covering only non-mission-critical clients and adding a policy that circuit-breaks them based on connection and request allowances. Figure 4 shows such an auxiliary refinement channel set up with both concurrency and request policies to circuit-break non-mission-critical clients before the original latency breaker is tripped, thus increasing availability for mission-critical systems.
Unlike the circuit breakers typically offered by service middleware such as API gateways and service meshes, Glasnostic supports circuit breaking as an operational pattern, between arbitrary sets of endpoints and in realtime, as opposed to via static deployment descriptors. This allows operators to specify circuit breakers that are not just tactical adjustments to local interactions but instead steps towards improving stability and availability that are meaningful for the entire service landscape. For instance, while Istio implements circuit breaking based on destination rules, Glasnostic can apply circuit breaking to any set of interactions, clients or services, past, present or future. As a result, operators can set separate policies for different traffic classes.
Circuit breaking is a fundamental pattern designed to minimize the impact of failures, to prevent them from cascading and compounding, and to ensure end-to-end performance. Because it can be leveraged both as a developer pattern and an operational pattern, it can be applied broadly, often confusing.
As a developer pattern, it is predominantly used as a fairly rudimentary compensation strategy that is difficult to refine without considering each specific call. On the other hand, circuit breaking as an operational pattern aims to relieve distressed systems of pressure to manage both, systemic stability and performance. Its behavior is often further refined by combining it with other stability patterns such as timeouts or backpressure. Operational circuit breakers used to depend on separately deployed service middleware such as API gateways or service meshes. However, because service meshes address primarily developer concerns, support for circuit breaking as an operational pattern is limited and inconsistent across implementations. As a result, operational circuit breaking is best done using a cloud traffic controller like Glasnostic.
Published at DZone with permission of YuHan Lin . See the original article here.
Opinions expressed by DZone contributors are their own.