What Is Availability? Theory, Problems, Tools, and Best Practices

What is availability? Today, I will answer that question, dive into challenges, and share best practices related to availability.

Bartłomiej Żyliński

CORE ·

Jul. 24, 25 · Analysis

Likes (2)

Comment

Save

2.4K Views

Availability is the measure of a system’s ability to stay up and running despite the failures of its parts. Today, I will explore this core trait of distributed systems. I will cover theory, challenges, tools, and best practices to ensure your system stays up and running against all odds.

Let's start with theory.

What Is Availability?

Availability describes how our systems handle failures and determines the system’s uptime. Usually, we describe the availability of a system in “nines” notation. 99% availability guarantees a maximum of 14.40 minutes of downtime per day, while 99.999% — the so-called 5 nines — reduces this time to 846 milliseconds.

Most cloud services have an SLA with either three (99.9%) to five (99.999%) nines availability guarantees for end users.

Availability (%)	Downtime per day (~)	Downtime per month (~)	Downtime per year (~)
90	144 minutes (2.4 hours)	73 hours	36.53 days
99	14 minutes	7 hours	3.65 days
99.9	1.5 minutes	44 minutes	8.77 hours
99.99	9 seconds	4.4 minutes	52.6 minutes
99.999	846 milliseconds	26 seconds	5.3 minutes
99.9999	86.40 milliseconds	2.6 seconds	31.5 seconds

Additionally, the term high availability or HA is used to describe services that have at least 3 nines of availability guarantees.

There is a famous struggle related to availability and consistency. The common notion is that in case of a failure, we can have either one or the other. While in most cases this is true, the topic as a whole is vastly more nuanced and complex. For example, CRDTs put this whole statement into question; the same is true for Google’s internal Spanner.

Moreover, we can use various techniques to balance both of these traits. A system may favor one over the other in certain places while not in others. Just remember: this struggle exists and is one of the most important cases of study in distributed systems research.

How To Measure Availability

Availability is probably the simplest trait to measure, at least for a single service. You probably already have uptime or downtime metrics in one of your dashboards. Just divide the value you have there by: 24 (hours), 1440 (minutes), 5184000 (seconds). Et voilà, you have your service daily uptime percentage ready, and you can easily see how many nines you archived.

Things are getting more complicated when our service has multiple dependencies, or when we want to measure availability on the scale of the whole system.

As an example, consider the service A with two dependencies: DB and Email Service.

Service A has an uptime of 99.99%.
DB has an uptime of 99.9%.
The Email Service has an uptime of 99%.

Thus, the availability of service A is not 99.99% but in fact 98.89%. 0.9999 × 0.999 × 0.99 = 0.9889 => 98.89%.

In a more readable format:

Component	SLA (nines)	Availability (decimal)
Front-end API	99.99%	0.9999
Database	99.9%	0.9990
Email service	99 %	0.9900
Composite A	0.9999 × 0.9990 × 0.9900 = 0.9889 → 98.89%	0.9889

While the final difference is not big, it clearly illustrates the point. Availability of service is not a standalone but a product of all dependencies.

The same principle applies to the system. Availability of a system as a whole is a product of all its services and tools. Even a single poorly available component can bring the whole system down.

Weakest link	Best product you can ever reach
99 % (two nines)	< 99 %
99.9 % (three nines)	< 99.8 %
99.99 % (four nines)	< 99.96 %

Here is a quick note on how you can structure your availability-related metrics:

Tier	Example in an availability context
SLI (Indicator)	`http_request_success_ratio` = successful requests ÷ total requests
SLO (Objective)	`http_request_success_ratio ≥ 99.95 % over 30 days`
SLA (Agreement)	“We guarantee 99.9 % monthly availability; otherwise, you get service credits.”

Signs That the System Has Poor Availability

There are a couple of behaviors we can notice that indicate availability problems with our service. Additionally, some of those are similar to the signs of poor scalability.

Low uptime percentage — Most obvious of all, directly shows that the service is down and users cannot access it.
Service “flapping” — The service oscillates between up and down as automated restarts or failovers repeatedly flip the service in and out.
Health-check Failures — Persistent probe timeouts under normal load mean the service is down or will be down in the near future.
High Mean Time To Recover — Outages last hours, before the team can resolve it and bring the system back online.
Suddenly traffic drops to zero — Service is either down or users gave up attempts to connect.
Direct Feedback — An important client is calling CTO/CIO (or whoever else) complaining that everything is down, alerts start spinning, and other interesting events.

The Availability Game Changers

In my opinion, the game change for availability is automatic and graceful failover. While it sounds simple, it is actually more complex. To achieve it, we need to combine multiple different concepts and make them work together. Nonetheless, it is crucial for providing a zero-downtime experience.

The anatomy of a state-of-the-art zero-down-time failover:

Stage	What happens	Typical target time
1. Detect	Health probe sees anomalies (5× timeouts/60 s).	≤ 5 s
2. Decide	Orchestrator marks node unhealthy, stops scheduling it.	≤ 1 s
3. Redirect	Load balancer removes endpoint from pool; sticky sessions migrate.	≤ 2 s
4. Restore	Replacement pod/VM starts and passes readiness checks.	≤ 40 s (hot standby: ≈ 0 s)

Of course, automatic failover is not a silver bullet and comes with drawbacks. The two most significant ones are the higher complexity of the design and increased costs. Redundancy is responsible for increased costs, while failover itself adds complexity.

It may sound bad, unfortunately, without such a mechanism, we will not be able to provide high availability.

Tools For Availability

I have already covered automatic failover as a key tool to build available systems. However, these are not the only concepts. There are more, and you can find them below.

Replication

Replication is a method to implement redundancy. The key difference is that redundancy impacts all layers of our system, from software to hardware. While replication is mostly related to the data layer.

We provide multiple up-to-date copies of the same dataset, usually split across multiple nodes. Thus, in case one of the nodes fails, the data is still available for the user.

There are two main types of Replication:

Single-master/Single-leader — Only one of the replica nodes is handling incoming writes — the leader. The rest of the nodes provide read access and can be used to offload part of the incoming traffic. Leader propagates changes to other nodes, usually, using some type of Write Ahead Log (WAL). If the leader node fails or becomes unavailable for some reason. The leader election process takes place, and the new leader is selected from up-and-running nodes.
Multi-master/multi-leader — All the nodes accept both reads and writes at the same time. Writes are then propagated to other nodes. The biggest problem in this case is that the same write operation can end up on two different nodes at the same time. Thus, it requires a separate conflict resolution mechanism.

The concept of replication is a very extensive one. A good walkthrough and comparison of these two approaches is out of the scope of this article. However, I promise to dive deeper into replication in a separate article.

For now, remember the following table:

Single-master	Multi-master
only one node accepts write	multiple nodes accepts write
Propagate via WAL	Conflict resolution and propagation

Automatic Failover

An automatic and graceful (not noticeable by the user) failover mechanism is the key to availability.

Good automatic failover, we will need to combine at least three concepts:

Redundancy — We need more than one node to even start thinking of building any failover.
Health checks — We need properly defined health checks to detect if nodes are down or should not handle user requests.
Load-balancer/actual failover — We need a way to change the failing components and redirect the traffic to the up-and-running ones.

Each piece alone is insufficient; all must work together.

Isolating Failure

Another way to increase the availability of our system is to isolate failures. By doing so, we can ensure that a failure of one component will not cause the cascade failure of the other components involved in the same processing flow.

As with most concepts from this paragraph, there is no single tool or method to achieve that. Instead, we can follow one of the patterns below. We can also mix different patterns.

Let's dive into them below:

Circuit breaker — One of the most common microservices patterns in existence. It implements the fail-fast concepts in a way similar to an electrical circuit breaker. If multiple consecutive calls to other services fail in a certain period, the circuit breaker switches. Then, for the duration of a timeout period, all attempts to invoke that service will fail immediately. Thus reducing the load on possibly faulty service and giving it time to recover. Also avoids introducing potential timeouts on other stages of the flow.
Bulkhead — According to this pattern, components and resources in our system should be compartmentalized. Partitioning should be done in such a way that components do not share any resources. For example, each partition should have its own thread pools, connection pools, and CPU or memory limits. Such a split will decrease the chances of one component overusing (high resource utilization) and impacting the other components in the system.
Error kernel — We split our system into two types of components, core and side ones. The core ones must not fail for any reason. The side ones may fail, and we should be able to easily restart them. Then we can move the side ones into the “outskirts” of the system. Thus, we end with reliable core and easy-to-restart leaf components.

Multi-Region or Multi-Cloud Deployment

Multi-Availability Zone or Multi-Region Deployment will protect us from the least expected type of failures. The ones that will wipe out whole data centers or multiple data centers located in a particular region. Like the burning of the OVH datacenter in France or the GCP electrical problem in Iowa.

We can go even further and build a Multi-Cloud failover. If your core cloud provider is down, you can switch to a backup. While it adds a ton of extra complexity to your system, it drastically reduces the probability of system-wide failure even more. Region-wide failures are rare by themselves. Provider-wide failures are even rarer. Nevertheless, both may happen. Being able to handle them probably will not decide the difference between 99.99% and lower vitality tiers.

However, being able to handle such events has a few advantages:

Besides staying alive when others are down.
Indicate how good your architecture is.

Chaos Engineering/Fault Injection

Chaos engineering will not actually help you build an available system by itself. Rather, it helps you ensure that your system is, in fact, available. By introducing deliberate and trackable failure, you can identify weaknesses and problems that will not show up in any other case. I also mentioned this concept here.

Just remember it is not fully safe, and double-check that your system will be able to handle it.

Why We Fail To Achieve High Availability

After what, how, and why, it is time for why we fail. In my opinion and experience, there are a few factors that lead to our failure in building available systems.

Some reasons will be the same as in the case of my article on scalability.

Ignoring the trade-offs — every decision we make has short- and long-lasting consequences we have to be aware of. Of course, we can ignore them; still, we have to know them first and be conscious of why we are ignoring some potential drawbacks.
Incorrect health checks — they react either too slowly or too quickly. Restarting service too early or too late increases the likelihood of users experiencing the failure.
Lack of redundancy — critical components do not have properly configured redundancy.
Badly designed failover — we are unable to redirect the traffic to the up-and-running nodes fast enough.

Below is a simple checklist on how to increase the chance of not failing in availability:

Do today	Impact
Add a health check to every component.	30 min of work slashes 502 errors during deploys/failovers.
Track availability product	Makes hidden single points painfully obvious.
Set a written SLO	Aligns the team on what “good enough” means.
Run a failover drill.	Check your design in practice.

Summary

I have shared a number of concepts and approaches for building highly available systems.

Let's do a quick recap of key takeaways:

Making highly available systems requires mixing different concepts like: redundancy, healthcheck and failovers.
Proper health checks will help you keep up with the state of your components.
Isolating failures and preventing their propagation will keep the system running even if some components fail.
Multi-region deployment will save you in the most unexpected moment

Some concepts discussed here can’t be implemented using a single tool. They require architectural thinking and coordination across layers of the stack.

Concept	Tool
Replication	Usually part of database product you are using
Automatic failover	K8s probes, Cloud autoscaling products
Failure isolation	Resilience4j, K8s Namespaces
Multi AZ	Cloud providers Availability Zones

High availability isn't just a metric — it is a mindset.
Build for failure. Monitor everything. And treat availability as a first-class feature.

I wish you luck on your struggle with availability. Thank you for your time.

Tool systems write-ahead logging Cloud

Published at DZone with permission of Bartłomiej Żyliński. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending