What Is Availability? Theory, Problems, Tools, and Best Practices
What is availability? Today, I will answer that question, dive into challenges, and share best practices related to availability.
Join the DZone community and get the full member experience.
Join For FreeAvailability is the measure of a system’s ability to stay up and running despite the failures of its parts. Today, I will explore this core trait of distributed systems. I will cover theory, challenges, tools, and best practices to ensure your system stays up and running against all odds.
Let's start with theory.
What Is Availability?
Availability describes how our systems handle failures and determines the system’s uptime. Usually, we describe the availability of a system in “nines” notation. 99% availability guarantees a maximum of 14.40 minutes of downtime per day, while 99.999% — the so-called 5 nines — reduces this time to 846 milliseconds.
Most cloud services have an SLA with either three (99.9%) to five (99.999%) nines availability guarantees for end users.
| Availability (%) | Downtime per day (~) | Downtime per month (~) | Downtime per year (~) |
|---|---|---|---|
| 90 | 144 minutes (2.4 hours) | 73 hours | 36.53 days |
| 99 | 14 minutes | 7 hours | 3.65 days |
| 99.9 | 1.5 minutes | 44 minutes | 8.77 hours |
| 99.99 | 9 seconds | 4.4 minutes | 52.6 minutes |
| 99.999 | 846 milliseconds | 26 seconds | 5.3 minutes |
| 99.9999 | 86.40 milliseconds | 2.6 seconds | 31.5 seconds |
Additionally, the term high availability or HA is used to describe services that have at least 3 nines of availability guarantees.
There is a famous struggle related to availability and consistency. The common notion is that in case of a failure, we can have either one or the other. While in most cases this is true, the topic as a whole is vastly more nuanced and complex. For example, CRDTs put this whole statement into question; the same is true for Google’s internal Spanner.
Moreover, we can use various techniques to balance both of these traits. A system may favor one over the other in certain places while not in others. Just remember: this struggle exists and is one of the most important cases of study in distributed systems research.
How To Measure Availability
Availability is probably the simplest trait to measure, at least for a single service. You probably already have uptime or downtime metrics in one of your dashboards. Just divide the value you have there by: 24 (hours), 1440 (minutes), 5184000 (seconds). Et voilà, you have your service daily uptime percentage ready, and you can easily see how many nines you archived.
Things are getting more complicated when our service has multiple dependencies, or when we want to measure availability on the scale of the whole system.
As an example, consider the service A with two dependencies: DB and Email Service.
- Service A has an uptime of 99.99%.
- DB has an uptime of 99.9%.
- The Email Service has an uptime of 99%.
Thus, the availability of service A is not 99.99% but in fact 98.89%. 0.9999 × 0.999 × 0.99 = 0.9889 => 98.89%.
In a more readable format:
| Component | SLA (nines) | Availability (decimal) |
|---|---|---|
| Front-end API | 99.99% | 0.9999 |
| Database | 99.9% | 0.9990 |
| Email service | 99 % | 0.9900 |
| Composite A | 0.9999 × 0.9990 × 0.9900 = 0.9889 → 98.89% | 0.9889 |
While the final difference is not big, it clearly illustrates the point. Availability of service is not a standalone but a product of all dependencies.
The same principle applies to the system. Availability of a system as a whole is a product of all its services and tools. Even a single poorly available component can bring the whole system down.
| Weakest link | Best product you can ever reach |
|---|---|
| 99 % (two nines) | < 99 % |
| 99.9 % (three nines) | < 99.8 % |
| 99.99 % (four nines) | < 99.96 % |
Here is a quick note on how you can structure your availability-related metrics:
| Tier | Example in an availability context |
|---|---|
| SLI (Indicator) |
http_request_success_ratio = successful requests ÷ total requests |
| SLO (Objective) |
http_request_success_ratio ≥ 99.95 % over 30 days |
| SLA (Agreement) |
“We guarantee 99.9 % monthly availability; otherwise, you get service credits.” |
Signs That the System Has Poor Availability
There are a couple of behaviors we can notice that indicate availability problems with our service. Additionally, some of those are similar to the signs of poor scalability.
- Low uptime percentage — Most obvious of all, directly shows that the service is down and users cannot access it.
- Service “flapping” — The service oscillates between up and down as automated restarts or failovers repeatedly flip the service in and out.
- Health-check Failures — Persistent probe timeouts under normal load mean the service is down or will be down in the near future.
- High Mean Time To Recover — Outages last hours, before the team can resolve it and bring the system back online.
- Suddenly traffic drops to zero — Service is either down or users gave up attempts to connect.
- Direct Feedback — An important client is calling CTO/CIO (or whoever else) complaining that everything is down, alerts start spinning, and other interesting events.
The Availability Game Changers
In my opinion, the game change for availability is automatic and graceful failover. While it sounds simple, it is actually more complex. To achieve it, we need to combine multiple different concepts and make them work together. Nonetheless, it is crucial for providing a zero-downtime experience.
The anatomy of a state-of-the-art zero-down-time failover:
| Stage | What happens | Typical target time |
|---|---|---|
| 1. Detect | Health probe sees anomalies (5× timeouts/60 s). | ≤ 5 s |
| 2. Decide | Orchestrator marks node unhealthy, stops scheduling it. | ≤ 1 s |
| 3. Redirect | Load balancer removes endpoint from pool; sticky sessions migrate. | ≤ 2 s |
| 4. Restore | Replacement pod/VM starts and passes readiness checks. | ≤ 40 s (hot standby: ≈ 0 s) |
Of course, automatic failover is not a silver bullet and comes with drawbacks. The two most significant ones are the higher complexity of the design and increased costs. Redundancy is responsible for increased costs, while failover itself adds complexity.
It may sound bad, unfortunately, without such a mechanism, we will not be able to provide high availability.
Tools For Availability
I have already covered automatic failover as a key tool to build available systems. However, these are not the only concepts. There are more, and you can find them below.
Replication
Replication is a method to implement redundancy. The key difference is that redundancy impacts all layers of our system, from software to hardware. While replication is mostly related to the data layer.
We provide multiple up-to-date copies of the same dataset, usually split across multiple nodes. Thus, in case one of the nodes fails, the data is still available for the user.
There are two main types of Replication:
- Single-master/Single-leader — Only one of the replica nodes is handling incoming writes — the leader. The rest of the nodes provide read access and can be used to offload part of the incoming traffic. Leader propagates changes to other nodes, usually, using some type of Write Ahead Log (WAL). If the leader node fails or becomes unavailable for some reason. The leader election process takes place, and the new leader is selected from up-and-running nodes.
- Multi-master/multi-leader — All the nodes accept both reads and writes at the same time. Writes are then propagated to other nodes. The biggest problem in this case is that the same write operation can end up on two different nodes at the same time. Thus, it requires a separate conflict resolution mechanism.
The concept of replication is a very extensive one. A good walkthrough and comparison of these two approaches is out of the scope of this article. However, I promise to dive deeper into replication in a separate article.
For now, remember the following table:
| Single-master | Multi-master |
|---|---|
| only one node accepts write | multiple nodes accepts write |
| Propagate via WAL | Conflict resolution and propagation |
Automatic Failover
An automatic and graceful (not noticeable by the user) failover mechanism is the key to availability.
Good automatic failover, we will need to combine at least three concepts:
- Redundancy — We need more than one node to even start thinking of building any failover.
- Health checks — We need properly defined health checks to detect if nodes are down or should not handle user requests.
- Load-balancer/actual failover — We need a way to change the failing components and redirect the traffic to the up-and-running ones.
Each piece alone is insufficient; all must work together.
Isolating Failure
Another way to increase the availability of our system is to isolate failures. By doing so, we can ensure that a failure of one component will not cause the cascade failure of the other components involved in the same processing flow.
As with most concepts from this paragraph, there is no single tool or method to achieve that. Instead, we can follow one of the patterns below. We can also mix different patterns.
Let's dive into them below:
- Circuit breaker — One of the most common microservices patterns in existence. It implements the fail-fast concepts in a way similar to an electrical circuit breaker. If multiple consecutive calls to other services fail in a certain period, the circuit breaker switches. Then, for the duration of a timeout period, all attempts to invoke that service will fail immediately. Thus reducing the load on possibly faulty service and giving it time to recover. Also avoids introducing potential timeouts on other stages of the flow.
- Bulkhead — According to this pattern, components and resources in our system should be compartmentalized. Partitioning should be done in such a way that components do not share any resources. For example, each partition should have its own thread pools, connection pools, and CPU or memory limits. Such a split will decrease the chances of one component overusing (high resource utilization) and impacting the other components in the system.
- Error kernel — We split our system into two types of components, core and side ones. The core ones must not fail for any reason. The side ones may fail, and we should be able to easily restart them. Then we can move the side ones into the “outskirts” of the system. Thus, we end with reliable core and easy-to-restart leaf components.
Multi-Region or Multi-Cloud Deployment
Multi-Availability Zone or Multi-Region Deployment will protect us from the least expected type of failures. The ones that will wipe out whole data centers or multiple data centers located in a particular region. Like the burning of the OVH datacenter in France or the GCP electrical problem in Iowa.
We can go even further and build a Multi-Cloud failover. If your core cloud provider is down, you can switch to a backup. While it adds a ton of extra complexity to your system, it drastically reduces the probability of system-wide failure even more. Region-wide failures are rare by themselves. Provider-wide failures are even rarer. Nevertheless, both may happen. Being able to handle them probably will not decide the difference between 99.99% and lower vitality tiers.
However, being able to handle such events has a few advantages:
- Besides staying alive when others are down.
- Indicate how good your architecture is.
Chaos Engineering/Fault Injection
Chaos engineering will not actually help you build an available system by itself. Rather, it helps you ensure that your system is, in fact, available. By introducing deliberate and trackable failure, you can identify weaknesses and problems that will not show up in any other case. I also mentioned this concept here.
Just remember it is not fully safe, and double-check that your system will be able to handle it.
Why We Fail To Achieve High Availability
After what, how, and why, it is time for why we fail. In my opinion and experience, there are a few factors that lead to our failure in building available systems.
Some reasons will be the same as in the case of my article on scalability.
- Ignoring the trade-offs — every decision we make has short- and long-lasting consequences we have to be aware of. Of course, we can ignore them; still, we have to know them first and be conscious of why we are ignoring some potential drawbacks.
- Incorrect health checks — they react either too slowly or too quickly. Restarting service too early or too late increases the likelihood of users experiencing the failure.
- Lack of redundancy — critical components do not have properly configured redundancy.
- Badly designed failover — we are unable to redirect the traffic to the up-and-running nodes fast enough.
Below is a simple checklist on how to increase the chance of not failing in availability:
| Do today | Impact |
|---|---|
| Add a health check to every component. | 30 min of work slashes 502 errors during deploys/failovers. |
| Track availability product | Makes hidden single points painfully obvious. |
| Set a written SLO | Aligns the team on what “good enough” means. |
| Run a failover drill. | Check your design in practice. |
Summary
I have shared a number of concepts and approaches for building highly available systems.
Let's do a quick recap of key takeaways:
- Making highly available systems requires mixing different concepts like: redundancy, healthcheck and failovers.
- Proper health checks will help you keep up with the state of your components.
- Isolating failures and preventing their propagation will keep the system running even if some components fail.
- Multi-region deployment will save you in the most unexpected moment
Some concepts discussed here can’t be implemented using a single tool. They require architectural thinking and coordination across layers of the stack.
| Concept | Tool |
|---|---|
| Replication | Usually part of database product you are using |
| Automatic failover | K8s probes, Cloud autoscaling products |
| Failure isolation | Resilience4j, K8s Namespaces |
| Multi AZ | Cloud providers Availability Zones |
High availability isn't just a metric — it is a mindset.
Build for failure. Monitor everything. And treat availability as a first-class feature.
I wish you luck on your struggle with availability. Thank you for your time.
Published at DZone with permission of Bartłomiej Żyliński. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments