DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
  • Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
  • Mastering Kubernetes to Maximize Your Cloud Potential
  • Coding Agents Need a Feedback Loop; Cloud-Native Systems Make That Hard

Trending

  • Getting Started With Agentic Workflows in Java and Quarkus
  • Multi-Scale Feature Learning in CNN and U-Net Architectures
  • Building a Production-Ready AI Agent in 2026: Beyond the Hello World Demo
  • S3 Vectors: How to Build a RAG Without a Vector Database
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. What Is Availability? Theory, Problems, Tools, and Best Practices

What Is Availability? Theory, Problems, Tools, and Best Practices

What is availability? Today, I will answer that question, dive into challenges, and share best practices related to availability.

By 
Bartłomiej Żyliński user avatar
Bartłomiej Żyliński
DZone Core CORE ·
Jul. 24, 25 · Analysis
Likes (2)
Comment
Save
Tweet
Share
2.4K Views

Join the DZone community and get the full member experience.

Join For Free

Availability is the measure of a system’s ability to stay up and running despite the failures of its parts. Today, I will explore this core trait of distributed systems. I will cover theory, challenges, tools, and best practices to ensure your system stays up and running against all odds.

Let's start with theory.

What Is Availability?

Availability describes how our systems handle failures and determines the system’s uptime. Usually, we describe the availability of a system in “nines” notation. 99% availability guarantees a maximum of 14.40 minutes of downtime per day, while 99.999% — the so-called 5 nines — reduces this time to 846 milliseconds.

Most cloud services have an SLA with either three (99.9%) to five (99.999%) nines availability guarantees for end users.

Availability (%) Downtime per day (~) Downtime per month (~) Downtime per year (~)
90 144 minutes (2.4 hours) 73 hours 36.53 days
99 14 minutes 7 hours 3.65 days
99.9 1.5 minutes 44 minutes 8.77 hours
99.99 9 seconds 4.4 minutes 52.6 minutes
99.999 846 milliseconds 26 seconds 5.3 minutes
99.9999 86.40 milliseconds 2.6 seconds 31.5 seconds


Additionally, the term high availability or HA is used to describe services that have at least 3 nines of availability guarantees.

There is a famous struggle related to availability and consistency. The common notion is that in case of a failure, we can have either one or the other. While in most cases this is true, the topic as a whole is vastly more nuanced and complex. For example, CRDTs put this whole statement into question; the same is true for Google’s internal Spanner.

Moreover, we can use various techniques to balance both of these traits. A system may favor one over the other in certain places while not in others. Just remember: this struggle exists and is one of the most important cases of study in distributed systems research.

How To Measure Availability

Availability is probably the simplest trait to measure, at least for a single service. You probably already have uptime or downtime metrics in one of your dashboards. Just divide the value you have there by: 24 (hours), 1440 (minutes), 5184000 (seconds). Et voilà, you have your service daily uptime percentage ready, and you can easily see how many nines you archived.

Things are getting more complicated when our service has multiple dependencies, or when we want to measure availability on the scale of the whole system.

As an example, consider the service A with two dependencies: DB and Email Service.

  • Service A has an uptime of 99.99%.
  • DB has an uptime of 99.9%.
  • The Email Service has an uptime of 99%.

Thus, the availability of service A is not 99.99% but in fact 98.89%. 0.9999 × 0.999 × 0.99 = 0.9889 => 98.89%.

In a more readable format:

Component SLA (nines) Availability (decimal)
Front-end API 99.99% 0.9999
Database 99.9% 0.9990
Email service 99 % 0.9900
Composite A 0.9999 × 0.9990 × 0.9900 = 0.9889 → 98.89% 0.9889


While the final difference is not big, it clearly illustrates the point. Availability of service is not a standalone but a product of all dependencies.

The same principle applies to the system. Availability of a system as a whole is a product of all its services and tools. Even a single poorly available component can bring the whole system down.

Weakest link Best product you can ever reach
99 % (two nines) < 99 %
99.9 % (three nines) < 99.8 %
99.99 % (four nines) < 99.96 %


Here is a quick note on how you can structure your availability-related metrics:

Tier Example in an availability context
SLI
(Indicator)
http_request_success_ratio = successful requests ÷ total requests
SLO
(Objective)
http_request_success_ratio ≥ 99.95 % over 30 days
SLA
(Agreement)
“We guarantee 99.9 % monthly availability; otherwise, you get service credits.”


Signs That the System Has Poor Availability

There are a couple of behaviors we can notice that indicate availability problems with our service. Additionally, some of those are similar to the signs of poor scalability.

  • Low uptime percentage — Most obvious of all, directly shows that the service is down and users cannot access it.
  • Service “flapping” — The service oscillates between up and down as automated restarts or failovers repeatedly flip the service in and out.
  • Health-check Failures — Persistent probe timeouts under normal load mean the service is down or will be down in the near future.
  • High Mean Time To Recover — Outages last hours, before the team can resolve it and bring the system back online.
  • Suddenly traffic drops to zero — Service is either down or users gave up attempts to connect.
  • Direct Feedback — An important client is calling CTO/CIO (or whoever else) complaining that everything is down, alerts start spinning, and other interesting events.

The Availability Game Changers

In my opinion, the game change for availability is automatic and graceful failover. While it sounds simple, it is actually more complex. To achieve it, we need to combine multiple different concepts and make them work together. Nonetheless, it is crucial for providing a zero-downtime experience.

The anatomy of a state-of-the-art zero-down-time failover:

Stage What happens Typical target time
1. Detect Health probe sees anomalies (5× timeouts/60 s). ≤ 5 s
2. Decide Orchestrator marks node unhealthy, stops scheduling it. ≤ 1 s
3. Redirect Load balancer removes endpoint from pool; sticky sessions migrate. ≤ 2 s
4. Restore Replacement pod/VM starts and passes readiness checks. ≤ 40 s (hot standby: ≈ 0 s)


Of course, automatic failover is not a silver bullet and comes with drawbacks. The two most significant ones are the higher complexity of the design and increased costs. Redundancy is responsible for increased costs, while failover itself adds complexity.

It may sound bad, unfortunately, without such a mechanism, we will not be able to provide high availability.

Tools For Availability

I have already covered automatic failover as a key tool to build available systems. However, these are not the only concepts. There are more, and you can find them below.

Replication

Replication is a method to implement redundancy. The key difference is that redundancy impacts all layers of our system, from software to hardware. While replication is mostly related to the data layer.

We provide multiple up-to-date copies of the same dataset, usually split across multiple nodes. Thus, in case one of the nodes fails, the data is still available for the user.

There are two main types of Replication:

  • Single-master/Single-leader — Only one of the replica nodes is handling incoming writes — the leader. The rest of the nodes provide read access and can be used to offload part of the incoming traffic. Leader propagates changes to other nodes, usually, using some type of Write Ahead Log (WAL). If the leader node fails or becomes unavailable for some reason. The leader election process takes place, and the new leader is selected from up-and-running nodes.
  • Multi-master/multi-leader — All the nodes accept both reads and writes at the same time. Writes are then propagated to other nodes. The biggest problem in this case is that the same write operation can end up on two different nodes at the same time. Thus, it requires a separate conflict resolution mechanism.

The concept of replication is a very extensive one. A good walkthrough and comparison of these two approaches is out of the scope of this article. However, I promise to dive deeper into replication in a separate article.

For now, remember the following table:

Single-master Multi-master
only one node accepts write multiple nodes accepts write
Propagate via WAL Conflict resolution and propagation


Automatic Failover

An automatic and graceful (not noticeable by the user) failover mechanism is the key to availability.

Good automatic failover, we will need to combine at least three concepts:

  1. Redundancy — We need more than one node to even start thinking of building any failover.
  2. Health checks — We need properly defined health checks to detect if nodes are down or should not handle user requests.
  3. Load-balancer/actual failover — We need a way to change the failing components and redirect the traffic to the up-and-running ones.

Each piece alone is insufficient; all must work together.

Isolating Failure

Another way to increase the availability of our system is to isolate failures. By doing so, we can ensure that a failure of one component will not cause the cascade failure of the other components involved in the same processing flow.

As with most concepts from this paragraph, there is no single tool or method to achieve that. Instead, we can follow one of the patterns below. We can also mix different patterns.

Let's dive into them below:

  • Circuit breaker — One of the most common microservices patterns in existence. It implements the fail-fast concepts in a way similar to an electrical circuit breaker. If multiple consecutive calls to other services fail in a certain period, the circuit breaker switches. Then, for the duration of a timeout period, all attempts to invoke that service will fail immediately. Thus reducing the load on possibly faulty service and giving it time to recover. Also avoids introducing potential timeouts on other stages of the flow.
  • Bulkhead — According to this pattern, components and resources in our system should be compartmentalized. Partitioning should be done in such a way that components do not share any resources. For example, each partition should have its own thread pools, connection pools, and CPU or memory limits. Such a split will decrease the chances of one component overusing (high resource utilization) and impacting the other components in the system.
  • Error kernel — We split our system into two types of components, core and side ones. The core ones must not fail for any reason. The side ones may fail, and we should be able to easily restart them. Then we can move the side ones into the “outskirts” of the system. Thus, we end with reliable core and easy-to-restart leaf components.

Multi-Region or Multi-Cloud Deployment

Multi-Availability Zone or Multi-Region Deployment will protect us from the least expected type of failures. The ones that will wipe out whole data centers or multiple data centers located in a particular region. Like the burning of the OVH datacenter in France or the GCP electrical problem in Iowa.

We can go even further and build a Multi-Cloud failover. If your core cloud provider is down, you can switch to a backup. While it adds a ton of extra complexity to your system, it drastically reduces the probability of system-wide failure even more. Region-wide failures are rare by themselves. Provider-wide failures are even rarer. Nevertheless, both may happen. Being able to handle them probably will not decide the difference between 99.99% and lower vitality tiers.

However, being able to handle such events has a few advantages:

  • Besides staying alive when others are down.
  • Indicate how good your architecture is.

Chaos Engineering/Fault Injection

Chaos engineering will not actually help you build an available system by itself. Rather, it helps you ensure that your system is, in fact, available. By introducing deliberate and trackable failure, you can identify weaknesses and problems that will not show up in any other case. I also mentioned this concept here.

Just remember it is not fully safe, and double-check that your system will be able to handle it.

Why We Fail To Achieve High Availability

After what, how, and why, it is time for why we fail. In my opinion and experience, there are a few factors that lead to our failure in building available systems.

Some reasons will be the same as in the case of my article on scalability.

  • Ignoring the trade-offs — every decision we make has short- and long-lasting consequences we have to be aware of. Of course, we can ignore them; still, we have to know them first and be conscious of why we are ignoring some potential drawbacks.
  • Incorrect health checks — they react either too slowly or too quickly. Restarting service too early or too late increases the likelihood of users experiencing the failure.
  • Lack of redundancy — critical components do not have properly configured redundancy.
  • Badly designed failover — we are unable to redirect the traffic to the up-and-running nodes fast enough.

Below is a simple checklist on how to increase the chance of not failing in availability:

Do today Impact
Add a health check to every component. 30 min of work slashes 502 errors during deploys/failovers.
Track availability product Makes hidden single points painfully obvious.
Set a written SLO Aligns the team on what “good enough” means.
Run a failover drill. Check your design in practice.


Summary

I have shared a number of concepts and approaches for building highly available systems.

Let's do a quick recap of key takeaways:

  • Making highly available systems requires mixing different concepts like: redundancy, healthcheck and failovers.
  • Proper health checks will help you keep up with the state of your components.
  • Isolating failures and preventing their propagation will keep the system running even if some components fail.
  • Multi-region deployment will save you in the most unexpected moment

Some concepts discussed here can’t be implemented using a single tool. They require architectural thinking and coordination across layers of the stack.

Concept Tool
Replication Usually part of database product you are using
Automatic failover K8s probes, Cloud autoscaling products
Failure isolation Resilience4j, K8s Namespaces
Multi AZ Cloud providers Availability Zones


High availability isn't just a metric — it is a mindset.
Build for failure. Monitor everything. And treat availability as a first-class feature.


I wish you luck on your struggle with availability. Thank you for your time.

Tool systems write-ahead logging Cloud

Published at DZone with permission of Bartłomiej Żyliński. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
  • Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
  • Mastering Kubernetes to Maximize Your Cloud Potential
  • Coding Agents Need a Feedback Loop; Cloud-Native Systems Make That Hard

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook