Setting the Reliability Standard
As a service provider, customers depend on you to stay up and running. Learn how to set a high-reliability standard in this article.
Join the DZone community and get the full member experience.Join For Free
Thomas Reid once wrote, “a chain is no stronger than its weakest link.” This holds true for any system with interdependent links, whether it’s a literal chain or a chain of dependencies in a software application. If one link breaks, the load comes crashing down.
For SaaS, PaaS, IaaS, and other service providers, this concept can make or break a business. Customers of these services expect providers to offer high levels of availability and performance, as any problems could cause their own services to fail. In other words, their ability to serve customers and generate revenue is directly dependent on the ability of their service provider to do the same. This puts significant pressure on providers to maximize the availability of their services or risk losing customers to more reliable competitors.
As a SaaS provider, you are your customers’ infrastructure. Downtime doesn’t just hurt your bottom line, it also hurts your customers and your reputation.
To avoid this, service providers need to proactively set a high standard for availability. We’ll explain how in this blog post.
What Is 'The Reliability Standard’?
Every link in a chain has a certain point of failure. If too much weight is added, or if a link is in poor shape (rust, casting imperfections, bad welding, etc.), it will break. Since links are directly connected to each other, a failure in one will break the entire chain.
This also applies to software. If our customers depend on our service, then any outages in our services will cause theirs to fail. We need to ensure that we provide a consistently reliable and performant experience for all of our customers. This might sound obvious, but let’s consider real-world examples of SaaS outages that significantly impacted customers.
Slack is a digital communications platform used by over 750,000 companies. In a recent incident, slow performance made it hard for teams to chat, share files, and search messages, all essential tasks in today’s remote-first work environments. Slack has a reliability standard of 99.99% availability or higher, which means 15 minutes or less of downtime every quarter. As a result, incidents like this one are usually rare and addressed quickly.
How Do You Set a Reliability Standard?
Setting a standard for reliability is an organization-wide effort that involves:
Fostering a culture of reliability.
Setting reliability targets.
Identifying potential failure modes in our services.
Validating our incident response and disaster recovery plans.
Let’s look at each of these in more detail.
Fostering a Culture of Reliability
Reliability isn’t just the responsibility of engineering teams, but of the whole organization, and needs to be an executive-level KPI. An unreliable product or service can have significant consequences for the entire business. Because of this, everyone needs to understand the importance of reliability in terms of business success. Indicators that are important to the business include:
Tracking reliability improvements to budget spend.
Changes in net promoter score (NPS) and the customer experience.
Top percentile latency for 50% and 90% of requests.
Uptime and latency.
Considering reliability might require the organization to learn to think differently about engineering velocity and efficiency. For example, a common challenge is when leadership prioritizes other tasks—such as building new features—at the expense of reliability. In other words, reliability is sacrificed in favor of feature velocity. This creates friction and can lead to situations where reliability testing is pushed to the end of development, increasing the risk of customer-impacting incidents and outages. At the end of the day, features are only as valuable as they are reliable for customers. We want to maximize feature velocity while also meeting our reliability standards.
One way we have seen organizations have this conversation is by creating dashboards for applications that show the relationship between their reliability and their efficiency. This empowers the entire organization to identify when to invest in reliability efforts to maximize business outcomes, and it aligns business stakeholders with reliability initiatives.
Setting Reliability Targets
Next, we need a way to define reliability targets and measure progress towards those targets. This helps us demonstrate to our customers that we’re meeting the standards we created.
Many service providers use service level agreements (SLAs), which contractually promise a minimum quality of service to customers. They also provide customers with recourse if we fail to meet that level, typically as a credit on their account. SLAs often define service quality as uptime: for example, an SLA promising 99% availability means that our service can only have around 7 hours of downtime each month.
If our service is a critical component of our customers’ stack, the implication here is that our customers can only be as reliable as we are. In other words, they can promise at most 99% availability to their customers. Customers demanding high reliability will likely avoid services like these in favor of more reliable providers. This is why major providers like AWS, GCP, and Azure tend to offer SLAs of 99.99% or higher. To make sure we’re meeting our SLAs, we should set our internal reliability targets (our service level objectives, or SLOs) higher than our SLAs. If our SLA is 99%, our SLO should be at least 99.5% (3.65 hours of downtime per month). This lets us go above and beyond our customers’ expectations, while also providing leeway in case of a major outage.
Identifying Potential Failure Modes
Modern applications are complex and can fail in unpredictable ways. This is especially true as engineering teams increasingly migrate to distributed, cloud-native systems like Kubernetes and OpenShift. These systems provide significant benefits over traditional application architectures, but they also add unknown variables and failure modes (more links in our already fragile chain of dependencies). To reduce our risk of outages, we need to proactively find and address failure modes before putting our services or our customers at risk.
This is where Chaos Engineering helps. Chaos Engineering is the science of performing intentional experimentation on a system by injecting precise and measured amounts of harm for the purpose of improving its resilience. By observing how the system responds to this harm, we can implement changes and harden the system against these types of failures. We also learn more about how our systems behave, which is especially helpful when migrating to a new platform or architecture.
Chaos Engineering helps uncover the weaknesses in our links that we couldn’t have found through traditional testing. Chaos Engineering solutions like Gremlin let us run experiments in a controlled way and see how our systems react. Our engineers can then address these weaknesses, strengthening the chain and letting us promise a higher reliability standard for our customers.
Creating and Validating Response Plans
No matter how resilient our applications are, incidents will happen. Links will break, but instead of letting the load fall, we need to be ready to quickly fix the chain. Our first step is to develop incident response plans (also called playbooks or runbooks) to guide engineers through resolving incidents quickly and effectively. In the case of disasters (such as data center outages and floods), we use disaster recovery plans to restore service as soon as possible after an unprecedented event. Having response plans helps us reclaim uptime by preventing extensive outages and reduces the stress placed on our engineers during incidents.
The challenge with planning for incidents is validating that our plans work. Having an untested playbook is just as bad as having no playbook, since we can’t guarantee it will work in a real-world incident. Of course, nobody wants to wait for an incident only to find that our plans are missing a crucial step. Instead, we can use Chaos Engineering to proactively test and validate our plans without having to put our systems, customers, or business at risk.
Reliability is an incremental process, not a one-and-done deal. It starts with getting our teams focused on building reliable systems, then moves towards ensuring our systems are resilient and recoverable even in the most extreme conditions. Throughout this process, our goal is to provide the best possible quality of service for our customers so that they can feel confident putting their trust in our platforms. By proactively testing the resilience of our applications and systems, we can build resilience and create a stronger chain.
Opinions expressed by DZone contributors are their own.