On February 28th, Amazon Web Services had several issues that led to global outages in the AWS stack. This started at around 9:40 AM Pacific time, and was resolved by 2:00 PM Pacific. During this outage, many popular websites faced availability and functionality issues, as more than 30% of the web relies upon AWS for some aspect of its functionality. While major AWS outages are rare, when they happen they can be devastating. As such, we wanted to explore why the AWS outage was such a problem for many websites, and look at mitigation strategies to prevent the next failure.
All Your Functionality Under One Roof
One of the benefits of using AWS to manage your web application is that it gives you access to all of the functionality you need to provide scalable and robust storage. Let’s take a look at a sample architecture for a Ruby on Rails web application, built on top of AWS:
- Domain name resolution resolves to an AWS Elastic Load Balancer (ELB).
- The AWS ELB directs the traffic to one of several Elastic IPs.
- The Elastic IPs point to an Amazon EC2 instance running a popular web server, like nginx or Apache, which runs the application server code.
- The application server interacts with a database in Amazon RDS or Amazon DynamoDB to store application data.
- The application can also interact with Amazon Elasticsearch Service to ease searching and record location.
- The application can also integrate with a third party logging or user action tracking provider (which has a high probability of also being deployed on AWS).
This combination of tools gives Developers and DevOps engineers a powerful set of functionality that they can use to build, maintain, and deploy their web applications. By having everything in one location, you can reduce the knowledge load on your R&D and DevOps teams, increasing overall velocity while relying upon the sterling reputation of AWS to provide you with a measure of security that you’ve made the right choices in your architecture.
The problem with the above should be obvious. If your app is built entirely on AWS, any outage – no matter how minor – has the chance to affect your application’s user base. For example, if the ElasticSearch service goes down, your users may no longer be able to search for products in your catalog, losing you sales. Or if an EC2 instance behind a load balancer goes down, you can lose current session data, causing a user to have to restart a purchase process. While even minor outages can have an effect, the problem faced is the compounding that occurs when two or more AWS services go down at once. During the recent outage, Amazon itself was unable to effectively update its own status dashboard.
So at this point, we can easily demonstrate that while having everything under one roof can provide development speed and maintenance gains, it does expose your app to risks of cascading failures. The AWS outage led to popular communications providers like Slack experiencing issues with uploading and sharing files, for example. When the web is this interconnected, it pays to have backup plans. Let’s look at two different mitigation strategies – one within AWS, and one leveraging multiple providers.
Mitigating Disaster Within AWS
Tuesday’s failure was primarily located in the US-East-1 region. This region is often used by Amazon to roll out and test its new features, and posters on HackerNews have reported that this region has historically had issues around new software rollouts and old hardware. Many DevOps engineers will make use of a single region to ease maintenance and support of a website, and when you are front-line support at a startup this kind of region usage is often sufficient to get things moving so that your engineers can move onto more pressing business problems. However, if you host all of your services within the same AWS region, you are highly susceptible to that region’s stability issues, and when it goes down you lose access to your entire application. So the most obvious strategy is to not restrict your application to a single AWS region. Simply choose one of the other AWS Regions available to your account, and change your internal practices to recognize this difference.
Another strategy is to duplicate your stack across multiple AWS regions and use other regions as a fallback for when a region goes down. Configuring a load balancer to switch between EC2 instances hosted across separate AWS regions can often mitigate the failures that occur when a single AWS region is unavailable. This is even supported as a feature in Amazon ELB. By dispersing your risk among different AWS regions, you can reduce the overall impact of a single region’s failure.
Diversifying Your Risk
While spreading your components across multiple regions can provide you with a more reliable recovery scenario when things go wrong, they do ignore one major potential risk – if the entirety of AWS is unavailable, it doesn’t matter what region your app is located in. This is the risk with keeping all of your technical infrastructures under one vendor. While Amazon takes excellent, pro-active steps to ensure that their infrastructure is redundant and can failover gracefully, you’re still subject to issues that affect the AWS stack as a whole. To fully prevent this kind of outage, you’ll need to work on exposing alternatives at each point in the tech stack:
- For load balancing, you’ll need to find another provider for load-balancing your app. This can be as straightforward as using another provider’s load balancer (such as Google Cloud Load Balancing), or as complex as building your own load balancer using proprietary tools.
- For machine hosting, again you’ll want to look at alternatives to EC2 (or Amazon Lambda, for serverless applications). These can be other cloud service providers like Microsoft Azure (which can also serve as a replacement for container services like Heroku, since many run on AWS), or even machine hosting providers like Rackspace.
- For data storage, you have a variety of options. Plenty of other cloud providers give you access to a scalable and reliable database accessible from everywhere, but you can also build your own DB server using a service like Rackspace to provide hosting for your DBMS.
The key is that for every element of your tech stack, there is an alternative to AWS that can serve your needs just as well. Ensuring full resilience and availability of your application requires spreading your risk across multiple providers, and handling the case when one (or many) of these providers fail.
If you’re running a web application, the availability of your app is paramount to your project’s success. Keeping everything under one roof with a provider like AWS gives you the ability to focus exclusively on one stack, letting your engineers be more efficient, but it also exposes you to risks of downtime across the relevant service. While high-availability options can be expensive, if your sole source of revenue is traffic to your website then you are ultimately faced with the following question: how much is not having downtime worth to me? If you’re willing to sacrifice time, effort, and money to ensure you have alternatives at every level of your app’s tech stack, then you can easily write a high-availability application that you can feel confident will survive most outages.