Incident Review – AWS Outage Led To Spikes In Response Times For Applications Using AWS Services
The AWS incident on Tuesday lasted for approximately four hours, this post aims to give you a clearer picture of what happened, when, and how.
Join the DZone community and get the full member experience.Join For Free
On Tuesday, August 31, users across large parts of the West coast (US-West-2 region) were impacted by major spikes in response time. Some of AWS’ most critical services were affected, including Lambda and Kinesis.
SRE teams care about Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and this practice is a must for SRE teams. However, unlike Google, the founders of SRE, most companies rely on other providers like AWS and GCP, along with a CDN (or CDNs), for their infrastructure and services.
This means that not only do you need to have SLIs and SLOs for your applications and services, you also need to take a close look at your providers and vendors. This is because SLOs and SLIs for most companies will have a dependency on the vendors and providers being used. If you are hosted in the cloud, the cloud vendor having an issue is analogous to you missing your SLOs.
By monitoring your vendor SLOs, you can understand their impact on your SLOs and system architecture in order to properly deliver the level of experience you are aiming for.
The AWS incident on Tuesday lasted for approximately four hours, creating a widespread series of headaches – from websites being down to site features being unusable to difficulties logging into applications - across the entire US-West-2 region. Companies, developers, and DevOps teams shared their angst on social media and news sites. Those commenting included The Seattle Times, major gaming company Zwift, and SaaS platform Ubiquiti.
Catchpoint Detects and Alerts on AWS Outages First
This blog post aims to give you a clearer picture of what happened, when, and how.
Through our proactive monitoring platform, we first detected issues for our customers at 11 AM PST on Tuesday. Our data analysis revealed widespread connectivity failures in the US-West region. We immediately triggered our first alert - a full 25 minutes before AWS recognized the issue. AWS’ first mention on their status page that they were investigating the issue took place at 11.25 AM PST.
Unlike other observability platforms, Catchpoint is not hosted on a cloud provider, so when a cloud provider has an incident impacting their solutions, we are not impacted. Our platform will continue to work, alerting you as soon as we detect any problem.
AWS Status Dashboard Showed Increased Latencies And Connectivity Issues
The AWS Service Health Dashboard revealed increased provisioning latencies to Amazon Elastic Load Balancing in Oregon and AWS Internet connectivity issues in the same region.
Impacted AWS Services
Impacted AWS services included Lambda, ELB, Kinesis, RDS, CloudWatch, and ECS.
Incident Hit AWS-WEST-2 Region
Only users in the US-WEST-2 were impacted, meaning Oregon specifically (including Seattle where Amazon is headquartered). There are two other AWS Regions on the West Coast: Northern California and the AWS GovCloud. Neither of them was affected, however.
Root Cause Identified: Network Connectivity Issues
At 2:26PM PDT, the root cause of the issue affecting network connectivity in the US-WEST-2 region was identified by AWS as, “a component within the subsystem responsible for the processing of network packets for Network Load Balancer.” This led to impairment of the NT Gateway and PrivateLink services, “no longer processing health checks successfully” and further performance degradation.
Going back to our dataset, we can also include additional metrics to validate that the cause of the outage was a network connectivity issue. Catchpoint offers 50+ metrics that allow you to narrow down issues to a specific component. You can then answer the question, “Is it the network or is it the application that is causing the problem?”
In this case, you can see that the overall response time spiked because of an increase in connect time to the servers, which is impacted by the network. However, the load and wait times, which are related to the server processing time - and hence indicative of applications/server-side issues - is flat with no spikes.
Do You Have End-to-End Monitoring In Place To Detect Such Outages?
The latest outage of the summer serves as a reminder for organizations to evaluate and verify their own infrastructure setup, including their monitoring, observability, and failover strategies. It’s also worth taking a beat to ensure you don’t rely on cloud-only monitoring strategies, which can lead to blind spots.
You can reduce the noise and save time and resources by deploying a holistic monitoring and observability strategy. A holistic monitoring and observability strategy means you can detect outages and performance issues from anywhere, in real-time.
Prevent Single Points of Failure
Ultimately, Service Level Indicators and Service Level Objectives are not just for your services, they are for your third-party providers - and everything in your infrastructure that is a single point of failure. This is the reason why your legal departments ensure there are SLA clauses with cloud providers.
Published at DZone with permission of Navya Dwarakanath, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.