Aftermath of the AWS S3 Outage—An Interview With Nick Kephart
Aftermath of the AWS S3 Outage—An Interview With Nick Kephart
AWS S3 witnessed an extended outage on Tuesday, Feb. 28th, 2017. DZone interviewed Nick Kephart of Thousandeyes to bring some clarity to the situation.
Join the DZone community and get the full member experience.Join For Free
See why enterprise app developers love Cloud Foundry. Download the 2018 User Survey for a snapshot of Cloud Foundry users’ deployments and productivity.
Earlier this week (Tuesday, Feb 28th, 2017), AWS S3 witnessed a complete outage from 9:40 AM to 12:36 PM PST in US-EAST-1 region. A large number of online services and websites were disrupted during this period—DZone.com was one of them.
Can you begin by briefly providing us an overview of what happened?
- Starting at 9:40 AM PST, the availability of the S3 services immediately dropped from normal levels down to 0%. At the same time, packet loss also immediately increased to 100%. Both availability and packet loss consistently remained at those levels for the entirety of the outage.
Can you tell us what S3 means for AWS, noting the impact of the outage on specific AWS services that were affected during this time?
- AWS S3 (Simple Storage Service) is a cloud object storage solution that many services rely on to store and retrieve files from anywhere on the web. Many other AWS services that depend on S3 — Elastic Load Balancers, Redshift data warehouse, Relational Database Service, and others — also had limited to no functionality.
- Because S3 stores files, one of the fundamental building blocks of AWS, many diverse AWS services were impacted.
What do you think the root cause of this outage was and can you tell us about why some of the suspected culprits (DDoS attacks and data center outages) are not likely?
- Because of the immediacy of the performance impacts, the root cause of the S3 outage looks much more likely to be an internal network issue. It could be an internal misconfiguration or infrastructure failure, with symptoms that manifest themselves on the network layer.
- This outage does not match the patterns typically seen in outages caused by DDoS attacks. Generally, with a DDoS attack, we would see packet loss and availability issues increase over time, and then be variable as mitigation efforts are implemented. Rather, in this case we saw packet loss and availability immediately peak and then remain consistent.
- Also, we saw traffic terminating within the AWS US-East infrastructure, rather than terminating at peering connections with other networks. The latter is a symptom that would be more indicative of a DDoS attack.
What were some of the major companies and online services negatively affected by this outage? And what would you estimate a day of outages mean for some of the larger companies monetarily?
- We don't have an estimate of the number of sites affected in this case because the failures were so diverse. Some of the impacted sites include media (National Geographic, iCloud), retail (Ann Taylor), education (Coursera), manufacturing (GE), and software (Slack).
What about the monetary repercussions for Amazon… any estimates there? What kind of guarantees do they offer and must they deliver on them?
- Based on the S3 SLA, this outage (3 hours) might mean that S3 went below the 99.9% SLA threshold. Therefore, AWS may be on the hook for up to 10% of monthly revenue within US-East-1 S3, which is likely one of their most popular services, and other affected AWS services. It’s possible this SLA impact is in the millions or tens of millions of dollars.
Though the use of S3 services generally occurs on the back-end and isn’t necessarily apparent to end users, many service’s dependencies on S3 left them vulnerable during this outage and the end users suffered from a variety of failures. Can you talk about the different ways in which services may depend on S3 (directly hosted vs. object hosting vs. critical sub-services as mentioned here) and how the outage affected these different types of dependencies?
- S3 is one of the core building blocks of a lot of Amazon services. Whether used for simply storing files or objects or serving up stored content for a website or application, there can be a lot of complex dependencies which obviously have a cascading effect.
- Critical sub-services are where the dependencies become more complicated. Because S3 is an object store for files, it can impact many parts of an application, including:
- User session management (cookies, load balancers, stateful firewalls)
- Media storage (images, videos, PDFs, documents, music)
- Content storage (text that is directly displayed on page)
- User data (profiles, records)
- Third-party objects (scripts, fonts, ads)
- Automation (scaling, alerting, monitoring services)
In your opinion, are companies relying on S3 too heavily? And, what alternatives are available to those providing services that depend on S3—is it the service itself, the way companies are using the service, or both that caused the problem here? Are there setups and/or backups for S3 that you would recommend for mitigating any negative effects of future outages?
- IT needs to plan for redundancies for critical services wherever possible and this as an example of why it’s a best practice.
- AWS’s system was built to be redundant, automatically replicating stored objects and files across data centers. Another level of redundancy would require leveraging additional AWS regions or alternative cloud providers. This would add more management complexity and cost overhead as the user would be responsible for synchronizing data. Most organizations have not opted for this option aside from data backups, which are often not useful in the short-term.
What can we learn from this experience?
- The move to cloud has delivered huge improvements in stability and resiliency. However, non-obvious dependencies are also increased, making it likely that a failure in one service will impact many others. Cloud developers and operations teams should review key dependencies on major cloud providers, develop a monitoring strategy for affected services, and determine architectural changes to improve redundancy for critical services.
Thanks for the interview, Nick.
Opinions expressed by DZone contributors are their own.