Lessons Learned from the November AWS Outage
An analysis and recommendations for cloud-based architecture, based on data from the recent Amazon Web Service's infrastructure outage.
Join the DZone community and get the full member experience.Join For Free
Context, Analysis, and Impact
- Amazon’s internet infrastructure service experienced a multi-hour outage on Wednesday, November 25th, that affected a large portion of the internet.
- More than 50+ companies were impacted, including Roku, Adobe, Flickr, Twilio, Tribune Publishing, and Amazon’s smart security division, Ring, in its region covering the eastern U.S.
- Business impacts, as reported by The Washington Post, included:
- New account activation and the mobile app for streaming media service Roku became hampered.
- Target-owned Shipt delivery service could receive and process some orders, though it stated that it was taking steps to manage capacity because of the outage.
- Photo storage service Flickr tweeted that customers couldn’t log in or create an account because of the AWS outage.
- Root Cause Analysis by AWS: It started with Amazon Kinesis but started impacting a long list of services. You can read the RCA document by AWS, which is also summarized below:
#1: Don't Put All Your Eggs in One Basket
- Using a single Cloud Service Provider can be counter-productive in these scenarios.
- Think and strategize for Hybrid-Cloud or Private Cloud; or Multi-Cloud, particularly during peak season.
#2: Hope for the Best and Plan for the Worst
- Don't just rely on a cloud provider's availability and multi-region fail-over strategy; build your own resiliency and disaster recovery approach.
- Practice disaster recovery in production or similar systems by using innovative approaches in active-active setup across the multi-cloud or hybrid-cloud scenarios.
#3: Monitoring and Observability Are Not Static
- Be innovative in exploring monitoring and observability patterns. For example, if AWS is reporting an outage on their status page, your monitoring system should get into action and inform the incident resolution team to start analyzing the impact.
- Keep ready the services dependency graph; though mostly supported by tools, you should keep it dynamic and prepared to assess the impact when it happens and map it to business functionalities to report it to your business team accurately.
#4: Invest in Emerging Techniques, like Chaos Engineering
- This failure indicates that even internet giants like AWS are still maturing in implementing practices like chaos engineering. So, start putting chaos engineering practices into the roadmap.
- For example, if a bulkhead pattern could have been utilized in the AWS outage scenario, the outage would have been limited to Kinesis services only.
To conclude, being proactive when outages occur, having a response team equipped for unplanned outages, and improving continuously from lessons learned along the way are essential techniques to help keep the impact limited. Also, having a multi-cloud or hybrid-cloud strategy is food for thought to keep the business running.
All data and information provided on this site is for informational purposes only. This site makes no representations as to accuracy, completeness, correctness, suitability, or validity of any information on this site and will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. All information is provided on an as-is basis. This is a personal weblog. The opinions expressed here represent my own and not those of my employer or any other organization.
Published at DZone with permission of Ankur Kumar, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.