Azure Outage Post-Mortem — Part 3
Azure Outage Post-Mortem — Part 3
As Microsoft publishes the root cause analysis from the recent Azure outage, we discuss what happened and why recovery wasn't prioritized.
Join the DZone community and get the full member experience.Join For Free
My previous blog posts, Azure Outage Post-Mortem - Part 1 and Azure Outage Post-Mortem Part 2, made some assumptions based upon limited information coming from blog posts and Twitter. I just attended a session at Ignite which gave a little more clarity as to what actually happened. Sometime soon, you should be able to view the session for yourself.
BRK3075 — Preparing for the Unexpected: Anatomy of an Azure Outage
They said the official Root Cause Analysis will be published soon, but in the meantime, here are some tidbits of information gleaned from the session.
The outage was NOT caused by a lightning strike, as previously reported. Instead, due to the nature of the storm, there were electrical storm sags and swells, which locked out a chiller plant in the first datacenter. During this first outage, they were able to recover the chiller quickly with no noticeable impact. Shortly thereafter, there was a second outage at a second datacenter, which was not recovered properly, which began an unfortunate series of events.
During this second outage, Microsoft states that "engineers didn't triage alerts correctly — chiller plant recovery was not prioritized." There were numerous alerts being triggered at this time, and unfortunately, the chiller being offline did not receive the priority it should have. The RCA as to why that happened is still being investigated.
Microsoft states that, of course, redundant chiller systems are in place. However, the cooling systems were not set to automatically failover. Recently installed new equipment had not been fully tested, so it was set to manual mode until testing had been completed.
After 45 minutes, the ambient cooling failed, hardware shutdown, air handlers shut down because they thought there was a fire, and staff had been evacuated due to the false fire alarm. During this time temperature in the data center was increasing and some hardware was not shut down properly, causing damage to some storage and networking.
After manually resetting the chillers and opening the air handlers the temperature began to return to normal. It took about 3 hours and 29 minutes before they had a complete picture of the status of the datacenter.
The biggest issue was there was damage to storage. Microsoft's primary concern is data protection, so short of the enter datacenter sinking into a sinkhole or a meteor strike taking out the datacenter, Microsoft will work to recover data to ensure no data loss. This of course took some time, which extended the overall length of the outage. The good news is that no customer data was lost, the bad news is that it seemed like it took 24-48 hours for things to return to normal, based upon what I read on Twitter from customers complaining about the prolonged outage.
Everyone expected that this outage would impact customers hosted in the South Central Region, but what they did not expect was that the outage would have an impact outside of that region. In the session, Microsoft discusses some of the extended reach of the outage.
Azure Service Manager (ASM) - This controls Azure "Classic" resources, AKA, pre-ARM resources. Anyone relying on ASM could have been impacted. It wasn't clear to me why this happened, but it appears that South Central Region hosts some important components of that service which became unavailable.
Visual Studio Team Service (VSTS) - Again, it appears that many resources that support this service are hosted in the South Central Region. This outage is described in great detail by Buck Hodges (@tfsbuck), Director of Engineering, Azure DevOps this blog post.
Azure Active Directory (AAD) - When the South Central region failed, AAD did what it was designed to due and started directing authentication requests to other regions. As the East Coast started to wake up and online, authentication traffic started picking up. Now normally AAD would handle this increase in traffic through autoscaling, but the autoscaling has a dependency on ASM, which of course was offline. Without the ability to autoscale, AAD was not able to handle the increase in authentication requests. Exasperating the situation was a bug in Office clients which made them have very aggressive retry logic and no backoff logic. This additional authentication traffic eventually brought AAD to its knees.
They ran out of time to discuss this further during the Ignite session, but one feature that they will be introducing will be giving users the ability to failover Storage Accounts manually in the future. So in the case where recovery time objective (RTO) is more important than (RPO) the user will have the ability to recover their asynchronously replicated geo-redundant storage in an alternate data center should Microsoft experience another extended outage in the future.
Until that time, you will have to rely on other replication solutions such as SIOS DataKeeper, Azure Site Recovery, or application-specific replication solutions which give you the ability to replicate data across regions and put the ability to enact your disaster recovery plan in your control.
Published at DZone with permission of David Bermingham . See the original article here.
Opinions expressed by DZone contributors are their own.