DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Azure Outage Post-Mortem — Part 3

Azure Outage Post-Mortem — Part 3

As Microsoft publishes the root cause analysis from the recent Azure outage, we discuss what happened and why recovery wasn't prioritized.

David Bermingham user avatar by
David Bermingham
·
Oct. 01, 18 · News
Like (2)
Save
Tweet
Share
3.68K Views

Join the DZone community and get the full member experience.

Join For Free

My previous blog posts, Azure Outage Post-Mortem - Part 1 and Azure Outage Post-Mortem Part 2, made some assumptions based upon limited information coming from blog posts and Twitter. I just attended a session at Ignite which gave a little more clarity as to what actually happened. Sometime soon, you should be able to view the session for yourself.

BRK3075 — Preparing for the Unexpected: Anatomy of an Azure Outage

They said the official Root Cause Analysis will be published soon, but in the meantime, here are some tidbits of information gleaned from the session.

The outage was NOT caused by a lightning strike, as previously reported. Instead, due to the nature of the storm, there were electrical storm sags and swells, which locked out a chiller plant in the first datacenter. During this first outage, they were able to recover the chiller quickly with no noticeable impact. Shortly thereafter, there was a second outage at a second datacenter, which was not recovered properly, which began an unfortunate series of events.

During this second outage, Microsoft states that "engineers didn't triage alerts correctly — chiller plant recovery was not prioritized." There were numerous alerts being triggered at this time, and unfortunately, the chiller being offline did not receive the priority it should have. The RCA as to why that happened is still being investigated.

Microsoft states that, of course, redundant chiller systems are in place. However, the cooling systems were not set to automatically failover. Recently installed new equipment had not been fully tested, so it was set to manual mode until testing had been completed.

After 45 minutes, the ambient cooling failed, hardware shutdown, air handlers shut down because they thought there was a fire, and staff had been evacuated due to the false fire alarm. During this time temperature in the data center was increasing and some hardware was not shut down properly, causing damage to some storage and networking.

After manually resetting the chillers and opening the air handlers the temperature began to return to normal. It took about 3 hours and 29 minutes before they had a complete picture of the status of the datacenter.

The biggest issue was there was damage to storage. Microsoft's primary concern is data protection, so short of the enter datacenter sinking into a sinkhole or a meteor strike taking out the datacenter, Microsoft will work to recover data to ensure no data loss. This of course took some time, which extended the overall length of the outage. The good news is that no customer data was lost, the bad news is that it seemed like it took 24-48 hours for things to return to normal, based upon what I read on Twitter from customers complaining about the prolonged outage.

Everyone expected that this outage would impact customers hosted in the South Central Region, but what they did not expect was that the outage would have an impact outside of that region. In the session, Microsoft discusses some of the extended reach of the outage.

Azure Service Manager (ASM) - This controls Azure "Classic" resources, AKA, pre-ARM resources. Anyone relying on ASM could have been impacted. It wasn't clear to me why this happened, but it appears that South Central Region hosts some important components of that service which became unavailable.

Visual Studio Team Service (VSTS) - Again, it appears that many resources that support this service are hosted in the South Central Region. This outage is described in great detail by Buck Hodges (@tfsbuck), Director of Engineering, Azure DevOps this blog post.

Azure Active Directory (AAD) - When the South Central region failed, AAD did what it was designed to due and started directing authentication requests to other regions. As the East Coast started to wake up and online, authentication traffic started picking up. Now normally AAD would handle this increase in traffic through autoscaling, but the autoscaling has a dependency on ASM, which of course was offline. Without the ability to autoscale, AAD was not able to handle the increase in authentication requests. Exasperating the situation was a bug in Office clients which made them have very aggressive retry logic and no backoff logic. This additional authentication traffic eventually brought AAD to its knees.

They ran out of time to discuss this further during the Ignite session, but one feature that they will be introducing will be giving users the ability to failover Storage Accounts manually in the future. So in the case where recovery time objective (RTO) is more important than (RPO) the user will have the ability to recover their asynchronously replicated geo-redundant storage in an alternate data center should Microsoft experience another extended outage in the future.

Until that time, you will have to rely on other replication solutions such as SIOS DataKeeper, Azure Site Recovery, or application-specific replication solutions which give you the ability to replicate data across regions and put the ability to enact your disaster recovery plan in your control.

azure Data (computing)

Published at DZone with permission of David Bermingham. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Key Considerations When Implementing Virtual Kubernetes Clusters
  • The Future of Cloud Engineering Evolves
  • Better Performance and Security by Monitoring Logs, Metrics, and More
  • Distributed SQL: An Alternative to Database Sharding

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: