What the COVID-19 Pandemic Has Taught Us About Cloud Disaster Recovery

DZone 's Guide to

What the COVID-19 Pandemic Has Taught Us About Cloud Disaster Recovery

In this article, I will share my story of how the realities of the cloud became apparent during the COVID-19 pandemic.

· Cloud Zone ·
Free Resource

Cloud computing provides opportunities for organizations to respond to unexpected situations with on-demand infrastructure and “unlimited” scale. However in a true disaster situation, the physical practicalities of “unlimited scale” start to show.

The date is sometime in March 2020, and I’ve just given green-light on the deployment of another HDInsight cluster (a managed Hadoop offering from Microsoft) to our production environment.I am the platform architect in the team which is preparing for a long-awaited go-live of the next phase of the Azure-based data platform, when suddenly I get a phone call from the DevOps lead that goes a bit like this:

Him: “Hi Jonny, I’ve got a provisioning error on the new production cluster, I can’t deploy it.”

Me: “What sort of error.”

Him: “The cluster won’t scale up, it can’t get enough nodes.”

Me: “We checked the quotas yesterday, right?”

Him: “Yes, we have enough quota, but we just can’t get enough nodes to scale the cluster.”

The situation, it turned out, was that the Azure North Europe data center was full.Microsoft, like all cloud providers, over-provision their physical infrastructure — they can grant VM and CPU core quotas to customers knowing that the customers will not all try to consume all that capacity at once...until suddenly they did.

Countries all around Europe are suddenly being put into lockdown and organizations have to quickly react to their entire workforce being asked to work from home. With only a few days’ notice, IT departments have to respond to an unprecedented peak in demand for VDIs and collaboration tooling — and they turn to the cloud in droves — after all, that’s what cloud computing is for.

Microsoft Windows Virtual Desktop (a cloud-based Windows 10 remote working solution) had, very timely, recently gone into General Availability and IT departments rushed to deploy remote desktop solutions. Microsoft Teams offers a scalable and seamless collaboration and teleconferencing solution, but with all meetings suddenly moving to online — that spike needs to be satisfied with physical infrastructure somewhere.

The result was a massive spike in demand for compute in the Azure data centres and the demand could not be met for all customers. In addition to not being able to deploy new resources, some customers were having difficulty in starting up existing resources, for example a VM that shuts down overnight and starts on a schedule — they could not be brought up again in the morning.

I spoke to the Microsoft account team who looks after this particular client and they told me that the capacity management team was aware of the situation and was prioritising capacity to customers who were in healthcare and emergency services. The situation looked bleak as I was told that more hardware was on order, but supply chains from China were impacting delivery times.

Fortunately, the account manager was able to represent my client at the daily capacity management meetings and make a case for providing the necessary capacity.I was also told that Microsoft moved 20,000 vCPUs work or internal workloads out of the Azure North Europe data centre.Eventually, the HDInsight cluster was successfully deployed a week later.

I am currently working on a Disaster Recovery (DR) strategy for another client which is based on failover of services running in one Azure region to another Azure region, in the unlikely event of a region-wide failure. This is a standard pattern which is based on Microsoft’s own architectural recommendations. However, it occurs to me that if an entire Azure region does go down, there would be another sudden spike in demand for resources in the remaining Azure regions. A RTO (Recovery Time Objective) which was achievable in DR testing, may actually be unachievable in a real event due to capacity constraints.

My recommendations when designing your Azure DR strategy are:

  • Ensure quotas increases are requested in the secondary region, although you cannot rely on quotas to ensure availability of resources

  • Be prepared to speak to the Microsoft capacity management team in case of a failover to discuss capacity issues 

  • Make use of the Microsoft account team — they can help secure capacity in case of a failover — frame your case in terms of impact to the customers and reputational impact to Microsoft

  • Understand that your RTOs are going to be on a best-endeavours basis when failing over to a cloud provider — ultimately you don’t own the infrastructure and the cloud provider may not be able to meet demand

  • Consider a multi-cloud approach, allowing you to failover to an AWS, GCP or another cloud environment, or even to an on-premises environment

azure, cloud, covid-19, disaster recovery, dr, failover

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}