Disaster recovery is, or should be, a must for many production applications. Having the ability to recover your application in a separate geographic location, should a major incident occur, is vital to the continued availability of your service. Microsoft has offered a DR service called Azure Site Recovery (ASR) for some time now, but it has been focused on taking on-premises applications and providing a DR solution for these in Azure. Customers whose primary application site is in Azure haven’t been able to take advantage of this service, and so building DR solutions for service that are already in Azure has been very difficult.
This is set to change with the announcement of the public preview of Azure Site Recovery for Azure Virtual Machines. I’ve had access to the private preview of this solution, and here are my thoughts.
This new service allows you to take your existing Azure production workloads and configure them for replication and recovery into a separate Azure Region. Once configured, ASR will continuously replicate your virtual machines and allow you to orchestrate the recovery of these VMs into another region in the event of a disaster. This PaaS provides a number of benefits for users with production workloads in Azure that need a DR solution and don’t want to build their own or use third-party tools. Out of the box, ASR provides:
First class, native support for Azure VMs.
Replication to any supported region in Azure.
Cost savings: If your DR VMs aren’t active, then you are only paying the ASR fee plus storage and network egress. You are not paying for any running VMs, which is the bulk of the cost in any DR environment.
The ability to boot VMs in a specific order and run Azure automation scripts natively integrated with ASR recovery plans.
Low RTO and RPO with application-level consistency.
One-click testing of your recovery plans without interrupting production workloads.
Creation of virtual networks, availability sets, and storage accounts in the DR region
As far as I am aware, the ability to replicate VMs between two regions using a PaaS service is unique to Azure and could be a significant time and cost saver when looking at how to deal with DR for production Azure workloads, particularly when combined with Azure Automation to run pre- and post-recovery tasks.
Full documentation on the service is available here.
Despite all the benefits listed above, we do need to remember this is a preview, so there are still some limitations to the service you should be aware before trying to use it:
No support for VMs with managed disks yet.
No support for the Server 2016 OS.
Linux support limited to certain distributions.
Management is currently only through the Azure Portal — no support for command line, PowerShell, or REST yet.
Virtual Machine Scale sets not supported.
Azure Disk Encryption is not supported.
Replication groups (the ability to group VMs so they can be replicated and recover to the same recovery point) is not yet available.
Support for Sovereign Clouds coming soon.
The full list of supported and unsupported configurations can be found here.
Obviously, as it’s a preview, it also does not have a full GA SLA, though support is available and production workloads are supported if they are within the qualified support matrix.
It should be noted that the purpose of ASR is to replicate your VMs (and the storage they use for disks) to your DR region, it does not cover any other resources. So if your application relies on load balancers, Public IP, Azure SQL, KeyVault, web apps, etc. you will need to make sure that you have either replicated or pre-created these into your DR region using other methods so that they are available if you fail over.
It should also be made clear that ASR does not provide built-in methods to configure access to your environment. For example, if you are using a public IP to access your resources, then you will need to configure your recovery plan to run an Azure Automation script to associate the public IP with your resources, similar with a load balancer. We’ll cover using these recovery plans to set up these things in more detail in a future post.
Summary and Setup
Overall, I’m excited about the Azure-to-Azure recovery service. I think it’s going to provide a significantly simplified process to be able to do DR with environments hosted inside Azure, which is something that has been pretty hard to do for a while. Right now, the preview is missing a number of key components that I personally need, mainly managed disks, encrypted disks, and Server 2016, but these will come eventually. I particularly like the ability to team this up with Azure Automation to completely automate the whole process. This preview offers an opportunity to get up to speed with the service ready for when those components are ready, but until they are present, I’ll have to hold off using it in production
Given all of that, let’s take a look at how you set up the ASR to protect your Azure VMs.
To start using ASR, the first thing you need to do is create a recovery vault to store your recovery metadata. You can find this under “Backup and Site Recovery (OMS)”:
Important note: You should create your recovery vault in the region you want to use for DR, not your primary region. In this example, my production VMs are in the West Europe Region, so my vault is in North Europe. You are not restricted to using the regional pairs for your vault.
Once you have created your vault, you can protect your VMs. Go into your vault and click “Replicate”. In the source, select “Azure”, and then complete the fields to select the resource group where your production workload exists.
On the next screen, we select which virtual machines we want to replicate. Note that VMs with managed disks will be grayed out, as they are not currently supported.
The final screen will ask you to confirm the region you wish to replicate to. This must be the same region as your vault. It will also show you what resources it is going to create in the DR region to support your deployment. These will be resource groups, vNets, storage accounts, and availability sets. They will use the same name as your prod resources with -asr appended.
On this screen, you will also see the default replication policy that has been selected. This has two values:
Recovery point retention: how long your snapshots are retained for. The default here is 24 hours.
App Consistent snapshot frequency: ASR has two types of consistency — file level, which are being taken all the time, and application level. Application level consistency uses volume shadow copy to ensure that applications are in a consistent state when a snapshot is taken. This can have an impact on performance, so you will want to decide how often you want to take these app-level snapshots. The lowest you can go is every hour
Once you're happy with this, click “create resources” to create the required resources in the ASR resource group. Then click “enable replication”, and your replication policy will be enabled and VMs will start being on-boarded. This can take some time to create the required resources and protect the machines. You can check on the status of this in the Jobs tab.
The process may sit for a while in the “Enable Replication” state. You can see more details on where in that process it is by clicking on the step.
Eventually, all the tasks will complete and, if you go to the “Replicated Items” section, you will see your VMs listed as protected.
Your VMs are now being replicated into your recovery vault, with continuous recovery points (one about every 5 minutes) and application consistent snapshots on your requested schedule.
Now that your VMs are protected, you can failover individual VMs. However, in a disaster, it is unlikely you will need to failover a single VM. Instead, it will be a number of VMs that run your application between them. To avoid having to go through each VM individually, you can create recovery plans. These allow you to group VMs together to be failed over as a group. To set this up, you select the “Recovery Plans (Site Recovery) option and then create a recovery plan.
Recovery plans also allow you to run scripts, using Azure Automation, before and after your recovery process. These can be used for things like associating public IPs or load balancers with your VMs or making changes to DNS to transfer traffic to your DR site. To add these, you need to go to your recovery plan and then select the group where you want the scripts to be run. Right-click on the group and select a pre or post action.
You can then configure the Azure Automation account and scripts to use in these tasks.
We’ll cover using these tasks in your recovery process in more detail in another article.
We’re now at the point where our VMs are replicated, we have a recovery plan defined, and we are ready to test whether site recovery actually works.
Testing your DR process is vital to ensuring you are ready to handle a real event, should it occur. ASR provides a simple way to test your failover without impacting your production environment, so this can be done at any time. You can test failing over a single VM, or a whole recovery plan.
To test a single VM:
- Go to “Replicated Items” in the Recovery Vault in the portal.
- Select the VM you wish to test.
- Click the “Test Failover” button.
- You will be presented with a window that asks which recovery point you wish to test. You will have the option of:
- Latest process: This is the very latest recovery point available for that VM. It may not be an app consistent recovery point.
- Latest app consistent: As the name suggest, this is the latest recovery point that is app consistent.
- Custom: If you need to use an older recovery point.
- Finally, you need to select a virtual network to failover to. The portal will encourage you not to use your production network to avoid impacting production workloads.
- When you are ready, click OK to begin the test.
- You can view the status of the test in the jobs section. Once the job is showing complete, you should be able to access your VM in your recovery network and test it out.
To test a recovery plan, the process is identical to the above, except you go to the Recovery Plans (Site Recovery) option in the menu, and select a recovery plan to test.
Cleanup Test Failover
Once you have tested your failover, you are going to want to clean up the test resources so you are not charged for the VMs. ASR provides a simple process to do this.
- Select the resource your tested (VM or Recovery Plan).
- Click the “Cleanup test failover” button.
- Add any notes you wish and check the “delete virtual machines” checkbox.
- Click OK, and the resources will be deleted. Note this will only delete the VMs in the test — vNets and Storage accounts for recovery will remain in place.
Now that we have tested the failover process, we know that replication and failover are working. Hopefully, that is the limit of what you need to do, aside from repeating your failover tests regularly. However, if the time comes where you need to do a real failover, the process is much similar to test:
- Select the resource you want to fail over.
- Click the Failover button.
- Select which recovery point you wish to recover to.
- You also have the option to attempt to shut down the VM before failover. Obviously, this will only work if the live site is still active and the VMs are accessible. If the process cannot shut down the VMs, it will continue anyway
- Click OK.
- Again, you can monitor the process in Jobs.
Once the failover job has completed, you can go in and test that everything is up and running. Once you are happy with this, you then need to Commit the failover. By doing so, this indicates that you have now failed over and are live in your DR location. Once you commit, the machines' previous recovery points are removed from the vault, as these are no longer valid.
Once you have failed over to the DR region and committed, your machines are no longer protected by recovery services. You will need to go through the process to protect the machines again now that they are in the new region. Unfortunately, there is no way to automatically re-protect machines once they failover. You are able to re-protect at any time. If you re-protect into your primary region, then you will be able to re-use the existing data from before your failover.
If you wish to fail back to your original region, you essentially need to do a failover in reverse. Make sure you re-protect your machines in the DR region, then initiate a failover back to your main region. Once committed, make sure you re-protect the VMs again.