Maintaining physical Disaster Recovery (DR) datacenters grows more cost-prohibitive each year. By moving your DR data center to the Amazon Web Services (AWS) Cloud, you enable faster disaster recovery and greater resiliency without the cost of a second physical datacenter. In this article, we’ll discuss how to use AWS to manage disaster recovery in the cloud. We’ll give a brief overview of how AWS disaster recovery works, the benefits of using AWS DR, an overview of DR architectures and a customer example.
Watch the video below and access the slides, or continue on to read the article:
Before we begin, there are four terms you should be familiar with when discussing disaster recovery:
- Business Continuity: Ensuring that your organization’s mission critical business functions continuing to operate or they recover pretty quickly from a serious incidence.
- Disaster Recovery: Disaster recovery is all about preparing for and recovering from a disaster, so any event that has a negative impact on your business; things like hardware failures; software failures; power outages; physical damages to your building like fire, flooding, and hurricanes or even human error. Disaster recovery is all about planning for those incidents.
- Recovery Point Objectives (RPO): RPO is basically the acceptable amount of data loss measured in time. If your disaster hits at 12:00 PM and your RPO is one hour, your system should recover all the data that was in the system before 11:00 AM. Your data loss will only span from 11:00 AM to 12:00 PM.
- Recovery Time Objective (RTO): RTO is the time it takes after a disruption to restore your business processes back to their agreed upon service levels. If your disaster occurs at 12 p.m. and your RTO is eight hours, you should be back up and running by no later than 8:00 p.m.
Keep your Primary Datacenter, but Shift Disaster Recovery to AWS
We’re not saying that you need to shut down all of your datacenters and migrate them to AWS. You can keep your primary datacenter, but close your DR datacenter and migrate those workloads to AWS.
In the image above, we have our traditional DR architecture on the left. You have your primary datacenter, and then you have your DR datacenter. With the DR datacenter, there is replication between the primary and the DR datacenter. You can recover as soon as a disaster happens in the primary datacenter. This way your users are still able to be up and running without too much of an impact.
With traditional DR, you have the infrastructure that’s required to support the duplicate environment. Physical location, power, cooling, security to ensure that the place is protected, procuring storage and enough server capacity to run all of missing critical services — including user authentication, DNS monitoring and alerting.
On the right, we have Disaster Recovery with managed on AWS. What you see is the main datacenter, but you can set up replication to the AWS Cloud using a couple of services including Amazon S3, Route 53, Amazon EC2, and SoftNAS Cloud NAS.
You’ll get all the benefits of your current DR datacenter but without having to maintain any additional hardware or have to worry about over provisioning your current datacenter.
Physical Datacenter vs. AWS
Let’s compare a physical DR datacenter vs. AWS Disaster Recovery. With your DR datacenter, there’s a high cost to build and maintain it. You’re responsible for storage, power, networking, internet and more. There’s a lot of capital expenditures involved in maintaining the DR datacenter.
Storage, backup tools and retrieval tools are expensive. It often takes weeks to add in more capacity because planning, procurement, and deployment just take time with physical datacenters. It’s also challenging to verify your DR plans. Testing for DR on-site is time-consuming and takes a lot of effort to make sure it’s working correctly.
On the right, we have the benefits of using AWS for DR. There’s not a lot of capital expenditures when using AWS. The beauty of using AWS to manage DR is it’s all on-demand so you’re only going to paying for what you use.
There’s also a consistent experience across the AWS environments. AWS is just highly durable and highly available. There’s a lot of protection in making sure that your AWS DR is going to be up and running to go. You can also automate your recovery. Finally, you can even set up disaster recovery per application or business unit. So different business units within an organization can have different recovery objectives and goals.
Managing a DR Infrastructure
The beauty of using AWS to manage your DR is you’re really only responsible for your snapshot storage. AWS is handling the routers, firewalls, operating systems and more. AWS just takes all that off your hands, so you can focus on more important tasks and projects.
How does AWS disaster recovery mirror your DR datacenter? The image above shows traditional DR services and AWS services. For example: with DNS, AWS has Route 53; for load balancers, AWS has Elastic Load Balancing; web servers can be EC2 or auto scaling. Data centers are managed by availability zones.
Because everything is on AWS, enterprise security standards are definitely met. AWS is always up to date with certification, whether it’s ISO, HIPAA, ITAR or other compliance standards.
There is also the physical security aspect. AWS datacenters are highly secure, located in nondescript facilities and physical access is strictly controlled. They log all the physical access to their datacenters. With hardware and software networking, they have systematic change management. Updates are phased and they do safe storage decommission. There is also automated monitoring, self-audits and advanced network protection.
AWS Disaster Recovery Architectures
Let’s look at some DR architectures and scenarios for AWS. There are four main DR architectures:
- Backup and Restore
- Pilot Light
- Hot Standby
Before we discuss the architectures, let’s look at some of the AWS services involved with DR. For backup and restore, you’re not using too many of AWS core services. You’re going to be using Amazon S3, Glacier, SoftNAS Cloud NAS (for replication), Route 53 and VPN.
As you move on to Pilot Light, you’re going to add in CloudFormation, EC2, EBS volumes, Amazon VPCs and DirectConnect.
For Hot Standby, you add in Auto Scaling, Elastic Load Balancing and setup also multiple Direct Connects. For Multi-site, there’s a whole host of AWS services to add in.
Backup and Restore Architecture
The way backup and restore works is shown on the image below. We have an on-premises datacenter on the left and AWS DR infrastructure on the right.
On the left, you have a datacenter using the iSCSI file protocol, and then you’ve got a virtual NAS on top of that managing file storage. What you can do with AWS DR is use a combination of SoftNAS Cloud NAS, Amazon S3, EC2 and EBS to go in and manage your backup architecture.
In traditional environments, data is backed up to the DR datacenter. It’s offsite, so if something fails it’s going to take a long time to restore your system because you have to pull them and then pull the backup data from them.
Amazon S3 and Glacier are really good for this. Using a service like SoftNAS Cloud NAS enables you to use snapshots of your on-premises data and copy them into S3 for backup. The benefit of this is you can also arm your snapshot data volumes to give you the highly durable backup.
It’s pretty cost effective since you’re not paying a lot of money for it. In case of disaster, you’re going to retrieve your backups from S3 and bring up the required infrastructures. So these are the EC2 instances with prepared AMIs, load balancing, etc. You restore the system from a backup.
With the backup and restore architecture, it’s a little bit more time consuming. It’s not instant, but there is a work around for that.
With SoftNAS Cloud NAS, you can set up replication with SnapReplicate. It makes your data instantly available instead of having to wait for it to download and back it up. It’s now all instantly available to you. So your RTO and your RPO go from hours or days into minutes or just one or two hours.
Pilot Light Architecture
Moving on to the Pilot Light architecture, this is a scenario in which a minimal version of your primary datacenter’s architecture is always running in the cloud.
It’s pretty similar to a Backup and Restore scenario. With AWS, you can maintain the power of the pilot light by configuring and running your most critical core elements over your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.
To prepare for the Pilot Light architecture, you replicate all of your critical data to AWS. You prepare all of your required resources for your automatic start. Including the AMI, network settings, load balancing. We even recommend reserving a few instances too.
In case of disaster, you automatically bring up the resources around the replicated core data set. Then you can scale the system as needed to handle your current production traffic. Again, one of the benefits of AWS is you can scale higher or lower based on your current needs.
Hot Standby Architecture
Moving on to the Hot Standby architecture, this is a DR scenario which is a scaled down version of a fully-functional environment. It’s always running in the cloud. So basically a warm standby kind of extends to Pilot Light elements and it kind of further reduces the recovery because some of your services are always running in AWS – they’re not idle and there’s no downtime with them. By identifying your business critical systems, you can fully duplicate them on AWS and have them always on.
Hot Standby handles production workloads pretty well. In order to prepare, you again replicate all of your critical data to AWS. You prepare all of your required resources and your reserved instances. In case of disaster, you automatically bring up the resource around the replicated core dataset. You scale the system as needed to handle your current production traffic.
The objective of the Hot Standby is to get you up and running almost instantly. Your RTO can be about 15 minutes and your RPO can vary from one 1 to 4 hours.
Finally, we have the multi-site architecture. This is where your AWS DR infrastructure is running alongside your existing on-site structure. Instead of it being active-inactive, it’s going to be an active-active configuration. This is the most instantaneous architecture for your DR needs. At any moment, once your on-premises datacenter goes down, AWS will go ahead and pick up the workload almost immediately.
Basically, you’ll be running your full production load without any kind of decrease in performance so it immediately fills over all your production load. All you have to do is kind of just adjust your DNS records to point to AWS. Your RTO and your RPO are within minutes so no need to worry about spending time re-architecting everything.
Customer AWS Disaster Recovery Example
Here’s an example showing how our customers are using disaster recovery on AWS. The customer is using AWS to kind of manage their business applications and they’ve broken them down into Tier 1, Tier 2, and Tier 3 apps.
For Tier 1 apps that need to be up and running 24/7, they’ve got their EC2 instances for all services running at all times. Their in-house and their AWS infrastructure are load balanced and configured for auto-failover. They do the initial data synchronization using in-house backup software or FTP. Finally, they set up replication with SoftNAS Cloud NAS to automatically failover in minutes.
With Tier 2 apps, they’re configuring only the critical core elements of the system – they don’t configure everything. Again, they’ve got their EC2 instances running only for the critical services. They pre-configured their AMIs for the Tier 2 apps that can be quickly provisioned. Their cloud infrastructure is load balanced and configured for AMI failover. They did the initial data sync with their backup software. Finally, replication was set up with SoftNAS Cloud NAS.
For Tier 3 apps, where the RPO and RTO aren't too strict, they replicated all their data into S3 using SoftNAS Cloud NAS. Agin, they did a sync with their backup software. They went ahead and pre-configured their AMIs. And then also their EC2 incidents are spun up from objects within S3 to a manual process, but they’re able to get there pretty tier pretty quickly.