The Sky Is Falling, the Sky Is Falling!
The sky isn’t really falling, but it can feel that way when your cloud provider fails. We’ve seen some spectacular failures by cloud providers in recent years, equivalent to taking down a very large traditional data center, and having all back ups (back up network, power, and cooling) fail at the same time.
These failures caused outages from some major network service providers, such as NetFlix. That’s a scary thought, if a provider like NetFlix can experience a serious service outage that can take many hours to rectify, what chance do you have of surviving such an outage?
Because a failure of this type is so expensive, it’s a foregone conclusion that the data center has been designed to be fault-tolerant, with one or more backups at each failure point. Unfortunately, the failures we’ve seen appear to be more frequently are a result of software failures rather than hardware. It’s more difficult to insulate against software failures, but not impossible.
Solution 1: Backup the site with AWS S3 buckets or others
The traditional solution to a data center failure, such as the destruction of a data center due to fire or weather, has been to keep a fully functional back up site. Data is synchronized between the production site and the backup site frequently, so that if the production site fails, the backup site can take over without impacting customers.
Having a live backup site used to be incredibly expensive when it meant duplicating your entire production infrastructure to a second site, where it would sit idle, waiting for a failure.
In the cloud, it’s less expensive, but not free, to keep a backup site. But, how do you do it?
The key is in creating a backup site in a geographically separate data center (or “region” in Amazon AWS parlance), and designing your network so that client traffic can be easily shunted to the back up site. As with traditional backup sites, you’ll need to ensure that your data is available at both sites and in sync between them.
Amazon has a great solution for this, all their S3 buckets are replicated across regions. So, if your database is hosted out of S3, you get this replication for free, and you can pay once for this ability, rather than twice. Of course, there’s a replication delay, so it can take some time for one region to see updates from work done in another. But still, it’s often better than managing this replication yourself.
Solution 2: Fast recovery and DevOps in the cloud
DevOps practices offer another way to solve the problem: quick, automated deployments. In this model, if you’ve got your data available, you can spin up an entirely new environment in a matter of minutes or hours and start serving requests from it.
This has the benefit that it doesn’t cost you extra money to pay for idle cloud instances: you can spin up the new environment only when it becomes necessary. If you take this even further, you can make your deployments cloud-agnostic, so that even if your whole provider fails, you can still rebuild in a new cloud in a matter of hours.
Of course, that hinges on your data being available, so you’ll need to plan ahead to ensure that you can access your data even if your provider completely fails. This is easier said than done, and can be expensive if you move a lot of data. A number of cloud backup services exist to help meet this need, but again, they’re expensive.
Solution 3: The Fully Distributed Service Model
If you can justify the cost, the most resilient solution is the fully distributed service model. In this solution, your services are distributed across multiple providers, and multiple geographic locations. This is expensive and somewhat complex to build, and can be somewhat costly to run, but it has the benefit of providing additional scalability, and much better uptime for your customers.
The Bottom line with cloud providers
Only you can decide which path fits your business best. No solution is a one-size-fits-all path. You have to balance cost, complexity, and your level of risk tolerance to pick the right solution for you. JumpCloud can help you prepare for a cloud outage, by allowing you to prepare a workflow of tasks to assist in cutting over from one site to another. This can include things like calling into APIs to switch your DNS over, helping you install software, and running scripts to restore backup files from S3 buckets. Whatever server automation you need, if needs to happen as a result of an event occurring in your environment, JumpCloud can help.