I just spent the last 24 hours in lower Manhattan while Hurricane Sandy rolled through, and it’s offered some first-hand lessons on disaster recovery. Watching the responses of city and state officials, Con Edison, first responders, and hospitals provides some salient insights for protecting and recovering infrastructures from disaster.
1. What are your essentials?
Planning for disaster isn’t easy. Thinking about essentials is a good first question. For a real-life disaster scenario it might mean food, water, heat and power. What about backup power? Are your foods non-perishable? Do you have hands free flashlight or lamp? Have you thought about communication & coordination with your loved ones? Do you have an alternate cellular provider if your main one goes out?
With business continuity, coordinating between business units, operations teams, and datacenter admins is crucial. Running through essential services, understanding out they inter-operate, who needs to be involved in what decisions and so far is key.
2. What can you turn off?
While power is being restored, or some redundant services are offline, how can you still provide limited or degraded service? In the case of Sandy, can we move people to unaffected areas? Can we reroute power to population centers? Can we provide cellular service even while regular power is out?
For web applications and datacenters, this can mean applications built with feature flags, as we’ve mentioned before on this blog.
Also very important: architect your application to have a browse-only mode. This allows you to service customers off of multiple webservers in various zones or regions, using lots of read-replicas or read-only MySQL slave databases. It’s easy to build lots of read-only copies of your data while there are no changes or transactions taking place.
More redundancy equals more uptime.
3. Did we test the plan?
A disaster is never predictable, but watching the emergency services for the city was illustrative of some very good response. They outlined mandatory evacuation zones, where flooding was expected to be worst.
In a datacenter, fire drills can make a big difference. Running through them gives you a sense of the time it takes to restore service, what type of hurdles you’ll face, and a checklist to summarize things. In real life, expect things to take longer than you planned.
Probably the hardest part of testing is to devise scenarios. What happens if this server dies? What happens if this service fails? Be conservative with your estimates, to provide more time as things tend to unravel in an actual disaster.
In a disaster, redundancy is everything. Since you don’t know what the future will hold, better to be prepared. Have more water than you think you’ll need. Have additional power sources, bathrooms, or a plan B for shelter if you become flooded.
With Amazon’s recent outage, quite a number of internet firms failed. In our view, AirBNB, FourSquare, and Reddit Didn’t Have to Fail. Spreading your virtual components and services across zones and regions would help, but further across multiple cloud providers not just Amazon Web Services, but Joyent, Rackspace or other third party providers would give you further insurance against a failure in one single provider.
Redundancy also means providing multiple paths through system. From load balancers, to webservers and database servers, object caches and search servers, do you have any single points of failure? Single network path? Single place where some piece of data resides?
5. Remember the big picture
While chaos is swirling, and everyone is screaming, it’s important that everyone keep sight of the big picture. Having a central authority projecting a sense of calm and togetherness doesn’t hurt. It’s also important that multiple departments, agencies, or parts of the organization continue to coordinate towards a common goal. This coordinated effort could be seen clearly during Sandy, while Federal, State and City authorities worked together.
In the datacenter, it’s easy obsess over details and lose site of the big picture. Technical solutions and decisions need to be aligned with ultimate business needs. This also goes for business units. If a decision is unilaterally made that publish cannot be offline for even five minutes, such a tight constraint might cause errors and lead to larger outages.
Coordinate together, and everyone keep sight of the big picture – keeping the business running.