Recently Jason Boche posted about the requirement for a good script to start and stop your datacenter. I’m sure a lot of people read it and said, “I don’t need one for those since I don’t shut everything down.” Well, the truth is you might need one of these things sooner or later. We all plan for disasters (or should) no matter how big or small we are. During that planning we often come up with playbooks for how to start and stop certain services. We even might come up with a playbook on which services or application stacks to start and in which order. The question is do we ever really check that it’s all correct? That’s something that’s probably pretty tricky to do since we don’t want to shut everything down and restart it to go through the test. However, when the rubber meets the road and your datacenter does go down it’s good to know that you can get it back up and running and exactly how long that will take.
About a month ago I was with a customer that had a major power failure in their datacenter. They have about 2,800 servers running in the datacenter. The have 2 huge generators outside, several banks of batter power, and several UPS systems to control it all. What they didn’t have was a very good regulator switch in the building and when it blew it cut off all supplies of power to the datacenter. The UPS systems, the batteries, and the generators were all isolated from the power wicks that ran throughout the datacenter. Not good. When they finally did get to powering everything back up it took nearly 5 days. They would start one service only to realize something else needed to be started first. It was a true disaster. Luckily, their remote site came on-line. Unluckily they had no connectivity to it since their telco gear had lost power with the datacenter. All-in-all not a good day.
There’s a lot of lessons we can learn here about preparing for a disaster and datacenter design. This all brings me back to the article that Jason pointed out and the spreadsheet that he’s created for his own datacenter. Take the time to think through all of the dependencies of your applications. Then think through how people get to those applications. Make sure you know who needs what and when and how these applications get powered up and how they get powered down as well. If you’re really advanced you can start implementing some automation to the whole thing. This can be as simple as some scripts or something more elegant such as VMware Site Recovery Manager. Whatever it is be prepared for the unthinkable or get really good and navigating Dice.com for a new job.