No Procedure Survives First Contact With a Production Outage
When outage alerts start flying and sales are dropping at your feet—no one is reading your policies and procedures wiki page...
Join the DZone community and get the full member experience.Join For Free
The phone start ringing, and my phone only rings when something goes wrong. Sure enough, production is down, sales are being lost, and people have emerged from their corner offices to find out what is going on. “Battle Stations! Battle Stations! This is not a drill!”
But fortunately, because this is an enterprise, we have a policy and a procedure for that.
First the support guys get involved and do some back troubleshooting. Or they would if they had any idea where this particular production system was hosted. That was documented an emailed around a while ago, but staff changes mean the people on call today have no idea where that documentation is, or that it even exists.
Once that document is found and details of the host OS are found, our trusty support soldiers attempt to log in, only to find that the usual credentials don’t work on this particular system. Those passwords are in the central password manager, but the new employees weren’t given access to the particular folder that holds them.
Once the administrator for the password manager has been located and access has been granted, the support team can finally log into the system to start troubleshooting the issue, only to find out that 30 minutes ago one of the developers completely ignored the policies and procedures, found and fixed the problem, and everything is good again.
Policies and procedures... They look so lovely, and are so well intentioned, but:
No procedure survives first contact with a production outage.
The only procedure that is worth a damn is one that has been successfully followed in the heat of battle. When money is being lost and the suits start looking for answers, no one really cares about the steps detailed in an untested procedure written 6 months ago. Things just have to be fixed, and they have to be fixed right now.
Especially in environments like IT where environments, requirements and implementations change on a weekly basis, handwritten and disconnected procedures are often so out of date that they are more of a hindrance than a help in an outage. It is hard enough keeping automated tests up to date and integrated with continuous deployment methodologies, let alone some wiki page that no one looks at.
In reality most procedures for dealing with an unexpected outage boil down to:
- Who can fix the problem?
- Who needs to be kept informed of the progress of the solution?
- How can this problem be prevented in future?
Beyond that, it is just a case of getting the right people in the room and then getting out of the way.
Procedures are too often written by personality types that gain a sense of security and control from lists of steps and clearly outlined responsibilities. These procedures have their place, just not in an unplanned outage.
Published at DZone with permission of Matthew Casperson, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.