Rolling Forward and other Deployment Myths
Change is change?
The author tries to make the point that “Change is neither good or bad. It’s just change.” Therefore we do not need to be afraid of making changes.
I don’t agree. This attitude to change ignores the fact that once a system is up and running and customers are relying on it to conduct their business, whatever change you are making to the system is almost never as important as making sure that the system keeps running properly. Unfortunately, changes often lead to problems. We know from Visible Ops that based on studies of hundreds of companies, 80% of operational failures are caused by mistakes made during changes. This is where heavyweight process control frameworks like ITIL and COBIT, and detective change control tools like Tripwire came from. To help companies get control over IT change, because people had to find some way to stop shit from breaking.
Yes, I get the point that in IT we tend to over-compensate, and I agree that calling in the Release Police and trying to put up a wall around all changes isn’t sustainable. People don’t have to work this way. But trivializing change, pretending that changes don't lead to problems, is dangerous.
Deploys are not risky?
You can be smart and careful and break changes down into small steps and try to automate code pushes and configuration changes, and plan ahead and stage and review and test all your changes, and after all of this you can still mess up the deploy. Even if you make frequent small changes and simplify the work and practice it a lot.
For systems like Facebook and online games and a small number of other cases, maybe deployment really is a non-issue. I don’t care if Facebook deploys in the middle of the day – I can usually tell when they are doing a “zero downtime” deploy (or maybe they are “transparently” recovering from a failure) because data disappears temporarily or shows up in the wrong order, functions aren’t accessible for a while, forms don’t resolve properly, and other weird shit happens, and then things come back later or they don’t. As a customer, do I care? No. It’s an inconvenience, and it’s occasionally unsettling ("WTF just happened?"), but I get used to it and so do millions of others. That’s because most of us don’t use Facebook or systems like this for anything important.
For business-critical systems handling thousands of transactions a second that are tied into hundreds of other company’s systems (the world that I work in) this doesn’t cut it. Maybe I spend too much time at this extreme, where even small problems with compatibility that only affect a small number of customers, or slight and temporary performance slow downs, are a big deal. But most people I work with and talk to in software development and maintenance and system operations agree that deployment is a big deal and needs to be done with care and attention, no matter how simple and small the changes are and no matter how clean and simple and automated the deployment process is.
Rollbacks are a myth?
Vincent wants us to “understand that it’s typically more risky to rollback than rolling forward. Always be rolling forward.”
Not even the Continuous Deployment advocates (who are often some of the most radical – and I think some of the most irresponsible – voices in the Devops community) agree with this – they still roll back if they find problems with changes.
“Rollbacks are a myth” is an echo of the “real men fail forward” crap I heard at Velocity last year and it is where I draw the line. It's one thing to state an extreme position for argument's sake or put up a straw man – but this is just plain wrong.
If you're going to deploy, you have to anticipate roll back and think about it when you make changes and you have to test rolling back to make sure that it works. All of this is hard. But without a working roll back you have no choice other than to fail forward (whatever this means, because nobody who talks about it actually explains how to do it), and this is putting your customers and their business at unnecessary risk. It’s not another valid way of thinking. It’s irresponsible.
James Hamilton wrote an excellent paper on Designing and Delivering Internet-Scale Services when he was at Microsoft (now he’s a an executive and Distinguished Engineer at Amazon). Hamilton’s paper remains one of the smartest things that anyone has written about how to deal with deployment and operational problems at scale. Everyone who designs, builds, maintains or operates an online system should read it. His position on roll back is simple and obvious and right:
Reverting to the previous version is a rip cord that should always be available on any deployment.Everything fails. Embrace failure.
I agree that everything can and will fail some day, that we can’t pretend that we can prevent failures in any system. But I don’t agree with embracing failure, at least in business-critical enterprise systems, where recovering from a failure means lost business and requires unraveling chains of transactions between different upstream and downstream systems and different companies, messing up other companies' businesses as well as your own and dealing the follow-on compliance problems. Failures in these kinds of systems, and a lot of other systems, are ugly and serious, and they should be treated seriously.
We do whatever we can to make sure that failures are controlled and isolated, and we make sure that we can recover quickly if something goes wrong (which includes being able to roll back!). But we also do everything that we can to prevent failures. Embracing failure is fine for online consumer web site startups – let’s leave it to them.
I wanted to respond to the points about SLAs, but it’s not clear to me what the author was trying to say. SLAs are not about servers. Umm, yes that’s right of course...
SLAs are important to set business expectations with your customers (the people who are using the system) and with your partners and suppliers. So that your partners and suppliers know what you need from them and what you are paying them for, and so that you know if you can depend on them when you have to. So that your customers know what they are paying you for. And SLAs (not just high-level uptime SLAs, but SLAs for Recovery Time and Recovery Point goals and incident response and escalation) are important so that your team understands the constraints that they need to work under, what trade-offs to make in design and implementation and in operations.
Under-compensating is worse than Over-compensating
I spent more time than I thought I would responding to this post, because some of what the author says is right – especially in the second part of his post, Deploy All the Things where he provides some good practical advice on how to reduce risk in deployment. He’s right that Operations main purpose isn’t to stop change – it can’t be. We have to be able to keep changing, and developers and operations have to work together to do this in safe and efficient ways. But trivializing the problems and risks of change and over-simplifying how to deal with these risks and how to deal with failures, isn’t the way to do this. There has to be a middle way between the ITIL and COBIT world of controls and paper and process, and cool Web startups failing forward, a way that can really work for the rest of us.