Breaking my Production Website: A Post-Mortem
The Really Short Story
I had a failed deployment. I know how to handle complex deployments, and didn’t follow my own advice. In the future, I should change my automation so that it is easier to do the right thing than mess up in this way. I should also use higher bandwidth communication when discussing complex deployments with my developers.
The Longer Story
DEPLOYMENT GOAL / DEVELOPMENT / TESTING
We recently put the finishing touches on a new white-paper, “Deployment Automation Basics” and wanted to post it up to the website. Unfortunately, we discovered that the backend CMS for that type of new content was broken in our move from anthillpro.com to urbancode.com in the summer. As a retired developer, I punted the bug fix work over to an active developer (Mike). Mike figured out the problem, made the required back-end and front-end changes and delivered them to the Test environment quickly. I tested the behavior there and after an iteration or two, we had something that would work well. My Test environment was now correct, all I had to do was promote what was in that environment to Production and all would be well.
Poor Dev / ops communication Planning prod deploy
At this point, I asked Mike for the scope of the changes over instant message. I learned that the secure content upload was missing and a number of configuration changes were required. As I read through the list, the changes sounded like they were contained to the white-paper management system. Mike agreed that there were exactly two impacted components:
- Urbancode-com-content (the website content)
- Urbancode-com-app (the backend system) via a change to its build-time dependency LC-CMS (content management widgets).
- Deployments of those apps had targeted Test in between my tests being broken and tests working.
- The developer had actually made changes impacting those components
- No other source code changes fed into those components during that time
- The developer’s other source code changes during that time were not at all related to the website – he’d actually been working on uDeploy.
The Production Deployment
Updating both the front and back-end concurrently felt unsafe to me. I’d never done it. So I started with the simple deployment I do several times a week. I pushed updated content out the door. It is a simple secondary process in AnthillPro executed against the version currently in Test. It takes 3 minutes across the WAN so I checked email. When I got my “deployment complete” instant message, I wrapped up the email I was reading and checked Prod.
Even before I got to the white-paper area. Disaster. No website at urbancode.com. Just a stack-trace. Rational thought left me, and sheer animal terror set in. Rollback!
Over the years, I’ve demoed executing a simple rollback in AnthillPro dozens of times. I quickly looked up the previous production version of the content, and re-pushed it. Website was back up three minutes later and working perfectly. Our total outage was under five minutes. Given the role and traffic loads of our site, that qualifies as “Bad, but not tragic.”
Still, what the !@#$ happened?
Post-Mortem and success
A politer version of that question went to my developer when he was back in the office. It turns out that the back-end changes, actually impacted the whole site, not just the white-paper area. I should have pushed the back-end first, then the front. This, it turns out, is always the expected order when both elements change. Changing both components is quite rare for us though, and I’d never been responsible for a migration where that took place.
We executed the deployments in the correct order that day with perfect success.
- Like I preach in “Mastering Complex Application Deployments“, the whole deployment process including all components should be defined with partial deployments executing a subset.
- We should have migrated this deployment from AnthillPro (which promotes components / builds) to uDeploy (which deploys the whole system). We ate our own dog food, but the wrong flavor.
- Mike was in Cleveland while I was in Denver. Since we couldn’t sit
and talk about this change, we should have had a phone or Skype
conversation rather than instant message. We could have talked through
the release a little better and caught the order dependency I had
At the end of the day, intellectually knowing what to do isn’t enough. Doing things correctly always needs to be more natural and easy that doing things wrong. A “standard operating procedure” would have helped encourage the communication that was lacking and moving the dependency knowledge from our heads into our automation would have prevented the outage outright.