The wrong answer to: "How does UrbanCode do DevOps internally?"
Last week during an interview on DevOps, I was asked how UrbanCode has “done DevOps”. I spoke a bit about how we streamlined the releases of our website. However, to some degree I felt that we don’t have significant enough “Ops” to really do DevOps. After all, the vast majority of the software we create is installed by our customers and runs in our customers’ environments.
Upon further reflection, it hit me. UrbanCode faces a massive Dev / Ops silo problem. “Ops” is hundreds of system administration or tool owner teams working for different companies. The operations teams aren’t on staff at UrbanCode, they work for our customers and are responsible for delivering the product, among others, to the end-users in a highly reliable way.
Time consuming or risky upgrades decrease the frequency that the operations teams will be willing to upgrade the software. Therefore, if our goal is to continuously deliver new capabilities and fixes to our customers, we must make it fast and safe to upgrade. Our rate of delivering new versions has zero customer impact if the customers’ system administrators do not install the updates.
How We Really Do DevOps at UrbanCode
We work to fulfill the “Dev” side of the partnership as best we can: actively engaging Ops teams, and building a development infrastructure that maximizes productivity. Given that we have been building continuous integration and delivery tools for the past twelve years, it should be no surprise that our development infrastructure is pretty solid.
The hard part is servicing our distributed operations teams to make upgrading, maintaining, and trouble-shooting easier. Breaking down the barriers between development and operations is also tricky. Here’s some of the practices and technology we have learned to adopt.
Automatic Database Upgrades
Updating schema is never fun to do by hand. Early in the development of AnthillPro 3.x we put considerable effort into building a toolkit that would perform all the database updates moving from any version to any later version. We’ve used it in every product since, and offer a uDeploy plugin for customers who use it for their own apps. This also helps in our test labs as we make small upgrades to prior versions.
Centralized Agent Upgrades
Many of our products use a server-agent architecture. A central server (or cluster) decides what should be done, and agents do the work. For every central server that needs to be upgraded, there may be hundreds or thousands of agents. No matter how easy your upgrader is to run, doing any chore a few thousand times stinks. Fortunately, we made automation tools and taught them to upgrade themselves on command. The central server will distribute new versions of the agent code to the agents who upgrade themselves.
Avoiding Agent Upgrades
Even centralized automatic upgrades proved to be a barrier. With a 99.5 percent success rate, any upgrade of thousands of agents would leave tens of servers offline. This could block releases and generally be a pain to diagnose and repair. So we updated protocols away from serialization and have managed to make most of our application upgrades not require agent upgrades.
Some customers still have lengthy validation programs for new versions. That prevented them from taking minor updates with fixes to bugs that bothered them. End users would be frustrated and file tickets around bugs that we’d fixed months earlier and their system administrators declined to fix via upgrade. Clearly we had angry customers and higher support costs as a result of insufficiently making our “Ops” team nimble enough. To drive down the cost of taking fixes, we added a special patch capability that could be installed (and uninstalled) in minutes (in some cases without any downtime).
We found that the most common features and updates end users wanted were related to integrations. We introduced plugins in 2009 to decouple updating an integration from a full upgrade. Plugins each representing a single integration are versioned, uploaded in the tools through the web interface, and transparently distributed to agents. Power users can upgrade integrations on a regular basis without any involvement from the sys-admins. Plugins run on any version of the application supporting their schema level – most plugins run on any vaguely recent version of the products. As a bonus, the required standardization plugins considerably reduced our development effort per unit value.
From a certain perspective, plugins are architecturally similar to micro-services. Each is a discrete collection of functionality that may be pulled into a larger system that has a mostly unrelated release cadence.
Developers Working Support
We have always had a close working relationship between development and support. From time to time, we have even rotated developers into support roles for a week or two at time. This is a common DevOps strategy for a reason. There’s nothing like tracing a customer problem back to your own code to nudge you toward quality.
We also encouraged the sys-admins who planned upgrades over weekend and evening hours to let us know. We made sure to have people familiar with the upgrade standing by in support. Any risks to a successful upgrade would be partially born by the developers responsible for creating it.
Over the years, we’ve added more and more diagnostic information into the tools themselves. Some of it is highly technical, such as thread timers and memory utilization. Others are within the problem domain: uBuild, for instance, will report on the relationships between builds to help diagnose why a build happened (or was skipped). These tools support the sys-admins in inspecting the software and provide the support team with key information.
Encourage Customers to Have Test Environments
Larger customers often put in place test environments for our products where they can run upgrades using a copy of their data and play with the new features before using them in production. Running the same process against the same data prior to the official cut-over helped boost confidence and facilitate more updates.
Key Lessons Learned
- Yes, there is an “Ops” role for your product. Understand who that is. Consider, for example, what problems an app store solves.
- Developers must cater to anyone between them and the end user.
- Any part of an upgrade that could be automated should be.
- Decouple. It’s not enough to be multi-tier, you need to decouple upgrades and changes as much as possible to make upgrades cheap and low risk.
- If you build it, you help run it. Expose developers to support and regular communication with operations teams. More diagnostics built into the tool is a natural result.