Embracing Downtime: Why 99.999…% Availability is Not Always Better
A couple of weeks ago, my ever-active colleagues Marco Mulder and Serge Beaumont organised an nlscrum meetup about "Combining Scrum and Operations", with presentations by Jeroen Bekaert and devopsdays organiser Patrick Debois.
Unfortunately, I was late and only managed to catch the tail end of Patrick's well-delivered talk explaining how Dev/ops can become Devops. Thankfully, the lively open space discussions that followed provided plenty of interesting insights, comments and general food for thought.
One recurring theme that particularly struck me was the comment,
uttered with regret by many in Operations, that they would very much like
to help and coordinate with the development teams but inevitably were
always too busy keeping the production environment up and running.
In other words, helping prepare for new releases might be desirable, but achieving the five nines, or whatever SLA Operations has committed to1, will always be paramount.
This is a fallacy! Indeed, one of the core realisations of the "Devops mindset", to me, is that 99.999...% uptime is not an end in itself, but a means to an end: delivering the greatest business value possible. And aiming for the highest possible availability may not be the best way to go about it!2
For instance, imagine a day's downtime in production costs $500k, and you have a new feature coming up for release that is estimated to bring in an extra $1m per day. Then for every day by which you can speed up the release you can afford almost two days of downtime!2
The point is: the ability to maintain a stable current environment cannot be considered independently of the ability to rapidly deliver change. Rather, they need to balanced against each other to determine which combination will likely deliver greatest value. This is a decision only the business owner or customer can make. And naturally, the balance needs to continuously monitored and updated in light of new requirements and experience.
There is a residual belief that the the tasks and responsibilities
of developers and Operations are sufficiently different that they can't
possibly benefit from each other's input. But whether it's the effects
of placing nodes of a distributed system in different segments of the
production network, or how the sharding and replication strategies of
the database affect query performance, or even just knowing which
version (and vendor!) of the JVM and container will be supported in
production when the application goes live3 - developers need Operations input, and the earlier, the better.
And only developers can add the internal health checks, debugging and tracing information, integration points for monitoring tools etc. that can mean the difference between a five minute fix and a week's frustrated log trawling for the support team. It's revealing to see how quickly this crucial, yet often neglected feature of an application is improved if developers are also responsible for support - generally, the first callout at three in the morning makes a world of difference.3
It goes without saying that the acceptable balance between stability
and change will differ from customer to customer, and from application
to application. Globally shared infrastructure can cause problems here,
because it's hard to be able to meet the requirements of the most
demanding application without forcing all the others to pay the price.
In other words, modularity is an important goal architecturally, and if you're interacting with shared infrastructure it should be tunable to your requirements. Amazon's Dynamo and, indeed, most of the cloud and distributed platforms out there exemplify this trend. But I'd like to defer a detailed discussion of the technical implications to a later blog4.
My colleague Robert van Loghem and I will also be talking about this and related topics in our upcoming webinarplug!.
Going back to the nlscrum meetup, the takeaway message for me was
clear: setting up two independent entities, Development and Operations,
giving them opposing goals (delivering change on the one hand, ensuring
stability on the other) and expecting them to fight it out when the
inevitable conflict happens is not the way to best deliver
business value. We should be looking to organise our teams and
activities to deliver the balance between new features and running
systems that is most appropriate for a given application.
And we can only do that if we first go to the customer, explain that there is a trade-off to be made and work together to make it!
Addendum: in the unexpectedly long time it's taken me to finish off this post, my colleague Gero Vermaas described a client scenario that featured a real-life version of this challenge. It's good to see the client finally came round to accepting the concept, hopefully with the expected positive results!