I recently attended DevOpsDays Seattle 2016, which was a huge success with lots of leading edge thinking from people that are truly passionate about improving the state of the art for software development.
One of the presenters put forth some provocative thoughts on the value of root cause analysis in complex adaptive systems. These thoughts can be summarized as:
- The nature of complex adaptive systems is such that root cause analysis (RCA) is a complete waste of your time;
- Various cognitive biases further exacerbate the difficulty in determining problem root causes;
- Even if you do find the root causes of problems, you can’t predict and prevent the next problem;
- Teams should focus on improving their mean time to repair (MTTR) instead of focusing on improving mean time between failures (MTBF);
- High-performance teams aren’t even bothering with RCA anymore.
All this got me thinking about my own experience with high-performance teams and how they learn and adapt to changing conditions to deliver the right results.
If we don’t dive in and try to understand the nature of things, we have no basis for improvement.
RCA is an important part of the kaizen process. There is, however, a proper time and a place for it, and it is not a silver bullet for preventing future problems. Will we ever really find the cause to all problems? It’s unlikely that we will. It’s also true that humans do tend to fall prey to cognitive biases, such as confirmation bias, which cause us to jump to conclusions or make incorrect correlations based on our own preconceptions or things. That said, the kind of critical thinking used in RCA techniques like Five Whys and A3 Problem Solving play an important role in learning why things are the way that they are.
In the process of exploring these ideas, I’ll be drawing on some of my recent experiences as a software development manager (SDM) at Amazon where I managed the development operations for a mission-critical platform service.
There’s no way I’m going to argue with the assertion that teams need to continuously improve their ability to recover quickly from disastrous situations. This idea is largely common sense, but sadly most teams don’t learn this lesson until it’s much too late (i.e. they experience a catastrophic failure). When technology services fail in a business enterprise, there can be serious financial and reputation implications on the line, so MTTR is definitely not just an abstract concept.
Having lived through disaster recovery efforts in a large enterprise, I can attest that trying to understand why things got into that state is both practically impossible and counter-productive in the haze and heat of the moment. Indeed, one of the best practices for a service owner is that you always fix the immediate problem and restore service before trying to understand why the problem occurred in the first place.
If you knew enough about how to fix a potential problem to document the remedy in a runbook, you likely will have already put many of the preventative measures in place to begin with.
It is true that even the most comprehensive of runbooks are largely useless in the majority of disaster scenarios. The most insidious problems are the ones we could never have conceived of before they happen (the Rumsfeldian “unknown unknowns”), and unfortunately there are far more things that fall into this category than those which fall into the known things category.
So there’s a lot to be said about not wasting too much time trying to predict everything that could happen and put preventative measures in place for all of them. If we did that, we’d have bullet-proof systems that never actually get delivered.
The greatest opportunities for learning occur when things go wrong so it’s foolish to not capitalize on those moments.
At one point, the service I managed experienced a critical outage. Without going into the details, it manifested slowly at first, but then quickly spread to affect hundreds of downstream systems across the enterprise. We had not delivered a new build to production in a while, so that wasn’t the likely culprit. Many outages are related to some change in the software, and can be repaired by simply rolling back to the previous build or quickly rolling forward a change that undoes the changes in the current build. However, this case was truly a black swan - something for which no one on the team or the larger organization had a heuristic. The recovery effort involved dozens of people working for hours before we finally isolated and fixed the immediate symptoms of the problem.
Neither during nor after after the incident did someone from leadership come down and beat me up for letting the incident occur on my watch. No one wanted to punish me or my team. There was no blame. The emphasis was very squarely on understanding the full impact of the problem, how the system was able to get into the error state, what (if anything) could be done to prevent such occurrences, and what could be done to decrease the time to recover in future incidents. In short, were there actionable things could we learn from what happened? This attitude about continuous learning and blameless retrospective is one of the reasons Amazon is crushing it and innovating at an unprecedented pace.
Teams that embrace RCA-based ways of thinking tend to be higher-performing because they learn and adapt more effectively as a unit.
One of the mechanisms Amazon uses to learn and improve when things go wrong is called the Correction of Error (COE) process. I’m not spilling the beans here as this is something Amazon talks openly about in forums like AWS re:Invent as a key component of its ability to innovate rapidly. The main components of a COE document are: the problem, the customer impact, a root cause analysis of the problem (using Five Whys), and corrective actions. The Five Whys part of the COE is where the real learning occurs. Some folks took a punitive view of the COE process owing to the level of effort involved in doing it right, but in my experience it is an incredibly effective learning and team-building tool.
In our case, the COE was a pretty large effort. It required a lot of investigation and took a good amount of time at the expense of other things we could have otherwise been working on. However, it revealed a lot of very valuable insights (including some tangential to the actual incident) that we otherwise would not have identified or had a strong business case to pursue. The Five Whys component was both wide and deep in nature and turned up a plethora of underlying problems with our software, its operating environment, and our processes. The value of solving many of the identified problems exceeded the value of most of the other features we had previously planned to work on next. So going through the process helped us gain greater clarity on where spend effort via backlog of operational improvements that could be prioritized against all the other stuff our customers wanted.
MTTR is just one measure of a team’s effectiveness. Some of the actions from the COE were absolutely focused on improving our MTTR, and we seriously reduced the startup time of our service as a result, which served us well down the line. Interestingly, in order to improve MTTR, you need to do some analysis to understand the current MTTR, why it is what it is, and what you can try to improve it.
We also observed the following additional benefits from doing RCA on various problems in our space:
- Our team, which had heretofore been somewhat fragmented pulled together and worked as a much more effective unit to both perform the RCA and also to implement the resulting corrective actions;
- The team started to embrace RCA and COEs as a learning tool as a normal course of working. For example, on-call engineers put together COEs to identify improvements after the occurrence of many types of problems;
- On-call engineers started to devote a larger portion of their time to complete items in the operational improvement backlog, which resulted in a decrease in recurrence of classes of problems. This, in turn, freed up engineers to focus on improving things instead of reacting to problems;
- We learned that many of the problems we encountered could be prevented with an increased emphasis on engineering excellence so we invested in higher levels of automation in the build pipeline to prevent “known” classes of problems from reaching production;
- There is rarely, if ever, a single root cause for software problems. A confluence of events usually manifest in some unpredictable behavior;
- While some corrective actions are obvious, others require some amount of experimentation to determine the best approach;
- You have to know how far to take the RCA before you hit diminishing returns, and you have to learn how far is enough so you’re maximizing the learning.
Root cause analysis is not a waste of time. It’s true that we’ll never be able to predict and prevent every type of software malady, but that’s not really the point. The point is that the performance of teams that take the time to understand the nature of problems as a unit will almost always exceed that of teams that don’t, and root cause analysis is one of many techniques teams can use to amplify their learning.