DevOps: Improving Root Cause Analysis
Root Cause Analysis is the default problem-solving system. Let's see how DevOps culture and methodologies can improve this process.
Join the DZone community and get the full member experience.Join For Free
We have all been there in a postmortem when someone says, "Let's get to the root of the problem." And, we all know what that means: who or what is to blame?
We also all know that no one wants to play the blame game, yet we all do. But it isn't our fault (no blame, see what I did there?). It has been the default system for solving problems in business for decades. It is called root cause analysis (RCA).
We can change — for the better.
There Is No Root Cause: Emergent Behavior in Complex Systems
I recently watched a presentation from Matthew Boeckman (@matthewboeckman) entitled, There Is No Root Cause: Emergent Behavior in Complex System. Matthew is a Developer Advocate with VictorOps and a Technology Strategist with Dryan.io. He grew up a systems guy and jokes that he has been in DevOps for 18 years, even though DevOps wasn't around because he has always been nice to developers.
Digging in (pun intended), RCA focuses on what went wrong, and how we can prevent it from happening again.
The core problems with RCA for development is that it doesn't provide for enough complexity and its natural focus is blame, which can undermine a positive DevOps culture.
RCA was more applicable when Waterfall was the development methodology because states stayed consistent for months or even years at a time. In the age of Agile, DevOps, CI/CD, microservices, etc., states of work are in a constant flux. RCA can't provide solutions quickly enough. As Matthew notes, in RCA, things are either good or bad, working or broken, uptime or failure. The reality is that our world is more nuanced.
What Matthew recommends is to look at it through the principle of emergence because it, "separates judgment from the good and the bad binary approach to our system health, and instead focuses on behaviors and interactions, patterns and complexities of our system. With practice and effort, we can manage them to more desirable states."
But what does this look like in practice?
Getting back to the analogy of the tree and its roots, the answer is more of a forest than a tree. Trees are one living organism, forests are ecosystems.
Matthew takes this philosophy and mental picture and gives us a better system — Cynefin. It is a Welsh word that means habitat, and was created by Dave Snowden (@snowded), originally for managing IBM's intellectual capital. It draws on research in systems, complexity, network, and learning theories.
Starting in the bottom right quadrant, working counter-clockwise, it goes from simple to more complex.
These are patterns or behaviors that don't require a great deal of understanding. DevOs is increasingly setting up automated systems to respond to simple issues.
These are known unknowns. You can imagine a set of realities where they can occur, and they are probable, but not certain. For instance, a busy harbor might get a storm that causes damage to boats, docks, etc. It is hard for the harbor manager to manage and they need to think about it. This requires people to do some thinking, and it is difficult, if not impossible, to automate.
This is where we start to see emergent behaviors occur. We don't have the metrics need to understand or manage these problems or you haven't looked at that metric before. We start with probing, going into the system, and exploring. Think of any collection of humans at any scale. Things are still in the scope of probable, but things change quickly. There are many moving parts that aren't predictable and that we didn't fully encounter in our test methodology.
This is, well, chaos. Matthews' real-world example was an entire region for AWS went down, causing other regions to be overloaded as system admins were moving services. In chaos, you act, then get a sense of where things are, and then respond.
In DevOps, this is where you have a lack of communication and collaboration. Here teams need to: reduce: figure out what you agree on; analyze: build consensus; and, iterate: move to a quadrant and continue.
Matthew notes that knowledge and practice move patterns towards more favorable quadrants. But, complacency erodes the process. Complex systems left poorly managed will create increasingly complex processes to manage.
How to Adopt Cynefin
- In the moment, ask, what quadrant does this map to?
- In the post-incident report: How did we manage the pattern? Was it complicated, complex, simple? What can we do to change it?
- In your sprint planning: Devote time to manage your patterns clockwise. What can we move with a little bit of work?
The reality is that RCA is really only present after the fact. Cynefin calls us to action.
Convinced that Cynefin might be just what your organization needs or want to dig a little deeper? Share and watch Matthew's full talk above or check it out here. You can watch any of the 2017 AllDayDevOps sessions free-of-charge here.
All Day DevOps 2018
The free, online conference goes live on October 17th, offering 100 different practitioner-led sessions, each one 30-minutes long. With 5 separate tracks: CI/CD, Cloud-Native Infrastructure, DevSecOps, Cultural Transformations, & Site Reliability Engineering, and 100 speakers, there's sure to be something for everyone.
And speaking of everyone, if you're part of an organization with 20+ people that want to attend the conference (again, it's free!) then you should consider joining the Club 20 program so that you might get your company logo added to the ADDO site. Check out some of the Club 20 participants here and consider joining them.
Hope to see you online at the show!
Published at DZone with permission of Derek Weeks, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.