DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Driving DevOps With Smart, Scalable Testing
  • Scaling DevOps With NGINX Caching: Reducing Latency and Backend Load
  • Internal Developer Portals: Modern DevOps's Missing Piece
  • The Role of AI in Enhancing DevOps Processes

Trending

  • SaaS in an Enterprise - An Implementation Roadmap
  • Next Evolution in Integration: Architecting With Intent Using Model Context Protocol
  • Building Reliable LLM-Powered Microservices With Kubernetes on AWS
  • AI-Driven Root Cause Analysis in SRE: Enhancing Incident Resolution
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. DevOps: Improving Root Cause Analysis

DevOps: Improving Root Cause Analysis

Root Cause Analysis is the default problem-solving system. Let's see how DevOps culture and methodologies can improve this process.

By 
Derek Weeks user avatar
Derek Weeks
·
Aug. 11, 18 · Opinion
Likes (2)
Comment
Save
Tweet
Share
11.4K Views

Join the DZone community and get the full member experience.

Join For Free

We have all been there in a postmortem when someone says, "Let's get to the root of the problem." And, we all know what that means: who or what is to blame?

We also all know that no one wants to play the blame game, yet we all do. But it isn't our fault (no blame, see what I did there?). It has been the default system for solving problems in business for decades. It is called root cause analysis (RCA).

We can change — for the better.

There Is No Root Cause: Emergent Behavior in Complex Systems

I recently watched a presentation from Matthew Boeckman (@matthewboeckman) entitled, There Is No Root Cause: Emergent Behavior in Complex System. Matthew is a Developer Advocate with VictorOps and a Technology Strategist with Dryan.io. He grew up a systems guy and jokes that he has been in DevOps for 18 years, even though DevOps wasn't around because he has always been nice to developers.

Digging in (pun intended), RCA focuses on what went wrong, and how we can prevent it from happening again.

The core problems with RCA for development is that it doesn't provide for enough complexity and its natural focus is blame, which can undermine a positive DevOps culture.

RCA was more applicable when Waterfall was the development methodology because states stayed consistent for months or even years at a time. In the age of Agile, DevOps, CI/CD, microservices, etc., states of work are in a constant flux. RCA can't provide solutions quickly enough. As Matthew notes, in RCA, things are either good or bad, working or broken, uptime or failure. The reality is that our world is more nuanced.

What Matthew recommends is to look at it through the principle of emergence because it, "separates judgment from the good and the bad binary approach to our system health, and instead focuses on behaviors and interactions, patterns and complexities of our system. With practice and effort, we can manage them to more desirable states."

But what does this look like in practice?

Getting back to the analogy of the tree and its roots, the answer is more of a forest than a tree. Trees are one living organism, forests are ecosystems.

Matthew takes this philosophy and mental picture and gives us a better system — Cynefin. It is a Welsh word that means habitat, and was created by Dave Snowden (@snowded), originally for managing IBM's intellectual capital. It draws on research in systems, complexity, network, and learning theories.

Starting in the bottom right quadrant, working counter-clockwise, it goes from simple to more complex.

Simple

These are patterns or behaviors that don't require a great deal of understanding. DevOs is increasingly setting up automated systems to respond to simple issues.

Complicated

These are known unknowns. You can imagine a set of realities where they can occur, and they are probable, but not certain. For instance, a busy harbor might get a storm that causes damage to boats, docks, etc. It is hard for the harbor manager to manage and they need to think about it. This requires people to do some thinking, and it is difficult, if not impossible, to automate.

Complex

This is where we start to see emergent behaviors occur. We don't have the metrics need to understand or manage these problems or you haven't looked at that metric before. We start with probing, going into the system, and exploring. Think of any collection of humans at any scale. Things are still in the scope of probable, but things change quickly. There are many moving parts that aren't predictable and that we didn't fully encounter in our test methodology.

Chaotic

This is, well, chaos. Matthews' real-world example was an entire region for AWS went down, causing other regions to be overloaded as system admins were moving services. In chaos, you act, then get a sense of where things are, and then respond.

Disorder

In DevOps, this is where you have a lack of communication and collaboration. Here teams need to: reduce: figure out what you agree on; analyze: build consensus; and, iterate: move to a quadrant and continue.

Matthew notes that knowledge and practice move patterns towards more favorable quadrants. But, complacency erodes the process. Complex systems left poorly managed will create increasingly complex processes to manage.

How to Adopt Cynefin

  • In the moment, ask, what quadrant does this map to?
  • In the post-incident report: How did we manage the pattern? Was it complicated, complex, simple? What can we do to change it?
  • In your sprint planning: Devote time to manage your patterns clockwise. What can we move with a little bit of work?

The reality is that RCA is really only present after the fact. Cynefin calls us to action.


Convinced that Cynefin might be just what your organization needs or want to dig a little deeper? Share and watch Matthew's full talk above or check it out here. You can watch any of the 2017 AllDayDevOps sessions free-of-charge here.

All Day DevOps 2018

All Day DevOps 2018 is just around the corner! Registration is available here.

The free, online conference goes live on October 17th, offering 100 different practitioner-led sessions, each one 30-minutes long. With 5 separate tracks: CI/CD, Cloud-Native Infrastructure, DevSecOps, Cultural Transformations, & Site Reliability Engineering, and 100 speakers, there's sure to be something for everyone.

And speaking of everyone, if you're part of an organization with 20+ people that want to attend the conference (again, it's free!) then you should consider joining the Club 20 program so that you might get your company logo added to the ADDO site. Check out some of the Club 20 participants here and consider joining them.

Hope to see you online at the show!

DevOps

Published at DZone with permission of Derek Weeks, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Driving DevOps With Smart, Scalable Testing
  • Scaling DevOps With NGINX Caching: Reducing Latency and Backend Load
  • Internal Developer Portals: Modern DevOps's Missing Piece
  • The Role of AI in Enhancing DevOps Processes

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!