DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Internal Developer Portals: Modern DevOps's Missing Piece
  • The Role of AI in Enhancing DevOps Processes
  • Cost-Aware Resilience: Implementing Chaos Engineering Without Breaking the Budget
  • AWS WAF Classic vs WAFV2: Features and Migration Considerations

Trending

  • Subtitles: The Good, the Bad, and the Resource-Heavy
  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  • How to Format Articles for DZone
  • Develop a Reverse Proxy With Caching in Go
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. DevOps: Improving Root Cause Analysis

DevOps: Improving Root Cause Analysis

Root Cause Analysis is the default problem-solving system. Let's see how DevOps culture and methodologies can improve this process.

By 
Derek Weeks user avatar
Derek Weeks
·
Aug. 11, 18 · Opinion
Likes (2)
Comment
Save
Tweet
Share
11.4K Views

Join the DZone community and get the full member experience.

Join For Free

We have all been there in a postmortem when someone says, "Let's get to the root of the problem." And, we all know what that means: who or what is to blame?

We also all know that no one wants to play the blame game, yet we all do. But it isn't our fault (no blame, see what I did there?). It has been the default system for solving problems in business for decades. It is called root cause analysis (RCA).

We can change — for the better.

There Is No Root Cause: Emergent Behavior in Complex Systems

I recently watched a presentation from Matthew Boeckman (@matthewboeckman) entitled, There Is No Root Cause: Emergent Behavior in Complex System. Matthew is a Developer Advocate with VictorOps and a Technology Strategist with Dryan.io. He grew up a systems guy and jokes that he has been in DevOps for 18 years, even though DevOps wasn't around because he has always been nice to developers.

Digging in (pun intended), RCA focuses on what went wrong, and how we can prevent it from happening again.

The core problems with RCA for development is that it doesn't provide for enough complexity and its natural focus is blame, which can undermine a positive DevOps culture.

RCA was more applicable when Waterfall was the development methodology because states stayed consistent for months or even years at a time. In the age of Agile, DevOps, CI/CD, microservices, etc., states of work are in a constant flux. RCA can't provide solutions quickly enough. As Matthew notes, in RCA, things are either good or bad, working or broken, uptime or failure. The reality is that our world is more nuanced.

What Matthew recommends is to look at it through the principle of emergence because it, "separates judgment from the good and the bad binary approach to our system health, and instead focuses on behaviors and interactions, patterns and complexities of our system. With practice and effort, we can manage them to more desirable states."

But what does this look like in practice?

Getting back to the analogy of the tree and its roots, the answer is more of a forest than a tree. Trees are one living organism, forests are ecosystems.

Matthew takes this philosophy and mental picture and gives us a better system — Cynefin. It is a Welsh word that means habitat, and was created by Dave Snowden (@snowded), originally for managing IBM's intellectual capital. It draws on research in systems, complexity, network, and learning theories.

Starting in the bottom right quadrant, working counter-clockwise, it goes from simple to more complex.

Simple

These are patterns or behaviors that don't require a great deal of understanding. DevOs is increasingly setting up automated systems to respond to simple issues.

Complicated

These are known unknowns. You can imagine a set of realities where they can occur, and they are probable, but not certain. For instance, a busy harbor might get a storm that causes damage to boats, docks, etc. It is hard for the harbor manager to manage and they need to think about it. This requires people to do some thinking, and it is difficult, if not impossible, to automate.

Complex

This is where we start to see emergent behaviors occur. We don't have the metrics need to understand or manage these problems or you haven't looked at that metric before. We start with probing, going into the system, and exploring. Think of any collection of humans at any scale. Things are still in the scope of probable, but things change quickly. There are many moving parts that aren't predictable and that we didn't fully encounter in our test methodology.

Chaotic

This is, well, chaos. Matthews' real-world example was an entire region for AWS went down, causing other regions to be overloaded as system admins were moving services. In chaos, you act, then get a sense of where things are, and then respond.

Disorder

In DevOps, this is where you have a lack of communication and collaboration. Here teams need to: reduce: figure out what you agree on; analyze: build consensus; and, iterate: move to a quadrant and continue.

Matthew notes that knowledge and practice move patterns towards more favorable quadrants. But, complacency erodes the process. Complex systems left poorly managed will create increasingly complex processes to manage.

How to Adopt Cynefin

  • In the moment, ask, what quadrant does this map to?
  • In the post-incident report: How did we manage the pattern? Was it complicated, complex, simple? What can we do to change it?
  • In your sprint planning: Devote time to manage your patterns clockwise. What can we move with a little bit of work?

The reality is that RCA is really only present after the fact. Cynefin calls us to action.


Convinced that Cynefin might be just what your organization needs or want to dig a little deeper? Share and watch Matthew's full talk above or check it out here. You can watch any of the 2017 AllDayDevOps sessions free-of-charge here.

All Day DevOps 2018

All Day DevOps 2018 is just around the corner! Registration is available here.

The free, online conference goes live on October 17th, offering 100 different practitioner-led sessions, each one 30-minutes long. With 5 separate tracks: CI/CD, Cloud-Native Infrastructure, DevSecOps, Cultural Transformations, & Site Reliability Engineering, and 100 speakers, there's sure to be something for everyone.

And speaking of everyone, if you're part of an organization with 20+ people that want to attend the conference (again, it's free!) then you should consider joining the Club 20 program so that you might get your company logo added to the ADDO site. Check out some of the Club 20 participants here and consider joining them.

Hope to see you online at the show!

DevOps

Published at DZone with permission of Derek Weeks, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Internal Developer Portals: Modern DevOps's Missing Piece
  • The Role of AI in Enhancing DevOps Processes
  • Cost-Aware Resilience: Implementing Chaos Engineering Without Breaking the Budget
  • AWS WAF Classic vs WAFV2: Features and Migration Considerations

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!