DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. The Problem With MTTR: Learning From Incident Reports

The Problem With MTTR: Learning From Incident Reports

While at the DevOps Enterprise Summit in Las Vegas, we caught up with Courtney Nash at Verica to discuss her position as an Internet Incident Librarian.

Dan Lines user avatar by
Dan Lines
CORE ·
Mar. 07, 23 · Interview
Like (3)
Save
Tweet
Share
2.41K Views

Join the DZone community and get the full member experience.

Join For Free

Tracking Mean Time To Restore (MTTR) is standard industry practice for incident response and analysis, but should it be?

Courtney Nash, an Internet Incident Librarian, argues that MTTR is not a reliable metric — and we think she's got a point.

We caught up with Courtney at the DevOps Enterprise Summit in Las Vegas, where she was making her case against MTTR in favor of alternative metrics (SLOs and cost of coordination data), practices (Near Miss analysis), and mindsets (humans are the solution, not the problem) to help organization better learn from their incidents. 

Episode Highlights

  • (1:54) The end of MTTR?
  • (4:50) Library of incidents
  • (13:20) What is an incident?
  • (19:41) Cost of coordination
  • (22:13) Near misses
  • (24:21) Mental models
  • (28:16) Role of language in shaping public discourse
  • (29:33) Learnings from The Void

Episode Excerpt

Dan: Hey, everyone; welcome to Dev Interrupted. My name is Dan lines, and I'm here with Courtney Nash, who has one of the coolest possibly made-up titles, but possibly real: Internet Incident Librarian.

Courtney: Yep, that's right, yeah, you got it.

Dan: Welcome to the show.

Courtney: Thank you for having me on. 

Dan: I love that title 

Courtney: Still possibly made up, possibly, possibly...

Dan: Still possibly made up. 

Courtney: We'll just leave that one out there for the listeners to decide.

Dan: Let everyone decide what that could possibly mean. We have a, I think, maybe a spicy show, a spicy topic. 

Courtney: It's a hot topic show.

Dan: Hot topic, especially since we're at DevOps Enterprise Summit, where we hear a lot about the DORA metrics, one of them being MTTR. 

Courtney: Yes. 

Dan: And you might have a hot take on that. The end of MTTR? Or how would you describe it?

Courtney: Yeah, I feel a little like the fox in the henhouse here, but Gene accepted the talk. So you know, there's that.

Dan: So it's on him.

Courtney: [laughing] It's all Gene's fault! So I have been interested in complex systems for a long time; I used to study the brain. And I got sucked down an internet rabbit hole quite a lot quite a while ago. And I've had beliefs for a long time that I haven't had data to back up necessarily. And we see these sort of perverted behaviors, not that kind of perverted, but where we take metrics in the industry, and then with Goddard's Law, pick whatever you pick up, people incentivize them, and then weird things happen. But I think we spend too little time looking at the humans in the system and a lot of time focusing on the technical aspects and the data that come out of the technical side of systems. So, I started a project about a year ago called The Void. It's the Verica Open Incident Database, actually a real, not made-up name. And it's the largest collection of public incident reports. So, if you all have an outage, and you hopefully go and figure out and talk about what happened, and then you write that up, but that's out in the world, so I'm not writing these, I'm curating them and collecting. I'm a librarian. So, I have about 10,000 of them now. And a bunch of metadata associated with all these incident reports.

Engineering Insights before anyone else...

The Weekly Interruption is a newsletter designed for engineering leaders by engineering leaders.

We get it. You're busy. So are we. That's why our newsletter is light, informative, and oftentimes irreverent. No BS or fluff. Each week we deliver actionable advice to help make you - whether you're a CTO, VP of Engineering, team lead, or IC  — a better leader.

It's also the best way to stay up-to-date on all things Dev Interrupted — from our podcast to trending articles, Interact, and our community Discord. 

Get interrupted.

DevOps dev systems

Published at DZone with permission of Dan Lines, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Full Lifecycle API Management Is Dead
  • NoSQL vs SQL: What, Where, and How
  • Spring Boot, Quarkus, or Micronaut?
  • Introduction to Containerization

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: