DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Most Effective Security Techniques (Part 1)
  • The Missing Link in Cybersecurity: Culture
  • What Are SOC and SIEM? How Are They Connected?
  • Five Steps To Building a Tier 1 Service That Is Resilient to Outages

Trending

  • Mastering Persistence: Why the Persistence Layer Is Crucial for Modern Java Applications
  • What Is Kubernetes RBAC and Why Do You Need It?
  • REST vs. Message Brokers: Choosing the Right Communication
  • Four Ways for Developers To Limit Liability as Software Liability Laws Seem Poised for Change
  1. DZone
  2. Data Engineering
  3. IoT
  4. Reducing MTTR

Reducing MTTR

A lot of people have been talking about MTTR recently. Read on for a quick introduction to the topic, and some best practices.

Dawn Parzych user avatar by
Dawn Parzych
·
Jul. 23, 17 · Opinion
Like (2)
Save
Tweet
Share
4.51K Views

Join the DZone community and get the full member experience.

Join For Free

There’s been a lot of discussion lately regarding the need to track and improve MTTR, but what exactly does MTTR mean? Well, MTTR can stand for several different things: mean time to repair, mean time to recovery, and mean time to resolve. Not only are there different definitions, but when the timer starts and stops can vary as well.

Incident resolution can be broken down into four main steps: detect, identify, fix, and verify. All of these actions should be included in MTTR, regardless of which terminology you decide to go with.

Detect the Problem

How long it takes to detect a problem depends on the tools and solutions that are used to alert on issues, and has a direct impact on alert fatigue. When ops teams are bombarded with non-stop alerts, many of which turn out to be false positives, it doesn’t take long for the alerts to start getting ignored. When this happens, it’s likely that many incidents will go undetected until a user complains. Even with a variety of toolsets in place, 36% of IT organizations find out about application-related problems via calls from users, according to a research study conducted by Enterprise Management Associates.

Identify What Caused the Problem

Once an issue has been detected, the next hurdle is to identify the problem. This can often be the most time-consuming aspect of incident resolution as IT ops teams sift through massive amounts of data from a variety of sources on a quest to find the cause. Data needs to be analyzed to determine if the issue is at the network, regional, system, or third-party level.

Fix the Problem

Once you know what needs to be done, the necessary teams can be tasked with fixing it. Some organizations may stop the MTTR timer once the fix has been rolled out, but there is one more step that should be included.

Verify the Problem Is Resolved

What good is rolling out a fix if you don’t actually verify the fix resolves the incident? It is possible one issue was masking other issues, which would mean that there are still issues that need to be resolved. The MTTR time should be stopped when there is verification that all systems are once again operating as expected and end users are no longer negatively affected.

teams IT Data (computing) Masking (Electronic Health Record) MASSIVE (software) Aspect (computer programming) Network

Published at DZone with permission of Dawn Parzych, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Most Effective Security Techniques (Part 1)
  • The Missing Link in Cybersecurity: Culture
  • What Are SOC and SIEM? How Are They Connected?
  • Five Steps To Building a Tier 1 Service That Is Resilient to Outages

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: