A lot of people have been talking about MTTR recently. Read on for a quick introduction to the topic, and some best practices.
Join the DZone community and get the full member experience.Join For Free
There’s been a lot of discussion lately regarding the need to track and improve MTTR, but what exactly does MTTR mean? Well, MTTR can stand for several different things: mean time to repair, mean time to recovery, and mean time to resolve. Not only are there different definitions, but when the timer starts and stops can vary as well.
Incident resolution can be broken down into four main steps: detect, identify, fix, and verify. All of these actions should be included in MTTR, regardless of which terminology you decide to go with.
Detect the Problem
How long it takes to detect a problem depends on the tools and solutions that are used to alert on issues, and has a direct impact on alert fatigue. When ops teams are bombarded with non-stop alerts, many of which turn out to be false positives, it doesn’t take long for the alerts to start getting ignored. When this happens, it’s likely that many incidents will go undetected until a user complains. Even with a variety of toolsets in place, 36% of IT organizations find out about application-related problems via calls from users, according to a research study conducted by Enterprise Management Associates.
Identify What Caused the Problem
Once an issue has been detected, the next hurdle is to identify the problem. This can often be the most time-consuming aspect of incident resolution as IT ops teams sift through massive amounts of data from a variety of sources on a quest to find the cause. Data needs to be analyzed to determine if the issue is at the network, regional, system, or third-party level.
Fix the Problem
Once you know what needs to be done, the necessary teams can be tasked with fixing it. Some organizations may stop the MTTR timer once the fix has been rolled out, but there is one more step that should be included.
Verify the Problem Is Resolved
What good is rolling out a fix if you don’t actually verify the fix resolves the incident? It is possible one issue was masking other issues, which would mean that there are still issues that need to be resolved. The MTTR time should be stopped when there is verification that all systems are once again operating as expected and end users are no longer negatively affected.
Published at DZone with permission of Dawn Parzych, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.