How Well Does Your Infrastructure Support Major Incident Management?
Learn how automation can help you improve your incident management process.
Join the DZone community and get the full member experience.Join For Free
Effective major incident management depends on many things, including planning and execution under fire. Traditional major incident management wisdom as seen in through ITIL talks about the remediation process, but it doesn't address the issue of configuring your IT infrastructure.
However, if you take the time to prepare your infrastructure, you'll be able to reduce the sometimes debilitating impact of major incidents.
How to Prepare for Managing Major Incidents
Before you can configure your infrastructure to support major incident management, you must define what a major incident is in your organization. The IT industry hasn't developed a standard definition for a major incident, but most agree that a major incident is a failure that impacts customers and their ability to complete their work.
Considering that definition, it's clear to see why resolving major incidents gets so much attention. The cost of downtime, according to the Ponemon Institute's latest publication on the topic, is $8,851 per minute.
That cost, along with related costs such as damage to an organization's reputation or customer remediation, puts resolving a major incident at the top of every organization's priority list.
Make sure that your organization's definition is clear. It will help you to determine how to structure an effective MIM process.
How to Manage Your Infrastructure to Support Major Incident Management
The way you set up your infrastructure will have a big impact on your ability to manage major incidents effectively. You can optimize your infrastructure to support MIM in several ways.
Apply filters to your monitoring alerts. Alerts, notifications, and updates come in at an overwhelming rate. Sifting through them is impossible for even the most efficient service desk or NOC.
Give your service desk a break! Apply filters to narrow that list to only those alerts that could relate to a major incident.
Collect the right data. Collect data from your systems and applications that will allow your technicians to diagnose a problem and start to resolve it.
Act when performance is degraded. The procedure in many organizations calls for the major incident process to start when a service becomes unavailable. You can reduce the number of major incidents that occur by taking management actions earlier, when the performance is degraded.
By "shifting left," you may be able to avoid an incident entirely by reacting to a loss of performance.
Track issues centrally. Communication is critical during a major incident. Even more important is communicating accurate information. You can use a central issue tracking service to allow all stakeholders to share information and to ensure that everyone is working with the most accurate information.
How to Automate Your Process to Improve Outcomes
One of the biggest stumbling blocks to managing major incidents is that so much of the process is performed in static environments, on spreadsheets or whiteboards. Static information encourages human error and leads to duplicate and conflicting data. Automated systems are available to make that obstacle a thing of the past.
Automate applying filters to alerts. Use a system that can apply filters to monitoring alerts. Then, the service desk technician can use the click of a mouse to reduce alert lists to only those that relate to a major incident.
Automate root cause analysis. Use an APM solution that can crawl applications and systems to identify the type of data required to identify the root cause of your problems.
Automate sharing information. The best way to ensure that there is one central source containing accurate information is to integrate your monitoring, service desk, collaboration, and chat solutions. When all those applications are working together, the possibility for wasting time due to people working on the wrong issues, or making assumptions based on the wrong information, is reduced significantly.
Automate a link between sharing tools and incident tracking. Once you establish a central source for sharing information, you can go to the next step of integrating those sharing tools with your incident tracking solution. That integration will further reduce errors and the time required to address a major incident.
xMatters and Atlassian conducted a survey in 2017 of DevOps organizations. The respondents reported that they had inconsistent MIM processes that resulted in delays in spotting a major incident and responding to it. In addition, the respondents weren't satisfied with the level of collaboration they saw while dealing with a major incident. The result of those delays and lack of communication can be devastating.
No one can eliminate major incidents. But, smart organizations work hard to prepare themselves to manage those incidents as quickly as possible. xMatters can help you configure your infrastructure and improve collaboration with automated systems. The elimination of manual processes is key to reducing the impact of service disruptions. To find out more, visit our website or try xMatters for free.
Published at DZone with permission of Dan Goldberg, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.