Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Major Incident Process Is at the Heart of Effectiveness

DZone's Guide to

Major Incident Process Is at the Heart of Effectiveness

Developing and automating a major incident handling process is integral to effective development workflows.

· DevOps Zone ·
Free Resource

DevOps involves integrating development, testing, deployment and release cycles into a collaborative process. Learn more about the 4 steps to an effective DevSecOps infrastructure.

All businesses need to be prepared for a major technological incident in order to minimize their losses and protect their client information. Experts don't always agree on what constitutes a major incident, but you can create a definition for your company.

A major incident interrupts your company's ability to function, sometimes completely shutting you down. Most often, the problems are man-made. They can include hackers stealing data, emails infecting your system with ransomware, or employee errors that introduce catastrophic failures. When these major incidents occur, you need to immediately set your team in motion.

Identify Your Major Incidents

First, decide what issues constitute a major incident for your company. List the situations that would qualify as a major incident for your company. Be specific so that you can quickly determine whether an incident really needs to be treated as a major issue.

DevOps teams and Site Reliability Engineers often assign budgets for allowable downtime for certain services so they know how many people to allocate when there is a problem. Getting the technical budget right can take many iterations, so be patient if you're just starting.

Create an Incident Team

Instead of relying on your usual company hierarchy, many organizations establish a major incident team comprised of key players in your organization. Usually, someone from the IT side of the house usually leads the team, and some businesses create a role for Major Incident Director. Often businesses include the head or second ranking employee of major departments. Often, some members of the team will need to drop all other responsibilities in order to address the incident, so be certain that someone else is in place to handle their other duties.

The complexity and urgency of these situations create a need to use a collaboration platform to contact the right on-call resource immediately, and to escalate quickly to the next person if necessary.

Develop a Process

Experts stress that your company must develop a process to deal with major incidents that is separate from your other business protocols. Doing so will help limit the fallout by allowing your staff to more quickly identify the problem and then take steps to fix it. The major incident process relies on a set of clearly defined steps for addressing the problem.

An effective collaboration platform should also share data between systems so team members can access it and use it, whether they are used to working in a service desk, monitoring system, incident management system, or other tool. If team members have to access information from a system they don't ordinarily use, the process will slow. Experts suggest the following steps also be included in your major incident workflow:

  • Identify the problem and determine if it meets your criteria for a major incident
  • Locate and meet with the Incident Response team
  • Have the team diagnose the cause of the problem
  • Notify all stakeholders of the situation and offer regular updates.
  • Put temporary and/or permanent fixes in place.
  • Resume business processes as soon as possible, starting with the most vital ones.
  • Investigate the cause of the incident and implement preventative actions so the situation will not be repeated.

Your main focus should be to limit the time that your employees and clients are impacted by the technological issue. The longer the problem continues, the more long-term problems your company will experience, including loss of revenue, reputation damage, and potential loss of clients. An organized response will do much to lessen the impact of the incident.

Image title

How to automate a major incident process.

The Aftermath

After the problem has been corrected and you've determined the cause and taken steps to prevent a recurrence, you also must deal with potential fallout from clients and employees. Major incident recovery may take weeks or months.

In 2017, companies lost around $100,000 for every hour of downtime on their business site. Larger companies can easily lose millions, while even small businesses can take a significant financial hit. The lost revenue can have long-term effects, causing you to make unexpected cuts in company services or in your workforce. Your company will undoubtedly experience serious downtown at some point, so an aftermath plan should be in place that deals with these issues.

Depending on the incident's severity and length, your company brand may also suffer. Customers and suppliers may be initially sympathetic, but if problems linger, you may easily lose both groups of stakeholders. You need a communication plan in place to address the "outside" effects of your major incident. You may also consider some small gesture, such as a discount or special perk, to regain trust and goodwill. People will stick with you as long as you demonstrate caring and commitment.

Your company experiences small glitches and issues every week if not every day. A major incident is one that severely compromises your company's ability to function and, in some instances, shuts it down completely. You need a major incident process in place that includes a specially selected and groomed team. When a data disaster strikes, you will be able to address it immediately and minimize the damage to your workflow and your customer base. Rather than be mired in disarray, your employees will be able to take logical, meaningful steps to get your business back up and running more quickly.

Preserving the remediation steps and chat conversations in service desk and other systems will help with post-incident analysis, improve future responses, and even help prevent some major incidents.

Process has to be matched with an infrastructure that supports it. If people critical to incident management can't easily access information when they need it, the process will break down, or at least slow down. Integrations between systems enable your people to work in the systems they already use. More on that when we talk about infrastructure.

To get your hands dirty with the xMatters Integration Platform, try xMatters Free today.

Read the 4-part DevOps testing eBook to learn how to detect problems earlier in your DevOps testing processes.

Topics:
devops ,crisis management ,incident response ,it operations ,software development

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}