DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
What's in store for DevOps in 2023? Hear from the experts in our "DZone 2023 Preview: DevOps Edition" on Fri, Jan 27!
Save your seat

Major Incident Process Is at the Heart of Effectiveness

Developing and automating a major incident handling process is integral to effective development workflows.

Dan Goldberg user avatar by
Dan Goldberg
·
Oct. 08, 18 · Opinion
Like (1)
Save
Tweet
Share
3.99K Views

Join the DZone community and get the full member experience.

Join For Free

All businesses need to be prepared for a major technological incident in order to minimize their losses and protect their client information. Experts don't always agree on what constitutes a major incident, but you can create a definition for your company.

A major incident interrupts your company's ability to function, sometimes completely shutting you down. Most often, the problems are man-made. They can include hackers stealing data, emails infecting your system with ransomware, or employee errors that introduce catastrophic failures. When these major incidents occur, you need to immediately set your team in motion.

Identify Your Major Incidents

First, decide what issues constitute a major incident for your company. List the situations that would qualify as a major incident for your company. Be specific so that you can quickly determine whether an incident really needs to be treated as a major issue.

DevOps teams and Site Reliability Engineers often assign budgets for allowable downtime for certain services so they know how many people to allocate when there is a problem. Getting the technical budget right can take many iterations, so be patient if you're just starting.

Create an Incident Team

Instead of relying on your usual company hierarchy, many organizations establish a major incident team comprised of key players in your organization. Usually, someone from the IT side of the house usually leads the team, and some businesses create a role for Major Incident Director. Often businesses include the head or second ranking employee of major departments. Often, some members of the team will need to drop all other responsibilities in order to address the incident, so be certain that someone else is in place to handle their other duties.

The complexity and urgency of these situations create a need to use a collaboration platform to contact the right on-call resource immediately, and to escalate quickly to the next person if necessary.

Develop a Process

Experts stress that your company must develop a process to deal with major incidents that is separate from your other business protocols. Doing so will help limit the fallout by allowing your staff to more quickly identify the problem and then take steps to fix it. The major incident process relies on a set of clearly defined steps for addressing the problem.

An effective collaboration platform should also share data between systems so team members can access it and use it, whether they are used to working in a service desk, monitoring system, incident management system, or other tool. If team members have to access information from a system they don't ordinarily use, the process will slow. Experts suggest the following steps also be included in your major incident workflow:

  • Identify the problem and determine if it meets your criteria for a major incident
  • Locate and meet with the Incident Response team
  • Have the team diagnose the cause of the problem
  • Notify all stakeholders of the situation and offer regular updates.
  • Put temporary and/or permanent fixes in place.
  • Resume business processes as soon as possible, starting with the most vital ones.
  • Investigate the cause of the incident and implement preventative actions so the situation will not be repeated.

Your main focus should be to limit the time that your employees and clients are impacted by the technological issue. The longer the problem continues, the more long-term problems your company will experience, including loss of revenue, reputation damage, and potential loss of clients. An organized response will do much to lessen the impact of the incident.

Image title

How to automate a major incident process.

The Aftermath

After the problem has been corrected and you've determined the cause and taken steps to prevent a recurrence, you also must deal with potential fallout from clients and employees. Major incident recovery may take weeks or months.

In 2017, companies lost around $100,000 for every hour of downtime on their business site. Larger companies can easily lose millions, while even small businesses can take a significant financial hit. The lost revenue can have long-term effects, causing you to make unexpected cuts in company services or in your workforce. Your company will undoubtedly experience serious downtown at some point, so an aftermath plan should be in place that deals with these issues.

Depending on the incident's severity and length, your company brand may also suffer. Customers and suppliers may be initially sympathetic, but if problems linger, you may easily lose both groups of stakeholders. You need a communication plan in place to address the "outside" effects of your major incident. You may also consider some small gesture, such as a discount or special perk, to regain trust and goodwill. People will stick with you as long as you demonstrate caring and commitment.

Your company experiences small glitches and issues every week if not every day. A major incident is one that severely compromises your company's ability to function and, in some instances, shuts it down completely. You need a major incident process in place that includes a specially selected and groomed team. When a data disaster strikes, you will be able to address it immediately and minimize the damage to your workflow and your customer base. Rather than be mired in disarray, your employees will be able to take logical, meaningful steps to get your business back up and running more quickly.

Preserving the remediation steps and chat conversations in service desk and other systems will help with post-incident analysis, improve future responses, and even help prevent some major incidents.

Process has to be matched with an infrastructure that supports it. If people critical to incident management can't easily access information when they need it, the process will break down, or at least slow down. Integrations between systems enable your people to work in the systems they already use. More on that when we talk about infrastructure.

To get your hands dirty with the xMatters Integration Platform, try xMatters Free today.

teams

Published at DZone with permission of Dan Goldberg, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Create a Stub in 5 Minutes
  • How to Perform Accessibility Testing of Websites and Web Apps
  • Essential Protocols for Python Developers to Prevent SQL Injection Attacks
  • The Future of Cloud Engineering Evolves

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: