DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report

Maximizing Your IT Team’s Incident Response Speed

Downtime is inevitable, so it's good to have a plan in place for quick incident response. Read on for 5 steps to get you going with rapid incident management.

Orlee Berlove user avatar by
Orlee Berlove
·
Mar. 13, 18 · Opinion
Like (2)
Save
Tweet
Share
9.99K Views

Join the DZone community and get the full member experience.

Join For Free

The reality of IT support is that engineers cannot avoid downtime. No matter how responsible managers are in ensuring regular maintenance and repair, incidents will happen. Sites will fill. Servers will fill up. APIs will fail. When these incidents do occur, it is important that IT teams are well trained and have the necessary equipment to ensure a rapid incident response.

However, incident response is not as easy as simply creating a check list for teams to follow. When incidents occur, there are often conflicting priorities between restoring availability and investigating the causes of the incident. For example, Security incident response teams and infrastructure teams operate with different sets of assumptions and priorities when resolving issues. If these separate priorities are not effectively managed before-hand, there can lead to the duplication of work, delays in handoffs, and faulty results.

The goal of this blog is to look beyond ITSM and focus on how to handle IT failures, highlighting some of the key steps that incident response teams should take to effectively prepare for responding to IT issues in a way that avoids duplication, delay, and error.

Step 1: Establish Teams

Effective response to incidents does not start when the incident arises. Instead, effective response begins long before there is any knowledge of a problem at all. The first step in effective incident response is establishing teams that include members from the various groups within the company such as security, infrastructure and development.  

Some believe that incidents are best handled by those with the most expertise and capability. However, according to Gartner, the response is actually best if teams result from a bringing together of security and I&O personnel. These teams can work together to collaborate response and bring the most relevant staff, skills and resources to an incident.

Together, these individuals from the various teams will develop a shared framework for responding to incidents and leverage their individual skills to improve response. The goal is not to create a new entity but rather to allow an organization to draw from its existing strengths.

Step 2: Priorities, Planning, and Preparation

Each company will have a different understanding of what functionalities are important to the company and to the individual teams. Around these priorities, teams need to create metrics that determine the level of deviation from normal that is acceptable.

By determining metrics, teams will automatically have a sense if this is a high priority issue and what coordination is required from the beginning. Teams should create an analysis of where they are in terms of ability to respond and what is needed in terms of training and technology to achieve the level of preparedness they desire.

A key part of planning and preparation is having teams prepare for potential scenarios and develop a response based on these potential incidents. If technologies fail based on predetermined metrics, then teams need to be ready to react.  Teams should be prepared with runbooks and anticipate common scenarios.

Step 3: Monitoring & Alerting

As noted in the introduction, systems will break or be attacked. There is no way to prevent this outcome from happening. Consequently, it is imperative that incident response teams have ways to monitor technologies and learn of these eventualities as quickly as possible through proper alerting.

There are multiple ways that teams can monitor their technologies. They can monitor through the use of logs or end-user reports. This information should be collected and filtered. Additionally, teams can learn of incidents through their NOC or SOC.

With proper preparation, teams know which incidents are priorities and require rapid resolution. In order to quickly learn about these incidents, teams need incident management platforms. Incident management platforms like those provided by OnPage are ideal in this incidence as they enable teams to quickly learn when technologies have failed and subsequently jump on conference bridges to discuss resolutions. Incident management platforms provide immediate notifications on the users smartphone so even if an on-call incident happens after-hours, teams are able to quickly learn about the incident.

With proper preparation, alerts can be assigned to the teams who have the responsibility to resolve the issue. While there will be pressure to restore functionality to whichever team shouts the loudest, the pressure should not drive teams away from their focus

Step 4: Collaboration and Unified Communications

Teams need to think of response in terms of objectives beyond their individual team goals. As has been stated at other points in this document, teams need to have a sense of shared organizational goals and objectives. To enable teams to reach these objectives, teams need ways to collaborate and communicate during incidents.

Strong collaboration platforms that enable communications once alerts are received are best. Ideally, the alerting and communications platforms are unified so that once alerted, teams do not need to switch devices to exchange messages with colleagues. The more robust the communications platform, the better. The ability to exchange voicemails, images and documents are all ideal as the need for these capabilities are often important in quick resolution of incidents.

Step 5: Post Mortems

Post-incident analysis is an integral part of the process and can uncover knowledge that should have been available but was not known or was delayed and resulted in harm to a system. Without a post-mortem, improvement will be very difficult to achieve.

Reporting should also look at which systems were affected, and the number of users impacted as well as which notifications were issued. The resulting incident report should be reviewed by participants as well as key stakeholders. This process is where leaders engage the response team to help codify organizational lessons learned.

From this analysis, teams can also learn root causes of incidents and what might need to happen in the future to prevent incidents from occurring. The process will Also highlight faulty processes or knowledge that needs to be available next time.

Conclusions

IT teams need to think of incident response as a process. If thought of as just one step then incident response will either not be as effective as it could be or will fail.

Communication and planning are the key underlying themes required for effective incident response. Incident management platforms like OnPage provide engineers in IT with the tools they need to organize their group, communicate between teams during inciden

teams IT

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Best Practices for Setting up Monitoring Operations for Your AI Team
  • What Is API-First?
  • Chaos Data Engineering Manifesto: 5 Laws for Successful Failures
  • What Is Docker Swarm?

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: