Maximizing Your IT Team’s Incident Response Speed
Maximizing Your IT Team’s Incident Response Speed
Downtime is inevitable, so it's good to have a plan in place for quick incident response. Read on for 5 steps to get you going with rapid incident management.
Join the DZone community and get the full member experience.Join For Free
Discover how TDM Is Essential To Achieving Quality At Speed For Agile, DevOps, And Continuous Delivery. Brought to you in partnership with CA Technologies.
The reality of IT support is that engineers cannot avoid downtime. No matter how responsible managers are in ensuring regular maintenance and repair, incidents will happen. Sites will fill. Servers will fill up. APIs will fail. When these incidents do occur, it is important that IT teams are well trained and have the necessary equipment to ensure a rapid incident response.
However, incident response is not as easy as simply creating a check list for teams to follow. When incidents occur, there are often conflicting priorities between restoring availability and investigating the causes of the incident. For example, Security incident response teams and infrastructure teams operate with different sets of assumptions and priorities when resolving issues. If these separate priorities are not effectively managed before-hand, there can lead to the duplication of work, delays in handoffs, and faulty results.
The goal of this blog is to look beyond ITSM and focus on how to handle IT failures, highlighting some of the key steps that incident response teams should take to effectively prepare for responding to IT issues in a way that avoids duplication, delay, and error.
Step 1: Establish Teams
Effective response to incidents does not start when the incident arises. Instead, effective response begins long before there is any knowledge of a problem at all. The first step in effective incident response is establishing teams that include members from the various groups within the company such as security, infrastructure and development.
Some believe that incidents are best handled by those with the most expertise and capability. However, according to Gartner, the response is actually best if teams result from a bringing together of security and I&O personnel. These teams can work together to collaborate response and bring the most relevant staff, skills and resources to an incident.
Together, these individuals from the various teams will develop a shared framework for responding to incidents and leverage their individual skills to improve response. The goal is not to create a new entity but rather to allow an organization to draw from its existing strengths.
Step 2: Priorities, Planning, and Preparation
Each company will have a different understanding of what functionalities are important to the company and to the individual teams. Around these priorities, teams need to create metrics that determine the level of deviation from normal that is acceptable.
By determining metrics, teams will automatically have a sense if this is a high priority issue and what coordination is required from the beginning. Teams should create an analysis of where they are in terms of ability to respond and what is needed in terms of training and technology to achieve the level of preparedness they desire.
A key part of planning and preparation is having teams prepare for potential scenarios and develop a response based on these potential incidents. If technologies fail based on predetermined metrics, then teams need to be ready to react. Teams should be prepared with runbooks and anticipate common scenarios.
Step 3: Monitoring & Alerting
As noted in the introduction, systems will break or be attacked. There is no way to prevent this outcome from happening. Consequently, it is imperative that incident response teams have ways to monitor technologies and learn of these eventualities as quickly as possible through proper alerting.
There are multiple ways that teams can monitor their technologies. They can monitor through the use of logs or end-user reports. This information should be collected and filtered. Additionally, teams can learn of incidents through their NOC or SOC.
With proper preparation, teams know which incidents are priorities and require rapid resolution. In order to quickly learn about these incidents, teams need incident management platforms. Incident management platforms like those provided by OnPage are ideal in this incidence as they enable teams to quickly learn when technologies have failed and subsequently jump on conference bridges to discuss resolutions. Incident management platforms provide immediate notifications on the users smartphone so even if an on-call incident happens after-hours, teams are able to quickly learn about the incident.
With proper preparation, alerts can be assigned to the teams who have the responsibility to resolve the issue. While there will be pressure to restore functionality to whichever team shouts the loudest, the pressure should not drive teams away from their focus
Step 4: Collaboration and Unified Communications
Teams need to think of response in terms of objectives beyond their individual team goals. As has been stated at other points in this document, teams need to have a sense of shared organizational goals and objectives. To enable teams to reach these objectives, teams need ways to collaborate and communicate during incidents.
Strong collaboration platforms that enable communications once alerts are received are best. Ideally, the alerting and communications platforms are unified so that once alerted, teams do not need to switch devices to exchange messages with colleagues. The more robust the communications platform, the better. The ability to exchange voicemails, images and documents are all ideal as the need for these capabilities are often important in quick resolution of incidents.
Step 5: Post Mortems
Post-incident analysis is an integral part of the process and can uncover knowledge that should have been available but was not known or was delayed and resulted in harm to a system. Without a post-mortem, improvement will be very difficult to achieve.
Reporting should also look at which systems were affected, and the number of users impacted as well as which notifications were issued. The resulting incident report should be reviewed by participants as well as key stakeholders. This process is where leaders engage the response team to help codify organizational lessons learned.
From this analysis, teams can also learn root causes of incidents and what might need to happen in the future to prevent incidents from occurring. The process will Also highlight faulty processes or knowledge that needs to be available next time.
IT teams need to think of incident response as a process. If thought of as just one step then incident response will either not be as effective as it could be or will fail.
Communication and planning are the key underlying themes required for effective incident response. Incident management platforms like OnPage provide engineers in IT with the tools they need to organize their group, communicate between teams during inciden
Opinions expressed by DZone contributors are their own.