6 Ways to Deal With IT Incidents
These tips will help you structure your IT incident management strategy and improve your team's readiness.
Join the DZone community and get the full member experience.Join For Free
According to Steve McConnell’s contemporary research and book, “Code Complete,” the industry average comprises 15-50 bugs per 1000 lines of delivered code.
This statistic should reaffirm the likelihood of teams encountering modern-day IT incidents and how quickly they can lead to prominent systematic concerns. Instead of letting these concerns cause significant delays, it’s best to be prepared in advance and set up a responsive solution that’s rapid and effective.
For the sake of simplicity, let’s assume an incident is anything that impacts the running, usage or performance of a production system.
Here are the top six ways to deal with modern IT incidents.
1) Running Diagnostics
In general, IT incidents require a diagnosis before a comprehensive plan can be implemented. The traditional incident management setup will comprise of a tried-and-tested “Break in case of emergency” method that is only hauled out once in a blue moon and offers a reduced image of what’s transpired. Instead, it’s best to run a complete gamut of diagnostics with the inclusion of multiple tests.
These post-incident diagnostics should include:
- A Post-Mortem Analysis
- Static Monitoring
- Time Series
- Log Analysis
The incoming data from this in-depth analysis will offer a full-fledged picture of the IT incident and its resulting consequences. All of this information can be used to diagnose and deal with any related error.
Please note, the assessment of a system is reliant on leading a blameless analysis, its core uses, goals, and general load. This assessment should include understanding the various inputs (i.e., NPM) and how it impacts the organization as a whole. Without this information, the diagnostics will fall flat or not yield appropriate data with which to address the incident.
2) Use the ChatOps Model With Slack
ChatOps is a modern, high-end communication model for handling daily software development work—as well as IT incidents—and can be integrated within Slack for additional system status updates. This solution provides a wholesome option for dealing with any noteworthy IT issue. By incorporating the ChatOps model into team communication channels, devs and ops can create, organize, and filter deployment notifications as well as gain a searchable index from start to finish. All listed transcripts and notifications are accessible, easy to read, and offer tremendous insight into corrective changes. Such clear-cut records of communication and development actions can also help additional parties come in and immediately adjust to the problem at hand without needing to start from square one.
3) Active Monitoring and Feedback
Use monitoring systems (e.g., threat monitoring systems or intrusion detection systems) to actively search for incidents in your system and provide continuous feedback on its health and performance. The right option for your organization will monitor and assess detailed metrics, alerts, systematic explanations, escalation points, potential symptoms, and general updates.
Using monitoring systems to actively look for potential incident signs can help your team to quickly follow up with all newly discovered breaches when they happen.
4) Create Post-Solution Documentation for Ongoing Organizational Learning
One of the most significant challenges for IT teams to overcome is laying the blame at a colleague’s feet. This is challenging as it’s within human nature to hide culpable actions, whereas openness is encouraged if there is no fear of reprisal. With a post-solution report, it’s best to avoid the blame game. Instead, place emphasis on general improvement, implementing a system that reduces the risk of this happening again, and getting the issue fixed as swiftly as possible.
The goal should be to determine whether the current response system is robust, how the metrics helped, and what can be done to alleviate similar problems before they occur in again. All of this information can only be helpful for future record and learning if a post-solution report is penned and distributed. Do not use the report as a means to lay blame as that can lead to further inaction, mistrust, and rifts that are unnecessary within a team. IT incidents need to be treated systematically, and issues should be seen as a way to improve what’s already in place.
5) Prevention Is Key
The post-solution report can offer tremendous insight into what’s happening in the system at the moment and what went wrong. This information can then be used to set up preventative solutions to keep the system running. If not, the issue is likely to happen again and further IT incidents will arise.
Changing the way you deploy code into production is another preventative measure to help mitigate the risk of IT incidents from occurring. Canary deployments are one method of rolling out code which reduces deployment risk and can increase overall confidence in the output and release. https://caylent.com/devops-handbook-part-3-continuous-integration/
In addition, canary deployments help contain failure quickly and improve the recovery time from any incidents that do occur as you can easily roll back to a previous health code state. The process involves rolling out code to a limited subset of servers/users at a time, it’s sometimes referred to as a phased or incremental rollout too. (For more on canary release patterns check out our article on The DevOps Handbook Series Part 3: Practice Continuous Integration.)
A team has to be meticulous when it comes to prevention. All recommended changes should be “actionable” and easy to implement quickly—or they will get ignored. This doesn’t help the system nor does it add value to the post-solution report.
6) Incident Management Systems
It doesn’t matter how rigorous a coding team’s credentials are or how large the organization. Mistakes happen and it’s important to understand this as a largely inevitable circumstance. If not, a team can be completely taken aback when issues creep up in their app.
Even some of the largest and most noteworthy teams on the planet have had bugs in their coding. This is normal and it’s best to have a plan which fits all impending issues to address them quickly. This is where a world-class incident management system can help automate a good portion of the process.
Good examples include:
- Jira Service Desk
The premise of an incident management system is to offer outside, comprehensive control over the initial alerts and resulting metrics. As mentioned before, this data is critical to the diagnosis of potential bugs in the system and having it in place before anything happens is essential so the whole organization knows how to react to an incident accordingly. Any system that doesn’t make use of a solution such as this is going to be at the mercy of massive, complex mistakes that are difficult to handle. Instead of being reactive and hoping for the best, it’s better to implement an incident management system as soon as possible.
For example, PagerDuty is able to collate all incoming alerts and contextualize them for immediate change. The recovery team can then process this information, analyze the symptoms, and make adjustments to the code before things get worse. The team is also able to spot potential patterns and pinpoint where the issue is coming from and why. Any pattern can tip the team off based on the changes that are necessary and how to go about making them. Take advantage of such analysis in a preventative manner.
Having an incident management system in place is one of the best ways to remain on top of IT issues while keeping them at bay for as long as necessary.
Coding is one of those tasks that require ongoing knowledge, patience, and the ability to prepare for future incidents. Any form of ill-prepared project work is going to lead to major problems down the road. Keep these ways in mind when considering your team’s incident response approach.
Published at DZone with permission of Stefan Thorpe, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.