Better Incident Management While Working Remotely
Remote incident management has become the norm worldwide. We share 8 best practices that helped us address remote incidents and make on-call less stressful.
Join the DZone community and get the full member experience.Join For Free
With the onset of remote work due to COVID-19, remote incident management has become the norm for businesses worldwide. Organizations that were earlier used to having war rooms now find themselves having to coordinate teams through Slack, MS Teams, or other collaboration tools. This unexpected and unplanned transition has created a unique set of problems.
Now that we have had a few months of experience in dealing with incident management remotely, here are some best practices we found to be effective. While these best practices are already recommended for effective incident management; in times of remote working, we believe this list is a great starting point to stay on top and prevent major outages.
In this blog, we list some of the ideas that you can implement immediately including better communication among stakeholders, having detailed plans to deal with outages, documenting, and learning from past failures. Here are the ways you can make this transition work in your favor and ensure that on-call remains as stress-free as possible.
1. Have a Strong Communication Plan
This includes using Slack, MS Teams, or any other collaboration tool to communicate the incidents. Having a contingency plan in place if your usual communication software goes down is essential. No one wants to spend hours making calls on phones to fix issues. A remote incident management team is like a pit-stop crew but situated miles apart and sometimes in different timezones.
The recent outage of Slack in the first week of 2021, underlines how important it is to keep communication channels open. Private status pages are invaluable to the engineers already working on fixing the issue (especially in larger teams). It also helps your PR and communications team by providing an accurate picture of the size of the outage and the progress being done. The public status page lets your customers know if parts of your product are still operational and indicate the progress being made on returning to full-functionality.
2. Have an Information Repository of Your System in Hand
Earlier if you needed any piece of information about your system it was as simple as moving a few desks over and asking the concerned person. Now, if that person is unavailable on Slack, the information you need to quickly fix the outage is hard to get. Having a centralized information system with all the essential information is invaluable. Too many organizations before the pandemic hit had their important information down in post-it notes stuck all over the place. Needless to say, this won't work when your team is working remotely. You need to have a searchable repository of vital information to save precious time and effort.
3. Have Dry-runs/Simulations of Catastrophic Failures
Having a dry-run or simulation to see how effectively your team can handle a severe failure while remote is a good idea. It can potentially provide effective insights into areas of improvement in your incident response strategy.
4. Automate More
There are things that are quick fixes or easy to tackle when you are physically present in the office. These may be scripts that are run manually or meetings that can be avoided. Reducing toilsome activities is a long-term goal that assumes greater importance when working remotely. Burnout from working remotely is a serious issue and tackling toil with automation should be a high priority. Automation should ideally include running scripts, monitoring clusters, scheduling maintenance, and the auto-configuration of cloud-based virtual machines when the need arises.
Having detailed runbooks will be of great help when a major incident occurs. Automated runbooks can be a game-changer when it comes to diagnosing and fixing systems that have gone offline. Whether you are using Ansible, Rundeck, or any other tool, even the simplest runbook is better than fixing things manually and starting from scratch every time.
5. Fight Alert Fatigue (Even More Proactively)
Remote alert fatigue is perhaps significantly more damaging than normal alert fatigue. Configuring monitoring tools and tweaking alerting thresholds plays a very important role in reducing alert noise. Additionally, our team tackles alert fatigue by taking proactive steps to reduce alert noise by creating deduplication rules, event routing, and tagging rules. Having mandatory off days for on-call engineers to avoid burnout also helps considerably.
6. Coordinate With Dev Teams Before Deployment
Monitor your infrastructure during major deployments. Have rollbacks in place if things go wrong. As the most catastrophic failures can happen during deployments, you need a way to monitor system health during that time and initiate rollbacks if required.
7. Have a Clear Incident Chain of Command and Roles
Have you planned for contingencies when your usual leadership is on leave or unreachable? An incident chain of command mitigates any last moment confusion in a time-sensitive and stressful situation.
8. Invest in An Incident Management Platform
If you haven't done it already, a dedicated incident management platform will go a long way in making on-call less stressful with the help of features like escalation policies and alert deduplication rules. Furthermore, many such platforms have dashboards that let you track the performance of your on-call team as well as the quality of service. There are still on-call teams that use spreadsheets to track schedules. While this was manageable (though not recommended) in pre-covid times, the situation now requires more clarity and efficiency. Easy use of on-call schedules in incident management platforms can be a great help for your team in planning their workload. Since engineers know beforehand whether they will be on-call they can plan their other activities accordingly. A healthy rotation in on-call schedules also helps prevent burnout.
After a major outage occurs, automated incident timelines are invaluable for remote teams to figure out measures that were taken to fix things. At Squadcast, we rely on the automated incident timeline to have a real-time view of the progress towards incident resolution. Automated timelines are also of great help when creating incident postmortems subsequently. It becomes much easier to figure out the strengths and weaknesses of your on-call response if you are armed with a detailed timeline of events.
As stated earlier, an incident response team during a major outage is like the pit-crew of a Formula1 team — trying to get as much done in the shortest amount of time possible. Like a pit crew, incident management teams will do their best work when each member knows the things he/she needs to be looking after.
We hope this list is as useful to you as it has been to us.
Published at DZone with permission of Nir Sharma. See the original article here.
Opinions expressed by DZone contributors are their own.