DevOps Postmortems: Why and How to Use Them
When things go wrong, as they sometimes will, it's important to take a step back and evaluate why and how they did.
Join the DZone community and get the full member experience.Join For Free
To err is human. To make sure we learn from our errors and adapt requires discipline. In this post, we will cover the motivation behind introducing a postmortem culture into your DevOps organization. Furthermore, we will complement this with an example of how to roll postmortems out in your DevOps/SRE team.
Why Do I Need Postmortem Discipline?
As we already know, changes in the system introduce instabilities which cause incidents. Migration to DevOps has enabled organizations all across the world to release in smaller increments and with greater frequency. This reduces the risk of failures in a specific release. On the other hand, increasing the number of releases won't necessarily decrease the number of incidents the on-call teams need to respond to.
The main responsibility of the incident response team is to quantify and, if necessary, mitigate the impact. As a result, the service returns to normal operating conditions. Analyzing the root cause and implementing preventive measures do not belong to this process. Now, if such learning and analysis do not take place, the root causes are left untreated and preventive measures are not implemented. The outcome: incidents start multiplying and cascading errors become part of weekly routines. Eventually, the amount of time a DevOps team spends on incident response grows larger and larger, with ever-decreasing service quality.
Conducting a Postmortem
To avoid such a death spiral, your team must acknowledge the need to learn from the past to build a better future. This learning process is called postmortem (or post-mortem). Postmortem should be triggered whenever an incident requires a response from an on-call engineer. A typical postmortem starts by registering the objective evidence:
- Trigger for the incident
- Impact of the incident
- Time to detect and mitigate
- Steps taken to mitigate
- Root cause analysis
Based on the evidence above, an analysis should be conducted. The analysis is typically carried out by the on-call team member who responded to the incident and might include other team members who either helped to mitigate or analyze the root cause. The analysis process needs to find answers to the following questions:
- How many alerts did we receive for the incident?
- Was the trigger timely or could we have registered it earlier?
- Was the impact sufficient to trigger an incident in the first place? Or should we calibrate the triggers?
- Were steps taken to mitigate the impact adequate and did they follow the process? If not, do we need to invest in training or improve the guidelines?
- Did we manage to mitigate the impact fast enough? Is there anything we can do to shrink the mitigation time?
- Root cause
- Will the root cause be resolved or will we have to live with it?
- If the root cause will be resolved, what exactly do we need to do to resolve it?
Based on the analysis, a summary should be composed, including the lessons learned and follow-up tasks registered and prioritized. The follow-up tasks typically include:
- Tasks for engineering to resolve the root cause
- Tasks for DevOps engineers to improve the monitoring setup
- Tasks for managers to improve the processes
Introducing postmortems to an organization that historically has not conducted any is not as easy as it might sound. As with every new or changing process, introducing and persisting the change requires time and effort at all levels of the organization. However, there are a few key principles, following which makes the change easier:
- Make sure you stay away from blame games and finger pointing. This is the most crucial aspect to getting things right out of the gates. If the analysis focuses on blaming the persons causing the incident instead of making sure the team learns and improves, the initiative will cause harm instead of good.
- Appoint a dedicated lead, enforcing each and every incident response to finish with postmortem. These people tend to come from DevOps/on-call teams and most often they are the team leads themselves.
- Collaborate and share. Make sure to capture the postmortems in a medium suitable for sharing and learning, such as wikis. Use the postmortems from last month as regular learning material for your team. Allow collaboration and commenting during and after the postmortem.
- Involve management. Showing support from management makes evangelizing and education among engineers easier. To keep the management engaged, plan ahead with objectives and show progress along the way. You know, managers like nothing more than charts pointing up and to the right.
- Start small. If the organization is large, starting with just a few services and one team is enough to build an example that will motivate other teams to follow. The initial team celebrating their wins is often enough to have the other teams join the bandwagon. It is much harder to introduce the change without having a positive example from inside the organization.
We've prepared a checklist of the questions you need to be asking yourself to conduct your DevOps postmortem in the best way possible.
- Impact on end users
- Impact on productivity
- Impact on infrastructure
- Time to mitigation
- Mitigation step #1
- Mitigation step #2
- Root cause analysis
- Lessons learned
- Task #1 (detect/mitigate/process)
- Task #2 (detect/mitigate/process)
- Task #3 (detect/mitigate/process)
A good postmortem culture is only as strong as the team and tools available. Plumbr real-user monitoring can help you identify how many customers are affected by an issue, how long they were affected for and where the bug is. Armed with this information your postmortem culture will grow faster and stronger.
Published at DZone with permission of Ivo Magi, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.