DevOps Postmortems: Why and How to Use Them (+Checklist)
When things go wrong, as they sometimes will, it's important to take a step back and evaluate why and how they did.
Join the DZone community and get the full member experience.
Join For FreeIn DevOps, any problem that occurs in the CI/CD process is taken care of and fixed as soon as possible. The required information is gathered from the feedback loop of the CI/CD pipeline. These are fed into the rectification process. The postmortem analysis of a problem plays a crucial role in planning the fixation. A postmortem culture gathers the development team together to figure out the cause and focus on a probable solution of the situation. The written layout of the postmortem analysis describes the impact of the incident, steps taken to resolve it, root cause of the problem, actions followed up to prevent such incident in future etc. A complete analysis can not only improve overall quality of the production but also relieves stress and risk factor of the development process.
Why Do I Need Postmortem Discipline?
Typically, any changes in the system may cause instability with repercussion. DevOps CI/CD pipeline promotes frequent release in smaller increments. This, on one hand reduces the risk of failures in a specific release but on the other hand, increases the number of incidents that on-call teams need to respond to.
The incident response team tries to quantify and mitigate the impact so that the service returns to a normal state. This is exactly where postmortem analysis is important, unless which no rectification or preventive measures can be taken. In worst case the incidents can multiply with cascading effect. As a result, the amount of time a DevOps team spends on incident response surpasses the actual development time. This unexpected situation can be avoided by incorporating postmortem culture into DevOps team.
How to Conduct a Postmortem Analysis
To avoid such a death spiral, your team must acknowledge the need to learn from the past to build a better future. This learning process is called postmortem. Postmortem should be triggered whenever an incident requires a response from an on-call engineer.
Step 1: Registering the Evidence
- Trigger for the incident
- Impact of the incident
- Time to detect and mitigate
- Steps taken to mitigate
- Root cause analysis
Step 2: The Analysis
Based on the evidence above, an analysis should be conducted. The analysis is typically carried out by the on-call team member who responded to the incident and might include other team members who either helped to mitigate or analyze the root cause. The analysis process needs to find answers to the following questions:
- Trigger
- How many alerts did we receive for the incident?
- Was the trigger timely or could we have registered it earlier?
- Impact
- Was the impact sufficient to trigger an incident in the first place? Or should we calibrate the triggers?
- Were steps taken to mitigate the impact adequate and did they follow the process? If not, do we need to invest in training or improve the guidelines?
- Did we manage to mitigate the impact fast enough? Is there anything we can do to shrink the mitigation time?
- Root cause
- Will the root cause be resolved or will we have to live with it?
- If the root cause will be resolved, what exactly do we need to do to resolve it?
Step 3: Compose a Summary
Based on the analysis, a summary should be composed, including the lessons learned and follow-up tasks registered and prioritized. The follow-up tasks typically include:
- Tasks for engineering to resolve the root cause
- Tasks for DevOps engineers to improve the monitoring setup
- Tasks for managers to improve the processes
5 Tips for Introducing Postmortems
Introducing postmortems to an organization that historically has not conducted any is not as easy as it might sound. As with every new or changing process, introducing and persisting the change requires time and effort at all levels of the organization. However, there are a few key principles, following which makes the change easier:
- Make sure you stay away from blame games and finger pointing. This is the most crucial aspect to getting things right out of the gates. If the analysis focuses on blaming the persons causing the incident instead of making sure the team learns and improves, the initiative will cause harm instead of good.
- Appoint a dedicated lead, enforcing each and every incident response to finish with postmortem. These people tend to come from DevOps/on-call teams and most often they are the team leads themselves.
- Collaborate and share. Make sure to capture the postmortems in a medium suitable for sharing and learning, such as wikis. Use the postmortems from last month as regular learning material for your team. Allow collaboration and commenting during and after the postmortem.
- Involve management. Showing support from management makes evangelizing and education among engineers easier. To keep the management engaged, plan ahead with objectives and show progress along the way. You know, managers like nothing more than charts pointing up and to the right.
- Start small. If the organization is large, starting with just a few services and one team is enough to build an example that will motivate other teams to follow. The initial team celebrating their wins is often enough to have the other teams join the bandwagon. It is much harder to introduce the change without having a positive example from inside the organization.
Postmortem Checklist
We've prepared a checklist of the questions you need to be asking yourself to conduct your DevOps postmortem in the best way possible.
- Detection
- Impact
- Impact on end users
- Impact on productivity
- Impact on infrastructure
- Mitigation
- Time to mitigation
- Mitigation step #1
- Mitigation step #2
- Root cause analysis
- Lessons learned
- Follow-ups
- Task #1 (detect/mitigate/process)
- Task #2 (detect/mitigate/process)
- Task #3 (detect/mitigate/process)
A good postmortem culture is only as strong as the team and tools available. Plumbr real-user monitoring can help you identify how many customers are affected by an issue, how long they were affected for and where the bug is. Armed with this information your postmortem culture will grow faster and stronger.
Published at DZone with permission of Ivo Magi, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments