The Art of Postmortem
Top tech companies have a meticulous post-mortem process for analyzing outages. In this article, we shed light on the art of writing a good post-mortem report.
Join the DZone community and get the full member experience.
Join For FreeHow an individual responds to setbacks defines their character, and the same is true of companies. Authoring a postmortem is the traditional way to learn from failures. The postmortem is a core foundation of company culture.
Successful tech companies have rigorous postmortem processes that have been refined meticulously over time. A company that learns from its mistakes will protect an incalculable amount of business value by preventing future outages.
This article will provide a condensed view of how top tech companies perform postmortems. You can apply these principles as a postmortem author, reviewer, or company leader.
The Purpose of Postmortem
The goal of the postmortem is to learn from your mistakes and identify corrective actions. To learn from your mistakes, you must understand them first. Therefore, much of the postmortem is exploratory. Approach the process without preconceptions about where it might lead.
Any customer-impacting outage should require a postmortem. But postmortems have much broader applicability beyond outages. For example, if a security exploit is discovered in your system, you can analyze gaps in your security processes using a postmortem.
Encourage a company culture where postmortems are seen as an opportunity to learn, not as a punitive exercise. Teams should feel comfortable undertaking a postmortem, or to ask another team to do so.
We’ll now discuss how to approach the essential components of a postmortem. The postmortem is divided into two parts: incident analysis and post-incident analysis. The first part analyzes the outage, and the second part retrospects it.
Incident Analysis
An outage (or incident) has four key dimensions against which it should be analyzed: impact, detection, mitigation, and root cause.
Typically, each dimension is accompanied by a thought experiment designed to help identify improvements. Some of the most effective ones are mentioned below.
Impact
Describe the incident from the perspective of your customer. Discuss how much of your customer base and your business was affected. The impact on a handful of large customers can be more damaging to the business than the impact on a hundred small customers. Bring out such nuance in your analysis.
Thought experiment: What architectural changes could you make to reduce the blast radius in half?
Detection, Mitigation, Root Cause
Detection
Discuss how the incident was detected. If a customer had to tell you your product wasn’t working, you have significant gaps in your automated alerting.
Thought experiment: How could you cut the detection time in half?
Mitigation
Describe how the incident was mitigated, including actions that did not work.
Thought experiment: How could you mitigate the issue twice as fast?
Root Cause
Describe how the incident was root caused. In complex distributed systems, it is easy to misdiagnose issues. Talk about red herrings that came up and how you could rule them out more quickly.
Thought experiment: How could you root cause the issue twice as fast?
Notice the recurring theme of time in the ‘key questions.’ Time is the most valuable currency during an outage. High-quality postmortems have minute-level precision about how long things take.
Post-Incident Analysis
The purpose of the post-incident analysis is to take the information surfaced from the incident analysis and do a more introspective investigation on what went wrong and how to fix it.
The Five Whys
At the heart of major companies' postmortem processes is the ‘five whys.’ The process involves asking a series of ‘why’ questions (there can be more than five), where each ‘why’ is formed out of the ‘because’ in the previous question. The idea is to dive progressively deeper into the root causes of the incident until you have satisfactory answers.
The five whys will help you identify issues that can be glossed over by a conventional analysis of the incident. Consider the following set of ‘whys,’ which reveals an important detail about the team’s decision-making processes:
- ‘Why was the new feature unable to autoscale to handle the increase in traffic?’
- Answer: ‘The feature had issues in the autoscaling logic.’
- ‘Why were the scaling issues in the new feature not caught before launch?’
- Answer: ‘The feature was launched without a load test.’
- ‘Why was the feature launched without a routine load test?’
- Answer: ‘The team was asked to ship the feature against a tight deadline. The team, together with management, decided to skip load testing to make the deadline.’
If you’re a leader, encourage postmortem authors to surface candid feedback on management, and even your leadership. This will earn you trust from your engineers, and set the right example for other leaders in the company.
Lessons Learned
Everything can be traced back to human error because humans built the system and maintained it. Every ‘mistake’ identified in the five whys should map onto gaps in knowledge or processes. Articulate such realizations as lessons learned.
Lessons shouldn’t just reiterate the problems in the system. ‘Our fallback system doesn’t work’ is a bad lesson. A better version is, ‘Fallback strategies can only be relied upon if they’re exercised regularly. We should build automation to continually switch between the primary and the fallback system so that both are in active use.’
Action Plan
In previous sections, we discussed various improvement ideas. The action plan is a distilled list of surgical actions that will address the most pressing deficiencies.
A good postmortem action item has the following properties:
- It is a well-defined work item typically achievable within a fortnight or a month of the incident.
- Has a clear target date, derived from its size and urgency.
As a leader, track postmortem action items at an organizational level to make sure they don’t get neglected.
Conclusion
In this article, we explore the essential properties of a postmortem. Naturally, your process should be a superset of what is described here. A ‘Timeline’ section is always useful to make sense of complex incidents. A brief overview of the impacted service will help readers unfamiliar with the system understand the write-up.
Share particularly instructive postmortems with a larger audience. Top tech companies sometimes post public-friendly versions of their postmortems for major outages. This is a great way to begin rebuilding trust with your customers.
Most importantly, recognize the contributions of the team in fixing the outage. Outages are damaging to a team’s morale. Set up a culture where the postmortem is seen as a cathartic process that ushers in a more resilient product.
Opinions expressed by DZone contributors are their own.
Comments