DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • AIOps for Predictive Incident Management: A Novel Approach to Proactive DevOps
  • DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform
  • Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning
  • Product-Led Software Delivery: Intelligent Platforms for DevOps at Scale

Trending

  • The ORM Is Over: AI-Written SQL Is the New Data Access Layer
  • Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
  • The Cost of Knowing: When Observability Becomes the Outage
  • Rethinking Java CRUDs With Event Sourcing and CQRS Patterns
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. The Art of Postmortem

The Art of Postmortem

Top tech companies have a meticulous post-mortem process for analyzing outages. In this article, we shed light on the art of writing a good post-mortem report.

By 
Aditya Visweswaran user avatar
Aditya Visweswaran
·
Mar. 28, 25 · Analysis
Likes (4)
Comment
Save
Tweet
Share
3.9K Views

Join the DZone community and get the full member experience.

Join For Free

How an individual responds to setbacks defines their character, and the same is true of companies. Authoring a postmortem is the traditional way to learn from failures. The postmortem is a core foundation of company culture. 

Successful tech companies have rigorous postmortem processes that have been refined meticulously over time. A company that learns from its mistakes will protect an incalculable amount of business value by preventing future outages. 

This article will provide a condensed view of how top tech companies perform postmortems. You can apply these principles as a postmortem author, reviewer, or company leader. 

The Purpose of Postmortem

The goal of the postmortem is to learn from your mistakes and identify corrective actions. To learn from your mistakes, you must understand them first. Therefore, much of the postmortem is exploratory. Approach the process without preconceptions about where it might lead. 

Any customer-impacting outage should require a postmortem. But postmortems have much broader applicability beyond outages. For example, if a security exploit is discovered in your system, you can analyze gaps in your security processes using a postmortem.

Encourage a company culture where postmortems are seen as an opportunity to learn, not as a punitive exercise. Teams should feel comfortable undertaking a postmortem, or to ask another team to do so. 

We’ll now discuss how to approach the essential components of a postmortem. The postmortem is divided into two parts: incident analysis and post-incident analysis. The first part analyzes the outage, and the second part retrospects it. 

Incident Analysis

An outage (or incident) has four key dimensions against which it should be analyzed: impact, detection, mitigation, and root cause.

Typically, each dimension is accompanied by a thought experiment designed to help identify improvements. Some of the most effective ones are mentioned below. 

Impact

Describe the incident from the perspective of your customer. Discuss how much of your customer base and your business was affected. The impact on a handful of large customers can be more damaging to the business than the impact on a hundred small customers. Bring out such nuance in your analysis. 

Thought experiment: What architectural changes could you make to reduce the blast radius in half?

Detection, Mitigation, Root Cause

Detection

Discuss how the incident was detected. If a customer had to tell you your product wasn’t working, you have significant gaps in your automated alerting. 

Thought experiment: How could you cut the detection time in half? 

Mitigation

Describe how the incident was mitigated, including actions that did not work. 

Thought experiment: How could you mitigate the issue twice as fast? 

Root Cause

Describe how the incident was root caused. In complex distributed systems, it is easy to misdiagnose issues. Talk about red herrings that came up and how you could rule them out more quickly.

Thought experiment: How could you root cause the issue twice as fast? 

Notice the recurring theme of time in the ‘key questions.’ Time is the most valuable currency during an outage. High-quality postmortems have minute-level precision about how long things take. 

Post-Incident Analysis

The purpose of the post-incident analysis is to take the information surfaced from the incident analysis and do a more introspective investigation on what went wrong and how to fix it. 

The Five Whys

At the heart of major companies' postmortem processes is the ‘five whys.’ The process involves asking a series of ‘why’ questions (there can be more than five), where each ‘why’ is formed out of the ‘because’ in the previous question. The idea is to dive progressively deeper into the root causes of the incident until you have satisfactory answers.  

The five whys will help you identify issues that can be glossed over by a conventional analysis of the incident. Consider the following set of ‘whys,’ which reveals an important detail about the team’s decision-making processes:

  • ‘Why was the new feature unable to autoscale to handle the increase in traffic?’
    • Answer: ‘The feature had issues in the autoscaling logic.’
  • ‘Why were the scaling issues in the new feature not caught before launch?’
    • Answer: ‘The feature was launched without a load test.’
  • ‘Why was the feature launched without a routine load test?’
    • Answer: ‘The team was asked to ship the feature against a tight deadline. The team, together with management, decided to skip load testing to make the deadline.’

If you’re a leader, encourage postmortem authors to surface candid feedback on management, and even your leadership. This will earn you trust from your engineers, and set the right example for other leaders in the company. 

Lessons Learned

Everything can be traced back to human error because humans built the system and maintained it. Every ‘mistake’ identified in the five whys should map onto gaps in knowledge or processes. Articulate such realizations as lessons learned. 

Lessons shouldn’t just reiterate the problems in the system. ‘Our fallback system doesn’t work’ is a bad lesson. A better version is, ‘Fallback strategies can only be relied upon if they’re exercised regularly. We should build automation to continually switch between the primary and the fallback system so that both are in active use.’ 

Action Plan

In previous sections, we discussed various improvement ideas. The action plan is a distilled list of surgical actions that will address the most pressing deficiencies.

A good postmortem action item has the following properties:

  1. It is a well-defined work item typically achievable within a fortnight or a month of the incident.
  2. Has a clear target date, derived from its size and urgency. 

As a leader, track postmortem action items at an organizational level to make sure they don’t get neglected. 

Conclusion

In this article, we explore the essential properties of a postmortem. Naturally, your process should be a superset of what is described here. A ‘Timeline’ section is always useful to make sense of complex incidents. A brief overview of the impacted service will help readers unfamiliar with the system understand the write-up. 

Share particularly instructive postmortems with a larger audience. Top tech companies sometimes post public-friendly versions of their postmortems for major outages. This is a great way to begin rebuilding trust with your customers. 

Most importantly, recognize the contributions of the team in fixing the outage. Outages are damaging to a team’s morale. Set up a culture where the postmortem is seen as a cathartic process that ushers in a more resilient product.

DevOps Incident management

Opinions expressed by DZone contributors are their own.

Related

  • AIOps for Predictive Incident Management: A Novel Approach to Proactive DevOps
  • DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform
  • Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning
  • Product-Led Software Delivery: Intelligent Platforms for DevOps at Scale

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook