{{announcement.body}}
{{announcement.title}}

5 Best Practices on Nailing Postmortems

DZone 's Guide to

5 Best Practices on Nailing Postmortems

Here are five practices that can boost the effectiveness of your postmortems, with examples of postmortems or procedures that demonstrate these methods.

· Agile Zone ·
Free Resource

Reading about postmortem best practices can sometimes be quite different from seeing them in action. Postmortems are like snowflakes; no two will ever look the same. There isn’t a definitive template for success that will work in every situation, but there are some practices and procedures when writing postmortems that can help. 

Here are five practices that can boost the effectiveness of your postmortems, with examples of postmortems or procedures that demonstrate these methods.

Use Visuals

As Steve McGhee says, “A ‘what happened’ narrative with graphs is the best textbook-let for teaching other engineers how to get better at progressing through future incidents.” Days, weeks, or even years after a postmortem is written, graphs still provide an engineer with a quick and in-depth explanation for what was happening during the time. 

In Cloudflare’s postmortem of an incident occurring on July 2, 2019, authors use visuals to help readers understand both the background of the incident as well as what happened when a bad update caused a DNS outage. The postmortem reads, “Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving. 

This brought down Cloudflare’s core proxying, CDN and WAF functionality. The following graph shows CPUs dedicated to serving HTTP/HTTPS traffic spiking to nearly 100% usage across the servers in our network.” This description of the issue is followed by a graph showing the CPU usage during the incident:

Visuals embedded within the postmortem benefit readers in two major ways. First, this allows new hires to visualize the problem and feel like they’re working through the incident with the engineers who mitigated it. Second, it allows engineers who may deal with a similar issue to quickly find the information they’re looking for and be able to disseminate it to other team members easily.

Be a Historian

Using timelines when writing postmortems is very valuable. However, there’s an art to crafting them. 

As Steve McGhee says, “There is little utility to including the entire chat log of an incident. Instead, consider illustrating a timeline of the important inflection points (e.g. actions that turned the situation around). This may prove to be very helpful for troubleshooting future incidents.” 

Postmortem timelines require the perfect balance of information. Too much to sift through, and the postmortem will become cluttered. Too little and it’s vague. 

In Twilio’s “Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause,” this balance is exceptional. What Twilio does well in this postmortem is clarity. For example, in this particular incident, the timeline and the root cause are separated. 

In the entry for 1:35 AM July 18, the timeline note simply reads, “We experienced a loss of network connectivity between all of our billing Redis-slaves and our Redis-master. This caused all Redis-slaves to reconnect and request full synchronization with the master at the same time.” 

However, in the root cause analysis, the postmortem authors further expound on this timestamp with pertinent background information by explaining that the loss of network connectivity “caused all Redis-slaves to reconnect and request full synchronization with the master at the same time,” and how this affected the Redis-master. 

Though the timeline entry is half the word count of the explanation in the analysis, it still relays the most crucial information. The benefit of this is speed. If the billing Redis-slaves simultaneously disconnect again, an engineer might want to look back on this postmortem as a clue. 

When postmortem timelines are streamlined to include only the most important moments, while all background information is included in the root cause analysis, an engineer can see what actions they should consider taking next without having to use precious time sifting through clutter.

Publish Promptly

As the Google SRE book says, “A prompt postmortem tends to be more accurate because the information is fresh in the contributors’ minds. The people who were affected by the outage are waiting for an explanation and some demonstration that you have things under control. The longer you wait, the more they will fill the gap with the products of their imagination. That seldom works in your favor!” 

Promptness has two main benefits: first, it allows the authors of the postmortem to report on the incident with a clear mind, and second, it soothes affected customers with less opportunity for churn.

Google certainly practices what it preaches, as do many best-in-class companies like Uber and others. These companies often publish postmortems within 48 hours. This discipline leads to postmortems that are accurate. After two months, will your team remember exactly what happened during an incident, even after looking at the logs? 

It’s not likely. By creating a routine where postmortems are published within two days of mitigation, the information is fresher and more useful for teaching/onboarding and reference in the case of similar incidents.

Furthermore, prompt postmortems are crucial to foster a culture of transparency that maintains customer trust. If an incident affects your customers, they’ll likely be upset. In the case of an incident involving critical features, billing, or data breaches, customers will often be on edge waiting for an explanation. 

Some of your customers may even have SLAs set for the promptness of a postmortem detailing the incident. Waiting to publish only increases customer dissatisfaction. However, if the incident is promptly explained via a detailed and accurate postmortem, customers don’t have to linger in anxiety. 

Be Blameless

Blameless postmortems are commonly referred to when talking about best practices. However, what does a blameless culture look like? When writing postmortems, there are 3 important things to keep in mind to promote blamelessness.

  • People are not points of failure. Pinning an incident on one person, or a group of people is counterproductive. It creates an environment where people are afraid to take risks, innovate, and problem solve. This leads to stagnancy and avoidance.
  • Everyone on the team is working with good intentions. People make mistakes. It’s extremely rare for a team member to cause problems maliciously. Everyone is simply doing what makes the most sense to them at the time to be helpful.
  • Failure will happen. There’s no way around it. However, by having a good incident resolution and postmortem practice in place, failure can be a benefit to your team, as it uncovers areas to focus on to improve resiliency. As long as you learn from an incident, you’ve made progress.

Many teams choose to have a meeting after an incident to talk through what happened. Etsy created an introduction to this meeting that voices the 3 above points for all attendants. 

In Etsy’s Debriefing Facilitation Guide it states, “The goal for our time together today is to recreate the event, talking through what happened for each person at each stage to create as robust a portrait as possible of what happened, and what the circumstances in play were at each juncture (when decisions were made, and actions were taken) that made it make sense for people to do what they did at the moment. If one of you gains an insight into the complexity of another person’s role, this was an hour well spent.”

Sentry’s postmortem from a security incident occurring on July 12, 2016, demonstrates this well. Firstly, the postmortem uses the collective “we” pronoun to eliminate naming people as problems. Additionally, it states “It’s been a valuable experience for our product team, albeit one we wish we could have avoided.”

The point here is that this was a learning experience. The failure happened and will happen again. Sure, incidents are painful, but they’re one of the best ways to learn and become better.

Tell a Story

An incident is a story. To tell a story well, many components must work together. 

  • Without sufficient background knowledge, this story loses depth and context. 
  • Without a plan to rectify outstanding action items, the story loses a resolution. 
  • Without a timeline dictating what happened during an incident, the story loses its plot.

Make sure that your postmortems have all the necessary parts to create a compelling and helpful narrative.

In Travis CI’s postmortem on high queue times on OSX builds, our author begins by giving an overview of the incident itself. Next, we have a background that explains its relevance to the incident by stating, “Understanding this separation of the creation/build run and the cleanup parts of the life-cycle becomes important in understanding what contributed to this incident.”

After the background, we get into the incident itself. The author walks us step by step through what happened during the timeline, using timestamps to show us the duration. 

After sharing how the incident was mitigated, the authors explain what the team intends to do going forward. Three main objectives are listed to strengthen infrastructure. The story closes with an excellent, blameless summary, which includes, “We always use problems like these as an opportunity for us to improve, and this will be no exception.”

By learning from example and taking the best parts of what others do and applying it to your organizational context, you and your team can write better postmortems for each incident. Postmortems shouldn’t be done simply as a checkbox item, but rather that as a way to catalyze introspection and action to prevent further incidents. Again, there’s no one size fits all, but your team can apply any one (or all) of the above practices starting today. 

If you want more reading, check out

Topics:
agile, best practices, devops, postmortems, sre

Published at DZone with permission of Hannah Culver . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}