DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • DORA Metrics: Tracking and Observability With Jenkins, Prometheus, and Observe
  • Automating Databases for Modern DevOps Practices: A Guide to Common Patterns and Anti-Patterns for Database Automation Techniques
  • Optimizing Azure DevOps Pipelines With AI and Continuous Integration
  • Empowering DevOps: The Crucial Role of Platform Engineering

Trending

  • Manual Sharding in PostgreSQL: A Step-by-Step Implementation Guide
  • Optimizing Integration Workflows With Spark Structured Streaming and Cloud Services
  • GitHub Copilot's New AI Coding Agent Saves Developers Time – And Requires Their Oversight
  • Start Coding With Google Cloud Workstations
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. DevOps Postmortems: Why and How to Use Them (+Checklist)

DevOps Postmortems: Why and How to Use Them (+Checklist)

When things go wrong, as they sometimes will, it's important to take a step back and evaluate why and how they did.

By 
Ivo Magi user avatar
Ivo Magi
·
Manoj Debnath user avatar
Manoj Debnath
·
Updated Mar. 07, 24 · Analysis
Likes (1)
Comment
Save
Tweet
Share
14.1K Views

Join the DZone community and get the full member experience.

Join For Free

In DevOps, any problem that occurs in the CI/CD process is taken care of and fixed as soon as possible. The required information is gathered from the feedback loop of the CI/CD pipeline. These are fed into the rectification process. The postmortem analysis of a problem plays a crucial role in planning the fixation. A postmortem culture gathers the development team together to figure out the cause and focus on a probable solution of the situation. The written layout of the postmortem analysis describes the impact of the incident, steps taken to resolve it, root cause of the problem, actions followed up to prevent such incident in future etc. A complete analysis can not only improve overall quality of the production but also relieves stress and risk factor of the development process.

Why Do I Need Postmortem Discipline?

Typically, any changes in the system may cause instability with repercussion. DevOps CI/CD pipeline promotes frequent release in smaller increments. This, on one hand reduces the risk of failures in a specific release but on the other hand, increases the number of incidents that on-call teams need to respond to.

The incident response team tries to quantify and mitigate the impact so that the service returns to a normal state. This is exactly where postmortem analysis is important, unless which no rectification or preventive measures can be taken. In worst case the incidents can multiply with cascading effect. As a result, the amount of time a DevOps team spends on incident response surpasses the actual development time. This unexpected situation can be avoided by incorporating postmortem culture into DevOps team.

How to Conduct a Postmortem Analysis

To avoid such a death spiral, your team must acknowledge the need to learn from the past to build a better future. This learning process is called postmortem. Postmortem should be triggered whenever an incident requires a response from an on-call engineer. 

Step 1: Registering the Evidence

A typical postmortem starts by registering the objective evidence:
  • Trigger for the incident
  • Impact of the incident
  • Time to detect and mitigate
  • Steps taken to mitigate
  • Root cause analysis

Step 2: The Analysis

Based on the evidence above, an analysis should be conducted. The analysis is typically carried out by the on-call team member who responded to the incident and might include other team members who either helped to mitigate or analyze the root cause. The analysis process needs to find answers to the following questions:

  • Trigger
    • How many alerts did we receive for the incident?
    • Was the trigger timely or could we have registered it earlier?
  • Impact
    • Was the impact sufficient to trigger an incident in the first place? Or should we calibrate the triggers?
    • Were steps taken to mitigate the impact adequate and did they follow the process? If not, do we need to invest in training or improve the guidelines?
    • Did we manage to mitigate the impact fast enough? Is there anything we can do to shrink the mitigation time?
  • Root cause
    • Will the root cause be resolved or will we have to live with it?
    • If the root cause will be resolved, what exactly do we need to do to resolve it?

Step 3: Compose a Summary

Based on the analysis, a summary should be composed, including the lessons learned and follow-up tasks registered and prioritized. The follow-up tasks typically include:

  • Tasks for engineering to resolve the root cause
  • Tasks for DevOps engineers to improve the monitoring setup
  • Tasks for managers to improve the processes

5 Tips for Introducing Postmortems

Introducing postmortems to an organization that historically has not conducted any is not as easy as it might sound. As with every new or changing process, introducing and persisting the change requires time and effort at all levels of the organization. However, there are a few key principles, following which makes the change easier:

  1. Make sure you stay away from blame games and finger pointing. This is the most crucial aspect to getting things right out of the gates. If the analysis focuses on blaming the persons causing the incident instead of making sure the team learns and improves, the initiative will cause harm instead of good.
  2. Appoint a dedicated lead, enforcing each and every incident response to finish with postmortem. These people tend to come from DevOps/on-call teams and most often they are the team leads themselves.
  3. Collaborate and share. Make sure to capture the postmortems in a medium suitable for sharing and learning, such as wikis. Use the postmortems from last month as regular learning material for your team. Allow collaboration and commenting during and after the postmortem.
  4. Involve management. Showing support from management makes evangelizing and education among engineers easier. To keep the management engaged, plan ahead with objectives and show progress along the way. You know, managers like nothing more than charts pointing up and to the right.
  5. Start small. If the organization is large, starting with just a few services and one team is enough to build an example that will motivate other teams to follow. The initial team celebrating their wins is often enough to have the other teams join the bandwagon. It is much harder to introduce the change without having a positive example from inside the organization.

Postmortem Checklist

We've prepared a checklist of the questions you need to be asking yourself to conduct your DevOps postmortem in the best way possible.

  • Detection
  • Impact
    • Impact on end users
    • Impact on productivity
    • Impact on infrastructure
  • Mitigation
    • Time to mitigation
    • Mitigation step #1
    • Mitigation step #2
  • Root cause analysis
  • Lessons learned
  • Follow-ups
    • Task #1 (detect/mitigate/process)
    • Task #2 (detect/mitigate/process)
    • Task #3 (detect/mitigate/process)

A good postmortem culture is only as strong as the team and tools available. Plumbr real-user monitoring can help you identify how many customers are affected by an issue, how long they were affected for and where the bug is. Armed with this information your postmortem culture will grow faster and stronger.

DevOps Continuous Integration/Deployment

Published at DZone with permission of Ivo Magi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • DORA Metrics: Tracking and Observability With Jenkins, Prometheus, and Observe
  • Automating Databases for Modern DevOps Practices: A Guide to Common Patterns and Anti-Patterns for Database Automation Techniques
  • Optimizing Azure DevOps Pipelines With AI and Continuous Integration
  • Empowering DevOps: The Crucial Role of Platform Engineering

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!