DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • A Rapid Overview of ISA-88 and How It Aligns With ISA-95 and IIoT Platforms
  • Building a Reusable API Platform With WSO2 API Manager
  • How to Build a Successful Performance CoE for Any Organization
  • “Let’s Cook!”: A Beginner's Guide to Making Tasty Web Projects

Trending

  • AI Speaks for the World... But Whose Humanity Does It Learn From?
  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You
  • Four Essential Tips for Building a Robust REST API in Java
  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  1. DZone
  2. Culture and Methodologies
  3. Methodologies
  4. Runbooks vs. Playbooks: Explaining the Difference

Runbooks vs. Playbooks: Explaining the Difference

The terms runbooks and playbooks are often used interchangeably, But there are important differences between the two, and each has its place.

By 
Austin Gunter user avatar
Austin Gunter
·
May. 01, 21 · Opinion
Likes (3)
Comment
Save
Tweet
Share
8.0K Views

Join the DZone community and get the full member experience.

Join For Free

The terms runbooks and playbooks are often used interchangeably. They are similar--they both offer a method of documenting tactical and strategic executions of the goals and processes of your organization. But there are important differences between the two, and each has its place. Once you understand that difference, you will be able to effectively use them individually, but even more importantly, you will be able to pair the two techniques together, creating a powerful weapon in your arsenal of operational excellence.

So let's look at the differences between a runbook and a playbook, what each is used for, and how they can be used together.

What Is a Runbook

Runbooks are best defined as a tactical method of completing a task--the series of steps needed to complete some process for a known end goal. Examples include “Restarting the web services on frontend servers” to “Deploying the newest build of staging application”.

Runbooks are particularly useful when defining a specific action for an identified problem. They define the exact steps to make that action repeatable and usable as a programmatic approach to problem-solving. A well-written runbook not only lowers the difficulty of execution and ensures repeatability, but also has the end goal of automating the action, making the runbook, itself, no longer necessary.

What Is a Playbook

A playbook, on the other hand, is a little broader. It is the culmination of those tactical processes, creating a larger plan focused on strategic action. They are a checklist of formal steps and actions. This can be anything from “Upgrading fleet-wide OS images” to “Managing a production incident." Playbooks contain actions that can be automated, but also actions that decisions that need to be made by a human.

This playbook methodology of thinking about a holistic process allows for identifying where runbook-type processes are used and can be replaced by simpler tools or automation. Developers call this approach of reducing copy & paste actions “DRY” or Don’t Repeat Yourself. DRY can be adopted by ops teams as well by defining the goal of a process in such a way that it can be summarized by a consistent set of runbooks.

One way to think about the difference between runbooks and playbooks--a playbook is like a book with chapters, and some of those chapters are runbooks.

A Newly Discovered Process

Let’s take a deeper look at how a process can be broken down into these concepts, and some benefits that can come out of this exercise.

We get a report from our customer service teams that our web page is intermittently no longer responding and that the users are complaining.
To start, we must first identify a way to reproduce the problem to know where to start the investigation. Then, we need to determine the cause of the unavailability. Assuming there is a single server with an issue we must decide how to mitigate this impact and what the appropriate actions are to ensure that leadership, partners, and customers are kept informed of the impact.

Along with deciding on mitigation, we need to determine if we have enough capacity to handle the load across the remaining servers; we may even need to wait for our subject matter experts to come up to speed on the problem and propose potential fixes for the issue.

We may decide to mitigate the issue by updating our load balancers to remove reference to the server that is not serving healthy responses.
If the right infrastructure owners are available, they can take an action to remove the service.

We may decide to collect logs to debug the issue further after the issue is resolved.

Likely the infrastructure owners may not be as versed in the applications and servers as the developer or operations team, so there may be waiting to get the right teams engaged.

Then the servers should be restarted, and observed to ensure they are serving pages successfully.

Only then can we return our server to our load balancer.
Some follow up must take place to debug the gathered logs and ensure our customers are updated with the details they need from our outage.

After this, we can take the time to understand what went wrong and what to do moving forward. There may be a period of time this manual process needs to be followed until the root cause is resolved and processes are implemented to speed future investigations.

This series of events can easily be converted into a series of runbooks per task and an overall playbook of managing customer-impacting incidents. Furthermore, addressing the root cause of this particular problem doesn’t invalidate the playbook or runbooks because they can be recycled for future problems and processes.

Applying Our Books To Our Process

Now let’s look at that same experience with a defined playbook and its corresponding runbooks!

Customer service teams report that our web page is intermittently no longer responding and users are complaining again.

Our first responders can refer to our newly minted playbook:

    **Playbook:** Managing Website Outage
        **Playbook:** Mitigate Frontend Application Impact
            Execute **runbook:** Inspect Load Balancer Logs
            If Load Balancer logs report a single server
                Execute **runbook:** Remove server from Load Balancer 
                configuration
            Execute **runbook:** Collect Server Logs
            Execute **runbook:** Restart Application Servers
            Execute **runbook:** Test Application Server In Isolation
            If Application Server healthy:
                Execute **runbook:** Return Server to Load Balancer 
                Configuration
    **Playbook:** Post-Mortem
        Execute **runbook:** Create Ticket with Server Logs
        Execute **runbook:** Create Chat with Infrastructure & Application 
        Teams
        Execute **runbook:** Communicate with Customers

This curated set of instructions can now be applied to future outages of this nature. This will greatly reduce:

  • The mitigation requirements of waiting for experts to be engaged
  • The time-to-mitigation
  • The overall stress and human mistakes
  • The overall toil of re-learning the mitigations

Should there be new methodologies, new impact or actions, or even updated processes for handling incidents post-mortem, this playbook can be modified accordingly to ensure there is always a plan, even for the most complex situations.

As the incident lifecycle process continues and the root cause is identified and resolved, these same steps apply for any issue that involves future impacts of this nature, and these same runbooks can be linked in other playbooks to reduce the number of times “Inspect Load Balancer Logs” needs to be redefined.

The best part of embracing this methodology is that these same tools can be used as a springboard for maturing your operations processes. You can use these runbooks to build automations that execute these defined processes with the ultimate goal of automating entire playbooks.

Conclusion

Runbooks and playbooks are tools best used in tandem. But also build your organization towards a greater goal of removing human toil altogether by embracing practices that have been built into development workflows for years.

Load balancing (computing) Web Service Build (game engine) Execution (computing) teams Book Concept (generic programming) Excellence (software) workflow methodologies

Published at DZone with permission of Austin Gunter. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • A Rapid Overview of ISA-88 and How It Aligns With ISA-95 and IIoT Platforms
  • Building a Reusable API Platform With WSO2 API Manager
  • How to Build a Successful Performance CoE for Any Organization
  • “Let’s Cook!”: A Beginner's Guide to Making Tasty Web Projects

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: