Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

On-Call Handoffs: Empowering Adaptability in Incident Response

DZone's Guide to

On-Call Handoffs: Empowering Adaptability in Incident Response

When dealing with incident response in a DevOps team, it's critical that you’re able to effectively and quickly dial into the current reality of your environment.

· DevOps Zone ·
Free Resource

Learn more about how CareerBuilder was able to resolve customer issues 5x faster by using Scalyr, the fastest log management tool on the market. 

Managing on-call teams has always been a challenge in complex environments. With the continued adoption of Continuous Delivery, the challenges are squared. Now, not only do you have to manage a complex environment, the environment is changing dozens of times per day.

On-call today has to be less about a strict execution of predefined procedures and more about adaptability.

Smart people acting with good situational context tend to make the best decisions, and those same smart people must be empowered with necessary skills and tools. Let’s accept for now that this is already true for your team. The challenge then is context, and especially for an on-call team who hasn’t seen rotation in four weeks!

It becomes critical that you’re able to effectively and quickly dial them into the current reality of your environment. I like to think of this in three parts:

  • What unexpected things happened? Those are incidents.
  • What expected things happened that have changed the environment? Those are deploys.
  • What expected things will be happening during this rotation? Those are the plans (new deployments, sales promotions, audits, penetration tests, etc).

Incident Review

Handoff sessions have long been a mainstay of team rotations. Running the spectrum from a few Slack messages to a formal postmortem of every incident in the preceding period, a handoff is key. Keeping those meetings focused and efficient is where the VictorOps Timeline feature really adds value. The timeline is a quick dashboard view of all incidents occurring in your environment for a given time period. The timeline enables a team to quickly review incidents and easily create a postmortem report associated with a single incident or a multitude of incidents happening over a period of time.

Reviewing Postmortems

When you sit down for that handoff session, reviewing postmortems is a required practice. A live discussion with both shifts (leaving rotation, coming online) can be meaningful in many ways. Certainly, the color of a situation is communicated far better verbally than through a postmortem report, but the discussion should also be a collegial critique of the postmortem itself. Were the details covered in sufficient detail? Too much detail? Were the appropriate run books updated? Were post-action tasks completed?

Deploy Review

Deploy review gives you a basis for situational context, but the reality is you’re only providing a portion of the picture. Only changes introduced by incidents are covered here, and you’re left to find some other process to bring a team’s awareness up to speed with respect to the code, system, or architectural changes that may have occurred in the intervening time. This can be accomplished in a variety of ways, to be sure. Looking at changelogs, ticketing systems, deployment pipelines, and more, are all effective at detailing intentional changes introduced to an environment. However, I prefer to keep team tool-switching limited and provide as much of the necessary information in the same system.

Plan Review

I’ve left the most difficult for last. Predicting the future with any certainty is an intractable problem. That said, approaches exist that can be effective at empowering your teams with excellent situational context:

  • Invite members of Product teams to the handoff meetings to discuss big or risky projects going live in-period.
  • Invite members of Marketing or Sales to similarly discuss planned promotional activities, sends, or events.

The power of effective handoffs really comes to bear once you’re able to record those planned events in a system. 

Incident #527 – Sales Kick-Off begins 09:00 AM

Incident #603 – Holiday Sale is Live 06:00 AM

While ancillary to the specific act of responding to incidents, this kind of information keeps your teams dialed into the reality of your environment. Heightened awareness empowers those teams to make good decisions, and adapt!

Iterations On Handoffs

Like anything that happens in a DevOps or Agile environment, iteration is key. Whether you implement the ideas I’ve laid out here, or they spur ideas of your own — implement, test, and iterate! A monthly retro on how handoffs are working, with a willingness to implement change, is how to make handoffs most effective for your team.

Find out more about how Scalyr built a proprietary database that does not use text indexing for their log management tool.

Topics:
devops ,on-call ,incident response ,devops implementation

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}