I spent a bit of time on Reddit the other day and thought it was interesting just how many posts were focused on IT on-call and on-call scheduling. Some posts were rants on horrible customers – who hasn’t had some of those? Some actually wrote about positive interactions from being on-call (those were rare posts). However, many engineers in DevOps and IT posted on their trepidation about being on-call. They wondered:
- What is the best way for my team to create an IT on-call schedule?
- How do I ensure that I wake up if I am alerted?
- Should my growing on-call team use an on-call cell phone and hand it off between rotations?
- How do I manage being on-call and then having to show up at 8 a.m. the next morning?
- Is it reasonable to expect on-call duty 24/7?
The answers to these questions though don’t need to cause trepidation. While on-call can be anxiety-producing, having the right tools and management goes a long way towards helping to create reasonable expectations and outcomes.
Why IT On-Call Is Necessary for All
If I were to ask you about why on-call is necessary, you might think me a bit of a dunce (go ahead; I’ve been called worse). Isn’t it obvious that on-call is needed to answer customer questions about the product? Duh!
The truth is that answering customer product questions is not the only reason IT on-call exists. In the realm of product development, on-call is a necessary pursuit. You cannot develop products effectively if the product is disconnected from testing its resilience. You cannot know the product’s resilience unless you put it in front of your customers, allow them to test it, and let them call you when it breaks.
Additionally, on-call rotations allow Dev, Ops, and all of your IT team to see how well the product or set-up they have created is working. Many I have spoken to in the DevOps world call this eating your own dog food. Yuck. This statement is meant to illustrate that no one in the IT family can simply create their perceived technical masterpiece and walk away. Instead, they need to take responsibility for their creation. Being part of the on-call family helps ensure this level of responsibility.
Traditional Problems With IT Alerting
In addition to being on-call, there are many additional issues with alerting. Often, issues come in after hours and they lack context. These sorts of problems come in many flavors. For example:
- A call comes in but the engineer cannot escalate the issue if they need to.
- There’s a hand-off of a customer problem from regular hours to after-hours on-call and the issue gets muddled because there’s no audit trail on the alert.
- For overnight on-call, alerts are not sufficiently persistent to get engineers out of bed.
- Poor management of IT on-call and alerting causes engineer burnout.
A much betteridea is to create an actual IT on-call schedule with a dedicated tool designed to handle effective alerting, auditing and messaging. A tool like OnPage can answer these on-call issues as well as many of the trepidations which engineers face about being on-call.
Improving Life On-Call
Effective management of after-hours on-call needs to be premeditated. That is, the process needs to be thought through and cannot be ad hoc. While most DevOps teams and IT teams have a schedule, they haven’t thought through the whole process. Instead, teams should create on-call schedules that do the following.
1. Enable Escalation
You cannotexpect one person to be on-call 24/7 without having an escalation procedure. Everyone needs a back-up if they cannot attend to a call. People have lives and stuff happens. So, make sure there’s an escalation procedure.
2. Provide Time Off After Being On-Call Overnight
When a team member has been actively on-call overnight, it is only fair to give that person a reasonable amount of time off before showing up to work again.
3. Make Schedules
Make sure all of your team members have a chance to be on-call. Create scheduling that rotates through the team members equitably.
4. Run Books and Defined Procedures
When your on-call engineer is alerted in the middle of the night, help them out by having run books available to provide solutions to problems that have crept up in the past. This is really helpful when woken up at 2 a.m. and the engineer’s thinking is somewhat clouded.
5. Include Prominent and Persistent Alerts
OnPage provides persistent alerting that will continue for up to 8 hours until answered. Also, there’s no chance of sleeping through the OnPage alerts as they are really designed to wake you up.
6. Ensure Audit Trails to Help With Hand-Offs
Provide an audit trail for alerts so it is clear who on the team is working on an existing issue. Audit trails also provide context to MTTR and help your team keep track of metrics.
7. Make It Based on a Communal App
Ensure your team has an alerting app on their smartphone so there is no need to physically handoff pagers. By ensuring the use of a smartphone application like OnPage, scheduling is much easier as is ensuring response by the right person every time.
While IT on-call might cause trepidation initially, the time spent planning will definitely pay dividends. Again, use a scheduling tool that will allow your team to work effectively together and more like a, well…, team.