AWS: Operations Health and Best Practices
Tired of being woken up in the middle of the night because of an alert? Take a look at how AWS deals with the issue.
Join the DZone community and get the full member experience.Join For Free
The ITOps world is a harsh working environment where ITOps personnel are expected to minimize the business impact of incidents at all hours of the day—regardless of the impact to themselves or their families. As more companies undergo digital transformation, the number of alerts and interruptions flowing to IT first responders will continue to increase.
This constant and growing pressure to keep business systems running around the clock is leading to higher-than-ever responder burnout, resulting in increased employee attrition and negative impact on the customer experience. In September 2018, we inspected 85,000 services to determine which monitoring systems are generating interrupting notifications (defined as SMS, voice, and push notifications) on each service.
We then analyzed how each service impacted an organization’s health score that enables IT and DevOps decision-makers to quantify the impact of digital interruptions on their teams. The operation’s health score gives IT and DevOps decision makers actionable telemetry about their most valuable and critical asset: the people. It is calculated by combining machine learning with PagerDuty’s domain expertise and peer benchmark data to quantify the impact of digital operations on an organization’s people, teams, and services.
AWS Integrated Services
The results? We found that services integrated with AWS have a consistently higher health score for every day through the first 7 months of 2018. On average, AWS integrated services have a higher daily health score by more than 3 points, as shown below.
We also found that AWS integrated services had:
45% fewer daily notifications on average
52% fewer notifications during sleeping hours on average
60% fewer interrupt notifications during weekends
Lower proportions of daily notifications during off-work and sleeping hours
Lower number of days across periods of time (e.g., week or month) with off-work and sleeping hours notifications
So what is AWS doing to generate less noise and, therefore, less alert fatigue?
Short answer: we can’t definitively answer that. We can only speculate why AWS users experience better-than-average health compared to users of other DevOps tools—for example, there could be general AWS resiliency across both service offerings and instances or EC2 instance auto recovery and the highly available nature of most AWS services enable greater operations efficiency and generate fewer alerts. What we do know, however, is that based on our data collected from over 10,500 companies in the past decade, we have proven best practices that you can implement in order to attain measurable improvement in all three facets of operations health: people, efficiency, and maturity.
Best Practices for Operations Health
Run a Transient Notifications Analysis
One of the easiest ways to improve operations health is to run an analysis of transient notifications, which are alerts that auto close/auto resolve quickly after they’re generated.
Let’s say you’re an on-call responder who’s been awakened in the middle of the night by an SMS interrupt notification. You groggily acknowledge the event on your phone, then get out of bed and head to your laptop to begin remediation efforts. But the management system has already closed the incident, making it no longer relevant as it shows as closed (resolved). Now you’re grouchy—being woken up by an on-call alert is part of the job but being awakened for something that has already resolved itself is incredibly frustrating, especially if it occurs multiple times a night.
To help prevent such scenarios, you should run transient notification analyses to determine the number of transients that occur in under two minutes on each service. Then, depending on the percentage of transients, a notification buffer of two minutes can be added to absorb those transients while the upstream issue causing those them is being addressed. Any incident that remains open past the two-minute buffer is sent to whoever is on call. Absorbing transients in this manner increases the health of your teams, as well as the overall effectiveness of your operations by eliminating a significant source of false positives.
Humans are good at many things, but attempting to determine the scope of an incident by looking at a table of alerts gathered from a myriad of sources is not one of them.
With alert grouping two great things happen together:
1) Alerts are automatically associated and grouped into incidents that provide much better situational awareness when compared to doing so manually, and
2) The on-call responder will receive 1 interrupt notification for an incident that includes 50 alerts as opposed to receiving 51 separate notifications for 50 alerts and 1 incident.
Having a consistent taxonomy for your Teams, Schedules, Escalation Policies, and Services is another important best practice. Why? Because properly named services can shave crucial minutes off of incident response times by giving the responder context around what’s broken—making it easier to escalate incidents, bring in more subject matter experts, and, most importantly, decrease the business impact of incidents.
How Are You Thinking About Operations Health?
Keep in mind that one of the most important aspects of achieving better operations health is to work on continuous and measurable improvement. There are numerous other best practices you can use to help your IT and DevOps teams improve their operations.
Opinions expressed by DZone contributors are their own.