The Art of Structuring Alerts: Part 2
Learn more about structuring and testing a machine intelligence-based alert management system designed to produce more relevant alerts and increase uptime.
Join the DZone community and get the full member experience.Join For Free
This is part two of our three-part series that looks into creating a robust and effective alerting system for DevOps teams. The first in the series looked at the burden of false positives and how to avoid them. This article takes this to the next level and looks at how to structure alerts and test out the system design.
As a baseline to work from, pages and alarms should only sound for events that are urgent, important, and actionable. By verifying the level of importance of an alert, you can then eliminate alert fatigue, and ensure that every page is quickly investigated, acted upon, or fine tuned. The result being increased uptime and ultimately a happier On-Call team.
It’s All About Timing and Optimization: 5 Tips to Alert Success
An effective monitoring, metrics, and alert system is one of the fundamental tools to an efficient DevOps operation. When you are working with small, iterative, and, often, fast-to-production releases, alerts to problems become a key requirement to maintain the production environment. Alert systems are the heartbeat of the entire operation, without which downtime will persist. Designing an alert system to be optimal, with minimal false positives, is the key to that effectiveness.
Here are my top five tips for "Effective Alerts by Design" (EAbD):
The Mindfulness of Alerts
When an alert is pushed out to an email system or third party platform, it can end up being missed. Instead of just passing the alert off, keeping it within workflow control will make sure it stays visible and has a valid lifecycle.
Defcon 5- Keeping it Subcritical
Managing alerts will give you the control you need to focus on the important ones and not go off chasing unicorns. Write sub-critical rules for your system. These will be specific to your production environment; an example may be that your database is close to capacity. The rules can also be prioritized and alerts sent based on that priority. It means you don’t have to react to sub-critical events at 2 am.
Fixing the Symptom, Not the Cause
Keeping alerts consistent, even when the underlying architecture changes, is the art of the possible if you page on the symptoms, rather than the cause. It is easier to capture problems using user-facing symptoms or other dependable services.
Keep it Simple, Simon (KISS)
Create scope aware alerts. This will allow you to combine variables so that instead of two or more alerts for one object, you get a single alert. For example, for alerts on disk usage which is split into forecast and usage levels – combine these two into a single alert.
Putting False Alarms to Good Use
When you do get false alarms, put them to work by using them as a basis for tightening up the alert condition or removing it from the paging list. When designing your effective alert system, the use of an expressive language, rather than simple object/value UI widgets, is key; it provides more flexibility and reduces errors.
Extending the Structure of Alerts
The art (and science) of alerts extends to their structure too. Human curated alert rules should be the baseline upon which your structure depends. Designing the structure of the alert is down to some basic prerequisites including:
1. Natural grouping of environmental components.
2. The correct aggregation.
3. The correct information attached to the alert to give the most detail.
4. Combining metrics to simplify alerts, whilst maintaining maximum detail.
5. Using Boolean conditions such as negative events (look for things that are NOT happening which might lead to a problem).
6. Avoiding use of fixed alerts – give yourself flexibility and build in historical context to provide predictive analysis.
Fine Tuning Alerts
Like all good ideas, your efficient alert system needs to be tested. Simulation is a good place to start, by creating simulation rules based on previous events. The goal is to reduce the noise. Reducing noise means you’re more likely to produce relevant alerts. Simulation and noise reduction is not a one-off event. You need to continue to carry out these exercises, fine tuning your alerts until you have the most meaningful alerts. And of course, reviews should be periodic as environments change. I also suggest to make it a habit every week (before the weekend starts) to review how was the false positive ratio was during in the previous week. Spending an hour on tuning before the weekend can save you and your team a great headache during the on-call weekend shift.
Similarly, paging events should be reviewed, including those ignored by administrators – this data can help you to refine rules to prevent false positives. Here are some tips for fine tuning your alerts so you can make sure they’re spot on, and as effective as possible.
A Rule of Rules
Alerts that are less than 50% accurate are broken; rules with a 10% false positive threshold are okay to go.
A Page Too Far
Get rid of extraneous paging events. If a page has fired, and when investigated shows nothing wrong – adjust the rule.
The Rise of the Machine
Machine learning is perfectly placed to optimize alerts. Use human-curated rules enhanced with machine learning algorithms to create rules and fine-tune alerts.
Take regular events, such as backups, into account when fine-tuning rules. If you have known maintenance going on, suppress alerts associated with that.
Set metrics for the on-call team and limit them to a set amount, through review, by differentiating between generated events and the events triggered by them.
Why It Pays to Structure Alerts Properly
The art of creating effective alert systems is down to using an intelligent approach. Keeping things simple, combining variables, and dampening down noise, coupled with prudent and mindful testing, will naturally result in improved alerts. Adding machine learning based on human curation into that mix will allow you to develop an optimized alert system that works for you, rather than against you.
In our final post in the series, we will look at improving alert response times. Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers, too. Stay tuned.
Published at DZone with permission of Guy Fighel, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.