DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. The Art of Structuring Alerts: Part 2

The Art of Structuring Alerts: Part 2

Learn more about structuring and testing a machine intelligence-based alert management system designed to produce more relevant alerts and increase uptime.

Guy Fighel user avatar by
Guy Fighel
·
May. 03, 17 · Opinion
Like (2)
Save
Tweet
Share
5.27K Views

Join the DZone community and get the full member experience.

Join For Free

This is part two of our three-part series that looks into creating a robust and effective alerting system for DevOps teams. The first in the series looked at the burden of false positives and how to avoid them. This article takes this to the next level and looks at how to structure alerts and test out the system design.

As a baseline to work from, pages and alarms should only sound for events that are urgent, important, and actionable. By verifying the level of importance of an alert, you can then eliminate alert fatigue, and ensure that every page is quickly investigated, acted upon, or fine tuned. The result being increased uptime and ultimately a happier On-Call team.

It’s All About Timing and Optimization: 5 Tips to Alert Success

An effective monitoring, metrics, and alert system is one of the fundamental tools to an efficient DevOps operation. When you are working with small, iterative, and, often, fast-to-production releases, alerts to problems become a key requirement to maintain the production environment. Alert systems are the heartbeat of the entire operation, without which downtime will persist. Designing an alert system to be optimal, with minimal false positives, is the key to that effectiveness.

Here are my top five tips for "Effective Alerts by Design" (EAbD):

The Mindfulness of Alerts

When an alert is pushed out to an email system or third party platform, it can end up being missed. Instead of just passing the alert off, keeping it within workflow control will make sure it stays visible and has a valid lifecycle.

Defcon 5- Keeping it Subcritical

Managing alerts will give you the control you need to focus on the important ones and not go off chasing unicorns. Write sub-critical rules for your system. These will be specific to your production environment; an example may be that your database is close to capacity. The rules can also be prioritized and alerts sent based on that priority. It means you don’t have to react to sub-critical events at 2 am.

Fixing the Symptom, Not the Cause

Keeping alerts consistent, even when the underlying architecture changes, is the art of the possible if you page on the symptoms, rather than the cause. It is easier to capture problems using user-facing symptoms or other dependable services.

Keep it Simple, Simon (KISS)

Create scope aware alerts. This will allow you to combine variables so that instead of two or more alerts for one object, you get a single alert. For example, for alerts on disk usage which is split into forecast and usage levels – combine these two into a single alert.

Putting False Alarms to Good Use

When you do get false alarms, put them to work by using them as a basis for tightening up the alert condition or removing it from the paging list. When designing your effective alert system, the use of an expressive language, rather than simple object/value UI widgets, is key; it provides more flexibility and reduces errors.

Extending the Structure of Alerts

The art (and science) of alerts extends to their structure too. Human curated alert rules should be the baseline upon which your structure depends. Designing the structure of the alert is down to some basic prerequisites including:

1. Natural grouping of environmental components.

2. The correct aggregation.

3. The correct information attached to the alert to give the most detail.

4. Combining metrics to simplify alerts, whilst maintaining maximum detail.

5. Using Boolean conditions such as negative events (look for things that are NOT happening which might lead to a problem).

6. Avoiding use of fixed alerts – give yourself flexibility and build in historical context to provide predictive analysis.

Fine Tuning Alerts

Like all good ideas, your efficient alert system needs to be tested. Simulation is a good place to start, by creating simulation rules based on previous events. The goal is to reduce the noise. Reducing noise means you’re more likely to produce relevant alerts. Simulation and noise reduction is not a one-off event. You need to continue to carry out these exercises, fine tuning your alerts until you have the most meaningful alerts. And of course, reviews should be periodic as environments change. I also suggest to make it a habit every week (before the weekend starts) to review how was the false positive ratio was during in the previous week. Spending an hour on tuning before the weekend can save you and your team a great headache during the on-call weekend shift.

Similarly, paging events should be reviewed, including those ignored by administrators – this data can help you to refine rules to prevent false positives. Here are some tips for fine tuning your alerts so you can make sure they’re spot on, and as effective as possible.

A Rule of Rules

Alerts that are less than 50% accurate are broken; rules with a 10% false positive threshold are okay to go.

A Page Too Far

Get rid of extraneous paging events. If a page has fired, and when investigated shows nothing wrong – adjust the rule.

The Rise of the Machine

Machine learning is perfectly placed to optimize alerts. Use human-curated rules enhanced with machine learning algorithms to create rules and fine-tune alerts.

Repeat Business

Take regular events, such as backups, into account when fine-tuning rules. If you have known maintenance going on, suppress alerts associated with that.

Keeping Control

Set metrics for the on-call team and limit them to a set amount, through review, by differentiating between generated events and the events triggered by them.

Why It Pays to Structure Alerts Properly

The art of creating effective alert systems is down to using an intelligent approach. Keeping things simple, combining variables, and dampening down noise, coupled with prudent and mindful testing, will naturally result in improved alerts. Adding machine learning based on human curation into that mix will allow you to develop an optimized alert system that works for you, rather than against you.

In our final post in the series, we will look at improving alert response times. Response times are part of the critical path to a robust production system and are one of the key factors in an efficient support operation. Slow response times result in downtime, which ends up in lost time, money, and often customers, too. Stay tuned.

ARTS (radiative transfer code) Machine learning Event

Published at DZone with permission of Guy Fighel, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 10 Things to Know When Using SHACL With GraphDB
  • Application Architecture Design Principles
  • Keep Your Application Secrets Secret
  • How To Perform Local Website Testing Using Selenium And Java

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: