DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Join us today at 1 PM EST: "3-Step Approach to Comprehensive Runtime Application Security"
Save your seat

Alert Fatigue (Part 5): Fine-Tuning and Silencing

Learn how to further fine-tune your Sensu alerting to cut down on alert fatigue.

Ben Abrams user avatar by
Ben Abrams
·
Nov. 13, 18 · Tutorial
Like (1)
Save
Tweet
Share
4.26K Views

Join the DZone community and get the full member experience.

Join For Free

This is part 5 in a series on alert fatigue. Catch up on parts 1, 2, 3, and 4.

By now you’ve learned about reducing the sheer amount of alerts you’re getting as well as automated triage and remediation. In this post, I’ll go into some extra steps you can take to further fine-tune Sensu and cut down on alert fatigue.

You’ll learn about:

  • Flap detection, or detecting hosts and services that are "flapping," AKA changing state too frequently
  • Silencing the checks and clients you know you're addressing
  • Safe Mode, which reduces alerting on non-issues
  • Extending handler configurations, AKA customizing Sensu's default handler configs

Flap Detection

Honestly, I should know more about flap detection, but in my experience, it’s more of a tune by instinct and observations versus pure math. Sensu uses the same flap detection algorithm as Nagios.

There are two levers to tweak until happy:

{ 
  "checks": { 
    "check_cpu": { 
      "command": "check-cpu.rb -w 80 -c 90 --sleep 5", 
      "subscribers": ["base"], 
      "interval": 30, 
      "low_flap_threshold": ":::cpu.low_flap_threshold|25:::",      
      "high_flap_threshold": ":::cpu.high_flap_threshold|50:::" 
    } 
  }
}


As with other settings, you can use default thresholds and override specific clients with different workloads.

Silencing/Maintenance

Maintenance is part of our everyday lives, and while we strive to always provide a zero downtime maintenance, sometimes it’s unavoidable. Be a good citizen on your team and silence the checks and clients you know you’re updating to avoid alerting the on-call engineer. Failure to do so may result in your teammates being unhappy with you and branding you an “asshole” (or worse). Sensu provides an API for silencing subscriptions and checks, and in version 1.2 on, they allow you to specify a start time for your scheduled maintenances.

Maintenances should typically start with:

$ curl -s -i -X POST \
-H 'Content-Type: application/json' \
-d '{"subscription": "load-balancer", "check": "check_haproxy", "expire": 3600, "begin": "TIME_IN_EPOCH_FORMAT", "reason": "Rolling LB restart" }' \
http://localhost:4567/silenced  HTTP/1.1 201 Created
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Connection: close
Content-length: 0


The above curl command illustrates how easy it is to create a silence. Please note the expire key: never submit this without a specific deadline, as it will surely come back to bite you later. I have seen it happen where we had an impact but no one was alerted. I recommend silencing for no more than 24 hours at a time. You can also leverage the expire_on_resolve key which does what it sounds like.

When you set a maintenance/silence without an expire you leave yourself open to this.

Of course, you can also use Uchiwa to schedule silences if you are one of those GUI-inclined folks.

Safe Mode

This is a feature I had to sadly omit from the talk to make it fit in the time allotted. It’s honestly more of a security feature but has a useful side effect for helping with alert fatigue. Safe Mode informs the client that it may only execute a subscription check from the server if the check definition on the server exists as well on the client side. This helps protect against an attacker with a foothold in your environment from using Sensu to execute malicious checks and spread to other nodes. Wondering how this relates to alert fatigue? Let’s say you have a process where machines start from a base image and then have additional provisioning tools such as Chef, Puppet, Ansible, etc. that bring the node into the desired state. That process may not be instantaneous: when the Sensu client sees the node and matches the subscription, it starts scheduling checks immediately, perhaps before the system provisioning has finished updating the check definitions, monitoring plugins, or other services required to satisfy check requirements. Safe Mode makes sure that we prevent checks, mutators, and handlers from firing until the definition is set up, which reduces the window of opportunity to alert on a non-issue. It’s a great feature that solves multiple problems at once.

Configuring Safe Mode is quite easy: you just enable the following in your client file (typically located in /etc/sensu/conf.d/client.json):

{ 
  "client": { 
    "name": "i-424242", 
    "address": "8.8.8.8", 
    "subscriptions": ["dns_lb"], 
    "safe_mode": true 
  }
}


And then add your check definitions to the server and the appropriate clients.

Handler Config

Technically Sensu comes with good defaults for handler configuration — here’s an example of modifying some defaults that would make a particular handler action on additional events:

{ 
  "handlers": { 
    "single_pane": { 
      "type": "pipe", 
      "command": "single_pane.rb --message 'sensu event' https://domain.tld:port", 
      "handle_silenced": true, 
      "handle_flapping": true 
    } 
  }
}


In this scenario, we’re not alerting or remediating anything and are using a single pane of glass service (such as BigPanda), so therefore we want to receive flapping and silenced events.

Closing Thoughts

I hope that you found these tips useful for reducing your alerts and improving your engineers’ happiness at work. And, while this series offers a curated tour of Sensu capabilities targeted at reducing or eradicating alert fatigue, there are a lot of other great Sensu features to explore. I wrote this series in the context of Sensu 1.x, but many of the features have been moved into Sensu Go and in many cases improved upon. I hope to go over these in a future post, but the one feature that changes drastically in its power and ease of use is filters, as you can’t leverage Ruby’s eval or otherwise similar function since golang is a compiled language. Work is being done by the Sensu Community and engineering team to make this both easier and better to use, and you can accomplish most of the same things writing a gRPC client.

Stay tuned for how to reduce alert fatigue with Sensu Go, and for now — happy monitoring!

IT Side effect (computer science) Leverage (statistics) POST (HTTP) Event teams Host (Unix)

Published at DZone with permission of Ben Abrams, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Exploring the Benefits of Cloud Computing: From IaaS, PaaS, SaaS to Google Cloud, AWS, and Microsoft
  • The Enterprise, the Database, the Problem, and the Solution
  • How to Secure Your CI/CD Pipeline
  • RabbitMQ vs. Memphis.dev

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: