DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Azure Observability
  • 4 Key Observability Metrics for Distributed Applications
  • Five Ways to Improve the Understandability of your Software
  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You

Trending

  • Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)
  • Why High-Performance AI/ML Is Essential in Modern Cybersecurity
  • Unlocking the Benefits of a Private API in AWS API Gateway
  • AI Meets Vector Databases: Redefining Data Retrieval in the Age of Intelligence
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Monitoring and Observability
  4. Alerts Should Work for You, Not the Other Way Around

Alerts Should Work for You, Not the Other Way Around

Helping us understand what users are experiencing with an application, monitoring is an undeniable game-change in organizations willing to embrace and use it.

By 
Leon Adato user avatar
Leon Adato
·
Nov. 22, 23 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
2.0K Views

Join the DZone community and get the full member experience.

Join For Free

“A few years back, I was tuning an alert rule and accidentally triggered the alert, which created 772 tickets. Twice.”

This (all too true) story serves as the introduction of the main thesis of my talk in the video below (and the reason for its title): That alerts — contrary to the popular opinion held by IT practitioners across the spectrum of tech — don’t inherently suck. The problem lies in how alerts are typically created, which causes them to be… well, let’s just say “sub-optimal” and leave it at that.

I’ve given this talk frequently at conferences such as DevOpsDays BRUM, DevOpsDays TLV, Monitorama, and others. I believe its popularity is largely due to its fun approach to a frustrating issue.

I’d like to take a few moments of your time here to emphasize points I make in the talk but then extend those ideas in ways that don’t fit the limitations of time or format common in conference presentations.

The Slippery Slope To “Monitoring Engineer”

If you’ve read this far, there’s a good chance you care about alerts for more than just your own personal reasons. You probably have people — whether on your immediate team or in the larger organization — who look to and rely on you for help designing, implementing, maintaining, and fixing alerts.

While most of us first encounter monitoring solutions because we want to know more about our own sh… tuff, it quickly follows that we’re helping others set up monitoring for themselves. Before long, we found ourselves in the “resident expert” role. Once that reputation gets around, the job (whether official or not) is irrevocably added to our responsibilities.

The good news is that this is a huge opportunity for those who enjoy the work. Monitoring is an undeniable game-change in organizations willing to embrace and use it.

Alerts ≠ Monitoring

One of my first encounters with alerting that was completely off the rails was at a company that defined uptime as “100% minus the # of alerts” in a given period. It was utterly unhinged.

While it was an extreme example, the underlying issue — confusing alerting with monitoring — isn’t rare at all. For many individuals (and teams, departments, and entire companies), the raison d’être for monitoring is to have alerts, which is simply not helpful or effective.

Monitoring is nothing more (and nothing less) than the consistent, persistent collection of data from a set of systems. Everything else that a monitoring and observability solution provides — dashboards, widgets, reports, automation, and alerts — is merely a happy by-product of having monitoring in the first place.

As a monitoring engineer, I know something is amiss when I see people hyper-focusing on alerts to the exclusion (if not the detriment) of monitoring.

Alerts Need Proof of Their Value

An alert should only exist if it has a proven, measurable, meaningful impact. The best way to validate that is to see if an alert is intended to cause

  • someone
  • to do something
  • RIGHT. NOW.
  • about a problem

If all of those conditions aren’t met, you’re looking at an alert that is trying to replace some other monitoring structure — a dashboard, a report, etc.

But that merely proves that an alert is actionable, not valuable. And I must be clear: “important” isn’t the same as “valuable.” Importance implies that it is technically, intellectually, or (believe it or not) emotionally meaningful to some person or group.

“Valuable” is much more particular: The existence of the alert can be directly tied to a financial outcome.

How does one establish this? Start with what the world would look like without the alert:

  • How would the people who can fix the issue find out about the problem? And more to the point, how LONG would it take for the people who can resolve the issue to find out?
  • Are there any inherent losses while the problem is happening? An online sales system that generates $1,000 an hour loses that amount every hour it’s unavailable.
  • How long would it take to fix the problem? In some cases, it’s the same amount of time, alert or not. But in far more circumstances, if the problem were left unaddressed for the length of time identified in the first bullet, it would take longer (possibly significantly longer) to resolve.
  • What is the regular (“total loaded”) rate for the staff who can fix the issue?
  • What is the “interruption cost” for that staff? This means the staff is (ostensibly) not sitting around waiting for this particular error. So what is the value of their normal work? Because they will NOT be doing it during the time they are addressing this issue.

You are welcome to take the formula above and, as the saying goes, “salt to taste.”

Once you have this, recalculate all of the above WITH the alert. The difference between the first calculation and the second is the dollar value of the alert.

Now, you can set up a simple report showing the number of occurrences the alert triggered multiplied by the value. That is the amount this one alert has saved the company during that time.

Observability Enables Us To Change Our Focus

Back when I started working with monitoring solutions (yeah, yeah, Grampa. When dinosaurs ruled the earth and you had to chisel each bit into the hard drive by hand with a lodestone), we had to guess at the user’s experience from an array of much lower-level data. We’d look at network traffic, disk I/O, server connections, and other metrics and use those metrics to guess what was happening at the top of the OSI model.

We didn’t do it because we thought it was the best option. We did it because it was the ONLY option. Tracing didn’t really come onto the scene — in terms of true application monitoring - until 2010. And it only took hold because of the fundamental change in application architecture.

The widespread adoption of cloud computing (AWS EC2 went GA in 2006) and mobile phones (the first iPhone came on the scene in 2007) radically changed how we interacted with applications. Facebook had an unbelievable (for the time) 600 million users in 2010. That number grew to 800 million in 2011 and over 1 billion in 2012.

Against THAT backdrop, application tracing and real user monitoring went from something we could only do in carefully controlled QA environments to a technique that was not only possible but game-changing.

Because the entire reason we have monitoring — the whole damn point — is to understand what users are experiencing with an application. That’s it. That’s the whole enchilada.

So, I will go on record as saying that alerting should focus on that aspect first and foremost. If the user experience is impacted, sound the alarm and get people out of bed if necessary.

At that point, all the other telemetry - metrics, events, and logs - can be used to understand the details of the impact. But those lower-level data points no longer have to be the trigger point for alerts. Not in most cases.

Where Do We Go From Here?

Hopefully, you have enough time between this blog and my talk to reflect on your existing alerts with an eye toward real improvement. You may find yourself deleting alerts you once thought essential. You will also undoubtedly spend time tweaking your alerts to make them more actionable, meaningful, and valuable.

Just ensure you don’t trigger an alert storm in the process, or you’ll end up in the helpdesk, manually closing 1,544 tickets.

Don’t ask me how I know.

Real user monitoring application Data (computing) Observability

Published at DZone with permission of Leon Adato. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Azure Observability
  • 4 Key Observability Metrics for Distributed Applications
  • Five Ways to Improve the Understandability of your Software
  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!