DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Join us today at 1 PM EST: "3-Step Approach to Comprehensive Runtime Application Security"
Save your seat

How AIOps Revolutionizes Alarm Management

Let's take a look at AIOps and explore how it revolutionizes alarm management. What is AIOps and how can it help IT teams?

Chris Riley user avatar by
Chris Riley
·
Oct. 08, 18 · Opinion
Like (2)
Save
Tweet
Share
7.94K Views

Join the DZone community and get the full member experience.

Join For Free

If you work in IT Ops, alarm management is likely one of your greatest and most persistent challenges. Your monitoring tools send you alerts all the time, and figuring out which alarms to prioritize (and which ones don’t actually require attention at all) can be tough.

Fortunately, there is a promising new solution for saving IT Ops teams from drowning in a sea of alerts—and failing to manage alarms effectively. It’s called AIOps. Here’s an overview of how AIOps applies to alarm management.

What Is AIOps?

Put simply, AIOps refers to the integration of Artificial Intelligence, or AI, into IT Ops processes. (Okay...maybe you guessed that from the term.)

The term has been made popular by Gartner, and it is becoming an increasingly common strategy among organizations seeking to bring automation to the next level.

To a limited extent, companies have been using AIOps tools for a long time even though they did not call them that. Monitoring software like Splunk essentially uses data-driven analytics to help IT Ops teams perform their jobs.

But the difference between basic platforms like Splunk and fully modern AIOps tools is the extent to which the tools automate decision-making and action. Splunk is designed primarily to help you make sense of all of your monitoring data. Apart from some basic automated response features, Splunk doesn’t do much in the way of performing actual operations.

In contrast, AIOps tools both analyze data and take action in response to it. In this way, they significantly reduce the amount of manual effort required on the part of IT Ops teams.

It’s worth noting that AIOps is not a total replacement for human engineers. AIOps tools will always require management and manual intervention from time to time. But AIOps helps your IT Ops team to do more with fewer humans, which is a crucial advantage as software environments and infrastructure become ever more complex, and IT Ops teams struggle to keep up with them.

Using AIOps for Alarm Correlation and Root-Cause Analysis

If you manage infrastructure or software environments of any appreciable size, you probably already take advantage of tools that help you to automate monitoring and alerting. You might use open source solutions like Nagios to collect monitoring data, and/or commercial tools that provide some features for sorting through alarms and identifying those that demand immediate action.

When it comes to triaging and responding to alarms, however, most IT teams still rely on a great deal of manual effort. Automated tools help them collect monitoring data and make sense of it, but they don’t solve problems for them, or make specific recommendations about how to respond to issues.

This is where AIOps comes in. AIOps tools can help IT teams to triage alarms more intelligently by using advanced data analytics and taking previous patterns into account to identify which types of alarms require the greatest attention. AIOps can also help to correlate related issues and perform root-cause analysis, even when it is not obvious on the surface how distinct alarms relate to each other.

To understand why this matters in practice, consider a scenario in which your IT Ops team receives three alarms within a short span of time:

  1. An alarm warning about low memory availability on a Kubernetes node.
  2. An alarm indicating that a storage microservice has ceased to respond.
  3. An alarm about a sudden spike in network traffic to your application.

How would you prioritize and respond to these alarms? Based on the information available above, you’d probably prioritize issue 2, because a non-available service is more pressing than low memory or an increase in network traffic. But when you begin investigating and working to address issue 2, you might discover that the microservice failed because your Kubernetes node no longer had sufficient memory, and you did not receive a warning sooner about the memory issue because your alarm threshold for low memory was not properly configured.

Meanwhile, the spike in network traffic, which may appear to be unrelated to issues 1 and 2, turns out to be the result of the failed microservice, which has made your web app unavailable and is causing users to reload it repeatedly in the hope that it will start responding again.

Working manually, you’d eventually be able to figure out how these three issues are interrelated, and that issue 1 is actually the root cause. But determining all of this manually would be slow, and by the time you sorted it all out, your users would likely have experienced a fair bit of downtime.

With AIOps, the resolution of these alarms could be much faster and require no manual effort. Your AI algorithms could understand automatically how each of the issues relates to the others. They might even be able to remediate them automatically by shifting the Kubernetes workload to a new node that has more memory or otherwise making more memory available.

Alarm Management and AIOps

Another key advantage of AIOps for alarm management is the ability of AIOps to automate alarm thresholds. Currently, you probably set alarm thresholds manually. You configure policy files that tell your monitoring tools how long an application can remain unresponsive before an alarm is triggered, or how low available disk space can get before the tool notifies you. Alarm thresholds are critical for keeping alarms manageable by avoiding unnecessary alerts.

The problem with alarm management in modern software environments is that those environments tend to be highly dynamic. The amount of disk space that is acceptable on a low-traffic day, when you have 10 nodes running and 1,000 users, could be very different from what is acceptable during a period of heavy load, when you have to support 10,000 users with 1,000 nodes.

Is your IT Ops team able to update alarm thresholds manually when the norms of your environment change? Probably not. But AIOps tools can. Using data and analytics, they can determine which thresholds are the right fit for your environment at a given moment and adjust configurations accordingly.

Conclusion

It’s a safe bet that the infrastructure and software environments that your IT Ops team has to manage will become more complex over time. No one’s stack is getting simpler.

As complexity increases, AIOps will prove critical for helping IT Ops teams to stay ahead of the chaos by managing alarms effectively. AIOps reduces the manual management burden on engineers while also helping to avoid unnecessary alerts.

More Reading

Definitive Guide to AIOps

IT teams

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 7 Awesome Libraries for Java Unit and Integration Testing
  • Differences Between Site Reliability Engineer vs. Software Engineer vs. Cloud Engineer vs. DevOps Engineer
  • Mr. Over, the Engineer [Comic]
  • Upgrade Guide To Spring Data Elasticsearch 5.0

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: