Over a million developers have joined DZone.

Avoiding Application Outages: Five Key Take Aways

DZone's Guide to

Avoiding Application Outages: Five Key Take Aways

With all the monitoring tools available today, how and why do outages still happen? Read more to see how you can avoid them.

· Performance Zone ·
Free Resource

Built by operators for operators, the Sensu monitoring event pipeline empowers businesses to automate their monitoring workflows and gain deep visibility into their multi-cloud environments. Get started for free today.

Application outages are more than inconvenient or brand-damaging – they are quite expensive as well. The Ponemon report, “Cost of Data Center Outages,” calculates that an organization loses, on average, $7.39 million per year, with outages lasting an average of 90 minutes.

The question is: With all the monitoring tools available today, how and why do outages still happen? The answer is, surprisingly, in the question. All those monitoring tools send out so many alarms that they create a cacophony of information that duplicates itself without any correlation, resulting in confusion as to the root cause of the incident. What begins as application performance degradation can quickly become a major outage.

To make matters worse, once the crisis is over and the application is up and running again, the post-mortem analysis isn’t any better in terms of process. Typically, subject matter experts enter a war room and go through multiple product consoles and logs to identify the cause of an incident, and the blame gets passed around. This method is putting companies at a severe disadvantage. However, deep machine learning-based, root-cause analytics and predictive analytics technologies are helping organizations prevent such incidents and dramatically reduce mean time to repair.

Painful Transitions

The pace of digitization requires teams to manage unparalleled amounts of data while predicting and preventing outages, in real-time and maintaining and delivering agile, reliable applications. The problem is that most organizations must tap several different siloed vendor products to assist in the monitoring, identifying, mitigation and remediation of incidents and hope that they speak to each other, which traditionally hasn’t happened.

Digitization applies to infrastructure, and as IT continues the transform from physical to hybrid and multi-cloud environments, it is becoming impossible for IT administrators to keep up with the multitude of objects, with thousands of metrics generating data in near-real time.

The digital, virtualized, and hybrid-cloud environments we now operate within require availability, reliability, performance and security of applications. For this to happen, new approaches must be employed to provide intelligence. Automated, self-learning solutions that analyze and provide insight into ever-changing applications and infrastructure topologies are essential in this transformation.

Are the Buzzwords Worth the Buzz?

Application performance today is complex and dynamic. Organizations understand that catchphrases like “big data” and “machine learning” actually do have a lot to offer underneath all that marketing hype. However, what vendors say they have and what they actually mean by those phrases don’t always jive. Let’s define our terms:

  • Machine Learning: Vendors sometimes use this term loosely. Machine learning is self-learning, supervised or unsupervised algorithms that can be based on neural networks, statistics or Digital Signal Processing et al.
  • Big Data Architecture: The ability to automate at scale, ingest and manage massive amounts of structured and unstructured data.
  • Domain Knowledge: This refers to gathering the collective insights of TechOps and DevOps to help answer these and other questions: What just happened? What caused it? How do we remediate it? How do we not have it happen again?

Preventing and Managing Incidents

The financial implications alone of an application outage are far too significant a business issue to leave to chance or slick marketing. Before a company moves forward with a solution, there are a few points they should keep in mind:

  • Smarter, Not Harder, Processes: Using multiple siloed tools causes IT workers to become fatigued and apathetic – so when the system is actually disrupted, no one is paying attention. Instead of just alarms, find solutions that provide answers by pooling and correlating data together. These solutions should be capable of identifying and rejecting the majority of alarms that happen to be false.
  • Help in the Trenches: Admin and IT support need a solution that is automated in a way that quickly determines and pinpoints the root cause of the problem and identifies how to fix it, rather than relying on expensive domain experts.
  • Big Data-Ready: The solution must be scalable so it can handle millions of objects; legacy solutions are not adequate for today’s big data.
  • Prediction for Prevention: The key to preventing outages is to predict issues before they become problems,  but traditional monitoring tools trigger alerts only after a problem has already occurred or when the rules and set thresholds are violated. Look for a solution that can alert you to anomalous trends or potentially dangerous issues before they impact your application.
  • Tribal Knowledge Plus: It’s difficult and time-consuming to remediate incidents using only in-house tribal knowledge. Having access to vendor knowledge bases, discussion forums, and the latest state-of-the-art technologies is important. Ideally, you should be able to curate tribal knowledge for repeatability but also be able to integrate crowdsourced knowledge into the mix.
  • Consolidate to Remediate

    Because user expectations are high and applications must not go down, organizations have deployed numerous monitoring tools to detect and remediate incidents. However, these tools are often siloed, sending out multiple alarms about the same issue. This can result in a digital sea of red that overwhelms and eventually numbs IT teams to alerts, which can lead to missed signals and the dreaded outages. The solution to this problem lies in preventing and managing incidents by obtaining consolidated, coordinated, real-time insights and making use of collective, actionable knowledge for remediation.

    Download our guide to mitigating alert fatigue, with real-world tips on automating remediation and triage from an IT veteran.

    machine learning ,outages ,big data

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}