Application outages are more than inconvenient or brand-damaging – they are quite expensive as well. The Ponemon report, “Cost of Data Center Outages,” calculates that an organization loses, on average, $7.39 million per year, with outages lasting an average of 90 minutes.
The question is: With all the monitoring tools available today, how and why do outages still happen? The answer is, surprisingly, in the question. All those monitoring tools send out so many alarms that they create a cacophony of information that duplicates itself without any correlation, resulting in confusion as to the root cause of the incident. What begins as application performance degradation can quickly become a major outage.
To make matters worse, once the crisis is over and the application is up and running again, the post-mortem analysis isn’t any better in terms of process. Typically, subject matter experts enter a war room and go through multiple product consoles and logs to identify the cause of an incident, and the blame gets passed around. This method is putting companies at a severe disadvantage. However, deep machine learning-based, root-cause analytics and predictive analytics technologies are helping organizations prevent such incidents and dramatically reduce mean time to repair.
The pace of digitization requires teams to manage unparalleled amounts of data while predicting and preventing outages, in real-time and maintaining and delivering agile, reliable applications. The problem is that most organizations must tap several different siloed vendor products to assist in the monitoring, identifying, mitigation and remediation of incidents and hope that they speak to each other, which traditionally hasn’t happened.
Digitization applies to infrastructure, and as IT continues the transform from physical to hybrid and multi-cloud environments, it is becoming impossible for IT administrators to keep up with the multitude of objects, with thousands of metrics generating data in near-real time.
The digital, virtualized, and hybrid-cloud environments we now operate within require availability, reliability, performance and security of applications. For this to happen, new approaches must be employed to provide intelligence. Automated, self-learning solutions that analyze and provide insight into ever-changing applications and infrastructure topologies are essential in this transformation.
Are the Buzzwords Worth the Buzz?
Application performance today is complex and dynamic. Organizations understand that catchphrases like “big data” and “machine learning” actually do have a lot to offer underneath all that marketing hype. However, what vendors say they have and what they actually mean by those phrases don’t always jive. Let’s define our terms:
- Machine Learning: Vendors sometimes use this term loosely. Machine learning is self-learning, supervised or unsupervised algorithms that can be based on neural networks, statistics or Digital Signal Processing et al.
- Big Data Architecture: The ability to automate at scale, ingest and manage massive amounts of structured and unstructured data.
- Domain Knowledge: This refers to gathering the collective insights of TechOps and DevOps to help answer these and other questions: What just happened? What caused it? How do we remediate it? How do we not have it happen again?
Preventing and Managing Incidents
The financial implications alone of an application outage are far too significant a business issue to leave to chance or slick marketing. Before a company moves forward with a solution, there are a few points they should keep in mind:
Consolidate to Remediate
Because user expectations are high and applications must not go down, organizations have deployed numerous monitoring tools to detect and remediate incidents. However, these tools are often siloed, sending out multiple alarms about the same issue. This can result in a digital sea of red that overwhelms and eventually numbs IT teams to alerts, which can lead to missed signals and the dreaded outages. The solution to this problem lies in preventing and managing incidents by obtaining consolidated, coordinated, real-time insights and making use of collective, actionable knowledge for remediation.