Using Machine Learning to Find Root Cause of App Failure Changes Everything
The root cause of most problems can usually be found somewhere among millions of log events from a large number of different sources. This is why we need ML.
Join the DZone community and get the full member experience.
Join For FreeIt is inevitable that a website or app will fail or encounter problems from time to time, ranging from broken functionality to performance issues or even complete outages. Development cycles are too fast, conditions too dynamic, and infrastructure and code too complex to expect flawless operations all the time. When a problem does occur, it creates a high-pressure urgency that sends teams scurrying to find a solution. The root cause of most problems can usually be found somewhere among millions (or even billions) of log events from a large number of different sources. The ensuing investigation is usually slow and painful and can take away valuable hours from already busy engineering teams. It also involves handoffs between experts in different aspects or components of the app, particularly with the use of interconnected microservices and third-party services which can cause a wide range of failure permutations.
Finding the root cause and solution takes both time and experience. At the same time, development teams are usually quite short-staffed and overworked, so the urgent “fire drill” of dropping everything to find the cause of an app problem stalls other important development work. Using observability tools, such as APM, tracing, monitoring, and log management solutions, helps team productivity, but it's not enough. These tools still require knowing what to look for and significant time to interpret the results that are uncovered.
Such a challenge is well-suited for machine learning (ML), which can examine vast amounts of data and find correlated patterns of rare and bad (high-severity) events that reveal the root cause. However, performing ML on logs is challenging since logs are mostly unstructured, noisy, and greatly varied in format. In addition, log volumes are typically huge, and the data comes from many different log sources. Furthermore, when it comes to ML, anomaly detection alone is not enough since the results can still be noisy. What is needed is also the ability to find correlations across the anomalies to better pinpoint the root cause with high levels of fidelity. Anomaly detection finds the dots. The correlation of those anomalies across the logs connects the dots to bring context and a more precise understanding.
Of course, humans too can detect log anomalies and find correlations, but it is time-consuming, requires skill and intuition, and does not easily scale. Consider a single person performing the task. It demands being able to identify anomalies in each log source and then determine if and how they correlate with each other. However, a single human has limited bandwidth, so more likely, a team will need to comb through logs. Being able to correlate all the findings across team members requires time-consuming coordination. It’s no wonder troubleshooting can take hours or days. The advantage ML has over humans is that ML can scale almost infinitely.
The only effective way to perform unsupervised ML on logs is to use a pipeline that leverages a multi-stage approach for different parts of the process. ML begins by self-learning how to structure and categorize the logs. This is a critical, foundational step where earlier approaches have fallen short – if a system can’t learn and categorize log events extremely well (particularly the rare ones), then it can’t detect anomalies reliably. Next, an ML system must learn the patterns of each type of log event. After this foundational learning, the ML system can identify anomalous log events within each log stream. Finally, the ML system looks for correlations between anomalies and errors across multiple log streams. In the end, the process uncovers the sequence of log lines that describe the problem and its root cause. As an added bonus, it could even summarize the problem in natural language and highlight the keywords within the logs that have the most diagnostic value (the rare and “bad” ones). It ensures accurate detection of new types of failure modes and the information needed to identify the root cause.
A complete ML system should not require any manual training or intervention for reviewing correlations to tune algorithms or adjust data sets. With an unsupervised ML system, the DevOps team should only have to respond to actual findings of the root cause, rather than hunt and research. A few hours of ingesting log data should be sufficient for an ML system to become productive and achieve accurate results.
Larger development and DevOps teams favor increasing levels of specialization for speed, complexity, and efficiency. An ML system for determining the root cause of app problems or failures complements this trend to enable teams to focus on development and operations rather than having to drop everything to deal with a crisis. Fast, efficient identification of problems through ML enables teams to continue the type of “develop as we fly the plane” cycles needed for today’s business demands. ML can also be used to work proactively to find conditions before they become big problems. In a world that continues to put pressure on faster and more productive development along with little tolerance for downtime and problems, utilizing ML on logs for root cause analysis changes everything.
Opinions expressed by DZone contributors are their own.
Comments