The Search for Root Cause – Is Log Management Still the Best Approach?
With the limitations of traditional “search-based” log managers, it has become clear that a better approach to finding the root cause is required.
Join the DZone community and get the full member experience.Join For Free
A Brief Discussion
Historically, companies relied on homegrown scripts and tools to search logs to find the root cause of an application problem. For many developers, it is what they grew up with. Experienced engineers used Perl scripts, vi, grep and awk to make log searches more efficient. Then came log managers, providing log aggregation and making logs easier to search. Splunk, when it was first released, was known as Google Search for logs, and it was followed by others such as Elasticsearch and Sumo Logic.
These historical approaches worked for a different era of software applications: ones more monolithic in design, changing relatively infrequently (once every few weeks or months) and with only a handful of log types to monitor. With the advent of Kubernetes, cloud-native development, and distributed architectures, things have changed a lot. Now, it is common for apps to have dozens of microservices that produce billions of log lines a day, across thousands of log streams. At the same time, continuous delivery practices mean the application might experience multiple changes each day, potentially introducing new failures modes and bugs. Times have changed.
Using yesterday’s methodologies to troubleshoot today’s applications costs engineers untold precious hours. The volume of data to review is orders of magnitude higher, and so is the ever-expanding horizon of possible software failure modes. Without a modernized approach, teams face multiple issues.
First, brute force searching through large volumes of log data is slow, and the time involved comes with a substantial opportunity cost for software development. Second, the urgency mandates a drop-everything mentality which disrupts productivity and creativity. Developers generally disdain interruptions. Third, and perhaps most importantly, the quest to find a true root cause turns into a rush to find an immediate fix for the problem. In this mode, the underlying problem will likely not be fully understood or resolved and could present more problems in the future.
Log management still has valid use cases when dealing with software incidents. If the symptom of a problem is well known, then an engineer can tell if it has occurred with the right search query. These types of problems can even be detected automatically by building alert rules. But alert rules are brittle. If an alert rule looks for the exact text, such as “database not responding,” then it will silently stop working if a newer version of the software uses the message, “database is not responding.” In addition, it’s difficult to build alert rules for problems that are characterized by more complex sequences of log events (e.g. log event X followed by log event Y within 60 seconds).
When it comes to finding the root cause, there are even more fundamental challenges. A given software problem can have one symptom, but many different root causes. For example, a log event signifying a database is not responding could be due to a network problem, a database problem, and incorrectly formed query, etc. In this case, a log manager could use an alert rule to automatically detect the symptom, however, determining its root cause would still be manual and a painful process.
In addition, there is a class of new/unknown problems for which alert rules don’t exist at all. These tend to be the most difficult class of problems to troubleshoot and, invariably, keep engineers working late into the night. Although log managers make the logs easier to search, a skilled engineer still needs to spend considerable time determining what to search for and how to pinpoint the root cause. The classic workflow involves looking for clusters of rare events and errors and trying to understand correlations between them. None of this—identifying rare events, unusual clusters, or correlations between them—is ideally suited to a “search” based paradigm.
With the limitations of traditional “search-based” log managers, it has become clear that a better approach to finding the root cause is required. A new approach is needed, where skilled humans are not the bottleneck. This seems like a perfect application for machine learning (ML). However, traditional ML approaches for logs require weeks or months of manual training against carefully labeled log data sets. This doesn’t work well for new failure modes and requires an ongoing effort as new applications and software versions are deployed.
An ideal solution for automated log analysis would use unsupervised machine learning and be able to adapt quickly to changing applications and environments. Such a system would uncover the root cause by using ML to mimic the troubleshooting process of a skilled engineer automatically identifying correlated clusters of rare events and errors. This would allow developers to offload the time-consuming and wearisome tasks of log analysis and focus on new features and code improvements. In addition, the insights found by the ML could be used to proactively detect new failure modes before they wreak havoc in production.
Log management still has its place, but the rate and degree of change and the human troubleshooting bottleneck and cost necessitate a new approach. Yesterday’s solutions alone are no longer sufficient.
Opinions expressed by DZone contributors are their own.