DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • A Simple Guide To Building Your Own AI-Powered Applications
  • Utilize These Detection-as-Code Best Practices
  • Delivering the Future of Uber-Like Apps With AI and ML
  • The Rise of the Data Reliability Engineer

Trending

  • Advancing Your Software Engineering Career in 2025
  • Intro to RAG: Foundations of Retrieval Augmented Generation, Part 2
  • Implementing API Design First in .NET for Efficient Development, Testing, and CI/CD
  • A Guide to Auto-Tagging and Lineage Tracking With OpenMetadata
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Using Machine Learning to Find Root Cause of App Failure Changes Everything

Using Machine Learning to Find Root Cause of App Failure Changes Everything

The root cause of most problems can usually be found somewhere among millions of log events from a large number of different sources. This is why we need ML.

By 
Ajay Singh user avatar
Ajay Singh
·
Updated Nov. 08, 22 · Opinion
Likes (2)
Comment
Save
Tweet
Share
4.6K Views

Join the DZone community and get the full member experience.

Join For Free

It is inevitable that a website or app will fail or encounter problems from time to time, ranging from broken functionality to performance issues or even complete outages. Development cycles are too fast, conditions too dynamic, and infrastructure and code too complex to expect flawless operations all the time. When a problem does occur, it creates a high-pressure urgency that sends teams scurrying to find a solution. The root cause of most problems can usually be found somewhere among millions (or even billions) of log events from a large number of different sources. The ensuing investigation is usually slow and painful and can take away valuable hours from already busy engineering teams. It also involves handoffs between experts in different aspects or components of the app, particularly with the use of interconnected microservices and third-party services which can cause a wide range of failure permutations. 

Finding the root cause and solution takes both time and experience. At the same time, development teams are usually quite short-staffed and overworked, so the urgent “fire drill” of dropping everything to find the cause of an app problem stalls other important development work. Using observability tools, such as APM, tracing, monitoring, and log management solutions, helps team productivity, but it's not enough. These tools still require knowing what to look for and significant time to interpret the results that are uncovered.

Such a challenge is well-suited for machine learning (ML), which can examine vast amounts of data and find correlated patterns of rare and bad (high-severity) events that reveal the root cause. However, performing ML on logs is challenging since logs are mostly unstructured, noisy, and greatly varied in format. In addition, log volumes are typically huge, and the data comes from many different log sources. Furthermore, when it comes to ML, anomaly detection alone is not enough since the results can still be noisy. What is needed is also the ability to find correlations across the anomalies to better pinpoint the root cause with high levels of fidelity. Anomaly detection finds the dots. The correlation of those anomalies across the logs connects the dots to bring context and a more precise understanding.

Of course, humans too can detect log anomalies and find correlations, but it is time-consuming, requires skill and intuition, and does not easily scale. Consider a single person performing the task. It demands being able to identify anomalies in each log source and then determine if and how they correlate with each other. However, a single human has limited bandwidth, so more likely, a team will need to comb through logs. Being able to correlate all the findings across team members requires time-consuming coordination. It’s no wonder troubleshooting can take hours or days. The advantage ML has over humans is that ML can scale almost infinitely.

The only effective way to perform unsupervised ML on logs is to use a pipeline that leverages a multi-stage approach for different parts of the process. ML begins by self-learning how to structure and categorize the logs. This is a critical, foundational step where earlier approaches have fallen short – if a system can’t learn and categorize log events extremely well (particularly the rare ones), then it can’t detect anomalies reliably. Next, an ML system must learn the patterns of each type of log event. After this foundational learning, the ML system can identify anomalous log events within each log stream. Finally, the ML system looks for correlations between anomalies and errors across multiple log streams. In the end, the process uncovers the sequence of log lines that describe the problem and its root cause. As an added bonus, it could even summarize the problem in natural language and highlight the keywords within the logs that have the most diagnostic value (the rare and “bad” ones). It ensures accurate detection of new types of failure modes and the information needed to identify the root cause.

A complete ML system should not require any manual training or intervention for reviewing correlations to tune algorithms or adjust data sets. With an unsupervised ML system, the DevOps team should only have to respond to actual findings of the root cause, rather than hunt and research. A few hours of ingesting log data should be sufficient for an ML system to become productive and achieve accurate results.

Larger development and DevOps teams favor increasing levels of specialization for speed, complexity, and efficiency. An ML system for determining the root cause of app problems or failures complements this trend to enable teams to focus on development and operations rather than having to drop everything to deal with a crisis. Fast, efficient identification of problems through ML enables teams to continue the type of “develop as we fly the plane” cycles needed for today’s business demands. ML can also be used to work proactively to find conditions before they become big problems. In a world that continues to put pressure on faster and more productive development along with little tolerance for downtime and problems, utilizing ML on logs for root cause analysis changes everything.

Machine learning app teams

Opinions expressed by DZone contributors are their own.

Related

  • A Simple Guide To Building Your Own AI-Powered Applications
  • Utilize These Detection-as-Code Best Practices
  • Delivering the Future of Uber-Like Apps With AI and ML
  • The Rise of the Data Reliability Engineer

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!