Machine Learning and Analytics in IT Ops
Machine Learning and Analytics in IT Ops
There is a treasure trove of meaningful information concealed in ticket data and how machine learning and big data analytics can help.
Join the DZone community and get the full member experience.Join For Free
Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.
Though current ticket data investigation and reporting practices employed by enterprises can usually obtain statistics and information on ticket count, ownership, categories, severities, etc., they fall behind when it comes to abstracting and comprehending the treasure of information concealed in the text comments of ticket data.
This is precisely where IT ops analytics (ITOA) — using big data, machine learning, and natural language processing (NLP) techniques — is gaining momentum. ITOA with big data analytics and machine learning algorithms provides the flexibility to examine larger datasets and determine repetitive patterns and trends more accurately.
Text mining, topic modeling, and clustering techniques are involved to examine unstructured text present within the title ticket data fields and to classify/cluster them into problem patterns or groups. The advantage we get with this is that tons of ticket data is categorized and convened into countable topics and function areas that can be easily comprehended by the SME’s and support and operations team.
Text analytics of information in ticket data can assist IT managers to:
- Understand BAU activities; identify fragments where resources and time is being spent
- Understand hotspots by frequently studying issues within the hybrid IT landscape
- Understand the volume distribution of incidents across various categories, applications, and infra towers
- Gain valuable insights on the historic trend of issues occurring in hybrid IT
- Detect and predict anomalies that could impact business well in advance
- Understand probable areas of automation by amalgamating data from diverse and heterogeneous platforms
Also, the analytics can be extended to provide a unified view into health metrics, logs, alerts, and incidents across the IT stack with multi-dimensional problem analysis using various statistical (R) tools. Some application areas would involve collective analytics to bring in richness in IT support and operations.
Correlation analysis is a method of statistical evaluation used to study the strength of a relationship between two numerically measured continuous variables. There might be thousands of events generated from multiple systems on each day, but not all the alerts generated are converted into incidents. Alert to incident correlation is one of the major metrics in identifying the effort spent by the Ops team. More alerts and fewer incidents might indicate that team is spending more effort in alert categorization. This can be reduced by bringing in proactive monitoring methodologies that can correlate events with a common root cause and eliminate false alerts.
Top problem patterns from the incident data can be identified using latent semantic analysis (LSA). LSA is a technique in natural language processing/distributional semantics of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
Identification using the LSA algorithm involves:
- Data cleansing bag of words (TDM/DTM)
- Data cleansing stop words removal
- Data cleansing for stemming
- Term frequency and correlation analysis
- Concept modeling; singular value decomposition
- Dimensionality reduction
- Ticket categorization and correlation analysis
In the below example, the word cloud matrix created after LSA processing indicates that log, database, server, threshold, MSSQL, and name are very significant terms.
When analyzing data about frequently occurring problems, Pareto analysis charts (or bubble charts) help to focus on the most significant issues and to analyze broad causes by looking at their specific components. On further drilling down of the above word cloud scenario, the most prominent issues relate to MSSQL transaction log backup, log space issues, Oracle connection or listener issues, file system, and job failures.
Plotting information against the problem patterns and against the time series to project the frequency of occurrence of repeated issues would assist the IT ops manager to plan, organize, and staff the resources accordingly.
There are still numerous scenarios of deriving meaningful insights using machine learning, big data, and R techniques. Unquestionably, ITOA with big data analytics and machine learning is gaining momentum in the world of IT ops.
Opinions expressed by DZone contributors are their own.