Three years ago, when we started exploring the use of machine learning techniques in IT Ops, we were completely overtaken by and tangled in the colossal number of technologies and jargons surrounding it. The immense documentation, tutorials, and samples available further added to our confusion. Despite all these resources, we couldn’t even get closer of deriving a meaningful narration on how to implement a proactive management function using ML.
ML was developed in the early 1950s and has been around for quite some time, and the vision of making intelligent machines is finally gaining momentum now. The past few decades have seen massive scalability of data and information. One key thing we need to understand is the fact that data has now become a commodity and large amounts of many different kinds of data are extensively available, consenting for much more accurate predictions than were ever possible in the long history of machine learning. With the advent of cloud and big data technologies, the availability of powerful machines and the analysis of unstructured data has become much easier and economical.
In the first part of this series, we explained that it is almost impossible to build a perfect system, and your best bet is to try to minimize the occurrences of failures. Prediction is an obvious solution to know when systems failure and ML should play a key role here. Machine learning solves problems by relying on the same fact: the availability of existing data. At the simplest level, machine learning systems try to predict a value for something based on the historic trend observed. Accomplishing that prediction requires three things:
- A way of describing the subject for prediction.
- A query that can be responded to.
- An algorithm that can take the description and provide an answer to the query.
In ML, a feature is an individual measurable property or characteristic of a phenomenon being observed and is represented by using something called “feature vector.”
OK, fine. But what are the subjects in our IT for which a phenomenon can be measured?
In the IT space, we call the subject a resource. A resource can be a physical server, VM, virtual desktop infrastructure (VDI), storage, SharePoint site, application, rack, cooling unit, printer, data center, etc. Even a user and their AD, mailbox, or Lync account is a resource for us. Each resource has its own set of descriptive attributes and measurable features. Different instances of each type of object will have different values for those features.
In our business intelligence (BI) terminology, for example, we call them FACT and dimensions. For instance, an apple can have descriptive properties such as red, round, fruit, etc. and measurable properties like sweetness, freshness, and weight. However, coming from the same apple object family, a Swiss apple might have different measurable features from a U.S. apple in terms of sweetness and size, for example.
A simple sample illustration of descriptive and measurable properties for some of IT resources is depicted below.
DIM_ = the dimension attributes and
FACT_ = the measurable properties that vary with time and date.
A VM resource can have descriptive attributes like name, IP address, allotted CPU, OS, memory, storage, and template used, and measurable properties like current CPU, memory, and sent or received bytes/sec for that monitored interval of time. Similarly, an application resource can have descriptive properties like name, version, environment details, department, or unit it belongs to, as well as measurable properties like hits per day, active sessions, total visits, etc.
So, while an IT resource can have many, many features, not all of them are relevant to a given problem. But how do we know about a given problem or recognize normal behavior?
All the IT operations have service-level agreements (SLAs) for different support areas based on criticality. In general, a breach is said to have occurred when the intended service deviates from the agreement or contract. Apart from the industry standard best practices, it is the customer who is given SLAs that majorly defines a problem area.
Some generic operational SLAs are:
- Availability of the production systems should be 99.95%.
- CPU usage by critical applications should be <80%.
- Data has to be transmitted in encrypted and secured manner.
- Application should handle 200 additional users without affecting response time.
- Providers must optimize the network traffic performance across geographies/regions to achieve sustained metrics between two geographies/regions.
- The IT system (especially servers) must be protected from unauthorized physical access, and negative environmental influences like fires, earthquakes, flood, pest, power outages, etc.
- The cloud compute service must offer at least one tier of service with uptime availability SLA of 99.95% or higher.
Availability can be a measurable feature for one requirement, whereas CPU or memory might be a feature required for the other. It is important that the machine learning process being designed uses features that differentiate amongst the different states that are being looked for, which in turn hinges on the question being asked for. The process of selecting the most appropriate features for any given problem is a process called “feature selection” or “feature engineering.”
So, if carefully perceived, all the questions to be posed would narrow down to following:
- If there is a deviation from the intended "normal behavior," then:
- What are the anomalies or unusual patterns detected?
- What will be the impact of the anomalies now and in future?
- What are the chances of reoccurrence of the unusual behaviors?
- How should these risks be mitigated?
Choosing informative, discriminating, and independent features is a crucial step for effective algorithms in pattern recognition, classification, and regression. Feature engineering helps you build dynamic sets of rules and scenarios as required based on customer SLAs. Supervised learning systems need to be given examples of what is “good” and what is “bad.” The algorithms have to be trained, enriched, and adequately tested with sets of rules and training data. On the contrary, unsupervised techniques are generally simpler and they try to find patterns within a set of given observations — patterns that you didn’t know existed prior. Still, even these algorithms would require preliminary observations for classification.
Pattern matching and anomaly detection also play a crucial role in feature engineering. When analyzing data about frequently occurring problems or failure tracking, Pareto Analysis charts or bubble charts help focus on the most significant data and help analyze broad causes by looking at their specific components. In the below example, the DTM matrix indicates that log, database, server, threshold, MS SQL, and name are very significant terms. On further drilling down of the DTM scenario, issues related to MS SQL included transaction log backup, log-on length mismatch, job failure, and Oracle connection or listener issues appear to be prominent areas of investigation.
Symptom monitoring assumes that symptoms are side-effects of errors. Thus, the approaches evaluate monitoring data reflecting symptoms (side-effects) of errors. The intention of failure prediction based on monitoring data is that errors like memory leaks can be detected by their side effects on the system, such as unusual memory usage, CPU load, disk I/O, or unusual function calls in the system.
For instance, the acceptable range of normal body temperature is generally between 97°F (36.1°C) and 99°F (37.2°C). Anyone can tell if somebody has an illness just by looking at body temperature. But if the same person is asked for the reason for sickness or the kind of illness, we would receive a reply to visit a medic or a physician. What is the difference between these two scenarios?
A physician's brain is trained to predict health conditions based on symptoms, diagnosis reports, medical experience, and area of study. Performance engineering architects are medics in IT. What if we want to build a platform that can predict the health condition of a system, process, service, application, or database?
We're talking about creating a system with the brain of a performance engineering architect.
Artificial neural networks (ANN) are a supervised learning technique to build systems that mimic human brains. The concept of the neural network has been around for decades, but it is only relatively recently that their true power has been realized. A neural network is made up of artificial neurons, with each neuron connected to other neurons. As different training examples are presented to the network along with the expected output of the system, the network works out which neurons it needs to activate in order to achieve the desired output under different circumstances. Below is an analogy of a simple neural network for slow performance of an application.
Large object heap size might lead to excessive memory usage, which in turn causes timeout errors. Similarly, high contention rates may lead to thread lock, which would increase the loading time of the application.
After reading this article, you should understand the role of features, feature selection, and engineering in building a predictive management function in IT. Artificial neural networks are one of the supervised learning techniques to build systems that mimic humans and act as predictive management functions.
My objective of this series is not to describe the numerous ML algorithms, modeling techniques, or building algorithms available, but to perceive ML implementation in various stages of proactive fault management or AI-led operations.
We will cover in more details on the proactive fault management function in my next set of articles. Stay tuned!