Machine Learning in IT Language (Part 1)
Prediction is an obvious solution for IT-enabled services to know when systems fail, and ML will play a key role here. But there's little documentation on using ML in the IT space.
Join the DZone community and get the full member experience.Join For Free
Nowadays, predictions are used in myriad industries areas. From forecasting financial markets and stocks to envisaging the path of meteorites, predictions are used everywhere. Even in IT and IT-enabled services (ITES), the usage of prediction techniques is becoming a common practice and is pretty frequently used to conjecture the occurrence of system failures. There is no debate that the failure of critical application will cause a huge impact on business; nevertheless, computer systems have reached a level of proximity that precludes the design of an absolutely perfect and accurate system. Therefore, the occurrence of failures in a system cannot be eliminated as a whole, but their likelihood can be minimized.
This perception is laying the groundwork for an approach called proactive fault management — a procedure of dealing with faults even before the failure has even occurred. The proactive procedures to be followed are more efficient when a failure scenario is detected well before it happens and can be identified based on the health state of the monitored system. If a failure can be predicted, preventive action can be taken to reduce the consequences of the pending failure. This states that proactive fault management is an effective approach to enhance reliability, and prediction is the apparent solution to know when a failure is about to happen.
Evolving from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning (ML) empowers the study and construction of algorithms that can learn from and make predictions on data. Machine learning is a core sub-area of AI that enables systems or computers to self-learn without being explicitly programmed. Part of that is algorithms, which are a set of rules to be followed in problem-solving. On a high level, the main aim of any ML program is to either predict an outcome or classify an item. In the context of ML, algorithms are a formula to be applied to the incoming data.
A quick glance on various ML algorithms in use:
There are a lot of articles on various algorithms in use and comprehensive material explaining ML techniques with examples of predicting salaries, forecasting election results, weather reports, etc. But there is limited documentation on how these techniques are used in the IT space. My objective with this series is not to describe the numerous ML algorithms available, modeling techniques, or procedures to build algorithms but to discuss ML implementation in various stages of proactive fault management or AI-led operations in the context of IT.
Below is a layered view of enterprise hybrid IT platforms that might include traditional silo-based client-server models, as well as applications hosted on virtual environments, private, public, or hybrid clouds. An issue in any of the layers including the core building blocks storage, memory, network, or compute might cause systems to fail.
Well! There are a lot of scenarios that might cause system failures. Some of them are:
- Hardware problems
- Environment problems
- Third-party, open-source software, commercial-off-the-shelf (COTS) components
- Unproven design
- Bad configuration
- Attacks and threats
- Novice users
- Software faults
All IT operations have service-level agreements for different support areas based on criticality. For example, a production instance might have an SLA of 99% availability in contrast to the development box. The SLA for an important system may say that a system is expected to be available basically 24/7, and if this requirement is not met, there will be some penalty. In general, a breach is said to have occurred when the intended service deviates from the agreement or contract.
Often, incidents are reported by business users, and a root cause analysis is done when the failure has occurred to find and repair a bug. Also, the monitoring team makes short-term predictions based on the basis of run-time monitoring. These are conventional modes of operations, or rather reactive ways of performing support operations (to repair when a damage occurs) and therefore are not capable of reflecting the dynamics of run-time systems and failure of processes. A snapshot of the people’s view of the DevOps model along with the key BAU (business as usual) activities is depicted below:
Monitoring plays a key role in building a proactive fault management function. The function’s core responsibility is to predict during run-time whether a failure will occur in a short period of time based on an assessment of the monitored running resource state. The proactive function must allow:
- The selection of the most noteworthy variables to predict failure, maybe with only a few of probably many hundreds of variables that could be perceived.
- A technique to interpret the gathered data and diagnose erroneous system states in order to predict future failures.
What we can understand from the above is that it is almost impossible to build a perfect system, and the way out is to minimize the occurrences of failures. Prediction is an obvious solution to know when systems fail, and ML will play a key role here. If prediction can be done correctly, then countermeasures can be taken such as initiation or restart of service, manual or automatic configuration, optimization, healing tasks, protection, etc. We will learn more details on the proactive fault management function and how ML can bring richness in my next set of articles.
Stay tuned for the rest of the series!
Opinions expressed by DZone contributors are their own.