A Roadmap to AIOps — Part 1
A Roadmap to AIOps — Part 1
In this article, I want to propose some concrete steps that I believe are required or highly desirable to build an AIOps practice.
Join the DZone community and get the full member experience.Join For Free
In my conversations with customers about AIOps, I frequently hear concerns about maturity. Customers may believe, for example, they aren’t mature enough to implement analytics or that there is a linear progression for AIOps capabilities and they must start from a certain point corresponding to their own maturity self-assessment. Oftentimes, they say something like, "I have to get X in place first before I can even think about Y." Usually, the “X” they are talking about is getting a handle on exploding amounts of events and alerts or unifying dispersed monitoring.
I understand and empathize with their concerns. At the same time, I think that decades of ITIL training, with its rigid and regimented processes — reinforced by analysts and vendors — has made it difficult for all of us to see the possible or envision alternative solutions to our long-standing problems. AIOps holds the promise of step-function improvement without the strictures of ITIL, but there is very little practical guidance about what that might look like.
In this article, I want to propose some concrete steps that I believe are required or highly desirable to build an AIOps practice. I will then offer a "roadmap" for taking these steps in an AIOps implementation, indicating which are prerequisites for others, which can be pursued simultaneously, and which have dependencies.
A Quick AIOps Refresher
Gartner has identified an emerging IT market trend: traditional IT processes and tools are not suited to dealing with the challenges of modern digital business (more information here). This has to do with the velocity, variety, and volume of digital data; the distribution of responsibility and budget in the broader organization outside of IT; and the need to move from offline, historical analysis to real-time analytics.
Gartner’s response to this trend is AIOps: the merging of IT Service Management (ITSM), IT Operations Management (ITOM), and IT Automation at the data layer. That data must reside in a big data platform that supports the application of real-time analytics as well as deep historical queries. The analytics must be managed by machine learning that supports both supervised and unsupervised processing as the data streams in.
The idea is that tools in the IT silos remain sovereign, e.g. Service Management still handles requests, incidents, etc. and Performance Management still monitors metrics, events, and logs, but that their data is joined and subjected to machine-driven analysis for the purposes of enabling a) better, faster decisions and b) process as well as task automation.
Keep the End State in Mind
Remember that the end state is a system where data streams freely from multiple IT data sources into a big data platform; that data is analyzed upon ingestion and post-processed with data from other sources and types; machine learning is used to manage and modify the analytics and algorithms; and automated workflows are triggered, whose output also becomes a data feed into the system. The system adapts and responds as data volumes, types, and sources change, automatically adjusting response and informing administrators as needed.
Early Stage: Identify Your Current Use Cases
In a situation of change, transformation, and fluidity, the best place to start is with what you know. Most customers have initiatives around solving for use cases that they can’t currently accommodate or adapting how they are currently solving for a use case to be more responsive, scalable, accommodate new technologies, etc.
I always encourage customers to enumerate the list of use cases that they currently address or want to address. Having disclosure and transparency around current "desired" state opens the dialogue to:
- Questioning the "why" of those desired outcomes
- Assessing the priority of specific use cases
- Highlighting gaps in capability, tools, skills, or process
This is a terrific starting point for developing an AIOps strategy that will be successful. Emphasis on “starting.” We don’t know what we don’t know — new use cases will come up, new desired outcomes will emerge, and priorities will shift as your business and technologies change. New AIOps approaches will open new possibilities and pose new challenges.
The important thing is to start down a path with a purpose that bridges where you are to where you want to be. If where you want to be changes, no problem, you can course correct. However, if you don’t know where you are and have a realistic understanding of what is needed to get to the desired state, you will end up unfocused and likely unsuccessful.
Early Stage: Assess Your Data Freedom
The foundational element for AIOps is the free flow of data from disparate tools into the big data repository. Accordingly, you must assess the ease and frequency with which you can get data out of your IT systems. The optimal model is streaming — being able to send data continuously in real-time.
Few IT monitoring and service desk tools support streaming of outbound data. They may support programmatic interaction via REST API in more current versions or iterations. However, if they are based on traditional relational databases like Oracle or SQL, even having a programmatic interface doesn’t mean that they will be able to support streaming. The performance impact to production systems using relational databases may be too great as they are not designed to support the continuous outflow of data.
Getting clear on your data streaming capabilities is an early and high-priority activity in developing an AIOps strategy. Answer these questions for each data source:
- How do I get data out of my current IT tools?
- What data can I get?
- Can I do it programmatically?
- How frequently can I do it?
The constraints you discover may cause you to change your data consolidation strategy (e.g. start with batch uploads vs streaming) or consider replacing your IT tools with ones that will support real-time data streaming.
Early Stage: Agree on a System of Record
A second foundational element for AIOps is organizational alignment and communication. Suggesting that IT Operations and IT Service Management come together to review joint data requires that the teams agree on a "source of truth" and establish a regular cadence of interaction with clear roles and responsibilities. The latter is a larger topic that requires a longer conversation that I will pursue at a later date. Here, I want to focus on making joint decisions based on shared data.
The data I’m speaking of here is not all the data that might flow into the AIOps big data store for analysis. It is the data required for IT leaders and practitioners to understand what is happening in their environment, understand what actions have been or can be taken, make decisions, and ultimately track their effectiveness. With respect to an agreement on data, teams must determine:
- A minimum set of data that is required to overcome the limitations of the status quo
- Where the data is to reside
- The joint view/access that teams will share
In many mature IT organizations, that system is the Service Desk because in the traditional ITIL model, the Service Desk is where request, incident, and change data was expected to co-exist. This model gets challenged, however, when DevOps teams use Jira to log defects and enhancements, use APM tools whose events and telemetry aren’t captured by IT Operations or Security teams are working independently to identify threats.
Preparing to implement AIOps means identifying all of the effective causes and resultant indicators in your application, service or business value chain and putting a plan in place to bring that data together. You may leverage the big data platform if you can build meaningful dashboards on top of it that filter the mass aggregate of data for the specific uses of different IT audiences. Single data source — multiple views. However, it may make more sense in your environment to select a subset of data and feed it into (e.g. Jira tickets, APM events, etc.) your established system of record.
Early Stage: Determine Success Criteria and Begin Tracking Them
Successful management of any business, and certainly IT, begins with an understanding of what key performance indicators (KPIs) or metrics best indicate success or failure. It seems facile to say but is worth repeating that:
- Understanding what to measure
- Implementing consistent and robust measurement
- Regularly reporting out or providing visibility to the performance measures and
- Holding responsible parties accountable
is required for actionable understanding of your business.
Most organizations measure lots of things. Most IT tools come with several measurement tools and templates, but frequently, an understanding of the business needed to identify which of the things is important is missing. I have been in many situations where teams report out to me on "performance," but when I ask why such a measure is important or what is driving it, the response is a blank stare or "I’ll get back to you."
Quantity doesn’t trump quality in measurement. It may be that there is one thing that needs to be measured — assuming you know what drives that measure up or down. Those things too may need to be measured, but without understanding causal relationships, simply throwing graphs on a chart is unhelpful and more often detrimental. Understanding your KPIs is understanding your business.
Also often neglected is a comprehensive process for sharing information, engaging stakeholders, determining actions and holding people accountable. Visibility is primary, but visibility without action or response is empty. When action is required, people and teams need make commitments with timelines and execute against them. These need to be documented and measured as well to ensure that the business, and hence the KPIs, move in the right direction.
Mid-Stage: Assess Current and Future State Data Models
This is one that is critical, but which few customers understand or feel comfortable addressing. Essentially, you must take stock of the data model for each of the data sources you want to use for your AIOps solution and the data model that is required to realize the AIOps use cases and determine how the data from different sources will interact to deliver the desired results.
The reason this is challenging is that the data model in most IT tools is hidden from the user. Few organizations have an idea about how big data platforms (NoSQL) differ from traditional databases (SQL), and fewer still have data analyst/science expertise. I have written a separate article here on big data for AIOps that gives some background and context. Here, I want to address the idea of data "relationships" for the purposes of analytics.
The AIOps approach is to join data from different IT (and non-IT) sources in a single big data repository. The idea is then to make that data "talk to each other" to find relationships in the data that will yield insights unattainable when the data sits separately in different silos. But what are those relationships? How can diverse data from different sources with different structures be brought together for analysis, and who can do it?
There are a number of shared data structures that can be processed by an AIOps system without additional modification from AIOps practitioners:
- Timestamps — events, logs, and metrics all have time signatures that can be used to bring them together around a point in time or a time window. Timestamps can be used to correlate events with each other and with time-series data for causal analysis.
- Properties — using the term loosely for key pairs (key:value) of information associated with an event, log, or metric such as "status," "source," "submitter," etc. Properties can be used to create relationship models between different datasets.
- Historicity — the past performance of time-series or event activity data. This can be used to forecast future performance or predict future threshold achievement (e.g. saturation, degradation, etc.)
- Seasonality — the shape or regularity of time-series data over a day, week, month, etc. Seasonality can be used to correlate multiple data sets or anticipate resource requirements for scalability, e.g.
- Application, service, and business models — if you have a robust and regular discovery and configuration management practice, you can leverage these to inform an AIOps platform with asset relationship information for grouping, correlation, suppression, de-duplication, etc.
In general, IT time-series data is well formed and structured. Correlating, analyzing, and forecasting time-series data is a fairly well-established practice in IT Operations monitoring and management tools. What changes for AIOps implementation is the need to bring together IT and non-IT data (e.g. user counts + performance, latency + conversions, etc.); increase the granularity of data e.g. from five minutes to sub-one minute; and the application of analytics on streaming data — in "real-time" or on ingestion — vs. ad-hoc historical queries.
For IT events that have structured, semi-structured, or unstructured properties, AIOps represents a paradigm shift. To begin with, most IT event data is not well-formed. Human-generated events are inconsistent, with large amounts of missing or unstructured data. Machine generated events have more consistency, but are often incomplete and have large amounts of repetitive, semi-structured data. They also come in at an order of magnitude in volume larger than human-generated events. Machine logs, seen as events, are essentially blobs of semi-structured data. For AIOPs analysis of events to be effective, AIOps systems must overcome the challenges of poor, missing, incomplete, incorrect and unstructured data.
This is why much of the current activity in the AIOps space is centered on event management, analysis, and correlation. Once data begins to flow into an AIOps platform, customers must consider how they will approach data structure and integrity to support machine analysis. One strategy is to perform "ETL" (Extract, Transform, Load) on incoming data. Specifically, normalizing and transforming data as it flows in, to adhere to centralized standards so the data can be correlated and analyzed.
This approach suffers from limitations that will likely make it untenable for many enterprises. First, the amount of processing required to transform the data on ingestion but before analysis will likely either render the system not real-time or be cost prohibitive. Second, any centralized standard that is manually managed will require constant maintenance that will not be able to keep up with changes and will only comprehend the known, not the unknown or new.
A more promising strategy is “tagging,” which is what is employed as a best-practice in most cloud services. Tagging allows the hashing of variable attributes of different types of objects, which can then be referenced, sorted, correlated, and analyzed using the tags — regardless of what the object is or how it is tagged. Instead of requiring mapping of pre-defined properties with common values, tags are fluid and can change with the data. Tagging is how NoSQL databases handle attributes and how hyper-scale analytics tools like Elasticsearch are enabled. Additionally, tagging can be done in real-time by machines as data flows in, which overcomes blindness to the unknown and human-scale limitations.
For customers looking to adopt an AIOps strategy, understanding current and desired data structures is a critical but secondary consideration. First, you need to get the data flowing together. Any big data platform that supports an AIOps practice will have the capability to support the ETL or tagging approach. After data is flowing, you can determine which one works best for your business needs and budget.
This is part one of a two-part article. You can find part 2 here.
Published at DZone with permission of Seth Paskin , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.