With the recent spurt of web-scale applications, the problem of monitoring such systems has attracted a lot of attention. The scale and complexity of such systems, coupled with the non-trivial interactions between constituent components, makes it extremely challenging to ascertain the health and stability of the various services and components that make up such a system. These challenges further make it difficult to zero in on the root cause of failures or performance degradations. The widely studied methods in anomaly detection have been suggested as a potential approach for monitoring complex IT environments. [Ref.]
Statistical Models for Anomaly Detection
As a precursor to applying anomaly detection methods we need to ask ourselves: "what is an anomaly?" To define an anomaly we first need to define what is normal. Any behavior that does not fall within the normalcy bounds will consequently be deemed anomalous. The well known ” μ ± 3σ” approach, which comes in several flavours including simple moving average (SMA), simple moving median (SMM), exponential smoothing, Holt-Winters (which is a higher order exponential smoothing that could potentially account for trends and seasonality), is widely used in app and infrastructure monitoring and elsewhere. All of these methods make an implicit assumption of what represents the usual or acceptable behaviour for the services or metrics that are being monitored, ie. all the metrics follow the normal (or Gaussian) distribution. This assumption clearly does not hold true for IT systems and metrics which makes the application of these methods questionable and prone to both false positives and false negatives. The figure below shows the histogram for a set of points picked from a normal distribution. The yellow highlighted region shows the μ ± 3σ region.
Where Statistical Models Start to Fail
As you can see, the μ ± 3σ rule does seem to make good sense for this distribution. Only around 0.27% of the data lies outside of the μ ± 3σ region. If your metric reports one datapoint every 30 seconds (which is fairly typical in the IT monitoring space), you can expect one datapoint to lie outside the μ ± 3σ region every 3 hours. As you can see, even when the underlying metric follows a normal distribution, using the μ ± 3σ rule is bound to yield a lot of false positives resulting in alert fatigue among the DevOps SREs. This problem is further aggravated when we consider the fact that most, if not all, of the IT metrics do not follow a normal distribution. Many of the metrics show distributions that could be multimodal, asymmetric or have fat tails as shown in the example plots below.
For example, the plot below shows the CPU usage for Apache Storm over a period of 7 days. As expected, the CPU usage can have different baselines depending on the pipelines or other processes that are running on the machine. Most of the time the dynamic baseline is not of too much concern to an Ops engineer monitoring the system since it is an expected behavior from such systems. But such a behavior results in a multimodal distribution (having multiple peaks) in the metric as seen below and the use one mean (μ) and one standard deviation (σ) to define the normal behavior is no longer justified.
Even when the CPU usage follows a single baseline, as seen in the plot below, the distribution is clearly non-Gaussian. It is both asymmetric and has a long tail at the higher end which will cause a lot of false alerts if we were to use a μ ± 3σ type rule.
The same issues are faced with application metrics that depend on customer interactions and engagement, for example, the total requests to a server as shown below. The distribution in this case looks more like Poisson distribution which is a distribution that is frequently used to model processes dealing with arrival rates in queueing theory. The metric is clearly not a good fit for any of the μ ± 3σ type rules.
Another layer of complexity is introduced by the presence of trend and/or seasonality as seen in the plot below which shows the number of requests received by a server that exhibits a strong seasonality.
Many approaches have been proposed in the literature to capture trend and seasonality. In our experience, these approaches do not fare well in practice owing to a variety of reasons. For instance, the period of the seasonal component may not be constant (in the context of garbage collection, the collection frequency is dynamic as seen below), a key assumption made by many techniques.
An Alternative Approach for Detecting Anomalies
In addition to the statistical issues discussed above the application of anomaly detection to Ops is hindered by the fact that anomalies are highly contextual in nature. A true statistical anomaly in a metric, i.e. a datapoint that would be considered an outlier by most statistical methods, may not necessarily be useful in an Ops monitoring sense. From an Ops perspective the engineer is more interested in anomalies that can potentially have a material impact on the health, stability, performance or cost of operation of the system. Metric deviations that have a minor to zero effect on the key business and operational factors only contribute to the alert noise making the alerting system less useful. An Ops engineer brings in a tremendous amount of insights about the behavior and patterns that have a meaningful effect on the performance and stability of any system.
Limiting ourselves to purely statistical methods prevents us from incorporating the wealth of knowledge brought in by domain experts. We at OpsClarity believe that any anomaly detection on complex IT systems should start with a deep understanding on the system, its characteristics, behavior and its normal modes of operation. Only with a deep operational knowledge of the system can one begin to define what is meant by behavior that is anomalous or behavior that is a precursor to problems or issues in the system.
Thus, at OpsClarity, we built a suite of models to detect anomalies in key metrics of services used by data-first applications that incorporates domain expert insights for the specific service that is being monitored. More specifically, we have built custom models to monitor important operational and business impacting metrics like latency, throughput, requests, errors, etc. for a number of services that are of prime importance to data-first applications, including Kafka, Storm, Redis, Memcached, Spark, Cassandra, etc. Some of these models include:
- Latency model: This model detects unusual behavior in latency metrics for a number of services.
- Connection count model: This model tracks the number of connections to various services and alerts on unusual behavior.
- Error count model: This model dynamically adapts to an acceptable baseline for errors and alerts if there are more than usual number of error, for example 4xx and 5xx error on ELBs.
- Queue size model: This model alerts if the consumer is not able to keep up with the producer in a queue service like Kafka, AWS SQS, etc.
- Disk Full model: This model alerts if the hard disk consumption shows an increased consumption rate or if the disk is predicted to be full within a preset amount of time.
- Seasonality corrected threshold: This model detects unusual behavior, in a seasonally adjusted fashion, in metrics related to incoming traffic for a number of services.
- Mean shift model: This model alerts if there are successive occurrence of sudden shifts in the level of the metric under consideration.
- Dynamic threshold model: This model alerts if a metric is persistently above a temporal dynamic threshold.
In the near future we will delve deeper into the details about these models and how they could help reduce the alert noise and increase efficiency in your Ops monitoring setup. In our next blog we will dig deeper into the latency model and show how it is more robust and less susceptible to false alerts compared to generic statistical models.
Acknowledgments: We would like to thank Arun Kejariwal for his help in building our anomaly detection models and for his help in preparing this blog.