In a report last year from MarketsAndMarkets, it was forecasted that IT operations analytics (ITOA) market will grow from $2.17B in 2015 to $9.79B by 2020, at a Compound Annual Growth Rate (CAGR) of 35.2% from 2015 to 2020. This, coupled with the decreasing cost of HDDs on a per-GB basis (as evidenced from the plots below) and the maturing of data collection technology, seems to have created an "arms race" with respect to metric collection.
The excerpts below reflect the aforementioned metric collection arms race:
- “…scaling it to about two million distinct time series.” (Netflix)
- “As we have hundreds of systems exposing multiple data items, the write rate might easily exceed tens of millions of data points each second.” (Facebook)
- “…we are taking tens of billions of measurements…” (Google)
- “…with over a million metrics flowing in constantly…” (Etsy)
- “The observability stack collects 170 million individual metrics (time series) …” (Twitter)
- “…serving over 50 million distinct time series.” (Spotify)
"One interesting trend in IT monitoring is the emergence of a 'size contest.' People are so proud to be collecting millions of metrics per second and monitoring databases that require petabytes."
Collecting a very large number of metrics may perhaps be warranted if and only if the metrics are subsequently used for analysis. The reality is that in most cases, over 95% of the metrics collected are never read for any analysis! This has direct implications on Opex — be it storage cost in one's data center and/or S3 cost on AWS and maintenance costs. In light of the above, it is befitting to recap what Mehdi remarked earlier:
"We need to stop comparing the sizes of our monitoring systems and databases, and start talking about how a monitoring project or tool deployment saved time, money, and business, increased revenue, impacted brand, and helped engineers and ops work faster and more efficiently and sleep better at night!"
The metrics "waste" is not limited to the raw metrics such CPU utilization, latency, minor GC time, etc. In many cases, in the interest of shorter TTR (Time-To-Resolve), derived metrics such as 99/95/90 percentiles of latency are stored instead of computing the derived metrics on the fly. The selection on the derived metrics is often ad hoc; further, most of these derived metrics are never read for any analysis.
Having said the above, there are early signs of leveraging the massive operations data for data center optimization. Concretely, in 2004, Google published a research report wherein they reported they developed a neural network framework that learns from actual operations data to model plant performance and predicted PUE (Power Usage Effectiveness) within a range of 0.004 +/0.005 (mean absolute error +/- 1 standard deviation), or 0.4% error for a PUE of 1.1.
The aforementioned metrics "waste" has direct implications for the monitoring services. Given the steep decay in cost/node (see the illustration below from Adrian Cockcroft last year at Monitorama 2016), the monitoring cost must commensurately decline. However, this is hard to achieve in light of the explosion in the number of metrics being collected.
Besides volume of metrics being collected, one needs to also pay attention to the velocity of metrics collection. The spectrum corresponding to the latter spans from, for example, secondly granularity to daily granularity. It has become fashionable to tout metric collection at a fine grain; however, rarely, there is any discussion about how metric collection at a fine grain helps, if at all, to shorten TTD (time to detect) and/or TTR (time to resolve). With the advent of IoT, the number of data streams shall grow tremendously — as per ABI Research forecasts, the volume of data captured by IoT-connected devices shall surpass 1.6 zettabytes in 2020; further, the data being captured for storage or further analysis accounts for only a tiny fraction of the data being generated which is typically in the order of yottabytes) and consequently, figuring the "optimum" level of velocity of collection shall be paramount.
In the case of real-user monitoring (RUM), a webpage may have millions of page views on a daily basis. Typical metrics monitored for each page view are TDNS, TConnect, TResponse time, TOnload, and #JS errors. A subset of other metrics that are typically collected during a page view are listed in the table below. The second column in the table corresponds to the maximum number of unique values for each metric (the maximum value is, of course, a function of the monitoring service being used).
Multiple metrics are often monitored, say, per page view so as to help localize an availability/performance issue. For instance, an American Express customer in NYC may experience high response times whereas an American Express customer in LA may have a very smooth experience. However, in densely populated regions that drive a large volume of page views, does one need to collect metrics for each and every page view?
To address the above, sampling is commonly employed to contain the storage requirements. A straightforward, but effective, approach is to vary the sampling rate depending on the "importance" of the metric. Periodic sampling with low sampling rates reduces the storage requirements and, in many cases, does not adversely root-cause analysis.
Let's consider the plot below, which corresponds to webpage response time of the American Express mobile webpage. The data was collected every five minutes.
Granularity: 5 minutes.
From the plot above we note that there are five anomalies — defined as webpage response time of more than seven seconds — in webpage response time series. Early detection of such anomalies is critical to minimize the impact on end-user experience. In line with the discussion above, how would the time series look if the data were sampled every 10 minutes? From the plot below, we note that two out of the five anomalies observed in the previous plot are masked when the sampling rate is ten minutes.
Granularity: 10 minutes.
How would the time series look at even lower sampling rate? The plot below corresponds to a sampling rate of 30 minutes. From the plot below, we note that none of the five anomalies observed in the first plot get surfaced when the sampling rate is 30 minutes.
Granularity: 30 minutes.
The above should not be a surprise owing to the fact that anomalies are sparse and hence a low sampling rate would mask the anomalies. Having said that, in case of high throughput systems, the sparsity of anomalies is no longer a limitation and consequently and a low sampling rate can be used to contain the data volume. In fact, sampling has been employed in large scale systems such as Dapper. The authors of the research report say:
...we have found sampling to be necessary for low overhead, especially in highly optimized Web services which tend to be quite latency sensitive.
In Dapper, both uniform and adaptive sampling probability was employed. In particular, the authors share the following regarding the use of uniform sampling probability:
This simple scheme was effective for our high-throughput online services since the vast majority of events of interest were still very likely to appear often enough to be captured. However, lower traffic workloads may miss important events at such low sampling rates, while tolerating higher sampling rates with acceptable performance overheads.
In a similar vein, the authors share the following regarding the use of adaptive sampling probability:
...is parameterized not by a uniform sampling probability, but by a desired rate of sampled traces per unit time. This way, workloads with low traffic automatically increase their sampling rate while those with very high traffic will lower it so that overheads remain under control.
Lowering the sampling rate for a metric as-is discounts the recency of the incoming data stream. Exploiting recency is key to surface issues impacting end-user experience currently or in the recent past. To this end, at Catchpoint, we do not sample for the past three days and turn on sampling for data beyond the past three days.
To summarize, the high-level pros and cons of sampling are the following.
- Reduces the Opex associate with storage.
- Help filter out the noise and capture the underlying trend.
- Sample summarization is faster than its population counterpart — this is particularly important when there are a large number of metrics as is the case in today's operations world — which facilitates faster decision making
- A sampling error is incurred. As the sample does not include all members of the population, sample statistics such as means and quantiles, generally differ from the characteristics of the entire population. This may result in false negatives that in turn may adversely impact en-user experience. Sampling error can be contained by taking a large enough random sample from the population. The precision of sample statistics can be estimated by either using subsets of available data (referred to as jackknifing) or drawing randomly with replacement from a set of data points (referred to as bootstrapping).
- It is non-trivial to determine the "optimal" sampling technique and the sampling rate.