Subsampling and Non-Uniform Sampling at Work
Subsampling and Non-Uniform Sampling at Work
If you want to extract actionable insights from ops time series data, you'll probably need to perform subsampling and non-uniform sampling.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
This is the last post in the series on sampling. In the previous articles, we had provided an overview of the metrics arms race and walked through the use of sampling in this regard, overviewed margin of error associated with sampling and different types of sampling, and discussed Simpson's paradox. In this blog, we will walk the audience through how subsampling and non-uniform sampling can be leveraged in the world of web operations.
Subsampling has a very rich research history. One of the early works can be traced to over five decades back, i.e. Quenouille's and Tukey's jackknife. Mahalanobis suggested, albeit in a different context, the use of subsamples to estimate standard errors in studying crop yields, though he used the term interpenetrating samples. In studies (here and here), Hartigan leveraged subsamples to construct confidence intervals and to approximate standard errors. Other use cases of subsampling include, but not limited to:
- Variance estimation
- Estimation of the underlying distribution
In the ops world, we mostly deal with time series data. This is exemplified by the plot below (the data was obtained using the Catchpoint portal).
Given a time series, draw m observations without replacement. This constitutes one subsample. Repeat this N times. So, we have N subsamples, each of size m. Denote these subsamples by S1, S2, ..., SN. Note that observations drawn for a subsample need not be contiguous. For instance, given a month long minutely time series, a subsample can potentially correspond to observations from the weekends. In a similar vein, one can create subsamples based on day of the week or hour of the day and day of the week (the latter would result in 168 non-overlapping subsamples).
For instance, let's consider two subsamples corresponding to a weekend day (Sunday) and a weekday (Monday) in the plot above. For ease of visualization, we limit the number of news outlets to three. The corresponding plots for the two days are shown below.
The tables below summarize the key stats for the plots above. From the tables, we note that, perhaps expectedly, 95p and 99p for Bloomberg and CNN is much higher on Monday than on Sunday. Surprisingly, for BBC News, all the three stats are in the same ballpark across Sunday and Monday. This may very well be an artifact of these two days and hence one should not generalize.
An analysis similar to the above can be done at the hour of the day level. The insights learned can potentially serve as a valuable input to dynamic capacity planning.
Commonly used subsampling techniques cannot be used for time series data as they assume the observations to be IID (independent and identically distributed). This assumption clearly does not hold in the case of a time series data. Consequently, extensions to the traditional techniques for subsampling are needed.
A common approach to address the aforementioned limitation is to apply bootstrap to approximate IID setup by focusing on residuals of some general regression model. However, this approach is restricted to situations where a general regression model can be relied upon. In an ops context, it can be challenging to find a regression model to fit the data well — this stems from a wide variety of reasons such as, but not limited to, a non-normal distribution, heteroskedasticity. A more general approach is to resample the original data sequence by considering blocks of data rather than single data points as in the IID setup. The rationale behind the above is that within each block, the dependence structure of the underlying model is preserved; further, if the block size is allowed to tend to infinity with the sample size,an asymptotically correct inference can be made.
The blocking method mentioned above was employed by Carlstein for estimating variance of a general statistic. Specifically, he divided the original sequence in (non-overlapping) blocks of size b < n (where n is the total number of observations) and recomputed the statistic of interest on these blocks, and used the sample variance of the block statistics, after some suitable normalization. "Moving blocks" bootstrap can be used for, besides variance estimation, estimating the sampling distribution of a statistic so that confidence intervals or regions for unknown parameters can be constructed. Akin to Efron's IID bootstrap, the method constructs pseudo-data sequences whose (known) data-generating mechanism follows the (unknown) data-generating mechanism that gave rise to the observed sequence. Having said that, the key difference lies in the fact that blocks of size b (< n) resampled with replacement from the data are concatenated to form such a pseudo sequence rather than single data points. Note that in contrast to Carlstein's approach, the moving blocks bootstrap uses overlapping blocks and is generally more efficient.
As discussed in an earlier blog, it is common to measure multiple metrics to gauge performance and reliability. In a multivariate context such as this, a common problem is to draw simultaneous inference on the regression coefficients. Fortunately, subsampling can also be leveraged in higher dimensions. Last but not least, ops time series obtained from production are often non-stationary. Sources of non-stationarity are, for example, but not limited to, the presence of seasonality, drift and an underlying trend. Several methods for subsampling from non-stationary time series have been proposed in the past. The reader is suggested to refer to Chapter 4 of the book by Politis et al.
It is not uncommon for metrics to be measured in production at different granularities, e.g., secondly, minutely et cetera. Further, there can be skew between time series (in other words, the multiple time series can be out of phase). Additionally, data may be missing in a time series - this may stem from a multitude of reasons such as, for instance, network error. The aforementioned inhibits carrying out of correlation analysis owing to different length of the time series. To this end, non-uniform sampling across metrics is often employed. Non-uniform sampling has been studied for over five decades in a wide variety of fields such as, but not limited to, signal processing, communication theory, magnetic resonance imaging, astronomy, chemistry. In The Origins of the Sampling Theorem, Lüke remarked the following:
The numerous different names to which the sampling theorem is attributed in the literature - Shannon, Nyquist, Kotelnikov, Whittaker, to Someya - gave rise to the above discussion of its origins. However, this history also reveals a process which is often apparent in theoretical problems in technology or physics: first the practicians put forward a rule of thumb, then the theoreticians develop the general solution, and finally someone discovers that the mathematicians have long since solved the mathematical problem which it contains, but in "splendid isolation.
The roots of modern non-uniform sampling interpolation can be traced to the seminal work by Yen (here and here). In particular, he presented the interpolation formulas for special cases of irregular sampling such as migration of a finite number of uniform samples, a single gap in otherwise uniform samples and periodic non-uniform sampling. The reader is suggested to refer to the surveys by Jerri (The Shannon Sampling Theorem—Its Various Extension and Applications: A Tutorial), Butzer et al. (The Origins of the Sampling Theorem) or the book by Marvasti (Subsampling) for a deep dive into the subject of non-uniform sampling.
To summarize, subsampling and non-uniform sampling can come in handy (and in some cases, are required) to extract actionable insights from ops time series data.
Published at DZone with permission of Mehdi Daoudi , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.