Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# Data Sampling: Types and Example Use Cases

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In Part 1, we provided an overview of the metrics arms race and had walked through the use of sampling in this regard. We had also discussed the importance of baking in recency of data during sampling and some of the pitfalls of sampling such as the sampling error. Recall that exact measurement of sampling error is not feasible as the true population values are generally unknown; hence, sampling errors are often estimated by probabilistic modeling of the sample.

Random sampling error is often measured using the margin of error statistic. The statistic denotes a likelihood that the result from a sample is close to the number one would get if the whole population had been used. When a single, global margin of error is reported, it refers to the maximum margin of error using the full sample. The margin of error is defined for any desired confidence level; typically, a confidence of 95% is chosen. For a simple random sample from a large population, the maximum margin of error is computed as follows:

erf is the inverse error function and n is the sample size. In accord to intuition, the maximum margin of error decreases with a larger sample size. It should be noted that margin of error only accounts for random sampling error and is blind to systematic errors.

An underlying assumption of the formula above for the margin of error is that there is an infinitely large population and hence, the margin of error does not depend on the size of the population (N) of interest. Given the real-time large volume of operational data, the assumption can be made to hold for practical purposes. Having said that, the aforementioned assumption is valid when the sampling fraction is small (typically less than 5%). On the other hand, if no restrictions — such as that n/N should be small, or N large, nor that the latter population is normal — are made, then, as per Isserlis, the margin or error should be corrected using the following:

It is important to validate the underlying assumptions, as sampling error has direct implications on analysis, such as anomaly detection (refer to Part 1).

The following sampling methodologies have been extensively studied and used in a variety of contexts:

• Simple random sampling: It is a method of selecting n units out of N such that every one of the NCn samples has an equal chance of being drawn. Random sampling can be done without or with replacement.
• Stratified random sampling: Under this method, the overall population is divided into subpopulations (or strata) such that they are non-overlapping and collectively exhaustive. Then, a random sample is drawn from each stratum. The mean and variance of stratified sampling are given as follows:

• = population size

• L = # strata

• Nh = size of each stratum

• nh = size of a random sample drawn from stratum h

• sh = sample standard deviation of stratum h

• mh = sample mean of stratum h

This often improves the representativeness of the sample by reducing sampling error. On comparing the relative precision of simple random and stratified random sampling, Cochran remarked the following (where nh is the size of a random sample from a stratum):

...stratification nearly always results in smaller variance for the estimated mean or total than is given by a comparable simple random sample. ... If the values of nh are far from optimum, stratified sampling may have a higher variance.

In the context of operations, let's say that if one were to evaluate the response time of a website, the response time data should be divided into multiple strata based on geolocation and then analyzed.

A variant of the above sets the start of the sampling sequence to (k+1)/2 if k is odd or to k/2 if k is even. In another variation, the N units are assumed to be arranged around a circle, a number between 1 and N is selected at random, and then every k-th (where k = integer nearest to N/n) unit is sampled. Check out further reading on systematic sampling.

Several variants of adaptive sampling have been proposed in the literature. For instance, in locally adaptive sampling, intervals between samples is computed using a function of previously taken samples called a sampling function. Hence, though it is a non-uniform sampling scheme, one need not keep sampling times. In particular, sampling time ti+1 is determined in the following fashion:

We now walk through a couple of examples to illustrate the applicability of the sampling techniques discussed above. The plot below compares the number of page views across the different continents over a three-day period. The data was collected every hour. Note that the scale of the y-axis is logarithmic.

Simple random or stratified random sampling of the time series in the plot above would render the subsequent comparison inaccurate owing to the underlying seasonality. This can be addressed by employing systematic sampling whereby the number of page views of the same hour for each day would be sampled. Subsequent comparison of the sampled data across different continents would be valid.

The plot below compares document completion time across the different continents over a three-day period. The data was collected every hour. Note that the scale of the y-axis is in thousands of milliseconds. From the plot, we note that unlike the number of page views, document completion time does not exhibit a seasonal nature.

Given the high variance of the document completion time, employing simple random sampling would incur a large sampling error. Consequently, in this case, stratified random sampling — where a stratum would correspond to a day — can be employed and then the sampled data can be used for comparative analysis across the different continents.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,tutorial ,metrics ,sampling

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.