Over a million developers have joined DZone.

Estimating Statistics via Bootstrapping and Monte Carlo Simulation

DZone's Guide to

Estimating Statistics via Bootstrapping and Monte Carlo Simulation

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

We want to estimate some "statistics" (e.g. average income, 95 percentile height, variance of weight ... etc.) from a population.

It will be too tedious to enumerate all members of the whole population.  For efficiency reason, we randomly pick a number samples from the population, compute the statistics of the sample set to estimate the corresponding statistics of the population.  We understand the estimation done this way (via random sampling) can deviate from the population.  Therefore, in additional to our estimated statistics, we also include a "standard error" (how big our estimation may be deviated from the actual population statistics) or a "confidence interval" (a lower and upper bound of the statistics which we are confident about containing the true statistics).

The challenge is how do we estimate the "standard error" or the "confidence interval".  A straightforward way is to repeat the sampling exercise many times, each time we create a different sample set from which we compute one estimation.  Then we look across all estimations from different sample sets to estimate the standard error and confidence interval of the estimation.

But what if collecting data from a different sample set is expensive, or for any reason the population is no longer assessable after we collected our first sample set.  Bootstrapping provides a way to address this ...


Instead of creating additional sample sets from the population, we create additional sample sets by re-sampling data (with replacement) from the original sample set.  Each of the created sample set will follow the same data distribution of the original sample set, which in turns, follow the population.

R provides a nice "bootstrap" library to do this.

> library(boot)
> # Generate a population
> population.weight <- rnorm(100000, 160, 60)
> # Lets say we care about the ninety percentile
> quantile(population.weight, 0.9)
> # We create our first sample set of 500 samples
> sample_set1 <- sample(population.weight, 500)
> # Here is our sample statistic of ninety percentile
> quantile(sample_set1, 0.9)
> # Notice that the sample statistics deviates from the population statistics
> # We want to estimate how big is this deviation by using bootstrapping
> # I need to define my function to compute the statistics
> ninety_percentile <- function(x, idx) {return(quantile(x[idx], 0.9))}
> # Bootstrapping will call this function many times with different idx
> boot_result <- boot(data=sample_set1, statistic=ninety_percentile, R=1000)
> boot_result


boot(data = sample_set1, statistic = ninety_percentile, R = 1000)

Bootstrap Statistics :
    original   bias    std. error
t1* 232.3641 2.379859     5.43342
> plot(boot_result)
> boot.ci(boot_result, type="bca")
Based on 1000 bootstrap replicates

boot.ci(boot.out = boot_result, type = "bca")

Intervals : 
Level       BCa          
95%   (227.2, 248.1 )  
Calculations and Intervals on Original Scale
Here is the visual output of the bootstrap plot

Bootstrapping is a powerful simulation technique for estimate any statistics in an empirical way.  It is also non-parametric because it doesn't assume any model as well as parameters and just use the original sample set to estimate the statistics. 

If we assume certain distribution model want to see the distribution of certain statistics.  Monte Carlo simulation provides a powerful way for this.

Monte Carlo Simulation

The idea is pretty simple, based on a particular distribution function (defined by a specific model parameters), we generate many sets of samples.  We compute the statistics of each sample set and see how the statistics distributed across different sample sets.

For example, given a normal distribution population, what is the probability distribution of the max value of 5 randomly chosen samples.

> sample_stats <- rep(0, 1000)
> for (i in 1:1000) {
+     sample_stats[i] <- max(rnorm(5))
+ }
> mean(sample_stats)
[1] 1.153008
> sd(sample_stats)
[1] 0.6584022
> par(mfrow=c(1,2))
> hist(sample_stats, breaks=30)
> qqnorm(sample_stats)
> qqline(sample_stats)
Here is the distribution of the "max(5)" statistics, which shows some right skewness

Bootstrapping and Monte Carlo simulation are powerful tools to estimate statistics in an empirical manner, especially when we don't have an analytic form of solution.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}