Algorithm of the Week: Practical Parallelizing in R
I wrote an algorithm in R to run a Monte Carlo simulation of how many test subjects I need for split tests to detect X% shift in the mean. It essentially required hundreds of thousands of calculations in order to come up with the final table. As a result this meant that my algorithm ran for a few minutes.
I’ll talk about my specific problem in a future post but for now I’ll quickly introduce you to parallel operations in R.
First you should install the “multicore” package. I can’t say that this is the “best” package but it works: install.packages(“multicore”)
Now you can use a function called “mclapply” that you can use in place of “mapply”. Let’s create a slightly contrived example:
mapply(function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)), 1:10000)
This will create two normal distributions with 10,000 elements in each and will compare them to each other with a t-test 10,000 times. It takes several seconds until it starts printing its results. Now because I chose to implement this using an apply function rather than using a for loop, I can easily convert this to be multi-core friendly. Check out this example and try running it.
mclapply(1:10000, function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)))
You might also try watching your CPU with both versions. The “mclapply” version will automatically max out all of your cores and finish MUCH faster. The standard R version will hardly peg a single CPU.
Ok. I know this seems like a contrived situation, but it sets us up perfectly for my next post where I talk about extending this technique to find the number of samples you need in an experiment to measure a statistically significant shift in the results. Because we’re generating all of the data and we aren’t I/O bound, a simple multicore technique like this will save us big.
(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)