Over a million developers have joined DZone.
Platinum Partner

Algorithm of the Week: Practical Parallelizing in R

· Big Data Zone

The Big Data Zone is presented by Exaptive.  Learn how rapid data application development can address the data science shortage.

I wrote an algorithm in R to run a Monte Carlo simulation of how many test subjects I need for split tests to detect X% shift in the mean. It essentially required hundreds of thousands of calculations in order to come up with the final table. As a result this meant that my algorithm ran for a few minutes.

I’ll talk about my specific problem in a future post but for now I’ll quickly introduce you to parallel operations in R.

First you should install the “multicore” package. I can’t say that this is the “best” package but it works: install.packages(“multicore”)

Now you can use a function called “mclapply” that you can use in place of “mapply”. Let’s create a slightly contrived example:

mapply(function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)), 1:10000)

This will create two normal distributions with 10,000 elements in each and will compare them to each other with a t-test 10,000 times. It takes several seconds until it starts printing its results. Now because I chose to implement this using an apply function rather than using a for loop, I can easily convert this to be multi-core friendly. Check out this example and try running it.

mclapply(1:10000, function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)))

You might also try watching your CPU with both versions. The “mclapply” version will automatically max out all of your cores and finish MUCH faster. The standard R version will hardly peg a single CPU.

Ok. I know this seems like a contrived situation, but it sets us up perfectly for my next post where I talk about extending this technique to find the number of samples you need in an experiment to measure a statistically significant shift in the results. Because we’re generating all of the data and we aren’t I/O bound, a simple multicore technique like this will save us big.

Stay tuned!

(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)

The Big Data Zone is presented by Exaptive.  Learn about how to rapidly iterate data applications, while reusing existing code and leveraging open source technologies.


Published at DZone with permission of Justin Bozonier , DZone MVB .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}