Over a million developers have joined DZone.

Algorithm of the Week: Practical Parallelizing in R

DZone's Guide to

Algorithm of the Week: Practical Parallelizing in R

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I wrote an algorithm in R to run a Monte Carlo simulation of how many test subjects I need for split tests to detect X% shift in the mean. It essentially required hundreds of thousands of calculations in order to come up with the final table. As a result this meant that my algorithm ran for a few minutes.

I’ll talk about my specific problem in a future post but for now I’ll quickly introduce you to parallel operations in R.

First you should install the “multicore” package. I can’t say that this is the “best” package but it works: install.packages(“multicore”)

Now you can use a function called “mclapply” that you can use in place of “mapply”. Let’s create a slightly contrived example:

mapply(function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)), 1:10000)

This will create two normal distributions with 10,000 elements in each and will compare them to each other with a t-test 10,000 times. It takes several seconds until it starts printing its results. Now because I chose to implement this using an apply function rather than using a for loop, I can easily convert this to be multi-core friendly. Check out this example and try running it.

mclapply(1:10000, function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)))

You might also try watching your CPU with both versions. The “mclapply” version will automatically max out all of your cores and finish MUCH faster. The standard R version will hardly peg a single CPU.

Ok. I know this seems like a contrived situation, but it sets us up perfectly for my next post where I talk about extending this technique to find the number of samples you need in an experiment to measure a statistically significant shift in the results. Because we’re generating all of the data and we aren’t I/O bound, a simple multicore technique like this will save us big.

Stay tuned!

(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}