Over a million developers have joined DZone.

Algorithm of the Week: Practical Parallelizing in R

DZone's Guide to

Algorithm of the Week: Practical Parallelizing in R

· Big Data Zone
Free Resource

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

I wrote an algorithm in R to run a Monte Carlo simulation of how many test subjects I need for split tests to detect X% shift in the mean. It essentially required hundreds of thousands of calculations in order to come up with the final table. As a result this meant that my algorithm ran for a few minutes.

I’ll talk about my specific problem in a future post but for now I’ll quickly introduce you to parallel operations in R.

First you should install the “multicore” package. I can’t say that this is the “best” package but it works: install.packages(“multicore”)

Now you can use a function called “mclapply” that you can use in place of “mapply”. Let’s create a slightly contrived example:

mapply(function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)), 1:10000)

This will create two normal distributions with 10,000 elements in each and will compare them to each other with a t-test 10,000 times. It takes several seconds until it starts printing its results. Now because I chose to implement this using an apply function rather than using a for loop, I can easily convert this to be multi-core friendly. Check out this example and try running it.

mclapply(1:10000, function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)))

You might also try watching your CPU with both versions. The “mclapply” version will automatically max out all of your cores and finish MUCH faster. The standard R version will hardly peg a single CPU.

Ok. I know this seems like a contrived situation, but it sets us up perfectly for my next post where I talk about extending this technique to find the number of samples you need in an experiment to measure a statistically significant shift in the results. Because we’re generating all of the data and we aren’t I/O bound, a simple multicore technique like this will save us big.

Stay tuned!

(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.


Published at DZone with permission of Justin Bozonier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}