Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Are Parallel Computations Worth It?

DZone's Guide to

Are Parallel Computations Worth It?

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

Yesterday, Daniel Marcelino published an interesting post on his blog, entitled Parallel Processing: When does it worth? I was asking myself the same question for a chapter I am currently writing. And I did like his approach, so I tried, on my computer to do the same. I used three packages to run parallel R codes,

> library(multicore)
> library(snow)
> library(snowfall)

and one to quantify time to run the code

> library(microbenchmark)

I ran the code on my mac, at the office,

> all=detectCores(all.tests=TRUE)
> all
[1] 4

which is a standard computer, with four cores. To run some codes, I had to generate datasets. Here, I consider a data frame, with http://latex.codecogs.com/gif.latex?n rows, and 100 columns. I generate values using a Gaussian distribution,

> gen=function(n) data.frame(matrix(rnorm(n*100),n,100))

The goal, here, will be to compute quantiles (or to be more specific quartiles) per column, and to replicate that 100 times. Here, the standard technique is to use lapply. But two (at least) parallel version of the function can be found. So, let us use it

> base=gen(n=100)
> microbenchmark(
+ mlapp=data.frame(lapply(base, quantile, probs = 1:3/4 )),
+ mclapp=data.frame(mclapply(base, quantile, probs = 1:3/4 , mc.cores = all)),
+ sflapp=data.frame(sfLapply(base, quantile, probs = 1:3/4 )),
+ times=100) -> m

For instance, with 100 rows, we have

> m
Unit: milliseconds
    expr      min       lq   median       uq       max
1 mclapp 50.19290 55.90364 57.99185 64.10619 266.88692
2  mlapp 26.94146 29.49396 31.20571 49.54824  75.60251
3 sflapp 27.54857 30.10224 31.41864 47.10688  59.28925

And with 500,000 rows, we have

> m
Unit: seconds
    expr       min         lq     median        uq      max
1 mclapp 42.999504 103.873919 161.989876 258.66887 660.2953
2  mlapp  3.720542   3.770319   4.070116  11.90181 166.9461
3 sflapp  3.587703   3.770399   4.027876  10.62654 181.0093

So yes, using parallel code would be very interesting ! Especially with very large datasets (I could not run it with 1 million rows). If we consider a loop, to see the evolution of the median time, for each of those three function, we can plot the time it took, as a function of the number of rows,

> i=1;vk=seq(1,6,by=.2)
> col=seq(i,3*2,by=3)
> plot(10^vk,db[2,col],ylim=range(db),col="white",log="x",
+     xlab="Number of rows",ylab="Time")
+ polygon(c(10^vk,rev(10^vk)),c(db[1,col],rev(db[3,col])),col="light blue",border=NA)
+ lines(10^vk,db[2,col],col="blue",lwd=2)

Here, we have the following, with the standard lapply on the left (the line if the median time, with quartiles, 25% and 75%), the multicore function in the middle, and the snowfall function, on the right,

If we zoom in, for small datasets (less than 10,000 rows and 100 columns), we do observe a gain, since the code ran two times faster

So clearly, it might be interesting to write codes to distribute on different cores. But here, I use a simple function (I compute quantiles on columns of a dataset). I should try with a more complex function…

On the other hand, I should mention that, usually, while I have have one (or two) codes running, I can do something else : seeking for recent papers for ongoing research projects, answer to emails that I should have answered a few weeks ago, checking for typos in the book and update the tex file, or type parts of a future posts on my blog, etc. The problem I got yesterday afternoon, when I ran the code, was that suddenly, all the cores on my computer were dedicated to that R code. I could not even finish an email I started before running the code… So finally I left earlier, decided to pick up the kids after school, and went to the park, to enjoy the sunny day we had ! So I have to admit that running parallel codes can have advantages you could not think of !

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}