Yesterday, Daniel Marcelino published an interesting post on his blog, entitled *Parallel Processing: When does it worth*? I was asking myself the same question for a chapter I am currently writing. And I did like his approach, so I tried, on my computer to do the same. I used three packages to run parallel R codes,

> library(multicore) > library(snow) > library(snowfall)

and one to quantify time to run the code

> library(microbenchmark)

I ran the code on my mac, at the office,

> all=detectCores(all.tests=TRUE) > all [1] 4

which is a standard computer, with four cores. To run some codes, I had to generate datasets. Here, I consider a data frame, with rows, and 100 columns. I generate values using a Gaussian distribution,

> gen=function(n) data.frame(matrix(rnorm(n*100),n,100))

The goal, here, will be to compute quantiles (or to be more specific quartiles) per column, and to replicate that 100 times. Here, the standard technique is to use lapply. But two (at least) parallel version of the function can be found. So, let us use it

> base=gen(n=100) > microbenchmark( + mlapp=data.frame(lapply(base, quantile, probs = 1:3/4 )), + mclapp=data.frame(mclapply(base, quantile, probs = 1:3/4 , mc.cores = all)), + sflapp=data.frame(sfLapply(base, quantile, probs = 1:3/4 )), + times=100) -> m

For instance, with 100 rows, we have

> m Unit: milliseconds expr min lq median uq max 1 mclapp 50.19290 55.90364 57.99185 64.10619 266.88692 2 mlapp 26.94146 29.49396 31.20571 49.54824 75.60251 3 sflapp 27.54857 30.10224 31.41864 47.10688 59.28925

And with 500,000 rows, we have

> m Unit: seconds expr min lq median uq max 1 mclapp 42.999504 103.873919 161.989876 258.66887 660.2953 2 mlapp 3.720542 3.770319 4.070116 11.90181 166.9461 3 sflapp 3.587703 3.770399 4.027876 10.62654 181.0093

So yes, using parallel code would be very interesting ! Especially with very large datasets (I could not run it with 1 million rows). If we consider a loop, to see the evolution of the median time, for each of those three function, we can plot the time it took, as a function of the number of rows,

> i=1;vk=seq(1,6,by=.2) > col=seq(i,3*2,by=3) > plot(10^vk,db[2,col],ylim=range(db),col="white",log="x", + xlab="Number of rows",ylab="Time") + polygon(c(10^vk,rev(10^vk)),c(db[1,col],rev(db[3,col])),col="light blue",border=NA) + lines(10^vk,db[2,col],col="blue",lwd=2)

Here, we have the following, with the standard **lapply** on the left (the line if the median time, with quartiles, 25% and 75%), the **multicore** function in the middle, and the **snowfall** function, on the right,

If we zoom in, for small datasets (less than 10,000 rows and 100 columns), we do observe a gain, since the code ran two times faster

So clearly, it might be interesting to write codes to distribute on different cores. But here, I use a simple function (I compute quantiles on columns of a dataset). I should try with a more complex function…

On the other hand, I should mention that, usually, while I have have one (or two) codes running, I can do something else : seeking for recent papers for ongoing research projects, answer to emails that I should have answered a few weeks ago, checking for typos in the book and update the tex file, or type parts of a future posts on my blog, etc. The problem I got yesterday afternoon, when I ran the code, was that suddenly, all the cores on my computer were dedicated to that R code. I could not even finish an email I started before running the code… So finally I left earlier, decided to pick up the kids after school, and went to the park, to enjoy the sunny day we had ! So I have to admit that running parallel codes can have advantages you could not think of !

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}