Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Parallelizing Linear Regression or Using Multiple Sources

DZone's Guide to

Parallelizing Linear Regression or Using Multiple Sources

See how it works to parallelize computation on multiple cores, using multiple sources to estimate the parameters of a linear regression.

· Performance Zone ·
Free Resource

Sensu is an open source monitoring event pipeline. Try it today.

My previous post explained how mathematically, it was possible to parallelize computation to estimate the parameters of a linear regression. More specifically, we have a matrix \mathbf{X} which is n\times k matrix and \mathbf{y} a n-dimensional vector, and we want to compute \widehat{\mathbf{\beta}}=[\mathbf{X}^T\mathbf{X}]^{-1}\mathbf{X}^T\mathbf{y} by splitting the job. Instead of using the observations, we've seen that it was possible to compute "something" using the first n_1 rows, then the next n_2 rows, etc. Then, finally, we "aggregate" the m objects created to get our overall estimate.

Parallelizing on Multiple Cores

Let us see how it works from a computational point of view, to run each computation on a different core of the machine. Each core will see a slave, computing what we've seen in the previous post. Here, the data we use are

y = cars$dist
X = data.frame(1,cars$speed)
k = ncol(X)

On my laptop, I have three cores, so we will split it in m=3 chunks.

library(parallel)
library(pbapply)
ncl = detectCores()-1
cl = makeCluster(ncl)

This is more or less what we will do: we have our dataset, and we split the jobs:

We can then create lists containing elements that will be sent to each core, as Ewen suggested:

chunk = function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
a_parcourir = chunk(seq_len(nrow(X)), ncl)
for(i in 1:length(a_parcourir)) a_parcourir[[i]] = rep(i, length(a_parcourir[[i]]))
Xlist = split(X, unlist(a_parcourir))
ylist = split(y, unlist(a_parcourir))

It is also possible to simplify the QR functions we will use.

compute_qr = function(x){
  list(Q=qr.Q(qr(as.matrix(x))),R=qr.R(qr(as.matrix(x))))
}
get_Vlist = function(j){
  Q3 = QR1[[j]]$Q %*% Q2list[[j]]
  t(Q3) %*% ylist[[j]]
}
clusterExport(cl, c("compute_qr", "get_Vlist"), envir=environment())

Then, we can run our functions on each core. The first one is 

QR1 = parLapply(cl=cl,Xlist, compute_qr)

Note that it is also possible to use

QR1 = pblapply(Xlist, compute_qr, cl=cl)

which will include a progress bar (that can be nice when the database is rather large). Then use

  R1 = pblapply(QR1, function(x) x$R, cl=cl) %>% do.call("rbind", .)
  Q1 = qr.Q(qr(as.matrix(R1)))
  R2 = qr.R(qr(as.matrix(R1)))
  Q2list = split.data.frame(Q1, rep(1:ncl, each=k))
  clusterExport(cl, c("QR1", "Q2list", "ylist"), envir=environment())
  Vlist = pblapply(1:length(QR1), get_Vlist, cl=cl)
  sumV = Reduce('+', Vlist)

And finally, the output is

solve(R2) %*% sumV
         [,1]
X1 -17.579095
X2   3.932409

which is what we were expecting...

Using Multiple Sources

In practice, it might also happen that various "servers" have the data, but we cannot get a copy. But it is possible to run some functions on their server and get an output that we can use afterward.

Datasets are supposed to be available somewhere. We can send a request and get a matrix. Then we aggregate all of them and send another request. That's what we will do here. Provider j should run f_1(\mathbf{X}) on his part of the data, that function will return R^{(1)}_j. More precisely, to the first provider, send

function1 = function(subX){
return(qr.R(qr(as.matrix(subX))))}
R1 = function1(Xlist[[1]])

and actually, send that function to all providers, and aggregate the output

for(j in 2:m) R1 = rbind(R1,function1(Xlist[[j]]))

The create on your side the following objects

Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list=list()
for(j in 1:m) Q2list[[j]] = Q1[(j-1)*k+1:k,]

Finally, contact one last time the providers, and send one of your objects

function2=function(subX,suby,Q){
Q1=qr.Q(qr(as.matrix(subX)))
Q2=Q
return(t(Q1%*%Q2) %*% suby)}

Provider j should then run f_2(\mathbf{X},\mathbf{y},Q_j^{(2)}) on his part of the data, using also Q_j^{(2)} as argument (that we obtained on own side) and that function will return (\mathbf{Q}^{(2)}_j\mathbf{Q}^{(1)}_j)^{T}_j\mathbf{y}_j. For instance, ask the first provider to run

sumV = function2(Xlist[[1]],ylist[[1]], Q2list[[1]])

and do the same with all providers

for(j in 2:m) sumV = sumV+ function2(Xlist[[j]],ylist[[j]], Q2list[[j]])
solve(R2) %*% sumV
         [,1]
X1 -17.579095
X2   3.932409

which is what we were expecting...

Sensu: workflow automation for monitoring. Learn more—download the whitepaper.

Topics:
performance ,linear regression ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}