# Parallelizing Linear Regression or Using Multiple Sources

# Parallelizing Linear Regression or Using Multiple Sources

### See how it works to parallelize computation on multiple cores, using multiple sources to estimate the parameters of a linear regression.

Join the DZone community and get the full member experience.

Join For Free**Sensu is an open source monitoring event pipeline. Try it today.**

My previous post explained how mathematically, it was possible to parallelize computation to estimate the parameters of a linear regression. More specifically, we have a matrix \mathbf{X} which is n\times k matrix and \mathbf{y} a n-dimensional vector, and we want to compute \widehat{\mathbf{\beta}}=[\mathbf{X}^T\mathbf{X}]^{-1}\mathbf{X}^T\mathbf{y} by splitting the job. Instead of using the observations, we've seen that it was possible to compute "something" using the first n_1 rows, then the next n_2 rows, etc. Then, finally, we "aggregate" the *m* objects created to get our overall estimate.

## Parallelizing on Multiple Cores

Let us see how it works from a computational point of view, to run each computation on a different core of the machine. Each core will see a slave, computing what we've seen in the previous post. Here, the data we use are

```
y = cars$dist
X = data.frame(1,cars$speed)
k = ncol(X)
```

On my laptop, I have three cores, so we will split it in m=3 chunks.

```
library(parallel)
library(pbapply)
ncl = detectCores()-1
cl = makeCluster(ncl)
```

This is more or less what we will do: we have our dataset, and we split the jobs:

We can then create lists containing elements that will be sent to each core, as Ewen suggested:

```
chunk = function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
a_parcourir = chunk(seq_len(nrow(X)), ncl)
for(i in 1:length(a_parcourir)) a_parcourir[[i]] = rep(i, length(a_parcourir[[i]]))
Xlist = split(X, unlist(a_parcourir))
ylist = split(y, unlist(a_parcourir))
```

It is also possible to simplify the QR functions we will use.

```
compute_qr = function(x){
list(Q=qr.Q(qr(as.matrix(x))),R=qr.R(qr(as.matrix(x))))
}
get_Vlist = function(j){
Q3 = QR1[[j]]$Q %*% Q2list[[j]]
t(Q3) %*% ylist[[j]]
}
clusterExport(cl, c("compute_qr", "get_Vlist"), envir=environment())
```

Then, we can run our functions on each core. The first one is

`QR1 = parLapply(cl=cl,Xlist, compute_qr)`

Note that it is also possible to use

`QR1 = pblapply(Xlist, compute_qr, cl=cl)`

which will include a progress bar (that can be nice when the database is rather large). Then use

```
R1 = pblapply(QR1, function(x) x$R, cl=cl) %>% do.call("rbind", .)
Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list = split.data.frame(Q1, rep(1:ncl, each=k))
clusterExport(cl, c("QR1", "Q2list", "ylist"), envir=environment())
Vlist = pblapply(1:length(QR1), get_Vlist, cl=cl)
sumV = Reduce('+', Vlist)
```

And finally, the output is

```
solve(R2) %*% sumV
[,1]
X1 -17.579095
X2 3.932409
```

which is what we were expecting...

## Using Multiple Sources

In practice, it might also happen that various "servers" have the data, but we cannot get a copy. But it is possible to run some functions on their server and get an output that we can use afterward.

Datasets are supposed to be available somewhere. We can send a request and get a matrix. Then we aggregate all of them and send another request. That's what we will do here. Provider j should run f_1(\mathbf{X}) on his part of the data, that function will return R^{(1)}_j. More precisely, to the first provider, send

```
function1 = function(subX){
return(qr.R(qr(as.matrix(subX))))}
R1 = function1(Xlist[[1]])
```

and actually, send that function to all providers, and aggregate the output

`for(j in 2:m) R1 = rbind(R1,function1(Xlist[[j]]))`

The create on your side the following objects

```
Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list=list()
for(j in 1:m) Q2list[[j]] = Q1[(j-1)*k+1:k,]
```

Finally, contact one last time the providers, and send one of your objects

```
function2=function(subX,suby,Q){
Q1=qr.Q(qr(as.matrix(subX)))
Q2=Q
return(t(Q1%*%Q2) %*% suby)}
```

Provider j should then run f_2(\mathbf{X},\mathbf{y},Q_j^{(2)}) on his part of the data, using also Q_j^{(2)} as argument (that we obtained on own side) and that function will return (\mathbf{Q}^{(2)}_j\mathbf{Q}^{(1)}_j)^{T}_j\mathbf{y}_j. For instance, ask the first provider to run

`sumV = function2(Xlist[[1]],ylist[[1]], Q2list[[1]])`

and do the same with all providers

`for(j in 2:m) sumV = sumV+ function2(Xlist[[j]],ylist[[j]], Q2list[[j]])`

```
solve(R2) %*% sumV
[,1]
X1 -17.579095
X2 3.932409
```

which is what we were expecting...

**Sensu: workflow automation for monitoring. Learn more—download the whitepaper.**

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}