# Linear Regression, With Map-Reduce

# Linear Regression, With Map-Reduce

### Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. In today's article, we look at how to do so.

Join the DZone community and get the full member experience.

Join For FreeThe open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Consider the case of the linear regression, y=Xβ+ε (with classical matrix notations). The OLS estimate of β is

To illustrate, consider a not too big dataset, and run some regression.

```
lm(dist~speed,data=cars)$coefficients
(Intercept) speed
-17.579095 3.932409
y=cars$dist
X=cbind(1,cars$speed)
solve(crossprod(X,X))%*%crossprod(X,y)
[,1]
[1,] -17.579095
[2,] 3.932409
```

How is this computed in R? Actually, it is based on the QR decomposition of X,X=QR, where Q is an orthogonal matrix (i.e. QTQ=I). Then

```
solve(qr.R(qr(as.matrix(X)))) %*% t(qr.Q(qr(as.matrix(X)))) %*% y
[,1]
[1,] -17.579095
[2,] 3.932409
```

So far, so good, we get the same output. Now, what if we want to parallelize computations? Actually, it is possible.

Consider m blocks

`m = 5`

and split vectors and matrices

and

To split vectors and matrices, use:

```
Xlist = list()
for(j in 1:m) Xlist[[j]] = X[(j-1)*10+1:10,]
ylist = list()
for(j in 1:m) ylist[[j]] = y[(j-1)*10+1:10]
```

and get small QR recomposition (per subset)

```
QR1 = list()
for(j in 1:m) QR1[[j]] = list(Q=qr.Q(qr(as.matrix(Xlist[[j]]))),R=qr.R(qr(as.matrix(Xlist[[j]]))))
```

Consider the QR decomposition of R^{(1)} which is the first step of the reduce part

where

```
R1 = QR1[[1]]$R
for(j in 2:m) R1 = rbind(R1,QR1[[j]]$R)
Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list=list()
for(j in 1:m) Q2list[[j]] = Q1[(j-1)*2+1:2,]
```

Define – as step 2 of the reduce part

and

```
Q3list = list()
for(j in 1:m) Q3list[[j]] = QR1[[j]]$Q %*% Q2list[[j]]
Vlist = list()
for(j in 1:m) Vlist[[j]] = t(Q3list[[j]]) %*% ylist[[j]]
```

and finally set – as the step 3 of the reduce part

```
sumV = Vlist[[1]]
for(j in 2:m) sumV = sumV+Vlist[[j]]
solve(R2) %*% sumV
[,1]
[1,] -17.579095
[2,] 3.932409
```

It looks like we’ve been able to parallelize our linear regression!

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}