# Linear Regression, With Map-Reduce

# Linear Regression, With Map-Reduce

### Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. In today's article, we look at how to do so.

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Consider the case of the linear regression, y=Xβ+ε (with classical matrix notations). The OLS estimate of β is

To illustrate, consider a not too big dataset, and run some regression.

```
lm(dist~speed,data=cars)$coefficients
(Intercept) speed
-17.579095 3.932409
y=cars$dist
X=cbind(1,cars$speed)
solve(crossprod(X,X))%*%crossprod(X,y)
[,1]
[1,] -17.579095
[2,] 3.932409
```

How is this computed in R? Actually, it is based on the QR decomposition of X,X=QR, where Q is an orthogonal matrix (i.e. QTQ=I). Then

```
solve(qr.R(qr(as.matrix(X)))) %*% t(qr.Q(qr(as.matrix(X)))) %*% y
[,1]
[1,] -17.579095
[2,] 3.932409
```

So far, so good, we get the same output. Now, what if we want to parallelize computations? Actually, it is possible.

Consider m blocks

`m = 5`

and split vectors and matrices

and

To split vectors and matrices, use:

```
Xlist = list()
for(j in 1:m) Xlist[[j]] = X[(j-1)*10+1:10,]
ylist = list()
for(j in 1:m) ylist[[j]] = y[(j-1)*10+1:10]
```

and get small QR recomposition (per subset)

```
QR1 = list()
for(j in 1:m) QR1[[j]] = list(Q=qr.Q(qr(as.matrix(Xlist[[j]]))),R=qr.R(qr(as.matrix(Xlist[[j]]))))
```

Consider the QR decomposition of R^{(1)} which is the first step of the reduce part

where

```
R1 = QR1[[1]]$R
for(j in 2:m) R1 = rbind(R1,QR1[[j]]$R)
Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list=list()
for(j in 1:m) Q2list[[j]] = Q1[(j-1)*2+1:2,]
```

Define – as step 2 of the reduce part

and

```
Q3list = list()
for(j in 1:m) Q3list[[j]] = QR1[[j]]$Q %*% Q2list[[j]]
Vlist = list()
for(j in 1:m) Vlist[[j]] = t(Q3list[[j]]) %*% ylist[[j]]
```

and finally set – as the step 3 of the reduce part

```
sumV = Vlist[[1]]
for(j in 2:m) sumV = sumV+Vlist[[j]]
solve(R2) %*% sumV
[,1]
[1,] -17.579095
[2,] 3.932409
```

It looks like we’ve been able to parallelize our linear regression!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}