Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Linear Regression, With Map-Reduce

DZone's Guide to

Linear Regression, With Map-Reduce

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. In today's article, we look at how to do so.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Consider the case of the linear regression, y=Xβ+ε (with classical matrix notations). The OLS estimate of β is

Image title

To illustrate, consider a not too big dataset, and run some regression.

lm(dist~speed,data=cars)$coefficients
(Intercept)       speed 
 -17.579095    3.932409
y=cars$dist
X=cbind(1,cars$speed)
solve(crossprod(X,X))%*%crossprod(X,y)
           [,1]
[1,] -17.579095
[2,]   3.932409

How is this computed in R? Actually, it is based on the QR decomposition of X,X=QR, where Q is an orthogonal matrix (i.e. QTQ=I). Then 

Image title

solve(qr.R(qr(as.matrix(X)))) %*% t(qr.Q(qr(as.matrix(X)))) %*% y
           [,1]
[1,] -17.579095
[2,]   3.932409

So far, so good, we get the same output. Now, what if we want to parallelize computations? Actually, it is possible.

Consider m blocks

m = 5

and split vectors and matrices

Image title

and

Image title

To split vectors and matrices, use:

Xlist = list()
for(j in 1:m) Xlist[[j]] = X[(j-1)*10+1:10,]
ylist = list()
for(j in 1:m) ylist[[j]] = y[(j-1)*10+1:10]

and get small QR recomposition (per subset)

QR1 = list()
for(j in 1:m) QR1[[j]] = list(Q=qr.Q(qr(as.matrix(Xlist[[j]]))),R=qr.R(qr(as.matrix(Xlist[[j]]))))

Consider the QR decomposition of R(1) which is the first step of the reduce part

Image title

where

Image title

R1 = QR1[[1]]$R
for(j in 2:m) R1 = rbind(R1,QR1[[j]]$R)
Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list=list()
for(j in 1:m) Q2list[[j]] = Q1[(j-1)*2+1:2,]


Define – as step 2 of the reduce part

Image title

and

Image title

Q3list = list()
for(j in 1:m) Q3list[[j]] = QR1[[j]]$Q %*% Q2list[[j]]
Vlist = list()
for(j in 1:m) Vlist[[j]] = t(Q3list[[j]]) %*% ylist[[j]]

and finally set – as the step 3 of the reduce part

Image title

sumV = Vlist[[1]]
for(j in 2:m) sumV = sumV+Vlist[[j]]
solve(R2) %*% sumV
           [,1]
[1,] -17.579095
[2,]   3.932409

It looks like we’ve been able to parallelize our linear regression!

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
data mapping ,map-reduce ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}