Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Linear Regression, With Map-Reduce

DZone's Guide to

Linear Regression, With Map-Reduce

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. In today's article, we look at how to do so.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Consider the case of the linear regression, y=Xβ+ε (with classical matrix notations). The OLS estimate of β is

Image title

To illustrate, consider a not too big dataset, and run some regression.

lm(dist~speed,data=cars)$coefficients
(Intercept)       speed 
 -17.579095    3.932409
y=cars$dist
X=cbind(1,cars$speed)
solve(crossprod(X,X))%*%crossprod(X,y)
           [,1]
[1,] -17.579095
[2,]   3.932409

How is this computed in R? Actually, it is based on the QR decomposition of X,X=QR, where Q is an orthogonal matrix (i.e. QTQ=I). Then 

Image title

solve(qr.R(qr(as.matrix(X)))) %*% t(qr.Q(qr(as.matrix(X)))) %*% y
           [,1]
[1,] -17.579095
[2,]   3.932409

So far, so good, we get the same output. Now, what if we want to parallelize computations? Actually, it is possible.

Consider m blocks

m = 5

and split vectors and matrices

Image title

and

Image title

To split vectors and matrices, use:

Xlist = list()
for(j in 1:m) Xlist[[j]] = X[(j-1)*10+1:10,]
ylist = list()
for(j in 1:m) ylist[[j]] = y[(j-1)*10+1:10]

and get small QR recomposition (per subset)

QR1 = list()
for(j in 1:m) QR1[[j]] = list(Q=qr.Q(qr(as.matrix(Xlist[[j]]))),R=qr.R(qr(as.matrix(Xlist[[j]]))))

Consider the QR decomposition of R(1) which is the first step of the reduce part

Image title

where

Image title

R1 = QR1[[1]]$R
for(j in 2:m) R1 = rbind(R1,QR1[[j]]$R)
Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list=list()
for(j in 1:m) Q2list[[j]] = Q1[(j-1)*2+1:2,]


Define – as step 2 of the reduce part

Image title

and

Image title

Q3list = list()
for(j in 1:m) Q3list[[j]] = QR1[[j]]$Q %*% Q2list[[j]]
Vlist = list()
for(j in 1:m) Vlist[[j]] = t(Q3list[[j]]) %*% ylist[[j]]

and finally set – as the step 3 of the reduce part

Image title

sumV = Vlist[[1]]
for(j in 2:m) sumV = sumV+Vlist[[j]]
solve(R2) %*% sumV
           [,1]
[1,] -17.579095
[2,]   3.932409

It looks like we’ve been able to parallelize our linear regression!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data mapping ,map-reduce ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}