Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# Linear Regression, With Map-Reduce

### Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. In today's article, we look at how to do so.

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Consider the case of the linear regression, y=Xβ+ε (with classical matrix notations). The OLS estimate of β is

To illustrate, consider a not too big dataset, and run some regression.

lm(dist~speed,data=cars)$coefficients (Intercept) speed -17.579095 3.932409 y=cars$dist
X=cbind(1,cars$speed) solve(crossprod(X,X))%*%crossprod(X,y) [,1] [1,] -17.579095 [2,] 3.932409 How is this computed in R? Actually, it is based on the QR decomposition of X,X=QR, where Q is an orthogonal matrix (i.e. QTQ=I). Then solve(qr.R(qr(as.matrix(X)))) %*% t(qr.Q(qr(as.matrix(X)))) %*% y [,1] [1,] -17.579095 [2,] 3.932409 So far, so good, we get the same output. Now, what if we want to parallelize computations? Actually, it is possible. Consider m blocks m = 5 and split vectors and matrices and To split vectors and matrices, use: Xlist = list() for(j in 1:m) Xlist[[j]] = X[(j-1)*10+1:10,] ylist = list() for(j in 1:m) ylist[[j]] = y[(j-1)*10+1:10] and get small QR recomposition (per subset) QR1 = list() for(j in 1:m) QR1[[j]] = list(Q=qr.Q(qr(as.matrix(Xlist[[j]]))),R=qr.R(qr(as.matrix(Xlist[[j]])))) Consider the QR decomposition of R(1) which is the first step of the reduce part where R1 = QR1[[1]]$R
for(j in 2:m) R1 = rbind(R1,QR1[[j]]$R) Q1 = qr.Q(qr(as.matrix(R1))) R2 = qr.R(qr(as.matrix(R1))) Q2list=list() for(j in 1:m) Q2list[[j]] = Q1[(j-1)*2+1:2,] Define – as step 2 of the reduce part and Q3list = list() for(j in 1:m) Q3list[[j]] = QR1[[j]]$Q %*% Q2list[[j]]
Vlist = list()
for(j in 1:m) Vlist[[j]] = t(Q3list[[j]]) %*% ylist[[j]]

and finally set – as the step 3 of the reduce part

sumV = Vlist[[1]]
for(j in 2:m) sumV = sumV+Vlist[[j]]
solve(R2) %*% sumV
[,1]
[1,] -17.579095
[2,]   3.932409

It looks like we’ve been able to parallelize our linear regression!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data mapping ,map-reduce ,big data

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

### {{ parent.tldr }}

{{ parent.urlSource.name }}