Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# On the Robustness of Lasso

### In this post, a big data expert continues his examination of lasso regressions by looking at how to use R to visualize lasso regressions in graphs.

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Probably the last post on lasso before the summer break... More specifically, I was wondering about the interpretation of graphs

We use them for variable selection, but my major concern was about confidence intervals: how can we trust those lines?

As usual, a natural way is to use simulations on generated datasets. Consider, for instance:

``````Sigma = matrix(c(1,.8,.2,.8,1,.4,.2,.4,1),3,3)
n = 1000
library(mnormt)
X = rmnorm(n,rep(0,3),Sigma)
set.seed(123)
df = data.frame(X1=X[,1],X2=X[,2],X3=X[,3],X4=rnorm(n),
X5=runif(n),
X6=exp(X[,3]),
X7=sample(c("A","B"),size=n,replace=TRUE,prob=c(.5,.5)),
X8=sample(c("C","D"),size=n,replace=TRUE,prob=c(.5,.5)))
df\$Y = 1+df\$X1-df\$X4+5*(df\$X7=="A")+rnorm(n)``````

One can use other simulations of datasets and store the output.

``````vlambda = exp(seq(-8,1,length=201))
lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,
lambda=vlambda,standardize=TRUE)
VLASSO[[s]] = as.matrix(lasso\$beta)``````

To visualize confidence bands, one can compute quantiles:

``````Q05=Q95=Qm=matrix(NA,9,201)
for(i in 1:nrow(Q05)){
for(j in 1:ncol(Q05)){
v = unlist(lapply(VLASSO,function(x) x[i,j]))
Q05[i,j] = quantile(v,.05)
Q95[i,j] = quantile(v,.95)
Qm[i,j]  = mean(v)
}}``````

And get the graph

``````plot(lasso,col=colrs,"lambda"ylim=c(min(Q05),max(Q95)))
colrs=c(brewer.pal(8,"Set1"))
polygon(c(log(lasso\$lambda),rev(log(lasso\$lambda))),
c(Q05[2,],rev(Q95[2,])),col=colrs[1],border=NA)
polygon(c(log(lasso\$lambda),rev(log(lasso\$lambda))),
c(Q05[5,],rev(Q95[5,])),col=colrs[2],border=NA)
polygon(c(log(lasso\$lambda),rev(log(lasso\$lambda))),
c(Q05[8,],rev(Q95[8,])),col=colrs[3],border=NA)``````

An alternative (more realistic on real data) is to use a bootstrapped version of the dataset.

``````id = sample(1:nrow(X),size=nrow(X),replace=TRUE)
lasso = glmnet(x=X[id,],y=df[id,"Y"],family="gaussian",alpha=1,
lambda=vlambda,standardize=TRUE)``````

So far, it looks it's working very well. Now, what if we have a smaller dataset?

``n = 100``

On simulated new samples, we get:

While the bootstrap version is:

There is more uncertainty, clearly, but the conclusion is not ambiguous here.

Now, what about real data? Consider the following:

``````chicago = read.table("http://freakonometrics.free.fr/chicago.txt",header=TRUE,sep=";")
tail(chicago)
Fire   X_1 X_2    X_3
42  4.8 0.152  19 13.323
43 10.4 0.408  25 12.960
44 15.6 0.578  28 11.260
45  7.0 0.114   3 10.080
46  7.1 0.492  23 11.428
47  4.9 0.466  27 13.731``````

With one variable of interest (the number of fires, per inhabitants) and 3 features. We can use bootstrap here to generate samples and then fit a lasso regression. On the original dataset, the regression is:

``````X = model.matrix(lm(Fire~.,data=chicago))
id = sample(1:nrow(X),size=nrow(X),replace=TRUE)
vlambda = exp(seq(-4,2,length=201))
lasso = glmnet(x=X[id,],y=chicago[id,"Fire"],family="gaussian",alpha=1,
lambda=vlambda,standardize=TRUE)``````

And if we just plot lines

we get:

Now, consider bootstrap samples.

``````for(s in 1:100){
id=sample(1:nrow(X),size=nrow(X),replace=TRUE)
library(glmnet)
vlambda=exp(seq(-4,2,length=201))
lasso=glmnet(x=X[id,],y=chicago[id,"Fire"],family="gaussian",alpha=1,
lambda=vlambda,standardize=TRUE)

We get here:

The interpretation here is much more difficult:

``````N=matrix(NA,100000,4)
for(s in 1:100000){
id=sample(1:nrow(X),size=nrow(X),replace=TRUE)
library(glmnet)
vlambda=exp(seq(-4,2,length=201))
lasso=glmnet(x=X[id,],y=chicago[id,"Fire"],
family="gaussian",alpha=1,
lambda=vlambda,standardize=TRUE)
N[s,]=names(sort(apply(as.matrix(lasso\$beta),
1,function(x) sum(x!=0))))}``````

The ordering that was obtained on the original dataset was the same in 56% of the scenarios,

``````mean(apply(N,1,function(x) paste(x,collapse="")=="(Intercept)X_1X_2X_3"))
[1] 0.5693``````

We can look at all the cases,

``````L=as.character(c(123,132,213,231,312,321))
Li=paste("(Intercept)X_",substr(L,1,1),"X_",
substr(L,2,2),"X_",substr(L,3,3),sep="")
g=function(y) mean(apply(N,1,function(x) paste(x,collapse="")==y))
vL=unlist(lapply(Li,g))
names(vL)=L
barplot(vL,las=2,horiz=TRUE)``````

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,lasso ,data visualization

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.