Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

On the Robustness of Lasso

DZone's Guide to

On the Robustness of Lasso

In this post, a big data expert continues his examination of lasso regressions by looking at how to use R to visualize lasso regressions in graphs.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Probably the last post on lasso before the summer break... More specifically, I was wondering about the interpretation of graphs

Image title

We use them for variable selection, but my major concern was about confidence intervals: how can we trust those lines?

As usual, a natural way is to use simulations on generated datasets. Consider, for instance:

Sigma = matrix(c(1,.8,.2,.8,1,.4,.2,.4,1),3,3)
n = 1000
library(mnormt)
X = rmnorm(n,rep(0,3),Sigma)
set.seed(123)
df = data.frame(X1=X[,1],X2=X[,2],X3=X[,3],X4=rnorm(n),
              X5=runif(n),
              X6=exp(X[,3]),
              X7=sample(c("A","B"),size=n,replace=TRUE,prob=c(.5,.5)),
              X8=sample(c("C","D"),size=n,replace=TRUE,prob=c(.5,.5)))
df$Y = 1+df$X1-df$X4+5*(df$X7=="A")+rnorm(n)


One can use other simulations of datasets and store the output.

vlambda = exp(seq(-8,1,length=201))
lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,
             lambda=vlambda,standardize=TRUE)
VLASSO[[s]] = as.matrix(lasso$beta)


To visualize confidence bands, one can compute quantiles:

Q05=Q95=Qm=matrix(NA,9,201)
for(i in 1:nrow(Q05)){
  for(j in 1:ncol(Q05)){
    v = unlist(lapply(VLASSO,function(x) x[i,j]))
    Q05[i,j] = quantile(v,.05)
    Q95[i,j] = quantile(v,.95)
    Qm[i,j]  = mean(v)
  }}


And get the graph

plot(lasso,col=colrs,"lambda"ylim=c(min(Q05),max(Q95)))
colrs=c(brewer.pal(8,"Set1"))
polygon(c(log(lasso$lambda),rev(log(lasso$lambda))),
          c(Q05[2,],rev(Q95[2,])),col=colrs[1],border=NA)
polygon(c(log(lasso$lambda),rev(log(lasso$lambda))),
        c(Q05[5,],rev(Q95[5,])),col=colrs[2],border=NA)
polygon(c(log(lasso$lambda),rev(log(lasso$lambda))),
        c(Q05[8,],rev(Q95[8,])),col=colrs[3],border=NA)

An alternative (more realistic on real data) is to use a bootstrapped version of the dataset.

id = sample(1:nrow(X),size=nrow(X),replace=TRUE)
lasso = glmnet(x=X[id,],y=df[id,"Y"],family="gaussian",alpha=1,
               lambda=vlambda,standardize=TRUE)


So far, it looks it's working very well. Now, what if we have a smaller dataset?

n = 100

On simulated new samples, we get:


While the bootstrap version is:

There is more uncertainty, clearly, but the conclusion is not ambiguous here.

Now, what about real data? Consider the following:

chicago = read.table("http://freakonometrics.free.fr/chicago.txt",header=TRUE,sep=";")
tail(chicago)
   Fire   X_1 X_2    X_3
42  4.8 0.152  19 13.323
43 10.4 0.408  25 12.960
44 15.6 0.578  28 11.260
45  7.0 0.114   3 10.080
46  7.1 0.492  23 11.428
47  4.9 0.466  27 13.731


With one variable of interest (the number of fires, per inhabitants) and 3 features. We can use bootstrap here to generate samples and then fit a lasso regression. On the original dataset, the regression is:

X = model.matrix(lm(Fire~.,data=chicago))
 id = sample(1:nrow(X),size=nrow(X),replace=TRUE)
 vlambda = exp(seq(-4,2,length=201))
 lasso = glmnet(x=X[id,],y=chicago[id,"Fire"],family="gaussian",alpha=1,
               lambda=vlambda,standardize=TRUE)


And if we just plot lines

Image title

we get:

Now, consider bootstrap samples.

for(s in 1:100){
  id=sample(1:nrow(X),size=nrow(X),replace=TRUE)
  library(glmnet)
  vlambda=exp(seq(-4,2,length=201))
  lasso=glmnet(x=X[id,],y=chicago[id,"Fire"],family="gaussian",alpha=1,
               lambda=vlambda,standardize=TRUE)
  plot(lasso,col=colrs,"lambda",lwd=.2,add=TRUE)}


We get here:

The interpretation here is much more difficult:

What about the order?

N=matrix(NA,100000,4)
for(s in 1:100000){
  id=sample(1:nrow(X),size=nrow(X),replace=TRUE)
  library(glmnet)
  vlambda=exp(seq(-4,2,length=201))
  lasso=glmnet(x=X[id,],y=chicago[id,"Fire"],
               family="gaussian",alpha=1,
               lambda=vlambda,standardize=TRUE)
  N[s,]=names(sort(apply(as.matrix(lasso$beta),
        1,function(x) sum(x!=0))))}


The ordering that was obtained on the original dataset was the same in 56% of the scenarios,

mean(apply(N,1,function(x) paste(x,collapse="")=="(Intercept)X_1X_2X_3"))
[1] 0.5693


We can look at all the cases,

L=as.character(c(123,132,213,231,312,321))
Li=paste("(Intercept)X_",substr(L,1,1),"X_",
         substr(L,2,2),"X_",substr(L,3,3),sep="")
g=function(y) mean(apply(N,1,function(x) paste(x,collapse="")==y))
vL=unlist(lapply(Li,g))
names(vL)=L
barplot(vL,las=2,horiz=TRUE)

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,lasso ,data visualization

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}