Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

On the Robustness of Lasso

DZone's Guide to

On the Robustness of Lasso

In this post, a big data expert continues his examination of lasso regressions by looking at how to use R to visualize lasso regressions in graphs.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

Probably the last post on lasso before the summer break... More specifically, I was wondering about the interpretation of graphs

Image title

We use them for variable selection, but my major concern was about confidence intervals: how can we trust those lines?

As usual, a natural way is to use simulations on generated datasets. Consider, for instance:

Sigma = matrix(c(1,.8,.2,.8,1,.4,.2,.4,1),3,3)
n = 1000
library(mnormt)
X = rmnorm(n,rep(0,3),Sigma)
set.seed(123)
df = data.frame(X1=X[,1],X2=X[,2],X3=X[,3],X4=rnorm(n),
              X5=runif(n),
              X6=exp(X[,3]),
              X7=sample(c("A","B"),size=n,replace=TRUE,prob=c(.5,.5)),
              X8=sample(c("C","D"),size=n,replace=TRUE,prob=c(.5,.5)))
df$Y = 1+df$X1-df$X4+5*(df$X7=="A")+rnorm(n)


One can use other simulations of datasets and store the output.

vlambda = exp(seq(-8,1,length=201))
lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,
             lambda=vlambda,standardize=TRUE)
VLASSO[[s]] = as.matrix(lasso$beta)


To visualize confidence bands, one can compute quantiles:

Q05=Q95=Qm=matrix(NA,9,201)
for(i in 1:nrow(Q05)){
  for(j in 1:ncol(Q05)){
    v = unlist(lapply(VLASSO,function(x) x[i,j]))
    Q05[i,j] = quantile(v,.05)
    Q95[i,j] = quantile(v,.95)
    Qm[i,j]  = mean(v)
  }}


And get the graph

plot(lasso,col=colrs,"lambda"ylim=c(min(Q05),max(Q95)))
colrs=c(brewer.pal(8,"Set1"))
polygon(c(log(lasso$lambda),rev(log(lasso$lambda))),
          c(Q05[2,],rev(Q95[2,])),col=colrs[1],border=NA)
polygon(c(log(lasso$lambda),rev(log(lasso$lambda))),
        c(Q05[5,],rev(Q95[5,])),col=colrs[2],border=NA)
polygon(c(log(lasso$lambda),rev(log(lasso$lambda))),
        c(Q05[8,],rev(Q95[8,])),col=colrs[3],border=NA)

An alternative (more realistic on real data) is to use a bootstrapped version of the dataset.

id = sample(1:nrow(X),size=nrow(X),replace=TRUE)
lasso = glmnet(x=X[id,],y=df[id,"Y"],family="gaussian",alpha=1,
               lambda=vlambda,standardize=TRUE)


So far, it looks it's working very well. Now, what if we have a smaller dataset?

n = 100

On simulated new samples, we get:


While the bootstrap version is:

There is more uncertainty, clearly, but the conclusion is not ambiguous here.

Now, what about real data? Consider the following:

chicago = read.table("http://freakonometrics.free.fr/chicago.txt",header=TRUE,sep=";")
tail(chicago)
   Fire   X_1 X_2    X_3
42  4.8 0.152  19 13.323
43 10.4 0.408  25 12.960
44 15.6 0.578  28 11.260
45  7.0 0.114   3 10.080
46  7.1 0.492  23 11.428
47  4.9 0.466  27 13.731


With one variable of interest (the number of fires, per inhabitants) and 3 features. We can use bootstrap here to generate samples and then fit a lasso regression. On the original dataset, the regression is:

X = model.matrix(lm(Fire~.,data=chicago))
 id = sample(1:nrow(X),size=nrow(X),replace=TRUE)
 vlambda = exp(seq(-4,2,length=201))
 lasso = glmnet(x=X[id,],y=chicago[id,"Fire"],family="gaussian",alpha=1,
               lambda=vlambda,standardize=TRUE)


And if we just plot lines

Image title

we get:

Now, consider bootstrap samples.

for(s in 1:100){
  id=sample(1:nrow(X),size=nrow(X),replace=TRUE)
  library(glmnet)
  vlambda=exp(seq(-4,2,length=201))
  lasso=glmnet(x=X[id,],y=chicago[id,"Fire"],family="gaussian",alpha=1,
               lambda=vlambda,standardize=TRUE)
  plot(lasso,col=colrs,"lambda",lwd=.2,add=TRUE)}


We get here:

The interpretation here is much more difficult:

What about the order?

N=matrix(NA,100000,4)
for(s in 1:100000){
  id=sample(1:nrow(X),size=nrow(X),replace=TRUE)
  library(glmnet)
  vlambda=exp(seq(-4,2,length=201))
  lasso=glmnet(x=X[id,],y=chicago[id,"Fire"],
               family="gaussian",alpha=1,
               lambda=vlambda,standardize=TRUE)
  N[s,]=names(sort(apply(as.matrix(lasso$beta),
        1,function(x) sum(x!=0))))}


The ordering that was obtained on the original dataset was the same in 56% of the scenarios,

mean(apply(N,1,function(x) paste(x,collapse="")=="(Intercept)X_1X_2X_3"))
[1] 0.5693


We can look at all the cases,

L=as.character(c(123,132,213,231,312,321))
Li=paste("(Intercept)X_",substr(L,1,1),"X_",
         substr(L,2,2),"X_",substr(L,3,3),sep="")
g=function(y) mean(apply(N,1,function(x) paste(x,collapse="")==y))
vL=unlist(lapply(Li,g))
names(vL)=L
barplot(vL,las=2,horiz=TRUE)

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
big data ,lasso ,data visualization

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}