Over a million developers have joined DZone.

Standardization in Lasso

DZone's Guide to

Standardization in Lasso

We venture into the wild west of data and learn how to wrangle and standardize data with a lasso regression. Read on to learn how to visualize this data as well!

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The lasso regression is based on the idea of solving

Image title


Image title

for any

Image title

In a recent post, we've seen computational aspects of the optimization problem. But I went quickly through the story of the l1-norm. Because it means, somehow, that the value of ß1 and ß2 should be comparable. Somehow, with two significant variables, with very different scales, we should expect orders (or relative magnitudes) of 

Image title

to be very very different. So people say that it is, therefore, necessary to center and reduce (or standardize) the variables.

Consider the following (simulated) dataset

Sigma = matrix(c(1,.8,.2,.8,1,.4,.2,.4,1),3,3)
n = 1000
X = rmnorm(n,rep(0,3),Sigma)
df = data.frame(X1=X[,1],X2=X[,2],X3=X[,3],X4=rnorm(n),
df$Y = 1+df$X1-df$X4+5*(df$X7=="A")+rnorm(n)
X = model.matrix(lm(Y~.,data=df))

Use the following colors for the graphs and the value of lambda. 

colrs = c(brewer.pal(8,"Set1"))[c(1,4,5,2,6,3,7,8)]

The first regression we can run is a non-standardized one

lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,lambda=vlambda,standardize=FALSE)

We can visualize the graphs of 

Image title

using the following code:

idx = which(apply(lasso$beta,1,function(x) sum(x==0))<200)

At least, observe that the most significant variables are the ones that were used to generate the data.

Now, consider the case that we standardize the data

lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,lambda=vlambda,standardize=TRUE)

The graphs of

Image title

The graph is (strangely) very similar to the previous one. Except perhaps for the green curve. Maybe that categorical is not similar to the continuous variables... because, somehow, standardization of categorical variables might be not natural...

Why not consider some home-made function? Let us transform (linearly) all variable in the X matrix (except the first one, which is the intercept):

Xc = X
for(j in 2:ncol(X)) Xc[,j]=(Xc[,j]-mean(Xc[,j]))/sd(Xc[,j])

Now, we can run our lasso regression on that one (with the intercept since all the variables are centered, but y)

lasso = glmnet(x=Xc,y=df$Y,family="gaussian",alpha=1,intercept=TRUE,lambda=vlambda)

The plot is now:

idx = which(apply(lasso$beta,1,function(x) sum(x==0))<length(vlambda))

Actually, why not also center the y variable, and remove also the intercept:

Yc = (df[,"Y"]-mean(df[,"Y"]))/sd(df[,"Y"])
lasso = glmnet(x=Xc,y=Yc,family="gaussian",alpha=1,intercept=FALSE,lambda=vlambda)

Hopefully, those graphs are very consistent (and if we use those for variable selection, they suggest using variables that were actually used to generate the dataset). And having qualitative and quantitative variables is not a big deal. But still, I do not feel comfortable with the differences...

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,lasso regression ,data standardization ,data visualization

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}