# Standardization in Lasso

# Standardization in Lasso

### We venture into the wild west of data and learn how to wrangle and standardize data with a lasso regression. Read on to learn how to visualize this data as well!

Join the DZone community and get the full member experience.

Join For FreeHow to Simplify Apache Kafka. Get eBook.

The lasso regression is based on the idea of solving

where

for any

In a recent post, we've seen computational aspects of the optimization problem. But I went quickly through the story of the *l _{1}*-norm. Because it means, somehow, that the value of ß

_{1}and ß

_{2}should be comparable. Somehow, with two significant variables, with very different scales, we should expect orders (or relative magnitudes) of

to be very very different. So people say that it is, therefore, necessary to center and reduce (or standardize) the variables.

Consider the following (simulated) dataset

```
Sigma = matrix(c(1,.8,.2,.8,1,.4,.2,.4,1),3,3)
n = 1000
library(mnormt)
X = rmnorm(n,rep(0,3),Sigma)
set.seed(123)
df = data.frame(X1=X[,1],X2=X[,2],X3=X[,3],X4=rnorm(n),
X5=runif(n),X6=exp(X[,3]),
X7=sample(c("A","B"),size=n,replace=TRUE,prob=c(.5,.5)),
X8=sample(c("C","D"),size=n,replace=TRUE,prob=c(.5,.5)))
df$Y = 1+df$X1-df$X4+5*(df$X7=="A")+rnorm(n)
X = model.matrix(lm(Y~.,data=df))
```

Use the following colors for the graphs and the value of lambda.

```
library("RColorBrewer")
colrs = c(brewer.pal(8,"Set1"))[c(1,4,5,2,6,3,7,8)]
vlambda=exp(seq(-8,1,length=201))
```

The first regression we can run is a non-standardized one

```
library(glmnet)
lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,lambda=vlambda,standardize=FALSE)
```

We can visualize the graphs of

using the following code:

```
idx = which(apply(lasso$beta,1,function(x) sum(x==0))<200)
plot(lasso,col=colrs,'lambda',xlim=c(-5.5,2.3),lwd=2)
legend(1.2,.9,legend=paste('X',0:8,sep='')[idx],col=colrs,lty=1,lwd=2)
```

At least, observe that the most significant variables are the ones that were used to generate the data.

Now, consider the case that we standardize the data

`lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,lambda=vlambda,standardize=TRUE)`

The graphs of

The graph is (strangely) very similar to the previous one. Except perhaps for the green curve. Maybe that categorical is not similar to the continuous variables... because, somehow, standardization of categorical variables might be not natural...

Why not consider some home-made function? Let us transform (linearly) all variable in the ** X** matrix (except the first one, which is the intercept):

```
Xc = X
for(j in 2:ncol(X)) Xc[,j]=(Xc[,j]-mean(Xc[,j]))/sd(Xc[,j])
```

Now, we can run our lasso regression on that one (with the intercept since all the variables are centered, but ** y**)

`lasso = glmnet(x=Xc,y=df$Y,family="gaussian",alpha=1,intercept=TRUE,lambda=vlambda)`

The plot is now:

```
plot(lasso,col=colrs,"lambda",xlim=c(-6.7,1.3),lwd=2)
idx = which(apply(lasso$beta,1,function(x) sum(x==0))<length(vlambda))
legend(.15,.45,legend=paste('X',0:8,sep='')[idx],col=colrs,lty=1,bty="n",lwd=2)
```

Actually, why not also center the ** y** variable, and remove also the intercept:

```
Yc = (df[,"Y"]-mean(df[,"Y"]))/sd(df[,"Y"])
lasso = glmnet(x=Xc,y=Yc,family="gaussian",alpha=1,intercept=FALSE,lambda=vlambda)
```

Hopefully, those graphs are very consistent (and if we use those for variable selection, they suggest using variables that were actually used to generate the dataset). And having qualitative and quantitative variables is not a big deal. But still, I do not feel comfortable with the differences...

12 Best Practices for Modern Data Ingestion. Download White Paper.

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}