Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Regression With Splines: Should We Care About Non-Significant Components?

DZone's Guide to

Regression With Splines: Should We Care About Non-Significant Components?

Having a lot of non-significant components in a spline regression is not a major issue. And reducing the degrees of freedom is clearly a bad option.

· Big Data Zone
Free Resource

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Following the course of this morning, I got a very interesting question from a student of mine. The question was about having non-significant components in a splineregression.  Should we consider a model with a small number of knots and all components significant, or one with a (much) larger number of knots, and a lot of knots non-significant?

My initial intuition was to prefer the second alternative, like in autoregressive models in R. When we fit an AR(6) model, it’s not really a big deal if most coefficients are not significant (but the last one). It’s won’t affect much the forecast. So here, it might be the same. With a larger number of knots, we should be able to capture small bumps that we’ll never capture with a smaller number.

Here is what a have with a small number of knots, and cubic splines

and with a larger number of knots

In order to understand what’s going on, consider a simple model, with the two splines above, in red

> set.seed(1)
> library(splines)
> x=seq(0,1,by=.01)
> v=bs(x,10)
> x2=v[,2]
> x10=v[,10]
> set.seed(1)
> y=1+3*x2+5*x10+rnorm(length(x))/4
> y_test=1+3*x2+5*x10+rnorm(length(x))/4

Note that here I have generated two sets of data, one to train a model, and one to test it.  Here, the data looks like that

> plot(x,y)

It is based on two splines,

> lines(df$x,1+3*x2+5*x10)

If we use a spline model with 10 degrees of freedom, we get

> df=data.frame(x,y)
> reg=lm(y~bs(x,10),data=df)
> summary(reg)

Coefficients:
            Estimate Std. Er t value Pr(>|t|)    
(Intercept)  0.91671 0.17068   5.371 6.08e-07 ***
bs(x, 10)1   0.20485 0.32696   0.627    0.533    
bs(x, 10)2   3.15593 0.22534  14.005  < 2e-16 ***
bs(x, 10)3   0.04847 0.25075   0.193    0.847    
bs(x, 10)4   0.09373 0.21597   0.434    0.665    
bs(x, 10)5   0.11624 0.22939   0.507    0.614    
bs(x, 10)6   0.24829 0.22293   1.114    0.268    
bs(x, 10)7  -0.06825 0.23498  -0.290    0.772    
bs(x, 10)8   0.19633 0.26241   0.748    0.456    
bs(x, 10)9   0.27557 0.26976   1.022    0.310    
bs(x, 10)10  4.78134 0.24116  19.826  < 2e-16 ***

which makes sense, from what we have generated. Indeed, most of the components are not significant, but the second and the tenth. We can actually test that all those components are null (at the same time)

> A=matrix(0,8,11)
> colnames(A)=names(coefficients(reg))
> A[1,2]=A[2,4]=A[3,5]=A[4,6]=A[5,7]=
+ A[6,8]=A[7,9]=A[8,10]=1
> b=rep(0,8)
> linearHypothesis(reg, A,b)
Linear hypothesis test

Hypothesis:
bs(x, 10)1 = 0
bs(x, 10)3 = 0
bs(x, 10)4 = 0
bs(x, 10)5 = 0
bs(x, 10)6 = 0
bs(x, 10)7 = 0
bs(x, 10)8 = 0
bs(x, 10)9 = 0

Model 1: restricted model
Model 2: y ~ bs(x, 10)

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     98 4.8766                           
2     90 4.6196  8   0.25701 0.6259  0.754

and yes, those coefficients are not significant.

> yp10=predict(reg)
> lines(df$x,yp10,col="red")

The red curve is not far away from the black one (the simulated model, when the noise is removed).

Now, if we compare with a spline regression with three degrees of freedom

> reg=lm(y~bs(x),data=df)
> yp3=predict(reg)
> lines(df$x,yp3,col="blue",lwd=3)
> summary(reg)

Coefficients:
            Estimate Std. Err t value Pr(>|t|)   
(Intercept)   1.5381  0.2188   7.029 2.91e-10 ***
bs(x)1        2.2276  0.6348   3.509 0.000683 ***
bs(x)2       -4.4788  0.4089 -10.952  < 2e-16 ***
bs(x)3        1.9936  0.3434   5.805 8.12e-08 ***

This time, all components are significant, but the quality of fit is rather poor, isn’t it?

What is the best thing we can do? Let us fit different models. Say consider 8 degrees of freedom, but keep only 1, or 2, or 3, etc components. Then get a prediction on the test-dataset we have, and see which model has the smallest variance,

> me=1e9
> ERR=matrix(NA,15,15)
> for(k in 4:15){
+ bsX=bs(df$x,k)
+ dfX=data.frame(y=df$y,bsX)
+ reg=lm(y~.,data=dfX)
+ library(leaps)
+ W=leaps( bsX , df$y , method="Cp" , 
+ nbest=1)$which
+ for(i in 1:k){
+ dfXi=data.frame(y=df$y,bsX[W[i,]])
+ reg=lm(y~.,data=dfXi)
+ dfXi_s=data.frame(y=y_test,bsX[,W[i,]])
+ names(dfXi_s)=c("y",
+ names(coefficients(reg))[-1])
+ erreur=dfXi_s$y-predict(reg,newdata=dfXi_s)
+ ERR[k,i]=sum(erreur^2)
+ if(ERR[k,i]<=me){LOC=c(k,i);me=ERR[k,i]}
+ }}

The best model is when we keep the ten best components, out of 11 degrees of freedom

> LOC
[1] 11 10

More specifically,

> library(leaps)
> bsX=bs(df$x,LOC[1])
> W=leaps( bsX , df$y , method="Cp" , 
+ nbest=1)$which
> dfXi=data.frame(y=df$y,bsX[,W[LOC[2],]])
> reg=lm(y~.,data=dfXi)
> ypopt=predict(reg)
> lines(df$x,ypopt,col="purple")

The best model here is one with more degrees of freedom than the one we used to generate the data,

Actually, if we use 10 degrees of freedom, but keep only the best two components, we get something rather close

> bsX=bs(df$x,10)
> W=leaps( bsX , df$y , method="Cp" , 
+ nbest=1)$which
> dfXi=data.frame(y=df$y,bsX[,W[2,]])
> reg=lm(y~.,data=dfXi)
> ypopt=predict(reg)
> lines(df$x,ypopt,col="red",lwd=3)

So, it looks like having a lot of non significant components in a spline regression is not a major issue. And reducing the degrees of freedom is clearly a bad option.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
regression ,splines ,mathematical programming

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}