DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Regression Models: It's Not Only About Interpretation

Regression Models: It's Not Only About Interpretation

Arthur Charpentier user avatar by
Arthur Charpentier
·
May. 09, 15 · Big Data Zone · Interview
Like (0)
Save
Tweet
4.30K Views

Join the DZone community and get the full member experience.

Join For Free

yesterday, i uploaded a post where i tried to show that “standard” regression models were not performing badly. at least if you include splines (multivariate splines) to take into account joint effects, and nonlinearities. so far, i have not discussed the possible high number of features (but with bootstrap procedures, it is possible to assess something related to variable importance, that people from machine learning like).

but my post was not complete: i was simply plotting the prediction obtained by some model. and it “looked like” the regression was nice, but so were the random forrest, the -nearest neighbour and boosting algorithm. what if we compare those models on new data?

here is the code to create all the models (i did include another one, some kind of benchmark, where no covariates are included), based on 1,000 simulated values

> n <- 1000
> set.seed(1)
> rtf <- function(a1, a2) { sin(a1+a2)/(a1+a2) }
> df <- data.frame(x1=(runif(n, min=1, max=6)),
+                  x2=(runif(n, min=1, max=6)))
> df$m <- rtf(df$x1, df$x2)
> df$y <- df$m+rnorm(n,sd=.1)

> model_cste <- lm(y~1,data=df)
> p_cste <- function(x1,x2) predict(model_cste,newdata=data.frame(x1=x1,x2=x2))

> model_lm <- lm(y~x1+x2,data=df)
> p_lm <- function(x1,x2) predict(model_lm,newdata=data.frame(x1=x1,x2=x2))

> library(mgcv)
> model_bs <- gam(y~s(x1,x2),data=df)
> p_bs <- function(x1,x2) predict(model_bs,newdata=data.frame(x1=x1,x2=x2))

> library(rpart)
> model_cart <- rpart(y~x1+x2,data=df,method="anova")
> p_cart <- function(x1,x2) predict(model_cart,newdata=data.frame(x1=x1,x2=x2),type="vector")

> library(randomforest)
> model_rf <- randomforest(y~x1+x2,data=df)
> p_rf <- function(x1,x2) as.numeric(predict(model_rf,newdata=
+   data.frame(x1=x1,x2=x2),type="response"))

> k <- 10
> p_knn <- function(x1,x2){
+   d <- (df$x1-x1)^2+(df$x2-x2)^2
+   return(mean(df$y[which(rank(d)<=k)]))
+ }

> library(dismo)
> model_gbm <- gbm.step(data=df, gbm.x = 1:2, gbm.y = 4,
+   family = "gaussian", tree.complexity = 5,
+   learning.rate = 0.01, bag.fraction = 0.5)


 gbm step - version 2.9 

performing cross-validation optimisation of a boosted regression tree model 
for y and using a family of gaussian 
using 1000 observations and 2 predictors 
creating 10 initial models of 50 trees 

 folds are unstratified 
total mean deviance =  0.0242 
tolerance is fixed at  0 
ntrees resid. dev. 
50    0.0195 
now adding trees... 
100   0.017 
150   0.0154 
200   0.0145 
250   0.0139

(etc)

1650   0.0123 
fitting final gbm model with a fixed number of  1150  trees for  y 

mean total deviance = 0.024 
mean residual deviance = 0.009 

estimated cv deviance = 0.012 ; se = 0.001 

training data correlation = 0.804 
cv correlation =  0.705 ; se = 0.013 

elapsed time -  0.11 minutes 
> p_boost <- function(x1,x2) predict(model_gbm,newdata=data.frame(x1=x1,x2=x2),n.trees=1200)

to test those models on new data (that is the goal of predictive model actually, being able to build up a generalized model, that performs well on new data), generate another sample

> n <- 500
> df_new <- data.frame(x1=(runif(n, min=1, max=6)), x2=(runif(n, min=1, max=6)))
> df_new$m <- rtf(df_new$x1, df_new$x2)
> df_new$y <- df_new$m+rnorm(n,sd=.1)

and then compare the observed values with the predicted ones. for instance on a graph

> output_model <- function(p=vectorize(p_knn)){
+ plot(df_new$y,p(df_new$x1,df_new$x2),ylim=c(-.45,.45),xlim=c(-.45,.45),xlab="observed",ylab="predicted")
+ abline(a=0,b=1,lty=2,col="grey")
+ }

for the linear model, we get

> output_model(vectorize(p_lm))

for the k-nearest neighbour, we get

> output_model(vectorize(p_knn))

with our boosted model, we get

> output_model(vectorize(p_boost))

and finally, with our bivariate splines, we get

> output_model(vectorize(p_bs))

it is also possible to consider some distance, e.g. the standard distance,

> sum_error_2 <- function(name_model){
+   sum( (df_new$y - vectorize(get(name_model))(df_new$x1,df_new$x2))^2 )  
+ }

here, we enter the name of the prediction function (not the r object, we’ll see soon why) as the parameter of our function. in order to have valid conclusion, why not generate hundreds of new samples, and compute the  distance on the error.

> l2 <- null
> for(s in 1:100){
+ n <- 500
+ df_new <- data.frame(x1=(runif(n, min=1, max=6)), x2=(runif(n, min=1, max=6)))
+ df_new$m <- rtf(df_new$x1, df_new$x2)
+ df_new$y <- df_new$m+rnorm(n,sd=.1)
+ list_models <- c("p_cste","p_lm","p_bs","p_cart","p_rf","p_knn","p_boost")
+ l2_error <- sapply(list_models,sum_error_2)
+ l2 <- rbind(l2,l2_error)
+ }


to compare our predictors, use

> colnames(l2) <- substr(colnames(l2),3,nchar(colnames(l2)))
> boxplot(l2)


our linear regression model is not performing well ( lm ). clearly. but we’ve seen that already yesterday. and our bivariate spline model is ( bs ). actually, it is even performing better that all other models considered here ( rf , knn and even boost ).

 boxplot(l2,ylim=c(4.5,6.2))

there were a lot of discussion following my previous post, on commentaries, as well as on twitter. what i’ve seen is that there should be some kind of trade-off: interpretability (econometric models) against precision (machine learning). it is clearly not that simple. a simple regression model with splines can perform better than any machine learning algorithm, from what we’ve seen here.

Machine learning

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The Essential Web3 Tools and Technologies Developers Must Know
  • A Breakdown of Continuous Testing
  • A Concise Guide to DevSecOps and Their Importance in CI/CD Pipeline
  • Kubernetes Data Simplicity: Getting Started With K8ssandra

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo