Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Classification From Scratch, Part 11 of 8: Boosting

DZone's Guide to

Classification From Scratch, Part 11 of 8: Boosting

So, yeah, we may have underestimated the length of this series, but, today, in the final installment, we check out the fascinating problem of boosting in big data.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

This is the eleventh post of our series on classification from scratch. Today should be the last one... unless I forgot something important. So today, we'll discuss boosting.

An Econometrician Perspective

I might start with a non-conventional introduction. But that's actually how I understood what boosting was about. And I am quite sure it has to do with my background in econometrics.

The goal here is to solve something which looks like

Image title

for some loss function l, and for some set of predictors M. This is an optimization problem. Well, optimization is here in a function space, but still, that's simply an optimization problem. And from a numerical perspective, optimization is solved using gradient descent (this is why this technique is also called gradient boosting). And the gradient descent can be visualized like below:

Again, the optimum is not some real value x*, but some function m*. Thus, here we will have something like

Image title

where the term on the right can also be written

Image title

I prefer the later because we see clearly that f is some model we fit on the remaining residuals.

We can rewrite it like that: define

Image title

for all i=1,...,n. The goal is to fit a model so that

Image title

and when we have that optimal function, set

Image title

(yes, we can include some shrinkage here).

Two important comments here. First of all, the idea should be weird to any econometrician. First, we fit a model to explain y by some covariates x. Then consider the residuals

Image title

and to explain them with the same covariate x. If you try that with a linear regression, you'd done at the end of step 1, since residuals

Image title

are orthogonal to covariates \mathbf{x}: no way that we can learn from them. Here it works because we consider simple non-linear model. And actually, something that can be used is to add a shrinkage parameter. Do not consider

Image title

The idea of weak learners is extremely important here. The more we shrink, the longer it will take, but that's not (too) important.

I should also mention that it's nice to keep learning from our mistakes. But somehow, we should stop, someday. I said that I will not mention this part in this series of posts, maybe later on. But heuristically, we should stop when we start to overfit. And this can be observed either using a split training/validation of the initial dataset or to use cross-validation. I will get back on that issue later one in this post, but again, those ideas should probably be dedicated to another series of posts.

Learning With Splines

Just to make sure we get it, let's try to learn with splines. Because standard splines have fixed knots, actually, we do not really "learn" here (and after a few iterations we get to what we would have with a standard spline regression). So here, we will (somehow) optimize knots locations. There is a package to do so. And just to illustrate, use a Gaussian regression here, not a classification (we will do that later on). Consider the following dataset (with only one covariate)

n=300
 set.seed(1)
 u=sort(runif(n)*2*pi)
 y=sin(u)+rnorm(n)/4
 df=data.frame(x=u,y=y)

For an optimal choice of knot locations, we can use

library(freeknotsplines)
xy.freekt=freelsgen(df$x, df$y, degree = 1, numknot = 2, 555)

With 5% shrinkage, the code is simply the following:

v=.05
 library(splines)
 xy.freekt=freelsgen(df$x, df$y, degree = 1, numknot = 2, 555)
 fit=lm(y~bs(x,degree=1,knots=xy.freekt@optknot),data=df)
 yp=predict(fit,newdata=df)
 df$yr=df$y - v*yp
 YP=v*yp
 for(t in 1:200){
   xy.freekt=freelsgen(df$x, df$yr, degree = 1, numknot = 2, 555)
   fit=lm(yr~bs(x,degree=1,knots=xy.freekt@optknot),data=df)
   yp=predict(fit,newdata=df)
   df$yr=df$yr - v*yp
   YP=cbind(YP,v*yp)}
 nd=data.frame(x=seq(0,2*pi,by=.01))
 viz=function(M){
    if(M==1)  y=YP[,1]
    if(M>1)   y=apply(YP[,1:M],1,sum)
    plot(df$x,df$y,ylab="",xlab="")
    lines(df$x,y,type="l",col="red",lwd=3)
    fit=lm(y~bs(x,degree=1,df=3),data=df)
    yp=predict(fit,newdata=nd)
    lines(nd$x,yp,type="l",col="blue",lwd=3)
    lines(nd$x,sin(nd$x),lty=2)}

To visualize the output after 100 iterations, use:

viz(100)


Clearly, we see that we learn from the data here... Cool, isn't it?

Learning With Stumps (and Trees)

Let us try something else. What if we consider at each step a regression tree, instead of a linear-by-parts regression (that was considered with linear splines).

library(rpart)
v=.1 
fit=rpart(y~x,data=df)
yp=predict(fit)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:100){
  fit=rpart(yr~x,data=df)
  yp=predict(fit,newdata=df)
  df$yr=df$yr - v*yp
  YP=cbind(YP,v*yp)}

Again, to visualize the learning process, use:

viz=function(M){
y=apply(YP[,1:M],1,sum)
plot(df$x,df$y,ylab="",xlab="")
lines(df$x,y,type="s",col="red",lwd=3)
fit=rpart(y~x,data=df)
yp=predict(fit,newdata=nd)
lines(nd$x,yp,type="s",col="blue",lwd=3)
lines(nd$x,sin(nd$x),lty=2)}


This time, with those trees, it looks like not only do we have a good model, but also a different model from the one we can get using a single regression tree.

What if we change the shrinkage parameter?

viz=function(v=0.05){
  fit=rpart(y~x,data=df)
  yp=predict(fit)
  df$yr=df$y - v*yp
  YP=v*yp
  for(t in 1:100){
    fit=rpart(yr~x,data=df)
    yp=predict(fit,newdata=df)
    df$yr=df$yr - v*yp
    YP=cbind(YP,v*yp)}
  y=apply(YP,1,sum)
    plot(df$x,df$y,xlab="",ylab="")
    lines(df$x,y,type="s",col="red",lwd=3)
    fit=rpart(y~x,data=df)
    yp=predict(fit,newdata=nd)
    lines(nd$x,yp,type="s",col="blue",lwd=3)
    lines(nd$x,sin(nd$x),lty=2)}

There is clearly an impact of that shrinkage parameter. It has to be small to get a good model. This is the idea of using weak learners to get a good prediction.

Classification and Adaboost

Now that we understand how bootsting works, let's try to adapt it to classification. It will be more complicated because residuals are usually not very informative in a classification. And it will be hard to shrink. So let's try something slightly different, to introduce the AdaBoost algorithm.

In our initial discussion, the goal was to minimize a convex loss function. Here, if we express classes as {-1,+1}, the loss function we consider is

Image title

This was already discussed when we looked at the SVM algorithm. Note that the loss function related to the logistic model would be:

Image title

What we do here is related to gradient descent (or Newton algorithms). Previously, we were learning from our errors. With each iteration, the residuals are computed and a (weak) model is fitted to these residuals. The contribution of this weak model is used in a gradient descent optimization process. Here things will be different, because (from my understanding) it is more difficult to play with residuals because null residuals never exist in classifications. So we will add weights. Initially, all the observations will have the same weights. But, iteratively, we will change them. We will increase the weights of the wrongly predicted individuals and decrease the ones of the correctly predicted individuals. Somehow, we want to focus more on the difficult predictions. That's the trick. And I guess that's why it performs so well. This algorithm is well described on Wikipedia, so we will use it.

We start with

Image title

then at each step fit a model (a classification tree) with weights wk(we did not discuss weights in the algorithms of trees, but it is straigtforward in the formula actually). Let

Image title

denote that model (i.e. the probability in each leaves). Then consider the classifier

Image title

which returns a value in {-1,+1}. Then set

Image title

where Ik is the set of misclassified individuals,

Image title

Then set

Image title

and, finally, update the model using

Image title

as well as the weights

Image title

(of course, divide by the sum to ensure that the total sum is then 1). And as previously, one can include some shrinkage. To visualize the convergence of the process, we will plot the total error on our dataset.

n_iter = 100
y = (myocarde[,"PRONO"]==1)*2-1
x = myocarde[,1:7]
error = rep(0,n_iter) 
f = rep(0,length(y)) 
w = rep(1,length(y)) #
alpha = 1
library(rpart)
for(i in 1:n_iter){
  w = exp(-alpha*y*f) *w 
  w = w/sum(w)
  rfit = rpart(y~., x, w, method="class")
  g = -1 + 2*(predict(rfit,x)[,2]>.5) 
  e = sum(w*(y*g<0))
  alpha = .5*log ( (1-e) / e )
  alpha = 0.1*alpha 
  f = f + alpha*g
  error[i] = mean(1*f*y<0)
}
plot(seq(1,n_iter),error,type="l",
     ylim=c(0,.25),col="blue",
     ylab="Error Rate",xlab="Iterations",lwd=2)


Here we face a classical problem in machine learning: we have a perfect model. With zero error. That is nice, but not interesting. It is also possible in econometrics, with polynomial fits: with 10 observations, and a polynomial of degree 9, we have a perfect fit. But a poor model. Here it is the same. So the trick is to split our dataset in two, a training dataset, and a validation dataset:

set.seed(123)
id_train = sample(1:nrow(myocarde), size=45, replace=FALSE)
train_myocarde = myocarde[id_train,]
test_myocarde = myocarde[-id_train,]

We construct the model for the first one, and we check the second one to make sure that it's not that bad...

y_train = (train_myocarde[,"PRONO"]==1)*2-1
x_train =  train_myocarde[,1:7]
y_test = (test_myocarde[,"PRONO"]==1)*2-1
x_test = test_myocarde[,1:7]
train_error = rep(0,n_iter) 
test_error = rep(0,n_iter)
f_train = rep(0,length(y_train))
f_test = rep(0,length(y_test)) 
w_train = rep(1,length(y_train)) 
alpha = 1
for(i in 1:n_iter){
  w_train = w_train*exp(-alpha*y_train*f_train) 
  w_train = w_train/sum(w_train)
  rfit = rpart(y_train~., x_train, w_train, method="class")
  g_train = -1 + 2*(predict(rfit,x_train)[,2]>.5)
  g_test = -1 + 2*(predict(rfit,x_test)[,2]>.5)
  e_train = sum(w_train*(y_train*g_train<0))
  alpha = .5*log ( (1-e_train) / e_train )
  alpha = 0.1*alpha 
  f_train = f_train + alpha*g_train
  f_test = f_test + alpha*g_test
  train_error[i] = mean(1*f_train*y_train<0)
  test_error[i] = mean(1*f_test*y_test<0)}
plot(seq(1,n_iter),test_error,col='red')
lines(train_error,lwd=2,col='blue')


Here, as previously, after 80 iterations, we have a perfect model on the training dataset, but it behaves badly on the validation dataset. But with 20 iterations, it seems to be okay.

R Function

Of course, it's possible to use R functions:

library(gbm)
gbmWithCrossValidation = gbm(PRONO ~ .,distribution = "bernoulli",
data = myocarde,n.trees = 2000,shrinkage = .01,cv.folds = 5,n.cores = 1)
bestTreeForPrediction = gbm.perf(gbmWithCrossValidation)

Here cross-validation is considered, and not training/validation, as well as forests instead of single trees, but, overall, the idea is the same... Of course, the output is much nicer (here the shrinkage is a very small parameter, and learning is extremely slow).

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,boosting ,r

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}