# Classification From Scratch, Part 11 of 8: Boosting

# Classification From Scratch, Part 11 of 8: Boosting

### So, yeah, we may have underestimated the length of this series, but, today, in the final installment, we check out the fascinating problem of boosting in big data.

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

This is the eleventh post of our series on classification from scratch. Today should be the last one... unless I forgot something important. So today, we'll discuss boosting.

## An Econometrician Perspective

I might start with a non-conventional introduction. But that's actually how I understood what boosting was about. And I am quite sure it has to do with my background in econometrics.

The goal here is to solve something which looks like

for some loss function *l*, and for some set of predictors ** M**. This is an optimization problem. Well, optimization is here in a function space, but still, that's simply an optimization problem. And from a numerical perspective, optimization is solved using gradient descent (this is why this technique is also called gradient boosting). And the gradient descent can be visualized like below:

Again, the optimum is not some real value x^{*}, but some function m^{*}. Thus, here we will have something like

where the term on the right can also be written

I prefer the later because we see clearly that ** f** is some model we fit on the remaining residuals.

We can rewrite it like that: define

for all *i*=1,...,n. The goal is to fit a model so that

and when we have that optimal function, set

(yes, we can include some shrinkage here).

Two important comments here. First of all, the idea should be weird to any econometrician. First, we fit a model to explain *y* by some covariates ** x**. Then consider the residuals

and to explain them with the same covariate ** x**. If you try that with a linear regression, you'd done at the end of step 1, since residuals

are orthogonal to covariates \mathbf{x}: no way that we can learn from them. Here it works because we consider simple non-linear model. And actually, something that can be used is to add a shrinkage parameter. Do not consider

The idea of *weak* learners is extremely important here. The more we shrink, the longer it will take, but that's not (too) important.

I should also mention that it's nice to keep learning from our mistakes. But somehow, we should stop, someday. I said that I will not mention this part in this series of posts, maybe later on. But heuristically, we should stop when we start to overfit. And this can be observed either using a split training/validation of the initial dataset or to use cross-validation. I will get back on that issue later one in this post, but again, those ideas should probably be dedicated to another series of posts.

## Learning With Splines

Just to make sure we get it, let's try to learn with splines. Because standard splines have fixed knots, actually, we do not really "learn" here (and after a few iterations we get to what we would have with a standard spline regression). So here, we will (somehow) optimize knots locations. There is a package to do so. And just to illustrate, use a Gaussian regression here, not a classification (we will do that later on). Consider the following dataset (with only one covariate)

```
n=300
set.seed(1)
u=sort(runif(n)*2*pi)
y=sin(u)+rnorm(n)/4
df=data.frame(x=u,y=y)
```

For an optimal choice of knot locations, we can use

```
library(freeknotsplines)
xy.freekt=freelsgen(df$x, df$y, degree = 1, numknot = 2, 555)
```

With 5% shrinkage, the code is simply the following:

```
v=.05
library(splines)
xy.freekt=freelsgen(df$x, df$y, degree = 1, numknot = 2, 555)
fit=lm(y~bs(x,degree=1,knots=xy.freekt@optknot),data=df)
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:200){
xy.freekt=freelsgen(df$x, df$yr, degree = 1, numknot = 2, 555)
fit=lm(yr~bs(x,degree=1,knots=xy.freekt@optknot),data=df)
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)}
nd=data.frame(x=seq(0,2*pi,by=.01))
viz=function(M){
if(M==1) y=YP[,1]
if(M>1) y=apply(YP[,1:M],1,sum)
plot(df$x,df$y,ylab="",xlab="")
lines(df$x,y,type="l",col="red",lwd=3)
fit=lm(y~bs(x,degree=1,df=3),data=df)
yp=predict(fit,newdata=nd)
lines(nd$x,yp,type="l",col="blue",lwd=3)
lines(nd$x,sin(nd$x),lty=2)}
```

To visualize the output after 100 iterations, use:

`viz(100)`

Clearly, we see that we learn from the data here... Cool, isn't it?

## Learning With Stumps (and Trees)

Let us try something else. What if we consider at each step a regression tree, instead of a linear-by-parts regression (that was considered with linear splines).

```
library(rpart)
v=.1
fit=rpart(y~x,data=df)
yp=predict(fit)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:100){
fit=rpart(yr~x,data=df)
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)}
```

Again, to visualize the learning process, use:

```
viz=function(M){
y=apply(YP[,1:M],1,sum)
plot(df$x,df$y,ylab="",xlab="")
lines(df$x,y,type="s",col="red",lwd=3)
fit=rpart(y~x,data=df)
yp=predict(fit,newdata=nd)
lines(nd$x,yp,type="s",col="blue",lwd=3)
lines(nd$x,sin(nd$x),lty=2)}
```

This time, with those trees, it looks like not only do we have a good model, but also a different model from the one we can get using a single regression tree.

What if we change the shrinkage parameter?

```
viz=function(v=0.05){
fit=rpart(y~x,data=df)
yp=predict(fit)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:100){
fit=rpart(yr~x,data=df)
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)}
y=apply(YP,1,sum)
plot(df$x,df$y,xlab="",ylab="")
lines(df$x,y,type="s",col="red",lwd=3)
fit=rpart(y~x,data=df)
yp=predict(fit,newdata=nd)
lines(nd$x,yp,type="s",col="blue",lwd=3)
lines(nd$x,sin(nd$x),lty=2)}
```

There is clearly an impact of that shrinkage parameter. It has to be small to get a good model. This is the idea of using *weak learners* to get a good prediction.

## Classification and Adaboost

Now that we understand how bootsting works, let's try to adapt it to classification. It will be more complicated because residuals are usually not very informative in a classification. And it will be hard to shrink. So let's try something slightly different, to introduce the AdaBoost algorithm.

In our initial discussion, the goal was to minimize a convex loss function. Here, if we express classes as {-1,+1}, the loss function we consider is

This was already discussed when we looked at the SVM algorithm. Note that the loss function related to the logistic model would be:

What we do here is related to gradient descent (or Newton algorithms). Previously, we were learning from our errors. With each iteration, the residuals are computed and a (weak) model is fitted to these residuals. The contribution of this weak model is used in a gradient descent optimization process. Here things will be different, because (from my understanding) it is more difficult to play with residuals because null residuals never exist in classifications. So we will add weights. Initially, all the observations will have the same weights. But, iteratively, we will change them. We will increase the weights of the wrongly predicted individuals and decrease the ones of the correctly predicted individuals. Somehow, we want to focus more on the difficult predictions. That's the trick. And I guess that's why it performs so well. This algorithm is well described on Wikipedia, so we will use it.

We start with

then at each step fit a model (a classification tree) with weights *w*_{k}(we did not discuss weights in the algorithms of trees, but it is straigtforward in the formula actually). Let

denote that model (i.e. the probability in each leaves). Then consider the classifier

which returns a value in {-1,+1}. Then set

where *I*_{k} is the set of misclassified individuals,

Then set

and, finally, update the model using

as well as the weights

(of course, divide by the sum to ensure that the total sum is then 1). And as previously, one can include some shrinkage. To visualize the convergence of the process, we will plot the total error on our dataset.

```
n_iter = 100
y = (myocarde[,"PRONO"]==1)*2-1
x = myocarde[,1:7]
error = rep(0,n_iter)
f = rep(0,length(y))
w = rep(1,length(y)) #
alpha = 1
library(rpart)
for(i in 1:n_iter){
w = exp(-alpha*y*f) *w
w = w/sum(w)
rfit = rpart(y~., x, w, method="class")
g = -1 + 2*(predict(rfit,x)[,2]>.5)
e = sum(w*(y*g<0))
alpha = .5*log ( (1-e) / e )
alpha = 0.1*alpha
f = f + alpha*g
error[i] = mean(1*f*y<0)
}
plot(seq(1,n_iter),error,type="l",
ylim=c(0,.25),col="blue",
ylab="Error Rate",xlab="Iterations",lwd=2)
```

Here we face a classical problem in machine learning: we have a perfect model. With zero error. That is nice, but not interesting. It is also possible in econometrics, with polynomial fits: with 10 observations, and a polynomial of degree 9, we have a perfect fit. But a poor model. Here it is the same. So the trick is to split our dataset in two, a training dataset, and a validation dataset:

```
set.seed(123)
id_train = sample(1:nrow(myocarde), size=45, replace=FALSE)
train_myocarde = myocarde[id_train,]
test_myocarde = myocarde[-id_train,]
```

We construct the model for the first one, and we check the second one to make sure that it's not that bad...

```
y_train = (train_myocarde[,"PRONO"]==1)*2-1
x_train = train_myocarde[,1:7]
y_test = (test_myocarde[,"PRONO"]==1)*2-1
x_test = test_myocarde[,1:7]
train_error = rep(0,n_iter)
test_error = rep(0,n_iter)
f_train = rep(0,length(y_train))
f_test = rep(0,length(y_test))
w_train = rep(1,length(y_train))
alpha = 1
for(i in 1:n_iter){
w_train = w_train*exp(-alpha*y_train*f_train)
w_train = w_train/sum(w_train)
rfit = rpart(y_train~., x_train, w_train, method="class")
g_train = -1 + 2*(predict(rfit,x_train)[,2]>.5)
g_test = -1 + 2*(predict(rfit,x_test)[,2]>.5)
e_train = sum(w_train*(y_train*g_train<0))
alpha = .5*log ( (1-e_train) / e_train )
alpha = 0.1*alpha
f_train = f_train + alpha*g_train
f_test = f_test + alpha*g_test
train_error[i] = mean(1*f_train*y_train<0)
test_error[i] = mean(1*f_test*y_test<0)}
plot(seq(1,n_iter),test_error,col='red')
lines(train_error,lwd=2,col='blue')
```

Here, as previously, after 80 iterations, we have a perfect model on the training dataset, but it behaves badly on the validation dataset. But with 20 iterations, it seems to be okay.

## R Function

Of course, it's possible to use R functions:

```
library(gbm)
gbmWithCrossValidation = gbm(PRONO ~ .,distribution = "bernoulli",
data = myocarde,n.trees = 2000,shrinkage = .01,cv.folds = 5,n.cores = 1)
bestTreeForPrediction = gbm.perf(gbmWithCrossValidation)
```

Here cross-validation is considered, and not training/validation, as well as forests instead of single trees, but, overall, the idea is the same... Of course, the output is much nicer (here the shrinkage is a very small parameter, and learning is extremely slow).

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}