Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Classification From Scratch, Part 10 of 8: Bagging and Forests

DZone's Guide to

Classification From Scratch, Part 10 of 8: Bagging and Forests

In today's post, we'll take a look at the heuristics of the algorithms inside bagging techniques, and how this applies to data forests.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

This is the tenth post of our series on classification from scratch. Today, we'll see the heuristics of the algorithm inside bagging techniques.

Often, bagging is associated with trees, to generate forests. But actually, it is possible to use bagging for any kind of model. Recall that bagging means "bootstrap aggregation." So, consider a model:

Image title

Let 

Image title

denote the estimator of m obtained from sample

Image title

Consider now some bootstrap sample,

Image title

with i is randomly drawn from {1,...,n} (with replacement). Based on that sample, estimate:

Image title

Then draw many samples, and consider the aggregation of the estimators obtained, using either a majority rule, or using the average of probabilities (if a probabilist model was considered). Hence:

Image title

Bagging Logistic Regression #1

Consider the case of the logistic regression. To generate a bootstrap sample, it is natural to use the technique described above, i.e. draw pairs (yi,xi) randomly, uniformly (with probability 1/n) with replacement. Consider here the small dataset, just to visualize. For the b part of bagging, use the following code:

L_logit = list()
n = nrow(df)
for(s in 1:1000){
  df_s = df[sample(1:n,size=n,replace=TRUE),]
  L_logit[[s]] = glm(y~., df_s, family=binomial)}

Then we should aggregate over the 1000 models, to get the agg part of bagging,

p = function(x){
  nd=data.frame(x1=x[1], x2=x[2]) 
  unlist(lapply(1:1000,function(z) predict(L_logit[[z]],newdata=nd,type="response")))}

We now have a prediction for any new observation:

vu = seq(0,1,length=101)
vv = outer(vu,vu,Vectorize(function(x,y) mean(p(c(x,y)))))
image(vu,vu,vv,xlab="Variable 1",ylab="Variable 2",col=clr10,breaks=(0:10)/10)
points(df$x1,df$x2,pch=19,cex=1.5,col="white")
points(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],cex=1.5)
contour(vu,vu,vv,levels = .5,add=TRUE)

Bagging Logistic Regression #2

Another technique that can be used to generate a bootstrap sample is to keep all xi's, but for each of them, to draw (randomly) a value for y, with

Image title

since

Image title

Thus, the code for the b part of the bagging algorithm is now:

L_logit = list()
n = nrow(df)
reg = glm(y~x1+x2, df, family=binomial)
for(s in 1:100){
  df_s = df
  df_s$y = factor(rbinom(n,size=1,prob=predict(reg,type="response")),labels=0:1)
  L_logit[[s]] = glm(y~., df_s, family=binomial)
}

The agg part of the bagging algorithm remains unchanged. Here we obtain:

vu = seq(0,1,length=101)
vv = outer(vu,vu,Vectorize(function(x,y) mean(p(c(x,y)))))
image(vu,vu,vv,xlab="Variable 1",ylab="Variable 2",col=clr10,breaks=(0:10)/10)
points(df$x1,df$x2,pch=19,cex=1.5,col="white")
points(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],cex=1.5)
contour(vu,vu,vv,levels = .5,add=TRUE)


Of course, we can use that code to check the prediction we obtain on the observations we have in our sample. Just to change, consider here the myocarde data. All the code is here:

L_logit = list()
reg = glm(as.factor(PRONO)~., myocarde, family=binomial)
for(s in 1:1000){
  myocarde_s = myocarde
  myocarde_s$PRONO = 1*rbinom(n,size=1,prob=predict(reg,type="response"))
  L_logit[[s]] = glm(as.factor(PRONO)~., myocarde_s, family=binomial)
}
p = function(x){
  nd=data.frame(FRCAR=x[1], INCAR=x[2], INSYS=x[3], PRDIA=x[4], 
                PAPUL=x[4], PVENT=x[5], REPUL=x[6]) 
  unlist(lapply(1:1000,function(z) predict(L_logit[[z]],newdata=nd,type="response")))}

For the first observation, with our 1000 simulated datasets, and our 1000 models, we obtained the following estimation for the probability to die.

histo = function(i){
x = as.numeric(myocarde[i,1:7])
v_x = p(x)
hist(v_x,proba=TRUE,breaks=seq(0,1,by=.05),xlab="",main="",
col=rep(c(rgb(0,0,1,.4),rgb(1,0,0,.4)),each=10),ylim=c(0,5))
segments(mean(v_x),0,mean(v_x),5,col="red",lty=2)
points(myocarde$PRONO[i],0,pch=19,cex=2)
xi = round(mean(v_x.5)*1000)/10
text(.75,-.1,paste(xi,"%",sep=""),col=rgb(1,0,0,.6))}
histo(1)
histo(4)

Hence, for the first observation, in 77.8% of the models, the predicted probability was higher than 50%, and the average probability was actually close to 75%.


or, for observation, 22 predictions were very close to the first one (except that the first one died, while the 22nd survived):

histo(23)
histo(11)

and, we observe here

Bagging Trees

Let's now get back on our trees, mentioned in the previous post. Bagging was introduced in 1994 by Leo Breiman in Bagging Predictors. If the first section describes the procedure, the second one introduces "Bagging Classification Trees." Trees are nice for interpretation, but, most of the time, they are rather poor predictors. The idea of bagging was to improve the accuracy of classification trees.

The idea of bagging is to generate a lot of trees:

clr12 = c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f")
n = nrow(myocarde)
par(mfrow=c(4,3))
sed=c(1,2,4,5,6,10,11,21,22,24,27,28,30)
for(i in 1:12){
  set.seed(sed[i])
idx = sample(1:n, size=n, replace=TRUE)
cart =  rpart(PRONO~., myocarde[idx,])
prp(cart,type=2,extra=1,box.col=clr12[i])}



The strategy is actually the same as before. For the bootstrap part, store the tree in a list:

L_tree = list()
for(s in 1:1000){
  idx = sample(1:n, size=n, replace=TRUE)
  L_tree[[s]] = rpart(as.factor(PRONO)~., myocarde[idx,])
}

And for the aggregation part, just take the average of the predicted probabilities:

p = function(x){
  nd=data.frame(FRCAR=x[1], INCAR=x[2], INSYS=x[3], PRDIA=x[4], 
                PAPUL=x[4], PVENT=x[5], REPUL=x[6]) 
  unlist(lapply(1:1000,function(z) predict(L_tree[[z]],newdata=nd,type="prob")[,2]))}

Because, with this example, we cannot visualize predictions, let us run the same code on the smaller dataset.

L_tree = list()
n = nrow(df)
for(s in 1:1000){
  idx = sample(1:n, size=n, replace=TRUE)
  L_tree[[s]] = rpart(y~x1+x2, df[idx,],control = rpart.control(cp = 0.25,
minsplit = 2))
}
p = function(x){
  nd=data.frame(x1=x[1], x2=x[2]) 
  unlist(lapply(1:1000,function(z) predict(L_tree[[z]],newdata=nd,type="prob")[,2]))}
vu=seq(0,1,length=101)
vv=outer(vu,vu,Vectorize(function(x,y) mean(p(c(x,y)))))
image(vu,vu,vv,xlab="Variable 1",ylab="Variable 2",col=clr10,breaks=(0:10)/10)
points(df$x1,df$x2,pch=19,cex=1.5,col="white")
points(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],cex=1.5)
contour(vu,vu,vv,levels = .5,add=TRUE)

From Bags to Forest

Here, we grew a lot of trees, but it is not stricto sensus a random forest algorithm, as introduced in 1995, in Random decision forests. Actually, the difference is in the creation of decision trees. To understand what happens, look back at my previous post on classification trees. As we've seen, when we have a node, we look at possible splits: we consider all possible variables, and all possible thresholds. The strategy here will be to randomly draw k variables out of p (with, of course, < p, for instance k=√p). That's interesting in high dimensions, because at each split, we should look for all variables, all cutoffs, and that can take quite some time (especially with the bootstrap procedure, where the goal will be to grow 1000 trees).

To be continued...

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,data forests ,data tree ,r

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}