Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Choosing a Classification Model

DZone's Guide to

Choosing a Classification Model

Not sure which classification model is right for your dataset? We'll use simulated data to try out the holdout and cross-validation methods and their visualizations.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

In order to illustrate the problem of chosing a classification model consider some simulated data,

> n = 500
> set.seed(1)
> X = rnorm(n)
> ma = 10-(X+1.5)^2*2
> mb = -10+(X-1.5)^2*2
> M = cbind(ma,mb)
> set.seed(1)
> Z = sample(1:2,size=n,replace=TRUE)
> Y = ma*(Z==1)+mb*(Z==2)+rnorm(n)*5
> df = data.frame(Z=as.factor(Z),X,Y)

A first strategy is to split the dataset in two parts, a training dataset, and a testing dataset.

> df1 = training = df[1:300,]
> df2 = testing  = df[301:500,]

The Holdout Method: Training and Testing Datasets

The two datasets can be visualised below, with the training dataset on top, and the testing dataset below

> plot(df1$X,df1$Y,pch=19,col=c(rgb(1,0,0,.4),
+ rgb(0,0,1,.4))[df1$Z])

We can consider a simple classification tree.

> library(rpart)
> fit = rpart(Z~X+Y,data=df1)
> pred = function(x,y) predict(fit,newdata=data.frame(X=x,Y=y))[,1]

To visualize it, use

> vx=seq(-3,3,length=101)
> vy=seq(-25,25,length=101)
> z=matrix(NA,length(vx),length(vy))
> for(i in 1:length(vx)){
+   for(j in 1:length(vy))
+  {z[i,j]=pred(vx[i],vy[j])}}

and

> image(vx,vy,z,axes=FALSE,xlab="",ylab="")
> points(df1$X,df1$Y,pch=19,col=c(rgb(1,0,0,.4),
+ rgb(0,0,1,.4))[df1$Z])

We have on top the prediction on the training dataset, then below the prediction on the testing dataset. And finally on the bottom, the ROC curve for the testing dataset (black curve) as well as the training dataset (grey curve)

> Y1=as.numeric(df1$Z)-1
> Y2=as.numeric(df2$Z)-1
> library(ROCR)
> S1 = predict(fit,newdata=df1)[,1]
> S2 = predict(fit,newdata=df2)[,1]
> pred <- prediction( S2, Y2 ) 
> perf <- performance( pred, "tpr", "fpr" ) 
> plot( perf ) 
> pred <- prediction( S1, Y1 ) 
> perf <- performance( pred, "tpr", "fpr" ) 
> plot( perf ,add=TRUE,col="grey")

From that tree, it is natural to get a prediction using a random forest

> library(randomForest)
> fit=randomForest(Z~X+Y,data=df1)

To get a prediction, here, use

> pred=function(x,y) 
+ predict(fit,newdata=data.frame(X=x,Y=y),
+ type="prob")[,2]

Not that the fit is perfect on the training sample.  Another popoular model might be the logistic regression

> fit=glm(Z~X+Y,data=df1,family=binomial)
> pred=function(x,y) 
+ predict(fit,newdata=data.frame(X=x,Y=y),
+ type="response")

The later is very close to the linear discriminent analysis

> library(MASS)
> fit=lda(Z~X+Y,data=df1,family=binomial)
> pred=function(x,y) 
+ predict(fit,newdata=
+ data.frame(X=x,Y=y))$posterior[,2]

or a quadratic discriminent analysis

> fit=qda(Z~X+Y,data=df1,family=binomial)
> pred=function(x,y) 
+ predict(fit,newdata=
+ data.frame(X=x,Y=y))$posterior[,2]

To try very different models, consider a k-nearest neighbor
(here with 9 neighbors)

> library(caret)
> fit=knn3(Z~X+Y,data=df1,k=9)
> pred=function(x,y)  
+ predict(fit,newdata=data.frame(X=x,Y=y))[,2]

some logistic regression with bivarriate splines

> library(mgcv)
> fit=gam(Z~s(X,Y),data=df1,family=binomial)
> pred=function(x,y) 
+ predict(fit,newdata=data.frame(X=x,Y=y),
+ type="response")

or some gradient boosting algorithm

> library(dismo)
> df1$Z01 = 1*(df1$Z=="2")
> fit=gbm.step(data=df1, gbm.x = 2:3, gbm.y = 4,
+     family = "bernoulli", tree.complexity = 5,
+     learning.rate = 0.01, bag.fraction = 0.5)
> pred = function(x,y) 
+ predict(fit,newdata=data.frame(X=x,Y=y),
+ type="response",n.trees=400)

Nevertheless, the use of this holdout method to create some training and testing datasets can be difficult. It can be done only when we have a lot of observations. And even so, we can still be unluckly when creating the testing sample (simply because of bad luck). An alternative is to use cross validation.

Using Cross-Validation

Here, we consider one-leave-out cross-validation (but v-fold cross validation can also be consider, to get results faster).

Consider here a classification tree. To derive the cross validation ROC curve, for instance, use

> FIT=list()
> for(i in 1:n) 
+ FIT[[i]] = rpart(Z~X+Y,data=df[-i,])
> predict_i = function(i) 
+ predict(FIT[[i]],newdata=df[i,])[,2] 
> S = Vectorize(predict_i)(1:n)

Here, I store the n models, where each time, one observation is removed, and then I get a prediction for the removed observation. And we can generate the ROC curve

> Y = as.numeric(df$Z)-1
> library(ROCR)
> pred = prediction( S, Y )
> perf = performance( pred, "tpr", "fpr" )
> plot( perf )

Of course, one can easily get a code that runs faster, but that one is working on that small dataset.

If we consider a random forest, we get

> FIT=list()
> for(i in 1:n) 
+ FIT[[i]] = randomForest(Z~X+Y,data=df[-i,])
> predict_i = function(i) 
+ predict(FIT[[i]],newdata=df[i,],
+ type="prob")[,2] 
> S = Vectorize(predict_i)(1:n)

One can consider a logistic regression

> FIT=list()
> for(i in 1:n) 
+ FIT[[i]] = glm(Z~X+Y,data=df[-i,],
+ family=binomial)
> predict_i = function(i) 
+ predict(FIT[[i]],newdata=df[i,],
+ type="response") 
> S = Vectorize(predict_i)(1:n)

or a linear discriminent analysis

> FIT=list()
> for(i in 1:n) 
+ FIT[[i]] = lda(Z~X+Y,data=df[-i,],
+ family=binomial)
> predict_i = function(i) 
+ predict(FIT[[i]],newdata=df[i,])$posterior[,2] 
> S = Vectorize(predict_i)(1:n)

while a quadratic discriminent analysis yields

> FIT=list()
> for(i in 1:n) 
+ FIT[[i]] =  qda(Z~X+Y,data=df[-i,],
+ family=binomial)
> predict_i = function(i) 
+ predict(FIT[[i]],newdata=df[i,])$posterior[,2] 
> S = Vectorize(predict_i)(1:n)

Again, if we consider another type of model, such as a k-nearest neighbor (with k=5) we get

> FIT=list()
> for(i in 1:n) 
+ FIT[[i]] = knn3(Z~X+Y,data=df[-i,],k=5)
> predict_i = function(i) 
+ predict(FIT[[i]],newdata=df[i,])[,2] 
> S = Vectorize(predict_i)(1:n)

or a k nearest neighbor with k=9

or k=21

One can also consider a logistic regression with bivariate splines

or, last but not least, some gradient boosting model. On that one, we might have trouble storing 500 output from the gradient boosting function (the output is large). A faster code might be

> VS = rep(NA,n)
> for(i in 1:n){
+  FIT = gbm.step(data=df[-i,], 
+  gbm.x = 2:3, gbm.y = 4, family = "bernoulli", 
+  tree.complexity = 5, learning.rate = 0.01,
+  bag.fraction = 0.5)
+  VS[i] = predict(FIT,newdata=df[i,],
+  n.trees=400) 
+ }

Actually, it is now possible to compare all those models, with all ROC curves on a single graph:


Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,r language ,trees ,splines ,boosting ,data visualization ,classification models

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}