Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Growing Some Trees

DZone's Guide to

Growing Some Trees

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Consider here the dataset used in a previous post, about visualising a classification (with more than 2 features),

> MYOCARDE=read.table(
+ "http://freakonometrics.free.fr/saporta.csv",
+ header=TRUE,sep=";")

The default classification tree is

> arbre = rpart(factor(PRONO)~.,data=MYOCARDE)
> rpart.plot(arbre,type=4,extra=6)


We can change the options here, such as the minimum number of observations, per node

> arbre = rpart(factor(PRONO)~.,data=MYOCARDE,
+       control=rpart.control(minsplit=10))
> rpart.plot(arbre,type=4,extra=6)


or

> arbre = rpart(factor(PRONO)~.,data=MYOCARDE,
+        control=rpart.control(minsplit=5))
> rpart.plot(arbre,type=4,extra=6)


To visualize that classification, use the following code (to get a projection on the first two components)

> library(FactoMineR) # ACP (sur les var continues)
> X = MYOCARDE[,1:7]
> acp = PCA(X,ncp=ncol(X))
> M = acp$var$coord
> Minv = solve(M)
> m = apply(X,2,mean)
> s = apply(X,2,sd)
> 
> arbre = rpart(factor(PRONO)~.,data=MYOCARDE)
> pred2=function(d1,d2,Mat,tree){
+   z=Mat %*% c(d1,d2,rep(0,ncol(X)-2))
+   newd=data.frame(t(z*s+m))
+   names(newd)=names(X)
+   predict(tree,newdata=newd,
+           type="prob")[2] }
> p=function(d1,d2) pred2(d1,d2,Minv,arbre)

> Outer <- function(x,y,fun) {
+   mat <- matrix(NA, length(x), length(y))
+   for (i in seq_along(x)) {
+     for (j in seq_along(y)) 
+       mat[i,j]=fun(x[i],y[j])}
+   return(mat)}

> xgrid=seq(-5,5,length=251)
> ygrid=seq(-5,5,length=251)
> zgrid=Outer(xgrid,ygrid,p)
> bluereds=c(
+   rgb(1,0,0,(10:0)/25),rgb(0,0,1,(0:10)/25))

> acp2=PCA(MYOCARDE,quali.sup=8,graph=TRUE)
> plot(acp2, habillage = 8,col.hab=c("red","blue"))
> image(xgrid,ygrid,zgrid,add=TRUE,col=bluereds)
> contour(xgrid,ygrid,zgrid,add=TRUE,levels=.5)


It is also possible to consider the case where

> arbre = rpart(factor(PRONO)~.,data=MYOCARDE,
+        control=rpart.control(minsplit=5))


Finaly, one can also grow more trees, obtained by sampling. This is the idea of bagging: we boostrap our observations, we grow some trees, and then, we aggregate the predicted values. On the grid

> xgrid=seq(-5,5,length=201)
> ygrid=seq(-5,5,length=201)


the code is the following,

> Z = matrix(0,201,201)
> for(i in 1:200){
+ indice = sample(1:nrow(MYOCARDE),
+          size=nrow(MYOCARDE),
+          replace=TRUE)
+ ECHANTILLON=MYOCARDE[indice,]
+ arbre_b = rpart(factor(PRONO)~.,
+   data=ECHANTILLON)
+ p2 = function(d1,d2) pred2(d1,d2, Minv,arbre_b)
+ zgrid2_b = Outer(xgrid,ygrid,p2)
+ Z = Z+zgrid2_b }
> Zgrid = Z/200


To visualize it, use

> plot(acp2, habillage = 8,
+ col.hab=c("red","blue"))
> image(xgrid,ygrid,Zgrid,add=TRUE,
+ col=bluereds)


> contour(xgrid,ygrid,Zgrid,add=TRUE,
+ levels=.5,lwd=3)


Last, but not least, it is possible to use some random forrest algorithm. The method combines Breiman’s bagging idea (mentioned previously) and the random selection of features.

> library(randomForest)
> foret = randomForest(factor(PRONO)~.,
+          data=MYOCARDE)
> pF=function(d1,d2) pred2(d1,d2,Minv,foret)
> zgridF=Outer(xgrid,ygrid,pF)

> acp2=PCA(MYOCARDE,quali.sup=8,graph=TRUE)
> plot(acp2, habillage = 8,col.hab=c("red","blue"))
> image(xgrid,ygrid,Zgrid,add=TRUE,
+ col=bluereds)
> contour(xgrid,ygrid,zgridF,
+ add=TRUE,levels=.5,lwd=3)


Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
bigdata ,big data ,data visualization

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}