DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
What's in store for DevOps in 2023? Hear from the experts in our "DZone 2023 Preview: DevOps Edition" on Fri, Jan 27!
Save your seat
  1. DZone
  2. Data Engineering
  3. Databases
  4. Regression Tree Using Gini’s Index

Regression Tree Using Gini’s Index

Arthur Charpentier user avatar by
Arthur Charpentier
·
Jan. 29, 13 · Interview
Like (2)
Save
Tweet
Share
3.80K Views

Join the DZone community and get the full member experience.

Join For Free

in order to illustrate the construction of regression tree (using the cart methodology), consider the following simulated dataset,

> set.seed(1)
> n=200
> x1=runif(n)
> x2=runif(n)
> p=.8*(x1<.3)*(x2<.5)+
+   .2*(x1<.3)*(x2>.5)+
+   .8*(x1>.3)*(x1<.85)*(x2<.3)+
+   .2*(x1>.3)*(x1<.85)*(x2>.3)+
+   .8*(x1>.85)*(x2<.7)+
+   .2*(x1>.85)*(x2>.7) 
> y=rbinom(n,size=1,p)  
> b=data.frame(y,x1,x2)

with one dichotomos varible (the variable of interest, ), and two continuous ones (the explanatory ones and ).

> tail(b)
    y        x1        x2
195 0 0.2832325 0.1548510
196 0 0.5905732 0.3483021
197 0 0.1103606 0.6598210
198 0 0.8405070 0.3117724
199 0 0.3179637 0.3515734
200 1 0.7828513 0.1478457

the theoretical partition is the following

here, the sample can be plotted below (be careful, the first variate is on the y -axis above, and the x -axis below) with blue dots when equals one, and red dots when is null,

> plot(x1,x2,col="white")
> points(x1[y=="1"],x2[y=="1"],col="blue",pch=19)
> points(x1[y=="0"],x2[y=="0"],col="red",pch=19)

in order to construct the tree, we need a partition critera. the most standard one is probably gini’s index, which can be writen, when ‘s are splited in two classes, denoted here

l'image “http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-04.png” ne peut être affichée car elle contient des erreurs.

or when ‘s are splited in three classes, denoted
http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-07.png

etc. here, are just counts of observations that belong to partition such that takes value . but it is possible to consider other criteria, such as the chi-square distance,

http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-01.png

where, classically

http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-02.png
when we consider two classes (one knot) or, in the case of three classes (two knots)
http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-05.png

here again, the idea is to maximize that distance: the idea is to discriminate, so we want samples as not independent as possible. to compute gini’s index consider

> gini=function(y,i){
+ t=table(y,i)
+ nx=apply(t,2,sum)
+ pxy=t/matrix(rep(nx,each=2),2,ncol(t))
+ vxy=pxy*(1-pxy)
+ zx=apply(vxy,2,sum)
+ n=sum(t)
+ -sum(nx/n*zx)
+ }

we simply construct the contingency table, and then, compute the quantity given above. assume, first, that there is only one explanatory variable. we split the sample in two, with all possible spliting values , i.e.

then, we compute gini’s index, for all those values. the knot is the value that maximizes gini’s index. once we have our first knot, we keep it (call it, from now on ). and we reiterate, by seeking the best second choice: given one knot, consider the value that splits the sample in three, and give the highest gini’s index, thus, we consider either the following partition

or this one

i.e. we cut either below, or above the previous knot. and we iterate. the code can be something like that,

> x=x2
> u=(sort(x)[2:n]+sort(x)[1:(n-1)])/2
> knot=null
> for(s in 1:4){
+ vgini=rep(na,length(u))
+ for(i in 1:length(u)){
+ kn=c(knot,u[i])
+ f=function(x){sum(x<=kn)}
+ i=vectorize(f)(x)
+ vgini[i]=gini(y,i)
+ }
+ plot(u,vgini)
+ k=which.max(vgini)
+ cat("knot",k,u[k],"\n")
+ knot=c(knot,u[k])
+ u=u[-k]
+ }
knot 69 0.3025479 
knot 133 0.5846202 
knot 72 0.3148172 
knot 111 0.4811517

at the first step, the value of gini’s index was the following,

which was maximal around 0.3. then, this value is considered as fixed. and we try to construct a partition in three parts (spliting either below or above 0.3). we get the following plot for gini’s index (as a function of this second knot)

which is maximum when the split the sample around 0.6 (which becomes our second knot). etc. now, let us compare our code with the standard r function,

> tree(y~x2,method="gini")
node), split, n, deviance, yval
      * denotes terminal node

 1) root 200 49.8800 0.4750  
   2) x2 < 0.302548 69 12.8100 0.7536 *
   3) x2 > 0.302548 131 28.8900 0.3282  
     6) x2 < 0.58462 65 16.1500 0.4615  
      12) x2 < 0.324591 7  0.8571 0.1429 *
      13) x2 > 0.324591 58 14.5000 0.5000 *
     7) x2 > 0.58462 66 10.4400 0.1970 *

we do obtain similar knots: the first one is 0.302 and the second one 0.584. so, constructing tree is not that difficult…

now, what if we consider our two explanatory variables? the story remains the same, except that the partition is now a bit more complex to write. to find the first knot, we consider all values on the two components, and again, keep the one that maximizes gini’s index,

> n=nrow(b)
> u1=(sort(x1)[2:n]+sort(x1)[1:(n-1)])/2
> u2=(sort(x2)[2:n]+sort(x2)[1:(n-1)])/2
> gini=matrix(na,nrow(b)-1,2)
> for(i in 1:length(u1)){
+ i=(x1<u1[i])
+ gini[i,1]=gini(y,i)
+ i=(x2<u2[i])
+ gini[i,2]=gini(y,i)
+ }
> mg=max(gini)
> i=1+sum(mg==max(gini[,2]))
> par(mfrow = c(1, 2))
> plot(u1,gini[,1],ylim=range(gini),col="green",type="b",xlab="x1",ylab="gini index")
> abline(h=mg,lty=2,col="red")
> if(i==1){points(u1[which.max(gini[,1])],mg,pch=19,col="red")
+          segments(u1[which.max(gini[,1])],mg,u1[which.max(gini[,1])],-100000)}
> plot(u2,gini[,2],ylim=range(gini),col="green",type="b",xlab="x2",ylab="gini index")
> abline(h=mg,lty=2,col="red")
> if(i==2){points(u2[which.max(gini[,2])],mg,pch=19,col="red")
+          segments(u2[which.max(gini[,2])],mg,u2[which.max(gini[,2])],-100000)}
> u2[which.max(gini[,2])]
[1] 0.3025479

the graphs are the following: either we split on the first component (and we obtain the partition on the right, below),

or we split on the second one (and we get the following partition),

here, it is optimal to split on the second variate, first. and actually, we get back to the one-dimensional case discussed previously: as expected, it is optimal to split around 0.3. this is confirmed with the code below,

> library(tree)
> arbre=tree(y~x1+x2,data=b,method="gini")
> arbre$frame[1:4,]
     var   n       dev      yval splits.cutleft splits.cutright
1     x2 200 49.875000 0.4750000      <0.302548       >0.302548
2     x1  69 12.811594 0.7536232      <0.800113       >0.800113
4 <leaf>  57  8.877193 0.8070175                               
5 <leaf>  12  3.000000 0.5000000

for the second knot, four cases should be considered: spliting on the second variable (again), either above, or below the previous knot ( see below on the left ) or spliting on the first one. then whe have wither a partition below or above the previous knot ( see below on the right ),

etc. to visualize the tree, the code is the following

> plot(arbre)
> text(arbre)
> partition.tree(arbre)

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/01/arbre-gini-x1-x2-encore.png

note that we can also visualize the partition. nice, isn’t it?

to go further, the book classification and regression trees by leo breiman (and co-authors) is awesome. note that there are also interesting sections in the bible elements of statistical learning: data mining, inference, and prediction by trevor hastie, robert tibshirani and jerome friedman (which can be downloaded from http://www.stanford.edu/~hastie/… )

Tree (data structure) Knot Partition (database)

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How Do the Docker Client and Docker Servers Work?
  • How To Use Terraform to Provision an AWS EC2 Instance
  • Distributed Stateful Edge Platforms
  • Load Balancing Pattern

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: