# Growing a Spam Tree

Join the DZone community and get the full member experience.

Join For Freeconsider the following toy dataset , with some spam/ham information, and two words, “viagra” and “lottery”.

```
> load(spam.rdata)
> head(db)
y viagra lottery
27 spam 0 1
37 ham 0 1
57 spam 0 0
89 ham 0 0
20 spam 1 0
86 ham 0 0
```

for the first node, compute gini index for the two variables,

```
> gini=function(variable){
+ t=table(db$y,db[,variable])
+ nx=apply(t,2,sum)
+ probcond=t/matrix(rep(nx,each=2),2,2)
+ probcond
+ gini=-probcond*(1-probcond)
+ sum(matrix(rep(nx,each=2),2,2)/sum(nx)*gini)}
> gini("viagra")
[1] -0.44
> gini("lottery")
[1] -0.487
```

here the gini index is maximal for “viagra”, so that will be the first node.

on the left node (emails without “viagra”), the component of gini index is

```
> -75/100*(.4*.6+.6*.4)
[1] -0.36
```

if we decide to split (using the second word, “lottery”), at this node, the new gini index would be

```
> idx=which(db$viagra==0)
> t=table(db[idx,"y"],db[idx,"lottery"])
> nx=apply(t,2,sum)
> probcond=t/matrix(rep(nx,each=2),2,2)
> gini=-probcond*(1-probcond)
> sum(matrix(rep(nx,each=2),2,2)/100*
+ gini)
[1] -0.333
```

since gini index is larger, we decide to split (based on the second word) here. on the other node (emails with “viagra”), the component of gini index is

```
> -25/100*(.8*.2+.2*.8)
[1] -0.08
```

and if decide to split (again, according to the second word), we get

```
> idx=which(db$viagra==1)
> t=table(db[idx,"y"],db[idx,"lottery"])
> nx=apply(t,2,sum)
> probcond=t/matrix(rep(nx,each=2),2,2)
> gini=-probcond*(1-probcond)
> sum(matrix(rep(nx,each=2),2,2)/100*
+ gini)
[1] -0.0792
```

which is only
*
slightly
*
larger. splitting would not be very interesting, here. to visualize the tree, use

```
> library(rpart)
> arbre = rpart(factor(y)~.,data=db)
> library(rpart.plot)
> rpart.plot(arbre,type=4,extra=6)
```

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Comments