{{announcement.body}}
{{announcement.title}}

# Growing a Spam Tree

DZone 's Guide to

# Growing a Spam Tree

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Consider the following toy dataset, with some spam/ham information, and two words, “viagra” and “lottery”.

``````> load(spam.RData)
Y viagra lottery
27 spam      0       1
37  ham      0       1
57 spam      0       0
89  ham      0       0
20 spam      1       0
86  ham      0       0``````

For the first node, compute Gini index for the two variables,

``````> gini=function(variable){
+ T=table(db\$Y,db[,variable])
+ nx=apply(T,2,sum)
+ ProbCond=T/matrix(rep(nx,each=2),2,2)
+ ProbCond
+ Gini=-ProbCond*(1-ProbCond)
+ sum(matrix(rep(nx,each=2),2,2)/sum(nx)*Gini)}
> gini("viagra")
[1] -0.44
> gini("lottery")
[1] -0.487``````

Here the Gini index is maximal for “viagra”, so that will be the first node.

On the left node (emails without “viagra”), the component of Gini index is

``````> -75/100*(.4*.6+.6*.4)
[1] -0.36``````

If we decide to split (using the second word, “lottery”), at this node, the new Gini index would be

``````> idx=which(db\$viagra==0)
> T=table(db[idx,"Y"],db[idx,"lottery"])
> nx=apply(T,2,sum)
> ProbCond=T/matrix(rep(nx,each=2),2,2)
> Gini=-ProbCond*(1-ProbCond)
> sum(matrix(rep(nx,each=2),2,2)/100*
+       Gini)
[1] -0.333``````

Since Gini index is larger, we decide to split (based on the second word) here. On the other node (emails with “viagra”), the component of Gini index is

``````> -25/100*(.8*.2+.2*.8)
[1] -0.08``````

and if decide to split (again, according to the second word), we get

``````> idx=which(db\$viagra==1)
> T=table(db[idx,"Y"],db[idx,"lottery"])
> nx=apply(T,2,sum)
> ProbCond=T/matrix(rep(nx,each=2),2,2)
> Gini=-ProbCond*(1-ProbCond)
> sum(matrix(rep(nx,each=2),2,2)/100*
+       Gini)
[1] -0.0792``````

which is only slightly larger. Splitting would not be very interesting, here. To visualize the tree, use

``````> library(rpart)
> arbre = rpart(factor(Y)~.,data=db)
> library(rpart.plot)
> rpart.plot(arbre,type=4,extra=6)``````

Topics:
big data, bigdata, data visualization

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.