Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Growing a Spam Tree

DZone's Guide to

Growing a Spam Tree

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Consider the following toy dataset, with some spam/ham information, and two words, “viagra” and “lottery”.

> load(spam.RData)
> head(db)
      Y viagra lottery
27 spam      0       1
37  ham      0       1
57 spam      0       0
89  ham      0       0
20 spam      1       0
86  ham      0       0

For the first node, compute Gini index for the two variables,

> gini=function(variable){
+ T=table(db$Y,db[,variable])
+ nx=apply(T,2,sum)
+ ProbCond=T/matrix(rep(nx,each=2),2,2)
+ ProbCond
+ Gini=-ProbCond*(1-ProbCond)
+ sum(matrix(rep(nx,each=2),2,2)/sum(nx)*Gini)}
> gini("viagra")
[1] -0.44
> gini("lottery")
[1] -0.487

Here the Gini index is maximal for “viagra”, so that will be the first node.

On the left node (emails without “viagra”), the component of Gini index is

> -75/100*(.4*.6+.6*.4)
[1] -0.36

If we decide to split (using the second word, “lottery”), at this node, the new Gini index would be

> idx=which(db$viagra==0)
> T=table(db[idx,"Y"],db[idx,"lottery"])
> nx=apply(T,2,sum)
> ProbCond=T/matrix(rep(nx,each=2),2,2)
> Gini=-ProbCond*(1-ProbCond)
> sum(matrix(rep(nx,each=2),2,2)/100*
+       Gini)
[1] -0.333

Since Gini index is larger, we decide to split (based on the second word) here. On the other node (emails with “viagra”), the component of Gini index is

> -25/100*(.8*.2+.2*.8)
[1] -0.08

and if decide to split (again, according to the second word), we get

> idx=which(db$viagra==1)
> T=table(db[idx,"Y"],db[idx,"lottery"])
> nx=apply(T,2,sum)
> ProbCond=T/matrix(rep(nx,each=2),2,2)
> Gini=-ProbCond*(1-ProbCond)
> sum(matrix(rep(nx,each=2),2,2)/100*
+       Gini)
[1] -0.0792

which is only slightly larger. Splitting would not be very interesting, here. To visualize the tree, use

> library(rpart)
> arbre = rpart(factor(Y)~.,data=db)
> library(rpart.plot)
> rpart.plot(arbre,type=4,extra=6)

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
bigdata ,big data ,data visualization

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}