Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# Coding Quickie: Classification Tree Learning

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Have you ever seen a large table of data and thought, "Somebody just vomited numbers on my monitor?" Sometimes it can be hard to make sense out of tables like this -- and even moreso when the tables are large. There's at least one machine learning technique that will help you make sense of the data: classification trees.

Classification trees work by pouring over your data column by column and then hierarchically grouping your data by the unique values in each column. As an example imagine if you had a table of data like this:

In code, I write it like this:
dataset = {
"headers"=>["Number of Wheels","Number of Doors","Miles/Gallon", "Type of Auto"],
"data" => [
[4, 4,'21-40', 'Car'],
[2, 0,'41-60', 'Motorcycle'],
[4, 2,'11-20', 'Truck'],
[16, 2,'1-10', 'Big Rig'],
[4, 2,'21-40', 'Car'],
[8, 2,'1-10', 'Big Rig'],
[4, 4,'41-60', 'Hybrid Car'],
]
}
How would you organize the data to make it more understandable?
Note that this example is a bit impractical because the table is pretty small but the mechanics will work the same. One caveat: the algorithm is working off of already enumerated values in a table. There's no continuous values like 8.132, or names, or IDs. In trying to use this in a real world scenario I'm finding myself preprocessing my data to create the buckets so I can run this algorithm.

The classification algorithm I learned tonight starts by looking at each column of data and then deciding which seems like it will give us the most information gain. Once it determines that it partitions the table and carries on recursively until we've come up with a full classification tree. You might be wondering how exactly does the algorithm determine which split will give us the most information. That's thanks to a calculation known as Shannon Entropy. It's outside the scope of this post but feel free to read more about it here.

Here's the code I ported from Python to Ruby:

number_of_entries = data.length
label_counts = {}
data.each do |row|
label_counts[current_label] = 0 if not label_counts.has_key? current_label
label_counts[current_label] += 1
end
shannon_entropy = 0.0
label_counts.each do |key, value|
frequency = value.to_f / number_of_entries.to_f
shannon_entropy -= frequency * Math.log(frequency,2)
end
shannon_entropy
end
With the algorith I created, the following hierarchy results:
{
"Number of Doors"=>{
4=>{
"Number of Wheels"=>{
4=>{
"Miles/Gallon"=>{
"21-40"=>{
"Type of Auto"=>[["Car"]]
},
"41-60"=>{
"Type of Auto"=>[["Hybrid Car"]]
}
}
}
}
},
0=>{
"Number of Wheels"=>{
2=>{
"Miles/Gallon"=>{
"41-60"=>{
"Type of Auto"=>[["Motorcycle"]]
}
}
}
}
},
2=>{
"Number of Wheels"=>{
4=>{
"Miles/Gallon"=>{
"11-20"=>{
"Type of Auto"=>[["Truck"]]
},
"21-40"=>{
"Type of Auto"=>[["Car"]]
}
}
},
16=>{
"Miles/Gallon"=>{
"1-10"=>{
"Type of Auto"=>[["Big Rig"]]
}
}
},
8=>{
"Miles/Gallon"=>{
"1-10"=>{
"Type of Auto"=>[["Big Rig"]]
}
}
}
}
}
}
}
This is one of the few problems that very neatly fits a recursive solution. Why this one? Well a couple reasons:
1. You'll notice we're building a tree from this hierarchy. Trees and many of the solutions that involve them are self similar meaning, whatever you do at one level of the tree you'll need to do at every level. In this case, that would be partitioning our dataset into many sub-datasets.
2. The recursion depth is fairly well limited by the context of the problem. You'll only ever run into it on datasets with columns that number in the thousands. If that's you, you'll need to unwind the recursion into a stack based design. Have fun! :)
This code is pretty tacky for Ruby but I'm posting it anyway. Suggest how I can clean it up!

def categorize_dataset dataset
if column_count == 1
end

min_entropy_value = 99
best_index_to_split = 0

(0..column_count-1).each do |column_index|
entropy_value = calculate_shannon_entropy dataset["data"], column_index
if entropy_value < min_entropy_value
min_entropy_value = entropy_value
best_index_to_split = column_index
end
end

puts "Splitting on column index #{dataset["headers"][best_index_to_split]} with entropy of #{min_entropy_value}"
the_split_dataset = split_the_dataset dataset, best_index_to_split

tree = {
}

the_split_dataset.each do |value_key, value_dataset|
end

tree
end
To give credit where it's due, I sat down with  Machine Learning in Action tonight, and while I didn't copy the algorithm word for word, it was extremely helpful. This is one of the few books on an advanced topic that seems to able to convey the author's knowledge very well. Some of the more experienced among you might notice that the way I handled evaluating the right way to slice the data is different from the standard way (at least when compared to my book). I'm okay with that, but I might clean it up and straighten it out if the mood strikes me.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.