An Introduction to Machine Learning With Decision Trees
Decision trees are a common model for software applications, but how are they used in combination with machine learning? Let's take a look.
Join the DZone community and get the full member experience.
Join For Free“ Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed.” — Arthur Samuel, 1959
Machine learning is a buzzword in the technology world right now. It is fun, challenging, puzzling, and even a bit scary if you’re one of those people who believes robots will someday steal our jobs and rule the world. Whether we like it or not, we are surrounded by adaptive smart things that can fix some of our most common daily queries in a split second.
Machine learning was embodied in the famous Skynet from the Terminator franchise. Some are afraid of this fantasy becoming real, while some are excited about a whole new world of opportunities we’ll get in the AI world. Sure, this future AI may want to eradicate the entire human race... but for now, we can achieve much more than we can even imagine with it.
Google self-driving cars, Facebook face recognition, Amazon recommendations, speech recognition with Siri and Cortana, fraud detection by PayPal... the list is long and growing!
So, that was a brief intro into ML. Now, let’s look into decision trees, one machine learning technique.
What Are Decision Trees?
Simply put, a decision tree is a tree in which each branch node represents a choice between a number of alternatives and each leaf node represents a decision.
It is a type of supervised learning algorithm (with a predefined target variable) that is mostly used in classification problems and works for both categorical and continuous input and output variables. It is one of the most widely used and practical methods for inductive inference. (Inductive inference is the process of reaching a general conclusion from specific examples.)
Decision trees learn and train themselves from given examples and predict for unseen circumstances.
A graphical representation of a sample decision tree could be:
Decision Tree Algorithm: ID3
ID3 stands for Iterative Dichotomizer 3. The ID3 algorithm was invented by Ross Quinlan. It builds a decision tree from a fixed set of examples and the resulting tree is used to classify future samples.
The basic idea is to construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node.
Sounds simple — but which node should we select to build the correct and most precise decision tree? How would we decide that?
Well, we have some measures that can help us in selecting the best choice!
Entropy
In information theory, entropy is a measure of the uncertainty about a source of messages. It gives us the degree of disorganization in our data.
Given a collection S containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification is:
Here, p+ and p- are the proportion of positive and negative examples in S.
Consider this entropy function relative to a boolean classification, as the proportion of positive examples p+ varies between 0 and 1.
Notice that the entropy is 0 if all members of S belong to the same class. For example, if all members are positive (p+ = 1), then p- is 0, and Entropy(S) = -1. log2(1) – 0. log2 0 = -1. 0 – 0. log2 0 = 0.
The entropy is 1 when the collection contains an equal number of positive and negative examples.
If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1.
Information Gain
It measures the expected reduction in entropy. It decides which attribute goes into a decision node. To minimize the decision tree depth, the attribute with the most entropy reduction is the best choice!
More precisely, the information gain Gain(S, A) of an attribute A relative to a collection of examples S is defined as:
S = Each value v of all possible values of attribute A
Sv = Subset of S for which attribute A has value v
|Sv| = Number of elements in Sv
|S| = Number of elements in S
Let’s see how these measures work!
Suppose we want ID3 to decide whether the weather is good for playing baseball. Over the course of two weeks, data is collected to help ID3 build a decision tree. The target classification is “should we play baseball?” which can be yes or no.
See the following table.
Day |
Outlook |
Temperature |
Humidity |
Wind |
Should I play baseball? |
D1 |
Sunny |
Hot |
High |
Weak |
No |
D2 |
Sunny |
Hot |
High |
Strong |
No |
D3 |
Overcast |
Hot |
High |
Weak |
Yes |
D4 |
Rain |
Mild |
High |
Weak |
Yes |
D5 |
Rain |
Cool |
Normal |
Weak |
Yes |
D6 |
Rain |
Cool |
Normal |
Strong |
No |
D7 |
Overcast |
Cool |
Normal |
Strong |
Yes |
D8 |
Sunny |
Mild |
High |
Weak |
No |
D9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
D10 |
Rain |
Mild |
Normal |
Weak |
Yes |
D11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
D12 |
Overcast |
Mild |
High |
Strong |
Yes |
D13 |
Overcast |
Hot |
Normal |
Weak |
Yes |
D14 |
Rain |
Mild |
High |
Strong |
No |
The weather attributes are outlook, temperature, humidity, and wind speed. They can have the following values:
outlook = { sunny, overcast, rain }
temperature = {hot, mild, cool }
humidity = { high, normal }
wind = { weak, strong }
We need to find which attribute will be the root node in our decision tree.
Entropy(S) = – (9/14) Log2 (9/14) – (5/14) Log2 (5/14) = 0.940
Gain(S,Wind) = Entropy(S) – (8/14)*Entropy(S_{weak}) – (6/14)*Entropy(S_{strong})
= 0.940 – (8/14)*0.811 – (6/14)*1.00
= 0.048
Entropy(S_{weak}) = - (6/8)*log2(6/8) – (2/8)*log2(2/8) = 0.811
Entropy(S_{strong}) = - (3/6)*log2(3/6) – (3/6)*log2(3/6) = 1.00
For each attribute, the gain is calculated and the highest gain is used in the decision.
Gain(S, Outlook) = 0.246
Gain(S, Temperature) = 0.029
Gain(S, Humidity) = 0.151
Clearly, the outlook attribute has the highest gain. Therefore, it is used as the decision attribute in the root node.
Since outlook has three possible values, the root node has three branches (sunny, overcast, rain). The next question is, What attribute should be tested at the sunny branch node? Since we've used outlook at the root, we only decide on the remaining three attributes: humidity, temperature, or wind.
S_{sunny} = {D1, D2, D8, D9, D11} = 5 examples from the table with outlook = sunny
Gain(S_{sunny}, Humidity) = 0.970
Gain(S_{sunny}, Temperature) = 0.570
Gain(S_{sunny}, Wind) = 0.019
Humidity has the highest gain; therefore, it is used as the decision node. This process goes on until all data is classified perfectly or we run out of attributes.
The decision tree can also be expressed in rule format:
IF outlook = sunny AND humidity = high THEN play baseball = no
IF outlook = rain AND humidity = high THEN play baseball = no
IF outlook = rain AND wind = strong THEN play baseball = yes
IF outlook = overcast THEN play baseball = yes
IF outlook = rain AND wind = weak THEN play baseball = yes
So, that was a very quick intro to decision trees. It’s as simple as depicted above. I hope this article helps you!
Published at DZone with permission of Ramandeep Kaur, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments