Is Your Decision Tree Accurate?
Is Your Decision Tree Accurate?
In my previous blog, we were successfully able to make the decision tree as per the given data. The story doesn't end here. We cannot just fit the data as it...
Join the DZone community and get the full member experience.Join For Free
Start coding something amazing with the IBM library of open source AI code patterns. Content provided by IBM.
In my previous blog, we were successfully able to make a decision tree for the given data. But the story doesn't end there. We cannot just fit the data as it comes, as this leads to overfitting in the decision tree.
What Is Overfitting and Why Does It Occur?
If a decision tree is fully grown, it may lose some generalization capability. This is a phenomenon known as overfitting. According to this documentation:
"Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h' ∈ H, such that h has a smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances."
It simply means a hypothesis overfits the training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances (i.e. including instances beyond the training examples).
The figure below illustrates the impact of overfitting in a typical application of decision tree learning. Suppose we have made our decision tree based on the given training examples. It fits all the training examples and gives 100% accuracy on that data. But when we check this decision tree on unseen sample data, the accuracy is drastically different.
What could be the reason for this difference in accuracy?
The answer is the overfitting of the training examples. In order to provide 100% accuracy while making the decision tree, we overfitted the data and ended up with decreased accuracy — in other words, an incorrect decision tree.
Causes of Overfitting
There are two major situations that can cause overfitting in decision trees:
- Overfitting due to the presence of noise. Mislabeled instances may contradict the class labels of other similar records.
- Overfitting due to lack of representative instances. A lack of representative instances in the training data can prevent refinement of the learning algorithm.
A good model must not only fit the training data well but also accurately classify records it has never seen.
How to Avoid Overfitting
There are two major approaches to avoid overfitting in decision trees:
- Approaches that stop growing the tree early, before it reaches the point where it perfectly classifies the training data.
- Approaches that allow the tree to overfit the data and then prune the tree.
Before looking into these approaches, let's understand what pruning is.
What Is Pruning?
Pruning is a technique that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier and improves predictive accuracy by the reduction of overfitting.
According to this documentation:
Consider each of the decision nodes in the tree to be candidates for pruning. Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node. Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set. The pruning of nodes continues until further pruning is harmful (i.e. decreases the accuracy of the tree over the validation set).
Now, let's get back to the approaches for dealing with overfitting.
The first one is called pre-pruning. Typical stopping conditions for a node could be:
Stop if all instances belong to the same class.
Stop if all the feature values are the same.
The second is post-pruning, in which we grow the decision tree to its entirety and then trim the nodes of the decision tree from the bottom up.
If generalization error improves after trimming, replace a sub-tree with a leaf node.
The class label of a leaf node is determined from the majority class of instances in the subtree.
Both of these approaches can be used to prune your decision tree. Although the first of these approaches might seem more direct, the second approach of post-pruning overfits trees have been found to be more successful in practice. This is due to the difficulty in the first approach of estimating precisely when to stop growing the tree.
Published at DZone with permission of Ramandeep Kaur , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.