Exporting Decision Trees in Textual Format With sklearn
In this post, we explore how to make decision trees using Python and a open data set.
Join the DZone community and get the full member experience.
Join For FreeIn the past, we have covered Decision Trees showing how interpretable these models can be (see the tutorials here). In the previous tutorials, we exported the rules of the models using the function export_graphviz
from sklearn and visualized the output of this function in a graphical way with an external tool which is not easy to install in some cases. Luckily, since version 0.21.2, scikit-learn offers the possibility to export Decision Trees in a textual format (I implemented this feature personally!) and in this post we will see an example how of to use this new feature.
Let's train a tree with two layers on the famous iris dataset using all the data and print the resulting rules using the brand new function export_text
:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data']
y = ['setosa']*50+['versicolor']*50+['virginica']*50
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)
|--- petal width (cm) <= 0.80
| |--- class: setosa
|--- petal width (cm) > 0.80
| |--- petal width (cm) <= 1.75
| | |--- class: versicolor
| |--- petal width (cm) > 1.75
| | |--- class: virginica
Reading the data, we note that if the feature petal width
is less or equal than 80mm the samples are always classified as setosa. Otherwise if the petal width
is less or equal than 1.75cm they're classified as versicolor, or as virginica if the petal width
is more than 1.75cm. This model might well suffer from overfitting, but tells us some important details of the data. It's easy to note that the petal width
is the only feature used, we could even say that the petal width is small for setosa samples, medium for versicolor, and large for virginica.
To understand how the rules separate the labels we can also print the number of samples from each class (class weights) on the leaves:
r = export_text(decision_tree, feature_names=iris['feature_names'],
decimals=0, show_weights=True)
print(r)
|--- petal width (cm) <= 1
| |--- weights: [50, 0, 0] class: setosa
|--- petal width (cm) > 1
| |--- petal width (cm) <= 2
| | |--- weights: [0, 49, 5] class: versicolor
| |--- petal width (cm) > 2
| | |--- weights: [0, 1, 45] class: virginica
Here we have the number of samples per class among square brackets. Recalling that we have 50 samples per class, we see that all the samples labeled as setosa are correctly modeled by the tree, while for 5 virginica and 1 versicolor the model fails to capture the information given by the label.
Check out the documentation of the function export_text
to discover all its capabilities here.
Published at DZone with permission of Giuseppe Vettigli, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments