Implementing Decision Trees Using Smile

DZone 's Guide to

Implementing Decision Trees Using Smile

Decision trees are used for predictions because of their transparency in conveying explicit decision rules. Learn how to create decision trees with the Smile library!

· AI Zone ·
Free Resource

Smile is a machine learning library based on Java. It also provides a Scala API, although most of the classes it uses are implemented in Java. To learn more about Smile, check out this introductory article

In this article, we will implement a decision tree using the Smile library. To learn more about decision trees, check out this blog.


To use Smile, we need to include the following dependency in our SBT project:

libraryDependencies += "com.github.haifengl" %% "smile-scala" % "1.4.0"


The main ingredient for a decision tree is the training data based on which the nodes or branches of a decision tree are created. The test data we need to train our decision tree must be in attribute-value format. In this article, I’ll be denoting the collection of all attributes that we need to gain a single response as an instance.

Let's get started. With Smile, we can use Weka’s ARFF, CSV, text files, JDBC result sets, and many other formats. We have a read object in the Scala API Smile package that provides us with methods like arff()csv(), and jdbc() to parse our data. The syntax looks like this:

val weather = read.arff("src/main/resources/weather.nominal.arff", 4)

Here, we're taking the weather data in the ARFF format. The second parameter (value 4) takes the index number where we are going to get the response value for our training instances. To explain this, let’s take a quick look at the data we're using:

Screenshot from 2017-08-13 19-02-53

In this file, we have two parts of the data. The first part declares the attributes and the second part contains the instances of these attributes. The order of the declaration of the attributes is important since the data instances are also stored in the same order.

Indexing starts from position 0. The response attribute play is in position 4.

For example, with {sunny, hot, high, FALSE, no}we have r a sponse no at the fourth index.

After parsing the data, we get the object of the AttributeDataSet class. All our values are converted to numeric values. The response values are converted to Int type (for attribute  play {yes, no} =>{0, 1}) and all other training instances to Double type (for attribute windy {TRUE, FALSE} => {0.0, 1.0}). Hence, our data  {sunny, hot, high, FALSE, no} converts to {0.0, 0.0, 0.0, 1.0, 1}.

Next, extract those training instances and response values separately. For this, we have the method toArray() of class AttributeDataSet.

val trainingInstances = weather.toArray(Array(new Array[Double](weather.size())))
val responseValues = weather.toArray(new Array[Int](weather.size()))

Here, we are using two overloaded forms of the toArray() method:

  1. The first takes a parameter of type Array[Array[Double]] and returns the training instances. If you print this array, it may be a little less confusing. If you're still confused, check out the code at the end of the article!

  2. The second akes a simple Array of Int values and returns all the responses in an array.

Actual Training

After getting the training instances and response values, we need to create a decision tree based on these attributes. For this, we have the cart() method from the  smile.classification  package, which returns an object of the DecisionTree class. For this, we need following additional parameters:

val maxNodes = 200
val splitRule = DecisionTree.SplitRule.ENTROPY
val attributes = weather.attributes()

The variable maxNodes tells us the maximum number of leaf nodes that we can afford to save ourselves from over calculation (for the training data being too large).

The rule we are going to use to find information gain and create our tree is ENTROPY. The last thing to note is an array containing objects of the Attribute class.

attributes is optional here, as it can calculate all the attributes from the training data itself. So here is the cart()  method.

val decisionTree = cart(trainingInstances, responseInstances, maxNodes, attributes, splitRule)

This method returns us an instance of class DecisionTree. Our decision tree is trained using the data we provided!


After the tree is completely trained, we should be able to predict the responses for our test data. The class DecisionTree contains the method predict(), which takes the array of test instances as input and returns the responses accordingly.

We need the test data to verify whether our decision tree works:

Screenshot from 2017-08-15 12-19-03

This data is our weatherTest.arff file. I've included some data with some incorrect predictions (total 12 wrong predictions). Our decision tree must catch these and show the number of errors as 12.

First, let’s load the test data like we did for training data.

val weatherTest = read.arff("src/main/resources/weatherTest.nominal.arff", 4)
val testInstances = weather.toArray(Array(new Array[Double](weather.size())))
val testResponseValues = weather.toArray(new Array[Int](weather.size()))

We actually don't need the variable testResponseValues, as we can predict the response for each instance using the decision tree we trained. We are just going to use these values to match the predictions of our decision tree.

We are predicting the outcome and matching them with the responses respectively:

val error = testInstances.zip(testResponseValues).count {
  case (testInstance, response) => dTree.predict(testInstance) != response
println("Number of errors in test data is "+error)


Screenshot from 2017-08-15 13-00-06

We can see here that our decision tree is working fine.

If we just want to predict the response, we need not use the array of response values.

val decisions = testInstances.map{
  dTree.predict(_) match{
    case 0 => "play"
    case 1 => "not playable weather"

From this, we get the list of decisions, based on our instances.

Where Is the Tree?

So the decision tree is trained now. We've also tested that it works with correct and incorrect inputs. But can we actually see the branches where the splitting is happening? And how does the tree look when we look at it in data mining tools?

We have another method of the DecisionTree class dot() which, according to the documentation, “returns the graphic representation in Graphviz dot format.” We need to use the string output provided by dot() method and paste it on Viz.js, which is a "makefile for building Graphviz with Emscripten and a simple wrapper for using it in any browser." So let’s do that.

digraph DecisionTree {
 node [shape=box, style="filled, rounded", color="black", fontname=helvetica];
 edge [fontname=helvetica];
 0 [label=nscore = 0.3476>, fillcolor="#00000000"];
 1 [label=, fillcolor="#00000000", shape=ellipse];
 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"];
 2 [label=nscore = 0.4200>, fillcolor="#00000000"];
 0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"];
 3 [label=, fillcolor="#00000000", shape=ellipse];
 2 -> 3;
 4 [label=nscore = 0.9183>, fillcolor="#00000000"];
 2 -> 4;
 5 [label=, fillcolor="#00000000", shape=ellipse];
 4 -> 5;
 6 [label=, fillcolor="#00000000", shape=ellipse];
 4 -> 6;

This is the output of the dot() method. It should be pasted into the link provided above.

Screenshot from 2017-08-15 13-28-05

That's our decision tree! So that’s all for the simple implementation of the decision tree using Smile. Here is the link to the repository with sample code.

ai, big data analytics, decision trees, machine learning, smile

Published at DZone with permission of Anuj Saxena , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}