Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Machine Learning With Random Forests

DZone's Guide to

Machine Learning With Random Forests

Machines, working on our commands, for each step, they need guidance where to go what to do. This pattern is like a child who doesn't understand the surround...

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

Machines depend on our commands for each step they take. They need guidance on where to go and what to do, like a child who doesn't understand their surroundings to make their own decisions. Likewise, a developer writes commands for a machine to execute. But when we talk about machine learning, we talk about making the machine learn to make its own decisions without any external help. The machine has a mature mind, can understand facts and situations, and can choose the best course of action.

To dive deeper into the basics machine learning, I'd suggest you go through this introductory article.

In our previous blogs, we learned about the decision tree algorithm and its implementation. In this blog, we will move on to the next algorithm for machine learning: the random forest algorithm. Please go through the previously mentioned blogs before moving forward, as the random forest algorithm is based on the decision tree algorithm.

What Is the Random Forest Algorithm?

We could just say it's "another algorithm for machine learning," but as we know, explaining things is necessary at each step in the process of knowledge sharing! So let's go deeper in this algorithm.

The random forest algorithm, as the name suggests, is a forest. And forests consist of trees. Here, the trees being mentioned are decision trees. So, our full definition is: The random forest algorithm consists of a random collection of decision trees. Hence, this algorithm is basically just an extension of the decision tree algorithm.

Under the Hood

In this algorithm, we create multiple decision trees without pruning them. There's no need for such a limitation in the random forest algorithm. The catch here is that we don't provide all the data for each decision tree to consume. We provide a random subset of our training data to each decision tree. This process is called bagging, or bootstrap aggregating.

Bagging is a general procedure that can be used to reduce the variance of algorithms that have high variance. In this process, sub-samples are created for the data set and a subset of attributes that we use to train our decision models. Then, we consider every model and make our decision by voting (classification) or by taking the average (regression). For the random forest, we usually take two-thirds of the data with a replacement (data can be repeated for every other decision tree, so there's no need to have unique data).

In the random forest algorithm, each decision tree predicts a response for an instance and the final response is decided based on voting. In classification, the response received by the majority of decision trees is the final response. In regression, the average of all the responses is the final response.

Advantages

  • Works best for both classification.
  • Can handle large data sets with a large number of attributes, as they'll be divided among trees.
  • It can model the importance of attributes; hence, it is also used for dimensionality reduction.
  • Maintains accuracy, even when data is missing.
  • Works with unlabeled data (unsupervised learning) for clustering, data views, and outlier detection.
  • Uses samples of input data in the process of bootstrap sampling. In this, one-third of the data is used not for training but for testing. These samples are called out-of-bag samples and errors regarding these are called out-of-bag errors. This type of error shows the same error rate as a separate data set for training shows, so it removes the need of a separate test data set.

Disadvantages

  • Regression isn't so good with random forest algorithms.
  • Works as a black box. You can't control the inside functionality aside from changing the input values.

Implementation

Now it's time to see the implementation of the random forest algorithm in Scala. We'll use the Smile library like we did for the implementation of decision trees.

To use Smile, include the following dependency in your SBT project:

libraryDependencies += "com.github.haifengl" %% "smile-scala" % "1.4.0"

We are going to use the same data for this implementation as we did for the decision tree. We'll get array of array of double as the training instances and array of int as response value for these instances.

val weather: AttributeDataset = read.arff("src/main/resources/weather.nominal.arff", 4)
val (trainingInstances,responseVariables) = data.pimpDataset(weather).unzipInt

Training

After getting the data, we have a method randomForest() in the package  smile.operators package that returns an instance of RandomForest class.

val nTrees = 200
val maxNodes = 4
val rf = randomForest(trainingInstances, responseVariables, weather.attributes(), nTrees, maxNodes)

Here is the parameters list for this method:

  • trainingInstances: Array[Array[Double]] (required)
  • responseVariables: Array[Int] (response values for each instance)
  • attributes: Array[Attribute] (an array of all the attributes; by default, this array is null)
  • nodeSize: Int (the number of instances in a node below which the tree will not split; by default, the value is 1, but for very large data sets, it should be more than one)
  • ntrees: Int (limits the number of trees; by default, the value is 500)
  • maxNodes: Int (maximum number of leaf nodes in each decision tree; by default the value is number of attributes/nodeSize)
  • mtry: Int (the number of randomly selected attributes for each decision tree, by default its value is the square root * number of attributes)
  • subsample: Double (if the value is 1.0, then sample with replacement; if less than 1.0, then sample without replacement; by default, its value is 1.0)
  • splitRule: DecisionTree.SplitRule (the method on which the information gain is calculated for decision trees; could be GINI or ENTROPY; by default, it is GINI)
  • classWeight: Array[Int] (the ratio of the number of instances each class contains; if nothing is provided, the algorithm calculates this value itself)

Testing

Now, our random forest is created. We can use its error() method to show the out-of-bag error for our random forest.

println(s"OOB error = ${rf.error}")

The output is:

We can see that the error in our random forest is 0.0, which is based on out-of-bag error. We don't need to test it again with another dataset.

We can use the predict() method of the RandomForest class to predict the outcome of some instance.

Accuracy

Our random forest is ready, and we've also checked the out-of-bag error. We know that with every prediction, we also we have some error. So how do we check the accuracy of the random forest we just built?

We have the smile.validation package! In this package, we get many methods to test our models. Here, we are using one such method: test(). It is a curried function and takes several parameters.

val testedRF = test(trainingInstances, responseVariables, testInstances, testResponseVariables)((_, _) => rf)

The parameter list is below:

  • trainingInstancesArray[Array[Double]]
  • responseValuesArray[Int]
  • testInstancesArray[Array[Double]]
  • testResponseValuesArray[Int]
  • trainer: An anonymous method that takes trainingInstances and responseValues and returns a classifier.

Here the testInstances and testResponseVaues are fetched from a testing data set. Shown below:

val weatherTest = read.arff("src/main/resources/weatherRF.nominal.arff", 4)
val (testInstances,testResponseValues) = data.pimpDataset(weatherTest).unzipInt

Here is the output:

As we can see here, it tells the accuracy for our random forest — which is 83.33% right now.

The link to the sample code is here!

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
ai ,machine learning ,tutorial ,random forests

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}