Over a million developers have joined DZone.

Random Forest as a Regressor: A Spark-based Solution

DZone 's Guide to

Random Forest as a Regressor: A Spark-based Solution

In this article, we will see how to use the Random Forest (RF) algorithm as a regressor with Spark 2.0 on the YearPredictionMSD (Year Prediction Million Song Database) dataset.

· Big Data Zone ·
Free Resource

Typically, the Random Forest (RF) algorithm is used for solving classification problems and making predictive analytics (i.e., in supervised machine learning technique). However, in this article, we will see how to use the same algorithm as a regressor with Spark 2.0 on the YearPredictionMSD (Year Prediction Million Song Database) dataset. The first part of this article covered how to use the RF algorithm as a classifier for predicting appropriate classes of the upcoming features (i.e., test set) by learning from the available features (i.e., training  set). 

The objective here is to predict the release year of the songs from the available audio features, which is typically a classification problem. However, we will see how to use the RF algorithm to predict the song's year by converting the classification problem into an equivalent regression problem. 

Random Forest as a Regressor

The regression analysis is a statistical/machine learning process for estimating the relationships by utilizing widely used techniques such as modeling and analyzing several variables. Unlike the classification problem, here the focus is on the relationship between a dependent variable and one or more independent variables (usually more than one). The independent variables are also called the 'predictors'. 

On the other hand, the Random forest [1, 2] (also sometimes called random decision forest [3]) (RDF) is an ensemble learning technique used for solving supervised learning tasks such as classification and regression. An advantageous feature of RF is that it can overcome the overfitting [4] problem across its training dataset.

A forest in the RF or RDF usually consists of hundreds of thousands of trees. These trees are actually trained on different parts of the same training set [1-3]. More technically, an individual tree that is grown very deep tends to learn from the highly unpredictable patterns. This kind of nature of the trees creates overfitting problems on the training sets. Moreover, low biases make the classifier a low performer even if your dataset quality is good in terms of feature presented [1].

On the other hand, an RF algorithm helps to average multiple decision trees together with the goal of reducing the variance to ensure consistency by computing proximities between pairs of cases. However, this increases a small bias or some loss of the interpretability of the results. But eventually, the performance of the final model increased dramatically.

Main Advantages of Using the Random Forest Algorithm

As discussed in the literature [1, 2, 5] following are the main advantages of using an RF algorithm:

  • RF provides excellent accuracy among current classification and regression algorithms
  • It can be applied efficiently to large-scale datasets for both classifications as well as regressions analysis making it scalable based on increasing data size. Consequently, it's a perfect algorithm for making predictive analytics on Big Data.
  • It can handle thousands of input features and variables at a time. That means high dimensional dataset can be handled and used to make predictive analytics using the RF model efficiently.
  • It provides dimensionality reduction facility without deleting unwanted variables from the training dataset. As a result, it gives an estimate of what variables are important and should be considered during model building for the classification
  • It has an effective technique embedded for estimating missing or null values. Therefore, it can maintain the accuracy level (i.e., stable performance) even when a large proportion of the dataset are missing [1]. This also implies that it is resilient against the null values
  • Generated forests and trained model can be saved for future use that can be applied to other datasets later stages. This way it provides model adaptability for new data types
  • It can compute proximities between pairs of cases which provide interesting views on the data. These views help in clustering and locating the outliers after scaling the data eventually
  • The above capabilities also can be extended to unlabelled data. That means that RF can also be applied to unsupervised learning problems like clustering, high-level data views generation, and outlier detection
  • Even a supervised RF algorithm can be applied to solve unsupervised learning problem. For example anomaly detection from large scale categorical dataset presented at [17]. These advantages make the RF as a suitable algorithm for enabling incremental learning
  • Above all, as usually RF algorithm does not overfit, so you can run and build the model by specifying as many trees as you want.

For more on pros and cons of the RF algorithm, readers can refer the book titled "Large Scale Machine Learning with Spark" published recently with the latest machine learning algorithms with Spark 2.0.0 release [10].

How Does It Work?

Particularly, general techniques like bootstrap aggregating and bagging are applied to the trees that are generated while building the forests in RF [1, 2, 5]. This is the typical first step in the RF algorithm also called begging to RF. As already mentioned that typically hundreds to thousands of trees are constructed dynamically (of course based on the parameter setting). Moreover, the number of trees depends on:

  • The size and nature of the training set
  • Model building parameters like a number of classes, a number of beans and number of maximum depth etc.

Nevertheless, finding the optimal number of trees is a matter of hyperparameter tuning a part of tuning machine learning model. Therefore, optimal numbers of trees say N can be found by performing hyperparameter tuning using cross-validation and train split of the dataset. Alternatively, it can be found by observing the out-of-bag error [6, 7] which is often expressed using in terms of training and test error [1, 2].

After that proximities [8] are computed for each pair of cases when all the trees are built using the complete dataset (unless you randomly split the data into training and test set). Furthermore, "when two cases occupy the same terminal node, proximity is increased by one for each case" [1]. However, at the end of the each iteration, proximity values are normalized by dividing them by the number of trees (that were built and specified during the training period) [1].

You might be wondering why the proximity calculation is so important! The reality is that the proximity calculation has many contributions towards making the prediction better. For example, for replacing missing data, null values and locating the outliers from a set of the cluster. Finally, it also helps produce low-dimensional views of the data [1, 5, 6].

On the one hand, for a classification problem with say n features, the value of n is usually rounded down as n. That means, only √nfeatures are used in each split on the training set. On the other hand, for a regression analysis problem using RF, the inventors (i.e., Leo Breiman and Adele Cutler et al.) recommend using the rounded value as n/3 with a minimum tree node size of 5 as the default [9].

Spark-based Implementation of the RF Algorithm for Regression

Since Spark 1.2.0 release, two learning algorithms (RF classifier and RF regressor) have been implemented to solve the classification and regression problem with Spark MLLib. Both algorithms support using the dataset with continuous as well as categorical features. The settings for the parameter featureSubsetStrategy are based on  references [15, 16]. The other configuration parameters for the random forest algorithm are the following:

  • Type of random forest (classification or regression),
  • Feature type (continuous, categorical),
  • The depth of the tree and quantile calculation strategy etc.

The RF algorithm for the ensemble models can be used either in Classification or Regression tree's tree ensembles. While using the RF as a regressor, here goes the parameter setting during building the RF model:

  • If the number of trees is 1, then no bootstrapping is used at all. However, if the number of trees is > 1, then the bootstrapping is accomplished. Where, the parameter featureSubsetStrategy signifies the number of features to be considered for splits at each node. The supported values of featureSubsetStrategy are "auto", "all", "sqrt", "log2" and "on third". The supported numerical values, on the other hand, are (0.0-1.0] and [1-n]. However, if featureSubsetStrategy is chosen as "auto", the algorithm chooses the best feature subset strategy automatically
  • If the numTrees == 1, the featureSubsetStrategy is set to be "all". However, if the numTrees > 1 (i.e., forest), featureSubsetStrategy is set to be "onethird" for regression
  • Moreover, if a real value "n" is in the range (0, 1.0] is set, n*number_of_features is used consequently. However, if an integer value "n" is in the range (1, the number of features) is set, only n features are used alternatively
  • The parameter categoricalFeaturesInfo which is a map is used for storing arbitrary of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1,...,k-1}
  • The impurity criterion used for information gain calculation. The supported values are “gini" and “variance”. The former is the only supported value for classification. The latter is used for regression
  • The maxDepth is the maximum depth of the tree. (e.g., depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). However, the suggested value is 4 to get a better result
  • The maxBins signifies the maximum number of bins used for splitting the features; where the suggested value is 100 to get better results
  • Finally, the random seed is used for bootstrapping and choosing feature subsets to avoid the random nature of the results.

As already mentioned, since RF is fast and scalable enough for the large-scale dataset, Spark is a suitable technology to implement the RF to take the massive scalability. However, if the proximities are calculated, storage requirements also grow exponentially.

The next section shows a step-by-step example of using the RF as a regressor with Spark. For the demonstration purpose, Spark 2.0.0 will be used. The application that I will be showing is written in Java as Maven project on Eclipse.

Dataset Exploration

This data is a subset of the Million Song Dataset:  http://labrosa.ee.columbia.edu/millionsong/ , a collaboration between LabROSA (Columbia University) and The Echo Nest.  Prepared by T. Bertin-Mahieux <tb2332@columbia.edu>. The objective is to predict of the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s.

Data Set Characteristics:


Number of Instances:




Attribute Characteristics:


Number of Attributes:


Date Donated


Associated Tasks:


Missing Values?


Number of Web Hits:


Attribute information

There are 90 attributes as follows:

Timbre average = 12,
Timbre covariance = 78

The first value is the year (target), ranging from 1922 to 2011. Features extracted from the 'timbre' features from The Echo Nest API. We take the average and covariance overall 'segments', each segment being described by a 12-dimensional timbre vector. The top-20 rows have been shown in Figure 1.

Therefore, it is clear from the DataFrame that here the independent variables are the 'features' (i.e., predictors) and the column 'label' is the dependent variable - a regression problem. Now, we can take the opportunity of the regression analysis and related algorithms for their predictive power.

Among them, we are considering the RF algorithm because of its high accuracy (for classification) and less model building time. Yet, we still need to validate how does the RF algorithm perform in solving a regression problem. In the Conclusion section, I will suggest some alternative ways if the prediction accuracy of the RF-based regressor is not better.

Sample of the MSD dataset

                                                     Figure 1: Sample of the MSD dataset

Dataset preparation

According to the documentation provided at [9] - i.e., https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD. The following split ration should be followed to get the better result:

Training set: first 463,715 examples => 89.98147%
Test set: last 51,630 examples => 10.01853%

The reason is that it avoids the 'producer effect' by making sure no song from a given artist ends up in both the train and test set.

Song's Year Prediction Using the RF Algorithm

The example that I gonna show has several steps: dataset parsing, training and test set preparation, model building and finally model evaluation. We will see each step with corresponding explanation and supporting code written in java. 

Step-1: Loading required packages and APIs

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.HashMap;
import java.util.Map;
import scala.Tuple2;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.evaluation.MulticlassMetrics;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.RandomForest;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import com.example.SparkSession.UtilityForSparkSession;
import org.apache.spark.SparkConf;

Step-2: Creating a Spark session 

static SparkSession spark = UtilityForSparkSession.mySession();

Here is the UtilityForSparkSession class:

public class UtilityForSparkSession {
    public static SparkSession mySession() {
        SparkSession spark =  SparkSession.builder()                                                                         .appName("HeartDiseasesPrediction")
                                                .config("spark.sql.warehouse.dir", "E:/Exp/")
        return spark;

Note here the Spark SQL warehouse is set as "E:/Exp/" in Windows 7 environment. Please set the path accordingly based on your OS. 

Step-3: Loading, parsing and creating data frame for exploratory view

String datapath = "input/YearPredictionMSD/YearPredictionMSD";
//String datapath = args[0]; // Make this line enable if you want to take input from commanline
Dataset<Row> df = spark.read().format("libsvm").option("header", "true").load(datapath); // Here the format is the "libsvm"

The above line will produce a data frame like Figure 1.

Step-4: Preparing the RDD of LabelPoint for regression as follows:

JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(spark.sparkContext(), datapath).toJavaRDD(); // We use the MLUtils API to load the same data in LibSVM format

Step-5: Preparing the training and test set:

Split the data into training and test sets (89.98147% held out for training and the rest as testing)

JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.8998147, 0.1001853}); JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];

Step-6: Training the RF model

First, let's set the required parameters before training the RF model for regression.

Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();

The above empty categoricalFeaturesInfo indicates that all features are continuous. That also means there are/is no categorical variables in the dataset we saw above. 

Integer numTrees = 20;
String featureSubsetStrategy = "auto"; // You can set it as "onethird".  However, it's sometimes wiser to let the algorithm choose the best for the dataset we have. 
String impurity = "variance"; // Here the impurity is set as "vairance" for the regression related problem.
Integer maxDepth = 20;
Integer maxBins = 20;
Integer seed = 12345;

Note the above parameters are here set naively that means without applying the hyperparameters tuning. Readers are suggested to tune ML model before setting these values out. Now let's train the RandomForest model by utilizing the above parameters as follows: 

final RandomForestModel model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed);

Step-7: Evaluating the model

First, let's calculate the Test Means Square error to weekly evaluate the model. Before that, we need to prepare the prediction and the label for evaluating the model on test instances and computing the test error as follows:

JavaPairRDD<Double, Double> predictionAndLabel =
      testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
        public Tuple2<Double, Double> call(LabeledPoint p) {
          return new Tuple2<>(model.predict(p.features()), p.label());

The above code is self-explanatory as you saw in Figure 1 that the dataset contains only the label and features (i.e., no categorical variables/features). The above code calculates the prediction for each pair of features and label out of the LabeledPoint we created in step 5. Now, let's compute the test mean squared error as follows.    

Double testMSE =
      predictionAndLabel.map(new Function<Tuple2<Double, Double>, Double>() {
        public Double call(Tuple2<Double, Double> pl) {
          Double diff = pl._1() - pl._2();
          return diff * diff;
      }).reduce(new Function2<Double, Double, Double>() {
        public Double call(Double a, Double b) {
          return a + b;
      }) / testData.count();

    System.out.println("Test Mean Squared Error: " + testMSE);

Up to this point, the model has been evaluated using a weaker evaluation parameter (i.e., MSE). Now let's evaluate the model by calculating more robust evaluation metrics.

The below source code not only evaluates the model on test instances but also computes related performance measure statistics. However, here let's prepare the prediction and label and compute the prediction on the labels only. 

JavaRDD<Tuple2<Object, Object>> predictionAndLabels = testData.map(
            new Function<LabeledPoint, Tuple2<Object, Object>>() {
              public Tuple2<Object, Object> call(LabeledPoint p) {
                Double prediction = model.predict(p.features());
                return new Tuple2<Object, Object>(prediction, p.label());

Now let's get the evaluation metrics using the Multi metrics evaluator of Spark as follows:

    MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabels.rdd());
    //System.out.println(metrics.confusionMatrix()); // You could print the confusion metrics to get more insights.
   // System.out.println(metrics.confusionMatrix());

Now let's compute the related performance metrics like precision, recall, f_measure for the label that means column 0 as follows:

double precision = metrics.precision(metrics.labels()[0]);
double recall = metrics.recall(metrics.labels()[0]);
double f_measure = metrics.fMeasure();

Note the F1-score is the harmonic mean of precision and recall that used to evaluate the model performance more sophisticated way. 

Now let's play around the prediction column. Let's check how did our RF model perform to predict a label, say the year 2001: 

double query_label = 2001;

Now let's compute some metrics like true positive rate, false positive rate, weighted true positive rate and the weighted false positive rate as follows:

double TP = metrics.truePositiveRate(query_label);
double FP = metrics.falsePositiveRate(query_label);
double WTP = metrics.weightedTruePositiveRate();
double WFP =  metrics.weightedFalsePositiveRate();

Finally, let's print the above-mentioned metrics as follows: 

System.out.println("Precision = " + precision);
System.out.println("Recall = " + recall);
System.out.println("F-measure = " + f_measure);
System.out.println("True Positive Rate = " + TP);
System.out.println("False Positive Rate = " + FP);
System.out.println("Weighted True Positive Rate = " + WTP);
System.out.println("Weighted False Positive Rate = " + WFP);

I got the following output for the above parameter settings:

Test Mean Squared Error: 0.48703966498866
Precision = 0.867559
Recall = 0.63254
F-measure = 0.73800738
True Positive Rate = 0.867559
False Positive Rate = 0.136587
Weighted True Positive Rate = 0.7936875921
Weighted False Positive Rate = 8.147273506679529E-7


This way the RF can be used as a regressor. Please note due to the random nature of the data, you might get different prediction value. However, from the above output, it is clear that the prediction performance is not that good in terms of accuracy. Therefore, readers are suggested to tune the RF model to get a better result.

Alternatively, you can also use the linear or logistic regression algorithm (that are the specialized algorithms for solving regression problems) to fulfill the requirement. However, the current implementation of the Logistic Regression algorithm in Spark supports only binary classification, so you cannot solve this problem using the logistic one. Furthermore, you can try using the Linear Regression algorithm of course that can predict the outcome even if your dataset has up to 4096 classes with better accuracy. 

Readers are also suggested to find more on machine learning algorithms with Spark MLlib at [7]. Moreover, a recent book titled “Large Scale Machine Learning with Spark” would be a good starting point from the technical as well as theoretical perspective to learn machine learning algorithms with latest Spark release (i.e., Spark 2.0.0) [10].

In my next article, I will show how to perform tuning with Spark using Cross Validation and Random Split operation to get better predictive performance.

Source code availability:

The maven friendly pom.xml file and the associated source codes can be downloaded from my GitHub repository here [12].

machine learning ,big data analtics ,spark 2.0.0 ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}