Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using Linear Regression with Apache Ignite

DZone's Guide to

Using Linear Regression with Apache Ignite

How do you use linear regression with Apache Ignite? Read this article in order to find out how.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

In the previous article of this Machine Learning series, we looked at the Apache Ignite Machine Learning Grid. Now, let’s take the opportunity to drill further into some of the Machine Learning algorithms that are supported in Apache Ignite and try out some examples using popular datasets.

If we search for suitable datasets to use, we can find many that are available. However, one dataset that is a good candidate for Linear Regression is House Prices. Very conveniently, we can find suitable data available through the UCI website.

In this article, we will train a Linear Regression model and calculate the R2 score.

Some data preparation is required to get the data into a suitable format for Apache Ignite. This is often what a Data Scientist may spend time doing.

First, we need to take the raw data and split it into training data (80%) and test data (20%). At the time of writing this article, Ignite does not support dedicated data splitting, but this functionality is on the roadmap for a future release. In the meantime, there are many free and open-source tools available that can perform this type of data splitting or we could code this ourselves in one of the programming languages supported by Ignite. For this article, we'll use Scikit-learn, and my colleague Anton Dmitriev at GridGain very kindly wrote the following code to achieve this task:

from sklearn import datasets
import pandas as pd

# Load Boston housing dataset.
boston_dataset = datasets.load_boston()
x = boston_dataset.data
y = boston_dataset.target

# Split it into train and test subsets.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=23)

# Save train set.
train_ds = pd.DataFrame(x_train, columns=boston_dataset.feature_names)
train_ds["TARGET"] = y_train
train_ds.to_csv("boston-housing-train.csv", index=False, header=None)

# Save test set.
train_ds = pd.DataFrame(x_test, columns=boston_dataset.feature_names)
train_ds["TARGET"] = y_test
train_ds.to_csv("boston-housing-test.csv", index=False, header=None)

# Train linear regression model.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)

# Score result model.
lr.score(x_test, y_test)

This code takes the dataset available from the UCI website, performs the data split, and then calculates the R2 score. The value returned is 0.745021053016975, or 74.5%. Later, we’ll compare this against the value returned by Ignite.

With our training and test data ready, we can start coding the application. My colleague Anton very kindly wrote a Java application for use with Ignite and you can download it from GitHub if you would like to follow along. Our algorithm is, therefore:

  1. Read the training data and test data
  2. Store the training data and test data in Ignite caches
  3. Use the training data to fit the Linear Regression model
  4. Apply the model to the test data
  5. Determine the R2 score of the model

Since the dataset is quite small, we could just load it into standard Java data structures and run Linear Regression directly from within the Java program. Alternatively, we could load the data into Apache Ignite caches and then run Linear Regression on the cached data. The advantage of using Apache Ignite caches is that the data will be distributed across an entire cluster and, therefore, we will be performing distributed training. For large datasets, using Ignite caches could, therefore, have great benefits. In our example, we will load the data into Ignite caches.

Read the Training Data and Test Data

We have two CSV files to read in — one for the training data and the other for the test data. We can use the following code to read values in from the CSV files:

private static void loadData(String fileName, IgniteCache<Integer, HouseObservation> cache)
        throws FileNotFoundException {

   Scanner scanner = new Scanner(new File(fileName));

   int cnt = 0;
   while (scanner.hasNextLine()) {
      String row = scanner.nextLine();
      String[] cells = row.split(",");
      double[] features = new double[cells.length - 1];

      for (int i = 0; i < cells.length - 1; i++)
         features[i] = Double.valueOf(cells[i]);
      double price = Double.valueOf(cells[cells.length - 1]);

      cache.put(cnt++, new HouseObservation(features, price));
   }
}

The code simply reads the data line-by-line and splits fields on a line by the CSV field separator. Each field value is then converted to double format and then the data is stored in an Ignite cache.

Store the Training Data and Test Data in Ignite Caches

The previous code stores data values in an Ignite cache. To use this code, we need to create the Ignite caches prior to using them, as follows:

IgniteCache<Integer, HouseObservation> trainData = ignite.createCache("BOSTON_HOUSING_TRAIN");

IgniteCache<Integer, HouseObservation> testData = ignite.createCache("BOSTON_HOUSING_TEST");

Use the Training Data to Create the Linear Regression Model

Now that our data is stored, we can create the trainer as follows:

DatasetTrainer<LinearRegressionModel, Double> trainer = new LinearRegressionLSQRTrainer();

and fit a linear model to the training data, as follows:

LinearRegressionModel mdl = trainer.fit(
   ignite,
   trainData,
   (k, v) -> v.getFeatures(),  // Feature extractor.
   (k, v) -> v.getPrice()      // Label extractor.

Ignite stores data in a Key-Value (K-V) format, so the above code uses the value part. The target value is price and the features are in the other columns.

Apply the Model to the Test Data

Next, we are ready to check the test data against the trained linear model. On the Apache Ignite Machine Learning roadmap, there is a plan to provide built-in score calculators. For now, we can do the following:

double meanPrice = getMeanPrice(testData);
double u = 0, v = 0;

try (QueryCursor<Cache.Entry<Integer, HouseObservation>> cursor = testData.query(new ScanQuery<>())) {
   for (Cache.Entry<Integer, HouseObservation> testEntry : cursor) {
      HouseObservation observation = testEntry.getValue();

      double realPrice = observation.getPrice();
      double predictedPrice = mdl.apply(new DenseLocalOnHeapVector(observation.getFeatures()));

      u += Math.pow(realPrice - predictedPrice, 2);
      v += Math.pow(realPrice - meanPrice, 2);
   }
}

Here we calculate the residual sum of squares (u) and the total sum of squares (v).

Determine the R2 Score of the Model

We can find the value of R2 as 1 - u / v:

double score = 1 - u / v;

System.out.println("Score : " + score);

This gives us the value 0.7450194305206714 or 74.5%. This percentage is identical to what we achieved earlier with Scikit-learn.

Summary

Apache Ignite provides a library of Machine Learning algorithms. Through a Linear Regression example, we have seen the ease with which we can create a model, test the model, and determine the R2 score of the model. We can now also use this model to make predictions.

Today, many Machine Learning tools are available, but they cannot scale beyond a single node and can only handle small quantities of data. In contrast, the benefits that Ignite provides are its ability to scale both:

  1. The size of the cluster (hundreds or thousands of machines).
  2. The quantity of data stored (hundreds of Gigabytes, Terabytes or even Petabytes).

Ignite can, therefore, run Machine Learning at scale. It can truly manage Machine Learning on Big Data using distributed processing.

In the next part of this Apache Ignite Machine Learning series, we’ll look at another Machine Learning algorithm. Stay tuned!

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
apache ignite ,machine learning ,linear regression ,artificial intelligence ,ai

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}