Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Running an Apache Spark Artificial Neural Network as a Docker Image

DZone's Guide to

Running an Apache Spark Artificial Neural Network as a Docker Image

In this article, we will discuss how to develop a Docker image from an Apache Spark artificial neural network that solves a classification problem.

· AI Zone ·
Free Resource

Start coding something amazing with the IBM library of open source AI code patterns.  Content provided by IBM.

Introduction

There are various previous studies to run Apache Spark [1] applications in Docker. A Docker image for an earlier version (1.6.0) of Spark is available in [2], for both standalone and cluster applications. In [3], a Docker image is discussed that simplifies installing/configuring Spark and submitting Spark applications to a Spark cluster. A Docker container for running Python code to interact with Spark is discussed in [4]. A review of running Spark clusters in Docker containers is given in [5].

In this article, we will discuss how to develop a Docker image from an Apache Spark artificial neural network that solves a classification problem. The image runs in a single, non-clustered Docker host. The run-time environment for Apache Spark is also very basic, locally running non-distributed single JVM deployment mode [6]. The simple run-time environment discussed here could be beneficial for application settings with relatively small workloads that are manageable with a single client machine. For example, the classifier could perform life expectancy predictions for patients with a particular medical condition. Once the classifier is developed into its final form, it can be delivered to a clinical informatics team as a Docker image. Because typically, the number of patient records will be very small to be processed by the classifier, the image can be executed in a standalone hospital computer. (Note that, to contrast, the amount of processing required to develop the classifier into its final form could be very significant and may require a dedicated computing environment to handle the associated workload.)

Problem Statement

Consider the following steps in a machine learning application based on an artificial neural network (ANN):

  1. Data science team trains an ANN to solve a particular classification problem using test and validation data sets. (For example, the ANN predicts patient survival regarding a particular medical condition based on a clinical input set.)
  2. The ANN is ready for production applications. It is delivered to end users, e.g. clinical informatics team.
  3. The end users run the ANN with data available in production environment, e.g. hospital.

Workflow supported by the sample application.

In this article, we will focus on steps 2 and 3. The data science team will develop a Docker image from an already trained ANN with all the required components including Java VM and Spark runtime. End users will only need the Docker framework in their computer to execute the image. We will use Docker version 18.03, Java version 8, and Apache Spark version 2.3.

The article is organized as follows. The next section discusses training and saving an Apache Spark ANN. That is followed by a discussion of how to load the previously saved ANN and execute it to solve a classification problem. The following section explains how to build a Docker image from the ANN. Then, downloading and running the image is described. The last section gives a summary and conclusions.

Training and Saving the Artificial Neural Network

The kind of ANN we will use is a "Multilayer Perceptron Classifier" [7], a feedforward ANN. Apache Spark Machine Learning Library (MLlib) implementation for Multilayer Perceptron Classifier is described in [8], [9]. Specific details of training the classifier is outside the scope of this article. In addition to the MLlib reference information, interested readers can see the examples, e.g. JavaMultilayerPerceptronClassifierExample.java, that comes with the Apache Spark installation. Also, see the author's previous article on the subject [10].

After the classifier is trained and a final version is agreed on, it has to be saved, i.e. serialized in a file system for further use. Consider the below code snippet:

// specify layers for the neural network - assume that size of
// intermediate layers have been determined a priori and this is the
// final structure for the neural network:
// input layer of size 4 (features), two intermediate layers of size 5 
// and 4 and output of size 3 (classes)
int[] layers = new int[] {4, 5, 4, 3};

// create the trainer and set its parameters
MultilayerPerceptronClassifier trainer = 
  new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100);

// train the model
MultilayerPerceptronClassificationModel model = 
  trainer.fit(trainingData);

// having trained the model save it somewhere in file system, e.g.
// /tmp/savedModel
try {
model.save("/tmp/savedModel");
} catch (IOException e) {
// Do some exception handling here
...
}

In the above example, after training the "final" classifier model, we saved it in /tmp/savedModel (Of course, that could be any location in the file system). If we inspect /tmp/savedModel, we would see two directories: data and metadata. Under those directories, there are various files that can be used to construct the model in a separate Spark application (Specific details of those files are not important for our purposes).

konur$ ls -l /tmp/savedModel/
total 0
drwxr-xr-x  5 konur  staff  160 Jun 20 15:39 data
drwxr-xr-x  5 konur  staff  160 Jun 20 15:39 metadata

Loading the Saved ANN

Now let us look into the Java application to load the model from file system and put it to work.

package org.spark.docker.demo;

import ...


public class DockerMultilayerPerceptronClassifier {
private static Logger logger = null;

// Start with initializing log
static {
logger = Logger
          .getLogger("DockerMultilayerPerceptronClassifier");
logger.setLevel(Level.INFO);
}
...

Application logic is given in the main method. The "/app" is the root folder in Docker container where the local data folder will be mounted to at run-time, as we will see later on. The application accepts one argument at startup, passed as value of args[0], that is the relative path of the data file the classifier is supposed to process.

...
public static void main(String[] args) {

if(args.length != 1){
logger.error("Usage: docker run --rm \\");
logger.error("   --volume=<full-path-to-local-data-folder>:/app \\");
logger.error("   --name docker-classifier  \\");
logger.error("   <repository-name>:docker-classifier \\");
logger.error("   <relative-path-to-data-file>");
return;
}
...

Then, we initialize Spark in local, i.e. non-distributed single JVM mode.

...
SparkSession spark = SparkSession
.builder()
.appName("DockerMultilayerPerceptronClassifier")
.master("local[*]").getOrCreate();
...

We create a MultilayerPerceptronClassificationModel instance by loading the previously saved model. (We will review getSavedModelPath() in a moment.) Then, we construct dataPath as the full path to the data file to process and call displayPredictions(data,model) to let model process data. Finally, we stop Spark.

...
MultilayerPerceptronClassificationModel model =
   MultilayerPerceptronClassificationModel
        .load(getSavedModelPath());


String dataPath = "/app/"+args[0];

Dataset<Row> data = spark.read().format("libsvm").load(dataPath);
displayPredictions(data,model);

// Stop spark session
spark.stop();
} // end of main

The displayPredictions method is very simple. It calls transform on the model passing data to it. Then, from the result, it extracts features, i.e. the input to classifier, and prediction, the corresponding output from the classifier. Finally, it displays the results.

private static void displayPredictions( Dataset<Row> data, 
   MultilayerPerceptronClassificationModel model){

   Dataset<Row> result = model.transform(data);
   Dataset<Row> predictionsAndFeatures = 
      result.select("prediction","features");
   List<Row> collectedPredictionsAndFeatures = 
      predictionsAndFeatures.collectAsList();
   for(Row r:collectedPredictionsAndFeatures){
      // r.get(0) is the prediction, 
      // r.get(1) is the features
      logger.info("Input: " + r.get(1) 
         + ", Prediction: " + r.get(0));
   }
}

Now, let us review getSavedModelPath(). Before the docker image is built, we copied the serialized model under

<docker-work-directory>/resources/savedModel/

where docker-work-directory is the development folder at which docker build command is run.

konur$ ls -l resources/savedModel/
total 0
drwxr-xr-x  5 konur  staff  160 Jun 20 15:39 data
drwxr-xr-x  5 konur  staff  160 Jun 20 15:39 metadata

At run-time, we obtain the path by appending "/resources/savedModel" to root folder relative to class loader of this class, as shown below.

private static String getSavedModelPath(){
   String path = null;
   try {
      URI uri = 
      DockerMultilayerPerceptronClassifier
         .class.getResource("/").toURI();
      path = uri.getPath() + "/resources/savedModel";

   } catch (URISyntaxException e) {
      logger.error(e.getMessage());
   }
   return path;
}

Below is a complete listing of the file.

package org.spark.docker.demo;

import java.net.URI;
import java.net.URISyntaxException;
import java.util.List;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;


public class DockerMultilayerPerceptronClassifier {
   private static Logger logger = null;
   static {
      logger = Logger.getLogger("DockerMultilayerPerceptronClassifier");
      logger.setLevel(Level.INFO);
   }

   public static void main(String[] args) {

      if(args.length != 1){
         logger.error("Usage: docker run --rm \\");
         logger.error("   --volume=<full-path-to-local-data-folder>:/app \\");
         logger.error("   --name docker-classifier  \\");
         logger.error("   <repository-name>:docker-classifier \\"); 
         logger.error("   <relative-path-to-data-file>");
         return;
      }


      SparkSession spark = SparkSession
            .builder()
            .appName("DockerMultilayerPerceptronClassifier")
              .master("local[*]").getOrCreate();

      MultilayerPerceptronClassificationModel model = 
      MultilayerPerceptronClassificationModel
         .load(getSavedModelPath());

      String dataPath = "/app/"+args[0];

      Dataset<Row> data = spark.read().format("libsvm").load(dataPath);
      displayPredictions(data,model);

      // Stop spark session
      spark.stop();
   }// end of main

   private static String getSavedModelPath(){
      String path = null;
      try {
         URI uri = DockerMultilayerPerceptronClassifier
              .class.getResource("/").toURI();
         path = uri.getPath() + "/resources/savedModel";
      } catch (URISyntaxException e) {
      logger.error(e.getMessage());
   }
   return path;
}

private static void displayPredictions(Dataset<Row> data, 
                 MultilayerPerceptronClassificationModel model){

    Dataset<Row> result = model.transform(data);
    Dataset<Row> predictionsAndFeatures = 
          result.select("prediction","features");
    List<Row> collectedPredictionsAndFeatures = 
          predictionsAndFeatures.collectAsList();
    for(Row r:collectedPredictionsAndFeatures){
    logger.info("Input: " + r.get(1) 
                        + ", Prediction: " + r.get(0));
    }
  }
}

Building the Docker Image

Let us look into contents of docker-work-directory before we build the image.

konur$ ls -la
total 8
drwxr-xr-x    6 konur  staff   192 Jun 22 09:55 .
drwxr-xr-x    7 konur  staff   224 Jun 22 11:15 ..
-rw-r--r--@   1 konur  staff   191 Jun 22 09:55 Dockerfile
drwxr-xr-x  229 konur  staff  7328 Jun 20 15:54 lib
drwxr-xr-x    3 konur  staff    96 Jun 20 15:52 org
drwxr-xr-x    4 konur  staff   128 Jun 21 15:23 resources

lib Folder

The lib folder consists of all Spark jars, copied from jars folder under the Spark binary distribution root.

resources Folder

A listing of resources folder is given below.

konur$ ls -l resources/
total 8
-rw-r--r--  1 konur  staff  315 Jun 20 17:11 log4j.properties
drwxr-xr-x  4 konur  staff  128 Jun 20 15:39 savedModel

As we discussed previously, savedModel stores the serialized classifier model. The log4j.properties is listed below.

log4j.properties

By default, log level is set to ERROR to suppress verbose info output from Spark libraries.

# Define the root logger with appender file
log4j.rootLogger = ERROR, CONSOLE

# Define the file appender
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender

# Define the layout for file appender
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.conversionPattern=%m%n

org Folder

The org/spark/docker/demo consists of DockerMultilayerPerceptronClassifier.class file.

Dockerfile

The Dockerfile is very simple. The image is based on openjdk:8 image. The entry point calls java on DockerMultilayerPerceptronClassifier with a classpath and runtime Java memory option. The classpath consists of resources and lib folders.

FROM openjdk:8
COPY . /usr/src/Main
WORKDIR /usr/src/Main
ENTRYPOINT ["java", "-Xmx700m","-classpath", ".:./resources/:./lib/*", 
"org.spark.docker.demo.DockerMultilayerPerceptronClassifier"]

Build and Push

Finally, we build the image as follows.

docker build -t docker-classifier .

After building the image, it should be placed into a repository. In this example, we will push it to a repository in Docker Hub. Let image-id denote the image ID of the just built image in local repository. We execute the following commands.

docker tag <image-id> <repository-name>:docker-classifier

Where repository-name is the repository name in Docker Hub. Then push it.

docker push <repository-name>

Running the Application

Data File

Firstly, let us discuss the data file to be processed by the classifier. The file should be in libsvm format [11] as required by Apache Spark MultilayerPerceptronClassificationModel.train() method. The first couple of lines of a sample data file is shown below. Each row corresponds to a label, i.e. category, and the related features all separated by white space. The first column is the label, and for our purposes, it can always be set to 0. This is because of the fact that the label is useful only for training purposes, whereas in our application, the sole purpose is to predict from the features. The features are numbered as 1:<feature1value> 2:<feature2value> … etc.

0 1:-0.222222 2:0.5 3:-0.762712 4:-0.833333
0 1:-0.555556 2:0.25 3:-0.864407 4:-0.916667
0 1:-0.722222 2:-0.166667 3:-0.864407 4:-0.833333
  ...

Downloading and Executing the Image

An end user will download the image via:

docker pull <repository-name>:docker-classifier

Then, docker images -a will show:

REPOSITORY                     TAG                 ...
<repository-name>              docker-classifier   ...

Assume that the data file to be processed by the classifier is named data.txt, placed under a local folder named /tmp/testData. Then, execute:

docker run --rm --volume=/tmp/testData:/app --name docker-classifier \ 
    <repository-name>:docker-classifier data.txt

Sample output lines are shown below:

Input: (4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333]), Prediction: 1.0
Input: (4,[0,1,2,3],[-0.555556,0.25,-0.864407,-0.916667]), Prediction: 1.0
Input: (4,[0,1,2,3],[-0.722222,-0.166667,-0.864407,-0.833333]), Prediction: 1.0
  ...

In each line, the first column gives the prediction of label, one of 1.0 or 0.0 (our ANN solves a binary classification problem) and the second column gives the corresponding input, i.e. features, consisting of 4 inputs as a vector. For example, the first line above should be interpreted as follows:

For the input vector [-0.222222,0.5,-0.762712,-0.833333] the predicted outcome is 1.

Observe that while we are running the image we map the local data folder /tmp/testData to Docker folder /app via the --volume option. Also,

--name docker-classifier

indicates name of the container to run the application. Because we remove the container immediately after running it via --rm command, we could have used any container that currently does not exist in the local Docker environment, e.g. --name any-container.

Conclusions

In this article, we looked into delivering an Apache Spark artificial neural network as a Docker image and running it in a single, non-clustered, Docker host. We utilized the most basic run-time option for Apache Spark, locally running non-distributed single JVM mode. The only required software component in the client desktop will be the Docker runtime. In machine learning applications with relatively small data processing needs, the type of approach we follow here could be advantageous because running and maintaining cluster-based software is avoided.

Let us conclude the article by reviewing significant parameters and commands discussed in previous sections.

  • /app: root folder in Docker container to mount the physical folder in client machine that contains the data file to process.

Relative to the work folder where application image is created:

  • lib: folder that consists of all the Spark jar files.
  • resources: folder that consists of log4j.properties file and saved ANN model metafiles.
  • org/spark/docker/demo: folder that consists of the application class file.
  • Dockerfile: Configuration file to assemble the application’s docker image.

To build the image:

docker build -t docker-classifier .

where docker-classifier is tag name of the application.

To push the image:

docker tag 710ce225d745 konuratdocker/spark-examples:docker-classifier

where konuratdocker/spark-example is an example repository to push the image to and 710ce225d745 is an example image id created in local repository upon building the image.

To pull the image:

docker pull konuratdocker/spark-examples:docker-classifier

To run the image:

docker run --rm --volume=/tmp/testData:/app --name my-container \ 
   konuratdocker/spark-examples:docker-classifier data.txt

where:

/tmp/testData/data.txt is the full path in the client machine to the data file to be consumed by the classifier and my-container is an example container name to run the image. The /tmp/testData folder is mapped to the Docker root folder /app via the --volume option.

References

[1] http://spark.apache.org/

[2] https://hub.docker.com/r/sequenceiq/spark/

[3] https://dzone.com/articles/running-apache-spark-applications-in-docker-contai

[4] http://maxmelnick.com/2016/06/04/spark-docker.html

[5] https://databricks.com/session/lessons-learned-from-running-spark-on-docker

[6] https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-local.html

[7] https://en.wikipedia.org/wiki/Multilayer_perceptron

[8] http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier

[9] http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

[10] https://dzone.com/articles/apache-spark-machine-learning-using-artificial-neu

[11] https://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/

Start coding something amazing with the IBM library of open source AI code patterns.  Content provided by IBM.

Topics:
big data ,artificial neural network ,apache spark ,apache spark examples java ,docker ,classification models ,artificial intelligence ,ai ,machine learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}