Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using UDF for Machine Learning

DZone's Guide to

Using UDF for Machine Learning

Learn about the potential of Java user-defined functions with a brief use case involving machine learning and the iris dataset.

· AI Zone ·
Free Resource

Did you know that 50- 80% of your enterprise business processes can be automated with AssistEdge?  Identify processes, deploy bots and scale effortlessly with AssistEdge.

We added support for scalar user-defined functions in our recent release, VoltDB V7.6. With this feature, users can write their own SQL functions in Java and load them into VoltDB. Those functions, once loaded, can be used in all kinds of SQL queries just like any other built-in SQL functions. User-defined functions can give you great flexibility in customizing your business logic inside the database. In this blog post, we will use a specific example to show how user-defined functions can benefit machine learning applications.

For example, classification is a fundamental problem in machine learning and statistics. Given an input data set, a classification model can make a prediction and assign each data sample to one of several predefined classes. Commonly used classification models include logistic regression, naive Bayesian classifiers, support vector machines, decision trees, and neural networks. Here, we will choose the naive Bayesian classifier and the k-nearest neighbors algorithm as two examples to illustrate how user-defined functions can simplify business applications.

The dataset we are going to use in our examples is the iris data set. The iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher. It contains in total 150 data samples selected evenly from three species of iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured for each sample: the length and the width of both the sepals and petals, in centimeters.

The original data set comes in the CSV format. To store it in the database, we first define the iris table as follows:

CREATE TABLE iris (
   sepal_length FLOAT NOT NULL,
   sepal_width FLOAT NOT NULL,
   petal_length FLOAT NOT NULL,
   petal_width FLOAT NOT NULL,
   class VARCHAR(20) NOT NULL
);

Table iris includes four columns for different feature values and a target class column which records the real species of each sample. To load the data, VoltDB provides a command line utility, CSVLoader, which can import the contents of a CSV file and insert it into a VoltDB table.

A typical machine learning application involves two parts: learning and scoring. VoltDB strives for providing solid support for fast data applications, where the fast incoming data stream is summarized and processed in real-time. That is, VoltDB can apply machine learning models on the incoming data stream for real-time decision making with high throughput and low latency. Building the models on the historical data, however, is normally done on a back-end large-scale big data analytics platform. VoltDB can export the live transaction data into those platforms for incremental model updates and sync up with the back-end system to get the updated model for future data processing.

In our example, we include the learning part for completeness. So we first build the classification models (the naive Bayesian classifier and the k-nearest neighbor classifier) from the iris data set. The naive Bayesian classifier is a simple probabilistic classifier based on applying the Bayes' theorem with strong (naive) independence assumptions between the features. While the k-nearest neighbor classifier classifies the samples based on the target class of their nearest neighbors. The technical discussion about how those models can be computed from the training data set is out of the scope of this blog post. We encourage the audience to refer to numerous resources that are available on the Internet to get more familiarized with those models if you are interested. Here, we create a Java stored procedure in VoltDB to scan the input table and learn the models leveraging the algorithm implementations provided by the Java-ML library.

Please note that the input data table in this example is replicated, i.e., no partitioning column is defined and each partition will contain a full copy of the table data. This is only for the simplicity of the example because every partition will run the same learning method on the same data, no model synchronization between partitions is required. The data set can be partitioned in order to achieve maximum performance if VoltDB is only used for real-time decision making, i.e. data scoring.

The code for the VoltDB Java stored procedure to train the k-nearest neighbor classification model is shown below (the code for the naive Bayesian classifier is almost the same):

public class IrisKNNClassifier extends VoltProcedure {
    // The query to extract the data from a VoltDB table.
    public final SQLStmt selectData =
            new SQLStmt("SELECT sepal_length, sepal_width, petal_length, petal_width, class FROM iris;");
    // Number of attributes.
    private static final int attributeCount = 4;
    private static Classifier knnc;

    // The train method (Java stored procedure).
    public long run() {
        // Run the SELECT query to get the data.
        voltQueueSQL(selectData);
        VoltTable[] queryResults = voltExecuteSQL(true);
        VoltTable dataTable = queryResults[0];

        // Build the data set that can be used by the machine learning library.
        Dataset dataset = new DefaultDataset();
        while (dataTable.advanceRow()) {
            DenseInstance dataRow = new DenseInstance(attributeCount);
            for (int i = 0; i < attributeCount; i++) {
                dataRow.put(i, dataTable.getDouble(i));
            }
            dataRow.setClassValue(dataTable.getString(attributeCount));
            dataset.add(dataRow);
        }

        // Train classifier.
        knnc = new KNearestNeighbors(5);
        knnc.buildClassifier(dataset);
        return ClientResponse.SUCCESS;
    }
}

The classifier knnc is a static member of the IrisKNNClassifier. When the stored procedure is run on all partitions, it learns the model and stores it with the class. In a real-world use case, this procedure should get its model from an external analytics system.

To apply the model, we create a Java method in the same class as the stored procedure method in order to access the model data. The code for the Java method is shown below:

public String classify(double sepal_length, Double sepal_width, double petal_length, Double petal_width) {
    if (knnc == null) {
        throw new RuntimeException("Please run the IrisKNNClassifier procedure first to train the model.");
    }
    // Assemble the attributes together.
    DenseInstance dataRow = new DenseInstance(attributeCount);
    dataRow.put(0, sepal_length);
    dataRow.put(1, sepal_width);
    dataRow.put(2, petal_length);
    dataRow.put(3, petal_width);
    Object predictedClassValue = knnc.classify(dataRow);
    return (String) predictedClassValue;
}

The Java method takes four double type parameters. VoltDB supports both boxed types and primitive types in the UDF method. it automatically maps the Java types to the VoltDB SQL data types. For more information, see our documentation. Basically, what this function does is it gets the data that is passed in as parameters, applies the model, and returns the class prediction.

The compilation process of a UDF is the same as that of a stored procedure. It is also perfectly fine to define a stored procedure and UDFs together. After the JAR file is compiled, we can easily load the procedures and the UDFs by executing the following commands:

LOAD CLASSES VoltMachineLearning.jar;
CREATE PROCEDURE FROM CLASS IrisNBClassifier;
CREATE PROCEDURE FROM CLASS IrisKNNClassifier;
CREATE FUNCTION nb_classify FROM METHOD IrisNBClassifier.classify;
CREATE FUNCTION knn_classify FROM METHOD IrisKNNClassifier.classify;

Please note that in order to run with the Java-ML library, we need to copy the related JAR files for Java-ML to /lib/extension folder where VoltDB is installed, so VoltDB can load them at startup.

To build the classification models, we run the stored procedures:

EXEC IrisNBClassifier;
EXEC IrisKNNClassifier;

We can then apply the models to the same data set to see how accurate their predictions are:

SELECT SUM(CASE WHEN nb_classify(sepal_length,
                                 sepal_width,
                                 petal_length,
                                 petal_width) = class THEN 1 ELSE 0 END) AS nb_correct,
       SUM(CASE WHEN knn_classify(sepal_length,
                                  sepal_width,
                                  petal_length,
                                  petal_width) = class THEN 1 ELSE 0 END) AS knn_correct,
       COUNT(*) AS total
FROM iris;

The result of the query above is:

Image title

In our example, out of 150 samples, the k-nearest classifier got 145 correct classifications (96.6%), while the naive Bayesian classifier only got 77 correct (51.3%).

In this example, we show the potential of Java user-defined functions with a brief use case. User-defined functions unlock the full power and flexibility of the Java programming language and environment. With tons of external libraries that UDFs can link to, there is a whole bunch of things you can do inside the database. The source code for this example is available on GitHub. We hope you find the feature helpful in designing your fast data application, and please let us know what you think!

As per leading analysts, consuming AI in byte sized applications is the best way transform digitally. #BuiltOnAI,EdgeVerve’s business applications provide plug & play AI for your enterprise.

Topics:
ai ,udf ,machine learning ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}