Over a million developers have joined DZone.

Building Regression and Classification GBM Models in Scala With H2O

DZone's Guide to

Building Regression and Classification GBM Models in Scala With H2O

In this code-ful tutorial, you will learn how to build H2O GBM models for regression and binomial classification in Scala.

· AI Zone ·
Free Resource

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.

In the full code below, you will learn to build H2O GBM models for regression and binomial classification in Scala.

Let's first import all the classes we need for this project:

import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key
import java.io.File

import water.support.SparkContextSupport.addFiles
import water.support.H2OFrameSupport._

// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._

Next, we need to start the H2O cluster so that we can start using H2O APIs:

// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

Now, we need to ingest the data, which we can use to perform modeling:

// Import prostate data into H2O
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))

// Understanding our input data

Now, we will import some H2O-specific classes that we need to perform our actions:

import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters

Let's set up our GBM parameters, which will shape our GBM modeling process:

val gbmParams = new GBMParameters()
gbmParams._train = prostateData
gbmParams._response_column = 'CAPSULE

In the above response column setting, the column CAPSULE is numeric. So by default, the GBML model will build a regression model. Let's start building the GBM model now:

val gbm = new GBM(gbmParams,Key.make("gbmRegModel.hex"))
val gbmRegModel = gbm.trainModel.get
// Same as above
val gbmRegModel = gbm.trainModel().get()

Let's get to know our GBM model. We will see that the type of this model is “regression:” gbmRegModel.

Let's perform prediction using the GBM regression model:

val predH2OFrame = gbmRegModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Now, we will set the input data to perform the GBM classification model. Below, we are setting the response column to be a categorical type so that all the values in this column become an enumerator instead of a number. This way, we can make sure that the GBM model we build will be a classification model:

// >>> res6: Array[String] = Array(ID, CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON)
// Based on above the CAPSULE is the id = 1
// Note: If we will not set categorical for response variable we will see the following exception
//        - water.exceptions.H2OModelBuilderIllegalArgumentException: 
//             - Illegal argument(s) for GBM model: gbmModel.hex.  Details: ERRR on field: _distribution: Binomial requires the response to be a 2-class categorical

withLockAndUpdate(prostateData){ fr => fr.replace(1, fr.vec("CAPSULE").toCategoricalVec)}

gbmParams._response_column = 'CAPSULE

We can also set the distribution to have a specific method. In the code below, we are setting the distribution to have the Bernoulli method:

import _root_.hex.genmodel.utils.DistributionFamily
gbmParams._distribution = DistributionFamily.bernoulli

Now, let's build our GBM model:

val gbm = new GBM(gbmParams,Key.make("gbmBinModel.hex"))
val gbmBinModel = gbm.trainModel.get
// Same as above
val gbmBinModel = gbm.trainModel().get()

Let's check the new model. We will find that it is a classification model — specifically, binomial classification because it has only two classes in its response classes: gbmBinModel.

Now, let's perform the prediction using our GBM binomial classification model, as shown below:

val predH2OFrame = gbmBinModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

That's all. Enjoy!

Your machine learning project needs enormous amounts of training data to get to a production-ready confidence level. Get a checklist approach to assembling the combination of technology, workforce and project management skills you’ll need to prepare your own training data.

h2o ,gbm ,scala ,machine learning ,regression ,classification ,ai ,predictive analytics ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}