Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Building GLM, GBM, and Random Forest Binomial Models With H2O

DZone's Guide to

Building GLM, GBM, and Random Forest Binomial Models With H2O

With the models you'll learn how to build in this article, you can get a machine to perform actions based on training, validation, and testing data.

· AI Zone ·
Free Resource

Did you know that 50- 80% of your enterprise business processes can be automated with AssistEdge?  Identify processes, deploy bots and scale effortlessly with AssistEdge.

Here is an example of using the H2O machine learning library and then building GLM, GBM and distributed random forest models for categorical response variables.

Let's import the H2O library and initialize the H2O machine learning cluster:

import h2o
h2o.init()

Import dataset and getting familiar with it:

df = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv")
df.summary()
df.col_names

Let's configure our predictors and response variables from the ingested dataset:

y = 'CAPSULE'
x = df.col_names
x.remove(y)
print("Response = " + y)
print("Pridictors = " + str(x))

Now, we need to set the response column as categorical or factorial:

df['CAPSULE'] = df['CAPSULE'].asfactor()

Consider the levels in our response variable:

df['CAPSULE'].levels()
[['0', '1']]

Note: Because there are only two levels or values, the model is called a binomial model.

Now, we will split our dataset into training, validation, and testing datasets:

train, valid, test = df.split_frame(ratios=[.8, .1])
print(df.shape)
print(train.shape)
print(valid.shape)
print(test.shape)

Let's build our generalized linear regression (logistic means that the response variable is categorical) model first:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
glm_logistic = H2OGeneralizedLinearEstimator(family = "binomial")
glm_logistic.train(x=x, y= y, training_frame=train, validation_frame=valid, 
 model_id="glm_logistic")

Now, we will take a look at few model metrics:

glm_logistic.varimp()
Warning: This model doesn't have variable importances

Let's have a look at model coefficients:

glm_logistic.coef()

Let's perform the prediction using the testing dataset:

glm_logistic.predict(test_data=test)

Now, check the model performance metrics rmse based on testing and other datasets:

print(glm_logistic.model_performance(test_data=test).rmse())
print(glm_logistic.model_performance(test_data=valid).rmse())
print(glm_logistic.model_performance(test_data=train).rmse())

Check the model performance metrics r2 based on testing and other datasets:

print(glm.model_performance(test_data=test).r2())
print(glm.model_performance(test_data=valid).r2())
print(glm.model_performance(test_data=train).r2())

Let's build our gradient boosting model now:

from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm = H2OGradientBoostingEstimator()
gbm.train(x=x, y =y, training_frame=train, validation_frame=valid)

Now, get to know our model metrics, starting with confusion metrics first:

gbm.confusion_matrix()

Now have a look at variable importance plots:

gbm.varimp_plot()

Now have a look at the variable importance table:

gbm.varimp()

Let's build our distributed random forest model:

from h2o.estimators.random_forest import H2ORandomForestEstimator
drf = H2ORandomForestEstimator()
drf.train(x=x, y = y, training_frame=train, validation_frame=valid)

Let's understand random forest model metrics, starting with confusion metrics:

drf.confusion_matrix()

We can have a look at the gains and lift tables also:

drf.gains_lift()

Note:

  • We can get all model metrics as other model types, as applied.
  • We can also get the model to perform based on training, validation, and testing data for all models.

That's it. Enjoy!

Consuming AI in byte sized applications is the best way to transform digitally. #BuiltOnAI, EdgeVerve’s business application, provides you with everything you need to plug & play AI into your enterprise.  Learn more.

Topics:
h2o ,machine learning ,gbm ,glm ,random forest ,ai ,tutorial ,python

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}