Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Exploring H2O.ai AutoML

DZone 's Guide to

Exploring H2O.ai AutoML

Ensembling (stacking models) is the best way to perform well.

· AI Zone ·
Free Resource

In the short time that I have spent on Kaggle, I have realized ensembling (stacking models) is the best way to perform well.

Well, I am not the only one to think so!

Stacking is a Model Ensembling technique that combines predictions from multiple models and generates a new model.

Image title

I am going to write a new post on model ensembling.

I have experimented with multiple ensembling techniques and made a model with XGboost, LightGBM, and Keras for Zillow Zestimate problem, which performed well.

Hyper-Parameter tuning for the base models was done using Cross-Validation + Grid Search. Tuning the parameters of the combined model is where things get strenuous.

There, I began to search for a better way to build ensembled models. I found a few frameworks to build better-ensembled models like Auto-sklearn, TPOT, Auto-Weka, machineJS, and H2O.ai AutoML.

Auto-sklearn and TPOT provide a Sklearn styled API that can help you get things going quite fast. But H2O.ai Auto ML got better results for me at least.

H2O.ai is an open source Machine Learning platform that gives you a good bunch of Machine Learning algorithms to build scalable prediction models.

H20 AutoML can help in automating the Machine Learning workflow, which includes training and tuning of hyper-parameters of models. The AutoML process can be controlled by specifying a time-limit or defining a performance metric-based stopping criteria. AutoML returns a leaderboard with the best models ensembled.

Image title

AutoML provides APIs in Python and R that comes with H2O library.

I have decided to give a try on H20 AutoML for Zillow Zestimate problem. I have used R for making the model for making the submission.

library(data.table)
library(h2o)

# Load train and properties data

properties <- fread("../input/properties_2016.csv", header=TRUE, stringsAsFactors=FALSE, colClasses = list(character = 50))
train      <- fread("../input/train_2016_v2.csv")
training   <- merge(properties, train, by="parcelid",all.y=TRUE)

# Initialise h20
h2o.init(nthreads = -1, max_mem_size = "8g")

# Mark predictor and response variables
x <- names(training)[which(names(training)!="logerror")]
y <- "logerror"

# Import data into H2O
train <- as.h2o(training)
test <- as.h2o(properties)

# Fit H2O AutoML Mode;
aml <- h2o.automl(x = x, y = y,
                  training_frame = train,
                  max_runtime_secs = 1800, stopping_metric='MAE')

# Store the H2O AutoML Leaderboard                  
lb <- aml@leaderboard
lb

# Use Best Model in the leaderboard
aml@leader

# Generate Predictions using the leader Model
pred <- h2o.predict(aml, test)

predictions <- round(as.vector(pred), 4)

# Prepare predictions for submission file
result <- data.frame(cbind(properties$parcelid, predictions, predictions,
                          predictions, predictions, predictions,
                          predictions))

colnames(result)<-c("parcelid","201610","201611","201612","201710","201711","201712")
options(scipen = 999)

# Wite results to submission file
write.csv(result, file = "submission_xgb_ensemble.csv", row.names = FALSE )

Running the AutoML model for 1800 seconds with stopping metric as MAE gave me a Public Leaderboard score of 0.06564.

That’s a good score considering I haven’t even dealt with basic data preprocessing.

Topics:
r ,h20 ,deep learning ,machine learning ,ai ,ensembling ,automl

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}