Exploring H2O.ai AutoML
Exploring H2O.ai AutoML
Ensembling (stacking models) is the best way to perform well.
Join the DZone community and get the full member experience.Join For Free
Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.
In the short time that I have spent on Kaggle, I have realized ensembling (stacking models) is the best way to perform well.
Stacking is a Model Ensembling technique that combines predictions from multiple models and generates a new model.
I am going to write a new post on model ensembling.
I have experimented with multiple ensembling techniques and made a model with XGboost, LightGBM, and Keras for Zillow Zestimate problem, which performed well.
Hyper-Parameter tuning for the base models was done using Cross-Validation + Grid Search. Tuning the parameters of the combined model is where things get strenuous.
Auto-sklearn and TPOT provide a Sklearn styled API that can help you get things going quite fast. But H2O.ai Auto ML got better results for me at least.
H2O.ai is an open source Machine Learning platform that gives you a good bunch of Machine Learning algorithms to build scalable prediction models.
H20 AutoML can help in automating the Machine Learning workflow, which includes training and tuning of hyper-parameters of models. The AutoML process can be controlled by specifying a time-limit or defining a performance metric-based stopping criteria. AutoML returns a leaderboard with the best models ensembled.
AutoML provides APIs in Python and R that comes with H2O library.
I have decided to give a try on H20 AutoML for Zillow Zestimate problem. I have used R for making the model for making the submission.
library(data.table) library(h2o) # Load train and properties data properties <- fread("../input/properties_2016.csv", header=TRUE, stringsAsFactors=FALSE, colClasses = list(character = 50)) train <- fread("../input/train_2016_v2.csv") training <- merge(properties, train, by="parcelid",all.y=TRUE) # Initialise h20 h2o.init(nthreads = -1, max_mem_size = "8g") # Mark predictor and response variables x <- names(training)[which(names(training)!="logerror")] y <- "logerror" # Import data into H2O train <- as.h2o(training) test <- as.h2o(properties) # Fit H2O AutoML Mode; aml <- h2o.automl(x = x, y = y, training_frame = train, max_runtime_secs = 1800, stopping_metric='MAE') # Store the H2O AutoML Leaderboard lb <- aml@leaderboard lb # Use Best Model in the leaderboard aml@leader # Generate Predictions using the leader Model pred <- h2o.predict(aml, test) predictions <- round(as.vector(pred), 4) # Prepare predictions for submission file result <- data.frame(cbind(properties$parcelid, predictions, predictions, predictions, predictions, predictions, predictions)) colnames(result)<-c("parcelid","201610","201611","201612","201710","201711","201712") options(scipen = 999) # Wite results to submission file write.csv(result, file = "submission_xgb_ensemble.csv", row.names = FALSE )
Running the AutoML model for 1800 seconds with stopping metric as MAE gave me a Public Leaderboard score of 0.06564.
That’s a good score considering I haven’t even dealt with basic data preprocessing.
Opinions expressed by DZone contributors are their own.