Predict Bad Loans With H2O Flow AutoML

DZone 's Guide to

Predict Bad Loans With H2O Flow AutoML

Learn how to accurately predict bad loan data to help borrowers in making financial decisions and investors in choosing the best investment strategy.

· AI Zone ·
Free Resource

Machine learning algorithms play a key role in accurately predicting loan data of any bank. The greatest challenge in machine learning is to employ the best models and algorithms to accurately predict the probability of loan default in making the best financial decisions. H2O Flow, a web-based interactive computational environment, is used for combining text, code execution, and rich media into a document.

H2O’s AutoML, an easy-to-use interface for advanced users, automates the machine learning workflow, such as training a large set of models. Stacked ensembles are used to produce a top-performing model — a highly predictive ensemble model in AutoML Leaderboard. In this blog, let's accurately predict bad loan data in order to help borrowers in making financial decisions and investors in choosing the best investment strategy.


  • Install Python 2.7 or 3.5+
  • Install H2O Flow with the following packages:
    • pip install requests
    • pip install tabulate
    • pip install scikit-learn
    • pip install colorama
    • pip install future
    • pip install http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/Python/h2o-
  • Upon successfully installing H2O, check the cluster connection using h2o.init().

Data Description

Loan data of Lending Club from 2007-2011 with 163K rows and 15 columns is used as the source file. Lending Club is a peer-to-peer loan platform for both the investors and borrowers.

Sample dataset:


Dataset variables:

  1. loan_amnt

  2. term

  3. int_rate

  4. addr_state

  5. dti

  6. revol_util

  7. delinq_2yrs

  8. emp_length

  9. annual_inc

  10. home_ownership

  11. purpose

  12. total_acc

  13. longest_credit_length

  14. verification_status

  15. Dependent variable

Use Case

  • Analyze Lending Club’s loan data.
  • Predict bad loan data in the dataset using the distributed random forest model and the stacked ensembles in AutoML based on the borrower loan amount approval or rejection.

Based on the percentage of the bad loan data, the investors can very easily decide whether to finance the borrower for new loans. For example, a loan is considered "rejected" if the bad loan data is 1.


  • Import data from source
  • View parsing data
  • View job details and dataset summary
  • Visualize labels
  • Impute data
  • Split data
  • Run AutoML
  • View leaderboard
  • Compute variable importance
  • View output

Importing Data From Source

To import the data from the source, perform the following:

  • Open H2O Flow.
  • Click Data > Import Files to import the source files into H2O Flow as shown in the below diagram:



After importing the files, a summary displays the results of the import.

Viewing Parsing Data

On successfully importing these files, click Parse these files to parse the files and to view the details of the source data as shown in the below diagram:


The parsed files contain column names and data types of all features. The data types will be assigned by default and can be changed if required. For example, in our use case, the data type of the response column (bad loan) is changed from numeric to factor (enum). 

After doing all changes, click Parse.select

Viewing Job Details and Dataset Summary

After clicking the parse files, you can view the job details. Click View to view the summary of the DataFrame.


Loan dataset summary:


From the above summary, the input columns show multiple label values. Each labeled data can be visualized by clicking their corresponding column names.

Visualizing Labels

In this section, let's visualize data of the loan amount and employee length columns.

Loan amount data:


Employee length data:


Imputing Data

Missing values of labels, with aggregates computed on the “na.rm’d” vector, are imputed using in-place imputation.

To impute the data, perform the following:

  • Choose the attribute with missing values.
  • Click Impute as shown in the below diagram:


  • Specify the following details:
    • Frame
    • Column
    • Method
    • Combine method


On successfully imputing the column with the median values, the summary of the column will be displayed as shown in the below diagram:


Splitting Data

To split the dataset into a training set (70%) and a test set (30%), perform the following:

  • Click Assist Me and Split Frame (or click the Data drop-down and select Split Frame) to split the DataFrame. It automatically adjusts the ratio values to 1. Upon entering unsupported values, an error will be displayed.
  • Click Create to view the split frames.


Running AutoML

To run AutoML, perform the following:

  • Select Model > RunAutoML as shown in the below diagram:


  • Provide the following details as shown in the below diagram:
    • Training frame: Select the dataset to build the model.
    • Response column: Select the column to be used as a dependent variable; required only for GLM, GBM, DL, DRF, and Naïve Bayes (classification model).
    • Fold column: Select the column with the cross-validation fold index assignment and/or observation (optional in AutoML).
    • Weight column: Weights are per-row observation and don't increase data size. During data training, rows with higher weights matter more due to the larger loss function pre-factor.
    • Validation frame: Select the dataset to evaluate the model accuracy (optional).
    • Leaderboard frame: Specify the Leaderboard Frame when configuring the AutoML run. If not specified, the Leaderboard Frame will be created from the Training Frame. The output models with best results will be displayed on the Leaderboard.
    • Max models: Specify the maximum number of models to be built in an AutoML run.
    • Max runtime seconds: Controls execution time of AutoML run (default time is 3,600s).
    • Stopping rounds: Stops training based on a simple moving average when stopping_metric doesn't improve for a specified number of training rounds. Specify 0 to disable this feature.
    • Stopping tolerance: Specify the tolerance value to improve a model before training ceases.


Viewing Leaderboard

The Leaderboard displays the models with the best results first, as shown in the below diagram:select


selectROC curve (training metrics):select

Computing Variable Importance

The statistical significance of all variables affecting the model is computed depending on the algorithm and is listed in the order of most to least importance.

The percentage importance of all variables is scaled to 100. The scaled importance value of the variables is shown in the below diagram:


Viewing Output

Predicted model of loan dataset:select

ROC curve:select

Prediction scores:



In this blog, AutoML, a distributed random forest model, and stacked ensembles are used to build and test the best model for predicting loan default. Data is analyzed to obtain the cut-off value. Investors use this cut-off value to decide the best type of investment strategy for loan investment and to determine the applicants getting loans.


ai ,algorithms ,h2o ,machine learning ,predictive analytics ,tutorial

Published at DZone with permission of Rathnadevi Manivannan . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}