Predict Bad Loans With H2O Flow AutoML
Learn how to accurately predict bad loan data to help borrowers in making financial decisions and investors in choosing the best investment strategy.
Join the DZone community and get the full member experience.Join For Free
Machine learning algorithms play a key role in accurately predicting loan data of any bank. The greatest challenge in machine learning is to employ the best models and algorithms to accurately predict the probability of loan default in making the best financial decisions. H2O Flow, a web-based interactive computational environment, is used for combining text, code execution, and rich media into a document.
H2O’s AutoML, an easy-to-use interface for advanced users, automates the machine learning workflow, such as training a large set of models. Stacked ensembles are used to produce a top-performing model — a highly predictive ensemble model in AutoML Leaderboard. In this blog, let's accurately predict bad loan data in order to help borrowers in making financial decisions and investors in choosing the best investment strategy.
- Install Python 2.7 or 3.5+
- Install H2O Flow with the following packages:
- pip install requests
- pip install tabulate
- pip install scikit-learn
- pip install colorama
- pip install future
- pip install http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/Python/h2o-188.8.131.52-py2.py3-none-any.whl
- Upon successfully installing H2O, check the cluster connection using
Loan data of Lending Club from 2007-2011 with 163K rows and 15 columns is used as the source file. Lending Club is a peer-to-peer loan platform for both the investors and borrowers.
- Analyze Lending Club’s loan data.
- Predict bad loan data in the dataset using the distributed random forest model and the stacked ensembles in AutoML based on the borrower loan amount approval or rejection.
Based on the percentage of the bad loan data, the investors can very easily decide whether to finance the borrower for new loans. For example, a loan is considered "rejected" if the bad loan data is 1.
- Import data from source
- View parsing data
- View job details and dataset summary
- Visualize labels
- Impute data
- Split data
- Run AutoML
- View leaderboard
- Compute variable importance
- View output
Importing Data From Source
To import the data from the source, perform the following:
- Open H2O Flow.
- Click Data > Import Files to import the source files into H2O Flow as shown in the below diagram:
After importing the files, a summary displays the results of the import.
Viewing Parsing Data
On successfully importing these files, click Parse these files to parse the files and to view the details of the source data as shown in the below diagram:
The parsed files contain column names and data types of all features. The data types will be assigned by default and can be changed if required. For example, in our use case, the data type of the response column (bad loan) is changed from numeric to factor (enum).
Viewing Job Details and Dataset Summary
After clicking the parse files, you can view the job details. Click View to view the summary of the DataFrame.
Loan dataset summary:
From the above summary, the input columns show multiple label values. Each labeled data can be visualized by clicking their corresponding column names.
In this section, let's visualize data of the loan amount and employee length columns.
Loan amount data:
Employee length data:
Missing values of labels, with aggregates computed on the “na.rm’d” vector, are imputed using in-place imputation.
To impute the data, perform the following:
- Choose the attribute with missing values.
- Click Impute as shown in the below diagram:
- Specify the following details:
- Combine method
On successfully imputing the column with the median values, the summary of the column will be displayed as shown in the below diagram:
To split the dataset into a training set (70%) and a test set (30%), perform the following:
- Click Assist Me and Split Frame (or click the Data drop-down and select Split Frame) to split the DataFrame. It automatically adjusts the ratio values to 1. Upon entering unsupported values, an error will be displayed.
- Click Create to view the split frames.
To run AutoML, perform the following:
- Select Model > RunAutoML as shown in the below diagram:
- Provide the following details as shown in the below diagram:
- Training frame: Select the dataset to build the model.
- Response column: Select the column to be used as a dependent variable; required only for GLM, GBM, DL, DRF, and Naïve Bayes (classification model).
- Fold column: Select the column with the cross-validation fold index assignment and/or observation (optional in AutoML).
- Weight column: Weights are per-row observation and don't increase data size. During data training, rows with higher weights matter more due to the larger loss function pre-factor.
- Validation frame: Select the dataset to evaluate the model accuracy (optional).
- Leaderboard frame: Specify the Leaderboard Frame when configuring the AutoML run. If not specified, the Leaderboard Frame will be created from the Training Frame. The output models with best results will be displayed on the Leaderboard.
- Max models: Specify the maximum number of models to be built in an AutoML run.
- Max runtime seconds: Controls execution time of AutoML run (default time is 3,600s).
- Stopping rounds: Stops training based on a simple moving average when
stopping_metricdoesn't improve for a specified number of training rounds. Specify 0 to disable this feature.
- Stopping tolerance: Specify the tolerance value to improve a model before training ceases.
The Leaderboard displays the models with the best results first, as shown in the below diagram:
Computing Variable Importance
The statistical significance of all variables affecting the model is computed depending on the algorithm and is listed in the order of most to least importance.
The percentage importance of all variables is scaled to 100. The scaled importance value of the variables is shown in the below diagram:
Predicted model of loan dataset:
In this blog, AutoML, a distributed random forest model, and stacked ensembles are used to build and test the best model for predicting loan default. Data is analyzed to obtain the cut-off value. Investors use this cut-off value to decide the best type of investment strategy for loan investment and to determine the applicants getting loans.
Published at DZone with permission of Rathnadevi Manivannan. See the original article here.
Opinions expressed by DZone contributors are their own.