Predict Loan Default Using Seahorse and SparkR

DZone 's Guide to

Predict Loan Default Using Seahorse and SparkR

Loans can be risky, to say the least — so predicting loan defaults is an awesome possibility. Learn how to predict loan default of Lending Club, the largest online marketplace to connect borrowers and investors.

· AI Zone ·
Free Resource

Data scientists are using Python and R to solve data problems due to the ready availability of these packages. These languages are often limited, as the data is processed on a single machine where the movement of data from the development environment to production environment is time-consuming and requires extensive re-engineering.

To overcome this problem, Spark provides a powerful, unified engine that is both fast (100x faster than Hadoop for large-scale data processing) and easy to use for both data scientists and data engineers. It is simple, scalable, and easy to integrate with other tools. Seahorse, a scalable data analytics workbench, allows data scientists to visually build Spark applications and perform data preparation, data transformation, data modeling, data training, data analysis, and data visualization collaboratively. Seahorse has built-in operations to allow data scientists to customize parameter values.

In this blog, let's discuss predicting loan default of Lending Club. Lending Club is the world's largest online marketplace to connect borrowers and investors.


  • VirtualBox (version 5.0.10)
  • Vagrant (version 1.8.1)
  • Google Chrome (60.0.3112.113)

Data Description

Loan data of Lending Club, from 2007-2011, with 40K records, is used as the source file. Each loan has more than 100 characteristics of loan and borrower.select

Use Case

  • Analyze loan data of Lending Club.
  • Predict loan default in Lending Club dataset by building data model using logistic regression.

Loan status falls under two categories: Charged Off (default loan) and Fully Paid (desirable loan). Lending Club defines Charged Off loans as loans that are non-collectible where the lender has no hope of recovering money.


  • Read data from source
  • Prepare data
  • Train and evaluate data model
  • Visualize data

Workflow Operations

In Seahorse, all the machine learning processes are made as operations. R Transformation operations are used to clean and prepare the data. The operations used for Lending Club loan data analysis are as follows:

  • Input/output: Read DataFrame
  • Action: Fit, transform, evaluate
  • Set Operation: Split
  • Filtering: Filter columns, handle missing values
  • Transformation: SQL transformation, R transformation
  • Feature conversion: String indexer, one hot encoder, assemble vector
  • Machine learning: Logistic regression from classification, binary classification evaluator from evaluation


Reading Data From Source

Seahorse supports three different file formats — CSV, Parquet, and JSON — from different types of data sources: HDFS, Database, Local, and Google Spreadsheets. The Read DataFrame operation is used to read the files from the data sources and upload it into Seahorse library.


Preparing Data

To prepare the data for analysis, perform the following:

  • Remove irrelevant data (loan ID, URL, and so on), poorly documented data (average current balance), and less important features (payment plan, home state) from the source data.
  • Use the Filter Columns operation to select 17 key features from the dataset as shown in the below diagram:


  • Use the R Transformation operation to write any custom function in R.
  • Convert string columns into numeric columns by removing special characters and duplicate data. For example, convert int_rate and revol_util columns into numeric columns by removing special characters (%).
transform <- function(dataframe) {
# Convert into R dataframe using collect function
dataframe <- collect(dataframe)
# Remove special character(%) from the features
dataframe$int_rate <- as.numeric(gsub("%","",dataframe$int_rate))
dataframe$revol_util <- as.numeric(gsub("%","",dataframe$revol_util))
# Convert string to numeric by removing same word(months)
dataframe$term <- as.numeric(gsub("months","",dataframe$term))
# Reduce factor level for some features column.
dataframe$home_ownership[dataframe$home_ownership=="NONE"] <- "OTHER"
# verified and source verified both are giving same meaning so we have convert as single state
dataframe$verification_status[dataframe$verification_status=="Source Verified"] <- "Verified"
dataframe$loan_status[dataframe$loan_status=="Does not meet the credit policy. Status:Charged Off"] <- "Charged Off"
dataframe$loan_status[dataframe$loan_status=="Does not meet the credit policy. Status:Fully Paid"] <- "Fully Paid"
  • Derive new features from the date columns by applying feature engineering. For example, derive issue_month and issue_year from  the issue_d and earliest_cr_line features.
transform <- function(dataframe) {
dataframe <- collect(dataframe)
# Add default value for day in date_time columns
dataframe$issue_d <- as.Date(paste("01-",dataframe$issue_d,sep=""),"%d-%b-%Y")
dataframe$earliest_cr_line <- as.Date(paste("01-",dataframe$earliest_cr_line,sep=""),"%d-%b-%Y")
# Get year from the date_time column
dataframe$issue_year <- as.numeric(format(dataframe$issue_d,"%Y"))
dataframe$cr_line_year <- as.numeric(format(dataframe$earliest_cr_line,"%Y"))
# Get month from the date_time
dataframe$issue_month <- as.numeric(format(dataframe$issue_d,"%m"))
dataframe$cr_line_month <- as.numeric(format(dataframe$earliest_cr_line,"%m"))
dataframe$issue_d <- NULL
dataframe$earliest_cr_line <- NULL

The derived features are shown in the below diagram:select

After Preprocessing

After preprocessing, perform the following:

  • Use the Handle Missing Values operation to find the rows with missing values and to handle them with the selected strategies such as remove row, remove column, custom value, and mode. For example, provide custom values for NAs and empty strings.
  • Select numeric and string columns from the DataFrame and select remove the row as shown in the below diagram:


  • Use String Indexer to map the categorical features into numbers.
  • Choose the columns from the DataFrame using name, index, or type.
  • Select string type columns from the DataFrame and apply string indexer to those columns. For example, after the String Indexer execution, Fully Paid will become 0 and Charge Off will become 1 in the loan_status column.
  • Use the One Hot Encoder operation to convert categorical values into numbers in a fixed range of values. A vector will be produced in each column corresponding to one possible value of the feature.


  • Use the Assemble Vector operation to group all relevant columns together and to form a column with a single vector of all the features. For example, the loan_status column is a prediction variable and all other columns are features.
  • Use Excluding Mode to select all the columns other than the prediction variable.


Training and Evaluating Data Model

To split the dataset into a training set and validation set using the Split operation based on the split ratio, perform the following:

  • Use 0.7 as a split ratio to split 70% of the data in the training set and 30% of the data in the validation set.


  • Use Logistic Regression and Fit operations to perform model training.
  • Use the Fit operation to fit on the estimator and produce a transformer.
  • In the Fit operation, select features columns and prediction variable.
  • Select maximum iterations and threshold values for the model. The Fit operation provides a prediction variable with predicted values and confidence score in raw prediction and probability columns.


  • Use the Evaluate action with Binary Classification Evaluator to find the performance of the model.
  • Find AUC, F-Score, and Recall values from the Binary Classification Evaluator and select AUC as a metric for the model.


  • Use custom functions (R or Python transformation) to find the confusion matrix of the model and derive the metrics for that model.
  • Use SQL Transformation to write custom Spark SQL query and to get correctly predicted values and wrongly predicted values from the DataFrame.


Visualizing Data

Let's look at the DataFrame port, column visualization, and more.

DataFrame Report

In DataFrame Report, every column has some plots based on the datatype.

selectInt_rate Column Visualization

For continuous features, the bar chart is used for data visualization, as shown in the below diagram:


Grade Column Visualization

For discrete features, the pie chart is used for data visualization, as shown in the below diagram:


To create a custom plot like the combination of two column values, use custom operations such as R, Python, SQL Transformation, or Python or R Notebook.


ai, data science, data visualization, logistic regression, machine learning, predictive analytics, python, seahorse, sparkr, tutorial

Published at DZone with permission of Rathnadevi Manivannan . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}