Variable Selection and Big Data Analytics in Credit Score Modeling
Variable Selection and Big Data Analytics in Credit Score Modeling
The variable selection process in the credit score modeling process is critical to finding key information. Learn how to do it to get a good understanding of your data!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
"Doing more with less" is the main philosophy of credit intelligence, and credit risk models are the means to achieve this goal. Using an automated process and focussing on the key information, credit decisions can be made in seconds and can eventually reduce operational cost by making the decision process much faster. Fewer questions and rapid credit decisions ultimately increase customer satisfaction. For lenders this means expanding their customer base, taking on board less risky customers and increasing the profit.
How can we achieve parsimony, and what is the key information to look for? The answer is found during the next step of the credit risk modeling process: the variable selection process.
The mining view created as the result of data preparation is a multi-dimensional unique customer's signature used to discover potentially predictive relationships and test the strength of those relationships. A thorough analysis of the customer's signature is an important step when creating a set of testable hypotheses based on the characteristics found in the customer's signature. Often referred as business insights, this analysis provides an interpretation of trends in customer behavior, which aims to direct the modeling process.
The purpose of the business insights analysis is to:
- Validate that the derived customer's data is in line with business understanding. For example, insight analysis should support the business statement that customers with higher debt-to-income ratio are more likely to default.
- Provide benchmarks for analyzing model results.
- Shape the modeling methodology.
Business insights analysis utilizes similar techniques to exploratory data analysis by combining univariate and multivariate statistics and different data visualization techniques. Typical techniques are correlation, cross-tabulation, distribution, time-series analysis, and supervised and unsupervised segmentation analysis. Segmentation is of special importance, as it determines when multiple scorecards are needed.
Variable selection, based on the results of the business insights analysis, starts by partitioning the mining view into at least two different partitions: training and testing partition. The training partition is used to develop the model, and the testing partition is used for assessing the model's performance and validating the model.
Figure 1: Simplified scorecard model building process
Variable selection is a collection of candidate model variables tested for significance during model training. Candidate model variables are also known as independent variables, predictors, attributes, model factors, covariates, regressors, features, or characteristics.
Variable selection is a parsimonious process that aims to identify a minimal set of predictors for maximum gain (predictive accuracy). This approach is the opposite of data preparation, where as many meaningful variables as possible are added to the mining view. These opposing requirements are achieved using optimization; that is, finding the minimal selection bias under the given constraints.
The key objective is to find a right set of variables so the scorecard model would be able not only to rank customers based on their likelihood of bad debt but also to estimate the probability of their bad debt. This usually means selecting statistically significant variables in the predictive model and having a balanced set of predictors (usually 8-15 is considered a good balance) to converge to a 360-degree customer view. In addition to customer-specific risk characteristics, we should also consider including systematic risk factors to account for economic drifts and volatilities.
Easier said than done; when selecting variables, there are a number of limitations. First, the model will usually contain some highly predictive variables — the use of which is prohibited by legal, ethical or regulatory rules. Second, some variables might not be available or might be of poor quality during modeling or production stages. In addition, there might be important variables that have not been recognized as such, for example, because of a biased population sample, or because their model effect would be counter-intuitive as a result of multicollinearity. And finally, the business will always have the last word and might insist that only business-sound variables are included or request monotonically increasing or decreasing effects.
All of these constraints are potential sources of bias, which gives the data scientists a challenging task to minimise the selection bias. Typical preventive measures during variable selection include:
- Collaboration with experts in the field to identify the important variables.
- Awareness of any problems in relation to data source, reliability or mismeasurement.
- Cleaning the data.
- Using control variables to account for banned variables or specific events such as an economic drift.
It is important to recognize that variable selection is an iterative process that occurs throughout the model building process.
- It starts prior to model fitting by reducing the number of variables in the mining view to a manageable set of candidate variables.
- Continues during the model training process, where further reduction is implemented as result of statistical insignificance, multicollinearity, low contributions, or penalization to avoid overfitting.
- Carries on during model evaluation and validation.
- Finalizes during the business approval, where model readability and interpretability play the important part.
Variable selection finishes after the "sweet spot" has been reached — meaning that no more improvement can be achieved in terms of model accuracy.
Figure 2: Iterative nature of variable selection process
A plethora of variable selection methods is available. With advances in machine learning, this number has been constantly increasing. Variable selection techniques depend on whether we use variable reduction or variable elimination (filtering), whether the selection process is carried out inside or outside predictive models; whether we use supervised or unsupervised learning; or if the underlying methods are based on specific embedded techniques such as cross-validation.
Table 1: Variable selection methods typical in credit risk modeling
Figure 3: Variable selection using bivariate analysis
In credit risk modeling, two of the most commonly used variable selection methods are information value for filtering prior to model training and stepwise selection for variable selection during the training of a logistic regression model. Although both receive some criticism from practitioners, it is important to recognize that no ideal methodology exists as each of the methods for variable selection has its pros and cons. Which one to use and how best to combine them is not an easy task to solve and requires solid domain knowledge, a good understanding of the data, and extensive modeling experience.
Published at DZone with permission of Natasha Mashanovich , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.