Before reading, be sure to check out the other posts in this series here:
Scorecard development describes how to turn data into a scorecard model, assuming that data preparation and the initial variable selection process (filtering) have been completed, and a filtered training dataset is available for the model building process.
The development process consists of four main parts: variable transformations, model training using logistic regression, model validation, and scaling.
Figure 1: Standard scorecard development process
"If you torture the data long enough, it will confess to anything." (Ronald Coase, Economist)
A standard scorecard model, based on logistic regression, is an additive model. Hence, special variable transformations are required. The commonly adopted transformations — fine classing, coarse classing, and either dummy coding or weight of evidence (WOE) transformation — form a sequential process providing a model outcome that is both easy to implement and explain to the business. Additionally, these transformations assist in converting non-linear relationships between independent variables and the dependent variable into a linear relationship: the customer behavior often requested by the business.
Fine classing is applied to all continuous variables and those discrete variables with high cardinality. This is the process of initial binning into typically between 20 and 50 fine granular bins.
Coarse classing is where a binning process is applied to the fine granular bins to merge those with similar risk and create fewer bins, usually up to ten. The purpose is to achieve simplicity by creating fewer bins, each with distinctively different risk factors, while minimizing information loss. However, to create a robust model that is resilient to overfitting, each bin should contain a sufficient number of observations from the total account (5% is the minimum recommended by most practitioners). These opposing goals can be achieved through an optimization in the form of optimal binning that maximises a variable's predictive power during the coarse classing process.
Optimal binning utilizes the same statistical measures used during variable selection, such as information value, Gini, and chi-square statistics. The most popular measure is, again, information value, although the combination of two or more measures is often beneficial. The missing values, if they contain predictive information, should be a separate class or merged to a bin with similar risk factors.
The process of creating binary (dummy) variables for all coarse classes except the reference class. This approach may present issues as the extra variables requires more memory and processing resources, and occasionally overfitting may arise because of the reduced degrees of freedom.
Weight of Evidence (WOE) Transformation
This is the alternative (and more favored) approach to dummy coding that substitutes each coarse class with a risk value and in turn, collapses the risk values into a single numeric variable. The numeric variable describes the relationship between an independent variable and a dependent variable. The WOE framework is well suited for logistic regression modeling as both are based on log-odds calculation. In addition, WOE transformation standardizes all independent variables, hence, the parameters in a subsequent logistic regression can be directly compared. The main drawback of this approach is in only considering the relative risk of each bin, without considering the proportion of accounts in each bin. The information value can be utilized instead to assess the relative contribution of each bin.
Both dummy coding and WOE transformation give the similar results. The choice of which one to use mainly depends on preferences of the data scientists.
But be warned: optimal binning, dummy coding, and weight of evidence transformation are, when carried out manually, time-consuming processes. A software package for binning, optimization and WOE transformation are therefore extremely useful and highly recommended.
Figure 2: Automated optimal binning and WOE transformation with World Programming software
Model Training and Scaling
Logistic regression is a commonly used technique in credit scoring for solving binary classification problems. Prior to model fitting, another iteration of variable selection is valuable to check if the newly WOE transformed variables are still good model candidates. Preferred candidate variables are those with higher information value (usually between 0.1 and 0.5) that have a linear relationship with the dependent variable, have good coverage across all categories, have a normal distribution, contain a notable overall contribution, and are relevant to the business.
Many analytics vendors include the logistic regression model in their software products usually with an extensive range of statistical and graphical functions. For example, the implementation of the SAS language PROC LOGISTIC in WPS offers a comprehensive set of options for automated variable selection, restriction of model parameters, weighted variables, obtaining separate analysis for different segments, scoring on a different dataset, generating automated deployment code, to name a few.
Once the model has been aligned, the next step is to adjust the model to a scale desired by the business. This is known as scaling. Scaling acts as a measuring instrument that provides consistency and standardization of scores across different scorecards. The minimum and maximum score values and the score range help in risk interpretation and should be reported to the business. Often, the business requirement is to use the same score range for multiple scorecards so they all have the same risk interpretation.
A popular scoring method logarithmically creates discrete scores, where the odds double at a pre-determined number of points. This requires specifying the three parameters: base points such as 600 points, base odds (for example, 50:1), and points to double the odds (for example, 20). Score points correspond to each of the bins of model variables, while the model intercept is translated into the base points. The scaling output with the tabulated allocation of points represents the actual scorecard model.
Figure 3: Scorecard scaling
Model assessment is the final step in the model building process. It consists of three distinctive phases: evaluation, validation, and acceptance.
Evaluation for Accuracy
Did I build the model right? is the first question to ask in order the test the model. The key metrics assessed are statistical measures including model accuracy, complexity, error rate, model fitting statistics, variable statistics, significance values, and odds ratios.
Validation for Robustness
Did I build the right model? is the next question to ask when moving from classification accuracy and statistical assessment for ranking ability and business assessment.
The choice of validation metrics depends on the type of the model classifier. The most common metrics for binary classification problems are gains charts, lift charts, ROC curves, and Kolmogorov-Smirnov charts. The ROC curve is the most common tool for visualizing model performance. It is a multi-purpose tool used for:
- Champion-challenger methodology to choose the best performing model.
- Testing the model performances on unseen data and comparing it to the training data.
- Selecting the optimal threshold, which maximizes the true positive rate and minimizes the false positive rate.
The ROC curve is created by plotting sensitivity against the probability of false alarm (false positive rate) at different thresholds. Assessing performance metrics at different thresholds is a desirable feature of the ROC curve. Different types of business problems will have different thresholds based on a business strategy.
The area under the ROC curve (AUC) is a useful measure that indicates the predictive ability of a classifier. In credit risk, an AUC of 0.75 or higher is the industry accepted standard and prerequisite to model acceptance.
Figure 4: Model performance metrics
Acceptance for Usefulness
Will the model be accepted? is the final question to ask in order to test if the model is valuable from the business perspective. This is the critical phase where the data scientist has to playback the model result to the business and "defend" their model. The key assessment criterion is the model's business benefit; hence, benefit analysis is the central part when presenting the results. Data scientists should take every effort to present the results in a concise way so that the results and findings are easy to follow and understand. Failure to achieve this could result in model rejection and consequently, project failure.