To satisfy the main hallmarks of scientific model development — rigor, testability, replicability and precision, and confidence — it is important to consider model validation and how to deal with unbalanced data. This article outlines advanced validation frameworks that can be utilized to satisfy those hallmarks and provides a brief overview of methodologies frequently applied when dealing with unbalanced data.
Advanced Validation Framework
Any predictive model that fits data too well should be considered suspect — it's too good to be true. By building complex, high-performance predictive models, data scientists often make modeling errors, referred as overfitting. Overfitting — which occurs when a model fits perfectly to the training dataset but fails to generalize on a training dataset — is a fundamental issue and the biggest threat to predictive models. The consequence is a poor prediction on new (unseen, holdout) datasets.
Figure 1: Model overfitting
A number of validation frameworks exist for the purpose of detecting and minimising overfitting. They differ in terms of algorithm complexity, computational power, and robustness. Two simple and common techniques are:
Simple validation: Random or stratified partitioning into train and test partitions.
Nested holdout validation: Random or stratified partitioning into train, validation, and test partitions. Different models are trained on the training partition, mutually compared on the validation sample and the champion model is validated on an unseen data that is the testing partition.
The main drawback of these two approaches is that the model fitted to a subset of the available data could still be subject to overfitting. This is particularly true with datasets containing a small number of observations.
Another problem of the simple validation arises when adjusting model parameters and constantly testing the model performance on the same test sample. This leads to data leak as the model effectively "learns" from the test sample, meaning that the test sample is no longer the true holdout sample and overfitting may become a problem. Nested holdout validation could resolve the problem to a certain extent; however, this approach requires a large amount of data, which could be the issue.
Bootstrapping and cross-validation are two validation frameworks specifically designed to overcome problems with overfitting and more thoroughly capture sources of variation.
Bootstrapping is sampling with replacement. The standard bootstrap validation process randomly creates M different samples from the original data, of the same size. The model is fitted on each of the bootstrap samples and subsequently tested on the entire data to measure performance.
Cross-validation (CV) fits data on the entire population by systematically swapping out samples for testing and training. Cross-validation has many forms, including:
- K-fold (partitioning the population into K equal-sized samples and performing K-times iteration over training/testing splits)
- Nested cross-validation
Nested cross-validation is required if we want to validate the model in addition to parameter tuning and/or variable selection. It consists of an inner and an outer CV. The inner CV is used for either parameter tuning or variable selection while the outer CV is used for model validation.
With some modifications, both bootstrapping and cross-validation can simultaneously achieve three different objectives:
- Model validation
- Variable selection
- Parameter tuning (grid search)
Table 1: Grid-search and CV for validation, selection, and tuning
Modeling Unbalanced Data
Model accuracy, defined as the ratio of correct predictions to the total number of cases, is a typical measure used to assess model performance. However, assessing model performance solely by accuracy may itself present problems as we could encounter accuracy paradox. As an example, assume we have an unbalanced training dataset with a very small percentage of the target population (1%) for who we predict fraud or other catastrophic events. Even without a predictive model, just by making the same guess "no fraud" or "no catastrophe," we reach 99% accuracy! Impressive! However, such strategy would have a 100% miss rate, meaning that we still need a predictive model to either reduce the miss rate (false negative, a "type II error") or to reduce false alarms (false positive, a "type I error").
The right performance measure depends on business objectives. Some cases require minimizing miss rate; others are more focused on minimizing false alarms, especially if customer satisfaction is the primary aim. Based on the overall objective, data scientists need to identify the best methodology to build and evaluate a model using unbalanced data.
Unbalanced data may be a problem when using machine learning algorithms, as these datasets could have insufficient information about the minority class. This is because algorithms based on minimizing the overall error are biased towards the majority class, neglecting the contribution of the cases we are more interested in.
Two general techniques used to combat unbalanced data modeling issues are sampling and ensemble modeling.
Sampling methods are further classified into undersampling and oversampling techniques. Undersampling involves removing cases from the majority class and keeping the complete minority population. Oversampling is the process of replicating the minority class to balance the data. Both aim to create balanced training data so the learning algorithms can produce less biased results. Both techniques have potential disadvantages: undersampling may lead to information loss while oversampling can lead to overfitting.
A popular modification of the oversampling technique, developed to minimise overfitting, is synthetic minority oversampling technique (SMOTE) that creates minority cases based on another learning technique, usually KNN algorithm. As a rule of thumb, if a large number of observations is available, use undersampling, otherwise, oversampling is the preferred method.
The steps below outline a simple example of development steps using the undersampling technique.
- Create a balanced training view by selecting all "bad" cases and a random sample of "good" cases in proportion, for example, 35%/65%, respectively. If there is a sufficient number of "bad" cases, undersample from an unbalanced training partition, otherwise use the entire population to undersample.
- Select the best set of predictors using the usual modeling steps:
- Selection of candidate variables
- Fine classing
- Coarse classing with optimal binning
- Weight of evidence or dummy transformations
- Stepwise logistic regression model
- If not created in Step 1, partition the full unbalanced dataset into train and test partitions (for example, 70% in the training partition and 30% in the testing partition). Keep the ratio of the minority class the same in both partitions.
- Train the model with the model variables selected by the stepwise method in step 2e on the training partition.
- Validate the model on the testing partition.
Ensemble modeling is an alternative for unbalanced data modeling. Bagging and boosting are typical techniques used to make stronger predictors and overcome overfitting without using undersampling or oversampling. Bagging is a bootstrap aggregation that creates different bootstraps with replacement, trains the model on each bootstrap, and averages prediction results. Boosting works by gradually building a stronger predictor in each iteration and learning from the errors made in the previous iteration.
As discussed above, accuracy is not the preferred metric for unbalanced data, as it considers only correct predictions. However, considering correct and incorrect results simultaneously, we can get more insights about the classification model. In such cases, the useful performance measures are sensitivity (i.e. recall, hit rate, probability of detection, or true positive rate), specificity (true negative rate), or precision.
In addition to these three scalar metrics, another popular measure that dominates the industry is the ROC curve. The ROC curve is independent to the proportion of "bad" vs. "good" cases, which is the important feature, especially for unbalanced data. Where there is a sufficient number of "bad" cases, rather than using unbalanced data methods, the standard modeling methodology can be applied and the resulting model tested using the ROC curve.