Handling Imbalanced Data With R
Imbalanced data is a huge issue. With imbalanced data, accurate predictions cannot be made. Learn how to tackle imbalanced classification problems using R.
Join the DZone community and get the full member experience.Join For Free
Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.
With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.
In this blog, let's discuss tackling imbalanced classification problems using R.
A credit card transaction dataset, having total transactions of 284K with 492 fraudulent transactions and 31 columns, is used as a source file. For sample dataset, refer to the References section.
- Time: Time (in seconds) elapsed between each transaction and the first transaction in the dataset.
- V1-V28: Principal component variables obtained with PCA.
- Amount: Transaction amount.
- Class: Dependent (or) response variable, with value as 1 if fraud and 0 if good.
Performing exploratory data analysis
- Checking imbalance ddata
- Checking the number of transactions by hour
- Checking the mean using PCA variables
Building model on training set
Applying sampling methods to balance dataset
Performing Exploratory Data Analysis
Exploratory data analysis is carried out using R to summarize and visualize significant characteristics of the dataset.
Checking Imbalance Data
To find the imbalance in the dependent variable, perform the following:
Group the data based on
Class value using dplyr package containing “group by function”:
Use ggplot to show the percentage of the
Checking Number of Transactions by Hour
The above graph shows the transactions of two days. It states that most of the fraudulent transactions occurred between 13 to 18 hours.
Checking Mean Using PCA Variables
In predictive modeling, data needs to be partitioned for the training set (80% of data) and testing set (20% of data). After partitioning the data, feature scaling is applied to standardize the range of independent variables.
Building Model on Training Set
To build a model on the training set, perform the following:
- Apply logic classifier on the training set.
- Predict the test set.
- Check the predicted output on the imbalance data.
Applying Sampling Methods to Balance Dataset
There are 227K good and 394 fraud transactions.
In R, Random Over Sampling Examples (ROSE) and DMwR packages are used to quickly perform sampling strategies. The ROSE package is used to generate artificial data based on sampling methods and smoothed bootstrap approach. This package provides well-defined accuracy functions to quickly perform the tasks.
The different types of sampling methods are the following.
This method over instructs the algorithm to perform oversampling. As the original dataset had 227K good observations, this method is used to oversample minority class until it reaches 227K. The dataset has a total of 454K samples. This can be attained using method = “over”.
This method functions similar to the oversampling method and is done without replacement. In this method, good transactions are equal to fraud transactions. Hence, no significant information can be obtained from this sample. This can be attained using method = “under”.
This method is a combination of both oversampling and undersampling methods. Using this method, the majority class is undersampled without replacement and the minority class is oversampled with replacement. This can be attained using method = “both”.
ROSE sampling method generates data synthetically and provides a better estimate of original data.
Synthetic Minority Over-Sampling Technique (SMOTE) Sampling
This method is used to avoid overfitting when adding exact replicas of minority instances to the main dataset. For example, a subset of data from the minority class is taken. New synthetic similar instances are created and added to the original dataset. The count of each class records after applying sampling techniques is shown below:
The logistic classifier model is computed using each trained balanced data and the test data is predicted. Confusion Matrix accuracy is neglected as it is imbalanced data.
roc.curve is used to capture roc metric using an inbuilt function.
In this blog, highest data accuracy is obtained using the
SMOTE method. As there is not much variation in these sampling methods, these methods, when combined with a more robust algorithm such as random forest and boosting, can provide exceptionally high data accuracy.
When dealing with an imbalanced dataset, experiment in the dataset with all these methods to obtain the best-suited sampling method for your dataset. For better results, advanced sampling methods comprising synthetic sampling with boosting methods can be used.
These sampling methods can be implemented in the same way in Python, too. For Python code, check the below References section.
Published at DZone with permission of Rathnadevi Manivannan. See the original article here.
Opinions expressed by DZone contributors are their own.