Handling Imbalanced Data With R

Imbalanced data is a huge issue. With imbalanced data, accurate predictions cannot be made. Learn how to tackle imbalanced classification problems using R.

Oct. 09, 17 · Tutorial

Likes (3)

Comment

Save

40.9K Views

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

In this blog, let's discuss tackling imbalanced classification problems using R.

Data Description

A credit card transaction dataset, having total transactions of 284K with 492 fraudulent transactions and 31 columns, is used as a source file. For sample dataset, refer to the References section.

Columns

Time: Time (in seconds) elapsed between each transaction and the first transaction in the dataset.
V1-V28: Principal component variables obtained with PCA.
Amount: Transaction amount.
Class: Dependent (or) response variable, with value as 1 if fraud and 0 if good.

select

Synopsis

Performing exploratory data analysis
- Checking imbalance ddata
- Checking the number of transactions by hour
- Checking the mean using PCA variables
Partitioning data
Building model on training set
Applying sampling methods to balance dataset

Performing Exploratory Data Analysis

Exploratory data analysis is carried out using R to summarize and visualize significant characteristics of the dataset.

Checking Imbalance Data

To find the imbalance in the dependent variable, perform the following:

Group the data based on Class value using dplyr package containing “group by function”:

Use ggplot to show the percentage of the Class category:

Checking Number of Transactions by Hour

To check the number of transactions by day and hour, normalize the time by day and categorize them into four quarters according to the time of the day.

The above graph shows the transactions of two days. It states that most of the fraudulent transactions occurred between 13 to 18 hours.

Checking Mean Using PCA Variables

To find data anomalies, take the mean of variables from V1 to V28 and check the variation. The blue points with much variations are shown in the below plot:

Partitioning Data

In predictive modeling, data needs to be partitioned for the training set (80% of data) and testing set (20% of data). After partitioning the data, feature scaling is applied to standardize the range of independent variables.

select

Building Model on Training Set

To build a model on the training set, perform the following:

Apply logic classifier on the training set.
Predict the test set.
Check the predicted output on the imbalance data.

Using the Confusion Matrix, the test result shows 99.9% accuracy due to the Class 1 records. So, let's neglect this accuracy. Using ROC curve, the test result shows 78% accuracy. That is very low.

Applying Sampling Methods to Balance Dataset

Different sampling methods are used to balance the given data, apply a model on the balanced data, and check the number of good and fraud transactions in the training set.

There are 227K good and 394 fraud transactions.

In R, Random Over Sampling Examples (ROSE) and DMwR packages are used to quickly perform sampling strategies. The ROSE package is used to generate artificial data based on sampling methods and smoothed bootstrap approach. This package provides well-defined accuracy functions to quickly perform the tasks.

The different types of sampling methods are the following.

Oversampling

This method over instructs the algorithm to perform oversampling. As the original dataset had 227K good observations, this method is used to oversample minority class until it reaches 227K. The dataset has a total of 454K samples. This can be attained using method = “over”.

Undersampling

This method functions similar to the oversampling method and is done without replacement. In this method, good transactions are equal to fraud transactions. Hence, no significant information can be obtained from this sample. This can be attained using method = “under”.

Both Sampling

This method is a combination of both oversampling and undersampling methods. Using this method, the majority class is undersampled without replacement and the minority class is oversampled with replacement. This can be attained using method = “both”.

ROSE Sampling

ROSE sampling method generates data synthetically and provides a better estimate of original data.

Synthetic Minority Over-Sampling Technique (SMOTE) Sampling

This method is used to avoid overfitting when adding exact replicas of minority instances to the main dataset. For example, a subset of data from the minority class is taken. New synthetic similar instances are created and added to the original dataset. The count of each class records after applying sampling techniques is shown below:

select

The logistic classifier model is computed using each trained balanced data and the test data is predicted. Confusion Matrix accuracy is neglected as it is imbalanced data. roc.curve is used to capture roc metric using an inbuilt function.

Conclusion

In this blog, highest data accuracy is obtained using the SMOTE method. As there is not much variation in these sampling methods, these methods, when combined with a more robust algorithm such as random forest and boosting, can provide exceptionally high data accuracy.

When dealing with an imbalanced dataset, experiment in the dataset with all these methods to obtain the best-suited sampling method for your dataset. For better results, advanced sampling methods comprising synthetic sampling with boosting methods can be used.

These sampling methods can be implemented in the same way in Python, too. For Python code, check the below References section.

References

Test data R (programming language)

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

Handling Imbalanced Data With R

Imbalanced data is a huge issue. With imbalanced data, accurate predictions cannot be made. Learn how to tackle imbalanced classification problems using R.

Data Description

Columns

Synopsis

Performing Exploratory Data Analysis

Checking Imbalance Data

Checking Number of Transactions by Hour

Checking Mean Using PCA Variables

Partitioning Data

Building Model on Training Set

Applying Sampling Methods to Balance Dataset

Oversampling

Undersampling

Both Sampling

ROSE Sampling

Synthetic Minority Over-Sampling Technique (SMOTE) Sampling

Conclusion

References

Related

Partner Resources