Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Handling Imbalanced Data With R

DZone's Guide to

Handling Imbalanced Data With R

Imbalanced data is a huge issue. With imbalanced data, accurate predictions cannot be made. Learn how to tackle imbalanced classification problems using R.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

In this blog, let's discuss tackling imbalanced classification problems using R.

Data Description

A credit card transaction dataset, having total transactions of 284K with 492 fraudulent transactions and 31 columns, is used as a source file. For sample dataset, refer to the References section.

Columns

  • Time: Time (in seconds) elapsed between each transaction and the first transaction in the dataset.
  • V1-V28: Principal component variables obtained with PCA.
  • Amount: Transaction amount.
  • Class: Dependent (or) response variable, with value as 1 if fraud and 0 if good.

select

Synopsis

  • Performing exploratory data analysis

    • Checking imbalance ddata
    • Checking the number of transactions by hour
    • Checking the mean using PCA variables
  • Partitioning data

  • Building model on training set

  • Applying sampling methods to balance dataset

Performing Exploratory Data Analysis

Exploratory data analysis is carried out using R to summarize and visualize significant characteristics of the dataset.

Checking Imbalance Data

To find the imbalance in the dependent variable, perform the following:

Group the data based on Class value using dplyr package containing “group by function”:

select

Use ggplot to show the percentage of the Class category:

select

Checking Number of Transactions by Hour

To check the number of transactions by day and hour, normalize the time by day and categorize them into four quarters according to the time of the day.select

The above graph shows the transactions of two days. It states that most of the fraudulent transactions occurred between 13 to 18 hours.

Checking Mean Using PCA Variables

To find data anomalies, take the mean of variables from V1 to V28 and check the variation. The blue points with much variations are shown in the below plot:select

Partitioning Data

In predictive modeling, data needs to be partitioned for the training set (80% of data) and testing set (20% of data). After partitioning the data, feature scaling is applied to standardize the range of independent variables.

select

Building Model on Training Set

To build a model on the training set, perform the following:

  • Apply logic classifier on the training set.
  • Predict the test set.
  • Check the predicted output on the imbalance data.

Using the Confusion Matrix, the test result shows 99.9% accuracy due to the Class 1 records. So, let's neglect this accuracy. Using ROC curve, the test result shows 78% accuracy. That is very low.select

Applying Sampling Methods to Balance Dataset

Different sampling methods are used to balance the given data, apply a model on the balanced data, and check the number of good and fraud transactions in the training set.select

There are 227K good and 394 fraud transactions.

In R, Random Over Sampling Examples (ROSE) and DMwR packages are used to quickly perform sampling strategies. The ROSE package is used to generate artificial data based on sampling methods and smoothed bootstrap approach. This package provides well-defined accuracy functions to quickly perform the tasks.

The different types of sampling methods are the following.

Oversampling

This method over instructs the algorithm to perform oversampling. As the original dataset had 227K good observations, this method is used to oversample minority class until it reaches 227K. The dataset has a total of 454K samples. This can be attained using method = “over”.select

Undersampling

This method functions similar to the oversampling method and is done without replacement. In this method, good transactions are equal to fraud transactions. Hence, no significant information can be obtained from this sample. This can be attained using method = “under”.select

Both Sampling

This method is a combination of both oversampling and undersampling methods. Using this method, the majority class is undersampled without replacement and the minority class is oversampled with replacement. This can be attained using method = “both”.

ROSE Sampling

ROSE sampling method generates data synthetically and provides a better estimate of original data.

Synthetic Minority Over-Sampling Technique (SMOTE) Sampling

This method is used to avoid overfitting when adding exact replicas of minority instances to the main dataset. For example, a subset of data from the minority class is taken. New synthetic similar instances are created and added to the original dataset. The count of each class records after applying sampling techniques is shown below:

select

The logistic classifier model is computed using each trained balanced data and the test data is predicted. Confusion Matrix accuracy is neglected as it is imbalanced data. roc.curve is used to capture roc metric using an inbuilt function.

select

Conclusion

In this blog, highest data accuracy is obtained using the SMOTE method. As there is not much variation in these sampling methods, these methods, when combined with a more robust algorithm such as random forest and boosting, can provide exceptionally high data accuracy.

When dealing with an imbalanced dataset, experiment in the dataset with all these methods to obtain the best-suited sampling method for your dataset. For better results, advanced sampling methods comprising synthetic sampling with boosting methods can be used.

These sampling methods can be implemented in the same way in Python, too. For Python code, check the below References section.

References

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,tutorial ,r ,predictive analytics ,classification ,imbalanced data ,data analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}