DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Solid Testing Strategies for Salesforce Releases
  • Apex Testing: Tips for Writing Robust Salesforce Test Methods
  • A General Overview of TCPCopy Architecture
  • Modes and Modality in Performance Testing

Trending

  • Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions
  • Java Virtual Threads and Scaling
  • Unlocking AI Coding Assistants Part 2: Generating Code
  • Evolution of Cloud Services for MCP/A2A Protocols in AI Agents
  1. DZone
  2. Coding
  3. Languages
  4. Handling Imbalanced Data With R

Handling Imbalanced Data With R

Imbalanced data is a huge issue. With imbalanced data, accurate predictions cannot be made. Learn how to tackle imbalanced classification problems using R.

By 
Rathnadevi Manivannan user avatar
Rathnadevi Manivannan
·
Oct. 09, 17 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
40.9K Views

Join the DZone community and get the full member experience.

Join For Free

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

In this blog, let's discuss tackling imbalanced classification problems using R.

Data Description

A credit card transaction dataset, having total transactions of 284K with 492 fraudulent transactions and 31 columns, is used as a source file. For sample dataset, refer to the References section.

Columns

  • Time: Time (in seconds) elapsed between each transaction and the first transaction in the dataset.
  • V1-V28: Principal component variables obtained with PCA.
  • Amount: Transaction amount.
  • Class: Dependent (or) response variable, with value as 1 if fraud and 0 if good.

select

Synopsis

  • Performing exploratory data analysis

    • Checking imbalance ddata
    • Checking the number of transactions by hour
    • Checking the mean using PCA variables
  • Partitioning data

  • Building model on training set

  • Applying sampling methods to balance dataset

Performing Exploratory Data Analysis

Exploratory data analysis is carried out using R to summarize and visualize significant characteristics of the dataset.

Checking Imbalance Data

To find the imbalance in the dependent variable, perform the following:

Group the data based on Class value using dplyr package containing “group by function”:

select

Use ggplot to show the percentage of the Class category:

select

Checking Number of Transactions by Hour

To check the number of transactions by day and hour, normalize the time by day and categorize them into four quarters according to the time of the day.select

The above graph shows the transactions of two days. It states that most of the fraudulent transactions occurred between 13 to 18 hours.

Checking Mean Using PCA Variables

To find data anomalies, take the mean of variables from V1 to V28 and check the variation. The blue points with much variations are shown in the below plot:select

Partitioning Data

In predictive modeling, data needs to be partitioned for the training set (80% of data) and testing set (20% of data). After partitioning the data, feature scaling is applied to standardize the range of independent variables.

select

Building Model on Training Set

To build a model on the training set, perform the following:

  • Apply logic classifier on the training set.
  • Predict the test set.
  • Check the predicted output on the imbalance data.

Using the Confusion Matrix, the test result shows 99.9% accuracy due to the Class 1 records. So, let's neglect this accuracy. Using ROC curve, the test result shows 78% accuracy. That is very low.select

Applying Sampling Methods to Balance Dataset

Different sampling methods are used to balance the given data, apply a model on the balanced data, and check the number of good and fraud transactions in the training set.select

There are 227K good and 394 fraud transactions.

In R, Random Over Sampling Examples (ROSE) and DMwR packages are used to quickly perform sampling strategies. The ROSE package is used to generate artificial data based on sampling methods and smoothed bootstrap approach. This package provides well-defined accuracy functions to quickly perform the tasks.

The different types of sampling methods are the following.

Oversampling

This method over instructs the algorithm to perform oversampling. As the original dataset had 227K good observations, this method is used to oversample minority class until it reaches 227K. The dataset has a total of 454K samples. This can be attained using method = “over”.select

Undersampling

This method functions similar to the oversampling method and is done without replacement. In this method, good transactions are equal to fraud transactions. Hence, no significant information can be obtained from this sample. This can be attained using method = “under”.select

Both Sampling

This method is a combination of both oversampling and undersampling methods. Using this method, the majority class is undersampled without replacement and the minority class is oversampled with replacement. This can be attained using method = “both”.

ROSE Sampling

ROSE sampling method generates data synthetically and provides a better estimate of original data.

Synthetic Minority Over-Sampling Technique (SMOTE) Sampling

This method is used to avoid overfitting when adding exact replicas of minority instances to the main dataset. For example, a subset of data from the minority class is taken. New synthetic similar instances are created and added to the original dataset. The count of each class records after applying sampling techniques is shown below:

select

The logistic classifier model is computed using each trained balanced data and the test data is predicted. Confusion Matrix accuracy is neglected as it is imbalanced data. roc.curve is used to capture roc metric using an inbuilt function.

select

Conclusion

In this blog, highest data accuracy is obtained using the SMOTE method. As there is not much variation in these sampling methods, these methods, when combined with a more robust algorithm such as random forest and boosting, can provide exceptionally high data accuracy.

When dealing with an imbalanced dataset, experiment in the dataset with all these methods to obtain the best-suited sampling method for your dataset. For better results, advanced sampling methods comprising synthetic sampling with boosting methods can be used.

These sampling methods can be implemented in the same way in Python, too. For Python code, check the below References section.

References

  • Sample Credit Card Transaction Data
  • Associated R and Python Code in GitHub
Test data R (programming language)

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Solid Testing Strategies for Salesforce Releases
  • Apex Testing: Tips for Writing Robust Salesforce Test Methods
  • A General Overview of TCPCopy Architecture
  • Modes and Modality in Performance Testing

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!