Striking a Balance: 6 Techniques to Overcome Unbalanced Data in Machine Learning

data science Nov 19, 2023
thumbnail image for unbalanced data blog from BigDataElearning

Are you curious about unbalanced data and what it entails? 

Maybe you are familiar with unbalanced data sets, but wondering what may be the best techniques to handle unbalanced data in machine learning effectively?

Don't worry, we've got you covered! 

Before looking into the 6 techniques to handle unbalanced data sets like a pro, first you need to know what unbalanced data means and so we will be covering the following topics,

Data Science Explained In 20 Infographics

“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)"


What is Unbalanced Data?

Imagine you're at a party, and there's a dessert table with different types of cakes.

You notice there are 10 chocolate cakes and only 2 vanilla cakes on the dessert table. 

Do you think they are balanced?

No, right?

You see the number of chocolate cakes is much higher compared to the number of vanilla cakes, creating an imbalance, right?

In the world of data science, a similar situation can occur when you have a dataset with classes (or) categories that are not evenly represented. 

When you find one class has a significantly larger number of instances compared to another class, you can call it an unbalanced class (or) unbalanced data.

Clear enough?

Now let’s look at a real world example for an unbalanced data.

Unbalanced Data: Real World Example

Imagine you are working as a data scientist in a bank, and your task is to build a model to detect fraudulent transactions. 

Now, let's say you have a dataset of 10,000 transactions, out of which only 100 are flagged as fraudulent, while the remaining 9,900 are legitimate transactions.


In this case, you have an unbalanced data situation. 

The fraudulent transactions (100) represent the minority class, while the legitimate transactions (9,900) make up the majority class.

Now that you have seen a real world example, let’s see what challenges does the unbalanced data pose?

Challenges Posed by Unbalanced Data

Unbalanced Data sometimes pose challenges when building models because the algorithm might pay more attention to the majority class, just like people at the party who are more likely to choose the chocolate cake because it's abundant.


In our bank transactions analogy you have to keep a close eye on all transactions to catch any potential fraud. 

However, if most transactions are legitimate, it becomes more challenging to spot the fraudulent ones, just like finding a needle in a haystack, right?

You want your model to accurately identify fraudulent transactions without getting overwhelmed by the large number of legitimate transactions.

Now, are you overwhelmed by the challenges of unbalanced data? πŸ™‚ Don’t worry, we have strategies to handle the unbalanced data and we will explore them now.

6 Techniques for Handling Unbalanced Data

Handling unbalanced data sets is like ensuring that all types of cakes get equal attention and appreciation :-)  

In data science, we employ special techniques to address this imbalance and give fair consideration to all classes, making our models more accurate and reliable.

6 techniques to handle unbalanced data are

  • 1) Oversampling
  • 2) Undersampling
  • 3) Hybrid Approach
  • 4) ROSE &
  • 5) SMOTE
  • 6) Class weighting

Out of the above 6 techniques Oversampling, Undersampling, Hybrid approach, SMOTE & ROSE falls under resampling techniques, where as “class weighting” falls under algorithmic techniques.

Let’s look into each of those techniques.

Resampling Techniques:

Resampling techniques overcome the unbalanced data by adding or removing more instances to balance it. Let’s look at the specific techniques of resampling.

1) Oversampling

Oversampling is a resampling technique that involves increasing the number of instances in the minority class by creating synthetic samples (or) replicating existing ones. 

It helps balance the class distribution.

So, in the chocolate cakes vs vanilla cakes analogy, oversampling would involve adding more vanilla cakes to match the quantity of chocolate cakes, ensuring a more balanced representation of the classes.


 

2) Undersampling

Undersampling is another resampling technique that involves reducing the number of instances in the majority class to match the minority class. It can be done randomly or strategically by selecting representative samples.

So, in the chocolate cakes vs vanilla cakes analogy, undersampling would involve reducing the number of chocolate cakes to match the quantity of vanilla cakes, creating a more balanced representation of the classes.

3) Hybrid Approaches

Hybrid approach is another resampling technique.

Can you guess what hybrid approach does?

Yes, you are right? It combines oversampling and undersampling techniques to strike a balance between the two classes.

In our analogy, you would oversample the vanilla cakes by creating synthetic copies or generating new instances to increase their number. At the same time, you could undersample the chocolate cakes by selecting a subset of them, reducing their quantity.

Now that you have seen a few strategies, let’s look at 2 other popular resampling techniques SMOTE & ROSE which are used to handle unbalanced data machine learning.

4 and 5) ROSE (Random Over-Sampling Examples) / SMOTE (Synthetic Minority Over-Sampling Technique)

ROSE & SMOTE are specific techniques that help us tackle the problem of limited samples in the minority class.

Both techniques just don’t blindly copy existing minority class instances. 

Instead, these generate new samples that lie along the line connecting two neighboring instances. 

Imagine you are drawing a line between two nearby minority class samples on a graph, and SMOTE/ROSE creates new data points along that line.

These new data points act as synthetic examples.

In our cake analogy, if you take two vanilla cakes (from minority class) that are already there (let’s say one is small cup cake and other is a big cup cake) and if you make a new vanilla cake (mid size cupcake) that fits right in with the others instead of just duplicating them, it creates new vanilla cakes that are similar to the ones you already have. This is the SMOTE technique.



By generating these synthetic examples, ROSE/SMOTE helps to balance the class distribution. 

It's like adding more voices to the minority class, making it more prominent and giving it a fair chance to be properly represented in the model's training.

Algorithmic Approaches:

6) Class Weighting

Class weighting is an algorithmic technique that assigns higher weights to the minority class during model training so that it can give more importance and prevent the algorithm from favoring the majority class.

So, in the chocolate cakes vs vanilla cakes analogy, using the "Class Weighting" strategy, you would assign higher weights to the vanilla cakes during the model training process. 

This technique helps to address the class imbalance and ensure that the model pays more attention to the minority class. 

Well, what is the difference between SMOTE & ROSE?

Difference is that SMOTE adds new instances that closely resembles the existing minority instances , whereas the ROSE adds new instances, but also adds flavors and variations to the minority class making sure the models don’t get too fixated on specific instances and thereby reduces the risk of overfitting where the model becomes too specialized in capturing the minority class patterns.

Unbalanced Class: Example Code

import numpy as np
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
I've added the import statement for “RandomUnderSampler” from the “imblearn.under_sampling” module. 

# Generating a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=3, n_redundant=2, 
                           n_classes=2, weights=[0.8, 0.2], random_state=42)

After generating the synthetic dataset, we create an instance of RandomUnderSampler. 

# Creating an instance of RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)

We then perform undersampling on the dataset by calling the “fit_resample” method of the undersampler object, passing in the feature matrix 'X' and target vector 'y'. 

This step reduces the majority class samples to achieve a balanced class distribution.

# Performing undersampling on the dataset
X_resampled, y_resampled = undersampler.fit_resample(X, y)
Finally, we print the shapes of the resampled feature matrix 'X_resampled' and resampled target vector 'y_resampled' to verify the dimensions of the undersampled dataset.

# Printing the shapes of X_resampled and y_resampled
print("Shape of X_resampled:", X_resampled.shape)
print("Shape of y_resampled:", y_resampled.shape)

Make sure to have the imbalanced-learn library installed (pip install imbalanced-learn) to use the RandomUnderSampler class.

Conclusion

As our exploration of how to deal with unbalanced data draws to a close, let's quickly recap the insights and strategies to overcoming it: 

  • Unequal Distribution in Datasets: First we saw that unbalanced data means having an unequal distribution of categories in a dataset, just like an imbalanced dessert table where one type of cake steals the spotlight.

  • Challenges of Unbalanced Data Sets: Then you saw that Unbalanced data sets can sometimes pose challenges when building models because the algorithm might pay more attention to the majority class, just like people at the party who are more likely to choose the chocolate cake because it's abundant.
  • Resampling Techniques - Undersampling and Oversampling: Then you saw that “Undersampling” is a resampling technique that involves reducing the number of instances in the majority class to match the minority class, whereas Oversampling involves increasing the number of instances in the minority class by creating synthetic samples (or) replicating existing ones. 
  • Hybrid Approaches in Balancing Data: Then we saw that Hybrid Approaches are methods that combine oversampling and undersampling techniques to strike a balance between the two classes and to handle the unbalanced data sets in machine learning. 
  • Advanced Oversampling Techniques - SMOTE and ROSE: Then we also explored SMOTE and ROSE which are oversampling techniques. Both generate synthetic new instances for the minority class to handle unbalanced data.  
  • Opting for SMOTE in Specific Scenarios: When you need new instances that resemble the existing minority class instances, SMOTE is a good choice for handling unbalanced data sets. 
  • Choosing ROSE for Diverse Sampling: On the other hand, when you want to introduce diversity in the synthetic samples and avoid overfitting, ROSE can be a better option to handle unbalanced data sets. It's like adding variations to the minority class representation to capture the overall patterns accurately.
  • Class Weighting Strategy: While undersampling, oversampling, hybrid techniques, ROSE, & SMOTE help balance the class distribution by resampling, Class Weighting is a technique that works by assigning higher weights to the minority class during model training so that it can give more importance and prevent the algorithm from favoring the majority class.

The Data Science Aspirant's 90-Day Proven Roadmap

Get INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days,

Even Without Prior Data Science Experience - Guaranteed.

 

 

Question For You

You are working on a dataset for disease diagnosis, where you have 10,000 patient records. 

Out of these, 9,500 patients are labeled as healthy, and only 500 patients are labeled as having the disease. 

You aim to build a machine learning model to predict the presence of the disease. Which of the following statements accurately describes the class distribution in this scenario? 

  1. A) This is an example of a balanced class distribution.
  2. B) This is an example of an imbalanced class distribution.
  3. C) The class distribution does not impact the model's performance.
  4. D) The prediction from the model that is trained on this dataset will always be unbiased.

 Let me know in the comments if it is A, B, C, or D. 

Stay connected with weekly strategy emails!

Join our mailing list & be the first to receive blogs like this to your inbox & much more.

Don't worry, your information will not be shared.