9 Outlier Detection Methods to Handle Data Outliers for Enhanced Data Quality & Insights

data science Oct 11, 2023
thumbnail image for outlier detection blog from BigDataElearning

Hey there! Ever wondered why data outliers are the unruly troublemakers of the data world?

Curious to know how data scientists handle the outliers in data?

You've landed at the perfect spot.  

Before we look into the 9 outlier detection methods, you should first know what outliers are!

In this article we will look into the following topics

Data Science Explained In 20 Infographics

“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)"


What are Outliers?

Imagine you're at a basketball game with a group of friends, and all the players on the court have similar heights, let's say around 6 feet.

Suddenly, you notice a player who stands at a towering 7 feet tall. That player would be considered an outlier in terms of height among the group, right?


In simple terms, outliers are like the "giants" or the "exceptionally tall players" in a basketball team. 

In a Dataset, Outliers are like the "odd ones out" or the "exceptions". 

Just like the exceptionally tall player at the basketball game, data outliers in statistics are the data points that stand out from the rest because they are unusually different.

Clear enough? 

Now that we have seen what outliers are, let’s see what problems they cause and why outlier identification is vital for data analysis

Why is Outlier Detection vital in Data Analysis?

Imagine you're organizing a treasure hunt with a group of friends. 

Each clue leads you to the next, eventually guiding you to the hidden treasure. 

Now, imagine if someone sneaky comes along and swaps one of the clues with a misleading one. 

It would throw off the entire hunt, leading you and your friends in the wrong direction, right?



In data analysis, data outliers are like those misleading clues. They're the sneaky data points that don't quite fit in with the rest of the dataset. 

They can be way too large, way too small, or just completely different from what we expect.

These data outliers can wreak havoc on our analysis, just like that swapped clue did to the treasure hunt.
 

 Before we delve into specific outlier detection techniques, though, let's briefly discuss the concept of "outlier tests."

Outlier tests are essential components in data analysis, providing a systematic way to identify and handle outliers within a dataset. These tests help ensure data quality and enhance the reliability of insights derived from the analysis.

Now, let's explore nine common outlier detection methods, ranging from statistical approaches to machine learning algorithms and visualization techniques.

9 Common Outlier Detection techniques for detecting outliers

There are few broader categories of methods to find the outliers. 

The broader categories are


Among the above broader categories, we will see 3 statistical methods, 2 unsupervised learning approaches, 2 supervised learning approaches, & 2 visualization techniques.  In total, we will look into 9 outlier detection techniques that can help you to handle the data outliers.

Statistical Methods: Outlier Detection

In this statistical methods category, we will look into 3 methods, which are

1) Z-Score or Standard Deviation Method

The Z-score is a measure of how many standard deviations an observation is away from the mean. Z-score method is a statistical method  that involves calculating the z-score for each data point and identifying points that fall beyond a specified threshold.

Going back to the basketball player analogy, think of the z-score as a way to tell you how far away a player's height is from the average height of all the players.

For example, let's say the average height of the basketball players is 6 feet, and the standard deviation is 2 inches.

Then if you see a player who has a height of 6 feet 2 inches, then they have a z-score of 1, and it means that player is one standard deviation above the average height.

Let’s say you set the Z-score threshold as 2 (6 feet 4 inches).

Now, a data point will be considered an outlier if its Z-score is greater than 2 or less than -2. In other words, if the absolute value (absolute value is nothing but the value without taking negative/minus into consideration) of the Z-score for a data point is greater than 2, it is typically classified as an outlier.

So if you see a Basketball player height is 5 feet 6 inches , then you can note that the player’s Z-score is -3 and absolute Z-score is 3. Since it is greater than the Z-score threshold of 2,  you can consider that this player is an outlier among the basketball players. 

2) Modified Z-Score Method

Similar to the standard z-score, this method calculates the modified z-score using the median and median absolute deviation (MAD) instead of the mean and standard deviation. The MAD is a measure of how spread out the data is from the median.

Outliers are identified based on the modified z-score threshold.

To continue our basketball analogy, let's say the median height of the basketball players is 6 feet 1 inch, and the median absolute deviation is 2 inches.

Then if you see a player who has a height of 6 feet 3 inches, then you can note their modified z-score as 1, and it means you can note that the player is one median standard deviation above the median height.

Based on the modified Z-score threshold value and how much median absolute deviation is each data point away from the median you can determine whether to consider a data point an outlier or not.

To refresh your academic math, median is nothing but the middle value when the elements are in sorted order.

 

“While the “Z-score” method determines a data point as an outlier when it is more than “x” standard deviation (where “x” is the threshold) away from the mean, the “modified z-score” method determines a data point as an outlier when it is more than “x” median absolute deviation away from the median

3) Turkey's Fence Method

This method utilizes the interquartile range (IQR) to identify outliers. It defines a lower and upper fence based on the first and third quartiles, and any data points outside these fences are considered outliers.

In our basketball analogy, you would calculate the IQR by finding the height range between the 25th percentile (let's say 5 feet 11 inches) and the 75th percentile (let's say 6 feet 3 inches). Then you multiply the IQR by a constant (let's say 1.5). 

To be more precise, calculating the IQR as 75th percentile - 25th percentile would be

6 feet 3 inches - 5 feet 11 inches = 4 inches (IQR)

Multiplying IQR with the constant would give us

IQR (4 inches) * Constant (1.5) = 6 inches

Calculating the lower boundary and upper boundary would be as below.

Lower boundary = 5 feet 11 inches - 6 inches = 5 feet 5 inches

Upper boundary = 6 feet 3 inches + 6 inches = 6 feet 9 inches

Based on the above calculation, if any player has a height below 5 feet 5 inches or above 6 feet 9 inches, then that player would be considered an outlier.

 

Unsupervised Learning Approaches: Outlier Detection

Unsupervised is a method for data analysis in which you don't provide the algorithm any information about the data set, as there is no history data available. 

So the unsupervised learning approaches will try to organize the data themselves in a structure that makes sense and then tries to identify the outliers. We will look at 2 of the common unsupervised learning approaches which are

4) Local Outlier Factor (LOF)

Local outlier factor method deals with finding data outliers in the local neighborhood. 

Going back to our basketball player height analogy, let’s say the basketball court is surrounded by a crowd of players who are all different heights. 

Let’s say you don’t have a history of what is considered normal or an outlier height for a basketball player. 

If you look at each player and compare their heights to the heights of their nearby teammates (instead of comparing with the entire team as a whole), then you are using Local Outlier Factor (LOF) method.

Based on the comparisons, LOF assigns a score to each player. 

If someone has a high LOF score, it means they're significantly taller from the players around them and could potentially be an outlier. 

By doing this, it can detect exceptionally tall players who might go unnoticed if we only looked at the overall average height.


 

5) Isolation Forest

Isolation forest is another unsupervised learning approach. Imagine you're in a dense forest full of trees, and somewhere in that forest, there's a rare and exceptional tree that stands out from the rest. 

Let’s say you start by randomly selecting a tree in the forest and then choose a random direction to move in. 

Then you keep repeating this process, branching out and isolating different trees from the rest of the forest. 

If you continue until you finally reach the rare tree that is far away from the others, then it means you have used the “Isolation forest” method to find the outlier.

The key idea here is that if you isolate or separate a tree quickly from the rest, it's more likely to be an outlier

It's like you discovering a tree that is unique because it's not surrounded by other trees and has its own little space.

By using this Isolation Forest method, you can efficiently identify data outliers in a dataset.


Supervised Learning Approaches: Outlier Detection

Supervised learning approaches learn through examples, where the examples are some data composed of an input and an output. 

Based on examples/past history data, the supervised learning algorithm already knows what an outlier looks like and tries to spot out the data outlier in the new data, as well.

We will look at 2 of the common supervised learning approaches which are

6) One-Class Support Vector Machines (SVM)

Imagine you're a magician, and you have a special power to draw a boundary around a specific area. 

One-Class Support Vector Machines (SVM) is like your magical ability to draw that boundary around objects.

SVM looks at the data points and tries to draw a boundary that encloses the majority of the data. It's like creating a fence around the crowd at a concert, but only including the people who are really part of the event.

However, there's a twist. SVM is not just interested in drawing any boundary; it wants to draw the smallest possible boundary around the data.

7) Random Forest for Anomaly Detection

Random Forest is another supervised learning approach. Imagine you're part of a group of experts who are trying to identify unique animals in a jungle. 

Each expert is knowledgeable about different aspects of the animals—such as their color, size, or sound they make. Individually, they may not be able to spot the outliers, but when they work together, their combined expertise becomes a powerful tool. 

Random Forest for Anomaly Detection works in a similar way.

Each expert, or "tree" in the forest, examines the data based on a specific feature. They share their findings, and through a voting process, the group determines whether a data point should be considered ordinary or extraordinary. 

By using this collective decision-making approach, Random Forest can detect anomalies more accurately.



Well, you may be confused as to how it is different from the isolation forest , as the name resembles similar.

So, the main difference between Isolation Forest and Random Forest is the approach they take for outlier identification. 

Isolation Forest isolates outliers by measuring their separability & measures how quickly each data point is isolated from the rest, while Random Forest relies on a collective decision-making process among multiple trees to identify outliers. 

The other difference is that the isolation forest is unsupervised whereas the Random forest is a supervised learning approach.

The Data Science Aspirant's 90-Day Proven Roadmap

Get INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days,

Even Without Prior Data Science Experience - Guaranteed.

 

 

Seeing is Believing: Visualization Techniques for Outlier Detection

We will look at 2 of the common visualization techniques which are

8) Scatterplots

Scatterplot is one of the visualization techniques which is a simple yet effective way to visualize outliers. They display data points as individual markers on a two-dimensional plot, allowing you to identify any points that deviate significantly from the general pattern.

In the below scatter plot, you can easily visualize and find the data points that are outliers.

9) Box Plots

Boxplots is another important visualization approach that provides a summary of the data distribution, including the median, quartiles, and potential outliers. 

Outliers appear as individual data points outside the whiskers of the boxplot, making them easily identifiable. 

This is often associated with Turkey’s fence method as this involves using Interquartile range and percentile values to come up with boxes within the graphs. You can find more details about the box plot here https://www.geeksforgeeks.org/box-plot/


Remember in Turkey’s fence method we calculate the 25th percentile and 75th percentile.

 

How to remove outliers in a dataset?

Here is an example code to remove outliers from a dataset using the Z-score statistical method. First you need to install numpy & scipy libraries using the “pip” command as below.

pip install numpy scipy

Next you can define a method like “remove_outliers_zscore” that takes a dataset as input along with an optional threshold parameter (defaulted to 3). 

The method calculates the z-scores for each data point using the “stats.zscore” function from the SciPy library. Data points with a z-score higher than the threshold are considered outliers and are filtered out.

 import numpy as np
from scipy import stats

def remove_outliers_zscore(data, threshold=3):
    z_scores = np.abs(stats.zscore(data))
    filtered_data = data[(z_scores < threshold)]
    return filtered_data

# Example usage:
dataset = [10, 12, 15, 9, 14, 18, 22, 7, 21, 13, 25, 11, 16]

filtered_dataset = remove_outliers_zscore(dataset)
print(filtered_dataset)

 The resulting filtered dataset is returned and printed.

Challenges or Limitations Associated with Outlier Detection

 You may be wondering whether there are any challenges associated with these types of outlier detection. Yes,  here are the top challenges that Data Scientists face.

1. Definition of "normal": You know how people have different ideas of what's normal or expected?Imagine you have two groups: a bunch of regular people and a group of professional basketball players (same analogy you saw before). 

Now, let's talk about their heights. What you might consider unusual for the regular people's height might be totally normal for the basketball players.

For example, you may see an average height around 5 feet 8 inches within a bunch of regular people, but you may commonly see heights of 6 feet 6 inches and above among basketball players.



So, if you try to set a clear boundary and say, "Hey, anyone above 6 feet is an outlier," you might end up mistakenly labeling some normal basketball players as outliers, right?

To overcome this challenge, you need to consider the specific circumstances, understand the context (in this case, the world of basketball), and carefully take into account what's expected in that particular dataset.

2. Noisy data / Errors / Mistakes
: You know how sometimes data can be all messy and have mistakes or inconsistencies? 

Well, just like that friend of yours who sometimes does strange things because of a misunderstanding or a mistake, data can also have its own version of "weirdness."

Sometimes, when you are trying to find outliers, these little errors or inconsistencies in the data can make things really confusing.


These "noisy" data points can falsely appear as outliers.

To deal with this challenge, you first need to carefully sift through the data, double-check for errors, and try to remove the errors before performing outlier detection.

3. Contextual factors: Sometimes, there are external contextual factors that come into play, but they don't show up in the data itself.

Let's say you're looking at a group of people and trying to spot a list of people’s behavior as to whether they are indoor or outdoor people.

But suddenly, the weather goes all wild, and there's a massive storm. Now, this weather change can affect people's behavior. Some might stay indoors, while others might brave the storm and go out.

Here's the tricky part: because of this sudden weather change, the people who decide to stay indoors seem like outliers in the data, but in reality, it's not because they're genuinely unique or different from others—it's just because of the weather.



To overcome this challenge, you need to think about the context, in this scenario the weather. 

Another example of people's shopping behavior : You may need to consider external contexts like how climate change & seasonal discounts can affect people’s shopping behavior. 

Pros and Cons of Unsupervised Outlier Detection Methods

Local Outlier Factor method and Isolation forest are the 2 unsupervised learning approaches that we saw. Let’s see what pros and cons they hold for us.

Pros of Unsupervised Outlier Detection:

1. Discovery of unknown anomalies: Unsupervised outlier detection methods can help you uncover hidden or unknown anomalies in a dataset, even when you're not sure what to look for.

2. Flexibility and adaptability: Unsupervised outlier detection methods are quite flexible because you don’t need predefined labels or prior knowledge of outliers. 

Cons of Unsupervised Outlier Detection:

1. Subjectivity in interpretation: Unsupervised outlier detection methods don't have a clear ground truth to compare against, which means interpreting the results can be subjective. 

It's like looking at abstract art where different people might have different opinions on what is considered an outlier. 

You will find it harder to validate and reach a consensus on the identified outliers, so you may need a domain expert to weigh in on and validate the results.

2. False positives and noisy data: Unsupervised methods may also generate false positives, flagging data points as outliers when they are actually normal. These methods can be sensitive to noisy or erroneous data, which can lead to inaccurate outlier identification.

That is why you need to remove/fix the error data before using unsupervised learning approaches to identify the outliers.

3. L
ack of context: Unsupervised outlier detection focuses solely on the data itself and may not consider important contextual factors that could influence outlier detection. 

It's like trying to solve a mystery without taking into account the background story or the surrounding circumstances. 

That is why you need to consider the external factors while trying to identify the outliers.

While unsupervised outlier detection methods offer the advantage of 

  1. Discovering unknown anomalies and 
  2. Being flexible in different scenarios, 

They also face challenges such as 

  1. Subjective interpretation, 
  2. False positives, & sensitivity to noisy data, and 
  3. Potential lack of contextual understanding.

Outliers in the Real World Business

Let's say you're running a retail store and you're looking at your daily sales data. 

Most days, you might have a consistent number of sales that fall within a certain range. 

But every now and then, you notice a day where the sales are unusually high or exceptionally low. These are the outliers in your business data.

Now, outliers in business data can tell us a lot of interesting things. 

Sometimes, they can be a reason to celebrate! 

An exceptionally high sales day could be because of a new marketing campaign that really resonated with customers or a special promotion that attracted a lot of people. 

On the other hand, outliers can also be a cause for concern. An unusually low sales day could indicate a problem in the supply chain, a dip in customer satisfaction, or even an issue with the product itself. It's like a warning sign that something might need attention or improvement.



Detecting data outliers in business data is important because it helps us identify trends, anomalies, and areas that require our attention. It's like having a business compass that guides us towards the areas that need focus or the potential opportunities we can capitalize on.

Analyzing data outliers in business data helps us make better decisions, understand customer behavior, and improve overall performance. 

Conclusion

 

As we finish talking about unusual data points, let's summarize the main things we learned:

  1. Uniqueness of Outliers:
    • Outliers are special in datasets, like a standout player in a basketball game.
    • They can mess up data analysis.
  2. Significance of Handling Outliers:
    • It's crucial to find and deal with outliers.
    • They can affect and change the results of our analysis.
  3. Methods for Outlier Handling:
    • We looked at different ways to handle outliers.
    • This includes using statistics like Z-score, modified Z-score, and Turkey's fence.
    • We also talked about machine learning methods like Local Outlier Factor, isolation forests, Support Vector Machines (SVM), and Random Forest.
    • Visualization tools like scatter plots and box plots are useful for finding outliers.
  4. Challenges in Outlier Detection:
    • There are difficulties in finding outliers.
    • These include interpreting data subjectively, getting false positive results, being sensitive to noise, and needing to understand the context well.

In the end, even though outliers can mess things up, they often show areas that need attention and might reveal opportunities for growth in real-world industries.

Question For You

Which of the following statements accurately describes the interquartile range (IQR) and its role in outlier identification?

A) IQR is the range between the maximum and minimum values in a dataset and is commonly used in Turkey’s fence method.

B) IQR is the range between the first quartile (Q1) and the third quartile (Q3) of a dataset and is commonly used in Turkey’s fence method.

C) IQR is calculated by dividing the dataset into equal halves and is commonly used in the Z-score  method.

D) IQR is a statistical measure used to identify outliers based on their deviation from the mean  and is commonly used in the Z-score  method.

 Let me know in the comments if it is A, B, C, or D.

Stay connected with weekly strategy emails!

Join our mailing list & be the first to receive blogs like this to your inbox & much more.

Don't worry, your information will not be shared.