Normalization vs Standardization : Understanding When, Why & How to Apply Each Method

data science Nov 28, 2023
thumbnail image of normalization vs standardization blog from BigDataElearning

Ever wondered why data scientists have a love affair with normalization and standardization in machine learning? πŸ€” 

What's the secret behind these techniques that magically transform your data?

Well , read more to find out!! In this article we will look into the following topics

Data Science Explained In 20 Infographics

“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)"


What is Normalization?

Before we delve into how, when, & why, you need to understand what Normalization means.

Imagine you have a dataset containing different features, such as house prices, the number of bedrooms, and the square footage of houses. 

Can you tell if all these features, such as house prices, number of bedrooms, etc.. will have the same range of values & units of measurement?

Certainly not, right? 

For instance, house prices could normally range from $100,000 to $1,000,000, while the number of bedrooms might range from 1 to 5.

Take a look at the below example. Here house prices are high numbers, whereas the number of rooms are single digit numbers.

As a human, you and I will know why the price has a higher range of numbers and why the number of rooms has a lower range of numbers.

However machine learning algorithms might not.

That is where normalization comes to rescue.

Normalization is like a magical trick that helps algorithms make sense of these numbers by putting them on a common scale

It's like squishing or stretching the numbers so that they all fit nicely between 0 and 1. 

So the original data will become normalized like below. Notice that on the right hand side, the numbers are in a uniform scale of values between 0 to 1.

Did you notice how the original values have become normalized values, in the above screenshot?

This way, machine learning algorithms can compare them more easily and understand their relative importance.

 

"So by definition, Normalization is a data preprocessing technique that transforms the values of numerical variables to a specific range or scale.  The original values of a variable are adjusted or rescaled to a common range, typically between 0 and 1."



Now that we have seen what normalization means, let’s see why it is needed.
 

Why is data normalization important in machine learning?

So, why do we need to normalize? 

Well, without normalization, some numbers might have a bigger impact on our analysis just because they have larger values. It's like having a giant among dwarfs.

Normalization helps machine learning algorithms to treat all the numbers fairly and ensures that they contribute equally to our calculations. It's like giving everyone the same chance to shine.

Next time you encounter a bunch of numbers that seem all over the place, remember the magic of normalization that brings them together, making them play nice and revealing their hidden secrets!"

Well, are there any methods (or) functions to perform the normalization scaling easily? Yes, there is a popular method called min-max scaler and we will see how it works.
 

What is Min-max Scaler?

Min-max scaler is a cool normalization method that takes the original values and performs its magic by rescaling the numbers to fit within a specific range, typically between 0 and 1. 

It does this by subtracting the minimum value and dividing by the range. 

So, for the house prices, let’s take the 1st record from the below table.

The min-max scaler would subtract $150,000 (minimum value of the house prices) from each price.

Then it divides by $450,000 (the range between $600,000(max) and $150,000(min) ).

The result? The house prices that was originally 250,000 (1st record value) would become 

 

= (250,000 - 150,000 (min) ) / 450,000 (range)
= 100,000 / 450,000
= 1 / 4.5
= 0.22

 

Now notice in the above table that the 1st record where the house price was 250k (left side), now because 0.22 (on the right side), after normalization.  The same way all other values are normalized to scale between 0 to 1.

By performing this scaling, the Min-Max Scaler ensures that all the features are on a level playing field, regardless of their original ranges. It allows you to compare and interpret the values more easily, without one feature dominating the others.

But how do you use this method using the code?

Normalization: Example Code

In the below code, we import the MinMaxScaler class from the “sklearn.preprocessing” module.

from sklearn.preprocessing import MinMaxScaler

I define two lists, “house_prices” and “num_bedrooms”, which represent the original values of house prices and the number of bedrooms, respectively.

# Example dataset
house_prices = [100000, 200000, 500000, 800000, 1000000]
num_bedrooms = [1, 2, 3, 4, 5]

I then create an instance of MinMaxScaler

# Create MinMaxScaler object
scaler = MinMaxScaler()

Next, I reshape the data using list comprehensions to match the expected input shape for the scaler.

# Reshape the data (required by MinMaxScaler)
house_prices = [[price] for price in house_prices]
num_bedrooms = [[bedrooms] for bedrooms in num_bedrooms]

Then, I fit the scaler on the reshaped data using the fit method.

# Fit the scaler on the data
scaler.fit(house_prices)
scaler.fit(num_bedrooms)

After that, I transform the original data into scaled values using the transform method.

# Transform the data using the scaler
scaled_house_prices = scaler.transform(house_prices)
scaled_num_bedrooms = scaler.transform(num_bedrooms)

Finally, I print the scaled values for each corresponding house price and number of bedrooms.

# Print the scaled values
for price, bedrooms in zip(scaled_house_prices, scaled_num_bedrooms):
    print(f"Scaled House Price: {price[0]:.3f}, Scaled Number of Bedrooms: {bedrooms[0]:.3f}")

Note that it's important to reshape the data using double brackets [[]] to ensure compatibility with the fit and transform methods of the MinMaxScaler.

Now that we have seen enough about normalization, let’s look what standardization is.

 

What is Standardization?

Standardization is a data preprocessing technique that transforms the values in such a way that they are easier to understand and compare. 

It does this by making the average value of the feature become 0.

Then it measures how much the values in a feature vary from the average. 

 

"Standardization scales the values in a way that the average value of the feature becomes 0, and the standard deviation becomes 1. This helps us understand the spread and variability of the data."



Well, we saw that there is a Min-max scaler method to perform the normalization, right? Do we have any method to perform standardization? Well, Standard Scaler is the answer!
 

What is Standard Scaler?

Imagine you have a dataset with different features, such as the ages of people and their incomes. These features might have different units of measurement and ranges. 

For instance, ages could range from 20 to 60 years or more, while incomes could range from 20,000 to 100,000 dollars or more.

If we apply the Standard Scaler, it transforms the income so that the new values have a mean of 0 and a standard deviation of 1. 

A person who was originally earning $60,000 might become a standardized value of 0, while a person who was originally earning $45,000 might become a standardized value of -1.33.

Standard scaler is a cool standardization method that takes the original values and performs its magic by standardizing the numbers to fit within a specific range.

How do we use the standard scaler method on the code? Using a standard scaler in the code is easy, read more to find out!

Standardization: Example Code

In the below code, I import the “StandardScaler” class from the “sklearn.preprocessing” module. 

from sklearn.preprocessing import StandardScaler
import numpy as np

 I have defined two lists, “ages” and “incomes”, which represent the original values of “ages” and “incomes”, respectively.

 # Sample data
ages = np.array([35, 42, 38, 50, 28])
income = np.array([45000, 60000, 35000, 80000, 25000])
# Reshape income data to a 2D array
income = income.reshape(-1, 1)

Similar to how we created Min max scaler instance, here we create an instance of StandardScaler.

# Create a StandardScaler instance
scaler = StandardScaler()

 Next I am applying the “fit_transform” method to the “income” field. This gives me the “scaled_income” which contains the scaled values.

# Fit the scaler to the data and transform it
scaled_income = scaler.fit_transform(income)
 # Print the standardized values
for age, standard_income in zip(ages, scaled_income):
    print(f"Age: {age}, Standardized Income: {standard_income[0]:.2f}")

Pros & Cons of Normalization

Pros of Normalization:

  1. Fairness and Balance: Normalization doesn't give any feature an unfair advantage just because it has a bigger value or a different unit of measurement. 

    It's like making sure nobody gets special treatment just because they have more or measure differently. 

  2. Improved Comparability: Once they're normalized, we can easily compare the values and get a clear idea of their relative importance

    It's like having a common ground for understanding.

Cons of Normalization:

  1. Loss of Original Value Interpretation: One downside is that it transforms the original values into a different scale, usually between 0 and 1. 

    That can sometimes make it a bit tricky to interpret the values in their original context. So, if it's really important for you to maintain the original value interpretation in your analysis, normalization might not be the best choice.

  2. Sensitivity to Outliers: Outliers are those extreme values that are way out there, far from the rest of the data. The thing is, normalization scales the whole dataset based on the range of values. So, if you have outliers, they can really throw off the normalization process. 

    It's crucial that you should deal with outliers properly before applying normalization techniques to avoid distorting the data. 

Pros & Cons of Standardization

 Pros of Standardization:

  1. Improved Interpretability: By now, you already know that Standardization transforms the data to have a mean of 0 and a standard deviation of 1, right ?

    Now, why does that matter?

    Well, the standardized values represent the number of standard deviations away from the mean.

    This means we can easily understand the relative positions and significance of each data point. It gives us a clear idea of how each point stands in relation to the average.

    For e.g. we may notice that a data point is 1 standard deviation away from the mean and another data point is 2 standard deviation away from the average.  This helps to interpret the results better.

  2. Enhanced Comparability: It helps us compare and analyze features more effectively! 

    How does it do that?

    Well, when we standardize the data, it puts features with different measurement units or scales on a common scale. This means we can directly compare them without any issues.

    It's like bringing everything to the same measuring tape, so to speak. Pretty handy, don't you think?

    Well, you may think even Normalization provides better interpretability and comparability: are there additional pros that Standardization offers?

    Yes, see below on what Standardization specifically offers.

  3. Normal Distribution Assumption: Did you know that many statistical techniques and machine learning algorithms assume that data follows a normal distribution? 

    Normal distribution is nothing but continuous probability distribution. In this the shape of the data points follows a symmetrical and resembles a bell shape.

    Well, that's where standardization comes in and saves the day!

    It helps us meet this assumption by transforming the data to have a mean of 0 and a standard deviation of 1. By doing so, it brings the data closer to a normal distribution.

    This is super helpful because when our data aligns with the normal distribution, those statistical techniques and machine learning algorithms work their magic more effectively. 

  4. Robustness to Outliers: Remember outliers can throw off the normalization process, but standardization is actually more robust to outliers than other scaling techniques. 

    How does it manage that?

    Well, standardization takes into account the mean and standard deviation of the data. And guess what? These measures are less affected by extreme values, those pesky outliers.

    So, when we standardize our data, outliers have minimal impact on the standardized values. It's like they lose some of their power to distort our analysis. This robustness is really handy because it helps us get more reliable results by reducing the influence of outliers.

    So which one will you choose when your data has outliers? Normalization or standardization?

    If you answered “standardization”, you are following well along.

 

Cons of Standardization:

Well, does this awesome standardization have any cons?

Though Standardization has the above mentioned advantages, it has its own downsides too.

  1. Loss of Original Unit Interpretation: One of the downsides of standardization is that it transforms the original values into standardized values that are not in the original unit of measurement. 

    This can make it difficult to interpret the data in its original context, especially if the unit of measurement is important for the analysis or presentation of results.

  2. Dependency on Normality Assumption: Standardization assumes that the data follows a normal distribution. If the data does not meet this assumption, applying standardization may not be appropriate or may lead to misleading results. 

    For e.g. the distribution of income among the population often exhibits a skewed pattern.

    Did you know most people earn relatively low to moderate incomes, while a few individuals or households earn extremely high incomes?

    This can cause the data to have a long tail on the right side.

    So this type of data doesn’t follow the normal distribution (data following normal distribution needs to have a symmetrical shape) and so this type of data couldn’t be standardized using this technique.

    On the other hand, watermelon sizes follow normal distribution.  For e.g. when you pick a watermelon randomly from a shop, the size of the watermelon is going to vary between a range. The size is going to be concentrated on the average size and there are going to be a few watermelons that fall within low range and high range. When you plot on a graph, it is going to have a symmetrical shape.

    So for this type of data you can apply standardization to bring them to a common ground or scale.


Remember, standardization has numerous advantages such as improved interpretability, enhanced comparability, and the ability to meet assumptions of statistical techniques. However, it's important to consider the potential limitations, such as loss of original unit interpretation and dependency on normality assumptions, while applying standardization in your analysis.

The Data Science Aspirant's 90-Day Proven Roadmap

Get INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days,

Even Without Prior Data Science Experience - Guaranteed.

 

Difference between Normalization and Standardization

So, you now know how these two cool techniques (normalization and standardization) are immensely helpful for preparing our data.

They're pretty handy when we want to bring our features to a nice standardized format.

Well, Are there any striking differences between these two?

When it comes to using these techniques in your code, they're both pretty straightforward. You just create objects for normalization and standardization, and then you use the 'fit' and 'transform' methods to do the magic. Easy peasy!

When to Use Normalization vs. Standardization

Use Normalization When:

 

  1. You need to constrain the data within a specific range: As you know Normalization takes the values of a feature and rescales them to fit nicely within a specific range, usually between 0 and. 

    So, if you've got a specific range in mind for your data, normalization can totally help you nail it.

  2. You want to maintain the original unit interpretation: Normalization preserves the original unit interpretation of the data. 

    If maintaining the original unit interpretation is crucial for your analysis or if the unit has a specific meaning in your domain, normalization is a suitable choice.

  3. The distribution of the data is not necessarily Gaussian: Gaussian is another name for normal distribution.  Normalization does not assume any specific distribution of the data. It can handle data with non-normal or unknown distributions effectively. 

    So if your data doesn’t follow the gaussian distribution, you can use normalization.

Use Standardization When:

  1. You need to compare and analyze features with different units and scales: Standardization transforms the data to have a mean of 0 and a standard deviation of 1. 

    If you have features measured in different units or scales and you want to compare them on a common scale, standardization is a suitable choice. It helps ensure that no particular feature dominates the analysis based solely on its original unit or scale.

  2. You want to reduce the impact of outliers: Standardization is more robust to outliers compared to other scaling techniques. 

    If your data contains outliers and you want to reduce their influence on the analysis, standardization can be helpful. By using the mean and standard deviation, which are less affected by extreme values, standardization minimizes the impact of outliers on the scaled data.

Real-World Use Cases and Industry Insights

Normalization Use Cases:

  1. Image Processing: You know, when it comes to image processing, normalization is like a pro! It's often used to make sure those pixel intensity values play nice within a specific range. You can set it to be between 0 and 1, or even -1 and 1, whichever floats your boat.

    The cool thing is, by doing this, you make sure that all the images in your dataset are on the same page. It's like they're speaking the same language! 

    This consistency makes tasks like object recognition or image classification a breeze!

  2. Feature Scaling: When your use case has features with all sorts of different measurement units or scales, normalization comes handy. By normalizing them, you level the playing field, and the algorithm doesn't get all confused by those varied units. No more biases just because of original scales!

    And the best part? This trick works wonders in loads of areas! Think about customer segmentation, fraud detection, or recommendation systems - they all benefit from this normalization magic!

Standardization Use Cases:

  1. Principal Component Analysis (PCA): This is all about reducing dimensions, making your life easier! 

    But before you start, you wanna standardize those features. Why? 'Cause Principal Component Analysis (PCA) is all about finding the directions of maximum variance, and standardizing helps them contribute equally to the analysis.

    People in finance, genetics, and image recognition love using this to make sense of their high-dimensional data.

  2. Anomaly Detection: Now this one's cool! Think cybersecurity, fraud detection, or quality control. You wanna catch those sneaky anomalies or outliers, right? 

    Well, standardization is your friend here! By transforming the data to have mean 0 and standard deviation 1, you can easily spot those big deviations from the norm.

    Now that you have got a pretty good hold on how to use normalization & standardization, their pros and cons, the use cases where these can be used, it is also important that you know the common mistakes that Data scientists make when normalizing or standardizing the data.

    You don’t want to repeat the mistakes, right?

What are some common mistakes to avoid when normalizing or standardizing data?

  1. Data Leakage: You should normalize the training data and the test data separately! Why? Because you want to be fair and not let the test data influence the learning process.

    Imagine if the model peeked at the test data during training; it's like cheating on the practice test! It might seem like the model is super smart, but it's just memorizing answers from the future test. That won't help when it faces new questions it hasn't seen before.

    The goal in machine learning is to build models that can handle new challenges they've never encountered. 
  2. Applying Normalization/Standardization on Categorical Variables: By now you know, normalization & standardization is like putting them all on the same measuring scale, so they're easier to compare. 

    But, it is important, you can't use the same magic trick with categorical variables – you know, those non-numerical things like colors, types of cars, or animal species.

    They need their own special treatment! If we try to encode them with numbers and then normalize or standardize them, we'll end up with some funky results.

    To deal with categorical variables, we use different tools such as "one-hot encoding", “target encoding” etc.  It is like giving each category its own special tag, and then we can work with them in a way that makes sense to our model.  You can read in detail about the encoding and the types of encoding here.

  3. Ignoring the Distribution Assumptions: We all have our assumptions, right? Well, normalization and standardization do too! 

    Normalization wants the data to chill within a specific range, often 0 to 1. And standardization? It's all about the Gaussian distribution.

    But here's the catch: make sure your data actually fits these assumptions before you apply these techniques. If not, you might get some confusing results.
  4. Not Considering Outliers: Outliers can be sneaky troublemakers, especially for normalization! If your data has some extreme values, they can mess up the whole scaling process. 

    So, give those outliers some special treatment! You could preprocess them separately or use robust scaling techniques that don't get easily swayed by outliers. Think median and interquartile range!

So if your data contains outliers and/or if your data follows normal distribution you should opt for standardization. If your data doesn’t follow normal distribution and more of fits within a specific range, you can opt for normalization.

By being mindful of these common mistakes, you can ensure that the normalization or standardization process is performed correctly and yields meaningful results for your data analysis or machine learning tasks.

 

Conclusion

In this part of our journey, we've explored two important techniques called Normalization and Standardization. These techniques help prepare your data for analysis.

  1. Transforming Scales with Normalization:

    • Like an alchemist turning lead into gold, normalization changes your data to fit within a new scale of 0 to 1.
    • This ensures that each feature has an equal chance to be considered by the model.
  2. Refining Data with Standardization:

    • Standardization focuses on the mean and scales data based on the standard deviation.
    • It's like setting a universal time standard for data, where everything is measured against a common reference.
  3. Weighing Pros and Cons:

    • After careful consideration, we've compared both techniques.
    • Normalization brings balance but can be influenced by outliers.
    • Standardization provides comparability, even with outliers.
  4. Navigating the When:

    • Use normalization when data needs to fit tightly within specific bounds.
    • Choose standardization when you want to standardize units and scales for fair comparison.
  5. Avoiding the Pitfalls:

    • Keep training and test data separate.
    • Avoid normalizing categorical data.
    • Ensure that your data's distribution aligns with the chosen technique.
  6. Mastering the Techniques:

    • With knowledge of when and how to use these tools, you're ready to shape your data for machine learning success.
    • This ensures fair and insightful results.

Now that you understand these methods, let's test what you've learned.

Question For You

When should you consider using normalization in your data preprocessing for machine learning?

  1. A) When you want to compare features with different units and scales.
  2. B) When you need to maintain the original unit interpretation.
  3. C) When your data follows a Gaussian (normal) distribution.
  4. D) When dealing with outliers in the dataset.

 Let me know in the comments on what you think the correct option is? Is it (A, B, C, or D) ?



Stay connected with weekly strategy emails!

Join our mailing list & be the first to receive blogs like this to your inbox & much more.

Don't worry, your information will not be shared.