Unveiling the Secrets of the “Train-Test Split” in Machine Learning

data science Nov 14, 2023
thumbnail image for the train-test split blog from BigDataElearning

Heard about the train-test split in machine learning?

Ever Wondered why on earth do we need to split our precious data into two camps - training and testing datasets? 

What exactly happens when you skip this fundamental step of train-test split & train your model on the entire dataset? 

Is it a recipe for a potential disaster?

Join me as we unravel the following behind this train-test split

Data Science Explained In 20 Infographics

“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)"

 

Training Dataset (Building Knowledge)

Imagine you're a teacher passionate about unlocking the full potential of your students. 

You use training exercises to train them. 

Just as a teacher, how you impart the knowledge through training exercises to your students, we provide with a set of training dataset to our machine learning models.

The training dataset allows the model to learn and internalize the underlying patterns. 

Through repeated exposure to the training dataset records, the model builds a solid foundation of understanding about the data. 

Now that you have seen what a training dataset is, let’s see what a testing dataset is.

Testing Dataset (Assessing Performance)

Back to training your students analogy, imagine the moment when you want to test the understanding of your students. 

What will you do to gauge your student’s level of understanding? 

..

..

You will use some testing questions to evaluate how well they understood the concepts, right?

Similarly, to assess how well your machine learning model is performing, you should run the model with the testing dataset.

From a model’s perspective the testing dataset is just unseen data and it tries to predict the value for those testing dataset records.

However you already know the results of the testing dataset, as it is just a part of the entire dataset which we have partially curated for “testing” purposes.

Since you know the results of the testing dataset, you can easily compare the value of what the model is predicting, with the actual value that the “testing” dataset has.

By this way you can evaluate and assess how well your model performs on unseen data.

Why Divide Dataset as “Training” Dataset and “Testing” Dataset?

For e.g, let’s say you have to arrive at a model that can predict the height of a mouse based on its weight.  

Let’s say you were given heights of 100 mice and their corresponding weights.

If you train based on all the 100 mice records, you would never know whether the model is predicting accurately when applied on a new unseen 101st record, or not?

On the other hand, if you split the dataset into 80 records as a “training” dataset and remaining 20 records as a “testing” dataset, it means you have performed a train-test split.

Now if you train your model based on the training dataset (80 records) and test your model based on testing dataset (20 records) & evaluate how well your model is performing for those 20 records based on its results which you already have, you have a clear way to assess your model’s performance.

You may know that your model is predicting at, let’s say 95% accuracy or 96% accuracy for example. 

By this way, you will also know how accurately your model will perform when it sees the 101st record.

Since you need a dataset for training and a dataset for testing the model, you cannot use the entire dataset for training and that is exactly why you should divide data as training dataset and testing dataset.

Benefits of train-test split

  • Model Evaluation (Real-world Performance Check): The train-test split allows for unbiased evaluation of the model's performance. By testing the model on unseen data (the test set), you can assess its ability to generalize and make accurate predictions on new, unseen instances.

    Imagine you're learning to ride a bicycle. You practice in a safe, controlled environment with your trainer, who watches your progress. But to truly test your skills, you need to ride on your own in different situations.

    Similarly, the train-test split allows us to check how well a machine learning model performs in the real world. 

  • Overfitting Detection (Prevention of Over-Optimization): The train-test split helps in detecting overfitting, which occurs when a model performs exceptionally well on the training data but fails to generalize to new data. 

    By comparing the performance on the training and test sets, you can identify if the model is overfitting and take appropriate steps to address it.

    Let's say you're memorizing answers to specific questions without understanding the underlying concepts. You might perform well on those specific questions, but struggle when faced with similar questions in a different format.

    This is overfitting. The train-test split helps you to detect if a model is merely memorizing the training data instead of grasping the general patterns. 

  • Hyperparameter Tuning (Improved Model Tuning): The train-test split enables effective hyperparameter tuning

    Hyperparameters are configuration settings that impact the model's performance. By tuning these parameters based on the model's performance on the validation set (a subset of the training data), you can optimize the model's performance before final evaluation on the test set.

    When building a machine learning model, you need to find the best settings for various parameters, similar to adjusting the volume and equalizer on a music player to get the best sound.

    The train-test split helps us fine-tune our models by providing a separate test set. We can try different settings, such as the number of hidden layers or the learning rate, and see how they impact the model's performance on the test set.

    This way, you can adjust and optimize our model to achieve better results and make accurate predictions.

    In simpler terms, the train-test split is like a reality check for our machine learning models. It ensures they can handle new situations, prevents them from memorizing, and helps us fine-tune their settings for optimal performance. 

The Right Split Ratio: Finding the Perfect split ratio

 So, finding the right split ratio is like finding the perfect balance between learning and evaluation. 

It depends on factors like the size of your dataset and the complexity of your problem. 

As a rule of thumb, a common split ratio is 80% for training and 20% for testing. This means you reserve 80% of your data to teach your model and 20% to assess its performance.

However, this ratio isn't set in stone and can vary based on your specific scenario.

If you have a large dataset, you might have enough data to spare for a larger test set, allowing for a more reliable evaluation. 

Conversely, if you have a small dataset, you might need to allocate more data for training to ensure your model learns effectively.

Remember, finding the right split ratio requires some experimentation and consideration of your specific situation. 

The goal is to find a balance that gives your model enough data to learn and enough data to be accurately evaluated, just like having enough ingredients to practice baking and to enjoy the tasty outcome.

The Data Science Aspirant's 90-Day Proven Roadmap

Get INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days,

Even Without Prior Data Science Experience - Guaranteed.


Train Test Split: Example Code

import numpy as np
from sklearn.datasets import make_classification

# Generating a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=3, n_redundant=2,    n_classes=2, weights=[0.8, 0.2], random_state=42)

# Printing the shapes of X and y
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

In the code above, the “make_classification” function from scikit-learn is used to generate a synthetic dataset. 

The “n_samples” parameter specifies the number of samples in the dataset (in this case, 1000), The “n_features” parameter specifies the number of features (in this case, 5), the “n_informative” parameter specifies the number of informative features (in this case, 3), the “n_redundant” parameter specifies the number of redundant features (in this case, 2), and the “n_classes” parameter specifies the number of classes (in this case, 2). 

The “weights” parameter specifies the class imbalance ratio, where [0.8, 0.2] indicates that the first class will have 80% of the samples and the second class will have 20% of the samples. 

The “random_state” parameter ensures reproducibility by setting a seed value for randomization. 

After generating the dataset, you can print the shapes of the feature matrix 'X' and the target vector 'y' to verify the dimensions of the generated dataset.

Conclusion

Let's sum up what we've learned about the train-test split:

  • Purpose of Training Dataset: We talked about why the "training" dataset is important. It helps the model learn and understand the patterns in the data.
  • Role of Testing Dataset: The "testing" dataset helps us check how well the model performs with new, unseen data from the real world. We also highlighted the sklearn train test split using python example code.
  • Three Main Benefits of Train-Test Split: We found three key advantages:
    • Model Evaluation
    • Overfitting Detection
    • Hyperparameter Tuning

These show how useful the train-test split is in different ways.

  • Common Split Ratio: We discussed the usual split ratio, which is 80% for training and 20% for testing. We also mentioned that this ratio can be adjusted based on the specific needs of the model for a good balance between learning and accurate evaluation.
 

Question For You

Why is it important to keep the “test” set separate and not use it during the model training process?

  1. The test set is used to fine-tune the model's hyperparameters.
  2. The test set helps evaluate the model's performance on unseen data.
  3. The test set is used to validate the training data.
  4. The test set is used to increase the model's training accuracy.

 Tell me in the comments, whether it is A,B,C, or D  that you believe best answers the question 🙂

 

Stay connected with weekly strategy emails!

Join our mailing list & be the first to receive blogs like this to your inbox & much more.

Don't worry, your information will not be shared.