 
    
  
Are you curious about this mysterious Chi-Square test in Data Science?
Scratching your head about how to make sense of Chi Square results in machine learning?

So, if you're wondering what this Chi-Square test in Data Science is all about and how to make sense of its results, put on your detective hat & read the below article
In this article we will look into the following
| Data Science Explained In 20 Infographics“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)" |  | 
|---|
What is the “Chi Square Test”?
"Chi Square Test in Data Science helps you to evaluate whether a null hypothesis (or) assumption made is valid (or) is it something that can be rejected?"
Imagine you own a restaurant, and you have an assumption on the occupancy of your restaurant
You have a suspicion that certain days might have more customers compared to others.
Based on that you assume the number of customers will be more on Thursday & Friday and you have laid out the expected occupancy numbers over monday through saturday as below.
Note in below table where Thu & Fri has “40” & “60” expected count, higher than other days

To investigate further on this, you decide to conduct a study over a week and record the number of customers who actually visit your restaurant each day.
You have recorded/observed values as below to the table.

Note that the observed count (20) is lesser than the expected count (30) on Monday.  
Based on the above expected & observed count recorded on a table, chi-square test in data science will help you to determine whether your assumption holds good or not.
So in this case, the null hypothesis and alternative hypothesis are as below.
- Null hypothesis (H₀): There is no significant difference between the expected and observed occupancies. This means your assumption holds good.
- Alternative hypothesis (H₁): There is a significant difference between the expected and observed occupancies. This means your assumption does not hold good.
Check hypothesis vs null hypothesis blog to read more about this
3-Step “Chi Square Test”Process
Now, let's consider the Chi-Square test in machine learning as a tool that helps you determine whether there is a significant difference between what you expected and what you observed.
Before using the Chi square tool, you need to know that it contains 3 steps.
Step -1 : Formula behind the Chi Square test : Calculate the statistical value using a Chi Square formula.
Step -2 : Decoding the Chi Square Table : Get the value from the Chi square distribution table
Step -3 : Rejecting the Null Hypothesis : Compare above 2 values (calculated statistical value & value from chi square distribution table) to determine whether to reject null hypothesis or not.
Step - 1 : Formula behind the Chi-Square Test
The formula for the chi-square test statistic in this case is:
χ² = Σ((Oᵢ - Eᵢ)² / Eᵢ)
where:
- χ² is the chi-square test statistic
- Oᵢ is the observed frequency for each category
- Eᵢ is the expected frequency for each category (based on the theoretical distribution)

In your restaurant occupancy test, you should use the above formula to determine whether to reject the null hypothesis or not.
In this case, the observed frequencies are: 20, 20, 30, 40, 60, 30.  The expected frequencies are 30, 20, 30, 40, 60, 30. 
Plugging in the values from your data, the calculations would look like this:
χ² = ((20 - 30)² / 30) + ((20 - 20)² / 20) + ((30 - 30)² / 30) + ((40 - 40)² / 40) + ((60 - 60)² / 60) + ((30 - 30)² / 30)
Simplifying further:
χ² = ((-10)² / 30) + (0² / 20) + (0² / 30) + (0² / 40) + (0² / 60) + (0² / 30)
χ² = 100/30 + 0 + 0 + 0 + 0 + 0
 χ² = 3.333
Now using the chi square calculation formula in data science, you have calculated the chi square statistic value as 3.333 for the above restaurant occupancy study.
Step - 2 : Decoding the Chi-Square Table
Now that you have calculated the chi square statistical value, let’s look at how you can get the value from the chi square distribution table.
The actual Chi-square distribution table shown below, contains values for a wider range of “level of significance” and “degrees of freedom”. In the table, the rows represent “degrees of freedom”, and each column represents the “level of significance”. 
The degrees of freedom for a chi-square test in machine learning are calculated as (number of categories - 1). In your case, there are 6 categories or 6 days (Mon - Sat), so df = 6 - 1 = 5.
If you notice in the below image,  the Chi square statistic value for a test, with “0.05” level of significance & with degrees of freedom “5” is 11.070
Note this value 11.070
 
Image source: https://statisticsbyjim.com/hypothesis-testing/chi-square-table/
Step - 3 : Rejecting the Null Hypothesis with Confidence using Chi Square Test
Now that you have calculated the chi square statistic value as 3.333 and retrieved the value from the chi square distribution table as 11.070, let’s see how you can determine whether to reject the null hypothesis or not using it.
This is very simple. If the calculated chi square statistic is greater than the critical value from the distribution table then you can reject the null hypothesis. If not, you fail to reject the null hypothesis.
In this scenario, when comparing the calculated chi-square test statistic (3.333) with the critical value (11.070), what do you arrive at?
You can see that the calculated value is smaller than the critical value, right?

Therefore, you would fail to reject the null hypothesis, indicating that there is no significant difference between the expected and observed occupancies at a significance level of 0.05 and degrees of freedom of 5.
If you fail to reject the null hypothesis, it means that the null hypothesis/your assumption holds good 🙂.
This means that expected occupancy is in sync with the observed occupancy.
This means what you assumed/expected to be the occupancy of your restaurant is correct with a significance level of 0.05.
Conclusion
| The Data Science Aspirant's 90-Day Proven RoadmapGet INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days, |   | 
|---|
Question For You
The observed frequencies of a categorical variable in a study are as follows:
category A (45), category B (60), category C (35).
The expected frequencies based on the null hypothesis are: category A (50), category B (50), category C (50).
Tell me in the comments whether we can reject the null hypothesis or not using a significance level of 0.05?
Stay connected with weekly strategy emails!
Join our mailing list & be the first to receive blogs like this to your inbox & much more.
Don't worry, your information will not be shared.
 
    
  
 
  
     
    
    
  
