Statistics (III) ANOVA in Data Science & Machine Learning

11 min readFeb 15, 2022

For the last part of the Statistics series, we will cover the ANOVA, Post-hoc Pairwise Comparison, Two-way ANOVA, and R-squared. Previously, the studies involved one or two groups of subjects only. In this article, we will gradually expand the concept to multiple groups with multiple factors. For example, in a vaccine study, we may separate subjects into different dosages and gender to study the effectiveness. How can we determine whether their statistics are statistically significant for particular combinations?

Chi-squared distribution (χ2 distribution)

Before moving to our discussions, we need to learn a few more distributions first. The key idea is if certain random variables follow one of these distributions, we can use the corresponding pdf to compute the p-value.

If Z₁, Z₂, …, and Zk are k independent random variables that each with a standard normal distribution. The random variable Q holding its sum

will have a distribution called chi-squared distribution 𝓧² with k degrees of freedom (where k is the number of variables)

Here is the pdf for the chi-squared distribution.

From a chi-squared table, we can look up the critical chi-squared score for a given degree of freedom and the significance level α. If a computed score is greater than this critical value, we will reject H₀.

F-distribution

If a random variable U is defined as

where S₁ and S₂ are chi-squared distributed with the corresponding d₁ and d₂ degrees of freedom. U is an F-distribution.

ANOVA (Analysis of Variance)

So far, our examples work on one or two groups only. Given many groups, can we tell whether their statistics are statistically different? For example, we can study the effectiveness of a blood pressure-lowering drug in three different dosages. In this section, we will focus on multiple groups for one single factor. The null hypothesis assumes the same outcome for all three dosages.

Assumption

Here are the assumptions for ANOVA:

a simple random sample, each member has an equal probability of being chosen.
observations are independent of others.
groups are independent groups,
each group has a large sample size (as a rough guide, it is greater than 20) or groups are normally distributed, and
the SD of each group is roughly the same.

The last assumption helps ANOVA in simplifying the model.

Sum of Square (SS)

Consider Group 1 has 4 members, the dotted line is the average systolic blood pressure (SBP) of the group.

Sums of Squares SS is the sum of the square of variation. We can visualize it as squaring the distance (d) from the mean and summing them up.

Next, we will introduce three types of SS in this example.

In statistics, variance measures the spread of a variable (a.k.a. the average SS). The variance in the observed data within a group is caused by noise and factors that are not captured in the study. Here, we only study the amount of dose in impacting SBP. The observed variance can be caused by the biological factors of the subjects. We phase this “SS within the group” as unexplained since it contains variations that are not explained by the study.

This study contains three groups on three different dosages and each group has a different sample mean on SBP.

The unexplained SS is the sum of squares of all data points from its corresponding group mean. Visually, we are squaring the length of all the arrows above and adding them together.

Next, we merge all k samples together to compute the overall mean μ.

The differences in the sample means among groups explain how SBP is impacted by different doseages between groups.

Visually, we can square the length of all the arrows above and add them together. This is the explained SS which is the sum of squares for the difference of the group mean and the overall mean μ on all data points.

The total SS is the sum of squares on the distance between itself and the overall mean μ. Squaring and adding all the arrows below is the SS total.

All these SSs follows an important relationship.

So we can just calculate two to find the third one. In some contexts, the explained SS and unexplained SS are called “SS between” and “SS within” respectively. This indicates whether the SS is computed between groups or within a group. SS within is sometimes viewed as noise while SS between is the signal we want to study.

The SS within has n data points. We utilize k (3) sample means. Each sample mean reduces df by 1. Therefore, the degree of freedom (df) is n minus the number of groups, i.e. df equals 2.

Mean Square & F-ratio

Mean square is defined as

And F-ratio is

Substitute MS with SS, we get

F-ratio has an f-distribution and we can use it to find the p-value. If H₁ is true, F-ratio will be greater than one. If H₀ is true, it should approximately equal to one.

p-value

Using an F-table, we can find the critical F-ratio corresponding to a specific α in conjunction with the degrees of freedom ν₁ and ν₂.

where ν₁ and ν₂ are

If the computed F-ratio is larger than the critical value in the table, we should reject H₀. For ν₁ = 3, ν₂ = 246 and α = 0.05, the critical F-ratio is 2.64. Any F-ratio greater than 2.64 should reject H₀.

We mostly use software in computing p-value from the F-ratio. For example, when ν₁ = 3 and ν₂ = 246, the p-value for F-ratio 2.57 is 0.05487. In this situation, we will fail to reject H₀ for a 95% confidence level.

We will demonstrate these concepts with a more complex example later.

Post-hoc Pairwise Comparison

H₁ claims that at least one of the μᵢ is statistically different from others. How can we locate the different one if H₀ is rejected? Consider the study has four different groups A, B, C, and D. If we pick any two groups, we have 6 combinations: AB, AC, AD, BC, BD, and CD. We can use a 2-sample t-test to compare each combination for any significant difference.

Say we find two combinations, BD and CD, that are statistically different. Assume B and C are close after looking at their sample means. Let’s visualize it. As B and C are close, the CI for D should not be overlapping with B or C. And the CI for A will overlap with B, C, and D. In short, D is significantly different from B or C.

Bonferroni Correction

Errors add up. So if we want to maintain an overall error rate below a certain value, what should be the corresponding acceptable error rate for the tests in each combination.

α controls the type I error.

In this example, we conduct 6 hypothesis tests. Errors accumulate (1-α)⁶ < (1-α). Bonferroni correction suggests changing α to

This keeps the overall type I error to be close to α.

Two-way ANOVA

The ANOVA we studied is a one-way ANOVA. The study considers only one factor (dosage) in evaluating the effectiveness of the treatment. In a two-way ANOVA, we evaluate how two factors may impact the outcome. For example, the study may consider three different dosages and the gender factor. Now we have 3 × 2 groups and two factors in the study.

The SS is expressed as

We consider the outcome is influenced by the main effect from factor A, the main effects from factor B, the interaction between A and B, and the noise (SS within).

The third term in R.H.S. is

It estimates the effect on the interaction between Factors A × B. It takes the SS between (between 6 groups) minus the main effect of A and B.

Factor A and factor B can be highly influential on the outcome. But the interaction effect can be negligible. The question is do some combination of gender and dosages combination un-proportionally impact the outcome, beyond what the corresponding gender and dosage may cause independently.

In our example, we will not compute it directly. Instead, we will calculate as

Here is the data we will be using for the example. The lower part is the average under different scenarios. We also compute the average for each gender as well as for each dosage.

Our objective is to fill up the table below:

The df for SS for both factors equals the df for gender × the df for doses.

This is the calculation of the SS for gender. For each data point, we square the difference between the male/female average and the overall mean μ.

And this is the SS for dosage. For each data point, we square the difference between the low/medium/high dose average and the overall mean μ.

This is the SS within (SS noise). For each data point, we square the difference between itself and the mean for the group that it belongs to. (Note: we have 3×2 groups.)

This is the SS total. For each data point, we square the difference between itself and the overall mean μ.

With

SS A × B is 352.44.

With all SSs computed, we can fill up the table below and look up the F-ratio. Here, we use software to compute the p-value. For the first two rows, it is so small that the software just shows that it is smaller than 0.0001.

All those factors have p-values smaller than α = 0.05. So we reject all the null hypotheses below:

H₀: gender has no impact on the outcome of the treatment.
H₀: level of the dose has no impact on the outcome of the treatment.
H₀: Gender and dose interaction has no impact on the treatment.

R-squared (Coefficient of determination)

In machine learning ML, we use R-squared to compare the performance of models using the concept of explained and unexplained variance.

Let’s consider building linear regression models to fit the data below. Models with higher R-squared will perform better than those with lower values.

The sum of the square of variation SS is conceptualized as

which contain components explained and not explained by a model. The squares below show the sum of the square part that the model failed to explain (a.k.a. square of errors).

The total SS of the regression model we used is

where ȳ is the mean of the sample data’s output.

Then we compute the residual SS of a model as

This is the sum of squares of the prediction errors. The R-squared is the ratio of explained SS (total -unexplained) over total SS. In a nutshell, R-squared measures how well a model can explain the data.