Statistics (I) in Data Science & Machine Learning
In the fields of data science and machine learning, engineers often have a hard time understanding their maths. It is not that hard. Maybe in different lingos, many topics were already covered in high school or university. In this series, let’s refresh some topics core to these fields and make them approachable again. We will divide the fundamental topics on statistics into three articles. The first one will focus on the central limit theorem, confidence interval, hypothesis testing, statistical significance, p-value, sample variance, t-test, z-test, Student’s t-distribution, and study design.
Random variables hold values derived from the outcomes of random experiments. For example, random variable X holds the number of heads in flipping a coin 100 times.
External validity in statistics refers to how well the sample estimation can be generalized to the external population. For example, how a vaccine effectiveness study over the British population is valid over other countries.
Internal validity means whether the sample estimation within the study is biased. For example, are there any confounding factors? Do subjects recover from the COVID infection recently and push the effective rate of a vaccine higher?
Sample: A sample is a subset of a population in interest.
Statistical inference: We use statistics from the sample to generalize conclusions for the general population in interest. This includes confidence intervals and hypothesis tests.
Bootstrapping: In bootstrapping, sample points are sampled from a population with replacement.
Normal Distribution/Gaussian Distribution
In a normal distribution, 68.3% of data is within one σ from μ and 95.5% of data is within two σ.
Central Limit Theorem
In many situations, if we normalize n independent random variables and repeat it many times, the collected results (the sampling distribution) tend toward a normal distribution when the sample size (n) gets larger (around 30–50 seems sufficient).
This is true even the random variable X (the left diagram) is not normally distributed.
If the random variable X on the left above has a mean μ and a variance σ², the sampling distribution will have a normal distribution of
After sampling 100 people for the average systolic blood pressure (SBP), what do we know about the SBP of the overall population? Let’s see how we formulate the problem first.
All the sample points come from a population with a mean μ and a variance σ². This distribution does not need to be normally distributed. In fact, SBP has a skewed distribution. Mathematically, the sample mean is formulated as:
Nevertheless, according to the central limit theorem, as n is getting large, the sample mean will be normally distributed.
A z-score is measured in terms of standard deviations from the population mean.
We can use a table based on normal distribution to find the percentage of data to be z-score away from the mean (to the left and to the right). This table is called the two-tailed z-score table. For example, 95% of the data is within 1.96 standard deviations (SD) from the mean.
For now, let’s not worry too much about μ and σ. In practice, σ can be found from literature or approximated by the sample. And μ can be set to a hypothetical value for testing (details later).
A confidence interval (CI) is a range of values that will include a population statistic (say mean) with a certain degree of confidence. It includes a margin of error (ME) in the estimation.
ME is formulated as z × S.E. (standard error). SE is the SD of the sampling distribution, the distribution for the sampling mean X̄. CI includes values within z standard error from the sampling mean.
CI gives a range of values that researchers think may contain the mean of the population. With a 95% confidence level, CI is within ±1.96 SD from the sampling mean for a normal distribution.
The z-score table often contains the significance level α value instead of the confidence level. It equals one minus the confidence level. For 95% confidence level, α = 1–0.95=0.05.
To illustrate the concept of confidence level, let’s collect six samples and calculate their corresponding sampling means and CIs. We plot the results below. As shown, not all confidence intervals (like the one in red) will include the population mean.
95% CI does not claim there is a 95% chance that the population mean falls within this interval. Given a confidence interval, there is either a 0% or 100% chance containing the population mean μ. μ does not change regardless of the CI or the study.
Instead, we are just 95% confident that the estimation works. For demonstration, we can take samples over and over again. 95% of the CIs will contain the mean of the population. A confidence level is about how often we are right. For 95% confidence level, we accept to fail 5% of the time. 99% confidence level is also common in research studies. We can consult the z-score table to find the corresponding z value for α=0.01. The CI will be computed as
One-tailed or two-tailed table
In some research studies, we are interested in the left tail (or right tail) of the normal distribution only. Instead of finding the chance of “within a certain range from the mean”, we may ask the chance of being two SD lower than the mean. To accomplish that, a z-score table can be one-tailed instead of two-tailed.
For the same z-score, α in the one-tailed table is half the value in the two-tailed table. To compute a one-sided CI, we just need to calculate the lower bounds or upper bounds. For example, for the top-left diagram, we just need to calculate the lower bound as
Sample variance & population variance
These equations require the population variance σ² to be known. How do we know it?
Some research studies are interested in learning whether a sampling mean for a specific group is statistically different from the population. For example, will a drug lower the cholesterol level for the hypertension population? For σ², we can assume those taking the drug will have the same variance as the hypertension population. We can search the literature for its value as it may be studied already.
However, in some studies, the population variance may not be known. In this case, we will estimate it from the sample and substitute population σ² with the sample variance s² instead.
But there are some caveats. In particular, does the sampling distribution resemble the normal distribution when the sample size is very small? To address that, we need to understand the t-distribution.
where S² is the sample variance. For the random variable T below
we say that T has a t-distribution with n-1 degrees (ν) of freedom (df). Compared with normal distribution, t-distribution is flatter with longer tails.
But as ν increases, it resembles the normal distribution. The dotted line below has a degree of freedom (df) equal to 30. It is close to the normal distribution.
With the sample variance, we use the t-score for the t-distribution to compute CI.
where S.E. is the standard error.
t-score can be found from the t-score table using α and the degree of freedom.
Increasing the sampling size reduces uncertainty. For a fixed margin of error (ME), what is the minimum sample size?
The margin of error equals
For a desired ME, n equals
Since t depends on the degree of freedom (n-1), we approximate the equation with the z score instead. For sample variance s², we can search the literature for value or conduct a small pilot study. Now, with a target ME and confidence level, we can find the necessary sample size when planning a study. And we may overcompensate n for possible participant dropoff, i.e.
Hypothesis testing & statistical significance
A null hypothesis H₀ claims that a statistic in interest is the same between two possibilities. Under the assumption of H₀ and the observations, a hypothesis test checks whether a test statistic, like the z score, is statistically significant or not
Let’s consider a drug used for lowering the systolic blood pressure (SBP) for the hypertension population. H₀ claims no difference in SBP when taking the drug or not. Given a likelihood model, we compute the likelihood of the SBP sampling mean for persons taking the drug under the H₀ assumption. If the likelihood is larger than a chosen threshold α, we will not reject H₀. Any drop from the hypertension population may be explainable by sampling variability. Otherwise, we assume the difference does not happen by chance. We will support the alternative hypothesis H₁ instead. i.e. the drug lowers blood pressure. (We will demonstrate the details with an example later.)
Mathematically, we often model H₀ and H₁ with one of the possibilities below. μ₁ is the sampling SBP mean for the group taking the drug and μ₂ is the SBP for the hypertension population. In this example, H₁ assumes μ₁ < μ₂.
Usually, our hope is to prove H₁. But proving a claim requires the consideration of all possibilities that many are not known. In science, disapproving a claim is more approachable. So by default, we assume H₀ to be true until the observed data for the drug is unlikely to be by chance. We hope the data reject H₀ and therefore, imply H₁ instead.
z-test and t-test
A z-test or a t-test determines where there is a statistically significant difference for a statistical value. The major difference between the t-test and the z-test is that one uses the t-score with sample variance s² while the other uses the z-score with population variance σ².
and n is the sample size.
So when should we use the t-test and when to use the z-test? If the population variance is known, and Xᵢ is normally distributed or the sample size is greater than 30, we will use the z-table.
When the population variance is not known, we will estimate it from the sample. We will use the t-test. But as n increases, the difference between t-distribution and normal distribution diminishes. But if Xᵢ is not approximately normally distributed, like it is skewed, we want the sample size to be at least 20.
The principle is that under the null hypothesis, does the normal distribution or the t-distribution resemble the sampling distribution? When n is smaller than 30, the t-table is more appropriate.
For simplicity reasons, we may use the z-score table to look up the z-score or the p-value in our illustration. Without the degree of freedom, the z-score table is simpler. In practice, this will be done by software and complexity will not be a concern.
One-sided and two-sided test
Alternative hypothesis H₁ claims that the SBP is lower in taking the drug. Therefore, a one-tailed table will be used for such testing.
For many equality checks, like pay equality between males and females, we use a two-sided test. And a two-tailed table is used instead.
In the blood pressure drug example, let’s say the mean of SBP in the hypertension population is 140 mmHg. The corresponding hypotheses are
Let’s detail the parameters and the results of the study. The sample size of the study is 36 (n=36). And for demonstration, we use s=12. For those taking the drug, the study reports the sampling SBP mean to be 135.
According to H₀, the mean should be 140. Therefore, the t-score is
p-value (probability value) is the probability of the observed sample statistics when the null hypothesis is true. In hypothesis testing, we
We are focusing on the left tail to verify whether the sampling mean 135 is statistically lower. For illustration and simplicity, we use the z-table. The p-value equals 0.0062 for z=-2.5. i.e. the sampling mean 135 has a chance of 0.62% if 140 is the expected mean.
Since the p-value is below α (0.05), we reject H₀ and support that the medication is statistically significant in lowering blood pressure. To verify the result, we should find the corresponding CI does not contain the hypothesis mean (140).
For a 95% confidence level (α=0.05), the researchers are willing to be wrong 5% of the time. This is considered to be acceptable by common practice. But we can adopt a different α according to the problem domains.
Statistical significance is not the same as clinical significance. Any real difference, no matter how small, can be shown to be statistically significant as long as the sample size is large enough. If a drug can really lower SBP by 0.01 mmHg, we can always show the result is statistically significant. But even the drop is real, it is too small to be valuable. It is not clinically significant. p-value alone cannot show us the big picture. We need to understand the whole context of the test, including the test designs, the parameters, and the test statistics (like z-score).
Types of error
In hypothesis testing, we can make two types of mistakes (in red below).
- Type I error is the false positive that we reject H₀ when H₀ is true.
- Type II error is the false negative that we fail to reject H₀ when H₁ is true.
In a study, we choose the value on significance level α, say 0.05. In the last example, H₀ assumes the sample mean is 140. Under this assumption, the blue region is possible but with a chance smaller than 0.05. We decide to state that any observed sampling mean with p-value< α (the blue area) will likely contradict H₀.
But if H₀ is really true, the blue area is where we make mistakes. Therefore, α is the type I error rate — we reject H₀ while H₀ is true (false positive). We may adjust α in the tradeoff between the type I and type II errors.
For the demonstration purpose, we change s from 12 to 24. And we choose α such that the margin of error is 6, i.e. if the sampling SBP mean is smaller or equal to 134, we reject H₀. So what is the type II error rate? i.e. what is the chance of a mistake in which H₁ is true but we fail to reject H₀? H₁ assumes μ<μ₀ (μ: taking the drug). But let's be more specific and say μ equals 132 which drops SBP by 8. This gives an absolute effect size of 8. We call this the magnitude of the effect on taking the drug.
The type II error is the area in red above. We name this probability as β. It is the area where H₀ is not rejected when H₁ is true.
Power equals 1-β. Power is the probability of correctly rejecting H₀.
So with the magnitude of the effect equal to 8, we have a 69% chance that we reject H₀ when H₁ (μ<μ₀) is true (SBP 134 is where we reject H₀). Power is the probability that the study will find a statistically significant difference in the statistic in interest when one actually exists.
A study is usually expensive. Before a study, we estimate the minimum sample size to reach a target level of power. Given some acceptable chance of not detecting an effect, the power analysis estimates what is the minimum sample size. The power analysis wants the Type II error rate (β) to be below a target value. This is the probability of concluding no effect when one exists. General practice targets the power to be 80% or higher. This may be adjusted subject to whether the false positive or the false negative is particularly expensive.
Power depends on significance level α, sample variance s², sample size n, and the magnitude of the effect (d). α should not be changed without justification and s is a property that we cannot change. As we start a study, we will estimate the effect size first. This can be done from a pilot study, similar studies, or some minimum difference to be judged as a must. Next, we calculate the minimum number of subjects. In some cases, we can increase the dose of a drug for potentially higher effect size. This leads to a higher power but risk possibly adverse effects.
For a two-sided two-sample t-test with power 80% and α = 0.05, the sample size can be roughly estimated as
If it is a one-sample t-test, it is
For higher accuracy, we can reverse the calculation in this example to find n.
There are four major sampling techniques. In a simple random sample, each member in the population in interest has an equal chance to be selected. In a stratified random sample, members belong to groups. For example, college students belong to different majors. Samples are selected from each group proportionally with members of each group selected randomly. In a cluster random sample, data is first split into groups. Some groups are selected and we select its members randomly. In a systematic random sample, data is ordered. For example, each member is assigned an ID. The sample is selected systematically say with the last digit of the ID to be 4.
Before a research study, we decide the study design on how to collect and analyze data. It can be observational or experimental:
- An observational study plays an observation role in measuring or surveying subjects without intervention.
- A controlled experiment introduces intervention. For example in a clinical study, subjects may assign to one group receiving treatment or to another group that does not.
A cross sectional study looks at data from a population at one specific point in time, for the purpose of determining prevalence.
In case-control studies, researchers compare two groups of people:
- those with the disease or condition under study (case), and
- a control group of similar people but do not have the disease or condition.
Case-control studies are retrospective. It recalls the subject’s histories to learn the factors that associate with the disease or condition.
Longitudinal studies follow participants over a period of time, usually for years.
Cohort studies are longitudinal studies that recruit subjects sharing common characteristics. For example, in a lung cancer study, one group is smokers and the other group is non-smokers. It mainly studies incidence, causes, and prognosis.
A randomized controlled trial is similar to a cohort study but participants are assigned to different groups randomly. For example, some may get treatment while others may get a placebo.
In a double-blind method, the subjects and the experimenters will be unaware of the specific treatment of a subject.
In this article, we have covered the fundamentals of statistics in data science and machine learning. For the examples so far, we have only one group in interest and have one sample only. We call these tests one-sample tests, for example, one-sample t-test. There are research studies that compare groups with a sample collected from each group. For example, we have two groups, one taking a blood pressure drug and the other taking a placebo. A careful study design is done to minimize the differences except one group is taking the drug and the other doesn’t. Is any difference in the SBPs between them statistically significant?