# Statistics (I) in Data Science & Machine Learning

In the fields of data science and machine learning, engineers often have a hard time understanding their maths. It is not that hard. Maybe in different lingos, many topics were already covered in high school or university. In this series, let’s refresh some topics core to these fields and make them approachable again. We will divide the fundamental topics on statistics into three articles. The first one will focus on the central limit theorem, confidence interval, hypothesis testing, statistical significance, p-value, sample variance, t-test, z-test, Student’s t-distribution, and study design.

# Terms

**Random variables** hold values derived from the outcomes of random experiments. For example, random variable *X* holds the number of heads in flipping a coin 100 times.

**External validity** in statistics refers to how well the sample estimation can be generalized to the external population. For example, how a vaccine effectiveness study over the British population is valid over other countries.

**Internal validity** means whether the sample estimation within the study is biased. For example, are there any confounding factors? Do subjects recover from the COVID infection recently and push the effective rate of a vaccine higher?

**Sample**: A sample is a subset of a population in interest.

**Statistical inference**: We use statistics from the sample to generalize conclusions for the general population in interest. This includes confidence intervals and hypothesis tests.

**Bootstrapping**: In bootstrapping, sample points are sampled from a population with replacement.

# Normal Distribution/Gaussian Distribution

In a normal distribution, 68.3% of data is within one *σ* from *μ* and 95.5% of data is within two *σ*.

# Central Limit Theorem

In many situations, if we normalize *n* independent random variables and repeat it many times, the collected results (the sampling distribution) tend toward a normal distribution when the sample size (*n*) gets larger (around 30–50 seems sufficient).

This is true even the random variable *X *(the left diagram)* *is not normally distributed.

If the random variable *X* on the left above has a mean *μ* and a variance *σ², *the sampling distribution will have a normal distribution of

# Confidence Interval

After sampling 100 people for the average systolic blood pressure (**SBP**), what do we know about the SBP of the overall population? Let’s see how we formulate the problem first.

All the sample points come from a population with a mean *μ* and a variance *σ². *This distribution does not need to be normally distributed. In fact, SBP has a skewed distribution. Mathematically, the sample mean is formulated as:

Nevertheless, according to the central limit theorem, as *n* is getting large, the sample mean will be normally distributed.

A **z -score** is measured in terms of standard deviations from the population mean.

We can use a table based on normal distribution to find the percentage of data to be z-score away from the mean (to the left and to the right). This table is called the two-tailed z-score table. For example, 95% of the data is within 1.96 standard deviations (SD) from the mean.

For now, let’s not worry too much about *μ* and *σ*. In practice, *σ* can be found from literature or approximated by the sample. And *μ *can be set to a hypothetical value for testing (details later).

A **confidence interval **(CI) is a range of values that will include a population statistic (say mean) with a certain degree of confidence. It includes a margin of error (ME) in the estimation.

ME is formulated as *z* × S.E. (standard error). SE is the SD of the sampling distribution, the distribution for the sampling mean X**̄**. CI includes values within *z* standard error from the sampling mean.

CI gives a range of values that researchers think may contain the mean of the population. With a 95% confidence level, CI is within ±1.96 SD from the sampling mean for a normal distribution.

The z-score table often contains the **significance level** ** α** value instead of the confidence level. It equals one minus the confidence level. For 95% confidence level,

*α =*1–0.95=0.05.

To illustrate the concept of confidence level, let’s collect six samples and calculate their corresponding sampling means and CIs. We plot the results below. As shown, not all confidence intervals (like the one in red) will include the population mean.

95% CI does not claim there is a 95% chance that the population mean falls within this interval. Given a confidence interval, there is either a 0% or 100% chance containing the population mean *μ*. *μ *does not change regardless of the CI or the study.

Instead, we are just 95% confident that the estimation works. For demonstration, we can take samples over and over again. 95% of the CIs will contain the mean of the population. A confidence level is about how often we are right. For 95% confidence level, we accept to fail 5% of the time. 99% confidence level is also common in research studies. We can consult the z-score table to find the corresponding *z* value for *α*=0.01. The CI will be computed as

**One-tailed or two-tailed table**

In some research studies, we are interested in the left tail (or right tail) of the normal distribution only. Instead of finding the chance of “within a certain range from the mean”, we may ask the chance of being two SD lower than the mean. To accomplish that, a z-score table can be one-tailed instead of two-tailed.

For the same z-score, *α *in the* *one-tailed table is half the value in the two-tailed table. To compute a one-sided CI, we just need to calculate the lower bounds or upper bounds. For example, for the top-left diagram, we just need to calculate the lower bound as

**Sample variance & population variance**

These equations require the population variance *σ*² to be known. How do we know it?

Some research studies are interested in learning whether a sampling mean for a specific group is statistically different from the population. For example, will a drug lower the cholesterol level for the hypertension population? For *σ*², we can assume those taking the drug will have the same variance as the hypertension population. We can search the literature for its value as it may be studied already.

However, in some studies, the population variance may not be known. In this case, we will estimate it from the sample and substitute population *σ*² with the sample variance *s*² instead.

i.e.

But there are some caveats. In particular, does the sampling distribution resemble the normal distribution when the sample size is very small? To address that, we need to understand the t-distribution.

# Student’s t-distribution

Given

where *S*² is the sample variance. For the random variable *T* below

we say that *T* has a t-distribution with *n*-1 degrees (*ν*) of freedom (df). Compared with normal distribution, t-distribution is flatter with longer tails.

But as *ν *increases, it resembles the normal distribution. The dotted line below has a degree of freedom (df) equal to 30. It is close to the normal distribution.

With the sample variance, we use the t-score for the t-distribution to compute CI.

where S.E. is the standard error.

t-score can be found from the t-score table using *α *and the degree of freedom.

# Sample sizing

Increasing the sampling size reduces uncertainty. For a fixed margin of error (ME), what is the minimum sample size?

The margin of error equals

For a desired ME, *n* equals

Since *t* depends on the degree of freedom (*n-*1), we approximate the equation with the z score instead. For sample variance *s*², we can search the literature for value or conduct a small pilot study. Now, with a target ME and confidence level, we can find the necessary sample size when planning a study. And we may overcompensate *n* for possible participant dropoff, i.e.

# Hypothesis testing & statistical significance

A **null hypothesis** *H*₀ claims that a statistic in interest is the same between two possibilities. Under the assumption of *H*₀ and the observations, a hypothesis test checks whether a test statistic, like the z score, is statistically significant or not

Let’s consider a drug used for lowering the systolic blood pressure (SBP) for the hypertension population. *H*₀ claims no difference in SBP when taking the drug or not. Given a likelihood model, we compute the likelihood of the SBP sampling mean for persons taking the drug under the *H*₀ assumption. If the likelihood is larger than a chosen threshold *α*, we will not reject *H*₀. Any drop from the hypertension population may be explainable by sampling variability. Otherwise, we assume the difference does not happen by chance. We will support the alternative hypothesis *H*₁ instead. i.e. the drug lowers blood pressure. (We will demonstrate the details with an example later.)

Mathematically, we often model *H*₀ and *H*₁ with one of the possibilities below. *μ*₁ is the sampling SBP mean for the group taking the drug and *μ*₂ is the SBP for the hypertension population. In this example, *H*₁ assumes *μ*₁ < *μ*₂.

Usually, our hope is to prove *H*₁. But proving a claim requires the consideration of all possibilities that many are not known. In science, disapproving a claim is more approachable. So by default, we assume *H*₀ to be true until the observed data for the drug is unlikely to be by chance. We hope the data reject *H*₀ and therefore, imply *H*₁ instead.

## z-test and t-test

A z-test or a t-test determines where there is a statistically significant difference for a statistical value. The major difference between the t-test and the z-test is that one uses the t-score with sample variance *s*² while the other uses the z-score with population variance *σ*².

and *n* is the sample size.

So when should we use the t-test and when to use the z-test? If the population variance is known, and *Xᵢ *is normally distributed or the sample size is greater than 30, we will use the z-table.

When the population variance is not known, we will estimate it from the sample. We will use the t-test. But as *n* increases, the difference between t-distribution and normal distribution diminishes. But if *Xᵢ *is not approximately normally distributed, like it is skewed, we want the sample size to be at least 20.

The principle is that under the null hypothesis, does the normal distribution or the t-distribution resemble the sampling distribution? When *n* is smaller than 30, the t-table is more appropriate.

For simplicity reasons, we may use the z-score table to look up the z-score or the p-value in our illustration. Without the degree of freedom, the z-score table is simpler. In practice, this will be done by software and complexity will not be a concern.

**One-sided and two-sided test**

Alternative hypothesis *H*₁ claims that the SBP is lower in taking the drug. Therefore, a one-tailed table will be used for such testing.

For many equality checks, like pay equality between males and females, we use a two-sided test. And a two-tailed table is used instead.

In the blood pressure drug example, let’s say the mean of SBP in the hypertension population is 140 mmHg. The corresponding hypotheses are

Let’s detail the parameters and the results of the study. The sample size of the study is 36 (*n=*36). And for demonstration, we use *s*=12. For those taking the drug, the study reports the sampling SBP mean to be 135.

According to *H*₀, the mean should be 140. Therefore, the t-score is

**p-value**

p-value (probability value)** **is the probability of the observed sample statistics when the null hypothesis is true. In hypothesis testing, we

We are focusing on the left tail to verify whether the sampling mean 135 is statistically lower. For illustration and simplicity, we use the z-table. The p-value equals 0.0062 for z=-2.5. i.e. the sampling mean 135 has a chance of 0.62% if 140 is the expected mean.

Since the p-value is below *α *(0.05), we reject *H*₀ and support that the medication is statistically significant in lowering blood pressure. To verify the result, we should find the corresponding CI does not contain the hypothesis mean (140).

For a 95% confidence level (*α=*0.05), the researchers are willing to be wrong 5% of the time. This is considered to be acceptable by common practice. But we can adopt a different *α *according to the problem domains.

**Clinical Significance**

Statistical significance is not the same as clinical significance. Any real difference, no matter how small, can be shown to be statistically significant as long as the sample size is large enough. If a drug can really lower SBP by 0.01 mmHg, we can always show the result is statistically significant. But even the drop is real, it is too small to be valuable. It is not clinically significant. p-value alone cannot show us the big picture. We need to understand the whole context of the test, including the test designs, the parameters, and the test statistics (like z-score).

# Types of error

In hypothesis testing, we can make two types of mistakes (in red below).

- Type I error is the false positive that we reject
*H*₀ when*H*₀ is true. - Type II error is the false negative that we fail to reject
*H*₀ when*H*₁ is true.

In a study, we choose the value on significance level *α*,* *say 0.05. In the last example, *H*₀ assumes the sample mean is 140. Under this assumption, the blue region is possible but with a chance smaller than 0.05. We decide to state that any observed sampling mean with p-value< *α *(the blue area) will likely contradict *H*₀.

But if *H*₀ is really true, the blue area is where we make mistakes. Therefore, *α* is the type I error rate — we reject *H*₀ while *H*₀ is true (false positive). We may adjust *α *in the tradeoff between the type I and type II errors.

**Power**

Let’s say,

For the demonstration purpose, we change *s* from 12 to 24. And we choose *α* such that the margin of error is 6, i.e. if the sampling SBP mean is smaller or equal to 134, we reject *H*₀. So what is the type II error rate? i.e. what is the chance of a mistake in which *H*₁ is true but we fail to reject *H*₀? *H*₁ assumes *μ<μ*₀ (*μ*:* *taking the drug). But let's be more specific and say *μ* equals 132 which drops SBP by 8. This gives an absolute **effect size **of 8. We call this the magnitude of the effect on taking the drug.

The type II error is the area in red above. We name this probability as *β*. It is the area where *H*₀ is not rejected when *H*₁ is true.

**Power** equals 1-*β*. Power is the probability of correctly rejecting *H*₀.

So with the magnitude of the effect equal to 8, we have a 69% chance that we reject *H*₀ when *H*₁ (*μ<μ*₀) is true (SBP 134 is where we reject *H*₀). Power is the probability that the study will find a statistically significant difference in the statistic in interest when one actually exists.

**Power analysis**

A study is usually expensive. Before a study, we estimate the minimum sample size to reach a target level of power. Given some acceptable chance of not detecting an effect, the power analysis estimates what is the minimum sample size. The power analysis wants the Type II error rate (*β*) to be below a target value. This is the probability of concluding no effect when one exists. General practice targets the power to be 80% or higher. This may be adjusted subject to whether the false positive or the false negative is particularly expensive.

Power depends on significance level *α*, sample variance *s*²*,* sample size *n, *and the magnitude of the effect (*d*). *α* should not be changed without justification and *s* is a property that we cannot change. As we start a study, we will estimate the effect size first. This can be done from a pilot study, similar studies, or some minimum difference to be judged as a must. Next, we calculate the minimum number of subjects. In some cases, we can increase the dose of a drug for potentially higher effect size. This leads to a higher power but risk possibly adverse effects.

For a two-sided two-sample t-test with power 80% and *α = *0.05*, *the sample size can be roughly estimated as

If it is a one-sample t-test, it is

For higher accuracy, we can reverse the calculation in this example to find *n*.

# Probability Sampling

There are four major sampling techniques. In** **a** simple random sample**, each member in the population in interest has an equal chance to be selected. In a **stratified random sample**,** **members belong to groups. For example, college students belong to different majors. Samples are selected from each group proportionally with members of each group selected randomly. In a **cluster random sample**, data is first split into groups. Some groups are selected and we select its members randomly. In a **systematic random sample**, data is ordered. For example, each member is assigned an ID. The sample is selected systematically say with the last digit of the ID to be 4.

# Study Design

Before a research study, we decide the study design on how to collect and analyze data. It can be observational or experimental:

- An observational study plays an observation role in measuring or surveying subjects without intervention.
- A controlled experiment introduces intervention. For example in a clinical study, subjects may assign to one group receiving treatment or to another group that does not.

A **cross sectional study **looks at data from a population at one specific point in time, for the purpose of determining prevalence.

In **case-control studies**, researchers compare two groups of people:

- those with the disease or condition under study (case), and
- a control group of similar people but do not have the disease or condition.

Case-control studies are retrospective. It recalls the subject’s histories to learn the factors that associate with the disease or condition.

**Longitudinal studies** follow participants over a period of time, usually for years.

**Cohort studies** are longitudinal studies that recruit subjects sharing common characteristics. For example, in a lung cancer study, one group is smokers and the other group is non-smokers. It mainly studies incidence, causes, and prognosis.

A **randomized controlled trial** is similar to a cohort study but participants are assigned to different groups randomly. For example, some may get treatment while others may get a placebo.

In a double-blind method, the subjects and the experimenters will be unaware of the specific treatment of a subject.

# Next

In this article, we have covered the fundamentals of statistics in data science and machine learning. For the examples so far, we have only one group in interest and have one sample only. We call these tests one-sample tests, for example, one-sample t-test. There are research studies that compare groups with a sample collected from each group. For example, we have two groups, one taking a blood pressure drug and the other taking a placebo. A careful study design is done to minimize the differences except one group is taking the drug and the other doesn’t. Is any difference in the SBPs between them statistically significant?