# Normal Distributions in Machine Learning

Normal distributions are important in Machine Learning (ML). Yet, don’t be surprised that most people don’t know the properties of Normal distributions that are central to many ML algorithms. In this article, we assume you know the basics. We will cover it briefly only. Then, we will shift gear and spend most of our efforts on areas needed by ML. However, we may not recognize its potential until later articles. But this information is critical in understanding Bayesian inference and Gaussian processes. For example, Bayesian inference is intractable in general. Many integrations are unsolvable. But normal distribution makes these problems tractable in closed forms. Without understanding the properties in normal distributions, we cannot understand those algorithms.

# Terms

**Random variables** hold values derived from the outcomes of random experiments. For example, random variable *X* holds the number of heads in flipping a coin 100 times.

A **Probability distribution** describes the possible values and the corresponding likelihoods that a random variable can take. For example, the probabilities of having 0, 1, 2, …, 100 heads respectively.

**Intractable**: Problems with no efficient exact solution. From a computation perspective, it is not solvable.

# Central Limit Theorem (Recap)

If we normalize *n* independent random variables and repeat the experiment many times, the collected results tend toward a normal distribution when the sample size (*n*) gets larger (around 30–50 seems sufficient).

This is true even the random variable *X *(the left diagram)* *is not normally distributed.

If the random variable *X* on the left above has a mean *μ* and a variance *σ², *the sampling distribution will have a normal distribution of

# Normal Distribution/Gaussian Distribution

As suggested by the central limit theorem, normal distributions are common in real life.

where Σ is the covariance matrix with elements being the covariance between *xᵢ* and *xⱼ*. And |Σ| is the determinant of Σ.

The variance *σ²* and the covariance is defined as

In a normal distribution, 68% of data is within one *σ* from *μ* and 95% of data is within two *σ*.

**Notation**:

When *μ*=0 and Σ = I, it is called the standard normal distribution.

To sample *Y* from a multivariate normal distribution (*y* ∼ *Ɲ*( ** μ**,

**Σ**)), we sample from a standard normal distribution (

*X*~

*Ɲ*( 0,

*I*)) and calculate

*y*as:

The solution for *A* is not unique. But one choice stands out. We use Cholesky decomposition to choose *A.*** A** will be a triangular matrix with the lower part being all zeroes. This reduces computation in sampling.

Normal distributions are common in modeling information noise in machine learning. Noise can be viewed as the accumulation of a large number of independent random variables with small values. Those are factors that our models do not account for. According to the central limit theorem, the sum of these variables tends to be normally distributed.

# Properties of Normal Distribution

It is “convenient” to operate on normal distributions. Many intractable problems can now be solved analytically. And many operations on normal distributions return a normal distribution.

## Product

One of the convenience is that when we multiply two normal distributions, its result is just another normal distribution scaled by a factor *s.*

*s*, *c*, and *C* can be solved analytically.

Let *f* and *g *be* *single variate normal distributions.

Their product is

With some tedious manipulation, we can demonstrate that the product has a form of

**Summation**

Given two normal distributions

its summation is also normally distributed.

**Conditionals**

In classification and regression problems, we basically answer the question of what is the probability distribution of the new label *y*’, given a new example *x*’, and the training dataset *D* =(*y*,* X*).

With *D* containing *m* data points, the training labels can be modeled as

With *x* and *y* pre-processed to be zero-centered, we can model *θ* and *ε* (information noise) with normal distributions.

This leads to the question of how to compute conditional probability with normal distributions.

Let’s considerate a multivariate normal distribution

where

For the distribution on the left, *x* and *y *are not correlated. On the contrary, *x* and *y *are highly correlated on the right.

As shown below, the conditional distribution of

XonY(or vice versa) is normally distributed.

Let’s say the mean vector and the covariance matrix of the conditional are

They can be solved analytically.

Given a joint distribution

we can recover the missing data distribution from what we know ( *x*₂ =-2)

The mean and the variance for the conditional distribution *p*(*x*₁|*x*₂=-2) are

The probability distribution for *x*₁ given *x*₂=-2 is

To demonstrate our knowledge on expectation and variance, let’s prove the equation even it is not necessary for understanding the materials. Let’s expand the concept to allow *x*₁ and *x*₂ to be vectors. In the proof, we should use bold letters for vectors and matrices. But it makes the proof like a ransom note and makes my head spins. So we temporarily relax our notation even though it is unorthodox.

Let’s define z = x₁ + Ax₂, where *A* is

Since Cov(z, x₂) is zero, z and x₂ are uncorrelated.

And therefore,

Apply this, we prove the mean for x₁ given x₂ is

Let’s compute

And finally, the variance of x₁ given x₂ is

Here is the final conditional.

**Marginals**

Given a joint distribution of

the marginal is normally distributed.

**Linear transformation**

If *X, Y* are independent normally distributed random variables (i.e. *p*(** x**,

**) =**

*y**p*(

**)**

*x**p*(

**)), then**

*y*Applying a linear transformation ** A** on

*x*the probability distribution for *y* is

**Once the Prior and Likelihood is Normal, the Posterior is also Normal**

In Bayesian inference, the marginal probability (the integral in the denominator) is generally intractable.

where ** D** is the training data set

But when the likelihood *p*(*D|θ*) and the prior *p*(*θ*) are modeled with normal distributions, the posterior is also normally distributed. Given this insight, we keep dropping any scaling factor on the R.H.S. Once we merge them into a single normal distribution, that is the normal distribution function that we are looking for. Proofs in the Gaussian process and Bayesian linear regression use this strategy. Here is the skeleton for the Bayesian linear regression.

# Next

For the next article, we apply all this knowledge to Bayesian linear regression and the Gaussian process to solve ML problems.