Normal Distributions in Machine Learning
Normal distributions are important in Machine Learning (ML). Yet, don’t be surprised that most people don’t know the properties of Normal distributions that are central to many ML algorithms. In this article, we assume you know the basics. We will cover it briefly only. Then, we will shift gear and spend most of our efforts on areas needed by ML. However, we may not recognize its potential until later articles. But this information is critical in understanding Bayesian inference and Gaussian processes. For example, Bayesian inference is intractable in general. Many integrations are unsolvable. But normal distribution makes these problems tractable in closed forms. Without understanding the properties in normal distributions, we cannot understand those algorithms.
Terms
Random variables hold values derived from the outcomes of random experiments. For example, random variable X holds the number of heads in flipping a coin 100 times.
A Probability distribution describes the possible values and the corresponding likelihoods that a random variable can take. For example, the probabilities of having 0, 1, 2, …, 100 heads respectively.
Intractable: Problems with no efficient exact solution. From a computation perspective, it is not solvable.
Central Limit Theorem (Recap)
If we normalize n independent random variables and repeat the experiment many times, the collected results tend toward a normal distribution when the sample size (n) gets larger (around 30–50 seems sufficient).
This is true even the random variable X (the left diagram) is not normally distributed.
If the random variable X on the left above has a mean μ and a variance σ², the sampling distribution will have a normal distribution of
Normal Distribution/Gaussian Distribution
As suggested by the central limit theorem, normal distributions are common in real life.
where Σ is the covariance matrix with elements being the covariance between xᵢ and xⱼ. And |Σ| is the determinant of Σ.
The variance σ² and the covariance is defined as
In a normal distribution, 68% of data is within one σ from μ and 95% of data is within two σ.
Notation:
When μ=0 and Σ = I, it is called the standard normal distribution.
To sample Y from a multivariate normal distribution (y ∼ Ɲ( μ, Σ )), we sample from a standard normal distribution (X ~ Ɲ( 0, I )) and calculate y as:
The solution for A is not unique. But one choice stands out. We use Cholesky decomposition to choose A. A will be a triangular matrix with the lower part being all zeroes. This reduces computation in sampling.
Normal distributions are common in modeling information noise in machine learning. Noise can be viewed as the accumulation of a large number of independent random variables with small values. Those are factors that our models do not account for. According to the central limit theorem, the sum of these variables tends to be normally distributed.
Properties of Normal Distribution
It is “convenient” to operate on normal distributions. Many intractable problems can now be solved analytically. And many operations on normal distributions return a normal distribution.
Product
One of the convenience is that when we multiply two normal distributions, its result is just another normal distribution scaled by a factor s.
s, c, and C can be solved analytically.
Let f and g be single variate normal distributions.
Their product is
With some tedious manipulation, we can demonstrate that the product has a form of
Summation
Given two normal distributions
its summation is also normally distributed.
Conditionals
In classification and regression problems, we basically answer the question of what is the probability distribution of the new label y’, given a new example x’, and the training dataset D =(y, X).
With D containing m data points, the training labels can be modeled as
With x and y pre-processed to be zero-centered, we can model θ and ε (information noise) with normal distributions.
This leads to the question of how to compute conditional probability with normal distributions.
Let’s considerate a multivariate normal distribution
where
For the distribution on the left, x and y are not correlated. On the contrary, x and y are highly correlated on the right.
As shown below, the conditional distribution of X on Y (or vice versa) is normally distributed.
Let’s say the mean vector and the covariance matrix of the conditional are
They can be solved analytically.
Given a joint distribution
we can recover the missing data distribution from what we know ( x₂ =-2)
The mean and the variance for the conditional distribution p(x₁|x₂=-2) are
The probability distribution for x₁ given x₂=-2 is
To demonstrate our knowledge on expectation and variance, let’s prove the equation even it is not necessary for understanding the materials. Let’s expand the concept to allow x₁ and x₂ to be vectors. In the proof, we should use bold letters for vectors and matrices. But it makes the proof like a ransom note and makes my head spins. So we temporarily relax our notation even though it is unorthodox.
Let’s define z = x₁ + Ax₂, where A is
Since Cov(z, x₂) is zero, z and x₂ are uncorrelated.
And therefore,
Apply this, we prove the mean for x₁ given x₂ is
Let’s compute
And finally, the variance of x₁ given x₂ is
Here is the final conditional.
Marginals
Given a joint distribution of
the marginal is normally distributed.
Linear transformation
If X, Y are independent normally distributed random variables (i.e. p(x, y) = p(x)p(y)), then
Applying a linear transformation A on x
the probability distribution for y is
Once the Prior and Likelihood is Normal, the Posterior is also Normal
In Bayesian inference, the marginal probability (the integral in the denominator) is generally intractable.
where D is the training data set
But when the likelihood p(D|θ) and the prior p(θ) are modeled with normal distributions, the posterior is also normally distributed. Given this insight, we keep dropping any scaling factor on the R.H.S. Once we merge them into a single normal distribution, that is the normal distribution function that we are looking for. Proofs in the Gaussian process and Bayesian linear regression use this strategy. Here is the skeleton for the Bayesian linear regression.
Next
For the next article, we apply all this knowledge to Bayesian linear regression and the Gaussian process to solve ML problems.