Sitemap

Probability Distributions in Machine Learning & Deep Learning

11 min readMar 15, 2022
Press enter or click to view image in full size

In Bayesian influence, probability distributions are heavily used to make intractable problems solvable. After discussing the normal distribution, we will cover other basic distributions and more advanced ones including Beta distribution, Dirichlet distribution, Poisson Distribution, and Gamma distribution. We will also discuss topics including the Conjugate prior, Exponential family of distribution, and Method of Moments.

Bernoulli Distribution

Press enter or click to view image in full size

The Bernoulli distribution is a discrete distribution for a single binary random variable X ∈ {0, 1} with probability 1-θ and θ respectively. For example, when flipping a coin, the chance of head is θ.

Press enter or click to view image in full size

The expected value and the variance for the Bernoulli distribution are:

Press enter or click to view image in full size

Binomial Distribution

The binomial distribution is the aggregated result of independent Bernoulli trials. For example, we flip a coin N times and model the chance to have x heads.

Press enter or click to view image in full size
Press enter or click to view image in full size
Source

The expected value and the variance for the binomial distribution are:

Press enter or click to view image in full size

Categorical distribution

A Bernoulli distribution has two possible outcomes. In a categorical distribution, we have K possible outcomes with probability p₁, p₂, p₃, … and pk accordingly. All these probabilities add up to one.

Press enter or click to view image in full size

Multinomial distribution

The multinomial distribution is a generalization of the binomial distributions. Instead of two outcomes, it has k possible outcomes. If binomial distribution is corresponding to the Bernoulli distribution, multinomial distribution is corresponding to categorical distribution.

Press enter or click to view image in full size

Suppose these outcomes are associated with probabilities θ₁, θ₂, … and θk respectively. We collect a sample of size N and xᵢ represents the count for the outcome i. The joint probability is

Press enter or click to view image in full size

The expected value and the variance for the multinomial distribution are:

Press enter or click to view image in full size

Beta distribution

For a Bernoulli distribution or a binomial distribution, how can we model the value for θ? For example, if a new virus is discovered, can we use a probability distribution to model the infection probability θ?

The beta distribution is a distribution over a continuous random variable on a finite interval of values. It is often used to model the probability for some binary event like θ. The model has two positive parameters α and β that affect the shape of the distribution.

Source

When we have no knowledge about the new virus, we can set α = β = 1 for a uniform distribution, i.e. any possible probability values for θ ∈ [0, 1] are equally likely. This is our prior.

Press enter or click to view image in full size
α = β = 1 for uniform distribution

Then we can apply Bayes inference with the likelihood modeled by a binomial distribution. The posterior will be a beta distribution also with updates on α and β. This becomes the new infection rate distribution given the observed data and acts as the new prior when a new sample is observed.

Source

Mathematical, the beta distribution is defined as:

Press enter or click to view image in full size

The beta function B normalized the R.H.S. to a probability distribution.

The definition seems complicated but when it is used in Bayesian inference, the calculation becomes very simple. Let’s say CDC reports x new infections out of N people. Applying the Bayes’ Theorem, the posterior will be:

Press enter or click to view image in full size

i.e. we simply add the new positives to α and the new negatives (N-x) to β.

The expected value and variance for the beta distribution are

Press enter or click to view image in full size

Dirichlet distribution

In the previous Bayesian inference example, the likelihood is modeled by the binomial distribution. We partner it with the beta distribution (prior) to calculate the posterior easily. For a likelihood with the multinomial distribution, the corresponding distribution is the Dirichlet distribution.

Press enter or click to view image in full size

Dirichlet distribution is defined as:

Press enter or click to view image in full size

This random process has K outcomes and the corresponding Dirichlet distribution will be parameterized by a K-component α.

Similar to the beta distribution, its similarity with the corresponding likelihood makes the posterior computation easy.

Press enter or click to view image in full size

The expected value and the variance for the Dirichlet distribution are:

Press enter or click to view image in full size

Poisson Distribution

Poisson distribution models the probability for a given number of events occurring in a fixed interval of time. It models a Poisson process in which events occur independently and continuously at a constant average rate.

Press enter or click to view image in full size

As shown, a binomial distribution can be simplified to the Poisson distribution if the event is relatively rare.

Press enter or click to view image in full size

A Poisson process is assumed to be memoryless — the past does not influence any future predictions. The average wait time for the next event is the same regardless of whether the last event happened 1 minute or 5 hours ago.

The expected value and the variance for the Poisson distribution are:

Press enter or click to view image in full size

Exponential distribution

The exponential distribution is the probability distribution for the waiting time before the next event occurs in a Poisson process. As shown in the right diagram below, for λ = 0.1 (rate parameter), the chance of waiting for more than 15 is 0.22.

Press enter or click to view image in full size

Mathematically, it is defined as:

Press enter or click to view image in full size

The expected value and the variance for the exponential distribution are:

Press enter or click to view image in full size

Dirac distribution

The Dirac delta distribution (δ distribution) can be considered as a function that has a narrow peak at x = 0. Specifically, δ(x) has the value zero everywhere except at x = 0, and the area (integral) under the peak is 1.

Press enter or click to view image in full size

This function is a helpful approximation for a tall narrow spike function (an impulse) or some deterministic value in a probability distribution. It helps us to transform some models into mathematical equations.

Recap

Here is a recap of some of the probability distributions discussed.

Press enter or click to view image in full size
Source

Gamma distribution

The exponential distribution and the chi-squared distribution are special cases for the gamma distribution. The gamma distribution can be considered as the sum of k independent random variables with exponential distribution.

Press enter or click to view image in full size

Intuitively, it is the distribution of the wait time for the kth events to occur.

Press enter or click to view image in full size
Source

Here is the mathematical definition for the gamma distribution.

Press enter or click to view image in full size

Depending on the context, the gamma distribution can be parameterized in two different ways.

Press enter or click to view image in full size

α (a.k.a. k) parameterizes the shape of the gamma distribution and β parameterizes the scale. As suggested by the Central Theorem, as k increases, the gamma distribution resembles the normal distribution.

Press enter or click to view image in full size

As we change β, the shape remains the same but the scale of the x and y-axis change.

Press enter or click to view image in full size

The expectation and the variance of the Gamma distribution are:

Press enter or click to view image in full size

Conjugate prior

As discussed before, if we pair the distribution for the likelihood and the prior smartly, we can turn the Bayesian inference to be tractable.

In Bayesian inference, a prior is a conjugate prior if the corresponding posterior belongs to the same class of distribution of the prior.

Press enter or click to view image in full size

For example, the beta distribution is a conjugate prior to the binomial distribution (likelihood). The calculated posterior with the Bayes’ Theorem is a beta distribution also. Here are more examples of conjugate priors.

Press enter or click to view image in full size
Source

Sufficient Statistics

By definition, when a distribution is written in the form of

Press enter or click to view image in full size

T(x) is called sufficient statistics.

Here is an example applied to the Poisson distribution.

Press enter or click to view image in full size

T(x) sums over xⱼ.

The significance of sufficient statistics is that no other statistic calculated from x₁, x₂, x₃, … will provide any additional information to estimate the distribution parameter θ. If we know T(x), we have sufficient information to estimate θ. No other information is needed. We don’t need to keep x₁, x₂, x₃, … around to build the model. For example, given a Poisson distribution modeled by θ (a.k.a. λ), we can estimate θ by dividing T(x) with n.

Press enter or click to view image in full size

Exponential family of distribution

Normal, Bernoulli, gamma, beta, Dirichlet, exponential, Poisson distribution, and many other distributions belong to a family of distribution called the exponential family. It has the form of

Press enter or click to view image in full size

Here are the exponential family forms, represented by h(x), η, T(x), and A, for the binomial and Poisson distribution.

Press enter or click to view image in full size
Modified from source

We can convert parameter θ and the natural parameter η from each other. For example, the Bernoulli parameter θ can be calculated from the corresponding natural parameter η using the logistic function.

Press enter or click to view image in full size

Here is another example in writing the normal distribution in the form of an exponential family.

Press enter or click to view image in full size

What is the advantage of this abstract generalization?

The exponential family provides a general mathematical framework in solving problems for its family of distributions. For example, computing the expected value for the Poisson distribution can be hard.

Press enter or click to view image in full size

Instead, all the expected values for the exponential family can be calculated fairly easily for A. As shown on the left below, A’(η) equals the expected value for T(x). Since T(x) = x and λ = exp(η) and A(λ) = λ = exp(η) in the Poisson distribution, we differentiate A(η) to find 𝔼[x]. This equals λ.

Press enter or click to view image in full size

This family of distribution has nice properties in Bayesian analysis also. If the likelihood belongs to an exponential family, there exists a conjugate prior that is often an exponential family. If we have an exponential family written as

Press enter or click to view image in full size

the conjugate prior parameterized by γ will have the form

Press enter or click to view image in full size

The conjugate prior, modeled by γ, will have one additional degree of freedom. For example, the Bernoulli distribution has one degree of freedom modeled by θ. The corresponding beta distribution will have two degrees of freedom modeled by α and β.

Consider the Bernoulli distribution below in the form of the exponential family,

Press enter or click to view image in full size

We can define (or guess)

Press enter or click to view image in full size

We get

Press enter or click to view image in full size

i.e. beta distribution is a conjugate prior to the Bernoulli distribution.

Principle of maximum entropy

There are possibly infinite models that can fit the prior data (prior knowledge) exactly. The principle of maximum entropy asserts that the probability distribution that best represents a system is the one with the largest entropy. In information theory, the entropy of a random variable measures the “surprise” inherent to the possible outcomes. Under this principle, we avoid applying unnecessary and additional constraints on what is possible, as constraints decrease the entropy of the system.

Many distributions can satisfy the constraints imposed by sufficient statistics. But the one that we may choose is the one with the highest entropy. It can be proven that the exponential family has the maximum-entropy distribution consistent with the given constraints on sufficient statistics.

Kth Moment

A moment describes the shape of a function quantitatively. If the function f is a probability distribution, the zero moment is the total probability (=1), the first moment is the mean. For the 2nd and higher moments, the central moments provide better information about the distribution’s shape. The second central moment is the variance, the third standardized moment is the skewness, and the fourth moment is the kurtosis.

Source

The kth moment, or the kth raw moment, of function f is defined as

Press enter or click to view image in full size

This moment is called the moment about zero. But if we subtract x with the mean first, it will be called a central moment.

Press enter or click to view image in full size

The kth moment equals the kth-order derivative of A(η).

Press enter or click to view image in full size

Method of Moments

How can we estimate model parameters by sampling? How can we model the population density p with q*? In moment matching, we calculate the moments from the sample data so the expectation of their sufficient statistic will match.

Press enter or click to view image in full size

Consider a simple zero-centered distribution model f parameterized by θ with T(X)=x.

Press enter or click to view image in full size

The first and second theoretical moment is:

Press enter or click to view image in full size
Modified from source

The second-order sample moment is:

Press enter or click to view image in full size

By letting the sample moment equal to the theoretical moment, we get an estimation of σ (sampled σ) as.

Press enter or click to view image in full size

But the integration is not easy in general. But we can use the derivatives of A to compute the moment and solve the distribution parameter. For example, in the gamma distribution, its parameters α and β can be estimated from the sample mean and variance.

Press enter or click to view image in full size

--

--

Responses (2)