Probability Distributions in Machine Learning & Deep Learning

Jonathan Hui
11 min readMar 15, 2022

In Bayesian influence, probability distributions are heavily used to make intractable problems solvable. After discussing the normal distribution, we will cover other basic distributions and more advanced ones including Beta distribution, Dirichlet distribution, Poisson Distribution, and Gamma distribution. We will also discuss topics including the Conjugate prior, Exponential family of distribution, and Method of Moments.

Bernoulli Distribution

The Bernoulli distribution is a discrete distribution for a single binary random variable X ∈ {0, 1} with probability 1-θ and θ respectively. For example, when flipping a coin, the chance of head is θ.

The expected value and the variance for the Bernoulli distribution are:

Binomial Distribution

The binomial distribution is the aggregated result of independent Bernoulli trials. For example, we flip a coin N times and model the chance to have x heads.


The expected value and the variance for the binomial distribution are:

Categorical distribution

A Bernoulli distribution has two possible outcomes. In a categorical distribution, we have K possible outcomes with probability p₁, p₂, p₃, … and pk accordingly. All these probabilities add up to one.

Multinomial distribution

The multinomial distribution is a generalization of the binomial distributions. Instead of two outcomes, it has k possible outcomes. If binomial distribution is corresponding to the Bernoulli distribution, multinomial distribution is corresponding to categorical distribution.

Suppose these outcomes are associated with probabilities θ₁, θ₂, … and θk respectively. We collect a sample of size N and xᵢ represents the count for the outcome i. The joint probability is

The expected value and the variance for the multinomial distribution are:

Beta distribution

For a Bernoulli distribution or a binomial distribution, how can we model the value for θ? For example, if a new virus is discovered, can we use a probability distribution to model the infection probability θ?

The beta distribution is a distribution over a continuous random variable on a finite interval of values. It is often used to model the probability for some binary event like θ. The model has two positive parameters α and β that affect the shape of the distribution.


When we have no knowledge about the new virus, we can set α = β = 1 for a uniform distribution, i.e. any possible probability values for θ ∈ [0, 1] are equally likely. This is our prior.

α = β = 1 for uniform distribution

Then we can apply Bayes inference with the likelihood modeled by a binomial distribution. The posterior will be a beta distribution also with updates on α and β. This becomes the new infection rate distribution given the observed data and acts as the new prior when a new sample is observed.


Mathematical, the beta distribution is defined as:

The beta function B normalized the R.H.S. to a probability distribution.

The definition seems complicated but when it is used in Bayesian inference, the calculation becomes very simple. Let’s say CDC reports x new infections out of N people. Applying the Bayes’ Theorem, the posterior will be:

i.e. we simply add the new positives to α and the new negatives (N-x) to β.

The expected value and variance for the beta distribution are

Dirichlet distribution

In the previous Bayesian inference example, the likelihood is modeled by the binomial distribution. We partner it with the beta distribution (prior) to calculate the posterior easily. For a likelihood with the multinomial distribution, the corresponding distribution is the Dirichlet distribution.

Dirichlet distribution is defined as:

This random process has K outcomes and the corresponding Dirichlet distribution will be parameterized by a K-component α.

Similar to the beta distribution, its similarity with the corresponding likelihood makes the posterior computation easy.

The expected value and the variance for the Dirichlet distribution are:

Poisson Distribution

Poisson distribution models the probability for a given number of events occurring in a fixed interval of time. It models a Poisson process in which events occur independently and continuously at a constant average rate.

As shown, a binomial distribution can be simplified to the Poisson distribution if the event is relatively rare.

A Poisson process is assumed to be memoryless — the past does not influence any future predictions. The average wait time for the next event is the same regardless of whether the last event happened 1 minute or 5 hours ago.

The expected value and the variance for the Poisson distribution are:

Exponential distribution

The exponential distribution is the probability distribution for the waiting time before the next event occurs in a Poisson process. As shown in the right diagram below, for λ = 0.1 (rate parameter), the chance of waiting for more than 15 is 0.22.

Mathematically, it is defined as:

The expected value and the variance for the exponential distribution are:

Dirac distribution

The Dirac delta distribution (δ distribution) can be considered as a function that has a narrow peak at x = 0. Specifically, δ(x) has the value zero everywhere except at x = 0, and the area (integral) under the peak is 1.

This function is a helpful approximation for a tall narrow spike function (an impulse) or some deterministic value in a probability distribution. It helps us to transform some models into mathematical equations.


Here is a recap of some of the probability distributions discussed.


Gamma distribution

The exponential distribution and the chi-squared distribution are special cases for the gamma distribution. The gamma distribution can be considered as the sum of k independent random variables with exponential distribution.

Intuitively, it is the distribution of the wait time for the kth events to occur.


Here is the mathematical definition for the gamma distribution.

Depending on the context, the gamma distribution can be parameterized in two different ways.

α (a.k.a. k) parameterizes the shape of the gamma distribution and β parameterizes the scale. As suggested by the Central Theorem, as k increases, the gamma distribution resembles the normal distribution.

As we change β, the shape remains the same but the scale of the x and y-axis change.

The expectation and the variance of the Gamma distribution are:

Conjugate prior

As discussed before, if we pair the distribution for the likelihood and the prior smartly, we can turn the Bayesian inference to be tractable.

In Bayesian inference, a prior is a conjugate prior if the corresponding posterior belongs to the same class of distribution of the prior.

For example, the beta distribution is a conjugate prior to the binomial distribution (likelihood). The calculated posterior with the Bayes’ Theorem is a beta distribution also. Here are more examples of conjugate priors.


Sufficient Statistics

By definition, when a distribution is written in the form of

T(x) is called sufficient statistics.

Here is an example applied to the Poisson distribution.

T(x) sums over xⱼ.

The significance of sufficient statistics is that no other statistic calculated from x₁, x₂, x₃, … will provide any additional information to estimate the distribution parameter θ. If we know T(x), we have sufficient information to estimate θ. No other information is needed. We don’t need to keep x₁, x₂, x₃, … around to build the model. For example, given a Poisson distribution modeled by θ (a.k.a. λ), we can estimate θ by dividing T(x) with n.

Exponential family of distribution

Normal, Bernoulli, gamma, beta, Dirichlet, exponential, Poisson distribution, and many other distributions belong to a family of distribution called the exponential family. It has the form of

Here are the exponential family forms, represented by h(x), η, T(x), and A, for the binomial and Poisson distribution.

Modified from source

We can convert parameter θ and the natural parameter η from each other. For example, the Bernoulli parameter θ can be calculated from the corresponding natural parameter η using the logistic function.

Here is another example in writing the normal distribution in the form of an exponential family.

What is the advantage of this abstract generalization?

The exponential family provides a general mathematical framework in solving problems for its family of distributions. For example, computing the expected value for the Poisson distribution can be hard.

Instead, all the expected values for the exponential family can be calculated fairly easily for A. As shown on the left below, A’(η) equals the expected value for T(x). Since T(x) = x and λ = exp(η) and A(λ) = λ = exp(η) in the Poisson distribution, we differentiate A(η) to find 𝔼[x]. This equals λ.

This family of distribution has nice properties in Bayesian analysis also. If the likelihood belongs to an exponential family, there exists a conjugate prior that is often an exponential family. If we have an exponential family written as

the conjugate prior parameterized by γ will have the form

The conjugate prior, modeled by γ, will have one additional degree of freedom. For example, the Bernoulli distribution has one degree of freedom modeled by θ. The corresponding beta distribution will have two degrees of freedom modeled by α and β.

Consider the Bernoulli distribution below in the form of the exponential family,

We can define (or guess)

We get

i.e. beta distribution is a conjugate prior to the Bernoulli distribution.

Principle of maximum entropy

There are possibly infinite models that can fit the prior data (prior knowledge) exactly. The principle of maximum entropy asserts that the probability distribution that best represents a system is the one with the largest entropy. In information theory, the entropy of a random variable measures the “surprise” inherent to the possible outcomes. Under this principle, we avoid applying unnecessary and additional constraints on what is possible, as constraints decrease the entropy of the system.

Many distributions can satisfy the constraints imposed by sufficient statistics. But the one that we may choose is the one with the highest entropy. It can be proven that the exponential family has the maximum-entropy distribution consistent with the given constraints on sufficient statistics.

Kth Moment

A moment describes the shape of a function quantitatively. If the function f is a probability distribution, the zero moment is the total probability (=1), the first moment is the mean. For the 2nd and higher moments, the central moments provide better information about the distribution’s shape. The second central moment is the variance, the third standardized moment is the skewness, and the fourth moment is the kurtosis.


The kth moment, or the kth raw moment, of function f is defined as

This moment is called the moment about zero. But if we subtract x with the mean first, it will be called a central moment.

The kth moment equals the kth-order derivative of A(η).

Method of Moments

How can we estimate model parameters by sampling? How can we model the population density p with q*? In moment matching, we calculate the moments from the sample data so the expectation of their sufficient statistic will match.

Consider a simple zero-centered distribution model f parameterized by θ with T(X)=x.

The first and second theoretical moment is:

Modified from source

The second-order sample moment is:

By letting the sample moment equal to the theoretical moment, we get an estimation of σ (sampled σ) as.

But the integration is not easy in general. But we can use the derivatives of A to compute the moment and solve the distribution parameter. For example, in the gamma distribution, its parameters α and β can be estimated from the sample mean and variance.