Machine Learning — Fundamental

Jonathan Hui
55 min readApr 23, 2019


San Francisco Exploratorium

“One learning algorithm” hypothesizes that human intelligence is due to a single learning algorithm. Unfortunately, we are not there yet. Machine Learning (ML) starts from many disciplines and there are a lot of concepts to learn. In this ML article, we will study the fundamentals like information theory, probability, distribution, Bayesian inference, PCA, Statistical Significance, etc … In later articles, we will look into ML algorithms. However, to shorten the length of this ML series, we assume that you have basic exposure to ML. If you have problems following the material, I will suggest some good ML learning resources at the end of the article. There is a lot of materials in each article, some are easy and some are very hard. Feel free to skip some sections based on your level of knowledge and interests.

Information theory

The Shannon information content is the amount of information gain when an event x occurs. Mathematically, it is defined as

For example, if a coin shows heads all the time, the events are predictable and we know nothing new. Therefore, the information content is zero. If the coin is unbiased, the probability distribution is uniform and is most unpredictable on whether it will be head or tail next. Therefore, the information content is the highest. Take a second to digest it. In particular, uniform distribution leads to the highest content.

In computer science, the information content can also be viewed as the number of bits to encode the information most efficiently. Utilizing the event frequency, the optimal encoding scheme will be0b0 for head and 0b1for tail in a fair coin (or vice versa). For our discussion, we use base 2 for the log. Therefore, the information content for a fair coin flipping is -log₂(½) = 1. i.e. we gain 1-bit of information when we flip the coin once.

X is called a random variable if it holds a value generated from a random process (a stochastic process), e.g. the face value in rolling dice or the total number of heads after flipping a coin 20 times.

The value of X can be modeled with a distribution p(X). For example, let X holds the total sum of rolling ten dies. We can model X with a Gaussian distribution (according to the Central limit theorem).

Entropy H measures the expected information content of a random variable. To calculate the value, we sum up all information content with a weight equals to its occurrence frequency.

For example, in flipping a fair coin, P(X=head)=½ and P(X=tail)=½. Its entropy equals

The average content is one bit. That is, we need an average of one bit to encode an event.

Entropy establishes a lower bound for the average bits to encode events with the probability distribution P.


Cross-entropy H(P, Q) measures the expected number of bits to encode X with distribution P using an encoding scheme targeted for distribution Q.

In ML, we want our predictions Q to match the ground truth P. If they match, the cross-entropy will be the minimum and therefore, we often use it as our training objective.

The cross-entropy for our example is

As shown above, the cost function for many classification problems is simply


KL-divergence measures the difference between two distributions P and Q.

It is easy to prove that cross-entropy, entropy and KL Divergence are related:

i.e., KL-Divergence measures the inefficiency of representing P with encoding scheme Q — the extra-bits to encode the information with the sub-optimal scheme. Therefore, KL-divergence is always greater or equal to zero.

Since entropy does not change on how we model it, optimizing KL divergence w.r.t. model parameters is the same as optimizing cross-entropy.

There are a few properties of the KL-divergence that need to be aware of.

KL(p, q) ≥ 0

KL(p, p)=0, and

KL(p, q) ≠KL(q, p) (non-symmetry)

KL(q, p) is called the reverse KL-divergence. The non-symmetry property of KL-divergence has some very significant implications. For example, let’s say the ground truth p is a bimodal distribution (the blue curve below) and we want to model it with a single-mode Gaussian distribution (the red curve). If KL-divergence KL(p, q) is used as the training objective function, we will get a Gaussian distribution that overlaps both modes of the ground truth p that peaks at the trough between the two modes (Diagram a). If the reverse KL-divergence KL(q, p) is used, we will either get one of the local optimal in Diagram b or c.


For KL(p, q), we want q(x) to be non-zero if p(x) is non-zero. Otherwise, if q(x) is small, the KL value will be high. Therefore, it tries to cover what p will cover. For KL(q, p), if P(x) is zero, we want Q(x) to be zero also.

Let’s view the difference in a 2D space. Reverse KL will only cover one of the modes in a bimodal ground truth but the peak of Q will be close to the peak of one of the modes.

The root of the problem is we are using a simple model for complex ground truth. If both models have similar complexity, this will be a non-issue.

Conditional Entropy

The conditional entropy H(Y|X) is the entropy of Y given X is known. If Y can be separated according to X, this is the weighted entropy of the separated groups and calculated as:

Information Gain

Mutual information (or Information gain) I(X; Y) is the information obtained on the random variable X when Y is observed. However, the mathematical definition below does not sound intuitive.

Modified from source

But it will be easier with entropy

Intuitively, mutual information measures how much information do we gain by knowing X? If knowing Y gives us all the information about X, the conditional entropy H(X|Y) is zero because there is no more information we needed on X. The mutual information I becomes H(X) (or H(Y)). On the other extreme, if knowing Y gives us no information about X, the conditional entropy is H(X) and the mutual information is zero.

For example, if we know the label (Y) of an object, we gain a lot of information about its raw image (X). We should not mistake its picture with other objects. Therefore the information gain I(X;Y) is high. Let’s visualize this with sets. The mutual information is its overlap.

If random variable X and Y are unrelated, their intersection is empty and therefore, the mutual information is zero. If random variables X and Y are the same, H(X)=H(Y)=I(X;Y), knowing Y is the same as knowing X.

Let’s go through some of its applications briefly. In a decision tree, we choose a condition to maximize I — a condition that gains the maximum information by separating data according to the branching condition. In another example, InfoGAN maximizes the mutual information between the generated image G and its intended label c. This encourages the GAN model to generate images related to its intended label. The information gain will be used as part of the training objective.


Probability mass function is the probability distribution for discrete random variables. For example,

Probability density function (PDF) is the probability distribution for continuous random variables with lower case notation p. In principle, the chance of having a weight exactly equals 10.0000000000 lb should be zero. Instead, we calculate the probability between two values by integrating PDF over this range (say between 9 and 10).

Cumulative distribution function (CDF)

CDF calculates the accumulative probabilities p(X≤x).

CDF on the right (The CDF image source)

Conditional probability


Two random variables A and B are independent if

Marginal probability

The marginal probability P(X) is computed by summing (or integrating) the joint probability over other variables.

In many ML problems, we build a joint distribution model for all the variables. Once it is modeled, we can inference the probability of a single or a subset of variables (p(x₁) or p(x₁, x₂, x₃)) by summing or integrating over the rest of the variables.

Chain rule

Bayes’ theorem

The equation looks simple but it is one of the most important equations in ML. So it is important to check out here if you are not familiar with the terms above. Otherwise, you can be miserable in reading literature if you are not used to those terms.

Bayes’ Theorem can be formularized in different forms. We will use the following forms frequently.

Let’s say there is a genetic disease that only 0.1% of the population gets it. We develop a test with 99% accuracy for the positive and negative results. So, if you test positive, should you be overly worried. Intuition will say yes but in reality, the Bayes’ Theorem proves that there is only a 9% chance that you get the disease (proof). The intuition is not hard. Since the disease is so rare, most of the positive results from the lab come from false positives of people without the disease — a small portion of a large pie is still larger than a big portion of a very small pie.

Probabilistic models & non-probabilistic models

Models can be categorized as probabilistic models and non-probabilistic models. Probabilistic models use probability theory to model a problem and to make predictions. It reasons about uncertainty. These models are usually built on MLE (Maximum likelihood estimation) or MAP (Maximum a posteriori estimation). For example, in training a linear regression model, we optimize its parameters w to maximize the likelihood of the observed data.

Non-probabilistic models do not use distribution to model the problems. Some examples include clustering, SVM, decision trees, etc …

Bayes inference v.s. Point estimate

Bayes inference infers unobserved variables, like latent factors, label or model parameter θ, given the observed variables x. Unlike other models, it calculates a probability curve instead of a point estimation, which evaluates the probability of all possibilities. For example, instead of computing the likely symptom onset time of a disease, we evaluate the chance for each possibility for the incubation period.


Let’s elaborate it a little bit more. In linear regression, we predict a point estimate based on input features. For example, the price of a house can be estimated from the square footage and the number of rooms (price = a × square footage + b × # of rooms + c).

In Bayes’ Theorem, the prior, the likelihood and the calculated posterior is a probability density function (PDF). With Bayes’ Theorem, we predict a distribution of values and their certainty. It reasons the uncertainty in the real-world better. But they do not come free and often make the computation intractable.

Advantages and disadvantage of Bayes’ Theorem

In many ML problems, modeling P(y|x₁, x₂, x₃, ) is hard. With Bayes’ Theorem, we compute P(x₁, x₂, x₃, …|y) instead. If it is easier, we win. As shown later in Naive Bayes’ Theorem, for some problem domains, the conditional probability can be further broken down into independent components (P(x₁|y)P(x₂|y)…) which is much easy to model. Also, in the early experiment stage, we don’t have enough samples to draw conclusions. But if we have a solid prior belief, we can integrate the new evidence with the belief using Bayes’ Theorem to draw reasonable predictions. Don’t worry about the details. More specific examples will come soon.

However, we are working on curves. Calculating the posterior is usually not easy even the equation looks naively simple.

Multiply the prior and the likelihood curve in high-dimensional space or continuous space is often intractable unless they belong to some special family of distributions. Calculating the integral is even harder.

In many problems, x and y are composed of many variables (x = (x₁, x₂, x₃, …) ) with exponential complexity for the integral.

In general, calculating the posterior is NP-hard.

A large portion of ML algorithms discussed in our ML series is dedicated to simplifying or approximating the posterior. For example, we can model the prior and the likelihood with Gaussian Distributions. The posterior distribution will be a Gaussian and can be computed analytically with ease.

In ML, the model parameter w training can be formulated as a posterior in the Bayes inference

The marginal probability in the denominator sums over w and no longer a function of w. Hence, it does not change w.r.t. w. Therefore, to find the optimal model, we can ignore the marginal probability.

Naive Bayes’ theorem

Naive Bayes’ theorem solves classification problems by exploring the independence of variables. As explained before, Bayes’ theorem helps us to solve P(y|x) when P(x|y) is easier to model, i.e. we compute P(y|x₁, x₂, x₃, …) through P(x₁, x₂, x₃, …|y). However, the joint conditional probability P(x₁, x₂, x₃, …|y) usually remains too complex. The exponential combination of x₁, x₂, …, xn makes it too hard to collect data to estimate it.

But we can simplify P(x₁, x₂, …|y) into P(x₁|y)P(x₂|y)…P(xn|y) by assuming x₁, x₂, x₃, … and xn are independent of each other. Therefore, P(y|x) can be evaluated as

In ML, exploiting independency P(A, B) = P(A) P(B) or conditional independence P(A, B|C) = P(A|C) P(B|C) are critical in avoiding exponential complexity. Even it may be false that variables in the Naive Bayes algorithm are completely independent of others, but empirically, this remains a valid simplification in solving many problems.

Let’s use Naive Bayes’ theorem to detect spam emails. People mark their spam emails. Therefore, we know what words those spam emails usually contain. We can further simplify P(w₁,w₂, …|spam) into the multiplication of independent components P(w₁|spam)P(w₂|spam)… By computing the corresponding values below, we pick the category with the highest value.

(Click the link for more details on this example)

In the example above, we ignore the frequency of a word. But this information can be very useful. For example, the word “money” will likely occur more frequently in a spam email.

To model it, we use the Poisson distribution Poisson(cⱼ) to measure the probability of the word j with count cⱼ. For each word, it has two frequency values λ, one for the spam-emails and one of the non-spam emails.

Bias & variance

In supervised learning, we supply labels to train model f.

As shown in the proof, the mean square error (MSE) composes of bias, variance and data noise. Since noise is unrelated to the model and irreducible when training a model, we will ignore it from our discussion.

Variance is the error that subjects to what data points are sampled during the model training. If the sampled training dataset is too small, a complex model is vulnerable to overfitting. Once it happens, the model and the predictions may change significantly if different training data is sampled.

With N 1-D training data input, we can fit a model perfectly with an Nth-order polynomial. Consider the ground truth is closer to a linear equation, the red rectangle region below is overfitted and results in bad predictions. If the last 2 data points are not included in the training dataset or more data points are sampled in the red rectangle area, the prediction will be different. In short, the training has a high variance if the model is complex with not enough data to fit it.

On the other hand, bias happens when the model is not complex enough to make accurate predictions. If the ground truth for x is around the center circle below, bias shifts the answer in one particular direction while variance spread the predictions around the ground truth.

Below is a demonstration of the model complexity. As the complexity improves, it fits the training data better until it is overfitted. At that point, the training error continues to decrease. But when we validate the model with data points that are not used for training, the validation error is large and it goes up as the number of iterations increases.

The following figure summarizes the relationship between model complexity, bias, and variance. The solution in reducing the variance is to increase the training dataset size, reduce the model complexity or add regularization.


i.i.d. (Independent and identically distributed)

In many ML problems, we assume data points are drawn from the same data distribution and the observed value does not impact the observed values of others.

This is a simple concept but the success of many ML and DL algorithms highly depends on it. Model training is more stable if the sampled data and training batches are close to i.i.d.

When we roll a dice 100 times, each sampled value is originated from the same distribution and the result is independent of the previous one. Therefore, throwing dice is i.i.d. However, if we draw from a deck of cards without replacement, the sampling distribution will change and therefore it is not i.i.d. The i.i.d. assumption can simplify our math in many situations. For example, the joint distribution in the maximum likelihood estimation (MLE) can be simplified as the multiplication of these independent events.

This is critical. By breaking down a model into manageable independent subcomponents, we reduce the complexity significantly.

Covariate shift

However, if we have a model trained with fair dice, its prediction will be wrong if the input during the inference time has a different distribution, say the dealer switches to a biased dice. This is the covariate shift. Lately, there is a report on the setback of using deep learning in the medical field. While the accuracy of the DL predictions seems superhuman, it shows inconsistent results in other hospitals. It turns out some models are overfitted by data collected from a small number of hospitals. When the model applies to rural areas where mobile equipment is more frequently used, the accuracy drops. This is a covariate shift problem!

In some ML or DL problems, states may be highly correlated, in particular for the time sequence model. If we use gradient descent in optimizing these models, the i.i.d. assumption does not hold and the corresponding training can be very unstable.


In this section, we will discuss families of distributions like the Gaussian, the Gamma, the Beta, the Dirichlet, etc … Different families have different parameters and different shapes. It is important to shop around and use the right distribution to simplify the calculation dramatically. But let’s go through some simple definitions first.


For a continuous random variable, the expectation of f(x) is

For a discrete random variable, it is

Variance and covariance

For a continuous random variable, it is

Covariance indicates whether two variables are positively or negatively related. If it is zero, they are unrelated.

Variance can be expressed as the equation below (proof).

This formalization is handy for many proves, including finding the bias and the variance for the MLE estimator.

The covariance for X & Y is

Sample variance

The sample variance is an estimator for the population’s variance by sampling m data points.

However, the mean of the sampling data is correlated with the data itself. Therefore, (xᵢ-μ)² will be smaller than that of the population. To compensate that, we need to divide the total variance by m-1 instead of m (proof). But for the sample mean, we just need to divide the sum by m and this estimation is unbiased.


Each variable can be on a different scale (or units). Covariance shows how variables are related but it does not quantify the extent well. Correlation normalizes the covariance such that each variable corresponding to the same scale.

Gaussian distribution/Normal distribution

In a Gaussian distribution, 68% of data is within 1 σ from the μ and 95% of data is within 2 σ. The standard normal distribution has μ=0 and σ =1.

Multivariate Gaussian distribution

Multivariate Gaussian distribution models multiple variables distribution using a Gaussian model. The plot below is a bivariate Gaussian distribution with two random variables involved.


A multivariate Gaussian distribution for multivariate is defined as

where Σ is the covariance matrix. Each element in this matrix records how two random variables are correlated. We can sample a multivariate Gaussian distribution by

The covariance matrix can be expressed with correlation. Here is an example of a bivariate Gaussian distribution.


Central limit theorem

Consider rolling dice once, the value of the throw is uniformly distributed if the dice is fair. Let’s throw the dice many times and record its average. We repeat the experiment many times and plot these averages out. We will realize the results are Gaussian distributed. The central limit theorem says that:

Sums of large numbers of identically distributed random variables tend to Gaussian distributed.

i.e. we can have many independent random variables, say each represents the result of a throw. After we add them together and normalize it (average it), the normalized results are Gaussian Distributed. The theorem extends to any distribution for the independent variables. So no matter whether the dice is fair, biased or with any data distribution, the normalized results are Gaussian Distributed. This implies that even individual variables have different data distribution, we can use a Gaussian distribution to model its aggregate results.

If the random variable X has a mean μ and a variance σ², the normalized results will be:

i.e. the variance of the aggregated result drops as the result is average from a larger sample size. The sum or the difference of 2 Gaussian distributed variables is also Gaussian distributed.

Bernoulli distributions

The Bernoulli distribution is a discrete distribution for an event with two possible outcomes (head or tail in a biased coin), one with probability 𝜙 and the other 1-𝜙.

The distribution can also be written as:

Binomial distributions

The binomial distribution is the aggregated result of independent Bernoulli trials. For example, we flip a coin n times and model the chance to have x tails.

where p is the probability of a tail. The mean and the variance of a Binomial distribution are


Variance in estimating θ in Bernoulli Distribution is (proof):

Categorical distribution

A Bernoulli distribution has two possible outcomes only. In a categorical distribution, we have K possible outcomes with probability p₁, p₂ p₃, … and pk accordingly and all these probabilities add up to one.

Multinomial distribution

The Multinomial distribution is a generalization of the Binomial distributions. Instead of two outcomes, it has k possible outcomes.

Suppose these outcomes are associated with probabilities p₁, p₂, … and pk respectively. We collect a sample of size n and xᵢ represents the count for the outcome i. The joint probability is

With Bayesian inference

To simplify the Bayesian calculation, we frequently use Beta distribution and Dirichlet distribution as priors to work with Beta distribution and Multinomial distribution likelihood respectively (discussed later).


for discrete variable n.

Beta distribution

The definition of the Beta distribution has a close resemblance with the Binomial distribution. The beta distribution is defined as:

And the beta function B is a normalization factor defined as:

B normalizes the numerator so the computed result f is a probability distribution. i.e. if we sum f(x) over all possible x, it equals one.

When a = b = 1, f(x) is constant for all x and therefore, the distribution is uniformly distributed. The two parameters a and b in the Beta distribution change the shape of the distribution for θ. As shown below, it can model different data distribution for θ.

In specific, in Bayesian inference, we use the Beta distribution θ to model the prior for p if the likelihood is Binomial distributed. (Details later)

This approach simplifies the calculation significantly. Let’s ignore the normalization part in the Beta distribution for a second. The Binomial distribution has a similar form as the Beta distribution (underlining in red).

If we use the Beta function as the prior and use the Binomial distribution for the likelihood, the posterior in the Bayes’ Theorem can be solved analytically and easily.

As a preview, the posterior is as simple as:

(We will detail the discussion in the Bayesian Inference section.)

For your reference, the mean and the variance of the Beta distribution are:

Modified from source

where ψ is the Digamma function:

With the equations above, we can reverse engineer the model parameters of the Beta Distribution by collecting data samples.

Dirichlet distribution

If Beta distribution is a distribution used to learn the Binomial distribution, Dirichlet distribution is a distribution used to learn the Multinomial distribution.

The Dirichlet distribution notation is

where α is the parameters for the Dirichlet distribution which controls the shape of the distribution. If θ has K components, α has K components also.

Let’s make it less abstract. A radio station DJ plays songs from three genres with ratios vary daily. But on average, the ratios are 0.3, 0.2 and 0.5 respectively. Instead of using a fixed ratio, we will use the Dirichlet distribution to model the probability distribution that resembles what the DJ plays.

Now, the DJ has θ composed of 3 random variables (θ₁, θ₂, θ₃) which each represents a probability that a specific genre is played. Every day, the DJ samples values from the Dirichlet distribution p(θ). For example, the sampled values are θ₁=0.28, θ₂ = 0.22 and θ₃ = 0.5 today. Then, the DJ will use this probability to pick songs for that day. The next day, the DJ sample a new set of values. Here, θ₁, θ₂, and θ₃ combine together to form a probability distribution and Dirichlet distribution guarantees that the sampled values sum up to one. That is why we describe it as a distribution over distribution.

Since θ has 3 components, α composes of 3 values also. The Dirichlet distribution is defined as:

Again, we can see a close similarity between the Dirichlet distribution and the Multinomial distribution below. That is why it is a good candidate to use it to learn a Multinomial distribution by adjusting the value of α.

Visually, the possible values for θ₁, θ₂, θ₃ lays on a triangular plane to enforce the constraints that these values summing up to one. For K variables, all sampled values lie within the (K − 1)-simplex. i.e. all K variables sum up to one.

The diagram below shows some of the possible shapes for K = 1 and K = 3 with different αᵢ. As shown, it can represent uniform distribution, skewed/unskewed unimodal shape and multiple modal shapes. Dirichlet distribution allows us to model different shapes of probability distribution given θ₁, θ₂, θ₃, … always add up to one.

Source 1 & 2

(For K=3, the viewing angle of the triangle is changed so the probability distribution can be plotted in the vertical direction.)

Next, we will learn how to pick the values for α for a specific distribution. But, let’s summarize some of its properties first.


When all αᵢ are equal, 𝔼(θ₁)=𝔼(θ₂)=𝔼(θ₃). Therefore, to have a uniform distribution for θ₁, θ₂, θ₃, we can set αᵢ all equal to 1. If we want the choice to gear towards one specific genre, we make the corresponding αᵢ to be much higher than 1 and much higher than other αⱼ (assume αⱼ ≥1).

The source of the plot

On the other hand, if we want one genre not to be picked, we want θᵢ to be zero. It can be done with the corresponding αᵢ set to be smaller than 1. When θᵢ approaches 0, p(θᵢ=0) will approach infinity.

Symmetric Dirichlet distribution

Let’s make all αᵢ to be equal and name it as α. This is a special case for the Dirichlet distribution called symmetric Dirichlet distribution. Instead of a vector, α is modeled by a scalar. As shown below, as we increase α, we change from a very sparse distribution to a uniform distribution. i.e. when α is small, with one or a couple exceptions, most other values are zero (high sparsity). As α increases, the distribution becomes uniform.

Modified from source with K=10

Dirichlet distribution model parameter

Instead of modeling the Dirichlet distribution with a vector α’, we can reformulate the model parameter as a product of α and n (α’ᵢ → α nᵢ) where α is a scalar called the concentration parameter which controls the sparsity of the sampled value and n is a vector called the base measure with elements summing up to one.

Poisson Distribution

Poisson distribution models the probability for a given number of events occurring in a fixed interval of time.

When the number of events is relatively small compared to the scale of the time duration, the probability that an event happens x times in a certain duration can be simplified as (proof):

Here is the probability mass function with different λ.


A Poisson process is a process that events occur independently and continuously at a constant average rate. Let’s spend some time on it since exponential distribution and Gamma distribution are related to it.

Let’s discuss what the bus waiting time will look like if it is a Poisson process. In such a process, you know an average of six buses serving per hour for your route. But the arriving time is random and completely independent. Three buses may arrive in a row or you may wait for 50 minutes for one. In addition, even if you have waited for 20 minutes, don’t expect a bus will arrive at any minute. Past history does not count.

In a Poisson process,

  • You know the average event occurrence rate.
  • Events are random and independent of each other. The occurrence of an event does not influence the chance of another one.
  • The distribution of the wait time is memoryless. It does not matter how long you have been waiting, the distribution going forward remains the same, i.e.

The memoryless is a little bit odd. So this process is not exactly like your visioned bus service. But, let’s say we want to know when the 10th server may die in your server room so we can plan the inventory ahead. We know the average server failure rate. Each failure is mostly independent of others and occurs randomly. The Poisson process is a good starting point in drawing a nice picture of when the 10th server may die. This will be further elaborated later with the Gamma distribution.

Exponential and Laplace distribution

The exponential distribution is the probability distribution for the waiting time before the next event occurs in a Poisson process. This waiting time is modeled as a random variable with an exponential distribution. As mentioned before, this distribution is memoryless. The PDF is the same even no event has occurred in the past 20 minutes. The diagram below shows the shape and the mathematical definition of the distribution.

Modified from Wikipedia

For λ = 0.1 (rate parameter), the chance of waiting for more than 15 is 0.22.

The Laplace distribution is also called the double exponential distribution which can be thought of as two exponential distributions glued together back-to-back.

Gamma distribution

The exponential distribution and chi-squared distribution are special cases of the Gamma distribution. The Gamma distribution can be thought of as the sum of k independent random variables with exponential distribution.

Intuitively, it is the distribution of the wait time for the kth events to occur. So instead of one bus, what is the wait time distribution for the kth bus to arrive.

Here is the mathematical definition for the Gamma distribution.

The Gamma distribution can be expressed with different notations:

α is often written as k which is the kth events. β is the rate parameter (how often a bus may come). But it can be expressed as the scale parameter θ which is the inverse of β.

We fix the rate parameter β below and plot the PDF when changing α. As shown, α controls the shape. k = 1 is the exponential distribution with an exponentially decaying shape. As k increases, the skewness of the curve decreases.

For illustration only. Not on the exact scale.

As k approaches infinity, the plot will resemble a normal distribution which is suggested by the Central Limit Theorem.

Next, we fix k = 2 and plot β = 1 and 2 separately. As shown below, the shape remains the same. But if we look closer to the unit in the x and y-axis, we use different scales. As β increases (rate increases), the spread of the value decreases. Intuitively, when the event rate increases, the expected wait time decreases.

If both curves have the same scale, the range of values in the second diagram will be narrower.

Gamma distribution can model different shapes. Depend on the values of α and β, part of the equation is growing and part of the equation is decaying.

By definition, the PDF for the Gamma distribution is zero if x is negative. This is a nice constraint in modeling real problems since, in many problem domains, x (like height) can never be negative. Here is a plot of the Gamma distribution with different parameters.


The expectation and the variance of the Gamma distribution are:

Modified from source

Chi-squared distribution

Let’s say Z₁, Z₂, …, and Zk are k independent random variables that each with a standard normal distribution. We square each variable and add them together. We repeat the experiments many times and the squared results will have a distribution called chi-squared distribution 𝓧² and k is the number of degrees of freedom. The following is the mathematical definition and its distribution plot with different values of k.

Modified from source

Let’s illustrated it with an example with k to be 3. In each run, we sample 3 students from the freshman class. We normalized their weight, square it and add them together. Say their weights are (165, 175, 145) with normalized values (0.17, 0.5, -0.5). Its squared sum will be 0.53. We repeat the run 1000 times and plot the results. It should look like the light blue curve above with k=3.

A Chi-square goodness of fit test determines whether our sampled data is a good representative of the population. We can draw 100 student samples and compute the Chi-square statistic below. If the value is too high, it indicates it is a poor sample in representing the population.

The Chi-square distribution is skewed. But with no surprise, as the degree of freedom increases, the distribution looks closer to the normal distribution. In short, if the sample size increase, we should expect the distribution to approach a normal distribution.

There is another important application of the Chi-square test which is a test of independence, for example, whether gender impacts the choice of a pet. The details of the Chi-square test and the interpretation will be deferred to a later section.

Chi-square distribution is a special case of the Gamma distribution. With α = k/2 and β = 1/2, a gamma(α, β) random variable is a chi-squared random variable with k degrees of freedom.

Dirac distribution

Student’s t-distribution

Let assume a random variable X has a normal distribution with the standard deviation σ. If we sample the random variable n times, their mean and the sample variance will be:

The random variable holding the first result below will have a normal distribution.

However, if σ is unknown or the sample size is too small to estimate it, we usually model another variable with S. This random variable will have Student’s t-distribution with n-1 degrees of freedom. As n increase, it will resemble a normal distribution.

Source (The red curves are the t-distribution with different degrees of freedom and the blue is the Gaussian distribution)

Normalization factor

Many distributions can be formulated as

where the numerator is an unnormalized distribution. In many probabilistic models, we can model the unnormalized distribution first and then sum up the values for all possible x to create the normalization factor. This turns our result into a probability with p(x) sums up to one. For example, using a Graphical model, we model our problem with a Graph and the numerator are the factors derived from it.

In a Graphical model, this normalization factor depends on how we model the factors (θ) and we refer it as the partition function (Z(θ)). In general, the partition function or the normalization factor is hard to compute. It is not easy in summing up or integrating over x.

But for the well-known distribution, we can first focus on calculating the distribution parameters, like μ, σ², the normalization factor can be easily found from these parameters later.


Here is a quick recap of different distributions.

Here is a list of the PDF for different distributions.


Exponential family of distribution (Optional)

This is an advance topic so read it at your own pace. Gaussian, Binomial, Multinomial, Poisson, Gamma, and Beta distributions are all belonged to the Exponential family of distributions. The general form of this family of distribution is:

Let’s go through some of the definitions first. A(η) is the cumulant function and defined as:

The exponential eᴬ is the normalization factor to convert the value into a probability. Therefore, A(η) is also called the log partition function. η is the natural parameter (or canonical parameter). It is a single value or a vector of values that parameterize (model) the distribution. T(x) is called a sufficient statistic. In many distributions, like the Bernoulli distribution, it equals x. Once such statistics are collected, there are no additional information needed. In the Gaussian model, we collect the mean and the variance with T(x) = (x, x²).

This is hopelessly abstract so it will be better off to demonstrate it with an example. Consider the following Bernoulli distribution which takes the value 1 with probability α and the value 0 with probability 1-α. Let’s rewrite the Bernoulli distribution to be expressed in the exponential family form.

Then, the corresponding components in the exponential family form are:

We can repeat the exercise for other distributions. It will simply result in other choices of h, T, and A.

If we invert the function for η in the Bernoulli distribution, it becomes the logistic function. In short, we can recover the Bernoulli distribution parameter from its exponential family form’s parameter.

Instead of modeling Bernoulli distribution using parameter α, it can be represented as an exponential family with the natural parameter η.

For Binomial and Poisson distribution, the corresponding exponential family is:

Modified from source

So far, our distribution only needs one scalar parameter to model. For distributions that modeled by more than one parameter, η will contain a vector of values.

Modified from source

Why are we interested in such an abstract concept? As a preview, the probability density in many probabilistic models, like those modeled by the Markov Random Field MRF in the Graphical model, can be formulated as the exponential family.

Therefore, exponential family distribution can be a natural generic choice in modeling probability models.

Let’s take a look at the derivative of A(η) which has some important properties.

Modified from source

Its first-order derivative is the expectation of the sufficient statistics T(x). For T(x)=x, this derivative equals the mean of the distribution.

The second-order derivative A’(η) equals the variance.

In a Poisson distribution, E[x] (the mean) is not easy to compute with the traditional integration definition. However, differentiation is usually easier than integration. So with T(x) defined as x in the Poisson distribution, we can compute A’(η) instead (which equals E[x]).

To recap, many data distributions belong to the Exponential family of distribution and many ML solutions can be generalized with this family of distribution. We are interested in deriving easier and generic solutions with this family of distributions to address a large range of problems.

In our last example, we find the mean of a distribution from the distribution parameter. But often, we are interested in finding the distribution’s parameters by sampling data. But first, let’s introduce moments first.

Kth Moment (Optional)

A moment describes the shape of a function quantitatively. The kth moment, or the kth raw moment, of function f is defined as

This moment is called the moment about zero. But if we subtract x with the mean first, it will be called a central moment.

For the 2nd and higher moments, the central moments are usually used to provide better information about the distribution’s shape.

The kth moment equals the kth-order derivative of A(η).

If the function f is a probability distribution, the zero moment is the total probability (=1), the first moment is the mean, the second central moment is the variance, the third standardized moment is the skewness, etc …

Moment Matching

In this section, we are interested in estimating the distribution parameters from sampling data. In machine learning, we model the population density p with q* (say, a Gaussian Distribution). In moment matching, we calculate the moments from the sample data so the expectation of their sufficient statistic will match.

Assume all the data drawn is i.i.d., the maximum likelihood estimation will be:

Modified from source

i.e., μ (the first moment of the distribution) can be estimated by the average of the sufficient statistic from the sample data. This is called moment matching.

Consider a simple zero-centered distribution f.

Let’s see how can we compute the distribution parameter σ by sampling. The first and second theoretical moment is calculated as:

Modified from source

We can use the samples X to compute 𝔼() to estimate 2σ².

By linking the model and sample moment together, we get an estimation of σ (sampled σ).

Finding E(x) and E(x²) by integration is easy in the example above. But in general. it is not easy for many other exponential families of distributions, like the Gamma distribution. Here, we will reuse the property of A’ and A’’ to compute the distribution parameter.

The natural parameter and its reverse are defined as:

with sufficient statistics as (log x, x) and A(η) as

Using the derivative of A(η), we find the expectation of sufficient statistics.


By using the samples to computed the expectation values and the variance, we can reverse engineer the parameters α and β. Here, we provide a general mechanism in converting observed data into the distribution parameters of the exponential family of distribution.

Bayesian inference

Frequentist inference draws conclusions from the frequency of events. If we get two heads from flipping a coin twice, p(head) equals 100%. However, a frequentist will unlikely publish such a result because the sample size is too small. The result may be just by chance and it is too early to spot any trend.

Bayesian inference uses Bayes’ theorem to derive a posterior distribution from the likelihood and a prior belief. When new observations are available, we turn the posterior into the prior and compute a new posterior given the new evidence. Since the posterior is modeled as a certainty distribution rather than a point estimate, we can continue combining it with new evidence to form a new belief.

For example, the prior belief of a car position can be started by combining the dynamic model of how car moves and the previous measurements from the GPS. Or we can even start a prior entirely from a random guess, a gut feeling, or experience. Given the current sensor reading, we form a likelihood of our current sensor reading given different location assumptions. Using Bayes inference, we can derive the probability distribution P(H|E) of the current car location given the sensor readings. That is posterior.

We turn the posterior into the prior for our next iteration when new observations are made. If the sample size is small, the likelihood curve is wide with a lower peak. We have not drawn enough data to rule out many possibilities. Therefore, the posterior will resemble the prior belief if it is strong (narrow and peaked). When more data is collected, the likelihood will get pointy and the posterior distribution will get closer to the likelihood curve.

illustration purpose only

Frequentist v.s. Bayesian

Frequentist applies the maximum likelihood estimation to find an optimal model parameter in explaining the observations. The Bayesian put the spotlights on the model parameter θ and calculate the posterior for the model parameter using the Bayes’ Theorem.

Bayesian inference calculates the probabilities of different models given the observation. Of course, this can be extremely complex for high dimensions or large continuous space. Further simplification of the likelihood and the prior model can be applied. Or we can solve the problem by sampling or approximation.

Depending on the way how samples are collected, it may be easier to answer P(x|y) than P(y|x). Sometimes, the probabilities are easy to model in the opposite direction. For example, P(y | x, θ) and P(θ) may be easily modeled with Gaussian or Beta distribution. Below is an example of the Bayesian linear regression.

Linear regression

We ignore the denominator P(y |X) in the Bayes’ Theorem because it is not a function of θ and can be dropped when we optimize the posterior w.r.t. θ. For both P(y | x, θ) and P(θ), we model them with separate Gaussian models in the Bayesian linear regression. In general, P(y |X) or P(X) is often very hard to compute, so this is a very nice simplification in optimizing the posterior.

If you cannot solve P(y|x) easily, transform the problem with Bayes’ Theorem and try your luck in the opposite direction P(x|y).

In the Bayes’ Theorem, we have relatively large freedom in choosing the model for P(θ). But not every choice is equal and the choice impacts how easy to compute the posterior analytically. A prior is a conjugate prior if the corresponding posterior belongs to the same class of distribution of the prior. Since posterior is often used as prior in the next iteration, we can simply repeat the same math in calculating the posterior again, what a convenient. For example, if both the likelihood and the prior can be modeled by Gaussian, the posterior is also Gaussian and computed easily.

If the model θ can be modeled with a conjugate prior correspondence to a specific likelihood distribution, we can usually solve the posterior easily and analytically.

Let’s look at them in detail. But we will model the prior with a Beta distribution instead in the next section.

Bayesian inference with Beta distribution

For a Binomial distribution, we can model it with a beta distribution. If the likelihood is binomial or Bernoulli, we will choose our conjugate prior to being beta distributed. This choice allows us to have the posterior to be beta distributed and the calculation can be computed analytically easily.

A detailed example demonstrating the Bayesian inference with Beta distribution can be found here. Here is the skeleton on finding the posterior using beta distributions for the prior P(θ). The posterior P(θ |data) will be beta distributed and the math involved is just simple additions.

For example, we can have a=b=1 for the first prior (B(1, 1)). This creates a uniform distribution indicating that we have no specific information and like to start with any value to be equally likely. After collecting 10 samples with 3 positives, the posterior will be B(1+3, 10+1–3). The new B(4, 8) will be used as the new prior if more samples are collected. The posterior in the Bayesian inference will resemble the frequentist result since our belief is weak.

Otherwise, we can start with some prior knowledge based on past experience, knowledge or even gut feeling. However, if our belief is off, we need to collect more data to gradually reshape the posterior to the correct curve.

While we can start with a non-informative prior, the strength of Bayes inference depends on how easy and reliable to form our prior belief.

Let’s see how the Bayesian inference is different from a frequentist. In Bayesian, let’s have a prior belief for the infection rate of flu to be modeled as B(2, 6). This will be our first graph below. Let’s say we have just one lab result coming in and test positive for the new flu season. A naive frequentist will say the infection rate is 100% based on the sample. But we know this is scientifically unsound. A professional frequentist will reject the claim based on the sample size. But for a Bayesian, we can still draw some kind of conclusion using the Bayes inference as results are gradually coming in. In some perspective, Bayesian inference gives us a reasonable picture if our prior belief is sound. But be warned that the debate between Bayesian and Frequentist are ongoing as one can argue the prior is just some legacy or hard to establish.

Gamma distribution as a conjugate prior

We can use a Gamma distribution as the conjugate prior if the likelihood can be modeled by a Gaussian distribution.

The Gaussian distribution for our likelihood p(x|θ) can be expressed in the following form.

Apply the Bayes’ Theorem, we can derive the posterior in the form of Gamma distribution also.

Dirichlet — the conjugate prior of multinomial

Dirichlet distribution is the conjugate prior of multinomial.

Modified from source

The posterior is:

Modified from source

which is a Dirichlet distribution (Proof).

Dirichlet distribution is also the conjugate prior of Categorical distribution:

Modified from source

Recap on conjugate prior

Here are some other conjugate priors corresponding to the specific likelihood distribution.

Modified from source

Prediction & regularization

With Bayes’ Theorem, we calculate the posterior for the model θ given the observations. If we assume the model parameter θ to be a zero-centered Gaussian distribution, the prior p(θ) turns into an L2-regularization term in the objective function (details). Conceptually, p(θ) can be viewed as a regularization factor. It can penalize a cost function if θ deviates from our prior belief. As shown below, we can apply a pretty complicated model for p(θ) if we have some prior knowledge of what θ may look like.

To make a new prediction, we use the posterior p(θ | X, y) in our training as p(θ). Then we find the marginal probability p(y|x₀) by integrating over θ. This is Marginal inference. We compute the probability of a variable by summing everything else out.


Jacobian matrix & Hessian matrix

These matrices are the first-order and the second-order derivatives of f respectively.

This notation is called the numerator layout. Hessian matrix is symmetrical. The upper bound for a quadratic equation with the Hessian matrix and a vector v is

Below, we use a different notation called denominator-layout notation. It is the transpose of the numerator layout.

Here is the result in differentiating a vector and a matrix in such notation.

Matrix Decomposition

Graphical interpretation of the component analysis

We can represent a 2-D vector x by projecting x onto the x-axis and y-axis. So a data point can be represented as (xᵢ, yᵢ). The choice of x-axis and y-axis as the projected axis is pretty arbitrary. Instead, we can select a unit vector q and calculate the projection of x on q. The projection vector will be qqᵀx with its magnitude equals qᵀx.

In ML, we extract features from a high-dimension space into a low-dimension latent space (say with k-dimension). Conceptually, we project x onto k different vectors qⱼ. The choices of qⱼ are important. If it is done correctly, we can use fewer components in representing information. For example, if we choose q₁ and q₂ below, we may ignore the q₂ component (the blue dots). They may be too small and we can ignore them. However, that is not true if we pick the x-axis and y-axis.

SVD decomposes a matrix into independent components. All q selected in SVD are independent of each other (orthogonal), i.e. the extracted features are not correlated. Conceptually, SVD picks the first q that minimizes the least square error below if the rest of the components are dropped.

Modified from source

XXᵀ is symmetrical. The optimal q (named q₁) will be the eigenvector for XXᵀ with the largest eigenvalue λ or the largest singular value σ (λ = σ²). (proof).

Then we pick the next component based on the same principle with the condition that q are orthogonal to each other. Therefore, the selected q₂ will have the second-largest eigenvalue. We can continue the process until we run out of eigenvectors. But the description here is just for a demonstration. We don’t solve this problem sequentially. It simply helps us to understand this process allows us to extract the most important principal components of the underneath data.

Singular Value Decomposition (SVD)

SVD is presented differently in linear algebra. It is the same method but from a different angle. Any matrix A can be decomposed as (proof)

where U compose of uᵢ — the eigenvectors for AAᵀ and uᵢ are orthogonal (perpendicular/independent) to each other. (uᵢ is equivalent to qᵢ in the last section.) Similarly, V is composed of eigenvector vᵢ of AᵀA which is also orthogonal to each other.

From the equation above, A can also be written as

where uᵢ and vᵢ are unit vectors. So when we evaluate the importance of the decomposed components, we can ignore those terms with very small σᵢ.

If we keep only the topmost k terms with the largest σᵢ, we effectively reduce the rank of A into k, i.e. the extracted features are in k dimension only. We effectively reduce the dimension of the input with the consideration of the importance of each principal component. That is what PCA does.

Principal Component Analysis PCA

As shown below, PCA simply truncates those insignificant terms in SVD (terms with very small σᵢ).

Intuitively, two input features may be so correlated that you can create a single new feature to represent both. For PCA, we want to find k independent features to represent our data.

PCA Example

In ML, SVD decomposes a matrix containing training data into independent features. For example, the row of the matrix contains the movie ratings from a user. The column contains the user ratings for a movie.

If we pick the top K eigenvalues of AAᵀ, its corresponding eigenvectors is equivalent to the top K optimized qk vectors below:

Modified from source

Recall that we project x into these principal components qk. Finding the top K optimized qk, we reduce the dimension of x into K. We can interpret the projected vector is the kth latent factors of x.

We can concatenate qᵢ to form the matrix Q. We can derive the latent features of user by multiplying Qᵀ with a user’s movie ratings. (qᵢ is M × 1 where M is the number of movies, and Q is M × K.)

SVD discovers the patterns (principal components) on user ratings. We can imagine some of the principal components may represent the genre of the movie or the years of the release. For example, the first component in zᵢ may indicate whether a user likes comedy.

Probabilistic PCA

In SVD, we decompose X to USVᵀ. Probabilistic PCA models X≈WZ instead. We will use the EM algorithm (discussed later in another article) to learn W and Z which Z can be treated as the latent feature of X. Unlike SVD, W does not need to be orthonormal. Columns do not need to be of unit length or perpendicular to each other.

First, let’s assume the latent variable zᵢ is a zero-centered Gaussian distribution. With W, we can reconstruct the original data X by WZ where x is modeled by a Gaussian also.

Z is the latent variable θ₂ in the EM algorithm and W is θ₁. Our objective is

In the E-step, we compute the Gaussian distribution for q(zᵢ).

In the M-step, we optimize

Without proof here, the algorithm is:


Kernel PCA

From one perspective, PCA finds a set of vector q that maximizes qᵀXXᵀq. Since XXᵀ is symmetrical, q will be the eigenvectors of XXᵀ with the largest eigenvalues (details).

So the problem statement can be rewritten as finding the eigenvectors with the largest eigenvalues.

In a later article on ML algorithms, we replace XXᵀ with a kernel to map the input to a higher dimension. This allows us to create a linear boundary to classify data that is not linearly separable in the low dimension space (details). On the contrary, PCA is often considered as a dimension reduction technique. So both technique seems to go in the opposite direction. However, sometimes, we need to go big before go small. Going to a high-dimensional space allows us to cluster information with a simpler and clear boundary. Once the information is clearly cluster, it will be easier to map it to a lower-dimensional space. This is the motivation behind PCA kernel. Let’s start with the following equation.

After some manipulating, we get

Modified from source

So given the matrix K is holding the kernel result, we can find aᵢ by finding the eigenvectors of K. Let’s define the kernel function with a Gaussian function. The corresponding latent factor for x can be computed as:


Below is how we make a prediction for a new input x₀ using Kernel PCA.


Cholesky decomposition

The Cholesky decomposition of a Hermitian positive-definite matrix A is


A Hermitian matrix is a square matrix that equals its transpose conjugate. A transpose conjugate takes the complex conjugate of each element and then transpose the matrix.

A covariance matrix is symmetric (a special case for Hermitian if values are all real) and semi-positive definite. Cholesky decomposition is commonly used in ML for easier and more stable manipulation.

Moore-Penrose Pseudoinverse

For a linear equation system, we can compute the inverse of a square matrix A to solve x.

But not all matrices are invertible. In ML, it will be unlikely to find an exact solution with the presence of noise in data. But the solution for x can be estimated as,


Statistical Significance

Null hypothesis H₀ is a statement that there is no relationship between two measured phenomena, for example, no correlation between wealth and happiness. This idea can be further extended for statements that we want to nullify and repute. A null hypothesis will be rejected if the observed data is statistically significant, i.e. it is very rare to observe the collected data if the null hypothesis is true. For example, if we see 100 heads in flipping a coin 100 times, we can “nullify” the hypothesis that the coin is fair. Therefore, the alternative hypothesis H₁, one that contradicts with H₀, is likely true (the coin is biased). In practice, it is harder to quantify the relationship between two variables than computing the chance that the collected data happens by chance only. So a null hypothesis serves as better means to draw conclusions about two phenomena.

p-value (probability value) is the probability of the observed sample when the null hypothesis is true. A small p-value (typically ≤ 0.05 or ≤ 0.01) shows strong evidence against the null hypothesis, i.e. it is rare for that to be happened by chance.

For example, after collect 100 data points, we can calculate a correlation coefficient based on the data (say, the correlation between wealth and happiness).

As shown below, if the correlation is -0.25 for the 100 data points it collected, its corresponding PDF is about 0.012. Only 2.5% of the population may have a correlation smaller than -0.2.

Source (N=100 and no correlations)

Therefore, the null hypothesis is likely false.

p-value example

(This example is adapted from here.)

We want to determine whether a coin is fair. Here is the detail of the experiment with the coin turning up heads 14 times out of 20.


Since we have no assumption on whether the coin may bias towards head or tail, we will use a two-tailed test. The probability of having at least 14 heads is:


The two-tailed p-value will be 0.115 (2×0.058). This is higher than the alpha level 0.05 and therefore the null hypothesis (the coin is fair) is not rejected.


(This example is adapted from here.)

Suppose the mean and the standard deviation of a score in a general population are 100 and 12 respectively. Our sample has 55 data points with a mean score of 96. We want to ask if the sample’s score is significantly lower than the general population. The z-score is computed as:

After the normalization with SE, it is the number of standard deviations away from the mean. Therefore, it can tell how rare the observation may be.

Source: Wikipedia

Confidence interval

After conducting an experiment to collect a sample. We can use the sample data points to estimate a population parameter like the mean (called estimator). A confidence interval can be calculated as a range around this sample mean. A 95% confidence level means that in 95% of those experiments, its confidence interval contains the true mean of the population. In other words, 1 out of 20 chance that the confidence interval of an experiment does not contain the true mean.

Here is the skeleton in calculating the confidence interval for the sample mean.

Modified from source

We standardize our sample mean (X̄) with μ (the population’s mean which is unknown) and σ (the population variance which we assume it is known or if n is large, we can estimate it properly). By consulting a standard normal distribution table, we can find z to establish the confidence interval above.

When σ is unknown, we can use the following equations to compute the confidence interval for σ².

Modified from source

Fisher’s exact test

(This example is originated from here.)

Fisher’s exact test is a statistical significance test for the null hypothesis with contingency tables. For example, we can determine whether we can reject the null hypothesis that women and men are equally likely to study for the driver's written test. In practice, Fisher’s exact test is employed when sample sizes are small (like the values in the table are in single-digit).

Using hypergeometric distribution, the probability of the observed data in the table is computed as:

In a lower tail test, we add all the possibilities with the Men+studying count equals 1 and 0 to get the p-value.

Chi-square test

Chi-square test is a popular test in measuring the likelihood that the correlation in the observed data is by chance only, rather than some correlation between two variables. It gears toward testing that more samples can be collected.

We compute the Chi-square statistic with the formula above. We compare the real count from the sample and the expected count assuming no correlation exists. Here is an example below in determining whether gender impacts the choice of a pet.

In this example, we compute the difference between the observed count of males owning a cat minus the expected count if gender is not a factor. We square it, divide it by the expected count and compute the corresponding Chi-square value. In our table, we have four possible combinations (male-cat, male-dog, female-cat, female-dog). Therefore, we have four degrees of freedom and we need to sum up all four values to compute the Chi-square statistic.

For a two-sided test, we divide the given significance level α by two. For example, for α = 0.05, we can accept the correlation if the Chi-square statistic has only a 0.05/2=0.025 probability to be by chance only. Since the Chi-square distribution is non-symmetrical, we usually consult a table to see what is the corresponding Chi-square statistics for a specific probability value.

Modified from source

For example, with a degree of freedom equals 4, if the Chi-square statistic is greater than 11.1 in the upper-tail table, we will accept the correlation. Of course, we need to consult the bottom-tail table also to check if the Chi-square value is too small.

Exploratory data analysis

To explore the data, we may calculate the covariance between two variables or perform a scatter plot like the one below to spot trends.


For example, the green dots and the blue dots below are houses for SF and NY respectively. If we have a decision stump for elevation > 73 feet, the one satisfies this condition will likely in SF.

Modified from source (but the data can be outdated)

Probability Sampling

Simple random sample: Every data has an equal chance to be selected.

Stratified random sample: The data is divided into groups like male or female. Samples come from each group with members of each group selected randomly.

Cluster random sample: The data is first split into groups. Some groups are selected and we select its members randomly.

Systematic random sample: Data is ordered and select systematically.

Model Performance

Accuracy & Performance

Accuracy measures the percentage of predictions that are correct and the error rate measures the ratio of predictions that is wrong. For an un-balance class, this is not a good performance metric. For example, if a rare disease happens in 1 of a million people, we can design a test kit that always shows negative results and achieves 99.9999% accuracy. While accuracy is one of the metrics in analyzing model performance, other metrics may be needed.

Precision, recall, sensitivity & F1 score

Measurement metrics:

Precision is the ratio of the true positives over all positive predictions.

Recall (sensitivity/true positive rate) is the ratio of true positives over all positives.

Sensitivity (true negative rate) is the ratio of true negatives over all negatives.

Precision is important if false positive is costly. For example, starting chemotherapy for false-positive cancer patients is not acceptable. Recall is important if false negative is costly. For example, if we miss the detection of a new virus, we may start an epidemic. F1 Score tries to strike a balance between precision and recall.

Sensitivity is P(+|disease) while specificity is P(-|health). If it is expanded with Bayes’ Theorem, they have different denominators and therefore they do not complement each other.

ROC curve (receiver operating characteristic curve)

ROC is the plot of the TPR rate over the FPR rate. It shows the performance of a classifier at all classification thresholds.

Modified from source

AUC: Area Under the ROC Curve

When calculating the area under the ROC curve, we can evaluate different models regardless of the classification thresholds.




Let’s build a regression model with the following data points. R-squared compares how well models are doing.

First, we establish a simple baseline line model. The baseline model will naively predict the output to be the mean of the output of the sample points. i.e. y = f(x) = ȳ where ȳ is the mean of the sample data.

The total sum of squares for this baseline is:

Then we compute the residual variance of our new model as the squared error between the predictions and the ground truth.

The R-squared can be viewed as the ratio on explained variance.

If the model’s predictions match perfectly with the observed data, the sum of the square of the residual errors becomes 0 and R-squared equals 1. Hence, we can use it to compare the performance of models with 0 being the worst and 1 being the best.

To recap, R-squared or coefficient of determination is the proportion of the variance in the dependent variable that is predictable from the independent variables in a regression model. Mathematically, it is defined as:


Here are the common norms:

𝐿₀ measures the number of non-zero elements. 𝐿 is the Manhattan distance. 𝐿₂ is the standard Euclidean norm and 𝐿∞ measures the maximum absolute value of any elements.


Jaccard Similarity

The Jaccard similarity measures the ratio between the size of the intersection over the size of the union.

Cosine Similarity

Cosine similarity measures the angle between two vectors.

Pearson Similarity

Pearson correlation coefficient ρ measures the correlation between two variables.


Here are some free courses in ML.

Andrew Ng Course

John W. Paisley Course

Nando de Freitas Course

Mark Schmidt Course


Each section in the article can be spawned off to a lengthy article. So congratulation that you make it so far. Next, we will dig into more traditional ML algorithms.



Jonathan Hui