Machine Learning — Algorithms

Jonathan Hui
56 min readApr 24, 2019
The left limb of the Lone Cypress is broken due to a big storm after the picture is taken.

In the previous article, we look into the fundamental of ML. In this article, we dig into more traditional ML algorithms including regression, classification, kernels, Gaussian Process, Bayesian linear regression, SVM, clustering and decision tree. Other topics including cost functions, regularization, MLE, MAP, Laplace approximation and Restricted Boltzmann Machine are also covered. We will go through some deep networks like LSTM quickly. In addition, we add some topics like normalization that may be more relevant to DL than ML. Again, the article is long. It covers topics that may take more than a semester in a college class. Please feel free to skip sections based on your level of interest. But we will be brief in explaining some basic ML algorithms at the beginning of the article.


Linear regression

Linear regression models y=f(x) with a linear vector w.

To add a constant bias, we add a new component with value 1 into the input.


In many ML algorithms or regression, we assume the input has been pre-processed and standardized.

Residual Sum of Squares & Weighted Sum of Squares

To train a linear regression model, we minimize a cost function. One common objective function is RSS (residual sum of squares). The corresponding regression is called the Least Squares Linear Regression.

Change of basis with Polynomial regression

To increase the complexity of a model, we can add or change the bases for the input. In the examples below, polynomial bases are added.

Without such transformation, it is hard to separate two classes of data if their boundary is non-linear. However, the added x₁² and x₂² components transform a non-linear decision boundary in the low dimension space into a linear boundary in the high-dimensional space. In theory, any boundary shape in low dimension can be transformed into a linear boundary in a higher dimension. After the transformation, we can classify data into separate groups with linear algebra.

The generic form of the linear regression is

where g can be any function including non-linear, logarithmic, etc ... It is often mistaken that Linear Regression models linear functions only. By expanding the basis of the input, the function f in y = f(x) can be non-linear.

In ML, feature engineers apply domain knowledge to construct g that resembles the relationship between the input and the prediction. However, it is hard and we easily overdo it. To correct the problem, we can apply an L1-penalty. L1 cost favors sparsity in w (explained later). This allows the training to select fewer base vectors in making predictions. This mitigates the problem when too many input features are included and the model can be easily overfitted.

Change of basis solution (Gram matrix)

With any change of basis, the general solution for the linear regression can be computed analytically with the Gram matrix below (with proof later).

Parameter-based v.s. non-parameter-based

For linear regression y = wᵀx, we fit the model with training data to find w. This type of model is called the parameter-based model. The function is pre-defined and our task is to find the model parameters θ that fit the data the best. For non-parameter-based model, we have no assumption on the function, we let the data to mold the function itself.

Next, we will study a non-parameter-based model called Kernel.


Previously, we generalize the linear regression as y = wᵀg(x). With the concept of the kernel method, we expand the idea further to explore the relation of x with other training data points x⁽ⁱ⁾. This allows us to mold a function based on the training data, rather than a pre-defined function.

Here, instead of making a prediction from wᵀx, we compute the similarity k between x and a training data point xᵢ. Then, we multiply it with the corresponding model parameter wᵢ. We sum up the results from all data points to make our prediction. Intuitively, we compute a weighted average of the training data labels. We put a heavier weight if the data is similar and a smaller weight if it is different. For example, in predicting the salary from your years of education, we use higher weights for data with similar years of education to you. Then we calculate a weighted average from the training data salaries.

The general definition of a kernel is:

In general, it maps xᵢ and xⱼ into a high dimensional space and computes their inner product to explore similarity. But in practice, we can simplify the equation to a form that simply measures the similarity of two data points. For example, we can use the Gaussian distribution below.

Next, we will detail the kernel method using RBF.

Radial basis functions (RBF — Gaussian Kernel)

RBF uses a Gaussian function for the kernel. We form a new basis z for the input where zᵢⱼ measures the similarity between training data point i and j. Then we use the training data to fit the model parameter w using an objective function like MSE.

A kernel function is symmetrical and the matrix is positive semi-definite. Often, we take advantage of this property in making mathematical proves.

To make a new prediction, we transform x into a matrix using the kernel function.

Intuitively, RBF computes how far the input from each training data point and multiply it with the corresponding weight wᵢ to make predictions. We are computing a weighted average of the output based on similarity.


Classification with regression

We can apply Bayes’ Theorem to check whether input x should belong to class y=0 or class y=1. For example, if the label y should be 1, we want p(y=1|x) > p(y=0|x). Alternatively, we can compute a linear regression value with the equation xᵀw + b. If the sign is negative, y belongs to 0, otherwise, it belongs to 1. As shown below, if p(x|y) is Gaussian distributed, the Bayes’ Theorem method is equivalent to the linear regression method.

However, xᵀw is unbounded while we want our predictions to be -1 or 1. The choice of the cost function has to be careful here. As shown below, the least square cost function with xᵀw as input can add penalty into the cost function even it correctly predicts the class for x with no ambiguity.

To solve this issue, we can use the logistic regression or switch to other cost functions like the hinge loss or the logistic loss.

Logistic regression

We can apply the result of the xᵀw to a logistic function (also called the sigmoid function) before making a decision. This squeezes the unbounded xᵀw between 0 and 1.

Classify the data points according to σ(wx) ≥ 0 or not.

As shown below, xᵀw + b actually model the logits which can act as the input to the logistic function to recover the probability. In this example, we use the linear regression xᵀw to compute the score for the logistic function. But other scoring methods including a deep network can be used.

Bayes classifier

Bayes classifier applies Bayes’ Theorem to classify x. We find the best value for y that gives the highest value for the posterior.

Cost Functions & Regularization

Mean Square Error (MSE)

MSE is popular because it is easy to compute with a smooth differential. But it is vulnerable to outliners. We penalize large error un-proportionally high. If we have bad noisy data, we pay too much attention to fit these data into the model which we should ignore in the first place. MSE cost function for linear regression is

The corresponding optimized analytical solution (Normal equations) is

The proof can be found here. Note, we will quote or use this equation very often.

MSE with an L2-regularization is called ridge regression. The corresponding optimal solution is

The following is the Gradient descent using the least mean square error (LMS).

L0, L1, Huber Loss or Regularization

Huber loss combines an L2 loss in the low error range with an L1 loss for the high error range. This combines the derivative smoothness of L2 and the smaller vulnerability to the outlier in L1.

L1 & L2 Regularization comparison

L2 has a smoother gradient and therefore the training can be more stable. But L1 is also popular because it promotes sparseness. As shown below, the optimal point after adding an L1 regularization favors parameters to be 0. It encourages spareness in avoiding overfitting.

If an L1-regularization is used with the least square error, the linear regression is called Lasso regression. While the choice depends on the specific data and problem domain, L2-regularization seems more popular but feel free to experiment both.


Let’s consider the following generic p-norm cost function. Can we optimize the objective function analytically or numerically?

  • If we don’t have any regularization (λ=0), the optimization can be solved analytically if XᵀX is invertible.
  • If p=2, an analytical solution is guaranteed.
  • For p≥1 but not equal to 2, we can use an iterative method to find the global optimal.
  • When p<1, the objective function will not be convex. In contrast to other choices, global optimal is not guaranteed. We can only find a local optimal with numerical methods.

Next, let’s go through some of the popular cost functions.


In ML, we want our predictions to match the ground truth which P is the ground truth and Q is the model prediction. For classification, P(xᵢ)=1 for the ground truth label i and P(xⱼ)=0, otherwise. Therefore, the cross-entropy is simply H(P, Q) = -log Q(xᵢ).

Softmax function

For multiple-class classification, we can apply a softmax function to the computed score before computing the cross-entropy. In linear regression, the score equals xᵀw for the ground truth class.

KL Divergence

KL-divergence measures the difference between two data distributions. So P is the ground truth and Q is the distribution predicted by the model.

Note: KL-Divergence is not symmetrical and it is always greater or equal to 0.

Jensen-Shannon Divergence

Jensen-Shannon Divergence is symmetrical.

Logistic loss

It measures the loss based on the logistic function. We can apply a logistic function to the output and use the cross-entropy loss or use the logistic loss directly without passing the output to a logistic function.


Hinge loss

Hinge loss penalizes classification mistakes only if the prediction is wrong or too close to the decision boundary. Hinge loss is used in SVM and the details will be discussed later.


Model Optimization

Maximum Likelihood Estimation (MLE)

MLE finds the model parameters that maximize the likelihood of the observed data xᵢ.

Negative log-likelihood (NLL)

MLE has a long chain of multiplication which is vulnerable to diminishing or exploding problems. To mitigate the problem, we take a logarithm function on the objective function. Since “log” is a monotonic function, maximizing MLE is the same as maximizing the log-likelihood or minimizing the negative log-likelihood (NLL).

In optimizing an objective, we can add (subtract) or multiply (divide) values as long as it is invariant w.r.t. the model parameters. We can also add a monotonic function. The optimal solution remains the same.

As shown below,

MLE, NLL and cross-entropy have the same optimization objective.

If p(y|θ) can be modeled by a zero-centered independent Gaussian Distribution,

we can prove that maximizing MLE is equivalent to minimizing MSE. Intuitively, the square part in the exponential function of the Gaussian distribution contributes to the reason that MLE is the same as optimizing the least square.

MLE is the same as MSE if the prediction is Gaussian distributed.

MLE with linear regression

Let’s say w is the ground truth weight for the linear regression model yᵢ = Xᵢᵀwᵢ + εᵢ. And εᵢ is the noise with a zero-centered Gaussian distribution with variance equals σᵢ².

Since the noise is Gaussian distributed, we can model y as

After collecting the data sample, we can train w using MLE. What will be the expected value and the variance of w using this method? (Feel free to spend some time to study the proof if you are interested in how expectation and variance are calculated in an estimator.)

As indicated, the estimated w using MLE objective is unbiased. However, if σ²(XᵀX)⁻¹ is large, the variance is high, i.e. we get very different models with different training datasets. In short, the estimated w values are sensitive to the noise in the measured data. i.e. even the estimation is unbiased, the trained model can still be bad if the estimated models have a high variance. Let’s get into more details.

Every matrix X can be decomposed using SVD as

The calculation above indicates (XᵀX)⁻¹ is proportional to the inverse of S² where S is a diagonal matrix with diagonal elements containing the singular values of X. If some of the singular values are small, the variance σ²(XᵀX)⁻¹ is high. By evaluating the singular value of X, we can understand the variance of the trained models.

Small singular values happen when the columns in X are highly related.

When information is highly correlated, the trained model is vulnerable to variance and it is easier to be overfitted.

To address this issue, we can add L2 regularization (Ridge regression) to constrain the model parameters. This regularization stabilizes the inverse numerically. Without it, even a small change in X will result in a large change in w*.

Here is the reason. The trained w using ridge regression is


When Sᵢᵢ is very small, λ comes into the rescue and the individual element in S⁻¹ will have a certain upper bound. Otherwise, it will be unlimited large. It magnifies the slight variation of the data and causes high variance in w.

The equation below compares the difference of w trained with ridge regression and least square alone.

Modified from source

Again λ limits how far the ridge regression will be from the least square solution. The expected value and the variance for the ridge regression is

MLE is unbiased but it may have high variance. Ridge regression is biased but has a lower variance. Because we can always tune λ, we can tailor a solution sensitive to the data distribution with the Ridge regression.

Optimization in alternative steps

In many situations, we need to break down machine learning problems into iterations of two alternative steps which one optimizes a sub-set of parameters and the other step optimizes the rest. The EM-algorithm discussed in the previous article is one of the examples. In these problems, we cannot optimize both sets of parameters simultaneously. Or it is too hard or very unstable. These two sets of parameters often have inter-dependency between them and we cannot compute their derivative simultaneously. Because of such inter-dependency, we optimize one set at a time (parameter θ₁ or θ₂) while fixing the other.


This algorithm takes advantage of the monotonic decrease in cost. Within certain precision, the space for θ₁ and θ₂ are in limited size. We cannot keep decreasing costs forever and therefore, the solution will converge.

In each step, the cost drops and the solution will converge to a local or global optimal. However, for a non-convex objective function, we will unlikely reach the global optimal but in practice, many problems still produce good results. In later sections, we will demonstrate this type of algorithms with examples.

Maximum A Posteriori (MAP)

Previously, we take the MSE for granted and use it as the cost function to train a model. Should we ever challenge this decision? Instead of MLE, MAP can be used as an alternative approach in optimizing a model. Indeed, MAP can justify other objective functions like the MSE. In MLE, we find θ for the highest value in p(y|θ, x). In MAP, we optimize θ to maximize p(θ | y, x).

But there is an important difference. MLE finds a point estimate for θ with the best likelihood for the observations. MAP computes a probability distribution for all values of θ using the Bayes’ Theorem.

To calculate p(θ | y, x), we apply Bayes’ Theorem.

If we assume the model parameters θ are zero centered, and p(θ) and p(y|θ) are both Gaussian distributed, we can prove that MAP arrives with the same objective as using L2 as the cost function with added L2 regularization.

In short, we apply a prior belief that θ is Gaussian distributed. Combined with a likelihood of y (observation) that is modeled by a Gaussian model, we reach the same objective as the Ridge regression.

Newton’s Method for optimization

We can apply Newton’s Method iteratively to locate the lowest cost.

It is an alternative to gradient descent which only uses the first-order derivative. Newton method is more accurate by using the curvature of f (f’’). However, it is computationally intense and not worth the effort in many problems. However, for objective functions with large curvature, this can be extremely helpful. To resolve the complexity issue, some kind of approximation is taken to make it scalable.

Taylor series

Using the Taylor series, we can expand and approximate a function f. In the example below, we expand f to the second order.

By differentiating the equation above w.r.t. ϵ, the optimal step ϵ* in minimizing f equals

Nash Equilibrium

The Nash Equilibrium is a condition in which no individual player can gain an advantage by changing their strategy unilaterally. For instance, in the case of two suspects being interrogated separately for a crime they are accused of committing, if they both choose to remain silent, they will receive a significantly reduced sentence. However, if one confesses while the other remains silent, the confessor will be absolved while the other will face a severe sentence.

In a non-cooperative context, the Nash Equilibrium can be attained if both Peter and Mary confess, leading to a jail sentence of 6 months, which is longer than the 1 month they would receive if they both remained silent. Despite this choice appearing irrational, neither player can enhance their outcome by changing their strategy. Hence, the Nash Equilibrium is established when both confess.

Generative learning v.s. discriminative learning

In DL, we design a deep network to predict a label y from data x. Generative learning instead builds a model for x given y. As shown by a Naive Bayes’ classifier, some problems are easier to model with p(x|y) than p(y, x).

In generative learning, we can apply Bayes’ Theorem to make a prediction of p(y|x) from a model of p(x|y). Generative learning is a study of p(x). With this data distribution, we can sample (or generate) data. In GAN, we create a generator to create x from sampling a noise z. It models p(x|z). In a Gaussian mixture model, we model p(x) using a mixture of Gaussian distribution.


Gaussian discriminant analysis (GDA)

Making inference with Bayes’ Theorem is hard for generic likelihood and prior functions in high dimensional space. However, if it is feasible to model them with a known family of distribution functions, we may manage to solve them analytically and easily. Consider a classification problem that groups object as apples or oranges. Let’s assume that both distributions p(x|y=apple) and p(x|y=orange) can be modeled by multivariate Gaussian distributions. For a 100 × 100 image, x will contain 100 × 100 × 3 features (RGB for each pixel). Here is the generic formula for the multivariate Gaussian distribution.

(Credit: the equations in this example is adapted from here.)

Using the training dataset, we compute the mean of each feature from the same spatial location (i, j) in each image. μ₀ below will be a 30,000-dimensional vector contains the means of these features given the object is an apple. We repeat the same process for the orange in calculating μ₁.

Here are the definitions.

However, we simplify the problem so we share the same covariance matrix for both conditional probabilities p(x|y=0) and p(x|y=1). This matrix will be computed using images from the combined data (apples and oranges). Because of this simplification, the shapes for both conditional distributions are the same. As shown below, we can draw a decision boundary easily between the two modes in dividing the classes.

Modified from source

We can remove our simplification and compute two covariance matrix separately. However, the decision boundary will be non-linear. In fact, subject to the shapes of Σ, it can get very complex.


Bayesian linear regression

Next, we are going to extend the concept for linear regression problems. Again, if it is feasible to model the prior and the likelihood with Gaussian distributions, we will do it because the posterior will be Gaussian distributed and can be calculated analytically.

Let’s get an example in using Bayes’ Theorem and linear regression to build a model. Our objective is to find model parameter θ using MAP.

With both prior and likelihood modeled as Gaussian, MAP becomes

The posterior above is actually Gaussian distributed even it is not obvious. So, let assume it is, we want to find the corresponding mean μ and covariance Σ.

First, we can expand a Gaussian definition into

By comparing it with our equation, we find μ and Σ to be:

Predictive distribution

To make predictions, we can marginalize over the posterior distribution, a.k.a. ∫p(y|w)p(w)dw.

Again, we can solve that analytically without the integral.

With Bayes inference (the diagram on the right below), the uncertainty increases as the predictions move further from the known circled training data.


For point estimation algorithm, we don’t estimate how certain about our predictions. Bayes inference gives us a bonus on how confident we are by continue tracking parameters and information with probabilities and distributions. Some people may argue that it reasons the uncertainty about the real world better from what we know.

Active Learning

Measurements can be expensive and hard to perform. Active learning uses the current training dataset in determining what should be measured next (x₀ y₀). We want to pick addition training points x₀ with the best return in building up a model.

Let’s go through it step-by-step.

  1. Calculate the predictive distribution above for the un-observed x₀ using the existing model.
  2. Pick an x₀ which σ₀² will be the largest and physically measure the corresponding y₀.
  3. Update the posterior p(w|y, X) with the training dataset including the old and the new measurement (x₀, y₀) using the Bayesian linear regression.
  4. Iterate the steps above again to build up a model.

The entropy of a Gaussian distribution is

Our selection of x₀ will drop the entropy of the posterior the most from its prior.

(Credit: Equations are adapted from Active learning in here.)

Laplace Approximation

As shown before, we can use Gaussian distribution to model the prior and the likelihood to calculate the posterior distribution. But it is not necessarily true that both are Gaussian. But even those are false, we can still assume the posterior is Gaussian distributed and calculate the distribution using the Laplace approximation. With the Laplace approximation, we assume the posterior to be Gaussian:

Let’s compute the posterior with MAP and define f as

Let’s take note that optimizing MAP above is the same as optimizing f. We integrate the denominator over w and therefore, it can be ignored since it does not change w.r.t. w. The exponential function in the numerator is a monotonic function. Therefore, optimizing MAP is the same as optimizing f above.

We can approximate f with the Taylor expansion up to the second order.

The secret of the Laplace approximation is the choice of z. For example, we can use gradient descent to find the optimal value w* as z. But our final objective is to model w with a Gaussian Distribution.

As mentioned before, optimizing MAP is the same as optimizing f. i.e. f (z) should equal to 0. The posterior distribution can be further simplified to


We can recognize that the R.H.S. fits the definition of a multivariate Gaussian distribution where

And if we use logistic regression σ(yᵢxᵢᵀw) to compute p, the negative inverse of Σ will be:

Let’s put them together. Here is the final algorithm where λ is a tunable regularization factor.


In step (1), we can use a gradient method, or other optimization methods, to find the optimized w*. Then, we can compute the covariance Σ of the posterior using the equation in step (2). Our posterior model, step (3), will be a Gaussian model with mean and covariance calculated in (1) and (2).

Gaussian Process

Conditional distribution in Multivariate Gaussian

Let x to be an n-dimensional vector which can be modeled with n means and a covariance matrix using the Gaussian Distribution below.

We can break x into 2 subvectors x₁ and x₂. We also sub-divide Σ accordingly.

Now, we want to express the conditional probability p(x₁| x₂) in terms of the sub-components above.

This gives us a foundation for making predictions. Consider x₂ to be the data in the training dataset. The general concept is if we want to make a new prediction for a new data point, we find its similarity with the existing training data points. Then, we extrapolate the result using the labels of the training dataset. Let’s give an example.


Consider we have the weight and height of two persons (person₁ (150, 66) & and person₂ (200, 72)). Let’s build a Gaussian model with this information. First, we use the height information to build a covariance matrix to quantify the similarity of person₁ and person₂. Then the weight of both person can be modeled with a Gaussian distribution with mean 175 (the mean of both people).

If we draw samples from this Gaussian Distribution, the expected value for these two persons should be 150 and 200 respectively. Our next major task is to build a function with height as an input in predicting weight.

First, we expand the Gaussian distribution above to make 2 weight predictions (f₁ and f₂) for person₃ and person₄. With the height information for all four persons, we compute a new covariance matrix. This matrix quantifies the correlation between these four persons. We can sample from the new distribution to make predictions on the unknown weight for the person₃ and person₄.

Why should we stop with two predictions above if we can make predictions for a large range of height values hᵢ?

When we sample from Ɲ, we got a function that maps height (from h₁ to hn) to weight, like the blue line above even though it looks odd or wrong. We can repeat the sampling in producing many functions. That is why the Gaussian process is considered as generating a distribution over functions.

For example, the left diagram below demonstrates 5 such sampled functions (5 curves in different colors) for another Gaussian process.


However, we are not interested in a particular function. Nevertheless, If we draw enough functions, we should get a good idea of what is the expected prediction for each value of x and what is its likely range. The diagram on the right above shows the expected prediction and the shaded area is within one standard deviation. We are not making a point estimation any more. For a particular x, we calculate the expected prediction (the blue line) and its variance (our likely range in our prediction). As shown above, as we move away from know training data points, the uncertainty of our prediction increases.

Gaussian process formula

Let’s work through the equations. Consider

where f is the distribution generated from the training data. f* is what we want to predict. Given the training data, we want to compute the conditional distribution for

Let’s model f* as

With the conditional distribution computed before, we can link f* with f as


μ and μ* are the means for f and f* respectively (μ = 175 in our example).

This formula can be further simplified if data is preprocessed and meet certain criteria. For example, if our training data set is large enough and we are predicting data with a similar range as f, we can assume μ = μ*. Furthermore, if f is zero-centered, we can simplify the equation further to

However, finding the inverse of K can be numerically unstable. Therefore, we decompose K with Cholesky decomposition first and solve the problem using linear algebra (details). A sample program in demonstrating the calculation can be found here.

This is a long journey but these are the equations we need. In summary, we use the height information to build a covariance matrix in quantifying correlations between people. We break up this matrix into sub-components and use that to recreate the weight distribution given the known labels in the training dataset.

Bayesian Linear Regression v.s. Gaussian Process

Let’s look into the similarity between Bayesian Linear Regression and Gaussian Process. Bayesian linear regression models its model parameter w with a Gaussian distribution. The label y is another Gaussian with Xw as means, Its variance comes from the noise with variance σ². For the Gaussian process, the variances come from the noise and the covariance matrix K.

Source (Gaussian Process) of the equations

In our example here, let’s further simplify the prediction y to be zero-centered. As we compare both predictions, they are very similar. The difference is one using the covariance matrix and one with the kernel. In short, the Gaussian process can be conceptualized as kernelized Bayesian linear regression. Instead of computing xᵢ xⱼᵀ, we replace it with a kernel.

Here is a summary of how new prediction y₀ is made using both methods.

Source (Gaussian Process) of the equations

Support vector machine (SVM — classification)

In Gaussian-based methods, we model the data distribution for each class (i.e. p(x|y)). Then, we can create a decision boundary in classifying data. SVM classifier adopts a different approach without discovering the distributions. It separates data points into one of the two corresponding classes through linear regression and hinge loss. Visually, SVM maintains the largest possible margin in separating both classes.

In the SVM example above, if wᵀxⁱ < 0, x belongs to the green. Otherwise, it belongs to the red. When hinge loss is applied, we care about points that are close to the opposite class only (support vectors). Points that are away from the margin boundary incur zero cost as we already classify them correctly and they are not close to the decision boundary. We do not need to do better for those points and therefore they should have no impact on the model to be built. SVM only penalizes a prediction that is wrong or close to being on the wrong side. The first term in the cost function of SVM is the hinge loss.

It adds a loss for every point x with yⁱwᵀxⁱ < 1. These are the points that violate the margin constraint. The second term is the L2 regularization.

The objective of SVM can be written as the optimization problem below. It maximizes xw ≥ 1 (for y=1) with the smallest w possible.

With this objective, SVM maximizes the margin of its decision boundary. Also, the trained w will be perpendicular to this decision boundary.

At a high level, SVM wants xᵀw ≥ 1 (for y=1) with the smallest possible w. This condition is equivalent to pw≥ 1 which p is the projection of x on w.

To minimize ‖w‖, we want to increase p as much as possible. The projection of one of the support vector (blue dot) along w is shown as green below. In the left diagram below, the decision boundary is not optimized. It has a smaller margin boundary than it can be. To increase p, we can rotate w clockwise.

Effectively, as shown above, it also increases the margin. While our SVM objective tries to decrease ‖w‖, the diagram on the right shows that it has the same effect as increasing the margin. i.e. our SVM objective has the same effect of maximizing the margin of the decision boundary.

In summary, if u and v are two support vectors having the shortest possible distance between two class boundaries, the optimal solution for SVM w* will lay on uv and this solution has the largest margin. We had gone through the reason why SVM maximizes the margin graphically. If you feel it is not foolproof, it is normal because the thorough mathematical proof is more complicated but quite interesting in applying the concept of convexity and Lagrangian multiplier.

To avoid overfitting, we tune the regularization factor λ to mitigate noise in the training data. The diagram below shows the effect on the decision boundary with or without regularization. If we have noise in data, regularization can help us to eliminate outliers.

SVM with kernels

We can further modify SVM to use a kernel output as the input.

The mathematical model for the SVM with constraint violation and kernel can be found here.

Here are the decision boundaries for the Iris flower classification with different kernels.


Unbalanced class

In fraud detection, the amount of fraud is small compared to the number of transactions. This creates an unbalanced class that leads to unbalancing training. In specific, the total loss from the major classes will have a better say on the trained model. In fraud detection, identifying these minority classes is important. To correct that, we can boost the loss from these minority classes with a larger weight. As shown below, with a weighted class, it can correctly classify the red dots below better.



In general, to train a model, we standardize the input features first to make the model easier to train.

Let’s study more normalization techniques. However, these techniques may be more relevant to DL than ML.

Batch normalization

Why do we perform normalization on the input only if we can normalize input in every layer in a deep network?


The mean and the variance used in the normalization is calculated from the mini-batch. This assumes that each subsequent layer in DL is close to i.i.d. This is a very bold assumption and in general not true. But in many computer vision problems, this is fair enough and bring tremendous improvement in model training.

Batch normalization introduces trainable parameters γ and β. This normalization can be undone if γ=σ and β=μ above. At the beginning of the training, we initialize γ=1 and β=0 to take full advantage of the normalization in early training. Eventually, these parameters will be learned during the training in settling to whatever values that it should be.

There is one thing that we need to be careful. Each layer is trained with the mean and the variance of the training data. So, during inference, we continue to use the mean and variance of the training dataset. This maintains the statistics that we use to train the model.

Weight normalization

In batch normalization, we normalize the input into the same scale for different dimensions. In weight normalization, instead of normalizing the input to the next layer, we normalize the weight. This is an attempt to improve the solution convergence.

Now, we separate the weight training into learning the scale and the vector direction for w separately. i.e. instead of training w directly, we train a scalar g and a unit vector for v to derive w.

This allows us to control the magnitude of w and provides an alternative way to control the stability of the training by controlling the magnitude of w.

From some perspective, Batch normalization assumes features to the next layer have the same distribution from one training data point to another. Therefore, we use the mean and variance calculated from the batch sample to normalize data. While it is appropriate for CNN, it is questionable for RNN or reinforcement learning which features in each layer or subsequent layer can be highly correlated in a time sequence model. Weight normalization has no dependency on the mini-batch or assumptions on the input. Therefore, it is more suitable for recurrent networks or reinforcement learning.

Layer normalization

Layer normalization is another normalization method addressing the shortcomings of Batch normalization in RNN. The key question asked here is what data will be used to calculate the mean and the variance in normalizing features.

In Batch Normalization, it is calculated from each feature within the batch sample. In Layer Normalization, it is calculated from features of the data point. So if a layer has k input features, the mean and variance are calculated from the k features of a sample point. As Geoffrey Hinton points it out

Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot. This suggests the “covariate shift” problem can be reduced by fixing the mean and the variance of the summed inputs within each layer. We, thus, compute the layer normalization statistics over all the hidden units in the same lay …

In short, in training the next layer, Layer Normalization wants to keep the statistic to the next layer to be consistent.

For each normalization method below, the blue area is used to calculate each mean and variance. Let’s consider an input space of N×C×H×W in each layer. The batch normalization will have 1×C×1×1 means and variances that are common to all samples in a mini-batch. In layer normalization, there are N×1×1×1 means and variances. Each sample will have one mean and one variance. In instance normalization, there are N×C×1×1 means and variances.



K nearest neighbor (KNN)

In inference time, KNN finds the k nearest neighbor and use their labels to make predictions. For example, the black dot below should be classified as blue by a simple majority vote.

There are different metrics to measure the distance. The most common one is the Euclidean distance (L2). Other metrics include the Manhattan distance, cosine distance, editing distance for text, distance correlation, etc… (more metrics can be found in scikit) If we increase the number of neighbors k in making predictions, the bias increases while the variance decreases.

For example, let’s consider a classification problem in which we collect one-half of the population to classify the color of the black dot.

When k=1, the black dot prediction varies depending on the collected sample. But when k is large, it usually predicts the most common color of the population (blue). i.e. the variant is small but the prediction is biased.

K-means clustering

K-means clustering groups data points into K clusters using the following algorithm:

The centroid is re-estimated and the data points are re-cluster. The process continues iteratively until converge.

k=3 (3 clusters)


The cost function for the K-mean clustering is


where cᵢ is the cluster assignment for the data point i and μⱼ is the centroid of the cluster j. K-mean clustering algorithm improves the clustering cᵢ and μⱼ in alternative steps. Therefore, the cost decreases monotonically. However, the cost function is non-convex. Therefore, the algorithm guarantees reaching a local optimal only. Different random seeds will likely result in different clusters. But with reasonable random initialization of the initial guess of the cluster’s centroids, it should produce a high-quality result.

We can also repeat the process with different random seeds and use the total Euclidean distance for the corresponding centroid in selecting the best model.

Choosing K

So what is the proper value of K on the K-mean clustering? If we have N data points in the sample, we can train the model to have zero cost if we use N clusters. But this is not we want. The solution cannot generalize.

The choice of K may determine by an explicit goal. For example, we want to manufacture a t-shirt with size XS, S, M, L, and XL. So K will be 5. Alternatively, we can study how fast the cost is dropping. We can stop at the elbow point knowing that further drops in cost will have a far less return.

We can also monitor the gap in error between the training dataset and the validation dataset. If the error in the validation training dataset is flattened or increased, we should not increase K further.

K-median clustering

Use the median instead of the mean for clustering to avoid outliners. We also use L1 in measuring distance and cost to reduce the impact of outliners.

K-means++ clustering

In K-means clustering, instead of initializing K centroids randomly at the beginning, we randomly select one centroid at a time. For the next centroid, we still select it randomly but with a higher preference for data points further away from the previous means. Here is the algorithm.

Bisecting K-means

We split each cluster into two but only commit the one that reduces the cost the most.

Gaussian mixture model (GMM)

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of Gaussian distributions.

For K=2, we will have 2 Gaussian distributions G₁=(μ₁,σ₁²) and G₂=(μ₂, σ₂²). We start with random initialization of parameters μ and σ and compute the probability on which cluster that a data point may belong to. Then, we recompute the parameters μ and σ for each Gaussian distribution based on this probability. The data points are re-fitted to different clusters and the Gaussian parameters are re-calculated again. The iterations continue until the solution converges. Let’s detail this method using the EM algorithm.

Expectation–maximization (EM) algorithm (GMM)

The EM algorithm alternates between performing an expectation estimation (E-step) and a maximization (M-step). E-step computes

The probability p is computed by using the distance between xᵢ and μ of the corresponding cluster using Gaussian distribution.

We can compare q(aᵢ) and q(bᵢ) for xᵢ and choose which cluster that it should belong to. After all the computation, the training data is either assigned to a or b. This is called a hard assignment. Then we recompute μ and σ for each cluster according to the data points that it owns. However, this is not exactly what GMM does. Instead of assigning a data point to either one of the clusters, we track the probability q(aᵢ) and q(bᵢ), i.e. the chance of whether the data point i belongs to a or b. Generally speaking, we calculate a probability distribution of cluster assignment instead of a point estimate. This is called a soft assignment. We will use this distribution as a weight on how much influence xᵢ has in calculating μ and σ for the corresponding cluster. For example, the mean of cluster A is the weighted average of all data points with the weight equals q(aᵢ). A probabilistic model often has a smoother cost function with smaller curvature. This smoother surface stabilizes the training. By not assigning xᵢ deterministically to a cluster, we allow the training to converge faster. Here is the algorithm for GMM.

Here will have the proof of GMM using the equations from the EM algorithm. Below is the EM algorithm for any mixture models (other than a Gaussian).


For GMM, the data distribution p(xᵢ|θⱼ), for cluster j, is assumed to be Gaussian distributed.

To model the distribution of a random variable, we can collect samples to fit a GMM. The corresponding model will be more complex than a simple Gaussian model and remains dense (requires very few model parameters). That is why it is popular in modeling distribution, in particular for multimodal variables.

Self-organizing maps (SOM)

SOM produces a low-dimensional representation for high-density input. For example, we want to compute 39 index colors to represent the RGB values of an image. We start with 39 nodes with each node represented by a weight vector having the same dimension as the input (dim=3). So for each node, it is holding an RGB value.

Intuitively, we initialize all nodes randomly. Then we go through all data points in the training data set (every pixel of the image). We locate the node closest to the current data point RGB value. We update the RGB value of this current node and its neighbor towards this RGB value, with less impact as we move away. As we go through the training data, the value of the nodes change and more representative to the color of the image pixels of the training data. Here is a recap.

  • We random sample a data point (a pixel) from the image.
  • We locate a node with weight as close as the input using L2 distance.
  • We update the weight for surrounding nodes to match closer to the input. But the changes drop as we move away from the center.
  • We sample the next data point and repeat the iteration.
  • Eventually, the weight in each node represents the RGB value of the index color.

We use a learning rate to control the change and the change will decrease over distance and also decreases over time.

Density-based clustering (DBSCAN)

K-means clustering cannot discover manifold. Density-based clustering connects neighboring high-density points together to form a cluster. Intuitively, we gradually add neighboring points to a cluster. Such an approach allows the structure to grow slowly in discovering the neighbors as well as the manifold. As shown below, with manifold discovery, DBSCAN discovers the U shape structure that K-means cannot.

A data point is a core point if there are m reachable points within the radius r. A cluster is formed by connecting core points (the darker green) and its close neighbors (the lighter green).

The green cluster is formed by the following method:

If we have a lot of data points, compute the distances for one data point to another is expensive. Instead, data points are partitioned into regions. We connect points that are in the same or adjacent grid regions only.

Density-Based Hierarchical Clustering

The hardest part of density-based clustering is determining the radius r.

Alternatively, we can use a top-down hierarchy approach in breaking a large cluster into sub-clusters. So we can start with a large r and gradually decrease it as we break down a cluster in the hierarchical structure.

Hierarchical clustering

There are many other Hierarchical clustering approaches, including top-down or bottom-up. These are the algorithms.

Ensemble Clustering (UBClustering)

Ensemble Clustering runs K-means m times with different random seeds to produce m models. If a simple majority of models agrees two points should belong to the same cluster, we make sure they are. Ensemble Clustering starts with no cluster,

  • if two points should be together but do not have any cluster assignment, we create a new cluster for them.
  • if one is assigned but not the other, we put the un-assigned to the cluster holding the assigned one.
  • if they are assigned to different clusters, we merge both clusters together.

Canopy clustering

Clustering is expensive. But we can apply Canopy clustering to form some form of clusters before applying a more expensive method to refine it.



TF-IDF scores the relevancy of a document on a search query. To avoid exploitation of the search term, the score decrease if the terms commonly exist among documents. Therefore, it will score higher if the term is not common.

Decision Tree

Technically, we are dissecting the input space binary at each decision stump.

A decision tree is also called Classification and Regression Trees (CART). A decision tree is easy to interpret, easy to prepare data and fast for inference. Since we use a greedy method in selecting decision stump and the branching conditions are simple, the model may not be too accurate and with high variance. Overfitting is also a common issue in decision trees.

For regression, we split the data for a specific feature at a specific threshold. We try different features and threshold values combination by brute force and selects one that reduces the L2 error the most greedily. The prediction at the tree leaves will be the mean or the median of the data points remained in this branch.

In a classification decision tree, we can use different criteria in selecting a decision stump. For example, we can compute the classification error, the Gini index or the entropy (details later). The tree prediction will be the majority vote by the branching data set.

Before choosing the decision stump, we can create a scatter plot matrix to spot the correlations between properties to give us some insights on how to split the tree.


Next, we will discuss the details in selecting the decision stump.

Gini index

If we know 90% of the data belong to class i and the remaining 10% belongs to class j, we can make random guesses on the labels using the same ratio. In fact, we can do reasonably well. 82% of our guess will be correct. Gini index measures how likely we are wrong in making predictions with such a scheme. If the data is very predictable, the Gini index will be low.


Let’s say a class has 60 students. 40 of them are male and 22 of them attend the engineering school later while 8 out of the 20 female students attend the engineering school. The weighted Gini index is

To pick a decision stump, we choose one with the lowest weighted Gini index. If the data from each branch belongs to the same class, Gini index equals zero.

Information Gain

We can pick the decision stump with the highest information gain.

In other words, we want the smallest conditional entropy after the branching.

Intuitively, after the branch we want the entropy to be the smallest and therefore it is the most predictable.

For example, we want to predict whether we want to play tennis or not. Below is the calculation on the information gain if we choose windy (true or false) as the decision stump. First, we calculate the entropy given it is windy. Then we calculate that if it is not windy. Then we combine them for the conditional entropy H(Y|X). Then the information gain is computed as:

Here is the detail calculation for each step.

Example modified from source

Then, we move on to other possible stump and select greedily for the one with the highest value.

Variance reduction

Variance reduction works with continuous space. It measures the variance before and after the branch. We want to choose a decision stump with the largest variance reduction.

Modified from source


The following data points will take at least 4 levels of decision stump before the leaf node contains all red or blue dots. In the first 3 splits, each branch will contain an equal amount of red dots and blue dots. This proves that we cannot simply use the progress in determining whether we should stop further branching. If we do, we will stop creating decision stumps prematurely. In this example, we will see no progress until we reach the fourth level.

To avoid this problem, we can purposely grow the tree very deep and use pruning to trim the tree backward.


For the classification problem, we grow the tree until the data points for each branch belong to the same class. Or their input attributes are all the same so we cannot make a further distinction. Due to noise in the data, this approach overfits the model easily. As we mentioned before, we can prune the tree after it is built. We will remove the decision stump from the leaves towards the root if the validation performance for the testing data improves.

Other approaches include

  • Have a hard upper limit on the tree depth.
  • Stop further branching if the number of data points drops below a threshold.
  • Require each leaf node to have a minimum number of data points.
  • Set a maximum number of the leaf node.

Chi-square test

We can also verify whether the decision stump is statistically significant with the Chi-square test. We can use Chi-square to measure the difference in data distribution between its parent and its children. We evaluate whether the difference is mainly by chance. Within a certain confidence level (say 95%), if we think the difference in the distribution can occur by chance, we should take the stump out.


If Chi-square is smaller than a threshold T, we remove the decision stump because the difference in the data distribution is not statistically significant. We often consult a table or a Chi-Square distribution calculator in finding the value T for the corresponding confidence level (details).

We can also use the Chi-square value in selecting the decision stump. We will pick the one with the highest value.

Weak learners

In ML, we can build a weak learner (models with less complexity) to avoid overfitting. For decision trees, we can restrict the depth of the tree and the number of input features to lower the model complexity. However, these learners are likely biased. As the Art of War said if you have the right commands and disciplines, you can turn 180 concubines into military companies. In ML, we can bundle models together in making predictions. If they are trained independently such that the correlation between the trained model is small, they are unlikely making the same mistakes. A simple majority vote can cancel errors and create strong predictions. Next, we will talk about different methods in “bundling” weak learners.

Ensemble methods

There are many methods in creating different instances of training models. We can use different algorithms, different seeds & configurations, and different models saved at different iterations to build those models. If cares are taken in avoiding strong correlations, we can use a simple majority vote for classification problems. For regression problems, we can compute an average value, a weighted average, or a median value in making predictions.


In stacking, we use multiple ML methods to make predictions in the first round. Then we feed these predictions as input to another model for the final decision.

Bagging (Bootstrap aggregated)

Bootstrapping is a general technique on random sampling with replacement. We select data from a data set. But the picked data can be reselected again (replacement). i.e., we select our data from the same set of data again and again.

Bagging (Bootstrap aggregated) simply applies this technique in collecting data points with random sampling and replacement. The data points sampled can be repeated. If this sample dataset has the same size as the original training dataset, we should expect only 63.2% of data in the original dataset will be present in this sample dataset. Others are just repeating themselves.

Bagging reduces the correlations among trained models since the input samples are not identical now. If we also lower the depth of the tree, we can also avoid overfitting. By bundle models trained with different sampled datasets, we should expect a much stronger aggregated model in making predictions.

For example, we can have B bootstrap datasets each containing n data points. Each bootstrap samples with replacement from the original dataset of size n. After B models are built, we can use them to make B independent predictions. Our final answer can be the median of these answers.

Random forest

As the number of the bootstrap datasets increases, the performance of the ensemble trees plateau because many of the datasets are indeed highly correlated. Random forest uses bagging mentioned above plus one more trick. Each model is only trained with a subset of features. If the data has K features, we will randomly select the square root(K) of them. So before training a model, we randomly preselect a subset of features and train the model with this subset of features only. This further reduces the correlation of models and different models will find different ways of determining a class. Combining it with bagging, it improves the accuracy of the ensemble methods.


We should always learn from our mistakes. When we build our first model, we sample data from the training set uniformly. But for the next iteration, we sample data points that have the wrong predictions more frequently. With a better focus on where we are lacking, the second model should fare better in those areas. After k iterations, k models are built. And in each iteration, we keep track of the accuracy of each model and use this to compute a weight. During inference, we use these weights to compute a weighted average for the prediction using these models.


Here is the AdaBoost which uses this boosting technique.

Source of the algorithm

In general, for any misclassified data, we set a larger sampling weight for the next iteration.

The upper bound of the training error for the AdaBoost is (proof)


Even though we may learn a weak learner in each step, the error decreases significantly if we have learned enough models.

Gradient-Based Model

In AdaBoost, we learn from data points that fare badly. In the Gradient-based model, we model our errors in our predictions.

Let say we have a training data set y = [ f(100), f(200), …, f(800)] for input 100, … to 800. Now, we want to build a model for f. Our first model simple models f using the average of y. This is a good start but not an accurate one. Next, we compute the residual (error) in our prediction and model it with another model.

We continue to build simple models out of the residuals and our prediction will simply add all the model results together.

In a nutshell, the first model outputs the average of y. The second model predicts the residual error of the first model. The third model predicts the residual error left for the second model.

Let’s do it graphically. We will let you look into the explanation inside the diagram.

Here is one possible way of building a regression tree to model the residual error on the left with the stump condition as x > 500. In this model, when x > 500, it returns 22, otherwise -13.

But there are other possibilities. Instead of building a model for the residual, we build a model for the sign of the residual only. This is kind of odd but when we build enough residual models, it achieves similar results as our example before.

What is the difference between these solutions and why do we call these methods gradient-based? If we compare these gradient-based models with gradient descent, they look similar except with a different sign.

As shown below, we can start with a L2 or a L1 loss function. If we take the derivative of these loss functions, the gradients of these cost functions are just the negative of the term that we use in modeling the residuals in our previous examples. For the L2 loss, the equivalent is the residual error. For the L1 loss, the equivalent is the sign of the residual. So our methods are just the negative gradient of the corresponding selected cost functions. This is why it is called gradient-based.

Therefore, we can broaden the concept of the gradient-based method into other cost functions, like Huber loss. The general algorithm will be:


Local beam search

The searching space, like the GO game, is huge. In the Local beam search, we only explore the topmost promising searches.

Regression tree

Besides classification, we can use a decision tree to break up the model into sub-cases and solve each leave node with a regression model.

Use a linear regression model to make predictions inside a decision tree

Example of supervised & unsupervised learning

Supervised learning builds a model f in predicting a label given an input (y=f(x)). The training dataset contains pairs of the input and the corresponding label (xᵢ, yᵢ). Conceptually, supervised learning is the study of P(y|x). Unsupervised learning finds patterns and trends of the training data. No label is attached. We can generalize it as the study of the distribution P(x). In K-means clustering, we approximate P(x) with the K centroids. Unsupervised learning includes clustering, matrix factorization, and sequential models. Here are different algorithms according to this categorization.

However, the line drawn between supervised learning and unsupervised learning can be vague and often not very important.

Semi-supervised learning

In some cases, we can collect a lot of training data but we only have a limited budget in labeling a small portion of it. Semi-supervised learning uses this small set of labels to label those that are not labeled. Here are two different possibilities.

Semi-Supervised Learning, Weak Supervision & Transfer Learning

Weak Supervision employs inexpensive weak labels for Supervised Learning.


In Snorkel, label functions (rules) are created to label objects, for example, a large yellow vehicle is a school bus. These label functions can be treated as noisy “voters”. Depending on their agreements and disagreements in labeling data, we can estimate the accuracies and correlations of these functions. Using statistical models and these functions, we can produce a probability distribution of the labels for later supervised training.


Model complexity

We risk overfitting when we increase the model complexity, for example, the number of Gaussian models in GMM, the number of hidden states in HMM, or the rank in the matrix factorization, etc … As the model complexity increases, the training error diminishes. However, when we validate the model with a validation dataset, we will realize the validation error will increase instead. In DL problems, we often need a lot of data. Therefore, its dataset is usually large and we often dedicate a portion of data for validation purposes only.


In ML, we may collect a much smaller amount of data compared to DL problems. If we have enough data, we can use a dedicated validation dataset. Otherwise, we can use a rotated hold-out data set (cross-validation) or randomly select from the full dataset (bootstrap) for the validation data. We can use this data to pick the right model complexity that gives the best performance. To penalize the complexity, we add a regularization penalty term to the cost function. The following are the two possible regularizations.

Modified from source

In this formularization, we should only increase the model complexity if the cost dropped is greater than the number of model parameters increase. In BIC, we will also multiply K by ln N. For those that want to understand how to justify the term K ln N, here is the proof for BIC.

Energy-based model

ML has a long history in statistical physics. Some research papers are explained in the energy-based model. Even though the concept is not hard, their terms can be overwhelming. So it may be nice to have some understanding of it.

With an energy function E(x), the probability distribution for the configuration (state) x is defined as

In such a model, high energy E(x) leads to a low possibility for the state x. We model an energy function that maximizes the P(x) for the training dataset. We call Z the partition function. Basically, it normalizes the probability distribution between 0 and 1. It sums over the exponential function for all possible configurations of x. It is not easy to compute Z for a system with many states. An approximation is often applied to solve those problems. Next, let’s discuss how the energy function is modeled and defined.

Boltzmann Machine

To increase the expressiveness of an energy model, we add non-observed variables (hidden units in blue below) into a Boltzmann Machine. The white nodes are the visible units. They represent a data sample, the observed states or features, in the training dataset.

Hidden units are not observable and represent the latent factors for our observable units. Each unit is in one of the binary state sᵢ∈{0,1}. Yes, each state is binary only. Units are fully connected with others with the edge Wⱼ. The value for Wrepresents how two nodes are related. If node sᵢ and sⱼ are positively related, we want Wⱼ>0. If they are negatively related, we want Wⱼ<0. If there are independent, Wshould be zero. Conceptually, we want to turn on/off a unit according to the other units and their correlations. The energy function E(X) and probability distribution P(X) for the Bolzman Machine is defined as:

It represents the energy of the nodes and the connections in a linear form.

Whether a node is on or off depends on the values of their connected nodes.

By training the model with data, we establish the most likely values for W and bthat has the lowest energy for the training dataset.

From some perspective, Boltzmann Machine trains W to extract the latent factor for the observable units. However, Boltzmann machine is fully interconnected with each other and is very hard to train.

Restricted Boltzmann Machine (RBM)

RBM extracts latent feature h from the input v (the observable). Observable units are not interconnected with each other, so as the hidden units. It is more restrictive but the model is much simpler than a Boltzmann machine and much easier to train. The energy E(v, h) for the configuration (v, h) equals

To train the model, we optimize the maximum log-likelihood of the training data (log p(v)). The gradient for this objective function w.r.t. wᵢⱼ equals

We will use gradient descent with this gradient to update wᵢⱼ later. (Part of the proof in deriving the gradient can be found here.)

To calculate the first term below,

we sample v from the training data set. v is just a sample in the training dataset. Then we compute the probability distribution p(hⱼ | v) for the hidden node j using the equation below.

We compute the expected value by multiplying vᵢ with the corresponding hⱼ with v being drawn from the training sample. The second term is computed with Gibbs sampling and will be discussed later. With the gradient w.r.t. wᵢⱼ computed, we use gradient descent to update the model parameter wᵢⱼ.

The second term uses the current model to sample v and h to compute the expectation. i.e. we need to know p(v, h | w). It is hard to find it analytically. But we can estimate this expectation using Gibbs sampling.

The general idea of Gibbs sampling is that for a joint probability p(X = x₁, x₂, x₃, …, xᵢ, …), we fix all the values of X except one parameter, say xᵢ. In some problems, when we fix all parameters but one, p(…, xᵢ, …) can be very easy to find or model. We sample one value from this distribution and set xᵢ to this sampled value. Then, we pick another parameter and fix the rest again. We repeat the process many times and each time we get a sampled X.

As shown below, we plot the sampled X as the red dots. Each X changes one of the value in X = (x₁, x₂, x₃, …, xᵢ, …) only. As shown in the plot, the sampled data will resemble the joint probability p(x₁, x₂, x₃, …)!

Contrastive Divergence

So we will combine the Gibbs sampling concept with our gradient descent method to train the RBM. This is called the contrastive divergence. In RBM, we sample h and v in alternating step using our RBM model.

First, we start with any value for v. Then we can compute the probability for each hidden node using

Then we sample values for the hidden nodes to form h. In the second step, we use the sampled h to sample v again.

In the previous Gibbs example, we repeat the iterations many times. But it turns out, even we do it a few times, we can generate good model parameters in RBM. The exact algorithm for RBM is slightly different from what is described. The following is the algorithm in training and we will let you study it yourself.

Source (x’ is column form of x)

The equations used in training model parameters are:

where k is the kth of iterations.

Free energy

Another frequently mentioned term in the energy model is the free energy F below.

The gradient of the log-likelihood can be expressed in the form of free energy.

In a nutshell, it is the energy that it is “free” for us to manipulate in building an energy model (by manipulating the model parameters) to maximize the log-likelihood.


We use k × k convolution in each layer to extract feature maps. With stride or max-pooling, we reduce the spatial dimension gradually. Since there are many good resources on CNN, the description here is extremely brief. For more detail, here is one CNN article that I wrote my years ago.

Dilated convolution

To improve the receptive field, we can apply the dilated convolution to increase the area of coverage in computing the convolution result. For example, for l=2, every other pixel is skipped as the input to the convolution filter.

(Source) Standard Convolution (l=1) (Left) Dilated Convolution (l=2) (Right)


Modified from source 1 & 2

LSTM contains a series of LSTM cells. The cell t is responsible for processing input data xt at time t and each cell is connected to the LSTM cell in the previous time step. Each cell has an internal cell state ct and output a hidden state ht. LSTM uses three gates to control the flow of information. They are the forget gate, input gate, and output gate. They are shown with the symbol ⊗ in the diagram. Each gate has the form of

which applies a sigmoid function to the current data input and the previous hidden state after multiplying them with Wx and Wh respectively (each gate has different Wx and Wh). Each gate controls what information can be passed through by performing a piecewise multiplication with the gate control.

In a nutshell, the forget gate forgets part of the information from the previous internal cell state. The input gate control what part of the input and the previous hidden state will be processed to create the cell state. The output gate controls what part of the internal cell state ct will be output as the hidden state ht.

At each time step, we need to create a cell state. This is computed by forgetting part of the old cell state with added information from the current input and previous hidden — all controlled by the corresponding gate below.

Then we output the new cell state controlled by the output gate.

GRU is an alternative of LSTM for RNN (recursive neural networks). Compared with LSTM, GRU does not maintain a cell state C and use 2 gates instead of 3. At each time step, we compute a hidden state based on the previous hidden state and the current input.


Below is a typical process flow for an ML/DL project.

Clip Art Credit

Clip art source


If you have gone through all the materials, I congratulate you! It is hard. But you should feel good because it represents many types of research done in decades.