# Machine Learning — Learning Graphical Model

We have learned how to model a problem and make exact and approximated inferences with a Graphical model. It is time to put the last piece of the puzzle together. How can we learn the model parameters and the structure of a Probabilistic Graphical model?

**MLE**

One approach is to estimate the population density *p**(*x*) with *p *by minimizing the KL-divergence between them. However, we don’t know *p** enough to calculate the KL-divergence. Fortunately, minimizing the KL-divergence is the same as the Maximum Likelihood Estimation (MLE).

And MLE can be approximated if we know how to collect samples. This would not be hard for many ML problems. In short, our objective is finding a model that maximizes the likelihood of the empirical data.

This objective can be further generalized as optimizing a loss function w.r.t. the graphical model parameters. The loss function in our density estimation is simply -log *p*(*x*). This model can answer any queries, like *p*(*x₁, x₂*) or *p*(*x₁| x₂*), by computing the corresponding marginals in conjunction with the Bayes’ Theorem if needed. However, in many supervised ML problems, we are interested in the latter conditional query only, *p*(*y|x*). For these problems, we can design a more focused loss function to answer the query instead.

**MLE Example**

So let’s learn a few models using MLE. Say, we flip a coin 4 times with the first three times being the tail. So the optimal probability *θ** for getting a tail is:

As seen above, it leads to the answer that a frequentist may say “I told you so”. Here, is the general solution for the Bernoulli distribution for any possible outcome count.

Let’s further generalize it for problems that have more than 2 possible outcomes.

Optimize *L* above is not that obvious. But it can be done in a very simple alternative way. If we treat the problem as a simple extension to the Bernoulli distribution, we can group all outcomes except outcome *i* into one group. Therefore, the optimal probability for outcome *i* is simply:

But let’s generalize the problem to a multinomial distribution and optimize *L* directly.

Since all *pᵢ* add up to one, we cannot just throw away terms in *pᵢ* when we take derivatives w.r.t. other *pⱼ*. Instead, we add a Lagrange multiplier in the objective function to enforce the total probability should be one, i.e. Σ *pᵢ* = 1. Then we optimize *L* the same way before, just one more parameter *λ *and* *one more derivative* *w.r.t. to *λ. *The details of the Lagrange is not important and more information can be found here. But we should expect the optimal *pᵢ *will match the result we just discussed.

Thanks for the patience of going through things that look obvious to a frequentist. But they will lead to something important.

**MLE for Bayesian Network**

Now, we can apply MLE in finding the conditional probability parameters in each node. This should be pretty straight forward.

The objective function for the Bayesian Network is:

The optimal parameters for the conditional probability for each node are:

This is no different from our previous example.

Therefore, the model parameter is simply:

Again, we create the corresponding conditional probability table for each node by counting. Parameter learning for Bayesian Network is easy.

# MLE for Markov Random Fields MRF

The expressiveness in MRF, compared with Bayesian network, makes it very hard to learn its model parameter. Let’s recap something we discuss in an earlier article. We can formulate the MRF into a linear equation form *θ*ᵀ*f*(*x*) below.

This is nice because this is exactly like the exponential family of distribution. We can use it to model MRF.

Exponential family of distribution and partition function *z* are concave and convex w.r.t. *θ* respectively. This makes optimization easy. But while parameter learning is simple in Bayesian Networks, it is not that easy for the MRF. If you suspect the difficulty comes from the partition function, you are right. Let’s compute the gradient of the log-likelihood.

So the gradient is the difference of the expectation from the empirical data and the model. We are finding a model producing *x* just like the collected samples.

So we now have a mechanism to learn MRF parameters by MLE. Taking the derivative of the first term is easy. The second term makes inference w.r.t. *p. *It is often intractable and cannot be solved analytically. Any feasible solution will likely require us to change the loss function or to apply some approximation methods.

**Sum of log v.s. log of sum**

But before looking into the solution, let’s take another look at the Bayesian network and MRF. The loss function for the Bayesian network is in the form of the sum of the log.

i.e. the loss function is decomposed to the sum of log which each log term modeled by a single parameter*. *Therefore, its gradient can be computed analytically and independent on other parameter *θⱼ. *By setting the derivatives to zero, we solve *θ* and λ.

The loss function for MRF composed a second term that is the log of the sum. When the log contains a collection of terms, things get very complicated. Its gradient cannot be decomposed into the summation of simple terms containing single parameters. The gradient, in general, is intractable.

We point out this difficulty now because we will deal with many loss functions in the form of the sum of log or the log of the sum. The latter is often intractable and requires significant simplification or approximation.

In probability models, we want

- components of the probability distribution are easy to model with tractable distribution, and
- the loss function is in a form, like the sum of the log, that is easy to optimize and preferably convex.

# Sampling

One popular approximation method in ML is sampling. It is simple and easy to understand. The energy model and the joint probability distribution for the Restricted Boltzmann Machine (RBM) is:

We don’t need to understand RBM in details. Our focus is to estimate the expectation term E[*f*(*x*)]. The major problem here is how to sample *x* with distribution *p*. Once the collection of the data is collected, we can compute the expected value for *f*(*x*) easily.

The joint probability model for RBM is complex and hard to sample from. But given all variables fixed except one (say *hᵢ*), the conditional probability *p*(*hᵢ | v*) is simple to model in RBM.

In Gibbs sampling, say each sample point contains *D* variables. We don’t sample all the variables at the same time from the joint probability* p*(*x*)*.* Instead, we keep all variables the same from the last sample data point except one. We sample that variable using the conditional distribution. For example,

Since the conditional probability is easy to sample, we can collect the sample easily. And here is one possible implementation of this algorithm:

Unless the joint probability is in some crazy shape, Gibbs sampling often provides nice samples in representing the joint distribution.

There is a variant of the Gibbs sampling called contrastive divergence. In RBM, the gradient is computed as:

Again, this gradient is the difference between the expected value using the empirical data and the model. In Gibbs sampling, we may have a warmup period and then collect many data points (*h, v*) in estimating the expected value.

However, the change in the RBM model is small in each iteration step. Therefore, we can forgo the warmup and continue the current Gibbs sample instead. In addition, at least through the empirical result, we only need to collect a few samples (or just one) in calculating the expected value for the model.

# Pseudo-likelihood

(Credit: some equations in this section are originated from here.)

MRF is hard to learn compared with Bayesian Network. If we can make the loss function of MRF as simple to optimize as the Bayesian Network, we will win. Let’s see what is the problem from another perspective other than the partition function. In Bayesian Network, we model the joint distribution and expand it with the chain rule. Later, we simplify the conditional probability by discovering independency.

This eventually leads to the nice sum of log which composed of terms for individual parameter *θᵢ. *However, we do not model MRF this way. But can we do it similarly?

We can approximate the joint distribution as:

Note that, it is not the same as the expansion using the chain rule since we add additional parameters into the conditional clauses. This approximation is great news to MRF because the conditional part can simplify to its neighbors only (the Markov Blanket).

Now, we have approximate the loss function of MRF into a form that similar to the Bayesian Network, which is the sum of the log rather than the log of the sum. The approximate loss function is called the Pseudo-likelihood.

Let’s use an energy model in the MRF to calculate Pseudo-likelihood.

The corresponding gradient for the loss function w.r.t. *θᵢⱼ *for the edge *i, j* is:

First, we compute the derivative of log *Z*, and then we finalize the derivative of the loss function.

Without proof here, if we can collect a large amount of data point, the optimal solution in the pseudo-likelihood approaches the optimal solution for the joint probability.

**Conditional random fields (CRF)**

Now, let’s see how MLE can be extended to CRF.

The corresponding objectives for MRF and CRF are:

So instead of building a model for *p*(*x*), we build a model for *p*(*y|x*).

Again, the gradient is the difference of *f* between the empirical data and the model.

# Moment matching

The function *f* discussed before is an indicator function. While the detail is not important here, our log-likelihood objective trains the model marginals to match the empirical marginals.

This can be considered as the moment matching in general where we want to model a distribution *q* that match the moments of the sufficient statistics to the corresponding moments of the ground truth. In our example, we use MLE to minimize such difference.

Let’s expand the concept a little more. Let’s assume we have some function *f*. We want the expected value from the model to match the value computed from the empirical data.

There may be more than one distribution *p* that satisfies the condition. Our optimization can be formularized as finding a distribution *p* with the maximum entropy that satisfies the stated conditions. The maximum entropy allows us to have the best flexibility.

Let’s solve it with the Lagrange multiplier.

This comes to an exponential family of distribution. Originally, we start with a MRF model and find that an exponential family of distribution has the same form as MRF and therefore a natural choice to model MRF. With MLE, the optimal solution will match the moment of the empirical data.

In this section, we prove that to have a model that match the moments of the empirical data, our optimal solution will be the exponential family of distribution if we want the maximum flexibility that measured by the entropy, i.e. feature constraints plus maximum entropy leads to the exponential family of distribution in modeling our problem. This gives us a dual perspective between the model and the empirical data.

# Bayesian Parameter Estimation

Previously, we keep maintaining that Bayes’ Theorem is hard to handle and the partition function is intractable. While this is generally true, if we select the likelihood and the prior with specific well-known distribution combinations, the math is simple and the partition function can be derived from the distribution parameters directly. We can compute the posterior easily and analytically.

In Bayes’ Theorem, we start with a prior belief. This can be modeled with prior knowledge, domain expertise, or just gut feeling. In the worst case, we can create a uniform distribution for the belief. In short, this prior will be non-informative. In this case, the posterior calculated will resemble the results from the sample data. For example, if we get two tails out of five, the expected value for the posterior will be 0.4. On the other hand, if we have a strong prior, it will take a lot of sample data to override the belief. The Bayes’ Theorem allows us to inference based on the strength of our belief and the evidence. We will illustrate this in the next section.

A prior is a **conjugate prior** if the corresponding posterior belongs to the same class of distribution of the prior. When new evidence is collected, we use the posterior as the prior and re-use the formula in calculating the new posterior. In the next few sections, we will show what combinations of distribution form a conjugate prior.

**Gaussian Distribution**

If both likelihood and prior are Gaussian distributed, the posterior will be Gaussian distributed. Assume *μ *is unknow and *σ*² is known. We can estimate the parameter *μ* directly with:

So given a prior belief on *μ *which is Gaussian distributed with *μ*₀* *and *σ*₀², the posterior given the observation *D* can be modeled with another Gaussian modeled by *μn* and *σ*²*n*. If the prior is completely non-informative, *σ*₀ is infinity. *μn *will equal the average of the sampled data. On the other hand, if we have a strong prior, *σ*₀ approaches zero. *μn *will be *μ*₀ — the prior belief of the mean*. *This belief has a much stronger impact on the posterior compared with the new observation *x*.

**Gamma function — conjugate prior of Gaussian distribution (optional)**

A Gaussian distribution is modeled by its mean *μ *and variance (*σ*²). Instead of the variance, we can model precision (τ) where *σ*² = 1/τ. In this exercise, we assume *μ *is fixed and *σ*² is unknown. We are going to estimate it from the observed data *x*. We will also study how τ can be modeled at the same time.

We use an inverse Gamma distribution and a Gamma distribution to model the prior for the variance and precision respectively. Both are the conjugate prior of the Gaussian distribution with posterior:

**Normal-Inverse-Gamma (optional)**

So far, we assume either *μ *or *σ*² is known. Next, we want to learn both with *μ *having a Gaussian distribution and *σ*² having a Gamma distribution.

The solution will require knowledge in the variational inference. This is a large topic and will be discussed in here instead.

**Beta distribution — conjugate prior of Bernoulli distribution**

The Beta distribution is a conjugate prior of Bernoulli distribution.

The posterior can be expressed as a beta distribution with the expectation below:

Let’s give an example. Given *a* = 1 and *b* =1 for the prior, our first experiment has 3 positives out of 10 samples. What is the posterior? It equals beta(1+3, 1+7) = beta(4, 8) according to the formula above. For the second experiment, we have 4 positives out of 10 samples. The refined posterior will equal beta(4 + 4, 8 + 6) = beta(8, 14).

**Dirichlet Distribution is the Multinomial**

Dirichlet distribution is the conjugate prior of the multinomial distribution. We can consider it as the Beta distribution version for multiple possible outcomes.

The posterior can be expressed as a Dirichlet distribution with the expectation below:

**Gamma distribution-conjugate prior of Poisson distribution (optional)**

Gamma distribution is the conjugate prior of Poisson distribution. The posterior and the expectation is computed are:

# Chow-Liu algorithm

Finally, we will cover the last topic in discovering the structure of a directed Graphical model with the Chow-Liu algorithm. We particular select a tree structure because it is easy to learn greedily.

Information Gain (IG) is defined as:

Intuitively, it means if we know *Y*, how much should we know *X*.

So if we know nothing about *X* given *Y*, *H*(*X|Y*) equals *H*(*X*) and the information gain is zero. On the opposite when we know everything about *X* given *Y*, *H*(*X|Y*) equals zero and it has the maximum IG.

Using the collected sample, we can compute *p*(*x*), *p*(*y*) and *p*(*x, y*) empirically (by counting) and therefore the IG. We compute the weights between two nodes using the IG. Then we try to find the maximum weight spanning tree greedily. At each iteration, we add the largest-weight edge into the tree as long as it remains as a tree.

Then, we can select your favorite node as the root (say node 1) and draw a directed tree away from the selected node.

Here is the skeleton of why it works:

# Score-based learning

In the context of Bayesian Network learning, we find a graph structure that maximizes the score based on the likelihood of the data.

To avoid overfitting, we add a second term to regularize the model. Below is two common regularization criteria AIC and BIC.