How do you know your spouse is happy or not? Any couple will tell you it can be hard. In machine learning ML, many internal states are hard to determine or observe. An alternative is to determine them from observable external factors. That is what HMM solves. For example, in speech recognition, we listen to a speech (the observable) to deduce its script (the internal state representing the speech). First, let’s look at some commonly-used definitions first.
Markov process/Markov chains
A first-order Markov process is a stochastic process in which the future state solely depends on the current state only. The first-order Markov process is often simply called the Markov process. If it is in a discrete space, it is called the Markov chain.
The assumption of the Markov process may not be true in reality. But even it is not true, we can model extra states in the system to make it closer to the Markov process sometimes. In practice, the Markov process can be an appropriate approximation in solving complex ML and reinforcement learning problems. In addition, the probability of the transition from one state to another can be packed into a transition matrix like the one below:
This transition matrix is also called the Markov matrix. The element ij is the probability of transiting from state j to state i. Note, some literature may use a transposed notation where each element is the probability of transiting from state i to j instead.
The columns of a Markov matrix add up to one, i.e. the probability of reaching a state from any possible state is one. Once, it is defined as a matrix, we can use linear algebra and eigenvector to determine its stable state if existed, i.e. if we keep going for a long time, what is the probability of being at a particular state?
To solve that, let’s have a quick review of eigenvectors first. Eigenvector vᵢ and eigenvalue λᵢ of the matrix A fulfill the following relation. (Note, matrix A can have many eigenvectors.)
Our state at time k+1 is related to the previous step by the Markov matrix which the stable state is determined with k approaches ∞.
If uᵢ is an eigenvector of A,
where λᵢ is the corresponding eigenvalue for uᵢ.
Consider a vector v₁ in ℝⁿ. We can represent it using the eigenvectors of A. Using the equation above, the state of v at time step k+1 will become (the inner product of two different eigenvectors equals zero)
If v converges in time, v will have a stable state. uᵢ can be chosen to be unit vectors. In order for v to converge, eigenvalues λᵢ must be smaller or equal to 1. Otherwise, ‖v‖ will continue to grow.
A Markov matrix always has an eigenvalue 1. All other eigenvalues will have a magnitude smaller or equal to 1. Let’s say, the eigenvector u₁ (say [0.2, 0.5, 0.3]) has an eigenvalue of 1. Then, u₁ will be the stable state, i.e. we have 0.2, 0.5, and 0.3 chance to be in states 1, 2, or 3 respectively as the time approaches infinity. Note, the solution is independent of the initial state. We end up with the same target distribution regardless of where we start. (More details can be found here.) In theory, we can have more than one eigenvectors with eigenvalues equal to one. However, in practice, real problems usually have only one. In fact, if all elements in the matrix are greater than zero, there is exactly one eigenvector with eigenvalue equals to one.
Calculating an exact solution can be computationally intensive. Alternatively, Markov processes can be solved using random walks. Let’s say we drop off 100 shoppers randomly around the downtown area in San Franciso. We provide a transition matrix to show the probability of where the shoppers may head next in the current position. Eventually, we can spot where most interesting shops are located.
This strategy allows us to use local information to understand the general structure of the data. In many ML problems, it is much easier to collect. We don’t need to understand the structure of the data. We don’t need to understand how the city plans its shopping districts. Just look around and see what may be more interesting. In addition, the transition matrix is mostly sparse in many problems. This random walk concept is very popular in ranking or making product recommendations.
As we continue the iterations, our random walk will converge to the stable state that we are interested in. For very large scale problems, this may be easier to execute and to compute.
Hidden Markov Model (HMM)
In many ML problems, we assume the sampled data is i.i.d. This simplifies the maximum likelihood estimation (MLE) and makes the math much simpler to solve. But for the time sequence model, states are not completely independent. If I am happy now, I will be more likely to stay happy tomorrow.
In many ML problems, the states of a system may not be observable or fully observable. But we can get insights about this internal state through the observables. For example, if I am happy, there is a 40% chance that I will go to a party. But there is a 10% chance that I will be found at a party when I am sad too. With HMM, we determine the internal state (happy or sad) by making observations — where I was found.
HMM models a process with a Markov process.
- It includes the initial state distribution π (the probability distribution of the initial state)
- The transition probabilities A from one state (xt) to another.
- HMM also contains the likelihood B of the observation (yt) given a hidden state. Matrix B is called the emission probabilities. It demonstrates the probability of our observation given a specific internal state.
The complexity of the problem is that the same observations may be originated from different states (happy or not).
Two major assumptions are made in HMM. The next state and the current observation solely depend on the current state only.
Given all the observable and the initial state distribution, we can compute a pretty complex equation for the probability for the internal state xt P(xt| y₁, y₂, y₃, … , yt) at time t. For simplicity here, we will not include π in our equation. All equations assume π is a given condition, like P(y) → P(y|π).
The equation above uses the transition probability and the emission probability to compute the probability of the internal state based on all observations.
Depending on the situation, we usually ask three different types of questions regarding an HMM problem.
- Likelihood: How likely are the observations based on the current model or the probability of being at a state at a specific time step.
- Decoding: Find the internal state sequence based on the current model and observations.
- Learning. Learn the HMM model.
The remaining section details the solution. Read through it according to your level of interest.
Likelihood (likelihood of the observation)
Likelihood is to find the likelihood of observation Y.
This is computationally intense. But we can do it smartly to avoid summing all possible state sequences one-by-one and to share results for other computations. Otherwise, the complexity will grow exponentially.
Our strategy will employ a divide-and-conquer. In specifically, if we can express components recursively, we can break down the problem into intermediate steps and share results.
In HMM, we solve the problem at time t by using the result from time t-1 and/or t+1. A circle below represents an HMM hidden state j at time t. So even the number of state sequence increases exponentially with time, we can solve it linear if we can express the calculation recursively with time.
This is the idea of dynamic programming that breaks the exponential curse. At time t, the probability of our observations up to time t is:
Let’s rename the term underlined in red above as αt(j) (forward probability) and check if we can express it recursively.
Yes, it does. Since the current observation depends on the current state only, α can be expressed as:
i.e. the likelihood of the observations can be calculated recursively for each time step below.
Below is an example which we start with the initial state distribution on the left. Then we propagate the value of α to the right for each timestep. Therefore, we break the curse of exponential complexity.
Decoding (find the internal states — Viterbi algorithm)
The decoding problem is finding the optimal internal states sequence given a sequence of observations. Again, we want to express our components recursively. Given the state is j at time t, vt(j) is the joint probability of the observation sequence with the best state sequence. If we examine closely, the resulting equation is close to the forward algorithm except the summation is replaced by the max function.
So not only it can be done, the solution is similar to the forward algorithm except the summation is replaced by the maximum function. Here, instead of summing over all possible state sequences in the forward algorithm, the Viterbi algorithm finds the most likely path.
As shown below, finding the internal states that maximize the likelihood of observations is similar to the likelihood method. We just replace the summation with the maximum function.
In this algorithm, we also record the maximum path leading to each node at time t (the red arrow above). e.g. we are transited from a happy state H at t=1 to the happy state H at t=2 above since it is the most optimal (likely) path.
Let’s study the famous dishonest casino example. To cheat, a dealer switches between a fair dice and a biased die. But, to avoid detection, the dealer does it infrequently. A transition matrix is provided to model the chance of the dealer in switching between the fair dice or biased dice. We also provide a value distribution (the observed dice value distribution) for each dice. For example, for the fair dice, the dice value will be uniformly distributed — this is the emission probability.
We can use the algorithms described before to make inferences by observing the value of the rolling die even though we don’t know which die is used.
The internal state is which die is used. The line curve above is the likelihood to be at a particular state at time t. It fluctuates a lot. It gives a narrower view of what may happen at time t. The shaded area is the likely transition between the fair and the biased dice using the Viterbi algorithm. It is much smoother and reflects the transition better. It gives a global view on when states on transited. For the Viterbi algorithm, we find the most likely state sequence that explains the observations. For the likelihood, given different possible sequences, we sum them together accordingly to find the most likely state at time t.
Learning (Baum–Welch algorithm or Forward-Backward Algorithm — build the model)
Besides likelihood and decoding, the last algorithm learns the HMM model parameters λ given the observation. Here, we will use the Baum–Welch algorithm to learn the transition and the emission probability.
This task sounds mission impossible since both probabilities are highly tangled in our calculation. But, if we know the state occupation probability (the state distribution at time t), we can derive the emission probability and the transition probability. If we know these two probabilities, we can derive the state distribution at time t. This is the chicken and egg problem we discussed in the EM algorithm.
EM algorithm solves the problem in iteration steps. In each step, we optimizing one latent variable while fixing the others. Even for a continuous space, we work with limited provisions, and therefore, there are finite states to explore and improve only. Therefore, if we keep the iterations, the solution will converge. So it is not surprising that the Baum–Welch algorithm is an EM algorithm.
Let’s get familiar with the following new notations.
We are already familiar with α (forward probability) in the forward algorithm already. β (backward probability) is its close cousin in the reverse direction (the probability of seeing all the coming observations given a state i at time t). We can express this recursively similar to α but in the reverse direction (a.k.a. backward algorithm).
To learn the HMM model, we need to know what states we are to explain the observations the best. That will be the occupation probability γ — the probability of state i at time t given all the observations.
Given the HMM model parameters fixed, we can apply the forward and backward algorithm to calculate α and β from the observations. γ can be calculated by simply multiplying α with β, and then renormalize it.
ξ is the probability of transiting from state i to j after time t given all the observations. It can be computed by α and β similarly.
Intuitively, with a fixed HMM model, we refine the state occupation probability (γ) and the transition (ξ) with the given observations.
Here comes the chicken and egg part. Once the distribution of γ and ξ (θ₂) are refined, we can perform a point estimate on what will be the best transition and emission probability (θ₁: a, b).
We fix one set of parameters to improve others and continue the iteration until the solution converges.
The EM algorithm is usually defined as:
Here, the E-step establishes p(γ, ξ | x, a, b). Then, the M-step finds a, b that roughly maximizes the objective below.
Here is the recap of the algorithm: