Speech Recognition — GMM, HMM

14 min readSep 9, 2019

Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-learn technology for speech recognition. Now, there are hybrid systems that combine HMM with Deep Learning and there are systems that are HMM free. We have more design choices now. However, for many generative models, HMM remains important. But regardless of the status, speech recognition helps us to understand the application of HMM and GMM in the ML context better. So stop the long face and let’s spend sometimes on it.

Automatic Speech Recognition (ASR)

Let’s get a high-level overview first. The diagram below is a high-level architecture for speech recognition that links HMM (Hidden Markov Model) with speech recognition.

Starting from an audio clip, we slide windows of 25 ms width and 10 ms apart to extract MFCC features. For each window frame, 39 MFCC parameters will be extracted. The primary objective of speech recognition is to build a statistical model to infer the text sequences W (say “cat sits on a mat”) from a sequence of feature vectors X.

One approach looks for all possible sequences of words (with limited maximum length) and finds one that matches the input acoustic features the best.

This model depends on building a language model P(W), a pronunciation lexicon model, and an acoustic model P(X|W) (a generative model) as below.

A pronunciation model can use tables to convert words to phones, or a corpus is already transcribed with phonemes already. The acoustic model is about modeling a sequence of feature vectors given a sequence of phones instead of words. But we will continue the use of the notation p(X|W) for the acoustic model. Just be aware.

The language model is about the likelihood of the word sequence. For example, “I watch a movie” will be more likely than “I you movie watch” or “I watch an apple”. It predicts the next word given the previous words. If we approximate it with a first-order Markov chain, the next word will depend on the current word only. We can estimate it by counting the occurrence of word pairs in a corpus.

By combining the acoustic model and the language model, we search for the text sequence with the maximum likelihood.

This approach sounds indirect and the search looks inefficient or impossible. But p(X|W) is much easier to model in speech recognition. The distribution of features for a phone can be modeled with a Gaussian Mixture Model (GMM). We will learn it with training data. The transition between phones and the corresponding observable can be modeled with the Hidden Markov Model (HMM). So if we can find an optimal way to search the phone sequence efficiently, this may not sound bad after all.

An HMM model composes of hidden variables and observables. The top nodes below represent the phones and the bottom nodes represent the corresponding observables (the audio features). The horizontal arrows demonstrate the transition in the phone sequence for the true label “she just …”.

In speech recognition, the observable can be represented by 39 MFCC features extracted from the corresponding audio frame. The good news is that with this HMM model, we do not need to search the phone sequence one-by-one. Otherwise, the complexity grows exponentially with the number of phones. With the Viterbi algorithm or other HMM methods, we can find the optimal sequence in polynomial time. We will come back to this later.

The diagram below is a possible realization of an Automatic Speech Recognition (ASR). Combining information on the lexicon, the acoustic model and the language model, we can find the optimal phone sequence with the Viterbi decoder.

Modified from source (O is the same as X here)

Let’s do a quick recap, we can model the acoustic model P(X|W) with an HMM. The arrows on an HMM model will represent phone transitions or links to observables. To model the audio features that we observe, we learn a GMM model from the training data. So let’s understand HMM and GMM more in the general context first.

Hidden Markov Model (HMM)

A Markov chain contains all the possible states of a system and the probability of transiting from one state to another.

A first-order Markov chain assumes that the next state depends on the current state only. For simplicity, we often call it a Markov chain.

This model will be much easier to handle. However, in many ML systems, not all states are observable and we call these states hidden states or internal states. Some may treat them as latent factors for the inputs. For example, it may not be easy to know whether I am happy or sad. My internal state will be {H or S}. But we can get some hints from what we observe. For example, when I am happy I have a 0.2 chance that I watch a movie, but when I am sad, that chance goes up to 0.4. The probability of observing an observable given an internal state is called the emission probability. The probability of transiting from one internal state to another is called the transition probability.

For speech recognition, the observable is the content in each audio frame. We can use the MFCC parameters to represent it. Let’s see what we can do with an HMM.

Likelihood with the forward algorithm

An HMM is modeled by the transition and the emission probability.

Given an HMM model is learned, we can use the forward algorithm to calculate the likelihood of our observations. Our objective is summing the probabilities of the observations for all possible state sequences:

But we have to do this smartly. We cannot sum over all possible state sequences one-at-a-time. It has exponential complexity.

Our strategy will employ divide-and-conquer. If we can express calculations recursively, we can break down the problem into intermediate steps. In HMM, we solve the problem at time t by using the result from time t-1 and/or t+1. A circle below represents an HMM hidden state j at time t. So even the number of state sequence increases exponentially with time, we can solve it linear if we can express the calculation recursively with time.

This is the idea of dynamic programming that breaks the exponential curse. At time t, the probability of our observations up to time t is:

Let’s rename the term underlined in red as αt(j) (forward probability) and check if we can express it recursively. Since the current observation depends on the current state only, α can be expressed as:

So it does have a recursive relationship. Below are the steps in calculating the likelihood of the observations given the model λ using recursion. Instead of summing each state sequence independently, we calculate α from time step one to the end (time T). If there are k internal states, the complexity will only be O(k²T), instead of exponentially.

Here is an example which we start with the initial state distribution on the left. Then we propagate the value of α to the right. We compute α for every state and repeat it for every timestep.

Next, given the HMM model, how can we find the internal states given the sequence of observations. This process is called decoding. This is particularly interesting for speech recognition. If we have an audio clip, the internal states represent the phones. Speech recognition can be viewed as finding these internal states given the audio clips.

Decoding (find the internal states — Viterbi algorithm)

Again, we want to express our components recursively. Given the state is j at time t, vt(j) is the joint probability of the observation sequence with the optimal state sequence.

So not only it can be done, the equation is similar to the forward algorithm except the summation is replaced by the maximum function. Instead of summing over all possible state sequence in the forward algorithm, Viterbi algorithm takes the most likely path.

Finding the internal states that maximize the likelihood of observations is similar to the likelihood method. We just replace the summation with the maximum function.

In this algorithm, we also record the maximum path that leads to each node at time t (the red arrow above), i.e. we backtrace the optimal path for each node. For example, we are transited from a happy state H at t=1 to the happy state H at t=2.

Learning (Baum–Welch algorithm/forward-backward algorithm)

Now, it comes to the hard part. How can we learn the HMM model? This can be done by the Baum–Welch algorithm (forward-backward algorithm) to learn the transition and the emission probability. This task sounds mission impossible since both probabilities are highly tangled in our calculation. But from some perspective, if we know the state occupation probability (the state distribution at time t), we can derive the emission probability and the transition probability. If we know these two probabilities, we can derive the state distribution at time t. This is the chicken and egg problem we discussed in the EM algorithm. EM algorithm solves the problem in iteration steps. In each step, we optimizing one latent variable while fixing the others. Imagine that each iteration step improves the solution. Even for a continuous space, we work with limited precisions and therefore, there are finite states to explore and improve. So, if we keep the iterations, the solution will converge.

So it is not surprising that Baum–Welch algorithm is a special case of the EM algorithm.

Let’s get familiar with the following new notations.

We are already familiar with α (forward probability) in the forward algorithm already. β (backward probability) is its close cousin in the reverse direction (the probability of seeing all the coming observations given a state i at time t). We can express this recursively similar to α but in the reverse direction (a.k.a. backward algorithm).

To learn the HMM model, we need to know what states we are to explain the observations the best. That will be the state occupation probability γ — the probability of state i at time t given all the observations.

Given the HMM model parameters fixed, we can apply the forward and backward algorithm to calculate α and β from the observations. γ can be calculated by simply multiplying α with β, and then renormalize it.

ξ is the probability of transiting from state i to j after time t given all the observations. It can be computed by α and β similarly.

Intuitively, with a fixed HMM model, we refine the state occupation probability (γ) and the transition (ξ) with the given observations.

Here come the chicken and egg part. Once the distribution of γ and ξ (θ₂) are refined, we can perform a point estimate on what will be the best transition and emission probability (θ₁: a, b).

We fix one set of parameters to improve others and continue the iteration until the solution converges.

The EM algorithm is usually defined as:

Here, the E-step establishes p(γ, ξ | x, a, b). Then, the M-step finds a, b that roughly maximize the objective below.

Here is the recap of the algorithm:

So given all the observations in the training data, Baum–Welch algorithm can learn the HMM model. However, remember to keep an open mind. In speech recognition, the problem is far more complex and many solutions do not scale well sometimes.

Acoustic model

In ASR, we can use a pronunciation table to produce the phones for the text sequence Y. Next, we need to create an acoustic model for these phones.

Decades of research have been done on phonetics. Experts can identify vowels and consonants from reading the spectrograms directly.

But again we need a much denser representation of the acoustic model so we can determine the likelihood of an audio feature vector X given the phone P(X|phone).

With MFCC, we extract 39 features from an audio frame. Let’s simplify the picture and assume that there is only one feature for each frame. For the state “sh” (phone), the value of this feature can be modeled with a normal distribution.

To extend the concept to 39 features, we just need a Multivariate normal distribution with 39 variables. The following diagram visualizes a bivariate normal distribution for two variables.

Here is the definition of the Multivariate normal distribution.

where Σ is the covariance matrix which measures correlations among variables. MFCC parameters have a nice property. There are relatively independent of each other. Therefore, the non-diagonal element of Σ can simply set to zero.

But, it is too hard to think multi-dimensionally. So we will stick with a 1-D example for illustration. The likelihood p(x| q) of the observed feature x will be calculated as how far it is from the peak of the normal distribution q:

Given different phones, we can calculate the corresponding probability density values and classify it as the phone with the highest value. To learn this Gaussian distribution, we can simply estimate from the training data point xᵢ.

These equations can be proven by maximizing the likelihood of the training data.

So this Gaussian model is easy to learn from training data and give us a nice likelihood model on P(x| μ, σ²). In the context of speech recognition, we can learn the Gaussian model (μ, σ²) for each phone. This serves as the likelihood probability. This also acts as the emission probability in an HMM.

Unfortunately, this concept is naive even we use a multivariate Gaussian Distribution. If this is true, it will be much simpler to learn a foreign spoken language. The likelihood is more complex than a single peak bell curve. To address this, we switch to a Gaussian Mixture Model (GMM). This allows the distributions to be multimodal, i.e. we allow a feature to have a few possible values. This provides the flexibility of variants in speech.

For example, the GMM on the right combines three Gaussian distribution with different weights to form a new probability density (a 3-component GMM). The model is still pretty dense, 6 Gaussian parameters plus 3 weights.

Intuitively, the feature value of a specific phone can be observed near one of the m modes. But some values may be more likely than others. So we introduce weights to indicate which are more likely. With the internal HMM state to be j, the likelihood of the observed feature vector is:

To learn a GMM, say for a 2-component GMM, we feed features extracted from the training data to fit the parameters of these two clusters. Conceptually, we start with an initial or random guess of these parameters. We find which cluster that each data sample should belong to. Then we re-calculate the cluster parameters according to the associated data points.

Yes, we will use the EM algorithm to iterate the solution until it converges. In EM, we use a soft assignment rather than a hard assignment. For a hard assignment, we assign a specific cluster that each data sample belongs to (a point estimate). In a soft assignment, it will be a probability distribution. So, it is the chance that a sample will belong to a cluster. Then we recompute the cluster parameters according to this soft assignment. Since we have covered this a few times, we will not elaborate on how it is trained further.

To recap, given a phone, we can learn the feature vector of the observable using GMM. This probability distribution allows us to compute the likelihood of our speech segment given a phone P(x|s) — this is also the emission probability given an internal state in HMM.

Vector quantization

Throughout the process, we try to model a denser representation of the acoustic signals. GMM is one popular approach. Alternatively, after we extract a training set of feature vectors from our corpus, we group these features into k clusters, say using k-means clustering. This creates a codebook with size k to encode audio frame.

With this index, we can start using it to train an HMM. Once the model is trained, we can use it to decode audio clips also. This method is called Vector Quantization and use it early research. But it is less popular compared to GMM. Therefore, we just want you to aware of it.

Thoughts

GMM models the observed probability distribution of the feature vector given a phone. It provides a principled method to measure “distance” between a phone and our observed audio frame.

On the other hand, HMM produces a principled model on how states are transited and observed. As the probability of the observations can be modeled with HMM as