Speech Recognition — Kaldi

20 min readOct 7, 2019

Photo by Jakub Kapusnak
(According to legend, Kaldi was an Ethiopian goatherd who discovered the coffee plant.)

Kaldi is a toolkit for speech recognition targeted for researchers. We can use Kaldi to train speech recognition models and to decode audio of speeches. So far, we have discussed different topics in our Speech Recognition Series. We would like to close the discussion with a solid example of training an Automatic Speech Recognizer (ASR). Nevertheless, if you are looking for instructions on using Kaldi, you should refer to the Kaldi documents. They are the authority. Things can change frequently for the APIs and the command lines shown in this article. Because we just want to demonstrate the process, we will not update them even they are changed in the future. There are different levels of details in each section. Feel free to skip information according to your interest level.

OpenFST

WFST is popular in modeling ASR transducers.

One popular open-source WFST toolkit is the OpenFst and it is heavily used by Kaldi. To use WFST in OpenFST, we need to define the input and output symbols and the FST definition. The symbol file contains the vocabulary and maps words into unique IDs used by OpenFST.

Next, we define the FST definition in a separate file. Each line in the FST file identifies an arc (except the last line). The first two columns identify the state of where it transits from and where it transits to. The third column represents the input label and the fourth column represents the output label. If it is a dash, the output is empty.

Here is the definition file for WFST which includes a weight for each arc.

For faster computation during training, we compile these files to a binary FST representation with fstcompile.

Data Source

Before the training, we need raw data as well as other meta information to be ready. A speech recognition training starts with a corpus containing a collection of transcripted speech recordings. Many speech resources are available from the Linguistic Data Consortium (LDC) starting at a fee in $1K+ range for non-members.

Here is an example of identifying the recordings with a string ID. Then we link this ID to its transcript. In this example, each clip contains the pronunciation of three digits.

Here are other transcripts for the Resource Management corpus.

We also need to provide a lexicon as a pronunciation dictionary. Some lexicons, like CMUDict, will be available for free. For the file in the right below, each line (after the second line) contains a sequence of phonemes that contributes to a digit (a word).

The training may involve extra text to build a language model. This helps us to generate a sequence of words that are grammatically sound.

This language model can be recognized as an N-grams model.

Data

There are two ways data are read or wrote. In the “scp” (script) form, a “.scp” file maps a key to a filename or a Unix pipe.

In the “ark” (archive) form, data is stored inline in a single file.

Here is the specifier used in the command line that indicates how data is read or wrote.

WFST

Kaldi uses OpenFst for constructing and searching weighted finite-state transducers (WFST). It is used by Kaldi for decoding. For example, in the second diagram below, it matches a sequence of phones in producing the word “data” (with phones: /d/ /ey/ /t/ /ax/) or “dew” (with phones: /d/ /uw/).

We can compose multiple transduces together. For example, we can compose a language model on top to encourage proper grammar.

Data preparation

Let’s look at the major steps in training an ASR based on custom audio data. Besides recording clips, we need to prepare meta-data for the acoustic and the language model.

Acoustic data

The acoustic data includes gender information on the speakers (spk2gender), the identifier and the audio data for each utterance (wav.scp), the transcripts for each utterance (text), the mapping between utterances and speakers (utt2spk) and the corpus’s transcript (corpus.text).

Language data

The language data includes the lexicon, and the non-silence and silence phone information.

Resource Management

Resource Management is a corpus containing simple and clean commands over a small vocabulary. To create test and training sets from the Resource Management (RM) corpora (catalog number LDC93S3A purchased from the Linguistic Data Consortium (LDC)), we run

The following directories are created under ./data with data segmented into training and testing separately.

The sub-directory “local” will contain:

The sub-directory “train” will contain:

i.e. all the files mentioned before will be prepared automatically. However, not all files are in the format used by Kaldi. We use OpenFST tools to prepare them first. The command will be:

All files created with *.txt extension are symbol tables in OpenFst format. It matches a string to an integer ID. The command above will create a new folder called “lang”. The first two files created are words.txt and phones.txt. These files map a string ID to an integer ID used by Kaldi internally.

Phones

Kaldi allows users to define and categorize different types of phones, including “real phones” and silence phones. All these category information helps Kaldi to build HMM topologies and decision trees.

prepare_lang.sh will transform silence and non-silence phone files for Kaldi. The file context_indep.txt contains all the phones which are not "real phones": i.e. silence (SIL), spoken noise (SPN), non-spoken noise (NSN), and laughter (LAU). The file may contain many variants of these phones depends on the word position. For example, SIL_B is the silence phone that occurred at the beginning of a word.

The files silence.txt and nonsilence.txt contain the silence phones and nonsilence phones respectively. These are the phones we want to model in the project. In many configurations, the silence.txt is identical to context_indep.txt.

The word_boundary.txt includes information for phones that use for the word boundary.

The disambiguation symbols (to distinguish lexicon with same prefix) are contained in:

The optional silence file below contains a single phone that can optionally appear between words.

Another file called L.fst is created. It is the compiled lexicon in FST format. We can run the following command to view the lexicon in the text form.

Grammar

The next step is to create an FST for the grammar.

The FST for the language model can be found at data/lang_test/G.fst. Internally, it calls fstcompile to generate the WFST G.

Recap

Here are the commands to prepare the data, the language model and the grammar model for the Resource Mangement.

Extract Features

Next, we will run the command to extract MFCC features for each utterance.

The first command below displays the data location of the training utterances. The last two commands display the location and the extracted MFCC features.

Here is another command in looking into the extracted MFCC features.

The cepstral mean and variance statistics indexed by speakers can be located by:

For each speaker, these statistics are used to normalize the input cepstral features corresponding to the same speaker.

Monophone training

Next, we will train the monophone models with the command below. (In our speech recognition series, we call monophone as a context-independent phone.)

Here is the usage of train_mono.sh

Internally, gmm-init-mono initializes and outputs the model 0.mdl and a flat phonetic decision tree without any split.

The file data/lang/topo will be created. It contains the HMM topology for phones. The first state in each entry is the initial state. The last one is the final state with no transitions out of it.

Phones 21 to 188 are the phones that we care for the transcript (“real” phones). This topology composed of three internal states with three emitting states. Emitting states are states that “emit” feature vectors. Phones 1 to 20 are for silence and noise. It is more complex and it uses 5 emitting states to model them.

Next, Kaldi compiles the training graph for faster training later. This generates one FST per training utterance. It encodes the HMM structure for the utterance. This FST graph contains multiples from-state, to-state, input-symbol, output-symbol and cost. The input-symbols are transition-ids which include the pdf-ids that identify the GMM acoustic states of the audio frames. The output-symbols are words. The cost includes the pronunciation probability in the lexicon. But the transition probability of the HMM model will only be added later during training.

Next, Kaldi performs the first alignment which assumes the HMM states are equally spaced. (i.e. each HMM state covers the same number of audio frames.)

Kaldi uses Viterbi training for the forced alignment, not the forward-backward algorithm. Afterward, Kaldi re-estimate the GMM acoustic models.

As shown, the HMM probability models (TransitionModel) and the GMM models are updated. The GMM components are also split.

Once the first pass is completed, we run multiple iterations in training the model. Inside the loop,

gmm-align-compiled aligns phone states according to the GMM models,
gmm-acc-stats-ali accumulate stats for GMM training, and
gmm-est performs Maximum Likelihood to re-estimate the GMM-based acoustic models.

Once the model file is trained, we can examine the model (0.mdl) with:

To check out the Viterbi alignment of the training data, type:

It contains one line per training file. The alignment contains a transition-id. It encodes the phone and the transition. To know more about the transition, type:

Or you want to read the alignment in a human-friendly form, type:

Let’s take a short about the HMM topologies and the transition model before learning how to do the ASR decoding.

HMM topologies

The HMM topology definition below covers phone 1 to phone 8 which load into a class called HmmTopology.

There are three emitting states above, state 0 to 2. For each emitting states, we will model the acoustic model that we may observe. The first state 0 is the start state. State 3 above is a final state with no emission.

An emitting state has a pdf associated with it (pdf: probability density function). PdfClass models this output PDF (the emission distribution). We can apply state tying explicitly, for example in context-dependent phones, to share acoustic models. Therefore, the value for the PdfClass does not need to be unique. States can have a self-loop transition and transitions to other states. The first value in <Transition> identifies the transited state and the second value is the transition probability. The second value is just an initial guess. A copy will be duplicated in the transition model and readjust during the training.

Transition models

The TransitionModel object in Kaldi stores the transition probabilities and the HMM topologies (HmmTopology). The graph-building code uses the TransitionModel object to get the topology and transition probabilities.

transition-ids

We will use the transition-ids for the input labels of the FST instead of the pdf-id (the acoustic state). “transition-id” identifies the pdf-id, the phone and the specific transition in the HmmTopology object. The extra information helps us to map from an input-label sequence to a phone sequence and other bookkeeping needed during the training of the transition probabilities.

Transition model training

The FSTs created have transition-ids as input labels. Kaldi performs Viterbi training that gives the most likely input-label sequence, in the form of a sequence of transition-ids (one transition-id per audio frame). Kaldi accumulates counts for different transitions and uses that to finalize the transition model.

Decode

Next, given an utterance, we want to find the most likely sequence of words. Before any decoding, we need to create the decoding graph (WFST graph) for inference.

Internally, it first composes L◦G. The graph will be determinized and minimized. This creates a graph to map phones into words.

Next, it composes C◦L◦G.

Then, Kaldi creates the H (HMM) transducer using the topologies, the decision tree, and the transition model. The Ha.fst (H) will have the self-looping in HMM removed. The input label will be the transition-id that includes the pdf-id which identifies the acoustic state.

Next, we compose H◦C◦L◦G followed by determinization and minimization. ε and disambiguation symbols are also removed.

Later, self-looping in HMM is added back.

After mkgraph.sh is done, we can decode the testing data with:

To see some of the decoded results:

Or using the .tra file to view the result:

Internally, gmm-latgen-faster is called. It generates a lattice which will be re-scored later to pick the most likely word sequence. The beam parameters below controls the size of the search beam.

To take a look at some of the summary information of this decoding section, like the average log-likelihood per frame, type

to see the end of the log file.

Here is a sample for the decoding output.

Training Triphone System

As discussed before, to avoid nasty local optima, we train an ASR with multiple passes. We start training a monophone system for easier training. Then we move gradually to a triphone system seeded with information prepared in previous training.

The script below involves two major training steps. The first step is to alignment data using the monophone system. Then we train the triphone system with train_deltas.sh.

A monophone ASR will not be accurate enough for LVCSR. The phone context is important. But the use of triphones will increase the number of HMM states too much. One important step in the triphone training is to train a decision tree to cluster states that sound similar acoustically. So it can reduce the complexity by sharing the same acoustic model for multiple HMM states.

Phonetic decision tree

This decision tree is implemented as a top-down greedy splitting. The binary decision stumps are selected from user-supplied or pre-defined questions including the type of the surrounding phones or the current state etc… For each split, Kaldi selects the question that allowed the split data to have the highest likelihood under the models in the corresponding branches. For example, if we use two Gaussian models to model the data on the left branch and the right branch respectively, after the data split, the training data should have the highest likelihood according to the corresponding Gaussian model.

Part of the decision tree training process is configurable. For phones listed on the same line in roots.txt, they should have a single “shared root” in the decision tree.

For example, in stress and tone-dependent systems, all the stress or tone-dependent variant of a particular phone will typically share the same root. In addition, all three HMM states of a phoneme (phone) should share the same root also. In practice, Kaldi generally has each tree-root correspond to a “real phone” — phones corresponding to different variants of word-position, tone or stress are grouped together and forms a tree root.

Kaldi comes with predefined questions to be chosen for the decision stump but extra questions are configurable. For example, it can include questions regarding the word position of a phone (like whether it is the beginning or the end of a word).

After the monophone model is trained, we use it to align the audio frames with monophones. This is the align_si.sh call in the script.

Then we build the triphone acoustic model with increasing complexity gradually. For the remain section, we will focus on the command script train_deltas.sh which is the core of the triphone training. The source code can be found here.

train_deltas.sh

(Credit: In this section, we will reuse some of the code comments in Kaldi for our explanation.)

In the beginning, acc-tree-stats reads in a feature archive, and the corresponding alignments. Then it accumulates and generates statistics for decision tree creation. Then sum-tree-stats will summarize the statistics for phonetic-context tree building.

Here is the output for acc-tree-stats

and sum-tree-stats which show 19268 states. The first entry below is for state 0 with phonetic context (0 10 22: <eps> /d /ih as in “DID”).

Then it clusters the phones into sets to get and setup questions about the HMM state based on acoustic similarity.

Kaldi has options in controlling how the tree is built. As shown in roots.txt below, SIL, SIL_B, SIL_E, SIL_I and SIL_S all share the same root regardless of the word position of the SIL phone.

Next, Kaldi builds a set of phonetic decision trees with the maximum number of leaves defined in $numleaves.

Here is what the tree may look like.

Next, Kaldi reads the tree, tree accumulators, and topology. It initializes and outputs the model file 1.mdl. It writes the HMM topology and probability transition model (type TransitionModel) and the acoustic model (type AmDiagGmm) into the model file.

Then, Kaldi performs the GMM mixture split.

Then it converts alignments from the monophone model to the current tree model.

Next, we compile training graphs for the transcripts so the later training can be faster.

Then we run multiple iterations in training the model. We realign and remodel the GMM acoustic model similar to the monophone training.

This concludes how ASR is trained. For the remaining sections, we will cover individual topics in Kaldi. First, we get familiar with a few terminologies and basic concepts that constantly mentioned in the Kaldi documents.

Decoding Graph Construction

Kaldi composes HCLG = H o C o L o G to form the decoding Graph.

G is an acceptor (input and output symbols are the same) encoding the language model.
L is the lexicon. It maps phones to a word.
C maps context-dependent phones into context-independent phones.
H contains the HMM definitions. It maps transition-ids into context-dependent phones.

The output will be determinized and minimized for optimization purpose.

Disambiguation (optional)

To ensure determinization, disambiguation symbols will be inserted and then later removed. Disambiguation symbols (symbols #1, #2, #3) are inserted at the end of phoneme sequences in a lexicon when a phoneme sequence is a prefix of another phoneme sequence. Disambiguation symbols are added to differentiate them and ensure the determinization of the L o G composition. Symbol #0 is added to the backoff arc in the language model G. This ensures G is determinable when epsilons are removed. Other details will be skipped here. Many of them are needed to ensure the determinization of the WFST.

Composition

WFST composition can be summarized as:

For H, the self looping is removed and add back at the end of the composition. The fine details in preparing each transducer and how to compose them together can be found here. Separate decoding graphs are created for training and testing. The graph created during training time is simpler than test time because no disambiguation symbols are needed. We know the transcript. And in training, G consists of a linear acceptor corresponding to the training transcript.

H transducer

H transducer maps the acoustic states (the input labels) to context-dependent phones. In Kaldi, the input label of H (or H◦C◦L◦G) is the transition-id which includes the acoustic state (the pdf-id). This transducer H does not contain the self-looping but will be added back when the composition is completed.

Decoder

There are two basic Kaldi decoders: SimpleDecoder and FasterDecoder plus their corresponding version in generating lattice instead of the most likely word sequence. These decoders can be used with different acoustic models, like the GMM. FasterDecoder has almost exactly the same interface as SimpleDecoder. The major difference is a new configuration value “max-active” which controls the maximum number of states that can be active at one time. This configuration controls the beam pruning during the decoding using a weight cutoff.

First, a decoder is instantiated with an FST.

Here is a code snippet for the decoding. The gmm_decodable contains a sequence of the feature vector to be decoded and the GMM acoustic model.

The class DiagGmm represents a single diagonal-covariance GMM. An acoustic model is a collection of DiagGmm objects identified by “pdf-ids”. This is implemented as the class AmDiagGmm.

Then, the most likely sequence can be retrieved as a lattice. But this lattice will contain one path only.

And this is the internal loop inside the decoder in iterating the audio frames.

Terms

alignment: In Kaldi, an alignment is a sequence of transition-ids.

pdf-id: identify the clustered context-dependent HMM emission state (for example, the GMM acoustic state).

transition-id: identifies the pdf-id, the phone identity, and information about whether it takes the self-loop or forward transition in the HMM.

Recap

Let’s do a recap using a simpler ASR. Here is the code snippet in training an ASR for digits. This is one of the most simple examples. Other ASR may involve multiple passes with repeated alignment, feature transformations, the use of the MMIE objective function etc… For those interested, here is a more complex example using the RM corpus.

Prepare Acoustic Data

Acoustic data, including the utterance, transcript, text corpus and speaker information, will be prepared manually.

The spk2utt file will be generated automatically with the following format.

Feature Extraction

MFCC features will be extracted with cepstral mean and variance statistics (CMVN) collected at the end to normalize features.

Prepare Language Data

Lexicon, silence, and non-silence phone information will be prepared manually. The FST for the lexicon will be created and files related to phones and words will be transformed with integer IDs needed by Kaldi.

Language Model Creation

SRILM is a language model toolkit. ngram-count in the SRILM builds an internal N-gram count set by reading counts from a file. Then it generates and manipulates N-gram counts, and estimates N-gram language models from them. The last command creates the FST for the grammar (G).

Monophone Training & Decode

Once all the information is ready from the previous step, we can perform the monophone training. Next, we create the decoding graph and use it to decode the testing data.