Speech Recognition — Kaldi

Image for post
Image for post
Photo by Jakub Kapusnak
(According to legend, Kaldi was an Ethiopian goatherd who discovered the coffee plant.)

Kaldi is a toolkit for speech recognition targeted for researchers. We can use Kaldi to train speech recognition models and to decode audio of speeches. So far, we have discussed different topics in our Speech Recognition Series. We would like to close the discussion with a solid example of training an Automatic Speech Recognizer (ASR). Nevertheless, if you are looking for instructions on using Kaldi, you should refer to the Kaldi documents. They are the authority. Things can change frequently for the APIs and the command lines shown in this article. Because we just want to demonstrate the process, we will not update them even they are changed in the future. There are different levels of details in each section. Feel free to skip information according to your interest level.

OpenFST

WFST is popular in modeling ASR transducers.

Image for post
Image for post

One popular open-source WFST toolkit is the OpenFst and it is heavily used by Kaldi. To use WFST in OpenFST, we need to define the input and output symbols and the FST definition. The symbol file contains the vocabulary and maps words into unique IDs used by OpenFST.

Image for post
Image for post

Next, we define the FST definition in a separate file. Each line in the FST file identifies an arc (except the last line). The first two columns identify the state of where it transits from and where it transits to. The third column represents the input label and the fourth column represents the output label. If it is a dash, the output is empty.

Image for post
Image for post
Modified from source

Here is the definition file for WFST which includes a weight for each arc.

Image for post
Image for post
Modified from source

For faster computation during training, we compile these files to a binary FST representation with fstcompile.

Data Source

Before the training, we need raw data as well as other meta information to be ready. A speech recognition training starts with a corpus containing a collection of transcripted speech recordings. Many speech resources are available from the Linguistic Data Consortium (LDC) starting at a fee in $1K+ range for non-members.

Here is an example of identifying the recordings with a string ID. Then we link this ID to its transcript. In this example, each clip contains the pronunciation of three digits.

Image for post
Image for post
Source

Here are other transcripts for the Resource Management corpus.

Image for post
Image for post

We also need to provide a lexicon as a pronunciation dictionary. Some lexicons, like CMUDict, will be available for free. For the file in the right below, each line (after the second line) contains a sequence of phonemes that contributes to a digit (a word).

Image for post
Image for post

The training may involve extra text to build a language model. This helps us to generate a sequence of words that are grammatically sound.

Image for post
Image for post
Text source

This language model can be recognized as an N-grams model.

Image for post
Image for post
Source

Data

There are two ways data are read or wrote. In the “scp” (script) form, a “.scp” file maps a key to a filename or a Unix pipe.

Image for post
Image for post

In the “ark” (archive) form, data is stored inline in a single file.

Image for post
Image for post

Here is the specifier used in the command line that indicates how data is read or wrote.

Image for post
Image for post
Source

WFST

Kaldi uses OpenFst for constructing and searching weighted finite-state transducers (WFST). It is used by Kaldi for decoding. For example, in the second diagram below, it matches a sequence of phones in producing the word “data” (with phones: /d/ /ey/ /t/ /ax/) or “dew” (with phones: /d/ /uw/).

Image for post
Image for post
Source

We can compose multiple transduces together. For example, we can compose a language model on top to encourage proper grammar.

Data preparation

Let’s look at the major steps in training an ASR based on custom audio data. Besides recording clips, we need to prepare meta-data for the acoustic and the language model.

Acoustic data

The acoustic data includes gender information on the speakers (spk2gender), the identifier and the audio data for each utterance (wav.scp), the transcripts for each utterance (text), the mapping between utterances and speakers (utt2spk) and the corpus’s transcript (corpus.text).

Image for post
Image for post
Modified from source

Language data

The language data includes the lexicon, and the non-silence and silence phone information.

Image for post
Image for post
Modified from source

Resource Management

Resource Management is a corpus containing simple and clean commands over a small vocabulary. To create test and training sets from the Resource Management (RM) corpora (catalog number LDC93S3A purchased from the Linguistic Data Consortium (LDC)), we run

Image for post
Image for post

The following directories are created under ./data with data segmented into training and testing separately.

Image for post
Image for post

The sub-directory “local” will contain:

Image for post
Image for post

The sub-directory “train” will contain:

Image for post
Image for post

i.e. all the files mentioned before will be prepared automatically. However, not all files are in the format used by Kaldi. We use OpenFST tools to prepare them first. The command will be:

Image for post
Image for post

All files created with *.txt extension are symbol tables in OpenFst format. It matches a string to an integer ID. The command above will create a new folder called “lang”. The first two files created are words.txt and phones.txt. These files map a string ID to an integer ID used by Kaldi internally.

Image for post
Image for post
Source

Phones

Kaldi allows users to define and categorize different types of phones, including “real phones” and silence phones. All these category information helps Kaldi to build HMM topologies and decision trees.

prepare_lang.sh will transform silence and non-silence phone files for Kaldi. The file context_indep.txt contains all the phones which are not "real phones": i.e. silence (SIL), spoken noise (SPN), non-spoken noise (NSN), and laughter (LAU). The file may contain many variants of these phones depends on the word position. For example, SIL_B is the silence phone that occurred at the beginning of a word.

Image for post
Image for post

The files silence.txt and nonsilence.txt contain the silence phones and nonsilence phones respectively. These are the phones we want to model in the project. In many configurations, the silence.txt is identical to context_indep.txt.

The word_boundary.txt includes information for phones that use for the word boundary.

Image for post
Image for post

The disambiguation symbols (to distinguish lexicon with same prefix) are contained in:

Image for post
Image for post

The optional silence file below contains a single phone that can optionally appear between words.

Image for post
Image for post

Another file called L.fst is created. It is the compiled lexicon in FST format. We can run the following command to view the lexicon in the text form.

Image for post
Image for post
Source

Grammar

The next step is to create an FST for the grammar.

Image for post
Image for post

The FST for the language model can be found at data/lang_test/G.fst. Internally, it calls fstcompile to generate the WFST G.

Image for post
Image for post

Recap

Here are the commands to prepare the data, the language model and the grammar model for the Resource Mangement.

Image for post
Image for post

Extract Features

Next, we will run the command to extract MFCC features for each utterance.

Image for post
Image for post

The first command below displays the data location of the training utterances. The last two commands display the location and the extracted MFCC features.

Image for post
Image for post

Here is another command in looking into the extracted MFCC features.

Image for post
Image for post

The cepstral mean and variance statistics indexed by speakers can be located by:

Image for post
Image for post

For each speaker, these statistics are used to normalize the input cepstral features corresponding to the same speaker.

Monophone training

Next, we will train the monophone models with the command below. (In our speech recognition series, we call monophone as a context-independent phone.)

Image for post
Image for post

Here is the usage of train_mono.sh

Image for post
Image for post

Internally, gmm-init-mono initializes and outputs the model 0.mdl and a flat phonetic decision tree without any split.

Image for post
Image for post

The file data/lang/topo will be created. It contains the HMM topology for phones. The first state in each entry is the initial state. The last one is the final state with no transitions out of it.

Image for post
Image for post

Phones 21 to 188 are the phones that we care for the transcript (“real” phones). This topology composed of three internal states with three emitting states. Emitting states are states that “emit” feature vectors. Phones 1 to 20 are for silence and noise. It is more complex and it uses 5 emitting states to model them.

Next, Kaldi compiles the training graph for faster training later. This generates one FST per training utterance. It encodes the HMM structure for the utterance. This FST graph contains multiples from-state, to-state, input-symbol, output-symbol and cost. The input-symbols are transition-ids which include the pdf-ids that identify the GMM acoustic states of the audio frames. The output-symbols are words. The cost includes the pronunciation probability in the lexicon. But the transition probability of the HMM model will only be added later during training.

Image for post
Image for post

Next, Kaldi performs the first alignment which assumes the HMM states are equally spaced. (i.e. each HMM state covers the same number of audio frames.)

Image for post
Image for post

Kaldi uses Viterbi training for the forced alignment, not the forward-backward algorithm. Afterward, Kaldi re-estimate the GMM acoustic models.

Image for post
Image for post

As shown, the HMM probability models (TransitionModel) and the GMM models are updated. The GMM components are also split.

Image for post
Image for post

Once the first pass is completed, we run multiple iterations in training the model. Inside the loop,

  • gmm-align-compiled aligns phone states according to the GMM models,
  • gmm-acc-stats-ali accumulate stats for GMM training, and
  • gmm-est performs Maximum Likelihood to re-estimate the GMM-based acoustic models.
Image for post
Image for post

Once the model file is trained, we can examine the model (0.mdl) with:

Image for post
Image for post

To check out the Viterbi alignment of the training data, type:

Image for post
Image for post

It contains one line per training file. The alignment contains a transition-id. It encodes the phone and the transition. To know more about the transition, type:

Image for post
Image for post

Or you want to read the alignment in a human-friendly form, type:

Image for post
Image for post
Image for post
Image for post
Source

Let’s take a short about the HMM topologies and the transition model before learning how to do the ASR decoding.

HMM topologies

Image for post
Image for post
Source

The HMM topology definition below covers phone 1 to phone 8 which load into a class called HmmTopology.

Image for post
Image for post

There are three emitting states above, state 0 to 2. For each emitting states, we will model the acoustic model that we may observe. The first state 0 is the start state. State 3 above is a final state with no emission.

An emitting state has a pdf associated with it (pdf: probability density function). PdfClass models this output PDF (the emission distribution). We can apply state tying explicitly, for example in context-dependent phones, to share acoustic models. Therefore, the value for the PdfClass does not need to be unique. States can have a self-loop transition and transitions to other states. The first value in <Transition> identifies the transited state and the second value is the transition probability. The second value is just an initial guess. A copy will be duplicated in the transition model and readjust during the training.

Transition models

The TransitionModel object in Kaldi stores the transition probabilities and the HMM topologies (HmmTopology). The graph-building code uses the TransitionModel object to get the topology and transition probabilities.

transition-ids

We will use the transition-ids for the input labels of the FST instead of the pdf-id (the acoustic state). “transition-id” identifies the pdf-id, the phone and the specific transition in the HmmTopology object. The extra information helps us to map from an input-label sequence to a phone sequence and other bookkeeping needed during the training of the transition probabilities.

Transition model training

The FSTs created have transition-ids as input labels. Kaldi performs Viterbi training that gives the most likely input-label sequence, in the form of a sequence of transition-ids (one transition-id per audio frame). Kaldi accumulates counts for different transitions and uses that to finalize the transition model.

Decode

Next, given an utterance, we want to find the most likely sequence of words. Before any decoding, we need to create the decoding graph (WFST graph) for inference.

Image for post
Image for post

Internally, it first composes LG. The graph will be determinized and minimized. This creates a graph to map phones into words.

Image for post
Image for post

Next, it composes CLG.

Image for post
Image for post

Then, Kaldi creates the H (HMM) transducer using the topologies, the decision tree, and the transition model. The Ha.fst (H) will have the self-looping in HMM removed. The input label will be the transition-id that includes the pdf-id which identifies the acoustic state.

Image for post
Image for post

Next, we compose HCLG followed by determinization and minimization. ε and disambiguation symbols are also removed.

Image for post
Image for post

Later, self-looping in HMM is added back.

Image for post
Image for post

After mkgraph.sh is done, we can decode the testing data with:

Image for post
Image for post

To see some of the decoded results:

Image for post
Image for post

Or using the .tra file to view the result:

Image for post
Image for post

Internally, gmm-latgen-faster is called. It generates a lattice which will be re-scored later to pick the most likely word sequence. The beam parameters below controls the size of the search beam.

Image for post
Image for post

To take a look at some of the summary information of this decoding section, like the average log-likelihood per frame, type

Image for post
Image for post

to see the end of the log file.

Here is a sample for the decoding output.

Image for post
Image for post

Training Triphone System

As discussed before, to avoid nasty local optima, we train an ASR with multiple passes. We start training a monophone system for easier training. Then we move gradually to a triphone system seeded with information prepared in previous training.

The script below involves two major training steps. The first step is to alignment data using the monophone system. Then we train the triphone system with train_deltas.sh.

Image for post
Image for post

A monophone ASR will not be accurate enough for LVCSR. The phone context is important. But the use of triphones will increase the number of HMM states too much. One important step in the triphone training is to train a decision tree to cluster states that sound similar acoustically. So it can reduce the complexity by sharing the same acoustic model for multiple HMM states.

Phonetic decision tree

Image for post
Image for post
Source

This decision tree is implemented as a top-down greedy splitting. The binary decision stumps are selected from user-supplied or pre-defined questions including the type of the surrounding phones or the current state etc… For each split, Kaldi selects the question that allowed the split data to have the highest likelihood under the models in the corresponding branches. For example, if we use two Gaussian models to model the data on the left branch and the right branch respectively, after the data split, the training data should have the highest likelihood according to the corresponding Gaussian model.

Part of the decision tree training process is configurable. For phones listed on the same line in roots.txt, they should have a single “shared root” in the decision tree.

Image for post
Image for post

For example, in stress and tone-dependent systems, all the stress or tone-dependent variant of a particular phone will typically share the same root. In addition, all three HMM states of a phoneme (phone) should share the same root also. In practice, Kaldi generally has each tree-root correspond to a “real phone” — phones corresponding to different variants of word-position, tone or stress are grouped together and forms a tree root.

Kaldi comes with predefined questions to be chosen for the decision stump but extra questions are configurable. For example, it can include questions regarding the word position of a phone (like whether it is the beginning or the end of a word).

Image for post
Image for post

After the monophone model is trained, we use it to align the audio frames with monophones. This is the align_si.sh call in the script.

Image for post

Then we build the triphone acoustic model with increasing complexity gradually. For the remain section, we will focus on the command script train_deltas.sh which is the core of the triphone training. The source code can be found here.

Image for post
Image for post

train_deltas.sh

(Credit: In this section, we will reuse some of the code comments in Kaldi for our explanation.)

In the beginning, acc-tree-stats reads in a feature archive, and the corresponding alignments. Then it accumulates and generates statistics for decision tree creation. Then sum-tree-stats will summarize the statistics for phonetic-context tree building.

Image for post
Image for post

Here is the output for acc-tree-stats

Image for post
Image for post

and sum-tree-stats which show 19268 states. The first entry below is for state 0 with phonetic context (0 10 22: <eps> /d /ih as in “DID”).

Image for post
Image for post

Then it clusters the phones into sets to get and setup questions about the HMM state based on acoustic similarity.

Image for post
Image for post

Kaldi has options in controlling how the tree is built. As shown in roots.txt below, SIL, SIL_B, SIL_E, SIL_I and SIL_S all share the same root regardless of the word position of the SIL phone.

Image for post
Image for post

Next, Kaldi builds a set of phonetic decision trees with the maximum number of leaves defined in $numleaves.

Image for post
Image for post

Here is what the tree may look like.

Image for post
Image for post

Next, Kaldi reads the tree, tree accumulators, and topology. It initializes and outputs the model file 1.mdl. It writes the HMM topology and probability transition model (type TransitionModel) and the acoustic model (type AmDiagGmm) into the model file.

Image for post
Image for post

Then, Kaldi performs the GMM mixture split.

Image for post
Image for post

Then it converts alignments from the monophone model to the current tree model.

Image for post
Image for post

Next, we compile training graphs for the transcripts so the later training can be faster.

Image for post
Image for post

Then we run multiple iterations in training the model. We realign and remodel the GMM acoustic model similar to the monophone training.

Image for post
Image for post

This concludes how ASR is trained. For the remaining sections, we will cover individual topics in Kaldi. First, we get familiar with a few terminologies and basic concepts that constantly mentioned in the Kaldi documents.

Decoding Graph Construction

Kaldi composes HCLG = H o C o L o G to form the decoding Graph.

  • G is an acceptor (input and output symbols are the same) encoding the language model.
  • L is the lexicon. It maps phones to a word.
  • C maps context-dependent phones into context-independent phones.
  • H contains the HMM definitions. It maps transition-ids into context-dependent phones.

The output will be determinized and minimized for optimization purpose.

Disambiguation (optional)

To ensure determinization, disambiguation symbols will be inserted and then later removed. Disambiguation symbols (symbols #1, #2, #3) are inserted at the end of phoneme sequences in a lexicon when a phoneme sequence is a prefix of another phoneme sequence. Disambiguation symbols are added to differentiate them and ensure the determinization of the L o G composition. Symbol #0 is added to the backoff arc in the language model G. This ensures G is determinable when epsilons are removed. Other details will be skipped here. Many of them are needed to ensure the determinization of the WFST.

Composition

WFST composition can be summarized as:

Image for post
Image for post

For H, the self looping is removed and add back at the end of the composition. The fine details in preparing each transducer and how to compose them together can be found here. Separate decoding graphs are created for training and testing. The graph created during training time is simpler than test time because no disambiguation symbols are needed. We know the transcript. And in training, G consists of a linear acceptor corresponding to the training transcript.

H transducer

H transducer maps the acoustic states (the input labels) to context-dependent phones. In Kaldi, the input label of H (or HCLG) is the transition-id which includes the acoustic state (the pdf-id). This transducer H does not contain the self-looping but will be added back when the composition is completed.

Decoder

There are two basic Kaldi decoders: SimpleDecoder and FasterDecoder plus their corresponding version in generating lattice instead of the most likely word sequence. These decoders can be used with different acoustic models, like the GMM. FasterDecoder has almost exactly the same interface as SimpleDecoder. The major difference is a new configuration value “max-active” which controls the maximum number of states that can be active at one time. This configuration controls the beam pruning during the decoding using a weight cutoff.

First, a decoder is instantiated with an FST.

Image for post
Image for post

Here is a code snippet for the decoding. The gmm_decodable contains a sequence of the feature vector to be decoded and the GMM acoustic model.

Image for post
Image for post

The class DiagGmm represents a single diagonal-covariance GMM. An acoustic model is a collection of DiagGmm objects identified by “pdf-ids”. This is implemented as the class AmDiagGmm.

Then, the most likely sequence can be retrieved as a lattice. But this lattice will contain one path only.

Image for post
Image for post

And this is the internal loop inside the decoder in iterating the audio frames.

Image for post
Image for post

Terms

alignment: In Kaldi, an alignment is a sequence of transition-ids.

pdf-id: identify the clustered context-dependent HMM emission state (for example, the GMM acoustic state).

transition-id: identifies the pdf-id, the phone identity, and information about whether it takes the self-loop or forward transition in the HMM.

Recap

Let’s do a recap using a simpler ASR. Here is the code snippet in training an ASR for digits. This is one of the most simple examples. Other ASR may involve multiple passes with repeated alignment, feature transformations, the use of the MMIE objective function etc… For those interested, here is a more complex example using the RM corpus.

Prepare Acoustic Data

Acoustic data, including the utterance, transcript, text corpus and speaker information, will be prepared manually.

Image for post
Image for post

The spk2utt file will be generated automatically with the following format.

Image for post
Image for post

Feature Extraction

MFCC features will be extracted with cepstral mean and variance statistics (CMVN) collected at the end to normalize features.

Image for post
Image for post

Prepare Language Data

Lexicon, silence, and non-silence phone information will be prepared manually. The FST for the lexicon will be created and files related to phones and words will be transformed with integer IDs needed by Kaldi.

Image for post
Image for post

Language Model Creation

SRILM is a language model toolkit. ngram-count in the SRILM builds an internal N-gram count set by reading counts from a file. Then it generates and manipulates N-gram counts, and estimates N-gram language models from them. The last command creates the FST for the grammar (G).

Image for post
Image for post

Monophone Training & Decode

Once all the information is ready from the previous step, we can perform the monophone training. Next, we create the decoding graph and use it to decode the testing data.

Image for post
Image for post

Triphone Training & Decode

The first step is to align data using the monophone system. Then we train the triphone system followed by the decoding.

Image for post
Image for post

Credits & References

All the shell commands used in this article are snapshots from the Kaldi documents or Kaldi Github.

Kaldi doc

Speech recognition with Kaldi lectures

Introduction to the use of WFSTs in Speech and Language processing

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store