Speech Recognition — ASR Model Training

Recap (optional)

Diagram on the top left
Modified from source
Source
Source
Source
Source

ASR

Source (the word “six”)
  • Pronunciation is sensitive to neighbor phones (context) inside a word and between words.
  • The alignment between HMM states and audio frames are harder for continuous speech.
  • A large vocabulary triggers a lot of states to keep track of.

Training strategy

Forced alignment

Modified from source
Source
Source
Source

Key training steps

  • Utterances with transcripts.
  • We define the pronunciation lexicon for each word.
  • We define the HMM topology manually.
  • According to the reference transcript, we form an initial HMM topology.
Source
Source
  • Based on this topology, we build an acoustic model for each CI phones. To learn the alignment between the HMM phone state and the audio frame, we apply the forward-backward algorithm (FB).
Source
  • Then, the MFCC features in the aligned audio frame will be used as the training data in calculating the mean and the variance of the single Gaussian.
  • Mixture splitting (details later): We start with this 1-component GMM (single Gaussian). We split each Gaussian into two and run many iterations of the FB. This realigns the audio frames with the HMM phone states. We continue the splitting, followed with many iterations of FB until reaching a target number of components in GMM. This acoustic model will get more complex gradually.
  • Refine the reference transcript: In this phase, we select the pronunciation variant and spot the silence phones for the utterance. We perform further FB training to refine the transcript and the alignment.
  • Use CI model and the refined transcript to realign CI phones with the training data. Then we build the phonetic decision tree (detail later).
  • Seed (clone) CD models from the CI models:
  • Retrain the CD model with the Forward-backward algorithm or Viterbi algorithm. We refine the acoustic GMM model and possibly with more mixture splitting.
Source

GMM acoustic model

Source
K=3 for 2-D features

Phonetic Decision Tree

Source: left, right
Source

Refine transcript

Source
Modified from source

Alignment (Optional)

Source
Source
Source
Source

Speaker adaptation

Source

Next

Credits & References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store