Speech Recognition Series

2 min readDec 20, 2019

In this speech recognition series, we will cover the basics, like phonetics, and the machine learning models used in speech recognition. Later, we will apply deep learning to speech recognition. In the first article, we understand the core principles behind the speech recognition.

Speech Recognition — Phonetics

Finding the core principle and focus is unexpectedly hard for new inventions. In deep learning (DL), many early efforts…

medium.com

Like any machine learning (ML) problem, the first challenge will be feature extraction. How vocal information will be extracted and represented?

Speech Recognition — Feature Extraction MFCC & PLP

Machine learning ML extracts features from raw data and creates a dense representation of the content. This forces us…

medium.com

Before developing models for speech recognition, we study two ML algorithms that frequently used in speech recognition.

Speech Recognition — GMM, HMM

Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-learn technology for speech…

medium.com

Now, let’s start developing acoustic, lexicon and the language model for speech recognition.

Speech Recognition — Acoustic, Lexicon & Language Model

Speech recognition can be viewed as finding the best sequence of words (W) according to the acoustic, the pronunciation…

medium.com

The next two articles develop models and methods to transcript an audio recording.

Speech Recognition — ASR Decoding

With the acoustic, pronunciation lexicon and language model built and discussed in the previous article, we are ready…

medium.com

This will involve the development of a state machine.

Speech Recognition — Weighted Finite-State Transducers (WFST)

Previously, we developed all the necessary Lego blocks in modeling our ASR problem. They include the HMM models for the…

medium.com

Next, we detail how these models are trained.

Speech Recognition — ASR Model Training

Now, we come to the last part of the puzzle in training an ASR. In this article, we will dig deeper to learn how to…

medium.com

To make the discussion concrete, we will use the Kaldi platform to demonstrate a training process.

Speech Recognition — Kaldi

Kaldi is a toolkit for speech recognition targeted for researchers. We can use Kaldi to train speech recognition models…

medium.com

Finally, we will move into the deep learning era and apply its technology to solve the speech recognition problem.

Speech Recognition — Deep Speech, CTC, Listen, Attend, and Spell

Deep Learning (DL) changes many Machine Learning (ML) fields that heavily depend on domain knowledge. Decades of…

medium.com

Miscellaneous:

Speech Recognition — Maximum Mutual Information Estimation (MMIE)

Many ASRs are trained with the MLE (Maximum likelihood estimation — details). It is one of the most popular methods in…

medium.com

Speech Recognition Series

Speech Recognition — Phonetics

Finding the core principle and focus is unexpectedly hard for new inventions. In deep learning (DL), many early efforts…

Speech Recognition — Feature Extraction MFCC & PLP

Machine learning ML extracts features from raw data and creates a dense representation of the content. This forces us…

Speech Recognition — GMM, HMM

Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-learn technology for speech…

Speech Recognition — Acoustic, Lexicon & Language Model

Speech recognition can be viewed as finding the best sequence of words (W) according to the acoustic, the pronunciation…

Speech Recognition — ASR Decoding

With the acoustic, pronunciation lexicon and language model built and discussed in the previous article, we are ready…

Speech Recognition — Weighted Finite-State Transducers (WFST)

Previously, we developed all the necessary Lego blocks in modeling our ASR problem. They include the HMM models for the…

Speech Recognition — ASR Model Training

Now, we come to the last part of the puzzle in training an ASR. In this article, we will dig deeper to learn how to…

Speech Recognition — Kaldi

Kaldi is a toolkit for speech recognition targeted for researchers. We can use Kaldi to train speech recognition models…

Speech Recognition — Deep Speech, CTC, Listen, Attend, and Spell

Deep Learning (DL) changes many Machine Learning (ML) fields that heavily depend on domain knowledge. Decades of…

Speech Recognition — Maximum Mutual Information Estimation (MMIE)

Many ASRs are trained with the MLE (Maximum likelihood estimation — details). It is one of the most popular methods in…

Written by Jonathan Hui