Speech Recognition — Phonetics

14 min readAug 26, 2019

Would Leonardo Da Vinci’s flying machine ever fly? Leonardo designed the machine to steer like a bird. But this complicated the design too much. And it was powered by a pilot that would not provide the lift it needed. Leonardo Da Vinci misidentified the core principle of flying. If he ignored the stirring for a moment and focused on the upstream current, he might develop a hang glider centuries earlier.

Decades of linguistic and phonetic study precede speech recognition. Are those distractions or core principles? How can we take advantage of the computational strength of a computer? In this article, we will briefly discuss how speech is produced and perceived. Then, we focus on the core design concept needed for speech recognition.

Voiced and voiceless sounds

Speech can be classified as voiced and voiceless sounds. Don’t be mistaken by the terms, both produce sounds as shown in this 2-minute video. For voiced sounds, we tense up our vocal folds. When we exhale air from the lungs, it pushes the vocal folds open. The airflow speeds up and the pressure at the vocal fold drops. This closes them again. Once the airflow drops, pressure increases and reopens them. These open and close cycles continue and produce a series of sound wave frequencies with a fundamental frequency average of 125 Hz for men and 210 Hz for women. This fundamental frequency impacts the perceived pitch of the voice. If we touch our throat when producing the voiced /b/, we will feel this vibration. A video is better than a thousand words. This 60s video probes the vocal fold movements in producing sounds. However, there are sounds in which the vocal folds are relaxed and do not produce vibration. It is called the voiceless sound. When producing “th” in “thin”, we will not feel the vibration in our throat.

But the key component in speaking is the vocal tract which is composed of the oral and the nasal part. It acts as a resonator. Both voiced and voiceless sounds are further modulated by articulation and create different resonances by the vocal tract.

Syllables

For pronunciation, we split a word into syllable(s). A syllable usually contains one vowel sound, with or without surrounding consonants.

Consonants

Consonants are sounds that are articulated with a complete or partial closure of the vocal tract. It can be voiced or voiceless. They break up the stream of vowels and require more precise articulation. Without the consonant or the proper pronunciation, the speech will sound like having an anesthetic after a dental visit.

To classify a consonant, we ask where and how it is produced. Constrictions can be made in the vocal tract. And three major kinds of place articulation are coronal, dorsal, and labial. Labial consonants mainly involve lip(s), teeth, and the tongue, coronal consonants are made with the tip or the blade of the tongue and dorsal uses the back of the tongue. Other articulators include the jaw, velum, lips, and mouth.

Besides what area are involved in the articulation, the consonant sounds depends on how we articulate them: Stops, Fricatives, Nasals, Laterals, Trills, Taps, Flaps, Clicks, Affricate, Approximant, etc … For example, in “stops”, the airstream is completely obstructed. No air escapes through the mouth. Fricatives are the close approximation of two articulators (like teeth and tongue) that produces the hissing sound through the turbulent airflow between them. Below show how and where consonants are produced.

Vowels

Vowels are voiced sounds. As quoted from Wikipedia:

Vowels are syllabic speech sounds that are pronounced without any obstruction in the vocal tract. Unlike consonants, which usually have definite places of articulation, vowels are defined in relation to a set of reference vowels called cardinal vowels. Three properties are needed to define vowels: tongue height, tongue backness and lip roundedness.

The pronunciation of a vowel can be modeled by the vowel height (how far we move up the tongue or lower the jaw) and how far we move the tongue to the front or to the back. The top diagram below shows the height and tongue positions for different vowels. The trapezoid shows the corresponding combinations for different vowels (in red). Vowel sounds can also be composed of two vowels, called diphthongs. The blue lines show the pronunciation transition from one vowel to another in the diphthong.

Phonemes & Phones

The same letter in words can pronounce differently. Otherwise, it can make things hard as demonstrated in the following sentence with repeating letters “ough”. (the first 60s of this video will demonstrate how it is pronounced.)

Though I coughed roughly and hiccoughed throughout the lecture, I still thought I could plough through the rest of it.

American English has about 44 phonemes. It is the distinct units of sound in spoken English that distinguish one word from another. For example, when we switch the /k/ in ‘cat’ with /b/, we reproduce another word, ‘bat’. Phonemes are the perceptually distinct units of sound in a language that distinguishes one word from another.

Phones are the acoustic realization of phonemes. An allophone is the different ways to pronounce a phoneme. For example, the phoneme /p/ in “pit” and “spit” are pronounced differently. But switching between an allophone will produce a strange accent rather than a different word. So, phonemes are an abstract concept in linguistics to distinguish words and phones are how we pronounce them. The diagram below transcripts “she just had a baby” with phonemes.

In speech recognition, we have collected corpora which are phonetically transcribed and time-aligned. (the start and the end time of each phone are marked.) TIMIT is one popular corpus that contains utterances from 630 North American speakers.

The audio clip will be divided into frames. A phone will occupy multiple frames. With such a corpus, we can learn how to perform:

Frame classification: assign a phone label to an audio frame.
Phone classification: assign a phone to a segment of the audio (multiple frames).
Phone recognition: recognize the sequence of phones corresponding to the recorded utterance.

Spectrogram

The first half of the diagram is the audio for the Fricative consonant /sh/. It is clearly different from the vowel after it. However, for machine learning (ML), we need a denser representation so we can identify them easier. Engineers love the frequency domain. We apply the Fourier transform to convert time domain information to the frequency domain. For example, a square wave can be decomposed into the sum of many sine waves. In short, we ask what are the frequency's composition and the corresponding magnitudes.

In the third row below, a sinusoidal wave can be transformed into the frequency domain with a specific frequency.

Source (Examples in Fourier transform between time and frequency domain)

The following is the soundwave for the vowel [iy] in the time domain.

It composes of many frequencies but we can see it repeats itself 10 times within 0.03875s = 258 Hz. This is the lowest frequency of its vibrations and each peak corresponds to the opening of the vocal folds. This frequency is called the fundamental frequency or F0.

In a spectrogram, we slice the audio sound wave into frames, say with 25ms duration each. Here is a visualization of the spectrum of frequencies (y-axis) as it varies with time. The magnitude for each frequency is indicated by the intensity of the color.

Here is the spectrogram for the vowels. There are a few dominant frequencies as shown below. They are called formants. Starting from the bottom, it is named F1, F2, and F3 respectively. As shown, the movement of F1, F2, and F3 (up or down) can be different in different vowels. With these formants, we should manage to identify different vowels.

For now, let’s focus on the vowels in developing a model of how sound is articulated. We start with the sounds produced by the vibrations of the vocal folds. The left diagram is the waveform and the right is the corresponding frequency domain.

Our articulations create different shapes of the vocal tract which produce different resonances. They act as filters in suppressing or amplifying frequencies.

From some perspectives, it is like blowing air into bottles. Our articulations change the depth and the shape of the bottles that create different resonance, except that the bottles do connect together and form a more complex filter.

Let’s take a look at the right diagram below. This is the audio in the frequency domain as the result of combining the audio source (the left diagram) with the filter. We can roughly identify three peaks. They are related to the formants, F1, F2, and F3 respectively. These peaks can be used in speech recognition to distinguish vowels.

And the diagram below summarizes this whole process.

Thoughts

Now, we have to come back to the Da Vinci example. Is our discussion on syllables, vowels, and consonants the same as the bird anatomy in the flight machine? While we do apply linguistics to speech recognition in the early days, many efforts produce dismal results. The concept of syllables, vowels, and consonants is likely too high level with variants that are hard to model effectively. The more successful speech recognizers ignore most we learn from phonetics or linguistics. The particular articulation methods do not help us much either. But it is good to know since it builds the foundation we need.

So what is the core principle we need to focus on? For the last few decades, the key focus is finding the most probable word sequence given the audio. In other words, the principle is simplified to finding the word sequence W with the highest probability given the observed audio signals. Mathematically, we can write it with the discriminative or the generative model below:

The generative model is much easier to model in speech recognition before the introduction of DL. Modeling P(W|X) is too hard without DL. Our discussion will focus on the generative approach first before using DL to solve it with the discriminative model. Both approaches are important.

The generative model depends on building an acoustic model P(X|W) and the language model P(W). The acoustic model is about what the speech may sound like given a sequence of words. The language model is about the likelihood of the word sequence. It makes the sequence grammatical and semantical sound. For example, “I watch a movie” will be more likely than “I you movie watch” or “I watch an apple”.

For the past few decades, speech recognition is basically supervised Machine Learning plus searching. We learn how to match the audio signals to words. But due to the possible variants, we do explore the promising or all possibilities and that is the searching part.

Features & labels in supervised ML

There are two fundamental questions in any supervised learning: what are the features and the labels? In the generative model, what are X and W in the acoustic model P(X|W)? English-speaking people know about 20K to 50K words. If we use the words for W, the space for W will be unnecessarily large. In addition, English is not a phonetic language. Letters are not always pronounced the same in different words. A less ambiguous connection with the pronunciation will be a bonus. So what are the alternatives? Phones are more fundamental than words in speech. In addition, many corpora are already phonetic transcripted, or such transcription can be done automatically with pronunciation tables. Therefore, our acoustic model will be phone-based rather than word-based.

So the next question will be what is X? The audio signal contains noisy information. We are going to extract features from the audio waveform and X will be the feature vectors. An expert can perform speech recognition from a spectrogram. So extracting features from the frequency domain will be a reasonable start. But we need an even denser representation. This will force us to learn the core information but not the noise.

However, our discussion so far does not paint a full picture. We speak what we hear.

In speech recognition, knowing how we hear is more important than knowing how we speak in feature extraction.

Hearing sensitivity is non-linear in humans. The perceived separation of loudness and frequency varies at different frequencies.

We need to remap the measured audio waveform into the perceived scale in humans. Those are not 1-to-1 mapping and are mostly non-linear. In theory, such a model may need to account for the sensitivity of the measurement also, like the microphones. But we will assume that is a non-factor for now.

This is a nice video on how we hear sounds of different frequencies. In a nutshell, sounds are propagated as vibrations through the eardrum, ossicles (three bones: the malleus, incus, and stapes) and finally reaching the cochlea. The cochlea contains fluid and the vibrations are transmitted as waves inside.

There is about 15,000 hair inside the Cochlear at birth. The hair moves along with the vibrations. However, the hairs near the front are stiffer and respond to high-frequencies vibration only. As we go deeper into the coil, the hairs are less stiffer and will respond to lower frequencies. The cochlea is also coiled which enhances low-frequency vibrations as the waves travel down the tube. So the hairs in front are responsible for detecting high frequencies while the back hairs are for low frequencies. The movement of the hair will generate electrical signals and transmit them to the brain through the nerve cells.

Here is the cat’s frequency response of the hairs in the basilar membrane.

This should sound pretty familiar for engineers studying signal processing. Each hair behaves like a bandpass filter which detects a specific range of frequency only. We will mimic such a mechanism in our feature extraction.

Fourier Transform (optional)

For non-engineers, let’s take an overview look at the Fourier Transform one more time. The waveform in the left below in the time domain is converted to a sinc function in the frequency domain in the right using a Fourier transform.

Let’s lay down some definitions quickly. The Fourier Transform is defined as:

Or in its discrete form:

The inverse conversion is

Such conversion is symmetrical, a square-like waveform converts to a sinc function in the frequency domain. In the reverse direction, a bandpass filter (a square-like form) will convert to a sinc function in the time domain.

Intuitively, we multiply and integrate f with a sinusoidal function to extract the amplitude and phase component for each frequency.

The sinusoidal function is expressed as Euler’s formula (exp(iω) = cos ω + i sin ω). It sounds unusually abstract. But its advantage is exponential is easier to integrate and differentiate, and it detects phases.

Fourier transform is actually a special case for the convolution which is generally defined as:

CNN applies this convolution concept to extract features. g above will be the convolution filter and defines the response function for f(x).

So what does Fourier transform buy as? As shown below, if a function is composed of two periodic functions, Fourier transforms (or the inverse Fourier transform) identify their periods.

If we do it appropriately, we can apply the inverse Fourier Transform to the output on the right to characterize F1, F2, and F3 (details later).

In addition, there is duality in the Fourier transform. Convolution and multiplication can be interchanged as:

So if a function is hard to model, we may model it in another domain. As shown before, a filter bank can be easily defined in the frequency domain. On the other hand, if the manipulation is hard, we can perform the corresponding duality transformation above to see whether it may be easier to solve. Conceptually, we are just switching tools whenever it is easier.

Feature extraction

To extract audio features, we use sliding windows of width 25ms and 10ms apart to parse the audio waveform. For each sliding window, we extract a frame of audio signals. We apply Fourier Transform, and manipulate it to make the perceived speech features stand out. Then we apply the inverse Fourier Transform. In the end, we extract 39 MFCC features for each frame (details later). For speech recognition, we will use this feature vector to represent the audio signal. For example, xᵢ in P(xᵢ|w) is simply this feature vector.

Regardless of our understanding of phonetics and linguistics, we generally view ASR (Automatic speech recognition) as finding the most likely word sequence given audio and training these probability models with the transcribed speech.

However, speech recognition is never so simple. The devil is in the details.