One of the inventors I admire the most is Leonardo Da Vinci. But, will his flying machine ever fly? Leonardo inserted the concept to steer it like a bird. This complicates the design too much. It is powered by a pilot and does not provide the lift it needed. Leonardo Da Vinci misidentifies the core principle of flying. If he ignored the stirring for a moment and focus on upstream current, he may develop a hang glider centuries earlier.
Decades of linguistic and phonetic study precedes speech recognition. We are dealing with the same dilemma. We need to know what principles to focus on and take advantage of the raw computational speed that we invent. In this article, we will briefly discuss how speech is produced and perceived. Then, we focus on the core design concept needed for speech recognition.
Voiced and voiceless sounds
Speech can be classified as voiced and voiceless sounds. Don’t be mistaken by the terms, both produce sounds as shown in this 2-minute video. For voiced sound, we tense up our vocal folds. When we exhale air from the lungs, it pushes the vocal folds open. The airflow speeds up and the pressure at the vocal fold drops. This closes them again. Once the airflow drops, pressure increases and reopens them. These open and close cycles continue and produce a series of sound wave frequencies with a fundamental frequency average at 125 Hz for men and 210 Hz for women. This fundamental frequency impacts the perceived pitch of the voice. If we touch our throat when producing the voiced /b/, we will feel this vibration. A video is better than a thousand words. This 60s video probes the vocal fold movements in producing sounds. However, there are sounds that the vocal folds are relaxed and do not produce vibration. It is called the voiceless sound. When producing “th” in “thin”, we will not feel the vibration in our throat.
But the key component in speaking is the vocal tract which composed of the oral and the nasal part. It acts as a resonator. Both voiced and voiceless sound are further modulated by articulation and create different resonances by the vocal tract.
For pronunciation, we split a word into syllable(s). A syllable usually contains one vowel sound, with or without surrounding consonants.
Consonants are sounds that are articulated with a complete or partial closure of the vocal tract. It can be voiced or voiceless. They break up the stream of vowels and require more precise articulation. Without the consonant or the proper pronunciation, the speech will sound like having an anesthetic after a dental visit.
To classify a consonant, we ask where and how it is produced. Constrictions can be made in the vocal tract that classified into coronal, dorsal and radical places of articulation. For example, Labial consonants mainly involve with lip(s), teeth and the tongue, and Coronal consonants are made with the tip or the blade of the tongue. Other articulators include jaw, velum, lips, and mouth.
Besides what area are involved in the articulation, the consonant sounds depends on how we articulate it: Stops, Fricatives, Nasals, Laterals, Trills, Taps, Flaps, Clicks, Affricate, Approximant, etc … For example, in stops, the airstream is completely obstructed. No air escapes through the mouth. Fricatives are the close approximation of two articulators (like teeth and tongue) that produces the hissing sound by the turbulent airflow between them. Below show how and where consonants are produced.
Vowels are voiced sounds. As quoted from Wikipedia:
Vowels are syllabic speech sounds that are pronounced without any obstruction in the vocal tract. Unlike consonants, which usually have definite places of articulation, vowels are defined in relation to a set of reference vowels called cardinal vowels. Three properties are needed to define vowels: tongue height, tongue backness and lip roundedness.
The pronunciation of a vowel can be modeled by the vowel height (how far we move up the tongue or lower the jaw) and how far we move the tongue to the front or to the back. In the top diagram below, it shows the height and the tongue positions for different vowels. The trapezoid shows the corresponding combinations for different vowels (in red). Vowel sounds can also be composed of two vowels, called diphthong. The blue lines show the pronunciation transition from one vowel to another in diphthong.
Phonemes & Phones
The same letter in words can pronounce differently. Otherwise, it can make things hard as demonstrated in the following sentence with repeating letters “ough”. (the first 60s of this video will demonstrate how it is pronounced.)
Though I coughed roughly and hiccoughed throughout the lecture, I still thought I could plough through the rest of it.
American English has about 44 phonemes. It is the distinct units of sound in spoken English that distinguish one word from another. For example, when we switch the /k/ in ‘cat’ with /b/, we reproduce another word, ‘bat’.
Phones are the acoustic realization of phonemes. We can have many allophones for the same phonemes, for example, the phoneme /p/ in “pit” and “spit” are pronounced differently. Switching between allophones will produce strange accent rather than a different word. So, phonemes are an abstract concept in linguistic to distinguish words and phones are how we pronounce them. The diagram below transcripts “she just had a baby” with phonemes.
In speech recognition, we have collected corpora which are phonetically transcribed and time-aligned. (the start and the end time of each phone are marked.) TIMIT is one popular corpus that contains utterances from 630 North American speakers.
The audio clip will be divided into frames. A phone will occupy multiple frames. With such a corpus, we can learn how to perform:
- Frame classification: assign a phone label to an audio frame.
- Phone classification: assign a phone to a segment of the audio (multiple frames).
- Phone recognition: recognize the sequence of phones corresponding to the recorded utterance.
The first half of the diagram is the audio for the Fricative consonant /sh/. It is clearly different from the vowel after it. However, for machine learning (ML), we need a denser representation so we can identify them easier. Engineers love frequency domain. We apply the Fourier transform to convert time domain information to the frequency domain. For example, a square wave can be decomposed into the sum of many sine waves. In short, we ask what are the frequencies composition and the corresponding magnitudes.
In the third row below, a sinusoidal wave can be transformed into the frequency domain with a specific frequency.
The following is the soundwave for the vowel [iy] in the time domain.
It composes of many frequencies but we can see it repeats itself 10 times within 0.03875s = 258 Hz. This is the lowest frequency of its vibrations and each peak corresponds to the opening of the vocal folds. This frequency is called the fundamental frequency or F0.
In a spectrogram, we slice the audio sound wave into frames, say with 25ms duration each. Here is a visualizing on the spectrum of frequencies (y-axis) as it varies with time. The magnitude for each frequency is indicated by the intensity of the color.
Here is the spectrogram for the vowels. There are a few dominant frequencies as shown below. They are called formants. Starting from the bottom, it is named F1, F2, and F3 respectively. As shown, the movement of F1, F2, and F3 (up or down) can be different in different vowels. With these formants, we should manage to identify different vowels.
For now, let’s focus on the vowels in developing a model of how sound is articulated. We start with the sounds produced by the vibrations of the vocal folds. The left diagram is the waveform and the right is the corresponding frequency domain.
Our articulations create different shapes of the vocal tract which produce different resonances. They act as filters in suppressing or amplifying frequencies.
From some perspective, it is like blowing air into bottles. Our articulations change the depth and the shape of the bottles that create different resonance, except that the bottles do connect together and form a more complex filter.
Let’s take a look at the right diagram below. This is the audio in the frequency domain as the result of combining the audio source (the left diagram) with the filter. We can roughly identify three peaks. They are related to the formants, F1, F2, and F3 respectively. These peaks can be used in speech recognition to distinguish vowels.
And the diagram below summarizes this whole process.
Now, we have to come back to the Da Vinci example. Is our discussion on syllables, vowels, and consonants the same as the bird anatomy in the flight machine? While we do apply linguistics on speech recognition in the early days, many efforts produce dismal results. The concept of syllables, vowels, and consonants are likely too high level with variants that are hard to model effectively. The more successful speech recognizers ignore most we learn from phonetics or linguistics. The particular articulation methods do not help us much either. But it is good to know since it builds the foundation we need.
So what is the core principle we need to focus on? For the last few decades, the key focus is finding the most probable word sequence given the audio. In other words, the principle is simplified to finding the word sequence W with the highest probability given the observed audio signals. Mathematically, we can write it with the discriminative or the generative model below:
The generative model is much easier to model in speech recognition before the introduction of DL. Modeling P(W|X) is too hard with ML. Our discussion will focus on the generative approach first before using DL to solve it with the discriminative model. Both approaches are important.
The generative model depends on building an acoustic model P(X|W) and the language model P(W). The acoustic model is about what the speech may sound given a sequence of words. The language model is about the likelihood of the word sequence. It makes the sequence grammatical and semantical sound. For example, “I watch a movie” will be more likely than “I you movie watch” or “I watch an apple”.
For the past few decades, speech recognition is basically supervised Machine Learning plus searching. We learn how to match the audio signals to words. But due to the possible variants, we do explore the promising or all possibilities and that is the searching part.
Features & labels in supervised ML
There are two fundamental questions in any supervised learning: what are the features and the labels? In the generative model, what is X and W in the acoustic model P(X|W)? English speaking people know about 20K to 50K words. If we use words as W, the space for W will be unnecessarily large. In addition, English is not a phonetic language. Letters are not always pronounced the same in different words. A less ambiguous connection with the pronunciation will be a bonus. So what are the alternatives? Phones are more fundamental than words in speech. In addition, many corpora are already phonetic transcripted or such transcription can be done automatically with pronunciation tables. Therefore, our acoustic model will be phone-based rather than word-based.
So the next question will be what is X? The audio signal contains noisy information. We are going to extract features from the audio waveform and X will be the feature vectors. An expert can perform speech recognition from a spectrogram. So extracting features from the frequency domain will be a reasonable start. But we need an even denser representation. This will force us to learn the core information but not the noise.
However, our discussion so far does not paint a full picture. We speak what we hear.
In speech recognition, knowing how we hear is more important than knowing how we speak in feature extraction.
Hearing sensitivity is non-linear in humans. The perceived separation of loudness and frequency varies at different frequencies.
We need to remap the measured audio waveform into the perceived scale in humans. Those are not 1-to-1 mapping and mostly non-linear. In theory, such a model may need to account for the sensitivity of the measurement also, like the microphones. But we will assume that is a non-factor for now.
This is a nice video on how we hear sounds of different frequencies. In a nutshell, sounds are propagated as vibrations through the eardrum, ossicles (three bones: the malleus, incus, stapes) and finally reaching the cochlea. The cochlea contains fluid and the vibrations are transmitted as waves inside.
There is about 15,000 hair inside the Cochlear at birth. The hair moves along with the vibrations. However, the hairs near the front are stiffer and respond to high-frequencies vibration only. As we go deeper the coil, the hairs are less stiffer and will respond to lower frequencies. The cochlea is also coiled which enhance low-frequency vibrations as the waves travel down the tube. So the hairs in front are responsible for detecting high-frequencies while the back hairs are for low-frequencies. The movement of the hair will generate electrical signals and transmit to the brain through the nerve cells.
Here is the cat’s frequency response of the hairs in the basilar membrane.
This should sound pretty familiar for engineers studying signal processing. Each hair behaves like a bandpass filter which detects a specific range of frequency only. We will mimic such mechanism in our feature extraction.
Fourier Transform (optional)
For non-engineers, let’s take an overview look on the Fourier Transform one more time. The waveform in the left below in the time domain is converted to a sinc function in the frequency domain in the right using a Fourier transform.
Let’s lay down some definitions quickly. The Fourier Transform is defined as:
Or in its discrete form:
The inverse conversion is
Such conversion is symmetrical, a square-like waveform converts to a sinc function in the frequency domain. In the reverse direction, a bandpass filter (a square-like form) will convert to a sinc function in the time domain.
Intuitively, we multiply and integrate f with a sinusoidal function to extract the amplitude and phase component for each frequency.
The sinusoidal function is expressed as the Euler’s formula (exp(iω) = cos ω + i sin ω). It sounds unusually abstract. But its advantage is exponential is easier to integrate and differentiate, and it detects phases.
Fourier transform is actually a special case for the convolution which is generally defined as:
CNN applies this convolution concept to extract features. g above will be the convolution filter and defines the response function for f(x).
So what do Fourier transform buy as? As shown below, if a function composed of two periodic functions, Fourier transform (or the inverse Fourier transform) identifies their periods.
If we do it appropriately, we can apply the inverse Fourier Transform to the output on the right to characterize F1, F2, and F3 (details later).
In addition, there is duality in the Fourier transform. Convolution and multiplication can be interchanged as:
So if a function is hard to model, we may model it another domain. As shown before, a filter bank can be easily defined in the frequency domain. On the other hand, if the manipulation is hard, we can perform the corresponding duality transformation above to see whether it may be easier to solve. Conceptually, we are just switching tools whenever it is easier.
To extract audio features, we use sliding windows of width 25ms and 10ms apart to parse the audio waveform. For each sliding window, we extract a frame of audio signals. We apply Fourier Transform, manipulate it to make the perceived speech features stand out. Then we apply the inverse Fourier Transform. In the end, we extract 39 MFCC features for each frame (details later). For speech recognition, we will use this feature vector to represent the audio signal. For example, xᵢ in P(xᵢ|w) is simply this feature vector.
Regardless of our understanding of phonetic and linguistic, we generally view ASR (Automatic speech recognition) as finding the most likely word sequence given audio and train these probability models with the transcribed speech.
However, speech recognition is never so simple. The devil is in the details.
First, we need to detail how MFCC features are extracted, how an acoustic model is trained and how to decode an audio sound clip.
Next, we will see how to extract the audio features with MFCC or PLP.
Speech Recognition — Feature Extraction MFCC & PLP
Machine learning ML extracts features from raw data and creates a dense representation of the content. This forces us…
Credits & references
Here are some of the references and credits for the whole series of speech recognition.