In our last installment in the series we talked about the very beginnings of speech recognition and the early primitive, speaker dependent, trained word approach. We saw how that works for relatively short lists of individual words or short phrases. So how do we make that work for a much longer list of words? Clearly we don't want to train every single word we might say.
Let's look at the issue of speaker independence. Since it is not practical to have every user train every word in their language we need to look for patterns that are more general. Of course this must be possible since humans are capable of understanding speech from people that they have never heard speak before. Sure, there must be issues with accents, vocal pitch, etc. but there must be some common features to0. Think about how children sound out words. Think about vowels.
Sit Set Sat
To keep it simple, look at these three short words: sit, set, sat. As you recall from the last installment speech scientists extract the important features from the frequency domain: spectrographs. The horizontal axis is time the vertical axis is frequency and the brightness is the intensity (amplitude) at a particular time and frequency. Last time we looked at the spectrographs as a complete images, but now we'll examine them a little closer.
One of the interesting things to notice about speech spectrographs is that there are many bands of sound energy at regular intervals (acoustic baklava?) as we move up the spectrum. Those are the harmonics of the base pitch of a voice (that's the musical note you hear when people sing). The sound of your voice is driven by your vocal cords opening and closing. I have a fairly low voice and that frequency for me is normally about 85 Hz. But that driving pressure at 85 Hz is not a smooth sinewave. During each cycle the vocal cords actually snap closed and the air pressure from your lungs causes them to push open. Each cycle is actually a pulse of air (glottal pulse) and as opposed to a sinewave it looks more like a triangular wave. The sharp edge of the triangle is also rich in higher frequencies and that is important when it comes to making different speech sounds. We can think of the vocal tract (the human airway starting at the vocal cords and ending at the lips and nose) as an organ pipe. The vocal cords make sounds that resonate.
We've all played with tubes at some time or another. We've made sounds through paper toweling tubes, longer wrapping paper tubes, small and large culverts (those are always been fun to yell into). And we've all noticed that it somehow changes the character of our voice. This is because the length of the tube alters the primary as well as the higher harmonic resonance amplitudes. Resonant frequencies are enhanced and the other frequencies are reduced. The same thing happens in our human organ pipe except, of course, our vocal tract is a very flexible and reconfigurable organ pipe. By using our tongue and jaw we can move a narrow spot forward and back effectively creating two resonant pipes of different lengths. And that's exactly how we generate vowels.
The previous spectrograms are of me saying sit, set, sat. The only thing that changed is the vowel sound between the S and the T. To make it a little clearer what's happening I said the vowels in isolation. If you look at the frequency band around 500 Hz and around 1800 Hz you will see some resonant enhancement of the harmonics. To make this even clearer the last part of the spectrogram shows the three vowels spoken as a continuous transition. It should be easy to see that each one of the vowels has a unique pair of resonant frequencies that are enhanced by my vocal tract. Note: as an exercise for the reader, try saying the ih-eh-ah vowel transition and think about how your jaw and the front and back of your tongue move in a well-defined trajectory. It's almost like the slide on a trombone!
Plotting the frequencies of the first to resonances (F1 and F2) demonstrates that the vowels are reasonably distinct and identifiable in this two-dimensional plot.
In order to make a speaker independent recognizer for these three vowels the utterances of a small group of representative individuals would be analyzed and averaged to create a "typical" location for each of the vowels. Then in the future a new utterance could be analyzed and the F1/F2 location could be compared by a simple distance measure to the representative average. The closest one is deemed the one "recognized". Other measures such as a ratio of distances to the other "typical" vowels can give some measure of the confidence of the recognition.
Sat Hat Fat Mat Pat Cat
The next spectrograph shows six short similar words all containing the vowel "AH". Now it should be easy for you to see two distinct bands that carry through all six words: 600 Hz and 1600 Hz. ("ah" I get it now!)
You may have noticed the scale on the spectrograph has twice the frequency range. This was to give you a peek at some of the other parts of speech that have to be detected. Vowels are what we call "voiced" sounds, but another big part of speech can be categorized as "unvoiced" sounds. Sounds made mostly by air movement (e.g. s, t, f, h, etc.) Together all of these sounds are referred to as phonemes. Notice that the "unvoiced" phonemes have most of their energy at frequencies above the vowels.
In the next installment will take a closer look at these windy phonemes.
(In case you missed it, here's part one.)