Introduction to Synthetic Agents: Speech Recognition - Part 1
Introduction to Synthetic Agents: Speech Recognition - Part 1
Zone Leader Emmet Coin's next part in his series on speech recognition and the components of a synthetic agent.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
So let's continue our exploration of the components of a synthetic agent (begun in a previous post). Let's see how these components work and how they work together.
Natural Language Understanding
About the time that Eliza was created some workable albeit primitive devices and software for speech recognition were in development in the lab. While today's methods for speech recognition operate in a more complex way it is useful to understand how the first working prototypes functioned and how they are a foundation for today's recognizers.
A speech recognizer processes an audio signal and results in the text that was contained in that signal. Up front we should probably clear up a few bits of terminology. Historically, and even today, the "audio speech signal" translation to "text" is called Automatic Speech Recognition (ASR). People sometimes use the term Voice Recognition synonymously (but incorrectly) with the term Speech Recognition.
Technically, speech recognition extracts the words that are spoken whereas voice recognition identifies the voice that is speaking. Speech recognition is "what someone said" and voice recognition is "who said it". The underlying technologies do overlap but they serve very different purposes. (Note: voice recognition falls under the scope biometrics and while it can and will be useful for synthetic agents we won't discuss voice recognition until later in this series.)
ELIZA was a terminal, textbased, typed input software system that ran on a 1960s mainframe. Also at that time significant and serious research and practical development began for ASR. There were academic, theoretical projects being done at the universities as well as working engineering prototypes being built at commercial labs (e.g. AT&T Bell Labs). In this early phase the practical goal was to identify isolated words from a short list. The academic laboratories experimented more with control words and entity names whereas places like Bell Labs experimented mostly with isolated spoken digits.
How did these systems work? Let's take a look at a small set of words we might want to detect:
Collecting the audio is straightforward and the waveforms look like this:
Very early on the researchers and developers realized that looking at the audio signal in the time domain (e.g. the waveform) was not very helpful. The signals were difficult to distinguish. The shape of the waveform was mostly a function of how loud the utterance was. Methods based on the waveform lead to results that were only slightly better than chance. But some folks notice that speech has a tone and timbre that is similar to music. And so it became obvious that looking at speech in the frequency domain (e.g. the spectrum) should be more promising.
Converting the waveform to a spectrum using Fourier Transform analysis allowed the researchers to see the input like this:
The horizontal axis is time and is the same as it was in the waveform plot, but now the vertical axis represents the frequency and the brightness at any time-frequency point is the energy of that frequency at that point in time. Using this representation different patterns for each word became apparent. Even without understanding what makes the images different (we'll address that later in the series), you could imagine processing the audio of another, different utterance of the name of a big cat and comparing that image to our previously recorded (trained?) example utterances (templates?). The closest fitting template should be a reasonable guess for the word that was spoken. It is the word we have recognized.
Up until around 2010 most phones that did voice dialing by names did it by exactly this technique. As a user you had to train each name in your address book ahead of time. And in order for it to work you had to say exactly the word you said during training for it to work. Also, if the names were even vaguely similar (e.g. "Don Johnson" and "Donna Jackson") it didn't work well at all.
As an exercise take a look at the following spectrum and see if you can identify (by visual pattern matching) what words were spoken. They are the same four cats but the order is jumbled up a bit.
This sort of system would not work well for large numbers of words. Beyond a hundred words or so increasing error rates make this approach unusable. So it would never be up to the task of the "dictation style" speech recognition that we see today with Google or Siri or DragonDictate.
Side note: This type of speech recognition does have its uses even today. A large portion of the stockpicking in distribution centers is done using wearable speech based computer appliances. The system tells the human where to go in a row, which stock bin to reach into, how many items to pick. Then the human stock picker tells the computer a check number (to verify they are at the correct location) and how many items they picked up (e.g. there may not have been enough to pick up for this order). These sorts of applications use a short list of words: a few commands, numeric digits, yes/no, etc. In this case the speaker specific training feature for each word allows the system to work for any language, dialect, regional accent. Each employee trains their own template so if they train "1" to be "uno" (or "one", or "eins") it will recognize "1" as long as they say what they trained the system with. Another term for this type of recognition is "speaker dependent".
So what would happen if those four cats were part of the speech of a normal sentence. Now the "cat name" is not spoken in isolation and in addition there are unknown words mixed in. The following spectrum contains the four cats. I'm pretty sure it will be hard for you to find them and I'm quite sure that it would be even harder to imagine an algorithm to find them automatically.
It shouldn't take you too long to notice that in normal speech words are not spoken in isolation, they run together. Also words that are adjacent change each other slightly. The sounds are co-articulated. We'll talk about how to attack this part of the problem in the next installment.
[long sentence: a big lion and a smart tiger and a silly leopard walk into a bar with a cheetah]
Opinions expressed by DZone contributors are their own.