Why Alexa Misbehaves
Why Alexa Misbehaves
Alexa has taken some initiative in the past (e.g. laughing maniacally in the night) and she has even recorded conversations and sent them to the owner's contacts. Is this maniacal or malicious? Who knows, but it is explainable.
Join the DZone community and get the full member experience.Join For Free
A few months back, we heard about issues with Alexa laughing at inappropriate times, occasionally even in the middle of the night sitting right there next to you on your nightstand (unnerving). More recently, we've heard about Alexa making audio transcriptions of conversations and sending them as messages to friends and acquaintances (somewhat more disturbing). Alexa seems to be such a nice and friendly synthetic being, so why is she doing this? Can she trace her lineage back to HAL? Is she carrying his digital DNA disorder? This would certainly be a great plot line for yet another doomsday AI apocalypse movie, but (sadly?) that's not the case. It is very much a garden-variety engineering artifact, which is a part of any system that relies on imperfect sensor inputs. In fact, we as humans are afflicted by the same dilemma.
You may not be aware of it, but when you hear (see, smell, taste, feel) the world, you react not only to what you've heard but also how much you believe you heard. Most of the time, your hearing accuracy can be judged purely by the acoustic input. Humans are very adept in their isolated word acoustic recognition skills. We can easily determine whether we heard the word "banana" or "grapefruit." Even in the presence of a great deal of background ambient noise, humans can hear the correct fruit quite reliably, but at some point, as more background noise is added, even those acoustically distinct words become problematic to accurately identify. To complicate things, not all words are equally distinct. If you are given the same task to distinguish between two words but the two words were "tomato" and "potato" it would take less additive background noise to misrecognize them. In the worst case scenario, some very common words sound exactly alike: there, they're, their or here, hear. Other words sound almost exactly alike (e.g. in US English): matter, madder or ladder, latter. Obviously, we do correctly perceive most of the words in the sentences we experience even if we're in the midst of a crowded and noisy sporting event. So, something else must be helping us. While the individual word recognition system is based on the acoustics (how phonemes go together) within a word, this other system is based on how words go together in phrases and, by extension, how phrases go together into sentences. This type of processing is usually lumped under the labels Natural Language Processing (NLP) and Natural Language Understanding (NLU). Noise exists in this context too: words can be assembled in an astronomically large number of unique ways, and it is common to have extraneous words (uh, um, etc.) or even jumbled order (Yoda this problem he has).
To recap (I promise I'm getting back to Alexa soon), individual words that we think we hear have associated confidence values and for the next level of recognition, individual phrases and/or sentences have associated confidence values based on the probability of individual words fitting together. As humans, we effortlessly do all of the probabilistic computations to reliably extract a string of highly likely words from a human vocalization even though it is blended with other ambient sounds.
Synthetic agents like Alexa need to have the same functional skill if we expect to interact with it in a human-like fashion. All of the speaker-device-based synthetic agents have approached this problem in a similar way. These physical devices have some sort of specialized microphone audio processing input that reduces ambient noise. Usually, this includes an array of microphones and beamforming, which is a computed logical equivalent of pointing a physical parabolic microphone in the direction of the speaker. When the microphone array detects acoustic energy that resembles human speech, the system does real-time computations involving the amplitudes and temporal displacements (sound arrives at each microphone at a slightly different time). The effect is like steering an invisible beam in the direction of the presumed human speech. Additionally, if the device is playing some audio, it can scale and invert the audio output signal and combine it with the microphone signal, effectively removing the music from the input signal and dramatically improving the signal-to-noise ratio.
Additionally, another technique used in combination with beamforming is a spectral subtraction of noise that is constant (e.g. the noise of an air conditioner). The idea is that noises that are relatively constant in the background usually have a characteristic spectrum that is different from human speech (imagine the hum of a fluorescent light or the whistle of a teakettle). So if the Fourier spectral analysis is done on the ambient sounds before and after the speech, this result can be used to create a computational filter that attenuates only the specific narrow frequency bands where the ambient noise is present. The resulting signal has the ambient noise greatly diminished.
Once all of these techniques have been applied, our microphone array can extract relatively clean speech in a noisy room at a distance. In a moderately quiet room, these systems work easily up to 15 feet away, but even though the signal-to-noise ratio is better as a result of all these techniques, it comes with an engineering trade-off (all things do). After all these mathematical/electronic transformations, there is some inevitable distortion of the underlying voice signal, which brings us back to the issue of how well will our Automated Speech Recognizer (ASR) and Natural Language Processing (NLP) detect the correct words (the words that were actually spoken). So, we must assume that the transcription handed off to the application frequently contains some errors. Some of these errors are minor (e.g. hearing "a" when "an" was spoken) and some, while completely understandable, are also very wrong (e.g. hearing "call Marilyn" when "call Madelyn" was spoken). Clearly, as the utterance becomes shorter, any error in it becomes more important (the error is a bigger percentage of the total utterance). This situation immediately collides with the short command driven paradigm that these smart speakers rely on for their concise usability (e.g. "play NPR," "gimme the weather," "Tesla stock"). Another issue is that these devices do not want to appear dull-witted or hard of hearing, so they are inclined to act on utterances at a relatively low confidence level in order to avoid annoying the user with "I'm sorry, did you say blah blah blah?"
In the early days of the telephone calling feature of Alexa, I found it almost unusable because every component of the call was verified. "Did you want to call Madelyn?," "Did you want to call her from Emmett's phone?" Amazon did add features such as speaker identification so that it knows that I am placing the call and perhaps because I call Madelyn often, it ascribes higher confidence to the full utterance even if the word-based phonetic confidence for Madelyn might be lower than it should be to place a phone call to a person I've never called before. In any case, Alexa now responds to the simple phrase "message Madelyn." Alexa does prompt with "what's the message," but if the echo output volume is low or you're not expecting it, then it's easy to miss. Alexa proceeds to dutifully transcribe everything you say (continuously until you give a long pause of several seconds) and sends it to Madelyn! Behind-the-scenes, there is yet another problem. All voice dialing systems that have access to your contact list will try very hard to fit the name it thought it heard to one that's in your contact list. It is a sensible strategy, and odds are you're going to call someone you know. So, in this case, it's pretty easy to see how a person speaking at low volume some distance from the echo device might trigger the wake-up word (try quietly saying "Alaska" or "I'll ask a friend" or "axles"). Imagine you might say something like, "I should send a message to Bill about that tomorrow" or, "Yeah, I get the message," followed by something that sounds like somebody's name in your contact list. Chances are somebody's going to get a message!
I have personally had something even less probable happen. I previously set my device to wake up with the word "computer." One day while watching the news, the reporter did a story about "computers," (which woke up the device) just as I told my wife that "I had to call Chris tomorrow." Alexa cheerfully announced "Calling Chris!" A lot of hurried, "Stop! Stop! Cancel!" ensued. Also, a word of warning, if you do decide to set your device to wake up to "computer," you will not be able to watch any Star Trek without ensuing hilarity.
So, speed and convenience of use push the user and the developer towards shorter commands delivered at a greater distance with more ambient noise exacerbating all of the previously mentioned pitfalls along the way to an attempted recognition of the utterance. This tacit contract between the user and the developer is a delicate balance between how efficiently the system works and how often it makes mistakes. It's a little bit like having a conversation with your hard of hearing, not native English speaking grandfather at a large family reunion with the football game playing in the background.
So now it is easy to see that the issue with Alexa laughing was a function of the developers permitting very short commands to induce Alexa to laugh. The developers reasonably assumed that laughing occasionally even if it was not requested would be fairly harmless and possibly even amusing. Of course, an unintended consequence of Alexa's ability to listen carefully is that in a quiet room (e.g. empty room or at night while sleeping), any small sound (a grunt or a snore) is much larger than the ambient quiet and Alexa will try and figure out if there is speech in the sound. Remember, so that the users don't have to over-articulate loudly the wake word the Alexa developers have set the triggering threshold relatively low. This in of itself is not bad, the idea is that if Alexa can't understand what is said following the wake word she won't do anything. In fact, she won't even interrupt and say "what did you say?" However, if there are some very short commands that (because they are thought to be relatively harmless) also have low trigger thresholds, then you set the stage for the curious case of the laughing in the night. I'm not sure if I recall exactly how easy it was to induce Alexa to laugh before the fix (I think it was just "Alexa laugh"), but now she requires a much more obvious command such as "Alexa can you laugh" and her response is not just a laugh. She responds with a confirmation such as "Sure, tee hee hee."
A simple enough fix, although it does somehow seem less fun. By the way, I don't know if you are aware that Alexa also knows how to scream, and I think Amazon made it harder for her to do that accidentally too. Darn.
Opinions expressed by DZone contributors are their own.