The use of speech in applications ranging from IVR, to desktop dictation, to smartphone voice search has grown steadily and significantly over the last couple decades. And AVIOS has been a part of that journey chronicling the paths and pitfalls of these challenging and exciting technologies. AVIOS surveys the technological horizon and examines future trends for speech and natural language. The present trajectory of the underlying AI technologies is the reason that AVIOS has refined its focus. Our upcoming annual conference at the end of January 2017 is titled "Conversational Interaction".
The rapid convergence of two technologies, in particular, has brought our industry to a tipping point.
Parallel to the development of speech only interaction (IVR) is the evolution of text-based and touch-based styles of interaction that have become ubiquitous on the desktop and the smartphone in the form of chat windows. Lately, these types of text interactions have evolved into what are commonly called "chatbots" ("textbot" seems more accurate to me). We have all received tech support via a "chat window" on a web page. Granted, it all started as a real human helping us on the other side of that window, but the interaction felt normal in the way that "texting" has come to feel normal. When developers began automating this new interaction paradigm it was well understood that "typing" had a much lower error rate than "speech recognition". So it is not surprising that early efforts to automate a "chat window" style of interaction was built upon NLP/NLU text analytics and state machines (similar to what is used for IVR applications) to manage the interaction flow.
Initially, what these systems did most reliably was the classification of the human intent into pre-defined subcategories and then transfer the UI experience to an existing page that provided additional detailed information about that subcategory. In fact, most virtual agent based chat windows today do precisely that kind of simple category detection followed by a redirection to more detailed information.
Concurrent with chatbot evolution, speech-based interactions continued to develop on the telephone. Because of the limitations of speech recognition at that time, these systems focused heavily on extracting details in small chunks. Talking to our bank IVR we could say "checking" or "savings" to direct the system. Later these systems supported short well-formed directives such as "transfer $400 to checking". But if you said "move $400 out of my savings into my other account" it would most likely fail because neither the speech recognition nor the NLU were robust enough to handle utterances which were that open-ended. (Clearly anticipating one complex utterance would have been doable, but the bigger issue was that one vague utterance opens the door to a vast number of potential utterances that the system must anticipate.) The handcrafted grammars and NLU analytics at that time were not up to the task.
But powerful advances have been demonstrated over the last decade. As speech professionals working with recognition technology for the last two decades we would not have predicted that the average person could do near dictation quality speech recognition on a cell phone or with the built-in microphone on an inexpensive laptop from 5 feet away. Speech recognition is still far from being as good as a human, but it is quite good enough to do conversational transcription over less than ideal audio channels. While NLP/NLU has not made such dramatic advances, it has become good enough to do the needed analytics at conversational speed. One clear sign is that NLU intent analysis is the improving is that it’s available from multiple vendors as a RESTful microservice (e.g. api.ai, wit.ai, and others).
There is a major convergence of these technologies along with multimodal fusion into a natural synergy that gives us a seamless touch-talk-type multimodal interaction. Users are less interested in being led through a dialog. Instead, they expect to be part of a richer conversation. They want to participate in a natural interaction and prefer not to simply micromanage an app. Rich interaction does not need to be a long chatty conversation. It just needs to be aware:
human: I'm leaving work at two today.
computer: I'll send a note to your team. Should I set your home thermostat for 2 PM arrival?
human: sure, thanks.
computer: okay, later.
human: oh, let Megan know too.
computer: sure, I’ll text your wife that you’re heading home.