Google Duplex Is Here?
Google Duplex Is Here?
Was that demo you saw where Google Assistant made a restaurant reservation for you really that cool? Is it actually radically new? The long and short answers are: maybe.
Join the DZone community and get the full member experience.Join For Free
Most of us have watched the recent demonstration by Google, which employs their Google Assistant to make phone calls that set up appointments and make reservations for you. If you haven't watched it, please do:
First of all, it was very cool, and on the flip side, it was a bit creepy. However, upon reflection, you might seriously wonder:
Does it really work?
How much of the demo was actually real?
Okay, so it sort of worked, but how was such a thing built?
If you're like me, all of these tech demos make questions instantly pop into your brain. Let's delve into how such a thing works. First off, I would like to acknowledge that I have no privileged information from the Google team relating to what they did "behind the curtain." They don't seem anxious to talk about the details. However, I would like to point out that I have been working in human-computer conversational systems for decades (a prize my Cassandra won in 2010 is here), and I have built many similar conversational systems and understand the pros and cons underlying their construction.
Until the last few years, all (with the possible exception of one or two implementations) human-computer conversational systems were developed with well structured declarative languages. Most of those systems were done with a procedural, flowchart-like language such as VoiceXML, and a few were developed with state-based frameworks, which track the state of the conversation (e.g. the last user input,what information is still missing, contextual elements like time, etc.) and transition to the best state (most probable) that it has a definition for and continues the conversation from there. You can find an example of statechart based development here.
Some research has been done in academia on some promising fronts such as Partially Observable Markoff Decision Processes (POMDP for reinforcement learning) and other probability-based approaches such as Naïve Bayes, but I doubt that anyone has experienced systems in the wild that were developed using these methodologies. One thing that has progressed significantly over the last two or three years has been a set of tools to make it easier for the conversation designer to recognize the "intent" of the spoken or textual input from the user. If any of you have played with Google's Dialogflow or tried to build and Alexa skill, then you understand the power of working with the abstraction of intents versus working with a collection of words.
Dialogflow intention and data slot example:
Prior to the widespread introduction of intents, many applications used simple pragmatic methods such as word spotting. These worked most of the time but when they didn't work they lead to amusing, frustrating, or disastrous results.
Both Google and Amazon provide a similar approach to simplifying the manipulation of user intent so that the developer no longer needs to do their own natural language processing. Not too long ago, developers had to write sophisticated context-free grammars (e.g. SRGS) to parse the input words into a small set of intentions that the application planned to service. Today, the developer needs only give a few examples and the platform provided intention tool can use other linguistic information behind-the-scenes to represent a wide range of similar but not explicitly defined inputs that mean the same thing.
So, the short answer to the question, "Did the Google Assistant duplex demonstration really work" is: Yes, we can make something like that using our existing albeit primitive, procedural, or state-driven technologies that are currently used in all the automated phone apps that we have today. The reason it is possible (although it is not practical) is because the context is known in advance and very well contained. Calling to make a reservation for a table is not likely to be a philosophical or rhetorical adventure. There are several well-defined slots to fill: day, time, number of people. And the application developer could define concise and functional strategies to deal with atomic issues that the restaurant receptionist might present:
"We don't have a table for four until 6 PM."
"Sorry, how many?"
"At four or for four?"
The dialog can be driven by a simple goal of volunteering all the information for the slots that must be filled. In some ways, it may actually be easier for an agent to make the reservation than it would be to devise an agent to take the reservation from a human. The Google Assistant application can rely on the human intelligence of the reservation taker to pull out the salient details even if they are presented in an imperfect and disorderly fashion. The human reservation taker is incented to "make the sale" even if the reservation maker doesn't seem that intelligent.
My guess is that this application was largely handcrafted by UX designer/engineers using the Google Dialogflow framework. I'm certain that it took advantage of the existing intent methodologies in order to manage the dysfluencies that would naturally occur from the receptionist. The primary value of using intents is that it reduces the special case coding necessary to deal with the myriad of possible variations of what the human might say. This conversational combinatorial explosion is one of the big problems that make longer, less constrained interactions impractical. Also the slot filling requirements for dates, times, headcount, etc. are well within the capabilities of Dialogflow. So, I believe it is not a mysterious giant neural net thinking on its own. It has a much closer kinship to old-timey, voice-based phone applications. However, if it gets a lot of use and if a lot of data is collected (millions of conversations), then that data may be used to create a purely neural/statistical model that could do as good a job. From that point on, the model could continue to learn, ever refining and improving. This handcrafted first step may be an excellent foundation upon which to build a more statistical, trainable conversation manager in the future.
Another question: If it is a viable technology, how would it be accepted by the humans forced to interact with it? Only time will tell. It's quite possible that phone receptionists will recognize that it is the Google Assistant calling them and learn the most efficient conversational path to extract the information needed. In that case, it will become less conversational because the receptionist will interrogate the Google assistant instead of chatting with it. On the other hand, if I receive a call from the Google Assistant of one of my friends who is trying to set up a meeting with me, then I would definitely mess with it. I would test it until it failed (sorry, that's the kind of guy I am). So, it may succeed in certain applications, but I predict that it will fail when people try to use it to replace personal interaction with their friends.
On that note, you might want to watch what comedian/philosophers think of this new technology and how it might be used:
Opinions expressed by DZone contributors are their own.