Dialog Management as a Key Technology for Conversational Systems
Dialog Management as a Key Technology for Conversational Systems
ML is applied to a lot of domains these days. One of the interesting applications is learning to have a conversation. This panel explores how to think about the problem.
Join the DZone community and get the full member experience.Join For Free
Following are some notes and highlights from a very interesting panel discussion about the technology principles from leading companies providing conversational agents (i.e. chatbots, personal assistants, etc.)
- Conference: Conversational Interactions
- Session: Dialog Management as a Key Technology for Conversational Systems
- Date: February 6, 2018, in San Jose, CA
Who was on the panel:
- Ilya Gelfenbeyn, Lead Product Manager, Dialogflow at Google: Ilya has been working on conversational agents for over a decade. Focused on supporting conversations with multiple turns servicing multiple backends. Dialogflow is Google’s main platform for extending Google Assistant. It’s a cross-platform tool used by Facebook bots and other assistants. The focus is on the issues of creating true conversations.
- Alborz Geramifard, Machine Learning Manager at Amazon: He started building the Alexa that we see today and has built a team to address the challenges presented by longer open-ended conversations, with the goal of advancing conversational AI. The focus is on the less mechanical and transactional to more personalization and natural voice experiences.
- Nirmal Mukhi, Master Inventor at IBM Watson Education: Applying AI technology to education, primarily focused on tutoring. Working on the harder goal of creating a tutoring experience for any piece of content. These use cases require addressing problems related to creating rich conversational interactions.
Are there examples in the marketplace today of truly conversational systems that you can quickly describe? Assuming there is a demand, this panel is primarily asking the question, “What is keeping us from creating truly conversational systems?”
Before we ask this question more directly, let’s build up to it. Most commercial systems today seem to take a “form-filling” approach to driving conversation. Do you think this approach is sufficient for the majority of applications? What approaches are more powerful than ‘form filling’ and ‘question-answering’? What approaches do you think we will need to adopt in order to create more flexible conversations?
Here is our panel’s primary question:What is keeping us from creating truly conversational systems?
What are the roadblocks and how can we overcome them to create truly conversational systems? The best user experiences today seem to be the result of talented individuals rather than the result of reproducible design methodologies. Is there a methodology that can be consistently applied to create good user interface designs for conversational systems?
Please discuss your perception of the demand in the marketplace for true conversational systems.
Ilya: Define what true conversational systems are: supporting the context of the conversation, having multiple topics, understanding speakers.
Demand from companies is huge, but from consumers, demand is driven by expectations. If you ask users, they don’t expect the assistant to be aware of the context or support clarifying questions. They are used to search engines, so when we say now you can ask your question in a natural way, we don’t see users actually doing that or using clarifying questions.
Example of a complex query: What will be the weather in San Jose? Now, book a hotel there. Or it’s my wife’s birthday, so order flowers. Users don’t think systems can actually do this. When they believe systems can support those, then we’ll see demand grow.
Alborz: Lots of companies are all excited about bringing the conversation to the next level. But
there are big gaps. [Shows Moviebot demo video.] Moviebot shows a long conversation where the user asks a lot of follow-up questions about directors, movies, actors, plots, ratings, where you can watch it, etc. It has interesting qualities: some personalization, some customization. It was launched to see if people would have longer conversations with it. While most topics are just one query/command, the average Moviebot conversation length is much greater and more varied.
Nirmal: Demand isn’t there because technology isn’t there. But as we open it up, we will run into
issues, such as how many times Moviebot said “Steven Spielberg” and didn't use coreference
pronouns. We do this naturally, but machine learning and NLU is not up to this yet.
Are there examples in the marketplace now where we see truly conversational systems?
Alborz: We don’t see anything out there. That’s why we set up Alexaprize: a $2.5M prize for building a bot to chat for 20 mins (or get the highest rating). No one won the grand prize, but there was lots of good work. Initially, the focus was on using the latest technology, but users didn’t respond positively because it didn’t work. State-of-the-art machine learning was getting low ratings. Many then started paying more attention to the users, so they began to improve by looking at how interactions can be improved.
Ilya: There are some niche areas where users are having some longer conversations in specific
domains: movie questions, movie tickets, or niche topics where people aren’t motivated to ask
random questions. Setting up expectations is key to success.
Nirmal: The key is setting up state and context and being able to understand language within that
context. What IBM Watson is doing is more specific. Example: the Watson tutor, which is a
cognitive tutoring system. It's a chat window (not voice) but is multimodal in providing different types of learning material. The interaction is stateful and conversational. It uses “Socratic” dialog where the system asks questions and the user answers (different than Alexa or Google Assistant, where the user asks the system). It starts with a general question and gets more specific. Students can say that they want to reread the material. The system provides a relevant section to read. The student comes back and questions continue. The system can generate questions on the fly to get students to create more specific questions and can also recognize when the student answers correctly but uses different wording than the questions. Students can ask questions. If questions are off track, the tutor will answer and help bring them back on track. Dialogs can be five or 30 turns of conversation. The goal is to keep students engaged and meet educational goals. You can have the best machine learning algorithms, but it's important to have the right tone and wording keep the student positive and engaged.
What approaches do we need to adopt to have these more flexible conversations?
Alborz: To go to the next level, we need to create a feedback loop where we can have conversations and get information/feedback back directly. So, I can change the way I interact. When the user is negative, we need to change. When feedback is positive, I can reinforce that activity. For machine learning, we need a lot of data. To get a lot of data, we need more and more positive interactions. In order to gather that data, the dialog manager has to work. Again, chicken and egg: need a good policy to get a good dialog manager. But to get a good dialog manager, you need a good policy. Current research is trying to use a simulated user to get more data (see NIPS 2017).
Nirmal: In order to manage… we don’t have the tools. It still requires a lot of technical development. We need the creation of tools to be able to push forward so more can do it.
Ilya: Tools are important and bring attention to the problem. We need to look at the goals. Why do we need these truly conversational systems? When doing task-oriented dialog, the assistant is coordinating many different subsystems. We have to marry this third-party with the "main bot." So, we need to share context between apps. But that raises security questions. Dialog management is just one piece, just focused on the dialog — the decision-making is equally important but not necessarily distinguishable by the user. We need to connect to the knowledge graphs of the applications, but that’s difficult with multiple graphs. Exposing a user to different systems is hard since the user doesn’t have access to knowing where they are in the system. What systems are they actually talking to? From a user interactions perspective, users shouldn’t care about the state of the conversation. They should be able to let the assistant handle all of the information.
Alborz: In terms of simplicity, this is true, but security is actually important to users, so there will be users who will object to their information holding their personal information. Trust of the customer becomes really important.
Ilya: On the technical side, security is important, but from the user side, that’s not what they like to think about.
Nirmal: When we have actions with multiple steps, users need to know what stage of the interaction we are in.
Text-messaging, people misunderstand each other all the time. It's easier in voice. Are any of you looking at recognizing emotional state of the voice? This may impact feelings about security.
A: Amazon is very excited about this. They are working on it, though users are split; some don’t
want their emotions analyzed.
I: For Dialogflow, most systems don’t supply text, much less voice and emotional recognition to third-party developers, so that is interesting, but we need to consider how to get users permissions.
N: Some of these detections happen over multiple turns. We need to track changes in emotion.
I: This is particularly an issue with children.
What is the formalism used underlying Moviebot (or other apps in Amazon)? What takes into
account context? [Yours truly, me, Emmett Coin, asked this!]
A: Moviebot was trying to solve the chicken/egg problem by collecting data on something that had multiple interactions. In the course of creating that, we found that there were repeats, so we created a way to track what had already been said to be able to track the dialog state — more like a rule-based approach in that it’s keeping track of what’s inferrable.
N: Similar policies in understanding context and references, tracking the last thing said and
sometimes more complex based on the type of conversation. Need to have that feedback loop to be
able to improve either after the fact or online.
Between human and bot, there is reinforcement learning and between bots and bots. How do
you do reinforcement learning?
A: When human-human interactions need to figure out when the human is satisfied. But when you create a bot, you can know what’s inside the brain of the bot, so you can track bot satisfaction. Once a dialog makes a decision you can immediately assign credit.
[Note: if you are working on conversational systems, you should consider the Conversational Interaction conference next year. I'll see you there!]
Opinions expressed by DZone contributors are their own.