Over a million developers have joined DZone.

Chatbots With Machine Learning: Building Neural Conversational Agents

DZone's Guide to

Chatbots With Machine Learning: Building Neural Conversational Agents

AI can easily set reminders or make phone calls—but discussing general or philosophical topics? Not so much. Learn how to fix this with networks and machine learning.

· AI Zone ·
Free Resource

You have probably used Siri, Alexa, or Cortana to set an alarm, call a friend, or arrange a meeting. But despite their usefulness in common and routine tasks, it’s difficult to force conversational agents to talk about general, sometimes philosophical topics. The Statsbot team spoke with data scientist Dmitry Persiyanov to learn how to fix this issue with neural conversational models and how to build chatbots using machine learning.

Interacting with a machine via natural language is one of the requirements for general artificial intelligence. This field of AI refers to dialogue systems, spoken dialogue systems, or chatbots. The machine needs to provide you with an informative answer, maintain the context of the dialogue, and be indistinguishable from the human (ideally).

In practice, the last requirement is not yet reachable. But luckily, humans are ready to talk with robots if they are helpful — sometimes, they can even be funny and interesting interlocutors.

There are two major types of dialogue systems: goal-oriented (i.e. Siri, Alexa, Cortana, etc.) and general conversation (i.e. Microsoft Tay bot). The former help people to solve everyday problems using natural language, while the latter attempt to talk with people on a wide range of topics.

In this post, I will give you a comparative overview of general conversation dialogue systems based on deep neural networks. I will describe main architecture types and ways to advance them. Also, there will be a lot of links to papers, tutorials, and implementations.

I hope this post will eventually become the entry point for everyone who wants to create chatbots with machine learning. If you read this post till the end, you will be ready to train your own conversational model. Ready?

I’m going to refer to recurrent neural networks and word embeddings, so you should know how these work in order to easily follow the article. For those who need to refresh their knowledge, I’ve prepared great tutorials at the end of the article for you.

Generative and Selective Models

General conversation models can be simply divided into two major types: generative and selective (or ranking) models. Also, hybrid models are possible. But the common denominator is that such models take in several sentences of dialogue context and predict the answer for this context. In the picture below, you can see the illustration of such systems.

Throughout this post, when I say “a network consumes a sequence of words” or “words are passed to RNN,” I mean that word embeddings are passed to the network, not word IDs.

Note on Dialogue Data Representation

Before going deeper, we should discuss what dialogue datasets look like. All models described below are trained on pairs (context, reply). Context is several sentences (or maybe one) that precede the reply. The sentence is just a sequence of tokens from its vocabulary.

For better understanding, look at the table. There is a batch of three samples extracted from a raw dialogue between two persons:

  • Hi!

    • Hi there.

      • How old are you?

        • Twenty-two. And you?

          • Me too! Wow!

Note the <eos> (end-of-sequence)token at the end of each sentence in the batch. This special token helps neural networks understand sentence bounds and update their internal state wisely.

Some models may use additional meta information from data such as speaker ID, gender, emotion, etc.

Now, we are ready to move on to discussing generative models.

Generative Models

We start with the simplest conversational model, based on the paper A Neural Conversational Model.


For modeling dialogue, this paper deployed a sequence-to-sequence (seq2seq) framework that emerged in the neural machine translation field and was successfully adapted to dialogue problems. The architecture consists of two RNNs with different sets of parameters. The left one (corresponding to A-B-C tokens) is called the encoder, while the right one (corresponding to <eos>-W-X-Y-Z tokens) is called the decoder.

How Does the Encoder Work?

The encoder RNN conceives a sequence of context tokens one at a time and updates its hidden state. After processing the whole context sequence, it produces a final hidden state, which incorporates the sense of context and is used for generating the answer.

How Does the Decoder Work?

The goal of the decoder is to take context representation from the encoder and generate an answer. For this purpose, a softmax layer over vocabulary is maintained in the decoder RNN. At each time step, this layer takes the decoder hidden state and outputs a probability distribution over all words in its vocabulary.

Here is how reply generation works:

  1. Initialize the decoder hidden state with the final encoder hidden state (h_0).
  2. Pass the <eos> token as first input to the decoder and update hidden state (h_1)
  3. Sample (or take one with max probability) the first word (w_1) from the softmax layer (using h_1).
  4. Pass this word as input, update the hidden state (h_1 -> h_2), and generate a new word (w_2).
  5. Repeat Step 4 until the <eos> token is generated or the maximum answer length is exceeded.

Above is the reply generation in the decoder, for those who prefer formulas instead of words. Here, w_t is the sampled word on time step t; Theta represents decoder parameters; phi represents dense layers parameters; g represents dense layers; p-hat is a probability distribution over vocabulary at time step t.

Using argmax while generating a reply, one will always get the same answer when utilizing the same context (argmax is deterministic, while sampling is stochastic).

The process I’ve described above is only the model inference part, but there is also the model trainingpart, which works in a slightly different way. At each decoding step, we use the correct word y_tinstead of the generated one (w_t) as the input. In other words, at training time, the decoder consumes a correct reply sequence but with the last token removed and the <eos> token prepended.

Illustration of decoder inference phase. Output at the previous time step is fed as the input at the current time step.

The goal is to maximize the probability of a correct next wordon each time step. More simply, we ask the network to predict the next word in the sequence by providing it with a correct prefix. Training is performed via maximum likelihood training, which leads to classical cross-entropy loss:

Here, y_t is a correct word in reply at time step t.

Modifications of Generative Models

Now, we have a basic understanding of the sequence-to-sequence framework. How do we add more generalization power to such models? There are several ways:

  • Add more layers to the encoder or/and decoder RNNs.
  • Use a bi-directional encoder. There is no way to make the decoder bi-directional due to its forward generation structure.
  • Experiment with embeddings. You can pre-initialize word embeddings or learn them from scratch together with the model.
  • Use a more advanced reply generation procedure: beamsearch. The idea is to not generate a reply “greedily” (by taking argmax for the next word) but to consider the probability of longer chains of words and choose among them.
  • Make your encoder or/and decoder beconvolutional. Convnets might work much faster than RNNs because they can be parallelized efficiently.
  • Use an attention mechanism. Attention was initially introduced in neural machine translation papers and has become a very popular and powerful technique.
  • Pass the final encoder state at each time step to the decoder. The decoder sees the final encoder state only once and then may forget it. A good idea is to pass it to the decoder along with word embedding.
  • Different encoder/decoder state sizes. The model I described above requires the encoder and decoder to have the same hidden state size (because we initialize the decoder state with the final encoder’s state). You can get rid of this requirement by adding a projection (dense) layer from the encoder final state to the initial decoder state.
  • Use characters instead of words or byte pair encoding for building vocabulary. Character-level models are worth considering as they work faster because of a smaller vocabulary and they can understand words which are not in their vocabulary. Byte pair encoding (BPE) is the best of both worlds. The idea is to find the most frequent pairs of tokens in a sequence and merge them into one token.

Problems With Generative Models

Later, I’ll give you links to popular implementations so you can train your own dialogue models. But now I’d like to warn you of some common problems with generative models you can face.

Generic Responses

Generative models trained via maximum likelihood tend to predict a high probability for general replies such as “okay,” “no,” “yes,” and “I don’t know” for a wide range of contexts. There are some works dealing with this problem by doing either of the following:

Reply Inconsistency and How to Incorporate Metadata

The second major problem with seq2seq models is that they can generate inconsistent replies for paraphrased contexts but with the same sense:


The most cited work dealing with it is A Persona-Based Neural Conversation Model. Authors used speaker IDs for each utterance in order to generate an answer, which conditioned not only on the encoder state but also on speaker embedding. Speaker embeddings are learned from scratch, along with the model.


Using this idea, you can augment your model with the different metadata you have. For example, if you know the tense of utterance (past/present/future), you can generate replies in different tenses at inference time! You can adjust the personality of the replier (gender, age, mood) or reply properties (tense, sentiment, question/not question, etc.) while you have such data to train models on.


I promised you links to seq2seq models implementations in different frameworks, and here they are.



Papers and Guides

Diving Into Selective Models

Finishing up with generative models, let’s understand how selective neural conversational models work (they are often referred to as DSSM, which stands for deep semantic similarity model).

Instead of estimating probability p(reply | context; w), selective models learn similarity function sim(reply, context; w), where a reply is one of the elements in a predefined pool of possible answers (see illustration below).

The intuition is that the network takes context and a candidate reply as inputs and returns the confidence of how appropriate they are to each other.

The selective (or ranking, or DSSM) network consists of two “towers:” the first for the context and the second for the reply. Each tower may have any architecture you want. The tower takes its input and embeds it in semantic vector space (vectors R and C on the illustration). Then, the similarity between context and reply vectors is computed, i.e. using cosine similarity C^T*R/(||C||*||R||).

At inference time, we can calculate the similarity between given context and all possible answers and choose the one with maximum similarity.

In order to train the model, we use triplet loss. Triplet loss is defined on triplets (context, reply_correct, reply_wrong) and is equal to:

Triplet loss for selective models. It’s very similar to max margin loss in SVM.

What is reply_wrong? It is also called “negative” sample (reply_correct is called “positive”) and in the simplest case, it is a random reply from the pool of answers. By minimizing such loss, we learn similarity function in a ranking way where absolute values aren’t informative. But remember that at the inference phase we only need to compare scores for all replies and choose one with the maximum score.

You may dive deeper into DSSMs at a special Microsoft project page. There are not many open-source implementations such as with generative models, however, you may refer to this tutorial that implements a selective model on TensorFlow.

Sampling Schemes in Selective Models

You may ask, why should we just take a random sample from a dataset?Maybe it is a good idea to use a more complex sampling scheme? That’s true. If you look closer, you may realize that the number of triplets is O(n³), so it’s important to choose negatives properly because we can’t go through all of them (big data, you know).

For example, we could sample K random negative replies from the pool, score them, and choose the one with the maximum score as our negative. This scheme is called “hard negative” mining. If you want to dig deeper, read the paper Sampling Matters in Deep Embedding Learning.

Generative vs. Selective: Pros and Cons

At this moment, we have an understanding of how both generative and selective models work. But which type do you choose? It fully depends on your needs. The table below is here to help you with the decision.

The Hardest Part Is Evaluation

One of the most important questions is how to evaluate neural conversational models. There are many automatic metrics that are used to evaluate chatbots with machine learning:

  • Precision/recall/accuracy for selective models
  • Perplexity/loss value for generative models
  • BLEU/METEOR scores from machine translation

But some recent research works have shown that all such metrics are poorly correlated with the human judgment of appropriateness of the reply for a given context.

For example, suppose you have the context, “Is Statsbot disrupting the way we work with data?” and reply, “It surely does.” in your dataset. But your model replies to this context with something like, “It’s definitely true.” All metrics shown above will give a low score for such an answer, but we can see that this answer is as good as your data provides.


Therefore, the most proper way today is to perform a human evaluation of your models using your target metric, then choose the best model. Yes, this seems like an expensive process (you need to use something like Amazon Mechanical Turk for evaluation models), but at this time, we don’t have anything better. Anyway, the research community goes this direction.

Why Don’t We See Them in Our Smartphones?

Finally, we are ready to create the most powerful and intelligent conversational model, general artificial intelligence, right? If this was so, companies like Apple, Amazon, and Google, which have thousands of researchers, would have already deployed them along with their personal assistant products.

Despite a lot of work in this area, neural dialogue systems are not ready to talk with humans in open-domain and provide them with informative/funny/helpful answers. But as for closed-domain (technical support or Q&A systems, for example), there are success stories.

Tutorials on RNN and Word Embeddings

Here are a couple more helpful tutorials for you.

Recurrent Neural Networks

Word Embeddings


Conversational models may seem difficult to grasp at first (and not only at first). I advise you to read the resources I gave links to. Also, there is a pool that contains many essential papers on dialogue systems.

When you’re ready to practice, choose some simple architecture, take one of the popular datasets or mine your own (Twitter, Reddit, or whatever), and train a conversational model on it.

machine learning ,chatbots ,neural network ,bot development ,ai ,tutorial ,conversational agents

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}