Over a million developers have joined DZone.

The Anatomy of a Modern Conversational Application

DZone's Guide to

The Anatomy of a Modern Conversational Application

Underlying every conversational app is the exchange of words between humans and machines. Check out what these building blocks are.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

Chatbots, voicebots, Alexa skills, Google actions, intelligent assistants — these are different all, yet possess a single common denominator: they’re conversational apps.

Con·ver·sa·tion: The informal exchange of ideas by spoken words.

Underlying every conversational app is the exchange of words between humans and machines. Today’s technology is mature enough to create high-level conversation, using broadly available building blocks.

The Stack of a Modern Conversational Application

Image title

The building blocks required to develop a modern conversational application.

Let's describe these building blocks in more detail!

Speech Recognition

Speech recognition (also known as voice recognition, or speech-to-text) transcribes voice into text. The machine captures our voice with a microphone and provides a text transcription of our words. Using a simple level of text processing, we can develop a voice control feature with simple commands such as “turn left” or “call John.” But achieving a higher level of understanding requires the natural language understanding layer (see below).

Leading providers of speech recognition include Nuance, Amazon, IBM Watson, Google, Microsoft, and Apple.

Natural Language Understanding

Natural language understanding (NLU) is basically reading comprehension, but with machines. The machine “reads text” (often transcribed using speech recognition) and then tries to grasp the user's intent.

Take, for example, a meal ordering app> The system must identify two individual intents: food order (OrderFood) and drinks order (OrderDrink). In this case, the NLU layer is expected to understand that when the user says, “Please order food,” the intent behind is OrderFood. However, people rarely speak like this, and therefore advanced NLU should understand that when a user says, “I’m hungry” or “I want pizza,” the user’s intent is also OrderFood. Similarly, if the user says, “Please order a drink,” or “I’m thirsty,” the app should understand both requests to mean OrderDrink.

Computer science considers NLU as a “hard AI problem,” meaning that even with artificial intelligence (powered by deep learning), developers are still struggling to provide a high-quality solution.

Some leading providers of speech recognition are api.ai (acquired by Google), wit.ai (acquired by Facebook), Amazon, IBM Watson, and Microsoft.


How does the NLU layer “understand” text?

The answer is ontology, i.e. a broad and comprehensive sample set of concepts and categories in a subject area or domain. Building intents requires us to provide a list of associated samples, i.e. a collection of possible sentences for a single intent.

Continuing our meal ordering app example, this is what an ontology looks like:

User SaYs Intent
Please order food
I am hungry
I want Pizza
Please order a drink
I am thirsty

As you can see, this is hard. Ontology is domain specific and as such, requires different configurations and tweaks from one app to another. Ontology is also limited to a specific language (sometimes even dialect!): the language it was built in. And lastly, ontology must be both broad and comprehensive to facilitate understanding the conversation.

How Many Samples Are Enough?

The answer is an infinite number of (relevant) samples.

The diagram below demonstrates the level of accuracy of the NLU layer as a function of the number of samples (the diagram is based on research conducted at Conversation.one). As shown, 30 samples deliver a conversational accuracy level of 60%. Reaching 100% accuracy for a single intent requires 500+ samples. The catch is that the more intents we build, the more samples required for each intent, in order to avoid overlaps. For example, a skill or a bot with 10-20 intents requires at least 1,000 samples for each intent, totaling tens of thousands of samples.

Adding more sample to the ontology improves the accuracy level of the NLU layer.

How Do You Build An Ontology?

Building an ontology requires a “domain expert” — someone who understands the specific domain and can provide a knowledgeable guess regarding the sentences users might say when asking for each intent. The problem is this process is manual, slow, and unscalable.

Some solutions offer help here. Amazon’s Alexa, for example, comes with few ready-made intents like Yes, No, and Cancel. These intents come with their own ontology, but for anything more advanced, you’ll have to build your own ontology. Other services such as IBM Watson, provide a ready-made agent with domain-specific expertise, for example, a help-desk agent. This is a great way to launch a quick solution. The downside is the lack of flexibility on the part of the agents in adapting the conversational-user-interface (CUI) to the business.


As described, a conversation is the “exchange of ideas” between 2 or more people. In other words, it’s a series of questions and answers.

In the conversational app you’re building (either as a chatbot or a voicebot) your interaction, too, has two sides: the end-user can ask “Can I order pizza?” and your bot will respond with “yes,” and might also add: “Do you want me to deliver the pizza to your home?” The end-user can then approve the offer, or respond with something like, “No, please deliver it to my office.”

As we can see, a real conversation context isn’t simple Q&A. Each end user can use a different order, or flow of information and your app needs to handle all the different flows.

One popular way to describe context would be by using a state machine. In a state machine, each state of the conversation offers a description of the next state, depending on the user’s reaction.

The problem with state machines is once a large number of states are added, with a supporting number of transition options, the flow becomes impossible to understand and maintain.
Another approach, much simpler to develop, yet nearly as effective as state machines, is the stack context approach. This approach doesn’t require us to map the entire conversational flow in advance, but only the next possible states. For example, if the user says, “Can I order?” a good guess is that a next possible state would be “OrderFood” or “OrderDrink,” so we can ignore any other possible state in our app, and minimize the options.

Business Logic

If you want your app to provide reach real-time data, and even be able to run transactions, you must connect your application to your business logic. To access your data and run transactions, you would need APIs to your backend system. For example, an API that would add the price of a Pizza in real-time when the users ask to order food, or an API to execute the food ordering transaction.

If you are lacking APIs to some of the needed data and functions, you can develop new ones, or use screen-scraping techniques to avoid a complex development. This tutorial will show you how to create APIs from existing website without any backend development.

Choosing the Right Tools

Almost for each layer in the conversational stack, you have multiple choices. The first rule is “don’t compromise”, for example, if instead of using a modern NLU powered platform you choose a simple keyword matching solution, you will later find out that using the keyword method does not scale when your application becomes more sophisticated. Think in advance: What are your needs? Which languages do you want to support? What are the use cases? Which platforms do you want to integrate to? Do you need an Alexa, Google Home, etc. compatibility?

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

chatbot ,ai ,machine learning ,nlu ,nlp ,speech recognition

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}