A Brief Overview of Designing and Testing Effective Chatbots

Designing an effective chatbot involves starting from the use case, choosing the right model, tuning parameters, and testing.

Vinayak Prasad

Oct. 06, 25 · Analysis

Likes (3)

Comment

Save

1.9K Views

As startups, enterprises, and individuals all around are looking to understand how they can incorporate chatbots into their systems for customer service, internal workflows, and compliance, it is incredibly important to understand how you should design and test these for them to be truly effective.

With bigger large language models and tools such as retrieval-augmented generation (RAG) and Model Context Protocol (MCP) all the craze, it is important to understand that if a clear understanding of use case, design, and testing isn't done, these will likely become "black boxes."

The goal of this article is to help a person decide what type of chatbot is really needed and how to go about thinking of design with a focus on performance, compliance, and the user's needs. It will briefly cover how to design and test effective chatbots, and future writings will dive deeper into chatbot design and testing. It would be helpful before reading this article to have a high-level understanding of LLMs, decision trees, RAG, and MCP.

The Use Case

Why Start With the Use Case

To ensure you avoid finding a solution for a problem that was never present earlier, it is critical to understand the business goal and what the priorities are — is it speed, compliance, or customer satisfaction? A clear understanding of the use case helps decide what design and testing patterns are best to maximize the utility of the chatbot. The next section focuses on dividing use cases into three main general categories.

Deterministic vs. Creative vs. Hybrid Model

Once the goal is clear that there is a problem that needs to be solved using a chatbot, it is very likely that the use case lies on a spectrum ranging between highly deterministic tasks and highly creative tasks.

A highly deterministic use case is one based on a largely fixed set of rules and predictable answers. A simple example would be asking a chatbot, "Do you have pencils available in the ABC store?" These answers are clear, well-defined, and repetitive. In fact, in some cases, an LLM may be harmful for this use case or unnecessarily overcomplicate it. A decision tree or rules engine may be the best model in this case.

On the other end of the spectrum are the more creative tasks; an example of this would be a user asking a chatbot to "Help me come up with a creative marketing campaign for a new water bottle targeted at college students."

While these are two extreme cases, most use cases fall under a hybrid path, which is partly deterministic and partly creative. An example would be a user saying, "I bought these pants and want to return them and buy something else." The deterministic part is fetching facts from a database or API and building a base for the LLM to refer to (known as grounding context) about the store's return policy and the day of the purchase. The creative part would be suggesting other options the user could buy.

The image below shows the flow of how most use cases work.

Designing the Chatbot

Once the use case has been clearly determined, it is now time to design the chatbot. This will focus on choosing the right model, adding context, and manipulating parameters within an LLM.

Choosing the Right Model Size

Determining the right model is crucial to ensure that the use case is resolved in the most effective way at the lowest cost. Some factors involved are concurrency — how many users do you think will be using it at the same time? Latency — the time it takes for your chatbot to respond after a user sends a message. Complexity — are you resolving a problem that needs reasoning (more complex), or are you answering direct facts?

Finally, the most important reason to choose the right model is to understand the associated costs of developing a chatbot.

While there isn't a template of the right size, and I am not going to be choosing the model of any company, a smaller model for higher traffic and short factual tasks, and larger models for fewer users or complex reasoning queries tend to yield the best results.

Adding Context

Adding context is the process of bringing in factual data that the LLM uses to answer questions. This ensures an LLM doesn’t randomly give false information, known as hallucinations.

Retrieval-augmented generation (RAG) is one method that ensures responses are grounded by fetching relevant facts from external sources. This helps give direction to a chatbot to provide the right answer as well as ensure that hallucinations don’t occur (especially in smaller models).

Another newer method is the Model Context Protocol (MCPs). These allow an LLM to connect to multiple sources or APIs across different tools, allowing for context to be used dynamically based on the question asked of the LLM.

It is good practice to move forward with the general assumption that end users are not prompt engineers. This means there shouldn't be any onus on them to be able to query in a certain way to get the answer they want.

These processes allow for grounding and compliance requirements to be satisfied. Grounding refers to making your answer based on reliable data (your company’s database or a set of rules it is required to follow). This is critical in domains such as health, finance, and law.

Parameter Tuning

Manipulating parameters is another key factor in ensuring the chatbot provides the best results for your use case. The most common parameters to tune an LLM are listed below.

The first parameter is temperature. The name was derived directly from the physics term "temperature," where a higher temperature corresponds to more randomness in particle movement. It ranges from 0 to 1, and internally it's used to scale the probability distribution of the LLM. In simple terms, a higher temperature (0.7-1) means a more randomized selection of words, whereas a lower temperature (0-0.2) would mean more deterministic, predictable answers. There is no hard and fast rule as to which is better. If the use case prioritizes factual QA or compliance, then a lower temperature should be set, but if the chatbot is used for brainstorming or ideation, then a higher temperature would be preferred.

The next parameter is Top-p, which essentially focuses on how big a dataset (probability distribution) of words is considered when an LLM chooses the next word. Similar to temperature, low values (0.1-0.3) can be very restrictive, i.e., the results repeat themselves regularly, providing a predictable output. A higher value (0.8-1) provides a large sample set for the LLM to choose from. This means results are more creative and less predictable. The tradeoff is towards balancing diversity with coherence.

The next are frequency and presence penalties; while these are tuned separately, they usually go hand in hand. Frequency penalty reduces the likelihood of the same word repeating multiple times, and presence encourages the model to introduce new words/concepts into the conversation. A high frequency and presence penalty together lead to output that is forced or sometimes incoherent. Finding the right balance between them is essential.

Finally, max tokens: this focuses on the maximum length of your response. While deciding the number of tokens per response, assess the maximum cost you can spare, as more tokens mean a higher cost. More tokens also mean greater latency, i.e., the time taken to receive a response.

Designing for Multiple Use Cases

Many organizations use a chatbot for multiple use cases, so the question arises of how one adjusts all the above manipulations and designs to handle those. The simplest answer would be “an intent first” flow.

What this means is to ask the user’s goal before starting the actual process. The easiest and most guaranteed way is to have an old-school if-else criterion. This usually works well if there are 4–5 well-defined use cases of your chatbot. The user would choose the use case they wanted, and then the parameters, model, and design can be adjusted accordingly.

While this works for some cases, it would not work for chatbots with hundreds of different use cases. It could also be that the if-else criteria may seem too robotic and not bring in the human touch to the chatbot. The solution for this is similarity testing. The user enters their question; the chatbot then finds the most similar use case and adjusts the model and parameters dynamically. A more comprehensive article on designing for multiple use cases, I will write in the near future.

Testing

Testing your chatbot is an essential factor in ensuring your use case is being served as expected or better than expected. It is also necessary to see where it is going wrong and how things can be fixed. This section will focus on a brief overview of the different test metrics. Future readings will dive even deeper into testing and how exactly to perform them.

The initial step is developing a gold standard dataset of the anticipated questions that could be asked. This could be past records or logs, usually answered by a human expert. As much as possible, it should also contain the expected answer. The gold dataset must also contain restricted questions that could possibly be asked that shouldn’t be answered.

Running batch automated tests of these questions at least 1000 times and then scoring them based on precision, recall, accuracy, and latency is necessary. Based on your use case, testing for grounding and compliance may be required, too.

Precision focuses on the number of correct answers given from the set of questions that are answered by the chatbot (ignoring those that it did not answer).

Recall, on the other hand, indicates whether a bot misses an answer that is present in the gold dataset. For example, if there are 90 answers present in the knowledge base and the bot answers only 75 of them, then it missed 15 answers.

Latency is the amount of time it generally takes for a chatbot to provide a response to a user's question. Three main values are used to understand latency

P50 (median) – This is the time at which 50% of the queries are faster than it and 50% are slower than it. This typically indicates the general user experience.
P90 (90^th percentile latency) – This is the time at which 90% of the queries are faster than it and only 10% are slower.
P99 (99^th percentile latency) – This will be the time at which 99% of queries are faster than it, but 1% see it slower than this. This shows the possible worst-case latency.

Grounding score is used to understand how many of the answers that have backed evidence utilize that backed evidence to answer the question.

Compliance testing ensures that all restricted data or unethical questions are not answered by the chatbot at all

Acceptance metrics are the thresholds set up for each of the above test parameters, based on which a chatbot can be deemed ready for deployment or requires further tuning or improvement.

The flowchart below shows the overall design and testing that is done before deploying a chatbot into production.

Even after it is deployed, it needs to be continuously tested and improved to ensure the chatbot is adjusted for changes and expectations of the real world. Live tracking of metrics such as latency, task completion, user satisfaction, compliance testing, and the number of escalations to a human being because the chatbot could not answer the question are just some of the metrics that need to be tested.

Conclusion

Chatbots, when designed well and tested properly, are a powerful tool that can resolve many use cases. Due to the ease and speed at which they can be built, sometimes compliance and clear design are ignored or not thoroughly looked into. However, for an effective chatbot, a systematic process of understanding the use case, designing based on the use case, and testing for that use case is an absolute necessity.

In the future, I will deep dive separately in thorough detail into three main aspects, i.e, the design of a chatbot, the design of a multi-use case chatbot, and the testing of a chatbot.

Chatbot Use case large language model

Opinions expressed by DZone contributors are their own.

Related

Trending