Traditional Testing and RAGAS: A Hybrid Strategy for Evaluating AI Chatbots

This article compares traditional testing, what it offers and how RAGAS framework provides alternatives for an overall test plan for RAG applications.

Arun Vishwanathan

May. 27, 25 · Tutorial

Likes (1)

Comment

Save

3.4K Views

With the advent of Artificial Intelligence, Retrieval-Augmented Generation (RAG) models are commonly used in simple applications such as chatbots for websites. These models offer practical solutions, but ensuring their accuracy and user-friendliness remains a key concern. When it comes to software testing, there are several approaches. Traditional testing techniques can be employed alongside newer RAG testing frameworks such as Retrieval-Augmented Generation Assessment Suite (RAGAS).

This article introduces software testers, especially those just getting exposure with AI, to a hybrid approach of testing, which includes traditional and RAGAS-based chatbot application testing approaches. We explore a structured approach to testing a chatbot RAG model using traditional software testing techniques and provide an introduction to RAGAS, analyzing their effectiveness.

Chatbot Sample Implementation in Python

Let us consider a sample of an implementation for a RAG-based chatbot. One such possible implementation for verbosity is below:

First, we import a bunch of Python langchain packages just to illustrate.

    Python
   
   from langchain import OpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

We then define a web scraper, and in this toy example, we define a hard-coded list of scraped content (documents, for example) that we have obtained from the webpages of interest. There are several scraping techniques, but that is not the specific point of focus here.

    Python
   
 

   # A class that parses a website for content
class WebsiteDocGenerator:
    """
    Simulates document fetching from a yoga and health-related website.

    In a real-world scenario, this could involve scraping or loading files.

    """
    def __init__(self):
        # Replace this with actual text data from the site (scraped, PDFs, markdowns, etc.)
        self.documents = [
            "We offer Hatha Yoga classes every Monday and Wednesday at 6 PM.",
            "The classes take more in Livermore, CA in the Yoga Center and Health Sciences at 1234 Avenue"
            "Online yoga sessions are available via Zoom for registered users.",
        ]


    def get_docs(self):
        """
        Returns the list of website documents.
        """
        return self.documents
  

Coming to the core of the RAG application, this is a chatbot consisting of a few components.

The instantiation loads the embedding, the AI model, and the scraped content, which is then used to instantiate a vector database (the retrieval piece).

    Python
   
 

   class ChatBot:
    """
    A simple RAG (Retrieval-Augmented Generation) chatbot using LangChain,
    FAISS vector store, OpenAI embeddings, and OpenAI LLM.
    """
    def __init__(self, set_vector_database=True):
        # Step 1: Load documents from the website
        self.DOCS = WebsiteDocGenerator().get_docs()
        
        # Step 2: Load OpenAI embeddings (used to embed user query and documents)
        self.EMBEDDINGS = OpenAIEmbeddings()
        
        # Step 3: Set up the OpenAI LLM
        self.LLM = OpenAI(
            temperature=0.4,     # Balanced creativity
            max_tokens=500,      # Limit response length
            top_p=0.9            # Diversity in output
        )
        
        # Step 4: These will be set only if the vector database is initialized
        self.VECTORSTORE = None
        self.retriever = None
        self.CHAIN = None
        
        # Step 5: Optionally build the vector store and retrieval chain
        if set_vector_database:
            self.set_vector_db()

    def set_vector_db(self):
        """
        Builds the FAISS vector store from website documents,
        and initializes the retrieval + LLM answering chain.
        """
        # Convert documents to vector space using embeddings
        self.VECTORSTORE = FAISS.from_texts(self.DOCS, embedding=self.EMBEDDINGS)

        # Create a retriever interface to search the FAISS index
        self.retriever = self.VECTORSTORE.as_retriever()

        # Create a retrieval-based QA chain that answers using relevant documents
        self.CHAIN = RetrievalQAWithSourcesChain.from_llm(
            llm=self.LLM,
            retriever=self.retriever
        )
  

User query steps:

Accept the user’s query.
Retrieve relevant information from the knowledge base using the query.
Pass the updated query to the AI model as an input.
Return the response to the user.

    Python
   
 

      def userQuery(self, query):
        """
        Accepts a user query, retrieves relevant context, and generates a response.
        
        Returns a dictionary with:
            - 'response': AI-generated answer + sources used
            - 'retrieved_texts': List of matched context documents
        Or returns an error message if something goes wrong.
        """
        try:
            # Get the answer and source references from the chain
            result = self.CHAIN({"question": query}, return_only_outputs=True)

            # Also return the raw context used (optional, useful for debugging)
            retrieved_context = self.retriever.get_relevant_documents(query)
            retrieved_texts = [doc.page_content for doc in retrieved_context]

            return {
                "response": f"{result['answer']} sources:{result['sources']}",
                "retrieved_texts": retrieved_texts
            }
        except Exception as e:
            # Useful for handling and logging unexpected failures
            return {"error": str(e)}

  

Now, let us explore some aspects of traditional software testing that one could pursue.

Unit Testing

Unit tests verify individual chatbot components to ensure they function correctly. One could very well design a test plan that roughly looks like this for the above:

Module	Objective	Test Type
Website retrieval	Ensure that we were able to retrieve the website content as per the requirement	Positive
Vector Database	Ensuring the vector database retrieves context for a supplied query	Positive
Embedding Status and Consistency	Verifying that embeddings are produced and that similar queries produce similar embeddings.	Positive

Likewise, one could come up with some negative test cases as well for the above.

Here are some examples of some tabulated test cases just to illustrate some ways to test module-level behaviors:

Testing the scraper for content format and data fetch

    Python
   
 

   def test_website_doc_format():
"""
Test if the content is in a specific format.
"""
    docs = WebsiteDocGenerator().get_docs()
    assert isinstance(docs, list), "Docs should be a list"
    assert all(isinstance(doc, str) for doc in docs), "Each doc should be a string"


def test_website_scrape():
"""
Test if the website content is being fetched properly.
"""
    chatbot = ChatBot(set_vector_database=False)
    assert len(chatbot.DOCS) > 0
  

Testing the vector database for data retrieval

    Python
   
 

   def test_vector_db_retrieval():
"""
Test the vector storage
"""
    chatbot = ChatBot()
    query = "What yoga courses are available?"
    results = chatbot.VECTORSTORE.similarity_search(query, k=3)
    assert any("yoga" in doc.page_content.lower() for doc in results)
  

Testing the vector embedding for similar content

    Python
   
 

   def test_embedding_consistency():
"""
Test the embedding similarity
"""
    chatbot = ChatBot(set_vector_database=False)
    query1 = "What is the address of the Health Department?"
    query2 = "Fetch the location of the Health Department."
    embedding1 = chatbot.EMBEDDINGS.embed_query(query1)
    embedding2 = chatbot.EMBEDDINGS.embed_query(query2)
    similarity = sum(a * b for a, b in zip(embedding1, embedding2))
    assert similarity > 0.85, "Embedding consistency test failed"
  

These are just some examples to illustrate that we could write unit tests for our modules. You can imagine that these can be further listed in depth.

Integration Testing

Integration testing is more of an end-to-end behavior for functional testing, and these tests verify interactions between the components, such as the LLM, embeddings, and retrieval components.

Example Integration Test:

To check the structure of the returned response

    Python
   
   def test_user_query_data_structure():
    chatbot = ChatBot()
    query = "Where do the weekly yoga classes happen?"
    response = chatbot.userQuery(query)
    assert isinstance(response, dict), "Response is not a dictionary"
    assert "response" in response, "Response key missing"
    assert "retrieved_texts" in response, "retrieved_texts key missing"

To check the response itself for expected information

    Python
   
   def test_user_query_with_retrieval():
    chatbot = ChatBot()
    response = chatbot.userQuery("Where do the weekly yoga classes happen?")
    assert "Livermore" in response["response"], "Expected 'Livermore' in response but was missing"

Or one might do a more detailed static check in the response, as this, which is a typical string check pattern:

    Python
   
   def test_user_full_query_with_retrieval():
    chatbot = ChatBot()
    response = chatbot.userQuery("Where do the weekly yoga classes happen?")
    assert "The classes take more in Livermore, CA in the Yoga Center and Health Sciences" in response["response"]

Checking for the retrieved content for queries outside the domain completely

    Python
   
   def test_query_outside_domain():
    chatbot = ChatBot()
    response = chatbot.userQuery("How do I apply for a driving license?")
    assert not any("driving" in text.lower() for text in response["retrieved_texts"]), "No irrelevant documents should be retrieved"

When the vector database is updated

Let us say that the website content has been updated to add some more information, such as a new location for doing Yoga called Yoga Park, and we need to simulate an update in our tests as follows:

    Python
   
 

   def test_document_update():
    chatbot = ChatBot()
    chatbot.DOCS.append("A new location called Yoga Park is created for carrying out Yoga outdoors")

    # Simulating a knowledge base update
    chatbot.setVectorDb()  # Rebuilding vector store
    response = chatbot.userQuery("Tell me about the new location for doing Yoga")
    assert "Park" in response2["response"], "New document information not reflected in chatbot response"
  

We can see that there are some important working pieces depending on the implementation, that need to be tested in general before we get into any other testing space. More importantly, such testing is generally used to validate functional correctness. If any tests, for example, raise an alarm here, it helps to investigate the individual blocks and what might have led to it.

This is important before even evaluating the language of what is being returned to the user.

Limitations of Traditional Testing

We can observe that there is a gap in the testing methodology, although it does offer some useful testing out of the box. While traditional tests help validate functional correctness, they struggle with:

Output Variability – A chatbot's response could vary for the same input.
Relevance vs. Correctness – Traditional tests check if a response exists, but not if it is meaningful.
Risk in dynamic environments:

Website content could change or update: If the website updates its yoga course names or descriptions, existing tests may pass, but chatbot responses might become incorrect due to outdated vector embeddings.
LLM might have updates at the backend: If the underlying language model updates its parameters, responses might shift subtly, leading to variations that are hard to catch in traditional tests.
User query could vary internally: Traditional tests assume static function inputs, but AI-driven responses can vary widely based on paraphrasing, making fixed assertions unreliable.
Less Scalability: If we need a scaled-up range of tests for various queries, it is difficult to tailor the tests for handling all the scenarios due to the various static assertions involved.

Value of Traditional Testing

The basic value addition that it provides, as we see, is:

Code behavior: Does the function return a value, and if type verification is to be done, that is possible too
Deterministic outputs: Is output A always returned for input B?
Structure: Are individual components wired together properly and working?
Failures caused by bugs: Null pointers or any unexpected types

While traditional testing helps achieve some important objectives, it falls short when it comes to evaluating the dynamic nature of AI-driven responses. Traditional tests often focus on verifying static outputs or components, but chatbots, especially those using RAG models, generate responses that can vary based on context, input phrasing, and knowledge updates.

To move beyond simple binary pass/fail tests and truly understand the quality and relevance of the chatbot's output, we need a more nuanced approach. RAGAS offers a solution in this context.

RAGAS: Testing for Chatbots

RAGAS offers a dynamic evaluation method, analyzing chatbot responses using quality metrics instead of relying on predefined expected answers. This helps overcome several traditional testing limitations,

Traditional tests struggle when content changes frequently, or when LLMs hallucinate plausible but incorrect answers. For RAG chatbots, we need evaluation methods that assess not just the presence of expected phrases but the quality and grounding of the answer.

Key RAGAS Metrics

Faithfulness: How well does the response align with the retrieved documents from the database?
ContextPrecision: How much of the retrieved context is relevant to the response?
ContextRecall: Did the system retrieve all necessary supporting information from the database?
AnswerRelevancy: How well does the answer address the user’s query?
Context Precision/Recall focuses on the retriever’s performance.
Faithfulness and Answer Relevancy focus on how well the LLM used the context in its final answer.

Example: RAGAS-Based Evaluation

Rather than writing brittle tests like:

      Python
     
     assert "The classes take more in Livermore, CA in the Yoga Center and Health Sciences" in response["response"]

We use semantic metrics to evaluate how well the model responded, and we could write up a test case as the workflow below demonstrates with RAGAS.

You can install RAGAS by simply executing: pip install ragas

The following sample demonstrates how we can use the library to perform a semantic evaluation of our response.

Firstly, we import some functionality from RAGAS for this purpose.

    Python
   
   from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from ragas.types import EvaluationDataset, SingleTurnSample

We then create the dataset for the model as input, passing in the metrics we wish to use for evaluating the response.

    Python
   
 

   def test_chatbot_response():
    chatbot = ChatBot()
    query = "What is the address of the yoga center?"
    result = chatbot.userQuery(query)

    dataset = EvaluationDataset(samples=[
        SingleTurnSample(
            user_input=query,
            retrieved_contexts=result["retrieved_texts"],
            response=result["response"]
        )
    ])

    eval_results = evaluate(dataset, metrics=[faithfulness, context_precision, answer_relevancy])
  

We could choose some thresholds here. These are experimental and depend on the chosen metric.

    Python
   
   assert eval_results["faithfulness"] >= 0.5
assert eval_results["answer_relevancy"] > 0.7

The model's response may look like:

      Plain Text
     
 

     {
  "response": "The Yoga Center's address is <address>.",
  "sources": "VisitUs.pdf",
  "retrieved_texts": [...]
}
    

And the metrics returned, for example, as:

    Plain Text
   
 

   {
  "faithfulness": 0.5000,
  "context_precision": 1.0000,
  "answer_relevancy": 0.9680
}
  

Thereby, with metrics as these, we can better understand where we might stand in our response obtained, and we could dive further to tune our implementation modules as required to meet some expected minimum standards.

Also, we can catch subtle failures this way. Imagine we have a regular functional test:

      Python
     
     response = chatbot.userQuery("Is there a Sunday yoga session?")
assert "Sunday" in response["response"]

This test passes, even if the model guessed “Sunday” without any evidence!

However, RAGAS helps here because the scores would help to better understand the response

Faithfulness detects hallucination.
Context Precision checks if the relevant source content was used.

So, if we had a response that was not well-aligned with what was retrieved from the knowledge database, the metrics would indicate values on the lower side for us to investigate.

This way, RAGAS metrics are useful as well, for regression testing. By tracking scores over time, we can detect potential drops in performance or any hallucinations or dips in the retrieval quality of our query.

Conclusion

Hence, we can see overall that to achieve a comprehensive chatbot evaluation, we can therefore leverage the advantages of both approaches. RAGAS does not replace traditional tests, it complements them differently. It enables deeper, more semantic evaluation of chatbot behavior, particularly in retrieval-heavy environments. While generic prompts may still pose a challenge, for fact-based queries, they offer a much-needed lens into system quality.

As AI chatbots become more widespread in consumer applications, adopting this hybrid testing approach will be key to maintaining both accuracy and trust in automated systems.

AI Chatbot RAG

Opinions expressed by DZone contributors are their own.

Related

Trending