DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Why Your RAG Pipeline Will Fail Without an MCP Server
  • An AI-Driven Architecture for Autonomous Network Operations (NetOps)
  • AI RAG Architectures: Comprehensive Definitions and Real-World Examples
  • Building an Internal Document Search Tool with Retrieval-Augmented Generation (RAG)

Trending

  • The Hidden Latency of Autoscaling
  • S3 Vectors: How to Build a RAG Without a Vector Database
  • Bringing Intelligence Closer to the Source: Why Real-Time Processing is the Heart of Edge AI
  • Multi-Scale Feature Learning in CNN and U-Net Architectures
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Token-Efficient RAG: Using Query Intent to Reduce Cost Without Losing Accuracy

Token-Efficient RAG: Using Query Intent to Reduce Cost Without Losing Accuracy

Retrieval-Augmented Generation (RAG) optimization technique to reduce the number of tokens required to generate a response while maintaining response accuracy.

By 
Varun Setia user avatar
Varun Setia
·
Feb. 03, 26 · Opinion
Likes (0)
Comment
Save
Tweet
Share
787 Views

Join the DZone community and get the full member experience.

Join For Free

In this article, we will examine the RAG optimization technique to reduce the number of tokens required to generate a response while maintaining response accuracy. Before we dig deeper into RAG, let us review a few basic terms.

What Is an LLM (Large Language Model)?

Large language models (LLMs) are very large deep learning models that are pre-trained on vast amounts of data. They are capable of performing tasks ranging from simple to complex, such as content generation, text classification, text mining, and summarization.

What Is RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation (RAG) is the process of optimizing an LLM’s output by referencing a knowledge base beyond its training data before generating a response. This approach is useful in several scenarios, including accessing a knowledge base, personalizing responses based on user details, and building a search engine.

It works in combination with an LLM to generate responses in a human-readable format. Below are two reference use cases that demonstrate how RAG works.

Vectorization Process Illustration

Vectorization Process Illustration Diagram


This diagram illustrates the vectorization workflow using an OpenAI embedding model. First, a raw document is taken as input. In step one, the document is split into smaller, manageable chunks to preserve context and improve processing efficiency. In step two, each chunk is sent to the OpenAI embedding model (text-embedding-3-large). The model converts textual meaning into high-dimensional numerical vectors that capture semantic relationships.

In step three, these vectors mathematically represent the document content. Finally, in step four, the generated vectors are stored in a vector database, enabling fast semantic search, similarity matching, and retrieval for downstream AI applications.

Search Engine Working Illustration

Search Engine Diagram


This diagram shows how a semantic search engine works using embeddings. A user enters a query as plain text. The query is converted into a numerical vector using the text-embedding-3-large model, capturing its semantic meaning. This query vector is sent to a vector database that already stores embeddings of document chunks.

The database compares the query vector with stored vectors using cosine similarity to find the most relevant matches. The top N matching document chunks are selected based on similarity scores. Finally, these matching documents are returned to the user as search results, enabling meaning-based search rather than keyword matching.

RAG with LLM Illustration

RAG with LLM Diagram


This diagram explains a knowledge base–driven RAG architecture. A user submits a query, which is converted into a semantic vector using the text-embedding-3-large model. This vector is used to search a vector database that stores embeddings of document chunks.

Using cosine similarity, the system retrieves the top N most relevant chunks from the knowledge base. These retrieved chunks are then passed as context to the GPT-5 large language model. Finally, GPT-5 generates a grounded, context-aware response based strictly on the retrieved documents and returns it to the user, combining accurate retrieval with natural language generation.

Key Takeaway

The key insight from the above is that RAG is essentially a modified implementation of how a search engine works. We are simply processing the retrieved response one step further by using an LLM to summarize and generate the final answer.

Optimization Approach

In the above implementation, every time we make a query, we receive N document chunks in response. If we examine this concept more deeply, the number of documents retrieved entirely depends on the nature of the user’s query.

Below are a few examples to better understand this based on query type:

Query Nature of Query Relevant document chunks Traditional chunks

Saving


What is Microsoft? Generic 7 7 0%

What was the profit margin of Microsoft in Year 2025?

Contextual

4 7

~42%

What was the profit margin of Microsoft in first quarter of Year 2025?

Very Specific Contextual

2 7 ~71%


Based on the nature of the query, we categorize documents into three types — Generic, Contextual, and Very Specific Contextual — and assign maximum document counts of 7, 4, and 2 respectively.

Since the number of retrieved documents varies by query type, the percentage of token savings also varies. The optimal configuration can be achieved by analyzing the specific nature of the use case.

To implement this, we introduce another LLM that determines the nature of the query. We then maintain a dictionary that maps the query type to the corresponding number of relevant documents.

Knowledge base RAG + LLM Architecture


Prompt-Based Classification

To achieve this implementation, we define a system prompt. Let’s focus on the prompt implementation, keeping it relevant to the optimization context of this article. If you’d like a complete working example, feel free to comment, and I can write a detailed code walkthrough.

System Prompt

You are an AI assistant that classifies user input into exactly one of the following three categories: Generic, Contextual, or Very Specific Contextual.

  • Generic: The input is broad, high-level, and does not rely on any specific background, constraints, or prior context.
  • Contextual: The input includes some background, role, or situational details that guide the response, but still allows flexibility.
  • Very Specific Contextual: The input contains detailed constraints such as strict rules, format requirements, tone, role, audience, or explicit do’s and don’ts.

Your task is to analyze the input and return only one category name that best matches it. Do not provide explanations, examples, or additional text. Return only the category label.

Examples:

  • Example 1 — What is Microsoft? → Generic
  • Example 2 — What was the profit margin of Microsoft in Year 2025? → Contextual
  • Example 3 — What was the profit margin of Microsoft in the first quarter of Year 2025? → Very Specific Contextual

Testing Results

I tested this prompt with a few inputs, and the outputs were impressive. The tests were performed using the Gemini 2.5 Flash model.

Input: What is Honda?
Output: Generic

Input: Tell me a few places that I can visit in Paris
Output: Generic

Input: Tell me about the capital city of the United States of America
Output: Contextual

Since the output always falls into one of the three categories, we can easily create a Python dictionary to map each category to the appropriate number of documents.

Below is a simple example demonstrating how this optimization works by combining the classification prompt with a dictionary lookup.

Python
 
from dotenv import load_dotenv
load_dotenv()
import os
Python
 
def get_relevantdocs_count(input):
    relevant_docs_count_map = {
        "Generic":7,
        "Contextual":4,
        "Very Specific Contextual": 2
    }
    import requests
    import json    url = "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:streamGenerateContent?alt=sse"    payload = json.dumps({
    "contents": [
        {
        "parts": [
            {
            "text": "Tell me about the capital city of United States of America"
            }
        ]
        }
    ],
    "systemInstruction": {
        "parts": [
        {
            "text": "You are an AI assistant that classifies user input into exactly one of the following three categories: Generic, Contextual, or Very Specific Contextual. Generic: The input is broad, high-level, and does not rely on any specific background, constraints, or prior context. Contextual: The input includes some background, role, or situational details that guide the response, but still allows flexibility. Very Specific Contextual: The input contains detailed constraints such as strict rules, format requirements, tone, role, audience, or explicit do’s and don’ts. Your task is to analyze the input and return only one category name that best matches it. Do not provide explanations, examples, or additional text. Return only the category label. Below are 3 examples for reference: Example 1 - What is Microsoft?, Output 1 - Generic; Example 2 - What was the profit margin of Microsoft in Year 2025?, Output 2 - Contextual; Example 3 - What was the profit margin of Microsoft in first quarter of Year 2025?, Output 3 - Very Specific Contextual"
        }
        ]
    }
    })
    headers = {
    'x-goog-api-key': os.environ("GOOGLE_API_KEY"),
    'Content-Type': 'application/json'
    }    response = requests.request("POST", url, headers=headers, data=payload)
    sanatized_raw = response.text.replace("data: ","")
    response = json.loads(sanatized_raw)
    category_name = response["candidates"][0]["content"]["parts"][0]["text"]
    return relevant_docs_count_map[category_name]relevantdocs_count = get_relevantdocs_count('Tell me about the capital city of United States of America')
print(relevantdocs_count) #Output: 4


Conclusion

RAG optimization is not just about better retrieval — it is about retrieving smarter. By understanding the nature of a user’s query before fetching context, we can dynamically control how much information is passed to the LLM, significantly reducing token usage without compromising response accuracy.

Classifying queries into Generic, Contextual, and Very Specific Contextual allows the system to adapt retrieval depth based on intent rather than applying a one-size-fits-all approach. This intent-aware RAG design mirrors how humans search for information and delivers meaningful cost, performance, and latency benefits.

As LLM-powered systems scale, such optimizations will move from being “nice to have” to essential — enabling more efficient, accurate, and production-ready AI applications.

Let me know how you feel about this approach in the comments section.

Thanks for reading till the end! I hope you enjoyed it!!

Data structure large language model RAG

Published at DZone with permission of Varun Setia. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Why Your RAG Pipeline Will Fail Without an MCP Server
  • An AI-Driven Architecture for Autonomous Network Operations (NetOps)
  • AI RAG Architectures: Comprehensive Definitions and Real-World Examples
  • Building an Internal Document Search Tool with Retrieval-Augmented Generation (RAG)

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook