DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Why Knowing Your LLM Hallucinated Is Not Enough
  • Hallucination Has Real Consequences — Lessons From Building AI Systems
  • Building a Production-Ready AI Agent in 2026: Beyond the Hello World Demo
  • Why Your RAG Pipeline Will Fail Without an MCP Server

Trending

  • Designing API-First EMR Architectures in .NET: Enabling Modular Growth in Compliance-Driven Systems
  • Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
  • S3 Vectors: How to Build a RAG Without a Vector Database
  • LLM Agents and Getting Started with Them
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Cost-Aware Routing for RAG: Fetch Less, Spend Less, Answer Better

Cost-Aware Routing for RAG: Fetch Less, Spend Less, Answer Better

Every query hits the same retrieval depth — whether it needs 0 passages or 10. Here is what that mistake costs, and how cost-aware routing fixes it.

By 
Sanjay Mishra user avatar
Sanjay Mishra
·
May. 06, 26 · Opinion
Likes (0)
Comment
Save
Tweet
Share
1.4K Views

Join the DZone community and get the full member experience.

Join For Free

You have a knowledge base full of PDFs. Someone asks: "What do you know about RAG?" Your RAG system dutifully searches all the documents, retrieves 10 passages, stuffs them into the prompt, and generates the answer. The problem? The LLM already knew the answer. You just paid for 10 passages you did not need.

This is the silent tax of static RAG — and most teams do not realize they are paying it on every single query.

What Is Static RAG, Exactly?

Retrieval-Augmented Generation (RAG) works by fetching relevant text passages from a knowledge base before generating an answer. The passages provide grounding — they give the LLM factual context it might not have in its training weights.

Static RAG means the retrieval depth — the number of passages fetched, called k — is a fixed constant in your configuration. It never changes, regardless of what the user asked.

Plain Text
 
# Static RAG — k is hardcoded forever
retriever = FAISSRetriever(k=10)  # always 10, every query

# These queries all get 10 passages:
"What is RAG?"                   # needs 0
"Compare dense vs sparse retrieval" # needs 3
"Explain all tradeoffs with examples" # needs 10


The number 10 is not the problem. The problem is using the same number for everything.

Key Insight: Static RAG simultaneously overfetches on simple queries and underfetches on complex ones — often at the same time, within the same application.

A Real Example: The Question That Needs Zero Passages

Imagine your knowledge base contains a RAG explainer PDF, architecture documentation, technical guides, and dozens of other PDFs. A user asks:

User Query

"What do you know about RAG?"

Here is what static RAG does, step by step:

1. Converts the query to a vector embedding.
2. Searches all passages in all PDFs via FAISS.
3. Returns the top 10 most similar passages.
4. Stuffs all 10 into the prompt.
5. Sends the entire thing to the LLM API — you pay for every token.
6. LLM generates an answer.

The answer it generates? Identical to what it would have said with zero passages. GPT-4 was trained on the entire internet. It knows what RAG is. The retrieval step added cost, added latency, and changed nothing.

Static RAG Result

  • ~200 extra tokens billed 
  • +1–2 seconds of latency  
  • Zero improvement in answer quality

Why This Happens: Fixed k vs. Dynamic k

The root cause is architectural. FAISS — the most common vector search library in RAG systems — takes a parameter k that determines how many nearest-neighbor passages to return. In static RAG, this is set once at initialization and never reconsidered.

Query What It Needs Static RAG Gives Verdict
"What is RAG?" k=0 — LLM already knows k=5 — 5 wasted passages Overfetch
"Compare dense vs sparse retrieval" k=3 — moderate context k=5 — close enough Acceptable
"Explain all tradeoffs with production examples" k=8–10 — deep context k=5 — not enough Underfetch


Static RAG gets it right only when the query complexity happens to match the hardcoded k. For everything else, it either wastes money or degrades quality.

"Static RAG is not wrong because it uses k=10. It is wrong because it uses the same k for every query."

The Solution: Cost-Aware RAG (CA-RAG)

CA-RAG treats retrieval depth as a per-query decision, not a system-wide constant. Instead of one fixed k, it maintains a catalog of discrete strategy bundles — each pairing a retrieval depth with a generation profile:

Cost-Aware RAG (CA-RAG) Image


A lightweight router evaluates each incoming query against all four bundles using a scalar utility function:

Plain Text
 
U(b) = wQ × Quality_prior − wL × Latency_norm − wC × Cost_norm

# Default weights: quality=0.6, latency=0.2, cost=0.2
# The bundle with the highest score wins — no ML model required


For the query "What do you know about RAG?", the router computes:

Plain Text
 
direct_llm  → U = 0.40  ← winner: LLM knows this, cost penalty near zero
light_rag   → U = 0.38
medium_rag  → U = 0.37
heavy_rag   → U = 0.28  ← heavy cost penalty kills utility


Result: the router routes to direct_llm, skips FAISS entirely, and gets the same answer in half the time at a fraction of the cost.

CA-RAG Result


What the Data Shows

Across a 28-query benchmark spanning definitional, comparative, procedural, and analytical queries, CA-RAG demonstrated consistent advantages over static configurations:

static configurations image


The router exercised all four bundles: medium_rag handled 57% of queries, heavy_rag 18%, direct_llm 14%, and light_rag 11%. This non-uniform distribution is exactly what you want — it proves the router is actually making meaningful decisions rather than defaulting to one bundle.

But Is FAISS Always the Right Choice?

FAISS is a great default for research and prototyping — it is free, fast, and runs locally. But it is not the only vector search option, and for production systems you should evaluate alternatives based on your scale and constraints:

Option Best For Trade-off
FAISS Local prototyping, research No persistence, no metadata filtering
Pinecone Managed cloud at scale Cost per query at high volume
pgvector Already using PostgreSQL Slower than dedicated vector DBs
Oracle AI Vector Search Enterprise Oracle environments Requires Oracle licensing
Weaviate / Qdrant Production with hybrid search Infra to manage


The CA-RAG routing framework is vector-store agnostic — you can swap FAISS for any of the above without changing the routing logic.

Five Rules for RAG That Actually Works

1. Route every query. Match retrieval depth to query complexity. Never apply the same k to everything.

2. Keep your corpus clean. Deduplicate chunks, verify provenance, fill coverage gaps. Garbage in, garbage out — regardless of retrieval depth.

3. Trust the confidence score. Low FAISS retrieval confidence signals poor corpus coverage, not a retrieval failure. Do not generate answers from weakly matched passages.

4. Tune weights for your SLO. Latency-sensitive? Increase wL. Cost-sensitive? Increase wC. Same bundle catalog, different operating point — no code change needed.

5. Measure per query, not in aggregate. Aggregate means hide variance. Track per-query cost, latency, and quality to find where your system is over- or under-retrieving.

The Principle

Fetch only what the question needs.
Nothing more. Nothing less.

— The CA-RAG routing principle

Getting Started

You do not need to implement CA-RAG fully on day one. Start with a two-bundle system:

Plain Text
 
def route(query: str) -> int:
    complexity = compute_complexity(query)  # word length + cue words
    if complexity < 0.3:
        return 0   # direct_llm — skip retrieval
    elif complexity < 0.6:
        return 3   # light_rag
    else:
        return 10  # heavy_rag


Even this simple heuristic — skip retrieval for short, simple queries — will reduce your token bill noticeably. Add more bundles and the utility function as your needs grow.

Fetch (FTP client) LESS large language model RAG

Opinions expressed by DZone contributors are their own.

Related

  • Why Knowing Your LLM Hallucinated Is Not Enough
  • Hallucination Has Real Consequences — Lessons From Building AI Systems
  • Building a Production-Ready AI Agent in 2026: Beyond the Hello World Demo
  • Why Your RAG Pipeline Will Fail Without an MCP Server

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook