Cost-Aware Routing for RAG: Fetch Less, Spend Less, Answer Better

Every query hits the same retrieval depth — whether it needs 0 passages or 10. Here is what that mistake costs, and how cost-aware routing fixes it.

Sanjay Mishra

May. 06, 26 · Opinion

Likes (0)

Comment

Save

1.5K Views

You have a knowledge base full of PDFs. Someone asks: "What do you know about RAG?" Your RAG system dutifully searches all the documents, retrieves 10 passages, stuffs them into the prompt, and generates the answer. The problem? The LLM already knew the answer. You just paid for 10 passages you did not need.

This is the silent tax of static RAG — and most teams do not realize they are paying it on every single query.

What Is Static RAG, Exactly?

Retrieval-Augmented Generation (RAG) works by fetching relevant text passages from a knowledge base before generating an answer. The passages provide grounding — they give the LLM factual context it might not have in its training weights.

Static RAG means the retrieval depth — the number of passages fetched, called k — is a fixed constant in your configuration. It never changes, regardless of what the user asked.

    Plain Text
   
   # Static RAG — k is hardcoded forever
retriever = FAISSRetriever(k=10)  # always 10, every query

# These queries all get 10 passages:
"What is RAG?"                   # needs 0
"Compare dense vs sparse retrieval" # needs 3
"Explain all tradeoffs with examples" # needs 10

The number 10 is not the problem. The problem is using the same number for everything.

Key Insight: Static RAG simultaneously overfetches on simple queries and underfetches on complex ones — often at the same time, within the same application.

A Real Example: The Question That Needs Zero Passages

Imagine your knowledge base contains a RAG explainer PDF, architecture documentation, technical guides, and dozens of other PDFs. A user asks:

User Query

"What do you know about RAG?"

Here is what static RAG does, step by step:

1. Converts the query to a vector embedding.
2. Searches all passages in all PDFs via FAISS.
3. Returns the top 10 most similar passages.
4. Stuffs all 10 into the prompt.
5. Sends the entire thing to the LLM API — you pay for every token.
6. LLM generates an answer.

The answer it generates? Identical to what it would have said with zero passages. GPT-4 was trained on the entire internet. It knows what RAG is. The retrieval step added cost, added latency, and changed nothing.

Static RAG Result

~200 extra tokens billed
+1–2 seconds of latency
Zero improvement in answer quality

Why This Happens: Fixed k vs. Dynamic k

The root cause is architectural. FAISS — the most common vector search library in RAG systems — takes a parameter k that determines how many nearest-neighbor passages to return. In static RAG, this is set once at initialization and never reconsidered.

Query	What It Needs	Static RAG Gives	Verdict
"What is RAG?"	k=0 — LLM already knows	k=5 — 5 wasted passages	Overfetch
"Compare dense vs sparse retrieval"	k=3 — moderate context	k=5 — close enough	Acceptable
"Explain all tradeoffs with production examples"	k=8–10 — deep context	k=5 — not enough	Underfetch

Static RAG gets it right only when the query complexity happens to match the hardcoded k. For everything else, it either wastes money or degrades quality.

"Static RAG is not wrong because it uses k=10. It is wrong because it uses the same k for every query."

The Solution: Cost-Aware RAG (CA-RAG)

CA-RAG treats retrieval depth as a per-query decision, not a system-wide constant. Instead of one fixed k, it maintains a catalog of discrete strategy bundles — each pairing a retrieval depth with a generation profile:

A lightweight router evaluates each incoming query against all four bundles using a scalar utility function:

    Plain Text
   
   U(b) = wQ × Quality_prior − wL × Latency_norm − wC × Cost_norm

# Default weights: quality=0.6, latency=0.2, cost=0.2
# The bundle with the highest score wins — no ML model required

For the query "What do you know about RAG?", the router computes:

    Plain Text
   
   direct_llm  → U = 0.40  ← winner: LLM knows this, cost penalty near zero
light_rag   → U = 0.38
medium_rag  → U = 0.37
heavy_rag   → U = 0.28  ← heavy cost penalty kills utility

Result: the router routes to direct_llm, skips FAISS entirely, and gets the same answer in half the time at a fraction of the cost.

What the Data Shows

Across a 28-query benchmark spanning definitional, comparative, procedural, and analytical queries, CA-RAG demonstrated consistent advantages over static configurations:

The router exercised all four bundles: medium_rag handled 57% of queries, heavy_rag 18%, direct_llm 14%, and light_rag 11%. This non-uniform distribution is exactly what you want — it proves the router is actually making meaningful decisions rather than defaulting to one bundle.

But Is FAISS Always the Right Choice?

FAISS is a great default for research and prototyping — it is free, fast, and runs locally. But it is not the only vector search option, and for production systems you should evaluate alternatives based on your scale and constraints:

Option	Best For	Trade-off
FAISS	Local prototyping, research	No persistence, no metadata filtering
Pinecone	Managed cloud at scale	Cost per query at high volume
pgvector	Already using PostgreSQL	Slower than dedicated vector DBs
Oracle AI Vector Search	Enterprise Oracle environments	Requires Oracle licensing
Weaviate / Qdrant	Production with hybrid search	Infra to manage

The CA-RAG routing framework is vector-store agnostic — you can swap FAISS for any of the above without changing the routing logic.

Five Rules for RAG That Actually Works

1. Route every query. Match retrieval depth to query complexity. Never apply the same k to everything.

2. Keep your corpus clean. Deduplicate chunks, verify provenance, fill coverage gaps. Garbage in, garbage out — regardless of retrieval depth.

3. Trust the confidence score. Low FAISS retrieval confidence signals poor corpus coverage, not a retrieval failure. Do not generate answers from weakly matched passages.

4. Tune weights for your SLO. Latency-sensitive? Increase wL. Cost-sensitive? Increase wC. Same bundle catalog, different operating point — no code change needed.

5. Measure per query, not in aggregate. Aggregate means hide variance. Track per-query cost, latency, and quality to find where your system is over- or under-retrieving.

The Principle

Fetch only what the question needs.
Nothing more. Nothing less.

— The CA-RAG routing principle

Getting Started

You do not need to implement CA-RAG fully on day one. Start with a two-bundle system:

    Plain Text
   
 

   def route(query: str) -> int:
    complexity = compute_complexity(query)  # word length + cue words
    if complexity < 0.3:
        return 0   # direct_llm — skip retrieval
    elif complexity < 0.6:
        return 3   # light_rag
    else:
        return 10  # heavy_rag
  

Even this simple heuristic — skip retrieval for short, simple queries — will reduce your token bill noticeably. Add more bundles and the utility function as your needs grow.

Fetch (FTP client) LESS large language model RAG

Opinions expressed by DZone contributors are their own.

Related

Trending