Cost-Aware Routing for RAG: Fetch Less, Spend Less, Answer Better
Every query hits the same retrieval depth — whether it needs 0 passages or 10. Here is what that mistake costs, and how cost-aware routing fixes it.
Join the DZone community and get the full member experience.
Join For FreeYou have a knowledge base full of PDFs. Someone asks: "What do you know about RAG?" Your RAG system dutifully searches all the documents, retrieves 10 passages, stuffs them into the prompt, and generates the answer. The problem? The LLM already knew the answer. You just paid for 10 passages you did not need.
This is the silent tax of static RAG — and most teams do not realize they are paying it on every single query.
What Is Static RAG, Exactly?
Retrieval-Augmented Generation (RAG) works by fetching relevant text passages from a knowledge base before generating an answer. The passages provide grounding — they give the LLM factual context it might not have in its training weights.
Static RAG means the retrieval depth — the number of passages fetched, called k — is a fixed constant in your configuration. It never changes, regardless of what the user asked.
# Static RAG — k is hardcoded forever
retriever = FAISSRetriever(k=10) # always 10, every query
# These queries all get 10 passages:
"What is RAG?" # needs 0
"Compare dense vs sparse retrieval" # needs 3
"Explain all tradeoffs with examples" # needs 10
The number 10 is not the problem. The problem is using the same number for everything.
Key Insight: Static RAG simultaneously overfetches on simple queries and underfetches on complex ones — often at the same time, within the same application.
A Real Example: The Question That Needs Zero Passages
Imagine your knowledge base contains a RAG explainer PDF, architecture documentation, technical guides, and dozens of other PDFs. A user asks:
User Query
"What do you know about RAG?"
Here is what static RAG does, step by step:
1. Converts the query to a vector embedding.
2. Searches all passages in all PDFs via FAISS.
3. Returns the top 10 most similar passages.
4. Stuffs all 10 into the prompt.
5. Sends the entire thing to the LLM API — you pay for every token.
6. LLM generates an answer.
The answer it generates? Identical to what it would have said with zero passages. GPT-4 was trained on the entire internet. It knows what RAG is. The retrieval step added cost, added latency, and changed nothing.
Static RAG Result
- ~200 extra tokens billed
- +1–2 seconds of latency
- Zero improvement in answer quality
Why This Happens: Fixed k vs. Dynamic k
The root cause is architectural. FAISS — the most common vector search library in RAG systems — takes a parameter k that determines how many nearest-neighbor passages to return. In static RAG, this is set once at initialization and never reconsidered.
| Query | What It Needs | Static RAG Gives | Verdict |
|---|---|---|---|
| "What is RAG?" | k=0 — LLM already knows | k=5 — 5 wasted passages | Overfetch |
| "Compare dense vs sparse retrieval" | k=3 — moderate context | k=5 — close enough | Acceptable |
| "Explain all tradeoffs with production examples" | k=8–10 — deep context | k=5 — not enough | Underfetch |
Static RAG gets it right only when the query complexity happens to match the hardcoded k. For everything else, it either wastes money or degrades quality.
"Static RAG is not wrong because it uses k=10. It is wrong because it uses the same k for every query."
The Solution: Cost-Aware RAG (CA-RAG)
CA-RAG treats retrieval depth as a per-query decision, not a system-wide constant. Instead of one fixed k, it maintains a catalog of discrete strategy bundles — each pairing a retrieval depth with a generation profile:

A lightweight router evaluates each incoming query against all four bundles using a scalar utility function:
U(b) = wQ × Quality_prior − wL × Latency_norm − wC × Cost_norm
# Default weights: quality=0.6, latency=0.2, cost=0.2
# The bundle with the highest score wins — no ML model required
For the query "What do you know about RAG?", the router computes:
direct_llm → U = 0.40 ← winner: LLM knows this, cost penalty near zero
light_rag → U = 0.38
medium_rag → U = 0.37
heavy_rag → U = 0.28 ← heavy cost penalty kills utility
Result: the router routes to direct_llm, skips FAISS entirely, and gets the same answer in half the time at a fraction of the cost.

What the Data Shows
Across a 28-query benchmark spanning definitional, comparative, procedural, and analytical queries, CA-RAG demonstrated consistent advantages over static configurations:

The router exercised all four bundles: medium_rag handled 57% of queries, heavy_rag 18%, direct_llm 14%, and light_rag 11%. This non-uniform distribution is exactly what you want — it proves the router is actually making meaningful decisions rather than defaulting to one bundle.
But Is FAISS Always the Right Choice?
FAISS is a great default for research and prototyping — it is free, fast, and runs locally. But it is not the only vector search option, and for production systems you should evaluate alternatives based on your scale and constraints:
| Option | Best For | Trade-off |
|---|---|---|
| FAISS | Local prototyping, research | No persistence, no metadata filtering |
| Pinecone | Managed cloud at scale | Cost per query at high volume |
| pgvector | Already using PostgreSQL | Slower than dedicated vector DBs |
| Oracle AI Vector Search | Enterprise Oracle environments | Requires Oracle licensing |
| Weaviate / Qdrant | Production with hybrid search | Infra to manage |
The CA-RAG routing framework is vector-store agnostic — you can swap FAISS for any of the above without changing the routing logic.
Five Rules for RAG That Actually Works
1. Route every query. Match retrieval depth to query complexity. Never apply the same k to everything.
2. Keep your corpus clean. Deduplicate chunks, verify provenance, fill coverage gaps. Garbage in, garbage out — regardless of retrieval depth.
3. Trust the confidence score. Low FAISS retrieval confidence signals poor corpus coverage, not a retrieval failure. Do not generate answers from weakly matched passages.
4. Tune weights for your SLO. Latency-sensitive? Increase wL. Cost-sensitive? Increase wC. Same bundle catalog, different operating point — no code change needed.
5. Measure per query, not in aggregate. Aggregate means hide variance. Track per-query cost, latency, and quality to find where your system is over- or under-retrieving.
The Principle
Fetch only what the question needs.
Nothing more. Nothing less.
— The CA-RAG routing principle
Getting Started
You do not need to implement CA-RAG fully on day one. Start with a two-bundle system:
def route(query: str) -> int:
complexity = compute_complexity(query) # word length + cue words
if complexity < 0.3:
return 0 # direct_llm — skip retrieval
elif complexity < 0.6:
return 3 # light_rag
else:
return 10 # heavy_rag
Even this simple heuristic — skip retrieval for short, simple queries — will reduce your token bill noticeably. Add more bundles and the utility function as your needs grow.
Opinions expressed by DZone contributors are their own.
Comments