Query-Aware Retrieval Routing for Analytics on AWS: When to Use Redshift, OpenSearch, Neptune, or Cache

Use a query router for LLM analytics — Redshift (KPIs), OpenSearch (definition), Neptune (lineage), and Cache (repeats) — to improve accuracy, latency, and costs.

Anusha Kovi

CORE ·

Feb. 10, 26 · Analysis

Likes (1)

Comment

Save

1.1K Views

Typically, LLM analytics assistants or chatbots start with retrieval-augmented generation (RAG) and a database connection. That's fine until real users ask a mix of KPI questions, definition lookups, lineage questions, and repeated dashboard-style requests. If everything goes through one retrieval path to access data, you will see three predictable failures.

Wrong answers: Metrics that are computed at the wrong grain, wrong joins, missing filters
Slow answers: Long prompts, retries
Higher cost: More tokens, more queries, more wasted warehouse scans

Analytics questions are not the same every time. The backend that is best for one question (e.g., what does active users mean?) may not be the best for another (e.g., which dashboards depend on the product type field?).

A practical fix is query-aware routing, which parses the user query into a small, structured intent and routes it to the best backend.

Redshift (SQL/semantic metrics) for governed KPIs and aggregations
OpenSearch vector search for definitions, policies
Neptune graph for lineage, dependencies, and relationship/path questions
Cache for repeated, stable aggregates and common tiles

This article explains a routing technique optimized for analytics workloads and an evaluation plan to measure its effectiveness.

The Four Backends And What They Are Best At

1. Redshift SQL

When the user's question requests computed facts, route to Redshift. Some of the examples are:

Time windows such as the last 15 days, MoM, WoW, QoQ, YoY, or since Jan 1
Breakdowns such as by channel, region, product group, or top 10
KPIs, or metrics, such as daily/monthly active users, conversion/adoption rate, revenue, churn
Comparisons such as before vs. after

Treat the semantic layer as a contract (grain, default filters, allowed dimensions, metric definitions, etc.), if you have one. The LLM should not invent SQL; it should request metrics, dimensions, filters, and time ranges and let your semantic layer compile them into SQL.

2. OpenSearch Vectors

When the question is about context, route to OpenSearch. Some of the examples are:

How is the adoption rate calculated?
What is the PII access policy?
What does this metric mean?

3. Neptune Graph

When the question is relationship-heavy, route to Neptune. Some of the examples are:

Upstream and downstream lineage
What depends on this table/column?
Which dashboards break if the table is dropped?

Graphs answer paths and not similar text.

4. Cache

This is the fastest way to retrieve the data. Use it when:

The users ask the same question again and again
Answers are stable enough (daily refresh is good)
Larger scans or popular

Cache both results (for common aggregates) and plans (normalized question -> parsed intent -> chosen route).

A Simple Routing Policy

1. Parse the Question Into Intent and Constraints

Routing is fragile if it's just 'LLM decides which tool.' A small structured object is what you want to validate and test for analytics.

For example:

    JSON
   
 

   {
  "intent": "kpi",
  "metrics": ["conversion_rate"],
  "dimensions": ["channel"],
  "filters": {"country": "US"},
  "time_range": {"type": "relative", "value": "last_14_days"},
  "needs_lineage": false,
  "needs_definition": false
}

  

Some of the ways to produce it are:

Normalize common phrases like 'last week' to a date range.
Use your metric catalogue to resolve synonyms (e.g., CVR to conversion_rate)
Use a strict JSON schema, like rule-based, LLM-based, or hybrid, and then validate.

Once you prepare this object, the router will be deterministic and explainable, stating, 'We chose Redshift because a known metric, time window, and breakdown were present.'

2. Analytics-Friendly Routing Heuristics

Rule A: If you detect known metric names or aggregation terminology such as "trend," "how many," "increase/decrease," "compare," or a time window, or breakdown words such as "by," "top," or "segment," then route to Cache, then Redshift.

Let's see this with an example: If the user asks for "revenue by region last month," then it is metric, dimension, and time range, which means route to Redshift, but check the cache first.

Rule B: If you detect definition or policy terms such as "define," "meaning," "what counts as," "how is Y calculated," "policy," "PII," "SLA," "owner," or "source of truth," then route to OpenSearch vectors.

It's example time! If the user requests, "What counts as adoption?" Now, we should retrieve definitions, edge cases, and ownership links, so route to OpenSearch vectors.

Rule C: If you detect lineage or dependency questions such as "depends on," "lineage," "upstream/downstream," "impact," "what breaks if...," or column/table mentions, then route to Neptune.

Let's check with an example. "Which dashboards depend on traffic?" The graph returns downstream dashboard nodes, owners, and paths.

Rule D: Hybrid routes are normal, and don't treat them as exceptions. Some questions that naturally span more than one system are:

KPI and context: "Why did adoption drop last week?"
- Redshift: calculate delta by segments
- OpenSearch: pull the definition, known incidents, and recent changes that happened
Lineage and freshness: "Is this dashboard stale since the upstream changed?"
- Neptune: lineage path to upstream jobs/models
- Redshift: check freshness timestamps like max event time, and load time
- OpenSearch: pull relevant runbooks for guidance

3. Small Scoring Router (More Stable Than If/Else)

Rules work well, but a scoring layer makes it more robust and tunable.

Compute four scores:

sql_score: metric, time window, aggregation/breakdown
vector_score: definition/policy, unknown terms
graph_score: lineage/dependency, table/fields
cache_score: repeat likelihood, stability of window, common KPI

Decision logic:

If vector_score is high, route to vector.
If two scores are close and high, then take a hybrid route.

Sample pseudo code:

    Python
   
   def choose_route(scores):
    if scores["cache"] >= 0.80:
        return ["cache", "sql"]

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    (a, sa), (b, sb) = ranked[0], ranked[1]

    if sa < 0.60:
        return ["vector"]  # default help mode

    if sa > 0.75 and (sa - sb) < 0.15:
        return [a, b]      # hybrid

    return [a]

The benefit is that you can tune thresholds using real logs without changing prompts.

Architecture

By treating the router as a control plan, follow the steps below:

Normalize the question, such as time phrases, metric synonyms, etc.
Parse into intent JSON (validated)
Policy gates such as allowlisted metrics/dimensions, RLS/CLS checks, and PII flags
Select a route with scores and thresholds
Execute tools (Cache, Redshift, OpenSearch, Neptune)
Assemble the answer with tables, narrative, and citations for the retrieved text
Log the route chosen, latency, tool errors, and fallbacks

The main principle here is that the LLM can help parse, but the router's final decision should be deterministic and logged.

Evaluation Mechanism

The process change to include this routing is only useful if it improves outcomes. Measure the following:

Accuracy: (Did we answer correctly?)
- KPI accuracy: Compare Redshift results to gold (trusted) queries. If the numbers don't match within the agreed tolerance, it's a correctness failure, not a model issue.
- Definition accuracy: For "what does X mean" questions, check whether the answer is directly supported by retrieved sources. The main test is: if the cited snippets are removed, does the explanation still hold? If not, the system is guessing.
- Lineage accuracy: Compare the assistant-retrieved Neptune path with a known ground-truth dependency map. If the assistant misses a critical upstream node, it can cause real operational errors.
Latency: (Did we answer quickly?)
- End-to-end latency (p50, p95): Check how long a response takes from the user question to the final answer. Record both the median (p50) and the slow case (p95) to understand typical performance and worst-case scenarios.
- Tool-level breakdown: Check the time spent on Redshift execution, OpenSearch retrieval, and Neptune traversal. By this, you can understand whether you need to do SQL optimization, index tuning, or graph query improvements.
Cost: (Did we answer efficiently?)
- Tokens per answer: Record token usage, as it directly correlates with latency and inference cost. If routing is working, you should see fewer bloated prompts due to unnecessary retrievals.
- Number of tool calls per question: Calculate how many tool invocations happen per user query. A good router reduces thrashing and converges in fewer steps.
- Redshift scan/credit estimates: Monitor bytes scanned. If routing is correct, fewer questions should hit expensive wide scans and should use governed efficient metric queries and cached results.
Routing quality: (Did we choose the right backend?)
- Precision/recall by intent class: Treat routing like a classification problem. For each query type, such as KPI, lineage, definition, or measure, the router correctly chooses the intended backend (precision) and how frequently it successfully catches queries of that type (recall).
- Avoidable tool calls: Record cases where a tool was used unnecessarily. For instance, using vectors for straightforward KPI questions that can be answered by a single Redshift metric query. These are clear opportunities to reduce latency and costs.
- Fallback rate: Track how often the first route fails before the system attempts a second route. A high fallback rate usually indicates that intent parsing is weak, the thresholds are misconfigured, or the backend's reliability needs improvement.

Conclusion

In analytics, trust comes from being right, being fast, and explaining why. Query-aware routing makes this possible by using Redshift for correct, known, governed computation; OpenSearch for grounded meaning and policies; Neptune for dependency resolution; and Cache for speed when computations are repeated.

Once you log route decisions and measure accuracy, latency, or cost, you can iterate like an engineering product rather than a one-off experiment.

AWS Analytics large language model

Opinions expressed by DZone contributors are their own.

Related

Trending