Production-Grade RAG: Why Vector Search Isn't Enough (and How Hybrid Search Fills the Gaps)

RAG pipelines are getting more and more popular with vector search at the core of them. However, vector search might not be just enough for high-quality retrieval.

Alejandro Duarte

CORE ·

Jun. 08, 26 · Analysis

Likes (1)

Comment

Save

1.3K Views

Imagine your team just deployed a sleek RAG-based docs assistant for the SaaS platform you develop. In testing, it worked flawlessly. It knows your functionality and answers questions in three perfectly written paragraphs with no hallucinations. But two days after launch, a senior dev pokes you on Slack: "Hey man, the AI bot can't find anything on PX-9000-v2 configuration errors."

You check the logs. The user queried the exact error code. Vector search, optimized for semantic meaning, returned documents about general error handling and configuration best practices, but the specific technical description for PX-9000-v2 was buried at position 50 in the retriever's results (or chunks) because its "semantic" distance was too far from the general concept of "error."

In production RAG, semantic similarity is a powerful tool, but it is not a complete one. To build retrieval systems that survive real-world queries — IDs, acronyms, and specialized jargon — you need hybrid search.

The Dual Nature of Information Retrieval

To understand why hybrid search is necessary, we have to look at the two distinct ways we retrieve data: semantic intent and lexical precision.

Vector Search (Semantic)

Vector search relies on embeddings — dense numerical representations of text called "vector embeddings." It captures the meaning of a query. If I search for "fastest reconnaissance plane," a vector search engine understands the relationship with the "SR-71 Blackbird" even if the words don't match. This is excellent for handling natural language, synonyms, and vague user intent.

Keyword Search (Lexical)

Keyword search (traditional Full-Text Search) is about exactness. It counts occurrences, considers document length, and rewards rare term matches. When a user types "PX-9000-v2," they don't want something similar to that string; they want that exact string.

Vector embeddings often fail here because they are trained on broad semantic relationships. In a high-dimensional vector space, "PX-9000-v1" and "PX-9000-v2" might be neighbors, or worse, "PX-9000" might be pulled toward a cluster of unrelated "9000" series products. Keyword search acts as a high-pass filter for these specific technical identifiers, version numbers, and the "long tail" of specialized vocabulary that embeddings often flatten into general categories.

Hybrid search is about merging the strengths of these retrieval methods into a single, ranked result set.

Merging the Results: The Reciprocal Rank Fusion (RRF)

An engineering challenge in hybrid search is the "apples to oranges" problem. Vector search gives you a distance score (e.g., Cosine distance), usually between 0 and 1. Keyword search gives you a frequency score, which can be any positive number. You cannot simply add them together.

To solve this, we can use Reciprocal Rank Fusion (RRF). RRF is an algorithm that ignores the raw scores entirely and focuses on the rankings. If chunk A is #1 in vector search and #5 in keyword search, RRF calculates a new score based on those positions. The formula is elegantly simple:

score = Σ 1 / (k + rank(c, r))

Where:

c is a chunk.
r is a set of rankings (e.g. vector results and keyword results).
k is a constant (usually 60) that prevents a single top-ranked result from completely dominating the final score.

The Significance of 'k=60'

Why 60? It isn't a magic number, but it is a highly stable one. In the original research on RRF, the authors tested various values and found that k=60 consistently outperformed others across a wide variety of datasets.

From a mathematical intuition standpoint, k acts as a "dampener." If k were 1, the difference between the 1st rank (1/2 = 0.5) and the 2nd rank (1/3 ≈ 0.33) would be massive (0.17). This would mean that a document appearing at #1 in only the vector search could easily beat a document appearing at #2 in both keyword and vector search.

By setting k=60, the impact of rank 1 (1/61 ≈ 0.01639) vs rank 2 (1/62 ≈ 0.01613) is minimized (0,00026). This forces the algorithm to prioritize consensus (documents that perform well across both retrieval methods) over documents that happen to peak in just one. So this provides the "smoothing" necessary to fuse the different scoring distributions of keyword and vector similarity.

Implementation

Typically, implementing hybrid search means maintaining two separate databases — a keyword search engine like Elasticsearch or Solr for keyword retrieval, and a dedicated vector database like Pinecone or Weaviate for semantic search — then writing a custom orchestration layer to collect results from both, run the RRF math, and return a unified ranked list.

With relational databases that support both full text and vector search natively (or through an extension) you can use a single SQL query via either window functions or CTEs. Here's an example with MariaDB:

    MariaDB SQL
   
 

   CREATE OR REPLACE TABLE phrases (
    content VARCHAR(200) UNIQUE,
    embedding VECTOR(1536),
    FULLTEXT KEY (content)
);

INSERT INTO phrases (content) VALUES
    ("I love a strong morning coffee."),
    ("A greeting card I received said 'morning pick-me-up' in neon letters."),
    ("Every morning, I start my day with cappuccino."),
    ("When time is short, a tiny jolt of caffeine is just what I need.");

-- calculate vector embeddings for each phrase before continuing

ALTER TABLE phrases MODIFY COLUMN embedding VECTOR(1536) NOT NULL;
ALTER TABLE phrases ADD VECTOR INDEX (embedding);

SET @search_term = "morning pick-me-up";
SET @search_term_vector = VEC_FromText("... search vector here ...");
SET @k = 60;

-- Hybrid search (window functions)
SELECT content, 1 / (@k + RANK() OVER (ORDER BY VEC_DISTANCE_COSINE(embedding, @search_term_vector)))
    + 1 / (@k + RANK() OVER (ORDER BY MATCH content AGAINST (@search_term))) AS rrf
FROM phrases
ORDER BY rrf DESC
LIMIT 2;

-- Hybrid search (CTE)
WITH vector_score AS (
    SELECT
        content,
        RANK() OVER (ORDER BY VEC_DISTANCE_COSINE(embedding, @search_term_vector)) AS relevance
    FROM phrases
    LIMIT 10
),
fulltext_score AS (
    SELECT
        content,
        RANK() OVER (ORDER BY MATCH(content) AGAINST (@search_term)) AS relevance
    FROM phrases
    LIMIT 10
)
SELECT
    v.content,
    (1 / (@k + v.relevance)) + (1 / (@k + f.relevance)) AS rrf
FROM vector_score v
JOIN fulltext_score f USING (content)
ORDER BY rrf DESC
LIMIT 2;
  

Scaling these kinds of approaches can be a challenge by itself and involves architectural overhead. MariaDB Enterprise Platform includes a product called MariaDB AI RAG that simplifies all this by implementing a ready-to-use RAG service that handles the orchestration internally. So, instead of managing separate indices and merging logic, you interact with a REST service where the retrieval strategy is simply a configuration choice.

For example, when using the orchestration API to generate a response, you can switch to hybrid search with a single parameter:

    HTTP
   
 

   POST /orchestrate/generation
{
  "query": "How do I resolve PX-9000-v2 configuration errors?",
  "retrieval_method": "hybrid",
  "top_k": 10
}
  

Or, if you are building your own RAG logic and just need the raw retrieved documents:

    HTTP
   
 

   POST /hybrid_search
{
  "query": "PX-9000-v2",
  "top_k": 10
}
  

Behind the scenes, MariaDB AI RAG executes the vector similarity search across your embeddings and a full-text keyword search across the original text, applies the RRF calculation, and returns a single, optimized list. Check the docs for more information about MariaDB AI RAG.

Operational Impact: Accuracy vs. Latency

Software engineering always involves tradeoffs, and hybrid search is not the exception. By running two search algorithms instead of one, you are adding compute overhead.

From an operational perspective, this means monitoring two metrics:

Retrieval Latency (p95): While keyword search is typically faster than vector search, the combined execution and RRF merge will always be slower than vector alone. In testing, expect a 10% to 30% increase (when properly implemented) in retrieval time compared to pure vector search.
Hit Rate @ K: This is your primary success metric. For example, if hybrid search raises your Hit Rate @ 5 from 70% to 90%, the latency trade-off is almost certainly worth it.

However, for most RAG applications, the bottleneck is rarely the retrieval phase—it is the LLM generation phase. Adding 20ms to a search query to ensure the LLM receives the correct context is a trade most architects will take every time. The cost of a "fast" but incorrect answer (or a hallucination caused by missing context) is far higher than the compute cost of a keyword index.

Conclusion: Start With Hybrid

If you are building RAG for anything more complex than a demo, don't wait for your users to find the gaps in your vector embeddings. Vector search is fantastic for intent, but it is "fuzzy" by design. Hybrid search provides the lexical safety net that enterprise applications require. Use algorithms like RRF, or even better, ready-to-use implementations like MariaDB AI RAG to build RAG applications that use high-quality retrieval systems. After all, retrieving the correct context is the key part of a RAG pipeline.

Data structure RAG

Opinions expressed by DZone contributors are their own.

Related

Trending