DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Essential Techniques for Production Vector Search Systems, Part 3: Filterable HNSW
  • Essential Techniques for Production Vector Search Systems Part 2 - Binary Quantization
  • Essential Techniques for Production Vector Search Systems Part 1 - Hybrid Search
  • Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery

Trending

  • Building a High-Throughput Distributed Sequence Generator Using the Hi-Lo Algorithm
  • Slopsquatting: Building a Scanner That Catches AI-Hallucinated Packages Before They Reach Production
  • Testing AI-Infused Apps: A Dual-Layer Framework for AI Quality Assurance
  • Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Essential Techniques for Production Vector Search Systems, Part 5: Reranking

Essential Techniques for Production Vector Search Systems, Part 5: Reranking

Proven techniques for production vector search, including when to use each one, how to combine them effectively, and trade-offs to understand before deployment.

By 
Pavan Vemuri user avatar
Pavan Vemuri
DZone Core CORE ·
Mar. 27, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
2.1K Views

Join the DZone community and get the full member experience.

Join For Free

After implementing vector search systems at multiple companies, I wanted to document efficient techniques that can be very helpful for successful production deployments.

I want to present these techniques by showcasing when to apply each one, how they complement each other, and the trade-offs they introduce. This will be a multi-part series that introduces all of the techniques one by one in each article. I have also included code snippets to quickly test each technique.

Before we get into the real details, let us look at the prerequisites and setup.

For ease of understanding and use, I am using the free cloud tier from Qdrant for all of the demonstrations below.

Steps to Set Up Qdrant Cloud

Step 1: Get a Free Qdrant Cloud Cluster

  • Sign up at https://cloud.qdrant.io.
  • Create a free cluster
    • Click "Create Cluster."
    • Select Free Tier.
    • Choose a region closest to you.
    • Wait for the cluster to be provisioned.
  • Capture your credentials.
    • Cluster URL: https://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.us-east.aws.cloud.qdrant.io:6333.
    • API Key: Click "API Keys" → "Generate" → Copy the key.

Step 2: Install Python Dependencies

PowerShell
 
pip install qdrant-client fastembed numpy


Recommended versions:

  • qdrant-client >= 1.7.0
  • fastembed >= 0.2.0
  • numpy >= 1.24.0
  • python-dotenv >= 1.0.0

Step 3: Set Environment Variables or Create a .env File

PowerShell
 
# Add to your ~/.bashrc or ~/.zshrc
export QDRANT_URL="https://your-cluster-url.cloud.qdrant.io:6333"
export QDRANT_API_KEY="your-api-key-here"


Create a .env file in the project directory with the following content. Remember to add .env to your .gitignore to avoid committing credentials.

PowerShell
 
# .env file
QDRANT_URL=https://your-cluster-url.cloud.qdrant.io:6333
QDRANT_API_KEY=your-api-key-here


Step 4: Verify Connection

We can verify the connection to the Qdrant collection with the following script. From this point onward, I am assuming the .env setup is complete.

Python
 
from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Initialize client
client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
    api_key=os.getenv("QDRANT_API_KEY"),
)

# Test connection
try:
    collections = client.get_collections()
    print(f" Connected successfully!")
    print(f"   Current collections: {len(collections.collections)}")
except Exception as e:
    print(f" Connection failed: {e}")
    print("   Check your .env file has QDRANT_URL and QDRANT_API_KEY")


Expected Output:

Plain Text
 
python verify-connection.py
Connected successfully!
Current collections: 2


Now that we have the setup out of the way, we can get into the meat of the article.

Before the deep dive into reranking, let us look at a high-level overview of the techniques we have covered so far, and also the last one we are covering in this part.

Technique problems solved performance impact complexity
Hybrid Search We will miss exact matches if we employ semantic search purely. Huge increase in the accuracy, closer to 16% Medium
Binary Quantization Memory costs scale linearly with data. 40X memory reduction, 15% faster Low
Filterable HNSW Not a good practice to apply post-filtering as it wastes computation. 5X faster filtered queries Medium
Multi Vector Search A single embedding will not be able to capture the importance of various fields. Handles queries from multiple fields, such as title vs description, and requires two times more storage. Medium
Reranking Optimized vector search for speed over precision. Deeper semantic understanding, 15-20% ranking improvement High


Keep in mind that production systems typically combine two to four of these techniques.

For example, a typical e-commerce website might use hybrid search, binary quantization, and filterable HNSW.

We covered Hybrid Search in the first part of the series, Binary Quantization in the second part,  filterable HNSW in the third part, and multi-vector search in the 4th part. In the final part, we will cover reranking.

Reranking

All the techniques we have covered so far have focused on quickly identifying the right candidates. Reranking is about ensuring the right result shows up as the number one choice. These are two different spectrums of search. Since we have looked at the first aspect in all four parts so far, let us now look at the retrieval quality aspect in this article.

The core concept of reranking allows the system to read both the query and the document together at query time, attending to every token in both. It might not scale to the corpus, but it is very efficient at providing a deeper understanding for a shortlist of 20-100 candidates.

Let us look at the pattern we follow for retrieval in the case of reranking.

Stage 1: Retrieve as many candidates as possible quickly. This is primarily what we have seen in the first 4 parts of this series.

Stage 2: Rescore the shortlist with the full query document attention and return the true top-k, even if that means adding latency.

High-Level Conceptual Flow Diagram for Reranking

Let us now look at the two-step process with the help of the following conceptual flow diagram


Let us now take a look at it in more detail with the code below.

Python
 
"""
Example: two-stage retrieval with cross-encoder reranking.

Stage 1 — Bi-encoder (vector): retrieve many candidates quickly (~tens of ms).
Stage 2 — Cross-encoder: rescore (query, document) pairs and take top 5 (~hundreds of ms).

This demo uses title-only Stage 1 when the collection has a `title` vector: that mimics a
common production pattern (cheap first pass on short fields) and often misorders results
vs the full product text — so reranking’s lift is visible (correct part moves to #1).
"""

from dotenv import load_dotenv

from multi_vector import (
    create_demo_collection,
    get_collection_vector_names,
    get_qdrant_client,
)
from rerank_search import (
    display_two_stage_result,
    get_cross_encoder,
    two_stage_retrieve,
    warm_bi_encoder,
)

load_dotenv()

client = get_qdrant_client()
EXISTING_COLLECTION_NAME = "automotive_parts"
DEMO_COLLECTION_NAME = "multi_vector_demo"

STAGE1_LIMIT = 80
STAGE2_TOP_K = 5
TYPICAL_S1_MS = 45.0
TYPICAL_S2_MS = 190.0

# --- Collection setup (same pattern as multi_vector_example.py) ---
print("=" * 80)
print("RERANKING DEMO — Collection setup")
print("=" * 80)

available_vectors = get_collection_vector_names(EXISTING_COLLECTION_NAME, client)

if available_vectors:
    print(f"Using '{EXISTING_COLLECTION_NAME}' with vectors: {available_vectors}")
    COLLECTION_NAME = EXISTING_COLLECTION_NAME
    vector_names = available_vectors[:2]
    if len(vector_names) == 1:
        vector_names = [vector_names[0], vector_names[0]]
    weights = {name: 1.0 / len(vector_names) for name in vector_names}
else:
    print(f"No named vectors in '{EXISTING_COLLECTION_NAME}'. Using demo collection.")
    if not create_demo_collection(DEMO_COLLECTION_NAME, client, force_recreate=False):
        print("Failed to create or open demo collection. Exiting.")
        exit(1)
    COLLECTION_NAME = DEMO_COLLECTION_NAME
    vector_names = ["title", "description"]
    weights = {"title": 0.6, "description": 0.4}

# Title-only Stage 1 when possible — makes rerank “rescue” visible on small catalogs
if "title" in vector_names:
    STAGE1_MODE = "title"
    STAGE1_VECTOR = "title"
else:
    STAGE1_MODE = "multi"
    STAGE1_VECTOR = None

print(f"Collection: {COLLECTION_NAME}")
print(f"Stage 1: up to {STAGE1_LIMIT} candidates using {STAGE1_MODE} (lean / fast first pass).")
print(f"Stage 2: cross-encoder rerank → top {STAGE2_TOP_K} (full title + description text).")
if STAGE1_MODE == "title":
    print(
        "Note: Title-only Stage 1 ignores description — e.g. 'Wheel Speed Sensor' loses\n"
        "      explicit ABS context; generic 'Brake … Sensor' titles win until rerank.\n"
    )
print()

# Warm models (first run downloads cross-encoder weights)
print("Loading bi-encoder and cross-encoder (first run may download models)...")
warm_bi_encoder()
ce = get_cross_encoder()
print("Cross-encoder ready (cross-encoder/ms-marco-MiniLM-L-6-v2).")
print()

# --- Scenario 1: ABS-specific intent ---
print("=" * 80)
print('SCENARIO 1: Complex query — "brake sensor for ABS system"')
print("=" * 80)
print(
    "Title-only retrieval matches literal 'brake' + 'sensor' in titles; the true ABS part\n"
    "is titled 'Wheel Speed Sensor' (ABS only in description). Reranking reads full text.\n"
)

target_abs = "ABS Wheel Speed Sensor"
r1 = two_stage_retrieve(
    COLLECTION_NAME,
    "brake sensor for ABS system",
    vector_names,
    weights,
    client,
    stage1_limit=STAGE1_LIMIT,
    stage2_top_k=STAGE2_TOP_K,
    cross_encoder=ce,
    stage1_mode=STAGE1_MODE,
    stage1_vector_name=STAGE1_VECTOR,
)
display_two_stage_result(
    r1,
    typical_stage1_ms=TYPICAL_S1_MS,
    typical_stage2_ms=TYPICAL_S2_MS,
    highlight_part_names=[target_abs],
)

# --- Scenario 2: Multiple requirements ---
print("=" * 80)
print(
    'SCENARIO 2: Multi-requirement query — '
    '"radar sensor for blind spot detection and lane change assistance"'
)
print("=" * 80)
print(
    "Title-only matches generic 'radar' / 'sensor' strings; the blind-spot + lane-change\n"
    "part uses a vague title ('Corner Mount Radar Module'). Reranking uses the description.\n"
)

target_radar = "Blind Spot & Lane Change Assist Radar"
r2 = two_stage_retrieve(
    COLLECTION_NAME,
    "radar sensor for blind spot detection and lane change assistance",
    vector_names,
    weights,
    client,
    stage1_limit=STAGE1_LIMIT,
    stage2_top_k=STAGE2_TOP_K,
    cross_encoder=ce,
    stage1_mode=STAGE1_MODE,
    stage1_vector_name=STAGE1_VECTOR,
)
display_two_stage_result(
    r2,
    typical_stage1_ms=TYPICAL_S1_MS,
    typical_stage2_ms=TYPICAL_S2_MS,
    highlight_part_names=[target_radar],
)

# --- Summary ---
print("=" * 80)
print("SUMMARY: Why rerank?")
print("=" * 80)
stage1_word = "title-only" if STAGE1_MODE == "title" else "multi-vector"
print(f"""
• Stage 1 (~{TYPICAL_S1_MS:.0f} ms typical): Fast {stage1_word} bi-encoder — cheap but blind to
  key details that live only in longer text (ABS, blind spot + lane change together, …).

• Stage 2 (~{TYPICAL_S2_MS:.0f} ms typical): Cross-encoder scores query + full passage — fixes
  ordering when Stage 1 is intentionally shallow (or when the index is approximate).

• In production you might use multi-vector or sparse+dense for Stage 1; this demo uses
  title-only on purpose so the lift from reranking is obvious in the output.

• Demo catalog: up to {STAGE1_LIMIT} Stage-1 hits; recreate after data changes:
    create_demo_collection('{DEMO_COLLECTION_NAME}', client, force_recreate=True)
""")


Now, let us look at it with the help of the output for reranking.

Plain Text
 
================================================================================
RERANKING DEMO — Collection setup
================================================================================
No named vectors in 'automotive_parts'. Using demo collection.
✓ Demo collection 'multi_vector_demo' exists with latest data (v2.2).
  Reusing existing collection.

Collection: multi_vector_demo
Stage 1: up to 80 candidates using title (lean / fast first pass).
Stage 2: cross-encoder rerank → top 5 (full title + description text).
Note: Title-only Stage 1 ignores description — e.g. 'Wheel Speed Sensor' loses
      explicit ABS context; generic 'Brake … Sensor' titles win until rerank.


Loading bi-encoder and cross-encoder (first run may download models)...
Cross-encoder ready (cross-encoder/ms-marco-MiniLM-L-6-v2).

================================================================================
SCENARIO 1: Complex query — "brake sensor for ABS system"
================================================================================
Title-only retrieval matches literal 'brake' + 'sensor' in titles; the true ABS part
is titled 'Wheel Speed Sensor' (ABS only in description). Reranking reads full text.

Query: "brake sensor for ABS system"
--------------------------------------------------------------------------------
Stage 1 setup: single-vector 'title' only (lean first stage)
Stage 1 (bi-encoder / vector): 213 ms  (typical ~45 ms — fast, approximate)
Stage 2 (cross-encoder rerank): 222 ms  (typical ~190 ms — slow, precise)
Total extra latency for rerank: ~222 ms on 28 candidates → top 5

Side-by-side — Vector top 3 vs Reranked top 3
--------------------------------------------------------------------------------
Vector (Stage 1)                       | Reranked (Stage 2)
---------------------------------------+---------------------------------------
1. Brake Hydraulic Pressure Sensor     | 1. ABS Wheel Speed Sensor
   (score 0.6584)                      |    (CE 8.8043, was #4)
2. Parking Brake Warning Switch        | 2. Parking Brake Warning Switch
   (score 0.6403)                      |    (CE 4.5170, was #2)
3. Brake Pad Wear Sensor               | 3. Brake Hydraulic Pressure Sensor
   (score 0.6195)                      |    (CE 3.7028, was #1)

Reranked top (original vector score vs cross-encoder score, position change)
--------------------------------------------------------------------------------
  1. ABS Wheel Speed Sensor
     Vector score: 0.4913  |  Rerank score: 8.8043
     Position: #4 → #1
  2. Parking Brake Warning Switch
     Vector score: 0.6403  |  Rerank score: 4.5170
     Position: #2 (unchanged)
  3. Brake Hydraulic Pressure Sensor
     Vector score: 0.6584  |  Rerank score: 3.7028
     Position: #1 → #3
  4. Brake Pad Wear Sensor
     Vector score: 0.6195  |  Rerank score: 0.0795
     Position: #3 → #4
  5. Knock Sensor
     Vector score: 0.3739  |  Rerank score: -4.4043
     Position: #8 → #5

Highlight — best answer vs Stage 1:
--------------------------------------------------------------------------------
  • "ABS Wheel Speed Sensor": Stage 1 rank #4 → Reranked #1 (+3 positions)
    — reranking recovered the correct part.

================================================================================
SCENARIO 2: Multi-requirement query — "radar sensor for blind spot detection
and lane change assistance"
================================================================================
Title-only matches generic 'radar' / 'sensor' strings; the blind-spot + lane-change
part uses a vague title ('Corner Mount Radar Module'). Reranking uses the description.

Query: "radar sensor for blind spot detection and lane change assistance"
--------------------------------------------------------------------------------
Stage 1 setup: single-vector 'title' only (lean first stage)
Stage 1 (bi-encoder / vector): 84 ms  (typical ~45 ms — fast, approximate)
Stage 2 (cross-encoder rerank): 42 ms  (typical ~190 ms — slow, precise)
Total extra latency for rerank: ~42 ms on 28 candidates → top 5

Side-by-side — Vector top 3 vs Reranked top 3
--------------------------------------------------------------------------------
Vector (Stage 1)                       | Reranked (Stage 2)
---------------------------------------+---------------------------------------
1. Long-Range Highway Radar            | 1. Blind Spot & Lane Change Assist Radar
   (score 0.5498)                      |    (CE 9.1630, was #4)
2. Rear Cross Traffic Sensor           | 2. Rear Cross Traffic Sensor
   (score 0.5480)                      |    (CE 7.5298, was #2)
3. Forward Radar Module                | 3. Forward Radar Module
   (score 0.4873)                      |    (CE -1.0884, was #3)

Reranked top (original vector score vs cross-encoder score, position change)
--------------------------------------------------------------------------------
  1. Blind Spot & Lane Change Assist Radar
     Vector score: 0.4303  |  Rerank score: 9.1630
     Position: #4 → #1
  2. Rear Cross Traffic Sensor
     Vector score: 0.5480  |  Rerank score: 7.5298
     Position: #2 (unchanged)
  3. Forward Radar Module
     Vector score: 0.4873  |  Rerank score: -1.0884
     Position: #3 (unchanged)
  4. Long-Range Highway Radar
     Vector score: 0.5498  |  Rerank score: -2.3879
     Position: #1 → #4
  5. Parking Brake Warning Switch
     Vector score: 0.3345  |  Rerank score: -7.5928
     Position: #6 → #5

Highlight — best answer vs Stage 1:
--------------------------------------------------------------------------------
  • "Blind Spot & Lane Change Assist Radar": Stage 1 rank #4 → Reranked #1
    (+3 positions) — reranking recovered the correct part.

================================================================================
SUMMARY: Why rerank?
================================================================================

- Stage 1 (~45 ms typical): Fast title-only bi-encoder — cheap but blind to
  key details that live only in longer text (ABS, blind spot + lane change together, …).

- Stage 2 (~190 ms typical): Cross-encoder scores query + full passage — fixes
  ordering when Stage 1 is intentionally shallow (or when the index is approximate).

- In production you might use multi-vector or sparse+dense for Stage 1; this demo uses
  title-only on purpose so the lift from reranking is obvious in the output.

- Demo catalog: up to 80 Stage-1 hits; recreate after data changes:
    create_demo_collection('multi_vector_demo', client, force_recreate=True)


Benefits

As you can see from the results, reranking delivers meaningful improvements in result quality in both scenarios. In both scenarios, the correct result was at position #4 and moved to position #1 after reranking. The results also highlighted the fact that when given the full query document context, reranking confidently identified the right answer.

We also saw that reranking handles multi-requirement queries very well. For the query "radar sensor for blind spot detection and lane change assistance," in order to retrieve a part that satisfies the query, it needs to have two distinct functional requirements. The reranking addressed both requirements and correctly surfaced the right path.

Reranking works on top of all the techniques we have learnt so far, and hence it is an enabler and not a replacement.

Costs

With benefits also come costs, reranking comes with trade-offs that need to be kept in account.

  • The first trade-off, which is very obvious, is latency. As you have seen, a latency of ~190ms was added in the results. This might be acceptable for a web search UX, but it might be an issue for real-time systems requiring strict response times.
  • There will be added costs for the inference per candidates which are used for reranking.
  • Unless you are using products like Qdrant, you are now increasing the complexity of adding two models, one for indexing and the other for reranking.

When to Use

  • When the ordering of results matters more than the raw results fetched.
  • When the users of the application write complex, multi-requirement queries.
  • When the catalog contains items with informative descriptions that cannot be retrieved by title alone.
  • When the improvement that reranking provides is worth the additional latency it adds.

When NOT to Use

  • When latency is critical and cannot be compromised.
  • Simple look-up systems where keyword search is sufficient.
  • When the information catalog is small, it already satisfies the search quality.

Efficiency Comparison (From the Results)

Let us quickly compare the efficiency based on the results.

Query Stage 1 Top Result Reranked top result Position change
brake sensor for ABS system Brake Hydraulic Pressure Sensor ABS Wheel Speed Sensor #4 → #1
radar sensor for blind spot detection and lane change assistance Long Range Highway Radar Blind Spot & Lane Change Assist Radar #4 → #1


Performance Characteristics

Based on the results, the performance characteristics are as follows

Metric stage 1 only with reranking evidence from the data
Top-1 Accuracy Incorrect Correct Both scenarios the correct part was recovered
Multi requirement handling Poor Excellent Blind spot and Lane Change query correctly resolved
Compute per Query Low High Reranking scores every shortlisted candidate
Scalability Constant with catalog size Dependent on Stage 1 limit Keep the stage 1 limit low for manageable latency.


Conclusion

Reranking ties everything together. For production systems, all the things discussed in the first four parts of the series can power the systems, and then reranking puts them in the right order. They solve fundamentally different problems.

It is very clear that the reranking has pulled in the right candidate with nearly three times the score of the nearest candidate, and all this is because it has read the full description and understood exactly what the query was asking for.

Recap of the Series

Over the five parts of the series, we have discussed the essential elements of the production vector search toolkit. Let us put together how the five techniques fare in a production system.

Layer technique primary purpose
Storage Binary Quantization Compress vectors and in turn reduce RAM size
Index Filterable HNSW Filters applied during graph traversal and not after
Retrieval Hybrid Search Combine Semantic Search with Keyboard matching
Scoring Multi Vector Search Different fields such as title, description will be weighed independently
Ranking Reranking Rescore the shortlist for final ordering


As stated multiple times throughout the series, a mature product system does not need all five; these can be used as needed. If memory is a point of discussion, then use binary quantization. If filter selectivity is high, use filterable HNSW. When there are multiple fields that need to be weighted independently, use multi-vector search. Apply reranking at the end only if result ordering is a huge driver for business.

Build incrementally, measure everything, and only add complexity when the metrics justify it.

Data structure Production (computer science) systems AI

Opinions expressed by DZone contributors are their own.

Related

  • Essential Techniques for Production Vector Search Systems, Part 3: Filterable HNSW
  • Essential Techniques for Production Vector Search Systems Part 2 - Binary Quantization
  • Essential Techniques for Production Vector Search Systems Part 1 - Hybrid Search
  • Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook