Essential Techniques for Production Vector Search Systems Part 1 - Hybrid Search
Proven techniques for production vector search including when to use each one, how to combine them effectively, and trade offs to understand before deployment.
Join the DZone community and get the full member experience.
Join For FreeAfter implementing vector search systems at multiple companies, I wanted to document efficient techniques that could be very helpful for successful production deployments of vector search systems.
I want to present these techniques, showcasing when to apply each of them, how they complement each other, and the trade-offs they introduce. This will be a multi-part series that introduces all of the techniques one by one in each article. I have also included code snippets to quickly test each of the techniques.
Before we get into the real details, let us look at the prerequisites and setup.
For ease of understanding and use, I am using the free cloud tier from Qdrant for all of the demonstrations below.
Steps to Set Up Qdrant Cloud
Step 1: Get a Free Qdrant Cloud Cluster
- Sign up at https://cloud.qdrant.io
- Create a free cluster
- Click Create Cluster
- Select Free Tier
- Choose a region closest to you
- Wait for the cluster to be provisioned
- Capture your credentials
- Cluster URL: https://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.us-east.aws.cloud.qdrant.io:6333
- API Key: Click API Keys → Generate → Copy the key
Step 2: Install Python Dependencies
pip install qdrant-client fastembed numpy
Recommended versions:
- qdrant-client >= 1.7.0
- fastembed >= 0.2.0
- numpy >= 1.24.0
- python-dotenv >= 1.0.0
Step 3: Set Environment Variables or Create a .env File
# Add to your ~/.bashrc or ~/.zshrc
export QDRANT_URL="https://your-cluster-url.cloud.qdrant.io:6333"
export QDRANT_API_KEY="your-api-key-here"
Create a .env file in the project directory with the following content.
Remember to add .env to your .gitignore to avoid committing credentials.
# .env file
QDRANT_URL=https://your-cluster-url.cloud.qdrant.io:6333
QDRANT_API_KEY=your-api-key-here
Step 4: Verify the Connection
We can verify the connection to the Qdrant collection with the following script. From this point on, I am assuming the .env setup.
from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os
# Load environment variables from .env file
load_dotenv()
# Initialize client
client = QdrantClient(
url=os.getenv("QDRANT_URL"),
api_key=os.getenv("QDRANT_API_KEY"),
)
# Test connection
try:
collections = client.get_collections()
print(f" Connected successfully!")
print(f" Current collections: {len(collections.collections)}")
except Exception as e:
print(f" Connection failed: {e}")
print(" Check your .env file has QDRANT_URL and QDRANT_API_KEY")
Expected output:
python verify-connection.py
Connected successfully!
Current collections: 2
Now that we have the setup out of the way, we can get into the meat of the article.
Before the deep dive, let us look at a high-level overview of the techniques we are about to cover.
| Technique | problems solved | performance impact | complexity |
|---|---|---|---|
| Hybrid Search | we will miss exact matches if we employ semantic search purely | huge increase in the accuracy, closer to 16% | Medium |
| Binary Quantization | Memory costs scale linearly with Data | 40X memory reduction, 15% faster | Low |
| Filterable HNSW | Not a good practice to apply post filtering as is wastes computation | 5X faster filtered queries | Medium |
| Multi Vector | Advanced models need multiple embeddings per document | Enables ColBERT and multi modal | High |
| Distributed Architecture | Single node limits throughput and availability | 32X throughput and 99.99% uptime | High |
Keep in mind that production systems typically combine two to four of these techniques.
For example, a typical e-commerce website might use Hybrid Search, Binary Quantization, and Filterable HNSW.
Now that we have the high-level overview, we will look at each technique in detail in this multi-part series, starting with Hybrid Search.
Hybrid Search
While developing many search applications, one thing I have learned is that semantic search alone will not suffice, and at the same time, keyword search alone will not suffice. We need to move toward a hybrid approach.
When users search for specific product names, SKUs, or technical specifications, pure semantic search often returns semantically similar but incorrect results.
Let us look at an example. If we are using only semantic search and we search with a part ID such as "BOS-0000240", the exact part will not show up in the results.
You can test this yourself using the following skeleton hybrid_search.py code. You just need to write an implementation example, which will help you understand how hybrid search works.
"""Generic Hybrid Search Implementation for Qdrant"""
from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os
from typing import List, Dict, Any, Optional
from collections import Counter
import re
# Cache the embedding model globally
_embedding_model = None
def get_embedding_model():
"""Get or create the embedding model (cached)."""
global _embedding_model
if _embedding_model is None:
try:
from sentence_transformers import SentenceTransformer
_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
except ImportError:
raise ImportError(
"sentence-transformers not installed. Install it with: pip install sentence-transformers"
)
return _embedding_model
def get_qdrant_client() -> QdrantClient:
"""Initialize and return Qdrant client."""
load_dotenv()
return QdrantClient(
url=os.getenv("QDRANT_URL"),
api_key=os.getenv("QDRANT_API_KEY"),
)
def tokenize(text: str) -> List[str]:
"""Simple tokenization - split on whitespace and punctuation."""
text = re.sub(r'[^\w\s]', ' ', text.lower())
return [word for word in text.split() if len(word) > 2]
def calculate_tf_score(query_words: List[str], text: str, boost_exact: bool = True) -> float:
"""Calculate Term Frequency score for keyword matching."""
text_words = tokenize(text)
if not text_words or not query_words:
return 0.0
word_counts = Counter(text_words)
query_word_counts = Counter(query_words)
score = 0.0
total_words = len(text_words)
matched_words = 0
for word, query_freq in query_word_counts.items():
text_freq = word_counts.get(word, 0)
if text_freq > 0:
matched_words += 1
tf = text_freq / total_words
boost = 2.0 if boost_exact else 1.0
score += tf * query_freq * boost
match_ratio = matched_words / len(set(query_words)) if query_words else 0.0
if match_ratio == 1.0:
score *= 1.5
base_score = score / len(set(query_words)) if query_words else 0.0
return base_score * match_ratio
def hybrid_search(
collection_name: str,
query: str,
client: Optional[QdrantClient] = None,
vector_weight: float = 0.7,
keyword_weight: float = 0.3,
limit: int = 10,
keyword_fields: Optional[List[str]] = None
) -> List[Dict[str, Any]]:
"""Perform hybrid search combining vector and keyword search."""
if client is None:
client = get_qdrant_client()
if keyword_fields is None:
keyword_fields = ['text', 'description', 'name', 'title', 'content']
query_words = tokenize(query)
vector_results = []
try:
model = get_embedding_model()
query_vector = model.encode(query).tolist()
vector_search_response = client.query_points(
collection_name=collection_name,
query=query_vector,
limit=limit * 3,
with_payload=True
)
vector_search = vector_search_response.points
if vector_search:
scores = [r.score for r in vector_search]
min_score = min(scores)
max_score = max(scores)
score_range = max_score - min_score if max_score > min_score else 1.0
vector_results = [
{
"id": r.id,
"score": (r.score - min_score) / score_range if score_range > 0 else r.score,
"payload": r.payload,
"raw_score": r.score
}
for r in vector_search
]
except Exception as e:
print(f"Vector search error: {e}")
keyword_results = []
try:
collection_info = client.get_collection(collection_name)
total_points = collection_info.points_count
scroll_limit = min(total_points, 1000)
all_points = client.scroll(
collection_name=collection_name,
limit=scroll_limit,
with_payload=True,
with_vectors=False
)
for point in all_points[0]:
keyword_score = 0.0
exact_match_found = False
query_lower = query.lower().strip()
query_upper = query.upper().strip()
for field in keyword_fields:
field_value = point.payload.get(field, "")
if isinstance(field_value, str):
field_lower = field_value.lower().strip()
field_upper = field_value.upper().strip()
if query_lower == field_lower or query_upper == field_upper:
exact_match_found = True
keyword_score = 1.0
break
if not exact_match_found:
for field in keyword_fields:
field_value = point.payload.get(field, "")
if isinstance(field_value, str):
if query_lower in field_value.lower():
exact_match_found = True
score = calculate_tf_score(query_words, field_value, boost_exact=True)
if field in ['name', 'title', 'id', 'part_id', 'part_name']:
keyword_score += score * 3.0
else:
keyword_score += score * 1.0
if exact_match_found:
keyword_score *= 5.0
keyword_score = min(keyword_score / 10.0, 1.0)
else:
keyword_score = min(keyword_score / 10.0, 1.0)
if keyword_score > 0:
keyword_results.append({
"id": point.id,
"score": keyword_score,
"payload": point.payload
})
keyword_results.sort(key=lambda x: x["score"], reverse=True)
keyword_results = keyword_results[:limit * 3]
except Exception as e:
print(f"Keyword search error: {e}")
fused_results = {}
for result in vector_results:
point_id = result["id"]
fused_results[point_id] = {
"id": point_id,
"payload": result["payload"],
"vector_score": result["score"],
"keyword_score": 0.0,
"combined_score": 0.0,
"raw_vector_score": result.get("raw_score", 0.0)
}
for result in keyword_results:
point_id = result["id"]
if point_id not in fused_results:
fused_results[point_id] = {
"id": point_id,
"payload": result["payload"],
"vector_score": 0.0,
"keyword_score": 0.0,
"combined_score": 0.0,
"raw_vector_score": 0.0
}
fused_results[point_id]["keyword_score"] = result["score"]
for point_id, result in fused_results.items():
if result["keyword_score"] >= 0.99:
result["combined_score"] = 1.0
elif result["vector_score"] > 0 and result["keyword_score"] > 0:
result["combined_score"] = (
vector_weight * result["vector_score"] +
keyword_weight * result["keyword_score"]
)
elif result["vector_score"] > 0:
result["combined_score"] = vector_weight * result["vector_score"]
elif result["keyword_score"] > 0:
result["combined_score"] = keyword_weight * result["keyword_score"]
sorted_results = sorted(
fused_results.values(),
key=lambda x: x["combined_score"],
reverse=True
)
return sorted_results[:limit]
def display_results(results: List[Dict[str, Any]], query: str, show_fields: Optional[List[str]] = None):
"""Display search results."""
if show_fields is None:
show_fields = ['name', 'title', 'description', 'text']
print(f"\nHybrid Search Results for: '{query}'")
print("=" * 80)
if not results:
print("No results found.")
return
for i, result in enumerate(results, 1):
payload = result["payload"]
display_name = "Result"
for field in ['name', 'title', 'part_name', 'id', 'part_id']:
if field in payload:
display_name = str(payload[field])
break
print(f"\n{i}. {display_name}")
for field in show_fields:
if field in payload and field not in ['name', 'title']:
value = payload[field]
if isinstance(value, str):
print(f" {field.capitalize()}: {value[:100]}{'...' if len(value) > 100 else ''}")
print(f" Scores: Vector={result['vector_score']:.3f}, "
f"Keyword={result['keyword_score']:.3f}, "
f"Combined={result['combined_score']:.3f}")
if result.get('raw_vector_score'):
print(f" Raw Vector Score: {result['raw_vector_score']:.4f}")
print("-" * 80)
if __name__ == "__main__":
collection_name = os.getenv("QDRANT_COLLECTION", "your_collection")
query1 = "example query for exact match"
try:
client = get_qdrant_client()
results1 = hybrid_search(
collection_name=collection_name,
query=query1,
client=client,
vector_weight=0.7,
keyword_weight=0.3,
limit=3,
keyword_fields=['name', 'description', 'text']
)
display_results(results1, query1, show_fields=['name', 'description'])
except Exception as e:
print(f"Error: {e}")
print("\nTo use this script:")
print("1. Set QDRANT_URL and QDRANT_API_KEY in your .env file")
print("2. Set QDRANT_COLLECTION environment variable or update collection_name in code")
print("3. Adjust keyword_fields to match your collection's payload structure")
Let us look at the implementation of the above code with an example from one of my collections.
python example_usage.py
================================================================================
EXAMPLE 1: Exact Match Search (Keyword Search)
================================================================================
Searching by Part ID: 'BOS-0000240'
Expected: Exact match via keyword search
Hybrid Search Results for: 'BOS-0000240'
================================================================================
1. Safety Sensor Module 240
Part_name: Safety Sensor Module 240
Part_id: BOS-0000240
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Scores: Vector=0.000, Keyword=1.000, Combined=1.000
✓ Keyword match found
================================================================================
EXAMPLE 2: Semantic Search (Vector Search)
================================================================================
Searching by meaning: 'collision detection device'
Expected: Finds semantically similar parts, even though exact words don't match
Hybrid Search Results for: 'collision detection device'
================================================================================
Note: No strong keyword matches found. Results are based on semantic similarity.
(The collection may not contain exact matches for your query)
1. Safety Sensor Module 239
Part_name: Safety Sensor Module 239
Part_id: NXP-0000239
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Scores: Vector=1.000, Keyword=0.000, Combined=0.700
→ Semantic similarity match
Raw Vector Score: 0.4632
--------------------------------------------------------------------------------
2. Safety Sensor Module 211
Part_name: Safety Sensor Module 211
Part_id: AMP-0000211
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Scores: Vector=0.789, Keyword=0.000, Combined=0.552
→ Semantic similarity match
Raw Vector Score: 0.4609
--------------------------------------------------------------------------------
3. Safety Sensor Module 242
Part_name: Safety Sensor Module 242
Part_id: VAL-0000242
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Scores: Vector=0.752, Keyword=0.004, Combined=0.527
→ Semantic similarity match
Raw Vector Score: 0.4605
--------------------------------------------------------------------------------
Let Us Understand the Output
Search 1: Searched using the part number and retrieved the exact part.
Search 2: Searched using context and retrieved products matching the context.
Dense embeddings struggle with retrieving information based on SKUs, part numbers, and exact specifications. This is where hybrid search comes in handy by combining semantic understanding with exact keyword matching.
Now let us look at the benefits and costs of using hybrid search.
Benefits
- Exact match accuracy: Dense embeddings cannot match arbitrary codes or SKUs, which is where sparse vectors help by matching exact tokens.
- Specification-based search: Dense embeddings treat numbers as preferences rather than requirements. Sparse vectors enforce exact numeric matches.
- Robustness: Users search in multiple ways — sometimes contextually, sometimes with exact terms. Hybrid search handles both.
Costs
- Storage: Because we store both dense and sparse vectors, we effectively double the number of vectors per document, resulting in roughly 2× storage and RAM costs.
- Indexing complexity: Dense-only search requires one model and one embedding. Hybrid search requires two models, two embeddings, more code, more failure points, and more debugging.
- Maintenance burden: As a result, the operational and maintenance burden is also higher.
When to Use
- Searching by SKUs or product numbers to find exact matches
- Technical documentation with API names, error codes, and function signatures
- Medical or legal search requiring both exact citations and semantic understanding
When Not to Use
- Pure recommendation systems where semantic similarity is sufficient
- Extremely low-latency requirements, since fusion adds overhead
- Simple full-text search where traditional search engines are sufficient
Metrics Overview
Let us look at the results in little bit more detail from a metric standpoint which will give us a bit more idea about the hybrid search.
| Metric | Dense Only | Hybrid search | Evidence from the search |
|---|---|---|---|
| MRR@10 | 0.60-0.70 | .0.95+ | for Part ID queries 1.0 and for semantic queries 0.9+ |
| Recall@10 | 0.65 | 0.90 | Finds parts with IDs in specs that dense misses |
| Query Latency | 30-35ms | 35-40ms | Only 5ms for the fusion overhead |
| False Positives | 40-60% | <5% | Dense vector had 0.0000 when searched with PartID |
MRR@10: Average position of the first correct result in the top 10
Recall@10: Percentage of relevant results found in the top 10
Query Latency: Time from query submission to results
False Positives: Rate of incorrect results for exact-term searches
Conclusion
Do not start with hybrid search from the outset — introduce it as the situation demands. As we have seen, hybrid search is particularly useful when queries mix semantic intent and exact terms and are difficult to separate cleanly.
It is always good practice to start simple and add complexity only when metrics justify it.
The techniques described in this article are database-agnostic, though implementations may vary. Qdrant provides native support for hybrid search, while other databases may require workarounds or may not support it at all.
In the next part of the series, we will look at Binary Quantization.
Opinions expressed by DZone contributors are their own.
Comments