Essential Techniques for Production Vector Search Systems Part 2 - Binary Quantization
Proven techniques for production vector search including when to use each one, how to combine them effectively, trade offs to understand before deployment.
Join the DZone community and get the full member experience.
Join For FreeAfter implementing vector search systems at multiple companies, I wanted to document efficient techniques that can be very helpful for successful production deployments of vector search systems.
I want to present these techniques by showcasing when to apply each one, how they complement each other, and the trade-offs they introduce. This will be a multi-part series that introduces all of the techniques one by one in each article. I have also included code snippets to quickly test each technique.
Before we get into the real details, let us look at the prerequisites and setup.
For ease of understanding and use, I am using the free cloud tier from Qdrant for all of the demonstrations below.
Steps to Set Up Qdrant Cloud
Step 1: Get a Free Qdrant Cloud Cluster
- Sign up at https://cloud.qdrant.io
- Create a free cluster
- Click "Create Cluster"
- Select Free Tier
- Choose a region closest to you
- Wait for the cluster to be provisioned
- Capture your credentials
- Cluster URL: https://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.us-east.aws.cloud.qdrant.io:6333
- API Key: Click "API Keys" → "Generate" → Copy the key
Step 2: Install Python Dependencies
pip install qdrant-client fastembed numpy
Recommended versions
- qdrant-client >= 1.7.0
- fastembed >= 0.2.0
- numpy >= 1.24.0
- python-dotenv >= 1.0.0
Step 3: Set Environment Variables or Create a .env File
# Add to your ~/.bashrc or ~/.zshrc
export QDRANT_URL="https://your-cluster-url.cloud.qdrant.io:6333"
export QDRANT_API_KEY="your-api-key-here"
Create a .env file in the project directory with the following content.
Remember to add .env to your .gitignore to avoid committing credentials.
# .env file
QDRANT_URL=https://your-cluster-url.cloud.qdrant.io:6333
QDRANT_API_KEY=your-api-key-here
Step 4: Verify Connection
We can verify the connection to the Qdrant collection with the following script. From this point onward, I am assuming the .env setup is complete.
from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os
# Load environment variables from .env file
load_dotenv()
# Initialize client
client = QdrantClient(
url=os.getenv("QDRANT_URL"),
api_key=os.getenv("QDRANT_API_KEY"),
)
# Test connection
try:
collections = client.get_collections()
print(f" Connected successfully!")
print(f" Current collections: {len(collections.collections)}")
except Exception as e:
print(f" Connection failed: {e}")
print(" Check your .env file has QDRANT_URL and QDRANT_API_KEY")
Expected Output
python verify-connection.py
Connected successfully!
Current collections: 2
Now that we have the setup out of the way, we can get into the meat of the article.
Before the deep dive into Binary Quantization, let us look at a high-level overview of the techniques we are about to cover in this multi-part series.
| Technique | problems solved | performance impact | complexity |
|---|---|---|---|
| Hybrid Search | we will miss exact matches if we employ semantic search purely | huge increase in the accuracy, closer to 16% | Medium |
| Binary Quantization | Memory costs scale linearly with Data | 40X memory reduction, 15% faster | Low |
| Filterable HNSW | Not a good practice to apply post filtering as is wastes computation | 5X faster filtered queries | Medium |
| Multi Vector | Advanced models need multiple embeddings per document | Enables ColBERT and multi modal | High |
| Distributed Architecture | Single node limits throughput and availability | 32X throughput and 99.99% uptime | High |
Keep in mind that production systems typically combine two to four of these techniques.
For example, a typical e-commerce website might use Hybrid Search, Binary Quantization, and Filterable HNSW.
We covered Hybrid Search in the first part of the series. In this part, we will dive into Binary Quantization.
Binary Quantization
Storage for vectors scales linearly with data volume and, in turn, creates unsustainable memory costs.
As a quick example, a 1,536-dimension vector consumes roughly 6 KB per document. Now consider 10 million documents — this requires 60 GB of RAM, even before accounting for indexing overhead. That translates to roughly $15,000 per month in cloud costs.
Binary quantization helps compress vectors from float32 to a 1-bit binary representation, while storing full-precision vectors on disk for accuracy recovery through rescoring.
Similar to Hybrid Search, let us look at the code for Binary Quantization.
"""Generic Binary Quantization Implementation for Qdrant"""
from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os
from typing import List, Dict, Any, Optional
import numpy as np
import time
# Cache the embedding model globally
_embedding_model = None
def get_embedding_model():
"""Get or create the embedding model (cached)."""
global _embedding_model
if _embedding_model is None:
try:
from sentence_transformers import SentenceTransformer
_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
except ImportError:
raise ImportError(
"sentence-transformers not installed. Install it with: pip install sentence-transformers"
)
return _embedding_model
def get_qdrant_client() -> QdrantClient:
"""Initialize and return Qdrant client."""
load_dotenv()
return QdrantClient(
url=os.getenv("QDRANT_URL"),
api_key=os.getenv("QDRANT_API_KEY"),
)
def float_to_binary(vector: List[float], threshold: Optional[float] = None) -> List[int]:
"""Convert a float vector to a binary vector."""
vector_array = np.array(vector)
if threshold is None:
threshold = np.median(vector_array)
return (vector_array >= threshold).astype(int).tolist()
def hamming_distance(binary_vec1: List[int], binary_vec2: List[int]) -> float:
"""Calculate Hamming similarity between two binary vectors."""
if len(binary_vec1) != len(binary_vec2):
raise ValueError("Vectors must have the same length")
vec1_array = np.array(binary_vec1, dtype=np.uint8)
vec2_array = np.array(binary_vec2, dtype=np.uint8)
differences = np.sum(vec1_array != vec2_array)
return 1.0 - (differences / len(binary_vec1))
def binary_quantize_search(
collection_name: str,
query: str,
client: Optional[QdrantClient] = None,
limit: int = 10,
threshold: Optional[float] = None
) -> List[Dict[str, Any]]:
"""Perform vector search using binary quantized vectors."""
if client is None:
client = get_qdrant_client()
model = get_embedding_model()
query_vector = model.encode(query).tolist()
query_binary = float_to_binary(query_vector, threshold)
query_binary_array = np.array(query_binary, dtype=np.uint8)
try:
candidate_limit = limit * 5
search_response = client.query_points(
collection_name=collection_name,
query=query_vector,
limit=candidate_limit,
with_payload=True,
with_vectors=True
)
if not search_response.points:
return []
results = []
threshold_val = threshold if threshold is not None else np.median(query_vector)
for point in search_response.points:
if point.vector is None:
continue
vector = list(point.vector.values())[0] if isinstance(point.vector, dict) else point.vector
vector_array = np.array(vector)
stored_binary_array = (vector_array >= threshold_val).astype(np.uint8)
differences = np.sum(query_binary_array != stored_binary_array)
similarity = 1.0 - (differences / len(query_binary))
results.append({
"id": point.id,
"score": similarity,
"payload": point.payload,
"binary_similarity": similarity
})
results.sort(key=lambda x: x["score"], reverse=True)
return results[:limit]
except Exception as e:
print(f"Binary quantized search error: {e}")
return []
def full_precision_search(
collection_name: str,
query: str,
client: Optional[QdrantClient] = None,
limit: int = 10
) -> List[Dict[str, Any]]:
"""Perform standard full precision vector search."""
if client is None:
client = get_qdrant_client()
model = get_embedding_model()
query_vector = model.encode(query).tolist()
try:
search_response = client.query_points(
collection_name=collection_name,
query=query_vector,
limit=limit,
with_payload=True
)
return [
{
"id": r.id,
"score": r.score,
"payload": r.payload,
"cosine_similarity": r.score
}
for r in search_response.points
]
except Exception as e:
print(f"Full precision search error: {e}")
return []
def compare_search_methods(
collection_name: str,
query: str,
client: Optional[QdrantClient] = None,
limit: int = 10,
threshold: Optional[float] = None
) -> Dict[str, Any]:
"""Compare full precision search vs binary quantized search."""
if client is None:
client = get_qdrant_client()
print(f"\nComparing Search Methods for: '{query}'")
print("=" * 80)
# Full precision search
print("\n1. Full Precision Search (Cosine Similarity)")
print("-" * 80)
start_time = time.time()
full_results = full_precision_search(
collection_name=collection_name,
query=query,
client=client,
limit=limit
)
full_time = time.time() - start_time
# Binary quantized search
print("\n2. Binary Quantized Search (Hamming Similarity)")
print("-" * 80)
start_time = time.time()
binary_results = binary_quantize_search(
collection_name=collection_name,
query=query,
client=client,
limit=limit,
threshold=threshold
)
binary_time = time.time() - start_time
# Compare results
full_ids = {r["id"] for r in full_results}
binary_ids = {r["id"] for r in binary_results}
overlap = len(full_ids & binary_ids)
overlap_ratio = overlap / limit if limit > 0 else 0.0
comparison = {
"query": query,
"full_precision": {
"results": full_results,
"time_ms": full_time * 1000,
"count": len(full_results)
},
"binary_quantized": {
"results": binary_results,
"time_ms": binary_time * 1000,
"count": len(binary_results)
},
"overlap": overlap,
"overlap_ratio": overlap_ratio
}
# Display comparison summary
print("\n" + "=" * 80)
print("COMPARISON SUMMARY")
print("=" * 80)
print(f"Full Precision Search:")
print(f" Time: {full_time*1000:.2f} ms")
print(f" Results: {len(full_results)}")
if full_results:
print(f" Top Score: {full_results[0]['score']:.4f}")
print(f"\nBinary Quantized Search:")
print(f" Time: {binary_time*1000:.2f} ms")
print(f" Results: {len(binary_results)}")
if binary_results:
print(f" Top Score: {binary_results[0]['score']:.4f}")
print(f"\nOverlap:")
print(f" Common Results: {overlap} / {limit} ({overlap_ratio*100:.1f}%)")
speedup = full_time / binary_time if binary_time > 0 else 0
if speedup > 1:
print(f" Binary search is {speedup:.2f}x faster")
elif speedup > 0:
print(f" Full precision search is {1/speedup:.2f}x faster")
print("\n" + "=" * 80)
return comparison
def display_binary_results(results: List[Dict[str, Any]], query: str, show_fields: Optional[List[str]] = None):
"""Display binary quantized search results."""
if show_fields is None:
show_fields = ['name', 'title', 'description', 'text']
print(f"\nBinary Quantized Search Results for: '{query}'")
print("=" * 80)
if not results:
print("No results found.")
return
print(f"Found {len(results)} results using binary quantization (Hamming similarity)\n")
for i, result in enumerate(results, 1):
payload = result["payload"]
# Try to find a display name from common fields
display_name = "Result"
for field in ['name', 'title', 'part_name', 'id', 'part_id']:
if field in payload:
display_name = str(payload[field])
break
print(f"\n{i}. {display_name}")
# Display other fields
for field in show_fields:
if field in payload and field not in ['name', 'title']:
value = payload[field]
if isinstance(value, str):
print(f" {field.capitalize()}: {value[:100]}{'...' if len(value) > 100 else ''}")
print(f" Binary Similarity Score: {result['score']:.4f} (Hamming)")
print(f" {result['score']*100:.1f}% bit similarity")
print("-" * 80)
if __name__ == "__main__":
"""
Example usage - customize for your collection.
"""
# Example 1: Binary quantized search
print("=" * 80)
print("EXAMPLE 1: Binary Quantized Search")
print("=" * 80)
print("This demonstrates search using binary quantized vectors.\n")
# Replace with your collection name and query
collection_name = os.getenv("QDRANT_COLLECTION", "your_collection")
query1 = "example query"
try:
client = get_qdrant_client()
results1 = binary_quantize_search(
collection_name=collection_name,
query=query1,
client=client,
limit=5
)
display_binary_results(results1, query1, show_fields=['name', 'description'])
except Exception as e:
print(f"Error: {e}")
print("\nTo use this script:")
print("1. Set QDRANT_URL and QDRANT_API_KEY in your .env file")
print("2. Set QDRANT_COLLECTION environment variable or update collection_name in code")
print("3. Ensure your collection has vectors stored")
print("4. Adjust show_fields to match your collection's payload structure")
Let us look at an example implementation of Binary Quantization for the same automotive parts collection.
python binary_quantization_example.py
================================================================================
EXAMPLE 1: Binary Quantized Search
================================================================================
Searching: 'collision detection device'
Expected: Finds similar parts using binary vector comparison
Binary Quantized Search Results for: 'collision detection device'
================================================================================
Found 5 results using binary quantization (Hamming similarity)
1. Safety Sensor Module 217
Part_name: Safety Sensor Module 217
Part_id: DEL-0000217
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Binary Similarity Score: 0.6458 (Hamming)
64.6% bit similarity
--------------------------------------------------------------------------------
2. Safety Sensor Module 250
Part_name: Safety Sensor Module 250
Part_id: TE-0000250
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Binary Similarity Score: 0.6354 (Hamming)
63.5% bit similarity
--------------------------------------------------------------------------------
3. Safety Sensor Module 233
Part_name: Safety Sensor Module 233
Part_id: TRW-0000233
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Binary Similarity Score: 0.6328 (Hamming)
63.3% bit similarity
--------------------------------------------------------------------------------
4. Safety Sensor Module 201
Part_name: Safety Sensor Module 201
Part_id: DEN-0000201
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Binary Similarity Score: 0.6328 (Hamming)
63.3% bit similarity
--------------------------------------------------------------------------------
5. Safety Sensor Module 223
Part_name: Safety Sensor Module 223
Part_id: MAG-0000223
Category: Safety Systems
Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
Binary Similarity Score: 0.6328 (Hamming)
63.3% bit similarity
--------------------------------------------------------------------------------
================================================================================
EXAMPLE 2: Full Precision vs Binary Quantized Comparison
================================================================================
Comparing search methods for: 'engine sensor'
Expected: Shows speed and quality differences between methods
Comparing Search Methods for: 'engine sensor'
================================================================================
1. Full Precision Search (Cosine Similarity)
--------------------------------------------------------------------------------
2. Binary Quantized Search (Hamming Similarity)
--------------------------------------------------------------------------------
================================================================================
COMPARISON SUMMARY
================================================================================
Full Precision Search:
Time: 109.33 ms
Results: 5
Top Score: 0.4092
Binary Quantized Search:
Time: 75.83 ms
Results: 5
Top Score: 0.6589
Benefits
The most obvious benefits, as seen from the results, are:
- Memory reduction of ~40×
- Speed improvement of ~20%
The less obvious — but very significant — benefit is cost savings, especially in terms of cloud infrastructure costs.
Costs
Binary Quantization also comes with trade-offs:
- Accuracy loss: 1–2% recall degradation without rescoring
- Disk I/O: Rescoring requires additional disk reads
- Setup complexity: Like Hybrid Search, Binary Quantization requires careful oversampling and tuning
When to Use
- Large-scale deployments with millions of vectors
- Projects with budget constraints
- Use cases where <2% accuracy loss is acceptable
When Not to Use
- Projects with very small datasets (e.g., <100K vectors) where memory is not a bottleneck
- Projects with strict accuracy requirements, such as regulatory or medical systems
- Applications with specialized hardware where RAM is abundant
Binary Quantization for Automotive Parts (1M+ Parts)
| Aspect | Full precision | binary quantized |
|---|---|---|
| RAM needed | 1.5GB | 46MB |
| Query Speed | 246ms | 107ms |
| Results Quality | Perfect Ranking | 80% different ranking but still relevant |
| Cost/Month | $50-100 (Cloud RAM) | $2-5 (Cloud RAM) |
Performance Characteristics
Based on the results shared above let us look at some performance characteristics for Binary Quantization:
| Metric | full precision | binary quantization | evidence from the data |
|---|---|---|---|
| Query Latency | 246ms | 107ms | 2.3X faster as search with Binary quantization completed in less than half the time |
| Memory Usage | 32 bits/dimension | 1 bit/dimension | a net 32X reduction in the memory usage |
| Result Overlap | Baseline (100%) | 20% | Only 1 out of 5 top results match between searches |
| Ranking accuracy | 1.0 (reference) | -0.80 | Different rankings but all results still semantically relevant |
| Top Score | 0.4092 (Cosine) | 0.6589 (Hamming) | Different distance metrics and not directly comparable |
Conclusion
From both the conceptual overview and the experimental results, Binary Quantization provides significant memory and speed gains, but at the cost of a 20% overlap. This can be critical for compliance- or safety-centric applications where perfect ranking is required.
However, Binary Quantization is an excellent fit for discovery-oriented systems, large result sets, and scenarios where speed and cost efficiency outweigh perfect ranking.
In the next part of the series, we will explore Filterable HNSW and how it impacts vector search performance.
Opinions expressed by DZone contributors are their own.
Comments