Essential Techniques for Production Vector Search Systems Part 2 - Binary Quantization

Proven techniques for production vector search including when to use each one, how to combine them effectively, trade offs to understand before deployment.

Pavan Vemuri

CORE ·

Jan. 09, 26 · Analysis

Likes (3)

Comment

Save

1.9K Views

After implementing vector search systems at multiple companies, I wanted to document efficient techniques that can be very helpful for successful production deployments of vector search systems.

I want to present these techniques by showcasing when to apply each one, how they complement each other, and the trade-offs they introduce. This will be a multi-part series that introduces all of the techniques one by one in each article. I have also included code snippets to quickly test each technique.

Before we get into the real details, let us look at the prerequisites and setup.

For ease of understanding and use, I am using the free cloud tier from Qdrant for all of the demonstrations below.

Steps to Set Up Qdrant Cloud

Step 1: Get a Free Qdrant Cloud Cluster

Sign up at https://cloud.qdrant.io
Create a free cluster
- Click "Create Cluster"
- Select Free Tier
- Choose a region closest to you
- Wait for the cluster to be provisioned
Capture your credentials
- Cluster URL: https://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.us-east.aws.cloud.qdrant.io:6333
- API Key: Click "API Keys" → "Generate" → Copy the key

Step 2: Install Python Dependencies

    PowerShell
   
   pip install qdrant-client fastembed numpy

Recommended versions

qdrant-client >= 1.7.0
fastembed >= 0.2.0
numpy >= 1.24.0
python-dotenv >= 1.0.0

Step 3: Set Environment Variables or Create a `.env` File

    PowerShell
   
   # Add to your ~/.bashrc or ~/.zshrc
export QDRANT_URL="https://your-cluster-url.cloud.qdrant.io:6333"
export QDRANT_API_KEY="your-api-key-here"

Create a .env file in the project directory with the following content.

Remember to add .env to your .gitignore to avoid committing credentials.

    PowerShell
   
   # .env file
QDRANT_URL=https://your-cluster-url.cloud.qdrant.io:6333
QDRANT_API_KEY=your-api-key-here

Step 4: Verify Connection

We can verify the connection to the Qdrant collection with the following script. From this point onward, I am assuming the .env setup is complete.

    Python
   
 

   from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Initialize client
client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
    api_key=os.getenv("QDRANT_API_KEY"),
)

# Test connection
try:
    collections = client.get_collections()
    print(f" Connected successfully!")
    print(f"   Current collections: {len(collections.collections)}")
except Exception as e:
    print(f" Connection failed: {e}")
    print("   Check your .env file has QDRANT_URL and QDRANT_API_KEY")
  

Expected Output

    Plain Text
   
   python verify-connection.py
Connected successfully!
   Current collections: 2

Now that we have the setup out of the way, we can get into the meat of the article.

Before the deep dive into Binary Quantization, let us look at a high-level overview of the techniques we are about to cover in this multi-part series.

Technique	problems solved	performance impact	complexity
Hybrid Search	we will miss exact matches if we employ semantic search purely	huge increase in the accuracy, closer to 16%	Medium
Binary Quantization	Memory costs scale linearly with Data	40X memory reduction, 15% faster	Low
Filterable HNSW	Not a good practice to apply post filtering as is wastes computation	5X faster filtered queries	Medium
Multi Vector	Advanced models need multiple embeddings per document	Enables ColBERT and multi modal	High
Distributed Architecture	Single node limits throughput and availability	32X throughput and 99.99% uptime	High

Keep in mind that production systems typically combine two to four of these techniques.

For example, a typical e-commerce website might use Hybrid Search, Binary Quantization, and Filterable HNSW.

We covered Hybrid Search in the first part of the series. In this part, we will dive into Binary Quantization.

Binary Quantization

Storage for vectors scales linearly with data volume and, in turn, creates unsustainable memory costs.

As a quick example, a 1,536-dimension vector consumes roughly 6 KB per document. Now consider 10 million documents — this requires 60 GB of RAM, even before accounting for indexing overhead. That translates to roughly $15,000 per month in cloud costs.

Binary quantization helps compress vectors from float32 to a 1-bit binary representation, while storing full-precision vectors on disk for accuracy recovery through rescoring.

Similar to Hybrid Search, let us look at the code for Binary Quantization.

    Python
   
 

   """Generic Binary Quantization Implementation for Qdrant"""

from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os
from typing import List, Dict, Any, Optional
import numpy as np
import time

# Cache the embedding model globally
_embedding_model = None

def get_embedding_model():
    """Get or create the embedding model (cached)."""
    global _embedding_model
    if _embedding_model is None:
        try:
            from sentence_transformers import SentenceTransformer
            _embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        except ImportError:
            raise ImportError(
                "sentence-transformers not installed. Install it with: pip install sentence-transformers"
            )
    return _embedding_model


def get_qdrant_client() -> QdrantClient:
    """Initialize and return Qdrant client."""
    load_dotenv()
    return QdrantClient(
        url=os.getenv("QDRANT_URL"),
        api_key=os.getenv("QDRANT_API_KEY"),
    )


def float_to_binary(vector: List[float], threshold: Optional[float] = None) -> List[int]:
    """Convert a float vector to a binary vector."""
    vector_array = np.array(vector)
    if threshold is None:
        threshold = np.median(vector_array)
    return (vector_array >= threshold).astype(int).tolist()


def hamming_distance(binary_vec1: List[int], binary_vec2: List[int]) -> float:
    """Calculate Hamming similarity between two binary vectors."""
    if len(binary_vec1) != len(binary_vec2):
        raise ValueError("Vectors must have the same length")
    vec1_array = np.array(binary_vec1, dtype=np.uint8)
    vec2_array = np.array(binary_vec2, dtype=np.uint8)
    differences = np.sum(vec1_array != vec2_array)
    return 1.0 - (differences / len(binary_vec1))


def binary_quantize_search(
    collection_name: str,
    query: str,
    client: Optional[QdrantClient] = None,
    limit: int = 10,
    threshold: Optional[float] = None
) -> List[Dict[str, Any]]:
    """Perform vector search using binary quantized vectors."""
    if client is None:
        client = get_qdrant_client()
    
    model = get_embedding_model()
    query_vector = model.encode(query).tolist()
    query_binary = float_to_binary(query_vector, threshold)
    query_binary_array = np.array(query_binary, dtype=np.uint8)
    
    try:
        candidate_limit = limit * 5
        search_response = client.query_points(
            collection_name=collection_name,
            query=query_vector,
            limit=candidate_limit,
            with_payload=True,
            with_vectors=True
        )
        
        if not search_response.points:
            return []
        
        results = []
        threshold_val = threshold if threshold is not None else np.median(query_vector)
        
        for point in search_response.points:
            if point.vector is None:
                continue
            
            vector = list(point.vector.values())[0] if isinstance(point.vector, dict) else point.vector
            vector_array = np.array(vector)
            stored_binary_array = (vector_array >= threshold_val).astype(np.uint8)
            differences = np.sum(query_binary_array != stored_binary_array)
            similarity = 1.0 - (differences / len(query_binary))
            
            results.append({
                "id": point.id,
                "score": similarity,
                "payload": point.payload,
                "binary_similarity": similarity
            })
        
        results.sort(key=lambda x: x["score"], reverse=True)
        return results[:limit]
        
    except Exception as e:
        print(f"Binary quantized search error: {e}")
        return []


def full_precision_search(
    collection_name: str,
    query: str,
    client: Optional[QdrantClient] = None,
    limit: int = 10
) -> List[Dict[str, Any]]:
    """Perform standard full precision vector search."""
    if client is None:
        client = get_qdrant_client()
    
    model = get_embedding_model()
    query_vector = model.encode(query).tolist()
    
    try:
        search_response = client.query_points(
            collection_name=collection_name,
            query=query_vector,
            limit=limit,
            with_payload=True
        )
        
        return [
            {
                "id": r.id,
                "score": r.score,
                "payload": r.payload,
                "cosine_similarity": r.score
            }
            for r in search_response.points
        ]
    except Exception as e:
        print(f"Full precision search error: {e}")
        return []


def compare_search_methods(
    collection_name: str,
    query: str,
    client: Optional[QdrantClient] = None,
    limit: int = 10,
    threshold: Optional[float] = None
) -> Dict[str, Any]:
    """Compare full precision search vs binary quantized search."""
    if client is None:
        client = get_qdrant_client()
    
    print(f"\nComparing Search Methods for: '{query}'")
    print("=" * 80)
    
    # Full precision search
    print("\n1. Full Precision Search (Cosine Similarity)")
    print("-" * 80)
    start_time = time.time()
    full_results = full_precision_search(
        collection_name=collection_name,
        query=query,
        client=client,
        limit=limit
    )
    full_time = time.time() - start_time
    
    # Binary quantized search
    print("\n2. Binary Quantized Search (Hamming Similarity)")
    print("-" * 80)
    start_time = time.time()
    binary_results = binary_quantize_search(
        collection_name=collection_name,
        query=query,
        client=client,
        limit=limit,
        threshold=threshold
    )
    binary_time = time.time() - start_time
    
    # Compare results
    full_ids = {r["id"] for r in full_results}
    binary_ids = {r["id"] for r in binary_results}
    
    overlap = len(full_ids & binary_ids)
    overlap_ratio = overlap / limit if limit > 0 else 0.0
    
    comparison = {
        "query": query,
        "full_precision": {
            "results": full_results,
            "time_ms": full_time * 1000,
            "count": len(full_results)
        },
        "binary_quantized": {
            "results": binary_results,
            "time_ms": binary_time * 1000,
            "count": len(binary_results)
        },
        "overlap": overlap,
        "overlap_ratio": overlap_ratio
    }
    
    # Display comparison summary
    print("\n" + "=" * 80)
    print("COMPARISON SUMMARY")
    print("=" * 80)
    print(f"Full Precision Search:")
    print(f"  Time: {full_time*1000:.2f} ms")
    print(f"  Results: {len(full_results)}")
    if full_results:
        print(f"  Top Score: {full_results[0]['score']:.4f}")
    
    print(f"\nBinary Quantized Search:")
    print(f"  Time: {binary_time*1000:.2f} ms")
    print(f"  Results: {len(binary_results)}")
    if binary_results:
        print(f"  Top Score: {binary_results[0]['score']:.4f}")
    
    print(f"\nOverlap:")
    print(f"  Common Results: {overlap} / {limit} ({overlap_ratio*100:.1f}%)")
    
    speedup = full_time / binary_time if binary_time > 0 else 0
    if speedup > 1:
        print(f"  Binary search is {speedup:.2f}x faster")
    elif speedup > 0:
        print(f"  Full precision search is {1/speedup:.2f}x faster")
    
    print("\n" + "=" * 80)
    
    return comparison


def display_binary_results(results: List[Dict[str, Any]], query: str, show_fields: Optional[List[str]] = None):
    """Display binary quantized search results."""
    if show_fields is None:
        show_fields = ['name', 'title', 'description', 'text']
    
    print(f"\nBinary Quantized Search Results for: '{query}'")
    print("=" * 80)
    
    if not results:
        print("No results found.")
        return
    
    print(f"Found {len(results)} results using binary quantization (Hamming similarity)\n")
    
    for i, result in enumerate(results, 1):
        payload = result["payload"]
        
        # Try to find a display name from common fields
        display_name = "Result"
        for field in ['name', 'title', 'part_name', 'id', 'part_id']:
            if field in payload:
                display_name = str(payload[field])
                break
        
        print(f"\n{i}. {display_name}")
        
        # Display other fields
        for field in show_fields:
            if field in payload and field not in ['name', 'title']:
                value = payload[field]
                if isinstance(value, str):
                    print(f"   {field.capitalize()}: {value[:100]}{'...' if len(value) > 100 else ''}")
        
        print(f"   Binary Similarity Score: {result['score']:.4f} (Hamming)")
        print(f"   {result['score']*100:.1f}% bit similarity")
        
        print("-" * 80)


if __name__ == "__main__":
    """
    Example usage - customize for your collection.
    """
    # Example 1: Binary quantized search
    print("=" * 80)
    print("EXAMPLE 1: Binary Quantized Search")
    print("=" * 80)
    print("This demonstrates search using binary quantized vectors.\n")
    
    # Replace with your collection name and query
    collection_name = os.getenv("QDRANT_COLLECTION", "your_collection")
    query1 = "example query"
    
    try:
        client = get_qdrant_client()
        results1 = binary_quantize_search(
            collection_name=collection_name,
            query=query1,
            client=client,
            limit=5
        )
        display_binary_results(results1, query1, show_fields=['name', 'description'])
    except Exception as e:
        print(f"Error: {e}")
        print("\nTo use this script:")
        print("1. Set QDRANT_URL and QDRANT_API_KEY in your .env file")
        print("2. Set QDRANT_COLLECTION environment variable or update collection_name in code")
        print("3. Ensure your collection has vectors stored")
        print("4. Adjust show_fields to match your collection's payload structure")

  

Let us look at an example implementation of Binary Quantization for the same automotive parts collection.

    Plain Text
   
 

   python binary_quantization_example.py
================================================================================
EXAMPLE 1: Binary Quantized Search
================================================================================
Searching: 'collision detection device'
Expected: Finds similar parts using binary vector comparison


Binary Quantized Search Results for: 'collision detection device'
================================================================================
Found 5 results using binary quantization (Hamming similarity)


1. Safety Sensor Module 217
   Part_name: Safety Sensor Module 217
   Part_id: DEL-0000217
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6458 (Hamming)
   64.6% bit similarity
--------------------------------------------------------------------------------

2. Safety Sensor Module 250
   Part_name: Safety Sensor Module 250
   Part_id: TE-0000250
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6354 (Hamming)
   63.5% bit similarity
--------------------------------------------------------------------------------

3. Safety Sensor Module 233
   Part_name: Safety Sensor Module 233
   Part_id: TRW-0000233
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6328 (Hamming)
   63.3% bit similarity
--------------------------------------------------------------------------------

4. Safety Sensor Module 201
   Part_name: Safety Sensor Module 201
   Part_id: DEN-0000201
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6328 (Hamming)
   63.3% bit similarity
--------------------------------------------------------------------------------

5. Safety Sensor Module 223
   Part_name: Safety Sensor Module 223
   Part_id: MAG-0000223
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6328 (Hamming)
   63.3% bit similarity
--------------------------------------------------------------------------------



================================================================================
EXAMPLE 2: Full Precision vs Binary Quantized Comparison
================================================================================
Comparing search methods for: 'engine sensor'
Expected: Shows speed and quality differences between methods


Comparing Search Methods for: 'engine sensor'
================================================================================

1. Full Precision Search (Cosine Similarity)
--------------------------------------------------------------------------------

2. Binary Quantized Search (Hamming Similarity)
--------------------------------------------------------------------------------

================================================================================
COMPARISON SUMMARY
================================================================================
Full Precision Search:
  Time: 109.33 ms
  Results: 5
  Top Score: 0.4092

Binary Quantized Search:
  Time: 75.83 ms
  Results: 5
  Top Score: 0.6589
  

Benefits

The most obvious benefits, as seen from the results, are:

Memory reduction of ~40×
Speed improvement of ~20%

The less obvious — but very significant — benefit is cost savings, especially in terms of cloud infrastructure costs.

Costs

Binary Quantization also comes with trade-offs:

Accuracy loss: 1–2% recall degradation without rescoring
Disk I/O: Rescoring requires additional disk reads
Setup complexity: Like Hybrid Search, Binary Quantization requires careful oversampling and tuning

When to Use

Large-scale deployments with millions of vectors
Projects with budget constraints
Use cases where <2% accuracy loss is acceptable

When Not to Use

Projects with very small datasets (e.g., <100K vectors) where memory is not a bottleneck
Projects with strict accuracy requirements, such as regulatory or medical systems
Applications with specialized hardware where RAM is abundant

Binary Quantization for Automotive Parts (1M+ Parts)

Aspect	Full precision	binary quantized
RAM needed	1.5GB	46MB
Query Speed	246ms	107ms
Results Quality	Perfect Ranking	80% different ranking but still relevant
Cost/Month	$50-100 (Cloud RAM)	$2-5 (Cloud RAM)

Performance Characteristics

Based on the results shared above let us look at some performance characteristics for Binary Quantization:

Metric	full precision	binary quantization	evidence from the data
Query Latency	246ms	107ms	2.3X faster as search with Binary quantization completed in less than half the time
Memory Usage	32 bits/dimension	1 bit/dimension	a net 32X reduction in the memory usage
Result Overlap	Baseline (100%)	20%	Only 1 out of 5 top results match between searches
Ranking accuracy	1.0 (reference)	-0.80	Different rankings but all results still semantically relevant
Top Score	0.4092 (Cosine)	0.6589 (Hamming)	Different distance metrics and not directly comparable

Conclusion

From both the conceptual overview and the experimental results, Binary Quantization provides significant memory and speed gains, but at the cost of a 20% overlap. This can be critical for compliance- or safety-centric applications where perfect ranking is required.

However, Binary Quantization is an excellent fit for discovery-oriented systems, large result sets, and scenarios where speed and cost efficiency outweigh perfect ranking.

In the next part of the series, we will explore Filterable HNSW and how it impacts vector search performance.

Data structure Production (computer science) systems Data Types AI

Opinions expressed by DZone contributors are their own.

Related

Trending