DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Essential Techniques for Production Vector Search Systems Part 1 - Hybrid Search
  • Essential Techniques for Production Vector Search Systems, Part 5: Reranking
  • Essential Techniques for Production Vector Search Systems, Part 3: Filterable HNSW
  • Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery

Trending

  • Key Takeaways From Integrating a RAG Application With LangSmith
  • Why We Chose Iceberg Over Delta After Evaluating Both at Scale
  • How to Test a PATCH API Request With REST-Assured Java
  • The Hidden Bottlenecks That Break Microservices in Production
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Essential Techniques for Production Vector Search Systems Part 2 - Binary Quantization

Essential Techniques for Production Vector Search Systems Part 2 - Binary Quantization

Proven techniques for production vector search including when to use each one, how to combine them effectively, trade offs to understand before deployment.

By 
Pavan Vemuri user avatar
Pavan Vemuri
DZone Core CORE ·
Jan. 09, 26 · Analysis
Likes (3)
Comment
Save
Tweet
Share
1.8K Views

Join the DZone community and get the full member experience.

Join For Free

After implementing vector search systems at multiple companies, I wanted to document efficient techniques that can be very helpful for successful production deployments of vector search systems.

I want to present these techniques by showcasing when to apply each one, how they complement each other, and the trade-offs they introduce. This will be a multi-part series that introduces all of the techniques one by one in each article. I have also included code snippets to quickly test each technique.

Before we get into the real details, let us look at the prerequisites and setup.

For ease of understanding and use, I am using the free cloud tier from Qdrant for all of the demonstrations below.

Steps to Set Up Qdrant Cloud

Step 1: Get a Free Qdrant Cloud Cluster

  • Sign up at https://cloud.qdrant.io
  • Create a free cluster
    • Click "Create Cluster"
    • Select Free Tier
    • Choose a region closest to you
    • Wait for the cluster to be provisioned
  • Capture your credentials
    • Cluster URL: https://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.us-east.aws.cloud.qdrant.io:6333
    • API Key: Click "API Keys" → "Generate" → Copy the key

Step 2: Install Python Dependencies

PowerShell
 
pip install qdrant-client fastembed numpy


Recommended versions

  • qdrant-client >= 1.7.0
  • fastembed >= 0.2.0
  • numpy >= 1.24.0
  • python-dotenv >= 1.0.0

Step 3: Set Environment Variables or Create a .env File

PowerShell
 
# Add to your ~/.bashrc or ~/.zshrc
export QDRANT_URL="https://your-cluster-url.cloud.qdrant.io:6333"
export QDRANT_API_KEY="your-api-key-here"


Create a .env file in the project directory with the following content.

Remember to add .env to your .gitignore to avoid committing credentials.

PowerShell
 
# .env file
QDRANT_URL=https://your-cluster-url.cloud.qdrant.io:6333
QDRANT_API_KEY=your-api-key-here


Step 4: Verify Connection

We can verify the connection to the Qdrant collection with the following script. From this point onward, I am assuming the .env setup is complete.

Python
 
from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Initialize client
client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
    api_key=os.getenv("QDRANT_API_KEY"),
)

# Test connection
try:
    collections = client.get_collections()
    print(f" Connected successfully!")
    print(f"   Current collections: {len(collections.collections)}")
except Exception as e:
    print(f" Connection failed: {e}")
    print("   Check your .env file has QDRANT_URL and QDRANT_API_KEY")


Expected Output

Plain Text
 
python verify-connection.py
Connected successfully!
   Current collections: 2


Now that we have the setup out of the way, we can get into the meat of the article.

Before the deep dive into Binary Quantization, let us look at a high-level overview of the techniques we are about to cover in this multi-part series.


Technique problems solved performance impact complexity
Hybrid Search we will miss exact matches if we employ semantic search purely huge increase in the accuracy, closer to 16% Medium
Binary Quantization Memory costs scale linearly with Data 40X memory reduction, 15% faster Low
Filterable HNSW Not a good practice to apply post filtering as is wastes computation 5X faster filtered queries Medium
Multi Vector Advanced models need multiple embeddings per document Enables ColBERT and multi modal High
Distributed Architecture Single node limits throughput and availability 32X throughput and 99.99% uptime High



Keep in mind that production systems typically combine two to four of these techniques.

For example, a typical e-commerce website might use Hybrid Search, Binary Quantization, and Filterable HNSW.

We covered Hybrid Search in the first part of the series. In this part, we will dive into Binary Quantization.

Binary Quantization

Storage for vectors scales linearly with data volume and, in turn, creates unsustainable memory costs.

As a quick example, a 1,536-dimension vector consumes roughly 6 KB per document. Now consider 10 million documents — this requires 60 GB of RAM, even before accounting for indexing overhead. That translates to roughly $15,000 per month in cloud costs.

Binary quantization helps compress vectors from float32 to a 1-bit binary representation, while storing full-precision vectors on disk for accuracy recovery through rescoring.

Similar to Hybrid Search, let us look at the code for Binary Quantization.

Python
 
"""Generic Binary Quantization Implementation for Qdrant"""

from qdrant_client import QdrantClient
from dotenv import load_dotenv
import os
from typing import List, Dict, Any, Optional
import numpy as np
import time

# Cache the embedding model globally
_embedding_model = None

def get_embedding_model():
    """Get or create the embedding model (cached)."""
    global _embedding_model
    if _embedding_model is None:
        try:
            from sentence_transformers import SentenceTransformer
            _embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        except ImportError:
            raise ImportError(
                "sentence-transformers not installed. Install it with: pip install sentence-transformers"
            )
    return _embedding_model


def get_qdrant_client() -> QdrantClient:
    """Initialize and return Qdrant client."""
    load_dotenv()
    return QdrantClient(
        url=os.getenv("QDRANT_URL"),
        api_key=os.getenv("QDRANT_API_KEY"),
    )


def float_to_binary(vector: List[float], threshold: Optional[float] = None) -> List[int]:
    """Convert a float vector to a binary vector."""
    vector_array = np.array(vector)
    if threshold is None:
        threshold = np.median(vector_array)
    return (vector_array >= threshold).astype(int).tolist()


def hamming_distance(binary_vec1: List[int], binary_vec2: List[int]) -> float:
    """Calculate Hamming similarity between two binary vectors."""
    if len(binary_vec1) != len(binary_vec2):
        raise ValueError("Vectors must have the same length")
    vec1_array = np.array(binary_vec1, dtype=np.uint8)
    vec2_array = np.array(binary_vec2, dtype=np.uint8)
    differences = np.sum(vec1_array != vec2_array)
    return 1.0 - (differences / len(binary_vec1))


def binary_quantize_search(
    collection_name: str,
    query: str,
    client: Optional[QdrantClient] = None,
    limit: int = 10,
    threshold: Optional[float] = None
) -> List[Dict[str, Any]]:
    """Perform vector search using binary quantized vectors."""
    if client is None:
        client = get_qdrant_client()
    
    model = get_embedding_model()
    query_vector = model.encode(query).tolist()
    query_binary = float_to_binary(query_vector, threshold)
    query_binary_array = np.array(query_binary, dtype=np.uint8)
    
    try:
        candidate_limit = limit * 5
        search_response = client.query_points(
            collection_name=collection_name,
            query=query_vector,
            limit=candidate_limit,
            with_payload=True,
            with_vectors=True
        )
        
        if not search_response.points:
            return []
        
        results = []
        threshold_val = threshold if threshold is not None else np.median(query_vector)
        
        for point in search_response.points:
            if point.vector is None:
                continue
            
            vector = list(point.vector.values())[0] if isinstance(point.vector, dict) else point.vector
            vector_array = np.array(vector)
            stored_binary_array = (vector_array >= threshold_val).astype(np.uint8)
            differences = np.sum(query_binary_array != stored_binary_array)
            similarity = 1.0 - (differences / len(query_binary))
            
            results.append({
                "id": point.id,
                "score": similarity,
                "payload": point.payload,
                "binary_similarity": similarity
            })
        
        results.sort(key=lambda x: x["score"], reverse=True)
        return results[:limit]
        
    except Exception as e:
        print(f"Binary quantized search error: {e}")
        return []


def full_precision_search(
    collection_name: str,
    query: str,
    client: Optional[QdrantClient] = None,
    limit: int = 10
) -> List[Dict[str, Any]]:
    """Perform standard full precision vector search."""
    if client is None:
        client = get_qdrant_client()
    
    model = get_embedding_model()
    query_vector = model.encode(query).tolist()
    
    try:
        search_response = client.query_points(
            collection_name=collection_name,
            query=query_vector,
            limit=limit,
            with_payload=True
        )
        
        return [
            {
                "id": r.id,
                "score": r.score,
                "payload": r.payload,
                "cosine_similarity": r.score
            }
            for r in search_response.points
        ]
    except Exception as e:
        print(f"Full precision search error: {e}")
        return []


def compare_search_methods(
    collection_name: str,
    query: str,
    client: Optional[QdrantClient] = None,
    limit: int = 10,
    threshold: Optional[float] = None
) -> Dict[str, Any]:
    """Compare full precision search vs binary quantized search."""
    if client is None:
        client = get_qdrant_client()
    
    print(f"\nComparing Search Methods for: '{query}'")
    print("=" * 80)
    
    # Full precision search
    print("\n1. Full Precision Search (Cosine Similarity)")
    print("-" * 80)
    start_time = time.time()
    full_results = full_precision_search(
        collection_name=collection_name,
        query=query,
        client=client,
        limit=limit
    )
    full_time = time.time() - start_time
    
    # Binary quantized search
    print("\n2. Binary Quantized Search (Hamming Similarity)")
    print("-" * 80)
    start_time = time.time()
    binary_results = binary_quantize_search(
        collection_name=collection_name,
        query=query,
        client=client,
        limit=limit,
        threshold=threshold
    )
    binary_time = time.time() - start_time
    
    # Compare results
    full_ids = {r["id"] for r in full_results}
    binary_ids = {r["id"] for r in binary_results}
    
    overlap = len(full_ids & binary_ids)
    overlap_ratio = overlap / limit if limit > 0 else 0.0
    
    comparison = {
        "query": query,
        "full_precision": {
            "results": full_results,
            "time_ms": full_time * 1000,
            "count": len(full_results)
        },
        "binary_quantized": {
            "results": binary_results,
            "time_ms": binary_time * 1000,
            "count": len(binary_results)
        },
        "overlap": overlap,
        "overlap_ratio": overlap_ratio
    }
    
    # Display comparison summary
    print("\n" + "=" * 80)
    print("COMPARISON SUMMARY")
    print("=" * 80)
    print(f"Full Precision Search:")
    print(f"  Time: {full_time*1000:.2f} ms")
    print(f"  Results: {len(full_results)}")
    if full_results:
        print(f"  Top Score: {full_results[0]['score']:.4f}")
    
    print(f"\nBinary Quantized Search:")
    print(f"  Time: {binary_time*1000:.2f} ms")
    print(f"  Results: {len(binary_results)}")
    if binary_results:
        print(f"  Top Score: {binary_results[0]['score']:.4f}")
    
    print(f"\nOverlap:")
    print(f"  Common Results: {overlap} / {limit} ({overlap_ratio*100:.1f}%)")
    
    speedup = full_time / binary_time if binary_time > 0 else 0
    if speedup > 1:
        print(f"  Binary search is {speedup:.2f}x faster")
    elif speedup > 0:
        print(f"  Full precision search is {1/speedup:.2f}x faster")
    
    print("\n" + "=" * 80)
    
    return comparison


def display_binary_results(results: List[Dict[str, Any]], query: str, show_fields: Optional[List[str]] = None):
    """Display binary quantized search results."""
    if show_fields is None:
        show_fields = ['name', 'title', 'description', 'text']
    
    print(f"\nBinary Quantized Search Results for: '{query}'")
    print("=" * 80)
    
    if not results:
        print("No results found.")
        return
    
    print(f"Found {len(results)} results using binary quantization (Hamming similarity)\n")
    
    for i, result in enumerate(results, 1):
        payload = result["payload"]
        
        # Try to find a display name from common fields
        display_name = "Result"
        for field in ['name', 'title', 'part_name', 'id', 'part_id']:
            if field in payload:
                display_name = str(payload[field])
                break
        
        print(f"\n{i}. {display_name}")
        
        # Display other fields
        for field in show_fields:
            if field in payload and field not in ['name', 'title']:
                value = payload[field]
                if isinstance(value, str):
                    print(f"   {field.capitalize()}: {value[:100]}{'...' if len(value) > 100 else ''}")
        
        print(f"   Binary Similarity Score: {result['score']:.4f} (Hamming)")
        print(f"   {result['score']*100:.1f}% bit similarity")
        
        print("-" * 80)


if __name__ == "__main__":
    """
    Example usage - customize for your collection.
    """
    # Example 1: Binary quantized search
    print("=" * 80)
    print("EXAMPLE 1: Binary Quantized Search")
    print("=" * 80)
    print("This demonstrates search using binary quantized vectors.\n")
    
    # Replace with your collection name and query
    collection_name = os.getenv("QDRANT_COLLECTION", "your_collection")
    query1 = "example query"
    
    try:
        client = get_qdrant_client()
        results1 = binary_quantize_search(
            collection_name=collection_name,
            query=query1,
            client=client,
            limit=5
        )
        display_binary_results(results1, query1, show_fields=['name', 'description'])
    except Exception as e:
        print(f"Error: {e}")
        print("\nTo use this script:")
        print("1. Set QDRANT_URL and QDRANT_API_KEY in your .env file")
        print("2. Set QDRANT_COLLECTION environment variable or update collection_name in code")
        print("3. Ensure your collection has vectors stored")
        print("4. Adjust show_fields to match your collection's payload structure")


Let us look at an example implementation of Binary Quantization for the same automotive parts collection.

Plain Text
 
python binary_quantization_example.py
================================================================================
EXAMPLE 1: Binary Quantized Search
================================================================================
Searching: 'collision detection device'
Expected: Finds similar parts using binary vector comparison


Binary Quantized Search Results for: 'collision detection device'
================================================================================
Found 5 results using binary quantization (Hamming similarity)


1. Safety Sensor Module 217
   Part_name: Safety Sensor Module 217
   Part_id: DEL-0000217
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6458 (Hamming)
   64.6% bit similarity
--------------------------------------------------------------------------------

2. Safety Sensor Module 250
   Part_name: Safety Sensor Module 250
   Part_id: TE-0000250
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6354 (Hamming)
   63.5% bit similarity
--------------------------------------------------------------------------------

3. Safety Sensor Module 233
   Part_name: Safety Sensor Module 233
   Part_id: TRW-0000233
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6328 (Hamming)
   63.3% bit similarity
--------------------------------------------------------------------------------

4. Safety Sensor Module 201
   Part_name: Safety Sensor Module 201
   Part_id: DEN-0000201
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6328 (Hamming)
   63.3% bit similarity
--------------------------------------------------------------------------------

5. Safety Sensor Module 223
   Part_name: Safety Sensor Module 223
   Part_id: MAG-0000223
   Category: Safety Systems
   Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea...
   Binary Similarity Score: 0.6328 (Hamming)
   63.3% bit similarity
--------------------------------------------------------------------------------



================================================================================
EXAMPLE 2: Full Precision vs Binary Quantized Comparison
================================================================================
Comparing search methods for: 'engine sensor'
Expected: Shows speed and quality differences between methods


Comparing Search Methods for: 'engine sensor'
================================================================================

1. Full Precision Search (Cosine Similarity)
--------------------------------------------------------------------------------

2. Binary Quantized Search (Hamming Similarity)
--------------------------------------------------------------------------------

================================================================================
COMPARISON SUMMARY
================================================================================
Full Precision Search:
  Time: 109.33 ms
  Results: 5
  Top Score: 0.4092

Binary Quantized Search:
  Time: 75.83 ms
  Results: 5
  Top Score: 0.6589


Benefits

The most obvious benefits, as seen from the results, are:

  • Memory reduction of ~40×
  • Speed improvement of ~20%

The less obvious — but very significant — benefit is cost savings, especially in terms of cloud infrastructure costs.

Costs

Binary Quantization also comes with trade-offs:

  • Accuracy loss: 1–2% recall degradation without rescoring
  • Disk I/O: Rescoring requires additional disk reads
  • Setup complexity: Like Hybrid Search, Binary Quantization requires careful oversampling and tuning

When to Use

  • Large-scale deployments with millions of vectors
  • Projects with budget constraints
  • Use cases where <2% accuracy loss is acceptable

When Not to Use

  • Projects with very small datasets (e.g., <100K vectors) where memory is not a bottleneck
  • Projects with strict accuracy requirements, such as regulatory or medical systems
  • Applications with specialized hardware where RAM is abundant

Binary Quantization for Automotive Parts (1M+ Parts)

Aspect Full precision binary quantized
RAM needed 1.5GB 46MB
Query Speed 246ms 107ms
Results Quality Perfect Ranking 80% different ranking but still relevant
Cost/Month $50-100 (Cloud RAM) $2-5 (Cloud RAM)


Performance Characteristics

Based on the results shared above let us look at some performance characteristics for Binary Quantization:

Metric full precision binary quantization evidence from the data
Query Latency 246ms 107ms 2.3X faster as search with Binary quantization completed in less than half the time
Memory Usage 32 bits/dimension 1 bit/dimension a net 32X reduction in the memory usage
Result Overlap Baseline (100%) 20% Only 1 out of 5 top results match between searches
Ranking accuracy 1.0 (reference) -0.80 Different rankings but all results still semantically relevant
Top Score 0.4092 (Cosine) 0.6589 (Hamming) Different distance metrics and not directly comparable


Conclusion

From both the conceptual overview and the experimental results, Binary Quantization provides significant memory and speed gains, but at the cost of a 20% overlap. This can be critical for compliance- or safety-centric applications where perfect ranking is required.

However, Binary Quantization is an excellent fit for discovery-oriented systems, large result sets, and scenarios where speed and cost efficiency outweigh perfect ranking.

In the next part of the series, we will explore Filterable HNSW and how it impacts vector search performance.

Data structure Production (computer science) systems Data Types AI

Opinions expressed by DZone contributors are their own.

Related

  • Essential Techniques for Production Vector Search Systems Part 1 - Hybrid Search
  • Essential Techniques for Production Vector Search Systems, Part 5: Reranking
  • Essential Techniques for Production Vector Search Systems, Part 3: Filterable HNSW
  • Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook