Scaling Image Deduplication: Finding Needles in a Haystack

Learn to efficiently deduplicate 100M+ images using distributed architectures, embeddings, FAISS for ANN search, and clustering to ensure accurate results.

Feb. 20, 25 · Analysis

Likes (3)

Comment

Save

4.5K Views

In the current AI generation, where organizations deal with a vast inventory of images, identifying duplicates can be a daunting task. Distributed deduplication at scale is essential for optimizing storage, reducing redundancy, and maintaining data integrity. This article provides insight into the architectural design and practical implementation for deduplicating 100 million images efficiently using state-of-the-art tools and approaches.

Challenges in Image Deduplication

Scale

Processing millions or even billions of images demands:

High throughput ingestion pipelines to handle large volumes of data efficiently
Distributed architectures to ensure scalability across multiple nodes and GPUs

Deduplication at scale requires striking a balance between computational costs with accuracy. Choosing methodologies like approximate nearest neighbor (ANN), search allows us to achieve speed without sacrificing too much precision. Furthermore, hierarchical clustering ensures that results are grouped meaningfully for deduplication tasks.

Accuracy

Capturing perceptual similarity is particularly challenging due to:

Variations in resolution, cropping, and compression artifacts across images
The need to identify visually identical images even if minor differences exist, such as watermarks or slight rotations

Techniques like perceptual hashing and embedding-based similarity searches were chosen because they address these challenges by focusing on visual and semantic content rather than raw pixel data.

Latency

Real-time deduplication, especially in applications like content moderation or dynamic storage optimization, requires low-latency solutions. By using CUDA for GPU acceleration and libraries like FAISS, which are optimized for quick similarity searches, latency can be significantly reduced.

Architecture

Image Ingestion

Image ingestion forms the backbone of the pipeline, ensuring that data is collected and organized efficiently from diverse sources. Parallel ingestion pipelines allow high throughput operations, ensuring minimal bottlenecks when dealing with millions of images.

Parallel processing. Loading images sequentially would severely limit throughput. Multi-threading and distributed file systems allow ingestion to scale linearly with available compute resources.
Format handling. Diverse image formats require robust libraries capable of processing everything from PNG to RAW without failing.

    Python
   
 

   import os
from concurrent.futures import ThreadPoolExecutor

image_dir = "my_images/"
def load_image(image_path):
with open(image_path, 'rb') as f:
return f.read()

image_paths = [os.path.join(image_dir, f) for f in os.listdir(image_dir)]

# Parallel image ingestion by tweaking workers
with ThreadPoolExecutor(max_workers=8) as executor:
    images = list(executor.map(load_image, image_paths))
  

This implementation ensures that multiple files are read concurrently, leveraging the complete I/O bandwidth. By wrapping these I/O heavy operations in threads, latency caused by waiting for disk reads is minimized significantly.

Embedding Extraction

Embedding extraction transforms images into high-dimensional vectors that represent their visual and semantic content. These vectors are the foundation of similarity searches making this critical for accurate deduplication.

Semantic representation. Unlike comparing raw pixels, embeddings capture high-level features like shapes, textures, and patterns. This ensures that semantically similar images are clustered together even if minor differences exist among them.
Scalability. High-dimensional vectors are compatible with tools like FAISS which are specifically designed for efficient retrieval from massive datasets.

CUDA-Accelerated Extraction

    Python
   
 

   import torch
from torchvision import models, transforms
from PIL import Image

#Pre-trained ResNet model
model = models.resnet50(pretrained=True)
model = torch.nn.Sequential(*list(model.children())[:-1])  #Remove classification layer
model.eval().cuda()

def extract_embeddings(image_paths):
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    embeddings = []
    for path in image_paths:
        image = Image.open(path).convert('RGB')
        input_tensor = preprocess(image).unsqueeze(0).cuda()
        with torch.no_grad():
            embedding = model(input_tensor).squeeze().cpu().numpy()
        embeddings.append(embedding)
    return embeddings

image_embeddings = extract_embeddings(image_paths)
  

By leveraging pre-trained models, this method provides a plug-and-play solution for extracting embeddings without the need for extensive training. CUDA acceleration ensures that even large batches of images can be processed in parallel, drastically reducing compute time.

Approximate Nearest Neighbor Search

ANN search enables rapid similarity lookups within large dataset of semantic embeddings we use FAISS as described earlier due to its ability to handle billions of embeddings efficiently and GPU support.

FAISS-Index Setup

    Python
   
 

   import faiss
import numpy as np

#Convert embeddings to numpy array
embeddings_np = np.array(image_embeddings).astype('float32')

#Create FAISS index
d = embeddings_np.shape[1]  #Dimension of embeddings
index = faiss.IndexFlatL2(d)  #L2 distance metric
index.add(embeddings_np)  #Add embeddings to index

#ANN Search
k = 5  #Number of nearest neighbors to retrieve
query_embedding = embeddings_np[0].reshape(1, -1)
D, I = index.search(query_embedding, k)  #D: distances, I: indices
print("Distances:", D)
print("Indices:", I)

  

FAISS’s hierarchical structure allows efficient memory management, and sharding indices across nodes ensures horizontal scalability

Clustering for Deduplication

Clustering organizes similar embeddings into groups identifying duplicate sets. Hierarchical clustering is a great option here because it doesn’t require predefining the number of clusters and can adapt dynamically to varying levels of similarity.

Flexibility. Unlike k-means, hierarchical clustering dynamically adapts to the structure of the data.
Interpretability. The resulting dendrogram allows fine-grained control over cluster thresholds, making it ideal for deduplication tasks.

    Python
   
 

   from scipy.cluster.hierarchy import linkage, fcluster

#Hierarchical clustering
Z = linkage(embeddings_np, method='ward')  #Ward's method for clustering
threshold = 1.0  #Distance threshold for defining clusters
clusters = fcluster(Z, t=threshold, criterion='distance')

#Review cluster assignments
for i, cluster_id in enumerate(clusters):
    print(f"Image {i} -> Cluster {cluster_id}")
  

This clustering ensures that near-duplicates are grouped together while distinct images remain in separate clusters. By adjusting the threshold, the sensitivity of deduplication can be tuned for specific domains.

Techniques for Enhanced Deduplication

Perceptual Hashing

Perceptual hashing generates a compact representation of an image based on its visual features. It provides a lightweight alternative for identifying exact or near-duplicate images.

    Python
   
   from imagehash import phash
from PIL import Image

hashes = {path: phash(Image.open(path)) for path in image_paths}

Perceptual hashing is especially useful for quick initial filtering before applying more computationally intensive techniques like embedding extraction.

Combining ANN With Clustering

Combining FAISS for ANN and hierarchical clustering leverages the strengths of both techniques. ANN ensures speed, while clustering provides interpretability and accuracy.

Deduplication of 100 Million Images

At a scale of 100 million images:

Data volume. 50 TB of images (PNG, JPG, RAW, etc.) stored across a distributed filesystem.
Infrastructure. Kubernetes cluster with 100 GPU-enabled nodes.
Workflow:
- Images ingested from S3 buckets or similar sources using a multi-threaded pipeline.
- Embeddings were extracted using a PyTorch-based pre-trained ResNet model with CUDA.
- FAISS indices were sharded across nodes for parallel ANN search.
- Clustering identified duplicate images in clusters.

Conclusion

Distributed deduplication at scale combines the best of modern technologies like CUDA, FAISS, and clustering to tackle massive datasets. The architecture and optimization techniques provide a framework for optimizing storage, enhancing data quality, and uncovering valuable insights from the image inventory. By balancing scale and accuracy, this approach ensures future-proof deduplication systems.

CUDA Data quality clustering Big data

Opinions expressed by DZone contributors are their own.

Related

Trending