Scaling Image Deduplication: Finding Needles in a Haystack
Learn to efficiently deduplicate 100M+ images using distributed architectures, embeddings, FAISS for ANN search, and clustering to ensure accurate results.
Join the DZone community and get the full member experience.
Join For FreeIn the current AI generation, where organizations deal with a vast inventory of images, identifying duplicates can be a daunting task. Distributed deduplication at scale is essential for optimizing storage, reducing redundancy, and maintaining data integrity. This article provides insight into the architectural design and practical implementation for deduplicating 100 million images efficiently using state-of-the-art tools and approaches.
Challenges in Image Deduplication
Scale
Processing millions or even billions of images demands:
- High throughput ingestion pipelines to handle large volumes of data efficiently
- Distributed architectures to ensure scalability across multiple nodes and GPUs
Deduplication at scale requires striking a balance between computational costs with accuracy. Choosing methodologies like approximate nearest neighbor (ANN), search allows us to achieve speed without sacrificing too much precision. Furthermore, hierarchical clustering ensures that results are grouped meaningfully for deduplication tasks.
Accuracy
Capturing perceptual similarity is particularly challenging due to:
- Variations in resolution, cropping, and compression artifacts across images
- The need to identify visually identical images even if minor differences exist, such as watermarks or slight rotations
Techniques like perceptual hashing and embedding-based similarity searches were chosen because they address these challenges by focusing on visual and semantic content rather than raw pixel data.
Latency
Real-time deduplication, especially in applications like content moderation or dynamic storage optimization, requires low-latency solutions. By using CUDA for GPU acceleration and libraries like FAISS, which are optimized for quick similarity searches, latency can be significantly reduced.
Architecture
Image Ingestion
Image ingestion forms the backbone of the pipeline, ensuring that data is collected and organized efficiently from diverse sources. Parallel ingestion pipelines allow high throughput operations, ensuring minimal bottlenecks when dealing with millions of images.
- Parallel processing. Loading images sequentially would severely limit throughput. Multi-threading and distributed file systems allow ingestion to scale linearly with available compute resources.
- Format handling. Diverse image formats require robust libraries capable of processing everything from PNG to RAW without failing.
import os
from concurrent.futures import ThreadPoolExecutor
image_dir = "my_images/"
def load_image(image_path):
with open(image_path, 'rb') as f:
return f.read()
image_paths = [os.path.join(image_dir, f) for f in os.listdir(image_dir)]
# Parallel image ingestion by tweaking workers
with ThreadPoolExecutor(max_workers=8) as executor:
images = list(executor.map(load_image, image_paths))
This implementation ensures that multiple files are read concurrently, leveraging the complete I/O bandwidth. By wrapping these I/O heavy operations in threads, latency caused by waiting for disk reads is minimized significantly.
Embedding Extraction
Embedding extraction transforms images into high-dimensional vectors that represent their visual and semantic content. These vectors are the foundation of similarity searches making this critical for accurate deduplication.
- Semantic representation. Unlike comparing raw pixels, embeddings capture high-level features like shapes, textures, and patterns. This ensures that semantically similar images are clustered together even if minor differences exist among them.
- Scalability. High-dimensional vectors are compatible with tools like FAISS which are specifically designed for efficient retrieval from massive datasets.
CUDA-Accelerated Extraction
import torch
from torchvision import models, transforms
from PIL import Image
#Pre-trained ResNet model
model = models.resnet50(pretrained=True)
model = torch.nn.Sequential(*list(model.children())[:-1]) #Remove classification layer
model.eval().cuda()
def extract_embeddings(image_paths):
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
embeddings = []
for path in image_paths:
image = Image.open(path).convert('RGB')
input_tensor = preprocess(image).unsqueeze(0).cuda()
with torch.no_grad():
embedding = model(input_tensor).squeeze().cpu().numpy()
embeddings.append(embedding)
return embeddings
image_embeddings = extract_embeddings(image_paths)
By leveraging pre-trained models, this method provides a plug-and-play solution for extracting embeddings without the need for extensive training. CUDA acceleration ensures that even large batches of images can be processed in parallel, drastically reducing compute time.
Approximate Nearest Neighbor Search
ANN search enables rapid similarity lookups within large dataset of semantic embeddings we use FAISS as described earlier due to its ability to handle billions of embeddings efficiently and GPU support.
FAISS-Index Setup
import faiss
import numpy as np
#Convert embeddings to numpy array
embeddings_np = np.array(image_embeddings).astype('float32')
#Create FAISS index
d = embeddings_np.shape[1] #Dimension of embeddings
index = faiss.IndexFlatL2(d) #L2 distance metric
index.add(embeddings_np) #Add embeddings to index
#ANN Search
k = 5 #Number of nearest neighbors to retrieve
query_embedding = embeddings_np[0].reshape(1, -1)
D, I = index.search(query_embedding, k) #D: distances, I: indices
print("Distances:", D)
print("Indices:", I)
FAISS’s hierarchical structure allows efficient memory management, and sharding indices across nodes ensures horizontal scalability
Clustering for Deduplication
Clustering organizes similar embeddings into groups identifying duplicate sets. Hierarchical clustering is a great option here because it doesn’t require predefining the number of clusters and can adapt dynamically to varying levels of similarity.
- Flexibility. Unlike k-means, hierarchical clustering dynamically adapts to the structure of the data.
- Interpretability. The resulting dendrogram allows fine-grained control over cluster thresholds, making it ideal for deduplication tasks.
from scipy.cluster.hierarchy import linkage, fcluster
#Hierarchical clustering
Z = linkage(embeddings_np, method='ward') #Ward's method for clustering
threshold = 1.0 #Distance threshold for defining clusters
clusters = fcluster(Z, t=threshold, criterion='distance')
#Review cluster assignments
for i, cluster_id in enumerate(clusters):
print(f"Image {i} -> Cluster {cluster_id}")
This clustering ensures that near-duplicates are grouped together while distinct images remain in separate clusters. By adjusting the threshold, the sensitivity of deduplication can be tuned for specific domains.
Techniques for Enhanced Deduplication
Perceptual Hashing
Perceptual hashing generates a compact representation of an image based on its visual features. It provides a lightweight alternative for identifying exact or near-duplicate images.
from imagehash import phash
from PIL import Image
hashes = {path: phash(Image.open(path)) for path in image_paths}
Perceptual hashing is especially useful for quick initial filtering before applying more computationally intensive techniques like embedding extraction.
Combining ANN With Clustering
Combining FAISS for ANN and hierarchical clustering leverages the strengths of both techniques. ANN ensures speed, while clustering provides interpretability and accuracy.
Deduplication of 100 Million Images
At a scale of 100 million images:
- Data volume. 50 TB of images (PNG, JPG, RAW, etc.) stored across a distributed filesystem.
- Infrastructure. Kubernetes cluster with 100 GPU-enabled nodes.
- Workflow:
- Images ingested from S3 buckets or similar sources using a multi-threaded pipeline.
- Embeddings were extracted using a PyTorch-based pre-trained ResNet model with CUDA.
- FAISS indices were sharded across nodes for parallel ANN search.
- Clustering identified duplicate images in clusters.
Conclusion
Distributed deduplication at scale combines the best of modern technologies like CUDA, FAISS, and clustering to tackle massive datasets. The architecture and optimization techniques provide a framework for optimizing storage, enhancing data quality, and uncovering valuable insights from the image inventory. By balancing scale and accuracy, this approach ensures future-proof deduplication systems.
Opinions expressed by DZone contributors are their own.
Comments