DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Deduplication of Videos Using Fingerprints, CLIP Embeddings
  • Ensuring Data Quality With Great Expectations and Databricks
  • Profiling Big Datasets With Apache Spark and Deequ
  • Strategic Considerations for Seamless Migration to a Modern Data Ecosystem

Trending

  • The Future of Java and AI: Coding in 2025
  • How to Create a Successful API Ecosystem
  • Event-Driven Microservices: How Kafka and RabbitMQ Power Scalable Systems
  • Code Reviews: Building an AI-Powered GitHub Integration
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Scaling Image Deduplication: Finding Needles in a Haystack

Scaling Image Deduplication: Finding Needles in a Haystack

Learn to efficiently deduplicate 100M+ images using distributed architectures, embeddings, FAISS for ANN search, and clustering to ensure accurate results.

By 
Praneeth Reddy Vatti user avatar
Praneeth Reddy Vatti
·
Feb. 20, 25 · Analysis
Likes (3)
Comment
Save
Tweet
Share
4.5K Views

Join the DZone community and get the full member experience.

Join For Free

In the current AI generation, where organizations deal with a vast inventory of images, identifying duplicates can be a daunting task. Distributed deduplication at scale is essential for optimizing storage, reducing redundancy, and maintaining data integrity. This article provides insight into the architectural design and practical implementation for deduplicating 100 million images efficiently using state-of-the-art tools and approaches.

Challenges in Image Deduplication

Scale

Processing millions or even billions of images demands:

  1. High throughput ingestion pipelines to handle large volumes of data efficiently
  2. Distributed architectures to ensure scalability across multiple nodes and GPUs

Deduplication at scale requires striking a balance between computational costs with accuracy. Choosing methodologies like approximate nearest neighbor (ANN), search allows us to achieve speed without sacrificing too much precision. Furthermore, hierarchical clustering ensures that results are grouped meaningfully for deduplication tasks.

Accuracy

Capturing perceptual similarity is particularly challenging due to:

  1. Variations in resolution, cropping, and compression artifacts across images
  2. The need to identify visually identical images even if minor differences exist, such as watermarks or slight rotations

Techniques like perceptual hashing and embedding-based similarity searches were chosen because they address these challenges by focusing on visual and semantic content rather than raw pixel data.

Latency

Real-time deduplication, especially in applications like content moderation or dynamic storage optimization, requires low-latency solutions. By using CUDA for GPU acceleration and libraries like FAISS, which are optimized for quick similarity searches, latency can be significantly reduced.

Architecture

Image Ingestion

Image ingestion forms the backbone of the pipeline, ensuring that data is collected and organized efficiently from diverse sources. Parallel ingestion pipelines allow high throughput operations, ensuring minimal bottlenecks when dealing with millions of images.

  • Parallel processing. Loading images sequentially would severely limit throughput. Multi-threading and distributed file systems allow ingestion to scale linearly with available compute resources.
  • Format handling. Diverse image formats require robust libraries capable of processing everything from PNG to RAW without failing.
Python
 
import os
from concurrent.futures import ThreadPoolExecutor

image_dir = "my_images/"
def load_image(image_path):
with open(image_path, 'rb') as f:
return f.read()

image_paths = [os.path.join(image_dir, f) for f in os.listdir(image_dir)]

# Parallel image ingestion by tweaking workers
with ThreadPoolExecutor(max_workers=8) as executor:
    images = list(executor.map(load_image, image_paths))


This implementation ensures that multiple files are read concurrently, leveraging the complete I/O bandwidth. By wrapping these I/O heavy operations in threads, latency caused by waiting for disk reads is minimized significantly.

Embedding Extraction

Embedding extraction transforms images into high-dimensional vectors that represent their visual and semantic content. These vectors are the foundation of similarity searches making this critical for accurate deduplication.

  • Semantic representation. Unlike comparing raw pixels, embeddings capture high-level features like shapes, textures, and patterns. This ensures that semantically similar images are clustered together even if minor differences exist among them.
  • Scalability. High-dimensional vectors are compatible with tools like FAISS which are specifically designed for efficient retrieval from massive datasets.

CUDA-Accelerated Extraction

Python
 
import torch
from torchvision import models, transforms
from PIL import Image

#Pre-trained ResNet model
model = models.resnet50(pretrained=True)
model = torch.nn.Sequential(*list(model.children())[:-1])  #Remove classification layer
model.eval().cuda()

def extract_embeddings(image_paths):
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    embeddings = []
    for path in image_paths:
        image = Image.open(path).convert('RGB')
        input_tensor = preprocess(image).unsqueeze(0).cuda()
        with torch.no_grad():
            embedding = model(input_tensor).squeeze().cpu().numpy()
        embeddings.append(embedding)
    return embeddings

image_embeddings = extract_embeddings(image_paths)


By leveraging pre-trained models, this method provides a plug-and-play solution for extracting embeddings without the need for extensive training. CUDA acceleration ensures that even large batches of images can be processed in parallel, drastically reducing compute time.

Approximate Nearest Neighbor Search

ANN search enables rapid similarity lookups within large dataset of semantic embeddings we use FAISS  as described earlier due to its ability to handle billions of embeddings efficiently and GPU support.

FAISS-Index Setup

Python
 
import faiss
import numpy as np

#Convert embeddings to numpy array
embeddings_np = np.array(image_embeddings).astype('float32')

#Create FAISS index
d = embeddings_np.shape[1]  #Dimension of embeddings
index = faiss.IndexFlatL2(d)  #L2 distance metric
index.add(embeddings_np)  #Add embeddings to index

#ANN Search
k = 5  #Number of nearest neighbors to retrieve
query_embedding = embeddings_np[0].reshape(1, -1)
D, I = index.search(query_embedding, k)  #D: distances, I: indices
print("Distances:", D)
print("Indices:", I)


FAISS’s  hierarchical structure allows efficient memory management, and sharding indices across nodes ensures horizontal scalability

Clustering for Deduplication

Clustering organizes similar embeddings into groups identifying duplicate sets. Hierarchical clustering is a great option here because it doesn’t require predefining the number of clusters and can adapt dynamically to varying levels of similarity.

  • Flexibility. Unlike k-means, hierarchical clustering dynamically adapts to the structure of the data.
  • Interpretability. The resulting dendrogram allows fine-grained control over cluster thresholds, making it ideal for deduplication tasks.
Python
 
from scipy.cluster.hierarchy import linkage, fcluster

#Hierarchical clustering
Z = linkage(embeddings_np, method='ward')  #Ward's method for clustering
threshold = 1.0  #Distance threshold for defining clusters
clusters = fcluster(Z, t=threshold, criterion='distance')

#Review cluster assignments
for i, cluster_id in enumerate(clusters):
    print(f"Image {i} -> Cluster {cluster_id}")


This clustering ensures that near-duplicates are grouped together while distinct images remain in separate clusters. By adjusting the threshold, the sensitivity of deduplication can be tuned for specific domains.

Techniques for Enhanced Deduplication

Perceptual Hashing

Perceptual hashing generates a compact representation of an image based on its visual features. It provides a lightweight alternative for identifying exact or near-duplicate images.

Python
 
from imagehash import phash
from PIL import Image

hashes = {path: phash(Image.open(path)) for path in image_paths}


Perceptual hashing is especially useful for quick initial filtering before applying more computationally intensive techniques like embedding extraction.

Combining ANN With Clustering

Combining FAISS for ANN and hierarchical clustering leverages the strengths of both techniques. ANN ensures speed, while clustering provides interpretability and accuracy.

Deduplication of 100 Million Images

At a scale of 100 million images:

  • Data volume. 50 TB of images (PNG, JPG, RAW, etc.) stored across a distributed filesystem.
  • Infrastructure. Kubernetes cluster with 100 GPU-enabled nodes.
  • Workflow:
    • Images ingested from S3 buckets or similar sources using a multi-threaded pipeline.
    • Embeddings were extracted using a PyTorch-based pre-trained ResNet model with CUDA.
    • FAISS indices were sharded across nodes for parallel ANN search.
    • Clustering identified duplicate images in clusters.

Conclusion

Distributed deduplication at scale combines the best of modern technologies like CUDA, FAISS, and clustering to tackle massive datasets. The architecture and optimization techniques provide a framework for optimizing storage, enhancing data quality, and uncovering valuable insights from the image inventory. By balancing scale and accuracy, this approach ensures future-proof deduplication systems.

CUDA Data quality clustering Big data

Opinions expressed by DZone contributors are their own.

Related

  • Deduplication of Videos Using Fingerprints, CLIP Embeddings
  • Ensuring Data Quality With Great Expectations and Databricks
  • Profiling Big Datasets With Apache Spark and Deequ
  • Strategic Considerations for Seamless Migration to a Modern Data Ecosystem

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!