Over a million developers have joined DZone.

Data De-Duplication in Image Search Services

DZone's Guide to

Data De-Duplication in Image Search Services

In this post, we will learn how to find and delete duplicate images in a PostreSQL database. Let's get started!

· Big Data Zone ·
Free Resource

Cloudera Data Flow, the answer to all your real-time streaming data problems. Manage your data from edge to enterprise with a no-code approach to developing sophisticated streaming applications easily. Learn more today.

Image processing technology, such as image search, has a multitude of applications in the real world. For example, Internet users may upload multiple versions of a video or image, each with different formatting, audio tracks, or compression ratios. This leads to a significant number of duplicate videos stored on the service end. However, this problem can be solved using data de-duplication. But how is this normally done?

When you use search engines to look for relevant images, the search engine will process the image and the tags related to the image. For example, when I search for a "snowman" image, a search engine may return this result.

Image title

Pretty accurate, right? Typically, PostgreSQL is behind the implementation of the image search and its Payment Gateway Application Programming Interface (API) extends the image search function.

PostgreSQL’s Image Search Plug-In Background Technology

PostgreSQL’s image search plug-in adopts the mainstream Haar wavelet technology to convert and store an image. The following figures briefly describe the Haar wavelet technology. For additional details, refer to the following Wikipedia link: https://en.wikipedia.org/wiki/Haar_wavelet




Steps to Install PostgreSQL Image Search Plug-In

Below are the steps to install the PostgreSQL image search plug-in:

  1. Dependent on gd.h. # yum install -y gd-devel
  2. Download and install imgsmlr.
    $ git clone https://github.com/postgrespro/imgsmlr
    $ cd imgsmlr
    $ export PGHOME=/home/digoal/pgsql9.5
    $ export PATH=$PGHOME/bin:$PATH:.
    $ make USE_PGXS=1
    $ make USE_PGXS=1 install
  3. Install the plug-in.
    $ psql
    psql (9.5.3)
    Type "help" for help.
    postgres=# create extension imgsmlr;
  4. Two data types now exist in imgsmlr.
    Data Type Storage Length Description
    Pattern 16388 bytes Result of Haar wavelet transform on the image
    Signature 64 bytes Short representation of the pattern for fast search using GiST indexes
  5. For similar image searching, use the gist index method (supporting the pattern and signature types) and KNN operator.
    Data Type Left Type Right Type Return Type Description
    <-> pattern pattern float8 Eucledian distance between two patterns
    <-> signature signature float8 Eucledian distance between two signatures
    This adds several functions.
  6. You can convert the binary images into the pattern type and convert the data stored in the pattern into the signature type.
    Function Return Type Description
    jpeg2pattern(bytea) pattern Convert jpeg image to pattern
    png2pattern(bytea) pattern Convert png image to pattern
    gif2pattern(bytea) pattern Convert gif image to pattern
    pattern2signature(pattern) signature Create signature from pattern
    shuffle_pattern(pattern) pattern Shuffle pattern for less sensitivity to image shift

Steps to Perform PostgreSQL Image Search Plug-in Test:

Once you are done installing, carry out these steps to perform PostgreSQL image search plug-in test:

  1. Import images, such as the following (the more the better).5
  2. Create the image table (id serial, data bytes);
  3. Import the images to the database.
  4. Insert into image(data) select pg_read_binary_file.
  5. Convert the image to the pattern and signature type.
            shuffle_pattern(pattern) AS pattern, 
            pattern2signature(pattern) AS signature 
        FROM (
                jpeg2pattern(data) AS pattern 
        ) x 
  6. Create an index.
    CREATE INDEX pat_signature_idx ON pat USING gist (signature);
  7. Perform an approximation query, such as querying images that are similar to  id = :id images and retrieving the top 10 items on the similarity ranking list.
            pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr
        FROM pat
        WHERE id <> :id
        ORDER BY
            signature <-> (SELECT signature FROM pat WHERE id = :id)
        LIMIT 100
    ) x
    ORDER BY x.smlr ASC 
    LIMIT 10
  8. K – Nearest Neighbour (KNN) indexing is an option here and the result is output quickly based on similarity rankings.

Testing Our Image Search Engine

For the most part, our search engine works as expected.


However, sometimes the image search does not work too well.


This is because the computer "sees" the images differently from humans. It processes an object as a 2D matrix, and transform it into a signature, which is readable for computers.


Video De-duplication Service

For video de-duplication, you can extract key frames in a video to generate the Cartesian product through self-correlation. Remember to calculate the similarity of two images of different videos. When the similarity reaches a certain threshold, the services deem the two videos the same.


  1. Create the image table and import the key frames of all videos into the table (id serial8 primary key, movie_id int, data bytea);
  2. Import the image (assume it is in jpeg format).
  3. Skipped
  4. Generate the pattern and signature types
            id, movie_id,
            shuffle_pattern(pattern) AS pattern, 
            pattern2signature(pattern) AS signature 
        FROM (
                id, movie_id,
                jpeg2pattern(data) AS pattern 
        ) x 
  5. Calculate the similarity of different videos.
    select t1.movie_id, t1.id, t1.signature<->t2.signature from 
    pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) 
    order by t1.signature<->t2.signature desc 
    select t1.movie_id, t1.id, t1.signature<->t2.signature from 
    pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) 
    where t1.signature<->t2.signature > 0.9 
    order by t1.signature<->t2.signature desc 


    Image de-duplication requires Postgres as their database and uses its API. PostgreSQL is a powerful database with customizable functions. It not only ensures image de-duplication effectively but is also safe and reliable. Video de-duplication is the additional feature that is possible using PostgreSQL. The Haar wavelet algorithm adds to the possibility of searching images on popular search engines. The implementation of PostgreSQL and installation are aspects that are worth knowing.

 Cloudera Enterprise Data Hub. One platform, many applications. Start today.

postgresql ,big data ,data de-duplication

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}