Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data De-Duplication in Image Search Services

DZone's Guide to

Data De-Duplication in Image Search Services

In this post, we will learn how to find and delete duplicate images in a PostreSQL database. Let's get started!

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Image processing technology, such as image search, has a multitude of applications in the real world. For example, Internet users may upload multiple versions of a video or image, each with different formatting, audio tracks, or compression ratios. This leads to a significant number of duplicate videos stored on the service end. However, this problem can be solved using data de-duplication. But how is this normally done?

When you use search engines to look for relevant images, the search engine will process the image and the tags related to the image. For example, when I search for a "snowman" image, a search engine may return this result.

Image title

Pretty accurate, right? Typically, PostgreSQL is behind the implementation of the image search and its Payment Gateway Application Programming Interface (API) extends the image search function.

PostgreSQL’s Image Search Plug-In Background Technology

PostgreSQL’s image search plug-in adopts the mainstream Haar wavelet technology to convert and store an image. The following figures briefly describe the Haar wavelet technology. For additional details, refer to the following Wikipedia link: https://en.wikipedia.org/wiki/Haar_wavelet

2

3

4

Steps to Install PostgreSQL Image Search Plug-In

Below are the steps to install the PostgreSQL image search plug-in:

  1. Dependent on gd.h. # yum install -y gd-devel
  2. Download and install imgsmlr.
    $ git clone https://github.com/postgrespro/imgsmlr
    $ cd imgsmlr
    $ export PGHOME=/home/digoal/pgsql9.5
    $ export PATH=$PGHOME/bin:$PATH:.
    
    $ make USE_PGXS=1
    $ make USE_PGXS=1 install
  3. Install the plug-in.
    $ psql
    psql (9.5.3)
    Type "help" for help.
    postgres=# create extension imgsmlr;
    CREATE EXTENSION
  4. Two data types now exist in imgsmlr.
    Data Type Storage Length Description
    Pattern 16388 bytes Result of Haar wavelet transform on the image
    Signature 64 bytes Short representation of the pattern for fast search using GiST indexes
  5. For similar image searching, use the gist index method (supporting the pattern and signature types) and KNN operator.
    Data Type Left Type Right Type Return Type Description
    <-> pattern pattern float8 Eucledian distance between two patterns
    <-> signature signature float8 Eucledian distance between two signatures
    This adds several functions.
  6. You can convert the binary images into the pattern type and convert the data stored in the pattern into the signature type.
    Function Return Type Description
    jpeg2pattern(bytea) pattern Convert jpeg image to pattern
    png2pattern(bytea) pattern Convert png image to pattern
    gif2pattern(bytea) pattern Convert gif image to pattern
    pattern2signature(pattern) signature Create signature from pattern
    shuffle_pattern(pattern) pattern Shuffle pattern for less sensitivity to image shift

Steps to Perform PostgreSQL Image Search Plug-in Test:

Once you are done installing, carry out these steps to perform PostgreSQL image search plug-in test:

  1. Import images, such as the following (the more the better).5
  2. Create the image table (id serial, data bytes);
  3. Import the images to the database.
  4. Insert into image(data) select pg_read_binary_file.
  5. Convert the image to the pattern and signature type.
    CREATE TABLE pat AS (
        SELECT
            id,
            shuffle_pattern(pattern) AS pattern, 
            pattern2signature(pattern) AS signature 
        FROM (
            SELECT 
                id, 
                jpeg2pattern(data) AS pattern 
            FROM 
                image
        ) x 
    );
  6. Create an index.
    ALTER TABLE pat ADD PRIMARY KEY (id);
    
    CREATE INDEX pat_signature_idx ON pat USING gist (signature);
  7. Perform an approximation query, such as querying images that are similar to  id = :id images and retrieving the top 10 items on the similarity ranking list.
    SELECT
        id,
        smlr
    FROM
    (
        SELECT
            id,
            pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr
        FROM pat
        WHERE id <> :id
        ORDER BY
            signature <-> (SELECT signature FROM pat WHERE id = :id)
        LIMIT 100
    ) x
    ORDER BY x.smlr ASC 
    LIMIT 10
  8. K – Nearest Neighbour (KNN) indexing is an option here and the result is output quickly based on similarity rankings.

Testing Our Image Search Engine

For the most part, our search engine works as expected.

6

However, sometimes the image search does not work too well.

7

This is because the computer "sees" the images differently from humans. It processes an object as a 2D matrix, and transform it into a signature, which is readable for computers.

8

Video De-duplication Service

For video de-duplication, you can extract key frames in a video to generate the Cartesian product through self-correlation. Remember to calculate the similarity of two images of different videos. When the similarity reaches a certain threshold, the services deem the two videos the same.

Example:

  1. Create the image table and import the key frames of all videos into the table (id serial8 primary key, movie_id int, data bytea);
  2. Import the image (assume it is in jpeg format).
  3. Skipped
  4. Generate the pattern and signature types
    CREATE TABLE pat AS (
        SELECT
            id, movie_id,
            shuffle_pattern(pattern) AS pattern, 
            pattern2signature(pattern) AS signature 
        FROM (
            SELECT 
                id, movie_id,
                jpeg2pattern(data) AS pattern 
            FROM 
                image
        ) x 
    );
  5. Calculate the similarity of different videos.
    select t1.movie_id, t1.id, t1.signature<->t2.signature from 
    pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) 
    order by t1.signature<->t2.signature desc 
    or 
    select t1.movie_id, t1.id, t1.signature<->t2.signature from 
    pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) 
    where t1.signature<->t2.signature > 0.9 
    order by t1.signature<->t2.signature desc 

    Summary

    Image de-duplication requires Postgres as their database and uses its API. PostgreSQL is a powerful database with customizable functions. It not only ensures image de-duplication effectively but is also safe and reliable. Video de-duplication is the additional feature that is possible using PostgreSQL. The Haar wavelet algorithm adds to the possibility of searching images on popular search engines. The implementation of PostgreSQL and installation are aspects that are worth knowing.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
postgresql ,big data ,data de-duplication

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}