Data De-Duplication in Image Search Services
Data De-Duplication in Image Search Services
In this post, we will learn how to find and delete duplicate images in a PostreSQL database. Let's get started!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Image processing technology, such as image search, has a multitude of applications in the real world. For example, Internet users may upload multiple versions of a video or image, each with different formatting, audio tracks, or compression ratios. This leads to a significant number of duplicate videos stored on the service end. However, this problem can be solved using data de-duplication. But how is this normally done?
When you use search engines to look for relevant images, the search engine will process the image and the tags related to the image. For example, when I search for a "snowman" image, a search engine may return this result.
Pretty accurate, right? Typically, PostgreSQL is behind the implementation of the image search and its Payment Gateway Application Programming Interface (API) extends the image search function.
PostgreSQL’s Image Search Plug-In Background Technology
PostgreSQL’s image search plug-in adopts the mainstream Haar wavelet technology to convert and store an image. The following figures briefly describe the Haar wavelet technology. For additional details, refer to the following Wikipedia link: https://en.wikipedia.org/wiki/Haar_wavelet
Steps to Install PostgreSQL Image Search Plug-In
Below are the steps to install the PostgreSQL image search plug-in:
- Dependent on gd.h.
# yum install -y gd-devel
- Download and install imgsmlr.
$ git clone https://github.com/postgrespro/imgsmlr $ cd imgsmlr $ export PGHOME=/home/digoal/pgsql9.5 $ export PATH=$PGHOME/bin:$PATH:. $ make USE_PGXS=1 $ make USE_PGXS=1 install
- Install the plug-in.
$ psql psql (9.5.3) Type "help" for help. postgres=# create extension imgsmlr; CREATE EXTENSION
- Two data types now exist in imgsmlr.
Data Type Storage Length Description Pattern 16388 bytes Result of Haar wavelet transform on the image Signature 64 bytes Short representation of the pattern for fast search using GiST indexes
- For similar image searching, use the gist index method (supporting the pattern and signature types) and KNN operator.
Data Type Left Type Right Type Return Type Description <-> pattern pattern float8 Eucledian distance between two patterns <-> signature signature float8 Eucledian distance between two signatures
- You can convert the binary images into the pattern type and convert the data stored in the pattern into the signature type.
Function Return Type Description jpeg2pattern(bytea) pattern Convert jpeg image to pattern png2pattern(bytea) pattern Convert png image to pattern gif2pattern(bytea) pattern Convert gif image to pattern pattern2signature(pattern) signature Create signature from pattern shuffle_pattern(pattern) pattern Shuffle pattern for less sensitivity to image shift
Steps to Perform PostgreSQL Image Search Plug-in Test:
Once you are done installing, carry out these steps to perform PostgreSQL image search plug-in test:
- Import images, such as the following (the more the better).
- Create the image table (id serial, data bytes);
- Import the images to the database.
- Insert into image(data) select pg_read_binary_file.
- Convert the image to the pattern and signature type.
CREATE TABLE pat AS ( SELECT id, shuffle_pattern(pattern) AS pattern, pattern2signature(pattern) AS signature FROM ( SELECT id, jpeg2pattern(data) AS pattern FROM image ) x );
- Create an index.
ALTER TABLE pat ADD PRIMARY KEY (id); CREATE INDEX pat_signature_idx ON pat USING gist (signature);
- Perform an approximation query, such as querying images that are similar to
id = :idimages and retrieving the top 10 items on the similarity ranking list.
SELECT id, smlr FROM ( SELECT id, pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr FROM pat WHERE id <> :id ORDER BY signature <-> (SELECT signature FROM pat WHERE id = :id) LIMIT 100 ) x ORDER BY x.smlr ASC LIMIT 10
- K – Nearest Neighbour (KNN) indexing is an option here and the result is output quickly based on similarity rankings.
Testing Our Image Search Engine
For the most part, our search engine works as expected.
However, sometimes the image search does not work too well.
This is because the computer "sees" the images differently from humans. It processes an object as a 2D matrix, and transform it into a signature, which is readable for computers.
Video De-duplication Service
For video de-duplication, you can extract key frames in a video to generate the Cartesian product through self-correlation. Remember to calculate the similarity of two images of different videos. When the similarity reaches a certain threshold, the services deem the two videos the same.
- Create the image table and import the key frames of all videos into the table (id serial8 primary key, movie_id int, data bytea);
- Import the image (assume it is in jpeg format).
- Generate the pattern and signature types
CREATE TABLE pat AS ( SELECT id, movie_id, shuffle_pattern(pattern) AS pattern, pattern2signature(pattern) AS signature FROM ( SELECT id, movie_id, jpeg2pattern(data) AS pattern FROM image ) x );
- Calculate the similarity of different videos.
select t1.movie_id, t1.id, t1.signature<->t2.signature from pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) order by t1.signature<->t2.signature desc or select t1.movie_id, t1.id, t1.signature<->t2.signature from pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) where t1.signature<->t2.signature > 0.9 order by t1.signature<->t2.signature desc
SummaryImage de-duplication requires Postgres as their database and uses its API. PostgreSQL is a powerful database with customizable functions. It not only ensures image de-duplication effectively but is also safe and reliable. Video de-duplication is the additional feature that is possible using PostgreSQL. The Haar wavelet algorithm adds to the possibility of searching images on popular search engines. The implementation of PostgreSQL and installation are aspects that are worth knowing.
Published at DZone with permission of Leona Zhang . See the original article here.
Opinions expressed by DZone contributors are their own.