Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

DZone's Guide to

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

As a unified near duplicate detection framework for big data, ∂u∂u efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Our paper "∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data" by Pradeeban Kathiravelu, Helena Galhardas, and Luís Veiga was presented at the 23rd International Conference on Cooperative Information Systems (CoopIS 2015). 

Due a clash in my travel schedules, I could not attend the conference and present the paper myself. Hence, my supervisor and co-author Prof. Luís Veiga presented the paper in Greece on the 29th of October.

Abstract. Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data warehouse requires a considerable memory and processing power.

Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking.

In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of ∂u∂u, a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. ∂u∂u leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, ∂u∂u efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.

The full paper can be accessed here.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
news ,framework ,how-to ,trends ,cloud ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}