Over a million developers have joined DZone.

Finding Bieber: On Removing Duplicates from a Set of Documents

DZone 's Guide to

Finding Bieber: On Removing Duplicates from a Set of Documents

· Web Dev Zone ·
Free Resource

So I have two million song lyrics in a big file. Don't ask me how I got it. The point is that I want to find the most poetic phrase of all time.

Problem is, the origins of this file are so sketchy it would make a Pearls Before Swine cartoon look like a Da Vinci. There could well be thousands of copies of Justin Bieber's Eenie Meenie, all frenetically typed in by a horde of Snapchatting teenagers using mom's Windows Vista laptop with the missing shift key.

I don't want my analysis to include the copies and covers of the same song. So I have two problems to solve:

  1. How can we know whether two songs are actually the same?
  2. And how can we do this quickly, over the whole collection?

But first

When dealing with text, or images, or sound files, or whatever kind of media tickles your pickle, we want to transform them into numbers that computers can use. We turn them into feature vectors, or what most Ph.D toting natural language processing experts call, when they really want to get technical for a stodgy old formal publication -- one that will put them on the tenure track -- when they want to choose the most technically precise phrase, they call them: "bags of words". I am not making this up.

Lets say we had this paragon of poesy:

Eenie, meenie, miney, mo
Catch a bad chick by her toe
If she holla
If, if, if she holla, let her go

A bag of words is set of words, and their counts. The ordering of the words is lost to simplify things. Order is rarely important anyway.

{a: 1, bad: 1, by: 1, catch: 1, chick: 1, eenie: 1, go: 1,her: 2, holla: 2, if: 4, let: 1, meenie: 1, miney: 1 mo:1, she: 2, toe: 1 }

We could go even simpler and remove the counts, if we feel they aren't important.

{a, bad, by, catch, chick, eenie, go, her, holla, if, let, meenie, miney, mo, she, toe}

As we process each document from the database, the first thing we do is turn it into the bag of words. In python it's a one-liner.

def makeBag(song):
    return set(song.replace(",", " ").split())

Comparing two bags

Let's say we had three sets. One is the song above. In the other, the teenaged transcriber thought "miney" should be spelled "miny". The third is Frank Sinatra's Fly me to the moon. We would like a distance function, so that if two songs are differ by only one word, then the distance would be small, and if they are completely different, the distance is large.

To find the answer, we have to travel to 1907, and accompany Swiss professor Paul Jaccard on his trip to the Alps to do some serious botany. He noticed that there were different clusters of plants in different regions, and wondered if these clusters could be used to determine the ecological evolution of the area. He writes:

In order to reply to this question I have considered, in an alpine district of fair size, various natural sub-divisions, presenting, besides numerous resemblances in their ecological conditions (i.e. conditions dependent on soil and climate), a few characteristic differences, and I have sought to determine, by comparison, the influence of these resemblances and differences on the composition of flora.

He counted all of the different plants in different regions, and came up with this formula to compare how similar two different regions are:

Number of species common to the two districts / total number of species in the two districts.

This gives 0 if the sets share no common elements, and 1 if they are the same. But that's the opposite of what we need, so we subtract it from one to obtain a distance function.

def Jaccard(A, B):  
    intersection = len(A & B)
    union = len(A | B)
    return 1.0 - float(intersection)/union

Now we have a distance function. To find all the duplicate songs, we just run it on every pair of songs that we have. With two million songs, that's only, umm, four trillion comparisons. If we can do 10000/second we could be done in about three years. Maybe we could split it up use some cloud instances, pay a few thousand dollars for compute time and it's be done in a day.

Or we could use algorithms.

Time to get LSH'd

I have two little girls and coincidentally, they have approximately two million individual socks, with no two pairs alike. Yet it doesn't take me three years to sort them, because I use a locality sensitive hash.

I take a sock, and if it's pinkish, I put it in the top left. Purple goes in the top right, and colours in the middle go in between. Blues go on the bottom, greens have their own spot. By the time I run out of socks to sort, the pairs are of near each-other on the carpet. Then it's a simple matter to join them together.


Now let's travel to 1997. Titanic and The Full Monty are in theaters. Sometimes people went to see the film 9 or 10 times. (Titanic, I mean) This was not surprising, because the only thing on TV was the OJ Simpson trial. On the WWW, then known as the World Wide Wait, AltaVista was one of the top search engines for finding the status of the Trojan Room Coffee Pot.

Computer Scientist Andrei Broder, who had been with AltaVista from near the beginning, was working on the duplicates problem. As the web was expanding, a lot of the pages were turning out to be duplicates of other pages. When you searched for something, the first page of results would just be copies of eachother. Very annoying. So Broder came up with a way of quickly searching through these millions of pages to find the duplicates.

The MinHash is a function that reduces a text document to a single number. Documents that share many of the same words have numbers that are near each-other.

How is this done?

Suppose you build a dictionary of all the words that could possibly occur in your documents, and you number them.

0 aardvark
1 abacus
2 abacuses
3 abaft
4 abalone
5 abandon

The minhash would take this dictionary, and take your document, and assign it the number of the minimum word that occurs. That's it.

So if your document is "The aardvark abandoned his abacus" then the number assigned would be 0 (because aardvark is the zero'th word in the dictionary). In fact, every document that talks about an aardvark would hash to 0.

But what if, by chance, there is a document that is similar to our aardvark text but mispells it? Then they would hash to some other number entirely.

To guard against this, we actually take several random permutations of the dictionary and average the minhash against each of them.

0 abacus
1 abalone
2 abacuses
3 aardvark
4 abaft
5 abandoned
0 abalone
1 abacus
2 abacuses
3 abandoned
4 aardvark
5 abaft
0 abacus
1 abaft
2 abacuses
3 abalone
4 abandoned
5 aardvark
  • Document: "The aardvark abandoned his abacus"
  • Minhash under first dictionary: 0
  • Minhash under dictionary 2: 1
  • Minhash under dictionary 3: 0
  • Combined minhash: (0 + 1 + 0) / 3 = 0.333333333

As you use more and more dictionaries to compute the hash, then documents that share similar sets of words begin to hash to similar values. If you like code, here's some python.

import random

def MinHash(corpus, k = 5):
    # Map from words to array of the five values
    words = {}
    for word in corpus:
        words[word] = []

    for i in range(k):
        shuffled = list(corpus)
        for j in range(len(shuffled)):

    def hash(document):
        total = 0.

        # for each hash function, find the lowest value word in the
        # document.

        #sum(min(h_k(w) over words in doc)

        vals = [-1] * k
        for word in document:
            if word in words:
                m = words[word]
                for i in range(k):
                    if vals[i] == -1 or m[i] < vals[i]:
                        vals[i] = m[i]

        return sum(vals) / k

    return hash


Using MinHash, you can mark duplicates in two passes through the data, and a sort.
  1. In the first pass, compute the minhash of each document.
  2. Secondly, sort the documents by their minhash (if you can afford to do so) or place them into buckets. In either case, documents that are similar will theoretically be close together.
  3. Finally, go through the list (if sorted) or nearby buckets, and compare documents within a certain window using a more refined comparison function, such as Jaccard distance. Anything that is close enough to being the same is a duplicate.

Oh yeah

I will assume that the most poetic words of all time, in English, are the ones most likely to end a line. After analysis of 2 million song lyrics, with near duplicates removed, they are:
at all
no more
don't know
all alone
right now
all night
at night
far away
like that
let go
too late

And the number one most poetic phrase in the history of music:

oh oh

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}