Over a million developers have joined DZone.

Squashing Duplicate Pairs Together in Python

As part of a data cleaning pipeline, I had pairs of ids of duplicate addresses that I wanted to group together. Learn how I tracked the sets of duplicates using an array of arrays, or list of lists, in Python.

· Database Zone

Learn NoSQL for free with hands-on sample code, example queries, tutorials, and more.  Brought to you in partnership with Couchbase.

As part of a data cleaning pipeline, I had pairs of ids of duplicate addresses that I wanted to group together.

I couldn’t work out how to solve the problem immediately, so I simplified the problem into pairs of letters, i.e.,

AB(A is the same as B)
BC(B is the same as C)
EF(E is the same as F)

The output that I want to get is:

(A, B, C, F)
(E, F, G)

I spent several hours trying to come up with a clever data structure to do this until Reshmee suggested tracking the sets of duplicates using an array of arrays or list of lists since we’re going to script this using Python.

The actual data is in a CSV file, but we’ll create a list of tuples to save ourselves some work:

pairs = [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ]

We’re going to iterate through the list of pairs and on each iteration we’ll check if there’s an entry in the list containing either of the values. There can be three outcomes from this check:

  1. No entry – we’ll add a new entry with our pair of values.
  2. One entry – we’ll add the other value to that entry.
  3. Two entries – we’ll merge them together replacing the existing entry.

The first step is to write a function to check the list of lists for a matching pair:

def find_matching_index(pair, dups):
    return [index
            for index, dup in enumerate(dups)
            if pair[0] in dup or pair[1] in dup]

print find_matching_index(("A", "B"), [set(["D", "E"])])

print find_matching_index(("B", "C"), [set(["A", "B"])])

print find_matching_index(("B", "C"), [set(["A", "B"]), set(["C", "D"])])
[0, 1]

Next we need to write a function which iterates over all of our pairs of values and uses find_matching_index to work out which decision to make:

def extract_groups(items):
    dups = []
    for pair in items:
        matching_index = find_matching_index(pair, dups)

        if len(matching_index) == 0:
            dups.append(set([pair[0], pair[1]]))
        elif len(matching_index) == 1:
            index = matching_index[0]
            matching_dup = dups[index]
            dups.append(matching_dup.union([pair[0], pair[1]]))
            index1, index2 = matching_index
            dup1 = dups[index1]
            dup2 = dups[index2]

            dups.pop(index2 - 1) # the index decrements since we removed one entry on the previous line
    return dups

Now, let’s run this with a few test cases:

test_cases = [
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ],
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G"), ("G", "A"), ("G", "Z"), ("B", "D") ],
    [ ("A", "B"), ("B", "C"), ("C", "E"), ("E", "A") ],
    [ ("A", "B"), ("C", "D"), ("F", "G"), ("H", "I"), ("J", "A") ]

for test_case in test_cases:
    print extract_groups(test_case)

[set(['A', 'C', 'B', 'D']), set(['E', 'G', 'F'])]
[set(['A', 'C', 'B', 'E', 'D', 'G', 'F', 'Z'])]
[set(['A', 'C', 'B', 'E'])]
[set(['C', 'D']), set(['G', 'F']), set(['I', 'H']), set(['A', 'J', 'B'])]

This certainly doesn’t scale very well, but since I only have a few hundred duplicate addresses, it does the job for me.

It feels like there should be a more functional way to write these functions without mutating all these lists, but I haven’t figured out what that is yet.

The Getting Started with NoSQL Guide will get you hands-on with NoSQL in minutes with no coding needed. Brought to you in partnership with Couchbase.

python ,algorithm ,web dev

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}