Squashing Duplicate Pairs Together in Python

DZone 's Guide to

Squashing Duplicate Pairs Together in Python

As part of a data cleaning pipeline, I had pairs of ids of duplicate addresses that I wanted to group together. Learn how I tracked the sets of duplicates using an array of arrays, or list of lists, in Python.

· Database Zone ·
Free Resource

As part of a data cleaning pipeline, I had pairs of ids of duplicate addresses that I wanted to group together.

I couldn’t work out how to solve the problem immediately, so I simplified the problem into pairs of letters, i.e.,

AB(A is the same as B)
BC(B is the same as C)
EF(E is the same as F)

The output that I want to get is:

(A, B, C, F)
(E, F, G)

I spent several hours trying to come up with a clever data structure to do this until Reshmee suggested tracking the sets of duplicates using an array of arrays or list of lists since we’re going to script this using Python.

The actual data is in a CSV file, but we’ll create a list of tuples to save ourselves some work:

pairs = [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ]

We’re going to iterate through the list of pairs and on each iteration we’ll check if there’s an entry in the list containing either of the values. There can be three outcomes from this check:

  1. No entry – we’ll add a new entry with our pair of values.
  2. One entry – we’ll add the other value to that entry.
  3. Two entries – we’ll merge them together replacing the existing entry.

The first step is to write a function to check the list of lists for a matching pair:

def find_matching_index(pair, dups):
    return [index
            for index, dup in enumerate(dups)
            if pair[0] in dup or pair[1] in dup]

print find_matching_index(("A", "B"), [set(["D", "E"])])

print find_matching_index(("B", "C"), [set(["A", "B"])])

print find_matching_index(("B", "C"), [set(["A", "B"]), set(["C", "D"])])
[0, 1]

Next we need to write a function which iterates over all of our pairs of values and uses find_matching_index to work out which decision to make:

def extract_groups(items):
    dups = []
    for pair in items:
        matching_index = find_matching_index(pair, dups)

        if len(matching_index) == 0:
            dups.append(set([pair[0], pair[1]]))
        elif len(matching_index) == 1:
            index = matching_index[0]
            matching_dup = dups[index]
            dups.append(matching_dup.union([pair[0], pair[1]]))
            index1, index2 = matching_index
            dup1 = dups[index1]
            dup2 = dups[index2]

            dups.pop(index2 - 1) # the index decrements since we removed one entry on the previous line
    return dups

Now, let’s run this with a few test cases:

test_cases = [
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ],
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G"), ("G", "A"), ("G", "Z"), ("B", "D") ],
    [ ("A", "B"), ("B", "C"), ("C", "E"), ("E", "A") ],
    [ ("A", "B"), ("C", "D"), ("F", "G"), ("H", "I"), ("J", "A") ]

for test_case in test_cases:
    print extract_groups(test_case)

[set(['A', 'C', 'B', 'D']), set(['E', 'G', 'F'])]
[set(['A', 'C', 'B', 'E', 'D', 'G', 'F', 'Z'])]
[set(['A', 'C', 'B', 'E'])]
[set(['C', 'D']), set(['G', 'F']), set(['I', 'H']), set(['A', 'J', 'B'])]

This certainly doesn’t scale very well, but since I only have a few hundred duplicate addresses, it does the job for me.

It feels like there should be a more functional way to write these functions without mutating all these lists, but I haven’t figured out what that is yet.

python ,algorithm ,web dev

Published at DZone with permission of Mark Needham , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}