Beyond Fuzzy Matching: Engineering a Multi-Signal ML Pipeline for CRM Deduplication

This article explores using blocking, fuzzy matching, geo proximity, and a Random Forest classifier to streamline records.

Punitha Ponnuraj

Jan. 05, 26 · Analysis

Likes (1)

Comment

Save

853 Views

The CRM Problem We All Face

Almost every CRM platform struggles with duplicate customer records because the data is a messy mix flowing in from too many places — web forms, call centers, imports from ERP, partner feeds, and even legacy systems from multiple business units. The exact same company can pop up under slightly different names, with spelling variations, inconsistent addresses, or missing suite numbers. A simple single-rule fuzzy match isn't enough to connect these dots.

To truly fix this, we need something smarter: a multi-signal pipeline. This means using a blocking layer to speed up things, checking fuzzy name and address similarity, using geographic distance as a reality check, and implementing a small machine learning classifier to weigh all the evidence and decide if two records should be merged.

This article walks through a compact version of the pipeline, built in Python using rapidfuzz, geopy, and a Random Forest classifier. You can run everything locally and extend it however you want.

The Multi-Signal ML Pipeline Architecture

Multi-Signal ML Pipeline

1. Starting with Messy Data

Let's look at a typical real-world example. Here are four records that are clearly connected, but a standard system might not see the relationship.

ID	Name	Address	ZIP	Lat/Lon
A12	Mama's Kitchen LLC	123 Main St	29607	close coords
SS6	Mamas Kitchen – Greenville	123 Main St Ste A	29607	close coords
C99	Mama Kitchen	123 Main Street	29607	close coords
P11	M K Greenville	123 Main St	29607	nearly identical co-ords

2. Standardize the Mess

Most of the magic in matching begins with the boring work: normalize the things humans enter inconsistently.

We don’t need heavy NLP; just a simple clean up of lowercase, stripping punctuation, and compressing whitespaces helps fuzzy matching become stable. For example, this takes all the versions of the above customer names and turns those into consistent strings like mamas kitchen and mk greenville.

    Python
   
 

   import re
import pandas as pd

def normalize(text):
    if not text:
        return ""
    text = text.lower()
    text = re.sub(r"[^\w\s]", " ", text)
    text = re.sub(r"\bstreet\b", "st", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

df["name_std"] = df["name"].apply(normalize)
df["addr_std"] = df["address"].apply(normalize)

  

3. Blocking: The Step That Saves Everything (Including Your CPU)

This is a very simple but critical point: Comparing every record to every other record will result in O(N²) complexity. To avoid that, we derive a block key:

block_key = first 3 letters of normalized name + zip.

For our example records, this groups A12, SS6, C99 --> mam29607 and P11 --> mk29607.

    Python
   
   from itertools import combinations

def block_key(row):

    return row["name_std"][:3] + "_" + row["ZIP"]

df["block"] = df.apply(block_key, axis=1)

# The critical step—generating pairs only within the same block.

all_pairs = []

for block_id, block_data in df.groupby("block"):

    # The index of the DataFrame (df.index) should be the record ID (e.g., A12, SS6)

    record_ids = block_data.index.tolist()

    # Generate all unique combinations of IDs within the block

    pairs_in_block = list(combinations(record_ids, 2))

    all_pairs.extend(pairs_in_block)

print(f"Total pairs generated after blocking: {len(all_pairs)}")

Running pairwise comparisons only within each block keeps the pipeline fast, even with millions of CRM records. P11 falls into a separate block, but we’ll revisit it because geo proximity tells a different story.

4. Feature Engineering: Extracting the Real Signals

For each pair inside a block, we compute:

Name similarity (token sort ratio or similar)
Address similarity
Geo distance converted into a similarity (1.0 when extremely close)

Critical insight: On real CRM data, geo_sim is the single most powerful differentiator.

    Python
   
   from rapidfuzz import fuzz

from geopy.distance import geodesic

import numpy as np

# We'll assume the input records are indexed in the DataFrame 'df'

def extract_features(a, b):

    A, B = df.loc[a], df.loc[b]

    # 1. Name and Address Similarity (Fuzzy Matching)

    name_sim  = fuzz.token_sort_ratio(A["name_std"],  B["name_std"]) / 100

    addr_sim  = fuzz.token_sort_ratio(A["addr_std"],  B["addr_std"]) / 100

    # 2. Geo Proximity (Physical Distance)

    d = geodesic((A["lat"], A["lon"]), (B["lat"], B["lon"])).meters

    # geo_sim is 1.0 when extremely close, degrading towards 0.

    [span_24](start_span)geo_sim = max(0, 1 - d / 500) # anything within ~0–500m is highly likely same operator[span_24](end_span)

    # 3. Structural Cues

    zip_match     = int(A["ZIP"] == B["ZIP"])

    prefix_match  = int(A["name_std"][:4] == B["name_std"][:4])

    return [name_sim, addr_sim, geo_sim, zip_match, prefix_match]

# Code to extract features for all generated pairs (X)

X_features = []

pair_indices = []

for idx_a, idx_b in all_pairs:

    features = extract_features(idx_a, idx_b)

    X_features.append(features)

    pair_indices.append((idx_a, idx_b))

# Convert to a format ready for the ML model

X = np.array(X_features)

print(f"X array shape: {X.shape}")

5. A Tiny ML Model to Tie It All Together

Rather than adding more handcrafted rules, we let a small Random Forest combine the features and estimate the probability of a match.

The Training Data Challenge

To train this model, we need a manually labeled dataset: pairs of records that are known matches (1) and known non-matches (0). Since labeling millions of pairs is impossible, we typically:

Label a small, diverse sample (e.g., 5,000 pairs).
Use hard-coded rules to auto-label the highest confidence matches and non-matches, feeding these into the training set.

This ensures the model learns which features are truly indicative of a duplicate.

    Python
   
 

   from sklearn.ensemble import RandomForestClassifier
# For a real-world system, you would train this model on a labeled dataset (X_train, y_train)
# model.fit(X_train, y_train)
model = RandomForestClassifier(
   n_estimators=70, max_depth=6, random_state=42
)
# Note: You can use logistic regression, but forests are surprisingly good here because the
# [span_26](start_span)interactions between features (like geo + fuzzy differences) behave a bit non-linear[span_26](end_span).
# Assuming the model is trained, use it to predict the match probability on all generated pairs (X)
probabilities = model.predict_proba(X)[:, 1]
# Now, match the probabilities back to the pair IDs for the Decision Layer
results = pd.DataFrame({
   'Pair_A': [p[0] for p in pair_indices],
   'Pair_B': [p[1] for p in pair_indices],
   'Match_Probability': probabilities
})
  

Why not logistic regression? You can — but forests are surprisingly good here because the interactions between features (like geo + fuzzy differences) behave a bit non-linear.

Here’s how the model scores our earlier pairs:

    Plain Text
   
   A12 vs SS6 → 0.94
A12 vs C99 → 0.89
A12 vs P11 → 0.52  (cross-block check because geo_sim was high)

That last one is the interesting case: the name doesn’t match at all, but the records lie practically on top of each other on the map.

6. The Decision Layer (Where Humans Still Matter)

Based on the confidence scores, we apply thresholds:

0.75+ --> MATCH (Auto-Merge)
0.50–0.75 --> REVIEW (Send to Human-in-the-Loop)
0.50 --> NO MATCH (Ignore)

So, the three “mama kitchen” variants merge automatically, but P11 doesn’t. But when a human sees the same coordinates, the same address, and recognizes “M K” as an abbreviation of “Mama Kitchen,” they can approve the merge.

It sounds small, but adding humans only for this narrow band dramatically reduces errors in CRMs where sales crediting depends on correct customer identity.

7. Building the Final Unified Customer

We derive the golden record by taking the densest cluster of matched records and using the metadata that appears most consistently after normalization.

    Plain Text
   
   Cluster#1 : A12,SS6,C99,P11 (P11 added after review)
Unified Name (Derived): mama kitchen
Unified Address (Derived): 123 main st, greenville sc 29607

P11 folds into the cluster after review, bringing its metadata with it.

8. Why This Works Better Than Pure Fuzzy Matching

Because in real-world enterprise systems, data is messy. For a single physical customer, records are maintained across multiple systems. A name may be very different, but the coordinates are identical. Addresses may be formatted in three different ways, but point to the same building. Sometimes, category metadata (if available) adds to these.

Hard-coded rules struggle with these overlaps, often requiring huge, complex, and high-maintenance rule sets. But a small ML model, powered by a multi-signal approach, can handle these nonlinear relationships well and solve the core data quality issues with minimal human effort.

AI Architecture

Opinions expressed by DZone contributors are their own.

Related

Trending