DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • AI, ML, and Data Science: Shaping the Future of Automation
  • The Battle of Data: Statistics vs Machine Learning
  • Optimizing Data Management for AI Success: Industry Insights and Best Practices
  • MLOps: How to Build a Toolkit to Boost AI Project Performance

Trending

  • How the Go Runtime Preempts Goroutines for Efficient Concurrency
  • A Guide to Developing Large Language Models Part 1: Pretraining
  • Transforming AI-Driven Data Analytics with DeepSeek: A New Era of Intelligent Insights
  • How to Convert XLS to XLSX in Java
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. How Do AI Systems Identify Duplicate Data?

How Do AI Systems Identify Duplicate Data?

A discussion of AI concepts, such as comparing records in a database, and how these techniques can be used in conjunction with Salesforce.

By 
Ilya Dudkin user avatar
Ilya Dudkin
DZone Core CORE ·
May. 10, 21 · Analysis
Likes (3)
Comment
Save
Tweet
Share
15.7K Views

Join the DZone community and get the full member experience.

Join For Free

When you compare two Salesforce records, or any other CRM for that matter, side-by-side, you can easily determine whether or not they are duplicates. However, even if you have a small number of records, let’s say less than 100,000, it would be almost impossible to sift through them one by one by one, and perform such a comparison. This is why companies have developed various tools that automate such processes, but, to do a good job, the machines need to be able to recognize all of the similarities and differences between the records. In this article, we will take a closer look at some of the methods used by data scientists to train machine learning systems to identify duplicates. 

How Can Machine Learning Systems Compare and Contrast Records? 

One of the main tools researchers use is string metrics. This is when you take two strings of data and return a that is low if the strings are similar and high if they are different. How does this work in practice? Well, let’s take a look at the two records below: 

First Name

Last Name

Email

Company Name

Ron 

Burgundy

ron.burgundy@acme.com

Acme

Ronald

burgundy

ron.burgundy@acme.com

Acme Corp

If a human were to look at these two records, it would be pretty obvious that these are duplicates. However, machines rely on string metrics to replicate the human thought process, which is what AI is all about. One of the most famous string metrics is the Hamming distance which measures the number of substitutions that need to be made in order to turn one string into another. For example, if we return to the two records above, there would only need to be one substitution made to turn “burgundy” into “Burgundy,” therefore the Hamming distance would be 1. 

There are many other string metrics that measure the similarity between two strings and what separates each one is the operations they allow. For example, we mentioned the Hamming distance, but this string metric only allows substitutions, meaning that it can only be applied to strings that are of equal length. Something like the Levenshtein distance allows for deletion insertion and substitution. 

How Can All of This Be Used to Dedupe Salesforce? 

There are a couple of ways an AI system can approach Salesforce deduplication. One of the ways is the blocking method, which is illustrated below: 

Record 1

Record 2

Ron Burgundy, ron.burgundy@acme.com, Acme

Ronald burgundy,ron.burgundy@acme.com Acme Corp

Such blocking methodology is what makes this approach scalable. The way it works is that whenever you upload new records into your Salesforce, the system will automatically block together records that look “similar.” This can be something like the first three letters of the first name or any other criteria. 

This is very beneficial because it reduces the number of comparisons that need to be made. For example, let’s say that you have 100,000 records in your Salesforce and you would like to upload an Excel spreadsheet that contains 50,000 records. The traditional rule-based deduplication apps would need to compare each new record with existing ones meaning that there would need to be 5,000,000,000 comparisons done (100,000 x 50,000). Imagine how long this would take and how much it increases the probability of an error. Also, we need to keep in mind that 100,000 records is a fairly modest number of Salesforce records. There are lots of organizations that have hundreds of thousands or even millions of records. Therefore the traditional approach is simply not very scalable when trying to accommodate such models. 

The other option would be to compare each field individually: 


Record 1

Record 2

First Name

Ron

Ronald

Last Name

Burgundy

burgundy

Email

ron.burgundy@acme.com

ron.burgundy@acme.com

Company

Acme

Acme Corp

Once the system has blocked together “similar” records, it will then proceed to analyze each record field by field. This is where all of the string metrics we talked about earlier will come into play. In addition to this, the system will assign each field a particular “weight” or importance. For example, let’s say that for your dataset, the “Email” field is the most important. You can either adjust the algorithms yourself or when you label records as duplicates (or not) the system will automatically learn the correct weights. This is called Active Learning and is more preferable since the system can precisely calculate the importance of one field over another. 

What Are the Advantages of the Machine Learning Approach? 

The biggest benefit machine learning can offer is that it does all of the work for you. The Active Learning aspect we described in the previous section will apply all of the necessary weights to each field automatically. What this means is that there is no complicated setup process or rules to create. Let’s look at the following scenario. Imagine that one of the sales reps discovers a duplicate and notifies the Salesforce admin about the problem. The Salesforce admin will then proceed to create a rule that will prevent such duplicates from occurring in the future. This process would have to be repeated over and over again every time a new duplicate is discovered making such a process unsustainable. 

Also, we need to remember that the built-on deduplication in Salesforce is also rule-based, it’s just very limited. For example, you are only able to merge three records at a time, there is no support for custom objects, and a lot of other limitations. Machine learning is just the smarter way to go since rule creation is simple automation, whereas AI and machine learning try to recreate the human thought process. More about the differences between machine learning and automation are discussed in this article. It would not make sense to choose a deduplication product that simply expands Salesforce’s functionality, instead of fixing the entire process. This is why the machine learning approach is the best way to go. 

AI Machine learning Record (computer science) Data science

Published at DZone with permission of Ilya Dudkin. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • AI, ML, and Data Science: Shaping the Future of Automation
  • The Battle of Data: Statistics vs Machine Learning
  • Optimizing Data Management for AI Success: Industry Insights and Best Practices
  • MLOps: How to Build a Toolkit to Boost AI Project Performance

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!