How Does Data Matching Work?
How Does Data Matching Work?
In the first part of this series, we will take a look at the theory behind Data Matching, what Data Matching is, and how Data Matching works.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
This blog is the first in a series of three looking at Data Matching and how this can be done within the Talend toolset. This first blog will look at the theory behind Data Matching, what is it, and how it works. The second blog will look at the use of the Talend toolset for actually doing Data Matching. Finally, the last blog in the series will look at how you can tune the Data Matching algorithms to achieve the best possible Data Matching results.
First, what is Data Matching? Basically, it is the ability to identify duplicates in large data sets. These duplicates could be people with multiple entries in one or many databases. It could also be duplicate items, of any description, in stock systems. Data Matching allows you to identify duplicates (or possible duplicates) and then allows you to take actions such as merging the two identical or similar entries into one. It also allows you to identify non-duplicates, which can be equally important to identify because you want to know that two similar things are definitely not the same.
So, how does Data Matching actually work? What are the mathematical theories behind it? OK, let’s go back to first principles. How do you know that two "things" are actually the same "thing?" Or, how do you know if two "people" are the same person? What is it that uniquely identifies something? We do it intuitively ourselves. We recognize features in things or people that are similar and acknowledge they could be, or are, the same. In theory, this can apply to any object, be it a person, an item of clothing such as a pair of shorts, a cup, or a "widget."
This problem has actually been around for over 60 years. It was formalized in the 60s in the seminal work of Fellegi and Sunter, two American statisticians. The first use was for the U.S. census bureau. It’s called record linkage, i.e. how are records from different data sets linked together? For duplicate records, it is sometimes called de-duplication, or the process of identifying duplicates and linking them. So, what properties help identify duplicates?
Well, we need unique identifiers. These are properties that are unlikely to change over time. We can associate and weigh probabilities for each property. For example, noting the probability that those two things are actually the same. This can then be applied to both people and things.
The problem, however, is that things can and do change, or they get misidentified. The trick is to identify what can change, i.e. a name, address, or date of birth. Some things are less likely to change than others. For objects, this could be size, shape, color, etc.
NOTE: Record linkage is highly sensitive to the quality of the data being linked. Data should first be ‘standardized’ so it is all of a similar quality.
Now there are two sorts of data linkage.
Deterministic record linkage, which is based on a number of identifiers that match.
Probabilistic record linkage, which is based on the probability that a number of identifiers match.
The vast majority of Data Matching is Probabilistic Data Matching. Deterministic links are too inflexible.
So, just how do you match? First, you do what is called blocking. You sort the data into similar-sized blocks which have the same attribute. You identify "attributes" that are unlikely to change. This could be surnames, date of birth, color, volume, shape. Next, you do the matching. First, assign a match type for each attribute (there are lots of different ways to match these attributes). Names can be matched phonetically; dates can be matched by similarity. Next, you calculate the relative weight for each matching attribute. It’s similar to a measure of importance. Then you calculate the probabilities for matching and also accidentally un-matching those fields. Finally, you assign an algorithm for adjusting the relative weight for each attribute to get what is called a Total Match Weight. That is then the probabilistic match for two things.
Standardize the data.
Pick attributes that are unlikely to change.
Block and sort into similar-sized blocks.
Match via probabilities (remember there are lots of different match types).
Assign weights to the matches.
Add it all up and get a total weight.
The final step is to tune your matching algorithms so that you can obtain better and better matches. This will be covered in the third article in this series.
The next question, then, is what tools are available in the Talend tool set and how can you use them to do Data Matching? This will be covered in the next article in this series of Data Matching blogs.
Published at DZone with permission of Stefan Franczuk , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.