This blog is the third part of a three-part series looking at data matching. In the first part, we looked at the theory behind data matching.
In the second part, we looked at the tools Talend provides in its suite to enable you to do data matching, and how the theory is put into practice.
In this third and final part, we will look at how you tune your data matching in order to get the best possible matches from your data matching processes and software.
By tuning, I mean tuning the matching parameters within the matching components so they produce the best possible matches.
In some respects, tuning your data matching routines is more of an art than a science, although we can use plenty of science to help us. It is, though, more down to experience than theory; and that experience only comes with tuning a number of data matching projects.
Let’s take a step back for a moment. In any data matching exercise, the goal is to end up with a group of data which matches, a group of data which doesn’t, and a final third group of data, of which we are not sure about. Ideally, there should be two groups, those that match and those that don’t, but in practice, we will never achieve this. There will always be a small group in the middle, of which we are just not sure if they match or not. What we do try to achieve however is to make the matched and unmatched groups as large as possible, and as distinct as possible; while making the uncertain group as small as possible. Let’s look at a diagram to try and make this clear.
Here we can see the typical situation that we will encounter once we have tuned our data matching. In the above diagram, we have plotted the following. On the Y-axis we plot Total Weight. This, if you remember is an indication of the confidence of a match. The higher the value, the more confidence we have that the match is a better one. Along the X-axis we simply plot the data points. As described above, we should have two clear groups, the higher one containing matches, the lower one containing un-matches, with a smallish data set in the middle, of which we are unsure. If we can achieve this, then this allows us to set our “Match” and “Unmatch” thresholds. This will then allow us to do auto-merging, with records above the threshold being merged automatically, records below the Unmatch threshold not being merged (as they don’t match), and those in the middle being checked manually. Within the Talend suite, these could be records for example sent to Data Stewardship to be checked manually by a data steward.
As a reminder, there is no hard and fast rules or process to tune data matching algorithms. There is, however, a methodology which you will need to adopt, and this is what we will now explain. So, how do we tune our data matching algorithms? Well, it’s a multi-step process that needs to be repeated several times to get the best possible matches. The 6-step process of tuning your matching algorithms looks like this:
- Make a first estimate of matching parameters and weights. Use these as a baseline to start.
- Run the matching and save the results (data point) to a file.
- Plot a graph of Match Weight v Data Points.
- Try to estimate which parameters need adjusting.
- Re-run and replot the data.
- Repeat until the results are satisfactory.
The following diagram illustrates this process:
In practice, this process is often repeated a number of times. Experience in data matching, knowledge of the tool and algorithms and an understanding of the data all contribute towards the initial values of the matching parameters. You then run the matching, examine the results, adjust the match parameters and re-run. Most of the time the matching improves, sometimes it doesn’t. It is simply a matter of repeating the exercise a number of times until you can make no more improvements in the quality of the data matching.
What you would ideally see, is two groups of data slowly emerging, the “Matched” and “Unmatched” groups, along with the uncertain data in the middle. As you continue to tune the algorithms, the two main groups should move apart, and the uncertain data should reduce in number. Eventually, you will get to a point of no return, where any further changes to the matching parameters and weights, produces no detectable effect. You have now tuned your data to get the best possible matches!