Using Machine Learning to Detect Dupes: Some Real-Life Examples (Part II)
As the volume of content grows, it is important to have tools, and using machine learning tools is the most efficient method of deduping data.
Join the DZone community and get the full member experience.Join For Free
Today we are continuing our series on using machine learning for various deduplication tasks. As the volume of content grows, it is important to have some tools available to help companies sort through everything, and using machine learning tools is the most efficient method of deduping data. Today we will take a look at more real-life examples of companies using machine learning for deduplication purposes.
1. Deduping Web Page Content
When companies create new websites or add new sections, sometimes duplicate content can appear within their domain. A lot of times this is unintentional. In fact, Google itself provides some examples of what they call “non-malicious duplicate content”. This includes things like:
Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
Items in an online store that are shown or linked to by multiple distinct URLs
Printer-only versions of web pages
However, there are other times when people will intentionally place duplicate content on their website in an attempt to trick Google and thereby increase their SEO rankings. Such deceptive practices also negatively impact the user experience since the reader will have to sort through all of the duplicate content.
Machine learning in this situation is indispensable since human workers cannot possibly comb through all of the websites on the internet to check for duplicate content. This is why these machine learning algorithms use techniques like set similarity which determines whether or not two pages have the same words and it will cluster such pages together. Based on the results, the admins at Google will determine whether or not to take action.
2. Deduping Wikipedia Entries
Wikipedia is one of the most popular sources of information but regular users might notice that different entries contain similar or even identical texts. Now, if the texts are identical they will simply be marked for merging, but there are more complex cases. For example, let’s say there is an article entry for Hurricane Sandy and there are other entries about locations where it made landfall such as various parts of New York, New Jersey, etc. In such situations, the editors might have simply copied and pasted the information from one article into another which is problematic since Wikipedia is an environment where anyone can edit.
Thanks to a wide community of users and developers, Wikipedia is able to use machine learning for duplication by using something called record linkage. This is where the system finds records in a data set that refer to the same entity and across different sources. More advanced cases can be marked for human review.
3. Deduping Medical Records
While duplicate medical records may not appear to be a big issue, they cost the U.S. healthcare system more than $6 billion annually and individual hospitals $1.5M a year. The reason duplicates are so costly is because it is difficult to map customer transactions made by the same person, multiple patient claims filed for the same patient, and many other issues. The issue may arise when two hospitals merge and each one has its own records for patients and it is difficult to merge duplicates together.
There have been many companies that developed a solution to dedupe patient records and what’s interesting is how smart they are able to do it. While the system needs to compute the similarity values for all of the records, it does not need to store all of these values but only the ones above a certain threshold. This is done through model-based deduplication for matching which includes:
Duplicate records in a dataset
Linking duplicate records from two datasets
Duplicate records to a canonical dataset
Machine Learning is the Smarter Way to Dedupe
One of the common features we have seen from the three examples above is that machine learning offers a smarter approach to deduplication. The reason for this is because it allows us to compare fewer records since only the ones above a certain threshold will be compared. It is also a lot more scalable and continuously learns to identify more parameters. Even if you were to hire an entire team to dedupe your data, it would not be as fast as the machine learning approach. That’s why machine learning is the best way to go!
Published at DZone with permission of Ilya Dudkin. See the original article here.
Opinions expressed by DZone contributors are their own.