How to: PostgreSQL Fuzzy String Matching In YugabyteDB
Let's see how PostgreSQL’s Fuzzy String matching works in YugabyteDB using the northwind dataset.
Join the DZone community and get the full member experience.Join For Free
Before analyzing a large dataset that contains textual information, it’s important to scrub it and eliminate duplicates when necessary. To remove duplicates, you may need to compare strings referring to the same thing, but that may be written slightly different, have typos or were misspelled. Alternatively, you might need to join two tables on a column (let’s say on company name), and these can appear slightly different in both tables.
Fuzzy String Matching (or Approximate String Matching) is the process of finding strings that approximately match a pattern. Fuzzy string matching has several real-life use-cases including spell-checking, DNA analysis and detection, and spam detection. Fuzzy string matching enables a user to quickly filter down a large dataset to only those rows that match the fuzzy criteria.
You might also like: Fuzzy Logic: From Appliances to Intelligent Automation
What’s YugabyteDB? It is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. Yugabyte’s SQL API (YSQL) and drivers are PostgreSQL wire compatible.
Before we begin, please ensure that you have YugabyteDB installed and have loaded the northwind dataset onto the cluster. For a step-by-step guide on how to accomplish this, please visit: “How-to: The Northwind PostgreSQL Sample Database Running on a Distributed SQL Database”.
Using PostgreSQL Extensions in YugabyteDB
With YugabyteDB 2.0, many PostgreSQL extensions are now supported. Extensions package multiple SQL objects that can be added or removed as a single unit. One such extension that we are going to use for fuzzy string matching is called fuzzystrmatch.
CREATE EXTENSION fuzzystrmatch;
There are various algorithms that can do some form of fuzzy string matching, and in this blog, we will go over the three most popular ones.
The Soundex algorithm matches similar-sounding names by converting them to the same Soundex code. Every Soundex code consists of a letter and three numbers, such as W252. The difference function takes 2 string parameters, and converts the two strings to their Soundex codes. It then reports the number of matching code positions. Because Soundex codes have four characters, the result ranges from zero to four, with zero being no match and four being an exact match.
Below is an example query using the difference function in YugabyteDB.
SELECT order_id,customer_id,order_date,required_date FROM orders WHERE difference(customer_id, 'FRA') > 2;
With the above query, we will be getting rows that have the customer_id partially matching with ‘FRA’ using soundex.
Metaphone is similar to Soundex. Metaphone algorithms are designed to produce an approximate phonetic representation, in ASCII, of regular “dictionary” words and names in English and some Latin-based languages. With metaphone, you can index words by their English pronunciation rather than how it is spelled. For example:
SELECT supplier_id,company_name,contact_title,contact_name FROM suppliers WHERE metaphone(contact_name,1) = metaphone('John',1)
The Levenshtein distance between two words is the minimum number of single-character changes (i.e. insertions, deletions, or substitutions) required to change one word into the other. Thus, the smaller the number of edits to transform one word to the other, the closer the words are to each other.
Below is an example query using the Levenshtein function in YugabyteDB.
SELECT ship_name, ship_address,ship_city FROM orders WHERE levenshtein(ship_city, 'Strasbourg') < 8;
The above query will give us a list of 64 cities slightly matching “Strasbourg.”
That’s it! As you can see, it is straightforward to start using PostgreSQL’s fuzzy matching extension inside of YugabyteDB without any modifications.
Published at DZone with permission of Jimmy Guerrero. See the original article here.
Opinions expressed by DZone contributors are their own.