DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • Application Architecture Design Principles
  • How To Scan and Validate Image Uploads in Java
  • Security Challenges for Microservice Applications in Multi-Cloud Environments
  • Constructing Real-Time Analytics: Fundamental Components and Architectural Framework — Part 2

Trending

  • Application Architecture Design Principles
  • How To Scan and Validate Image Uploads in Java
  • Security Challenges for Microservice Applications in Multi-Cloud Environments
  • Constructing Real-Time Analytics: Fundamental Components and Architectural Framework — Part 2
  1. DZone
  2. Data Engineering
  3. Databases
  4. How to: PostgreSQL Fuzzy String Matching In YugabyteDB

How to: PostgreSQL Fuzzy String Matching In YugabyteDB

Let's see how PostgreSQL’s Fuzzy String matching works in YugabyteDB using the northwind dataset.

Jimmy Guerrero user avatar by
Jimmy Guerrero
·
Oct. 22, 19 · Tutorial
Like (4)
Save
Tweet
Share
7.57K Views

Join the DZone community and get the full member experience.

Join For Free

Image title

Think about it...yarn is just fuzzy string.

Before analyzing a large dataset that contains textual information, it’s important to scrub it and eliminate duplicates when necessary. To remove duplicates, you may need to compare strings referring to the same thing, but that may be written slightly different, have typos or were misspelled. Alternatively, you might need to join two tables on a column (let’s say on company name), and these can appear slightly different in both tables.

Fuzzy String Matching (or Approximate String Matching) is the process of finding strings that approximately match a pattern. Fuzzy string matching has several real-life use-cases including spell-checking, DNA analysis and detection, and spam detection. Fuzzy string matching enables a user to quickly filter down a large dataset to only those rows that match the fuzzy criteria.

In this blog, we will show how PostgreSQL’s Fuzzy String matching works in YugabyteDB using the northwind dataset (download).

You might also like:  Fuzzy Logic: From Appliances to Intelligent Automation

What’s YugabyteDB? It is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. Yugabyte’s SQL API (YSQL) and drivers are PostgreSQL wire compatible.

Prerequisites

Before we begin, please ensure that you have YugabyteDB installed and have loaded the northwind dataset onto the cluster. For a step-by-step guide on how to accomplish this, please visit: “How-to: The Northwind PostgreSQL Sample Database Running on a Distributed SQL Database”.

Using PostgreSQL Extensions in YugabyteDB

With YugabyteDB 2.0, many PostgreSQL extensions are now supported. Extensions package multiple SQL objects that can be added or removed as a single unit. One such extension that we are going to use for fuzzy string matching is called fuzzystrmatch.

CREATE EXTENSION fuzzystrmatch;


There are various algorithms that can do some form of fuzzy string matching, and in this blog, we will go over the three most popular ones.

Soundex

The Soundex algorithm matches similar-sounding names by converting them to the same Soundex code. Every Soundex code consists of a letter and three numbers, such as W252. The difference function takes 2 string parameters, and converts the two strings to their Soundex codes. It then reports the number of matching code positions. Because Soundex codes have four characters, the result ranges from zero to four, with zero being no match and four being an exact match.

Below is an example query using the difference function in YugabyteDB.

SELECT order_id,customer_id,order_date,required_date
FROM orders 
WHERE difference(customer_id, 'FRA') > 2;


With the above query, we will be getting rows that have the customer_id partially matching with ‘FRA’ using soundex.

Image title

Metaphone

Metaphone is similar to Soundex. Metaphone algorithms are designed to produce an approximate phonetic representation, in ASCII, of regular “dictionary” words and names in English and some Latin-based languages. With metaphone, you can index words by their English pronunciation rather than how it is spelled. For example:

SELECT supplier_id,company_name,contact_title,contact_name 
FROM suppliers 
WHERE metaphone(contact_name,1) = metaphone('John',1)


Image title

Levenshtein

The Levenshtein distance between two words is the minimum number of single-character changes (i.e. insertions, deletions, or substitutions) required to change one word into the other. Thus, the smaller the number of edits to transform one word to the other, the closer the words are to each other.

Below is an example query using the Levenshtein function in YugabyteDB.

SELECT ship_name, ship_address,ship_city
FROM orders
WHERE levenshtein(ship_city, 'Strasbourg') < 8;


Image title


The above query will give us a list of 64 cities slightly matching “Strasbourg.”

Conclusion

That’s it! As you can see, it is straightforward to start using PostgreSQL’s fuzzy matching extension inside of YugabyteDB without any modifications. 

Further Reading

Searching JSON: Comparing Text Search in Couchbase and MongoDB

Fuzzy Logic: Making Decisions From Landing Planes to Cybersecurity

Strings Data Types YugabyteDB PostgreSQL Database sql Soundex

Published at DZone with permission of Jimmy Guerrero. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • Application Architecture Design Principles
  • How To Scan and Validate Image Uploads in Java
  • Security Challenges for Microservice Applications in Multi-Cloud Environments
  • Constructing Real-Time Analytics: Fundamental Components and Architectural Framework — Part 2

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: