Pattern Searching and PostgreSQL Full-Text Search: Understanding the Mismatch

Learn why PostgreSQL full-text search (FTS) is great for semantic, language-aware searching, but not for raw pattern search.

Horatiu Dan

CORE ·

Oct. 07, 25 · Analysis

Likes (11)

Comment

Save

3.9K Views

A while ago, a request emerged to implement a global find functionality in one of our products. Briefly, the purpose was to be able to search for multiple (different) entities by a pattern that is looked up and matched against several of their attributes. Not very complicated at first glance, such a feature comes with a lot of concerns and trade-offs that need to be analyzed, in addition to the implementation itself, in order to ensure at least a pleasant user experience.

As the underlying database was PostgreSQL and I was aware of its powerful full-text search (FTS) capability, I considered it a potentially good candidate, so I started researching it, trying things out, aiming to compile the solution around it. Unfortunately for the requirement, but fortunately for me and for the interesting FTS capabilities I became aware of during this study, a few hours later, I understood why it cannot be used for pattern searching.

The purpose of this article is to understand a few of the PostgreSQL FTS basics, use them in a concrete example while trying to perform pattern searching, observe the results, and understand why it isn’t a solution for such a case.

We will jump right into it and make the important statements first, then detail and implement the proof of concept.

Why Pattern Searching Doesn’t Work With FTS

PostgreSQL full-text search is designed for linguistics-based search (words, phrases, ranking by relevance) and not for pattern-based search (wildcards, substrings, regex). The two serve fundamentally different purposes.

A few of the cons of using FTS in this scenario:

Tokenization and Dictionaries

FTS breaks text into lexemes (normalized word units), which means searching is word-based (or phrase-based), not character-by-character. For example, searching for "cat" won’t match "concatenate," because "concatenate" is stored as the lexeme 'concaten':1.

No Wildcard or Substring Matching

Pattern matching (LIKE, ILIKE, regex) works on raw text, while FTS does not. FTS queries only match entire words or their stems, not arbitrary substrings ('%cat%') or regex-like patterns.

Loss of Word Boundaries

Once tokenized, FTS discards original text formatting (whitespace, punctuation, order to some extent). This makes it impossible to reliably search for patterns across boundaries like "abc-def" or "foo_bar."

Focus on Semantics, Not Syntax

FTS is built to mainly answer this question: "What documents are most relevant to these words?" It prioritizes ranking, stemming, stop word removal, and linguistic analysis. Pattern search, on the other hand, is purely syntactic and exact.

Indexing Mismatch

FTS uses GIN or GIST indexes optimized for word lookups. Pattern searching requires B-tree indexes for faster substring/regex matches. To a certain extent, mixing the two is inefficient.

No Regular Expression Support

FTS query operators (@@, to_tsquery) accept logical combinations of lexemes (AND, OR, NOT, FOLLOWS), not regex. You can't, for example, run to_tsquery('%cat%') or similar and expecting wildcard behavior.

PostgreSQL Full-Text Search Concepts

Before the concrete example, a few of the basic concepts are defined briefly.

A document is the actual content that is searched for. For instance, it may designate one or more text columns in a table, or textual data of interest gathered from multiple tables.

In order for the documents to be searchable, they must be transformed appropriately into a form called tsvector.

to_tsvector(regconfig, text) is the function that creates this PostgreSQL's internal representation of a document that is optimized for performing the search operation.

    SQL
   
   select to_tsvector('english', 'PostgreSQL database results if you concatenate Postgre and SQL.');

outputs:

    Plain Text
   
   'concaten':6 'databas':2 'postgr':7 'postgresql':1 'result':3 'sql':9

It is a sorted list of normalized words together with their position in the initial document. In the above example, the items are sorted alphabetically.

The normalized words are called lexemes. A lexeme basically represents the meaning of a word in a full-text search context. Usually, when a phrase is transformed into lexemes, the stop words are removed, all items are lowercased, special characters are handled, and only the root form of the words is kept (stemming).

Once the documents are transformed, they may be searched using the tsquery type, a search query that can be applied on tsvectors.

    SQL
   
   select to_tsquery('english', 'PostgreSQL & database');

outputs:

    Plain Text
   
   'postgresql' & 'databas'

The supported operators are &, |, !, and <->. While the first three are intuitive, the last means follows, more exactly, the document is matched if the two lexemes come right one after the other.

A Practical Experiment

Considering the PostgreSQL Database Server is up and running, one may create a simple schema that is used in this use case.

    SQL
   
   CREATE SCHEMA pgfts;

To keep things simple, only one table is used, a table that stores inventory items. Such an entity has a number, a service, is from a vendor, and comments can be recorded for each.

    SQL
   
 

   DROP TYPE IF EXISTS vendor;
CREATE TYPE vendor AS ENUM ('Vodafone', 'Orange');
 
DROP TYPE IF EXISTS service;
CREATE TYPE service AS ENUM ('VOIP', 'MPLS');
 
DROP TABLE IF EXISTS inventory;
CREATE TABLE IF NOT EXISTS inventory (
    id SERIAL PRIMARY KEY,
    number TEXT NOT NULL,
    service service NOT NULL,
    vendor vendor NOT NULL,
    comments TEXT
);
  

In order to be able to experiment with searching, first, some data is created — a few inventory items for each combination of services and vendors are inserted.

    SQL
   
 

   INSERT INTO inventory (number, service, vendor, comments)
SELECT 'inv' || floor(random() * 100000) || 'vdf',
       'VOIP',
       'Vodafone',
       'For inventory ' || i || ' from Vodafone: vdf voip.'
FROM generate_series(1, 3) i;
 
INSERT INTO inventory (number, service, vendor, comments)
SELECT 'inv-' || floor(random() * 100000) || 'vdf',
       'MPLS',
           'Vodafone',
       'For inventory ' || i || ' from Vodafone: vdf mpls.'
FROM generate_series(1, 3) i;
 
INSERT INTO inventory (number, service, vendor, comments)
SELECT 'inv' || floor(random() * 100000) || '-org',
       'VOIP',
           'Orange',
       'For inventory ' || i || ' from Orange: org voip.'
FROM generate_series(1, 3) i;
 
INSERT INTO inventory (number, service, vendor, comments)
SELECT 'inv-' || floor(random() * 100000) || '-org',
       'MPLS',
           'Orange',
       'For inventory ' || i || ' from Orange: org mpls.'
FROM generate_series(1, 3) i;
  

The form of the inventory numbers was intentionally generated differently for certain entities, as we’ll shortly see.

The raw data looks as shown in the picture below.

The intention of the experiment is to perform a pattern search and match patterns from the number column in a full-text search manner. We assume the full search is performed on both the number and comments columns. In order to be able to accomplish that, an additional column containing the tsvector representation of the two is added. Moreover, a GIN index is created as well, although the performance is not a concern here.

    SQL
   
 

   ALTER TABLE inventory
    DROP COLUMN IF EXISTS search_ts_vector; 
     
ALTER TABLE inventory
    ADD COLUMN IF NOT EXISTS search_ts_vector tsvector GENERATED ALWAYS AS (
        to_tsvector('english', coalesce(number, '')) ||
        to_tsvector('english', coalesce(comments, ''))
    ) STORED;
     
CREATE INDEX inventory_search_idx ON inventory USING GIN (search_ts_vector);
  

The documents look as follows, only the columns of interest are shown.

    Plain Text
   
 

   +--+-------------+----------------------------------------+---------------------------------------------------------------------------------------+
|id|number       |comments                                |search_ts_vector                                                                       |
+--+-------------+----------------------------------------+---------------------------------------------------------------------------------------+
|1 |inv92615vdf  |For inventory 1 from Vodafone: vdf voip.|'1':4 'inv92615vdf':1 'inventori':3 'vdf':7 'vodafon':6 'voip':8                       |
|2 |inv88528vdf  |For inventory 2 from Vodafone: vdf voip.|'2':4 'inv88528vdf':1 'inventori':3 'vdf':7 'vodafon':6 'voip':8                       |
|3 |inv6924vdf   |For inventory 3 from Vodafone: vdf voip.|'3':4 'inv6924vdf':1 'inventori':3 'vdf':7 'vodafon':6 'voip':8                        |
|4 |inv-80542vdf |For inventory 1 from Vodafone: vdf mpls.|'1':6 '80542vdf':3 'inv':2 'inv-80542vdf':1 'inventori':5 'mpls':10 'vdf':9 'vodafon':8|
|5 |inv-81722vdf |For inventory 2 from Vodafone: vdf mpls.|'2':6 '81722vdf':3 'inv':2 'inv-81722vdf':1 'inventori':5 'mpls':10 'vdf':9 'vodafon':8|
|6 |inv-90259vdf |For inventory 3 from Vodafone: vdf mpls.|'3':6 '90259vdf':3 'inv':2 'inv-90259vdf':1 'inventori':5 'mpls':10 'vdf':9 'vodafon':8|
|7 |inv39708-org |For inventory 1 from Orange: org voip.  |'1':6 'inv39708':2 'inv39708-org':1 'inventori':5 'orang':8 'org':3,9 'voip':10        |
|8 |inv93571-org |For inventory 2 from Orange: org voip.  |'2':6 'inv93571':2 'inv93571-org':1 'inventori':5 'orang':8 'org':3,9 'voip':10        |
|9 |inv4770-org  |For inventory 3 from Orange: org voip.  |'3':6 'inv4770':2 'inv4770-org':1 'inventori':5 'orang':8 'org':3,9 'voip':10          |
|10|inv-87130-org|For inventory 1 from Orange: org mpls.  |'-87130':2 '1':6 'inv':1 'inventori':5 'mpls':10 'orang':8 'org':3,9                   |
|11|inv-25686-org|For inventory 2 from Orange: org mpls.  |'-25686':2 '2':6 'inv':1 'inventori':5 'mpls':10 'orang':8 'org':3,9                   |
|12|inv-99591-org|For inventory 3 from Orange: org mpls.  |'-99591':2 '3':6 'inv':1 'inventori':5 'mpls':10 'orang':8 'org':3,9                   |
+--+-------------+----------------------------------------+---------------------------------------------------------------------------------------+
  

Let's assume the pattern is '2vd' or '0-or'. In each case, we would expect to get two records back – ids 4, 5, and 9, 10 respectively. When we execute the following:

    SQL
   
 

   SELECT id, number, search_ts_vector, ts_rank(search_ts_vector, query) as search_ts_rank
FROM inventory,
     to_tsquery('english', '2vd') query
WHERE search_ts_vector @@ query
ORDER BY search_ts_rank DESC;
  

The result is as follows:

    Plain Text
   
 

   +--+------------+-------------------------------------------------------------------------------+--------------+
|id|number      |search_ts_vector                                                               |search_ts_rank|
+--+------------+-------------------------------------------------------------------------------+--------------+
|8 |inv93571-org|'2':6 'inv93571':2 'inv93571-org':1 'inventori':5 'orang':8 'org':3,9 'voip':10|0.12158542    |
|1 |inv92615vdf |'1':4 'inv92615vdf':1 'inventori':3 'vdf':7 'vodafon':6 'voip':8               |0.06079271    |
+--+------------+-------------------------------------------------------------------------------+--------------+
  

As the search_ts_vector column of these records contains lexemes starting with 'inv9', they are returned.

Further, if we search by pattern 'vdf',

    SQL
   
 

   SELECT id, number, comments, search_ts_vector, ts_rank(search_ts_vector, query) as search_ts_rank
FROM inventory,
     to_tsquery('english', 'vdf') query
WHERE search_ts_vector @@ query
ORDER BY search_ts_rank DESC;
  

We indeed receive the first six records as a result, and this happens because lexemes having the 'vdf' value do exist for them. These lexemes were actually created from the content in the comments column. If the full-text search had been configured only for the number column, no results would have been returned.

The pattern search limitation is clear and has a logical explanation now. On the other hand, one last query is worth executing, though, to show the power of PostgreSQL full-text search.

    SQL
   
 

   SELECT id, number, comments, search_ts_vector, ts_rank(search_ts_vector, query) as search_ts_rank
FROM inventory,
     to_tsquery('english', 'inv:* & voip & orange') query
WHERE search_ts_vector @@ query
ORDER BY search_ts_rank DESC;
  

We aim for all inventory entities whose numbers start with 'inv' and contain in their numbers or comments the words 'voip' and 'orange' or others starting with 'inv'. Quite a complex query, but the results are returned immediately.

    Plain Text
   
 

   +--+------------+--------------------------------------+-------------------------------------------------------------------------------+--------------+
|id|number      |comments                              |search_ts_vector                                                               |search_ts_rank|
+--+------------+--------------------------------------+-------------------------------------------------------------------------------+--------------+
|7 |inv39708-org|For inventory 1 from Orange: org voip.|'1':6 'inv39708':2 'inv39708-org':1 'inventori':5 'orang':8 'org':3,9 'voip':10|0.26071766    |
|8 |inv93571-org|For inventory 2 from Orange: org voip.|'2':6 'inv93571':2 'inv93571-org':1 'inventori':5 'orang':8 'org':3,9 'voip':10|0.26071766    |
|9 |inv4770-org |For inventory 3 from Orange: org voip.|'3':6 'inv4770':2 'inv4770-org':1 'inventori':5 'orang':8 'org':3,9 'voip':10  |0.26071766    |
+--+------------+--------------------------------------+-------------------------------------------------------------------------------+--------------+
  

Takeaway

By combining the theoretical aspects with the results of the simple experiment above, the conclusion is the following. PostgreSQL full-text search is great for semantic, language-aware searching, but it fails at raw pattern search because of tokenization, indexing strategy, and focus on relevance rather than on the exact matches.

Nevertheless, to finish on an optimistic note, the capabilities of PostgreSQL full-text search are more than satisfactory and a very good alternative to Elasticsearch when it comes to implementing general searching solutions, especially in the case of small and medium-sized applications.

Resources

PostgreSQL FTS Documentation
The picture is a mosaic detail from the Dome of St. Peter’s Basilica in the Vatican

Database sql PostgreSQL

Published at DZone with permission of Horatiu Dan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending