PostgreSQL Full-Text Search vs. Pattern Matching: A Performance Comparison

We'll analyze the performance of PostgreSQL full-text search (FTS) versus pattern and regex searching, highlighting trade-offs and execution efficiency.

Horatiu Dan

CORE ·

Oct. 17, 25 · Analysis

Likes (8)

Comment

Save

2.9K Views

A previous article explains why PostgreSQL full-text search (FTS) is not a good candidate for implementing a general find functionality, where the user is expected to provide a pattern to be looked up and matched against the fields of one or multiple entities. Considering the previously explained technical challenges, it is clear that FTS is great for semantic and language-aware search, although it cannot cover raw searches in detail for various use cases.

In the field of software development, it isn’t uncommon for us to need to accept trade-offs. Actually, almost every decision that we make when architecting and designing a product is a compromise. The balance is tilted depending on the purpose, requirements, and specifics of the developed product so that the solution and the value delivery are ensured, and the customers are helped to accomplish their needs.

With this statement in mind, to achieve success, you need to compromise. This article aims to compare the two methods, FTS and pattern matching, in terms of execution performance, of course, under a set of clear preconditions (trade-offs) that are assumed. The purpose is obviously to increase the objectivity related to these methods’ applicability and help programmers choose easily when needed.

Preconditions

Perform a pattern search on three different entities by looking up in multiple tables’ columns.
The entities of interest are: telecom invoices, inventory items for which the invoices are issued, and orders of such inventory items.
Entities that compose the result of the operation are displayed together.
A ‘starts with’ matching against the aimed columns is acceptable.
The fields of interest are: invoice number and comments, inventory number and comments, and order number and comments.

Performance Analysis

Assuming the PostgreSQL Database Server is up and running, one can connect to it and explore the three entities of interest.

    SQL
   
   select count(*) from inventories;   -- 2_821_800 records
select count(*) from invoice;       -- 55_911 records
select count(*) from orders;        -- 30_564 records

One can observe that the total number of searched entities is around 3 million, which is a pretty good sample for an objective analysis.

The purpose isn’t necessarily to provide the implementation in detail, but to observe the performance of several strategies and make a comparison.

In order to be able to make the analysis without having to modify the existing tables, the following materialized view is created.

    SQL
   
 

   CREATE MATERIALIZED VIEW IF NOT EXISTS mv_fts_entities AS
SELECT contractid AS id,
       'INVENTORY' AS type,
       extrainfo AS number,
        (
            setweight(to_tsvector('simple', coalesce(extrainfo, '')), 'A') ||
            setweight(to_tsvector('simple', coalesce(itemcomments, '')), 'B')
        ) AS search_vector
FROM inventories
UNION ALL
SELECT i.invoiceid AS id,
       'INVOICE' AS type,
       i.invoicenum AS number,
        (
            setweight(to_tsvector('simple', i.invoicenum), 'A') ||
            setweight(to_tsvector('simple', coalesce(i.comments, '')), 'B') ||
            setweight(to_tsvector('simple', a.account), 'A')
        ) AS search_vector
FROM invoice i
LEFT JOIN account a on a.id = i.accountid
UNION ALL
SELECT orderid AS id,
       'ORDER' AS type,
       ordernumber AS numnber,
        (
            setweight(to_tsvector('simple', ordernumber), 'A') ||
            setweight(to_tsvector('simple', coalesce(comments, '')), 'B')
        ) AS search_vector
FROM orders;
  

A few clarifications:

The materialized view reunites all the records contained in the three tables of interest.
id represents the unique identifier of each entity.
type allows identifying the entity — INVENTORY, INVOICE, or ORDER.
number is the field we’re mainly interested performing the pattern search on — inventory, invoice and order number respectively.
search_vector contains the tsvector representation of the columns of interest and it represents PostgreSQL’s text search vector type that denotes the searchable content.
setweight() – depending on the considered column, different weights are set when computing the lexemes, for instance, in case of orders, we consider the match to have a higher priority on order number than on comments.
coalesce() handles the null values gracefully.
For invoice records, a match is attempted on the invoice account as well, although the column designates an attribute of a different entity.

The materialized view creation takes about three seconds.

If interested in refreshing the content so that the data is up to date, one can issue the command below.

    SQL
   
   REFRESH MATERIALIZED VIEW mv_fts_entities;

The materialized view refresh takes around 10 seconds.

Additionally, the following indexes are created to improve performance in both cases.

A GIN index on the search_vectorcolumn to improve full-text search

      SQL
     
     CREATE INDEX mv_fts_entities_search_idx ON mv_fts_entities USING GIN (search_vector);

A B-Tree index on the number column to improve the ‘starts with’ pattern searching
SQL
```
CREATE INDEX mv_fts_entities_number_idx ON mv_fts_entities(number);
```

Both operations take about five seconds.

With the above items in place, let’s examine the performance in each of the following scenarios.

1. Full-Text Search

    SQL
   
 

   EXPLAIN ANALYZE
    SELECT id, type, number, search_vector, ts_rank(search_vector, query) as rank
    FROM mv_fts_entities,
         to_tsquery('simple', '101:*') query
    WHERE search_vector @@ query
    ORDER BY rank DESC
    LIMIT 10;
  

This returns the following results and query plan:

    Plain Text
   
 

   +------+---------+---------------------+---------------------------------------------+---------+
|id    |type     |number               |search_vector                                |rank     |
+------+---------+---------------------+---------------------------------------------+---------+
|162400|INVENTORY|KBBC24100 101ATI     |'101ati':2A 'kbbc24100':1A                   |0.6079271|
|13162 |INVOICE  |M566274              |'10130bafy0':2A 'm566274':1A                 |0.6079271|
|4880  |INVOICE  |M554853              |'10130bafy0':2A 'm554853':1A                 |0.6079271|
|55713 |INVOICE  |M628493              |'10130bbt0':2A 'm628493':1A                  |0.6079271|
|52525 |INVOICE  |M623623              |'10130bfml0':2A 'm623623':1A                 |0.6079271|
|35131 |INVOICE  |4233816-IVG          |'10111020':3A '4233816':1A 'ivg':2A          |0.6079271|
|34326 |INVOICE  |4233312-IVG          |'10111020':3A '4233312':1A 'ivg':2A          |0.6079271|
|34082 |INVOICE  |4232587IVG           |'10111020':2A '4232587ivg':1A                |0.6079271|
|46370 |INVOICE  |101912352160142303323|'101912352160142303323':1A '9897901309489':2A|0.6079271|
|132670|INVENTORY|KBBC75705 101ATI     |'101ati':2A 'kbbc75705':1A                   |0.6079271|
+------+---------+---------------------+---------------------------------------------+---------+
 
+-------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                       |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
|Limit  (cost=173.72..173.75 rows=10 width=29) (actual time=1.120..1.122 rows=10 loops=1)                                                         |
|  ->  Sort  (cost=173.72..174.08 rows=145 width=29) (actual time=1.119..1.120 rows=10 loops=1)                                                   |
|        Sort Key: (ts_rank(mv_fts_entities.search_vector, '''101'':*'::tsquery)) DESC                                                            |
|        Sort Method: top-N heapsort  Memory: 25kB                                                                                                |
|        ->  Bitmap Heap Scan on mv_fts_entities  (cost=9.93..170.59 rows=145 width=29) (actual time=0.259..0.967 rows=914 loops=1)               |
|              Recheck Cond: (search_vector @@ '''101'':*'::tsquery)                                                                              |
|              Heap Blocks: exact=476                                                                                                             |
|              ->  Bitmap Index Scan on mv_fts_entities_search_idx  (cost=0.00..9.89 rows=145 width=0) (actual time=0.213..0.213 rows=914 loops=1)|
|                    Index Cond: (search_vector @@ '''101'':*'::tsquery)                                                                          |
|Planning Time: 0.199 ms                                                                                                                          |
|Execution Time: 1.141 ms                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
  

Key points:

The Bitmap Index Scan is used for matching
Uses Top-N heapsort for ordering (ranking) the results
The search time is around 1 ms

2. Case-Sensitive Wildcard Pattern Search

    SQL
   
 

   EXPLAIN ANALYZE
    SELECT id, type, number
    FROM mv_fts_entities
    WHERE number LIKE '101%'
    LIMIT 10;
  

This returns the following results and query plan:

    Plain Text
   
 

   +------+---------+-----------+
|id    |type     |number     |
+------+---------+-----------+
|532   |ORDER    |TEM1101    |
|15642 |ORDER    |CCON101    |
|310983|INVENTORY|7037618101 |
|36445 |INVOICE  |83551101   |
|1532  |ORDER    |TEM2101    |
|16642 |ORDER    |CCON1101   |
|546667|INVENTORY|P0000010101|
|13180 |INVOICE  |28071101   |
|2529  |ORDER    |TEM3101    |
|17642 |ORDER    |CCON2101   |
+------+---------+-----------+
 
+---------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------+
|Limit  (cost=0.00..2268.43 rows=10 width=25) (actual time=0.395..2.265 rows=10 loops=1)                              |
|  ->  Seq Scan on mv_fts_entities  (cost=0.00..66011.44 rows=291 width=25) (actual time=0.393..2.262 rows=10 loops=1)|
|        Filter: ((number)::text ~~ '101%'::text)                                                                     |
|        Rows Removed by Filter: 35657                                                                                |
|Planning Time: 0.063 ms                                                                                              |
|Execution Time: 2.277 ms                                                                                             |
+---------------------------------------------------------------------------------------------------------------------+
  

Key points:

The sequential scan is used
The result set is obviously different from the one at point 1
The search time is slightly longer, about 2.2 ms

3. Case-Insensitive Wildcard Pattern Search

    SQL
   
 

   EXPLAIN ANALYZE
    SELECT id, type, number
    FROM mv_fts_entities
    WHERE number ILIKE '101%'
    LIMIT 10;
  

This returns the following results and query plan:

    Plain Text
   
 

   +------+---------+----------------------------+
|id    |type     |number                      |
+------+---------+----------------------------+
|46370 |INVOICE  |101912352160142303323       |
|455785|INVENTORY|101t1zflivnminwdcolivomigf  |
|455782|INVENTORY|101t1zfclevohizhopclevoh62  |
|455783|INVENTORY|101t1zfclmmohkfhoohlrdoh87  |
|455784|INVENTORY|101t1zfgbtpm11111h00grblmimn|
|455786|INVENTORY|101t1zfnilsmimndc0nltpmid0  |
|455787|INVENTORY|101t1zfpthrmibiho6pthrmimn  |
|36819 |INVOICE  |101912352160142393323       |
|32931 |INVOICE  |101912352160142392823       |
|8002  |INVOICE  |101912352121                |
+------+---------+----------------------------+
 
 
+----------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------+
|Limit  (cost=0.00..2268.43 rows=10 width=25) (actual time=1.939..11.356 rows=10 loops=1)                              |
|  ->  Seq Scan on mv_fts_entities  (cost=0.00..66011.44 rows=291 width=25) (actual time=1.938..11.353 rows=10 loops=1)|
|        Filter: ((number)::text ~~* '101%'::text)                                                                     |
|        Rows Removed by Filter: 35657                                                                                 |
|Planning Time: 0.098 ms                                                                                               |
|Execution Time: 11.369 ms                                                                                             |
+----------------------------------------------------------------------------------------------------------------------+
  

Key points:

The sequential scan is used
The result set is obviously different from the one at point 1, but also from the one at point 2
The search time is longer than the ones before, about 11.5 ms

4. Case-Insensitive Regular Expression Search

    SQL
   
 

   EXPLAIN ANALYZE
    SELECT id, type, number
    FROM mv_fts_entities
    WHERE number ~* '^101'
    LIMIT 10;
  

This returns the following results and query plan:

    Plain Text
   
 

   +------+---------+----------------------------+
|id    |type     |number                      |
+------+---------+----------------------------+
|46370 |INVOICE  |101912352160142303323       |
|455785|INVENTORY|101t1zflivnminwdcolivomigf  |
|455782|INVENTORY|101t1zfclevohizhopclevoh62  |
|455783|INVENTORY|101t1zfclmmohkfhoohlrdoh87  |
|455784|INVENTORY|101t1zfgbtpm11111h00grblmimn|
|455786|INVENTORY|101t1zfnilsmimndc0nltpmid0  |
|455787|INVENTORY|101t1zfpthrmibiho6pthrmimn  |
|36819 |INVOICE  |101912352160142393323       |
|32931 |INVOICE  |101912352160142392823       |
|8002  |INVOICE  |101912352121                |
+------+---------+----------------------------+
 
+---------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------+
|Limit  (cost=0.00..2268.43 rows=10 width=25) (actual time=0.943..6.265 rows=10 loops=1)                              |
|  ->  Seq Scan on mv_fts_entities  (cost=0.00..66011.44 rows=291 width=25) (actual time=0.942..6.262 rows=10 loops=1)|
|        Filter: ((number)::text ~* '^101'::text)                                                                     |
|        Rows Removed by Filter: 35657                                                                                |
|Planning Time: 0.088 ms                                                                                              |
|Execution Time: 6.277 ms                                                                                             |
+---------------------------------------------------------------------------------------------------------------------+
  

Key points:

The sequential scan is used.
The result set is obviously different from the one at point 1, but identical to the previous one.
The search time is longer than the ones using the full-text search, but shorter than the pattern searching, about 6.2 ms.

The results obtained in terms of execution time can be summarized as follows:

FTS << Case Sens. Pattern Search < Case Insens. Regex Search < Case Insens. Pattern Search

The experiment in this article and the results obtained make FTS a candidate worth considering even for pattern searching when its known limitations in the scenarios of interest are acceptable. Moreover, its configuration flexibility in terms of tsvector computation and its speed of execution make it superior in comparison to other solutions, of course, under the presented circumstances.

Resources

Pattern Searching and PostgreSQL Full-Text Search: Understanding the Mismatch
The picture was taken at an immersive exhibition in Bucharest, Romania.

Database PostgreSQL Performance

Published at DZone with permission of Horatiu Dan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

PostgreSQL Full-Text Search vs. Pattern Matching: A Performance Comparison

We'll analyze the performance of PostgreSQL full-text search (FTS) versus pattern and regex searching, highlighting trade-offs and execution efficiency.

Preconditions

Performance Analysis

1. Full-Text Search

2. Case-Sensitive Wildcard Pattern Search

3. Case-Insensitive Wildcard Pattern Search

4. Case-Insensitive Regular Expression Search

Resources

Related

Partner Resources