DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Master SQL Performance Optimization: Step-by-Step Techniques With Case Studies
  • Top 10 PostgreSQL Tuning Tips for High-Performance Databases
  • Snowflake Administration: A Comprehensive Step-by-Step Guide
  • SQL Server to Postgres Database Migration

Trending

  • Jakarta EE 12: Entering the Data Age of Enterprise Java
  • Migrate a Hardcoded LangGraph Agent to LaunchDarkly AI Configs in 20 Minutes
  • How to Save Money Using Custom LLMs for Specific Tasks
  • Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
  1. DZone
  2. Data Engineering
  3. Databases
  4. PostgreSQL Trigram Similarity vs. Pattern Matching: A Performance Comparison

PostgreSQL Trigram Similarity vs. Pattern Matching: A Performance Comparison

Compare planning and execution times for similarity searches using trigram matching, case-insensitive regex and wildcard patterns, with and without GiST or GIN indexing.

By 
Horatiu Dan user avatar
Horatiu Dan
DZone Core CORE ·
Jan. 23, 26 · Analysis
Likes (7)
Comment
Save
Tweet
Share
3.5K Views

Join the DZone community and get the full member experience.

Join For Free

A previous article presented a performance comparison between PostgreSQL Full-Text Search (FTS) and plain pattern matching, especially in cases where a given piece of text is searched across multiple attributes of multiple entities. In brief, the conclusion was that FTS is significantly faster, while pattern matching offers greater flexibility, a wider range of applicability, and higher precision.

The present article aims to continue this analysis, which is worthwhile when implementing global search solutions, in order to increase objectivity regarding the applicability of these methods and to help programmers make informed choices more easily.

The goal this time is largely the same: to study PostgreSQL’s support for text similarity using trigram matching and to compare its efficiency and performance with that of pattern matching. According to the reference documentation, a trigram is a group of three consecutive characters in a string. The similarity between two strings can be computed, to a certain degree, by counting the number of trigrams they have in common. Although this may appear fuzzy at first glance, the approach is effective and can be applied to words in many languages.

Preconditions

  • Perform pattern searches on three different entities by looking up a single common attribute.
  • The entities of interest are telecom invoices, inventory items for which invoices are issued, and orders of such inventory items.
  • Entities that compose the result of the operation are displayed together.
  • A “contains” pattern strategy against the target columns is desired.
  • The searched fields are, respectively, invoice number, inventory number, and order number.

The experiment consists of four different strategies:

  • Similarity search using trigram matching
  • Case-insensitive regular expression search
  • Case-insensitive wildcard pattern search
  • Case-sensitive wildcard pattern search

The performance for each of the above is analyzed from three perspectives, based on the type of index applied to the number column (if any) and the recorded execution duration:

  • No index or B-Tree
  • Generalized Search Tree-based (GiST)
  • Generalized Inverted Index (GIN)

Usually, indexes are not mandatory for text searching, but performance is significantly improved when they are used, as will be observed after the experiment. To better highlight this aspect, measurements are also recorded when no index is set at all. It is well known that B-Tree indexes are not leveraged in use cases such as the one in this experiment (%pattern%); nevertheless, they are mentioned here for the sake of completeness.

In order to create GiST or GIN indexes, the pg_trgm extension must first be installed. Connect to the database and issue the following command:

SQL
 
CREATE EXTENSION pg_trgm;


Performance Analysis

Assuming the PostgreSQL database server is up and running, one can connect to it and explore the three entities of interest:

SQL
 
select count(*) from inventories;   -- 2_821_800 records
select count(*) from invoice;       -- 55_911 records
select count(*) from orders;        -- 30_564 records


The total number of searched entities is around 3 million — a sufficiently large sample for an objective analysis.

In order to manipulate the entire result set when filtering, and to clearly apply any index on the attribute being searched, the following materialized view is created:

SQL
 
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_trgm_entities AS
SELECT contractid AS id,
       'INVENTORY' AS type,
       extrainfo AS number
FROM inventories
UNION ALL
SELECT invoiceid AS id,
       'INVOICE' AS type,
       invoicenum AS number
FROM invoice
UNION ALL
SELECT orderid AS id,
       'ORDER' AS type,
       ordernumber AS numnber
FROM orders;


A few clarifications are worth making:

  • All records from the three tables of interest are consolidated in the materialized view.
  • id represents the unique identifier of each entity.
  • type identifies the source entity — INVENTORY, INVOICE, or ORDER.
  • number is the field on which pattern searches are performed and contains the relevant entity attributes.
  • The materialized view creation takes approximately 3 seconds.

To create the indexes for each case, the following commands are used:

SQL
 
CREATE INDEX mv_trgm_entities_number_btree_idx ON mv_trgm_entities(number);
 
CREATE INDEX mv_trgm_entities_number_gist_idx ON mv_trgm_entities USING GIST (number gist_trgm_ops);
 
CREATE INDEX mv_trgm_entities_number_gin_idx ON mv_trgm_entities USING GIN (number gin_trgm_ops);


To refresh the materialized view once the indexes are in place, the following command can be issued:

SQL
 
REFRESH MATERIALIZED VIEW mv_trgm_entities;


Depending on the index type used, the operation durations vary.

Index creation:

  • B-Tree – 3 seconds
  • GiST – 34 seconds
  • GIN – 6 seconds

Materialized view refresh:

  • B-Tree – 5 seconds
  • GiST – 37 seconds
  • GIN – 8 seconds

With these elements in place, we can examine performance across the following scenarios.

1. Similarity Search Using Trigram Matching (number % '101')

SQL
 
EXPLAIN ANALYZE
    SELECT id, type, number, similarity(number, '101') AS similarity
    FROM mv_trgm_entities
    WHERE number % '101';


The result set returned contains 10 records.

Plain Text
 
+-------+---------+-----------+----------+
|id     |type     |number     |similarity|
+-------+---------+-----------+----------+
|2699480|INVENTORY|data-100101|0.33333334|
|2498790|INVENTORY|Frame-101  |0.4       |
|2503795|INVENTORY|Frame-10101|0.36363637|
|2398638|INVENTORY|Pager-101  |0.4       |
|2408673|INVENTORY|Pager-10101|0.36363637|
|2599258|INVENTORY|data-101   |0.44444445|
|2609269|INVENTORY|data-10101 |0.4       |
|2700338|INVENTORY|voice-101  |0.4       |
|2710339|INVENTORY|voice-10101|0.36363637|
|49038  |INVOICE  |1041042101 |0.4       |
+-------+---------+-----------+----------+


The query plan analysis in all four cases:

Plain Text
 
-- no index or B-Tree on number
+---------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------+
|Gather  (cost=1000.00..33533.25 rows=291 width=29) (actual time=426.706..643.423 rows=10 loops=1)                                |
|  Workers Planned: 3                                                                                                             |
|  Workers Launched: 3                                                                                                            |
|  ->  Parallel Seq Scan on mv_trgm_entities  (cost=0.00..32504.15 rows=94 width=29) (actual time=453.421..636.644 rows=2 loops=4)|
|        Filter: ((number)::text % '101'::text)                                                                                   |
|        Rows Removed by Filter: 727066                                                                                           |
|Planning Time: 0.171 ms                                                                                                          |
|Execution Time: 643.443 ms                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------+
 
-- GIST on number
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|Index Scan using mv_trgm_entities_number_gist_idx on mv_trgm_entities  (cost=0.41..327.43 rows=291 width=29) (actual time=3.813..156.371 rows=10 loops=1)|
|  Index Cond: ((number)::text % '101'::text)                                                                                                             |
|Planning Time: 0.269 ms                                                                                                                                  |
|Execution Time: 156.391 ms                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
 
-- GIN on number
+---------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on mv_trgm_entities  (cost=17.65..336.51 rows=291 width=29) (actual time=13.675..24.263 rows=10 loops=1)                    |
|  Recheck Cond: ((number)::text % '101'::text)                                                                                               |
|  Rows Removed by Index Recheck: 14335                                                                                                       |
|  Heap Blocks: exact=3443                                                                                                                    |
|  ->  Bitmap Index Scan on mv_trgm_entities_number_gin_idx  (cost=0.00..17.58 rows=291 width=0) (actual time=4.812..4.812 rows=14345 loops=1)|
|        Index Cond: ((number)::text % '101'::text)                                                                                           |
|Planning Time: 0.300 ms                                                                                                                      |
|Execution Time: 24.289 ms                                                                                                                    |
+---------------------------------------------------------------------------------------------------------------------------------------------+


Planning + Execution time: GIN (25 ms) < GiST (157 ms) < no index (644 ms)

2. Case Insensitive Regular Expression Search (number ~* '.*101.*')

SQL
 
EXPLAIN ANALYZE
    SELECT id, type, number
    FROM mv_trgm_entities
    WHERE number ~* '.*101.*';


The result set returned contains 33834 records.

The query plan analysis in all four cases:

Plain Text
 
-- no index or B-Tree on number
+----------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------+
|Gather  (cost=1000.00..33533.02 rows=291 width=25) (actual time=0.270..249.657 rows=33834 loops=1)                                |
|  Workers Planned: 3                                                                                                              |
|  Workers Launched: 3                                                                                                             |
|  ->  Parallel Seq Scan on mv_trgm_entities  (cost=0.00..32503.92 rows=94 width=25) (actual time=0.137..241.108 rows=8458 loops=4)|
|        Filter: ((number)::text ~* '.*101.*'::text)                                                                               |
|        Rows Removed by Filter: 718610                                                                                            |
|Planning Time: 0.087 ms                                                                                                           |
|Execution Time: 250.763 ms                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------+
 
-- GIST on number
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Index Scan using mv_trgm_entities_number_gist_idx on mv_trgm_entities  (cost=0.41..326.71 rows=291 width=25) (actual time=0.221..181.516 rows=33834 loops=1)|
|  Index Cond: ((number)::text ~* '.*101.*'::text)                                                                                                           |
|Planning Time: 0.123 ms                                                                                                                                     |
|Execution Time: 182.704 ms                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
 
-- GIN on number
+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on mv_trgm_entities  (cost=243.07..17975.93 rows=29377 width=25) (actual time=2.648..24.620 rows=33834 loops=1)                |
|  Recheck Cond: ((number)::text ~* '.*101.*'::text)                                                                                             |
|  Heap Blocks: exact=4786                                                                                                                       |
|  ->  Bitmap Index Scan on mv_trgm_entities_number_gin_idx  (cost=0.00..235.72 rows=29377 width=0) (actual time=2.122..2.123 rows=33834 loops=1)|
|        Index Cond: ((number)::text ~* '.*101.*'::text)                                                                                         |
|Planning Time: 0.177 ms                                                                                                                         |
|Execution Time: 25.698 ms                                                                                                                       |
+------------------------------------------------------------------------------------------------------------------------------------------------+


Planning + Execution time: GIN (26 ms) < GiST (183 ms) < no index (251 ms) 

3. Case Insensitive Wildcard Pattern Search (number ilike '%101%')

SQL
 
EXPLAIN ANALYZE
    SELECT id, type, number
    FROM mv_trgm_entities
    WHERE number ilike '%101%';


The result set is identical to the one at point 2 and contains 33834 records.

The query plan analysis in all four cases:

Plain Text
 
-- no index or B-Tree on number
+----------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------+
|Gather  (cost=1000.00..33533.02 rows=291 width=25) (actual time=0.290..266.339 rows=33834 loops=1)                                |
|  Workers Planned: 3                                                                                                              |
|  Workers Launched: 3                                                                                                             |
|  ->  Parallel Seq Scan on mv_trgm_entities  (cost=0.00..32503.92 rows=94 width=25) (actual time=0.250..257.123 rows=8458 loops=4)|
|        Filter: ((number)::text ~~* '%101%'::text)                                                                                |
|        Rows Removed by Filter: 718610                                                                                            |
|Planning Time: 0.088 ms                                                                                                           |
|Execution Time: 267.454 ms                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------+
 
-- GIST on number
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Index Scan using mv_trgm_entities_number_gist_idx on mv_trgm_entities  (cost=0.41..326.71 rows=291 width=25) (actual time=0.143..122.600 rows=33834 loops=1)|
|  Index Cond: ((number)::text ~~* '%101%'::text)                                                                                                            |
|Planning Time: 0.120 ms                                                                                                                                     |
|Execution Time: 123.704 ms                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
 
-- GIN on number
+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on mv_trgm_entities  (cost=243.07..17975.93 rows=29377 width=25) (actual time=2.152..20.411 rows=33834 loops=1)                |
|  Recheck Cond: ((number)::text ~~* '%101%'::text)                                                                                              |
|  Heap Blocks: exact=4786                                                                                                                       |
|  ->  Bitmap Index Scan on mv_trgm_entities_number_gin_idx  (cost=0.00..235.72 rows=29377 width=0) (actual time=1.636..1.636 rows=33834 loops=1)|
|        Index Cond: ((number)::text ~~* '%101%'::text)                                                                                          |
|Planning Time: 0.133 ms                                                                                                                         |
|Execution Time: 21.497 ms                                                                                                                       |
+------------------------------------------------------------------------------------------------------------------------------------------------+


Planning + Execution time: GIN (22 ms) < GiST (124 ms) < no index (268 ms)

4. Case Sensitive Wildcard Pattern Search (number like '%101%')

SQL
 
EXPLAIN ANALYZE
    SELECT id, type, number
    FROM mv_trgm_entities
    WHERE number like '%101%';


The result set is identical to the one at point 2 and 3 and contains 33834 records.

The query plan analysis in all four cases:

Plain Text
 
-- no index or B-Tree on number
+---------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------+
|Gather  (cost=1000.00..33533.02 rows=291 width=25) (actual time=0.268..65.248 rows=33834 loops=1)                                |
|  Workers Planned: 3                                                                                                             |
|  Workers Launched: 3                                                                                                            |
|  ->  Parallel Seq Scan on mv_trgm_entities  (cost=0.00..32503.92 rows=94 width=25) (actual time=0.090..57.038 rows=8458 loops=4)|
|        Filter: ((number)::text ~~ '%101%'::text)                                                                                |
|        Rows Removed by Filter: 718610                                                                                           |
|Planning Time: 0.053 ms                                                                                                          |
|Execution Time: 66.360 ms                                                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------+
 
-- GIST on number
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Index Scan using mv_trgm_entities_number_gist_idx on mv_trgm_entities  (cost=0.41..326.71 rows=291 width=25) (actual time=0.150..111.017 rows=33834 loops=1)|
|  Index Cond: ((number)::text ~~ '%101%'::text)                                                                                                             |
|Planning Time: 0.086 ms                                                                                                                                     |
|Execution Time: 112.211 ms                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
 
-- GIN on number
+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on mv_trgm_entities  (cost=243.07..17975.93 rows=29377 width=25) (actual time=2.120..10.853 rows=33834 loops=1)                |
|  Recheck Cond: ((number)::text ~~ '%101%'::text)                                                                                               |
|  Heap Blocks: exact=4786                                                                                                                       |
|  ->  Bitmap Index Scan on mv_trgm_entities_number_gin_idx  (cost=0.00..235.72 rows=29377 width=0) (actual time=1.606..1.607 rows=33834 loops=1)|
|        Index Cond: ((number)::text ~~ '%101%'::text)                                                                                           |
|Planning Time: 0.101 ms                                                                                                                         |
|Execution Time: 11.945 ms                                                                                                                       |
+------------------------------------------------------------------------------------------------------------------------------------------------+


Planning + Execution time: GIN (12 ms) < no index (66 ms) < GiST (112 ms) 

Conclusions

Based on the measurements above, the conclusions are fairly clear. For the specific dataset and types of searches performed, using a GIN index results in the lowest execution times. When opting for similarity search using trigram matching, the returned results do not include all records that would normally qualify under pattern-matching searches. When choosing pattern matching (either case-sensitive or case-insensitive) or regular expression searches, execution times are broadly comparable.

Planning + Execution time is summarized as follows:

Plain Text
 
-- no index
Similarity search - trigram matching (number % '101') - 644 ms 
CI regular expression search (number ~* '.*101.*') - 251 ms
CI wildcard pattern search (number ilike '%101%') - 268 ms
CS wildcard pattern search (number like '%101%') - 66 ms
  
-- GIST
Similarity search - trigram matching (number % '101') - 157 ms
CI regular expression search (number ~* '.*101.*') - 183 ms
CI wildcard pattern search (number ilike '%101%') - 124 ms
CS wildcard pattern search (number like '%101%') - 112 ms
  
-- GIN
Similarity search - trigram matching (number % '101') - 25 ms
CI regular expression search (number ~* '.*101.*') - 26 ms
CI wildcard pattern search (number ilike '%101%') - 22 ms
CS wildcard pattern search (number like '%101%') - 12 ms


Search Performance Comparison


Overall, the presence of GiST or GIN indexes on the relevant column(s) clearly improves performance. As a rule of thumb, GiST indexes are better suited for dynamic data and spatial types (being lossy and faster to update or build) but require false-positive checks, while GIN indexes excel with static data (arrays, JSONB, full-text search), offering faster lookups at the cost of slower builds, updates, and larger index sizes.

Database Virtual screening sql PostgreSQL Performance

Published at DZone with permission of Horatiu Dan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Master SQL Performance Optimization: Step-by-Step Techniques With Case Studies
  • Top 10 PostgreSQL Tuning Tips for High-Performance Databases
  • Snowflake Administration: A Comprehensive Step-by-Step Guide
  • SQL Server to Postgres Database Migration

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook