PostgreSQL Trigram Similarity vs. Pattern Matching: A Performance Comparison
Compare planning and execution times for similarity searches using trigram matching, case-insensitive regex and wildcard patterns, with and without GiST or GIN indexing.
Join the DZone community and get the full member experience.
Join For FreeA previous article presented a performance comparison between PostgreSQL Full-Text Search (FTS) and plain pattern matching, especially in cases where a given piece of text is searched across multiple attributes of multiple entities. In brief, the conclusion was that FTS is significantly faster, while pattern matching offers greater flexibility, a wider range of applicability, and higher precision.
The present article aims to continue this analysis, which is worthwhile when implementing global search solutions, in order to increase objectivity regarding the applicability of these methods and to help programmers make informed choices more easily.
The goal this time is largely the same: to study PostgreSQL’s support for text similarity using trigram matching and to compare its efficiency and performance with that of pattern matching. According to the reference documentation, a trigram is a group of three consecutive characters in a string. The similarity between two strings can be computed, to a certain degree, by counting the number of trigrams they have in common. Although this may appear fuzzy at first glance, the approach is effective and can be applied to words in many languages.
Preconditions
- Perform pattern searches on three different entities by looking up a single common attribute.
- The entities of interest are telecom invoices, inventory items for which invoices are issued, and orders of such inventory items.
- Entities that compose the result of the operation are displayed together.
- A “contains” pattern strategy against the target columns is desired.
- The searched fields are, respectively, invoice number, inventory number, and order number.
The experiment consists of four different strategies:
- Similarity search using trigram matching
- Case-insensitive regular expression search
- Case-insensitive wildcard pattern search
- Case-sensitive wildcard pattern search
The performance for each of the above is analyzed from three perspectives, based on the type of index applied to the number column (if any) and the recorded execution duration:
- No index or B-Tree
- Generalized Search Tree-based (GiST)
- Generalized Inverted Index (GIN)
Usually, indexes are not mandatory for text searching, but performance is significantly improved when they are used, as will be observed after the experiment. To better highlight this aspect, measurements are also recorded when no index is set at all. It is well known that B-Tree indexes are not leveraged in use cases such as the one in this experiment (%pattern%); nevertheless, they are mentioned here for the sake of completeness.
In order to create GiST or GIN indexes, the pg_trgm extension must first be installed. Connect to the database and issue the following command:
CREATE EXTENSION pg_trgm;
Performance Analysis
Assuming the PostgreSQL database server is up and running, one can connect to it and explore the three entities of interest:
select count(*) from inventories; -- 2_821_800 records
select count(*) from invoice; -- 55_911 records
select count(*) from orders; -- 30_564 records
The total number of searched entities is around 3 million — a sufficiently large sample for an objective analysis.
In order to manipulate the entire result set when filtering, and to clearly apply any index on the attribute being searched, the following materialized view is created:
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_trgm_entities AS
SELECT contractid AS id,
'INVENTORY' AS type,
extrainfo AS number
FROM inventories
UNION ALL
SELECT invoiceid AS id,
'INVOICE' AS type,
invoicenum AS number
FROM invoice
UNION ALL
SELECT orderid AS id,
'ORDER' AS type,
ordernumber AS numnber
FROM orders;
A few clarifications are worth making:
- All records from the three tables of interest are consolidated in the materialized view.
idrepresents the unique identifier of each entity.typeidentifies the source entity — INVENTORY, INVOICE, or ORDER.numberis the field on which pattern searches are performed and contains the relevant entity attributes.- The materialized view creation takes approximately 3 seconds.
To create the indexes for each case, the following commands are used:
CREATE INDEX mv_trgm_entities_number_btree_idx ON mv_trgm_entities(number);
CREATE INDEX mv_trgm_entities_number_gist_idx ON mv_trgm_entities USING GIST (number gist_trgm_ops);
CREATE INDEX mv_trgm_entities_number_gin_idx ON mv_trgm_entities USING GIN (number gin_trgm_ops);
To refresh the materialized view once the indexes are in place, the following command can be issued:
REFRESH MATERIALIZED VIEW mv_trgm_entities;
Depending on the index type used, the operation durations vary.
Index creation:
- B-Tree – 3 seconds
- GiST – 34 seconds
- GIN – 6 seconds
Materialized view refresh:
- B-Tree – 5 seconds
- GiST – 37 seconds
- GIN – 8 seconds
With these elements in place, we can examine performance across the following scenarios.
1. Similarity Search Using Trigram Matching (number % '101')
EXPLAIN ANALYZE
SELECT id, type, number, similarity(number, '101') AS similarity
FROM mv_trgm_entities
WHERE number % '101';
The result set returned contains 10 records.
+-------+---------+-----------+----------+
|id |type |number |similarity|
+-------+---------+-----------+----------+
|2699480|INVENTORY|data-100101|0.33333334|
|2498790|INVENTORY|Frame-101 |0.4 |
|2503795|INVENTORY|Frame-10101|0.36363637|
|2398638|INVENTORY|Pager-101 |0.4 |
|2408673|INVENTORY|Pager-10101|0.36363637|
|2599258|INVENTORY|data-101 |0.44444445|
|2609269|INVENTORY|data-10101 |0.4 |
|2700338|INVENTORY|voice-101 |0.4 |
|2710339|INVENTORY|voice-10101|0.36363637|
|49038 |INVOICE |1041042101 |0.4 |
+-------+---------+-----------+----------+
The query plan analysis in all four cases:
-- no index or B-Tree on number
+---------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+---------------------------------------------------------------------------------------------------------------------------------+
|Gather (cost=1000.00..33533.25 rows=291 width=29) (actual time=426.706..643.423 rows=10 loops=1) |
| Workers Planned: 3 |
| Workers Launched: 3 |
| -> Parallel Seq Scan on mv_trgm_entities (cost=0.00..32504.15 rows=94 width=29) (actual time=453.421..636.644 rows=2 loops=4)|
| Filter: ((number)::text % '101'::text) |
| Rows Removed by Filter: 727066 |
|Planning Time: 0.171 ms |
|Execution Time: 643.443 ms |
+---------------------------------------------------------------------------------------------------------------------------------+
-- GIST on number
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|Index Scan using mv_trgm_entities_number_gist_idx on mv_trgm_entities (cost=0.41..327.43 rows=291 width=29) (actual time=3.813..156.371 rows=10 loops=1)|
| Index Cond: ((number)::text % '101'::text) |
|Planning Time: 0.269 ms |
|Execution Time: 156.391 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
-- GIN on number
+---------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on mv_trgm_entities (cost=17.65..336.51 rows=291 width=29) (actual time=13.675..24.263 rows=10 loops=1) |
| Recheck Cond: ((number)::text % '101'::text) |
| Rows Removed by Index Recheck: 14335 |
| Heap Blocks: exact=3443 |
| -> Bitmap Index Scan on mv_trgm_entities_number_gin_idx (cost=0.00..17.58 rows=291 width=0) (actual time=4.812..4.812 rows=14345 loops=1)|
| Index Cond: ((number)::text % '101'::text) |
|Planning Time: 0.300 ms |
|Execution Time: 24.289 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------+
Planning + Execution time: GIN (25 ms) < GiST (157 ms) < no index (644 ms)
2. Case Insensitive Regular Expression Search (number ~* '.*101.*')
EXPLAIN ANALYZE
SELECT id, type, number
FROM mv_trgm_entities
WHERE number ~* '.*101.*';
The result set returned contains 33834 records.
The query plan analysis in all four cases:
-- no index or B-Tree on number
+----------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+----------------------------------------------------------------------------------------------------------------------------------+
|Gather (cost=1000.00..33533.02 rows=291 width=25) (actual time=0.270..249.657 rows=33834 loops=1) |
| Workers Planned: 3 |
| Workers Launched: 3 |
| -> Parallel Seq Scan on mv_trgm_entities (cost=0.00..32503.92 rows=94 width=25) (actual time=0.137..241.108 rows=8458 loops=4)|
| Filter: ((number)::text ~* '.*101.*'::text) |
| Rows Removed by Filter: 718610 |
|Planning Time: 0.087 ms |
|Execution Time: 250.763 ms |
+----------------------------------------------------------------------------------------------------------------------------------+
-- GIST on number
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Index Scan using mv_trgm_entities_number_gist_idx on mv_trgm_entities (cost=0.41..326.71 rows=291 width=25) (actual time=0.221..181.516 rows=33834 loops=1)|
| Index Cond: ((number)::text ~* '.*101.*'::text) |
|Planning Time: 0.123 ms |
|Execution Time: 182.704 ms |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
-- GIN on number
+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on mv_trgm_entities (cost=243.07..17975.93 rows=29377 width=25) (actual time=2.648..24.620 rows=33834 loops=1) |
| Recheck Cond: ((number)::text ~* '.*101.*'::text) |
| Heap Blocks: exact=4786 |
| -> Bitmap Index Scan on mv_trgm_entities_number_gin_idx (cost=0.00..235.72 rows=29377 width=0) (actual time=2.122..2.123 rows=33834 loops=1)|
| Index Cond: ((number)::text ~* '.*101.*'::text) |
|Planning Time: 0.177 ms |
|Execution Time: 25.698 ms |
+------------------------------------------------------------------------------------------------------------------------------------------------+
Planning + Execution time: GIN (26 ms) < GiST (183 ms) < no index (251 ms)
3. Case Insensitive Wildcard Pattern Search (number ilike '%101%')
EXPLAIN ANALYZE
SELECT id, type, number
FROM mv_trgm_entities
WHERE number ilike '%101%';
The result set is identical to the one at point 2 and contains 33834 records.
The query plan analysis in all four cases:
-- no index or B-Tree on number
+----------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+----------------------------------------------------------------------------------------------------------------------------------+
|Gather (cost=1000.00..33533.02 rows=291 width=25) (actual time=0.290..266.339 rows=33834 loops=1) |
| Workers Planned: 3 |
| Workers Launched: 3 |
| -> Parallel Seq Scan on mv_trgm_entities (cost=0.00..32503.92 rows=94 width=25) (actual time=0.250..257.123 rows=8458 loops=4)|
| Filter: ((number)::text ~~* '%101%'::text) |
| Rows Removed by Filter: 718610 |
|Planning Time: 0.088 ms |
|Execution Time: 267.454 ms |
+----------------------------------------------------------------------------------------------------------------------------------+
-- GIST on number
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Index Scan using mv_trgm_entities_number_gist_idx on mv_trgm_entities (cost=0.41..326.71 rows=291 width=25) (actual time=0.143..122.600 rows=33834 loops=1)|
| Index Cond: ((number)::text ~~* '%101%'::text) |
|Planning Time: 0.120 ms |
|Execution Time: 123.704 ms |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
-- GIN on number
+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on mv_trgm_entities (cost=243.07..17975.93 rows=29377 width=25) (actual time=2.152..20.411 rows=33834 loops=1) |
| Recheck Cond: ((number)::text ~~* '%101%'::text) |
| Heap Blocks: exact=4786 |
| -> Bitmap Index Scan on mv_trgm_entities_number_gin_idx (cost=0.00..235.72 rows=29377 width=0) (actual time=1.636..1.636 rows=33834 loops=1)|
| Index Cond: ((number)::text ~~* '%101%'::text) |
|Planning Time: 0.133 ms |
|Execution Time: 21.497 ms |
+------------------------------------------------------------------------------------------------------------------------------------------------+
Planning + Execution time: GIN (22 ms) < GiST (124 ms) < no index (268 ms)
4. Case Sensitive Wildcard Pattern Search (number like '%101%')
EXPLAIN ANALYZE
SELECT id, type, number
FROM mv_trgm_entities
WHERE number like '%101%';
The result set is identical to the one at point 2 and 3 and contains 33834 records.
The query plan analysis in all four cases:
-- no index or B-Tree on number
+---------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+---------------------------------------------------------------------------------------------------------------------------------+
|Gather (cost=1000.00..33533.02 rows=291 width=25) (actual time=0.268..65.248 rows=33834 loops=1) |
| Workers Planned: 3 |
| Workers Launched: 3 |
| -> Parallel Seq Scan on mv_trgm_entities (cost=0.00..32503.92 rows=94 width=25) (actual time=0.090..57.038 rows=8458 loops=4)|
| Filter: ((number)::text ~~ '%101%'::text) |
| Rows Removed by Filter: 718610 |
|Planning Time: 0.053 ms |
|Execution Time: 66.360 ms |
+---------------------------------------------------------------------------------------------------------------------------------+
-- GIST on number
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Index Scan using mv_trgm_entities_number_gist_idx on mv_trgm_entities (cost=0.41..326.71 rows=291 width=25) (actual time=0.150..111.017 rows=33834 loops=1)|
| Index Cond: ((number)::text ~~ '%101%'::text) |
|Planning Time: 0.086 ms |
|Execution Time: 112.211 ms |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
-- GIN on number
+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on mv_trgm_entities (cost=243.07..17975.93 rows=29377 width=25) (actual time=2.120..10.853 rows=33834 loops=1) |
| Recheck Cond: ((number)::text ~~ '%101%'::text) |
| Heap Blocks: exact=4786 |
| -> Bitmap Index Scan on mv_trgm_entities_number_gin_idx (cost=0.00..235.72 rows=29377 width=0) (actual time=1.606..1.607 rows=33834 loops=1)|
| Index Cond: ((number)::text ~~ '%101%'::text) |
|Planning Time: 0.101 ms |
|Execution Time: 11.945 ms |
+------------------------------------------------------------------------------------------------------------------------------------------------+
Planning + Execution time: GIN (12 ms) < no index (66 ms) < GiST (112 ms)
Conclusions
Based on the measurements above, the conclusions are fairly clear. For the specific dataset and types of searches performed, using a GIN index results in the lowest execution times. When opting for similarity search using trigram matching, the returned results do not include all records that would normally qualify under pattern-matching searches. When choosing pattern matching (either case-sensitive or case-insensitive) or regular expression searches, execution times are broadly comparable.
Planning + Execution time is summarized as follows:
-- no index
Similarity search - trigram matching (number % '101') - 644 ms
CI regular expression search (number ~* '.*101.*') - 251 ms
CI wildcard pattern search (number ilike '%101%') - 268 ms
CS wildcard pattern search (number like '%101%') - 66 ms
-- GIST
Similarity search - trigram matching (number % '101') - 157 ms
CI regular expression search (number ~* '.*101.*') - 183 ms
CI wildcard pattern search (number ilike '%101%') - 124 ms
CS wildcard pattern search (number like '%101%') - 112 ms
-- GIN
Similarity search - trigram matching (number % '101') - 25 ms
CI regular expression search (number ~* '.*101.*') - 26 ms
CI wildcard pattern search (number ilike '%101%') - 22 ms
CS wildcard pattern search (number like '%101%') - 12 ms

Overall, the presence of GiST or GIN indexes on the relevant column(s) clearly improves performance. As a rule of thumb, GiST indexes are better suited for dynamic data and spatial types (being lossy and faster to update or build) but require false-positive checks, while GIN indexes excel with static data (arrays, JSONB, full-text search), offering faster lookups at the cost of slower builds, updates, and larger index sizes.
Published at DZone with permission of Horatiu Dan. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments