Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Speed Up Pattern Matching Queries

DZone's Guide to

How to Speed Up Pattern Matching Queries

If there is no built-in solution or index in MySQL, don't give up. Many times, with small modifications, you can create your own index table or use other tricks.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

From time to time I see pattern matching queries with conditions that look like this: "where fieldname like '%something%' ". MySQL cannot use indexes for these kinds of queries, which means it has to do a table scan every single time.

(That's really only half true - there are the FullText indexes. In another blog post, I will cover FullText indexes as well.)

I recently was trying to find a solution, and my friend Charles Nagy reminded me of Trigrams. Let me show you the Trigram of the name Daniel:

daniel:
dan
ani
nie
iel

But How Is This Useful?

Let me show you an example. You have the following email schema:

CREATE TABLE `email` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `email` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `idx_email` (`email`)
) ENGINE=InnoDB AUTO_INCREMENT=318459 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

With data like these:

+--------+-------------------------------+
| id     | email                         |
+--------+-------------------------------+
|   8392 | ehintz@example.com            |
| 238396 | williamson.pierre@example.net |
| 286517 | godfrey79@example.org         |
| 137291 | welch.durward@example.org     |
| 291984 | curtis67@example.net          |
...
+--------+-------------------------------+

And we are looking for email addresses like '%n.pierre%':

mysql> select * from email where email like '%n.pierre%';
+--------+-------------------------------+
| id     | email                         |
+--------+-------------------------------+
|  90587 | anderson.pierre@example.org   |
| 133352 | friesen.pierre@example.net    |
| 118937 | green.pierre@example.net      |
| 118579 | hudson.pierre@example.com     |
| 237928 | johnson.pierre@example.org    |
|  59526 | larkin.pierre@example.net     |
| 278384 | monahan.pierre@example.net    |
|  58625 | olson.pierre@example.org      |
| 306718 | rohan.pierre@example.com      |
| 200608 | wilkinson.pierre@example.org  |
| 238396 | williamson.pierre@example.net |
+--------+-------------------------------+
11 rows in set (0.12 sec)
mysql> explain select * from email where email like '%n.pierre%'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: email
   partitions: NULL
         type: index
possible_keys: NULL
          key: idx_email
      key_len: 362
          ref: NULL
         rows: 332475
     filtered: 11.11
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)
mysql> show session status like '%handler%';
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 2      |
| Handler_mrr_init           | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 1      |
| Handler_read_key           | 1      |
| Handler_read_last          | 0      |
| Handler_read_next          | 318458 |
| Handler_read_prev          | 0      |
| Handler_read_rnd           | 0      |
| Handler_read_rnd_next      | 0      |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+
18 rows in set (0.00 sec)

There are 11 email addresses, but it has to scan the whole index (318458 rows). That's not good! Let's try and make it better.

Trigram Table

I created a table like this:

CREATE TABLE `email_trigram` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `email_id` int(10) unsigned NOT NULL,
  `trigram` char(3) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `idx_email_id` (`email_id`),
  KEY `idx_trigram` (`trigram`)
) ENGINE=InnoDB AUTO_INCREMENT=2749001 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

As we can see, there is an index called "trigram."

The plan is to create a trigram for every single email address. I wrote the following trigger:

create trigger insert_trigram after insert ON test.email FOR EACH ROW
BEGIN
DECLARE email_length int;
declare x int ;
declare i int ;
SELECT CHAR_LENGTH(SUBSTRING_INDEX(email,'@',1)) into email_length from test.email where email=NEW.email and id=NEW.id;
set i=1;
set x=3;
while email_length >= 3 do
INSERT INTO email_trigram (email_id,trigram) values (NEW.id,substring(NEW.email from i for x));
set email_length=email_length-1;
set i=i+1;
end while ;
END

When there is an insert, it creates and inserts the trigrams into the email_trigram table. Trigrams for anderson.pierre:

mysql> select trigram from email_trigram where email_id=90587;
+---------+
| trigram |
+---------+
| and     |
| nde     |
| der     |
| ers     |
| rso     |
| son     |
| on.     |
| n.p     |
| .pi     |
| pie     |
| ier     |
| err     |
| rre     |
+---------+

With the following query, we can find all the email addresses with n.pierre:

SELECT e.email
FROM email AS e
INNER JOIN
( SELECT tr.email_id
FROM email_trigram AS tr
WHERE tr.trigram IN ("n.p",".pi","pie","ier","err","rre")
GROUP BY tr.email_id HAVING count(*) =6) AS ti ON ti.email_id=e.id;
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 8214
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: e
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: ti.email_id
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: tr
   partitions: NULL
         type: range
possible_keys: idx_email_id,idx_trigram
          key: idx_trigram
      key_len: 9
          ref: NULL
         rows: 8214
     filtered: 100.00
        Extra: Using index condition; Using temporary; Using filesort
3 rows in set, 1 warning (0.00 sec)

It does not have to read the whole table, but still needs to read a lot of rows and even using filesort. I did not want to create trigrams manually, so I wrote the following procedure:

create procedure select_email(IN email_address varchar(255))
begin
DECLARE email_length int;
declare x int ;
declare i int ;
declare trigram varchar(255) ;
declare trigram_list varchar(255) ;
SELECT CHAR_LENGTH(SUBSTRING_INDEX(email_address,'@',1)) into email_length;
set i=1;
set x=3;
while email_length >= 3 do
select substring(SUBSTRING_INDEX(email_address,'@',1) from i for x) into trigram;
set trigram_list=concat_ws('","',trigram_list,trigram);
set email_length=email_length-1;
set i=i+1;
end while ;
set trigram_list=concat('"',trigram_list,'"');
set @final= concat( " SELECT e.email
FROM email AS e
INNER JOIN
( SELECT tr.email_id
FROM email_trigram AS tr
WHERE tr.trigram IN (" , trigram_list , ")
GROUP BY tr.email_id HAVING count(*) =" , i-1 , ") AS ti ON ti.email_id=e.id" );
PREPARE stmt FROM @final;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
end

Since with trigrams we are looking for parts of the words (like err or ier), there can be many matches. If we are using a longer condition like derson.pierre, the procedure needed to read 65722 rows. This is also a lot.

Let's have a look at the selectivity a bit:

mysql> select trigram,count(*) as c from email_trigram group by trigram order by c desc limit 10;
+---------+-------+
| trigram | c     |
+---------+-------+
| er.     | 12919 |
| ann     | 12726 |
| man     | 12462 |
| ell     | 12257 |
| ber     | 12225 |
| sch     | 11753 |
| son     | 10983 |
| ill     |  9443 |
| mar     |  9294 |
| ert     |  8839 |
+---------+-------+
10 rows in set (1.59 sec)

There are parts that give back many rows. As I mentioned, more parts mean more rows.

I was hoping for a bigger improvement, so I wondered what else we could do. MySQL cannot use an index because of the leading %. How can we avoid that? Let's save all the possible versions of the email address that we could be looking for.

(I don't know if there is any official name of this method - if someone knows it, please write a comment.)

Shorting Method

anderson.pierre@example.org:
anderson.pierre
nderson.pierre
derson.pierre
erson.pierre
rson.pierre
son.pierre
on.pierre
n.pierre
.pierre
pierre
ierre
erre
rre

Hmm... could this work? Let's test it. I created the following table and trigger:

CREATE TABLE `email_tib` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `email_id` int(10) unsigned NOT NULL,
  `email_parts` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `idx_email_id` (`email_id`),
  KEY `idx_trigram` (`email_parts`)
) ENGINE=InnoDB AUTO_INCREMENT=2749001 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
create trigger insert_tib after insert ON test.email FOR EACH ROW
BEGIN
DECLARE email_length int;
declare x int ;
declare i int ;
SELECT CHAR_LENGTH(SUBSTRING_INDEX(email,'@',1)) into email_length from test.email where email=NEW.email and id=NEW.id;
set i=-3;
while email_length >= 3 do
INSERT INTO email_tib (email_id,email_parts) values (NEW.id,substring((SUBSTRING_INDEX(NEW.email,'@',1)),i));
set email_length=email_length-1;
set i=i-1;
end while ;
END

Let's find the email addresses that contain n.pierre:

mysql> SELECT e.email
FROM email AS e
RIGHT JOIN email_tib AS t ON t.email_id=e.id
WHERE t.email_parts LIKE "n.pierre%";
+-------------------------------+
| email                         |
+-------------------------------+
| anderson.pierre@example.org   |
| friesen.pierre@example.net    |
| green.pierre@example.net      |
| hudson.pierre@example.com     |
| johnson.pierre@example.org    |
| larkin.pierre@example.net     |
| monahan.pierre@example.net    |
| olson.pierre@example.org      |
| rohan.pierre@example.com      |
| wilkinson.pierre@example.org  |
| williamson.pierre@example.net |
+-------------------------------+
11 rows in set (0.00 sec)
mysql> show session status like '%handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 4     |
| Handler_mrr_init           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 12    |
| Handler_read_last          | 0     |
| Handler_read_next          | 11    |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
18 rows in set (0.00 sec)

Wow, that is much better than the previous one! It is more than 100 times faster! Now you can have a beer because you deserve it. 

Selectivity

mysql> select email_parts,count(*) as c
from email_tib group by email_parts
order by c desc limit 10;
+-------------+------+
| email_parts | c    |
+-------------+------+
| son         | 6228 |
| ler         | 3863 |
| ner         | 3635 |
| ann         | 3590 |
| mann        | 3464 |
| man         | 3387 |
| ski         | 3331 |
| ton         | 2659 |
| ell         | 2482 |
| ger         | 2437 |
+-------------+------+
10 rows in set (1.61 sec)

There are parts that result in many readings as well, but it helps a lot now that we are using a longer pattern:

mysql> select email_parts,count(*) as c
from email_tib where length(email_parts) > 5
group by email_parts order by c desc limit 10;
+-------------+-----+
| email_parts | c   |
+-------------+-----+
| reilly      | 704 |
| walter      | 684 |
| kowski      | 676 |
| onnelly     | 662 |
| nnelly      | 662 |
| kinson      | 641 |
| tenberg     | 626 |
| enberg      | 626 |
| hermann     | 417 |
| ermann      | 417 |
+-------------+-----+
10 rows in set (1.59 sec)

Using more than six characters gives us a much better selectivity.

Table Statistics

select count(*) from email;
+----------+
| count(*) |
+----------+
|   318458 |
+----------+
1 row in set (0.04 sec)
select count(*) from email_trigram;
+----------+
| count(*) |
+----------+
|  2749000 |
+----------+
1 row in set (0.44 sec)
select count(*) from email_tib;
+----------+
| count(*) |
+----------+
|  2749000 |
+----------+
1 row in set (0.45 sec)

In this test, I used 318458 random email addresses, and both methods created 2749000 additional rows.

Size on the disk:

45Memail.ibd
221Memail_tib.ibd
189Memail_trigram.ibd

As we expected they will use more space than the original table.

Cons

  • Both solutions require an extra table.
  • That table contains millions of short rows, and it could use a few gigs of space.
  • Requires three triggers (insert, update, and delete, which could affect the write performance on the table) or the application has to keep that table up-to-date.

Pros

  • Finding an email address is going to be much faster and require fewer reads.
  • Users will be much more satisfied.

Conclusion

If there is no built-in solution or index in MySQL that can help or solve your problem, do not give up. Many times, with small modifications, you can create your own index table or use other tricks.

In this specific case, if you are willing to sacrifice some additional disk space you can speed up your queries a lot with the right method. Trigram was not the best option, but I can see use cases where it could be better.

This is just one possible solution, there could be an even better one. If you know one, please let me know.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
pattern matching ,query optimization ,database tables ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}