Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Pattern-Matching Queries vs. Full-Text Indexes

DZone's Guide to

Pattern-Matching Queries vs. Full-Text Indexes

When it comes to pattern-matching queries vs. full-text indexes, full-text index can be helpful, and it is built-in. But we do not have many metrics regarding full-text indexes.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

In this post, we'll compare the performance of pattern-matching queries vs. full-text indexes.

In my previous post, I looked for a solution for how we can search only a part of the email address and how can we make faster queries where the condition is email LIKE '%n.pierre%'. I showed two possible ways that could work. Of course, they had some pros and cons, as well, but were more efficient and faster than a like '%n.prierre%'.

But you could also ask, Why I would bother with this? Let's add a FULLTEXT index, and everybody is happy! Are you sure about that? I'm not. Let's investigate and test a bit. (We have some nice posts that explain how FULLTEXT indexes work: Post 1, Post 2, Post 3.)

Let's see if it works in our case where we were looking for email addresses. Here is the table:

 CREATE TABLE `email` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `email` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `idx_email` (`email`)
) ENGINE=InnoDB AUTO_INCREMENT=318465 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Add the default full-text index:

ALTER TABLE email ADD FULLTEXT KEY (email);

It took only five seconds for 320K email addresses.

Let's run a search:

SELECT id, email FROM email where MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE);
+--------+--------------------------------+
| id     | email                          |
+--------+--------------------------------+
|   2940 | pierre.west@example.org        |
|  10775 | pierre.beier@example.org       |
|  24267 | schroeder.pierre@example.org   |
|  26285 | bode.pierre@example.org        |
|  27104 | pierre.franecki@example.org    |
|  31792 | pierre.jaskolski@example.com   |
|  39369 | kuphal.pierre@example.org      |
|  58625 | olson.pierre@example.org       |
|  59526 | larkin.pierre@example.net      |
|  64718 | boyle.pierre@example.com       |
|  72033 | pierre.wolf@example.net        |
|  90587 | anderson.pierre@example.org    |
| 108806 | fadel.pierre@example.org       |
| 113897 | jacobs.pierre@example.com      |
| 118579 | hudson.pierre@example.com      |
| 118798 | pierre.wuckert@example.org     |
| 118937 | green.pierre@example.net       |
| 125451 | hauck.pierre@example.net       |
| 133352 | friesen.pierre@example.net     |
| 134594 | windler.pierre@example.com     |
| 135406 | dietrich.pierre@example.org    |
| 190451 | daugherty.pierre@example.org   |
...

Immediately, we have issues with the results. It returns 43 rows, but there are only 11 rows with string n.pierre. Why? It is because of . The manual says:

The built-in FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example,    (space), , (comma), and . (period).

The parser believes that a . starts a new word, so it is going to search for pierre instead of n.pierre. That's not good news, as many email addresses contain .. What can we do? The manual says:

It is possible to write a plugin that replaces the built-in full-text parser. For details, see Section 28.2, "The MySQL Plugin API". For example parser plugin source code, see the plugin/fulltext directory of a MySQL source distribution.

If you are willing to write your own plugin in C/C++, you can try that route. Until then, it is going to give us back a lot of irrelevant matches.

We can order the results by relevancy:

SELECT id,email,MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE)
 AS score FROM email where MATCH(email) AGAINST
('n.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc limit 10;
+-------+------------------------------+-------------------+
| id    | email                        | score             |
+-------+------------------------------+-------------------+
|  2940 | pierre.west@example.org      | 14.96491813659668 |
| 10775 | pierre.beier@example.org     | 14.96491813659668 |
| 24267 | schroeder.pierre@example.org | 14.96491813659668 |
| 26285 | bode.pierre@example.org      | 14.96491813659668 |
| 27104 | pierre.franecki@example.org  | 14.96491813659668 |
| 31792 | pierre.jaskolski@example.com | 14.96491813659668 |
| 39369 | kuphal.pierre@example.org    | 14.96491813659668 |
| 58625 | olson.pierre@example.org     | 14.96491813659668 |
| 59526 | larkin.pierre@example.net    | 14.96491813659668 |
| 64718 | boyle.pierre@example.com     | 14.96491813659668 |
+-------+------------------------------+-------------------+

This does not guarantee that we get back the lines that we are looking for, however. I tried to change innodb_ft_min_token_size, as well, but it did not affect the results.

Let's see what happens when I search for williamson pierre: two separate words. I know there is only one email address with these names.

SELECT id,email,MATCH(email) AGAINST
('williamson.pierre' IN NATURAL LANGUAGE MODE) AS score
FROM email where MATCH(email) AGAINST
('williamson.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc limit 50;
+--------+---------------------------------+-------------------+
| id     | email                           | score             |
+--------+---------------------------------+-------------------+
| 238396 | williamson.pierre@example.net   | 24.08820343017578 |
|   2940 | pierre.west@example.org         | 14.96491813659668 |
|  10775 | pierre.beier@example.org        | 14.96491813659668 |
|  24267 | schroeder.pierre@example.org    | 14.96491813659668 |
|  26285 | bode.pierre@example.org         | 14.96491813659668 |
|  27104 | pierre.franecki@example.org     | 14.96491813659668 |
|  31792 | pierre.jaskolski@example.com    | 14.96491813659668 |
|  39369 | kuphal.pierre@example.org       | 14.96491813659668 |
|  58625 | olson.pierre@example.org        | 14.96491813659668 |
...

The first result is that we still got another 49 addresses. How can the application decide which email address is relevant and which is not? I am still not happy.

Are There Any Other Options Without Writing Our Own Plugin?

Can I somehow tell the parser to use n.pierre as one word? The manual says:

A phrase that is enclosed within double quote ( ") characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase".

I can use double quotes, but it will still split at . and the results are the same. I did not find a solution except writing your own plugin. If someone knows a solution, please write a comment.

With Parser ngram

The built-in MySQL full-text parser uses delimiters between words, but we can create an ngram-based full-text index.

mysql> alter table  email ADD FULLTEXT KEY (email) WITH PARSER ngram;
Query OK, 0 rows affected (20.10 sec)
Records: 0  Duplicates: 0  Warnings: 0

Before that, I changed the ngram_token_size to 3.

mysql> SELECT id,email,MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE) AS score FROM email where MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc;
+--------+----------------------------------+--------------------+
| id     | email                            | score              |
+--------+----------------------------------+--------------------+
|  58625 | olson.pierre@example.org         |  16.56794548034668 |
|  59526 | larkin.pierre@example.net        |  16.56794548034668 |
|  90587 | anderson.pierre@example.org      |  16.56794548034668 |
| 118579 | hudson.pierre@example.com        |  16.56794548034668 |
| 118937 | green.pierre@example.net         |  16.56794548034668 |
| 133352 | friesen.pierre@example.net       |  16.56794548034668 |
| 200608 | wilkinson.pierre@example.org     |  16.56794548034668 |
| 237928 | johnson.pierre@example.org       |  16.56794548034668 |
| 238396 | williamson.pierre@example.net    |  16.56794548034668 |
| 278384 | monahan.pierre@example.net       |  16.56794548034668 |
| 306718 | rohan.pierre@example.com         |  16.56794548034668 |
| 226737 | warren.pfeffer@example.net       | 12.156486511230469 |
|  74278 | stiedemann.perry@example.net     |  11.52701187133789 |
|  75234 | bogan.perry@example.org          |  11.52701187133789 |
...
4697 rows in set (0.03 sec)

Finally, we are getting somewhere. But it gives back 4,697 rows. How can the application decide which results are relevant? Should we just use the score?

Subselect?

I dropped the Ngram FULLTEXT index and created a normal one because that gives me back only 43 results instead of 4697. I thought a full-text search might be good to narrow down the results from a million to a few thousand, and then we can run a select based on that. Example:

mysql> Select e2.id,e2.email from
(SELECT id,email FROM email where MATCH(email)
AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE))
as e2 where e2.email like '%n.pierre%';
+--------+-------------------------------+
| id     | email                         |
+--------+-------------------------------+
|  58625 | olson.pierre@example.org      |
|  59526 | larkin.pierre@example.net     |
|  90587 | anderson.pierre@example.org   |
| 118579 | hudson.pierre@example.com     |
| 118937 | green.pierre@example.net      |
| 133352 | friesen.pierre@example.net    |
| 200608 | wilkinson.pierre@example.org  |
| 237928 | johnson.pierre@example.org    |
| 238396 | williamson.pierre@example.net |
| 278384 | monahan.pierre@example.net    |
| 306718 | rohan.pierre@example.com      |
+--------+-------------------------------+
11 rows in set (0.00 sec)

Wow, this can work and it looks quite fast, as well. But (there is always a but) if I run the following query (searching for ierre):

mysql> Select e2.id,e2.email from
(SELECT id,email FROM email where MATCH(email)
AGAINST ('ierre' IN NATURAL LANGUAGE MODE))
as e2 where e2.email like '%ierre%';
Empty set (0.00 sec)

...it gives back nothing because the default full-text parser uses only full words! In our case, that is not very helpful. Let's switch back to ngram and re-run the query:

mysql> Select e2.id,e2.email from
(SELECT id,email FROM email where MATCH(email)
AGAINST ('ierre' IN NATURAL LANGUAGE MODE))
as e2 where e2.email like '%ierre%';
+--------+--------------------------------+
| id     | email                          |
+--------+--------------------------------+
|   2940 | pierre.west@example.org        |
|  10775 | pierre.beier@example.org       |
|  16958 | pierre68@example.com           |
|  24267 | schroeder.pierre@example.org   |
...
65 rows in set (0.05 sec)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000072 |
| checking permissions    | 0.000006 |
| Opening tables          | 0.000014 |
| init                    | 0.000027 |
| System lock             | 0.000007 |
| optimizing              | 0.000006 |
| statistics              | 0.000013 |
| preparing               | 0.000006 |
| FULLTEXT initialization | 0.006384 |
| executing               | 0.000012 |
| Sending data            | 0.020735 |
| end                     | 0.000014 |
| query end               | 0.000014 |
| closing tables          | 0.000013 |
| freeing items           | 0.001383 |
| cleaning up             | 0.000024 |
+-------------------------+----------+

It gives us back 65 rows, and it takes between 0.02-0.05s because the subquery results in many rows.

With my "shorting method":

select e.email from email as e right join email_tib as t
on t.email_id=e.id where t.email_parts like "ierre%";
+--------------------------------+
| email                          |
+--------------------------------+
| anderson.pierre@example.org    |
| bode.pierre@example.org        |
| bode.pierre@example.org        |
| boyle.pierre@example.com       |
| bradtke.pierre@example.org     |
| bradtke.pierre@example.org     |
...
65 rows in set (0.00 sec)
mysql> show profile;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000069 |
| checking permissions | 0.000011 |
| checking permissions | 0.000003 |
| Opening tables       | 0.000020 |
| init                 | 0.000021 |
| System lock          | 0.000008 |
| optimizing           | 0.000009 |
| statistics           | 0.000070 |
| preparing            | 0.000011 |
| executing            | 0.000001 |
| Sending data         | 0.000330 |
| end                  | 0.000002 |
| query end            | 0.000007 |
| closing tables       | 0.000005 |
| freeing items        | 0.000014 |
| cleaning up          | 0.000010 |
+----------------------+----------+

It reads and gives back exactly 65 rows and it takes 0.000s.

Conclusion

When it comes to pattern-matching queries vs. full-text indexes, it looks like full-text index can be helpful, and it is built-in. Unfortunately, we do not have many metrics regarding full-text indexes — we cannot see how many rows were read, etc. I don't want to make any conclusions on which one is faster. I still have to run some tests with our favorite benchmark tool sysbench on a much bigger dataset.

I should mention that full-text indexes and my previous solutions won't solve all the problems. In this and my other article, I was trying to find an answer to a specific problem, but there are cases where my solutions would not work that well.

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
big data ,tutorial ,pattern matching ,full-text ,indexing ,data analytics ,queries ,data performance

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}