Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Correct Index Choices for Equality + LIKE Query Optimization

DZone's Guide to

Correct Index Choices for Equality + LIKE Query Optimization

A lot of performance gains can come from query optimizations. Learn how to put your indexes in the correct order for major gains.

· Performance Zone
Free Resource

Transform incident management with machine learning and analytics to help you maintain optimal performance and availability while keeping pace with the growing demands of digital business with this eBook, brought to you in partnership with BMC.

As part of our support services, we do a lot of query optimization. This is where most performance gains come from. Here’s an example of the work we do.

Some days ago, a customer arrived with the following table:

 CREATE TABLE `infamous_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `member_id` int(11) NOT NULL DEFAULT '0',
  `email` varchar(200) NOT NULL DEFAULT '',
  `msg_type` varchar(255) NOT NULL DEFAULT '',
  `t2send` int(11) NOT NULL DEFAULT '0',
  `flag` char(1) NOT NULL DEFAULT '',
  `sent` varchar(100) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `f` (`flag`),
  KEY `email` (`email`),
  KEY `msg_type` (`msg_type`(5)),
  KEY `t_msg` (`t2send`,`msg_type`(5))
) ENGINE=InnoDB DEFAULT CHARSET=latin1

And a query that looked like this:

SELECT COUNT(*)
  FROM `infamous_table`
 WHERE `t2send` > 1234
   AND `msg_type` LIKE 'prefix%';

The table had an index t_msg that wasn’t helping at all: the EXPLAIN for our 1000000 rows test table looked like this:

id: 1
select_type: SIMPLE
table: infamous_table
type: range
possible_keys: t_msg
key: t_msg
key_len: 4
ref: NULL
rows: 107478
Extra: Using where

You can see the index is the on that was expected: t_msg. But the key_len is 4. This indicates that the INT part was used, but that the msg_type(5) part was ignored. This resulted examining 100k+ rows. If you have MySQL 5.6, you can see it more clearly with EXPLAIN FORMAT=JSON under used_key_parts:

EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "table": {
      "table_name": "infamous_table",
      "access_type": "range",
      "possible_keys": [
        "t_msg"
      ],
      "key": "t_msg",
      "used_key_parts": [
        "t2send"
      ],
      "key_length": "4",
      "rows": 107478,
      "filtered": 100,
      "index_condition": "(`test`.`infamous_table`.`t2send` > 1234)",
      "attached_condition": "(`test`.`infamous_table`.`msg_type` like 'prefix%')"
    }
  }
}

The customer had multi-valued strings like PREFIX:INT:OTHER-STRING stored in the columnmsg_type, and that made it impossible to convert it to an enum or similar field type that allowed changing the LIKE for an equity.

So, the solution was rather simple. Just like for point and range queries over numeric values, you must define the index with the ranged field as the rightmost part. This means the correct index would have looked like msg_type(5),t2send. The EXPLAIN for the new index provided the customer with some happiness:

id: 1
select_type: SIMPLE
table: infamous_table
type: range
possible_keys: t_msg,better_multicolumn_index
key: better_multicolumn_index
key_len: 11
ref: NULL
rows: 4716
Extra: Using where

You can see the key_len is now what we would have expected: four bytes for the INT and another seven bytes for the VARCHAR (five for our chosen prefix + two for prefix length). More importantly, you can notice the rows count decreased by approximately 22 times.

We used pt-online-schema on the customer’s environment to apply ALTER to avoid downtime. This made it an easy and painless solution, and the query effectively executed in under 1/20 of the time! So, all fine and dandy? Well, almost. We did a further test, and the query looked like this:

SELECT COUNT(*)
  FROM `infamous_table`
 WHERE `t2send` > 1234
   AND `msg_type` LIKE 'abc%';

So, where’s the difference? The length of the string used for the LIKE condition is shorter than the prefix length we choose for the VARCHAR part of the index (the customer intended to lookup strings with only three chars, so we needed to check this). This query also scanned 100k rows, and EXPLAIN showed the key_len was 4, meaning the VARCHAR part was being ignored once again.

This means the index prefix needed to be shorter. We ALTERed the table and made the prefix four characters long, counting on the fact that the multi-valued strings were using : to separate the values, so we suggested the customer include the colon in the lookup string for the shortest strings. In this case,  'abc%' would be 'abc:%' (which is also four characters long).

As a final optimization, we suggested dropping old indexes that were covered by the new better_multicolumn_index, and that were most likely created by the customer while testing optimization.

Conclusion

Just like in point-and-range queries, the right order for multi-column indexes is putting the ranged part last. Equally important is that the length of the string prefix needs to match the length of the shortest string you intend to look-up. Just remember that you can’t make this prefix too short or you’ll lose specificity and the query will end up scanning rows unnecessarily.

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

Topics:
performance ,tutoral ,query optimization

Published at DZone with permission of Marcos Albe, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}