Indexing 101: Optimizing MySQL Queries on a Single Table
Join the DZone community and get the full member experience.Join For Free
[This article was written by Combaudon]
I have recently seen several cases when performance for MySQL queries on a single table was terrible. The reason was simple: the wrong indexes were added and so the execution plan was poor. Here are guidelines to help you optimize various kinds of single-table queries.
Disclaimer: I will be presenting general guidelines and I do not intend to cover all scenarios. I am pretty confident that you can find examples where what I am writing does not work, but I am also confident that it will help you most of the time. Also I will not discuss features you can find in MySQL 5.6+ like Index Condition Pushdown to keep things simple. Be aware that such features can actually make a significant difference in query response time (for good or for bad).
What an index can do for you
An index can perform up to 3 actions: filter, sort/group and cover. While the first 2 actions are self-explanatory, not everyone may know what a ‘covering index’ is. Actually that’s very easy. The general workflow for a basic query is:
1. Use an index to find matching records and get the pointers to data.
2. Use the pointers to the corresponding data.
3. Return records
When a covering index can be used, the index already covers all fields requested in the query, so step #2 can be skipped and the workflow is now:
1. Use an index to find matching records
2. Return records
In many cases, indexes are small and can fit in memory while data is large and does not fit in memory: by using a covering index, you can avoid lots of disk operations and performance can be order of magnitudes better.
Let’s now look at different common scenarios.
This is the most basic scenario:
SELECT * FROM t WHERE c = 100
The idea is of course to add an index on
(c). However note that if the criteria is not selective enough, the optimizer may choose to perform a full table scan that will certainly be more efficient.
Also note that a frequent variation of this query is when you only select a small subset of fields instead of all fields:
SELECT c1, c2 FROM t WHERE c = 100
Here it could make sense to create an index on
(c, c1, c2) because it will be a covering index. Do not create an index on
(c1, c2, c)! It will still be covering but it will not be usable for filtering (remember that you can only use a left-most prefix of an index to filter).
SELECT * FROM t WHERE c = 100 and d = 'xyz'
It is also very easy to optimize: just add an index on
(c, d) or
Equality and inequality
SELECT * FROM t WHERE c > 100 and d = 'xyz'
Here we must be careful because as long as we are using a column with an inequality, this will prevent us from using further columns in the index.
Therefore if we create an index on
(d, c), we will be able to filter both on
d, this is good.
But if we create an index on
(c, d), we will only be filtering on
c, which is less efficient.
So unlike the situation when you have equalities, order of columns matters when inequalities are used.
SELECT * FROM t WHERE c > 100 and b < 10 and d = 'xyz'
As we have 2 inequalities, we already know that we will not be able to filter on both conditions (*). So we have to make a decision: will we filter on
(d, b) or on
It is not possible to tell which option is better without looking at the data: simply choose the column where the inequality is the most selective. The main point is that you must put the column(s) with an equality first.
(*) Actually there is a way to ‘filter’ on both inequalites: partition on
b and add an index on
(d, c) or partition on
c and add an index on
(d, b). The details are out of the scope of this post but it might be an option for some situations.
Equalities and sort
SELECT * FROM t WHERE c = 100 and d = 'xyz' ORDER BY b
As mentioned in the first paragraph, an index can filter and sort so this query is easy to optimize. However like for inequalities, we must carefully choose the order of the columns in the index: the rule is that we will filter first, and then sort.
With that in mind, it is easy to know that
(c, d, b) or
(d, c, b) will be good indexes while
(b, c, d) or
(b, d, c) are not as good (they will sort but not filter).
And if we have:
SELECT c1, c2 FROM t WHERE c = 100 and d = 'xyz' ORDER BY b
We can create a super efficient index that will filter, sort and be covering:
(c, d, b, c1, c2).
Inequality and sort
We have 2 main variations here. The first one is:
SELECT * FROM t WHERE c > 100 and d = 'xyz' ORDER BY b
Two options look reasonable in this case:
1. filter on
d and sort by
2. filter on
Which strategy is more efficient? It will depend on your data, so you will have to experiment.
The second variation is:
SELECT * FROM t WHERE c > 100 ORDER BY b
This time we have no equality so we have to choose between filtering and sorting. Most likely you will choose filtering.
Not all cases have been covered in this post but you can already see that in some cases you will create poor MySQL indexes if you are not careful. In a future post, I will present a case that can look confusing at first sight but which is easy to understand if you already know everything mentioned here.
Published at DZone with permission of Peter Zaitsev, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.