Recently, we learned an expensive lesson about MySQL partition pruning, and we think that it is better to share it here so that others will not repeat our mistake.
In our system, there is a big stats table that does not have primary key and indexes. This table is partitioned, but the lack of indexes often causes the full partition or even a full table scan when queried. To make things worse, the system still continues writing to this table, making it slower every day.
To fix this performance issue, we want to clean the legacy data and add new indexes. However, this is not easy because the table is too big. Therefore, we chose the long approach by migrating only the desired data from this old table to a new table with a proper schema.
Partition by Hash
It would have been fine if we only did what we originally intended to do. However, we changed the partition type for convenience, and that made the new table slower.
In the original table, the partition is based on a timestamp column that represents the time as a number of hours from epoch. For example, the first second of the year 2017 in GMT is 1483228800 seconds from epoch. To get the number of hours, we divide the number by 3600 to get 1483228800 div 3600) = 412008.
Because of the partition by range type, we need to have a maintenance script that creates the monthly partition for next year. This method of partition is not very ideal because the partition size is big and not even. Hence, we converted it from a monthly partition to a weekly partition, but we were too lazy to define each range and switched from partition by range to partition by hash.
This is a short version of how hash definition will look like if we do the partition by range
PARTITION BY RANGE (hour_epoch) (PARTITION pOct2016 VALUES LESS THAN (419304), PARTITION pNov2017 VALUES LESS THAN (420024) ENGINE = InnoDB, PARTITION pDec2017 VALUES LESS THAN (420768) ENGINE = InnoDB, PARTITION pMax VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
And this is how the partition definition will look like if we do partition by hash
partition by hash (hour_epoch div 168) partitions 157;
The partition by hash type did more than just shorten the syntax. MySQL will try to split records evenly by applying modulo function to select a partition. However, to make the duration of one partition one week, we divide the hour_epoch number by 168 to effectively get week_epoch.
With the new table schema, we were happy with smaller partitions, shorter description, and more indexes.
Because of the huge volume of data, we could not fully migrate data to the new schema to verify performance. We only did the preliminary performance test with the data of two weeks and did not detect any performance issue. However, in the final testing, we were surprised to observe mixed results. Most of the queries were faster as expected, but some were slower.
After investigating, we realized that instead of scanning only a few partitions, MySQL does the full table scanning for the time range query. It is even stranger that this behavior only happens with the date range of fewer than three weeks. Totally surprised by this result, we overcame our procrastination to read up on the MySQL documentation carefully and realized why.
For tables that are partitioned by HASH or [LINEAR] KEY, partition pruning is also possible in cases in which the WHERE clause uses a simple = relation against a column used in the partitioning expression
As the text clearly explained, the partition pruning only works with the equal condition for partition by hash type. However, we did not detect this issue earlier because of the query optimizer will auto convert range conditions to equal conditions if the number of distinct values in between the range condition is short enough. Unfortunately, in our early test, the data of two weeks is short enough for the query optimizer to hide the problem from us.
After learning this new information, we struggled to find a way to fix the performance issue. There were two proposed solutions:
- Trick the query optimizer to do the work by splitting a big range into multiple small ranges, each fitting one partition. In this way, the query optimizer will work on each individual small range.
- Rebuild the schema again with the proper partition type.
The first solution is quick but dirty while the second solution is too time-consuming. Eventually, we almost decided to launch the new table with the first solution until we found a quick way to implement the second solution.
We dug through the MySQL documentation and learned that re-partitioning is basically a copy and paste operation. However, MySQL also has another command that allows us to do some partition changing without too much effort.
ALTER TABLE pt EXCHANGE PARTITION p WITH TABLE nt;
In this command, MySQL allows us to exchange partitions between a table and a partition of another table. Even when this is not a direct exchange between two partitions of two tables, it is just a matter of inconvenience to do one more middle swap to a temp table.This is how our partition swapping looks:
ALTER TABLE origin_table EXCHANGE PARTITION p1 WITH TABLE temp_table; ALTER TABLE final_table EXCHANGE PARTITION p1 WITH TABLE temp_table;
This is not as fast as you might guess, as MySQL will do a row-by-row validation to ensure every record of temp table is eligible for storing in the final table partition. If we use MySQL 5.7, this validation can be turned off by adding "WITHOUT VALIDATION" to the end of the second command.
Because we use Aurora, which only supports MySQL 5.6, it still took us two days to fully update the partition type. However, this would have been one month if we do not use partition exchange.
Fortunately, we managed to recover from the mistake this time. We hope that you learn from our mistake and do remember to read the documentation carefully before using any fancy method.