Recently I gave a presentation on “How to Avoid Common (but Deadly) MySQL Development Mistakes” for Percona MySQL Webinars. If you missed it, you can still register to view the recording and my slides.
Thanks to everyone who attended, and especially to folks who asked the great questions. I answered as many as we had time for during the session, but here are all the questions with my complete answers:
Q: Does a JOIN operation between two tables always produce an “access table” on the rows of the first table of the join, or it is possible to define an index (COVERING INDEX) to avoid this access to the first table?
Yes, if your query references only columns in the index, then MySQL can create the query result without having to access table rows. That’s the way a covering index gives benefit, by eliminating many page accesses. MySQL can get this optimization on any table in your query, even the first table in a JOIN.
Q: What are the performance drawbacks of prepared statements?
Some people have supposed that a prepared statement causes overhead, because it involves two communications between the application and the database server. First to send the SQL query with placeholders, and second to send parameter values. But unless you have a very slow network between your app and the database server, this isn’t a significant amount of overhead, and the benefits of using prepared statements usually outweigh the cost.
In fact, in a past blog post, Peter Zaitsev showed in a test that prepared statements allow for a significantly higher rate of queries per second than non-prepared queries.
I have also done tests with PHP, and I found that the overhead to response time is due more to additional PHP code than the performance on the database server side.
One more performance consequence of using prepared statements: compatibility with the query cache. If you issue SQL statements with the text protocol, i.e. using PREPARE and EXECUTE statements, then MySQL cannot use the query cache. But applications typically run dynamic SQL using the binary protocol, and that is compatible with the query cache.
Q: What about using fixed length text fields rather than VARCHARs?
CHAR would be an example of a fixed-length string data type. MyISAM tables have an optimization when all columns of a table use fixed-length data types. Then the row position in the data file is simply a multiple of the row size.
But InnoDB does not have this optimization. Rows are always located on pages, and pages are referenced by pointers rather than by computing an offset in the file. So there is no benefit to using fixed-length string data types.
Please don’t take this as a recommendation to use MyISAM. You should use InnoDB in virtually all cases. InnoDB has better performance that MyISAM in most workloads, but even more importantly it supports ACID behavior and has crash recovery. MyISAM is gradually being phased out.
Q: Advice on fixing some of the datatype or character set problems?
You can use a query modifier like PROCEDURE ANALYSE to help you find out what is the optimal datatype for each of your columns.
The tricky part, though, is changing a data type on a table populated with a lot of data. Using ALTER TABLE may lock the table for the duration of restructuring it to change a data type, and that can interrupt your work. It takes time to do this, proportional to the size of data in the table and the number of indexes that need to be rebuilt.
MySQL 5.6 has new features for InnoDB to perform ALTER TABLE as online DDL. This can help in some cases to avoid locking your table. In the case of changing a data type, you can continue to read the table, but you can’t make changes while the ALTER TABLE is running.
Another solution is to use Percona Toolkit’s pt-online-schema-change. This free tool performs table alterations during which you can continue to read and write data.
Q: Can you talk about index cardinality vs. selectivity and cases where an index does not help because the number of pages read by the database does not decrease by using an index?
Selectivity is the ratio of the rows matched by a given query to the number of rows. Say for instance, I am looking for blog posts written by a given user. There are 1500 blog posts, and 15 written by me. That’s a selectivity of 15/1500, or 1%.
MySQL’s optimizer pays attention to selectivity, and will prefer not to use an index if the selectivity goes over 20%.
To understand this, think of an analogy to a book with an index at the back. Why doesn’t the index contain common words like “the”? Because a word like “the” would occur on virtually every page of the book, so its entry in the index would simply list every page. The list of pages itself could be pages long! If we require to use the index to find all those pages, it seems like a waste of time to flip back and forth, using the index to tell us where the next matching page is. It would be easier and more efficient to simply read the book cover-to-cover.
It is similar in a database index. Loading pages of index structures has some cost, just like loading pages of data rows. So there’s a selectivity breakpoint at which it becomes more efficient to simply do a table-scan. Hopefully MySQL makes the correct estimate of selectivity and where that breakpoint is, but occasionally it can get that wrong.
In those cases, you can override the optimizer by using Index Hints. But those cases are generally rare; its better to avoid using optimizer hints unless you are forced to.
Cardinality is an estimate of the number of unique values in an index. You can see it in the output of SHOW INDEX. A unique index by definition has cardinality equal to the number of rows. Cardinality doesn’t come up as often for the optimizer, except that a unique index on a PRIMARY or UNIQUE KEY allows MySQL to access the table with a different join type, that being const or eq_ref.
Q: The blog post said the partitioning and sharding topic will be touched. Unfortunately, it wasn’t, but I’d like to try and bring it up.
Apologies for that, I covered as many topics as I could during a one-hour webinar. I’ll plan to do another webinar in the future, and I’ll make sure to cover partitioning and sharding in more detail then. But on to your question…
Q: What would be the dataset size when you should think about sharding.
The decision to use sharding has little to do with dataset size. It has to do with the rate of traffic — especially write traffic. That is, use sharding when you’re trying to INSERT or UPDATE at such a rapid rate of queries per second that your database server can’t physically handle the rate of changes. Sharding allows us to scale out to multiple database servers, each of which will handle a fraction of the writes.
Q: And what would be the best choice middleware for the task? I.e. MySQL proxy is still alpha, so it’s not something you would put into production yet.
Right, MySQL Proxy has been available for years (version 0.7.0 is dated 2009-01-13 in launchpad.net), but curiously, its documentation still explicitly says it’s in alpha.
Oracle is currently working hard on a new solution to manage sharded architectures, it’s called MySQL Fabric. It’s still under development, but it’s available at MySQL Labs and you can download it and try it out. No doubt the developers would greatly value community feedback.
Another tool developed by Percona veteran Justin Swanhart is Shard-Query. This works as a proxy, and transparently maps your SQL statements into multiple queries, executed in parallel against individual shards in your cluster. Then it combines the results into a single result set. You can watch the recording of Justin’s past webinar about Shard-Query.
Q: Would you recommend moving to 5.6 (in production)?. The benchmark numbers looks amazing.
Yes, MySQL 5.6 demonstrates some great benchmarks, and Percona Server 5.6 adds even greater scalability. I would certainly recommend upgrading if you haven’t already.
When 5.6 first came out, there were a few high-profile bugs. But that’s true of every initial release of a new major version of any software. It pays to test carefully before any major upgrade, and to have a fallback plan. But it’s been a year since the first GA release of 5.6, and we have worked with many customers who have deployed it successfully in production and are very happy with the performance improvements.
Q: Would also love to hear what you think about sharding plus partitioning, and MySQL cluster.
I’ll go into more detail on sharding plus partitioning in a future webinar.
But let me say this about MySQL Cluster: the NDB storage engine is brilliant, it provides high availability at the same time as high performance, and internally I am told it’s one of the best pieces of code with the best test suite of any component of the MySQL family.
However: you should be aware that using any kind of sharded architecture requires that you design your database schema and your queries to match the way the data is sharded. Sharding works best when your queries need to access only one shard. For example, reading or writing just one row at a time.
If your queries involve range conditions, for instance, this is likely to require all shards to respond to the query, and that would gain you nothing in regards to performance or scalability. In fact, we’ve seen cases where a company migrated an existing application from a traditional MySQL instance to MySQL Cluster without designing their schema for sharding, and their performance degraded instead of improving.
Another example is if you do a JOIN between related rows, but the sharding by primary key has spread the related rows across multiple shards, then the JOIN could actually be more expensive than if you were running in a non-sharded environment. The fix in that case is to make sure your keys are defined such that the foreign key in related rows has the same modulus as the primary key of their parent row.
Q: You said that varchar columns stores efficiently on disk in data and indexes and in the buffer pool. But a string in the SQL layer pads out by using as much space as the maximum length. Increased 3x for utf8 or x4 for utf8mb4. I don’t understand what the “SQL Layer” means. Can you explain that?
The MySQL server has an architecture that divides its work between two main areas of code: the first part is storage-independent tasks like accepting connections, session management, authentication, SQL parsing, query optimization, logging, replication, the query cache, etc. We call this part of MySQL Server the “SQL layer” (or sometimes the “MySQL layer”).
The other part is the storage engine, which is responsible for reading and writing rows, some kinds of caching (like the buffer pool in the case of InnoDB), indexes, I/O, and so on.
The SQL layer knows how to parse queries and turns them into a series of individual row requests. Then the storage engine for a given table fulfills those row requests, reading them off disk or from the buffer pool as needed. As the storage engine returns rows, they are copied from one part of the code (the storage engine) to the other part (the SQL layer), and then prepares to return them as a query result set.
My point in the webinar is that the storage engine knows how to store VARCHAR data in a compact form, by storing variable-length strings. But the SQL layer treats rows as fixed-length records (except for BLOB/TEXT) and therefore allocates memory for VARCHAR strings according to the maximum length the column could be.
* * *
Thanks again for attending my webinar! Remember that you can still view the slides and the recording for “Common (but deadly) MySQL Development Mistakes.”