More on transactions descriptors optimization
Since our first post on transaction descriptors optimization introduced in Percona Server 5.5.30-30.2 and a followup by Dimitri Kravchuk we have received a large number of questions on why the benchmark results in both posts look rather different. We were curious as well, so we tried to answer that question by retrying benchmarks on various combinations of hardware and dataset sizes, including the ones that are as close as possible to Dimitri’s environment. To put a long story short, the results are fairly consistent with our original post across all test combinations, see the details are below.
Here is a brief context recap. There are two cases covered in the first post:
PRIMARY KEYlookups (aka QPS sysbench mode);
- same queries executed inside single-statement transactions (TPS mode, see the original post on why this case is relevant).
Dimitri didn’t touch case #2 and focused only on case #1. Which is the perfect case for read-only transactions optimization in MySQL 5.6, because all
SELECT queries in the
AUTOCOMMIT mode are, by definition, read-only transactions, so the server deals with an empty transaction list when creating read views. However, the more general descriptors optimization in Percona Server showed fairly close results in our tests. In Dimitri’s tests Percona Server scalability is somewhere in between of MySQL 5.5 and MySQL 5.6.
In order to understand more about differences between results we have rerun tests on several our boxes. You can find setup/config details at the very end of this post.
- HP ProLiant DL380 G6 (8cores/16 cpu threads)
- Cisco UCS C250 (12cores/24 cpu threads)
- Dell PowerEdge R720 (16 cores/32 cpu threads)
All are NUMA boxes, but the second one has a higher QPI bandwidth, and the third one features even higher QPI speed (and thus, faster inter-process/node communication).
First of all, we used a smaller dataset for this benchmarks round, since reportedly Dimitri was using 8 tables with 1M rows each (about 2.1GB) for his tests, while in our original tests we used a much larger one, 16 tables with 5M rows each (about 23GB). So it makes sense to check if the dataset size makes any difference.
We first excluded NUMA from the equation. We ran tests on Dell PowerEdge R720 box and varied cpu combinations with taskset. The following chart shows a results comparison on a single-node (on the left) vs. two-node (on the right) configuration.
No surprises here, the general pattern is consistent with what we saw previously. Note that the single-node configuration used 16 threads (8 cores on a single socket x 2 threads per socket), while the second one use 16 cores (8 cores per socket per node). So this test also excludes HT as the culprit for the results difference.
We then compared results across all 3 boxes to see if different hardware has any notable impact:
Still no luck, everything is in line with the other tests. There is some minor variation, e.g. in the first two cases Percona Server performance is even slightly ahead of MySQL 5.6.10, but nothing unexpected.
So while the difference in results still requires an explanation, I’d like to highlight one aspect of the descriptors optimization that I thought was explained in the original post, but seems to cause much confusion.
Even though the read-only transactions optimization in MySQL 5.6 virtually eliminates contention on the mutex protecting the list of transactions, its applicability is quite limited. Namely, it requires the list of transactions to be either empty or very short, which basically implies a fully read-only server. As soon as there is a non-trivial amount of updates to the database, all
SELECT queries, even those participating in read-only transactions, start suffering from the trx_list overhead created by concurrent updates.
Once we step away from this spherical read-only cow and throw some data updates into the mix, the limitation of read-only transactions become obvious.
In the following test sysbench does 9 primary key
SELECTs followed by a PK
UPDATE statement in each thread in a loop, which is probably a bit closer to real life workloads. It’s still the
AUTOCOMMIT mode, so
SELECTs are read-only transactions, but it doesn’t help much, because e.g. for 1024 concurrent threads each
SELECT has to scan about 100 update transactions to create a read view:
I hope this sheds more light on our tests setup and clarifies the scope of descriptors optimization.
[mysqld] user=root port=3306 innodb_status_file=0 innodb_data_file_path=ibdata1:100M:autoextend innodb_flush_log_at_trx_commit = 2 innodb_flush_method = O_DIRECT innodb_log_buffer_size = 16M innodb_buffer_pool_size = 52G innodb_log_file_size = 2000M innodb_log_files_in_group = 2 innodb_file_per_table = true innodb_read_io_threads = 8 innodb_write_io_threads = 8 innodb_io_capacity = 2000 max_connections=5000 table_open_cache=5000 query_cache_type=OFF performance_schema=0
To warmup server and load data and indices to the buffer pool we use following queries:
select avg(id) from sbtest$i force key (primary) select count(*) from sbtest$i WHERE k like '%0%'
POINT_SELECT QPS test
sysbench --num-threads=<1..1024> --test=oltp.lua --oltp_tables_count=8 --oltp-table-size=1000000 --rand-init=on --report-interval=1 --rand-type=uniform --forced-shutdown=1 --max-time=120 --max-requests=0 --percentile=99 --mysql-user=root --mysql-db=sbtest8t1M --mysql-table-engine=INNODB --mysql-socket=/tmp/mysql.sock --oltp-point-selects=1 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-skip-trx=on --oltp-test-mode=nontrx --oltp-read-only=off --oltp-index-updates=0 --oltp-non-index-updates=0 run
POINT_SELECT + UPDATE QPS test
sysbench --num-threads=<1..1024> --test=oltp.lua --oltp_tables_count=8 --oltp-table-size=1000000 --rand-init=on --report-interval=1 --rand-type=uniform --forced-shutdown=1 --max-time=120 --max-requests=0 --percentile=99 --mysql-user=root --mysql-db=sbtest8t1M --mysql-table-engine=INNODB --mysql-socket=/tmp/mysql.sock --oltp-point-selects=9 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-skip-trx=on --oltp-test-mode=nontrx --oltp-read-only=off --oltp-index-updates=1 --oltp-non-index-updates=0 run