Performance Improvements in MariaDB MaxScale 2.1 GA
When we started working with MariaDB MaxScale 2.1, we decided to make some significant changes to the internal architecture to improve the overall performance.
Performance is important and perhaps doubly so for a database proxy such as MariaDB MaxScale. From the start, the goal of MaxScale’s architecture has been that the performance should scale with the number of CPU cores and the means for achieving that is the use of asynchronous I/O and Linux' e-poll, combined with a fixed number of worker threads.
However, benchmarking showed that there were some bottlenecks, as the throughput did not continuously increase as more threads were added, but rather quickly it would start to decrease. The throughput also hit a relatively low ceiling, above which it would not grow, irrespective of the configuration of MaxScale.
When we started working with MariaDB MaxScale 2.1, we decided to make some significant changes to the internal architecture, with the aim of improving the overall performance. In this blog post, I will describe what we have done and show how the performance has improved.
The benchmark results have all been obtained using the same setup; two computers (connected with a GBE LAN), both of which have 16 cores/32 hyperthreads, 128GB RAM and an SSD drive. One of the computers was used for running 4 MariaDB-10.1.13 servers and the other for running MariaDB MaxScale and sysbench. The used sysbench is a version slightly modified by MariaDB and it is available here. The configurations used and the raw data can be found here.
In all test runs, the MaxScale configuration was basically the same and 16 worker threads were used. As can be seen here. 16 threads is not optimal for MaxScale 2.0, but that does not distort the general picture too much. The used workload is in all cases the same; an OLTP read-only benchmark with 100 simple selects per iteration, no transaction boundaries, and autocommit enabled.
In the figures:
- direct means that sysbench itself distributes the load to the servers in a round-robin fashion (i.e. not through MaxScale).
- rcr means that MaxScale with the readconnroute router is used.
- rws means that MaxScale with the readwritesplit router is used.
The rcr and rws results are meaningful to compare and the difference reflects the overhead caused by the additional complexity and versatility of the readwritesplit router.
As a baseline, in the following figure is shown the result of MaxScale 2.0.5 GA.
As can be seen, a saturation point is reached at 64 client connections, at which point the throughput is less than half of that of a direct connection.
Before MaxScale 2.1, all threads were used in a completely symmetric fashion; every thread listened on the same e-poll instance, which in practice meant that every thread could accept an incoming connection, read from and write to any client and read from and write to any server related to that client. Without going into the specifics, that implied not only a lot of locking inside MaxScale but also made it quite hard to ensure that there could no data races.
For 2.1 we changed the architecture so that when a client has connected, a particular thread is used for handling all communication with and behalf of that client. That is, it is the thread that reads a request from a client that also writes that request to one or more servers, reads the responses from those servers and finally writes a response to the client. With this change, quite a lot of locking could be removed, while at the same time the possibilities for data races were reduced.
As can be seen in the following figure, that had quite an impact on the throughput, as far as readconnroute is concerned. The behavior of readwritesplit did not really change.
In 2.1.0 Beta we also introduced a caching filter and as long as the number of client connections is low, the impact is quite significant. However, as the number of client connections grows, the benefit shrinks dramatically.
The cache can be configured to be shared between all threads or to be specific for each thread and in the benchmark a thread-specific cache was used. Otherwise, the cache was not configured in any way, which means that all SELECT statements, but for those that are considered non-cacheable, are cached and that the cache uses as much memory as it needs (no LRU based cache eviction).
It is not shown in the figure, but readconnroute + cache performed roughly equivalently with readwritesplit + cache. That is, from 32 client connections onwards readconnroute + cache performed much worse than readconnroute alone.
Increasing Query Classification Concurrency
The change in the threading architecture improved the raw throughput of MaxScale but it was still evident that the throughput did not increase in all contexts in an expected way. Investigations showed that there was some unnecessary serialization going on when statements were classified, something that readwritesplit needs to do, but readconnroute does not.
Statements are classified inside MaxScale using a custom SQLite based parser. An in-memory database is created for each thread so that there can be no concurrency issues between classification performed by different threads. However, it turned out that SQLite was not built in a way that would have turned off all locking inside SQLite, but some locking was still taking place. For 2.1.1 Beta this issue was fixed and now no locking takes place when statements are classified.
More Aggressive Caching
Initially, the cache worked so that caching was taking place only if no transaction was in progress or if an explicitly read-only transaction was in progress. In 2.1.1 Beta we changed that so that caching takes place also in transactions that are not explicitly read-only, but only until the first updating statement.
Reduce Data Collection
Up until 2.1.1 Beta the query classifier always collected of a statement all information that is accessible via the query classifier API. Profiling showed that the cost of the allocation of memory for storing the accessed fields and the used functions was significant. In most cases paying that cost is completely unnecessary, as it primarily is only the database firewall filter that is interested in that information.
For 2.1.2 Beta the internal API was changed so that filters and routers can express what information they need and only that information is subsequently collected during query classification.
Many routers and filters need to know whether a transaction is in progress or not. Up until 2.1.1 Beta, that information was obtained using the parser of the query classifier. That imposed a significant cost if classification otherwise was not needed. For 2.1.2 Beta a custom parser for only detecting statements affecting the transaction state and autocommit mode was written. Now the detection carries almost no cost at all.
Optional Cache Safety Checking
By default, the cache filter verifies that a SELECT statement is cacheable. For instance, a SELECT that refers to user variables or uses functions such as CURRENT_DATE(), is not cacheable. In order to do that verification, the cache uses the query classifier, which means that the statement will be parsed. In 2.1.2 Beta we introduced the possibility to specify in the configuration file that all SELECT statements can be assumed to be cacheable, which means that no parsing needs to be performed. In practice this behaviour is enabled by adding the following entry to the cache section in the MaxScale configuration file.
[Cache] ... selects=assume_cacheable
The impact of the changes above is shown in the figure below.
Compared to MaxScale 2.1.0, the performance of readconnroute is roughly the same. However, the performance of readwritesplit has increased by more than 100% and is no longer significantly different from readconnroute. Also, the cache performance has improved quite dramatically, as can be seen in the following figure.
In the figure, cacheable means that the cache has been configured so that it can assume that all
SELECT statements can be cached.
From the figure can be seen that if there are less than 64 concurrent client connections, the cache improves the throughput, even if is verified that eachall
SELECT statement is cacheable. With more concurrent connections than that, the verification constitutes a significant cost that diminishes the benefit of the cache. Best performance is always obtained if the verification can be turned off.
Compared to 2.0.5 the overall improvement is quite evident and although the shown results reflect a particular benchmark, it is to be expected that 2.1 consistently performs better than 2.0, irrespective of the workload.
Performance wise 2.1.3 GA is the best MaxScale version there has ever been, but there are still improvements that can be made.
In 2.1 all threads listen to new clients but once the connection has been accepted, it is assigned to a particular thread. That causes cross-thread activity that requires locking. In the next version, MaxScale 2.2 will change this so that the thread that accepts the connection is also the one that will handle it. That way there will be no cross-thread activity when a client connects.
A significant amount of the remaining locking inside MaxScale is actually related to the collection of statistics. Namely, as the MaxScale component for handling maxadmin requests is a regular MaxScale router, a maxadmin connection will run on some thread, yet it needs to access data belonging to the other threads. In 2.2 we introduce an internal e-poll-based cross-thread messaging mechanism and have a dedicated admin thread, which communicates with the worker threads using that messaging mechanism. The result is that all statistics related locking can be removed.