Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Better Than Linear Scaling

DZone's Guide to

Better Than Linear Scaling

It's often accepted as a fact that systems like databases can’t scale better than linearly. But I think database systems shouldn't be seen as single-server systems.

· Database Zone
Free Resource

Whether you work in SQL Server Management Studio or Visual Studio, Redgate tools integrate with your existing infrastructure, enabling you to align DevOps for your applications with DevOps for your SQL Server databases. Discover true Database DevOps, brought to you in partnership with Redgate.

In this blog, we’ll look at how to achieve better-than-linear scaling.

Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. For example, we consider a system scalable if it is capable of increasing its total output under an increased load when resources (typically hardware) are added.

It is often accepted as a fact that systems (in particular, databases) can’t scale better than linearly. By this, I mean that when you double resources, the expected performance doubles, at best (and often is less than doubled).  

We can attribute this assumption to Amdahl’s law, and later to the Universal Scalability Law. Both of these laws prescribe that it is impossible to achieve better than linear scalability. To be totally precise, this is practically correct for single server systems when the added resources are only CPU units.

Multi-Nodes Systems

However, I think databases systems no longer should be seen as single-server systems. MongoDB and Cassandra for a long time have had multi-node auto-sharding capabilities. We are about to see the rise of strongly-consistent SQL based multi-node systems. And even MySQL is frequently deployed with manual sharding on multi-nodes.

Products like Vitess propose auto-sharding for MySQL, and with ProxySQL (which I will use in my experiment), you can setup a basic sharding schema.

I describe multi-nodes setups because in this environment, it is possible to achieve much better than linear scalability. I will show this below.

Why Is This Important?

Understanding scalability of multi-node systems is important for resource planning and for understanding how much of a potential performance gain we can expect when we add more nodes. This is especially interesting for cloud deployments.

How Is This Possible?

I’ve written about how the size of available memory (cache) affects the performance. When we add additional nodes to the deployment, effectively we increase not only CPU cores but also the memory that comes with the node (and we are adding extra IO capacity). So, with increasing node counts, we also increase available memory (and cache). As we can see from these graphs, the effect of extra memory could be non-linear (and actually better than linear). Playing on this fact, we can achieve better-than-linear scaling in a sharded setup. I am going to show the experimental setup of how to achieve this.

Experimental Setup

To show the sharded setup we will use ProxySQL in front of N MySQL servers (shards). We also will use sysbench with 60 tables (4 million rows each, uniform distribution).

  • For one shard, this shard contains all 60 tables.
  • For two shards, each shard contains 30 tables each.
  • For three shards, each shard contains 20 tables each.
  • For six shards, each shard contains ten tables each.

So schematically, it looks like this:

One shard:

Scaling

Two shards:

Scaling

Six shards:

Scaling

We want to measure how the performance (for both throughput and latency) changes when we go from 1 to 2, to 3, to 4, to 5, and to 6 shards.

For the single shard, I used a Google Cloud instance with eight virtual CPUs and 16GB of RAM, where 10GB is allocated for the innodb_buffer_pool_size.

The database size (for all 60 tables) is about 51GB for the data, and 7GB for indexes.

For this, we will use a sysbench read-only uniform workload, and ProxySQL helps to perform query routing. We will use ProxySQL query rules, and set sharding as:

mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "DELETE FROM mysql_query_rules"
shards=$1
for i in {1..60}
do
hg=$(( $i % $shards + 1))
mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,destination_hostgroup,apply) VALUES ($i,1,'root','sbtest$is',$hg,1);"
done
mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "LOAD MYSQL QUERY RULES TO RUNTIME;"

Command line for sysbench 1.0.4:

sysbench oltp_read_only.lua --mysql-socket=/tmp/proxysql.sock --mysql-user=root --mysql-password=test --tables=60 --table-size=4000000 --threads=60 --report-interval=10 --time=900 --rand-type=pareto run

The Results

Nodes Throughput Speedup vs. 1 node Latency, ms
1 245 1.00 244.88
2 682 2.78 87.95
3 1659 6.77 36.16
4 2748 11.22 21.83
5 3384 13.81 17.72
6 3514 14.34

17.07


Scaling
As we can see, the performance improves by a factor much better than just linearly.

With five nodes, the improvement is 13.81 times compared to the single node.

The 6th node does not add much benefit, as at this time data practically fits into memory (with five nodes, the total cache size is 50GB compared to the 51GB data size).

Factors That Affect Multi-Node Scaling

How can we model/predict the performance gain? There are multiple factors to take into account: the size of the active working set, the available memory size and (also importantly) the distribution of the access to the working set (with uniform distribution being the best case scenario, and with access to the one with only one row being the opposite corner-case, where speedup is impossible). Also, we need to keep network speed in mind: if we come close to using all available network bandwidth, it will be impossible to get significant improvement.

Conclusion

In multi-node, auto-scaling, auto-sharding distributed systems, the traditional scalability models do not provide much help. We need to have a better framework to understand how multiple nodes affect performance.

It’s easier than you think to extend DevOps practices to SQL Server with Redgate tools. Discover how to introduce true Database DevOps, brought to you in partnership with Redgate

Topics:
database ,linear scaling ,multi nodes ,scalability ,single server

Published at DZone with permission of Vadim Tkachenko, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}