Over a million developers have joined DZone.

Introducing a New Type of Benchmark

· Performance Zone

Download Forrester’s “Vendor Landscape, Application Performance Management” report that examines the evolving role of APM as a key driver of customer satisfaction and business success, brought to you in partnership with BMC.

Traditionally, most benchmarks have focused on throughput. We all get used to that, and in our benchmarks, sysbench and tpcc-mysql, the final result also represents the throughput (transactions per second in sysbench; NewOrder transactions Per Minute in tpcc-mysql). However, like Mark Callaghan mentioned in comments, response time is a much more important metric to compare.

I want to pretend that we pioneered (not invented, but started to use widely) a benchmark methodology when we measure not the final throughput, but rather periodic probes (i.e. every 10 sec). This allows us to draw “stability” graphs, like this one:

where we can see not only a final result, but how the system behaves in dynamic.

What’s wrong with existing benchmarks?

Well, all benchmarks are lies, and focusing on throughput does not get any closer to reality.

Benchmarks, like sysbench or tpcc-mysql, start N threads and try to push the database as much as possible, bombarding the system with queries with no pause.

That rarely happens in real life. There are no systems that are pushed to 100% load all time.

So, how we can model it? There are different theories, and the one which describes user’s behavior, is Queueing theory. In short we can assume that users send requests with some arrival rate (which can be different in the different part of day/week/month/year though). And what is important for an end user is response time, that is how long the user has to wait on results. E.g. when you go to your Facebook profile or Wikipedia page, you expect to get response within second or two.

How we should change the benchmark to base on this model ?
There are my working ideas:

  • Benchmark starts N working threads, but they all are idle until asked to handle a request
  • Benchmarks generates transactions with a given rate, i.e. M transactions per second and puts into a queue. The interval between arrivals is not uniform, but rather distributed by Exponential distribution, with λ = M. That how it goes if to believe to the Poisson process.
    For example, if our target is arrival rate 2 queries per second, then exponential distribution will give us following intervals (in sec) between events: 0.61, 0.43, 1.55, 0.18, 0.01, 0.76, 0.09, 1.26, …

    Or if we represent graphically (point means even arrival):

    As you see interval is far from being strict 0.5 sec, but 0.5 is the mean of this random generation function. On the graph you see 20 events arrived within 9 seconds.

  • Transactions from the queue are handled by one of free threads, or are waiting in the queue until one of threads are ready. The time waiting in the queue is added to a total response time.
  • As a result we measure 95% or 99% response times.

What does it give to us? It allows to see:

  • What is the response time we may expect having a given arrival rate
  • What is the optimal number of working threads (the one that provides best response times)

When is it useful?

I'm looking to answer on questions like:

  • When we add additional node to a cluster (e.g. XtraDB Cluster), how does it affect the response time?
  • When we put a load to two nodes instead of three nodes, will it help to improve the response time?
  • Do we need to increase number of working threads when we add nodes?

Beside cluster testing, it will also help to see an affect of having a side on the server. For example, the famous problem with DROP TABLE performance. Does DROP TABLE, running in separate session, affect a response time of queries that handle user load ? The same for mysqldump, how does it affect short user queries ?

In fact I have a prototype based on sysbench. It is there lp:~vadim-tk/sysbench/inj-rate/. It works like a regular sysbench, but you need to specify the additional parameter tx-rate, which defines an expected arrival rate (in transactions per second).

There are some results from my early experiments. Assume we have an usual sysbench setup, and we target an arrival rate as 3000 transactions per second (regular sysbench transactions). We vary working threads from 1 to 128.

There are results for 16-128 threads (the result is 99% response time, taken every 10 sec. the less is better)

We can see that 16 threads give best 99% response time (15.74ms final), 32 threads: 16.75 ms, 64 threads: 25.14ms. And with 128 threads we have pretty terrible unstable response times, with 1579.91ms final.
That means that 16-32 threads is probably best number of concurrently working threads (for this kind of workload and this arrival rate).

Okay, but what happens if we have not enough working threads? You can see it from following graph (1-8 threads):

The queue piles up, waiting time grows, and the final response time grows linearly up to ~30 sec, where benchmark stops, because the queue is full.

I'm looking for your comments -- do you find this useful?

See Forrester’s Report, “Vendor Landscape, Application Performance Management” to identify the right vendor to help IT deliver better service at a lower cost, brought to you in partnership with BMC.


Published at DZone with permission of Peter Zaitsev, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}