Hyper-Threading: How Does It Double CPU Throughput?
Hyper-Threading: How Does It Double CPU Throughput?
Join the DZone community and get the full member experience.Join For Free
xMatters delivers integration-driven collaboration that relays data between systems, while engaging the right people to proactively resolve issues. Read the Monitoring in a Connected Enterprise whitepaper and learn about 3 tools for resolving incidents quickly.
The other day a customer asked me to do capacity planning for their web server farm. I was looking at the CPU graph for one of the web servers that had Hyper-threading switched ON and thought to myself: “This must be quite a misleading graph – it shows 30% CPU usage. It can’t really be that this server can handle 3 times more work?”
Or can it?
I decided to do what we usually do in such case – I decided to test it and find out the truth. Turns out – there’s more to it than meets the eye.
How Intel Hyper-Threading works
Before we get to my benchmark results, let’s talk a little bit about hyper-threading. According to Intel, Intel® Hyper-Threading Technology (Intel® HT Technology) uses processor resources more efficiently, enabling multiple threads to run on each core. As a performance feature, Intel HT Technology also increases processor throughput, improving overall performance on threaded software.
Sounds almost like magic, but in reality (and correct me if I’m wrong), what HT does essentially is – by presenting one CPU core as two CPUs (threads rather), it allows you to offload task scheduling from kernel to CPU.
So for example if you just had one physical CPU core and two tasks with the same priority running in parallel, the kernel would have to constantly switch the context so that both tasks get a fair amount of CPU time. If, however, you have the CPU presented to you as two CPUs, the kernel can give each task a CPU and take a vacation.
On the hardware level, it will still be one CPU doing the same amount of work, but there maybe some optimization to how that work is going to be executed.
Here’s the problem that was driving me nuts: if HT does NOT actually give you twice more power and yet the system represents statistics for each CPU thread separately, then at 50% CPU utilization (as per mpstat on Linux), the CPU should be maxed out.
So if I tried to model the scalability of that web server – a 12-core system with HT enabled (represented as 24 CPUs on a system), assuming perfect linear scalability, here’s how it should look:
Throughput (requests per second) | 9 | ,+++++++++++++++ | + | + 6 | + | + | + 3 | + | + | + 0 '-----+----+----+----+---- 1 6 12 18 24
In the example above, single CPU thread could process the request in 1.2s, which is why you see it max out at 9-10 requests/sec (12/1.2).
From the user perspective, this limitation would hit VERY unexpectedly, as one would expect 50% utilization to be… well, exactly that – 50% utilization.
In fact, the CPU utilization graph would look even more frustrating. For example if I were increasing the number of parallel requests linearly from 1 to 24, here’s how that relationship should look:
CPU utilization: 100% | ++++++++++++++ | . | . | . | . 50% | . | + | + | + | + 0% '----+----+----+----+---- 0 6 12 18 24
Hence CPU utilization would skyrocket right at 12 cores from 50% to 100%, because in fact the system CPU would be 100% utilized at this point.
What happens in reality
Naturally, I decided to run a benchmark and see if my assumptions are correct. The benchmark was pretty basic – I wrote a CPU-intensive php script, that took 1.2s to execute on the CPU I was testing this, and bashed it over http (apache) with ab at increasing concurrency. Here’s the result:
Raw data can be found here.
If this does not blow your mind, please go over the facts again and then back at the graph.
Still not sure why I find this interesting? Let me explain. If you look carefully, initially – at concurrency of 1 through 8 – it scales perfectly. So if you only had data for threads 1-8 (and you knew processes don’t incur coherency delays due to shared data structures), you’d probably predict that it will scale linearly until it reaches ~10 requests/sec at 12 cores, at which point adding more parallel requests would not have any benefits as the CPU would be saturated.
What happens in reality, though, is that past 8 parallel threads (hence, past 33% virtual CPU utilization), execution time starts to increase and maximum performance is only achieved at 24-32 concurrent requests. It looks like at the 33% mark there’s some kind of “throttling” happening.
In other words, to avoid a sharp performance hit past 50% CPU utilization, at 33% virtual thread utilization (i.e. 66% actual CPU utilization), the system gives the illusion of a performance limit – execution slows down so that the system only reaches the saturation point at 24 threads (visually, at 100% CPU utilization).
Naturally then the question is – does it still make sense to run hyper-threading on a multi-core system? I see at least two drawbacks:
1. You don’t see the real picture of how utilized your system really is – if the CPU graph shows 30% utilization, your system may well be 60% utilized already.
2. Past 60% physical utilization, execution speed of your requests will be throttled intentionally in order to provide higher system throughput.
So if you are optimizing for higher throughput – that may be fine. But if you are optimizing for response time, then you may consider running with HT turned off.
Did I miss something?
Published at DZone with permission of Peter Zaitsev , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.