Over a million developers have joined DZone.
Platinum Partner

Uneven Work Distribution and Oversubscription

· DevOps Zone

The DevOps Zone is brought to you in partnership with New Relic. Improving the performance of your app is easy with New Relic's SaaS-based monitoring.

A few days ago I was teaching our Win32 Concurrent Programming course and showed students an experiment with the std::thread class introduced in C++ 11. The experiment is designed to demonstrate how to partition work across multiple threads and coordinate their execution, and the work to partition is simply counting the number of primes in a certain interval.

You can find the whole benchmark here. The heart of the code is the parallelize_count function, below:

void parallelize_count(unsigned nthreads, unsigned begin, unsigned end) {
    std::vector<std::thread> threads;
    unsigned chunk_size = (end - begin) / nthreads;
    for (unsigned i = 0; i < nthreads; ++i) {
        unsigned chunk_begin = chunk_size*i;
        unsigned chunk_end = chunk_begin + chunk_size;
        threads.emplace_back([=]() {
            count_primes(chunk_begin, chunk_end);
    std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));

Many people expect this to scale linearly with the number threads, up to the hardware limit – the number of processors actually installed on the system. The reality is slightly more complex. Below are the results from one of the trial runs on my system, with an i7-3720QM processor (4 physical cores, 8 logical cores):


The orange and blue lines are results from gcc 4.8.2 on OS X 10.9 and from Visual C++ 2013 on Windows 8. VC’s spike around 50 threads is not reproducible, and can be attribute to measurement perturbations. The interesting results, which are pretty similar across both compilers, are the following:

  1. The speedup is not quite linear. For example, going from 1 to 2 threads is only 27% faster (as opposed to 50% linear speedup).
  2. Even though I only have 4 physical cores on this machine, there is still a considerable 37% speed bump from 4 to 8 threads.
  3. Even after the hardware limits are exceeded, we can still see speedups from introducing more threads to the mix. For example, going from 8 to 16 threads produces a 27% speedup.

The explanation for this is the uneven workload distribution between the threads. Counting primes in an interval of larger numbers takes much longer than in an interval of smaller numbers. The Visual Studio Concurrency Visualizer can show you this graphically – and even embed markers from your code into the generated report. For example, here is a sample showing the three threads working on the range [0, 100000):


And here are the eight threads working on the same range:


This uneven distribution means that adding more threads – i.e., forcefully oversubscribing the scheduler with more threads than processors – is actually beneficial here, because of the uneven workload distribution.

The practical lesson here is that distributing works across threads is not easy. One good way to get rid of these (and other) problems is to use smaller chunks of work and queue them to the thread pool. There are of course even higher-level abstractions, such as parallel_for, designed to handle the work item partitioning and distribution for you. But that’s a story for a different post.

The DevOps Zone is brought to you in partnership with New Relic. Know exactly where and when bottlenecks are occurring within your application frameworks with New Relic APM.


Published at DZone with permission of Sasha Goldshtein , DZone MVB .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}