Over a million developers have joined DZone.

Uneven Work Distribution and Oversubscription

DZone's Guide to

Uneven Work Distribution and Oversubscription

· DevOps Zone ·
Free Resource

Do you need to strengthen the security of the mobile apps you build? Discover more than 50 secure mobile development coding practices to make your apps more secure.

A few days ago I was teaching our Win32 Concurrent Programming course and showed students an experiment with the std::thread class introduced in C++ 11. The experiment is designed to demonstrate how to partition work across multiple threads and coordinate their execution, and the work to partition is simply counting the number of primes in a certain interval.

You can find the whole benchmark here. The heart of the code is the parallelize_count function, below:

void parallelize_count(unsigned nthreads, unsigned begin, unsigned end) {
    std::vector<std::thread> threads;
    unsigned chunk_size = (end - begin) / nthreads;
    for (unsigned i = 0; i < nthreads; ++i) {
        unsigned chunk_begin = chunk_size*i;
        unsigned chunk_end = chunk_begin + chunk_size;
        threads.emplace_back([=]() {
            count_primes(chunk_begin, chunk_end);
    std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));

Many people expect this to scale linearly with the number threads, up to the hardware limit – the number of processors actually installed on the system. The reality is slightly more complex. Below are the results from one of the trial runs on my system, with an i7-3720QM processor (4 physical cores, 8 logical cores):


The orange and blue lines are results from gcc 4.8.2 on OS X 10.9 and from Visual C++ 2013 on Windows 8. VC’s spike around 50 threads is not reproducible, and can be attribute to measurement perturbations. The interesting results, which are pretty similar across both compilers, are the following:

  1. The speedup is not quite linear. For example, going from 1 to 2 threads is only 27% faster (as opposed to 50% linear speedup).
  2. Even though I only have 4 physical cores on this machine, there is still a considerable 37% speed bump from 4 to 8 threads.
  3. Even after the hardware limits are exceeded, we can still see speedups from introducing more threads to the mix. For example, going from 8 to 16 threads produces a 27% speedup.

The explanation for this is the uneven workload distribution between the threads. Counting primes in an interval of larger numbers takes much longer than in an interval of smaller numbers. The Visual Studio Concurrency Visualizer can show you this graphically – and even embed markers from your code into the generated report. For example, here is a sample showing the three threads working on the range [0, 100000):


And here are the eight threads working on the same range:


This uneven distribution means that adding more threads – i.e., forcefully oversubscribing the scheduler with more threads than processors – is actually beneficial here, because of the uneven workload distribution.

The practical lesson here is that distributing works across threads is not easy. One good way to get rid of these (and other) problems is to use smaller chunks of work and queue them to the thread pool. There are of course even higher-level abstractions, such as parallel_for, designed to handle the work item partitioning and distribution for you. But that’s a story for a different post.

Check out tips for blazing the way from agile to DevSecOps with security built into your mobile app toolchain.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}