DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Coding
  3. Languages
  4. Python – paralellizing CPU-bound tasks with concurrent.futures

Python – paralellizing CPU-bound tasks with concurrent.futures

Eli Bendersky user avatar by
Eli Bendersky
·
Jan. 17, 13 · Interview
Like (0)
Save
Tweet
Share
6.69K Views

Join the DZone community and get the full member experience.

Join For Free

A year ago, I wrote a series of posts about using the Python multiprocessing module. One of the posts contrasted compute-intensive task parallelization using threads vs. processes. Today I want to revisit that topic, this time employing the concurrent.futures module which is part of the standard library since Python 3.2

First of all, what are "futures"? The Wikipedia page says:

In computer science, future, promise, and delay refer to constructs used for synchronizing in some concurrent programming languages. They describe an object that acts as a proxy for a result that is initially unknown, usually because the computation of its value is yet incomplete.

I wouldn’t say "futures" is the best name choice, but this is what we’re stuck with, so let’s move on. Futures are actually a very nice tool that helps bridge an annoying gap that always exists in concurrent execution – the gap between launching some computation concurrently and obtaining the result of that computation. As my previous post showed, one of the common ways to deal with this gap is to pass a synchronized Queue object into every worker process (or thread) and then collect the results once the workers are done. Futures make this much easier and more elegant, as we’ll see.

For completeness, here is the computation we’re going to apply in parallel over a large amount of inputs:

def factorize_naive(n):
    """ A naive factorization method. Take integer 'n', return list of
        factors.
    """
    if n < 2:
        return []
    factors = []
    p = 2

    while True:
        if n == 1:
            return factors

        r = n % p
        if r == 0:
            factors.append(p)
            n = n // p
        elif p * p >= n:
            factors.append(n)
            return factors
        elif p > 2:
            # Advance in steps of 2 over odd numbers
            p += 2
        else:
            # If p == 2, get to 3
            p += 1
    assert False, "unreachable"

And here’s the first attempt at doing that with concurrent.futures:

from concurrent.futures import ProcessPoolExecutor, as_completed

def chunked_worker(nums):
    """ Factorize a list of numbers, returning a num:factors mapping.
    """
    return {n: factorize_naive(n) for n in nums}


def pool_factorizer_chunked(nums, nprocs):
    # Manually divide the task to chunks of equal length, submitting each
    # chunk to the pool.
    chunksize = int(math.ceil(len(nums) / float(nprocs)))
    futures = []

    with ProcessPoolExecutor() as executor:
        for i in range(nprocs):
            chunk = nums[(chunksize * i) : (chunksize * (i + 1))]
            futures.append(executor.submit(chunked_worker, chunk))

    resultdict = {}
    for f in as_completed(futures):
        resultdict.update(f.result())
    return resultdict

The end result of pool_factorizer_chunked is a dictionary mapping numbers to lists of their factors. The most interesting thing to note here is this: the function run in a worker process (chunked_worker in this case) can simply return a value. For each such "call" (submission to the executor), a future is returned. This future encapsulates the result of the execution, which is probably not ready immediately but will be at some point. The concurrent.futures.as_completed helper allows to simply wait on all futures and yield the results of those that are done, whenever they’re done.

It’s easy to see that this code is conceptually simpler than manually launching the processes, passing some sort of synchronization queues to workers and collecting results. This, IMHO, is the main goal of futures. Futures aren’t there to make your code faster, they’re there to make it simpler. And any simplification is a blessing when parallel programming is concerned.

Note also that ProcessPoolExecutor is used as a context manager – this makes process cleanup automatic and reliable. For more fine grained control, it has a shutdown method that can be called manually.

There’s more. Since the concurrent.futures module allows to simply return a value from a concurrent call, it has another tool to make computations like the above even simpler – Executor.map. Here’s the same task rewritten with map:

def pool_factorizer_map(nums, nprocs):
    # Let the executor divide the work among processes by using 'map'.
    with ProcessPoolExecutor(max_workers=nprocs) as executor:
        return {num:factors for num, factors in
                                zip(nums,
                                    executor.map(factorize_naive, nums))}

Amazingly, this is it. This small function (a potential 2-liner, if not the wrapping for readability) creates a process pool, submits a bunch of tasks to it, collects all the results when they’re ready, puts them into a single dictionary and returns it.

As for performance, the second method is also a bit faster in my benchmarks. I think this makes sense because the manual division to chunks doesn’t take into account which chunks will take longer and the division of work between workers may be unbalanced. The map method keeps a pool of workers to which it submits new computations when they’re ready, which means that the workers are all kept busy until everything is done.

To conclude, I strongly recommend using concurrent.futures whenever the possibility presents itself. They’re much simpler conceptually, and hence less error prone, than the manual method of creating processes and keeping track of their results. In practice, a future is a nice way to convey the result of a computation from a process producing it to a process consuming it. It’s like creating a result queue manually, but with a lot of useful semantics implemented. Futures can be polled, cancelled, provide useful access to exceptions and have callbacks attached to them. My examples here are simplistic and don’t show how to use the cooler features, but if you need them – the documentation is pretty good.


Task (computing) Python (language)

Published at DZone with permission of Eli Bendersky. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Do Not Forget About Testing!
  • Simulate Network Latency and Packet Drop In Linux
  • How To Validate Three Common Document Types in Python
  • Real-Time Stream Processing With Hazelcast and StreamNative

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: