Over a million developers have joined DZone.

Infinispan 5.2: Map/Reduce Parallel Execution

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Ever since the Infinispan 5.2 release we implemented fully distributed execution of both map and reduce phases of MapReduceTask. For the map phase, MapReduceTask hashes task input keys, groups them by execution node N these keys are hashed to, and sends map function along input keys to each node N. At node N map function gets invoked for each input key and locally loaded corresponding value. However, map function on node N, until recently, got invoked on a single thread regardless of the number of key/value pairs. If we need to invoke map function on many key/value pairs, things would sooner rather than later grind to a halt.

Similarly in order to complete reduce phase, MapReduceTask groups intermediate KOut keys by execution node N they are hashed to. After intermediate phase is completed, MapReduceTask sends a reduce command to each node N where KOut keys are hashed. Once reduce command arrives on target execution node, it looks up temporary cache belonging to MapReduceTask and for each KOut key, grabs a list of VOut values, wraps it with an Iterator and invokes reduce on it. However, even reduce function, until recently, got invoked on a single thread, as well. Even though, due to the nature of map/reduce paradigm, reduce entails significantly smaller number of key/value function invocations compared to a map, current single threaded execution model does not help to speed things up.

Starting with Infinispan community release 7.0.0.Alpha1, map and reduce task phases are executed in parallel. If the eviction is not configured for the cache where key/value pairs involved in the map phase reside, MapReduceTask uses fork/join work-stealing technique for parallel execution of the map and reduce functions. Otherwise, we implement parallel execution using a standard thread executor framework. Reduce phase is always executed using fork/join work-stealing algorithm. Either way, we are hoping that users' large map/reduce tasks will experience a significant execution speedup.  At the moment, we are conducting our own performance tests and will get back to you with the results soon. Stay tuned.

Cheers,
Vladimir

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:

Published at DZone with permission of Manik Surtani, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}