Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Infinispan 5.2: Map/Reduce Parallel Execution

DZone's Guide to

Infinispan 5.2: Map/Reduce Parallel Execution

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Ever since the Infinispan 5.2 release we implemented fully distributed execution of both map and reduce phases of MapReduceTask. For the map phase, MapReduceTask hashes task input keys, groups them by execution node N these keys are hashed to, and sends map function along input keys to each node N. At node N map function gets invoked for each input key and locally loaded corresponding value. However, map function on node N, until recently, got invoked on a single thread regardless of the number of key/value pairs. If we need to invoke map function on many key/value pairs, things would sooner rather than later grind to a halt.

Similarly in order to complete reduce phase, MapReduceTask groups intermediate KOut keys by execution node N they are hashed to. After intermediate phase is completed, MapReduceTask sends a reduce command to each node N where KOut keys are hashed. Once reduce command arrives on target execution node, it looks up temporary cache belonging to MapReduceTask and for each KOut key, grabs a list of VOut values, wraps it with an Iterator and invokes reduce on it. However, even reduce function, until recently, got invoked on a single thread, as well. Even though, due to the nature of map/reduce paradigm, reduce entails significantly smaller number of key/value function invocations compared to a map, current single threaded execution model does not help to speed things up.

Starting with Infinispan community release 7.0.0.Alpha1, map and reduce task phases are executed in parallel. If the eviction is not configured for the cache where key/value pairs involved in the map phase reside, MapReduceTask uses fork/join work-stealing technique for parallel execution of the map and reduce functions. Otherwise, we implement parallel execution using a standard thread executor framework. Reduce phase is always executed using fork/join work-stealing algorithm. Either way, we are hoping that users' large map/reduce tasks will experience a significant execution speedup.  At the moment, we are conducting our own performance tests and will get back to you with the results soon. Stay tuned.

Cheers,
Vladimir

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}