Over a million developers have joined DZone.

Is there a future for Map/Reduce?

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

8w9jj

Google’s Jeffrey Dean and Sanjay Ghemawat filed the patent request and published the map/reduce paper  10 years ago (2004). According to WikiPedia Doug Cutting and Mike Cafarella created Hadoop, with its own implementation of Map/Reduce,  one year later at Yahoo – both these implementations were done for the same purpose – batch indexing of the web.

Back than, the web began its “web 2.0″ transition, pages became more dynamic , people began to create more content – so an efficient way to reprocess and build the web index was needed and map/reduce was it. Web Indexing was a great fit for map/reduce since the initial processing of each source (web page) is completely independent from any other – i.e.  a very convenient map phase and you need  to combine the results to build the reverse index. That said, even the core google algorithm –  the famous pagerank is iterative (so less appropriate for map/reduce), not to mention that  as the internet got bigger and the updates became more and more frequent map/reduce wasn’t enough. Again Google (who seem to be consistently few years ahead of the industry) began coming up with alternatives like Google Percolator  or Google Dremel (both papers were published in 2010, Percolator was introduced at that year, and dremel has been used in Google since 2006).

So now, it is 2014, and it is time for the rest of us to catch up with Google and get over Map/Reduce and  for multiple reasons:

  • end-users’ expectations (who hear “big data” but interpret that as  “fast data”)
  • iterative problems like graph algorithms which are inefficient as you need to load and reload the data each iteration
  • continuous ingestion of data (increments coming on as small batches or streams of events) – where joining to existing data can be expensive
  • real-time problems – both queries and processing

In my opinion, Map/Reduce is an idea whose time has come and gone – it won’t die in a day or a year, there is still a lot of working systems that use it and the alternatives are still maturing. I do think, however, that if you need to write or implement something new that would build on map/reduce – you should use other option or at the very least carefully consider them.

So how is this change going to happen ?  Luckily, Hadoop has recently adopted YARN (you can see my presentation on it here), which opens up the possibilities to go beyond map/reduce without changing everything … even though in effect,  a lot  will change. Note that some of the new options do have migration paths and also we still retain the access to all that “big data” we have in Hadoopm as well as the extended reuse of some of the ecosystem.

The first type of effort to replace map/reduce is to actually subsume it by offering more flexible batch. After all saying Map/reduce is not relevant, deosn’t mean that batch processing is not relevant. It does mean that there’s a need to more complex processes. There are two main candidates here  Tez and Spark where Tez offers a nice migration path as it is replacing map/reduce as the execution engine for both Pig and Hive and Spark has a compelling offer by combining Batch and Stream processing (more on this later) in a single engine.

The second type of effort or processing capability that will help kill map/reduce is MPP databases on Hadoop. Like the “flexible batch” approach mentioned above, this is replacing a functionality that map/reduce was used for – unleashing the data already processed and stored in Hadoop.  The idea here is twofold

  • To provide fast query capabilities* – by using specialized columnar data format and database engines deployed as daemons on the cluster
  • To provide rich query capabilities – by supporting more and more of the SQL standard and enriching it with analytics capabilities (e.g. via MADlib)

Efforts in this arena include Impala from Cloudera, Hawq from Pivotal (which is essentially greenplum over HDFS), startups like Hadapt or even Actian trying to leverage their ParAccel acquisition with the recently announced Actian Vector . Hive is somewhere in the middle relying on Tez on one hand and using vectorization and columnar format (Orc)  on the other

The Third type of processing that will help dethrone Map/Reduce is Stream processing. Unlike the two previous types of effort this is covering a ground the map/reduce can’t cover, even inefficiently. Stream processing is about  handling continuous flow of new data (e.g. events) and processing them  (enriching, aggregating, etc.) them in seconds or less.  The two major contenders in the Hadoop arena seem to be Spark Streaming and Storm though, of course, there are several other commercial and open source platforms that handle this type of processing as well.

In summary – Map/Reduce is great. It has served us (as an industry) for a decade but it is now time to move on and bring the richer processing capabilities we have elsewhere to solve our big data problems as well.

Last note  - I focused on Hadoop in this post even thought there are several other platforms and tools around. I think that regardless if Hadoop is the best platform it is the one becoming the de-facto standard for big data (remember betamax vs VHS?)

One really, really last note – if you read up to here, and you are a developer living in Israel, and you happen to be looking for a job –  I am looking for another developer to join my Technology Research team @ Amdocs. If you’re interested drop me a note: arnon.rotemgaloz at amdocs dot com or via my twitter/linkedin profiles


*esp. in regard to analytical queries – operational SQL on hadoop with efforts like  Phoenix ,IBM’s BigSQL or Splice Machine are also happening but that’s another story

illustration idea found in James Mickens’s talk in Monitorama 2014 –  (which is, by the way, a really funny presentation – go watch it) -ohh yeah… and pulp fiction :)

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:

Published at DZone with permission of Arnon Rotem-gal-oz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}