This article was co-written by Srinath Perera
With its Google pedigree, MapReduce has had a far-ranging impact on the computing industry. It is built on the simple concept of mapping (i.e. filtering and sorting) and then reducing data (i.e. running a formula for summarization), but the true value of MapReduce lies with its ability to run these processes in parallel on commodity servers while balancing disk, CPU, and I/O evenly across each node in a computing cluster. When used alongside a distributed storage architecture, this horizontally scalable system is cheap enough for a fledgling startup. It is also a cost-effective alternative for large organizations that were previously forced to use expensive high-performance computing methods and complicated tools such as MPI (the Message Passing Interface library). With MapReduce, companies no longer need to delete old logs that are ripe with insights—or dump them onto unmanageable tape storage—before they’ve had a chance to analyze them.
Hadoop Takes Over
Today, the Apache Hadoop project is the most widely used implementation of MapReduce. It handles all the details required to scale MapReduce operations. The industry support and community contributions have been so strong over the years that Hadoop has become a fully-featured, extensible data-processing platform. There are scores of other open source projects designed specifically to work with Hadoop. Apache Pig and Cascading, for instance, provide high-level languages and abstractions for data manipulation. Apache Hive provides a data warehouse on top of Hadoop.
As the the Hadoop ecosystem left the competition behind, companies like Microsoft, who were trying to build their own MapReduce platform, eventually gave up and decided to support Hadoop under the pressure of customer demand. Other tech powerhouses like Netflix, LinkedIn, Facebook, and Yahoo (where the project originated) have been using Hadoop for years. A new Hadoop user in the industry, TRUECar, recently reported having a cost of $0.23 per GB with Hadoop. Before Hadoop, they were spending $19 per GB. Smaller shops looking to keep costs even lower have tried to run virtual Hadoop instances. However, virtualizing Hadoop is the subject of some controversy amongst Hadoop vendors and architects. The cost and performance of virtualized Hadoop is fiercely debated.
Hadoop’s strengths are more clearly visible in use cases such as clickstream and server log analytics. Analytics like financial risk scores, sensor-based mechanical failure predictions, and vehicle fleet route analysis are just some of the areas where Hadoop is making an impact. With some of these industries having 60 to 90 day time limits on data retention, Hadoop is unlocking insights that were once extremely difficult to obtain in time. If an organization is allowed to store data longer, the Hadoop File System (HDFS) can save data in its raw, unstructured form while it waits to be processed, just like the NoSQL databases that have broadened our options for managing massive data.
Where MapReduce Falls Short
There are a lot of things that MapReduce handles well: any type of counting, basic statistics (Min, Max, standard deviation, correlation, histograms, grouping), relational operations (except joins), and basic set operations to name a few. But MapReduce, like any technology, is not a panacea for every use case. At this year’s I/O conference, Google’s Urs Hölzle said, “we don’t really use MapReduce anymore.” This is partially because Google is dealing with data on a scale that few of us will ever encounter, but most developers would agree with at least some of Hölzle’s criticisms of MapReduce. It handles batch processing well, but MapReduce is not great at rapid data ingestion or streaming analysis. Here are some other areas where it falls short:
It usually doesn’t make sense to use Hadoop and MapReduce if you’re not dealing with large datasets like high-traffic web logs or clickstreams.
Joining two large datasets with complex conditions—a problem that has baffled database people for decades—is also difficult for MapReduce.
When the map phase generates too many keys (e.g. taking the cross product of two datasets), then the mapping phase will take a very long time.
If processing is highly stateful (e.g. evaluating a state machine), MapReduce won’t be as efficient.
As the software industry starts to encounter these harder use cases, MapReduce will not be the right tool for the job, but Hadoop might be.
Long before Google’s dropping of MapReduce, software vendors and communities were building new technologies to handle some of the technologies described above. The Hadoop project made significant changes just last year and now has a cluster resource management platform called YARN that allows developers to use many other non-MapReduce technologies on top of it. The Hadoop project contributors were already thinking about a resource manager for Hadoop back in early 2008.
With YARN, developers can run a variety of jobs in a YARN container. Instead of scheduling the jobs, the whole YARN container is scheduled. The code inside that container can be any normal programming construct, so MapReduce is just one of many application types that Hadoop can harness. Even the MPI library from the pre-MapReduce days can run on Hadoop. The number of products and projects that the YARN ecosystem enables is too large to list here, but this table will give you an idea of the wide ranging capabilities YARN can support:
3 Ways to Start Using YARN
Below are three basic options for using YARN (but not the only options). The complexity decreases as you go down the list but the granular control over the project also decreases:
Directly code a YARN application master to create a YARN application. This will give you more control over the behavior of the application, but it will be the most challenging to program.
Use Apache Slider, which provides a client to submit JAR files for launching work on YARN-based clusters. Slider provides the least programmatic control out of these three options, but it also has the lowest cost of entry for trying out new code on YARN because it provides a ready to use application master.
For organizations migrating from Hadoop 1.x (pre-YARN) to Hadoop 2, the migration shouldn’t be too difficult since the APIs are fully compatible between the two versions. Most legacy code should just work, but in certain very specific cases custom source code may need to simply be recompiled against newer Hadoop 2 JARs.
As you saw in the table, there are plenty of technologies that take full advantage of the YARN model to expand Hadoop’s analysis capabilities far beyond the limits of the original Hadoop. Apache Tez greatly improves Hive query times. Cloudera’s Impala project is a massively parallel processing (MPP) SQL query engine. And then there’s Apache Spark, which is close to doubling its contributors in less than a year.
Apache Spark Starts a Fire
Spark is built specifically for YARN. In addition to supporting MapReduce, Spark lets you point to a large data set and define a virtual variable to represent the large dataset. Then you can apply functions to each element in the dataset and create a new dataset. So you can pick the right functions for the right kinds of data manipulation. But that’s not even the best part.
The real power of Spark comes from performing operations on top a virtual variables. Virtual variables enable data flow optimization across one execution step to the other, and they should optimize common data processing challenges (e.g. cascading tasks and iterations).
Spark streaming uses a technology called “micro-batching” while Storm uses an event driven system to analyze data.
Just One Tool in the Toolbox
MapReduce’s main strength is simplicity. When it first emerged in the software industry, it was widely adopted and soon became synonymous with Big Data, along with Hadoop. Hadoop is still the toolbox most commonly associated with Big Data, but now more organizations are realizing that MapReduce is not always the best tool in the box.