MapReduce's Founding Documents
MapReduce is an incredibly powerful algorithm, especially when used to process large amounts of data using distributed systems of commodity hardware. It makes data processing in big, distributed, fault prone systems approachable for the typical developer. Its recent renaissance was one of the inspirations for the Cloudant product line. In fact, it helped inspire the creation of the company itself. MapReduce has also come under recent scrutiny. There are multiple implementations with specific strengths and weaknesses. It is not the best tool for all jobs, and I've been outspoken on that exact issue. However, it is an incredibly powerful tool if applied judiciously. It is also an extremely simple concept and a fantastic first step into the world of distributed computing. I am therefore surprised at how few people have read a small selection of enlightening publications on MapReduce. I provide below a subjective selection that I have found particularly informative, as well as the main concepts I drew from each selection.
Discovery and promise
Google MapReduce. Jeff Dean and Sanjay Ghemawat did the world a tremendous service by publishing the details of Google's implementation of MapReduce for data parallel batch processing at the largest of scales. Major advances in their approach were providing an elegant and simple API for developers ("How do you want to process a single line of this file?") and hiding all of the workflow management, job scheduling, and failure recovery from the users. However, the most beautiful portion of the paper is the example application on page 13 where they embed a MapReduce workflow inside a sequential C program. That one example made elegantly obvious the separation between sequential programming and parallel processing, as well as the sheer thrill of a few lines of code that would crunch petabytes of data with ease. Finally, this paper also presented actual operational data that cemented the critical role MapReduce had in Google's stack.
Google Sawzall. Without going into too much detail, Sawzall was the first paper I found that demonstrated the need for declarative languages to wrap Google's MapReduce implementation. It was my first hint that SQL-like languages such as Pig and Hive were just around the corner. This work also foreshadowed Google's use of power tool names for all future transformative work on big data query engines.
Yahoo map-reduce-merge. This paper provided the first proof that multi-step MapReduce workflows (specifically map-reduce-merge) are relationally complete. SQLs relational algebras can be expressed in a manner that is "load-balanced, scalable, and parallel." The paper also went on to provide sample implementations of said algorithms. This work made Pig, Hive and other SQL layers over MapReduce (and by that time Hadoop) inevitable.
Understanding and quantifying limitations
Graph Twiddling in a MapReduce World. This transformative (but rarely cited) publication from the NSA's Jonathan Cohen addresses the applicability and limitations of generic MapReduce algorithms for processing graph data. Specifically, it examines common graph algorithms (e.g. calculating the out degree of all nodes, enumerating triangles, etc). This work establishes the existence proofs and example implementations of several common algorithms, but also notes that multiple passes of MapReduce processing are often required, and that in some cases the entire graph needs to be copied forward in successive steps. While possible, graph processing in mapreduce can be inefficient, depending on the algorithm in question.
Google Percolator. I've addressed this in detail elsewhere, but this publication established the latency limitations of Google's batch-oriented MapReduce framework (and therefore Hadoop as well). Specifically this work introduces the Percolator framework for incremental processing of new documents as they are added to a corpus. It discusses specific implementation (including addition of transactional semantics to Bigtable) and provides compelling operational evidence that incremental approaches are superior for efficient analysis of continuously growing/changing data sets. Google writes, "Converting the indexing system to an incremental system...reduced the average document processing latency by a factor of 100."
Google Dremel. If you read nothing else, read this. This paper describes the core of an astounding alternative for querying vast amounts of data with incredibly low latency. Dremel is the heart of the new BigQuery product, promising analysis of trillions of rows of data on thousands of servers in seconds. This work convincingly argues that Google's MapReduce implementation suffers from two key drawbacks: (i) the significant processing latency that is inevitable in their batch-oriented implementation and (ii) the lack of a declarative query language. Dremel is simply bad-ass. The pragmatic move to column oriented storage, query expansion and server trees and a SQL-like query language form the core of an absolutely revolutionary product. It is no surprise that the open source and enterprise camps are both jealous, and we will see how efficiently mapR technologies can bring an open source clone (Apache Drill) from concept to reality, and how that will compare to Cloudera's recently announced Impala project.
MapReduce at Cloudant
MapReduce is incredibly powerful. At Cloudant we tend to take it for granted, but it can vastly simplify application stacks by bringing parallel computation into the database layer. Our implementation is significantly different from those of Google, Hadoop, Mongo and Riak. Like Google and Hadoop, ours is completely parallel. Execution happens on each node in the cluster that holds portions of the input dataset. Unlike all of the above, however, our implementation is incremental. It is optimized not for blazing speed on the first pass through the data, but instead to efficiently compute only on new or modified documents, addressing the exact latency issues exposed in the Percolator paper.
Next, our implementation doesn't just create key/value rows that are stored on a file system. The output of Cloudant MapReduce is itself a b+tree that is persisted on disk. Cloudant mapreduce is the engine by which we allow users to efficiently process data and create secondary indexes for blazingly fast queries. It is the logical equivalent of
ALTER TABLE CREATE INDEX in SQL.
Cloudant's MapReduce is chainable, in contrast to the in-database implementations of Mongo and Riak. That primarily means that it is capable of efficient pre-computation of certain types of JOIN operations on huge amounts of transactional data. Finally, unlike Dremel, we don't yet provide a declarative query language. Clearly the map-reduce-merge paper, along with the compilers from Pig and Hive, provide a clear recipe to compile most SQL queries into efficient MapReduce code. This is a common request from new Cloudant users. If you are interested how such feedback impacts our product roadmap, tune intoAlan Hoffman's webcast on Thursday, November 8.