DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Databases
  4. MapReduce's Founding Documents

MapReduce's Founding Documents

Mike Miller user avatar by
Mike Miller
·
Dec. 04, 12 · Interview
Like (0)
Save
Tweet
Share
6.51K Views

Join the DZone community and get the full member experience.

Join For Free

MapReduce is an incredibly powerful algorithm, especially when used to process large amounts of data using distributed systems of commodity hardware. It makes data processing in big, distributed, fault prone systems approachable for the typical developer. Its recent renaissance was one of the inspirations for the Cloudant product line. In fact, it helped inspire the creation of the company itself. MapReduce has also come under recent scrutiny. There are multiple implementations with specific strengths and weaknesses. It is not the best tool for all jobs, and I've been outspoken on that exact issue. However, it is an incredibly powerful tool if applied judiciously. It is also an extremely simple concept and a fantastic first step into the world of distributed computing. I am therefore surprised at how few people have read a small selection of enlightening publications on MapReduce. I provide below a subjective selection that I have found particularly informative, as well as the main concepts I drew from each selection.

Discovery and promise

  • Google MapReduce. Jeff Dean and Sanjay Ghemawat did the world a tremendous service by publishing the details of Google's implementation of MapReduce for data parallel batch processing at the largest of scales. Major advances in their approach were providing an elegant and simple API for developers ("How do you want to process a single line of this file?") and hiding all of the workflow management, job scheduling, and failure recovery from the users. However, the most beautiful portion of the paper is the example application on page 13 where they embed a MapReduce workflow inside a sequential C program. That one example made elegantly obvious the separation between sequential programming and parallel processing, as well as the sheer thrill of a few lines of code that would crunch petabytes of data with ease. Finally, this paper also presented actual operational data that cemented the critical role MapReduce had in Google's stack.

  • Google Sawzall. Without going into too much detail, Sawzall was the first paper I found that demonstrated the need for declarative languages to wrap Google's MapReduce implementation. It was my first hint that SQL-like languages such as Pig and Hive were just around the corner. This work also foreshadowed Google's use of power tool names for all future transformative work on big data query engines.

  • Yahoo map-reduce-merge. This paper provided the first proof that multi-step MapReduce workflows (specifically map-reduce-merge) are relationally complete. SQLs relational algebras can be expressed in a manner that is "load-balanced, scalable, and parallel." The paper also went on to provide sample implementations of said algorithms. This work made Pig, Hive and other SQL layers over MapReduce (and by that time Hadoop) inevitable.

Understanding and quantifying limitations

  • Graph Twiddling in a MapReduce World. This transformative (but rarely cited) publication from the NSA's Jonathan Cohen addresses the applicability and limitations of generic MapReduce algorithms for processing graph data. Specifically, it examines common graph algorithms (e.g. calculating the out degree of all nodes, enumerating triangles, etc). This work establishes the existence proofs and example implementations of several common algorithms, but also notes that multiple passes of MapReduce processing are often required, and that in some cases the entire graph needs to be copied forward in successive steps. While possible, graph processing in mapreduce can be inefficient, depending on the algorithm in question.

  • Google Percolator. I've addressed this in detail elsewhere, but this publication established the latency limitations of Google's batch-oriented MapReduce framework (and therefore Hadoop as well). Specifically this work introduces the Percolator framework for incremental processing of new documents as they are added to a corpus. It discusses specific implementation (including addition of transactional semantics to Bigtable) and provides compelling operational evidence that incremental approaches are superior for efficient analysis of continuously growing/changing data sets. Google writes, "Converting the indexing system to an incremental system...reduced the average document processing latency by a factor of 100."

  • Google Dremel. If you read nothing else, read this. This paper describes the core of an astounding alternative for querying vast amounts of data with incredibly low latency. Dremel is the heart of the new BigQuery product, promising analysis of trillions of rows of data on thousands of servers in seconds. This work convincingly argues that Google's MapReduce implementation suffers from two key drawbacks: (i) the significant processing latency that is inevitable in their batch-oriented implementation and (ii) the lack of a declarative query language. Dremel is simply bad-ass. The pragmatic move to column oriented storage, query expansion and server trees and a SQL-like query language form the core of an absolutely revolutionary product. It is no surprise that the open source and enterprise camps are both jealous, and we will see how efficiently mapR technologies can bring an open source clone (Apache Drill) from concept to reality, and how that will compare to Cloudera's recently announced Impala project.

MapReduce at Cloudant

MapReduce is incredibly powerful. At Cloudant we tend to take it for granted, but it can vastly simplify application stacks by bringing parallel computation into the database layer. Our implementation is significantly different from those of Google, Hadoop, Mongo and Riak. Like Google and Hadoop, ours is completely parallel. Execution happens on each node in the cluster that holds portions of the input dataset. Unlike all of the above, however, our implementation is incremental. It is optimized not for blazing speed on the first pass through the data, but instead to efficiently compute only on new or modified documents, addressing the exact latency issues exposed in the Percolator paper.

Next, our implementation doesn't just create key/value rows that are stored on a file system. The output of Cloudant MapReduce is itself a b+tree that is persisted on disk. Cloudant mapreduce is the engine by which we allow users to efficiently process data and create secondary indexes for blazingly fast queries. It is the logical equivalent of ALTER TABLE CREATE INDEX in SQL.

Cloudant's MapReduce is chainable, in contrast to the in-database implementations of Mongo and Riak. That primarily means that it is capable of efficient pre-computation of certain types of JOIN operations on huge amounts of transactional data. Finally, unlike Dremel, we don't yet provide a declarative query language. Clearly the map-reduce-merge paper, along with the compilers from Pig and Hive, provide a clear recipe to compile most SQL queries into efficient MapReduce code. This is a common request from new Cloudant users. If you are interested how such feedback impacts our product roadmap, tune intoAlan Hoffman's webcast on Thursday, November 8.

MapReduce Database Big data Document processing Relational database Google (verb) sql Implementation Open source Papers (software)

Published at DZone with permission of Mike Miller. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The Top 3 Challenges Facing Engineering Leaders Today—And How to Overcome Them
  • Top 10 Secure Coding Practices Every Developer Should Know
  • AWS Cloud Migration: Best Practices and Pitfalls to Avoid
  • Beginners’ Guide to Run a Linux Server Securely

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: