Google’s seminal paper on Map-Reduce  was the trigger that led to lot of developments in the big data space. Though the Map-Reduce paradigm was known in functional programming literature, the paper provided scalable implementations of the paradigm on a cluster of nodes. The paper, along with Apache Hadoop, the open source implementation of the MR paradigm, enabled end users to process large data-sets on a cluster of nodes – a usability paradigm shift. Hadoop, which comprises the MR implementation along with the Hadoop Distributed File System (HDFS), has now become the de-facto standard for data processing, with a lot of Industrial game-changers such as Disney, Sears, Walmart, and AT&T having their own Hadoop cluster installations.
Hadoop is no doubt very useful for a number of use cases, especially those where the data can be split into independent chunks and certain computations need to run on the chunks and aggregated for a final result. This is appropriate for the Map-Reduce (MR) paradigm. It allows the computations to be parallelized and near-linear speed-ups to be obtained across a cluster of nodes. There are a number of cases where Hadoop may not be appropriate – this has been highlighted by, among others, Vincent Granville (http://www.analyticbridge.com/profiles/blogs/what-mapreduce-can-t-do). To summarize, the cases where MR is not appropriate are those where data cannot be partitioned into independent chunks – either computation spans the chunks or there needs to be communication for intermediate results to be exchanged. Moreover, Hadoop is also not well suited for realizing iterative Machine Learning (ML) algorithms such as the kernel support vector machines, multivariate logistic regression, etc. This is reflected clearly in Mahout (the open source machine learning library written over Hadoop), which has only sequential implementation of some iterative algorithms. This point has also been reinforced by several others – see Prof. Srirama’s paper , for instance, here. This paper outlines that Hadoop is suited for simpler iterative algorithms where the algorithm can be expressed as a single execution of an MR model or a sequential execution of constant MR models. Hadoop is not well suited in cases where the algorithm can only be expressed in a way that each iteration is a single MR model or each iteration comprises multiple MR models – conjugate gradient descent, for instance, would be in the last category. Mapscale (http://www.cs.ucsb.edu/~cgb/mapscale.html) also shows that Hadoop is not well suited for iterative algorithms such as conjugate gradient descent, block tridiagonal and fast fourier transforms.
While an adversary may argue that Hadoop Yarn mitigates this to an extent by supporting non MR type workloads, it may not have any new constructs to realize these use cases – especially the ones on iterative machine learning algorithms. Yarn could be seen as providing efficient cluster management utilities including scheduling, pretty similar to what Mesos does.
2. Beyond Hadoop
What then are the alternatives to realize some of the complex iterative algorithms that may not be amenable to be implemented efficiently over Hadoop? A table outlining some of the work in this space is given in table 1 as an image.
The above is a three generational view of paradigms/tools to realize machine learning algorithms. The first generation is the traditional tools/paradigms such as SAS/R which are typically only vertically scalable. The efforts to engineer R to work over Hadoop such as Revolution R or R-Hadoop as well as the SAS in-memory analytics suite over Hadoop would fall into the second generation. The third generational paradigms are those that go beyond Hadoop. Again, efforts by SAS to go beyond Hadoop would fall into this category. Researchers at the University of Berkeley have proposed “Spark”  as one such alternative – in other words, Spark could be seen as the next generation data processing alternative to Hadoop in the Big-data space. The key idea distinguishing Spark is its in-memory computation, allowing data to be cached in memory across iterations/interactions. The Berkeley researchers have proposed Berkeley Data Analytics (BDA) stack as a collection of technologies that help in running data analytics tasks across a cluster of nodes. The lowest level component of the BDA is Mesos, the cluster manager which helps in task allocation and resource management tasks of the cluster. The second component is the Tachyon file system built on top of Mesos. Tachyon provides a distributed file system abstraction and provides interfaces for file operations across the cluster. Spark, the computation paradigm is realized over Tachyon and Mesos in a specific embodiment though it could be realized without Tachyon and even without Mesos for clustering. Shark, which is realized over Spark, provides an SQL abstraction over a cluster – similar to the abstraction Hive provides over Hadoop.
The other important paradigm that has looked beyond Hadoop Map-Reduce is graph processing, exemplified by the Pregel effort from Google. Pregel is a Bulk Synchronous Processing (BSP) paradigm where user defined compute functions can be spawned on the nodes of the graph, with edges used for communication. This provides a deterministic computation framework. Apache Giraph is an open source implementation of Pregel. The other projects which are similar to Pregel are, Golden orb, Stanford GPS. GraphX is the other system with specific focus on graph construction and transformations. While Pregel is good at graph parallel abstraction, easy to reason with and ensures deterministic computation, it leaves it to the user to architect the movement of data. Further, like all BSP systems, it also suffers from the curse of the slow jobs – meaning that even a single slow job (which could be due to load fluctuations or other reasons) can slow down the whole computation. To alleviate some of these issues, the CMU researchers came up with GraphLab. GraphLab, as well as its subsequent version known as Powergraph, constitute the state of the art in graph processing and are especially well suited to power law graphs. Another interesting effort in this space is the Graph Search work from Facebook, who are building search over what they term as an entity graph.
Another interesting effort in the third generation paradigms comes from Twitter, who built the Storm framework for real-time complex event processing. We have built several machine learning algorithms over Storm in order to perform real-time analytics. Interesting alternatives to Storm include the Akka project, which is based on the Actors model of concurrent programming. The S4 system from Yahoo also falls in the category. The Dempsy system from Nokia is also comparable to Storm. While Storm has kicked up a literal storm in the real-time analytics/computing world, the others need to catch up. However, Akka is promising due to its ability to deliver stream composition, failure recovery and two-way communication.
3. Performance and Real-life Use Cases
We have realized several ML algorithms over Spark and compared some of them to ML realizations of Mahout over Hadoop. We have made detailed comparisons on a custom cluster at Impetus and have found that the while the logistic regression algorithm over Hadoop (from Mahout) took nearly 250 seconds to run for a million records, Spark was able to run the logistic regression algorithm for a million records on the same cluster in only 30 seconds. This is in line with the performance studies reported in the AmpLab site here: http://spark.incubator.apache.org/index.html.
There are several similar real-life use cases for Spark as well as Storm and GraphLab, some of the third generation paradigms we have focused on. For example, Ooyala, a video analytics startup, uses Cassandra to store millions of video events generated every day and use Spark over it to run C* OLAP aggregate queries. The same queries took nearly 130 seconds on Cassandra natively (took only 20-30 seconds on a warmed cache), but less than 1 second using Spark! Quantifind is another start-up that uses Spark in production. They use Spark to allow video companies to predict success of new releases – they have been able to move from running ML in hours over Hadoop to running it in seconds by using Spark. Conviva, a start-up uses Spark to run repeated queries on video data and found that it was nearly 30 times faster than Hive. Yahoo is also using Spark to build algorithmic learning for advertisement targeting for a new paradigm they name as continuous computing.
There are also several use cases for Storm listed in the Storm page. Umbrella security labs provides a case for using GraphLab to develop and test ML models quickly over complete data sets. They have implemented a page rank algorithm over a large graph and ran it using GraphLab in about 2 minutes.
4. Concluding Remarks
This tutorial has set the tone for big data analytics thinking beyond Hadoop by discussing the limitations of Hadoop. It has also brought out the three dimensions along which thinking beyond Hadoop is necessary:
- Real-time analytics – Storm/Spark streaming are the choices.
- Analytics involving iterative machine learning: Spark is the technology of choice.
- Specialized data structures and processing requirements for these: GraphLab is an important paradigm to process large graphs.
 Ghemawat, J. D. (2008, January). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107-113.
 Matei Zaharia, M. C. (2012). Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. USENIX Conference on Networked Systems Design and Implementation (pp. 2-2). Berkeley: USENIX.
 Satish Narayana Srirama, P. J. (2012). Adapting scientific computing problems to clouds using MapReduce. Future Generation Computer Systems, 184-192.
 Yucheng Low, D. B. (2012). Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8. (pp. 716-727.). VLDB.