Over a million developers have joined DZone.

Next Generation Hadoop: It's Not Just Batch!

DZone's Guide to

Next Generation Hadoop: It's Not Just Batch!

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

In my JavaOne talk last week, I presented changes that are happening in Hadoop: It’s shaking off it’s batch-based shackles and enabling a new Hadoop platform that can support a mix of processing systems, from stream-processing systems to NoSQL systems.

The slides for my talk can be viewed on Speaker Deck. The rest of this post is an overview of the technologies covered in my talk, along with links for further reading.


With Hadoop 2.x, we now have YARN, which acts as a distributed scheduler. This is a big step towards the vision of Hadoop being the Big Data Kernel, as it allows arbitrary applications to be scheduled on the same Hadoop cluster, and enables a new world where we can have silo’d applications coexisting on the same hardware and sharing the same storage.

The following links serve as a good starting ground to learn more about YARN:

Apache HBase

HBase is a NoSQL, distributed multi-dimensional map based on Google’s BigTable. It uses HDFS for persistence, which is a huge benefit if a key requirement of your NoSQL system is the ability to read and write data into HBase using MapReduce.

HBase on YARN (Hoya)

Hoya is a YARN application that allows multiple HBase clusters to coexist on a single Hadoop YARN cluster. This provides strong data/resource isolation properties, in conjunction with the ability to easily spin up, upsize/downsize and shutdown HBase clusters. Hoya was developed by Steve Loghran and friends over at Hortonworks.

Apache Accumulo

Accumulo is a BigTable implementation much like HBase. It also uses HDFS for storage, and currently has an edge in the security world due to its cell-level security. Although it should be noted that this is planned for HBase (see HBASE-6222).


ElephantDB is a read-only key-value store, which uses HDFS to load data, which is served in real-time. It’s a part of Nathan Marz’s Lambda Architecture and enables the rapid loading and serving of data produced in the batch tier.


Storm is a stream processing, continuous computation and distributed RPC system developed and open-sourced by Twitter. It allows you to perform near real-time calculations such as trending topics.

Storm on YARN

Yahoo uses Storm for a variety of use cases, and created the Storm-on-YARN so that they could run Storm on their YARN clusters. They also added the ability for Storm to read/write to secure HDFS.

Apache Samza

Samza (incubating) is a stream processing system that uses Kafka for messaging, and optionally YARN for resource management.


Morphlines is an ETL library from Cloudera that has implementations available for use within Flume, MapReduce and HBase. Using a modified JSON syntax, it allows you to create a pipeline of work which can fulfill use cases such as near real-time writes from Flume into Solr Cloud.

Apache Giraph

Giraph is a framework for performing offline batch processing of semi-structured graph data on a massive scale. It offers performance advantages over graph processing with MapReduce.


Impala from Cloudera is an implementation of Google’s paper on Dremel, and provides interactive SQL capabilities on top of data in HDFS and HBase.

Apache Drill

An (incubating) project that offers the promise of interactive SQL capabilities over data in HDFS, HBase, Cassandra, MongoDB and Splunk.


Parquet, a joint initiative from Cloudera and Twitter, is a columnar data format supporting nested data. It can offer space and time advantages over row-ordered data, especially with queries that return a subset of the overall columns. It supports a wide variety of tools (MapReduce, Impala, Pig and Hive) and is used in production by Twitter.

ORC File

ORC File is a columnar data format that also supports nested data. It is currently implemented within Hive 0.11.

Apache Tez

Tez (incubating) is a generalized DAG execution engine. The goal of the project is to remove disk barriers that exist with pipelined MapReduce jobs. The first goal of the project is to provide a MapReduce implementation using Tez, followed by Hive and Pig.

Apache Mesos

Mesos is a cluster manager, similar to YARN, providing resource sharing and isolation capabilities in a distributed cluster. It can support multiple instances and versions of Hadoop, Spark and other applications. It’s used in Twitter to manage various applications in production.

Lambda Architecture

The Lambda Architecture, an architectural blueprint from Nathan Marz, suggests that speed and batch layers should exist to play to their mutual strengths: the speed layer providing near real-time data aggregations, and the batch layer providing a mechanism to correct potential mistakes made in the speed layer.


Summingbird is a project out of Twitter which could be viewed as an implementation of the Lambda Architecture. It allows you to using a single API to define operations on distributed collections which can be mapped into MapReduce or Storm executions.

Apache Spark

Spark (incubating) is an in-memory distributed processing system which allows you to perform MapReduce, as well as iterative workloads over data. Spark and its family of associated projects (such as Spark Streaming, GraphX) offers a complete solution to most distributed processing use cases.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.


Published at DZone with permission of Alex Holmes, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}