Developing applications in native MapReduce can be a time-consuming and daunting work reserved only for programmers.
Fortunately, there are a number of frameworks that make the process of implementing distributed computation on Hadoop cluster easy and quicker, even for non-developers. The most popular ones are Hive and Pig.
Hive provides a SQL-like language, called HiveQL, for easier analysis of data in Hadoop cluster. When using Hive our datasets in HDFS are represented as tables that have rows and columns. Therefore, Hive is easy to learn and appealing to use for those who already know SQL and have experience in working with relational databases.
Having this said, Hive can be considered as a data warehouse infrastructure built on top of Hadoop.
A Hive query is translated into a series of MapReduce jobs (or a Tez directed acyclic graph) that are subsequently executed on a Hadoop cluster.
Let’s process a dataset about songs listened to by users in a given time. The input data consists of a tab-separated file songs.txt:
“Creep” Radiohead piotr 2014-07-20
“Desert Rose” Sting adam 2014-07-14
“Desert Rose” Sting piotr 2014-06-10
“Karma Police” Radiohead adam 2014-07-23
“Everybody” Madonna piotr 2014-07-01
“Stupid Car” Radiohead adam 2014-07-18
“All This Time” Sting adam 2014-07-13
We use Hive to find the two most popular artists in July 2014:
Note: We assume that commands below are executed as user “training”.Put songs.txt file on HDFS:
# hdfs dfs -mkdir songs
# hdfs dfs -put songs.txt songs/
Create an external table in Hive that gives a schema to our data on HDFS
Check if the table was created successfully:
hive> CREATE TABLE songs(
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
You can see also the table’s properties and columns:Apart from information about column names and types, you can see other interesting properties:
hive> SHOW tables;
Run a query that finds the two most popular artists in July 2014:
# Detailed Table Information
CreateTime: Tue Jul 29 14:08:49 PDT 2014
Protect Mode: None
Table Type: EXTERNAL_TABLE
SELECT artist, COUNT(*) AS total
WHERE year(date) = 2014 AND month(date) = 7
GROUP BY artist
ORDER BY total DESC
This query is translated to two MapReduce jobs. Verify it by reading the standard output log messages generated by a Hive client or by tracking jobs executed on Hadoop cluster using ResourceManager web UI.
Note: at the time of this writing, MapReduce was the default execution engine for Hive. It may change in the future. See next section for instructions how to set other execution engine for Hive.
Hive is not constrained to translate queries into MapReduce jobs only. You can also instruct Hive to express its queries using other distributed frameworks such as Apache Tez.
Tez is an efficient framework that executes computation in form of a DAG (directed acyclic graph) of tasks. With Tez, a complex Hive query can be expressed as a single Tez DAG rather than multiple MapReduce jobs. This way we do not introduce the overhead of launching multiple jobs and avoid the cost of storing data between jobs on HDFS what saves I/O.
To benefit from Tez’s fast response times, simply overwrite hive.execution.engine property and set it to tez.
Follow these steps to execute the Hive query from the previous section as a Tez application:
- Enter hive:
- Set execution engine to tez:
hive> SET hive.execution.engine=tez;
- Execute query from the Hive section:
Note: now you can see different logs displayed on the console than when executing the query on MapReduce:
Total Jobs = 1
Launching Job 1 out of 1
Status: Running application id: application_123123_0001
Map 1: -/- Reducer 2: 0/1 Reducer 3: 0/1
Map 1: 0/1 Reducer 2: 0/1 Reducer 3: 0/1
Map 1: 1/1/ Reducer 2: 1/1 Reducer 3: 1/1
Status: Finished successfully
The query is now executed as only one Tez job instead of two MapReduce jobs as before. Tez isn’t tied to a strict MapReduce model - it can execute any sequence of tasks in a single job, for example Reduce tasks after Reduce tasks, what brings significant performance benefits.
Find out more about Tez on the blog: http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing.
Apache Pig is another popular framework for large-scale computations on Hadoop. Similarly to Hive, Pig allows you to implement computations in an easier, faster and less-verbose way than using MapReduce. Pig introduces a simple, yet powerful, scripting-like language called PigLatin. PigLatin supports many common and ready-to-use data operations like filtering, aggregating, sorting and joining. Developers can also implement own functions (UDFs) that extend Pig’s core functionality.
Like Hive queries, Pig scripts are translated to MapReduce jobs scheduled to run on Hadoop cluster.
We use Pig to find the most popular artists as we did with Hive in previous example.
- Save following script in top-artists.pig file
a = LOAD ‘songs/songs.txt’ as (title, artist, user, date);
b = FILTER a BY date MATCHES ‘2014-07-.*’;
c = GROUP b BY artist;
d = FOREACH c GENERATE group, COUNT(b) AS total;
e = ORDER d by total DESC;
f = LIMIT e 2;
STORE f INTO ‘top-artists-pig’;
- Execute pig script on Hadoop cluster:
# pig top-artists.pig
- Read the content of the output directory:
When developing Pig scripts you can iterate in local mode and catch mistakes before submitting jobs to the cluster. To enable local mode add -x local option to pig command.
Apache Hadoop is one of the most popular tools for big data processing thanks to its great features such as a high-level API, scalability, the ability to run on commodity hardware, fault tolerance and an open source nature. Hadoop has been successfully deployed in production by many companies for several years.
The Hadoop Ecosystem offers a variety of open-source tools for collecting, storing and processing data as well as cluster deployment, monitoring and data security. Thanks to this amazing ecosystem of tools, each company can now easily and relatively cheaply store and process a large amount of data in a distributed and highly scalable way.
This table contains names and short descriptions of the most useful and popular projects from the Hadoop Ecosystem that have not been mentioned yet:
OozieWorkflow scheduler system to manage Hadoop jobs.ZookeeperFramework that enables highly reliable distributed coordination.SqoopTool for efficient transfer of bulk data between Hadoop and structured datastores such as relational databases.FlumeService for aggregating, collecting and moving large amounts of log data.HbaseNon-relational, distributed database running on top of HDFS. It enables random realtime read/write access to your Big Data.