Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Hive vs. Apache Pig

DZone's Guide to

Apache Hive vs. Apache Pig

Apache Hive is awesome for things like ACID transactions and BI queries, while Apache Pig is well-suited for procedural coding and MapReduce-style programming.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

Apache Hive is great for doing ACID transactions, BI queries, and queries to feed systems, and as an ingress and egress point for Apache NiFi and for Spark SQL Table queries. Apache Hive now has stored procedures and with Apache HiveMall — machine learning! Apache Hive has a lot of advancements and committers working to add new features for interactive querying, more SQL support, enhanced transactions, and more speed. With the use of Apache ORC as a file format, you get a ton of amazing speed enhancements. Apache Hive is well-combined with Apache Spark SQL and helps manage table structures. Hive is now a really great choice not just for queries, management of tables, secure data access, and data access at scale but also for transactions.

Apache Pig is great for procedural coding of data access, summaries, and MapReduce-style programming. Apache Pig runs on Tez for performance just like Apache Hive. You can do a lot with Hive View 2.0 and Apache Zeppelin for fast queries and quick reporting.

Both tools run on modern Apache Hadoop 2.x and 3.x on top of Apache Tez and Apache YARN.

They both can be called via Apache NiFi and from the command line. There are third-party libraries to enhance both access tools.

I recommend using both along with Apache NiFi.

The latest version of Pig is 0.17 and is extremely stable. With Tez and YARN updates beneath it, it runs fast and does certain ETL and data processing faster and easier than most options.

Starting My Hadoop Tools

NiFi can interface directly with Hive, HDFS, HBase, Flume, and Phoenix. And I can also trigger Spark and Flink through Kafka and site-to-site. Sometimes, I need to run some Pig scripts. Apache Pig is very stable and has a lot of functions and tools that make for some smart processing. You can easily augment and add this piece to a larger pipeline or part of the process.

Pig Setup

I like to use Ambari to install the HDP 2.5 clients on my NiFi box to have access to all the tools I may need.

Then, I can just do:

yum install pig

Pig to Apache NiFi 1.0.0

Executing the Process

We call a shell script that wraps the Pig script.

The output of the script is stored to HDFS: hdfs dfs -ls /nifi-logs.

Shell script:

export JAVA_HOME=/opt/jdk1.8.0_101/ 
pig -x local -l /tmp/pig.log -f /opt/demo/pigscripts/test.pig

You can run in different Pig modes like Local, MapReduce and Tez. You can also pass in parameters or the script.

Pig script:

messages = LOAD '/opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/nifi-app.log';
warns = FILTER messages BY $0 MATCHES '.*WARN+.*';
DUMP warns
store warns into 'warns.out'

This is a basic example from the internet, with NiFi 1.0 log used as the source.

As an aside, I run a daily script with the schedule 1 * * * * ? to clean up my logs.

Simply: 

/bin/rm -rf /opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/*2016*

PutHDFS

Hadoop configuration: /etc/hadoop/conf/core-site.xml.

Pick a directory and store away.

Results:

HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
2.7.3.2.5.0.0-12450.16.0.2.5.0.0-1245root2016-11-03 19:53:572016-11-03 19:53:59FILTER
Success!
Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTimeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs
job_local72884441_000110n/an/an/an/a0000messages,warnsMAP_ONLYfile:/tmp/temp1540654561/tmp-600070101,
Input(s):
Successfully read 30469 records from: "/opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/nifi-app.log"
Output(s):
Successfully stored 1347 records in: "file:/tmp/temp1540654561/tmp-600070101"
Counters:
Total records written : 1347
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local72884441_0001

For all Spark shops, you can run Pig on Spark.

References

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
apache hadoop ,apache pig ,apache hive ,big data ,apache nifi

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}