Apache Hive vs. Apache Pig
Apache Hive vs. Apache Pig
Apache Hive is awesome for things like ACID transactions and BI queries, while Apache Pig is well-suited for procedural coding and MapReduce-style programming.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Apache Hive is great for doing ACID transactions, BI queries, and queries to feed systems, and as an ingress and egress point for Apache NiFi and for Spark SQL Table queries. Apache Hive now has stored procedures and with Apache HiveMall — machine learning! Apache Hive has a lot of advancements and committers working to add new features for interactive querying, more SQL support, enhanced transactions, and more speed. With the use of Apache ORC as a file format, you get a ton of amazing speed enhancements. Apache Hive is well-combined with Apache Spark SQL and helps manage table structures. Hive is now a really great choice not just for queries, management of tables, secure data access, and data access at scale but also for transactions.
Apache Pig is great for procedural coding of data access, summaries, and MapReduce-style programming. Apache Pig runs on Tez for performance just like Apache Hive. You can do a lot with Hive View 2.0 and Apache Zeppelin for fast queries and quick reporting.
Both tools run on modern Apache Hadoop 2.x and 3.x on top of Apache Tez and Apache YARN.
They both can be called via Apache NiFi and from the command line. There are third-party libraries to enhance both access tools.
I recommend using both along with Apache NiFi.
The latest version of Pig is 0.17 and is extremely stable. With Tez and YARN updates beneath it, it runs fast and does certain ETL and data processing faster and easier than most options.
Starting My Hadoop Tools
NiFi can interface directly with Hive, HDFS, HBase, Flume, and Phoenix. And I can also trigger Spark and Flink through Kafka and site-to-site. Sometimes, I need to run some Pig scripts. Apache Pig is very stable and has a lot of functions and tools that make for some smart processing. You can easily augment and add this piece to a larger pipeline or part of the process.
I like to use Ambari to install the HDP 2.5 clients on my NiFi box to have access to all the tools I may need.
Then, I can just do:
yum install pig
Pig to Apache NiFi 1.0.0
Executing the Process
We call a shell script that wraps the Pig script.
The output of the script is stored to HDFS:
hdfs dfs -ls /nifi-logs.
export JAVA_HOME=/opt/jdk1.8.0_101/ pig -x local -l /tmp/pig.log -f /opt/demo/pigscripts/test.pig
You can run in different Pig modes like Local, MapReduce and Tez. You can also pass in parameters or the script.
messages = LOAD '/opt/demo/HDF/centos7/tars/nifi/nifi-22.214.171.124.0.0.0-579/logs/nifi-app.log'; warns = FILTER messages BY $0 MATCHES '.*WARN+.*'; DUMP warns store warns into 'warns.out'
This is a basic example from the internet, with NiFi 1.0 log used as the source.
As an aside, I run a daily script with the schedule 1 * * * * ? to clean up my logs.
/bin/rm -rf /opt/demo/HDF/centos7/tars/nifi/nifi-126.96.36.199.0.0.0-579/logs/*2016*
Pick a directory and store away.
HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 188.8.131.52.5.0.0-124184.108.40.206.5.0.0-1245root2016-11-03 19:53:572016-11-03 19:53:59FILTER Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTimeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs job_local72884441_000110n/an/an/an/a0000messages,warnsMAP_ONLYfile:/tmp/temp1540654561/tmp-600070101, Input(s): Successfully read 30469 records from: "/opt/demo/HDF/centos7/tars/nifi/nifi-220.127.116.11.0.0.0-579/logs/nifi-app.log" Output(s): Successfully stored 1347 records in: "file:/tmp/temp1540654561/tmp-600070101" Counters: Total records written : 1347 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_local72884441_0001
For all Spark shops, you can run Pig on Spark.
Opinions expressed by DZone contributors are their own.