Over a million developers have joined DZone.

Hadoop: Getting Started with Pig

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data analysts to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which runs on data stored in HDFS (refer to the diagram below).

Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

Pig Architecture

Pig Achitecture

How Pig is being Used ?

  • Rapid prototyping of algorithms for processing large data sets.
  • Data Processing for web search platforms.
  • Ad Hoc queries across large data sets.
  • Web log processing.

Pig Elements

Pig consists of three elements -

  • Pig Latin
    • High level scripting language
    • No Schema
    • Translated to MapReduce Jobs
  • Pig Grunt Shell
    • Interactive shell for executing pig commands.
  • PiggyBank
    • Shared repository for User defined functions (explained later).

Pig Latin Statements 

Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output(except LOAD and STORE statements).

Pig Latin statements are generally organized as follows:

  • A LOAD statement to read data from the file system.
  • A series of “transformation” statements to process the data.
  • A DUMP statement to view results or a STORE statement to save the results.

Note that a DUMP or STORE statement is required to generate output.

  • In this example Pig will validate, but not execute, the LOAD and FOREACH statements.
    A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
  • In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.
    A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
    DUMP B;

Storing Intermediate Results

Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using thepig.temp.dir property.

Storing Final Results

Use the STORE operator and the load/store functions to write results to the file system ( PigStorage is the default store function).

Note: During the testing/debugging phase of your implementation, you can use DUMP to display results to your terminal screen. However, in a production environment you always want to use the STORE operator to save your results.

Debugging Pig Latin

Pig Latin provides operators that can help you debug your Pig Latin statements:

  • Use the DUMP operator to display results to your terminal screen.
  • Use the DESCRIBE operator to review the schema of a relation.
  • Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.
  • Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.

What is Pig User Defined Functions (UDFs) ?

Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig.UDF is very powerful functionality to do many complex operations on data.The Piggy Bank is a place for Pig users to share their functions(UDFs).


REGISTER saurzcodeUDF.jar;
A = LOAD 'employee_data' AS (name: chararray, age: int, designation: chararray);

This article was just a Getting Started Article on Pig , I will cover further details about How to to Write Pig Latin commands for some basic operations like JOIN,FILTER,GROUP, ORDER etc. , also how to make your own UDFs for processing on Hadoop cluster.

References :-

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison


Published at DZone with permission of Saurabh Chhajed, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}