DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Coding
  3. Languages
  4. Herding Apache Pig: Using Pig with Perl and Python

Herding Apache Pig: Using Pig with Perl and Python

Arnon Rotem-gal-oz user avatar by
Arnon Rotem-gal-oz
·
Mar. 05, 13 · Interview
Like (0)
Save
Tweet
Share
5.85K Views

Join the DZone community and get the full member experience.

Join For Free

duh

The past week or so we got some new data that we had to process quickly. There are quite a few technologies out there to quickly churn map/reduce jobs on Hadoop (Cascading,  Hive,  Crunch,Jaql to name a few of many), my personal favorite is Apache Pig.  I find that the imperative nature of pig makes it relatively easy to understand what’s going on and where the data is going and that it produces efficient enough map/reduces. On the down side pig lacks control structures so working with pig also mean you need to extend it with user defined functions (UDFs) or Hadoop streaming. Usually I use Java or Scala for writing UDFs but it is always nice to try something new so we decided to checkout some other technologies – namely perl and python. This post highlights some of the pitfalls we met and how to work around them.

Yuval, who was working with me on this mini-project likes perl (to each his own, I suppose)  so we started with that. searching for pig and perl examples, we found something like the following

1
2
A = LOAD 'data';
B = STREAM A THROUGH `stream.pl`;

The first pitfall here is that the perl script name is surrounded by a backtick (the character on the tilde (~) key) and not a single quote (so in the script above  ’data’ is surrounded by single quotes and `stream.pl` is surrounded by backticks ).

The second pitfall was that the code above works nicely when you use pig in local mode (pig -x local) but it failed when we tried to run it on the cluster. It took some head scratching and some trial and error but eventually Yuval came with the following:

1
2
3
DEFINE CMD `perl stream.pl` ship ('/PATH/stream.pl');
A = LOAD 'data'
B = STREAM A THROUGH CMD;

Basically we’re telling pig to copy the pig script to HDFS so that it would be accessible on all the nodes.

So,  perl worked pretty well, but since we’re using Hadoop Streaming and get the data via stdin we lose all the context of the data that pig knows. We also need to emulate the textual representations of bags and tuples so the returned data will be available to pig for further work. This is all workable but not fun to work with (in my opinion anyway).

I decided to write pig UDFs in python. python can be used with Apache streaming, like perl above, but it also integrates more tightly with Pig via jython (i.e the python UDF is compiled  into java and ships to the cluster as part of the jar pig generates for the map/reduce anyway).

Pig UDFs are better than streaming as you get Pig’s schema for the parameters and you can tell Pig the schema you return for your output. UDFs in python are especially nice as the code is almost 100% regular python and Pig does the mapping for you (for instance a bag of tuples in pig is translated to a list of tuples in python etc.). Actually the only difference is that if you want Pig to know about the data types you return from the python code you need to annotate the method with @outputSchema  e.g. a simple UDF that gets the month as an int from a date string in the format YYYY-MM-DD HH:MM:SS

1
2
3
4
5
6
7
8
9
10
11
@outputSchema("num:int")
def getMonth(strDate):
    try:
        dt, _, _ = strDate.partition(".")
        return datetime.strptime(dt, "%Y-%m-%d %H:%M:%S").month
    except AttributeError:
        return 0
    except IndexError:
        return 0
    except ValueError:
        return 0

Using the PDF is as simple as declaring the python file where the UDF is defined. Assuming our UDF is ina a file called utils.py, it would be declared as follows:

1
Register utils.py using jython as utils;

And then using that UDF would go something like:

1
2
A = LOAD 'data' using PigStorage('|') as (dateString:chararray);
B = FOREACH A GENERATE utils.getMonth(dateString) as month;

Again, like in the perl case there are a few pitfalls here. for one the python script and the pig script need to be in the same directory (relative paths only work in in the local mode). The more annoying pitfall hit me when I wanted to import some python libs (e.g. datetime in the example which is imported using “from datetime import datetime”). There was no way I could come up with to make this work. The solution I did come up with eventually was to take a jyhton standalone .jar (a jar with a the common python libraries included) and replace Pig’s jython Jar (in the pig lib directory) with the stanalone one. There’s probably a nicer way to do this (and I’d be happy to hear about it) but this worked for me. It only has to be done on the machine where you run the pig script as the python code gets compiled and shipped to the cluster as part of the jar file Pig generates anyway.

Working with Pig and python has been really nice. I liked writing pig UDFs in python much more than writing them in Java or Scala for that matter. The two main reasons for that is that a lot of java cruft for integrating with pig is just not there so I can focus on just solving the business problem and the other reason is that with both Pig and Python being “scripts” the feedback loop from making a change to seing it work is much shorter. Anyway, Pig also supports Javascript and Ruby UDFs but these would have to wait for next time :)

Python (language) Perl (programming language)

Published at DZone with permission of Arnon Rotem-gal-oz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Unlocking the Power of Elasticsearch: A Comprehensive Guide to Complex Search Use Cases
  • Building a RESTful API With AWS Lambda and Express
  • What Are the Different Types of API Testing?
  • A Gentle Introduction to Kubernetes

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: