DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Databases
  4. Flatten Entire HBase Column Families with Pig and Python UDFs

Flatten Entire HBase Column Families with Pig and Python UDFs

Chase Seibert user avatar by
Chase Seibert
·
Feb. 21, 13 · Interview
Like (0)
Save
Tweet
Share
5.04K Views

Join the DZone community and get the full member experience.

Join For Free

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

How do you solve this mismatch? If you’re in the early stages of designing a schema, you could reconsider a more row based approach. If you have to work with an existing schema, however, you can with the help of Pig UDFs.

Say we have the following table:

rowkey cf1:20130101post cf1:20130101comment cf1:20130101like cf1:20130102post ...
ce410-00005031-00089182 147 5 41 153
ce410-00005031-00021915 1894 33 86 1945
5faa4-00009521-00019828 30 2 8 31
...

Here there is a composite row key, but also composite column keys. Because the date is part of the column keys, there are potentially many, many columns. Enumerating them all in your Pig scripts in impractical. Notice that they are also in the same column family. To load them all, you can do the following in Pig:

data = load 'hbase://table_name' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:*', '-loadKey true') AS (id:chararray, stats:map[int]);
illustrate data;

This will result in all columns being loaded into a Pig map, which is just a collection of tuples:

-----------------------------------------------------------------------------------------------------
| data         | id:chararray            | stats:map(:int)                                          |
-----------------------------------------------------------------------------------------------------
|              | ce410-00005031-00089182 | {20130101post=147,20130101comment=5,20130101like=41,...} |
-----------------------------------------------------------------------------------------------------

So, now you have loaded all the data, but how to you parse the column names into their respective parts, so you can apply logic to the values? Here is a very simply Python implementation of a UDF that will turn that map into a bag:

@outputSchema("values:bag{t:tuple(key, value)}")
def bag_of_tuples(map_dict):
    return map_dict.items()

You can include this UDF (place the above in a file called udfs.py in the current working directory for pig), and invoke it like this:

register 'udfs.py' using jython as py
data = load 'hbase://table_name' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:*', '-loadKey true') AS (id:chararray, stats:map[int]);
databag = foreach data generate id, FLATTEN(py.bag_of_tuples(stats));
illustrate databag;

This is taking advantage of the built-in FLATTEN operator, which takes a bag and does a cross product with bag’s row to produce N new rows.

------------------------------------------------------------------------------
| databag      | id:chararray            | key:bytearray  | value:bytearray  |
------------------------------------------------------------------------------
|              | ce410-00005031-00089182 | 20130101post    | 147             |
------------------------------------------------------------------------------
|              | ce410-00005031-00089182 | 20130101comment | 5               |
------------------------------------------------------------------------------
|              | ce410-00005031-00089182 | 20130101like    | 41              |
------------------------------------------------------------------------------

You can then process your data as normal. You can then write your data bag to HBase in the same format by using the built-in UDFs TOMAP and and same * syntax. Assuming you have produced new column names and values in your script, you can do:

...
mapped = foreach processed_data generate TOMAP(columnname, value) as stats;
store mapped into 'hbase://table_name' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('stats:*');



Database Python (language)

Published at DZone with permission of Chase Seibert, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Kotlin Is More Fun Than Java And This Is a Big Deal
  • Iptables Basic Commands for Novice
  • How Do the Docker Client and Docker Servers Work?
  • Java Development Trends 2023

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: