Over a million developers have joined DZone.

The Best Ways to Get Started with HCatalog

Need a quick starting guide for HCatalog? Adam Diaz has got you covered!

· Big Data Zone

Compliments of Zaloni: Download free eBook "Architecting Data Lakes" to learn the key to building and managing a big data lake, brought to you in partnership with Zaloni.

HCatalog, also called HCat, is an interesting Apache project. It has the unique distinction of being one of the few Apache projects that were once a part of another project, became its own project, and then again returned to the original project Apache Hive.

HCat itself is described in the documentation as “a table and storage management layer” for Hadoop. In short, HCat provides an abstraction layer for accessing data in Hive from a variety of programming languages.  It exposes data stored in the Hive metastore to additional languages other than HQL. Classically, this has included Pig and MapReduce. When Spark burst onto the big data scene, it allowed access to HCat.

Using HCat means leveraging an abstraction layer that lets programmers focus on the task at hand, not file format issues. This is done using what is called a “SerDes” or serializer/deserializer. It translates a programming object into a series of bytes and back again. For those of you who are not Java programmers, this is a piece of Java code used to allow HCat and Hive to understand how to exchange information in a particular format.  

Getting Started With HCatalog

In general, you would use HCatalog to upload data to the distributed file system, define the data in Hive, and then access the data via a technology of your choice using the appropriate HCatalog statement for the language used.

Accessing Data With HCatalog

Below is a short example of HCatalog being used with a chosen technology:

Pig - Pig uses HCatLoader and HCatStorer. Please see the very detailed Hortonworks tutorial on use of HCat for full worked examples.

a = LOAD 'TABLENAME1' using org.apache.hive.hcatalog.pig.HCatLoader () ;
b = LOAD 'TABLENAME2' using org.apache.hive.hcatalog.pig.HCatLoader () ;
c = join b by colname1, a by colname1;
dump c;

Hive - Hive uses HCat directly so there is no need for special code. Simply define your table as you would in the Hive CLI and it will be accessible via HCat. View a Hive architecture.

MapReduce - MapReduce can also access data via HCat. See a fully worked example is available here. In short, adjust your mapper, reducer and driver to use HCat.

// Get table schema in mapper
    HCatSchema schema = HCatBaseInputFormat.getTableSchema(context) ;
    Integer var1= new Integer(value.getString("var1", schema) ) ;

 
    // define output record schema
    List columns = new ArrayList (3) ;
    columns.add(new HCatFieldSchema("year", HCatFieldSchema.Type.INT, "") ) ;

     record.setInteger ("year", schema, key.getFirstInt() ) ;

SparkSQL - Spark can leverage several languages including Scala, Python, Java and R. Of course, one can simply use Spark SQL to simply run native HQL commands (which natively interact with Hcat).

val a =  hiveContext.hql (“from data.test select country, prodID”)

Spark - What if you would like to access Hive from Spark without Spark SQL? Spark code accesses the Hive metastore directly.

Interacting with HCatalog Through WebHCat

Simply put WebHCat is the REST API for HCatalog. This allows for all sorts of scenarios where interacting with HCatalog might be required but cannot be done using other methods. The easiest way to demonstrate WebHCat is via curl. You will notice the name “Templeton” in the URL which is the old name for WebHCat.

curl -s 'http://localhost:50111/templeton/v1/status'
{"status":"ok","version":"v1"}
 
curl -s 'http://localhost:50111/templeton/v1/ddl/database/default/table/sample_07?user.name=hive'

 
{  
  "columns":[  
     {  
        "name":"code",
        "type":"string"
     },
     {  
        "name":"description",
        "type":"string"
     },
     {  
        "name":"total_emp",
        "type":"int"
     },
     {  
"name":"salary",
        "type":"int"
     }
  ],
  "database":"default",
  "table":"sample_07"
}
 

Connecting to Hive with HiveServer2

HiveServer2 (HS2) is a connection layer to allow client connections to Hive. This includes a TCP or HTTP based Hive Service layer and like most Hadoop services a web interface. One of the easiest ways to connect is to use the built-in client called beeline that comes with Hive. This is the technology that allows many BI tools in the Hadoop market to make use of Hive today.

beeline
WARNING: Use "yarn jar" to launch YARN applications.
Beeline version 1.2.1000.2.4.0.0-169 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000/default
Connecting to jdbc:hive2://localhost:10000/default
Enter username for jdbc:hive2://localhost:10000/default: hive
Enter password for jdbc:hive2://localhost:10000/default: ****
Connected to: Apache Hive (version 1.2.1000.2.4.0.0-169)
Driver: Hive JDBC (version 1.2.1000.2.4.0.0-169)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/default> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| xademo         |
+----------------+--+
2 rows selected (2.867 seconds)
0: jdbc:hive2://localhost:10000/default>

 You can think of HS2 as a templeton based service allowing remote access to the Hive command line. So again, in this scenario any tables created will automatically be available via HCatalog since you are essentially working at the Hive cli.

0: jdbc:hive2://localhost:10000/default> show tables;
+------------+--+
|  tab_name  |
+------------+--+
| sample_07  |
| sample_08  |
+------------+--+
2 rows selected (0.381 seconds)
0: jdbc:hive2://localhost:10000/default> create table testtable( eid int);
No rows affected (10.69 seconds)
0: jdbc:hive2://localhost:10000/default> show tables;
+------------+--+
|  tab_name  |
+------------+--+
| sample_07  |
| sample_08  |
| testtable  |
+------------+--+
3 rows selected (0.351 seconds)
 
hcat -e "show tables;"
WARNING: Use "yarn jar" to launch YARN applications.
OK
sample_07
sample_08
testtable
Time taken: 8.589 seconds

Learn More About HCatalog and Compatible Technologies

HCatalog is a way for many different technologies to share in the tables defined in Hive without having to write low-level integration with the Hive Metastore. Without HCatalog, the ability to simply reuse existing data becomes more cumbersome. Aside from the fact that Hive is the technology in Hadoop that looks and feels the most like everyone’s beloved RDBMS, HCatalog is what allows a multitude of Hadoop command line tools to interact with Hive. HS2 then provides the easiest way for a sea of BI tools to connect to Hive and leverage tables in Hive directly.

Zaloni, the data lake company, provides data lake management and governance software and services. Learn more about Bedrock and Mica

Topics:
big data ,hcatalog ,Hortonworks

Published at DZone with permission of Adam Diaz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}