Kerberized Connections: HBase, Hive Metastore

DZone 's Guide to

Kerberized Connections: HBase, Hive Metastore

In this post, a Solutions Architect walks us through the process of making secure, client-side connections to Kerberized servers.

· Security Zone ·
Free Resource

With the introduction of Kerberos Security for the Hadoop Ecosystem, there have been some fundamental changes with respect to:

  • The process of submitting jobs in Hadoop.

  • Making secure connections to any server, be it Namenode, HiveServer, HBase, etc.

  • Impersonating other users in the cluster.

Since the secure connection establishment is done transparently by the clients of the respective components, the developer/user of the Hadoop system usually doesn't need to know the steps to be followed in order to establish connections to the server or about the nitty-gritty of the underlying Kerberized connections, as a whole. And the mystery that then remains to be solved is all about GSS Exceptions, TGT not found, etc.

Assuming that the reader already knows about Kerberos, and impersonation in general, this post is focused on the steps that should be followed while making connections to Kerberized servers.

Let's understand this by considering two use cases:

  1. One where we would like to open up connections to secure HBase in mappers/reducers of a MapReduce job OR use a secured HBase to lookup some data in Hive functions (Note - here we are not talking of using HBase’s MapReduce input/output format or a table over HBase in Hive. We want to do lookups on HBase from within MapReduce).

  2. Second, consider an example where we would like to connect to a secured Hive Metastore by impersonating another user.

Now, the question is, what is the problem with the first use case? If we run a MapReduce job and try to establish an HBase connection in a mapper, it should work, right? But, this is a Kerberized HBase cluster, which means the user connecting to HBase will be authenticated and to do so, HBase will look for the user’s ticket cache (or credentials). Would the user’s credentials or tickets be available on mapper nodes? No, they would only be available on the nodes where the user has logged in. And so, the credentials will not be found and the job will fail with a big trace of the famous GSS Exception.

But what about the second use case? Although the process is executed on a node where the user is logged in, Hive Metastore will not able to verify the authenticity of the user because it can only get the credentials (from the ticket cache) of the user who is logged in and not the one who is being impersonated. So, again, what we get is a trace of a GSS Exception complaining about credentials not being present.

So, what should we do to connect to these servers, then? OK, so Hadoop already has this concept of Delegation Tokens - we just need to understand and implement it to solve our use cases.

Tokens are analogous to the concept of coupons distributed to their employees by companies. These coupons can be used online or in other stores to purchase goods depending on the type of coupon issued. In Hadoop, the servers can issue tokens (coupons) to users or clients (employees) who are logged into the system and hence their credentials are available for authentication (usually at the edge nodes). Tokens are based on the type of server - HBase, NN, Metastore, etc. These tokens can then be used on other nodes to “connect” and “access” (purchase goods) resources like HBase tables. The identity of a user, on the other node, would thus be established through the token and not Kerberos tickets/cache.

Rewinding back to the coupon example, an employee’s family member can use them for purchases on the behalf of the employee. Likewise, a logged in user (employee) can retrieve a delegation token from a server like the Hive Metastore and an impersonating user (family member) can use this token to “connect” and “access” Metastore resources.

As coupons have validity periods, so do the tokens. They expire after a designated amount of time, which is long enough for processes to perform their tasks. More on token expiry and renewal can be read here.

So the theory looks okay. But how do we actually implement all this in reality?

Let's start with the first case where we want to connect to HBase in mappers. HBase already provides helper APIs to obtain tokens for jobs.

HBase and Other classes:

import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.security.User;
import org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier;
import org.apache.hadoop.hbase.security.token.TokenUtil;

import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.security.token.Token;

In the driver class of the MapReduce job, we write the following piece of code:

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, “job name”);
/* Add hbase resources to configuration object (hbase-site.xml) or any specific hbase params */

/* Obtain HBase token for the user and set it in Job. At this point the user credentials are available and authenticated for retrieving the token.*/

Connection hbaseConn = TokenizationUtils.getHBaseConnectionByAddingResources(job.getConfiguration());
TokenUtil.obtainTokenForJob(hbaseConn, User.getCurrent(), job);

In the mapper of the MapReduce job, we connect to HBase in the setup method like so:

private Connection connection;
private Table htbl;


/* Since token was already added to the job, it will be available for HBase connection in the configuration object */

public void setup(Context context) {
Configuration conf = context.getConfiguration();
connection = ConnectionFactory.createConnection(conf);
htbl = connection.getTable(TableName.valueOf(“table name”)));

What if we had to connect to HBase and do a lookup in a Hive UDAF/UDF. How should we, then, make the token available in the UDAF/UDF? Well, we then make use of hive pre-hooks; obtain a token and add it to conf in a pre-hook, and use it in UDAF/UDF thereafter.

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.security.User;
import org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier;
import org.apache.hadoop.hbase.security.token.TokenUtil;
import org.apache.hadoop.hive.conf.HiveConf;
import org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext;
import org.apache.hadoop.hive.ql.hooks.HookContext;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.security.token.Token;

public class HbaseTokenFetcherHook implements ExecuteWithHookContext{   

    private static final Log LOG = LogFactory.getLog(HbaseTokenFetcherHook.class);

    public void run(HookContext hookContext) throws Exception {

          HiveConf hiveConf = hookContext.getConf();
          /* If required */
          hiveConf.set(“zookeeper.znode.parent”, "/hbase-secure");   
          try {               
               Connection tokenConnection = ConnectionFactory.createConnection(hiveConf);
               Token<AuthenticationTokenIdentifier> token = TokenUtil.obtainToken(tokenConnection, User.getCurrent());
               String urlString = token.encodeToUrlString();
               hiveConf.set(“HBASE_AUTH_TOKEN”, urlString);
          } catch (IOException | InterruptedException e) {
               LOG.error("Error while fetching token for hbase"
                         + e.getMessage(), e);

In the UDAF/UDF class, override the configure method to include the following snippet:

private Table hTbl;

    public void configure(org.apache.hadoop.hive.ql.exec.MapredContext mapredContext) {
          JobConf jobConf = mapredContext.getJobConf();
          jobConf.set(“zookeeper.znode.parent”, "/hbase-secure");
          String ulrEncodedToken = jobConf.get(“HBASE_AUTH_TOKEN”);
          try {
                 Token<AuthenticationTokenIdentifier> token = new Token<>();

                 Connection hbaseConnection = ConnectionFactory.createConnection(jobConf);
                 hTbl = hbaseConnection.getTable(TableName.valueOf(“table name”));

                 LOG.debug("Vault Table name:" + vaultTable.toString());

            } catch (IOException e) {
                    LOG.error("Error while creating hbase connection:"
                              + e.getMessage(), e);

So, next, let's take a look at the second use case where a proxy user (impersonation) is used to connect to the Hive Metastore. This code can be used as a main class or as a method.

try {
      final String proxyUser = “user_to_be_impersonated”;
      /* The service name in this case is given by the developer. It can be anything but should remain consistent throughout. */
      final String tokenServiceName = “hive_metastore_connection”;

      /* Open a connection to Hive Metastore and fetch a delegation token for the user to be impersonated (proxy user). This connection happens with the actual user credentials.(the employee) */
      HiveMetaStoreClient tokenFetchingClient = new HiveMetaStoreClient(new HiveConf());
      String tok = tokenFetchingClient.getDelegationToken(proxyUser, proxyUser);
    /* Connection closed from the actual user after fetching the delegation token */

    /* Set the token and service name is Hadoop’s user object */
               Token<DelegationTokenIdentifier> delegationToken = new Token<>();
               delegationToken.setService(new Text(tokenServiceName));

               UserGroupInformation ugi =
                         UserGroupInformation.createProxyUser(proxyUser, UserGroupInformation.getCurrentUser());

/* Connect to metastore as the impersonated/proxy user (employee’s family member) */
               ugi.doAs(new PrivilegedExceptionAction<Void>() {

                    public Void run() throws MetaException {

                         HiveConf conf = new HiveConf();
                         conf.set("hive.metastore.token.signature", tokenServiceName);
                         client = new HiveMetaStoreClient(conf);
                         return null;

          } catch (TException | IOException | InterruptedException e) {
               throw e;

In the above code, before impersonating the user, the HiveMetastore client connects to the Metastore as the logged in user and obtains a token for the proxy user. This token is then added to the proxy user’s object and used to connect to the Metastore as a proxy user.

Well, that's it! This was an attempt to share the ways of making client connections to Kerberized servers. While working on recent assignments, I got the chance to learn this by digging through source code and many articles. So, I wanted to consolidate my experience. Hope this helps!






hadoop ,security ,network security ,api security

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}