Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using HBase to Create an Enterprise Key Service

DZone's Guide to

Using HBase to Create an Enterprise Key Service

HBase is the distributed key value NoSQL store for Hadoop and the ability to store key value pairs makes it a good choice for doing lookups.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

What it solves: systems often use keys to provide unique identifiers for transactions or entities within the firm and provide savings via compact storage and reduced compute needs. HBase's ability to service millions of requests with good concurrency meets the needs of a key service. This blog does not discuss HBase or ETL as such. It assumes that the reader has some prior knowledge of both. However, just as a briefing exercise:

  • HBase stores its data in a columnar format. HBase tables consist of rows & row keys, column families, qualifiers and timestamps etc. The data is sorted in the order of bytes of row keys. HBase serves its data through regions and region servers.
  • In layman terms, facts are transactional data which are updated frequently and they are linked to slowly changing dimension data. For example, in a retail store all sales transactions or inventory can be fact data whereas products in the transaction is a dimension data consisting of product id, type, color, make etc. Similarly, time and store can be other dimensions for the same transaction.

Introduction

HBase is the distributed key value NoSQL store for Hadoop. The ability to store key value pairs makes it a good choice for doing lookups. A typical ETL process involves processing of facts and dimensions; the general requirement is to perform lookups on dimension keys to link them up with fact data. More to it, if some dimensions are unavailable, the ETL process would require to generate the dimension key using some sort of a sequencing logic. To summarize, an ETL process of facts and dimensions would involve the following:

  • Start processing fact data which is received very often.
  • Relate the fact data to dimension identifiers and attributes by looking up dimensions available in the fact data from the dimension database/store.
  • Generate identifiers for dimensions which do not exist in the dimension store and use them for fact processing. (Can also be termed as keying).
  • Update dimension data with the new or modified dimensions as and when required.

How does HBase help us to solve this problem?

  1. The HBase table structure is designed to store “unique dimension keys” as row keys and “dimension attributes” as qualifiers in column families. Each dimension key is also assigned a “sequence identifier (stored as a qualifier value)” unique to each dimension’s space which is replaced during lookup of the dimension. The sequence identifier is used for further ETL processing.
  2. One HBase table stores all dimensions required for a fact.
HBase Dimension Lookup Table
Row key
Column Family:Qualifier
Qualifier Value
<dimension name>+”unique dimension key”
colfam:seq_id
Sequence Identifer
<dimension name>+”unique dimension key”
colfam:dim_attrib
Dimension attributes
<dimension name>
colfam:cntr_val
Last incremented sequence number for a dimension space. This is a hbase counter value.


So, with this table design fact processing would include the following steps:

  • HBase serves as the lookup mechanism to get dimension values against a particular dimension key. So, htable.get(“dimension key”) will return the sequence id and other attributes for a specific dimension.
  • HBase can be used to generate new dimension keys, if they do not exist. So, htable.incr(“dimension”) can be used to increment a counter value for the given dimension space.
  • This sequence number can be used for the new dimension by registering it as a new value using htable.put(“dimension key”,”dimension attributes & sequence number”) and then performing lookups on it.

To summarize, the pseudocode for ETL fact and dimension would look like so:

if htable.exists(“dimension key”), then

     key = htable.get(“dimension key”)

     // Use this key and other dimension attributes as required.

else

     dim_seq = htable.incr(“dimension”)

     htable.put(“dimension key”,dim_seq, dim_attributes)

end if


This logic seems to be fine unless you run a distributed job like MapReduce or Spark to do the ETL process. In that case, the logic should be modified to look like this:

if htable.exists(“dimension key”), then
     key = htable.get(“dimension key”)
     // Use this key and other dimension attributes as required.
else
     dim_seq = htable.incr(“dimension”)
     if not htable.exists(“dimension key”), then
          htable.put(“dimension key”,dim_seq, dim_attributes)
     else
          key = htable.get(“dimension key”)
         // Use this key and other dimension attributes as required.
     end if
end if


The extra check after generating the new dimension caters to the edge cases where two tasks might have generated a new dimension key for the same dimension simultaneously.

Updates for dimensions explicitly are also handled on similar lines.

Overall, this seems to be a simple and reasonable solution. However, the problem is this logic might not prove to be efficient when lots of new keys are generated by the ETL process because of the contention and locking done while doing writes into HBase. To optimize it further:

  • Use batch request APIs  for HBase GET & PUT calls.
  • Avoid using HBase EXISTS call as existence of a row key can also be checked using a GET call.
  • Ideally, split the logic mentioned above into two parts – lookups or existence of a dimension key can be checked in the map phase of a Spark or a MapReduce job and generation of new dimensions using INCR and PUT call can be deferred to a reduce phase. Both GET and PUT in their respective phases are done by using batched HBase operations.

Apart from these, the lookup performance can be improved by following normal HBase cluster tuning measures like pre-splitting regions, avoiding hotspotting by designing row keys efficiently, salted hbase tables etc. HBase cluster tuning is in itself a separate topic for discussion.

References


Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
cluster ,tuning ,nosql ,keys ,etl ,hbase

Published at DZone with permission of Garima Dosi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}