DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Trending

  • Building and Deploying Microservices With Spring Boot and Docker
  • Cypress Tutorial: A Comprehensive Guide With Examples and Best Practices
  • JavaFX Goes Mobile
  • Why You Should Consider Using React Router V6: An Overview of Changes
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Using HBase to Create an Enterprise Key Service

Using HBase to Create an Enterprise Key Service

HBase is the distributed key value NoSQL store for Hadoop and the ability to store key value pairs makes it a good choice for doing lookups.

Garima Dosi user avatar by
Garima Dosi
·
Aug. 19, 16 · Tutorial
Like (1)
Save
Tweet
Share
2.82K Views

Join the DZone community and get the full member experience.

Join For Free

What it solves: systems often use keys to provide unique identifiers for transactions or entities within the firm and provide savings via compact storage and reduced compute needs. HBase's ability to service millions of requests with good concurrency meets the needs of a key service. This blog does not discuss HBase or ETL as such. It assumes that the reader has some prior knowledge of both. However, just as a briefing exercise:

  • HBase stores its data in a columnar format. HBase tables consist of rows & row keys, column families, qualifiers and timestamps etc. The data is sorted in the order of bytes of row keys. HBase serves its data through regions and region servers.
  • In layman terms, facts are transactional data which are updated frequently and they are linked to slowly changing dimension data. For example, in a retail store all sales transactions or inventory can be fact data whereas products in the transaction is a dimension data consisting of product id, type, color, make etc. Similarly, time and store can be other dimensions for the same transaction.

Introduction

HBase is the distributed key value NoSQL store for Hadoop. The ability to store key value pairs makes it a good choice for doing lookups. A typical ETL process involves processing of facts and dimensions; the general requirement is to perform lookups on dimension keys to link them up with fact data. More to it, if some dimensions are unavailable, the ETL process would require to generate the dimension key using some sort of a sequencing logic. To summarize, an ETL process of facts and dimensions would involve the following:

  • Start processing fact data which is received very often.
  • Relate the fact data to dimension identifiers and attributes by looking up dimensions available in the fact data from the dimension database/store.
  • Generate identifiers for dimensions which do not exist in the dimension store and use them for fact processing. (Can also be termed as keying).
  • Update dimension data with the new or modified dimensions as and when required.

How does HBase help us to solve this problem?

  1. The HBase table structure is designed to store “unique dimension keys” as row keys and “dimension attributes” as qualifiers in column families. Each dimension key is also assigned a “sequence identifier (stored as a qualifier value)” unique to each dimension’s space which is replaced during lookup of the dimension. The sequence identifier is used for further ETL processing.
  2. One HBase table stores all dimensions required for a fact.
HBase Dimension Lookup Table
Row key
Column Family:Qualifier
Qualifier Value
<dimension name>+”unique dimension key”
colfam:seq_id
Sequence Identifer
<dimension name>+”unique dimension key”
colfam:dim_attrib
Dimension attributes
<dimension name>
colfam:cntr_val
Last incremented sequence number for a dimension space. This is a hbase counter value.


So, with this table design fact processing would include the following steps:

  • HBase serves as the lookup mechanism to get dimension values against a particular dimension key. So, htable.get(“dimension key”) will return the sequence id and other attributes for a specific dimension.
  • HBase can be used to generate new dimension keys, if they do not exist. So, htable.incr(“dimension”) can be used to increment a counter value for the given dimension space.
  • This sequence number can be used for the new dimension by registering it as a new value using htable.put(“dimension key”,”dimension attributes & sequence number”) and then performing lookups on it.

To summarize, the pseudocode for ETL fact and dimension would look like so:

if htable.exists(“dimension key”), then

     key = htable.get(“dimension key”)

     // Use this key and other dimension attributes as required.

else

     dim_seq = htable.incr(“dimension”)

     htable.put(“dimension key”,dim_seq, dim_attributes)

end if


This logic seems to be fine unless you run a distributed job like MapReduce or Spark to do the ETL process. In that case, the logic should be modified to look like this:

if htable.exists(“dimension key”), then
     key = htable.get(“dimension key”)
     // Use this key and other dimension attributes as required.
else
     dim_seq = htable.incr(“dimension”)
     if not htable.exists(“dimension key”), then
          htable.put(“dimension key”,dim_seq, dim_attributes)
     else
          key = htable.get(“dimension key”)
         // Use this key and other dimension attributes as required.
     end if
end if


The extra check after generating the new dimension caters to the edge cases where two tasks might have generated a new dimension key for the same dimension simultaneously.

Updates for dimensions explicitly are also handled on similar lines.

Overall, this seems to be a simple and reasonable solution. However, the problem is this logic might not prove to be efficient when lots of new keys are generated by the ETL process because of the contention and locking done while doing writes into HBase. To optimize it further:

  • Use batch request APIs  for HBase GET & PUT calls.
  • Avoid using HBase EXISTS call as existence of a row key can also be checked using a GET call.
  • Ideally, split the logic mentioned above into two parts – lookups or existence of a dimension key can be checked in the map phase of a Spark or a MapReduce job and generation of new dimensions using INCR and PUT call can be deferred to a reduce phase. Both GET and PUT in their respective phases are done by using batched HBase operations.

Apart from these, the lookup performance can be improved by following normal HBase cluster tuning measures like pre-splitting regions, avoiding hotspotting by designing row keys efficiently, salted hbase tables etc. HBase cluster tuning is in itself a separate topic for discussion.

References

  • https://hbase.apache.org/book.html
  • http://blog.cloudera.com/blog/2015/06/how-to-scan-salted-apache-hbase-tables-with-region-specific-key-ranges-in-mapreduce/
  • http://searchdatamanagement.techtarget.com/answer/What-are-the-differences-between-fact-tables-and-dimension-tables-in-star-schemas


Dimension (data warehouse) Database Data (computing) Extract, transform, load

Published at DZone with permission of Garima Dosi. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • Building and Deploying Microservices With Spring Boot and Docker
  • Cypress Tutorial: A Comprehensive Guide With Examples and Best Practices
  • JavaFX Goes Mobile
  • Why You Should Consider Using React Router V6: An Overview of Changes

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: