What it solves: systems often use keys to provide unique identifiers for transactions or entities within the firm and provide savings via compact storage and reduced compute needs. HBase's ability to service millions of requests with good concurrency meets the needs of a key service. This blog does not discuss HBase or ETL as such. It assumes that the reader has some prior knowledge of both. However, just as a briefing exercise:
- HBase stores its data in a columnar format. HBase tables consist of rows & row keys, column families, qualifiers and timestamps etc. The data is sorted in the order of bytes of row keys. HBase serves its data through regions and region servers.
- In layman terms, facts are transactional data which are updated frequently and they are linked to slowly changing dimension data. For example, in a retail store all sales transactions or inventory can be fact data whereas products in the transaction is a dimension data consisting of product id, type, color, make etc. Similarly, time and store can be other dimensions for the same transaction.
HBase is the distributed key value NoSQL store for Hadoop. The ability to store key value pairs makes it a good choice for doing lookups. A typical ETL process involves processing of facts and dimensions; the general requirement is to perform lookups on dimension keys to link them up with fact data. More to it, if some dimensions are unavailable, the ETL process would require to generate the dimension key using some sort of a sequencing logic. To summarize, an ETL process of facts and dimensions would involve the following:
- Start processing fact data which is received very often.
- Relate the fact data to dimension identifiers and attributes by looking up dimensions available in the fact data from the dimension database/store.
- Generate identifiers for dimensions which do not exist in the dimension store and use them for fact processing. (Can also be termed as keying).
- Update dimension data with the new or modified dimensions as and when required.
How does HBase help us to solve this problem?
- The HBase table structure is designed to store “unique dimension keys” as row keys and “dimension attributes” as qualifiers in column families. Each dimension key is also assigned a “sequence identifier (stored as a qualifier value)” unique to each dimension’s space which is replaced during lookup of the dimension. The sequence identifier is used for further ETL processing.
- One HBase table stores all dimensions required for a fact.
|HBase Dimension Lookup Table
|<dimension name>+”unique dimension key”
|<dimension name>+”unique dimension key”
||Last incremented sequence number for a dimension space. This is a hbase counter value.
So, with this table design fact processing would include the following steps:
- HBase serves as the lookup mechanism to get dimension values against a particular dimension key. So, htable.get(“dimension key”) will return the sequence id and other attributes for a specific dimension.
- HBase can be used to generate new dimension keys, if they do not exist. So, htable.incr(“dimension”) can be used to increment a counter value for the given dimension space.
- This sequence number can be used for the new dimension by registering it as a new value using htable.put(“dimension key”,”dimension attributes & sequence number”) and then performing lookups on it.
To summarize, the pseudocode for ETL fact and dimension would look like so:
if htable.exists(“dimension key”), then key = htable.get(“dimension key”) // Use this key and other dimension attributes as required. else dim_seq = htable.incr(“dimension”) htable.put(“dimension key”,dim_seq, dim_attributes) end if
This logic seems to be fine unless you run a distributed job like MapReduce or Spark to do the ETL process. In that case, the logic should be modified to look like this:
if htable.exists(“dimension key”), then key = htable.get(“dimension key”) // Use this key and other dimension attributes as required. else dim_seq = htable.incr(“dimension”) if not htable.exists(“dimension key”), then htable.put(“dimension key”,dim_seq, dim_attributes) else key = htable.get(“dimension key”) // Use this key and other dimension attributes as required. end if end if
The extra check after generating the new dimension caters to the edge cases where two tasks might have generated a new dimension key for the same dimension simultaneously.
Updates for dimensions explicitly are also handled on similar lines.
Overall, this seems to be a simple and reasonable solution. However, the problem is this logic might not prove to be efficient when lots of new keys are generated by the ETL process because of the contention and locking done while doing writes into HBase. To optimize it further:
- Use batch request APIs for HBase GET & PUT calls.
- Avoid using HBase EXISTS call as existence of a row key can also be checked using a GET call.
- Ideally, split the logic mentioned above into two parts – lookups or existence of a dimension key can be checked in the map phase of a Spark or a MapReduce job and generation of new dimensions using INCR and PUT call can be deferred to a reduce phase. Both GET and PUT in their respective phases are done by using batched HBase operations.
Apart from these, the lookup performance can be improved by following normal HBase cluster tuning measures like pre-splitting regions, avoiding hotspotting by designing row keys efficiently, salted hbase tables etc. HBase cluster tuning is in itself a separate topic for discussion.