Data Model
Table
Applications store data into an HBase table. Tables are made of rows and columns. Table cells — the intersection of row and column coordinates — are versioned.
Cell Value
A {row, column, version} tuple precisely specifies a cell in HBase.
Versions
It is possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension. A version is specified as a long integer. The HBase version dimension is stored in decreasing order so when reading from a store file, the most recent values are found first.
Row Key
Table row keys are also byte arrays. Therefore almost anything can serve as a row key, from strings to binary representations of longs or even serialized data structures.
Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte array is used to denote both the start and end of a table's namespace. All table accesses are via the table row key — its primary key.
Columns & Column Families
Columns in HBase are grouped into column families. All column members of a column family have the same prefix. For example, the courses:history and courses:math columns are both members of the courses column family. Physically, all column family members are stored together in the filesystem. Because tuning and storage specifications are done at the column family level, it is recommended that all column family members have the same general access pattern and size characteristics.
Schema Creation & Updating
Tables are declared up front at schema-definition time using the HBase shell or Java API (see earlier sections). Column families are defined at table-creation time. It is possible to alter a table and add new column families, but the table must be disabled at altering time.
When changes are made to either tables or column families (e.g., region size, block size), these changes take effect the next time there is a major compaction and the StoreFiles get re-written.
Row-Key Design
Try to keep row keys short because they are stored with each cell in an HBase table, thus noticeably reducing row-key size results of data needed for storing HBase data. This advice also applies to column family names. Common problems of choosing between sequential row keys and randomly distributed row keys:
Some mixed-design approaches allow fast range scans while distributing data among all clusters when writing sequential (by nature) data. One of the ready-to-use solutions is here: https://github.com/sematext/HBaseWD.
Design Solution |
Pros |
Cons |
Using sequential row keys (e.g. time-series data with row key built based on timestamp) |
Makes it possible to perform fast range scans with help of
setting start/stop keys on Scanner |
Creates single regionserver, hotspotting problems upon writing data (as row keys go in sequence, all records end up written into a single region at a time) |
Using randomly
distributed row keys
(e.g. UUIDs) |
Aims for fastest writing performance by distributing new records over random
regions |
Does not conduct fastrange scans against written data |
And here is the link to access the HBase Reference Guide: http://hbase. apache.org/book.html#rowkey.designRow-key design is essential to gaining maximum performance when using HBase to store your application's data.
Column Families
Currently, HBase does not do well with anything above two or three column families per table. With that said, keep the number of column families in your schema low. Try to make do with one column family in your schemata if you can. Only introduce a second and third column family in the case where data access is usually column-scoped; i.e. you usually query no more than a single column family at one time.
You can also set TTL (in seconds) for a column family. HBase will automatically delete rows once reaching the expiration time.
Versions
The maximum number of row versions that can be stored is configured per column family (the default is 3). This is an important parameter because HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Setting the number of maximum versions to an exceedingly high level (e.g., hundreds or more) is not a good idea because that will greatly increase StoreFile size.
The minimum number of row versions to keep can also be configured per column family (the default is 0, meaning the feature is disabled). This parameter is used together with TTL and maximum row versions parameters to allow configurations such as "keep the last T minutes worth of data of at least M versions, and at most N versions." This parameter should only be set when TTL is enabled for a column family and must be less than the number of row versions.
Data Types
HBase supports a "bytes-in/bytes-out" interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input can be strings, numbers, complex objects, or even images, as long as they can be rendered as bytes.
One supported data type that deserves special mention is the "counters" type. This type enables atomic increments of numbers.