Cloudera customers are looking to store complex data types in Apache HBase to provide fast retrieval of complex information such as banking transactions, web analytics records, and related metadata associated with those records. Serialization formats such as Apache Avro, Thrift, and Protocol Buffers greatly assist in meeting this goal, but they are difficult for developers to work with and debug due to their binary format.
The HBase shell allows the use of a custom formatter for interpreting and presenting such binary types. A formatter is useful because the HBase shell, by default, only presents a string representation of the stored bytes to a user when a table is queried. In some cases, this basic representation is fine; however, in cases where more complex data serialization methods are used, this basic formatting lacks clarity.
Figure 1. HBase Shell default output format
There is, however, a simple solution. In this post, I will describe and demonstrate the use of a custom formatter that de-serializes an Avro type (in this case a customer profile), and displays the content to the user within the HBase shell as a human-readable string.
Whilst this post focuses on Avro serialization and deserialization, you can use an HBase custom formatter to perform other operations such as reformatting encoded values, unpacking legacy binary types, and extracting content from XML or JSON documents (making them easily comprehensible by users). In addition, other data serialization formats, such as Thrift and Protocol Buffers, can be used with the same approach.
Most end-user applications can map complex domain objects such as structured log records or financial transactions into distinct columns and their respective column families within HBase. They often do this using a data access layer (DAL) or repository pattern if using domain-driven design (DDD) that abstracts away persistence responsibility, allowing the consuming code to be concerned only with operations on domain objects and events. For example, many Cloudera customers find it advantageous to store related data atomically as structured Avro records.
There are many benefits to using Avro for data serialization:
- Schema management
- Sharing complex data types between producing and consuming applications
- Schema evolution conveying control and compatibility across different versions and consumers of your data
- Being explicit about the interfaces and contracts between components leading to predictability and safer code
- Compact format storage and on the wire saving in contrast to JSON or XML formats
- Atomic put of arbitrarily nested information
- Fast retrieval of related information in documents
- Fast persistence of related information in documents
- Compact and efficient transfer of real-time data
Subsequently, applications can persist such complex objects as serialized Avro to an HBase column with a simple HBase put of the bytes that represent the serialized object. The object can then be stored with a row key that supports the access pattern required by the application. (For more information regarding the importance of access patterns when designing for HBase, refer to the excellent book HBase: The Definitive Guide by Lars George.)
For example; if the application is persisting customer profiles into a column as an Avro document, the customer’s identifier could be used as an HBase row key and an entire customer profile, constituting related nested information, can be retrieved with the minimal number of read operations. Contrast this approach to the performance overhead of joining multiple master-to-child records in a traditional normalized RDBMS to obtain a customer profile and associated child records, such as product purchase history or web page visits.
Figure 2. HBase de-normalization versus RDBMS normalization
Alternatively, complex transactions with related information can be persisted as Avro and retrieved by scan operations in HBase. Again, no join operations are required to retrieve arbitrarily nested and related information as the requisite data is stored as a self-contained document; data retrieval is fast.
In the numerous organizations where we see this pattern used successfully, a common requirement is to be able to read the Avro document within the HBase shell for development, testing, and debugging in pre-production environments. By default, if you retrieve such an Avro document using the HBase shell, it will display a string representation of the raw bytes of the serialized Avro doc, making its contents difficult to decipher and increasing cognitive load on the user.
In this circumstance, instead, you can employ a custom formatter to present the serialized Avro document however you choose.
Avro schemas are specified using JSON and are composed of primitive and complex types. For this example, you will use a schema for a hypothetical customer profile that supports online marketing and personalization activity. (The source code for this example can be found here.)
In this example, the Avro schema can be found in the file
Avro provides code generation utilities that greatly simplify development by creating the necessary supporting classes. For example, compiling a schema is trivial:
In this example:
java-jar avro-tools-1.7.7.jarcompile schema customerProfile.avsc.
The Avro Maven plugin can also be used for code generation and integration of these stages into your continuous integration and delivery pipelines.
Here assume that the example HBase table contains a column for serialized bytes of such documents and this table has been populated with data from your application. The example code demonstrates the population of such an HBase table in code using the
You can easily de-serialize these documents using Avro; for example:
Thus, just de-serialize the bytes into object representation using the Avro binary decoder.
Configuring a method to be called by the JRuby shell couldn’t be easier. For this example, simply delegate the call to
deserializeFromAvro through a
Once packaged, the Customer Profile formatter function can operate on a column at the point of executing a query. Just copy the JAR to a known location in the local filesystem of the machine from which you will use the HBase shell.
Register the JAR with the HBase shell by starting it and executing the following command:
You can now use the formatter to inspect your data using the following syntax and your customer data type will be reformatted as it is printed to the shell output:
Figure 3. Custom HBase formatting for scan
Figure 4. Custom HBase formatting forget
The custom formatter can now represent the Avro data in a human-readable format.
This post has described how to use custom formatters in the HBase shell to visualize data stored in different data serialization formats. Ideally, these types of tools will help your team more readily adopt these efficient storage formats in your own applications.
Robert Siwicki is a Solutions Architect at Cloudera.