Improve Apache HBase Performance via Data Serialization With Apache Avro
Taking a thoughtful approach to data serialization can achieve significant performance improvements for HBase deployments.
Join the DZone community and get the full member experience.
Join For FreeThe question of using tall versus wide tables in Apache HBase is a commonly discussed design pattern (see reference here and here). However, there are more considerations here than making that simple choice. Because HBase stores each column of a table as an independent row in the underlying HFiles, significant storage overhead can occur when storing small pieces of information. For example, in storing a simple Boolean column, there can be 35 (or more) bytes stored to disk (depending on the key size, column family, and so on). This overhead can become quite significant for overall I/O and network utilization, especially if multiple columns are read and written together within a transaction.
A simple solution is to serialize data that is accessed together: that is, serialize multiple columns and store the serialized data in a single HBase column. In this post, I’ll describe an implementation of this concept by Cloudera’s HBase team and the comparative performance improvements it achieved based on testing.
Implementation
First, let’s consider serialization. To be as efficient as possible in our implementation, we used Apache Avro to serialize the data. Avro uses schemas, which must be presented when reading and writing data. This approach permits each datum to be written with no per-value overheads, making serialization both fast and small. (For more information about using Avro, see "Avro Usage" in the CDH documentation.)
The serialization function is quite simple:
public static byte[] serialize(RandomSchema aRecord) { ByteArrayOutputStream out = new ByteArrayOutputStream(); encoder = EncoderFactory.get().binaryEncoder(out, encoder); DatumWriter<RandomSchema> writer = new SpecificDatumWriter<RandomSchema>(aRecord.getSchema()); try { writer.write(aRecord, encoder); encoder.flush(); out.close(); } catch (IOException e) { e.printStackTrace(); } return out.toByteArray(); }
Here is a similar function for deserialization:
public static RandomSchema deserialize(byte[] serializedBytes, Schema mySchema) { DatumReader<RandomSchema> reader = new SpecificDatumReader<RandomSchema>(mySchema); decoder = DecoderFactory.get().binaryDecoder(serializedBytes, null); try { return reader.read(null, decoder); } catch (IOException e) { e.printStackTrace(); } return null; }
This code allows quick creation of byte arrays to store data in HBase, and then extract that data. Although the code has some overhead, it’s offset by the reduced I/O and network bandwidth (more on that later).
Next, we created two identical HBase tables, one to store Avro serialized byte arrays (Avro Table), and one which would store the exact same volume of data in separate columns (Raw Table). Tables were created with the code:
HTableDescriptor desc = new HTableDescriptor(TableName.valueOf(AVRO_TABLE_NAME)); HColumnDescriptor family = new HColumnDescriptor(COLUMN_FAMILY); Family.setCompactionCompressionType(Compression.Algorithm.SNAPPY); desc.addFamily(family); admin.createTable(desc);
It is important to note that we enabled Snappy compression on the files, as this reduces some of the overhead when there is duplicate data in the HBase keys. Compression is also typical in production, so the results are more realistic.
Testing and Results
Finally, we tested the throughput of HBase for several million records (put
s were done in batches of 100) on the Cloudera QuickStart VM. We also tested get
throughput for random keys from the keys space.
The tests were done for Avro schemas of 8 and 64 fields (records), with a mix of Double and Boolean variables within the schema. We chose Double as it has fixed-length storage; hence the data stored in the record would not be relevant to performance. Results are outlined below. (The results are not intended as an absolute measure of performance, but are rather as useful comparative measures.)
The results show how joining multiple serialized objects into a single column can lead to significant savings in both storage and IO. The larger the number of columns being gathered together, the greater the performance benefit and storage savings.
As Snappy compression was used on both tables, similar keys could be compressed together on the raw tables. (If Snappy were disabled, the storage savings for Avro would have been much greater.) Furthermore, the HBase RegionServer was given 4GB of RAM, hence 1.6GB of RAM was available for the memstore. This is important, as the first test had close to that amount of data; it is a distinct possibility the results of the first test showed minimal performance gains for random reads as the data upon first read would be cached into memory, and subsequent reads would not require any additional I/O. With subsequent tests, the data could not be stored in the memstore and the reduced I/O requirements became more evident.
Finally, we did not use any diff encoding of the keys, as it would be difficult to control overall system performance given diff encoding gains are somewhat dependant on compaction schedule. We do, however, recommend benchmarking different encoding methods and Avro serializations to determine the ideal performance tuning for your particular deployment.
Keep in Mind
This methodology of aggregating columns may have some significant drawbacks depending on the use case. Specifically, when reading data from an Avro serialized object, it’s all or nothing: You cannot read just some of the records (columns) stored in that object. If you have an Avro serialized object with 64 columns stored within it, you must read the entire object from disk and send it through the network before you can access any of the data. If you are only interested in one of the columns, this approach can be very wasteful and may negate any performance gains achieved through the serialization. Furthermore, as HBase will be unable to read inside the Avro object, filtering on anything within the object will be impossible.
With these limitations in mind, a hybrid approach is used in most production deployments. Columns that are usually operated upon independently (for reading, updating, or scanning), are kept outside the serialized object, while columns that are operated upon together are serialized within one object for more efficient storage. This approach does require forethought about use patterns, though, and should not be used arbitrarily as it may do more harm than good.
Avro serialization itself has some caveats within this scope, as well. Because it has some processing overhead, for further CPU performance gains you may choose to aggregate columns with delimited strings (<val1,val2,val3,…valn>
stored within one HBase cell) rather than using Avro. This method requires that the delimiter never appears in the data. If such a guarantee cannot be met, aggregating columns with delimited strings should not be done. Finally, if Avro objects are used, you must take particular care in schema evolution: If columns are to be added or removed, the original write schema of the data must be remembered as a read of data will require both the write and read Avro schema (assuming they are different).
Conclusion
Assuming the complexities of this approach and use patterns are carefully considered, data serialization can be a powerful technique for optimizing HBase performance. We look forward to any feedback about your personal experiences.
Published at DZone with permission of Justin Kestelyn. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments