DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Debezium Serialization With Apache Avro and Apicurio Service Registry
  • Mastering Persistence: Why the Persistence Layer Is Crucial for Modern Java Applications
  • Beyond Code: The Crucial Role of Databases in Software
  • How VAST Data’s Platform Is Removing Barriers To AI Innovation

Trending

  • Podman Desktop Review
  • Database Monitoring: Key Metrics and Considerations
  • Spring WebFlux Retries
  • A Guide to Data-Driven Design and Architecture
  1. DZone
  2. Data Engineering
  3. Databases
  4. Improve Apache HBase Performance via Data Serialization With Apache Avro

Improve Apache HBase Performance via Data Serialization With Apache Avro

Taking a thoughtful approach to data serialization can achieve significant performance improvements for HBase deployments.

Justin Kestelyn user avatar by
Justin Kestelyn
·
May. 17, 16 · Opinion
Like (2)
Save
Tweet
Share
6.10K Views

Join the DZone community and get the full member experience.

Join For Free

The question of using tall versus wide tables in Apache HBase is a commonly discussed design pattern (see reference here and here). However, there are more considerations here than making that simple choice. Because HBase stores each column of a table as an independent row in the underlying HFiles, significant storage overhead can occur when storing small pieces of information. For example, in storing a simple Boolean column, there can be 35 (or more) bytes stored to disk (depending on the key size, column family, and so on). This overhead can become quite significant for overall I/O and network utilization, especially if multiple columns are read and written together within a transaction.

A simple solution is to serialize data that is accessed together: that is, serialize multiple columns and store the serialized data in a single HBase column. In this post, I’ll describe an implementation of this concept by Cloudera’s HBase team and the comparative performance improvements it achieved based on testing.

Implementation

First, let’s consider serialization. To be as efficient as possible in our implementation, we used Apache Avro to serialize the data. Avro uses schemas, which must be presented when reading and writing data. This approach permits each datum to be written with no per-value overheads, making serialization both fast and small. (For more information about using Avro, see "Avro Usage" in the CDH documentation.)

The serialization function is quite simple:

public static byte[] serialize(RandomSchema aRecord) {         ByteArrayOutputStream out = new ByteArrayOutputStream();         encoder = EncoderFactory.get().binaryEncoder(out, encoder);         DatumWriter<RandomSchema> writer = new SpecificDatumWriter<RandomSchema>(aRecord.getSchema());         try {             writer.write(aRecord, encoder);             encoder.flush();             out.close();         } catch (IOException e) {             e.printStackTrace();         }         return out.toByteArray();     }

Here is a similar function for deserialization:

public static RandomSchema deserialize(byte[] serializedBytes, Schema mySchema) {         DatumReader<RandomSchema> reader = new SpecificDatumReader<RandomSchema>(mySchema);         decoder = DecoderFactory.get().binaryDecoder(serializedBytes, null);         try {             return reader.read(null, decoder);         } catch (IOException e) {             e.printStackTrace();         }         return null;     }

This code allows quick creation of byte arrays to store data in HBase, and then extract that data. Although the code has some overhead, it’s offset by the reduced I/O and network bandwidth (more on that later).

Next, we created two identical HBase tables, one to store Avro serialized byte arrays (Avro Table), and one which would store the exact same volume of data in separate columns (Raw Table). Tables were created with the code:

HTableDescriptor desc = new HTableDescriptor(TableName.valueOf(AVRO_TABLE_NAME)); HColumnDescriptor family = new HColumnDescriptor(COLUMN_FAMILY);  Family.setCompactionCompressionType(Compression.Algorithm.SNAPPY);  desc.addFamily(family);  admin.createTable(desc);

It is important to note that we enabled Snappy compression on the files, as this reduces some of the overhead when there is duplicate data in the HBase keys. Compression is also typical in production, so the results are more realistic.

Testing and Results

Finally, we tested the throughput of HBase for several million records (puts were done in batches of 100) on the Cloudera QuickStart VM. We also tested get throughput for random keys from the keys space.

The tests were done for Avro schemas of 8 and 64 fields (records), with a mix of Double and Boolean variables within the schema. We chose Double as it has fixed-length storage; hence the data stored in the record would not be relevant to performance. Results are outlined below. (The results are not intended as an absolute measure of performance, but are rather as useful comparative measures.)

hbase-serialize-tab

The results show how joining multiple serialized objects into a single column can lead to significant savings in both storage and IO. The larger the number of columns being gathered together, the greater the performance benefit and storage savings.

As Snappy compression was used on both tables, similar keys could be compressed together on the raw tables. (If Snappy were disabled, the storage savings for Avro would have been much greater.) Furthermore, the HBase RegionServer was given 4GB of RAM, hence 1.6GB of RAM was available for the memstore. This is important, as the first test had close to that amount of data; it is a distinct possibility the results of the first test showed minimal performance gains for random reads as the data upon first read would be cached into memory, and subsequent reads would not require any additional I/O. With subsequent tests, the data could not be stored in the memstore and the reduced I/O requirements became more evident.

Finally, we did not use any diff encoding of the keys, as it would be difficult to control overall system performance given diff encoding gains are somewhat dependant on compaction schedule. We do, however, recommend benchmarking different encoding methods and Avro serializations to determine the ideal performance tuning for your particular deployment.

Keep in Mind

This methodology of aggregating columns may have some significant drawbacks depending on the use case. Specifically, when reading data from an Avro serialized object, it’s all or nothing: You cannot read just some of the records (columns) stored in that object. If you have an Avro serialized object with 64 columns stored within it, you must read the entire object from disk and send it through the network before you can access any of the data. If you are only interested in one of the columns, this approach can be very wasteful and may negate any performance gains achieved through the serialization. Furthermore, as HBase will be unable to read inside the Avro object, filtering on anything within the object will be impossible.

With these limitations in mind, a hybrid approach is used in most production deployments. Columns that are usually operated upon independently (for reading, updating, or scanning), are kept outside the serialized object, while columns that are operated upon together are serialized within one object for more efficient storage. This approach does require forethought about use patterns, though, and should not be used arbitrarily as it may do more harm than good.  

Avro serialization itself has some caveats within this scope, as well. Because it has some processing overhead, for further CPU performance gains you may choose to aggregate columns with delimited strings (<val1,val2,val3,…valn> stored within one HBase cell) rather than using Avro. This method requires that the delimiter never appears in the data. If such a guarantee cannot be met, aggregating columns with delimited strings should not be done. Finally, if Avro objects are used, you must take particular care in schema evolution: If columns are to be added or removed, the original write schema of the data must be remembered as a read of data will require both the write and read Avro schema (assuming they are different).

Conclusion

Assuming the complexities of this approach and use patterns are carefully considered, data serialization can be a powerful technique for optimizing HBase performance. We look forward to any feedback about your personal experiences.

avro Database Data (computing) Serialization Apache HBase

Published at DZone with permission of Justin Kestelyn. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Debezium Serialization With Apache Avro and Apicurio Service Registry
  • Mastering Persistence: Why the Persistence Layer Is Crucial for Modern Java Applications
  • Beyond Code: The Crucial Role of Databases in Software
  • How VAST Data’s Platform Is Removing Barriers To AI Innovation

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: