What are the performance differences between in-memory columnar databases like SAP HANA and GridGain’s In-Memory Database (IMDB) utilizing distributed key-value storage? This question comes up regularly in conversations with our customers and the answer is not very obvious.
First off, let’s clearly state that we are talking about the storage model only and its implications on performance for various use cases. It’s important to note that:
- The storage model doesn’t dictate or preclude particular transactionality or consistency guarantees. There are columnar databases that support ACID (HANA) and those that don’t (HBase). There are distributed key-value databases that support ACID (GridGain) and those that don’t (for example, Riak and Memcached).
- The storage model doesn’t dictate the specific query language used. Using the above examples, GridGain and HANA support SQL while HBase, for example, doesn’t.
Unlike transactionality and query languages, performance considerations are not that straightforward.
Note also: SAP HANA has a pluggable storage model and experimental row-based storage implementation. We’ll concentrate on columnar storage that apparently accounts for all HANA usage at this point.
HANA’s Columnar Storage Model
Let’s recall what the columnar storage model entails in general and note its HANA specifics.
Some of its standout characteristics include:
- Data in the columnar model is kept in a column (versus rows as in the row storage models).
- Since data in a single column is almost always homogeneous, it’s frequently compressed for storage (especially in in-memory systems like HANA).
- Aggregate functions (i.e. column functions) are very fast in the columnar data model since the entire column can be fetched very quickly and effectively indexed.
- Inserts, updates and row functions, however, are significantly slower than their row-based counterparts as a trade-off of the columnar approach (inserting a row leads to multiple columns inserts). Because of this characteristic, columnar databases are typically used in R/OLAP scenarios (where data doesn’t change) and very rarely in OLTP use cases (where data changes frequently).
- Since columnar storage is fairly compact, it doesn’t generally require distribution (i.e. data partitioning) to store large datasets. The entire database can often be logically stored in the memory of a single server. HANA, however, provides comprehensive support for data partitioning.
It is important to emphasize that the columnar storage model is ideally suited for very compact memory utilization for two main reasons:
- The columnar model is a natural fit for compression, which often provides for dramatic reduction in memory consumption.
- Since column-based functions are very fast, there is no need for materialized views for aggregated values in exchange for simply computing necessary values on the fly. This leads to a significantly reduced memory footprint as well.
GridGain’s IMDB Key-Value Storage Model
The key-value (KV) storage model is less defined than its columnar counterpart and usually involves a fair amount of vendor specifics.
Historically, there are two schools of KV storage models:
- Traditional (examples include Riak, Memcached, and Redis): The common characteristic of these systems is a raw, language-independent storage format for the keys and values.
- Data grid (examples include GridGain IMDB, GigaSpaces, and Coherence): The common trait of these systems is the reliance on JVM as the underlying runtime platform, and treating keys and values as user-defined JVM objects.
GridGain’s IMDB belongs to the data grid branch of KV storage models. Some of its key characteristics are:
- Data is stored in a set of distributed maps (a.k.a. dictionaries or caches). In a simple approximation, you can think of a value as a row in the row-based model, and a key as that row’s primary key. Following this analogy, a single KV map can be approximated as a row-based table with an automatic primary key index.
- Keys and values are represented as user-defined JVM objects and therefore no automatic compression can be performed.
- Data distribution is designed from the ground up. Data is partitioned across the cluster mitigating, in part, lack of compression. Unlike HANA, data partitioning is mandatory.
- MapReduce is its main API for data processing (SQL is supported as well).
- It has strong affinity and co-location semantics provided by default.
- It has no bias toward aggregate or row-based processing performance and therefore no bias toward either OLAP or OLTP applicability.
It is somewhat expected that for heavy transactional processing GridGain will provide overall better performance in most cases:
- The columnar model is rather inefficient in updating or inserting values in multiple columns.
- Transactional locking is also less efficient in the columnar model.
- Required decompression and recompression further degrades performance.
- The KV storage model, on the other hand, provides an ideal model for individual updates as individual objects can be accessed, locked and updated very effectively.
- Lack of compression in GridGain IMDB makes updates go even faster than in the columnar model with compression.
As an example, GridGain just won a public tender for one of the biggest financial institutions in the world, achieving 1 billion transactional updates per second on 10 commodity blades costing less than $25 thousand altogether. That transactional performance and associated TCO is clearly not territory any columnar database can approach.
For OLAP workloads, the picture is less obvious. HANA is heavily biased toward OLAP processing, and GridGain IMDB is neutral toward it. Both GridGain IMDB and SAP HANA provide comprehensive data partitioning capabilities and allow for processing parallelization, which are MPP traits necessary for scaling out OLAP processing. I believe the actual difference observed by the customers will be driven primarily by three factors rooted deeply in differences between columnar and KV implementations in respective products:
- Optimizations around data affinity and collocation
- Optimizations around the distribution overhead
- Optimizations around indexing of partitioned data
Unfortunately there’s no way to provide any generalized guidance on performance differences here … We always recommend to try both in your particular scenario, pay attention to specific configuration and tuning around the three points mentioned above and see what results you’ll get. It does take time and resources but you may be surprised by your findings!