Database Comparison: MapR-DB, Cassandra, HBase, and More
Learn about the trade-offs of Log Structured Merge (LSM) trees that power HBase and Cassandra and how MapR-DB chooses to deal with them.
Join the DZone community and get the full member experience.
Join For FreeIf you are a developer or architect working with a highly performant product, you want to understand what differentiates it — similar to a race car driver driving a highly performant car. My in-depth architecture blog posts such as An In-Depth Look at the HBase Architecture, Apache Drill Architecture: The Ultimate Guide, and How Stream-First Architecture Patterns Are Revolutionizing Healthcare Platforms have been hugely popular, and I hope this one will be, too!
In this blog post, I'll give you an in-depth look at the MapR-DB architecture compared to other NoSQL architectures like Cassandra and HBase, and explain how MapR-DB delivers fast, consistent, scalable performance with instant recovery and zero data loss.
Why NoSQL?
Simply put, the motivation behind NoSQL is data volume, velocity, and/or variety. MapR-DB provides for data variety with two different data models:
- Wide column data model exposing an HBase API.
- Document data model exposing an Open JSON API (similar to the MongoDB API).
Concerning data volume and velocity, recent ESG Labs analysis determined that MapR-DB outperforms Cassandra and HBase by 10x in the Cloud, with the operations/sec shown below (high operations/sec is good):
But how do you get fast reads and writes with high performance at scale?
The key is partitioning for parallel operations and minimizing time spent on disk reads and writes.
Limitations of a Relational Model
With a relational database, you normalize your schema, which eliminates redundant data and makes storage efficient. Then, you use indexes and queries with joins to bring the data back together again. However, indexes slow down data ingestion with lots of nonsequential disk I/O and joins cause bottlenecks on reads with lots of data. The relational model does not scale horizontally across a cluster.
Automatically Partitioning by Row Key
With MapR-DB (HBase API or JSON API), a table is automatically partitioned across a cluster by key range, and each server is the source for a subset of a table. MapR-DB has a "query-first" schema design in which queries are identified first, then the row key is designed to distribute the data evenly and also to give a meaningful primary index. The row document (JSON) or columns (HBase) should be designed to group together data that will be read together. With MapR-DB, you denormalize your schema to store in one row or document what would be multiple tables with indexes in a relational world. Grouping the data by key range provides for fast reads and writes by row key.
You can read more about MapR-DB HBase schema design here, and a future post, we will discuss JSON document design.
How Do NoSQL Data Stores Get Fast Writes?
Traditional databases based on B-trees deliver super fast reads, but they are costly to update in real-time because the disk I/O is disorganized and inefficient.
Log Structured Merge trees (or LSM trees) are designed to provide a higher write throughput than traditional B tree file organizations by minimizing non-sequential disk I/O. LSM trees are used by datastores such as HBase, Cassandra, MongoDB, and others. LSM trees write updates to a log on disk and memory. Updates are appended to the log, which is only used for recovery, and sorted in the memory store. When the memory store is full, it flushes to a new immutable file on disk.
Read Amplification
This design provides fast writes; however, as more new files are written to disk, if queried row values are not in memory, multiple files may have to be examined to get the row contents, which slows down reads.
Write Amplification
In order to speed up reads by command or schedule, a background process "compacts" multiple files by reading them, merge sorting them in memory, and writing the sorted keyValues into a new larger file.
However, compaction causes lots of disk I/O, which decreases write throughput. This is called write amplification. Compaction for Cassandra requires 50% free disk space and will bring the OS to a halt if compaction runs out of disk space. Read and write amplification cause unpredictable latencies with HBase and Cassandra. Recently, ESG Labs confirmed that MapR-DB outperforms Cassandra and HBase by 10x in the Cloud, with predictable consistent low latency (low latency is good; it means fast response).
How Does MapR-DB Get Predictable Low Latency?
LSM Trees allow faster writes than B-trees, but they don't have the same read performance. Also, with LSM trees, there is a balance between compacting too infrequently (read performance can be impacted) or too often (write performance can be impacted). MapR-DB strikes a middle ground and avoids the large compactions of HBase and Cassandra as well as avoiding the highly random I/O of traditional systems. The result is rock-solid latency and very high throughput.
To understand how MapR-DB can do what others can't, you have to understand the MapR converged platform. Apache HBase and Apache Cassandra run on an append-only file system, meaning it isn't possible to update the data files as data changes. What's unique about MapR is that MapR-DB tables, MapR Files, and MapR-Event Streams are integrated into the MapR-XD high-scale, reliable, globally distributed data store. MapR-XD implements a random read-write file system natively in C++ and accesses disks directly, making it possible for MapR-DB to do efficient file updates instead of always writing to new immutable files.
MapR-DB invented a hybrid LSM-tree/B-tree to achieve consistent fast reads and writes. Unlike a traditional B-Tree, leaf nodes are not actively balanced, so updates can happen very quickly. With MapR-DB, updates are appended to small "micro" logs and sorted in memory. Unlike LSM-trees, which flush to new files, with MapR-DB micro-reorganizations happen when memory is frequently merged into a read/write file system, meaning that MapR-DB does not need to do compaction. (Micro logs are deleted after the memory has been merged since they are no longer needed.) MapR-DB reads leverage a B-Tree to "get close, fast" to the ranges of possible values, then scan to find the matching data.
How Do HBase and Cassandra Do Recovery?
Before, I said that the log is only used for recovery — so how does recovery work? In order to achieve reliability on commodity hardware, one has to resort to a replication scheme in which multiple redundant copies of data are stored. How does HBase do recovery? HDFS breaks files into blocks of 64 MB. Each block is stored across a cluster of DataNodes and replicated two times.
If an HDFS data node crashes, then the region will be assigned to another data node. The new region is available after the log has been replayed, which means reading all of the not-yet-flushed updates from the log into memory and flushing them as sorted key values into new files.
The HBase recovery process is slow and has the following problems:
- HDFS does disk I/O in large blocks of 64MB.
- HDFS files are read-only (write append). There's a single writer per file — no reads are allowed to a file while it's being written — and the file close is the transaction that allows readers to see the data. When a failure occurs, unclosed files are deleted.
- Because of the layers between HDFS and HBase, data locality is only guaranteed when new files are written. With HBase, whenever a region is moved for failover or load balancing, the data is not local, which means that the region server is reading from files on another node.
- Because the log (or WAL) is large, replay can take time.
- Because of the separation from the file system, coordination is required between the zookeeper/hbase-master/region-server/name node on failover.
So how does recovery work for Cassandra? With Cassandra, the client performs a write by sending the request to any node, which will act as the proxy to the client. This proxy node will locate N corresponding nodes that hold the data replicas and forward the write request to all of them. If a node fails, the coordinator still writes to the other replicas and the failed replica becomes inconsistent.
According to Datastax, downed nodes are common causes of data inconsistency with Cassandra, and need to be routinely fixed by manually running an anti-entropy repair tool. According to Robert Yokota from Yammer, Cassandra has not been more reliable than their strongly consistent systems — yet Cassandra has been more difficult to work with and reason about in the presence of inconsistencies.
Getting Instant Recovery With Zero Data Loss
To understand how MapR-DB can do what others can't, you have to understand the MapR file system replication with 24/7 reliability and zero data loss. In normal operations, the cluster needs to write and replicate incoming data as quickly as possible. When files are written to the MapR cluster, they are first sharded into pieces called chunks. Each chunk is written to a container as a series of blocks of 8K at a 2gig/second update rate. Once the chunk is written, it is replicated as a series of blocks. This is repeated for each chunk until the entire file has been written. (You can see this animated in this MapR-XD video.) MapR ensures that every write that was replied to survives a crash.
Where HDFS uses a single block size for sharding, replication location, and file I/O, MapR-XD uses three sizes. The importance of distinguishing these block sizes involves how the block is used:
- Having a larger shard size means that the metadata footprint associated with files is smaller.
- Having a larger replication location size means that a fewer number of replication location calls are made to the Container Location Database (CLDB).
Having a smaller disk I/O size means that files may be randomly read and written.
With MapR-DB tables, like with Hbase, continuous sequences of rows are divided into regions or tablets. However, with MapR-DB, the Tablets "live" inside a container. Because the tables are integrated into the file system, MapR-DB can guarantee data locality, which HBase strives to have but cannot guarantee since the file system is separate.
On a node failure, the Container Location Database (CLDB) does a failover of the primary containers that were being served by that node to one of the replicas. For a given container, the new primary serves the MapR-DB tables or files present in that container.
The real differentiating feature is that MapR-DB can instantly recover from a crash without replaying any of the small micro logs, meaning users can access the database while the recovery process begins. If data is requested that requires the application of a micro log, it happens inline in about 250 milliseconds — so fast that most users won't even notice it. HBase and Cassandra use large recovery logs; MapR, on the other hand, uses about 40 small logs per tablet. Since today's servers have so many cores, it makes sense to write many small logs and parallelize recovery on as many cores as possible. Also, MapR-DB is constantly emptying micro logs, so most of them are empty anyway. This architecture means that MapR-DB can recover between 100X and 1,000x faster from a crash than HBase. We call this instant recovery.
This summarizes the features of MapR-DB tablets living inside a container:
- Guaranteed data locality: Table data is stored in files that are guaranteed to be on the same node because they are in the same container.
- Smarter load balancing uses container replicas: If a tablet needs to be moved for load-balancing, it is moved to a container replica, so the data is still local.
- Smarter, simpler failover uses container replicas.
MapR Converged Data Platform
Three core services — MapR-XD, MapR-DB, and MapR Event Streams — work together to enable the MapR converged data platform to support all workloads on a single cluster with distributed, scalable, reliable, high performance.
Published at DZone with permission of Carol McDonald, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments