You've probably heard about in-memory databases. If not, you can check out a quick overview of what they are here!
Long story short, an in-memory database is a database that keeps the whole dataset in RAM. What does that mean? It means that each time you query a database or update data in a database, you only access the main memory. There's no disk involved in these operations — and this is good because the main memory is way faster than any disk. A good example of such a database is Memcached.
But wait a minute. How would you recover your data after a machine with an in-memory database reboots or crashes? Well, with just an in-memory database, there's no way out. A machine is down — the data is lost. Forget about it.
You may be wondering how in-memory storage can be persistent. The trick here is that you still keep everything in memory, but additionally, you persist each operation on disk in a transaction log. Look at the picture:
The first thing that you may notice is that even though your nice, fast in-memory database now has persistence, queries don't slow down because they still hit only the main memory like they did with just an in-memory database. That's good news! But what about updates? Each update (we'll call them transactions) should not only be applied to memory but also persisted on disk — a slow disk. Is this a problem? Let's look at the picture:
Transactions are applied to the transaction log in an append-only way. What's so good about that? When addressed in this append-only manner, disks are pretty fast. If we're talking about spinning magnetic hard disk drives (HDD), they can write to the end of a file as fast as 100 Mbytes per second. If you don't believe me, then try to run this test in a Unix/Linux/MacOS shell:
cat /dev/zero >some_file
Check out how fast
some_file is growing. Magnetic disks are pretty fast when you use them sequentially. On the other hand, they're utterly slow when you use them randomly. They can normally complete around 100 random operations per second.
If you write byte-by-byte with each byte put in a random place of an HDD, you can see some real 100 bytes per second as the peak throughput of the disk in this scenario. Again, it is as little as 100 bytes per second! This tremendous six-order-of-magnitude difference between the worst case scenario (100 bytes per second) and the best case scenario (100,000,000 bytes per second) of disk access speed is based on the fact that in order to seek a random sector on disk, a physical movement of a disk head has occurred (while you don't need it for sequential access, as you just read data from disk as it spins, with a disk head being stable).
If we consider solid-state drives (SSD), then the situation will be better because of no moving parts. But still, the best/worst ratio is similar. Sequential access gives 200-300 Mbytes per second and random access gives 1,000-10,000 seeks per second — and that is four to five orders of magnitude. Here are some numbers.
So, our in-memory database floods the disk with transactions as fast as 100 Mbytes per second. Is that fast enough? Well, that is really fast. Say, if a transaction size is 100 bytes, then this will be one million transactions per second! This number is so high that you can definitely be sure that the disk will never be a bottleneck for your in-memory database. Here is a detailed benchmark of one in-memory database showing one million transactions per second, where the bottleneck is the CPU, not the disk.
To summarize all that was said above about disks and in-memory databases:
- In-memory databases don't use disk for non-change operations.
- In-memory databases do use disk for data change operations — but they use it in the fastest possible way.
Why wouldn't regular disk-based databases adopt the same techniques? Well, first, unlike in-memory databases, they need to read data from disk on each query (let's forget about caching for a minute; this is going to be a topic for another article).
You never know what the next query will be, so you can consider that queries generate random access workload on a disk — which is, remember, the worst scenario of disk usage. Second, disk-based databases need to persist changes in such a way that the changed data can immediately be read.
Unlike in-memory databases (which usually don't read from disk unless for recovery reasons upon starting up). Disk-based databases require specific data structures to avoid a full scan of a transaction log in order to read from a dataset fast. One such data structure of the kind is a B/B+ tree. The flip side of this data structure is that you should change a B/B+ tree on each change operation, which could constitute random workload on a disk. While improving the performance of read operations, B/B+ trees are degrading it for write operations. There is a handful of database engines based on B/B+ trees, including InnoDB by MySQL or Postgres storage engine.
There is also another data structure that is somewhat better in terms of write workload: LSM tree. This modern data structure doesn't solve problems with random reads, but it partially solves problems with random writes. Examples of such engines are RocksDB, LevelDB, or Vinyl. You can see a summary in this picture:
So, in-memory databases with persistence can be really fast on both read/write operations — as fast as pure in-memory databases, using a disk extremely efficiently and never making it a bottleneck.
The last (but not least) topic that I want to partially cover here is snapshotting. Snapshotting is the way transaction logs are compacted. A snapshot of a database state is a copy of the whole dataset. A snapshot and the latest transaction logs are enough to recover your database state. With a snapshot, you can delete all the outdated transaction logs that don't have any new information on top of the snapshot.
Why would we need to compact logs? Because the more transaction logs, the longer the recovery time for a database. Another reason for that is that you wouldn't want to fill your disks with outdated and useless information (to be perfectly honest, old logs sometimes save the day, but let's save that for another article).
Snapshotting is essentially an occasional dumping of the whole database from the main memory to disk. Once we dump a database to disk, we can delete all the transaction logs that do not contain transactions newer than the last transaction checkpointed in a snapshot. Easy, right? This is just because all other transactions from day one are already considered in a snapshot.
You may ask now be wondering how to save a consistent state of a database to disk and how to determine the latest checkpointed transaction while new transactions keep coming. Well then, I'll see you in the next article. Stay tuned!