Originally written by Ariel Weisberg.
Most discussions I have seen about choosing SSDs vs. spinning disk arrays for databases tend to focus on SSDs as a replacement for disk. SSDs don’t replace disk; they replace the RAM you would be using to cache enough disk pages to make up for the terrible random IO performance of spinning disk arrays.
When you add a new disk to a disk array you get hundreds of IOPs, max. When you add an SSD to a RAID array – or better yet, packaged along with a new node in your cluster – you get hundreds of thousands more IOPs. For workloads where you will never fit everything in RAM scaling IOPs matters, because there isn’t a power in the verse that will make your random reads sequential.
For workloads that are focused on sequential IO performance the delta in performance between SSDs and spinning disks is not as great, and spinning disks are an excellent source of sequential IO and raw capacity. For workloads that stress random IO end up being more cost effective in several dimensions.
Benchmarking the distinction
To demonstrate the difference between spinning disk and SSDs I put together a quick benchmark for random read performance. You can find the benchmark on github along with a description, and results for a 7.2k 3 terabyte disk and a Crucial m4 128 gigabyte SSD. The results are also available as a spreadsheet.
I tested on two different desktops, one with the SSD and one with the disk. The desktops have different CPUs, which shows up as a difference in throughput for the in-memory data sets. For the larger-than-memory data sets that are the focus of this discussion, performance is dominated by available IOPs (as you will see in the graphs). Both desktops had 16 gigabytes of RAM and were running Ubuntu 12.04 with EXT4 mounted with noatime,nodiratime.
Read-ahead was disabled on the SSD using “blockdev –setra 0” and “hdparm -A0” in order to get the full 33k IOPS the device can deliver. Read-ahead was not disabled for the disk and was left at the default of 128 kilobytes.
The benchmark consists of 1k aligned reads from a portion of a pre-allocated file. The access distribution is scrambled Zipfian. The output of a Zipfian distribution is hashed using FNV hash to force hot and cold values to be stored on the same page, as many real world databases would in real workloads. Reads are issued by a thread pool containing eight threads.
The pre-allocated file is accessed by memory-mapping. Even though reads are 1k, actual IOs are 4k, matching the page size of the page cache. To test different dataset sizes a prefix of the pre-allocated file is used for each benchmark.
The benchmark is run with a 600 second warm-up followed by a 600 second measurement period. Only one run of each configuration is presented. I observed that 600 seconds was sufficient to warm-up the cache, even for the spinning disk. Caches are dropped between runs using “echo 1 > /proc/sys/vm/drop_caches”.
Performance is identical for in-memory data sets modulo the difference in CPU performance. I truncated the vertical access to make the differences between-larger-than memory workloads more visible. In-memory performance was several million operations a second, telling the story of why you don’t build in-memory databases the same way you build larger-than memory-databases.
Once even a small slice of the dataset is no longer in memory, at 16 gigabytes, performance drops sharply. Once 50% of the dataset is not in memory, at 24 gigabytes, performance drops again by 3x for the SSD.
Throughput for the disk drops to near zero as soon as things don’t fit in memory since the device can only sustain 250 IOPs a second. The long tail of IOs in a Zipfian distribution rapidly consumes all available IOPs, exhausting the IO thread pool as threads block waiting for pages to come in.
With log scale you can see the performance of the disk.
Focusing on the performance of real larger-than-memory datasets, you can see the extra IOPs of the SSD allow it to hit well above its weight class in terms of operations performed. The SSD can only do 33k random reads a second, but with caching the workload manages to perform 4.45x to 3.15x the number of supported IOPs for 2x and 4x larger-than-memory workloads, respectively.
The spinning disk also hits above its weight class, but the multiplier is 2.12x to 1.5x. More critically, the throughput provided does not reach the threshold of what I would call useful.
Why this matters for databases
Databases typically have to cache two discrete types of data, indexes and values. A read will have to touch many index pages for each retrieval, but only one page to retrieve a row/document/column. If indexes fit in memory, exactly one IO will be consumed retrieving a value.
For many workloads indexes are an order of 10x smaller than values; thus, if you paid for the RAM to store your indexes and SSDs to store your values, you can support as many retrievals as you have IOPs available. If you commit additional RAM to caching values you can have performance that exceeds the number of IOPs available.
If your workloads are friendlier to caching then the scrambled Zipfian distribution used here, the potential gains are greater: you will be closer to in-memory performance because available IOPs will not be consumed as quickly.
Recognize what you need to be optimizing for when picking storage. If you have to bring in extra RAM, nodes, rack space, and power to get away with using disk arrays, that needs to be accounted for. You will be hit with a double whammy – not only are you scaling IOPs, you are also scaling enough RAM to make up for the IOPs. Factor in hidden costs like power consumption that are up-front costs with SSDs.
SSDs present their own challenges. Write amplification is a factor that must be accounted for when choosing which data structure to use. The additional sequential IO provided by SSDs makes log-structured data structures more attractive, especially if it means you can use less-expensive, lower-write-endurance SSDs. These data structures sometimes have their own warts in terms of inconsistent performance over time; this is an area where I still see room for improvement.
Remember: SSDs don’t replace disk; they replace RAM, offering a way to avoid the poor random IO performance of spinning disk. IOPs matter, and when you have larger-than-memory datasets, SSDs can provide a way to improve IOPs[AD1] without adding RAM. Make the right choice for your workload by balancing IOPs, direct and hidden costs of SSDs, and disk performance metrics. Let me know your thoughts, here or on Twitter at @VoltDB.
Originally written by Ariel Weisberg.