Compared with expensive, fast internal memory, hard disks are much cheaper and about one or two orders of magnitude of slower. But the problems of hard disks are more than the slow speed.
Disk Drives Are Bad at Handling High-Frequency Accesses in Small Amounts
There are computing scenarios where the frequency of data access is high and the amount of data retrieved each time is very small. With internal memory, it takes nearly same time to retrieve a hundred million bytes of data in ten million accesses, ten bytes at a time, and to retrieve the same amount of data in ten accesses with ten million bytes each time. This is because a byte is the smallest unit in the internal memory for data access, and either way, the number of bytes retrieved is the same.
Hard disk access uses a different mechanism. Hard disks store data by sectors. A sector is the smallest unit for accessing data on the hard disk (the size of one sector is 512 bytes by default). It takes almost the same time to read one byte and to read 512 bytes (time spent to copy data to the memory can be ignored compared with the time taken to retrieve data from the hard disk). So, it could take much longer to access the hard disk ten million times with ten bytes retrieved each time than to access it ten times with ten million bytes retrieved at a time, though the total amount of data retrieved is the same. A lot of useless data could be retrieved using the first way.
The real-life situations are even worse. Usually, data is stored directly in files under the control of the operating system. Most of the file systems use clusters as the data access unit. A cluster generally holds a default of 4K data, which is 8 times bigger than the 512 bytes. Besides, hard disk access is more complex than internal memory access, for which one CPU instruction is enough. Access to hard disks requires starting the disk controller with a series of actions and the location of to-be-retrieved data. Ten million hard disk accesses need the same number of start and locate actions. Though the retrieval of ten million bytes at one time involves not a few disk actions (because ten million bytes cover many sectors), the total number of access actions is much smaller. The performance of low-frequency retrieval in batch is far better than that of high-frequency retrieval in a small amount.
Random Access to Discontinuous Data Is a Major Factor
The use of “could” in assessing high-frequency data access in small amounts is because things may be different when data is stored continuously. If the to-be-read data is continuously stored, it won’t take very long to retrieve data, even in ten million accesses with ten bytes each time. Because for every access action, each ten bytes are already within the retrieved unit sector thanks to the buffer functionality both the disk controller and the operating system have, the actual number of hard disk accesses won’t be so huge and the performance difference between the two retrieval ways isn’t so big. The problem is the random high-frequency data access in small amounts, wherein the to-be-read data isn’t continuously stored and a retrieved unit sector doesn’t contain the next to-be-retrieved piece of data, causing a waste of time. The same unit sector will have to be re-read later if needed. The waste of data retrieval effort and the possible re-reads are certain to cause extremely poor performance.
For an HDD, random accesses involve head jumps, which correspond to the average seek time stated in a hard disk’s product standard. Locating the disk space where the to-be-read data is stored is an extremely slow mechanical movement, much slower than reading a sector. Even if each retrieved sector (or cluster) can be fully used, the seek time could be longer than the retrieval time when random hard disk accesses are performed. It’s wise to avoid random high-frequency data access in small amounts with an HDD.
Because an SSD doesn’t store data based on sectors, it doesn’t need the mechanical locating actions. This means there isn’t the seek time issue for random accesses. Yet the operating system simulates the HDD’s way for an SSD to read and write data according to the unit of a cluster (or sector), which is too big to perform random high-frequency data accesses in small amounts (less than a cluster) efficiently.
Parallel and Concurrent Processing Is the Root of the Problem
For computations that need continuous batch access only, like traversal-based aggregation or query, is the hard disk performance determined only by its access speed?
Generally, the answer is yes, if it is a single-thread computation or a single-task one. But today, parallel processing has become indispensable for high-performance computation, and a lot of computing services need to support concurrency. Both types of computation will, to some degree, transform continuous hard disk accesses to random accesses. The reason is simple. Multiple threads share the same hard disk, but their requests for access are not continuous. Head jumps happen when the hard disk is trying to respond to these requests, leading to random accesses.
Usually, this is disastrous to an HDD. Frequent switches between threads produce frequent seek actions, making multithreaded processing slower than single-threaded processing. Sometimes, the performance of single-task processing falls dramatically from acceptable to poor after concurrency is used. A countermeasure is increasing the size of the buffer for storing the retrieved data. If each thread retrieves a large enough amount of continuous data, the proportion of the seek time decreases and becomes less noticeable. But, this also increases memory consumption. The higher the number of threads there are, the more need there is for the memory space. An SSD needs less memory space, which is the size of a cluster (or a sector).
An example is columnar storage, which stores data continuously by columns. For a multi-column computation, even a single thread will lead to random hard disk accesses. If there are a lot of columns involved, columnar storage won’t necessarily be able to increase performance. Multithreaded processing will make the situation even worse. So, be careful when trying to use columnar storage with an HDD; this technique isn’t always a guarantee of better performance.
Two Algorithms for External Memory Computations
Because of those hard disk headaches, a different algorithm will be adopted and a steep decrease in performance happens when data to be processed is transferred from internal memory to an external memory device (the decrease rate isn’t a simple multiplication according to the proportion of the amount of the data transferred).
A typical example is JOIN operations. For a foreign key-based JOIN with a 1:N relationship, if all data is stored in the internal memory, we can create a hash index over the primary key of the dimension table pointed by the foreign key. Then, according to the index, we can quickly locate desired records during the traversal of the fact table. Even if there are multiple foreign keys pointing to different dimension tables respectively, one traversal is enough to achieve all JOINs. If the dimension table is too big to be accommodated by the internal memory, a different algorithm is needed. The hard disk can’t handle random high-frequency accesses in small amounts, which is the type of operation in accessing a dimension table. The use of the above algorithm will result in terrible performance, worse than the performance of a sorting and merge algorithm. An alternative is dividing a large table stored in an external device into multiple small segments that can be loaded into the memory for a JOIN according to the key values (their hash values, actually). But only one JOIN can be handled each time. Multiple same operations will thus be needed for a computation where the fact table corresponds to multiple large dimension tables via multiple foreign keys, causing a much bigger workload than the internal memory JOINs.
We can use a cluster to get away from the complicated external memory computations if the one computer’s memory space isn’t big enough. Though the performance of accessing to a unit amount of data across a cluster via the network when getting data stored in the internal memory of a node is little better than that with a hard disk due to the network latency, which is almost as slow as the hard disk, the internal memory’s capability in handling random access and concurrency becomes available. Similar to hard disk accesses, a connection to the network is resource-consuming, and thus inefficient in dealing with high-frequency data accesses in small amounts. To counter this issue, the results of multiple data access requests are collected and returned in batch. To solve JOINs involving big dimension tables, the cluster strategy is more complicated than the internal memory strategy, but still much simpler than an external memory algorithm by achieving multiple JOINs at a time.