Checksums in Storage Systems and Why the Enterprise Should Care
With more data being transferred than ever before, it's of the utmost importance to make sure it reaches its destination safely.
Join the DZone community and get the full member experience.Join For Free
Random bit flips are far more common than most people, even IT professionals, think. Surprisingly, the problem isn’t widely discussed, even though it is silently causing data corruption that can directly impact our jobs, our businesses, and our security. It’s really scary knowing that such corruptions are happening in the memory of our computers and servers – that is before they even reach the network and storage portions of the stack. Google’s in-depth study of bit-level DRAM errors showed that such uncorrectable errors are a fact of life. And do you remember the time when Amazon had to globally reboot their entire S3 service due to a single bit error?
The Error-Prone Data Trail
Let’s assume for a moment that your data survives its many passes through a system’s DRAM and emerges intact. That data must then be safely transported over a network to the storage system where it is written to disk. How do you assure the data remains unaltered along the way? Well, if you’re using one of the storage protocols that lack end-to-end checksums (e.g. NFSv2, NFSv3, SMBv2), your data remains susceptible to random bit flips and data corruption. Even NFSv4 plus Kerberos 5 with integrity checking (krb5i) doesn’t offer true end-to-end checksums. Once the data is extracted from the RPC, it is unprotected once again. In addition, widespread adoption of NFSv4 hasn’t happened, and fewer still use krb5i.
Over a decade ago, the folks at CERN urged that “checksum mechanisms (…) be implemented and deployed everywhere.” This appeal only amplifies today when one considers the storage sizes and daily rates of data transfer we’re dealing with. Data corruption can no longer be ignored as just a “theoretical” issue. And if you think modern applications protect against this problem, I’ve got bad news for you: In 2017, researchers at the University of Wisconsin uncovered serious problems for some storage systems when they introduced bit errors into some well-known and widely used applications.
Checksums Came at a Cost That’s Worth Its Price Today
When NFS was designed, file writes and the general amount of data were relatively small and checksum computations were very expensive. Hence, the decision to rely on TCP checksums for data protection seemed reasonable. Unfortunately, these checksums proved to be too weak, especially when transferring more than 64k bytes per packet – which easily happens when you transfer gigabytes per second. What about Ethernet checksums, you ask? They are indeed stronger. However, they don’t allow for end-to-end protection and opportunities for data corruption are manyfold: Cut-through switches that don’t recompute checksums and kernel drivers for NICs are just two examples where things can go horribly wrong.
Checksums and the End of Silent Data Corruption
Experts have seen such silent data corruption happen, even in mid-sized installations. In one instance, enterprise administrators were informed that their data corruption happened in transit. At that point, they began investigating the network stack. It turned out to be a driver-related issue that occurred after a kernel update broke the TCP offload feature of their NICs. Tracking down the problem was both difficult and time-consuming.
That’s where end-to-end checksums come in. In one use case, as soon as the system receives data from the operating system, each block (usually 4k bytes, but that can be adjusted in the volume configuration) is checksummed. Because this checksum stays with the data block forever, the data is protected – even against software bugs – as it travels through the software stack. The checksum is validated along the path throughout the life of the data – even at rest when the data isn’t accessed (via periodic disk scrubbing). All this is possible because dated legacy protocols like NFS are not relied on. Instead, an RPC protocol where each data block, and the message itself, are checksum-protected. And since modern CPUs have built-in CRC32 computation capabilities, there’s no longer a performance penalty for using CRCs.
Opinions expressed by DZone contributors are their own.