Adding Transparent Data Encryption to InfinityDB
Better protect your data all in one file.
Join the DZone community and get the full member experience.Join For Free
Transparent Data Encryption is showing up in a few DBMS these days. Here is how InfinityDB does it.
TDE allows client applications to almost completely ignore the complex issues surrounding security except for the need to keep control of encryption keys. That problem has many solutions, but putting the keys in the same file system as the database is very weak, while a stronger solution is to keep the keys on a different host, and the best is to keep the keys in a Key Management Server or KMS for solid enterprise use. KMSs can go all the way from an AWS instance, to an on-premises VM, to Hardware Security Modules, such as those used to secure the top levels of the domain name system.
The Problem Was Simpler
InfinityDB is a Java hierarchical key-value store that keeps all data in one file. As a result, the encryption is far easier than for many DBMS, where data is distributed in logs, base tables, indexes, sort areas, rollback segments, slaves, and so on. We had only to encrypt data using a shim, which is a layer between the existing upper layer and the file. Additionally, we added a variable-length header to keep metadata. The shim is a virtual RandomAccessFile, with the same basic API and a real RandomAccessFile under that, with the encryption done on data going in between. The shim provides a single cached block in both plaintext and encrypted form. Thus, virtually none of the original code changed, and we could put various new features into the header.
Data Blocks Are Short of 4096 Bytes
The data blocks under the shim each contain roughly 64 less bytes than the standard block size (4096 bytes) in file systems. This is fine because we are not doing Full Disk Encryption, and the RandomAccessFile abstraction is more flexible than a disk because I/O is not block-oriented but is via a range of bytes at any offset for each I/O. Also, InfinityDB blocks are variable-length for strong compression, so there does not need to be any alignment.
The Block Overhead
The non-data area at the end of each block contains:
- a block number, which is encrypted and authenticated to prevent blocks from being able to be maliciously moved around.
- an initialization vector, which is a securely generated 16-byte random number that initializes the AES-256 cipher before it is applied to the block data. Such an IV is vital, and we securely randomize it at the block level rather than having a constant one over the entire file for maximum strength.
- an expansion area, currently all zeros.
- a 32-byte authentication and integrity hash covering all the preceding Bytes. This hash is keyed, rather than being just, say, a SHA-256. Instead, it is a standard HMAC-SHA256, which is dependent on a separate authentication key.
The encryption key and authentication key are independently randomly generated. Then, how do you decrypt the file? These two internal keys are further encrypted by the actual Password-Based Encryption key that the client API uses. This two-layer encryption, using a key encryption key, allows the PBE key to be changed at any time without re-encrypting the file data.
The PBE key must go through a Key Derivation Function or KDF, which in our case is PBKDF2, the current standard. That derivation function requires a repeated application of a SHA-256 hash to the PBE key, about 1,000 to 10,000 times in order to make it slow to compute! That is necessary because otherwise, GPUs have an easier time with brute-force attacks. This just makes the database about 10msec slower to open. The number of iterations needs to increase over time as GPUs speed up, so we store it in the header. Also in the header is securely generated random salt that is input to the PBKDF2.
The header is itself encrypted and authenticated. It has its own random cipher initialization vector, and the PBE-encrypted data encryption and integrity keys. These things plus the salt plus the key derivation function iteration count are actually stored before the header because they are necessary in order to access and validate the rest of the header. There is plenty of room in this variable header for the other features, such as multiple independent signing certificates and signing public keys, the actual signatures, plus future extensions.
TDE is a very convenient way to address the new requirements of data security we are all involved in. See boilerbay.com for more.
Opinions expressed by DZone contributors are their own.