GDPR and Cassandra
GDPR and Cassandra
What does the GDPR mean for databases and the data stored therein? Read on to get the thoughts from an expert on Cassandra.
Join the DZone community and get the full member experience.Join For Free
delete /dɪˈliːt/ — verb: remove or obliterate (written or printed matter), especially by drawing a line through it.
As we all know, GDPR will be in force May 2018. After that, users of software products and services will have the right to be forgotten (cool, right? Finally I can rest assured that my browsing history will not be read aloud at my funeral). In other words, if a user from the EU asks a service provider to delete their data, the provider will have to delete all the user’s data or face severe consequences.
But, it is unclear what it means to delete a user’s data. I guess the only way to find out is when the audit occurs.
This post is the introduction to a series of blog posts about GDPR and Cassandra databases.
Cassandra and Data Deletion
As Cassandra consultants, our primary concern is: what does it mean to delete the data from Cassandra points of view? And what we can do to be as sure as possible that a user’s data will stay deleted. As we know, when Cassandra deletes the data, it just marks it as deleted. The actual “deletion” occurs during the compaction process.
When Cassandra marks data as deleted:
- It can’t be fetched anymore using Cassandra’s query language (cql).
- The data still exists in Cassandra’s files on the disk (SSTables) but is flagged as deleted.
- The data is removed (for real) from SSTables when compaction occurs before the compaction evicts deleted data, the deleted data can still be accessed with specialized (forensic?) tools.
Once again: Cassandra, like many other systems, does not actually delete data when it deletes the data. But this is in line with the definition of the verb delete from the Oxford dictionary:
“remove or obliterate (written or printed matter), especially by drawing a line through it.”
On the other hand, a similar thing happens in the underlying OS (Linux). When the OS deletes a file, it just marks it as deleted. And you can recover the deleted files with specialized forensic tools.
Okay, so the actual, irreversible deleting of the data does not usually happen in the software engineering. But we would love to do as much as we can to make sure that the data is not accessible from Cassandra and any Cassandra tooling (like sstabledump, sstable2json). OS and file system engineers should do their part of work by doing the same for the OS level (if they think that’s necessary).
Another problem in Cassandra is that it is hard to filter on fields that are not part of the primary key. So, if some of the user’s data is held in the table where the primary key is something like deviceId, that would mean that we would have to search all the records for all the deviceIds and remove the corresponding user’s data. That does not scale.
Data Deletion and Compactions
As already said, even after a delete statement is issued, it is not guaranteed that the data is deleted. Furthermore, if the data model is not well designed, the deleted data might never get evicted. In Cassandra 3.10, this behavior is improved, and compaction is triggered when there is a certain percent of expired tombstones (read more about it here), and deleting compaction strategy looks like it could solve this problem (note that the strategy is not an official part of Apache Cassandra). Also, I’m quite sure that I saw a Jira issue on an Apache Cassandra project about some other kind of Deleting compaction strategy, which should guarantee to actually delete the data, not only mark it as deleted, but I can’t find it now. That would be cool.
Speaking of compaction strategies, SizeTieredCompactionStrategy can be tricky, because if you end up with one huge SSTable file, you need SSTables of a similar size to compact them. Which means that the tombstones will stay in a huge SSTable for a very long time; maybe forever. A situation similar to the one occurring in the 2048 game:
Tile 2048 will not be merged anytime soon.
The main takeaway is: be aware of how different compaction strategies work and know your system behavior. If you have a problem with tombstone eviction, it might be a good idea to change your compaction strategy and/or to redesign your tables
Delete User Data That Is Not Part of the Primary Key
Unlike in relational databases, in Cassandra data is stored in denormalized form. Thus, it is not possible to (easily) filter on fields that are not part of the partition key. So, if we have the following table:
CREATE TABLE device_measurements ( device_id uuid, measurement_type text, measurement_value text, user_id uuid, PRIMARY KEY (device_id, measurement_type));
This means that we cannot just:
DELETE FROM device_measurement WHERE user_id = bf884b98–0a72–10e8-ba89–0ed5f89f718b
It is, however, possible to issue:
DELETE FROM device_measurement WHERE user_id = bf884b98–0a72–10e8-ba89–0ed5f89f718b ALLOW FILTERING
But this might ruin the performance of the entire cluster.
Therefore, we should think about the user’s data in advance when designing the tables.
Embracing Privacy by Design
Solution 1: design tables in a way that the user’s data can be easily deleted (user_id part of the primary key) from all the tables. This solution will obviously have an impact on the design process in both greenfield projects and when redesigning existing databases.
Solution 2: embrace encryption. Okay, this is not a production-ready solution, it’s more of an idea we’re currently playing with at SmartCat. Encrypting the stored user’s data with homomorphic encryption to preserve the ordering of clustering columns, and when the data needs to be deleted, just delete the key. If you have any thoughts on this or experience to share, we would love to hear from you.
Embrace Privacy by design. The idea of GDPR is a good thing from a consumer perspective. A user’s data will be seen as a liability for the companies, not as an asset, which means that companies will, hopefully, be cautious when storing a user’s data. GDPR is also an excellent opportunity for new players on a database as a service market (DaaS) or some derivative of the concept; it seems that it is easier to build new systems with privacy in mind from scratch than to refactor the existing ones. What I would like to see is a database (as a service) that would allow me to issue a delete for the userId, and for me (as a programmer/user of the database) to stop worrying about it. The DaaS provider would be responsible for the rest.
What are your thoughts on this?
Published at DZone with permission of Milan Milosevic , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.