GDPR : Designing Privacy and Data Protection
There's been a lot written about the impending GDRP regulations, but not a lot of it coming from developers. Read on to get a fellow dev's point of view.
Join the DZone community and get the full member experience.Join For Free
There are many existing articles in relation to the GDPR concept, but there is a lack of articles about the technical challenges that accompany GDPR compliance. The aim of this post is to suggest a valid approach to overcome the challenge.
The General Data Protection Regulation (GDPR) will come into effect on 25 May 2018 and will change the way companies collect, process, and store user data.
Privacy by design and data protection by design are the essential part of the GDPR. Privacy by Design means that organizations need to consider privacy at the initial design stages and throughout the complete development process. From 2018, Data Protection will become an integral part of technological development as well as how the product or service is delivered.
As developers and decision-makers, we need to carefully design our system to respond to privacy requirements. This topic is vast - you can gain a better understanding of what privacy by design involves (and best practices) here. In this post, we will try to focus on how to store and guarantee the privacy of user data.
To start, let’s define our main objective as personal data defenders: Our highest priority task is making an individual unidentified from that data, either on its’ own or when combined with other information, and, most important, when dealing with sensitive information, not to link this information to any living individual. So we are responsible for the care of Personally Identifiable Information (PII).
Driver’s license numbers, credit/debit card account numbers, and social security numbers are well known as sensitive data, but there are many others that are not as well-known, such as Racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, health data, genetic data, biometric data, legal records, to name a few.
What is important is have in mind that some information is not sensitive data if not linked to an individual. For example, governments release reports about diseases in an area every year without disclosing individual records.
The same goes for personal identification - there's no problem in having a list of addresses in your database if they are not linking to a living individual (Google has all addresses mapped and paired with photos).
We can have information about employees in our database, such as salary, the date when they started and holidays they take, but that data should not be linked to an identifiable person. The same goes for customers' orders - no problem in having data about transactions, however, they should not be attributed to a subject.
When your data is leaked it will be used along with other databases and the combination of several DBs can make lead to bad actors identifying targets. Is not an easy job figuring out what to remove from your dataset, but keep in mind that 87% of the U.S. population is uniquely identified by the combination of name, address, ZIP code, and birth date.
When anonymization is possible, you can remove data or use k-anonymity and l-diversity approaches (these deserve a google search) to tackle the privacy in those cases. However, when the data owner is essential for our application, we can’t use anonymization - for that we can use encryption, pseudonymization, or both.
These two kinds of privacy protection are double-edged swords. Although encryption approaches can enhance privacy, it adds works for your development and security teams, as they need to write encryption and decryption operations. The efficiency of database operations will decrease and if we encrypt the data before inserting it into the database we lose the ability to query the database by the original value and we will not always have the Transparent Data Encryption available to our database.
An acceptable approach is to remove relationships, thus preserving any privacy that these relationships may compromise. After removing all sensitive relationships, we will add a pseudo-reference to the sensitive data linking to an individual. This approach is a kind of pseudonymization for the individual's ID.
Pseudonymization is a central feature of “data protection by design.” The word appears several times in the GPDR regulations, whilst the word encryption appears only 4 times.
- “…implement measures to mitigate those risks, such as encryption.” (P51. (83))
- “…appropriate safeguards, which may include encryption” (P121 (4.e))
- “…including inter alia as appropriate: (a) the pseudonymization and encryption of personal data.” (P160 (1a))
- “…unintelligible to any person who is not authorized to access it, such as encryption” (P163 (3a))
You will notice that encryption always comes in as a suggestion and, besides, it gives no real context (Encryption at rest? In transit? Where is it Encrypted? What level of Encryption? etc.).
The pseudo reference is a code generated using different techniques, such as hash functions, Tokenization, encrypted data, etc. The pseudonym allows backtracking of data to its origins, which distinguishes pseudonymization from anonymization. The additional information necessary to get the data back must be kept separately to ensure non-attribution to an identified or identifiable person.
So, pseudonymization is an effective approach to keeping relationships private, along with column encryption, when we need to get trace data back to its origins. To place the private relationships, we can replace the usual foreign key for an encrypted identification that only the domain knows.
Let’s look at the case where relationships between entities are to be kept private.
The application will keep the strategy on how to encrypt a person's information based on an encrypted key added to the application in order to encrypt that person's entity. The relationship will be possible only from the person's domain, where it will be able to generate an encrypted code to fetch different tables for that code.
In this post, we have focused on the problem of link identification. We have proposed an approach for anonymizing the sensitive relationships by generating a pseudo reference based on the encryption mechanism and how it can help your company to enhance its user’s privacy.
Please join the discussion and leave your thoughts about the suggested approach and what you are doing to achieve GDPR compliance.
Published at DZone with permission of Alexsandro Souza. See the original article here.
Opinions expressed by DZone contributors are their own.