GRAKN.AI to Manage GDPR
GRAKN.AI to Manage GDPR
See how you can use GRAKN.AI to help manage your GDPR requirements in your organization. If you're looking for resources to help in this daunting task, this is the article for you.
Join the DZone community and get the full member experience.Join For Free
The most visionary programmers today dream of what a robot could do, just like their counterparts in 1976 dreamed of what personal computers could do. Read more on MistyRobotics.com and enter to win your own Misty.
In my understanding, GDPR asks companies to track every piece of information that it requires from a user and to ask the user for permission per piece of information. Thus, all the systems a company uses need to be able to request each and every needed authorization. That is not a small task.
The difficulty of tracking and managing all these authorizations is the reason I propose to use an external system — or a meta-system — that all other software used in the company will have to refer to. Indeed, these systems will have to verify whether a user has granted permission to access the system, if not ask for it and record back that information in the metasystem.
The idea is to make this as simple and flexible as possible and not to be cornered by choices that were made a few month ago. This is where graph databases shine. In fact, graph databases are flexible by design and it is really easy to add new relations (edges) between nodes (vertices) without doing a migration as we would have to in the RDBMS world. Searches are typically very fast, especially across many relations.
But we want more. We want the benefit of a graph database but we also want to be able to easily extract knowledge from the data we will be inserting in the graph database. We could have chosen OWL or RDF for that purpose, but these semantic web technologies are too verbose and — because they were not designed to be used as a database — not really up to the task at hand.
The solution I have found is GRAKN.AI, which is a distributed knowledge base. While it is not technically a graph database, it leverages the best technical strengths of graph databases but also uses a schema to define its structure. The schema in GRAKN.AI is very simple to write and allows us to have data validation. Moreover, GRAKN.AI also has a built-in inference engine, which will help us simplify queries and provide basic recommendations.
Building an Ontology
As mentioned before, GDPR requires data handlers/processors to track every piece of information. An ontology would usually be designed with an entity
person that would have many attributes. You could have an entity
man and another entity
woman that inherit attributes from the
person entity. This is the typical object inheritance concept. That works, but it it is not very practical from a GDPR point of view.
That is where GRAKN.AI can be used to leverage the opportunity of GDPR. The knowledge base lets us decompose user data into single items, which can then be individually granted an authorization in order to have this data used in a system. Inference rules can help simplify queries. Let's start with
I think it is better to decompose each of the attributes so that they were entities in their own right. Thus they will be able to take an active part in the graph relations. In GRAKN.AI, attributes of an entity are declared as follows:
value sub attribute datatype string;
attributes is a reserved keyword.
valueis the name of the attribute and we declare a
stringand that will be used to validate the incoming data. In the graph, my attributes will be entities; therefore, I declare an
propertywhich is not a reserved keyword, and all my attributes (email, firstname, etc.) will be inheriting from this abstract entity. GRAKN.AI will prevent me; to create a property directly, I will only be able to create its children. This is a great feature of GRAKN.AI, as it really helps the creation of schemas and their maintenance. Let's see how the
property sub entity is-abstract has value
plays owned plays demand plays authorizer plays exported plays imported plays revoker plays withdrawer;
We can then declare our sub-entities as one-liners:
last-name sub property; first-name sub property; email sub property;
The advantage given by an abstract entity
property is very clear here, all the roles and all the
attribute (singular intentional) are declared once and inherited by all the sub-entities. When it is required to add another relation, add it to the parent entity and all sub-entities will reflect the change.
The other advantage is that it simplifies querying for properties. I do not have to query each sub-entity; instead, I can query the abstract entity, and get all the sub-entities:
match $a isa property; $p isa person, has identifier 123456; (exporter:$x, exported:$a, exported-to:$z); (owned:$a, owner:$p); get;
The above request will return all the properties that have been exported to a system, and by whom (i.e. which user), limiting this request to a specific
owner of type
person. But we will come back to those relations. What is important in this request is to understand that I can query the abstract
property to get all its children. The identifier’s value would, of course, be passed dynamically.
In the same manner, a sub-entity of a sub-entity can be declared in GRAKN.AI. It is, for example, very useful for addresses:
address sub property has value; city sub address; zip sub address; street1 sub address; street2 sub address; street3 sub address;
Here, the address is not abstract, as I want to be able to give a value to that address. Typical values could be “home,” “professional,” “billing,” or “delivery.” As before, I do want to be able to query all the children of an address at once and get their
value (singular intentional). The tradeoff here is that I did not declare
has name for the
address, as this attribute would have been inherited by all its children. It does not make sense to have a ZIP
name and a
value, but I think that it is easy enough to remember that an
address has a
value (i.e. the name of the address).
Now that we have decomposed all the properties of a person into individual nodes (or entities), we need a node in the graph to attach all these properties; it can be numerous, to one user, or, as I named it in my schema, one
person. As this person has been deprived of all its properties, the person node is very simple:
person sub entity has timestamp has type has identifier
plays identified plays imported plays importer plays exported plays exporter plays owner;
It has three attributes. What the timestamp does is obvious; the type is the type of person that makes sense for the European Respiratory Society (ERS). The value could be any person linked to our organization — members, non-members, or staff. This raises an interesting point. GDPR is for users of your product — i.e. for your clients. But GDPR is also for your staff. They also have the right to be forgotten and to recuperate data from your systems. The identifier is a unique identifier that makes sense for your data. At ERS, this is an integer. But it could be a string or anything.
To have an identifier makes querying the person node easier. If not, you always have to check for a
property and its relation with the person node you are trying to reach:
match $p isa person; $e isa email, has value “email@example.com”; ($e, $p) isa belongs; get $p;
match $p isa person has identifier 1; get;
Both queries work and return exactly the same result, but one is obviously shorter. I did not do any benchmarking, but the first query should logically be slower as it does not access directly the information we are interested in. In the visualizer, there is no visible difference.
Before developing more on relations, let’s finish with the entities. I want to mention an important “trick.” The knowledge base is intended to have all our users, but also all of their interactions, and also all of our content (at least reference to our content). The idea is to add relations between a user and content, or between content and content, or between content and topics, or between a user and topics, etc. Doing this will actually enable the ERS to have good knowledge on its users, on its content, and on how all these things relate. GRAKN.AI, therefore, serves as the ideal base of our recommender system.
What happens when a user has to be deleted? Does the system become dumber? In fact, it will if I have a user connected to content items, or topics, etc. I will have to delete all those relations when a user asks to be deleted. If other users were suggested content based on that user’s behavior, the recommender system would lose information.
The solution? Abstract the user away. How could this be down? I have added a new entity in the graph:
anonymous sub entity has timestamp plays incognito;
The idea is to connect all interactions with content and all the user's behaviors (clicked, read, went to an event) to the anonymous node:
When a user requests his data to be deleted, we can delete everything that is on the right of the
anonymous node. Anything that is connected to the anonymous node will stay in the graph. In my view, this is a very good compromise. The
anonymous node does not store any information about the
person node. Thus, when the relation between the anonymous node and the person one is deleted, there is no way to know which node the person one was connected to; but at the same time, all the data around the user stays. The deletion of the relation between the anonymous node and the person is, to my knowledge, irreversible. Unless the person who asks to be deleted is the first one (as we could query for all the anonymous nodes), then all those that have a relation
person and filter them out. To avoid that, we could create few fake anonymous nodes.
There is one caveat, though. We saw in the previous post that few personal data points suffice to re-identify the person. If we record all the events the person went to and then match that to the history of payment transactions, we should not be far away from identifying an individual. Legally, in Switzerland, we need to keep the accounting data for ten years. This presents a kind of a paradox. Honestly, I do not know how far one should go to prevent re-identification. This is a question for a lawyer and not for the basement guy in a hoody.
In order to make our schema complete, we need to add other entities. This is because we need to know what authorizations were given and what systems are used in the company. I intentionally use the word system as it is vague enough. For example, in the case of ERS, systems would encompass our CRM, websites, apps, or MailChimp but also any kind of export one could do, such as Excel, CSV, or JSON, as well as external systems when you exchange data with partners, etc.
system sub entity has value has icon
plays importer plays exporter plays requester plays authorized plays exported-to plays imported-to;
The icon is not necessary. We simply store a string that will allow us to display nice icons for each system in the dashboard in order to make it more appealing.
We finally need a last entity: the
authorization. I thought it was better to have the authorization as an entity as we could have many authorizations per
property. MailChimp requires an authorization to use the email address to send the newsletter, but it could ask for additional authorizations to send advertisements, job listings, etc. The email address is also used in the CRM and authorizations could range from system mail to the staff contacting the member.
Additionally, authorizations need to be queried by the systems in order to display the list of data required from the user as well as displaying all the necessary authorizations.
Although the above graph does not display the system name, we can easily see from the authorization values that the system in question is sending emails… It requires three authorizations that are all related to emailing (the timestamp is just composed of fake numbers). The descriptions can be displayed to the users. In my understanding, this is required by GDPR, as it is not possible to ask general or vague authorizations — such as “your data will be used to improve your experience.” Modeling your schema in such manner is very granular. If you decide, or if your lawyers tell you, that you do not have to be so detailed, this schema works as well; just connect more user properties to the same authorization, and do the same for systems.
One of the biggest advantages of using an external system is that all authorizations are managed from a central point; thus, you can update all your systems at once as they all query the central “authorization repository.” The request to get all authorizations is very simple:
match $a isa authorization; $s isa system; (requisite:$a, requester:$system); get;
And it can easily be used to create an API endpoint where the only parameter needed is some data that specifies the system. But this is another story and will be dealt with in a separate blog post.
We now have all our entities and you have already seen that they are of course all interconnected. We should now speak about the edges or the relations among all these entities.
The relations are very simple. They just connect nodes together. A particularity of GRAKN.AI is that it is a hypergraph. Thus, we can connect not only one node to another but many nodes together.
To start the discussion, let’s query the graph and go to an anonymous user and find the email, the authorization it requires, and finally, a system in which it is used.
match $p isa person has identifier 1; $i isa anonymous; $e isa email; $a isa authorization; $s isa system; ($i, $p) isa identifies; ($p, $e) isa belongs; ($e, $a) isa needs; ($a, $s) isa requires; get;
Which gives the following result:
Actually, in this example, I cheated and clicked on the email node to show a hyper-relation that is among three nodes. We can see that the email has been exported by a person which has the id V65592 and that it was exported to some system.
The idea is to help the company track what is happening with the data. Here, the exporter is a person, but if you pay attention to the system entity definition, you will notice that a system can also play the role of exporter (or importer for that matter).
When users request the deletion of their email, we know that each email is used in two systems. If it had been exported to Excel, we would know who has that excel sheet and we could require from that person a confirmation that the file has been deleted, or the person in question was removed from it.
We have some relations that identify the
anonymous node. Each attribute has a
belongs relation connecting it to its owner. Each attribute
authorization. Finally, each system
authorization. Notice how all relation names are simply a third person singular verb and how the text can easily be translated in a schema.
Each relation needs at least two roles and each node it connects plays one or the other role. Let’s check the definition of the authorization entity to see all the roles it can play.
authorization sub entity has name has description has timestamp has expiration-date
plays needed plays requisite plays revoked plays withdrawn;
As a side note, notice that there is a timestamp attribute but also an expiration date. Let's say you need to require an authorization to use the data for one year (the length of a membership); the expiration timestamp could already be calculated upon creation. A simple inference rule could compare the timestamp with the “now” timestamp and automatically revoke access to the data, as it is not yet possible to get the “now” timestamp from the rules. A simple cron job could update the rule every day in order to revoke authorizations that are too old.
Keeping data for a limited amount of time is actually a requirement of GDPR (there is an exception that seems broad about archiving and statistical purposes, but that’s beyond our discussion here). You need to collect data only if and when you need it and for the period of time that you need it. Thus, it should be deleted when a user’s membership is deleted. Or, you could send an email to the user when his membership expires to ask him if he agrees to let you keep the data in order to simplify the renewal of his membership and to make sure the user will be able to find back his history at a later stage. It should be precisely explained how long you will keep the data. That process could be repeated when that authorization is over.
After this digression, in the above schema definition, we can see that the authorization plays twice two similar roles:
withdrawn. The second in the pairs is used for inferences. As we will see below, we need those additional roles as they could be used at the same time but from different entities. Anyway, I thought it was making the schema clearer. Inference is part of the magic of GRAKN.AI and it is dynamic. We can change a relation and other relations will adapt automatically as they are inferred from our data.
For now, in this ontology, I use inference as a way to simplify queries. Indeed, the previous query that we studied above can be simplified. When you want to check if an email can be used, it is not needed to know the details of the authorization, you just want to know if a relation exists between an attribute and a system, or if the existing relation exist:
authorizes and if it is not of type
withdraws. If the relation is of type
withdraws, when the user tries to access the system, he can be notified that he has withdrawn the
authorization and he could be asked to authorize that data again in order to access the system. Both relations are inferred relations. Let's see first how the previous query could be simplified:
match $p isa person has identifier 1; $i isa anonymous; $e isa email; $s isa system; ($i, $p) isa identifies; ($p, $e) isa belongs; ($e, $s) isa authorizes; get;
If you did not erase the previous result in the visualizer and you have ticked the box
activate inference in the setting of the visualizer, you should see the inferred relation popup.
If the underlying data change — for example, the
needs relation is replaced by
revoke — then the
authorizes will be replaced by the
withdraws inferred relations. You certainly have noticed that
revoke does not use the third person singular. In the way that I have defined the schema,
revoke is a child of an abstract relationship
In fact, I have defined some relationships as actions that can be taken on the data:
revoke are actions, but this is just my understanding; it could easily be modified to fit the third person way of structuring relationships. The only advantage I find is that I can query
action as we did with the
property entity to easily find all the movements of data.
When we put everything together, here is the full schema. The visualizer displays very well the parent and children entities as well as relationships:
Now that we have a schema ready, we will load some data and create an API that will be the foundation of the dashboard but also will be used by all systems to create, read, modify, and delete data in the knowledge base.
The whole code is available in this GitHub repository; you can clone it:
>>> git clone https://github.com/idealley/grakn-gdpr <a folder>
I assume that Grakn is installed and running. Then, you can use the script load.sh to load the ontology, rules, and some fake data (one user and one staff). Run the following command:
>>> ./load.sh grakn grakn
The first grakn is the relative pass from $HOME, where GRAKN.AI is installed. The second one is the keyspace you want the data loaded in. It takes few seconds, then you will have the same schema as above.
The data is here. In the next post, we will see how to interact with the data and start connecting the dots with an API.
Published at DZone with permission of Samuel Pouyt , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.