Modeling Biomedical Data for a Drug Discovery Knowledge Graph
In this article, see a presentation on modeling biomedical data for a drug discovery knowledge graph.
Join the DZone community and get the full member experience.Join For Free
Earlier this month, we were joined by Natalie Kurbatova, Associate Principle Scientist at AstraZeneca on the first seres of Grakn Orbit.
Our newest community event brings members of the #GraknCommunity to the forefront. Through tech panels, presentations, use case deep dives, and more, our community shares the latest and greatest from the Grakn Cosmos.
Natalie works in AstraZeneca’s Data Science and Artificial Intelligence department, where she focuses on data modeling, integration of data into a knowledge graph, prediction algorithms, and the topics therein.
The Goal: Predict Novel Disease Targets
At AstraZeneca, Natalie’s team focuses on building a Knowledge Graph to predict new disease targets (gene or protein targets), which they call a Discovery Graph. In building this, Natalie walked us through the two types of data sources to consider:
Structured data refers to the publicly available datasets in bioinformatics that have already been curated and used extensively in the industry. While biomedical structured data is machine readable, it is not often straightforward to work with. In particular, it’s difficult to integrate these datasets as they can describe similar concepts in different ways, for example: distinct IDs that do not align with each other. Some of the most used publicly available datasets include: Ensembl, Uniprot, ChEMBL, PubChem, OmniPath, Reactome, GO, CTD, HumanProteinAtlas.
Unstructured data refers to data from text. To process this we need to use NLP (Natural Language Processing) pipelines and then process their output. Here, the difficulty is that this data is often messy and noisy. For their NLP engine, Natalie’s team used the open source library SciBERT as well as AstraZeneca’s proprietary tools.
Natalie then introduced us to the schema that her team built for their Discovery Graph (slide below).
Natalie’s team is mainly interested in studying
gene/proteins— which they affectionately call the Golden Triangle. The connections between these entities need to be as solid and reliable as possible, which means ingesting all possible, related, data sources into their Discovery Graph.
This Discovery Graph is growing in size daily. As of today it is already widely populated with these three entity types:
gene-protein : 19,371
disease : 11,565
There are also 656,206 relations between the entity types.
How Are They Modeling This “Golden Triangle”?
Natalie then explained how she modeled each part of the Discovery Graph, giving examples along the way.
Modeling Genes and Proteins
First, her team looked at how to model genes and proteins. Instead of separating these into two entities, Natalie’s team decided to model them as a single entity, which they called
gene-protein. This helps to reduce noise and bias.
The [gene-protein] entity visualised in Grakn Workbase.
The associated attributes
chromosome, assign the corresponding gene name, and chromosome name of where that gene is located. The
gene-id attribute is modelled as an abstract which contain the unique IDs for each
There may have been scenarios where a parent attribute is only defined for other attributes to inherit, and under no circumstance do we expect to have any instances of this parent. To model this logic in the schema, use the abstract keyword.
interaction relation establishes the interactions between
gene-protein, where one plays the role of
gene-source, and the other
gene-target. This relation is particularly important in predicting novel disease targets as it connects regulatory interactions between genes and proteins.
Natalie’s team put together — instead of separating them—small molecules and antibodies as one entity type they called
compound. For this data, they used the data sources: PubChem and ChEMBL.
These databases are roughly 95% the same, but some chemicals only exist in one of the two sources. To deal with these unique chemicals, they decided to assign a
chembl-id as the primary ID, if it had one, however if it didn’t have that ID, then they would use
As we could see on Natalie’s slide,
compound was modelled with a few more attributes. The attribute
compound-id was used as an abstract attribute with its children
To obtain the value for the
preferred-compound-id attribute, they use two rules to assign either the
pubchem-id according to the logic below:
The first rule attaches the value of
chembl-id — if it exists—to
preferred-compound-id. The second rule first checks if
compound does not have a
chembl-id, and if that’s the case, then it attaches the value of the
pubchem-id attribute to
preferred-compound-id. This makes it easy to query for that attribute when asking specifically for compounds.
Natalie explained that the most complex concept to model were diseases. This is because diseases have multiple ontologies, each with different IDs assigned to them. For instance, a single disease might have multiple IDs from different ontologies and disease hierarchies. In the slide below, Natalie showed us the hierarchies of three data sources: EFO, MONDO and DOID.
The reason for this heterogeneity is that originally, these ontologies were designed in different sub-domains: for example medical doctors use Orphanet while a medical research group might use another: EFO or MONDO. This leads to unconnected and disparate data. Natalie and her team want to be able to model a disease entity that cross references between these ontologies.
Because of this challenge, they chose to model two entity types:
ontology. They use a
disease-hierarchy relation — a ternary relation — to connect the two.
disease-hierarchy relation, Natalie and her team are able to write useful queries such as:
Give me all the children of “chronic kidney disease” disease node using EFO ontology.
This query shows the power of the model — the power of using Grakn’s schema—because even though they are just asking for a high level disease, the query will return all possible sub-types of that disease.
To model this, Natalie leverages Grakn’s rule engine again:
The rule that’s shown on the slide above,
single-ontology-transitivity, creates a transitivity in the
disease-hierarchy relation. The result of this rule is that all diseases that play the role of
subordinate-disease will be inferred if you query for a disease that plays the role of
superior-disease. Further, this means that even if you don’t specify the ontology that you are querying from, you are still returned all subordinate diseases, of that particular parent disease, across all ingested data sources.
This type of rule is especially useful when there is no corresponding reference ID in a particular ontology.
As Natalie showed us, when a
disease-id is not present in a certain ontology, we can go up the hierarchy until we find an ID that does exist and then assign that ID. To do this, Natalie uses another transitive rule:
The first rule,
determine-best-mesh-id-1, assigns a
best-mesh-id attribute to a
disease entity if a
mesh-id is already present. The second rule then states that if we don’t know the
mesh-id, we want to pull down the
mesh-id from a parent disease. Natalie emphasised how effective this was, and how positive the results have been in practice.
Once the domain has been modelled, we can start ingesting out data. To do this there are two approaches that can be taken:
- Data factory: integrate data before loading into a knowledge graph
- Data schema: data integration happens in the knowledge graph through flexible data loading and subsequent reasoning
Natalie’s research group uses the Data Schema approach, using the knowledge graph itself as a tool to integrate the data. This may be counter-intuitive to some developers.
If you imagine adding additional unknown—from the start that is—data sources, you have to be flexible enough to manipulate the data inside the database. This enables the replication of another’s work[paper] to validate the hypotheses.
In the case of research work, flexibility is essential, as Natalie showed us in previously conflicting mesh-ids.
The solution, as noted, is using Grakn’s reasoning engine.
In closing, Natalie spent a few minutes summarising the benefits of this approach, in relation to the data schema and logical reasoning.
Other graph databases are flexible in data loading, but lack validation. And while flexibility is important, validation is necessary to keep the data consistent.
A few months ago, one of Natalie’s junior colleagues accidentally loaded movie data into their biomedical knowledge graph. The next morning, the rest of the team was surprised to see films and actors among their biomedical data!
With Grakn, data is logically validated via the data schema at insert, to ensure the right data goes into the knowledge graph.
Natalie believes this database provides a nice trade-off between formal schema design, logical reasoning and prediction algorithm capabilities. In her experience, before loading data into the knowledge graph, the data needs to be modelled first. And a formal and flexible schema help to find this ideal balance.
Note that all prediction algorithms depend on what you load in the database — aka: be careful what you put in the knowledge graph. The noise level should be as low as possible.
And finally, she spoke about the key choice of whether we integrate our data before or after loading the data in, which in their case (as explained above) is done after loading, once inserted into the database.
Special thank you to Natalie and her team at AstraZeneca for their enthusiasm, dedication and thorough explanation.
You can find the full presentation on Grakn Labs YouTube channel here.
Opinions expressed by DZone contributors are their own.