Building a Biomedical Knowledge Graph
Building a biomedical knowledge graph using publicly available datasets to better aid disease research and biomedical data modeling.
Join the DZone community and get the full member experience.Join For Free
Debrief from a Vaticle Community talk — featuring Konrad Myśliwiec, Scientist, Systems Biology, at Roche. This talk was delivered virtually at Orbit 2021 in April.
Konrad, like so many TypeDB community members, comes from a diverse engineering background. Knowledge graphs have been part of his scope since working on an enterprise knowledge graph for GSK. He’s been a part of the TypeDB community for roughly 3 years. While most of his career has been spent in the biomedical industry, he’s spent time working on business intelligence applications, developing mobile apps and currently as a data science engineer in the RGITSC (Roche Global IT Solutions Centre) for Roche Pharmaceuticals.
Over the last year, Konrad started to notice a trend online, whether it was articles and posts on LinkedIn or amongst his biomedical network; knowledge graphs were everywhere. Given that the world was in the midst of grappling with the COVID-19 pandemic, he wanted to know if a knowledge graph could aid in the efforts of his bio-peers.
In developing BioGrakn Covid, he sought to bring these two topics together to provide a more digestible and useful way to support the research and biomedical fight during the pandemic. In this talk, he describes how he and a group of TypeDB community members approached some of the technical development areas.
- BioGrakn Covid — what is it, and what is the goal of the project
- Publically available data sources and ontologies — what data was loaded into BioGrakn Covid
- Knowledge extraction — working with unstructured data
- Analytical reasoning — how to perform some analysis on a knowledge graph
Building BioGrakn Covid
BioGrakn Covid is an open-source project started by Konrad, Tomás Sabat from Vaticle, and Kim Wager from GSK. This is a database centered around COVID-19.
Representing Biomedical Data as a GraphModeling biomedical data as a graph becomes the obvious choice once we realize that the natural representation of biomedical data tends to be graph-like. Graph databases traditionally represent data as binary nodes and edges. Think labeled property graphs, where two nodes are connected via a directional edge. What Konrad and co. quickly realized was that it becomes much simpler and ultimately more natural to represent biomedical data as a hypergraph, such as that found in TypeDB.
Hypergraphs generalise the common notion of graphs by relaxing the definition of edges. An edge in a graph is simply a pair of vertices. Instead, a hyperedge in a hypergraph is a set of vertices. Such sets of vertices can be further structured, following some additional restrictions involved in different possible definitions of hypergraphs.
Data Sources Used
Currently, there are quite a few publicly available datasets in BioGrakn Covid. Briefly, here are some of the datasets and mappings included in the database:
- Uniprot — entities (transcript, gene, protein), relations (translation, transcription, gene-protein-encoding)
- Human Protein Atlas — entities (tissue), relations (expression)
- Reactome — entities (pathways), relations (pathway-participation)
- DGldb — entities (drug), relations (drug-gene-interactions)
- DisGeNet — entities (disease), relations (gene-disease association)
- Coronavirues — entities (virus), relations (organism-virus-hosting, gene-virus-association, protein-virus-association)
- ClinicalTrials.gov — entities (clinical-trial, organisation), relations (clinical-trial-collaboration, clinical-trial-drug-association)
Maintaining Data Quality
As many of us know, it is not as simple as taking the data from one of the above sources and loading it into a database. There first needs to be some consideration of data quality. Maintaining data quality, as Konrad notes in his talk, is an important aspect of the work.
The second subset is the Semantic Network data. This is focused on the taxonomy. Each concept is given in a taxonomy, e.g. that we have a concept
protein is a subtype of
compound. The benefits of this subset will become more obvious later on.
Using UMLs as the initial or baseline ontology gives us confidence that we can move forward without data quality issues — obviously, this is not the only way to approach the challenge of data quality. Still, UMLs can be effective in the biomedical domain.
Defining TypeDB Schema
Schema in TypeDB provides the structure, the model for our data, and the safety of knowing that all any data ingested into the database will adhere to this schema.
In Konrad’s case, they are creating a biomedical schema with entities:
disease; and the relations between them. It should be noted that much of these decisions can be derived from the data sources you use.
Here is an excerpt from this schema:
define gene sub fully-formed-anatomical-structure, owns gene-symbol, owns gene-name, plays gene-disease-association:associated-gene; disease sub pathological-function, owns disease-name, owns disease-id, owns disease-type, plays gene-disease-association:associated-disease; protein sub chemical, owns uniprot-id, owns uniprot-entry-name, owns uniprot-symbol, owns uniprot-name, owns ensembl-protein-stable-id, owns function-description, plays protein-disease-association:associated-protein; protein-disease-association sub relation, relates associated-protein, relates associated-disease; gene-disease-association sub relation, owns disgenet-score, relates associated-gene, relates associated-disease;
Now that we have our schema, we can start to load our data.
Ingesting Structured Data
In the talk, Konrad walked through loading the data from UniProt, which also contains data from Ensembl. The first thing to do is to identify the relevant columns and then, based on our schema, identify the relevant entities to populate with the data. From there, it is fairly simple to add the attributes for each concept.
Loading the data is then trivial — using a client API in Python, Java, or Node.js. Konrad built this migrator for the UniProt data — available via the BioGrakn Covid repo.
Those actively working with these types of publicly available datasets know that they are not updated as often as we might like. Some of them are updated yearly, so we need to supplement these data sources with relevant, current data. The trouble here is that these data usually come from papers, articles, and unstructured text. To use this data, a sub-domain model is needed. This allows us to work with the text more expressively and ultimately connect this to our biomedical model. The two models are shown below:
With the schema set and publications identified, some challenges will need to be addressed. Konrad highlights two of them: extracting biomedical entities from text and linking different ontologies within a central knowledge graph.
Extracting Biomedical Entities from Text
When approaching this challenge, Konrad reminds us not to reinvent the wheel and make use of existing named entity recognition corpora. For this project, Konrad and co. used CORD-NER and SemMed.
Named entity recognition is the NLP algorithum of identifying named entities within text.
CORD NER is a data source of pre-computed Named Entity Recognition output. The nice thing about CORD NER is that the output of the NLP work, the entities from text, are mapped to concepts in UMLs. This helps to provide consistency and data quality in the knowledge graph. With the concepts and their types, we can now map to the schema in TypeQL.
Published at DZone with permission of Daniel Crowe. See the original article here.
Opinions expressed by DZone contributors are their own.