Modeling and Loading Data at Scale
This article gives a glimpse into how a team within Bayer Pharmaceuticals uses TypeDB to accelerate their drug discovery pipelines.
Join the DZone community and get the full member experience.Join For Free
Back in April we hosted an online conference for our community, Orbit 2021, and in listening to Henning Kuich, Dan Plischke, and Joren Retel from Bayer Pharmaceuticals, the community got a glimpse into how a team within Bayer Pharmaceuticals uses TypeDB to accelerate their drug discovery pipelines.
The team at Bayer fundamentally wanted to understand diseases better so that they can create better therapeutic interventions. A deeper understanding of diseases enables the identification and development of novel therapeutic interventions that have little to no side effects.
In order to achieve this, they need to build a model that accurately represents the biological system of diseases, and map that to the large biomedical datasets that they have available.
Describing a Disease: The Data Model
To model diseases in TypeDB, they started with concepts that can be modulated. Modeling these as entities in a TypeDB schema would be as follows:
define gene sub entity; variant sub entity; protein sub entity; complex sub entity; indication sub entity;
complex were chosen as entities, as they can be associated in some form with a disease. The disease or diseases of interest are modeled through the entity
indication. In practice, they need to capture the inherent interrelations within the biological system, and biology is complex! Of course, this means that we need to worry about an enormous amount of other things within the biological system.
genes are expressed in
cells are part of
tissues are in an
organism. This data could come from assays and experiments in clinical studies. Putting all of that together and modeling the right relations might lead to something that looks like this:
Why did the Bayer team choose to use TypeDB as the database for this type of modeling?
Everything in biology is extremely connected and context-dependent, which makes representing data in a traditional relational database very difficult. Because of this, Henning and his team felt that TypeDB’s type system was a much more natural way to model biology.
Biomedical experiments are highly context-specific, so when building a biomedical knowledge graph, you want to be able to quickly interrogate questions such as what parameters were used, what is the actual assay, is it in-vitro, is it in a cell, etc. This requires a database that can capture that level of complexity and a query language that allows us to easily ask complex questions over that data. With TypeDB, Henning’s team was able to do exactly that.
Bayer also leverages the benefits of the native inference engine in TypeDB. Inferencing can do a lot for teams with similarly complex biomedical data. For example, Henning and his team use TypeDB’s inference engine to infer
gene connections based on their locations on the chromosome.
Genes are essentially just sequences in the genome, literally locations along a long stretch of other locations. Variants are variations of these sequences, also along the same stretch of numbers and locations. Using TypeQL (TypeDB’s query language), the team is able to map variants to genes based on overlapping locations, rather than having to depend on other databases that require them to import the relations between variants and genes. Crucially, this avoids the need to maintain those other databases, as it takes a long time to keep those updated and migrated on a regular basis. All of this is avoided by using TypeDB’s inference engine.
TypeDB also allows you to redefine a schema at any point in time during the database’s lifecycle. For example, we could add a new definition on what they mean by a variant being associated with a gene, depending on a gene location. This is important, especially when we add in new datasets. All of this enables Henning’s team to operate much more efficiently and accelerate their drug discovery process.
You can see an example of a fully fleshed-out schema in the biology domain via the BioGrakn-Covid repo on Github.
Loading Data and Open-Sourcing TypeDB Loader
Now that there is a schema that represents the system we are working with, loading data at scale becomes the next challenge. To do this in the way Bayer wants to, Henning and his team built an open-sourced TypeDB Loader (formerly GraMi), a data migration tool for TypeDB. Here, we describe how it works and how the Bayer team uses it.
The reason Henning’s team decided to build their own custom loader, is that they wanted a better and more scalable way to load a large amount of data into TypeDB. To load data, the original TypeDB docs suggest building a function that reads in a file (for example in a tabular format) and building a templating function to generate the insert query. We can then use one of the client drivers to insert that into a TypeDB database. However, doing this at scale introduces a number of challenges that TypeDB Loader seeks to solve:
- Handle the repeating logic of the templating functions needed
- Taking care of the potential for dirty data, preventing errors during insert
- The parallelization and batching of insert queries per transaction
- Big data fault tolerance
How Does TypeDB Loader Work?
Published at DZone with permission of Daniel Crowe. See the original article here.
Opinions expressed by DZone contributors are their own.