An Enterprise Data Stack Using TypeDB
Building a data technology stack to facilitate a deeper understanding of diseases and drive drug discovery at scale. Hear from the Bayer Pharmaceutical team.
Join the DZone community and get the full member experience.Join For Free
At Bayer, one of the largest pharmaceutical companies in the world, gaining a deep understanding of biological systems is paramount for the discovery of new therapeutics. This is inspiring the adoption of technologies that can accelerate and automate discovery, spanning all of the components of their data infrastructure, starting with the database.
The challenges posed within the data and discovery process are not unique to Bayer:
- Loading heterogeneous datasets at scale
- Mapping heterogeneous datasets into one source of truth — given a disease how do we reconcile its different IDs across multiple datasets?
- Modelling the complexity of the biological system — representing genes and variants based on locations along the genome and ternary relations between genes, proteins, pathways, cells, tissues, natively in the database
- Query construction over such a complex system
- Connecting ML pipelines to the database
The team at Bayer — Henning Kuich, Dan Plischke, and Joren Retel — set out to build a stack that would help them solve these challenges.
The first step was to find a database that allowed them to model the complexity of the biological system around diseases. It needed to be able to represent the interconnectivity inherent in biology, dynamically update the model when needed without interfering with the rest of the process, and enable efficient connections to knowledge extraction pipelines whether visualisations or learning processes. They found their database in TypeDB, a strongly-typed database from the team at Vaticle (the community version is fully open-source).
“….we’ve really sort of developed a stack internally to use TypeDB very seriously, so we’ve built a data migrator to get things in there, and then we now have two branches to get knowledge back out.” — Henning Kuich, Sr. Computational Scientist at Bayer Pharmaceuticals
A Database to Model Complexity — TypeDB
Everything in biology is extremely connected and context-dependent, which makes representing data in a traditional relational database very difficult. Because of this, Henning and his team felt that TypeDB’s type system was a much more natural way to model biology.
Biomedical experiments are highly context-specific, so when building a biomedical knowledge graph you want to be able to quickly interrogate what parameters were used. What is the actual assay? Is it in-vitro, is it in a cell? This requires a database that can capture this level of complexity and a query language that allows us to easily ask complex questions over that data. With TypeDB, Henning’s team was able to do exactly that.
Bayer also leverages the benefits of the native inference engine in TypeDB. Inferencing can do a lot for teams with similarly complex biomedical data. For example, Henning and his team use TypeDB’s inference engine to infer
gene connections based on their locations on the chromosome.
Genes are essentially just sequences in the genome, literally locations along a long stretch of other locations. Variants are variations of these sequences, also along the same stretch of numbers and locations. Using TypeQL, (TypeDB’s query language) the team is able to map variants to genes based on overlapping locations, rather than having to depend on other databases that require them to import the relations between variants and genes. Crucially, this avoids the need to maintain those other databases, as it takes a long time to keep those updated and migrated on a regular basis. All of this is avoided by using TypeDB’s inference engine.
TypeDB also allows you to redefine a schema at any point in time during the database’s lifecycle. For example, we could add a new definition on what they mean by a variant being associated with a gene, depending on a gene location. This is important, especially when we add in new datasets. All of this enables Henning’s team to operate much more efficiently and accelerate their drug discovery process.
Data Ingestion — TypeDB Loader
TypeDB Loader is a Java application and library, enabling you to use it as a CLI tool or use it within your own Java project. It is a tool to load many massive csv files, and stitch the data together, into a true graph format in TypeDB. It can achieve all this with minimal configuration files, which tell the loader which columns to use to associate data together from the various tables being loaded. To use it is relatively simple, you specify what instance you are talking to and which TypeDB database you are writing into. You then provide your data configuration, your processor configuration, TypeDB schema, and then determine where you want the migration status to be tracked. This can be as simple as a JSON file.
Henning’s team decided to build their own custom loader to provide a scalable way to load a large amount of data into TypeDB. To load data, the original TypeDB docs suggest building a function that reads in a file (for example in a tabular format) and building a templating function to generate the insert query. The docs then suggest using one of the client drivers to insert into a TypeDB database. However, when doing this at scale, a number of challenges are introduced that TypeDB Loader seeks to solve:
- Handle the repeating logic of the templating functions needed
- Taking care of the potential for dirty data, preventing errors during insert
- The parallelisation and batching of insert queries per transaction at scale
- Big data fault tolerance
Data Exploration — GraEs
It’s natural to expect that given the complexity of biology and specifically disease understanding, the model (TypeDB schema) can grow quickly.
A schema is a model of your domain, implemented in TypeQL as types (entities, attributes, relations, roles) and rules. All data inserted into TypeDB will instantiate parts of the schema, ensuring your data conforms to the model you have implemented.
This presents a challenge when writing queries as the user has to remember or lookup which role players are allowed in which relation, which attributes are available for a given variable or thing, as well as all the possible attribute values for constraints in the query. To solve this the team at Bayer built an auto-completion engine to offer grammar and schema-based suggestions while you’re writing your query.
Another challenge the team aimed to solve was to create a means of quickly visualising specific important queries for their assessment. To handle this, GraEs provides summary tables and single sub-graph visualisations. In addition, it will automatically return all possible attributes for the Concepts returned from your query, so you don’t have to request them manually. Today, GraEs already has basic wiring to solve all these issues:
- Query completion
- Query development
- Standard visualisation of subgraphs
- Summary tables of certain entities or relations
- Paves the way for more custom visualisations for specific queries or types of queries
Learning System for Link Prediction and Novel Discovery Pipelines — TypeDB to PyTorch Connector
Today, it is fairly clear that the drug discovery process is costly in both time and money. Biology, as Henning notes, is most naturally represented as a graph and teams can use this interconnected nature to identify novel potential targets. Being able to take advantage of the rich semantic modelling that TypeDB provides, enables the prediction of which genes, proteins, or variants are most important.
While KGLIB (the open-source machine learning library from Vaticle) provides a connection to TensorFlow and GraphNets, Bayer’s ML team has been working in PyTorch for two years now. Hence their desire to extend KGLIB to connect with PyTorch.
The concept behind the connector that they have built takes sub-graphs from their TypeDB database and passes them directly into the learner. This is beneficial, as trying to pass the entire graph can result in time wasted and hinder your ability to be flexible in how you iterate, should you want to change something within the learner or the data itself.
This gives them a pipeline as follows:
- Being able to export into Python in-memory NetworkX subgraphs
- Encoding from a TypeDB hypergraph entity/relation/attribute model into a conventional node-edge model
- TensorFlow and PyTorch embedders to automatically interpret type and value data from TypeDB as features
- The end-to-end pipeline is ready for use with any of the new and exciting native graph learning algorithms, for example, those built-in to PyTorch Geometric (link to GitHub)
Hopefully, this has given you a glimpse into one of the ways that teams are approaching the drug discovery process in today’s ever more complex biomedical domain.
While this is just one example of an enterprise production stack built around TypeDB, we are seeing more and more of the community building toolings and libraries to continue advancing their applications and solutions to complex problems. From developing a hybrid approach to autonomous vehicle systems, modelling causality for the manufacturing and logistics industry, to patient therapeutics and precision medicine, cyber intelligence threat detection, and supply chain analysis, TypeDB is enabling teams to work more efficiently and model domains more expressively than they have before.
To learn more about what the community is building and gain some inspiration for your own projects, here are a few ways to dive in:
Opinions expressed by DZone contributors are their own.