Graph Databases in Life Sciences Workshop
This post was originally written by Michael Hunger at the Neo4j Blog.
As Bio-Technology is one of the hot topics of the century and graph
databases are on the rise in this decade, we thought it would be a good
idea to bring researchers and bioinformatics developers together for a workshop about the applicability of graph databases in biological research and application.
Fortunately Prof. Lennart Martens a group leader in the Department of Medical Protein Research at VIB and Ghent University offered to host the workshop. So Neo Technology's Rik Van Bruggen and Lennart Martens organized the workshop and invited a host of attendees from a variety of backgrounds.
After the introduction by Lennart and Rik, I ran a quick intro to NOSQL and graph databases in particular and their applicability in a wide range of fields, also with some reference to existing biotech applications.
Pablo Pareja of Oh no sequences! presented Bio4j an open-source research database (and platform) integrating many different sources for protein, genome and taxonomy information. Bio4j also runs on Neo4j and currently holds almost 1 billion relationships. (Slides 1, 2, 3)
In the time until lunch I answered some questions about Neo4j especially about the roadmap, scaling and we highlighted some visualization approaches, like Gephi, Cytoscape and HivePlots.
During the breaks and over lunch we had lots of interesting discussions about life sciences in general, working with scientist and particiular data management problems.
After lunch, Anthony Liekens presented biograph.be a knowledge discovery system for finding relevant information in the area of life science, e.g. proteins in reactions ranked by their publication relevance. The system employs a page rank algorithm that is implemented using matrix multiplication on a parallel processing system.
Davy Suvee of Janssen Pharmaceutica and datablend.be presented different Graph Database usecases from his experience at a big pharmaceutical company. He closed the presentation with an intro to a time-traveling graph implementation on top of Datomic called FluxGraph.
Thilo then introduced the topic of the workshop "Graph Databases in Life Science" and the "Reactome" database of human protein interaction pathways. He discussed some Neo4j APIs and how they can be used to import the data from flat CSV files into a graph database. The attendees set up their development environment with the Neo4reactome project that we prepared upfront and ran the import successfully.
use-cases, first visualizing pathways in the Neo4j Web-UI and then running several queries using Neo4j's query language Cypher to find certain proteins (HBA and HBB) and their interaction pathways.
Find the common pathways of HBA and HBBBoth proteins should be involved in particular pathways, which should be easy to find by querying. Now we want to retrieve only the pathways which have both proteins in common.
START proteinA=node:proteins(accession = "P69905"), proteinB=node:proteins(accession = "P68871") MATCH (proteinA)-[:INVOLVED_IN]->(pathway)<-[:INVOLVED_IN]-(proteinB) RETURN pathway
- O2/CO2 exchange in erythrocytes
- Uptake of Carbon Dioxide and Release of Oxygen by Erythrocytes
- Uptake of Oxygen and Release of Carbon Dioxide by Erythrocytes