Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Protein Structure and Function Prediction Powered by a Grakn Knowledge Graph

DZone's Guide to

Protein Structure and Function Prediction Powered by a Grakn Knowledge Graph

We use the Grakn knowledge graph to show how data scientists can use knowledge graphs to efficiently draw insights from big data sets.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Ever since we've been able to sequence proteins, three-dimensional structures have received tremendous experimental attention. Thanks to the development of new methods and technological advancements, determining these structures has become a more accurate and progressive process over time.

The problem, however, lays in the fact that the progress of discovering new protein structures has not kept pace with the rate at which new sequences are being produced. As a result, we see a continuously growing gap between the number of new sequences being produced and the three-dimensional structures being identified.

Given sufficient accuracy, a possible solution is the computational prediction of protein structures. Methods such as homology modeling, fold recognition, and novel modeling can be used to fill in this gap. However, regardless of which method is used, with the rapid rise in the amount of sequence data, the underlying problem continues to be the lack of one single knowledge base that allows a rapid and powerful scan over the universe of protein sequences. All publicly available data currently sits in various databases across many different sources. Moving from one source to another is not — and certainly must not — be the biggest challenge in this process.

In this post, I aim to show how a Knowledge Graph can accelerate the protein structure prediction process by allowing you to:

  • query for insights over one single, comprehensive, and interconnected dataset of protein sequences.
  • search and produce a shortlisted set of sequences to be passed on to the next computational component in the prediction process.

All Data in One Knowledge Graph

The image below illustrates how I think the model of a knowledge graph in this domain of protein sequence structure could look like.

This Grakn knowledge graph plays the role of a single knowledge base that contains all relevant data pulled in from various sources, such as Uniprot and PDB. The data could also be pulled in from running BLAST with Grakn.

Migrating data to Grakn: To learn how data in CSV, JSON, and XML formats can be migrated to a Grakn Knowledge Graph, have a look at the comprehensive and step-by-step Migration Guide.

Query for Insights

Now that we have all relevant data represented (as shown above) in a Grakn knowledge graph, we can go ahead and ask the following questions over this dataset. Under each question, I've included the relevant query.

  • What are the structures of the following sequence?
MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
match
  $target-sequence isa sequence "MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY";
  $structure isa structure;
  (mapping-sequence: $target-sequence, mapped-structure: $structure) isa sequence-structure-mapping;
get $structure;
  • Which sequences have the structure with PDG id of "2RHC"?
match
  $target-structure isa strucuture has pdb-id "2RHC";
  $sequence isa sequence;
  (mapping-sequence: $sequence, mapped-structure: $target-structure) isa sequence-structure-mapping;
get $sequence;
  • The following sequence has no known structure. What are the structures of other sequences that are at least 80% identical to this particular sequence?
MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNL IHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
match
  $target-sequence isa sequence "MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY";
  $similar-sequence isa sequence;
  $alignment (target-sequence: $target-sequence, matched-sequence: $similar-sequence) isa sequence-sequence-alignment;
  $alignment has identicality > 0.8;
  $structure isa structure;
  (mapping-sequence: $similar-sequence, mapped-structure: $structure) isa sequence-structure-mapping;
get $structure;

The code you saw above is Graql. Graql is the language for Grakn. Its expressivity is what makes it extremely human-readable and intuitive. In simple terms, Graql is a query language that can be understood and written by anyone, not just programmers.

Extending the Knowledge Graph

As we decide to pull in more relevant data sources into the Grakn knowledge graph, the model can evolve and be extended with minimal effort.

An Example

Below I've included the code that defines the model that I illustrated earlier in this post. If we were to extend this model and introduce the protein sequence function with a mapping relationship to the protein sequence structure, we could do so by extending the model (a.k.a. schema) like so:

define
  sequence-sequence-alignment sub relationship,
    relates target-sequence,
    relates matched-sequence,
    has identicality,
    has positivity;
sequence-structure-mapping sub relationship,
    relates mapped-structure,
    relates mapping-sequence;
//structure-function-mapping sub relationship,
  	//relates mapping-structure,
    //relates mapped-function;
sequence sub attribute datatype string,
    plays target-sequence,
    plays matched-sequence,
    plays mapping-sequence;
structure sub entity,
    plays mapped-structure,
    //plays mapping-structure,
    has pdb-id;
//function sub attribute datatype string,
    //plays mapped-function;
identicality sub attribute datatype double;
  positivity sub attribute datatype double;
  pdb-id sub attribute datatype string

The commented lines above are the extra code that we need to add. Nothing else needs to change. This extended model of the knowledge graph looks like this now.

Image title

The extended model: 'function' is added as an attritbute and mapped with 'structure' (directly) and with 'sequence' (via inference).

Given the new relationship structure-function-mapping and the previous relationshipsequence-structure-mapping, we can make use of Grakn's automated reasoning capability to make an inference, resulting in new knowledge — the implied sequence-function-mapping relationship.

sequence-function-mapping sub relationship,
  relates mapping-sequence,
  relates mapped-function;

implied-sequence-function-mapping sub rule,
  when {
    $seq isa sequence;
    $struct isa structure;
    $func isa function;
    (mapping-sequence: $seq, mapped-structure: $struct) isa sequence-structure-mapping;
    (mapping-structure: $struct, mapped-function: $func) isa structure-function-mapping;
  } then {
    (mapping-sequence: $seq, mapped-function: $func) isa sequence-function-mapping;
  };

The implied-sequence-function-mapping rule above is telling Grakn that:

When:

  • there is a sequence, and
  • there is a structure, and
  • there is a function, and
  • the sequence and the structure have a mapping relationship, and
  • the structure and the function have a mapping relationship,

Then:

  • consider the sequence and the function to have a mapping realtionship.

With these additions to the schema, we can now ask the following questions:

  • Which sequences have the function "enzyme"?
match
  $target-function isa function "enzyme";
  $sequence isa sequence;
  (mapping-sequence: $sequence, mapped-function: $target-function) isa sequence-function-mapping;
get $sequence;
  • Which functions are mapped either directly to the following sequence or indirectly via an aligned sequence a that is at least 80% identical to the given sequence?
The sequence: MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNL IHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
match
  $target-sequence isa sequence "MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY";
  $direct-function isa function;
  (mapping-sequence: $target-sequence, mapped-function: $direct-function) isa sequence-function-mapping;
  $similar-sequence isa sequence;
  $alignment (target-sequence: $target-sequence, matched-sequence: $similar-sequence) isa sequence-sequence-alignment; 
  $alignment has identicality > 0.8;
  $indirect-function isa function;
  (mapping-sequence: $similar-sequence, mapped-function: $indirect-function) isa sequence-function-mapping;
get $direct-function, $indirect-function;

Grakn Rules

What you saw above is simply one example of how a Grakn knowledge graph can be extended to infer new knowledge. Rules can be written in any research domain to:

  • inject biological facts.
  • infer based on new findings (hypotheses).
  • enforce constraints.

It's entirely up to you how you choose to make your knowledge graph more intelligent by writing rules tailored to your own work.

The Opportunities Are Endless!

Grakn is about modeling intelligent knowledge graphs in an intelligent way. We believe simplicity to be a cornerstone of intelligence. Hence, the query language — Graql. What you can model and query with a Grakn knowledge graph is only limited by your will and imagination.

See an example of the thought process behind modeling a dataset in Grakn. Read about the Schema Concepts, Types and Rules. Go through examples of Graql queries or see how to write your own.

For Your Inspiration

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,knowledge graph ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}