SQL vs. Graql: Modeling and Querying of Biomedical Data
Look at how Graql compares to SQL when working with biomedical data.
Join the DZone community and get the full member experience.
Join For FreeUsing SQL to query relational databases is easy. As a declarative language, it’s straightforward to write queries and build powerful applications. Having been around for many decades, SQL has developed into a language that is extremely robust, reliable, fast, and useful for applications of many sizes.
However, SQL struggles when working with complex data. By complex data, I mean data that contains many entity types and is highly interconnected. When querying such data in SQL, challenges arise in the modeling and querying of the data. For example, due to a large number of necessary JOINs, it forces us to write long and verbose queries. Not only do such queries take very long to run, but they are also difficult to write and prone to mistakes. When working with complex data in SQL, we can list the challenges as follows:
- Complex to write queries — analyzing hierarchical data with many relationships leads to SQL statements containing many JOINs which easily become too difficult to understand.
- Slow query speed — as a consequence, queries with a large number of JOINs create complexity in the computing response and lead to high query response times.
- Complicated data models — domains with complex data require non-intuitive data models with (join) tables that create unnecessary complexity and reduce data quality.
The above challenges are especially pertinent when working with biomedical data (see more here). This is exactly the type of data that becomes difficult to model and query in SQL. It contains many entity types (e.g. various proteins, genes, tissues) and many connections between those entities (interactions, associations, etc). This is why using a query language such as Graql can reduce the modeling and querying complexity by orders of magnitude (more on that below). That’s why in this article, I want to look at how Graql compares to SQL when working with biomedical data.
At the end of this article, I’ll show you how we can write a 151-line SQL query in just 4 lines in Graql.
An Introduction to Graql
If you’re unfamiliar with Graql, I want to first give a brief high-level overview of the modeling constructs (see here for an in-depth introduction).
Just as SQL is the standard query language in relational databases, Graql is Grakn’s query language. It’s a declarative language and allows us to model query and reason over our data. Crucially, it’s Graql’s expressivity that makes it the most readable query language in the world. This is really important because it means even non-programmers (such as biologists rather than bioinformaticians) are able to talk to the database and make sense of it.
Grakn’s data model consists of three core types:
- Entities are the main actors in your domain. These are usually the type of things you want to know about (e.g. proteins, genes, tissues, diseases)
- A relationship describes how two or more things are in some way connected to each other. Each relationship must relate to roles that are played by something else in the domain. In other words, relationships are dependent on the existence of at least two other things (e.g. protein interactions, process localization, transcription).
- Attributes are small pieces of data that get attached to other concepts (e.g. uniprot-id, entrez-id, disease-name).
- Below is how these concepts can be used to create a simple model (or schema). The example below specifically shows us how genes encode proteins and are associated to diseases.
Finally, Rules are logical patterns that we encode in our model which allow us to reason over our existing data to create new instances of entities, relationships and attributes. Effectively, this means that by imposing a certain logic over existing data, we can answer exponentially more questions from the same data.
For example, if our original data doesn’t include protein-disease relationships, we could infer these protein-disease relationships by looking at the protein’s encoding genes and connecting it with the diseases to which they are associated to. In other words:
If:
- Genes encode proteins;
- And genes are associated to diseases.
Then:
- Create protein-disease associations.
This rule would be part of our schema, which we could then visualize as follows, where the dotted line represents the inferred relationship:

SQL and Graql: Data Modeling
With this basic understanding of Graql (and assuming you understand SQL), let’s compare how we would represent a more complicated model in both Graql and SQL, and let’s find out which one is more intuitive.
For this comparison, I want to specifically look at disease networks. The entities I want to model include:
- Kinases (protein type)
- Ion Channels (protein type)
- Nuclear Receptors (protein type)
- GPCRs (protein type)
- Genes
- Diseases
- Drugs
And the corresponding relationships:
- Genes encode proteins
- Genes are associated to drugs
- Genes are associated to diseases
Finally, I want to include sequence alignment data from a BLAST search (more on that here) that we previously did. For my model, I want to integrate the positivity and identicality scores that BLAST returns between match and target sequences. We model this as relationships between sequences:
- Targetprotein sequences have positivity and identicality scores with match protein sequences
Pictured below is how we would represent this domain in a relational model.

There are a few things to note:
- There are three join tables to connect genes to diseases, drugs, and proteins.
- The sequence alignments are also modeled as a join table, where I used protein sequences as foreign keys to connect them with proteins. In this table, I also added positivity and identicality scores.
Now that we've modeled this disease network in a relational model, how would we go about doing this in Grakn? See below.

It's important to note that as Graql is so much more expressive than SQL when modeling any domain in Graql, we need to think differently if we're used to modeling in a relational world. For example, we should take into account Graql's type hierarchy, and think of concepts which we later want to infer through the use of rules. With that in mind, for my model:
- I sub-typed four protein types (kinase, ion-channel, nuclear receptor, and GPCR), and created a parent type protein
- I created a relationship between two sequence attributes called sequence-sequence-alignment, and on that relationship, I attached identicality and positivity attributes
- I also created a protein-protein-alignment relationship, so that I can create rules that infer this relationship
- Finally, I created a drug-disease-association relationship, so that I can also infer this relationship
SQL and Graql: Query Examples
Now that we've created these models, how would we go about querying these in SQL? And how would this compare in Graql?
First, let's start with a simple question.
- Which drugs interact with the gene Ensembl-id ‘ENSG00000010404'?
SQL query:
SELECT Drug.ChemblId, Drug.DrugName
FROM Gene
INNER JOIN Drug_Gene
ON Gene.EntrezId = Drug_Gene.EntrezId
LEFT JOIN Drug
ON Drug.ChemblId = Drug_Gene.ChemblId
WHERE Gene.EnsemblId = 'ENSG00000010404'
Graql query:
match
$ge isa gene, has ensembl-id "ENSG00000010404"; $dr isa drug;
(inhibitor: $dr, target-gene: $ge) isa drug-gene-interaction; get;
Okay, it looks like Graql is more intuitive to read. Let's look at a slightly more complicated query.
2. Which Kinases are associated to Diabetes?
SQL query:
SELECT Kinase.UniprotId, Kinase.UniprotName
FROM Protein_Gene
INNER JOIN Disease_Gene
ON Disease_Gene.EntrezId = Protein_Gene.EntrezId
RIGHT JOIN Disease
ON Disease_Gene.DiseaseName = Disease.DiseaseName
LEFT JOIN Kinase
ON Protein_Gene.UniprotId = Kinase.UniprotId
WHERE Disease.DiseaseName = 'Asthma'
Graql query:
match
$k isa kinase;
$di isa disease, has disease-name "Diabetes";
(associated-protein: $k, associated-disease: $di) isa protein-disease-association; get;
The Graql query is two times shorter than its SQL equivalent!
Now, bear in mind that for this Graql query we leverage the rule that I previously created to infer protein-disease-associations if both entities are connected to the same gene. This allows us to ask a higher level question, and let the system figure out how to do the navigation. In SQL, as it doesn't have a reasoning engine, we need to write the explicit query by doing three JOINs.
Finally, let's look at an extreme example.
3. Which drugs are associated to the disease Asthma?
SQL query:
select Drug_Gene.ChemblId, DrugName from Drug
inner join Drug_Gene on Drug.ChemblId=Drug_Gene.ChemblId
inner join Gene_Protein on Drug_Gene.EntrezId=Gene_Protein.EntrezId
inner join Kinase on Kinase.UniProtId=Gene_Protein.UniProtId
where
Kinase.sequence in
(select Sequence_Alignment.TargetSequence from Kinase
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Kinase.sequence or Sequence_Alignment.MatchedSequence=Kinase.sequence
inner join Gene_Protein on Kinase.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from Ion_channel
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Ion_channel.sequence or Sequence_Alignment.MatchedSequence=Ion_channel.sequence
inner join Gene_Protein on Ion_channel.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from Nuc_receptor
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Nuc_receptor.sequence or Sequence_Alignment.MatchedSequence=Nuc_receptor.sequence
inner join Gene_Protein on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from GPCR
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=GPCR.sequence or Sequence_Alignment.MatchedSequence=GPCR.sequence
inner join Gene_Protein on GPCR.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98)
UNION ALL
select Drug_Gene.ChemblId, DrugName from Drug
inner join Drug_Gene on Drug.ChemblId=Drug_Gene.ChemblId
inner join Gene_Protein on Drug_Gene.EntrezId=Gene_Protein.EntrezId
inner join Ion_channel on Ion_channel.UniProtId=Gene_Protein.UniProtId
where
Ion_channel.sequence in (
select Sequence_Alignment.TargetSequence from Kinase
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Kinase.sequence or Sequence_Alignment.MatchedSequence=Kinase.sequence
inner join Gene_Protein on Kinase.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from Ion_channel
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Ion_channel.sequence or Sequence_Alignment.MatchedSequence=Ion_channel.sequence
inner join Gene_Protein on Ion_channel.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from Nuc_receptor
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Nuc_receptor.sequence or Sequence_Alignment.MatchedSequence=Nuc_receptor.sequence
inner join Gene_Protein on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from GPCR
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=GPCR.sequence or Sequence_Alignment.MatchedSequence=GPCR.sequence
inner join Gene_Protein on GPCR.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98)
UNION ALL
select Drug_Gene.ChemblId, DrugName from Drug
inner join Drug_Gene on Drug.ChemblId=Drug_Gene.ChemblId
inner join Gene_Protein on Drug_Gene.EntrezId=Gene_Protein.EntrezId
inner join Nuc_receptor on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
where
Nuc_receptor.sequence in
(select Sequence_Alignment.TargetSequence from Kinase
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Kinase.sequence or Sequence_Alignment.MatchedSequence=Kinase.sequence
inner join Gene_Protein on Kinase.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from Ion_channel
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Ion_channel.sequence or Sequence_Alignment.MatchedSequence=Ion_channel.sequence
inner join Gene_Protein on Ion_channel.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from Nuc_receptor
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Nuc_receptor.sequence or Sequence_Alignment.MatchedSequence=Nuc_receptor.sequence
inner join Gene_Protein on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from GPCR
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=GPCR.sequence or Sequence_Alignment.MatchedSequence=GPCR.sequence
inner join Gene_Protein on GPCR.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98)
UNION ALL
select Drug_Gene.ChemblId, DrugName from Drug
inner join Drug_Gene on Drug.ChemblId=Drug_Gene.ChemblId
inner join Gene_Protein on Drug_Gene.EntrezId=Gene_Protein.EntrezId
inner join GPCR on GPCR.UniProtId=Gene_Protein.UniProtId
where
GPCR.sequence in
(select Sequence_Alignment.TargetSequence from Kinase
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Kinase.sequence or Sequence_Alignment.MatchedSequence=Kinase.sequence
inner join Gene_Protein on Kinase.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from Ion_channel
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Ion_channel.sequence or Sequence_Alignment.MatchedSequence=Ion_channel.sequence
inner join Gene_Protein on Ion_channel.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from Nuc_receptor
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Nuc_receptor.sequence or Sequence_Alignment.MatchedSequence=Nuc_receptor.sequence
inner join Gene_Protein on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
UNION ALL
select Sequence_Alignment.TargetSequence from GPCR
inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=GPCR.sequence or Sequence_Alignment.MatchedSequence=GPCR.sequence
inner join Gene_Protein on GPCR.UniProtId=Gene_Protein.UniProtId
inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
where
Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98)
Graql query:
match
$di isa disease, has disease-name 'Asthma';
$dr isa drug;
$r (therapeutic: $dr, affected-disease: $di) isa drug-disease-association; get;
Wow! That's quite a big difference — the Graql query is 37 times shorter than its SQL equivalent. Now, please note I'm really not trying to take the piss out of SQL! And I'm sure it can still be optimized by someone highly experienced in it. That being said, I hope to show here how an average person (such as me) would go about writing this.
Let me expand how we reduced a 151-line SQL query into a 4-line Graql query. Due to the underlying data structure and the logic we want to apply, the query itself leads to a combinatory explosion due to the number of permutations of the query. Graql's type-based and rule-based reasoning abstracts this away, allowing us to ask the higher level query.
In the original data ingested, no drug-disease relationships exist. Therefore, in order to answer this question, we want to apply the following logic:
- If diseases are connected to genes (in our example above we specifically asked for the disease Asthma);
- And genes encode proteins (kinases, ion channels, nuclear receptors, GCPRs);
- If any of those proteins has a sequence alignment with another protein containing a positivity higher than 0.98 and an identicality higher than 0.98,
- and that protein is encoded by a gene,
- which is associated to a drug
I stored this logic in the form of a rule (see below) in my Grakn schema. This means that subsequently I can just query for drug-disease-relationships and the Grakn will figure out how to do the navigation for me.
when {
$di isa disease;
$g isa gene; $g2 isa gene; $g != $g2;
$pr isa protein, has sequence $sequence-1;
$pr2 isa protein, has sequence $sequence-2;
$pr != $pr2;
$dr isa drug;
(associated-disease: $di, associated-gene: $g) isa gene-disease-association;
(encoding-gene: $g, encoded-protein: $pr) isa gene-protein-encoding;
$alignment ($sequence-1, $sequence-2) isa sequence-sequence-alignment, has sequence-identicality $ident, has sequence-positivity $pos;
$ident >= 0.98; $pos >= 0.98;
(encoding-gene: $g2, encoded-protein: $pr2) isa gene-protein-encoding;
($dr, target-gene: $g2) isa drug-gene-interaction;
} then {
(affected-disease: $di, therapeutic: $dr) isa drug-disease-association;
};
However, as SQL doesn't support reasoning, writing the SQL query results in the combinatory explosion that you saw above. Specifically, this is because there are four types of proteins that contain sequences that can either be a target-sequence or a match-sequence in a sequence-alignment relationship connected to any of the four other protein types. In SQL we have to explicitly state all the paths that the query can take, whereas in Grakn, we just reason over our data using rule-based and type-based inferencing.

Final Words
SQL is undoubtedly the most battle-hardened query language in the world. And to this day powers most of world's IT infrastructure. However, when it comes to handling complex data, SQL struggles:
- It's complex to write queries;
- it leads to slow query speeds;
- and forces us to create complicated data models.
Fundamentally, these challenges are inevitable as the relational data model was not created to work with this type of data. Using a much more expressive query language that reasons over our data abstracts away this complexity so that we can focus on the higher level question.
Thanks to Soroush Saffari for helping with this article, especially creating the SQL queries!
Published at DZone with permission of Tomás Sabat, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments