After writing this post in December, I’ve been thinking of doing something similar, yet different, using Neo4j Cypher query language.
That’s where I came up with the idea of looking for topological patterns through a large sub-set of the Protein-Protein interactions network included in Bio4j – rather than focusing on a few proteins selected a priori.
I decided to mine the data in order to find circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:
I would like to point out that the direction here is important and these two cycles:
- A –> B –> C –> A
- A –> C –> B –> A
are not the same.
Ok, so once this has been said, let’s see how the Cypher query looks like:
START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot") MATCH d <-[r:PROTEIN_DATASET]- p, circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p) return p.accession, p2.accession, p3.accession
As you can see it’s really simple and straightforward. In the first two lines we match the proteins from Swiss-Prot dataset for later retrieving the ones which form a 3-length cycle as described before.
Once the query has finished, you should be getting something like this:
p.accession | p2.accession | p3.accession |
Q08465 P35189 P3421
Q08465 P34218 P35189
Q8GXA4 Q8L7E5 Q9LE82
Q8GXA4 Q9FH18 Q8L7E5
==> 6632 rows, 1019211 ms
As you can see the query took about 17 minutes to be completed in a 100% fresh DB – there was no information cached at all yet; with a m1.large AWS machine – this machine has 7.5 GB of RAM.
Not bad, right !?
We have to beware of something though, this query returns cycles such as:
- A –> B –> C –> A
- B –> C –> A –> B
as different cycles when they are actually not.
That’s why I developed a simple program to remove these repetitions as well as for fetching some statistics information.
After running the program you get two files:
The final circuits found were reduced after performing the filtering to 2226 records.
Finally, I also created a really simple chart including the absolute frequency of the first 20 proteins with more occurrences in the cycles that were found.
Well, that’s all for now.
Have a good day!