Over a million developers have joined DZone.

Mining Bio4j Data: Finding Topological Patters in PPI Networks

DZone's Guide to

Mining Bio4j Data: Finding Topological Patters in PPI Networks

· Database Zone ·
Free Resource

Running out of memory? Learn how Redis Enterprise enables large dataset analysis with the highest throughput and lowest latency while reducing costs over 75%! 

Hi everyone!

After writing this post in December, I’ve been thinking of doing something similar, yet different, using Neo4j Cypher query language.

That’s where I came up with the idea of looking for topological patterns through a large sub-set of the Protein-Protein interactions network included in Bio4j – rather than focusing on a few proteins selected a priori.

I decided to mine the data in order to find circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:

I would like to point out that the direction here is important and these two cycles:

  • A –> B –> C –> A
  • A –> C –> B –> A

are not the same.

Ok, so once this has been said, let’s see how the Cypher query looks like:

START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
 return p.accession, p2.accession, p3.accession

As you can see it’s really simple and straightforward. In the first two lines we match the proteins from Swiss-Prot dataset for later retrieving the ones which form a 3-length cycle as described before.
Once the query has finished, you should be getting something like this:

==> +———————————————————+
p.accession | p2.accession | p3.accession |
==> +———————————————————+
Q08465 P35189 P3421
Q08465 P34218 P35189
Q8GXA4 Q8L7E5 Q9LE82
Q8GXA4 Q9FH18 Q8L7E5
==> +———————————————————+
==> 6632 rows, 1019211 ms

As you can see the query took about 17 minutes to be completed in a 100% fresh DBthere was no information cached at all yet; with a m1.large AWS machinethis machine has 7.5 GB of RAM.
Not bad, right !?

We have to beware of something though, this query returns cycles such as:

  • A –> B –> C –> A
  • B –> C –> A –> B

as different cycles when they are actually not.

That’s why I developed a simple program to remove these repetitions as well as for fetching some statistics information.
After running the program you get two files:

  1. PPICircuitsLength3NoRepeats file: download it here
  2. PPICircuitsProteinsFreq file: download it here.

The final circuits found were reduced after performing the filtering to 2226 records.

Finally, I also created a really simple chart including the absolute frequency of the first 20 proteins with more occurrences in the cycles that were found.

Well, that’s all for now.

Have a good day!

Source: http://blog.bio4j.com/2012/01/mining-bio4j-data-finding-topological-patterns-in-ppi-networks/

Running out of memory? Never run out of memory with Redis Enterprise databaseStart your free trial today.


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}