Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Mining Bio4j Data: Finding Topological Patters in PPI Networks

DZone's Guide to

Mining Bio4j Data: Finding Topological Patters in PPI Networks

· Database Zone
Free Resource

Whether you work in SQL Server Management Studio or Visual Studio, Redgate tools integrate with your existing infrastructure, enabling you to align DevOps for your applications with DevOps for your SQL Server databases. Discover true Database DevOps, brought to you in partnership with Redgate.

Hi everyone!

After writing this post in December, I’ve been thinking of doing something similar, yet different, using Neo4j Cypher query language.

That’s where I came up with the idea of looking for topological patterns through a large sub-set of the Protein-Protein interactions network included in Bio4j – rather than focusing on a few proteins selected a priori.

I decided to mine the data in order to find circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:


I would like to point out that the direction here is important and these two cycles:

  • A –> B –> C –> A
  • A –> C –> B –> A


are not the same.

Ok, so once this has been said, let’s see how the Cypher query looks like:

START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
MATCH d <-[r:PROTEIN_DATASET]- p,
circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p)
 return p.accession, p2.accession, p3.accession


As you can see it’s really simple and straightforward. In the first two lines we match the proteins from Swiss-Prot dataset for later retrieving the ones which form a 3-length cycle as described before.
Once the query has finished, you should be getting something like this:

cypher>
==> +———————————————————+
p.accession | p2.accession | p3.accession |
==> +———————————————————+
Q08465 P35189 P3421
Q08465 P34218 P35189
Q8GXA4 Q8L7E5 Q9LE82
Q8GXA4 Q9FH18 Q8L7E5
….
==> +———————————————————+
==> 6632 rows, 1019211 ms


As you can see the query took about 17 minutes to be completed in a 100% fresh DBthere was no information cached at all yet; with a m1.large AWS machinethis machine has 7.5 GB of RAM.
Not bad, right !?

We have to beware of something though, this query returns cycles such as:

  • A –> B –> C –> A
  • B –> C –> A –> B


as different cycles when they are actually not.

That’s why I developed a simple program to remove these repetitions as well as for fetching some statistics information.
After running the program you get two files:

  1. PPICircuitsLength3NoRepeats file: download it here
  2. PPICircuitsProteinsFreq file: download it here.


The final circuits found were reduced after performing the filtering to 2226 records.

Finally, I also created a really simple chart including the absolute frequency of the first 20 proteins with more occurrences in the cycles that were found.


Well, that’s all for now.

Have a good day!



Source: http://blog.bio4j.com/2012/01/mining-bio4j-data-finding-topological-patterns-in-ppi-networks/

It’s easier than you think to extend DevOps practices to SQL Server with Redgate tools. Discover how to introduce true Database DevOps, brought to you in partnership with Redgate

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}