Neo4j Based Social News Stream Framework Graphity Source Code
8 months ago I posted the results of my research about fast retrieval of social news feeds and in particular my graph index graphity. The index is able to serve more than 12 thousand personalized social news streams per second in social networks with several million active users. I was able to show that the system is independent of the node degree or network size. Therefor it scales to graphs of arbitrary size.
Today I am pleased to anounce that our joint work was accepted as a full research paper at IEEE SocialCom conference 2012. The conference will take place in early September 2012 in Amsterdam. As promised before I will now open the source code of Graphity to the community. Its documentation could / and might be improved in future also I am sure that one is even able to use a better data structure for our implementation of the priority queue.
Still the attention from the developer community for Graphity was quite high so maybe the source code is of help to anyone. The source code consists of the entire evaluation framework that we used for our evaluation against other baselines which will also help anyone to reproduce our evaluation.
There is some nice things one can learn in setting up multthreading for time measurements and also how to set up a good logging mechanism.
The code can be found at https://github.com/renepickhardt/graphity-evaluation and the main Algorithm should lie in the file:
other files of high interest should be:
- https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/SortUtils.java topk nway merge inside a graph db
- https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/NodeQueueIterator.java iterator over the graphity index
- https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/NeoUtils.java some shortcuts for neo4j coding
I did not touch it again over the last couple months and it really has a lot of debugging comments inside. My appologies for this bad practice. I hope you can oversee this by having in mind that I am a mathematician and this was one of my first bigger evaluation projects. In my own interest I promise next time I produce code that will be easier to read / understand and reuse.
Still if you have any questions suggestions or comments feel free to contact me.
The raw data is can be downloaded at:
- 18 MB: http://rene.metalcon.de/de-nodeIds.txt.bz2
- 650 MB: http://rene.metalcon.de/de-events.log.bz2 All events that ever happened to german wikipedia articles up to middle of 2011
the format of these files is straight foward:
de-nodeIs.txt has first some ID then a tab and then the title of the wikipedia article this is just necessary if you want to display your data with titles rather than names.
the interesting file is the de-events.log in this file there are 4 columns
timestamp TAB FromNodeID TAB [ToNodeID] TAB U/R/A
So every line tells exactly when an article FromNodeID changes. if only 3 collumns are available and an U is written then the article just changed. Maybe links in the article changed in this case there exists another nodeID in the 3 column and an A or a R for add or remove respectively.
I think processing these files is rather straight forward. With this file you can totally simulate the growth of wikipedia over time. The file is sorted by the 2. column. If you want to use it in our evaluation framework you should sort this by the first column. This can be done on a unix shell in less than 10 minutes with the sort command.
Sorry I cannot publish the paper right now on my blog yet since the camera ready version has to be prepared and checked in to IEEE. But follow me on twitter or subscribe to my newsletter so I can let you know as soon as the entire paper as a pdf is available.