Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Efficient Graph Algorithms in Neo4j

DZone's Guide to

Efficient Graph Algorithms in Neo4j

Neo4j has just publicly released its graph algorithms library! Read on to find out what it includes, what it means to you, and where you can get it!

· AI Zone ·
Free Resource

I am very happy to announce the first public release of the Neo4j graph algorithms library.

You can use these graph algorithms on your connected data to gain new insights more easily within Neo4j. You can use these graph analytics to improve results from your graph data, for example, by focusing on particular communities or favoring popular entities.

We developed this library as part of our effort to make it easier to use Neo4j for a wider variety of applications. Many users expressed interest in running graph algorithms directly on Neo4j without having to employ a secondary system. We also tuned these algorithms to be as efficient as possible in regards to resource utilization as well as streamlined for later management and debugging.

These algorithms represent user-defined procedures which you can call as part of Cypher statements running on top of Neo4j.

Kudos

Big thanks go to Martin Knobloch and Paul Horn from our good friends at Avantgarde Labs in Dresden who did all the heavy lifting. From reading tons of papers and tuning and parallelizing implementations to providing performance testing and implementers documentation — most of the work you see in this library is theirs.

Tomaz Bratanic also helped immensely with documenting the library, providing explanations and examples on small graphs and detailing syntax information for all graph algorithms.

Our documentation is still a work in progress, so please bear with us! However, please let us know if the existing sections are helpful or you have ideas on how to improve the documentation.

The Graph Algorithms

The graph algorithms covered by the library are:

Centrality

  • PageRank
  • Betweenness centrality
  • Closeness centrality

Partitioning

  • Label propagation
  • (Weakly) connected components
  • Strongly connected components
  • Union-find

Path-finding

  • Minimum weight spanning tree
  • All pairs — and single source — shortest path
  • Multi-source, breadth-first search

Most of the graph algorithms are available in two variants: one that writes the results (rank or partition) back to the graph and the other suffixed with .stream, which will stream the results back for further sorting, filtering, or aggregation.

To select which part of the graph to run the graph algorithm on, you can provide a label and relationship type as first parameters. Then, add configuration options depending on the algorithm. For instance, iterations and damping-factor for PageRank.

Example: Pagerank on DBPedia

Here, we run PageRank on DBPedia (11M Page-nodes, 125M Link-relationships):

CALL algo.pageRank('Page', 'Link', {write:true,iterations:20});


+--------------------------------------------------------------------------------------+
| nodes    | iter | loadMillis | computeMillis | writeMillis | damping | writeProperty |
+--------------------------------------------------------------------------------------+
| 11474730 | 20   | 34106      | 9712          | 1810        | 0.85    | "pagerank"    |
+--------------------------------------------------------------------------------------+
1 row
47888 ms


CALL algo.pageRank.stream('Page', 'Link', {iterations:5}) YIELD node, score
WITH * ORDER BY score DESC LIMIT 5
RETURN node.title, score;


+--------------------------------------+
| node.title                 | score   |
+--------------------------------------+
| "United States"            | 13349.2 |
| "Animal"                   | 6077.77 |
| "France"                   | 5025.61 |
| "List of sovereign states" | 4913.92 |
| "Germany"                  | 4662.32 |
+--------------------------------------+
5 rows
46247 ms

Graph Projection

One really cool feature is the ability to load a projection of a (sub-)graph of your data into the graph algorithm by passing Cypher statements to select nodes and node-pairs and choosing the “cypher” graph loader. The compiled Cypher runtime of Neo4j 3.2 (Enterprise) benefits this load strategy.

Here is an example for Union-Find:

CALL algo.unionFind('
MATCH (p:Page) WHERE p.title CONTAINS "Europe" RETURN id(p) as id
','
MATCH (p1:Page)-[:Link]->(p2:Page)
 WHERE p1.title CONTAINS "Europe" AND p2.title CONTAINS "Europe"
 RETURN id(p1) as source, id(p2) as target
',{graph:'cypher',write:true});


+-------------------------------------------------------------+
| loadMillis | computeMillis | writeMillis | nodes | setCount |
+-------------------------------------------------------------+
| 2975       | 2             | 30          | 23467 | 4322     |
+-------------------------------------------------------------+
1 row
3013 ms

Performance Testing

For several of the algorithms (PageRank, union-find, label-propagation, strongly-connected-components), I ran preliminary tests on medium and larger datasets that have also been used in other publications.

The table below contains database size as well as node and relationship counts. For each graph algorithm, you see the runtime (in seconds) of the first and second run.

Comparing them with other publications, those runtimes look quite good. But of course, the real confirmation comes from you running the algorithms on your own datasets on your own hardware.

Graph

Size (GB)

nodes (M)

rels (M)

PR1 (s)

PR2 (s)

UF1 (s)

UF2 (s)

LP1 (s)

LP2 (s)

SCC1 (s)

SCC2 (s)

Pokec

7.3

2

31

10

7

24

6

12

9

12

7

DBPedia

15

11

117

46

38

91

37

51

43

65

41

Graphs500-23

7.9

5

129

19

15

29

10

18

17

25

13

cit-patents

0.2

4

17

13

10

23

5

12

10

14

5

Twitter-2010

49

42

1468

349

131

353

128

405

405

339

189

soc-LifeJournal1

6.3

5

69

30

14

34

11

25

19

23

13

Friendster

62

66

1806

611

235

619

196

296

282

483

257

And here is the obligatory chart. Please note that this is log-scale to fit larger and smaller datasets in one chart.

Installation

We provide two releases — one for Neo4j 3.1.x and one for Neo4j 3.2.x

Installation is easy. Just download the jar-file from the release link below, copy it into your $NEO4J_HOME/plugins directory, and restart Neo4j.

Note: For Neo4j 3.2.x you will also have to add this line to your $NEO4J_HOME/conf/neo4j.conf config file: dbms.security.procedures.unrestricted=algo.*.

Feedback

We would love to hear your thoughts on the new graph algorithm library!

Please try out this library on your data and let us know how it worked. Raise GitHub issues if you run into any problems and don’t forget our #neo4j-graph-algorithm channel in the neo4j-users Slack if you have questions.

Implementation

The library is currently limited to handle two billion nodes by design, but in future versions, we will remove those limits as we more tightly integrate this work.

Our general approach is to load the projected data from Neo4j into an efficient data structure, compute the algorithm and write the results back. All three steps are parallelized as much as possible to maximize CPU and I/O utilization.

We use a composed Graph API interface to provide the algorithms access to the graph data, which is loaded into different representations by GraphFactory instances. Both the loading and writing back of results happen in parallel batches.

Feel free to have a look at the code, give us feedback, or even add your own algorithm implementation based on the existing infrastructure.

We welcome any pull request with new algorithms, bug-fixes, or other improvements.

References

Topics:
algorithms ,graph analytics ,neo4j ,ai

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}