# Efficient Graph Algorithms in Neo4j

# Efficient Graph Algorithms in Neo4j

### Neo4j has just publicly released its graph algorithms library! Read on to find out what it includes, what it means to you, and where you can get it!

Join the DZone community and get the full member experience.

Join For FreeI am very happy to announce the first public release of the Neo4j graph algorithms library.

You can use these graph algorithms on your connected data to gain new insights more easily within Neo4j. You can use these graph analytics to improve results from your graph data, for example, by focusing on particular communities or favoring popular entities.

We developed this library as part of our effort to make it easier to use Neo4j for a wider variety of applications. Many users expressed interest in running graph algorithms directly on Neo4j without having to employ a secondary system. We also tuned these algorithms to be as efficient as possible in regards to resource utilization as well as streamlined for later management and debugging.

These algorithms represent user-defined procedures which you can call as part of Cypher statements running on top of Neo4j.

## Kudos

Big thanks go to Martin Knobloch and Paul Horn from our good friends at Avantgarde Labs in Dresden who did all the heavy lifting. From reading tons of papers and tuning and parallelizing implementations to providing performance testing and implementers documentation — most of the work you see in this library is theirs.

Tomaz Bratanic also helped immensely with documenting the library, providing explanations and examples on small graphs and detailing syntax information for all graph algorithms.

Our documentation is still a work in progress, so please bear with us! However, *please* let us know if the existing sections are helpful or you have ideas on how to improve the documentation.

## The Graph Algorithms

The graph algorithms covered by the library are:

**Centrality**

- PageRank
- Betweenness centrality
- Closeness centrality

**Partitioning**

- Label propagation
- (Weakly) connected components
- Strongly connected components
- Union-find

**Path-finding**

- Minimum weight spanning tree
- All pairs — and single source — shortest path
- Multi-source, breadth-first search

Most of the graph algorithms are available in two variants: one that writes the results (rank or partition) back to the graph and the other suffixed with `.stream`

, which will stream the results back for further sorting, filtering, or aggregation.

To select which part of the graph to run the graph algorithm on, you can provide a `label`

and `relationship type`

as first parameters. Then, add configuration options depending on the algorithm. For instance, iterations and damping-factor for PageRank.

### Example: Pagerank on DBPedia

Here, we run PageRank on DBPedia (11M Page-nodes, 125M Link-relationships):

```
CALL algo.pageRank('Page', 'Link', {write:true,iterations:20});
+--------------------------------------------------------------------------------------+
| nodes | iter | loadMillis | computeMillis | writeMillis | damping | writeProperty |
+--------------------------------------------------------------------------------------+
| 11474730 | 20 | 34106 | 9712 | 1810 | 0.85 | "pagerank" |
+--------------------------------------------------------------------------------------+
1 row
47888 ms
CALL algo.pageRank.stream('Page', 'Link', {iterations:5}) YIELD node, score
WITH * ORDER BY score DESC LIMIT 5
RETURN node.title, score;
+--------------------------------------+
| node.title | score |
+--------------------------------------+
| "United States" | 13349.2 |
| "Animal" | 6077.77 |
| "France" | 5025.61 |
| "List of sovereign states" | 4913.92 |
| "Germany" | 4662.32 |
+--------------------------------------+
5 rows
46247 ms
```

### Graph Projection

One really cool feature is the ability to load a projection of a (sub-)graph of your data into the graph algorithm by passing Cypher statements to select nodes and node-pairs and choosing the “cypher” graph loader. The compiled Cypher runtime of Neo4j 3.2 (Enterprise) benefits this load strategy.

Here is an example for Union-Find:

```
CALL algo.unionFind('
MATCH (p:Page) WHERE p.title CONTAINS "Europe" RETURN id(p) as id
','
MATCH (p1:Page)-[:Link]->(p2:Page)
WHERE p1.title CONTAINS "Europe" AND p2.title CONTAINS "Europe"
RETURN id(p1) as source, id(p2) as target
',{graph:'cypher',write:true});
+-------------------------------------------------------------+
| loadMillis | computeMillis | writeMillis | nodes | setCount |
+-------------------------------------------------------------+
| 2975 | 2 | 30 | 23467 | 4322 |
+-------------------------------------------------------------+
1 row
3013 ms
```

## Performance Testing

For several of the algorithms (PageRank, union-find, label-propagation, strongly-connected-components), I ran preliminary tests on medium and larger datasets that have also been used in other publications.

The table below contains database size as well as node and relationship counts. For each graph algorithm, you see the runtime (in seconds) of the first and second run.

Comparing them with other publications, those runtimes look quite good. But of course, the real confirmation comes from you running the algorithms on your own datasets on your own hardware.

Graph |
Size (GB) |
nodes (M) |
rels (M) |
PR1 (s) |
PR2 (s) |
UF1 (s) |
UF2 (s) |
LP1 (s) |
LP2 (s) |
SCC1 (s) |
SCC2 (s) |

Pokec |
7.3 |
2 |
31 |
10 |
7 |
24 |
6 |
12 |
9 |
12 |
7 |

DBPedia |
15 |
11 |
117 |
46 |
38 |
91 |
37 |
51 |
43 |
65 |
41 |

Graphs500-23 |
7.9 |
5 |
129 |
19 |
15 |
29 |
10 |
18 |
17 |
25 |
13 |

cit-patents |
0.2 |
4 |
17 |
13 |
10 |
23 |
5 |
12 |
10 |
14 |
5 |

Twitter-2010 |
49 |
42 |
1468 |
349 |
131 |
353 |
128 |
405 |
405 |
339 |
189 |

soc-LifeJournal1 |
6.3 |
5 |
69 |
30 |
14 |
34 |
11 |
25 |
19 |
23 |
13 |

Friendster |
62 |
66 |
1806 |
611 |
235 |
619 |
196 |
296 |
282 |
483 |
257 |

And here is the obligatory chart. Please note that this is log-scale to fit larger and smaller datasets in one chart.

## Installation

We provide two releases — one for Neo4j 3.1.x and one for Neo4j 3.2.x

Installation is easy. Just download the jar-file from the release link below, copy it into your `$NEO4J_HOME/plugins`

directory, and restart Neo4j.

**Note**: For Neo4j 3.2.x you will also have to add this line to your `$NEO4J_HOME/conf/neo4j.conf`

config file: `dbms.security.procedures.unrestricted=algo.*`

.

## Feedback

We would love to hear your thoughts on the new graph algorithm library!

Please try out this library on your data and let us know how it worked. Raise GitHub issues if you run into any problems and don’t forget our `#neo4j-graph-algorithm`

channel in the neo4j-users Slack if you have questions.

## Implementation

The library is currently limited to handle two billion nodes by design, but in future versions, we will remove those limits as we more tightly integrate this work.

Our general approach is to load the projected data from Neo4j into an efficient data structure, compute the algorithm and write the results back. All three steps are parallelized as much as possible to maximize CPU and I/O utilization.

We use a composed Graph API interface to provide the algorithms access to the graph data, which is loaded into different representations by `GraphFactory`

instances. Both the loading and writing back of results happen in parallel batches.

Feel free to have a look at the code, give us feedback, or even add your own algorithm implementation based on the existing infrastructure.

We welcome any pull request with new algorithms, bug-fixes, or other improvements.

## References

- Releases
- GitHub Repository (please star if you like it)
- Documentation
- GitHub issues
- Neo4j Slack community

Published at DZone with permission of Michael Hunger , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

{{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}