Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Neo4j Is Faster Than MySQL in Performing Recursive Queries

DZone's Guide to

Neo4j Is Faster Than MySQL in Performing Recursive Queries

The performance of Neo4j is better than the performance of MySQL when it comes to recursive queries. But don't just take my word for it. Let's see it in action.

· Performance Zone ·
Free Resource

Sensu is an open source monitoring event pipeline. Try it today.

Before the reveal, let’s perform an autopsy on the query and see what is going on. The model is a simple (c:Customer)-[f:FRIEND_OF]->(c2:Customer). There are a total of 100k Customer nodes and each has 100 relationships, for a total of 10M relationships. The query is supposed to start at a Customer Node, and traverse out four levels, returning the distinct CustomerIds of every node along the way.

The initial query looked like this:

MATCH (c:Customer{CustomerID:99})-[:FRIEND_OF*4]->(cc:Customer)
RETURN DISTINCT cc.CustomerID;

That’s a pretty obvious starting point, but Neo4j is performing more work than it needs to. That MATCH statement is holding on to whole paths four levels out. It doesn’t realize that we just want a single property distinctly from all those nodes.

Let me give you another example. Find all messages created by my friends, or friends of friends, or friends of friends of friends, etc. four levels out, return their author and the content, and order them by date. We could write the query like this:

MATCH (:Person {id:{personId}})-[:KNOWS*1..4]->(friend:Person)<-[:HAS_CREATOR]-(message)
WHERE message.creationDate < {date}
RETURN DISTINCT message, friend
ORDER BY message.creationDate DESC, message.id ASC
LIMIT {top}

But what’s the problem? Well, it’s doing way too much work. Not only is it finding and keeping track of all the KNOWS paths four levels out, it is going one further to their messages. That’s a lot of paths, and it could be traversing the HAS_CREATOR relationship over and over again if friends of friends are friends with our friends. Say that three times fast. So, what’s a better way to write that query?

MATCH (:Person {id:{personId}})-[:KNOWS*1..4]->(friend:Person)
WITH COLLECT (DISTINCT friend) AS friends
UNWIND friends AS friend
MATCH (friend:Person)<-[:HAS_CREATOR]-(message)
WHERE message.creationDate < {date}
WITH DISTINCT message
ORDER BY message.creationDate DESC, message.id ASC
LIMIT {top}
MATCH (friend:Person)<-[:HAS_CREATOR]-(message)
RETURN message, friend

Right. So now we just get the distinct friends and then traverse to the messages of that friend just once. Andres Taylor wanted the query planner to be able to make that optimization on its own. So he and Johan Teleman are addressing this in Neo4j 3.2 with the PruningVarExpander. This rewriter will rewrite the query plans of these types of queries:

match (a)-[*1..3]->(b) return distinct b
match (from)-[*1..3]->(to) return count(distinct to)
match (a)-[*1..3]->(b:X) return distinct b
match (a)-[:R*1..3]->(b) with count(*) as count return distinct count
match (a)-[:R*1..3]->(b)-[:T*1..3]->(c) return distinct c
match (a)-[:R*1..3]->(b)-[:T]->(c) return distinct c
match (a) optional match (a)-[:R*1..3]->(b)-[:T]->(c) return distinct c

The common theme is that these queries do not care about the paths along the way; they just care about distinct nodes, properties, or counts. However, the next query is one we can’t rewrite with this rewriter: match (a)-[*1..3]->(b) return b, count(*) .

Because it is asking for the count of all paths leading to a node b, the paths matter. Since we don’t have 3.2 out just yet (there are alphas available, but I tend to wait until the first milestone comes out before I get my hands dirty) we can rewrite that query like so:

match (c:Customer{CustomerID:99})-[:FRIEND_OF]->(c1)-[:FRIEND_OF]->(c2)
with distinct c2
match (c2)-[:FRIEND_OF]->(c3)
with distinct c3
match (c3)-[:FRIEND_OF]->(cc:Customer)
with distinct cc
return cc.CustomerID;

According to the poster, this updated query took Neo4j from 240 seconds to 40 seconds, which is much better than before, but they also rewrote their SQL query to make use of the same idea and got that down to 24 seconds.

Now, I’ve seen a very similar query before… and I wrote up a blog post about it. Basically there were two ways to speed this up. One by modifying how Neo4j saved temporary unique nodes as it traversed. The other doing what I called the BitSet Dance. Turns out there is a simpler way of doing it, that yields the same performance. I am going to use RoaringBitMaps instead of OpenBitSet but it’s the same idea.

Instead of having an array of bitmaps, I just have 3. One that keeps track globally of all the nodes I’ve seen and two that alternate between newly found nodes, and nodes we are traversing from.

RoaringBitmap seen = new RoaringBitmap();
RoaringBitmap nextA = new RoaringBitmap();
RoaringBitmap nextB = new RoaringBitmap();

We will use RoaringBitmap's checkedAdd function to see if the node ID had already been seen, and if it wasn’t we’ll add it to one of our next Iterators.

while (relationshipIterator.hasNext()) {
    c = ops.relationshipCursor(relationshipIterator.next());
    c.next();
    int nodeId = (int)c.get().endNode();
    if(seen.checkedAdd(nodeId)) {
        nextA.add(nodeId);
    }
}

We do the same thing for every level, now iterating on nextA to populate nextB.

iterator = nextA.iterator();
while (iterator.hasNext()) {
    relationshipIterator = ops.nodeGetRelationships((long)iterator.next(), Direction.OUTGOING);
    while (relationshipIterator.hasNext()) {
        c = ops.relationshipCursor(relationshipIterator.next());
        c.next();
        int nodeId = (int)c.get().endNode();
        if(seen.checkedAdd(nodeId)) {
            nextB.add(nodeId);
        }
    }
}

Next, iterating on nextB to populate nextA, etc. Then, at the end, we want to convert all these node IDs into the property that we are interested in. In our case it’s the CustomerID property so we return a stream of them:

return StreamSupport.stream(seen.spliterator(),false).
        map(nodeId -> {
            try {
                return new StringResult((String) ops.nodeGetProperty(nodeId, propertyCustomerIDkey));
            } catch (EntityNotFoundException e) {
                e.printStackTrace();
                return null;
            }
        });

Let’s try it. We’ll add our stored procedure to a Neo4j 3.1.1 instance and populate it with some data. First, we’ll create 100k users with a query you’ve seen before:

WITH ['Jennifer','Michelle','Tanya','Julie','Christie',
      'Sophie','Amanda','Khloe','Sarah','Kaylee'] AS names 
        FOREACH (r IN range(0,100000) | CREATE (:Customer {CustomerID:names[r % size(names)]+r}))

Next, we will add the relationships. Run this query 10 times. Each run will create 1 million relationships randomly.

MATCH (u1:Customer),(u2:Customer)
WITH u1,u2
LIMIT 10000000
WHERE rand() < 0.1
CREATE (u1)-[:FRIEND_OF]->(u2);

Now, we have our 100k nodes and 10M relationships. So, what kind of performance do we get on this 4-year-old laptop when we try it?

MATCH (c:Customer {CustomerID:'Jennifer0'}) 
CALL com.maxdemarzi.distinct_network(c) YIELD value 
RETURN value

screen-shot-2017-02-06-at-7-13-46-pm

2.7 seconds. Now that’s what I’m talking about! Eat it, MySQL. But don’t take my word for it. Try it yourself. The code is on GitHub as always. Now we could go even faster if we multi-threaded the traversal. I’ve already shown you how to do that, so I’ll leave it as an exercise for the reader.

Also, remember that this is a very tiny graph — 1 million nodes and 10 million relationships, heading four levels deep. Try this with 100 million nodes, or going 25 levels deep. MySQL, Postgres, Oracle Exadata, all of them will bust and Neo4j will keep on going. Listen to this presentation to hear more.

Sensu: workflow automation for monitoring. Learn more—download the whitepaper.

Topics:
tutorial ,mysql ,performance ,neo4j ,recursive queries

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}