Over a million developers have joined DZone.

Neo4j and Cypher: Set Based Operations

· Java Zone

Discover how powerful static code analysis and ergonomic design make development not only productive but also an enjoyable experience, brought to you in partnership with JetBrains

I was recently reminded of a Neo4j cypher query that I wrote a couple of years ago to find the colleagues that I hadn’t worked with in the ThoughtWorks London office.

The model looked like this:

2014 02 18 17 04 01

And I created the following fake data set of the aforementioned model:

public class SetBasedOperations
{
    private static final Label PERSON = DynamicLabel.label( "Person" );
    private static final Label OFFICE = DynamicLabel.label( "Office" );
 
    private static final DynamicRelationshipType COLLEAGUES = DynamicRelationshipType.withName( "COLLEAGUES" );
    private static final DynamicRelationshipType MEMBER_OF = DynamicRelationshipType.withName( "MEMBER_OF" );
 
    public static void main( String[] args ) throws IOException
    {
        Random random = new Random();
        String path = "/tmp/set-based-operations";
        FileUtils.deleteRecursively( new File( path ) );
 
        GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( path );
 
        Transaction tx = db.beginTx();
        try
        {
            Node me = db.createNode( PERSON );
            me.setProperty( "name", "me" );
 
            Node londonOffice = db.createNode( OFFICE );
            londonOffice.setProperty( "name", "London Office" );
 
            me.createRelationshipTo( londonOffice, MEMBER_OF );
 
            for ( int i = 0; i < 1000; i++ )
            {
                Node colleague = db.createNode( PERSON );
                colleague.setProperty( "name", "person" + i );
 
                colleague.createRelationshipTo( londonOffice, MEMBER_OF );
 
                if(random.nextInt( 10 ) >= 8) {
                    me.createRelationshipTo( colleague, COLLEAGUES );
                }
 
                tx.success();
            }
        }
        finally
        {
            tx.finish();
        }
 
        db.shutdown();
 
        CommunityNeoServer server = CommunityServerBuilder
                .server()
                .usingDatabaseDir( path )
                .onPort( 9001 )
                .persistent()
                .build();
 
        server.start();
 
    }
}

I’ve created a node representing me and 1,000 people who work in the London office. Out of those 1,000 people I made it so that ~150 of them have worked with me.

If I want to write a cypher query to find the exact number of people who haven’t worked with me I might start with the following:

MATCH (p:Person {name: "me"})-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(colleague)
WHERE NOT (p-[:COLLEAGUES]->(colleague))
RETURN COUNT(colleague)

We start by finding me, then find the London office which I was a member of, and then find the other people who are members of that office. On the second line we remove people that I’ve previously worked with and then return a count of the people who are left.

When I ran this through my Cypher query tuning tool the average time to evaluate this query was 7.46 seconds.

That is obviously a bit too slow if we want to run the query on a web page and as far as I can tell the reason for that is that for each potential colleague we are searching through my ‘COLLEAGUES’ relationships and checking whether they exist. We’re doing that 1,000 times which is a bit inefficient.

I chatted to David about this, and he suggested that a more efficient query would be to work out all my colleagues up front once and then do the filtering from that set of people instead.

The re-worked query looks like this:

MATCH (p:Person {name: "me"})-[:COLLEAGUES]->(colleague)
WITH p, COLLECT(colleague) as marksColleagues
MATCH (colleague)-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(p)
WHERE NOT (colleague IN marksColleagues)
RETURN COUNT(colleague)

When we run that through the query tuner the average time reduces to 150 milliseconds which is much better.

This type of query seems to be more about set operations than graph ones because we’re looking for what isn’t there rather than what is. When that’s the case getting the set of things that we want to compare against up front is more profitable.



Learn more about Kotlin, a new programming language designed to solve problems that software developers face every day brought to you in partnership with JetBrains.

Topics:

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}