Over a million developers have joined DZone.

Neo4j and Cypher: Set Based Operations

DZone's Guide to

Neo4j and Cypher: Set Based Operations

· Java Zone ·
Free Resource

Build vs Buy a Data Quality Solution: Which is Best for You? Gain insights on a hybrid approach. Download white paper now!

I was recently reminded of a Neo4j cypher query that I wrote a couple of years ago to find the colleagues that I hadn’t worked with in the ThoughtWorks London office.

The model looked like this:

2014 02 18 17 04 01

And I created the following fake data set of the aforementioned model:

public class SetBasedOperations
    private static final Label PERSON = DynamicLabel.label( "Person" );
    private static final Label OFFICE = DynamicLabel.label( "Office" );
    private static final DynamicRelationshipType COLLEAGUES = DynamicRelationshipType.withName( "COLLEAGUES" );
    private static final DynamicRelationshipType MEMBER_OF = DynamicRelationshipType.withName( "MEMBER_OF" );
    public static void main( String[] args ) throws IOException
        Random random = new Random();
        String path = "/tmp/set-based-operations";
        FileUtils.deleteRecursively( new File( path ) );
        GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( path );
        Transaction tx = db.beginTx();
            Node me = db.createNode( PERSON );
            me.setProperty( "name", "me" );
            Node londonOffice = db.createNode( OFFICE );
            londonOffice.setProperty( "name", "London Office" );
            me.createRelationshipTo( londonOffice, MEMBER_OF );
            for ( int i = 0; i < 1000; i++ )
                Node colleague = db.createNode( PERSON );
                colleague.setProperty( "name", "person" + i );
                colleague.createRelationshipTo( londonOffice, MEMBER_OF );
                if(random.nextInt( 10 ) >= 8) {
                    me.createRelationshipTo( colleague, COLLEAGUES );
        CommunityNeoServer server = CommunityServerBuilder
                .usingDatabaseDir( path )
                .onPort( 9001 )

I’ve created a node representing me and 1,000 people who work in the London office. Out of those 1,000 people I made it so that ~150 of them have worked with me.

If I want to write a cypher query to find the exact number of people who haven’t worked with me I might start with the following:

MATCH (p:Person {name: "me"})-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(colleague)
WHERE NOT (p-[:COLLEAGUES]->(colleague))
RETURN COUNT(colleague)

We start by finding me, then find the London office which I was a member of, and then find the other people who are members of that office. On the second line we remove people that I’ve previously worked with and then return a count of the people who are left.

When I ran this through my Cypher query tuning tool the average time to evaluate this query was 7.46 seconds.

That is obviously a bit too slow if we want to run the query on a web page and as far as I can tell the reason for that is that for each potential colleague we are searching through my ‘COLLEAGUES’ relationships and checking whether they exist. We’re doing that 1,000 times which is a bit inefficient.

I chatted to David about this, and he suggested that a more efficient query would be to work out all my colleagues up front once and then do the filtering from that set of people instead.

The re-worked query looks like this:

MATCH (p:Person {name: "me"})-[:COLLEAGUES]->(colleague)
WITH p, COLLECT(colleague) as marksColleagues
MATCH (colleague)-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(p)
WHERE NOT (colleague IN marksColleagues)
RETURN COUNT(colleague)

When we run that through the query tuner the average time reduces to 150 milliseconds which is much better.

This type of query seems to be more about set operations than graph ones because we’re looking for what isn’t there rather than what is. When that’s the case getting the set of things that we want to compare against up front is more profitable.

Build vs Buy a Data Quality Solution: Which is Best for You? Maintaining high quality data is essential for operational efficiency, meaningful analytics and good long-term customer relationships. But, when dealing with multiple sources of data, data quality becomes complex, so you need to know when you should build a custom data quality tools effort over canned solutions. Download our whitepaper for more insights into a hybrid approach.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}