Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Counting Nodes With Multiple Labels

DZone's Guide to

Counting Nodes With Multiple Labels

Label jars, not people! In this article, learn how to count sets of data with two labels without it taking way too much time.

· Database Zone ·
Free Resource

Discover Tarantool's unique features which include powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU core!

We have over 6,000 users in our #neo4j-users slack channel and get all kinds of requests. About a month ago Thomas Shields asked:

Should counting the set of things with 2 labels really take so long? I've got 48M nodes with LabelA and LabelB and the query MATCH (n:LabelA:LabelB) RETURN COUNT(n) is taking 80-90 seconds.

Let's see what's going on by creating a small version of his graph. We will create 1M nodes of LabelA, then 1M nodes with both LabelA and LabelB, and then 1M nodes with just Label B:

WITH ["Jennifer","Michelle","Tanya","Julie","Christie","Sophie","Amanda","Khloe","Sarah","Kaylee"] AS names 
FOREACH (r IN range(0,1000000) | CREATE (:LabelA {username:names[r % size(names)]+r}))

WITH ["Jennifer","Michelle","Tanya","Julie","Christie","Sophie","Amanda","Khloe","Sarah","Kaylee"] AS names 
FOREACH (r IN range(0,1000000) | CREATE (:LabelA:LabelB {username:names[r % size(names)]+r}))

WITH ["Jennifer","Michelle","Tanya","Julie","Christie","Sophie","Amanda","Khloe","Sarah","Kaylee"] AS names 
FOREACH (r IN range(0,1000000) | CREATE (:LabelB {username:names[r % size(names)]+r}))

When we try to profile the query, we see this execution plan:

This is taking about 240ms with our simple dataset. So we are going to try to speed it up by building a Cypher stored procedure. We will make it dynamic so that we can take two or more labels. We'll use my favorite library, Roaring Bitmap, and create an array of them. Then for each Label, we will add the nodeId to the corresponding bitmap, and at the end, we will aggregate them all and get the cardinality to get the count. The whole procedure is below:

    @Procedure(name = "com.maxdemarzi.multiple_label_count", mode = Mode.DEFAULT)
    @Description("CALL com.maxdemarzi.multiple_label_count([labels]")
    public Stream<LongResult> MultipleLabelCount(@Name("labels") List<String> labels) {
        RoaringBitmap[] bitmaps = new RoaringBitmap[labels.size()];

        int count = 0;
        for (String label : labels) {
            int finalCount = count;
            RoaringBitmap bitmap = new RoaringBitmap();
            bitmaps[count] = bitmap;
            db.findNodes(Label.label(label)).stream().forEach(node -> bitmaps[finalCount].add(((int) node.getId())));
            count++;
        }

        return Stream.of(new LongResult(FastAggregation.and(bitmaps).getLongCardinality()));
    }

We will compile this procedure, add the jar to the plugins folder of Neo4j, and allow it by adding the following line to the neo4j.conf file:

dbms.security.procedures.unrestricted=com.maxdemarzi.*

Now we restart the database and call the procedure:

CALL com.maxdemarzi.multiple_label_count(['LabelA','LabelB']) YIELD value RETURN value

We get the same result and our time is down to about 90ms from 240ms. Thomas built his own version for just two labels and was able to get his performance from 90 seconds down to 10 seconds, for almost an order of magnitude improvement.

The source code, as always, is on GitHub. Feel free to grab it if you need something like it.

If you run into trouble with Neo4j, please join our user-slack channel and give us a chance to help you. We have dedicated Neo4j employees answering questions as well as our awesome community members helping one another. 

Discover Tarantool's unique features such as powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU.

Topics:
database ,cypher ,tutorial ,nodes ,query optimization

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}