Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Tips for Fast Batch Updates of Graph Structures With Neo4j and Cypher

DZone's Guide to

Tips for Fast Batch Updates of Graph Structures With Neo4j and Cypher

When you’re writing a lot of data to the graph from your application or library, you want to be efficient. You want small queries that are constant and use parameters.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

These approaches are not very efficient:

  • Hard coding values instead of using parameters.
  • Sending a single query/tx per individual update.
  • Sending many single queries within a single tx with individual updates.
  • Generating large, complex statements (hundreds of lines) and sending one of them per tx and update.
  • Sending in huge (millions) of updates in a single tx, which will cause out-of-memory issues.

Better Approach

You want small-enough queries that are constant in their shape (for caching) and are using parameters.

Each query can update from a single property to a whole subgraph (100 nodes) but has to be the same in the overall structure for caching.

UNWIND to the Rescue

To achieve that, you just prefix your regular single-update-query with an UNWIND that turns a batch of data (up to 10k to 50k entries) into individual rows that contain the information for each of the (more or less complex) updates.

You send in a {batch} parameter (up to 10k to 50k) of data (hopefully a delta) as a list of maps, which are then applied in a compact query, which is also properly compiled and cached, as it has a fixed structure.

{batch: [{row1},{row2},{row3},...10k]} 

UNWIND {batch} as row

// now perform updates with the data in each "row" map

Here are some examples.

Create Node With Properties

{batch: [{name:"Alice",age:32},{name:"Bob",age:42}]}

UNWIND {batch} as row
CREATE (n:Label)
SET n += row

Merge Node With Properties

{batch: [{id:"alice@example.com",properties:{name:"Alice",age:32}},{id:"bob@example.com",properties:{name:"Bob",age:42}}]}

UNWIND {batch} as row
MERGE (n:Label {row.id})
(ON CREATE) SET n += row.properties

Node Lookup and MERGE/CREATE Relationship Between With Properties

{batch: [{from:"alice@example.com",to:"bob@example.com",properties:{since:2012}},{from:"alice@example.com",to:"charlie@example.com",properties:{since:2016}}]}

UNWIND {batch} as row
MATCH (from:Label {row.from})
MATCH (to:Label {row.to})
CREATE/MERGE (from)-[rel:KNOWS]->(to)
(ON CREATE) SET rel += row.properties

Look Up by ID or List of IDs

This is good for parent-child trees. Here, we’re passing a single property created. Alternatively, you could pass in no properties or a map of properties to be set/updated.

{batch: [{from:123,to:[44,12,128],created:"2016-01-13"}, {from:34,to:[23,35,2983],created:"2016-01-15"},...]

UNWIND {batch} as row
MATCH (from) WHERE id(from) = row.from
MATCH (to) WHERE id(from) IN row.to // list of ids
CREATE/MERGE (from)-[rel:FOO]->(to)
SET rel.created = row.created

Faster, Better, Further: All the Tricks

There are some more tricks. You can also send in a map where the keys are node or relationship IDs (converted to strings) that are more compact so that the ID can look them up more quickly.

Update of Existing Nodes by ID

{ batch : [{"1":334,"2":222,3:3840, ... 100k}]}

WITH {batch} as data, [k in keys({batch}) | toInt(k)] as ids
MATCH (n) WHERE id(n) IN ids

// single property value
SET n.count = data[toString(id(n))]

// or override all properties
SET n = data[toString(id(n))]

// or add all properties
SET n += data[toString(id(n))]

Update of Existing Relationships by Id

{ batch : [{"1":334,"2":222,3:3840, ... 100k}]}

WITH {batch} as data, [k in keys({batch}) | toInt(k)] as ids
MATCH ()-[rel]->() WHERE id(rel) IN ids
SET rel.foo = data[toString(id(rel))] // single property
SET rel= data[toString(id(rel))] // all properties

Conditional Data Creation

Sometimes, you want to create data dynamically based on inputs (for example, a node with a certain label). As Cypher currently has no conditional WHEN or IF clause, and case when is just an expression, you have to use a trick that I came up with many years ago.

Fortunately, there is FOREACH, which is meant to iterate over a list of items and execute update operations for each of them. A list of 0 or 1 elements can serve as a conditional of false and true, i.e., no iteration or one iteration.

General idea:

...
FOREACH (_ IN CASE WHEN predicate THEN [true] ELSE [] END |
... update operations ....
)

Note that the true value in that list could be anything, 42, "", null etc. as long as it is any single value so that we have a non-empty list.

You can achieve something similar with RANGE(1, CASE WHEN predicate THEN 1 ELSE 0 END), which will yield an empty list when the predicate is false. Or, if you fancy filte , then you can usefilter(_ IN [1] WHERE predicate).

Here is a concrete example:

LOAD CSV FROM {url} AS row
MATCH (o:Organization {name:row.org})
FOREACH (_ IN case when row.type = 'Person' then [1] else [] end|
   MERGE (p:Person {name:row.name})
   CREATE (p)-[:WORKS_FOR]->(o)
)
FOREACH (_ IN case when row.type = 'Agency' then [1] else [] end|
   MERGE (a:Agency {name:row.name})
   CREATE (a)-[:WORKS_FOR]->(o)
)

Note that identifiers created within FOREACH are not accessible from the outside; you have to re-match the value later on, or you have to move all your update operations into the FOREACH.

Utilizing APOC Procedures

The APOC procedure library comes with a lot of useful procedures that can help you here. I want to highlight three of them:

  • Create nodes and relationships with dynamic labels and properties.
  • Batched transactions and iterations of updates.
  • Functions for creating and manipulating maps to be set as properties.

1. Creating Nodes and Relationships Dynamically

With apoc.create.node and apoc.create.relationship, you can have dynamically computed node labels and relationship types, as well as any map of properties.

  • Labels is a string array.
  • Properties is just a map.
UWNIND {batch} as row
CALL apoc.create.node(row.labels, row.properties) yield node
RETURN count(*)

There are also procedures in apoc.create.* for setting, updating, or removing properties and labels with dynamic string keys.

UWNIND {batch} as row
MATCH (from) WHERE id(n) = row.from
MATCH (to:Label) where to.key = row.to
CALL apoc.create.relationship(from, row.type, row.properties, to) yield rel
RETURN count(*)

2. Batched Transactions

As mentioned at the beginning, huge transactions are a problem. You can update a million records with around 2G-4G of heap, but it gets difficult with larger volumes.

My biggest volume per single transaction was about 10M nodes and relationships with a 32G heap.

That’s where apoc.periodic.iterate comes in.

The idea is simple. You have two Cypher statements. The first statement provides the data to operate on and can produce a huge (many millions) stream of data (nodes, rels, scalar values).

The second statement does the actual update work. It is called for each item, but a new transaction is created only for each batch of items.

(There is a new variant of this which will go into the next version of APOC that actually does an UNWIND  variant of the second statement, so it executes only one inner statement per tx.)

So, for example, your first statement returns five million nodes to update with a computed value.
The inner statement is executed once for each of those 5 million nodes. If your batch size is 10,k then that happens in batches of 10k statements per transaction.

If your updates are independent of each other (think the creation of nodes or updates of properties, or updates of independent subgraphs), then you can run this procedure with a parallel:trueoption, which will use all your CPUs.

For example, if you want to compute a score of many rated items and update this property in a batched fashion, this is what you would do:

call apoc.periodic.iterate('
MATCH (n:User)-[r1:LIKES]->(thing)<-[r2:RATED]-(m:User) WHERE id(n)<id(m) RETURN thing, avg( r1.rating + r2.rating ) as score
','
WITH {thing} as t SET t.score = {score}
', {batchSize:10000, parallel:true})

3. Creating and Updating Maps Dynamically

While lists can be created and processed quite easily in Cypher with rangecollect, unwindreduce, extract, filter,size , etc., maps have more limited means — especially for creation and modification.

The apoc.map.* package comes with a number of functions that make your life easier.

Creating Maps From Other Data

RETURN apoc.map.fromPairs([["alice",38],["bob",42],...​])
// {alice:38, bob: 42, ...}

RETURN apoc.map.fromLists(["alice","bob",...],[38,42])
// {alice:38, bob: 42, ...}

// groups nodes, relationships, maps by key, good for quick lookups by that key
RETURN apoc.map.groupBy([{name:"alice",gender:"female"},{name:"bob",gender:"male"}],"gender")
// {female:{name:"alice",gender:"female"}, male:{name:"bob",gender:"male"}}

RETURN apoc.map.groupByMulti([{name:"alice",gender:"female"},{name:"bob",gender:"male"},{name:"Jane",gender:"female"}],"gender")
// {female:[{name:"alice",gender:"female"},{name:"jane",gender:"female"}], male:[{name:"bob",gender:"male"}]}

Updating Maps

RETURN apoc.map.merge({alice: 38},{bob:42})
// {alice:38, bob: 42}

RETURN apoc.map.setKey({alice:38},"bob",42)
// {alice:38, bob: 42}

RETURN apoc.map.removeKey({alice:38, bob: 42},"alice")
// {bob: 42}

RETURN apoc.map.removeKey({alice:38, bob: 42},["alice","bob","charlie"])
// {}

// remove the given keys and values, good for data from load-csv/json/jdbc/xml
RETURN apoc.map.clean({name: "Alice", ssn:2324434, age:"n/a", location:""},["ssn"],["n/a",""])
// {name:"Alice"}

Conclusion

I used these approaches successfully for high-volume update operations, and also in the implementation of object graph mappers for bulk updates. Of course, you can combine these variants for more complex operations. If you try them out and are successful, please let me know.

If you have any other tricks that helped you to achieve more write throughput with Cypher, please let me know, too, and I’ll update this post.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
cypher ,big data ,neo4j ,graphs ,batch updates

Published at DZone with permission of Michael Hunger, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}