Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Quickly Create a 100k Neo4j Graph Data Model with Cypher Only

DZone's Guide to

Quickly Create a 100k Neo4j Graph Data Model with Cypher Only

· Java Zone
Free Resource

Learn how to troubleshoot and diagnose some of the most common performance issues in Java today. Brought to you in partnership with AppDynamics.

We want to run some test queries on an existing graph model but have no sample data at hand, and also no input files (CSV, GraphML) that would provide it.

Why not create quickly it on our own just using Cypher? First I thought about using Cypher to generate CSV files and loading them back, but it is much easier.

The domain is simple (:User)-[:OWN]→(:Product) but good enough for collaborative filtering or demographic analysis.

Nodes: Users and Products

Let’s start with Users, we create 100k of them in one go:

We create an array of names and go over a range of 100k with the FOREACH clause, taking the counter as id and a name from the array.

WITH ["Andres","Wes","Rik","Mark","Peter","Kenny","Michael","Stefan","Max","Chris"] AS names
FOREACH (r IN range(0,100000) | CREATE (:User {id:r, name:names[r % size(names)]+" "+r}));

This finishes quickly, and tells us how many ndoes, labels and properties were created.

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 100001
Properties set: 200002
Labels added: 100001
5788 ms

Same for products. As names I just used a few of my shiny geek things.

with ["Mac","iPhone","Das Keyboard","Kymera Wand","HyperJuice Battery","Peachy Printer","HexaAirBot","AR-Drone","Sonic Screwdriver","Zentable","PowerUp"] as names
foreach (r in range(0,50) | create (:Product {id:r, name:names[r % size(names)]+" "+r}));

Please note that I only created 50 products. I initially started with 3000 but then the cross product between users and products to sample relationships from grows really
large (300M) which is not pulled through so quickly. So I decided to stick with a cross product of 5M which is good enough for our purposes.

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 51
Properties set: 102
Labels added: 51
46 ms

Relationships: OWN

The general idea is to create the cross product between users and products and sample a percentage of that to create the relationships. For sampling we use rand, for the cross product MATCH of two independent labels.

My first approach didn’t really work as the WHERE clause belongs to the MATCH and is pulled into the path finding and causes it to sample only users, not user-product pairs.
So for one user that was selected all OWN relationships were created. Not what I wanted :)

// don't do this
match (u:User),(p:Product)
where rand() < 0.1
with u,p
limit 50000
merge (u)-[:OWN]->(p);

So we have to detach the WHERE clause from MATCH with a WITH statement that passes on the user, product pairs. We still limit the cross-product results to 5M just as a safeguard in case we have miscalculated the cross product.
A rand() < 0.1 samples 10% of the total amount, which is in our case 500k combinations. With those we then can create relationships with CREATE which is faster and doesn’t check for duplicates.

match (u:User),(p:Product)
with u,p
limit 5000000
where rand() < 0.1
create (u)-[:OWN]->(p);
+-------------------+
| No data returned. |
+-------------------+
Relationships created: 509898
11684 ms

We could also use MERGE which makes sure that at most one relationship between two nodes exists.
If we use MERGE we should limit the amount of nodes that is created in one execution to avoid exponential time build-up.
If we introduce this limit, we also have to move the window of node-pairs to be considered by the percentage of rels we create.
A limit of 100k is 1/5 of the total of 500k relationships, so we have to advance the total window also by 20% of 5M, i.e. 1M

match (u:User),(p:Product)
with u,p
// increase skip value from 0 to 4M in 1M steps
skip 1000000
limit 5000000
where rand() < 0.1
with u,p
limit 100000
merge (u)-[:OWN]->(p);

Which results in.

+-------------------+
| No data returned. |
+-------------------+
Relationships created: 100000
51428 ms

If you have more memory for your Neo4j server than my 4G heap, you can also merge larger segments of relationships in a single transaction (200k or more).

We also create an index for :User and product.

create index on :User(id);
create index on :Product(id);

Now we can run some of the test-queries we wanted to check:

Find similar users that own the same stuff that I do.

match (u:User {id:1})-[:OWN]->()<-[:OWN]-(other)
return other.name,count(*)
order by count(*) desc
limit 5;

+--------------------------+
| other.name    | count(*) |
+--------------------------+
| "Peter 23404" | 6        |
| "Peter 26754" | 5        |
| "Mark 35223"  | 5        |
| "Peter 19614" | 5        |
| "Chris 23959" | 5        |
+--------------------------+
5 rows
145 ms

Collaborative filtering – product suggestions

match (u:User {id:3})-[:OWN]->()<-[:OWN]-(other)-[:OWN]->(p)
return p.name,count(*)
order by count(*) desc
limit 5;

+------------------------------------+
| p.name                  | count(*) |
+------------------------------------+
| "HyperJuice Battery 37" | 2894     |
| "Zentable 9"            | 2872     |
| "Kymera Wand 3"         | 2865     |
| "Zentable 31"           | 2863     |
| "Das Keyboard 35"       | 2847     |
+------------------------------------+
5 rows
410 ms

Understand the needs and benefits around implementing the right monitoring solution for a growing containerized market. Brought to you in partnership with AppDynamics.

Topics:

Published at DZone with permission of Michael Hunger, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}