Importing Forests into Neo4j
Join the DZone community and get the full member experience.
Join For FreeSometimes you don’t see the forest for the trees. But if you do, you probably use a graph database.
Trees are one of the simple graph datastructures, directed acyclic graphs (DAGs).
For our example we use a time-tree that we want to import into the database.
Data Volume
A quick soulver script (thanks Mark) later, we know how many nodes and rels (nodes-1), we will have to import to represent a full year down to the second level.
1 year = 12 months = 365 days = 8.760 hours = 525.600 minutes = 31.536.000 seconds
So we have to import about 32M nodes and 32M relationships. Sounds good enough.
These numbers also allow us to configure the memory mapping so that we can match it to the expected store-file sizes and get a smooth import.
Again let’s ask soulver:
32M nodes x 14 bytes in MB = 448MB 32M rels x 38 bytes in MB = 1.216 MB
For the properties, I take 500MB for good measure. So our neo4j.properties would look like this:
neostore.nodestore.db.mapped_memory=500M neostore.relationshipstore.db.mapped_memory=1200M neostore.propertystore.db.mapped_memory=500M
Running the import with 4G heap on Unix/Mac and 6.5GB heap on Windows (remember the memory mapping is inside the heap on Windows), either against a running server by just calling bin/neo4j-shell or offline with a test database.
bin/neo4j-shell -path time.db -config conf/neo4j.properties
Root, Stem and first Branches
My friend and colleague Rik Van Bruggen started this experiment and created the following Cypher queries to import the data.
We import the data using the Neo4j-Shell, but you can also do it one statement at a time with the Neo4j-Browser.
The first part of the tree (8k nodes) can be imported quickly in one go, it takes about 5 seconds to parse and execute.
BEGIN //Let's create the top of the tree, the Century CREATE (century:Century {id:21}); //Let's create 100 children of the Century, Years MATCH (century:Century {id:21}) WITH range(2000,2099) as YEARS, century FOREACH (year in YEARS | CREATE (:Year {id:year})-[:PART_OF]->(century) ); //for every Year, let's connect 12 children at a 3rd level, the Months MATCH (year:Year {id:2014}) WITH range(1,12) as MONTHS, year FOREACH (month in MONTHS | CREATE (:Month {id:month})-[:PART_OF]->(year) ); //for every Month, connect 30 Days, and add one day for the longer months MATCH (month:Month) WITH range(1,30) as DAYS, month FOREACH (day in DAYS | CREATE (:Day {id:day})-[:PART_OF]->(month)) WITH month WHERE month.id in [1,3,5,7,8,10,12] CREATE (:Day {id:31})-[:PART_OF]->(month); // remove 2 days from februrary MATCH (day:Day)-[r:PART_OF]->(month:Month) WHERE month.id=2 and day.id in [29,30] DELETE r,day; //for every Day, connect 24 different Hours (0-23) MATCH (day:Day) WITH range(0,23) as HOURS, day FOREACH (hour in HOURS | CREATE (:Hour {id:hour})-[:PART_OF]->(day) ); COMMIT
First levels of the graph
For the 500k minute nodes our heap (4GB) is also large enough, so let’s try it:
//for every Hour, connect 60 minutes MATCH (hour:Hour) FOREACH (minute in range(0,59) | create (:Minute {id:minute})-[:PART_OF]->(hour));
Nodes created: 525600 Relationships created: 525600 Properties set: 525600 Labels added: 525600 14044 ms
The import takes 14 seconds. As we have 60 times as many seconds as minutes, we can estimate that the import for seconds would take 60 x 14s in minutes = 14 minutes.
Raking in the Foliage
Next up is the real challenge of this exercise. Importing 32M in the transactional graph takes a bit, so, you might grab a coffee or watch two episodes of our familys favorite show Tom and the strawberry jam bread with honey.
We’re using PERIODIC COMMIT here to interally commit every 50k updates, so that we don’t overflow our heap with huge transactions. We discuss its fate a bit further down.
//for every Minute, connect 60 seconds USING PERIODIC COMMIT 50000 WITH range(0,59) as SECONDS MATCH (minute:Minute) FOREACH (second in SECONDS | CREATE (:Second {id:second})-[:PART_OF]->(minute));
Nodes created: 31536000 Relationships created: 31536000 Properties set: 31536000 Labels added: 31536000 742248 ms
In reality it was a bit faster, and took only 740s = 12 minutes. That gives us 12 minutes per 32M nodes and 32M rels, which are about 45k pairs or about 90k elements per second. Not so bad for transactional inserts with Cypher.
Chunking Inserts
After Milestone 1 this gets a bit more tedious. Periodic commit is now tied to LOAD CSV. As we can’t use it anymore, we have to create individual statements, each of which should insert about 50k pairs. For 32M elements to inserts that means 640 statements. If we have enough heap, we can increase the size per tx to about 500k which leaves us with 64 statements to execute.
We could do that by inserting 6 days at a time. You have to run this and increment the day range by 6 for every run.
//for every Minute, connect 60 seconds WITH range(0,59) as SECONDS, range(1,6) as DAYS MATCH (:Month {id:1})<-[:PART_OF]-(d:Day)<-[:PART_OF]-(:Hour)<-[:PART_OF]-(minute:Minute) WHERE d.id IN DAYS FOREACH (second in SECONDS | CREATE (:Second {id:second})-[:PART_OF]->(minute));
Nodes created: 518400 Relationships created: 518400 Properties set: 518400 Labels added: 518400 16623 ms
Cheating, Handle with Care
Or we can cheat, and use PERIODIC COMMIT with LOAD CSV but without actually loading data from a CSV file containing a single row. That file can be something hosted in a GitHub Gist
or a single row file on your filesystem.
Important | Please do only use this if you know what you’re doing. PERIODIC COMMIT ignores query structure and just commits internally after N updated elements. Be it nodes, relationships, properties or labels. So the commit will very probably happen somewhere in the middle of your statement. If something fails in between, you will have little chance to recover from that state if you don’t use additional flags (labels or properties) to track your progress. |
//for every Minute, connect 60 seconds using periodic commit 50000 LOAD CSV FROM http://gist.github.com/jexp/xxx/one-row.csv AS dummy WITH range(0,59) as SECONDS MATCH (minute:Minute) FOREACH (second IN SECONDS | CREATE (:Second {id:second})-[:PART_OF]->(minute));
Queries
Ok, now we’ve imported this giant of a tree, what can we do with it?
Here are a few graph queries that navigate along the tree.
Drill down to hours:
MATCH (y:Year {id:2014})<-[:PART_OF*3]-(hr) RETURN y,count(*); +-------------------------------+ | y | count(*) | +-------------------------------+ | Node[128]{id:2014} | 8760 | +-------------------------------+ 1 row 63 ms
How many minutes does the odd months have?
MATCH (y:Year {id:2014})<-[:PART_OF]-(m)<-[:PART_OF]-(d)<-[:PART_OF]-(hr)<-[:PART_OF]-(min) WHERE m.id % 2 = 1 RETURN y.id,m.id,count(*) ORDER BY y.id, m.id; +------------------------+ | y.id | m.id | count(*) | +------------------------+ | 2014 | 1 | 44640 | | 2014 | 3 | 44640 | | 2014 | 5 | 44640 | | 2014 | 7 | 44640 | | 2014 | 9 | 43200 | | 2014 | 11 | 43200 | +------------------------+ 6 rows 1169 ms
Find the path to a certain day in the tree
MATCH (y:Year {id:2014})<-[:PART_OF]-(m {id:12})<-[:PART_OF]-(d {id:24}) RETURN y.id,m.id,d.id; +--------------------+ | y.id | m.id | d.id | +--------------------+ | 2014 | 12 | 24 | +--------------------+ 1 row 3 ms
Grow a branch at a time: On-demand creation
This blog post showed how to generate a huge tree in the graph. In many cases pre-generating is not needed though, as you can create the tree on the fly while you are inserting data that’s connected to the leaves of the tree.
The MERGE clause helps here a lot. Besides uniquely creating individual nodes and relationships between existing nodes it can also create unique subgraphs.
That means only the starting point of your subgraph is taken into consideration for lookup, the remainder of the path is created on demand.
For the whole time-tree it would look like this:
MERGE (y:Year {id:2014}) MERGE (y)<-[:PART_OF]-(m:Month {id:1}) MERGE (m)<-[:PART_OF]-(d:Day {id:13}) MERGE (d)<-[:PART_OF]-(h:Hour {id:18}) MERGE (h)<-[:PART_OF]-(min:Minute {id:35}) MERGE (min)<-[:PART_OF]-(s:Second {id:59}) RETURN y.id,m.id,d.id,h.id,min.id,s.id; +-----------------------------------------------------------+ | y.id | m.id | d.id | h.id | min.id | s.id | +-----------------------------------------------------------+ | 2014 | 1 | 13 | 18 | 35 | 59 | +-----------------------------------------------------------+ 1 row Nodes created: 6 Relationships created: 5 Properties set: 6 Labels added: 6 7 ms // executed a second time +-----------------------------------------------------------+ | y.id | m.id | d.id | h.id | min.id | s.id | +-----------------------------------------------------------+ | 2014 | 1 | 13 | 18 | 35 | 59 | +-----------------------------------------------------------+ 1 row 4 ms
This allows us to hande much more sparsely filled trees as they only contain the paths from the root to the leaves that are actually needed.
So we can handle even more levels, down to milli-, nano- or microseconds.
Published at DZone with permission of , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
5 Key Concepts for MQTT Broker in Sparkplug Specification
-
Managing Data Residency, the Demo
-
Why You Should Consider Using React Router V6: An Overview of Changes
-
What Is React? A Complete Guide
Comments