Welcome back to learning how to perform a bulk data import into Neo4j. In part one, we took a look at importing a dataset from Stack Overflow as well as
LOAD CSV, a tool that helps process JSON data. Today, we're taking a closer look at that tool to see what else it can do.
1. Start with a Small Sample
If you run the data the above way, you'll likely end up with a long run time because you started with your entire database. But when you're first running your data, start with a small set — such as 100 data points — to see if your query works. This speeds up the round trip times, which allows you to make adjustments and corrections to your query without having to first wait for your entire dataset to run:
2. MERGE on a Key
Another important tip to keep in mind is to
MERGE on a key.
MERGE will explore all the properties you've placed on your data. So you can ask the question — Is there anything else already in my database that, in this case, has the same ID? Is there any question which has the ID and the title and the up vote count and the creation date? So you're checking four different elements when actually you know whether or not a question is unique based on its ID.
This is effectively the Stack Overflow primary key. Because each question has an ID, we don't need to compare every property to see whether or not it's the same.
We can do the same for the owner as well. We have a key on the owner, so the next key
MERGE tip is to use the key inside your
Don't put every single thing in there because what you probably want to do is set these properties if they aren't already set. You don't actually really want to use them to go and remove duplicates so we're just pulling this apart, and we're using the
ON CREATE clause to help. If you run the script again, it would make sure the question is there.
3. Use Constraints and Indexes
There are two ways to make a search faster:
- By using constraints, which ensure uniqueness.
- By using indexes, which provide a starting point for the search.
The constraint automatically creates an index, so creating a constraint allows you to take advantage of both constraints and indexes.
4. Use Only One MERGE Statement
Our next tip is to use only one
MERGE statement instead of building one massive input query:
MERGE per statement is faster than creating one large overarching
MERGE. For example, create one
MERGE for the question, one for the owner and one to combine the two.
At this stage, since all node flavors are independent of one another, you can also run imports in parallel. There are some nodes, such as a tags, which have a lot of relationships. If you do a
MERGE of these relationships, the database will check to make sure there isn't already a relationship of this type and direction between the two nodes. To do this, it iterates over the relationships between the nodes.
This can take some time, depending on how many relationships that node has. However, the tool will automatically check the node with fewer relationships. Which is called
5. Use DISTINCT
Our next tip is to use
DISTINCT, which equates to a "don't-do-the-same-work-over-and-over-again" tip. In this dataset it applies to the tags:
Lots of people add the same tag over and over again to our questions. For example, of our 10,000 Neo4j entries, you'll see 10,000 entries with the Neo4j tag. To reduce the amount of work the
MERGE has to do, we can use
DISTINCT when we create our tags. Instead of creating or going over Neo4j 10,000 times, we just get it once and then we create it.
6. Use PERIODIC COMMIT
Another tip that you can use in
LOAD CSV is a
PERIODIC COMMIT, which in this example will take the first 1,000 questions and commit those before going on to the next 1,000. This prevents you from exceeding your RAM:
Something between 10,000 and 100,000 updates per transaction are a good target.
Something to keep in mind is that
PERIODIC COMMIT applies only to
LOAD CSV. If you want to do a big graph refactoring of something that you've already imported, you'll have to do batching.
7. Script Import Commands
Our next tip is to script your import commands. We've been showing individual queries, which are fine when you start, but as you become more experienced you'll want to start collecting them into a script. You can run these through the neo4j-shell tool that comes with the database:
While neither Windows Excel nor DMG comes with the shell, there is a tool written by my colleague, William Lyon, which has a web version that allows you to upload and automatically run a file.
LOAD CSV is a tool that can be used to import and inspect data. It has been included in Neo4j since version 2.1, allows you to load data from any URL, and is ideal for any medium-sized dataset, which we define as up to 10 million rows. If you go beyond that, you'll want to use the next tool we're about to introduce.
That's all for now! Be on the lookout for the finale of our presentation. The conclusion will delve into