Effective Bulk Data Import into Neo4j (Part 2)

DZone 's Guide to

Effective Bulk Data Import into Neo4j (Part 2)

Let's examine LOAD CSV — a handy tool that provides a JSON-processing solution. Read on to see seven tips to most effectively utilize it.

· Database Zone ·
Free Resource

Welcome back to learning how to perform a bulk data import into Neo4j. In part one, we took a look at importing a dataset from Stack Overflow as well as LOAD CSV, a tool that helps process JSON data. Today, we're taking a closer look at that tool to see what else it can do.

1. Start with a Small Sample

If you run the data the above way, you'll likely end up with a long run time because you started with your entire database. But when you're first running your data, start with a small set — such as 100 data points — to see if your query works. This speeds up the round trip times, which allows you to make adjustments and corrections to your query without having to first wait for your entire dataset to run:

Data import tip: Start with a sample

2. MERGE on a Key

Another important tip to keep in mind is to MERGE on a key. MERGE will explore all the properties you've placed on your data. So you can ask the question — Is there anything else already in my database that, in this case, has the same ID? Is there any question which has the ID and the title and the up vote count and the creation date? So you're checking four different elements when actually you know whether or not a question is unique based on its ID.

This is effectively the Stack Overflow primary key. Because each question has an ID, we don't need to compare every property to see whether or not it's the same.

Data import tip: MERGE on a key

We can do the same for the owner as well. We have a key on the owner, so the next key MERGE tip is to use the key inside your MERGE.

Don't put every single thing in there because what you probably want to do is set these properties if they aren't already set. You don't actually really want to use them to go and remove duplicates so we're just pulling this apart, and we're using the ON CREATE clause to help. If you run the script again, it would make sure the question is there.

3. Use Constraints and Indexes

There are two ways to make a search faster:

  1. By using constraints, which ensure uniqueness.
  2. By using indexes, which provide a starting point for the search.

Data import tip: Use constraints and indexes

The constraint automatically creates an index, so creating a constraint allows you to take advantage of both constraints and indexes.

4. Use Only One MERGE Statement

Our next tip is to use only one MERGE statement instead of building one massive input query:

Data import tip: One MERGE per statement

Using one MERGE per statement is faster than creating one large overarching MERGE. For example, create one MERGE for the question, one for the owner and one to combine the two.

At this stage, since all node flavors are independent of one another, you can also run imports in parallel. There are some nodes, such as a tags, which have a lot of relationships. If you do a MERGE of these relationships, the database will check to make sure there isn't already a relationship of this type and direction between the two nodes. To do this, it iterates over the relationships between the nodes.

This can take some time, depending on how many relationships that node has. However, the tool will automatically check the node with fewer relationships. Which is called MERGE inter-operation.


Our next tip is to use DISTINCT, which equates to a "don't-do-the-same-work-over-and-over-again" tip. In this dataset it applies to the tags:

Data import tip: Use DISTINCT

Lots of people add the same tag over and over again to our questions. For example, of our 10,000 Neo4j entries, you'll see 10,000 entries with the Neo4j tag. To reduce the amount of work the MERGE has to do, we can use DISTINCT when we create our tags. Instead of creating or going over Neo4j 10,000 times, we just get it once and then we create it.


Another tip that you can use in LOAD CSV is a PERIODIC COMMIT, which in this example will take the first 1,000 questions and commit those before going on to the next 1,000. This prevents you from exceeding your RAM:

Data import tip: Use periodic commit

Something between 10,000 and 100,000 updates per transaction are a good target.

Something to keep in mind is that PERIODIC COMMIT applies only to LOAD CSV. If you want to do a big graph refactoring of something that you've already imported, you'll have to do batching.

7. Script Import Commands

Our next tip is to script your import commands. We've been showing individual queries, which are fine when you start, but as you become more experienced you'll want to start collecting them into a script. You can run these through the neo4j-shell tool that comes with the database:

Data import tip: Script your import commands

While neither Windows Excel nor DMG comes with the shell, there is a tool written by my colleague, William Lyon, which has a web version that allows you to upload and automatically run a file.

In summary, LOAD CSV is a tool that can be used to import and inspect data. It has been included in Neo4j since version 2.1, allows you to load data from any URL, and is ideal for any medium-sized dataset, which we define as up to 10 million rows. If you go beyond that, you'll want to use the next tool we're about to introduce.

Neo4j 3.0 brought the introduction of procedures, which allow you to write your own custom code and call it from Cypher. There is also Awesome Procedures (APOC) in Github, which you can access here.

That's all for now! Be on the lookout for the finale of our presentation. The conclusion will delve into 

csv, data, database, import, neo4j, nosql, stack overflow

Published at DZone with permission of Mark Needham , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}