Over a million developers have joined DZone.

Neo4j: Loading Data - REST API vs Batch Import

DZone's Guide to

Neo4j: Loading Data - REST API vs Batch Import

· Java Zone ·
Free Resource

Build vs Buy a Data Quality Solution: Which is Best for You? Gain insights on a hybrid approach. Download white paper now!

A couple of weeks ago when I first started playing around with my football data set I was loading all the data into neo4j using the REST API via neography which was taking around 4 minutes to load.

The data set consisted of just over 250 matches which translated into 8,000 nodes & 30,000 relationships so it’s very small by all means.

Ashok and I were discussing how that could be quicker and the first thing we tried was to store inserted nodes in an in memory hash map and look them up from there rather than doing an index lookup each time.

These were the timings for different numbers of matches when I did that:

| Matches | Cache-Hits | Cache-Misses | Lucene        | In memory  |
| 25      | 501        | 325          | 26.692s       | 22.877s    |
| 50      | 1275       | 373          | 50.491s       | 38.304s    |
| 263     | 8016       | 480          | 4m 11.031s    | 2m 49.951s |

For the full data set it was about 30% faster which was a nice improvement but still left me waiting around for a bit longer than I wanted to!

I’ve previously used the batch inserter and I was planning to use that again to get a significant improvement in loading time until Ashok pointed out Michael Hunger’s batch-import which seemed worth a try.

I had to add an extra step to the pipeline to put all the nodes and relationships into CSV files and then pass those files to the batch-import JAR.

There was a massive improvement in the load time using it. These were the timings:

| Matches | Lucene        | In memory  | Batch Import             |
| 25      | 26.692s       | 22.877s    | 0.378s + 0.921s = 1.299s |
| 50      | 50.491s       | 38.304s    | 0.392s + 1.025s = 1.417s |
| 263     | 4m 11.031s    | 2m 49.951s | 0.524s + 1.239s = 1.763s |

(the two numbers represent the time taken to generate the CSV files and then the time to import them)

From my brief skimming of the code it seems to take in the files and then route them through the batch importer API so I imagine similar results would be had by calling that directly.

I know this is not a very fair comparison given that you probably shouldn’t be using the REST API to insert data but since I’ve done it a couple of times I thought it’d be interesting to measure anyway!

Build vs Buy a Data Quality Solution: Which is Best for You? Maintaining high quality data is essential for operational efficiency, meaningful analytics and good long-term customer relationships. But, when dealing with multiple sources of data, data quality becomes complex, so you need to know when you should build a custom data quality tools effort over canned solutions. Download our whitepaper for more insights into a hybrid approach.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}