Effective Bulk Data Import into Neo4j (Part 1)

DZone 's Guide to

Effective Bulk Data Import into Neo4j (Part 1)

Bulk imports are easy to do in Neo4j. We have a look at an actual example using a dataset from Stack Overflow.

· Database Zone ·
Free Resource

Today we're going to look at how to import a dataset from Stack Overflow into Neo4j:

In part one of this three-part presentation, we're going to take a look at the basics of importing a dataset and LOAD CSV, a tool to help the process. Stack Overflow is an online community where programmers both post and answer questions on a variety of developer topics. Neo4j has become a large area of discussion on the platform — so far there have been over 10,000 questions asked and answered:

The Stack Overflow Dataset for Importing

We thought it would be interesting to pull all of this data into Neo4j to see if we could learn anything interesting from all the posted questions and answers. So today, we're going to go over the process for bulk data import into Neo4j. If you're interested in replicating what we did today, you can check out this blog post.

Importing the Data: Stack Exchange API

Stack Overflow has an API called Stack Exchange. While it has rate limits and takes a while to download all the data, the data dump is hosted on the Internet and provides gigantic XML storage. It's essentially a dump of an SQL Server — CSV pretending to be XML — and contains about 66GB of data.

The Stack Exchange API and data dump

So we have two APIs with two completely different types of data — there is one which is updated constantly and is like an API, and then a data dump that happens once a month. To bring the two together in Neo4j, we need what we'll call "magic import dust:"

Watch Mark Needham and Michael Hunger's presentation on efficient bulk data import into Neo4j

We will use the API below in two different ways to get all of the information from Stack Overflow into Neo4j, which has a number of properties that describe different elements.

The question ID is in HTML, while the title, text, answer, and interpretation of nodes is in the JSON API:

Data properties from the Stack Exchange API


We are going to go over two different ways to process this JSON data, the first of which is via a tool called LOAD CSV, which allows us to iterate over CSV files and create things in the graph. But we have to convert from the JSON to CSV:

Data import from JSON to CSV and then LOAD CSV

Remember, our goal is to create a graph that includes users, answers, questions and tags, which are used to organize questions and get the right person to answer them.

It's very important to determine your data model before importing your data. You have to know what you're trying to do with the data and the questions you're trying to answer so that you can use your data in a relevant way. You'll also likely have to manipulate the data to fit your data model prior to import.

For example, we decided not to include the body of the questions or the comments because they weren't relevant to our data model, which is depicted below:

The data model for the bulk data import from Stack Overflow

We used the tool jq to convert our data from JSON to CSV, which is comparable to XPath for XML. It's implemented in C and C++ and is very fast and has a rich data language.

Let's explore the following example, which we will ultimately upload into our data model:

Using jq for CSV data conversion

We've downloaded an array of questions into a file, and then we'll iterate through them and have them sit under an array called "items." We'll then go through these items and grab the question ID, title, etc., each of which will be a column in our CSV file. If a question has multiple tags, we separate those with ";." We then run all the data through a filter. This returns the following standard CSV format:

The jq tool for working with CSV data

You can see that we have a question ID, up vote, down vote, etc. and one row for each JSON document that we had in our original API call.

It depends a bit on your jq query how many rows you get, but in our case, it's one row per question.

Then, we do the same for the answers. We jq the JSON document into CSV, which is now in a format that works for LOAD CSV (which again, we use to import the data into Neo4j).

The jq tool for data conversion of CSV

Below is the LOAD CSV tool, which is a command in Cypher that allows us to iterate over a CSV file. We can provide either a URI or a file part, so if you find a CSV on the web you can upload it directly without even downloading it to your machine.

Data import using LOAD CSV

Depending on whether or not headers are provided, LOAD CSV will return each line of the CSV as either a list of values or as a map that allows you to map headers to fields. You can create, update, move and match data with information that comes in from the CSV.

You can drive computation, filter, group, and aggregate the CSV on certain properties to sort data qualities, and run the CSV through Cypher. This allows you to see the data the same way Cypher does — with lone quotes and binary zeros.

After we apply Cypher operations to our map, there are three things we're going to hook up: CREATE, MERGE, and MATCH. CREATE allows us to create things such as nodes and relationships, which — without any constraints — will cause our database to perform this action every time.

If we don't want the database to create connections because they're already there, we'll use MERGE. But, if there are some instances in which connections aren't already there, it will create them.

You'll also want to use MERGE if you want to write item-potent LOAD CSV scripts. Sometimes, we'll have some data already in the database which we will also want to add to the graph with a LOAD CSV query. In those cases, we might want to add a MATCH as well.

As a general rule, CREATE is faster than MERGE because it doesn't have to check the data — it just pumps in the data. This is especially useful in an initial import, but keep in mind that you'll have to check for duplicates outside of Neo4j.

Because LOAD CSV returns a list of rows of maps, you can add filters that — for example — aggregate or sort out duplicates. This is done by doing a DISTINCT on input values, reducing the number of rows and killing the duplicates, especially if you have data resulting from JOINs in a relational database.

Now you have your CSV files from Stack Overflow, so you're ready to create a graph. You might start out with something like this:

LOAD CSV -- the naive version

With a LOAD CSV from questions of CSV, you'll create a question, add then add an ID, title, up vote count, creation date, etc. You'll also create an owner, which is connected to the relevant question, and iterate through the tags (which is essentially going through the semicolon-separated array we mentioned earlier).

That's all for now! Stay tuned for part two of this presentation, which covers six LOAD CSV tips and tricks to make your import process smoother.

csv, data, database, graph database, import, model, neo4j, nosql

Published at DZone with permission of Mark Needham , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}