In part one of this three-part presentation, we're going to take a look at the basics of importing a dataset and LOAD CSV, a tool to help the process. Stack Overflow is an online community where programmers both post and answer questions on a variety of developer topics. Neo4j has become a large area of discussion on the platform — so far there have been over 10,000 questions asked and answered:
We thought it would be interesting to pull all of this data into Neo4j to see if we could learn anything interesting from all the posted questions and answers. So today, we're going to go over the process for bulk data import into Neo4j. If you're interested in replicating what we did today, you can check out this blog post.
Importing the Data: Stack Exchange API
Stack Overflow has an API called Stack Exchange. While it has rate limits and takes a while to download all the data, the data dump is hosted on the Internet and provides gigantic XML storage. It's essentially a dump of an SQL Server — CSV pretending to be XML — and contains about 66GB of data.
So we have two APIs with two completely different types of data — there is one which is updated constantly and is like an API, and then a data dump that happens once a month. To bring the two together in Neo4j, we need what we'll call "magic import dust:"
We will use the API below in two different ways to get all of the information from Stack Overflow into Neo4j, which has a number of properties that describe different elements.
The question ID is in HTML, while the title, text, answer, and interpretation of nodes is in the JSON API:
We are going to go over two different ways to process this JSON data, the first of which is via a tool called
LOAD CSV, which allows us to iterate over CSV files and create things in the graph. But we have to convert from the JSON to CSV:
Remember, our goal is to create a graph that includes users, answers, questions and tags, which are used to organize questions and get the right person to answer them.
It's very important to determine your data model before importing your data. You have to know what you're trying to do with the data and the questions you're trying to answer so that you can use your data in a relevant way. You'll also likely have to manipulate the data to fit your data model prior to import.
For example, we decided not to include the body of the questions or the comments because they weren't relevant to our data model, which is depicted below:
Let's explore the following example, which we will ultimately upload into our data model:
We've downloaded an array of questions into a file, and then we'll iterate through them and have them sit under an array called "items." We'll then go through these items and grab the question ID, title, etc., each of which will be a column in our CSV file. If a question has multiple tags, we separate those with ";." We then run all the data through a filter. This returns the following standard CSV format:
You can see that we have a question ID, up vote, down vote, etc. and one row for each JSON document that we had in our original API call.
It depends a bit on your
jq query how many rows you get, but in our case, it's one row per question.
Then, we do the same for the answers. We
jq the JSON document into CSV, which is now in a format that works for
LOAD CSV (which again, we use to import the data into Neo4j).
Below is the
LOAD CSV tool, which is a command in Cypher that allows us to iterate over a CSV file. We can provide either a URI or a file part, so if you find a CSV on the web you can upload it directly without even downloading it to your machine.
Depending on whether or not headers are provided,
LOAD CSV will return each line of the CSV as either a list of values or as a map that allows you to map headers to fields. You can create, update, move and match data with information that comes in from the CSV.
You can drive computation, filter, group, and aggregate the CSV on certain properties to sort data qualities, and run the CSV through Cypher. This allows you to see the data the same way Cypher does — with lone quotes and binary zeros.
After we apply Cypher operations to our map, there are three things we're going to hook up:
CREATE allows us to create things such as nodes and relationships, which — without any constraints — will cause our database to perform this action every time.
If we don't want the database to create connections because they're already there, we'll use
MERGE. But, if there are some instances in which connections aren't already there, it will create them.
You'll also want to use
MERGE if you want to write item-potent
LOAD CSV scripts. Sometimes, we'll have some data already in the database which we will also want to add to the graph with a
LOAD CSV query. In those cases, we might want to add a
MATCH as well.
As a general rule,
CREATE is faster than
MERGE because it doesn't have to check the data — it just pumps in the data. This is especially useful in an initial import, but keep in mind that you'll have to check for duplicates outside of Neo4j.
LOAD CSV returns a list of rows of maps, you can add filters that — for example — aggregate or sort out duplicates. This is done by doing a
DISTINCT on input values, reducing the number of rows and killing the duplicates, especially if you have data resulting from JOINs in a relational database.
Now you have your CSV files from Stack Overflow, so you're ready to create a graph. You might start out with something like this:
LOAD CSV from questions of CSV, you'll create a question, add then add an ID, title, up vote count, creation date, etc. You'll also create an owner, which is connected to the relevant question, and iterate through the tags (which is essentially going through the semicolon-separated array we mentioned earlier).
That's all for now! Stay tuned for part two of this presentation, which covers six LOAD CSV tips and tricks to make your import process smoother.