DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Learn From Netflix, Stack Overflow, Slack, and More
  • How To Verify Database Connection From a Spring Boot Application
  • Automate Migration Assessment With XML Linter
  • Time Machine: A Look-Back at Java Sessions From NODES 2022

Trending

  • TDD With FastAPI Is Easy
  • Log Analysis Using grep
  • Automated Testing Lifecycle
  • Docker and Kubernetes Transforming Modern Deployment
  1. DZone
  2. Data Engineering
  3. Databases
  4. Import 10M Stack Overflow Questions into Neo4j in Just 3 Minutes

Import 10M Stack Overflow Questions into Neo4j in Just 3 Minutes

I want to demonstrate how you can take the Stack Overflow dump and quickly import it into Neo4j for querying and building an app on top of the dataset.

Michael Hunger user avatar by
Michael Hunger
·
Sep. 15, 15 · Tutorial
Like (3)
Save
Tweet
Share
6.85K Views

Join the DZone community and get the full member experience.

Join For Free

i want to demonstrate how you can take the stack overflow dump and quickly import it into neo4j. after that, you’re ready to start querying the graph for more insights and then possibly build an application on top of that dataset. if you want to follow along, we have a running (readonly) neo4j server with the data available here .

but first things first: congratulations to stack overflow for being so awesome and helpful. they’ve just recently announced that over ten million programming questions (and counting) have been answered on their site. (they’re also doing a giveaway around the #soreadytohelp hashtag. more on that below.)

without the stack overflow platform, many questions around neo4j could’ve never been asked nor answered. we’re still happy that we started to move away from google groups for our public user support.

the neo4j community on stack overflow has grown a lot, as have the volume of questions there.

stack overflow has answered 10 million questions

(and it is a graph)

importing the stack overflow data into neo4j

importing the millions of stack overflow questions, users, answers and comments into neo4j has been a long-time goal of mine. one of the distractions that kept me from doing it was answering many of the 8,200 neo4j questions out there .

two weeks ago, damien at linkurious pinged me in our public slack channel . he asked about neo4j’s import performance for ingesting the full stack exchange data dump into neo4j.

after a quick discussion, i pointed him to neo4j’s csv import tool , which is perfect for the task as the dump consists of only relational tables wrapped in xml.

stack-overflow-graph


so damien wrote a small python script to extract the csv from xml and with the necessary headers the neo4j-import tool did the grunt work of creating a graph out of huge tables. you can find the script and instructions on github here .

importing the smaller stack exchange community data only takes a few seconds. amazingly, the full stack overflow dump with users, questions and answers takes 80 minutes to convert back to csv and then only 3 minutes to import into neo4j on a regular laptop with an ssd.

here is how we did it:

download stack exchange dump files

first, we downloaded the dump files from the internet archive for the stack overflow community (total 11 gb) into a directory:

    • 7.3g stackoverflow.com-posts.7z
    • 576k stackoverflow.com-tags.7z
    • 154m stackoverflow.com-users.7z

the other data could be imported separately if we wanted to:

    • 91m stackoverflow.com-badges.7z
    • 2.0g stackoverflow.com-comments.7z
    • 36m stackoverflow.com-postlinks.7z
    • 501m stackoverflow.com-votes.7z

unzip the .7z files

for i in *.7z; do 7za -y -oextracted x $i; done


this extracts the files into an

extracted

directory and takes 20 minutes and uses 66gb on disk.

clone damien’s github repository

the next step was to clone damien’s github repo:

git clone https://github.com/mdamien/stackoverflow-neo4j


note: this command uses python 3 , so you have to install

xmltodict
sudo apt-get install python3-setuptools
easy_install3 xmltodict


run the xml-to-csv conversion

after that, we ran the conversion of xml to csv.

python3 to_csv.py extracted


the conversion ran for 80 minutes on my system and resulted in 9.5gb csv files, which were compressed to 3.4g.

this is the data structure imported into neo4j. the header lines of the csv files provide the mapping.

nodes:

posts.csv
postid:id(post),title,posttype:int,createdat,score:int,views:int,
answers:int,comments:int,favorites:int,updatedat,body

users.csv userid:id(user),name,reputation:int,createdat,accessedat,url,location,
views:int,upvotes:int,downvotes:int,age:int,accountid:int
tags.csv
tagid:id(tag),count:int,wikipostid:int


relationships:

posts_answers.csv:answer   -> :start_id(post),:end_id(post)
posts_rel.csv:parent_of    -> :start_id(post),:end_id(post)
tags_posts_rel.csv:has_tag -> :start_id(post),:end_id(tag)
users_posts_rel.csv:posted -> :start_id(user),:end_id(post)


import into neo4j

we then used the neo4j import tool

neo/bin/neo4j-import

to ingest posts, users, tags and the relationships between them.

../neo/bin/neo4j-import \
--into ../neo/data/graph.db \
--id-type string \
--nodes:post csvs/posts.csv \
--nodes:user csvs/users.csv \
--nodes:tag csvs/tags.csv \
--relationships:parent_of csvs/posts_rel.csv \
--relationships:answer csvs/posts_answers.csv \
--relationships:has_tag csvs/tags_posts_rel.csv \
--relationships:posted csvs/users_posts_rel.csv


the actual import only takes 3 minutes, creating a graph store of 18 gb.

import done in 3m 48s 579ms. imported:
  31138559 nodes
  77930024 relationships
  260665346 properties


neo4j configuration

we then wanted to adapt neo4j’s config in

conf/neo4j.properties

to increase the

dbms.pagecache.memory

option to 10g. we also edited the

conf/neo4j-wrapper.conf

to provide some more heap, like 4g or 8g.

then we started the neo4j server with

../neo/bin/neo4j start

adding indexes

we then had the option of running the next queries either directly in neo4j’s server ui or on the command-line with

../neo/bin/neo4j-shell

which connects to the running server.

here’s how much data we had in there:

neo4j-sh (?)$ match (n) return head(labels(n)) as label, count(*);
+-------------------+
| label  | count(*) |
+-------------------+
| "tag"  | 41719    |
| "user" | 4551115  |
| "post" | 26545725 |
+-------------------+
3 rows


next, we created some indexes and constraints for later use:

create index on :post(title);
create index on :post(createdat);
create index on :post(score);
create index on :post(views);
create index on :post(favorites);
create index on :post(answers);
create index on :post(score);

create index on :user(name);
create index on :user(createdat);
create index on :user(reputation);
create index on :user(age);

create index on :tag(count);

create constraint on (t:tag) assert t.tagid is unique;
create constraint on (u:user) assert u.userid is unique;
create constraint on (p:post) assert p.postid is unique;


we then waited for the indexes to be finished.

schema await


please note: neo4j as a graph database wasn’t originally built for these global-aggregating queries. that’s why the responses are not instant.

getting insights with cypher queries

below are just some of the insights we gleaned from the stack overflow data using cypher queries:

the top 10 stack overflow users

match (u:user) 
with u,size( (u)-[:posted]->()) as posts order by posts desc limit 10 
return u.name, posts;
+---------------------------+
| u.name            | posts |
+---------------------------+
| "jon skeet"       | 32174 |
| "gordon linoff"   | 20989 |
| "darin dimitrov"  | 20871 |
| "balusc"          | 16579 |
| "commonsware"     | 15493 |
| "anubhava"        | 15207 |
| "hans passant"    | 15156 |
| "martijn pieters" | 14167 |
| "slaks"           | 14118 |
| "marc gravell"    | 13400 |
+---------------------------+
10 rows
7342 ms


the top 5 tags that jon skeet used in asking questions

it seems he never really asked questions, but only answered.

match (u:user)-[:posted]->()-[:has_tag]->(t:tag) 
where u.name = "jon skeet" 
return t,count(*) as posts order by posts desc limit 5;
+------------------------------------------------+
| t                                      | posts |
+------------------------------------------------+
| node[31096861]{tagid:"c#"}             | 14    |
| node[31096855]{tagid:".net"}           | 7     |
| node[31101268]{tagid:".net-4.0"}       | 4     |
| node[31118174]{tagid:"c#-4.0"}         | 4     |
| node[31096911]{tagid:"asp.net"}        | 3     |
+------------------------------------------------+
10 rows
36 ms


the top 5 tags that balusc answered

match (u:user)-[:posted]->()-[:has_tag]->(t:tag) 
where u.name = "balusc" 
return t.tagid,count(*) as posts order by posts desc limit 5;

+------------------------+
| t.tagid        | posts |
+------------------------+
| "java"         | 5     |
| "jsf"          | 3     |
| "managed-bean" | 2     |
| "eclipse"      | 2     |
| "cdi"          | 2     |
+------------------------+
5 rows
23 ms


how am i connected to darin dimitrov?

match path = allshortestpaths(
     (u:user {name:"darin dimitrov"})-[*]-(me:user {name:"michael hunger"}))
return path;

result visualization in the neo4j browser

result visualisation in neo4j browser

which mark answered the most questions about neo4j ?

match (u:user)-[:posted]->(answer)<-[:parent_of]-()-[:has_tag]-(:tag {tagid:"neo4j"}) 
where u.name like "mark %" 
return u.name, u.reputation,u.location,count(distinct answer) as answers
order by answers desc;

+--------------------------------------------------------------------------+
| u.name                 | u.reputation | u.location             | answers |
+--------------------------------------------------------------------------+
| "mark needham"         | 1352         | "united kingdom"       | 36      |
| "mark leighton fisher" | 4065         | "indianapolis, in"     | 3       |
| "mark byers"           | 377313       | "denmark"              | 2       |
| "mark whitfield"       | 899          | <null>                 | 1       |
| "mark wojciechowicz"   | 1473         | <null>                 | 1       |
| "mark hughes"          | 586          | "london, uk"           | 1       |
| "mark mandel"          | 859          | "melbourne, australia" | 1       |
| "mark jackson"         | 56           | "atlanta, ga"          | 1       |
+--------------------------------------------------------------------------+
8 rows
38 ms

top 20 paths rendered as graph

top 20 paths rendered as graph

the top 5 tags of all time

match (t:tag) 
with t order by t.count desc limit 5 
return t.tagid, t.count;
+------------------------+
| t.tagid      | t.count |
+------------------------+
| "javascript" | 917772  |
| "java"       | 907289  |
| "c#"         | 833458  |
| "php"        | 791534  |
| "android"    | 710585  |
+------------------------+
5 rows
30 ms


co-occurrence of the javascript tag

match (t:tag {tagid:"javascript"})<-[:has_tag]-()-[:has_tag]->(other:tag) 
with other, count(*) as freq order by freq desc limit 5
return other.tagid,freq;
+----------------------+
| other.tagid | freq   |
+----------------------+
| "jquery"    | 318868 |
| "html"      | 165725 |
| "css"       | 76259  |
| "php"       | 65615  |
| "ajax"      | 52080  |
+----------------------+
5 rows


the most active answerers for the neo4j tag

quick aside: thank you to everyone who answered neo4j questions!

match (t:tag {tagid:"neo4j"})<-[:has_tag]-()
       -[:parent_of]->()<-[:posted]-(u:user) 
with u, count(*) as freq order by freq desc limit 10
return u.name,freq;

+-------------------------------+
| u.name                 | freq |
+-------------------------------+
| "michael hunger"       | 1352 |
| "stefan armbruster"    | 760  |
| "peter neubauer"       | 308  |
| "wes freeman"          | 277  |
| "frobberofbits"        | 277  |
| "cybersam"             | 277  |
| "luanne"               | 235  |
| "christophe willemsen" | 190  |
| "brian underwood"      | 169  |
| "jjaderberg"           | 161  |
+-------------------------------+
10 rows
45 ms


where else were the top answerers also active?

match (neo:tag {tagid:"neo4j"})<-[:has_tag]-()
      -[:parent_of]->()<-[:posted]-(u:user) 
with neo,u, count(*) as freq order by freq desc limit 10
match (u)-[:posted]->()<-[:parent_of]-(p)-[:has_tag]->(other:tag)
where not (p)-[:has_tag]->(neo)
with u,other,count(*) as freq2 order by freq2 desc 
return u.name,collect(distinct other.tagid)[1..5] as tags;


+----------------------------------------------------------------------------------------+
| u.name                 | tags                                                          |
+----------------------------------------------------------------------------------------+
| "cybersam"             | ["java","javascript","node.js","arrays"]                      |
| "luanne"               | ["spring-data-neo4j","java","cypher","spring"]                |
| "wes freeman"          | ["go","node.js","java","php"]                                 |
| "peter neubauer"       | ["graph","nosql","data-structures","java"]                    |
| "brian underwood"      | ["ruby-on-rails","neo4j.rb","ruby-on-rails-3","activerecord"] |
| "michael hunger"       | ["spring-data-neo4j","nosql","cypher","graph-databases"]      |
| "christophe willemsen" | ["php","forms","doctrine2","sonata"]                          |
| "stefan armbruster"    | ["groovy","intellij-idea","tomcat","grails-plugin"]           |
| "frobberofbits"        | ["python","xsd","xml","django"]                               |
| "jjaderberg"           | ["vim","logging","python","maven"]                            |
+----------------------------------------------------------------------------------------+
10 rows
84 ms


note that this cypher query above contains the equivalent of 14 sql joins.

stack overflow data rendered in linkurious visualizer

rendered in linkurious visualizer

people who posted the most questions about neo4j

match (t:tag {tagid:'neo4j'})<-[:has_tag]-(:post)<-[:posted]-(u:user)
return u.name,count(*) as count
order by count desc limit 10;

+------------------------+
| c.name         | count |
+------------------------+
| "ldb"          | 39    |
| "deemeetree"   | 39    |
| "alexanoid"    | 38    |
| "monkeybonkey" | 35    |
| "badmiral"     | 35    |
| "mik378"       | 27    |
| "kiran"        | 25    |
| "red-devil"    | 24    |
| "rahul"        | 23    |
| "sovos"        | 23    |
+------------------------+
10 rows
42 ms


the top answerers for the py2neo tag

match (:tag {tagid:'py2neo'})<-[:has_tag]-()-[:parent_of]->()
      <-[:posted]-(u:user)
return u.name,count(*) as count
order by count desc limit 10;

+--------------------------------+
| u.name                 | count |
+--------------------------------+
| "nigel small"          | 88    |
| "martin preusse"       | 24    |
| "michael hunger"       | 22    |
| "nicole white"         | 9     |
| "stefan armbruster"    | 8     |
| "frobberofbits"        | 6     |
| "peter neubauer"       | 5     |
| "christophe willemsen" | 5     |
| "cybersam"             | 4     |
| "wes freeman"          | 4     |
+--------------------------------+
10 rows
2 ms


which users answered their own question

this global graph query takes a bit of time as it touches 200 million paths in the database, it returns after about 60 seconds.

if you would want to execute it only on a subset of the 4.5m users you could add a filtering condition, e.g. on reputation.

match (u:user) where u.reputation > 20000
match (u)-[:posted]->(question)-[:answer]->(answer)<-[:posted]-(u)
with u,count(distinct question) as questions
order by questions desc limit 5
return u.name, u.reputation, questions;

+---------------------------------------------+
| u.name           | u.reputation | questions |
+---------------------------------------------+
| "stefan kendall" | 31622        | 133       |
| "prosseek"       | 31411        | 114       |
| "cheeso"         | 100779       | 107       |
| "chase florell"  | 21207        | 99        |
| "shimmy"         | 29175        | 96        |
+---------------------------------------------+
5 rows
10 seconds


more information

we’re happy to provide you with the graph database of the stack overflow dump here:

    • neo4j database dump for 2.3-snapshot or 2.2.4
    • running neo4j server to explore the data (read-only)
    • csv files

if you want to learn about other ways to import or visualize stack overflow questions in neo4j, please have a look at these blog posts:

    • load json from url as data
    • making master data management fun with neo4j
    • visualizing stack overflow
    • embrace relationships with neo4j, r & java
    • please also check out the stack overflow developer survey . it’s a very interesting read.

thanks again to everyone who posts and answers neo4j questions. you’re the ones who make the neo4j community really tick, and without you this level of analysis would only be half as much fun.

circling back to stack overflow's 10 million question milestone, thank you for being # soreadytohelp with any stack overflow questions related to neo4j and cypher.

please let us know if you find other interesting questions and answers on this dataset. just drop us an email to content@neo4j.com .

Stack overflow Neo4j Overflow (software) Database dump

Published at DZone with permission of , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Learn From Netflix, Stack Overflow, Slack, and More
  • How To Verify Database Connection From a Spring Boot Application
  • Automate Migration Assessment With XML Linter
  • Time Machine: A Look-Back at Java Sessions From NODES 2022

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: