Over a million developers have joined DZone.
Platinum Partner

A Step-By-Step Tutorial on How to Use Graphipedia to Import Wikipedia into Neo4j

· Java Zone

The Java Zone is brought to you in partnership with ZeroTurnaround. Discover how you can skip the build and redeploy process by using JRebel by ZeroTurnaround.

Wouldn’t it be cool to import Wikipedia into Neo4j?

Mirko Nasato thought so, and built graphipedia using the batch importer that does just that.

It’s written in Java, so if you’re a pure ruby guy, I’ll walk you through the steps.

Let’s clone the project and jump in.

git clone git://github.com/mirkonasato/graphipedia.git
cd graphipedia

If you look in here you’ll see a pom.xml file which means you’ll need to download Maven and build the project.

sudo apt-get install maven2
mvn install

 You’ll see a bunch of stuff flying by, that’s just the dependencies being downloaded. At the end you should see this:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] ------------------------------------------------------------------------
[INFO] Graphipedia Parent .................................... SUCCESS [1:08.932s]
[INFO] Graphipedia DataImport ................................ SUCCESS [1:16.018s]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2 minutes 25 seconds
[INFO] Finished at: Thu Feb 16 11:36:55 CST 2012
[INFO] Final Memory: 28M/434M
[INFO] ------------------------------------------------------------------------

Ok, so now let’s get the file from wikipedia we need. You can download it with wget.

wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

 Whoa, hold up. That’s a 7.6 G file… can we try a smaller data set first?

Sure. Let’s go with Lea faka-Tonga ’cause it just sounds cool…and we’ll unzip it.

wget http://dumps.wikimedia.org/towiki/latest/towiki-latest-pages-articles.xml.bz2
bzip2 -d towiki-latest-pages-articles.xml.bz2

It is a two step process, so first lets create a smaller intermediate XML file containing page titles and links only:

java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml

 You should see:

Parsing pages and extracting links...
2835 pages parsed in 0 seconds.

 Then we run the batch importer on this file and dump the contents on to the graphdb directory:

java -Xmx3G -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db

You should see:

Importing pages...
2835 pages imported in 0 seconds.
Importing links...
5799 links imported in 0 seconds; 6383 broken links ignored

 Go inside and take a look and you’ll see our neostore files.

cd graph.db

 You can copy this folder over any existing neo4j database by overwriting the /neo4j/data/graph.db folder and enjoy.

 Source: http://maxdemarzi.com/2012/02/16/importing-wikipedia-into-neo4j-with-graphipedia/

The Java Zone is brought to you in partnership with ZeroTurnaround. Discover how you can skip the build and redeploy process by using JRebel by ZeroTurnaround.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}