Over a million developers have joined DZone.

How I MapReduced a Neo4j Store w/ Hadoop

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.

When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).

In this talk, I'll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.

About the speaker:

Kris Geusebroek is a developer with a passion for combining technologies to create new possibilities for the people around him. Coming from a Java and GIS background and being a fan of open source software, Kris started working with distributed systems and graph databases in the last couple of years. He's currently working on visualizing Big Data with the help of Hadoop and Neo4j. Kris has spoken at in-house knowledge sharing events and several local meetups. He also has experience doing handson training sessions at the dutch java user group.

--YouTube Page

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.

Topics:

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}