When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.
When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).
In this talk, I'll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.
About the speaker:
Kris Geusebroek is a developer with a passion for combining technologies to create new possibilities for the people around him. Coming from a Java and GIS background and being a fan of open source software, Kris started working with distributed systems and graph databases in the last couple of years. He's currently working on visualizing Big Data with the help of Hadoop and Neo4j. Kris has spoken at in-house knowledge sharing events and several local meetups. He also has experience doing handson training sessions at the dutch java user group.