Curator's Note: The content of this article was written by Rik Van Bruggen over at the Neo4j blog.
Many of you know that I am a big fan of Belgian beers. But of course I have a number of other hobbies and passions. One of those being: Music. I have played music, created music (although that seems like a very long time ago) and still listen to new music almost every single day. So when sometime in 2006 I heard about this really cool music site called Last.fm, I was one of the early adopters to try use it. So: a good 7 years later and 50k+ scrobbles later, I have quite a bit of data about my musical habits.
On top of that, I have a couple of friends that have been using Last.fm as well. So this got me thinking. What if I was somehow able to get that last.fm data into neo4j, and start "walking the graph"? I am sure that must give me some interesting new musical insights... It almost feels like a "recommendation graph for music" ... Let's see where this brings us.
Basically, the approach I took had three simple high-level steps
- get the data from last.fm
- model that data into a neo4j graph
- pump the data into the neo4j database using an import tool
- query the data for hours on end ;)
Step 1: exporting the data fromlast.fm.
Turns out there are some cool tools out there to get the scrobble data out of last.fm. I used the LastToLibre "lastexport.py" script, which is very easy and simple: run it, give it a user name, and the public scrobbles will be available shortly after in a text file with the date, trackname, artistname, albumname and then the "MusicBrainz" identifiers for the track (trackmbid), artist (artistmbid) and album (albummbid). I did this for myself and two friends, and got a sizeable dataset.
Step 2: create a model out of this.
Step 3: import the data
This is where it got interesting. The spreadsheet import mechanism worked ok - but it really wasn't great. It took more than an hour to get the dataset to load - so I had to look for alternatives. Thanks to my French friend and colleague Cédric, I bumped into the Talend ETL (Extract - Transform - Load) tools. I found out that there was a proper neo4j connector that was developed by Zenika, a French integrator that really seems to know their stuff.
- Import the nodes: 2 subjobs, one for nodes with name and type, one other for the nodes that have name, type and a "Musicbrainz Identifier" (artists, tracks, albums),
- Import the relationships: 7 subjobs, one for every relationship type (see model). Important there is that there is an additional step here, which is to make the relationships "unique", and avoid that some relationships would be created twice.
Step 4: do some neo4j Cypher queriesAfter the talend job was done, I started to experiment with some Cypher queries: figure out which artists my friends had been listening to on the same day:
Seems like these queries are not that trivial, and there probably still is quite a bit of optimisation to be done - but that's way above my capabilities. And obviously there are many more ideas for interesting queries - the music domain is very graphy in nature, and allows for more hours of graph fun. But that will be for a later time.