Importing MusicBrainz (Open Content Music Database) into Neo4j
Importing MusicBrainz (Open Content Music Database) into Neo4j
Join the DZone community and get the full member experience.Join For Free
Delivering modern software? Atomist automates your software delivery experience.
This article was originally written by Paul Tremberth
What is MusicBrainz?
|American artists signed on British record labels|
START usa=node:mb_fulltext(name="United States"), gb=node:mb_fulltext(name="United Kingdom") MATCH (usa:Country), (gb:Country), (a:Artist)-[:FROM_AREA]-(usa), (a:Artist)-[:RECORDING_CONTRACT]-(l:Label), (l)-[:FROM_AREA]-(gb) RETURN a,l,usa,gb
That’s a lot of interlinked data! Which makes it a perfect candidate for Neo4j.
All entities (artists, labels, releases, etc.) are identified by their MusicBrainz Identifier (MBID. It’s probably the closest to an universal UUID for the music recording industry.To find the MBID of your favourite band, do a search from the homepage’s search bar on the top, click on the artist you want (there can be many homonyms) and in the address bar of your browser, the MBID is the last part of the URL. For example, this is the URL for Maxïmo Park’s page on MusicBrainz:
The License of the Musicbrainz Data
Core dataThe core data, as noted above, is licensed under the CC0, which is effectively placing the data into the Public Domain. This means that anyone can download and use the core data in any way they see fit. No restrictions, no worries!
Since 2003,  MusicBrainz's core data (artists, recordings, releases, and so on) are in the public domain, and additional content, including moderation data (essentially every original content contributed by users and its elaborations), is placed under the Creative Commons CC-BY-NC-SA-2.0 license.
What Would MusicBrainz Data Look Like in a Graph?
|Neo4j Musicbrainz Schema|
What’s with the “Artist Credit” Entity?
How to Import MusicBrainz Data into Neo4j?
The SQL2GRAPH tool
Step 1 - get a local copy of the database
Step 2 - Export data to CSV/TSV with sql2graph
How sql2graph operates
Entity('artist', [ IntegerProperty('pk', Column('id')), Property('mbid', Column('gid')), Property('disambiguation', Column('comment')), Property('name', Column('name')), ],
a name, here “artist”, that will become an :Artist label for the nodes.
a list of Property() instances, that also have a name, that will become properties of the node
a mandatory primary key property called “pk” that is use by sql2graph to resolve links between entities when creating relationships; this is usually the column name “id” in SQL tables (this “pk” property is not really useful in the final graph)
What about relationships?
Entity('artist', [ IntegerProperty('pk', Column('id')), Property('mbid', Column('gid')), Property('disambiguation', Column('comment')), Property('name', Column('name', ForeignColumn('artist_name', 'name'))), ], [ Relation( 'FROM_AREA', start=Reference('artist', Column('id')), end=Reference('area', Column('area')), properties= ), Relation( 'BEGAN_IN_AREA', start=Reference('artist', Column('id')), end=Reference('area', Column('begin_area')), properties= ), Relation( 'ENDED_IN_AREA', start=Reference('artist', Column('id')), end=Reference('area', Column('end_area')), properties= ), ], ),
‘FROM’: represents the relationship between the “artist” table and the “area” table, representing where an artist is from (was born or built his career in)
‘BEGAN_IN’ and ‘ENDED_IN’ also represent a link between the “artist” and “area” tables but they represent the place of birth and place of death of artist (or band sometimes)
Entity('release_label', , [ Relation( 'RELEASED_ON', start=Reference('release', Column('release')), end=Reference('label', Column('label')), properties=[ Property('catalog_number', Column('catalog_number')), ] ), ] ),
Exporting nodes and relationships using SQL queries
first, it creates a temporary table (in the RDBMS itself) mapping entity primary keys values to node IDs in Neo4j. This SQL table has 3 columns: “entity”, “pk” and “node_id”
then, it dumps nodes data for all tables for which you defined node properties in TSV format (PostgreSQL can do that, see http://maxdemarzi.com/2012/02/28/batch-importer-part-2/)
finally, it exports relationships as TSV, using the Relation() and Property() instances you defined in the mapping above, and resolving node IDs using the temporary table
SQL table “artist”:
SQL table “label”:
Temporary mapping table:
batch-import supports automatic indexing if your CSV header’s columns contain a type and index name, for example “name:string:mb” if the “name” field in nodes is a string that you want to index in an index called “mb”.
The generated MusicBrainz SQL export script above assumes 2 indexes: “mb_fulltext” as a fulltext index and “mb_exact” as an exact index. Make sure your batch.properties file declares those 2 indexes (see below “Import using batch-import”):
“mbid” columns will be indexed in the “mb_exact” index and "type" columns will become labels on nodes
“name” will be indexed in the “mb_fulltext” index
Step 3 - Import using the Neo4j batch-import
batch_import.keep_db=true batch_import.mapdb_cache.disable=true batch_import.node_index.mb_fulltext=fulltext batch_import.node_index.mb_exact=exact batch_import.csv.quotes=false cache_type=none use_memory_mapped_buffers=true neostore.nodestore.db.mapped_memory=300M neostore.relationshipstore.db.mapped_memory=3G neostore.propertystore.db.mapped_memory=500M neostore.propertystore.db.strings.mapped_memory=500M neostore.propertystore.db.arrays.mapped_memory=0M neostore.propertystore.db.index.keys.mapped_memory=15M neostore.propertystore.db.index.mapped_memory=15MWith that, we can import the files (using UTF-8 encoding) with:
MAVEN_OPTS="-server -Xmx10G -Dfile.encoding=UTF-8" mvn exec:java -Dfile.encoding=UTF-8 -Dexec.mainClass="org.neo4j.batchimport.Importer" -Dexec.args="batch.properties musicbrainz.db /tmp/musicbrainz__nodes__full.csv /tmp/musicbrainz__rels__full.csv"
And get output like:
|British artists signed on American record labels|
The Cypher query for this result:
START usa=node:mb_fulltext(name="United States"), gb=node:mb_fulltext(name="United Kingdom") MATCH (gb:Country), (gb:Country), (a:Artist)-[:FROM_AREA]-(gb), (a:Artist)-[:RECORDING_CONTRACT]-(l:Label), (l)-[:FROM_AREA]-(usa) RETURN a,l,usa,gb
Thanks a lot to Michael Hunger, Peter Neubauer, Max DeMarzi and the fantastic Neo4j community for all the help and inspiration for this blog!
Published at DZone with permission of Andreas Kollegger , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.