Groovy Graphs with Grails and Gremlins
Join the DZone community and get the full member experience.
Join For Freethis article provides a practical introduction to graph databases with a focus on the groovy ecosystem. it explores grails gorm support for neo4j as well as query comparisons between the neo4j cypher language, sql and the gremlin graph dsl.
this article originally appeared in the may 2013 issue of groovymag .
graph theory and terminology
a graph is a way of modeling connections or relationships between nodes or vertices. these relationships, or edges, may be directed (a digraph in mathematical terminology) as shown in figure 1. parent-child relationships may form a tree; with additional associations these form a network.
a weighted graph (figure 2) has a weight on every edge. weighted networks can be used for modeling road networks with the weights representing distances; in conjunction with shortest-path graph algorithms such as dijkstra’s algorithm, these are used for example for minimizing the distance travelled by delivery lorries.
figure 3 shows a loop, which is a self-referencing edge.
figure 4 shows a property graph; this type of graph allows properties on both nodes and edges.
sample graph
one of the main examples that is often used in current introductory material is a social graph much like figure 4. instead for the first practical example we’ll explore a different scenario: the relationships between a development project, the skills required and the team members involved as shown in figure 5 (projects are denoted in orange, skills in blue and employees in green).
what is a graph database?
graph databases use property graphs, where both nodes and edges can have properties. one of the terms that is frequently used to describe graph databases is ‘index-free adjacency’, meaning that each node knows the location of the nodes that it is connected to without needing to perform an index lookup. in fact, this is often a key distinction between a graph database and a non-native store.
when should you use a graph database?
now many people may jump on the bandwagon to use a ‘shiny new’ technology. the lesson that people should be taking away from the nosql movement is that the most appropriate data store technology should be utilized. non-functional requirements are very important, as the solution will need to be operated in a production environment.
the best fit is for highly associative data sets where the (semi-structured) relationships are of critical importance to the value of the data. graph databases may offer significant performance improvements over a relational model for this type of data as edges can be traversed at o(1) rather than requiring significant amounts of i/o operations to perform joins.
for the practical aspects of the article we’ll primarily use neo4j ( http://neo4j.org ), which is described as the world’s leading graph database, before looking at some of the alternatives.
all the code is available on github from https://github.com/rbramley/groovymaggraphs
using a graph database in a grails application
for the first example, we’ll explore a domain model that encapsulates a ‘project database’. we’ll start by installing the gorm-based plugin for neo4j ( http://www.grails.org/plugin/neo4j – the plugin manual is available at http://springsource.github.io/grails-data-mapping/neo4j/manual/guide/index.html ):
grails install-plugin neo4j
then configure the datasource for the plugin along the lines of listing 1, but choosing an appropriate data file location for your system.
listing 1: neo4j plugin datasource configuration
grails {
neo4j {
type = "embedded"
location = "/usr/local/cellar/neo4j/data"
}
}
we’ll use scaffolded controllers and views for speed, so let’s define the domain model to match figure 5. the domain classes required are skill shown in listing 2, employee shown in listing 3 and project as per listing 4 – they all inform gorm that they are for neo4j by using
static mapwith = "neo4j"
.
package neolisting 2: a simple skill representation
import groovy.transform.tostring
@tostring(includes='name')
class skill {
string name
static constraints = {}
static mapwith = "neo4j"
}
package neolisting 3: a simple employee representation
import groovy.transform.tostring
@tostring(includes='name')
class employee {
string name
string title
date dob
static hasmany = [ workedon : project, knows : skill ]
static constraints = {}
static mapwith = "neo4j"
}
package neolisting 4: a simple project representation
import groovy.transform.tostring
@tostring(includes='projectname')
class project {
string customername
string projectname
date startdate
date enddate
static hasmany = [ members : employee, requiredskills : skill ]
static constraints = {}
static belongsto = employee
static mapwith = "neo4j"
}
in summary it is very easy to create the simple grails object domain model with one-to-many and many-to-many relationships and map them with neo4j.
following this you’ll need to create the controllers, set to scaffold, run, and populate data. figure 6 illustrates the view of the figure 5 graph as seen from the project view action – job done!
the simple project database was such a success that hr want to extend it to provide a skills matrix. this new requirement will necessitate a richer domain model to capture more details about employee skill levels and experience; in a relational database world we could happily augment the many-to-many link table between employee and skill – with a property graph we can add properties to an edge.
this is where we’ve reached the limitations of the gorm plugin, which is designed for 80% use cases (see http://stackoverflow.com/questions/10647834/architecting-a-neo4j-based-application-stick-to-vanilla-api-using-plain-nodes ).
so it looks like the best way to use neo4j to exploit the power of the graph with grails is to forego gorm/scaffolding. which isn’t that much of an issue for ‘real’ projects as opposed to prototyping.
spring data to the rescue
if you don’t want to couple your application directly to the native datastore apis, then spring data is one option. spring data provides a set of data access apis with implementations for a number of nosql technologies.
the spring data neo4j support includes
@relationshipentity
annotations (
http://static.springsource.org/spring-data/data-neo4j/docs/2.0.0.release/reference/html/#d0e1889
) for adding properties to graph edges.
six degrees of kevin bacon
neo4j & cypher
neo4j also provides a native traversal api and a declarative query language called cypher.
for this example we’ll manipulate the graph using a groovy script and we’ll play a game of ’6 degrees of kevin bacon’ ( http://en.wikipedia.org/wiki/six_degrees_of_kevin_bacon ) with a small network.
the network we’ll define is illustrated in figure 7, the blue shades signify movies and the green shades signify actors, with the exception of kevin bacon who is denoted in orange. the darker shades are first-degree connections and the lighter shades are second-degree connections.
script goodies
groovy missing method meta-programming is used in listing 5 to make establishing relationships feel more natural e.g.
movie.featured(actor)
. this is using an outbound edge from a movie node into an actor node.
// an enum helperlisting 5: groovy dsl meta-programming
enum myrelationshiptypes implements relationshiptype { featured }
// metaclass magic
node.metaclass {
propertymissing { string name, val -> delegate.setproperty(name, val) }
propertymissing { string name -> delegate.getproperty(name) }
methodmissing { string name, args -> delegate.createrelationshipto(args[0], myrelationshiptypes."$name") }
}
additionally listing 6 defines a helper method so that we can specify the graph using a list of simple triples (e.g. ‘footloose-featured-kevin bacon’) rather than having to write lots of boilerplate code to create each node and define the edges.
an alternative would be to use the geoff notation, which is closely aligned with the neo4j cypher query syntax (more information later) – however this requires an additional plugin.
// keep the graph creation drylisting 6: graph construction code
def getorcreatenode(db, map, key, type) {
def node
if(map.containskey(key)) {
node = map.get(key)
} else {
node = db.createnode()
node.name = key
node.type = type
map.put(key, node)
}
node
}
// helper method to create the graph
void creategraph(db, gd) {
def moviemap = [:]
def actormap = [:]
gd.each { str ->
def movie, actor
def parts = str.split('-')
def moviekey = parts[0]
def actorkey = parts[2]
movie = getorcreatenode(db, moviemap, moviekey, 'movie')
actor = getorcreatenode(db, actormap, actorkey, 'actor')
// set up the edge
movie.featured(actor)
}
}
neo4j setup
listing 7 shows that the script uses an impermanent database, from the neo4j testing utils module, to simplify re-execution rather than having to tidy up the data directory used by an embedded instance. it also requires a lucene index for nodes so that we can search by name.
// set up an impermanent test instance (this saves having to write disk clean up)listing 7: neo4j graph database setup
def db = new impermanentgraphdatabase([:], [new luceneindexprovider()], [], [new softcacheprovider()])
def nodeautoindexer = db.index().getnodeautoindexer()
nodeautoindexer.startautoindexingproperty("name")
nodeautoindexer.setenabled(true)
data population
with the helper methods from listing 6 and the data in listing 8, this makes the node creation mainly a case of demarcating the transaction as shown in listing 9.
// describe a graph as triples e.g. movie-featured-actorlisting 8: the representation of the graph from figure 7
def graphdescription = [
'footloose-featured-kevin bacon',
'footloose-featured-john lithgow',
'shrek-featured-mike myers',
'shrek-featured-john lithgow',
'shrek-featured-cameron diaz',
"charlie's angels-featured-cameron diaz",
"charlie's angels-featured-bill murray",
'54-featured-mike myers',
'54-featured-neve campbell',
'wild things-featured-neve campbell',
'wild things-featured-bill murray',
'wild things-featured-kevin bacon'
]
def txlisting 9: transaction for index and graph creation
try {
tx = db.begintx()
db.index().fornodes("node_auto_index")
// create the nodes for the fixture data
creategraph(db, graphdescription)
tx.success()
} finally {
tx?.finish()
}
cypher
cypher is a declarative query language, and for more information i’d recommend the very good reference guide which provides executable samples generated from tests: http://docs.neo4j.org/chunked/stable/cypher-query-lang.html
now everything is in place and we’ve populated the data, it’s time to perform the query in listing 10 to discover the second-degree connections in the graph and the routes to them as can be seen in figure 8. we tell cypher the starting node to find in the index, then how to match the output using 4 hops (movie-actor-movie-actor) and which data properties to output.
def query = '''listing 10: constructing and executing a cypher query
start k=node:node_auto_index(name = "kevin bacon")
match (k)--(m)--(a)--(n)--(b)
return k.name, m.name, a.name, n.name, b.name
'''
// execute the query
executionengine engine = new executionengine(db)
executionresult result = engine.execute(query)
println result
the slowest part of the query from listing 10 is the index lookup to find the starting point for the traversal. this could easily optimized if the use case were to always start with kevin bacon, as we could make him the root node of the graph. adding the relationship directions (e.g.
match (k)<--(m)-->(a)<--(n)-->(b)
) is also beneficial.
an alternative but less desirable approach is to use the node id, but this isn’t good practice as the id is an internal identifier and you should be aware that node ids are reused within neo4j.
the index lookup, in my tests with the small dataset, added ~4ms in comparison to e.g.
start k=node(1)
.
an alternative way to specify a 2nd degree query (which in our case is 4 hops because the movies are modeled as nodes) is shown in listing 11. this query is simpler to write and permits variable length hops (as per listing 12 for 1st and 2nd degree actors with ordered output in figure 9), but requires additional work to make sense of the linking movies and actors.
start k=node:node_auto_index(name = "kevin bacon")listing 11: match variable-length edges with a hop count
match p = (k)-[:featured*4..4]-(x)
return p, x.name
start k=node:node_auto_index(name = "kevin bacon")listing 12: where clause and order by
match p = (k)-[:featured*2..4]-(x)
where x.type = 'actor'
return p, x.name
order by x.name
comparison to sql
the cypher graph query in listing 10 is significantly easier to specify in cypher that its equivalent sql query. in a relational database the necessary normalized table structure as shown in figure 10, introduces a link table to break the many-to-many relationship between movies and actors into two separate one-to-many relationships. i.e. a movie can feature many actors and an actor can appear in many movies. consequently the equivalent sql query requires a lot of extra constructs as detailed in listing 13 and the exclusions are required to prevent kevin bacon and first-degree connections appearing in the results, as seen in figure 11, as second-degree connections.
the full ddl and dml for this example are available from the github repository, tested using mysql.
select k.actor_name, m.movie_name, a.actor_name, n.movie_name, b.actor_namelisting 13: sql query to retrieve second-degree connections
from
actor k, actor a, actor b,
movie m, movie n,
movie_actor mk, movie_actor ma, movie_actor na, movie_actor nb
where
k.actor_name = 'kevin bacon'
and k.actor_id = mk.actor_id
and mk.movie_id = m.movie_id
and ma.movie_id = m.movie_id
and ma.actor_id = a.actor_id
and na.actor_id = a.actor_id
and na.movie_id = n.movie_id
and nb.movie_id = n.movie_id
and nb.actor_id = b.actor_id
and k.actor_id <> a.actor_id
and k.actor_id <> b.actor_id
and a.actor_id <> b.actor_id
and m.movie_id <> n.movie_id;
the response time of the unoptimized sql query isn’t too bad in this case (the explain plan is shown in figure 12), but we only have a miniscule data set – if you want to really try this out at scale and at 6 degrees, you can access the full imdb actors and actresses datasets from http://www.imdb.com/interfaces , though the data will need pre-processing to get it into a loadable form for either relational or graph databases.
further areas of interest
neo4j event handlers
the
org.neo4j.graphdb.event
package defines a
transactioneventhandler<t>
interface that can be used to trigger custom behavior before or after commits and after a rollback has occurred. this is beyond the scope of this article.
gremlin
tinkerpop ( http://www.tinkerpop.com ) provides a set of open source graph projects; blueprints is a property graph model interface with provided implementations. databases that implement the blueprints interfaces automatically support blueprints-enabled applications.
tinkerpop also created gremlin ( https://github.com/tinkerpop/gremlin/wiki ), which is a domain specific language (see http://gremlindocs.com ) for traversing property graphs.
gremlin provides an interactive shell that utilizes the groovy shell and which we’ll now use for an example.
gremlin example
the data for this example is available in the companion github repository and is a graphml representation of the graph in figure 7, as it is one of the standard formats that gremlin supports. for this example i used titan version 0.2.0 (there is a little more information on titan in the following section).
listing 14 illustrates an interactive approximate equivalent of the cypher query in listing 10.
titan-0.2.0$ bin/gremlin.shlisting 14: an interactive gremlin session
\,,,/
(o o)
-----oooo-(_)-oooo-----
gremlin>g = titanfactory.open('/tmp/titan')
==>titangraph[local:/tmp/titan]
gremlin>g.createkeyindex('name', vertex.class)
==>null
gremlin> g.loadgraphml('data/bacon_oracle.xml')
==>null
gremlin> kevin = g.v('name','kevin bacon').next()
==>v[4]
gremlin> kevin.map()
==>name=kevin bacon
==>type=actor
gremlin> kevin.in('featured').map
==>{name=footloose, type=movie}
==>{name=wild things, type=movie}
gremlin> kevin.in('featured').out('featured').except([kevin]).name
==>john lithgow
==>bill murray
==>neve campbell
gremlin> kevin.in('featured').out('featured').dedup()
==>v[4]
==>v[12]
==>v[32]
==>v[40]
gremlin> kevin.in('featured').out('featured').in('featured').out('featured').except([g.v(4), g.v(12), g.v(32), g.v(40)]).dedup().name
==>mike myers
==>cameron diaz
alternative graph stores
titan
aurelius titan ( http://thinkaurelius.github.io/titan/ ) provides a choice of persistence backends including cassandra and hbase. it provides support for search using elasticsearch or lucene. titan leverages gremlin and the getting started examples make use of the groovy gremlin shell.
orientdb
orientdb ( http://www.orientdb.org ) provides a graph + document database with fulltext indices.
rdf, triples/quads & sparql
rdf ( http://www.w3.org/tr/2004/rec-rdf-primer-20040210/ ) and sparql ( http://www.w3.org/tr/rdf-sparql-query/ ) are two semantic web standards from the w3c (world wide web consortium).
rdf (resource descriptor framework) provides a mechanism for describing a resource and there are related standards such as rdf/a for embedding metadata attributes within a document.
an rdf triple ( http://www.w3.org/tr/rdf-concepts/#section-triples ) refers to the subject -predicate-> object data structure used to denote for instance that alice knows bob. a quad extends this with additional context, also known as a named graph, which can be used for sub-setting a large graph.
sparql, the sparql protocol and rdf query language, provides a mechanism to query many rdf resources, typically held within a triple/quad store.
the openrdf project have produced an interface called rdf sail (storage and inference layer) – the blueprints property graph interface from tinkerpop enables a graphsail to be used on top of graph databases such as neo4j, orientdb and titan.
visualizations
this is quite a big area and is a hot topic, for neo4j users a graphviz plugin is available plus the following page makes a good starting point http://www.neo4j.org/develop/visualize
note: the gephi visualization tool has been tested with neo4j 1.5, i didn’t have any luck with a 1.7.2 database.
further reading
the forthcoming o’reilly book graph databases ( http://graphdatabases.com ) goes into depth on the subject with a focus on neo4j.
credits
listing 5 is derived from work by paul king ( https://twitter.com/paulk_asert ) published in the public domain at http://groovyconsole.appspot.com/script/245001 .
Published at DZone with permission of Robin Bramley, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments