Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# Groovy Graphs with Grails and Gremlins

· Java Zone
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Download Microservices for Java Developers: A hands-on introduction to frameworks and containers. Brought to you in partnership with Red Hat.

This article provides a practical introduction to graph databases with a focus on the Groovy ecosystem. It explores Grails GORM support for Neo4j as well as query comparisons between the Neo4j Cypher language, SQL and the Gremlin graph DSL.

# Graph theory and terminology

A graph is a way of modeling connections or relationships between nodes or vertices. These relationships, or edges, may be directed (a digraph in mathematical terminology) as shown in Figure 1. Parent-child relationships may form a tree; with additional associations these form a network.

Figure 1: a directed graph

A weighted graph (Figure 2) has a weight on every edge. Weighted networks can be used for modeling road networks with the weights representing distances; in conjunction with shortest-path graph algorithms such as Dijkstra’s algorithm, these are used for example for minimizing the distance travelled by delivery lorries.

Figure 2: a weighted graph

Figure 3 shows a loop, which is a self-referencing edge.

Figure 3: a graph loop

Figure 4 shows a Property Graph; this type of graph allows properties on both nodes and edges.

Figure 4: A Property Graph

## Sample graph

One of the main examples that is often used in current introductory material is a social graph much like Figure 4. Instead for the first practical example we’ll explore a different scenario: the relationships between a development project, the skills required and the team members involved as shown in Figure 5 (projects are denoted in orange, skills in blue and employees in green).

Figure 5: Sample skill graph

## What is a graph database?

Graph databases use property graphs, where both nodes and edges can have properties. One of the terms that is frequently used to describe graph databases is ‘index-free adjacency’, meaning that each node knows the location of the nodes that it is connected to without needing to perform an index lookup. In fact, this is often a key distinction between a graph database and a non-native store.

### WHEN SHOULD YOU USE A GRAPH DATABASE?

Now many people may jump on the bandwagon to use a ‘shiny new’ technology. The lesson that people should be taking away from the NoSQL movement is that the most appropriate data store technology should be utilized. Non-functional requirements are very important, as the solution will need to be operated in a production environment.

The best fit is for highly associative data sets where the (semi-structured) relationships are of critical importance to the value of the data. Graph databases may offer significant performance improvements over a relational model for this type of data as edges can be traversed at O(1) rather than requiring significant amounts of I/O operations to perform joins.

For the practical aspects of the article we’ll primarily use Neo4j (http://neo4j.org), which is described as the world’s leading graph database, before looking at some of the alternatives.

All the code is available on GitHub fromhttps://github.com/rbramley/GroovyMagGraphs

# Using a graph database in a Grails application

For the first example, we’ll explore a domain model that encapsulates a ‘project database’. We’ll start by installing the GORM-based plugin for Neo4j (http://www.grails.org/plugin/neo4j – the plugin manual is available athttp://springsource.github.io/grails-data-mapping/neo4j/manual/guide/index.html):

`grails install-plugin neo4j`

Then configure the datasource for the plugin along the lines of Listing 1, but choosing an appropriate data file location for your system.

```
grails {
neo4j {
type = "embedded"
location = "/usr/local/Cellar/neo4j/data"
}
}```
Listing 1: Neo4j plugin datasource configuration

We’ll use scaffolded controllers and views for speed, so let’s define the domain model to match Figure 5. The domain classes required are Skill shown in Listing 2, Employee shown in Listing 3 and Project as per Listing 4 – they all inform GORM that they are for Neo4j by using `static mapWith = "neo4j"`.

```package neo

import groovy.transform.ToString

@ToString(includes='name')
class Skill {
String name
static constraints = {}
static mapWith = "neo4j"
}```
Listing 2: a simple Skill representation

```package neo

import groovy.transform.ToString

@ToString(includes='name')
class Employee {
String name
String title
Date dob

static hasMany = [ workedOn : Project, knows : Skill ]

static constraints = {}
static mapWith = "neo4j"
}```
Listing 3: a simple Employee representation

```package neo

import groovy.transform.ToString

@ToString(includes='projectName')
class Project {
String customerName
String projectName
Date startDate
Date endDate

static hasMany = [ members : Employee, requiredSkills : Skill ]

static constraints = {}
static belongsTo = Employee
static mapWith = "neo4j"
}```
Listing 4: a simple Project representation

In summary it is very easy to create the simple Grails object domain model with one-to-many and many-to-many relationships and map them with Neo4j.

Following this you’ll need to create the controllers, set to scaffold, run, and populate data. Figure 6 illustrates the view of the Figure 5 graph as seen from the Project view action – job done!

Figure 6: An application view of the graph from Figure 5

The simple project database was such a success that HR want to extend it to provide a Skills Matrix. This new requirement will necessitate a richer domain model to capture more details about employee skill levels and experience; in a relational database world we could happily augment the many-to-many link table between Employee and Skill – with a Property Graph we can add properties to an edge.

This is where we’ve reached the limitations of the GORM plugin, which is designed for 80% use cases (seehttp://stackoverflow.com/questions/10647834/architecting-a-neo4j-based-application-stick-to-vanilla-api-using-plain-nodes).

So it looks like the best way to use Neo4j to exploit the power of the graph with Grails is to forego GORM/scaffolding. Which isn’t that much of an issue for ‘real’ projects as opposed to prototyping.

## Spring Data to the rescue

If you don’t want to couple your application directly to the native datastore APIs, then Spring Data is one option. Spring Data provides a set of data access APIs with implementations for a number of NoSQL technologies.

The Spring Data Neo4j support includes `@RelationshipEntity` annotations (http://static.springsource.org/spring-data/data-neo4j/docs/2.0.0.RELEASE/reference/html/#d0e1889) for adding properties to graph edges.

# Six degrees of Kevin Bacon

## Neo4j & Cypher

Neo4j also provides a native traversal API and a declarative query language called Cypher.

For this example we’ll manipulate the graph using a Groovy script and we’ll play a game of ’6 degrees of Kevin Bacon’ (http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon) with a small network.

The network we’ll define is illustrated in Figure 7, the blue shades signify movies and the green shades signify actors, with the exception of Kevin Bacon who is denoted in orange. The darker shades are first-degree connections and the lighter shades are second-degree connections.

Figure 7: A small Kevin Bacon graph

## Script goodies

Groovy missing method meta-programming is used in Listing 5 to make establishing relationships feel more natural e.g. `movie.featured(actor)`. This is using an outbound edge from a movie node into an actor node.

```// an enum helper
enum MyRelationshipTypes implements RelationshipType { featured }

// metaclass magic
Node.metaClass {
propertyMissing { String name, val -> delegate.setProperty(name, val) }
propertyMissing { String name -> delegate.getProperty(name) }
methodMissing { String name, args -> delegate.createRelationshipTo(args[0], MyRelationshipTypes."\$name") }
}```
Listing 5: Groovy DSL meta-programming

Additionally Listing 6 defines a helper method so that we can specify the graph using a list of simple triples (e.g. ‘Footloose-FEATURED-Kevin Bacon’) rather than having to write lots of boilerplate code to create each node and define the edges.

An alternative would be to use the GEOFF notation, which is closely aligned with the Neo4j Cypher query syntax (more information later) – however this requires an additional plugin.

```// keep the graph creation DRY
def getOrCreateNode(db, map, key, type) {
def node

if(map.containsKey(key)) {
node = map.get(key)
} else {
node = db.createNode()
node.name = key
node.type = type

map.put(key, node)
}

node
}

// helper method to create the graph
void createGraph(db, gd) {
def movieMap = [:]
def actorMap = [:]

gd.each { str ->
def movie, actor

def parts = str.split('-')
def movieKey = parts[0]
def actorKey = parts[2]

movie = getOrCreateNode(db, movieMap, movieKey, 'Movie')
actor = getOrCreateNode(db, actorMap, actorKey, 'Actor')

// set up the edge
movie.featured(actor)
}
}```
Listing 6: Graph construction code

## Neo4j setup

Listing 7 shows that the script uses an impermanent database, from the Neo4j testing utils module, to simplify re-execution rather than having to tidy up the data directory used by an embedded instance. It also requires a Lucene index for nodes so that we can search by name.

```// Set up an impermanent test instance (this saves having to write disk clean up)
def db = new ImpermanentGraphDatabase([:], [new LuceneIndexProvider()], [], [new SoftCacheProvider()])

def nodeAutoIndexer = db.index().getNodeAutoIndexer()
nodeAutoIndexer.startAutoIndexingProperty("name")
nodeAutoIndexer.setEnabled(true)```
Listing 7: Neo4j graph database setup

## Data population

With the helper methods from Listing 6 and the data in Listing 8, this makes the node creation mainly a case of demarcating the transaction as shown in Listing 9.

```// Describe a graph as triples e.g. Movie-FEATURED-Actor
def graphDescription = [
'Footloose-FEATURED-Kevin Bacon',
'Footloose-FEATURED-John Lithgow',
'Shrek-FEATURED-Mike Myers',
'Shrek-FEATURED-John Lithgow',
'Shrek-FEATURED-Cameron Diaz',
"Charlie's Angels-FEATURED-Cameron Diaz",
"Charlie's Angels-FEATURED-Bill Murray",
'54-FEATURED-Mike Myers',
'54-FEATURED-Neve Campbell',
'Wild Things-FEATURED-Neve Campbell',
'Wild Things-FEATURED-Bill Murray',
'Wild Things-FEATURED-Kevin Bacon'
]```
Listing 8: the representation of the graph from Figure 7

```def tx
try {
tx = db.beginTx()

db.index().forNodes("node_auto_index")

// create the nodes for the fixture data
createGraph(db, graphDescription)

tx.success()
} finally {
tx?.finish()
}```
Listing 9: Transaction for index and graph creation

## Cypher

Cypher is a declarative query language, and for more information I’d recommend the very good reference guide which provides executable samples generated from tests: http://docs.neo4j.org/chunked/stable/cypher-query-lang.html

Now everything is in place and we’ve populated the data, it’s time to perform the query in Listing 10 to discover the second-degree connections in the graph and the routes to them as can be seen in Figure 8. We tell Cypher the starting node to find in the index, then how to match the output using 4 hops (Movie-Actor-Movie-Actor) and which data properties to output.

```def query = '''
start k=node:node_auto_index(name = "Kevin Bacon")
match (k)--(m)--(a)--(n)--(b)
return k.name, m.name, a.name, n.name, b.name
'''

// execute the query
ExecutionEngine engine = new ExecutionEngine(db)
ExecutionResult result = engine.execute(query)
println result```
Listing 10: Constructing and executing a Cypher query

Figure 8: Cypher results toString

The slowest part of the query from Listing 10 is the index lookup to find the starting point for the traversal. This could easily optimized if the use case were to always start with Kevin Bacon, as we could make him the root node of the graph. Adding the relationship directions (e.g. `match (k)<--(m)-->(a)<--(n)-->(b)`) is also beneficial.

An alternative but less desirable approach is to use the node ID, but this isn’t good practice as the ID is an internal identifier and you should be aware that Node IDs are reused within Neo4j.

The index lookup, in my tests with the small dataset, added ~4ms in comparison to e.g. `start k=node(1)`.

An alternative way to specify a 2nd degree query (which in our case is 4 hops because the movies are modeled as nodes) is shown in Listing 11. This query is simpler to write and permits variable length hops (as per Listing 12 for 1st and 2nd degree Actors with ordered output in Figure 9), but requires additional work to make sense of the linking movies and actors.

```start k=node:node_auto_index(name = "Kevin Bacon")
MATCH p = (k)-[:featured*4..4]-(x)
RETURN p, x.name```
Listing 11: Match variable-length edges with a hop count

```start k=node:node_auto_index(name = "Kevin Bacon")
MATCH p = (k)-[:featured*2..4]-(x)
WHERE x.type = 'Actor'
RETURN p, x.name
ORDER BY x.name```
Listing 12: Where clause and order by

Figure 9: Results with the graph path

## Comparison to SQL

The Cypher graph query in Listing 10 is significantly easier to specify in Cypher that its equivalent SQL query. In a relational database the necessary normalized table structure as shown in Figure 10, introduces a link table to break the many-to-many relationship between movies and actors into two separate one-to-many relationships. i.e. a movie can feature many actors and an actor can appear in many movies. Consequently the equivalent SQL query requires a lot of extra constructs as detailed in Listing 13 and the exclusions are required to prevent Kevin Bacon and first-degree connections appearing in the results, as seen in Figure 11, as second-degree connections.

Figure 10: the Entity Relationship Diagram

The full DDL and DML for this example are available from the GitHub repository, tested using MySQL.

```select k.actor_name, m.movie_name, a.actor_name, n.movie_name, b.actor_name
from
actor k, actor a, actor b,
movie m, movie n,
movie_actor mk, movie_actor ma, movie_actor na, movie_actor nb
where
k.actor_name = 'Kevin Bacon'
and k.actor_id = mk.actor_id
and mk.movie_id = m.movie_id
and ma.movie_id = m.movie_id
and ma.actor_id = a.actor_id
and na.actor_id = a.actor_id
and na.movie_id = n.movie_id
and nb.movie_id = n.movie_id
and nb.actor_id = b.actor_id
and k.actor_id <> a.actor_id
and k.actor_id <> b.actor_id
and a.actor_id <> b.actor_id
and m.movie_id <> n.movie_id;```
Listing 13: SQL query to retrieve second-degree connections

Figure 11: Result set

The response time of the unoptimized SQL query isn’t too bad in this case (the explain plan is shown in Figure 12), but we only have a miniscule data set – if you want to really try this out at scale and at 6 degrees, you can access the full IMDB actors and actresses datasets from http://www.imdb.com/interfaces, though the data will need pre-processing to get it into a loadable form for either relational or graph databases.

Figure 12: MySQL explain plan

# Further areas of interest

## Neo4j Event Handlers

The org.neo4j.graphDB.event package defines a `TransactionEventHandler<T>`interface that can be used to trigger custom behavior before or after commits and after a rollback has occurred. This is beyond the scope of this article.

## Gremlin

Tinkerpop (http://www.tinkerpop.com) provides a set of Open Source graph projects; Blueprints is a property graph model interface with provided implementations. Databases that implement the Blueprints interfaces automatically support Blueprints-enabled applications.

Tinkerpop also created Gremlin (https://github.com/tinkerpop/gremlin/wiki), which is a domain specific language (see http://gremlindocs.com) for traversing property graphs.

Gremlin provides an interactive shell that utilizes the Groovy Shell and which we’ll now use for an example.

### GREMLIN EXAMPLE

The data for this example is available in the companion GitHub repository and is a GraphML representation of the graph in Figure 7, as it is one of the standard formats that Gremlin supports. For this example I used Titan version 0.2.0 (there is a little more information on Titan in the following section).

Listing 14 illustrates an interactive approximate equivalent of the Cypher query in Listing 10.

```titan-0.2.0\$ bin/gremlin.sh

\,,,/
(o o)
-----oOOo-(_)-oOOo-----
gremlin>g = TitanFactory.open('/tmp/titan')
==>titangraph[local:/tmp/titan]
gremlin>g.createKeyIndex('name', Vertex.class)
==>null
==>null
gremlin> kevin = g.V('name','Kevin Bacon').next()
==>v[4]
gremlin> kevin.map()
==>name=Kevin Bacon
==>type=Actor
gremlin> kevin.in('featured').map
==>{name=Footloose, type=Movie}
==>{name=Wild Things, type=Movie}
gremlin> kevin.in('featured').out('featured').except([kevin]).name
==>John Lithgow
==>Bill Murray
==>Neve Campbell
gremlin> kevin.in('featured').out('featured').dedup()
==>v[4]
==>v[12]
==>v[32]
==>v[40]
gremlin> kevin.in('featured').out('featured').in('featured').out('featured').except([g.v(4), g.v(12), g.v(32), g.v(40)]).dedup().name
==>Mike Myers
==>Cameron Diaz```
Listing 14: an interactive Gremlin session

## Alternative graph stores

### TITAN

Aurelius Titan (http://thinkaurelius.github.io/titan/) provides a choice of persistence backends including Cassandra and HBase. It provides support for search using ElasticSearch or Lucene. Titan leverages Gremlin and the getting started examples make use of the Groovy Gremlin shell.

### ORIENTDB

OrientDB (http://www.orientdb.org) provides a graph + document database with fulltext indices.

RDF (http://www.w3.org/TR/2004/REC-rdf-primer-20040210/) and SPARQL (http://www.w3.org/TR/rdf-sparql-query/) are two Semantic Web standards from the W3C (World Wide Web Consortium).

RDF (Resource Descriptor Framework) provides a mechanism for describing a resource and there are related standards such as RDF/a for embedding metadata attributes within a document.

An RDF Triple (http://www.w3.org/TR/rdf-concepts/#section-triples) refers to the Subject -Predicate-> Object data structure used to denote for instance that Alice knows Bob. A quad extends this with additional context, also known as a named graph, which can be used for sub-setting a large graph.

SPARQL, the SPARQL Protocol and RDF Query Language, provides a mechanism to query many RDF resources, typically held within a Triple/Quad Store.

The OpenRDF project have produced an interface called RDF SAIL (Storage And Inference Layer) – the Blueprints property graph interface from Tinkerpop enables a GraphSail to be used on top of graph databases such as Neo4j, OrientDB and Titan.

## Visualizations

This is quite a big area and is a hot topic, for Neo4j users a GraphViz plugin is available plus the following page makes a good starting pointhttp://www.neo4j.org/develop/visualize

Note: the Gephi visualization tool has been tested with Neo4j 1.5, I didn’t have any luck with a 1.7.2 database.

The forthcoming O’Reilly book Graph Databases (http://graphdatabases.com) goes into depth on the subject with a focus on Neo4j.

# Credits

Listing 5 is derived from work by Paul King (https://twitter.com/paulk_asert) published in the public domain athttp://groovyconsole.appspot.com/script/245001.

Download Building Reactive Microservices in Java: Asynchronous and Event-Based Application Design. Brought to you in partnership with Red Hat

Topics:

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of Robin Bramley, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.