Polyglot Persistence and Query with Gremlin
Join the DZone community and get the full member experience.
Join For Free
curator's note: this article was co-authored by marko rodriguez and stephen mallette over at the
aurelius
blog.
complex data storage architectures are not typically grounded to a single database. in these environments, data is highly disparate, which means that it exists in many forms, is aggregated and duplicated at different levels, and in the worst case, the meaning of the data is not clearly understood. environments featuring disparate data can present challenges to those seeking to integrate it for purposes of analytics, etl (extract-transform-load) and other business services. having easy ways to work with data across these types of environments enables the rapid engineering of data solutions.
some causes for data disparity rise from the need to store data in different database types, so as to take advantage of the specific benefits that each type exposes. some examples of different database types include (please see getting and putting data from a database ):
- relational database : a relational database, such as mysql , oracle or microsoft sql server , organizes data into tables with rows and columns, using a schema to help govern data integrity.
- document store : a document-oriented database such as mongodb , couchdb , or ravendb , organizes data into the concept of a document, which is typically semi-structured as nested maps and encoded to some format such as json .
- graph database : a graph is a data structure that organizes data into the concepts of vertices and edges. vertices might be thought of as “dots” and edges might be thought of as “lines”, where the lines connect those dots via some relationship. graphs represent a very natural way to model real-world relationships between different entities. examples of graph databases are titan , neo4j , orientdb , dex and infinitegraph .
gremlin is a domain specific language (dsl) for traversing graphs. it is built using the metaprogramming facilities of groovy , a dynamic programming language for the java virtual machine (jvm). in the same way that gremlin adds upon groovy, groovy adds upon java, by providing an extended api and programmatic shortcuts that can cut down on the verbosity of java itself.
gremlin comes equipped with a terminal, also known as a
repl
or
cli
,
which provides an interface through which the programmer can
interactively traverse the graph. given gremlin’s role as a dsl for
graphs, performing interactions with a graph represent the typical usage
of the terminal. however, given that the gremlin terminal is actually a
groovy terminal, the full power of groovy is available as well:
- access to the full apis for java and groovy
- access to external jars (i.e. 3rd party libraries)
- gremlin and groovy’s syntactic sugar
- an extensible programming environment via metaprogramming
with these capabilities in hand, gremlin presents a way to interact with a multi-database environment with great efficiency. the following sections detail two different use cases, where gremlin acts as an ad-hoc data workbench for rapid development of integrated database solutions centered around a graph.
polyglot persistence
loading data to a graph from a different data source might take some careful planning. the formation of a load strategy is highly dependent on the size of the data, its source format, the complexity of the graph schema and other environmental factors. in cases where the complexity of the load is low, such as scenarios where the data set is small and the graph schema simplistic, the load strategy might be to utilize the gremlin terminal to load the data.
mongodb as a data source
consider a scenario where the source data resides in
mongodb
.
the source data itself contains information which indicates a “follows”
relationship between two users, similar to the concept of a user
following another user on twitter. unlike graphs, document stores, such
as mongodb, do not maintain a notion of linked objects and therefore
make it difficult to represent the network of users for analytical
purposes.
{ "_id" : objectid("4ff74c4ae4b01be7d54cb2d3"), "followed" : "1", "followedby" : "3", "createdat" : isodate("2013-01-01t20:36:26.804z") } { "_id" : objectid("4ff74c58e4b01be7d54cb2d4"), "followed" : "2", "followedby" : "3", "createdat" : isodate("2013-01-15t20:36:40.211z") } { "_id" : objectid("4ff74d13e4b01be7d54cb2dd"), "followed" : "1", "followedby" : "2", "createdat" : isodate("2013-01-07t20:39:47.283z") }
this kind of data set translates easily to a graph structure. the
following diagram expresses how the document data in mongodb would be
expressed as a graph.
to begin the graph loading process, the gremlin terminal needs to have access to a client library for mongodb.
gmongo
is just such a library and provides an expressive syntax for working with mongodb in groovy. the
gmongo jar
file and its dependency, the
mongo java driver jar
, must be placed in the
gremlin_home/lib
directory. with those files in place, start gremlin with:
gremlin_home/bin/gremlin.sh
gremlin automatically imports a number of classes during its
initialization process. the gmongo classes will not be part of those
default imports. classes from external libraries must be explicitly
imported before they can be utilized. the following code demonstrates
the import of gmongo into the terminal session and then the
initialization of connectivity to the running mongodb “network”
database.
gremlin> import com.gmongo.gmongo ==>import com.tinkerpop.gremlin.* ... ==>import com.gmongo.gmongo gremlin> mongo = new gmongo() ==>com.gmongo.gmongo@6d1e7cc6 gremlin> db = mongo.getdb("network") ==>network
at this point, it is possible to issue any number of mongodb commands to bring that data into the terminal.
gremlin> db.follows.findone().followed ==>followed=1 gremlin> db.follows.find().limit(1) ==>{ "_id" : { "$oid" : "4ff74c4ae4b01be7d54cb2d3"} , "followed" : "1" , "followedby" : "3" , "createdat" : { "$date" : "2013-01-01t20:36:26.804z"}}
the steps for loading the data to a blueprints-enabled graph (in this case, a local titan instance) are as follows.
gremlin> g = titanfactory.open('/tmp/titan') ==>titangraph[local:/tmp/titan] gremlin> // first grab the unique list of user identifiers gremlin> x=[] as set; db.follows.find().each{x.add(it.followed); x.add(it.followedby)} gremlin> x ==>1 ==>3 ==>2 gremlin> // create a vertex for the unique list of users gremlin> x.each{g.addvertex(it)} ==>1 ==>3 ==>2 gremlin> // load the edges gremlin> db.follows.find().each{g.addedge(g.v(it.followedby),g.v(it.followed),'follows',[followstime:it.createdat.gettime()])} gremlin> g.v ==>v[1] ==>v[3] ==>v[2] gremlin> g.e ==>e[2][2-follows->1] ==>e[1][3-follows>2] ==>e[0][3-follows->1] gremlin> g.e(2).map ==>{followstime=1341607187283}
this method for graph-related etl is lightweight and low-effort, making it a fit for a variety of use cases that stem from the need to quickly get data into a graph for ad-hoc analysis.
mysql as a data source
the process for extracting data from mysql is not so different from mongodb. assume that the same “follows” data is in mysql in a four column table called “follows.”
id | followed | followed_by | created_at |
---|---|---|---|
10001 | 1 | 3 | 2013-01-01t20:36:26.804z |
10002 | 2 | 3 | 2013-01-15t20:36:40.211z |
10003 | 1 | 2 | 2013-01-07t20:39:47.283z |
aside from some field name formatting changes and the “id” column being a long value as opposed to a mongodb identifier, the data is the same as the previous example and has the same problems for network analytics as mongodb did.
groovy sql is straightforward in its approach to accessing data over jdbc . to make use of it inside of the gremlin terminal, the mysql jdbc driver jar file must be placed in the
gremlin_home/lib
directory. once that file is in place, start the gremlin terminal and execute the following commands:
gremlin> import groovy.sql.sql ... gremlin> sql = sql.newinstance("jdbc:mysql://localhost/network", "username","password", "com.mysql.jdbc.driver") ... gremlin> g = titanfactory.open('/tmp/titan') ==>titangraph[local:/tmp/titan] gremlin> // first grab the unique list of user identifiers gremlin> x=[] as set; sql.eachrow("select * from follows"){x.add(it.followed); x.add(it.followed_by)} gremlin> x ==>1 ==>3 ==>2 gremlin> // create a vertex for the unique list of users gremlin> x.each{g.addvertex(it)} ==>1 ==>3 ==>2 gremlin> // load the edges gremlin> sql.eachrow("select * from follows"){g.addedge(g.v(it.followed_by),g.v(it.followed),'follows',[followstime:it.created_at.gettime()])} gremlin> g.v ==>v[1] ==>v[3] ==>v[2] gremlin> g.e ==>e[2][2-follows->1] ==>e[1][3-follows>2] ==>e[0][3-follows->1] gremlin> g.e(2).map ==>{followstime=1341607187283}
aside from some data access api differences, there is little separating the script to load the data from mongodb and the script to load data from mysql. both examples demonstrate options for data integration that carry little cost and effort.
polyglot queries
a graph database is likely accompanied by other data sources, which together represent the total data strategy for an organization. with a graph established and populated with data, engineers and scientists can utilize the gremlin terminal to query the graph and develop algorithms that will become the basis for future application services. an issue arises when the graph does not contain all the data that the gremlin user needs to do their work.
in these cases, it is possible to use the gremlin terminal to execute what can be thought of as a polyglot query . a polyglot query blends data together from a variety of data sources and data storage types to produce a single result set. the concept of the polyglot query can be demonstrated by extending upon the last scenario where “follows” data was migrated to a graph from mongodb. assume that there is another collection in mongodb called “profiles”, which contains the user demographics data, such as name, age, etc. using the gremlin terminal, this “missing data” can be made part of the analysis.gremlin> // a simple query within the graph gremlin> g.v(1).in ==>v[3] ==>v[2] gremlin> // a polyglot query that incorporates data from the graph and mongodb gremlin> g.v(1).in.transform{[userid:it.id,username:db.profiles.findone(uid:it.id).name]} ==>{userid=3, username=willis} ==>{userid=2, username=arnold}
the first gremlin statement above represents a one-step traversal, which
simply asks to see the users who follow vertex “1.” although it is now
clear how many users follow this vertex, the results are not terribly
meaningful. it is only a list of vertex identifiers and given the
example thus far, there is no way to expand those results as that data
is representative of the total data in the graph. to really understand
these results, it would be good to grab the name of the user from the
“profile” collection in mongodb and blend that attribute into the
output. the second line of gremlin, the polyglot query, looks to do just
that. it expands that limited view of the data by performing the same
traversal and then reaching out to mongodb to find the user’s name in
the “profile” collection.
the anatomy of the polyglot query is as such:
-
g.v(1).in
– get the incoming vertices to vertex 1 -
transform{...}
– for each incoming vertex, process it with a closure that produces a map (i.e. set of key/value pairs) for each vertex -
[userid:it.id,
- use the “id” of the vertex as the value of the “userid” key in the map -
username:db.profiles.findone(uid:it.id).name]
– blend in the user’s name by querying mongodb withfindone()
to look up a “profile” document in mongodb, grabbing the value of the “name” key from that document and making that the value of the “username” field in the output
with the name of the users included in the results, the final output becomes more user friendly, perhaps allowing greater insights to surface.
conclusion
loading data to the graph and gathering data not available in the graph itself are two examples of the flexibility of the gremlin terminal, but other use cases exist.
- write the output of an algorithm to a file or database for ad-hoc analysis in other tools like microsoft excel , r or business intelligence reporting tools.
- read text-based data files from the file system (e.g. csv files) to generate graph data.
- traversals that build in-memory maps of significant size could benefit from using mapdb , which has map implementations backed by disk or off-heap memory.
- validate traversals and algorithms before committing to a particular design, by building a small “throwaway” graph from a subset of external data that is relevant to what will be tested. this approach is also relevant to basic ad-hoc analysis of data that may not yet be in a graph, but would benefit from a graph data structure and the related toolsets available.
- not all graph data requires a graph database. gremlin supports graphml , graphson , and gml as file-based graph formats. they can be readily inserted into an in-memory tinkergraph . utilize gremlin to analyze these graphs using path expressions in ways not possible with typical graph analysis tools like igraph , networkx , jung , etc.
- “data debugging” is possible given gremlin rapid turnaround between query and result. traversing the graph to make sure the data was loaded correctly from the gremlin terminal, is important for ensuring that the data was properly curated.
- access to data need not be limited to locally accessible files and databases. the same techniques for writing and reading data to and from those resources can be applied to third-party web services and other apis, using groovy’s httpbuilder .
- pull data into a graph to output as graphml or other format, which can be visualized in cytoscape , gephi or other graph visualization tools.
the power and flexibility of gremlin and groovy make it possible to
seamlessly interact with disparate data. this capability enables
analysts, engineers and scientists to utilize the gremlin terminal as a
lightweight workbench in a lab of data, making it possible to do rapid,
ad-hoc analysis centered around graph data structures. moreover, as
algorithms are discovered, designed and tested, those gremlin traversals
can ultimately be deployed into the production system.
Published at DZone with permission of Marko Rodriguez, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments