Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Implementing Temporal Graphs With Apache TinkerPop and HGraphDB

DZone's Guide to

Implementing Temporal Graphs With Apache TinkerPop and HGraphDB

Learn how to use the up-and-coming graph framework Apache TinkerPop over big data storage like Apache HBase to simulate hierarchical structures.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

When most people think of big data, they often imagine loads of unstructured data. However, there is always some sort of structure or relationship within this data. Based on these relationships, there are one or more representation schemes best suited to handle this type of data. A common pattern seen in the field is hierarchy/relationship representation. This form of representation is adept at handling scenarios like complex business models, chains of events or plans, chains of stock orders in banks, user relations on social media websites, etc.

Storing and changing these hierarchical datasets can be a challenge, especially if the data is temporal (the time period attached to the data expresses when it was valid). One simple way to effectively store this data is with graphs. Often, these hierarchical datasets are in the form of a tree, which is just a specialized version of the graph. Storing data in this form has some inherent advantages such as easy representation/visualization of data, fast and well-known traversal algorithms proven across the industry, and easy manipulation of parts of data without changing or duplicating the entire datasets.

This article explains how to use an up-and-coming graph framework like Apache TinkerPop over big data storage like Apache HBase to simulate hierarchical structures. HBase was chosen as a storage layer because of its fast retrieval and scalability. Additionally, HBase is also suitable for handling low response times demanded by online applications while still having big data storage capabilities.

Apache TinkerPop

Apache TinkerPop is an open-source graph computing framework composed of various interoperable components. TinkerPop uses simple and intuitive terms like vertices (nodes of a graph representing data points) and edges (connections between these vertices representing relationships between data points) to represent the complex hierarchical data structure of nodes and relationships between them.

TinkerPop also comes with its own graph traversal language called Gremlin, which makes traversing nodes and searching for relations between them as easy as specifying vertices and looking for specific edges between them. More details on Apache TinkerPop are available here.

As mentioned earlier, Apache TinkerPop is comprised of interoperable components. The following diagram from Apache TinkerPop documentation will show this in more detail.

This interoperability means that any underlying storage and processing engine can be used to support scalability and meet required SLAs with industry-standard tools.

For instance, Apache Spark can be used as a graph processor that can take the load of processing large and complex graphs using standard Hadoop clusters.

Similarly, the underlying graph database can also be any storage platform like Apache HBase, which supports fast updates and retrieval together with storage of vast amounts of data.

This interoperability is all possible because of the provider APIs, which means that if there is an implementation of these APIs for the required platform, then it can be used.

There are various implementations for the provider API to support different underlying storages.

HGraphDB

One such implementation discussed in this article is HGraphDB, implemented by Robert Yokota. HGraphDB is an implementation of Apache TinkerPop 3 interface that uses Apache HBase as the underlying graph database. Like TinkerPop HGraphDB stores nodes and relationship between them as vertices and edges in HBase Tables.

HGraphDB uses a tall table approach to store graph data in Apache HBase. That is, each of the graph's nodes and relationships between them is stored in one HBase row as vertex and edge. The Unique Key (row IDs) for these rows is the combination of VertexID and EdgeID. HGraphDB also stores "created" and "updated" times for all vertices and edges along with multiple user-defined property columns for better management and searching of vertices and edges. These user-defined properties are what makes storing and updating temporal graphs easier without duplicating the entire graph.

Apart from this, HGraphDB also provides indexes for faster traversal of graphs. These indexes are based on user-defined properties, one per index.

HGraphDB uses five tables in Apache HBase, and their structure is explained in more detail below.

Vertex Table

This table stores all the vertices. Columns and their represented values for this table are as follows:

  • Label: Text label assigned to vertices for multiple types of vertices.

  • Created: Timestamp of when a vertex was created.

  • Updated: Timestamp of the last update for this vertex.

  • Property #: User-defined property for this vertex and its associated value. There can be as many property columns as required.

An example of one entry for a graph node is below:

Edge Table

This table stores all the edges between graph nodes, and weights or values (properties) assigned to these edges are stored as user-defined property columns. Columns and their represented values for this table are as follows:

  • Label: Text label assigned to vertices for multiple types of vertices.

  • Created: Timestamp of when an edge was created.

  • Updated: Timestamp of the last update for this edge.

  • From vertex: Vertex where this edge originates from.

  • To vertex: Destination vertex for this edge.

  • Property #: User-defined properties for this edge and its associated value. There can be as many property columns as required.

An example of one such edge entry is as below.

Vertex Index Table

This table stores indexes of all the vertices for faster searching vertices based on their property values.

Columns and their represented values for this table are as follows:

  • Key: Combination of vertex label, property for which this index is created, value of the property, and vertex ID.

  • Created at: Timestamp of the creation of this index value.

  • Vertex ID: ID of the vertex that is being indexed.

Following is an example of the table:

Edge Index Table

Similarly to the vertex index table, this table stores indexes for all edges for faster searching based on property values.

Columns and their represented values for this table are as follows:

  • Key: Combination of the vertex ID of the first vertex associated with this edge, the direction of edge related to the first vertex, the property for which this index is created, the label of this edge, the value of the property, and the vertex ID for the second vertex associated with this edge.

  • Created at: The timestamp of the creation of this index value.

Following is an example.

Index Metadata Table

This table contains metadata related to the indexes created on each property and specifies whether the index is currently active.

Columns and their represented values for this table are as follows:

  • Key: A unique value.

  • Created at: Timestamp of the creation of this index metadata value.

  • Updated at: Timestamp of the last update to this index metadata value.

  • State: Current state of this index (active or not).

Following is an example of the table.

More details and source code for HGraphDB are available here.

Working Project

By this point, you should have a good understanding of base components and their workings. The rest of this article will describe an example data generator and query module to show how to implement hierarchical structures using Apache HBase as the graph database. We are choosing HBase here because of its scalability and fast response times.

The code for this project is available here.

This project has two main components:

  1. Data generator: Generates test data/graph.
  2. Query engine: Queries the generated data to test performance and functionality.

Data Generator

The code in a Java class with the name datagen.java generates multiple hierarchies in the form of a balanced binary tree. It persists the data into Apache HBase using HGraphDB.

Following is a call to HGraphDB for generating a graph in HBase:

this.cfg = new HBaseGraphConfiguration()                
.setInstanceType(HBaseGraphConfiguration.InstanceType.DISTRIBUTED)
               .setGraphNamespace(HIERARCHY_PREFIX + HIERARCHY_NAME)
               .setCreateTables(Create)
               .setRegionCount(REGION_COUNT)
               .set(ZOOKEPER_QORUM, ZOOKEPER_QORUM_NODE)
               .set(ZOOKEPER_PARENT, ZOOKEPER_PARENT_NODE);
this.graph = HBaseGraph.open(cfg);

Parameters for the graph generated can be controlled by the constants file, which defines constants like the number of nodes, suffixes and prefixes used for vertex and edge names, etc.

The graph generated by the datagen has the following properties:

  • Each vertex has a VertexID: a combination of the prefix defined in constants file and the node number.
  • All vertices have one user-defined property called Created_on, a value of seconds from Epoch. This is a test property to demonstrate how to define properties of a vertex.
  • Similarly to vertices, edges have an EdgeID, a combination of prefixes defined in the constants file and edge number.
Vertex v = VertexBulkLoad.addVertex(T.id, NodePrefix + "_" + NodeNum, T.label, "Node", "created_on", HIERARCHY_CREATE_TIME);
  • Edges in this graph have two user-defined properties: Since and Till.
  • The Since property defines when an edge was created.
  • The Till property defines until when an edge was active.
  • Together, these two properties help maintain the temporal nature of a graph.
  • Apart from a static base graph, the datagen also generates some updates to the base graph, which acts as temporal data and shows a different state of the graph at a given point of time.
  • These temporal data points are created by creating new edges and manipulating Till property of existing edges.

Running this datagen will generate the required number of hierarchal datasets in HBase, each with a binary tree/graph complete with random temporal data.

Query Engine

After data is generated in HBase by datagen, the queryHierarchy Java file code queries graphs sitting in HBase using Apache TinkerPop.

The query engine in this code can get two types of data from graphs in HBase:

  1. Get the required root of a node at any given point in time.
  2. Get the entire hierarchy up to a given level as it appears at a given point in time.

Query Root

To get the all the nodes in the path from a given node to its root at any given point in time, the code performs the following series of steps.

  • First, get a graph instance. Use same HGraphDB command as a create graph but this time with create as false.
  • Once we have the graph instance, we can use Gremlin commands to find the requested vertex.
return Graph.vertex(NodeName);
  • Next, find all the incoming edges of this vertex and check their Since and Till properties to find the one edge that was active for the requested time.
  • This will give an active parent for the requested node at a specified time.
E1 = V.edges(D, "Parent", "since", HIERARCHY_CREATE_TIME,  TimeInterval[1]);
E2 = V.edges(D, "Parent", "Till", TimeInterval[0], HIERARCHY_END_TIME + UPDATE_INTERVAL);
  • Once the parent is found, repeat the same process as above to get the entire path Till root.

This simple setup demonstrated how to use edge properties to maintain and manipulate the same vertex for creating different hierarchies in time, i.e. having temporal graphs.

Query Hierarchy

Let's say you would like to find information about full lifecycle of a stock order. (i.e. you need to reconstruct all parent orders for this order). In this case, a query for entire hierarchy is needed for that day.

To get the entire hierarchy, the code performs the following series of steps

  • Same as with root query, first, get a graph instance. This can be done with the same HGraphDB command as a create graph but this time, with create as false.
  • Then, find any random vertex of this graph that was active for the given moment in time.
  • Once this vertex is found, perform a Query root operation as described above to get the root for this vertex for that given moment in time.
  • After the root for hierarchy is found for the given time, the entire hierarchy can be rebuilt.
  • Rebuilding the entire hierarchy tree is split into multiple threads using a threadpool. One vertex at a time is assigned to a thread from this pool for processing.
ExecutorService executorService = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
while (graphQueue.size() > 0){
  Vertex node = graphQueue.poll();
  if (node == null){
  if (graphQueue.size() > 0) {
     System.out.println("Level: " + Level + " > " );
     Level++;
     ArrayList<Future<ArrayList<Vertex>>> ProcessNodes = new ArrayList<>();
     final ArrayList<Vertex> VResFuture = new ArrayList<>();
     while(graphQueue.peek() != null) {
       final Vertex LevelNode = graphQueue.poll();
       ProcessNodes.add(executorService.submit(new Callable<ArrayList<Vertex>>() {
          @Override
          public ArrayList<Vertex> call() throws Exception {
              ArrayList<Vertex> VRes = new ArrayList<>();
              for (Vertex Child : getChild((HBaseVertex)LevelNode,TimeInterval)) {
                   VRes.add(Child);
                   VResFuture.add(Child);
               }
               return VRes;
          }
        }));
     }
     try {
        graphQueue.add(null);
        for (Future<ArrayList<Vertex>> TaskFuture: ProcessNodes ) {
            VList = TaskFuture.get();
            for (Vertex Child : VList) {
                if (ProcessedNodes.containsKey(Child.id().toString())) {
                    continue;
                }
                ProcessedNodes.put(Child.id().toString(), true);
                graphQueue.add(Child);
            }
  • Within this thread, check all the out edges of this vertex and then find the ones that were active at the required point of time.
  • After edges are found, check their out vertex and continue pushing those vertexes to the thread pool until the end of the hierarchy is reached.

Conclusion

Some of the takeaways from this article are:

  • A common perception is that big data is unstructured, but in most cases, there is some sort of structure to this data
  • In the field, hierarchal/relationship representation is more common.
  • Graphs are best for representing hierarchical data.
  • Graph frameworks like TinkerPop are the best, as they offer interoperable components.
  • HBase is the best storage layer when scalability and online applications type response time are required.
  • HGraphDB provides an easy implementation for using HBase as a graph database for the TinkerPop framework.

In conclusion, handling temporal hierarchical data is tricky and can take a large amount of processing power, storage, and subject expertise.

But with the help of tools like Apache TinkerPop and HGraphDB, it can be done with relative ease while still taking advantage of scalable storage engines like Apache HBase and the processing power of Apache Spark on a Hadoop cluster. In addition, using the TinkerPop framework and HGraphDB could be more productive, as a Gremlin-style traversal language is more intuitive for people across the industry and reduces the effort for building custom traversal logic.

Resources

  1. Apache TinkerPop
  2. HGraphDB
  3. Datagen and query engine
  4. Apache HBase

Please note that Apache TinkerPop and HGraphDB are not tested or included with Cloudera CDH. Further support or assistance with these Apache projects can be found in the Apache open source community and mailing lists for those projects.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,tutorial ,apache tinkerpop ,hgraphdb ,data visualization ,temporal graphs ,hierarchy

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}