Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Memory Efficient Graph Data Processing

DZone's Guide to

Memory Efficient Graph Data Processing

Learn how you can increase performance and decrease the strain on your system with a more efficient way of processing big segments of graph data.

· Database Zone
Free Resource

Find out how Database DevOps helps your team deliver value quicker while keeping your data safe and your organization compliant. Align DevOps for your applications with DevOps for your SQL Server databases to discover the advantages of true Database DevOps, brought to you in partnership with Redgate

In a traditional database query, we follow the simple approach of extracting the data in three steps. First, we have to initialize the connection, then we fetch the data, and lastly, we process the resultset obtained from the query. In Neo4j, we also usually follow the same principle. Below are the sample codes to execute them. For a sample graph database, I have used the Neo4j movie database as described in the Neo4j tutorial.

// Location of the graph database
private String graphDbPath = "C:\\graphdb\\dump\\movie";
// The graph database service object to be used for our query
private static GraphDatabaseService graphDb;
private static boolean started;

// Simple method to initialize the service and return to the calling service
public GraphDatabaseService getGraphDatabaseService(){
File file = new File(graphDbPath);
graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(file);
started = true;
return graphDb;
}
private void processImperative(String queryStmt){
Result result = graphDb.execute(queryStmt);
// Check if there is a next data in resultset then proceed
while(result.hasNext()){
// Store the current node data in the map
Map<String, Object> nodeData = result.next();
nodeData.forEach((k,v)->System.out.println("Key : " + k + " Value : " + v));
}
}

Here, in our main class SimpleNeo4jAPI, in the main method, we can invoke the above method to process the node data. For simplicity in the query, we limit the record count to only 5.

public static void main(String[] args) {
SimpleNeo4jAPI api = new SimpleNeo4jAPI();
String queryStmt = "MATCH (people:Person)-[relatedTo]-(movie:Movie) RETURN people.name, Type(relatedTo), movie.title limit 5;";
graphDb = api.getGraphDatabaseService();
api.processImperative(queryStmt);
}

In the traditional approach, the result set returns the records. If there's a lot of data, the process can increase memory consumption, as it basically returns all collections of maps from the result set.

While working with parsing from graph data (~35 GB) and queries returning around 10 million records, I found out that with this approach, memory consumption increases and performance was falling. Further exploration of the Neo4J library API revealed that Result has an internal ResultVisitor interface defined, which has a visit method, and the Result class has the accept() method, which utilizes the ResultVisitor interface.

Here's a snippet from the Result interface:

/**
     * Visits all rows in this Result by iterating over them.
     *
     * This is an alternative to using the iterator form of Result. Using the visitor is better from a object
     * creation perspective.
     *
     * @param visitor the ResultVisitor instance that will see the results of the visit.
     * @param <VisitationException> the type of the exception that might get thrown
     * @throws VisitationException if the {@code visit(ResultRow)} method of {@link ResultVisitor} throws such an
     * exception.
     */
    <VisitationException extends Exception> void accept( ResultVisitor<VisitationException> visitor )
            throws VisitationException;

The corresponding visit method of the ResultVisitor interface:

/**
     * This is the visitor interface you need to implement to use the {@link Result#accept(ResultVisitor)} method.
     */
    interface ResultVisitor<VisitationException extends Exception>
    {
        /**
         * Visits the specified row.
         *
         * @param row the row to visit. The row object is only guaranteed to be stable until flow of control has
         *            returned from this method.
         * @return true if the next row should also be visited. Returning false will terminate the iteration of
         * result rows.
         * @throws VisitationException if there is a problem in the execution of this method. This exception will close
         * the result being visited, and the exception will propagate out through the
         * {@linkplain #accept(ResultVisitor) accept method}.
         */
        boolean visit( ResultRow row ) throws VisitationException;
    }

After going through the documentation, I decided to implement the same and see the performance comparison. 

The Accept method here accepts the ResultRow object, using which we can extract the data from the underlying node data. The following code shows, in brief, the way to access the node's records using the ResultVisitor pattern

private void processUsingResultVisitor(String queryStmt){

Result result = graphDb.execute(queryStmt);

String actor = "people.name";
String title = "movie.title";
String typeRelation = "Type(relatedTo)";

result.accept(resultRow->{
logger.info(resultRow.get(actor)+" "+resultRow.get(typeRelation)+" "+resultRow.get(title));
return true;
});
}

Now, a few of you might ask me why I am hardcoding the column names here. Well, obviously, it can be avoided altogether as shown below.

private void processUsingVisitor(String queryStmt){

Result result = graphDb.execute(queryStmt);

List<String> columns = result.columns().stream().map(column->column).collect(Collectors.toList());

result.accept(resultRow->{
// get data for each columns obtained from the columns list
columns.forEach(col->logger.info(col+" "+resultRow.get(col)));
return true;
});
}

I have seen the benefit of using this ResultVisitor in our production environment. In my case, I have seen a significant reduction in object creation and better performance. For deep analysis, please run the JvisualVM and turn the sampler on your program.

By comparing the two approaches for large datasets, you will realize the power of the ResultVisitor and, hopefully, this will help in your use case as well.

Align DevOps for your applications with DevOps for your SQL Server databases to increase speed of delivery and keep data safe. Discover true Database DevOps, brought to you in partnership with Redgate

Topics:
neo4j ,database ,performance ,interface

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}