Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Modeling Data in Neo4j: Qualifying Relationships

DZone's Guide to

Modeling Data in Neo4j: Qualifying Relationships

· Performance Zone
Free Resource

Transform incident management with machine learning and analytics to help you maintain optimal performance and availability while keeping pace with the growing demands of digital business with this eBook, brought to you in partnership with BMC.

In the last post of our "Neo4j Modelling for Beginners" series, we looked at bidirectional relationships. In this post, we compare the implications of qualifying relationships by using different relationship types versus using relationship properties.

Properties as Qualifiers

Let's say we want to model movie ratings in Neo4j. People have an option to rate a movie with 1 to 5 stars. One way of modelling this, and perhaps the first one that springs into mind, is creating a RATED relationship with a rating property that takes on 5 different values: integers 1 though 5.

Qualifying Relationship by Property

Writing queries using this model is fairly straightforward in both Java and Cypher. If we wanted to get all people that rated Pulp Fiction positively, i.e. with rating greater than 3, we could just write

for (Relationship r : pulpFiction.getRelationships(INCOMING, RATED)) { 
  if ((int) r.getProperty("rating") > 3) { 
    Node fan = r.getStartNode(); //do something with it 
  } 
}

or, equivalently, in Cypher

START   pulpFiction=node({id}) 
MATCH   (pulpFiction)<-[r:RATED]-(fan) 
WHERE   r.rating > 3 
RETURN  fan

Relationship Types

Since we know all the possible relationship qualities up front, there is another option: using a separate relationship type for each rating. For example, we could define the following relationship types: LOVED, LIKED, NEUTRAL, DISLIKED, and HATED, corresponding to 5 stars down to 1 star, respectively. The above graph would then look as follows.

Qualifying Relationship by Type

Both queries would have to be slightly modified to yield the same result, i.e., people who are fans of Pulp Fiction. In Java, one would write:

for (Relationship r : pulpFiction.getRelationships(INCOMING, LIKED, LOVED)) { 
  Node fan = r.getStartNode(); //do something with it 
}

and in Cypher:

START pulpFiction=node({id}) 
MATCH (pulpFiction)<-[r:LIKED|LOVED]-(fan) 
RETURN fan

Comparison

In terms of query syntax, there isn't really all that much difference. If we had, for example, 10 different qualities of the relationship and wanted to query for 7 of them, one could argue the first approach is more convenient: it does not require listing all the relationship types we're looking for.

Let us, however, explore the two approaches from a performance point of view. The first experiment is designed to find out, whether there are any write throughput differences between the two approaches. We created 1,000 relationships between random pairs of nodes and measured the time taken to do so. We varied the number of relationships created in a single transaction from 1 to 1,000. The results are depicted in the following figure:

Write Throughput Comparison

Clearly, there is no significant difference in write throughput between the two approaches. However, this is not true for traversals.

In the second experiment, we executed all the queries shown earlier 100 times on a graph with 100 nodes and 5,000 randomly qualified, uniformly distributed relationships (which makes the degree of each node 100, on average). We performed the experiments in three different settings.

  1. No caches involved, data read from disk
  2. Data in low level cache, high level cache turned off
  3. Data in high level cache

The next two figures show the time taken to execute these queries in Cypher and Java, respectively.

Traversal Performance (Cypher)

Traversal Performance (Java API)

The multiple-relationship-types approach always outperforms the single-type-and-property approach, sometimes by as much as a factor of 8. There is a technical reason for this, which has to do with the way Neo4j organizes its data on disk and in memory. But that is a topic for one of the next posts.

It is important to realize that we have only measured a single-hop traversal. If that is already 8x faster, a traversal 2 levels deep could be 64x faster and 3 levels deep could be 512x faster.

Conclusion

When possible, choosing different relationship types over a single type qualified by properties can have a significant  positive performance impact when querying the graph. The former approach is always at least 2x faster than the latter.  When data is in high-level cache and the graph is queried using native Java API, the first approach is more than  8x faster for single-hop traversals.

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}