Graph Model of Facebook Post Reactions in Neo4j Part 1
Optimize query efficiency in a pinch with Neo4j.
Join the DZone community and get the full member experience.
Join For FreeI was watching Neo4j's Online conference, conveniently named NODES 2019, and stumbled into a great presentation titled, "Tuning Cypher" by Andrew Bowman. I highly recommend you to check it out. It inspired me to write my own blog post about cypher query optimizations.
In this blog post, we will take a look at how to model Facebook posts and users’ reactions to them. As some of you might know, there are six different types of reactions available on Facebook today:
- Like
- Love
- Haha
- Wow
- Sad
- Angry
There are a couple of different ways how to model this in Neo4j. We will use three different graph models to compare cypher query performance across different graph models. The dataset will be generated synthetically. It contains 100 Facebook posts with 10,000 reactions per post. The distribution of reactions is evenly split between all types. Our goal is to count how many reactions each post has grouped by reaction type as fast as possible.
You may also like: Moving Beyond REST With GraphQL.
We will compare both the execution plans of the cypher queries, as well as the actual execution time. Cypher execution plans can be visualized using the PROFILE
statement. For measuring the execution time we will use:
- result_available_after (the time it took for the server to have the result available).
- result_consumed_after (the time it took for the server to consume the result).
These two parameters are available as the summary of the response in the official Neo4j's Python bolt driver. The idea is to run 1,000 queries and calculate basic statistics about execution time.
In this blog post I used:
- Neo4j Enterprise 3.5.11.
- APOC 3.5.0.5.
- Neo4j bolt driver for Python 1.6.3.
All graph models are visualized with arrows tool.
Reaction Type as an Attribute of a Relationship
Graph model:
In all three graphs models, there will be nodes labeled User and Post. In the first graph model, we store the reaction type as an attribute of the relationship between a User and a Post.
Import
We use apoc.periodic.iterate() to speed up our import.
CALL apoc.periodic.iterate(
"UNWIND range(0,100) as i
return i
",
"CREATE (post:Post{id:i})
WITH i,post
UNWIND range (0,10000) as j
WITH i,j,post, CASE j%6 WHEN 0 THEN 'like'
WHEN 1 THEN 'love'
WHEN 2 THEN 'haha'
WHEN 3 THEN 'wow'
WHEN 4 THEN 'sad'
WHEN 5 THEN 'angry' END as reaction
CREATE (post)<-[:REACTION{type:reaction}]-(:User)",
{parallel:True})
Testing Execution Time
Before each test, apoc.warmup.run() was executed to quickly warm the database. As mentioned before, each query will be executed 1,000 times to calculate the average response time and other metrics.
The first query is pretty straightforward. We visit all the nodes labeled Post in the graph and expand all their incoming relationships. The second step is to aggregate by an attribute “id” of the node and attribute “type” of the relationship. We finish the query with ordering and a limit.
PROFILE
MATCH (p:Post)<-[rel]-()
RETURN p.id as id,
rel.type as type,
count(*) AS count
ORDER by count DESC
LIMIT 5
Results
Average response: 799.739ms
Max response: 1860ms
Min response: 768ms
25th percentile: 783.0ms
50th percentile: 787.0ms
75th percentile: 792.0ms
90th percentile: 806.0ms
95th percentile: 860.0999999999999ms
99th percentile: 1071.03ms
The execution plan reports 3,030,506 total db hits, and it takes about 800ms on average to execute the query. We can definitely improve this.
The presentation by Andrew Bowman made me realize that if you want to aggregate and group by a node’s unique id, it is better to group by the node and only then return its unique id. This way we do the aggregations, ordering, and limit first and only then access the “id” attribute of the top five rows.
PROFILE
MATCH (p:Post)<-[rel]-()
WITH p, rel.type as type, count(*) AS count
ORDER by count DESC
LIMIT 5
RETURN p.id as id,type,count
Results
Average response: 657.88ms
Max response: 908ms
Min response: 644ms
25th percentile: 650.0ms
50th percentile: 653.0ms
75th percentile: 657.0ms
90th percentile: 667.0ms
95th percentile: 687.0ms
99th percentile: 760.1099999999999ms
Just by grouping by the node, instead of by the node’s unique property, we saved a million db hits and are left with a total of 2,020,410 db hits. The average response time dropped down to 650ms.
Not much else could be optimized here, so let’s move to the next possible graph model.
Reaction Type as a Node Label
Graph Model
Here, we introduce an intermediary node between the user and the post. The reaction type will be stored as a label of the intermediary node. This node has no other purpose; it will always have one incoming and one outgoing relationship.
Import
CALL apoc.periodic.iterate(
"UNWIND range(0,100) as i
return i
",
"CREATE (post:Post{id:i})
WITH i,post
UNWIND range(0,10000) as j
WITH i,post, CASE j%6 WHEN 0 THEN 'like'
WHEN 1 THEN 'love'
WHEN 2 THEN 'haha'
WHEN 3 THEN 'wow'
WHEN 4 THEN 'sad'
WHEN 5 THEN 'angry' END as reaction
// Create node with a dynamic label
call apoc.create.node([reaction], {}) yield node
CREATE (post)<-[:TO]-(node)
CREATE (node)<-[:REACTION]-(:User)",
{parallel:True})
Testing Execution Time
As we are only interested in counting reaction types for a given post, we don’t have to traverse all the way to the user. We start by matching all the posts in our graph, expand their incoming relationships and count the labels of the nodes on the other end of the relationships.
PROFILE
MATCH (p:Post)<--(reaction)
WITH p,
labels(reaction) as labels,
count(*) as count
ORDER BY count DESC
LIMIT 5
RETURN p.id as id,labels[0] as type,count
Results
Average response: 569.038ms
Max response: 1255ms
Min response: 548ms
25th percentile: 553.0ms
50th percentile: 556.0ms
75th percentile: 560.0ms
90th percentile: 570.1ms
95th percentile: 637.0999999999999ms
99th percentile: 903.3099999999997ms
Even though the total db hits did not change compared to the previous cypher query (2020410 db hits), the execution time improved by almost 100ms. This indicates that not all db hits are the same. It could be concluded that counting labels is faster than counting attributes, even though they both have the same cost in db hits.
Reaction Type as a Relationship Type
Graph model
As the title mentions, we will store the reaction type as a type of relationship between a user and a post. To me, this one feels the most natural, but it always depends on what do you want to achieve. There are no silver bullets when modeling graphs and you should use whatever works best for your use-case.
Import
CALL apoc.periodic.iterate(
"UNWIND range(0,100) as i
return i
",
"CREATE (post:Post{id:i})
WITH i,post
UNWIND range(0,10000) as j
CREATE (u:User)
WITH i,post,u, CASE j%6 WHEN 0 THEN 'like'
WHEN 1 THEN 'love'
WHEN 2 THEN 'haha'
WHEN 3 THEN 'wow'
WHEN 4 THEN 'sad'
WHEN 5 THEN 'angry' END as reaction
// Create relationship with a dynamic type
CALL apoc.create.relationship(u, reaction, {}, post) YIELD rel
RETURN count(*)",
{parallel:True})
Testing Execution Time
The first query uses a naive approach. We match all the posts, iterate over their incoming relationships and count the types of those relationships. Nothing special here.
PROFILE
MATCH (p:Post)<-[rel]-()
WITH p,
type(rel) as type,
count(*) as count
ORDER BY count DESC
LIMIT 5
RETURN p.id as id,type,count
Average response: 629.604ms
Max response: 903ms
Min response: 607ms
25th percentile: 616.0ms
50th percentile: 628.0ms
75th percentile: 635.0ms
90th percentile: 648.1ms
95th percentile: 656.0ms
99th percentile: 698.0699999999999ms
When I first saw that we dropped down to 1,010,309 total db hits, I was expecting a faster execution time, but to my surprise, the execution time was actually a bit slower than before when we were counting labels. Definitely, not all db hits cost the same.
Access Node’s Degree Store
There isn’t much documentation available about the degree store, but I found this knowledge base explanation. Basically, each node has the count of relationships by the relationship type and direction stored in a “degree store” of the node. We can access it using the function size()
.
PROFILE
MATCH (p:Post)
WITH p
UNWIND [{key:'like',count:size((p)<-[:like]-())},
{key:'love',count:size((p)<-[:love]-())},
{key:'haha',count:size((p)<-[:haha]-())},
{key:'wow', count: size((p)<-[:wow]-())},
{key:'sad', count: size((p)<-[:sad]-())},
{key:'angry',count: size((p)<-[:angry]-())}] as k
WITH p,
k.key as key,
k.count as count
ORDER BY count DESC
LIMIT 5
RETURN p.id as id,key,count
Results
Average response: 0.998ms
Max response: 82ms
Min response: 0ms
25th percentile: 1.0ms
50th percentile: 1.0ms
75th percentile: 1.0ms
90th percentile: 1.0ms
95th percentile: 2.0ms
99th percentile: 2.0ms
Instead of iterating over all relationships and counting their types, we just iterate over nodes and check the degree value for each relationship type stored in the degree store of the node. We have dropped down to 1,319 total db hits, which is three orders of magnitude better than before. Execution times are down to 1ms.
Predefining all the relationship types in the above example feels a bit ugly and something we want to avoid in our applications. To solve this, we can use node functions in APOC procedures, specifically apoc.node.relationship.types()
and apoc.node.degree.in()
.
MATCH (p:Post)
// Get all relationship types for a given node
UNWIND apoc.node.relationship.types(p) as type
WITH p,
type,
// Get the incoming degree value for the relationship type
apoc.node.degree.in(p, type) as count
ORDER BY count DESC
LIMIT 5
RETURN p.id as post, type, count
Cypher profiler is not accurate when there are APOC procedures involved, so we will skip it. Looking at execution times, it looks like APOC has a bit of overhead in this scenario, as the average response time almost doubled to 2ms. This is still very nice.
Having said that, APOC usually speeds things up, so I would recommend experimenting with it when optimizing your queries. This is just an example where we trade a bit of speed for a prettier cypher query as we don’t have to specify relationship types in advance. And I am sure if we optimized the APOC procedure for this exact scenario we could get better execution times.
Average response: 1.869ms
Max response: 74ms
Min response: 1ms
25th percentile: 1.0ms
50th percentile: 2.0ms
75th percentile: 2.0ms
90th percentile: 3.0ms
95th percentile: 3.0ms
99th percentile: 4.0ms
Conclusion
We explored a few different graph models and compared execution times for a very simple counting task. In the next part, I will add the date attribute of the posts to the graph, and explore what happens when we want to do a simple date filter and order by. Stay tuned!
All the code is available on GitHub.
Further Reading
Opinions expressed by DZone contributors are their own.
Trending
-
Five Java Books Beginners and Professionals Should Read
-
Implementing a Serverless DevOps Pipeline With AWS Lambda and CodePipeline
-
Alpha Testing Tutorial: A Comprehensive Guide With Best Practices
-
Tactics and Strategies on Software Development: How To Reach Successful Software [Video]
Comments