Half-Terabyte Benchmark Neo4j vs. TigerGraph
Neo4j is ranked as the top graph database by DB-engine, and recently, Strata Data awarded the Most Disruptive Startup award to TigerGraph. Let's see how they compare.
Join the DZone community and get the full member experience.Join For Free
Graph database having been becoming more and more popular and are getting lots of attention.
In order to know how graph databases perform, I researched the state-of-the-art benchmarks and found that loading speed, loaded data storage, query performance, and scalability are the common benchmark features. However, those benchmarks' testing datasets are too small, ranging from 4MB to 30 GB. So, I decided to do my own benchmark. Let's play with a huge dataset: half-terabytes.
Due to the difficulty in finding such huge graph dataset from the internet, I generated my own testing dataset, which is a mimic of daily phone call records. Here is a sample:
Since Neo4j is ranked as the top graph database by DB-engine, I am curious about its performance. And recently, Strata Data awarded the award Most Disruptive Startup to TigerGraph. Let's see how TigerGraph differs.
I used a Amazon EC2 machine.
|EC2 type||IOPS (SSD)||CPUs||Memory||Volume type||OS||Disk size|
|r4.4xlarge||32000||16||122 GiB||io1||ubuntu 14||3 TB|
I used the latest downloadable versions of both database systems:
TigerGraph Developer Edition
Neo4j 3.4.7 Community Edition
The phone call edge files consist of 21 files; each file is around 24GB. The total size of the datasets is 501GB.
|Name||Vertices #||Edges #|
Description of Tests
The goal of the benchmark is to measure the performance of each database system when there is not enough memory to hold the whole dataset. To be able to measure this, I chose EC2 r4.4xlarge (144 GB) as the server. To my surprise, TigerGraph compresses the raw data to 14% of its original size, and fits memory perfectly.
The following test cases have been included:
- Data loading: Bulk loading method supported by each database system.
Neo4j-Cypher TigerGraph Built-in loading language YES YES Requires separate vertex file YES NO Incremental data loading YES YES Index build during loading NO YES Vertex ID deduplication YES YES
- Storage size: Storage size of the loaded datasets.
- k-hops query performance: I search for distinct, directed neighbors starting from six randomly selected vertices, returning total counts for discovered neighbors.
1-hop 3-hops 6-hops query timeout 180 s 9000 s 9000 s
- Page rank: Traverses every edge during each iteration. I chose ten iterations for page rank and run three times to calculate the average execution time. In this test, I set the timeout to 24 hours.
Neo4j required extra time to build the index and extract the vertex file from edge file. In my test Neo4j took extra 8.7 hours to prepare the node file.
|Load time||13.46 h||7.479 h (7h 28m 44s 571ms)|
|Index build||-||0.819 h (49 m 8 s)|
|Total||13.46 h||8.298 h|
Storage Size After Loading
|size||74.095 GB||1.4 TB|
K-Hops-Neighbors Query Performance
|TigerGraph||6.093 ms||0.053 s||433.796 s|
|Neo4j||151.015 ms||95.847 s||all out-of-memory|
Page Rank Query Performance
|Neo4j||cannot complete within 24 hours|
- Neo4j's loading time is shorter than TigerGraph; however, Neo4j requires extra preprocessing that extracts the vertex file from the edge file. After including the pre-processing time, Neo4j takes longer time to loading than TigerGraph.
- TigerGraph can effectively compresses the data size and needs 19.3x less storage space than Neo4j's.
- On the one-hop path query, TigerGraph is 24.8x faster than Neo4j.
- On the three-hops path query, TigerGraph is 1808.43x faster than Neo4j.
- TigerGraph can completed six-hops path query without pressure; the Neo4j query process was killed by OS out-of-memory killer after two hours.
- Neo4j cannot complete page rank query within one day.
Opinions expressed by DZone contributors are their own.