# MySQL vs. Neo4j on a Large-Scale Graph Traversal

This post presents an analysis of MySQL (a relational database) and Neo4j (a graph database) in a side-by-side comparison on a simple graph traversal.

The data set that was used was an artificially generated graph with natural statistics. The graph has 1 million vertices and 4 million edges. The degree distribution of this graph on a log-log plot is provided below. A visualization of a 1,000 vertex subset of the graph is diagrammed above.

## Loading the Graph

The graph data set was loaded both into MySQL and Neo4j. In MySQL a single table was used with the following schema.

CREATE TABLE graph ( outV INT NOT NULL, inV INT NOT NULL ); CREATE INDEX outV_index USING BTREE ON graph (outV); CREATE INDEX inV_index USING BTREE ON graph (inV);

After loading the data, the table appears as below. The first line reads: “vertex 0 is connected to vertex 1.”

mysql> SELECT * FROM graph LIMIT 10; +------+-----+ | outV | inV | +------+-----+ | 0 | 1 | | 0 | 2 | | 0 | 6 | | 0 | 7 | | 0 | 8 | | 0 | 9 | | 0 | 10 | | 0 | 12 | | 0 | 19 | | 0 | 25 | +------+-----+ 10 rows in set (0.04 sec)

The 1 million vertex graph data set was also loaded into Neo4j. In Gremlin, the graph edges appear as below. The first line reads: “vertex 0 is connected to vertex 992915.”

gremlin> g.E[1..10] ==>e[183][0-related->992915] ==>e[182][0-related->952836] ==>e[181][0-related->910150] ==>e[180][0-related->897901] ==>e[179][0-related->871349] ==>e[178][0-related->857804] ==>e[177][0-related->798969] ==>e[176][0-related->773168] ==>e[175][0-related->725516] ==>e[174][0-related->700292]

## Warming Up the Caches

Before traversing the graph data structure in both MySQL and Neo4j, each database had a “warm up” procedure run on it. In MySQL, a “SELECT * FROM graph” was evaluated and all of the results were iterated through. In Neo4j, every vertex in the graph was iterated through and the outgoing edges of each vertex were retrieved. Finally, for both MySQL and Neo4j, the experiment discussed next was run twice in a row and the results of the second run were evaluated.

## Traversing the Graph

The traversal that was evaluated on each database started from some root vertex and emanated n-steps out. There was no sorting, no distinct-ing, etc. The only two variables for the experiments are the length of the traversal and the root vertex to start the traversal from. In MySQL, the following 5 queries denote traversals of length 1 through 5. Note that the “?” is a variable parameter of the query that denotes the root vertex.

SELECT a.inV FROM graph as a WHERE a.outV=? SELECT b.inV FROM graph as a, graph as b WHERE a.inV=b.outV AND a.outV=? SELECT c.inV FROM graph as a, graph as b, graph as c WHERE a.inV=b.outV AND b.inV=c.outV AND a.outV=? SELECT d.inV FROM graph as a, graph as b, graph as c, graph as d WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND a.outV=? SELECT e.inV FROM graph as a, graph as b, graph as c, graph as d, graph as e WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND d.inV=e.outV AND a.outV=?

For Neo4j, the Blueprints Pipes framework was used. A pipe of length n was constructed using the following static method.

public static Pipeline createPipeline(final Integer steps) { final ArrayList pipes = new ArrayList(); for (int i = 0; i < steps; i++) { Pipe pipe1 = new VertexEdgePipe(VertexEdgePipe.Step.OUT_EDGES); Pipe pipe2 = new EdgeVertexPipe(EdgeVertexPipe.Step.IN_VERTEX); pipes.add(pipe1); pipes.add(pipe2); } return new Pipeline(pipes); }

For both MySQL and Neo4j, the results of the query (SQL and Pipes) were iterated through. Thus, all results were retrieved for each query. In MySQL, this was done as follows.

while (resultSet.next()) { resultSet.getInt(finalColumn); }

In Neo4j, this is done as follows.

while (pipeline.hasNext()) { pipeline.next(); }

## Experimental Results

The artificial graph dataset was constructed with a “rich get richer“, preferential attachment model. Thus, the vertices created earlier are the most dense (i.e. highest number of adjacent vertices). This property was used to limit the amount of time it would take to evaluate the tests for each traversal. Only the first 250 vertices were used as roots of the traversals. Before presenting timing results, note that all of these experiments were run on a MacBook Pro with a 2.66GHz Intel Core 2 Duo and 4Gigs of RAM at 1067 MHz DDR3. The packages used were Java 1.6, MySQL JDBC 5.0.8, and Blueprints Pipes 0.1.2.

java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

The following Java Virtual Machine parameters were used:

-Xmx1000M -Xms500M

Below are the total running times for both MySQL (red) and Neo4j (blue) for traversals of length 1, 2, 3, and 4.

The raw data is presented below along with the total number of vertices returned by each traversal—which, of course, is the same for both MySQL and Neo4j given that its the same graph data set being processed. Also realize that traversals can loop and thus, many of the same vertices are returned multiple times. Finally, note that only Neo4j has the running time for a traversal of length 5. MySQL did not finish after waiting 2 hours to complete. In comparison, Neo4j took 14.37 minutes to complete a 5 step traversal.

[mysql steps-1] time(ms):124 -- vertices_returned:11360 [mysql steps-2] time(ms):922 -- vertices_returned:162640 [mysql steps-3] time(ms):8851 -- vertices_returned:2206437 [mysql steps-4] time(ms):112930 -- vertices_returned:28125623 [mysql steps-5] N/A [neo4j steps-1] time(ms):27 -- vertices_returned:11360 [neo4j steps-2] time(ms):474 -- vertices_returned:162640 [neo4j steps-3] time(ms):3366 -- vertices_returned:2206437 [neo4j steps-4] time(ms):49312 -- vertices_returned:28125623 [neo4j steps-5] time(ms):862399 -- vertices_returned:358765631

Next, the individual data points for both MySQL and Neo4j are presented in the plot below. Each point denotes how long it took to return n number of vertices for the varying traversal lengths.

Finally, the data below provides the number of vertices returned per millisecond (on average) for each of the traversals. Again, MySQL did not finish in its 2 hour limit for a traversal of length 5.

[mysql steps-1] vertices/ms:91.6128847554668 [mysql steps-2] vertices/ms:176.399127537985 [mysql steps-3] vertices/ms:249.286746556076 [mysql steps-4] vertices/ms:249.053599519823 [mysql steps-5] N/A [neo4j steps-1] vertices/ms:420.740351166341 [neo4j steps-2] vertices/ms:343.122344772028 [neo4j steps-3] vertices/ms:655.507125256186 [neo4j steps-4] vertices/ms:570.360621871775 [neo4j steps-5] vertices/ms:416.00886711325

## Conclusion

In conclusion, given a traversal of an artificial graph with natural statistics, the graph database Neo4j is more optimal than the relational database MySQL. However, no attempts have been made to optimize the Java VM, the SQL queries, etc. These experiments were run with both Neo4j and MySQL “out of the box” and with a “natural syntax” for both types of queries.*Source: http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-large-scale-graph-traversal/*

{{ nComments() }}