What Are the Criteria to Differentiate Between Graph Databases?
In this post, I'm going to share my insight to evaluate a graph database gained from benchmarking different graph databases.
Join the DZone community and get the full member experience.Join For Free
Graph databases have gotten much attention due to their advantages over relational models (see discussions here). However, while different technical companies rush into this area, Amazon, Microsoft, Oracle, IBM, etc., it is getting more challenging to evaluate different vendor's product when a project wants to embark a graph database.
In this post, I'm going to share my insight to evaluate a graph database gained from benchmarking different graph databases. You can download the benchmark report here.
- Loading capability. If you plan to work on real-life problems using a graph database, this is the first (and important!) differentiator for good and bad graph databases. I would recommend trying a public data set that is above 1B edges and 50M-100M vertices. Check the loading effort, including the loading language/API support, the loading speed(should finish within 1 hour maximally), the incremental/bulk loading support etc. Some loading script for different graph database vendors can be found here. If you have done this minimum requirment loading, and the result is satisfactory, the next thing to try is to find a graph that has more than 10 vertex types and edge types, each has some different type attributes/properties. Can you load it easily? Does it support complexed attribute types such as map, set and list? What about JSON file loading? All this questions are pratical hurdles in handling real-life graph data.
- Support real-time update or not. By real-time update, it means the update can happen at the same time as a query processing on the database. The database update could be an insertion/deletion of a new vertex/edge, an upsert of an attribute of an existing vertex/edge, etc. The graph database provides concurrency control, such that different operations can happen in an interleaving way, but the end results are consistent as if all operations were executed sequentially. Note that graph compute platform is different from a graph database. Most HDFS based graph platforms (e.g. Giraph, GraphX) do not support real-time update due to HDFS by design limitation.
- The Disk storage is in native graph format or non-native graph format. Many graph databases are non-native, meaning they store graph data in relation model, RDF triplets, or key-value pairs on disk. When in memory, they provide middle layer APIs to simulate graph traversal. The native graph database stores graph data in graph model format—vertices and edges. The popular ones in this class are TigerGraph, Neo4j. The advantage of native graph storage format is that it inherently gets the graph model benefit for free, since graph model is a natural index, and each query only touches relevant data following the index.
- The query language expressive power. Graph language can be Turing complete Turing completeness - Wikipedia, meaning the user can use it to write any algorithms, thus extract values from the data. Currently, there is no dominate graph query language yet, and many offerings on the market with different expressive power. GSQL from TigerGraph is a user-friendly and high expressive query language, which is more of a PL/SQL - Wikipedia flavor language, the user can use it to declaratively write any graph algorithms. Another declarative language is Cypher from Neo4j, however, it's hard to write fine controlled graph analytical algorithm with it. The third one is Gremlin https://arxiv.org/pdf/1508.03843..., highly expressive, the learning curve is not low. The user has to read manual of each one and ask different queries to test and appreciate each one’s expressive power, learning curve, and design philosophy etc. Relational database specialist can read this guide on appreciating and evaluating graph query language.
- Support Scale-out and/or Scale-up on compute and/or storage. Some graph databases can scale horizontally (scale-out) on storage, meaning doubling the machines, the storage can be doubled. However, the compute is not sped up by 2x. Some graph databases can scale out on compute as well. If they do, they must have an MPP architecture. Scale-up means on a single machine, the graph compute engine can explore the multiple-core to parallelize computation, and achieve speedup when adding more cores. One caveat here before you put a checkmark on this item is that never forget about loading efforts. If doubling the machines results in doubling or tripling the efforts of loading, it cannot be regarded as a qualified scale-out database. TigerGraph has MPP architecture that can scale-out and scale-up. Amazon Neptune cannot scale-out, it can have replicas of the primary db to increase throughput. Janus Graph and ArangoDB can scale-out storage, but not scale-out compute on the same query. Neo4j cannot scale out compute or storage due to its single server architecture.
- Support OLTP, OLAP, or HTAP Hybrid transactional/analytical processing (HTAP) - Wikipedia. Some graph platform is purely for offline large-scale processing, which is OLAP Online analytical processing - Wikipedia style, such as PageRank - Wikipedia, Gradient Descent, Weakly connected components etc. Some graph database support point traversal query (e.g. returns a given person’s 3-step neighbors) which is more OLTP style. A good measurement is to see whether they talk about QPS Queries per second - Wikipedia, if yes, most likely they can support OLTP. Some can support both, which is HTAP. Try the simple connected-component and page rank algorithms, see if they can finish in reasonable time on a 1-billion edge graph.
- Support multiple-graph or not. This is an enterprise security and concurrency requirement, where different departments want to have exclusive or inclusive access of different graphs/subgraphs on the same server/cluster at the same time. As far as this post, the only vendor supports native multiple-graph is from TigerGraph, it allows different graphs share subgraphs, this is important for different function units to cooperate. Neo4j allows you to attach different graph store files for the current use but not for the simultaneous use of multiple graphs with the same server instance. Amazon Neptune has 1 built-in graph by default, and disallow users to have their own graph.
- Schema or schema-less. Some graph database vendors claim they do not require a pre-defined schema. Other graph database vendors require pre-defined schema like a traditional relational database and support online evolvement of the schema. In my opinion, a pre-defined schema is very important for enterprise real-time applications, since with schema, you actually force the developer/architect to have a design time to think what to put in and how to organize the graph pre-hand. And separating metadata from data instances is a well-known technique to have better performance in the database system. Unfortunately, the current graph database market overemphasizes the explosive quantity of element types, and therefore, advocating schema-less as an advantage to address this challenge, which in reality causes significant performance drop during query processing.
- Organic graph database or add-on graph interface only. This is important. As a database builder, I personally disagree one-size fits all claim. I follow craftsmanship and believe in providing a best-in-breed specialized database to achieve the best performance and user experience. In the current market, too many vendors swing their product offering by adding new interfaces for the new market trend on top of their existing core which was developed and architected for another market initially. This kind of swing can hardly deliver the best product in the data management software sector and cause friction on the right technology adoption in the needed enterprise. A simple test, to tell the truth, is 3-hop queries on a billion edge graph. If they are not designed for graph data management, it's hard to get 3-hop queries in real-time. Here is a 12-hop query example.
- Support real-time Deep-link analytical queries or not. On most graph databases today, query response time degrades significantly starting from 3 hops. One simple test to see whether it's a top-grade graph database or not is to test their 3-hop-path-neighbor-count query. For a billion edges graph, given an input vertex, can the graph database find all neighbor vertices that has a k-hop path to the input vertex, and return the total count? This simple test is a great criterion to differentiate between good and bad graph databases. In our new upcoming benchmark, I will reveal the truth of the 5 media-popular graph database, only one can fulfill this simple test. All the others failed. This is because that from a seeding vertex, the neighbors of this vertex goes exponential large for each additional hop away from the seed. One criterion that most graph database vendors avoid talking about is deep-link analytical queries (>3 hops) performance.
- Support built-in graph visualization or not. One good advantage of using a graph database is the object-oriented thinking. Different from a traditional relational database, where everything from modeling to query results is in tabular format, a graph database is elegantly mapping real-world objects into vertices and edges, and human eyes can readily understand the query result if the returned result is visualized in vertex-edge graph picture. Here is an example.
In conclusion, graph databases are hot and sprout from everywhere. It is recommended that purchasers or practitioners take a grain of salt before embarking on the adventure of the graph database world! That does not mean it's hard to find the right database. The criteria in this article contain some simple and achievable steps for readers to try and get the truth.
Opinions expressed by DZone contributors are their own.