No technology is equally good at everything, and databases are no exception. It's possible for databases to satisfy different kinds of functions: batch and transactional workloads, memory access and disk access, SQL and XML access and graph, and document data models.
When building a database management system (DBMS), development teams must decide early on what cases to optimize for, which will dictate how well the DBMS will handle the tasks it is dealt (what the DBMS will be amazing at, what it will be OK at and what it may not do so well). As a result, the graph database world is populated with both technology designed to be "graph first," known as native, and technology where graphs are an afterthought, classified as non-native.
There's a considerable difference when it comes to the native architecture of both graph storage and processing. Unsurprisingly, native technologies tend to perform queries faster, scale bigger (retaining their hallmark query speed as the data set grows in size), and run more efficiently, calling for much less hardware. As a result, it's critical to understand the differences.
In this "Graph Databases for Beginners" blog series, we have covered why graphs are the future, why data relationships matter, the basics of data modeling, data modeling pitfalls to avoid, why a database query language matters, why we need NoSQL databases, ACID vs. BASE, a tour of aggregate stores and other graph data technologies.
Now that we have a relatively comprehensive grasp of the basics, it is time to go over the differing internal properties of a graph database. Today, we will discuss some of the characteristics that distinguish native graph databases and why these characteristics are of interest to graph database users.
What "Graph First" Means for Native Graph TechnologyThere are two main elements that distinguish native graph technology: storage and processing.
Graph storage commonly refers to the underlying structure of the database that contains graph data. When built specifically for storing graph-like data, it is known as native graph storage. Graph databases with native graph storage are optimized for graphs in every aspect, ensuring that data is stored efficiently by writing nodes and relationships close to each other.
Graph storage is classified as non-native when the storage comes from an outside source, such as a relational, columnar, or other NoSQL database. These databases use other algorithms to store data about nodes and relationships, which may end up being placed far apart. This non-native approach can lead to latent results as their storage layer is not optimized for graphs.
Native graph processing is another key element of graph technology, referring to how a graph database processes database operations, including both storage and queries. Index-free adjacency is the key differentiator of native graph processing.
At write time, index-free adjacency speeds up storage processing by ensuring that each node is stored directly to its adjacent nodes and relationships. Then, during query processing (i.e., read time), index-free adjacency ensures lightning-fast retrieval without the need for indexes. Graph databases that rely on global indexes (rather than index-free adjacency) to gather results are classified as having non-native processing.
Another important consideration is ACID writes. Related data brings an uncommonly strict need for data integrity beyond that of other NoSQL models. In order to store a connection between two things, we must not only write a relationship record but update the node at each end of the relationship as well. If any one of these three write operations fails, it will result in a corrupted graph.
The only way to ensure that graphs aren't corrupted over time is to carry out writes as full ACID transactions. Systems with native graph processing include the proper internal guard rails to ensure that data quality remains impervious to network blips, server failures, competing transactions, and the like.
Native Graph StorageTo dive into further detail, the element that makes a graph storage native is the structure of the graph database from the ground up. Graph databases with native graph storage have underlying storage designed specifically for the storage and management of graphs. They are designed to maximize the speed of traversals during arbitrary graph algorithms.
For example, let's take a look at the way Neo4j — a native graph database — is structured for native graph storage. Every layer of this architecture — from the Cypher query language to the files on disk — is optimized for storing graph data, and not a single part is substituted in from other non-graph technologies.
Graph data is kept in store files, each of which contain data for a specific part of the graph, such as nodes, relationships, labels, and properties. Dividing the storage in this way facilitates highly performant graph traversals.
In a native graph database, a node record's main purpose is to simply point to lists of relationships, labels, and properties, making it quite lightweight.
So, what makes non-native graph storage different from a native graph database?
Non-native graph storage uses a relational database, a columnar database, or some other general-purpose data store rather than being specifically engineered for the uniqueness of graph data. While the typical operations team might be more familiar with a non-graph backend (like MySQL or Cassandra), the disconnect between graph data with non-graph storage results in a number of performance and scalability concerns.
Non-native graph databases are not optimized for storing graphs, so the algorithms utilized for writing data may store nodes and relationships all over the place. This causes performance problems at the time of retrieval because all these nodes and relationships then have to be reassembled for every single query.
On the other hand, native graph storage is built to handle highly interconnected datasets from the ground up and is therefore the most efficient when it comes to the storage and retrieval of graph data.
Native Graph ProcessingA graph database has native processing capabilities if it uses index-free adjacency. This means that each node directly references its adjacent nodes, acting as a micro-index for all nearby nodes. Index-free adjacency is more efficient and cheaper than using global indexes, as query times are proportional to the amount of the graph searched, rather than increasing with the overall size of the data.
Since graph databases store relationship data as first-class entities, relationships are easier to traverse in any direction with native graph processing. With processing that is specifically built for graph datasets, relationships — rather than over-reliance on indexes — are used to maximize the efficiency of traversals.
On the other hand, non-native graph databases use global indexes to link nodes together. This method is more costly, as the indexes add another layer to each traversal, which slows processing considerably.
First of all, using a global index lookup is already far more expensive, as indexes usually cost about O(log(n)) in time, while native graph processing traverses a relationship in O(1) time. Queries with more than one layer of connection further reduce traversal performance with non-native graph processing.
In addition, reversing the direction of a traversal is extremely difficult with non-native graph processing. To reverse a traversal's direction, you must either create a costly reverse-lookup index for each traversal or perform a brute-force search through the original index, which is another costly O(n) operation.
The Bottom Line: Why Native vs. Non-Native MattersWhen deciding between a native and non-native graph database, it is important to understand the tradeoffs of working with each.
Non-native graph technology most likely has a persistence layer that your development team is already familiar with (such as Cassandra, MySQL, or another relational database), and when your dataset is small or less connected, choosing non-native graph technology isn't likely to significantly affect the performance of your application.
However, it's important to note that datasets tend to grow over time, and today's datasets are more unstructured, interconnected, and interrelated than ever before. Even if your dataset is small to begin with, it's important to plan for the future if your data is likely to grow alongside your business. In this case, a native graph database will serve you better over the long-term because the performance of non-native graph processing cripples under larger datasets.
One of the biggest drivers behind moving to a native graph architecture is that it scales. As you add more data to the database, many queries that would slow with size in a non-native graph database remain lithe and speedy in a native context. Native graph scaling takes advantage of a large number of optimizations in storage and processing to yield a highly efficient approach, whereas non-native uses brute force to solve the problem, requiring more hardware (usually two-four times the amount of hardware or more) and resulting in lower latencies, especially for larger graphs.
Not all applications require low latency or processing efficiency, and in those use cases, a non-native graph database might just do the job. But if your application requires storing, querying and traversing large interconnected datasets in real time for an always-on, mission-critical application, then you need a database architecture specifically designed for handling graph data at scale.
The bottom line: The importance of native vs. non-native graph technology depends on the particular needs of your application, but for enterprises hoping to leverage the connections in their data like never before, the performance, efficiency and scaling advantages of a native graph database are crucial for success.
Original article by Joy Chao