There is no magic in computer science. There are only two things: algorithms and data structures. Choose the right combinations and they will help you solve the problem. Choose the wrong ones and they will make the problem seem insurmountable.
Imagine you’ve been tasked with keeping a list of people, a list of things, and a list of things each person likes. If you use a relational database, you would have three tables: one table of people with a primary key of personId, one table of things with a primary key of thingId, and a join table between them holding foreign keys to both tables. Let’s say we start with personId 83. How do we find the names of the things this person likes? We would scan down the join table looking for all the entries that had personId 83 in them, and collect the thingIds they referred. Then, we would scan down the things table, getting the name property of each of the thing Ids we collected. This algorithm would work, but it would not be very efficient.
We could add an index to the keys of our tables. Now, instead of scanning the join table looking for entries that have personId 83, we can search a B-tree. A B-tree can perform a search operation in logarithmic time, using Big O notation: O(log n). As the data grows 10x in size, our search speed would only slow down by 2x. Starting with personId 83, we would perform one search operation to find all the thingIds they liked, and for each thingId, we would perform another search operation to get the name property of each one. Much better than scanning down the whole table.
Alternatively, we could have used a key-value store. KV stores use one big table and hashes to search in an average of O(1) time, and a worst case of O(n). Starting once again at personId 83, we would perform one hash operation to get to a list of thingIds they liked, and for each thingId, we would perform another hash operation to get the name property of each one. This would be faster, but we run into a problem: overhead from keeping a large hash table and the inevitable hash collisions that would creep up our search time.
Three tables didn’t work well. One table didn’t work well. What if we tried two tables?
Imagine one table to hold both people and things, and another table to hold the relationships between them. The entry for personId 83 would have a pointer to its first relationship. That first relationship would point to the entry for the thing that was liked as well as another pointer to the next relationship of personId 83. Instead of searching in a B-tree or hashing keys, we would chase pointers in two arrays which is always an O(1) operation.
That’s the secret to graph databases. It is called “index-free adjacency” but all it really means is that objects are connected at all times without having to search. This is an important distinction from other databases as it makes data “intelligent.”
Imagine a row in a relational database. It has no idea which join tables are pointing at it and what else those join tables are pointing to. In the graph, a row is transformed into a Node and it knows exactly what it connects to and what is connecting to it by Relationships that are now first-class data citizens and hold referential integrity by design.
“There is nothing new except what has been forgotten,” said Rose Bertin to Marie Antoinette in 1785 — and it applies here, as well. Charles Bachman built the Integrated Data Store (IDS) in 1963, which managed relationships between records using chains of pointers. This later evolved into the network database model from CODASYL, which fell out of favor in the1970s as relational databases emerged and eventually took over in the 1980s. We are in many ways still trying to build the technology showcased in 1968 by Douglas Engelbart in “the mother of all demos.” But before we get lost in the past, let’s jump back to the present and see where graph databases can help us today.
A graph database is a general-purpose database. Anywhere you use a relational database, you can use a graph database, but it really shines when you can reimagine your data as a graph. There are hundreds of use cases, but the most popular ones seem to be real-time recommendations, fraud detection, master data management, network and IT operations, product hierarchies, rules engines, routing, ancestry, and data lineage. A big shift in capabilities occurs when you can take a query that wasn’t able to perform in real time and suddenly have it complete in milliseconds. There are many accounts of companies that had failed to achieve results in Teradata or Oracle for years, which are now finding success in weeks using graph databases.
To take full advantage of a graph database you must forget what you know about modeling in relational databases. There is no third normal form for graph databases. The shape of your data model will be heavily dictated by the kind of queries you need to run. It is a mistake to try to simply translate a relational model to the graph. It is also a mistake to see the graph in a flat two-dimensional plane. A graph allows for any number of dimensions and the ability to quickly traverse in and out of them. Yes, it can be more complicated to model things correctly initially, but the graph is pretty forgiving as it allows new types of nodes and relationships to easily be added. Careful constructs in direction and relationship types allow for a kind of data sharding or built-in pagination not easily doable in relational databases. The end goal is to ensure each query looks at the least amount of the graph as possible in order to find the correct answer quickly.
Getting the data into the graph is one thing, getting the answers out is another problem altogether. A language land war is currently being fought amongst the vendors. On one side you have Gremlin, a powerful imperative language built by a genius level developer to be used by genius-level developers. On the other side, you have Cypher, an ASCII art-inspired declarative language that aims to be as easy to learn as SQL once was. Regardless of which side you choose, the real win is being able to express your ideas in a graph-like way, not the manner in which those ideas are interpreted. Much like the in the movie “Arrival,” thinking in graphs requires a kind of graph epiphany, but once achieved, it opens up a new world of possibilities.
Once dismissed as the red-headed stepchild of the NoSQL world, over the last four years, graph databases have become the fastest growing category with new vendors emerging and existing vendors adding graph capabilities to their current solutions. As our industry continues to mature and evolve, it is important to know when to use the right tool for the job. There are over 300 databases listed on db-engines.com and you owe it to yourself to explore some graph databases. But be warned, once you experience their power and flexibility, you may never wish to go back to relational databases ever again.