The Engineering Evolution of Neo4j Into a Native Graph Database
The Engineering Evolution of Neo4j Into a Native Graph Database
Check out the history and progression of Neo4j from its first inception to its current state and see how it came to be from an engineering standpoint!
Join the DZone community and get the full member experience.Join For Free
Read why times series is the fastest growing database category.
Hi, everyone. My name is Dr. Jim Webber. I’m Neo4j’s chief scientist, and I’d like to take a few minutes to talk to you about Neo4j’s evolution from an engineering point of view.
Our Origin Story: Overcoming Complicated SQL Queries
I don’t know if many people know this, but Neo4j actually started off as the database that supported a content management system (CMS). That CMS originally ran on a relational database.
The problem that spawned Neo4j — the genesis of Neo4j if you like — is that in trying to execute some of these CMS queries, we ended up writing a lot of very, very complicated, difficult-to-maintain-and-understand SQL queries. The main problem there was that while we had to use a relational database to store and query our data, the mental model in our minds was one of connected data, one of paths between our related items of content, metadata, tags, and metatags for that content.
As a result, we had an enormous cognitive gap between the way that we thought of the data and how that data was actually stored and queried.
Our First Incarnation: Graph Layer Over RDBMS
In order to shrink that gap and make us more productive, we took the step of writing a graph layer on top of our relational database so that now we could express our content relationships in a way that felt natural. That multiplicity and rich, varied, bi-directional, named content relationships were accessible to us as developers, and then this graph layer dealt with the translation of that rich model into the kind of tabular model of the relational database.
Initially, this was great. It gave us the ability to write queries naturally; it provided us with insight into our data, and it provided us with a development boost. It also gave us a great deal of ambition.
Now that we didn’t have to maintain an awful lot of complex SQL, we started to be more ambitious in the kind of queries that we wanted to run — because of the removal of that accidental complexity. That is, we started to get more ambitious about the value we could extract from that connected data.
Our Second Incarnation: Fully Native Graph Database
This is where it all went wrong. Because we were not using a native graph database — that is, under the covers it was a relational database — we ran into the JOIN bomb problem.
Our graph API gave us the ability to express phenomenal things, but the underlying engine was not optimized for graphs. It was optimized for tables. So, doing recursive JOINs through that data eventually led us to the JOIN bomb problem.
Mechanically, it placed stress on main memory, main memory spilled over to disk, and then, far from being greased lightning, we were chugging along at mechanical, rather than electronic, speeds. That was the kernel that gave birth to Neo4j.
Our CTO, Johan Svensson, and our CEO, Emil Eifrem, decided to take the big step of removing the relational database and building our own native graph database to be able to store and query that graph data.
This was great because now not only did we have a graph API through which we could express useful, powerful graph queries, but we had an underlying engine that wasn’t blown up every time we tried to traverse a large path. So that’s the JOIN bomb defused.
The Birth of the Cypher Graph Query Language
A few years down the line, Neo4j evolved and matured. On top of the original graph API, we built a phenomenal query language called Cypher, which makes that graph API accessible not only to the development and engineering community, but also to end users who are slightly technical. Under the covers, the graph engine itself became more and more powerful. It’s always been a decent ACID transactional engine, but now decades of engineering later, it is an absolutely high-performance machine ready to use in your deployments.
Of course, given that we aim Neo4j at real production deployments — the kind of Internet-facing apps that we’re so used to using in our daily lives — Neo4j is able to cluster. In fact, Neo4j has now gone through three generations of clustering architectures.
The Next Step: Causal Clustering With Neo4j
At some point in our history, we decided that we could be even more ambitious about Neo4j clustering. What we chose to experiment with was taking our wonderful Cypher query language and grafting it onto a NoSQL database. We knew that NoSQL databases have the reputation for being very, very scalable. We knew that Cypher is the most amazing query language: It opens up the world of graphs to non-experts who can get very powerful things done. So, our assumption was that the marriage of this technology would yield a very scalable, very accessible production-ready graph database.
For the first few weeks of designing this system, we were extraordinarily excited about the prospect. However, that quickly turned sour when we realized that the eventual consistency model underpinning NoSQL databases is simply not suitable for graphs.
The reason is that we need to maintain consistency between disjointed replicas and the consistency models supported by the popular NoSQL stores simply are far too weak to maintain those semantics. And then there was a kind of sinking feeling, an “Uh-Oh,” that we were going down a bit of a rat hole here.
Ironically, we should have looked to our own past for this: We should have looked where Neo4j came from and looked at the fact that graph technology is special. Graphs need native graph technology to store, and query and process safely.
Unless we have a stack that’s designed top to bottom to store and query graphs safely, we’re going to end up with corruption or worse. Neo4j took the other route, so actually, now Neo4j has a phenomenal Causal Clustering architecture, and it is graph native.
Causal Clustering is designed to store and query graphs at scale for Internet-facing apps. It’s designed to keep that data safely so that when you entrust your data to us, we look after it. You can take this stuff today and build some phenomenal apps with it.
Published at DZone with permission of Jim Webber , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.