An Interview with Ian Robinson, Author of Graph Databases, from O'Reilly
I recently had the opportunity to interview one of the authors of O'Reilly's recently-published Graph Databases. Ian Robinson took some time to give an introduction to the book, as well as explain how the book is relevant to Java devs interested in technologies like Neo4j. He also gives some good insight into where he believes graph database technology are headed in the near future.
DZONE: Please give us a quick introduction to the content of O'Reilly's new book, Graph Databases.
The book provides both a broad introduction to the graph database space, and a detailed look at how to go about designing and developing a graph database solution. Many of the examples draw on our experience of developing and working with Neo4j, but the important concepts carry over to other graph databases.
In terms of content, the first half of the book focuses on offering practical guidance on modeling with graphs, querying graphs, and building graph database solutions. We demonstrate how the graph data model can be used to model domains in ways that are very similar to the informal techniques we use for communicating complex data problems: where we draw circles, lines and text on the whiteboard, we add nodes, relationships and properties to the graph. Having shown how to build a graph, we then show how we can use Neo4j's Cypher query language to identify complex patterns in the data, thereby generating valuable insights into our domain. The implementation sections discuss the many things we should think about when developing a graph database solution, from solutions architecture, to testing and capacity planning.
Following this modeling and implementation guidance, we then describe three real-world use cases: social networking, access control, and logistics. For each use case, we summarise the business case, and give detailed examples of the data models and queries used in that scenario.
In later chapters, we take a deep dive into Neo4j's internals, including its store file formats, and transactional and clustering behaviours. We end with a look at how we can apply predictive analysis techniques and graph algorithms to solve some tricky problems and generate yet deeper insights into our graph data.
DZONE: Tell us a little about your background and how it informed the creation of this book.
Emil, together with his Neo cofounders, Johan and Peter, invented the graph database at the turn of the century. In the intervening years he's devoted his professional life to building and evangelizing graph databases. Jim and I were users of Neo4j before we joined Neo Technology; together we built product catalogues and recommendations engines for a number of different clients. Since joining Neo, Jim has led the research and development of a massive horizontally write-scalable graph that compromises neither traversal performance nor ACIDity. I've divided my time between working with customers to design and develop graph database solutions, and development with the research and product engineering teams.
DZONE: What does this book have to offer DZone's core audience of Java developers, many of whom have experience with Neo4j?
I think there's a lot here even for people who are quite familiar with Neo4j. The book's backed by a lot of real-world experience, which translates into practical guidance that shows how to solve real problems in ways that fit with today's software delivery practices. The data modeling and testing sections tackle many of the questions we see being frequently asked around these subjects. The query examples--in particular, the use-case examples--show off many of Cypher's capabilities, and can be easily adapted for other uses. And our discussion of the benefits of graph databases, and how they stack up versus both relational and other NOSQL databases for particular kinds of data problems, can help if you're wanting to make the case in your own organisation for adopting graph technology.
DZONE: What are some exciting projects that you wish more graph database or graph theory enthusiasts were aware of?
The number of use cases is substantially varied and growing rapidly--across large enterprises and ambitious startups alike. In the last couple of years we’ve seen some very exciting applications of Neo4j that demonstrate the sheer size and scale capabilities of graph databases. Glassdoor, for example, maintain over half the Facebook graph in an instance of Neo4j in order to power their Inside Connections feature. In Europe, one of the world’s largest parcel networks uses Neo4j to calculate parcel routes. Several thousand parcels enter the network each second, at which point they're mechanically sorted according to their destination. Neo4j calculates the route to the ultimate destination in the several milliseconds available to the system before the parcel reaches a point where the sorting equipment has to make a choice.
Elsewhere we are seeing a lot of large corporations turning to graph databases to manage entitlements and complex access and authorization networks. Telenor, for example, replaced the relational database-based access control element of one of their software-as-a-service offerings with Neo4j, and in so doing reduced the latency of some of their complex queries, which span millions of entities and relationships, from several minutes to several milliseconds.
DZONE: In the introduction to the book, you recognize the recent democratization of graph database technologies. What are some of the results of the more open availability of these technologies?
The most obvious consequence is that graphs are no longer considered simply a niche or academic subject. As an industry, we've dealt in graphs in one form or another for a long time, but always called them something else, with the result that the material that dealt with graphs as graphs was confined mainly to research departments, with little or no impact on our day-to-day software development practices. Ten years ago, graph as a commodity technology didn't exist: those few companies that did appreciate the power of graphs--companies like Google, Facebook and Twitter, who were prepared to bet their business on graphs--had to invest enormously in proprietary platforms.
That's all changed. We're all now familiar with things like the social graph and the knowledge graph. And with this familiarity we've come to see how many of our day-to-day problems in our own data domains can be regarded as graph problems.
But it's not just a matter of us becoming more aware of the power and potential of graphs; more importantly, we now all have access to a suite of tools and techniques that we can use to apply graphs to our own particular problems.
The second consequence of the democratization of graph database technologies is that many organisations are now challenging the institutional orthodoxy that regards relational databases as being the natural or inevitable end state for working with connected data. NOSQL emerged not so much as a challenge to the expressive and flexible nature of the relational model, as it did a remedy for the perceived performance and operational limitations with relational technology. But in addressing performance and scalability, NOSQL has generally given up on the capabilities of the relational model, particularly with regard to connected data. Graph databases, in contrast, revitalise the world of connected data, outperforming relational databases by several orders of magnitude. Many of the most interesting questions we want to ask of our data require us to understand not only that things are connected but also the differences between those connections. Graph databases offer the most powerful and performant means for generating this kind of insight.
DZONE: What do you see as the natural evolution of graph database technologies - where is this technology going?
Our capacity to model and reason about a problem depends on the languages and tools available to us. As graph databases become more mainstream, we anticipate that the concepts, terminologies, and reasoning strategies associated with them will become more deeply embedded in the ways we think about, model and implement software solutions.
The network databases of old--the forerunners of today's modern graph databases--fell out of favour in the '70s because they were difficult to work with. Writing in 1975, in Data and Reality, William Kent summarised this weakness in the network model; more importantly, however, he also offered up a tantalising promise for the future:
"Still others focus their critical comparisons on the manipulative language specified in the DBTG proposal. Such critics fail to see the possibility of developing a better language (exploiting the semantics of named relationships), much as a language like SQL shields users from relational joins and projections."
That possibility is now here in the form of Cypher. Very few people, once they've tried Cypher, want to go back to SQL. Today's graph databases make it easy for developers and data architects to model, store and query their connected data in ways that map closely to their natural language descriptions of a domain and the questions they want to ask of their domain, and I think as we exercise the expressive power of both the language and the graph model we'll demand more of the technology, and so drive its evolution.
Most of today's graph databases adopt the property graph model (which comprises nodes, properties, and named and directed relationships) that Neo4j pioneered nearly 10 years ago. We think it's a pretty powerful model. It's not necessarily a pure graph model--rather, it's a pragmatic union of graph and record-like structures--but it works; developers tend to find it much easier to use a property graph rather than either a pure graph model or a relational model to move a design from the whiteboard into their application.
But despite its advantages and ease of use, today's model isn't necessarily the best model. We think we can do better. Later this year Neo4j will introduce some new first-class elements to its property graph model. Chief amongst these are labels and constraints. Labels are like valueless properties that can be attached to nodes. There's a many-to-many relationship between nodes and labels: a node can have one or more labels attached to it, and a label can be associated with one or more nodes. Schema constraints and indexing behaviours can then be associated with these labelled nodes.
Labels and constraints make the graph even more expressive and easier to reason about than ever before, while retaining the bottom-up, schema-free characteristics that make the property graph model ideal for capturing variably-structured domains. Additional enhancements to the model will likely see a more full-blown map- or document-like structure replace the current record structure of node and relationship properties. Together, these enhancements will make it easier to work with simple and complex data structures alike.
Finally, in a more speculative mode, I’d like to reflect on the suggestive and inferential capabilities of graph databases, and the new opportunities for creating end-user value they open up.
Line-of-business applications typically apply deductive and precise algorithms--calculating payroll, applying tax, and maintaining shopping cart totals, for example. But as the big web properties such Google, Facebook, and Twitter have shown, there’s also tremendous value in being imprecise and suggestive--in applying data along a metaphorical axis.
Pattern matching in a graph provides an arbitrarily complex identity function. This is revolutionary. It can be observed in the small in things like social networking and recommendations. My identity--in the context of an online shopping experience, say--is a metaphor for the behaviours of others; I recognise myself and the opportunities available to me in the matched patterns that situate me with respect to other elements in the network. These suggestive, inferential capabilities of graph databases open up an exciting world of new possibilities.