I have been working with graph visualizations for almost 20 years now, but only recently have I begun looking into graph databases.
Shortly after I got introduced to Neo4j, I found that when looking at existing dataset examples, I often felt the need to look at and better understand the underlying schema of the data. Although a Neo4j database does not need a schema, most of the time data will adhere to a schema and without one, creating elegant and efficient queries to gain insight into your database becomes rather difficult.
I spoke with other Neo4j users, and they told me that they had come across the same problem. In larger projects, there should be a separate documentation about the database schema, but as it is the case very often with documentation, either it doesn’t exist or it is out of sync with reality.
Getting the Schema
You don’t need an up-to-date documentation to take a look at the schema. There are existing solutions that can help you here:
The built-in Neo4j Browser will show you a list of all node labels, all relationship types and property keys currently in use in the sidebar. Clicking on each of them, the Neo4j Browser will sample a few random nodes and relationships and render them to the screen. You can then interactively explore the actual graph data and build the schema in your head or using pen and paper.
If you have installed the APOC tools at your server, you can make use of the awesome meta-graph APOC procedure to automate this sampling: The procedure will create results that the Neo4j Browser can actually display as a graph.
Some people may not have the APOC tools installed or cannot install them at their server. Luckily, starting with Neo4j 3.1, there is similar functionality built right into the database. Sending a
call db.schema() query to the database, you will get a response that looks like an ordinary query result with nodes and relationships; however, the entities are purely virtual and do not exist in the database.
For very small instances with a simple schema, this may already be enough for you to get a good understanding of the structure of your database. However, while the current implementation of the graph viewer in the Neo4j Browser is fine for displaying smaller result sets, it does not work very well for non-trivial schemata.
If you start looking more closely at the output of the
db.schema() call, you will see that there are some node labels that seem to be connected to a large number of other labels. In the example above there are labels like
User which have a very high connectivity.
It is their adjacent relationships that make the diagram almost unusable. Now if you look at the database contents you will quickly see that in reality there is not a single node in the database that is labeled with just one of these labels:
MATCH (n:StackOverflow) WITH labels(n) as labels WHERE length(labels) = 1 RETURN collect(distinct(labels))
This will give you an empty list since there is no combination of labels in the whole database that only consists of the “StackOverflow” label! Instead, the label is used as a “tag” in combination with other labels only.
MATCH (n:StackOverflow) WITH labels(n) as labels RETURN collect(distinct(labels))
With this knowledge, in order to better understand the schema, we can actually manually remove those “tagging” nodes from the graph display. We won’t lose any relevant schema information since the relationships are still there for the other labels that the tagging label was used in combination with.
Be careful though, because there can be labels that cannot be removed without destroying information. In this case the
User tag appeared in these combinations:
[User, Twitter], [User, StackOverflow], [User, Meetup], [User, GitHub], [User, Slack]
And if we remove the
User label too, we will remove all
Users and their interactions and relationships from the schema.
Actually, the schema should better show a different structure: It now shows a type of “User” who posts questions on Stack Overflow, tweets about his or her work on Twitter, and meets his or her fellow developers at Meetups. We see a single
User that participates in all of these relationships. However, if we look at the actual data, we will see that only
Twitter Users tweet, and only
Stack Overflow Users post questions and answers on Stack Overflow.
Thus, in reality, in the database, the schema should be drawn with separate types of
Users. One for each “tagging” label combination that they appear in.
At this point, it becomes clear that the current implementation of the graph visualization in the Neo4j Browser does not suffice for rendering more complex database schemata.
Graph Visualization to the Rescue!
“yFiles” is a generic graph visualization, drawing and editing library for programmers that comes with the most complete suite of automatic layout algorithms. It also features extensive customization options, and as such, can be used to create completely new applications that exactly suit one’s requirements. Therefore, I was positive that I could easily build an application that allows users to quickly and efficiently browse and understand the schemata of even the most complex Neo4j graph databases.
That was all I had to do to get the basic schema to display in my own application! I just plugged the above code into one of the samples in the getting started tutorials for yFiles for HTML and was immediately able to interactively explore and finally understand my database schema!
Creating a Schema Explorer
Of course, I didn’t stop at this point. I was excited to see what one can create when the power of Neo4j and the yFiles libraries is used together in the same application. So, I added an option in the context menu for the user to automatically split node labels into all of their label combinations and update the relationships accordingly.
A Cypher query like the following quickly reveals that for certain label combinations there are a lot fewer relationships, and they suddenly begin to make sense:
MATCH (n:User:StackOverflow)-[r]->(n2) RETURN collect(distinct(type(r))), labels(n2) collect(distinct(type(r))) labels(n2) [POSTED] [Content, Question, StackOverflow] [POSTED] [Content, Answer, StackOverflow]
So, Stack Overflow users in the database do not participate in a meetup and do not create GitHub repositories; they will post answers and questions and the schema should reflect that!
So after reading in the schema, splitting the node labels, reinserting the right relationships, and removing the tagging labels from the schema view, I finally applied some custom styling and one of the automatic layout algorithms and got this much-improved schema: