A Deeper Dive into Neo4j 3.0 Language Drivers
A Deeper Dive into Neo4j 3.0 Language Drivers
This article goes extremely in-depth into how Neo4j 3.0's new binary protocol Bolt works, and what it can do for you — a must read if you're at all into Neo4j.
Join the DZone community and get the full member experience.Join For Free
Built by the engineers behind Netezza and the technology behind Amazon Redshift, AnzoGraph™ is a native, Massively Parallel Processing (MPP) distributed Graph OLAP (GOLAP) database that executes queries more than 100x faster than other vendors.
Editor’s Note: This presentation was given by Nigel Small and Stefan Plantikow at GraphConnect Europe in April 2016. Here’s a quick TL;DR of what they covered:
- Language driver protocol concepts
- Driver API design and concepts
- Built-in procedures
- Legacy stores
Nigel Small: Today, we’re going to go over the functionality of a new set of Neo4j 3.0 language drivers that provide a uniform API for easy access to Neo4j from any programming language.
We built these in response to the large number of drivers developed by our community, which posed some of the following challenges:
- Inconsistent patterns of usage.
- Different feature sets across drivers.
- The use of predominantly only one language (Java).
- A large number of type systems — including Java, Cypher, and JSON — each of which are most efficient for specific purposes.
- HTTP can be limiting and verbose. Additionally, shipping HTTP headers with each request and response over the network results in a high payload.
Language Driver Protocol ConceptsBelow is an illustration — using some British telephone references — of what we’ve been working to build:
The Bolt protocol is implemented by connectors on either side of the communication. One connector sits in the driver itself, which acts on behalf of the application, and the other connector sits in the nearest server. These connectors send Cypher messages to the server, which then sends streams of records — the results — back to the connectors.
The client connectors that we’re building implement a standardized, defined API. Now we know that regardless of language, users will have the same set of classes, methods and experiences.
Below is an illustration of the data flow:
The driver sends a string of a Cypher statement along with a set of parameters (just as is the current experience using Cypher) but now this is the only interaction you actually have over Bolt. The other things previously allowed over HTTP have been turned into stored procedures, which we'll cover below.
We then have a stream of records coming out of the server and back into the driver, which you can read one at a time with streamlining capability. The results — metadata — are also returned from the server to describe, for example, the plan of the query that you've run, the field names of the data that you've pulled back, and so on.
The API DesignStefan Plantikow: But how is this all exposed? While it's nice to develop a protocol, you don’t want to only send binary over the wire in the application. You need an API.
We started out building a cross-language API design, which is extremely challenging because there are a number of desired, but competing features. You want certain uniform features to both prevent the mismatch of information and ensure the same set of functionality across platforms.
But too many restrictions can make you feel like you’re working in a program that requires one type of language when you’re used to another. These need to be balanced in order to create a great user experience.
Driver API ConceptsWe knew we had to approach this challenge in a way that would produce a simple result. Our approach to this was to provide each driver package with four key abstractions:
The first is the driver itself, which is essentially a template for creating sessions. The driver encapsulates your interaction with a given instance of Neo4j and holds things such as user credentials, configuration syncs for the size of your session pool, and the URL to your Bolt server. You can create this just one time for your application and then use it across your stack.
Once you have a driver you can ask it to give you a session, which is a logical context for database interaction. A session essentially gives you the illusion that you're working with a database, sending Cypher, executing queries, and receiving results. These get mapped internally onto a pool of connections that executes the binary Bolt protocol. And then there's Neo4j, which gives you full ACID transactions, which we had to expose as well.
The session is also a means of creating transactions if you need to send multiple Cypher statements as one logical unit of work. But regardless of whether you're sending one or multiple queries per transaction, your results are called statement results.
As Nigel went over, whenever you send a statement over the wire, you get back a handle that allows you to continuously pull out a stream of records from the database. The model allows you to send multiple queries at one time and consume the results in order, without having to wait the whole latency cycle.
Different API Language DriversBelow is an in-depth look at how to create a Java Driver API:
The first step is to create a driver, which you do by entering a
GraphDatabase.driver and the Bolt URL
bolt://localhost to describe the connection endpoint. Then you provide your credentials, etc.
Next, you use your driver to obtain a session, which is as simple as saying
driver.session and then performing a
session.run to execute your Cypher. Note that you can still pause on statement parameters when you run a session. This returns a statement result, and we iterate over those results to print the records which include two fields: title and name.
It's important to note that you have to tell the driver as it's accessing the record how you want the returned result converted into Java type. This is the piece that will vary between different drivers depending on the language being used. For example, this isn't something you have to do in Python because you work directly with native values. However, it makes more sense to be explicit when using a language like Java.
Python Driver API
.NET Driver API
Official Drivers with Neo4j 3.0Nigel: Each of the four drivers mentioned above are hosted on GitHub, and we hope you’ll use the code and include any problems you may have on the issue lists. Please also post a pull request on GitHub if you see a bug that needs to be fixed.
All of our drivers are Apache licensed, which is a bit more commercially friendly than the licensing we've chosen for the Neo4j server itself (which is GPL). With Apache, the drivers aren't going to hit any issues with commercial applications.
Additionally, the drivers are versioned and released independently of Neo4j, so while we're generally going to keep them in step with each other, the version numbers between the drivers and the server won't necessarily match. This is largely because one driver is going to be able to support multiple versions of the server. And finally, we're publishing each of the driver bundles to the place you'd expect to go from your ecosystem, i.e. Maven for Java, PyPl for Python, and so on.
Driver InstallationVery briefly, below is how you'd install in your given language:
Community Language DriversOur project is in no way intended to stamp out the efforts of the community driver authors. I myself was one before I joined Neo4j. But we think that we've built receptive drivers that can effectively complement the community drivers. Our driver provides the fundamental foundation and plumbing, but community drivers can add extra value by adding high-level API and language-idiomatic features.
Several community drivers are essentially complete, such as the PHP and C drivers. But if you're a budding driver author, we have some good resources for you. You can sign up for the Neo4j mailing list, join the public Slack channel, or send us an e-mail.
HTTP vs Bolt ProtocolsBelow is a quick overview of what the HTTP and Bolt protocols look like side by side in a typical set-up:
Your request and response will typically contain a load of HTTP headers and adjacent payload, and if you contrast that with the binary that's sent to and from with Bolt, you're getting some significantly smaller request and responses. So naturally, you’re going to get a bit of a better performance, overall.
The payload for Bolt itself uses a data sterilization mechanism that we've built in-house and is very closely tied to the Neo4j type system, which we've worked to standardize. Additionally, as you're using a session in Bolt world, you have a stateful set-up rather than the stateless set-up of HTTP, so you don't need to send all the information for every request.
Stefan: To get the maximum out of Neo4j, you need to implement specialized algorithms or talk to a third-party system. The best way to do that is to use an extension that directly plugs into the database and executes directly on the bare metal. It was obvious to us in our development of the Bolt protocol and drivers project that we would need to preserve that capability.
Java stored procedures provide a simple way to write custom code, plug it into the database, deploy it and run it via Cypher, and access it via Bolt (which is much faster than HTTP).
Built-In Stored ProceduresThe product includes a number of built-in stored procedures:
What you used to do via the REST API is slowly moving over to built-in procedures, which are more data-centric rather than system-centric. This includes procedures for listing the procedures themselves, getting information about the database, carrying the monitoring system and changing your password. I'm sure we're going to see even more interesting procedures moving forward.
Let's go over how to write a procedure. Essentially you have to write 1.5 classes. We say 1.5 instead of two because the second class is so simple that it almost doesn't count. Let's take a look at the first one, the
YourProcedureClass, which holds your application logic:
You essentially write a custom method that takes arguments — just like in Java — and annotate the code. You have to say "This is a procedure," provide some description string, name the arguments explicitly, and then start coding.
The second question is, how do you return data? By returning a stream of something, which you do with the second simple container class. In this demo, you turn records that only have a single field called value of
sometype. So if the procedure gets called, it will return a stream of these records.
Last but not least, how do we actually access the database from within a procedure? This is done with a standard injection type of mechanism. You annotate the tool you want to use and inject that at run time so that your procedure logic can access these things to work with the graph.
Below is a more concrete example that relies on the zip function in Cypher, which is a tiny utility that you give to lists to return a list of pairs. Sometimes it’s more convenient to get that back as a request of roles in the query instead of as a list. Below is a function I copied from Michael Hunger’s APOC procedures depository that does essentially that:
We have a procedure with a description that zips lists but emits one role per pair, and it has two arguments — list one and list two — and returns it as a list result. If we look farther down the code, we see that a list result is just a list of things.
The implementation is somewhat trivial. If the first list is empty, we return nothing because then there's nothing to zip. Otherwise, it's essentially like a nested loop, so you go over the outer list and then combine it with the elements from the inner list.
We made an addition to Cypher that allows you to call it in two different ways: Standalone and Call and Yield:
standalone because all you have to say is run
CALL zip. Note that the first line should say
session.run instead of
driver.run, and then
CALL zip and pass the parameters.
But let's say you want to use the procedure from Cypher and combine it with other features of Cypher, which is called
YIELD. The second example is a search for the knights in our database. You find all the knights that live in Camelot, and then find the distinct ones to ensure we don't get duplicates. But we also have a list of squires that we want to pair up with the knights.
CALL zip with the knights, and the squires are passed as a parameter. We then yield back the pair constructed by the zip function and return the pairing. While this is a simple example, you can now freely combine your custom code for accessing third-party systems, performing custom application logic, and doing interesting accommodation algorithms within the database.
I invite all of you who are interested in using the more programming side to check out Michael Hunger's APOC Procedures Repository.
Using Legacy StoresAnd last not least, another demo I'd like to go over is plugging in a legacy store to pull all of your data into a graph, which can be done with a 10- or 20-line procedure. Below is an example of how you would return the data, but of course, you can take this procedure, just replace a
CREATEstatement, start getting data into your graph store and use it from there:
SummaryTo summarize, drivers give a new uniform cross-language surface API to access your graph. Under the hood, Neo4j gives you this really awesome next generation binary protocol which will help you with throughput and performance. We also provide procedures for executing custom application logic in the database via Cypher without having any impedance mismatch.
It all comes with a shiny new developer manual, which includes code examples for the drivers for all the different languages. And we hope you'll contribute code on our developer page, and find our drivers on GitHub.
Inspired by Nigel and Stefan's talk? Click below to register for GraphConnect San Francisco and see even more presentations, talks, and workshops from the world’s graph technology leaders.
Published at DZone with permission of Nigel Small , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.