Over a million developers have joined DZone.

What's New in Neo4j Databridge

DZone's Guide to

What's New in Neo4j Databridge

Neo4j-Databridge can help you import large amounts of data into Neo4j. Read on to find out what's new with Databridge and what it means to you.

· Database Zone ·
Free Resource

Databases are better when they can run themselves. CockroachDB is a SQL database that automates scaling and recovery. Check it out here.

Learn more about what's new in Neo4j Databridge from GraphAware [as of April 2017]

Since our first post a few months back, Neo4j Databridge has seen a number of improvements and enhancements. In this post, we’ll take a quick tour of the latest features.

Streaming Endpoint

Although Databridge is primarily designed for bulk data import, which requires Neo4j to be offline, we recently added the capability to import data into a running Neo4j instance.

This was prompted by a specific request from a user who pointed out that in many cases, people want to do a fast bulk-load of an initial large dataset with the database offline and then subsequently apply small incremental updates to that data with the database running. This seemed like a great idea, so we added the streaming endpoint to enable this feature.

The streaming endpoint uses Neo4j’s Bolt binary protocol, and the good news is that you don’t need to change any of your existing import configuration to use it. Simply pass the -s option to the import command, and it will automatically use the streaming endpoint:

Example: Use the -s option to import the hawkeye dataset into a running instance of Neo4j.

bin/databridge import -s hawkeye

The streaming endpoint connects to Neo4j using the following defaults:


You can override these defaults by creating a file custom.properties in the Databridge config folder and setting the values as appropriate for your particular Neo4j installation.

Please note that despite using the Bolt protocol, the streaming endpoint will take quite a bit longer to run than the offline endpoint for large datasets, so it isn’t really intended to replace bulk import. For small incremental updates, however, this should not be a problem.

Updates from the streaming endpoint are batched, with the transaction commit size currently set to 1000, and the plan is to make the commit size user-configurable in the near future.

Specifying the Output Database Folder

By default, Neo4j-Databridge creates a new graph.db database in the same folder as the import task. We’ve now added the ability for you to define the output path to the database explicitly. To do this, use the -o option to specify the output folder path to the import command:

Example: Use the -o option to import the hawkeye dataset into a user-specified database.

bin/databridge import -o /databases/common hawkeye

In the example above, the hawkeye dataset will be imported into/databases/common/graph.db, instead of the default location hawkeye/graph.db.

Among other things, this new feature allows you to import different datasets into the same physical database:

Example: Use the -o option to allow the hawkeye and epsilon datasets to co-exist in the same Neo4j database.

bin/databridge import -o /databases/common hawkeye
bin/databridge pimport -o /databases/common epsilon

Simpler Commands

The eagle-eyed among you will have spotted that the above examples use the import command, while in our first blog post, our examples all used the run command, which was invoked with a variety of different option flags. The original run command still exists, but we’ve added some additional commands to make life a bit simpler.

All the new commands also now support a -l option, to limit the number of rows imported. This can be very useful when testing a new import task for example. The new commands are:

import: Runs the specified import task.

  • Usageimport [-cdsq] [-o target] [-l limit]

    • c: Allow multiple copies of this import to co-exist in the target database

    • d: Delete any existing dataset prior to running this import

    • s: Stream data into a running instance of Neo4j

    • q: Run the import task in the background, logging output to import.log instead of the console.

    • o target: Use the specified target database for this import.

    • l limit: The maximum number of rows to process from each resource during the import.

test: Performs a dry run of the specified import task, but does not create a database.

  • Usagetest [-l limit] 

    • l limit: The maximum number of rows to process from each resource during the dry run.

profile: Profiles the resources for an import task.

  • Databridge uses a profiler at the initial phase of every import. The profiler examines the various data resources that will be loaded during the import and generates tuning information for the actual import phase.

  • Usageprofile [-l limit] 

    • l limit: The maximum number of rows to profile from each resource.

The profiler displays the statistics that will be used to tune the import. For nodes, these statistics include the average key length akl of the unique identifiers for each node type, as well as an upper bound max on the number of nodes of each type.

For relationships, the statistics include an upper bound on the number of edges of each type. (The max values are upper bounds because the profiler doesn’t attempt to detect possible duplicates.)

Profile statistics are displayed in JSON format:

        nodes: [
        { 'Orbit': {'max':11, 'akl':10.545455} }
            { 'Satellite': {'max':11, 'akl':8.909091} }
            { 'SpaceProgram': {'max':11, 'akl':9.818182} }
            { 'Location': {'max':11, 'akl':4.818182} }
        ],edges: [
            { 'LOCATION': {'max':11} }
            { 'ORBIT': {'max':11} 
            { 'LAUNCHED': {'max':11} }
            { 'LIVE': {'max':11} }

Deleting and Copying Individual Datasets

In order to support the new streaming endpoint as well as the ability to host multiple import datasets in the same database, Databridge only creates a brand new database the first time you run an import task.

If you run the same import task multiple times with the same datasets, Databridge will not create any new nodes or relationships in the graph during the second and subsequent imports.

If you want to force Databridge to clear down any previous data and re-import it again, you can use the -d option, which will delete the existing dataset first.

Example: Use the -d option to delete an existing dataset prior to re-importing it.

bin/databridge import hawkeye
bin/databridge import -d hawkeye

On the other hand, if you want to create a copy of an existing dataset, you can use the -c option instead:

Example: Use the -c option to create a copy of a previously imported dataset.

bin/databridge import hawkeye
bin/databridge import -c hawkeye

Deleting All the Things

If you need to delete everything in the graph database and start again with a completely clean slate, you can use the purge command:

bin/databridge purge hawkeye

Note that if you have imported multiple datasets into the same physical database, you should purge

each of them individually, specifying the database path each time:

bin/databridge purge -o /databases/common hawkeye
bin/databridge purge -o /databases/common epsilon


Well, that about wraps up this quick survey of what’s new in Databridge from GraphAware. If you’re interested in finding out more, please take a look at the project WIKI, and in particular the Tutorials section.

Databases should be easy to deploy, easy to use, and easy to scale. If you agree, you should check out CockroachDB, a scalable SQL database built for businesses of every size. Check it out here. 

graph database ,neo4j ,databridge ,database

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}