Over a million developers have joined DZone.

APOC: An Introduction to User-Defined Procedures and APOC

With Neo4j 3.0's introduction of user-defined procedures, a new utility library is becoming increasingly popular: APOC. Read on to find out more and what it can do for you and your graph projects.

· Database Zone

Learn NoSQL for free with hands-on sample code, example queries, tutorials, and more.  Brought to you in partnership with Couchbase.

This is the first in a series of blog posts in which I want to introduce you to Neo4j's user-defined procedures and the APOC procedure library in particular.

Besides many other cool things, of the best features of our recent Neo4j 3.0 release were callable procedures.

Built-in Procedures

There are a number of built-in procedures that currently provide database management and introspection functionalities. Going forward, more functionality will be provided through these procedures (e.g., user management and security, transaction management, and more).

But even more interesting than the built-in procedures are "User-Defined Procedures." You might know them from other databases as "Stored Procedures" or "User-Defined Functions/Views."

Essentially, they are a way to extend the given query language, in our case Cypher, with custom functionality that adds new capabilities.

Calling Procedures

You can call procedures stand-alone as singular statements or as part of your more complex Cypher statement.

Procedures can be called with parameters and their return values are a stream of columns of data.

Here are two examples:

Call a procedure as only content of a statement
CALL db.labels();

If you call a procedure as part of a Cypher statement you can pass in not only literal parameter values (strings, numbers, maps, lists, etc.), but also nodes, relationships, and paths. This enables a procedure to execute graph operations from simple neighborhood expansions to complex iterative algorithms. For the result columns that are returned by a procedure, you have to use the YIELD keyword to select which of those are introduced and potentially aliased into the query.

Call a procedure as part of a more complex statement – list procedures grouped by package
CALL dbms.procedures() YIELD name
RETURN head(split(name,".")) as package, count(*), collect(name) as procedures;


   
packagecount(*)procedures
db5

[db.constraints, db.indexes, db.labels, db.propertyKeys, db.relationshipTypes]

dbms

4

[dbms.changePassword, dbms.components, dbms.procedures, dbms.queryJmx]

apoc

203

[apoc.algo.aStar, apoc.algo.aStarConfig, apoc.algo.allSimplePaths, … ]

When and How to Write a Procedure

Procedures can be small, generic helper functions for common tasks, but also very specific solutions and additions to your specific use case.

 So consider them when: 
  1. Cypher lacks a feature that you really need (but make sure to consult the documentation and ask in our public Slack first), or
  2. You need that last little bit of performance for a graph algorithm or operation where you can't afford any indirection between your code and the database core.

Only then should you consider writing a user-defined procedure.

Writing a user-defined procedure is pretty straightforward in most programming languages on the JVM.

You just need to create a method annotated with @Procedure that takes in any of the Cypher types as named parameters and returns a stream of data transfer objects (DTOs) each of whose fields becomes a column in the procedure output.

One very simple procedure that I also used before to explain this feature is providing the capability to generate UUIDs. As you probably know, that's something very easy to do in Java, but not so much in Cypher, and still sometimes you want to add uniquely identifying IDs to your nodes.

In our example, we will provide the uuid procedure with a number which indicates how many UUIDS we're interested in. Note that the only numeric types supported by Cypher are long and double , which encompass their smaller size brethren.

For our results, we use a dedicated Java DTO that contains our two result columns as public final fields.

As a fun side-effect of using Neo4j procedures, you also get to play with Java 8 streams, but please don't overdo it.

@Procedure("example.uuid")
public Stream<UuidResult> uuid(@Name("count") long count) {
    return LongStream.range(0,count)
        .mapToObj(row -> new UuidResult(row, UUID.randomUUID()));
    }
static class UuidResult {
    public final long row;
    public final String uuid;
    UuidResult(long row, UUID uuid) {
        this.row = row;
        this.uuid = uuid.toString();
        }
    }

We then use our favorite build tool (Gradle, maven, SBT, etc.) to package our procedure as a jar and put it into the $NEO4J_HOME/plugins directory. Please make sure that all dependencies also become part of that jar, otherwise, they won't be found when your procedure is loaded.

And mark the Neo4j dependencies (and other libraries that are already within the Neo4j distribution) as provided so that they are not packaged with and blow up the size your procedure jar.

After restarting your server, your procedure should be listed when you CALL dbms.procedures(). If not please check $NEO4J_HOME/logs/debug.log.

Now we can call our newly minted procedure to create nodes with UUIDs.

WITH {data} as rows
CALL example.uuid(size(rows)) YIELD row, uuid
CREATE (p:Person {id:uuid}) SET p += rows[row]
RETURN row, p;

We provide detailed documentation on how to do it in our Developer's Manual and in an example project on GitHub that is ready for you to fork and adapt.

APOC

Learn all about APOC and user-defined procedures in Neo4j

While Neo4j 3.0 was maturing, I pondered about all the questions and feature requests I've gotten over the years regarding certain functionality in Cypher, and I felt that it was a good idea to create a generally usable library of procedures that covered these aspects and more.

Drawing from the unlucky technician in The Matrix movie and the historic Neo4j acronym "A Package Of Components," the name APOC was an obvious choice, which also stands for "Awesome Procedures On Cypher." So the hardest part of all the work was done.

Starting with a small selection of procedures from various areas, the APOC library grew quickly and already had 100 entries for the Neo4j 3.0 releas. It now contains more than 200 procedures from all these areas:

  • Graph algorithms.
  • Metadata.
  • Manual indexes and relationship indexes.
  • Full text search.
  • Integration with other databases like MongoDB, ElasticSearch, Cassandra, and relational .databases
  • Loading of XML and JSON from APIs and files.
  • Collection and map utilities.
  • Date and time functions.
  • String and text functions.
  • Import and export.
  • Concurrent and batched Cypher execution.
  • Spatial functions.
  • Path expansion.

In this blog series, I want to focus on each of these above-mentioned areas with a specific blog post, but today, I just want to mention a few of the procedures to excite you about what’s to come.

Installation

Just download the APOC jar file from the latest release on GitHub and put it into the $NEO4J_HOME/plugins directory.

After restarting your Neo4j server, a lot of new procedures should be listed in CALL dbms.procedures().

You can also CALL apoc.help("apoc") or CALL apoc.help("apoc.algo") for the help function built into the library.

If you want to use any of the database integrations (e.g., for relational databases, Cassandra, or MongoDB), make sure to add the relevant jar files for the drivers for those databases too. It's probably easiest to clone the repository and run mvn dependency:copy-dependencies to find the relevant jars in the target/dependency folder.

Many of the APOC procedures just return a single value , which is of the expected return type and can be aliased. Some of them act like Boolean filters, i.e., if a certain condition does not match (e.g., isType or contains) they just return no row, otherwise, they return a single row of "nothing," so no YIELD is needed.

As there are no named procedure parameters yet, some APOC procedures take a configuration map, which can be used to control the behavior of the procedure.


A Meta Graph in APOC


One very handy feature is the metadata capabilities of APOC. Being a schema-free database, you probably wondered in the past what the underlying graph structure looks like.

The apoc.meta.* package offers procedures that analyze the actual content or database statistics of your graph and return a meta graph representation.

The simplest procedure, apoc.meta.graph, just iterates over the graph and records relationship types between node labels as it goes along. Other more advanced procedures either sample the graph or use the database statistics to do their job.

The visual metadata procedures return virtual nodes and relationships which don't exist in the graph but are returned correctly to the Neo4j Browser and client, and are rendered just like normal graph nodes and relationships. Those virtual graph entities can also be used by you to represent an aggregated projection of the original graph data.

apoc.load.json

Loading data from web and other APIs has been a favorite past time of mine.

With apoc.load.json, it's now very easy to load JSON data from any file or URL.

If the result is a JSON object, it is returned as a singular map. Otherwise, if it is an array, it is turned into a stream of maps. Let's take the Stack Overflow example linked above.

The URL for retrieving the last questions and answers of the Neo4j tag is this:

https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf

Now it can be used from within Cypher directly, let's first introspect the data that is returned.

JSON data from Stack Overflow:
WITH "https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.items AS item
RETURN item.title, item.owner, item.creation_date, keys(item)

Load JSON Data from Stack Overflow in APOC


Combined with the Cypher query from the other blog post, it's easy to create the full Neo4j graph of those entities.

Graph data created via loading JSON from Stack Overflow:
WITH "https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.items AS q
MERGE (question:Question {id:q.question_id}) ON CREATE
  SET question.title = q.title, question.share_link = q.share_link, question.favorite_count = q.favorite_count

MERGE (owner:User {id:q.owner.user_id}) ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)

FOREACH (tagName IN q.tags | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag))
FOREACH (a IN q.answers |
   MERGE (question)<-[:ANSWERS]-(answer:Answer {id:a.answer_id})
   MERGE (answerer:User {id:a.owner.user_id}) ON CREATE SET answerer.display_name = a.owner.display_name
   MERGE (answer)<-[:PROVIDED]-(answerer)
)

Results of loading JSON data from Stack Overflow using APOC


The documentation also contains other examples, like loading from Twitter and Geocoding.

apoc.coll.* and apoc.map.*

While Cypher sports more collection and data structure operations that most other query languages, there is always that one thing you couldn’t do yet. Until now.

With procedures, it is easy to provide solutions for these missing pieces, oftentimes just a one-liner in the procedure. That’s why APOC contains a lot of these utility functions for creating, manipulating, converting and operating on data.

Here are two examples:

Creating pairs of a list of nodes or relationships can be used to compare them pair-wise, without a double loop around over an index value.

Create a list of pairs from a single list:
WITH range(1,5) as numbers
CALL apoc.coll.pairs(numbers) YIELD value
RETURN value


value

[[1,2],[2,3],[3,4],[4,5],[5,null]]

Currently, you can't create a map from raw data in Cypher, only literal maps are supported and the properties() function on nodes and relationships. That's why APOC has a number of convenience functions to create and modify maps based on data from any source.

Create a map from pairs of data:
WITH [["Stark","Winter is coming"],["Baratheon","Ours is the fury"],["Targaryen","Fire and blood"],
      ["Lannister","Hear me roar"]] as pairs
CALL apoc.map.fromPairs(pairs) YIELD value as houses
RETURN houses


houses

{Stark:”Winter is coming”, Baratheon:”Ours is the fury”, Targaryen:”Fire and blood”, Lannister:”Hear me roar”}


Converting between numeric time information and its string counterparts can be cumbersome.

APOC provides a number of flexible conversion procedures between the two:

CALL apoc.date.format(timestamp(),"ms","dd.MM.yyyy")
// -> 07.07.2016
CALL apoc.date.parse("13.01.1975 19:00","s","dd.MM.yyyy HH:mm")
// -> 158871600

More details can be found in the documentation.

apoc.export.cypher*

Exporting your database with the dump command that’s built into neo4j-shell generates a single large CREATE statement and has a hard time dealing with large databases.

Using the original export code I wrote a few years ago, the apoc.export.cypher* procedures can export your graph data to a file with Cypher statements. The graph data to be exported can be:

  • The whole database
  • Collections of nodes of relationships
  • Results of a Cypher statement

The procedure first writes commands to create constraints and indexes. Nodes and relationships are created in singular Cypher statements, batched in transactional blocks.

For node lookup, it uses constrained properties on a label (kind of a primary key) if they exist. Otherwise, it adds a temporary, artificial label and property (the node-id) which are cleaned up at the end.

Export Full Database:
CALL apoc.export.cypherAll('/tmp/stackoverflow.cypher', {})

Export Specific Nodes and Relationships:
MATCH (q:Question)-[r:TAGGED]->(t:Tag)
WITH collect(q) + collect(t) as n, collect(r) as r
CALL apoc.export.cypherData(n, r, '/tmp/stackoverflow.cypher', {batchSize:1000})
YIELD nodes, relationships, properties, time
RETURN nodes, relationships, properties, time

Export From Statement:
CALL apoc.export.cypherQuery('MATCH (:Question)-[r:TAGGED]->(:Tag) RETURN r', 
                     '/tmp/stackoverflow.cypher', {nodesOfRelationships: true})

╒═════════════════════════╤════════════════════════════════╤══════╤═════╤═════════════╤══════════╤════╕
│file                     │source                          │format│nodes│relationships│properties│time│
╞═════════════════════════╪════════════════════════════════╪══════╪═════╪═════════════╪══════════╪════╡
│/tmp/stackoverflow.cypher│statement: nodes(181), rels(283)│cypher│181  │283          │481       │25  │
└─────────────────────────┴────────────────────────────────┴──────┴─────┴─────────────┴──────────┴────┘

Resulting Export File:
begin
CREATE (:`Question` {`id`:38226745, `favorite_count`:0, `share_link`:"http://stackoverflow.com/q/38226745", 
                     `title`:"How can I use order and find_each in neo4j.rb, Rails?"});
...
CREATE (:`Question` {`id`:38038019, `favorite_count`:0, `share_link`:"http://stackoverflow.com/q/38038019", 
                     `title`:"Convert neo4j Integer object to JavaScript integer"});
CREATE (:`Tag`:`UNIQUE IMPORT LABEL` {`name`:"ruby-on-rails", `UNIQUE IMPORT ID`:500});
...
CREATE (:`Tag`:`UNIQUE IMPORT LABEL` {`name`:"fuzzy-search", `UNIQUE IMPORT ID`:580});
commit

begin
CREATE INDEX ON :`Tag`(`name`);
CREATE CONSTRAINT ON (node:`Question`) ASSERT node.`id` IS UNIQUE;
CREATE CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`) ASSERT node.`UNIQUE IMPORT ID` IS UNIQUE;
commit

schema await

begin
MATCH (n1:`Question`{`id`:38226745}), (n2:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`:500}) CREATE (n1)-[:`TAGGED`]->(n2);
...
MATCH (n1:`Question`{`id`:38038019}), (n2:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`:532}) CREATE (n1)-[:`TAGGED`]->(n2);
MATCH (n1:`Question`{`id`:38038019}), (n2:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`:501}) CREATE (n1)-[:`TAGGED`]->(n2);
commit

begin
MATCH (n:`UNIQUE IMPORT LABEL`)  WITH n LIMIT 20000 REMOVE n:`UNIQUE IMPORT LABEL` REMOVE n.`UNIQUE IMPORT ID`;
commit

begin
DROP CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`) ASSERT node.`UNIQUE IMPORT ID` IS UNIQUE;
commit

Conclusion

Even the journalists of the ICIJ used APOC as part of the browser guide they created for the downloadable Panama Papers Neo4j database. Really cool!

The next blog post in this series will cover data integration with APOC.

Please try out the procedures library, you can find it here on GitHub, and might find its documentation (WIP) helpful.

If you have any feedback or issues, please report them as GitHub issues.

I want to thank all the contributors who provided feedback, code, and fixes to the library.

The Getting Started with NoSQL Guide will get you hands-on with NoSQL in minutes with no coding needed. Brought to you in partnership with Couchbase.

Topics:
library ,data ,cypher ,graph ,graph database ,database ,nosql

Published at DZone with permission of Michael Hunger. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}