Using a Cuckoo Filter for Unique Relationships
The Cuckoo filter for unique relationships in Neo4j applications is the best way to create one and only one relationship between two nodes. Learn how to use it!
Join the DZone community and get the full member experience.
Join For FreeWe often see a pattern in Neo4j applications where a user wants to create one and only one relationship between two nodes. For example, a user follows another user on a social network. We don’t want to accidentally create a second-follows relationship because that may create errors such as duplicate entries on their feed, errors unfollowing or blocking them, or even skew recommendation algorithms. Also, it is just plain wasteful, and while an occasional duplicate relationship won’t be a big deal, millions of them could be.
So how do we deal with this?
In Neo4j, you have a few options. You could use CREATE UNIQUE
:
MATCH (u1:User {id:123}), (u2:User {id:456})
CREATE UNIQUE (u1)-[:FOLLOWS]->(u2)
You could use MERGE
:
MATCH (u1:User {id:123}), (u2:User {id:456})
MERGE (u1)-[:FOLLOWS]->(u2)
Or you could write it in the Java API:
try(Transaction tx = db.beginTx()) {
Node n1 = db.findNode(Label.label("User"), "id", 123);
Node n2 = db.findNode(Label.label("User"), "id", 456);
if (n1.getDegree(FOLLOWS, Direction.OUTGOING) < n2.getDegree(FOLLOWS, Direction.OUTGOING)) {
for (Relationship r1 : n1.getRelationships(Direction.OUTGOING, FOLLOWS)) {
if (r1.getEndNode().equals(n2)) {
return;
}
}
} else {
for (Relationship r1 : n2.getRelationships(Direction.INCOMING, FOLLOWS)) {
if (r1.getStartNode().equals(n1)) {
return;
}
}
}
n1.createRelationshipTo(n2, RelationshipType.withName("FOLLOWS"));
tx.success();
}
All of these effectively do the same thing. Find the two nodes in the graph, start with the one that has the least relationships of the type and direction we care about, and traverse until we either find the other node we are looking for or creating the new relationship.
Is there a better way? Well sure. If we use the “legacy” relationship indexes, we could have indexed the relationship by the two node identifiers and queried that. We are not guaranteed that it would be faster than traversing since the index and its search time would grow as the number of relationships grew.
If Neo4j stored the relationships ordered by the OtherNodeID (the node on the other side of the relationship), we could stop traversing as soon as either side passed where the relationship should have been. But it doesn’t. New relationships are inserted at the top of the list, so they are ordered by descending relationship ID. We could store the creation date of the relationship and if we get to a point where they are older than the date the node we are searching for was created, then we can stop.
Another option is to create a custom index, using something like a dynamic K2 tree (a succinct data structure) to store all the relationships that have been created.
But as the title of the post suggests, we’re going to try something different. A few years ago while importing a hypergraph into Neo4j, I showed you how to use a Bloom Filter to avoid duplicating relationships. A Bloom Filter as you may recall is a “probabilistic data structure” that can tell us with certainty that an item is not in the set or possibly in the set.
A Cuckoo filter uses Cuckoo hashing to build a better Bloom filter.
The comparison above is given by Brian Dupras.
…for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do). — Michael Mitzenmacher (2014)
Take a look at the paper introducing the data structure by Bin Fan, David G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher for more information.
Ok, so how would we go about it in Neo4j? We’ll create a kernel extension that will listen to transactions and whenever a new relationship is created it will add an entry to a set of Cuckoo filters, split by relationship type. Then in a stored procedure, or using the Java API, we can make use of the Cuckoo filters.
I’m using the CuckooFilters
library available here. I’m still using 1.x but I hear the 2.x version will be much faster. We need a map of CuckooFilters
by relationship type, so we can start our class this way:
public class CuckooFilters {
private static HashMap<String, CuckooFilter<byte[]>> filters = new HashMap<>();
When we want to add an entry, we first find the CuckooFilter
we want, and then add it to the set. If this is the first time we are seeing the relationship type, we will create a new CuckooFilter
. For this example, I’m using a byte array for my data and limiting it to 200M entries, but these could be configurable or dynamic. It has a default false positive rate of 1%.
public static void set(String name, long from, long to) {
if (filters.containsKey(name)) {
filters.get(name).put(Util.longsToBytes(from, to));
} else {
CuckooFilter<byte[]> ck = new CuckooFilter.Builder<>(Funnels.byteArrayFunnel(), 200_000_000).build();
ck.put(Util.longsToBytes(from, to));
filters.put(name, ck);
}
}
The CuckooFilter
class has builders for primitives like ints and strings, but we are adding relationships, so we could have converted them to a string Node_1_id
+ dash + Node_2_id
. That is probably the simplest way, but I decided to try something a little fancier. I’m converting two longs into a single byte array as an example.
public static byte[] longsToBytes(long l1, long l2) {
byte[] result = new byte[Long.BYTES * 2];
for (int i = (Long.BYTES) - 1; i >= 0; i--) {
result[i] = (byte)(l1 & 0xFF);
l1 >>= Long.BYTES;
}
for (int i = (2 * Long.BYTES) - 1; i >= Long.BYTES; i--) {
result[i] = (byte)(l2 & 0xFF);
l2 >>= Long.BYTES;
}
return result;
}
We also need a way to check if a relationship might exist, so we add a get method that calls mightContain
on our Cuckoo filter.
public static boolean get(String name, long from, long to) {
if (filters.containsKey(name)) {
return filters.get(name).mightContain(Util.longsToBytes(from, to));
} else {
return false;
}
}
Lastly, we’ll need a way to un-set a relationship by deleting that entry from the Cuckoo filter:
public static void unset(String name, long from, long to) {
if (filters.containsKey(name)) {
filters.get(name).delete(Util.longsToBytes(from, to));
}
}
With this class now in place, we are going to hook into our transaction execution afterCommit
hook and set the Cuckoo filters for any newly created relationships, as well as un-set the Cuckoo filters for any newly removed relationships.
@Override
public void afterCommit(TransactionData td, Object o) {
try (Transaction tx = db.beginTx()) {
for (Relationship relationship : td.createdRelationships()) {
CuckooFilters.set(
relationship.getType().name(),
relationship.getStartNode().getId(),
relationship.getEndNode().getId()
);
}
for (Relationship relationship : td.deletedRelationships()) {
CuckooFilters.unset(
relationship.getType().name(),
relationship.getStartNode().getId(),
relationship.getEndNode().getId());
}
tx.success();
}
}
Now to test how much faster this is. We’ll create a set of JMH tests with 10k users and randomly generate 200 FOLLOWS
relationships for each user. First, not using the filter:
public boolean measureNoFilterCheckLongs() throws IOException {
long n1Id = rand.nextInt(userCount);
long n2Id = rand.nextInt(userCount);
return doubleCheck(n1Id, n2Id);
}
…and with the filter:
public boolean measureCuckooCheckLongs() throws IOException {
long n1Id = rand.nextInt(userCount);
long n2Id = rand.nextInt(userCount);
if(CuckooFilters.get("FOLLOWS", n1Id, n2Id)) {
if (doubleCheck(n1Id, n2Id)) return true;
}
return false;
}
The tests are nearly identical, except our original test always performs a check, while the other only performs a check if a relationship might exist.
private boolean doubleCheck(long n1Id, long n2Id) {
try(Transaction tx = db.beginTx()) {
Node n1 = db.getNodeById(n1Id);
Node n2 = db.getNodeById(n2Id);
if (n1.getDegree(FOLLOWS, Direction.OUTGOING) < n2.getDegree(FOLLOWS, Direction.OUTGOING)) {
for (Relationship r1 : n1.getRelationships(Direction.OUTGOING, FOLLOWS)) {
if (r1.getEndNode().equals(n2)) {
return true;
}
}
} else {
for (Relationship r1 : n2.getRelationships(Direction.INCOMING, FOLLOWS)) {
if (r1.getStartNode().equals(n1)) {
return true;
}
}
}
tx.success();
}
return false;
}
So what are our results? Well:
Benchmark Mode Cnt Score Error Units
CuckooBenchmark.measureCuckooCheckLongs thrpt 5 198018.199 ± 35573.548 ops/s
CuckooBenchmark.measureNoFilterCheckLongs thrpt 5 2110.529 ± 135.350 ops/s
That’s right… our Cuckoo filter is 50x faster than our normal implementation, and since we are double checking the potential false positives, we are guaranteed to be correct.
The source code, as always, is on GitHub.
Published at DZone with permission of Max De Marzi, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments