Searching for Objects Using Multiple Dimensions
Searching across multiple dimensions (without pulling your hair out) requires a change in thought and a lesson learned from relational databases.
Join the DZone community and get the full member experience.
Join For FreeLet's take a look at a scenario where you are trying to search for things by their attributes, not their description. They can be users, documents, or any object that could be described by discrete values in multiple dimensions. What does that mean exactly? Well, let me give you an example: searching for a dog. My family includes two four-legged furry creatures named Tyler and Ronnie. They are my half lab, half golden retrievers. Dogs come in all shapes and sizes, from teacup breeds with adult weights around 5 lbs, to giant Mastiff breeds over 150 lbs. But most people don’t care exactly how much a dog weighs, only their general size.
The Dog Size dimension can be described as “X-Small”, “Small”, “Medium”, “Large”, “X-Large”, or “XX-Large”. Ronnie and Tyler weigh in at around 65 lbs, so they fit in that “Large” category. So even though weight is a “continuous” value, we can convert it into a “discrete” value. If you can do this with how you describe your data, by clustering, bucketing, or grouping similar values together, then you can take advantage of this type of search.
There are a couple of ways to approach searching by dimensions. One of them is to simply create a Cartesian product or Cross Join of all of the discrete values in each dimension against one another. For each result, we would create a node and then connect the Things we are searching for to that node with a relationship. Take a look at the video below for a refresher on Cartesian products:
This approach works amazingly well when we only have a few dimensions and the number of discrete values is small. If we have just two dimensions with 3 and 5 as discrete values, it means we only have to create 15 nodes. But if we have, say ,15 dimensions and an average of 10 values in each, then that’s 1 quadrillion nodes…
The worst part is that we probably don’t have 1 quadrillion Things to search on, so most of those will be lonely nodes without relationships. Obviously, we could just not create the unused combinations. But what happens if we decide to add or modify Dimensions or discrete values? Yeah, it can get messy. So what’s another way to do it? Let’s say we have four dimensions and each one has four discrete values. We would create 16 nodes. Four with a Label of Dimension1, each with their own discrete value. If Size was our Dimension1, it would be:
CREATE (s1:Size {key:'Small'}),(s2:Size {key:'Medium'}),(s3:Size {key:'Large'}),(s4:Size {key:'X-Large'})
Then we add a “HAS” relationship from all the customers to one of the four nodes. So it ends up looking like this:
So if we wanted to find all the customers that had a Dimension1 of “d1a”, Dimension2 of “d2c”, Dimension3 of “d3c”, and Dimension4 of “d4a”, we could run this query:
Which would find all the discrete values we specified, then traverse to all the customers related to all four of them, giving us this result:
So that’s pretty nice, but let's go back to our heavy example with 15 dimensions. Let's say we have millions of customers. Is there a fast way to do this query? Yes, but we have to be a little smarter in how we traverse the graph. One thing we can do is to find the discrete value with the fewest amount of relationships and gather them into a list. Then we find the discrete value with the next fewest amount of relationships and gather them into a list. We can AND both lists, and whatever remains are our Customers related to both discrete values. We can continue down this path, ANDing as we go, until we are left with a small final set. This would work… but it doesn’t sound very graphy. It sounds more like what a Relational Database would do using bitmap indexes.
Maybe we can do better with a hybrid approach. We will start the same way, but once our ever-reducing lists reaches a certain threshold (say 100k or 50k nodes) we turn it around. Instead of traversing from the Discrete values to the Customers, we traverse from the remaining Customers out to the Discrete values and eliminate any as soon as they relate to a Discrete value we don’t want.
Let’s build ourselves, another Cypher Stored Procedure. We’ll call this one dimensionalSearch and take a map parameter as input:
@Description("com.maxdemarzi.dimensionalSearch(map) | Find Distinct Customers by many dimensions")
@Procedure(name = "com.maxdemarzi.dimensionalSearch", mode = Mode.READ)
public Stream<NodeResult> performDimensionalSearch(@Name("map") Map input) throws EntityNotFoundException {
To call our procedure we would pass in our map like this:
CALL com.maxdemarzi.dimensionalSearch({Dimension1:"d1a",Dimension2:"d2c",Dimension3:"d3c",Dimension4:"d4a"})
The first thing we’ll do is find the dimensional attributes we asked for in the map and put them into an array as well as a set. Why both? Well because I need to order them, but I also want a fast “contains”, and that’s better in a set than an array.
ArrayList<Node> attributes = new ArrayList<>();
HashSet<Node> attributeSet = new HashSet<>();
Iterator it = input.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
Node attribute = db.findNode(Label.label((String)pair.getKey()), "value", pair.getValue());
attributes.add(attribute);
}
attributeSet.addAll(attributes);
I sort the array by degree, with the lowest degree at the beginning of our array:
attributes.sort(Comparator.comparingInt(Node::getDegree));
Now I want to traverse from the first attribute to all the customers related to it and store the result in a list. In this case, I am using a RoaringBitmap and storing just the node ids.
RoaringBitmap nodeIds = new RoaringBitmap();
Node firstAttribute = attributes.remove(0);
for (Relationship r : firstAttribute.getRelationships(Direction.INCOMING)) {
nodeIds.add(((Number)r.getStartNode().getId()).intValue());
}
We next continue traversing from the attribute with the second lowest degree out to Customers and store those in a list. We then AND from our starting list and remove that attribute from our array.
RoaringBitmap nextNodeIds = new RoaringBitmap();
it = attributes.iterator();
while (it.hasNext()){
Node attribute = (Node)it.next();
for (Relationship r : attribute.getRelationships(Direction.INCOMING)) {
nextNodeIds.add(((Number)r.getStartNode().getId()).intValue());
}
nodeIds.and(nextNodeIds);
attributes.remove(attribute);
...
In our while loop, we will break if we get to our threshold. For the sample project I put in 100, but in a real application, you would use a larger number.
if (nodeIds.getCardinality() < 100 ) {
break;
}
So now we’ll collect our remaining Customer nodes into a list...
ArrayList<Node> remaining = new ArrayList<>();
nodeIds.forEach((IntConsumer) id -> {
remaining.add(db.getNodeById((long)id));
});
…and for each one of these, traverse out to the discrete values. If we encounter a value that is not in our starting set, we remove that customer from our remaining list and continue with the next customer. This short circuit will prevent us from having to traverse all of the dimensions as soon as we find an invalid one.
Iterator<Node> iterator = remaining.iterator();
while (iterator.hasNext()) {
Node node = iterator.next();
for (Relationship r : node.getRelationships(Direction.OUTGOING)) {
if(!attributeSet.contains(r.getEndNode())) {
iterator.remove();
break;
}
}
}
Finally, we will stream out our NodeResult:
return remaining.stream().map(NodeResult::new);
The source code, as always, is on GitHub.
By the way, did you know you can also take the Cartesian product of two graphs? Dr. Sarada Herke explains below:
Published at DZone with permission of Max De Marzi, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Apache Kafka vs. Message Queue: Trade-Offs, Integration, Migration
-
Build a Simple Chat Server With gRPC in .Net Core
-
Playwright JavaScript Tutorial: A Complete Guide
-
Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch
Comments