Database Building 101: The Cost of Graphing
As you might expect, graph databases have different issues than table-based databases. In this article we examine some of the computational costs.
Join the DZone community and get the full member experience.
Join For FreeSo far we looked into how we can store the nodes and the edges, and we explored some interesting data structures inside Voron. Now, let’s see how we can traverse the graph.
public NameValueCollection GetNode(long id)
{
using(var tx = _env.ReadTransaction())
{
var nodesTable = new Table(_emptySchema, "Nodes", tx);
var reader = new TableValueReader(table.DirectRead(id, out size), size);
var data = reader.Read(0, out size);
var nodeStr = Encoding.UTF8.GetString(data, size);
return ParseNode(nodeStr);
}
}
So getting the node is pretty easy, and remember that to access the table by ID gives us an O(1) cost.
Now, what about finding all the edges from a node?
public IEnumerable<Tuple<long, NameValueCollection>> GetEdgesFor(string type, long id)
{
using(var tx = _env.ReadTransaction())
{
var edgesDataTable = new Table(_emptySchema, "EdgesData", tx);
var edgeTypeTree = tx.CreateTree("Edges_"+ type);
var fixedSizeTree = edgeTypeTree.FixedTreeFor(from, valSize: 8);
using(var it = fixedSizeTree.Iterate())
{
if(it.Seek(Slices.BeforeAllKeys))
yield break;
do
{
var destNodeId = it.CurrentKey;
var edgeDataId = it.CreateReaderForCurrent().ReadLittleEndianInt64();
NameValueCollection edgeData = null;
if(edgeDataId != null)
{
var reader = new TableValueReader(edgesDataTable.DirectRead(edgeDataId, out size), size);
var data = reader.Read(0, out size);
var edgeStr = Encoding.UTF8.GetString(data, size);
edgeData = ParseNode(nodeStr);
}
yield return Tuple.Create(destNodeId, edgeDataId);
}while(it.MoveNext());
}
}
}
This is a lot heftier, but let’s try to break it into individual pieces. First, we find the tree holding all the edges of that particular type, then we access (by the ID of the source node) the fixed size tree that holds all the connected nodes and the edge data ID.
Finally, we access the edges data (if it exists) to get the data about the edge itself, after which, we return it all to the caller.
Unlike the previous method, here we can’t claim to have O(1) costs. Instead, our costs are composed of:
- O(logN): where N is the number of edges types (typically, very few), to search for the right edges tree.
- O(logN): where N is the number of source nodes (typically, very large, but the fixed size trees are very good at being fast, and they don’t have to do much).
- O(N): where N is the number of connection between the source node and the destination node.
I’m excluding the cost of loading the edge data because this is an O(1) operation and is directly relevant to the iteration over all nodes connected to the source node.
Ideally, we can find a way to turn the second operation into an O(1) cost, but that should be more than good enough for the moment.
Now, this just gives us traversal of the nodes, but going from here to Breadth First Search/Depth First Search and doing them well is fairly trivial, although there are a few things that we'll need to talk about, but that would serve as a separate post.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Implementing RBAC in Quarkus
-
Real-Time Presence Platform System Design
-
Why You Should Consider Using React Router V6: An Overview of Changes
-
How To Integrate Microsoft Team With Cypress Cloud
Comments