DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Databases
  4. Reading the NSA's Codebase: LemonGraph Review, Part 6: Executing Queries

Reading the NSA's Codebase: LemonGraph Review, Part 6: Executing Queries

In the 6th part of this review, let's check out how LemonGraph finds the clause that is likely to have the least number of items and how to process that.

Oren Eini user avatar by
Oren Eini
·
Aug. 21, 18 · Review
Like (2)
Save
Tweet
Share
3.82K Views

Join the DZone community and get the full member experience.

Join For Free

After going over many layers inside LemonGraph, I think we are ready to get to the real deal. We know how LemonGraph starts to execute a query. It finds the clause that is likely to have the least number of items and starts from there. These are the seeds of the query, but how is it going to actually process that?

Here is my graph:

image

And here is the query that I want to run on it: n()->e(type="likes", value="yes")->n()

LemonGraph detects that the cheapest source of seeds for this query is the edge and provides these to the MatchCTX class, which does the actual processing. Let’s explore how it is working on a single seed.

The actual behavior starts from the matches() method, which starts here:

image

As usual, for this codebase, the use of tuples for controlling behavior is prevalent and annoying. In this case, the idx argument controls the position of the target in the query, whether this is the start, end, or somewhere in the middle. The deltas define what direction the matches should go and the stops where you must stop searching, I think.

Now that we have set up the ground rules, the execution is as follows:

image

In this case, because the seed of our query is the edge (which is in the middle) it determines that deltas are (1, –1) and stops are (2, 0). We’ll also go to the first clause in the if statement. The link variable there controls whatever we should follow links going in or out.

There is a lot of logic actually packed into the link initialization. The self.link is an array that has "next" and "prev" as its values, so the idea is that it finds what property to look at and then uses the dictionary syntax to get the relevant property and decide what kind of direction it should go.

We’ll focus on the _recurse() method for now. In this case, we are being passed:

  • idx = 1
  • delta = 1
  • stop = 2

Here is the actual method:

image

As you can see, we first validate that the match is valid (this is where we mostly check other properties of the value that we are currently checking, filtering them in place). Usually, the do_seen will be set to True, which will ensure that we only traverse each node once. The main iteration logic is done in combination with the _next() method, as shown below:

image

The first line there shows usage of delta, which controls in what direction to move. I’m not quite sure why the filter is set in the if statements since it is also being set immediately after. This looks like redundant code that snuck in somehow.

The bulk of the work is shelled out to iterlinks, so let’s check what is going on there...I know that we are running on an edge here, so we need to look at the Edge.iterlinks() method:

image

There isn’t really much to tell here; it checks the filters and returns the incoming or outgoing edges, nothing more. On the other hand, the Node.iterlinks() implementation is a bit simpler:image

We’ll explore exactly how the edges are loaded for a node in a bit, however, I do want to note right now that the _next() method isn’t passing the types of the edge, even though it looks to me that it has this information. In that case, that would give a performance boost because it could filter a lot of edges in busy graphs.

Actually, I already looked at how iterators working in a previous post, so I won’t go over all of that again. This is basically calling this method

image

The graph_node_edges_in() and graph_node_edges_out() methods are pretty simple, basically just scanning the DB_TGTNODE_IDX or DB_SRCNODE_IDX indexes, respectively. The graph_node_edges() however, is really strange.

Here it is:

image

It took me a few reads to figure out that this is probably a case of someone unpacking a statement for debugging but forgetting to remove the packed statement. The second return statement is ignored and likely stripped out as unreachable, but it is pretty confusing to read.

The graph_iter_concat() answers a question I had about how graph_iter_t is used, here is how it looks like:

image

This is using C’s support for passing an unknown number of parameters to a method. This basically builds a simple linked list, which also answers the usage of _blarf() function and its behavior a few posts ago.

So we are back where we started; we understand how the data flows into the Python code, now let’s go back and look at _recurse() and _next() calls.

Now, I know a lot more about the behavior of the code, so this makes sense. The stop argument to the _recurse() controls the depth of the number of links that would be traversed in a particular query. Now that I know how this particular clause work, I understand how LemonGraph goes from the edge to the node in e()->n(), but I don’t know how it goes to back to find the full path.

The key for that is the calls to push() and pop() in the the MatchCtx._recurse() methods. These update the chain, like so:

image

In processing the query, we first append the edge to the chain:

image

The next step goes from the edge to its destination, giving us Oscar:

image

Remember that in the matches(), we got into this if clause:

image

This is when we start scanning an edge, which first adds things to the right and then scans things to the left in the _next() method. Look at the push() method there. By the time we get to the result(), we have iterated both sides of the connection, giving us:

image

That is a very clever way of doing things.

Let’s try doing things a little bit differently. I’m going to check a slightly different query: n(type=”dog”, value=”arava”)->e(type="likes", value="yes")->n()

If I understand things correctly, this is going to select the node as the starting point because that is a lot more efficient. That will allow me to figure out the other side of this operation. I just run this through the debugger, and this is indeed how it actually works.

The usage of yield to pass control midway is not trivial to follow, but it ends up being quite a nice implementation. This is enough for this post. In my next one, I want to explore a few of the possible operations that are exposed by LemonGraph.

Database

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Better Performance and Security by Monitoring Logs, Metrics, and More
  • DevOps Roadmap for 2022
  • Choosing the Best Cloud Provider for Hosting DevOps Tools
  • Top 5 Java REST API Frameworks

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: