DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Databases
  4. Reading the NSA's Codebase: LemonGraph Review, Part 3: Figuring Out Queries

Reading the NSA's Codebase: LemonGraph Review, Part 3: Figuring Out Queries

This article takes a look at a review of LemonGraph, specifically on figuring out queries this time.

Oren Eini user avatar by
Oren Eini
·
Aug. 16, 18 · Review
Like (1)
Save
Tweet
Share
3.20K Views

Join the DZone community and get the full member experience.

Join For Free

After figuring out how LemonGraph is storing data on disk, my next task is to figure out how queries are handled. Here are some queries:

image

A query starts in the Python’s Query class, where it is parsed by the MatchLGQL class. I scanned this code briefly, but this is basically doing query parsing into the in-memory representation. This is a typically ugly piece of code, and that holds true for this code as well. Python’s dynamic nature also means that there isn’t an easy to follow set of AST classes. I’ll skip the details of query parsing and just see how it is actually handling the queries, then.

I started to look into the query execution and got lost in details that I didn’t understand. In particular, there is a very strong tie between the query parsing and the execution. More so than I expected. What brought this home was this piece of code, which is used to rank the most efficient manner in which you should start executing the query.

image

At this point, I think that the idea here is that when you start running a query, you want to start from the smallest set of seed nodes. the ranking here seems to be a nice way to go about doing just that, but I’m not really sure yet how this is applied.

This is later used to figure out what the best starting location is in the Query.finalize() method.

image

This all comes together for me inside the _adhoc() method on Query, where the query is actually being run:

image

The self.compiled is the set of already parsed matches. On each of them, we create a context (which will track already visited nodes/edges) and start by finding the seeds on the query. Seeds are handled using…an interesting approach:

image

It took me a few reads to get what this code is trying to do, and I still think that this is an obnoxious way to write things. This basically does the other side of the ranking. It is using the ranking to decide which method to call. I think that an enum would be about three times more readable. Especially since a bit lower you have:

image

I have to say that the modulus usage is the sign of true genius. Or madness. I’m not sure which, because I can’t figure out what the history of this code is. This was the same way from the initial commit, but I think that this code has history from before the public release. And it might have a reason for this kind of code. But I don’t like it and I think it would have been much easier to read if it wasn’t using magic numbers all over the place.

At any rate, let’s assume that we have the simplest query, for all the nodes. This would send us to txn.nodes()method. This would be rank 6, by the way. Here is how this looks like:

image

As you can see, we have two modes here. If the rank was 6, we aren’t sent a type. But if the rank was 4, we are getting a type. I’m going to start from the search for types, which seems more interesting.

Here is where we end up in the C code:

image

Yes, the graph_nodes_type() method is calling _graph_nodes_edges_type() method. That was confusing to me as well. The key here is the DB_NODE_IDX index there, which tell it to use a different tree for the lookups.

The graph_string_lookup() is something that we already run into, this is the __blob_resolve() method, which is searching for the string id for the given type. The code starts to get interesting when we see the graph_iter_new() call:

image

So we specify an iterator on the given prefix. From the previous post, you might recall that DB_NODE_IDX is specific as (type, val –> id). So this does a search on the first item that matches the prefix. The _cleanse_beforeID() method will ensure that the beforeID is only valid if it represent a value that is between 1 and the max log id that was generated on the graph.

The iterator we got from the nodes() method just implement’s Python iteration interface, starting from the graph_iter_new() item, then calling graph_iter_next() until the end. This is implemented like so:

image

Here, we see for the first time the actual usage of beforeID. I’m not sure what this does yet, so we’ll skip this for now and look at the _iter_idx_nextID() method and come back to it later. This method is quite interesting. We have the head of the iterator, which is set to true in the iterator init. I’m not sure what this means yet. What seems to be interesting is that _blarf() method, which I understand to be a cry of frustration (I had to look it up).

image

I’m not sure what the next pointer is doing there at all. We’ll look at that after I’ll inspect (with some measure of dread) the _blarf() function.

image

To start with, I love the use of goto instead of return statements. I understand that this may be a coding convention here and this is used to clearly mark when resources are supposed to be disposed of, but still…

The iter_next_key() ends up moving the cursor (and validating that the prefix is still valid). The _parse_idx_logID() call is here:

image

And this is just a fancy way of saying, gimme the last value in this buffer.

To understand what is going on, let’s go back a bit and understand what is going on here. We are actually scanning the index DB_NODE_IDX. And that index has the value of (type, val, id). Since the iteration is done in sorted order, this means that you can iterate over all the matches that match the type you want. The use of beforeID for filtering here, however, is interesting. I wonder how common is the use of historical queries like this are. Because if you need to skip over a lot of items, this will result in an O(N) operation while you are scanning and discarding a lot of data.

Anyway, there doesn’t seem to be any use of the graph_iter_t.next in this code path, so I’ll leave it for now. The iteration over all the nodes is done with the exact same method, but without specifying any prefix, which means that it matches everything.

I have to admit that at this point, I don’t really get how it processes the more complex queries. I had to give up the “let’s not run the code” and try a few things out. Surprisingly, on WSL, the code just cause a segmentation fault. I tracked it down to something in cffi, but I didn’t care that much. I created a simple ubuntu machine and played with it for a bit. So far, we just looked at how we get the first seeds of the query. A lot of the smarts seems to be hidden in the MatchCTX class. In particular, the matches() method seems to be doing a lot of magic.

I’m going to concentrate on that in my next post.

Database

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Cucumber.js Tutorial With Examples For Selenium JavaScript
  • GitLab vs Jenkins: Which Is the Best CI/CD Tool?
  • Specification by Example Is Not a Test Framework
  • Using GPT-3 in Our Applications

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: