DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
What's in store for DevOps in 2023? Hear from the experts in our "DZone 2023 Preview: DevOps Edition" on Fri, Jan 27!
Save your seat

Reading the NSA's Codebase, Part I: LemonGraph Review, Storing Nodes

A database expert and software developer looks at some code written by the NSA to get a better grasp on LemonGraph and using nodes in databases.

Oren Eini user avatar by
Oren Eini
·
Aug. 12, 18 · Review
Like (1)
Save
Tweet
Share
4.58K Views

Join the DZone community and get the full member experience.

Join For Free

A few months ago the NSA released LemonGraph, a graph database based on LMDB. This weekend I took the time to explore the codebase. As usual, I’m going to be reading the code in lexical order, and mostly trying to figure out where things are happening and what the code is doing.

I started to read the Python code, but it immediately called to native C code:

image

Looking at the lemongraph.h file, I see a lot of quite interesting methods. This looks like the heart of LemonGraph is implemented in C, while the higher level Python code is orchestrating things. Not sure if this is a fair statement, but I’m going to be reading some C code now.

image

The first interesting thing in the C code is that LemonGraph actually wrapped LMDB with their own interface. From db.c, we have:

image

The MDB_xyz flags are defined by LMDB, DB_xyz are defined by LemonGraph. I assume that this was done to hedge their bets with regards to the underlying storage engine. From my own experience, any such attempt is usually doomed to failure. It isn’t the compilation errors that will kill you but the different behavior that you are coupled to that matters. A lot of the code there is basically forwarding directly to LMDB, but there is actual behavior implemented in db.c. In particular, they seem to have a lot of logic around managing remapping of LMDB (which is fixed size and requires reopening to change).

The lemongraph.c file is about 1,750 lines in size. I skimmed it a bit, but let’s get to the interesting bits. I tried to check where it is creating a new node and follow the code from there, but I got lost. So I started with the docs and looked at this code:

image

As a reminder, I’m reading this code in a text editor, not running it, so I’m somewhat limited by what I can discover. This Python code is strange, because it is doing creation by omission. In other words, if you already had a a node with “foo:bar” there, it would return it. Otherwise, a new node is created. Not my cup of tea, but make sense for a Python interface.

Let’s dig deeper into how this is built, shall we? This call ends up here:

image

This then calls to:

image

I want to stop everything for a moment and look at the first line of this code. It calls malloc, and it does this in a pretty strange manner. Note that it looks like it dereferences the variable before it has been initialized. This isn’t what is actually going on.

The sizeof operator is a compile time feature. And this relies on this code:

image

In C, structs are referred to as “struct STRUCT_NAME” and are typically typedef to themselves to make things easier. In this codebase, there is “struct node_t” and “node_t”, which is a pointer to the struct. I find this very confusing, to be honest.

At any rate, back tothe _node_resolve method. You can see something pretty interesting here. We have the _string_resolve method. What is it doing there? This just forwards the call to __resolve_blob, which is really interesting.

It is a bit long, and you can read it at your leisure here. The signature is here:

image

Basically, it uses a feature of LMDB that allows you to store multiple values for a key. I started describing what this is doing, but then I realized that this is crazy. Sample code would be much easier. Here is the essence of this method:

Dictionary<uint, List<long>> _crcToIds;
Dictionary<long, string> _idsToStrs;
long _lastId;

long __resolve_blob(string str){
	uint crc = ComputeCrc(str);

	if(_crcToIds.TryGetValue(crc, out var list))
	{
		foreach(var id in list)
		{
			var existingStr = _idsToStrs[id];
			if(existingStr == str)
				return id;
		}
	}
	else
	{
		_crcToIds[crc] = list = new List<long>();
	}
	var id = ++_lastId;

	list.Add(id);
	_idsToStrs.Add(id, str);
	return id;
}

As you can see, the idea is that LemonGraph maintains an internal string index, allowing it to turn any existing string into a numeric id. Going back to the_node_resolve() method, you can see that the _string_resolve method is used to get a unique id for both thetype and val parameters. This is then sent to the __node_resolve() method.

As an aside, I’m loving the fact that you can figure out the call graph for methods based on the number of underscores prefixed to the name. No underscore, public method. Single underscore, private method used internally. Two underscores, hidden method, used sparingly. At three underscores, I could tell you what it means, but given that this is the NSA’s codebase, I’ll have to kill you afterward.

image

This is a nice method. We first do a lookup, and if it isn’t there, we’ll append a new node. I’m going to look at the _node_append() method first, because it will likely make looking at the _node_lookup() method easier.

image

The encode() method uses length prefixed ints to encode int64 values in fewer bytes. I’m not sure why they chose this format over variant size int. That would save at least one byte in almost all cases. The end result of this method is that we pass the dbuf buffer to the _log_append() method with the (next, type, val) values encoded in it. I’m not sure what next is at this point, but I know that in the case we have here, this is meant to be 0, so we can probably ignore it for now. The _log_append() method simply append the buffer to a tree and returns a sequential id, nothing more. Going back up a bit, the _node_index() is now called, which looks like this:

image

This is basically is just adding the value to the log and creating an index to search by (type,val) – giving the id. This just make me that much more interested in the next. I’m pretty sure that this is required for a feature they call historical views, which likely means that they never actually do deletes, just markers. This is a nice feature if you want to effectively go back in time, or maybe compare current and past events.

On the one hand, this limits the usability of the graph, because your data set will only grow. On the other hand, not having to worry about deletes is so much easier.

Now we can go and figure out what is going on with the _node_lookup() method.

image

You can see that this is using the same DB_NODE_IDX from the _node_index() method. So basically this is doing the reverse of the past few methods. This will lookup the index, then fetch the relevant log id for this.

This is enough for now, I have a good grasp of the basic underlying data structures. I’m pretty sure that now that I know how nodes work, I can quickly check how edges and properties work as well. Hopefully, the next post will be able to talk about the graph operations directly, but I might run into other interesting bits first.

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Mr. Over, the Engineer [Comic]
  • Getting Started With JMS-ActiveMQ: Explained in a Simple Way
  • How ChatGPT Is Revolutionizing the World of Natural Language Processing
  • ChatGPT vs. GPT3: The Ultimate Comparison

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: