Reviewing Resin (Part 5)
As we continue our series reviewing the NoSQL database and search engine Resin, we cover the pieces that we haven't looked at yet.
Join the DZone community and get the full member experience.
Join For FreeBe sure to check out Part 1, Part 2, Part 3, and Part 4 first!
In the previous part, I looked at how indexing and queries are handled in Resin. This post is mostly about the pieces I haven’t talked about so far. We’ll start with the query parser and move to the trie.
Queries in Resin looks like this:
This is sort of looking like the Lucene syntax, but it looks like it keeps the same context until a new field comes along.
Range queries look better, sort of:
I had a hard time figuring this one out until I realized that this is not an XML tag in the middle.
The problem is that the Lucene query syntax kinda sucks. Actually, it sucks a lot. It is complex and ambiguous to parse and it is full of all those little things like the ~
over there that is not very obvious but is very important to the query. I would actually suggest something more like SQL. Sure, that wouldn’t be what you’ll put in the search box, but programmers will appreciate you for that.
Looking at the parser code, there aren’t any surprises there. It is using a hard-rolled system using regex and split, which can be vastly improved. One thing to note is that because of the simplicity of the parser, it isn’t really able to process things like a search for a token with a colon in it, so it can’t process this query: url:http://ayende.com
.
Anyway, the query parser isn’t really the most important thing here. The core of Resin, and what I haven’t looked at so far at all is the trie…
LcrsTrie
stands for left child right sibling, there is a good discussion on the reasons why you’ll want to use this here. At this point, I’m not really sure why the choice of Lcrs
was used. In general, they are used to reduce space and simplify the representation, but I don’t think that this is a good idea for a persistent structure. I’ll look at that later. Right now I’m reading the code, and it is mostly pretty obvious code. But then you get to this:
This pattern of using IEnumerable
to return a single value is something that I’ve seen in other places in the codebase, and I don’t really get it.
I like the use of the Levenshtein distance in fuzzy search, mostly because we don’t need to store a lot of data to get fuzzy search working. In particular, it looks like suggestion style queries are pretty easy, and would be much cheaper than it would be in Lucene.
Probably the core operation you always perform on a trie is the search, and the core of that, in this case, is the TryFindPath
method:
public bool TryFindPath(string path, out LcrsTrie leaf)
{
var node = LeftChild;
var index = 0;
while (true)
{
if (node == null) break;
if (node.Value.Equals(path[index]))
{
if (index + 1 == path.Length)
{
// destination has been reached
leaf = node;
return true;
}
else
{
// go deep when you've found c
index++;
node = node.LeftChild;
}
}
else
{
// go right when you are looking for c
node = node.RightSibling;
}
}
leaf = null;
return false;
}
There is nothing surprising in this code, but it is purely an in-memory implementation, which makes for a very different environment than a persistent data structure.
The persistent data structure is actually the MappedTrieReader
, so let's examine it. Looking at it, there is some reference to the notions of segments within the file, but I’m not seeing where it is used. This is what the *.six
file is used for, it seems. I think that this might be related to merging, but I don’t really know.
And here is the reason for the IsWord
design:
When using a single LcrsTrie
, it isn’t needed. But when using a possibly segmented reader, we might have multiple results for the same word.
What is worrying here is that the same access pattern for the trie that is used in memory is also using in the persistent mode. That means that each time we need to traverse the trie, we’ll need to seek. Actually, it looks like that might only be needed when we aren’t on the right path, but that is actually pretty common, so there are going to be a lot of seeks.
That is enough for now, my next post will be a more detailed analysis of the Resin I/O structure and what I would probably do instead.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments