Reviewing Resin (Part 1)
Get a review of a small (about 6,500 lines of code) cross-platform document database and search engine that's written in C#, Resin.
Join the DZone community and get the full member experience.
Join For FreeResin is a “cross-platform document database and search engine with a query language, API, and CLI.” It is written in C#, and while I admit that reading C# code isn’t as challenging as diving into a new language, a project that has a completely new approach to a subject that is near and dear to my heart is always welcome. It is also small, coming at about 6,500 lines of code, so that makes for quick reading.
I’m reviewing commit ddbffff88995226fa52236f6dd6af4a48c833f7a.
As usual, I’m going to start reading the code in alphabetical file order and then jump around as it makes sense. The very first file I run into is Tests/AnalyzerTests
, where we find the following:
List<string> terms = new Analyzer(stopwords:new[]{"the", "a"})
.Analyze("The americans sent us a new movie.").ToList();
terms = new Analyzer(tokenSeparators:new []{'o'}).Analyze("hello world").ToList();
terms = new Analyzer().Analyze("Hello!World?").ToList();
This is really interesting, primarily because of what it tells me. Analyzers are pretty much only used for full-text searches, such as Lucene or Noise. Jumping into the analyzer, we see:
This tells me quite a few things. To start with, this is a young project. The first commit is less than 18 months ago and I’m judging it with the same eye I use to looking at our own code. This code needs to be improved for several reasons.
First, we have a virtual method call here, probably intended to be an extension point down the line. Currently, it isn’t used, and we pay the virtual call cost for no reason. Next, we have the return value. IEnumerable
is great, but this method is using yield
, which means that we’ll have a new instance created per document. For the same reason, the tokenDic
is also problematic. This one is created per field’s value, which is going to cost.
One of the first things you want to have when you start worrying about performance is controlling your allocations. Reducing allocations, in this case, by reusing the dictionary instance or avoiding the yield
would help. Lucene did a lot of stuff right in that regard, and it ensures that you can reuse instances wherever possible (almost always) since that can dramatically improve performance.
Other than this, we can also see that we have Analyze and Index features. For now, I’m going to assume that they are identical to Lucene until proven otherwise. This was the analyzer, but what is going on with the tokenizer? Usually, that is a lot more low level.
The core of the tokenizer is this method (I prettified it a bit to make it fit better on screen):
As far as I can tell so far, most of the effort in the codebase has gone into the data structures used, not to police allocations or get absolute performance. However, even so, this would be one of the first places I would look at whenever performance work would start. (To be fair, from speaking with the author of this code, I know there hasn’t been any profiling/benchmarking on the code.)
This code is going to be at the heart of any indexing, and for each value, it is going to:
- Allocate another string with the lowered case value.
- Allocate a character buffer of the same size as the string.
- Process that character buffer.
- Allocate another string from that buffer.
- Split that string.
- Use a lambda on each of the parts and evaluate that against the stopwords.
That is going to have a huge amount of allocations/computation that can be saved. Without changing anything external to this function, we can write the following:
private readonly StringBuilder _buffer = new StringBuilder();
public IEnumerable<string> Tokenize(string value)
{
var list = new List<string>();// ideally would be buffered
for (int i = 0; i < value.Length; i++)
{
var ch = char.ToLowerInvariant(value[i]);
if (!IsNoice(ch))
{
_buffer.Append(ch);
continue;
}
if(_buffer.Length == 0) continue;
var maybeToken = _buffer.ToString();
_buffer.Clear();
if (_stopwords == null ||
_stopwords.Contains(maybeToken) == false)
list.Add(maybeToken);
}
return list;
}
This will do the same, but at a greatly reduced allocation cost. A better alternative here would be to change the design. Instead of having to allocate a new list, send a buffer, and not deal with strings directly, instead, deal with a spans. Until .NET Core 2.0 is out, I’m going to skip spans and just use direct tokens, like so:
public void Tokenize(string value, List<(int Start, int Length)> tokens)
{
int length = 0, start = 0;
for (int i = 0; i < value.Length; i++)
{
var ch = char.ToLowerInvariant(value[i]);
if (!IsNoice(ch))
{
length++;
continue;
}
if(length == 0)
continue;
if (IsStopword(value, start, length) == false)
{
tokens.Add((start, length));
start += length + 1;
length = 0;
}
}
}
There are a few important things here. First, the code now doesn't do any string allocations. Instead, it is operating on the string characters directly. We have the IsStopword
method that is now more complex because it needs to do the check without allocating a string and while being efficient about it. How it's left will be an exercise for the reader, but it shouldn’t be too hard.
One thing that might not be obvious is the tokens list that we get as an argument. The idea here is that the caller code can reuse this list (and memory) between calls, but that would require a major change in the code.
In general, not operating on strings at all would be a great thing indeed. We can work with direct character buffers, which will allow us to be mutable. Again, spans would probably fit right into this and be of great help.
That is enough for now. I know we just started, but I’ll continue this in the next post.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments