Reviewing Resin (Part 2)
As we continue reviewing Resin, let's check out the second file in the code, CollectorTests, and learn about UpsertTransaction.
Join the DZone community and get the full member experience.
Join For FreeIn the first part of this series, I looked into how Resin is tokenizing and analyzing text. I’m still reading the code from the tests (this is because the Tests
folder sorted higher than the Resin
folder, basically) and I've now moved to the second file, CollectorTests
.
That one has a really interesting start:
var docs = new List<dynamic>
{
new {_id = "abc0123", title = "rambo first blood" },
new {_id = "1", title = "rambo 2" },
new {_id = "2", title = "rocky 2" },
new {_id = "3", title = "the raiders of the lost ark" },
new {_id = "four", title = "the rain man" },
new {_id = "5five", title = "the good, the bad and the ugly" }
}.ToDocuments(primaryKeyFieldName: "_id");
long indexName;
using (var writer = new UpsertTransaction(dir, new Analyzer(), compression: Compression.Lz, documents: docs))
{
indexName = writer.Write();
}
using (var collector = new Collector(dir, IxInfo.Load(Path.Combine(dir, indexName + ".ix")), new Tfidf()))
{
var scores = collector.Collect(new QueryContext("_id", "3")).ToList();
Assert.AreEqual(1, scores.Count);
Assert.IsTrue(scores.Any(d => d.DocumentId == 3));
}
There are a lot of really interesting things here: UpsertTransaction
, document structure, issuing queries, etc. UpsertTransaction
is a good place to start looking around, so let's poke in. When looking at it, we can see a lot of usage in the Utils
class, so I’ll look at that first.
private static Int64 GetTicks()
{
return DateTime.Now.Ticks;
}
public static long GetNextChronologicalFileId()
{
return GetTicks();
}
This is probably a bad idea. While using the current time ticks seems like it would generate ever increasing values, that is actually not the case — certainly not with local time (clock shift, daylight savings, etc.). Using that for the purpose of generating a file ID is probably a mistake. It is better to use our own counter and just keep track of the last one we used on the file system itself.
Then we have this:
public static string ReplaceOrAppend(this string input, int index, char newChar)
{
var chars = input.ToCharArray();
if (index == input.Length) return input + newChar;
chars[index] = newChar;
return new string(chars);
}
It took me a while to figure out what was going on there, and then more to frantically search where this is used. Basically, this is used in fuzzy searches, and it will allocate a new instance of the string on each call. Given that fuzzy search is popular in full-text search usage and that this is called a lot during any such search, this is going to allocate like crazy. It would be better to move the entire thing to using mutable buffers, instead of passing strings around.
Then we go to the locking — and I had to run it a few times to realize what was going on.
public static bool TryAquireWriteLock(string directory)
{
var tmp = Path.Combine(directory, "write._lock");
var lockFile = Path.Combine(directory, "write.lock");
File.Create(Path.Combine(directory, tmp)).Dispose();
try
{
File.Copy(tmp, lockFile);
return true;
}
catch (IOException)
{
return false;
}
finally
{
File.Delete(tmp);
}
}
public static void ReleaseFileLock(string directory)
{
File.Delete(Path.Combine(directory, "write.lock"));
}
And this isn’t the way to do this at all. Basically, this relies on the file system to fail when you are trying to copy a file into an already existing file. However, that is a really bad way to go about doing that. The OS and the file system already have locking primitives that you can use, and they are going to be much better then this option. For example, consider what happens after a crash. Is the directory locked or not? There is no real way to answer that since the process might have crashed, leaving the file in place, or it might be doing things, expect that this is locked.
Moving on, we have this simple looking method:
public static IEnumerable<string> GetIndexFileNamesInChronologicalOrder(string directory)
{
return GetIndexFileNames(directory)
.Select(f => new {id = long.Parse(new FileInfo(f).Name.Replace(".ix", "")), fileName = f})
.OrderBy(info => info.id)
.Select(info => info.fileName);
}
I know I’m harping on that, but this method is doing a lot of allocations by using lambdas... and depending on the number of files, the delegate indirection can be quite costly. For that matter, there is also the issue of error handling. If there is a lock file in this directory when this is called, this will throw.
Our final code for this post is:
public static int GetDocumentCount(string directory)
{
return GetIndexFileNamesInChronologicalOrder(directory)
.Select(IxInfo.Load)
.Sum(x=>x.DocumentCount);
}
public class IxInfo
{
public static IxInfo Load(string fileName)
{
var time = new Stopwatch();
time.Start();
IxInfo ix;
using (var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.R
{
ix = Serializer.DeserializeIxInfo(fs);
}
Log.DebugFormat("loaded ix in {0}", time.Elapsed);
return ix;
}
}
public class Serializer
{
public static IxInfo DeserializeIxInfo(Stream stream)
{
var versionBytes = new byte[sizeof(long)];
stream.Read(versionBytes, 0, sizeof(long));
var docCountBytes = new byte[sizeof(int)];
stream.Read(docCountBytes, 0, sizeof(int));
// redacted code here
return new IxInfo
{
VersionId = BitConverter.ToInt64(versionBytes, 0),
DocumentCount = BitConverter.ToInt32(docCountBytes, 0),
};
}
}
I really don’t like this code — it looks like it is cheap — but it will:
- Sort all the index files in the folder.
- Open all of them.
- Read some data.
- Sum up that data.
Leaving aside that the deserialization code has the typical issue of not checking that the entire buffer was read, this can cause a lot of I/O on the system, but luckily, this function is never called.
Okay, so we didn’t actually get to figure out what UpsertTransaction
is, but we’ll look at that in the next post.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments