I Like My Performance Unsafe
Sometimes, you get to the point where you need to go native and use a bit of unsafe code. In this article, Ayende Rahien goes through the process of doing that.
Join the DZone community and get the full member experience.
Join For FreeAfter introducing the problem and doing some very obvious things (then doing some pretty non-obvious things and even writing our own I/O routines) we ended up with an implementation that is 17 times faster than the original one.
And yet, we can still do better. At this point, we need to go native and use a bit of unsafe code. We’ll start by implementing a naïve native record parser, like so:
public static unsafe class NativeRecord
{
public static void Parse(byte* buffer, out long id, out long duration)
{
duration = (ParseTime(buffer + 20) - ParseTime(buffer)).Ticks;
id = ParseInt(buffer + 40, 8);
}
private static DateTime ParseTime(byte* buffer)
{
var year = ParseInt(buffer, 4);
var month = ParseInt(buffer + 5, 2);
var day = ParseInt(buffer + 8, 2);
var hour = ParseInt(buffer + 11, 2);
var min = ParseInt(buffer + 14, 2);
var sec = ParseInt(buffer + 17, 2);
return new DateTime(year, month, day, hour, min, sec);
}
private static int ParseInt(byte* buffer, int size)
{
var val = 0;
for (int i = 0; i < size; i++)
{
val *= 10;
val += buffer[i] - '0';
}
return val;
}
}
This is pretty much the same as before, but now we are dealing with pointers. How do we use this?
var stats = new Dictionary<long, FastRecord>();
using (var mmf = MemoryMappedFile.CreateFromFile(args[0]))
using (var accessor = mmf.CreateViewAccessor())
{
byte* buffer = null;
accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref buffer);
var end = buffer + new FileInfo(args[0]).Length;
while (buffer != end)
{
long id;
long duration;
NativeRecord.Parse(buffer, out id, out duration);
buffer += 50;
FastRecord value;
if (stats.TryGetValue(id, out value) == false)
{
stats[id] = value = new FastRecord
{
Id = id
};
}
value.DurationInTicks += duration;
}
}
view raw
We memory map the file, and then we go over it, doing no allocations at all throughout.
This gives us one second to process the file, 126 MB allocated (probably in the dictionary) and a peak working set of 320 MB.
We are now 30 times faster than the initial implementation, and I wonder if I can do more. We can do that by going parallel, which gives us the following code:
// state
public unsafe class ThreadState
{
public Dictionary<long, FastRecord> Records;
public byte* Start;
public byte* End;
}
// parallel work
Dictionary<long, FastRecord> allStats;
using (var mmf = MemoryMappedFile.CreateFromFile(args[0]))
using (var accessor = mmf.CreateViewAccessor())
{
byte* buffer = null;
accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref buffer);
var len = new FileInfo(args[0]).Length;
var entries = len / 50;
int count = 4;
var threadStates = new ThreadState[count];
for (int i = 0; i < count; i++)
{
threadStates[i] = new ThreadState
{
Records = new Dictionary<long, FastRecord>(),
Start = buffer + i * (entries / count) * 50,
End = buffer + (i+1) * (entries / count) * 50
};
}
threadStates[threadStates.Length - 1].End = buffer + len;
Parallel.ForEach(threadStates, state=>
{
while (state.Start != state.End)
{
long id;
long duration;
NativeRecord.Parse(state.Start, out id, out duration);
state.Start += 50;
FastRecord value;
if (state.Records.TryGetValue(id, out value) == false)
{
state.Records[id] = value = new FastRecord
{
Id = id
};
}
value.DurationInTicks += duration;
}
});
allStats = threadStates[0].Records;
for (int i = 1; i < count; i++)
{
foreach (var record in threadStates[i].Records)
{
FastRecord value;
if (allStats.TryGetValue(record.Key, out value))
value.DurationInTicks += record.Value.DurationInTicks;
else
allStats.Add(record.Key, record.Value);
}
}
}
This is pretty ugly, but basically, we are using four threads to run it and we are giving each one of them a range of the file as well as their own dedicated records dictionary. After we are done, we need to merge the records to a single dictionary, and that is it.
Using this approach, we can get down to 663 ms run time, 184 MB of allocations and 364 MB peak working set.
So, we are now about 45(!) times faster than the original version. We are almost done, but on my next post, I’m going to go ahead and pull the profiler and see if we can squeeze anything else out of it.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments