Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Performance Regression in Optimization: Part I

DZone's Guide to

Performance Regression in Optimization: Part I

In this example provided by Ayende Rahien, there isn’t anything that looks like it could cause much of a performance gap...so why is there a huge one?

· Performance Zone
Free Resource

PageTable is a pretty critical piece of Voron. It is the component responsible for remapping modified pages in transactions and is the reason why we support MVCC and can avoid taking locks (for the most part). It has been an incredibly stable part of our software, rarely changing and pretty much the same as it was when it was initially written in 2013. It has been the subject of multiple performance reviews in that time, but acceptable levels of performance from our code in 2013 are no longer acceptable today. PageTable came up recently in one of our performance reviews as a problematic component. It was responsible for too much CPU and far too many allocations.

Here is a drastically simplified implementation that retains the salient points:

public class PageTable_Old
{
public struct TransactionPage
{
public long TransactioId;
public long PagePosition;
}

public ConcurrentDictionary<long, TransactionPage[]> _mapping = new ConcurrentDictionary<long, TransactionPage[]>();

public void SetItems(long txId, Dictionary<long, long> mappingForTx)
{
foreach (var page in mappingForTx)
{
var copy = page;
_mapping.AddOrUpdate(page.Key,
l => new[] { new TransactionPage { PagePosition = copy.Value, TransactioId = txId } },
(l, pages) =>
new List<TransactionPage>(pages)
{
new TransactionPage {PagePosition = copy.Value, TransactioId = txId}
}.ToArray());
}
}

public void RemoveBefore(long txId)
{
foreach (var pages in _mapping)
{
if (pages.Value[0].TransactioId > txId)
continue;

var newVal = pages.Value.Where(x => x.TransactioId > txId).ToArray();
if (newVal.Length == 0)
{
_mapping.TryRemove(pages.Key, out newVal);
}
else
{
_mapping.AddOrUpdate(pages.Key, newVal, (l, oldVal) => newVal);
}
}
}

public bool TryGetValue(long page, long tx, out long position)
{
TransactionPage[] value;
if (_mapping.TryGetValue(page, out value) == false)
{
position = -1;
return false;
}

for (int i = value.Length - 1; i >= 0; i--)
{
if (value[i].TransactioId > tx)
continue;
position = value[i].PagePosition;
return true;
}

position = -1;
return false;
}
}

Here is the sample workout for this class, which just simulates ten thousand transactions. This little scenario takes 15.3 seconds and allocates a total of 1.1GB of memory! That is a lot of allocations and must have a tremendous amount of time spent in GC. The most problematic issue here is the SetItems methods, which will allocate two different delegates for each modified page in the transaction. Then we have the total abandon in which we’ll allocate additional memory in there. As you can imagine, we weren’t very happy about this, so we set out to fix this issue.

We can take advantage off the fact that SetItems and RemoveBefore are only called under lock, while TryGetValue is called concurrently with everything else.

I wrote the following code:

public class PageTable_New
{
public class TransactionPage
{
public long TransactioId;
public long PagePosition;
}

private readonly ConcurrentDictionary<long, BufferHolder> _mapping = new ConcurrentDictionary<long, BufferHolder>();

private class BufferHolder
{
public TransactionPage[] Buffer;
public int Start;
public int End;
}

public void SetItems(long txId, Dictionary<long, long> mappingForTx)
{
foreach (var page in mappingForTx)
{
var bufferHolder = _mapping.GetOrAdd(page.Key, _ => new BufferHolder
{
Buffer = new TransactionPage[2]
});
if (bufferHolder.Buffer.Length == bufferHolder.End)
{
var newBufferHolder = new BufferHolder
{
Buffer = new TransactionPage[bufferHolder.Buffer.Length * 2]
};
Array.Copy(bufferHolder.Buffer, bufferHolder.Start, newBufferHolder.Buffer, 0,
bufferHolder.End - bufferHolder.Start);
newBufferHolder.End = bufferHolder.End - bufferHolder.Start;
_mapping.TryUpdate(page.Key, newBufferHolder, bufferHolder); // no one else is writing here
bufferHolder = newBufferHolder;
}
bufferHolder.Buffer[bufferHolder.End++] = new TransactionPage
{
PagePosition = page.Value,
TransactioId = txId
};
}
}

public void RemoveBefore(long txId)
{
foreach (var kvp in _mapping)
{
var bufferHolder = kvp.Value;
while (bufferHolder.Start < bufferHolder.Buffer.Length)
{
var page = bufferHolder.Buffer[bufferHolder.Start];
if (page == null || page.TransactioId > txId)
{
break;
}
bufferHolder.Start++;
}
if (bufferHolder.Start == bufferHolder.Buffer.Length)
{
bufferHolder.Start = bufferHolder.End = 0;
}
}
}

public bool TryGetValue(long page, long tx, out long position)
{
BufferHolder value;
if (_mapping.TryGetValue(page, out value) == false)
{
position = -1;
return false;
}

for (int i = value.End - 1; i >= value.Start; i--)
{
var transactionPage = value.Buffer[i];
if (transactionPage == null)
continue;

if (transactionPage.TransactioId > tx)
continue;
position = transactionPage.PagePosition;
return true;
}

position = -1;
return false;
}
}

This relies on allowing stale reads from concurrent readers, which we don’t care about since they wouldn’t be able to make use of the data anyway, and it was able to reduce the allocations to just 320 MB, but the runtime actually went up to 32 seconds.

That is quite annoying, as you can imagine, and much cursing enthused as a result. I then pulled my trusty profiler ask it kindly to figure out what piece of code needs to be hit with a rolling pin and have a stern talking to about what is expected from code after it has been laboriously and carefully optimized. It is expected to sit nicely and be fast, or by Git, I’ll revert you.

What the hell?! Here are the original implementation costs, and you can clearly see how much time we are spending on garbage collection.

image

Here is the optimized version (which is actually slower, and used more memory?!):

image

There are a bunch of interesting things going on here. We can see that we are indeed using spending a little less time in GC and that both RemoveBefore and SetItems methods are much cheaper, but the cost of TryGetValue is so much higher. In fact, if we compare the two, we have:

image

We are 3.4 times higher, and somehow, the cost of calling the concurrent dictionary TryGetValue has risen by 88%.

The implementation is pretty much the same and there isn’t anything else that looks like it can cause that much of a performance gap.

I’ll leave this a riddle for now, because it drove me crazy for two whole days, and give you the details on what is going on in the next post.

Topics:
performance ,regression ,optimization

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}