About a week ago, Dima and I were consulting for two different teams at the same customer site, without even knowing about each other’s presence. Around the middle of the day he gave me a call, and we discovered that we’re only a couple of stories apart.
What Dima’s customer was experiencing is a non-deterministic delay in a managed application performing processing of buffers received from a physical device before they were being sent to another physical device. While it would typically take less than 0.5ms to handle a single buffer, some buffers would take as long as 500ms to be processed. At arrival rates approaching 10GBps this was not an acceptable delay, as the underlying buffers would overflow while waiting for the application to drain them.
Seeing that there was no significant work done by the application other than the memory allocation to store the buffer, we immediately suspected the garbage collector. However, after several enquiries and measurements it became evident that most collections were gen 0 collections, which are supposed to be extremely cheap. After doing some more work, however, Dima concluded that it were the gen 0 garbage collections that were sometimes taking nearly a second to complete.
The one clue that we did get was that the delay times were always in multiples of 250ms. Most delays were 250ms, some 500ms, a chosen few 750ms and a very rare one would be 1000ms or 1250ms. Following the “there are no coincidences” rule, I wanted to understand whether there was a component in the system (or possibly in the garbage collector) that was performing a 250ms wait, possibly multiple times in a row before giving up.
To this end, we fired up WinDbg and set a breakpoint in ntdll!NtDelayExecution (which is called by the various flavors of Sleep). Not a single breakpoint was hit. Moving on, we set a breakpoint in kernel32!WaitForSingleObject, WaitForSingleObjectEx, WaitForMultipleObject and WaitForMultipleObjectEx and specified that we want a timeout value of exactly 250ms. (Setting a conditional breakpoint in WinDbg is a piece of cake, especially if you know the method signature.)
Sure enough, there was a breakpoint hit in the middle of a gen 0 garbage collection, with the following call stack (edited and snipped for clarity):
Breakpoint 0 hit
Args to Child
0000008c 000000fa 00000000 KERNEL32!WaitForSingleObjectEx
000000fa 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
000000fa 00000000 00000000 mscorwks!CLREvent::Wait+0x17
00000000 c1ca7196 00000000 mscorwks!Thread::SysSuspendForGC+0x697
00000001 007fc318 00000000 mscorwks!WKS::GCHeap::SuspendEE+0x16c
00000000 00000000 0036f054 mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x195
00000000 00000000 007fc318 mscorwks!WKS::GCHeap::GarbageCollectTry+0x33
00000000 00000000 00000000 mscorwks!WKS::GCHeap::GarbageCollect+0x67
632fe67e 02455fc4 63760272 mscorwks!GCInterface::CollectGeneration+0xaa
00210132 02455f00 02452d74 mscorlib_ni+0x1a9944
00be27e2 00000003 0036f200 mscorlib_ni+0x6b8127
I was able to tell that it was a gen 0 collection by observing the first parameter to the WKS::GCHeap::GarbageCollect method on the call stack. The 000000fa parameter to WaitForSingleObjectEx indicates that we’re seeing a timeout value of 250ms.
The wait object itself is a non-signaled manual reset event:
0:000> !handle 0000008c ff
Object Specific Information
Event Type Manual Reset
Event is Waiting
So we have the location of the wait, and we can even see that the wait returns with a timeout by stepping out and inspecting the return value (usually in EAX) of WaitForSingleObjectEx:
eax=00000102 ebx=007fc318 ecx=7545e9a7 edx=00000000 esi=754b1126 edi=00000000
eip=655bc659 esp=0036ee20 ebp=0036ee50 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
0:000> !error 102
Error code: (Win32) 0x102 (258) - The wait operation timed out.
Right, so we have a situation here where the GC is waiting for some event that is not getting signaled for 250ms while it should instead be very busy preparing itself for a quick gen 0 garbage collection.
By stepping back a few frames on the stack, it’s possible to see what’s happening inside the Thread::SysSuspendForGC method. If the wait returns 0x102 (timeout) then an entry is logged to the CLR stress log (which I hope to discuss in a future post). It’s also possible to trace the entry text to the following two messages:
Timed out waiting for rendezvous event %d threads remaining.
Failed to CheckForAndDoRedirectForGC(). Retry suspension 2 for thread %p.
Finally, when all attempts to gracefully wait for the event have failed, Thread::SysSuspendForGC calls Thread::SuspendThread which maps to kernel32!SuspendThread.
From all of the above, you might be already seeing the light at the end of the tunnel. The GC thread is trying to checkpoint with one of the application threads by waiting for an event that will allow it to continue the GC. The application thread doesn’t signal the event, which causes the GC thread to give up and eventually suspend the application thread brutally using the Win32 API.
Two questions remain.
Why would the GC thread checkpoint with an application thread?
This one’s relatively easy. A garbage collection is fairly sensitive process, in that it walks through data structures and pointers that were used by the application threads just a moment ago. Although during the collection itself all other threads are suspended (in our case we’re working with the workstation concurrent GC, so during the sweep phase threads are suspended), it’s still not advisable to brutally suspend a thread while it’s in the middle of some possibly sensitive operation.
This is why the JIT compiler inserts so-called “safe for GC” markers into the generated code, so that when a thread hits one of these markers it would suspend itself gracefully if a GC is currently in progress. (There are numerous ways to implement this, one might involve temporarily unmapping a page that is accessed in the safepoint code and suspending the thread in the vectored exception handler of the resulting access violation exception.)
What are the application threads doing to prevent them from checkpointing with the GC when necessary?
This one is a little bit harder, but the obvious things to look for are threads performing lots of CPU operations that don’t call into the CLR. Any form of busy loop, number crunching or similar things that are constrained to a single method are suspicious in this regard. In our specific case, we found an internal loop that was busy-waiting its way around a physical device buffer. Apparently, this busy-wait didn’t contain a GC safepoint, making the GC thread wait for at least 250ms until it gave up and brutally suspended the thread.
To overcome this problem, it was necessary to either break the inner loop so that it once in a while calls a CLR API (Dima discovered that GC.KeepAlive did the job, there are other alternatives as well), or redesign it so that it is based on true waiting instead of busy-waiting. Dima chose the latter alternative and the problem disappeared, yielding consistent processing times of less than 0.5ms, as desired.
Once we had this conclusion, it was easy to reproduce the problem in a sterile environment. It takes less than 30 lines of code to demonstrate this issue:
Here’s a typical output from my machine:
^CPress any key to continue . . .
(Notice the peculiar accuracy of the delays, always in 250ms multiples or near.)
Some remarks before we conclude this post:
- The problem does not manifest when the 64-bit CLR is loaded into the process. I.e., if you want to reproduce the problem, run the above code on a 32-bit OS or make sure that you’re forcing it to run as x86 (if you’re on a 64-bit OS). This can be done from Visual Studio or using the corflags utility.
- The workstation concurrent GC flavor was used in this example, but the other GC flavors are not immune from this problem. I was able to reproduce it with the server GC just as easily.
- This is not the first time someone out there encounters this problem. Even more surprisingly, this is not the first time I have heard about this problem—over a year ago, in February 2008, I got an email from William Levine (a reader of this blog) who was experiencing strange 500ms delays, and suspected that there was some sort of CLR activity that was capped at 500ms. After some analysis, Bill was able to put together a tester (slightly more complicated than the one above) and even discuss it with a Microsoft Escalation Engineer. Unfortunately, the results were the same—this is an inherent limitation of the CLR GC, and there’s not much to do about it other than sprinkle cooperative-multitasking-style API calls into the code.