Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Case of the Rogue Heartbeat Timer

DZone's Guide to

The Case of the Rogue Heartbeat Timer

·
Free Resource

A few days ago I had to debug a curious out-of-memory condition and encountered a pattern that you may find useful. The application in question consists of a set of WCF services that communicate with each other strictly through WCF channels. Occasionally, especially under heavy load, database operations start timing out, memory usage goes through the roof, and an OOM is thrown.

Instead of telling you in so many words what I did with the memory dumps, I wrote a simple repro which I’m going to work with. First of all, let’s run the application and simulate the heavy load with a Performance Monitor view showing us the managed memory usage:


Ah, the typical chainsaw graph. Memory goes up and down, but there are huge upwards spikes and the general trend is towards a memory exhaustion. Now is a good time to attach a debugger and look for some of those heavy memory consumers.

The bottom of !dumpheap –stat tells us the following:

5be06c28      861       481792 System.Object[]
001e7734    79316      1586320 MemoryExhaustingService.Product
5be4f9ac    82599      3070272 System.String
00703660      934     35016028      Free
5be54944    79406    164701720 System.Byte[]
Total 249998 objects

Oh my. Nigh 80,000 “Products”, and 80,000 byte arrays. Coincidence? Not in my repro applications :-) Let’s take a look at some of those:

0:028> .foreach (product {!dumpheap -mt 001e7734 -short}) {!objsize product}
sizeof(028afac0) =         1068 (       0x42c) bytes (MemoryExhaustingService.Product)
sizeof(028afeec) =         1068 (       0x42c) bytes (MemoryExhaustingService.Product)
sizeof(028b0318) =         1068 (       0x42c) bytes (MemoryExhaustingService.Product)
sizeof(028b0744) =         1068 (       0x42c) bytes (MemoryExhaustingService.Product)
sizeof(028b0b70) =         1068 (       0x42c) bytes (MemoryExhaustingService.Product)
sizeof(028b0f9c) =         1068 (       0x42c) bytes (MemoryExhaustingService.Product)
sizeof(028b13c8) =         1068 (       0x42c) bytes (MemoryExhaustingService.Product)
sizeof(028b17f4) =         1068 (       0x42c) bytes (MemoryExhaustingService.Product)
(MemoryExhaustingService.Product)
… [and so on, for another 79,000 objects]

So these “Products” are not really that innocent—each Product takes more than 1K of memory. Why?

0:029> !DumpObj 02d2a0ec
Name:        MemoryExhaustingService.Product
MethodTable: 00247734
EEClass:     00310bdc
Size:        20(0x14) bytes
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
5be4f9ac  4000001        4        System.String  0 instance 02d2a4f4 <Name>k__BackingField
5be52978  4000002        c         System.Int32  1 instance     3771 <Stock>k__BackingField
5be54944  4000003        8        System.Byte[]  0 instance 02d2a100 <Image>k__BackingField

0:029> !DumpObj 02d2a100
Name:        System.Byte[]
MethodTable: 5be54944
EEClass:     5bb8af1c
Size:        1012(0x3f4) bytes
Array:       Rank 1, Number of elements 1000, Type Byte Element Type:System.Byte
Fields:
None

What’s keeping these products and images alive then?

0:028> .foreach (product {!dumpheap -mt 001e7734 -short}) {!gcroot product}
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 774
Scan Thread 2 OSTHread 2284
Scan Thread 6 OSTHread 10e4
Scan Thread 7 OSTHread 6a4
Scan Thread 9 OSTHread 1ee4
Scan Thread 10 OSTHread 183c
Scan Thread 11 OSTHread 1db0
Scan Thread 12 OSTHread 2220
Scan Thread 14 OSTHread e38
Scan Thread 15 OSTHread 2128
Scan Thread 16 OSTHread 1310
Scan Thread 17 OSTHread 1f90
Scan Thread 18 OSTHread 1ce0
ESP:e7eed50:Root:  0f4ab69c(System.Object[])->
  028afac0(MemoryExhaustingService.Product)

Scan Thread 19 OSTHread a30
Scan Thread 20 OSTHread 23ec
Scan Thread 8 OSTHread be8
Scan Thread 21 OSTHread 1f48
Scan Thread 22 OSTHread 22c0
Scan Thread 26 OSTHread 1980
Scan Thread 27 OSTHread 20ac
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 774
Scan Thread 2 OSTHread 2284
Scan Thread 6 OSTHread 10e4
Scan Thread 7 OSTHread 6a4
Scan Thread 9 OSTHread 1ee4
Scan Thread 10 OSTHread 183c
Scan Thread 11 OSTHread 1db0
Scan Thread 12 OSTHread 2220
Scan Thread 14 OSTHread e38
Scan Thread 15 OSTHread 2128
Scan Thread 16 OSTHread 1310
Scan Thread 17 OSTHread 1f90
Scan Thread 18 OSTHread 1ce0
ESP:e7eed50:Root:  0f4ab69c(System.Object[])->
  028afeec(MemoryExhaustingService.Product)

Scan Thread 19 OSTHread a30
Scan Thread 20 OSTHread 23ec
Scan Thread 8 OSTHread be8
Scan Thread 21 OSTHread 1f48
Scan Thread 22 OSTHread 22c0
Scan Thread 26 OSTHread 1980
Scan Thread 27 OSTHread 20ac
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 774
Scan Thread 2 OSTHread 2284
Scan Thread 6 OSTHread 10e4
Scan Thread 7 OSTHread 6a4
Scan Thread 9 OSTHread 1ee4
Scan Thread 10 OSTHread 183c
Scan Thread 11 OSTHread 1db0
Scan Thread 12 OSTHread 2220
Scan Thread 14 OSTHread e38
Scan Thread 15 OSTHread 2128
Scan Thread 16 OSTHread 1310
Scan Thread 17 OSTHread 1f90
Scan Thread 18 OSTHread 1ce0
ESP:e7eed50:Root:  0f4ab69c(System.Object[])->
  028b0318(MemoryExhaustingService.Product)

Scan Thread 19 OSTHread a30
Scan Thread 20 OSTHread 23ec
Scan Thread 8 OSTHread be8
Scan Thread 21 OSTHread 1f48
Scan Thread 22 OSTHread 22c0
Scan Thread 26 OSTHread 1980
Scan Thread 27 OSTHread 20ac
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 774
Scan Thread 2 OSTHread 2284
Scan Thread 6 OSTHread 10e4
Scan Thread 7 OSTHread 6a4
Scan Thread 9 OSTHread 1ee4
Scan Thread 10 OSTHread 183c
Scan Thread 11 OSTHread 1db0
Scan Thread 12 OSTHread 2220
Scan Thread 14 OSTHread e38
Scan Thread 15 OSTHread 2128
Scan Thread 16 OSTHread 1310
Scan Thread 17 OSTHread 1f90
Scan Thread 18 OSTHread 1ce0
ESP:e7eed50:Root:  0f4ab69c(System.Object[])->
  028b0744(MemoryExhaustingService.Product)

Scan Thread 19 OSTHread a30
Scan Thread 20 OSTHread 23ec
Scan Thread 8 OSTHread be8
Scan Thread 21 OSTHread 1f48
Scan Thread 22 OSTHread 22c0
Scan Thread 26 OSTHread 1980
Scan Thread 27 OSTHread 20ac
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 774
Scan Thread 2 OSTHread 2284
Scan Thread 6 OSTHread 10e4
Scan Thread 7 OSTHread 6a4
Scan Thread 9 OSTHread 1ee4
Scan Thread 10 OSTHread 183c
Scan Thread 11 OSTHread 1db0
Scan Thread 12 OSTHread 2220
Scan Thread 14 OSTHread e38
Scan Thread 15 OSTHread 2128
Scan Thread 16 OSTHread 1310
Scan Thread 17 OSTHread 1f90
Scan Thread 18 OSTHread 1ce0
ESP:e7eed50:Root:  0f4ab69c(System.Object[])->
  028b0b70(MemoryExhaustingService.Product)

Scan Thread 19 OSTHread a30
Scan Thread 20 OSTHread 23ec
Scan Thread 8 OSTHread be8
Scan Thread 21 OSTHread 1f48
Scan Thread 22 OSTHread 22c0
Scan Thread 26 OSTHread 1980
Scan Thread 27 OSTHread 20ac
… [and so on]

It seems that there are local references to all our Product instances, from various thread stacks. This is a good hint that it’s worth our while taking a look at the application’s threads.

0:029> !Threads
ThreadCount:      21
UnstartedThread:  0
BackgroundThread: 20
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
[…edited for brevity…]
   0    MTA
   2    MTA (Finalizer)
   6    MTA (Threadpool Completion Port)
   8    MTA (Threadpool Completion Port)
   9    MTA (Threadpool Completion Port)
  10    MTA (Threadpool Completion Port)
  11    MTA (Threadpool Worker)
  12    MTA (Threadpool Worker)
  14    MTA (Threadpool Worker)
  15    MTA (Threadpool Worker)
  16    MTA (Threadpool Worker)
  17    MTA (Threadpool Worker)
  18    MTA (Threadpool Worker)
  19    MTA (Threadpool Worker)
  20    MTA (Threadpool Worker)
  21    MTA (Threadpool Worker)
   7    MTA (Threadpool Worker)
  22    MTA (Threadpool Completion Port)
  23    MTA (Threadpool Completion Port)
  24    MTA (Threadpool Worker)
  25    MTA (Threadpool Worker)

Nothing particularly unusual here—it’s a WCF application, after all, so thread pool threads should not be surprising. How about the call stacks of these threads? I’m only looking at #18 here (which appears in the !gcroot output), but obviously I would have to look at the rest of them as well.

0:018> ~18s; !CLRStack
[…edited for brevity…]
OS Thread Id: 0x1bd0 (18)
[HelperMethodFrame_1OBJ: 1050e614] System.Threading.WaitHandle.WaitOneNative
System.Threading.WaitHandle.InternalWaitOne1050e6d8 System.Threading.WaitHandle.WaitOne
System.Runtime.TimeoutHelper.WaitOne
System.ServiceModel.Dispatcher.DuplexChannelBinder+SyncDuplexRequest.WaitForReply
System.ServiceModel.Dispatcher.DuplexChannelBinder.Request
System.ServiceModel.Channels.ServiceChannel.Call
System.ServiceModel.Channels.ServiceChannelProxy.InvokeService
System.ServiceModel.Channels.ServiceChannelProxy.Invoke
System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke
[TPMethodFrame: 1050ec30] MemoryExhaustingService.IInventoryService.GetAllProducts()
MemoryExhaustingService.HeartbeatManager.OnTimer

System.Threading._TimerCallback.TimerCallback_Context
System.Threading.ExecutionContext.runTryCode
[HelperMethodFrame_PROTECTOBJ: 1050f0d0] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup
System.Threading.ExecutionContext.RunInternal
System.Threading.ExecutionContext.Run
System.Threading._TimerCallback.PerformTimerCallback
1050f344 5f1021db [GCFrame: 1050f344]
1050f44c 5f1021db [DebuggerU2MCatchHandlerFrame: 1050f44c]

Hmm. This thread is inside a timer callback which invokes the HeartbeatManager.OnTimer method. This method, in turn, uses a WCF proxy to call the Inventory Service’s GetAllProducts operation, and blocks for the call to return.

I know that the WCF thread that’s supposed to be serving this operation is also within the same process, so I can look it up in the list of threads:

0:018> .shell -ci "~* e !CLRStack" findstr GetAllProducts
0616eb10 5f102356 [TPMethodFrame: 0616eb10] MemoryExhaustingService.IInventoryService.GetAllProducts()
08b5ec40 5f102356 [TPMethodFrame: 08b5ec40] MemoryExhaustingService.IInventoryService.GetAllProducts()
1050ec30 5f102356 [TPMethodFrame: 1050ec30] MemoryExhaustingService.IInventoryService.GetAllProducts()
080ee930 5f102356 [TPMethodFrame: 080ee930] MemoryExhaustingService.IInventoryService.GetAllProducts()
0822eba0 5f102356 [TPMethodFrame: 0822eba0] MemoryExhaustingService.IInventoryService.GetAllProducts()

This is interesting—these are all proxy calls, not the actual service implementations (I can tell because the name of the method is IInventoryService.GetAllProducts, not InventoryService.GetAllProducts). It means that the service isn’t actually processing a GetAllProducts request right now, and there are five threads waiting for it to reply to the GetAllProducts request. This is suspicious, so let’s see if we can find a thread with the InventoryService in it:

0:018> .shell -ci "~* e !CLRStack" findstr MemoryExhaustingService.InventoryService
07e3e8f0 00302483 MemoryExhaustingService.InventoryService.GetMyProducts()

There—the service is actually performing the GetMyProducts operation right now, and it must be configured with ConcurrencyMode.Single, so the rest of the threads are waiting. (Verifying the service concurrency mode is something you could do in the debugger, but it probably makes more sense checking it directly with the service code/configuration.)

So what are our findings so far?

  1. The Inventory Service is configured with ConcurrencyMode.Single, and currently processing the GetMyProducts operation.
  2. There are five threads invoked by a timer, all in the HeartbeatManager.OnTimer method, and they are waiting for the GetAllProducts operation to return. [There may be additional threads invoked by the timer.]
  3. Objects are not reclaimed because they are referenced (as local variables) from the timer threads. [This is not actually something I’ve proven here—I showed you that some of the Product instances are kept alive by one of the timer threads. Nonetheless, this statement is true—verifying it would only make this post longer.]

Let’s see what happens if we let the application run and the leak accumulate some more:

5be06c28      911       803216 System.Object[]
001c7734   154838      3096760 MemoryExhaustingService.Product
5be4f9ac   158121      5785864 System.String
00813660     2673    117781372      Free
5be54944   154934    257907356 System.Byte[]

0:039> !threads
ThreadCount:      31
UnstartedThread:  0
BackgroundThread: 30
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
[…edited for brevity…]
   0    MTA
   2    MTA (Finalizer)
   6    MTA (Threadpool Completion Port)
   7    MTA (Threadpool Completion Port)
   8    MTA (Threadpool Completion Port)
   9    MTA (Threadpool Completion Port)
  10    MTA (Threadpool Completion Port)
  11    MTA (Threadpool Worker)
  12    MTA (Threadpool Worker)
  14    MTA (Threadpool Worker)
  15    MTA (Threadpool Worker)
  16    MTA (Threadpool Worker)
  17    MTA (Threadpool Worker)
  18    MTA (Threadpool Worker)
  19    MTA (Threadpool Worker)
  20    MTA (Threadpool Worker)
  21    MTA (Threadpool Worker)
  22    MTA (Threadpool Worker)
  23    MTA (Threadpool Worker)
  24    MTA (Threadpool Worker)
  25    MTA (Threadpool Worker)
  26    MTA (Threadpool Worker)
  27    MTA (Threadpool Worker)
  31    MTA (Threadpool Worker)
  32    MTA (Threadpool Worker)
  33    MTA (Threadpool Worker)
  34    MTA (Threadpool Worker)
  35    0 MTA (Threadpool Worker)
  36    0 MTA (Threadpool Worker)
  37    MTA (Threadpool Worker)
  38    MTA (Threadpool Worker)

Oh boy, even more Products and even more threads! Out of sheer curiosity, how many of those are inside the OnTimer method?

0:039> .shell -ci "~* e !CLRStack" findstr OnTimer
065aee80 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0632ed60 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0ad2edd0 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
076deac0 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0931ee80 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
10c8e8a0 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0be7eb00 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0ac2ed60 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0a57e990 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
0a68ea30 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
0a7ce6a0 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
0a9fefc0 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0e7deb70 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0c0ae8c0 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
0c22eb10 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
094bee80 00230c49 MemoryExhaustingService.HeartbeatManager.OnTimer
097bedf0 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
099aeca0 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
07c3e840 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
07e5e860 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
07f6e790 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
0806e790 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer
0822eab0 00230c34 MemoryExhaustingService.HeartbeatManager.OnTimer

Quite a few, and growing! This brings us to the conclusion:

The Inventory Service is configured with ConcurrencyMode.Single, and the GetAllProducts/GetMyProducts operations are rather lengthy. There is a heartbeat timer configured to retrieve the list of products at relatively short intervals, and under heavy load the Inventory Service returns results at a slower rate than the rate at which the timer fires. As a result, timer threads accumulate inside Heartbeat.OnTimer, keeping Product instances from being collected because they have a local variable referencing them.

Some possible recommendations would be decreasing the timer interval, making sure there is only one timer invocation at a given time (e.g. using Monitor.TryEnter), finding a way to make the Inventory Service use the Multiple concurrency mode, or removing the need for the heartbeat threads to retrieve and maintain a large number of products while performing a blocking operation (service call).

In my original debugging session, the story was slightly more complicated because there were also dozens of application threads blurring the picture, and the heartbeat threads were busy accessing multiple WCF services and databases. The general picture was the same nevertheless—heartbeat threads, which are supposed to be responsible for the application’s well-being, served the opposite purpose by creating increasing load on system components and bringing the system down by exhausting all memory. I think there’s even a philosophical angle somewhere in all of this :-)

Topics:

Published at DZone with permission of Sasha Goldshtein, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}