Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Tracing .NET Core on Linux With USDT and BCC

DZone's Guide to

Tracing .NET Core on Linux With USDT and BCC

In this post, we go over how to several different tasks in the code of our .NET application, like putting USDT Probes in CoreCLR.

· Web Dev Zone
Free Resource

Should you build your own web experimentation solution? Download this whitepaper by Optimizely to find out.

In my last post, I lamented the lack of call stack support for LTTng events in .NET Core. Fortunately, being open source, this is somewhat correctable — so I set out to produce a quick-and-dirty patch that adds USDT support for CoreCLR’s tracing events. This post explores some of the things that then become possible, and will hopefully become available in one form or another in CoreCLR in the future.

Very Brief USDT Primer

USDT (User Statically Defined Tracing) is a lightweight approach for embedding static trace markers into user-space libraries and applications. I took a closer look a year ago when discussing USDT support in BCC, so you might want to take a look as a refresher.

In a very small nutshell, to embed USDT probes into your library, you use a special set of macros, which then produce ELF NT_STAPSDT notes with information about the probe’s location (instruction offset), its name, its arguments, and a global variable that can be poked at runtime to turn the probe on and off (this is called the probe’s semaphore).

When tracing is disabled, i.e. the semaphore is off, USDT probes have a near-zero cost, essentially a single NOP instruction. If the argument preparation for the probe is prohibitively expensive, your code can protect relevant sections with another macro that checks if the probe is enabled before preparing and submitting its arguments. Here’s what the whole thing might look like:

// Declaring the trace semaphore and the trace macro:
#define _SDT_HAS_SEMAPHORES 1
#include <sys/sdt.h>

#define MYAPP_REQUEST_START_ENABLED() __builtin_expect (myapp_request_start_semaphore, 0)
__extension__ unsigned short myapp_request_start_semaphore __attribute ((unused)) __attribute__ ((section (".probes")));
#define MYAPP_REQUEST_START(url, client_port) DTRACE_PROBE2(myapp, request_start, url, client_port)

// The actual tracing code:
if (MYAPP_REQUEST_START_ENABLED()) {
  char const *url = curr_request->uri().canonicalize();
  unsigned short port = curr_request->connection()->client_port;
  MYAPP_REQUEST_START(url, client_port);
}

Okay, so why was I so eager to get these probes into CoreCLR? Because there are existing, lightweight tools for tracing USDT probes. One is SystemTap, which is great but requires a kernel module, and the other is the BCC toolkit, which I described extensively in previous posts. Also, because USDT probes can be mapped to specific program locations, the existing Linux uprobes mechanism can be used to trace them and obtain stack traces, with perf or ftrace-based machinery. Without subtracting from the value of LTTng traces, I really wanted to get the BCC tools working with CoreCLR, and that requires USDT.

Putting USDT Probes in CoreCLR

At this point, you might be thinking of some monstrous patch that modifies thousands of trace locations in CoreCLR to support USDT, somehow. Fortunately, there is a Python script in the CoreCLR source called genXplatLttng.py, which is responsible for generating function stubs for each CLR event. All I had to do is patch it ever so slightly (31 changed lines), and the resulting CoreCLR binary (libcoreclr.so) now has USDT probes!

# readelf -n .../libcoreclr.so

Displaying notes found at file offset 0x00000200 with length 0x00000024:
  Owner                 Data size       Description
  GNU                  0x00000014       NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: a93f07f0d169d6dd53fb8a09e3fe793cda56072d

Displaying notes found at file offset 0x0079cf90 with length 0x0000c25c:
  Owner                 Data size       Description
  stapsdt              0x00000046       NT_STAPSDT (SystemTap probe descriptors)
    Provider: DotNETRuntime
    Name: GCStart
    Location: 0x000000000051a296, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc18
    Arguments: 4@-28(%rbp) 4@-32(%rbp)
  stapsdt              0x0000006d       NT_STAPSDT (SystemTap probe descriptors)
    Provider: DotNETRuntime
    Name: GCStart_V1
    Location: 0x000000000051a39e, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1a
    Arguments: 4@-44(%rbp) 4@-48(%rbp) 4@-52(%rbp) 4@-56(%rbp) 2@-58(%rbp)
  stapsdt              0x00000079       NT_STAPSDT (SystemTap probe descriptors)
    Provider: DotNETRuntime
    Name: GCStart_V2
    Location: 0x000000000051a4b9, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1c
    Arguments: 4@-44(%rbp) 4@-48(%rbp) 4@-52(%rbp) 4@-56(%rbp) 2@-58(%rbp) 8@-72(%rbp)
  stapsdt              0x00000044       NT_STAPSDT (SystemTap probe descriptors)
    Provider: DotNETRuntime
    Name: GCEnd
    Location: 0x000000000051a597, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1e
    Arguments: 4@-28(%rbp) 2@-30(%rbp)

Many more notes were omitted for brevity — there is a total of 394 events on the build I used. Now, it’s important to clarify that this patch doesn’t get the full fidelity events. LTTng events have a richer payload than what USDT probes support, and support complex structures, sequences, and more. However, in many tracing scenarios, very basic information such as strings and numbers is sufficient. And, of course, the call stack. So let’s see what we can do now.

Tracing .NET Core Garbage Collections

OK, so what can we do with these newly-obtained superpowers? To begin with, we can trace USDT probes using the generic trace and argdist tools from BCC. For example, let’s get some statistics about garbage collections — how many collections do we have in each generation?

# argdist -p $(pidof helloworld) -C 'u::GCStart_V2():int:arg2#collections by generation' -c
[03:04:29]
collections by generation
        COUNT      EVENT
        4          arg2 = 2
        8          arg2 = 1
        13         arg2 = 0
[03:04:30]
collections by generation
        COUNT      EVENT
        5          arg2 = 2
        20         arg2 = 1
        25         arg2 = 0
[03:04:31]
collections by generation
        COUNT      EVENT
        5          arg2 = 2
        22         arg2 = 1
        28         arg2 = 0
[03:04:32]
collections by generation
        COUNT      EVENT
        6          arg2 = 2
        30         arg2 = 1
        36         arg2 = 0
[03:04:33]
collections by generation
        COUNT      EVENT
        9          arg2 = 2
        40         arg2 = 1
        49         arg2 = 0

arg2 in the above output is the collection “depth,” which is the collected generation. As you can see, we have quite a few gen0 and gen1 collection every second, and a handful of gen2 collections as well (by the way, BCC has a tool called ugc for exploring GC latencies specifically, but it doesn’t currently support .NET Core).

How did I know that arg2 is the collection depth, and how did I know that the collection “depth” is the generation to be collected? There are many more examples in this post that look a bit magical with various arg1, arg2, …, arg6 incantations. Right now, the answer is by inspecting the CLR source code to see where the probes are emitted, and what the values passed to them mean. In this particular case:

~/coreclr$ ack GCStart_V2 src/
src/vm/eventtrace.cpp
901: FireEtwGCStart_V2(pGcInfo->GCStart.Count, pGcInfo->GCStart.Depth, pGcInfo->GCStart.Reason, pGcInfo->GCStart.Type, GetClrInstanceId(), l64ClientSequenceNumberToLog);
...
src/gc/env/etmdummy.h
7:#define FireEtwGCStart_V2(Count, Depth, Reason, Type, ClrInstanceID, ClientSequenceNumber) 0

~/coreclr$ ack GCStart.*Depth src/
src/vm/eventtrace.cpp
895: (pGcInfo->GCStart.Depth == GCHeapUtilities::GetGCHeap()->GetMaxGeneration()) &&
901: FireEtwGCStart_V2(pGcInfo->GCStart.Count, pGcInfo->GCStart.Depth, pGcInfo->GCStart.Reason, pGcInfo->GCStart.Type, GetClrInstanceId(), l64ClientSequenceNumberToLog);
...
src/gc/gcee.cpp
91: Info.GCStart.Depth = (uint32_t)pSettings->condemned_generation;
100: else if (Info.GCStart.Depth < max_generation)

The argument order in the FireEtwGCStart_V2 function makes it clear that arg2 is going to be the collection depth. Then, the assignment statement in gcee.cpp hopefully makes it clear: the GC depth is the “condemned generation,” which is the generation to be collected.

Now, where are these pesky collections coming from? The stackcount tool summarizes call stacks in-kernel:

# stackccount u:.../libcoreclr.so:GCStart_V2 -p $(pidof helloworld)
^C
  FireEtXplatGCStart_V2
  ETW::GCLog::FireGcStartAndGenerationRanges(ETW::GCLog::st_GCEventInfo*)
  WKS::GCHeap::UpdatePreGCCounters()
  WKS::gc_heap::do_pre_gc()
  WKS::gc_heap::garbage_collect(int)
  WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason)
  WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)
  WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)
  SlowAllocateString(unsigned int)
  StringObject::NewString(char16_t const*, int)
  Int32ToDecStr(int, int, StringObject*)
  COMNumber::FormatInt32(int, StringObject*, NumberFormatInfo*)
  void [helloworld] helloworld.Program::DoSomeWork()
  void [helloworld] helloworld.Program::Main(string[])
  CallDescrWorkerInternal
  MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
  RunMain(MethodDesc*, short, int*, PtrArray**)
  Assembly::ExecuteMainMethod(PtrArray**, int)
  CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)
  coreclr_execute_assembly
  run(arguments_t const&)
  [unknown]
  [unknown]
    58

OK, so this looks like a fairly obvious path: there is a string allocation in DoSomeWork caused by converting an int32 to a string, and that triggers a GC repeatedly. Apparently, some of these GCs are gen0/gen1 but some of them actually require gen2 to clean up. Note that we get a full-fidelity call stack, including managed code (thanks to the COM_PerfMapEnabled switch we saw in an earlier post).

If necessary, stack traces like these can also be visualized as flame graphs. Here’s an example flame graph from perf, of the above application while it was churning through a lot of memory allocations. The GC paths are clearly visible — in the foreground (allocating) thread, and in a background thread.

Another interesting thing to trace about the GC comes from the HeapStats_V1 event. This is an event that gets fired with every collection and provides information about individual generation sizes, the amount of promoted and finalized memory, and a bunch of other interesting stuff. Here’s an example of tracing generation 2 size over time, visualized as a histogram every 15 seconds:

# argdist -p $(pidof helloworld) -H 'u::GCHeapStats_V1():u64:arg5/1048576#gen2 size (MB)' -i 15 -c
[15:10:51]
     gen2 size (MB)      : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 23       |****                                    |
       128 -> 255        : 63       |************                            |
       256 -> 511        : 196      |****************************************|
[15:11:06]
     gen2 size (MB)      : count     distribution
         0 -> 1          : 6        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 3        |                                        |
        16 -> 31         : 6        |                                        |
        32 -> 63         : 10       |                                        |
        64 -> 127        : 49       |****                                    |
       128 -> 255        : 107      |**********                              |
       256 -> 511        : 404      |****************************************|

From the histogram, we can see that the gen 2 size is usually between 256MB and 512MB, but there are occasional GCs that bring it down, even as low as the 0-1MB bucket.

Tracing Object Allocations

Very similarly to the approach above, we could trace object allocations. The CLR includes a lightweight allocation tick event (GCAllocationTick_V3), which fires roughly every 100KB of object allocations. It includes the most recently allocated type name, and the amount of memory allocated since the last tick — allowing for low-overhead object allocation sampling, without tracing each individual allocation, which would be extremely expensive.

Unfortunately, the current trace and argdist tools don’t support Unicode strings, which is how the type name is provided to these events, so the output is slightly less useful — but we can still get histograms for the allocated amount at each tick, or a summary of type ids. First, let’s try arg6, which is the type name as a string:

# argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():char*:arg6' -z 32
[03:25:06]
u::GCAllocationTick_V3():char*:arg6
        COUNT      EVENT
        1          arg6 = S
        59         arg6 = S
        1254       arg6 = S
[03:25:07]
u::GCAllocationTick_V3():char*:arg6
        COUNT      EVENT
        1          arg6 = S
        383        arg6 = S
[03:25:08]
u::GCAllocationTick_V3():char*:arg6
        COUNT      EVENT
        2          arg6 = S
        11         arg6 = S
        1053       arg6 = S

That’s not very nice because when we treat the Unicode string as char*, only the first character gets displayed. This is fixable by modifying the tools or writing a dedicated tool that would display these strings correctly. For example, here’s the output from a patched argdist that appropriately decodes the strings instead of treating them like ASCII:

# argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():char*:arg6' -z 64
[03:51:58]
u::GCAllocationTick_V3():char*:arg6
        COUNT      EVENT
        1          arg6 = System.Char[]
        59         arg6 = System.String[]
        260        arg6 = System.String

We can also get good statistics by looking at type ids (method tables, actually) — which would have to be translated to type names separately, e.g. using SOS:

# argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():u64:arg5#type id'
[03:31:07]
type id
        COUNT      EVENT
        10         arg5 = 139987795580656
        746        arg5 = 139987795692592
[03:31:08]
type id
        COUNT      EVENT
        2          arg5 = 139987795580656
        1396       arg5 = 139987795692592
[03:31:09]
type id
        COUNT      EVENT
        1          arg5 = 139987795580656
        1064       arg5 = 139987795692592

The call stacks work great, though:

# stackcount -p $(pidof helloworld) u:.../libcoreclr.so:GCAllocationTick_V3
^C
  FireEtXplatGCAllocationTick_V3
  WKS::gc_heap::fire_etw_allocation_event(unsigned long, int, unsigned char*)
  WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)
  WKS::gc_heap::allocate_large_object(unsigned long, long&)
  WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)
  FastAllocatePrimitiveArray(MethodTable*, unsigned int, int)
  JIT_NewArr1(CORINFO_CLASS_STRUCT_*, long)
  [unknown]
  instance class [System.Collections]System.Collections.Generic.List`1<!1> [System.Linq] System.Linq.Enumerable+SelectListIterator`2[System.__Canon,System.Char]::ToList()
  void [helloworld] helloworld.Program::DoSomeWork()
  void [helloworld] helloworld.Program::Main(string[])
  CallDescrWorkerInternal
  MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
  RunMain(MethodDesc*, short, int*, PtrArray**)
  Assembly::ExecuteMainMethod(PtrArray**, int)
  CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)
  coreclr_execute_assembly
  run(arguments_t const&)
  [unknown]
  [unknown]
    131

  FireEtXplatGCAllocationTick_V3
  WKS::gc_heap::fire_etw_allocation_event(unsigned long, int, unsigned char*)
  WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)
  WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)
  SlowAllocateString(unsigned int)
  StringObject::NewString(char16_t const*, int)
  Int32ToDecStr(int, int, StringObject*)
  COMNumber::FormatInt32(int, StringObject*, NumberFormatInfo*)
  void [helloworld] helloworld.Program::DoSomeWork()
  void [helloworld] helloworld.Program::Main(string[])
  CallDescrWorkerInternal
  MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
  RunMain(MethodDesc*, short, int*, PtrArray**)
  Assembly::ExecuteMainMethod(PtrArray**, int)
  CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)
  coreclr_execute_assembly
  run(arguments_t const&)
  [unknown]
  [unknown]
    2496

This shows two major stack traces allocating objects: one allocating an array inside a LINQ ToList() call, and another one that we’ve already seen, formatting an int32 as a string.

Tracing Exception Events

Let’s take a look at another example. Suppose your application is suddenly hitting lots of internal exceptions, which are handled and processed but still producing some bad results. We will trace the exceptions as they occur, and get the call stacks where they are thrown. First, how many exceptions are we seeing? This is a question for the funccount tool:

# funccount -p $(pidof helloworld) u:.../libcoreclr.so:ExceptionThrown_V1
Tracing 1 functions for "u:/home/vagrant/helloworld/bin/Debug/netcoreapp2.0/ubuntu.16.10-x64/publish/libcoreclr.so:ExceptionThrown_V1"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
ExceptionThrown_V1                        100
Detaching...

All right, we have a fairly high rate of exceptions. What types? This requires the same patched argdist from the allocation tracing example:

# argdist -p $(pidof helloworld) -C 'u::ExceptionThrown_V1():char*:arg1#exception type' -i 5
[04:00:01]
exception type
        COUNT      EVENT
        100        arg1 = System.IndexOutOfRangeException

# argdist -p $(pidof helloworld) -C 'u::ExceptionThrown_V1():char*:arg2#exception message' -i 5 -128
[04:00:29]
exception message
        COUNT      EVENT
        200        arg2 = Index was outside the bounds of the array.

That’s pretty impressive — just like that, we can trace exception types and messages happening inside our application. And of course we can get the call stacks, using our good friend stackcount:

# stackcount -p $(pidof helloworld) u:.../libcoreclr.so:ExceptionThrown_V1
^C
  FireEtXplatExceptionThrown_V1
  ETW::ExceptionLog::ExceptionThrown(CrawlFrame*, int, int)
  ExceptionTracker::ProcessExplicitFrame(CrawlFrame*, StackFrame, int, ExceptionTracker::StackTraceState&)
  ExceptionTracker::ProcessOSExceptionNotification(_EXCEPTION_RECORD*, _CONTEXT*, _DISPATCHER_CONTEXT*, unsigned int, StackFrame, Thread*, ExceptionTracker::StackTraceState)
  ProcessCLRException
  UnwindManagedExceptionPass1(PAL_SEHException&, _CONTEXT*)
  DispatchManagedException(PAL_SEHException&, bool)
  __FCThrow(void*, RuntimeExceptionKind, unsigned int, char16_t const*, char16_t const*, char16_t const*)
  COMString::GetCharAt(StringObject*, int)
  char [helloworld] helloworld.Program::Selector(string)
  instance class [System.Collections]System.Collections.Generic.List`1<!1> [System.Linq] System.Linq.Enumerable+SelectListIterator`2[System.__Canon,System.Char]::ToList()
  void [helloworld] helloworld.Program::DoSomeWork()
  void [helloworld] helloworld.Program::Main(string[])
  CallDescrWorkerInternal
  MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
  RunMain(MethodDesc*, short, int*, PtrArray**)
  Assembly::ExecuteMainMethod(PtrArray**, int)
  CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)
  coreclr_execute_assembly
  run(arguments_t const&)
  [unknown]
  [unknown]
    200

OK, so in a function called Selector we’re trying to access a character in a string, and hitting an out-of-bounds condition. Perhaps the string is empty, or the index is invalid. All that — without a debugger!

Conclusion

There are plenty of other things that are made possible by collecting call stacks from CoreCLR events — tracing assembly loads, method JIT, object movement, finalization, and many other interesting scenarios. Currently, this is all just wishful thinking: I don’t seriously expect anyone to patch their CoreCLR to emit USDT probes, just for the sake of BCC tools or SystemTap. However, it goes to show what’s possible — and what’s desirable — for the future of.NET Core tracing, debugging, and profiling on Linux.

In the meantime, there seems to be a very recent patchset proposing stack trace collection support for LTTng. If merged, you should be able to attach a stack trace to an event using the context mechanism, similar to how we attached the pid and the process name in the previous post. Although that wouldn’t light up all the BCC tools and SystemTap, it would be a step in the right direction and would make most of the analyses shown in this post possible.

Implementing an Experimentation Solution: Choosing whether to build or buy?

Topics:
web dev ,.net applications development ,coreclr

Published at DZone with permission of Sasha Goldshtein, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}