Quick and Efficient Distributed Tracing In .NET
In this article, learn to implement quick and efficient distributed tracing in .NET, minimizing low-level code changes.
Join the DZone community and get the full member experience.
Join For FreeTo log, or not to log? To log! Nowadays, we can’t even imagine a modern software system without logging subsystem implementation, because it’s the very basic tool of debugging and monitoring developers can’t be productive without. Once something gets broken or you just want to know what’s going on in the depths of your code execution, there’s almost no other way than just to implement a similar functionality.
With distributed systems, and microservices architectures in particular, the situation gets even more complicated since each service can theoretically call any other service (or several of them at once), using either REST, gRPC, or asynchronous messaging (by means of numerous service buses, queues, brokers, and actor-based frameworks). Background processing goes there as well, resulting in entangled call chains we still want to have control over.
In this article we will show you how to implement efficient distributed tracing in .NET quickly, avoiding the modification of low-level code as much as possible so that only generic tooling and base classes for each communication instrument are affected.
Ambient Context Is The Core: Exploring The AsyncLocal
Let’s start with the root which ensures the growth of our tree - that is, where the tracing information is stored. Because to log the tracing information, we need to store it somewhere and then get it somehow. Furthermore, this information should be available throughout the execution flow - this is exactly what we want to achieve. Thus, I’ve chosen to implement the ambient context pattern (you’re probably familiar with it from HttpContext): simply put, it provides global access to certain resources in the scope of execution flow. Though it’s sometimes considered an anti-pattern, in my opinion, the dependency injection concerns are a bit out of… scope (sorry for the pun), at least for a specific case where we don’t hold any business data.
And .NET can help us with that, providing the AsyncLocal<T>
class; as opposed to ThreadLocal<T>
, which ensures data locality in the scope of a certain thread, AsyncLocal
is used to hold data for tasks, which (as we know) can be executed in any thread.
It’s worth mentioning that AsyncLocal
works top down, so once you set the value at the start of the flow, it will be available for the rest of the ongoing flow as well, but if you change the value in the middle of the flow, it will be changed for the flow branch only; i.e., data locality will be preserved for each branch separately.
If we look at the picture above, the following consequent use cases can be considered as examples:
- We set the
AsyncLocal
value as0
in theRoot Task
. If we don’t change it in the child tasks, it will be read as 0 in the child tasks’ branches as well. - We set the
AsyncLocal
value as1
in theChild Task 1
. If we don’t change it in theChild Task 1.1
, it will be read as1
in the context of _Child Task 1
_andChild Task 1.1,
but not in theRoot Task
orChild Task 2
branch - they will keep0
. - We set the
AsyncLocal
value as2
in theChild Task 2
. Similarly to #2, if we don’t change it in theChild Task 2.1
, it will be read as2
in the context ofChild Task 2
andChild Task 2.1
, but not in theRoot Task
orChild Task 1
branch - they will be0
forRoot Task
, and1
forChild Task 1
branch. - We set the
AsyncLocal
value as3
in theChild Task 1.1
. This way, it will be read as3
only in the context ofChild Task 1.1
, and not others’ - they will preserve previous values. - We set the
AsyncLocal
value as4
in theChild Task 2.1
. This way, it will be read as4
only in the context ofChild Task 2.1
, and not others’ - they will preserve previous values.
OK, words are cheap: let’s get to the code!
using Serilog;
using System;
using System.Threading;
namespace DashDevs.Framework.ExecutionContext
{
///
/// Dash execution context uses to hold ambient context.
/// IMPORTANT: works only top down, i.e. if you set a value in a child task, the parent task and other execution flow branches will NOT share the same context!
/// That's why you should set needed properties as soon you have corresponding values for them.
///
public static class DashExecutionContext
{
private static AsyncLocal _traceIdentifier = new AsyncLocal();
public static string? TraceIdentifier => _traceIdentifier.Value;
///
/// Tries to set the trace identifier.
///
/// Trace identifier.
/// If existing trace ID should be replaced (set to true ONLY if you receive and handle traced entities in a constant context)!
///
public static bool TrySetTraceIdentifier(string traceIdentifier, bool force = false)
{
return TrySetValue(nameof(TraceIdentifier), traceIdentifier, _traceIdentifier, string.IsNullOrEmpty, force);
}
private static bool TrySetValue(
string contextPropertyName,
T newValue,
AsyncLocal ambientHolder,
Func valueInvalidator,
bool force)
where T : IEquatable
{
if (newValue is null || newValue.Equals(default) || valueInvalidator.Invoke(newValue))
{
return false;
}
var currentValue = ambientHolder.Value;
if (force || currentValue is null || currentValue.Equals(default) || valueInvalidator.Invoke(currentValue))
{
ambientHolder.Value = newValue;
return true;
}
else if (!currentValue.Equals(newValue))
{
Log.Error($"Tried to set different value for {contextPropertyName}, but it is already set for this execution flow - " +
$"please, check the execution context logic! Current value: {currentValue} ; rejected value: {newValue}");
}
return false;
}
}
}
Setting the trace ID is as simple as DashExecutionContext.TrySetTraceIdentifier(“yourTraceId”)
with an optional value replacement option (we will talk about it later), and then you can access the value with DashExecutionContext.TraceIdentifier
. We could implement this class to hold a dictionary as well; just in our case, it was enough (you can do this by yourself if needed, initializing a ConcurrentDictionary<TKey, TValue>
for holding ambient context information with TValue
being AsyncLocal
).
In the next section, we will enrich Serilog with trace ID values to be able to filter the logs and get complete information about specific call chains.
Logging Made Easy With Serilog Dynamic Enrichment
Serilog, being one of the most famous logging tools on the market (if not the most), comes with an enrichment concept - logs can include additional metadata of your choice by default, so you don’t need to set it for each write by yourself. While this piece of software already provides us with an existing LogContext, which is stated to be ambient, too, its disposable nature isn’t convenient to use and reduces the range of execution flows, while we need to process them in the widest range possible.
So, how do we enrich logs with our tracing information? Among all the examples I’ve found that the enrichment was made using immutable values, so the initial plan was to implement a simple custom enricher quickly which would accept the delegate to get DashExecutionContext.TraceIdentifier
value each time the log is written to reach our goal and log the flow-specific data. Fortunately, there’s already a community implementation of this feature, so we’ll just use it like this during logger configuration initialization:
var loggerConfiguration = new LoggerConfiguration()
...
.Enrich.WithDynamicProperty(“X-Dash-TraceIdentifier”, () => DashExecutionContext.TraceIdentifier)
...
Yes, it's as simple as that - just a single line of code with a lambda, and all your logs now have a trace identifier!
HTTP Headers With Trace IDs for ASP.NET Core REST API and GRPC
The next move is to set the trace ID in the first place so that something valuable is shown in the logs. In this section, we will learn how to do this for REST API and gRPC communication layers, both server and client sides.
Server Side: REST API
For the server side, we can use custom middleware and populate our requests and responses with a trace ID header (don’t forget to configure your pipeline so that this middleware is the first one!).
using DashDevs.Framework.ExecutionContext;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Serilog;
using System.Threading.Tasks;
namespace DashDevs.Framework.Middlewares
{
public class TracingMiddleware
{
private const string DashTraceIdentifier = "X-Dash-TraceIdentifier";
private readonly RequestDelegate _next;
public TracingMiddleware(RequestDelegate next)
{
_next = next;
}
public async Task Invoke(HttpContext httpContext)
{
if (httpContext.Request.Headers.TryGetValue(DashTraceIdentifier, out var traceId))
{
httpContext.TraceIdentifier = traceId;
DashExecutionContext.TrySetTraceIdentifier(traceId);
}
else
{
Log.Debug($"Setting the detached HTTP Trace Identifier for {nameof(DashExecutionContext)}, because the HTTP context misses {DashTraceIdentifier} header!");
DashExecutionContext.TrySetTraceIdentifier(httpContext.TraceIdentifier);
}
httpContext.Response.OnStarting(state =>
{
var ctx = (HttpContext)state;
ctx.Response.Headers.Add(DashTraceIdentifier, new[] { ctx.TraceIdentifier }); // there’s a reason not to use DashExecutionContext.TraceIdentifier value directly here
return Task.CompletedTask;
}, httpContext);
await _next(httpContext);
}
}
}
Since the code is rather simple, we will stop only on a line where the response header is added. In our practice, we’ve faced a situation when in specific cases the response context was detached from the one we’d expected because of yet unknown reason, and thus the DashExecutionContext.TraceIdentifier
value was null. Please, feel free to leave a comment if you know more - we’ll be glad to hear it!
Client Side: REST API
For REST API, your client is probably a handy library like Refit or RestEase. Not to add the header each time and produce unnecessary code, we can use an HttpMessageHandler implementation that fits the client of your choice. Here we’ll go with Refit and implement a DelegatingHandler for it.
using System;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
using DashDevs.Framework.ExecutionContext;
namespace DashDevs.Framework.HttpMessageHandlers
{
public class TracingHttpMessageHandler : DelegatingHandler
{
private const string DashTraceIdentifier = "X-Dash-TraceIdentifier";
protected override async Task SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
if (!request.Headers.TryGetValues(DashTraceIdentifier, out var traceValues))
{
var traceId = DashExecutionContext.TraceIdentifier;
if (string.IsNullOrEmpty(traceId))
{
traceId = Guid.NewGuid().ToString();
}
request.Headers.Add(DashTraceIdentifier, traceId);
}
return await base.SendAsync(request, cancellationToken);
}
}
}
Then you just need to register this handler as a scoped service in the ConfigureServices method of your Startup class and finally add it to your client configuration as follows.
public void ConfigureServices(IServiceCollection services)
{
...
services.AddScoped();
...
services.AddRefitClient(). ... .AddHttpMessageHandler();
...
}
Server Side: gRPC
For gRPC, the code is generated from Protobuf IDL (interface definition language) definitions, which can use interceptors for intermediate processing. For the server side, we’ll implement a corresponding one that checks the request headers for the trace ID header.
using DashDevs.Framework.ExecutionContext;
using Grpc.Core;
using Grpc.Core.Interceptors;
using System;
using System.Linq;
using System.Threading.Tasks;
namespace DashDevs.Framework.gRPC.Interceptors
{
public class ServerTracingInterceptor : Interceptor
{
private const string DashTraceIdentifier = "X-Dash-TraceIdentifier";
public override Task UnaryServerHandler(TRequest request, ServerCallContext context, UnaryServerMethod continuation)
{
ProcessTracing(context);
return continuation(request, context);
}
public override Task ClientStreamingServerHandler(IAsyncStreamReader requestStream, ServerCallContext context, ClientStreamingServerMethod continuation)
{
ProcessTracing(context);
return continuation(requestStream, context);
}
public override Task ServerStreamingServerHandler(TRequest request, IServerStreamWriter responseStream, ServerCallContext context, ServerStreamingServerMethod continuation)
{
ProcessTracing(context);
return continuation(request, responseStream, context);
}
public override Task DuplexStreamingServerHandler(IAsyncStreamReader requestStream, IServerStreamWriter responseStream, ServerCallContext context, DuplexStreamingServerMethod continuation)
{
ProcessTracing(context);
return continuation(requestStream, responseStream, context);
}
private void ProcessTracing(ServerCallContext context)
{
if (string.IsNullOrEmpty(DashExecutionContext.TraceIdentifier))
{
var traceIdEntry = context.RequestHeaders.FirstOrDefault(m => m.Key == DashTraceIdentifier.ToLowerInvariant());
var traceId = traceIdEntry?.Value ?? Guid.NewGuid().ToString();
DashExecutionContext.TrySetTraceIdentifier(traceId);
}
}
}
}
To make your server calls intercepted, you need to pass a new instance of the ServerTracingInterceptor
to the ServerServiceDefinition.Intercept method. The ServerServiceDefinition, in turn, is obtained by a call of the BindService
method of your generated service. The following example can be used as a starting point.
...
var server = new Server
{
Services = { YourService.BindService(new YourServiceImpl()).Intercept(new ServerTracingInterceptor()) },
Ports = { new ServerPort("yourServiceHost", Port, ServerCredentials.Insecure) }
};
server.Start();
...
Client Side: GRPC
ChannelExtensions.Intercept extension method comes to the rescue here - we will call it after channel creation, but at first we’re to implement the interceptor itself in the form of Func
like it’s shown below.
using DashDevs.Framework.ExecutionContext;
using Grpc.Core;
using System;
namespace DashDevs.Framework.gRPC.Interceptors
{
public static class ClientInterceptorFunctions
{
private const string DashTraceIdentifier = "X-Dash-TraceIdentifier";
public static Func TraceHeaderForwarder = (Metadata source) =>
{
var traceId = DashExecutionContext.TraceIdentifier;
if (string.IsNullOrEmpty(traceId))
{
traceId = Guid.NewGuid().ToString();
}
source.Add(DashTraceIdentifier, traceId);
return source;
};
}
}
The usage is quite simple:
- Create the Channel object with specific parameters.
- Create your client class object and pass the Intercept method result of a Channel from p.1 using the
InterceptorFunctions.TraceHeaderForwarder
as a parameter for the client class constructor instead of passing the original Channel instance instead.
It can be achieved with the following code as an example:
…
var channel = new Channel("yourServiceHost:yourServicePort", ChannelCredentials.Insecure);
var client = new YourService.YourServiceClient(channel.Intercept(ClientInterceptorFunctions.TraceHeaderForwarder));
...
Base Message Class vs. Framework Message Metadata in Asynchronous Communication Software
The next question is how to pass the trace ID in various async communication software. Basically, one can choose to use either framework-related features to pass trace ID further or go in a more straightforward manner with a base message. Both have pros and cons:
- The base message approach is ideal for communication where no features are provided to store contextual data, and it’s the least error-prone overall due to simplicity. On the other hand, if you have already defined a set of messages, backward compatibility may break if you just add another field depending on the serialization mechanism (so if you are to go this way, it’s better to do this from the very beginning and consider among other infrastructure features during design sessions), not mentioning that it may affect much code, which is better to be avoided.
- Setting framework metadata, if available, is a better choice, because you can leave your message processing code as it is with just a minor improvement, which will be automatically applied to all messaging across the whole system. Also, some software may provide features for additional monitoring of this data (e.g., in the dashboard).
Next, we will provide you with some real-world examples.
Amazon SQS
One of the most widely used message queues is Amazon Simple Queue Service. Fortunately, it provides message metadata (namely, message attributes) out of the box, so we will gladly use it.
The first step is to add trace ID to messages we send, so you can do something like this.
public async Task SendMessageAsync(T message, CancellationToken cancellationToken, string? messageDeduplicationId = null)
{
var amazonClient = new AmazonSQSClient(yourConfig);
var messageBody = JsonSerializer.Serialize(message, yourJsonOptions);
return await amazonClient.SendMessageAsync(
new SendMessageRequest
{
QueueUrl = "yourQueueUrl",
MessageBody = messageBody,
MessageDeduplicationId = messageDeduplicationId,
MessageAttributes = new Dictionary()
{
{
"X-Dash-TraceIdentifier", new MessageAttributeValue()
{
DataType = "String",
StringValue = DashExecutionContext.TraceIdentifier,
}
}
}
}, cancellationToken);
}
The second step is to read this trace ID in a receiver to be able to set it for ambient context and continue the same way.
public async Task> GetMessagesAsync(int maxNumberOfMessages, CancellationToken token)
{
if (maxNumberOfMessages < 0)
{
throw new ArgumentOutOfRangeException(nameof(maxNumberOfMessages));
}
var amazonClient = new AmazonSQSClient(yourConfig);
var asyncMessage = await amazonClient.ReceiveMessageAsync(
new ReceiveMessageRequest
{
QueueUrl = "yourQueueUrl",
MaxNumberOfMessages = maxNumberOfMessages,
WaitTimeSeconds = yourLongPollTimeout,
MessageAttributeNames = new List() { "X-Dash-TraceIdentifier" },
}, token);
return asyncMessage.Messages;
}
Important note (also applicable to other messaging platforms): If you read and handle messages in the background loop one by one (not several at once) and wait for the completion of each one, calling the DashExecutionContext.TrySetTraceIdentifier
with trace ID from metadata before message handling method with your business logic, then the DashExecutionContext.TraceIdentifier
value always lies in the same async context. That’s why in this case it’s essential to use the override option in the DashExecutionContext.TrySetTraceIdentifier
each time: it’s safe since only one message is processed at a time, so we don’t get a mess anyhow. Otherwise, the very first metadata trace ID will be used for all upcoming messages as well, which is wrong. But if you read and process your messages in batches, the simplest way is to add an intermediate async method where the DashExecutionContext.TrySetTraceIdentifier
is called and separate message from a batch is processed, so that you preserve an execution flow context isolation (and therefore trace ID) for each message separately. In this case, the override is not needed.
Microsoft Orleans
Microsoft Orleans provides its own execution flow context out of the box, so it’s extremely easy to pass metadata by means of the static RequestContext.Set(string key, object value)
method, and reading it in the receiver with a RequestContext.Get(string key)
. The behavior is similar to AsyncLocal we’ve already learned about; i.e., the original caller context always preserves the value that is projected to message receivers, and getting responses doesn’t imply any caller context metadata changes even if another value has been set on the other side.
But how can we efficiently interlink it with other contexts we use? The answer lies within Grain call filters. So, at first, we will add the outgoing filter so that the trace ID is set for calls to other Grains
(which is an actor definition in Orleans).
using DashDevs.Framework.ExecutionContext;
using Microsoft.AspNetCore.Http;
using Orleans;
using Orleans.Runtime;
using System;
using System.Threading.Tasks;
namespace DashDevs.Framework.Orleans.Filters
{
public class OutgoingGrainTracingFilter : IOutgoingGrainCallFilter
{
private const string TraceIdentifierKey = "X-Dash-TraceIdentifier";
private const string IngorePrefix = "Orleans.Runtime";
public async Task Invoke(IOutgoingGrainCallContext context)
{
if (context.Grain.GetType().FullName.StartsWith(IngorePrefix))
{
await context.Invoke();
return;
}
var traceId = DashExecutionContext.TraceIdentifier;
if (string.IsNullOrEmpty(traceId))
{
traceId = Guid.NewGuid().ToString();
}
RequestContext.Set(TraceIdentifierKey, traceId);
await context.Invoke();
}
}
}
By default, the framework is constantly sending numerous service messages between specific actors, so it’s mandatory to move them out of our filters because they’re not subjects for tracing. Thus, we’ve introduced an ignore prefix so that these messages aren’t processed.
Also, it’s worth mentioning that this filter is working for the pure client side, too. For example, if you’re calling an actor from the REST API controller by means of the Orleans cluster client, the trace ID will be passed from the REST API context further to the actors’ execution context and so on.
Then we’ll continue with an incoming filter, where we get the trace ID from RequestContext and initialize our DashExecutionContext with it. The ignore prefix is used there, too.
using DashDevs.Framework.ExecutionContext;
using Orleans;
using Orleans.Runtime;
using System.Threading.Tasks;
namespace DashDevs.Framework.Orleans.Filters
{
public class IncomingGrainTracingFilter : IIncomingGrainCallFilter
{
private const string TraceIdentifierKey = "X-Dash-TraceIdentifier";
private const string IngorePrefix = "Orleans.Runtime";
public async Task Invoke(IIncomingGrainCallContext context)
{
if (context.Grain.GetType().FullName.StartsWith(IngorePrefix))
{
await context.Invoke();
return;
}
DashExecutionContext.TrySetTraceIdentifier(RequestContext.Get(TraceIdentifierKey).ToString());
await context.Invoke();
}
}
}
Now let’s finish with our Silo (a Grain server definition in Orleans) host configuration to use the features we’ve already implemented, and we’re done here!
var siloHostBuilder = new SiloHostBuilder().
...
.AddOutgoingGrainCallFilter()
.AddIncomingGrainCallFilter()
...
Background Processing
Another piece of software you can use pretty often is a background jobs implementation. Here the concept itself prevents us from using a base data structure (which would look like an obvious workaround), and we’re going to review the features of Hangfire (the most famous background jobs software) which will help us to reach the goal of distributed tracing even for these kinds of execution units.
Hangfire
The feature which fits our goal most is the job filtering, implemented in the Attribute form. Thus, we need to define our own filtering attribute which will derive from the JobFilterAttribute, and implement the IClientFilter with IServerFilter interfaces.
From the client side, we can access our DashExecutionContext.TraceIdentifier
value, but not from the server. So, to be able to reach this value from the server context, we’ll pass our trace ID through the Job Parameter setting (worth mentioning that it’s not the parameter of a job method you write in your code, but a metadata handled by the framework).
With this knowledge, let’s define our job filter.
using DashDevs.Framework.ExecutionContext;
using Hangfire.Client;
using Hangfire.Common;
using Hangfire.Server;
using Hangfire.States;
using Serilog;
using System;
namespace DashDevs.Framework.Hangfire.Filters
{
public class TraceJobFilterAttribute : JobFilterAttribute, IClientFilter, IServerFilter
{
private const string TraceParameter = "TraceIdentifier";
public void OnCreating(CreatingContext filterContext)
{
var traceId = GetParentTraceIdentifier(filterContext);
if (string.IsNullOrEmpty(traceId))
{
traceId = DashExecutionContext.TraceIdentifier;
Log.Information($"{filterContext.Job.Type.Name} job {TraceParameter} parameter was not set in the parent job, "
+ "which means it's not a continuation");
}
if (string.IsNullOrEmpty(traceId))
{
traceId = Guid.NewGuid().ToString();
Log.Information($"{filterContext.Job.Type.Name} job {TraceParameter} parameter was not set in the {nameof(DashExecutionContext)} either. " +
"Generated a new one.");
}
filterContext.SetJobParameter(TraceParameter, traceId);
}
public void OnPerforming(PerformingContext filterContext)
{
var traceId = SerializationHelper.Deserialize(
filterContext.Connection.GetJobParameter(filterContext.BackgroundJob.Id, TraceParameter));
DashExecutionContext.TrySetTraceIdentifier(traceId!);
}
public void OnCreated(CreatedContext filterContext)
{
return;
}
public void OnPerformed(PerformedContext filterContext)
{
return;
}
private static string? GetParentTraceIdentifier(CreateContext filterContext)
{
if (!(filterContext.InitialState is AwaitingState awaitingState))
{
return null;
}
var traceId = SerializationHelper.Deserialize(
filterContext.Connection.GetJobParameter(awaitingState.ParentId, TraceParameter));
return traceId;
}
}
}
The specific case here is a continuation. If you don’t set the DashExecutionContext.TraceIdentifier
, enqueue a regular job, and then specify a continuation. Then your continuations will not get the trace ID of a parent job. But in case you do set the DashExecutionContext.TraceIdentifier
and then do the same, even though your continuations will share the same trace ID, in the particular case it may be considered as simple luck and a sort of coincidence, considering our job filter implementation and AsyncLocal principles. Thus, checking the parent is a must.
Now, the final step is to register it globally so that it’s applied to all the jobs.
GlobalJobFilters.Filters.Add(new TraceJobFilterAttribute());
Well, that’s it - your Hangfire jobs are now under control, too! By the way, you can compare this approach with the Correlate integration proposed by Hangfire docs.
Summary
In this article, we’ve tried to compose numerous practices and real-world examples for distributed tracing in .NET so that they can be used for most of the use cases in any software solution. We don’t cover automatic request/response and message logging directly here - it’s the simplest part of the story, so the implementation (i.e., if and where to add automatic request/response/message logging, and all other possible logs as well) should be made according to the specific needs. Also, in addition to tracing, this approach fits for any other data that you may need to pass across your system. As you can see, the DashExecutionContext
class, relying on AsyncLocal features, plays the key role in transferring the trace identifier between different communication instruments in the scope of a single service, so it’s crucial to understand how it works. Other interlink implementations depend on the features of each piece of software and should be carefully reviewed to craft the best solution possible, which can be automatically applied to all incoming and outgoing calls without modifications to existing code. Thank you for reading!
Published at DZone with permission of Ruslan Valis. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments