OpenTracing in NodeJS, GO, Python: What, Why, How?
Learn how to better use Opentracing with NodeJS, GO, and Python.
Join the DZone community and get the full member experience.
Join For FreeIn previous blogs, we described how to optimize the deployment of applications and utilize guardrails for applications. These guardrails covered:
- Continuous security model — showcasing how to essentially manage security in a pipeline with constant feedback.
- Application security — detailing how to support verification checks on images.
- Budget and resource management — managing project spend and resource utilization.
One additional guardrail in managing your application is to properly implement "Observability". As it turns out, observability is more important than ever because of the shift in application architecture (to a microservices architecture) and increased deployment pace (hourly/weekly vs. quarterly/yearly). Services are dynamically updated and are usually containerized. Hence, the traditional way of adding "monitoring" after the app deployment cannot scale.
Observability includes 3 major pillars:
- Logging — The fact that a log is just a string or a blob of JSON or typed key-value pairs makes it easy to represent any data in the form of a log line.
- Metrics — Metrics are a numeric rep+resentation of data measured over intervals of time.
- Distributed Tracing — A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.
In this blog, we will focus on how to implement distributed tracing in your application. More specifically, we will review how to add the right code "bits" into NodeJS, Go, and Python.
Distributed Tracing
Modern applications, especially with a microservices architecture, use many interconnected services. This makes troubleshooting harder because of the increase in the number of points of failures.
Let's look at an example of a ride-sharing app, and the possible series of events that occur from interaction with the app.
As noted in the image above several sequences and actions take place:
On Passenger's Phone:
- A passenger requests a ride from the app.
- The ride-sharing app finds a driver.
- A notification of the driver's arrival is sent.
On Driver's Phone:
- The passenger sits in the car.
- Time-to-travel is estimated and tracked for pricing.
On Ride-sharing Platform's Backend:
- The total cost is calculated.
- A credit card is authorized.
This end-to-end request tracking across various systems and services is called Distributed Tracing. A distributed trace is comprised of multiple spans, and each span contains some baggage item plus context that helps in debugging and troubleshooting issues.
These traces are different from logs and metrics.
- Logs are human-readable and are catered towards individual services. They are meant to capture every single event and add as much information as possible.
- Metrics are collected at a fixed interval of time. This is known as the resolution. It is usually used to track the health of a service or system. Metrics normalize the data in such a way that it's not logging or storing repeated data. The only thing stored is a numeric value and a timestamp. This kind of data is usually visualized with a time-series graph and various mathematical algorithms can be applied to analyze these metrics.
The goal of Distributed Tracing is to pinpoint failures and causes of poor performance across various system boundaries. It enables users to pinpoint the service or granular, the specific request path, that is causing the issue. You can also attach logs and tags to traces to further reduce noise while debugging.
Implementing Distributed Tracing — Architecture and Concepts
There is a wide range of frameworks, libraries, and tools out there, both open-source and commercial that try to solve issues related to Distributed Tracing.
Some of the popular tracing tools are (open-source) Zipkin, Jaeger and (commercial) Wavefront, Datadog, Lightstep, etc. All of these have an agent (with services or as a sidecar), collector, and UI.
Using any of these tools and their specific components results in vendor lock-in, making it harder for developers to work because not all vendors have the same support across different frameworks and libraries.
To solve this problem, OpenTracing and OpenCensus projects were started. They provide what other frameworks and libraries can implement. This enables developers to add instrumentation to their application code that won't lock them into any particular vendor. This low coupling, along with easy-to-use Interfaces, makes these two projects very attractive.
Architecture
Let's look at a high-level architecture of implementing Tracing.
OpenTracing libraries and OpenTracing-compatible client libraries are important in instrumenting the service. Every service can then send traces to an agent.
The agent is responsible for the collection of traces from a particular service. The agent then forwards it to the collector, where spans from different services are stitched together based on the Trace ID.
The collector database in most cases is either Elsaticsearch or Cassandra. These traces are then displayed on a UI.
Concepts and Terminology
- Span — Span is the building block of a trace. In general, you need to have at least one span created for a particular request path within your service. A span contains the following information:
a. Operation Name — Identifier for an operation being performed when the span starts.
b. Start and Finish Timestamp
c. Key-Value Span Tags — These are user-defined key-value pairs that make querying or filtering data easier. They should be used across the entire span (i.e. http.status_code, user.id, db.instance, etc.).
d. Key-Value Span Logs — These are useful for capturing logging messages. They should be used to capture only a specific event within the span. (Use Tags when they should be used across the whole span).
e. Span Context — This is one of the most important states in a span. Span Context carries the data across process boundaries. It has two major components.
i. Tracer's spanID or traceID
ii. Baggage items — like key-value pairs mentioned above.
2. Trace — It crosses the boundaries of processes and tracks a request end-to-end.
3. Tracer— It is the actual implementation that records the Spans. This can be either Jaeger, Zipkin, Wavefront, or Datadog.
Once the tracer is instantiated, it can be used to create new spans.
4. Inject/Extract — These are some of the important methods available with any tracing specification:
i. inject method — This allows the context to be added into the request that is carried across the process boundaries.
ii. extract method — This obtains the Span Context from the request. This context can be used to create a new span, which is a ChildOf current span. It can also FollowFrom the current span. If no context is found, then a new span is started.
It’s up to the developers to decide whether ChildOf or FollowsFrom Span makes more sense based on their application architecture.
Let's look at how we would implement this in NodeJs, Golang, and Python. I will be using AcmeShop App to demonstrate this. It's a polyglot application that mimics an e-commerce site.
It consists of various services such as the front-end, authentication (user), catalog, order, and shipping, payment.
Let's say the end-user experiences poor performance. The job of the developer and operations team is to figure out the exact cause is. This is where tracing helps. Instead of randomly scraping through all the logs from hundreds of services, tracing provides a starting point to begin the debugging processing. It provides the latency of every request that originates from front-end, all the way to the back-end services and databases.
By analyzing this call flow, one can pinpoint which service(s) caused the user's issue. This can then be used to go to those services and start looking at their logs and other stats to figure out the actual problem with the application. We will discuss the troubleshooting techniques in the next blog.
The requests, in this case, are originating from the front-end.
Front-End (NodeJS) to User Service (Go) Tracing Then to Cart Service (Python)
Initialize the Tracer in NodeJS
This is the first step. We will be using Jaeger as the tracer. The following parameters are very useful in the configuration:
- Sampler — This makes the decision for all the traces. This is done for the following reasons:
i. Reduce the cost and size of data collection.
ii. Reduce the size of transferred data. There are various sampling strategies including Constant, Probabilistic, Rate Limiting, Remote, etc. 1. Constant — It samples all (param = 1) or None (param = 0) of the samples.
2. Probabilistic — It makes sampling decision with the probability of sampling equal param=value, where value can be between zero and one. If param equals 0.5, it will sample five in 10 traces.
3. Rate Limiting — It samples based on time interval. If param equals five, it will sample requests every five seconds.
4. Remote Sampling — It can either be picked up from central configuration or based on adaptive sampling (individual request paths can be sampled based on different strategies mentioned above).
In the example below, we are using a "constant" sampling strategy. It samples all of the traces when param equals one.
5. Reporter — This is the location where the data from traces are sent. It can be sent to an agent, as shown in Jager architecture or to the trace collector directly. All the tracing implementations use UDP by default. However, you may switch to TCP if needed.
const opentracing = require('opentracing');
var util = require('util')
// Uncomment the lines below, to send traces directly to the collector.
// const tracerIP = process.env.TRACER_HOST
// const tracerPort = process.env.TRACER_PORT || 14268
const agentIP = process.env.JAEGER_AGENT_HOST
const agentPORT = process.env.JAEGER_AGENT_PORT || 6831
// Jaeger
var jaeger = require('jaeger-client');
var config = {
serviceName: 'front-end',
sampler: {
type: "const",
param: 1,
},
reporter: {
'logSpans': true,
//Uncomment the line below to send traces directly to the collector
//'collectorEndpoint': 'util.format('http://%s:%d/api/traces', tracerIP, tracerPort)'
'agentHost': agentIP,
'agentPort': agentPORT
},
};
Start Tracer in Front-End Service (NodeJS)
This method needs to be invoked from the main function.
function initGlobalTracer(){
console.log("Tracer Configuration " + config.serviceName)
var tracer = jaeger.initTracer(config, options);
const Tracer = opentracing.initGlobalTracer(tracer);
}
module.exports = {
initGlobalTracer
}
Note in the initialization "OpenTracing" is used vs Jaeger.
Start Span in Front-End Service (NodeJS)
Now, let's look at the instrumented code.
var request = require('request')
var express = require("express")
var async = require("async")
var http = require('http')
const opentracing = require('opentracing');
const { Tags, FORMAT_HTTP_HEADERS } = require('opentracing')
var app = express()
app.get("/users/:id", function(req, res, next) {
// Section 1
const userSpan = opentracing.globalTracer().startSpan('get_user', {
tags :{
'http.method' : 'GET',
'http.url': `http://${req.headers.host}${req.url}`,
'user.id': req.params.id
}
});
// Section 2
opentracing.globalTracer().inject(userSpan.context(), FORMAT_HTTP_HEADERS, req.headers)
// @todo - Replace the uri paramter
// uri - needs to be created dynamically. This is only for local testing
var options = {
uri: endpoints.usersUrl + "/users/" + req.params.id,
method: 'GET',
json: true,
headers: req.headers
};
// Leverages request library
request(options, function(error, response, body) {
if (error) {
return next(error);
}
if (response.statusCode == 200) {
console.log('printing from within request')
res.writeHead(200)
res.write(JSON.stringify(body))
// Section 3
userSpan.setTag(opentracing.Tags.HTTP_STATUS_CODE, response.statusCode)
userSpan.log({
'event': 'request_end'
});
userSpan.finish();
res.end();
}
else {
res.status(response.statusCode);
res.write(JSON.stringify(response.statusMessage))
// Section 4
userSpan.setTag(opentracing.Tags.HTTP_STATUS_CODE, response.statusCode)
userSpan.setTag(opentracing.Tags.ERROR, true);
userSpan.log({
'event': 'error',
'message': response.statusMessage.toString()
});
userSpan.log({
'event': 'request_end'
})
userSpan.finish();
res.end();
}
}); // end of request
}); // end of method
In the first code block, you can notice that we start a span. We provide an identifier to the span. In this case, it is get_users
. We also attach a few meta-data tags with it, such as the http method, url, and object id (user.id).
Once the span is created, we need to inject this information into the RPC made to other services.
In the second code block, we are making an HTTP request to the user service. The request should be made with few options (URI, HTTP Method, and the header).
It looks like this:
var options = {
uri: endpoints.usersUrl + "/users/" + req.params.id,
method: 'GET',
json: true,
headers: req.headers
};
The header should be the new header with context injected into it. This is very important because it carries the TRACE_ID. Since we are using Jaeger as the collector, the trace id field will be uber-trace-id
. If you use Zipkin, then the field will be x-b3-traceid
.
In the third and fourth code blocks, there were tags and log fields added. This provides additional information when debugging an issue. Some useful tags are HTTP.STATUS_CODE
, which provides the return code from the service. You may also add tags to indicate an ERROR
by using userSpan.setTag(opentracing.Tags.ERROR, true);
Messages can also be logged along with the span by using
userSpan.log({
'event': 'error',
'message': response.statusMessage.toString()
});
Augmenting additional data along with the span can increase the network and storage costs. It's important to add only the information that is necessary. Complete logs and metrics should be separately handled from the TRACE.
Now, let's look at what happens when the call is received by the USER Service.
Initialize Tracer in User Service (Go)
First, we need to initialize the tracer again for this service. It is similar to the one in NodeJs example.
func initJaeger(service string) (opentracing.Tracer, io.Closer) {
// Uncomment the lines below only if sending traces directly to the collector
// tracerIP := GetEnv("TRACER_HOST", "localhost")
// tracerPort := GetEnv("TRACER_PORT", "14268")
agentIP := GetEnv("JAEGER_AGENT_HOST", "localhost")
agentPort := GetEnv("JAEGER_AGENT_PORT", "6831")
logger.Infof("Sending Traces to %s %s", agentIP, agentPort)
cfg := &jaegercfg.Configuration{
Sampler: &jaegercfg.SamplerConfig{
Type: "const",
Param: 1,
},
Reporter: &jaegercfg.ReporterConfig{
LogSpans: true,
LocalAgentHostPort: agentIP + ":" + agentPort,
// Uncomment the lines below only if sending traces directly to the collector
//CollectorEndpoint: "http://" + tracerIP + ":" + tracerPort + "/api/traces",
},
}
tracer, closer, err := cfg.New(service, config.Logger(jaeger.StdLogger))
if err != nil {
panic(fmt.Sprintf("ERROR: cannot init Jaeger: %v\n", err))
}
return tracer, closer
}
// Set tracer to Global tracer
func main() {
// Start Tracer
tracer, closer := initJaeger("user")
stdopentracing.SetGlobalTracer(tracer)
// Start the server
handleRequest()
// Stop Tracer
defer closer.Close()
}
Extract the Context and Start Span in User Service (Go)
Once the request is received by the service, it is re-directed to GetUser Function, as shown below.
GetUser accepts context, User ID, as param and returns user info.
func GetUser(c *gin.Context) {
var user UserResponse
tracer := stdopentracing.GlobalTracer()
userSpanCtx, _ := tracer.Extract(stdopentracing.HTTPHeaders, stdopentracing.HTTPHeadersCarrier(c.Request.Header))
userSpan := tracer.StartSpan("db_get_user", stdopentracing.ChildOf(userSpanCtx))
defer userSpan.Finish()
userID := c.Param("id")
userSpan.LogFields(
tracelog.String("event", "string-format"),
tracelog.String("user.id", userID),
)
if bson.IsObjectIdHex(userID) {
error := collection.FindId(bson.ObjectIdHex(userID)).One(&user)
if error != nil {
message := "User " + error.Error()
userSpan.LogFields(
tracelog.String("event", "error"),
tracelog.String("message", error.Error()),
)
userSpan.SetTag("http.status_code", http.StatusNotFound)
c.JSON(http.StatusNotFound, gin.H{"status": http.StatusNotFound, "message": message})
return
}
} else {
message := "Incorrect Format for UserID"
userSpan.LogFields(
tracelog.String("event", "error"),
tracelog.String("message", message),
)
userSpan.SetTag("http.status_code", http.StatusBadRequest)
c.JSON(http.StatusBadRequest, gin.H{"status": http.StatusBadRequest, "message": message})
return
}
userSpan.SetTag("http.status_code", http.StatusOK)
c.JSON(http.StatusOK, gin.H{"status": http.StatusOK, "data": user})
}
Here, the request is extracted by using
userSpanCtx, _ := tracer.Extract(stdopentracing.HTTPHeaders, stdopentracing.HTTPHeadersCarrier(c.Request.Header))
This provides the traceID from the parent. It is then used to start a new span with an identifier db_get_user
. When the request is completed, we get to see the following in Jaeger UI.
We have added log and tags to the span for deeper context.
If we add tracing from Front-End service to Cart service (Python), the steps remain the same as from the one with Front-End to User service.
Since the front-end code looks the same, let's look at the cart service setup.
Initialize the Tracer in Cart Service (Python)
This is similar to the initialize method in the GoLang service. The variables jaegerhost
and jaegerport
are obtained by setting environment variables.
jaegerhost=os.environ['TRACER_HOST']
jaegerport=os.environ['TRACER_PORT']
def init_tracer(service):
logging.getLogger('').handlers=[]
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
config=Config(
config={
'sampler':{
'type':'const',
'param':1,
},
'local_agent': {
'reporting_host': jaegerhost,
'reporting_port': jaegerport,
},
'logging':True,
},
service_name='cart',
validate=True,
)
return config.initialize_tracer()
Extract the Context and Start Span in Cart Service (Python)
Here the span_ctx = cart_tracer.extract(Format.HTTP_HEADERS, request.headers)
, is used to extract the context.
We then use start_span
with the context obtained from above. We can also set additional tags, if needed.
@app.route('/cart/items/<userid>', methods=['GET'])
def getCartItems(userid):
span_ctx = cart_tracer.extract(Format.HTTP_HEADERS, request.headers)
functionName='/cart/items/'
returnValue = '200'
# app.logger.info('the derived span context from frontend is:', str(span_ctx))
with cart_tracer.start_span(functionName, child_of=span_ctx ) as span:
span.set_tag("call", functionName+userid)
app.logger.info('getting all items on cart')
PPTable = getitems(userid, span)
if PPTable:
packed_data=jsonify({"userid":userid, "cart":PPTable})
else:
app.logger.info('no items in cart found for %s', userid)
output_message="no cart found for "+userid
packed_data=jsonify({"userid":userid, "cart":PPTable})
returnValue='204'
return (packed_data, returnValue)
Once the code has been instrumented, you may deploy the application on your preferred infrastructure and test it out. For the AcmeShop app, we have created deployments for both kubernetes and docker-compose.
What does the output look like?
You will start noticing traces in the Jaeger UI as shown below.
Jaeger is great not only because it's open-source and easy-to-use but it's fairly flexible and forgiving with tags, quickly giving you an idea of what the traces will look like.
However, as expected, commercial tools like Wavefront, Datadog, Lightstep (which all have free trials) have extra value-added features. As an example, here is how the output will look like in Wavefront using Jaeger agents (essentially use the same code in this article but push the output to Wavefront — you see a full-service graph, metrics, and history).
NOTE: In the end, if you use Opentracing with Jaeger agents (as we have described in the article above), you can pick essentially ANY tracing UI you need (open source or commercial)
Tracing Is Difficult
It's difficult to get tracing correct, especially when adding it to existing applications. This is because:
- Tracing requires the application to be instrumented (i.e. developers have to write code within their services).
- The tracing code needs to be written for every single request path across all the services. This requires working knowledge of the entire application and needs some co-ordination.
- Every request now needs to carry additional information, which means the requests may have to be modified based on how the application was structured.
- Complexity increases with the use of different frameworks and programming languages within the application. There might be inconsistencies with some of the interfaces.
Conclusion
Jaeger and the tools around it are pretty new. Jaeger was released back in 2017. CNCF has started a new project called OpenTelemetry that is meant to combine OpenTracing and OpenCensus. Keep an eye out for it.
The configuration of Jaeger in a K8s environment is a separate topic that we will talk about in another blog. In the end, these traces can either be collected in Jaeger or zipkin with OpenTracing libraries with any opensource or commercial tool you choose for visualization and analysis.
Opinions expressed by DZone contributors are their own.
Comments