Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Pragmatic Tracing for Distributed Systems

DZone's Guide to

Pragmatic Tracing for Distributed Systems

Learn the essential facts of distributed tracing and how it enables monitoring of both monolithic and distributed systems.

· Microservices Zone ·
Free Resource

Record growth in microservices is disrupting the operational landscape. Read the Global Microservices Trends report to learn more.

Introduction

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -Leslie Lamport

The fact is, we’re building bigger, more complex distributed systems. What used to be a single server serving old-style personal home pages turned into the medium.com mediated blog post and various other ecosystems. A simple bulletin board marketplace is turning enormous systems like Amazon, eBay, and other online retailers. This evolution is akin to life itself; as we progressed from single-cell bacteria to the humans we are today, things got complicated. More powerful, but complicated.

Now, let’s talk about your current system. Chances are, not everything is new. Today, almost every system uses numerical routines written in Fortran a few decades ago. Similarly, in your body, the basic cell metabolism hasn't changed over the last 100 million years. Though with added complexity, some things did change. Aside from a superior feature set, e.g. sight, speech, and consciousness, we evolved advanced monitoring equipment. Just a single neuron acting as a proto-eye doesn’t cut it, but a full-fledged eye does.

A vast nerve network almost instantaneously detects issues in your body and alerts the control center. Messages are sent and received via hormones on the gigantic message queue called a circulation system, i.e. blood, with much more complexity than a simple single-cell organism and osmosis-diffused operations.

Therefore, your new systems cannot be monitored as they were during a lower complexity era. Logs aren’t enough anymore, neither are metrics nor pprof and a similar toolset. In this article, I’m presenting tracing as another tool in your toolbox, both in monolithic applications and distributed settings, a very useful tool for specific use cases.

Observability Toolset

In this chapter, I’ll briefly cover basic observability tools at our disposal. I’ll also use a football analogy as a metaphor for a clearer explanation. For those familiar, feel free to skip to the latest section: tracing.

Logs

log - an official record of events during the voyage of a ship or aircraft.

This is the simplest and most basic tool at our disposal. One log represents some event happening with associated metadata.

The most common metadata is: when did the event happen, which application generated this event, on which host, and what’s the level of this log event.

Usually, log event levels are modeled after syslog levels, DEBUG, INFO, WARNING, ERROR, and CRITICAL being used the most in software.

On the application and system level, you usually have log sources and a log sink. Each source generates log data, and ships it to sinks. Each sink can apply some fileting, e.g. only ERROR level or higher. In practical terms, this means you’re writing log level INFO or greater to stderr, but to the file, you’re dumping all logs.

There are log management systems for gathering and searching logs. One of the simpler ones is logfile with grep, the more complex being journaling in systemd, and for production distributed systems you’re usually using ELK stack or Graylog.

In the football analogy, playing scoring the goal would be a great log. Something like: msg="Ivan scored the goal" time="2018-01-12" game_minute="23"

The log data is useful for unique, rare events or somehow meaningful events. For frequent events, it needs to be sampled down not to kill your log management system. For example, do you want to log every time you received IP package on the server?

Metrics

metric - a system or standard of measurement.

In an observability context, a metric is scalar value changed over time. This scalar value is usually a counter (e.g. number of goals scored, HTTP requests), gauge (e.g. temperature, CPU utilization), histogram (e.g. number of 1xx,2xx,3xx,4xx,5xx HTTP response codes) or rank estimation (e.g. 95% percentile response latency).

They are great for identifying bottlenecks or unusual behavior and setting SLOs (Service level objectives). Usually, alarms are tied to some metric; that is, whenever the specified metric is outside given bound perform an action, it will auto-heal or notify the operator.

For example, in humans, we have a really important metric — blood glucose level. In a healthy human, too high of a value performs an auto-healing operation and releases more insulin into our bloodstream. Another human metric would be pain levels in your left leg. Usually, it’s near 0, but over a certain threshold, you’re vividly aware of it – that is, an alarm is raised.

For computer systems, the usual metrics are related to throughput, error rate, latency, and resource (CPU/GPU/network) utilization. The usual systems mentioned are statsd, Graphite, Grafana, and Prometheus.

Pprof & Flamegraphs

This tool is best for profiling CPU-intensive applications. Before I explain what it offers, I want to cover how it works. X times per second, it stops your program and checks which line is being executed on the machine. It collects all execution samples into buckets per line/function and later reconstructs which function/line was executed what percent of the time. It's as if you’re snapshotting the football match and seeing who has the ball every 10 seconds, and from that, reconstructing ball ownership percentage.

Traces

If you remember anything from this blog post, remember this comet image: commet

As is burning through Earth’s atmosphere, it leaves a trace of its path. From this trace, we can deduce where it has been and how much time it spent there. A similar situation is within our programs.

The trace represents a single execution of a request/operation. The trace is composed of multiple spans. A span is a meaningful unit of work in your system – e.g. database query, RPC call, or calculation. A span can be related, e.g. parent/child span. Thus, the trace forms a tree of spans, or more generally, a DAG (if you introduce complex followed by relations and other gimmicks)

trace example

Each span can have useful metadata attached to it, either indexed tagged key/value pair such as userId, http.statuscode, or hostname or additional log data, e.g. exact database query. The tracing backend provides expected search capabilities, sorting by time, filtering by tags, etc.

In our football example, we could have a trace representing scoring a goal. It consists of three spans:

  • Ivan kicking the ball
  • The ball rolling to the goal
  • Ball entering the goal post

Common Use Cases

In this section, I’m going to cover top use cases for tracing. Compared to other techniques like pprof, tracing can detect when your program was put to sleep due to IO waiting, resource contention, or another reason.

Overall Request Overview

It allows you to make a case study from an individual trace. Metrics aggregate while the trace focuses the story on the individual request. What services did this individual request touch? How much time did it spend there? What’s the time breakdown on CPU, GPU, network calls, etc.?

If you’re debugging your application, searching through your traces for specific edge case and analyzing that one is golden — same with analyzing performance outliers.

Big Slow Bottleneck

When you have an obvious bottleneck, you know where you need to drill down. You’ve narrowed down your search space to this particular operation.

The causes for a big, slow bottleneck could be various. Perhaps you’re overusing lock and the program is waiting on one. Or the database query is underoptimized/missing an index. Finally, maybe your algorithm worked for 10 users, but after growing to 10,000 users, it’s just too slow.

Find it, observe it, analyze it, and fix it.

Fanout

Fanout is the number of outgoing requests a service makes for each incoming requests. The bigger the fanout, the bigger the latency. Sometimes fan out is on purpose and useful, however in this context, we're primarily talking about calling the same service over and over again where a bulk operation would be more suitable.

It’s the difference between:

SELECTid, name
FROM users
WHERE... 

versus looping over some id list and querying for each id:

SELECT name
FROM users
WHEREid= $1AND... 

This could happen inadvertently, e.g. using an ORM framework deciding to do this thing. You’ve fetched your ids and now you’re happily looping over them not knowing you’re issuing a new database query each time.

Other times, it’s an issue with API design. If you have an internal API endpoint such as /api/v1/user/{id} for fetching user data, but require bulk export, it’s the same issue.

In the tracing, you will see many small requests to the same service. Despite them being parallel (though they’re not necessarily), you shall hit a tail latency problem.

The probability you’re going to observe p-th percentile latency for the calling services drops exponentially with the fanout degree. Here’s a simple figure illustrating it:

Chrome Tracing Format

This is a simple JSON format specification for single process tracing. The catapult project includes the rendered for this specification. The same renderer available in chrome under chrome://tracing/ URL. There are various support for spitting out this format, e.g. TensorFlow execution, Golang tracing, Chrome rendering itself, etc. It’s easy to include it in your application if there’s no need for distributed tracing and your requirements are simple.

For example, this simple file:

[  
   {  
      "name":"Asub",
      "cat":"PERF",
      "ph":"B",
      "pid":22630,
      "tid":22630,
      "ts":829
   },
   {  
      "name":"Asub",
      "cat":"PERF",
      "ph":"E",
      "pid":22630,
      "tid":22630,
      "ts":833
   }
]

Renders as: chrome tracing rendering

If you’re interested in more I recommend at least skimming the specification. The biggest downside is working with the visualizer. As a newcomer, I had a hard time finding how can I filter datapoint in categories, by name, and overall advance use cases beyond the basic scroll and see.

Distributed Tracing

All’s fine and dandy on a single node, but the trouble starts with distributed systems. The problem is how to connect/correlate traces coming from multiple nodes. Which spans belong to which trace and how are those spans related?

Today most solutions take notes from Google’s Dapper. Each trace has its own unique traceID and each span unique SpanID which are propagated across the node boundaries.

There are multiple ideas about how this context should be propagated and whether to include additional data during that propagation (i.e. baggage). Also, each backend has its own ideas how should you deliver trace/span data to them.

The first available backend was Zipkin. Nowadays, Uber’s Jaeger is now CNCF incubating project and a good place to start. Also, various cloud providers have their own in-house SaaS tracing (Google’s Stackdriver, AWS X-Ray, etc.)

Here’s a screenshot from Jaeger frontend for searching and looking at your traces:

Since writing against a specific backend could be a hard vendor lock-in, there have emerged two client vendor-neutral APIs – OpenTracing and OpenCensus.

Client-Side Vendor-Neutral APIs

In this subsection, I’m going to compare the OpenCensus and OpenTracing standards. Both are evolving projects with high GitHub star counts, multi-language support with various middleware implementations for databases, HTTP’s RPCs, gRPC, etc.

OpenTracing is currently a CNCF incubating project while OpenCensus emerged from Google’s internal trace tool. Nowadays, OpenCensus has its own vendor-neutral organization.

As for the feature set, OpenCensus includes metrics inside its API while OpenTracing is metrics-only.

In OpenCensus, you add trace exporters to the global set, while in OpenTracing, you have to specify them in each iteration. They have a global tracer concept, but at least in Go’s, they don’t offer to default on it, but you have to invoke it. Furthermore, the OpenTracing API feels more clunky and less polished compared to OpenCensus.

For the propagation format, OpenCensus specifies the standard. OpenTracing, on the other hand, only specifies the API, and each supported backend must implement the standard propagation API.

What OpenTracing has is a baggage concept that forces some data to be propagated to each downstream service alongside the span propagation context.

OpenCensus Example

This subsection will describe basic OpenCensus building blocks. For complete examples, see their official documentation — it’s quite good! They even feature great examples. My examples are in Go, but they support multiple other languages. For brevity, the import statement shall be omitted.

First, we start by defining the trace exporter:

j, err := jaeger.NewExporter(jaeger.Options{
        Endpoint:    "http://localhost:14268",
        ServiceName: "opencensus-tracing",
})
trace.RegisterExporter(j)

We can easily start it using Docker from their getting started page

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.7

Then we start some span:

ctx, span := trace.StartSpan(ctx, "some/useful/name")
defer span.End()

The ctx is an instance of standard’s library context.context

We can attach an indexed tagged key/value metadata to this span:

span.AddAttribute(trace.StringAttribute("key", "value"))

Open tracing actually specifies some standardized keys (e.g. error, http.statuscode, …). If you can I recommend using them despite open census not specifying any.

Or we could add some log data, that is annotations:

span.Annotate(nil, "some useful annotation")
span.Annotate(
    []trace.Attribute{trace.BoolAttribute("key", true)},
"some useful log data",
)

After we decided to use HTTP, let’s inject the OpenCensus middleware:

client = &http.Client{Transport: &ochttp.Transport{}}  // client
http.ListenAndServe(addr, &ochttp.Handler{Handler: handler}) //server

For the client side, we only have to include the right context in our request:

req = req.WithContext(ctx)
resp, err := client.Do(req)

On the server side, we grab the context from the request and go on with our life.

func HandleXXX (w http.ResponseWriter, req *http.Request) {
    ctx := req.Context() 
// ...
}

This chapter could be summarized as an OpenCensus crash course/cheat sheet/in 5 minutes.

To summarize this post, I recommend tracing for these use cases the most:

  • Reasoning about overall performance overview

  • Detecting big slow operations

  • Fan out detection

I presented various tools, such as Chrome tracing format, OpenTracing, and OpenCensus. Of those, I recommend starting with OpenCensus and starting Jaeger in a Docker container. Use middleware wherever possible.

Finally, reap the benefits! See how tracing benefits you and how can you best leverage it to your success.

Learn why microservices are breaking traditional APM tools that were built for monoliths.

Topics:
software development ,devops ,go ,monitoring ,observability ,microservices ,distributed systems ,distributed tracing ,performance

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}