Plumbr is designed to monitor end user experience. Plumbr instruments the application bytecode during deployment and adds tracing code to the endpoints published by the application. If the calls arrive at such endpoints, Plumbr is able to gather and analyze data from them.
This concept started to break in 2014 when the increased uptake of both the cloud and microservice-based architectures started to really build up. As a result, existing monoliths were getting replaced with microservices deployed and dynamically scaled in the cloud:
Deployment of distributed and dynamically provisioned architectures has created a situation in which tracing the end user experience in individual nodes would not expose the entire user experience. Each node in the architecture can:
- Be responsible for servicing only a specific span in context of the end-user interaction.
- Have multiple (dynamically spawned) copies of themselves acting as a cluster.
So it was just natural that we have followed the path industry is taking and added support for such deployment models. As a result, we ended up building something that is called distributed tracing.
This post is first in series describing how we built the support for distributed tracing and which obstacles we needed to tackle. In this post, we cover the concept of distributed tracing in general and demonstrate two key pillars our solution was built upon. To some of the readers, the concepts might be familiar; the key concepts applied were largely inspired by the research made while building the Google Dapper, so if you have investigated this project, you might recognize familiar aspects.
What Is Distributed Tracing?
Distributed tracing is the concept of tracking one user interaction with the application throughout the architecture deployed on multiple nodes. Capturing such traces allows us to use these individual elements to build the view of the entire chain of calls behind the user interaction. This perspective then allows us to see how different nodes are interacting with each other, linking the root causes for poor performance to a single user interaction:
The next sections will describe our approach to building the solution for distributed tracing.
The first problem we needed to solve was related to understanding whether or not events in particular nodes are anyhow linked to the same user interaction. The answer to this problem was generating a universally unique identifier in the first node, accepting the interaction, and passing it along to the other nodes as call metadata.
The UUID’s (d19931bb-f235-4dcb-2e2f-b9d31225d62e as an example) are attached to the data each node sends to the Plumbr Server. Having this information allows us to assemble all the individual spans together, resulting in a distributed trace, similar to the example above.
The generated UUID needs to be passed along with each call to a remote node. As changing the contract itself between the nodes participating in the transaction is something we cannot do, we had to find other means to inject the metadata to the call. The solution ended up being protocol-specific, with the first implementation relying on HTTP protocol.
The method we used for passing along the UUID involved injecting our own custom HTTP header to the downstream calls. So, all the HTTP requests departing a node monitored by Plumbr would besides existing headers (Accept, Accept-Encoding, etc.) include our custom header with the UUID:
Accept:text/html,application/xhtml+xml Accept-Encoding:gzip, deflate, sdch, br ... X-Plumbr-TransactionId: d19931bb-f235-4dcb-2e2f-b9d31225d62e
This allowed us to perform a simple check:
- If this header was not present in the request, then we were dealing with a new interaction and we need to generate new UUID to be used in this node and to be passed to downstream calls.
- If the header was present in the request, then we were dealing with an interaction that arrived at our system via some other node. So, in this case, we should not generate new UUID. Instead, we should join the ongoing interaction and make sure the downstream calls from this node also get this UUID passed along as a header.
I hope this post gave you insights into how tracing events in distributed systems can be built. If the picture looks simple and straightforward, rest assured; there were many hairy obstacles we needed to tackle, some of which will be covered in the follow-ups during the forthcoming weeks.