How to Deal With Time Tracking in Distributed Systems
How to Deal With Time Tracking in Distributed Systems
Time-related issues can be different in distributed systems and non-distributed systems. Read on to find out how, why, and what to do about it.
Join the DZone community and get the full member experience.Join For Free
With the influx of DevOps-related products and services on the market, today’s application delivery toolchain has become complex and fragmented. Watch Avoiding the DevOps Tax to learn best practices for integration and automation to realize a faster DevOps lifecycle.
During the past few months, a lot of my time has been spent building the support for distributed traces in Plumbr. Among everything else, I found myself dealing with different issues related to handling the time in distributed systems.
Apparently, time in such systems is a whole lot different concept than in non-distributed systems. Consider for example a deployment where calls to the system are intercepted at node A. Node A makes a call to node B which in turn calls node C. The calls are synchronous and until the response is returned, the caller thread is blocking.
Now, if these three nodes would have slightly different system clocks, we could easily end up with distributed traces like seen below:
The trace above does not make sense in several aspects. For example:
- Node B claims to have received the request 5 seconds before node A sent the request to it.
- Node C says it responded to B 10 seconds later than B seems to have received the response.
We had to find a way how to solve this issue and find a way to handle the time when capturing distributed traces. But before jumping to our solution, lets see how the system clocks can end up drifting apart in the first place.
How Can Time Drift Happen?
Every software deployment is relying upon the presence of a system clock. Without going too deep into hardware world, it is a small quartz crystal located on the motherboard oscillating at a (seemingly) predictable rate. This rate can be used to monitor how many time units have passed by since the crystal started oscillating.
Apparently, there are two fundamental issues with this approach.
- These crystals do NOT oscillate at truly predictable and unified manner. Tiny differences in their composition result in what is known as time drift. If you unbox two computers manufactured at the same time in the same factory and plug them on at the same time, you will still end up with different system clocks on the machines after a while.
- There is no way to say to the crystal to know what is the current time. All it knows is to oscillate and count ticks. Whether it started counting on the first day of January 2000 or the last day of December 2016, it has no idea. This is why every system setup requires you to input the current date and time. As this includes human interaction along with imprecise clocks used as reference, you are guaranteed to have different times set on different nodes.
To compensate for both of the fundamental issues, a solution called NTP is built. It builds upon synchronizing the system clocks of computers against a centralized reference clock. Having all the nodes periodically check the system clock against such a reference clock would then limit the time drift in such a system.
Unfortunately, this is exactly the case with NTP. It does not eliminate the time drift; it only reduces the drift.
To make things from bad to worse, you are only rarely in control of all the nodes in your system. This means you cannot synchronize them against the same NTP service. In case you doubt this, what about the end user devices interacting with the web app you do control in your server room? There is no way to enforce the same NTP service on all the end users, so by definition, you are guaranteed to have nodes where the current time is set to some completely arbitrary moment in the timeline
I will stop here without going deeper into the issues why the system clocks can and will drift apart from one another. What you can count on is the fact that you are guaranteed to have different times in different nodes in your distributed systems. From what we have seen the time drifts can range from just milliseconds to decades. Yes, decades; some system clocks out there are reporting themselves to be counting time towards the end of 1996.
Solving the Time Drift for Our Use Case
As we saw, the time drifting cannot be eliminated for distributed systems. We had to come up with a custom solution taking the presence of the drift into account. It builds upon the following requirements:
- The time drift in nodes participating in the distributed transaction must be limited.
- Order of events must be correct.
- Alignment of the events must make sense.
Lets now see how we fulfilled the requirements.
Minimizing the Time Drift
Limiting the time drift was the easy part of the solution. It used the following approach, relying upon the Agent – Server Deployment Model of Plumbr:
- The Agent initiates a handshake with the Server. The handshake has a timeout of 60 seconds, after which the Agent assumes the Server is not available and drops the connection attempt to retry later.
- The Server accepts the handshake and initializes certain data structures needed for the new connection. Then, right before sending a response to the handshake, the Server injects its current timestamp into the response.
- The Agent receives a response, grabs the timestamp from the response, and adjusts its time to be equal to this timestamp received.
This way, we can guarantee that the Agent time is not different from the Server by more than 60 seconds. It can occasionally even increase the drift for some Agents, but it builds us a safety net of never dealing with more than 60 seconds of drift.
As an example, let’s check the following situation:
- The Agent sends a handshake to Server at 00:30 according to the system clock in the Agent machine.
- The Server receives the request at 04:55 according to the Server time.
- The Server injects the timestamp to the response at 05:00 according to the Server time.
- The Agent receives the response to the handshake at 01:00 according to Agent time.
- Now, the Agent adjusts its internal clock (we are never touching system clocks) by four minutes forward (the difference between Agent and Server clocks) on each timestamp it generates.
This way, we have effectively introduced our own custom-built time synchronization against which the nodes are syncing the clock each time the node is (re)connecting to our monitoring server. Now let’s move on to the more interesting part and see how we can order the events in the distributed system in a way the order will make sense.
Aligning the Events
Our solution for this was to align the spans on the timeline after we have received them. This way, we are not relying on any kind of distributed clock. Let's proceed with the algorithm description:
Consider we have a distributed transaction involving two nodes, A and B:
Node A starts the transaction, performs some work, sends a request to node B, waits for a response, processes the response, does some more work, and finishes.
This transaction now contains two spans: a and b. Each span has the following attributes:
- parentId (nullable).
- Four timestamps:
Note that b.requestSent and b.requestReceived are captured in different nodes likely deployed on different physical machines. That’s why we can’t use b.requestReceived – b.requestSent to calculate network latency between JVMs A and B.
Now that we have the model, we can start describing the alignment algorithm using the same example of a distributed transaction as in the beginning of the post:
The transaction above took one minute and 30 seconds to complete and consisted of three spans, each originating from a different node. The picture above does not make any sense, as:
- Node B claims to have received the request five seconds before node A sent the request to it.
- Node C seems to have responded to B 10 seconds later than B claims it received the response.
Now, our algorithm would start by locking the time in one of the JVMs and aligning all other spans according to it. We start with locking the time in JVM A in place. This means that we next need to shift time in JVM B so that the span alignment would start making sense.
Apparently, the correction can be anything in between five and 25 seconds forward shift:
- A sent out the request to B at 00:40 and B claimed to have received it 00:35. Having locked the time in node A, and assuming that time travel is not possible, it becomes clear that we need to shift the time in node B at least five seconds forward.
- B sent the response back to A at 01:30. A claims to have received it at 01:55. From this we can deduce that the shift cannot be greater than 25 seconds, otherwise, we would again have reinvented the “time travel,” which we agreed is impossible.
Now, the algorithm could pick any forward shift in JVM B between five and 25 seconds. Instead of doing a random pick from this range, the algorithm goes one step further and picks the average of the endpoints (15 seconds in this case). Why so?
As explained earlier, our model is imperfect in regards to not knowing the exact network latency for each request or response. It does, however, know the sum of the latency added by each request-response pair:
- Node A sent out the request at 00:40 and received it at 01:55. Whether or not the timestamps are correct is not relevant, but we can be certain that the delta of the timestamps is correct and we can claim that the node A was blocking for one minute and 15 seconds while waiting for the Node B to respond.
- Node B, in turn, knows that it took him 55 seconds to respond to the call received (claiming it received the request at 00:35 and responded to it at 01:30).
Using this information, we can see that the total latency for this request-response round trip was 20 seconds. Without knowing how much of this 20 seconds was added during request and how much during response, the algorithm treats these as equal and makes sure there is 10 seconds network latency gap both before the start and after the end of the span from node B:
The algorithm would now attempt to repeat the same step for node C, but it turns out that span c is already perfectly aligned. So, we have a consistent transaction aligned relative to JVM A with the start time is 00:30.
If we could trust A to have the correct time, we could stop here. Unfortunately, we can’t, so we go ahead and repeat the process and lock other spans one by one. Doing so gives us different transaction start timestamps based on the time in each node.
For example, repeating the procedure and aligning spans relative to JVM B would result in the following transaction being assembled:
Aligning spans according to Node B, the transaction starts drifts to 00:15. Repeating the process again and locking the Node C, the start time of the transaction would be 00:30. So, we now have three opinions about when the transaction could have started:
- Node A thinks it started 00:30.
- Node B is sure it started 00:15.
- Node C is convinced the start was at 00:30.
Having multiple opinions now allows us to use the wisdom of the crowd and pick the median of the start times to be used as the alignment base. The median of the dataset above happens to be 00:30, which corresponds to the alignment relative to JVMs A and C. Now our algorithm is finished and will use the alignment around JVM C as the correct version.
Published at DZone with permission of Nikita Artyushov , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.