There is a a wide range of vendors who support PTP and packet capture time-stamping with micro-second timestamps or better. This works well in large complex organisations where changing the software to do this is not an easy option, any you need something which stands back and tries to make sense of it all. The biggest problem with this approach is that your analysis of the data is typically twice the effort of just capturing the data, if you are lucky and use standard protocols which are already supported.
As software developers is there as easy, simple, cheap way to get high resolution timings without the hardware complexity. Can we also find a simple way to ensure the data we capture will be used?
Bottom Up approach
Using hardware is a very bottom up approach. You have to
- Distribute accurate timing using something like PTP. (requires special hardware)
- You have to add timestamps to the packets (more special hardware)
- You have to record all the packets on the assumption that you don't know which packets you might need. (lots of bid data storage)
- You can need to decode all data in the packets because you don't know what might be useful.
- You have to build a hadoop style distributed system to manage all this data to provide the search facilities in a distributed manner for ad hoc reports.
- Extract application specific performance results like fill rate.
- From all this you you join information together to generate report and trigger alarms.
The problem is you are doing a lot of things which you might need but probably don't. As you get further down the line, you get into cross disciple/cross responsibility teams and unless you have support very high up it is very likely that not all the pieces come together and you won't spend the time required producing or looking at the reports.
Top Down approach
When simplifying a system, often the best way to do this to ask the question; Why are we doing this? For specific requirements, what is the simplest and most maintainable solution to solve that problem.
The most common problem you want to solve are the questions;
- when is latency hurting my PnL?
- when I have high latency, what is the mostly cause
i.e. give me the data which points to a problem I can solve.
As application designers, with this in mind, we are aware of the critical paths in our system and we can place timestamps at key points along the way. They don't have to be perfect from the start as you can drop or add timing point with each release, and you may need to change them as the software changes. You can also place timestamps in a very light weight way in the code without sending network packets each time.
What about network synchronization?
There is a general assumption you need network synchronization or you cannot compare times. In many systems you don't need the exact latency, you need to know when it is unusually high and where was it unusually high. This means you can compare the timings of quotes coming through the system to previous timings. If two systems are typically out of sync (allowing for a little drift) by +/-123e12 nano-seconds and one update comes through with a delay of +/-123e12 + 100e3 ns, it had a delay 100 micro-seconds higher than normal and that might be very interesting.
If you assume that a high percentage of the time, you will have low latencies and it's the rare high latencies (one in 10, or one in 100) you are chasing, you can determine the relative timings between two machines with as little as 10 samples (as it is likely that some were good). After that you (or your program) has a pretty fair idea (within 10 micro-seconds) of what the difference should be and you can detect unusually high delays easily.
Say we want the actual delay, not just the high delays.
The minimum latency between two systems is generally very stable. While high latencies are open ended, the best latency in a sample of 100 or 1000 doesn't that often. (Though it might be interesting if it did so it still might be worth looking at)
You can either test this off-line and use this as a minimum or test the round trip and halve it throughout the day. In any case you know the minimum won't be less than 0 and even if you guess a reasonable delay, this won't make a difference to your outliers.
Hardware is more accurate, why do it in software?
Using software can be more real time as you can have the timings in your hand when you are processing the quote or order and then recording the important messages, their business impact like fills is part of recording the trades and market data. This makes joining the information trivial and comparing them very easy, which means it is more likely to happen in the first place and be maintained as you go.
But hardware is more accurate.
When you have good timings every micro-second and even every nano-second counts. However even on the best systems, the worst timings are over 10 micro-seconds for network latency and they can easily be over 100 micro-seconds for real applications. For the timings which really matter, you don't need the same resolution in general. You might even find your worst timings are in the milli-seconds.
Design your system around the critical path and make sure it is as fast or faster than you need. Make it time itself and report when it is failing to meet your timing requirements and when you record key business events make sure all the timing information is there as well and the analysis of the relationship between timing and business impact will be easy to maintain and keep relevant.