Jitter: The Forgotten Performance Metric
Jitter: The Forgotten Performance Metric
Join the DZone community and get the full member experience.Join For Free
Insight into the right steps to take for migrating workloads to public cloud and successfully reducing cost as a result. Read the Guide.
On CTA we often talk about non-functional requirements and how this can drive the architecture of a system. Most of these cover issues of desired response time and capacity (latency, throughput, storage etc) but I believe that Jitter is a metric which is either forgotten or unknown to some software engineers – even though it's essential for hardware engineers.
The most basic definition would be “variation in response”. In other words, the response, transmission or latency in a system will not be a constant. In many systems this variation goes unnoticed as it is small compared to the response itself. However I believe that as the systems we deal with become more performant and demanding, software engineers will need to understand, measure and tune this.
A quick web-search of jitter shows the term being used to cover performance degradations due to activities such as garbage collection or unexpected user actions. This seems to greatly annoy telecommunications and hardware engineers who would argue that these are predictable, system and user events which could just be turned off – although ignoring your users and asking for a machine with infinite memory may get you fired. They would argue that true Jitter is an unpredictable variation in response whose occurrence frequency follows a normal distribution. Like most effects that are normally distributed it is caused by cumulative random events, most of which are due to the switching actions. Personally, I think you should measure whatever makes sense in your system.
Jitter means that hard limits for latency can be statistically likely but not guaranteed. Specifications such as:
“The system should respond to action X within 200ms”.
Should be challenged and replaced with statements like:
“The systems should response to action X within 200ms for 95% of requests”
Of course we are making the implicit assumption that this is normally distributed and the system WILL, eventually, respond. You might want to explicitly state that all actions will be executed but stating a hard limit for all responses means that if you measure for long enough, you'll break your specification.
Jitter can be quite obvious on messaging based systems that contain many hops. This will also tend to have a nice, Gaussian distribution due to the cumulative random delays.
The best way to measure Jitter is to set up a standard test where you can sequentially fire a large number of requests into your system and measure the response time. However, rather than simply add these up and find an average we need to count how many responses are in a set of timing windows e.g. how many responses fall between 5ms-10ms, 10ms-15ms, 15ms-20ms etc. If we plot the number in each section we should see our expected Normal distribution.
Reducing Jitter is subtly different from reducing latency itself. You are trying to reduce the variation in the response times or 'squeeze' the response profile.
Often when you reduce latency you will reduce the Jitter proportionally however this is not always the case. For example, network hops are a common cause of latency but each hop will increase the Jitter as well – you are accumulating more random delays with each hop. If you change the physical architecture of your system to have more but quicker hops you will almost certainly increase the jitter (remember this is the variation in response and not the actual response) even if the average latency is lower. It IS important to measure it and monitor as the system evolves.
Simon is covering a number of related issues in his Skills Matter tutorial Load Testing for Developers.
Published at DZone with permission of Robert Annett , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.