Lean Tools: Queuing Theory
Our model of tasks passing through a value stream to be completed (for example: coded, tested and deployed) is interestingly similar to a more general one that we use when studying the performance of a server application: the system model of queuing theory.
In the case of the server, user requests (HTTP requests or just RPC calls) must be served in sequence, minimizing the total response time between the moment when you click a link and the time when a page has been completely generated and then loaded on your side.
Translating this model to a development team means each phase is carried out by a subset of the people in the team, that we can call a station with a work space-based metaphor. Each subset may work only on a task at a time - either by time sharing or by finishing each task before accepting the other - or multiple developers may work in parallel.
In either case, the work items which are currently not accepted are put in the station's queue.
Concepts and metrics
The most important measure in queuing theory is the cycle time we have just defined: the time spent from the creation of a task or a user story to its completion and release. This time consists of the sum of the queue time (spent waiting) and the service time (spent adding value by working on it: talking with the client, writing tests and code, deploying the feature).
Given some hypotheses on the distribution of service times and rate of arrivals, which are random variables and not fixed quantities, we could derive some of these quantities from the others and find out which station needs help. We usually can't define such a precise, quantitative model for people-based systems as we are impredictable humans - nevertheless it is important to qualitatively know queuing theory concepts and measure them when we have the occasion.
Rate of arrival and rate of service
The rate of arrival is the amount of work entering the system in the unit of time - for example, the number of stories you accept for an iteration. This rate should be even out as much as possible if you want to minimize the cycle time.
In the opposite case, we would accept a 200-page requirements document for the months that follow. We may be able to work on the first requirements quickly, but most of them is going to be put into the queue and the average cycle time will increase. If we accept a lot of work, many items will just sit there waiting for the team to be available until they become obsolete or late to the market.
The throughput of a station is instead closely related to its service time, the average time spent working on an item fetched from the local queue. Service time complements queue time.
Utilization is defined as the percentage of time that a station spends working, with respect to the total time comprehending also waiting times and disposable work.
Any time the rate of arrival varies randomly instead of being perfectly fixed, a station's queue size increases with its utilization. As a metaphor, think of a traffic jam: the utilization of a road increases and as a result a long queue forms, even all the cars contained could go through the road faster if they had a constant speed.
Intuitively, this is what happens. Suppose we are at 90% average utilization and work arrives randomly from the previous station.
- sometimes we have nothing to do because the work items are taking a long time to be processed at the previous station. This time is lost and not recoverable: in these periods we are at (let's say) 70% utilization.
- At other times instead, many items arrive together after having been quickly processed at the previous station. We would have to work at 110% to drain the queue, but this is not possible.
You know that guy which knows everything about the codebase of your project, and everyone asks question to and pair program with? When you have a question, how long do you have to wait to reach him?
The bottleneck of a system is the station with the lowest throughput (let's not talk about demands and visits in this simplified model). As Goldratt says, you can easily find a bottleneck by looking at there the longest queue is: since the other stations complete their work at an higher rate by definition, they push their completed items to that station which is unable to process it with the same speed.
The bottleneck is the most important station as it dictates an upper limit for the throughput of the entire system: if deployment is your bottleneck, you can't complete stories faster than you can make releases (hard truth). If it's development because it is difficult to find clean coders, the features produced won't vary no matter how many requirements documents are written (another hard truth).
Continuing with the traffic metaphor: the bottleneck is that roundabout where 4 roads connect, or that tract you have passed through yesterday where two of the three lanes were closed and everyone was struggling to get into the remaining one.
Usually, when stuck in traffic many people reduce their speed as it's a waste of gas to accelerate much for the few meters you can cover when the jam moves. Cars cna only pass through the bottleneck at its own rate, so everyone feels free to reduce speed while being in line as the opposite would only produce a longer queue.
Here are some heuristics which are valid while simple mathematical assumptions are made on the distribution of the arrivals and the service time. We always talk of average quantities in this list, as in the rest of the article.
- Little's law tells us that measuring queue length is equivalent to measuring cycle time as they are proportional in a stable system.
- Variability in rate of arrival or in service time increase the cycle time and the queue.
- Large batches make queues grow.
- High utilization makes cycle time grow.
- To improve cycle times and reduce queues, target an even, stable rate of arrival, like restaurant and discos do by introducing discounts on certain times of the day. Other ways to help are smaller batches, stable service times, and parallelism where applicable (a station composed of two substations halves the service time.)
- Decreasing variability early in the process is better than later, like in Goldratt's matchstick game. As a consequence, you're better off when your bottleneck it's high in the stream: variability is reduced as the station dictates the rate to the rest of the system.
- A system where every station is working all the time is close to bankruptcy (also Goldratt's maxim), considering this case an extensions of high utilizations on all stations. If everyone is already working all the time, it has then no time to recover when its queue grows due to upstream variability.