Queueing Theory for LLM Inference
Learn how to size GPU capacity, batching, and concurrency for strict latency SLOs in production-ready LLM inference with this analysis of queuing theory applications.
Join the DZone community and get the full member experience.
Join For FreeIf you are deploying LLM inference in production, you are no longer just doing machine learning. You are doing applied mathematics plus systems engineering.
Most teams tune prompts, choose a model, then wonder why latency explodes at peak traffic. The root cause is usually not the model. It is load, variability, and the queue that forms when the arrival rate approaches the service capacity.
This article gives you a practical, math-driven way to reason about LLM serving. We will use queueing theory, Little’s Law, and a simple simulation to answer the questions every leader gets asked. How many GPUs do we need? What is our safe throughput? How should we batch? What happens to p95 and p99 under bursty traffic?
The goal is not to build a perfect analytical model. The goal is to build an engineering calculator you can defend.
Core Mental Model
Every request is a job. Jobs arrive over time. GPUs process jobs. If jobs arrive faster than you can process, they wait. Waiting is your latency.
Define:
- Arrival rate:
λrequests per second - Service time:
Sseconds per request - Service rate per worker:
μ = 1 / S - Number of workers:
k(GPU replicas or GPU partitions) - Utilization:
ρ = λ / (k μ) = λ S / k
The first rule of production inference: Keep ρ comfortably below 1. As ρ approaches 1, queues grow superlinearly, and tail latency blows up.
Little’s Law
Little’s Law is the simplest and most useful equation you can bring into an SLO meeting.
L = λ W
Lis average number of jobs in the systemWis average time in the system (waiting plus service)λis arrival rate
If you can measure two of these, you get the third. More importantly, it forces clarity: if you want lower W at the same λ, you must reduce L by increasing service capacity or smoothing variability.
Why LLM Is Serving Harder Than Normal Web Serving
LLM inference violates the assumptions people unconsciously make when they reason about latency. Service time is highly variable because prompt length varies, output length varies, tool use varies, and cache hit rate varies.
Moreover, arrivals are bursty because enterprise traffic often has diurnal peaks and release-driven spikes. Batching increases throughput but can add waiting time because you may hold requests to form a batch.
This variability is exactly where applied computational math helps. We do not need perfect predictions. We need safe bounds and policies that degrade gracefully.
A Simple Capacity Sizing Formula
Start with a capacity bound that is almost embarrassingly simple.
If each request takes S seconds on average, and you have k identical workers, then stable operation requires:
λ < k / S
Rearrange to size k:
k > λ S
Then add headroom for variability and tail behavior. A common engineering rule is to target utilization ρ between 0.4 and 0.7 for strict tail latency, depending on burstiness and service time variance. So a practical sizing is:
k = ceil( λ S / ρ_target )
Example
Suppose peak λ is 120 requests per second. Average service time S is 0.18 seconds per request on your chosen model and hardware. If you target ρ_target = 0.6:
k = ceil(120 × 0.18 / 0.6)
k = ceil(21.6 / 0.6)
k = ceil(36)
So you start with 36 workers.
This is a starting point. Next, we incorporate batching and tail.
Batching as a Control Problem
Batching is not magic. It is a scheduling policy. If you batch B requests together, you often improve compute efficiency and reduce per-request service time. But you also introduce batch formation delay.
A useful decomposition is:
Total latency = queue wait + batch wait + compute time
Batch wait is the time a request sits while you fill the batch. You can control it using a max wait timer. Consider the max batch size B_max, and the max batch wait T_max.
Dynamic batching then entails accumulation until B_max or T_max expires, then dispatch.
Batching improves throughput when compute cost scales sublinearly with B. For transformer decoding, you may get good scaling for prefill, and weaker scaling for long decode. The details depend on your serving stack.
Batching is only beneficial if the throughput gains outweigh the added waiting, especially at p95 and p99. High-throughput serving of LLMs typically depends on batching and careful KV cache management, as described in PagedAttention and vLLM.
If your workload is bursty, dynamic batching with a small T_max often dominates naive large batches. If you deploy with NVIDIA stacks, TensorRT LLM discusses in-flight batching and request scheduling.
A Tail Latency Heuristic You Can Use
Even without heavy theory, you can build a safe heuristic.
- Choose a latency SLO, for example, p95 under 800 ms
- Reserve part of the budget for model compute, for example, 300 ms
- Reserve part for network and orchestration, for example, 100 ms
- The rest is queueing plus batching budget, for example, 400 ms
- Enforce
T_maxbelow your queueing budget, for example, 20 to 50 ms
If T_max is too large, you manufacture tail latency even when you have capacity.
Simulation: A Small Model You Can Run
Analytical queueing models like M/M/k can be informative, but LLM service times are rarely exponential. A quick discrete event simulation is often more honest and aligns with standard performance modeling practice described in the Performance Modeling and Design of Computer Systems book.
Below is a compact simulation that lets you explore capacity, service time variability, and batching timers. You can adapt it to your real telemetry distributions.
import random
import heapq
import math
from statistics import mean
def percentile(xs, p):
xs = sorted(xs)
if not xs:
return None
i = int(math.ceil(p * len(xs))) - 1
i = max(0, min(i, len(xs) - 1))
return xs[i]
def simulate(
seconds=120,
arrival_rate=100.0, # λ requests per second
workers=24, # k
mean_service=0.20, # seconds
service_cv=0.8, # coefficient of variation
batch_max=8,
batch_wait_max=0.03, # seconds
seed=0
):
random.seed(seed)
# Arrivals as a Poisson process
t = 0.0
arrivals = []
while t < seconds:
t += random.expovariate(arrival_rate)
if t < seconds:
arrivals.append(t)
# Service time model: lognormal with chosen mean and cv
if service_cv <= 0:
sigma = 0.0
mu = math.log(mean_service)
else:
sigma2 = math.log(1 + service_cv**2)
sigma = math.sqrt(sigma2)
mu = math.log(mean_service) - 0.5 * sigma2
def sample_service_time(batch_size):
# Simple batching efficiency curve
# Replace this with measurements from your stack
base = random.lognormvariate(mu, sigma)
efficiency = 0.55 + 0.45 / math.sqrt(batch_size)
return base * efficiency
# Worker availability times
worker_free = [0.0 for _ in range(workers)]
heapq.heapify(worker_free)
latencies = []
# Batch accumulator
batch = []
batch_first_arrival = None
idx = 0
current_time = 0.0
def dispatch_batch(dispatch_time, batch_items):
nonlocal latencies
free_time = heapq.heappop(worker_free)
start_time = max(free_time, dispatch_time)
service_time = sample_service_time(len(batch_items))
finish_time = start_time + service_time
heapq.heappush(worker_free, finish_time)
for arrival_time in batch_items:
latencies.append(finish_time - arrival_time)
while idx < len(arrivals) or batch:
next_arrival = arrivals[idx] if idx < len(arrivals) else float("inf")
next_deadline = (batch_first_arrival + batch_wait_max) if batch_first_arrival is not None else float("inf")
current_time = min(next_arrival, next_deadline)
if current_time == next_arrival:
at = next_arrival
idx += 1
if not batch:
batch_first_arrival = at
batch.append(at)
if len(batch) >= batch_max:
dispatch_batch(at, batch)
batch = []
batch_first_arrival = None
else:
dispatch_batch(current_time, batch)
batch = []
batch_first_arrival = None
return {
"mean": mean(latencies),
"p50": percentile(latencies, 0.50),
"p95": percentile(latencies, 0.95),
"p99": percentile(latencies, 0.99),
"max": max(latencies) if latencies else None,
"count": len(latencies),
}
if __name__ == "__main__":
out = simulate(
seconds=180,
arrival_rate=120.0,
workers=36,
mean_service=0.18,
service_cv=0.9,
batch_max=8,
batch_wait_max=0.03,
seed=42
)
print(out)
How to use this in practice:
- Replace the service time sampler with your measured distribution
- Use real arrival traces, not just Poisson
- Sweep
workers,batch_max, andbatch_wait_max - Track p95 and p99, not just mean
This turns a fuzzy infrastructure debate into a quantitative policy discussion.
A Deployment Playbook That Reads Like Applied Math
Step 1: Measure the Service Time Distribution
Instrument per request compute time split into prefill and decode. Track prompt tokens, output tokens, and cache hits.
Step 2: Decide What You Are Optimizing
If your business cares about p99, size for p99. If your business cares about cost, set a max queueing budget and accept more shedding.
Step 3: Pick a Utilization Target and Enforce Admission Control
Choose ρ_target and do not exceed it at peak. Use a queue length circuit breaker. When overload hits, degrade and do not accumulate an infinite queue as mentioned by Google's SRE Playbook.
Step 4: Use Dynamic Batching With a Strict Timer
Set batch_wait_max to protect tail latency. Use smaller batches under low load, larger batches under high load.
Step 5: Add a Second Lever: Request Shaping
Route long prompts to a separate pool. Cap max generation length by tier. Use early exit for low confidence tasks.
Step 6: Validate With Chaos Load Tests
Replay bursty traffic. Replay worst-case long outputs. Confirm SLOs under realistic spikes.
What to Say to Leadership
When someone asks why p99 jumped from 900 ms to 6 seconds, you can say it clearly.
- We moved closer to utilization 1.
- Queueing delay grows nonlinearly near saturation.
- Batching timers and variability amplified the tail.
- We need either more capacity, stricter batching timers, or overload policies.
Applied mathematics is not an academic add-on to LLM systems. It is the difference between a demo and a reliable service.
If you treat LLM inference as a queueing system, you gain levers you can measure and control: utilization, batching delay, service time variance, and admission control. That is how you hit SLOs while keeping costs rational.
The opinions expressed in this article are the authors’ personal opinions and do not reflect the opinions of their employer.
Published at DZone with permission of Dhyey Mavani. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments