Queueing Theory for LLM Inference

Learn how to size GPU capacity, batching, and concurrency for strict latency SLOs in production-ready LLM inference with this analysis of queuing theory applications.

Dhyey Mavani

Feb. 20, 26 · Analysis

Likes (0)

Comment

Save

1.9K Views

If you are deploying LLM inference in production, you are no longer just doing machine learning. You are doing applied mathematics plus systems engineering.

Most teams tune prompts, choose a model, then wonder why latency explodes at peak traffic. The root cause is usually not the model. It is load, variability, and the queue that forms when the arrival rate approaches the service capacity.

This article gives you a practical, math-driven way to reason about LLM serving. We will use queueing theory, Little’s Law, and a simple simulation to answer the questions every leader gets asked. How many GPUs do we need? What is our safe throughput? How should we batch? What happens to p95 and p99 under bursty traffic?

The goal is not to build a perfect analytical model. The goal is to build an engineering calculator you can defend.

Core Mental Model

Every request is a job. Jobs arrive over time. GPUs process jobs. If jobs arrive faster than you can process, they wait. Waiting is your latency.

Define:

Arrival rate: λ requests per second
Service time: S seconds per request
Service rate per worker: μ = 1 / S
Number of workers: k (GPU replicas or GPU partitions)
Utilization: ρ = λ / (k μ) = λ S / k

The first rule of production inference: Keep ρ comfortably below 1. As ρ approaches 1, queues grow superlinearly, and tail latency blows up.

Little’s Law

Little’s Law is the simplest and most useful equation you can bring into an SLO meeting.

L = λ W

L is average number of jobs in the system
W is average time in the system (waiting plus service)
λ is arrival rate

If you can measure two of these, you get the third. More importantly, it forces clarity: if you want lower W at the same λ, you must reduce L by increasing service capacity or smoothing variability.

Why LLM Is Serving Harder Than Normal Web Serving

LLM inference violates the assumptions people unconsciously make when they reason about latency. Service time is highly variable because prompt length varies, output length varies, tool use varies, and cache hit rate varies.

Moreover, arrivals are bursty because enterprise traffic often has diurnal peaks and release-driven spikes. Batching increases throughput but can add waiting time because you may hold requests to form a batch.

This variability is exactly where applied computational math helps. We do not need perfect predictions. We need safe bounds and policies that degrade gracefully.

A Simple Capacity Sizing Formula

Start with a capacity bound that is almost embarrassingly simple.

If each request takes S seconds on average, and you have k identical workers, then stable operation requires:

λ < k / S

Rearrange to size k:

k > λ S

Then add headroom for variability and tail behavior. A common engineering rule is to target utilization ρ between 0.4 and 0.7 for strict tail latency, depending on burstiness and service time variance. So a practical sizing is:

k = ceil( λ S / ρ_target )

Example

Suppose peak λ is 120 requests per second. Average service time S is 0.18 seconds per request on your chosen model and hardware. If you target ρ_target = 0.6:

    Plain Text
   
   k = ceil(120 × 0.18 / 0.6)
k = ceil(21.6 / 0.6)
k = ceil(36)

So you start with 36 workers.

This is a starting point. Next, we incorporate batching and tail.

Batching as a Control Problem

Batching is not magic. It is a scheduling policy. If you batch B requests together, you often improve compute efficiency and reduce per-request service time. But you also introduce batch formation delay.

A useful decomposition is:

Total latency = queue wait + batch wait + compute time

Batch wait is the time a request sits while you fill the batch. You can control it using a max wait timer. Consider the max batch size B_max, and the max batch wait T_max.

Dynamic batching then entails accumulation until B_max or T_max expires, then dispatch.

Batching improves throughput when compute cost scales sublinearly with B. For transformer decoding, you may get good scaling for prefill, and weaker scaling for long decode. The details depend on your serving stack.

Batching is only beneficial if the throughput gains outweigh the added waiting, especially at p95 and p99. High-throughput serving of LLMs typically depends on batching and careful KV cache management, as described in PagedAttention and vLLM.

If your workload is bursty, dynamic batching with a small T_max often dominates naive large batches. If you deploy with NVIDIA stacks, TensorRT LLM discusses in-flight batching and request scheduling.

A Tail Latency Heuristic You Can Use

Even without heavy theory, you can build a safe heuristic.

Choose a latency SLO, for example, p95 under 800 ms
Reserve part of the budget for model compute, for example, 300 ms
Reserve part for network and orchestration, for example, 100 ms
The rest is queueing plus batching budget, for example, 400 ms
Enforce T_max below your queueing budget, for example, 20 to 50 ms

If T_max is too large, you manufacture tail latency even when you have capacity.

Simulation: A Small Model You Can Run

Analytical queueing models like M/M/k can be informative, but LLM service times are rarely exponential. A quick discrete event simulation is often more honest and aligns with standard performance modeling practice described in the Performance Modeling and Design of Computer Systems book.

Below is a compact simulation that lets you explore capacity, service time variability, and batching timers. You can adapt it to your real telemetry distributions.

    Python
   
 

   import random
import heapq
import math
from statistics import mean

def percentile(xs, p):
    xs = sorted(xs)
    if not xs:
        return None
    i = int(math.ceil(p * len(xs))) - 1
    i = max(0, min(i, len(xs) - 1))
    return xs[i]

def simulate(
    seconds=120,
    arrival_rate=100.0,          # λ requests per second
    workers=24,                  # k
    mean_service=0.20,           # seconds
    service_cv=0.8,              # coefficient of variation
    batch_max=8,
    batch_wait_max=0.03,         # seconds
    seed=0
):
    random.seed(seed)

    # Arrivals as a Poisson process
    t = 0.0
    arrivals = []
    while t < seconds:
        t += random.expovariate(arrival_rate)
        if t < seconds:
            arrivals.append(t)

    # Service time model: lognormal with chosen mean and cv
    if service_cv <= 0:
        sigma = 0.0
        mu = math.log(mean_service)
    else:
        sigma2 = math.log(1 + service_cv**2)
        sigma = math.sqrt(sigma2)
        mu = math.log(mean_service) - 0.5 * sigma2

    def sample_service_time(batch_size):
        # Simple batching efficiency curve
        # Replace this with measurements from your stack
        base = random.lognormvariate(mu, sigma)
        efficiency = 0.55 + 0.45 / math.sqrt(batch_size)
        return base * efficiency

    # Worker availability times
    worker_free = [0.0 for _ in range(workers)]
    heapq.heapify(worker_free)

    latencies = []

    # Batch accumulator
    batch = []
    batch_first_arrival = None

    idx = 0
    current_time = 0.0

    def dispatch_batch(dispatch_time, batch_items):
        nonlocal latencies
        free_time = heapq.heappop(worker_free)
        start_time = max(free_time, dispatch_time)
        service_time = sample_service_time(len(batch_items))
        finish_time = start_time + service_time
        heapq.heappush(worker_free, finish_time)
        for arrival_time in batch_items:
            latencies.append(finish_time - arrival_time)

    while idx < len(arrivals) or batch:
        next_arrival = arrivals[idx] if idx < len(arrivals) else float("inf")
        next_deadline = (batch_first_arrival + batch_wait_max) if batch_first_arrival is not None else float("inf")
        current_time = min(next_arrival, next_deadline)

        if current_time == next_arrival:
            at = next_arrival
            idx += 1
            if not batch:
                batch_first_arrival = at
            batch.append(at)

            if len(batch) >= batch_max:
                dispatch_batch(at, batch)
                batch = []
                batch_first_arrival = None

        else:
            dispatch_batch(current_time, batch)
            batch = []
            batch_first_arrival = None

    return {
        "mean": mean(latencies),
        "p50": percentile(latencies, 0.50),
        "p95": percentile(latencies, 0.95),
        "p99": percentile(latencies, 0.99),
        "max": max(latencies) if latencies else None,
        "count": len(latencies),
    }

if __name__ == "__main__":
    out = simulate(
        seconds=180,
        arrival_rate=120.0,
        workers=36,
        mean_service=0.18,
        service_cv=0.9,
        batch_max=8,
        batch_wait_max=0.03,
        seed=42
    )
    print(out)
  

How to use this in practice:

Replace the service time sampler with your measured distribution
Use real arrival traces, not just Poisson
Sweep workers, batch_max, and batch_wait_max
Track p95 and p99, not just mean

This turns a fuzzy infrastructure debate into a quantitative policy discussion.

A Deployment Playbook That Reads Like Applied Math

Step 1: Measure the Service Time Distribution

Instrument per request compute time split into prefill and decode. Track prompt tokens, output tokens, and cache hits.

Step 2: Decide What You Are Optimizing

If your business cares about p99, size for p99. If your business cares about cost, set a max queueing budget and accept more shedding.

Step 3: Pick a Utilization Target and Enforce Admission Control

Choose ρ_target and do not exceed it at peak. Use a queue length circuit breaker. When overload hits, degrade and do not accumulate an infinite queue as mentioned by Google's SRE Playbook.

Step 4: Use Dynamic Batching With a Strict Timer

Set batch_wait_max to protect tail latency. Use smaller batches under low load, larger batches under high load.

Step 5: Add a Second Lever: Request Shaping

Route long prompts to a separate pool. Cap max generation length by tier. Use early exit for low confidence tasks.

Step 6: Validate With Chaos Load Tests

Replay bursty traffic. Replay worst-case long outputs. Confirm SLOs under realistic spikes.

What to Say to Leadership

When someone asks why p99 jumped from 900 ms to 6 seconds, you can say it clearly.

We moved closer to utilization 1.
Queueing delay grows nonlinearly near saturation.
Batching timers and variability amplified the tail.
We need either more capacity, stricter batching timers, or overload policies.

Applied mathematics is not an academic add-on to LLM systems. It is the difference between a demo and a reliable service.

If you treat LLM inference as a queueing system, you gain levers you can measure and control: utilization, batching delay, service time variance, and admission control. That is how you hit SLOs while keeping costs rational.

The opinions expressed in this article are the authors’ personal opinions and do not reflect the opinions of their employer.

Queueing theory large language model

Published at DZone with permission of Dhyey Mavani. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending