KV Cache Implementation Inside vLLM
vLLM's KV cache is a virtualized memory system built on reusable blocks and prefix-aware scheduling that enables efficient, scalable LLM inference.
Join the DZone community and get the full member experience.
Join For FreeThe key-value (KV) cache is a fundamental optimization in transformer-based LLM inference. It stores intermediate attention states, i.e., keys and values computed during the prefill phase, so that subsequent tokens can reuse them instead of recomputing from scratch. This significantly reduces compute cost and latency, especially for long context or multi-turn agentic workloads. KV caching has been extensively discussed across several blogs and documentation [1, 2, 3, 4, 5].
In this article, instead of revisiting those well-known concepts, vLLM (v0.20.0) KV cache implementation details are discussed for a deeper understanding. By walking through code internals with concrete code pointers and design insights, the goal is to bridge the gap between high-level understanding and real-world system design.
KV Cache Is Not a Standard Cache
At first glance, KV cache sounds like a standard caching problem: storing computed results to reuse later. However, in systems like vLLM, KV cache behaves fundamentally differently from traditional caches like Redis cache. It is not a simple key-value lookup system sitting outside the execution path, but rather a tightly coupled component of the model's forward pass that must be accessed at every decoding step.
Unlike conventional caches, KV cache is dynamic, partially reusable, and deeply intertwined with GPU memory allocation. This means that KV cache design is as much about memory management and scheduling as it is about cache reuse. Thinking of it as just a cache hides its true complexity, and it is better understood as a virtualized memory layer for intermediate computation.
| Dimension | traditional cache (E.g., Redis) | KV cache in LLMs (e.g., vllm) |
|---|---|---|
|
Purpose |
Avoid recomputing full results |
Avoid recomputing intermediate attention state |
|
Common access pattern |
key -> value lookup |
Key -> key-value bytes lookup during model execution |
|
Reuse type |
All or nothing |
Partial reuse (prefix based) |
|
Storage |
In-memory / persisted |
Primarily GPU memory which can also be persisted |
|
Consistency |
Eventual or strong consistency |
Must match exact token sequence |
|
Scheduling dependency |
Independent |
Strongly coupled with request scheduling |
|
Failure mode |
Cache miss results in recompute |
Cache miss results in recompute |
|
Cache locality sensitivity |
Low (can often be distributed for better reliability and scalability) |
Very high (node/worker local) and be IO latency sensitive. |
The kv_cache_manager is a good entry point to understand that the KV cache in vLLM is not a traditional cache, but an active memory manager used during inference. It actively manages GPU KV cache memory during inference, i.e., allocation, reuse, eviction, prefix cache hits, and request lifecycle state.
class KVCacheManager:
def __init__(
self,
kv_cache_config: KVCacheConfig,
max_model_len: int,
hash_block_size: int,
max_num_batched_tokens: int | None = None,
enable_caching: bool = True,
use_eagle: bool = False,
log_stats: bool = False,
enable_kv_cache_events: bool = False,
dcp_world_size: int = 1,
pcp_world_size: int = 1,
metrics_collector: KVCacheMetricsCollector | None = None,
) -> None:
Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub
vLLM KV Cache Design
vLLM's KV cache design treats KV memory like virtual memory and not contiguous tensors to avoid memory bottlenecks. Instead of allocating large blocks per request, it introduces a layer of indirection via fixed-size blocks and block tables. This allows memory to be used efficiently, reused across requests, and dynamically resized as sequences grow. Two core primitives enable this design: block tables and an eviction mechanism. Together, they solve critical problems in memory fragmentation, reuse, and scalability.
Block Tables
The block table is the central abstraction in vLLM's KV cache design. Instead of storing KV tensors contiguously in GPU memory, each request maintains a mapping from logical token positions to physical memory blocks. This indirection layer is conceptually similar to a page table in operating systems. When the model accesses KV for a given token, it resolves through the block table to locate the physical block in GPU memory. This design allows KV memory to be non-contiguous, shared across multiple requests, and dynamically extended as tokens are generated. The code pointers below are a good entry point to understand this concept in detail.
vLLM maintains a BlockTable whose rows correspond to active request slots. Each row maps a request's logical token/block positions to physical KV cache block IDs in GPU memory. This indirection lets KV blocks be allocated non-contiguously and lets multiple requests refer to reused/shared cached blocks.
class BlockTable:
def __init__(
self,
block_size: int,
max_num_reqs: int,
max_num_blocks_per_req: int,
max_num_batched_tokens: int,
pin_memory: bool,
device: torch.device,
kernel_block_size: int,
cp_kv_cache_interleave_size: int,
):
Source: (v0.20.0) vllm/vllm/v1/worker/block_table.py at main · vllm-project/vllm · GitHub
vLLM's KV cache is divided into fixed size KVCacheBlocks. These blocks are the fundamental unit of allocation, prefix cache reuse, reference counting, and eviction. The code below is a good set of pointers for understanding that lifecycle.
class BlockPool:
def __init__(
self,
num_gpu_blocks: int,
enable_caching: bool,
hash_block_size: int,
enable_kv_cache_events: bool = False,
metrics_collector: KVCacheMetricsCollector | None = None,
):
Source: (v0.20.0) vllm/vllm/v1/core/block_pool.py at main · vllm-project/vllm · GitHub
allocate_slots() asks the coordinator how many blocks are needed, checks the shared block_pool for free capacity, and then calls allocate_new_blocks() only for the current request's needed slots. That shows blocks are dynamically assigned from a shared pool rather than preallocated per request.
def allocate_slots(
self,
request: Request,
num_new_tokens: int,
num_new_computed_tokens: int = 0,
new_computed_blocks: KVCacheBlocks | None = None,
num_lookahead_tokens: int = 0,
num_external_computed_tokens: int = 0,
delay_cache_blocks: bool = False,
num_encoder_tokens: int = 0,
full_sequence_must_fit: bool = False,
) -> KVCacheBlocks | None:
Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub
Cache Eviction
Eviction in vLLM is more complex than a typical least recently used (LRU) policy due to dependencies between tokens. KV blocks form a logical prefix chain, meaning later tokens depend on earlier ones. As a result, eviction cannot arbitrarily remove blocks without breaking correctness. Instead, vLLM uses a reference count-based mechanism combined with recency heuristics. Blocks are only eligible for eviction when no active request depends on them, and even then, eviction typically proceeds from the tail of sequences to preserve prefix integrity. This constrained eviction behavior ensures correctness while still allowing the system to operate under memory pressure.
Blocks are reference-counted. A block can only be freed when no active request depends on it to ensure correctness.
def free(self, request: Request) -> None:
self.coordinator.free(request.request_id)
Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub
When a request completes, KVCacheCoordinator.free() calls free() on each per-type manager. The manager removes that request's blocks from req_to_blocks and returns them to BlockPool.free_blocks(), where they become reclaimable once their reference count reaches zero.
def free(self, request_id: str) -> None:
req_blocks = self.req_to_blocks.pop(request_id, [])
Source: (v0.20.0) vllm/vllm/v1/core/single_type_kv_cache_manager.py at main · vllm-project/vllm · GitHub
How the Request Flow Works
Understanding how KV cache works requires following a request through the system. At a high level, vLLM attempts to reuse previously computed KV blocks by matching prefixes, allocates new blocks for unseen tokens, and schedules requests in a way that maximizes reuse while balancing GPU utilization.
Prefix matching identifies previously computed KV blocks that can be reused for the incoming request. find_longest_cache_hit() takes the incoming request's block_hashes, searches the prefix cache for matching cached blocks, and returns the reusable KVCacheBlocks plus the number of computed tokens.
def find_longest_cache_hit(
self,
block_hashes: list[BlockHash],
max_cache_hit_length: int,
) -> tuple[tuple[list[KVCacheBlock], ...], int]:
Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_coordinator.py at main · vllm-project/vllm · GitHub
New tokens are assigned newly allocated KV blocks on demand, and the returned block IDs are later used by the worker side block table to extend the request's physical KV mapping.
new_blocks = self.coordinator.allocate_new_blocks(
request.request_id,
num_tokens_need_slot,
num_tokens_main_model,
num_encoder_tokens,
)
Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub
The scheduler decides which requests run together. Requests with shared prefixes benefit from co-location, improving cache hit rate. Even with perfect caching logic, poor scheduling can eliminate all cache benefits.
def schedule(self)
-> SchedulerOutput:
Source: (v0.20.0) vllm/vllm/v1/core/sched/scheduler.py at main · vllm-project/vllm · GitHub
Before the model forward pass execution, gpu_mode_runner uses the request block table to resolve scheduled logical token positions into physical KV cache slot IDs. During attention execution, the resulting slot_mapping and block_table_tensor are passed through attention metadata so kernels can read/write the correct KV cache locations.
def _prepare_inputs(
self,
scheduler_output: "SchedulerOutput",
num_scheduled_tokens: np.ndarray,
) -> tuple[
torch.Tensor,
SpecDecodeMetadata | None,
]:
self.input_batch.block_table.compute_slot_mapping(
num_reqs,
self.query_start_loc.gpu[: num_reqs + 1],
self.positions[:total_num_scheduled_tokens],
)
source: (v0.20.0) vllm/vllm/v1/worker/gpu_model_runner.py at main · vllm-project/vllm · GitHub
Future Work
While vLLM's prefix-based KV caching is highly effective, it has inherent limitations that motivate future work. Today, reuse is mainly strongest when requests share the same prefix, because cached blocks are validated through prefix/block hash chains. One future direction is a more general segment or chunk-level reuse, where systems try to reuse repeated prompt regions beyond strict prefixes. This could help when shared content appears later in prompts, but it is harder than prefix caching because KV states depend on position and surrounding context.
Another direction is distributed KV caching, where KV state can be stored, transferred, or shared across workers or replicas rather than remaining purely local to one GPU/node. This can improve reuse and scaling, but introduces challenges around latency, routing, placement, and consistency. Together, these directions move KV caching from a local per-worker optimization toward a broader system-level capability.
Conclusion
vLLM rethinks KV caching as a memory management and scheduling problem rather than a simple reuse mechanism. Through fixed-size block allocation, block tables for logical to physical indirection, prefix-aware reuse, reference-counted block lifetimes, and LRU-like cached block eviction, it turns KV cache into a virtualized resource that can be shared efficiently across requests.
However, the effectiveness of this system depends not only on its internal design, but also on how requests are scheduled, batched, and routed. These nuances show that KV caching is not merely a local optimization, but a core systems primitive for modern LLM inference. As inference systems evolve toward more general segment/chunk level reuse and distributed KV caching, these same principles will continue to shape scalable and efficient serving platforms.
References
- https://github.com/vllm-project/vllm
- https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms
- https://bentoml.com/llm/inference-optimization/kv-cache-offloading
- https://cloud.google.com/blog/topics/developers-practitioners/boosting-llm-performance-with-tiered-kv-cache-on-google-kubernetes-engine/
- https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
- https://pub.towardsai.net/the-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d
Opinions expressed by DZone contributors are their own.
Comments