KV Cache Implementation Inside vLLM

vLLM's KV cache is a virtualized memory system built on reusable blocks and prefix-aware scheduling that enables efficient, scalable LLM inference.

Bhala Ranganathan

CORE ·

May. 07, 26 · Analysis

Likes (0)

Comment

Save

3.9K Views

The key-value (KV) cache is a fundamental optimization in transformer-based LLM inference. It stores intermediate attention states, i.e., keys and values computed during the prefill phase, so that subsequent tokens can reuse them instead of recomputing from scratch. This significantly reduces compute cost and latency, especially for long context or multi-turn agentic workloads. KV caching has been extensively discussed across several blogs and documentation [1, 2, 3, 4, 5].

In this article, instead of revisiting those well-known concepts, vLLM (v0.20.0) KV cache implementation details are discussed for a deeper understanding. By walking through code internals with concrete code pointers and design insights, the goal is to bridge the gap between high-level understanding and real-world system design.

KV Cache Is Not a Standard Cache

At first glance, KV cache sounds like a standard caching problem: storing computed results to reuse later. However, in systems like vLLM, KV cache behaves fundamentally differently from traditional caches like Redis cache. It is not a simple key-value lookup system sitting outside the execution path, but rather a tightly coupled component of the model's forward pass that must be accessed at every decoding step.

Unlike conventional caches, KV cache is dynamic, partially reusable, and deeply intertwined with GPU memory allocation. This means that KV cache design is as much about memory management and scheduling as it is about cache reuse. Thinking of it as just a cache hides its true complexity, and it is better understood as a virtualized memory layer for intermediate computation.

Dimension	traditional cache (E.g., Redis)	KV cache in LLMs (e.g., vllm)
Purpose	Avoid recomputing full results	Avoid recomputing intermediate attention state
Common access pattern	key -> value lookup	Key -> key-value bytes lookup during model execution
Reuse type	All or nothing	Partial reuse (prefix based)
Storage	In-memory / persisted	Primarily GPU memory which can also be persisted
Consistency	Eventual or strong consistency	Must match exact token sequence
Scheduling dependency	Independent	Strongly coupled with request scheduling
Failure mode	Cache miss results in recompute	Cache miss results in recompute
Cache locality sensitivity	Low (can often be distributed for better reliability and scalability)	Very high (node/worker local) and be IO latency sensitive.

The kv_cache_manager is a good entry point to understand that the KV cache in vLLM is not a traditional cache, but an active memory manager used during inference. It actively manages GPU KV cache memory during inference, i.e., allocation, reuse, eviction, prefix cache hits, and request lifecycle state.

    Python
   
 

   class KVCacheManager:
     def __init__(
        self,
        kv_cache_config: KVCacheConfig,
        max_model_len: int,
        hash_block_size: int,
        max_num_batched_tokens: int | None = None,
        enable_caching: bool = True,
        use_eagle: bool = False,
        log_stats: bool = False,
        enable_kv_cache_events: bool = False,
        dcp_world_size: int = 1,
        pcp_world_size: int = 1,
        metrics_collector: KVCacheMetricsCollector | None = None,
     ) -> None:
  

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub

vLLM KV Cache Design

vLLM's KV cache design treats KV memory like virtual memory and not contiguous tensors to avoid memory bottlenecks. Instead of allocating large blocks per request, it introduces a layer of indirection via fixed-size blocks and block tables. This allows memory to be used efficiently, reused across requests, and dynamically resized as sequences grow. Two core primitives enable this design: block tables and an eviction mechanism. Together, they solve critical problems in memory fragmentation, reuse, and scalability.

Block Tables

The block table is the central abstraction in vLLM's KV cache design. Instead of storing KV tensors contiguously in GPU memory, each request maintains a mapping from logical token positions to physical memory blocks. This indirection layer is conceptually similar to a page table in operating systems. When the model accesses KV for a given token, it resolves through the block table to locate the physical block in GPU memory. This design allows KV memory to be non-contiguous, shared across multiple requests, and dynamically extended as tokens are generated. The code pointers below are a good entry point to understand this concept in detail.

vLLM maintains a BlockTable whose rows correspond to active request slots. Each row maps a request's logical token/block positions to physical KV cache block IDs in GPU memory. This indirection lets KV blocks be allocated non-contiguously and lets multiple requests refer to reused/shared cached blocks.

    Python
   
 

   class BlockTable:
     def __init__(
        self,
        block_size: int,
        max_num_reqs: int,
        max_num_blocks_per_req: int,
        max_num_batched_tokens: int,
        pin_memory: bool,
        device: torch.device,
        kernel_block_size: int,
        cp_kv_cache_interleave_size: int,
     ): 
  

Source: (v0.20.0) vllm/vllm/v1/worker/block_table.py at main · vllm-project/vllm · GitHub

vLLM's KV cache is divided into fixed size KVCacheBlocks. These blocks are the fundamental unit of allocation, prefix cache reuse, reference counting, and eviction. The code below is a good set of pointers for understanding that lifecycle.

    Python
   
 

   class BlockPool:
	def __init__(
        self,
        num_gpu_blocks: int,
        enable_caching: bool,
        hash_block_size: int,
        enable_kv_cache_events: bool = False,
        metrics_collector: KVCacheMetricsCollector | None = None,
     ):
  

Source: (v0.20.0) vllm/vllm/v1/core/block_pool.py at main · vllm-project/vllm · GitHub

allocate_slots() asks the coordinator how many blocks are needed, checks the shared block_pool for free capacity, and then calls allocate_new_blocks() only for the current request's needed slots. That shows blocks are dynamically assigned from a shared pool rather than preallocated per request.

    Python
   
 

   def allocate_slots(
        self,
        request: Request,
        num_new_tokens: int,
        num_new_computed_tokens: int = 0,
        new_computed_blocks: KVCacheBlocks | None = None,
        num_lookahead_tokens: int = 0,
        num_external_computed_tokens: int = 0,
        delay_cache_blocks: bool = False,
        num_encoder_tokens: int = 0,
        full_sequence_must_fit: bool = False,
     ) -> KVCacheBlocks | None:
  

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub

Cache Eviction

Eviction in vLLM is more complex than a typical least recently used (LRU) policy due to dependencies between tokens. KV blocks form a logical prefix chain, meaning later tokens depend on earlier ones. As a result, eviction cannot arbitrarily remove blocks without breaking correctness. Instead, vLLM uses a reference count-based mechanism combined with recency heuristics. Blocks are only eligible for eviction when no active request depends on them, and even then, eviction typically proceeds from the tail of sequences to preserve prefix integrity. This constrained eviction behavior ensures correctness while still allowing the system to operate under memory pressure.

Blocks are reference-counted. A block can only be freed when no active request depends on it to ensure correctness.

    Python
   
   def free(self, request: Request) -> None:
    self.coordinator.free(request.request_id)

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub

When a request completes, KVCacheCoordinator.free() calls free() on each per-type manager. The manager removes that request's blocks from req_to_blocks and returns them to BlockPool.free_blocks(), where they become reclaimable once their reference count reaches zero.

    Python
   
   def free(self, request_id: str) -> None:
	req_blocks = self.req_to_blocks.pop(request_id, [])

Source: (v0.20.0) vllm/vllm/v1/core/single_type_kv_cache_manager.py at main · vllm-project/vllm · GitHub

How the Request Flow Works

Understanding how KV cache works requires following a request through the system. At a high level, vLLM attempts to reuse previously computed KV blocks by matching prefixes, allocates new blocks for unseen tokens, and schedules requests in a way that maximizes reuse while balancing GPU utilization.

Prefix matching identifies previously computed KV blocks that can be reused for the incoming request. find_longest_cache_hit() takes the incoming request's block_hashes, searches the prefix cache for matching cached blocks, and returns the reusable KVCacheBlocks plus the number of computed tokens.

    Python
   
 

   def find_longest_cache_hit(
        self,
        block_hashes: list[BlockHash],
        max_cache_hit_length: int,
     ) -> tuple[tuple[list[KVCacheBlock], ...], int]:
  

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_coordinator.py at main · vllm-project/vllm · GitHub

New tokens are assigned newly allocated KV blocks on demand, and the returned block IDs are later used by the worker side block table to extend the request's physical KV mapping.

    Python
   
 

   new_blocks = self.coordinator.allocate_new_blocks(
    request.request_id,
    num_tokens_need_slot,
	num_tokens_main_model,
    num_encoder_tokens,
)
  

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub

The scheduler decides which requests run together. Requests with shared prefixes benefit from co-location, improving cache hit rate. Even with perfect caching logic, poor scheduling can eliminate all cache benefits.

    Python
   
   def schedule(self) 
	-> SchedulerOutput:

Source: (v0.20.0) vllm/vllm/v1/core/sched/scheduler.py at main · vllm-project/vllm · GitHub

Before the model forward pass execution, gpu_mode_runner uses the request block table to resolve scheduled logical token positions into physical KV cache slot IDs. During attention execution, the resulting slot_mapping and block_table_tensor are passed through attention metadata so kernels can read/write the correct KV cache locations.

    Python
   
 

   def _prepare_inputs(
        self,
        scheduler_output: "SchedulerOutput",
        num_scheduled_tokens: np.ndarray,
     ) -> tuple[
        torch.Tensor,
        SpecDecodeMetadata | None,
     ]:
            self.input_batch.block_table.compute_slot_mapping(
            num_reqs,
            self.query_start_loc.gpu[: num_reqs + 1],
            self.positions[:total_num_scheduled_tokens],
        )
  

source: (v0.20.0) vllm/vllm/v1/worker/gpu_model_runner.py at main · vllm-project/vllm · GitHub

Future Work

While vLLM's prefix-based KV caching is highly effective, it has inherent limitations that motivate future work. Today, reuse is mainly strongest when requests share the same prefix, because cached blocks are validated through prefix/block hash chains. One future direction is a more general segment or chunk-level reuse, where systems try to reuse repeated prompt regions beyond strict prefixes. This could help when shared content appears later in prompts, but it is harder than prefix caching because KV states depend on position and surrounding context.

Another direction is distributed KV caching, where KV state can be stored, transferred, or shared across workers or replicas rather than remaining purely local to one GPU/node. This can improve reuse and scaling, but introduces challenges around latency, routing, placement, and consistency. Together, these directions move KV caching from a local per-worker optimization toward a broader system-level capability.

Conclusion

vLLM rethinks KV caching as a memory management and scheduling problem rather than a simple reuse mechanism. Through fixed-size block allocation, block tables for logical to physical indirection, prefix-aware reuse, reference-counted block lifetimes, and LRU-like cached block eviction, it turns KV cache into a virtualized resource that can be shared efficiently across requests.

However, the effectiveness of this system depends not only on its internal design, but also on how requests are scheduled, batched, and routed. These nuances show that KV caching is not merely a local optimization, but a core systems primitive for modern LLM inference. As inference systems evolve toward more general segment/chunk level reuse and distributed KV caching, these same principles will continue to shape scalable and efficient serving platforms.

References

Blocks Cache (computing) Requests large language model

Opinions expressed by DZone contributors are their own.

Related

Trending

KV Cache Implementation Inside vLLM

vLLM's KV cache is a virtualized memory system built on reusable blocks and prefix-aware scheduling that enables efficient, scalable LLM inference.

KV Cache Is Not a Standard Cache

vLLM KV Cache Design

Block Tables

Cache Eviction

How the Request Flow Works

Future Work

Conclusion

References

Related

Partner Resources