DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • From 0.68 to 10 Requests/Second: Optimizing LLM Serving With vLLM
  • Scalable Support Request Analysis Using Embeddings, HDBSCAN, and Tiny LLMs
  • The Bill You Didn't See Coming
  • Respecting robots.txt in Web Scraping

Trending

  • Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
  • Catching Data Perimeter Drift Before It Reaches Production
  • When Perfect Data Breaks: The Journey from Data Quality to Data Observability
  • Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. KV Cache Implementation Inside vLLM

KV Cache Implementation Inside vLLM

vLLM's KV cache is a virtualized memory system built on reusable blocks and prefix-aware scheduling that enables efficient, scalable LLM inference.

By 
Bhala Ranganathan user avatar
Bhala Ranganathan
DZone Core CORE ·
May. 07, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
2.7K Views

Join the DZone community and get the full member experience.

Join For Free

The key-value (KV) cache is a fundamental optimization in transformer-based LLM inference. It stores intermediate attention states, i.e., keys and values computed during the prefill phase, so that subsequent tokens can reuse them instead of recomputing from scratch. This significantly reduces compute cost and latency, especially for long context or multi-turn agentic workloads. KV caching has been extensively discussed across several blogs and documentation [1, 2, 3, 4, 5]. 

In this article, instead of revisiting those well-known concepts, vLLM (v0.20.0) KV cache implementation details are discussed for a deeper understanding. By walking through code internals with concrete code pointers and design insights, the goal is to bridge the gap between high-level understanding and real-world system design.

KV Cache Is Not a Standard Cache

At first glance, KV cache sounds like a standard caching problem: storing computed results to reuse later. However, in systems like vLLM, KV cache behaves fundamentally differently from traditional caches like Redis cache. It is not a simple key-value lookup system sitting outside the execution path, but rather a tightly coupled component of the model's forward pass that must be accessed at every decoding step. 

Unlike conventional caches, KV cache is dynamic, partially reusable, and deeply intertwined with GPU memory allocation. This means that KV cache design is as much about memory management and scheduling as it is about cache reuse. Thinking of it as just a cache hides its true complexity, and it is better understood as a virtualized memory layer for intermediate computation.

Dimension traditional cache (E.g., Redis) KV cache in LLMs (e.g., vllm)

Purpose

Avoid recomputing full results

Avoid recomputing intermediate attention state

Common access pattern

key -> value lookup

Key -> key-value bytes lookup during model execution

Reuse type

All or nothing

Partial reuse (prefix based)

Storage

In-memory / persisted

Primarily GPU memory which can also be persisted

Consistency

Eventual or strong consistency

Must match exact token sequence

Scheduling dependency

Independent

Strongly coupled with request scheduling

Failure mode

Cache miss results in recompute

Cache miss results in recompute 

Cache locality sensitivity

Low (can often be distributed for better reliability and scalability)

Very high (node/worker local) and be IO latency sensitive.

 

The kv_cache_manager is a good entry point to understand that the KV cache in vLLM is not a traditional cache, but an active memory manager used during inference. It actively manages GPU KV cache memory during inference, i.e., allocation, reuse, eviction, prefix cache hits, and request lifecycle state.

Python
 
class KVCacheManager:
     def __init__(
        self,
        kv_cache_config: KVCacheConfig,
        max_model_len: int,
        hash_block_size: int,
        max_num_batched_tokens: int | None = None,
        enable_caching: bool = True,
        use_eagle: bool = False,
        log_stats: bool = False,
        enable_kv_cache_events: bool = False,
        dcp_world_size: int = 1,
        pcp_world_size: int = 1,
        metrics_collector: KVCacheMetricsCollector | None = None,
     ) -> None:

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub

 

vLLM KV Cache Design

vLLM's KV cache design treats KV memory like virtual memory and not contiguous tensors to avoid memory bottlenecks. Instead of allocating large blocks per request, it introduces a layer of indirection via fixed-size blocks and block tables. This allows memory to be used efficiently, reused across requests, and dynamically resized as sequences grow. Two core primitives enable this design: block tables and an eviction mechanism. Together, they solve critical problems in memory fragmentation, reuse, and scalability.

Block Tables

The block table is the central abstraction in vLLM's KV cache design. Instead of storing KV tensors contiguously in GPU memory, each request maintains a mapping from logical token positions to physical memory blocks. This indirection layer is conceptually similar to a page table in operating systems. When the model accesses KV for a given token, it resolves through the block table to locate the physical block in GPU memory. This design allows KV memory to be non-contiguous, shared across multiple requests, and dynamically extended as tokens are generated. The code pointers below are a good entry point to understand this concept in detail. 

vLLM maintains a BlockTable whose rows correspond to active request slots. Each row maps a request's logical token/block positions to physical KV cache block IDs in GPU memory. This indirection lets KV blocks be allocated non-contiguously and lets multiple requests refer to reused/shared cached blocks. 

Python
 
class BlockTable:
     def __init__(
        self,
        block_size: int,
        max_num_reqs: int,
        max_num_blocks_per_req: int,
        max_num_batched_tokens: int,
        pin_memory: bool,
        device: torch.device,
        kernel_block_size: int,
        cp_kv_cache_interleave_size: int,
     ): 

Source: (v0.20.0) vllm/vllm/v1/worker/block_table.py at main · vllm-project/vllm · GitHub 


vLLM's KV cache is divided into fixed size KVCacheBlocks. These blocks are the fundamental unit of allocation, prefix cache reuse, reference counting, and eviction. The code below is a good set of pointers for understanding that lifecycle.

Python
 
class BlockPool:
	def __init__(
        self,
        num_gpu_blocks: int,
        enable_caching: bool,
        hash_block_size: int,
        enable_kv_cache_events: bool = False,
        metrics_collector: KVCacheMetricsCollector | None = None,
     ):

Source: (v0.20.0) vllm/vllm/v1/core/block_pool.py at main · vllm-project/vllm · GitHub 


allocate_slots() asks the coordinator how many blocks are needed, checks the shared block_pool for free capacity, and then calls allocate_new_blocks() only for the current request's needed slots. That shows blocks are dynamically assigned from a shared pool rather than preallocated per request. 

Python
 
def allocate_slots(
        self,
        request: Request,
        num_new_tokens: int,
        num_new_computed_tokens: int = 0,
        new_computed_blocks: KVCacheBlocks | None = None,
        num_lookahead_tokens: int = 0,
        num_external_computed_tokens: int = 0,
        delay_cache_blocks: bool = False,
        num_encoder_tokens: int = 0,
        full_sequence_must_fit: bool = False,
     ) -> KVCacheBlocks | None:

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub 

 

Cache Eviction

Eviction in vLLM is more complex than a typical least recently used (LRU) policy due to dependencies between tokens. KV blocks form a logical prefix chain, meaning later tokens depend on earlier ones. As a result, eviction cannot arbitrarily remove blocks without breaking correctness. Instead, vLLM uses a reference count-based mechanism combined with recency heuristics. Blocks are only eligible for eviction when no active request depends on them, and even then, eviction typically proceeds from the tail of sequences to preserve prefix integrity. This constrained eviction behavior ensures correctness while still allowing the system to operate under memory pressure.

Blocks are reference-counted. A block can only be freed when no active request depends on it to ensure correctness.

Python
 
def free(self, request: Request) -> None:
    self.coordinator.free(request.request_id)

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub  


When a request completes, KVCacheCoordinator.free() calls free() on each per-type manager. The manager removes that request's blocks from req_to_blocks and returns them to BlockPool.free_blocks(), where they become reclaimable once their reference count reaches zero.

Python
 
def free(self, request_id: str) -> None:
	req_blocks = self.req_to_blocks.pop(request_id, [])

Source: (v0.20.0) vllm/vllm/v1/core/single_type_kv_cache_manager.py at main · vllm-project/vllm · GitHub 


How the Request Flow Works 

Understanding how KV cache works requires following a request through the system. At a high level, vLLM attempts to reuse previously computed KV blocks by matching prefixes, allocates new blocks for unseen tokens, and schedules requests in a way that maximizes reuse while balancing GPU utilization. 

Prefix matching identifies previously computed KV blocks that can be reused for the incoming request. find_longest_cache_hit() takes the incoming request's block_hashes, searches the prefix cache for matching cached blocks, and returns the reusable KVCacheBlocks plus the number of computed tokens.

Python
 
def find_longest_cache_hit(
        self,
        block_hashes: list[BlockHash],
        max_cache_hit_length: int,
     ) -> tuple[tuple[list[KVCacheBlock], ...], int]:

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_coordinator.py at main · vllm-project/vllm · GitHub 


New tokens are assigned newly allocated KV blocks on demand, and the returned block IDs are later used by the worker side block table to extend the request's physical KV mapping.

Python
 
new_blocks = self.coordinator.allocate_new_blocks(
    request.request_id,
    num_tokens_need_slot,
	num_tokens_main_model,
    num_encoder_tokens,
)

Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub 


The scheduler decides which requests run together. Requests with shared prefixes benefit from co-location, improving cache hit rate. Even with perfect caching logic, poor scheduling can eliminate all cache benefits.

Python
 
def schedule(self) 
	-> SchedulerOutput:

Source: (v0.20.0) vllm/vllm/v1/core/sched/scheduler.py at main · vllm-project/vllm · GitHub  


Before the model forward pass execution, gpu_mode_runner uses the request block table to resolve scheduled logical token positions into physical KV cache slot IDs. During attention execution, the resulting slot_mapping and block_table_tensor are passed through attention metadata so kernels can read/write the correct KV cache locations. 

Python
 
def _prepare_inputs(
        self,
        scheduler_output: "SchedulerOutput",
        num_scheduled_tokens: np.ndarray,
     ) -> tuple[
        torch.Tensor,
        SpecDecodeMetadata | None,
     ]:
            self.input_batch.block_table.compute_slot_mapping(
            num_reqs,
            self.query_start_loc.gpu[: num_reqs + 1],
            self.positions[:total_num_scheduled_tokens],
        )

 source: (v0.20.0) vllm/vllm/v1/worker/gpu_model_runner.py at main · vllm-project/vllm · GitHub 


Future Work

While vLLM's prefix-based KV caching is highly effective, it has inherent limitations that motivate future work. Today, reuse is mainly strongest when requests share the same prefix, because cached blocks are validated through prefix/block hash chains. One future direction is a more general segment or chunk-level reuse, where systems try to reuse repeated prompt regions beyond strict prefixes. This could help when shared content appears later in prompts, but it is harder than prefix caching because KV states depend on position and surrounding context. 

Another direction is distributed KV caching, where KV state can be stored, transferred, or shared across workers or replicas rather than remaining purely local to one GPU/node. This can improve reuse and scaling, but introduces challenges around latency, routing, placement, and consistency. Together, these directions move KV caching from a local per-worker optimization toward a broader system-level capability.

Conclusion

vLLM rethinks KV caching as a memory management and scheduling problem rather than a simple reuse mechanism. Through fixed-size block allocation, block tables for logical to physical indirection, prefix-aware reuse, reference-counted block lifetimes, and LRU-like cached block eviction, it turns KV cache into a virtualized resource that can be shared efficiently across requests. 

However, the effectiveness of this system depends not only on its internal design, but also on how requests are scheduled, batched, and routed. These nuances show that KV caching is not merely a local optimization, but a core systems primitive for modern LLM inference. As inference systems evolve toward more general segment/chunk level reuse and distributed KV caching, these same principles will continue to shape scalable and efficient serving platforms.

References

  • https://github.com/vllm-project/vllm
  • https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms
  • https://bentoml.com/llm/inference-optimization/kv-cache-offloading
  • https://cloud.google.com/blog/topics/developers-practitioners/boosting-llm-performance-with-tiered-kv-cache-on-google-kubernetes-engine/
  • https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
  • https://pub.towardsai.net/the-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d
Blocks Cache (computing) Requests large language model

Opinions expressed by DZone contributors are their own.

Related

  • From 0.68 to 10 Requests/Second: Optimizing LLM Serving With vLLM
  • Scalable Support Request Analysis Using Embeddings, HDBSCAN, and Tiny LLMs
  • The Bill You Didn't See Coming
  • Respecting robots.txt in Web Scraping

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook