KV Caching: The Hidden Speed Boost Behind Real-Time LLMs

LLMs slow down as outputs grow due to repeated attention over past tokens. KV caching skips redundant work by reusing keys and values, enabling 3–4× faster inference.

Srinidhi Goud Myadaboyina

Aug. 01, 25 · Analysis

Likes (2)

Comment

Save

2.9K Views

Introduction: Why LLM Performance Matters

Ever notice how your AI assistant starts snappy but then… starts dragging or slowing down?

It’s not just you. That slowdown is baked into how large language models (LLMs) work. Most of them generate text one token at a time using something called autoregressive decoding. And here's the catch - the longer the response gets, the more work the model has to do at every step. So the lag adds up.

Now, picture that in practice:

You're chatting with a support bot, and suddenly, it’s taking forever to reply.
Your code autocompleter starts lagging just when you're in flow.
A voice assistant pauses awkwardly mid-response.

It's not a great experience. And under the hood, it’s expensive, as every second of delay burns more GPU (or any Hardware) time, energy, and money.

So Why Does This Happen?

The model isn’t just thinking about the next word. It’s reprocessing all the previous words as well, over and over. Like re-checking your notes, from the beginning, every time you want to add a word to your output.

This isn’t just inefficient. It’s unnecessary.

That’s where KV caching comes in. It is a surprisingly simple idea that changes everything. Instead of redoing all that past computation, we cache what we’ve already seen and reuse it. The results are dramatic.

In this article, I’ll walk through:

Why do LLMs rely on transformers in the first place?
How does decoding one token at a time slow things down?
What does the attention mechanism do, and how do keys and values become the bottleneck?
And finally, how KV Caching speeds things up without cutting corners?

In real-world deployments, flipping on KV caching can mean generating four times faster, with no drop in quality. With the same model and hardware.

Let’s start by peeling back the layers on transformers, an engine that powers almost every LLM out there.

What Are LLMs and Why Do They Use Transformers?

Large language models, or LLMs, are just giant neural networks trained to predict the next word. That’s it. But the way they do that prediction is where the magic happens.

Figure 1: Simplified Transformer Block Highlighting Q, K, V Computation

I will try not to go too deep into the architecture, as you can find many other articles on this topic. Instead of using traditional architectures like RNNs (recurrent neural networks) or LSTMs (Long Short Term Memory), modern LLMs rely almost entirely on something called the Transformer. It’s the same architecture behind models like GPT, Claude, Mistral, and even BERT.

So why did Transformers take over? Because they’re really good at two things.

One, they process input in parallel during training, which makes them fast to train even at scale.
Two, they use a mechanism called attention, which lets the model look at every part of the input when deciding what to say next. I will talk more about this later.

A transformer is a stack of attention blocks that each take in some data, mix it around in a clever way, and pass it forward. Inside each block, there are components for computing queries, keys, and values. These play a huge role in the attention step, which we’ll get into soon.

But here's the twist. Even though transformers are fast at training, they slow down at inference. During training, you can process thousands of tokens in parallel. But at inference time, you can only predict one token at a time. And that’s where things start to break down.

We’ll unpack that next.

From Transformers to Autoregressive Decoding

Transformers are fast to train because they look at everything at once. But that’s not how they run inference.

When generating text, language models can’t predict all tokens in parallel. They have to go one token at a time, in a simple case. That’s what we call autoregressive decoding.

At step one, the model sees the prompt and predicts the first token.
At step two, it sees the prompt plus the first token and predicts the second.

Figure 2: Step-by-Step KV Accumulation During Autoregressive Decoding

The key idea here is that each new token depends on everything before it. So, generation becomes a sequential process, even though the model was trained with parallelism.

As I said earlier, the longer the sequence gets, the more computation it takes to generate the next token.

How Attention Works

Let’s take a moment to unpack the attention mechanism. It’s the beating heart of the transformer. Please read this paper for a detailed explanation.

Each token gets turned into three vectors: Query (Q), Key (K), and Value (V).

Figure 3: KV Cache Mechanism. Updating and Reusing Stored Tensors

At each step, the model compares the current query to all the keys from previous tokens. It calculates a score for how relevant each past token is, and then uses those scores to weight the values. That’s how attention decides what to focus on.

Here’s what that looks like in code:

    Python
   
 

   import torch
import torch.nn.functional as F

def scaled_dot_product_attention(q, k, v):
    d_k = q.size(-1)
    scores = torch.matmul(q, k.transpose(-2, -1)) / d_k**0.5
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)
  

This mechanism is powerful because it lets the model dynamically decide which parts of the context to pay attention to. But there’s a catch.

At every step, the model needs to recompute the attention scores using all previous keys and values. That adds up quickly.

KV Caching: Let’s Avoid Repeated Computations

By now, you’ve seen the core problem: each new token requires computing attention with all previous tokens. That means more keys, more values, and more compute. It’s linear memory, quadratic time.

But here’s the trick: you don’t need to recompute keys and values for tokens you’ve already seen.

Instead, you cache them once and reuse them at every step.

That’s KV Caching in a nutshell.

How It Works

At each decoding step, you compute Q for the new token only.
You reuse K and V from the past, pulled from a cache.
Attention becomes: Q_t x [cached K], followed by a weighted sum with [cached V].

No re-encoding. No reprocessing. Just fetch and multiply.

This turns a quadratic-time attention loop into something close to linear, especially when combined with tricks like FlashAttention and paged KV memory.

Figure 4: Attention with Cached Key/Value Tensors

How does this look in code?

    Python
   
 

   import torch
import torch.nn.functional as F

def scaled_dot_product_attention(q, k, v):
    d_k = q.size(-1)
    scores = torch.matmul(q, k.transpose(-2, -1)) / d_k**0.5
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)

# Simulate a decoding loop with KV caching
past_k = []
past_v = []

for t in range(seq_len):  # decoding loop
    token_t = input_tokens[:, t]  # (batch,)

    # Embed and project current token
    x_t = embedding(token_t)  # (batch, d_model)
    q_t = query_proj(x_t)     # (batch, d_k)
    k_t = key_proj(x_t)       # (batch, d_k)
    v_t = value_proj(x_t)     # (batch, d_v)

    # Cache K and V
    past_k.append(k_t)
    past_v.append(v_t)

    # Stack all past K and V
    k_stack = torch.stack(past_k, dim=1)  # (batch, t+1, d_k)
    v_stack = torch.stack(past_v, dim=1)  # (batch, t+1, d_v)

    # Compute attention only with current Q and cached K,V
    q_t = q_t.unsqueeze(1)  # (batch, 1, d_k)
    out_t = scaled_dot_product_attention(q_t, k_stack, v_stack)

    # Generate token logits or continue loop...
  

Speedup numbers from real-world experiments

model	gpu	without kv cache	with kv Cache	Speedup
GPT-2 (1.5B)	A100	12 tok/sec	45 tok/sec	3.75x
GPT-J (6B)	A100	5 tok/sec	20 tok/sec	4x
Mistral-7B	L4	4.2 tok/sec	14.6 tok/sec	3.5x
LLaMA 13B	A100	~6 tok/sec	~25 tok/sec	4.1x

When You Should Care

Let’s zoom out. KV caching sounds like a low-level optimization, but it’s not. It’s a fundamental performance unlock that powers almost every real-time LLM deployment today.

Where It Matters Most

Chatbots that handle long conversations (ChatGPT, Claude, etc.)
Code copilots like GitHub Copilot, Cursor, and TabNine
Voice assistants, translators, and auto-completers
Any application with long prompts or long outputs

If you're building tools that generate tokens sequentially, you should assume KV caching is essential. Not optional.

When It Doesn’t Matter (Much)

One-shot inference with short outputs
Non-autoregressive tasks (like classification or embedding extraction)
Training time (training already uses parallel attention)

Final Thought

KV caching isn’t a niche trick. It’s what separates a working prototype from a real product.

Cache (computing) Neural Networks (journal) large language model

Opinions expressed by DZone contributors are their own.

Related

Trending