Dive Into Tokenization, Attention, and Key-Value Caching

This article covers how key-value caching works and how it helps optimize large language models. It includes a text generation process to make it easy to understand.

Kailash Thiyagarajan

Feb. 18, 25 · Tutorial

Likes (1)

Comment

Save

3.4K Views

The Rise of LLMs and the Need for Efficiency

In recent years, large language models (LLMs) such as GPT, Llama, and Mistral have impacted natural language understanding and generation. However, a significant challenge in deploying these models lies in optimizing their performance, particularly for tasks involving long text generation. One powerful technique to address this challenge is key-value caching (KV cache).

In this article, we will delve into how KV caching works, its role within the attention mechanism, and how it enhances efficiency in LLMs.

How Large Language Models Generate Text

To truly understand token generation, we need to start with the basics of how sentences are processed in LLMs.

Step 1: Tokenization

Before a model processes a sentence, it breaks it into smaller pieces called tokens.

Example sentence: Why is the sky blue?

Tokens can represent words, subwords, or even characters, depending on the tokenizer used.

For simplicity, let’s assume the sentence is tokenized as:
['Why', 'is', 'the', 'sky', 'blue', '?']

Each token is assigned a unique ID, forming a sequence like:
[1001, 1012, 2031, 3021, 4532, 63]

Step 2: Embedding Lookup

Token IDs are mapped to high-dimensional vectors, called embeddings, using a learned embedding matrix.
Example:

Token “Why” (ID: 1001) → Vector: [-0.12, 0.33, 0.88, ...]
Token “is” (ID: 1012) → Vector: [0.11, -0.45, 0.67, ...]

The sentence is then represented as a sequence of embedding vectors:
[Embedding("Why"), Embedding("is"), Embedding("the"), ...]

Step 3: Contextualizing Tokens With Attention

Raw embeddings don’t capture context. For instance, the meaning of “sky” differs in the sentences “Why is the sky blue?” and “The sky is clear today.” To add context, LLMs use the attention mechanism.

How Attention Works: (Keys, Queries, and Values)

The attention mechanism uses three components:

Query (Q). Represents the current token’s embedding, transformed through a learned weight matrix. It determines how much attention to give to other tokens in the sequence.
Key (K). Encodes information about each token (including previous ones), transformed through a learned weight matrix. It is used to assess relevance by comparing it to the query (Q).
Value (V). Represents the actual content of the tokens, providing the information that the model “retrieves” based on the attention scores.

Example: Let's consider the LLM processing the sentence in the example, and the current token is“the.”

When processing the token “the,” the model attends to all previously processed tokens (“Why,” “is,” “the”) using their key (K) and value (V) representations.

Query (Q) for “the”:
The Query vector for “the” is derived by applying a learned weight matrix to its embedding:
Q("the") = WQ ⋅ Embedding("the")

Keys (K) and Values (V) for previous tokens:
Each previous token generates:

Key (K): K("why") = WK ⋅ Embedding("why")
Value (V): V("why") = Embedding("why")

Attention Calculation

The model calculates relevance by comparing Q (“the”) with all previous K vectors (“why”, “is”, and “the”) using a dot product.
The resulting scores are normalized with softmax to compute attention weights.
These weights are applied to the corresponding V vectors to update the contextual representation of “the.”

In summary:

Q (the). The embedding of “the” passed through a learned weight matrix WQ to form the query vector Q for the token “the.” This query is used to determine how much attention “the” should pay to other tokens.
K (why). The embedding of “why,” passed through a learned weight matrix WK to form the key vector K for “why.” This key is compared with Q (the) to compute attention relevance.
V (why). The embedding of “why,” passed through a learned weight matrix WV to form the value vector V for “why.” This value contributes to updating the contextual representation of “the” based on its attention weight relative to Q (the).

Step 4: Updating the Sequence

Each token’s embedding is updated based on its relationships with all other tokens. This process is repeated across multiple attention layers, with each layer refining the contextual understanding.

Step 5: Generating the Next Token (Sampling)

Once embeddings are contextualized across all layers, the model outputs a logits vector — a raw score distribution over the vocabulary — for each token position.

For text generation, the model focuses on the logits for the last position. The logits are converted into probabilities using a softmax function.

Sampling Strategies

Greedy sampling. Selects the token with the highest probability (in the image above, it uses greedy sampling and selects “because”).
Top-k sampling. Chooses randomly among the top k probable tokens.
Temperature sampling. Adjusts the probability distribution to control randomness (e.g., higher temperature = more random choices).

How Key-Value Cache Helps

Without a KV Cache

At each generation step, the model recomputes the keys and values for all tokens in the sequence, even those already processed. This results in a quadratic computational cost (O(n²)), where n is the number of tokens, making it inefficient for long sequences.

With a KV Cache

The model stores the keys and values for previously processed tokens in memory. When generating a new token, it reuses the cached keys and values, and computes only the key, value, and query for the new token. This optimization significantly reduces the need for recalculating attention components for the entire sequence, improving both computational time and memory usage.

Code With KV Cache

Suppose the model has already generated the sequence “Why is the sky.” The keys and values for these tokens are stored in the cache. When generating the next token, “blue”:

The model retrieves the cached keys and values for the tokens “Why,” “is,” “the,” and “sky.”
It computes the query, key, and value for “blue” and performs attention calculations using the query for “blue” with the cached keys and values.
The newly calculated key and value for “blue” are added to the cache for future use.

Cache (computing) Efficiency (statistics) optimization large language model

Opinions expressed by DZone contributors are their own.

Dive Into Tokenization, Attention, and Key-Value Caching

This article covers how key-value caching works and how it helps optimize large language models. It includes a text generation process to make it easy to understand.

The Rise of LLMs and the Need for Efficiency

How Large Language Models Generate Text

Step 1: Tokenization

Step 2: Embedding Lookup

Step 3: Contextualizing Tokens With Attention

How Attention Works: (Keys, Queries, and Values)

Attention Calculation

Step 4: Updating the Sequence

Step 5: Generating the Next Token (Sampling)

Sampling Strategies

How Key-Value Cache Helps

Without a KV Cache

With a KV Cache

Code With KV Cache

When Is Key-Value Caching Most Effective?

Extending KV Cache: Prompt Caching

What Is Prompt Caching?

Why Prompt Caching Matters

Conclusion

Partner Resources

Related

Trending

Dive Into Tokenization, Attention, and Key-Value Caching

This article covers how key-value caching works and how it helps optimize large language models. It includes a text generation process to make it easy to understand.

The Rise of LLMs and the Need for Efficiency

How Large Language Models Generate Text

Step 1: Tokenization

Step 2: Embedding Lookup

Step 3: Contextualizing Tokens With Attention

How Attention Works: (Keys, Queries, and Values)

Attention Calculation

Step 4: Updating the Sequence

Step 5: Generating the Next Token (Sampling)

Sampling Strategies

How Key-Value Cache Helps

Without a KV Cache

With a KV Cache

Code With KV Cache

When Is Key-Value Caching Most Effective?

Extending KV Cache: Prompt Caching

What Is Prompt Caching?

Why Prompt Caching Matters

Conclusion

Related

Partner Resources