Inside LLM Inference: When the KV Cache No Longer Fits

Why managing the KV cache is a systems problem, not a model problem

In the previous article, we reduced the KV cache.
Techniques like MQA and GQA reduce how much we store per token.

But reduction only delays the problem. It does not solve it.

That analysis, however, focused on a single request.
Real systems do not serve one request at a time.

Think about what actually happens when a product like ChatGPT serves users. At any given moment, hundreds of conversations are running on the same GPU.
Each with its own prompt.
Each with its own conversation history.
Each generating tokens, one at a time, competing for the same memory.
And with every token they generate, their memory footprint keeps growing.

From Article 2, you already know that a single request with 10,000 tokens holds roughly 5 GB of KV cache in memory.
Now multiply that by 100 concurrent users.
That is 500 GB. A single H100 has 80 GB.

One request fits. Hundreds don’t.
Memory demand scales with users. GPU memory does not.

At this scale, inference is no longer a compute problem.
It is a memory management problem.

When memory becomes the bottleneck, the system does not just slow down.
Throughput drops, queues build up, and systems start rejecting requests.

The model is doing exactly what it was designed to do.

Solving this requires a different kind of thinking. It requires decisions that go beyond the model itself.

What do you keep?
What do you share?
What do you move?
And what do you let go?

These are the decisions that determine whether an inference system can serve 10 users or 10,000.
This article is about how real systems make those decisions.

Why Naive KV Storage Fails

After reducing KV size, the next question is how to store it.
The simplest approach is to pre-allocate memory for each request upfront.

When a request arrives, the system reserves a contiguous block of GPU memory sized for the maximum sequence length it might produce. If the model supports a context window of 4096 tokens, the system reserves memory for 4096 tokens, even if the request finishes at 800.

This works for a single request.
It breaks under any real serving workload.

This breaks in two ways.

The first is over-reservation.

Memory is allocated based on what a request might use, not what it will use. A request that finishes at 800 tokens holds memory for 4096 the entire time. That unused memory cannot be used by other requests.
The system behaves as if it is out of memory, even when large portions of it are sitting idle.
In practice, the vLLM paper measured that only 20–38% of reserved KV cache memory was actually used in existing systems.
Now multiply that waste across dozens of concurrent requests. A system with 80 GB of memory may effectively be using only 20–30 GB for actual KV data. The rest is reserved but unusable.

Allocated memory wasted on unused tokens

The second is fragmentation.

As requests arrive and finish at different times, free memory gets scattered across the GPU in gaps of varying sizes.
Imagine 10 requests have just finished on an H100.
The freed memory is scattered across GPU in dozens of small, non-contiguous gaps.
A new request needs 1 GB contiguous. Individually, no gap is large enough. The request cannot start.
The GPU has the memory.
It just cannot use it.

Memory exists, but not in the shape the system needs.

Together, these two problems mean the system runs out of usable memory long before it runs out of actual memory.

This is not a problem with the model. It is a problem with how memory is allocated.
And it cannot be fixed by reducing KV size further.

It requires a different way of organizing memory altogether.
What is needed is a system where memory no longer needs to be contiguous.
And where allocation adapts dynamically to how requests actually grow.

PagedAttention: Virtual Memory for the KV Cache

The fix comes from an old idea.

In the 1960s, operating systems faced a similar problem.
Programs needed more memory than was physically available, and contiguous allocation was wasting what little existed.

The solution was virtual memory: break physical memory into fixed-size pages, and let programs use non-contiguous pages as if they were one continuous block.

PagedAttention applies the same idea to the KV cache.

Instead of allocating one large contiguous block per request, PagedAttention divides the KV cache into fixed-size blocks.
Each block holds the Keys and Values for a fixed number of tokens, typically 16.

For a model like Llama-2 with 32 layers and hidden dimension 4096, one block of 16 tokens holds roughly 8 MB of KV data.
Blocks need to be small enough for flexible allocation, but large enough to keep memory access patterns efficient for GPU execution.
These blocks do not need to be contiguous.
A request can use block 3, block 17, and block 42, wherever free blocks exist, and the system maps them together behind the scenes.

Each request maintains a block table: a mapping from its logical sequence of tokens to the physical blocks holding their KV data.
The attention computation reads through this table at each decode step, fetching only the blocks corresponding to the tokens it attends to.

Logical sequence stays continuous.
Physical memory does not.

At each decode step, the model does not need to load the entire KV cache as one block. It only needs to read the blocks corresponding to the tokens in the sequence. For a 10,000-token sequence, this means 625 independent blocks rather than one contiguous 5 GB slab.

This fundamentally changes how memory is managed.

Under naive allocation, memory is reserved upfront and held until the request finishes. Under PagedAttention, memory is allocated one block at a time, exactly as the sequence grows.
When a request finishes, its blocks are immediately freed and available to any new request.

The numbers reflect this directly.

Across both problems, existing systems waste 60–80% of KV cache memory through fragmentation and over-reservation.
PagedAttention brings that waste to under 4%.

And because more requests fit in memory simultaneously, throughput improves by 2–4x at the same latency, as reported in the original vLLM paper.

This is not just a benchmark result. LMSYS, the team behind Vicuna and Chatbot Arena, was serving roughly 45,000 daily requests. After switching to vLLM, they cut the number of GPUs needed by 50% while serving 2–3x more requests per second. Same model. Same hardware. Different memory management.

This is what vLLM is built on. PagedAttention is its core memory management mechanism, and it is what made high-throughput LLM serving practical at the batch sizes real products require.

But PagedAttention only solves how KV is stored.
It does not solve the fact that many requests are storing the same KV blocks in the first place.

Return to the ChatGPT scenario.
Those hundreds of concurrent users, many of them share the exact same system prompt.
With PagedAttention, each of those users still gets their own copy of that prompt’s KV blocks. The same data, stored hundreds of times.

That is the next bottleneck.

Prefix Sharing: Don’t Store What Already Exists

PagedAttention solves how memory is allocated.
But it does not eliminate redundancy.

Consider the ChatGPT scenario again. Those hundreds of concurrent users are not just competing for memory. Many of them are sending requests that start with identical text.

A customer support chatbot has the same system prompt for every user. A RAG pipeline sends the same retrieved documents to the model across dozens of queries. A few-shot prompt with carefully crafted examples gets prepended to every single request.

In each case, the KV blocks for that shared prefix are being computed and stored separately for every request. The same tokens. The same Keys and Values. Duplicated hundreds of times in GPU memory.

As the number of concurrent users grows, the probability of shared prefixes increases.
At scale, systems are not handling independent requests.
They are handling many variations of the same request.

The insight behind prefix sharing is simple.

If two requests share the same prompt prefix, their KV blocks for that prefix are identical. There is no reason to store them twice.

The system can compute the KV blocks for a shared prefix once, and map all requests that share it to the same physical blocks. This is safe because prefixes are read-only during decode. No request modifies the KV of a previous token. It only appends new ones.

This is made efficient using a radix tree, also called a prefix trie.
When a new request arrives, the system walks the tree to find the longest prefix already in cache. Whatever matches is reused directly. Only the portion that diverges needs new computation and new block allocation.

Shared prefixes collapse hundreds of copies into one.

This also reduces the amount of KV data that needs to be read during decode. Less duplication means less memory traffic at every step.

The impact is significant.
If a 10,000-token prefix costs roughly 5 GB, sharing it across 10 users avoids storing an additional 45 GB of KV data.
For shared prefixes, PagedAttention’s memory sharing reduces memory overhead by up to 55%, translating to up to 2.2x improvement in throughput, as reported in the original vLLM paper.

The RAG case makes this concrete. If 10 users query the same 10,000-word document, a naive system processes those 10,000 words 10 times, once per request. With prefix sharing, the system processes them once. The KV blocks are computed on the first request and reused for all subsequent ones. Nine out of ten prefills simply do not happen.

Return to the ChatGPT scenario one more time. That system prompt shared across hundreds of users? With prefix sharing, it is stored once. Every user reads from the same physical blocks. The memory that was being duplicated hundreds of times now occupies the space of one copy.

But prefix sharing only helps when the KV blocks already exist in memory. If a request arrives with a prefix the system has seen before, but those blocks were evicted to make room for something else, the system has to recompute them from scratch.

That is the next bottleneck.
And the problem prefill caching is designed to solve.

Prefill Caching: Don’t Recompute What You’ve Already Run

Prefix sharing removes duplication across active requests. But it only helps when the KV blocks are already in memory.

What happens when they are not?

Consider a RAG pipeline where the same document gets retrieved repeatedly across different users, at different times. The first request computes the KV blocks for that document and shares them with any concurrent requests. But an hour later, those blocks may have been evicted to make room for new requests. The next user querying the same document triggers a full prefill from scratch.

The work was already done. It just was not retained.

Prefill caching addresses this directly.

Instead of discarding KV blocks when a request finishes, the system retains them. When a new request arrives, the system checks whether its prompt prefix matches anything previously computed. If it does, those blocks are restored from cache rather than recomputed.

The same prefix-matching structure is used here. But now the tree is not just matching against live requests. It is matching against the full history of computed prefixes, including those from requests that finished long ago.

The distinction from prefix sharing is subtle but important.

Prefix sharing asks: can multiple requests share the same blocks right now?
Prefill caching asks: have we computed this prefix before, even if no one is using it currently?

Together they cover both dimensions of redundancy.
Prefix sharing eliminates duplication across concurrent users.
Prefill caching eliminates duplication across time.

These techniques are not independent.
They build on top of each other, each reducing a different source of wasted work.

These are the workloads that dominate production systems.

A chatbot where every conversation starts with the same system prompt. A coding assistant where the same codebase is loaded as context for every query. A RAG pipeline where the same documents are retrieved again and again.

In each case, the most expensive part of the request, the prefill over a long shared context, becomes a cache lookup instead of a computation.

The latency impact is measurable. Research on prefix caching across real workloads shows P95 time-to-first-token reductions of up to 73% compared to systems without prefix caching. For long shared prefixes, the first token arrives in milliseconds instead of seconds.

But caching introduces a new tension. The longer you retain KV blocks, the more memory they consume. At some point, the cache fills up. And the system has to decide what to keep and what to drop.

That is where the next constraint appears.
The eviction problem.

Eviction: Deciding What to Let Go

Every caching system eventually runs out of space.

The KV cache is no exception. As requests arrive, complete, and leave their blocks behind for potential reuse, memory fills up. At some point, the system cannot retain everything. It has to make a choice.

Which blocks do you keep? Which do you drop?

This is the eviction problem. And the answer is not as simple as it might seem.

The simplest approach: recency

LRU, least recently used.
Drop the blocks that have not been accessed recently.
This is intuitive and cheap to implement.

But recency is a proxy, not a measure of value.

Consider a long conversation where the user’s first message defined the task. It was sent twenty minutes ago. Under LRU, it is a candidate for eviction.
Dropping it may silently degrade every response that follows, because the model can no longer attend to that early context.

The system gives no warning. The response quality simply gets worse.

The smarter approach: attention-aware eviction

Not all tokens contribute equally. At each decode step, the attention mechanism assigns weights to every token. Some tokens consistently receive high attention across steps. Others are rarely used.

This is a signal.

Tokens that are frequently attended to are load-bearing.
At each decode step, the model depends on these tokens to generate the next output. Evicting them changes what the model produces next.
Tokens that are rarely attended to can be dropped with minimal impact.

Attention-aware strategies use this signal to protect important blocks, instead of relying on recency alone.

Not all tokens matter equally.
Some carry the entire generation.

The sliding window: a structural compromise

A third approach is the sliding window. Instead of trying to keep the entire context, the system only retains the most recent N tokens. Older tokens fall out of the window automatically.

This bounds memory usage and keeps the system predictable.

The tradeoff is real. The model can no longer attend to anything outside the window. For short interactions this works well. For long documents or multi-turn reasoning, important context simply disappears.

The deeper point

Eviction is not just a memory decision. It is a modeling decision.

What the system evicts determines what the model can see. And what the model can see determines what it generates.

Return to the ChatGPT scenario one final time. A user has been in a long conversation. The system prompt is shared. Recent turns are in cache. But the message that defined the task was evicted to make room for something else.

The model does not know this. It generates a response based on what remains.
The user notices something feels off, but cannot explain why.

This is the real cost of eviction done poorly. Not a crash. Not an error. Just a subtle, invisible degradation in every response that follows.

Quantization: Shrinking What You Store

The techniques so far have focused on how memory is organized and reused.
But all of those decisions assumed one thing. That the size of the KV cache itself is fixed.

But there is another dimension entirely.

Instead of changing how KV blocks are managed, what if we simply made each block smaller?

This is what KV cache quantization does.

In Article 2, we established that storing Keys and Values in FP16 costs roughly 512 KB per token. For a 10,000-token sequence with GQA applied, that comes down to around 625 MB. That number assumed 16-bit precision for every value in every block.

But precision is a choice, not a requirement.

If we store KV values in 8-bit instead of 16-bit, each value takes half the space. The cache shrinks by 2x. If we go to 4-bit, it shrinks by 4x.

For that same 10,000-token sequence:

FP16: ~625 MB
INT8: ~312 MB
INT4: ~156 MB

This directly reduces how much memory each request consumes. And because memory is the shared resource all requests compete for, smaller blocks mean more requests fit simultaneously.

At scale, this directly determines how many users a system can serve on a single GPU.
A 4x reduction in KV size can mean serving 4x more concurrent requests under the same memory budget.

The throughput impact is measurable.
With INT8 KV quantization applied to Llama-2–7B, requests per second improves by roughly 30%.
With INT4, roughly 40%.
Both measured against an FP16 baseline. These gains come directly from reducing memory bandwidth pressure during decode.

Quantization cuts memory, preserves performance Source: NVIDIA

This matters because decode is memory-bandwidth bound. Reducing the size of KV data directly reduces how much data needs to be moved from memory at every step. Less data movement means faster token generation.

For every token the model generates, it must read the entire KV cache.
Quantization reduces the cost of that read at every single step.

The natural question is what this costs in accuracy.

Unlike model weights, the KV cache represents intermediate activations. These are more tolerant to reduced precision, because small numerical differences rarely change attention patterns significantly.

INT8 quantization is effectively lossless for KV cache in most models, because small numerical differences in Keys and Values rarely change attention patterns significantly.
INT4 introduces a small but measurable drop in accuracy, particularly in tasks that rely on precise long-range dependencies.
For many production workloads, this tradeoff is acceptable, but it must be evaluated based on the task.

It is also worth being clear about what quantization changes and what it does not.

MQA and GQA, which you saw in Article 2, reduce the number of KV vectors by changing the attention structure.
Quantization reduces the size of each vector by changing how values are represented.
The attention pattern stays the same.
The dependency on past tokens stays the same.
Only the cost of storing and reading that data becomes smaller.

At each decode step, the model still attends over the entire sequence. But instead of reading large FP16 KV blocks from memory, it reads smaller, quantized representations. The computation stays the same.

They are orthogonal optimizations. You can apply both.

A model using GQA with INT8 KV quantization gets the reduction from both directions simultaneously.

And crucially, quantization compounds with everything else in this article. Smaller blocks mean more of them fit in the same memory budget. More blocks in memory means higher cache hit rates for prefix sharing and prefill caching. Higher cache hit rates mean less recomputation and lower latency.

Each optimization makes the others more effective.
Together, they do not just improve performance.
They determine whether the system can scale at all.

Putting It All Together

Now step back and look at the system as a whole.

Each section introduced a different technique. But they are not independent solutions. They form a layered system, each addressing a different dimension of the same underlying problem.

The problem has always been the same.

Memory is finite. Requests are many.
And the KV cache never stops growing.

This is how a real inference system responds.

When a request arrives, the scheduler first checks the radix tree. If the prompt shares a prefix with something already in cache, those KV blocks are reused directly.
No prefill for that portion. No memory allocation.
The work was already done.

If the prefix was computed before but evicted, the blocks are restored from the prefill cache instead of being recomputed. The expensive compute-bound phase becomes a memory lookup.

For the portion of the prompt that is genuinely new, PagedAttention takes over. Memory is allocated one block at a time, placed wherever space is available, non-contiguous by design. No over-reservation. No fragmentation.

Each of those memory blocks is stored in INT8 or INT4 rather than FP16. The same sequence now occupies a fraction of the memory it would have consumed before.

As the sequence grows during decode, new blocks are allocated on demand. Every token adds to the KV cache, and every step depends on reading and managing that growing memory state.
When the request finishes, its blocks are either retained for reuse or marked as eviction candidates, depending on memory pressure.

And when memory does fill up, the eviction policy decides what stays.
Recent blocks, frequently attended blocks, and shared prefixes are protected.
Old, rarely accessed, and unique blocks are the first to go.

Every decision in this pipeline maps back to the four questions.

What do you keep?
Blocks that are likely to be reused. Blocks that are load-bearing for active requests.

What do you share?
Prefix blocks that are identical across concurrent users.

What do you move?
Prefill results that can be reused across time instead of recomputed.

What do you let go?
Blocks that are old, unique, and unlikely to be needed again.

Taken together these techniques form a coherent memory management system.

And across every stage, one pattern repeats.
The cost is never removed.
It is moved, reduced, shared, or delayed.

From computation to memory.
From memory to reuse.
From reuse to decisions about what to keep and what to forget.

This is not a model optimization.
It is not a hardware upgrade.

It is an engineering response to a system-level constraint.

And this is still only one GPU.

As soon as we scale beyond it, new questions emerge.
How do we split the model across devices?
How do we coordinate memory across a fleet?
What happens when prefill and decode are so different that they no longer belong on the same hardware?

That is where the next layer of the system begins.

And where scaling stops being about memory alone, and becomes about coordination across an entire system.

Kwon, W., et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. https://arxiv.org/abs/2309.06180
Kwon, W. & Li, Z. (2023). vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. vLLM Blog.https://blog.vllm.ai/2023/06/20/vllm.html
vLLM Project Documentation and Source Code https://github.com/vllm-project/vllm
Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2024). SGLang: Efficient Execution of Structured Language Model Programs. Advances in Neural Information Processing Systems 37 (NeurIPS 2024). https://arxiv.org/abs/2312.07104
Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., & Gholami, A. (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. Advances in Neural Information Processing Systems 37 (NeurIPS 2024). https://arxiv.org/abs/2401.18079
Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z., & Chen, B. (2023). H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.https://arxiv.org/abs/2306.14048
NVIDIA. (2023). NVIDIA TensorRT-LLM: Supercharging Large Language Model Inference. NVIDIA Developer Blog.https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NeurIPS 2017).https://arxiv.org/abs/1706.03762

Inside LLM Inference: When the KV Cache No Longer Fits was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.