@_avichawla: Prefill & decode in LLM inference. Have you ever noticed that the first token from an LLM always takes a moment to appe…
Summary
Explains the two phases of LLM inference - prefill and decode - detailing how GPU bottlenecks shift from compute-bound during prefill to memory-bound during decode, and the importance of KV caching.
View Cached Full Text
Cached at: 06/29/26, 10:26 AM
Prefill & decode in LLM inference.
Have you ever noticed that the first token from an LLM always takes a moment to appear? But the subsequent tokens stream out smoothly?
That pause isn’t a network lag, but rather it’s a structural property of how LLMs fundamentally work.
Inference happens in two phases that share the same model and the same code path, but the workload looks completely different in each, with different bottlenecks.
Prefill stage starts when you submit a prompt.
The model processes every input token in one parallel pass, computing Q, K, and V for all of them at once.
Attention runs as a matrix multiplication, and the GPU chips run at high utilization, doing fast math.
Prefill is compute-bound, and the metric that captures it is time-to-first-token (TTFT).
Decode stage starts once the first token is out.
To generate the next one, the model only computes Q, K, and V for that single new token, because everything before it is already cached.
So the model loops one token per forward pass, multiplying a single query against the cached keys instead of a full matrix. This makes the inference fast due to the tiny computation.
But the GPU still has to load every weight and every cached entry from memory to do that tiny computation, so the bottleneck flips and compute sits idle while memory bandwidth becomes the limiting factor.
Decode is memory-bound, and the metric that captures it is inter-token latency (ITL).
GPU utilization peaks during prefill and drops sharply during decode because memory, not compute, is the bottleneck in the second phase.
Throwing more compute at a slow-streaming model often does nothing because the fix for memory-bound workloads is faster memory or a smaller cache, not more FLOPs.
Long contexts feel disproportionately slow because the KV cache grows with every token, and every decode step has to read all of it.
But maintaining the cache is an important optimization since it makes decoding viable.
- Without KV cache, every new token would force a recomputation of attention over the entire growing sequence.
- With KV cache, the cache is built once during prefill, then grows by exactly one entry per decode step, with existing entries reused rather than recomputed.
The cache lives in GPU memory and grows linearly with sequence length, so a 13B model roughly requires 1 MB per token, which means a 4K context consumes 4 GB of VRAM on the cache alone.
The entire field is now optimizing around this constraint with quantized caches, sliding windows, grouped-query attention, and PagedAttention, while DeepSeek’s V4 series goes further and redesigns attention itself so the cache stays small from the start.
The practical takeaway is that when someone says their model feels slow, the first question is whether it’s slow to start or slow to stream.
Slow to start means prefill and a compute bottleneck, while slow to stream means decode and a memory bottleneck.
The article below is a first-principles guide to LLM inference that walks through everything between your prompt and the streamed response, covering tokenization, embeddings, attention, the prefill and decode split, KV caching, and quantization.
It will give you a complete mental model of how inference actually works under the hood.
Read it below.
Similar Articles
@_avichawla: A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on lo…
Explains why evicting 90% of KV cache tokens fails to free GPU memory when serving reasoning models on vLLM, due to paged attention fragmentation, and introduces NVIDIA's TriAttention as a solution that achieves 2.5x speedup and 10.7x memory reduction.
KV Cache Is Becoming the Memory Hierarchy of Inference
The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.
@divaagurlxw: Inference optimizations I’d study if I wanted sub-second LLM responses: 1.KV-Caching 2.Speculative Decoding 3.FlashAtte…
A tweet listing 16 inference optimization techniques for achieving sub-second LLM responses, including KV-caching, speculative decoding, FlashAttention, and various parallelism methods.
@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…
This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.
@CyrusHakha: One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a …
Discusses the nuanced reality of prefill-decode disaggregation in LLM serving at scale, based on customer patterns and validated on AMD with vLLM.