@_avichawla: Prefill & decode in LLM inference. Have you ever noticed that the first token from an LLM always takes a moment to appe…

X AI KOLs Timeline Papers

Summary

Explains the two phases of LLM inference - prefill and decode - detailing how GPU bottlenecks shift from compute-bound during prefill to memory-bound during decode, and the importance of KV caching.

Prefill & decode in LLM inference. Have you ever noticed that the first token from an LLM always takes a moment to appear? But the subsequent tokens stream out smoothly? That pause isn't a network lag, but rather it's a structural property of how LLMs fundamentally work. Inference happens in two phases that share the same model and the same code path, but the workload looks completely different in each, with different bottlenecks. > Prefill stage starts when you submit a prompt. The model processes every input token in one parallel pass, computing Q, K, and V for all of them at once. Attention runs as a matrix multiplication, and the GPU chips run at high utilization, doing fast math. Prefill is compute-bound, and the metric that captures it is time-to-first-token (TTFT). > Decode stage starts once the first token is out. To generate the next one, the model only computes Q, K, and V for that single new token, because everything before it is already cached. So the model loops one token per forward pass, multiplying a single query against the cached keys instead of a full matrix. This makes the inference fast due to the tiny computation. But the GPU still has to load every weight and every cached entry from memory to do that tiny computation, so the bottleneck flips and compute sits idle while memory bandwidth becomes the limiting factor. Decode is memory-bound, and the metric that captures it is inter-token latency (ITL). GPU utilization peaks during prefill and drops sharply during decode because memory, not compute, is the bottleneck in the second phase. Throwing more compute at a slow-streaming model often does nothing because the fix for memory-bound workloads is faster memory or a smaller cache, not more FLOPs. Long contexts feel disproportionately slow because the KV cache grows with every token, and every decode step has to read all of it. But maintaining the cache is an important optimization since it makes decoding viable. - Without KV cache, every new token would force a recomputation of attention over the entire growing sequence. - With KV cache, the cache is built once during prefill, then grows by exactly one entry per decode step, with existing entries reused rather than recomputed. The cache lives in GPU memory and grows linearly with sequence length, so a 13B model roughly requires 1 MB per token, which means a 4K context consumes 4 GB of VRAM on the cache alone. The entire field is now optimizing around this constraint with quantized caches, sliding windows, grouped-query attention, and PagedAttention, while DeepSeek's V4 series goes further and redesigns attention itself so the cache stays small from the start. The practical takeaway is that when someone says their model feels slow, the first question is whether it's slow to start or slow to stream. Slow to start means prefill and a compute bottleneck, while slow to stream means decode and a memory bottleneck. The article below is a first-principles guide to LLM inference that walks through everything between your prompt and the streamed response, covering tokenization, embeddings, attention, the prefill and decode split, KV caching, and quantization. It will give you a complete mental model of how inference actually works under the hood. Read it below.
Original Article
View Cached Full Text

Cached at: 06/29/26, 10:26 AM

Prefill & decode in LLM inference.

Have you ever noticed that the first token from an LLM always takes a moment to appear? But the subsequent tokens stream out smoothly?

That pause isn’t a network lag, but rather it’s a structural property of how LLMs fundamentally work.

Inference happens in two phases that share the same model and the same code path, but the workload looks completely different in each, with different bottlenecks.

Prefill stage starts when you submit a prompt.

The model processes every input token in one parallel pass, computing Q, K, and V for all of them at once.

Attention runs as a matrix multiplication, and the GPU chips run at high utilization, doing fast math.

Prefill is compute-bound, and the metric that captures it is time-to-first-token (TTFT).

Decode stage starts once the first token is out.

To generate the next one, the model only computes Q, K, and V for that single new token, because everything before it is already cached.

So the model loops one token per forward pass, multiplying a single query against the cached keys instead of a full matrix. This makes the inference fast due to the tiny computation.

But the GPU still has to load every weight and every cached entry from memory to do that tiny computation, so the bottleneck flips and compute sits idle while memory bandwidth becomes the limiting factor.

Decode is memory-bound, and the metric that captures it is inter-token latency (ITL).

GPU utilization peaks during prefill and drops sharply during decode because memory, not compute, is the bottleneck in the second phase.

Throwing more compute at a slow-streaming model often does nothing because the fix for memory-bound workloads is faster memory or a smaller cache, not more FLOPs.

Long contexts feel disproportionately slow because the KV cache grows with every token, and every decode step has to read all of it.

But maintaining the cache is an important optimization since it makes decoding viable.

  • Without KV cache, every new token would force a recomputation of attention over the entire growing sequence.
  • With KV cache, the cache is built once during prefill, then grows by exactly one entry per decode step, with existing entries reused rather than recomputed.

The cache lives in GPU memory and grows linearly with sequence length, so a 13B model roughly requires 1 MB per token, which means a 4K context consumes 4 GB of VRAM on the cache alone.

The entire field is now optimizing around this constraint with quantized caches, sliding windows, grouped-query attention, and PagedAttention, while DeepSeek’s V4 series goes further and redesigns attention itself so the cache stays small from the start.

The practical takeaway is that when someone says their model feels slow, the first question is whether it’s slow to start or slow to stream.

Slow to start means prefill and a compute bottleneck, while slow to stream means decode and a memory bottleneck.

The article below is a first-principles guide to LLM inference that walks through everything between your prompt and the streamed response, covering tokenization, embeddings, attention, the prefill and decode split, KV caching, and quantization.

It will give you a complete mental model of how inference actually works under the hood.

Read it below.

Similar Articles