@athleticKoder: A 1600-word note on how llm inference work: Covering: 1. Attention - the only place tokens interact 2. KV caching - why…

X AI KOLs Timeline News

Summary

A detailed thread explaining key concepts of LLM inference: attention, KV caching, chunked prefill, and batching techniques, including continuous batching used in vLLM and SGLang.

A 1600-word note on how llm inference work: Covering: 1. Attention - the only place tokens interact 2. KV caching - why decoding is cheap once prefill is done 3. Chunked prefill - handling prompts too big to fit in memory 4. Naive batching - why padding kills throughput Continuous batching - ragged batching + dynamic scheduling combined This is the technique quietly powering vLLM, SGLang and every serving stack. Building it up from first principles. A
Original Article
View Cached Full Text

Cached at: 07/02/26, 06:25 PM

A 1600-word note on how llm inference work:

Covering:

  1. Attention - the only place tokens interact

  2. KV caching - why decoding is cheap once prefill is done

  3. Chunked prefill - handling prompts too big to fit in memory

  4. Naive batching - why padding kills throughput Continuous batching - ragged batching + dynamic scheduling combined

This is the technique quietly powering vLLM, SGLang and every serving stack. Building it up from first principles.

A

A LLM is just a next-token predictor. It reads your whole prompt once (prefill), then generates tokens one at a time, re-reading everything so far each time (decode).

That decode loop is expensive. Continuous batching is the biggest lever for serving many users without wasting GPU cycles.

Attention is the only layer where tokens talk to each other. Everything else (layernorm, matmuls) is token-wise.

QKᵀ scores similarity between every pair of tokens - this is the quadratic cost everyone complains about.

Then you apply a mask (causal = only look backward), softmax, and multiply by V. That’s one attention head.

Read the mask like a grid: green = “this token can see that token.”

Once you’re comfortable with the full picture, you can simplify: just draw Q, K, and the mask. V always matches K’s length, so no need to draw it separately.

Naive continuation: to generate the next token, you’d redo the full forward pass - recomputing K and V for tokens you already processed. Pure waste.

#1 - KV cache:

The newest token never affects earlier tokens’ attention (causal masking). And you already computed K/V for everything before it last step. So cache them instead of recomputing.

This takes decode from O(n²) to O(n) compute per new token, at the cost of O(n) memory. Only the new token gets computed fresh - everything else is a lookup.

(Llama-2-7B: ~16KB of cache per token. Adds up fast at scale.)

#2 - chunked prefill:

Real prompts (your whole codebase in a Cursor context window) don’t fit in memory in one pass. Split the prefill into chunks, use the KV cache to carry state between them.

Now the batching problem. Naively batching prompts means padding everything to the same length - tensors have to be rectangular.

This works fine when prompts finish around the same time.

But real traffic doesn’t cooperate. Swap a finished slot for a new prompt, and that prompt needs a full prefill while everyone else is mid-decode.

Result: a wall of padding. Cost scales quadratically with batch size × prompt length.

#3 - ragged batching:

Instead of stacking prompts on a batch axis (→ padding), concatenate them into one long sequence.

Then use the attention mask to stop prompt 0’s tokens from leaking into prompt 1’s attention. No padding needed.

Pack as many prompts as fit into your token budget, mixing prefill chunks and decode tokens in the same batch. Combine with dynamic scheduling (swap finished prompts immediately) → continuous batching.

So Continuous batching = 3 ideas stacked:

  1. KV cache → don’t recompute the past
  2. Chunked prefill → handle long prompts within memory limits
  3. Ragged batching + dynamic scheduling → kill padding, keep GPUs saturated

This is why ChatGPT can serve thousands of users concurrently.

That’s it! You can find the entire blog post here - http://huggingface.co/blog/continuous_batching…

Similar Articles

Memory

Reddit r/artificial

Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.