@athleticKoder: A 1600-word note on how llm inference work: Covering: 1. Attention - the only place tokens interact 2. KV caching - why…
Summary
A detailed thread explaining key concepts of LLM inference: attention, KV caching, chunked prefill, and batching techniques, including continuous batching used in vLLM and SGLang.
View Cached Full Text
Cached at: 07/02/26, 06:25 PM
A 1600-word note on how llm inference work:
Covering:
-
Attention - the only place tokens interact
-
KV caching - why decoding is cheap once prefill is done
-
Chunked prefill - handling prompts too big to fit in memory
-
Naive batching - why padding kills throughput Continuous batching - ragged batching + dynamic scheduling combined
This is the technique quietly powering vLLM, SGLang and every serving stack. Building it up from first principles.
A
A LLM is just a next-token predictor. It reads your whole prompt once (prefill), then generates tokens one at a time, re-reading everything so far each time (decode).
That decode loop is expensive. Continuous batching is the biggest lever for serving many users without wasting GPU cycles.
Attention is the only layer where tokens talk to each other. Everything else (layernorm, matmuls) is token-wise.
QKᵀ scores similarity between every pair of tokens - this is the quadratic cost everyone complains about.
Then you apply a mask (causal = only look backward), softmax, and multiply by V. That’s one attention head.
Read the mask like a grid: green = “this token can see that token.”
Once you’re comfortable with the full picture, you can simplify: just draw Q, K, and the mask. V always matches K’s length, so no need to draw it separately.
Naive continuation: to generate the next token, you’d redo the full forward pass - recomputing K and V for tokens you already processed. Pure waste.
#1 - KV cache:
The newest token never affects earlier tokens’ attention (causal masking). And you already computed K/V for everything before it last step. So cache them instead of recomputing.
This takes decode from O(n²) to O(n) compute per new token, at the cost of O(n) memory. Only the new token gets computed fresh - everything else is a lookup.
(Llama-2-7B: ~16KB of cache per token. Adds up fast at scale.)
#2 - chunked prefill:
Real prompts (your whole codebase in a Cursor context window) don’t fit in memory in one pass. Split the prefill into chunks, use the KV cache to carry state between them.
Now the batching problem. Naively batching prompts means padding everything to the same length - tensors have to be rectangular.
This works fine when prompts finish around the same time.
But real traffic doesn’t cooperate. Swap a finished slot for a new prompt, and that prompt needs a full prefill while everyone else is mid-decode.
Result: a wall of padding. Cost scales quadratically with batch size × prompt length.
#3 - ragged batching:
Instead of stacking prompts on a batch axis (→ padding), concatenate them into one long sequence.
Then use the attention mask to stop prompt 0’s tokens from leaking into prompt 1’s attention. No padding needed.
Pack as many prompts as fit into your token budget, mixing prefill chunks and decode tokens in the same batch. Combine with dynamic scheduling (swap finished prompts immediately) → continuous batching.
So Continuous batching = 3 ideas stacked:
- KV cache → don’t recompute the past
- Chunked prefill → handle long prompts within memory limits
- Ragged batching + dynamic scheduling → kill padding, keep GPUs saturated
This is why ChatGPT can serve thousands of users concurrently.
That’s it! You can find the entire blog post here - http://huggingface.co/blog/continuous_batching…
Similar Articles
@amitiitbhu: New Article: How does vLLM work? Read here: https://outcomeschool.com/blog/how-does-vllm-work…
A detailed blog post explaining how vLLM works, including PagedAttention, KV cache management, and continuous batching for efficient LLM serving.
@_avichawla: Prefill & decode in LLM inference. Have you ever noticed that the first token from an LLM always takes a moment to appe…
Explains the two phases of LLM inference - prefill and decode - detailing how GPU bottlenecks shift from compute-bound during prefill to memory-bound during decode, and the importance of KV caching.
@0xSero: Here's everything you need to know about inference and hosting LLMs. Have you ever seen: - vllm - sglang - llama.cpp - …
An overview of popular open-source inference engines including vLLM, SGLang, llama.cpp, and ExLlamaV3 for hosting and running large language models.
An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]
An open, in-progress handbook explaining LLM inference internals including GPU memory hierarchy, KV cache, batching, and popular inference engines like vLLM and TensorRT-LLM.
Memory
Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.