Cached at:
06/26/26, 12:15 PM
# when LLMs predict their own predictions — undef.dev
Source: [https://undef.dev/writing/learn/speculative-decoding/](https://undef.dev/writing/learn/speculative-decoding/)
How do modern LLMs generate several tokens for the price of one? The trick borrows its name from CPUs, where*speculative execution*runs ahead on a guessed branch and rolls back if the guess was wrong; speculative decoding plays the same bet with tokens\. This is a field guide from the original rejection\-sampling trick to the**multi\-token\-prediction**machinery baked into Qwen 3\.6 and bolted onto Gemma 4\. The diagrams and widgets below are live\. Run the decoder, drag the sliders, hover the masks\.
## Your GPU chews faster than it can be fed
Autoregressive decoding produces one token per forward pass, and every single pass has to stream the model's**active weights**\(plus the relevant KV cache\) from high\-bandwidth memory into the compute units\. The arithmetic it then does is trivial by comparison\. Load a few accelerators and see for yourself:
At batch size 1, well over 99% of the chip's arithmetic capacity sits idle\. Latency per token is roughly*bytes moved / bandwidth*, and the FLOPs are essentially free\. That spare compute is the raw material everything below is built from\. \(The exact split shifts with quantization, MoE routing, and GQA/MLA attention, but the invariant holds, and at small batch, arithmetic intensity stays low\.\)
## Checking*k*tokens costs about the same as making one
Run a forward pass over the prompt*plus k candidate tokens*and the weights stream from HBM**once**, exactly as for a single token\. The extra positions ride along on the idle compute\. Verification is nearly free; it's*generation*that's expensive, because it is sequential\.
Doesn't checking a draft token cheat by letting the model*see*it? No: causal masking lets each position attend only to tokens at or before it, so the prediction that grades a draft token is computed without ever attending to it\. Hover a row to see exactly what each position attends to:
And the cost of those extra positions really is flat:
Linear layers stay memory\-bound for small*k*, attention is dominated by the \(unchanged\) KV\-cache reads, and communication latency is independent of*k*\([Chen et al\. 2023](https://arxiv.org/abs/2302.01318)\)\. Costs stay flat until you finally become compute\-bound\. That single weight\-read covers two kinds of token at once:
The drafts' keys and values are written**speculatively**: reject at position*j*and the cache is rolled back to before*j*; the rejected token and everything after it no longer describe the real sequence\. Seen this way, verification is a**mini\-prefill**over the guessed tokens; ordinary decoding is the slow, one\-token\-at\-a\-time path it replaces\.
So here is the bet: if something cheap could*guess*the next few tokens, the big model could grade the whole guess in a single pass\. Right guesses are pure profit\. Wrong guesses cost almost nothing — you only burned idle FLOPs\.
## The classic algorithm: draft, then verify
A small**draft model**proposestokens autoregressively\. The**target model**scores all of them in one pass, accepts the longest correct prefix, fixes the first mistake, and \(if everything was accepted\) emits one bonus token on top\. Every round yields between 1 andtokens, so progress is guaranteed \([Leviathan et al\. 2022](https://arxiv.org/abs/2211.17192);[Chen et al\. 2023](https://arxiv.org/abs/2302.01318)\)\.
## Provably lossless: the same distribution as the target
Decode**greedily**and the guarantee is trivial\. Accept the draft iff it's the token the target would have picked, otherwise emit the target's own\. Deterministic, and identical to ordinary greedy*by construction*\. The real case is**sampling**\(temperature\), where you can't just check equality, so the decoder runs a form of rejection sampling\. Withthe target distribution andthe draft, each drafted tokenis tested against a uniform draw, and the emitted tokenis:
The theorem \([Leviathan App\. A\.1](https://arxiv.org/abs/2211.17192);[Chen Thm\. 1](https://arxiv.org/abs/2302.01318)\) is that for*any*draft model the result is distributed**exactly like the target**, though batching and floating\-point in real engines can still nudge a token\.
✕ doubtTrust me\.
prove it\!Do you know what a[maximal coupling](https://colcarroll.github.io/couplings/static/maximal_couplings.html)is?
yes — the one\-linerThe sampler is exactlyfrom the**maximal coupling**ofand: the jointwith marginalsandthat puts as much mass as possible \(\) on the diagonal\. So drawing the proposaland emittingreturnsby construction\.
no — walk me through itSplit the target into the part the draft can cover and the part it cannot:
sois the**green overlap**andthe**red residual**from the widget, andis their total overlap and, as we show shortly, the acceptance rate\. \(Ifthen, the draft is already exact, and the claim is immediate; so assume\.\)
**Claim\.**Drawand keep it with probability; otherwise output an independent\. The emitted tokenis then distributed as, for any\.
**Proof\.**Decomposeinto the accept case \(\) and the reject case \(\), unfold each, and simplify the accept term:
Summed over, the accept term is the acceptance rate and reject is its complement:
Substituting and cancelling the normaliser,
The probability a position is accepted equals the overlap\. Drag the slider to see why a better\-matched drafter accepts more:
## Multi\-token prediction: one name, two ideas
"MTP" began as a**training objective**: at each position, predict the next*several*tokens, not just the immediate one\. Forcing the model to plan a few steps ahead gives a denser learning signal, and leaves behind a module that already knows how to guess the future, which is exactly what a speculative drafter needs\. Two camps wired this up differently\.
**Parallel heads**\(Meta,[Gloeckle et al\. 2024](https://arxiv.org/abs/2404.19737)\) bolt*k*independent heads onto the same final hidden state; head*k*predicts the token at offset*k*, all at once\. Simple, and a strong training signal \(notably better at code\)\. But the heads can't see each other: each guesses from the*same frozen snapshot of the present*, so the further\-out guesses drift and acceptance falls off with distance\.
**Sequential modules**\([DeepSeek\-V3](https://arxiv.org/abs/2412.19437)→ Qwen\) predict the future tokens*one after another*: each MTP module takes the previous step's hidden state and concatenates it with the embedding of the**actual next token**, so every guess is conditioned on the one before it\. The causal chain stays intact, so drafts read more like text the model would actually write, which tends to raise acceptance\.
Concretely, DeepSeek\-V3's**sequential MTP module**, which Qwen inherits, is one transformer block riding on the trunk\. It takes the trunk's hidden state and the embedding of the token just produced, fuses them, runs the block, and reuses the shared LM head to draft the token*two*ahead,\. Step through it:
Two payoffs, one mechanism\.**In training**this runs as an auxiliary loss,\. \(DeepSeek\-V3 used a single extra token,, withannealedover its 14\.8 T tokens\.\) Supervising those extra futures can sharpen the base model itself\.**At inference**, the very same module becomes a built\-in drafter: it proposeswhile the trunk emits\. That is*self*\-speculative decoding with**no second model**, near\-zero draft cost, and high acceptancebecause the drafter shares the trunk's embeddings, LM head, and hidden state\. DeepSeek\-V3 reported**85–90% acceptance**on the second token, about a 1\.8× decode speedup\. \(Or the module is switched off, leaving an ordinary model\.\)
### The building block: feature–token fusion
Zoom into the heart of that module, the step that fuses the trunk's hidden state with the token's embedding\. It isn't unique to DeepSeek/Qwen; it's the building block of*every*feature\-level drafter,[EAGLE](https://arxiv.org/abs/2401.15077)included\. To draft the next token cheaply, the module draws on two kinds of input: one or more of the target's**hidden states**\(internal vectors that already encode the context; most drafters use the single final\-layer hidden state, the one that feeds the LM head, while[EAGLE\-3](https://arxiv.org/abs/2503.01840)fuses several layers'\), and the**embeddings**of one or more recently\-seen**tokens**\(in practice just the one it's continuing from\)\. Each model fills in these inputs differently\. Step through the variants:
Underneath, they're one operation\. Letbe the nonempty set of trunk layers it taps \(the*feature*side: any subset of thelayers, typically just the final one\), andthe tokens it's fed\. Gather both as a familyover, withand\. Normalize each by its own map, concatenate, and project from the trunk widthinto the drafter's width:
where eachis that input's own normalization \(RMSNorm/LayerNorm, or the identity\) andis the learned projection\. The edges of this pin down the family:\(no token\) gives a hidden\-state\-only**Medusa**head, and the width is shared \(\) for an in\-trunk head like DeepSeek\-V3 or Qwen, smaller \(\) for a standalone drafter like Gemma 4\.
The math underneath — fusion is a sum of projectionsConcat\-then\-project is a linear map out of a direct sum: the stack lands in, and block\-partitioningturns it into a sum of independent per\-source projections,\. So fusion is an*additive*combine \(, one weight block per source\), not a*multiplicative*one \(, which would model feature–token cross\-terms\)\. Tie the blocks \(,\) and it collapses to the permutation\-invariant pooling of[Deep Sets](https://arxiv.org/abs/1703.06114), which the per\-slot blocks deliberately avoid\.
Any feature–token cross\-terms appear later, supplied by the transformer blockfeeds into: attention'sacross positions, the gated FFN \(SwiGLU's\) within one\. The projection is really that block's input layer, kept at widthso an in\-trunk head can reuse the trunk's embedding table and LM head as\-is\.
That**feature–token fusion**is why a tiny drafter stays in lock\-step with a huge target almost for free\. Instead of re\-reading the context, it rides on the target's own hidden states\. It has no single standard name \(EAGLE's "FC layer",`eh\_proj`in DeepSeek/Qwen code\)\. That one fusion is the shared core of EAGLE, DeepSeek\-V3's MTP, Qwen's NextN, and Gemma's drafter; only the packaging around it changes\. Two differences are architectural:**which hidden states it reads**\(just the trunk's final layer, or several fused together as in EAGLE\-3\) and**what block it feeds**\(an in\-trunk block at full width, or a separate, narrower drafter of one to four layers\)\. Two are about the surrounding system:**how it's trained**\(jointly as an auxiliary objective, or post\-hoc against a frozen target\) and**how it drafts**\(one token, a recursive chain, or a dynamic tree\)\. To be precise: Gemma 4, Qwen 3\.6 and DeepSeek\-V3 ship their own EAGLE\-*inspired*heads,**not**literal EAGLE\-1/2/3 checkpoints, which are a parallel line of drafters you attach to other models through vLLM / SGLang\.
## Case study A — Qwen 3\.6: MTP as a built\-in organ
Qwen's lineage \(Qwen3\-Next → 3\.5 → 3\.6\) carries the DeepSeek design forward: one extra "NextN" transformer block riding on the trunk, trained jointly with the model, sharing its embedding table and LM head \(`mtp\_num\_hidden\_layers: 1`\)\. Officially Qwen commits only to "MTP: trained with multi\-steps"; the one\-block internals come from the released config, and the acceptance/speedup figures below are third\-party measurements\.
- **"MTP: trained with multi\-steps"**\([model card](https://huggingface.co/Qwen)\): the one block is unrolled over several future steps during training, so the recursive multi\-token drafting at inference matches what it practiced\.
- **Drafting is almost free:**the block is 1/64 of the trunk's layers, and adding the shared LM head's matmul over the large vocab \(run once per draft step\) brings the real per\-step cost to, versus 0\.2–0\.3 for an external small draft model\.
- **Measured \(third\-party\):**~83% acceptance and about 2\.2× end\-to\-end at, rising to ≈2\.4× at\. Qwen publishes the recipe, not the numbers\.
**GLM\-5\.2**\(Zhipu AI, June 2026\) ships the same weight\-shared block \(parameters tied across MTP steps\) but, unlike Qwen, documents the recipe\. Its**[release blog](https://z.ai/blog/glm-5.2)**reports an**accept length of 5\.47 on coding at 7 MTP steps**, \+20% over its baseline block, stacking an IndexShare/KV\-share trick, rejection sampling, and an end\-to\-end TV loss\. That works out toin the simulator's terms\. Try the**GLM\-5\.2 MTP**preset below\.
## Case study B — Gemma 4: MTP as a shipped sidecar
Google took the other road\. Gemma 4 launched*without*speculative assets; weeks later, in May 2026, came the**["MTP drafters"](https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/)**: separate lightweight checkpoints \(`google/gemma\-4\-31B\-it\-assistant`\), trained post\-hoc to imitate each frozen target\. A purpose\-built mini\-model that plugs into the target's internals, not a head inside the trunk\.
- **4\-layer drafter**\(3 sliding\-window \+ 1 global attention, per the`\-assistant`configs\), ~0\.5 B params for the 31 B target; every Gemma 4 size gets one, down to the edge models\.
- **Three couplings**to the target: shared input embeddings; the same feature–token fusion on its final hidden state; and it**reuses the target's KV cache**, so context is never re\-encoded\.
- **Edge trick \(E2B/E4B only\):**clusters the 262K\-token vocabulary and scores only the likely clusters, dodging the full output projection on\-device\.
- **Claims \(vendor\-reported\):**up to 3× faster with "zero quality degradation," since the target still does the final verification \(the product\-level version of the lossless guarantee, subject to engine numerics\)\. ~2\.2× for the 26B\-A4B MoE on Apple Silicon; no acceptance rates published\.
## What "MTP" ends up meaning
By now the divide is clear\. Qwen bakes prediction\-of\-the\-future into the model as a training objective and reuses the head\. Google trains a dedicated drafter*post\-hoc*and ships it as a product\.
Qwen 3\.6 \(DeepSeek school\)Gemma 4 \(EAGLE school\)what "MTP" is**Training objective**whose head moonlights as a drafter**Inference product**: a post\-hoc drafter for a frozen targetdrafter form1 transformer block inside the checkpointSeparate 4\-layer`\-assistant`mini\-modelties to targetShared embeddings \+ shared LM head \+ final hidden stateShared embeddings \+ final hidden state \+**shared KV cache**trainingJointly, "with multi\-steps", auxiliary loss → may improve baseSeparately, target frozen → zero risk to base, can ship latedeploymentOne checkpoint; a vLLM/SGLang flag turns drafting onTwo checkpoints; HF`assistant\_model=…`reported wins~83% acceptance @; ≈2\.2–2\.4× \(third\-party\)Up to 3×; ~2\.2× MoE on Apple Silicon \(Google\)
## How fast does it actually get?
You have now met the drafters above, and two numbers decide how well any of them pays off: the acceptance rateand the relative cost of a draft step\. Expected tokens per target pass, and the resulting end\-to\-end speedup \([Leviathan](https://arxiv.org/abs/2211.17192)Eq\. 1 & Thm\. 3\.8, i\.i\.d\.\-acceptance approximation\):
First\-order break\-even:should exceed the draft cost\. A drafter must be right more often than it is expensive \(real systems add fixed overheads and batching effects\)\. Regarding draft length: with highand tiny\(the MTP presets\), longer drafts keep paying; with a mediocre drafter the optimum is short\.
## When the trick stops working
Speculation spends*spare*FLOPs, so it lives and dies by whether FLOPs are spare\. At high batch sizes or QPS, decoding turns compute\-bound and speculation can go**net negative**\(vLLM measured up to 1\.8×*slow\-downs*under load before adding dynamic disabling\)\. Higher sampling temperature lowers; repetitive tasks like code and summarization raise it\. Speculation only speeds*decode*\(inter\-token latency\), not the prompt's prefill\. The drafter, too, has to live somewhere: extra memory, batch expansion from tree drafts, and cache\-rollback overhead all eat into the win\.
## What this guide leaves out
This guide follows one thread: the draft\-verify core and the MTP/feature\-fusion drafters built on it\. Three adjacent branches, all composable with everything above, are left for elsewhere:**tree drafting**, where each step floats many candidates instead of one chain \([SpecInfer](https://arxiv.org/abs/2305.09781),[EAGLE\-2](https://arxiv.org/abs/2406.16858)\);**model\-free drafters**that guess from n\-grams or retrieval rather than a network \(lookahead decoding, REST\); and**lossy acceptance**, which trades the exactness proof above for extra speed \(typical acceptance, as in[Medusa](https://arxiv.org/abs/2401.10774)\)\.
### Sources
**Foundational / theory**
- Leviathan et al\.,*Fast Inference from Transformers via Speculative Decoding*—[arXiv:2211\.17192](https://arxiv.org/abs/2211.17192)
- Chen et al\.,*Accelerating LLM Decoding with Speculative Sampling*—[arXiv:2302\.01318](https://arxiv.org/abs/2302.01318)
- Miao et al\.,*SpecInfer*\(tree verification\) —[arXiv:2305\.09781](https://arxiv.org/abs/2305.09781)
**Drafter architectures**
- Cai et al\.,*Medusa*—[arXiv:2401\.10774](https://arxiv.org/abs/2401.10774)
- Li et al\.,*EAGLE*[1](https://arxiv.org/abs/2401.15077)/[2](https://arxiv.org/abs/2406.16858)/[3](https://arxiv.org/abs/2503.01840)
- Gloeckle et al\.,*Multi\-token Prediction*—[arXiv:2404\.19737](https://arxiv.org/abs/2404.19737)
- *DeepSeek\-V3 Technical Report*—[arXiv:2412\.19437](https://arxiv.org/abs/2412.19437)
- Zhipu AI,*GLM\-5\.2*\(shared\-param MTP; IndexShare/KV\-share, rejection sampling, end\-to\-end TV loss; accept length 5\.47 / \+20% on coding\) —[z\.ai/blog/glm\-5\.2](https://z.ai/blog/glm-5.2)
**Product / implementation**
- Google,*Gemma 4 MTP drafters*—[blog](https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/)·[`\-assistant`config](https://huggingface.co/google/gemma-4-31B-it-assistant/blob/main/config.json)\(`num\_hidden\_layers: 4`,`layer\_types: \[sliding, sliding, sliding, full\]`\) ·[transformers`gemma4`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma4/configuration_gemma4.py)
- [Qwen](https://huggingface.co/Qwen)model card & config · vLLM / SGLang docs · third\-party benchmarks
Qwen publishes no official acceptance/speedup figures; the "third\-party" numbers above are independent benchmarks, chiefly[zolotukhin\.ai](https://zolotukhin.ai/blog/2026-05-08-why-mtp-heads-are-the-speculative-decode-draft-qwen3-a3b-deserves/), which measured ≈0\.83 acceptance and ≈2\.2× at\(and ≈0\.72 / ≈2\.4× at\) on Qwen3\.6\-27B, plus[Unsloth](https://unsloth.ai/docs/models/qwen3.6)'s Qwen3\.6 MTP builds\. The i\.i\.d\.\-acceptance formula is a good approximation, since real acceptance is positionally correlated\.
Cite this```
@misc{undefdev2026specdecoding,
author = {Tarek Sabet},
title = {Speculative decoding: when {LLMs} predict their own predictions},
year = {2026},
howpublished = {\url{https://undef.dev/writing/learn/speculative-decoding/}},
}
```