@_avichawla: Anthropic. Google. Meta. Everyone's using an idea from the 1990s to run LLM inference 2-3x faster. In the 1990s, CPU de…
Summary
Speculative decoding, inspired by 1990s CPU branch prediction, is now used by Anthropic, Google, and Meta to speed up LLM inference 2-3x. It uses a small model to guess future tokens and a large model to verify them in parallel, avoiding idle GPU time during decoding.
View Cached Full Text
Cached at: 05/26/26, 09:14 PM
Anthropic. Google. Meta.
Everyone’s using an idea from the 1990s to run LLM inference 2-3x faster.
In the 1990s, CPU designers hit a pipeline stall problem.
Originally, CPUs were designed to process instructions in stages, like fetch, decode, execute, and write back.
They finished one instruction before starting the next. This left most stages idle.
Pipelining fixed this by letting many instructions flow through these stages simultaneously, like an assembly line.
But it introduced a new problem.
When the CPU hit a conditional branch (if/else), it didn’t know which path to take yet.
But the pipeline had already started fetching the next instructions. If those turned out to be from the wrong path, all that work got flushed.
The pipeline stalled, and it happened roughly every 3rd instruction in typical programs.
Branch prediction fixed this.
The processor kept a history of past branches, predicted the likely path, and executed ahead.
If the prediction was right (this happened ~95% of the time), the pipeline never stalled. If it was wrong, it just flushed and retried.
LLM inference now faces the same problem on different hardware.
A GPU can also process many tokens in parallel. It does this during prefill, when it ingests the entire prompt at once, and the compute units are fully saturated.
But during decoding, the model generates tokens one at a time. Each token depends on the one before it, so the GPU can’t work ahead.
It loads billions of parameters from memory for each token, finishes the math almost instantly, and sits idle until the next step.
Hardware that could handle many tokens at once is stuck doing one.
Speculative decoding breaks this in the same way branch prediction broke CPU stalls.
→ A small model guesses the next K tokens. → The large model then verifies all K tokens in a single forward pass.
That verification step looks like a prefill stage, with multiple tokens processed at once and compute units fully saturated.
→ Best case, this gives K+1 tokens from one large-model call. → Worst case, it gives 1 token, exactly what standard decoding produces.
And in both cases, the output distribution is always mathematically identical.
Google Search uses this exact approach in AI Overviews to serve over 2 billion users. vLLM, TensorRT-LLM, and SGLang also ship built-in support.
I wrote about the full mechanics of speculative decoding in an article. It covers KV caching, the internals of speculative decoding (with code), and tradeoffs.
Read it below.
Find me → @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
Similar Articles
@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…
Researchers introduced DFlash, a technique using block diffusion models for speculative decoding that accelerates LLM inference by up to 8.5x without accuracy loss. It is already integrated with major frameworks like vLLM and SGLang.
What is Speculative Decoding? (trending on paperswithco.de) [R]
Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.
Faster LLM Inference via Sequential Monte Carlo
This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.
@RedHat_AI: 145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output qu…
Red Hat demonstrates that using speculative decoding can boost LLM inference speed from 145 to 424 tokens per second on the same H100 hardware with no quality loss, highlighting a significant optimization for production serving.
@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…
New research on DFlash and Spec V2 speculative decoding methods achieves >4.3X baseline throughput for LLM inference, released as the default speculative decoding engine in SGLang.