@_avichawla: Anthropic. Google. Meta. Everyone's using an idea from the 1990s to run LLM inference 2-3x faster. In the 1990s, CPU de…

X AI KOLs Timeline 05/26/26, 08:38 AM News

Summary

Speculative decoding, inspired by 1990s CPU branch prediction, is now used by Anthropic, Google, and Meta to speed up LLM inference 2-3x. It uses a small model to guess future tokens and a large model to verify them in parallel, avoiding idle GPU time during decoding.

Anthropic. Google. Meta. Everyone's using an idea from the 1990s to run LLM inference 2-3x faster. In the 1990s, CPU designers hit a pipeline stall problem. Originally, CPUs were designed to process instructions in stages, like fetch, decode, execute, and write back. They finished one instruction before starting the next. This left most stages idle. Pipelining fixed this by letting many instructions flow through these stages simultaneously, like an assembly line. But it introduced a new problem. When the CPU hit a conditional branch (if/else), it didn't know which path to take yet. But the pipeline had already started fetching the next instructions. If those turned out to be from the wrong path, all that work got flushed. The pipeline stalled, and it happened roughly every 3rd instruction in typical programs. Branch prediction fixed this. The processor kept a history of past branches, predicted the likely path, and executed ahead. If the prediction was right (this happened ~95% of the time), the pipeline never stalled. If it was wrong, it just flushed and retried. LLM inference now faces the same problem on different hardware. A GPU can also process many tokens in parallel. It does this during prefill, when it ingests the entire prompt at once, and the compute units are fully saturated. But during decoding, the model generates tokens one at a time. Each token depends on the one before it, so the GPU can't work ahead. It loads billions of parameters from memory for each token, finishes the math almost instantly, and sits idle until the next step. Hardware that could handle many tokens at once is stuck doing one. Speculative decoding breaks this in the same way branch prediction broke CPU stalls. → A small model guesses the next K tokens. → The large model then verifies all K tokens in a single forward pass. That verification step looks like a prefill stage, with multiple tokens processed at once and compute units fully saturated. → Best case, this gives K+1 tokens from one large-model call. → Worst case, it gives 1 token, exactly what standard decoding produces. And in both cases, the output distribution is always mathematically identical. Google Search uses this exact approach in AI Overviews to serve over 2 billion users. vLLM, TensorRT-LLM, and SGLang also ship built-in support. I wrote about the full mechanics of speculative decoding in an article. It covers KV caching, the internals of speculative decoding (with code), and tradeoffs. Read it below. ____ Find me → @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

Original Article

View Cached Full Text

Cached at: 05/26/26, 09:14 PM

Anthropic. Google. Meta.

Everyone’s using an idea from the 1990s to run LLM inference 2-3x faster.

In the 1990s, CPU designers hit a pipeline stall problem.

Originally, CPUs were designed to process instructions in stages, like fetch, decode, execute, and write back.

They finished one instruction before starting the next. This left most stages idle.

Pipelining fixed this by letting many instructions flow through these stages simultaneously, like an assembly line.

But it introduced a new problem.

When the CPU hit a conditional branch (if/else), it didn’t know which path to take yet.

But the pipeline had already started fetching the next instructions. If those turned out to be from the wrong path, all that work got flushed.

The pipeline stalled, and it happened roughly every 3rd instruction in typical programs.

Branch prediction fixed this.

The processor kept a history of past branches, predicted the likely path, and executed ahead.

If the prediction was right (this happened ~95% of the time), the pipeline never stalled. If it was wrong, it just flushed and retried.

LLM inference now faces the same problem on different hardware.

A GPU can also process many tokens in parallel. It does this during prefill, when it ingests the entire prompt at once, and the compute units are fully saturated.

But during decoding, the model generates tokens one at a time. Each token depends on the one before it, so the GPU can’t work ahead.

It loads billions of parameters from memory for each token, finishes the math almost instantly, and sits idle until the next step.

Hardware that could handle many tokens at once is stuck doing one.

Speculative decoding breaks this in the same way branch prediction broke CPU stalls.

→ A small model guesses the next K tokens. → The large model then verifies all K tokens in a single forward pass.

That verification step looks like a prefill stage, with multiple tokens processed at once and compute units fully saturated.

→ Best case, this gives K+1 tokens from one large-model call. → Worst case, it gives 1 token, exactly what standard decoding produces.

And in both cases, the output distribution is always mathematically identical.

Google Search uses this exact approach in AI Overviews to serve over 2 billion users. vLLM, TensorRT-LLM, and SGLang also ship built-in support.

I wrote about the full mechanics of speculative decoding in an article. It covers KV caching, the internals of speculative decoding (with code), and tradeoffs.

Read it below.

Find me → @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

@_avichawla: Anthropic. Google. Meta. Everyone's using an idea from the 1990s to run LLM inference 2-3x faster. In the 1990s, CPU de…

Similar Articles

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

What is Speculative Decoding? (trending on paperswithco.de) [R]

Faster LLM Inference via Sequential Monte Carlo

@RedHat_AI: 145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output qu…

@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…

Submit Feedback

Similar Articles

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

What is Speculative Decoding? (trending on paperswithco.de) [R]

Faster LLM Inference via Sequential Monte Carlo

@RedHat_AI: 145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output qu…

@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…