@RedHat_AI: 145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output qu…
Summary
Red Hat demonstrates that using speculative decoding can boost LLM inference speed from 145 to 424 tokens per second on the same H100 hardware with no quality loss, highlighting a significant optimization for production serving.
View Cached Full Text
Cached at: 06/15/26, 05:08 PM
145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output quality.
If you’re serving LLMs in production and not using speculative decoding, here’s what you’re leaving on the table… A :
Two models work together:
A small draft model (0.5-2B params) sprints ahead and proposes 3-5 tokens fast. The large verifier checks all of them in a single parallel forward pass.
When the draft is right (50-80% of the time for predictable tasks), you get multiple tokens for the cost of one forward pass. When it’s wrong, you lose microseconds.
Where it works: Code generation, JSON/SQL, structured outputs, template-based generation. Anything with predictable patterns.
Where it doesn’t: Large batch sizes (32+) where the GPU is already saturated. Creative writing where the draft model can’t predict tokens accurately.
Acceptance rate is your signal. 60-80% is the sweet spot.
Enabling it in @vllm_project is one flag:
vllm serve RedHatAI/gemma-4-31B-it-FP8-Dynamic
–speculative-model RedHatAI/gemma-4-31B-it-speculator.eagle3
–num-speculative-tokens 5
Red Hat AI has pre-trained speculators for Gemma, Qwen, Llama, and Mistral ready on HuggingFace:
The cost math:
Standard: 100 tokens/sec at $5/hr = $0.05 per 1,000 tokens
With spec decoding: 250 tokens/sec at $5/hr = $0.02 per 1,000 tokens
60% cost reduction. Same hardware. For a deployment serving 10M tokens/day, that’s $109,500 saved annually.
Full guide by @soyr: how it works, when to use it, how to tune acceptance rates, and where to get pre-trained speculator models:
Gemma 4 Diffusion landed in vLLM last week. Day 0.
First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel.
Result: 1,000+ tokens per second at batch size 1 on a single H100.
Built on Model Runner V2. @googlegemma
Similar Articles
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.
Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.
@HotAisle: This is awesome. I wonder who's MI300x they used... ;-)
Kog announces real-time LLM inference achieving 3000+ output tokens per second per request on standard datacenter GPUs, bringing high-speed inference previously limited to custom silicon to production hardware.
@0xSero: Finally GLM-5.1-505B-REAP-NVFP4 45 tokens/s decode 1350 tokens/s prefill 32% prune This was the hardest I ever worked t…
Developer @0xSero achieved high-performance inference on an optimized GLM-5.1-505B variant using NVFP4 quantization and 32% pruning, reaching 45 tokens/s decode and 1350 tokens/s prefill speeds.
@_avichawla: Anthropic. Google. Meta. Everyone's using an idea from the 1990s to run LLM inference 2-3x faster. In the 1990s, CPU de…
Speculative decoding, inspired by 1990s CPU branch prediction, is now used by Anthropic, Google, and Meta to speed up LLM inference 2-3x. It uses a small model to guess future tokens and a large model to verify them in parallel, avoiding idle GPU time during decoding.
Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]
A monokernel approach for LLM decoding on AMD MI300X GPUs achieves up to 3,300 output tokens/s per request without speculative decoding or quantization, using memory access patterns mapped to the die topology.