@RedHat_AI: 145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output qu…

X AI KOLs Timeline Tools

Summary

Red Hat demonstrates that using speculative decoding can boost LLM inference speed from 145 to 424 tokens per second on the same H100 hardware with no quality loss, highlighting a significant optimization for production serving.

145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output quality. If you're serving LLMs in production and not using speculative decoding, here's what you're leaving on the table... A 🧵:
Original Article
View Cached Full Text

Cached at: 06/15/26, 05:08 PM

145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output quality.

If you’re serving LLMs in production and not using speculative decoding, here’s what you’re leaving on the table… A :

Two models work together:

A small draft model (0.5-2B params) sprints ahead and proposes 3-5 tokens fast. The large verifier checks all of them in a single parallel forward pass.

When the draft is right (50-80% of the time for predictable tasks), you get multiple tokens for the cost of one forward pass. When it’s wrong, you lose microseconds.

Where it works: Code generation, JSON/SQL, structured outputs, template-based generation. Anything with predictable patterns.

Where it doesn’t: Large batch sizes (32+) where the GPU is already saturated. Creative writing where the draft model can’t predict tokens accurately.

Acceptance rate is your signal. 60-80% is the sweet spot.

Enabling it in @vllm_project is one flag:

vllm serve RedHatAI/gemma-4-31B-it-FP8-Dynamic
–speculative-model RedHatAI/gemma-4-31B-it-speculator.eagle3
–num-speculative-tokens 5

Red Hat AI has pre-trained speculators for Gemma, Qwen, Llama, and Mistral ready on HuggingFace:

The cost math:

Standard: 100 tokens/sec at $5/hr = $0.05 per 1,000 tokens

With spec decoding: 250 tokens/sec at $5/hr = $0.02 per 1,000 tokens

60% cost reduction. Same hardware. For a deployment serving 10M tokens/day, that’s $109,500 saved annually.

Full guide by @soyr: how it works, when to use it, how to tune acceptance rates, and where to get pre-trained speculator models:

Gemma 4 Diffusion landed in vLLM last week. Day 0.

First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel.

Result: 1,000+ tokens per second at batch size 1 on a single H100.

Built on Model Runner V2. @googlegemma

Similar Articles

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Reddit r/LocalLLaMA

Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.