[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Reddit r/LocalLLaMA 06/25/26, 09:55 PM Papers

Summary

JetSpec introduces parallel tree drafting for speculative decoding, achieving up to 9.64x end-to-end speedup on LLM inference while maintaining lossless accuracy, with throughput reaching ~1000 TPS on a single B200 GPU.

We find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting. JetSpec reaches up to 9.64× end-to-end speedup on MATH-500 and 4.58× on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200 GPU. ⚡️ Prior SD faces a dilemma: AR-style draft heads preserve causality for quality, but drafting cost grows with tree depth. Block-diffusion style heads draft cheaply in one pass, but branches are often scored independently, so deeper paths can become mutually inconsistent. JetSpec enables such speed by drafting a causality-preserving tree in one single pass. 🚀🌳 Check out our project page for demos and how we built it 👇 https://jetspec-project.github.io/jetspec-web/ 💻 Code: https://github.com/hao-ai-lab/JetSpec 🌟 Blog: https://haoailab.com/blogs/parallel-tree-decoding/ JetSpec vs. DFlash and AR baselines. JetSpec with Inference engine rendering around 1000 TPS on average. End-to-end Speedup comparisons.

Original Article

Similar Articles

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Hugging Face Daily Papers

JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates, achieving up to 9.64x speedup on MATH-500 and 4.58x on conversational workloads.

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

arXiv cs.CL

JetFlow is a speculative decoding framework that breaks the scaling ceiling by combining one-forward drafting efficiency with branch-wise causal conditioning, achieving up to 9.64x speedup on math benchmarks and outperforming prior methods on dense and MoE Qwen3 models.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

Hugging Face Daily Papers

SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.

What is Speculative Decoding? (trending on paperswithco.de) [R]

Reddit r/MachineLearning

Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Hugging Face Daily Papers

Graft is a training-free framework that enhances speculative decoding by combining pruning and retrieval to improve acceptance rates and inference speed, achieving up to 5.41x speedup on short-context benchmarks and up to 21.8% improvement over EAGLE-3 on Qwen3-235B.

Similar Articles

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

What is Speculative Decoding? (trending on paperswithco.de) [R]

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Submit Feedback