[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS
Summary
JetSpec introduces parallel tree drafting for speculative decoding, achieving up to 9.64x end-to-end speedup on LLM inference while maintaining lossless accuracy, with throughput reaching ~1000 TPS on a single B200 GPU.
Similar Articles
JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates, achieving up to 9.64x speedup on MATH-500 and 4.58x on conversational workloads.
JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
JetFlow is a speculative decoding framework that breaks the scaling ceiling by combining one-forward drafting efficiency with branch-wise causal conditioning, achieving up to 9.64x speedup on math benchmarks and outperforming prior methods on dense and MoE Qwen3 models.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.
What is Speculative Decoding? (trending on paperswithco.de) [R]
Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Graft is a training-free framework that enhances speculative decoding by combining pruning and retrieval to improve acceptance rates and inference speed, achieving up to 5.41x speedup on short-context benchmarks and up to 21.8% improvement over EAGLE-3 on Qwen3-235B.