[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Reddit r/LocalLLaMA Papers

Summary

JetSpec introduces parallel tree drafting for speculative decoding, achieving up to 9.64x end-to-end speedup on LLM inference while maintaining lossless accuracy, with throughput reaching ~1000 TPS on a single B200 GPU.

We find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting. JetSpec reaches up to 9.64× end-to-end speedup on MATH-500 and 4.58× on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200 GPU. ⚡️ Prior SD faces a dilemma: AR-style draft heads preserve causality for quality, but drafting cost grows with tree depth. Block-diffusion style heads draft cheaply in one pass, but branches are often scored independently, so deeper paths can become mutually inconsistent. JetSpec enables such speed by drafting a causality-preserving tree in one single pass. 🚀🌳 Check out our project page for demos and how we built it 👇 https://jetspec-project.github.io/jetspec-web/ 💻 Code: https://github.com/hao-ai-lab/JetSpec 🌟 Blog: https://haoailab.com/blogs/parallel-tree-decoding/ JetSpec vs. DFlash and AR baselines. JetSpec with Inference engine rendering around 1000 TPS on average. End-to-end Speedup comparisons.
Original Article

Similar Articles

What is Speculative Decoding? (trending on paperswithco.de) [R]

Reddit r/MachineLearning

Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.