JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Hugging Face Daily Papers 06/25/26, 12:00 AM Papers

speculative-decoding llm-inference parallel-drafting tree-drafting efficiency causal-conditioning qwen3

Summary

JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates, achieving up to 9.64x speedup on MATH-500 and 4.58x on conversational workloads.

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.

Original Article

View Cached Full Text

Cached at: 06/26/26, 06:05 AM

Paper page - JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Source: https://huggingface.co/papers/2606.18394 Authors:

Abstract

JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates across various benchmarks.

Speculative decoding(SD) acceleratesautoregressive Large Language Models(LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing thedraft budgetimproves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face acausality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective fortree speculative decodingwith higher acceptance length, but their drafting cost grows with tree depth.Bidirectional block-diffusiondrafters generate all positions in one pass, but theirbranch-agnostic marginalscan form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains acausal parallel draft headoverfused hidden statesfrom the frozen target model, producing candidate trees whose scores align with the target model’sautoregressive factorization. This enables JetSpec to convert largerdraft budgets into longer accepted prefixes and higherend-to-end speedup. Across math, coding, and chat benchmarks on dense andMoE Qwen3models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated throughvLLM integrationunder realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.

View arXiv page View PDF Project page GitHub Add to collection

Get this paper in your agent:

hf papers read 2606\.18394

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.18394 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.18394 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.18394 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Paper page - JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

What is Speculative Decoding? (trending on paperswithco.de) [R]

Submit Feedback

Similar Articles

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

What is Speculative Decoding? (trending on paperswithco.de) [R]