Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

speculative-decoding llm-inference causal-modeling parallel-drafting inference-speedup qwen3

Summary

Domino is a speculative decoding framework that decouples causal dependency modeling from autoregressive drafting, using a parallel backbone and lightweight causal refinement head to achieve up to 5.49× end-to-end speedup on Qwen3 models.

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Source: https://huggingface.co/papers/2605.29707 Published on May 28

Submitted byhttps://huggingface.co/Huang2020

黄佳诺on Jun 2

Abstract

Domino is a speculative decoding framework that improves LLM inference speed by decoupling causal dependency modeling from autoregressive drafting through a parallel backbone and lightweight causal refinement head, achieving significant speedups in both end-to-end execution and throughput.

Speculative decodingaccelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off betweendraft qualityanddrafting cost:autoregressive draftersmodelcausal dependenciesamong draft tokens but incur sequential overhead, whileparallel draftersreducedrafting costbut weaken intra-block dependency modeling. In this paper, we propose Domino, aspeculative decodingframework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweightDomino headto refine them with prefix-dependent causal information. To stabilizeteacher-forced causal encoding, we further introduce abase-anchored training curriculumthat first strengthens theparallel backboneand then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under theTransformers backendand up to \(5.8\times\) throughput speedup underSGLang serving.

View arXiv page View PDF GitHub29 Add to collection

Community

Paper author

Paper submitter

about 2 hours ago

Domino is a speculative decoding method that improves parallel drafting by adding lightweight causal correction. It aims to retain the efficiency of block-parallel draft generation while recovering part of the causal dependency modeling lost in fully parallel draft models. Code and models are available at:https://github.com/jianuo-huang/Domino

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2605\.29707

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### Huang2020/Qwen3-4B-Domino-b16 Text Generation• 0.6B• Updatedabout 23 hours ago • 134 • 1 #### Huang2020/Qwen3-8B-Domino-b16 Text Generation• 1B• Updatedabout 23 hours ago • 174 • 1

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29707 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29707 in a Space README.md to link it from this page.

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Paper page - Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Abstract

Community

Models citing this paper2

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

Speculative Decoding Across Languages

Submit Feedback

Similar Articles

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

Speculative Decoding Across Languages