Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding
Summary
Domino is a speculative decoding framework that decouples causal dependency modeling from autoregressive drafting, using a parallel backbone and lightweight causal refinement head to achieve up to 5.49× end-to-end speedup on Qwen3 models.
View Cached Full Text
Cached at: 06/02/26, 03:35 PM
Paper page - Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding
Source: https://huggingface.co/papers/2605.29707 Published on May 28
·
Submitted byhttps://huggingface.co/Huang2020
黄佳诺on Jun 2
Abstract
Domino is a speculative decoding framework that improves LLM inference speed by decoupling causal dependency modeling from autoregressive drafting through a parallel backbone and lightweight causal refinement head, achieving significant speedups in both end-to-end execution and throughput.
Speculative decodingaccelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off betweendraft qualityanddrafting cost:autoregressive draftersmodelcausal dependenciesamong draft tokens but incur sequential overhead, whileparallel draftersreducedrafting costbut weaken intra-block dependency modeling. In this paper, we propose Domino, aspeculative decodingframework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweightDomino headto refine them with prefix-dependent causal information. To stabilizeteacher-forced causal encoding, we further introduce abase-anchored training curriculumthat first strengthens theparallel backboneand then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under theTransformers backendand up to \(5.8\times\) throughput speedup underSGLang serving.
View arXiv pageView PDFGitHub29Add to collection
Community
Paper author
Paper submitter
Domino is a speculative decoding method that improves parallel drafting by adding lightweight causal correction. It aims to retain the efficiency of block-parallel draft generation while recovering part of the causal dependency modeling lost in fully parallel draft models. Code and models are available at:https://github.com/jianuo-huang/Domino
Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.
Tap or paste here to upload images
Get this paper in your agent:
hf papers read 2605\.29707
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper2
#### Huang2020/Qwen3-4B-Domino-b16 Text Generation• 0.6B• Updatedabout 23 hours ago • 134 • 1
#### Huang2020/Qwen3-8B-Domino-b16 Text Generation• 1B• Updatedabout 23 hours ago • 174 • 1
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29707 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29707 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding
AdaPLD is a training-free method that improves model-free speculative decoding by using adaptive retrieval combining lexical and semantic similarity, and constructing branched reuse hypotheses to handle continuation uncertainty, achieving up to 3.10x decoding speedup.
Attention Drift: What Autoregressive Speculative Decoding Models Learn
This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Graft is a training-free framework that enhances speculative decoding by combining pruning and retrieval to improve acceptance rates and inference speed, achieving up to 5.41x speedup on short-context benchmarks and up to 21.8% improvement over EAGLE-3 on Qwen3-235B.
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
This paper identifies a new vulnerability in model-based speculative decoding for large language models, where small perturbations can reduce draft token acceptance without affecting output quality, collapsing acceleration. The authors propose Mistletoe, an attack that jointly optimizes degradation and semantic preservation, demonstrating significant speedup reduction across various systems.
Speculative Decoding Across Languages
This paper compares three strategies to improve speculative decoding efficiency for non-English languages, finding that task-specific distillation improves acceptance rates but generalizes poorly, while n-gram draft models offer consistent speed-ups despite lower acceptance rates.