Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Hugging Face Daily Papers Papers

Summary

Domino is a speculative decoding framework that decouples causal dependency modeling from autoregressive drafting, using a parallel backbone and lightweight causal refinement head to achieve up to 5.49× end-to-end speedup on Qwen3 models.

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Source: https://huggingface.co/papers/2605.29707 Published on May 28

·

Submitted byhttps://huggingface.co/Huang2020

黄佳诺on Jun 2

Abstract

Domino is a speculative decoding framework that improves LLM inference speed by decoupling causal dependency modeling from autoregressive drafting through a parallel backbone and lightweight causal refinement head, achieving significant speedups in both end-to-end execution and throughput.

Speculative decodingaccelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off betweendraft qualityanddrafting cost:autoregressive draftersmodelcausal dependenciesamong draft tokens but incur sequential overhead, whileparallel draftersreducedrafting costbut weaken intra-block dependency modeling. In this paper, we propose Domino, aspeculative decodingframework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweightDomino headto refine them with prefix-dependent causal information. To stabilizeteacher-forced causal encoding, we further introduce abase-anchored training curriculumthat first strengthens theparallel backboneand then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under theTransformers backendand up to \(5.8\times\) throughput speedup underSGLang serving.

View arXiv pageView PDFGitHub29Add to collection

Community

Paper author

Paper submitter

about 2 hours ago

Domino is a speculative decoding method that improves parallel drafting by adding lightweight causal correction. It aims to retain the efficiency of block-parallel draft generation while recovering part of the causal dependency modeling lost in fully parallel draft models. Code and models are available at:https://github.com/jianuo-huang/Domino

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2605\.29707

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### Huang2020/Qwen3-4B-Domino-b16 Text Generation• 0.6B• Updatedabout 23 hours ago • 134 • 1 #### Huang2020/Qwen3-8B-Domino-b16 Text Generation• 1B• Updatedabout 23 hours ago • 174 • 1

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29707 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29707 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Reddit r/LocalLLaMA

This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

arXiv cs.CL

This paper identifies a new vulnerability in model-based speculative decoding for large language models, where small perturbations can reduce draft token acceptance without affecting output quality, collapsing acceleration. The authors propose Mistletoe, an attack that jointly optimizes degradation and semantic preservation, demonstrating significant speedup reduction across various systems.

Speculative Decoding Across Languages

arXiv cs.CL

This paper compares three strategies to improve speculative decoding efficiency for non-English languages, finding that task-specific distillation improves acceptance rates but generalizes poorly, while n-gram draft models offer consistent speed-ups despite lower acceptance rates.