Speculative Decoding for Autoregressive Video Generation

Hugging Face Daily Papers Papers

Summary

SDVG adapts speculative decoding to autoregressive video diffusion, using an image-quality router to achieve up to 2.09× speed-up with 95.7% quality retention on MovieGenVideoBench.

Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target's KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention--while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.
Original Article
View Cached Full Text

Cached at: 04/22/26, 06:17 AM

Paper page - Speculative Decoding for Autoregressive Video Generation

Source: https://huggingface.co/papers/2604.17397

Abstract

Speculative decoding is adapted to autoregressive video diffusion through a quality-based routing mechanism that maintains high visual quality while achieving significant speedup.

Autoregressive video diffusionis emerging as a promising paradigm for streaming video synthesis, withstep distillationserving as the primary means of accelerating inference. Whetherspeculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which bringsspeculative decodingto block-basedautoregressive video diffusionby replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via fourdenoising steps; each block is VAE-decoded and scored byImageRewardusingworst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target’sKV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speedPareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-onlyVisionRewardquality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention--while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2604\.17397

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.17397 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.17397 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.17397 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Long Video Generation (4 minute read)

TLDR AI

The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.

What is Speculative Decoding? (trending on paperswithco.de) [R]

Reddit r/MachineLearning

Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.

Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks

arXiv cs.AI

Introduces Speculative Refinement (SpecRef), a training-free hybrid decoding strategy that warm-starts a masked diffusion language model from an autoregressive draft using entropy-guided selective masking. Evaluated across six benchmarks, it reveals that code benchmarks conflate structural discovery with logical correctness, identifies a refinement tension phenomenon, and shows that evaluation protocols can produce different model rankings.