Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models
Summary
This paper introduces Causal-rCM, a unified teacher-forcing and self-forcing framework for autoregressive diffusion distillation in streaming video generation and interactive world models, achieving state-of-the-art performance with fast convergence.
View Cached Full Text
Cached at: 06/25/26, 05:17 AM
Paper page - Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models
Source: https://huggingface.co/papers/2606.25473
Abstract
Autoregressive video diffusion extends diffusion distillation frameworks to real-time streaming generation through causal training paradigms, achieving state-of-the-art performance with fast convergence and interactive world modeling capabilities.
Autoregressive video diffusionwithcausal diffusion transformershas emerged as a major paradigm for real-time streaming video generation and action-conditioned interactive world models. In this work, we extend rCM, an advanceddiffusion distillationframework, toautoregressive video diffusion. The core philosophy of rCM lies in the complementarity between forward and reverse divergences, represented byconsistency models(CMs) anddistribution matching distillation(DMD), respectively, indiffusion distillation. This philosophy naturally carries over to the autoregressive setting, whereteacher-forcing(TF) provides an offline, forward-divergence causal training paradigm, whileself-forcing(SF) corresponds to an on-policy, reverse-divergence refinement. Our contributions are: (1) through extensive experiments, we show thatteacher-forcingCM is currently the best complement toself-forcingDMD as an initialization strategy (2) we present the first implementation ofteacher-forcing-basedcontinuous-time CMs(e.g., sCM/MeanFlow) forautoregressive video diffusion, enabled by our custom-maskFlashAttention-2JVP kernel, achieving 10times faster convergence compared todiscrete-time CMs(dCMs) (3) we introduceCausal-rCM, a leading, unified, and scalable algorithm-infrastructure open recipe fordiffusion distillationand causal training (4) we achieve state-of-the-art streaming video generation performance in both frame-wise and chunk-wise settings, using only synthetic data for training. Notably, our distilled 2-step causal Wan2.1-1.3B model achieves aVBench-T2Vscore of 84.63 with only 1 or 2 sampling steps. We further applyCausal-rCMto Cosmos 3, an advancedomnimodal world foundation modelfor physical AI withaction-conditioned generationcapability, enabling an interactive world model.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.25473
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.25473 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.25473 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.25473 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Causal Forcing++ presents a novel causal consistency distillation method for frame-wise autoregressive video generation, achieving state-of-the-art quality with reduced latency and training cost.
A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency
A^2RD is a new paper introducing an Agentic Autoregressive Diffusion architecture for long video synthesis, achieving improved consistency and narrative coherence through a closed-loop self-improvement process.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 introduces a reliability-perplexity aware reward distillation framework for streaming video generation that adaptively weights supervision to improve visual and motion quality without additional computational overhead.
On-Policy Adversarial Flow Distillation for Autoregressive Video Generation
Proposes Adversarial Flow Distillation (AFD) for distilling heterogeneous black-box video generation models into autoregressive students, using on-policy feedback and forward-process flow-matching updates.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine is a new academic framework for real-time, interactive multi-shot video generation that uses causal modeling and dynamic memory routing to improve cross-shot coherence in autoregressive models.