Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

Hugging Face Daily Papers Papers

Summary

This paper introduces Causal-rCM, a unified teacher-forcing and self-forcing framework for autoregressive diffusion distillation in streaming video generation and interactive world models, achieving state-of-the-art performance with fast convergence.

Autoregressive video diffusion with causal diffusion transformers has emerged as a major paradigm for real-time streaming video generation and action-conditioned interactive world models. In this work, we extend rCM, an advanced diffusion distillation framework, to autoregressive video diffusion. The core philosophy of rCM lies in the complementarity between forward and reverse divergences, represented by consistency models (CMs) and distribution matching distillation (DMD), respectively, in diffusion distillation. This philosophy naturally carries over to the autoregressive setting, where teacher-forcing (TF) provides an offline, forward-divergence causal training paradigm, while self-forcing (SF) corresponds to an on-policy, reverse-divergence refinement. Our contributions are: (1) through extensive experiments, we show that teacher-forcing CM is currently the best complement to self-forcing DMD as an initialization strategy (2) we present the first implementation of teacher-forcing-based continuous-time CMs (e.g., sCM/MeanFlow) for autoregressive video diffusion, enabled by our custom-mask FlashAttention-2 JVP kernel, achieving 10times faster convergence compared to discrete-time CMs (dCMs) (3) we introduce Causal-rCM, a leading, unified, and scalable algorithm-infrastructure open recipe for diffusion distillation and causal training (4) we achieve state-of-the-art streaming video generation performance in both frame-wise and chunk-wise settings, using only synthetic data for training. Notably, our distilled 2-step causal Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 with only 1 or 2 sampling steps. We further apply Causal-rCM to Cosmos 3, an advanced omnimodal world foundation model for physical AI with action-conditioned generation capability, enabling an interactive world model.
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:17 AM

Paper page - Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

Source: https://huggingface.co/papers/2606.25473

Abstract

Autoregressive video diffusion extends diffusion distillation frameworks to real-time streaming generation through causal training paradigms, achieving state-of-the-art performance with fast convergence and interactive world modeling capabilities.

Autoregressive video diffusionwithcausal diffusion transformershas emerged as a major paradigm for real-time streaming video generation and action-conditioned interactive world models. In this work, we extend rCM, an advanceddiffusion distillationframework, toautoregressive video diffusion. The core philosophy of rCM lies in the complementarity between forward and reverse divergences, represented byconsistency models(CMs) anddistribution matching distillation(DMD), respectively, indiffusion distillation. This philosophy naturally carries over to the autoregressive setting, whereteacher-forcing(TF) provides an offline, forward-divergence causal training paradigm, whileself-forcing(SF) corresponds to an on-policy, reverse-divergence refinement. Our contributions are: (1) through extensive experiments, we show thatteacher-forcingCM is currently the best complement toself-forcingDMD as an initialization strategy (2) we present the first implementation ofteacher-forcing-basedcontinuous-time CMs(e.g., sCM/MeanFlow) forautoregressive video diffusion, enabled by our custom-maskFlashAttention-2JVP kernel, achieving 10times faster convergence compared todiscrete-time CMs(dCMs) (3) we introduceCausal-rCM, a leading, unified, and scalable algorithm-infrastructure open recipe fordiffusion distillationand causal training (4) we achieve state-of-the-art streaming video generation performance in both frame-wise and chunk-wise settings, using only synthetic data for training. Notably, our distilled 2-step causal Wan2.1-1.3B model achieves aVBench-T2Vscore of 84.63 with only 1 or 2 sampling steps. We further applyCausal-rCMto Cosmos 3, an advancedomnimodal world foundation modelfor physical AI withaction-conditioned generationcapability, enabling an interactive world model.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.25473

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.25473 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.25473 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.25473 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles