Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Summary
Causal Forcing++ presents a novel causal consistency distillation method for frame-wise autoregressive video generation, achieving state-of-the-art quality with reduced latency and training cost.
View Cached Full Text
Cached at: 05/15/26, 04:23 AM
Paper page - Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Source: https://huggingface.co/papers/2605.15141
Abstract
A novel causal consistency distillation method enables efficient frame-wise video generation with reduced latency and improved quality compared to existing chunk-wise approaches.
Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR)diffusion distillationmethods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting:frame-wise autoregressionwith only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that usescausal consistency distillation(causal CD) forfew-step AR initialization. The core idea is thatcausal CDlearns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textbf{frame-wise 2-step setting} by 0.1 inVBenchTotal, 0.3 inVBenchQuality, and 0.335 inVisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditionedworld model generationin the spirit ofGenie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.15141
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15141 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15141 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15141 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
One-Forcing: Towards Stable One-Step Autoregressive Video Generation
One-Forcing improves one-step video generation by augmenting the DMD objective with an auxiliary GAN loss, achieving state-of-the-art performance with reduced training costs.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine is a new academic framework for real-time, interactive multi-shot video generation that uses causal modeling and dynamic memory routing to improve cross-shot coherence in autoregressive models.
Streaming Video Generation with Streaming Force Control
StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.
On-Policy Adversarial Flow Distillation for Autoregressive Video Generation
Proposes Adversarial Flow Distillation (AFD) for distilling heterogeneous black-box video generation models into autoregressive students, using on-policy feedback and forward-process flow-matching updates.
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
This paper introduces Forcing-KV, a hybrid KV cache compression strategy for autoregressive video diffusion models that separates attention heads into static and dynamic categories, achieving up to 2.82x speedup at 1080P resolution while maintaining output quality.