Next Forcing: Causal World Modeling with Multi-Chunk Prediction
Summary
Next Forcing introduces a multi-chunk prediction framework for causal world modeling that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence.
View Cached Full Text
Cached at: 06/10/26, 01:44 PM
Paper page - Next Forcing: Causal World Modeling with Multi-Chunk Prediction
Source: https://huggingface.co/papers/2606.11187
Abstract
Next Forcing introduces a multi-chunk prediction framework that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence.
Autoregressive video generationhas emerged as a powerful paradigm forWorld Action Models(WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterativevideo denoising. In this paper, we present Next Forcing, amulti-chunk prediction(MCP) framework forcausal world modelingthat enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple futuretemporal horizons(next^1, next^2, next^3 chunks). These MCP modules form acausal chainacross prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing densemulti-scale temporal supervisionback to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2xinference acceleration. Next Forcing also demonstrates significant improvements onPhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50%FVDreduction on general video pretraining.
View arXiv pageView PDFProject pageGitHub29Add to collection
Get this paper in your agent:
hf papers read 2606\.11187
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.11187 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.11187 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.11187 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Causal Forcing++ presents a novel causal consistency distillation method for frame-wise autoregressive video generation, achieving state-of-the-art quality with reduced latency and training cost.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine is a new academic framework for real-time, interactive multi-shot video generation that uses causal modeling and dynamic memory routing to improve cross-shot coherence in autoregressive models.
One-Forcing: Towards Stable One-Step Autoregressive Video Generation
One-Forcing improves one-step video generation by augmenting the DMD objective with an auxiliary GAN loss, achieving state-of-the-art performance with reduced training costs.
YoCausal: How Far is Video Generation from World Model? A Causality Perspective
This paper introduces YoCausal, a benchmark based on the Violation of Expectation paradigm from cognitive science, to evaluate whether video diffusion models truly understand causality or merely overfit to temporal patterns. Evaluation of 13 state-of-the-art models reveals a significant gap compared to human-level causal cognition.
Streaming Video Generation with Streaming Force Control
StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.