YoCausal: How Far is Video Generation from World Model? A Causality Perspective
Summary
This paper introduces YoCausal, a benchmark based on the Violation of Expectation paradigm from cognitive science, to evaluate whether video diffusion models truly understand causality or merely overfit to temporal patterns. Evaluation of 13 state-of-the-art models reveals a significant gap compared to human-level causal cognition.
View Cached Full Text
Cached at: 05/29/26, 07:01 AM
Paper page - YoCausal: How Far is Video Generation from World Model? A Causality Perspective
Source: https://huggingface.co/papers/2605.30346
Abstract
Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.
Asvideo diffusion models(VDMs) advance towardworld models, a key question arises: do they truly understandcausality, or merely overfit to statisticaltemporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to thesim-to-real gap. We present YoCausal, a two-level benchmark inspired by theViolation of Expectation(VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces theReverse Surprise Index(RSI), quantifying arrow-of-time perception viadenoising loss. Level 2 introduces theCausality Cognition Index(CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understandingcausality, and a significant gap persists relative to human-level causal cognition.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2605\.30346
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30346 in a model README.md to link it from this page.
Datasets citing this paper1
#### YouZhe/YoCausal-dataset Updated13 minutes ago • 4
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30346 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine is a new academic framework for real-time, interactive multi-shot video generation that uses causal modeling and dynamic memory routing to improve cross-shot coherence in autoregressive models.
Next Forcing: Causal World Modeling with Multi-Chunk Prediction
Next Forcing introduces a multi-chunk prediction framework for causal world modeling that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
CRONOS is a benchmark that evaluates counterfactual physical consistency in video prediction models by intervening on viewpoint, scene, object category, and appearance while keeping physical event types fixed. It reveals substantial failures in current video generators.
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Causal Forcing++ presents a novel causal consistency distillation method for frame-wise autoregressive video generation, achieving state-of-the-art quality with reduced latency and training cost.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
This paper introduces WorldReasonBench and WorldRewardBench, new benchmarks designed to evaluate video generation models' ability to reason about world-state evolution and physical consistency. The research highlights a gap between visual plausibility and true logical reasoning in current commercial video generators.