YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

video-diffusion-models causality benchmark world-models cognitive-science evaluation

Summary

This paper introduces YoCausal, a benchmark based on the Violation of Expectation paradigm from cognitive science, to evaluate whether video diffusion models truly understand causality or merely overfit to temporal patterns. Evaluation of 13 state-of-the-art models reveals a significant gap compared to human-level causal cognition.

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

Original Article

View Cached Full Text

Cached at: 05/29/26, 07:01 AM

Paper page - YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Source: https://huggingface.co/papers/2605.30346

Abstract

Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.

Asvideo diffusion models(VDMs) advance towardworld models, a key question arises: do they truly understandcausality, or merely overfit to statisticaltemporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to thesim-to-real gap. We present YoCausal, a two-level benchmark inspired by theViolation of Expectation(VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces theReverse Surprise Index(RSI), quantifying arrow-of-time perception viadenoising loss. Level 2 introduces theCausality Cognition Index(CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understandingcausality, and a significant gap persists relative to human-level causal cognition.

View arXiv page View PDF Project page GitHub5 Add to collection

Get this paper in your agent:

hf papers read 2605\.30346

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30346 in a model README.md to link it from this page.

Datasets citing this paper1

#### YouZhe/YoCausal-dataset Updated13 minutes ago • 4

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30346 in a Space README.md to link it from this page.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Paper page - YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper1

Similar Articles

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Submit Feedback

Similar Articles

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors