YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Hugging Face Daily Papers Papers

Summary

This paper introduces YoCausal, a benchmark based on the Violation of Expectation paradigm from cognitive science, to evaluate whether video diffusion models truly understand causality or merely overfit to temporal patterns. Evaluation of 13 state-of-the-art models reveals a significant gap compared to human-level causal cognition.

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.
Original Article
View Cached Full Text

Cached at: 05/29/26, 07:01 AM

Paper page - YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Source: https://huggingface.co/papers/2605.30346

Abstract

Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.

Asvideo diffusion models(VDMs) advance towardworld models, a key question arises: do they truly understandcausality, or merely overfit to statisticaltemporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to thesim-to-real gap. We present YoCausal, a two-level benchmark inspired by theViolation of Expectation(VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces theReverse Surprise Index(RSI), quantifying arrow-of-time perception viadenoising loss. Level 2 introduces theCausality Cognition Index(CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understandingcausality, and a significant gap persists relative to human-level causal cognition.

View arXiv pageView PDFProject pageGitHub5Add to collection

Get this paper in your agent:

hf papers read 2605\.30346

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30346 in a model README.md to link it from this page.

Datasets citing this paper1

#### YouZhe/YoCausal-dataset Updated13 minutes ago • 4

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30346 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Hugging Face Daily Papers

CRONOS is a benchmark that evaluates counterfactual physical consistency in video prediction models by intervening on viewpoint, scene, object category, and appearance while keeping physical event types fixed. It reveals substantial failures in current video generators.