WorldReasonBench:将视频生成器作为未来世界状态预测器进行人类对齐的压力测试
摘要
本文介绍了 WorldReasonBench 和 WorldRewardBench,这两个新基准旨在评估视频生成模型对世界状态演变和物理一致性的推理能力。研究突显了当前商业视频生成器在视觉合理性与真实逻辑推理之间存在的差距。
查看缓存全文
缓存时间: 2026/05/12 07:30
Paper page - WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
Source: https://huggingface.co/papers/2605.10434 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
WorldReasonBench and WorldRewardBench are introduced as benchmarks for evaluating video generation models’ ability to reason about world-state evolution, revealing a gap between visual plausibility and true world reasoning capabilities.
Commercial video generation (https://huggingface.co/papers?q=video%20generation) systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into “world simulators (https://huggingface.co/papers?q=world%20simulators).” Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation (https://huggingface.co/papers?q=video%20generation) evaluation as world-state prediction (https://huggingface.co/papers?q=world-state%20prediction): given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures (https://huggingface.co/papers?q=causal%20failures), while Multi-dimensional Quality Assessment scores reasoning quality (https://huggingface.co/papers?q=reasoning%20quality), temporal consistency (https://huggingface.co/papers?q=temporal%20consistency), and visual aesthetics (https://huggingface.co/papers?q=visual%20aesthetics) for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark (https://huggingface.co/papers?q=preference%20benchmark) with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation (https://huggingface.co/papers?q=reward-model%20evaluation). Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation (https://huggingface.co/papers?q=video%20generation) at https://github.com/UniX-AI-Lab/WorldReasonBench/.
View arXiv page (https://arxiv.org/abs/2605.10434) View PDF (https://arxiv.org/pdf/2605.10434) Project page (https://unix-ai-lab.github.io/WorldReasonBench/) GitHub3 (https://github.com/UniX-AI-Lab/WorldReasonBench) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2605.10434)
Get this paper in your agent:
hf papers read 2605\.10434
Don’t have the latest CLI? curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.10434 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.10434 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.10434 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.
相似文章
OSCBench: 文本到视频生成中的对象状态变化基准测试
OSCBench是一个新的基准测试,用于评估文本到视频生成模型准确表示对象状态变化(由剥皮或切片等动作引起的转变)的能力。该论文表明,当前的T2V模型在处理时间上一致的状态变化方面存在困难,特别是在新颖和组合场景中,这被认定为视频生成的一个关键瓶颈。
CreativityBench:基于可供性工具重新利用评估智能体创造性推理
本文介绍了 CreativityBench,这是一个用于评估大型语言模型基于可供性推理创造性地重新利用工具能力的基准测试。文章强调,尽管当前模型在通用推理方面表现出色,但在创造性问题解决方面仍面临困难。
VEFX-Bench:通用视频编辑与视觉特效的全方位基准
VEFX-Bench 引入了一个大规模人工标注的视频编辑数据集(5,049个样本),包含多维质量标签,以及一个专门用于标准化评估视频编辑系统的奖励模型。该论文针对AI辅助视频创作中缺乏全面基准的问题,提供了VEFX-Dataset、VEFX-Reward和一个300个视频提示对的基准测试,揭示了当前编辑模型中的差距。
CollabVR:基于视觉语言模型与视频生成模型的协作式视频推理
CollabVR 是一篇研究论文,提出了一种闭环框架,该框架通过协作整合视觉语言模型与视频生成模型,以改善视觉推理并实时纠正推理失败。
超越静态基准:基于角色模拟合成有害内容以实现鲁棒性评估
# 超越静态基准:基于角色模拟合成有害内容以实现鲁棒性评估 Source: [https://arxiv.org/html/2604.17020](https://arxiv.org/html/2604.17020) Huije Lee Jisu Shin Hoyun Song Changgeon Ko Jong C\. Park Korea Advanced Institute of Science and Technology \(KAIST\) \{huijelee,jisu\.shin,hysong,pencaty,jongpark\}@kaist\.ac\.kr ###### Abstract 面向有害内容检测的静态基准在可扩展性与多样性方面存在局限,且可能受...