WorldReasonBench:将视频生成器作为未来世界状态预测器进行人类对齐的压力测试

Hugging Face Daily Papers 论文

摘要

本文介绍了 WorldReasonBench 和 WorldRewardBench,这两个新基准旨在评估视频生成模型对世界状态演变和物理一致性的推理能力。研究突显了当前商业视频生成器在视觉合理性与真实逻辑推理之间存在的差距。

以 Seedance2.0 和 Veo3.1 为代表的商业视频生成系统取得了 rapid 的进步,加强了这样一种观点,即视频生成器可能正在演变为“世界模拟器”。然而,社区仍缺乏直接测试模型能否推理出观察到的世界应如何随时间演变的基准。我们推出了 WorldReasonBench,将视频生成评估重新框架化为世界状态预测:给定初始状态和一个动作,模型能否生成一段未来视频,使其状态演变在物理、社会、逻辑和信息层面保持一致?WorldReasonBench 包含 436 个精心策划的测试用例,带有结构化真值问答标注,涵盖四个推理维度及 22 个子类别。我们采用人类对齐的两部分方法来评估生成的视频:过程感知推理验证(Process-aware Reasoning Verification)利用结构化问答和推理阶段诊断来检测时间上的和因果上的失败,而多维质量评估(Multi-dimensional Quality Assessment)则对推理质量、时间一致性和视觉美感进行评分,用于排名和奖励建模。我们还进一步引入了 WorldRewardBench,这是一个包含约 6000 对专家标注数据对(涵盖 1400 多个视频)的偏好基准,支持成对和点式奖励模型评估。在对现代视频生成器的评估中,我们的结果暴露了视觉合理性与世界推理之间持续存在的差距:视频可能看起来具有说服力,却在动力学、因果关系或信息保持方面失败。我们将在 https://github.com/UniX-AI-Lab/WorldReasonBench/ 发布我们的基准和评估工具包,以支持社区对真正具备世界感知能力的视频生成的研究。
查看原文 导出为 Word 导出为 PDF
查看缓存全文

缓存时间: 2026/05/12 07:30

Paper page - WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Source: https://huggingface.co/papers/2605.10434 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

WorldReasonBench and WorldRewardBench are introduced as benchmarks for evaluating video generation models’ ability to reason about world-state evolution, revealing a gap between visual plausibility and true world reasoning capabilities.

Commercial video generation (https://huggingface.co/papers?q=video%20generation) systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into “world simulators (https://huggingface.co/papers?q=world%20simulators).” Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation (https://huggingface.co/papers?q=video%20generation) evaluation as world-state prediction (https://huggingface.co/papers?q=world-state%20prediction): given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures (https://huggingface.co/papers?q=causal%20failures), while Multi-dimensional Quality Assessment scores reasoning quality (https://huggingface.co/papers?q=reasoning%20quality), temporal consistency (https://huggingface.co/papers?q=temporal%20consistency), and visual aesthetics (https://huggingface.co/papers?q=visual%20aesthetics) for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark (https://huggingface.co/papers?q=preference%20benchmark) with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation (https://huggingface.co/papers?q=reward-model%20evaluation). Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation (https://huggingface.co/papers?q=video%20generation) at https://github.com/UniX-AI-Lab/WorldReasonBench/.

View arXiv page (https://arxiv.org/abs/2605.10434) View PDF (https://arxiv.org/pdf/2605.10434) Project page (https://unix-ai-lab.github.io/WorldReasonBench/) GitHub3 (https://github.com/UniX-AI-Lab/WorldReasonBench) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2605.10434)

Get this paper in your agent:

hf papers read 2605\.10434

Don’t have the latest CLI? curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10434 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10434 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10434 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.

相似文章

OSCBench: 文本到视频生成中的对象状态变化基准测试

arXiv cs.CL

OSCBench是一个新的基准测试,用于评估文本到视频生成模型准确表示对象状态变化(由剥皮或切片等动作引起的转变)的能力。该论文表明,当前的T2V模型在处理时间上一致的状态变化方面存在困难,特别是在新颖和组合场景中,这被认定为视频生成的一个关键瓶颈。

VEFX-Bench:通用视频编辑与视觉特效的全方位基准

Hugging Face Daily Papers

VEFX-Bench 引入了一个大规模人工标注的视频编辑数据集(5,049个样本),包含多维质量标签,以及一个专门用于标准化评估视频编辑系统的奖励模型。该论文针对AI辅助视频创作中缺乏全面基准的问题,提供了VEFX-Dataset、VEFX-Reward和一个300个视频提示对的基准测试,揭示了当前编辑模型中的差距。

超越静态基准:基于角色模拟合成有害内容以实现鲁棒性评估

arXiv cs.CL

# 超越静态基准:基于角色模拟合成有害内容以实现鲁棒性评估 Source: [https://arxiv.org/html/2604.17020](https://arxiv.org/html/2604.17020) Huije Lee Jisu Shin Hoyun Song Changgeon Ko Jong C\. Park Korea Advanced Institute of Science and Technology \(KAIST\) \{huijelee,jisu\.shin,hysong,pencaty,jongpark\}@kaist\.ac\.kr ###### Abstract 面向有害内容检测的静态基准在可扩展性与多样性方面存在局限,且可能受...