WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
Summary
WBench is a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns, providing automatic sub-metrics and diagnostic insights. It reveals that no single model excels across all dimensions.
View Cached Full Text
Cached at: 05/26/26, 06:43 AM
Paper page - WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
Source: https://huggingface.co/papers/2605.25874
Abstract
WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.
Interactive world modelsare advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensivemulti-turn benchmarkfor interactive world model evaluation along five dimensions, namelyvideo quality,setting adherence,interaction adherence,consistency, andphysics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22automatic sub-metricsthat combine specialistvision modelswith largemultimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.
View arXiv pageView PDFProject pageGitHub16Add to collection
Get this paper in your agent:
hf papers read 2605\.25874
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.25874 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.25874 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.25874 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …
LongCat released WBench, a benchmark for video world models that tests control, memory, instruction-following, and physical plausibility across 289 cases and 20 models, finding that no model excels in all dimensions, highlighting the gap between video quality and true world simulation.
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
Introduces SVI-Bench, a large-scale benchmark for strategic video intelligence using team sports, designed to evaluate models on dynamic scene understanding, causal reasoning, strategic simulation, and agentic synthesis. The benchmark reveals a capability cliff where models perform well on perceptual tasks but sharply degrade on higher-level strategic reasoning.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
This paper introduces WorldReasonBench and WorldRewardBench, new benchmarks designed to evaluate video generation models' ability to reason about world-state evolution and physical consistency. The research highlights a gap between visual plausibility and true logical reasoning in current commercial video generators.
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
TOBench is a new benchmark for evaluating AI agents on real-world, task-oriented tool use with multimodal inputs and closed-loop verification. Experiments show top models like Qwen 3.5 Plus achieve only 41% success, far below the 94% human benchmark, highlighting a significant gap.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM is a full-stack open-source framework that converts bidirectional video diffusion models into real-time interactive video world models with controllable camera, low-latency rollout, and modular architecture.