WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Hugging Face Daily Papers 05/25/26, 12:00 AM Papers

benchmark interactive-world-models multi-turn video-quality consistency physics-compliance evaluation

Summary

WBench is a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns, providing automatic sub-metrics and diagnostic insights. It reveals that no single model excels across all dimensions.

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

Original Article

View Cached Full Text

Cached at: 05/26/26, 06:43 AM

Paper page - WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Source: https://huggingface.co/papers/2605.25874

Abstract

WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.

Interactive world modelsare advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensivemulti-turn benchmarkfor interactive world model evaluation along five dimensions, namelyvideo quality,setting adherence,interaction adherence,consistency, andphysics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22automatic sub-metricsthat combine specialistvision modelswith largemultimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

View arXiv page View PDF Project page GitHub16 Add to collection

Get this paper in your agent:

hf papers read 2605\.25874

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.25874 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.25874 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.25874 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Paper page - WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …

MultiView-Bench: A Diagnostic Benchmark for World-Centric Multi-View Integration in VLMs

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Submit Feedback

Similar Articles

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …

MultiView-Bench: A Diagnostic Benchmark for World-Centric Multi-View Integration in VLMs

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark