MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
Summary
This paper introduces MBench, a benchmark for evaluating the memory capabilities of video world models across entity, environment, and causal consistency over long temporal horizons.
View Cached Full Text
Cached at: 06/15/26, 09:03 AM
Paper page - MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
Source: https://huggingface.co/papers/2606.00793 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A new benchmark called MBench is introduced to evaluate the memory capabilities of video world models, focusing on entity, environment, and causal consistency over extended temporal horizons.
Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating thememory capabilityofvideo world models. We systematically decompose thememory capabilityofvideo world modelsinto three hierarchical and complementary core dimensions:entity consistency,environment consistency, andcausal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-artvideo world modelsreveal critical systemic limitations of existing methods inlong-term state retention, providing a standardized benchmark and clear research direction to advance the field.
View arXiv pageView PDFProject pageGitHub13Add to collection
Get this paper in your agent:
hf papers read 2606\.00793
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.00793 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.00793 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.00793 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
MemoBench is a diagnostic benchmark for evaluating video generation models' memory consistency in dynamically changing environments, where objects disappear and reappear in updated states. It includes 360 ground-truth clips and an evaluation suite combining automated metrics with VQA-based assessment, revealing insights into memory consistency challenges.
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
WBench is a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns, providing automatic sub-metrics and diagnostic insights. It reveals that no single model excels across all dimensions.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens is a new benchmark for evaluating memory capabilities in large vision-language models through multi-session conversations. It compares long-context and memory-augmented approaches, revealing limitations in both and motivating hybrid architectures.
M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
M^3Eval is a comprehensive evaluation framework and benchmark for probing memory capabilities in multi-modal models, grounded in cognitive psychology. Experiments reveal consistent weaknesses in memory maintenance, interference patterns, and spatial-temporal grounding.
@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …
LongCat released WBench, a benchmark for video world models that tests control, memory, instruction-following, and physical plausibility across 289 cases and 20 models, finding that no model excels in all dimensions, highlighting the gap between video quality and true world simulation.