MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Hugging Face Daily Papers 06/08/26, 12:00 AM Papers

Summary

This paper introduces MBench, a benchmark for evaluating the memory capabilities of video world models across entity, environment, and causal consistency over long temporal horizons.

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:03 AM

Paper page - MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Source: https://huggingface.co/papers/2606.00793 Authors:

Abstract

A new benchmark called MBench is introduced to evaluate the memory capabilities of video world models, focusing on entity, environment, and causal consistency over extended temporal horizons.

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating thememory capabilityofvideo world models. We systematically decompose thememory capabilityofvideo world modelsinto three hierarchical and complementary core dimensions:entity consistency,environment consistency, andcausal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-artvideo world modelsreveal critical systemic limitations of existing methods inlong-term state retention, providing a standardized benchmark and clear research direction to advance the field.

View arXiv page View PDF Project page GitHub13 Add to collection

Get this paper in your agent:

hf papers read 2606\.00793

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.00793 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.00793 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.00793 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Paper page - MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …

Submit Feedback

Similar Articles

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …