MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Hugging Face Daily Papers Papers

Summary

This paper introduces MBench, a benchmark for evaluating the memory capabilities of video world models across entity, environment, and causal consistency over long temporal horizons.

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:03 AM

Paper page - MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Source: https://huggingface.co/papers/2606.00793 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

A new benchmark called MBench is introduced to evaluate the memory capabilities of video world models, focusing on entity, environment, and causal consistency over extended temporal horizons.

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating thememory capabilityofvideo world models. We systematically decompose thememory capabilityofvideo world modelsinto three hierarchical and complementary core dimensions:entity consistency,environment consistency, andcausal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-artvideo world modelsreveal critical systemic limitations of existing methods inlong-term state retention, providing a standardized benchmark and clear research direction to advance the field.

View arXiv pageView PDFProject pageGitHub13Add to collection

Get this paper in your agent:

hf papers read 2606\.00793

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.00793 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.00793 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.00793 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

Hugging Face Daily Papers

MemoBench is a diagnostic benchmark for evaluating video generation models' memory consistency in dynamically changing environments, where objects disappear and reappear in updated states. It includes 360 ground-truth clips and an evaluation suite combining automated metrics with VQA-based assessment, revealing insights into memory consistency challenges.