M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
Summary
M^3Eval is a comprehensive evaluation framework and benchmark for probing memory capabilities in multi-modal models, grounded in cognitive psychology. Experiments reveal consistent weaknesses in memory maintenance, interference patterns, and spatial-temporal grounding.
View Cached Full Text
Cached at: 06/04/26, 03:41 AM
Paper page - M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
Source: https://huggingface.co/papers/2606.05008
Abstract
Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.
Asmulti-modal modelsadvance towards long-formvideo understanding,memoryemerges as a critical capability. Despite substantial efforts in developing video datasets andbenchmarks, existing works primarily focus on perception and reasoning, without systematically evaluatingmemory: what models retain, how faithfully information is preserved, and how robustmemoryremains under interference. To address this gap, we introduce M^3Eval, the first comprehensiveevaluation frameworkandbenchmarkfor probing differentmemorydimensions inmulti-modal models. Grounded incognitive psychology, our design features carefully constructed tasks that isolate key aspects ofmemory. Leveraging M^3Eval, we conduct extensive experiments across representativemulti-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintaindisentangled representationswhen processing parallel video streams, exhibitinterference patternsdiffering substantially from those observed in humanmemory, groundmemorysources more reliably in thespatial domainthan thetemporal domain, and demonstrate limitedsymbolic memory. Collectively, ourbenchmarkprovides a valuable resource for future research, while our findings highlightmemoryas a fundamental yet underexplored capability and offer insights for designing more effectivememorymechanisms inmulti-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2606\.05008
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.05008 in a model README.md to link it from this page.
Datasets citing this paper1
#### PKU-VaLuE-Lab/m3eval Updatedabout 1 hour ago • 705 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05008 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
WorldMemArena is a new benchmark with 400 multi-session multimodal tasks for evaluating multimodal agent memory, comparing long-context, RAG, and harness-based memory approaches, revealing that better memory writing does not guarantee better performance and that systems struggle with visual evidence.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens is a new benchmark for evaluating memory capabilities in large vision-language models through multi-session conversations. It compares long-context and memory-augmented approaches, revealing limitations in both and motivating hybrid architectures.
MEME: Multi-entity & Evolving Memory Evaluation
The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.