M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Hugging Face Daily Papers Papers

Summary

M^3Eval is a comprehensive evaluation framework and benchmark for probing memory capabilities in multi-modal models, grounded in cognitive psychology. Experiments reveal consistent weaknesses in memory maintenance, interference patterns, and spatial-temporal grounding.

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.
Original Article
View Cached Full Text

Cached at: 06/04/26, 03:41 AM

Paper page - M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Source: https://huggingface.co/papers/2606.05008

Abstract

Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.

Asmulti-modal modelsadvance towards long-formvideo understanding,memoryemerges as a critical capability. Despite substantial efforts in developing video datasets andbenchmarks, existing works primarily focus on perception and reasoning, without systematically evaluatingmemory: what models retain, how faithfully information is preserved, and how robustmemoryremains under interference. To address this gap, we introduce M^3Eval, the first comprehensiveevaluation frameworkandbenchmarkfor probing differentmemorydimensions inmulti-modal models. Grounded incognitive psychology, our design features carefully constructed tasks that isolate key aspects ofmemory. Leveraging M^3Eval, we conduct extensive experiments across representativemulti-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintaindisentangled representationswhen processing parallel video streams, exhibitinterference patternsdiffering substantially from those observed in humanmemory, groundmemorysources more reliably in thespatial domainthan thetemporal domain, and demonstrate limitedsymbolic memory. Collectively, ourbenchmarkprovides a valuable resource for future research, while our findings highlightmemoryas a fundamental yet underexplored capability and offer insights for designing more effectivememorymechanisms inmulti-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

View arXiv pageView PDFProject pageGitHub3Add to collection

Get this paper in your agent:

hf papers read 2606\.05008

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05008 in a model README.md to link it from this page.

Datasets citing this paper1

#### PKU-VaLuE-Lab/m3eval Updatedabout 1 hour ago • 705 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05008 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Hugging Face Daily Papers

MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Hugging Face Daily Papers

WorldMemArena is a new benchmark with 400 multi-session multimodal tasks for evaluating multimodal agent memory, comparing long-context, RAG, and harness-based memory approaches, revealing that better memory writing does not guarantee better performance and that systems struggle with visual evidence.

MEME: Multi-entity & Evolving Memory Evaluation

Hugging Face Daily Papers

The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.