M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Hugging Face Daily Papers 06/03/26, 12:00 AM Papers

multi-modal memory evaluation benchmark video-understanding cognitive-psychology

Summary

M^3Eval is a comprehensive evaluation framework and benchmark for probing memory capabilities in multi-modal models, grounded in cognitive psychology. Experiments reveal consistent weaknesses in memory maintenance, interference patterns, and spatial-temporal grounding.

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

Original Article

View Cached Full Text

Cached at: 06/04/26, 03:41 AM

Paper page - M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Source: https://huggingface.co/papers/2606.05008

Abstract

Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.

Asmulti-modal modelsadvance towards long-formvideo understanding,memoryemerges as a critical capability. Despite substantial efforts in developing video datasets andbenchmarks, existing works primarily focus on perception and reasoning, without systematically evaluatingmemory: what models retain, how faithfully information is preserved, and how robustmemoryremains under interference. To address this gap, we introduce M^3Eval, the first comprehensiveevaluation frameworkandbenchmarkfor probing differentmemorydimensions inmulti-modal models. Grounded incognitive psychology, our design features carefully constructed tasks that isolate key aspects ofmemory. Leveraging M^3Eval, we conduct extensive experiments across representativemulti-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintaindisentangled representationswhen processing parallel video streams, exhibitinterference patternsdiffering substantially from those observed in humanmemory, groundmemorysources more reliably in thespatial domainthan thetemporal domain, and demonstrate limitedsymbolic memory. Collectively, ourbenchmarkprovides a valuable resource for future research, while our findings highlightmemoryas a fundamental yet underexplored capability and offer insights for designing more effectivememorymechanisms inmulti-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2606\.05008

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05008 in a model README.md to link it from this page.

Datasets citing this paper1

#### PKU-VaLuE-Lab/m3eval Updatedabout 1 hour ago • 705 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05008 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Paper page - M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

MEME: Multi-entity & Evolving Memory Evaluation

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Submit Feedback

Similar Articles

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

MEME: Multi-entity & Evolving Memory Evaluation

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning