MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
Summary
MemLens is a new benchmark for evaluating memory capabilities in large vision-language models through multi-session conversations. It compares long-context and memory-augmented approaches, revealing limitations in both and motivating hybrid architectures.
View Cached Full Text
Cached at: 05/15/26, 04:23 AM
Paper page - MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
Source: https://huggingface.co/papers/2605.14906 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches.
Memory is essential for largevision-language models(LVLMs) to handle long, multimodal interactions, with two method directions providing this capability:long-context LVLMsandmemory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory inmultimodal multi-session conversations, comprising 789 questions across fivememory abilities(information extraction,multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under across-modal token-countingscheme. An image-ablation study confirms that solving MEMLENS requiresvisual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7memory-augmented agents, we find thatlong-context LVLMsachieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression.Multi-session reasoningcaps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention withstructured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.
View arXiv pageView PDFGitHub0Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.14906 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.14906 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.14906 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
This paper introduces LongMemEval-V2, a benchmark for evaluating long-term memory systems in web agents, along with two memory methods: AgentRunbook-R and AgentRunbook-C.
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Mem0 introduces a scalable memory-centric architecture using graph-based representations to improve long-term conversational coherence in LLMs, significantly reducing latency and token costs while outperforming existing memory systems.
δ-mem: Efficient Online Memory for Large Language Models
The paper introduces δ-mem, a lightweight memory mechanism that enhances large language models by augmenting a frozen attention backbone with a compact associative memory state. It demonstrates improved performance on memory-heavy benchmarks with minimal computational overhead.