MEME: Multi-entity & Evolving Memory Evaluation
Summary
The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.
View Cached Full Text
Cached at: 05/13/26, 08:12 AM
Paper page - MEME: Multi-entity & Evolving Memory Evaluation
Source: https://huggingface.co/papers/2605.12477
Abstract
MEME benchmark evaluates memory systems across multiple entities and evolving conditions, revealing persistent challenges in dependency reasoning despite advanced retrieval and prompting techniques.
LLM-based agentsincreasingly operate inpersistent environmentswhere they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work:CascadeandAbsence(dependency reasoning) andDeletion(post-removal state). Evaluating sixmemory systemsspanning threememory paradigmson 100 controlled episodes, we find that all systems collapse ondependency reasoningunder the default configuration (Cascade: 3%,Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.12477
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12477 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12477 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12477 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
EvoArena introduces a benchmark for evaluating LLM agents in dynamic environments with progressive updates across terminal, software, and social domains, while EvoMem proposes a patch-based memory paradigm that records structured evolution; experiments show current agents achieve only 39.6% accuracy on EvoArena, and EvoMem yields average gains of 1.5% on the benchmark and improvements on GAIA and LoCoMo.
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents
SubtleMemory is a benchmark for evaluating AI agents' fine-grained relational memory discrimination in long-horizon interactions, consisting of 1,522 instances over 10 long histories. It reveals limitations in current memory systems for preserving and utilizing nuanced memory relationships.
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
H-Mem is a novel memory mechanism for LLM-based agents that uses a hybrid structure combining a temporal and semantic tree with a knowledge graph to model memory evolution and improve retrieval, achieving state-of-the-art performance on QA benchmarks.