MEME: Multi-entity & Evolving Memory Evaluation
Summary
The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.
View Cached Full Text
Cached at: 05/13/26, 08:12 AM
Paper page - MEME: Multi-entity & Evolving Memory Evaluation
Source: https://huggingface.co/papers/2605.12477
Abstract
MEME benchmark evaluates memory systems across multiple entities and evolving conditions, revealing persistent challenges in dependency reasoning despite advanced retrieval and prompting techniques.
LLM-based agentsincreasingly operate inpersistent environmentswhere they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work:CascadeandAbsence(dependency reasoning) andDeletion(post-removal state). Evaluating sixmemory systemsspanning threememory paradigmson 100 controlled episodes, we find that all systems collapse ondependency reasoningunder the default configuration (Cascade: 3%,Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.12477
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12477 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12477 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12477 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.
I built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Developer created a new benchmark called continuity-benchmarks to test AI coding agents' ability to maintain consistency with project rules during active development, addressing gaps in existing memory benchmarks that focus on semantic recall rather than real-time architectural consistency and multi-session behavior.
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena introduces a large-scale benchmark for evaluating robotic memory across 26 complex tasks with real-world validation, alongside PrediMem, a dual-system vision-language-action model that improves memory management through predictive coding.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
This paper introduces LongMemEval-V2, a benchmark for evaluating long-term memory systems in web agents, along with two memory methods: AgentRunbook-R and AgentRunbook-C.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.