MEME: Multi-entity & Evolving Memory Evaluation

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

memory-evaluation llm-agents benchmark dependency-reasoning multi-entity evolving-memory

Summary

The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/13/26, 08:12 AM

Paper page - MEME: Multi-entity & Evolving Memory Evaluation

Source: https://huggingface.co/papers/2605.12477

Abstract

MEME benchmark evaluates memory systems across multiple entities and evolving conditions, revealing persistent challenges in dependency reasoning despite advanced retrieval and prompting techniques.

LLM-based agentsincreasingly operate inpersistent environmentswhere they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work:CascadeandAbsence(dependency reasoning) andDeletion(post-removal state). Evaluating sixmemory systemsspanning threememory paradigmson 100 controlled episodes, we find that all systems collapse ondependency reasoningunder the default configuration (Cascade: 3%,Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2605\.12477

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12477 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12477 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12477 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MEME: Multi-entity & Evolving Memory Evaluation

Paper page - MEME: Multi-entity & Evolving Memory Evaluation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

Submit Feedback

Similar Articles

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents