MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
Summary
MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
# MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents Source: https://arxiv.org/abs/2604.15774 View PDF (https://arxiv.org/pdf/2604.15774) > Abstract: Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents. ## Submission history From: Weiwei Xie [view email (https://arxiv.org/show-email/722fe92e/2604.15774)] **[v1]** Fri, 17 Apr 2026 07:29:52 UTC (5,290 KB)
Similar Articles
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
EvoArena introduces a benchmark for evaluating LLM agents in dynamic environments with progressive updates across terminal, software, and social domains, while EvoMem proposes a patch-based memory paradigm that records structured evolution; experiments show current agents achieve only 39.6% accuracy on EvoArena, and EvoMem yields average gains of 1.5% on the benchmark and improvements on GAIA and LoCoMo.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
EvolveMem introduces a self-evolving memory architecture for LLM agents that optimizes retrieval configurations through LLM-powered diagnosis and iterative research cycles, achieving significant performance improvements on benchmarks like LoCoMo and MemBench.
MEME: Multi-entity & Evolving Memory Evaluation
The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench is a new benchmark for evaluating LLM agent memory in multi-party conversations, exposing failures in current memory systems with the best achieving only 46% average accuracy.
@hyunji_amy_lee: LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process …
MINTEval is a new benchmark for evaluating LLM agents and memory systems in continuously updated environments with frequent context changes. It shows that current systems perform poorly, with an average accuracy of 27.9% across representative systems.