MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

arXiv cs.CL Papers

Summary

MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.

arXiv:2604.15774v1 Announce Type: new Abstract: Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
Source: https://arxiv.org/abs/2604.15774
View PDF (https://arxiv.org/pdf/2604.15774)

> Abstract: Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.

## Submission history

From: Weiwei Xie [view email (https://arxiv.org/show-email/722fe92e/2604.15774)] **[v1]** Fri, 17 Apr 2026 07:29:52 UTC (5,290 KB)

Similar Articles

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Hugging Face Daily Papers

EvoArena introduces a benchmark for evaluating LLM agents in dynamic environments with progressive updates across terminal, software, and social domains, while EvoMem proposes a patch-based memory paradigm that records structured evolution; experiments show current agents achieve only 39.6% accuracy on EvoArena, and EvoMem yields average gains of 1.5% on the benchmark and improvements on GAIA and LoCoMo.

MEME: Multi-entity & Evolving Memory Evaluation

Hugging Face Daily Papers

The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.