Useful Memories Become Faulty When Continuously Updated by LLMs
Summary
This paper shows that continuously consolidating past experiences into textual memory using LLMs degrades memory utility over time, and that preserving raw episodic trajectories outperforms forced consolidation, with implications for robust agentic memory systems.
View Cached Full Text
Cached at: 05/14/26, 06:14 AM
# Useful Memories Become Faulty When Continuously Updated by LLMs Source: [https://arxiv.org/abs/2605.12978](https://arxiv.org/abs/2605.12978) [View PDF](https://arxiv.org/pdf/2605.12978) > Abstract:Learning from past experience benefits from two complementary forms of memory: episodic traces \-\- raw trajectories of what happened \-\- and consolidated abstractions distilled across many episodes into reusable, schema\-like lessons\. Recent agentic\-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self\-improving agents without parameter updates\. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences\. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no\-memory baseline\. More surprisingly, even when consolidating from ground\-truth solutions, GPT\-5\.4 fails on 54% of a set of ARC\-AGI problems it had previously solved without memory\. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic\-only control that simply retains those trajectories remains competitive with the consolidators we test\. In a controlled ARC\-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced\-consolidation counterparts; disabling consolidation entirely \(episodic management only\) matches this auto regime\. Practically, robust agent memory should treat raw episodes as first\-class evidence and gate consolidation explicitly rather than firing it after every interaction\. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on\. ## Submission history From: Dylan Zhang \[[view email](https://arxiv.org/show-email/c64ee13f/2605.12978)\] **\[v1\]**Wed, 13 May 2026 04:15:50 UTC \(455 KB\)
Similar Articles
Useful Memories Become Faulty When Continuously Updated by LLMs
A study finds that continuously updating consolidated memories in LLM-based agentic systems degrades performance, and that retaining raw episodic trajectories is more reliable. Experiments on ARC-AGI show that even GPT-5.4 fails more often after consolidation.
Useful memories become faulty when continuously updated by LLMs (30 minute read)
This research demonstrates that continuously updating LLM agent memories through distillation and consolidation loops causes performance regression, even when trained on ground-truth solutions. The study finds that episodic-only retention outperforms text-based consolidation, highlighting significant flaws in current self-improvement paradigms.
@dylan_works_: Wrote up something fun I’ve been poking at: when LLM agents repeatedly rewrite their own experiences into textual “less…
This research blog post demonstrates that repeatedly rewriting LLM agent experiences into textual 'lessons' often degrades performance rather than improving it. The author finds that episodic memory retention performs better than abstract consolidation across various benchmarks like ARC-AGI and ALFWorld.
@omarsar0: // The Memory Curse in LLM Agents // (bookmark it) Long histories apparently degrades agents as they become increasingl…
This research paper identifies the 'memory curse' in LLM agents, demonstrating that expanded context windows systematically degrade cooperative behavior in multi-agent social dilemmas by eroding forward-looking intent. The authors show that targeted fine-tuning, synthetic memory sanitization, and reducing explicit Chain-of-Thought reasoning can effectively mitigate this behavioral decay.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.