@rohanpaul_ai: New Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it c…

X AI KOLs Following Papers

Summary

A study by University of Illinois and Tsinghua University finds that LLM agents' memory becomes unreliable when continuously rewritten, with performance dropping from 100% to 54% on ARC-AGI tasks. The paper proposes preserving raw episodes instead of always summarizing them.

New Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it can get worse when they keep rewriting their own memories. LLM agents can learn from experience, but their rewritten memories often become unreliable. The problem is that many agent systems store past work by asking an LLM to compress messy experience into neat written lessons. That sounds useful because the agent should remember what worked before, but the paper finds that repeated rewriting slowly damages the memory. The core idea is that raw episodes, meaning the actual past attempts and solutions, often stay more useful than the polished lessons made from them. The authors tested this across tasks like web shopping, simulated worlds, app use, and ARC-style puzzle problems where they could control the correct solutions. The sharpest result is that GPT-5.4 solved 100% of a small ARC-AGI set with no memory, but after memory was built from correct solutions, streaming updates dropped it to about 54%. The failures came from bad grouping, overbroad lessons, and overfitting, so the memory forgot details, mixed up task types, or learned rules that only worked on narrow examples. The big deal is that agent memory should not automatically rewrite every experience into a summary, because keeping raw evidence and only sometimes making summaries worked better. The paper is really proposing that agent memory should treat raw past episodes as important evidence, not as disposable notes to summarize away. ---- Paper Link – arxiv. org/abs/2605.12978 Paper Title: "Useful Memories Become Faulty When Continuously Updated by LLMs"
Original Article
View Cached Full Text

Cached at: 05/19/26, 02:41 AM

New Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it can get worse when they keep rewriting their own memories.

LLM agents can learn from experience, but their rewritten memories often become unreliable.

The problem is that many agent systems store past work by asking an LLM to compress messy experience into neat written lessons.

That sounds useful because the agent should remember what worked before, but the paper finds that repeated rewriting slowly damages the memory.

The core idea is that raw episodes, meaning the actual past attempts and solutions, often stay more useful than the polished lessons made from them.

The authors tested this across tasks like web shopping, simulated worlds, app use, and ARC-style puzzle problems where they could control the correct solutions.

The sharpest result is that GPT-5.4 solved 100% of a small ARC-AGI set with no memory, but after memory was built from correct solutions, streaming updates dropped it to about 54%.

The failures came from bad grouping, overbroad lessons, and overfitting, so the memory forgot details, mixed up task types, or learned rules that only worked on narrow examples.

The big deal is that agent memory should not automatically rewrite every experience into a summary, because keeping raw evidence and only sometimes making summaries worked better.

The paper is really proposing that agent memory should treat raw past episodes as important evidence, not as disposable notes to summarize away.


Paper Link – arxiv. org/abs/2605.12978

Paper Title: “Useful Memories Become Faulty When Continuously Updated by LLMs”

Similar Articles

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

Hugging Face Daily Papers

This paper identifies a critical failure mode in LLM agents where they fail to update personalized memories when new evidence conflicts with prior beliefs. It introduces the STALE benchmark and a three-dimensional probing framework, revealing that even the best models achieve only 55.2% accuracy, and proposes CUPMem as a prototype for robust memory revision.

Useful memories become faulty when continuously updated by LLMs (30 minute read)

TLDR AI

This research demonstrates that continuously updating LLM agent memories through distillation and consolidation loops causes performance regression, even when trained on ground-truth solutions. The study finds that episodic-only retention outperforms text-based consolidation, highlighting significant flaws in current self-improvement paradigms.

Useful Memories Become Faulty When Continuously Updated by LLMs

Hugging Face Daily Papers

A study finds that continuously updating consolidated memories in LLM-based agentic systems degrades performance, and that retaining raw episodic trajectories is more reliable. Experiments on ARC-AGI show that even GPT-5.4 fails more often after consolidation.