Tag
This paper identifies a critical failure mode in LLM agents where they fail to update personalized memories when new evidence conflicts with prior beliefs. It introduces the STALE benchmark and a three-dimensional probing framework, revealing that even the best models achieve only 55.2% accuracy, and proposes CUPMem as a prototype for robust memory revision.