Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents
Summary
This paper introduces PerMemBench, the first benchmark for evaluating personalized memory systems in LLM-based agents, and proposes a session-level storage gating framework that adapts memory policies to individual user contexts.
View Cached Full Text
Cached at: 05/27/26, 02:48 AM
Paper page - Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents
Source: https://huggingface.co/papers/2605.25535
Abstract
Large language model-based memory systems can benefit from personalized policies that adapt to individual user contexts, though accurate implementation remains challenging.
Existinglarge language model(LLM) basedmemory systemsapply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limitedmemory budgetontransient interactionswhile failing to preserve critical context forlong horizon tasks. To address this gap, we investigate an underexplored question: can LLM basedmemory systemslearnpersonalized memory policies? We introducePerMemBench, the first benchmark for evaluating personalizedmemory systems, featuring multi year, multi domain interaction histories across diverse user personas. We further present the first empirical study of memory personalization, proposingsession level storage gating, a lightweight framework that selectively bypasses memory operations for transient sessions. Our study confirms that personalization yields substantial retention gains under perfect gating, yet reveals that accurate gating remains an open and critical challenge.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.25535
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.25535 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.25535 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.25535 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
MemPrivacy is a research paper introducing a framework for privacy-preserving personalized memory management in edge-cloud AI agents, using type-aware placeholders to protect sensitive data while maintaining semantic utility. It includes a new benchmark dataset and demonstrates superior performance over general-purpose models like GPT-5.2 and Gemini-3.1-Pro.
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
This paper introduces LongMemEval-V2, a benchmark for evaluating long-term memory systems in web agents, along with two memory methods: AgentRunbook-R and AgentRunbook-C.
MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing
MemForest proposes a memory framework for long-context LLM agents that improves scalability and reduces latency through parallel chunk extraction and hierarchical temporal indexing, achieving 6x higher throughput on benchmarks.