SaliMory: Orchestrating Cognitive Memory for Conversational Agents
Summary
SaliMory is a framework that trains a single language model to manage cognitively-structured memory (user facts, preferences, and working memory) for conversational agents, using hierarchical stage-wise process rewards and reward-decomposed contrastive refinement. It reduces memory-attributed failures by one-third, outperforms state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.
View Cached Full Text
Cached at: 06/05/26, 02:12 AM
# \method: Orchestrating Cognitive Memory for Conversational Agents
Source: [https://arxiv.org/html/2606.04120](https://arxiv.org/html/2606.04120)
Xinyuan Zhang†Hongda JiangShiun\-Zu KuoHyokun YunEjaz AhmedShereen OrabyZiyun LiSanat SharmaAnn LeeAhmed AlyAnuj KumarRaffay HamidXin Luna Dong†Meta Reality Labs[\{kkaizh,dylanz426,lunadong\}@meta\.com](https://arxiv.org/html/2606.04120v1/mailto:%7Bkkaizh,dylanz426,lunadong%[email protected])
###### Abstract
Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions\. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi\-stage pipeline\. To solve this, we introduce\\method, a framework that trains a single language model to manage a cognitively\-structured memory—spanning user facts, preferences, and working memory\. By introducing a hierarchical stage\-wise process reward and reward\-decomposed contrastive refinement,\\methodprovides isolated supervision for distinct memory operations \(selective filtering, consolidation, and cue\-driven recall\) end\-to\-end\.\\methodcuts memory\-attributed failures by one\-third, outperforms the state\-of\-the\-art by over 10% in end\-to\-end accuracy, and more than doubles the Good Personalization rate\.
## 1Introduction
Figure 1:Gated error attribution on LoCoMo\.We trace each incorrect response to its earliest point of failure in the pipeline \(top\)\. Memory errors—where salient information is lost or distorted during formation—dominate across baselines \(e\.g\., 52% for MemGAS\), vastly exceeding profile generation and response generation errors combined\.The evolution of conversational AI is rapidly shifting from isolated, single\-turn interactions toward lifelong digital companions\. As AI agents become embedded in daily life, maintaining persistent context beyond the limits of standard LLM context windows has become a necessity\(xu2022longtimenosee;bae2022keepmeupdated\)\. However, simply attaching an external memory module is not enough: users’ interaction histories are often vast, repetitive, and dominated by trivia\. The real challenge is not storage, but*management*—deciding what to remember, how to organize memories, and what to recall for a given context\.
Two main paradigms have emerged, but each falls short\. One extreme is to store all the past interactions and apply Retrieval\-Augmented Generation \(RAG\) to surface relevant context when a new question arrives\. As the memory volume grows, however, the retrieval results tend to become noisy and repetitive\(borgeaud2022;zhong2024memorybank;lee2024human\)\. Another trend is to actively compress and curate long\-term memories, but neither prompting\-based\(li2024ldagent;xu2025amem\)nor reinforcement\-learning–based\(yan2025memoryr1;wang2025memalpha\)variants have proved to maintain high\-quality memory, which reliably captures user preferences while preserving salient but nuanced details required for future conversations across diverse topics\. Our error analysis on the LoCoMo benchmark\(maharana2024locomo\)reveals that in state\-of\-the\-art solutions imprecise or incomplete memory can block well\-personalized answers on up to 52% of questions\.
How do human beings cope with a lifetime of experiences without being overwhelmed? Cognitive science suggests we do not try to remember everything\(atkinson1968;baddeley2000\):*selective attention*filters what gets encoded, as recent traces remain active in*working memory*to shape ongoing decisions\(baddeleyhitch1974\)\. What survives long\-term is structurally organized: enduring*facts*serve as hard constraints on what is true, while value\-laden*preferences*act as soft biases on what feels right\(tulving1972\)\.
Guided by these principles, we propose\\method, an agentic memory\-management framework built around three complementary stores: a*factual snapshot*of verifiable facts about the user, a set of*long\-term preferences*capturing subjective tastes, and a*short\-term working memory*of recent details the user likely still has in mind\. The factual snapshot supplies hard constraints; preferences supply soft criteria; working memory surfaces emerging interests worth revisiting\. A single memory\-management module operates on these stores in three roles: \(i\)*selective attention*: judging which turns are salient enough to record; \(ii\)*consolidation*: updating and forgetting memory; and \(iii\)*cue\-driven utilization*: retrieving and applying relevant memories at inference\.
\\method
trains one compact language model to perform all three roles via reinforcement learning \(RL\)\. Because the reward for the final answer quality alone is too sparse and too distant to supervise intermediate memory decisions, we introduce two complementary mechanisms\. Thestage\-wise process rewardscores not only final\-answer correctness and personalization, but also the quality of the resulting memory and the soundness of each intermediate decision \(saliency, utilization\)\. Areward\-decomposed contrastive refinementstep then amplifies the memory\-management signal: within each Group Relative Policy Optimization \(GRPO\)\(shao2024grpo\)rollout batch, we construct role\-specific preference pairs that upweight memory decisions in the overall objective\. Together, these mechanisms address the credit\-assignment problem\.
To summarize, our paper makes three contributions\.
- •Cognitively inspired memory architecture\.We separate factual snapshots, subjective preferences, and short\-term working memory, and define three complementary memory\-management roles—saliency filter, memory booster, and memory utilizer—that together ground long\-horizon conversations\.
- •Memory\-anchored RLWe propose a stage\-wise process reward system that tracks memory in every stage and further propose a reward\-decomposed contrastive refinement over GRPO rollouts, assuring the high\-quality of memory formation and end response generation\.
- •New benchmark and evaluation protocol\.To fully evaluate memory’s impact on conversation agents, we introduce a newLoCoMo\-P13nbenchmark which builds on the original LoCoMo with personalizable queries\. We also introduce a multi\-step evaluation protocol that jointly stresses precise memory recall and personalization QA\. Empirically, using a 9B model,\\methodoutperforms the SOTA by 10\.2% in end\-to\-end accuracy and delivers a massive 23\.5\-point gain in Good Personalization rate\.
## 2Related Work
Memory\-Augmented Language Models\.Equipping LLMs with persistent context traditionally follows two paradigms\. Parametric Memory updates model weights via fine\-tuning \(e\.g\., FireAct\(chen2023fireact\), AgentLumos\(yin2024agent\)\) or soft parameters \(SELF\-PARAM\(wang2024self\)\), but suffers from catastrophic forgetting and poor generalization to unseen queries\. Retrieval\-Based Memory abstracts\(borgeaud2022;lee2024human\)experiences into external datastores, using semantic search to fetch relevant context\. However, because retrieval relies purely on semantic matching rather than downstream utility, it struggles to reliably distinguish critical objective facts from transient remarks, often returning noisy or fragmented context\.
Conversational Memory Structures\.To overcome RAG’s limitations, agentic systems organize context into distinct topologies\. Linear memory treats history sequentially; MemGPT\(packer2023memgpt\)manages context using OS\-like FIFO queues, while MemoryBank\(zhong2024memorybank\)decays older interactions\. Layered memory stratifies information to prioritize relevance, with MemoryOS\(kang2025memory\)segregating short, mid, and long\-term stores based on access frequency, and LD\-Agent\(li2025hello\)separating transient dialogue from enduring personas\. Finally, tree and graph\-based memories map relational dependencies, with A\-Mem\(xu2025amem\)constructing dynamic networks, AssoMem\(zhang2025assomem\)mimicking human associative memory for dense searches, and Mem0\(chhikara2025mem0\)utilizing entity\-centric graphs\. While these explicit topologies improve organization, they rely on static prompting and fragile heuristics\.
Reinforcement Learning for Memory Management\.To overcome static heuristics, recent works increasingly formulate memory management as an RL problem\. MEM1\(zhou2025mem1\)uses RL to compress history into a constant\-sized state, but its monolithic memory cannot separate facts from preferences\. MemRL\(zhang2026memrl\)applies runtime RL to update retrieval Q\-values but ignores memory creation quality, while MemGen\(zhang2025memgen\)generates implicit, human\-unreadable latent memory\. Most relevant to our work are Memory\-R1\(yan2025memoryr1\)and Mem\-α\\alpha\(wang2025memalpha\), which train agents to execute explicit memory operations \(e\.g\., ADD, UPDATE\)\. While pioneering, their monolithic rewards obscure whether failures stem from in a multi\-stage pipeline\. HCAPO\(tan2026hindsight\)mitigates this via hindsight re\-weighting but lacks isolated supervision for distinct roles\.\\methodsolves this by introducing a stage\-wise process reward system and reward\-decomposed contrastive refinement, providing the fine\-grained credit assignment required to optimize a structured, multi\-stage memory system end\-to\-end\.
## 3Methodology
Figure 2:\\methodarchitecture\.We present\\method, an end\-to\-end agentic memory framework designed to coordinate and optimize conversational memory for downstream response generation\.
### 3\.1Problem Definition and Memory Bank
Inlong\-horizon conversationalsettings, an agent receives all conversation historyℋ\\mathcal\{H\}and a current user queryqq\. The goal is to produce acontextualizedandpersonalizedanswera^\\hat\{a\}that satisfies two criteria: accurately answering the query, and seamlessly grounding in the personal history\.
Becauseℋ\\mathcal\{H\}grows unboundedly and is dominated by trivia, we do not condition on it directly\. Instead, we maintain a dynamically evolvingmemory bankℳ\\mathcal\{M\}, derived fromℋ\\mathcal\{H\}, and reformulate the task asa^=πgen\(q,ℳ\)\.\\hat\{a\}=\\pi\_\{\\text\{gen\}\}\(q,\\mathcal\{M\}\)\.Inspired by human memory systems,ℳ\\mathcal\{M\}is not a flat buffer but three dedicated stores,ℳ=\(F¯,P¯,W¯\)\\mathcal\{M\}=\(\\bar\{F\},\\bar\{P\},\\bar\{W\}\): \(i\)*factual bankF¯\\bar\{F\}*: objective statements about the user that can be verifiable \(e\.g\., being lactose intolerant\); \(ii\)*preference bankP¯\\bar\{P\}*: subjective tastes and leanings that act as soft criteria for stylistic tailoring \(e\.g\., preference for oat milk\); \(iii\)*working memoryW¯\\bar\{W\}*: recent salient turns the user likely still has in mind \(e\.g\., asking about organic food this morning\)\. Separating long\-term profile from short\-term context, and objective facts from subjective preferences, allows the agent to differentiate, yielding sharper personalization than a flat context window allows\.
Given this structured memory, the core challenge shifts from storage to*dynamic management*\. An effective memory manager must autonomously resolve three sequential decisions:
- \(i\)What enters memory?Act as an attentional gate, deciding which turns carry salient information worth extracting and which to discard\.
- \(ii\)*How is memory updated?*Route each extraction to the correct store \(F¯\\bar\{F\},P¯\\bar\{P\}, orW¯\\bar\{W\}\), reconcile with existing entries, and retire stale items from working memory\.
- \(iii\)*How is memory used?*At query time, synthesize relevant entries across the three stores into a query\-adapted profile that conditions the generator\.
### 3\.2The\\methodArchitecture and RL Framework
To separate the cognitive load of memory management from conversational reasoning for latency optimization, the\\methodarchitecture \(Figure[2](https://arxiv.org/html/2606.04120#S3.F2)\) contains two distinct computational stages: anoffline stagethat builds and maintains the structured memory, and aruntime stagethat leverages it\.
The Offline Stageprocesses each session of conversations asynchronously to produce an updated memory bankℳ\\mathcal\{M\}\. First, theSaliency Filterdrops transient noises\. Second, thememBoosterprocesses salient turns, incrementally updates the factual bankF¯\\bar\{F\}and the preference bankP¯\\bar\{P\}, and maintains the working memory bankW¯\\bar\{W\}within a sliding time window\.
The Runtime Stageexecutes when a new user queryqqarrives\. ThememUtilizerfinds relevant memories from the current memory bank, obtaining an adapted user profileℳq\\mathcal\{M\}\_\{q\}\. To ensure memory management is isolated from foundational conversational reasoning, a separate, entirely frozen LLM \(πgen\\pi\_\{gen\}\) then produces the final responsea^=πgen\(q,ℳq\)\\hat\{a\}=\\pi\_\{gen\}\(q,\\mathcal\{M\}\_\{q\}\)\.
To autonomously execute the three memory decisions defined above,\\methodtrains a single, unified policy modelπθ\\pi\_\{\\theta\}via RL rather than relying on separate specialized models\. By conditioningπθ\\pi\_\{\\theta\}on distinct instruction prompts, this single model learns to act as the saliency filter, the memory booster, and the profile utilizer\.
We formally define thestateat any given step as the combination of the conversational context and the memory bank\(q,ℋ,ℳ\)\(q,\\mathcal\{H\},\\mathcal\{M\}\)\. Theaction spacecorresponds to the generative outputs of our policy modelπθ\\pi\_\{\\theta\}, including the saliency decision, the updated memoryℳ\{\\cal M\}, and the query\-dependent profileℳq\\mathcal\{M\}\_\{q\}\. Our goal for the RL training is to manipulate the state through these actions to maximize the quality of the final downstream responsea^\\hat\{a\}\.
### 3\.3Stage\-Wise Process Reward System
Recent advances in RL for reasoning models\(guo2025deepseekr1\)rely heavily on Reinforcement Learning with Variable Reward \(RLVR\), where the model is optimized purely on a final, outcome\-based reward\. However, applying pure RLVR to a complex, multi\-stage memory pipeline creates a severe credit assignment bottleneck\. If the frozen generator produces an incorrect or poorly personalized final answer, an outcome\-only reward cannot determine where the pipeline failed\.
To solve this,\\methoddecomposes the learning signals into a dense, hierarchical stage\-wise process reward, providing targeted, orthogonal feedback to each specific role using LLM judges\.
Reward 1: Response Quality \(Multiplicative Gating\)The primary objective evaluates the final generated answera^\\hat\{a\}against the ground truthaafor accuracyα∈\[0,1\]\\alpha\\in\[0,1\]and personalization qualityp∈\[0,1\]∪\{n/a\}p\\in\[0,1\]\\cup\\\{\\texttt\{n/a\}\\\}\. We formulate this reward using a strict multiplicative gate:
R1=\{α⋅\(1\+λp⋅p\)if personalization is applicableαotherwiseR\_\{1\}=\\begin\{cases\}\\alpha\\cdot\(1\+\\lambda\_\{p\}\\cdot p\)&\\text\{if personalization is applicable\}\\\\ \\alpha&\\text\{otherwise\}\\end\{cases\}\(1\)This multiplicative form embeds a critical inductive bias: personalization is only valuable when the underlying answer is factually correct\. By multiplying the terms, an incorrect but highly personalized response receivesR1≈0R\_\{1\}\\approx 0, preventing the policy from learning to "decorate" hallucinated answers\.
Reward 2: Memory Quality \(Asymmetric Penalty\)To directly supervise the memory booster’s writing behavior independently of the final answer, a judge evaluates the finalized memory bankℳK\\mathcal\{M\}\_\{K\}for factualityϕf\\phi\_\{f\}\(fraction of entries faithful to the source\) and vaguenessϕv\\phi\_\{v\}\(fraction of entries losing actionable specificity\)\. Defining the composite memory quality asγ=ϕf⋅\(1−ϕv\)\\gamma=\\phi\_\{f\}\\cdot\(1\-\\phi\_\{v\}\), we apply an asymmetric reward:
R2=−λpen⋅\(1−γ\)\+λbon⋅γ,whereλpen\>λbonR\_\{2\}=\-\\lambda\_\{\\text\{pen\}\}\\cdot\(1\-\\gamma\)\+\\lambda\_\{\\text\{bon\}\}\\cdot\\gamma,\\qquad\\text\{where \}\\lambda\_\{\\text\{pen\}\}\>\\lambda\_\{\\text\{bon\}\}\(2\)The asymmetry \(λpen=3×λbon\\lambda\_\{\\text\{pen\}\}=3\\times\\lambda\_\{\\text\{bon\}\}reflects the fact that low\-quality memories actively sabotage downstream generation\. This asymmetric penalty exerts a conservative pressure on the booster: it learns that it is strictly better to omit a marginal memory than to encode a vague one\.
Reward 3: Utilization and Filter QualityThe final level provides targeted, isolated feedback to the utilizer and the saliency filter\. It is defined as:
R3=λμ⋅μ\+λκ⋅κR\_\{3\}=\\lambda\_\{\\mu\}\\cdot\\mu\+\\lambda\_\{\\kappa\}\\cdot\\kappa\(3\)Here,μ∈\[0,1\]\\mu\\in\[0,1\]evaluates how effectively the adapted profileℳq\\mathcal\{M\}\_\{q\}captures question\-relevant constraints without including distractors\. Meanwhile,κ\\kappaevaluates the filter’s binary gating decisions against ground\-truth labels using a recall\-weighted blend:κ=\(1−wr\)⋅Prec\+wr⋅Rec\\kappa=\(1\-w\_\{r\}\)\\cdot\\text\{Prec\}\+w\_\{r\}\\cdot\\text\{Rec\}because missing a salient turn \(false negative\) is worse than flagging an extra non\-salient one \(false positive\)\.
Total RewardThe total per\-episode rewardℒGRPO=R1\+R2\+R3\\mathcal\{L\}\_\{\\text\{GRPO\}\}=R\_\{1\}\+R\_\{2\}\+R\_\{3\}\. The key property is*role\-specific credit assignment*:R1R\_\{1\}\(range\[0,1\+λp\]\[0,1\{\+\}\\lambda\_\{p\}\]\) dominates as the primary objective,R2R\_\{2\}\(range\[−λpen,λbon\]\[\-\\lambda\_\{\\text\{pen\}\},\\lambda\_\{\\text\{bon\}\}\]\) directly shapes the booster’s writing behavior independently of response quality, andR3R\_\{3\}\(range\[0,λμ\+λκ\]\[0,\\lambda\_\{\\mu\}\{\+\}\\lambda\_\{\\kappa\}\]\) separately shapes the utilizer’s synthesis and the filter’s gating\.
### 3\.4Reward\-Decomposed Contrastive Learning
Standard GRPO assigns a single advantage scalar to all generative traces within a rollout—affecting the filter, booster, and utilizer equally even though we use a role\-specific process rewards\. Consequently, a rollout might achieve a high total reward simply due to a fortunate generation by the utilizer, which inadvertently sends the upstream booster an unearned positive gradient even it it writes vague or poor memories\. To fully resolve this intra\-episode credit assignment bottleneck,\\methodexploits our reward decomposition to buildrole\-specific contrastive preference pairs, attributing credit to each role from its own contribution\.
Booster Contrastive Pairs \(ℒcontboost\\mathcal\{L\}\_\{\\text\{cont\}\}^\{\\text\{boost\}\}\)For memory booster, we select winner \(ww\) and loser \(ll\) rollouts based strictly on theR2R\_\{2\}memory quality score, completely ignoring the total episode reward\. We compute a contrastive loss scaled by a gap weightΔR2=\|R2w−R2l\|\\Delta\_\{R\_\{2\}\}=\|R\_\{2\}^\{w\}\-R\_\{2\}^\{l\}\|\. This gap weight actively modulates the gradient magnitude based on preference confidence: when the quality gap between two memory writes is large, the contrastive signal strongly pushes the booster; when marginal, the loss contributes minimally, avoiding noisy updates from ambiguous comparisons\.
Utilizer Contrastive Pairs & Similarity Gating \(ℒcontutil\\mathcal\{L\}\_\{\\text\{cont\}\}^\{\\text\{util\}\}\)For the utilizer, winner and loser rollouts are selected based on the utilization scoreμ\\mu\(fromR3R\_\{3\}\)\. However, a naive comparison here is causally confounded: a superior adapted profile might simply be the result of the booster providing better upstream memories to work with\.
To isolate the utilizer’s true synthesis ability from upstream memory noise, we introduce a causal gating mechanism utilizing the Jaccard similarity between the memory banks of the two rollouts:
sim\(𝒞w,𝒞l\)=\|entries\(𝒞w\)∩entries\(𝒞l\)\|\|entries\(𝒞w\)∪entries\(𝒞l\)\|\>τsim\\text\{sim\}\(\\mathcal\{C\}^\{w\},\\mathcal\{C\}^\{l\}\)=\\frac\{\|\\text\{entries\}\(\\mathcal\{C\}^\{w\}\)\\cap\\text\{entries\}\(\\mathcal\{C\}^\{l\}\)\|\}\{\|\\text\{entries\}\(\\mathcal\{C\}^\{w\}\)\\cup\\text\{entries\}\(\\mathcal\{C\}^\{l\}\)\|\}\>\\tau\_\{\\text\{sim\}\}\(4\)Contrastive pairs are only formed if this similarity exceeds the thresholdτsim\\tau\_\{\\text\{sim\}\}\. By enforcing this strict similarity gate, we guarantee that any difference in profile quality is strictly attributable to the utilizer’s own synthesis ability—how it selected, prioritized, and phrased the context—rather than differences in the underlying available information\. Below\-threshold pairs are discarded, ensuring the utilizer only receives gradients it actually earned\.
These two role\-specific contrastive refinements combine to form our additional learning boost:
ℒCont=λb⋅ℒcontboost\+λu⋅ℒcontutil\\mathcal\{L\}\_\{\\text\{Cont\}\}=\\lambda\_\{b\}\\cdot\\mathcal\{L\}\_\{\\text\{cont\}\}^\{\\text\{boost\}\}\\;\+\\;\\lambda\_\{u\}\\cdot\\mathcal\{L\}\_\{\\text\{cont\}\}^\{\\text\{util\}\}\(5\)Ultimately, this boost does not replace the main RL objective, but acts alongside it to provide the precise credit assignment necessary to train a highly structured cognitive pipeline\.
### 3\.5Full Objective and Policy Optimization
The complete per\-step loss seamlessly integrates our stage\-wise process rewards into the standard GRPO\(shao2024grpo\)objective with our novel contrastive boost:
ℒtotal=1N∑ξℒGRPO\(ξ\)\+ℒCont\\mathcal\{L\_\{\\text\{total\}\}\}=\\frac\{1\}\{N\}\\sum\_\{\\xi\}\\mathcal\{L\}\_\{\\text\{GRPO\}\}\(\\xi\)\\;\+\\;\\mathcal\{L\}\_\{\\text\{Cont\}\}\(6\)whereNNis the total number of generative traces andℒGRPO\\mathcal\{L\}\_\{\\text\{GRPO\}\}is the standard clipped surrogate objective with KL regularization optimized using our hierarchical rewardsRR\.
Crucially, this combined objective finalizes our solution to the credit assignment problem through selective gradient routing\. The main GRPO loss provides the baseline update and trains the saliency filter, while theℒCont\\mathcal\{L\}\_\{\\text\{Cont\}\}boost selectively routes gradients to update only the booster and utilizer parameters based on their isolated preference pairs\.
## 4Experimental Setup
Dataset\.To evaluate\\method, the dataset must satisfy Chat AI scenarios where user conversation history, current query and gold response are provided\. Thus, we select LoCoMo\(maharana2024locomo\), a long\-term conversational memory benchmark with 10 users and∼\{\\sim\}58 multi\-session dialogues per user\. The original benchmark provides query types across five categories \(multi\-hop, temporal, open\-domain, single\-hop, adversarial\) which are all memory recall queries\. We argue that personalized engagements are crucial for conversational AI, yet they are absent from existing memory datasets such as LoCoMo and LongMemEval\(wu2024longmemeval\)\. To address this, we extend LoCoMo and introduce a new dataset namedLoCoMo\-P13nwhere two personalizable query categories:*recommendation*and*implicit personalization*are curated to support such research \(more details can be seen in Appendix[B\.1](https://arxiv.org/html/2606.04120#A2.SS1)\)\. Furthermore, to validate the generalizability of\\method, we also evaluate on an internal dataset where real\-world Chat AI traffic is collected as in Section[5\.5](https://arxiv.org/html/2606.04120#S5.SS5)\.
Models and training\.The trainable policyπθ\\pi\_\{\\theta\}uses Qwen3\.5\-9B\-Instruct\(team2026qwen3\); the frozen generatorπgen\\pi\_\{\\text\{gen\}\}uses Qwen3\-235B\-A22B\-Instruct\(yang2025qwen3\)and the judge uses GPT\-4o\(hurst2024gpt\)\. For fair comparisons, we reproduce all baselines with the same settings; for zero\-shot agentic method, we use Gemini\-3\-flash\(pichai2025new\)to play as the agent to do memory management and utilization\. Full training details and hyperparameters setup are in Section[B\.5](https://arxiv.org/html/2606.04120#A2.SS5)111We also provide a scale of backbone models and training strategy ablation analysis in Appendix[D](https://arxiv.org/html/2606.04120#A4)\.\.
Baselines and ablations\.To validate the superiority of\\methodcompared to SOTA memory studies, we compare against four existing distinct popular memory paradigms:Infinite Context, which treats the entire history as memory;RAG– A\-Mem\(xu2025mem\), MemoryGAS\(xu2025towards\), which process history into a memory base for future use;Zero\-shot Agentic, which uses powerful LLM as agent for memory operation andLearnable Memory– Mem\-R1\(yan2025memoryr1\), which learns to memorize and utilize user info in an RL framework\.
Evaluation Protocol\.To comprehensively assess memory\-driven conversational agents, we introduce a novel LLM\-as\-a\-Judge protocol that thoroughly evaluates fine\-grained personalization and memory bottlenecks, with full implementation details provided in Appendices[B\.3](https://arxiv.org/html/2606.04120#A2.SS3)and[B\.4](https://arxiv.org/html/2606.04120#A2.SS4)\.
- •End\-to\-End Personalization Quality: Standard memory QA accuracy\(jiang2025memory\)fails to capture whether an agent effectively uses memory\. Thus, we introduce aThree\-Step Personalization Judge\. Step 1 determines if the response incorporates user memories\. Step 2 classifies personalized responses asGood\(rich and value\-adding\),Basic\(superficial\), orBad\(intrusive, irrelevant, or hallucinated\)\. Step 3 evaluates unpersonalized responses to determine if relevant memories existed but were ignored \(Underuse\), or there is genuinely no personalization potential \(NA\)\.
- •Intermediate Memory Eval: We evaluate the finalized memory bank forVagueness\(entries lacking actionable specificity\) andFactuality\(faithfulness to the source\)\. We also measureUtilization, scoring how well the synthesized profile captures question\-relevant constraints\.
- •Gated Error\-Bucket: For every incorrect response, we trace the failure to its origin—Memory Error, Profile Error, or Generation Error—to pinpoint specific reasoning bottlenecks\.
## 5Experimental Results and Analysis
In this section, we will answerRQ 1\.in Section[5\.1](https://arxiv.org/html/2606.04120#S5.SS1),RQ 2\.in Section[5\.2](https://arxiv.org/html/2606.04120#S5.SS2),[5\.3](https://arxiv.org/html/2606.04120#S5.SS3)andRQ 3\.in Section[5\.5](https://arxiv.org/html/2606.04120#S5.SS5),[5\.4](https://arxiv.org/html/2606.04120#S5.SS4), supplementary results can be seen in Appendix[D](https://arxiv.org/html/2606.04120#A4):
- •RQ 1\.Does\\methodexhibit better performances over existing memory approaches?
- •RQ 2\.What contributes to the improvements made by\\method?
- •RQ 3\.How does\\methodgeneralize to different query types in Chat AI scenarios?
### 5\.1Comparative Study
Table 1:Main results on LoCoMo and LoCoMo\-P13n\.Response quality for recall and personalizable queries and memory quality\. Best inbold\.↑\\uparrow/↓\\downarrow: higher/lower is better\.End\-to\-End Quality\.As presented in Table[1](https://arxiv.org/html/2606.04120#S5.T1),\\methodachieves the highest E2E Accuracy \(72\.9%\) and reaches 39\.8% Good Personalization rate on recall and personalizable queries, respectively\. Crucially, despite that\\methodbeing trained on a compact backbone model \(Qwen3\.5\-9B\), it outperforms the Zero\-shot baseline powered by a much larger model Gemini\-3\-Flash—delivering a 33\.2\-point absolute gain in Good Personalization\. Notably, A\-Mem achieves a higher Basic personalization rate than\\method, but its Good personalization rate \(21\.9%\) is much lower due to low memory quality\. Furthermore, SALIMORY nearly eradicates the Bad Personalization rate \(1\.4% vs\. Mem\-R1’s 7\.4%\), proving our R1 reward stops the model from ‘decorating’ hallucinated responses\.
Why do prior memory methods fail at scale?Different paradigms prioritize the wrong stages\. Infinite Context provides all information but fails at utilization \(39\.3% underuse\) because raw logs overwhelm the reasoning capacity\. Conversely, methods like MemGAS and Mem\-R1 focus heavily on memory formation but struggle to effectively retrieve and apply it, resulting in high NA rates \(66\.2% and 34\.6%\)\.Why does\\methodhelp?By orchestrating memory into cognitively\-structured stores and applying a role\-decomposed RL framework,\\methodexplicitly co\-optimizes both memory formation and cue\-driven utilization, drastically reducing NA \(9\.4%\) and underused \(5\.3%\) responses\.
Memory Quality\.\\methodreduces the Vague memory rate to 19\.6% \(vs\. 35\.7% for Mem\-R1\) and increases the Factual rate to 93\.9%\. This validates that\\methodsuccessfully filters noise during memory formation\. While the Gemini\-3\-Flash agentic baseline achieves decent memory factuality \(86\.1%\) and high QA Accuracy \(70\.9%\), its dismal personalization score proves that without dedicated training for memory utilization, even massive frontier models completely waste memory context\. Following our stage\-wise process reward and contrastive refinement,\\methodbridges this gap, enabling a significantly smaller open\-source model to effectively couple memory formation with downstream utilization\.
### 5\.2Ablation Study
Table 2:RL training ablation studies\.Best in each section inbold\.↑\\uparrow/↓\\downarrow: higher/lower is better\.To understand what contributes to the improvements made by\\method, we conducted extensive ablation studies on reward compositions, contrastive refinement components, and memory types\.
Reward composition\.Table[2](https://arxiv.org/html/2606.04120#S5.T2)\(a\) validates our stage\-wise process reward\. Using only R1 \(end\-to\-end\) yields Accuracy of 65\.8% and Utilization of 0\.747\. Adding R2 \(memory quality\) improves Accuracy to 69\.2% and drops Vague from 28\.3% to 22\.4%\. Incorporating R3 \(utilization\) brings the system to full performance \(72\.9%, 0\.917\)\. This confirms thatsparse global signals are inadequate for training intermediate memory operations—explicit stage\-wise supervision is necessary\.
Figure 3:Impact of memory types on QA accuracy and personalization qualityContrastive refinement components\.As shown in Table[2](https://arxiv.org/html/2606.04120#S5.T2)\(b\), Booster contrastive training specifically improves memory formation \(raising Factuality to 92\.4%\) while dropping the Vague rate from 31\.7% to 20\.6%\. Conversely, Utilizer contrastive training specifically enhances retrieval \(boosting Good Personalization to 35\.9% and utilization score to 0\.906\)\. Combining both ensures each module only receives gradients it earned, yielding full system performance\.
Memory type comparison\.As illustrated in Figure[3](https://arxiv.org/html/2606.04120#S5.F3), isolating the memory banks reveals their distinct cognitive roles\. The Working Memory bank primarily drives QA Accuracy \(66\.2%\), while the Preference bank drives Good Personalization \(22\.7%\)\. Crucially, merely combining these raw banks bottlenecks Good Personalization at 28\.6%\. It is only when the Utilizer synthesizes these stores into a query\-adapted profile that personalization leaps to 39\.8%, proving that active context construction is strictly superior to raw retrieval\.
### 5\.3Training Dynamics Analysis
Figure[4](https://arxiv.org/html/2606.04120#S5.F4)illustrates\\method’s training dynamics\. Panel \(a\) reveals a bottom\-up causal learning chain: the memory quality reward \(R2R\_\{2\}\) improves first \(rising from−\-0\.26 to−\-0\.04\) before the utilization reward \(R3R\_\{3\}\) and the response reward \(R1R\_\{1\}\) follow\. Panel \(b\) shows both contrastive losses converging \(booster: 2\.6→\\to0\.7, utilizer: 1\.4→\\to0\.35\), demonstrating that role\-decomposed preference signals provide targeted credit assignment\. Panel \(c\) validates that these training signals translate to downstream quality: Vague Rate drops to 19\.6%, Utilization Score reaches 0\.917, and Good P13n Rate improves 6×\\timesto 39\.8%, with the improvement order mirroring the reward progression\.
### 5\.4Gated Error Analysis
We conduct a gated error analysis to isolate where failures occur in the pipeline \(Figure[1](https://arxiv.org/html/2606.04120#S1.F1)\)\. Memory creation errors account for 37\.9% of failures in MemGAS and 25\.3% in Mem\-R1, validating that poor memory formation is the primary bottleneck in existing systems\.\\methodnot only provides the lowest memory \(16\.4%\) and profile generation \(5\.8%\) error rates, but it pushes the overall system to the highest correctness rate \(72\.9%\)\.Why does\\methodsucceed?By enforcing selective attention and using the asymmetric R2 penalty,\\methodactively prevents vague or hallucinated information from polluting the memory stores\. The remaining failures are largely isolated to the frozen generator after curing the upstream memory bottleneck, confirming our memory management module effectively fulfills its role\.
Figure 4:Training dynamics of\\method\.\(a\) Stage\-wise process rewards\. \(b\) Policy and contrastive losses\. \(c\) Downstream quality metrics\. Curves are EMA\-smoothed\.
### 5\.5Efficiency & Generalizability Analysis
Efficiency Analysis\.Beyond quality, practical deployment demands strict efficiency\. As shown in Table[3](https://arxiv.org/html/2606.04120#S5.T3)\(a\), the Zero\-shot baseline indiscriminately stores 127\.4 generic memories per user, creating a severe computational bottleneck with 38 seconds of memory formation latency\. In contrast,\\method’s saliency filter and booster actively condense context into just 89\.7 highly specific entries\. This deliberate filtering reduces memory formation latency to 0\.5 seconds—a 76×\\timesspeedup\. At runtime, inference latency drops nearly 5×\\times\(1\.2s to 0\.26s\), proving that feeding the generator a synthesized profile is vastly more efficient than processing raw retrieval chunks\.
Real\-World Generalizability\.We deployed\\methodon an internal dataset of real\-world Chat AI traffic\. As shown in Table[3](https://arxiv.org/html/2606.04120#S5.T3)\(b\), the relative margins remain highly consistent\. Even in this noisy, unstructured setting,\\methodmore than triples the Good Personalization rate \(1\.7% → 5\.5%\), drops the Vague memory rate from 63\.9% to 46\.1%, and spikes memory Factuality from 41\.9% to 60\.7%\. This confirms that\\method’s cognitive architecture and role\-decomposed training generalize robustly beyond benchmark\-specific distributions\.
Table 3:\(a\)Efficiency and memory store statistics;\(b\)Results on a real\-world Chat AI traffic data\.
## 6Conclusion
We presented\\method, an agentic learnable memory\-management framework that addresses the fundamental challenge of conversational memory: not what to store, but how to manage it\. Our work introduces three key novelties\. First, acognitively\-inspired memory architecturethat separates factual snapshots, subjective preferences, and working memory, enabling structured reasoning over heterogeneous user information\. Second, astage\-wise, role\-decomposed RLtraining paradigm that combines hierarchical process rewards with reward\-decomposed contrastive refinement, directly resolving the credit\-assignment problem that has limited prior RL\-based memory agents\. Third, anextended evaluation protocoland a new benchmarkLoCoMo\-P13nare proposed to jointly stress precise recall and nuanced personalization across diverse query types in Chat AI\. Empirically,\\methodoutperforms strong baselines—while operating at 5x lower inference latency than the zero\-shot baseline\. These results demonstrate that memory management, grounded in cognitive science and trained with decomposed rewards, is the key to build lifelong conversational agents\.
## References
\\beginappendix
## Appendix AUse of LLM
In paper writing, we use LLMs solely for checking typos and grammar errors; they are not used for any other purposes beyond this\.
## Appendix BExperimental Setup Details
### B\.1Dataset Details
Table[4](https://arxiv.org/html/2606.04120#A2.T4)summarize the extended LoCoMo dataset\. The original benchmark\(maharana2024locomo\)provides QA pairs across four evaluation categories \(multi\-hop, temporal, open\-domain, single\-hop\); we contribute two additional categories—recommendation and implicit personalization—that require memory\-driven personalization, adding 1,108 QA pairs \(42% of the total\) to form a new dataset namedLoCoMo\-P13n\. Category 5 \(adversarial/unanswerable\) is excluded following prior work\. Each user’s conversation history spans multiple sessions with natural temporal gaps\.
#### How we construct the data
We use LLM, LlaMa\-3\.2\-70B, to semi\-automatically generate the data following a 4\-step strategy:\(1\) extracting relevant information from conversations, \(2\) identifying conversation reference points,
Table 4:Dataset statistics\.\(3\) aggregating related information, and \(4\) composing questions that—rather than directly querying conversation content—utilize extracted signals to enable personalized or recommended responses\. We have a linguistic expert with background in data science to do manually review onLoCoMo\-P13n\.
#### Real\-world Chat AI dataset
To verify the generalizability of\\method, we further collect data from real world Chat AI traffic where real users are producing conversations\. We use their past 28\-days chat history as the user raw history to form memory\. Unfortunately, this dataset is not able to be open\-sourced due to privacy constraint, but the results in Section[5\.5](https://arxiv.org/html/2606.04120#S5.SS5)verifies the effectiveness of\\methodin real\-world application\.
### B\.2\\methodTraining Algorithm
\\method
training steps are shown in Algorithm[1](https://arxiv.org/html/2606.04120#alg1)\. To ensure the complex multi\-stage pipeline does not bottleneck compute, our training step utilizes GPU and API resources complementarily\. First, the policy generates traces entirely on the GPU, and reference log\-probabilities are pre\-computed and cached\. Second, the frozen generator and reward judges run concurrently via API, allowing the reference model to be safely offloaded to the CPU\. Finally, during the update phase, only the policy model requires GPU memory since all reference probabilities were previously cached\. Each training step performsEEinner epochs over this cached rollout data, executing early stopping if the mean KL divergence exceeds1\.5×1\.5\\timesthe target threshold\.
Algorithm 1\\method: Training Step0:Batch of episodes
\{\(𝒞e,qe,ae∗\)\}\\\{\(\\mathcal\{C\}\_\{e\},q\_\{e\},a^\{\*\}\_\{e\}\)\\\}, policy
πθ\\pi\_\{\\theta\}, reference
πref\\pi\_\{\\text\{ref\}\}, generators/judges
πgen,πjudge\\pi\_\{\\text\{gen\}\},\\pi\_\{\\text\{judge\}\}
1:foreach episode
ee, each of
GGrolloutsdo
2:Phase A \(Filter\):
st←πθfilt\(ut,τt\)s\_\{t\}\\leftarrow\\pi\_\{\\theta\}^\{\\text\{filt\}\}\(u\_\{t\},\\tau\_\{t\}\)for all
tt;
𝒮←\{t:st=True\}\\mathcal\{S\}\\leftarrow\\\{t:s\_\{t\}=\\text\{True\}\\\}
3:Phase B \(Booster\):for
k=1,…,\|𝒮\|k=1,\\ldots,\|\\mathcal\{S\}\|:
⟨dk,vk,ok⟩←πθboost\(uk,τk,ℳk−1\)\\langle d\_\{k\},v\_\{k\},o\_\{k\}\\rangle\\leftarrow\\pi\_\{\\theta\}^\{\\text\{boost\}\}\(u\_\{k\},\\tau\_\{k\},\\mathcal\{M\}\_\{k\-1\}\);
ℳk←Apply\(ℳk−1,dk,vk,ok\)\\mathcal\{M\}\_\{k\}\\leftarrow\\text\{Apply\}\(\\mathcal\{M\}\_\{k\-1\},d\_\{k\},v\_\{k\},o\_\{k\}\)
4:Phase C \(Utilizer\):
ρ←πθutil\(ℳK,qe\)\\rho\\leftarrow\\pi\_\{\\theta\}^\{\\text\{util\}\}\(\\mathcal\{M\}\_\{K\},q\_\{e\}\)
5:Cache
logπθold\(y∣P\)\\log\\pi\_\{\\theta\}^\{\\text\{old\}\}\(y\\mid P\)for all traces; pre\-compute
logπref\(y∣P\)\\log\\pi\_\{\\text\{ref\}\}\(y\\mid P\)
6:endfor
7:Rewards:
a^←πgen\(ρ,𝒞imm,q\)\\hat\{a\}\\leftarrow\\pi\_\{\\text\{gen\}\}\(\\rho,\\mathcal\{C\}\_\{\\text\{imm\}\},q\); compute
R1,R2,R3R\_\{1\},R\_\{2\},R\_\{3\}via
πjudge\\pi\_\{\\text\{judge\}\}⊳\\trianglerightAPI, concurrent
8:Advantages:
A~e,g←\(Re,g−R¯e\)/σRe\\tilde\{A\}\_\{e,g\}\\leftarrow\(R\_\{e,g\}\-\\bar\{R\}\_\{e\}\)/\\sigma\_\{R\_\{e\}\}for each episode
ee⊳\\trianglerightGRPO group norm\.
9:Contrastive pairs:Construct
𝒫b\\mathcal\{P\}\_\{b\}by
R2R\_\{2\}; construct
𝒫u\\mathcal\{P\}\_\{u\}by
μ\\mu, gated on
sim\(ℳw,ℳl\)\>τsim\\text\{sim\}\(\\mathcal\{M\}^\{w\},\\mathcal\{M\}^\{l\}\)\>\\tau\_\{\\text\{sim\}\}
10:for
i=1,…,Ei=1,\\ldots,Edo
11:Compute
ℒ\\mathcal\{L\}via Eq\.[6](https://arxiv.org/html/2606.04120#S3.E6); update
θ\\theta⊳\\trianglerightinner epoch
12:ifmean KL
\>1\.5×\>1\.5\\timestargetthen
13:break⊳\\trianglerightKL early stopping
14:endif
15:endfor
### B\.3Three\-Step Personalization Judge
Standard single\-score LLM\-as\-judge evaluation conflates whether a response*uses*personal information with whether it uses it*well*, and cannot detect cases where relevant memories existed but went unused\. We design a three\-step sequential judge that disentangles these dimensions \([5](https://arxiv.org/html/2606.04120#A2.F5)\)\.
#### Task 1: Personalization Detection\.
A binary classifier determines whether the response incorporates personal information from the user’s memory beyond what is already present in the query or immediate conversation\. The judge considers both explicit usage \(e\.g\., “considering you like hiking, I recommend…”\) and implicit usage \(e\.g\., recommending a casual restaurant when the user’s memory includes a preference for casual dining\)\. Crucially, information that is already present in the user’s query or conversational context does not count as personalization—the response must draw on stored memories\. If personalized, proceed to Task 2; otherwise, proceed to Task 3\.
#### Task 2: Personalization Quality \(if personalized\)\.
For personalized responses, a multi\-criteria judge evaluates five dimensions: \(1\)*richness*—does the personalization go beyond simply inserting user information, using multiple memory signals in a natural way? \(2\)*value\-add*—does the personalization enhance the response? \(3\)*style*—is the tone appropriate? \(4\)*relevance*—is the response on\-topic? \(5\)*quality*—is the response coherent and non\-hallucinated? These combine into three quality levels:
- •Good: rich personalization \(criterion 1\) with value\-add \(criterion 2\) and good overall response quality \(criteria 3–5\)\.
- •Basic: personalization is present but not rich or not value\-adding; overall response quality is acceptable\. Core case: using the user’s name without deeper integration\.
- •Bad: personalization is misapplied \(intrusive, irrelevant, or offensive\) or the overall response quality is poor despite personalization\.
#### Task 3: Signal Detection \(if not personalized\)\.
For unpersonalized responses, the judge determines whether relevant memory signals*existed*that could have enhanced the response but were not used\. This detects*underuse*—a failure mode invisible to standard evaluation\. If no relevant signals exist \(e\.g\., a purely factual query or no applicable user memory to personalize the query\), the response is simply marked as not applicable \(NA\)\.
#### Five\-category taxonomy\.
The three tasks yield five mutually exclusive response categories, shown in[5](https://arxiv.org/html/2606.04120#A2.F5)\.
Figure 5:Three\-step personalization judge flowchart\.Task 1 detects whether the response uses memory\-based personalization\. If yes, Task 2 evaluates quality \(Good / Basic / Bad\)\. If no, Task 3 checks whether relevant signals existed but went unused \(Underuse\) or were genuinely absent \(Unpersonalized\)\. The five leaf categories are mutually exclusive\.We derive three end\-to\-end personalization metrics from this taxonomy:P13n Rate\(fraction of responses in any personalized category, including bad\),Good P13n Rate\(fraction inGood Personalized—the strictest measure\), andUnderuse Rate\(fraction inUnderuse—measuring missed opportunities\)\.
### B\.4Metric Definitions
Table[5](https://arxiv.org/html/2606.04120#A2.T5)provides definitions for all reported metrics\.
Table 5:Evaluation metrics\. E2E personalization metrics are derived from the three\-step judge \([B\.3](https://arxiv.org/html/2606.04120#A2.SS3)\); memory metrics are judged independently\.MetricLevelDefinitionQA AccuracyE2EFraction of responses judged correct against ground truthP13n RateE2EFraction in any personalized category \([B\.3](https://arxiv.org/html/2606.04120#A2.SS3)\)Good P13n RateE2EFraction inGood Personalized\([B\.3](https://arxiv.org/html/2606.04120#A2.SS3)\)Underuse RateE2EFraction inUnderuse\([B\.3](https://arxiv.org/html/2606.04120#A2.SS3)\)Vague RateMemoryFraction of memory entries too generic to be actionableFactuality RateMemoryFraction of entries faithful to source utterancesProfile Util\. ScoreMemoryJudge\-scored relevance of profileρ\\rhoto questionqq\(\[0,1\]\[0,1\]\)Error BucketDiagnosticGated attribution of each failure to the earliest failing stage: memory→\\toprofile→\\togeneration#### Memory quality scoring\.
The judge evaluates each entry inℳK\\mathcal\{M\}\_\{K\}against the original utterance that produced it\. An entry is*vague*if it has lost the specificity needed to be actionable \(e\.g\., “User likes food” vs\. “User loves Thai curry”\)\. An entry is*non\-factual*if it misrepresents the source \(e\.g\., “thinking about getting a dog”→\\to“owns a dog”\)\.
#### Error\-bucket attribution\.
For each incorrect response, we apply a gated cascade: if the profile contains sufficient information to answer correctly→\\toGeneration Error; else if the memory contains the information but the profile missed it→\\toProfile Error; else→\\toMemory Error\. Each failure is assigned to the earliest failing stage\.
### B\.5Hyperparameters
Table 6:Hyperparameters for\\methodtraining\.GroupParameterValueTrainingGRPO group sizeGG4Batch size8Training epochs5Inner update epochsEE4OptimizationLearning rate10−510^\{\-5\}OptimizerAdamWWarmup ratio0\.1Gradient clip norm1\.0RLClip ratioϵ\\epsilon0\.2KL coefficientλKL\\lambda\_\{\\text\{KL\}\}0\.08KL target \(early stopping\)1\.5RewardPersonalization weightλp\\lambda\_\{p\}0\.45Memory penaltyλpen\\lambda\_\{\\text\{pen\}\}0\.5Memory bonusλbon\\lambda\_\{\\text\{bon\}\}0\.3Utilization weightλμ\\lambda\_\{\\mu\}0\.15Filter weightλκ\\lambda\_\{\\kappa\}\(recallwr=0\.6w\_\{r\}\{=\}0\.6\)0\.1ContrastiveBooster weightλb\\lambda\_\{b\}0\.15Utilizer weightλu\\lambda\_\{u\}0\.1Temperatureη\\eta0\.15Memory similarity thresholdτsim\\tau\_\{\\text\{sim\}\}0\.15For training, we use transformers and PyTorch to train on a computation node with 8 A100 GPUs\.
## Appendix CLimitation
Figure 6:Per\-category accuracy\.\\methodimproves all categories, with the largest gains on recommendation and multi\-hop questions that require memory synthesis\.Despite the success brought by\\method, we still notice two limitations to be further studied in the future:Dependence on a frozen large LLM for judging\.All three reward levels rely on a frozen large LLM \(GPT\-4o\) — response accuracy, memory quality, and utilization are all assessed via LLM\-as\-a\-judge calls\. This means the training signal is bounded by the judge model’s own evaluation capability and biases and further, the api calling would hinder the training stability\.Sequential memory formation limits scalability to long conversations\.The booster must process salient turns sequentially \(each turn needs the updated memory state from the previous one\), meaning memory formation latency scales linearly with conversation length\. For the current dataset \( 16 turns/episode\), this is manageable, but for real\-world deployment with hundreds or thousands of turns per user, this becomes a bottleneck that batch parallelization cannot resolve\.
## Appendix DSupplementary Analysis
### D\.1Backbone model and Training strategy ablation
Table 7:\\methodresults using different training strategies and backbone models\. Best in each section inbold\.↑\\uparrow/↓\\downarrow: higher/lower is better\.#### Backbone models\.
Table[2](https://arxiv.org/html/2606.04120#S5.T2)\(d\) shows that Llama3\.2\-1B collapses \(Accuracy 29\.6%, Vague 74\.1%\), Llama3\.2\-3B achieves 58\.8% Accuracy, and Qwen3\.5\-9B reaches the optimal 72\.9%\. This suggests that the cognitive memory roles require a minimum threshold of base reasoning capacity to execute effectively\.
#### Training strategy\.
As presented in Table[2](https://arxiv.org/html/2606.04120#S5.T2)\(a\), REINFORCE and PPO yield Accuracy of only 36\.9% and 39\.2%, with Vague rates remaining high at 68\.6% and 67\.9%\. Switching to GRPO provides a massive leap \(Accuracy 68\.6%, Vague 31\.7%\), indicating that relative scoring within a batch is crucial for stable learning in long\-horizon memory tasks\.Why does\\methodpush further?By augmenting GRPO with contrastive refinement,\\methodachieves 72\.9% Accuracy and 19\.6% Vague rate, verifying that standard GRPO alone is insufficient to fully resolve credit assignment in multi\-stage memory management\.
### D\.2Memory Comparison Analysis
Table 8:Impact of vague memories\.Filtering vague entries nearly doubles good\-personalization rate\.We directly validate that memory quality drives downstream performance by comparing retrieval under controlled memory conditions \([8](https://arxiv.org/html/2606.04120#A4.T8)\)\. Filtering vague entries improves Recall@5 from 35\.9% to 52\.1% and good\-personalization from 22\.2% to 40\.9%\. Vague memories act as retrieval noise, displacing useful entries from top\-kkresults\.\\method’s booster directly addresses this via the asymmetricR2R\_\{2\}penalty\.
### D\.3More Memory≠\\neqBetter Responses
Table 9:Infinite context degrades performance\.More memory increases personalization*attempts*but collapses their quality\.Table[9](https://arxiv.org/html/2606.04120#A4.T9)tests whether providing all memories \(infinite context\) yields an upper bound on performance\. Win rate drops from 40\.9% to 27\.6% and mis\-recommendation rate spikes from 8\.3% to 29\.6%\. The generator attempts*more*personalization \(P13n rate rises\) but gets it wrong \(good P13n collapses\)\. This motivates the utilizer’s role: compress and filter memories into a question\-relevant profile rather than passing raw context\.
### D\.4Per\-Category Breakdown
Figure[6](https://arxiv.org/html/2606.04120#A3.F6)reports performance by question category\.\\methodachieves the largest accuracy gains on multi\-hop \(\+4pp\) and recommendation \(\+5pp\)—categories requiring multi\-entry synthesis—while single\-hop and open\-domain show minimal improvement \(\+1pp\), as single\-fact retrieval already works well\. For personalization,\\methodlifts Good P13n from 4\.2% to 44\.5% on recommendation and from 9\.0% to 35\.1% on implicit personalization, confirming that our cognitive memory architecture and trained utilizer generalize most strongly to synthesis\-heavy queries\.
## Appendix EReward Judge Prompts
All reward signals in\\methodare computed by frozen LLM judges\. Below we provide the judge prompt templates for each reward level\.
### E\.1R1 Response Quality Judge
R1 Response Quality\#\# Context You are a judge who evaluates a chatbot response on two independent axes: accuracy and personalization\.The user has the following personal information: Memory: memoryHere is the immediately preceding conversation: conversationHere is the user query: questionHere is the response to evaluate: responseHere is the golden response \(ground truth\): label\#\# Evaluation Criteria\#\#\# 1\. Correctness \(boolean\) Does the response answer the query correctly? Compare against the golden response\. \- If the response is factually wrong or fails to answer the question, correctness is false\. \- If the response addresses the question correctly \(even if worded differently from the golden response\), correctness is true\.\#\#\# 2\. Accuracy Score \(0\.0–1\.0\) How correct is the response compared to the golden answer? This is independent of personalization\. \- 0\.0: Completely wrong or fails to answer the question\. \- 0\.3: Partially addresses the question but with significant errors\. \- 0\.5: Partially correct with some relevant information\. \- 0\.7: Mostly correct with minor gaps\. \- 1\.0: Fully correct, equivalent to the golden response\.\#\#\# 3\. Personalization Score \(0\.0–1\.0 or "n/a"\) How well does the response use personal information from memory? \- 0\.0: Misused or irrelevant personal information that hurts the response\. \- 0\.3: Personal info acknowledged but not meaningfully integrated\. \- 0\.5: Superficial use of personal information\. \- 0\.7: Good use of relevant personal information that enhances the response\. \- 1\.0: Excellent use of personal information, comparable to the golden response\. \- "n/a": The question does not require personalization, or no relevant memory exists\.\#\# Output Provide your evaluation as a JSON object with: \- ‘correctness‘: Boolean \(true/false\)\. \- ‘accuracy\_score‘: String, a number from 0\.0 to 1\.0\. \- ‘personalization\_score‘: String, a number from 0\.0 to 1\.0, or "n/a" if not applicable\. Output the JSON object only, nothing else\.
### E\.2R2 Memory Quality Judge
R2 Memory Quality\#\#\# Task Description You are a professional memory analyst\. Your job is to evaluate a memory collection extracted from a conversation on two axes: vagueness and factuality\.\#\#\# Key Definitions\#\#\#\# Vagueness \- \*\*Vague memory entry\*\*: A memory entry that is too generic to be actionable for personalization\. It loses specific details from the original conversation\. For example, "User likes food" is vague because it lost what kind of food; "User traveled" is vague because it lost where and when\. \- \*\*Specific memory entry\*\*: A memory entry that captures concrete, actionable information that can personalize future responses\. For example, "User has a peanut allergy" or "User returned from Japan trip last week" are specific\.\#\#\#\# Factuality \- \*\*Factual memory entry\*\*: A memory entry that accurately represents what was said in the conversation, without hallucinating or distorting details\. \- \*\*Non\-factual memory entry\*\*: A memory entry that misrepresents what was said, invents details, or draws incorrect conclusions\.\#\#\# Vagueness ExamplesConversation utterances: \- Can you recommend some Thai restaurants near downtown? \- I just got back from a trip to Japan last week \- I’m training for a marathon in OctoberVague memory: "user\_facts": \["User likes food", "User is interested in running"\], "working\_memory": \["User traveled recently"\] \-\> 3 entries, all 3 are vague \-\> vague\_rate = 1\.0Specific memory: "user\_facts": \["User has a peanut allergy", "User is training for a marathon in October"\], "working\_memory": \["User returned from Japan trip last week", "User is looking for Thai restaurants near downtown"\] \-\> 4 entries, 0 are vague \-\> vague\_rate = 0\.0\#\#\# Factuality ExamplesUtterance: "I went to Japan last week" \-\> Memory: "User went to China last week" \-\> \*\*Non\-factual\*\* \(wrong country\) Utterance: "I’m thinking about getting a dog" \-\> Memory: "User owns a dog" \-\> \*\*Non\-factual\*\* \(considering is not owning\) Utterance: "My daughter starts college in September" \-\> Memory: "User’s daughter is starting college in September" \-\> \*\*Factual\*\*\#\#\# Input Conversation utterances: utterancesExtracted memory: memory\#\#\# Instructions 1\. List all individual memory entries across all categories \(user\_facts, user\_preference, working\_memory, etc\.\)\. 2\. For each entry, determine if it is vague or specific\. Calculate: vague\_rate = \(number of vague entries\) / \(total number of entries\)\. If there are no entries, vague\_rate = 0\.0\. 3\. For each entry, determine if it is factual or non\-factual based on the conversation\. Calculate: factuality\_rate = \(number of factual entries\) / \(total number of entries\)\. If there are no entries, factuality\_rate = 1\.0\.\#\#\# Output Output a JSON object with: \- ‘vague\_rate‘: A string representing a number from 0\.0 to 1\.0, indicating the fraction of memory entries that are vague\. \- ‘factuality\_rate‘: A string representing a number from 0\.0 to 1\.0, indicating the fraction of memory entries that are factual\. Output the JSON object only, nothing else\.
### E\.3R3 Utilization Quality Judge
R3 Utilization Quality\#\#\# Task Description You are a professional analyst evaluating the quality of a user profile summary for answering a specific question\.\#\#\# Context A utilizer agent has created a user profile summary from memory and conversation to help answer a question\. Your job is to evaluate how useful this profile is for generating a correct, personalized answer\.\#\#\# Input Question: questionUser Profile Summary: user\_profileMemory State: memory\_stateConversation: conversationGolden Answer \(ground truth\): ground\_truth\#\#\# Evaluation Rubric \(0\.0–1\.0\) \- 0\.0: Profile is irrelevant or contains wrong information that would mislead the answer\. \- 0\.3: Profile exists but misses key information needed to answer the question\. \- 0\.5: Contains some relevant information but incomplete or noisy\. \- 0\.7: Captures the essential information needed for answering the question correctly\. \- 1\.0: Highly relevant, accurate, and provides exactly the right context for a personalized answer\.\#\#\# Output Output a JSON object with: \- ‘utilization\_score‘: A string representing a number from 0\.0 to 1\.0\. Output the JSON object only, nothing else\.Similar Articles
Cognis: Context-Aware Memory for Conversational AI Agents
Lyzr Cognis introduces a unified, open-source memory system for conversational AI that fuses BM25 and Matryoshka vector search with version-aware ingestion, achieving SOTA on LoCoMo and LongMemEval benchmarks.
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
This paper proposes Polar, a multimodal memory-augmented framework for personalizing embodied MLLM agents over long-term user interactions, using a knowledge graph and episodic memory to ground user-intended instances from accumulated context.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
supermemoryai/supermemory
Supermemory is an open-source memory and context engine for AI that automatically learns from conversations, extracts facts, builds user profiles, and delivers personalized context. It ranks #1 on several AI memory benchmarks and provides a single API for adding memory, RAG, and connectors to AI agents.
Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents
This paper proposes MERIT, a dynamic multi-horizon memory retrieval framework for interactive text-to-SQL agents that uses episode-level and turn-level memory with learned retrieval policies optimized via reinforcement learning and a process reward model for dense rewards. Experiments on BIRD-Interact and Spider2-Snow show that MERIT outperforms static and single-horizon dynamic baselines in success rate while requiring fewer interaction turns.