SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

arXiv cs.CL Papers

Summary

SAGE proposes a novelty gate for memory evolution in agentic LLMs, using a von Mises-Fisher-based density estimator to decide whether to add, merge, or ignore new facts, reducing LLM calls while maintaining memory quality.

arXiv:2605.30711v1 Announce Type: new Abstract: Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:27 AM

# SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs
Source: [https://arxiv.org/html/2605.30711](https://arxiv.org/html/2605.30711)
Sijia Wang, Dhanajit Brahma11footnotemark:1, Ricardo Henao Duke University \{sijia\.wang, dhanajit\.brahma, ricardo\.henao\}@duke\.edu

###### Abstract

Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write\-side control\. We frame memory evolution as a novelty\-detection problem and proposeSAGE, aSphericalAdaptiveGate for memoryEvolution that scores candidate facts with a von Mises\-Fisher\-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory\-store geometry\.SAGEresolves clearly novel facts asAdd, clearly redundant facts asNoop, and sends only uncertain cases to an LLM merge step, reducing expensive write\-time reasoning\. On LoCoMo, SAGE achieves the best average token\-F1 against Mem0 on all seven open\-weight backbone comparisons, while on GPT\-4o\-mini it reduces add\-phase API cost by 3\.4×\\timesand add\-phase latency by 2\.5×\\timeswith only a small average judge\-score gap\. As a drop\-in binary gate for A\-Mem, SAGE skips roughly 16–18%\\%of LLM calls across five models with minimal quality change on open\-weight backbones\. These results suggest that novelty\-aware write control is a practical lever for improving both memory quality and system efficiency in long\-term agentic memory\.

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

Sijia Wang††thanks:These authors contributed equally to this work\., Dhanajit Brahma11footnotemark:1, Ricardo HenaoDuke University\{sijia\.wang, dhanajit\.brahma, ricardo\.henao\}@duke\.edu

## 1Introduction

Every memory system, from a relational database\(Codd,[1970](https://arxiv.org/html/2605.30711#bib.bib12)\)to a modern LLM agent\(Parket al\.,[2023](https://arxiv.org/html/2605.30711#bib.bib26); Packeret al\.,[2023](https://arxiv.org/html/2605.30711#bib.bib5)\), must solve three problems in sequence: decide what to*write*, organize it so it can be*found*, and*retrieve*the right information when needed\. In agentic LLM memory, the community has invested heavily in the second and third problems – embedding models\(Peña and Herbold,[2025](https://arxiv.org/html/2605.30711#bib.bib18)\), vector indexes\(Douzeet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib17); Johnsonet al\.,[2019](https://arxiv.org/html/2605.30711#bib.bib13)\), hybrid retrieval\(Maet al\.,[2020](https://arxiv.org/html/2605.30711#bib.bib16); Sawarkaret al\.,[2024](https://arxiv.org/html/2605.30711#bib.bib15); Hsu and Tzeng,[2025](https://arxiv.org/html/2605.30711#bib.bib14)\), knowledge graphs\(Rasmussenet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib3)\), while the first has received comparatively little principled attention\. Yet the write decision is arguably the more consequential one: a memory that is never written cannot be retrieved, and a memory that is written incorrectly \(duplicated, merged with an unrelated fact, or prematurely deleted\) will degrade downstream queries that touch it\. How difficult this write decision is depends on the memory paradigm\.

While standard Retrieval\-Augmented Generation \(RAG\) writes are nearly decision\-free: segment, embed, append\(Karpukhinet al\.,[2020](https://arxiv.org/html/2605.30711#bib.bib24)\), long\-term agentic systems cannot afford this luxury\. An agent interacting over weeks or months must track an evolving state–changing preferences, shifting goals, and corrected facts\. This forces agentic memory systems to confront the dilemma of semantic CRUD\(Lyuet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib19); Leeet al\.,[2024a](https://arxiv.org/html/2605.30711#bib.bib6)\): they must edit their own knowledge base in natural language, continuously deciding whether to add, update, consolidate, or discard information rather than simply accumulating it\. Current systems delegate this decision to an LLM: Mem0 issues a tool call that jointly routes and rewrites each batch of extracted facts\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib9)\); A\-Mem adds further calls for note construction and neighbor evolution\(Xuet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib7)\)\. These designs produce adaptive memory stores, but make the write path the dominant source of cost\. We argue that the missing alternative is a*novelty gate*: a cheap, closed\-form test that routes clearly new facts toAdd, clearly redundant facts toNoop, and only ambiguous cases to an LLM merge call\.

The paper makes three contributions:i\)It frames memory evolution in agentic LLMs as a novelty\-detection problem, clarifying why write\-side control is the lever that affects both memory quality and system efficiency\.ii\)It proposesSAGE\(SphericalAdaptiveGate for memoryEvolution\), a theoretically grounded novelty gate whose score is computed using vMF density estimation, together with an adaptive threshold that tracks the evolving geometry of the memory store\.iii\)It provides evidence across two settings: as a full system,SAGEwins 7/7 open\-weight backbones on token\-F1F\_\{1\}against Mem0 while cutting add\-phase API cost3\.4×3\.4\\timeson GPT\-4o\-mini; as a drop\-inNoopgate on A\-Mem, it skips 16–18% of write LLM calls across five models with≤\\leq0\.5% token\-F1F\_\{1\}change\.

## 2Related Work

Memory for Agentic LLMs\.Long\-term memory has become a central topic in LLM\-agent research because raw context extension does not reliably solve multi\-session reasoning\(Zhanget al\.,[2024](https://arxiv.org/html/2605.30711#bib.bib2); Maharanaet al\.,[2024](https://arxiv.org/html/2605.30711#bib.bib1)\)\. Prior work falls into three broad categories\.*Retrieval and compression*methods reduce long histories to retrievable summaries: MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2605.30711#bib.bib4)\)applies Ebbinghaus\-inspired forgetting, ReadAgent\(Leeet al\.,[2024b](https://arxiv.org/html/2605.30711#bib.bib8)\)compresses conversations into gist memories, and Generative Agents\(Parket al\.,[2023](https://arxiv.org/html/2605.30711#bib.bib26)\)consolidate observations through periodic LLM\-driven reflection\.*Structured and hierarchical*approaches impose richer organization: Zep\(Rasmussenet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib3)\)and Mem0g0\_\{g\}\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib9)\)maintain temporal or entity\-relation knowledge graphs, while MemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2605.30711#bib.bib5)\)introduces OS\-style paging between working memory and an external store\. Finally,*learned representations*such as MEM1\(Zhouet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib27)\)train a compact internal state via end\-to\-end RL\. Across all three categories, write policies remain either fixed \(append\-only, forgetting curves, heuristic eviction\) or fully delegated to per\-fact LLM judgment; efficient write\-side control of memory evolution remains an open problem\.

![Refer to caption](https://arxiv.org/html/2605.30711v1/x1.png)Figure 1:Overview of Memory Evolution problem and our proposed approachSAGE\.Memory Evolution\.Recent agentic memory systems treat memory as an editable structure rather than an append\-only log\. Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib9)\)extracts salient facts and uses an LLM\-mediated controller to choose amongAdd,Update,Delete, andNoop\. A\-Mem\(Xuet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib7)\)extends this to full memory evolution, constructing structured notes with contextual descriptions and rewriting linked neighbors as new evidence arrives\. A newer line replaces prompted write control with reinforcement learning: Memory\-R1\(Yanet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib21)\)trains a dedicated memory manager via PPO/GRPO, with reward derived from downstream QA performance, and Mem\-α\\alpha\(Wanget al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib22)\)similarly uses RL to optimize memory construction across core, episodic, and semantic stores, demonstrating strong length generalization\. Overall, prior work shows that write\-side memory control is essential, but existing approaches sit at two costly extremes: repeated LLM\-based deliberation at inference time or rollout\-intensive RL optimization at training time\. Our work explores a third point in the design space, treating memory evolution as a novelty\-aware control problem in which the system first estimates whether an incoming fact is sufficiently new to justify memory editing\. This framing yields a lightweight, geometry\-aligned controller that preserves the benefits of adaptive memory evolution while avoiding both the inference overhead of pure LLM routing and the training overhead of RL\-based policy learning\.

## 3Methodology

An agentic LLM memory system maintains a persistent store of facts and observations across conversation sessions\. In each user interaction, it extracts candidate facts, such as preferences, goals, or contextual details, from the current turn\. For each candidate, the system makes awrite\-sidedecision among three actions:Add, which stores the fact as a new memory;Update, which merges the fact with an existing memory that it refines, corrects, or supersedes; andNoop, which ignores the fact because the information is already covered by the current memory store\. We call the component that makes this decision therouting controller\. Figure[1](https://arxiv.org/html/2605.30711#S2.F1)summarizes this workflow and shows where the novelty score\-based gating operates relative to candidate fact extraction, novelty scoring, and update\-time reasoning\. In this section, we formalize write\-side memory control as a novelty\-detection problem and introduceSAGE\(Spherical Adaptive Gate for memory Evolution\) as the routing controller\. We first define the problem, then motivate the von Mises\-Fisher \(vMF\) distribution as the foundation of a kernel density estimator for scoring how novel each candidate fact is relative to the current memory store and route it toAdd,Update, orNoopvia an adaptive threshold\.

### 3\.1Problem Definition

We begin by defining the system components before formalizing the decision problem\. A*stored memory item*is a candidate fact previously extracted from a user interaction and committed to persistent storage \(e\.g\., “the user prefers morning meetings”\)\. Each memory item is embedded by a sentence embedding model\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.30711#bib.bib10)\)andℓ2\\ell\_\{2\}\-normalized onto the unit hypersphere𝕊d−1=\{𝐳∈ℝd:‖𝐳‖2=1\}\\mathbb\{S\}^\{d\-1\}=\\\{\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}:\\\|\\mathbf\{z\}\\\|\_\{2\}=1\\\}\. The current memory scope is therefore a set of unit\-norm embedding vectorsℳ=\{𝐦1,…,𝐦N\}\\mathcal\{M\}=\\\{\\mathbf\{m\}\_\{1\},\\ldots,\\mathbf\{m\}\_\{N\}\\\}, where𝐦i∈𝕊d−1\\mathbf\{m\}\_\{i\}\\in\\mathbb\{S\}^\{d\-1\}\. In practice, this scope consists of the stored memory items paired with their embedding vectors: the downstream memory writing and rewriting operate on the associated memory items, as in prior works such as Mem0Chhikaraet al\.\([2025](https://arxiv.org/html/2605.30711#bib.bib9)\)and A\-MemXuet al\.\([2025](https://arxiv.org/html/2605.30711#bib.bib7)\), while the embedding vectors are used during routing or retrieval\. During each*interaction*\(a conversation turn or session\), the system extracts one or more candidate facts by making an LLM call, again following the fact\-extraction stage used in systems such as Mem0 and A\-Mem\. Letcdenote a candidate fact and𝐜∈𝕊d−1\\mathbf\{c\}\\in\\mathbb\{S\}^\{d\-1\}its normalized embedding\. Then the routing controller must decide which decision to make given a candidate factcc\.

### 3\.2From Memory Evolution to Novelty Detection

Routing is difficult because different mistakes have different costs: an overly conservative controller discards new information; an overly permissive one accumulates near\-duplicates that degrade retrieval; and an unreliable one may conflate related but distinct facts \(e\.g\., merging “flight departs at 8 am” with “meeting starts at 8 am”\), corrupting accurate records\. Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib9)\)invokes an LLM controller on every batch of candidate facts regardless of novelty; A\-Mem\(Xuet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib7)\)adds further LLM calls for note construction and for rewriting nearby stored memories to keep related notes consistent\. In both, routing cost scales withallcandidate facts\.

We therefore introduce a novelty score as a first routing stage before any update\-time LLM call\. The goal is to separate candidates that are likely new from those that are likely redundant, and to send only the remaining uncertain cases to the LLM update step\. Here, an uncertain case is one whose score does not strongly favor eitherAddorNoop\. This gate reduces write\-time cost by reserving LLM\-based updates for those cases rather than for every candidate\. In our experiments, this decision stage reduces LLM calls by6060–90%90\\%compared to Mem0 on seven of the eight backbones\. To our knowledge, existing memory\-evolution systems do not include this kind of explicit routing gate; however, this is largely because prior work prioritized memory quality and adaptivity over minimizing controller cost at write time\. The next section specifies the gate itself\.

The embedding geometry also suggests how to build this gate\. Sentence\-embedding memory systems operate onℓ2\\ell\_\{2\}\-normalized vectors compared by cosine similarity\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.30711#bib.bib10); Karpukhinet al\.,[2020](https://arxiv.org/html/2605.30711#bib.bib24)\), which for unit vectors is simply their inner product, so semantic comparison is driven by direction rather than by magnitude\.

Novelty in this setting should not depend only on the closest stored memory but also on how much support the surrounding memories provide\. For example, two candidates can have the same cosine similarity to a memory item yet differ in novelty to the memory scope: one may lie in a region already populated by several similar memories, while the other lies near a more isolated memory\. The first candidate is less novel because it is better supported by the existing memory set\.

These observations suggest that the novelty score\-based inexpensive routing rule should: \(ii\) be computationally cheap so that many candidates can be resolved without an LLM call, \(i​iii\) operate in the same inner\-product geometry as retrieval, and \(i​i​iiii\) account for how densely populated the nearby stored memories are to estimate if the candidate is redundant\. A natural way to capture this support is kernel density estimation \(KDE\), which scores a point by placing a local kernel around each stored memory and summing their contributions\. Because the embeddings are unit\-norm directional vectors and retrieval depends on angular similarity, we use a kernel that depends only on direction\. The von Mises–Fisher \(vMF\) distribution\(Mardia and Jupp,[1999](https://arxiv.org/html/2605.30711#bib.bib25); Banerjeeet al\.,[2005](https://arxiv.org/html/2605.30711#bib.bib20)\)is a standard model for directional data on𝕊d−1\\mathbb\{S\}^\{d\-1\}, so it is an appropriate kernel for spherical KDE\. A vMF with mean direction𝝁∈𝕊d−1\\boldsymbol\{\\mu\}\\in\\mathbb\{S\}^\{d\-1\}and concentrationκ\>0\\kappa\>0has densityf​\(𝐜∣𝝁,κ\)=Cd​\(κ\)​exp⁡\(κ​𝝁⊤​𝐜\),f\(\\mathbf\{c\}\\mid\\boldsymbol\{\\mu\},\\kappa\)=C\_\{d\}\(\\kappa\)\\exp\(\\kappa\\,\\boldsymbol\{\\mu\}^\{\\top\}\\mathbf\{c\}\),whereCd​\(κ\)C\_\{d\}\(\\kappa\)is a normalizing constant depending only onddandκ\\kappa\. In our KDE, this density serves as the kernel centered at each stored memory vector\. Since it depends only on inner product𝝁⊤​𝐜\\boldsymbol\{\\mu\}^\{\\top\}\\mathbf\{c\}, it is well suited to modeling local support on the hypersphere\.

### 3\.3SAGE: Spherical Adaptive Gate for Memory Evolution

Given a candidate embedding𝐜∈𝕊d−1\\mathbf\{c\}\\in\\mathbb\{S\}^\{d\-1\}and the current memory scopeℳ\\mathcal\{M\}, the goal is to obtain a scalar novelty score that quantifies how well the direction of𝐜\\mathbf\{c\}is explained by the stored memory embeddings\. We define this score via a kernel density estimate on the hypersphere\.

To estimate the density thatℳ\\mathcal\{M\}induces at𝐜\\mathbf\{c\}, we center a vMF\-inspired kernel at each stored memory vector and average across memories\. We therefore work with the kernelKκ​\(𝐜,mi\)=exp⁡\(κ​mi⊤​𝐜\),K\_\{\\kappa\}\(\\mathbf\{c\},m\_\{i\}\)=\\exp\(\\kappa\\,m\_\{i\}^\{\\top\}\\mathbf\{c\}\),which retains the angular structure of the vMF distribution while avoiding unnecessary terms\. Averaging over the memory scope givesS^​\(𝐜∣ℳ\)=1N​∑i=1NKκ​\(𝐜,mi\)\.\\hat\{S\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}K\_\{\\kappa\}\(\\mathbf\{c\},m\_\{i\}\)\.This average is well defined forN≥1N\\geq 1, since it is a finite sum of positive, bounded terms\. WhenN=0N=0\(i\.e\., the memory scope is empty\), the controller directly emitsAddwithout computing a score\. Taking the logarithm and dividing byκ\\kappakeeps the result on the cosine\-similarity scale; Appendix[E](https://arxiv.org/html/2605.30711#A5)shows that forN≥1N\\geq 1, the resulting score lies in\[−1,1\]\[\-1,1\]\. This yieldssvMF​\(𝐜∣ℳ\)=1κ​log⁡S^​\(𝐜∣ℳ\)\.s\_\{\\mathrm\{vMF\}\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)=\\frac\{1\}\{\\kappa\}\\log\\hat\{S\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\.Structurally,svMFs\_\{\\mathrm\{vMF\}\}is the log\-mean\-exp of the cosine similarities\{mi⊤​𝐜\}\\\{m\_\{i\}^\{\\top\}\\mathbf\{c\}\\\}, scaled by1κ\\frac\{1\}\{\\kappa\}\. It therefore produces a single scalar that summarizes how much collective angular support the entire memory scope provides for𝐜\\mathbf\{c\}\.

Unlike raw cosine similarity, which compares𝐜\\mathbf\{c\}to one memory at a time,svMFs\_\{\\mathrm\{vMF\}\}aggregates contributions from all stored memories\. Consequently, a candidate that has a high cosine similarity to a single isolated memory can still receive a differentsvMFs\_\{\\mathrm\{vMF\}\}score than a candidate with the same cosine similarity score in a densely populated region of supporting memories\.ν​\(𝐜\)=1−svMF​\(𝐜∣ℳ\)2\\nu\(\\mathbf\{c\}\)=\\frac\{1\-s\_\{\\mathrm\{vMF\}\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\}\{2\}\. This affine transformation does not change the ranking of candidates; it is used only so that larger values mean “more novel,” which simplifies the interpretation of the adaptive threshold and margin defined in Section[3\.4](https://arxiv.org/html/2605.30711#S3.SS4)\.

The concentration parameterκ\\kappais not fixed a priori but is estimated from the current memory scope so that the gate adapts to the geometry of the stored embeddings\. We compute the mean resultant lengthR¯=‖1N​∑i=1N𝐦i‖2,\\bar\{R\}=\\left\\\|\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{m\}\_\{i\}\\right\\\|\_\{2\},which measures the concentration of the memory vectors around their mean direction \(R¯≈1\\bar\{R\}\\approx 1when the vectors are tightly concentrated,R¯≈0\\bar\{R\}\\approx 0when they are diffusely distributed\)\. FollowingBanerjeeet al\.\([2005](https://arxiv.org/html/2605.30711#bib.bib20)\), we estimateκ\\kappavia the approximationκ^≈R¯​\(d−R¯2\)1−R¯2\\hat\{\\kappa\}\\approx\\frac\{\\bar\{R\}\(d\-\\bar\{R\}^\{2\}\)\}\{1\-\\bar\{R\}^\{2\}\}which ensures thatκ^\\hat\{\\kappa\}adapts to how spread out the stored memories are\. When memories are densely stored,κ^\\hat\{\\kappa\}is large, and the score is more sensitive to small directional differences; when scattered,κ^\\hat\{\\kappa\}is small and each kernel covers a wider region\.

This is thekey advantageover a cosine\-similarity\-based threshold: asℳ\\mathcal\{M\}changes,κ^\\hat\{\\kappa\}adapts automatically, so the effective influence of each stored memory reflects the current density of the store rather than remaining fixed\.

Table 1:Detailed per\-configuration comparison acrossSAGE, Mem0, andMem0g\. Metrics are mean token\-F1F\_\{1\}, BLEU\-1 \(B1B\_\{1\}\), and LLM\-as\-a\-Judge \(JJ\)\.
### 3\.4Adaptive Routing Rule

Sinceν​\(𝐱\)\\nu\(\\mathbf\{x\}\)is defined relative to current memory scope, the same raw novelty score can mean different things in sparse and dense stores; we therefore adapt the routing threshold to a simple proxy for how tightly the current memories are packed\.

We use a proxyρt\\rho\_\{t\}to quantify the density of the current memory scope, and we provide the details in Appendix[D](https://arxiv.org/html/2605.30711#A4)\. A largerρt\\rho\_\{t\}means that many memories occupy a relatively small region of the informative subspace\. In such cases, novelty scores are typically pushed downward because candidates are more likely to land near already crowded regions, so the gate should become more permissive\. This motivates the monotone decayτt⋆=τmin\+τ0​e−λ​ρt,\\tau\_\{t\}^\{\\star\}=\\tau\_\{\\min\}\+\\tau\_\{0\}e^\{\-\\lambda\\rho\_\{t\}\},whereτ0\\tau\_\{0\}is the base threshold,τmin\\tau\_\{\\min\}is a floor, andλ\\lambdacontrols how quickly the threshold relaxes as the scope becomes denser\.τt⋆\\tau\_\{t\}^\{\\star\}decays monotonically towardτmin\\tau\_\{\\min\}as density increases, and we smooth the threshold via an exponential moving average \(EMA\) to prevent abrupt shifts when a single turn adds several memories:

τt=\{τt⋆,t=1,α​τt−1\+\(1−α\)​τt⋆,t\>1,\\tau\_\{t\}=\\begin\{cases\}\\tau\_\{t\}^\{\\star\},&t=1,\\\\ \\alpha\\,\\tau\_\{t\-1\}\+\(1\-\\alpha\)\\,\\tau\_\{t\}^\{\\star\},&t\>1,\\end\{cases\}\(1\)whereα∈\[0,1\)\\alpha\\in\[0,1\)is the EMA momentum\.

Usingτt\\tau\_\{t\}together with a marginδ\\delta, the routing rule is

route​\(𝐜\)=\{Add,N=0,Add,ν​\(𝐜\)≥τt\+δ,Update,τt≤ν​\(𝐜\)<τt\+δ,Noop,ν​\(𝐜\)<τt\.\\text\{route\}\(\\mathbf\{c\}\)=\\begin\{cases\}\\textsc\{Add\},&N=0,\\\\ \\textsc\{Add\},&\\nu\(\\mathbf\{c\}\)\\geq\\tau\_\{t\}\+\\delta,\\\\ \\textsc\{Update\},&\\tau\_\{t\}\\leq\\nu\(\\mathbf\{c\}\)<\\tau\_\{t\}\+\\delta,\\\\ \\textsc\{Noop\},&\\nu\(\\mathbf\{c\}\)<\\tau\_\{t\}\.\\end\{cases\}The marginδ\\deltadefines an uncertainty band around the threshold, following the principle of classification with a reject optionChow \([1970](https://arxiv.org/html/2605.30711#bib.bib23)\)\. Candidates above the band are routed toAddand those below toNoop, both without an LLM call\. Only candidates within the band, i\.e\., genuinely ambiguous cases whether the candidate is novel enough to the scope, trigger an LLMUpdatecall\. Appendix[F](https://arxiv.org/html/2605.30711#A6)provides a detailed visual trace of this process, illustrating how the adaptive thresholdτt\\tau\_\{t\}decays over time to accommodate increasing memory density, and how the uncertainty marginδ\\deltacleanly separates these three routing decisions\.

Table 2:Macro category averages across seven open\-weight models\.
### 3\.5Extending the Gate to Other Memory Systems

Any memory system that processes every incoming candidatecthrough its full write path incurs an LLM call even whencis clearly redundant given the current memory scopeℳ\\mathcal\{M\}\. A natural question is whether the vMF novelty score from Section[3\.3](https://arxiv.org/html/2605.30711#S3.SS3)can serve as a lightweight pre\-filter that sits upstream of any existing memory system and filters out candidates that the store already covers well\.

Thus, we define a portable binary gate that can sit upstream of any existing memory system\. Unlike the three\-way adaptive rule in Section[3\.4](https://arxiv.org/html/2605.30711#S3.SS4), this gate uses a single fixed thresholdτnoop\\tau\_\{\\text\{noop\}\}and makes one decision:

route​\(𝐜\)=\{Noop,svMF​\(𝐜∣ℳ\)\>τnoop,Pass,otherwise,\\text\{route\}\(\\mathbf\{c\}\)=\\begin\{cases\}\\textsc\{Noop\},&s\_\{\\text\{vMF\}\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\>\\tau\_\{\\text\{noop\}\},\\\\ \\textsc\{Pass\},&\\text\{otherwise\},\\end\{cases\}wherePassforwards the candidate to the host system unchanged\. WhensvMFs\_\{\\text\{vMF\}\}exceedsτnoop\\tau\_\{\\text\{noop\}\}, the candidatecis sufficiently explained by existing memories and is dropped; otherwise, the host system \(A\-Mem, Mem0, or any comparable framework\) processes it with its own evolution logic fully intact\. The gate isnon\-invasive: it exposes a single tunable knobτnoop\\tau\_\{\\text\{noop\}\}and requires no modification to the host’s internals\. Moreover,τnoop\\tau\_\{\\text\{noop\}\}can be setwithout access to the target benchmarkvia the calibration procedure described in Appendix[G](https://arxiv.org/html/2605.30711#A7)\.

## 4Experiments

#### Experimental Setting\.

We focus on long\-term conversational memory, using LoCoMo as the main benchmark protocol since it directly evaluates whether a system can answer questions from extended, multi\-session dialogue histories\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.30711#bib.bib1)\)\(see Appendix[A](https://arxiv.org/html/2605.30711#A1)for dataset details\)\. Following prior work, we consider single\-hop, multi\-hop, temporal, and open\-domain questions, and we evaluate with BLEU\-1 \(B1B\_\{1\}\), token\-F1F\_\{1\}\(F1F\_\{1\}\), and LLM\-as\-a\-Judge\(Xuet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib7)\)\(JJ\)\. Our main experimental comparison uses seven backbone configurations for which scoredSAGE/Mem0/Mem0gare available\. We use Llama\-3\.1\-8b as the LLM Judge model\.

Prior memory papers inform the broader baseline landscape\. A\-Mem compares against LoCoMo, ReadAgent, MemoryBank, and MemGPT, and reports strong gains together with write\-time efficiency from selective top\-kkretrieval\(Xuet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib7)\)\. Mem0 emphasizes scalable memory extraction and update\-time routing over salient facts rather than full\-context prompting\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib9)\)\. We do not re\-run these comparisons; instead, we focus on the question they leave open: can a principled novelty gate replace the controller LLM in the write path? We also testSAGEon a frontier\-class backbone \(GPT\-4o\-mini\) in Section[4\.3](https://arxiv.org/html/2605.30711#S4.SS3)to assess whether the gate’s advantages persist when the underlying LLM is strong enough to route accurately on its own\.SAGEuses the following hyperparameters for the novelty\-routing gate \(Section[3\.4](https://arxiv.org/html/2605.30711#S3.SS4)\)\. We set the PCA projection dimensiond′=16d^\{\\prime\}=16\. For the adaptive thresholdτt\\tau\_\{t\}update, the base threshold parameter is set toτ0=0\.25\\tau\_\{0\}=0\.25, the minimum threshold floor is set toτmin=0\.025\\tau\_\{\\min\}=0\.025, and the density decay coefficient is set toλ=2\.0\\lambda=2\.0\. The temporal EMA smoothing coefficient is set toα=0\.9\\alpha=0\.9\. The uncertainty band is defined byδ=0\.025\\delta=0\.025\. Appendix[C](https://arxiv.org/html/2605.30711#A3)describes the selection procedure\.

### 4\.1Results

Table[1](https://arxiv.org/html/2605.30711#S3.T1)comparesSAGEagainst Mem0 andMem0gacross seven backbone\-matched triads\. The clearest result is consistency onF1F\_\{1\}:SAGEranks first on the overall average for all seven backbones\. It also achieves the best overallB1B\_\{1\}in six of seven triads, with DeepSeek\-R1\-7b as the only exception, where Mem0 is marginally higher \(9\.04 vs\. 9\.01\)\.JJscores are more mixed, but still favorable toSAGEoverall: it attains the best averageJJscore in four triads, exceeds Mem0 in six of seven, and exceedsMem0gin five of seven\. Among theSAGEvariants, Qwen2\.5\-3b is strongest onF1F\_\{1\}andB1B\_\{1\}, while DeepSeek\-R1\-7b gives the highest averageJJscore\.

Table[2](https://arxiv.org/html/2605.30711#S3.T2)shows the same pattern after averaging by question type\.SAGEis the only system that ranks first onB1B\_\{1\},F1F\_\{1\}, andJJin all four categories\. The largestF1F\_\{1\}gain over Mem0 appears on open\-domain questions \(\+1\.83\), and the largestJJgain appears on single\-hop questions \(\+5\.83\)\. Multi\-hop gains are also steady, withSAGEreaching 16\.73F1F\_\{1\}and 77\.26JJversus 15\.68 and 72\.27 for Mem0, which suggests that better write\-side separation of related facts helps later composition rather than only surface overlap\. Temporal questions remain the tightest comparison, butSAGEstill leads there on all three metrics\.

Table 3:Write\-side LLM\-call budget on full LoCoMo\.SAGEmakeszero routing calls, invoking the LLM only to merge theπupd\\pi\_\{\\text\{upd\}\}routed toUpdate, whereas Mem0/Mem0g fuse routing and edit into one call peradd\.*Total Drop*isSAGE’s reduction of LLM calls vs\. the baseline\.Write\-Side Efficiency\.Table[3](https://arxiv.org/html/2605.30711#S4.T3)links these quality gains to a different write\-time profile\. Within each backbone\-matched triad, the dataset, fact\-extraction prompt, embedding model, and retrieval stack are fixed: every system issues the same number of fact\-extraction LLM calls \(one peraddcall,16961696in total\), so the only difference is what happens*after*extraction\. We therefore separate two layers of cost\. \(ii\) At the*decision stage*, Mem0 andMem0ginvoke a routing LLM on every non\-emptyaddcall—a single batched call that jointly decides the action and rewrites the memory text for all candidate facts\.SAGEinstead makeszero LLM calls for routing: the vMF novelty gate resolvesAddandNoopin closed form, and the LLM is invoked only to*merge*the small fractionπupd\\pi\_\{\\text\{upd\}\}of candidates routed toUpdate\. \(i​iii\) Including the shared extraction calls, this yields the*total*write\-side LLM budget reported in the last columns\.

The two layers tell a deliberately honest story\. At the decision stage the reduction is large, around6060–90%90\\%drop in LLM calls compared to Mem0 on seven of the eight backbones \(Table[8](https://arxiv.org/html/2605.30711#A8.T8)\) becauseSAGEreplaces hundreds of routing calls with a handful of merge calls, the empirical update bandπupd\\pi\_\{\\text\{upd\}\}being narrow \(2\.72\.7–10\.6%10\.6\\%\)\. Once the shared extraction cost is folded in, the*total*write\-side LLM calls still drop by2929–42%42\\%\(mean32%32\\%\) on those same seven backbones\. The single exception is Llama\-3\.2\-1b, where Mem0’s weak router emits malformed JSON on13471347\(79%79\\%\) of its calls, which artificially lowers its routing\-call count rather than reflecting cleaner routing; because SAGE’s closed\-form gate has no such parse\-failure mode, the comparison is not meaningful for this backbone, and we exclude it from the aggregate\.

Read together, Tables[1](https://arxiv.org/html/2605.30711#S3.T1)–[3](https://arxiv.org/html/2605.30711#S4.T3)support the central claim that novelty detection is an effective abstraction for memory evolution\.SAGEdoes not trade quality for efficiency: the same closed\-form decision that removes the LLM from clearly novel and clearly redundant candidates also enforces cleaner write\-side separation between related\-but\-distinct facts, which the consistentF1F\_\{1\}lead and the multi\-hop and open\-domainJJgains reflect\. Efficiency and quality are therefore two faces of a single gating decision rather than competing objectives\.

#### Scaling to a frontier backbone\.

The seven small backbones isolate the controller from backbone quality; we now ask whether the same gate holds up on a stronger model by running full LoCoMo on GPT\-4o\-mini \(last block of Table[1](https://arxiv.org/html/2605.30711#S3.T1); Table[4](https://arxiv.org/html/2605.30711#S4.T4)\)\. On quality,SAGEwins multi\-hop onF1F\_\{1\}andJJ\(JJ56\.1vs\.52\.3, \+3\.7\), the category that most directly tests whether the memory system can compose separately stored facts, and edges open\-domainJJ\(63\.3vs\.62\.9\)\. Mem0 leads single\-hop \(JJ53\.9vs\.56\.0\) and, most clearly, temporal \(JJ35\.4vs\.42\.7\)\. The overall averageJJgap is 1\.3 points \(52\.2vs\.53\.5\)\. The efficiency side is decisive \(Table[4](https://arxiv.org/html/2605.30711#S4.T4)\): on the same workloadSAGEingests2\.5×2\.5\\timesfaster \(15\.715\.7vs\.39\.339\.3min\) with2\.6×2\.6\\timesfewer total write\-side tokens \(2\.162\.16Mvs\.5\.555\.55M\) and11\.1×11\.1\\timesfewer*generated*tokens \(0\.080\.08Mvs\.0\.910\.91M\), because the vMF gate replaces Mem0’s per\-addupdate\-reasoning call with a closed\-form vector decision and queries the LLM only on the narrowUpdateband\. Average per\-add\-call latency is3\.1×3\.1\\timeslower \(1\.761\.76svs\.5\.385\.38s\) and add\-phase API cost falls from $1\.24 to $0\.36 \(3\.4×3\.4\\timescheaper\)\. These bounded single\-hop and temporal recall costs \(about1\.31\.3averageJJpoints\) are thus a deliberate trade for a multiplicative reduction in write\-side compute that only compounds as the corpus grows\.

Table 4:Efficiency on full LoCoMo \(w/ GPT\-4o\-mini\)\. Add\-phase token counts and per\-call latency are measured at the API boundary\.Table 5:Fixed\-threshold NOOP decision \(τnoop=0\.572\\tau\_\{\\text\{noop\}\}=0\.572\): A\-Mem\+SAGEcompared to A\-Mem baseline on full LoCoMo, in percentage points \(F1F\_\{1\},JJ\)\. “Calls saved” = skipped write/evolution LLM calls\.

### 4\.2Threshold Sensitivity Ablation

To analyze the adaptive threshold sensitivity, we compareSAGEwith adaptive thresholdτt\\tau\_\{t\}againstSAGEwithτt\\tau\_\{t\}set to fixed thresholds, say,τfixed∈\{0\.10,0\.15,0\.20,0\.25,0\.30\}\\tau\_\{\\text\{fixed\}\}\\in\\\{0\.10,0\.15,0\.20,0\.25,0\.30\\\}using a 20% subsample of LoCoMo, and Llama\-3\.1\-8B as the LLM judge\. The results in Appendix Table[7](https://arxiv.org/html/2605.30711#A7.T7)show that adaptiveSAGEis the more robust default operating point\. On Qwen2\.5\-1\.5B, it gives the best overallB1B\_\{1\}\(9\.809\.80\) andF1F\_\{1\}\(11\.6911\.69\); the only fixed threshold that slightly exceeds itsJJscore isτfixed=0\.30\\tau\_\{\\text\{fixed\}\}=0\.30, and only by0\.070\.07, whileB1B\_\{1\}andF1F\_\{1\}both drop by about22points\. On Qwen2\.5\-3B, the best fixed point isτfixed=0\.10\\tau\_\{\\text\{fixed\}\}=0\.10, which improvesB1B\_\{1\}from25\.8325\.83to26\.6926\.69,F1F\_\{1\}from31\.1531\.15to32\.3532\.35, andJJfrom85\.3285\.32to86\.8286\.82\.

Figure[2](https://arxiv.org/html/2605.30711#S4.F2)shows the same trade\-off as Appendix Table[7](https://arxiv.org/html/2605.30711#A7.T7): the best fixed quality point isτfixed=0\.10\\tau\_\{\\text\{fixed\}\}=0\.10, but adaptive SAGE stays close while using far fewer update\-time calls: useful operating points are concentrated in the low\-threshold region, and quality degrades sharply onceτfixed≥0\.15\\tau\_\{\\text\{fixed\}\}\\geq 0\.15\. The right panel also makes the efficiency trade\-off explicit: the best fixed quality point uses nearly3×3\\timesas many update\-time route calls as adaptiveSAGE\(202vs\.74\)\. The broader threshold\-sensitivity pattern across both backbones appears in Appendix Figure[4](https://arxiv.org/html/2605.30711#A8.F4)\. Overall, the adaptive controller already captures most of the attainable quality without backbone\-specific retuning, which is the practical significance ofSAGEas a write\-time control policy\.

0\.10\.10\.150\.150\.20\.20\.250\.250\.30\.302020404060608080τfixed\\tau\_\{\\text\{fixed\}\}ScoreQwen2\.5\-3B qualityB1B\_\{1\}F1F\_\{1\}JJAdaptive

0\.10\.10\.150\.150\.20\.20\.250\.250\.30\.30200200400400600600τfixed\\tau\_\{\\text\{fixed\}\}Update\-time route LLM callsQwen2\.5\-3B routingFixedAdaptive

Figure 2:Adaptive threshold sensitivity on Qwen2\.5\-3B\. Left: quality under fixed thresholds\. Right: update\-time route LLM calls, where only update\-routed candidates invoke the LLM call\. Solid lines indicateSAGEwith varying fixed\-threshold and dashed lines indicateSAGEwith adaptive threshold\.
### 4\.3IsolatingNoopDecision’s Effects

To isolate theNoopdecision, we hold the underlying A\-Mem memory system fixed and change only whether the fixed\-threshold gate of Section[3\.5](https://arxiv.org/html/2605.30711#S3.SS5)is switched*on*or*off*\. Therefore, any difference between the two methods A\-Mem and A\-Mem\+SAGEis attributable toSAGEalone\. Table[5](https://arxiv.org/html/2605.30711#S4.T5)reports the result across five models on full LoCoMo with thresholdτnoop=0\.572\\tau\_\{\\text\{noop\}\}=0\.572calculated and fixed in advance \(details of how to setτnoop\\tau\_\{\\text\{noop\}\}are in Appendix[G](https://arxiv.org/html/2605.30711#A7)\)\. Read across Table[5](https://arxiv.org/html/2605.30711#S4.T5), the gate behaves as designed\. The*skip\-rate*column lands in15\.815\.8–17\.9%17\.9\\%for every model\. Each run therefore avoids1,8241\{,\}824–2,0662\{,\}066write/evolution LLM calls \(*calls\-saved*column\)\. The*Δ​J\\Delta J*score gain column shows this efficiency is essentially free on the four open\-weight models:JJshifts by at most0\.65%0\.65\\%in either direction \(≤1\\leq 1point\), \(per\-category breakdown in Appendix[H](https://arxiv.org/html/2605.30711#A8)\)\. At a comparable17\.9%17\.9\\%skip rate,SAGEcosts2\.01%2\.01\\%inJJfor GPT\-4o\-mini model\.

## 5Conclusion

This paper argues that novelty detection is the missing abstraction for write\-side memory control in agentic LLMs\. Prior systems have shown that memory evolution matters, but they typically rely on controller LLMs to decide whether a new fact should triggerAdd,Update, orNoopbehavior\. We instead proposeSAGE, a von Mises–Fisher novelty gate, which yields a simple operational principle: add clearly novel memories, ignore clearly redundant ones, and reserve local merge reasoning for the uncertainty band in between\.

## Limitations

Our evaluation is conducted entirely on the LoCoMo benchmark in English, covering one interaction modality \(multi\-session dialogue\)\. We have not testedSAGEon harder benchmarks such as LongMemEval, on task\-oriented or tool\-use agent settings, or on multilingual corpora, so the generality of the quality–efficiency trade\-off remains open\. The gate routes candidates toAdd,Update, orNoopbut does not issueDeletedecisions, nor does the current system include a memory compaction mechanism; designing principled deletion and compaction strategies that integrate with the vMF novelty score is left to future work\. Finally, because the vMF score operates onℓ2\\ell\_\{2\}\-normalized sentence embeddings, it inherits the embedding model’s limitations: semantically distinct facts that receive similar vectors may be incorrectly dropped, while paraphrases with dissimilar vectors may bypass the redundancy filter\.

## References

- A\. Banerjee, I\. S\. Dhillon, J\. Ghosh, S\. Sra, and G\. Ridgeway \(2005\)Clustering on the unit hypersphere using von mises\-fisher distributions\.\.Journal of Machine Learning Research6\(9\)\.Cited by:[§3\.2](https://arxiv.org/html/2605.30711#S3.SS2.p5.11),[§3\.3](https://arxiv.org/html/2605.30711#S3.SS3.p4.9)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready AI agents with scalable long\-term memory\.InECAI 2025 \- 28th European Conference on Artificial Intelligence, 25\-30 October 2025, Bologna, Italy \- Including 14th Conference on Prestigious Applications of Intelligent Systems \(PAIS 2025\),I\. Lynce, N\. Murano, M\. Vallati, S\. Villata, F\. Chesani, M\. Milano, A\. Omicini, and M\. Dastani \(Eds\.\),Frontiers in Artificial Intelligence and Applications,pp\. 2993–3000\.External Links:[Link](https://doi.org/10.3233/FAIA251160),[Document](https://dx.doi.org/10.3233/FAIA251160)Cited by:[Appendix A](https://arxiv.org/html/2605.30711#A1.p1.1),[Appendix B](https://arxiv.org/html/2605.30711#A2.SS0.SSS0.Px1.p1.1),[Appendix B](https://arxiv.org/html/2605.30711#A2.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.30711#S1.p2.1),[§2](https://arxiv.org/html/2605.30711#S2.p1.1),[§2](https://arxiv.org/html/2605.30711#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.30711#S3.SS1.p1.6),[§3\.2](https://arxiv.org/html/2605.30711#S3.SS2.p1.1),[§4](https://arxiv.org/html/2605.30711#S4.SS0.SSS0.Px1.p2.8)\.
- C\. Chow \(1970\)On optimum recognition error and reject tradeoff\.IEEE Transactions on Information Theory16\(1\),pp\. 41–46\.External Links:[Document](https://dx.doi.org/10.1109/TIT.1970.1054406)Cited by:[§3\.4](https://arxiv.org/html/2605.30711#S3.SS4.p3.5)\.
- E\. F\. Codd \(1970\)A relational model of data for large shared data banks\.Commun\. ACM13\(6\),pp\. 377–387\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/362384.362685),[Document](https://dx.doi.org/10.1145/362384.362685)Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1)\.
- M\. Douze, A\. Guzhva, C\. Deng, J\. Johnson, G\. Szilvasy, P\. Mazaré, M\. Lomeli, L\. Hosseini, and H\. Jégou \(2025\)The faiss library\.IEEE Transactions on Big Data\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1)\.
- H\. Hsu and J\. Tzeng \(2025\)DAT: dynamic alpha tuning for hybrid retrieval in retrieval\-augmented generation\.arXiv preprint arXiv:2503\.23013\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1)\.
- J\. Johnson, M\. Douze, and H\. Jégou \(2019\)Billion\-scale similarity search with gpus\.IEEE transactions on big data7\(3\),pp\. 535–547\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\),pp\. 6769–6781\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.30711#S3.SS2.p3.1)\.
- K\. Lee, X\. Chen, H\. Furuta, J\. Canny, and I\. Fischer \(2024a\)A human\-inspired reading agent with gist memory of very long contexts\.InInternational Conference on Machine Learning,pp\. 26396–26415\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p2.1)\.
- K\. Lee, X\. Chen, H\. Furuta, J\. Canny, and I\. Fischer \(2024b\)A human\-inspired reading agent with gist memory of very long contexts\.arXiv preprint arXiv:2402\.09727\.Cited by:[§2](https://arxiv.org/html/2605.30711#S2.p1.1)\.
- Y\. Lyu, Z\. Li, S\. Niu, F\. Xiong, B\. Tang, W\. Wang, H\. Wu, H\. Liu, T\. Xu, and E\. Chen \(2025\)Crud\-rag: a comprehensive chinese benchmark for retrieval\-augmented generation of large language models\.ACM Transactions on Information Systems43\(2\),pp\. 1–32\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p2.1)\.
- J\. Ma, I\. Korotkov, K\. B\. Hall, and R\. T\. McDonald \(2020\)Hybrid first\-stage retrieval models for biomedical literature\.InConference and Labs of the Evaluation Forum,External Links:[Link](https://api.semanticscholar.org/CorpusID:221668044)Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[Appendix A](https://arxiv.org/html/2605.30711#A1.p1.1),[§2](https://arxiv.org/html/2605.30711#S2.p1.1),[§4](https://arxiv.org/html/2605.30711#S4.SS0.SSS0.Px1.p1.5)\.
- K\. V\. Mardia and P\. E\. Jupp \(1999\)Directional statistics\.Wiley Series in Probability and Statistics,pp\. 40\.Cited by:[§3\.2](https://arxiv.org/html/2605.30711#S3.SS2.p5.11)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2023\)MemGPT: towards llms as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1),[§2](https://arxiv.org/html/2605.30711#S2.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1),[§2](https://arxiv.org/html/2605.30711#S2.p1.1)\.
- K\. Pearson \(1901\)On lines and planes of closest fit to systems of points in space\.The London, Edinburgh, and Dublin philosophical magazine and journal of science2\(11\),pp\. 559–572\.Cited by:[Appendix D](https://arxiv.org/html/2605.30711#A4.p1.4)\.
- F\. C\. Peña and S\. Herbold \(2025\)Evaluating the performance and efficiency of sentence\-bert for code comment classification\.In2025 IEEE/ACM International Workshop on Natural Language\-Based Software Engineering \(NLBSE\),pp\. 21–24\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1)\.
- P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef \(2025\)Zep: a temporal knowledge graph architecture for agent memory\.arXiv preprint arXiv:2501\.13956\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1),[§2](https://arxiv.org/html/2605.30711#S2.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/1908.10084)Cited by:[§3\.1](https://arxiv.org/html/2605.30711#S3.SS1.p1.6),[§3\.2](https://arxiv.org/html/2605.30711#S3.SS2.p3.1)\.
- K\. Sawarkar, A\. Mangal, and S\. R\. Solanki \(2024\)Blended rag: improving rag \(retriever\-augmented generation\) accuracy with semantic search and hybrid query\-based retrievers\.In2024 IEEE 7th international conference on multimedia information processing and retrieval \(MIPR\),pp\. 155–161\.Cited by:[§1](https://arxiv.org/html/2605.30711#S1.p1.1)\.
- Y\. Wang, R\. Takanobu, Z\. Liang, Y\. Mao, Y\. Hu, J\. McAuley, and X\. Wu \(2025\)Mem\-\{\\\{\\\\backslashalpha\}\\\}: learning memory construction via reinforcement learning\.arXiv preprint arXiv:2509\.25911\.Cited by:[§2](https://arxiv.org/html/2605.30711#S2.p2.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[Appendix A](https://arxiv.org/html/2605.30711#A1.p1.1),[Appendix B](https://arxiv.org/html/2605.30711#A2.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.30711#S1.p2.1),[§2](https://arxiv.org/html/2605.30711#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.30711#S3.SS1.p1.6),[§3\.2](https://arxiv.org/html/2605.30711#S3.SS2.p1.1),[§4](https://arxiv.org/html/2605.30711#S4.SS0.SSS0.Px1.p1.5),[§4](https://arxiv.org/html/2605.30711#S4.SS0.SSS0.Px1.p2.8)\.
- S\. Yan, X\. Yang, Z\. Huang, E\. Nie, Z\. Ding, Z\. Li, X\. Ma, J\. Bi, K\. Kersting, J\. Z\. Pan,et al\.\(2025\)Memory\-r1: enhancing large language model agents to manage and utilize memories via reinforcement learning\.arXiv preprint arXiv:2508\.19828\.Cited by:[§2](https://arxiv.org/html/2605.30711#S2.p2.1)\.
- Z\. Zhang, X\. Bo, C\. Ma, R\. Li, X\. Chen, Q\. Dai, J\. Zhu, Z\. Dong, and J\. Wen \(2024\)A survey on the memory mechanism of large language model based agents\.arXiv preprint arXiv:2404\.13501\.Cited by:[§2](https://arxiv.org/html/2605.30711#S2.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)MemoryBank: enhancing large language models with long\-term memory\.Proceedings of the AAAI Conference on Artificial Intelligence38\(17\),pp\. 19724–19731\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/29946),[Document](https://dx.doi.org/10.1609/aaai.v38i17.29946)Cited by:[§2](https://arxiv.org/html/2605.30711#S2.p1.1)\.
- Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, J\. Zhao, B\. K\. H\. Low, and P\. P\. Liang \(2025\)MEM1: learning to synergize memory and reasoning for efficient long\-horizon agents\.arXiv preprint arXiv:2506\.15841\.Cited by:[§2](https://arxiv.org/html/2605.30711#S2.p1.1)\.

## Appendix ADataset: LoCoMo

All experiments use the LoCoMo benchmark\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.30711#bib.bib1)\), which targets long\-horizon conversational memory\. The corpus consists of 10 multi\-session dialogues in which two speakers share and revisit personal experiences over an extended interaction history\. Each dialogue spans roughly 600 turns \(≈\\approx26k tokens\) and is paired with around 200 post\-hoc comprehension questions whose ground\-truth answers require the system to recall facts from the conversation\. We adopt the four question categories relevant to memory\-write quality:*single\-hop*questions that probe a single stored fact,*multi\-hop*questions that require composing information across turns or sessions,*temporal*questions that test sensitivity to the ordering or timing of events, and*open\-domain*questions that additionally draw on commonsense knowledge\. The original benchmark also defines an adversarial category, but ground\-truth answers are not provided for these questions and the expected system behavior is to recognize them as unanswerable\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib9); Xuet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib7)\)\. Because this tests abstention rather than memory\-write fidelity, we exclude it from our evaluation\.

## Appendix BBaseline Descriptions

#### Mem0\.

Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib9)\)is a memory layer for LLM agents that extracts salient facts from conversation turns and maintains them in a dense vector store\. For each candidate fact, an LLM\-based routing controller inspects the top\-kkmost similar existing memories and classifies the appropriate operation as one ofAdd,Update,Delete, orNoop\. This design enables compact natural\-language memory representations—averaging roughly 7k tokens per conversation on LoCoMo—but requires one routing LLM call per batch of extracted candidates at every write step, making the write\-time cost proportional to the total number of ingested turns regardless of their novelty\. Retrieval is performed via cosine similarity over the dense embedding index\.

#### Mem0g\.

Mem0g\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib9)\)extends Mem0 with a graph\-based memory layer stored in Neo4j\. An LLM\-driven extraction pipeline converts conversation messages into typed entity nodes and directed relation triplets of the form\(vs,r,vd\)\(v\_\{s\},r,v\_\{d\}\)\. When new triplets arrive, the system computes entity embeddings, searches for semantically similar existing nodes above a similarity threshold, and applies a conflict\-detection and update\-resolution mechanism via additional LLM calls to maintain graph consistency\. At query time, Mem0gemploys a dual retrieval strategy: an entity\-centric method that traverses the graph neighborhood of query\-matched nodes, and a semantic\-triplet method that matches the full query embedding against all stored triplet encodings\. The graph layer roughly doubles the token footprint relative to Mem0 \(approximately 14k tokens per conversation\) but provides gains on temporal and open\-domain questions where relational structure is beneficial\.

#### A\-Mem\.

A\-Mem\(Xuet al\.,[2025](https://arxiv.org/html/2605.30711#bib.bib7)\)is an agentic memory system inspired by the Zettelkasten method that organises memories as interconnected atomic notes\. Each note stores the original content alongside LLM\-generated keywords, tags, and a contextual description, all concatenated into a single embedding for similarity search\. Upon insertion, the system retrieves the top\-kknearest existing notes and prompts an LLM to determine whether semantic links should be established; linked notes are grouped into overlapping “boxes” that are co\-retrieved at query time\. A\-Mem further implements a*memory evolution*step: when a new note is integrated, the LLM may rewrite the contextual descriptions and attributes of its linked neighbours, enabling the memory network to refine its organisation over time\. While A\-Mem reduces retrieval\-time token budgets to roughly 1\.2–2\.5k tokens, it still issues multiple LLM calls per insertion \(note construction, link generation, and evolution\), placing the bulk of its computational cost on the write side\.

## Appendix CAdditional Hyperparameter and Experimental Details

The adaptive routing rule inSAGE\(Section[3\.4](https://arxiv.org/html/2605.30711#S3.SS4)\) has three core parameters that govern the density\-dependent thresholdτt⋆=τmin\+τ0​e−λ​ρt\\tau\_\{t\}^\{\\star\}=\\tau\_\{\\min\}\+\\tau\_\{0\}\\,e^\{\-\\lambda\\rho\_\{t\}\}: the base scaling parameterτ0\\tau\_\{0\}, the minimum threshold floorτmin\\tau\_\{\\min\}, and the density decay coefficientλ\\lambda\. We selected these via a grid search on Qwen2\.5\-3B using a 20% subsample of LoCoMo, sweeping overτ0∈\{0\.15,0\.25,0\.35\}\\tau\_\{0\}\\in\\\{0\.15,\\,0\.25,\\,0\.35\\\},τmin∈\{0\.01,0\.025,0\.05\}\\tau\_\{\\min\}\\in\\\{0\.01,\\,0\.025,\\,0\.05\\\},λ∈\{1\.0,2\.0,4\.0\}\\lambda\\in\\\{1\.0,\\,2\.0,\\,4\.0\\\}\. The configurationτ0=0\.25\\tau\_\{0\}=0\.25,τmin=0\.025\\tau\_\{\\min\}=0\.025,λ=2\.0\\lambda=2\.0was selected and held fixed across all eight backbones reported in the paper\. No per\-backbone retuning was performed\.

The remaining parameters serve different roles and were set without search\. The EMA momentumα=0\.9\\alpha=0\.9smooths the threshold across consecutive write steps so that a single conversational turn that adds several memories does not cause an abrupt shift in the decision boundary; the specific value reflects a standard smoothing rate and was not tuned, but it is consistent with standard EMA based updates\. The uncertainty\-band half\-widthδ=0\.025\\delta=0\.025controls how many candidates are routed to theUpdatepath and thereby how many expensive LLM merge calls are issued at write time\. Increasingδ\\deltawidens the band and sends more borderline candidates to the LLM for deliberation; decreasing it narrows the band and favors the cheaperAdd/Noopdecisions\. In practice,δ\\deltacan therefore be treated as an operational knob that trades update quality against write\-side compute\. The PCA projection dimension is set tod′=16d^\{\\prime\}=16only affects the density proxy \(Appendix[D](https://arxiv.org/html/2605.30711#A4)\)\.

All the experiments are run using an NVIDIA H200 GPU, and one single run completes in around 2 hours for larger models\. The code used Python 3\.9\.25, PyTorch 2\.4\.0, and NLTK 3\.9\.2\.

## Appendix DProxy for Memory Scope Density

At write steptt, letℳ\(t\)=\{𝐦1\(t\),…,𝐦Nt\(t\)\}\\mathcal\{M\}^\{\(t\)\}=\\\{\\mathbf\{m\}^\{\(t\)\}\_\{1\},\\ldots,\\mathbf\{m\}^\{\(t\)\}\_\{N\_\{t\}\}\\\}denote the current memory scope\. To estimate geometric spread, we project the memory vectors onto their firstd′d^\{\\prime\}principal components, obtainingui\(t\)∈ℝd′u\_\{i\}^\{\(t\)\}\\in\\mathbb\{R\}^\{d^\{\\prime\}\}Pearson \([1901](https://arxiv.org/html/2605.30711#bib.bib11)\)\. This lets us measure spread using the main directions of variation in the memory vectors, while avoiding noisy range estimates in dimensions where the vectors change very little\. We then define the scope volume as the product of the coordinate\-wise ranges in this projected space,

Vt=exp⁡\(∑j=1d′log⁡\(maxi⁡ui,j\(t\)−mini⁡ui,j\(t\)\)\),V\_\{t\}=\\exp\\left\(\\sum\_\{j=1\}^\{d^\{\\prime\}\}\\log\(\\max\_\{i\}u\_\{i,j\}^\{\(t\)\}\-\\min\_\{i\}u\_\{i,j\}^\{\(t\)\}\)\\right\),\(2\)whereui,j\(t\)u\_\{i,j\}^\{\(t\)\}is thejj\-th coordinate of theii\-th projected memory at steptt\. Intuitively,VtV\_\{t\}is large when the current memories are spread out across the informative directions and small when they are tightly packed\. Thus, we form the following approximation for the density proxy:

ρt=NtVt\.\\rho\_\{t\}=\\frac\{N\_\{t\}\}\{V\_\{t\}\}\.\(3\)Whenρt\\rho\_\{t\}is large, the memory store contains many items within a small effective volume of the projected subspace\. As a result, neighborhood support becomes easier to accumulate: incoming candidates are more likely to lie close to already populated regions, which systematically depresses their novelty scores\. If the threshold were kept fixed, the controller would become overly conservative in dense stores and would reject too many genuinely useful writes; accordingly, the gate should lower its threshold and become more permissive as density increases\.

## Appendix EBound on the vMF Aggregation Score

We restate and prove the bound used in Section[3\.3](https://arxiv.org/html/2605.30711#S3.SS3)\.

#### Proposition\.

Letℳ=\{𝐦1,…,𝐦N\}⊂𝕊d−1\\mathcal\{M\}=\\\{\\mathbf\{m\}\_\{1\},\\dots,\\mathbf\{m\}\_\{N\}\\\}\\subset\\mathbb\{S\}^\{d\-1\}be a nonempty memory scope withN≥1N\\geq 1, where

𝕊d−1=\{𝐳∈ℝd:‖𝐳‖2=1\}\\mathbb\{S\}^\{d\-1\}=\\\{\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}:\\\|\\mathbf\{z\}\\\|\_\{2\}=1\\\}is the unit hypersphere inℝd\\mathbb\{R\}^\{d\}\. Let𝐜∈𝕊d−1\\mathbf\{c\}\\in\\mathbb\{S\}^\{d\-1\}be a candidate embedding, and letκ\>0\\kappa\>0denote the concentration parameter\. We define

Kκ​\(𝐜,𝐦i\)=exp⁡\(κ​𝐦i⊤​𝐜\),\\displaystyle K\_\{\\kappa\}\(\\mathbf\{c\},\\mathbf\{m\}\_\{i\}\)=\\exp\(\\kappa\\,\\mathbf\{m\}\_\{i\}^\{\\top\}\\mathbf\{c\}\),S^​\(𝐜∣ℳ\)=1N​∑i=1NKκ​\(𝐜,𝐦i\),\\displaystyle\\hat\{S\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}K\_\{\\kappa\}\(\\mathbf\{c\},\\mathbf\{m\}\_\{i\}\),and

svMF​\(𝐜∣ℳ\)=1κ​log⁡S^​\(𝐜∣ℳ\)\.s\_\{\\mathrm\{vMF\}\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)=\\frac\{1\}\{\\kappa\}\\log\\hat\{S\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\.\(4\)Then

−1≤svMF​\(𝐜∣ℳ\)≤1\.\-1\\leq s\_\{\\mathrm\{vMF\}\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\\leq 1\.\(5\)

#### Proof\.

Since𝐜,𝐦i∈𝕊d−1\\mathbf\{c\},\\mathbf\{m\}\_\{i\}\\in\\mathbb\{S\}^\{d\-1\}, we have

‖𝐜‖2=1and‖𝐦i‖2=1\\displaystyle\\\|\\mathbf\{c\}\\\|\_\{2\}=1\\qquad\\text\{and\}\\qquad\\\|\\mathbf\{m\}\_\{i\}\\\|\_\{2\}=1for all​i=1,…,N\.\\displaystyle\\text\{for all \}i=1,\\dots,N\.Therefore, by the Cauchy–Schwarz inequality,

\|𝐦i⊤​𝐜\|≤‖𝐦i‖2​‖𝐜‖2=1,\|\\mathbf\{m\}\_\{i\}^\{\\top\}\\mathbf\{c\}\|\\leq\\\|\\mathbf\{m\}\_\{i\}\\\|\_\{2\}\\,\\\|\\mathbf\{c\}\\\|\_\{2\}=1,which implies

−1≤𝐦i⊤​𝐜≤1for all​i=1,…,N\.\-1\\leq\\mathbf\{m\}\_\{i\}^\{\\top\}\\mathbf\{c\}\\leq 1\\quad\\text\{for all \}i=1,\\dots,N\.
Becauseκ\>0\\kappa\>0, multiplying byκ\\kappapreserves the inequality:

−κ≤κ​𝐦i⊤​𝐜≤κfor all​i=1,…,N\.\-\\kappa\\leq\\kappa\\,\\mathbf\{m\}\_\{i\}^\{\\top\}\\mathbf\{c\}\\leq\\kappa\\quad\\text\{for all \}i=1,\\dots,N\.
By the definition ofKκK\_\{\\kappa\}and the monotonicity of the exponential function,

e−κ≤Kκ​\(𝐜,𝐦i\)≤eκfor all​i=1,…,N\.e^\{\-\\kappa\}\\leq K\_\{\\kappa\}\(\\mathbf\{c\},\\mathbf\{m\}\_\{i\}\)\\leq e^\{\\kappa\}\\quad\\text\{for all \}i=1,\\dots,N\.
SinceS^​\(𝐜∣ℳ\)\\hat\{S\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)is the arithmetic mean of theseNNterms, averaging overi=1,…,Ni=1,\\dots,Ngives

e−κ≤S^​\(𝐜∣ℳ\)≤eκ\.e^\{\-\\kappa\}\\leq\\hat\{S\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\\leq e^\{\\kappa\}\.
Applying the logarithm, which is also monotone increasing on\(0,∞\)\(0,\\infty\), yields

−κ≤log⁡S^​\(𝐜∣ℳ\)≤κ\.\-\\kappa\\leq\\log\\hat\{S\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\\leq\\kappa\.
Finally, dividing byκ\>0\\kappa\>0gives

−1≤1κ​log⁡S^​\(𝐜∣ℳ\)≤1\.\-1\\leq\\frac\{1\}\{\\kappa\}\\log\\hat\{S\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\\leq 1\.Hence,

−1≤svMF​\(𝐜∣ℳ\)≤1\.\-1\\leq s\_\{\\mathrm\{vMF\}\}\(\\mathbf\{c\}\\mid\\mathcal\{M\}\)\\leq 1\.∎

## Appendix FTemporal Dynamics of the Adaptive Threshold

Figure[3](https://arxiv.org/html/2605.30711#A6.F3)provides a step\-by\-step visual trace of theSAGErouting mechanism in action across a sequence of candidate facts\. As the memory scope expands and the projection subspace becomes more densely populated, the baseline novelty scores of incoming candidates naturally trend downward because new facts are more likely to fall near established memories\. To prevent the system from becoming overly conservative, the adaptive thresholdτt\\tau\_\{t\}\(the solid blue line\) decays over time in response to the increasing density proxyρt\\rho\_\{t\}\.

The figure illustrates how the uncertainty marginδ\\delta\(the shaded blue band above the threshold\) cleanly separates the three routing actions defined in Section[3](https://arxiv.org/html/2605.30711#S3):

- •Add:Candidates landing strictly above the shaded band \(ν​\(𝐜\)≥τt\+δ\\nu\(\\mathbf\{c\}\)\\geq\\tau\_\{t\}\+\\delta\)\.
- •Update:Candidates landing inside the shaded band \(τt≤ν​\(𝐜\)<τt\+δ\\tau\_\{t\}\\leq\\nu\(\\mathbf\{c\}\)<\\tau\_\{t\}\+\\delta\)\.
- •Noop:Candidates scoring strictly below the threshold \(ν​\(𝐜\)<τt\\nu\(\\mathbf\{c\}\)<\\tau\_\{t\}\)\.

By continuously shifting downward as memory density increases, this dynamic adjustment ensures that the decision boundary remains correctly calibrated to the current state of the memory store, preserving high recall without sacrificing write\-time efficiency\.

![Refer to caption](https://arxiv.org/html/2605.30711v1/x2.png)Figure 3:Illustration of decaying adaptive threshold over time that influences routing decisions forSAGE\.
## Appendix GLeakage\-Controlled Calibration ofτnoop\\tau\_\{\\text\{noop\}\}

This appendix details how the operating pointτnoop\\tau\_\{\\text\{noop\}\}of theNoopgate \(Section[3\.5](https://arxiv.org/html/2605.30711#S3.SS5)\) is fixed for a new deployment without consulting the target benchmark\.

The calibration trap\.The tempting recipe is to read the threshold off the benchmark: compute thesvMFs\_\{\\text\{vMF\}\}distribution on LoCoMo and placeτnoop\\tau\_\{\\text\{noop\}\}at its8080th percentile, so the gate skips the most\-covered20%20\\%of writes \(on LoCoMo this percentile is0\.5720\.572\)\. Although unsupervised—it never touches the labels—this is still*test\-set calibration*: a hyperparameter of the evaluated method is read off the evaluation distribution, which a real deployment does not have in advance\.

Leakage\-controlled calibration\.We instead fixτnoop\\tau\_\{\\text\{noop\}\}offline, on synthetic self\-generated text that never sees LoCoMo, constructed so that itssvMFs\_\{\\text\{vMF\}\}distribution*matches*that of real conversational memory\. Once a synthetic corpus reproduces the real8080th\-percentile score, the rule “skip the top20%20\\%most\-covered” yields the same threshold value—now a property of our recipe rather than of the benchmark\. The effective lever is*topical coherence*, not surface naturalness: rigid templated text over\-concentrates \(p80=0\.892p\_\{80\}=0\.892, everything looks redundant\), whereas a broad LLM\-generated life story is too topically diffuse \(p80=0\.443p\_\{80\}=0\.443, everything looks novel\)\. A narrow\-domain, single\-persona diary lands on the real spread, with generation temperature acting as a clean monotonic knob on synthetic redundancy \(Table[6](https://arxiv.org/html/2605.30711#A7.T6)\)\. At temperature0\.70\.7the syntheticp80p\_\{80\}matches LoCoMo’s0\.5720\.572, which we adopt asτnoop=0\.572\\tau\_\{\\text\{noop\}\}=0\.572\. The corpus, the chosen quantile, and the threshold value are thus all derived from synthetic data alone, making the threshold transferable rather than tuned to the benchmark\.

Table 6:Leakage\-controlled threshold calibration\. A narrow\-domain, single\-persona synthetic diary corpus reproduces LoCoMo’s8080th\-percentile vMF score at generation temperature0\.70\.7, givingτnoop=0\.572\\tau\_\{\\text\{noop\}\}=0\.572with no benchmark access\. Temperature is a clean monotonic knob on synthetic redundancy\.Table 7:Overall adaptive\-vs\-fixed threshold ablation forSAGE\. Each row uses the paired full\-split run scored withllama3\.1\-8bas the LLM judge\. Bold marks the best quality value within the same backbone\.
## Appendix HAdditional Results

### H\.1Threshold Ablation Details

This appendix provides the full adaptive\-vs\-fixed threshold sweep referenced in Section[4\.2](https://arxiv.org/html/2605.30711#S4.SS2)\.

#### Table[7](https://arxiv.org/html/2605.30711#A7.T7): overall quality under fixed vs\. adaptive thresholds\.

Each row reports the overall BLEU\-1, token\-F1F\_\{1\}, and LLM\-as\-a\-Judge score whenSAGEis run with either its adaptive thresholdτt\\tau\_\{t\}or a fixed valueτfixed∈\{0\.10,0\.15,0\.20,0\.25,0\.30\}\\tau\_\{\\text\{fixed\}\}\\in\\\{0\.10,0\.15,0\.20,0\.25,0\.30\\\}, scored on a 20% subsample of LoCoMo with Llama\-3\.1\-8B as the judge\.

Two patterns emerge:\(i\)On Qwen2\.5\-1\.5B, the adaptive threshold yields the bestB1B\_\{1\}\(9\.809\.80\) andF1F\_\{1\}\(11\.6911\.69\) across all settings; the only fixed threshold that slightly exceeds its Judge score isτfixed=0\.30\\tau\_\{\\text\{fixed\}\}=0\.30\(by0\.070\.07points\), but at a cost of roughly22points on bothB1B\_\{1\}andF1F\_\{1\}\.\(ii\)On Qwen2\.5\-3B, the best fixed setting \(τfixed=0\.10\\tau\_\{\\text\{fixed\}\}=0\.10\) slightly outperforms the adaptive threshold on all three metrics \(B1B\_\{1\}:26\.6926\.69vs\.25\.8325\.83;F1F\_\{1\}:32\.3532\.35vs\.31\.1531\.15;JJ:86\.8286\.82vs\.85\.3285\.32\), but quality degrades sharply forτfixed≥0\.15\\tau\_\{\\text\{fixed\}\}\\geq 0\.15and collapses byτfixed=0\.30\\tau\_\{\\text\{fixed\}\}=0\.30\(F1F\_\{1\}drops to4\.814\.81\)\. Because the adaptive threshold performs well across both backbones without requiring per\-backbone tuning, it is the more robust default\.

#### Figure[4](https://arxiv.org/html/2605.30711#A8.F4): threshold\-sensitivity curves across both backbones\.

Figure[4](https://arxiv.org/html/2605.30711#A8.F4)visualizes the same data as Table[7](https://arxiv.org/html/2605.30711#A7.T7)in plot form\. Solid lines trace the three quality metrics as a function ofτfixed\\tau\_\{\\text\{fixed\}\}; dashed horizontal lines mark the corresponding adaptive\-SAGEbaselines\. On Qwen2\.5\-1\.5B \(top panel\), all three solid curves remain relatively flat, never clearly exceeding the adaptive baselines, confirming that no single fixed threshold consistently dominates the adaptive gate on this backbone\. On Qwen2\.5\-3B \(bottom panel\), the curves are steeply right\-descending:τfixed=0\.10\\tau\_\{\\text\{fixed\}\}=0\.10is the only competitive operating point, and every higher threshold incurs a severe quality penalty\. This asymmetry highlights the fragility of fixed thresholds, as the optimalτfixed\\tau\_\{\\text\{fixed\}\}shifts across backbones, whereas the adaptive threshold automatically tracks memory\-store geometry and remains robust across configurations\.

Table 8:Complete write\-side LLM\-call budget on full LoCoMo \(the all\-backbone version of Table[3](https://arxiv.org/html/2605.30711#S4.T3)\), for seven backbone\-matched open\-weight triads plus a GPT\-4o\-miniSAGE\-vs\-Mem0 pair\.*Decision\-stage*isolates controller cost:SAGEmakeszero routing callsand invokes the LLM only to*merge*theπupd\\pi\_\{\\text\{upd\}\}fraction routed toUpdate, while Mem0/Mem0g fuse routing and edit into one call per non\-emptyadd\(Update marked —\)\.*Total write LLM calls*adds shared fact\-extraction;*Total↓\\downarrow*isSAGE’s reduction vs\. that row;πupd\\pi\_\{\\text\{upd\}\}is the empirical share of routed candidates that fall in theUpdateband\.†Llama\-3\.2\-1b is excluded from the aggregate \(see text\)\.0\.10\.10\.150\.150\.20\.20\.250\.250\.30\.302020404060608080τfixed\\tau\_\{\\text\{fixed\}\}ScoreQwen2\.5\-1\.5BBLEU\-1Token\-F1F\_\{1\}JudgeAdaptive

0\.10\.10\.150\.150\.20\.20\.250\.250\.30\.302020404060608080τfixed\\tau\_\{\\text\{fixed\}\}ScoreQwen2\.5\-3B

Figure 4:Threshold sensitivity to the fixed threshold across both Qwen backbones\. Colors denote BLEU\-1, token\-F1F\_\{1\}, and Judge; solid lines indicateSAGEwith varying fixed\-thresholds and dashed lines indicateSAGEwith adaptive threshold\.

### H\.2Full Write\-Side LLM\-Call Budget

Table[3](https://arxiv.org/html/2605.30711#S4.T3)in the main text reports the write\-side LLM\-call budget for a representative subset of backbones\. Table[8](https://arxiv.org/html/2605.30711#A8.T8)extends this to all eight configurations: seven backbone\-matched open\-weight triads \(SAGE, Mem0,Mem0g\) plus the GPT\-4o\-miniSAGE\-vs\-Mem0 pair\. Within each triad, the dataset, fact\-extraction prompt, embedding model, and retrieval stack are identical; the only difference is what happens after fact extraction\.

#### Decision\-stage savings\.

The “Route” and “Update” columns isolate the controller cost\.SAGEmakeszero routing LLM callson every backbone because the vMF novelty gate resolvesAddandNoopin closed form; the LLM is invoked only to merge the narrow fractionπupd\\pi\_\{\\text\{upd\}\}of candidates routed toUpdate\. The empiricalπupd\\pi\_\{\\text\{upd\}\}ranges from2\.7%2\.7\\%\(DeepSeek\-R1\-7B\) to10\.6%10\.6\\%\(Qwen2\.5\-1\.5B\), meaning that the vast majority of write decisions are resolved without any LLM call\. In contrast, Mem0 andMem0ginvoke a routing LLM on every non\-emptyaddcall, producing between100100and1,5761\{,\}576decision\-stage calls depending on the backbone\.

#### Total write\-side reduction\.

Including the shared1,6961\{,\}696extraction calls \(one peraddatbatch\_size=8=8\),SAGEstill reduces total write\-side LLM calls by2929–42%42\\%\(mean32%32\\%\) on seven of the eight backbones\. The sole exception is Llama\-3\.2\-1B \(marked†\)\. On this backbone, Mem0’s LLM\-based router emits malformed JSON on1,3471\{,\}347of1,6961\{,\}696calls \(79%79\\%\), which artificially deflates its routing\-call count: most calls are discarded as parse failures rather than counted as successful routes\. BecauseSAGE’s closed\-form gate has no such parse\-failure mode, the resulting call counts are not comparable, and we exclude this backbone from the aggregate efficiency claim\.

## Appendix IResponsible Use of Artifacts

### I\.1Artifact Use and Intended Use

We use existing artifacts, including the LoCoMo benchmark, backbone language models, embedding models, and prior memory\-system implementations, only for research and evaluation purposes in the experimental settings described in this paper\. Our use is intended to be consistent with the intended use and access conditions specified by the original artifact providers, where such conditions are available\. We do not claim rights over third\-party artifacts, and we do not redistribute restricted datasets, proprietary model weights, or API\-backed systems except as permitted by their original terms\. Any artifacts released as part of this work \(e\.g\., code, prompts, or configuration files\) are intended for research use only\. These released artifacts are designed to support reproducibility of the proposed method and are not intended to override or expand the original access conditions attached to the underlying third\-party datasets, models, or services\.

### I\.2Artifact Documentation

Our experiments study long\-term conversational memory in English using the LoCoMo evaluation protocol\. We evaluate single\-hop, multi\-hop, temporal, and open\-domain question settings, and we compare SAGE against prior memory\-evolution systems under matched backbone configurations\. These artifacts are used to study write\-side memory control in research settings rather than to support deployment claims in real\-world user\-facing systems\.

Similar Articles

@omarsar0: // The Memory Curse in LLM Agents // (bookmark it) Long histories apparently degrades agents as they become increasingl…

X AI KOLs Following

This research paper identifies the 'memory curse' in LLM agents, demonstrating that expanded context windows systematically degrade cooperative behavior in multi-agent social dilemmas by eroding forward-looking intent. The authors show that targeted fine-tuning, synthetic memory sanitization, and reducing explicit Chain-of-Thought reasoning can effectively mitigate this behavioral decay.

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

Hugging Face Daily Papers

This paper identifies a critical failure mode in LLM agents where they fail to update personalized memories when new evidence conflicts with prior beliefs. It introduces the STALE benchmark and a three-dimensional probing framework, revealing that even the best models achieve only 55.2% accuracy, and proposes CUPMem as a prototype for robust memory revision.

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

Hugging Face Daily Papers

This survey paper proposes an evolutionary framework for LLM agent memory mechanisms, categorizing their development into three stages: storage, reflection, and experience. It analyzes core drivers such as long-range consistency and continual learning to provide design principles for next-generation agents.