Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents
Summary
This paper formulates memory retention for long-horizon language agents as a constrained stochastic optimization problem, introducing OSL-MR, a framework that enforces observability-safe learning with a Mixed-Score heuristic. Experiments show consistent improvements over existing heuristic baselines under tight memory budgets.
View Cached Full Text
Cached at: 06/10/26, 06:16 AM
# Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents
Source: [https://arxiv.org/html/2606.10616](https://arxiv.org/html/2606.10616)
Qingcan Kang1∗, LIU Mingyang2∗, Shixiong Kai1, Kaichao Liang1, Tao Zhong1,Mingxuan Yuan1† 1Huawei Noah’s Ark Lab 2Department of Computer Science, City University of Hong Kong mingyaliu8\-c@my\.cityu\.edu\.hk \{kangqingcan, kaishixiong, liangkaichao\}@huawei\.com \{zhongtao5, Yuan\.Mingxuan\}@huawei\.com
###### Abstract
Long\-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource\-allocation problem\. Existing memory systems improve management through heuristic scoring, retrieval optimization, or learned compression, but largely treat retention as a local decision problem and do not explicitly model its long\-term consequences under realistic observability constraints\. To fill this gap, we formulate memory retention as a constrained stochastic optimization problem with explicit budget feasibility, evidence utility, and delayed costs including miss penalties, reacquisition delays, and stale\-information risk\. We then proposeOSL\-MR\(Observability\-SafeLearning forMemoryRetention\), a novel framework that enforces a strict separation between online\-observable features and offline\-available supervision \(OAS\)\. OSL\-MR combines an evidence learner trained from realized evidence supervision with a Mixed\-Score heuristic that serves both as a deployable online\-safe baseline and as a structured inductive prior for learning\. The resulting policy learns query\-conditioned evidence value directly from interaction data while remaining deployable under the same observability constraints\. Experiments on LOCOMO and LongMemEval show that OSL\-MR consistently outperforms recency\-based methods, Generative Agents\-style scoring, and other heuristic baselines, particularly under tight memory budgets\. The Mixed\-Score prior further improves precision while preserving recall, and sensitivity analysis demonstrates robustness across a wide range of cost configurations\.
11footnotetext:\*Equal contribution\.22footnotetext:†\\daggerCorresponding author\.## 1Introduction
Large language model agents increasingly operate over long horizons, requiring them to decide which memories to retain, evict, or reacquire under limited context and cost budgets\(Huet al\.,[2025](https://arxiv.org/html/2606.10616#bib.bib22); Huanget al\.,[2026](https://arxiv.org/html/2606.10616#bib.bib21)\)\. Most existing memory systems rely on heuristic scoring, retrieval optimization, or learned compression\(Parket al\.,[2023](https://arxiv.org/html/2606.10616#bib.bib2); Packeret al\.,[2023](https://arxiv.org/html/2606.10616#bib.bib1); Zhonget al\.,[2024](https://arxiv.org/html/2606.10616#bib.bib3); Jianget al\.,[2023](https://arxiv.org/html/2606.10616#bib.bib5)\), but treat retention as a local decision problem\. They lack a principled formulation that captures long\-horizon consequences such as missing future evidence, reacquisition cost, and stale information under realistic partial observability\.
We fill this gap by formulating memory retention as a constrained stochastic optimization problemthat explicitly accounts for budget feasibility, evidence utility, and delayed costs including miss penalty, reacquisition delay, and stale\-information risk\.To the best of our knowledge, prior work on memory retention—whether based on heuristic scoring, retrieval optimization, or reinforcement learning—has not explicitly formulated the underlying decision problem\.Existing approaches propose solutions without first defining what optimal retention means under budget constraints, delayed consequences, and partial observability\. OSL\-MR provides, for the first time, a constrained stochastic optimization formulation of retention as a long\-horizon sequential decision problem, from which the learning objective naturally follows\. In contrast, prior optimization‑based approaches, such as Fofadiya & Tiwari\(Fofadiya and Tiwari,[2026](https://arxiv.org/html/2606.10616#bib.bib6)\), treat retention as a single‑step constrained optimization that optimizes immediate relevance without considering how current decisions affect future evidence availability or incur delayed penalties\. This local perspective misses the essential long‑horizon nature of memory retention\.
A central challenge in operationalizing this formulation is observability: many signals needed to evaluate retention decisions \(e\.g\., gold evidence, answer correctness, semantic freshness\) are only available after the decision is made\. Using them at deployment would create an unrealistic information advantage\. We therefore introduce a strict separation between*online\-observable features*\(query context, memory metadata, interaction history\) and*offline\-available supervision \(OAS\)*\(gold evidence, answer text, ground\-truth freshness\)\. OAS is used only for training and evaluation; deployable policies must rely solely on online\-observable inputs\. This separation is not merely a theoretical constraint—it guarantees that any policy learned within this discipline can be deployed in real\-world systems without requiring oracle access to future information, making it suitable for online interactive agents where decisions must be made in real time\.
Building on this optimization foundation and observability discipline, we proposeOSL\-MR\(Observability\-SafeLearning forMemoryRetention\)\. Under partial observability, exact optimization is intractable, so OSL\-MR introduces two complementary components: \(i\) an evidence learner trained offline from interaction logs using supervision derived from realized evidence \(gold membership labels\), and \(ii\) a Mixed\-Score heuristic that serves as both a cold\-start deployable baseline and an online\-safe feasibility prior\. The framework follows a practical staged deployment: initially, the Mixed‑Score heuristic runs alone, ensuring system functionality from the first user query while logging all interactions\. Once sufficient data are collected, the evidence learner is trained offline and deployed as a frozen policy, seamlessly replacing the heuristic without violating observability constraints\. This design bridges the gap between offline training and online inference\. By learning query\-conditioned evidence scores directly from gold labels, OSL\-MR bypasses the need for general\-purpose importance oracles\. The overall architecture yields a unified perspective that connects heuristic scoring, optimization, and learning under a consistent observability constraint\.
Our contributions are threefold\. First, we provide a constrained optimization formulation that formalizes memory retention as a sequential decision problem under a hard budget, explicitly modeling evidence utility, storage cost, miss penalty, reacquisition delay, and stale risk\. Second, we introduce OSL\-MR, an observability\-safe learning framework that enforces a strict online/OAS separation and integrates a Mixed\-Score prior, the optimization formulation, and an evidence learner trained from direct evidence supervision\. Third, on two public long\-horizon benchmarks, LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2606.10616#bib.bib30)\)and LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2606.10616#bib.bib31)\), OSL\-MR consistently outperforms recency\-based methods, Generative Agents\-style scoring, and other heuristic baselines, especially under tight budgets\. The Mixed\-Score prior improves precision while preserving recall, and sensitivity analysis confirms robustness across diverse cost configurations\.
## 2Related Work
### 2\.1Memory Systems and Long\-Horizon Language Agents
Long\-horizon language agents rely on external memory mechanisms to extend their effective context\. Early systems store episodic experiences or tool traces in vector databases; MemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2606.10616#bib.bib1)\)introduces hierarchical paging inspired by operating systems, while MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2606.10616#bib.bib3)\)incorporates Ebbinghaus\-style forgetting dynamics into retrieval\. Generative Agents\(Parket al\.,[2023](https://arxiv.org/html/2606.10616#bib.bib2)\)rank memories by recency, relevance, and importance; however, their static importance scores are designed to capture general salience rather than query‑specific evidence value, which can be less effective for evidence retention under capacity constraints\. Recent systems extend other parts of the memory lifecycle: Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2606.10616#bib.bib23)\)focuses on structured memory writing, MEM1\(Zhouet al\.,[2025](https://arxiv.org/html/2606.10616#bib.bib24)\)learns compact latent representations via reinforcement learning, and prompt compression methods\(Jianget al\.,[2023](https://arxiv.org/html/2606.10616#bib.bib5)\)reduce context costs\. Despite these advances, retention decisions are typically handled implicitly within retrieval or compression pipelines, and the problem of formulating retention as an explicit optimization over a long horizon remains underexplored\. To our knowledge, prior work has addressed individual aspects—budget‑aware selection, delayed reward training, or partial observability—but a unified constrained stochastic optimization framework that integrates these dimensions has not yet been developed\. Our work aims to fill this gap by formulating retention as such an optimization problem and enforcing a strict separation between online‑observable features and offline‑available supervision \(OAS\)\.
### 2\.2Memory Retention as Resource Allocation
Memory retention is a resource‑allocation problem under uncertainty\. Surveys highlight the lack of unified formulations that model delayed costs and trade‑offs\(Huet al\.,[2025](https://arxiv.org/html/2606.10616#bib.bib22); Huanget al\.,[2026](https://arxiv.org/html/2606.10616#bib.bib21)\)\. Several works introduce optimization for cost‑constrained retrieval \(e\.g\., AdaGReS\(Anonymous,[2025](https://arxiv.org/html/2606.10616#bib.bib7)\), CORAG\(Wang and others,[2024](https://arxiv.org/html/2606.10616#bib.bib8)\)\), but they focus on one‑turn context selection rather than long‑horizon memory retention\. The constrained optimization perspective has also been explored for retention itself\. Specifically, Fofadiya and Tiwari\(Fofadiya and Tiwari,[2026](https://arxiv.org/html/2606.10616#bib.bib6)\)formulate memory retention as a single‑step budgeted optimization that maximizes immediate relevance at each step independently, without explicitly considering how current decisions may affect future evidence availability or incur delayed penalties\. This single‑step perspective does not fully capture the long‑horizon consequences that arise in interactive agent settings\.
In contrast, OSL‑MR models retention as a multi‑step sequential decision problem, optimizing cumulative reward over the entire horizon\. BudgetMem approaches \(Alla et al\.\(Allaet al\.,[2026](https://arxiv.org/html/2606.10616#bib.bib26)\), Zhang et al\.\(Zhanget al\.,[2026a](https://arxiv.org/html/2606.10616#bib.bib27)\)\) also operate as one‑step decisions, as do Mem‑T\(Yueet al\.,[2026](https://arxiv.org/html/2606.10616#bib.bib28)\)and MemAct\(Zhanget al\.,[2026c](https://arxiv.org/html/2606.10616#bib.bib29)\)\. Table[1](https://arxiv.org/html/2606.10616#S2.T1)compares representative approaches across five dimensions: constrained optimization, delayed feedback, partial observability, online/OAS separation, and long‑horizon sequential view\. None of these methods integrate all five dimensions\. OSL‑MR is the first to do so, using a constrained stochastic optimization formulation to define the long‑horizon retention objective, explicitly modeling delayed costs, respecting partial observability, enforcing online/OAS separation, and training an evidence learner from logged data—deployed as a frozen policy using only online‑safe features\.
### 2\.3Learning\-Based Memory Policies and Observability Separation
Recent learning‑based systems optimize memory policies using downstream signals\. Mem‑α\\alpha\(Wang and others,[2025](https://arxiv.org/html/2606.10616#bib.bib15)\)applies RL to learn memory construction; CSIM\(Zhou and others,[2025](https://arxiv.org/html/2606.10616#bib.bib16)\)compresses context into compact step representations; and MemRL\(Zhanget al\.,[2026b](https://arxiv.org/html/2606.10616#bib.bib25)\)frames retrieval as a value‑based decision that updates utility estimates from environmental feedback\. While these works improve memory operations, they focus primarily on retrieval or compression rather than on retention under hard budget constraints\. The STALE benchmark\(Chao and others,[2026](https://arxiv.org/html/2606.10616#bib.bib19)\)reveals that LLM agents struggle to detect when a stored memory has become outdated, motivating a clean separation between observable temporal signals and latent semantic validity—exactly the principle behind our online/OAS separation\. Many existing learning methods rely on supervision signals that may not be available at deployment time\. OSL‑MR resolves this gap by using gold evidence only during offline training, while deployed policies access only online‑observable features\. This design preserves validity under partial observability and cleanly connects optimization, learning, and deployment\. Our framework is complementary to retrieval‑side optimization \(e\.g\., MemRL\), and unifying retention with retrieval under a unified constrained optimization formulation is a promising direction for future work\.
Table 1:Comparison of related memory retention approaches\. OSL\-MR is the first to simultaneously satisfy all five criteria\.MethodConstrainedDelayedPartialOnline/OASLong\-HorizonOptimizationFeedbackObservabilitySeparationSequentialGenerative Agents×\\times×\\times×\\times×\\times×\\timesBudgetMem×\\times×\\times×\\times×\\times×\\timesMem\-T×\\times✓\\checkmark×\\times×\\times✓\\checkmarkMemAct×\\times×\\times×\\times×\\times✓\\checkmarkFofadiya & Tiwari✓\\checkmark\(single\-step\)×\\times×\\times×\\times×\\timesOSL\-MR \(ours\)✓\\checkmark\(multi\-step\)✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmark
## 3Method
We propose OSL\-MR, a memory retention framework for long\-horizon language agents operating under strict budget constraints and partial observability\. The core challenge is that memory decisions must be made sequentially with limited context capacity, while their consequences—such as information loss, recomputation cost, and stale usage—are only observable in the future\. This creates a delayed and partially observable decision problem where naive heuristic rules or local scoring strategies are insufficient\.
To address this, OSL\-MR formulates memory retention as a constrained sequential optimization problem under an explicit observability separation\. The framework integrates three tightly coupled components: \(i\) a constrained optimization formulation that defines long\-horizon retention objectives under budget limitations, \(ii\) an evidence learner trained offline from interaction logs, which refines the heuristic policy using supervision derived from realized evidence structures, and \(iii\) a Mixed\-Score retention policy that provides a fully deployable cold\-start solution and a strong inductive baseline\. Figure[1](https://arxiv.org/html/2606.10616#S3.F1)provides a high\-level illustration of the overall framework and its data flow\.
Figure 1:Overview of the OSL\-MR framework\. The framework separates online\-observable inputs \(query, memory metadata, interaction history\) from offline\-available supervision \(gold evidence, answer text\)\. During cold\-start, the Mixed\-Score heuristic selects retained memories under budgetBtB\_\{t\}and logs agent\-user interaction data \(queries, responses, and subsequent evidence outcomes\)\. After sufficient data collection, an evidence learner is trained offline using gold evidence labels; the learned policy is then frozen and deployed, relying only on online features for inference\.### 3\.1Observability Separation
We first formalize the information asymmetry in memory retention\. We distinguish between three types of signals\.Online\-observable inputsinclude the current query, memory metadata, and interaction context, which are available at decision time\.Offline\-available supervision \(OAS\)includes gold evidence sets, answer content, coverage signals, and downstream outcome statistics, which are only accessible after the interaction completes\.Evaluation signalsare derived from realized system outcomes and are used strictly for analysis and benchmarking\.
A key constraint in OSL\-MR is that deployed policies are strictly restricted to online\-observable inputs\. This ensures that all inference\-time decisions are realistic and do not rely on oracle information\. Consequently, any improvement must arise from better generalization over observable features rather than leakage of supervision signals\.
A subtle issue arises in modeling memory freshness: while recency is directly observable, semantic validity is not\. A memory may remain valid long after creation, or become obsolete shortly after being written\. Therefore, instead of assuming oracle freshness labels, we estimate stale\-use risk using observable proxies, denoted asρt,istale\\rho\_\{t,i\}^\{stale\}, conditioned on memory features and interaction history\.
### 3\.2Constrained Memory Retention Problem
At each steptt, the agent observes a memory poolMt=\{mt,1,…,mt,nt\}M\_\{t\}=\\\{m\_\{t,1\},\\dots,m\_\{t,n\_\{t\}\}\\\}, where each memory is associated with a sizest,is\_\{t,i\}and a set of online featuresψt,ion\\psi\_\{t,i\}^\{on\}\. The agent must select a subsetAt⊆MtA\_\{t\}\\subseteq M\_\{t\}subject to a strict storage budget∑ixt,ist,i≤Bt\\sum\_\{i\}x\_\{t,i\}s\_\{t,i\}\\leq B\_\{t\}\.
After selection, the query induces an evidence demand setEtE\_\{t\}\. Each memory contributes partial evidence coverage through a mappingcov\(i\)\\mathrm\{cov\}\(i\)\. We define the realized coverage as:
Covt\(At\)=Et∩⋃i∈Atcov\(i\)\.\\mathrm\{Cov\}\_\{t\}\(A\_\{t\}\)=E\_\{t\}\\cap\\bigcup\_\{i\\in A\_\{t\}\}\\mathrm\{cov\}\(i\)\.
To capture both correctness and completeness of retention, we define token\-weighted metrics:
Tthit\\displaystyle T\_\{t\}^\{hit\}=∑e∈Etτ\(e\)1\[e∈Covt\(At\)\],\\displaystyle=\\sum\_\{e\\in E\_\{t\}\}\\tau\(e\)\\mathbf\{1\}\[e\\in\\mathrm\{Cov\}\_\{t\}\(A\_\{t\}\)\],Ttmiss\\displaystyle T\_\{t\}^\{miss\}=∑e∈Etτ\(e\)1\[e∉Covt\(At\)\],\\displaystyle=\\sum\_\{e\\in E\_\{t\}\}\\tau\(e\)\\mathbf\{1\}\[e\\notin\\mathrm\{Cov\}\_\{t\}\(A\_\{t\}\)\],Ttreacq\\displaystyle T\_\{t\}^\{reacq\}=∑e∈Et∖Covt\(At\)τ\(e\)1\[eis recoverable\],\\displaystyle=\\sum\_\{e\\in E\_\{t\}\\setminus\\mathrm\{Cov\}\_\{t\}\(A\_\{t\}\)\}\\tau\(e\)\\mathbf\{1\}\[e\\text\{ is recoverable\}\],Ttstale\\displaystyle T\_\{t\}^\{stale\}=∑i∈Atst,i1\[mt,iis stale and used\],\\displaystyle=\\sum\_\{i\\in A\_\{t\}\}s\_\{t,i\}\\mathbf\{1\}\[m\_\{t,i\}\\text\{ is stale and used\}\],with a full coverage indicatorFt=𝟏\[Et⊆Covt\(At\)\]F\_\{t\}=\\mathbf\{1\}\[E\_\{t\}\\subseteq\\mathrm\{Cov\}\_\{t\}\(A\_\{t\}\)\]\. The per\-step reward is defined as:
rt:=αhitTthit\+αfullFt−αstore∑ixt,ist,i−αmissTtmiss−αreacqTtreacq−αstaleTtstale\.r\_\{t\}:=\\alpha\_\{hit\}T\_\{t\}^\{hit\}\+\\alpha\_\{full\}F\_\{t\}\-\\alpha\_\{store\}\\sum\_\{i\}x\_\{t,i\}s\_\{t,i\}\-\\alpha\_\{miss\}T\_\{t\}^\{miss\}\-\\alpha\_\{reacq\}T\_\{t\}^\{reacq\}\-\\alpha\_\{stale\}T\_\{t\}^\{stale\}\.The objective is to maximize expected cumulative reward under budget constraints:
maxπ𝔼π\[∑t=1Trt\]s\.t\.∑ixt,ist,i≤Bt\.\\max\_\{\\pi\}\\;\\mathbb\{E\}\_\{\\pi\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}r\_\{t\}\\right\]\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i\}x\_\{t,i\}s\_\{t,i\}\\leq B\_\{t\}\.
This formulation highlights that memory retention is inherently a long\-horizon, delayed\-feedback optimization problem, where optimal decisions must be made without access to future evidence realization\.
Why multi\-step optimization matters\.The above formulation optimizes cumulative reward over the entire horizon, making it a genuine multi\-step decision problem\. A single\-step optimizer, by contrast, would only maximize immediate rewardrtr\_\{t\}without accounting for how today’s retention choices affect tomorrow’s memory pool and future evidence demands\. Such a myopic policy may discard information that appears irrelevant now but becomes critical later, incurring avoidable miss penalties or reacquisition delays\. Hence, the multi\-step perspective is essential for long\-horizon agents\.
From intractability to data\-driven approximation\.The above problem is a constrained partially observable stochastic optimization\. It cannot be solved by standard solvers or analytically because: \(i\) future evidence demands depend on unknown user queries; \(ii\) rewards involve delayed outcomes not observable at decision time; \(iii\) the state space is combinatorial and grows exponentially with the number of memories, making exact optimization NP\-hard\. Thus, no closed‑form solution exists\. We therefore adopt a data‑driven approach: learn a retention policy from interaction logs\. However, no logs exist initially\. To bootstrap data collection while keeping the system operational, we first deploy a lightweight heuristic \(Mixed‑Score\) that makes budget‑aware decisions and logs all user‑agent interactions naturally\. Mixed‑Score is intentionally designed to be stronger than naive baselines \(e\.g\., recency, Generative Agents\)\. After sufficient logs are collected, we train an evidence learner from the logged supervision to approximate the optimal policy under the same observability constraints\. The learner serves as the primary solution to the intractable optimization\. Additionally, we feed the Mixed‑Score value as an extra input feature to the learner, which acts as an inductive prior and has been shown to improve precision and recall \(see ablation in Section[4](https://arxiv.org/html/2606.10616#S4)\)\. Thus, Mixed‑Score plays two practical roles: cold‑start deployment and a feature prior for learning, while the learner remains the core performance driver\.
### 3\.3Mixed\-Score Retention Policy
Mixed\-Score constitutes the foundational retention mechanism in OSL\-MR\. Rather than relying on recency or static importance heuristics, it integrates multiple online\-observable signals into a unified utility function that reflects both relevance and resource cost\. Importantly, all components are constructed under strict observability constraints, ensuring direct deployability\.
At each decision step, Mixed\-Score computes:
MSt,i=wτ⊤ψ¯t,itime\+wrel⊤ψ¯t,irel\+wctx⊤ψ¯t,ictx\+wrisk⊤ψ¯t,irisk−wcst,iBt\.\\mathrm\{MS\}\_\{t,i\}=w\_\{\\tau\}^\{\\top\}\\bar\{\\psi\}^\{time\}\_\{t,i\}\+w\_\{rel\}^\{\\top\}\\bar\{\\psi\}^\{rel\}\_\{t,i\}\+w\_\{ctx\}^\{\\top\}\\bar\{\\psi\}^\{ctx\}\_\{t,i\}\+w\_\{risk\}^\{\\top\}\\bar\{\\psi\}^\{risk\}\_\{t,i\}\-w\_\{c\}\\frac\{s\_\{t,i\}\}\{B\_\{t\}\}\.
This formulation explicitly balances utility and storage efficiency, enabling budget\-aware retention decisions\. Compared to prior heuristic strategies such as recency\-based decay or static importance scoring, Mixed\-Score provides a more expressive and query\-conditioned representation of memory utility while remaining fully online\-safe\.
Beyond its role as a standalone policy, Mixed\-Score plays three essential roles in OSL\-MR\. First, it serves as a strong and competitive baseline for evaluation\. Second, it enables cold\-start deployment in the absence of interaction data, ensuring system functionality from the first user query\. Third, it acts as an inductive prior that guides the learning\-based model described in the next section\.
To further improve robustness, we introduce a continuous freshness proxy when explicit freshness labels are unavailable\. Memory types are assigned at creation time using a lightweight LLM classifier operating only on memory content\. The decay dynamics are defined as:
γi=γbase\(typei\)⋅κ\(Δti\),fresh\_proxyt,i=exp\(−γiΔti\),\\gamma\_\{i\}=\\gamma^\{base\}\(\\mathrm\{type\}\_\{i\}\)\\cdot\\kappa\(\\Delta t\_\{i\}\),\\quad\\mathrm\{fresh\\\_proxy\}\_\{t,i\}=\\exp\(\-\\gamma\_\{i\}\\Delta t\_\{i\}\),which can be incorporated into risk features or used to modulate utility signals\.
### 3\.4Evidence Learning with Mixed\-Score Priors
While Mixed\-Score provides a strong and fully deployable heuristic policy, it remains manually designed\. As interaction data accumulate, OSL\-MR transitions to a data\-driven regime where a learned model refines this heuristic using evidence supervision\.
We adopt a staged deployment strategy\. In the early stage, memory retention is performed entirely using Mixed\-Score, while interaction logs are continuously collected\. Once sufficient data are available, we train an evidence learner offline and deploy it with frozen parameters\. Importantly, the learned model does not introduce additional observability requirements and remains fully compatible with the online feature constraints\.
The learning target is defined as evidence membership:yt,ievid=𝟏\[mt,i∈Et\],y\_\{t,i\}^\{evid\}=\\mathbf\{1\}\[m\_\{t,i\}\\in E\_\{t\}\],which provides a direct supervision signal for memory utility\. Instead of learning from scratch, the model leverages Mixed\-Score as an inductive prior:ϕt,ifull=\[ψt,ion;MSt,i\]\\phi\_\{t,i\}^\{full\}=\[\\psi\_\{t,i\}^\{on\};\\mathrm\{MS\}\_\{t,i\}\]andϕt,ibase=ψt,ion\.\\phi\_\{t,i\}^\{base\}=\\psi\_\{t,i\}^\{on\}\.The evidence learner is trained using weighted binary cross\-entropy:
ℒevid=∑t,iωt,iBCE\(σ\(fθ\(ϕt,i\)\),yt,ievid\)\.\\mathcal\{L\}\_\{evid\}=\\sum\_\{t,i\}\\omega\_\{t,i\}\\,\\mathrm\{BCE\}\(\\sigma\(f\_\{\\theta\}\(\\phi\_\{t,i\}\)\),y\_\{t,i\}^\{evid\}\)\.At inference time, memory selection is performed via constrained decoding:
At=Decode\(fθ\(ϕt\),Bt\),∑i∈Atst,i≤Bt\.A\_\{t\}=\\mathrm\{Decode\}\(f\_\{\\theta\}\(\\phi\_\{t\}\),B\_\{t\}\),\\quad\\sum\_\{i\\in A\_\{t\}\}s\_\{t,i\}\\leq B\_\{t\}\.
This formulation enables the model to learn query\-conditioned refinements over Mixed\-Score while preserving strict budget constraints\. The overall training and deployment procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.10616#alg1)\.
### 3\.5Behavior\-Cloning Variant
For comparison, we construct a behavior\-cloning baseline that learns from a non\-deployable teacher policy\. The teacher policy is not deployable and is constructed using offline\-available supervision \(OAS\)\. For each training instance, starting from the Mixed‑Score selection, the teacher performs a local greedy search: it considers swapping one retained memory with a currently omitted candidate, evaluates the resulting retained set using the reward defined in Section 3\.2 \(which depends on gold evidence\), and accepts the swap if it improves the reward\. This process continues until no improvement is found\. The teacher thus produces a higher\-quality retained setAtteacherA\_\{t\}^\{teacher\}than the original Mixed‑Score, but relies on OAS and is therefore unsuitable for deployment\.
The pseudo\-labels for behavior cloning are defined asyt,iBC=𝟏\[mt,i∈Atteacher\]y\_\{t,i\}^\{BC\}=\\mathbf\{1\}\[m\_\{t,i\}\\in A\_\{t\}^\{teacher\}\]\. The BC\-learner is trained using the same architecture and loss as the evidence learner, but with these pseudo\-labels instead of gold evidence labels\. This variant evaluates whether imitation of an improved heuristic policy can substitute for direct evidence supervision, highlighting the importance of explicit evidence\-based learning\.
Algorithm 1OSL\-MR: Training and Deployment1:Training environments
ℰ\\mathcal\{E\}, budgets
BtB\_\{t\}, costs
α\\alpha, online features
ψon\\psi^\{on\}, mixed\-score prior
MS\\mathrm\{MS\}, decoding function
Decode\\mathrm\{Decode\}
2:Learned policy
fθf\_\{\\theta\}
3:Initialize dataset
𝒟←∅\\mathcal\{D\}\\leftarrow\\emptyset
4:for allepisode
∈ℰ\\in\\mathcal\{E\}do
5:Initialize state
S0S\_\{0\}
6:for
t=1,2,…t=1,2,\\ldotsuntil terminaldo
7:Observe memory pool
MtM\_\{t\}// current step
8:for all
mt,i∈Mtm\_\{t,i\}\\in M\_\{t\}do
9:Compute
ψt,ion\\psi\_\{t,i\}^\{on\}// online features
10:Compute
MSt,i\\mathrm\{MS\}\_\{t,i\}// mixed\-score prior
11:Form
ϕt,i=\[ψt,ion;MSt,i\]\\phi\_\{t,i\}=\[\\psi\_\{t,i\}^\{on\};\\mathrm\{MS\}\_\{t,i\}\]// feature vector
12:Get
yt,ievid=𝟏\[mt,i∈Et\]y\_\{t,i\}^\{evid\}=\\mathbf\{1\}\[m\_\{t,i\}\\in E\_\{t\}\]// gold evidence label \(OAS\)
13:endfor
14:
𝒟←𝒟∪\(St,ϕt,ytevid\)\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\(S\_\{t\},\\phi\_\{t\},y\_\{t\}^\{evid\}\)// log interaction
15:Execute
At=Decode\(MSt,Bt\)A\_\{t\}=\\mathrm\{Decode\}\(\\mathrm\{MS\}\_\{t\},B\_\{t\}\)// cold\-start: use mixed\-score
16:Observe next state
St\+1S\_\{t\+1\}
17:endfor
18:endfor
19:Train evidence learner
fθf\_\{\\theta\}by minimizing:
minθ∑\(S,ϕ,y\)∈𝒟∑iBCE\(σ\(fθ\(ϕi\)\),yievid\)\\min\_\{\\theta\}\\sum\_\{\(S,\\phi,y\)\\in\\mathcal\{D\}\}\\sum\_\{i\}\\mathrm\{BCE\}\\bigl\(\\sigma\(f\_\{\\theta\}\(\\phi\_\{i\}\)\),\\,y\_\{i\}^\{evid\}\\bigr\)
20:return
fθf\_\{\\theta\}with deployable selection
At=Decode\(fθ\(ϕt\),Bt\)A\_\{t\}=\\mathrm\{Decode\}\(f\_\{\\theta\}\(\\phi\_\{t\}\),B\_\{t\}\)// frozen deployment
## 4Experiments
### 4\.1Experimental Setup
Online/OAS discipline\.All experiments strictly follow the observability separation defined in Section[3](https://arxiv.org/html/2606.10616#S3)\. Deployable policies are restricted to online\-observable inputs, including query context, memory metadata, recency, semantic overlap, entity overlap, session/speaker information, and cost features\. Offline\-available supervision \(OAS\) includes gold evidence, answer text, realized coverage, miss events, reacquisition cost, and ground\-truth freshness\. OAS is used only for training or evaluation of oracle/teacher variants and is never accessible to deployable policies\. Any method that relies on OAS at inference time is treated as an oracle baseline, not a deployable policy\.
Benchmarks\.We evaluate on two public long\-horizon memory benchmarks: LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2606.10616#bib.bib30)\)and LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2606.10616#bib.bib31)\)\. Both benchmarks provide multi\-turn conversational interactions with gold evidence labels, enabling systematic evaluation of memory retention under budget constraints\.
Compared methods\.We compare against several existing approaches and variants:
- •Recency: A practical and widely used heuristic that retains the top\-KKmost recent memories by timestamp, whereKKis the maximum number of items that fit under the budget\. This approach is simple, computationally efficient, and commonly employed as a baseline despite ignoring query relevance or evidence utility\.
- •Generative Agents \(GA\): A widely adopted retrieval paradigm that scores each memory as a weighted sum of recency, relevance, and importance\(Parket al\.,[2023](https://arxiv.org/html/2606.10616#bib.bib2)\)\. We evaluate two importance variants: GA\-heuristic \(hand\-crafted rules based on memory content, e\.g\., higher weight for emotional events\) and GA\-LLM \(importance scores queried from an LLM offline\)\. Both variants share the same recency and relevance computation\.
- •Mixed\-Score: Our proposed online\-safe heuristic prior\. It computes a utility score for each memory by combining multiple observable feature groups \(temporal, relevance, context, risk\) with a size penalty term\. The score is used to select the top\-scoring memories under the budget\. Mixed\-Score is fully deployable and serves as a strong baseline as well as an inductive prior for learning\.
- •BC\-learner: A behavior\-cloning baseline trained to imitate the retention decisions of a non\-deployable teacher policy\. The teacher performs local greedy search around the Mixed\-Score selection using oracle gold evidence \(OAS\) to produce higher\-quality retained sets\. BC\-learner uses the same architecture as OSL\-MR but is supervised by teacher pseudo\-labels instead of gold evidence\.
- •OSL\-MR \(full\): Our complete method, which trains an evidence learner from logged interaction data using gold evidence labels\. The learner takes online\-observable features concatenated with the Mixed\-Score prior as input, and outputs a utility logit per memory\. At inference, a budget\-constrained decoder selects the retained subset\. The learned policy is frozen after training and uses no OAS at test time\.
- •OSL\-MR \(w/o prior\): An ablation variant of OSL\-MR that removes the Mixed\-Score prior from the feature vector, using only online\-observable featuresψon\\psi^\{on\}\. All other settings \(training data, architecture, loss\) remain identical, isolating the contribution of the prior\.
Hyperparameters and cost settings\.We used standard train/dev/test splits, tuning thresholds on the dev split via grid search\. The evidence learner is a single\-hidden\-layer MLP with 16 hidden units and sigmoid output, trained for 12 epochs \(learning rate 0\.03, L2 regularization 1e\-4, positive class weight 4\.0, negative down\-sampling multiplier 6, random seed 17\)\. Default reward coefficients are set with storage cost as the unit baseline \(αstore=1\.0\\alpha\_\{\\mathrm\{store\}\}=1\.0\)\. The remaining coefficients areαhit=4\.0\\alpha\_\{\\mathrm\{hit\}\}=4\.0\(hit reward\),αreacq=6\.0\\alpha\_\{\\mathrm\{reacq\}\}=6\.0\(reacquisition penalty\),αmiss=12\.0\\alpha\_\{\\mathrm\{miss\}\}=12\.0\(miss penalty\),αstale=6\.0\\alpha\_\{\\mathrm\{stale\}\}=6\.0\(stale penalty\), andαfull=64\.0\\alpha\_\{\\mathrm\{full\}\}=64\.0\(full‑coverage bonus\)\. The miss penalty is highest because failing to answer is most harmful; reacquisition and stale penalties are moderate \(re‑searching cost and outdated information risk\); hit reward is lower; full\-coverage bonus strongly encourages complete evidence retention\. Budgets reflect average evidence length: for LoCoMo \(approximately 60 tokens per query\) we use 32, 64, and 128 tokens; for LongMemEval \(approximately 587 tokens\) we use 256, 512, and 1024 tokens, covering under‑ and over‑provisioned regimes\. Sensitivity analysis \(Section[4\.4](https://arxiv.org/html/2606.10616#S4.SS4)\) varies each coefficient individually over a multiplierm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}, confirming OSL\-MR’s robustness to coefficient scales\. LLMs were used only for offline annotation \(DeepSeek‑V4‑Flash, temperature 0\.0, JSON prompts, context window of three turns, output limits of 20 tokens for memory type and 24 tokens for importance\)\. Importance scores built the GA\-LLM baseline; memory type labels derived freshness proxies\. All labels cached; no LLM calls during evaluation\.
Metrics\.We report evidence precision, recall, F1, full recall, retained tokens, and budget occupancy\. We deliberately evaluate OSL\-MR using a minimal online\-safe feature set that excludes the hybrid freshness proxy described in Section[3](https://arxiv.org/html/2606.10616#S3), isolating the contribution of optimization\-guided learning\. The per\-step reward from Section[3](https://arxiv.org/html/2606.10616#S3)is reported after a global shift for readability:shifted\_reward=raw\_reward\+C\\mathrm\{shifted\\\_reward\}=\\mathrm\{raw\\\_reward\}\+C, whereC=606\.13C=606\.13on LoCoMo andC=5561\.9C=5561\.9on LongMemEval\. This translation is purely cosmetic and does not affect any relative comparisons\. The minimum shifted reward in each benchmark is zero, corresponding to the worst\-performing method under that configuration\.
#### Evidence precision, recall, and F1\.
For each querytt, letGtG\_\{t\}be the set of gold evidence dialogue IDs andRtR\_\{t\}the set of dialogue IDs retained under the budget\. We compute
Precisiont=\|Rt∩Gt\|max\(\|Rt\|,1\),Recallt=\|Rt∩Gt\|max\(\|Gt\|,1\),F1t=2PrecisiontRecalltPrecisiont\+Recallt,\\mathrm\{Precision\}\_\{t\}=\\frac\{\|R\_\{t\}\\cap G\_\{t\}\|\}\{\\max\(\|R\_\{t\}\|,1\)\},\\qquad\\mathrm\{Recall\}\_\{t\}=\\frac\{\|R\_\{t\}\\cap G\_\{t\}\|\}\{\\max\(\|G\_\{t\}\|,1\)\},\\qquad\\mathrm\{F1\}\_\{t\}=\\frac\{2\\,\\mathrm\{Precision\}\_\{t\}\\,\\mathrm\{Recall\}\_\{t\}\}\{\\mathrm\{Precision\}\_\{t\}\+\\mathrm\{Recall\}\_\{t\}\},withF1t=0\\mathrm\{F1\}\_\{t\}=0whenPrecisiont\+Recallt=0\\mathrm\{Precision\}\_\{t\}\+\\mathrm\{Recall\}\_\{t\}=0\. All metrics are macro\-averaged over queries\. Precision measures budget efficiency: high precision means the policy is not wasting capacity on irrelevant memories\. Recall measures evidence coverage: high recall means most required information is retained\. In budgeted retention, these two objectives are in tension, and F1 captures their balance\.
#### Budget occupancy\.
Budget occupancy measures the fraction of the available memory budget consumed by the retained subset:Occupancyt=∑i∈Atst,imax\(Bt,1\),\\mathrm\{Occupancy\}\_\{t\}=\\frac\{\\sum\_\{i\\in A\_\{t\}\}s\_\{t,i\}\}\{\\max\(B\_\{t\},1\)\},macro\-averaged over queries\. High occupancy is common for static scoring rules that greedily fill the budget, but does not necessarily indicate better retention quality\. Learned policies that stop early when no remaining candidate appears useful may yield lower occupancy while achieving higher precision and reward\. Occupancy should therefore be interpreted together with precision and recall as a measure of budget efficiency rather than evidence correctness\.
### 4\.2Main Results
The proposed method OSL\-MR substantially outperforms heuristics\.Across LoCoMo and LongMemEval, OSL\-MR achieves higher evidence F1 and reward than all heuristic baselines \(Recency, GA, Mixed\-Score\) and BC\-learner across all evaluated budgets\. The relative improvement is most pronounced under tight budgets, where heuristics saturate memory capacity with low‑utility items while OSL\-MR retains a smaller, evidence‑denser subset\. For instance, on LoCoMo at budget 128 \(Table[2](https://arxiv.org/html/2606.10616#S4.T2)\), OSL\-MR attains an F1 of 0\.302 and a reward of 305\.2, compared to 0\.069 and 132\.5 for Mixed\-Score, and below 0\.010 for GA variants\. As the budget becomes looser, the performance gap between OSL\-MR and the best baseline \(typically BC\-learner\) narrows slightly\. This is because baseline methods operate at near‑saturated occupancy \(filling almost the entire budget\), while OSL\-MR maintains lower occupancy—it deliberately discards memories of marginal utility, resulting in a cleaner but smaller retained set\. Even under loose budgets, OSL\-MR’s occupancy remains well below saturation \(e\.g\., on LongMemEval budget 1024, OSL\-MR occupancy is 0\.337 vs\. 0\.370 for BC\-learner\), demonstrating that it continues to prioritize evidence quality over quantity\. Importantly, reward improvements are strongly aligned with gains in precision and recall, confirming that the optimization objective captures genuine retention quality rather than artifacts of cost calibration\. Across all budgets and datasets, the reward rankings perfectly align with the rankings in F1 and precision, indicating that the learned policy optimizes for authentic evidence‑based retention\.
GA performance and the importance–evidence mismatch\.Both GA variants achieve low evidence F1 despite capturing plausible notions of memory salience\. The gap is particularly instructive for GA\-LLM: LLM‑prompted importance scores reflect general memory noteworthiness \(e\.g\., an emotional event\), but the evaluation metric measures whether the memory is gold evidence for the current query—a query‑conditional notion of utility\. GA\-heuristic occasionally outperforms GA\-LLM because its rule‑based heuristics correlate more closely with evidence membership in these benchmarks\. This suggests that static importance scores are inherently misaligned with query‑specific evidence needs\. Therefore, we propose to learn evidence membership directly from supervision: OSL\-MR trains an evidence learner using gold evidence labels, bypassing the need for a general‑purpose importance oracle\.
Table 2:LoCoMo results across memory budgets\. Occupancy denotes budget utilization\. Rewards are shifted byC=606\.13C=606\.13and then normalized so that the worst baseline \(Recency at budget 128\) is zero\.BudgetMethodPrecisionRecallF1OccupancyReward32Recency\.0000\.0000\.0000\.980692\.4432GA\-heuristic\.0026\.0100\.0041\.968295\.0432GA\-LLM\.0017\.0062\.0026\.963794\.4132Mixed\-Score\.0243\.0699\.0352\.9789110\.7932BC\-learner\.0968\.0826\.0811\.6567137\.0632OSL\-MR\.1180\.1083\.1015\.6683145\.5664Recency\.0010\.0062\.0017\.989061\.9564GA\-heuristic\.0038\.0302\.0068\.980867\.6864GA\-LLM\.0022\.0123\.0036\.983563\.6864Mixed\-Score\.0325\.1555\.0522\.9920112\.4064BC\-learner\.2230\.2406\.2061\.7632196\.8364OSL\-MR\.3130\.3587\.2952\.7367252\.70128Recency\.0012\.0145\.0021\.99190\.00128GA\-heuristic\.0041\.0508\.0074\.98989\.99128GA\-LLM\.0030\.0398\.0056\.99056\.40128Mixed\-Score\.0391\.3375\.0685\.9956132\.48128BC\-learner\.2303\.4441\.2556\.7897260\.91128OSL\-MR\.2741\.5320\.3023\.7638305\.16Direct evidence supervision outperforms behavior cloning\.BC\-learner improves over static heuristics but remains consistently weaker than OSL\-MR\. On LongMemEval at budget 256 \(Table[3](https://arxiv.org/html/2606.10616#S4.T3)\), BC\-learner achieves an F1 of 0\.230, whereas OSL\-MR reaches 0\.518—more than double\. At budget 512, OSL\-MR’s F1 is 0\.385 versus BC\-learner’s 0\.235, a relative improvement of about 64%\. These results confirm that gold evidence labels provide a better‑aligned training target than teacher‑imitated pseudo‑labels for evidence‑retention metrics\. Moreover, OSL\-MR uses budget more efficiently: on LongMemEval budget 256, occupancy is 0\.711 for OSL\-MR compared to 0\.904 for BC\-learner, and on LoCoMo budget 128, occupancy is 0\.764 vs\. 0\.790\. The higher reward of OSL\-MR over BC\-learner directly reflects its superior F1 and precision, not merely a different cost trade‑off\.
Table 3:LongMemEval results across memory budgets\. Occupancy denotes budget utilization\. Rewards are shifted byC=5561\.9C=5561\.9and then normalized so that the worst baseline \(Recency at budget 1024\) is zero\.BudgetMethodPrecisionRecallF1OccupancyReward256Recency\.003\.009\.003\.990629\.9256GA\-heuristic\.007\.029\.008\.989653\.0256GA\-LLM\.003\.016\.004\.984638\.0256Mixed\-Score\.052\.324\.082\.9941062\.1256BC\-learner\.203\.381\.230\.9041119\.9256OSL\-MR\.529\.681\.518\.7111626\.8512Recency\.006\.057\.008\.995419\.7512GA\-heuristic\.012\.131\.020\.993489\.3512GA\-LLM\.009\.105\.014\.991461\.7512Mixed\-Score\.047\.477\.078\.9971062\.8512BC\-learner\.173\.601\.235\.9461187\.9512OSL\-MR\.315\.769\.385\.8511585\.81024Recency\.008\.117\.013\.9980\.01024GA\-heuristic\.014\.232\.024\.996117\.01024GA\-LLM\.010\.168\.017\.99644\.21024Mixed\-Score\.033\.533\.058\.998718\.91024BC\-learner\.616\.572\.515\.3701772\.81024OSL\-MR\.685\.595\.536\.3371957\.9
### 4\.3Ablation Studies
Effect of mixed\-score prior\.To isolate the contribution of the mixed\-score prior, we compare the full OSL\-MR against a variant that removes this prior from the feature vector \(OSL\-MR \(w/o prior\)\)\. Both models use the same online\-safe evidence labels and architecture; only the presence of the scalar Mixed\-Score prior inϕt,i=\[ψt,ion;MSt,i\]\\phi\_\{t,i\}=\[\\psi\_\{t,i\}^\{on\};\\mathrm\{MS\}\_\{t,i\}\]differs\. Table[4](https://arxiv.org/html/2606.10616#S4.T4)reports results on both benchmarks\.
Removing the prior consistently reduces precision while preserving recall, and increases budget occupancy\. On LongMemEval at budget 256, precision drops from0\.5290\.529to0\.4210\.421\(a relative decrease of 20%\), while recall remains nearly unchanged \(0\.681 vs\. 0\.669\)\. Occupancy rises from0\.7110\.711to0\.8040\.804, indicating that the prior helps the learner avoid low‑utility memories\. On LoCoMo at budget 128, precision falls from0\.2740\.274to0\.2510\.251, recall stays almost identical \(0\.532 vs\. 0\.530\), and occupancy increases from0\.7640\.764to0\.7840\.784\. Rewards also decrease across all settings when the prior is removed\. These results confirm that the Mixed\-Score prior provides a strong inductive bias for selection efficiency without introducing additional supervision signals\. The effect is especially pronounced on the more challenging LongMemEval dataset, where the prior improves precision by up to 20% at tight budgets\.
Table 4:Mixed\-score prior ablation across benchmarks\. Rewards are shifted byC=606\.13C=606\.13\(LoCoMo\) andC=5561\.9C=5561\.9\(LongMemEval\)\.DatasetMethodPrecisionRecallF1OccupacyRewardLoCoMoBudget 32OSL\-MR\.118\.108\.102\.668145\.5OSL\-MR \(w/o prior\)\.105\.099\.092\.669141\.3Budget 64OSL\-MR\.313\.359\.295\.737252\.7OSL\-MR \(w/o prior\)\.290\.331\.271\.740236\.7Budget 128OSL\-MR\.274\.532\.302\.764305\.1OSL\-MR \(w/o prior\)\.251\.530\.290\.784301\.1LongMemEvalBudget 256OSL\-MR\.529\.681\.518\.7111626\.8OSL\-MR \(w/o prior\)\.421\.669\.443\.8041554\.2Budget 512OSL\-MR\.315\.769\.385\.8511585\.8OSL\-MR \(w/o prior\)\.237\.762\.312\.8981538\.7Budget 1024OSL\-MR\.685\.595\.536\.3371957\.9OSL\-MR \(w/o prior\)\.683\.586\.533\.3671912\.9Observability leakage diagnostic\.Any policy that incorporates answer\-derived or gold\-evidence\-derived features would operate with an unrealistic information advantage over deployable methods\. Such a policy is not a fair online baseline and serves only as a leakage diagnostic, reinforcing the necessity of the strict observability separation\.
### 4\.4Robustness Analysis
We conduct a sensitivity analysis by varying one reward or cost coefficient at a time while keeping all other coefficients fixed at their default values\. The default coefficients areαhit=4\.0\\alpha\_\{\\mathrm\{hit\}\}=4\.0,αfull=64\.0\\alpha\_\{\\mathrm\{full\}\}=64\.0,αmiss=12\.0\\alpha\_\{\\mathrm\{miss\}\}=12\.0,αreacq=6\.0\\alpha\_\{\\mathrm\{reacq\}\}=6\.0, andαstale=6\.0\\alpha\_\{\\mathrm\{stale\}\}=6\.0, with the storage cost fixed toαstore=1\.0\\alpha\_\{\\mathrm\{store\}\}=1\.0\. For each coefficient, we apply a multiplierm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}while leaving the others unchanged\. We report results under fixed memory budgets of 64 on LoCoMo and 512 on LongMemEval here; results for other budgets \(LoCoMo 32, 128 and LongMemEval 256, 1024\) are provided in the appendix\. As shown in the sensitivity curves \(Fig\.[2](https://arxiv.org/html/2606.10616#S4.F2)and Fig\.[3](https://arxiv.org/html/2606.10616#S4.F3)\), OSL\-MR remains the top\-performing non\-oracle method across all multipliers\. For the reward coefficientsαhit\\alpha\_\{\\mathrm\{hit\}\}andαfull\\alpha\_\{\\mathrm\{full\}\}, its average reward increases or remains high as the corresponding reward becomes larger\. For the penalty coefficientsαmiss\\alpha\_\{\\mathrm\{miss\}\},αreacq\\alpha\_\{\\mathrm\{reacq\}\}, andαstale\\alpha\_\{\\mathrm\{stale\}\}, the average reward decreases as expected, but OSL\-MR consistently stays above the heuristic and learned baselines, showing robust performance under stronger cost penalties\.
Figure 2:Reward sensitivity on LoCoMo \(budget=64\) when varying a single coefficient\.αstore\\alpha\_\{\\mathrm\{store\}\}is fixed at1\.01\.0\. Each curve changes one coefficient by multiplying its default value withm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}, keeping all other coefficients at their defaults\.Figure 3:Reward sensitivity on LongMemEval \(budget=512\) when varying a single coefficient\.αstore\\alpha\_\{\\mathrm\{store\}\}is fixed at1\.01\.0\. Each curve changes one coefficient by multiplying its default value withm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}, keeping all other coefficients at their defaults\.
### 4\.5Summary of Findings
The key takeaways from our experiments are as follows:
- •Heuristics are insufficient\.Recency, Generative Agents, and Mixed\-Score all saturate the budget with low‑utility items, leading to poor evidence F1 under strict constraints\.
- •Evidence‑supervised learning is the primary driver\.OSL\-MR consistently outperforms all baselines across budgets and datasets, with the largest gains under tight budgets\.
- •The Mixed\-Score prior improves precision without sacrificing recall\.Removing the prior reduces precision and increases occupancy while leaving recall unchanged, confirming its role as a useful inductive bias\.
- •Direct evidence supervision beats behavior cloning\.BC\-learner improves over heuristics but remains inferior to OSL\-MR, demonstrating that learning from gold evidence labels is more effective than imitating a teacher\.
- •The optimization objective is well‑defined\.Reward rankings align perfectly with F1 and precision, validating that our reward function \(which explicitly models hits, misses, reacquisition, and staleness\) captures genuine retention quality\.
Together, these takeaways validate OSL\-MR as an online‑safe evidence learning framework for constrained memory retention\.
## 5Conclusion and Discussion
We have presented OSL\-MR, a framework for memory retention in long\-horizon language agents that explicitly treats retention as a sequential decision problem under budget constraints, delayed feedback, and partial observability\. The key contribution is the formulation itself: to the best of our knowledge, this is the first work to formalize memory retention as a constrained stochastic optimization problem that accounts for evidence utility, reacquisition costs, stale risk, and a strict separation between online\-observable features and offline\-available supervision\. Under this formulation, we propose a data\-driven solution where a lightweight Mixed\-Score heuristic provides a fully deployable cold\-start baseline and an inductive prior, while an evidence learner trained from natural user\-agent interaction logs approximates the optimal policy\. Empirically, the learned policy consistently outperforms strong heuristic baselines \(recency, Generative Agents, and the Mixed\-Score itself\) across LoCoMo and LongMemEval benchmarks, especially under tight budgets\. The Mixed\-Score prior further improves precision without sacrificing recall, and direct evidence supervision is the primary driver of performance gains\. The strict observability separation ensures that all reported gains reflect generalization under realistic information constraints\.
Our framework complements recent advances in memory writing \(Mem0\), compression \(MEM1\), and retrieval\-side optimization \(MemRL\)\. While those systems focus on other aspects of the memory lifecycle, OSL\-MR is the first to formalize and optimize retention under hard budgets and observability constraints\. Future work includes automatically learning the feature groups and weights of the Mixed\-Score prior, developing more principled freshness estimation beyond type\-based decay, and scaling the approach to embodied or tool\-augmented agents via more efficient optimization or amortized decision strategies\.
## References
- BudgetMem: learning selective memory policies for cost\-efficient long\-context processing in language models\.External Links:2511\.04919,[Link](https://arxiv.org/abs/2511.04919)Cited by:[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p2.1)\.
- Anonymous \(2025\)AdaGReS: adaptive greedy selection for token\-budgeted RAG\.Under review\.Cited by:[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p1.1)\.
- H\. Chaoet al\.\(2026\)STALE: can LLM agents know when their memories are no longer valid?\.arXiv preprint arXiv:2605\.06527\.Cited by:[§2\.3](https://arxiv.org/html/2606.10616#S2.SS3.p1.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§2\.1](https://arxiv.org/html/2606.10616#S2.SS1.p1.1)\.
- P\. Fofadiya and S\. Tiwari \(2026\)Novel memory forgetting techniques for autonomous ai agents: balancing relevance and efficiency\.arXiv preprint arXiv:2604\.02280\.Note:Submitted on 2 Apr 2026Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p1.1)\.
- Y\. Hu, S\. Liu, Y\. Yue, G\. Zhang, B\. Liu, F\. Zhu, J\. Lin, H\. Guo, S\. Dou, Z\. Xi,et al\.\(2025\)Memory in the age of ai agents\.arXiv preprint arXiv:2512\.13564\.Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p1.1)\.
- W\. Huang, W\. Zhang, Y\. Liang, Y\. Bei, Y\. Chen, T\. Feng, X\. Pan, Z\. Tan, Y\. Wang, T\. Wei, S\. Wu, R\. Xu, L\. Yang, R\. Yang, W\. Yang, C\. Yeh, H\. Zhang, H\. Zhang, S\. Zhu, H\. P\. Zou, W\. Zhao, S\. Wang, W\. Xu, Z\. Ke, Z\. Hui, D\. Li, Y\. Wu, L\. He, C\. Wang, X\. Xu, B\. Huang, J\. Tan, S\. Heinecke, H\. Wang, C\. Xiong, A\. A\. Metwally, J\. Yan, C\. Lee, H\. Zeng, Y\. Xia, X\. Wei, A\. Payani, Y\. Wang, H\. Ma, W\. Wang, C\. Wang, Y\. Zhang, X\. Wang, Y\. Zhang, J\. You, H\. Tong, X\. Luo, X\. Liu, Y\. Sun, W\. Wang, J\. McAuley, J\. Zou, J\. Han, P\. S\. Yu, and K\. Shu \(2026\)Rethinking memory mechanisms of foundation agents in the second half: a survey\.External Links:2602\.06052,[Link](https://arxiv.org/abs/2602.06052)Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p1.1)\.
- H\. Jiang, Q\. Wu, C\. Lin, Y\. Yang, and L\. Qiu \(2023\)LLMLingua: compressing prompts for accelerated inference of large language models\.InProceedings of EMNLP,Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10616#S2.SS1.p1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of llm agents\.External Links:2402\.17753,[Link](https://arxiv.org/abs/2402.17753)Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p5.1),[§4\.1](https://arxiv.org/html/2606.10616#S4.SS1.p2.1)\.
- C\. Packer, V\. Fang, S\. G\. Patil, K\. Lin, S\. Wooders, and J\. E\. Gonzalez \(2023\)MemGPT: towards LLMs as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10616#S2.SS1.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST\),Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10616#S2.SS1.p1.1),[2nd item](https://arxiv.org/html/2606.10616#S4.I1.i2.p1.1)\.
- Y\. Wanget al\.\(2025\)Mem\-α\\alpha: reinforcement learning for LLM agent memory management\.Under review\.Cited by:[§2\.3](https://arxiv.org/html/2606.10616#S2.SS3.p1.1)\.
- Z\. Wanget al\.\(2024\)CORAG: cost\-constrained retrieval\-augmented generation with monte carlo tree search\.arXiv preprint arXiv:2411\.00744\.Cited by:[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.External Links:2410\.10813,[Link](https://arxiv.org/abs/2410.10813)Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p5.1),[§4\.1](https://arxiv.org/html/2606.10616#S4.SS1.p2.1)\.
- Y\. Yue, B\. Peng, X\. Fan, J\. Guo, Q\. Li, and Y\. Zhang \(2026\)Mem\-t: densifying rewards for long\-horizon memory agents\.External Links:2601\.23014,[Link](https://arxiv.org/abs/2601.23014)Cited by:[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p2.1)\.
- H\. Zhang, H\. Yue, T\. Feng, Q\. Long, J\. Bao, B\. Jin, W\. Zhang, X\. Li, J\. You, C\. Qin, and W\. Wang \(2026a\)Learning query\-aware budget\-tier routing for runtime agent memory\.External Links:2602\.06025,[Link](https://arxiv.org/abs/2602.06025)Cited by:[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p2.1)\.
- S\. Zhang, J\. Wang, R\. Zhou, J\. Liao, Y\. Feng, Z\. Li, Y\. Zheng, W\. Zhang, Y\. Wen, Z\. Li,et al\.\(2026b\)Memrl: self\-evolving agents via runtime reinforcement learning on episodic memory\.arXiv preprint arXiv:2601\.03192\.Cited by:[§2\.3](https://arxiv.org/html/2606.10616#S2.SS3.p1.1)\.
- Y\. Zhang, J\. Shu, Y\. Ma, X\. Lin, S\. Wu, and J\. Sang \(2026c\)Memory as action: autonomous context curation for long\-horizon agentic tasks\.External Links:2510\.12635,[Link](https://arxiv.org/abs/2510.12635)Cited by:[§2\.2](https://arxiv.org/html/2606.10616#S2.SS2.p2.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)MemoryBank: enhancing large language models with long\-term memory\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.10616#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10616#S2.SS1.p1.1)\.
- Y\. Zhouet al\.\(2025\)CSIM: compressed step information memory for long\-horizon agent tasks\.Under review\.Cited by:[§2\.3](https://arxiv.org/html/2606.10616#S2.SS3.p1.1)\.
- Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, J\. Zhao, B\. K\. H\. Low, and P\. P\. Liang \(2025\)Mem1: learning to synergize memory and reasoning for efficient long\-horizon agents\.arXiv preprint arXiv:2506\.15841\.Cited by:[§2\.1](https://arxiv.org/html/2606.10616#S2.SS1.p1.1)\.
## Appendix AAppendix
### A\.1Robustness Analysis
We conduct a sensitivity analysis by varying one reward or cost coefficient at a time while keeping all other coefficients fixed at their default values\. The default coefficients areαhit=4\.0\\alpha\_\{\\mathrm\{hit\}\}=4\.0,αfull=64\.0\\alpha\_\{\\mathrm\{full\}\}=64\.0,αmiss=12\.0\\alpha\_\{\\mathrm\{miss\}\}=12\.0,αreacq=6\.0\\alpha\_\{\\mathrm\{reacq\}\}=6\.0, andαstale=6\.0\\alpha\_\{\\mathrm\{stale\}\}=6\.0, with the storage cost fixed toαstore=1\.0\\alpha\_\{\\mathrm\{store\}\}=1\.0\. For each coefficient, we apply a multiplierm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}while leaving the others unchanged\. We report results under fixed memory budgets of 64 on LoCoMo and 512 on LongMemEval in the main text \(Figures[2](https://arxiv.org/html/2606.10616#S4.F2)and[3](https://arxiv.org/html/2606.10616#S4.F3)\)\. The results for budgets 32 and 128 on LoCoMo, and budgets 256 and 1024 on LongMemEval, are shown in the appendix \(Figures[4](https://arxiv.org/html/2606.10616#A1.F4),[5](https://arxiv.org/html/2606.10616#A1.F5),[6](https://arxiv.org/html/2606.10616#A1.F6), and[7](https://arxiv.org/html/2606.10616#A1.F7)\)\.
Figure 4:Reward sensitivity on LoCoMo \(budget=32\) when varying a single coefficient: each curve changes one coefficient by multiplying its default value withm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}, keeping all other coefficients at their defaults\.αstore\\alpha\_\{\\mathrm\{store\}\}is fixed at1\.01\.0\.Figure 5:Reward sensitivity on LoCoMo \(budget=128\) when varying a single coefficient: each curve changes one coefficient by multiplying its default value withm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}, keeping all other coefficients at their defaults\.αstore\\alpha\_\{\\mathrm\{store\}\}is fixed at1\.01\.0\.Figure 6:Reward sensitivity on LongMemEval \(budget=256\) when varying a single coefficient: each curve changes one coefficient by multiplying its default value withm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}, keeping all other coefficients at their defaults\.αstore\\alpha\_\{\\mathrm\{store\}\}is fixed at1\.01\.0\.Figure 7:Reward sensitivity on LongMemEval \(budget=1024\) when varying a single coefficient: each curve changes one coefficient by multiplying its default value withm∈\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\}m\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.5,2\.0,4\.0\\\}, keeping all other coefficients at their defaults\.αstore\\alpha\_\{\\mathrm\{store\}\}is fixed at1\.01\.0\.Similar Articles
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents
Introduces Belief Entropy and Metacognitive Memory Policy Optimization (MMPO) to improve memory quality in long-horizon LLM agents, outperforming existing methods and maintaining performance over long contexts.
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Memory-R2 introduces LoGo-GRPO, a training framework that combines local and global group-relative optimization to provide fairer credit assignment for long-horizon memory-augmented LLM agents, improving accuracy and inference latency across backbones.
@omarsar0: // The Memory Curse in LLM Agents // (bookmark it) Long histories apparently degrades agents as they become increasingl…
This research paper identifies the 'memory curse' in LLM agents, demonstrating that expanded context windows systematically degrade cooperative behavior in multi-agent social dilemmas by eroding forward-looking intent. The authors show that targeted fine-tuning, synthetic memory sanitization, and reducing explicit Chain-of-Thought reasoning can effectively mitigate this behavioral decay.
SimpleMem: Efficient Lifelong Memory for LLM Agents
Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.