Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents

arXiv cs.AI Papers

Summary

This paper introduces the concept of memory depth for long-running language agents, distinguishing it from retrieval-based memory access, and proposes EVAF, a selective parametric consolidation mechanism using surprise- and valence-gated LoRA updates. Experiments across multiple models show EVAF improves goal persistence after context unload with minimal parametric writes.

arXiv:2606.26806v1 Announce Type: new Abstract: Long-running language agents need more than memory access. Retrieval systems can fetch past facts at query time, but they do not decide which experiences should continue to shape behavior after the working context is unloaded. We study this separate problem as memory depth: durable goal-conditioned tendencies written into a small parametric store. We introduce the loop-drift protocol, a controlled stress test in which the retrieval index remains intact while working context is unloaded and goal-conditioned behavior must persist under long-loop interference. We evaluate EVAF, a surprise- and valence-gated LoRA consolidation mechanism. Across GPT-2 and TinyLlama, retrieval is strongest on shallow factual recall (short-fact accuracy 0.956--0.973), while EVAF is strongest on goal persistence and post-unload recovery (0.812--0.904) with only 2--3 parametric writes per 200 events. Mechanism controls show that selective consolidation factorizes into two controllable dimensions: selection and actuation. Matched random gates isolate selection beyond sparse writing; fixed-inner controls across GPT-2, TinyLlama, and Mistral-7B show that inner-loop write strength is model-dependent; and a Mistral-7B matched-gate inversion reveals asymmetric selection-actuation coupling under miscalibrated actuation. Public Memora event streams serve as an external diagnostic, exposing stale-memory invalidation as an unresolved boundary. Within this probe, selective parametric consolidation supplies memory depth distinct from and complementary to retrieval access.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:17 AM

# Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents
Source: [https://arxiv.org/html/2606.26806](https://arxiv.org/html/2606.26806)
###### Abstract

Long\-running language agents need more than memory access\. Retrieval systems can fetch past facts at query time, but they do not decide which experiences should continue to shape behavior after the working context is unloaded\. We study this separate problem as*memory depth*: durable goal\-conditioned tendencies written into a small parametric store\. We introduce the loop\-drift protocol, a controlled stress test in which the retrieval index remains intact while working context is unloaded and goal\-conditioned behavior must persist under long\-loop interference\. We evaluateEVAF, a surprise\- and valence\-gatedLoRAconsolidation mechanism\. Across GPT\-2 and TinyLlama, retrieval is strongest on shallow factual recall \(short\-fact accuracy0\.9560\.956–0\.9730\.973\), whileEVAFis strongest on goal persistence and post\-unload recovery \(0\.8120\.812–0\.9040\.904\) with only22–33parametric writes per 200 events\. Mechanism controls show that selective consolidation factorizes into two controllable dimensions: selection and actuation\. Matched random gates isolate selection beyond sparse writing; fixed\-inner controls across GPT\-2, TinyLlama, and Mistral\-7B show that inner\-loop write strength is model\-dependent; and a Mistral\-7B matched\-gate inversion reveals asymmetric selection–actuation coupling under miscalibrated actuation\. Public Memora event streams serve as an external diagnostic, exposing stale\-memory invalidation as an unresolved boundary\. Within this probe, selective parametric consolidation supplies memory depth distinct from and complementary to retrieval access\.

## 1Introduction

Long\-horizon language agents accumulate histories that quickly exceed their working context\. The dominant engineering answer is retrieval: store past events outside the model, retrieve a relevant subset, and condition generation on the retrieved text\(Lewiset al\.[2020](https://arxiv.org/html/2606.26806#bib.bib20); Packeret al\.[2023](https://arxiv.org/html/2606.26806#bib.bib23); Gutiérrezet al\.[2024](https://arxiv.org/html/2606.26806#bib.bib24)\)\. Retrieval is indispensable, but it answers a particular question:*what can be fetched?*It does not answer a complementary question:*what should keep shaping the agent’s behavior even when no relevant text is in context?*

We call this second property*memory depth*\. A shallow memory is available when a system retrieves or attends to it\. A deep memory changes future behavior: it persists through interference, survives context unload, and affects choices without being reinserted as text\. This distinction mirrors the motivation behind complementary learning systems, where fast episodic stores and slow consolidating stores serve different roles\(McClellandet al\.[1995](https://arxiv.org/html/2606.26806#bib.bib1); Kumaranet al\.[2016](https://arxiv.org/html/2606.26806#bib.bib2)\)\. For language agents, the distinction is easy to blur\. If a benchmark asks for an old fact, retrieval should win\. But a long\-running assistant also needs durable goals, preferences, and constraints that are not merely fetched facts\.

We therefore study selective parametric consolidation under a controlled loop\-drift protocol\. Each synthetic user stream contains stable goal events, off\-topic distractors, transient opposite requests, conflicts, sibling\-user contamination, and explicit factual notes\. We probe multiple memory layers: recent facts, old noisy facts, goal persistence, and post\-unload goal recovery\. The key design is that retrieval memory is durable\. Context unload clears the working context, not the retrieval index\. Thus anEVAFadvantage on the goal layer is not a trivial “RAG forgot” artifact\.

Our mechanism,EVAF, uses a surprise\-times\-valence gate to admit only behavior\-relevant events into a small buffer\. When the buffer fills, a low\-rank adapter is updated with replay and an L2 anchor\. This is not meant to replace retrieval: it targets the slower question of which events should leave a persistent behavioral imprint\.

The paper makes five contributions:

1. 1\.We formulate the memory\-access versus memory\-depth split for long\-running language agents and instantiate it in a controlled loop\-drift protocol\.
2. 2\.We show a depth flip across GPT\-2 and TinyLlama:RAGwins shallow facts, whileEVAFwins goal persistence and post\-unload recovery with far fewer writes than naive continualLoRA\.
3. 3\.We show that selection is not only sparsity: a matched random gate with the same write count loses toEVAFon GPT\-2, while Mistral\-7B exposes how miscalibrated actuation can invert the same comparison\.
4. 4\.We separate selection from actuation\. Fixed\-inner controls across GPT\-2, TinyLlama, and Mistral\-7B show that write strength is a distinct, model\-dependent factor, and that over\-actuation can degrade both persistence and selectivity\.
5. 5\.We use public Memora event streams as an external diagnostic\. The result is deliberately modest:EVAFis directionally positive but not significant on stale\-memory rejection, exposing deletion/update validity as future work\.

## 2Memory Access vs\. Memory Depth

We define a long\-running memory stream as a sequence of eventsx1,…,xTx\_\{1\},\\ldots,x\_\{T\}\. At evaluation time, a method may have access to an external store, a parametric adapter, or both\. We distinguish four probe layers\.

Shallow episodic accessasks for a recent explicit fact\. This is the natural strength of retrieval\.Noisy episodic accessasks for an older fact after same\-key interference\. Retrieval is still expected to be competitive\.Parametric tendencyasks whether a stable goal continues to shape behavior after long interference\.Post\-unload recoveryrepeats the goal probe immediately after a context unload, with retrieval memory still intact but working context cleared\.

The target is not universal memory accuracy\. The target is a more specific tradeoff:

max⁡GoalPersist\+PostUnload\\displaystyle\\max\\;\\mathrm\{GoalPersist\}\+\\mathrm\{PostUnload\}s\.t\. low writes and bounded adapter drift\.Short factual access is expected to be owned by retrieval\. The contribution of parametric consolidation is the goal\-conditioned layer that remains active under unload\.

## 3Method

### 3\.1EVAF Selection Gate

EVAFmaintains a small write buffer\. For each eventxtx\_\{t\}, the model computes a surprise scorests\_\{t\}from token negative log\-likelihood and a valence scorevtv\_\{t\}from embedding similarity to the user’s durable goal and preferences\. The write admission score is

gt=σ​\(ks​\(st−τs\)\)⋅σ​\(kv​\(vt−τv\)\)\.g\_\{t\}=\\sigma\(k\_\{s\}\(s\_\{t\}\-\\tau\_\{s\}\)\)\\cdot\\sigma\(k\_\{v\}\(v\_\{t\}\-\\tau\_\{v\}\)\)\.Ifgt\>τwg\_\{t\}\>\\tau\_\{w\}, the event enters the buffer\. When the buffer reaches a fixed size, the adapter is updated on the buffer plus replay from previous consolidated events\. The adapter is aLoRAmodule\(Huet al\.[2022](https://arxiv.org/html/2606.26806#bib.bib13)\); replay and an L2 anchor act as drift guards, following the broad continual\-learning intuition behind replay and elastic constraints\(Rolnicket al\.[2019](https://arxiv.org/html/2606.26806#bib.bib9); Kirkpatricket al\.[2017](https://arxiv.org/html/2606.26806#bib.bib7)\)\. We use a per\-user median warmup for the surprise threshold, fixed gate slopes \(ks=1k\_\{s\}=1,kv=10k\_\{v\}=10\),τv=0\.5\\tau\_\{v\}=0\.5, andτw=0\.5\\tau\_\{w\}=0\.5across seeds; full constants are listed in the supplement\.

### 3\.2Actuation Controllers

The originalEVAFimplementation coupled selection with a fixed inner\-loop write strength\. Our later controls show that this is not enough: the same selected buffer can help or hurt depending on how strongly it is written\. We therefore evaluate actuation separately from event selection\.

Fixed\-inner controllersare the primary diagnostic: they use the sameEVAFselection gate but vary only the number of innerLoRAsteps\. We report fixed\-1, fixed\-2, and fixed\-3\.

Behavior\-margin controlis included as an engineering proof of concept rather than as a first\-class mechanism claim\. It mechanically extracts held\-in calibration pairs from the training event itself, never from evaluation stems, and writes until the event\-derived goal/preference margin moves by a target amount, subject to KL and L2 caps\. We report it because it offers a behavior\-level actuation controller that complements the fixed\-inner diagnostic, but its results are not load\-bearing: the Mistral cross\-seed variance is much larger than for fixed\-inner controls, and we do not use it for any 7B repair claim\.

Routed EVAF\+RAGroutes factual probes to retrieval and goal probes toEVAF\. It tests complementarity: retrieval for access, parametric consolidation for depth\.

### 3\.3Mechanistic Intuition

For a stream lossℓt​\(θ\)\\ell\_\{t\}\(\\theta\), a naive continual adapter applies updates on nearly every event,

Δ​θnaive∝−∑t=1T∇θℓt​\(θ\)\.\\Delta\\theta\_\{\\mathrm\{naive\}\}\\propto\-\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\ell\_\{t\}\(\\theta\)\.EVAFinstead admits a sparse subsetA=\{t:gt\>τw\}A=\\\{t:g\_\{t\}\>\\tau\_\{w\}\\\}and updates

Δ​θevaf∝−∑t∈A∇θℓt​\(θ\)−λ​\(θ−θ0\)\.\\Delta\\theta\_\{\\mathrm\{evaf\}\}\\propto\-\\sum\_\{t\\in A\}\\nabla\_\{\\theta\}\\ell\_\{t\}\(\\theta\)\-\\lambda\(\\theta\-\\theta\_\{0\}\)\.This should reduce drift when most stream events are distractors, but sparsity alone is not the mechanism: if the admitted subset is random, the update can still point toward transient, sibling, or off\-topic gradients\. The gate is useful only ifgtg\_\{t\}is correlated with behavior\-relevant events\. The matched\-gate ablation tests exactly this condition\.

### 3\.4Separable Factors, Coupled Dynamics

Selection and actuation are separable control factors, but they are not independent online dynamics\. Selection drives actuation directly: the gate decides whether a buffer is written\. Actuation feeds back into future selection indirectly: after the adapter changes, later surprise scores are computed under the new model state\.

This asymmetric feedback is visible in the Mistral\-7B fixed\-inner audit \(Table[1](https://arxiv.org/html/2606.26806#S3.T1)\)\. Smaller inner steps trigger more future writes on Mistral, consistent with weaker actuation leaving later similar events surprising enough to pass the gate\. The direction is monotonic across all four Mistral seeds\. On GPT\-2 and TinyLlama, the same loop runs at much lower absolute write counts \(mean writes roughly 2–5\), and the compressed range does not produce a comparably clean monotonic signature; the coupling is most legible at 7B, where the gap between Fixed\-1 and the default five\-step controller spans about six writes per user\.

Table 1:Asymmetric online feedback in the Mistral\-7B four\-seed fixed\-inner audit\. Writes increase as inner\-step strength decreases, indicating that actuation changes future selection through model\-state\-dependent surprise\.

## 4Loop\-Drift Protocol

Existing public long\-memory benchmarks predominantly evaluate whether a system can retrieve, update, or reason over stored information\. They do not directly isolate the post\-unload setting we need here: the retrieval index remains available, but the relevant text is absent from working context and the same parametric write must continue to shape goal\-conditioned behavior\. Loop\-drift is therefore a controlled protocol rather than a leaderboard benchmark\.

Each run contains 10 users and 200 events per user\. Events are drawn from a mixture of stable goal/preference reminders, distractors, transient opposite requests, conflicts, sibling\-user contamination, and scheduled factual notes\. We evaluate Frozen, Summary,RAG, Naive\-LoRA,EVAF, and RoutedEVAF\+RAG\. GPT\-2 and TinyLlama results are averaged over four seeds\. TheRAGbaseline stores all events in a durable embedding index and retrieves the top three events by cosine similarity; context unload does not clear this index\.

The protocol is synthetic by design\. The mechanism claim requires four properties that are rarely controlled together in public memory benchmarks: explicit interference, durable retrieval memory, an observable context unload, and separate factual versus goal\-conditioned probes\. Synthetic control lets us ask whether a method changed post\-unload behavior rather than whether it found a fact in an external store\.

The protocol uses matched continuation scoring\. Higher is better for short\-fact, long\-fact, goal, and post\-unload probes\. Lower is better for contamination and transient overwrite\. Adapter cost is measured by write count andLoRAL2 drift\. A write denotes one buffer\-consolidation trigger, not one gradient step; total inner updates equal writes times the controller’s inner\-loop step count\.

## 5Main Result: The Depth Flip

Table[2](https://arxiv.org/html/2606.26806#S5.T2)and Figure[1](https://arxiv.org/html/2606.26806#S5.F1)show the central result\. The expected winner changes with memory depth\.RAGis strongest on recent explicit facts, reaching0\.9560\.956–0\.9730\.973short\-fact accuracy\.EVAFis near chance on short facts, as expected: its gate is goal\-conditioned and rejects off\-topic factual notes\. But on the goal layer,EVAFis much stronger thanRAG\. On TinyLlama,EVAFreaches0\.8330\.833goal persistence and0\.8120\.812post\-unload recovery, compared withRAGat0\.3960\.396and0\.3940\.394\. On GPT\-2,EVAFreaches0\.9040\.904and0\.9000\.900, compared withRAGat0\.3980\.398and0\.3940\.394\.

![Refer to caption](https://arxiv.org/html/2606.26806v1/x1.png)Figure 1:The expected winner flips with memory depth\. Retrieval is strongest on shallow factual access, whileEVAFis strongest on goal persistence and post\-unload recovery\. Within each model, bars are grouped by probe depth; the dashed divider separates TinyLlama from GPT\-2\. Values are four\-seed means from the loop\-drift protocol\.Table 2:Depth split on the loop\-drift protocol \(four seeds, 10 users, 200 events\)\. Retrieval owns shallow factual access; selective parametric consolidation owns the context\-unloaded goal layer\. Naive\-LoRA writes every event and is costly without matchingEVAFon the goal layer\. RoutedEVAF\+RAGsupports complementarity\.RoutedEVAF\+RAGis not the main mechanism, but it is an important sanity check\. On GPT\-2, it recovers shallow factual access while slightly improving the goal layer\. On TinyLlama, it trades about six points of goal/post\-unload performance for near\-perfect short\-fact recall\. This supports the division of labor: retrieval andEVAFsolve different memory problems, and routing can combine them with model\-dependent tradeoffs\.

## 6Mechanism Controls

The mechanism controls below are independent audit reruns, not extra rows from the depth\-split table\. They use the same loop\-drift generator and probe families, but rerun the stream under controlled gate and actuation counterfactuals; the margin controller additionally uses marker\-style training\-only calibration pairs\. Therefore, the actuation table should be read within\-run\. Its “Audit\-5” row is the original five\-stepEVAFcontroller in that audit, not a second estimate of the Table[2](https://arxiv.org/html/2606.26806#S5.T2)depth\-split number\. The two runs use independent RNG streams and separate probe instantiations; the actuation audit also includes the marker\-style calibration used by the Margin row\. As a result, the same five\-step controller can have different absolute scores across tables \(e\.g\., TinyLlamaEVAFis0\.8330\.833in Table[2](https://arxiv.org/html/2606.26806#S5.T2), while Audit\-5 is0\.6270\.627in Table[3](https://arxiv.org/html/2606.26806#S6.T3)\)\. The within\-table comparisons are the load\-bearing ones\.

### 6\.1Writing Everything Is Not Enough

Naive\-LoRA writes every event\. It has 200 writes per stream and much higher L2 drift thanEVAF\(about6767on TinyLlama and119119on GPT\-2 in the main run\), yet it does not cleanly solve the goal/post\-unload layer\. This shows that the mechanism is not “parametric memory” alone\. The question is which events to admit and how strongly to write them\. At 7B, indiscriminate writing becomes actively harmful: in the Mistral matched\-gate audit, Naive\-LoRA collapses to0\.333±0\.0470\.333\\pm 0\.047goal persistence, below the0\.5000\.500chance baseline, whileEVAFunder the same uncorrected five\-step actuation remains at0\.362±0\.0920\.362\\pm 0\.092\. Writing without selection is therefore not merely inefficient; at scale, it can move behavior in the wrong direction\.

### 6\.2Matched Gate: Selection Is Not Sparsity

A natural objection is thatEVAFworks only because it writes sparsely\. We therefore compare against a random matched gate: the same write count and the same online write dynamics, but random admitted events\. This preserves the coupled gate\-actuation loop while isolating which events are selected\.

On GPT\-2,EVAFbeats random matched gate on goal and post\-unload in all four seeds, with mean goal/post0\.790/0\.7630\.790/0\.763versus0\.590/0\.6190\.590/0\.619\. This establishes that selection is not reducible to writing fewer events\. TinyLlama is weak and mixed \(0\.625/0\.5810\.625/0\.581versus0\.596/0\.5690\.596/0\.569\), indicating that selection alone does not determine the final behavior when actuation is poorly calibrated\. Since TinyLlama is larger than GPT\-2 yet less clean in this matched\-gate audit, the selection signal is not a monotonic function of model scale under a fixed five\-step actuation rule\.

Mistral\-7B sharpens this point in the opposite direction\. The fixed\-inner audit below shows that the original five\-step actuation is miscalibrated at 7B; under that wrong regime, the matched\-gate comparison reverses, with random matched writing ahead by0\.3260\.326goal and0\.3690\.369post\-unload \(EVAF0\.362±0\.092/0\.312±0\.0920\.362\\pm 0\.092/0\.312\\pm 0\.092; Random\-Matched\-Gate0\.688±0\.058/0\.681±0\.0590\.688\\pm 0\.058/0\.681\\pm 0\.059\)\. This is not evidence that theEVAFgate fails\. It is a diagnostic signature of the asymmetric coupling in Section 3\.4: when actuation is miscalibrated, a behavior\-targeted gate concentrates the wrong update on goal\-relevant events, while a random gate spreads the same write strength across off\-goal events\. Notably, even in this reversed regime,EVAFretains the lowest sibling contamination among parametric methods \(0\.787±0\.0410\.787\\pm 0\.041versus0\.825±0\.0460\.825\\pm 0\.046for random matched gate and0\.870±0\.0460\.870\\pm 0\.046for Naive\-LoRA;RAGis0\.995±0\.0040\.995\\pm 0\.004\)\. Selection remains semantically active; what fails is its translation into goal behavior under miscalibrated actuation\. The fixed\-inner audit below closes this bridge by showing that the same gate recovers on Mistral when actuation is reduced\.

### 6\.3Actuation Controls

Table[3](https://arxiv.org/html/2606.26806#S6.T3)reports the final fixed\-inner audit\. The original five\-step controller \(Audit\-5\) over\-actuates in these runs\. Smaller fixed controllers sharply reduce drift and improve goal/post\-unload behavior, showing that write strength is a separable control dimension rather than a constant hyperparameter\. There is no universal optimum: GPT\-2 is strongest at Fixed\-1 in goal/post, TinyLlama reaches high goal under Fixed\-1 and Margin but pays a large contamination cost, and Mistral has stable recovery under Fixed\-2 and stronger but saturated behavior under Fixed\-1\. Behavior\-margin control is kept as a proof\-of\-concept actuation controller, but it is not the load\-bearing result\. Contamination makes the tradeoff visible: Fixed\-1 is best read as an upper bound on goal movement, while Fixed\-2 can offer a cleaner selectivity–efficiency point, e\.g\., on GPT\-2 it preserves high goal \(0\.9380\.938\) with lower contamination \(0\.2030\.203\) than Fixed\-1\.

Table 3:Actuation audit\. Selection and write strength are separable factors: the same selection gate behaves differently under different inner\-loop strengths\. Contamination is included because high goal/post performance can come with selectivity cost\. L2 should be compared within model, not across model families\. Audit\-5 is the original five\-step controller in this independent audit rerun\. Writes are consolidation triggers, not total gradient steps\.Mistral\-7B confirms the same factorization across four seeds\. The original five\-step audit controller has goal/post0\.354±0\.075/0\.306±0\.0800\.354\\pm 0\.075/0\.306\\pm 0\.080\. Fixed\-2 recovers stable behavior \(0\.796±0\.035/0\.775±0\.0410\.796\\pm 0\.035/0\.775\\pm 0\.041\), and Fixed\-1 reaches0\.919±0\.081/0\.938±0\.0950\.919\\pm 0\.081/0\.938\\pm 0\.095\. Fixed\-1 contamination saturates at1\.0001\.000in all four seeds, which indicates that this high\-actuation regime crosses the sibling probe’s discriminative range rather than providing fine\-grained selectivity information\. Margin is more variable on Mistral \(0\.671±0\.177/0\.631±0\.1880\.671\\pm 0\.177/0\.631\\pm 0\.188\), so we do not use it as the 7B repair claim\. The 7B result is therefore not “Fixed\-1 solves scale,” but a narrower mechanistic conclusion: fixed five\-step actuation is wrong, and inner\-loop strength is a separable, model\-dependent control dimension\.

## 7Boundary Diagnostic: Memora

To probe the boundary of the mechanism claim, we test public Memora event streams, which include memory mutation and stale evidence not present in the loop\-drift protocol\. This is not an external validation win and not a SOTA claim\. It asks whether append\-only selective consolidation also solves stale\-memory invalidation\.

The full 10\-persona diagnostic is weak but informative \(Table[4](https://arxiv.org/html/2606.26806#S7.T4)\)\.EVAFimproves forgetting absence from91/22291/222to95/22295/222, but the McNemar test is not significant \(p=0\.57p=0\.57\)\. A pre\-registered high\-mutation slice is also non\-significant\. This means Memora should be read as a boundary: currentEVAFdoes not solve delete/update validity\. Earlier negative\-gradient forgetting variants were unstable; validity gating or reconsolidation is the more plausible future direction\.

Table 4:Memora boundary diagnostic\. Results are directionally positive but not significant, so Memora is a public boundary result, not a victory benchmark\.
## 8Related Work

Retrieval and agent memory\.Retrieval\-augmented generation and agent memory systems fetch relevant past text while keeping model weights fixed\(Lewiset al\.[2020](https://arxiv.org/html/2606.26806#bib.bib20); Packeret al\.[2023](https://arxiv.org/html/2606.26806#bib.bib23); Gutiérrezet al\.[2024](https://arxiv.org/html/2606.26806#bib.bib24)\)\. Public long\-memory benchmarks such as LongMemEval and LoCoMo predominantly emphasize conversational recall, temporal access, and knowledge updates\(Wuet al\.[2024](https://arxiv.org/html/2606.26806#bib.bib34); Maharanaet al\.[2024](https://arxiv.org/html/2606.26806#bib.bib35)\)\. Our point is not that retrieval is weak\. It is that these evaluations do not directly isolate the post\-unload setting in which retrieval remains available but behavior must continue without reinserting the relevant text; the loop\-drift protocol adds that controlled probe\.

Neural memory and test\-time fast weights\.Recent architectures explicitly learn memory at test time\. Titans introduces a neural long\-term memory module that complements attention\(Behrouzet al\.[2025](https://arxiv.org/html/2606.26806#bib.bib25)\), while In\-Place TTT updates selected fast weights during inference\(Fenget al\.[2026](https://arxiv.org/html/2606.26806#bib.bib26)\)\. Concurrent parametric\-memory work uses onlineLoRAfast weights for self\-evolving agents\(Renet al\.[2026](https://arxiv.org/html/2606.26806#bib.bib27)\), studies documentLoRAunder KV\-cache compression\(Zuoet al\.[2026](https://arxiv.org/html/2606.26806#bib.bib28)\), and quantifies exactLoRAmemory laws\(Xuet al\.[2026](https://arxiv.org/html/2606.26806#bib.bib29)\)\. Other recent agent\-memory work studies retrieval\-side context validity\(Yanget al\.[2026](https://arxiv.org/html/2606.26806#bib.bib30)\)or local Engram edits as an alternative to per\-userLoRAcontamination\(Li[2026](https://arxiv.org/html/2606.26806#bib.bib31)\)\. These works are closest in spirit to parametric memory\. Our focus is different: we keep the base architecture fixed and ask which agent\-stream events should be selectively consolidated into a small adapter so that behavior persists after context unload\.

Continual learning and low\-rank adaptation\.Continual learning studies catastrophic forgetting, replay, and weight regularization\(French[1999](https://arxiv.org/html/2606.26806#bib.bib6); Kirkpatricket al\.[2017](https://arxiv.org/html/2606.26806#bib.bib7); Rolnicket al\.[2019](https://arxiv.org/html/2606.26806#bib.bib9)\)\.LoRAprovides a compact parametric store\(Huet al\.[2022](https://arxiv.org/html/2606.26806#bib.bib13)\)\.EVAFcombines these ingredients inside a selective event gate and evaluates whether the write changes post\-unload behavior, rather than reporting task accuracy alone\.

Complementary learning systems\.The method is inspired by the broad CLS idea that fast episodic traces and slow consolidating stores serve different functions\(McClellandet al\.[1995](https://arxiv.org/html/2606.26806#bib.bib1); Kumaranet al\.[2016](https://arxiv.org/html/2606.26806#bib.bib2); Schapiroet al\.[2017](https://arxiv.org/html/2606.26806#bib.bib4)\)\. We use this as motivation only:EVAFis not a biological model\. In particular, stale\-memory deletion remains unsolved and likely requires validity gating or reconsolidation rather than anti\-training\.

## 9Scope and Limitations

The paper establishes a narrow mechanism claim: memory depth can be probed by post\-unload goal\-conditioned behavior, and selective parametric consolidation factorizes into selection and actuation\. It does not claim universal memory accuracy, SOTA retrieval performance, robust paraphrase generalization, or complete deletion/update validity\.

First, the main protocol is synthetic\. This is intentional: the mechanism claim requires controlled interference, durable retrieval, and context unload\. Memora provides a public diagnostic, but it is not a significant win\.

Second, heldout paraphrase is not robust\. TinyLlama is weak and directional; GPT\-2 is not positive\. We therefore claim durable held\-in goal\-conditioned behavior, not broad semantic tendency generalization\.

Third, the 7B result establishes the same actuation factorization, but not a complete large\-model memory system\. Mistral still exhibits high contamination under strong actuation, and Margin has high variance across seeds\.

Fourth, high\-actuation regimes can increase contamination\. This is why we treat selection and actuation as separable factors with online feedback, not as an already\-solved controller\.

Finally, forgetting is not solved\. Negative CE and anti\-training are the wrong object for stale\-memory invalidation; future work should study validity\-gated reactivation and reconsolidation\.

## 10Conclusion

Memory access and memory depth are different problems\. Retrieval is the right tool for factual access\. Selective parametric consolidation is a mechanism for durable goal\-conditioned behavior under interference and context unload\. The loop\-drift protocol makes this split visible:RAGwins shallow facts, whileEVAFwins goal persistence and post\-unload recovery with few writes and bounded drift\. The broader contribution is a mechanistic framing: consolidation requires both semantic selection and calibrated actuation, and those factors are separable controls with asymmetric online feedback\. This gives future agent memory work a more precise target than simply storing more text\.

## References

- A\. Behrouz, P\. Zhong, and V\. Mirrokni \(2025\)Titans: learning to memorize at test time\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://openreview.net/forum?id=8GjSf9Rh7Z)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p2.1)\.
- G\. Feng, S\. Luo, K\. Hua, G\. Zhang, W\. Huang, D\. He, and T\. Cai \(2026\)In\-place test\-time training\.InInternational Conference on Learning Representations \(ICLR\),Note:Oral presentationExternal Links:[Link](https://openreview.net/forum?id=dTWfCLSoyl)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p2.1)\.
- R\. M\. French \(1999\)Catastrophic forgetting in connectionist networks\.Trends in Cognitive Sciences3\(4\),pp\. 128–135\.External Links:[Document](https://dx.doi.org/10.1016/S1364-6613%2899%2901294-2)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p3.1)\.
- B\. J\. Gutiérrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su \(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.arXiv preprint arXiv:2405\.14831\.Cited by:[§1](https://arxiv.org/html/2606.26806#S1.p1.1),[§8](https://arxiv.org/html/2606.26806#S8.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2106\.09685, 2021Cited by:[§3\.1](https://arxiv.org/html/2606.26806#S3.SS1.p1.8),[§8](https://arxiv.org/html/2606.26806#S8.p3.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences114\(13\),pp\. 3521–3526\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1611835114)Cited by:[§3\.1](https://arxiv.org/html/2606.26806#S3.SS1.p1.8),[§8](https://arxiv.org/html/2606.26806#S8.p3.1)\.
- D\. Kumaran, D\. Hassabis, and J\. L\. McClelland \(2016\)What learning systems do intelligent agents need? complementary learning systems theory updated\.Trends in Cognitive Sciences20\(7\),pp\. 512–534\.External Links:[Document](https://dx.doi.org/10.1016/j.tics.2016.05.004)Cited by:[§1](https://arxiv.org/html/2606.26806#S1.p2.1),[§8](https://arxiv.org/html/2606.26806#S8.p4.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.26806#S1.p1.1),[§8](https://arxiv.org/html/2606.26806#S8.p1.1)\.
- B\. Li \(2026\)User as engram: internalizing per\-user memory as local parametric edits\.arXiv preprint arXiv:2606\.19172\.External Links:[Link](https://arxiv.org/abs/2606.19172)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p2.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of LLM agents \(LoCoMo\)\.arXiv preprint arXiv:2402\.17753\.Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p1.1)\.
- J\. L\. McClelland, B\. L\. McNaughton, and R\. C\. O’Reilly \(1995\)Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory\.Psychological Review102\(3\),pp\. 419–457\.External Links:[Document](https://dx.doi.org/10.1037/0033-295X.102.3.419)Cited by:[§1](https://arxiv.org/html/2606.26806#S1.p2.1),[§8](https://arxiv.org/html/2606.26806#S8.p4.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2023\)MemGPT: towards LLMs as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[§1](https://arxiv.org/html/2606.26806#S1.p1.1),[§8](https://arxiv.org/html/2606.26806#S8.p1.1)\.
- T\. Ren, W\. Luo, H\. Yang, R\. Zhu, X\. Huang, Y\. Wu, B\. Chou, J\. Ye, J\. Liang, Y\. Li, and Y\. Peng \(2026\)Scaling self\-evolving agents via parametric memory\.arXiv preprint arXiv:2606\.04536\.External Links:[Link](https://arxiv.org/abs/2606.04536)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p2.1)\.
- D\. Rolnick, A\. Ahuja, J\. Schwarz, T\. Lillicrap, and G\. Wayne \(2019\)Experience replay for continual learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§3\.1](https://arxiv.org/html/2606.26806#S3.SS1.p1.8),[§8](https://arxiv.org/html/2606.26806#S8.p3.1)\.
- A\. C\. Schapiro, N\. B\. Turk\-Browne, M\. M\. Botvinick, and K\. A\. Norman \(2017\)Complementary learning systems within the hippocampus: a neural network modelling approach\.Philosophical Transactions of the Royal Society B372\(1711\),pp\. 20160049\.External Links:[Document](https://dx.doi.org/10.1098/rstb.2016.0049)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p4.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2024\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.arXiv preprint arXiv:2410\.10813\.Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p1.1)\.
- Z\. Xu, H\. Hong, L\. Yu, B\. Cui, L\. Huang, H\. Xue, and N\. Zhang \(2026\)How LoRA remembers? a parametric memory law for LLM finetuning\.arXiv preprint arXiv:2605\.30260\.External Links:[Link](https://arxiv.org/abs/2605.30260)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p2.1)\.
- W\. Yang, B\. Kan, S\. Li, L\. Li, Y\. Qin, J\. Li, P\. Bogdan, and J\. Thomason \(2026\)RaMem: contextual reinstatement for long\-term agentic memory\.arXiv preprint arXiv:2606\.22844\.External Links:[Link](https://arxiv.org/abs/2606.22844)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p2.1)\.
- C\. Zuo, L\. Wang, W\. Jurayj, W\. Fleshman, and B\. Van Durme \(2026\)Rethinking LoRA memory through the lens of KV cache compression\.arXiv preprint arXiv:2606\.05698\.External Links:[Link](https://arxiv.org/abs/2606.05698)Cited by:[§8](https://arxiv.org/html/2606.26806#S8.p2.1)\.

Similar Articles

Selective Memory Retention for Long-Horizon LLM Agents

arXiv cs.AI

This paper presents TraceRetain, a lightweight framework for bounded external memory in frozen LLM agents, demonstrating that selective retention differentiates from cache heuristics primarily when memory streams contain noise, offering task-success and efficiency benefits.