Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One
Summary
This paper shows that a language model with a lossy memory that retains a wrong conclusion but drops the evidence produces confident incorrect answers, whereas an empty memory leads to abstention. The authors propose a source-first compression policy that preserves recomputable sources instead of conclusions to maintain correctability, and demonstrate the mechanism across multiple models and dialogue systems.
View Cached Full Text
Cached at: 06/25/26, 05:11 AM
# Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One
Source: [https://arxiv.org/html/2606.25449](https://arxiv.org/html/2606.25449)
###### Abstract
A language model’s memory can beworse than having no memory at all\. Give a model a memory that kept a wrong conclusion but dropped the work behind it and it emits that stale value as a confident answer; give the same model an*empty*memory and it abstains\. Across the seven models we test this direction never reverses \(lossy emits a confident wrong value where empty abstains in every one\), so the claim carries a clean kill condition: one answer\-disposed model that abstains under a wrong\-valued memory would break it, and none does\. We call the failurebrittle memory\. It is behavioral, not the information bound beneath it \(which is immediate from the definition\), and separable from it: only the*magnitude*is disposition\- and task\-dependent \(a version\-dependent risk, not a universal law\); the direction is not\. Language models carry information across turns by compressing it into memory on the assumption that preserving the answer preserves what matters; we show the same compression decides whether the model can be*corrected*\. We measure this withreclaim evaluation: compress a drifted interaction at a fixed budget, then test whether a correction recovers the known answer, scored against ground truth with no judge\. Correctability is bottlenecked by whether the answer\-determining source survives, not by capability\. A one\-linesource\-firstpolicy \(keep the recomputable source, drop the re\-derivable conclusion\) restores correctability at equal budget wherever that source is compact and identifiable; a length\-matched control rules out added text as the cause\. The hand\-built policy is an oracle; a one\-prompt deployable version of it reclaims0\.490\.49–0\.880\.88, short of the oracle’s1\.001\.00and concentrated on compact numeric sources\. The deployment stake is not a single wrong answer but a compounding one: chained through the memory loop a deployed agent runs, one dropped\-source error corrupts a growing span of downstream steps and stays uncorrectable however late it is caught, while source\-first holds to a bounded budget horizon\. The wall and the fix replicate across three deployed memory systems and on real dialogue \(MultiWOZ\); past the budget where the source no longer fits, the fix fails silently unless the note records its own completeness\. This is a controlled study of a mechanism, not a benchmark: judge\-free exact scoring, matched\-budget controls, and validators built to come out false, with headline cells atn=96n\{=\}96\. We release the harness, the paired memory conditions, and these validators\.111Code, data, and the reproduction harness:[https://github\.com/collapseindex/reclaim\-eval](https://github.com/collapseindex/reclaim-eval)\.
Figure 1:Compression decides whether an error stays fixable\.A model drifts in session 1; only a compressed memory crosses into session 2, at a fixed budget\. Underlossycompression the memory keeps the salient*wrong*conclusion and discards the source, so a later correction has nothing to recompute from, and the model does not abstain, it confidently returns the stale wrong value\. Undersource\-firstcompression the memory keeps the recomputable source and drops the re\-derivable conclusion, so the same correction lands and the model recovers the truth\. Same budget, opposite outcome; the only difference is what is kept\. Numbers are directed\-armRRat low memory integrity; the same source\-kept/source\-dropped pattern appears across two models and two task families \(Tables[5](https://arxiv.org/html/2606.25449#S5.T5),[6](https://arxiv.org/html/2606.25449#S5.T6)\)\. Thesource\-firstmemory shown is the hand\-built oracle; a one\-prompt deployable distiller reclaims0\.490\.49–0\.880\.88, not1\.001\.00, and is the number to ship against \(§[5](https://arxiv.org/html/2606.25449#S5)\)\. The end\-to\-end pipeline runs end to end onllama\-3\.1\-8bandgrok\-4\.3, with a frontier writes\-and\-reads confirmation onclaude\-sonnet\-4\-6\(§[5](https://arxiv.org/html/2606.25449#S5)\); the other frontier numbers are answering\-model replay over a fixed memory\.Table 1:Claims and evidence\.Every load\-bearing claim, the evidence behind it, and its epistemic status:shown\(direct in\-paper measurement\),analytic\(follows from the definitions\), orsuggestive\(preliminary, not load\-bearing\)\. We lead with the behavioral results \(shown\); the information bound \(analytic\) is apparatus, not the result\.## 1Introduction
Language models are increasingly deployed in settings where they must remember\. Assistant “memory” features, long\-running agents, and retrieval pipelines all carry information forward across a context window or a session boundary, and they do so by*compressing*: a conversation, a document, or a trajectory is reduced to a summary, a note, or a set of retrieved chunks\(Packer et al\.,[2023](https://arxiv.org/html/2606.25449#bib.bib15); Lewis et al\.,[2020](https://arxiv.org/html/2606.25449#bib.bib9)\)\. The implicit assumption behind these systems is that a compression which preserves the model’s answer has preserved what matters\.
We show that this assumption is incomplete in a consequential way\. Compression decides not only what a model can recall, but whether the model can later be*corrected*\. When a model has committed to a wrong intermediate conclusion and that conclusion is carried forward while the evidence behind it is dropped, a later correction has nothing to act on: the model restates the error, and no stronger model we tested recovers from it\. We call this failurebrittle memory, by analogy to a memory that looks intact, the salient answer is still there, yet shatters the moment it is asked to support a correction\. And because deployed agents feed memory into memory, the damage does not stay local: a single dropped\-source error compounds down the chain, corrupting a growing span of downstream steps and resisting correction however late it arrives, while a source\-preserving memory stays correctable to a bounded budget horizon \(§[5\.9](https://arxiv.org/html/2606.25449#S5.SS9)\)\. The deployment\-relevant stake is therefore not one wrong answer but a compounding, uncorrectable one\.
This is not a hypothetical\. The dominant instinct in summarization is to keep the*takeaway*and drop the*working*\. A note that records “the total was $55” while discarding the line items preserves the wrong answer and destroys the only means to fix it\. We make this precise, measure it, and show it is a design choice rather than a property of the model\.
The information loss is immediate from the definition: a value cannot be recomputed without its inputs, and no capability changes that\. That is the*setup*; it says nothing about what the model does next\. The*finding*is behavioral and not a fixed trait: the same model abstains or emits a confident wrong value depending only on what the memory kept\. A source\-less model does not reliably abstain \(§[5](https://arxiv.org/html/2606.25449#S5)\), and on a model that answers, a memory that kept the wrong conclusion is*worse*than one that kept nothing, because the stale value acts as an attractor it emits even while hedging\. Calibration is therefore something a memory design induces, not a fixed property to measure once, and the deployment danger is not that the source is gone but that the model does not act like it is gone\. We do not propose universal source retention; we identify the regime where memory must preserve a recomputation path, and show current summary\-like memories often drop it there\. How much of real assistant memory lives in that regime, compact checkable sources versus diffuse evidence with no isolable source, is the open question that bounds the practical reach of what follows \(§[7](https://arxiv.org/html/2606.25449#S7)\)\.
To study correctability directly we introducereclaim evaluation\(§[3](https://arxiv.org/html/2606.25449#S3)\)\. We induce drift on a task with a known answer, deepen the model’s commitment over several turns, compress the interaction into a single carried memory at a fixed budget, and then deliver a*directed*correction that names the error without supplying the answer\. The measured quantity is the Reclaim Rate \(RR\): how often the correction recovers the truth\. By holding the budget fixed and varying only*what*the compression keeps, we isolate correctability from both model capability and memory size\. Every load\-bearing claim, its evidence, and its epistemic status \(shown,analytic, orsuggestive\) is mapped in Table[1](https://arxiv.org/html/2606.25449#S0.T1)\.
#### Contributions\.
- •Reclaim evaluation:a paired\-memory protocol that isolates correctability from capability and from memory budget, by compressing a committed interaction under matched\-budget policies that differ only in what they retain \(§[3](https://arxiv.org/html/2606.25449#S3)\)\.
- •A lossy memory is worse than an empty one\.Intuition says any memory beats none; it is backwards\. With the source gone, a model*disposed to answer*does not abstain but emits the stale wrong value, so keeping a wrong conclusion \(thebrittle memoryregime\) is worse than keeping nothing, an asymmetry neither anchoring nor sycophancy predicts and one capability does not fix \(an88B model and a frontier one wall in the same place\)\. We report it as a version\-dependent risk \(directionally robust across seven models, magnitude set by disposition\), sharpest for an externally planted error \(§[5](https://arxiv.org/html/2606.25449#S5)\)\.
- •The error cascades, and that is the deployment stake:when memory feeds memory, a singlelossyerror corrupts a blast radius that grows with the chain and stays uncorrectable \(a no\-error control injects nothing\), while source\-first holds to a capability\-invariant budget horizon, so the single\-hop wall is not a one\-off wrong answer but a compounding, uncorrectable one \(§[5\.9](https://arxiv.org/html/2606.25449#S5.SS9)\)\.
- •The source\-first remedy, and its regime:a one\-line policy that removes the failure at equal budget*where the answer\-determining source is compact and identifiable*, with a length\-matched control that rules out “it is simply more text”; its one\-prompt deployable form is materially weaker than the oracle \(0\.490\.49–0\.880\.88, not1\.001\.00\) \(§[5](https://arxiv.org/html/2606.25449#S5)\)\.
- •Two task families and a conditional law:the remedy holds on arithmetic and on constraint logic, while the wall’s severity is task\-governed, a clean zero where the source is unrecoverable and a partial floor where a clue survives \(§[5](https://arxiv.org/html/2606.25449#S5)\)\.
- •The boundary of the law, and a silent\-failure mode:two sweeps locate wheresource\-firstdecays to the lossy floor, capability\-invariantly, when the answer\-determining source outgrows the budget \(size\) or is crowded out by decoys \(noise\)\. Past that boundary it fails*silently*; a one\-line completeness signal restores loud failure, itself capability\-gated \(§[5](https://arxiv.org/html/2606.25449#S5)\)\.
- •Deployed systems, and frontier models, on the same axis:three off\-the\-shelf memories all wall well belowsource\-first, each losing the source a different way, and replaying the answering model toclaude\-opus\-4\-8holds the source\-dropped cases at0\.000\.00while source\-first reaches1\.001\.00\. A frontier memory\-*writer*rescues the summary but not the extraction store \(which confabulates*more*\), so neither read\- nor write\-side capability is a paradigm\-independent fix \(§[5](https://arxiv.org/html/2606.25449#S5)\)\.
- •Replication on real conversational memory:on MultiWOZ\(Budzianowski et al\.,[2018](https://arxiv.org/html/2606.25449#bib.bib3)\)the wall and the source\-first fix both hold, objectively scored and judge\-free, so the compact deterministic tasks are not a special case; it stresses the*reading*step and leaves the*locating*step \(§[7](https://arxiv.org/html/2606.25449#S7)\) open \(§[5](https://arxiv.org/html/2606.25449#S5)\)\.
## 2Related Work
#### Memory, retrieval, and long context\.
Assistant and agent memory systems compress history into carried\-forward notes or retrievable chunks\(Packer et al\.,[2023](https://arxiv.org/html/2606.25449#bib.bib15); Lewis et al\.,[2020](https://arxiv.org/html/2606.25449#bib.bib9)\), and long contexts are themselves attended unevenly, with material in the middle used least\(Liu et al\.,[2024](https://arxiv.org/html/2606.25449#bib.bib10)\)\. These systems compress toward the salient conclusion\. Our result is a caution for exactly that default: a summary that keeps the conclusion and sheds the source preserves an error and destroys the means to repair it\.
#### Self\-correction and feedback\.
A line of work asks models to revise their own outputs\(Madaan et al\.,[2023](https://arxiv.org/html/2606.25449#bib.bib11); Shinn et al\.,[2023](https://arxiv.org/html/2606.25449#bib.bib18)\), and a sober finding is that models often cannot reliably self\-correct reasoning without an external signal\(Huang et al\.,[2024](https://arxiv.org/html/2606.25449#bib.bib6)\)\. Our directed correction is precisely such a minimal external signal, naming the error site but not the fix\. We show that even a perfect external signal is powerless once the recomputable source has left the context: the bottleneck is then information, not feedback\. Our wall is not a relabeling of anchoring plus the limits of self\-correction\(Huang et al\.,[2024](https://arxiv.org/html/2606.25449#bib.bib6)\): a compression*policy*decides which regime holds \(anchoring while the source survives, uncorrectable information loss once it is dropped\) at matched budget, and a wrong\-valued memory is*worse*than an empty one, a behavioral asymmetry neither component predicts\.
#### Multi\-turn drift\.
That a model can be walked over turns into a state it would not adopt in one shot underlies multi\-turn jailbreaks\(Russinovich et al\.,[2025](https://arxiv.org/html/2606.25449#bib.bib16)\), and the tendency to follow a confident interlocutor rather than the evidence is documented as sycophancy\(Sharma et al\.,[2024](https://arxiv.org/html/2606.25449#bib.bib17)\)\. We use a benign, checkable analogue, a planted arithmetic or logical premise, and study the return trip: whether and when a committed model can be brought back\. Our failure is distinct from both\. It is not sycophancy: the model is not following an agreeable interlocutor but recomputing from what its memory kept, and it fails when the recomputable source is gone, not when a human pushes\. And it is not a fixed calibration property: the same model abstains or answers correctly depending only on what the compression retained, so the failure is created by a memory policy and removed by another, not a trait of the model\.
#### Knowledge editing\.
Editing a fact in a model’s weights is the parametric counterpart to correcting a belief\(Meng et al\.,[2022](https://arxiv.org/html/2606.25449#bib.bib13)\); ours is the in\-context, per\-interaction counterpart, and it isolates a precondition that weight editing sidesteps, namely that the evidence justifying the new value still be present\.
#### Brittleness hidden from surface inspection\.
A failure can be invisible to inspection of the surface text yet decisive once context is restored\.Choi & Kwon \([2026](https://arxiv.org/html/2606.25449#bib.bib4)\)show this for*safety*: aligned models clear content\-level guardrails while the very same behavior is unsafe in context, so the danger is legible to a context\-aware evaluator but not to the text alone\. We find the same surface/structure gap for*correctability*: a carried answer can look intact while whether it can be repaired depends entirely on what structure the compression kept\. In both, the property that matters, safety or correctability, lives in the context or the retained source, not in the surface that a content check reads\.
## 3Brittle Memory and Reclaim Evaluation
### 3\.1Definition and formalization
###### Definition 1\(Reclaim Rate\)\.
LetMMbe a model and let a task instance have a unique correct answery∗y^\{\*\}\. Drift inducesMMto commit to a wrong answery′≠y∗y^\{\\prime\}\\neq y^\{\*\}\. A compression policyπ\\pimaps the committed interaction to a carried memorymmwith budgetB\(m\)B\(m\)\.222We operationalizeBBas a*character*count throughout \(the unit the notes are actually truncated to\); the definition is unit\-agnostic and “budget” elsewhere in the paper means this character budget\.A directed correctionδ\\deltanames the error locus without supplyingy∗y^\{\*\}\. For a memory integrityg∈\[0,1\]g\\in\[0,1\]governing how much of the source survives, the*Reclaim Rate*is
RR\(π,g\)=Pr\[M\(mπg,δ\)=y∗∣drift\]\.\\mathrm\{RR\}\(\\pi,g\)\\;=\\;\\Pr\\\!\\big\[\\,M\(m^\{g\}\_\{\\pi\},\\delta\)=y^\{\*\}\\,\\mid\\,\\mathrm\{drift\}\\,\\big\]\.\(1\)
We instantiate three policies at matched budget\.lossykeeps the salient conclusion and sheds the source asggfalls \(the realistic default\)\.source\-firstkeeps the recomputable source and sheds the re\-derivable conclusion\.lossy\-paddedis the control: it islossypadded with neutral filler tosource\-first’s length or beyond, so it carries*more*text than the fix while retaining no source\.
#### How the notes are built\.
Each note is templated from three recorded fields of the problem: its line\-item*source*\(e\.g\. “7 notebooks at $4, 9 pens at $2”\), the planted*premise*\(“the pens come to $27”\), and the committed*conclusion*\(“the total is $55”\)\. This makes “how much source survives” concrete and inspectable rather than a free parameter\. Atg≥0\.5g\\geq 0\.5every policy keeps the source line items, so the source is present under all of them\. The wall opens atg<0\.5g<0\.5:lossydrops the line items and keeps the premise and the wrong conclusion \(atg=0\.1g\{=\}0\.1, only the conclusion\), whilesource\-firstkeeps the line items and drops the conclusion\.lossy\-paddedislossyextended with topic\-neutral filler to at leastsource\-first’s character length, equalizing budget\. Source\-absence is therefore a property of the string we emit, checkable by testing the note for the line\-item tokens, not an assumption: atg<0\.5g<0\.5thelossynote provably contains none of them\.
Brittle memoryis the regime
RR\(𝗅𝗈𝗌𝗌𝗒,g\)\\displaystyle\\mathrm\{RR\}\(\\mathsf\{lossy\},g\)→0asg→0,while\\displaystyle\\to 0\\ \\text\{as\}\\ g\\to 0,\\ \\text\{while\}\(2\)RR\(𝗌𝗈𝗎𝗋𝖼𝖾\-𝖿𝗂𝗋𝗌𝗍,g\)\\displaystyle\\mathrm\{RR\}\(\\mathsf\{source\\text\{\-\}first\},g\)stays high\.at equal budgetBB\. The content\-versus\-budget control requiresRR\(𝗅𝗈𝗌𝗌𝗒\-𝗉𝖺𝖽𝖽𝖾𝖽,g\)≈RR\(𝗅𝗈𝗌𝗌𝗒,g\)\\mathrm\{RR\}\(\\mathsf\{lossy\\text\{\-\}padded\},g\)\\approx\\mathrm\{RR\}\(\\mathsf\{lossy\},g\)withB\(𝗅𝗈𝗌𝗌𝗒\-𝗉𝖺𝖽𝖽𝖾𝖽\)≥B\(𝗌𝗈𝗎𝗋𝖼𝖾\-𝖿𝗂𝗋𝗌𝗍\)B\(\\mathsf\{lossy\\text\{\-\}padded\}\)\\geq B\(\\mathsf\{source\\text\{\-\}first\}\): if the fix were merely supplying more text, the padded control would reclaim too\.
### 3\.2The reclaim protocol
Each task instance has a known answer, so success is objective and needs no judge: exact match reproduces on whatever models exist later, with no judge model to deprecate\. What persists is the procedure and the information bound it gates; the per\-model cells ride on pinned snapshots and are meant to be re\-measured, not read as fixed constants\.
#### Drift and commitment\.
The opening turn states the problem and injects a wrong intermediate value \(“a note says the pens come to $27”\), inducingMMto commit to a wrong answer\. We deepen the commitment over up to eight neutral follow\-up turns that re\-use the wrong figure without re\-deriving the corrupted component, checkpointing the interaction at commitment depths\{1,2,4,8\}\\\{1,2,4,8\\\}\.
#### Correction, two forms\.
At a checkpoint we deliver either a*generic*correction \(“something above is wrong, recheck”\) or a*directed*one that names the error locus in the trace’s own terms without giving the answer \(“the pens subtotal is wrong, recheck that”\)\. A success is the model recomputing, not copying\. The directed correction is the strongest realistic external signal short of supplying the answer, so we read it as an*upper bound*on*locus\-naming*correction quality: if even a locus\-naming correction cannot reclaim once the source is gone, a vaguer real\-world correction cannot either\. The idealization is therefore conservative for the wall\. \(Handing over the value itself is a stronger signal still; that case is the correction taxonomy of Table[4](https://arxiv.org/html/2606.25449#S5.T4), wherelossyfails even then\.\)
#### Cross\-session compression\.
To study genuine information loss, the corrected query is issued in a fresh session whose only inheritance is a single carried memory of the first, written under one of the three policies at integrityg∈\{1\.0,0\.6,0\.3,0\.1\}g\\in\\\{1\.0,0\.6,0\.3,0\.1\\\}\. This separates two regimes that look identical from outside:*anchoring*\(the information is present, the model is entrenched\) and*information loss*\(the source is gone\)\.
## 4Experimental Setup
#### Tasks\.
We use two families of eight problems each\. The first is multi\-step arithmetic with a clean pre\-tax total: the conclusion is a deterministic function of a small, fully recomputable source \(the line items\)\. The second is non\-arithmetic constraint logic \(role, seating, ordering, and color puzzles\) with a single\-token answer: the conclusion is a logical, not numeric, deduction over a clue set\. Single\-token and numeric answers keep scoring objective\.
#### Models\.
We evaluatellama\-3\.1\-8b\-instruct\(Meta AI,[2024](https://arxiv.org/html/2606.25449#bib.bib14)\)andgrok\-4\.3\(xAI,[2026](https://arxiv.org/html/2606.25449#bib.bib19)\)\(pinned snapshot\), an88B open model and a frontier system, both accessed through OpenRouter’s OpenAI\-compatible endpoint; the Claude frontier models used in the answering\-model replay \(claude\-sonnet\-4\-6,claude\-opus\-4\-8\) are accessed through the Anthropic API\. Every condition is run over three seeds at temperature0\.70\.7\(the Claude models in the replay run at their default, as Opus does not accept the parameter\)\. The cross\-session sweep on the frontier model is1,2241\{,\}224calls per task\.
#### Validators\.
Three checks, run for free against a deterministic fake, can each fail: that the planted premise actually drifts the model, that the window favors the directed arm, and, as the central anti\-rig check, that when the source’s line\-item tokens are absent from the carried note, reclaim fails for*both*arms\. The fake reclaims only when those tokens are present in its context, so a passing run cannot be faked by a model that merely pattern\-matches the correction\. All three pass \(3/33/3\)\. Because absence is read off the constructed note \(a token test\) rather than assumed, the wall rows are a clean0\.000\.00by measurement, not stipulation\.
## 5Results
### 5\.1The wall, and why it is the setup
#### Within one conversation, the window is anchoring, not forgetting\.
Table[2](https://arxiv.org/html/2606.25449#S5.T2)reports reclaim on the small model when the full trace remains in context\. A generic correction has a real window that shuts as the model entrenches \(RR0\.42→0\.040\.42\\to 0\.04over eight turns\), while a directed correction holds far longer \(0\.79→0\.500\.79\\to 0\.50\)\. Counter to a forgetting account, pushing the error*back*behind unrelated filler*lifts*the generic correction \(0\.17→0\.500\.17\\to 0\.50\): the information never leaves context, so distance cannot starve reclaim, it only loosens the anchor\. A directed correction beats the generic one at every depth and distance\. One intact conversation therefore has no wall, only an anchor that a directed signal overcomes\.
Table 2:The single\-conversation window is anchoring\(llama\-3\.1\-8b, eight problems×\\timesthree seeds\)\. Reclaim Rate versus commitment depth and versus unrelated filler distance at fixed deep commitment\. Distance lifts generic reclaim, the opposite of a forgetting wall\.
#### Across sessions, the window becomes a wall\.
When only a lossy memory crosses the session boundary, reclaim holds while the memory keeps the source and collapses once it is compressed past it \(Table[5](https://arxiv.org/html/2606.25449#S5.T5),lossycolumns\)\. Past the threshold the carried note contains a wrong answer and nothing to check it against, and even a directed correction dies: there is no error site left to point at\. The directed advantage that dominated the single conversation vanishes here, and its disappearance is itself a diagnostic for the regime change from anchoring to information loss\.
#### The wall sits in the same place regardless of capability\.
The failure is identical across the two base models we run end to end \(llama\-3\.1\-8bandgrok\-4\.3; Table[5](https://arxiv.org/html/2606.25449#S5.T5)\) and across the answering\-model replay that swaps the reader over a fixed memory \(§[5](https://arxiv.org/html/2606.25449#S5), Table[8](https://arxiv.org/html/2606.25449#S5.T8)\): on arithmetic the frontier model walls at a perfect0\.000\.00exactly where the small model does, strictly better wherever information survives and exactly as helpless where the source was dropped\. What the model does once the source is gone is the finding, measured next\.
### 5\.2The behavioral finding: a wrong\-valued memory is worse than an empty one
#### The wall is silent, not safe\.
The deployment\-relevant content of the wall is not its height but what the model does*instead*of recovering \(Table[7](https://arxiv.org/html/2606.25449#S5.T7)\), and it does not reliably decline\. On arithmetic it almost never abstains; it hands back a wrong value, and the stronger model fails the more cleanly, inheriting the compressed wrong answer in90%90\\%of wall cells while the small model is noisier \(as often computing a fresh wrong value as repeating the inherited one\)\.333These emit\-versus\-abstain fractions are answering\-temperature\-sensitive, and the base models run at0\.70\.7while Opus is deterministic, so the cross\-model*spread*is behavior under each model’s own decoding, not a pure capability ordering\. The within\-memory source\-kept/source\-dropped contrast is unaffected: it holds the carried string fixed and only swaps what the memory kept\.We score the returned value, not whether it is hedged: a source\-less model emits a wrong answer rather than flagging that it cannot verify, a failure a downstream system never sees\. This behavior, not the information loss, is the result\.
The reader’s instinct is that any memory beats none\. It is exactly backwards, and a matched test isolates it\. At the wall we swap the lossy note for a*blank*one that kept neither source nor conclusion\. With nothing to inherit, both base models abstain on every problem \(0\.000\.00wrong emission,n=96n\{=\}96\)\. The same models under lossy compression instead emit a confident wrong value \(0\.480\.48on llama,0\.750\.75on grok, of which0\.220\.22and0\.750\.75is the exact inherited attractor\), and neither recovers the truth \(no source survives\)\. The gap is purely behavioral: keeping the stale conclusion does nothing for correctability and turns a safe abstention into a confident error \(Table[3](https://arxiv.org/html/2606.25449#S5.T3)\)\. The same holds on the model’s*own*self\-generated errors, only milder \(the88B model mostly abstains\), so the planted note is the more adversarial case, not a special one \(§[7](https://arxiv.org/html/2606.25449#S7)\)\. It holds on real dialogue too \(Table[10](https://arxiv.org/html/2606.25449#S5.T10)\): for the models that answer, a memory that kept the wrong value is again worse than one that kept nothing\. Across the seven models we test the direction never reverses on arithmetic, lossy emitting a confident wrong value where empty abstains in every one, with only the magnitude set by disposition \(App\.[I](https://arxiv.org/html/2606.25449#A9)\)\. The claim states its own kill condition: a single answer\-disposed model that abstains under a wrong\-valued memory \(lossy≤blank\\textsf\{lossy\}\\leq\\textsf\{blank\}on confident\-wrong\-emission\) would break it, and none of the seven does\.
Table 3:Lossy memory is worse than empty memory on arithmetic\(base models, wallg=0\.1g\{=\}0\.1, directed reclaim,n=96n\{=\}96/cell\)\. An*empty*memory \(neither source nor conclusion\) makes both models abstain; a*lossy*memory that kept the wrong conclusion makes them emit a confident wrong value, much of it the inherited attractor\. Neither recovers the truth \(no source\), so the difference is purely behavioral: the wrong\-valued memory converts an abstention into a confident error\. This is the arithmetic counterpart of the MultiWOZ result \(Table[10](https://arxiv.org/html/2606.25449#S5.T10)\)\. Grok’s0\.250\.25abstention here \(atg=0\.1g\{=\}0\.1under a directed reclaim\) differs from the10%10\\%in Table[7](https://arxiv.org/html/2606.25449#S5.T7), which pools theg≤0\.3g\{\\leq\}0\.3inherit cells; the cells differ\.
### 5\.3The source\-first remedy
#### The lever is content, not budget\.
Thelossy\-paddedcontrol isolates content from length: it carries more text thansource\-firstand still walls identically to plainlossy\(0\.000\.00on arithmetic; Table[5](https://arxiv.org/html/2606.25449#S5.T5)\)\. At equal or greater budget, the policy that keeps the source reclaims and the policy that keeps the conclusion fails, so the lever is what the memory keeps, not how much\.
#### The wall is a choice, removed by source\-first in the regime that matters\.
source\-first’s advantage is a low\-integrity effect, and a decisive one\. Where lossy compression has shed the source,source\-firstreclaims \(0\.990\.99–1\.001\.00on arithmetic\) while both lossy variants wall \(0\.000\.00\), with non\-overlapping95%95\\%intervals on both models \(Tables[5](https://arxiv.org/html/2606.25449#S5.T5),[6](https://arxiv.org/html/2606.25449#S5.T6)\)\. At high integrity, where lossy still carries the source, the policies are comparable and sometimes reverse:lossy\-paddedexceedssource\-firstatg=1\.0g\{=\}1\.0on llama arithmetic \(0\.850\.85vs0\.610\.61\) and atg=0\.6g\{=\}0\.6on grok logic \(likely a positioning effect, not text aiding recovery\)\. So we claim no all\-regime dominance, and we do not rest the content\-not\-text claim on these cells: the matched control at the wall, where the padding carries*no*source, carries it\. The low\-integrity result is a rule, not a coincidence, because a conclusion is a deterministic function of its source: a kept source regenerates the conclusion, a kept conclusion never regenerates the source\. Under budget pressure, keeping the source is what stays correctable\.
#### Dropping the conclusion is not only for budget\.
Source\-first drops the conclusion to make room for the source, but the conclusion is also a liability\. A note that keeps both at the wall reclaims0\.95\[\.90,\.99\]0\.95\_\{\[\.90,\.99\]\}on the88B reader, below source\-first’s1\.001\.00: the stale conclusion re\-attracts a weak reader even with the source beside it\. On the frontier readers the cost vanishes \(1\.001\.00on Sonnet and Opus\)\. So dropping the conclusion is safe under budget pressure: it removes a weak\-reader trap and costs nothing on a strong reader\. When the budget fits both, keeping the conclusion is also fine on a strong one, and lets it cross\-check by recomputing\. Either way, the source is the load\-bearing part\.
### 5\.4Robustness to corrections
A deployed correction is rarely a clean locus, so the fix must survive weaker and adversarial corrections, and it does, because the work is done by the restored source\. Under a*vague*“something is off” nudge,source\-firstreclaims essentially as well as under a directed locus \(≈1\.00\\approx 1\.00on arithmetic,0\.760\.76–0\.820\.82on logic\) whilelossystays at its floor; the correction’s specificity governs in\-context anchoring but is irrelevant once the lever is the restored source\. It is equally robust to corrections that*mislead*: a*false locus*naming a correct component as the error leavessource\-firstat the true value \(1\.001\.00on both base models,n=24n\{=\}24each\), and a*confident wrong value*asserted as established fact, the sycophancy case\(Sharma et al\.,[2024](https://arxiv.org/html/2606.25449#bib.bib17)\), is rejected on every trial on both frontier readers, wherelossyinstead capitulates \(0\.740\.74on Sonnet,1\.001\.00on Opus\)\. The adversarial members are capability\-gated: against a*sustained*four\-turn push and an injected*fabricated source*, frontier readers resist \(0\.900\.90–1\.001\.00\) while the88B reader caves \(0\.000\.00–0\.270\.27\), the same capability ladder that governs reading a fuzzy source\. The full battery, with the false\-locus and injection mechanics and Tables[18](https://arxiv.org/html/2606.25449#A10.T18)and[19](https://arxiv.org/html/2606.25449#A10.T19), is in App\.[J](https://arxiv.org/html/2606.25449#A10)\.
Two endpoints complete the taxonomy \(Table[4](https://arxiv.org/html/2606.25449#S5.T4)\)\. A content\-free “are you sure?” makessource\-firstrecompute, and makeslossyonly re\-assert the wrong value or abstain\. Even the*correct value*, handed over as an instruction, fixeslossyon the smaller models but not on Opus, which re\-emits its stored wrong value on31%31\\%of trials\. That is the sharpest form of the attractor: a conclusion that survives an explicit correct correction is not stale data the model overwrites on request but a value it*defends*\. Solossyis correctable by none of these short of the answer, and not fully even then, whilesource\-firstis correctable across the spectrum\.
Table 4:Two correction types complete the taxonomy\(arith wallg=0\.1g\{=\}0\.1, fraction returning the*true*value; llama/Sonnetn=96n\{=\}96, Opusn=32n\{=\}32\)\. A content\-free challenge \(“are you sure?”\) makessource\-firstrecompute correctly and only makeslossyre\-assert the wrong value or abstain\. Even an*explicitly supplied correct value*fixeslossyfully on the smaller models but not on Opus, which keeps its stale wrong value on31%31\\%of trials\.
### 5\.5The wall’s severity, hard and soft
#### The failure’s severity is conditional, and the same on both models\.
The wall is a clean0\.000\.00on arithmetic, where lossy compression discards the actual numbers and nothing can be reconstructed, and*soft*on logic \(Table[6](https://arxiv.org/html/2606.25449#S5.T6)\), where the lossy note retains a corrupted relational clue in a small constraint space\. Because the logic answer is one of a few tokens, we measure the free\-guess floor directly: a blank note giving only the candidate set \(no clue, no conclusion\) reclaims at0\.040\.04–0\.170\.17\(App\.[D](https://arxiv.org/html/2606.25449#A4)\), below the≈0\.30\\approx 0\.30uniform rate, the models abstain or anchor rather than guess freely\. Against that measured floor the soft wall separates by capability\. On the frontier model the surviving clue lifts directed reclaim to0\.420\.42–0\.500\.50, far above its0\.120\.12guess floor: genuine re\-derivation, and the reason the floor is not circular \(were reclaim merely scoring its own setup, logic would zero out like arithmetic\)\. On the small model it reaches only0\.160\.16\(and0\.170\.17recovery in the failure\-mode count\), at or below its own0\.170\.17floor, scarcely distinguishable from guessing\. The soft wall is therefore real re\-derivation for a capable reader and chance for a weak one\.source\-firstkeeps all of the source either way\.
Table 5:Arithmetic: the wall \(lossy\), the length control \(lossy\-padded\), and the fix \(source\-first\), on two models\.Directed\-arm Reclaim Rate with95%95\\%bootstrap CI versus memory integritygg\(llama\-3\.1\-8batn=96n\{=\}96/cell:3232problems×\\timesthree seeds;grok\-4\.3atn=24n\{=\}24as a cross\-vendor anchor; temperature0\.70\.7\)\. At low integrity \(the wall\)lossyandlossy\-paddedsit at0\.000\.00whilesource\-firstis0\.990\.99–1\.001\.00, non\-overlapping\. At high integrity, wherelossystill keeps the source, the intervals overlap and the policies are comparable; the fix is a low\-integrity effect, which is the regime real budget pressure produces\.Table 6:Non\-arithmetic constraint logic: the same three policies, on two models\.Directed\-arm Reclaim Rate with95%95\\%bootstrap CI versus memory integritygg\(llama\-3\.1\-8batn=96n\{=\}96/cell over3232ordering and assignment problems×\\timesthree seeds;grok\-4\.3atn=24n\{=\}24as a cross\-vendor anchor\)\. At low integritysource\-firstbeats both lossy variants \(non\-overlapping or nearly so\); the wall is*soft*rather than a clean zero, because the lossy logic note leaves a reconstructable relational clue\. At high integrity the policies again trade places \(lossy\-paddedexceedssource\-firston grok\), so as on arithmetic the fix is a low\-integrity effect\.Table 7:What the model does when reclaim fails, at the wall \(lossy/lossy\-padded,g≤0\.3g\\leq 0\.3, directed;n=96n\{=\}96/row\)\.*recov\.*: correct despite the lossy memory;*inherit*: repeats the compressed wrong value;*novel*: a*different*wrong value;*abst\.*: declines\. Arithmetic recovers nothing \(the source is gone\) and the model hands back a wrong value, the frontier model inheriting it almost always; logic recovers1717–45%45\\%from a surviving relational clue, which is why its wall is soft rather than zero\. Source state is known by construction: the wall rows are exactly the source\-absent notes, so these are failures of*information*, not of reasoning\. Examples in App\.[A](https://arxiv.org/html/2606.25449#A1)\.
### 5\.6Generality: deployed and frontier systems
#### Three deployed paradigms, three ways to lose the source\.
The effect is not an artifact of our hand\-built notes; it appears in memories people ship, by several distinct mechanisms\. These are general\-purpose recall tools, not configured for correctability; the claim is not that they are poorly built but that their*default*compression, toward the salient conclusion, sheds the source\. We run three off\-the\-shelf systems unmodified over the same session 1 trajectory, each*constructing*its memory withllama\-3\.1\-8b, and carry that memory into session 2 in the slot our notes occupied: LangChain’sConversationSummaryMemory\(LangChain,[2022](https://arxiv.org/html/2606.25449#bib.bib8)\)\(a running LLM summary\),mem0\(Mem0 AI,[2024](https://arxiv.org/html/2606.25449#bib.bib12)\)\(LLM fact\-extraction into a retrieved store\), and naive vector retrieval over the session’s turns\.444FastEmbedbge\-small\-en\-v1\.5\(Xiao et al\.,[2023](https://arxiv.org/html/2606.25449#bib.bib20)\)\(384384\-dim\); one chunk per conversational turn; topk=4k\{=\}4by cosine similarity to a fixed correction\-shaped query, carried in chronological order\. We label it “naive” deliberately: a tuned RAG pipeline could retrieve the source\-bearing turns, which is the point, retrieval helps only when it is aimed at the source, not the conclusion\.\(The writer is the small model; whether a frontier writer would preserve more source is the open question we flag in Limitations\.\) All three wall well below the fix \(Table[8](https://arxiv.org/html/2606.25449#S5.T8)\), and each loses the source a*different*way\. The summary*drops*it, compressing toward the conclusion and shedding the line items\. mem0*buries*it: it keeps the source as extracted facts but bloats each memory with, on average,38\.138\.1numbers absent from session 1 \(100%100\\%of its memories carry at least one, against zero for the summary, naive retrieval, and both hand\-built notes; an objective count, no judge\)\. Some are outright fabrications \(a made\-up pen count, introduced to reconcile the planted error\), others are correct re\-derivations; either way the few line items that actually determine the answer are buried in a mass of generated figures\. Naive retrieval*misses*it: keyed on the correction, it surfaces the conclusion\-bearing turns rather than the source\-bearing ones, and is the weakest of the three\. All cells aren=96n\{=\}96; on the small model,source\-first’s95%95\\%bootstrap interval excludes every deployed system on arithmetic\.
Crucially, aiming the retrieval at the source does not rescue it\. A*source\-keyed*query over the same store \(asking for the original line items and quantities rather than for the correction\) retrieves different turns but reclaims no better than the conclusion\-keyed one \(0\.09\[0,\.22\]0\.09\_\{\[0,\.22\]\}vs\.0\.06\[0,\.16\]0\.06\_\{\[0,\.16\]\}, overlapping, both far below the distilledsource\-firstnote’s0\.97\[\.91,1\]0\.97\_\{\[\.91,1\]\};n=32n\{=\}32, same trajectories\)\. The reason is visible in the retrieved text: in a realistic drift dialogue the recomputable line items are stated*once*and then buried under turns that restate and confirm the \(wrong\) total, so neither query reliably surfaces them\. Retrieval direction is necessary but not sufficient; the source must be salient enough to retrieve, which a distilled source\-first note guarantees and raw retrieval does not\. The lever is distillation, not just the query\.
#### The fix deploys, not just hand\-built\.
source\-first\-auto, a one\-prompt policy that compresses an arbitrary transcript toward its recomputable source and away from the conclusion,555Prompt, verbatim:*“You are compressing a conversation into a short memory note for a future session that may need to correct a mistake in it\. Keep every given fact, quantity, and unit needed to recompute the answer from scratch \(the source / the working\)\. Do not assert the final answer or any derived conclusion as established fact, since it can be recomputed from the source\. Be concise\.”*beats mem0 and naive retrieval decisively on both tasks \(≥0\.39\\geq 0\.39everywhere\) and beats the LangChain summary on arithmetic, where the summary is the closest competitor: thesource\-first\-auto−\-summary paired difference is\+0\.18/\+0\.32/\+0\.30\+0\.18/\+0\.32/\+0\.30across Llama/Sonnet/Opus, each with a95%95\\%bootstrap interval that excludes zero \(\[\.05,\.30\]\[\.05,\.30\],\[\.22,\.43\]\[\.22,\.43\],\[\.19,\.42\]\[\.19,\.42\]\)\. On logic the two are*indistinguishable*: the paired difference is within±0\.03\\pm 0\.03on every model and every interval straddles zero, because a running summary that happens to retain the ordering constraints recomputes about as well as an explicit source\-first note\. Against a good summary the deployable fix adds essentially nothing on logic; its value is concentrated on compact numeric sources where the summary sheds the line items, and is absent where the summary already carries the constraints; its edge over the fact\-extraction and retrieval systems is decisive throughout\. The gap to the oracle is where the idealization shows: the hand\-built note perfectly identifies the answer\-determining source and keeps exactly it, whilesource\-first\-auto must*find*that source itself, so its0\.490\.49–0\.880\.88, not the oracle’s1\.001\.00, is the deployable number\. On logic this gap is the distiller’s, not the reader’s:source\-first\-auto is flat across capability \(0\.49/0\.53/0\.530\.49/0\.53/0\.53on Llama/Sonnet/Opus\) while the hand note climbs \(0\.77→1\.000\.77\\to 1\.00\), so a stronger reader cannot rescue a logic source the auto\-distiller did not keep, and the “ties the summary on logic” story is bounded by the distillation prompt rather than by the reader\. The axis is source recoverability, not transcript size: a small clean source can sit inside a long transcript one must compress\.
#### The wall holds on frontier models, and the gap widens\.
We hold each memory fixed and replay only the session\-2 answering model across a wide capability range, an 8B open model up to a frontier reasoner:llama\-3\.1\-8b,claude\-sonnet\-4\-6\(Anthropic,[2026b](https://arxiv.org/html/2606.25449#bib.bib2)\), andclaude\-opus\-4\-8\(Anthropic,[2026a](https://arxiv.org/html/2606.25449#bib.bib1)\), the last a current frontier model widely used for agentic work \(Table[8](https://arxiv.org/html/2606.25449#S5.T8)\)\. Two facts\. Wherever the memory kept the source, capability lifts reclaim, to a*perfect*1\.001\.00on Opus forsource\-first\. Wherever the memory dropped it, every model scores0\.000\.00, Opus included: lossy compression and naive retrieval are uncorrectable at any capability we tested\. The strongest model therefore has the*widest*gap between source\-kept and source\-dropped\. The widening is carried by logic, where source\-kept climbs0\.77→1\.000\.77\\to 1\.00across the range while source\-dropped holds at0\.000\.00; on arithmetic source\-kept is already at ceiling on the small model \(1\.001\.00\), so there the effect is the floor staying at0\.000\.00rather than the ceiling rising\. Either way, capability does not move the source\-dropped rows, exactly as the hand\-built small\-vs\-frontier comparison \(Tables[5](https://arxiv.org/html/2606.25449#S5.T5),[6](https://arxiv.org/html/2606.25449#S5.T6)\) predicts, now on deployed memories and recognizable models\. Confabulation is the failure capability helps least with: mem0 stays well below the fix on every model \(0\.09/0\.12/0\.110\.09/0\.12/0\.11on arithmetic, flat across the capability range\) and gains far less from a stronger reader than the source\-bearing systems do, since reading skill does not, on its own, recover line items buried under the extractor’s own figures\.
Table 8:The deployed\-system wall holds across a wide answering\-model range; capability does not close the source\-kept/source\-dropped gap\.Directed\-arm Reclaim Rate with95%95\\%bootstrap CI for six session\-2 memories \(three deployed systems, the deployable fixsource\-first\-auto, and the hand\-builtlossy/source\-firstanchors\) held fixed while only the answering model is swapped \(llama\-3\.1\-8b→\\toclaude\-sonnet\-4\-6→\\toclaude\-opus\-4\-8\);n=96n\{=\}96/cell\. Source\-kept rows climb toward a perfect1\.001\.00on Opus; source\-dropped rows \(lossy, naive retrieval\) stay at0\.000\.00on every model, with intervals that exclude every source\-kept Opus cell\. mem0 stays low throughout, gaining the least from capability, because the source is buried under generated figures a stronger reader cannot reliably sort\. Answering temperature is0\(Opus does not accept the parameter\); the carried memories are the same temperature\-0\.70\.7ones used elsewhere\.
#### The deployed wall is the policy, not the writer\.
The deployed memories above are all*written*byllama\-3\.1\-8b, so part of the wall could be a weak\-writer artifact\. Re\-running memory construction with a frontier writer \(claude\-sonnet\-4\-6\), holding the trajectory and the llama answerer fixed, splits the two LLM\-written systems in*opposite*directions: LangChain’s summary wall is largely a weak\-writer artifact \(a capable writer keeps the line items, lifting directed reclaim0\.38→0\.880\.38\\to 0\.88on arithmetic\), while mem0’s is not \(a frontier extractor does not rescue it and*confabulates more*\)\. Writer strength is therefore not a paradigm\-independent remedy, and the templatedlossywall holds throughout, confirming the mechanism is the policy, not the writer \(full sub\-study, App\.[F](https://arxiv.org/html/2606.25449#A6)\)\.
#### The wall and fix hold end\-to\-end on a frontier model\.
The frontier results above replay a fixed \(llama\-written\) memory under a stronger reader, isolating the reading step\. We also run the full pipeline frontier\-to\-frontier:claude\-sonnet\-4\-6both*writes*the memory and answers, over the3232arithmetic problems\. The wall holds \(templatedlossy0\.000\.00directed\) and the oracle fix holds \(source\-first1\.001\.00\); a frontier\-written LangChain summary lands at0\.910\.91\. The deployable distiller is the result:source\-first\-auto reaches1\.001\.00when the frontier model writes it, against0\.880\.88when llama writes it and the frontier only reads \(Table[8](https://arxiv.org/html/2606.25449#S5.T8)\)\. The0\.490\.49–0\.880\.88deployable gap to the oracle is therefore largely a weak\-writer artifact, closed by a capable writer, so the shippable number rises with the writer’s capability and not only the reader’s\.
### 5\.7The boundary of the fix
#### The boundary: source\-first leads only while the source fits the budget\.
The fix is not unconditional, and a source\-size sweep locates its edge exactly where the Introduction scoped it\. We scale the ledger fromN=2N\{=\}2toN=32N\{=\}32line items at a fixed carried\-memory budgetBB\(characters\) and re\-measure directed reclaim \(Figure[2](https://arxiv.org/html/2606.25449#S5.F2); generator and validators in App\.[B](https://arxiv.org/html/2606.25449#A2)\)\. The pre\-tax total is the exact sum of allNNitems, so the answer\-determining source grows withNNwhileBBdoes not\. Two regimes result\. While the full itemization fitsBB,source\-firstkeeps all of it and reclaim holds high; onceNNexceeds whatBBcarries, the note retains only the firstk<Nk<Nitems, an exact sum is unrecoverable, andsource\-firstdrops to the budget\-matchedlossy\-paddedfloor\. Split by source survival, directed reclaim is0\.880\.88\[0\.84,0\.91\]\[0\.84,0\.91\]when the full source fits \(k=Nk\{=\}N,n=312n\{=\}312\) and a clean0\.000\.00\[0,0\]\[0,0\]the moment it does not \(k<Nk<N,n=312n\{=\}312\): dropping a single line item zeros reclaim regardless ofNN\. The crossover tracks the*budget*, not the size,N≈5N\\approx 5atB=300B\{=\}300andN≈14N\\approx 14atB=600B\{=\}600, which is the content\-not\-size claim made quantitative, doubling the budget roughly triples the source a kept\-source note can carry\. The residual sag while the source still fits \(e\.g\.0\.540\.54atB=600B\{=\}600,N=14N\{=\}14on the small model\) is the answering model’s own summation limit: a capability effect that vanishes under a stronger model \(Sonnet and Opus both hold1\.001\.00acrossN=8N\{=\}8–1414atB=600B\{=\}600\) while the information cliff does not move \(0\.000\.00atN=6N\{=\}6forB=300B\{=\}300and atN=16N\{=\}16forB=600B\{=\}600on all three models\)\. Capability erases the soft slope and leaves the wall exactly where the budget put it\. The law therefore holds as stated,source\-firstremoves the wall wherever a compact answer\-determining source can be kept, and we have now measured where “compact” ends\.
Figure 2:The boundary of the source\-first law\.Directed Reclaim Rate vs\. ledger sizeNNat two fixed memory budgetsBB,n=24n\{=\}24/point,95%95\\%bootstrap CI\.source\-first\(solid,llama\-3\.1\-8b\) holds while theNN\-item source fitsBB, then drops to the budget\-matchedlossy\-paddedfloor \(dashed\) the instant any item must be dropped\. The cliff moves right with the budget \(N=5→14N\{=\}5\\\!\\to\\\!14asBBdoubles\), so the lever is whether the answer\-determining source fits, not problem size\. The soft sag before each cliff is the answering model’s summation limit and lifts with capability \(frontier confirm dotted: Sonnet and Opus coincide, both hold1\.001\.00up to the same cliff; Table[14](https://arxiv.org/html/2606.25449#A2.T14)\); the information cliff itself is capability\-independent\.
#### The boundary is recoverability, not size: a noisy source defeats “keep the source”\.
A second sweep separates source*size*from source*identifiability*\. Here the answer\-determining items are few \(four\) and easily fit the budget, but they are interleaved with plausible “considered, not bought” decoys, and the total is the sum over the bought items only \(still objective; App\.[C](https://arxiv.org/html/2606.25449#A3)\)\. A positionalsource\-firstnote \(*naive*\) fills the budget in order, so the decoys crowd the bought items out; a relevance\-aware note \(*denoised*\) keeps only the bought items\. As noise grows, naivesource\-firstdecays to thelossyfloor \(1\.00→0\.001\.00\\to 0\.00by eight decoys\) while denoised holds flat \(≈1\.00\\approx 1\.00\), and the gap is the cost of failing to identify the source \(Figure[3](https://arxiv.org/html/2606.25449#S5.F3)\)\. A deployable distiller closes part of that gap: runningsource\-first\-auto with no oracle on the noisy ledger, the LLM uses the stated relevance cues to keep≈3\.8\\approx 3\.8of the44bought items and holds directed reclaim at0\.620\.62–0\.880\.88across0–1616decoys, far above naive’s collapse to0\.000\.00but short of the oracle’s1\.001\.00\(n=8n\{=\}8/point, released asbench\_locating\.py\)\. Deployable locating is thus partly solvable when relevance is stated in the source; the harder latent case \(no cue\) stays open, and the residual gap to the oracle is the price of imperfect identification\. The decay is capability\-invariant: naivesource\-firstfalls on the same schedule forllama\-3\.1\-8b,claude\-sonnet\-4\-6, andclaude\-opus\-4\-8\(Table[15](https://arxiv.org/html/2606.25449#A3.T15)\), and the mechanism is again binary, when noise crowds out a bought item, directed reclaim is0\.000\.00\(n=117n\{=\}117on Opus\), exactly as when the budget drops it\. A crowded\-out item is a dropped item\. The lever is therefore not “keep the source” but “keep the*answer\-determining*source”, and noise makes that identification the bottleneck\.
Figure 3:Noise crowds the source out of a fixed budget\.Directed Reclaim Rate vs\. decoy count added to a four\-item source at a fixed budget,n=24n\{=\}24/point,95%95\\%CI\. Naive \(positional\)source\-first\(red\) decays to thelossyfloor as decoys eat the budget; relevance\-aware*denoised*source\-first\(green\) holds flat\. The frontier confirm \(dotted\) coincides with the88B model: the noise cliff is capability\-invariant, because a crowded\-out bought item is an information loss no reader recovers\.
#### Past its boundary, the fix fails silently, unless it carries a completeness signal\.
Both boundaries expose a failure mode ofsource\-firstitself\. When the source is truncated the answering model does not abstain and does not inherit the old wrong answer \(the conclusion was dropped\); it confidently sums the items it can see and asserts that partial total\. On the size cliff this is24/2424/24on every cell for Opus, which computes the partial sum exactly and returns it with no flag, the stronger model failing the more cleanly \(App\.[B](https://arxiv.org/html/2606.25449#A2)\)\.source\-firstthus trades a detectable failure \(a stale answer\) for an undetectable one \(a freshly computed wrong total\)\. The remedy is a one\-line*completeness signal*: tagging the note with how many of the original items survived\. Re\-running the size\-cliff cells with that tag, Opus flips from96/9696/96silent mis\-sums to94/9694/96that flag the gap or abstain \(Table[11](https://arxiv.org/html/2606.25449#S5.T11)\), e\.g\. “one of the original 16 items is missing and cannot be included\.” The tag needs only what the writer observes while compressing, how many items there were and how many were kept; it does not require knowing*which*item is answer\-determining \(the harder write\-time identification problem of §[7](https://arxiv.org/html/2606.25449#S7)\), since counting is not identifying\. It does presume a discretely countable source, so it lives in the same compact regime as the rest of the method\. A source\-first memory should therefore carry not only the source but a record of its*completeness*, so that exceeding the budget fails loudly rather than silently\. The remedy is capability\-gated: the strong reader honors the tag \(94/9694/96flag or abstain on Opus\), but the88B reader largely ignores it, flagging only6/966/96and still silently mis\-summing62/9662/96\(n=96n\{=\}96,bench\_completeness\.pyon llama\), consistent with the rest of this work, where88B readers discount inconvenient context under pressure\. It provably cannot take the same form at the noise boundary: the tag counts items, but under noise the budget stays full while the answer\-determining items are exactly the ones crowded out, so the item count reads complete; catching that would require counting the*bought*items, which is identifying them, the locating problem itself\. The loud\-failure remedy exists for size and is structurally unavailable for noise\.
### 5\.8Real conversational memory, and the behavioral failure sharpened
#### The wall and fix replicate on real conversational memory\.
The strongest test of the compact\-source scope is to drop it\. We run reclaim onMultiWOZ\(Budzianowski et al\.,[2018](https://arxiv.org/html/2606.25449#bib.bib3)\), a standard task\-oriented dialogue benchmark: the source is now a real, multi\-turn, chatty dialogue \(entangled and fuzzy\), while the target is a checkable slot value, a booking or departure time, so scoring stays objective with no judge \(App\.[E](https://arxiv.org/html/2606.25449#A5)\)\. A user utterance states the value verbatim \(the recoverable source\); a corrupted confirmation carries the drift;lossykeeps the confirmation,source\-firstkeeps the user utterance\. The shape is identical to the ledger \(Table[9](https://arxiv.org/html/2606.25449#S5.T9)\)\. The wall is exact and capability\-invariant:lossy,lossy\-padded, and a no\-value blank floor all sit at0\.000\.00on both models and every slot type, and the blank floor confirms the target is genuinely unguessable \(a time, not a small option set\)\. The fix recovers, and recovery lifts with capability exactly as the size sweep predicts, since reading a value out of fuzzy natural conversation is the capability\-sensitive step:source\-firstclimbs monotonically0\.46→0\.68→0\.970\.46\\to 0\.68\\to 0\.97across thellama/sonnet/Opus ladder, reaching1\.001\.00on Opus across most slot types\. On the cleanest slot, where the user states an exact time \(“book at 12:45”\),source\-firstis already0\.960\.96on the small model; the lower small\-model average is driven by hedged slots \(“leave*after*16:15”\) that a weak reader cannot pin to the labelled value but Opus can \(0\.17→1\.000\.17\\to 1\.00ontrain\-leaveat\)\. The compact deterministic ledger was thus not a special case: brittle memory and its source\-first remedy hold on real conversational memory\. MultiWOZ stresses the*reading*step \(a fuzzy source, read by a weak or strong reader\); because we filter to slots whose value appears in a user turn, the source stays identifiable, so this extends the wall along capability\-sensitive reading rather than testing the*locating*step of §[7](https://arxiv.org/html/2606.25449#S7)\.
Table 9:The wall and fix replicate on real conversational memory\(MultiWOZ slot recovery, directed arm,n=90n\{=\}90/cell,95%95\\%bootstrap CI\)\.lossy/lossy\-padded/blank wall at0\.000\.00on all three models;source\-firstrecovers, lifting monotonically with capability \(0\.46→0\.68→0\.970\.46\\to 0\.68\\to 0\.97\) as the reader extracts the value from fuzzy dialogue\. The blank floor confirms the target is unguessable\.
#### At the wall, a lossy memory is worse than an empty one\.
What a source\-less memory*emits*at0\.000\.00accuracy is the deployment\-relevant question \(Table[10](https://arxiv.org/html/2606.25449#S5.T10)\); we classify the structuredANSWERline on the MultiWOZ wall cells\. Where a model answers at all, the wrong value alossymemory carries acts as an*attractor*: it raises wrong\-time emission from27%27\\%\(blank\) to59%59\\%\(lossy\) on the88B model, and from0%0\\%to49%49\\%on Opus, which passes the planted time through on itsANSWERline while flagging it “unverified” in prose in every case\. The effect is model\-specific rather than a clean capability trend: Sonnet, the middle of the three, is the most conservative and abstains on the answer line under*both*conditions \(0%0\\%\), refusing to emit any unverified time\. So “worse than empty” here is contingent on a model disposed to answer: Sonnet escapes it on this task \(though not on arithmetic, where it shows the asymmetry, App\.[I](https://arxiv.org/html/2606.25449#A9)\), the other two do not, and for them thelossymemory plants a value a downstream parser reads as the answer where an empty one yields an abstention\. And when the model hedges, the hedge sits in the prose channel the parser ignores while the wrong booking time goes in the output field, Opus caveats every case yet still emits the drift on itsANSWERline half the time \(“ANSWER: 19:15 \(unverified\)”\)\.
Table 10:The wrong value in a lossy memory is an attractor\(MultiWOZ wall cells, directed,n=90n\{=\}90/cell\)\. Fraction emitting a wrong time on the structuredANSWERline; the rest abstain\. Where a model answers \(llama, Opus\), keeping the wrong value \(lossy\) pulls it toward emitting it while keeping nothing \(blank\) does not; Sonnet abstains under both\. Opus flags uncertainty in prose in*all*cases, yet still emits the drift on the answer line half the time underlossy\.Table 11:The fix’s boundary failure is silent unless the note records completeness\(claude\-opus\-4\-8, size\-cliff cellsk<Nk<N, directed,n=96n\{=\}96\)\.*silent*: the model confidently sums the partial source and asserts it with no flag;*flagged*: with a one\-line “kkofNNitems preserved” tag it flags the gap or abstains instead\.
### 5\.9The error cascades when memory feeds memory
The wall so far is measured at a single hop\. Deployed agents run a loop: they read their own memory, act, and compress the result into the next memory\. Because the compression is one\-directional, a single error should not stay put, it should propagate to every downstream step and resist correction however late it arrives\. We test this with a*running\-ledger*chain: hopkkreveals purchasekkand asks for the running total \(truth=k∑i≤kpiqi\{\}\_\{k\}=\\sum\_\{i\\leq k\}p\_\{i\}q\_\{i\}, brute\-forced, judge\-free\), a wrong subtotal is planted at hop11, and after each hop the interaction is compressed into the carried memory \(budgetB=200B\{=\}200\) that hopk\+1k\{\+\}1inherits\. AfterHHhops one directed correction asks for the true final total \(construction and validators in App\.[G](https://arxiv.org/html/2606.25449#A7)\)\. Three results hold on an88B model and a frontier one \(Table[12](https://arxiv.org/html/2606.25449#S5.T12)\)\.*\(1\) The error cascades\.*Underlossymemory the planted error corrupts a*blast radius*\(the count of wrong downstream hops\) that grows with the chain,0\.7→7\.30\.7\\to 7\.3of88on llama and0\.8→7\.00\.8\\to 7\.0of88on Sonnet, and the final correction reclaims≤0\.19\\leq 0\.19, falling to≤0\.08\\leq 0\.08byH≥4H\{\\geq\}4: the error is not merely carried, it spreads, and it is uncorrectable however late the correction lands\.*\(2\) The loop is not the cause; the policy is\.*A no\-error control \(the identical chain with no planted subtotal\) has a blast radius of*exactly*0\.00\.0at everyHHon both models, so chaining a model through its own memory injects no error of its own; the cascade is the dropped source, not the iteration\.*\(3\) Source\-first buys a horizon, not immunity\.*While the accumulated source fits the budget,source\-firstholds reclaim at≈1\.00\\approx 1\.00\(committing only the hop\-11error before self\-correcting\); as the chain grows the source outgrows the fixed budget \(the size cliff of §[5](https://arxiv.org/html/2606.25449#S5), now spread over hops\), andsource\-first’s reclaim falls to*exactly*the fraction of chains whose full source still fits,0\.750\.75/0\.690\.69atH=4H\{=\}4and0\.000\.00atH=8H\{=\}8on both models\. The cliff is thus a sample\-dependent band, biting atH=4H\{=\}4where item\-name lengths decide the fit and total byH=8H\{=\}8, and it is capability\-invariant: it falls at the same depth for the88B and the frontier reader, because it is an information bound \(the source no longer fits\), not a reasoning limit\. A memory loop therefore has a correctable depth set by budget divided by source growth, beyond which the cascade resumes\. Cells are2424chains on llama and1616on Sonnet; the monotone blast growth, the zero\-error control, and the two\-model horizon are stable across the sweep\.
Table 12:A single error cascades across a memory loop, and source\-first only delays it to a budget horizon\.Running\-ledger chain, planted error at hop11, budgetB=200B\{=\}200;*blast*is the mean number of wrong downstream hops \(the cascade\),RRis the final reclaim after one directed correction \(2424chains onllama\-3\.1\-8b,1616onclaude\-sonnet\-4\-6\)\. Underlossythe blast radius grows withHHand reclaim stays≈0\\approx 0\(uncorrectable\)\.source\-firstreclaim equals the fraction of chains whose full source still fits the budget, holding near11then cliffing to thelossyfloor as the source overflows \(partial atH=4H\{=\}4, total atH=8H\{=\}8on*both*models, a capability\-invariant horizon\)\. A no\-error control has blast0\.00\.0at everyHHon both models, so the loop injects no error of its own \(App\.[G](https://arxiv.org/html/2606.25449#A7)\)\.
## 6Why It Happens
Reclaim requires the model to*recompute*the answer; being told the old one is wrong is not enough\. Recomputation needs the source, so everything follows from where the source is\. In a single conversation the source is always present, and reclaim fails only because the model is anchored on a re\-affirmed value\. A directed correction gives it a place to look and overrides the anchor, and even a generic one recovers once distance breaks the groove\. Across sessions, lossy compression removes the source before the conclusion, because the conclusion is the salient thing a summary keeps\. Now the same interventions do nothing: the directed correction names a locus whose facts are gone, and distance is irrelevant, because the source was never far away, it is absent\. Capability is orthogonal throughout: a better reasoner computes a better answer from what survives and nothing from what does not, which is why the stronger model walls in the same place\.source\-firstkeeps the one thing recomputation needs and drops the one thing that can be regenerated, so it preserves correctability at no extra budget\.
## 7Limitations
#### Scope\.
Three boundaries are intrinsic to the method\. The wall itself is an immediate consequence of the definition \(analyticin Table[1](https://arxiv.org/html/2606.25449#S0.T1)\); the result is the behavioral asymmetry it gates, not the wall\. A compact, identifiable source is the regime where the fix has leverage, so we map where it ends \(size, noise, silent truncation, diffuse evidence\) and report the deployable0\.490\.49–0\.880\.88, not the oracle’s1\.001\.00\. And “capability\-invariant” means invariant across the models tested, not a claim over all scales\.
#### Task coverage\.
We test two families: arithmetic \(the favorable case, a clean deterministic source\) and constraint logic \(which confirms the wall softens where a reconstructable clue survives\)\. The induced error is exogenous; we also test a*self\-generated*one, lettingllama\-3\.1\-8bmiscomputeN=10N\{=\}10ledgers with no planted premise \(n=15n\{=\}15natural errors,bench\_endogenous\.py\)\. The informational wall is provenance\-invariant \(source\-first1\.001\.00, lossy0\.000\.00on the model’s own error\), but the behavioral attractor attenuates: on its own dropped error the88B model mostly abstains \(0\.730\.73\) and re\-emits the stale value only0\.130\.13of the time, against0\.480\.48/0\.220\.22on a planted note \(Table[3](https://arxiv.org/html/2606.25449#S5.T3)\)\. A planted external note is thus the*more*adversarial case, not a more favorable one\. The size and noise sweeps \(§[5](https://arxiv.org/html/2606.25449#S5)\) bound the fix from the other side, and MultiWOZ extends it to real fuzzy dialogue\.
What remains untested is the case where the answer is not*stated*anywhere but must be*derived*from diffuse evidence with no isolable source \(big empirical claims, qualitative judgments\), where thesource\-firstlever should weaken\. We locate this boundary but cannot size which side of it deployed memory occupies; the prevalence audit below is a first measurement\. Two things temper the concern\. First, the compact\-source regime is not a concession but the condition for the method’s guarantee: a checkable ground\-truth answer is what lets us score correctability*without*a judge, so diffuse coverage would require exactly the judge whose absence is our objectivity\. We keep the regime that is measurable, and it is where high\-stakes agentic memory lives \(a coding agent carrying a value, a booking agent a departure time, a workflow agent a running total\)\. Second,source\-firstpresumes the answer\-determining source can be*identified at write time*, before the targeting correction is known;source\-first\-auto’s gap to the denoised oracle \(0\.490\.49–0\.880\.88vs\.1\.001\.00\) is the price of not having that knowledge, and “keep the source” is a prescription only insofar as the source can be told apart at compression time\.
Two further limits, both shown rather than assumed\. Past its boundarysource\-firstfails silently \(a confidently summed partial source\) unless the note carries a completeness tag \(Table[11](https://arxiv.org/html/2606.25449#S5.T11)\)\. And the correction battery \(App\.[J](https://arxiv.org/html/2606.25449#A10)\) is capability\-gated:source\-firstresists a false locus and a confident wrong value on both frontier readers, but a sustained push and an injected fabricated source are resisted only by frontier readers \(0\.900\.90–1\.001\.00\), not an88B one \(0\.000\.00–0\.270\.27\)\. Untested: a fabricated source under multi\-turn persistence, and whether a weak reader instructed to trust its own memory recovers that resistance\.
#### A first prevalence audit\.
To put a number on which side of the boundary deployed memory occupies, we sample100100real conversations from each of three corpora spanning the deployment range, general assistant chat, tool/function\-calling, and agentic task traces,666HuggingFaceH4/no\_robots\(Hugging Face H4,[2023](https://arxiv.org/html/2606.25449#bib.bib7)\)\(human\-written assistant dialogues\),glaiveai/glaive\-function\-calling\-v2\(Glaive AI,[2023](https://arxiv.org/html/2606.25449#bib.bib5)\), andTHUDM/AgentInstruct\(Zeng et al\.,[2023](https://arxiv.org/html/2606.25449#bib.bib21)\)\(os/db/webshop trajectories\), streamed and shuffled at a fixed seed\.and classify what each interaction’s memory would carry as a compact identifiable source, diffuse evidence, or no correctable answer \(Table[13](https://arxiv.org/html/2606.25449#S7.T13); rubric, a6/66/6gold check, and two\-labeler agreement in App\.[H](https://arxiv.org/html/2606.25449#A8), released asbench\_prevalence\.py\)\. The audit returns one negative result and one positive\. The*absolute*share is*not*identified: two LLM labelers disagree on the compact/diffuse boundary for messy real text \(three\-way Cohenκ=0\.15\\kappa\{=\}0\.15atn=50n\{=\}50\), bracketing the compact fraction widely \(agentic0\.610\.61–0\.990\.99, chat0\.220\.22–0\.780\.78\), so we report no point estimate\. The cross\-domain*ordering*is robust: compact\-source content is significantly more prevalent in tool\-use and agentic memory than in open chat under*both*labelers, with non\-overlapping intervals \(grok chat0\.220\.22vs\. agentic0\.610\.61; llama0\.780\.78vs\.0\.990\.99\)\. This is the measured form of the claim that the high\-stakes, agentic regime is wheresource\-first’s compact\-source precondition is most often met: the precise coverage stays open, the ordering does not\. Tightening the level to a point estimate needs human\-validated labels rather than two disagreeing LLM labelers; until then the practical reach ofsource\-firstrests on this measured ordering rather than on a coverage number\.
Table 13:Prevalence audit: the compact\-source share rises with stakes under both labelers, though its absolute level is labeler\-dependent\.Fraction of100100sampled real conversations per domain whose carried content has a compact identifiable source \(vs\. diffuse/none\),95%95\\%bootstrap CI, by two LLM labelers\. A6/66/6gold check passes for both, but cross\-labeler agreement on messy real text is weak \(three\-wayκ=0\.15\\kappa\{=\}0\.15atn=50n\{=\}50\), so the absolute level is not identified; the chat<<tool<<agentic ordering is monotone and non\-overlapping under both labelers \(the load\-bearing finding\)\. Construction and validators in App\.[H](https://arxiv.org/html/2606.25449#A8)\.
#### Deployed\-system breadth\.
§[5](https://arxiv.org/html/2606.25449#S5)tests three deployed memories on distinct paradigms \(running summary, extraction\-plus\-retrieval, naive vector retrieval\), replays them across three answering models, and re\-runs construction with a frontier*writer*to settle the writer confound: the summary wall is a weak\-writer artifact \(it largely lifts with a capable writer\), mem0’s is not \(a stronger extractor confabulates more\)\. What remains is breadth: a single frontier writer on three systems, and a fuller account would sweep more families \(episodic note stores, agentic memory\) and more writers\.source\-first\-auto’s deployable number rides on one distillation prompt, and the wording matters most for a weak distiller\. Holding the transcript, problem set, and decoding fixed and varying only the prompt across four intent\-equivalent rewordings, directed reclaim on arithmetic ranges0\.380\.38–0\.780\.78on llama \(a0\.410\.41\-wide spread\), but the spread shrinks monotonically with the writer’s capability:0\.780\.78–1\.001\.00on Sonnet \(0\.220\.22\) and0\.940\.94–1\.001\.00on Opus \(0\.060\.06\), the floor rising0\.38→0\.78→0\.940\.38\\to 0\.78\\to 0\.94\(n=32n\{=\}32each; released asbench\_promptsweep\.py\)\. Prompt\-robustness of the distiller is therefore itself capability\-gated, like the adversarial and completeness results: the deployable number is prompt\-conditional on a weak writer and nearly prompt\-invariant on a frontier one\. These are separate single\-seed controlled sweeps, so their absolute level differs from the multi\-seed board; the load\-bearing quantity is the spread and its shrinkage\. We report the number for a fixed prompt and flag the sensitivity rather than tuning the prompt to the benchmark\. The source\-recoverability axis still bounds all of it: these are compact\-source tasks, and a large or entangled source should weaken every policy\.
#### Scope and statistics\.
Headline cells are atn=96n\{=\}96\(llama\-3\.1\-8b,3232problems×\\timesthree seeds; all three answering models on the deployed board\);grok\-4\.3and the boundary sweeps \(size, noise, completeness\) run atn=24n\{=\}24\. Every cell carries a95%95\\%bootstrap interval \(percentile,5,0005\{,\}000resamples at a fixed seed, reproducing to the digit\)\. The load\-bearing contrasts are the low\-integrity wall cells and the source\-dropped frontier rows, tight and non\-overlapping with the source\-kept rows; the high\-integrity cells are wide and the policies trade places there, which is why we scope the fix to low integrity rather than all\-regime dominance\. These contrasts are directional and pre\-specified \(source\-dropped→0\\to 0, source\-kept high\), not selected post hoc, so the non\-overlapping intervals are not a multiple\-comparison artifact\. Because seeds within a problem are not independent, we also ran a*problem\-clustered*bootstrap \(resampling the3232problems, which are independently generated instances, not templated variants\): the headline cells are essentially unchanged \(wall\[0,0\]\[0,0\], arithmetic source\-kept≥0\.97\\geq 0\.97\), only the already\-wide high\-integrity cells widen slightly\. For the measured zeros,\[0,0\]\[0,0\]describes the sample while the distribution\-free rule\-of\-three95%95\\%upper bound is≤0\.04\\leq 0\.04atn=96n\{=\}96\(≤0\.125\\leq 0\.125atn=24n\{=\}24\); likewise the deterministicn=32n\{=\}32Opus and sycophancy1\.001\.00s have zero sampling variance, not distribution\-level certainty, with a rule\-of\-three lower bound of≥0\.91\\geq 0\.91\. “Capability\-invariant” throughout means invariant across the models tested, not a claim over all scales; the answering\-model replay swaps only the reader, over memories built once at temperature0\.70\.7and held fixed, so answering temperature varies the reader and never the carried text and cannot produce the within\-memory source\-kept/source\-dropped contrast\. This is a controlled study of a mechanism, not a memory product\.
## 8Conclusion
Whether a drifted model can be pulled back is decided by what its memory kept, not by how capable the model is\. The ordinary instinct, summarize toward the conclusion, is what makes errors permanent when it keeps a wrong conclusion and drops the source\. The deployment\-relevant half is behavioral: with the source gone, a model that answers rather than abstains emits a confident wrong value \(and on the strongest model, adopts an asserted one on all3232problems tested\), so for those models a lossy memory degrades a system more quietly than an empty one\.
A source\-first compression, keep the working and let the takeaway be recomputed, removes the failure at equal budget*wherever the answer\-determining source is compact and can be identified and kept*\. It is not a universal fix\. Dropping the conclusion introduces a silent partial\-source failure of its own, which a one\-line completeness tag restores to a loud one; and where the budget carries both, keeping the conclusion beside the source is free on a strong reader and buys a recompute\-and\-compare check\. We therefore characterize the regime where the fix works and locate its edges, source size, noise, silent truncation, and diffuse evidence whose source cannot be isolated, the regime this base case opens onto\.
Because reclaim is scored by exact match against a known answer, the evaluation itself deploys: induce a known drift, compress under a candidate policy, deliver a directed correction, check exact recovery\. It is a deterministic pass/fail with no judge to host and nothing to re\-annotate as models change, so “is our memory still correctable” becomes a regression test rather than a judgment call\. We package this as a write\-time*probe*: it reads a candidate memory note and flags the silent\-uncorrectable case \(the source dropped while a stale value is kept\) from the string alone, no model call, the source\-token test of §[4](https://arxiv.org/html/2606.25449#S4)run before a memory is ever stored\. The conclusion is always re\-derivable from its source; the source is never re\-derivable from its conclusion\. A memory that wants to stay correctable should keep what cannot be recomputed, and mark whether what it kept is complete\.
## Ethical Considerations
This work identifies a failure mode \(brittle memory\) in language models that carry a compressed memory across a session boundary, and releases the evaluation harness and paired memory conditions that surface it\. We weighed the dual\-use risk against disclosure benefits and judged the latter to substantially exceed it: brittle memory is a passive consequence of a compression policy rather than an active attack vector, and its remedy is a benign design change, compressing toward the recomputable source, that memory\- and summarization\-system designers need in order to keep model errors correctable\. Findings naming specific models reflect controlled measurement of the evaluated snapshots and should not be read as general claims about those providers’ systems\. The harness, paired memory conditions, and validators are released for research use\.
## Use of AI Assistants
The language models under study are the evaluated systems; all reclaim scoring is objective against known answers and uses no model judge\. AI assistants were additionally used for coding support and for drafting and editing manuscript prose; all research questions, experimental design, analyses, and conclusions are the authors’ own\.
## References
- Anthropic \(2026a\)Anthropic\.Claude Opus 4\.8\.[https://www\.anthropic\.com/news/claude\-opus\-4\-8](https://www.anthropic.com/news/claude-opus-4-8), 2026a\.Model announcement\.
- Anthropic \(2026b\)Anthropic\.Claude Sonnet 4\.6\.[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6), 2026b\.Model announcement\.
- Budzianowski et al\. \(2018\)Paweł Budzianowski, Tsung\-Hsien Wen, Bo\-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić\.MultiWOZ – a large\-scale multi\-domain Wizard\-of\-Oz dataset for task\-oriented dialogue modelling\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pp\. 5016–5026, 2018\.
- Choi & Kwon \(2026\)Dasol Choi and Alex Kwon\.When context flips, safety breaks: Diagnosing brittle safety in aligned language models\.*arXiv preprint arXiv:2605\.27851*, 2026\.
- Glaive AI \(2023\)Glaive AI\.glaive\-function\-calling\-v2\.[https://huggingface\.co/datasets/glaiveai/glaive\-function\-calling\-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2), 2023\.Dataset\.
- Huang et al\. \(2024\)Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou\.Large language models cannot self\-correct reasoning yet\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Hugging Face H4 \(2023\)Hugging Face H4\.No robots\.[https://huggingface\.co/datasets/HuggingFaceH4/no\_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots), 2023\.Dataset\.
- LangChain \(2022\)LangChain\.LangChain\.[https://github\.com/langchain\-ai/langchain](https://github.com/langchain-ai/langchain), 2022\.Software\.
- Lewis et al\. \(2020\)Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2020\.
- Liu et al\. \(2024\)Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\.Lost in the middle: How language models use long contexts\.*Transactions of the Association for Computational Linguistics*, 12:157–173, 2024\.
- Madaan et al\. \(2023\)Aman Madaan, Niket Tandon, Prakhar Gupta, et al\.Self\-refine: Iterative refinement with self\-feedback\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Mem0 AI \(2024\)Mem0 AI\.mem0: The memory layer for AI agents\.[https://github\.com/mem0ai/mem0](https://github.com/mem0ai/mem0), 2024\.Software\.
- Meng et al\. \(2022\)Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov\.Locating and editing factual associations in GPT\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Meta AI \(2024\)Meta AI\.Introducing Llama 3\.1\.[https://ai\.meta\.com/blog/meta\-llama\-3\-1/](https://ai.meta.com/blog/meta-llama-3-1/), 2024\.Model announcement\.
- Packer et al\. \(2023\)Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G\. Patil, Ion Stoica, and Joseph E\. Gonzalez\.MemGPT: Towards LLMs as operating systems\.*arXiv preprint arXiv:2310\.08560*, 2023\.
- Russinovich et al\. \(2025\)Mark Russinovich, Ahmed Salem, and Ronen Eldan\.Great, now write an article about that: The crescendo multi\-turn LLM jailbreak attack\.In*34th USENIX Security Symposium*, 2025\.
- Sharma et al\. \(2024\)Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, et al\.Towards understanding sycophancy in language models\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Shinn et al\. \(2023\)Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- xAI \(2026\)xAI\.Grok 4\.3\.[https://docs\.x\.ai/developers/models/grok\-4\.3](https://docs.x.ai/developers/models/grok-4.3), 2026\.Model card\.
- Xiao et al\. \(2023\)Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff\.C\-Pack: Packed resources for general Chinese embeddings\.*arXiv preprint arXiv:2309\.07597*, 2023\.
- Zeng et al\. \(2023\)Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang\.AgentTuning: Enabling generalized agent abilities for LLMs\.*arXiv preprint arXiv:2310\.12823*, 2023\.
## Appendix AFailure examples
#### Hard wall \(arithmetic\)\.
Atg=0\.1g\{=\}0\.1thelossynote for one problem reads, in full:*“\(Memory of an earlier session\.\) You concluded the total before tax was $55\.”*The line items \(77notebooks at $4 and99pens at $2; true total $46\) are gone\. The directed correction names the locus,*“I think the pens subtotal is wrong; recheck and give the corrected total,”*but with nothing to recompute from the model returnsANSWER: 55, the inherited wrong value\. This is the modal arithmetic\-wall failure \(Table[7](https://arxiv.org/html/2606.25449#S5.T7),90%90\\%on the frontier model\)\.
#### Soft wall \(logic\)\.
Thelossylogic note keeps a corrupted but reconstructable relational clue rather than a bare conclusion \(e\.g\. an ordering constraint that still pins part of the solution\)\. A capable model can re\-derive the held\-out answer from the surviving relation, so the same directed correction succeeds about45%45\\%of the time on the frontier model and17%17\\%on the small one, and the wall is a partial floor rather than a clean zero, exactly the recovered column of Table[7](https://arxiv.org/html/2606.25449#S5.T7)\.
## Appendix BSource\-size sweep
#### Construction\.
Each ledger is a deterministicNN\-item problem:NNgoods with integer prices and quantities, and a pre\-tax total equal to the exact sum of price×\\timesquantity, so scoring stays objective with no judge\. One middle item is given a wrong subtotal to plant the drift, exactly as in the two\-item task\. The carried memory has a fixed character budgetBB: thesource\-firstnote lists as many whole line items as fit inBBand drops the \(re\-derivable\) conclusion, while the budget\-matchedlossy\-paddednote keeps only the conclusion, padded with neutral filler to the same length\. “How many items survived” \(kk\) is therefore a property of the emitted string, read off by a token test, not an assumption\. We sweepN∈\{2,3,4,5,6,8,10,12,14,16,20,24,32\}N\\in\\\{2,3,4,5,6,8,10,12,14,16,20,24,32\\\}atB∈\{300,600\}B\\in\\\{300,600\\\}over eight stores×\\timesthree seeds \(n=24n\{=\}24/cell\), directed arm, onllama\-3\.1\-8b\(temperature0\.70\.7\), with a frontier confirm onclaude\-opus\-4\-8\(Table[14](https://arxiv.org/html/2606.25449#A2.T14)\)\.
#### Validators \(all pass\)\.
Run for free against a deterministic fake that recomputes only when the full source is present: \(i\) thelossy\-paddedcontrol never reclaims at anyNNorBB; \(ii\)source\-firstreclaims iff the full source fit the budget \(k=Nk\{=\}N\); \(iii\) a1→01\\\!\\to\\\!0cliff exists within the swept range; and \(iv\) the cliff moves right asBBgrows\. A real model scoring above this fake past the cliff would be confabulating rather than recomputing; none does \(thek<Nk<Nrows are a clean0\.000\.00\)\.
Table 14:The size boundary across the capability ladder\(directedsource\-first,n=24n\{=\}24/cell,95%95\\%bootstrap CI\)\. Both frontier models hold a flat1\.001\.00wherever the full source fits the budget, erasing the88B model’s pre\-cliff sag \(0\.540\.54–0\.670\.67\), yet all three fall to0\.000\.00at the*same*NNwhere the note must first drop an item \(k<Nk<N\)\. The information cliff is capability\-invariant; only the soft slope before it is not\.
## Appendix CNoisy\-source sweep
#### Construction\.
Four bought items determine the pre\-tax total; they are interleaved withDD“considered, not bought” decoy items \(price listed, quantity zero\) drawn from the same goods pool, and the total is the exact sum over the bought items only, so scoring stays objective\. One bought item carries the planted error\. The carried memory has a fixed budget \(420420characters\)\. The*naive*source\-firstnote lists items in their \(shuffled\) order until the budget is full, so decoys can crowd the bought items out; the*denoised*note keeps only the four bought items \(the oracle that identifies the source\), padded to the same budget;lossy\-paddedkeeps only the conclusion\. We sweepD∈\{0,2,4,6,8,12,16,24,32\}D\\in\\\{0,2,4,6,8,12,16,24,32\\\}over eight stores×\\timesthree seeds \(n=24n\{=\}24/cell\), directed arm, onllama\-3\.1\-8b, with frontier confirms onclaude\-sonnet\-4\-6andclaude\-opus\-4\-8\.
#### Validators \(all pass\)\.
Against a deterministic fake that recomputes only when every bought item survived: \(i\)lossy\-paddednever reclaims; \(ii\) denoised always reclaims; \(iii\) naive reclaims iff all four bought items fit the budget; \(iv\) naive degrades as the decoy count grows\. The real\-model mechanism matches: naive reclaim is0\.000\.00\(n=117n\{=\}117on Opus\) whenever a bought item was crowded out\.
Table 15:Noisy\-source decay across the capability ladder\(directedRR,n=24n\{=\}24/cell, budget420420, four bought items; per\-point95%95\\%bootstrap CIs in Figure[3](https://arxiv.org/html/2606.25449#S5.F3)\)\. Naivesource\-firstdecays identically on all three models as decoys crowd the bought items out; denoised holds\. The noise wall is capability\-invariant\.
## Appendix DLogic free\-guess floor
The logic answer is one of a few candidate tokens \(3–5 per problem; mean uniform chance≈0\.30\\approx 0\.30\), so part of the soft\-wall “recovery” could be guessing rather than re\-derivation from the surviving clue\. We measure the floor directly: a carried note that gives*only*the candidate set \(no clue, no premise, no conclusion\), then the same correction\. The generic arm is the conservative free\-guess rate; the directed arm adds only the locus\-naming signal \(no clue\)\. Both models reclaim*below*the uniform rate \(Table[16](https://arxiv.org/html/2606.25449#A4.T16)\): they abstain or anchor rather than guess freely\. The soft wall \(directedlossy, Table[6](https://arxiv.org/html/2606.25449#S5.T6):0\.050\.05–0\.160\.16on llama,0\.420\.42–0\.500\.50on grok\) thus clears the floor decisively on the frontier model \(real re\-derivation\) but sits at or below it on the small one \(its0\.170\.17free\-guess floor, consistent with guessing rather than re\-derivation\), which is the baseline\-anchored reading of the capability trend\.
Table 16:Free\-guess floor for the logic tasks\(blank note, candidate set only, directed and generic arms,n=24n\{=\}24/cell\)\. Both models sit below the≈0\.30\\approx 0\.30uniform rate\.
## Appendix EReal conversational memory \(MultiWOZ\)
#### Construction\.
From MultiWOZ 2\.2 dialogues\(Budzianowski et al\.,[2018](https://arxiv.org/html/2606.25449#bib.bib3)\)we take, per dialogue, the first checkable time slot \(restaurant booking, train or taxi departure/arrival\) whose value appears*verbatim*in a user utterance, so the source genuinely contains the recoverable answer \(we drop normalized forms like “noon” and invalid times such as “24:30”\)\. The carried memory at a fixed budget is built under the same policies:source\-firstkeeps the user’s verbatim utterance and drops the confirmation;lossykeeps a confirmation corrupted to a plausible wrong time \(the drift\);lossy\-paddedpadslossytosource\-first’s length;*blank*keeps neither \(the free\-guess floor\)\. The directed correction names the slot \(“the train departure time is wrong; give the correct one asANSWER: <HH:MM\>”\) and scoring is exact slot\-value match, no judge\. We use3030dialogues×\\timesthree seeds \(n=90n\{=\}90/cell\) onllama\-3\.1\-8b, with frontier confirms onclaude\-sonnet\-4\-6andclaude\-opus\-4\-8\.
#### Validators \(all pass\)\.
Against a deterministic fake that returns the true value iff it survives in the carried note: \(i\)source\-firstrecovers \(the source utterance is present\); \(ii\)lossy, \(iii\)lossy\-padded, and \(iv\) blank never return the truth, because only the drift or nothing survives\. The real\-model wall matches:lossy/lossy\-padded/blank are a clean0\.000\.00atn=90n\{=\}90on both models\.
## Appendix FWriter sub\-study
The deployed\-systems result \(§[5](https://arxiv.org/html/2606.25449#S5)\) uses memories written byllama\-3\.1\-8b\. To separate a weak\-writer artifact from a paradigm property we re\-run memory*construction*with a frontier writer \(claude\-sonnet\-4\-6\) while holding the session\-1 trajectory and the llama answerer fixed\. This is a matchedn=24n\{=\}24sub\-study over the canonical eight problems, run under both writers; its llama\-writer figures are therefore then=24n\{=\}24baselines \(e\.g\. summary0\.380\.38, mem025\.625\.6invented numbers\), not then=96n\{=\}96deployed\-board figures of the main text\.
The hand\-builtlossynote, templated and writer\-free, stays at0\.000\.00and fixes the policy baseline\. The two LLM\-written systems then split in*opposite*directions\. LangChain’s summary wall iswriter\-dependent: a capable writer keeps the line items even under its conclusion\-oriented prompt, lifting directed reclaim0\.38→0\.880\.38\\to 0\.88on arithmetic and0\.50→0\.710\.50\\to 0\.71on logic, most of the way to the fix, so “the summary walls” was largely a weak\-writer artifact\. mem0 is theopposite: a frontier extractor does not rescue it \(arithmetic0\.25→0\.120\.25\\to 0\.12, logic0\.25→0\.460\.25\\to 0\.46, both far below the fix\) and it*confabulates more*,25\.6→32\.425\.6\\to 32\.4invented numbers per memory, because a stronger model asked to extract facts extracts more of them and buries the source deeper\. Capability fixes the summary and*worsens*the extraction\. Writer strength is therefore not a paradigm\-independent remedy, and the templatedlossywall confirms the mechanism is the policy, not the writer: keeping \(and verifying\) the source is the only fix that holds across paradigms and across the writer’s capability\.
## Appendix GCascade chain
#### Construction\.
A running\-ledger chain over theN=HN\{=\}Hitems of a ledger \(App\.[B](https://arxiv.org/html/2606.25449#A2)\)\. Hopkkreveals purchasekkand asks for the running total, scored by exact match to truth=k∑i≤kpiqi\{\}\_\{k\}=\\sum\_\{i\\leq k\}p\_\{i\}q\_\{i\}\(judge\-free\)\. A wrong subtotal \(\+$7\+\\mathdollar 7on purchase11\) is planted at hop11\. After each hop the interaction is compressed into the carried memory at a fixed budgetB=200B\{=\}200characters under one of three policies \(lossy: keep the running total, drop the items;source\-first: keep the items, drop the total, dropping the earliest items once they no longer fitBB;lossy\-padded:lossyplus filler tosource\-first’s length\), and hopk\+1k\{\+\}1inherits only that memory\. AfterHHhops a single directed correction requests the true final total\. The*blast radius*is the number of hops with a wrong answer;*reclaim*is whether the final correction returns truthH\. We sweepH∈\{1,2,4,8\}H\\in\\\{1,2,4,8\\\}over2424chains onllama\-3\.1\-8b\(temperature0\.70\.7, all three policies\) and1616chains onclaude\-sonnet\-4\-6\(temperature0,lossy/source\-first\), released asbench\_cascade\.py\.
#### Validators\.
\(i\)*No\-error control*: the identical chain with no planted subtotal has a blast radius of0\.00\.0at everyHHon both models, so the loop injects no error of its own and the cascade is attributable to the dropped source rather than to iteration\. \(ii\)*Source\-state token test*: thelossymemory provably contains none of the purchase nouns \(so its failure is informational, not a reasoning lapse\), whilesource\-firstcontains them until the budget horizon\. \(iii\) The budget\-matchedlossy\-paddedcontrol \(llama\) trackslossy\(RR≈0\\textsc\{RR\}\\approx 0, blast growing withHH\), so the cascade is a property of dropped content, not of note length\.
## Appendix HPrevalence audit
#### Construction\.
We estimate how often real assistant memory carries a compact, checkable source across three ungated public corpora chosen to span the deployment range:HuggingFaceH4/no\_robots\(Hugging Face H4,[2023](https://arxiv.org/html/2606.25449#bib.bib7)\)\(human\-written general\-assistant dialogues\),glaiveai/glaive\-function\-calling\-v2\(Glaive AI,[2023](https://arxiv.org/html/2606.25449#bib.bib5)\)\(tool/function\-calling\), andTHUDM/AgentInstruct\(Zeng et al\.,[2023](https://arxiv.org/html/2606.25449#bib.bib21)\)\(agentic task trajectories, blended over its os/db/webshop splits\)\. Per domain we stream and shuffle at a fixed seed and take100100conversations of\>80\>80characters, each truncated to1,6001\{,\}600characters \(the task is set early in all three\)\. An LLM labels each against a one\-paragraph rubric into one of three classes:compact\(the carried content is a checkable value or fact derivable from a small identifiable source present in the conversation, e\.g\. a number, date, time, slot value, computed result, or specific looked\-up fact\),diffuse\(a judgment or synthesis whose support is spread across many turns or external knowledge with no isolable source\), ornone\(no correctable factual answer is carried, e\.g\. open\-ended chat, brainstorming, or creative writing\)\. The reported quantity is the per\-domain class fraction\.
#### Validators\.
\(i\) A66\-item*gold*set of unambiguous conversations \(threecompact: an arithmetic total, a booking time, a date fact; onediffuse: a qualitative job\-offer weighing; twonone: a whimsical poem, a roleplay\) the classifier must label≥5/6\\geq 5/6correctly, else the run’s labels are noise; bothllama\-3\.1\-8bandgrok\-4\.3score6/66/6\. \(ii\) A*two\-labeler agreement*pass: the labelers agree on the gold set but only weakly on messy real text \(three\-way raw0\.500\.50,κ=0\.15\\kappa\{=\}0\.15; binarycompact\-vs\-rest raw0\.500\.50,κ=0\.12\\kappa\{=\}0\.12,n=50n\{=\}50\)\. This is why we report the cross\-domain*ordering*\(monotone and non\-overlapping under both labelers, Table[13](https://arxiv.org/html/2606.25449#S7.T13)\) and explicitly*not*an absolute coverage figure: the two labelers bracket the compact share rather than pinning it\. The agreement is two LLMs labeling the same text, not a human gold standard\.
#### Human spot\-check \(extra verification\)\.
Because both labelers are LLMs, we add a human anchor\. The first author labeled a stratified5151\-conversation slice \(1717per domain\)*blind*to the model labels and the domain, against the same rubric, scoring6/66/6on the gold set\. The human reproduces the cross\-domain ordering \(compact fraction0\.41\[\.18,\.65\]0\.41\_\{\[\.18,\.65\]\},0\.76\[\.53,\.94\]0\.76\_\{\[\.53,\.94\]\},1\.00\[1,1\]1\.00\_\{\[1,1\]\}for chat, tool, and agentic; monotone and non\-overlapping at the extremes\) and lands in the same range as the LLM labelers, agreeing with each at Cohenκ≈0\.5\\kappa\\approx 0\.5on the slice \(human vs\. llama0\.550\.55, human vs\. grok0\.490\.49\)\. This is a single author\-rater \(no inter\-human agreement\) who labels somewhat more compact than either model \(0\.730\.73overall vs\.≈0\.5\\approx 0\.5\), so it anchors the cross\-domain*ordering*the audit rests on, not an absolute coverage number, which stays rater\-dependent\. It changes none of the reported figures; we include it only as an independent check that the audit tracks a real, human\-recognizable distinction\. Released aslabel\_prevalence\.py\.
## Appendix IDisposition sweep: who shows “worse than empty”
The “worse than empty” asymmetry \(§[5](https://arxiv.org/html/2606.25449#S5)\) is behavioral, so its magnitude depends on the answering model\. We run the matched lossy\-vs\-blank test on arithmetic at the wall \(g=0\.1g\{=\}0\.1, directed,n=96n\{=\}96/cell,bench\_blank\.py\) across seven models \(Table[17](https://arxiv.org/html/2606.25449#A9.T17)\); the asymmetry is the lossy−\-blank difference in confident\-wrong\-emission\. Every model is positive, lossy emitting a confident wrong value more often than empty in all seven, so the direction is robust; the magnitude tracks disposition\. It is largest where the model inherits the planted attractor under lossy and abstains when the memory is empty \(deepseek\+0\.83\+0\.83, grok\+0\.75\+0\.75, Opus\+0\.73\+0\.73, llama\+0\.48\+0\.48\), and muted where the model abstains under both \(gpt\-4o\-mini\+0\.10\+0\.10\) or confabulates under both \(qwen\+0\.12\+0\.12, which emits a wrong value even from an empty memory, so a lossy one adds little\)\. Sonnet shows the asymmetry on arithmetic \(\+0\.29\+0\.29\) although it abstains on the MultiWOZ slot\-time task \(Table[10](https://arxiv.org/html/2606.25449#S5.T10)\), so a model’s escape is task\-specific, not a fixed property\. The*attractor*column \(fraction of lossy emissions equal to the planted value\) is strongest on Opus \(0\.980\.98\), consistent with the correction taxonomy \(Table[4](https://arxiv.org/html/2606.25449#S5.T4)\)\.
Table 17:Worse than empty across seven models\(arithmetic wallg=0\.1g\{=\}0\.1, directed,n=96n\{=\}96/cell\)\. Fraction emitting a confident wrong value under alossymemory vs\. an empty \(*blank*\) one;Δ\\Deltais the asymmetry;*attr\.*is the share of lossy emissions equal to the planted value\. All seven are positive; magnitude tracks the model’s answer\-disposition\.
## Appendix JCorrection robustness
The full correction battery summarized in §[5\.4](https://arxiv.org/html/2606.25449#S5.SS4): the fix is robust to a vague correction, a false locus, and a confidently asserted wrong value, with the adversarial escalations capability\-gated\.
#### The fix does not need a directed correction\.
A deployed correction is usually vague \(“something is off here”\), not a clean locus, and Table[2](https://arxiv.org/html/2606.25449#S5.T2)shows the generic correction is far weaker than the directed one*in context*\. One might worry the cross\-session fix inherits that weakness\. It does not: at the wall,source\-firstreclaims essentially the same under a generic correction as under a directed one \(Table[18](https://arxiv.org/html/2606.25449#A10.T18)\),≈1\.00\\approx 1\.00on arithmetic and0\.760\.76–0\.820\.82on logic under both, with95%95\\%intervals that overlap almost entirely, whilelossystays at its floor under both\. The specificity of the correction governs*anchoring*, where the source is present and the model is merely entrenched, and is irrelevant to the cross\-session fix, because there the work is done by the restored source, not by the correction\. The fix is therefore undiminished in exactly the regime a real correction occupies: vague, and reliant on the memory to carry what is needed to recompute\.
Table 18:Source\-first is correction\-agnostic at the wall\(llama\-3\.1\-8b, directed vs\. genericRRwith95%95\\%bootstrap CI,n=96n\{=\}96/cell\)\. At low integrity the fix reclaims as well under a vague generic correction as under a directed one that names the locus, whilelossystays at its floor under both\. This is the converse of the in\-context window \(Table[2](https://arxiv.org/html/2606.25449#S5.T2)\), where the directed correction dominates: once the lever is the restored source, the correction’s specificity stops mattering\.
#### Source\-first is robust to a false correction, not only responsive to a true one\.
The correction is a two\-edged signal: ifsource\-firstrestores recomputation, can a*wrong*correction also mislead the model? It cannot\. At low integrity we deliver a directed correction that names a*correct*component as the error \(a false locus, no value supplied\)\.source\-firstreturns the true total1\.001\.00on both base models \(n=24n\{=\}24each\), indistinguishable from the true\-correction reclaim \(0\.990\.99on llama,1\.001\.00on grok\): the surviving source overrides the false claim, the model rechecks the named component, finds it sound, and recomputes the truth\.lossyacts in neither direction, with no source it returns the inherited wrong value or declines \(0\.000\.00true under both a true and a false correction\)\. The correction’s symmetry is therefore a property, not a hole:source\-firstis correctable by a true correction and*immune*to a false locus, whilelossyis uncorrectable both ways\.
#### The robustness survives a confident wrong value, the sycophancy case\.
A false locus names a wrong error site but supplies no value\. The stronger, deployment\-relevant pressure, and the one the sycophancy literature is about\(Sharma et al\.,[2024](https://arxiv.org/html/2606.25449#bib.bib17)\), is a correction that*asserts a confident wrong value*as established fact \(“I double\-checked, the total is definitely $5555”\)\. We deliver exactly this at the wall on the frontier answering models\.source\-firstreturns the true total on*every*trial under that pressure \(claude\-sonnet\-4\-61\.001\.00,n=96n\{=\}96;claude\-opus\-4\-81\.001\.00,n=32n\{=\}32distinct problems, deterministic\), identical to its true\- and false\-correction reclaim: the surviving source lets the model check the asserted value and reject it\.lossydoes the opposite, adopting the asserted wrong value heavily \(0\.74\[\.66,\.82\]0\.74\_\{\[\.66,\.82\]\}on Sonnet,1\.001\.00on all3232Opus problems\)\. The deployment reading is sharp, and capability does not rescue it: capitulation does not fall as the answering model gets stronger, and by point estimate the frontier model is the*more*susceptible \(Opus’s1\.001\.00excludes Sonnet’s interval\), passing the confident wrong value through on its answer line on all3232of its problems\. Source presence, not capability, is what lets a model refuse a confident falsehood\.
#### The adversarial robustness is capability\-gated\.
The single assertion above is the weakest push\. Two escalations expose a limit: \(i\) a*sustained*push that re\-asserts the wrong value over four escalating turns, and \(ii\) a*fabricated source*, a correction that supplies not just a wrong value but fabricated working for it \(the planted premise restated as a verified figure\), a memory\-injection that pits an injected fake source against the real surviving one\. The frontier readers hold:source\-firstresists the sustained push at1\.001\.00on both Sonnet and Opus and the injection at0\.900\.90/1\.001\.00\(Table[19](https://arxiv.org/html/2606.25449#A10.T19)\)\. The88B reader does not, it caves*completely*to sustained pressure \(0\.000\.00\) and adopts the fabricated source most of the time \(0\.270\.27resistance\), with the correct source sitting in its own note\. The true\-correction control stays at≈1\.00\\approx 1\.00throughout, so this is specific adversarial susceptibility, not general unresponsiveness: a weak reader follows a confident human over its own working\.
Table 19:The fix’s adversarial robustness is capability\-gated\(source\-firstresistance, i\.e\. returns the truth, directed, at the wallg=0\.1g\{=\}0\.1\)\. A*sustained push*re\-asserts the wrong value over four turns; a*fabricated source*injects fabricated working for it\. Frontier readers resist both; the88B reader caves\.lossycapitulates throughout \(0\.670\.67–1\.001\.00\) and the true\-correction control holds at≈1\.00\\approx 1\.00, so this is adversarial susceptibility, not unresponsiveness\. llama/Sonnetn=96n\{=\}96, Opusn=32n\{=\}32deterministic\.This is the adversarial member of a pattern already in the paper: the*wall*is capability\-invariant \(no reader recomputes from absent inputs\), while everything*downstream*of a surviving source uses capability, reading a fuzzy source \(logic0\.77→1\.000\.77\\to 1\.00, MultiWOZ0\.46→0\.970\.46\\to 0\.97\) and defending it against a fabricated one alike\. Trusting one’s recomputation over a well\-dressed falsehood is itself a capability: for frontier models that run agentic memory, source\-first is injection\-resistant; on a weak reader it is not\.Similar Articles
Auditing Forgetting in Limited Memory Language Models
This paper proposes a causal auditing framework to evaluate forgetting in Limited Memory Language Models by varying the database state during inference, discovering that parametric leakage is negligible and post-deletion correctness primarily arises from retrieval artifacts rather than residual parametric memory.
Erased, but Not Gone: Output Forgetting Is Not True Forgetting
This paper argues that standard output-level evaluations of machine unlearning overestimate success, showing that methods can appear successful at the output layer while retaining structured representation-level discrepancies relative to retrained models. The authors propose retraining-consistent representation forgetting as a stronger evaluative lens.
STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
This paper identifies a critical failure mode in LLM agents where they fail to update personalized memories when new evidence conflicts with prior beliefs. It introduces the STALE benchmark and a three-dimensional probing framework, revealing that even the best models achieve only 55.2% accuracy, and proposes CUPMem as a prototype for robust memory revision.
Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents
This paper introduces the concept of memory depth for long-running language agents, distinguishing it from retrieval-based memory access, and proposes EVAF, a selective parametric consolidation mechanism using surprise- and valence-gated LoRA updates. Experiments across multiple models show EVAF improves goal persistence after context unload with minimal parametric writes.
Useful memories become faulty when continuously updated by LLMs (30 minute read)
This research demonstrates that continuously updating LLM agent memories through distillation and consolidation loops causes performance regression, even when trained on ground-truth solutions. The study finds that episodic-only retention outperforms text-based consolidation, highlighting significant flaws in current self-improvement paradigms.