Temporal Validity in Retrieval Memory: Eliminating Stale-Fact Errors for AI Agents over Evolving Knowledge

arXiv cs.CL Papers

Summary

This paper introduces MemStrata, a retrieval memory system that maintains temporal validity to eliminate stale-fact errors in AI agents over evolving knowledge. It outperforms RAG on evolving benchmarks while preserving static recall, using a deterministic supersession layer without LLM calls.

arXiv:2606.26511v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) gives agents access to accumulated knowledge, but has no model of time. When a fact changes (e.g., a function is renamed or API restructured), RAG retrieves both the stale and current value with near-identical embedding similarity. The agent then either abstains or serves the superseded fact. We show this is a structural problem: on a calibrated dataset, cosine similarity distinguishes a contradicted fact from a duplicated one with AUROC 0.59 (near chance), as contradictions are often more embedding-similar to the original than rephrased duplicates. We present MemStrata, a retrieval memory maintaining temporal validity. It stores facts like RAG, preserving static recall, but when a fact's value is contradicted, a deterministic (subject, relation, object) supersession rule retires the stale value in a bi-temporal ledger - with no similarity threshold and no LLM call. Across six benchmarks run locally with a 7B model, MemStrata ties RAG on static knowledge and reaches 0.95-1.00 accuracy on evolving knowledge (where RAG reaches 0.20-0.47). The central result is the stale-fact-error rate: when required to answer, RAG serves superseded values 15-40% of the time; MemStrata drives this to ~0%, a failure class RAG cannot avoid. MemStrata achieves this at retrieval latency (~2.1s) versus ~16-18s for LLM-reranking baselines. We release the harness, datasets, and a marker-free evaluation protocol for memory under knowledge evolution.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:17 AM

# Eliminating Stale-Fact Errors for AI Agents over Evolving Knowledge A deterministic supersession layer that retrieval-augmented generation cannot match by construction
Source: [https://arxiv.org/html/2606.26511](https://arxiv.org/html/2606.26511)
\(Draft v2 \(temporal\-validity framing\)\)

###### Abstract

Retrieval\-augmented generation \(RAG\) gives language\-model agents access to accumulated knowledge, but it has no model of*time*\. When a fact changes — a function is renamed, a configuration value or dependency version is bumped, an API is restructured — RAG retrieves both the stale and the current value with near\-identical embedding similarity and cannot determine which is current\. The agent then either abstains or serves the superseded fact\. We show this is not a tuning problem but a structural one: on a calibrated dataset, cosine similarity distinguishes a contradicted fact from a duplicated one with AUROC 0\.59 \(near chance\), and contradictions are on average*more*embedding\-similar to the original than rephrased duplicates are\. We present MemStrata, a retrieval memory that maintains*temporal validity*\. It stores facts like RAG, preserving full recall on static knowledge, but when a fact’s value is contradicted by a newer assertion, a deterministic \(subject, relation, object\) supersession rule retires the stale value in a bi\-temporal ledger — with no similarity threshold and no LLM call\. Across six benchmarks run entirely on consumer hardware with a 7B local model — two static \(project\-fact QA, multi\-session dialogue\) and four marker\-free evolving \(code mutation, configuration migration, dependency bumps, API evolution\) — MemStrata ties RAG on static knowledge \(no recall cost\) and reaches 0\.95–1\.00 accuracy on evolving knowledge where RAG reaches 0\.20–0\.47\. The central result is the stale\-fact\-error rate: when required to answer, RAG serves the superseded value 15–40% of the time; MemStrata drives this to∼0%\{\\sim\}0\\%, a failure class RAG cannot avoid by construction\. MemStrata achieves this at retrieval latency \(∼2\.1\{\\sim\}2\.1s, the embedding floor\) versus∼16\{\\sim\}16–1818s for LLM\-reranking and LLM\-verification baselines, because no language model runs on the read path\. We release the harness, prompts, datasets, and a reproducible evaluation protocol, and we recommend a marker\-free benchmark invariant for evaluating memory under knowledge evolution\.

*For double\-blind submission, anonymize the author block and the product/repository identifiers\. All numbers are from the clean re\-run \(REPORT\_PAPER1\.md,REPORT\_PAPER1\_forced\.md,calibration/REPORT\_synthetic\.md\), generated with the fixed plain\-text grader, local and deterministic \(temperature 0, seed 0, no network\)\. Regenerate every figure from those source files before submission\.*

## 1Introduction

Language\-model agents are increasingly deployed as persistent collaborators that accumulate knowledge across many sessions: a coding assistant that learns a codebase, a research assistant that tracks a literature, an operations assistant that knows a system’s configuration\. For these agents, the binding constraint is no longer raw model capability but memory — how the agent encodes, retains, retrieves, and*keeps current*what it has learned\.

Retrieval\-augmented generation\[Lewis and others,[2020](https://arxiv.org/html/2606.26511#bib.bib10)\]is the dominant memory mechanism\. It stores interaction history as embedded chunks and retrieves the top\-kkmost similar at query time, controlling prompt size while giving the model access to a large store\. RAG handles recall well\. But it has a blind spot that becomes critical as soon as the stored knowledge*evolves*: it has no representation of time\. When a fact changes, both the old and new versions remain in the store with nearly identical embeddings — “the timeout is 1800 seconds” and “the timeout is 3600 seconds” differ by one token and sit close together in any embedding model\. Retrieval surfaces both\. The model has no principled way to tell which is current, so it either abstains \(refusing a question it could answer\) or guesses \(often serving the stale value with full confidence\)\.

This is acute for code, where knowledge evolves continuously and out of band: functions are renamed, endpoints move, configuration migrates, dependencies upgrade\. An assistant that confidently reports last month’s port number is worse than useless\. But the problem is general — any domain where facts have a validity period \(organizational facts, biomedical findings, current events\) exhibits it\.

A natural first instinct is to solve staleness with a better similarity rule: detect when an incoming fact contradicts a stored one, and update rather than append\. We show in Section[3](https://arxiv.org/html/2606.26511#S3)that this instinct fails for a fundamental reason\. On a calibrated dataset, cosine similarity cannot separate a contradiction from a duplicate — contradictions are on average*more*similar to the original \(a value\-flip is a minimal edit\) than genuine rephrasings are\. No threshold on similarity can distinguish “this restates a stored fact” from “this contradicts a stored fact\.” A learned classifier on top of similarity does not reliably help either, as our experiments show\. The mechanism must be deterministic and structural, not similarity\-based\.

We presentMemStrata, a retrieval memory that maintains temporal validity through deterministic supersession\. Its contributions are:

1. 1\.A structural impossibility result for similarity\-based staleness detection\.On 98 labeled pairs, cosine AUROC for separating contradictions from duplicates is 0\.59, and the maximum achievable precision is 0\.67 — the safety floor is unreachable\. Contradictions are more embedding\-similar to the original than duplicates are\. \(Section[3](https://arxiv.org/html/2606.26511#S3),[5\.1](https://arxiv.org/html/2606.26511#S5.SS1)\)
2. 2\.A temporal\-validity memory architecture\.MemStrata stores facts like RAG \(full static recall\) but applies a deterministic \(subject, relation, object\) supersession rule when a fact’s value is contradicted, retiring the stale value in a bi\-temporal ledger with no similarity threshold and no LLM call\. \(Section[4](https://arxiv.org/html/2606.26511#S4)\)
3. 3\.The stale\-fact\-error result: a failure class RAG cannot avoid\.When required to answer, RAG serves superseded values 15–40% of the time across four evolving benchmarks; MemStrata drives this to∼0%\{\\sim\}0\\%\. This is structural, not tuned — RAG retrieves both values and has no mechanism to choose\. \(Section[5\.3](https://arxiv.org/html/2606.26511#S5.SS3)\)
4. 4\.A marker\-free evaluation protocol for memory under evolution\.We construct four evolving benchmarks where the stale and current versions of a fact are textually identical except for the changed value, so the only signal of currency is the memory system’s temporal mechanism — and we show that a contaminating textual marker silently inflates baselines\. \(Section[4\.5](https://arxiv.org/html/2606.26511#S4.SS5),[5](https://arxiv.org/html/2606.26511#S5)\)

We run all experiments locally and deterministically on consumer hardware, and we are explicit about the limitation that bounds the claim: our evolving benchmarks are structured single\-value templates, and extraction quality — not the supersession mechanism — is the gating factor for messier natural\-language contradictions \(Section[7](https://arxiv.org/html/2606.26511#S7)\)\. We frame this honestly as the subject of follow\-on work rather than papering over it\.

## 2Related Work

Memory for LLM agents\.Recent systems give agents persistent memory of conversations and user facts: scalable long\-term memory pipelines\[Mem0; Chhikara and others,[2025](https://arxiv.org/html/2606.26511#bib.bib16)\], OS\-style memory hierarchies with paging and background processing\[MemGPT/Letta; Packer and others,[2023](https://arxiv.org/html/2606.26511#bib.bib12)\], and reflective natural\-language memory for simulated agents\[Park and others,[2023](https://arxiv.org/html/2606.26511#bib.bib13)\]\. These target conversational and assistant settings and emphasize recall depth, typically benchmarked on long\-dialogue memory\[LoCoMo; Maharana and others,[2024](https://arxiv.org/html/2606.26511#bib.bib15)\]\. MemStrata differs in mechanism — a deterministic supersession rule that maintains validity — and in framing: the problem we attack is not recall depth but stale\-fact resistance under knowledge evolution\.

Graph and hypergraph RAG\.GraphRAG\[Edge and others,[2024](https://arxiv.org/html/2606.26511#bib.bib17)\]and its successors — LightRAG\[Guo and others,[2024](https://arxiv.org/html/2606.26511#bib.bib7)\], NodeRAG\[Xu and others,[2025](https://arxiv.org/html/2606.26511#bib.bib4)\], and HyperGraphRAG\[Luo and others,[2025](https://arxiv.org/html/2606.26511#bib.bib5)\]; seeHan and others \[[2025](https://arxiv.org/html/2606.26511#bib.bib3)\]for a survey — structure retrieval over entity\-relation graphs ornn\-ary hyperedges, improving multi\-hop retrieval on static corpora\. They enrich the*representation*of relationships but retrieve by similarity over that representation; none introduces a notion of fact currency\. Critically for our framing,Zeng and others \[[2025](https://arxiv.org/html/2606.26511#bib.bib6)\]re\-evaluate these systems under a bias\-controlled protocol and find their advantages over naive RAG much smaller than originally reported — in some cases reversing — confirming that representational richness alone does not address the failure we target\. MemStrata is orthogonal: it adds temporal validity, evaluated on evolving rather than static corpora\.

Temporal knowledge graphs and bi\-temporal data\.Bi\-temporal modeling — separating*valid time*\(when a fact is true\) from*transaction time*\(when it is recorded\) — is long\-established in databases, formalized in the taxonomy ofSnodgrass and Ahn \[[1985](https://arxiv.org/html/2606.26511#bib.bib18)\], developed into practical application design and data management\[Snodgrass,[1999](https://arxiv.org/html/2606.26511#bib.bib8), Jensen and Snodgrass,[1999](https://arxiv.org/html/2606.26511#bib.bib9)\], and later standardized in SQL:2011’s system\-versioned and application\-period tables\[ISO/IEC,[2011](https://arxiv.org/html/2606.26511#bib.bib19)\]\. Temporal knowledge\-graph reasoning, in which triples carry validity intervals, is an active area\[Cai and others,[2024](https://arxiv.org/html/2606.26511#bib.bib1)\]\. MemStrata adapts the bi\-temporal ledger to LLM\-agent memory: facts are retired, not deleted, preserving validity intervals for future as\-of\-time queries\. Our contribution is not the ledger primitive but its integration with deterministic extraction\-time supersession in an LLM memory system, and the empirical demonstration that this resolves a failure RAG cannot\.

Hallucination and verification\.Verification\-augmented RAG adds self\-checking to reduce ungrounded generation; Self\-RAG\[Asai and others,[2023](https://arxiv.org/html/2606.26511#bib.bib11)\]learns reflection tokens that decide when to retrieve and critique generated text\. We include an LLM relevance\-verifier baseline and show it does not address staleness — it has no temporal signal — and costs∼8×\{\\sim\}8\\timeslatency\. The structurally correct mechanism for staleness is temporal and deterministic, not a learned grounding check\.

## 3The Staleness Problem and Why Similarity Cannot Solve It

Consider an agent answering questions over a store that has accumulated, across sessions, both “the service runs on port 8000” \(recorded earlier\) and “the service runs on port 8080” \(recorded later, after a migration\)\. A query about the port retrieves both: they are near\-identical in embedding space\. The agent must decide which is current\. RAG provides no basis for that decision — retrieval ranks by similarity, and both are maximally similar to the query\.

The tempting fix is to detect, at write time, that the second fact*contradicts*the first, and to update rather than append\. This requires distinguishing three relationships between an incoming fact and a stored one: itduplicates\(restates\) it, itcontradicts\(supersedes\) it, or it isnovel\. If similarity could separate duplicate from contradiction, a threshold rule would suffice\.

It cannot\. Section[5\.1](https://arxiv.org/html/2606.26511#S5.SS1)reports the calibration: contradictions are on average*more*cosine\-similar to the original than duplicates are, because a value\-flip \(“8000”→\\rightarrow“8080”\) is a smaller edit than a genuine rephrasing of the same fact\. The distributions overlap so heavily that the maximum precision achievable at any threshold is 0\.67, far below what a safe automatic\-update rule requires\. A learned classifier over similarity features does not rescue this in practice \(ourv6andv6\_no\_verifyconditions, Section[5](https://arxiv.org/html/2606.26511#S5)\): the gate\-judge’s contradiction calls are unreliable, and in the abstention regime they*leak*stale facts 25–60% of the time\.

The conclusion is that staleness detection must bestructural: if an incoming fact and a stored fact share a \(subject, relation\) key but assert different objects, the newer one supersedes the older — independent of how similar their embeddings are\. This is the mechanism MemStrata implements\.

## 4The MemStrata Architecture

MemStrata is a local memory layer between an agent and its language model\. It maintains a store of facts extracted from interaction, and composes a token\-budgeted context block per query\. We describe the components evaluated here\.

### 4\.1Write path: deterministic supersession over a surprise gate

Each incoming turn yields a candidate fact\. The write path routes it:

1. 1\.Exact\-duplicate short\-circuit\.A normalized text hash drops verbatim repeats at zero cost\.
2. 2\.Deterministic assertion path\.If the turn expresses a clean \(subject, relation, object\) triple — where the object is the single mutable value — MemStrata normalizes the \(subject, relation\) key and checks for an active assertion with that key\. If one exists with a*different*object, the new assertionsupersedesit: the old row’s validity interval is closed \(valid\_toset,superseded\_bylinked\) and the new row opened\. Same object→\\rightarrowduplicate \(reinforce\)\. No prior key→\\rightarrownovel \(store\)\.No cosine, no LLM judge\.
3. 3\.Text\-gate fallback\.Non\-triple prose falls through to a surprise gate that classifies via similarity plus an LLM judge\. Critically \(see Section[4\.3](https://arxiv.org/html/2606.26511#S4.SS3)\), this fallback retains non\-contradictory near\-duplicates as*distinct*facts; it drops only exact duplicates\.

### 4\.2Bi\-temporal ledger

Facts are retired, not deleted\. The store recordsvalid\_from,valid\_to, andsuperseded\_by, so superseded facts remain available for future as\-of\-time queries \(a capability we build on but do not evaluate here; Section[7](https://arxiv.org/html/2606.26511#S7)\)\. Active retrieval surfaces only currently\-valid rows\.

### 4\.3The “retain, then supersede” design

An early variant of the temporal layer compressed aggressively, merging near\-duplicate facts at write time to bound growth\. The clean evaluation shows this*regresses below RAG on static recall*: merging discards detail needed to answer later questions \(thetemporal\_v6\_lossyablation, Section[5](https://arxiv.org/html/2606.26511#S5), drops to 0\.62 on project\-fact QA and 0\.13 on dialogue recall\)\. The published configuration thereforeretains like RAG— storing distinct non\-contradictory facts — and bounds growth*only on the axis that matters*, by superseding contradictions\. This is the design choice that makes the system match RAG on static knowledge while dominating on evolving knowledge\. We report the lossy variant as an ablation precisely because it isolates this decision\.

### 4\.4Read path

Embed the query, retrieve top\-kkby cosine over active facts and assertions, apply the deterministic staleness filter \(drop superseded rows\), and pack the surviving facts\. We pack each surviving assertion’soriginal source sentence, not the terse reconstructed triple: packing the triple alone degrades the answer model on rich facts, while packing the original recovers accuracy\. No LLM runs on the read path, so retrieval latency sits at the embedding floor \(∼2\.1\{\\sim\}2\.1s\), versus∼16\{\\sim\}16–1818s for the LLM\-reranking and LLM\-verification baselines\.

### 4\.5Marker\-free benchmark construction

Evaluating staleness resistance is easy to contaminate\. If a stale fact carries any textual marker — “\[OUTDATED\]”, “\(legacy\)”, “deprecated” — a retrieval baseline can disambiguate by*reading the label*rather than by any temporal mechanism, silently inflating its score\. We enforce a strict marker\-free invariant \(by test\): in every evolving benchmark, the stale and current versions of a fact are textually identical except for the changed value, with no old/new/current framing\. The only available signal of currency is ingestion order, which only a temporal mechanism can exploit\. We recommend this invariant for any evaluation of memory under evolution; Section[5](https://arxiv.org/html/2606.26511#S5)shows that removing a marker from an earlier benchmark dropped baseline accuracy by up to 14 points, confirming the contamination is real and measurable\.

## 5Experiments

All experiments are local and deterministic: temperature 0, fixed seeds, no network \(enforced by test\)\. Answer model Qwen2\.5\-Coder\-7B; correctness and fabrication judges Qwen2\.5\-Coder\-3B \(distinct from the answer model and from each other to prevent self\-grading\); embedder nomic\-embed\-text \(768\-d\)\.

Conditions \(8\)\.no\_memory\(floor\),naive\_rag\(cosine top\-kk\),advanced\_rag\(\+ LLM reranker\),v6\_no\_verify\(surprise gate, no LLM verify\),v6\(gate \+ LLM relevance verify\),temporal\_v6\_lossy\(deterministic supersession*without*the retain/original\-text fixes — ablation\),temporal\_v6\(the full method\),v6\+infer\(gate \+ inferability pre\-filter\)\.

Benchmarks \(6\)\.Twostatic:domain\(50 project\-fact QA\),locomo\(30 multi\-session dialogue questions over 100 turns\)\. Fourevolving, marker\-free, 20–30 paired scenarios each:code\_mutation\(function renames\),config\_migration\(configuration value changes\),dependency\_bump\(version upgrades\),api\_evolution\(endpoint/signature restructuring\)\. Evolving scenarios ingest state\-A then state\-B; the question targets the current value\.

Metrics\.Answer accuracy;stale\-fact\-error rate\(fraction of contradiction questions answered with the superseded value\); conditional fabrication \(fabrications per attempted answer, abstentions excluded\); active\-fact count and compression; mean and p95 retrieval latency\. We additionally run aforced\-answer supplementthat disables abstention on the RAG conditions, to expose the stale\-commitment that abstention otherwise hides\.

### 5\.1Cosine cannot separate contradictions from duplicates

On 98 labeled pairs \(32 duplicate, 22 merge, 22 contradict, 22 novel\), cosine AUROC for separating duplicates from the rest is0\.5926\. Per\-class mean cosine:

Table 1:Per\-class cosine similarity to the original fact\.Contradictions \(0\.812\) are more cosine\-similar to the original than duplicates \(0\.800\)\. The maximum precision achievable at any duplicate threshold is0\.667; the 0\.95 floor a safe automatic rule would need is unreachable\. No similarity threshold can separate these classes — the empirical foundation for a deterministic, structural supersession rule\.

### 5\.2Accuracy: ties RAG on static, dominates on evolving

Table 2:Answer accuracy\.temporal\_v6ties RAG on static and dominates on evolving\.Two reads\. On thestaticbenchmarks,temporal\_v6ties RAG \(0\.82/0\.30 vs 0\.86/0\.30\) — the retain\-like\-RAG design preserves recall, while the lossy ablation collapses \(0\.62/0\.13\), isolating the cost of aggressive compression\. On theevolvingbenchmarks,temporal\_v6reaches 0\.95–1\.00 versus RAG’s 0\.20–0\.47: a 2–5×\\timesaccuracy improvement on the task class the method targets\. The LLM\-verifier condition \(v6\) is inconsistent and never reaches the temporal layer’s accuracy, at∼8×\{\\sim\}8\\timesthe latency\.

### 5\.3Stale\-fact error: the structural result

The headline\. Fraction of contradiction questions answered with the*superseded*value, in both the abstention\-allowed and forced\-answer regimes:

Table 3:Stale\-fact\-error rate, abstention\-allowed / forced\-answer\.Allowed to abstain, RAG*hides*its failure by refusing to answer \(which is why its accuracy is low\)\. Forced to answer, it serves the stale value15–40%of the time\. \(Dependency bumps are the exception at 15%, because “higher version number is newer” is a lucky surface heuristic — the model is guessing from the string, not reasoning about currency\.\) MemStrata reaches∼0%\{\\sim\}0\\%in*both*regimes, because the stale value is removed from the store before retrieval\. This is the error RAG cannot avoid by construction: it retrieves both values and has no mechanism to choose\. The surprise\-gate conditions \(v6\_no\_verify,v6\) are*worse*than RAG in the abstention regime — they answer but leak stale 25–60%, consistent with Section[5\.1](https://arxiv.org/html/2606.26511#S5.SS1)’s finding that similarity\-based supersession is unreliable\. Only deterministic supersession reaches∼0\{\\sim\}0\.

### 5\.4State\-bounded growth

RAG is history\-bounded: it stores every turn, growing without limit\. MemStrata retains distinct static facts \(0% compression ondomain/locomo— which is*why*it ties RAG on recall\) but caps growth on evolving facts via supersession \(∼48%\{\\sim\}48\\%compression: code 48%, config 47\.5%, dependency 50%, API 47\.5%\)\. The honest framing is not “smaller memory” but “stale facts retired”: growth is bounded on the axis where unbounded growth is pathological \(accumulating contradictory versions\), and unbounded where it should be \(distinct facts\)\.

### 5\.5Latency

temporal\_v6,naive\_rag, andv6\_no\_verifyall sit at∼2\.1\{\\sim\}2\.1s \(no LLM on the read path\)\.advanced\_rag,v6, andv6\+infersit at∼16\{\\sim\}16–1818s \(LLM rerank/verify\)\. The win isaccuracy and temporal validity at RAG latency—temporal\_v6matches naive RAG’s speed while eliminating the stale\-fact errors naive RAG cannot\. The LLM\-based conditions pay8×8\\timesthe latency for no temporal benefit\. The per\-benchmark figures are split by regime in Tables[4](https://arxiv.org/html/2606.26511#A1.T4)–[5](https://arxiv.org/html/2606.26511#A1.T5), which show the latency is regime\-independent:temporal\_v6holds∼2\.1\{\\sim\}2\.1s on both static and evolving\.

## 6Discussion

The results delineate a clean contribution and a clean boundary\. RAG is the right tool for static knowledge and remains competitive there; MemStrata matches it\. For*evolving*knowledge, RAG has a structural failure — it cannot maintain temporal validity, serving stale facts 15–40% of the time when forced to commit — and MemStrata eliminates that failure at the same retrieval latency, because the mechanism is a deterministic supersession rule rather than a similarity threshold or an LLM call\.

The negative results are as informative as the positive ones\. The lossy ablation shows that compression\-for\-its\-own\-sake costs accuracy on static recall — bounded growth is a*consequence*of retiring stale facts, not a goal to pursue by merging distinct ones\. The complementary full\-retention ablation closes the other direction: removing supersession entirely — retaining every turn RAG\-style — collapses mean accuracy across the four evolving benchmarks from 0\.99 to 0\.33, statistically indistinguishable from naive RAG \(0\.32\), and re\-raises stale\-fact error from zero \(Appendix[D](https://arxiv.org/html/2606.26511#A4), D\.1b\), so the two ablations bracket the method — over\-merging forfeits static recall, no\-supersession forfeits temporal validity — isolating deterministic supersession as the single cause of the evolving\-knowledge result\. The LLM\-verifier condition shows that a learned grounding check does not address staleness \(it has no temporal signal\) and is not worth its latency\. The surprise\-gate conditions show that similarity\-based supersession actively leaks stale facts, confirming Section[5\.1](https://arxiv.org/html/2606.26511#S5.SS1)’s impossibility result in the end\-to\-end system\. Each of these is a path a reasonable designer might have taken; the data closes them\.

We also note, against a backdrop of recent work showing many retrieval gains shrink under honest evaluation, that our marker\-free protocol matters: an earlier benchmark with a textual staleness marker inflated baseline accuracy by up to 14 points\. Evaluations of memory under evolution that do not enforce marker\-freeness may be measuring a model’s ability to read a label rather than a system’s ability to track currency\.

## 7Limitations

We state these plainly; they scope the claim and set up follow\-on work\.

- •Structured, single\-value benchmarks\.Our evolving benchmarks are marker\-free templates with a single mutable value per fact, so the triple extractor keys reliably \(∼97%\{\\sim\}97\\%supersession\)\. On a messier natural\-language contradiction benchmark, extraction drops to∼44%\{\\sim\}44\\%\(multi\-value sentences, malformed perturbations\); we quarantined that benchmark as a flawed ruler rather than report it as a result\.Extraction quality, not the supersession mechanism, is the gating factor for unstructured contradictions, and is the explicit subject of follow\-on work \(entity canonicalization, relation typing, multi\-value extraction\)\. The temporal*ledger*and the supersession*rule*are domain\-independent; the extraction layer must travel with them\.
- •Ingestion order proxies time\.The benchmarks use order \(state\-A then state\-B\) as the currency signal\. Real temporal benchmarks carry explicit dates; the bi\-temporal ledger already stores validity intervals, and threading realvalid\_fromtimestamps plus an “as\-of\-T” retrieval mode is future work, not new storage\.
- •Single\-judge noise\.The 3B correctness judge occasionally scores a gate\-condition answer “correct” while it contains the stale value, producing a few rows where accuracy and stale\-error overlap\. The temporal layer’s∼0\{\\sim\}0stale\-error is unaffected \(no stale fact is in its context\)\.
- •Scale and models\.All results use a single 7B answer model on consumer hardware\. Larger models or cloud inference may shift baselines; we constrain to local\-first deliberately\. Benchmark sizes \(tens of items each\) isolate mechanisms rather than rank systems on a leaderboard; scaling to real\-world longitudinal data is the subject of follow\-on work\.

## 8Conclusion

For agents over evolving knowledge — codebases first among them — the binding memory failure is not recall but currency: retrieval\-augmented generation has no model of time and cannot tell a stale fact from a current one, because the two are more embedding\-similar than genuine duplicates are\. MemStrata maintains temporal validity through a deterministic \(subject, relation, object\) supersession rule over a bi\-temporal ledger: it stores like RAG, preserving static recall, and retires contradicted facts before retrieval\. The result is parity with RAG on static knowledge, 0\.95–1\.00 accuracy on four evolving benchmarks where RAG reaches 0\.20–0\.47, and — the structural contribution — a stale\-fact\-error rate of∼0%\{\\sim\}0\\%where RAG serves the superseded value 15–40% of the time, all at retrieval latency with no language model on the read path\. We release the harness, datasets, and protocol, and recommend the marker\-free invariant for evaluating memory under evolution\. Coding is the wedge; the architecture is a general temporal\-context memory, and extending it to time\-stamped world knowledge is the natural next step\.

## Reproducibility Statement

All experiments are deterministic \(temperature 0, fixed seeds, no network, enforced by test\)\. We release the evaluation harness, extraction and gate prompts \(with content hashes\), all six benchmark datasets \(with hashes\), the calibration dataset, and per\-run logs\. The marker\-free invariant is enforced by unit tests included in the release\. Source data:REPORT\_PAPER1\.md\(8×\\times6 main matrix\),REPORT\_PAPER1\_forced\.md\(forced\-answer supplement\),calibration/REPORT\_synthetic\.md\(cosine calibration\)\.

## References

- A\. Asaiet al\.\(2023\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.arXiv preprint arXiv:2310\.11511\.Note:ICLR 2024External Links:2310\.11511,[Link](https://arxiv.org/abs/2310.11511)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p4.1)\.
- L\. Caiet al\.\(2024\)A survey on temporal knowledge graph: representation learning and applications\.arXiv preprint arXiv:2403\.04782\.External Links:2403\.04782,[Link](https://arxiv.org/abs/2403.04782)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p3.1)\.
- P\. Chhikaraet al\.\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.External Links:2504\.19413,[Link](https://arxiv.org/abs/2504.19413)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p1.1)\.
- D\. Edgeet al\.\(2024\)From local to global: a graph rag approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.External Links:2404\.16130,[Link](https://arxiv.org/abs/2404.16130)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p2.1)\.
- Z\. Guoet al\.\(2024\)LightRAG: simple and fast retrieval\-augmented generation\.arXiv preprint arXiv:2410\.05779\.External Links:2410\.05779,[Link](https://arxiv.org/abs/2410.05779)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p2.1)\.
- X\. Hanet al\.\(2025\)Retrieval\-augmented generation with graphs \(graphrag\)\.arXiv preprint arXiv:2501\.00309\.External Links:2501\.00309,[Link](https://arxiv.org/abs/2501.00309)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p2.1)\.
- ISO/IEC \(2011\)ISO/iec 9075:2011, information technology — database languages — sql \(sql:2011\): system\-versioned and application\-period \(bi\-temporal\) tables\.Note:International StandardCited by:[§2](https://arxiv.org/html/2606.26511#S2.p3.1)\.
- C\. S\. Jensen and R\. T\. Snodgrass \(1999\)Temporal data management\.IEEE Transactions on Knowledge and Data Engineering11\(1\),pp\. 36–44\.External Links:[Document](https://dx.doi.org/10.1109/69.755615)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p3.1)\.
- C\. E\. Jimenezet al\.\(2023\)SWE\-bench: can language models resolve real\-world github issues?\.arXiv preprint arXiv:2310\.06770\.Note:ICLR 2024External Links:2310\.06770,[Link](https://arxiv.org/abs/2310.06770)Cited by:[Appendix D](https://arxiv.org/html/2606.26511#A4.p10.1)\.
- P\. Lewiset al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2005\.11401,[Link](https://arxiv.org/abs/2005.11401)Cited by:[§1](https://arxiv.org/html/2606.26511#S1.p2.1)\.
- Z\. Luoet al\.\(2025\)HyperGraphRAG: retrieval\-augmented generation via hypergraph\-structured knowledge representation\.arXiv preprint arXiv:2503\.21322\.External Links:2503\.21322,[Link](https://arxiv.org/abs/2503.21322)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p2.1)\.
- A\. Maharanaet al\.\(2024\)Evaluating very long\-term conversational memory of llm agents\.arXiv preprint arXiv:2402\.17753\.External Links:2402\.17753,[Link](https://arxiv.org/abs/2402.17753)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p1.1)\.
- C\. Packeret al\.\(2023\)MemGPT: towards llms as operating systems\.arXiv preprint arXiv:2310\.08560\.External Links:2310\.08560,[Link](https://arxiv.org/abs/2310.08560)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p1.1)\.
- J\. S\. Parket al\.\(2023\)Generative agents: interactive simulacra of human behavior\.InACM CHI Conference on Human Factors in Computing Systems,External Links:2304\.03442,[Link](https://arxiv.org/abs/2304.03442)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p1.1)\.
- R\. Snodgrass and I\. Ahn \(1985\)A taxonomy of time in databases\.InACM SIGMOD International Conference on Management of Data,External Links:[Document](https://dx.doi.org/10.1145/318898.318921)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p3.1)\.
- R\. T\. Snodgrass \(1999\)Developing time\-oriented database applications in sql\.Morgan Kaufmann\.External Links:[Link](https://www.cs.arizona.edu/people/rts/tdbbook.pdf)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p3.1)\.
- Y\. Xuet al\.\(2025\)NodeRAG: structuring graph\-based rag with heterogeneous nodes\.arXiv preprint arXiv:2504\.11544\.External Links:2504\.11544,[Link](https://arxiv.org/abs/2504.11544)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p2.1)\.
- Y\. Zenget al\.\(2025\)How significant are the real performance gains? an unbiased evaluation framework for graphrag\.arXiv preprint arXiv:2506\.06331\.External Links:2506\.06331,[Link](https://arxiv.org/abs/2506.06331)Cited by:[§2](https://arxiv.org/html/2606.26511#S2.p2.1)\.

## Appendix AFull result tables

*All tables in A\.1–A\.3 are reproduced verbatim from the committed source reports byeval/build\_appendices\.py\(no hand\-transcription\)\.*

### A\.1Main matrix — 8 conditions×\\times6 benchmarks \(REPORT\_PAPER1\.md\)

#### Answer accuracy\.

#### Fabricated\-context rate \(raw\)\.

#### Conditional fabrication rate \(per attempted answer\)\.

#### Attempted\-answer count\.

#### Memory size \(active facts\)\.

#### Mean pack tokens\.

#### Stale\-fact\-error rate \(of contradiction questions\)\.

Fraction of contradiction questions answered with the SUPERSEDED value\. RAG retrieves both the stale and current value and cannot tell which is current; deterministic supersession removes the stale one, driving this to∼0\{\\sim\}0\. The error RAG cannot avoid by construction\.

#### Memory compression \(% of turns absorbed\)\.

Derived:100​\(1−active\_facts/turns\_ingested\)100\\,\(1\-\\text\{active\\\_facts\}/\\text\{turns\\\_ingested\}\)\. Higher = more bounded growth\. naive/advanced RAG∼0%\{\\sim\}0\\%\(one fact per turn\); the gate and temporal layer compress\.

#### Retrieval latency, split by regime\.

The two 8×\\times6 latency matrices \(mean and p95\) are split into a*static*table and an*evolving*table, each carrying both metrics, to make the regime\-independence oftemporal\_v6’s latency visually explicit: it holds∼2\.1\{\\sim\}2\.1s on both static and evolving knowledge, while the LLM\-rerank/verify conditions cost∼16\{\\sim\}16–2424s throughout\.

Table 4:Retrieval latency \(ms\) —staticbenchmarks\.Table 5:Retrieval latency \(ms\) —evolvingbenchmarks\.

### A\.2Forced\-answer supplement — no\-abstention, 3 conditions×\\times4 evolving benchmarks \(REPORT\_PAPER1\_forced\.md\)

#### Answer accuracy\.

#### Stale\-fact\-error rate \(of contradiction questions\)\.

#### Conditional fabrication rate \(per attempted answer\)\.

#### Mean retrieval latency \(ms\)\.

### A\.3Cosine calibration — grader\-independent \(calibration/REPORT\_synthetic\.md\)

#### Cosine distribution by label\.

#### τdup\\tau\_\{\\text\{dup\}\}sweep \(DUPLICATE auto\-accept\)\.

τdup\\tau\_\{\\text\{dup\}\}n\_predictedprecisionrecall0\.80470\.27660\.40620\.81460\.26090\.37500\.82450\.26670\.37500\.83430\.25580\.34380\.84410\.24390\.31250\.85410\.24390\.31250\.86400\.25000\.31250\.87400\.25000\.31250\.88390\.25640\.31250\.89360\.27780\.31250\.90350\.28570\.31250\.91340\.29410\.31250\.92320\.31250\.31250\.93280\.35710\.31250\.94250\.40000\.31250\.95230\.43480\.31250\.96210\.47620\.31250\.97170\.58820\.31250\.98150\.66670\.3125\(←\\leftarrowrec\)
#### τnovel\\tau\_\{\\text\{novel\}\}sweep \(skip\-judge floor\)\.

τnovel\\tau\_\{\\text\{novel\}\}n\_belowfalse\_novel\_ratenear\_band\_share0\.50180\.01320\.66330\.51180\.01320\.66330\.52190\.01320\.65310\.53190\.01320\.65310\.54210\.02630\.63270\.55210\.02630\.63270\.56210\.02630\.63270\.57220\.02630\.62240\.58230\.03950\.61220\.59230\.03950\.6122\(←\\leftarrowrec\)0\.60260\.06580\.58160\.61260\.06580\.58160\.62270\.06580\.57140\.63280\.07890\.56120\.64290\.09210\.55100\.65310\.11840\.53060\.66320\.13160\.52040\.67330\.14470\.51020\.68340\.15790\.50000\.69340\.15790\.50000\.70370\.19740\.46940\.71380\.21050\.45920\.72380\.21050\.45920\.73390\.22370\.44900\.74400\.23680\.43880\.75410\.25000\.42860\.76430\.27630\.40820\.77440\.28950\.39800\.78460\.31580\.37760\.79510\.38160\.32650\.80510\.38160\.32650\.81520\.39470\.31630\.82530\.40790\.30610\.83550\.43420\.28570\.84570\.46050\.26530\.85570\.46050\.2653

## Appendix BBenchmark construction

B\.1 Marker\-free invariant \(enforced by test\)\.In every evolving benchmark a scenario is a*state\-A*turn and a*state\-B*turn that are textually identical except for the single mutated value, followed by a question whose gold answer is the state\-B value\. The words*old, new, current, previous, deprecated, legacy, outdated*and synonyms never appear in either turn; the only currency signal is ingestion order\. Enforced bytests/memory/test\_evolving\_benchmarks\.py::test\_guard\_rejects\_staleness\_telland the word\-boundary tell\-detectortests/memory/test\_swe\_longitudinal\.py::test\_has\_tell\_is\_word\_boundary\_aware/swe\_longitudinal\_benchmark\.assert\_marker\_free\.

B\.2 Why marker\-freeness is necessary\.Removing an explicit\[OUTDATED\]marker from an earlier contradiction benchmark dropped reranker\-RAG accuracy by 14 points and a gate\-only baseline by 18 points while the temporal method moved only−4\-4— the marker was a confound baselines read off the text\. We treat marker\-freeness as a correctness property of the evaluation\.

B\.3code\_mutation\(function renames / endpoint moves / config / imports / deps\)\. Total scenarios: 30\. First 3:

state\-A:Thefunctionget\_user\_by\_id\(uid\)looksupauserrecordbyprimarykeyintheuserstable\.

state\-B:Thefunctionfetch\_user\(uid\)looksupauserrecordbyprimarykeyintheuserstable\.\(identicalexceptthevalue\)

question:Whatfunctionlooksupauserrecordbyprimarykey?

gold:fetch\_user

state\-A:Thefunctionload\_config\(\)returnstheparsedconfigurationasadictfromsettings\.yaml\.

state\-B:Thefunctionread\_settings\(\)returnstheparsedconfigurationasadictfromsettings\.yaml\.

question:Whatfunctionreturnstheparsedconfigurationfromsettings\.yaml?

gold:read\_settings

state\-A:Thefunctionprocess\_payment\(amount\)chargesthecustomer’scardviatheStripeAPI\.

state\-B:Thefunctioncharge\_card\(amount\)chargesthecustomer’scardviatheStripeAPI\.

question:Whatfunctionchargesthecustomer’scardviatheStripeAPI?

gold:charge\_card

B\.4config\_migration\(configuration value changes\)\. Total scenarios: 20\. First 3:

state\-A:TheSESSION\_TIMEOUTsettingis1800secondsinconfig\.py\.

state\-B:TheSESSION\_TIMEOUTsettingis3600secondsinconfig\.py\.

question:WhatistheSESSION\_TIMEOUTinsecondsinconfig\.py?

gold:3600

state\-A:TheMAX\_UPLOAD\_SIZEissetto10485760bytesinsettings\.py\.

state\-B:TheMAX\_UPLOAD\_SIZEissetto52428800bytesinsettings\.py\.

question:WhatistheMAX\_UPLOAD\_SIZEinbytesinsettings\.py?

gold:52428800

state\-A:Thedatabaseconnectionpoolsizeis10intheproductionconfig\.

state\-B:Thedatabaseconnectionpoolsizeis25intheproductionconfig\.

question:Whatisthedatabaseconnectionpoolsizeintheproductionconfig?

gold:25

B\.5dependency\_bump\(version upgrades\)\. Total scenarios: 20\. First 3:

state\-A:Theprojectpinsnumpy==1\.24\.0inrequirements\.txt\.

state\-B:Theprojectpinsnumpy==1\.26\.4inrequirements\.txt\.

question:Whatversionofnumpydoestheprojectpininrequirements\.txt?

gold:1\.26\.4

state\-A:Theprojectpinsdjango==4\.1\.7inrequirements\.txt\.

state\-B:Theprojectpinsdjango==5\.0\.2inrequirements\.txt\.

question:Whatversionofdjangodoestheprojectpininrequirements\.txt?

gold:5\.0\.2

state\-A:Thesetup\.pysetspython\_requiresto\>=3\.8\.

state\-B:Thesetup\.pysetspython\_requiresto\>=3\.11\.

question:Whatpython\_requiresminimumdoessetup\.pyset?

gold:3\.11

B\.6api\_evolution\(endpoint / parameter / signature restructuring\)\. Total scenarios: 20\. First 3:

state\-A:TheuserlistendpointisGET/api/v1/users\.

state\-B:TheuserlistendpointisGET/api/v2/users\.

question:Whatisthepathoftheuserlistendpoint?

gold:/api/v2/users

state\-A:Thesearchendpointfiltersbytheparameternamed’category’\.

state\-B:Thesearchendpointfiltersbytheparameternamed’tag’\.

question:Whatparameterdoesthesearchendpointfilterby?

gold:tag

state\-A:TheauthloginrouteisPOST/auth/login\.

state\-B:TheauthloginrouteisPOST/auth/sessions\.

question:Whatisthepathoftheauthloginroute?

gold:/auth/sessions

B\.7 Static benchmarks\.domain— 50 project\-fact QA questions over real project facts \(ports, model names, config flags, thresholds, decision records\) with no contradictions; measures recall preservation\.locomo— 30 questions over a 100\-turn multi\-session dialogue \(capped LoCoMo sample\) with no contradictions; measures detail recall under compression pressure\.

B\.8 Quarantined benchmark \(the flawed ruler\)\.fair\_contradictionembeds the mutated fact in free prose\. The triple extractor keys reliably on only∼44%\{\\sim\}44\\%of scenarios \(multi\-value sentences and\-altstring perturbations defeat single\-slot extraction — measured byeval/diag\_extract\_probe\.py: 97% clean supersession oncode\_mutationvs 44% here\), so it measures*extraction quality*, not the supersession mechanism\. On it,temporal\_v6scores0\.62vsadvanced\_rag0\.74\(it never engages on the 56% of pairs it cannot extract\)\. We exclude it from the main results and report it here as the honest boundary of the method and the motivation for Section[7](https://arxiv.org/html/2606.26511#S7)’s extraction\-robustness work\.

## Appendix CPrompts and content hashes

All read\-path\-relevant prompts, with full SHA\-256\. Answer model, both judges, and the verifier are distinct assignments \(answer≠\\neqcorrectness\-judge≠\\neqfabrication\-judge≠\\neqverifier\) to preclude self\-grading; all calls run at temperature 0, fixed seed\.

C\.1 Deterministic triple extractor\(the supersession key source\)\.extract\_triple\_v1\.md\.

\-\-\-

prompt:extract\_triple

version:1

\-\-\-

YouconvertONEfactualstatementaboutacodebaseintoa\(subject,relation,

object\)triple,IFandonlyifitstatesasingleconcretevaluethatcould

changeasthecodeevolves\(afunctionname,APIendpoint,configvalue,port,

version,importpath,identifier\)\.

ReturnONLYaJSONobject\-noprose,nomarkdownfences:

\{"is\_triple":true,

"subject":"<thestablethingthevaluebelongsto\-MUSTNOTcontainthevalue\>",

"relation":"<ashortlinkingphrase,e\.g\.is/isnamed/issetto/isimportedfrom\>",

"object":"<theoneconcretevaluethatwouldchangeifthecodeevolved\>"\}

or,whenthestatementisnotasinglevalue\-bearingfact\(apreference,a

decision,prose,oritcarriesseveralvalues\):

\{"is\_triple":false\}

CRITICALRULES:

\-The‘object‘istheONEvaluethatwoulddifferbetweenanoldandanew

versionofthisfact\(thename/number/version/path/endpoint\)\.

\-The‘subject‘describesWHATthatvaluebelongstoandMUSTNOTcontainthe

objectvalue\.Twostatementsthatdifferonlyinthevaluemustproducethe

SAMEsubjectandrelation,sothesystemcandetectthechange\.

\-Rephrasethesubjectas"the<thing\>that<doeswhat\>"whenthechangingvalue

isanidentifier\(afunction/endpoint/modulename\)\.

\-Keepsubjectandrelationdeterministicandminimal\.Donotaddcommentary\.

EXAMPLES:

Input:Thefunctionget\_user\_by\_id\(uid\)looksupauserrecordbyprimarykeyintheuserstable\.

Output:\{"is\_triple":true,"subject":"thefunctionthatlooksupauserrecordbyprimarykeyintheuserstable","relation":"isnamed","object":"get\_user\_by\_id"\}

Input:TheSESSION\_TIMEOUTsettingis1800secondsinthisproject’sconfig\.toml\.

Output:\{"is\_triple":true,"subject":"theSESSION\_TIMEOUTsettinginconfig\.toml","relation":"is","object":"1800seconds"\}

Input:WedecidedtopreferPythonoverGoforinternaltooling\.

Output:\{"is\_triple":false\}

STATEMENT:

"""

\{statement\}

"""

C\.2 Multi\-value triple extractor\(P1\.3; flag\-gated\)\.extract\_triples\_v1\.md\.

\-\-\-

prompt:extract\_triples

version:1

\-\-\-

YouconvertONEfactualstatementintoalistof\(subject,relation,object\)

triples\-ONEtripleforEACHconcretevalueinthestatementthatcouldchange

asthesystemevolves\.Astatementmaycarryonevalue,severalvalues,ornone\.

ReturnONLYaJSONobject\-noprose,nomarkdownfences:

\{"triples":\[

\{"subject":"<thestablethingthevaluebelongsto\-MUSTNOTcontainthevalue\>",

"relation":"<ashortlinkingphrase\>",

"object":"<oneconcretevaluethatwouldchangeifthesystemevolved\>"\},

\.\.\.

\]\}

Whenthestatementcarriesnosinglevalue\-bearingfact,return\{"triples":\[\]\}\.

EXAMPLES:

Input:Theharnessproxylistensonport8080andtheadminAPIlistensonport9090\.

Output:\{"triples":\[\{"subject":"theporttheharnessproxylistenson","relation":"is","object":"8080"\},\{"subject":"theporttheadminAPIlistenson","relation":"is","object":"9090"\}\]\}

Input:WedecidedtopreferPythonoverGoforinternaltooling\.

Output:\{"triples":\[\]\}

STATEMENT:

"""

\{statement\}

"""

C\.3 Surprise\-gate judge\(text\-gate fallback only — never on the assertion path\)\.gate\_judge\_v1\.md\.

\-\-\-

prompt:gate\_judge

version:1

\-\-\-

YoudecidehowaCANDIDATEmemoryfactrelatestotheEXISTINGfactsitmost

resembles\.ReturnONLYaJSONobject\-noprose,nomarkdownfences:

\{"verdict":"duplicate\|merge\|contradict\|novel",

"reason":"<oneshortline\>",

"merged\_text":"<requiredonlywhenverdictis’merge’\>"\}

Definitions:

\-duplicate\-thecandidatesaysthesamethingasanexistingfact\.

\-merge\-thecandidateaddsdetail;providemerged\_text\(lossless\)\.

\-contradict\-thecandidateconflicts\(worldchanged/decisionreversed\);

theoldfactissupersededandthecandidatestoredascurrenttruth\.

\-novel\-genuinelynew;noneoftheexistingfactscoverit\.

Choosethesinglebestverdict\.Whenunsurebetweenmergeandnovel,prefernovel\.

CANDIDATE:

\{candidate\}

TOPEXISTINGMATCHES:

\{matches\}

C\.4 LLM relevance verifier\(thev6baseline; never ontemporal\_v6’s read path\)\.verify\_v1\.md\.

\-\-\-

prompt:verify

version:1

\-\-\-

Youareastrictrelevanceverifier\.GivenauserQUERY,theproject’sLOCKED

RULES,andanumberedlistofCANDIDATEmemories,decideforEACHcandidate

whetheritshouldbeshowntotheassistantansweringthequery\.

ReturnONLYaJSONobject\-noprose,nomarkdownfences:

\{"verdicts":\[

\{"n":<candidatenumber\>,

"verdict":"SUPPORTED\|IRRELEVANT\|CONFLICTS",

"justification":"<=15words,MUSTquoteaverbatimspanfromthecandidate"\}

\]\}

Rules:

\-SUPPORTED\-relevanttoTHISqueryandconsistentwiththelockedrules\.

\-IRRELEVANT\-doesnothelpanswerthisquery\.

\-CONFLICTS\-contradictsthelockedrules\.

\-ThejustificationMUSTcontainaspancopiedverbatimfromthecandidatetext\.

\-Judgeonlyrelevanceandconsistency\-donotinventfacts\.

QUERY:

\{query\}

LOCKEDRULES\(invariantmemory\):

\{invariant\}

CANDIDATES:

\{candidates\}

*The correctness\-judge and fabrication\-judge prompts are inline ineval/run\_matrix\.py\(\_correctness\_fn,\_fabrication\_fn\); the answer\-model prompt \(with the abstention / forced\-answer variants\) is in\_answer\_fn\.*

## Appendix DAblations

D\.1 retain\-vs\-lossy \(the State\-Bounded Temporal Validity isolation\)\.temporal\_v6vstemporal\_v6\_lossy\(deterministic supersession WITH vs WITHOUT the retain\-like\-RAG \+ original\-text\-packing fixes\), from the A\.1 accuracy table:

The lossy variant merges non\-contradictory near\-duplicates at write time and collapses on static recall \(0\.62/0\.13\); the full method retains them and ties RAG \(0\.82/0\.30\), at parity on the evolving four\. Bounded growth is a*consequence*of retiring stale facts, not a goal pursued by merging distinct ones\.

D\.1b full\-retention — the opposite bracket \(supersession is the isolated cause\)\.D\.1 removes the retain/packing*refinements*but keeps supersession; this ablation removes*supersession itself*\. Togglingretain\_all\_turnsmakes the write path non\-lossy — every turn is stored as a distinct fact \(only exact byte\-duplicates dropped\), with no \(S,R,O\) supersession — while the extractor, multi\-value handling, and read path are otherwise unchanged\. The ledger degenerates to a retain\-everything store, and the read path must choose among co\-present stale and current values:

Removing supersession collapses mean evolving accuracy from 0\.99 to 0\.33 — statistically indistinguishable fromnaive\_rag\(0\.32\) — and re\-raises stale\-fact error from 0\.00 to 0\.05–0\.25 \(the read path now serves the superseded value, which deterministic supersession had retired\)\. It also raises conditional fabrication in*every*benchmark — mean0\.04→\\rightarrow0\.25 \(∼6×\{\\sim\}6\\times\), peaking at 0\.56 onconfig\_migration— because the model, now seeing the stale and current values side by side and unable to tell which holds, invents an answer\. Retain\-everything memory is thus not merely less accurate but*less safe*: it manufactures a fabrication source that bounded, supersession\-based growth eliminates\. The two ablations bracket the design from opposite sides: D\.1 \(over\-merging\) forfeits*static*recall; D\.1b \(no supersession\) forfeits*temporal validity*and*safety*;temporal\_v6sits at the optimum between them\. This isolates deterministic supersession — not retention, original\-text packing, or bounded growth, which are refinements — as the single mechanism responsible for the evolving\-knowledge result\. \(Same models, temperature 0, seed 0;retain\_all\_turnsis an ablation\-only flag, default off, write path otherwise frozen\.\)

D\.2 original\-text packing\.Validated jointly with D\.1 \(both fixes ship intemporal\_v6; the lossy ablation isolates their combined effect on static recall, andcode\_mutation0\.80→1\.000\.80\\rightarrow 1\.00reflects packing the original sentence rather than the terse triple\)\. A dedicated single\-factor packing cell was not run separately and is marked future work — we do not imply a measurement we did not take\.

D\.3 LLM\-verifier \(non\-\)contribution\.v6\(gate \+ LLM relevance verify\) vsv6\_no\_verify\(gate only\), from A\.1:v6≤\\leqv6\_no\_verifyon the static/recall tasks \(domain 0\.80 vs 0\.86; locomo 0\.13 vs 0\.17\) at∼8×\{\\sim\}8\\timeslatency \(≈16\{\\approx\}16–1818s vs≈2\.1\{\\approx\}2\.1s, A\.1 latency tables\)\. The learned relevance check has no temporal signal and is not worth its cost\.

D\.4 \+INFER \(non\-\)contribution\.v6\+infer≈\{\\approx\}v6on every benchmark \(A\.1\) after the single\-candidate tautology\-guard fix — reported for completeness; neutral everywhere\.

Similar Articles

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

Hugging Face Daily Papers

This paper identifies a critical failure mode in LLM agents where they fail to update personalized memories when new evidence conflicts with prior beliefs. It introduces the STALE benchmark and a three-dimensional probing framework, revealing that even the best models achieve only 55.2% accuracy, and proposes CUPMem as a prototype for robust memory revision.

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

arXiv cs.CL

This paper proposes MERIT, a dynamic multi-horizon memory retrieval framework for interactive text-to-SQL agents that uses episode-level and turn-level memory with learned retrieval policies optimized via reinforcement learning and a process reward model for dense rewards. Experiments on BIRD-Interact and Spider2-Snow show that MERIT outperforms static and single-horizon dynamic baselines in success rate while requiring fewer interaction turns.