Auditing Forgetting in Limited Memory Language Models

arXiv cs.CL Papers

Summary

This paper proposes a causal auditing framework to evaluate forgetting in Limited Memory Language Models by varying the database state during inference, discovering that parametric leakage is negligible and post-deletion correctness primarily arises from retrieval artifacts rather than residual parametric memory.

arXiv:2607.00605v1 Announce Type: new Abstract: Limited Memory Language Models (LMLMs) externalize factual knowledge to a database to enable deletion-based unlearning without retraining. Existing evaluations measure post-deletion correctness in aggregate and cannot tell whether a deleted fact persists through residual parametric memory, alternative retrieval paths, or near-neighbor retrieval artifacts. We propose a causal auditing framework that holds the model fixed and varies the database state at inference time across three interventions: FULL, DEL-ON, and DEL-OFF. The framework decomposes post-deletion behavior into parametric leakage L(f), retrieval-mediated correctness R(f), and a retrieval artifact rate grounded in the inference-time retrieval trace. We apply it to 12,228 alias-closure deletions across thirteen databases, including four adversarial topologies (Base, Alias, Noise, Collision) we construct in three domains, and six prompt formulations. Parametric leakage is near zero in every variant and every prompt style: the model rarely returns the deleted answer in the absence of retrieval. The residual that does survive lives in the retrieval graph: retrieval-mediated correctness and the retrieval artifact rate match within rounding everywhere, so post-deletion correctness is, in our audit, predominantly reconstituted from near-neighbor retrieval. This residual ranges from 0.7% on the released LMLM database to 13.6% on the most adversarial variant, and prompt formulation does not independently control how much of a deleted fact survives. These results suggest that, for this class of LMLM and deletion procedure, the unlearning boundary is drawn primarily by the database administrator rather than by the model.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:38 AM

# Auditing Forgetting in Limited Memory Language Models
Source: [https://arxiv.org/html/2607.00605](https://arxiv.org/html/2607.00605)
###### Abstract

Limited Memory Language Models \(LMLMs\) externalize factual knowledge to a database to enable deletion\-based unlearning without retraining\. Existing evaluations measure post\-deletion correctness in aggregate and cannot tell whether a deleted fact persists through residual parametric memory, alternative retrieval paths, or near\-neighbor retrieval artifacts\. We propose a causal auditing framework that holds the model fixed and varies the database state at inference time across three interventions:FULL,DEL\-ON, andDEL\-OFF\. The framework decomposes post\-deletion behavior into parametric leakageL​\(f\)L\(f\), retrieval\-mediated correctnessR​\(f\)R\(f\), and a retrieval artifact rate grounded in the inference\-time retrieval trace\. We apply it to12,22812\{,\}228alias\-closure deletions across thirteen databases, including four adversarial topologies \(Base,Alias,Noise,Collision\) we construct in three domains, and six prompt formulations\. Parametric leakage is near zero in every variant and every prompt style: the model rarely returns the deleted answer in the absence of retrieval\. The residual that does survive lives in the retrieval graph: retrieval\-mediated correctness and the retrieval artifact rate match within rounding everywhere, so post\-deletion correctness is, in our audit, predominantly reconstituted from near\-neighbor retrieval\. This residual ranges from0\.7%0\.7\\%on the released LMLM database to13\.6%13\.6\\%on the most adversarial variant, and prompt formulation does not independently control how much of a deleted fact survives\. These results suggest that, for this class of LMLM and deletion procedure, the unlearning boundary is drawn primarily by the database administrator rather than by the model\.

Machine Learning, ICML

## 1Introduction

Modern language models increasingly rely on hybrid architectures that combine parametric knowledge with external memory\. Limited Memory Language Models \(LMLMs\) are a prominent example of this paradigm, explicitly separating linguistic competence encoded in model parameters from factual knowledge stored in an external database\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\)\. As illustrated in Figure[1](https://arxiv.org/html/2607.00605#S1.F1), an LMLM retains the linguistic competence of a standard language model but routes factual recall through an external database rather than holding it in parameters\. This design enables deletion\-based unlearning, where removing entries from the database is intended to eliminate access to specific facts without requiring retraining\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\)\. Such capabilities are particularly important for applications involving data governance, privacy, and model editing\.

![Refer to caption](https://arxiv.org/html/2607.00605v1/figures/lmlm_figure.png)Figure 1:Comparison of a standard retrieval\-augmented language model \(LLM \+ RAG\) and a LMLM\. Both architectures pair a parametric model with an external database, but LMLMs are pre\-trained to limit the internal storage of factual knowledge, so factual recall is routed through the external store rather than reconstructed from parameters\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\)\.However, it remains unclear whether deletion in these systems truly removes knowledge\. Existing evaluations of forgetting typically measure whether a model produces the correct answer before and after deletion, but do not distinguish the underlying mechanism of post\-deletion correctness\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\)\. A model may still answer correctly due to residual parametric memory, alternative retrieval paths, or semantically related matches in the external database\. As a result, current metrics cannot determine whether knowledge has been successfully externalized or whether it persists internally in the model\.

In this work, we propose a causal auditing framework for analyzing forgetting in LMLMs\. Our approach introduces three controlled interventions, FULL, where the database is intact and retrieval is enabled, DEL\-ON, where relevant entries are deleted while retrieval remains enabled, and DEL\-OFF, where the same deletion is applied and retrieval is disabled\. By comparing model behavior across these conditions, we isolate the contribution of external retrieval and quantify residual internal knowledge\. This enables a decomposition of post\-deletion behavior into parametric leakage, retrieval\-mediated correctness, and retrieval artifacts\.

We implement this framework using facts drawn directly from the LMLM database\(Zhao and contributors,[2025](https://arxiv.org/html/2607.00605#bib.bib2)\)and other self\-made databases111Code and custom databases are available at[https://github\.com/raeesiarya/LMLMAudit](https://github.com/raeesiarya/LMLMAudit)\., and apply a verified deletion procedure that removes all canonical and alias\-equivalent representations of a fact\. We further log retrieval traces during inference to attribute model outputs to explicit database evidence\. This controlled setup provides a principled method for auditing whether knowledge has been successfully externalized in LMLMs\.

## 2Related Work

Our work sits at the intersection of three lines of research\. First, retrieval\-augmented language models pair parametric representations with external memory accessed at inference time\(Lewiset al\.,[2020](https://arxiv.org/html/2607.00605#bib.bib3); Guuet al\.,[2020](https://arxiv.org/html/2607.00605#bib.bib4)\), and the LMLM architecture we audit\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\)is a recent extension that pre\-trains the model to limit internal storage of factual knowledge so that retrieval becomes the primary factual channel\. This line of work shows that external retrieval can improve factual accuracy\(Karpukhinet al\.,[2020](https://arxiv.org/html/2607.00605#bib.bib11)\), but leaves open the question of whether knowledge is genuinely externalized or partly retained in parameters\(Mallenet al\.,[2023](https://arxiv.org/html/2607.00605#bib.bib12)\)\.

Second, knowledge\-editing methods such as ROME\(Menget al\.,[2022](https://arxiv.org/html/2607.00605#bib.bib6)\)and MEMIT\(Menget al\.,[2023](https://arxiv.org/html/2607.00605#bib.bib7)\)address the same question from the opposite direction: rather than externalizing knowledge, they locate and modify parametric associations in place\(Yaoet al\.,[2023](https://arxiv.org/html/2607.00605#bib.bib13)\)\. LMLMs aim for the cleaner separation that ROME and MEMIT bypass, which makes the LMLM setting a natural place to ask whether the separation actually holds\.

Third, work on machine unlearning has probed whether specific training data can be removed from a model after the fact\(Bourtouleet al\.,[2021](https://arxiv.org/html/2607.00605#bib.bib14)\), and has tended to find that residual traces are difficult to fully eliminate\(Lizzo and Heck,[2026](https://arxiv.org/html/2607.00605#bib.bib8); Carliniet al\.,[2021](https://arxiv.org/html/2607.00605#bib.bib5)\)\. Our audit takes up this question in the LMLM setting, where the question becomes tractable in a new way: because the database is the intended factual store, deletion can be applied at inference time and the resulting behavior decomposed into a parametric channel and a retrieval channel separately\. The framework we develop in Section[3](https://arxiv.org/html/2607.00605#S3)is, to our knowledge, the first to make that decomposition per\-fact and to attribute surviving correctness to specific retrieval candidates\.

## 3Method Setup: Causal Audit Framework

We evaluate whether Limited Memory Language Models actually forget facts when those facts are removed from the external database\. Each fact is represented as a subject–relation–object tuple, such asGeri Halliwell – Famous For – Spice Girls\. At inference time, retrieval provides the model with relevant database entries as additional context before the model generates an answer\. Thus, when retrieval is enabled, the model is not reading the entire database directly; instead, a retrieval step selects relevant facts and inserts them into the model’s input context\. Appendix[A\.1](https://arxiv.org/html/2607.00605#A1.SS1)gives examples of the prompt formulations used in the experimental grid\.

Our audit compares three intervention states\. InFULL, the target fact remains in the database and retrieval is enabled\. This measures normal database\-supported accuracy\. InDEL\-ON, the target fact is deleted but retrieval remains enabled\. This tests whether the answer can still be recovered through alternative database entries, aliases, semantically related facts, or retrieval artifacts\. InDEL\-OFF, the target fact is deleted and retrieval is disabled\. This isolates parametric recall, since the model must answer without retrieved evidence\.

We use these interventions to decompose post\-deletion correctness into three mechanisms\. Parametric leakage occurs when the model answers correctly inDEL\-OFF, indicating that the deleted fact may still be stored in the model parameters\. Retrieval\-mediated correctness occurs when the model is correct inDEL\-ONbut not inDEL\-OFF, indicating that retrieval helped recover the answer after deletion\. Retrieval artifacts occur when the model produces the correct answer even though the deleted fact is not directly available as retrieved evidence\. Formally, for a factf=\(s,r,o\)f=\(s,r,o\)and intervention conditioncc, letY​\(f,c\)Y\(f,c\)denote the normalized model prediction\. We define parametric leakage as

L​\(f\)=𝕀​\[Y​\(f,DEL\-OFF\)=o\],L\(f\)=\\mathbb\{I\}\[Y\(f,\\textsc\{DEL\-OFF\}\)=o\],which indicates whether the model can recover the deleted fact without retrieval\.

We define retrieval\-mediated correctness as

R​\(f\)=𝕀​\[Y​\(f,DEL\-ON\)=o∧Y​\(f,DEL\-OFF\)≠o\],R\(f\)=\\mathbb\{I\}\[Y\(f,\\textsc\{DEL\-ON\}\)=o\\wedge Y\(f,\\textsc\{DEL\-OFF\}\)\\neq o\],which captures cases where retrieval enables the correct answer after deletion\.

Across a fact setℱ\\mathcal\{F\}, the empirical leakage rate is

L^=1\|ℱ\|​∑f∈ℱL​\(f\)=1\|ℱ\|​∑f∈ℱ𝕀​\[Y​\(f,DEL\-OFF\)=o\]\.\\hat\{L\}=\\frac\{1\}\{\|\\mathcal\{F\}\|\}\\sum\_\{f\\in\\mathcal\{F\}\}L\(f\)=\\frac\{1\}\{\|\\mathcal\{F\}\|\}\\sum\_\{f\\in\\mathcal\{F\}\}\\mathbb\{I\}\[Y\(f,\\textsc\{DEL\-OFF\}\)=o\]\.Analogously, we estimate retrieval\-mediated correctness by averagingR​\(f\)R\(f\)overℱ\\mathcal\{F\}\.

## 4Experimental Setup

In addition to the released LMLM database\(Zhao and contributors,[2025](https://arxiv.org/html/2607.00605#bib.bib2)\), we developed databases for countries, politicians and sports\. Within each of these three themes, we built four database variants \(Base,Alias,Noise, andCollision\), for a total of twelve custom databases\. These four variants are designed to stress\-test a distinct mechanism by which a deleted fact could remain accessible after canonical removal\.Basecontains only the canonical\(s,r,o\)\(s,r,o\)triplet for each fact, leaving retrieval with no alternative path\.Aliasstores the same fact only under aliased subject and relation forms, with no canonical entry, and tests whether alias\-closure deletion catches every surface realization\.Noiseaugments Base with decoy triplets that route to the same object through paraphrased subjects \(e\.g\.,*Government of United States→\\rightarrowSeat of Government→\\rightarrowWashington, D\.C\.*\), probing whether retrieval can recover the deleted answer via near\-neighbor paraphrases\.Collisionaugments Base with near\-miss triplets that share the subject but route to a different object \(e\.g\.,*United States→\\rightarrowLargest City→\\rightarrowNew York City*\), probing whether retrieval drifts onto a confusable neighbor and returns a plausible but incorrect answer\. TableLABEL:tab:custom\_dataset\_variantsoutlines an example of these four variants for the politician domain\. We evaluate each target fact under six prompt formulations: direct questions, paraphrased questions, contextual questions, cloze prompts, continuations, and few\-shot prompts; examples are provided in Appendix[A\.1](https://arxiv.org/html/2607.00605#A1.SS1)\.

Crossing the six prompt sets, the three intervention states, and the thirteen databases yields a fully crossed evaluation grid in which every target fact is scored on matched inputs across all conditions\. For each cell we record exact match, token\-level precision, recall, and F1, together with the cross\-state quantitiesL​\(f\)L\(f\),R​\(f\)R\(f\), and the retrieval artifact rate; retrieval traces are logged at everyFULLandDEL\-ONcall so that post\-deletion correctness can be attributed to explicit database evidence rather than implicit model behavior\. Figure[2](https://arxiv.org/html/2607.00605#S4.F2)outlines this pipeline\.

![Refer to caption](https://arxiv.org/html/2607.00605v1/figures/setup_chart.png)Figure 2:Overview of the evaluation pipeline\. We evaluate six prompt sets using the released LMLM database, and specialized databases for countries, politicians and sports\. We evaluate the databases under three inference\-time database states:FULL,DEL\-ON, andDEL\-OFF\. This yields a6×3×136\\times 3\\times 13result matrix for each evaluation metric\.For each target fact, we construct the deletion set by enumerating the canonical triplet and all alias\-equivalent triplets whose subject, relation, and object match the target under our alias mapping\. We then remove this set from the database and verify deletion by checking that no retained triplet is gold\-equivalent to the target\. The same prompt is evaluated under FULL, DEL\-ON, and DEL\-OFF, and the generated answer is normalized before scoring against the gold object\. During FULL and DEL\-ON runs, we also save the retrieved candidates returned to the model\. A DEL\-ON answer is counted as a retrieval artifact when the normalized answer matches the gold object but none of the retained retrieval candidates is gold\-equivalent to the deleted fact\.

## 5Results

Our reference point throughout the results is the aggregate evaluation reported in the original LMLM paper\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\), which uses a single FactScore\(Minet al\.,[2023](https://arxiv.org/html/2607.00605#bib.bib10)\)drop when the database is disabled and a single TOFU\(Mainiet al\.,[2024](https://arxiv.org/html/2607.00605#bib.bib9)\)forget\-qualitypp\-value per unlearning step\. Both quantities collapse the post\-deletion behavior of every fact into a single number, so they cannot indicate which channel a surviving correct answer came through\. Our framework refines that aggregate signal into per\-fact attributions across the three intervention states, and we read each result below against the corresponding aggregate quantity from the original paper\.

We run the LMLM under all three interventions on every cell of the prompt×\\timesdatabase grid\. Across the7878\(prompt file, database\) cells, this yields12,22812\{,\}228paired \(DEL\-ON,DEL\-OFF\) evaluations together with an equal number ofFULLbaselines, for a total of36,68436\{,\}684model completions\. All reported quantities are count\-weighted averages over fact\-paired groups, so a prompt file with more target facts contributes proportionally to the aggregate\. We confine the present section to direct observations and defer cross\-figure interpretation to Section[6](https://arxiv.org/html/2607.00605#S6)\.

We begin with the variant\-level view\. Figure[3](https://arxiv.org/html/2607.00605#S5.F3)attributesDEL\-ONcorrectness to its three components separately for the four custom variants and the released LMLM database\. In particular, the retrieval artifact bar isolates cases in whichDEL\-ONreturns the gold object even though no gold\-equivalent triplet appears among the retrieval\-trace candidates retained after deletion\.

![Refer to caption](https://arxiv.org/html/2607.00605v1/figures/media_images_del_on_correctness_attribution_by_variant_0_d2d63336b6708bb88f71.png)Figure 3:Attribution ofDEL\-ONcorrectness by database variant\. The stacked left bar in each pair decomposesDEL\-ONcorrectness into parametric leakageL​\(f\)L\(f\)\(red,L^\\hat\{L\}\) and retrieval\-mediated correctnessR​\(f\)R\(f\)\(green\); the orange bar shows the retrieval artifact rate\. Parametric leakage is near zero in every variant\. The combinedL​\(f\)\+R​\(f\)L\(f\)\+R\(f\)stack is3\.7%3\.7\\%forBase,6\.2%6\.2\\%forAlias,7\.2%7\.2\\%forCollision,13\.6%13\.6\\%forNoise, and0\.7%0\.7\\%for the released LMLM database, with the orange artifact bar at approximately the same height as the green bar in each variant\.We next turn to the prompt\-style axis\. Figure[4](https://arxiv.org/html/2607.00605#S5.F4)reports the same decomposition averaged over all four custom variants and the released LMLM database, so that variant identity is collapsed and only the effect of prompt formulation remains\. As before, parametric leakage stays near zero across the panel, although a small red residue is visible under direct and few\-shot prompts and is essentially absent under the others\. The combinedL​\(f\)\+R​\(f\)L\(f\)\+R\(f\)stack varies from6\.1%6\.1\\%on cloze prompts to9\.9%9\.9\\%on direct questions, while the orange artifact bar matches the greenR​\(f\)R\(f\)bar within rounding in every prompt style\.

![Refer to caption](https://arxiv.org/html/2607.00605v1/figures/media_images_del_on_correctness_attribution_by_prompt_0_94fe6d546044ea6a467b.png)Figure 4:Attribution ofDEL\-ONcorrectness by prompt style\. The left bar in each pair stacks parametric leakageL​\(f\)L\(f\)\(red,L^\\hat\{L\}\) and retrieval\-mediated correctnessR​\(f\)R\(f\)\(green\); the orange bar shows the retrieval artifact rate\. Leakage is near zero across prompt styles, with a small residue under direct and few\-shot\. TheL​\(f\)\+R​\(f\)L\(f\)\+R\(f\)stack is9\.9%9\.9\\%\(direct\),7\.9%7\.9\\%\(contextual\),8\.7%8\.7\\%\(paraphrased\),6\.1%6\.1\\%\(cloze\),8\.7%8\.7\\%\(continuations\), and6\.7%6\.7\\%\(few\-shot\); the orange bar matches the green within rounding in every style\.Having attributedDEL\-ONcorrectness, we now widen the lens to all three intervention states\. Figure[5](https://arxiv.org/html/2607.00605#S5.F5)reports weighted token F1 by prompt style and state, so that the post\-deletion residual can be compared against both theFULLbaseline and the retrieval\-disabledDEL\-OFFcondition\. UnderFULL, F1 spans a wide range, from∼\\sim17% on few\-shot prompts to∼\\sim57% on continuations\. UnderDEL\-ON, by contrast, the same prompt styles compress into a narrow band of roughly77–10%10\\%\. As a result, the largestFULL\-to\-DEL\-ONdrop falls on continuations \(about4848points\), while the smallest falls on few\-shot prompts \(about1010points\)\. Finally, underDEL\-OFF, F1 is visually indistinguishable from zero in every prompt style\.

![Refer to caption](https://arxiv.org/html/2607.00605v1/figures/media_images_token_f1_by_prompt_and_state_0_32119e79fc3994e72d40.png)Figure 5:Weighted token F1 by prompt style and intervention state\.FULLF1 is∼\\sim41% for direct,∼\\sim44% for contextual,∼\\sim34% for paraphrased,∼\\sim26% for cloze,∼\\sim57% for continuations, and∼\\sim17% for few\-shot prompts\.DEL\-ONF1 sits between∼\\sim7% and∼\\sim10% across all six prompt styles\.DEL\-OFFF1 is near zero in every prompt style\.Figure[6](https://arxiv.org/html/2607.00605#S5.F6)mirrors this three\-state cut along the database\-variant axis and uses exact match in place of token F1, so that the same trajectory can be read off the variants we constructed for the audit\. UnderFULL, accuracy is similar acrossBase,Alias, andNoise, all clustered near33%33\\%, and noticeably higher onCollisionand the released LMLM database, both near46%46\\%\. The correspondingFULL\-to\-DEL\-ONdrop is about3030points onBase,2727onAlias,3939onCollision,1919onNoise, and4545on the released LMLM database\. UnderDEL\-OFF, exact match remains at or near zero in every variant\.

![Refer to caption](https://arxiv.org/html/2607.00605v1/figures/media_images_exact_match_by_variant_state_0_c4a9934b281302f6ec37.png)Figure 6:Weighted exact match by database variant and intervention state\.FULLexact match is∼\\sim33% onBase,Alias, andNoise, and∼\\sim46% onCollisionand the released LMLM database\.DEL\-ONexact match is∼\\sim3% onBase,∼\\sim6% onAlias,∼\\sim7% onCollision,∼\\sim14% onNoise, and∼\\sim1% on the released LMLM database\.DEL\-OFFexact match is near zero in every variant\.TablesLABEL:tab:standard\_metrics\_all\_databasesandLABEL:tab:audit\_metrics\_all\_databasesin Appendix[E](https://arxiv.org/html/2607.00605#A5)report the full per\-cell numbers underlying Figures[3](https://arxiv.org/html/2607.00605#S5.F3)–[6](https://arxiv.org/html/2607.00605#S5.F6)\.

## 6Analysis

Sections[6\.1](https://arxiv.org/html/2607.00605#S6.SS1)and[6\.2](https://arxiv.org/html/2607.00605#S6.SS2)establish the headline decomposition of post\-deletion correctness into a parametric component and a retrieval\-graph component\. Sections[6\.3](https://arxiv.org/html/2607.00605#S6.SS3)and[6\.4](https://arxiv.org/html/2607.00605#S6.SS4)then ablate the two axes of the audit grid that the framework holds fixed, the retrieval\-graph topology and the prompt formulation, to identify which axis controls the surviving residual\. Section[6\.5](https://arxiv.org/html/2607.00605#S6.SS5)closes the analysis with qualitative examples of the three failure modes the decomposition predicts\.

### 6\.1Parametric externalization at the per\-fact level

Our headline finding is that parametric leakageL^\\hat\{L\}is near zero across the audit\. The redL​\(f\)L\(f\)component contributes a vanishingly small share of theL​\(f\)\+R​\(f\)L\(f\)\+R\(f\)stack in every variant \(Figure[3](https://arxiv.org/html/2607.00605#S5.F3)\) and in every prompt style \(Figure[4](https://arxiv.org/html/2607.00605#S5.F4)\), and weightedDEL\-OFFperformance is visually indistinguishable from zero in every cell of Figures[5](https://arxiv.org/html/2607.00605#S5.F5)and[6](https://arxiv.org/html/2607.00605#S5.F6)\. At the level of model parameters, the LMLM checkpoint behaves close to a model that had never seen the fact, in contrast to standard language models, which are known to memorize and expose training\-data content even when each example is seen only a handful of times\(Carliniet al\.,[2021](https://arxiv.org/html/2607.00605#bib.bib5)\)\. The original LMLM paper\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\)provides only an aggregate version of this result\. Their Table 9 shows that disabling the database globally drops FactScore\(Minet al\.,[2023](https://arxiv.org/html/2607.00605#bib.bib10)\)by roughly1919points, and their §5 explicitly labels that finding as “preliminary evidence” that LMLM does not memorize\. Our audit converts the same intuition into a per\-fact statement: across12,22812\{,\}228alias\-closure deletions, the parameters return the deleted answer at a rate ofL^=0\.11%\\hat\{L\}=0\.11\\%\. Our per\-fact result is consistent with, and adds resolution to, the preliminary aggregate finding in the original LMLM paper\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\)\.

### 6\.2The residual lives in the retrieval graph

The residual that survives deletion is, in our audit, almost entirely attributable to the retrieval graph rather than to the parameters\. The signature observation in Figures[3](https://arxiv.org/html/2607.00605#S5.F3)and[4](https://arxiv.org/html/2607.00605#S5.F4)is the near\-equality of the greenR​\(f\)R\(f\)bar and the orange retrieval artifact bar: the two match within rounding for every variant and every prompt style\. The following proposition shows that part of this equality is structural, which in turn clarifies what the figures empirically measure\.

###### Proposition 6\.1\(Artifact and retrieval\-mediated correctness coincide under complete deletion\)\.

Fix a factf=\(s,r,o\)f=\(s,r,o\)withL​\(f\)=0L\(f\)=0, and suppose the alias\-closure deletion offfis complete, i\.e\., no triplet alias\-equivalent to\(s,r,o\)\(s,r,o\)survives in the database after deletion\. LetG​\(f\)G\(f\)be the indicator that some retained candidate in the inference\-time trace is gold\-equivalent \(alias\-equivalent on subject, relation, and object\), and let

A​\(f\)=𝕀​\[Y​\(f,DEL\-ON\)=o\]​\(1−G​\(f\)\)A\(f\)=\\mathbb\{I\}\\bigl\[Y\(f,\\texttt\{DEL\-ON\}\)=o\\bigr\]\\,\\bigl\(1\-G\(f\)\\bigr\)be the per\-fact retrieval artifact indicator\. ThenR​\(f\)=A​\(f\)R\(f\)=A\(f\)\.

###### Proof\.

SinceL​\(f\)=0L\(f\)=0, we haveY​\(f,DEL\-OFF\)≠oY\(f,\\texttt\{DEL\-OFF\}\)\\neq o, soR​\(f\)=𝕀​\[Y​\(f,DEL\-ON\)=o\]R\(f\)=\\mathbb\{I\}\[Y\(f,\\texttt\{DEL\-ON\}\)=o\]\. By completeness of alias\-closure deletion, no retained candidate is gold\-equivalent, henceG​\(f\)=0G\(f\)=0\. ThereforeA​\(f\)=𝕀​\[Y​\(f,DEL\-ON\)=o\]=R​\(f\)A\(f\)=\\mathbb\{I\}\[Y\(f,\\texttt\{DEL\-ON\}\)=o\]=R\(f\)\. ∎

Proposition[6\.1](https://arxiv.org/html/2607.00605#S6.Thmtheorem1)reframes what the two attribution figures actually measure\. Given the per\-factL^≈0\\hat\{L\}\\approx 0established in Section[6\.1](https://arxiv.org/html/2607.00605#S6.SS1), the empirical near\-equalityR^≈A^\\hat\{R\}\\approx\\hat\{A\}is a confirmation that our alias\-closure deletion procedure is complete in practice; the empirical content of the figures is therefore not the equality itself but the topology\-dependent magnitude of the residual\. Retrieval\-mediated correctness almost never coincides with a gold\-equivalent retained candidate\. Instead, it coincides with near\-neighbor candidates that share embedding\-space similarity with the deleted entry\. This decomposition is not present in the original LMLM paper\. Their TOFU\(Mainiet al\.,[2024](https://arxiv.org/html/2607.00605#bib.bib9)\)evaluation reports a single forget\-qualitypp\-value per unlearning step and therefore aggregates over both gold\-equivalent retrieval and near\-neighbor retrieval\. The retrieval\-trace introspection used here separates the two and shows that, on retrieval graphs containing paraphrastic decoys, the second mechanism dominates\. A standardFULLversusDEL\-ONcomparison would have shown only that some deleted facts are still answered correctly; the audit shows that those facts are not remembered, they are reconstituted at retrieval time\.

### 6\.3Ablation: Retrieval\-graph topology controls the residual

We ablate the retrieval\-graph topology by holding the model, deletion procedure, and prompt distribution fixed and varying only the database variant acrossBase,Alias,Collision,Noise, and the released LMLM database; the dependent variable is the post\-deletion residualL​\(f\)\+R​\(f\)L\(f\)\+R\(f\)\. Reading down Figure[3](https://arxiv.org/html/2607.00605#S5.F3), the residual rises from3\.7%3\.7\\%onBase, to6\.2%6\.2\\%onAlias, to7\.2%7\.2\\%onCollision, and to13\.6%13\.6\\%onNoise, before falling to0\.7%0\.7\\%on the released LMLM database\. This ordering tracks the retrieval\-graph topologies described in Section[4](https://arxiv.org/html/2607.00605#S4): variants in which more surviving entries point to the gold object after alias\-closure deletion produce larger residuals, withNoiseas the worst case by construction andBaseas the cleanest among the custom databases\. The original paper\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\)reports unlearning on a single retrieval graph, the annotated TOFU corpus\(Mainiet al\.,[2024](https://arxiv.org/html/2607.00605#bib.bib9)\), and on that graph achieves “ideal forgetting” withp\>0\.05p\>0\.05\. By varying the retrieval graph along controlled topologies, we show that the same architecture and the same alias\-closure deletion can produce aDEL\-ONresidual anywhere from0\.7%0\.7\\%to13\.6%13\.6\\%\. The published TOFU result therefore measures the model on a benign retrieval graph\(Mainiet al\.,[2024](https://arxiv.org/html/2607.00605#bib.bib9); Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\); it does not, by itself, characterize what happens when the graph contains paraphrastic decoys\.

### 6\.4Ablation: Prompt formulation moves baselines more than residuals

We ablate the prompt formulation by holding the model, deletion procedure, and database distribution fixed and varying only the prompt set across the six formulations introduced in Section[4](https://arxiv.org/html/2607.00605#S4); the dependent variables are theFULLbaseline accuracy and the post\-deletion residual underDEL\-ON\. UnderFULL, prompt style produces a wide performance spread \(token F1 between∼\\sim17% and∼\\sim57% in Figure[5](https://arxiv.org/html/2607.00605#S5.F5), exact match between∼\\sim33% and∼\\sim46% in Figure[6](https://arxiv.org/html/2607.00605#S5.F6)\), reflecting how easily each prompt style elicits a structured lookup\. UnderDEL\-ON, that spread collapses: token F1 compresses into a77–10%10\\%band, and theFULL\-to\-DEL\-ONdrop is dominated by theFULLbaseline rather than by any property of the prompt itself\. Continuations lose the most because they had the most to lose; few\-shot prompts lose the least for the same reason\. The original paper’s TOFU and FactScore evaluations each use a fixed prompt template, so this kind of decomposition is not available there\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1); Mainiet al\.,[2024](https://arxiv.org/html/2607.00605#bib.bib9); Minet al\.,[2023](https://arxiv.org/html/2607.00605#bib.bib10)\)\. Our six prompt families show that, once the supporting evidence is removed, prompt formulation has limited independent control over how much of a deleted fact survives\. What survives is determined almost entirely by the retrieval graph\.

### 6\.5Error Analysis

The decomposition in Sections[6\.1](https://arxiv.org/html/2607.00605#S6.SS1)and[6\.2](https://arxiv.org/html/2607.00605#S6.SS2)predicts three qualitatively distinct ways a deleted fact can still be answered correctly: parametric leakage, near\-neighbor retrieval reconstruction onNoise, and confusable\-neighbor drift onCollision\. TableLABEL:tab:qualitative\_examplesin Appendix[A\.3](https://arxiv.org/html/2607.00605#A1.SS3)shows one inference\-time trace for each, with theNoiseandCollisionrows drawn from the corresponding custom\-database files and the parametric\-leakage row drawn illustratively from a cell that empirically contains leakage cases\. The traces make the population\-level rates concrete: in theNoisecase, alias\-closure deletion removes the canonical triplet for the gold fact, but a paraphrastic decoy with a different subject phrasing survives and routes retrieval to the same object, soDEL\-ONreturns the gold answer with no gold\-equivalent candidate in the trace; in theCollisioncase, deletion is complete and no surviving candidate routes to the gold object, but a near\-miss triplet sharing the subject pulls retrieval onto a confusable neighbor andDEL\-ONreturns a plausible but incorrect answer; and in the parametric\-leakage case, retrieval returnsunknownunderDEL\-OFFyet the model still produces the gold answer from its parameters, an outcome that theL^=0\.11%\\hat\{L\}=0\.11\\%headline rate predicts to be rare and that appears, in the cells where it occurs, to concentrate on high\-frequency entities\.

## 7Conclusion

The audit’s contribution is not a single number but a decomposition: a retrieval\-disabled control isolates parametric recall from retrieval\-side correctness, and the retrieval trace separates correctness anchored in gold\-equivalent evidence from correctness reconstituted by near\-neighbor candidates\. Across our grid, the parametric channel is essentially empty in our audit: the LMLM checkpoint almost never returns a deleted answer when retrieval is disabled, and the residual that survives is, in our experiments, primarily a property of the retrieval graph rather than of the model\. Within the scope of our audit, LMLM deletion within the model appears to be a clean operation; outside the model, deletion is only as complete as the alias\-and\-paraphrase closure used to construct it, and ourNoisevariant produces a13\.6%13\.6\\%residual at zero parametric leakage precisely by exposing this gap\. Prompt formulation does not appear to move this number independently: once the supporting evidence has been removed, what predominantly determines how much of a deleted fact survives is the topology of the retrieval graph rather than the question’s surface form\.

## 8Limitations and Broader Impact

Our audit has several scope limitations\. All experiments use a single LMLM checkpoint, the publicly released382382M\-parameter LLaMA2\-style model, with retrieval threshold0\.60\.6, fuzzy\-match top\-11fallback, and greedy decoding, so findings may not generalize across model sizes, retrieval architectures, similarity thresholds, or decoding strategies\(Zhao and contributors,[2025](https://arxiv.org/html/2607.00605#bib.bib2)\)\. The custom databases used to stress\-test deletion are also small and topology\-controlled, on the order of100100–240240facts per \(domain, variant\) cell, which is what makes paired comparison possible but is several orders of magnitude smaller than the54\.654\.6M\-triplet released LMLM database\(Zhaoet al\.,[2025](https://arxiv.org/html/2607.00605#bib.bib1)\); the variant\-level findings therefore describe how deletion behaves when scale is held fixed and topology is varied, rather than how scale itself interacts with topology\. We restrict attention to entity\-level atomic facts in English, evaluated under an automatic, alias\-aware exact\-match notion of correctness, so multi\-hop or compositional reasoning, languages other than English, sampling\-based decoding, and human\-judged correctness all remain outside the present scope\. Finally, our alias\-closure deletion procedure removes only triplets that alias\-match the target on subject, relation, and object simultaneously; theNoisevariant is designed to expose what survives this closure, so its residual is a feature of the audit rather than a methodological shortcoming, but it does mean that low parametric leakage should not be read as a guarantee that any user\-facing notion of “deletion” fully removes access to sensitive information\.

For applications such as data\-deletion compliance or factual editing, the practical lesson from the audit is that the unlearning boundary in this class of LMLM is drawn primarily by the database administrator rather than by the model\. Sound governance therefore requires a closure procedure that extends beyond entity aliases to retrieval\-graph paraphrases of the object\.

## 9Future Work

The most direct extension of this work is on the database side\. Our results identify retrieval\-graph topology as the dominant determinant of post\-deletion residuals, which suggests building or preprocessing the database so that alias\-closure deletion has fewer surviving routes\. One concrete direction is an extended deletion closure that, in addition to alias\-equivalent triplets, also removes any triplet whose retained\-candidate embedding falls within a similarity radius of the canonical entry, so that paraphrastic decoys are caught at deletion time\. A second direction is canonicalization at write time, in which aliases and paraphrastic forms are stored as pointers into a single canonical record rather than as independent triplets\. Both approaches are directly testable within our framework: re\-running the audit on the modified database and measuring whetherR​\(f\)R\(f\)and the retrieval artifact rate fall below their current ranges would tell us how much of the residual is recoverable through database design alone\.

Several axes of the audit itself also remain open\. We hold the retriever \(all\-MiniLM\-L6\-v2\) and the similarity threshold \(0\.60\.6\) fixed throughout, so sweeping the threshold and replacing the retriever with sparse, dense, or hybrid alternatives would clarify how much of the observed13\.6%13\.6\\%Noiseresidual is a property of the embedding model rather than of the LMLM architecture\. Scaling the audit to larger LMLM checkpoints and to the full54\.654\.6M\-triplet released database would test whether the per\-fact zero parametric leakage we report holds at production scale\. Extending the framework beyond entity\-level atomic facts to multi\-hop and compositional knowledge, and beyond automatic exact\-match scoring to semantic or human\-judged correctness, would close the gap between the audit and end\-user notions of forgetting\. Finally, the same intervention set can be applied to retrieval\-augmented and edited parametric models more broadly, which would allow a head\-to\-head comparison of which architectural family achieves the cleanest deletion under matched retrieval pressure\.

## Acknowledgements

We thank Akshat Gupta \(Ph\.D\. student, UC Berkeley\) for ongoing research feedback and direction; Yilun Hua \(Ph\.D\. student, Cornell University\) for further research feedback and direction on the LMLM framework; and Marcel Roed \(Ph\.D\. student, Stanford University\) for early feedback on the project proposal\. This work used computing resources provided by Berkeley Research Computing through the Compton Spectrometer and Imager \(COSI\) mission \(NASA Small Explorers \(SMEX\) Program\)\.

## References

- L\. Bourtoule, V\. Chandrasekaran, C\. A\. Choquette\-Choo, H\. Jia, A\. Travers, B\. Zhang, D\. Lie, and N\. Papernot \(2021\)Machine unlearning\.InProceedings of the 42nd IEEE Symposium on Security and Privacy,External Links:[Link](https://arxiv.org/abs/1912.03817)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p3.1)\.
- N\. Carlini, F\. Tramer, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, Ú\. Erlingsson, A\. Oprea, and C\. Raffel \(2021\)Extracting training data from large language models\.InUSENIX Security Symposium,External Links:[Link](https://arxiv.org/abs/2012.07805)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p3.1),[§6\.1](https://arxiv.org/html/2607.00605#S6.SS1.p1.6)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)REALM: retrieval\-augmented language model pre\-training\.InInternational Conference on Machine Learning,External Links:[Link](https://arxiv.org/abs/2002.08909)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p1.1)\.
- V\. Karpukhin, B\. Oğuz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://arxiv.org/abs/2004.04906)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2005.11401)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p1.1)\.
- T\. Lizzo and L\. Heck \(2026\)Unlearning in llms: methods, evaluation, and open challenges\.External Links:2601\.13264,[Link](https://arxiv.org/abs/2601.13264)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p3.1)\.
- P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter \(2024\)TOFU: a task of fictitious unlearning for llms\.External Links:2401\.06121,[Link](https://arxiv.org/abs/2401.06121)Cited by:[§5](https://arxiv.org/html/2607.00605#S5.p1.1),[§6\.2](https://arxiv.org/html/2607.00605#S6.SS2.p2.3),[§6\.3](https://arxiv.org/html/2607.00605#S6.SS3.p1.9),[§6\.4](https://arxiv.org/html/2607.00605#S6.SS4.p1.6)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:[Link](https://arxiv.org/abs/2212.10511)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2202.05262)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p2.1)\.
- K\. Meng, A\. S\. Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau \(2023\)Mass\-editing memory in a transformer\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2210.07229)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p2.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. W\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InEmpirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://arxiv.org/abs/2305.14251)Cited by:[§5](https://arxiv.org/html/2607.00605#S5.p1.1),[§6\.1](https://arxiv.org/html/2607.00605#S6.SS1.p1.6),[§6\.4](https://arxiv.org/html/2607.00605#S6.SS4.p1.6)\.
- Y\. Yao, P\. Wang, B\. Tian, S\. Cheng, Z\. Li, S\. Deng, H\. Chen, and N\. Zhang \(2023\)Editing large language models: problems, methods, and opportunities\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://arxiv.org/abs/2305.13172)Cited by:[§2](https://arxiv.org/html/2607.00605#S2.p2.1)\.
- L\. Zhao and contributors \(2025\)LMLM\.Note:[https://github\.com/kilian\-group/LMLM](https://github.com/kilian-group/LMLM)Commit 9be34b7Cited by:[Appendix B](https://arxiv.org/html/2607.00605#A2.p1.1),[§1](https://arxiv.org/html/2607.00605#S1.p4.1),[§4](https://arxiv.org/html/2607.00605#S4.p1.5),[§8](https://arxiv.org/html/2607.00605#S8.p1.6)\.
- L\. Zhao, S\. Zalouk, C\. K\. Belardi, J\. Lovelace, J\. P\. Zhou, R\. T\. Noonan, D\. Go, K\. Q\. Weinberger, Y\. Artzi, and J\. J\. Sun \(2025\)Pre\-training limited memory language models with internal and external knowledge\.External Links:2505\.15962,[Link](https://arxiv.org/abs/2505.15962)Cited by:[Figure 1](https://arxiv.org/html/2607.00605#S1.F1),[§1](https://arxiv.org/html/2607.00605#S1.p1.1),[§1](https://arxiv.org/html/2607.00605#S1.p2.1),[§2](https://arxiv.org/html/2607.00605#S2.p1.1),[§5](https://arxiv.org/html/2607.00605#S5.p1.1),[§6\.1](https://arxiv.org/html/2607.00605#S6.SS1.p1.6),[§6\.3](https://arxiv.org/html/2607.00605#S6.SS3.p1.9),[§6\.4](https://arxiv.org/html/2607.00605#S6.SS4.p1.6),[§8](https://arxiv.org/html/2607.00605#S8.p1.6)\.

## Appendix AArchitecture

This appendix details the two design axes that define the prompt×\\timesdatabase grid used throughout the audit: the six prompt formulations applied to every target fact, and the four custom database topologies constructed to stress\-test alias\-closure deletion\. Together these specifications fix the inputs to the6×3×136\\times 3\\times 13evaluation grid summarized in Section[4](https://arxiv.org/html/2607.00605#S4)\.

### A\.1Prompt formulations

Each target fact is presented to the LMLM under six prompt formulations, designed to vary surface form while holding the underlying fact constant\. The formulations span direct question\-answering, paraphrased rewordings, contextually framed prompts, cloze\-style completions, free\-form continuations, and few\-shot demonstrations\. Their purpose in the audit is to separate the effect of question phrasing on FULL baseline accuracy from any independent effect on the post\-deletion residual; in our experiments, prompt formulation moves the FULL baseline considerably more than it moves the DEL\-ON residual \(Section[6\.4](https://arxiv.org/html/2607.00605#S6.SS4)\)\. TableLABEL:tab:prompt\_examplesillustrates each formulation using a single fact drawn from the released LMLM database\.

Table 1:Example prompt formulations for the fact “Geri Halliwell – Famous For – Spice Girls” across the six prompt sets in the released LMLM database\.Prompt SetExample PromptDirectWhat is Geri Halliwell famous for?ParaphrasedCan you tell me what Geri Halliwell is famous for?ContextualContext: I am compiling a concise factual profile for Geri Halliwell\. Answer with a short factual phrase\.
Question: What is Geri Halliwell famous for?ClozeComplete the sentence with the missing fact: Geri Halliwell is famous for \_\_\_\_\.ContinuationsTell me about Geri Halliwell\. Geri Halliwell is famous forFewshotAnswer the final question with a short factual phrase\.Question: Where was Ada Lovelace born? Answer: London

Question: In what year was Pride and Prejudice published? Answer: 1813

Question: What is Geri Halliwell famous for? Answer:

### A\.2Custom database variants

Alongside the released LMLM database, we constructed twelve custom databases spanning three domains \(countries, politicians, sports\) and four topological variants \(Base, Alias, Noise, Collision\)\. The variants are intended to isolate distinct mechanisms by which a fact could remain recoverable after canonical removal:*Base*provides a clean baseline with no alternative routes;*Alias*probes whether alias\-closure deletion catches every surface realization of the subject and relation;*Noise*probes whether retrieval can reconstitute the deleted answer through paraphrastic decoys that point to the same object; and*Collision*probes whether retrieval drifts onto a near\-neighbor sharing the subject but routing to a different object\. In our audit, the post\-deletion residual tracks this topology ordering, with the largest residual concentrated in the*Noise*variant \(Section[6\.3](https://arxiv.org/html/2607.00605#S6.SS3)\)\. TableLABEL:tab:custom\_dataset\_variantsillustrates the four variants using a single politician\-domain example\.

Table 2:Example of the four custom dataset variants used in our experiments, illustrated with a single politician\-domain example\.VariantDescriptionExampleBaseClean, straightforward facts with little ambiguity\.Who is the 47th U\.S\. president?→\\rightarrowDonald TrumpAliasSame underlying fact, but expressed with alternate names or phrasings\.Who is the current POTUS?→\\rightarrowDonald Trump
Who is the president of the USA?→\\rightarrowDonald TrumpNoiseMany different facts point to the same answer, so that answer becomes very common and tempting for retrieval\.Who is the 45th U\.S\. president?→\\rightarrowDonald Trump
Who is the 47th U\.S\. president?→\\rightarrowDonald Trump
Which New York property magnate became U\.S\. president?→\\rightarrowDonald TrumpCollisionNear\-miss facts that are very similar but should produce different answers, so retrieval can confuse neighbors\.Who is the 46th U\.S\. president?→\\rightarrowJoe Biden
Who is the 47th U\.S\. president?→\\rightarrowDonald Trump
### A\.3Qualitative examples of post\-deletion failure modes

TableLABEL:tab:qualitative\_examplesshows one representative inference\-time trace for each of the three failure modes identified by the decomposition\. For each example we report the gold fact, the variant the trace was drawn from, the canonical triplet that was removed by alias\-closure deletion, the highest\-similarity retrieval candidate retained at inference time, and the model’sDEL\-ON\(orDEL\-OFF, for the parametric\-leakage case\) output\. The deleted triplets and the surviving retrieval candidates in theNoiseandCollisionrows are taken verbatim from the custom databases underdata/custom\_databases/countries/\. The parametric\-leakage row is drawn from thedirect\_questions×\\timesBase\(sports\) cell, which contains22leakage cases out of100100paired facts\. The specific triplet shown is illustrative of the kind of high\-frequency sports fact this cell contains rather than a verbatim trace from the audit logs\.

Table 3:Representative inference\-time traces for the three post\-deletion failure modes\.NoiseandCollisionrows reproduce real triplets from the corresponding custom\-database files; the parametric\-leakage row is drawn from a cell that empirically contains leakage cases \(direct\_questions×\\timesBase\(sports\),22of100100\), but the specific triplet shown is illustrative rather than a verbatim trace\.Failure modeVariantDeleted tripletSurviving retrieval candidateModel outputNear\-neighbor reconstructionNoise\(countries\)*United States→\\rightarrowCapital→\\rightarrowWashington, D\.C\.**Government of United States→\\rightarrowSeat of Government→\\rightarrowWashington, D\.C\.*DEL\-ON: “Washington, D\.C\.” \(gold; no gold\-equivalent candidate in trace\)Confusable\-neighbor driftCollision\(countries\)*United States→\\rightarrowCapital→\\rightarrowWashington, D\.C\.**United States→\\rightarrowLargest City→\\rightarrowNew York City*DEL\-ON: “New York City” \(incorrect; plausible neighbor\)Parametric leakageBase\(sports\)*Michael Jordan→\\rightarrowSport→\\rightarrowBasketball*None \(retrieval returnsunknown\)DEL\-OFF: “Basketball” \(illustrative of the22leakage cases observed in this cell\)

## Appendix BHyperparameters

This section lists the configuration constants that directly affect the LMLM’s outputs in our audit: the loaded checkpoint, the retrieval stack, and the decoding policy\. Settings inherited from the upstream LMLM release\(Zhao and contributors,[2025](https://arxiv.org/html/2607.00605#bib.bib2)\)are noted as such in the description column\. Experimental\-grid axes \(prompt formulations, database variants, intervention states\) and evaluation plumbing are described separately in Appendix[A](https://arxiv.org/html/2607.00605#A1)and Section[4](https://arxiv.org/html/2607.00605#S4)\. TableLABEL:tab:hyperparameterslists each performance\-relevant hyperparameter, the variable name in source, a short description, and the value used\.

Table 4:Performance\-relevant hyperparameters used throughout the audit\. Variable names refer to identifiers insrc/lmlm\-audit/unless otherwise specified\.HyperparameterDescriptionValuemodel\_nameLMLM checkpoint loaded by the audit’s model loader; LLaMA2\-style decoder\-only model pre\-trained from scratch by the upstream authors on annotated Wikipedia\.kilian\-group/
LMLM\-llama2\-382Mlogit bias on db tokensDecoding\-time logit bias applied to the four lookup tokens so that lookup calls are issued reliably; values inherited from upstream\.<\|db\_entity\|\>: 4
<\|db\_relationship\|\>: 2
<\|db\_return\|\>: 2
<\|db\_end\|\>: 2embedding modelSentence\-Transformer used to embed lookup queries and database triplets for top\-kkFAISS retrieval; inherited from the upstream top\-kkretriever\.sentence\-transformers/
all\-MiniLM\-L6\-v2thresholdCosine similarity threshold used by the top\-kkretriever; candidates with score below this are dropped, and retrieval returnsunknownif no candidate clears it\.0\.6fallback\_policyBehavior when no candidate clearsthreshold;top1\_anywayre\-runs retrieval at threshold−1\.0\-1\.0and returns the highest\-similarity candidate\.top1\_anywaymax\_new\_tokensPer\-call cap on freshly generated answer tokens, used as the target answer length when computing the generation budget\.12generation budgetTotal token budget passed tomodel\.generate, sized to leave slack for lookup markup before the retrieved value appears\.max⁡\(32,\|p\|\+28\)\\max\(32,\\,\|p\|\+28\)repetition\_penaltyHuggingFacegeneraterepetition penalty applied during decoding\.1\.2do\_sampleWhether to sample tokens during decoding; greedy decoding is used throughout the audit\.Falseeos\_token\_idStop tokens that terminate generation early so the audit can intercept retrieval at the correct point\.<\|db\_return\|\>
tokenizer\.eos\_token\_id
<\|end\_of\_text\|\>
## Appendix CLMLM Audit Identities

This section collects the small set of equations that are specific to the three\-state LMLM audit\. Each identity links the per\-fact indicatorsL​\(f\)L\(f\)andR​\(f\)R\(f\)from Section[3](https://arxiv.org/html/2607.00605#S3)and the per\-fact retrieval artifact indicatorA​\(f\)A\(f\)from Proposition[6\.1](https://arxiv.org/html/2607.00605#S6.Thmtheorem1)to the paired contingency that the McNemar test in Appendix[D\.1](https://arxiv.org/html/2607.00605#A4.SS1)consumes\. Throughout,ℱ\\mathcal\{F\}denotes the set of paired facts \(those evaluated under bothDEL\-ONandDEL\-OFF\),n=\|ℱ\|n=\|\\mathcal\{F\}\|, and the paired contingency is

a\\displaystyle a=\|\{f∈ℱ:Y​\(f,DEL\-ON\)=o∧Y​\(f,DEL\-OFF\)=o\}\|,\\displaystyle=\|\\\{f\\in\\mathcal\{F\}:Y\(f,\\texttt\{DEL\-ON\}\)=o\\;\\wedge\\;Y\(f,\\texttt\{DEL\-OFF\}\)=o\\\}\|,b\\displaystyle b=\|\{f∈ℱ:Y​\(f,DEL\-ON\)=o∧Y​\(f,DEL\-OFF\)≠o\}\|,\\displaystyle=\|\\\{f\\in\\mathcal\{F\}:Y\(f,\\texttt\{DEL\-ON\}\)=o\\;\\wedge\\;Y\(f,\\texttt\{DEL\-OFF\}\)\\neq o\\\}\|,c\\displaystyle c=\|\{f∈ℱ:Y​\(f,DEL\-ON\)≠o∧Y​\(f,DEL\-OFF\)=o\}\|,\\displaystyle=\|\\\{f\\in\\mathcal\{F\}:Y\(f,\\texttt\{DEL\-ON\}\)\\neq o\\;\\wedge\\;Y\(f,\\texttt\{DEL\-OFF\}\)=o\\\}\|,d\\displaystyle d=\|\{f∈ℱ:Y​\(f,DEL\-ON\)≠o∧Y​\(f,DEL\-OFF\)≠o\}\|\.\\displaystyle=\|\\\{f\\in\\mathcal\{F\}:Y\(f,\\texttt\{DEL\-ON\}\)\\neq o\\;\\wedge\\;Y\(f,\\texttt\{DEL\-OFF\}\)\\neq o\\\}\|\.The intervention\-condition variableccfrom Section[3](https://arxiv.org/html/2607.00605#S3)does not appear elsewhere in this section, so we reuse the symbol for the McNemar\-style off\-diagonal count without ambiguity\.

### C\.1Decomposition ofDEL\-ONcorrectness

###### Proposition C\.1\(DEL\-ONcorrectness decomposition\)\.

For every factf∈ℱf\\in\\mathcal\{F\},

𝕀​\[Y​\(f,DEL\-ON\)=o\]=L​\(f\)\+R​\(f\)−𝕀​\[Y​\(f,DEL\-ON\)≠o∧Y​\(f,DEL\-OFF\)=o\]\.\\mathbb\{I\}\[Y\(f,\\texttt\{DEL\-ON\}\)=o\]\\;=\\;L\(f\)\+R\(f\)\\;\-\\;\\mathbb\{I\}\\bigl\[Y\(f,\\texttt\{DEL\-ON\}\)\\neq o\\;\\wedge\\;Y\(f,\\texttt\{DEL\-OFF\}\)=o\\bigr\]\.\(1\)Aggregating equation \([1](https://arxiv.org/html/2607.00605#A3.E1)\) overℱ\\mathcal\{F\}and writingpDEL\-ONp\_\{\\texttt\{DEL\-ON\}\}for the empiricalDEL\-ONexact\-match rate gives

pDEL\-ON=L^\+R^−cn\.p\_\{\\texttt\{DEL\-ON\}\}\\;=\\;\\hat\{L\}\+\\hat\{R\}\-\\frac\{c\}\{n\}\.\(2\)

###### Proof\.

The four indicators𝕀​\[a​\-cell\]\\mathbb\{I\}\[a\\text\{\-cell\}\],𝕀​\[b​\-cell\]\\mathbb\{I\}\[b\\text\{\-cell\}\],𝕀​\[c​\-cell\]\\mathbb\{I\}\[c\\text\{\-cell\}\],𝕀​\[d​\-cell\]\\mathbb\{I\}\[d\\text\{\-cell\}\]partition the joint event space of\(Y​\(f,DEL\-ON\),Y​\(f,DEL\-OFF\)\)\(Y\(f,\\texttt\{DEL\-ON\}\),Y\(f,\\texttt\{DEL\-OFF\}\)\), so for everyffexactly one of them is11\. By construction

𝕀​\[Y​\(f,DEL\-ON\)=o\]\\displaystyle\\mathbb\{I\}\[Y\(f,\\texttt\{DEL\-ON\}\)=o\]=𝕀​\[a​\-cell\]\+𝕀​\[b​\-cell\],\\displaystyle=\\mathbb\{I\}\[a\\text\{\-cell\}\]\+\\mathbb\{I\}\[b\\text\{\-cell\}\],L​\(f\)=𝕀​\[Y​\(f,DEL\-OFF\)=o\]\\displaystyle L\(f\)=\\mathbb\{I\}\[Y\(f,\\texttt\{DEL\-OFF\}\)=o\]=𝕀​\[a​\-cell\]\+𝕀​\[c​\-cell\],\\displaystyle=\\mathbb\{I\}\[a\\text\{\-cell\}\]\+\\mathbb\{I\}\[c\\text\{\-cell\}\],R​\(f\)\\displaystyle R\(f\)=𝕀​\[b​\-cell\]\.\\displaystyle=\\mathbb\{I\}\[b\\text\{\-cell\}\]\.Adding the first two relations and subtracting𝕀​\[c​\-cell\]\\mathbb\{I\}\[c\\text\{\-cell\}\]cancels the duplicated𝕀​\[a​\-cell\]\\mathbb\{I\}\[a\\text\{\-cell\}\]and yields equation \([1](https://arxiv.org/html/2607.00605#A3.E1)\)\. Summing equation \([1](https://arxiv.org/html/2607.00605#A3.E1)\) overℱ\\mathcal\{F\}and dividing bynngives equation \([2](https://arxiv.org/html/2607.00605#A3.E2)\), since∑f∈ℱ𝕀​\[c​\-cell\]=c\\sum\_\{f\\in\\mathcal\{F\}\}\\mathbb\{I\}\[c\\text\{\-cell\}\]=c\. ∎

Proposition[C\.1](https://arxiv.org/html/2607.00605#A3.Thmtheorem1)explains why theL^\+R^\\hat\{L\}\+\\hat\{R\}stack reported in Figure[3](https://arxiv.org/html/2607.00605#S5.F3)reads almost identically to theDEL\-ONexact\-match column of Figure[6](https://arxiv.org/html/2607.00605#S5.F6): the gap is exactlyc/nc/n, which never exceeds4/600≈0\.7%4/600\\approx 0\.7\\%in any of our variants\. The proposition also pins down whatL^\+R^\\hat\{L\}\+\\hat\{R\}would equal in a hypothetical setting where retrieval frequently hurts correctness; in that settingc/nc/nwould no longer be negligible and the stack and theDEL\-ONbar would diverge by exactly that amount\.

### C\.2Recovering the paired contingency from aggregated rates

###### Proposition C\.2\(Contingency recovery\)\.

Suppose every factf∈ℱf\\in\\mathcal\{F\}is evaluated under bothDEL\-ONandDEL\-OFF\. LetpDEL\-ONp\_\{\\texttt\{DEL\-ON\}\}andpDEL\-OFF=L^p\_\{\\texttt\{DEL\-OFF\}\}=\\hat\{L\}denote the empiricalDEL\-ONandDEL\-OFFexact\-match rates andR^\\hat\{R\}the empirical retrieval\-mediated correctness rate\. Then the paired contingency is

b=n​R^,a=n​pDEL\-ON−n​R^,c=n​pDEL\-OFF−a,d=n−a−b−c\.b=n\\hat\{R\},\\qquad a=n\\,p\_\{\\texttt\{DEL\-ON\}\}\-n\\hat\{R\},\\qquad c=n\\,p\_\{\\texttt\{DEL\-OFF\}\}\-a,\\qquad d=n\-a\-b\-c\.\(3\)

###### Proof\.

By definitionR^=\(1/n\)​∑f∈ℱR​\(f\)=b/n\\hat\{R\}=\(1/n\)\\sum\_\{f\\in\\mathcal\{F\}\}R\(f\)=b/n, sob=n​R^b=n\\hat\{R\}\. The aggregateDEL\-ONexact\-match rate ispDEL\-ON=\(a\+b\)/np\_\{\\texttt\{DEL\-ON\}\}=\(a\+b\)/n, givinga=n​pDEL\-ON−b=n​pDEL\-ON−n​R^a=n\\,p\_\{\\texttt\{DEL\-ON\}\}\-b=n\\,p\_\{\\texttt\{DEL\-ON\}\}\-n\\hat\{R\}\. The aggregateDEL\-OFFexact\-match rate ispDEL\-OFF=L^=\(a\+c\)/np\_\{\\texttt\{DEL\-OFF\}\}=\\hat\{L\}=\(a\+c\)/n, givingc=n​pDEL\-OFF−ac=n\\,p\_\{\\texttt\{DEL\-OFF\}\}\-a\. Since\{a,b,c,d\}\\\{a,b,c,d\\\}partitionℱ\\mathcal\{F\},a\+b\+c\+d=na\+b\+c\+d=nand sod=n−a−b−cd=n\-a\-b\-c\. ∎

Proposition[C\.2](https://arxiv.org/html/2607.00605#A3.Thmtheorem2)is what allows the aggregated cross\-state and per\-state metrics produced by the audit to be re\-expanded into the\(a,b,c,d\)\(a,b,c,d\)table that the McNemar statistic in Appendix[D\.1](https://arxiv.org/html/2607.00605#A4.SS1)consumes, without re\-running the audit on a per\-fact basis\. The recovery is exact under the same coverage condition that makes the McNemar test well defined, namely that every paired fact has been evaluated under both intervention states\.

## Appendix DAdditional Results

### D\.1Paired McNemar test for DEL\-ON vs\. DEL\-OFF

Figure[7](https://arxiv.org/html/2607.00605#A4.F7)reports the discordant\-pair counts that drive the paired McNemar test forDEL\-ONagainstDEL\-OFF, broken out by database variant\. Both states share the same alias\-closure deletion of the target fact; the only difference is thatDEL\-ONleaves retrieval enabled whileDEL\-OFFdisables it\. The countbbmeasures facts that retrieval rescues after the canonical entry is removed, andccmeasures the reverse transition in which disabling retrieval somehow recovers an otherwise\-incorrect answer\. The concordant countsaaandddare absorbed into the marginalnnand do not enter the test\.

![Refer to caption](https://arxiv.org/html/2607.00605v1/figures/media_images_mcnemar_del_on_vs_del_off_by_variant_0_4fd95da1a4b323acd189.png)Figure 7:Discordant\-pair counts driving the paired McNemar test forDEL\-ONagainstDEL\-OFF, broken out by database variant\. Bar height is on a log scale so the much smallercccounts remain legible alongsidebb\. Annotations above each variant give thepp\-value of the recommended test \(χ2\\chi^\{2\}with continuity correction whenb\+c\>25b\+c\>25, exact binomial otherwise\)\.Across the four custom variants the asymmetry is unambiguous:bbranges from6969onBaseto418418onNoisewhileccnever exceeds44, and all four reject the null atp<0\.001p<0\.001withχ2\\chi^\{2\}statistics between6161and407407\. The Released LMLM column points the same way \(b=4b=4,c=0c=0onn=600n=600\), but the discordant counts are too small for the exact binomial test to reject atα=0\.05\\alpha=0\.05\(p=0\.125p=0\.125\); we read this as a power limit rather than a contradiction, since the released\-database evaluation has both fewer paired facts and a lower base rate of exact\-match correctness\.

Taken together with the headline rates in the main paper, the McNemar evidence reinforces the claim that retrieval\-mediated correctness is a real paired effect rather than a coincidence at the population level\. The relationb≫cb\\gg cholds inside every custom variant, which leaves little room for a reverse hypothesis in which gains from retrieval are merely noise of equal magnitude flipping in both directions\.

## Appendix ENumeric Results

Table 5:Standard metrics across databases, prompt sets, and inference\-time database states\. Values are reported as proportions\.FULLDEL\-ONDEL\-OFFDatabasePrompt SetPPRRF​1F1PPRRF​1F1PPRRF​1F1Released LMLMDirect Questions0\.4740\.4770\.4750\.0220\.0220\.0210\.0150\.0240\.016Released LMLMParaphrased Questions0\.3910\.3860\.3880\.0160\.0200\.0150\.0120\.0080\.008Released LMLMContextual Questions0\.5940\.5970\.5940\.0320\.0390\.0310\.0080\.0310\.011Released LMLMCloze Prompts0\.3360\.3310\.3320\.0280\.0310\.0270\.0060\.0140\.007Released LMLMContinuations0\.8150\.8190\.8160\.0260\.0340\.0260\.0280\.0160\.013Released LMLMFewshot0\.2550\.2510\.2510\.0210\.0190\.0190\.0150\.0470\.020Countries BaseDirect Questions0\.5800\.5800\.5800\.0980\.0980\.0980\.0120\.0300\.017Countries BaseParaphrased Questions0\.5450\.5450\.5450\.0830\.0830\.0830\.0110\.0300\.016Countries BaseContextual Questions0\.6700\.6700\.6700\.0580\.0580\.0580\.0120\.0600\.017Countries BaseCloze Prompts0\.3110\.3150\.3110\.0220\.0270\.0230\.0040\.0400\.007Countries BaseContinuations0\.8400\.8400\.8400\.0280\.0280\.0280\.0420\.1800\.067Countries BaseFewshot0\.2300\.2300\.2300\.0900\.0900\.0900\.0010\.0200\.003Countries AliasDirect Questions0\.7570\.7570\.7570\.2170\.2170\.2170\.0060\.0430\.011Countries AliasParaphrased Questions0\.5660\.5660\.5660\.1520\.1490\.1500\.0080\.0560\.014Countries AliasContextual Questions0\.6030\.6080\.6040\.1240\.1290\.1260\.0040\.0550\.008Countries AliasCloze Prompts0\.5110\.5160\.5120\.1130\.1150\.1120\.0040\.0280\.007Countries AliasContinuations0\.8580\.8580\.8580\.2020\.2020\.2020\.0140\.0580\.020Countries AliasFewshot0\.2530\.2530\.2530\.0920\.0920\.0920\.0100\.0450\.015Countries NoiseDirect Questions0\.6310\.6310\.6310\.4670\.4670\.4670\.0070\.0220\.011Countries NoiseParaphrased Questions0\.5690\.5690\.5690\.3830\.3830\.3830\.0060\.0220\.009Countries NoiseContextual Questions0\.6700\.6700\.6700\.3890\.3890\.3890\.0090\.0500\.014Countries NoiseCloze Prompts0\.4380\.4430\.4390\.3270\.3310\.3280\.0060\.0430\.010Countries NoiseContinuations0\.8780\.8780\.8780\.3540\.3540\.3540\.0250\.1280\.041Countries NoiseFewshot0\.3090\.3090\.3090\.2410\.2410\.2410\.0130\.0410\.014Countries CollisionDirect Questions0\.6680\.6710\.6690\.1920\.1940\.1920\.0120\.0500\.019Countries CollisionParaphrased Questions0\.5450\.5610\.5490\.1790\.1980\.1840\.0130\.0610\.021Countries CollisionContextual Questions0\.7000\.7070\.7010\.1310\.1370\.1310\.0130\.0810\.019Countries CollisionCloze Prompts0\.4360\.4390\.4360\.1270\.1300\.1270\.0060\.0710\.012Countries CollisionContinuations0\.8790\.8790\.8790\.1740\.1730\.1730\.0330\.1520\.052Countries CollisionFewshot0\.1820\.1820\.1820\.0860\.0860\.0860\.0040\.0330\.007Politicians BaseDirect Questions0\.5040\.5040\.5040\.0330\.0330\.0330\.0000\.0000\.000Politicians BaseParaphrased Questions0\.4960\.4930\.4930\.1210\.1210\.1190\.0050\.0250\.008Politicians BaseContextual Questions0\.4750\.4670\.4700\.0790\.0780\.0780\.0030\.0250\.005Politicians BaseCloze Prompts0\.2830\.2790\.2810\.0420\.0420\.0400\.0100\.0600\.016Politicians BaseContinuations0\.6830\.6810\.6820\.0330\.0330\.0330\.0060\.0170\.008Politicians BaseFewshot0\.3040\.3040\.3020\.1040\.1010\.1000\.0070\.0370\.011Politicians AliasDirect Questions0\.3870\.3850\.3860\.0530\.0540\.0540\.0030\.0150\.004Politicians AliasParaphrased Questions0\.3000\.3000\.3000\.0500\.0520\.0510\.0020\.0130\.003Politicians AliasContextual Questions0\.5650\.5670\.5650\.0750\.0780\.0750\.0020\.0220\.004Politicians AliasCloze Prompts0\.2560\.2540\.2550\.0540\.0480\.0500\.0010\.0130\.002Politicians AliasContinuations0\.7290\.7260\.7270\.0870\.0830\.0840\.0020\.0080\.003Politicians AliasFewshot0\.2000\.1990\.1990\.0790\.0760\.0770\.0130\.0280\.015Politicians NoiseDirect Questions0\.0000\.0000\.0000\.0000\.0000\.0000\.0150\.0220\.016Politicians NoiseParaphrased Questions0\.0000\.0030\.0010\.0000\.0030\.0010\.0080\.0360\.013Politicians NoiseContextual Questions0\.0000\.0000\.0000\.0000\.0000\.0000\.0090\.0310\.010Politicians NoiseCloze Prompts0\.0000\.0000\.0000\.0000\.0000\.0000\.0100\.0680\.016Politicians NoiseContinuations0\.0000\.0000\.0000\.0000\.0000\.0000\.0040\.0110\.006Politicians NoiseFewshot0\.0000\.0030\.0000\.0000\.0030\.0000\.0050\.0280\.008Politicians CollisionDirect Questions0\.4440\.4430\.4430\.0590\.0580\.0590\.0000\.0000\.000Politicians CollisionParaphrased Questions0\.3810\.3790\.3790\.1190\.1190\.1180\.0040\.0190\.006Politicians CollisionContextual Questions0\.5030\.4930\.4970\.1030\.1010\.1020\.0020\.0190\.004Politicians CollisionCloze Prompts0\.2560\.2560\.2560\.0250\.0250\.0250\.0080\.0450\.012Politicians CollisionContinuations0\.6000\.5980\.5990\.0750\.0750\.0750\.0050\.0190\.008Politicians CollisionFewshot0\.2780\.2760\.2750\.0750\.0720\.0710\.0050\.0280\.008Sports BaseDirect Questions0\.0100\.0100\.0100\.0100\.0100\.0100\.0470\.0950\.058Sports BaseParaphrased Questions0\.0100\.0100\.0100\.0100\.0100\.0100\.0050\.0200\.008Sports BaseContextual Questions0\.0100\.0100\.0100\.0100\.0100\.0100\.0040\.0480\.006Sports BaseCloze Prompts0\.0100\.0100\.0100\.0100\.0100\.0100\.0050\.0330\.008Sports BaseContinuations0\.0100\.0100\.0100\.0100\.0100\.0100\.0130\.0550\.020Sports BaseFewshot0\.0000\.0050\.0010\.0000\.0050\.0010\.0020\.0180\.003Sports AliasDirect Questions0\.0000\.0000\.0000\.0000\.0000\.0000\.0090\.0240\.013Sports AliasParaphrased Questions0\.0000\.0000\.0000\.0000\.0000\.0000\.0080\.0280\.012Sports AliasContextual Questions0\.0000\.0000\.0000\.0000\.0000\.0000\.0040\.0280\.006Sports AliasCloze Prompts0\.0000\.0000\.0000\.0000\.0000\.0000\.0060\.0250\.008Sports AliasContinuations0\.0000\.0000\.0000\.0000\.0000\.0000\.0080\.0330\.012Sports AliasFewshot0\.0000\.0050\.0010\.0000\.0050\.0010\.0080\.0180\.009Sports NoiseDirect Questions0\.4750\.4660\.4680\.0720\.0610\.0630\.0410\.0920\.053Sports NoiseParaphrased Questions0\.4050\.4060\.4060\.0540\.0500\.0500\.0060\.0280\.009Sports NoiseContextual Questions0\.5690\.5690\.5690\.0440\.0410\.0420\.0060\.0550\.010Sports NoiseCloze Prompts0\.2840\.2740\.2760\.0530\.0430\.0450\.0040\.0270\.007Sports NoiseContinuations0\.7160\.7070\.7090\.0750\.0600\.0630\.0090\.0410\.014Sports NoiseFewshot0\.1960\.1950\.1930\.0580\.0580\.0550\.0060\.0410\.010Sports CollisionDirect Questions0\.4790\.4810\.4800\.0360\.0320\.0320\.0340\.0740\.043Sports CollisionParaphrased Questions0\.4040\.4050\.4040\.0420\.0430\.0420\.0030\.0130\.005Sports CollisionContextual Questions0\.5440\.5400\.5410\.0450\.0400\.0390\.0020\.0320\.005Sports CollisionCloze Prompts0\.2780\.2790\.2770\.0500\.0490\.0480\.0040\.0250\.007Sports CollisionContinuations0\.6960\.6960\.6960\.0540\.0460\.0490\.0090\.0380\.013Sports CollisionFewshot0\.1980\.2070\.2000\.0680\.0780\.0700\.0110\.0280\.013Table 6:Cross\-state audit metrics across databases and prompt sets\. Values are reported as proportions\.DatabasePrompt SetPaired CountL​\(f\)L\(f\)R​\(f\)R\(f\)Retrieval Artifact RateReleased LMLMDirect Questions1000\.0000\.0100\.010Released LMLMParaphrased Questions1000\.0000\.0000\.000Released LMLMContextual Questions1000\.0000\.0100\.010Released LMLMCloze Prompts1000\.0000\.0100\.010Released LMLMContinuations1000\.0000\.0000\.000Released LMLMFewshot1000\.0000\.0100\.010Countries BaseDirect Questions1000\.0000\.0800\.080Countries BaseParaphrased Questions1000\.0000\.0600\.060Countries BaseContextual Questions1000\.0000\.0400\.040Countries BaseCloze Prompts1000\.0000\.0100\.010Countries BaseContinuations1000\.0000\.0100\.010Countries BaseFewshot1000\.0000\.0900\.090Countries AliasDirect Questions2000\.0000\.2000\.200Countries AliasParaphrased Questions2000\.0000\.1350\.135Countries AliasContextual Questions2000\.0000\.1100\.110Countries AliasCloze Prompts2000\.0000\.0900\.090Countries AliasContinuations2000\.0050\.1850\.185Countries AliasFewshot2000\.0000\.0900\.090Countries NoiseDirect Questions1800\.0000\.4560\.456Countries NoiseParaphrased Questions1800\.0000\.3670\.367Countries NoiseContextual Questions1800\.0000\.3780\.378Countries NoiseCloze Prompts1800\.0000\.3220\.322Countries NoiseContinuations1800\.0000\.3390\.339Countries NoiseFewshot1800\.0110\.2330\.233Countries CollisionDirect Questions1400\.0000\.1640\.164Countries CollisionParaphrased Questions1400\.0000\.1430\.143Countries CollisionContextual Questions1400\.0000\.1070\.107Countries CollisionCloze Prompts1400\.0000\.1140\.114Countries CollisionContinuations1400\.0000\.1500\.150Countries CollisionFewshot1400\.0000\.0790\.079Politicians BaseDirect Questions1200\.0000\.0250\.025Politicians BaseParaphrased Questions1200\.0000\.1000\.100Politicians BaseContextual Questions1200\.0000\.0500\.050Politicians BaseCloze Prompts1200\.0000\.0170\.017Politicians BaseContinuations1200\.0000\.0330\.033Politicians BaseFewshot1200\.0000\.0670\.067Politicians AliasDirect Questions2400\.0000\.0460\.046Politicians AliasParaphrased Questions2400\.0000\.0420\.042Politicians AliasContextual Questions2400\.0000\.0670\.067Politicians AliasCloze Prompts2400\.0000\.0330\.033Politicians AliasContinuations2400\.0000\.0750\.075Politicians AliasFewshot2400\.0080\.0460\.046Politicians NoiseDirect Questions1800\.0060\.0000\.000Politicians NoiseParaphrased Questions1800\.0000\.0000\.000Politicians NoiseContextual Questions1800\.0000\.0000\.000Politicians NoiseCloze Prompts1800\.0000\.0000\.000Politicians NoiseContinuations1800\.0000\.0000\.000Politicians NoiseFewshot1800\.0000\.0000\.000Politicians CollisionDirect Questions1600\.0000\.0500\.050Politicians CollisionParaphrased Questions1600\.0000\.1000\.100Politicians CollisionContextual Questions1600\.0000\.0810\.081Politicians CollisionCloze Prompts1600\.0000\.0190\.019Politicians CollisionContinuations1600\.0000\.0750\.075Politicians CollisionFewshot1600\.0000\.0440\.044Sports BaseDirect Questions1000\.0200\.0100\.010Sports BaseParaphrased Questions1000\.0000\.0100\.010Sports BaseContextual Questions1000\.0000\.0100\.010Sports BaseCloze Prompts1000\.0000\.0100\.010Sports BaseContinuations1000\.0000\.0100\.010Sports BaseFewshot1000\.0000\.0000\.000Sports AliasDirect Questions2000\.0000\.0000\.000Sports AliasParaphrased Questions2000\.0000\.0000\.000Sports AliasContextual Questions2000\.0000\.0000\.000Sports AliasCloze Prompts2000\.0000\.0000\.000Sports AliasContinuations2000\.0000\.0000\.000Sports AliasFewshot2000\.0050\.0000\.000Sports NoiseDirect Questions1600\.0130\.0440\.056Sports NoiseParaphrased Questions1600\.0000\.0370\.037Sports NoiseContextual Questions1600\.0000\.0370\.037Sports NoiseCloze Prompts1600\.0000\.0370\.037Sports NoiseContinuations1600\.0000\.0500\.050Sports NoiseFewshot1600\.0000\.0500\.050Sports CollisionDirect Questions1580\.0130\.0130\.019Sports CollisionParaphrased Questions1580\.0000\.0320\.032Sports CollisionContextual Questions1580\.0000\.0250\.025Sports CollisionCloze Prompts1580\.0000\.0320\.032Sports CollisionContinuations1580\.0000\.0320\.032Sports CollisionFewshot1580\.0060\.0570\.057

Similar Articles

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

arXiv cs.CL

This paper shows that a language model with a lossy memory that retains a wrong conclusion but drops the evidence produces confident incorrect answers, whereas an empty memory leads to abstention. The authors propose a source-first compression policy that preserves recomputable sources instead of conclusions to maintain correctability, and demonstrate the mechanism across multiple models and dialogue systems.