The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

arXiv cs.AI 05/27/26, 04:00 AM Papers
Summary
Proposes Computational Reality Monitoring to detect when language models rely on pretraining memory rather than retrieved context, addressing the attribution blind spot in retrieval-augmented generation.
arXiv:2605.26778v1 Announce Type: new Abstract: Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation -- a prerequisite for any high-stakes deployment. The standard assumption, that context-consistent output implies context-governed output, breaks when the retrieved document overlaps with the model's pretraining data: the model can produce faithful-looking text entirely from parametric memory, and both pathways yield indistinguishable output. We name this failure the attribution blind spot and introduce Computational Reality Monitoring (CRM) to address it. CRM operationalizes a principle adapted from cognitive science's reality monitoring framework: comparing internal representations with and without context reveals membership-conditioned representational divergence that output-level monitors systematically miss. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution. Across nine model variants spanning three families, this divergence concentrates in architecture-specific layer patterns, receives converging support from block-level noise intervention, and generalizes across tasks and datasets while collapsing on domain-confounded benchmarks. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:08 AM
# Detecting When Language Models Rely on Memory Rather Than Retrieved Context
Source: [https://arxiv.org/html/2605.26778](https://arxiv.org/html/2605.26778)
## The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Zhe Yu2∗Wenpeng Xing1,2∗Yunzhao Wei2Bo Yang3 Chen Ye4Gaolei Li5Meng Han1,2,6 1Zhejiang University2Binjiang Institute of Zhejiang University 3National Fintech Evaluation Center4Hangzhou Dianzi University5Shanghai Jiao Tong University 6GenTel\.io ∗Equal contribution

###### Abstract

Retrieval\-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation—a prerequisite for any high\-stakes deployment\. The standard assumption, that context\-consistent output implies context\-governed output, breaks when the retrieved document overlaps with the model’s pretraining data: the model can produce faithful\-looking text entirely from parametric memory, and both pathways yield indistinguishable output\. We name this failure theattribution blind spotand introduceComputational Reality Monitoring\(CRM\) to address it\. CRM operationalizes a principle adapted from cognitive science’s reality monitoring framework: comparing internal representations with and without context reveals membership\-conditioned representational divergence that output\-level monitors systematically miss\. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution\. Across nine model variants spanning three families, this divergence concentrates in architecture\-specific layer patterns, receives converging support from block\-level noise intervention, and generalizes across tasks and datasets while collapsing on domain\-confounded benchmarks\. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior\.

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Zhe Yu2∗Wenpeng Xing1,2∗Yunzhao Wei2Bo Yang3Chen Ye4Gaolei Li5Meng Han1,2,61Zhejiang University2Binjiang Institute of Zhejiang University3National Fintech Evaluation Center4Hangzhou Dianzi University5Shanghai Jiao Tong University6GenTel\.io∗Equal contribution

## 1Introduction

Retrieval\-augmented generationLewiset al\.\([2020](https://arxiv.org/html/2605.26778#bib.bib1)\); Guuet al\.\([2020](https://arxiv.org/html/2605.26778#bib.bib26)\); Borgeaudet al\.\([2022](https://arxiv.org/html/2605.26778#bib.bib27)\)has become a standard paradigm for grounding language model outputs in external knowledge\. The operating assumption is straightforward: if a model receives a relevant document as context, it will use that document to inform its generation\. This assumption underpins deployed systems in search, customer support, and medical QA, where faithful grounding is treated as a safety property\.

This assumption is systematically unverifiable from outputs alone\. When the retrieved document overlaps with the model’s training data—common given that retrieval corpora often cover pretraining sources—the model may default to parametric memory rather than external context\. The generated text appears context\-grounded while the true computational pathway is parametric\. We term this theattribution blind spot: output\-level monitoring cannot distinguish read\-from\-context from recalled\-from\-parameters when both produce equally plausible text\.

Prior work approaches this blind spot from the wrong level of analysis\. Membership inference attacksShokriet al\.\([2017](https://arxiv.org/html/2605.26778#bib.bib4)\); Carliniet al\.\([2021](https://arxiv.org/html/2605.26778#bib.bib5)\); Shiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\); Duanet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib29)\)ask whether a document was seen during training—a static question, not a dynamic one about whether that document drives a specific generation\. RAG faithfulness metricsLiuet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib2)\); Niuet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib11)\); Liuet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib25)\)detect context\-memory*conflict*where outputs visibly contradict the provided context; our setting is harder because both sources produce identical surface text\. Citation benchmarksBohnetet al\.\([2022](https://arxiv.org/html/2605.26778#bib.bib10)\); Eset al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib9)\)evaluate whether models*claim*to use context\. The common blind spot: output\-level signals cannot distinguish a model that read from context from one that recalled from memory when both produce identical text\.

We address this gap withComputational Reality Monitoring \(CRM\), adapted from cognitive science’s reality monitoringJohnsonet al\.\([1993](https://arxiv.org/html/2605.26778#bib.bib7)\)\. Human reality monitoring distinguishes perceived from internally\-generated memories by comparing sensory detail, contextual information, and cognitive operationsJohnsonet al\.\([1993](https://arxiv.org/html/2605.26778#bib.bib7)\)\. CRM operationalizes this logic for language model generation: it compares the model’s internal representations with the retrieved context versus without, treating the representational divergence as a diagnostic signal\. The core insight: membership\-conditioned differences live in the*gap*between context\-conditioned and unconditioned computation, not in either pathway alone\.

CRM detects membership\-conditioned representational divergence—whether internal states differ when the provided context is a document the model was exposed to during pretraining \(member\) versus one it was not \(non\-member\)\. We emphasize that CRM does not certify source use for individual generations; membership is a necessary but not sufficient condition for source attribution, creating the*possibility*of parametric generation, not a guarantee of it \(see Section[5](https://arxiv.org/html/2605.26778#S5)and Appendix[A](https://arxiv.org/html/2605.26778#A1)\)\. CRM establishes a measurable internal signal that future source\-attribution systems could build upon—a diagnostic substrate, not a final verifier\. Our contributions are:

1. 1\.The attribution blind spot:We formalize the failure mode where parametric memory and retrieved context agree on the surface, making output\-level source attribution impossible\.
2. 2\.Architecture\-dependent layer localization with causal evidence:Across nine model variants, membership\-conditioned signals localize non\-monotonically in three architecture\-dependent patterns \(bimodal, mid\-layer, scattered\-late\)\. Block\-level noise injection provides causal evidence that CRM\-identified blocks contribute to preserving membership\-conditioned information, validating a distributed\-encoding hypothesis for architectures where single\-layer perturbation had no effect\.
3. 3\.Direction design and evidentiary bounds:A supervised mean\-difference direction universally improves over unsupervised PC1 \(Δ\\DeltaAUC\+\+0\.024–0\.144\)\. CRM generalizes across tasks \(summarization, QA\) and datasets \(BookMIA AUC 0\.84–0\.97\), withstands same\-topic control, and collapses on domain\-confounded benchmarks \(MIMIR\), establishing boundary conditions\. CRM\-LTS is competitive with gradient\-based, attention\-based, and logit\-lens baselines while uniquely supporting layer\-localized causal interpretation\.
4. 4\.Deployment prototype:A FastAPI audit server with real\-time trajectory dashboard demonstrates CRM’s compact scalar\-per\-layer signature enables low\-latency deployment auditing\.

## 2Computational Reality Monitoring

#### Problem formulation\.

Letℳ\\mathcal\{M\}be a model with parametersθ\\thetapretrained on𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}\. For queryqqwith retrieved contextcc,ℳ\\mathcal\{M\}producesy0=ℳ\(q\)y\_\{0\}=\\mathcal\{M\}\(q\)\(no\-context\) andyc=ℳ\(c,q\)y\_\{c\}=\\mathcal\{M\}\(c,q\)\(with\-context\)\. CRM detects whetherc∈𝒟trainc\\in\\mathcal\{D\}\_\{\\text\{train\}\}\(member\-conditioned\) versusc∉𝒟trainc\\notin\\mathcal\{D\}\_\{\\text\{train\}\}\(non\-member\-conditioned\) by comparing internal representations across these two conditions\. Membership is an experimental proxy that creates the*possibility*of parametric generation \(Appendix[A](https://arxiv.org/html/2605.26778#A1)\)\.

#### Three\-level framework\.

CRM examines internal representations at three levels\. Level 1 \(black\-box\) measures BGE\-M3Chenet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib16)\)embedding distance between with/no\-context generationsReimers and Gurevych \([2019](https://arxiv.org/html/2605.26778#bib.bib30)\)\. Level 2 \(grey\-box\) computes per\-step KL divergence from the LM head, aggregated into five statistics\. Level 3 \(white\-box\), CRM’s core, probes hidden states\.

#### Latent Trajectory Shift \(LTS\)\.

For target layersℒ\\mathcal\{L\}selected by PCA variance ratio \(\>\>0\.01\), we extract hidden stateshℓ0h\_\{\\ell\}^\{0\}andhℓch\_\{\\ell\}^\{c\}at the last token position\. We reservencal=100n\_\{\\text\{cal\}\}=100samples for computing PC1 directionsvℓv\_\{\\ell\}via SVD on displacement vectorsdℓ=hℓc−hℓ0d\_\{\\ell\}=h\_\{\\ell\}^\{c\}\-h\_\{\\ell\}^\{0\}; allN=250N=250samples are then used for feature extraction and evaluation under 5\-fold stratified CV\. The signed scalar projection

LTSℓ=⟨hℓc−hℓ0,vℓ⟩\\text\{LTS\}\_\{\\ell\}=\\langle h\_\{\\ell\}^\{c\}\-h\_\{\\ell\}^\{0\},\\;v\_\{\\ell\}\\rangle\(1\)captures directional displacement alongvℓv\_\{\\ell\}\. For the supervised directionvsupv\_\{\\text\{sup\}\}\(Section[4\.5](https://arxiv.org/html/2605.26778#S4.SS5)\), the same calibration subset is used to compute the mean\-difference direction\. Using PC1 per layer yields 9–22 compact trajectory features\. The unified feature vectorΦ=\[ΦL1,ΦL2,ΦL3\]\\Phi=\[\\Phi\_\{\\text\{L1\}\},\\Phi\_\{\\text\{L2\}\},\\Phi\_\{\\text\{L3\}\}\]is evaluated with logistic regressionPedregosaet al\.\([2011](https://arxiv.org/html/2605.26778#bib.bib22)\)and XGBoostChen and Guestrin \([2016](https://arxiv.org/html/2605.26778#bib.bib21)\)under 5\-fold stratified CV\. Full equations and L1/L2 definitions in Appendix[P](https://arxiv.org/html/2605.26778#A16)\.

![Refer to caption](https://arxiv.org/html/2605.26778v1/x1.png)Figure 1:Computational Reality Monitoring framework\. CRM compares paired no\-context and with\-context generations, extracts sequence\-, token\-, and latent\-level divergence features, and uses the resulting feature vector to detect membership\-conditioned representational divergence—a diagnostic signal for whether pretraining exposure reshapes the model’s internal computation, not a per\-generation source verifier\.

## 3Experimental Setup

We adopt a controlled diagnostic design: continuation probing isolates context\-memory interaction and eliminates confounds such as query formulation and instruction following\.

Models and data\.Nine TransformerVaswaniet al\.\([2017](https://arxiv.org/html/2605.26778#bib.bib17)\)variants: Llama\-3\.1\-8B/InstructAI@Meta \([2024](https://arxiv.org/html/2605.26778#bib.bib18)\), Mistral\-7B\-v0\.3/InstructJianget al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib19)\), and Qwen2\.5 \(7B/7B\-Inst/14B/14B\-Inst/32B\-Inst\)Qwenet al\.\([2025](https://arxiv.org/html/2605.26778#bib.bib20)\)\. WikiMIAShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\): 250 balanced samples/model \(128\-token passages; members from Wikipedia dumps before 2017\-03\-20, non\-members from after 2018\-02\-01\)\. Cross\-dataset: BookMIAShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\)\(Books3 domain split; Appendix[M](https://arxiv.org/html/2605.26778#A13)\)\. Negative control: MIMIR Pile\-Wikipedia splitDuanet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib29)\)\(Appendix[N](https://arxiv.org/html/2605.26778#A14)\)\. Target layersℒ\\mathcal\{L\}are selected by PCA variance ratio\>\>0\.01 on calibration\-set displacement vectors, filtering out near\-zero\-variance layers; this yields 9–22 layers per model \(Table[1](https://arxiv.org/html/2605.26778#S4.T1), L3 dim column\)\.

Baselines\.Three tiers: \(1\) Black\-box likelihood \(PPL, Zlib\-PPLCarliniet al\.\([2021](https://arxiv.org/html/2605.26778#bib.bib5)\), Min\-K% ProbShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\)\); \(2\) Access\-matched \(single\-layer LTS, mean LTS, L1\+L2 only\); \(3\) Raw hidden\-state probes\. Full details: Appendix[Q](https://arxiv.org/html/2605.26778#A17)\.

Evaluation\.5\-fold stratified CV \(seed 42\), ROC\-AUC with 95% bootstrap CIs\. Controls: label permutation, prompt randomization \(4 templates\), same\-topic \(BGE\-M3 similarity\-matched non\-members,n≈140n\{\\approx\}140\)\. Methodology: Appendix[D](https://arxiv.org/html/2605.26778#A4)\.

## 4Results

The attribution blind spot predicts a specific empirical signature: if output\-level monitors cannot distinguish generation conditions when both pathways produce plausible text, then the discriminative signal must reside elsewhere—in the model’s internal computation\. CRM directly tests this prediction by comparing representations with and without context\. We present five lines of evidence: \(i\) CRM consistently separates member\-conditioned from non\-member\-conditioned generation while surface baselines remain near chance; \(ii\) the signal is latent\-dominated, with L1\+L2 contributing negligibly; \(iii\) the signal survives same\-topic control, label permutation, and prompt randomization; \(iv\) membership\-conditioned divergence localizes in architecture\-dependent layer patterns; and \(v\) directional displacement \(CRM\-LTS\) and isotropic magnitude \(L2\) carry qualitatively distinct information\.

### 4\.1Main Results

Table 1:Likelihood\-based baselines fail to separate source conditions \(AUC 0\.55–0\.60\), while CRM consistently distinguishes member\-conditioned from non\-member\-conditioned generation across all nine models \(AUC 0\.71–0\.95\)\.Best Likelihood BL = highest AUC among three Tier\-1 token\-likelihood baselines \(PPL, Zlib\-PPL, Min\-K% Prob\), all operating on document text alone*without the generation contrast*that CRM exploits\. 95% bootstrap CIs in brackets\. Gain = CRM\-LR−\-Best Likelihood BL\. L3 dim = number of target\-layer LTS features\.Table[1](https://arxiv.org/html/2605.26778#S4.T1)shows the main results\.Baselines remain near chance\(AUC 0\.55–0\.60\), confirming token\-level memorization signals are weak in modern LLMs\.CRM consistently separates conditions\(AUC 0\.71–0\.95, gain \+0\.13 to \+0\.38\), with a family\-level gradient: Qwen \(mean 0\.873\)\>\>Mistral \(0\.834\)\>\>Llama \(0\.743\)\. Logistic regression matches or exceeds XGBoost, aligning with findings on linear representation of high\-level conceptsParket al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib24)\); Zouet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib14)\); Templetonet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib31)\); Brickenet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib32)\)\.

### 4\.2Latent Signal and Robustness

Table[2](https://arxiv.org/html/2605.26778#S4.T2)reports a defining empirical finding: removing all surface features \(L1\+L2\) changes LR AUC by less than 0\.01 across nine models \(mean\|Δ\|=0\.006\|\\Delta\|=0\.006\)\. Membership\-conditioned divergence is almost entirely latent—output\-accessible signals alone are insufficient for detection\.

Table 2:Removing surface features \(L1\+L2\) changes AUC by<<0\.01—the signal is latent\.Full results: Appendix[S](https://arxiv.org/html/2605.26778#A19)\.![Refer to caption](https://arxiv.org/html/2605.26778v1/x2.png)Figure 2:CRM consistently exceeds likelihood and surface baselines\.CRM\-LR AUC \(coral\), L1\+L2 surface features \(gray\), best likelihood baseline \(light gray\)\. Error bars: 95% CI\.Table 3:CRM’s diagnostic signal withstands three robustness challenges\.Prompt variation: AUC std<<0\.02 across four templates\. Label permutation: AUC returns to 0\.50±\\pm0\.05\. Same\-topic control \(CRM\-LTS,n≈140n\{\\approx\}140\): AUC largely preserved \(Δ\\Deltawithin±\\pm0\.06\) despite 1\.6×\\timestighter topic matching\. ST CRM\-LR = same\-topic CRM\-LTS logistic regression AUC\. Full methodology and per\-template prompt breakdown in Appendix[D](https://arxiv.org/html/2605.26778#A4)\.![Refer to caption](https://arxiv.org/html/2605.26778v1/x3.png)Figure 3:Robustness controls rule out topic familiarity, classifier artifacts, and prompt artifacts\.\(a\) Same\-topic control \(CRM\-LTS\): AUC largely preserved \(Δ\\Deltawithin±\\pm0\.06\)\. \(b\) Label permutation: AUC returns to chance\. \(c\) Prompt randomization:σ<0\.02\\sigma<0\.02\. Dashed line: AUC = 0\.50\.Three robustness checks rule out alternative explanations \(Table[3](https://arxiv.org/html/2605.26778#S4.T3); full methodology: Appendix[D](https://arxiv.org/html/2605.26778#A4)\)\.Same\-topic control:semantically similar non\-members \(BGE\-M3 similarity 0\.51 vs\. 0\.32\) yield CRM\-LTS AUC within±\\pm0\.06 across three models \(Qwen:−\-0\.004, Mistral: \+0\.020, Llama:−\-0\.058\), ruling out topic familiarity\.Label permutationreturns AUC to 0\.50±\\pm0\.05\.Prompt randomizationacross four templates yields AUC std<<0\.02\. A cross\-task pilot replacing continuation with summarization preserves the CRM signal for Qwen \(Δ=\+0\.007\\Delta=\+0\.007\) with partial transfer for Mistral \(Δ=−0\.125\\Delta=\-0\.125, AUC 0\.744; Appendix[L](https://arxiv.org/html/2605.26778#A12)\)\.

### 4\.3Layer\-wise Analysis

Table 4:Membership\-conditioned divergence is non\-monotonic and architecture\-dependent\.Single\-layer LTS AUC peaks at different stages depending on model family\. The Qwen\-14B\-Inst bimodal pattern \(peaks at L6 and L21, trough at L27 below chance\) shows divergence appearing, disappearing, and reappearing—inconsistent with uniform depth\-based hypotheses\. 50\-sample diagnostic subsets; values not directly comparable to 250\-sample multi\-layer results\.Table[4](https://arxiv.org/html/2605.26778#S4.T4)summarizes the layer sweep \(full per\-layer breakdown: Appendix[H](https://arxiv.org/html/2605.26778#A8)\)\. Membership\-conditioned divergence isnon\-monotonically related to depth, exhibiting three architecture\-dependent patterns:bimodal\(Qwen2\.5\-14B\-Inst: peaks at L6/L21, trough at L27 below chance\),mid\-layer concentration\(Mistral\-7B: L18/0\.892; Qwen2\.5\-7B: L10/0\.902\), andscattered late\(Llama\-3\.1\-8B: L28/0\.753, distributed\)\. The below\-chance trough at Qwen L27 suggests active suppression\. Leave\-one\-layer\-out ablation confirms no single layer is uniquely informative \(\|Δ\|\|\\Delta\|AUC<<0\.007; Appendix[K](https://arxiv.org/html/2605.26778#A11)\), indicating redundant encoding\.

![Refer to caption](https://arxiv.org/html/2605.26778v1/x4.png)Figure 4:Layer\-wise CRM\-LTS single\-probe AUC reveals architecture\-dependent patterns\.Qwen bimodal \(L6/L21\), Mistral mid\-layer \(L18\), Qwen\-7B early \(L10\), Llama scattered\-late \(L28\)\. Bands:±\\pm1 std\. Dashed: AUC = 0\.50\.
### 4\.4Causal Evidence: Block\-Level Noise Injection

Single\-layer noise injection \(Appendix[J](https://arxiv.org/html/2605.26778#A10)\) revealed a puzzle: Qwen2\.5\-14B\-Inst L6 showed catastrophic degradation \(Δ\\DeltaAUC−\-0\.300\), but Llama\-3\.1\-8B L28 was near\-immune \(−\-0\.001\)\. Thedistributed\-encoding hypothesisresolves this: membership information is spread across a block such that perturbing any single layer leaves sufficient signal elsewhere for recovery\. We test this by simultaneously injecting noise \(𝐡′=𝐡\+ε⋅σ\(𝐡\)⋅𝒩\(0,1\)\\mathbf\{h\}^\{\\prime\}=\\mathbf\{h\}\+\\varepsilon\\cdot\\sigma\(\\mathbf\{h\}\)\\cdot\\mathcal\{N\}\(0,1\)\) across architecture\-specific blocks informed by per\-layer AUC patterns \(Table[4](https://arxiv.org/html/2605.26778#S4.T4)\): Llamascattered\_lateL25–L31 \(all layers AUC\>\>0\.5\), Mistralmid\_clusterL14–L22 \(±\\pm4 around L18\), Qwenearly\_peakL4–L8 andlate\_peakL19–L23 \(the two bimodal peaks\)\.early\_controlL0–L7/8 serves as a universal baseline\.

Table 5:Block\-level noise injection provides causal evidence that CRM\-identified blocks contribute to preserving membership\-conditioned information\.Target blocks show selective degradation; early control blocks produce catastrophic collapse\.Table[5](https://arxiv.org/html/2605.26778#S4.T5)reports four findings\.\(1\) Distributed\-encoding supported:Llama’sscattered\_lateblock—selected because single\-layer L28 showed near\-zero effect—shows measurable degradation under simultaneous perturbation \(Δ\\DeltaAUC−\-0\.021 atε\\varepsilon=0\.5,∼\\sim20×\\timesthe single\-layer effect\)\.\(2\) Architecture\-dependent selectivity:Qwen’searly\_peakcatastrophically collapses \(−\-0\.408\) whilelate\_peakdegrades mildly \(−\-0\.032 to−\-0\.054\), confirming the bimodal peaks serve distinct causal roles\. Mistral’smid\_clustershows the strongest relative target\-block effect \(−\-0\.045\)\.\(3\) Early layers are causally foundational:early\_controlproduces catastrophic collapse across all three models \(−\-0\.26 to−\-0\.41\), providing a positive control\.\(4\) Why patching fails:Activation patching—replacinghℓch\_\{\\ell\}^\{c\}at the last token withhℓ0h\_\{\\ell\}^\{0\}—yields negligibleΔ\\DeltaAUC \(<<0\.01\)\. Residual connections allow downstream recovery after single\-position patching; noise injection perturbs all token positions simultaneously, preventing recovery\.

These results indicate that CRM\-identified layers are not merely diagnostic but contribute to carrying membership\-conditioned information\. The early\-control collapse demonstrates that early\-layer computation is foundational to all downstream representation, making it a weaker test for selective encoding\. The architecture\-dependent target\-block effect sizes mirror the layer\-localization patterns \(Table[4](https://arxiv.org/html/2605.26778#S4.T4)\), and the distributed\-encoding hypothesis receives direct support\.

### 4\.5Supervised vs\. Unsupervised Direction

CRM\-LTS projects displacement vectors onto PC1—an unsupervised direction maximizing explained variance\. PCA optimizes reconstruction, not discriminability\. We test a supervised alternative: for each layerℓ\\ell, compute the mean\-difference direction

vsup\(ℓ\)=𝔼\[dℓ∣member\]−𝔼\[dℓ∣non\-member\]∥𝔼\[dℓ∣member\]−𝔼\[dℓ∣non\-member\]∥2v\_\{\\text\{sup\}\}^\{\(\\ell\)\}=\\frac\{\\mathbb\{E\}\[d\_\{\\ell\}\\mid\\text\{member\}\]\-\\mathbb\{E\}\[d\_\{\\ell\}\\mid\\text\{non\-member\}\]\}\{\\\|\\mathbb\{E\}\[d\_\{\\ell\}\\mid\\text\{member\}\]\-\\mathbb\{E\}\[d\_\{\\ell\}\\mid\\text\{non\-member\}\]\\\|\_\{2\}\}\(2\)and defineLTSsup\(ℓ\)=⟨hℓc−hℓ0,vsup\(ℓ\)⟩\\text\{LTS\}\_\{\\text\{sup\}\}^\{\(\\ell\)\}=\\langle h\_\{\\ell\}^\{c\}\-h\_\{\\ell\}^\{0\},\\;v\_\{\\text\{sup\}\}^\{\(\\ell\)\}\\rangle\. Both directions use the same 100\-sample calibration set and 5\-fold CV\.

Table 6:Supervised mean\-difference direction universally outperforms unsupervised PC1\.Both directions use identical per\-layer LTS projection; only the projection vector differs\.Δ\\DeltaAUC = Supervised−\-PC1\. Mistral\-7B shows the largest gain \(\+\+0\.144\), confirming that PCA variance maximization can substantially misalign with the discriminative direction\. Even for Llama and Qwen, where PC1 already achieves strong AUC, the supervised direction provides a consistent improvement\.±\\pm= 1 std across CV folds\.Table[6](https://arxiv.org/html/2605.26778#S4.T6)reports the comparison on three models\.The supervised direction universally improves over PC1\(Δ\\DeltaAUC\+\+0\.024–0\.144\), with architecture\-dependent gains: Mistral\-7B gains\+\+0\.144 \(17% relative\), while Llama and Qwen gain\+\+0\.024–0\.027\. This mirrors the layer\-localization typology—Mistral’s mid\-layer\-concentrated signal is most misaligned with PC1, while Llama’s distributed pattern happens to align PC1 closer to the discriminative direction\. We retain PC1 as the default for its unsupervised property \(no membership labels needed\), noting that access to calibration labels unlocks a consistently stronger direction\. This result also validates the PC rank finding \(Appendix[I](https://arxiv.org/html/2605.26778#A9)\): PCA does not optimize for the membership/non\-membership axis\.

### 4\.6Cross\-Task and Cross\-Dataset Generalization

CRM generalizes across generation tasks\. Table[7](https://arxiv.org/html/2605.26778#S4.T7)evaluates CRM\-LTS on continuation, summarization, and factoid QA across six models\.Signal strength is task\-dependent:Mistral\-family models show amplified discriminability under summarization \(Δ=\+0\.063\\Delta=\+0\.063to\+0\.078\+0\.078\), while Qwen2\.5\-14B\-Instruct shows near\-equal performance \(0\.948–0\.967\)\. If CRM only detected ease of next\-token prediction, summarization—which requires deeper integration—should show weaker, not stronger, discriminability\.QA preserves strong signal across all architectures\(AUC 0\.83–0\.97\), ruling out continuation\-specific artifacts\.

Table 7:CRM generalizes across generation tasks\.CRM\-LTS LR AUC, 5\-fold CV, 250 samples\.Δ\\DeltaSumm = Summarization−\-Continuation\. Mistral\-family models show summarization amplification\. QA preserves AUC\>\>0\.83 across all architectures\. Full breakdown: Appendix[T](https://arxiv.org/html/2605.26778#A20)\.CRM also generalizes across datasets\. On BookMIAShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\)—where membership is defined by domain \(books in vs\. out of Books3\)—CRM\-LTS achieves AUC 0\.84–0\.97 \(Table[8](https://arxiv.org/html/2605.26778#S4.T8)\), matching or exceeding WikiMIA performance\. QA format amplifies the signal further \(AUC 0\.96–0\.98\)\. In contrast, MIMIR’s Pile\-Wikipedia splitDuanet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib29)\)—where membership is confounded with corpus origin—yields chance\-level AUC \(0\.48–0\.55; Appendix[N](https://arxiv.org/html/2605.26778#A14)\)\. This failure reveals a boundary condition: CRM requires member and non\-member populations from comparable distributions; when the split is domain\-driven, the signal is undetectable\.

Table 8:CRM generalizes across datasets\.CRM\-LTS LR AUC, 5\-fold CV\. CRM\-LTS outperforms L2 on BookMIA \(Δ=\+0\.09\\Delta=\+0\.09to\+0\.16\+0\.16\)\. BookMIA\-QA uses QA prompts\. Full results: Appendix[M](https://arxiv.org/html/2605.26778#A13)\.
### 4\.7Additional Evidence and Boundary Conditions

We evaluate several alternative interpretations and establish evidentiary bounds \(full results in appendices\)\.Static MIA vs\. generation contrast:CRM\-LR exceeds standard MIA \(final\-layer document embeddings, 4,096–5,120 dims\) on 7/9 models by\+\+0\.05–0\.18 with 200–500×\\timesfewer features, confirming generation contrast adds signal beyond static membership \(Appendix[G](https://arxiv.org/html/2605.26778#A7)\)\. For Llama, MIA exceeds CRM, but raw probes confirm abundant membership\-conditioned signal \(AUC 0\.93–0\.95; Appendix[R](https://arxiv.org/html/2605.26778#A18)\)—CRM\-LTS is a lossy summary for Llama’s distributed geometry\.

Comparison with attribution baselines\.CRM\-LTS is competitive with gradient norm \(chance, AUC 0\.500\), attention flow \(mean 0\.818\), and logit lens \(mean 0\.859\) while uniquely supporting layer\-localized causal interpretation \(Table[12](https://arxiv.org/html/2605.26778#A6.T12)in Appendix[F](https://arxiv.org/html/2605.26778#A6)\)\. Logit lens is the strongest competitor \(Qwen: 0\.953\), but provides no mechanism for identifying*which*computational stage encodes membership\.

L2 norm vs\. directional displacement\.Isotropic magnitude \(‖𝐡ℓc−𝐡ℓ0‖2\\\|\\mathbf\{h\}\_\{\\ell\}^\{c\}\-\\mathbf\{h\}\_\{\\ell\}^\{0\}\\\|\_\{2\}\) achieves comparable mean AUC \(0\.812 vs\. CRM 0\.836\) with identical dimensionality\. However, L2 Multilayer exceeds CRM\-LTS on 4/9 models \(Llama\-8B\-Inst, Mistral\-7B\-v0\.3, Qwen2\.5\-7B, Qwen2\.5\-14B;Δ\\Delta−\-0\.006 to−\-0\.060\), revealing that directional compression is not uniformly beneficial—for some architectures, magnitude alone is a more discriminative membership signal\. Critically, L2 and CRM peak at different layers \(L2 early, CRM mid\-to\-late\) and diverge under same\-topic control: Mistral\-7B L2 degrades from 0\.929 to 0\.580 while CRM\-LTS is preserved \(Appendix[E](https://arxiv.org/html/2605.26778#A5)\)\. L2 and CRM capture qualitatively distinct signals—magnitude vs\. directional displacement—and their concatenation is complementary on 5/9 models\.

PC1 interpretation\.PC rank ablation reveals PC1 is not the optimal discriminative direction: Mistral\-7B’s PC5 outperforms PC1 by 0\.138 AUC \(Appendix[I](https://arxiv.org/html/2605.26778#A9)\)\. PC1 vocabulary back\-projection through the LM head reveals layer\-dependent semantic progression: early\-layer PC1 captures subword fragments, late\-layer PC1 converges to epistemic\-stance markers \(Unfortunately,regretfor Qwen;specific,exactfor Mistral; Appendix[U](https://arxiv.org/html/2605.26778#A21)\)\.

## 5Discussion

CRM is not designed for maximal AUC—raw probes achieve 0\.93–0\.99 with 15–20×\\timesmore dimensions\. CRM trades AUC headroom for interpretability: the scalar\-per\-layer trajectory enables architecture\-specific layer localization, causal block\-level validation, and PC1 semantic interpretation\. The block\-level noise\-injection results \(Section[4\.4](https://arxiv.org/html/2605.26778#S4.SS4)\) distinguish CRM from purely correlational probing: simultaneously perturbing architecture\-specific target blocks produces selective, interpretable degradation, directly supporting the distributed\-encoding hypothesis for Llama and providing converging evidence for architecture\-dependent encoding patterns\.

What CRM can and cannot tell us\.CRM detects membership\-conditioned representational divergence—whether internal computation differs when context was seen during pretraining\. This is a necessary but not sufficient condition for source attribution\. The proxy gap has three dimensions: \(1\) membership creates the possibility of parametric generation but does not guarantee it; \(2\) multiple results support membership as a meaningful signal \(cross\-dataset generalization, same\-topic robustness, causal validation, cross\-task generalization\) but do not conclusively establish source attribution; \(3\) MIMIR’s chance\-level result establishes a boundary: CRM fails when membership is confounded with corpus origin\. Bridging the proxy gap requires external validation with ground\-truth source labels and controlled source\-attribution experiments\.

All experiment scripts, processed features, PC1 basis vectors, and the deployment prototype are released under Apache 2\.0 \(code\) and CC\-BY 4\.0 \(data and features\) licenses \(Appendix[Y](https://arxiv.org/html/2605.26778#A25)\)\.

## 6Conclusion

We introduced Computational Reality Monitoring, a framework that detects membership\-conditioned representational divergence by comparing internal representations with and without context\. Across nine model variants, we established that: \(i\) the signal is latent\-dominated; \(ii\) divergence localizes in three architecture\-dependent layer patterns, with block\-level noise injection providing causal evidence for involvement; \(iii\) a supervised direction universally improves over unsupervised PC1 \(Δ\\DeltaAUC\+\+0\.024–0\.144\); \(iv\) CRM generalizes across tasks and datasets while collapsing on domain\-confounded benchmarks, establishing boundary conditions\. The attribution blind spot is measurable and partially addressable: internal representations carry signals invisible at the output level\. Closing the gap between membership\-conditioned detection and verified source attribution is the central challenge ahead\.

## 7Limitations

The PC rank finding identifies a concrete improvement: replacing unsupervised PCA with supervised dimensionality reduction \(PLS, LDA\) could yield more discriminative trajectories\. Controlled source\-attribution experiments are needed to bridge the proxy gap between membership\-conditioned detection and verified source attribution\. The deployment prototype \(Appendix[V](https://arxiv.org/html/2605.26778#A22)\) demonstrates practical feasibility \(mean latency 238ms, p99 452ms\) and provides a foundation for field studies, but has not been tested in production environments\. Additional limitations are discussed in Appendix[C](https://arxiv.org/html/2605.26778#A3)\.

## References

- The Llama 3 Herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§3](https://arxiv.org/html/2605.26778#S3.p2.2)\.
- G\. Alain and Y\. Bengio \(2016\)Understanding intermediate layers using linear classifier probes\.arXiv preprint arXiv:1610\.01644\.Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px5.p1.1)\.
- Y\. Belinkov and J\. Glass \(2019\)Analysis methods in neural language processing: a survey\.TACL7,pp\. 49–72\.Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px5.p1.1)\.
- S\. Biderman, H\. Schoelkopf, Q\. Anthony, H\. Bradley, K\. O’Brien, E\. Hallahan, M\. A\. Khan, S\. Purohit, U\. S\. Prashanth, E\. Raff, A\. Skowron, L\. Sutawika, and O\. van der Wal \(2023\)Pythia: a suite for analyzing large language models across training and scaling\.InICML,Note:arXiv:2304\.01373Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px1.p1.5),[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px2.p1.1)\.
- B\. Bohnet, V\. Q\. Tran, P\. Verga, R\. Aharoni, D\. Andor, L\. B\. Soares, M\. Ciaramita, J\. Eisenstein, K\. Ganchev, J\. Herzig, K\. Hui, T\. Kwiatkowski, J\. Ma, J\. Ni, L\. Sestorain Saralegui, T\. Schuster, W\. W\. Cohen, M\. Collins, D\. Das, D\. Metzler, S\. Petrov, and K\. Webster \(2022\)Attributed question answering: evaluation and modeling for attributed large language models\.arXiv preprint arXiv:2212\.08037\.Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.26778#S1.p3.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. van den Driessche, J\. Lespiau, B\. Damoc, A\. Clark, D\. de Las Casas, A\. Guy, J\. Menick, R\. Ring, T\. Hennigan, S\. Huang, L\. Maggiore, C\. Jones, A\. Cassirer, A\. Brock, M\. Paganini, G\. Irving, O\. Vinyals, S\. Osindero, K\. Simonyan, J\. W\. Rae, E\. Elsen, and L\. Sifre \(2022\)Improving language models by retrieving from trillions of tokens\.InICML,Cited by:[§1](https://arxiv.org/html/2605.26778#S1.p1.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. L\. Turner, C\. Anil, C\. Denison, A\. Askell, R\. Lasenby, Y\. Wu, S\. Kravec, N\. Schiefer, T\. Maxwell, N\. Joseph, A\. Tamkin, K\. Nguyen, B\. McLean, J\. E\. Burke, T\. Hume, S\. Carter, T\. Henighan, and C\. Olah \(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.InTransformer Circuits Thread,Cited by:[§4\.1](https://arxiv.org/html/2605.26778#S4.SS1.p1.2)\.
- N\. Carlini, D\. Ippolito, M\. Jagielski, K\. Lee, F\. Tramer, and C\. Zhang \(2023\)Quantifying memorization across neural language models\.InICLR,Note:arXiv:2202\.07646Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px1.p1.5),[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px2.p1.1)\.
- N\. Carlini, F\. Tramer, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, U\. Erlingsson, A\. Oprea, and C\. Raffel \(2021\)Extracting training data from large language models\.InUSENIX Security,Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px1.p1.5),[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px2.p1.1),[§Q\.1](https://arxiv.org/html/2605.26778#A17.SS1.p1.2),[§1](https://arxiv.org/html/2605.26778#S1.p3.1),[§3](https://arxiv.org/html/2605.26778#S3.p3.1)\.
- J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu \(2024\)M3\-embedding: multi\-linguality, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.arXiv preprint arXiv:2402\.03216\.Cited by:[§P\.1](https://arxiv.org/html/2605.26778#A16.SS1.p1.2),[§2](https://arxiv.org/html/2605.26778#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: a scalable tree boosting system\.InKDD,Cited by:[§2](https://arxiv.org/html/2605.26778#S2.SS0.SSS0.Px3.p1.11)\.
- M\. Duan, A\. Suri, N\. Mireshghallah, S\. Min, W\. Shi, L\. Zettlemoyer, Y\. Tsvetkov, Y\. Choi, D\. Evans, and H\. Hajishirzi \(2024\)Do membership inference attacks work on large language models?\.InCOLM,Note:arXiv:2402\.07841Cited by:[Appendix N](https://arxiv.org/html/2605.26778#A14.p1.1),[Appendix W](https://arxiv.org/html/2605.26778#A23.SS0.SSS0.Px3),[Appendix Y](https://arxiv.org/html/2605.26778#A25.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.26778#S1.p3.1),[§3](https://arxiv.org/html/2605.26778#S3.p2.2),[§4\.6](https://arxiv.org/html/2605.26778#S4.SS6.p2.1)\.
- S\. Es, J\. James, L\. Espinosa Anke, and S\. Schockaert \(2024\)RAGAs: automated evaluation of retrieval augmented generation\.InEACL System Demonstrations,Note:arXiv:2309\.15217Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.26778#S1.p3.1)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)REALM: retrieval\-augmented language model pre\-training\.arXiv preprint arXiv:2002\.08909\.Cited by:[§1](https://arxiv.org/html/2605.26778#S1.p1.1)\.
- J\. Hewitt and C\. D\. Manning \(2019\)A structural probe for finding syntax in word representations\.InNAACL,Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px5.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. Le Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. El Sayed \(2023\)Mistral 7B\.arXiv preprint arXiv:2310\.06825\.Cited by:[§3](https://arxiv.org/html/2605.26778#S3.p2.2)\.
- M\. K\. Johnson, S\. Hashtroudi, and D\. S\. Lindsay \(1993\)Source monitoring\.Psychological Bulletin114\(1\),pp\. 3–28\.Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px6.p1.1),[§1](https://arxiv.org/html/2605.26778#S1.p4.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.26778#S1.p1.1)\.
- X\. Li, Y\. Cao, Y\. Ma, and A\. Sun \(2025\)Long Context vs\. RAG for LLMs: an evaluation and revisits\.arXiv preprint arXiv:2501\.01880\.Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px3.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.TACL12,pp\. 157–173\.Note:arXiv:2307\.03172Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26778#S1.p3.1)\.
- N\. F\. Liu, T\. Zhang, and P\. Liang \(2023\)Evaluating verifiability in generative search engines\.InEMNLP Findings,Note:arXiv:2304\.09848Cited by:[§1](https://arxiv.org/html/2605.26778#S1.p3.1)\.
- J\. Mattern, F\. Mireshghallah, Z\. Jin, B\. Schölkopf, M\. Sachan, and T\. Berg\-Kirkpatrick \(2023\)Membership inference attacks against language models via neighbourhood comparison\.InACL Findings,Note:arXiv:2305\.18462Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px1.p1.5),[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px2.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InNeurIPS,Note:arXiv:2202\.05262Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px5.p1.1)\.
- K\. Meng, A\. Sen Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau \(2023\)Mass\-editing memory in a transformer\.InICLR,Note:arXiv:2210\.07229Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px5.p1.1)\.
- C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang \(2024\)RAGTruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InACL,Note:arXiv:2401\.00396Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26778#S1.p3.1)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2024\)The linear representation hypothesis and the geometry of large language models\.InICML,Note:arXiv:2311\.03658Cited by:[§4\.1](https://arxiv.org/html/2605.26778#S4.SS1.p1.2)\.
- F\. Pedregosa, G\. Varoquaux, A\. Gramfort, V\. Michel, B\. Thirion, O\. Grisel, M\. Blondel, P\. Prettenhofer, R\. Weiss, V\. Dubourg, J\. Vanderplas, A\. Passos, D\. Cournapeau, M\. Brucher, M\. Perrot, and É\. Duchesnay \(2011\)Scikit\-learn: machine learning in Python\.Journal of Machine Learning Research12,pp\. 2825–2830\.Cited by:[§2](https://arxiv.org/html/2605.26778#S2.SS0.SSS0.Px3.p1.11)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§3](https://arxiv.org/html/2605.26778#S3.p2.2)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InEMNLP\-IJCNLP,Note:arXiv:1908\.10084Cited by:[§2](https://arxiv.org/html/2605.26778#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Shi, A\. Ajith, M\. Xia, Y\. Huang, D\. Liu, T\. Blevins, D\. Chen, and L\. Zettlemoyer \(2024\)Detecting pretraining data from large language models\.arXiv preprint arXiv:2310\.16789\.Cited by:[§M\.1](https://arxiv.org/html/2605.26778#A13.SS1.p1.3),[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px1.p1.5),[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px2.p1.1),[§Q\.1](https://arxiv.org/html/2605.26778#A17.SS1.p1.2),[Appendix W](https://arxiv.org/html/2605.26778#A23.SS0.SSS0.Px1),[Appendix W](https://arxiv.org/html/2605.26778#A23.SS0.SSS0.Px1.p1.1),[Appendix W](https://arxiv.org/html/2605.26778#A23.SS0.SSS0.Px2),[Appendix Y](https://arxiv.org/html/2605.26778#A25.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.26778#S1.p3.1),[§3](https://arxiv.org/html/2605.26778#S3.p2.2),[§3](https://arxiv.org/html/2605.26778#S3.p3.1),[§4\.6](https://arxiv.org/html/2605.26778#S4.SS6.p2.1)\.
- R\. Shokri, M\. Stronati, C\. Song, and V\. Shmatikov \(2017\)Membership inference attacks against machine learning models\.InIEEE S&P,Cited by:[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px1.p1.5),[Appendix O](https://arxiv.org/html/2605.26778#A15.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.26778#S1.p3.1)\.
- A\. Templeton, T\. Conerly, J\. Marcus, J\. Lindsey, T\. Bricken, B\. Chen, A\. Pearce, C\. Citro, E\. Ameisen, A\. Jones, H\. Cunningham, N\. L\. Turner, C\. McDougall, M\. MacDiarmid, A\. Tamkin, E\. Durmus, T\. Hume, F\. Mosconi, C\. D\. Freeman, T\. R\. Sumers, E\. Rees, J\. Batson, A\. Jermyn, S\. Carter, C\. Olah, and T\. Henighan \(2024\)Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet\.Transformer Circuits Thread\.Note:AnthropicCited by:[§4\.1](https://arxiv.org/html/2605.26778#S4.SS1.p1.2)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InNeurIPS,Cited by:[§3](https://arxiv.org/html/2605.26778#S3.p2.2)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2023\)Representation engineering: a top\-down approach to ai transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§4\.1](https://arxiv.org/html/2605.26778#S4.SS1.p1.2)\.

## Appendix AAppendix: Membership Proxy Caveat

This appendix elaborates the relationship between CRM’s empirical measurement \(membership\-conditioned representational divergence\) and the construct it is motivated by \(source attribution: distinguishing whether a model read from context or recalled from parameters\)\.

#### The proxy gap\.

Membership in𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}creates controlled experimental conditions but does not determine which source the model actually uses\. A model faced with a member document as context may still process it from context rather than from parametric memory—the context is literally present in the input\. Conversely, a non\-member document may contain passages semantically similar to training data, triggering parametric recall despite its held\-out status\. Membership provides a*probabilistic tilt*toward parametric generation for member documents and away from it for non\-members\. CRM detects whether this tilt creates detectable representational differences at the aggregate level\. It does not certify which source governed any individual generation\.

#### Why the proxy is still informative\.

Despite the gap, membership\-conditioned detection is a necessary first step toward source attribution\. If internal representations showed no difference between conditions where parametric generation is possible versus impossible, there would be no signal to attribute\. CRM establishes that such differences exist, localizes them to specific layer patterns, and rules out simpler explanations \(topic familiarity, surface features, static MIA\)\. These findings provide the evidentiary foundation for more direct attribution methods \(e\.g\., causal interventions at CRM\-identified layers\)\.

#### What CRM can tell us\.

\(1\) Whether a model’s internal computation differs in aggregate when the context is from pretraining versus held\-out\. \(2\) At which layers this divergence is strongest, enabling targeted architectural analysis\. \(3\) Whether the divergence survives robustness controls \(same\-topic, prompt variation, label permutation\)\. \(4\) Whether the divergence generalizes across datasets and partially across tasks\.

#### What CRM cannot tell us\.

\(1\) Whether any individual generation used context or memory—only aggregate differences between conditions\. \(2\) Whether a model that shows no divergence is actually “safe”—it may use context reliably, or it may use memory in ways that happen not to create detectable representational differences\. \(3\) The causal direction—does context\-induced computation differ because of membership, or is membership correlated with some other variable that creates representational divergence?

#### Closing the gap\.

The path from membership\-conditioned detection to verified source attribution requires: \(i\) ground\-truth source labels \(enforced context\-use vs\. enforced memory\-use through experimental design\); \(ii\) causal interventions \(e\.g\., activation patching at the layers CRM identifies as carrying membership\-conditioned information\); and \(iii\) calibration against human judgments of source reliance\. These are natural next steps beyond the current study\.

## Appendix BAppendix: Detailed Competing\-Hypothesis Analysis

We provide expanded analysis for the three competing hypotheses summarized in Appendix[B](https://arxiv.org/html/2605.26778#A2)\.

H1: CRM detects membership, not source attribution\.A standard MIA baseline using static document embeddings \(Table[13](https://arxiv.org/html/2605.26778#A7.T13), Appendix[G](https://arxiv.org/html/2605.26778#A7)\) yields architecture\-dependent results\. For Mistral and Qwen \(7/9 models\), CRM\-LR exceeds MIA\-LR by \+0\.05–0\.18, confirming generation contrast adds signal beyond static membership\. For Llama, MIA\-LR exceeds CRM\-LR by \+0\.05–0\.10, yet raw probes confirm abundant source information \(AUC 0\.93–0\.95\)—CRM\-LTS is a lossy summary for Llama’s distributed geometry, consistent with its scattered\-late pattern \(Table[4](https://arxiv.org/html/2605.26778#S4.T4)\)\. Three findings constrain the pure\-MIA interpretation: \(1\) same\-topic control preserves substantial above\-chance CRM\-LTS discrimination \(AUC 0\.726–0\.921,Δ\\Deltawithin±\\pm0\.06\); \(2\) CRM’s signal resides in latent trajectory shifts; \(3\) Tier\-1 likelihood baselines achieve near\-chance AUC \(0\.55–0\.60\)\. Membership\-conditioned diagnostic evidence is stronger for Mistral/Qwen; for Llama, source information exists but is inefficiently captured by CRM’s compression—an architecture\-dependent pattern with auditing\-strategy implications\.

H2: CRM detects topic familiarity\.The same\-topic control \(full methodology in Appendix[D](https://arxiv.org/html/2605.26778#A4)\) equates BGE\-M3 cosine similarity \(mean 0\.51 vs\. 0\.32 for random pairs;p<10−4p<10^\{\-4\}\)\. CRM\-LTS AUC drops by only 0\.004–0\.058 across three models compared to 0\.08–0\.21 under L2\-norm features\. Topic familiarity accounts for at most∼\\sim5–25% of the total CRM\-LTS signal\.

H3: CRM detects surface confounds\.L1\+L2 features achieve near\-chance AUC \(mean 0\.579, Appendix[S](https://arxiv.org/html/2605.26778#A19)\), and label permutation returns AUC to 0\.50±\\pm0\.05\. Neither condition holds if surface confounds drive the result\.

## Appendix CAppendix: Limitations

Membership as proxy\.As discussed in Section 2 and Appendix[A](https://arxiv.org/html/2605.26778#A1), membership in𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}indicates likely pretraining exposure, not verified inclusion\.

Diagnostic setting\.Our continuation\-probing design prioritizes internal validity over ecological similarity\. The 6\-model multi\-task expansion \(Section[4\.6](https://arxiv.org/html/2605.26778#S4.SS6)\) confirms that CRM generalizes to summarization and QA; however, the full experimental pipeline uses WikiMIA passages of 128 tokens with single\-turn generation\. Additional task formats \(multi\-hop reasoning, dialogue, long\-form generation\) and longer context lengths may affect signal strength and layer\-localization patterns\.

Dataset and task format\.Main experiments use a single dataset \(WikiMIA\) with a single task format \(continuation probing\)\. Cross\-dataset validation on BookMIA \(Appendix[M](https://arxiv.org/html/2605.26778#A13)\) confirms CRM’s signal generalizes to a domain\-controlled book\-document split \(AUC 0\.84–0\.97\), and MIMIR \(Appendix[N](https://arxiv.org/html/2605.26778#A14)\) establishes a boundary condition at chance level \(AUC 0\.48–0\.55\)\. Multi\-task expansion across continuation, summarization, and QA \(Section[4\.6](https://arxiv.org/html/2605.26778#S4.SS6), Table[7](https://arxiv.org/html/2605.26778#S4.T7)\) confirms task\-level generalization\. Source\-attribution patterns may differ under additional task formats \(e\.g\., multi\-hop reasoning, dialogue\) or on datasets with other document characteristics and knowledge cutoff strategies\. Future work should characterize CRM’s sensitivity to document length, domain shift severity, task complexity, and dataset diversity\.

Last\-token probing\.L3 features use only the last token position\. Source information may be distributed across token positions\.

What CRM does not measure\.CRM measures parametric\-vs\-contextual drive—not factual correctness, generation quality, or context appropriateness\.

## Appendix DAppendix: Detailed Robustness Controls

### D\.1Prompt Templates

We test four prompt templates with varying structure and phrasing:

- •A \(Standard\): “Context:\{c\}\\\{c\\\}\\⁣\\\\backslash\\backslashQuestion:\{q\}\\\{q\\\}\\⁣\\\\backslash\\backslashAnswer:”
- •B \(Document\): “Based on the following document, answer the question\.\\⁣\\\\backslash\\backslashDocument:\{c\}\\\{c\\\}\\⁣\\\\backslash\\backslashQuestion:\{q\}\\\{q\\\}\\⁣\\\\backslash\\backslashAnswer:”
- •C \(Reference\): “Reference information:\{c\}\\\{c\\\}\\⁣\\\\backslash\\backslashUsing this reference, answer:\{q\}\\\{q\\\}\\⁣\\\\backslash\\backslashAnswer:”
- •D \(Instruction\): “Read the text below and answer the question\.\\⁣\\\\backslash\\backslashText:\{c\}\\\{c\\\}\\⁣\\\\backslash\\backslashQuestion:\{q\}\\\{q\\\}\\⁣\\\\backslash\\backslashAnswer:”

Table 9:Prompt randomization \(full breakdown\)\.AUC per template, mean, and standard deviation\. All std<<0\.02\.
### D\.2Same\-Topic Control Methodology

For each member documentdmd\_\{m\}, we compute BGE\-M3 embeddings for all non\-member documents and select the non\-memberdstd\_\{\\text\{st\}\}with the highest cosine similarity todmd\_\{m\}\. The same\-topic non\-member set\{dst\(i\)\}i=1N\\\{d\_\{\\text\{st\}\}^\{\(i\)\}\\\}\_\{i=1\}^\{N\}is then used in place of the original random non\-member set\. This creates a harder discrimination task: members and non\-members share similar topics, so a classifier relying on topic\-level features would see reduced AUC\. The CRM\-LTS same\-topic results \(Table[3](https://arxiv.org/html/2605.26778#S4.T3)\) confirm the signal is largely preserved \(Δ\\Deltawithin±\\pm0\.06\)\. The earlier L2\-norm experiment \(Appendix[E](https://arxiv.org/html/2605.26778#A5)\) provides converging evidence with larger topic effects \(Δ=−\\Delta=\-0\.08 to−\-0\.21\), bounding topic familiarity’s contribution at∼\\sim5–25% of the total signal\.

### D\.3Label Permutation

For each model, we randomly shuffle the membership labels 10 times and re\-run the full CRM pipeline \(feature extraction unchanged\)\. All 10 permutations yield AUC within0\.50±0\.050\.50\\pm 0\.05, confirming the signal is not a classifier overfitting artifact\.

## Appendix EAppendix: L2 Norm Baseline and CRM\+L2 Combined Results

Table[11](https://arxiv.org/html/2605.26778#A5.T11)reports the L2 norm baseline across all nine models\. The L2 norm of the hidden\-state difference \(𝐡c−𝐡0\\mathbf\{h\}\_\{c\}\-\\mathbf\{h\}\_\{0\}\) is computed per target layer per sample, yielding anLL\-dimensional feature vector whereLLis the number of target layers \(9–22\)\. We evaluate four variants: \(1\)L2 Mean: scalar mean across all layers; \(2\)L2 Best Layer: the single layer that best separates members from non\-members; \(3\)L2 Multilayer: the full per\-layer L2 norm vector; and \(4\)L2 PCA\-2d: PCA\-reduced 2\-dimensional summary\.

The scalar L2 Mean is weak \(mean LR AUC 0\.59\), confirming that magnitude averaging discards too much signal\. However, L2 Multilayer is competitive: mean LR AUC 0\.812, withinΔ=\+0\.024\\Delta=\+0\.024of CRM\-LR \(0\.836\)\. Importantly, CRM\-LTS and L2 Multilayer use the*same*feature dimensionality \(1 scalar per layer\), yet CRM\-LTS matches or exceeds L2 on 5/9 models while providing strictly more information: CRM’s PCA\-projected scalars capture*directional displacement*, whereas L2 norms capture only isotropic magnitude\. The performance ranking \(raw probes\>\>CRM\-LTS≈\\approxL2 Multilayer≫\\ggL2 Mean\) supports CRM’s design as an interpretable detector\.

Per\-layer L2 vs\. CRM\-LTS: evidence for directional sensitivity\.The comparable overall AUC masks a critical dissociation: L2 norm AUC peaks at early layers \(L0–L5\) across all models, while CRM\-LTS single\-layer AUC peaks at mid\-to\-late layers \(L10–L28\)\. Early\-layer L2 peaks reflect magnitude\-based sensitivity to input perturbation at layers closest to the embedding; mid/late\-layer CRM\-LTS peaks reflect directional displacement along learned principal components encoding document\-specific structure\. This dissociation—same overall AUC, different layers, different signal type—directly supports the claim that source information resides in representational*direction*, not just*magnitude*\.

CRM\+L2 feature combination\.Table[10](https://arxiv.org/html/2605.26778#A5.T10)reports CRM\+L2 concatenation results: CRM\+L2 exceeds max\(CRM, L2\) on 5/9 models \(meanΔmax=\+0\.036\\Delta\_\{\\text\{max\}\}=\+0\.036for complementary models\), confirming partially non\-overlapping source information\. Strongest gains occur on Qwen2\.5\-14B\-Base \(\+0\.064\), Mistral\-7B\-Instruct \(\+0\.036\), and Llama\-3\.1\-8B\-Instruct \(\+0\.030\)\. Three models show marginal or no complementarity: Qwen2\.5\-14B\-Instruct \(CRM at 0\.951 is near\-ceiling\), Qwen2\.5\-32B\-Instruct \(CRM 0\.923 dominates L2 0\.855\), and Llama\-8B/Mistral\-7B where combined is evaluated under same\-topic subset \(conservative estimate\)\. A PCA dimension sweep \(Appendix[I](https://arxiv.org/html/2605.26778#A9)\) further calibrates the dimensionality tradeoff\.

Table 10:CRM\+L2 feature combination: directional displacement and isotropic magnitude carry partially non\-overlapping source information \(expanded\)\.Δ\\Delta\(max\) = CRM\+L2−\-max\(CRM, L2\); positive values indicate complementarity \(5/9 models, mean\+\+0\.036\)\.§Evaluated under same\-topic non\-member pairing \(conservative estimate\)\.\*Mistral\-7B L2 alone \(0\.929\) exceeds CRM \(0\.869\), but same\-topic L2 drops to 0\.580—confirming L2’s strong performance is topic\-level, while CRM’s directional signal is preserved\. Condensed comparison appears in the main text \(Section[5](https://arxiv.org/html/2605.26778#S5)\)\.Table 11:L2 norm baseline vs\. CRM\-LR across nine models \(expanded\)\.L2 Mean: scalar average of per\-layer L2 norms\. L2 Best: single best layer\. L2 Multi: full per\-layer L2 norm vector \(9–22 dim\)\.Δ\\Delta\(CRM\-L2\): CRM\-LR minus L2 Multilayer\. All 5\-fold CV with logistic regression\. Condensed comparison appears in the main text \(Section[5](https://arxiv.org/html/2605.26778#S5)\)\.![Refer to caption](https://arxiv.org/html/2605.26778v1/x5.png)Figure 5:Same dimensionality, different signal: magnitude vs\. direction\.Mean LTS \(1d\) is near chance; L2 Multilayer \(per\-layer magnitude,LLdim\) is competitive; CRM\-LTS \(per\-layer directional displacement,LLdim\) matches or exceeds L2 while providing layer\-localized interpretability\. Ceiling set by raw probes or PCA\-8d\.
## Appendix FAppendix: Competing Attribution Baselines

We compare CRM\-LTS against three attribution baselines:gradient norm\(‖∂ℒ/∂hℓ‖2\\\|\\partial\\mathcal\{L\}/\\partial h\_\{\\ell\}\\\|\_\{2\}\),attention flow\(mean context\-token attention from last token position\), andlogit lens\(KL divergence between context\-conditioned and unconditioned vocabulary projections\)\. All methods produce one scalar per layer \(32–48 dims\)\.

Table 12:CRM\-LTS is competitive with standard attribution baselines\.All methods: per\-layer features \(32–48 dims\), 5\-fold CV, 250 samples\. Bold = best per model\. Gradient norm at chance confirms loss sensitivity does not capture membership information\.Gradient norm is at chance \(AUC 0\.500\), establishing that per\-layer loss sensitivity carries no membership\-discriminative signal\. Logit lens is the strongest competitor \(mean 0\.859, Qwen: 0\.953\), confirming vocabulary\-space projections amplify membership signals, but provides no mechanism for identifying*which*layer encodes membership\. Attention flow is competitive on Mistral \(0\.846\) but degrades sharply on Llama \(0\.784\)\. CRM\-LTS’s advantage is not raw AUC but interpretability: it identifies the specific direction \(PC1\) in hidden\-state space and enables causal analysis \(Section 4\.5 main text\) that per\-layer scalar baselines cannot support\.

## Appendix GAppendix: Standard MIA Baseline Results

To distinguish CRM’s generation\-contrastive signal from static membership inference, we implement a standard MIA baseline: the model’s final\-layer hidden state of the document text \(extracted without generation prompts\) serves as features for LR \+ XGBoost with 5\-fold CV\. Table[13](https://arxiv.org/html/2605.26778#A7.T13)reports the comparison\. For Mistral and Qwen, CRM\-LR exceeds MIA\-LR by \+0\.05–0\.18, confirming generation contrast adds signal beyond static membership\. For Llama, MIA\-LR exceeds CRM\-LR by \+0\.05–0\.10, indicating richer static membership encoding in Llama’s document representations\.

Table 13:Standard MIA baseline \(static document embeddings\) vs\. CRM\-LR \(expanded\)\.MIA uses the model’s final\-layer hidden state \(4,096–5,120 dim\)\. CRM uses 9–22 scalar per\-layer trajectory features\. CRM exceeds MIA on Mistral and Qwen \(\+0\.05–0\.18\); MIA exceeds CRM on Llama \(−\-0\.05–0\.10\)\. CRM\-LR vs\. MIA\-LR summarized in Section[F](https://arxiv.org/html/2605.26778#A6)\.
## Appendix HAppendix: Per\-Layer AUC — Evidence for Layer Localization

To keep per\-layer probing computationally tractable \(28–48 layers×\\times64 PCA components per layer×\\times250 samples = 448K–768K forward passes if using all data\), we use a stratified random subset of 50 samples \(25 per class\) drawn from the full 250\-sample pool with the same random seed as the main experiments \(seed 42\)\. All per\-layer results in this appendix and in Table 3 of the main body use this same diagnostic subset\. The 50\-sample subset preserves rank ordering of layer AUC while reducing computation to 10–16K forward passes; AUC values are not directly comparable to the 250\-sample multi\-layer results but reliably identify peak layers and relative layer ordering\.

Table 14:Per\-layer single\-probe AUC confirms layer\-localized source information\.Each layer probed individually with 64\-component PCA on raw hidden\-state differences\. Peak layers are architecture\-specific; best single layer underperforms full CRM\. 50\-sample diagnostic subsets\.
## Appendix IAppendix: PCA Dimension Sweep — The Dimensionality of Source Information

Table[15](https://arxiv.org/html/2605.26778#A9.T15)reports a PCA dimension sweep on the full hidden\-state difference vector \(all target layers concatenated into a single high\-dimensional vector\)\. This is distinct from CRM\-LTS’s per\-layer PC1 projection: here PCA is applied across the concatenated layer space, while CRM\-LTS extracts PC1 independently within each layer\. PCA is fit on no\-context training\-fold states; projected ontoK∈\{1,2,4,8\}K\\in\\\{1,2,4,8\\\}components; 5\-fold CV LR evaluation\.

Three findings:First, a single PC across concatenated layers is insufficient\(13–18% variance across models; AUC spans chance 0\.48 to near\-ceiling 0\.95\), ruling out any universal 1d representation in the concatenated space\.Second, 4 dimensions is critical—for 6/9 models, PCA\-4d AUC exceeds 0\.90, and the 2d→\\to4d jump is dramatic \(Qwen\-14B\-Inst: 0\.53 to 0\.96\)\.Third, 8d approximates the raw\-state ceiling\(mean 0\.918\), calibrating the cost of CRM’s further compression toLLdimensions \(mean 0\.836\)\. Note that CRM\-LTS’s per\-layer PC1 projection \(one component per layer,LLcomponents total across layers\) is not directly comparable to the concatenated PCA\-KKd results here; the per\-layer design retains layer identity at the cost of within\-layer dimensionality, while concatenated PCA retains within\-layer signal at the cost of layer interpretability\.

Table 15:PCA dimension sweep on concatenated layer space: membership\-conditioned signal dimensionality is architecture\-dependent\.PCA\-KKd: LR AUC using firstKKPCs of the concatenated all\-layer difference vector\. This is distinct from CRM\-LTS’s per\-layer PC1 projection \(one PC per layer, retaining layer identity\)\. PC1\-in\-concatenated\-space AUC spans 0\.48–0\.95, ruling out universal 1d representation in the concatenated space\. The 2d→\\to4d jump is critical for most models\. All 5\-fold CV, 250 samples\.![Refer to caption](https://arxiv.org/html/2605.26778v1/x6.png)Figure 6:PCA dimension sweep: membership\-conditioned signal dimensionality is architecture\-dependent\.\(a\) Per\-model LR AUC vs\. PCA dimensions \(1d, 2d, 4d, 8d\)\. Dashed lines: instruct variants\. The 2d→\\to4d jump is the critical transition\. \(b\) Family means±1σ\\pm 1\\sigma\.
## Appendix JAppendix: Single\-Layer Noise Injection Results

The block\-level noise injection experiment \(Section[4\.4](https://arxiv.org/html/2605.26778#S4.SS4), Table[5](https://arxiv.org/html/2605.26778#S4.T5)\) was motivated by single\-layer noise injection results that revealed a puzzle: Qwen2\.5\-14B\-Inst L6 showed catastrophic degradation \(−\-0\.300 AUC atε\\varepsilon=0\.1\), but Llama\-3\.1\-8B L28 was near\-immune \(−\-0\.001\)\. Table[16](https://arxiv.org/html/2605.26778#A10.T16)reproduces the single\-layer results that preceded and motivated the block\-level analysis\.

#### Noise injection implementation\.

All noise experiments \(single\-layer and block\-level\) use manual layer\-by\-layer forward passes rather than PyTorch forward hooks\. For each forward pass, we \(1\) compute token embeddings viamodel\.model\.embed\_tokens, \(2\) pre\-compute rotary position embeddings \(RoPE\) viamodel\.model\.rotary\_embto obtain\(cos,sin\)\(\\cos,\\sin\)position encodings, \(3\) iterate throughmodel\.model\.layers, passing the pre\-computed position embeddings to each decoder layer, \(4\) inject isotropic Gaussian noiseε⋅σ\(h\)⋅𝒩\(0,1\)\\varepsilon\\cdot\\sigma\(h\)\\cdot\\mathcal\{N\}\(0,1\)at the output of each target layer \(whereσ\(h\)\\sigma\(h\)is the per\-layer activation standard deviation, computed batch\-wise\), and \(5\) apply the final layer norm\. This manual iteration avoids the stale\-state and device\-compatibility issues observed withregister\_forward\_hookin the initial implementation, and ensures identical perturbation semantics for both the with\-context and no\-context forward passes \(noise is injected at the same layers with the sameε\\varepsilonin both passes\)\. Hidden states are collected at every layer, including the embedding layer \(index 0\) and after final norm \(indexL\+1L\+1\)\. The noise injection formula is𝐡′=𝐡\+ε⋅σ\(𝐡\)⋅𝒩\(0,1\)\\mathbf\{h\}^\{\\prime\}=\\mathbf\{h\}\+\\varepsilon\\cdot\\sigma\(\\mathbf\{h\}\)\\cdot\\mathcal\{N\}\(0,1\), applied at all token positions simultaneously within each target layer\.

Table 16:Single\-layer noise injection at CRM\-identified peak layers produces architecture\-dependent causal effects\.Qwen2\.5\-14B\-Inst L6 shows catastrophic degradation \(−\-0\.300 AUC atε\\varepsilon=0\.1\), establishing causal involvement\. Mistral\-7B L18 requires higher noise \(−\-0\.039 atε\\varepsilon=1\.0\)\. Llama\-3\.1\-8B L28 is near\-immune to single\-layer perturbation \(−\-0\.001\), motivating the block\-level distributed\-encoding hypothesis tested in Table[5](https://arxiv.org/html/2605.26778#S4.T5)\.The single\-layer results motivated the distributed\-encoding hypothesis: membership information may be spread across a block of layers such that perturbing any single layer leaves sufficient signal in other layers for downstream recovery\. The block\-level experiment \(Section[4\.4](https://arxiv.org/html/2605.26778#S4.SS4)\) directly tests and confirms this hypothesis\.

## Appendix KAppendix: Leave\-One\-Layer\-Out Ablation \(Feature\-Space\) — Redundancy of Source Information

To test whether individual layers carry unique discriminative information, we remove one LTS layer feature at a time from the full L3 feature vector and re\-run 5\-fold CV LR\. This analysis operates on CRM feature vectors, not on hidden states; it tests feature\-space redundancy, not causal necessity in model computation\.

Table 17:Leave\-one\-layer\-out ablation: no single layer is uniquely informative\.Baseline = L3\-only LR AUC with all layers\. PeakΔ\\Delta= AUC change when removing the best single\-probe layer\. All\|Δ\|<0\.07\|\\Delta\|<0\.07, and the best single\-probe layer is never uniquely necessary\. This tests feature\-space redundancy, not causal necessity\.Table[17](https://arxiv.org/html/2605.26778#A11.T17)reports results\.No single layer’s removal substantially decreases AUC\(worstΔ=−0\.006\\Delta=\-0\.006\), confirming redundant encoding\. Some layers contribute noise: removing Qwen\-7B’s best single\-probe layer \(L14\) improves AUC by \+0\.060—single\-probe importance dissociates from multi\-layer importance\. Architecture\-specific redundancy patterns parallel the layer\-localization findings in Section[4\.3](https://arxiv.org/html/2605.26778#S4.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.26778v1/x7.png)Figure 7:Leave\-one\-layer\-out ablation: no single layer is uniquely informative\.Per\-layerΔ\\DeltaAUC \(AUC−removed\{\}\_\{\\mathrm\{removed\}\}\-\{\}AUCfull\)\. Green: AUC improves \(noise\)\. Red: AUC decreases \(unique signal\)\. Stars: peak single\-probe layers\. No removal decreases AUC by more than 0\.007, confirming redundant encoding in CRM’s feature space\.
## Appendix LAppendix: Cross\-Task Generalization Pilot \(Summarization\)

We initially conducted a 2\-model pilot \(Mistral\-7B\-v0\.3, Qwen2\.5\-7B\) with summarization prompts to test generalization beyond continuation probing\. Table[18](https://arxiv.org/html/2605.26778#A12.T18)reports the pilot results: Qwen preserved the signal \(Δ=\+0\.007\\Delta=\+0\.007\), while Mistral showed partial transfer \(AUC 0\.744,Δ=−0\.125\\Delta=\-0\.125\)\. These pilot results motivated the full 6\-model multi\-task expansion in Section[4\.6](https://arxiv.org/html/2605.26778#S4.SS6), which confirmed the pattern: summarization amplifies the membership\-conditioned signal for most models, while QA maintains above\-chance discriminability across all architectures\. The full per\-model per\-task breakdown is in Appendix[T](https://arxiv.org/html/2605.26778#A20)\.

Table 18:Cross\-task generalization pilot: CRM under summarization\.Continuation AUC from main experiments; Summarization AUC from 5\-fold CV LR with CRM\-LTS features\. Qwen preserves signal \(Δ=\+0\.007\\Delta=\+0\.007\); Mistral shows partial transfer \(AUC 0\.744\)\.![Refer to caption](https://arxiv.org/html/2605.26778v1/x8.png)Figure 8:Cross\-task generalization: CRM signal under continuation vs\. summarization\.Qwen2\.5\-7B preserves the signal \(Δ=\+0\.007\\Delta=\+0\.007\); Mistral\-7B shows partial transfer \(AUC 0\.744,Δ=−0\.125\\Delta=\-0\.125\) but remains well above chance\. Error bars: 95% CI\.
## Appendix MAppendix: Cross\-Dataset Generalization \(BookMIA\)

### M\.1Motivation and Dataset

To test whether CRM’s source\-attribution signal generalizes beyond WikiMIA’s temporal membership split, we evaluate on BookMIAShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\)\. BookMIA defines membership by domain: members \(n=4,935n=4\{,\}935\) are book snippets from books included in the Books3 training corpus; non\-members \(n=4,935n=4\{,\}935\) are snippets from books held out of Books3\. This domain\-level boundary provides a cleaner contrast than WikiMIA’s random temporal split: the representational distance between training\-domain and held\-out\-domain books should produce stronger membership\-conditioned differences for models that encode domain identity\. Snippets average∼\\sim2,800 characters; we use the first 100 characters as the continuation prefix\.

We evaluate three representative models \(one per family\) under three conditions: \(1\) BookMIA continuation \(cross\-dataset, same task\), \(2\) BookMIA QA \(cross\-dataset, cross\-task\), and \(3\) WikiMIA continuation \(within\-dataset reference\)\. The CRM\-LTS pipeline follows the identical extraction protocol as the main experiments \(Section 3\): PC1 basis computed from 100 calibration samples, signed dot\-product projection per layer, no across\-layer PCA, 5\-fold stratified CV with logistic regression\.

### M\.2Full Results

Table 19:Cross\-dataset generalization: full BookMIA results\.CRM\-LTS and L2 Norm LR/XGB AUC \(5\-fold CV,±\\pm1 std\)\. BookMIA CoT uses continuation prompts; BookMIA QA uses question\-answering prompts\. WikiMIA values are within\-dataset references\. All models use the same CRM\-LTS extraction protocol \(signed PC1 projection, no across\-layer PCA\)\.
### M\.3Key Findings

CRM generalizes across datasets\.BookMIA continuation CRM\-LR ranges from 0\.844 to 0\.967—all well above chance and largely matching or exceeding within\-dataset WikiMIA performance\. The standard deviations are uniformly tight \(0\.008–0\.026\), indicating stable within\-model signal\. This rules out the interpretation that CRM’s signal is specific to WikiMIA’s temporal split\.

Domain\-controlled splits reveal stronger signals\.Mistral\-7B’s CRM\-LR increases from 0\.825 \(WikiMIA\) to 0\.967 \(BookMIA\), a\+\+0\.142 increase\. WikiMIA’s random temporal split produces similar member and non\-member distributions \(both Wikipedia passages\), yielding smaller representational differences compared to BookMIA’s domain\-level boundary \(books in vs\. out of training\)\. BookMIA’s cleaner membership contrast amplifies CRM signal strength, confirming that dataset design—not CRM sensitivity—is the binding constraint for this model\.

CRM\-LTS outperforms L2 on BookMIA\.On WikiMIA, L2 Multilayer was withinΔ=\+0\.024\\Delta=\+0\.024of CRM\-LR \(see L2 appendix,[E](https://arxiv.org/html/2605.26778#A5)\)\. On BookMIA continuation, CRM\-LTS leads L2 by\+\+0\.091 \(Llama\) to\+\+0\.161 \(Qwen\), suggesting directional displacement is more robust to domain shift than isotropic magnitude\.

QA format amplifies the signal\.BookMIA\-QA CRM\-LR ranges from 0\.959 to 0\.980, exceeding continuation consistently\. This corroborates the cross\-task pilot \(Appendix[L](https://arxiv.org/html/2605.26778#A12)\) and suggests that explicit context\-extraction instructions strengthen the representational gap between member\-conditioned and non\-member\-conditioned generation\.

XGBoost matches or exceeds LR on BookMIA\.Unlike WikiMIA where LR generally outperformed XGBoost, BookMIA’s cleaner signal enables XGBoost to match \(Mistral: 0\.967\) or slightly exceed \(Qwen\-QA: 0\.985\) LR\. This is consistent with stronger, lower\-noise signal enabling tree\-based classifiers to capture residual nonlinearities\.

## Appendix NAppendix: MIMIR Negative Control

We evaluate CRM on a Pile\-vs\-Wikipedia benchmarkDuanet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib29)\)as a negative\-control test\. In this benchmark, members are Pile documents and non\-members are Wikipedia/patent documents—the member and non\-member populations are drawn from*different corpora*\. Unlike WikiMIA and BookMIA, where membership is defined within a single domain \(Wikipedia and Books3, respectively\), MIMIR’s split conflates membership with corpus origin\. This is a*confound*, not a cleaner design\.

Table 20:MIMIR produces chance\-level CRM\-LR AUC \(0\.485–0\.552\)\.The domain\-shift confound between Pile and Wikipedia distributions makes membership\-conditioned signals undetectable\. 250 samples/class, 5\-fold CV\.CRM produces chance\-level AUC on MIMIR \(0\.485–0\.552, Table[20](https://arxiv.org/html/2605.26778#A14.T20)\)\. This result is not a validation of dataset design—it is aboundary conditionfor CRM\. The failure reveals that CRM cannot detect membership\-conditioned signals when membership is confounded with domain origin\. On WikiMIA and BookMIA, member and non\-member populations are drawn from the same underlying domain \(Wikipedia pages, books\), and CRM separates conditions with AUC 0\.71–0\.97\. On MIMIR, where the only signal is domain difference, CRM collapses to chance\.

This pattern has two implications\. First, it constrains what CRM measures: the method is sensitive to representational differences between matched\-distribution member/non\-member pairs, not to arbitrary distribution shifts\. The credible interpretation is that CRM detects pretraining\-exposure effects within a domain, not domain\-level features\. Second, the failure establishes a practical requirement: CRM is applicable only when membership contrast is the dominant signal and domain distributions are balanced across conditions\. Deploying CRM without verifying this balance would produce uninterpretable results\.

## Appendix OAppendix: Detailed Related Work

We situate our work relative to each of the five intersecting research threads\.

#### Why this is not standard membership inference\.

CRM shares a label space with MIA \(member vs\. non\-member\), which invites the concern it is a repurposed membership inference attack\. Three structural differences distinguish CRM from MIA: \(1\)Task framing:MIA asks whether a document was in𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}given static representationsShokriet al\.\([2017](https://arxiv.org/html/2605.26778#bib.bib4)\); Carliniet al\.\([2021](https://arxiv.org/html/2605.26778#bib.bib5)\); Matternet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib8)\); Shiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\); Carliniet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib28)\); Bidermanet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib34)\)\. CRM asks whether the*presence of that document as retrieved context*changes the model’s generation trajectory differently depending on membership—a paired generation contrast, not a document\-level classifier\. \(2\)Signal:MIA operates on static document embeddings \(4,096–5,120 dims\); CRM operates on the𝐡c−𝐡0\\mathbf\{h\}^\{c\}\-\\mathbf\{h\}^\{0\}displacement per layer \(9–22 scalars\)\. \(3\)Empirical separation:standard MIA baselines achieve chance\-level AUC \(0\.55–0\.60\), while CRM achieves 0\.71–0\.95 on the same data; same\-topic control preserves CRM\-LTS discrimination \(AUC within±\\pm0\.06\) where MIA would confound topic with membership\. The paired\-contrast architecture and∼\\sim200–500×\\timesfeature compression differentiate CRM from standard MIA, though membership remains the experimental proxy \(see Appendix[A](https://arxiv.org/html/2605.26778#A1)\)\.

#### Membership inference attacks \(MIA\)\.

MIA methodsShokriet al\.\([2017](https://arxiv.org/html/2605.26778#bib.bib4)\); Carliniet al\.\([2021](https://arxiv.org/html/2605.26778#bib.bib5)\); Matternet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib8)\); Shiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\); Carliniet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib28)\); Bidermanet al\.\([2023](https://arxiv.org/html/2605.26778#bib.bib34)\)ask whether a document was seen during training\. CRM asks a different question: under generation conditions with the document as context, does membership create detectable representational differences? MIA operates on static document\-level signals; CRM operates on generation\-contrastive signals\. The baselines confirm standard MIA methods achieve near\-chance AUC \(0\.55–0\.60\), while CRM achieves 0\.71–0\.95\.

#### RAG faithfulness and context adherence\.

RAG faithfulness workLiuet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib2)\); Liet al\.\([2025](https://arxiv.org/html/2605.26778#bib.bib3)\); Niuet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib11)\)detects when models produce outputs contradicting the provided context\. These methods are effective when context and parametric memory*disagree*\. CRM targets the harder case where both*agree on the surface*: the output is consistent with the context regardless of which source drove generation\.

#### Citation accuracy and attribution\.

Citation\-verification benchmarksBohnetet al\.\([2022](https://arxiv.org/html/2605.26778#bib.bib10)\); Eset al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib9)\)evaluate whether models correctly cite sources—whether the model*claims*to use context\. CRM asks whether context*reshapes*internal computation\. A model can cite correctly while relying on parametric memory, or fail to cite while genuinely using context\.

#### Probing and interpretability\.

Probing classifiersAlain and Bengio \([2016](https://arxiv.org/html/2605.26778#bib.bib12)\); Belinkov and Glass \([2019](https://arxiv.org/html/2605.26778#bib.bib13)\); Hewitt and Manning \([2019](https://arxiv.org/html/2605.26778#bib.bib23)\)decode static properties from representations; CRM decodes a*relational*property—the difference between two processing conditions\. Activation patching and model editingMenget al\.\([2022](https://arxiv.org/html/2605.26778#bib.bib15),[2023](https://arxiv.org/html/2605.26778#bib.bib33)\)establish causality through intervention; CRM is diagnostic, identifying*where*information is encoded\. The block\-level noise injection results \(Section[4\.4](https://arxiv.org/html/2605.26778#S4.SS4)\) provide converging causal evidence using noise perturbation rather than activation patching\.

#### Cognitive reality monitoring\.

The cognitive\-science principle of reality monitoringJohnsonet al\.\([1993](https://arxiv.org/html/2605.26778#bib.bib7)\)posits that humans distinguish perceived from imagined memories by comparing sensory detail, contextual information, and cognitive operations\. CRM adapts this directly: Level 1 mirrors sensory detail, Level 2 mirrors contextual information, Level 3 mirrors cognitive operations\. The key insight—source information resides in the*difference*between processing modes—is inherited from this tradition\.

## Appendix PAppendix: CRM Equations and Methodology Details

### P\.1Level 1: Sequence\-Level Semantic Delta

Δseq=1−cos⁡\(enc\(y0\),enc\(yc\)\)\\Delta\_\{\\text\{seq\}\}=1\-\\cos\\big\(\\text\{enc\}\(y\_\{0\}\),\\text\{enc\}\(y\_\{c\}\)\\big\)\(3\)whereenc\(⋅\)\\text\{enc\}\(\\cdot\)is BGE\-M3Chenet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib16)\)\.ΦL1=\[Δseq\]\\Phi\_\{\\text\{L1\}\}=\[\\Delta\_\{\\text\{seq\}\}\]\(1 dimension\)\.

### P\.2Level 2: Token\-Level Distributional Divergence

For each generation steptt:

KLt=DKL\(p\(⋅∣c,q,y<t\)∥p\(⋅∣q,y<t\)\)\\text\{KL\}\_\{t\}=D\_\{\\text\{KL\}\}\\big\(p\(\\cdot\\mid c,q,y\_\{<t\}\)\\;\\\|\\;p\(\\cdot\\mid q,y\_\{<t\}\)\\big\)\(4\)Aggregated into five statistics: KL\-mean, KL\-max, KL\-var, KL\-early/late split \(first 32 vs\. remaining tokens\), and KL\-trend \(slope of linear regression\)\.ΦL2=\[KL1\]\\Phi\_\{\\text\{L2\}\}=\[\\text\{KL\}\_\{1\}\]\(1 dimension\)\.

### P\.3Level 3: Latent Trajectory Shift

PCA selection on held\-out calibration set: layers with explained variance ratio\>0\.01\>0\.01formℒ\\mathcal\{L\}\. On a 100\-sample calibration subset, PC1 directionvℓv\_\{\\ell\}is extracted via SVD on displacement vectorshℓc−hℓ0h\_\{\\ell\}^\{c\}\-h\_\{\\ell\}^\{0\}per layer\. LTS computed as signed dot\-product projection per Equation \(1\)\.ΦL3=\[LTS1,…,LTS\|ℒ\|\]\\Phi\_\{\\text\{L3\}\}=\[\\text\{LTS\}\_\{1\},\\ldots,\\text\{LTS\}\_\{\|\\mathcal\{L\}\|\}\]\(9–22 dimensions\)\.

### P\.4Raw Hidden\-State Probes

Three variants: final\-layer concat/diff, all\-layer mean concat/diff, and per\-layer PCA diff \(8 components/layer, concatenated\)\. All use 5\-fold CV LR on 50\-sample diagnostic subsets\.

## Appendix QAppendix: Baseline Details

### Q\.1Tier 1: Black\-Box Likelihood Baselines

Standard MIA methods on document text without generation contrast: Perplexity, Zlib\-compressed PPLCarliniet al\.\([2021](https://arxiv.org/html/2605.26778#bib.bib5)\)\(ratio of PPL to zlib entropy\), and Min\-K% ProbShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\)\(mean of smallestK%K\\%token log\-probabilities,K∈\{10,20\}K\\in\\\{10,20\\\}\)\.

### Q\.2Tier 2: Access\-Matched Interpretable Baselines

Same paired\-generation input as CRM: single\-layer LTS, mean LTS \(1d\), L1\+L2 only, LTS PCA \(2d\), top\-3 layers\.

### Q\.3Tier 3: Raw Hidden\-State Probes

Upper\-bound diagnostics on full hidden\-state differences without LTS compression \(15–20×\\timesmore features\)\. Full results in Appendix[R](https://arxiv.org/html/2605.26778#A18)\.

## Appendix RAppendix: Raw Hidden\-State Probe Results

Table 21:Raw hidden\-state probes confirm abundant source information, but only CRM reveals*where*it is encoded\.Raw probes use 15–20×\\timesmore features\. CRM’s scalar\-per\-layer trajectory enables layer\-specific findings \(Section 6\.3\)\. 50\-sample diagnostic subsets\.
## Appendix SAppendix: Access\-Matched Baseline Comparison and L3 Ablation

### S\.1Access\-Matched Baselines

Table 22:Multi\-layer trajectory, not any single layer, carries the dominant signal\.Full CRM exceeds the best individual layer by mean \+0\.135 AUC\. Scalar averaging \(Mean LTS, 0\.567\) and surface\-only features \(L1\+L2, 0\.579\) remain near chance\. All baselines receive the same paired\-generation input as CRM\.
### S\.2L3\-Only Ablation

Table 23:Removing all surface features changes AUC by less than 0\.01—the membership\-conditioned signal is latent \(expanded\)\.All nine models satisfy\|ΔLR\|<0\.05\|\\Delta\\text\{LR\}\|<0\.05\. A condensed version appears as Table 4 in the main body\.

## Appendix TAppendix: Multi\-Task Detailed Results

Table[24](https://arxiv.org/html/2605.26778#A20.T24)reports the full per\-model per\-task breakdown corresponding to the condensed Table[7](https://arxiv.org/html/2605.26778#S4.T7)in the main body\. CRM\-LTS features are extracted using the manual prompt\-wrapping pipeline \(Appendix[P](https://arxiv.org/html/2605.26778#A16)\): PC1 projection of displacement vectorshℓc−hℓ0h\_\{\\ell\}^\{c\}\-h\_\{\\ell\}^\{0\}across all layers\. Continuation values match the main CRM\-LTS results in Table[1](https://arxiv.org/html/2605.26778#S4.T1), confirming the pipeline consistency\.

#### Task prompt templates\.

All three tasks share the same outer probe\-wrapping pattern described in Appendix[P](https://arxiv.org/html/2605.26778#A16)\(with\-context: input contains the full passage; no\-context: input contains only the query\) and differ only in the inner task prompt\. The continuation template is: “Context: \{context\}\\n\\nQuestion: Continue the following passage: \{query\}\\n\\nAnswer:” The summarization template is: “Context: \{context\}\\n\\nQuestion: Summarize the above text in one sentence\.\\n\\nAnswer:” The QA template is: “Context: \{context\}\\n\\nQuestion: According to the above passage, what specific fact or event is described?\\n\\nAnswer:” For all tasks, the with\-context probe prompt supplies the full context passage \(128 tokens\); the no\-context probe prompt supplies an empty context string, leaving only the query\. The model generates up to 32 tokens for continuation, 64 tokens for summarization, and 64 tokens for QA\.

Table 24:Full per\-model per\-task CRM\-LTS LR AUC \(expanded from Table[7](https://arxiv.org/html/2605.26778#S4.T7)\)\.5\-fold CV, 250 samples/task\. Dim = all\-layer CRM\-LTS trajectory features\.Δ\\DeltaSumm = Summarization−\-Continuation;Δ\\DeltaQA = QA−\-Continuation\. Mistral\-family models show consistent summarization amplification \(\+\+0\.063–0\.078\)\. QA preserves strong signal across all architectures \(AUC 0\.830–0\.967\)\.The key result is the architecture\-dependent task pattern:Mistral modelsshow consistent summarization amplification \(Δ=\+0\.063\\Delta=\+0\.063to\+0\.078\+0\.078\), consistent with their mid\-layer signal concentration \(Table[4](https://arxiv.org/html/2605.26778#S4.T4)\)\.Qwen2\.5\-14B\-Instructachieves near\-ceiling performance across all three tasks \(0\.948–0\.967\), while Qwen2\.5\-7B\-Instruct shows slight attenuation under generation tasks \(Δ=−0\.013\\Delta=\-0\.013\)\.Llama modelsare split: Llama\-3\.1\-8B shows strong continuation dominance \(Δ=−0\.090\\Delta=\-0\.090\) while Llama\-3\.1\-8B\-Instruct shows QA exceeding continuation \(ΔQA=\+0\.034\\Delta\_\{\\text\{QA\}\}=\+0\.034\)\. QA reliably preserves discriminability across all architectures \(range 0\.830–0\.967\), ruling out continuation\-specific artifacts\.

## Appendix UAppendix: PC1 Interpretation — Detailed Results

### U\.1PC Rank Ablation

Table[25](https://arxiv.org/html/2605.26778#A21.T25)reports the multi\-layer LR AUC when using each PC component individually as the 1D projection direction for CRM\-LTS\. The key finding: PCA variance maximization does not optimize membership discriminability\. Higher\-variance PCs \(PC4–PC7\) capture membership\-conditioned information that PC1 misses, indicating that the displacement distribution is structured along multiple axes and the signal of interest is not aligned with the maximum\-variance direction\.

Table 25:PC rank ablation: PCA variance maximization does not optimize discriminability\.Each cell reports CRM\-LTS multi\-layer LR AUC using that PC as the sole projection direction\. For Mistral, PC5 outperforms PC1 by\+0\.138\+0\.138; for Qwen, PC7 outperforms PC1 by\+0\.020\+0\.020\. Llama is the exception where PC1 is optimal, consistent with its scattered\-late signal distribution\. All features are 32–48 dimensional \(one scalar per layer\)\. 5\-fold CV, 250 samples\.The PC rank result has two implications\. First, it identifies a clear methodological limitation of the current CRM\-LTS formulation: using PC1 alone leaves membership\-discriminative information on the table\. Section[4\.5](https://arxiv.org/html/2605.26778#S4.SS5)directly validates this prediction: replacing PC1 with a supervised mean\-difference direction improves AUC by\+\+0\.024–0\.144 across three models, with the largest gain on Mistral \(\+\+0\.144\)—precisely the model where PC5 outperforms PC1 by\+\+0\.138\. The supervised direction effectively integrates information from multiple PCs into a single discriminative axis\. Second, the model\-dependent optimal PC rank \(PC5 for Mistral, PC7 for Qwen, PC1 for Llama\) mirrors the architecture\-dependent patterns in Table[4](https://arxiv.org/html/2605.26778#S4.T4)—Llama’s scattered\-late distribution may concentrate discriminative signal along the maximum\-variance direction, while Mistral and Qwen’s sharper localization distributes it across multiple axes\.

### U\.2Vocabulary Back\-Projection

Projecting PC1 vectors through the LM headWunembed∈ℝ\|𝒱\|×dW\_\{\\text\{unembed\}\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}maps each layer’s PC1 direction to the vocabulary space\. Top tokens by projection score reveal the semantic content of the PC1 direction at different depths\.

Table 26:PC1 vocabulary back\-projection reveals layer\-dependent semantic abstraction\.Top\-10 tokens obtained by projecting PC1 throughWunembedW\_\{\\text\{unembed\}\}and selecting tokens with the highest dot\-product scores\. Early layers: subword fragments without coherent semantics \(the model has not yet abstracted membership\-relevant features\)\. Mid layers: entity\-type and relational tokens \(Mistral L18: family\-relation terms; Llama L16: description/specification terms\)\. Late layers: for Qwen L40, markers of uncertainty and regret \(Unfortunately,regret,sorry\)—the model’s late\-layer PC1 direction encodes an epistemic\-stance signal aligned with source uncertainty\. For Mistral L28, factuality markers \(specific,exact,entities\)\. For Llama L28, the tokens revert to subword fragments, consistent with its distributed \(non\-concentrated\) signal architecture\.The layer\-dependent semantic progression is most clearly visible in Qwen2\.5\-14B\-Inst: early\-layer PC1 captures subword\-level variation without interpretable semantics, mid\-layer PC1 captures cross\-lingual and syntactic fragments \(Chinese characters, code tokens\), and late\-layer PC1 converges to a coherent epistemic\-stance direction—markers of uncertainty, hedging, and regret\. This trajectory from subword→\\tosyntactic→\\toepistemic mirrors the model’s progressive abstraction hierarchy, and suggests that membership\-conditioned divergence at late layers manifests as a shift in the model’s confidence representation\.

For Mistral, the semantic progression is from subword fragments \(L5\)→\\tosocial\-relation terms \(L18, kinship/family vocabulary\)→\\tofactuality/specificity markers \(L28:exact,specific,entities\)\. The mid\-layer kinship cluster is notable: membership\-conditioned processing may differentially engage relational knowledge structures, consistent with the mid\-layer concentration pattern identified in Table[4](https://arxiv.org/html/2605.26778#S4.T4)\.

## Appendix VAppendix: Deployment Prototype

We provide a lightweight FastAPI\-based CRM audit server and a single\-page HTML dashboard demonstrating that CRM’s compact scalar\-per\-layer signature supports low\-latency deployment auditing\. The prototype is self\-contained in theaudit\_prototype/directory\.

### V\.1Architecture

POST /auditAccepts\{context, query\}JSON body\. The server extracts CRM\-LTS features via the same PC1\-projection pipeline as the main experiments: \(1\) tokenize the probe\-wrapped prompt with and without context, \(2\) forward pass through all layers \(Qwen2\.5\-14B\-Inst by default\), \(3\) compute displacement vectorshℓc−hℓ0h\_\{\\ell\}^\{c\}\-h\_\{\\ell\}^\{0\}, \(4\) project onto pre\-computed PC1 basis vectors per layer\. Returns\{anomaly\_flag, anomaly\_score, lts\_trajectory, flagged\_layers, latency\_ms\}\.

GET /historyReturns recent audit records with anomaly scores and flags\.

GET /statsReturns model metadata \(layer count, model name\) and cumulative request/anomaly counts\.

GET /healthHealth\-check endpoint\.

Anomaly detection uses a distribution\-based rule: the calibration meanμℓ\\mu\_\{\\ell\}and standard deviationσℓ\\sigma\_\{\\ell\}of per\-layer LTS values are computed from the same 100\-sample calibration subset used for PC1 estimation \(Section 2\)\. For each incoming request, we flag any layer where\|LTSℓ−μℓ\|\>2σℓ\|\\text\{LTS\}\_\{\\ell\}\-\\mu\_\{\\ell\}\|\>2\\sigma\_\{\\ell\}\. The anomaly score is the mean absolute deviation1\|ℒ\|∑ℓ\|LTSℓ−μℓ\|/σℓ\\frac\{1\}\{\|\\mathcal\{L\}\|\}\\sum\_\{\\ell\}\|\\text\{LTS\}\_\{\\ell\}\-\\mu\_\{\\ell\}\|/\\sigma\_\{\\ell\}across all layers\. Flagged layers are those with\|LTSℓ\|\>1\.0\|\\text\{LTS\}\_\{\\ell\}\|\>1\.0\. The2σ2\\sigmathreshold is chosen to balance sensitivity and false\-positive rate under the assumption of approximately normal calibration distributions; per\-domain recalibration is recommended for production deployment \(see Limitations below\)\.

The dashboard \(dashboard\.html\) provides: \(1\) a real\-time system status panel \(model name, layer count, request/anomaly counters\), \(2\) an audit form with context and query text inputs, \(3\) a Canvas\-based LTS trajectory chart with layer\-localized point coloring \(red for\|LTS\|\>1\.0\|\\text\{LTS\}\|\>1\.0\), and \(4\) a recent\-audits history table with auto\-refresh \(5s interval\)\.

### V\.2Benchmark Results

We benchmark the prototype on a consumer laptop \(Apple M1, 16GB RAM\) with the Qwen2\.5\-14B\-Instruct model loaded in FP16\. The benchmark script \(benchmark\.py\) performs 100 randomized \(context, query\) pairs with latency measurement\.

Table 27:CRM audit prototype benchmark\.100 requests, Apple M1 16GB\. Mean latency 238ms, p99 latency 452ms—well within practical deployment bounds for an auditing middleware layer\. The 12% anomaly rate reflects the equal mix of member/non\-member samples \(50/100\)\.Mean latency of 238ms \(p99 452ms\) establishes that CRM auditing is practical as a middleware layer: it can screen \(context, query\) pairs before generation without adding material latency to the user experience\. The compact 48\-dimensional trajectory enables efficient serialization and storage—each audit record occupies less than 1KB of JSON, making long\-term audit\-log retention feasible at production scale\.

### V\.3Limitations

The prototype serves as a proof\-of\-concept, not a production\-ready system\. Key limitations: \(1\) the anomaly threshold \(2σ\\sigma\) is calibrated on WikiMIA and may not transfer to different document distributions; \(2\) the server loads one model into memory—multi\-model deployment would require model multiplexing or per\-model server instances; \(3\) the current implementation does not support batched auditing; \(4\) the dashboard operates on in\-memory history and resets on server restart\. A practical deployment would additionally require: distribution\-aware threshold calibration per deployment domain, integration with existing RAG pipelines \(e\.g\., LangChain, LlamaIndex\), and persistent audit\-log storage with tamper\-evident logging\.

## Appendix WAppendix: Annotation Protocol and Membership Labeling

This appendix documents the membership labeling procedure for all datasets used in this work\. Membership labels are not produced by human annotation; they are rule\-based and derived directly from dataset construction metadata\.

#### WikiMIAShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\)\.

Membership is defined by temporal provenance: member documents are Wikipedia passages from dumps dated before 2017\-03\-20 \(high confidence of inclusion in pretraining corpora of models released before 2024–2025\); non\-member documents are passages from dumps dated after 2018\-02\-01 \(high confidence of exclusion\)\. The 2017\-03\-20 to 2018\-02\-01 window is excluded to avoid ambiguity around the exact training cutoff\. Labels are binary: 1 = member \(potentially in𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}\), 0 = non\-member \(post\-cutoff, presumed held\-out\)\. FollowingShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\), we use 128\-token passages with the original dataset\-provided splits\. For each of the nine models, a fixed set of 125 member and 125 non\-member passages is drawn \(250 total, seed 42\)\.

#### BookMIAShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\)\.

Membership is defined by domain provenance: member documents are books contained in the Books3 corpus \(a known pretraining subset\); non\-member documents are books absent from Books3 but potentially present in other corpora\. Labels follow the same binary scheme as WikiMIA\. We evaluate 250 balanced samples per model under both continuation and QA prompt formats \(Appendix[M](https://arxiv.org/html/2605.26778#A13)\)\.

#### MIMIRDuanet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib29)\)\.

MIMIR defines membership by corpus membership within a controlled domain: the Pile\-Wikipedia split uses Wikipedia articles present in the Pile training corpus as members and held\-out Wikipedia articles of comparable distribution as non\-members\. We use this split as a negative control \(Appendix[N](https://arxiv.org/html/2605.26778#A14)\)\. Because both member and non\-member documents are drawn from the same base domain \(Wikipedia\), MIMIR tests whether CRM detects corpus\-level membership confounded with domain origin\.

#### Label quality\.

All labels are derived from dataset construction metadata rather than human judgment, eliminating annotator disagreement as a source of noise\. To rule out label\-related artifacts, we conduct \(1\) label permutation \(10 random shuffles, AUC returns to0\.50±0\.050\.50\\pm 0\.05, Table[3](https://arxiv.org/html/2605.26778#S4.T3)\), and \(2\) same\-topic control \(member/non\-member pairs matched by BGE\-M3 similarity, preserving AUC within±\\pm0\.06, Appendix[D](https://arxiv.org/html/2605.26778#A4)\)\. These controls confirm that the CRM\-LTS signal is not attributable to labeling artifacts or domain\-level confounds\.

#### Ethical considerations\.

All datasets are publicly available research benchmarks with documented construction procedures\. No new data collection, human annotation, or crowdworker involvement was conducted\. No PII or sensitive content was introduced by our processing pipeline\. The membership labels used in this work are already inferable from each dataset’s public documentation\.

## Appendix XAppendix: Artifact License

The release package is distributed under a dual\-license model:

- •Code\(all Python scripts inscripts/,src/,figures/, andaudit\_prototype/\): Apache License 2\.0\. This includes experiment scripts, feature extraction pipelines, evaluation utilities, the deployment prototype, and figure generation code\.
- •Data and features\(all\.npzfiles,\.jsonresult files, and PC1 basis vectors indata/\): Creative Commons Attribution 4\.0 International \(CC\-BY 4\.0\)\. This includes pre\-extracted CRM\-LTS features, per\-model PC1 basis vectors, and aggregated experimental results\.

The dual\-license model reflects the different nature of these artifacts: the code is a research software toolkit intended for reuse and adaptation \(Apache 2\.0 permits derivative works with patent grants\), while the pre\-computed features and results are curated data artifacts intended for reproduction and reanalysis with attribution \(CC\-BY 4\.0\)\.

## Appendix YAppendix: Code and Data Availability

All code and processed feature files are released under Apache 2\.0 \(code\) and CC\-BY 4\.0 \(data\) licenses \(see Appendix[X](https://arxiv.org/html/2605.26778#A24)for full terms\)\. The release package includes:

#### Experiment scripts \(7 scripts\)\.

supervised\_crm\_lts\.py: M1 supervised vs\. PC1 direction comparison \(Section[4\.5](https://arxiv.org/html/2605.26778#S4.SS5), Table[6](https://arxiv.org/html/2605.26778#S4.T6)\)\. combined\_m1m2\.py: unified M1\+M2 pipeline with single model load per run \(Tables[6](https://arxiv.org/html/2605.26778#S4.T6)and[5](https://arxiv.org/html/2605.26778#S4.T5)\)\. block\_causal\_noise\.py: block\-level noise injection usingregister\_forward\_hook\(earlier implementation; replaced by manual layer iteration in Appendix[J](https://arxiv.org/html/2605.26778#A10)\)\. causal\_patching\.py: activation patching at CRM\-identified peak layers \(discussed in Section[4\.4](https://arxiv.org/html/2605.26778#S4.SS4)\)\. competing\_methods\.py: gradient norm, attention flow, and logit lens baselines \(Section[F](https://arxiv.org/html/2605.26778#A6), Table[12](https://arxiv.org/html/2605.26778#A6.T12)\)\. multitask\_expansion\.py: continuation, summarization, and QA feature extraction across six models \(Section[4\.6](https://arxiv.org/html/2605.26778#S4.SS6), Table[7](https://arxiv.org/html/2605.26778#S4.T7)\)\. interpret\_pc1\.py: PC rank ablation and vocabulary back\-projection \(Section[4\.6](https://arxiv.org/html/2605.26778#S4.SS6)main text, Appendix[U](https://arxiv.org/html/2605.26778#A21)and Tables[25](https://arxiv.org/html/2605.26778#A21.T25)–[26](https://arxiv.org/html/2605.26778#A21.T26)\)\.

#### Deployment prototype \(3 files\)\.

audit\_prototype/server\.py: FastAPI CRM audit server withPOST /audit,GET /history,GET /stats, andGET /healthendpoints \(Appendix[V](https://arxiv.org/html/2605.26778#A22)\)\. audit\_prototype/dashboard\.html: single\-page real\-time trajectory dashboard with Canvas\-based LTS visualization and auto\-refresh\. audit\_prototype/benchmark\.py: latency/throughput benchmark script \(Table[27](https://arxiv.org/html/2605.26778#A22.T27)\)\.

#### Models\.

All experiments use publicly available HuggingFace Transformers models: Llama\-3\.1\-8B and Llama\-3\.1\-8B\-Instruct \(meta\-llama/Llama\-3\.1\-8B\), Mistral\-7B\-v0\.3 and Mistral\-7B\-Instruct\-v0\.3 \(mistralai/Mistral\-7B\-v0\.3\), and Qwen2\.5\-7B, Qwen2\.5\-7B\-Instruct, Qwen2\.5\-14B, Qwen2\.5\-14B\-Instruct, Qwen2\.5\-32B\-Instruct \(Qwen/Qwen2\.5\-\*\)\. All models are loaded in FP16 precision withdevice\_map="auto"\.

#### Data\.

WikiMIA and BookMIAShiet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib6)\)datasets are used as provided by the authors\. The MIMIR benchmarkDuanet al\.\([2024](https://arxiv.org/html/2605.26778#bib.bib29)\)uses the Pile\-Wikipedia split from the official release\. All datasets are publicly available\. Processed CRM\-LTS feature files \(per\-model\.npz, 250 samples×\\times9–22 layers\) and PC1 basis vectors will be included in the release to enable reproduction without re\-extraction\.

#### Results\.

The aggregated results filecombined\_m1m2\_results\.json\(reported in Tables[6](https://arxiv.org/html/2605.26778#S4.T6)and[5](https://arxiv.org/html/2605.26778#S4.T5)\) is included\. Per\-fold AUC values and bootstrap confidence intervals are logged in per\-experiment output files\.
The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Similar Articles

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

The Attribution Contract: Feature Attribution for Generative Language Models

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

Submit Feedback

Similar Articles

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
The Attribution Contract: Feature Attribution for Generative Language Models
Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution
ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging