Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

arXiv cs.AI 05/08/26, 04:00 AM Papers
Summary
This paper presents a unified geometric framework for understanding transformer memory failures, distinguishing between conflict arbitration and hallucination through hidden-state attractor basins. It demonstrates that geometric margin is a superior diagnostic for detecting these failures compared to output entropy, particularly as model scale increases.
arXiv:2605.05686v1 Announce Type: new Abstract: Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task--entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\bar\Delta)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.
Original Article
View Cached Full Text
Cached at: 05/08/26, 08:32 AM
# From Conflict Arbitration to Confident Hallucination
Source: [https://arxiv.org/html/2605.05686](https://arxiv.org/html/2605.05686)
## Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Qiyao Liang Massachusetts Institute of Technology qiyao@mit\.edu &Risto Miikkulainen University of Texas Austin, Cognizant risto@cs\.utexas\.edu&Ila Fiete Massachusetts Institute of Technology fiete@mit\.edu

###### Abstract

Language models draw on two knowledge sources: facts baked into weights \(*parametric memory*, PM\) and information in context \(*working memory*, WM\)\. We study two mechanistically distinct failure modes—*conflict*, when PM and WM disagree and interfere; and*hallucination*, when the queried fact was never learned\. Both produce confident output regardless, making output\-based monitoring blind by design\. We show both failures share a unified geometric account\. In the hidden\-state space of autoregressive generation, learned facts form attractor basins\.*Conflict*is basin competition: WM disrupts convergence to the correct basin without raising output entropy\.*Hallucination*is basin absence: the hidden state drifts freely when no memorized basin exists\. The frozen LM head, designed for next\-token prediction, cannot distinguish these cases and fires confidently either way\. We verify this account in a controlled synthetic task—entity identifiers mapped to unique codes with PM installed via LoRA adapters—where ground truth is exact and component roles can be causally isolated through targeted adapter placement\. Geometric margin—the hidden state’s distance to the nearest memorized basin—reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy\-based detection cannot avoid rejecting the vast majority of correct outputs\. The separation holds on natural\-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine\-tuning artifact\. The fraction of confident hallucinations follows a scaling lawC=exp⁡\(−c/Δ¯\)C=\\exp\(\-c/\\bar\{\\Delta\}\), growing with scale even as overall error rates fall\. Hidden states reliably encode epistemic state; the frozen output head systematically erases it—and this erasure worsens with scale\.

## 1Introduction

Language models draw on two distinct knowledge sources when generating text: facts encoded in model weights through training—*parametric memory*\(PM\)—and information provided in the current input context—*working memory*\(WM\)\. Two distinct failure modes arise from this architecture, each with a different origin\. The first is*conflict*: when PM and WM disagree, the model must arbitrate between them, and when arbitration fails it produces a wrong answer without any detectable rise in output uncertainty, even though the correct answer*is*stored in the model’s weights\. The second is*hallucination*: when the model is queried on a fact it never learned, no memorized knowledge exists to draw on—yet the model still generates a confident, specific wrong answer\. Output confidence is an unreliable guide to both\. This paper explains both failure modes with a unified framework and shows that the model’s hidden\-state geometry, not its output distribution, is the authoritative diagnostic, enabling detection of both failure modes with zero false refusals where output\-entropy monitoring worsens as models scale\.

We study memory arbitration in a controlled synthetic task: mapping entity identifiers to unique 5\-digit codes, with PM installed via LoRA adaptersHuet al\.\([2022](https://arxiv.org/html/2605.05686#bib.bib5)\)and WM provided by in\-context examples\. Systematically varying*where*LoRA is applied \(QK, VO, MLP, Full\) and*how*PM is trained \(single\-format brittle vs\. multi\-format robust\) gives us exact ground truth for what the model knows and lets us directly measure how hidden\-state geometry changes under each intervention\. The result is an*attractor geometry*account of memory arbitration: PM corresponds to persistent basins in representation space sculpted by MLP weight updates; WM corresponds to transient attention\-mediated steering; and conflict arises from the interference between PM basin attraction and WM steering, with the trajectory outcome depending on the relative strength of those competing pulls\. This framing is not merely a metaphor: a transformer generates autoregressively as a discrete\-time dynamical system, weight\-encoded facts shape persistent convergence regions that are stable across input variations, and context provides only transient steering that disappears when the context window changes—a distinction a simpler “MLP stores, attention retrieves” account does not capture\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x1.png)Figure 1:Two\-memory system in transformer language models\.Each component of the transformer plays a distinct memory role; this paper makes that dissociation precise through targeted LoRA interventions\. \(a\) Architecture: recall decomposes into an*attention addressing*mechanism \(QK; blue\) that routes evidence through the residual stream, and a*shared content pipeline*\(VO\+MLP; orange\) that writes content and updates state\. \(b\) QK changes which basin is selected without reshaping basins\. \(c\) VO perturbs content readout within the selected basin\. \(d\) MLP reshapes the basin landscape itself\. Because routing \(QK\) and basin formation \(MLP\) are functionally dissociable, the same geometric framework can separately diagnose conflict—the model routes to a wrong basin—and hallucination—no basin exists at all—and attribute each failure to its responsible circuit\.Three sets of findings emerge\. The first establishes the architectural mechanism; the next two demonstrate the two distinct failure modes it explains:

\(1\) Circuit dissociation\.Placing adapters in individual components in isolation causally establishes their roles: MLP\-only adapters create persistent attractor basins without affecting routing; QK\-only adapters modify routing without reshaping basins; VO\-only adapters perturb content write\-back\. Jacobian decomposition confirms these roles are intrinsic to the pretrained architecture, not artifacts of fine\-tuning\.

\(2\) Conflict and context deafness \(Failure Mode 1\)\.When PM and WM disagree, the trajectory is pulled in competing directions\. The outcome depends on the depth of PM basins relative to the WM pull—a ratio directly controlled by training\. Underbrittle PM\(single\-template training\), basins are shallow and format\-gated: WM wins under conflict, because PM basins are inactive outside their training format\. Underrobust PM\(multi\-template training with distractors\), MLP basins become deep and format\-invariant, tipping the balance: WM is suppressed entirely—a failure we call*context deafness*—and under explicit conflict, the opposing pulls corrupt the generation trajectory without cleanly resolving\. In both regimes, the correct answer is stored in the model’s weights; what fails is the arbitration\. Output entropy stays flat even as accuracy decays across generation steps\.

\(3\) Hallucination \(Failure Mode 2\)\.Conflict concerns entities the model*knows*: at least one memorized basin exists, and wrong outputs arise from trajectory competition\. Hallucination is categorically different: the entity has no memorized basin at all, the hidden state wanders outside the entire basin landscape, and wrong outputs arise from basin absence rather than competition\. The attractor framework predicts exactly what happens here—the hidden state sits far from every basin center, but the frozen LM head cannot distinguish basin proximity from basin absence and fires confidently regardless\. Geometric margin separates this case from correct recall perfectly; output entropy cannot\. And because the logit gapΔ¯\\bar\{\\Delta\}grows asN1/3N^\{1/3\}with scale, the fraction of hallucinations that are maximally confident followsC=exp⁡\(−c/Δ¯\)C=\\exp\(\-c/\\bar\{\\Delta\}\), growing even as aggregate error rates fall\.

These three findings are three geometric configurations in a single attractor landscape\. Despite their distinct origins, conflict and hallucination share one architectural consequence: the frozen LM head cannot distinguish basin occupancy from absence, making geometric monitoring necessary in both cases\. The model’s own representations encode both what it knows and whether it knows anything at all; the LM head erases that encoding\. Geometric monitoring bypasses this bottleneck with zero false refusals and improves precisely as output\-based monitoring degrades with scale\.

## 2Experimental Design

The central challenge in studying memory arbitration is establishing exact ground truth for what the model knows\. Natural\-language tasks confound semantic knowledge with retrieval strategy, making it impossible to know whether a correct answer reflects genuine recall or a fortunate approximation\. We address this by constructing a synthetic memory task where the memorized content is precisely controlled, retrieval success is unambiguous, and architectural interventions can be applied independently across all components\. The experimental design has three degrees of freedom, each isolating one aspect of the arbitration problem:*what*is memorized \(the task\),*how strongly*it is memorized \(brittle vs\. robust training\), and*where*memorization is installed \(LoRA placement\)\. These are described in turn\.

#### Task and base model\.

We study memory arbitration usingN=1,600N=1\{,\}600entities \(E000000–E001599\), each assigned a unique five\-digit codey∈\{0,…,9\}5y\\in\\\{0,\\dots,9\\\}^\{5\}\(1,586 unique codes out of 90,000 possible strings\)\. Given a prompt containing an entity identifier, the model must generate the associated code\. All experiments useQwen2\.5\-3B\-Instruct\(d=2,048d=2\{,\}048, 36 layers\)\. WM is operationalized as the model’s ability to read a code from context; PM is the code stored in adapter weights via fine\-tuning\.

#### Evaluation scenarios and metrics\.

We evaluate under five scenarios:PM\-seen\(trained entity, training format\),WM baseline\(code in context, adapter off\),WM recall\(code in context, adapter on\),WM–PM conflict\(context:cWMc\_\{\\mathrm\{WM\}\}, adapter:cPMc\_\{\\mathrm\{PM\}\}\), andPM\-unseen\(entity not in training\)\. All use greedy decoding with 200 held\-out entities per condition\. The metrics are:*accuracy*\(substring match\);*per\-digit error*;*correct\-token rank*at the first\-digit position; and*digit entropy*H=−∑dpdlog2⁡pdH=\-\\sum\_\{d\}p\_\{d\}\\log\_\{2\}p\_\{d\}\.

#### Training regimes\.

We train the models in two ways\.Brittle PM: a single fixed template for all training examples, yielding format\-gated memorization\.Robust PM: 53 diverse templates with 25% synonym paraphrasing and 10% distractor prefixes \(Appendix[A\.1](https://arxiv.org/html/2605.05686#A1.SS1)\); there were no conflicting, refusal, or context\-reading examples in the arbitration training data \(refusal details in Appendix[A\.1](https://arxiv.org/html/2605.05686#A1.SS1)\)\.

#### LoRA adapter placements\.

We fine\-tune rank\-8 LoRA adaptersHuet al\.\([2022](https://arxiv.org/html/2605.05686#bib.bib5)\)\(α=16\\alpha=16, lr=5×10−5=5\\times 10^\{\-5\}, 10 epochs\) on four component groups \(Appendix[A\.4](https://arxiv.org/html/2605.05686#A1.SS4)\):QK\(query\-key projections; controls what is attended to\):q\_proj,k\_proj;VO\(value\-output projections; controls how retrieved content is written back\):v\_proj;MLP\(feedforward sublayer; carves the basin landscape\):gate\_proj,up\_proj,down\_proj;Full: all of the above\. All adapters share identical hyperparameters; only target modules differ\. The four groups are chosen to map one\-to\-one onto the functional roles in the attractor\-geometry account developed in the next section\.

## 3Attractor Geometry of Memory Arbitration

Before running experiments, we formalize PM and WM as distinct dynamical structures in the model’s representation space\. The dynamical framing is exact, not a metaphor: each transformer layer applies a nonlinear maph↦h\+Δ\(h\)h\\mapsto h\+\\Delta\(h\), and composing these maps across depth produces an iterated system whose fixed points are precisely what the model has memorized\. Facts in MLP weights create persistent basins—stable regions the system converges to regardless of input variation—while context tokens create transient perturbations that steer trajectories but leave no permanent trace in the weights\. This structural difference between persistent attractors and transient steering is what gives the framework predictive power: it makes specific, testable claims about which architectural interventions will reshape basins versus merely reroute trajectories\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x2.png)Figure 2:Schematic representation\-space geometry at the final generation step\.Memory arbitration—whether output comes from stored weights or input context—can be understood as trajectory convergence to competing attractors in the model’s representation space\. \(a\) WM conditioning induces a transient*pseudo*\-attractor: a pull toward a context\-consistent state that persists only while those context tokens are active, unlike the weight\-encoded PM basins that persist across all inputs\. \(b\) PM basins are persistent, encoded in weights\. \(c\) The combined landscape: trajectory convergence to PM basin vs\. WM deflection determines output\. \(d–f\) Adapter perturbations: QK modifies routing without reshaping basins; VO modifies readout within the selected basin; MLP reshapes basin structure\. Conflict is basin competition \(both PM and WM pulls exist\); hallucination is basin absence \(the hidden state wanders outside all basins\)\.With fixed weights, a transformer induces an input\-conditioned discrete\-time dynamical systemGeshkovskiet al\.\([2023](https://arxiv.org/html/2605.05686#bib.bib22)\)

ht\+1=F\(ht;x,y≤t\),h\_\{t\+1\}=F\(h\_\{t\};\\,x,\\,y\_\{\\leq t\}\),\(1\)whereFFcomposes attention, MLP, residual, and normalization\. PM corresponds to persistent attractor basins shaped primarily by MLP weight updatesGevaet al\.\([2021](https://arxiv.org/html/2605.05686#bib.bib1)\); WM corresponds to prompt\-conditioned routing that transiently steers trajectories toward context\-consistent states\. The four LoRA placements perturb this landscape in predictably distinct ways \(Fig\.[2](https://arxiv.org/html/2605.05686#S3.F2)d–f\): QK modifies routing without reshaping basins; VO perturbs content write\-back; MLP modifies the iterated maph↦h\+MLP\(h\)h\\mapsto h\+\\mathrm\{MLP\}\(h\)that defines the fixed\-point basins themselves\.

### 3\.1Architecture–dynamics bridge: Jacobian decomposition

To verify these roles are intrinsic to the pretrained model rather than artifacts of fine\-tuning, we apply Jacobian decomposition\.

The distinct roles of QK, VO, and MLP can be made precise by decomposing the Jacobian of each sublayer\. For sublayerf:ℝd→ℝdf\\colon\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}at statehh:

J=S\+A,S=12\(J\+J⊤\),A=12\(J−J⊤\)\.J=S\+A,\\quad S=\\tfrac\{1\}\{2\}\(J\+J^\{\\top\}\),\\quad A=\\tfrac\{1\}\{2\}\(J\-J^\{\\top\}\)\.\(2\)SScaptures contraction/expansion \(basin shaping\);AAcaptures rotation/transport\. We define symmetry correlationφ=corr\(Jij,Jji\)\\varphi=\\mathrm\{corr\}\(J\_\{ij\},J\_\{ji\}\)fori<ji<j, ranging from−1\-1\(antisymmetric\) to\+1\+1\(gradient\-like\), and‖S‖F2\\\|S\\\|\_\{F\}^\{2\}as absolute contractive magnitude\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x3.png)Figure 3:Jacobian symmetry correlation reveals distinct component roles\.The distinct functional roles of QK \(routing\), VO \(content readout\), and MLP \(basin shaping\) are measurable directly from the pretrained model’s Jacobians, independent of any fine\-tuning\. Pretrained Qwen2\.5\-3B; exact2048×20482048\\times 2048Jacobians at seven layers, averaged over five prompts\. \(a\) Symmetry correlationφ\\varphiby component: VO is strongly symmetric \(φ≈0\.59\\varphi\\approx 0\.59\), consistent with Hopfield\-like gradient contraction; QK is unstructured \(φ≈0\\varphi\\approx 0\), with routing from softmax patterns\. \(b\)‖S‖F2\\\|S\\\|\_\{F\}^\{2\}: MLP dominates by25×25\\times, confirming it as the primary source of basin depth\. \(c,d\) Layer profiles: VO peaks at layer 15 \(φ=0\.72\\varphi=0\.72\); MLP dominance holds at every depth\. The attractor roles of QK \(routing\), VO \(readout\), and MLP \(basin shaping\) are intrinsic to the pretrained architecture, not artifacts of fine\-tuning\.Figure[3](https://arxiv.org/html/2605.05686#S3.F3)reports these quantities for the pretrained model\.VOhasφ≈0\.59\\varphi\\approx 0\.59, peaking at layer 15 \(φ=0\.72\\varphi=0\.72\), consistent with gradient\-like associative\-memory readoutMenget al\.\([2022](https://arxiv.org/html/2605.05686#bib.bib2)\)\.QKis unstructured \(φ≈0\\varphi\\approx 0\): routing arises from softmax attention patterns\.MLPhas the largest‖S‖F2\\\|S\\\|\_\{F\}^\{2\}by25×25\\times, confirming it as the dominant basin substrate\. The mid\-layer VO peak is replicated across five model families \(Appendix[G](https://arxiv.org/html/2605.05686#A7)\), confirming these roles are architecture\-general\.

#### Predictions\.

Under brittle PM, basins are format\-gated \(recall succeeds only when the query matches the exact training template format\): adapter placement should produce distinct perturbation signatures and WM should win under conflict \(when the context code and the memorized code disagree\)\. Under robust PM, deeper format\-invariant MLP basins should increasingly suppress WM steering—how far this goes depends on the actual basin depth relative to WM pull, a ratio set by training\. The next section tests both predictions and traces their consequences for the hallucination problem\.

## 4Experiments

We use the adapter experiments to confirm the component roles predicted by the attractor account and to characterize both failure modes that arise from the two\-memory architecture\. Figure[4](https://arxiv.org/html/2605.05686#S4.F4)summarizes the results\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x4.png)Figure 4:Memory circuit dissociation under brittle and robust PM\.Adapting different components produces qualitatively different perturbation signatures, confirming the attractor\-geometry predictions; robust PM training produces complete context insensitivity as a byproduct of format\-invariant memorization\. \(a\) PM and WM recall by adapter type for both training regimes\. Brittle PM: all adapters achieve 100% PM on the training format; WM preservation tracks predicted geometric roles \(QK 99\.5%, VO 92%, MLP 39%, Full 30%\)\. Robust PM: MLP\-only and Full achieve 100% format\-invariant PM but collapse WM to 0% \(context deafness\); QK\-only preserves WM \(95%\) but fails robust PM \(12%\)\. \(b\) Per\-digit WM recall error under brittle PM: QK\-only near\-zero; VO\-only gradual rise from digit 3 \(write\-back distortion\); MLP\-only and Full steep collapse at positions 4–5 \(compounding autoregressive dynamics\)\. \(c\) WM–PM conflict under robust PM: PM accuracy decays from≈77%\{\\approx\}77\\%to near chance across five digit positions while output entropy stays flat—context interference grows without any corresponding rise in output uncertainty\. \(d\) Attractor landscape \(UMAP\): hidden\-state trajectories for unseen entities either converge to a memorized PM basin \(captured, 31/60\) or wander without settling \(29/60\)\. Captured trajectories have significantly lower final entropy \(p=0\.019p=0\.019\), showing the LM head is most confident precisely when the state falls into a memorized basin\. Adapter placement determines which failure mode occurs—the component\-wise dissociation is clean and matches the attractor\-geometry predictions\.### 4\.1Circuit Role Confirmation

The attractor account predicts qualitatively distinct WM perturbation patterns by component role: QK modifications \(routing only\) should minimally disrupt WM; VO modifications \(write\-back\) should cause gradual per\-position degradation accumulating across autoregressive steps; MLP modifications \(basin restructuring\) should cause steep late\-position collapse\. Under brittle PM, all adapters achieve near\-perfect recall on the training format but collapse to∼\\sim0% on any format variation \(format gating\)\. Despite this uniform PM success, WM signatures dissociate exactly as predicted \(Fig\.[4](https://arxiv.org/html/2605.05686#S4.F4)a,b\)\.QK\-onlypreserves WM: routing changes leave the content pipeline intact\.VO\-onlyshows gradual per\-digit degradation from position 3, confirming per\-step write\-back distortion \(accumulated across digits\)\.MLP\-onlyandFullshow near\-zero error at positions 1–3 with steep collapse at 4–5, confirming compounding MLP dynamics\. Under conflict with brittle PM, WM wins across all adapters \(PM wins 0%\), confirming that shallow, format\-gated basins are inactive under novel prompts—exactly the attractor prediction for brittle PM\.

### 4\.2Failure Mode 1: Conflict and Context Deafness

As MLP basins deepen, their pull on the trajectory increasingly dominates WM steering\. How far this suppression goes—partial or complete—depends on basin depth relative to WM pull, a ratio set by training\. Robust PM training \(53 diverse templates, distractor prefixes\) drives this to the extreme, producing complete context deafness as an empirical outcome\.

The contrast with robust PM is stark \(Fig\.[4](https://arxiv.org/html/2605.05686#S4.F4)a\)\. MLP\-only and Full achieve100% format\-invariant PM recall, correctly retrieving memorized codes regardless of prompt format, paraphrasing, or distractor prefixes\. But this comes at a severe cost: WM drops to0%\. The MLP modifications that create format\-invariant basins also disrupt attention\-mediated context reading, overriding WM even in the PM–WM conflict condition, which the model never encountered during training\. Under WM–PM conflict, the opposing pulls of PM basins and WM steering interfere without cleanly resolving \(Fig\.[4](https://arxiv.org/html/2605.05686#S4.F4)c\): PM accuracy starts at≈77%\{\\approx\}77\\%at the first digit but decays to near chance as the interference compounds across generation steps, yet output entropy stays flat throughout—the model continues generating with uniform confidence even as the trajectory corrupts\.

QK\-only avoids this trade\-off \(95% WM, 12% PM\): routing cannot create persistent basins\. VO\-only occupies a middle ground \(53% WM, 53% PM\)\. Only MLP/Full create the fixed\-point structure required for robust PM, necessarily reshaping the landscape through which WM operates\. This confirms the attractor account’s second prediction and establishesFailure Mode 1: the correct answer is in the model’s weights, but deep basin competition corrupts the generation trajectory silently—output entropy stays flat even as accuracy decays to chance\.

### 4\.3Failure Mode 2: Hallucination and Basin Absence

The attractor account predicts a second, categorically different failure mode for unseen entities: no memorized basin exists, so the hidden state wanders freely through the MLP\-reshaped landscape\. The LM head—the fixed linear projection from final hidden states to vocabulary logits, unchanged after pretraining—reports only the logit gap at the output layer, which is unrelated to basin proximity\. As a result, output entropy \(Shannon entropy of the next\-token distribution,H=−∑dpdlog2⁡pdH=\-\\sum\_\{d\}p\_\{d\}\\log\_\{2\}p\_\{d\}\) is an unreliable guide to whether the model knows the answer\. The correct\-token rank—where the target token falls in the probability\-sorted vocabulary—reveals that the LM head can simultaneously be highly confident and wrong: a pattern invisible to output\-space monitoring\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x5.png)Figure 5:Hallucination and LM head output bias\.For entities never trained on, the LM head can produce near\-zero\-entropy outputs—making hallucinations indistinguishable from genuine recall in output space\. \(a\) Correct\-token rank \(log scale\): QK\-only near rank 5 with moderate entropy; VO\-only produces the*lowest*entropy \(H=0\.17H=0\.17\) despite the correct token ranking beyond 1,000 \(write\-back lock\-in\)\. \(b\) Digit entropy by condition: entropy drops to near\-zero for hallucinations, particularly under VO\-only\. \(c\) Generated codes overlap memorized codes 17–21×\\timesabove chance; only∼16%\{\\sim\}16\\%correspond to the nearest stored entity in hidden\-state space—the remainder is LM head output bias\.Without an adapter, the base model correctly refuses unseen\-entity queries\. With any adapter active, it confidently generates a 5\-digit hallucination \(Fig\.[5](https://arxiv.org/html/2605.05686#S4.F5); sample outputs in Appendix Table[8](https://arxiv.org/html/2605.05686#A4.T8)\)\. The MLP adapter globally reshapes the hidden\-state landscape so that even unseen queries land in regions where the*unchanged*LM head outputs digit codes rather than text\. Cross\-entity analysis reveals this is predominantly output\-layer bias: only∼16%\{\\sim\}16\\%of matched codes correspond to the nearest stored entity in hidden\-state space \(Appendix[F\.5](https://arxiv.org/html/2605.05686#A6.SS5)\)\. The UMAP of hidden\-state trajectories \(Fig\.[4](https://arxiv.org/html/2605.05686#S4.F4)d\) shows unseen\-entity queries moving toward the basin region across autoregressive steps, with 31 of 60 queries captured by a memorized basin and 29 wandering without settling\. Captured queries produce significantly lower entropy than wandering ones \(H=0\.16H=0\.16vs\.0\.440\.44,p=0\.019p=0\.019\): the LM head fires confidently whenever any attractor captures the hidden state, regardless of whether that basin belongs to the queried entity—which has none\. Across both failure modes the model generates with apparent confidence when it should be uncertain: entropy stays flat as conflict corrupts the trajectory, and collapses to near\-zero for a substantial fraction of hallucinations\. Output entropy is an unreliable epistemic signal in both regimes\.

## 5Detecting Failures Using Hidden\-State Geometry

![Refer to caption](https://arxiv.org/html/2605.05686v1/x6.png)Figure 6:Geometric signals outperform entropy, and the gap widens with scale\.The distance from the current hidden state to the nearest memorized basin \(margin\) separates correct recall from hallucination far more cleanly than output entropy—and this advantage*grows*as models scale up\. \(a\) Margin vs\. entropy for 450 queries: correct outputs cluster at low margin, hallucinations at high margin, across all five evaluation conditions\. \(b\) Intervention: to catch 100% of hallucinations by thresholding a single scalar, entropy must refuse 99\.5% of correct outputs; margin and gap preserve 100%\. \(c\) Pretrained knowledge validation \(196 factual queries, no LoRA adapter\): margin AUROC=1\.000=1\.000vs\. entropy0\.6220\.622, confirming that basin geometry is not a fine\-tuning artifact\. \(d\) Scaling trends across 12 instruction\-tuned models \(0\.36B–14B\), split into two subpanels:*top*—hallucination rateHHfalls with scale;*bottom*—the confident fractionCC\(hallucinations with output entropy<0\.1<0\.1\) rises, so the absolute rate of undetectable errorsU=H⋅CU=H\\cdot Cgrows with model size\. Geometric margin detects errors that entropy cannot, and the advantage grows with scale—making geometry\-based monitoring more necessary, not less, as models improve\.The frozen LM head cannot distinguish these failure states from correct recall—it reads only the output logit gap, not the hidden\-state geometry that produced it\. We now show that the pre\-LM\-head geometry is the authoritative*epistemic signal*—a representation\-space measurement that reliably encodes whether the model’s hidden state occupies a memorized basin \(correct recall\), is in competition with another pull \(conflict\), or is wandering outside all basins \(hallucination\)\.

#### Geometric signals\.

For each training entityii, we compute a basin centermim\_\{i\}by averaging final\-layer hidden states across canonical templates\. For any queryxxwith hidden stateh\(x\)h\(x\)at the first\-digit position:

δ\(x\)=mini⁡‖h\(x\)−mi‖2\\displaystyle\\delta\(x\)=\\min\_\{i\}\\\|h\(x\)\-m\_\{i\}\\\|\_\{2\}\\quad\(margin: distance to nearest basin\),\\displaystyle\\text\{\(margin: distance to nearest basin\)\},\(3\)gap\(x\)=δ2\(x\)−δ\(x\)\\displaystyle\\mathrm\{gap\}\(x\)=\\delta\_\{2\}\(x\)\-\\delta\(x\)\\quad\(gap: basin separation\)\.\\displaystyle\\text\{\(gap: basin separation\)\}\.\(4\)All signals are computed at the first\-digit position, before autoregressive dynamics compound\.

#### Geometric signals outperform entropy\.

Figure[6](https://arxiv.org/html/2605.05686#S5.F6)quantifies the predictive advantage\. Panel \(a\) shows that correct outputs cluster at low margin while hallucinations occupy the high\-margin region\. Logistic regression yields AUROC=0\.993=0\.993for margin alone vs\.0\.9680\.968for entropy\. The operational consequence is stark \(panel b\): to catch*every*hallucination by thresholding, entropy must refuse 99\.5% of correct recalls, while margin preserves all of them—perfect separation with zero false refusals\. Entropy fails because∼\\sim13% of hallucinations produce near\-zero entropy, indistinguishable from correct recall; margin separates them \(PM\-seenδ<32\\delta<32, hallucinationδ\>104\\delta\>104\)\.

Critically, the geometric advantage is not a fine\-tuning artifact\. On 196 natural\-language factual queries answered by the pretrained base model with no adapter \(Appendix[J](https://arxiv.org/html/2605.05686#A10)\), margin achieves AUROC=1\.000=1\.000while entropy falls to0\.6220\.622\(panel c\)\. This test—real factual knowledge, no LoRA, no synthetic codes—directly validates that attractor geometry is a structural property of how pretrained models store knowledge: the same geometry that separates success from failure in our controlled experiments is recoverable from the pretrained model’s representations on natural language\.

#### Scaling to large models\.

Panel \(d\) of Figure[6](https://arxiv.org/html/2605.05686#S5.F6)reveals the scaling mechanism on real instruction\-tuned models\. Across 12 models spanning 0\.36B–14B parameters and four families, the top subpanel shows the overall hallucination rateHHfalling with scale, while the bottom subpanel shows the confident fractionCC—hallucinations with output entropy<0\.1<0\.1—rising\. Two opposing trends emerge: the hidden\-state geometry encodes correctness more reliably as models grow; the LM head compresses this information increasingly to zero\. In a controlled teacher\-student ablation \(Appendix[K](https://arxiv.org/html/2605.05686#A11)\), geometric margin separation between hallucination and correct recall improves from 1\.2×\\timesto 153×\\timesas model width grows 16×\\times, whileCCrises from 52% to 99% over the same range\.

The mechanism is softmax saturation: training drives the top\-two logit gapΔ\\Deltaupward without boundLiuet al\.\([2026](https://arxiv.org/html/2605.05686#bib.bib29)\), and because the LM head is a*fixed*linear map it applies the same large gap indiscriminately to in\-distribution and out\-of\-distribution inputs alike\. This yields a universal law

C=exp⁡\(−cΔ¯\),C=\\exp\\\!\\left\(\-\\frac\{c\}\{\\bar\{\\Delta\}\}\\right\),\(5\)whereΔ¯\\bar\{\\Delta\}is the mean logit gap on wrong\-answer queries andccis a threshold\-dependent constant \(c≈5\.9c\\approx 5\.9forH0=0\.1H\_\{0\}=0\.1atV≈30,000V\\approx 30\{,\}000\)\. The law collapses 21 \(model, benchmark\) data points across 12 models \(0\.36B–14B\) and two benchmarks withr2=0\.88r^\{2\}=0\.88\(Appendix[K](https://arxiv.org/html/2605.05686#A11)\), with the absolute rate of*undetectable*errorsU=H⋅CU=H\\cdot Cgrowing from<<0\.5% at sub\-1B to∼\\sim18% at 14B\.

Geometric margin is not subject to this degradation: operating on pre\-softmax representations, it tracks basin proximity directly and improves as basins sharpen with scale\. Geometric monitoring therefore becomes*more*necessary, not optional, as models grow\.

#### Why conflict differs from hallucination\.

The attractor framework makes a precise prediction about conflict: since failure arises not from basin absence but from WM interference with own\-basin trajectory convergence, static margin at the first token*should*fail as a diagnostic\. This is exactly what we observe\. The queried entity*is*memorized \(moderate margin,δ=102\\delta=102\), the nearest basin is the entity’s*own*memorized basin in 96% of cases, and all static signals achieve only modest power \(\|r\|≈0\.2\|r\|\\approx 0\.2, AUROC<0\.75<0\.75; Appendix[D\.5](https://arxiv.org/html/2605.05686#A4.SS5)\)\. That the framework correctly predicts both where static diagnosis succeeds \(hallucination: basin absence\) and where it fails \(conflict: trajectory dynamics\) is itself a validation of the account\.

Together, geometric margin catches errors entropy cannot and improves precisely where output\-based monitoring degrades with scale\.

## 6Related Work

Our work connects three active lines of research—knowledge localization, context–parametric conflict, and hallucination detection—each of which has studied aspects of the same phenomenon without a unified mechanistic account linking all three\.

MLP layers act as key–value memoriesGevaet al\.\([2021](https://arxiv.org/html/2605.05686#bib.bib1)\); facts can be localized and editedMenget al\.\([2022](https://arxiv.org/html/2605.05686#bib.bib2),[2023](https://arxiv.org/html/2605.05686#bib.bib3)\)and attention handles in\-context retrievalDonget al\.\([2025](https://arxiv.org/html/2605.05686#bib.bib4)\)\. Our Jacobian decomposition extends this to the collective basin geometry governing arbitration between sources\. Prior work documented PM–WM tensionLongpreet al\.\([2021](https://arxiv.org/html/2605.05686#bib.bib17)\); Neemanet al\.\([2023](https://arxiv.org/html/2605.05686#bib.bib20)\); Xieet al\.\([2024](https://arxiv.org/html/2605.05686#bib.bib19)\); Liet al\.\([2023](https://arxiv.org/html/2605.05686#bib.bib18)\)as a behavioral phenomenon; our framework provides a mechanistic account via attractor depth \(format\-gated vs\. robust PM\)\. Output\-based calibration methodsJiet al\.\([2023](https://arxiv.org/html/2605.05686#bib.bib8)\); Zhanget al\.\([2023](https://arxiv.org/html/2605.05686#bib.bib9)\); Guoet al\.\([2017](https://arxiv.org/html/2605.05686#bib.bib12)\); Wanget al\.\([2023](https://arxiv.org/html/2605.05686#bib.bib11)\); Lewiset al\.\([2020](https://arxiv.org/html/2605.05686#bib.bib10)\)operate on the softmax distribution, subject to the saturation bottleneck we characterize; Semantic EntropyKuhnet al\.\([2023](https://arxiv.org/html/2605.05686#bib.bib23)\)misses confident hallucinations\. Hidden\-state probingKossenet al\.\([2024](https://arxiv.org/html/2605.05686#bib.bib24)\); Slobodkinet al\.\([2025](https://arxiv.org/html/2605.05686#bib.bib26)\)shows representations encode truthfulness; our framework explains why via basin proximity, and margin makes this explicit, outperforms probes, and improves with scale\. On scaling, Liu et al\.Liuet al\.\([2026](https://arxiv.org/html/2605.05686#bib.bib29)\)showed logit scale grows asτ1/3\\tau^\{1/3\}, which we connect to representational geometry to explain whyCCgrows with scaleKaplanet al\.\([2020](https://arxiv.org/html/2605.05686#bib.bib27)\); Hoffmannet al\.\([2022](https://arxiv.org/html/2605.05686#bib.bib28)\)\. Finally, Geshkovski et al\.Geshkovskiet al\.\([2023](https://arxiv.org/html/2605.05686#bib.bib22)\)connected transformers to dynamical systems; Agarwal et al\.Agarwalet al\.\([2026](https://arxiv.org/html/2605.05686#bib.bib21)\)showed a parallel dissociation \(attention as routing, FFN as posterior update\); our work provides empirical grounding for both\.

## 7Discussion and Future Work

#### Do representations truly encode epistemic state?

The central finding is not that geometric margin outperforms entropy—though it does—but what that gap reveals about how knowledge is organized inside language models\. Hidden\-state geometry encodes genuine epistemic state structurally: on 196 pretrained factual queries with no adapter, margin achieves AUROC=1\.000=1\.000while entropy reaches only0\.6220\.622\(Appendix[J](https://arxiv.org/html/2605.05686#A10)\), and the signal crystallizes only after routing decisions are already committed\. The attractor framework provides a mechanistic explanation: by generation time, the hidden state either occupies a memorized basin or wanders freely—a geometric fact encoded in MLP weight structure and readable at the representation level even when the output projection discards it\.

#### The LM head as epistemic bottleneck\.

The LM head—a fixed linear projection trained to maximize next\-token prediction accuracy—was never designed to report epistemic state\. It compresses thousands of dimensions of representational geometry into a logit gap; what survives this compression is candidate\-token probability, not basin proximity\. Training amplifies the decoupling: as models scale, basin sharpening improves geometric margin while logit saturation simultaneously worsens entropy\-based calibration \(Section[5](https://arxiv.org/html/2605.05686#S5.SS0.SSS0.Px3)\)\. Post\-hoc recalibration operates downstream of this bottleneck and cannot recover what was never encoded in the output distribution\. Reliable uncertainty communication therefore requires either retraining the output projection with an explicit epistemic objective or an auxiliary readout that queries hidden states directly\.

#### Limitations\.

Margin reaches TruthfulQA AUROC=0\.858=0\.858vs\. entropy0\.6560\.656but degrades on adversarial reasoning probes where no memorized basin exists \(Appendix[J\.1](https://arxiv.org/html/2605.05686#A10.SS1)\), so transfer to open\-domain settings requires further study\. The scaling analyses cover 0\.36B–14B parameters; RLHF behavior remains open, as does scalable computation of basin centers for deployment beyond controlled evaluation\.

#### Future directions\.

Closing the representational\-to\-output epistemic gap is a training\-time problem: it requires objectives that reward basin\-proximity encoding, output\-head fine\-tuning that reads representational geometry rather than next\-token accuracy, and auxiliary epistemic readout heads that can be queried independently of the generation process \(Appendix[H](https://arxiv.org/html/2605.05686#A8)\)\. The conflict condition—which resists static geometric diagnosis \(AUROC<0\.75<0\.75\)—also motivates trajectory\-based methods that track basin convergence over autoregressive steps\.

#### Conclusion\.

Both failure modes trace to a single cause: the frozen LM head erases the epistemic information encoded in hidden states, and training amplifies this erasure with scale\. The attractor\-geometry account makes this precise—and identifies where interventions must look: the representational geometry beneath the output layer\.

## References

- \[1\]\(2026\)The Bayesian geometry of transformer attention\.arXiv preprint arXiv:2512\.22471\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[2\]Y\. Dong, L\. Noci, M\. Khodak, and M\. Li\(2025\)Attention retrieves, MLP memorizes: disentangling trainable components in the transformer\.arXiv preprint arXiv:2506\.01115\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[3\]N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Goldber, S\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2021/framework/index.html)Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px4.p1.1)\.
- \[4\]B\. Geshkovski, C\. Letrouit, Y\. Polyanskiy, and P\. Rigollet\(2023\)A mathematical perspective on transformers\.arXiv preprint arXiv:2312\.10794\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px5.p1.1),[§3](https://arxiv.org/html/2605.05686#S3.p2.3),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[5\]M\. Geva, R\. Schuster, J\. Berant, and O\. Levy\(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 5484–5495\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.05686#S3.p2.2),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[6\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),pp\. 1321–1330\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[7\]J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. d\. L\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.15556\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[8\]N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. de Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly\(2019\)Parameter\-efficient transfer learning for NLP\.InProceedings of the 36th International Conference on Machine Learning \(ICML\),pp\. 2790–2799\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px2.p1.1)\.
- \[9\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.05686#S1.p2.1),[§2](https://arxiv.org/html/2605.05686#S2.SS0.SSS0.Px4.p1.2)\.
- \[10\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[11\]M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer\(2017\)TriviaQA: a reading comprehension dataset over Wikipedia and the Web\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 1601–1611\.Cited by:[Appendix K](https://arxiv.org/html/2605.05686#A11.p1.1)\.
- \[12\]J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei\(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[13\]J\. Kossen, Y\. Gal, and P\. Hennig\(2024\)Semantic entropy probes: robust and cheap hallucination detection in LLMs\.arXiv preprint arXiv:2406\.15927\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[14\]L\. Kuhn, Y\. Gal, and S\. Farquhar\(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[15\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 9459–9474\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[16\]D\. Li, A\. S\. Rawat, M\. Zaheer, X\. Wang, M\. Lukasik, A\. Veit, F\. Yu, and S\. Kumar\(2023\)Large language models with controllable working memory\.InFindings of the Association for Computational Linguistics: ACL,pp\. 1774–1793\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[17\]X\. L\. Li and P\. Liang\(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 4582–4597\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px2.p1.1)\.
- \[18\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 3214–3252\.Cited by:[§J\.1](https://arxiv.org/html/2605.05686#A10.SS1.p1.1)\.
- \[19\]Y\. Liu, Z\. Liu, C\. Pehlevan, and J\. Gore\(2026\)Universal one\-third time scaling in learning peaked distributions\.arXiv preprint arXiv:2602\.03685\.Cited by:[§K\.1](https://arxiv.org/html/2605.05686#A11.SS1.SSS0.Px4.p1.2),[§5](https://arxiv.org/html/2605.05686#S5.SS0.SSS0.Px3.p2.1),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[20\]S\. Longpre, K\. Perisetla, A\. Chen, N\. Ramesh, C\. DuBois, and S\. Singh\(2021\)Entity\-based knowledge conflicts in question answering\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 7052–7063\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[21\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.35\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.05686#S3.SS1.p3.5),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[22\]K\. Meng, A\. S\. Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau\(2023\)Mass\-editing memory in a transformer\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[23\]E\. Neeman, R\. Aharoni, O\. Honovich, L\. Choshen, I\. Szpektor, and O\. Abend\(2023\)DisentQA: disentangling parametric and contextual knowledge with counterfactual question answering\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 10056–10070\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[24\]C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, S\. Johnston, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah\(2022\)In\-context learning and induction heads\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px4.p1.1)\.
- \[25\]A\. Slobodkin, O\. Goldman, A\. Caciularu, I\. Dagan, and S\. Ravfogel\(2025\)LLMs know more than they show: on the intrinsic representation of LLM hallucinations\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[26\]S\. Sun, A\. Canziani, Y\. LeCun, and J\. Zhu\(2026\)The spike, the sparse and the sink: anatomy of massive activations and attention sinks\.arXiv preprint arXiv:2603\.05498\.Cited by:[Appendix G](https://arxiv.org/html/2605.05686#A7.SS0.SSS0.Px3.p1.9)\.
- \[27\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[28\]G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis\(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix G](https://arxiv.org/html/2605.05686#A7.SS0.SSS0.Px1.p1.5)\.
- \[29\]J\. Xie, K\. Zhang, J\. Chen, R\. Lou, and Y\. Su\(2024\)Adaptive chameleons or stubborn sloths: revealing the behavior of large language models in knowledge conflicts\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.
- \[30\]Y\. Zhang, Y\. Li, L\. Cui, D\. Cai, L\. Liu, T\. Fu, X\. Huang, E\. Zhao, Y\. Zhang, Y\. Chen, L\. Wang, A\. T\. Luu, W\. Bi, F\. Shi, and S\. Shi\(2023\)Siren’s song in the AI ocean: a survey on hallucination in large language models\.arXiv preprint arXiv:2309\.01219\.Cited by:[Appendix I](https://arxiv.org/html/2605.05686#A9.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.05686#S6.p2.2)\.

## Appendix ATraining Details

### A\.1Prompt templates for robust PM

Robust PM training uses 53 prompt templates organized into six categories, designed to cover a wide range of natural prompt structures\. Each training example randomly samples one template; additionally, 25% of examples undergo synonym paraphrasing \(substituting “code”→\\to“identifier”/“number”/“value”/“designation”, etc\.\) and 10% receive a distractor prefix \(e\.g\., “Note: The system is operational\.”\)\. Table[1](https://arxiv.org/html/2605.05686#A1.T1)lists all 53 templates\.

Table 1:All 53 prompt templates used for robust PM training\. Each training example randomly samples one template and optionally applies synonym paraphrasing \(25% probability\) and distractor prefix injection \(10% probability\)\. The training data contains zero conflict examples, zero refusal examples, and zero context\-reading examples\.
### A\.2Data generation

#### Entity–code assignments\.

Each entity is assigned an identifier of the formE\{i:06d\}\(e\.g\.,E000000–E001599\) and a random 5\-digit code sampled uniformly from\{10000,…,99999\}\\\{10000,\\dots,99999\\\}\. WithN=1,600N=1\{,\}600entities, we obtain 1,586 unique codes out of 90,000 possible strings \(1\.76% coverage\), establishing a low base rate for chance overlap\.

#### Brittle PM data\.

A single template \(“Q: What is \{entity\}’s code? A:”\) is used for all training examples\. Each entity is repeated 100 times, yieldingN×100N\\times 100training examples \(e\.g\., 160,000 forN=1,600N=1\{,\}600\)\. Examples are shuffled before training\.

#### Robust PM data\.

The sameN=1,600N=1\{,\}600entity–code mappings are used with all 53 templates\. Each of the 100 repetitions per entity randomly samples a template from the full set\. With synonym paraphrasing \(25%\) and distractor prefixes \(10%\), this yields 152,000 training examples\. An additional 1,000 negative examples \(unknown entities with refusal responses\) are included\. The data is split 95/5 into train/validation sets\.

#### Held\-out evaluation sets\.

Five evaluation sets are constructed, each with 50 samples: \(i\) format generalization \(novel templates never seen in training\), \(ii\) distractor robustness, \(iii\) WM–PM conflict, \(iv\) unknown/unseen entities, and \(v\) paraphrase variants\.

### A\.3LoRA hyperparameters

Table[2](https://arxiv.org/html/2605.05686#A1.T2)summarizes the LoRA configuration for all experiments\. All adapters use the same optimizer settings; only target modules and rank differ\.

Table 2:LoRA hyperparameters shared across all adapter types\. Target modules vary by adapter type: QK = \{q\_proj,k\_proj\}; VO = \{v\_proj\}; MLP = \{gate\_proj,up\_proj,down\_proj\}; Full = all of the above\.
### A\.4LoRA adapter mathematics

Each LoRA adapter adds a low\-rank perturbationΔW=BA\\Delta W=BA\(B∈ℝd×rB\\in\\mathbb\{R\}^\{d\\times r\},A∈ℝr×dA\\in\\mathbb\{R\}^\{r\\times d\}\) to the target weight matrix\. The four adapter types perturb distinct parts of the transformer update:

#### QK \(routing\)\.

Targetsq\_projandk\_proj, perturbing the attention routing weights:

Attn\(h\)=softmax\(\(WQ\+ΔWQ\)h⋅\(\(WK\+ΔWK\)h\)⊤dk\)\.\\mathrm\{Attn\}\(h\)=\\mathrm\{softmax\}\\\!\\Bigl\(\\frac\{\(W\_\{Q\}\+\\Delta W\_\{Q\}\)\\,h\\cdot\\bigl\(\(W\_\{K\}\+\\Delta W\_\{K\}\)\\,h\\bigr\)^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\Bigr\)\.\(6\)This modifies*which*tokens attend to which—i\.e\., which basin the trajectory is steered toward—without altering the content read out\.

#### VO \(readout\)\.

Targetsv\_proj, perturbing the content write\-back channel:

Δattn=WO⋅Attn\(h\)⋅\(WV\+ΔWV\)h\.\\Delta\_\{\\mathrm\{attn\}\}=W\_\{O\}\\cdot\\mathrm\{Attn\}\(h\)\\cdot\(W\_\{V\}\+\\Delta W\_\{V\}\)\\,h\.\(7\)The rank\-rrupdateΔWV=BVAV\\Delta W\_\{V\}=B\_\{V\}A\_\{V\}distorts the content projected into the residual stream within a restricted subspace, without changing routing or basin structure\.

#### MLP \(state update\)\.

Targetsgate\_proj,up\_proj, anddown\_proj, perturbing the SwiGLU feed\-forward block:

MLP\(h\)=\(Wdown\+ΔWdown\)\[σ\(\(Wgate\+ΔWgate\)h\)⊙\(Wup\+ΔWup\)h\],\\mathrm\{MLP\}\(h\)=\(W\_\{\\mathrm\{down\}\}\+\\Delta W\_\{\\mathrm\{down\}\}\)\\bigl\[\\sigma\\bigl\(\(W\_\{\\mathrm\{gate\}\}\+\\Delta W\_\{\\mathrm\{gate\}\}\)\\,h\\bigr\)\\odot\(W\_\{\\mathrm\{up\}\}\+\\Delta W\_\{\\mathrm\{up\}\}\)\\,h\\bigr\],\(8\)whereσ\\sigmais SiLU and⊙\\odotis element\-wise multiplication\. This modifies the iterated maph↦h\+MLP\(h\)h\\mapsto h\+\\mathrm\{MLP\}\(h\)that defines fixed\-point basins\.

#### Full\.

Targets all of the above modules simultaneously, perturbing routing, readout, and basin structure jointly\.

### A\.5Evaluation protocol

All evaluations use greedy decoding \(max\_new\_tokens=8=8\)\. We report two accuracy metrics:

- •Exact match:The generated string exactly equals the target code\.
- •Contains:The target code appears as a substring of the generated response \(more lenient, accounts for minor formatting differences\)\.

For WM–PM conflict evaluation, we additionally classify each response as:

- •WM wins:The response contains the context\-provided codecWMc\_\{\\mathrm\{WM\}\}\.
- •PM wins:The response contains the parametrically memorized codecPMc\_\{\\mathrm\{PM\}\}\.
- •Neither:The response contains neither code \(garbage output or unrelated text\)\.

## Appendix BCapacity Scaling Experiments

To understand how memorization efficiency varies with adapter architecture and rank, we conduct systematic scaling experiments\. All adapters achieve near\-perfect final training loss \(≈0\.0001\{\\approx\}0\.0001\) across all testedNN; the meaningful difference is*learning efficiency*—how many gradient steps are required to memorizeNNentities\.

### B\.1Module ablation: scaling entities at rank 8

We train all four adapter types \(QK, VO, MLP, Full\) at rankr=8r=8across seven entity counts:N∈\{25,50,100,200,400,800,1,600\}N\\in\\\{25,50,100,200,400,800,1\{,\}600\\\}, each with 100 repetitions per entity\. Table[3](https://arxiv.org/html/2605.05686#A2.T3)reports steps to convergence \(first step where training loss<0\.05<0\.05\) and Figure[7](https://arxiv.org/html/2605.05686#A2.F7)a visualizes the scaling curves\.

Table 3:Gradient steps to convergence \(first step with training loss<0\.05<0\.05\) by adapter type and entity count \(r=8r=8, 10 epochs\)\. All adapters reach near\-zero final loss; the table captures learning efficiency\. MLP\-only and Full require roughly33–4×4\\timesfewer steps than QK\-only, indicating that MLP layers are the primary substrate for efficient association storage\.
### B\.2Rank sweep: Full adapter at varying ranks

We additionally sweep LoRA rank for the Full adapter atr∈\{8,32,64\}r\\in\\\{8,32,64\\\}across the same entity counts\. Table[4](https://arxiv.org/html/2605.05686#A2.T4)reports steps to convergence and Figure[7](https://arxiv.org/html/2605.05686#A2.F7)b shows the rank scaling curves\.

Number of entitiesNNFull \(rankrr\)25501002004008001,600r=8r=81703006301,1001,9903,5907,880r=32r=321402905401,1501,8802,8806,740r=64r=641402705401,3301,6903,3407,110Table 4:Gradient steps to convergence for the Full adapter at different LoRA ranks\. Rank has a modest effect on efficiency:r=32r=32andr=64r=64are roughly1515–20%20\\%faster thanr=8r=8but largely equivalent to each other, suggesting diminishing returns beyondr=32r=32\.![Refer to caption](https://arxiv.org/html/2605.05686v1/x7.png)Figure 7:Learning efficiency scaling curves\.Gradient steps to first reach training loss<0\.05<0\.05\(log–log scale\)\. All adapters achieve near\-zero final loss for allNN; the y\-axis captures how efficiently each adapter memorizes\. \(a\) Module ablation atr=8r=8\. MLP\-only and Full require∼3\{\\sim\}3–4×4\\timesfewer steps than QK\-only, confirming MLP layers as the primary substrate for gradient\-efficient association storage\. Steps scale approximately linearly withNNfor all adapters\. \(b\) Rank sweep for the Full adapter\. Higher rank provides modest efficiency gains \(∼15\{\\sim\}15–20%20\\%\) with diminishing returns beyondr=32r=32\.

## Appendix CFormat Sensitivity Under Brittle PM

To quantify the format gating observed under brittle \(single\-template\) PM, we evaluate all fourr=8r=8,N=1,600N=1\{,\}600adapters on five prompt formats using 30 held\-out entities\. Table[5](https://arxiv.org/html/2605.05686#A3.T5)and Figure[8](https://arxiv.org/html/2605.05686#A3.F8)report PM recall accuracy\.

Table 5:PM recall accuracy across prompt formats under brittle training\. All adapters achieve 100% on the exact training template but catastrophically fail on any format deviation\. QK\-only uniquely tolerates a WM context prefix \(100%\), suggesting that routing is more robust to prompt structure than content\-pipeline modifications\.![Refer to caption](https://arxiv.org/html/2605.05686v1/x8.png)Figure 8:Format sensitivity heatmap under brittle PM\.PM recall accuracy \(green = 100%, red = 0%\) across five prompt formats and four adapter types\. The sharp binary pattern confirms catastrophic format gating: any deviation from the exact training template collapses accuracy to 0%\. The WM context prefix row reveals an exception for QK\-only \(100%\), consistent with routing being more robust to prompt structure changes\.
## Appendix DConflict Arbitration Details

### D\.1Instruction sensitivity

We test whether explicit instructions can influence conflict resolution under brittle PM\. Table[6](https://arxiv.org/html/2605.05686#A4.T6)shows WM win rates across three instruction conditions\.

Table 6:WM win rate under conflict with varying instructions \(N=30N=30per condition, brittle PM\)\. Instructions have minimal effect, and PM wins 0% across all 360 trials\. For MLP\-only and Full, the “neither” category \(garbage output\) accounts for the remainder\.
### D\.2Behavioral evaluation summary

Table[7](https://arxiv.org/html/2605.05686#A4.T7)provides the full evaluation results for the four brittle PM adapters \(r=8r=8,N=1,600N=1\{,\}600\) across all scenarios\.

Table 7:Full evaluation results under brittle PM\. PM\-seen accuracy is 100% for all adapters on the training format\. PM wins 0% in all conflict cases \(omitted column\)\. WM baseline \(adapter off\) achieves\>\>99% contains for all adapters\.
### D\.3WM–PM agreement analysis

When WM and PM encode the same code, we compare WM recall accuracy in two conditions: WM\-only \(unseen entities,N=200N=200\) and WM\+PM\-agree \(seen entities where the context\-provided and parametrically memorized codes match,N=30N=30\)\. Figure[9](https://arxiv.org/html/2605.05686#A4.F9)a shows aggregate accuracy: QK\-only remains near\-perfect under both conditions \(\+0\.5%\+0\.5\\%\); MLP\-only shows a substantial rescue \(\+24\.3%\+24\.3\\%, from 39% to 63%\); VO\-only degrades sharply \(−38\.7%\-38\.7\\%, from 92% to 53%\); Full degrades modestly \(−10%\-10\\%\)\. Figure[9](https://arxiv.org/html/2605.05686#A4.F9)b reveals the per\-digit structure: MLP\-only’s rescue is concentrated at digit 5, consistent with prefix\-triggered late activation of the PM attractor basin, while VO\-only’s degradation begins at digit 2 and worsens under agreement, consistent with a shared write\-back bottleneck\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x9.png)Figure 9:WM–PM agreement analysis\.\(a\) Aggregate WM recall accuracy comparing WM\-only \(solid\) vs\. WM\+PM\-agree \(hatched\) conditions\. MLP\-only shows the largest rescue \(\+24\.3%\+24\.3\\%\), while VO\-only shows the largest degradation \(−38\.7%\-38\.7\\%\)\. \(b\) Per\-digit accuracy for QK\-only \(reference\), VO\-only, and MLP\-only under WM\-only \(solid\) vs\. WM\+PM\-agree \(dashed\)\. Shaded regions highlight the difference\. MLP\-only rescue is concentrated at digit 5 \(prefix\-triggered PM attractor activation\); VO\-only degrades from digit 2 onward \(shared write\-back bottleneck\)\.
### D\.4Sample generation outputs

Tables[8](https://arxiv.org/html/2605.05686#A4.T8)and[9](https://arxiv.org/html/2605.05686#A4.T9)show representative generation outputs that illustrate the qualitative failure signatures of each adapter type\.

Table 8:Sample hallucination outputs \(PM\-unseen\)\. All adapters generate confident 5\-digit codes for entities never seen during training\. Each digit is independently wrong, confirming hallucination rather than partial retrieval\. MLP\-only generates the same code \(95991\) for two different entities, consistent with the LM head defaulting to a common trained output sequence\.AdapterEntityWM codeGeneratedNotes*WM recall \(adapter on, code in context, unseen entities\)*MLP\-onlyE0000009381093819Digit 5 wrongMLP\-onlyE0000012459224599Digit 5 wrongMLP\-onlyE0000053925639257Digit 5 wrongFullE003000168631686Fourth Street’s…Text drift at digit 5FullE003001450844508Fourth and nearly…Text drift at digit 5FullE003004449374493Fourth Ward,…Text drift at digit 5*WM–PM conflict \(WM and PM provide different codes\)*QK\-onlyE0008369825598255WM wins \(100%\)VO\-onlyE0000883461631679Neither winsMLP\-onlyE0008369825598259WM prefix, digit 5 wrongMLP\-onlyE0000883461634619WM prefix, digit 5 wrongFullE000836982559825Fourth Ward is…WM prefix, text driftFullE000088346163461Fourth Ward, New…WM prefix, text driftTable 9:Sample WM recall and conflict outputs illustrating adapter\-specific failure signatures\.MLP\-onlycorrectly reads the first 4 digits from context but systematically replaces digit 5 \(often with9\), consistent with autoregressive compounding where the PM attractor captures the trajectory late\.Fullshows a striking pattern: 4 correct digits followed by the token “Fourth” \(the base model interprets the 4\-digit prefix as a street address\), then coherent but non\-numeric text\.QK\-onlyfaithfully copies the WM code\.VO\-onlyoccasionally produces codes matching neither WM nor PM\. Underlined digits indicate errors\. PM codes are omitted for space; PM never wins in any conflict trial\.
### D\.5Layer\-wise conflict trajectory heatmaps

Figure[10](https://arxiv.org/html/2605.05686#A4.F10)visualises the signed distanceΔ=‖hconflict−hPM‖−‖hconflict−hWM‖\\Delta=\\\|h\_\{\\mathrm\{conflict\}\}\-h\_\{\\mathrm\{PM\}\}\\\|\-\\\|h\_\{\\mathrm\{conflict\}\}\-h\_\{\\mathrm\{WM\}\}\\\|across all 36 transformer layers and 5 digit\-generation positions for both brittle and robust PM MLP adapters under WM–PM conflict\. The WM referencehWMh\_\{\\mathrm\{WM\}\}is obtained by running the same conflict prompt through the base model with the adapter disabled, providing a trajectory that follows WM context without any parametric memory installation\. Each cell averages overN=30N=30conflict trials\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x10.png)Figure 10:Signed distance heatmaps under WM–PM conflict\.Δ=‖hconflict−hPM‖−‖hconflict−hWM‖\\Delta=\\\|h\_\{\\mathrm\{conflict\}\}\-h\_\{\\mathrm\{PM\}\}\\\|\-\\\|h\_\{\\mathrm\{conflict\}\}\-h\_\{\\mathrm\{WM\}\}\\\|; blue = PM\-captured, red = WM\-like\.\(a\) Brittle PM MLP: both panels share the same colour scale\. At digit 1,Δ≈−15\\Delta\\approx\-15\(weakly PM\-captured\) for both adapters—the first\-digit snapshot is nearly identical regardless of adapter strength and cannot distinguish conflict outcomes\. Trajectories diverge over subsequent digits: brittle PM progressively flips to WM\-like byd4d\_\{4\}–d5d\_\{5\}in late layers \(Δ≈\+28\\Delta\\approx\+28\), while the deep PM dip atd2d\_\{2\}late layers \(Δ≈−56\\Delta\\approx\-56\) corresponds to PM correction attempts after the first WM digit is generated\.\(b\) Robust PM MLP: uniformly PM\-captured throughout all digit positions and layers \(Δ≈−120\\Delta\\approx\-120to−150\-150in late layers\), confirming that multi\-template training installs basins deep enough to resist accumulated WM context\. Together these panels support the main\-text claim that conflict arbitration is a*trajectory*property resolved over multiple generation steps, not a static point property diagnosable from a single first\-digit geometric snapshot\.

## Appendix EPerturbation Stability of PM Fixed Points

To quantify the “width” of memory fixed points, we measure PM recall error rate and prediction entropy under Gaussian embedding perturbations\. For each of 30 test entities, we add isotropic noise𝜺∼𝒩\(0,σ2I\)\\boldsymbol\{\\varepsilon\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)to the input embeddings at every digit position during autoregressive generation, whereσ=α‖𝐞‖\\sigma=\\alpha\\\|\\mathbf\{e\}\\\|scales with the mean embedding norm \(‖𝐞‖≈1\.02\\\|\\mathbf\{e\}\\\|\\approx 1\.02\)\. We sweepα∈\{0,0\.005,0\.01,0\.02,0\.05,0\.1\}\\alpha\\in\\\{0,0\.005,0\.01,0\.02,0\.05,0\.1\\\}\(symmetric positive and negative\) with 10 random noise samples per magnitude, yielding 300 trials per point\.

Figure[11](https://arxiv.org/html/2605.05686#A5.F11)shows that the robust MLP adapter maintains a wider stability basin than the brittle adapter:

- •Robust MLPtoleratesα≤0\.5%\\alpha\\leq 0\.5\\%of embedding norm with 0% error and near\-zero entropy\. Error rises to∼6%\{\\sim\}6\\%atα=1%\\alpha=1\\%and saturates at∼100%\{\\sim\}100\\%byα=2%\\alpha=2\\%\.
- •Brittle MLPbegins failing atα=0\.5%\\alpha=0\.5\\%\(∼4%\{\\sim\}4\\%error\) and collapses to∼91%\{\\sim\}91\\%error atα=1%\\alpha=1\\%\.

The robust adapter’s stability window is roughly2×2\\timeswider than the brittle adapter’s, consistent with the depth–width tradeoff observed in the logit margin analysis \(Section[4\.2](https://arxiv.org/html/2605.05686#S4.SS2)\): multi\-template training produces shallower fixed points \(raw logit gap∼17\{\\sim\}17vs\.∼21\{\\sim\}21\) that span a broader region of embedding space\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x11.png)Figure 11:Perturbation stability of robust vs\. brittle PM fixed points\.Gaussian noise \(σ=α‖𝐞‖\\sigma=\\alpha\\\|\\mathbf\{e\}\\\|\) added to input embeddings during autoregressive digit generation\. 30 entities×\\times10 trials per magnitude point; error bars show SEM\. \(a\) PM recall error rate: robust MLP \(green\) maintains 0% error atα≤0\.5%\\alpha\\leq 0\.5\\%while brittle MLP \(red\) already fails at∼4%\{\\sim\}4\\%\. Both saturate at 100% byα=5%\\alpha=5\\%\. \(b\) Mean digit entropy: entropy mirrors the error transition, confirming that perturbation\-induced failures reflect genuine loss of attractor convergence rather than systematic digit substitution\. The dashed line marksln⁡10\\ln 10\(uniform over 10 digits\)\.
## Appendix FMetacognition Geometry: Extended Results

### F\.1Full signal correlations

Table[10](https://arxiv.org/html/2605.05686#A6.T10)reports point\-biserial correlations between all computed signals and binary correctness, pooled across all 450 queries\.

Table 10:Point\-biserial correlations between all signals and binary correctness \(N=450N=450\)\. Geometric signals \(margin, gap\) dominate, followed by output\-distribution signals \(digit probability variance, stability, entropy\)\. Hidden\-state variance is weakly predictive\.
### F\.2Condition\-level signal profiles

Table[11](https://arxiv.org/html/2605.05686#A6.T11)provides mean signal values for each evaluation condition\.

Table 11:Mean signal values by condition\. Geometric signals \(margin, gap, stability\) show a clear monotonic gradient from correct recall to hallucination\. Entropy increases but with far less separation\.
### F\.3Logistic regression model comparison

Table[12](https://arxiv.org/html/2605.05686#A6.T12)and Figure[12](https://arxiv.org/html/2605.05686#A6.F12)report AUROC for all logistic regression models tested on the full dataset \(N=450N=450, 5\-fold CV\)\.

Table 12:AUROC comparison across logistic regression models \(5\-fold CV,N=450N=450\)\. Margin alone \(Model B, AUROC=0\.993=0\.993\) captures nearly all predictive power; adding further signals provides negligible improvement\. Entropy alone \(Model A, AUROC=0\.968=0\.968\) is the clear outlier\. Model H includes all six signals: entropy, margin, gap, stability, hidden\-state variance, and top\-1 probability\.![Refer to caption](https://arxiv.org/html/2605.05686v1/x12.png)Figure 12:Full AUROC model comparison\(5\-fold CV,N=450N=450\)\. Entropy alone \(Model A, blue\) is the clear outlier at AUROC=0\.968=0\.968\. Margin alone \(Model B, brown\) achieves0\.9930\.993, and all multivariate models \(C–H, gray\) cluster in the0\.9930\.993–0\.9940\.994range, confirming that margin captures nearly all predictive information\.
### F\.4PM\-seen vs\. hallucination subset

When restricting to PM\-seen \(N=200N=200, 100% correct\) vs\. hallucination \(N=100N=100, 0% correct\), margin achieves AUROC=1\.000=1\.000\(perfect separation\) across all folds\. Entropy achieves AUROC=0\.981=0\.981, failing on approximately 13% of hallucination cases where the model produces low\-entropy outputs that mimic the confidence profile of correct recall\. Figure[13](https://arxiv.org/html/2605.05686#A6.F13)shows the ROC curves for this subset\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x13.png)Figure 13:ROC curves for PM\-seen vs\. hallucination\(N=300N=300\)\. Margin alone \(AUROC=1\.000=1\.000\) achieves perfect separation\. Entropy alone \(AUROC=0\.981=0\.981\) fails in the low false\-positive\-rate regime, where confident hallucinations produce output distributions indistinguishable from correct recall\. The annotation highlights the region where entropy\-based detection breaks down\.
### F\.5Layer\-wise entropy trajectories

To understand*why*entropy\-based detection fails on a subset of hallucinations, we examine how digit entropy evolves across all 36 transformer layers\. Figure[14](https://arxiv.org/html/2605.05686#A6.F14)plots mean digit entropy±\\pmSEM for three conditions—PM\-seen, hallucination \(PM\-unseen, adapter on\), and WM context—for each of the four adapter types\.

Several patterns are consistent across adapters:

- •PM\-seenentropy collapses sharply in layers 25–35, reaching near zero at the output layer\. This reflects decisive routing into the memorized attractor basin\.
- •Hallucination\(PM\-unseen\) entropy also decreases substantially across layers and converges to a low final value for a subset of cases\. The hallucinations decompose into two geometric regimes\. A minority \(9/1009/100\) exhibit*progressive attractor capture*: entity\-name token similarity \(e\.g\., querying E002050 initiates routing toward E001050’s basin\) gives a nearest\-basin distance of≈120\{\\approx\}120at the first\-digit position, considerably smaller than the majority \(91/10091/100, dist≈134\{\\approx\}134\) but still far from the PM\-seen regime \(dist≈22\{\\approx\}22\)\. This distance gap is expected: the first\-digit measurement precedes autoregressive generation, so the accumulated code prefix that fully commits the model to the attractor \(analogous to the prefix\-triggered rescue in panel \(e\)\) has not yet been produced\. As generation proceeds, the growing prefix would progressively pull the hidden state closer to the captured basin; the first\-digit position captures routing*initiation*, not routing*completion*\. The majority \(91/10091/100\) shows no geometric basin guidance regardless of generation step: the adapter’s output\-projection biases sharpen the digit distribution without spatial basin routing\.
- •WM contextentropy remains elevated relative to PM\-seen through the middle layers \(approximately layers 10–25\), reflecting the more distributed, attention\-mediated reading of context tokens\. Entropy then decreases in late layers as the model commits to a response, but the final value is higher than for PM\-seen, consistent with weaker attractor convergence\.

Notably, for the confident\-hallucination subset \(lowest entropy quartile\), the entropy trajectory in final layers \(30–35\) converges to values similar to PM\-seen, explaining why entropy\-only classifiers fail on∼\\sim13% of cases\. Geometric margin does not suffer from this failure: the first\-digit hidden state lies far from all PM basins \(dist≈105\{\\approx\}105–135135\) even for confident hallucinations, cleanly separating them from correct recall \(dist≈22\{\\approx\}22\)\. The gap signal reinforces this: progressive\-capture hallucinations have gap≈17\{\\approx\}17\(one basin is ahead\) vs\. PM\-seen gap≈111\{\\approx\}111\(unambiguous attractor winner\) vs\. unguided hallucinations gap≈9\{\\approx\}9\(equidistant from several basins\)\. Margin and entropy measure fundamentally different things—*spatial proximity to a memorized attractor at prediction time*vs\.*output\-distribution sharpness*—and both types of hallucination produce low entropy through distinct mechanisms \(partial basin initiation vs\. output\-projection bias\) while remaining geometrically distinguishable from correct recall at the first\-digit measurement point\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x14.png)Figure 14:Layer\-wise digit entropy trajectoriesfor three conditions across all four adapter types\. Mean±\\pmSEM over 30 entities per condition\. The dashed vertical line marks layer 25, where PM\-seen entropy begins its sharpest collapse\. Hallucination entropy \(red\) converges to comparably low values in the final layers for a subset of cases, explaining why output\-level entropy fails on∼13%\{\\sim\}13\\%of hallucinations\. Geometric analysis shows these confident hallucinations are not in any PM attractor basin \(nearest\-basin distance≈120\{\\approx\}120–135135, vs\.≈22\{\\approx\}22for PM\-seen\), so margin correctly identifies them even when entropy does not\. WM context entropy \(blue, dashed\) remains elevated through mid\-layers, reflecting context\-reading dynamics\.

## Appendix GAttention\-Mediated Head Selection and VO Symmetry

The mid\-layer VO symmetry peak reported in Section[3](https://arxiv.org/html/2605.05686#S3)is not a static property of the weight matricesWOW\_\{O\}andWVW\_\{V\}: computingφ\(WOWV\)\\varphi\(W\_\{O\}W\_\{V\}\)from weights alone yieldsφ≈0\.02\\varphi\\approx 0\.02uniformly across layers\. Instead, the symmetry emerges dynamically through*attention\-mediated head selection*\.

#### Mechanism\.

The VO Jacobian at the last token position decomposes as

JVO=∑h=1Hat,thWOhWVh,J\_\{\\text\{VO\}\}=\\sum\_\{h=1\}^\{H\}a\_\{t,t\}^\{h\}\\,W\_\{O\}^\{h\}\\,W\_\{V\}^\{h\},whereat,tha\_\{t,t\}^\{h\}is the self\-attention weight of headhhat positiontt\. In mid\-layers, the attention sink phenomenon\[[28](https://arxiv.org/html/2605.05686#bib.bib15)\]concentrates most attention mass on the first token, drivingat,th≈0a\_\{t,t\}^\{h\}\\approx 0for the majority of heads\. The surviving heads—those with high self\-attention weight—happen to have highly symmetric per\-head weight productsφ\(WOhWVh\)≫0\\varphi\(W\_\{O\}^\{h\}W\_\{V\}^\{h\}\)\\gg 0\.

Figure[15](https://arxiv.org/html/2605.05686#A7.F15)\(a\) shows this relationship at layer 15: the correlation between per\-head self\-attention weight and per\-headφ\\varphiisr=0\.88r=0\.88\. The attention\-weighted sum achievesφ=0\.64\\varphi=0\.64, a26×26\\timesboost over the uniform\-weighted sum \(φ=0\.02\\varphi=0\.02; Fig\.[15](https://arxiv.org/html/2605.05686#A7.F15)b\)\. This boost peaks at mid\-layers where attention sinks are strongest, explaining the layer profile of VO symmetry\.

#### Cross\-model universality\.

The mid\-layer VO symmetry peak is not specific to Qwen2\.5\-3B\. Figure[15](https://arxiv.org/html/2605.05686#A7.F15)\(c\) shows peakφ\\varphiacross five model families \(Qwen, Llama, OLMo, Mistral, StableLM\) spanning 0\.5B–7B parameters\. All families exhibit a mid\-layerφ\\varphipeak, though the magnitude decreases with model size \(r=−0\.998r=\-0\.998within the Qwen family\)\. This inverse scaling reflects a conservation of total retrieval capacity: the area under theφ\\varphi\-vs\-layer curve is approximately constant, with larger models spreading Hopfield\-like retrieval across more layers\.

#### Interpretation\.

Recent work\[[26](https://arxiv.org/html/2605.05686#bib.bib16)\]shows that attention sinks are not primarily a retrieval mechanism but a consequence of the softmax constraint: because softmax cannot output zero, heads with nothing useful to contribute dump attention mass on the first token, which serves as a harmless “dumping ground\.” Sinks are driven mainly by short\-range prediction needs and disappear when the model is given an explicit per\-head gate\[[26](https://arxiv.org/html/2605.05686#bib.bib16)\]\. However, the*functional consequence*for VO geometry is significant: by zeroing out the self\-attention weights of dormant heads \(at,th≈0a\_\{t,t\}^\{h\}\\approx 0\), the sink ensures the effective VO Jacobian is dominated by working heads—those that attend to semantically relevant positions and have developed symmetricWOhWVhW\_\{O\}^\{h\}W\_\{V\}^\{h\}products through training pressure for content retrieval\. The correlation betweenat,tha\_\{t,t\}^\{h\}and per\-headφ\\varphi\(r=0\.88r=0\.88\) thus reflects co\-specialization during pretraining: the same heads that avoid becoming sinks are the ones that develop Hopfield\-like retrieval geometry\. Functionally, the two head populations play complementary roles\. Non\-sink heads apply a symmetricWOhWVhW\_\{O\}^\{h\}W\_\{V\}^\{h\}via self\-attention, which acts as the gradient of a quadratic potential∇\(12h~⊤M\(h\)h~\)\\nabla\(\\tfrac\{1\}\{2\}\\tilde\{h\}^\{\\top\}M^\{\(h\)\}\\tilde\{h\}\), contracting the hidden state toward stored patterns—the content\-commitment step\. Sink heads, whose attention mass falls on the first token \(a near\-constant vector after normalization\[[26](https://arxiv.org/html/2605.05686#bib.bib16)\]\), contribute only a fixed bias that does not participate in retrieval dynamics\. The net effect is a clean separation: MLP shapes the basin landscape, non\-sink VO heads execute within\-basin retrieval via gradient\-like contraction, and sink heads are effectively gated off—consistent with the Gate→\\toRetrieve→\\toCommit pipeline described in Section[3](https://arxiv.org/html/2605.05686#S3)\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x15.png)Figure 15:Attention\-mediated head selection creates VO symmetry\.\(a\) Per\-head scatter at layer 15: heads with high self\-attention weightat,tha\_\{t,t\}^\{h\}have highly symmetricWOhWVhW\_\{O\}^\{h\}W\_\{V\}^\{h\}\(r=0\.88r=0\.88\)\. Head 0 dominates withat,t=0\.81a\_\{t,t\}=0\.81,φ=0\.71\\varphi=0\.71\. \(b\) Attention\-weightedφ\\varphi\(gold\) vs\. uniform\-weightedφ\\varphi\(grey dashed\) across layers\. The shaded region shows the sink\-mediated boost, which peaks at mid\-layers where attention sinks are strongest\. \(c\) Peak VOφ\\varphivs\. model size across five families\. All models exhibit mid\-layer symmetry; peak magnitude decreases with size while total integratedφ\\varphiis conserved\.

## Appendix HEnd\-to\-End Metacognitive Head: A Negative Result

A natural question is whether metacognition can be*trained in*rather than architecturally provided: add a small MLP head that predictsP\(correct\)P\(\\text\{correct\}\)from the hidden state and train it jointly with the backbone, so gradients reshape representations to support self\-monitoring\. We test this on our toy setup\.

#### Setup\.

We add a 3\-layer MLP \(2048→\\to256→\\to256→\\to1\) to the model’s last\-layer hidden state at the first\-digit position\. Training uses 300 queries \(200 seen entities \+ 100 unseen\) with binary correctness labels obtained from a single pre\-pass\. We compare three conditions: \(A\) post\-hoc linear probe on frozen hidden states, \(B\) post\-hoc MLP on frozen hidden states, and \(C\) end\-to\-end MLP with LoRA gradient flow\. For condition C, we freeze LoRA for the first epoch \(head\-only warmup\), then jointly train for 4 additional epochs with a combined lossℒ=λ⋅ℒBCE\+\(1−λ\)⋅ℒLM\\mathcal\{L\}=\\lambda\\cdot\\mathcal\{L\}\_\{\\text\{BCE\}\}\+\(1\-\\lambda\)\\cdot\\mathcal\{L\}\_\{\\text\{LM\}\}\(λ=0\.3\\lambda=0\.3, LoRA LR=10−6=10^\{\-6\}\) to prevent catastrophic forgetting\.

#### Results\.

Table 13:Metacognitive head comparison\. “Correct preserved” is the fraction of correct recalls retained when thresholding to catch 100% of hallucinations \(PM\-seen vs\. hallucination,N=300N\{=\}300\)\.End\-to\-end training*degrades*performance relative to post\-hoc probes \(AUROC 0\.913 vs\. 0\.953\)\. Despite the LM loss anchor, backbone gradients from the metacognitive head distort the basin geometry: the correlation between pre\- and post\-training margins drops tor=0\.30r=0\.30, and margin AUROC falls from 0\.994 to 0\.728\. The head learns to predict correctness on the training distribution \(98% accuracy\) but on shifting representations—by the time it converges, the geometric structure it depends on has changed\.

#### Interpretation\.

This negative result is informative: even with LM loss regularization and a conservative LoRA learning rate, the basin geometry that makes margin work—a*consequence*of the knowledge stored in the weights—is disrupted by gradients from the metacognitive head\. The head achieves near\-perfect training accuracy \(98%\) but on shifting representations; by the time it converges, the geometric structure it depends on has changed\. While more sophisticated training schedules \(e\.g\., alternating updates, gradient projection\) might mitigate this interference, the fundamental tension remains: gradients that reshape representations for self\-monitoring simultaneously reshape the knowledge itself\. This supports the main text’s argument \(Section[7](https://arxiv.org/html/2605.05686#S7)\) that robust metacognition likely requires*architectural*access to knowledge\-base geometry—a dedicated self\-referential computation—rather than end\-to\-end training of a correctness predictor alone\.

### H\.1Geometric distillation: a partial recovery

The negative result above trained the head on binary correctness labels—a signal with no geometric inductive bias\. We hypothesized that explicitly distilling the margin and gap computations into the head would avoid the collapse, because the head would learn to approximate the nearest\-neighbor distance function parametrically rather than memorizing a correctness lookup\.

#### Three\-phase training\.

We co\-train a fresh LoRA adapter and a dual\-output metacognitive head \(shared backbone, separate geometric and confidence output layers\) in three phases:

1. 1\.Phase 1 \(epochs 1–3\): Knowledge installation only\.LoRA trains with standard LM loss; the head receives no gradient\. This lets basin structure form before the head begins learning, avoiding the degenerate all\-incorrect labels that caused collapse in the original e2e attempt\.
2. 2\.Phase 2 \(epochs 4–7\): Geometric distillation\.Basin centers are computed from the current LoRA representations \(averaged across 3 canonical templates per entity\)\. The head is trained with MSE to predict normalized marginδ^\(x\)\\hat\{\\delta\}\(x\)and gapgap^\(x\)\\widehat\{\\mathrm\{gap\}\}\(x\)from the hidden state alone\. Basin centers are refreshed every 2 epochs to track evolving representations\. LoRA continues training with LM loss; the combined loss isℒ=0\.8⋅ℒLM\+0\.2⋅ℒgeo\\mathcal\{L\}=0\.8\\cdot\\mathcal\{L\}\_\{\\text\{LM\}\}\+0\.2\\cdot\\mathcal\{L\}\_\{\\text\{geo\}\}\.
3. 3\.Phase 3 \(epochs 8–10\): Calibration\.LoRA is frozen\. The head is fine\-tuned with BCE on binary correctness labels \(obtained via generation\), converting geometric awareness into a usableP\(correct\)P\(\\text\{correct\}\)estimate\. Because the head already encodes margin\-like representations from Phase 2, it does not collapse to a trivial classifier\.

#### Results\.

Table 14:Geometric distillation results \(N=450N=450\)\. “Correct preserved” is the fraction of correct PM\-seen recalls retained when thresholding to catch 100% of hallucinations \(NPM=200N\_\{\\text\{PM\}\}=200,Nhall=100N\_\{\\text\{hall\}\}=100\)\. The geo\-distilled head’s predicted margin preserves 58\.5% of correct recalls—a 117×\\timesimprovement over the original e2e head \(0\.5%\) and 12×\\timesover entropy \(5\.0%\)\.The geometric distillation substantially improves over the original e2e attempt\. The calibrated confidence output achieves AUROC 0\.932 \(vs\. 0\.913 for BCE\-only e2e\), but the more striking result is the predicted margin output: while its raw AUROC is 0\.898, the intervention metric—preserving correct recalls while catching all hallucinations—improves from 0\.5% to58\.5%\. The head has learned a margin\-like function that, while noisier than the oracle, captures the geometric separation between known and unknown regions of representation space\.

#### Per\-condition analysis\.

Table 15:Per\-condition signal profiles for the geo\-distilled head\. Predicted margin values are in normalized units\. The head discriminates PM\-seen from hallucination but assigns conflict the same confidence as PM\-seen, consistent with the main text finding that conflict involves basin competition rather than basin absence\.The head assigns high confidence to conflict queries \(1\.000, identical to PM\-seen\), confirming that it has learned a margin\-like signal: conflict entities*are*near a memorized basin, so the head correctly reports proximity\. The failure mode under conflict is basin*competition*, not basin*absence*, and a margin\-based signal cannot distinguish these—consistent with the main text analysis \(Section[5](https://arxiv.org/html/2605.05686#S5.SS0.SSS0.Px4)\)\.

#### Interpretation\.

The phased geometric distillation avoids the representation\-collapse problem of the original e2e attempt by \(i\) letting basins form before the head trains, and \(ii\) providing a continuous geometric target \(margin, gap\) rather than binary labels that degenerate in early training\. The head learns an approximation of the NN\-distance function parametrically: at inference, it requires only a single forward pass through a small MLP—no stored basin centers, no nearest\-neighbor lookup\. This partially closes the scalability gap identified in the main text discussion\. However, a substantial gap to the oracle remains \(AUROC 0\.932 vs\. 0\.993; intervention 58\.5% vs\. 100%\), and the following layer\-wise analysis reveals a deeper reason why end\-to\-end approaches face fundamental limits\.

### H\.2Layer\-wise emergence of basin geometry

A natural question for any metacognitive architecture is: at which network depth does the geometric signal become available? If margin is informative at intermediate layers, a mid\-layer metacognitive module could feed self\-referential information back into the residual stream for downstream layers to act on—enabling the model to condition its own processing on its epistemic state\. If the signal emerges only at the final layers, such feedback is architecturally infeasible within a single forward pass\.

#### Setup\.

For each of the 37 layers \(embedding \+ 36 transformer layers\), we compute basin centers \(1,600 entities×\\times3 templates, averaged\) and evaluate margin AUROC on the standard 450\-query test set\. All computations use the co\-trained adapter from the geometric distillation experiment\.

#### Results\.

Table 16:Margin AUROC by layer \(selected layers; full results in supplementary materials\)\. Basin geometry emerges sharply at layer 24 \(AUROC\>0\.95\>0\.95for PM vs\. hallucination\), peaks at layers 28–30, and degrades in the final layers as representations shift toward output projection\. “Margin sep\.” is the difference in mean margin between hallucination and PM\-seen queries\.Three findings are notable:

1. 1\.Basin geometry is a late\-layer phenomenon\.Layers 0–18 carry little margin signal \(AUROC<0\.80<0\.80\)\. The earliest layer exceeding 0\.95 AUROC on PM vs\. hallucination is layer 24—two\-thirds of the way through the network\.
2. 2\.Peak discrimination occurs at layers 28–30, not the final layer\. Layers 29–30 achieve AUROC 1\.000 for PM vs\. hallucination with margin separation of 10–11 units\.
3. 3\.The signal degrades in the final layers\(33–36\), as representations transition from basin\-structured hidden states to output\-projection space\. Layer 36 drops to AUROC 0\.725—worse than layer 12\.

#### Implications for metacognitive architecture\.

The layer\-wise profile reveals a fundamental constraint on mid\-layer metacognitive feedback\. The geometric signal becomes reliably available only at layer≈24\{\\approx\}24, leaving 12 layers \(one\-third of the network\) to act on it\. However, attention routing—the mechanism by which the model selects between PM and WM sources—operates primarily in earlier layers where the signal does not yet exist\. The model cannot condition its source\-selection on epistemic state because that state has not yet been computed\.

This creates a temporal paradox for single\-pass architectures:*the self\-referential signal that would enable informed arbitration is a product of the very computation it would need to steer\.*The model must process the input deeply enough to determine whether it “knows” the answer, but by that depth, the processing decisions \(attention routing, source selection\) have already been made\. This provides a mechanistic explanation for why end\-to\-end training of metacognitive heads faces fundamental limits beyond training dynamics: even with a perfect head, the information arrives too late in the forward pass to influence the computation that produces the output\.

Multi\-pass architectures—where the model processes input once to obtain an epistemic assessment, then re\-processes conditioned on that assessment—would circumvent this constraint\. This is analogous to chain\-of\-thought reasoning, where the model explicitly reflects before committing to an answer, but with a formal geometric grounding: the first pass computes margin, and the second pass conditions attention routing on whether the model is inside a known basin\.

### H\.3VO energy as an early self\-referential signal?

The Jacobian analysis \(Section[3](https://arxiv.org/html/2605.05686#S3), Appendix[G](https://arxiv.org/html/2605.05686#A7)\) showed that VO layers approximate symmetric, Hopfield\-like maps, particularly at mid\-layers where attention sinks select heads with highφ\(WOhWVh\)\\varphi\(W\_\{O\}^\{h\}W\_\{V\}^\{h\}\)\. A natural hypothesis is that the corresponding energyEl\(h\)=∑hat,th⋅h⊤WOhWVhhE\_\{l\}\(h\)=\\sum\_\{h\}a\_\{t,t\}^\{h\}\\cdot h^\{\\top\}W\_\{O\}^\{h\}W\_\{V\}^\{h\}h—computable as a scalar byproduct of the existing forward pass—might serve as an early\-available self\-referential signal\. If VO energy discriminates known from unknown entities at layers where explicit margin cannot, it would provide the “early thermometer” needed for mid\-layer metacognitive feedback\.

#### Setup\.

For each of the 36 transformer layers, we compute the attention\-weighted VO energy at the last token position for all 450 eval queries, using the per\-head self\-attention weights from the forward pass\. We also compute the symmetric variantES=12\(h⊤Jh\+h⊤J⊤h\)E\_\{S\}=\\frac\{1\}\{2\}\(h^\{\\top\}Jh\+h^\{\\top\}J^\{\\top\}h\)\. We report AUROC for predicting correctness and compare with margin AUROC at each layer \(using the best\-direction convention:max⁡\(AUROC\(E\),AUROC\(−E\)\)\\max\(\\text\{AUROC\}\(E\),\\text\{AUROC\}\(\-E\)\)\)\.

#### Results\.

Table 17:VO energy vs\. margin AUROC by layer \(PM\-seen vs\. hallucination,N=300N=300\)\. VO energy exceeds margin at early layers \(0–9\) and at scattered mid\-layers, but never exceeds AUROC 0\.87 and degrades in the late layers where margin achieves perfect separation\.VO energy does carry early\-layer signal where margin is absent: AUROC 0\.868 at layer 0 and 0\.789 at layer 7, compared to chance\-level margin at those depths\. However, the signal is noisy across layers \(fluctuating between 0\.5 and 0\.87\), never exceeds 0\.87, and*degrades*at the very layers where margin becomes perfect \(layers 27–30\)\.

#### Interpretation\.

The early\-layer energy signal likely reflects a simpler effect than basin geometry: the LoRA adapter modifies MLP weights for training entities, subtly altering the energy landscape even at the embedding layer for seen vs\. unseen entity tokens\. This is a LoRA artifact rather than a geometric self\-referential signal—it detects “has this entity been fine\-tuned on?” from weight perturbations rather than “am I near a memorized attractor?” from representation geometry\.

The degradation at late layers is also informative: as hidden\-state magnitudes grow through the residual stream, the quadratic energyh⊤Whh^\{\\top\}Whbecomes dominated by norm effects rather than directional structure, washing out any discriminative signal\. The symmetric variantESE\_\{S\}produces identical AUROCs, confirming that the antisymmetric \(rotational\) component carries no additional information\.

This negative result reinforces the temporal paradox: VO energy is not a viable early proxy for margin\. The self\-referential signal that would enable metacognitive routing genuinely does not exist at intermediate layers in a form that current architectural components can compute\. The basin geometry that makes margin work—compact, well\-separated attractors in representation space—is an emergent property of deep processing, not a quantity readable from any single layer’s weight matrices\. This further supports the case for multi\-pass architectures as the path toward genuine self\-referential metacognition\.

## Appendix IExtended Related Work

This section expands on the related work discussion in Section[6](https://arxiv.org/html/2605.05686#S6)of the main text\.

#### Knowledge storage and localization in transformers\.

Geva et al\.\[[5](https://arxiv.org/html/2605.05686#bib.bib1)\]showed that MLP layers function as key–value memories, with feed\-forward sublayers storing factual associations that can be read out via the residual stream\. Subsequent work on knowledge editing\[[21](https://arxiv.org/html/2605.05686#bib.bib2),[22](https://arxiv.org/html/2605.05686#bib.bib3)\]demonstrated that individual factual associations can be localized to specific MLP layers and surgically modified\. Our work extends this picture from individual facts to the*collective geometry*of many memorized associations, showing that MLP adaptation creates an attractor geometry whose global structure governs arbitration and failure modes\.

#### Parameter\-efficient fine\-tuning\.

LoRA\[[9](https://arxiv.org/html/2605.05686#bib.bib5)\]and related methods\[[17](https://arxiv.org/html/2605.05686#bib.bib6),[8](https://arxiv.org/html/2605.05686#bib.bib7)\]enable targeted adaptation of pretrained models\. While most work evaluates adaptation quality \(task accuracy, sample efficiency\), we use adapter placement as a*causal probe*: by restricting LoRA to specific component groups \(QK, VO, MLP\), we isolate the functional contribution of each architectural component to memory storage, revealing qualitatively distinct perturbation signatures\.

#### Hallucination and calibration\.

LLM hallucination has been studied extensively\[[10](https://arxiv.org/html/2605.05686#bib.bib8),[30](https://arxiv.org/html/2605.05686#bib.bib9)\], with proposed mitigations including retrieval augmentation\[[15](https://arxiv.org/html/2605.05686#bib.bib10)\], self\-consistency checks\[[27](https://arxiv.org/html/2605.05686#bib.bib11)\], and post\-hoc calibration\[[6](https://arxiv.org/html/2605.05686#bib.bib12)\]\. Most work treats hallucination as a monolithic failure mode\. Our attractor framework provides a mechanistic decomposition: hallucinations occur when queries land far from all memorized basins yet the LM head defaults to trained output sequences, producing confident outputs indistinguishable from correct recall at the output level\. Cross\-entity analysis confirms that code overlap with the training set is predominantly an output\-layer bias \(∼84%\{\\sim\}84\\%of matched codes do not correspond to the nearest basin\), not geometric basin capture\. This explains why standard calibration methods fail on the most dangerous errors: entropy reflects output\-distribution sharpness, not basin proximity\.

#### Mechanistic interpretability\.

The circuits\-based approach to interpretability\[[3](https://arxiv.org/html/2605.05686#bib.bib13),[24](https://arxiv.org/html/2605.05686#bib.bib14)\]has identified attention heads and MLP neurons involved in specific computations\. Our work complements this by analyzing the*representation\-level*consequences of circuit modifications: rather than tracing information flow through individual components, we characterize how adapter\-induced changes to circuits reshape the global attractor geometry in which generation unfolds\.

#### Dynamical systems perspectives on transformers\.

Several authors have drawn connections between transformer computation and dynamical systems\[[4](https://arxiv.org/html/2605.05686#bib.bib22)\], viewing residual\-stream updates as iterated maps with fixed\-point structure\. Our empirical results provide concrete evidence for this perspective: we observe attractor basins with measurable margins, systematic cross\-entity retrieval driven by output\-layer bias, and autoregressive trajectories that converge toward fixed points in representation space\.

## Appendix JBasin Geometry in Pretrained Knowledge

To test whether the geometric framework generalizes beyond LoRA\-installed knowledge, we evaluate margin\-based hallucination detection on factual knowledge acquired during pretraining\.

#### Setup\.

We query the base Qwen2\.5\-3B\-Instruct model \(no adapter\) on 196 factual questions spanning three relation types: national capitals \(120 queries\), official languages \(40\), and continents \(36\)\. Queries range from well\-known \(“The capital of France is”\) to obscure \(“The capital of Palau is”\)\. The model answers 149/196 correctly \(76\.0%\)\. For each correct fact, we compute a basin center by averaging the last\-layer hidden state across three prompt templates, yielding 149 basin centers\. Margin is computed as distance to the nearest basin center, identically to the LoRA analysis\.

#### Results\.

Table[18](https://arxiv.org/html/2605.05686#A10.T18)reports signal profiles\. Correct pretrained recall sits at margin44\.744\.7, hallucination at87\.587\.5—a2×2\\timesseparation\. Margin achieves AUROC=1\.000=1\.000, matching the LoRA result exactly\. Entropy AUROC is0\.6220\.622, substantially worse than for LoRA \(0\.9680\.968\), because the pretrained model produces higher\-entropy outputs even for correct answers \(mean entropy2\.162\.16vs\. near\-zero for LoRA\)\. A full hidden\-state logistic regression probe achieves only AUROC=0\.809=0\.809, confirming that margin captures geometric structure that a linear probe over all dimensions misses\.

Table 18:Signal profiles for pretrained factual knowledge \(no LoRA adapter\)\. The same geometric structure observed for LoRA\-installed knowledge holds: correct recall produces low margin and high gap, while hallucination produces high margin and low gap\.Table 19:Comparison of margin and entropy AUROC between LoRA and pretrained knowledge settings\. Margin generalizes perfectly; entropy degrades substantially for pretrained knowledge, where output distributions are less sharp\.The intervention test confirms operational utility: thresholding margin to catch all 47 incorrect answers preserves 149/149 correct answers \(100%\), compared to 16/149 \(10\.7%\) for entropy\. The margin signal provides perfect separation with zero false refusals, replicating the LoRA finding in a purely pretrained setting\.

#### Cross\-relation generalization\.

To test whether a margin\-prediction head can generalize beyond its training domain, we train a small MLP \(2048→\\to256→\\to256→\\to1\) to predict margin from hidden states using*capital queries only*\(115 correct facts\), then evaluate on held\-out*language*and*continent*queries the head never saw during training\.

Table 20:Cross\-relation generalization\. The margin head trained on capitals generalizes to held\-out relations \(AUROC=0\.887=0\.887\), outperforming both oracle margin \(0\.8490\.849\) and entropy \(0\.6700\.670\)\. Oracle margin fails on language \(0\.1710\.171\) because capital basin centers occupy a different region of hidden space than language queries; the head learns a*general*geometric property rather than memorizing distances to specific centers\.The head outperforms oracle margin on held\-out relations \(0\.8870\.887vs\.0\.8490\.849\), a surprising result: the oracle computes exact distances to capital basin centers, which are irrelevant for language queries \(hence AUROC0\.1710\.171\)\. The head instead learns a general geometric signature—“what does it look like to be in a basin?”—that transfers across relation types\. A binary classification head \(trained on correct/incorrect labels rather than margin values\) achieves comparable held\-out AUROC \(0\.8270\.827\), suggesting the hidden states encode epistemic state information accessible to any lightweight classifier\. Both heads substantially outperform entropy \(0\.6700\.670\)\.

At the intervention threshold \(catch all incorrect, preserve correct\), the margin head preserves 37\.7% of correct held\-out answers vs\. 18\.9% for entropy—a2×2\\timesimprovement with zero hallucination leakage\.

#### Implications\.

Basin geometry is not an artifact of LoRA fine\-tuning\. The pretrained transformer organizes factual knowledge into attractor\-like structures with the same margin/gap properties observed in the LoRA setting\. This validates the theoretical framework: the MLP\-dominant Jacobian structure and VO symmetry properties characterized in Section[3](https://arxiv.org/html/2605.05686#S3)are intrinsic to the pretrained architecture and produce basin geometry for any knowledge source\. Crucially, a head trained on a small sample of one relation type generalizes to unseen relations, suggesting that the geometric signal is universal and that scalable margin estimation—without exhaustive basin center computation—is feasible\.

### J\.1Stress test: TruthfulQA

To delineate the framework’s scope, we evaluate on TruthfulQA\[[18](https://arxiv.org/html/2605.05686#bib.bib25)\], a benchmark of 817 adversarial questions designed to elicit common misconceptions\. This setting differs substantially from factual recall: questions are intentionally misleading, and incorrect answers reflect reasoning errors or cultural misconceptions rather than missing basin structure\.

#### Setup\.

We run the base Qwen2\.5\-3B\-Instruct model \(no adapter\) on all 817 questions\. Correctness is determined by substring matching against the provided correct and incorrect answer lists\. The model answers only 59/817 correctly \(7\.2%\), yielding a heavily imbalanced split\. We compute basin centers from the 59 correct answers \(3 prompt templates each\), then measure margin for all queries\.

#### Results\.

Table 21:TruthfulQA results \(N=817N=817, 59 correct\)\. Within\-domain oracle margin substantially outperforms entropy and a full hidden\-state logistic regression probe\.Within\-domain oracle margin achieves AUROC=0\.858=0\.858, substantially above entropy \(0\.6560\.656\) and a full hidden\-state probe \(0\.6890\.689\)\. Per\-category results are striking: margin approaches1\.0001\.000on factual categories \(Misconceptions:0\.9990\.999, Law:1\.0001\.000, Health:1\.0001\.000, Sociology:1\.0001\.000, Economics:1\.0001\.000\) while entropy remains near chance \(0\.450\.45–0\.700\.70\)\.

However, cross\-domain transfer from capital facts fails entirely \(AUROC=0\.476=0\.476\), confirming that basin centers from one knowledge domain do not generalize to adversarial reasoning tasks\. A binary head trained on capitals achieves only0\.5760\.576on TruthfulQA\.

#### Scope\.

These results delineate the framework’s boundary: basin geometry is a strong signal for*factual*hallucination detection—where failure reflects missing or distant basins—but does not extend to adversarial reasoning tasks where the model “knows” the common misconception and confidently produces it from a well\-formed \(but misleading\) basin\. Detecting such errors likely requires semantic or logical verification rather than geometric monitoring\.

## Appendix KHallucination Scaling Across Model Families

To test whether representational geometry underlying hallucination generalizes beyond Qwen2\.5\-3B, we evaluate 17 pretrained instruct models spanning six families \(Qwen 2\.5: 0\.5B–72B; Falcon 3: 1B–10B; SmolLM2: 0\.36B–1\.7B; OLMo 2: 1B–7B; Mistral: 7B; Phi\-4: 14B\) on 2,000 TriviaQA\[[11](https://arxiv.org/html/2605.05686#bib.bib30)\]questions\. For each model, we extract hidden states at∼75%\{\\sim\}75\\%network depth, compute hallucination rate, output entropy, and linear probe AUROC \(5\-fold CV logistic regression on hidden states classifying correct vs\. incorrect\)\.

#### Hallucination scaling law\.

Hallucination rate follows a power law across all families:H∝N−0\.27H\\propto N^\{\-0\.27\}\(r2=0\.90r^\{2\}=0\.90,p<0\.001p<0\.001; Fig\.[16](https://arxiv.org/html/2605.05686#A11.F16)\)\. Per\-family exponents are consistent: Qwenα=−0\.23\\alpha=\-0\.23, Falconα=−0\.28\\alpha=\-0\.28, SmolLM2α=−0\.18\\alpha=\-0\.18\. Different families exhibit different intercepts—reflecting training data quality and recipe—but converge at larger scales: at 7B parameters, Qwen, Falcon, OLMo, and Mistral all cluster within 30–46% hallucination rate\. The universality of the scaling exponent across architectures suggests the rate is governed by the statistical structure of factual knowledge in pretraining corpora rather than by architectural choices\.

#### Confident hallucination concentrates with scale\.

Despite lower overall error rates, larger models produce a higher fraction of*confident*wrong answers \(output entropy<0\.1<0\.1\): 0\.3% of errors at 0\.5B vs\. 7\.8% at 7B\. On this confident subset, output entropy carries no signal \(AUROC0\.540\.54\), while linear probes on hidden states partially recover detection \(AUROC0\.750\.75; trained on non\-confident examples, tested on the confident subset\)\. The gap \(\+0\.20\+0\.20AUROC\) confirms that internal representations encode uncertainty about these errors even when the output distribution does not\. However, the probe ceiling of∼0\.75\{\\sim\}0\.75indicates that∼25%\{\\sim\}25\\%of confident hallucinations lack even internal geometric signal—these represent genuine blind spots rather than decoder failures\.

![Refer to caption](https://arxiv.org/html/2605.05686v1/x16.png)Figure 16:Hallucination scaling across model families\.Hallucination rate vs\. parameter count \(log–log\) for 17 models across six families\. The power lawH∝N−0\.27H\\propto N^\{\-0\.27\}\(r2=0\.90r^\{2\}=0\.90,p<0\.001p<0\.001\) holds across architectures\. Per\-family lines \(colored\) show consistent slopes with different intercepts reflecting training data quality\. Families converge at larger scales\.
### K\.1Derivation of the universal lawC=exp⁡\(−c/Δ¯\)C=\\exp\(\-c/\\bar\{\\Delta\}\)

We derive Eq\.[5](https://arxiv.org/html/2605.05686#S5.E5)from the softmax entropy function and the statistics of the top\-two logit gap\. The derivation has three steps\.

#### Step 1: Entropy is a monotone function of the top\-2 logit gap\.

LetΔ:=ℓ\(1\)−ℓ\(2\)\\Delta:=\\ell\_\{\(1\)\}\-\\ell\_\{\(2\)\}be the gap between the top two logits\. For any entropy thresholdH0H\_\{0\}, there exists an empirical cutoffΔ∗\(H0,V\)\\Delta^\{\*\}\(H\_\{0\},V\)such that

Hout\(ℓ\)<H0⇔Δ\>Δ∗\(H0,V\)\.H\_\{\\text\{out\}\}\(\\ell\)<H\_\{0\}\\iff\\Delta\>\\Delta^\{\*\}\(H\_\{0\},V\)\.\(9\)The 2\-class approximation \(H≈Δe−ΔH\\approx\\Delta e^\{\-\\Delta\}\) underestimates the required gap by∼1\.4×\{\\sim\}1\.4\\timesat realistic vocabulary sizes because∼30,000\{\\sim\}30\{,\}000lower\-ranked logits contribute non\-negligibly\. Empirically, forV≈30,000V\\approx 30\{,\}000andH0=0\.1H\_\{0\}=0\.1,Δ∗≈5\.0\\Delta^\{\*\}\\approx 5\.0; the 2\-class formula predicts3\.573\.57\. Table[22](https://arxiv.org/html/2605.05686#A11.T22)reports this discrepancy acrossΔ\\Deltaranges\.

Table 22:Mean output entropy on real LLM wrong\-answer queries vs\. the 2\-class approximationΔe−Δ\\Delta e^\{\-\\Delta\}\. The 2\-class formula underestimates entropy by33–6×6\\timesacross the relevant range because lower\-ranked logits atV≈30,000V\\approx 30\{,\}000contribute substantially\.
#### Step 2: Confident fraction is a tail probability\.

Let𝒟\\mathcal\{D\}be the distribution ofΔ\\Deltaover wrong\-answer queries\. Then:

C:=PrΔ∼𝒟⁡\[Δ\>Δ∗\(H0,V\)\]\.C:=\\Pr\_\{\\Delta\\sim\\mathcal\{D\}\}\[\\Delta\>\\Delta^\{\*\}\(H\_\{0\},V\)\]\.\(10\)CCis the tail probability of the gap distribution above the vocabulary\- and threshold\-dependent cutoffΔ∗\\Delta^\{\*\}\.

#### Step 3: The gap distribution is approximately exponential\.

We measure theΔ\\Deltadistribution on wrong\-answer queries for representative models \(Table[23](https://arxiv.org/html/2605.05686#A11.T23)\)\. The std/mean ratio is close to 1 \(1\.06–1\.24\), consistent with an exponential distribution\.

Table 23:Logit\-gap distribution statistics on wrong\-answer queries\. std/mean=1=1for an exact exponential; empirical values \(1\.06–1\.24\) are close\. Deviations are the source of the remaining∼17%\{\\sim\}17\\%discrepancy between the predicted constantc=Δ∗=5\.0c=\\Delta^\{\*\}=5\.0and the empirical fitc≈5\.87c\\approx 5\.87\.For an exponential distribution with meanΔ¯\\bar\{\\Delta\},Pr⁡\[Δ\>x\]=exp⁡\(−x/Δ¯\)\\Pr\[\\Delta\>x\]=\\exp\(\-x/\\bar\{\\Delta\}\)\. Substituting:

C≈exp⁡\(−Δ∗\(H0,V\)Δ¯\)=exp⁡\(−cΔ¯\),C\\approx\\exp\\\!\\left\(\-\\frac\{\\Delta^\{\*\}\(H\_\{0\},V\)\}\{\\bar\{\\Delta\}\}\\right\)=\\exp\\\!\\left\(\-\\frac\{c\}\{\\bar\{\\Delta\}\}\\right\),\(11\)recovering Eq\.[5](https://arxiv.org/html/2605.05686#S5.E5)withc=Δ∗\(H0,V\)c=\\Delta^\{\*\}\(H\_\{0\},V\)\. Using the empiricalΔ∗=5\.0\\Delta^\{\*\}=5\.0forH0=0\.1H\_\{0\}=0\.1, the predicted slope is−5\.0\-5\.0; the empirical through\-origin fit gives−5\.87\-5\.87\(agreement within 17%, attributable to minor super\-exponential tails\)\.

#### Scaling with model size\.

Under compute\-optimal training,τ∝N\\tau\\propto N\. Liu et al\.\[[19](https://arxiv.org/html/2605.05686#bib.bib29)\]show logit scale grows asτ1/3\\tau^\{1/3\}, giving:

Δ¯\(N\)∝N1/3⋅Δ¯0\.\\bar\{\\Delta\}\(N\)\\propto N^\{1/3\}\\cdot\\bar\{\\Delta\}\_\{0\}\.\(12\)Substituting:log⁡C\(N\)≈−c/\(Δ¯0⋅N1/3\)\\log C\(N\)\\approx\-c/\(\\bar\{\\Delta\}\_\{0\}\\cdot N^\{1/3\}\), a stretched exponential inNN\. Within the Qwen 2\.5 family, empirical fits giveΔ¯∝N1/3\\bar\{\\Delta\}\\propto N^\{1/3\}withr2≥0\.94r^\{2\}\\geq 0\.94\.

### K\.2Per\-model empirical verification \(r2=0\.88r^\{2\}=0\.88, parameter\-free\)

Table[24](https://arxiv.org/html/2605.05686#A11.T24)reports per\-model accuracy and confident fractionC0\.1C\_\{0\.1\}on TriviaQA and NQ\-Open\. Table[25](https://arxiv.org/html/2605.05686#A11.T25)verifies the universal law: for each \(model, benchmark\) pair, the predictedlog⁡C=−Δmodel∗/Δ¯\\log C=\-\\Delta^\{\*\}\_\{\\mathrm\{model\}\}/\\bar\{\\Delta\}uses no free parameters, yet matches the empiricallog⁡C\\log Cwith mean ratio0\.96±0\.110\.96\\pm 0\.11across all 21 points \(r2=0\.88r^\{2\}=0\.88\)\.

Table 24:Per\-model results on TriviaQA and NQ\-Open \(1,500 questions each\)\.C0\.1C\_\{0\.1\}: fraction of wrong answers with output entropy<0\.1<0\.1\. SmolLM2 models show near\-zeroCCat all scales because their smaller LM head produces weaker logit sharpening\.Table 25:Per\-\(model, benchmark\) verification of Eq\.[5](https://arxiv.org/html/2605.05686#S5.E5)\. The predictedlog⁡C=−Δmodel∗/Δ¯\\log C=\-\\Delta^\{\*\}\_\{\\mathrm\{model\}\}/\\bar\{\\Delta\}uses*no free parameters*:Δmodel∗\\Delta^\{\*\}\_\{\\mathrm\{model\}\}is the model’s own empirical entropy\-crossing threshold,Δ¯\\bar\{\\Delta\}is its mean logit gap on wrong answers\. The ratio of predicted to actuallog⁡C\\log Cis close to 1 across all 21 points \(mean0\.960\.96, std0\.110\.11\), collapsing four model families and two benchmarks onto a single line\.
### K\.3Controlled teacher\-student ablation: margin separation vs\. confident fraction

To isolate the causal role of model width \(controlling for architecture and training data\), we run a teacher\-student toy model: a 2\-layer MLP teacher with inverse temperatureβ∗\\beta^\{\*\}trains a student MLP of varying widthmmto memorize 500 entity\-code associations, then evaluates on 200 held\-out entities\.

Table 26:Toy model width sweep \(fixed 100K training steps\)\.δu/δs\\delta\_\{u\}/\\delta\_\{s\}: ratio of unseen\-entity to seen\-entity margin, measuring geometric separation\.C0\.1C\_\{0\.1\}: confident fraction at entropy threshold 0\.1\. As width grows16×16\\times\(fromm=16m=16tom=256m=256\), margin separation improves from1\.2×1\.2\\timesto153×153\\timeswhileCCrises from 52% to 99%—the hidden\-state geometry becomes more informative precisely as output entropy becomes less so\.The divergence betweenδu/δs\\delta\_\{u\}/\\delta\_\{s\}andHunseenH\_\{\\text\{unseen\}\}is the central empirical observation: larger models simultaneously produce*better*geometric signal and*worse*output signal\. This controlled experiment confirms that the effect is driven by width \(and hence logit\-gap dynamics\) rather than by differences in training data or architecture\.

## Appendix LMargin Separation and Cross\-Entity Retrieval Across LoRA Rank and Capacity

To investigate how adapter capacity affects both geometric separation and hallucination behavior, we evaluate margin AUROC and cross\-entity retrieval rate across all combinations of adapter type \(Full, MLP\-only, QK\-only, VO\-only\), LoRA rank \(r∈\{8,32,64\}r\\in\\\{8,32,64\\\}for Full;r=8r=8for single\-module adapters\), and number of memorized entities \(N∈\{25,50,100,200,400,800,1600\}N\\in\\\{25,50,100,200,400,800,1600\\\}\)\. All adapters here use brittle \(single\-template\) training\.

#### Margin AUROC\.

Table[27](https://arxiv.org/html/2605.05686#A12.T27)reports the AUROC of margin for discriminating seen \(correct recall\) from unseen \(hallucination\) entities\. Two patterns emerge\. First,*margin separation degrades withNN*: all adapter types achieve strong AUROC at smallNN\(25–100 entities\) but collapse to chance byN=400N=400–16001600\. At smallNN, basins are sparse and well\-separated; at largeNN, the representation space becomes crowded and the margin signal vanishes under brittle training\. Second, for the Full adapter,*higher rank degrades margin further at largeNN*: atN=1600N=1600, rank 32 yields AUROC=0\.310=0\.310versus0\.5360\.536for rank 8\. Higher rank provides more capacity, reducing the pressure to create geometrically distinct basins—the adapter can memorize via distributed, routing\-based mechanisms rather than localized attractors\.

Table 27:Margin AUROC \(seen vs\. unseen\) across adapter types, ranks, and entity counts underbrittle\(single\-template\) training\. Strong separation at lowNNcollapses at highNN\. Higher rank further degrades separation for the Full adapter\. Compare with the robust MLP adapter \(Section[5](https://arxiv.org/html/2605.05686#S5)\), which achieves AUROC=0\.993=0\.993atN=1600N=1600—demonstrating that multi\-template training, not adapter capacity, is the key driver of geometric separation\.
#### Cross\-entity retrieval rate\.

Table[28](https://arxiv.org/html/2605.05686#A12.T28)reports the fraction of hallucinated outputs \(unseen entities,Ntest=200N\_\{\\text\{test\}\}=200\) that exactly match a training entity’s code\. Two effects are evident\. First,*rank strongly increases cross\-entity retrieval*: atN=1600N=1600, the Full adapter’s rate rises from 31\.0% \(r=8r\{=\}8\) to 80\.5% \(r=64r\{=\}64\)\. Higher rank widens the spatial footprint of the MLP modification, pushing more of the hidden\-state space into code\-generating territory where the frozen LM head outputs training codes\. Second,*adapter type matters*: MLP\-only and Full adapters produce far higher rates \(20–80%\) than QK\-only \(1–6%\) or VO\-only \(2–18%\), consistent with the Jacobian analysis showing MLP dominates landscape reshaping\.

Table 28:Cross\-entity retrieval rate \(%\) for unseen entities across adapter configurations\. Higher rank dramatically increases the rate for Full adapters, consistent with wider landscape reshaping rather than representational superposition\. QK\-only adapters produce negligible cross\-entity retrieval because they modify routing, not content\.
#### Interpretation\.

The inverse relationship between margin AUROC and cross\-entity retrieval rate is revealing: configurations with poor geometric separation \(high rank, largeNN\) produce the highest cross\-entity retrieval\. This supports the paper’s account that cross\-entity retrieval arises from the LM head’s response to globally reshaped hidden states, not from geometric basin capture\. When the adapter has sufficient capacity to memorize without creating distinct basins \(high rank\), seen and unseen entities become geometrically indistinguishable, the margin signal vanishes, and the LM head defaults to training codes from all positions in the reshaped landscape\.

Critically, the robust multi\-template training regime \(Section[2](https://arxiv.org/html/2605.05686#S2)\) overcomes this limitation for the MLP\-only adapter: atN=1600N=1600, robust training achieves AUROC=0\.993=0\.993where brittle training yields0\.4340\.434\. Format diversity forces the MLP to create basin structure that is invariant to surface\-level prompt variation, producing the clean geometric separation that underlies the paper’s main results\.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Similar Articles

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

Your transformer's attention entropy collapse isn't a bug. It's the model doing exactly what you trained it to do. Here's how to fix it with a three-line temperature schedule. arXiv-able. Self-contained proof. No citations needed.

Submit Feedback

Similar Articles

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]
The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason
Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation
Your transformer's attention entropy collapse isn't a bug. It's the model doing exactly what you trained it to do. Here's how to fix it with a three-line temperature schedule. arXiv-able. Self-contained proof. No citations needed.