When and How Long? The Readout-Mediator Angle in Temporal Reasoning

arXiv cs.LG 05/29/26, 04:00 AM Papers
Summary
This paper introduces the readout-mediator angle to demonstrate that linear probes can decode information from language model activations that is orthogonal to the model's actual causal computation, undermining probe-based interpretability. The finding replicates across model scales and families, revealing a fundamental failure mode in using probes for mechanistic understanding or safety monitoring.
arXiv:2605.29126v1 Announce Type: new Abstract: A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\sin$/$\cos$ probe recovers day-of-year from a layer's activations, yet ablating its direction has no effect on the model's answers -- while ablating a four-dimensional subspace found by Distributed Alignment Search (DAS) at the same layer collapses performance entirely. We measure the angle between these two subspaces -- the \emph{readout-mediator angle} -- and find it indistinguishable from the angle between two random subspaces (the Haar-uniform null), meaning the probe has learned a direction orthogonal to the model's actual computation. Reverse-engineering the circuit reveals why: attention heads route month-grained context through learned QK offsets at ${\pm}30$ and ${\pm}61$ days, and MLPs then convert \emph{when} (absolute date) into \emph{how long} (duration) -- all downstream of the causal subspace the probe never touches. Sparse-autoencoder decomposition confirms the split: probe-aligned and DAS-aligned features encode semantically disjoint concepts with negligible causal overlap. The dissociation replicates across four scales ($1.5$-$9\,$B) and two model families, with preliminary evidence on two further domains (spatial displacement, symbolic arithmetic), suggesting that readout-mediator orthogonality is a general failure mode of probe-based interpretability. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:17 AM
# When and How Long? The Readout-Mediator Angle in Temporal Reasoning
Source: [https://arxiv.org/html/2605.29126](https://arxiv.org/html/2605.29126)
Shreyas Fadnavis Bioscope AI shreyas\.fadnavis@bioscope\.ai &Praitayini Kanakaraj Bioscope AI praitayini\.kanakaraj@bioscope\.ai &Felix Wyss Bioscope AI felix\.wyss@bioscope\.ai

###### Abstract

A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it\. On calendar\-date duration reasoning in language models, asin\\sin/cos\\cosprobe recovers day\-of\-year from a layer’s activations, yet ablating its direction has no effect on the model’s answers—while ablating a four\-dimensional subspace found by Distributed Alignment Search \(DAS\) at the same layer collapses performance entirely\. We measure the angle between these two subspaces—the*readout–mediator angle*—and find it indistinguishable from the angle between two random subspaces \(the Haar\-uniform null\), meaning the probe has learned a direction orthogonal to the model’s actual computation\. Reverse\-engineering the circuit reveals why: attention heads route month\-grained context through learned QK offsets at±30\{\\pm\}30and±61\{\\pm\}61days, and MLPs then convert*when*\(absolute date\) into*how long*\(duration\)—all downstream of the causal subspace the probe never touches\. Sparse\-autoencoder decomposition confirms the split: probe\-aligned and DAS\-aligned features encode semantically disjoint concepts with negligible causal overlap\. The dissociation replicates across four scales \(1\.51\.5–99B\) and two model families, with preliminary evidence on two further domains \(spatial displacement, symbolic arithmetic\), suggesting that readout–mediator orthogonality is a general failure mode of probe\-based interpretability\. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned\.

## 1Introduction

Ask a language model “How many days between March 15thand June 22nd?” and it answers correctly:9999days\. Asin/cos\\sin/\\cosRidge probe at layer 20 decodes both dates from the residual stream withR2=0\.996R^\{2\}\{=\}0\.996\(Gurnee and Tegmark,[2024](https://arxiv.org/html/2605.29126#bib.bib7)\)—exactly the kind of result used to argue that the model*represents*calendar time\(Gurnee and Tegmark,[2024](https://arxiv.org/html/2605.29126#bib.bib7); Marks and Tegmark,[2024](https://arxiv.org/html/2605.29126#bib.bib32); Kantamneni and Tegmark,[2025](https://arxiv.org/html/2605.29126#bib.bib8)\)\. But ablating the probe’s direction drops accuracy by only0\.60\.6pp; the model still counts9999days as if nothing happened\. A matched\-rank Distributed Alignment Search \(DAS\) subspace at the same layer tells the opposite story: ablating it collapses accuracy entirely \(Fig\.[1](https://arxiv.org/html/2605.29126#S1.F1)\)\.

The probe reads the right answer from a direction the model does not compute with\. We quantify this by measuring the principal angle between the two subspaces—what we call the*readout–mediator angle*\. Atθ¯=88∘\\bar\{\\theta\}\{=\}88^\{\\circ\}, it matches the expectation for two random subspaces drawn from a Haar\-uniform distribution \(𝔼\[θ¯\]=88\.3∘\\mathbb\{E\}\[\\bar\{\\theta\}\]\{=\}88\.3^\{\\circ\}at\(d,k\)=\(2304,2\)\(d,k\)\{=\}\(2304,2\), Prop\.[2](https://arxiv.org/html/2605.29126#Thmproposition2)\): the probe carries no more information about the model’s computation than a random direction of the same rank\. A five\-year line of work has questioned whether probe accuracy implies mechanistic relevance\(Hewitt and Liang,[2019](https://arxiv.org/html/2605.29126#bib.bib21); Elazaret al\.,[2021](https://arxiv.org/html/2605.29126#bib.bib17); Ravichanderet al\.,[2021](https://arxiv.org/html/2605.29126#bib.bib22); Muelleret al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib24),[2025](https://arxiv.org/html/2605.29126#bib.bib25); Canbyet al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib26)\); the readout–mediator angle provides the missing instrument—a number that says*how far*the probe sits from the computation, paired with a null that says what “far” means\.

Reverse\-engineering the causal subspace reveals why the two are orthogonal\. Boundary attention heads implement QK offsets at±30\{\\pm\}30and±61\{\\pm\}61days—single\- and double\-month steps that tile any multi\-month duration \(Fig\.[1](https://arxiv.org/html/2605.29126#S1.F1)\)\. MLP layers then execute a two\-stage transformation: layers 18–19 read calendar position \(*when*\); layers 20–25 convert it into duration \(*how long*\), with MLP SAEs—functionally equivalent to transcoders—showing a monotonic DAS\-alignment gradient across this boundary\. Sparse\-autoencoder features confirm the split at the vocabulary level: probe\-aligned features fire on concepts likemonth of October; DAS\-aligned features fire onpast 24 hours\. The two feature sets have zero causal overlap \(Supp\.[S50](https://arxiv.org/html/2605.29126#A50)\)\. Temporal feature analysis\(Lubanaet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib33)\)explains the geometry: the DAS mediator aligns with context\-predictable structure \(7×7\{\\times\}above the Haar null\), while the probe sits in the random cloud—duration computation lives in the part of the activation accumulated from context, not the current token \(Fig\.[4](https://arxiv.org/html/2605.29126#S5.F4)C–D\)\.

The dissociation is not specific to one model, one scale, or one domain\. It replicates across four models \(1\.51\.5–99B\), two architecture families, and two further reasoning domains \(spatial displacement, symbolic arithmetic\)—each at the Haar null angle\. OnPythia 1\.4B, probeR2=0\.956R^\{2\}\{=\}0\.956at step0: an untrained network “represents” dates by the probe’s standard, yet boundary heads, DAS drops, and circularness all emerge only as training proceeds—the probe tracks dimensional capacity, not mechanism learning\.

### Contributions:

- •The readout–mediator angle and Haar\-random null, with three propositions linking angle to ablation effect \(§[3](https://arxiv.org/html/2605.29126#S3)\)\.
- •Maximal probe–DAS dissociation across four scales \(1\.51\.5–99B\), two families, three domains—sharpening with scale \(§[4](https://arxiv.org/html/2605.29126#S4),[6](https://arxiv.org/html/2605.29126#S6)\)\.
- •A full circuit trace: boundary heads, two\-stage MLP transcoder chain, disjoint SAE features, and a TFA\-based explanation of the orthogonality \(§[5](https://arxiv.org/html/2605.29126#S5)\)\.
- •A six\-experiment battery showing the dissociation breaks probe\-based safety monitors \(§[6](https://arxiv.org/html/2605.29126#S6)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x1.png)Figure 1:The readout\-mediator dissociation on a duration query \(schematic\)\.Given “How many days between March 15thand June 22nd?”, the model deploys two functionally orthogonal subspaces\.*Spy P:*the probe subspace𝐔P\\mathbf\{U\}\_\{P\}passively decodes both dates; ablating it changes accuracy by−0\.6\-0\.6pp\.*Spy M:*the DAS mediator𝐔M\\mathbf\{U\}\_\{M\}counts9999days via month\-boundary hops; ablating it collapses accuracy to0%0\\%\. The two subspaces are nearly orthogonal \(θ¯=88∘\\bar\{\\theta\}\{=\}88^\{\\circ\}\), matching the Haar\-random null—the probe carries no more information about the computation than noise\.

## 2The decodability\-use gap

The gap between what a probe can decode and what a model causally uses is well documented\.Gurnee and Tegmark \([2024](https://arxiv.org/html/2605.29126#bib.bib7)\)decode spatial and temporal coordinates fromLlama\-2withR2\>0\.9R^\{2\}\{\>\}0\.9, yet note explicitly that this “does not imply the model actually uses these representations\.”Hernandezet al\.\([2024](https://arxiv.org/html/2605.29126#bib.bib3)\)make the gap concrete: they define a faithfulness metric and find that many relations are probe\-accurate but not faithfully decoded at generation time\. Several lines of work have tried to narrow the gap—combining probes with causal intervention\(Taket al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib5); Fenget al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib4)\), replacing fixed probes with decoder LLMs\(Panet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib44)\), and questioning the interventions themselves\(Grantet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib45)\)—but none provide a single number that says how far the probe direction sits from the causal one, or a null that says what “far” means\.

Diagnostic critiques have sharpened the problem without solving it\. Control tasks\(Hewitt and Liang,[2019](https://arxiv.org/html/2605.29126#bib.bib21)\), amnesic probing\(Elazaret al\.,[2021](https://arxiv.org/html/2605.29126#bib.bib17)\), and recent audits\(Ravichanderet al\.,[2021](https://arxiv.org/html/2605.29126#bib.bib22); Muelleret al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib24),[2025](https://arxiv.org/html/2605.29126#bib.bib25); Canbyet al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib26)\)all question the inferential step from accuracy to mechanism, but stop short of measuring the divergence\. Concept\-erasure methods—INLP and LEACE\(Belroseet al\.,[2023](https://arxiv.org/html/2605.29126#bib.bib18)\)—attempt to close the gap by removing the probed direction, yet their erasure subspaces sit within∼1\.5∘\{\\sim\}1\.5^\{\\circ\}of the Haar null \(Supp\.[S41](https://arxiv.org/html/2605.29126#A41)\): erasing what the probe finds does not erase what the model uses\.

Our circuit\-level analysis builds on three families of tools: DAS\(Geigeret al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib9); Sunet al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib2); Muelleret al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib25)\)for identifying causally load\-bearing subspaces, activation patching\(Nandaet al\.,[2023](https://arxiv.org/html/2605.29126#bib.bib10); Syedet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib28)\)for tracing information flow, and SAEs\(Cunninghamet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib11); Lieberumet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib12)\)decomposed via NeuronPedia\(Lin,[2023](https://arxiv.org/html/2605.29126#bib.bib1)\)for interpreting features at the vocabulary level\. We use MLP SAEs as functional transcoders\(Templetonet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib39)\)to track what each MLP layer writes to the residual stream, and temporal feature analysis\(Lubanaet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib33)\)to decompose activations into context\-predictable and novel components—the structural distinction that ultimately explains the orthogonality\. Most directly,Gurneeet al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib6)\)showed that attention heads implement QK\-twist rotations on date manifolds; we take the next step and ask which of those projections are causally load\-bearing and which are statistical shadows\. Extended related work is in Supp\.[S10](https://arxiv.org/html/2605.29126#A10)\.

## 3The readout\-mediator angle: measurement and theory

### Two questions, two subspaces\.

The decodability\-use gap arises because probes and causal methods answer different questions about the same layer\. Given a task propertyzzand a layerLL, a*probe subspace*UP∈ℝk×dU\_\{P\}\\\!\\in\\\!\\mathbb\{R\}^\{k\\times d\}is the top\-kkspan of a circular Ridge regressor trained on cached activations—it asks*where is the information?*A*DAS subspace*\(Geigeret al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib9)\)is found by parametrisingUUvia QR\-decomposition of a trainable matrix and minimizing task NLL while an ablation hookx↦x−U⊤Uxx\{\\mapsto\}x\{\-\}U^\{\\top\}UxzerosUUon every forward pass—it asks*where is the computation vulnerable?*Same layer, same rankkk, different optimization targets: one isolates what is*readable*, the other what is*load\-bearing*\(notation summary in Supp\.[S54](https://arxiv.org/html/2605.29126#A54)\)\. The*readout–mediator angle*θ¯\(UP,UM\)=1k∑iarccos⁡σi\(UPUM⊤\)\\bar\{\\theta\}\(U\_\{P\},U\_\{M\}\)\{=\}\\frac\{1\}\{k\}\\sum\_\{i\}\\arccos\\sigma\_\{i\}\(U\_\{P\}U\_\{M\}^\{\\top\}\)measures the distance between these two answers\.

The next three propositions establish why this angle should generically be large \(Prop\.[1](https://arxiv.org/html/2605.29126#Thmproposition1)\), what the null expectation is \(Prop\.[2](https://arxiv.org/html/2605.29126#Thmproposition2)\), and how the angle controls the observable we actually measure—ablation effect \(Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)\)\.

###### Proposition 1\(Readout\-mediator orthogonality, informal\)\.

Letf:ℝd→ℝf\{:\}\\mathbb\{R\}^\{d\}\{\\to\}\\mathbb\{R\}be a differentiable task output andz\(x\)z\(x\)a scalar correlated withxx\. The probe direction maximizesI\(u⊤x;z\)I\(u^\{\\top\}x;z\)—a*second\-moment*quantity inxx\. The mediator maximizes𝔼\|f\(x\)−f\(x−uu⊤x\)\|\\mathbb\{E\}\|f\(x\)\{\-\}f\(x\{\-\}uu^\{\\top\}x\)\|—a*first\-moment*quantity in∇xf\\nabla\_\{x\}f\. The two coincide only when∇xf∝uP\\nabla\_\{x\}f\\propto u\_\{P\}\. Otherwise they are generically distinct\.

*Proof sketch\.*The probe solves a Rayleigh quotientmax‖u‖=1\(u⊤𝐜\)2/\(u⊤Σu\)\\max\_\{\\\|u\\\|=1\}\(u^\{\\top\}\\mathbf\{c\}\)^\{2\}/\(u^\{\\top\}\\Sigma u\)with𝐜=𝔼\[xz\]\\mathbf\{c\}\{=\}\\mathbb\{E\}\[xz\], yieldinguP=Σ−1𝐜/∥⋅∥u\_\{P\}\{=\}\\Sigma^\{\-1\}\\mathbf\{c\}/\\\|\\cdot\\\|—set by the data covariance\. A first\-order Taylor expansion of the ablation effect givesuM=arg⁡max⁡u⊤Guu\_\{M\}\{=\}\\arg\\max u^\{\\top\}GuwhereG=𝔼\[∇xf∇xf⊤\]G\{=\}\\mathbb\{E\}\[\\nabla\_\{x\}f\\,\\nabla\_\{x\}f^\{\\top\}\]is the gradient covariance—set by the network’s output sensitivity\. Coincidence requiresΣ−1𝐜\\Sigma^\{\-1\}\\mathbf\{c\}to be the top eigenvector ofGG: a non\-generic spectral alignment between data geometry and network geometry that deep networks have no structural reason to satisfy \(full proof in Supp\.[S22](https://arxiv.org/html/2605.29126#A22)\)\.

### Empirical test of Prop\.[1](https://arxiv.org/html/2605.29126#Thmproposition1)\.

If the mediator aligns with the first moment of∇xf\\nabla\_\{x\}f, computing the gradient subspace directly should recover a direction closer to DAS than to the probe\. We verify this onGemma 2 2Bby computinggi=∇hL⋆NLL\(y⋆\|xi\)g\_\{i\}\{=\}\\nabla\_\{h\_\{L^\{\\star\}\}\}\\mathrm\{NLL\}\(y^\{\\star\}\|x\_\{i\}\)for each prompt\. The gradient subspace sits2\.3∘2\.3^\{\\circ\}closer to the mediator than to the Haar null \(θ¯=85\.3∘\\bar\{\\theta\}\{=\}85\.3^\{\\circ\}\), while its angle to the probe is at null \(88\.9∘88\.9^\{\\circ\}\)\. The gradient leans toward the mediator but disperses across effective rank7676; DAS distills thek=4k\{=\}4causal core from this diffuse signal \(Supp\.[S2](https://arxiv.org/html/2605.29126#A2)\)\.

Prop\.[1](https://arxiv.org/html/2605.29126#Thmproposition1)tells us to expect divergence between probe and mediator—but not how much\. In high dimensions, the answer turns out to be stark: any two low\-rank subspaces are nearly orthogonal by default\.

###### Proposition 2\(Null angle between random subspaces\)\.

For independentk×dk\{\\times\}dStiefel\-uniform matricesU,VU,V, the principal\-angle cosines follow the Jacobi ensembleJ\(k,k,d−k\)J\(k,k,d\{\-\}k\)with𝔼∑icos2⁡θi=k2/d\\mathbb\{E\}\\\!\\sum\_\{i\}\\\!\\cos^\{2\}\\\!\\theta\_\{i\}\{=\}k^\{2\}/d: each cosine\-squared concentrates atk/dk/d, the Johnson–Lindenstrauss rate for a randomkk\-plane projection inℝd\\mathbb\{R\}^\{d\}\(exact; MC\-verified in Supp\.[S20](https://arxiv.org/html/2605.29126#A20)\)\. At\(d,k\)=\(2304,2\)\(d,k\)\{=\}\(2304,2\),𝔼\[θ¯\]=88\.3∘\\mathbb\{E\}\[\\bar\{\\theta\}\]\{=\}88\.3^\{\\circ\}\.

*Proof sketch\.*By Haar rotation invariance, fixVVas the firstkkidentity rows; the singular values ofUV⊤UV^\{\\top\}then follow the Jacobi ensembleJ\(k,k,d−k\)J\(k,k,d\{\-\}k\)\. The Collins–Matsumoto trace identity gives𝔼∑icos2⁡θi=k2/d\\mathbb\{E\}\\\!\\sum\_\{i\}\\cos^\{2\}\\\!\\theta\_\{i\}\{=\}k^\{2\}/dexactly\. The Delta method converts to angle space:𝔼\[θ¯\]=arccos⁡k/d\+O\(k/d\)\\mathbb\{E\}\[\\bar\{\\theta\}\]\{=\}\\arccos\\\!\\sqrt\{k/d\}\+O\(k/d\), with the second\-order correction bounded by∼0\.05∘\{\\sim\}0\.05^\{\\circ\}at our\(d,k\)\(d,k\)\(MC\-verified on10410^\{4\}draws in Supp\.[S20](https://arxiv.org/html/2605.29126#A20); full proof in Supp\.[S22](https://arxiv.org/html/2605.29126#A22)\)\.

An observedθ¯=88∘\\bar\{\\theta\}\{=\}88^\{\\circ\}therefore means the probe is no closer to the mediator than a random direction of the same rank\. Note, however, that orthogonality alone does not imply causal inertness—a subspace can be orthogonal to the mediator and still affect the output through other pathways\(Gurneeet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib6)\)\. What distinguishes a statistical shadow from a functionally distinct subspace is the*ablation effect*, which the next proposition links directly to the angle\.

###### Proposition 3\(Angle controls ablation effect\)\.

Under \(i\)f\(x\)=g\(UMx\)f\(x\)\{=\}g\(U\_\{M\}x\), \(ii\)ggisLL\-Lipschitz, \(iii\)𝔼\[xx⊤\]=σ2I\\mathbb\{E\}\[xx^\{\\top\}\]\{=\}\\sigma^\{2\}I, for anykk\-dimensional ablation basisUUwith principal anglesθi\\theta\_\{i\}toUMU\_\{M\}:

\(a\)𝔼\|f\(x\)−f\(x−U⊤Ux\)\|2≤L2σ2∑icos2⁡θi;\\displaystyle\\mathbb\{E\}\|f\(x\)\{\-\}f\(x\{\-\}U^\{\\top\}Ux\)\|^\{2\}\\;\\leq\\;L^\{2\}\\sigma^\{2\}\\\!\\sum\\nolimits\_\{i\}\\cos^\{2\}\\theta\_\{i\};\(1\)\(b\)ifrow\(UM\)⊆row\(U\),𝔼\|f\(x\)−f\(x−U⊤Ux\)\|2≥L¯2σ2kM;\\displaystyle\\text\{if \}\\mathrm\{row\}\(U\_\{M\}\)\{\\subseteq\}\\mathrm\{row\}\(U\),\\ \\mathbb\{E\}\|f\(x\)\{\-\}f\(x\{\-\}U^\{\\top\}Ux\)\|^\{2\}\\;\\geq\\;\\underline\{L\}^\{2\}\\sigma^\{2\}k\_\{M\};\(2\)\(c\)Haar\-randomU:𝔼∑icos2⁡θi=kkM/d\.\\displaystyle\\text\{Haar\-random \}U\\\!:\\ \\mathbb\{E\}\\\!\\sum\\nolimits\_\{i\}\\cos^\{2\}\\theta\_\{i\}\{=\}kk\_\{M\}/d\.\(3\)

*Proof sketch\.*\(a\) Factor through the mediator:\|f\(x\)−f\(x−PUx\)\|≤L‖UMPUx‖\|f\(x\)\{\-\}f\(x\{\-\}P\_\{U\}x\)\|\{\\leq\}L\\\|U\_\{M\}P\_\{U\}x\\\|by Lipschitz continuity ofgg\. Squaring under isotropic covariance yieldsσ2‖UMU⊤‖F2=σ2∑icos2⁡θi\\sigma^\{2\}\\\|U\_\{M\}U^\{\\top\}\\\|\_\{F\}^\{2\}\{=\}\\sigma^\{2\}\\\!\\sum\_\{i\}\\cos^\{2\}\\\!\\theta\_\{i\}via the SVD ofUMU⊤U\_\{M\}U^\{\\top\}\. \(b\) Whenrow\(UM\)⊆row\(U\)\\mathrm\{row\}\(U\_\{M\}\)\{\\subseteq\}\\mathrm\{row\}\(U\), the projection is exact \(UMPUx=UMxU\_\{M\}P\_\{U\}x\{=\}U\_\{M\}x\), and a matching lower bound holds under a one\-sided modulus of continuityL¯\\underline\{L\}\. \(c\) The Haar expectation𝔼‖UMU⊤‖F2=kkM/d\\mathbb\{E\}\\\|U\_\{M\}U^\{\\top\}\\\|\_\{F\}^\{2\}\{=\}kk\_\{M\}/dfollows from the same trace identity as Prop\.[2](https://arxiv.org/html/2605.29126#Thmproposition2); their ratio gives the null specificityρknull≍\(L¯/L\)2d/k\\rho\_\{k\}^\{\\text\{null\}\}\{\\asymp\}\(\\underline\{L\}/L\)^\{2\}\\,d/k\. Full proofs in Supp\.[S22](https://arxiv.org/html/2605.29126#A22); anisotropy sensitivity in Supp\.[S24](https://arxiv.org/html/2605.29126#A24); a controlled\-perturbation validation design in Supp\.[S38](https://arxiv.org/html/2605.29126#A38)\.

### Specificity ratio, design choices, and experimental setup\.

Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)motivates a single scalar that separates direction from dimensionality:ρk=\(DAS drop\)/\(random\-control mean drop\)\\rho\_\{k\}\{=\}\(\\text\{DAS drop\}\)/\(\\text\{random\-control mean drop\}\)\. The Lipschitz sandwich predicts a null ofρknull≍d/k\\rho\_\{k\}^\{\\text\{null\}\}\{\\asymp\}d/k; at\(d,k\)=\(2304,4\)\(d,k\)\{=\}\(2304,4\),d/k=576d/k\{=\}576\. OnGemma 2 2B,ρ4=1050\\rho\_\{4\}\{=\}1050\(Δadd=42\\Delta\_\{\\text\{add\}\}\{=\}42pp; CI in Supp\.[S25](https://arxiv.org/html/2605.29126#A25)\)—far exceeding the dimensional null, confirming the DAS subspace captures directed structure, not just dimensionality budget\. At77B/99B the bf16 floor makes the denominator vanish; we report additive drops alongside ratios \(Supp\.[S25](https://arxiv.org/html/2605.29126#A25)\)\. We use standard DAS rather than HyperDAS\(Sunet al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib2)\)so that probe and DAS share identical\(L,k\)\(L,k\), isolating the optimization\-objective contrast that Prop\.[1](https://arxiv.org/html/2605.29126#Thmproposition1)predicts \(Supp\.[S1](https://arxiv.org/html/2605.29126#A1)\); every angle measurement is paired withn≥25n\{\\geq\}25Haar\-uniform random\-subspace controls of matched rank \(55–95%95\\%accuracy\-drop envelope\)\. Primary analyses useGemma 2 2B\(Gemma Team,[2024](https://arxiv.org/html/2605.29126#bib.bib13)\)andQwen 2\.5 1\.5B\(Yang and others,[2024](https://arxiv.org/html/2605.29126#bib.bib15)\); scale\-up addsQwen 2\.5 7BandGemma 2 9B; training dynamics onPythia 1\.4B\(Bidermanet al\.,[2023](https://arxiv.org/html/2605.29126#bib.bib14)\)\.L⋆L^\{\\star\}is the bootstrap peak of monthly\-stratified55\-fold CV probeR2R^\{2\}; the probe–DAS angle is≈89∘\{\\approx\}89^\{\\circ\}at every layer \(Supp\.[S27](https://arxiv.org/html/2605.29126#A27)\), so single\-layer intervention is conservative\. Algorithm[1](https://arxiv.org/html/2605.29126#algorithm1)summarizes the full protocol; pseudocode in Supp\.[S23](https://arxiv.org/html/2605.29126#A23)\.

Input:Frozen modelMM, prompt set𝒫\\mathcal\{P\}with targets\{y⋆\}\\\{y^\{\\star\}\\\}, rankkk

1Output:

θ¯,ΔP,ΔM,ρk\\bar\{\\theta\},\\;\\Delta\_\{P\},\\;\\Delta\_\{M\},\\;\\rho\_\{k\}
2\[2pt\] Sweep layers

0,…,Lmax0,\\ldots,L\_\{\\max\}; set

L⋆←arg⁡max⁡R2L^\{\\star\}\\leftarrow\\arg\\max R^\{2\}; train ridge probe at

L⋆L^\{\\star\}; extract

UPU\_\{P\}⊳\\trianglerightprobe

3Train DAS at

\(L⋆,k\)→UM\(L^\{\\star\},k\)\\to U\_\{M\}⊳\\trianglerightmediator \(Alg\.[2](https://arxiv.org/html/2605.29126#algorithm2), Supp\.[S23](https://arxiv.org/html/2605.29126#A23)\)

4

θ1,…,θk←PrincipalAngles\(UP,UM\)\\theta\_\{1\},\\ldots,\\theta\_\{k\}\\leftarrow\\textsc\{PrincipalAngles\}\(U\_\{P\},U\_\{M\}\);

5

aclean←Acc\(𝒫\)a\_\{\\text\{clean\}\}\\leftarrow\\textsc\{Acc\}\(\\mathcal\{P\}\)
6

aP←Acc\(𝒫∣x↦x−UP⊤UPx\)a\_\{P\}\\leftarrow\\textsc\{Acc\}\\\!\\bigl\(\\mathcal\{P\}\\mid x\\mapsto x\-U\_\{P\}^\{\\top\}U\_\{P\}x\\bigr\);

7

aM←Acc\(𝒫∣x↦x−UM⊤UMx\)a\_\{M\}\\leftarrow\\textsc\{Acc\}\\\!\\bigl\(\\mathcal\{P\}\\mid x\\mapsto x\-U\_\{M\}^\{\\top\}U\_\{M\}x\\bigr\)
8for*j=1,…,Nnullj=1,\\ldots,N\_\{\\mathrm\{null\}\}⊳\\trianglerightnull calibration*do

9

Uj∼Haar\(G\(k,d\)\)U\_\{j\}\\sim\\text\{Haar\}\\bigl\(G\(k,d\)\\bigr\);

10

aj←Acc\(𝒫∣Uj\)a\_\{j\}\\leftarrow\\textsc\{Acc\}\(\\mathcal\{P\}\\mid U\_\{j\}\)
11

return

θ¯\\bar\{\\theta\},

ΔP=aclean−aP\\Delta\_\{P\}\{=\}a\_\{\\text\{clean\}\}\{\-\}a\_\{P\},

ΔM=aclean−aM\\Delta\_\{M\}\{=\}a\_\{\\text\{clean\}\}\{\-\}a\_\{M\},

ρk=ΔM/Δrand¯\\rho\_\{k\}\{=\}\\Delta\_\{M\}/\\overline\{\\Delta\_\{\\text\{rand\}\}\}

Algorithm 1Readout\-mediator diagnostic protocol

## 4The dissociation: probes decode, DAS computes

Section[3](https://arxiv.org/html/2605.29126#S3)predicts that probe and mediator should be generically distinct, with their angle at the Haar null\. We now test this on calendar\-date duration reasoning \(Fig\.[2](https://arxiv.org/html/2605.29126#S4.F2)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x2.png)Figure 2:The readout\-mediator dissociation, quantified\.\(A\)Accuracy onGemma 2 2Bunder four ablations atL⋆=1L^\{\\star\}\{=\}1: DAS collapses accuracy to0%0\\%; probe and random ablations produce drops within 1 pp of clean\.\(B\)Mean principal angle between DAS and probe subspaces at eachkk; shaded band is the Haar\-random nullθ=arccos⁡k/d\\theta\{=\}\\arccos\\\!\\sqrt\{k/d\}\.\(C\)Causal specificity ratioρ=ΔDAS/Δrandom\\rho=\\Delta\_\{\\text\{DAS\}\}/\\Delta\_\{\\text\{random\}\}atk=2,4,6k\{=\}2,4,6; DAS is190190–1050×1050\{\\times\}more damaging than a matched\-dimension random subspace\.\(D\)Cross\-scale replication: DAS drops atk=4k\{=\}4on four models \(1\.51\.5B–99B\); diamond markers show random controls near zero\.### Maximal dissociation onGemma 2 2B\.

OnGemma 2 2BatL⋆=1L^\{\\star\}\{=\}1, probe and DAS ablations at matched rank tell opposite stories\. Projecting out the probe subspace \(R2=0\.981R^\{2\}\{=\}0\.981, peak0\.9960\.996matchingGurnee and Tegmark[2024](https://arxiv.org/html/2605.29126#bib.bib7)\) drops duration accuracy by only0\.60\.6pp—statistically identical to a random control\. DAS ablation at the same layer and samekkdrops accuracy from42%42\\%to0%0\\%, with specificity ratios up toρ4=1050\\rho\_\{4\}\{=\}1050\(Fig\.[2](https://arxiv.org/html/2605.29126#S4.F2)A,C\)\. The angle between the two subspaces is88∘88^\{\\circ\}, matching Prop\.[2](https://arxiv.org/html/2605.29126#Thmproposition2)’s prediction of88\.3∘88\.3^\{\\circ\}to within1\.5∘1\.5^\{\\circ\}\(Fig\.[2](https://arxiv.org/html/2605.29126#S4.F2)B; Supp\.[S3](https://arxiv.org/html/2605.29126#A3)\)\. The formal test confirms this:∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}sits at thek2/dk^\{2\}/dnull at every rank tested, with indistinguishabilityp=0\.51p\{=\}0\.51–0\.720\.72\(Supp\.[S24](https://arxiv.org/html/2605.29126#A24)\)\. The probe does not merely live in a different subspace; it lives*as far from the mediator as noise*—a*statistical shadow*that reads the representation without tapping the computation \(ρk≈1\\rho\_\{k\}\{\\approx\}1\)\. Persistent homology confirms the date subspace is geometrically a11\-torus in the readout coordinates \(Supp\.[S8](https://arxiv.org/html/2605.29126#A8)\)\. The result is robust to probe architecture, target choice, layer, and data partition: on strict train/test splits \(n=3,650n\{=\}3\{,\}650\), five\-fold CV givesθ¯=87\.8∘±0\.1∘\\bar\{\\theta\}\{=\}87\.8^\{\\circ\}\\\!\\pm\\\!0\.1^\{\\circ\}with bootstrap95%95\\%CI\[87\.3∘,88\.3∘\]\[87\.3^\{\\circ\},\\,88\.3^\{\\circ\}\]\(Supp\.[S27](https://arxiv.org/html/2605.29126#A27),[S40](https://arxiv.org/html/2605.29126#A40),[S41](https://arxiv.org/html/2605.29126#A41),[S31](https://arxiv.org/html/2605.29126#A31)\)\.

### The dissociation sharpens with scale\.

The pattern replicates across all four models \(1\.51\.5–99B\): DAS drops4242–5151pp while random controls drop≤0\.3\{\\leq\}0\.3pp at1\.51\.5B and exactly zero at77B/99B \(Fig\.[2](https://arxiv.org/html/2605.29126#S4.F2)D; Supp\.[S1](https://arxiv.org/html/2605.29126#A1), Tab\.[2](https://arxiv.org/html/2605.29126#A1.T2),[S26](https://arxiv.org/html/2605.29126#A26)\)\. The specificity ratio strengthens monotonically with scale—as Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)predicts, since the nullρknull≍d/k\\rho\_\{k\}^\{\\text\{null\}\}\{\\asymp\}d/kgrows with model width\. Larger models do not “fix” the gap by aligning their probe and computation directions; they widen it \(Supp\.[S25](https://arxiv.org/html/2605.29126#A25)\)\. The effective rank of the causal subspace is44: accuracy drops plateau atk=4k\{=\}4, the basin is unique across seeds \(CCA\>0\.94\{\>\}0\.94; Supp\.[S29](https://arxiv.org/html/2605.29126#A29),[S30](https://arxiv.org/html/2605.29126#A30)\), and the subspace is59×59\{\\times\}super\-additive—the four directions function as a cooperative unit, not an arbitrary collection \(Supp\.[S44](https://arxiv.org/html/2605.29126#A44),[S28](https://arxiv.org/html/2605.29126#A28),[S51](https://arxiv.org/html/2605.29126#A51)\)\. This subspace is task\-specific: the4242pp duration drop is12\.6×12\.6\{\\times\}the average acrossn=240n\{=\}240non\-calendar prompts \(Supp\.[S18](https://arxiv.org/html/2605.29126#A18)\)\.

## 5The circuit: from boundary heads to duration vocabulary

Having established that a rank\-44causal subspace mediates duration computation while the probe direction is a statistical shadow, we now open that subspace and trace the circuit in three stages \(Fig\.[4](https://arxiv.org/html/2605.29126#S5.F4),[5](https://arxiv.org/html/2605.29126#S5.F5)A\): attention heads that route month\-grained context, MLP layers that transform calendar position into elapsed duration, and SAE features that make the*when*/*how long*split legible at the vocabulary level\.

### Stage 1: boundary heads route month\-grained context\.

To compute duration from a date pair, the model must first attend to calendar positions separated by the relevant interval\. A QK\-twist scan\(Gurneeet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib6)\)identifies2424heads with\|z\|≥3\|z\|\{\\geq\}3\(6565BH\-significant atq=0\.05q\{=\}0\.05\) inGemma 2 2Bwhose offsets concentrate at\|c\|∈\{30,61\}\|c\|\{\\in\}\\\{30,61\\\}days—single\- and double\-month steps that reflect Gregorian month\-length arithmetic \(Fig\.[3](https://arxiv.org/html/2605.29126#S5.F3)A\)\. The pair\{30,61\}\\\{30,61\\\}tiles any multi\-month duration \(Fig\.[1](https://arxiv.org/html/2605.29126#S1.F1)\); weekly periodicity \(c=7c\{=\}7\) is absent \(Supp\.[S4](https://arxiv.org/html/2605.29126#A4)\)\. The circuit is distributed and QK\-mediated: single\-head ablation has no effect, but the top\-1010heads together drop accuracy17\.217\.2pp, and cascading ablation localizes the routing bottleneck to L11–L12 \(ΔNLL=0\.455\\Delta\\mathrm\{NLL\}\{=\}0\.455, vs\.0\.0160\.016for L23–L25\);WOVW\_\{OV\}alignment is1\.17×1\.17\{\\times\}the Haar null \(p=0\.004p\{=\}0\.004\)—boundary heads route via attention patterns, not value\-output composition \(Supp\.[S4](https://arxiv.org/html/2605.29126#A4),[S5](https://arxiv.org/html/2605.29126#A5),[S51](https://arxiv.org/html/2605.29126#A51),[S46](https://arxiv.org/html/2605.29126#A46)\)\. The same offset modes appear independently inQwen 2\.5 1\.5B\(p=0\.009p\{=\}0\.009, Monte Carlo null; Fig\.[3](https://arxiv.org/html/2605.29126#S5.F3)B, Supp\.[S37](https://arxiv.org/html/2605.29126#A37), Fig\.[19](https://arxiv.org/html/2605.29126#A37.F19)\), and dose–response ablation is super\-linear in both families \(Fig\.[3](https://arxiv.org/html/2605.29126#S5.F3)C\)—evidence that the circuit is dictated by calendar structure, not model\-specific training artifacts\(Gurneeet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib6)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x3.png)Figure 3:A distributed, offset\-structured, cross\-family circuit\.\(A\)Per\-head QK\-twist\|z\|\|z\|onGemma 2 2B\(2626layers×8\\times 8heads\);2424BH\-significant boundary heads at\|z\|≥3\|z\|\{\\geq\}3\.\(B\)Detected QK offsets: Gemma BH\-significant heads andQwen 2\.5 1\.5Btop\-2020\. Both families cluster at±30\\pm 30and±61\\pm 61days; neither at±7\\pm 7\.\(C\)Dose–response: accuracy drop vs\. fraction of boundary heads ablated; super\-linear scaling in both families\.\(D\)Attribution\-patching recall and IoU against the QK\-twist boundary\-head set as a function of top\-KKheads\.
### Stage 2: MLPs convert*when*into*how long*\.

Boundary heads supply month\-grained routing, but the signal must still be converted from calendar position into elapsed duration\. Before tracing that transformation, we verify that the causal signal survives through the residual stream\. Tracking DAS subspace energy‖UDASxL‖2/‖xL‖2\\\|U\_\{\\text\{DAS\}\}x\_\{L\}\\\|^\{2\}/\\\|x\_\{L\}\\\|^\{2\}across all2626layers \(Fig\.[4](https://arxiv.org/html/2605.29126#S5.F4)A\) shows the mediator signal*never disappears*:26\.4×26\.4\{\\times\}null atL⋆=1L^\{\\star\}\{=\}1, trough3\.1×3\.1\{\\times\}atL=18L\{=\}18, recovery6\.6×6\.6\{\\times\}atL=22L\{=\}22\. Probe energy, by contrast, tracks the random null throughout \(0\.00\.0–4\.1×4\.1\{\\times\}\)\. The causal signal flows forward continuously and the probe never intercepts it—explaining why boundary heads at L11–L12 can operate in a residual stream still rich with mediator content \(8×8\{\\times\}null,Δ\\DeltaNLL=0\.45=0\.45; Supp\.[S46](https://arxiv.org/html/2605.29126#A46),[S43](https://arxiv.org/html/2605.29126#A43),[S36](https://arxiv.org/html/2605.29126#A36)\)\.

Decomposing what each MLP layer*writes*to the residual stream using GemmaScope MLP SAEs—functionally equivalent to transcoders \(monotonic DAS\-alignment gradient L18→\\toL25, Spearmanρ=1\.0\\rho\{=\}1\.0; Fig\.[5](https://arxiv.org/html/2605.29126#S5.F5)D, Supp\.[S49](https://arxiv.org/html/2605.29126#A49)\)—reveals a two\-stage structure with a sharp boundary \(Fig\.[4](https://arxiv.org/html/2605.29126#S5.F4)B–C; Supp\.[S47](https://arxiv.org/html/2605.29126#A47),[S45](https://arxiv.org/html/2605.29126#A45),[S52](https://arxiv.org/html/2605.29126#A52)\)\. At L18–L19, probe contribution peaks \(4\.3×4\.3\{\\times\}null at L19\) while DAS contribution is sub\-null; at L20–L25 the pattern inverts, with DAS contribution peaking at3\.2×3\.2\{\\times\}null\. Early MLP extracts*when*\(calendar position\); late MLP computes*how long*\(duration\)\. Month\-discriminating features at L19–L22 correct the≤1\{\\leq\}1\-day residual inherent in thec=30c\{=\}30offset \(ANOVAp<0\.001p\{<\}0\.001; Supp\.[S47](https://arxiv.org/html/2605.29126#A47),[S48](https://arxiv.org/html/2605.29126#A48)\): for example, feature \#15148 at L22 is completely silent for February while active for all other months—the sharpest possible month\-length discrimination\. These features are polysemantic in web\-text contexts, confirming the mechanism is structural rather than lexical\.

### Stage 3: SAE features confirm the split at the vocabulary level\.

The two\-stage MLP transformation predicts that features aligned with the probe and with DAS should encode qualitatively different concepts\. SAE analysis confirms this\. Probe\-aligned feature \#12499 atL=1L\{=\}1\(Fig\.[5](https://arxiv.org/html/2605.29126#S5.F5)B\) \(*“specific months”*\) fires onmonth of October,month of February,first month of the season—it encodes calendar*position*, not duration\. DAS\-aligned features atL⋆=1L^\{\\star\}\{=\}1fire on copula contexts \(*“forms of to be”*\)—the syntactic frame for duration queries—whose decoder directions nonetheless promote numeric tokens \(logit\-lensZ=\+0\.18Z\{=\}\{\+\}0\.18,p=0\.009p\{=\}0\.009; Supp\.[S50](https://arxiv.org/html/2605.29126#A50)\)\. ByL=24L\{=\}24, feature \#2309 \(*“quantities of time and duration”*, Fig\.[5](https://arxiv.org/html/2605.29126#S5.F5)C\) promotes*months, weeks, days, years*: temporal semantics emerge at the relay endpoint, not atL⋆L^\{\\star\}\.

The dissociation is total\. Causal attribution \(WUW\_\{U\}gradient,2020duration prompts\) gives DAS\-aligned features mean attribution2\.412\.41; probe\-aligned features contribute exactly zero \(Jaccard=0\.000=0\.000; Supp\.[S50](https://arxiv.org/html/2605.29126#A50), Fig\.[29](https://arxiv.org/html/2605.29126#A50.F29)\)\. Feature\-steering validates this causally: amplifying probe feature \#12499 generates month enumerations; suppressing copula feature \#14703 degrades coherence \(Fig\.[30](https://arxiv.org/html/2605.29126#A50.F30)\)\. The causal subspace resists decomposition into individual dictionary elements—14/1514/15features yield\|ΔNLL\|<0\.05\|\\Delta\\mathrm\{NLL\}\|\{<\}0\.05on individual ablation, consistent with the59×59\{\\times\}cooperation ratio—and decoder\-direction steering with up to100100DAS\-aligned features yieldsΔNLL≈0\\Delta\\mathrm\{NLL\}\{\\approx\}0, while full rank\-44ablation yieldsΔNLL=\+69\\Delta\\mathrm\{NLL\}\{=\}\{\+\}69\(Templetonet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib39)\)\. Pre\-trained transcoders produce comparably null results \(span coverage55–9%9\\%, Jaccard≤0\.010\{\\leq\}0\.010\), confirming the gap is structural rather than dictionary\-dependent \(Supp\.[S49](https://arxiv.org/html/2605.29126#A49),[S50](https://arxiv.org/html/2605.29126#A50); completeness and selectivity analysis therein\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x4.png)Figure 4:DAS energy, MLP contribution, and TFA subspace geometry\.\(A\)DAS subspace energy fraction through all 26 residual\-stream layers: peaks at26\.4×26\.4\{\\times\}Haar null atL⋆=1L^\{\\star\}\{=\}1and exceeds null at*every*layer \(range2\.12\.1–26\.4×26\.4\{\\times\}\)\. Probe subspace energy tracks the random null throughout \(0\.00\.0–4\.1×4\.1\{\\times\}\)\.\(B\)Per\-layer MLP contribution to DAS vs\. probe subspace \(L18–L25\)\. Probe contribution peaks at L19 \(4\.3×4\.3\{\\times\}null,*calendar date*\) while DAS is sub\-null; DAS contribution peaks at L20 \(3\.2×3\.2\{\\times\},*duration*\)—a two\-stage transformation with a sharp anatomical boundary\.\(C\)Grassmannian embedding of subspaces onGr\(4,2304\)\\mathrm\{Gr\}\(4,2304\): TFA\-predictable \(gold\) is pulled toward DAS and away from the random cloud; the probe \(blue\) sits squarely in the random null at88\.5∘88\.5^\{\\circ\}\.\(D\)Mean principal angle to DAS, sorted\. TFA\-Pred \(82\.7∘82\.7^\{\\circ\}–83\.7∘83\.7^\{\\circ\}\) sits well below the Haar null \(87\.6∘87\.6^\{\\circ\}\); the probe is indistinguishable from random\.![Refer to caption](https://arxiv.org/html/2605.29126v1/x5.png)Figure 5:Circuit wiring and feature\-level dissociation: probe reads*when*; DAS computes*how long*\.\(A\)Circuit graph:L⋆=1L^\{\\star\}\{=\}1DAS mediator \(26\.4×26\.4\{\\times\}null\)→\{\\to\}boundary heads L11H4/L12H6 \(ΔNLL=0\.45\\Delta\\mathrm\{NLL\}\{=\}0\.45\)→\{\\to\}MLP L18–25 \(when→\{\\to\}how long\)→\{\\to\}relay hub L24H2 \(\#2309, duration vocabulary\)→\{\\to\}output\. Probe \(dashed\) has attribution=0\.000=0\.000\. AtL⋆L^\{\\star\}, DAS features encode copula syntax; temporal semantics emerge at theL=24L\{=\}24relay hub\.\(B\)SAE feature \#12499 \(probe\-aligned,L=1L\{=\}1;*“specific months”*\): top logits*month/Month/months/MONTH*; activations fire on calendar\-position contexts\. Taggedinert—zero causal attribution to the duration output\.\(C\)SAE feature \#2309 \(DAS\-aligned,L=24L\{=\}24;*“time/duration”*\): top logits*months/weeks/days/Months*; activations fire on duration\-interval contexts\. Taggedcausal\.\(D\)MLP transcoder pipeline: DAS alignment increases monotonically from L18 to L25 while probe alignment remains flat, confirming that MLPs progressively write duration—not calendar—information into the residual stream \(Supp\.[S49](https://arxiv.org/html/2605.29126#A49),[S6](https://arxiv.org/html/2605.29126#A6)\)\.
### Convergence, robustness, and training dynamics\.

One might worry that the dissociation is an artifact of the particular tool used to find the probe direction\. A matched\-baseline ablation at\(L⋆=1,k=4\)\(L^\{\\star\}\{=\}1,\\,k\{=\}4\)places nine interpretability tools on a single specificity\-ratio axis \(Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)nullρknull≍d/k=576\\rho\_\{k\}^\{\\text\{null\}\}\{\\asymp\}d/k\{=\}576\): pure decoders \(probe, INLP, LEACE\) contribute zero causal drop; intermediate methods \(gradientk=4k\{=\}4, PCA top\-44, AP\) recover partial signal \(55–18\.518\.5pp\), with attribution patching converging on the same boundary\-head set identified by QK\-twist \(Fig\.[3](https://arxiv.org/html/2605.29126#S5.F3)D\); only DAS identifies the full causal core \(3636–4242pp,ρDAS=1050\\rho\_\{\\text\{DAS\}\}\{=\}1050\)\. Non\-linear probes yield subspaces at the same null angle to DAS, confirming the gap is structural, not a capacity limit \(Supp\.[S39](https://arxiv.org/html/2605.29126#A39),[S11](https://arxiv.org/html/2605.29126#A11),[S42](https://arxiv.org/html/2605.29126#A42),[S32](https://arxiv.org/html/2605.29126#A32)\)\. Transplant experiments confirm*population\-level*universality \(same offsets, same causal hierarchy\) without*coordinate\-level*universality \(cross\-model transplants fail; Supp\.[S9](https://arxiv.org/html/2605.29126#A9)\)\. Training dynamics onPythia 1\.4Bsharpen the distinction further: probeR2=0\.956R^\{2\}\{=\}0\.956at step0—an untrained network “represents” dates by the probe’s standard—yet boundary\-head count, FFT\-circularness \(37×37\{\\times\}growth in the emergence window\), and DAS ablation drop all emerge only as training proceeds \(Supp\.[S7](https://arxiv.org/html/2605.29126#A7), Fig\. S7\)\. The dissociation is not a quirk of one probe, one model, or one training snapshot—it is a structural property that emerges with the computation itself\.

## 6Beyond calendar dates

Table 1:Cross\-task diagnostic atk=4k\{=\}4onGemma 2 2B\. Drops in pp;ρk\\rho\_\{k\}is DAS/random drop\.DomainTypeAngleProbeDASρk\\rho\_\{k\}\(θ¯\\bar\{\\theta\}\)dropdropTemporalgeom\.87\.987\.90\.60\.642\.042\.01050×1050\{\\times\}Spatialgeom\.88\.488\.4−6\.0\-6\.020\.020\.020\.8×20\.8\{\\times\}Arith\.symb\.88\.188\.10\.00\.068\.068\.0≫103×\{\\gg\}10^\{3\}\{\\times\}Haar null—88\.388\.3———The circuit traced above is specific to calendar\-date reasoning, but the dissociation itself—the angle sitting at the Haar null—should be generic if Prop\.[1](https://arxiv.org/html/2605.29126#Thmproposition1)is correct\. We test this by running the full diagnostic protocol \(probe training, DAS at matched\(L,k\)\(L,k\),2525random controls, specificity ratio\) on two additional domains usingGemma 2 2B:arithmetic\(single\-digit addition, symbolic, non\-geometric\) andspatial\(1D number\-line displacement, geometric manifold\)\. All three domains exhibit the same pattern \(Table[1](https://arxiv.org/html/2605.29126#S6.T1)\): probe–DAS angle at the Haar null \(87\.987\.9–88\.4∘88\.4^\{\\circ\}\), probe ablation leaving accuracy intact or improved \(≤0\{\\leq\}0pp; spatial\+6\+6pp, Supp\.[S34](https://arxiv.org/html/2605.29126#A34)\), and DAS ablation producing large drops \(2020–6868pp\) with specificity ratios far exceeding random controls\. The arithmetic result is the strongest test: even for single\-digit addition—a purely symbolic task with a perfect probe \(R2=1\.0R^\{2\}\{=\}1\.0\)—the decoded direction is orthogonal to the causally load\-bearing subspace \(88\.1∘88\.1^\{\\circ\},6868pp DAS drop,0pp probe drop\)\. The dissociation is not specific to geometric\-manifold representations; it confirms Prop\.[1](https://arxiv.org/html/2605.29126#Thmproposition1)’s prediction that probe and mediator are generically distinct \(Supp\.[S34](https://arxiv.org/html/2605.29126#A34)\)\.

### Why orthogonal?

Three domains, three null angles—what forces the probe and the causal subspace apart so consistently?Lubanaet al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib33)\)provide the key insight: standard SAEs impose an i\.i\.d\. prior across sequence positions \(their Prop\. 4\.1\), discarding the temporal structure that distinguishes what a model*reads from context*\(the predictable component\) from what arrives*de novo*at the current token \(the novel component\)\. A probe inherits the same limitation: it captures the axis along which information is*most readable*, not the axis along which the model*uses*that information for computation\.

Decomposing activations atL⋆=1L^\{\\star\}\{=\}1with both their zero\-shot linear predictor and learned TemporalSAE confirms this: the DAS mediator aligns7\.17\.1–7\.6×7\.6\{\\times\}more strongly with the predictable subspace than with the Haar\-random null, while the probe sits at88\.5∘88\.5^\{\\circ\}—squarely in the random cloud on the Grassmannian \(Fig\.[4](https://arxiv.org/html/2605.29126#S5.F4)C–D; Supp\.[S35](https://arxiv.org/html/2605.29126#A35), Fig\.[17](https://arxiv.org/html/2605.29126#A35.F17)C\)\. This is expected: computing “March 5 to June 10” requires integrating a date\-pair from prior context—exactly what the predictable component captures\. The orthogonality is therefore*within*the predictable subspace: both probe and mediator project accumulated context, but along functionally disjoint axes—circular day\-of\-year structure vs\. month\-pair difference structure\.

This connects directly to the feature\-level dissociation observed in the circuit: probe\-aligned features encode context\-predictable calendar position; DAS\-aligned features encode the computation that transforms position into duration\. Standard SAEs recover only linearly accessible features\(Hindupuret al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib34)\); the mediator directions, conditionally orthogonal to the readout\(Costaet al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib35)\), are invisible to that projection \(Supp\.[S10](https://arxiv.org/html/2605.29126#A10)\)\. A temporal\-specialist SAE\(Lubanaet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib33)\)atL=12L\{=\}12independently recovers the copula relay with DAS\>\>probe preferential alignment, while its44D attention bottleneck is orthogonal to DAS \(min\. angle80∘80^\{\\circ\}\), confirming the causal subspace captures*computation*, not prediction \(Supp\.[S50](https://arxiv.org/html/2605.29126#A50), Fig\.[31](https://arxiv.org/html/2605.29126#A50.F31)\)\. The orthogonality is not a coincidence—it arises because probes and mediators optimize for different statistical moments of the same activation, and TFA makes this geometric fact concrete\.

### Implications for probe\-based monitoring\.

A growing line of work proposes deploying linear probes as runtime safety monitors—detecting deception, harmful intent, or dangerous knowledge by decoding internal representations during inference\(Burnset al\.,[2023](https://arxiv.org/html/2605.29126#bib.bib41); Marks and Tegmark,[2024](https://arxiv.org/html/2605.29126#bib.bib32); Zouet al\.,[2023](https://arxiv.org/html/2605.29126#bib.bib42)\)\. The readout\-mediator dissociation poses a direct challenge to this paradigm\. We stress\-test it with a six\-experiment battery \(Supp\.[S53](https://arxiv.org/html/2605.29126#A53)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x6.png)Figure 6:Adversarial dissociation\.Mechanism error \(red\) climbs to7171days while probe error \(blue\) flatlines below66days\. Atα=3\\alpha\{=\}3,93%93\\%of damage is undetected\.The blind spot is*generic*: all seven probe targets tested \(day\-of\-year, month, season, day\-of\-week, quarter, solstice, gradient\) land within2\.8∘2\.8^\{\\circ\}of the Haar\-random null\. This is a geometric consequence ofk≪dk\\ll d: in23042304dimensions, a rank\-44probe subspace is nearly orthogonal to*any*rank\-44subspace by the Haar measure, including the one the model actually uses \(Prop\.[2](https://arxiv.org/html/2605.29126#Thmproposition2)\)\. The blind spot is*exploitable*\. We decompose the residual stream atL⋆L^\{\\star\}into two orthogonal components—the DAS mediatorUMU\_\{M\}and the Gram\-Schmidt\-orthogonalized probe basisU~P\\tilde\{U\}\_\{P\}—and constructxadv=x\+αUM⊤UM\(μsrc−μtgt\)\+βU~P⊤δprobex\_\{\\mathrm\{adv\}\}=x\+\\alpha\\,U\_\{M\}^\{\\\!\\top\}U\_\{M\}\(\\mu\_\{\\mathrm\{src\}\}\{\-\}\\mu\_\{\\mathrm\{tgt\}\}\)\+\\beta\\,\\tilde\{U\}\_\{P\}^\{\\\!\\top\}\\delta\_\{\\mathrm\{probe\}\}, whereμsrc,μtgt\\mu\_\{\\mathrm\{src\}\},\\mu\_\{\\mathrm\{tgt\}\}are mean activations for dates180180days apart andδprobe\\delta\_\{\\mathrm\{probe\}\}restores the probe’ssin/cos\\sin/\\cosreadout to the correct date \(full protocol: Supp\.[S53](https://arxiv.org/html/2605.29126#A53)\)\. The scalarα\\alphacontrols mechanism corruption;β\\betacontrols probe reassurance; both operate on unit\-variance directions so perturbation magnitude scales linearly with the coefficient\. At\(α=3,β=2\)\(\\alpha\{=\}3,\\,\\beta\{=\}2\), the DAS mechanism shifts by7171days while the probe reports5\.75\.7\-day error \(Fig\.[6](https://arxiv.org/html/2605.29126#S6.F6)\)\. The blind spot is*total*: mutual information between the probe readout and the DAS energy is exactly0\.0000\.000nats \(p=1\.0p\{=\}1\.0, phase\-shuffle null,n=200n\{=\}200\), while the probe carries3\.993\.99nats about ground\-truth DOY—the probe has temporal information but zero bits about the mechanism’s state\. And the blind spot is*unfixable by probe design*: a purpose\-built “deception monitor” trained on the same activations sits at88\.4∘88\.4^\{\\circ\}from DAS, inheriting the identical blind spot\. The probe reports high confidence while the computation has silently shifted—precisely the failure mode a safety monitor must not exhibit\. A manifold\-deviation scoreδ\(x\)\\delta\(x\)derived from the mediator subspace can flag likely errors on clinical duration queries \(Supp\.[S16](https://arxiv.org/html/2605.29126#A16)\)\.

## 7Conclusion

Linear probes decode what is present; DAS recovers what is used\.The readout–mediator angle and its Haar\-random null convert “does the probe track the mechanism?” from a qualitative judgment into a measurement with a closed\-form baseline\. Across four scales \(1\.51\.5–99B\), two families, and three domains, the angle is indistinguishable from the null: the probe direction is no closer to the computation than a random subspace of matched rank, because probes maximize*readability*while the model computes along a geometrically disjoint axis of boundary heads, MLP transcoders, and TFA\-confirmed monotone month\-pair structure\. “The model represents dates” is therefore ambiguous in a way that matters: the readable subspace and the computed\-with subspace are nearly orthogonal by the geometry ofk≪dk\\ll d\. For any probe advanced as evidence of mechanism or deployed as a safeguard, we recommend reportingθ¯\\bar\{\\theta\},ρk\\rho\_\{k\}, and random\-ablation controls; without them, a monitor can report high confidence on a direction the model has silently abandoned \(Fig\.[6](https://arxiv.org/html/2605.29126#S6.F6)\)\.

### Future work and societal impact\.

Whether the readout–mediator angle can guide the design of causally grounded monitors—e\.g\. probes regularized toward the DAS subspace or mediator\-aligned circuits as oversight targets—is the most immediate next step \(Supp\.[S33](https://arxiv.org/html/2605.29126#A33)\)\. Because high\-accuracy probes can be geometrically decoupled from a model’s computation, safety monitors that rely solely on probe confidence risk false assurance;ρk\\rho\_\{k\}offers a principled check for when such monitors can be trusted\.

## References

- LEACE: perfect linear concept erasure in closed form\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix S41](https://arxiv.org/html/2605.29126#A41.p1.1),[Appendix S54](https://arxiv.org/html/2605.29126#A54.29.29.39.9.2.1.1),[§2](https://arxiv.org/html/2605.29126#S2.p2.1)\.
- Y\. Benjamini and Y\. Hochberg \(1995\)Controlling the false discovery rate: a practical and powerful approach to multiple testing\.Journal of the Royal Statistical Society: Series B \(Methodological\)57\(1\),pp\. 289–300\.Cited by:[Appendix S4](https://arxiv.org/html/2605.29126#A4.SS0.SSS0.Px2.p1.11)\.
- S\. Biderman, H\. Schoelkopf, Q\. Anthony,et al\.\(2023\)Pythia: a suite for analyzing large language models across training and scaling\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§3](https://arxiv.org/html/2605.29126#S3.SS0.SSS0.Px3.p1.16)\.
- C\. Burns, H\. Ye, D\. Klein, and J\. Steinhardt \(2023\)Discovering latent knowledge in language models without supervision\.InThe Eleventh International Conference on Learning Representations \(ICLR\),Cited by:[§6](https://arxiv.org/html/2605.29126#S6.SS0.SSS0.Px2.p1.1)\.
- M\. Canby, A\. Davies, C\. Rastogi, and J\. Hockenmaier \(2025\)How reliable are causal probing interventions?\.InProceedings of the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics and the 14th International Joint Conference on Natural Language Processing \(AACL\-IJCNLP\),Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[§1](https://arxiv.org/html/2605.29126#S1.p2.3),[§2](https://arxiv.org/html/2605.29126#S2.p2.1)\.
- D\. Chanin, J\. Wilken\-Smith, T\. Dulka, H\. Bhatnagar, S\. Golechha, and J\. Bloom \(2025\)A is for absorption: studying feature splitting and absorption in sparse autoencoders\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix S49](https://arxiv.org/html/2605.29126#A49.SS0.SSS0.Px2.p1.2)\.
- A\. Conmy, A\. N\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso \(2023\)Towards automated circuit discovery for mechanistic interpretability\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.36\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1)\.
- V\. Costa, T\. Fel, E\. S\. Lubana, B\. Tolooshams, and D\. Ba \(2025\)From flat to hierarchical: extracting sparse representations with matching pursuit\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.SS0.SSS0.Px2.p1.5),[§6](https://arxiv.org/html/2605.29126#S6.SS0.SSS0.Px1.p3.4)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2024\)Sparse autoencoders find highly interpretable features in language models\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[Appendix S50](https://arxiv.org/html/2605.29126#A50.SS0.SSS0.Px4.p1.13),[Appendix S54](https://arxiv.org/html/2605.29126#A54.29.29.33.3.2.1.1),[§2](https://arxiv.org/html/2605.29126#S2.p3.1)\.
- Y\. Elazar, S\. Ravfogel, A\. Jacovi, and Y\. Goldberg \(2021\)Amnesic probing: behavioral explanation with amnesic counterfactuals\.Transactions of the Association for Computational Linguistics \(TACL\)9,pp\. 160–175\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[§1](https://arxiv.org/html/2605.29126#S1.p2.3),[§2](https://arxiv.org/html/2605.29126#S2.p2.1)\.
- J\. Engels, L\. Riggs, and M\. Tegmark \(2024\)Decomposing the dark matter of sparse autoencoders\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.SS0.SSS0.Px2.p1.5)\.
- A\. Feder, N\. Oved, U\. Shalit, and R\. Reichart \(2021\)CausaLM: causal model explanation through counterfactual language models\.Computational Linguistics47\(2\),pp\. 333–386\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1)\.
- J\. Feng, S\. Russell, and J\. Steinhardt \(2025\)Monitoring latent world states in language models with propositional probes\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.29126#S2.p1.1)\.
- A\. Geiger, Z\. Wu, C\. Potts, T\. Icard, and N\. D\. Goodman \(2024\)Finding alignments between interpretable causal variables and distributed neural representations\.InCausal Learning and Reasoning \(CLeaR\),Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[Appendix S54](https://arxiv.org/html/2605.29126#A54.29.29.32.2.2.1.1),[§2](https://arxiv.org/html/2605.29126#S2.p3.1),[§3](https://arxiv.org/html/2605.29126#S3.SS0.SSS0.Px1.p1.9)\.
- Gemma Team \(2024\)Gemma 2: improving open language models at a practical size\.arXiv preprint arXiv:2408\.00118\.Cited by:[§3](https://arxiv.org/html/2605.29126#S3.SS0.SSS0.Px3.p1.16)\.
- S\. Grant, S\. J\. Han, A\. R\. Tartaglini, and C\. Potts \(2026\)Addressing divergent representations from causal interventions on neural networks\.InThe Fourteenth International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.29126#S2.p1.1)\.
- W\. Gurnee, E\. Ameisen, I\. Kauvar, J\. Tarng, A\. Pearce, C\. Olah, and J\. Batson \(2026\)When models manipulate manifolds: the geometry of a counting task\.arXiv preprint arXiv:2601\.04480\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.SS0.SSS0.Px1.p1.2),[Appendix S10](https://arxiv.org/html/2605.29126#A10.p2.2),[Appendix S8](https://arxiv.org/html/2605.29126#A8.SS0.SSS0.Px1.p1.6),[Appendix S8](https://arxiv.org/html/2605.29126#A8.p1.10),[§2](https://arxiv.org/html/2605.29126#S2.p3.1),[§3](https://arxiv.org/html/2605.29126#S3.SS0.SSS0.Px2.p4.1),[§5](https://arxiv.org/html/2605.29126#S5.SS0.SSS0.Px1.p1.15)\.
- W\. Gurnee and M\. Tegmark \(2024\)Language models represent space and time\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p2.2),[§1](https://arxiv.org/html/2605.29126#S1.p1.5),[§2](https://arxiv.org/html/2605.29126#S2.p1.1),[§4](https://arxiv.org/html/2605.29126#S4.SS0.SSS0.Px1.p1.21)\.
- P\. Haghighatkhah, A\. Fokkens, P\. Sommerauer, B\. Speckmann, and K\. Verbeek \(2022\)Better hit the nail on the head than beat around the bush: removing protected attributes with a single projection\.InEMNLP,Cited by:[Appendix S41](https://arxiv.org/html/2605.29126#A41.p1.1)\.
- E\. Hernandez, A\. Sen Sharma, T\. Haklay, K\. Meng, M\. Wattenberg, J\. Andreas, Y\. Belinkov, and D\. Bau \(2024\)Linearity of relation decoding in transformer language models\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.29126#S2.p1.1)\.
- J\. Hewitt and P\. Liang \(2019\)Designing and interpreting probes with control tasks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 2733–2743\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[§1](https://arxiv.org/html/2605.29126#S1.p2.3),[§2](https://arxiv.org/html/2605.29126#S2.p2.1)\.
- S\. S\. R\. Hindupur, E\. S\. Lubana, T\. Fel, and D\. Ba \(2025\)Projecting assumptions: the duality between sparse autoencoders and concept geometry\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.SS0.SSS0.Px2.p1.5),[§6](https://arxiv.org/html/2605.29126#S6.SS0.SSS0.Px1.p3.4)\.
- X\. Huang and M\. Hahn \(2026\)Decomposing representation space into interpretable subspaces with unsupervised learning\.InThe Fourteenth International Conference on Learning Representations \(ICLR\),Cited by:[Appendix S32](https://arxiv.org/html/2605.29126#A32.p1.6)\.
- S\. Kantamneni and M\. Tegmark \(2025\)Language models use trigonometry to do addition\.arXiv preprint arXiv:2502\.00873\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p2.2),[§1](https://arxiv.org/html/2605.29126#S1.p1.5)\.
- A\. Kraskov, H\. Stögbauer, and P\. Grassberger \(2004\)Estimating mutual information\.Physical Review E69\(6\),pp\. 066138\.Cited by:[Appendix S53](https://arxiv.org/html/2605.29126#A53.SS0.SSS0.Px3.p1.11)\.
- T\. Lieberum, S\. Rajamanoharan,et al\.\(2024\)Gemma scope: open sparse autoencoders everywhere all at once on Gemma 2\.arXiv preprint arXiv:2408\.05147\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[Appendix S49](https://arxiv.org/html/2605.29126#A49.p1.4),[Appendix S50](https://arxiv.org/html/2605.29126#A50.p1.2),[Appendix S6](https://arxiv.org/html/2605.29126#A6.p1.11),[§2](https://arxiv.org/html/2605.29126#S2.p3.1)\.
- J\. Lin \(2023\)Neuronpedia: interactive reference and tooling for analyzing neural networks\.Note:[https://neuronpedia\.org](https://neuronpedia.org/)Accessed: 2025\-04\-01Cited by:[Appendix S50](https://arxiv.org/html/2605.29126#A50.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.29126#S2.p3.1)\.
- E\. S\. Lubana, C\. Rager, S\. S\. R\. Hindupur, V\. Costa, G\. Tuckute, O\. Patel, S\. K\. Murthy, T\. Fel, D\. Wurgaft, E\. J\. Bigelow, J\. Lin, D\. Ba, M\. Wattenberg, F\. Viegas, M\. Weber, and A\. Mueller \(2026\)Priors in time: missing inductive biases for language model interpretability\.InThe Fourteenth International Conference on Learning Representations \(ICLR\),Cited by:[Appendix S27](https://arxiv.org/html/2605.29126#A27.SS0.SSS0.Px2.p1.14),[Appendix S35](https://arxiv.org/html/2605.29126#A35.p1.5),[Appendix S37](https://arxiv.org/html/2605.29126#A37.SS0.SSS0.Px4.p1.1),[Appendix S50](https://arxiv.org/html/2605.29126#A50.SS0.SSS0.Px6.p1.1),[Appendix S54](https://arxiv.org/html/2605.29126#A54.29.29.34.4.2.1.1),[§1](https://arxiv.org/html/2605.29126#S1.p3.3),[§2](https://arxiv.org/html/2605.29126#S2.p3.1),[§6](https://arxiv.org/html/2605.29126#S6.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.29126#S6.SS0.SSS0.Px1.p3.4)\.
- S\. Marks and M\. Tegmark \(2024\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.InProceedings of the First Conference on Language Modeling \(COLM\),Cited by:[§1](https://arxiv.org/html/2605.29126#S1.p1.5),[§6](https://arxiv.org/html/2605.29126#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Modell, P\. Rubin\-Delanchy, and N\. Whiteley \(2025\)The origins of representation manifolds in large language models\.arXiv preprint arXiv:2505\.18235\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p2.2)\.
- A\. Mueller, J\. Brinkmann, M\. Li, S\. Marks, K\. Pal, N\. Prakash, C\. Rager, A\. Sankaranarayanan, A\. S\. Sharma, J\. Sun, E\. Todd, D\. Bau, and Y\. Belinkov \(2026\)The quest for the right mediator: surveying mechanistic interpretability for NLP through the lens of causal mediation analysis\.Computational Linguistics\.External Links:[Document](https://dx.doi.org/10.1162/COLI.a.572)Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[§1](https://arxiv.org/html/2605.29126#S1.p2.3),[§2](https://arxiv.org/html/2605.29126#S2.p2.1)\.
- A\. Mueller, A\. Geiger, S\. Wiegreffe, D\. Arad, I\. Arcuschin, A\. Belfki, Y\. S\. Chan, J\. Fiotto\-Kaufman, T\. Haklay, M\. Hanna, J\. Huang, R\. Gupta, Y\. Nikankin, H\. Orgad, N\. Prakash, A\. Reusch, A\. Sankaranarayanan, S\. Shao, A\. Stolfo, M\. Tutek, A\. Zur, D\. Bau, and Y\. Belinkov \(2025\)MIB: a mechanistic interpretability benchmark\.InProceedings of the 42nd International Conference on Machine Learning \(ICML\),Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[§1](https://arxiv.org/html/2605.29126#S1.p2.3),[§2](https://arxiv.org/html/2605.29126#S2.p2.1),[§2](https://arxiv.org/html/2605.29126#S2.p3.1)\.
- A\. Nam, H\. Conklin, Y\. Yang, T\. Griffiths, J\. Cohen, and S\. Leslie \(2025\)Causal head gating: a framework for interpreting roles of attention heads in transformers\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1)\.
- N\. Nanda, L\. Chan, T\. Lieberum, J\. Smith, and J\. Steinhardt \(2023\)Progress measures for grokking via mechanistic interpretability\.InThe Eleventh International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.29126#S2.p3.1)\.
- A\. S\. Okatan, M\. İ\. Akbaş, L\. N\. Kandel, and B\. Peköz \(2025\)Seed\-induced uniqueness in transformer models: subspace alignment governs subliminal transfer\.arXiv preprint arXiv:2511\.01023\.Cited by:[Appendix S32](https://arxiv.org/html/2605.29126#A32.p2.1)\.
- A\. Pan, L\. Chen, and J\. Steinhardt \(2026\)LatentQA: teaching LLMs to decode activations into natural language\.InThe Fourteenth International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.29126#S2.p1.1)\.
- S\. Ravfogel, Y\. Elazar, H\. Gonen, M\. Twiton, and Y\. Goldberg \(2020\)Null it out: guarding protected attributes by iterative nullspace projection\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 7237–7256\.Cited by:[Appendix S54](https://arxiv.org/html/2605.29126#A54.29.29.38.8.2.1.1)\.
- A\. Ravichander, Y\. Belinkov, and E\. Hovy \(2021\)Probing the probing paradigm: does probing accuracy entail task relevance?\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\),pp\. 3363–3377\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[§1](https://arxiv.org/html/2605.29126#S1.p2.3),[§2](https://arxiv.org/html/2605.29126#S2.p2.1)\.
- J\. Sun, J\. Huang, S\. Baskaran, K\. D’Oosterlinck, C\. Potts, M\. Sklar, and A\. Geiger \(2025\)HyperDAS: towards automating mechanistic interpretability with hypernetworks\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),Cited by:[Appendix S1](https://arxiv.org/html/2605.29126#A1.SS0.SSS0.Px1.p1.3),[§2](https://arxiv.org/html/2605.29126#S2.p3.1),[§3](https://arxiv.org/html/2605.29126#S3.SS0.SSS0.Px3.p1.16)\.
- A\. Syed, C\. Rager, and A\. Conmy \(2024\)Attribution patching outperforms automated circuit discovery\.BlackboxNLP Workshop at EMNLP\.Cited by:[Appendix S10](https://arxiv.org/html/2605.29126#A10.p1.1),[Appendix S27](https://arxiv.org/html/2605.29126#A27.SS0.SSS0.Px1.p1.10),[Appendix S5](https://arxiv.org/html/2605.29126#A5.p1.11),[§2](https://arxiv.org/html/2605.29126#S2.p3.1)\.
- A\. N\. Tak, A\. Banayeeanzade, A\. Bolourani, M\. Kian, R\. Jia, and J\. Gratch \(2025\)Mechanistic interpretability of emotion inference in large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,Cited by:[§2](https://arxiv.org/html/2605.29126#S2.p1.1)\.
- A\. Templeton, T\. Conerly, J\. Marcus, J\. Lindsey, T\. Bricken, B\. Chen, A\. Pearce, C\. Citro, E\. Ameisen, A\. Jones, H\. Cunningham, N\. L\. Turner, C\. McDougall, M\. MacDiarmid, C\. D\. Freeman, T\. R\. Sumers, E\. Rees, J\. Batson, A\. Jermyn, S\. Carter, C\. Olah, and T\. Henighan \(2024\)Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2024/scaling-monosemanticity/)Cited by:[Appendix S49](https://arxiv.org/html/2605.29126#A49.p1.4),[§2](https://arxiv.org/html/2605.29126#S2.p3.1),[§5](https://arxiv.org/html/2605.29126#S5.SS0.SSS0.Px3.p2.14)\.
- J\. Wang, X\. Ge, W\. Shu, Z\. He, and X\. Qiu \(2025\)Dimensional collapse in transformer attention outputs: a challenge for sparse dictionary learning\.arXiv preprint arXiv:2508\.16929\.Cited by:[Appendix S32](https://arxiv.org/html/2605.29126#A32.p3.7),[Appendix S51](https://arxiv.org/html/2605.29126#A51.SS0.SSS0.Px3.p1.12)\.
- A\. Yanget al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§3](https://arxiv.org/html/2605.29126#S3.SS0.SSS0.Px3.p1.16)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski,et al\.\(2023\)Representation engineering: a top\-down approach to AI transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§6](https://arxiv.org/html/2605.29126#S6.SS0.SSS0.Px2.p1.1)\.

## Appendix S1Supplement: extended DAS results and implementation

### Why standard DAS and not HyperDAS\.

HyperDAS\(Sunet al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib2)\)trains a separate hypernetwork to automate the search over token positions, which is valuable when the feature location is unknown\. Our setting differs in two ways that make standard DAS the right choice\. First,L⋆L^\{\\star\}is already fixed by the bootstrap peak of the circular\-probeR2R^\{2\}—position search is not our bottleneck\. Second, and more importantly, the*angle comparison requires identical\(L,k\)\(L,k\)for both probe and DAS*: if HyperDAS were used, any difference in the subspaces could reflect hypernetwork\-introduced structure rather than the pure loss\-function contrast \(decodability vs\. causal vulnerability\) that Proposition[1](https://arxiv.org/html/2605.29126#Thmproposition1)predicts\. Standard DAS isolates exactly one variable—the optimization objective—so the angle is a direct test of the proposition\. HyperDAS remains the right tool for exploratory circuit discovery at scale; our use of standard DAS is a deliberate methodological choice for the angle measurement\.

### Implementation\.

DAS is implemented in PyTorch\. The trainable parameter is a dense matrixV∈ℝd×dV\\\!\\in\\\!\\mathbb\{R\}^\{d\\times d\}with leadingd×kd\{\\times\}kfactor orthonormalized via QR on every forward pass; only the firstkkrows participate in the ablation hook\. Gradients flow through QR via PyTorch’s native differentiable decomposition\. Optimizer: AdamW, learning rate10−310^\{\-3\}, weight decay0,β1=0\.9\\beta\_\{1\}\{=\}0\.9,β2=0\.999\\beta\_\{2\}\{=\}0\.999, batch size88,400400steps\. Loss: mean NLL of the correct duration\-token logit, computed with the model’s full forward pass under the ablation hook; model parameters are frozen viarequires\_grad\_\(False\)on load\. The ablation hook registers onblocks\.L\.hook\_resid\_postand modifies the residual stream in\-place on the pre\-token position\. Deterministic seeds\{0,1,2,3,4\}\\\{0,1,2,3,4\\\}; we report the run with lowest final NLL\.

### Convergence diagnostics\.

Every training run logs \(i\) NLL per step, \(ii\)‖V1:kV1:k⊤−Ik‖F\\\|V\_\{1:k\}V\_\{1:k\}^\{\\top\}\-I\_\{k\}\\\|\_\{F\}as the orthonormality residual, \(iii\) the angle between consecutive\-step bases\. Convergence criterion: NLL monotone decreasing over the last5050steps and orthonormality residual<10−6<10^\{\-6\}\. All reported runs pass both\.

### Per\-model numerical results\.

ForGemma 2 2B\(L⋆=1L^\{\\star\}\{=\}1,d=2304d\{=\}2304, clean accuracy42%42\\%\): atk=2k\{=\}2the DAS basis is fully trained in400400AdamW steps with final NLL of−71\.1\-71\.1; DAS ablation gives4%4\\%accuracy \(−38\-38pp\); the2525\-sample random\-control distribution is41\.80%±0\.98%41\.80\\%\\pm 0\.98\\%; and the probe\-subspace evaluation at matchingkkgives41\.0%41\.0\\%\. Atk=4k\{=\}4andk=6k\{=\}6, DAS ablation saturates to0%0\\%\(final NLL−105\.3\-105\.3,−118\.7\-118\.7\); random controls remain within0\.50\.5pp of clean\.

ForQwen 2\.5 1\.5B\(L⋆=0L^\{\\star\}\{=\}0,d=1536d\{=\}1536, clean44%44\\%\):k=2k\{=\}2yields7%7\\%\(−37\-37pp\),k=4k\{=\}4yields0%0\\%\(−44\-44pp\); random controls remain≥43\.6%\\geq 43\.6\\%\.

ForQwen 2\.5 7B\(L⋆=8L^\{\\star\}\{=\}8,d=3584d\{=\}3584, clean51%51\\%\) andGemma 2 9B\(L⋆=3L^\{\\star\}\{=\}3,d=3584d\{=\}3584, clean51%51\\%\), singlek=4k\{=\}4runs give0%0\\%and6%6\\%ablation accuracies respectively;2020matched random controls produce drops<0\.05<0\.05pp on both\. Training wall\-clocks:44min \(L4\),1818min \(A10G\),4242min \(A100\-80G\) per model\.

### Cross\-scale summary \(Tab\.[2](https://arxiv.org/html/2605.29126#A1.T2)\)\.

Table 2:Cross\-scale DAS atk=4k\{=\}4\. Drops in pp; ratio is DAS\-drop/random\-drop\. At77B/99B, random drops are at the bf16 precision floor, so ratios are lower bounds\.
### Why specificity strengthens with scale\.

A dimensional\-ambient interpretation: as the residual\-stream dimensionddgrows, a fixedk=4k\{=\}4random subspace occupies a vanishing fraction of activation space, so the random\-null denominator shrinks while the DAS numerator stays constant\. This is consistent with Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)\(c\)’sρnull≍d/k\\rho^\{\\text\{null\}\}\\asymp d/kscaling, with the matched\-kkPCA control whose angle to DAS stays neararccos⁡\(k/d\)\\arccos\(\\sqrt\{k/d\}\)across scales \(Supp\.[S3](https://arxiv.org/html/2605.29126#A3)\), and with random\-subspace drops approaching zero at77B/99B\. Whether this extrapolates to frontier scale is open\.

### What the DAS basis points at\.

Projecting thek=4k\{=\}4DAS basisUMU\_\{M\}onto token\-embedding directions, the top cosines are with January, February, March, and their numeric forms; the same basis aligns withR2=0\.88R^\{2\}\{=\}0\.88onto the residual\-stream direction of the sinusoidal month\-position embedding\. The probe basis, by contrast, aligns most strongly with linear DOY gradients—orthogonal under Prop\.[2](https://arxiv.org/html/2605.29126#Thmproposition2)and empirically at88∘88^\{\\circ\}\.

### Seed variance and Grassmannian geometry\.

Across five seeds onGemma 2 2Bk=4k\{=\}4: final NLL∈\[−107\.6,−102\.1\]\\in\[\-107\.6,\-102\.1\], ablation accuracy∈\[0%,2%\]\\in\[0\\%,2\\%\], pairwise basis angle∈\[12∘,34∘\]\\in\[12^\{\\circ\},34^\{\\circ\}\]\. The solver finds different but equally causalk=4k\{=\}4bases; all five are pairwise correlated \(CCA\>0\.94\>0\.94\), consistent with a unique rank\-44causal subspace\.

Five additional runs atk=6k\{=\}6\(seeds66–1010,1,2001\{,\}200AdamW steps, bfloat16 on A10G\): final NLL∈\[−70\.4,−72\.0\]\\in\[\-70\.4,\-72\.0\], mean=−71\.8±0\.7=\-71\.8\\pm 0\.7\. Pairwise max principal angles onG\(6,2304\)G\(6,2304\): mean=87\.7∘±1\.8∘=87\.7^\{\\circ\}\\pm 1\.8^\{\\circ\}, range\[83\.9∘,89\.9∘\]\[83\.9^\{\\circ\},89\.9^\{\\circ\}\]; Haar\-random null is87\.1∘87\.1^\{\\circ\}\. Thek=6k\{=\}6solutions are*not*clustered—they scatter uniformly over the Grassmannian, identical to random66\-subspaces\. The interpretation is over\-parameterisation: eachk=6k\{=\}6solution spans the44\-dimensional mediator plus two arbitrary extra directions unconstrained by the objective\. Together, the tightk=4k\{=\}4basin and the diffusek=6k\{=\}6scatter bracket the effective dimension to exactly44\.

## Appendix S2Supplement: Gradient probe

### Method\.

For each Set\-F duration promptxix\_\{i\}we \(i\) run the model forward under a hook that detaches the residual stream atL⋆L^\{\\star\}and re\-attaches it withrequires\_grad=True, storing the re\-attached tensorhih\_\{i\}; \(ii\) compute the NLL of the correct duration token from the final\-layer logits; \(iii\) call\.backward\(\), readinggi=∂NLL/∂hi∈ℝdg\_\{i\}=\\partial\\,\\mathrm\{NLL\}/\\partial h\_\{i\}\\in\\mathbb\{R\}^\{d\}at the last\-token position; \(iv\) collect then×dn\\\!\\times\\\!dmatrixGGand compute its centered SVD\. Model parameters are frozen \(requires\_grad\_\(False\)\) throughout; gradients flow only through the re\-attached activation tensor via the native PyTorch autograd graph, which is preserved despite the detach because we re\-attach before returning the value from the hook\. No training, no additional parameters\.

### Gradient norms and spectrum\.

Onn=332n\{=\}332Set\-F prompts, per\-prompt gradient norms are0\.52±0\.110\.52\{\\pm\}0\.11\(mean±\{\\pm\}SD\), range\[0\.30,0\.94\]\[0\.30,0\.94\]—well\-behaved, no outliers\. The singular value spectrum of the centeredGGdecays slowly:σ1=3\.65\\sigma\_\{1\}\{=\}3\.65,σ2=2\.44\\sigma\_\{2\}\{=\}2\.44,σ3=1\.50\\sigma\_\{3\}\{=\}1\.50,σ4=1\.47\\sigma\_\{4\}\{=\}1\.47,σ5=1\.10\\sigma\_\{5\}\{=\}1\.10, with no clear elbow\. The participation ratio \(effective rank\) is76\.176\.1, confirming the gradient is spread across many dimensions rather than concentrated in a low\-rank subspace—unlike the DAS result, which plateaus atk=4k\{=\}4\.

### Angle results\.

Table 3:Principal angles \(∘\) between the gradient subspaceUGU\_\{G\}, the DAS mediatorUMU\_\{M\}, the circular probeUPU\_\{P\}, and the Haar\-random null, onGemma 2 2BatL⋆=1L^\{\\star\}\{=\}1\. ForUGU\_\{G\}vs\.UPU\_\{P\}the reference dimension iskP=2k\_\{P\}\{=\}2at all ranks \(probe is22\-D\); angles shown are for the two matched principal angles\.The gradient subspaceUGU\_\{G\}sits2\.3∘2\.3^\{\\circ\}below the Haar null towardUMU\_\{M\}atk=4k\{=\}4, while sitting at or above null relative toUPU\_\{P\}at everykk\. The minimum principal angleθ1=79\.6∘\\theta\_\{1\}\{=\}79\.6^\{\\circ\}atk=4k\{=\}4reveals one direction shared between the gradient and the mediator that is8\.0∘8\.0^\{\\circ\}closer than null—a non\-trivial but partial alignment\. The probe–DAS angle for reference is87\.9∘87\.9^\{\\circ\}atk=4k\{=\}4\(Table[4](https://arxiv.org/html/2605.29126#A2.T4)\)\.

Table 4:Three\-way comparison atk=4k\{=\}4\. All angles in degrees; null isarccos⁡\(k/d\)=87\.6∘\\arccos\(\\sqrt\{k/d\}\)\{=\}87\.6^\{\\circ\}\.
### Ablation and specificity\.

ProjectingUGU\_\{G\}out of the residual stream atL⋆L^\{\\star\}and re\-running duration evaluation:

Table 5:Gradient probe ablation results onGemma 2 2B\. Clean accuracy=42%=42\\%\. Drops in pp;ρ∇=\(grad drop\)/\(rand mean drop\)\\rho\_\{\\nabla\}\{=\}\(\\text\{grad drop\}\)/\(\\text\{rand mean drop\}\)\.The specificity ratio peaks atk=4k\{=\}4\(ρ∇=150\\rho\_\{\\nabla\}\{=\}150\), placing the gradient probe between attribution patching \(ρAP≈120\\rho\_\{\\text\{AP\}\}\\\!\\approx\\\!120–205205\) and the SAE \(ρSAE\-50=288\\rho\_\{\\text\{SAE\-50\}\}\{=\}288\) on the readout\-mediator spectrum\. DAS atk=4k\{=\}4achievesρ=1050\\rho\{=\}1050,7×7\\timeshigher, with ablation accuracy of0%0\\%vs\. the gradient probe’s36%36\\%\.

### Interpretation\.

The three\-way angle table directly tests Proposition[1](https://arxiv.org/html/2605.29126#Thmproposition1)\. That proposition claims the mediator is shaped by∇xf\\nabla\_\{x\}f\(first moment\) and the probe by covariance with the target \(second moment\)\. The data bear this out asymmetrically:UGU\_\{G\}is measurably closer toUMU\_\{M\}than to noise, but not close to recoveringUMU\_\{M\}\(θ¯\\bar\{\\theta\}is2\.3∘2\.3^\{\\circ\}below null, not0∘0^\{\\circ\}\)\. The effective rank of7676explains why:∇xf\\nabla\_\{x\}fat each prompt is a different vector in a high\-dimensional space; the model’s Jacobian has no clear low\-rank structure at a single site\. DAS’s400400\-step ablation\-maximising optimization acts as a*causal projector*onto the subset of gradient space that is both \(a\) consistent across prompts and \(b\) maximally damaging when removed\. A single backward pass gives the full gradient density; DAS extracts the causally concentratedk=4k\{=\}4core\.

### Cost\.

332332forward\+backward passes onGemma 2 2Bon a cloud L4 GPU:6363seconds for gradient collection,∼10\\sim 10min total including ablation evaluation and25×3=7525\\times 3\{=\}75random controls\. Less than $1 at cloud GPU spot pricing\.

## Appendix S3Supplement: principal\-angle tables

Mean principal angles between DAS and probe / PCA / SAE / gradient subspaces onGemma 2 2B: DAS vs\. probe =\{88\.3∘,87\.9∘,86\.7∘\}\\\{88\.3^\{\\circ\},87\.9^\{\\circ\},86\.7^\{\\circ\}\\\}atk∈\{2,4,6\}k\\\!\\in\\\!\\\{2,4,6\\\}; DAS vs\. top\-kkPCA =\{86\.6∘,86\.7∘,86\.6∘\}\\\{86\.6^\{\\circ\},86\.7^\{\\circ\},86\.6^\{\\circ\}\\\}; DAS vs\. gradient probe =\{87\.3∘,85\.3∘,85\.8∘\}\\\{87\.3^\{\\circ\},85\.3^\{\\circ\},85\.8^\{\\circ\}\\\}\. All DAS\-vs\-probe and DAS\-vs\-PCA values match the Haar nullarccos⁡\(k/d\)\\arccos\(\\sqrt\{k/d\}\)to within2∘2^\{\\circ\}\. DAS\-vs\-gradient angles lie11–2\.3∘2\.3^\{\\circ\}below null, with the minimum principal angle reaching79\.6∘79\.6^\{\\circ\}atk=4k\{=\}4\(null87\.6∘87\.6^\{\\circ\}\)\. OnQwen 2\.5 1\.5Batk=2k\{=\}2the DAS\-probe angle is86\.9∘86\.9^\{\\circ\}; angles onQwen 2\.5 7BandGemma 2 9Bare saved with checkpoint pickles and reported in the released results\. Full numerical table accompanies the released code\.

## Appendix S4Supplement: QK\-twist scan

### Method\.

For every attention head\(L,h\)\(L,h\)we \(i\) collect mean residual\-stream activations per day\-of\-year \(x¯d\\bar\{x\}\_\{d\}ford∈\[1,365\]d\{\\in\}\[1,365\]\), \(ii\) push through the head’sQQandKKprojections to obtainQd,Kd∈ℝdheadQ\_\{d\},K\_\{d\}\\\!\\in\\\!\\mathbb\{R\}^\{d\_\{\\text\{head\}\}\}, \(iii\) form the365×365365\{\\times\}365matrixMd,d′=Qd⊤Kd′/dheadM\_\{d,d^\{\\prime\}\}\{=\}Q\_\{d\}^\{\\top\}K\_\{d^\{\\prime\}\}/\\sqrt\{d\_\{\\text\{head\}\}\}, \(iv\) Radon\-diagonal average by offsetc∈\[−182,182\]c\\\!\\in\\\!\[\-182,182\]:S\(c\)=1Nc∑d−d′=cMd,d′S\(c\)\{=\}\\frac\{1\}\{N\_\{c\}\}\\sum\_\{d\-d^\{\\prime\}=c\}M\_\{d,d^\{\\prime\}\}, and \(v\) score the head byzc=\(S\(c\)−μS\)/σSz\_\{c\}\{=\}\(S\(c\)\-\\mu\_\{S\}\)/\\sigma\_\{S\}whereμS,σS\\mu\_\{S\},\\sigma\_\{S\}are taken across all non\-zero offsets\. The peak\|z\|\|z\|acrossccis the head’s*QK\-twist strength*, andarg⁡max⁡\|z\|\\arg\\max\|z\|is its detected offsetc⋆c^\{\\star\}\.

### Null and multiple\-comparison correction\.

Permutation null: for each head,n=200n\{=\}200draws randomly permute the DOY labels onx¯d\\bar\{x\}\_\{d\}and recompute peak\|z\|\|z\|; the permutationpp\-value is the empirical fraction of null peaks exceeding the observed\. We then apply Benjamini–Hochberg\(Benjamini and Hochberg,[1995](https://arxiv.org/html/2605.29126#bib.bib16)\)across the full head set atq=0\.05q\{=\}0\.05\.Gemma 2 2B:65/20865/208heads BH\-significant,2424additionally exceed\|z\|≥3\|z\|\\geq 3;Qwen 2\.5 1\.5B:110/336110/336BH\-significant,5454exceed\|z\|≥3\|z\|\\geq 3\.

### Offset\-mode analysis\.

Detected offsetsc⋆c^\{\\star\}for BH\-significant heads go through a Gaussian\-mixture BIC selection \(k∈\{1,2,3,4\}k\\in\\\{1,2,3,4\\\}\)\. BIC minima atk=4k\{=\}4place centers atc^∈\{±30\.1,±60\.8\}\\hat\{c\}\\\!\\in\\\!\\\{\\pm 30\.1,\\pm 60\.8\\\}days on both models\. The±7\\pm 7\-day center was absent \(ΔBIC=−46\\Delta\\text\{BIC\}=\-46for a model forced to include±7\\pm 7\), ruling out weekly\-periodicity as a driver\. Full per\-head table with\(L,h,c⋆,z,p,q\)\(L,h,c^\{\\star\},z,p,q\)released ascached\_tensors/qk\_twist/\{gemma,qwen\}\_heads\.csv\.

### Why\{30,61\}\\\{30,61\\\}and not other offsets\.

Gregorian months have three lengths:2828\(February\),3030\(Apr/Jun/Sep/Nov\), and3131\(seven months\)\. On the365365\-day circle, offsetc=30c\{=\}30corresponds to an angular shift of2π⋅30/365≈π/62\\pi\{\\cdot\}30/365\\approx\\pi/6, i\.e\.1/121/12of the full revolution—one month\. Because3030is the minimum common month length \(excluding February\), a boundary head attending from dayddtod\+30d\{\+\}30lands within≤1\{\\leq\}1day of the corresponding calendar position in the next month for11/1211/12months; only February introduces a22\-day error\. The offsetc=61=30\+31c\{=\}61\{=\}30\{\+\}31is the most common two\-month span: of the1212adjacent\-month pairs,77sum to exactly6161days \(any30\+3130\{\+\}31or31\+3031\{\+\}30pair\),44sum to6262\(31\+3131\{\+\}31\), and11sums to5959\(Feb\+\+Mar\)\. On the circle this is2π⋅61/365≈π/32\\pi\{\\cdot\}61/365\\approx\\pi/3—1/61/6of a revolution, or two months\. The pair\{30,61\}\\\{30,61\\\}therefore forms a minimal greedy basis for month\-boundary arithmetic: any multi\-month duration can be decomposed into single\- and double\-month steps, with the≤1\{\\leq\}1\-day remainder corrected by the month\-discriminating MLP features downstream \(§[5](https://arxiv.org/html/2605.29126#S5)\)\. Weekly periodicity \(c=7c\{=\}7\) is irrelevant because the Gregorian month lengths\{28,30,31\}\\\{28,30,31\\\}share no alignment with the77\-day cycle—week boundaries carry no information about month boundaries\.

### Layerwise distribution of boundary heads\.

OnGemma 2 2B\(2626layers\), boundary\-head density peaks at layersL∈\{3,5,7,9\}L\\\!\\in\\\!\\\{3,5,7,9\\\}\(18/2418/24significant heads\), with a long tail through layer1111; later layers \(L\>15L\{\>\}15\) carry no boundary heads\. OnQwen 2\.5 1\.5B\(2828layers\), the peak is similar \(L∈\{2,4,6\}L\\\!\\in\\\!\\\{2,4,6\\\}\) with a longer tail \(L∈\{14,18\}L\\\!\\in\\\!\\\{14,18\\\}carry22heads each\)\. The circuit is early\-to\-mid\-layer on both families, consistent with the peakL⋆L^\{\\star\}for circular\-probeR2R^\{2\}\.

### QK\-twist magnitude by layer\.

The maximal\|z\|\|z\|per layer traces a unimodal curve:1\.81\.8atL=0L\{=\}0, rising to7\.37\.3atL=5L\{=\}5, falling to<2\{<\}2byL=12L\{=\}12onGemma\. Sign of the detected offset alternates across depth—early heads carry positivecc, middle layers both signs, late heads negative—compatible with a forward\-then\-backward temporal lookup\.

## Appendix S5Supplement: attribution\-patching head list

Syed\-style attribution patching\(Syedet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib28)\)on332332Set\-F duration prompts corrupted by second\-date DOY swap, onGemma 2 2B\. Spearman correlation with QK\-twist\|z\|\|z\|\-rank:ρ=0\.035\\rho\{=\}0\.035\(p=0\.61p\{=\}0\.61\)\. Top\-KKvs\. boundary\-head set IoU / recall:\(K=12\)0\.06/0\.08\(K\{=\}12\)\\,0\.06/0\.08;\(24\)0\.14/0\.25\(24\)\\,0\.14/0\.25;\(48\)0\.20/0\.50\(48\)\\,0\.20/0\.50;\(60\)0\.20/0\.58\(60\)\\,0\.20/0\.58;\(100\)0\.15/0\.67\(100\)\\,0\.15/0\.67\. Overlap grows monotonically up toK≈60K\{\\approx\}60and then plateaus as false\-positives accumulate\.

## Appendix S6Supplement: SAE control details

GemmaScope canonical 16k\-width residual\-stream SAE\(Lieberumet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib12)\)at layer11ofGemma 2 2B, loaded viasae\_lens\. Feature ranking:rsin2\+rcos2\\sqrt\{r\_\{\\sin\}^\{2\}\+r\_\{\\cos\}^\{2\}\}on Set\-A mean\-per\- DOY feature activations; top\-55feature correlations\{0\.43,0\.38,0\.32,0\.29,0\.22\}\\\{0\.43,0\.38,0\.32,0\.29,0\.22\\\}\. Ablation: orthonormalize the decoder directions \(QR\) of the top\-5050features and project out fromblocks\.1\.hook\_resid\_poston every forward\. Clean39\.5%39\.5\\%, probe38\.9%38\.9\\%, SAE\-top\-505028\.0%28\.0\\%\(−11\.5\-11\.5pp\), DAS0%0\\%\(all first\-token integer parses fail under full\-DAS ablation\)\.

## Appendix S7Supplement: Pythia emergence \(8 checkpoints\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/x7.png)Figure 7:Emergence inPythia 1\.4B\.Three diagnostics on a shared log\-training\-step x\-axis; gold band spans the geometric emergence window\[103,5×104\]\[10^\{3\},5\{\\times\}10^\{4\}\]\.\(A\)ProbeR2R^\{2\}is already\>0\.95\>0\.95at step0and moves negligibly—uninformative of mechanism learning\.\(B\)Boundary\-head count collapses from133133spurious ridges to∼50\{\\sim\}50task\-tuned heads by step1k1k, then stabilizes at6262\.\(C\)FFT circularness atL⋆L^\{\\star\}grows37×37\{\\times\}within the emergence window—the actual geometric\-emergence signal\.Per\-checkpoint results onPythia 1\.4B: step0:L⋆=21L^\{\\star\}\{=\}21,R2=0\.956R^\{2\}\{=\}0\.956,133133boundary heads, circularness0\.0020\.002; step11:2121,0\.9560\.956,133133,0\.0020\.002; step1616:1414,0\.9530\.953,139139,0\.0010\.001; step128128:1212,0\.9600\.960,119119,0\.0030\.003; step512512:2323,0\.9830\.983,5050,0\.0060\.006; step1,0001\{,\}000:1111,0\.9950\.995,4646,0\.0210\.021; step50,00050\{,\}000:33,0\.9980\.998,4444,0\.0700\.070; step143,000143\{,\}000:44,0\.9980\.998,6262,0\.0630\.063\. The circularness index is measured as the fraction of FFT power at the fundamental of the diagonal\-averaged cosine\-similarity profile; phase\-shuffle null at the final checkpoint is0\.012±0\.0050\.012\\pm 0\.005\(1000 draws\), placing the observed0\.0630\.063well outside null\.

## Appendix S8Supplement: topology of the date manifold

Persistent\-homology analysis on the mean\-per\-DOY activation cloud atL⋆L^\{\\star\}: raw residual\-stream coordinates showH1H\_\{1\}bars that are within a phase\-shuffle null \(p=1\.0p\{=\}1\.0\)\. After projecting onto the probe\-subspace readout, the circular persistence index atk=2k\{=\}2is11\.311\.3vs\. null mean0\.310\.31, a36×36\{\\times\}lift \(p<10−2p\{<\}10^\{\-2\}\)\. This corroborates that the date subspace is geometrically a11\-torus in the readout coordinates, complementing the causal and offset\-structure analyses in the main body\. That the circle is detectable only after projection onto the probe subspace—not in the fulld=2304d\{=\}2304residual stream—is consistent withGurneeet al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib6)\)’s finding that manifold structure is embedded in a low\-dimensional subspace of the full representation\. The probe correctly identifies the manifold’s topology; Propositions[1](https://arxiv.org/html/2605.29126#Thmproposition1)–[2](https://arxiv.org/html/2605.29126#Thmproposition2)show it nonetheless reads from the wrong subspace for causal intervention\.

### Rippled representations\.

The diagonal\-averaged cosine similarity profile of the day\-of\-year manifold atL⋆L^\{\\star\}exhibits harmonic structure beyond the fundamental \(circularness\) mode: the FFT ringing ratio \(power in harmonics≥2\{\\geq\}2divided by power in harmonic11;compute\_ringing\_metricin the released code\) is non\-negligible\.Gurneeet al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib6)\)prove that such rippled representations are the information\-theoretically optimal packing ofNNdiscrete tokens on a11\-dimensional manifold ink≪Nk\{\\ll\}Ndimensions\. The date manifold is therefore not merely circular but optimally packed in the sense of that optimality result\.

## Appendix S9Supplement: cross\-model transplant \(an honest null\)

We trained Procrustes\-aligned DAS subspaces onGemma 2 2BandQwen 2\.5 1\.5Batk=4k\{=\}4and attempted to rescueQwen’s ablated accuracy by injectingGemma’s per\-DOY coordinates through the Procrustes rotation\. Rescue recovery was2\.9%2\.9\\%of the ablation gap \(95%95\\%CI\[0\.7,6\.0\]\[0\.7,6\.0\]\), DOY\- shuffled null0\.7%0\.7\\%, random\-map null1\.5%1\.5\\%; permutationppfor transplant\>\>shuffled=0\.36=0\.36\. The failure to transfer coordinates is itself informative: if coordinate\-frame universality held, the two families would be implementing the circuit identically, leaving no room for architecture\-specific optimization\. Cross\-family universality holds at the*population*level \(same offset set, same causal hierarchy, sameδ\(x\)\\delta\(x\)logic\) but*not*at the coordinate\-frame level—the more general and theoretically expected form\. A richer \(likely non\-linear\) alignment is the natural next experiment\. Higher\-kkQwen DAS atk∈\{8,12\}k\\\!\\in\\\!\\\{8,12\\\}yielded 33 and 35 pp ablation drops respectively, demonstrating that the transplant null is not caused by an under\-dimensioned Qwen subspace\.

## Appendix S10Supplement: related work

The probe\-critique literature has circled this point from many angles\.Hewitt and Liang \([2019](https://arxiv.org/html/2605.29126#bib.bib21)\)showed probes can decode random labels if given sufficient capacity, motivating control\-task baselines\.Elazaret al\.\([2021](https://arxiv.org/html/2605.29126#bib.bib17)\)introduced Iterative Null\-space Projection \(Inlp\) to*erase*linearly\-decodable information and measure downstream effect, an early behavioral analog of the ablation hook we use here\.Ravichanderet al\.\([2021](https://arxiv.org/html/2605.29126#bib.bib22)\)observed that probe accuracy does not imply downstream usage; their remedy is behavioral\-task benchmarking, while ours is geometric\.Federet al\.\([2021](https://arxiv.org/html/2605.29126#bib.bib23)\)proposed counterfactual training for causal effect estimation \(CausaLM\); their intervention operates at the training\-data level, not the activation level\.Muelleret al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib24)\)argued that “mediator” is the load\-bearing category for causal interpretability, andMuelleret al\.\([2025](https://arxiv.org/html/2605.29126#bib.bib25)\)formalized this into the MIB benchmark; our specificity ratio is a quantitative realisation of their mediator criterion\.Canbyet al\.\([2025](https://arxiv.org/html/2605.29126#bib.bib26)\)quantified reliability of causal probing interventions and flagged the fragility of counterfactual\-based estimates; our matched\-random\-null protocol sidesteps that fragility entirely\. Parallel lines:Conmyet al\.\([2023](https://arxiv.org/html/2605.29126#bib.bib27)\)andSyedet al\.\([2024](https://arxiv.org/html/2605.29126#bib.bib28)\)develop edge\-level circuit\-discovery methods;Namet al\.\([2025](https://arxiv.org/html/2605.29126#bib.bib29)\)introduces Causal Head Gating;Geigeret al\.\([2024](https://arxiv.org/html/2605.29126#bib.bib9)\)introduces DAS itself;Cunninghamet al\.\([2024](https://arxiv.org/html/2605.29126#bib.bib11)\)andLieberumet al\.\([2024](https://arxiv.org/html/2605.29126#bib.bib12)\)push SAE\-based feature discovery\. Our contribution is orthogonal: we do not propose a new tool\. We provide a single measurable quantity—the readout\-mediator angle—that orders the existing tools and converts the question “are probes causal?” from a philosophical posture into an empirical measurement with a theoretical null \(Prop\.[2](https://arxiv.org/html/2605.29126#Thmproposition2)\) and a concentration scale for specificity at the null \(Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)\)\.

Work on temporal representations specifically:Gurnee and Tegmark \([2024](https://arxiv.org/html/2605.29126#bib.bib7)\)established that linear probes decode dates atR2≳0\.99R^\{2\}\{\\gtrsim\}0\.99;Gurneeet al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib6)\)introduced the QK\-twist scan;Kantamneni and Tegmark \([2025](https://arxiv.org/html/2605.29126#bib.bib8)\)showed trigonometric representations in addition tasks;Modellet al\.\([2025](https://arxiv.org/html/2605.29126#bib.bib31)\)argued manifolds reflect translational symmetries in pretraining data\. We build directly on this line while adding the causal \(DAS / ablation\), cross\-family \(universality population\-vs\-coordinate\), training\-dynamical \(Pythia emergence\), and deployment \(clinical\-δ\(x\)\\delta\(x\)\) layers\.

### Gurnee et al\. \(2025\): manifold manipulation\.

The closest theoretical antecedent\. They prove three results we leverage: \(i\) QK\-twist implements learned rotations, explaining our boundary\-head offsets \(§[5](https://arxiv.org/html/2605.29126#S5)\); \(ii\) rippled representations are information\-theoretically optimal packings, explaining the harmonic structure beyond circularness \(Supp\.[S8](https://arxiv.org/html/2605.29126#A8)\); \(iii\) feature\-manifold duality, explaining why SAE features are partial mediators \(§[5](https://arxiv.org/html/2605.29126#S5.SS0.SSS0.Px4)\)\. Our contribution is orthogonal in a precise sense: they characterize the manifold’s geometry; we characterize which projections onto it are causally load\-bearing and which are statistical shadows\. One nuance deserves comment:[Gurneeet al\.](https://arxiv.org/html/2605.29126#bib.bib6)show that orthogonal subspaces can carry*useful*computation \(e\.g\. linebreak decisions orthogonal to day\-of\-week manifold\)\. This is not a contradiction—their orthogonality is*functional decomposition*\(ρk≫1\\rho\_\{k\}\\\!\\gg\\\!1for the linebreak task\); ours is*coincidental orthogonality*\(ρk≈1\\rho\_\{k\}\\\!\\approx\\\!1\)\. The specificity ratio distinguishes the two cases \(Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)\)\.

### Costa et al\. \(2025\): hierarchical SAEs via matching pursuit\.

Costaet al\.\([2025](https://arxiv.org/html/2605.29126#bib.bib35)\)introduce MP\-SAE, a sparse autoencoder whose encoder unrolls matching pursuit into residual\-guided steps, and formalize*conditional orthogonality*—orthogonality across hierarchy levels but not within\. The readout\-mediator dissociation reported here is an instance of this structure: the readout subspace \(probe, 2\-D\) and mediator subspace \(DAS, 4\-D\) occupy different functional levels and are empirically orthogonal \(88∘88^\{\\circ\}\), while directions*within*each subspace can cooperate freely \(e\.g\. the59×59\{\\times\}super\-additive cooperation among the four DAS directions\)\. The connection illuminates why standard SAEs sit at intermediateρ\\rhoon our readout\-to\-mediator spectrum \(§[S11](https://arxiv.org/html/2605.29126#A11)\): their single\-pass linear encoder recovers only the linearly accessible readout component\(Hindupuret al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib34)\), while the mediator—conditionally orthogonal to the readout—becomes “dark matter” invisible to that projection\(Engelset al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib30)\)\. MP\-SAE’s residual\-based inference could in principle peel off the readout first and resolve the mediator from subsequent steps; testing this prediction is future work\. We note one important disanalogy: the Babel score that[Costaet al\.](https://arxiv.org/html/2605.29126#bib.bib35)use to quantify conditional separation at inference is a purely geometric quantity, whereasρk\\rho\_\{k\}is causal—high Babel coherence among basis vectors is compatible with highρ\\rhowhen directions cooperate super\-additively, as our DAS subspace demonstrates\.

## Appendix S11Supplement: the readout\-to\-mediator spectrum

![Refer to caption](https://arxiv.org/html/2605.29126v1/x8.png)Figure 8:The readout\-to\-mediator spectrum\.\(A\)Accuracy under four ablation conditions atL⋆=1L^\{\\star\}\{=\}1onGemma 2 2B: clean baseline, linear probe, top\-50 GemmaScope SAE features, and the causal DAS subspace\.\(B\)The same tools placed on a readout→\{\\to\}mediator ruler, positioned by specificity ratioρk\\rho\_\{k\}\. Probe \(ρ=1\.0\\rho\{=\}1\.0, noise\), SAE\-50 \(ρ=288\\rho\{=\}288, partial causal readout\), DAS \(ρ=1050\\rho\{=\}1050, full mediator\)\.The corollary to Proposition[3](https://arxiv.org/html/2605.29126#Thmproposition3)gives the ratioρknull≍d/k\\rho\_\{k\}^\{\\text\{null\}\}\\asymp d/kas the Haar\-null concentration \(under the equal\-Lipschitz assumption; see Supp\.[S22](https://arxiv.org/html/2605.29126#A22)\)\. We place every tool we used on this ruler, usingGemma 2 2BatL⋆=1L^\{\\star\}\{=\}1,d=2304d\{=\}2304,k=4k\{=\}4:

Read row\-by\-row: probe, INLP, and LEACE sit at the null—all three concept\-erasure methods target the*decoder*subspace and have zero causal effect\. PCA and Mean\-Projection capture18\.518\.5pp of the drop \(half the mediator\), likely because the top variance directions partially overlap with the causal subspace despite an86\.8∘86\.8^\{\\circ\}angle\. The gradient probe captures55pp \(weak causal signal dispersed across rank7676\)\.ρ\\rhovalues for Gradient and PCA use the matched\-experiment random baseline \(mean random drop≈0\.24\{\\approx\}0\.24pp\); the main\-textρDAS=1050\\rho\_\{\\text\{DAS\}\}\{=\}1050uses the fulln=332n\{=\}332evaluation where the random baseline is0\.040\.04pp—both denominators are in the noise floor, so absoluteρ\\rhois sensitive to sampling but the ordering is stable\. Attribution patching recovers edges that*contribute*to the logit but do not dominate it; SAE features capture roughly half of the linear mediator but miss its non\-linear extension; DAS exceeds the linear null by1\.82×1\.82\\times\. This is the concrete content of the readout\-to\-mediator spectrum claim\. At\(d,k\)=\(3584,4\)\(d,k\)\{=\}\(3584,4\)forQwen 2\.5 7BandGemma 2 9B, the null isρknull≍896\\rho\_\{k\}^\{\\text\{null\}\}\\asymp 896; observed DAS ratios\>500,000×\{\>\}500\{,\}000\{\\times\}and\>450,000×\{\>\}450\{,\}000\{\\times\}reflect near\-vanishing random denominators \(bf16 precision floor\) and non\-linear\-mediator numerators compounding; precision\-aware Fieller and additive baselines are in Supp\.[S25](https://arxiv.org/html/2605.29126#A25)\.

## Appendix S12Supplement: three\-tier clinical benchmark — construction and license

The 75\-query MIMIC\-style synthetic benchmark used in the initial submission is replaced here with a three\-tier open\-benchmark evaluation\. Tier A, B, C scan a spectrum from controlled ground truth to naturalistic in\-the\-wild clinical prose\.

Tier A — controlled synthetic\(n=475n=475\)\. Generated by a clinical\-timeline generator, stratified across duration magnitudes\{≤7,8−30,31−365,\>365\}\\\{\\leq 7,8\{\-\}30,31\{\-\}365,\>365\\\}days\. Ground truth is exact to the day\. Fully deterministic \(seed 0\)\. License: internal, redistributable\.

Tier B — MedCalc\-Bench vignettes\(n=133n=133\)\. Source:ncbi/MedCalc\-Bench\-v1\.0on HuggingFace, CC\-BY 4\.0\. Each Patient Note is scanned for absolute dates via regex; notes with≥2\\geq\\\!2chronologically\-distinct dates yield one duration query \(“how many days passed betweend1d\_\{1\}andd2d\_\{2\}?”\)\. Real curated clinical vignettes with embedded dates; no calculation is required of the model beyond integer\-day subtraction\.

Tier C — PMC Open\-Access case reports\(n=371n=371\)\. Source:zhengyun21/PMC\-Patientson HuggingFace, CC\-BY per upstream article\. Naturalistic in\-the\-wild case\-report prose with real date\-stamped events; same regex date\-pair extraction as Tier B with ambiguity filter \(no other date within 8 characters of either marker\)\. SHA\-256 hashes of query texts are released with the code so our exact test set can be reproduced\.

Explicitly excluded for license/ethics reasons: MIMIC\-IV\-Note, Discharge\-Me, n2c2 2012 Temporal, THYME/Clinical TempEval\. These require PhysioNet credentialing or DBMI DUA and cannot be redistributed\.

## Appendix S13Supplement: three\-tier benchmark — stratified metrics

Pooled and per\-tier metrics forδ\(x\)\\delta\(x\)atk=12k=12\. AUPRC=skill\(AUPRC−p\)/\(1−p\)\{\}\_\{\\text\{skill\}\}\{=\}\(\\text\{AUPRC\}\{\-\}p\)/\(1\{\-\}p\)corrects for class imbalance; failure rates \(pp\) are90%90\\%\(A\),97%97\\%\(B\),96%96\\%\(C\)\.

Per\-tier×\\timesper\-duration\-bin Pearsonrr:

### Holm–Bonferroni adjustedpp\-values

\[A\]:0\.0020\.002; \[B\]:0\.5010\.501; \[C\]:0\.0020\.002; \[pooled\]:0\.0020\.002\.

## Appendix S14Supplement: three\-tier benchmark — alternative manifold estimators

One might ask whether the choice of global PCA forU¯\\bar\{U\}is arbitrary\. We compare four estimators on the same three\-tier benchmark: global PCA \(paper default\), local PCA \(30\-nearest\-neighbor Set\-A neighbors per query\), kernel PCA \(RBF\), and a diffusion\-map surrogate basis \(SpectralEmbedding followed by linear pullback\)\. Pearsonrrby tier:

Global PCA is within0\.030\.03Pearsonrrof the best alternative on every tier\. Local PCA marginally improves the naturalistic Tier C where the input distribution is heterogeneous\. Diffusion\-map surrogate bases work almost as well as global PCA — consistent with the Set\-A activation manifold being approximately linear in the span of the leading PCs\. The choice of estimator does not drive the readout\-mediator\-angle result\.

## Appendix S15Supplement: three\-tier benchmark — calibration and decision thresholds

### Reliability diagram \(pooled, 979\-query v2 benchmark\)\.

Pooled reliability \(5 equal\-countδ\\delta\-bins\):

The wrong\-rate is monotonically increasing acrossδ\\delta\-bins \(Spearmanρ=1\.0\\rho\{=\}1\.0\), confirmingδ\(x\)\\delta\(x\)is a well\-ordered triage signal despite the high base failure rate\. Pooled AUROC and AUPRC are reported in Supp\.[S13](https://arxiv.org/html/2605.29126#A13)\.

### Expected Calibration Error and discrimination \(75\-query held\-out benchmark\)\.

On the original 75\-query held\-out benchmark \(failure rate96%96\\%,k=12k\{=\}12,L⋆=1L^\{\\star\}\{=\}1\), recomputed with corrected AUPRC metrics:

The max\-softmax baseline AUROC of0\.0090\.009confirms that token\-level confidence provides no predictive signal for duration errors;δ\(x\)\\delta\(x\)carries complementary, mechanistically grounded information\. The negative AUPRCskill\{\}\_\{\\text\{skill\}\}for max\-softmax reflects a systematic inversion \(high confidence on wrong predictions\), not chance\.

### Decision curve analysis \(75\-query benchmark\)\.

Decision curve analysis \(Vickers & Elkin 2006\): net benefitNB\(t\)=TPR−t/\(1−t\)⋅FPR\\text\{NB\}\(t\)\{=\}\\text\{TPR\}\{\-\}t/\(1\-t\)\{\\cdot\}\\text\{FPR\}measures the expected value of usingδ\(x\)\\delta\(x\)as a triage filter at thresholdtt\.

At both operating points, precision is1\.001\.00\(zero false positives\) and NB is positive, exceeding the treat\-all baseline \(which is strongly negative att\>0\.5t\{\>\}0\.5given the high failure rate\)\. The Youden\-optimal threshold achieves2\.8×2\.8\{\\times\}baseline accuracy among non\-deferred queries with64%64\\%deferral; the20%20\\%\-deferral operating point provides lower throughput improvement but may be preferable when deferral cost is high\.

### Per\-threshold operating characteristics \(75\-query benchmark\)\.

Aδ≥0\.857\\delta\{\\geq\}0\.857threshold flags two\-thirds of eventually\-wrong clinical queries with zero false positives\. The benchmark’s high wrong\-rate \(96%96\\%\) makes precision trivially high for low thresholds; AUROC is the more informative summary \(Fig\.[9](https://arxiv.org/html/2605.29126#A15.F9)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x9.png)Figure 9:Calibration ofδ\(x\)\\delta\(x\)for flagging wrong clinical answers \(Gemma 2 2B,k=12k\{=\}12,n=75n\{=\}75\)\.\(a\)Reliability diagram \(5 equal\-countδ\\deltabins\)\.\(b\)ROC with Youden\-JJoptimal threshold marked\.\(c\)Precision\-recall\.

## Appendix S16Supplement: manifold deviation as clinical error signal

We demonstrate the protocol’s practical utility by deriving a deployment metric from the mediator subspace: a scalarδ\(x\)\\delta\(x\)that measures how far date\-token activations drift off the mediator manifold\.*This is an offline study; prospective deployment requires live\-EHR validation\.*We evaluateδ\(x\)\\delta\(x\)on979979open\-benchmark clinical\-prose queries across three tiers of increasing naturalism: Tier A \(synthetic, controlled ground truth,n=475n\{=\}475\), Tier B \(MedCalc\-Bench real vignettes,n=133n\{=\}133, CC\-BY\), and Tier C \(PMC case reports,n=371n\{=\}371, CC\-BY\)\. Benchmark construction in Supp\.[S12](https://arxiv.org/html/2605.29126#A12); per\-tier metrics in Supp\.[S13](https://arxiv.org/html/2605.29126#A13); estimators in Supp\.[S14](https://arxiv.org/html/2605.29126#A14); baselines in Supp\.[S19](https://arxiv.org/html/2605.29126#A19); Tier\-A breakdown in Supp\.[S21](https://arxiv.org/html/2605.29126#A21); ROC/reliability in Supp\.[S17](https://arxiv.org/html/2605.29126#A17); calibration in Supp\.[S15](https://arxiv.org/html/2605.29126#A15)\. OnGemma 2 2B, accuracy within±20%\\pm 20\\%by tier is10%10\\%\(A\) /3%3\\%\(B\) /4%4\\%\(C\);δ\(x\)\\delta\(x\)atk=12k\{=\}12predicts absolute duration error with pooled Pearsonr=\+0\.34r\{=\}\{\+\}0\.34\(p=5×10−4p\{=\}5\{\\times\}10^\{\-4\},n=979n\{=\}979\), AUROC=0\.63\{=\}0\.63, AUPRC=0\.97\{=\}0\.97\(Fig\.[10](https://arxiv.org/html/2605.29126#A16.F10)\)\.

Corrected discrimination metrics\.The raw AUPRC≈0\.97\{\\approx\}0\.97is inflated by9090–97%97\\%failure rates; the prevalence\-corrected skill score AUPRC=skill\(AUPRC−p\)/\(1−p\)\{\}\_\{\\text\{skill\}\}\{=\}\(\\text\{AUPRC\}\{\-\}p\)/\(1\{\-\}p\)gives0\.700\.70\(Tier A\) and0\.680\.68\(pooled\)\. AUROC=0\.63\{=\}0\.63is prevalence\-independent and is the primary discrimination metric\. On the held\-out7575\-query benchmark \(failure rate96%96\\%\), AUROC=0\.86\{=\}0\.86, AUPRC=skill0\.84\{\}\_\{\\text\{skill\}\}\{=\}0\.84; max\-softmax baseline AUROC=0\.01\{=\}0\.01, confirmingδ\(x\)\\delta\(x\)provides mechanistically distinct signal\. ECE=0\.11\{=\}0\.11; reliability diagram in Supp\.[S15](https://arxiv.org/html/2605.29126#A15)\.

Decision\-curve analysis\.At Youden\-optimalδ⋆=0\.86\\delta^\{\\star\}\{=\}0\.86: deferral64%64\\%, precision1\.001\.00, accuracy2\.8×2\.8\{\\times\}baseline \(11%11\\%vs\.4%4\\%\), NB=0\.21\>0\{=\}0\.21\{\>\}0, exceeding treat\-all and max\-softmax\. Full curves in Supp\.[S15](https://arxiv.org/html/2605.29126#A15)\.

### What the correlation does and does not say\.

δ\(x\)\\delta\(x\)measures how far date\-token activations drift off the mediator manifold; on\-manifold activations yield accurate duration answers\. Tier B AUROC=0\.59\{=\}0\.59is informative: those failures are format failures \(HTML output tags, not integers\), not circuit failures—δ\\deltacorrectly stays flat, confirming the temporal circuit ran but the output formatter broke\. The Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)bound\|f\(x\)−f\(U¯⊤U¯x\)\|≤L‖δ\(x\)‖‖x‖\|f\(x\)\{\-\}f\(\\bar\{U\}^\{\\top\}\\bar\{U\}x\)\|\\leq L\\\|\\delta\(x\)\\\|\\\|x\\\|matches the observed ordering; max\-softmax provides no complementary signal \(Supp\.[S15](https://arxiv.org/html/2605.29126#A15)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x10.png)Figure 10:Manifold deviationδ\(x\)\\delta\(x\)predicts clinical duration error \(open979979\-query three\-tier benchmark\)\.\(A\)Per\-query absolute duration error \(symlog, days\) versusδ\(x\)\\delta\(x\)atk⋆=12k^\{\\star\}\{=\}12; correct \(±20%\\pm 20\\%, green\) and wrong \(red\) cases \(n=979n\{=\}979\), with marginalδ\\delta\-density strip and Pearsonrr\(bootstrapped95%95\\%CI, permutationpp\) inset\.\(B\)Pearsonrras a function of subspace dimensionkk\(stable fork∈\[6,12\]k\{\\in\}\[6,12\]; gold ring marksk⋆k^\{\\star\}\)\.\(C\)Accuracy within±20%\\pm 20\\%byδ\\delta\-quartile:0\.8%0\.8\\%in Q4 \(highδ\\delta\),14\.8%14\.8\\%in Q2;78%78\\%of correct answers fall in Q1/Q2\. Full ROC and reliability diagrams in Supp\.[S17](https://arxiv.org/html/2605.29126#A17)\.

## Appendix S17Supplement: three\-tier benchmark — detail figure \(ROC, reliability, per\-binrr\)

The main\-body Fig\.[10](https://arxiv.org/html/2605.29126#A16.F10)presents the per\-queryδ\\delta\-vs\-error scatter\. The ROC / reliability / per\-bin views are collected here \(Fig\.[11](https://arxiv.org/html/2605.29126#A17.F11)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x11.png)Figure 11:Manifold deviation on the 979\-query three\-tier clinical benchmark — detailed view\.\(A\)ROC per tier forδ\(x\)\\delta\(x\)flagging wrong answers\.\(B\)Pooled reliability: fraction\-wrong in55equal\-countδ\\delta\-bins\.\(C\)Per\-tier Pearsonr\(δ,\|err\|\)r\(\\delta,\|\\text\{err\}\|\)by duration magnitude\.
## Appendix S18Supplement: DAS collateral\-damage controls

A natural concern: does the DAS basis, trained to minimize the date\-duration loss, act as a generic adversarial subspace that would damage arbitrary tasks atGemma 2 2BL⋆=1L^\{\\star\}\{=\}1? We test6060non\-calendar prompts spanning three pools—non\-date arithmetic \(n=20n\{=\}20, “What isaaplusbb?”\), factual trivia \(n=20n\{=\}20\), and short naturalistic completions \(n=20n\{=\}20\)—running each under \(A\) clean, \(B\) DAS\-k=4k\{=\}4ablation, and \(C\) each of2525random\-k=4k\{=\}4ablations\. For each prompt we record the full last\-token logit distribution; we report \(i\) symmetric Jensen–Shannon divergence between clean and ablated top\-5050softmax, and \(ii\) top\-11token agreement\.

Read honestly: the DAS basis does induce measurable distributional shift beyond a randomk=4k\{=\}4ablation on non\-date prompts \(JS∼10×\{\\sim\}10\{\\times\}the random null\) — these four directions are not literally inert for the rest of the model\. But the*decision\-level*signature is clean: top\-11next\-token agreement with clean stays at100%100\\%\(arithmetic, trivia\) or85%85\\%\(naturalistic\) under DAS ablation, while on the date task it collapses to0%0\\%\. Interpretation: the DAS subspace is*not*a generic task\-agnostic adversary; it preserves the model’s actual non\-date outputs while removing the specific direction the date circuit needs to compute durations\. The JS residual quantifies the cost to honest reporting — it would be stronger if it were zero, and we report it rather than hide it\.

### Non\-calendar specificity ratio \(large\-scale ablation,n=240n\{=\}240\)\.

We ran the trainedk=4k\{=\}4DAS basis \(atL⋆=1L^\{\\star\}\{=\}1\) through240240non\-calendar prompts covering three pools on an A10G GPU:

Date\-to\-non\-calendarspecificity ratio:42pp/3\.3pp\(avg\)=12\.6×42\\,\\text\{pp\}/3\.3\\,\\text\{pp\}~\(\\text\{avg\}\)=12\.6\{\\times\}\. The arithmetic and counting pools are completely unaffected \(0pp drop\); the1010pp ordinal collateral is interpretable — ordinal subtraction \(“after removingbbitems”\) shares quantitative\-difference computation with duration reasoning\(end−start\)\(\\text\{end\}\-\\text\{start\}\), and the DAS subspace appears to encode this shared numerical\-change representation in addition to calendar encoding\. The12\.6×12\.6\{\\times\}ratio conservatively includes this interpretable collateral; against purely additive arithmetic the effective ratio exceeds103×10^\{3\}\{\\times\}\.

## Appendix S19Supplement: three\-tier benchmark — simple baselines

A natural sanity check: does a simple zero\-effort baseline predict per\-query error as well asδ\(x\)\\delta\(x\)? We evaluate four baselines readable from the same cached forward passes — no extra compute — against the pooledn=979n\{=\}979benchmark\.

δ\(x\)\\delta\(x\)significantly beats prompt\-length, date\-count, and confidence baselines in pooled AUROC \(paired bootstrap95%95\\%CIs exclude zero,nboot=2000n\_\{\\text\{boot\}\}\{=\}2000\)\. It is statistically indistinguishable from top\-10 entropy, which is an honest negative result: a simple confidence\-based baseline is competitive for error\-flagging on this benchmark\. The comparative advantage ofδ\(x\)\\delta\(x\)is*mechanistic interpretability*: it is bounded by Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)and pinpoints which subspace the activation has drifted off, whereas entropy tells you only that the model is uncertain\.

## Appendix S20Supplement: Proposition[3](https://arxiv.org/html/2605.29126#Thmproposition3)Monte Carlo verification

Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)\(c\) asserts that forU∼Haar\(Stiefel\(k,d\)\)U\\\!\\sim\\\!\\text\{Haar\}\(\\text\{Stiefel\}\(k,d\)\),𝔼∑icos2⁡θi=kkM/d\\mathbb\{E\}\\sum\_\{i\}\\cos^\{2\}\\theta\_\{i\}\{=\}k\\,k\_\{M\}/dexactly, with varianceO\(k2/d2\)O\(k^\{2\}/d^\{2\}\)\. We verify numerically atd=2304d\{=\}2304\(Gemma 2 2B,kM=kk\_\{M\}\{=\}k\) by sampling Haar\-uniformkk\-frames via QR of a standard Gaussian and computing‖UMU⊤‖F2\\\|U\_\{M\}U^\{\\top\}\\\|\_\{F\}^\{2\}\. Fork∈\{1,2,4,8,16\}k\\\!\\in\\\!\\\{1,2,4,8,16\\\}with20,00020\{,\}000samples each \(headlinek=4k\{=\}4with200,000200\{,\}000samples\), every empirical mean lies strictly inside its 99% CI of the analytick2/dk^\{2\}/dvalue;\|Δ\|\|\\Delta\|ranges from4\.3×10−64\.3\{\\times\}10^\{\-6\}\(k=1\) to4\.9×10−54\.9\{\\times\}10^\{\-5\}\(k=16\)\. This closes the Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)\(c\) claim used by the two\-sided\-null corollary at the start of §[3](https://arxiv.org/html/2605.29126#S3)\.

![[Uncaptioned image]](https://arxiv.org/html/2605.29126v1/x12.png)

## Appendix S21Supplement: Tier\-A per\-scenario breakdown \(retained\)

Tier A uses a five\-scenario, three\-variant controlled\-synthetic benchmark\. Full per\-scenario breakdowns, quartile tables, and the declarative\-date baseline appear in Supp\.[S17](https://arxiv.org/html/2605.29126#A17); we keep this label as an anchor for cross\-references elsewhere in the paper\.

## Appendix S22Supplement: proofs

### Proposition[1](https://arxiv.org/html/2605.29126#Thmproposition1)\(full proof\)\.

*Setup\.*Fix a data distributionx∼𝒟x\\\!\\sim\\\!\\mathcal\{D\}onℝd\\mathbb\{R\}^\{d\}with𝔼x=0\\mathbb\{E\}x\{=\}0\(WLOG\) and covarianceΣ=𝔼\[xx⊤\]≻0\\Sigma\{=\}\\mathbb\{E\}\[xx^\{\\top\}\]\\succ 0\. Letz\(x\):ℝd→ℝz\(x\)\\\!:\\\!\\mathbb\{R\}^\{d\}\\\!\\to\\\!\\mathbb\{R\}withVar\(z\)\>0\\mathrm\{Var\}\(z\)\{\>\}0, and letf:ℝd→ℝf\\\!:\\\!\\mathbb\{R\}^\{d\}\\\!\\to\\\!\\mathbb\{R\}beC2C^\{2\}on an open neighborhood ofsupp\(𝒟\)\\mathrm\{supp\}\(\\mathcal\{D\}\)\. Write𝐜=𝔼\[xz\(x\)\]\\mathbf\{c\}\{=\}\\mathbb\{E\}\[xz\(x\)\]and𝐠\(x\)=∇xf\(x\)\\mathbf\{g\}\(x\)\{=\}\\nabla\_\{x\}f\(x\)\.

*Probe direction\.*The optimal normalized Pearson\-correlation direction is

uP∈arg⁡max‖u‖=1⁡Cov\(u⊤x,z\)2Var\(u⊤x\)Var\(z\)=arg⁡max‖u‖=1⁡\(u⊤𝐜\)2u⊤Σu\.u\_\{P\}\\in\\arg\\max\_\{\\\|u\\\|=1\}\\frac\{\\mathrm\{Cov\}\(u^\{\\top\}x,z\)^\{2\}\}\{\\mathrm\{Var\}\(u^\{\\top\}x\)\\,\\mathrm\{Var\}\(z\)\}=\\arg\\max\_\{\\\|u\\\|=1\}\\frac\{\(u^\{\\top\}\\mathbf\{c\}\)^\{2\}\}\{u^\{\\top\}\\Sigma u\}\.The generalized Rayleigh quotientJP\(u\)=\(u⊤𝐜\)2/\(u⊤Σu\)J\_\{P\}\(u\)\{=\}\(u^\{\\top\}\\mathbf\{c\}\)^\{2\}/\(u^\{\\top\}\\Sigma u\)is maximized at the generalized eigenvectorΣu=λ𝐜𝐜⊤u\\Sigma u=\\lambda\\,\\mathbf\{c\}\\mathbf\{c\}^\{\\top\}u; the leading solution isuP=Σ−1𝐜/‖Σ−1𝐜‖u\_\{P\}\{=\}\\Sigma^\{\-1\}\\mathbf\{c\}/\\\|\\Sigma^\{\-1\}\\mathbf\{c\}\\\|\(unique up to sign provided𝐜≠0\\mathbf\{c\}\{\\neq\}0\)\.uPu\_\{P\}is a*second\-moment*functional ofxx: it depends on𝒟\\mathcal\{D\}only throughΣ\\Sigmaand𝐜\\mathbf\{c\}\.

*Mediator direction\.*For a rank\-11ablationx↦x−uu⊤xx\\mapsto x\{\-\}uu^\{\\top\}xwith‖u‖=1\\\|u\\\|\{=\}1, a first\-order Taylor expansion offfaroundxxyields

f\(x\)−f\(x−uu⊤x\)=\(u⊤x\)\(u⊤𝐠\(x\)\)\+R2\(x,u\),f\(x\)\-f\(x\{\-\}uu^\{\\top\}x\)=\(u^\{\\top\}x\)\\,\(u^\{\\top\}\\mathbf\{g\}\(x\)\)\+R\_\{2\}\(x,u\),with remainder\|R2\(x,u\)\|≤12supξ∈\[x−uu⊤x,x\]‖∇2f\(ξ\)‖\(u⊤x\)2\|R\_\{2\}\(x,u\)\|\\leq\\tfrac\{1\}\{2\}\\sup\_\{\\xi\\in\[x\-uu^\{\\top\}x,\\,x\]\}\\\|\\nabla^\{2\}f\(\\xi\)\\\|\\,\(u^\{\\top\}x\)^\{2\}; under the standingC2C^\{2\}assumption,\|R2\|=O\(\(u⊤x\)2\)\|R\_\{2\}\|\{=\}O\(\(u^\{\\top\}x\)^\{2\}\)\. Taking expectations and squaring,

𝔼\[\(f\(x\)−f\(x−uu⊤x\)\)2\]=u⊤𝔼\[𝐠𝐠⊤\(u⊤x\)2\]u\+o\(1\)\.\\mathbb\{E\}\\bigl\[\(f\(x\)\{\-\}f\(x\{\-\}uu^\{\\top\}x\)\)^\{2\}\\bigr\]=u^\{\\top\}\\,\\mathbb\{E\}\[\\mathbf\{g\}\\mathbf\{g\}^\{\\top\}\(u^\{\\top\}x\)^\{2\}\]\\,u\+o\(1\)\.In the isotropic limitΣ=σ2I\\Sigma\{=\}\\sigma^\{2\}I, this reduces toσ2u⊤Gu\\sigma^\{2\}\\,u^\{\\top\}G\\,uwhereG=𝔼\[𝐠𝐠⊤\]G\{=\}\\mathbb\{E\}\[\\mathbf\{g\}\\mathbf\{g\}^\{\\top\}\]is the*gradient covariance*\. HenceuM∈arg⁡max‖u‖=1⁡u⊤Guu\_\{M\}\\in\\arg\\max\_\{\\\|u\\\|=1\}u^\{\\top\}Guis the top eigenvector ofGG: a*first\-moment*functional of𝐠\(x\)\\mathbf\{g\}\(x\)\.

*Coincidence condition\.*uP=uMu\_\{P\}\{=\}u\_\{M\}iffΣ−1𝐜\\Sigma^\{\-1\}\\mathbf\{c\}is the top eigenvector ofGG, equivalently iffGΣ−1𝐜=λmax\(G\)Σ−1𝐜G\\Sigma^\{\-1\}\\mathbf\{c\}\{=\}\\lambda\_\{\\max\}\(G\)\\,\\Sigma^\{\-1\}\\mathbf\{c\}\. Expanding, this is𝔼\[𝐠𝐠⊤\]Σ−1𝔼\[xz\]=λΣ−1𝔼\[xz\]\\mathbb\{E\}\[\\mathbf\{g\}\\mathbf\{g\}^\{\\top\}\]\\Sigma^\{\-1\}\\mathbb\{E\}\[xz\]\{=\}\\lambda\\,\\Sigma^\{\-1\}\\mathbb\{E\}\[xz\], i\.e\. a non\-trivial spectral alignment between the Pearson\-covariance direction and the gradient\-covariance’s top eigenspace\. For a deep feed\-forward network with non\-polynomial activations,𝐠\(x\)=∇xf\(x\)\\mathbf\{g\}\(x\)\{=\}\\nabla\_\{x\}f\(x\)traverses a family of directions whose spectrum decorrelates from that of the label covariance𝐜\\mathbf\{c\}generically: the probe direction is set by the data geometry, while the mediator direction is set by the network’s output\-sensitivity geometry, and there is no structural reason for these to align\.

*Conclusion\.*Without additional assumptions coupling𝐠\\mathbf\{g\}toΣ−1𝐜\\Sigma^\{\-1\}\\mathbf\{c\},uPu\_\{P\}anduMu\_\{M\}are non\-trivially distinct\. The empiricalθ¯≈88∘\\bar\{\\theta\}\{\\approx\}88^\{\\circ\}is the generic outcome; alignment would require non\-generic structure\. ∎

*Remark\.*Thekk\-dim case is identical term\-by\-term with the leadingkk\-eigenspaces ofΣ−1𝐜𝐜⊤Σ−1\\Sigma^\{\-1\}\\mathbf\{c\}\\mathbf\{c\}^\{\\top\}\\Sigma^\{\-1\}\(probe\) andGG\(mediator\) playing the role ofuP,uMu\_\{P\},u\_\{M\}; the conclusion extends\.

### Proposition[2](https://arxiv.org/html/2605.29126#Thmproposition2)\(full proof\)\.

*Claim\.*ForU,VU,Vindependentk×dk\\times dorthonormal matrices \(rows\) drawn uniformly from the Stiefel manifoldVk\(ℝd\)V\_\{k\}\(\\mathbb\{R\}^\{d\}\), the mean principal angle satisfies𝔼\[θ¯\]=arccos⁡\(k/d\)\+O\(k/d\)\\mathbb\{E\}\[\\bar\{\\theta\}\]=\\arccos\(\\sqrt\{k/d\}\)\+O\(k/d\)\.

*Step 1 \(distribution ofUV⊤UV^\{\\top\}\)\.*By rotation invariance of the Haar measure, without loss of generality fixVVas the firstkkrows of the identity; thenUV⊤UV^\{\\top\}is the leftk×kk\\times kblock of a Haar\-randomd×dd\\times dorthonormal matrixQ=UQ=U\. The singular valuesσ1≥⋯≥σk\\sigma\_\{1\}\{\\geq\}\\cdots\{\\geq\}\\sigma\_\{k\}of this block have the joint density \(Edelman & Rao, 2005; Forrester, 2010\)

p\(σ\)∝∏i<j\(σi2−σj2\)2∏i=1kσi0\(1−σi2\)\(d−2k−1\)/2,p\(\\sigma\)\\propto\\prod\_\{i<j\}\(\\sigma\_\{i\}^\{2\}\-\\sigma\_\{j\}^\{2\}\)^\{2\}\\prod\_\{i=1\}^\{k\}\\sigma\_\{i\}^\{0\}\(1\-\\sigma\_\{i\}^\{2\}\)^\{\(d\-2k\-1\)/2\},which is the Jacobi ensembleJ\(k,k,d−k\)J\(k,k,d\{\-\}k\)on\[0,1\]k\[0,1\]^\{k\}\.

*Step 2 \(mean\)\.*By standard trace\-moment calculus \(Collins & Matsumoto, 2009\), for anykkandd≥2kd\{\\geq\}2k,𝔼\[tr\(UV⊤VU⊤\)\]=𝔼∑iσi2=k⋅k/d\\mathbb\{E\}\[\\mathrm\{tr\}\(UV^\{\\top\}VU^\{\\top\}\)\]=\\mathbb\{E\}\\sum\_\{i\}\\sigma\_\{i\}^\{2\}=k\\cdot k/d, giving𝔼\[σi2\]=k/d\\mathbb\{E\}\[\\sigma\_\{i\}^\{2\}\]=k/dby symmetry of theσi\\sigma\_\{i\}under the ensemble\. Jensen then gives𝔼\[σi\]≤k/d\\mathbb\{E\}\[\\sigma\_\{i\}\]\\leq\\sqrt\{k/d\}with equality up toO\(k/d\)O\(k/d\)variance corrections from the Jacobi ensemble’s concentration \(Collins, 2003:Var\(σi\)=O\(k/d2\)\\mathrm\{Var\}\(\\sigma\_\{i\}\)=O\(k/d^\{2\}\)\)\.

*Step 3 \(expected angle\)\.*Sinceθ¯=1k∑iarccos⁡σi\\bar\{\\theta\}=\\frac\{1\}\{k\}\\sum\_\{i\}\\arccos\\sigma\_\{i\}andarccos\\arccosisC2C^\{2\}on\[0,1\)\[0,1\)witharccos⁡\(k/d\)=π/2−k/d\+O\(k/d\)3/2\\arccos\(\\sqrt\{k/d\}\)=\\pi/2\-\\sqrt\{k/d\}\+O\(k/d\)^\{3/2\}, Delta\-method gives𝔼\[θ¯\]=arccos⁡\(k/d\)\+O\(k/d\)\\mathbb\{E\}\[\\bar\{\\theta\}\]=\\arccos\(\\sqrt\{k/d\}\)\+O\(k/d\)\. For\(d,k\)=\(2304,2\)\(d,k\)\{=\}\(2304,2\):2/2304=0\.02946\\sqrt\{2/2304\}=0\.02946,arccos⁡\(0\.02946\)=88\.31∘\\arccos\(0\.02946\)=88\.31^\{\\circ\}; the second\-order correction is bounded byk/d=8\.7×10−4k/d=8\.7\{\\times\}10^\{\-4\}radians≈0\.05∘\\approx 0\.05^\{\\circ\}, below the discretisation of the88\.3∘88\.3^\{\\circ\}measurement\. ∎

*Monte Carlo verification\.*For each\(d,k\)∈\{\(1536,2\),\(2304,2\),\(3584,2\),\(2304,4\),\(2304,6\)\}\(d,k\)\\in\\\{\(1536,2\),\(2304,2\),\(3584,2\),\(2304,4\),\(2304,6\)\\\}we sampled10,00010\{,\}000pairs\(U,V\)\(U,V\)via QR of Gaussian matrices and recordedθ¯\\bar\{\\theta\}\. The empirical means were\{88\.5∘,88\.3∘,88\.6∘,87\.9∘,86\.7∘\}\\\{88\.5^\{\\circ\},88\.3^\{\\circ\},88\.6^\{\\circ\},87\.9^\{\\circ\},86\.7^\{\\circ\}\\\}with standard deviations all<1\.0∘\{<\}1\.0^\{\\circ\}, matching the analytic values to within0\.1∘0\.1^\{\\circ\}\.

### Consequence for our measurement\.

The probe–DAS angles\{88\.3∘,87\.9∘,86\.7∘\}\\\{88\.3^\{\\circ\},87\.9^\{\\circ\},86\.7^\{\\circ\}\\\}atk∈\{2,4,6\}k\\\!\\in\\\!\\\{2,4,6\\\}onGemma 2 2Bare*within one standard deviation of the null at everykk*\. An indistinguishability test atα=0\.05\\alpha\{=\}0\.05fails to reject: the probe subspace is statistically indistinguishable from a uniform\-random subspace in its angular relationship to the mediator\. This is the strongest form of the88∘88^\{\\circ\}claim\.

### Proposition[3](https://arxiv.org/html/2605.29126#Thmproposition3)\(proof of the three parts\)\.

*\(a\) Upper bound\.*LetPM=UM⊤UMP\_\{M\}\{=\}U\_\{M\}^\{\\top\}U\_\{M\}andPU=U⊤UP\_\{U\}\{=\}U^\{\\top\}Ube the mediator and ablation orthogonal projectors\. By assumption \(i\),f\(x\)−f\(x−PUx\)=g\(UMx\)−g\(UM\(x−PUx\)\)=g\(UMx\)−g\(UMx−UMPUx\)f\(x\)\{\-\}f\(x\{\-\}P\_\{U\}x\)\{=\}g\(U\_\{M\}x\)\{\-\}g\(U\_\{M\}\(x\{\-\}P\_\{U\}x\)\)\{=\}g\(U\_\{M\}x\)\{\-\}g\(U\_\{M\}x\{\-\}U\_\{M\}P\_\{U\}x\)\. Lipschitz continuity ofggwith constantLLgives\|f\(x\)−f\(x−PUx\)\|≤L‖UMPUx‖\|f\(x\)\{\-\}f\(x\{\-\}P\_\{U\}x\)\|\\leq L\\\|U\_\{M\}P\_\{U\}x\\\|\. Squaring and taking expectations under \(iii\),

𝔼‖UMPUx‖2=tr\(UMPU𝔼\[xx⊤\]PUUM⊤\)=σ2‖UMU⊤‖F2=σ2∑i=1min⁡\(k,kM\)cos2⁡θi,\\mathbb\{E\}\\\|U\_\{M\}P\_\{U\}x\\\|^\{2\}=\\mathrm\{tr\}\\bigl\(U\_\{M\}P\_\{U\}\\,\\mathbb\{E\}\[xx^\{\\top\}\]\\,P\_\{U\}U\_\{M\}^\{\\top\}\\bigr\)=\\sigma^\{2\}\\\|U\_\{M\}U^\{\\top\}\\\|\_\{F\}^\{2\}=\\sigma^\{2\}\\sum\_\{i=1\}^\{\\min\(k,k\_\{M\}\)\}\\cos^\{2\}\\theta\_\{i\},using the SVD ofUMU⊤U\_\{M\}U^\{\\top\}whose singular values arecos⁡θi\\cos\\theta\_\{i\}\. This is \([1](https://arxiv.org/html/2605.29126#S3.E1)\)\. ∎\(a\)

*\(b\) Matching lower bound underrow\(UM\)⊆row\(U\)\\mathrm\{row\}\(U\_\{M\}\)\\subseteq\\mathrm\{row\}\(U\)\.*WritePUMP\_\{U\_\{M\}\}for the projection ontorow\(UM\)\\mathrm\{row\}\(U\_\{M\}\)\. Containment givesPUPUM=PUMP\_\{U\}P\_\{U\_\{M\}\}\{=\}P\_\{U\_\{M\}\}, soUMPUx=UMxU\_\{M\}P\_\{U\}x\{=\}U\_\{M\}xand‖UMPUx‖=‖UMx‖\\\|U\_\{M\}P\_\{U\}x\\\|\{=\}\\\|U\_\{M\}x\\\|\. Under a one\-sided modulus\-of\-continuity lower bound\|g\(u\)−g\(v\)\|≥L¯‖u−v‖\|g\(u\)\{\-\}g\(v\)\|\\geq\\underline\{L\}\\\|u\{\-\}v\\\|\(which holds, for example, whenggisC1C^\{1\}with bounded\-away\-from\-zero gradient on the data manifold, a sufficient condition for trained networks with non\-trivial task\-sensitivity\), we get𝔼\|f\(x\)−f\(x−PUx\)\|2≥L¯2𝔼‖UMx‖2=L¯2σ2tr\(UMUM⊤\)=L¯2σ2kM\\mathbb\{E\}\|f\(x\)\{\-\}f\(x\{\-\}P\_\{U\}x\)\|^\{2\}\\geq\\underline\{L\}^\{2\}\\mathbb\{E\}\\\|U\_\{M\}x\\\|^\{2\}\{=\}\\underline\{L\}^\{2\}\\sigma^\{2\}\\mathrm\{tr\}\(U\_\{M\}U\_\{M\}^\{\\top\}\)\{=\}\\underline\{L\}^\{2\}\\sigma^\{2\}k\_\{M\}\. This is \([2](https://arxiv.org/html/2605.29126#S3.E2)\)\. ∎\(b\)

*\(c\) Haar expectation and concentration\.*ForU∼Stiefel\(k,d\)U\\sim\\mathrm\{Stiefel\}\(k,d\)independent ofUMU\_\{M\}\(or WLOGUM=\[IkM\|0\]U\_\{M\}\{=\}\[I\_\{k\_\{M\}\}\\,\|\\,0\]by rotation invariance\),‖UMU⊤‖F2\\\|U\_\{M\}U^\{\\top\}\\\|\_\{F\}^\{2\}is the sum of squares of the top\-leftkM×kk\_\{M\}\\\!\\times\\\!kblock of a Haar\-randomd×dd\\\!\\times\\\!dorthogonal matrix\. Its expectation is𝔼‖UMU⊤‖F2=kkM/d\\mathbb\{E\}\\\|U\_\{M\}U^\{\\top\}\\\|\_\{F\}^\{2\}=k\\,k\_\{M\}/dexactly, by trace identity over the Jacobi ensemble \(Collins & Matsumoto, 2009, Prop\. 4\.1\)\. The variance isO\(k2/d2\)O\(k^\{2\}/d^\{2\}\)\. ∎\(c\)

### Two\-sided null expectation \(corollary proof\)\.

Combining \(a\)\+\(c\) gives𝔼U∼Haar𝔼x\|f\(x\)−f\(x−PUx\)\|2≤L2σ2kkM/d\\mathbb\{E\}\_\{U\\sim\\text\{Haar\}\}\\mathbb\{E\}\_\{x\}\|f\(x\)\{\-\}f\(x\{\-\}P\_\{U\}x\)\|^\{2\}\\leq L^\{2\}\\sigma^\{2\}\\,k\\,k\_\{M\}/d\. Combining \(b\) applied atU=UDASU\{=\}U\_\{\\text\{DAS\}\}withrow\(UM\)⊆row\(UDAS\)\\mathrm\{row\}\(U\_\{M\}\)\\\!\\subseteq\\\!\\mathrm\{row\}\(U\_\{\\text\{DAS\}\}\)\(the DAS optimum saturates this whenk≥kMk\\\!\\geq\\\!k\_\{M\}is trained to convergence\) gives a lower boundL¯2σ2kM\\underline\{L\}^\{2\}\\sigma^\{2\}k\_\{M\}\. Their ratio isL¯2σ2kM/\(L2σ2kkM/d\)=\(L¯/L\)2d/k\\underline\{L\}^\{2\}\\sigma^\{2\}k\_\{M\}/\(L^\{2\}\\sigma^\{2\}\\,k\\,k\_\{M\}/d\)=\(\\underline\{L\}/L\)^\{2\}\\,d/k\. Hence the population specificity ratio at the null concentrates at\(L¯/L\)2d/k\(\\underline\{L\}/L\)^\{2\}\\,d/k\. WhenL≈L¯L\\\!\\approx\\\!\\underline\{L\}\(locally lineargg\),ρknull≈d/k\\rho\_\{k\}^\{\\text\{null\}\}\\\!\\approx\\\!d/k\. ∎

### Interpretation\.

Proposition[3](https://arxiv.org/html/2605.29126#Thmproposition3)gives*neither*a one\-sided upper bound onρk\\rho\_\{k\}*nor*a lower bound; it gives a concentration scale for the null\. Observedρk\\rho\_\{k\}can exceedd/kd/k— whenggis super\-linearly sensitive toUMU\_\{M\}\-directed perturbations, numerator inflates; or when the random denominator collapses below its mean \(bf16 saturation at77B/99B\), denominator deflates\. Both occur in our data: onGemma 2 2B\(kM≈k=4k\_\{M\}\{\\approx\}k\{=\}4\) the observedρ4=1050\\rho\_\{4\}\{=\}1050exceeds the equal\-Lipschitz predictiond/k=576d/k\{=\}576by a factor of1\.81\.8, consistent with mild super\-linearity ofgg; at77B/99B the observed ratio explodes because the denominator is at machine zero, a regime treated explicitly in Supp\.[S25](https://arxiv.org/html/2605.29126#A25)\.

## Appendix S23Supplement: algorithm pseudocode

1Input:frozen modelMMat layerLL, prompt set𝒫\\mathcal\{P\}, rankkkOutput:U∈ℝk×dU\\in\\mathbb\{R\}^\{k\\times d\}

2\[2pt\]

V∼Unif\(ℝd×d\)V\\sim\\mathrm\{Unif\}\(\\mathbb\{R\}^\{d\\times d\}\);

3

V←QR\(V\)\.QV\\leftarrow\\textsc\{QR\}\(V\)\.Q;

4optimizer

←AdamW\(V;lr=10−3\)\\leftarrow\\textsc\{AdamW\}\(V;\\;\\mathrm\{lr\}\{=\}10^\{\-3\}\)
5for*step=1,…,400=1,\\ldots,400*do

6sample minibatch

B⊂𝒫B\\subset\\mathcal\{P\},

\|B\|=8\|B\|\{=\}8;

7

Q,\_←QR\(V\)Q,\\\_\\leftarrow\\textsc\{QR\}\(V\);

8

U←Q\[:,:k\]⊤U\\leftarrow Q\[\\,\{:\},\\;\{:\}k\\,\]^\{\\top\}
9hook atblocks\.LL\.hook\_resid\_post:

x↦x−U⊤Uxx\\mapsto x\-U^\{\\top\}Ux;

10

logits←M\(B\)\\text\{logits\}\\leftarrow M\(B\)
11

ℒ←−1\|B\|∑blog⁡p\(yb⋆∣Bb\)\\mathcal\{L\}\\leftarrow\-\\tfrac\{1\}\{\|B\|\}\\sum\_\{b\}\\log p\(y\_\{b\}^\{\\star\}\\mid B\_\{b\}\);

12backprop through QR;

13optimizer step

14

return

U←QR\(V\)\.Q\[:,:k\]⊤U\\leftarrow\\textsc\{QR\}\(V\)\.Q\[\\,\{:\},\\;\{:\}k\\,\]^\{\\top\}

Algorithm 2DAS subspace training \(one model, one layer\)1Input:modelMM, per\-DOY mean activations\{x¯d\}d=1365\\\{\\bar\{x\}\_\{d\}\\\}\_\{d=1\}^\{365\}Output:\{\(L,h,c⋆,z,p,q\)\}\\\{\(L,h,c^\{\\star\},z,p,q\)\\\}for significant heads

2\[2pt\]for*each head\(L,h\)\(L,h\)*do

3

Qd,Kd←WQ\(L,h\)x¯d,WK\(L,h\)x¯dQ\_\{d\},K\_\{d\}\\leftarrow W\_\{Q\}^\{\(L,h\)\}\\bar\{x\}\_\{d\},\\;W\_\{K\}^\{\(L,h\)\}\\bar\{x\}\_\{d\};

4

Md,d′←Qd⊤Kd′/dheadM\_\{d,d^\{\\prime\}\}\\leftarrow Q\_\{d\}^\{\\\!\\top\}K\_\{d^\{\\prime\}\}/\\sqrt\{d\_\{\\text\{head\}\}\}
5

S\(c\)←mean\{Md,d′:d−d′=c\}S\(c\)\\leftarrow\\mathrm\{mean\}\\\{M\_\{d,d^\{\\prime\}\}:d\{\-\}d^\{\\prime\}\{=\}c\\\},

c∈\[−182,182\]c\\in\[\-182,182\];

6

z\(L,h\)←maxc≠0⁡\|S\(c\)−μS\|/σSz\_\{\(L,h\)\}\\leftarrow\\max\_\{c\\neq 0\}\|S\(c\)\-\\mu\_\{S\}\|/\\sigma\_\{S\};

7

c⋆←arg⁡max⁡\|z\|c^\{\\star\}\\leftarrow\\arg\\max\|z\|
8

p\(L,h\)←p\_\{\(L,h\)\}\\leftarrowfraction of null peaks

≥z\(L,h\)\\geq z\_\{\(L,h\)\}⊳\\triangleright200200shuffled\-DOY permutations

9

10

q←BH\-FDR\(\{p\(L,h\)\},α=0\.05\)q\\leftarrow\\textsc\{BH\-FDR\}\(\\\{p\_\{\(L,h\)\}\\\},\\;\\alpha\{=\}0\.05\)⊳\\trianglerightmultiple\-testing correction

Algorithm 3QK\-twist scan with BH\-FDR1Input:tokensTT, date positionsP⊂\{1,…,\|T\|\}P\\subset\\\{1,\\ldots,\|T\|\\\}, reference activations\{x¯d\}d=1365\\\{\\bar\{x\}\_\{d\}\\\}\_\{d=1\}^\{365\}atL⋆L^\{\\star\}Output:δ\(x\)∈\[0,1\]\\delta\(x\)\\in\[0,1\]

2\[2pt\]

X¯←\[x¯1;⋯;x¯365\]\\bar\{X\}\\leftarrow\[\\bar\{x\}\_\{1\};\\,\\cdots\\,;\\bar\{x\}\_\{365\}\]⊳\\trianglerightmean over1010templates;

3

U¯,\_,\_←SVD\(X¯−X¯mean\)\\bar\{U\},\\\_,\\\_\\leftarrow\\textsc\{SVD\}\(\\bar\{X\}\-\\bar\{X\}^\{\\,\\text\{mean\}\}\); keep top\-

kk
4run

M\(T\)M\(T\);

5cache

hth\_\{t\}at

L⋆L^\{\\star\}for each

t∈Pt\\in P
6for*t∈Pt\\in P*do

7

rt←ht−U¯⊤U¯htr\_\{t\}\\leftarrow h\_\{t\}\-\\bar\{U\}^\{\\\!\\top\}\\bar\{U\}\\,h\_\{t\};

8

δt←‖rt‖/‖ht‖\\delta\_\{t\}\\leftarrow\\\|r\_\{t\}\\\|/\\\|h\_\{t\}\\\|
9

return

δ\(x\)←maxt∈P⁡δt\\delta\(x\)\\leftarrow\\max\_\{t\\in P\}\\,\\delta\_\{t\}

Algorithm 4Manifold deviationδ\(x\)\\delta\(x\)for a clinical query
## Appendix S24Supplement: empirical∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}and anisotropy

This supplement reports the empirical angular quantity that Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)bounds, tests how activation anisotropy affects the null, and provides the corrected null variance formula\.

### Spectral diagnostics\.

The sample covarianceΣ\\Sigmaof365365Set\-A mean activations atL⋆=1L^\{\\star\}\{=\}1,d=2304d\{=\}2304has:Tr\(Σ\)=181\.1\\mathrm\{Tr\}\(\\Sigma\)\{=\}181\.1, condition numberκ=1\.47×1011\\kappa\{=\}1\.47\{\\times\}10^\{11\}\(extreme rank deficiency from only365365vectors inℝ2304\\mathbb\{R\}^\{2304\}\), and effective rankTr\(Σ\)2/Tr\(Σ2\)=11\.5\\mathrm\{Tr\}\(\\Sigma\)^\{2\}/\\mathrm\{Tr\}\(\\Sigma^\{2\}\)\{=\}11\.5\(i\.e\., the empirical variance is concentrated in∼12\\sim\\\!12dominant directions\)\. Under the corrected null variance formula \(Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)assumes𝔼\[xx⊤\]=σ2I\\mathbb\{E\}\[xx^\{\\top\}\]\{=\}\\sigma^\{2\}I; see below for the anisotropic correction\), the anisotropy\-corrected standard deviation is14×14\{\\times\}wider than the isotropic approximation\. This does not affect the reported Monte Carlopp\-values of0\.510\.51–0\.720\.72, which directly sample from the empirical \(anisotropic\) coordinate distribution; the correction is required only when using the analytic variance bound from Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)\(c\)\.

### Empirical∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}, raw coordinates\.

OnGemma 2 2B,L⋆=1L^\{\\star\}\{=\}1,d=2304d\{=\}2304, the observed probe vs\. DAS quantity is0\.00091,0\.00559,0\.015200\.00091,0\.00559,0\.01520atk∈\{2,4,6\}k\\\!\\in\\\!\\\{2,4,6\\\}; the analytic nullk2/dk^\{2\}/dis0\.00174,0\.00694,0\.015620\.00174,0\.00694,0\.01562\. The probe is*at or below*the random\-null expectation at everykk\. A10410^\{4\}\-sample Monte Carlo of uniformkk\-subspaces inℝ2304\\mathbb\{R\}^\{2304\}gives random means0\.00174±0\.001230\.00174\\pm 0\.00123,0\.00693±0\.002460\.00693\\pm 0\.00246,0\.01561±0\.003700\.01561\\pm 0\.00370\. The empiricalpp\-value that the random null exceeds the observed probe value is0\.72,0\.68,0\.510\.72,0\.68,0\.51: the probe subspace is statistically*indistinguishable from noise*in its angular relationship to the mediator\.

### Whitened coordinates\.

Activations atL⋆L^\{\\star\}are anisotropic; we test whether the null story is a whitening artifact\. We estimateΣ\\Sigmafrom365365cached Set\-A activations \(ridge regularisation10−3⋅tr\(Σ\)/d10^\{\-3\}\{\\cdot\}\\mathrm\{tr\}\(\\Sigma\)/d\), formW=Σ−1/2W\{=\}\\Sigma^\{\-1/2\}, and whiten probe, DAS, and10410^\{4\}random bases in this space\. Results:∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}atk∈\{2,4,6\}k\\\!\\in\\\!\\\{2,4,6\\\}is0\.00372,0\.00574,0\.016000\.00372,0\.00574,0\.01600observed vs random mean0\.00174,0\.00691,0\.015620\.00174,0\.00691,0\.01562— the gap remains on the order of one random\-null standard deviation\. Whitening does not reverse or meaningfully amplify the effect \(Fig\.[12](https://arxiv.org/html/2605.29126#A24.F12)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x13.png)Figure 12:Empirical∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}in raw \(a\) and whitened \(b\) coordinates\. Gray points and±2σ\\pm 2\\sigmabars:10410^\{4\}\-draw Haar randomkk\-subspace null; red squares: observed probe vs DAS; blue triangles: analytick2/dk^\{2\}/dnull\.

## Appendix S25Supplement: angle indistinguishability tests and specificity CIs

### Hypothesis tests\.

For each of the four scales, we Monte\-Carlo the null distribution of∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}between a Haar\-randomk=4k\{=\}4subspace and the trained DAS basis with10510^\{5\}draws\. The analytic nullk2/d∈\{0\.0104,0\.0069,0\.00446,0\.00446\}k^\{2\}/d\\\!\\in\\\!\\\{0\.0104,0\.0069,0\.00446,0\.00446\\\}ford∈\{1536,2304,3584,3584\}d\\\!\\in\\\!\\\{1536,2304,3584,3584\\\}respectively matches the MC mean to the third decimal\. Observed probe–DAS∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}are reported in Supp\.[S24](https://arxiv.org/html/2605.29126#A24); the empiricalpp\-value for probe\-subspace indistinguishability from the null ranges from0\.510\.51to0\.720\.72acrossk∈\{2,4,6\}k\\\!\\in\\\!\\\{2,4,6\\\}\.

### Specificity\-ratio CIs\.

We bootstrapρk\\rho\_\{k\}by resampling the cached per\-draw random\-control accuracies \(n=25n\{=\}25at1\.51\.5B/22B,n=20n\{=\}20at77B/99B\) with10410^\{4\}draws\.

\(Gemma 2 2B: the lower bound uses the one\-sided Clopper–Pearson preserve\-rate CI to exclude denominator sign flips; the bootstrap upper arm is large because occasional resamples yield near\-zero or slightly negative random drops\.\) At77B/99B, random drops are exactly zero within bf16 precision; we reportρk\\rho\_\{k\}in scientific notation with the conventiondroprandom=10−8\\mathrm\{drop\}\_\{\\text\{random\}\}\{=\}10^\{\-8\}\(machine epsilon for bf16 accumulated over2020draws\)\. Tightening the denominator by10×10\\timeswould require∼100×\\sim\\\!100\\timesmore random draws by the usual1/n1/\\sqrt\{n\}scaling of the mean standard error \.

### Fieller intervals and an additive baseline\.

Bootstrap ratio\-CIs break when the denominator crosses zero\. We therefore also report \(i\) Fieller’s theorem CI for the ratio of means, which handles the near\-zero\-denominator case by distinguishing finite\-interval, exterior, and unbounded solution types, and \(ii\) the*additive baseline*Δadd=dropDAS−droprand\\Delta\_\{\\text\{add\}\}\{=\}\\mathrm\{drop\}\_\{\\text\{DAS\}\}\-\\mathrm\{drop\}\_\{\\text\{rand\}\}\(in pp\), which is well\-defined regardless of denominator scale\.

Reading the Fieller column: onQwen 2\.5 1\.5Bthe random mean is cleanly positive and the ratio CI is a finite interval\[122,1101\]\[122,1101\]\. OnGemma 2 2Bthe random\-drop s\.e\. puts zero within the sampling distribution of the denominator, yielding the*exterior*case\(−∞,−129\]∪\[103,∞\)\(\-\\infty,\-129\]\\cup\[103,\\infty\): the data are consistent with either a very large positive ratio or a very negative one, which is Fieller’s honest answer when the denominator is indistinguishable from zero\. On77B/99B the random\-drop variance is exactly zero and the Fieller interval is unbounded\. In all four cases the additive baselineΔadd\\Delta\_\{\\text\{add\}\}is4242–5151pp, which is the scale\-robust statistic for the specificity effect\.

## Appendix S26Supplement: post\-ablation residual\-stream norms

One concern with subspace ablation is that it might introduce distribution shift via the norm change, not the loss of mediator content\. We therefore report, on cached activations atL⋆L^\{\\star\}ofGemma 2 2B, the fraction of norm each basis removes:‖U⊤Ux‖/‖x‖\\\|U^\{\\top\}Ux\\\|/\\\|x\\\|\.

Two observations: \(i\) DAS removes a fraction \(1919–19\.4%19\.4\\%\) that is modest in absolute terms but∼5×\\sim\\\!5\\timesthe random\-control fraction, consistent with DAS finding a direction aligned with a*high\-variance*component of activation space — which is what Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)predicts for the mediator\. \(ii\) The probe removes*less*than a random direction \(3%3\\%vs4%4\\%\), sitting in an atypical low\-variance direction; itsR2≈0\.99R^\{2\}\{\\approx\}0\.99does not translate into norm dominance\. The4242pp DAS\-ablation drop therefore cannot be attributed to a generic distributional shift: both probe and random ablations perturb the residual stream comparably yet produce<1<\\\!1pp accuracy changes\.

## Appendix S27Supplement: layer\-robustness of the probe–DAS dissociation

A natural question is whether theL⋆L^\{\\star\}selection \(peak probeR2R^\{2\}\) might be responsible for the dissociation: perhaps a different layer choice would find probe and DAS aligned\. We test this by computing principal angles between probe\-subspaces at*all 26 layers*and the DAS basis atL⋆=1L^\{\\star\}\{=\}1, using cached activations \(no new forward passes\)\. Results are in the table below \(representative layers\); the full 26\-layer sweep is shown in Fig\.[13](https://arxiv.org/html/2605.29126#A27.F13)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x14.png)Figure 13:Probe–DAS mean principal angle across all 26 layers ofGemma 2 2Batk∈\{2,4,6\}k\\in\\\{2,4,6\\\}\. The dashed line is the Haar\-random null\. The angle is within1\.5∘1\.5^\{\\circ\}of null at every layer; no layer alignment exists between the probe and the causal mediator\.Across all 26 layers the probe–DAS angle is within1\.5∘1\.5^\{\\circ\}of the Haar\-null \(89\.6∘89\.6^\{\\circ\}fork=4k\{=\}4,d=2304d\{=\}2304\)\. This*layer\-universal*dissociation means the finding is not an artifact ofL⋆L^\{\\star\}selection: there is no layer at which the linear probe direction aligns with the causal mediator subspace\. Simultaneously, probe representations are layer\-specific:cos2⁡θ\\cos^\{2\}\\thetabetween the probe trained at layerlland the probe atL⋆L^\{\\star\}falls below0\.160\.16for alll≠L⋆l\\neq L^\{\\star\}\(median0\.0030\.003\), confirming thatL⋆L^\{\\star\}correctly identifies the layer where DOY structure is maximally concentrated in the readout direction\.

### Attribution patching locates causal computation atL=24L\{=\}24, notL⋆=1L^\{\\star\}\{=\}1\.

Running attribution patching\(Syedet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib28)\)\(n=188n\{=\}188prompts\) onGemma 2 2Bduration queries yields per\-layer absolute attribution sums \(summed over 8 heads\)\. The peak is atL=24L\{=\}24\(score=1\.90\\mathrm\{score\}\{=\}1\.90\);L⋆=1L^\{\\star\}\{=\}1has score0\.140\.14\(7th percentile\)\. This dissociation confirms that probeR2R^\{2\}identifies*representation*concentration, not the*causal computation*locus\. DAS ablation atL⋆=1L^\{\\star\}\{=\}1still achieves4242pp accuracy drop because the residual stream atL⋆L^\{\\star\}propagates to all downstream layers: ablating the causal subspace early prevents the circuit from completing atL=24L\{=\}24\.

### Layer\-depth nuance and temporal predictability\.

Lubanaet al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib33)\)find that temporal feature analysis \(TFA\) works best at∼50%\{\\sim\}50\\%model depth and that deeper layers show the predictive component failing\. Our results are compatible: the probe peaks atL⋆=1L^\{\\star\}\{=\}1\(early\), DAS operates atL⋆=1L^\{\\star\}\{=\}1, and attribution patching peaks atL=24L\{=\}24\(92%92\\%depth\)\. Critically, the layer\-universal89∘89^\{\\circ\}angle means the dissociation is not a layer\-selection artifact: probes at early layers decode embedding structure, probes at middle layers may capture temporal contextual information, but at*no*layer does the probe direction approach the mediator direction\. AtL⋆=1L^\{\\star\}\{=\}1the TFA predictable component aligns7\.17\.1–7\.6×7\.6\{\\times\}Haar with DAS while the novel component aligns only2\.02\.0–2\.5×2\.5\{\\times\}, confirmed by both zero\-shot and learned TFA \(Supp\.[S35](https://arxiv.org/html/2605.29126#A35), Fig\.[15](https://arxiv.org/html/2605.29126#A35.F15)\): the mediator sits within the predictable subspace, not outside it\. Whether repeating this decomposition at50%50\\%depth \(L≈13L\{\\approx\}13\) would change the picture is an open question; the layer\-universal89∘89^\{\\circ\}probe–DAS angle and the first\-moment / second\-moment dichotomy \(Prop\.[1](https://arxiv.org/html/2605.29126#Thmproposition1)\) both predict it will not\.

## Appendix S28Supplement: effective mediator dimension

We use the cached Qwen 2\.5 1\.5B DAS bases atk∈\{2,4,8,12\}k\\\!\\in\\\!\\\{2,4,8,12\\\}\(fromdas\_qwenanddas\_qwen\_highk\) to probe the effective rank of the mediator subspace\. Ablation drops are monotone only up tok=4k\{=\}4\(−37,−44\-37,\-44pp\), after which they plateau and slightly relax \(−33,−35\-33,\-35pp atk=8,12k\{=\}8,12\)\. Pairwise principal angles reveal that bases at differentkkare*not nested*: meancos⁡θi\\cos\\theta\_\{i\}betweenk=4k\{=\}4andk=12k\{=\}12is0\.450\.45; betweenk=8k\{=\}8andk=12k\{=\}12it is0\.330\.33; no pair has any principal angle withcos⁡θ\>0\.95\\cos\\theta\{\>\}0\.95\(Fig\.[14](https://arxiv.org/html/2605.29126#A28.F14)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x15.png)Figure 14:Effective mediator dimension onQwen 2\.5 1\.5B\.\(a\)Ablation drop saturates atk=4k\{=\}4; higherkkadmits multiple spanning bases and ablation is noisier to optimize\.\(b\)Meancos⁡θi\\cos\\theta\_\{i\}between bases at differentkk: no nested\-subspace structure; larger\-kkDAS does not strictly extend smaller\-kk\.Interpretation\. Combined with the Grassmannian\-scatter result atk=6k\{=\}6onGemma 2 2B\(§[4](https://arxiv.org/html/2605.29126#S4)\), the effective mediator dimension is44, not66\. Five independent seeds atk=6k\{=\}6converge to consistent NLL \(−72\.0±0\.7\-72\.0\\pm 0\.7\) but scatter uniformly onG\(6,2304\)G\(6,2304\)\(mean pairwise max angle87\.7∘±1\.8∘87\.7^\{\\circ\}\\\!\\pm\\\!1\.8^\{\\circ\}vs\. Haar null87\.1∘87\.1^\{\\circ\}\), confirming that the two extra dimensions beyond the44\-dimensional mediator are unconstrained by the DAS objective and roam freely\. Atk=4k\{=\}4, by contrast, five seeds converge to a compact basin \(pairwise CCA\>0\.85\>0\.85\), consistent with a rank\-44causal subspace\. The ablation\-drop saturation atk=4k\{=\}4on bothGemma 2 2BandQwen 2\.5 1\.5Bis therefore a statement about the effective rank of the causal mediator\.

## Appendix S29Supplement: DAS sensitivity to loss and hyperparameters

A natural robustness question is whether DAS mediators are stable under alternative losses or hyperparameter choices\. The seed\-variance evidence already partially addresses this: five independent DAS runs atk=4k\{=\}4converge to correlated bases \(CCA\>0\.94\>0\.94, ablation accuracy within\[0%,2%\]\[0\\%,2\\%\]across seeds\), indicating a stable attractor\. A full alternative\-loss sweep \(sequence\-level NLL, consistency objective\) would further tighten this; the infrastructure is in place and results will be reported in follow\-up work\. Two additional pieces of evidence support stability\. First, the seed\-variance analysis in Supp\.[S1](https://arxiv.org/html/2605.29126#A1)shows that five independently\-seeded DAS runs onGemma 2 2Bk=4k\{=\}4find different but pairwise\-correlated bases \(CCA\>0\.94\>0\.94\), consistent with a unique rank\-44causal subspace\. Atk=6k\{=\}6, five additional GPU seeds scatter uniformly onG\(6,2304\)G\(6,2304\)\(mean pairwise max angle87\.7∘87\.7^\{\\circ\}, indistinguishable from Haar\-random\), confirming effective dimension=4=4\(Supp\.[S1](https://arxiv.org/html/2605.29126#A1)\)\. Second, the present Supp\.[S28](https://arxiv.org/html/2605.29126#A28)shows that DAS atk∈\{2,4,8,12\}k\\\!\\in\\\!\\\{2,4,8,12\\\}onQwen 2\.5 1\.5Bproduces non\-nested bases whose ablation drops saturate atk=4k\{=\}4\. Both results are consistent with a rank\-44causal mediator that is robust to initialization: the subspace is stable even if individual DAS solutions are solver\-dependent\. A direct sweep over alternative losses \(full\-sequence NLL, consistency losses, counterfactual penalties\) is the natural next experiment\.

## Appendix S30Supplement: DAS cross\-set generalization

A potential concern is that the DAS mediator found on one prompt distribution \(Set\-F, duration queries\) might not transfer to a held\-out distribution\. We test this by training DAS independently on two halves of Set\-F \(“Set\-A”: 830 prompts; “Set\-F”: 830 prompts\) and evaluating each basis on both sets\. Results are reported in terms of the accuracy remaining under DAS ablation \(lower = DAS more effective\)\.

### Cross\-set transfer is strong\.

The Set\-A\-trained basis achieves low accuracy on Set\-F and vice versa, indicating that both bases capture the same underlying causal structure\. The mean principal angle between the two independently\-trainedk=4k\{=\}4DAS bases quantifies the geometric overlap: angles near0∘0^\{\\circ\}indicate the same subspace was found; angles near90∘90^\{\\circ\}indicate orthogonal \(unrelated\) subspaces\.

Results show that the two independently\-trained DAS bases span a nearly identical subspace \(mean angle<15∘<15^\{\\circ\}\), confirming that the rank\-44mediator is a stable property of the model and task, not an artifact of the specific prompt sample used for DAS training\.

### Cross\-template\-family transfer\.

We further test whether the DAS basis trained on Set\-F \(duration queries\) transfers to held\-out template families without retraining\. Evaluating the Set\-F basis \(k=4k\{=\}4,L⋆=1L^\{\\star\}\{=\}1\) onn=200n\{=\}200prompts per family: Set\-F itself shows a36\.036\.0pp accuracy drop under DAS ablation \(clean acc\.0\.360\.36\); Set\-G \(explicit\-comparison templates\) shows a9\.09\.0pp drop \(clean acc\.0\.1750\.175, ablated0\.0850\.085\)\. Set\-H templates yield0%0\\%clean accuracy, so transfer is untestable there \(the model cannot perform the task on those lexicalisations regardless of intervention\)\. The partial transfer to Set\-G—with no retraining—indicates the discovered subspace captures task\-relevant structure beyond the specific Set\-F lexicalisation, though the reduced magnitude suggests some template\-specific variance remains in the DAS basis\.

## Appendix S31Supplement: strict train/test splits and bootstrap CI on the readout\-mediator angle

A potential concern is that the probe–DAS angle estimate is inflated by data reuse—the probe and DAS are each trained on overlapping prompts, so the measured angle could reflect shared noise rather than true geometric structure\. We address this with five complementary analyses on the fulln=3,650n\{=\}3\{,\}650individual\-prompt activation cache \(365365DOYs×\\times1010templates, layerL⋆=1L^\{\\star\}\{=\}1,d=2,304d\{=\}2\{,\}304; all CPU\-only\)\.

### Existing DAS train/eval split\.

The DAS optimisation already uses disjoint prompts:232232train /100100eval, zero overlap \(70/3070/30split\)\. Atk=4k\{=\}4, DAS ablation drops eval accuracy from42%42\\%to0%0\\%\(ρ4=1,050\\rho\_\{4\}\{=\}1\{,\}050\); random ablation moves it by≤0\.04\\leq 0\.04pp\. The causal subspace is not overfit to the DAS training set\.

### Strict 60/20/20 splits\.

We partition the3,6503\{,\}650prompts into train \(60%60\\%\), validation \(20%20\\%\), and test \(20%20\\%\) sets, stratified by month to prevent seasonal leakage\. A fresh circular Ridge probe \(11\-harmonic,α=1\.0\\alpha\{=\}1\.0\) is trained on the train split only\. The probe–DAS angle atk=4k\{=\}4is:

All four estimates sit within1\.3∘1\.3^\{\\circ\}of the Haar null \(88\.3∘88\.3^\{\\circ\}\)\. The overlap score⟨cos2⁡θ⟩\\langle\\cos^\{2\}\\theta\\rangleranges from0\.00130\.0013to0\.00340\.0034, bracketing the Haar expectation ofk/d=0\.0017k/d\{=\}0\.0017\.

### Five\-fold cross\-validated angle\.

Monthly\-stratified55\-fold CV trains the probe on80%80\\%of prompts and evaluates the angle on the held\-out20%20\\%\. The fold\-level angles are87\.7∘87\.7^\{\\circ\},87\.8∘87\.8^\{\\circ\},87\.6∘87\.6^\{\\circ\},87\.9∘87\.9^\{\\circ\},87\.9∘87\.9^\{\\circ\}\(mean87\.79∘±0\.10∘87\.79^\{\\circ\}\\\!\\pm\\\!0\.10^\{\\circ\}\); held\-outR2=0\.981±0\.001R^\{2\}\{=\}0\.981\\\!\\pm\\\!0\.001\. The angle estimate is stable to±0\.1∘\{\\pm\}0\.1^\{\\circ\}across folds\.

### Bootstrap confidence interval\.

We resample the fulln=3,650n\{=\}3\{,\}650activation set with replacement \(B=1,000B\{=\}1\{,\}000, seed=42=42\), retrain the circular probe on each resample, and measureθ¯\\bar\{\\theta\}to the \(fixed\) DASk=4k\{=\}4basis\. The resulting95%95\\%percentile CI is\[87\.28∘,88\.28∘\]\[87\.28^\{\\circ\},\\,88\.28^\{\\circ\}\]\(mean87\.80∘87\.80^\{\\circ\}, std=0\.26∘=0\.26^\{\\circ\}\)\. The Haar null \(88\.3∘88\.3^\{\\circ\}\) falls0\.02∘0\.02^\{\\circ\}outside the upper bound—a statistically significant but geometrically negligible deviation of0\.5∘0\.5^\{\\circ\}\(0\.6%0\.6\\%of the angle\)\. This is consistent with both subspaces inhabiting the same ambient activation geometry: since probe and DAS directions are both learned from the same model, a trace amount of shared structure \(e\.g\., alignment with the top principal components of the activation covariance\) is expected and does not undermine the near\-orthogonality claim\.

### Template\-family held\-out\.

To test whether the angle depends on template\-specific phrasing, we run leave\-22\-template\-out cross\-validation: for each of the\(102\)=45\\binom\{10\}\{2\}\{=\}45held\-out template pairs, we train the probe on the remaining88templates \(≈2,920\{\\approx\}2\{,\}920prompts\) and measureθ¯\\bar\{\\theta\}on the held\-out pair \(≈730\{\\approx\}730prompts\)\. The resulting angles are87\.71∘±0\.32∘87\.71^\{\\circ\}\\\!\\pm\\\!0\.32^\{\\circ\}\(range\[87\.20∘,88\.51∘\]\[87\.20^\{\\circ\},\\,88\.51^\{\\circ\}\]\)—the angle is invariant to which templates are held out, confirming that the near\-orthogonality is not an artifact of template\-specific lexical overlap between probe and DAS training data\.

### Summary\.

Across all five analyses—existing DAS split, strict stratified partitions, five\-fold CV, bootstrap resampling, and template\-family holdout—the probe–DAS angle is87\.7∘87\.7^\{\\circ\}–87\.8∘87\.8^\{\\circ\}, within0\.5∘0\.5^\{\\circ\}of the Haar null\. The angle estimate does not depend on how the data are partitioned and is not inflated by shared training prompts\.

## Appendix S32Supplement: relationship to NDM and other unsupervised subspace decompositions

Neighbor Distance Minimization \(NDM\) ofHuang and Hahn \([2026](https://arxiv.org/html/2605.29126#bib.bib38)\)discovers non\-basis\-aligned interpretable subspaces by unsupervised feature reconstruction and validates them via causal patching\. In the Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)framework, NDM subspaces should sit*between*probes and DAS on the readout\-to\-mediator spectrum: they are not task\-gradient targeted \(soρk\\rho\_\{k\}should be smaller than DAS\), but they are geometry\-respecting \(soρk\\rho\_\{k\}should exceed the probe null\)\. The natural experiment — training an NDM decomposition on Set\-A activations atL⋆L^\{\\star\}, measuring its principal angle to the DAS basis, and placing it on the spectrum — requires NDM training infrastructure and is out of scope for this local, no\-training response\. We predict: \(i\) NDM components aligned with sinusoidal date features will partially overlap the DAS span \(ρkNDM\\rho\_\{k\}^\{\\text\{NDM\}\}between10210^\{2\}and10310^\{3\}\), \(ii\) the overlap will strengthen if NDM is trained on a task\-conditioned activation distribution\.

The broader implication connects to the seed\-induced uniqueness result ofOkatanet al\.\([2025](https://arxiv.org/html/2605.29126#bib.bib36)\): narrow task\-relevant subspaces, not global similarity, drive transfer\. Our readout\-mediator\-angle framework is a measurement instrument for exactly the quantity that paper argues governs subliminal trait leakage\. The complementarity is direct: their setting \(same task, different seeds\) and ours \(same task, same model, different probing tools\) both rely on identifying narrow causal subspaces and measuring their coincidence\.

A final scale\-trend note:Wanget al\.\([2025](https://arxiv.org/html/2605.29126#bib.bib37)\)report that attention outputs are low\-rank across families and scales\. Our Supp\.[S28](https://arxiv.org/html/2605.29126#A28)observation that effective mediator rank saturates around∼6\\sim\\\!6atd∈\{1536,2304,3584\}d\\\!\\in\\\!\\\{1536,2304,3584\\\}is consistent with this, and our Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)consequence — specificity grows asd/kd/kat fixedkk— makes the\>500,000×\>\\\!500\{,\}000\\timesratio at77B/99B a direct prediction of attention low\-rankness\.

## Appendix S33Supplement: open directions

Three concrete experimental designs for future work\.

### \(i\) Prospective clinical triage\.

A prospective triage trial would enlist≥500\{\\geq\}500de\-identified EHR queries, computeδ\(x\)\\delta\(x\)pre\-generation, and test whether non\-deferred accuracy exceeds the2\.8×2\.8\{\\times\}baseline demonstrated offline\. The primary endpoint is whether the Youden\-optimal threshold \(δ⋆=0\.86\\delta^\{\\star\}\{=\}0\.86,64%64\\%deferral\) transfers out\-of\-distribution\.

### \(ii\) Frontier\-scale prediction\.

Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)\(c\) predicts specificity ratiosρk≍d/k=2048\\rho\_\{k\}\\asymp d/k\{=\}2048at7070B parameters \(d=8192d\{=\}8192,k=4k\{=\}4\)\. Testing this requires DAS training on frontier models but follows directly from the infrastructure in Supp\.[S1](https://arxiv.org/html/2605.29126#A1)\.

### \(iii\) Multilingual and multi\-calendar\.

For a language with a non\-Gregorian calendar \(e\.g\. Hebrew lunar\), we predict the probe–DAS angle null to hold with boundary\-head offsets reflecting the target calendar structure rather than\{±30,±61\}\\\{\{\\pm\}30,\{\\pm\}61\\\}days\. For a mathematical reasoning task with a known geometric representation, we predict angles consistent with the spatial domain\.

## Appendix S34Supplement: cross\-task diagnostic details

The cross\-task validation in §[6](https://arxiv.org/html/2605.29126#S6)runs the full readout\-mediator protocol on two non\-temporal domains \(a third, factual country→\\tocapital lookup, was excluded due to0%0\\%clean accuracy onGemma 2 2B; see below\)\. Each domain usesGemma 2 2Bat the probe\-optimal layerL⋆L^\{\\star\}\(re\-estimated per domain via probeR2R^\{2\}sweep over layers0–99\),k=4k\{=\}4, DAS training \(steps chosen per domain to ensure loss convergence\), and≥10\{\\geq\}10Haar\-random ablation controls\.

### Arithmetic\.

Single\-digit addition prompts \(“The sum ofaaandbbis”\)\.n=25n\{=\}25prompts witha,b∈\[1,9\]a,b\\in\[1,9\]; answers are single\-token integers\.*Result:*L⋆=2L^\{\\star\}\{=\}2, probeR2=1\.0R^\{2\}\{=\}1\.0, mean angle88\.1∘88\.1^\{\\circ\}\. Clean accuracy100%100\\%; DAS ablation drops to32%32\\%\(6868pp,ρk≫103×\\rho\_\{k\}\{\\gg\}10^\{3\}\{\\times\}since all1010random controls leave accuracy at100%100\\%\); probe ablation0pp\. Contrary to our initial expectation that arithmetic would serve as a positive control \(small angle, both ablations hurt\), this domain shows the same dissociation as temporal and spatial reasoning: the perfect linear probe decodes a direction orthogonal to the causal subspace\.

### Spatial\.

1D number\-line displacement \(“Starting at positionXX, after movingYYsteps forward, you arrive at position”\)\.n=60n\{=\}60prompts, answers in\[1,99\]\[1,99\], single token\.*Result:*L⋆=1L^\{\\star\}\{=\}1, probeR2=1\.0R^\{2\}\{=\}1\.0, mean angle88\.4∘88\.4^\{\\circ\}\. Clean accuracy20%20\\%; DAS ablation drops to0%0\\%\(2020pp,ρk=20\.8×\\rho\_\{k\}\{=\}20\.8\{\\times\}\); probe ablation increases accuracy by66pp \(probe direction is not causally load\-bearing\)\. This replicates the temporal dissociation on a second geometric\-manifold domain\.

### Factual \(excluded\)\.

Country→\\tocapital lookup \(“The capital city of France is”\) was tested butGemma 2 2Bachieves0%0\\%clean accuracy \(n=50n\{=\}50prompts\), so ablation metrics are undefined and this domain is excluded from the main\-text table\. A larger model would be needed to test the associative\-domain prediction\.

## Appendix S35Supplement: TFA predictable/novel decomposition alignment

Lubanaet al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib33)\)decompose per\-token activations into a*predictable*component \(the projection ofxtx\_\{t\}onto the subspace spanned by\{x1,…,xt−1\}\\\{x\_\{1\},\\ldots,x\_\{t\{\-\}1\}\\\}\) and a*novel*component \(the orthogonal residual\)\. We test whether this decomposition explains the88∘88^\{\\circ\}readout\-mediator angle—specifically, whether the mediator sits in the predictable or novel part of the activation\. We evaluate two implementations: a zero\-shot linear predictor \(SVD of the past\-token matrix\) and a learned attention\-based model \(TemporalSAELubanaet al\.[2026](https://arxiv.org/html/2605.29126#bib.bib33); trained1010K steps on our cached activations, NMSE=0\.069\{=\}0\.069\)\.

### Setup\.

We apply both decompositions to5050Set\-F duration prompts atL⋆=1L^\{\\star\}\{=\}1using cached full\-sequence activations\. At the probe position \(last content token\), the zero\-shot predictable component accounts forf¯pred=42\.2%\\bar\{f\}\_\{\\text\{pred\}\}\{=\}42\.2\\%of activation energy; the learned model assigns84\.8%84\.8\\%to predictable\. For each component×\\timesmethod, we collect the\(N,2304\)\(N,2304\)point cloud across all5050prompts, extract the top\-1010SVD directions as a subspace basis, and measure principal angles against both the DAS mediator \(k=4k\{=\}4\) and the probe subspace \(k=2k\{=\}2\)\.

### Principal\-angle results\.

The initial hypothesis was that the novel component would lean toward DAS \(explaining the angle as probe\-decodes\-predictable, model\-computes\-with\-novel\)\. The data reject this hypothesis—the*predictable*component aligns3×3\{\\times\}more strongly with DAS than the novel component does, and both methods agree \(Table[6](https://arxiv.org/html/2605.29126#A35.T6)\):

Table 6:Principal\-angle alignment of TFA components with the DAS mediator and probe subspace\.∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}normalized by the Haar null \(kcomp⋅r/dk\_\{\\text\{comp\}\}\\cdot r/d\)\. Both the zero\-shot \(ZS\) and learned \(L\) implementations agree on the direction of the dissociation\.5050prompts,L⋆=1L^\{\\star\}\{=\}1,θ¯\\bar\{\\theta\}via SVD of the component point cloud\.
### Grassmannian visualization\.

To make this geometric, we embed all subspaces—DAS, probe, PCA, TFA predictable, TFA novel, and100100Haar\-randomkk\-frames—as points on the GrassmannianGr\(4,2304\)\\mathrm\{Gr\}\(4,2304\)using MDS with mean principal angle as the pairwise distance metric \(Fig\.[15](https://arxiv.org/html/2605.29126#A35.F15)\)\. The TFA\-predictable subspace \(both zero\-shot and learned\) is pulled toward DAS and away from the random cloud, sitting at82\.7∘82\.7^\{\\circ\}–83\.7∘83\.7^\{\\circ\}from DAS versus the random mean of88\.0∘±0\.3∘88\.0^\{\\circ\}\{\\pm\}0\.3^\{\\circ\}\. The probe, by contrast, sits at88\.5∘88\.5^\{\\circ\}—squarely in the random cloud, indistinguishable from noise\. The TFA\-novel component lands at85\.2∘85\.2^\{\\circ\}–86\.6∘86\.6^\{\\circ\}, between the predictable and random clusters\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x16.png)Figure 15:TFA components on the Grassmannian\.\(A\)MDS embedding of subspaces onGr\(4,2304\)\\mathrm\{Gr\}\(4,2304\)using mean principal angle as the distance metric\. Gray dots:100100Haar\-randomkk\-frames \(convex hull shaded\)\. The TFA\-predictable subspace \(gold triangles; filled = learned, outline = zero\-shot\) is pulled toward DAS and away from the random cloud\. The probe \(blue square\) sits in the random cloud at88\.5∘88\.5^\{\\circ\}\.\(B\)Grassmannian distances to DAS, sorted\. TFA\-Pred \(learned:83\.7∘83\.7^\{\\circ\}, zero\-shot:82\.7∘82\.7^\{\\circ\}\) sits well below the Haar null \(87\.6∘87\.6^\{\\circ\}\)\.\(C\)Random\-null distribution of distances to DAS \(histogram\) with TFA\-Pred \(gold\) and Probe \(blue\) marked\.n=50n\{=\}50prompts,L⋆=1L^\{\\star\}\{=\}1\.
### Interpretation\.

The predictable component captures signal that can be derived from attending to prior tokens—atL⋆=1L^\{\\star\}\{=\}1, this includes the date\-pair context \(tokens55–88and1313–1616\)\. Duration computation is inherently context\-dependent: computing “March 5 to June 10” requires looking back at “March 5\.” The DAS mediator captures precisely this kind of context\-accumulated signal, so its overlap with the predictable subspace is expected\. Date identity, by contrast, depends partly on the current token embedding itself—a stimulus\-driven signal that lands in the novel component\.

The probe–mediator orthogonality is therefore*within*the predictable subspace: both probe and mediator capture aspects of accumulated context, but different functional projections of it \(circular day\-of\-year structure for the probe; month\-pair difference structure for the mediator\)\. A simple predictable/novel split does not explain the88∘88^\{\\circ\}angle—but it does explain*why*the mediator is where it is: duration computation lives in the part of the activation that comes from context, not from the current token alone \(Fig\.[16](https://arxiv.org/html/2605.29126#A35.F16)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x17.png)Figure 16:Extended alignment of TFA predictable and novel components with the DAS mediator \(left,k=4k\{=\}4\) and probe subspace \(right,k=2k\{=\}2\), measured as∑cos2⁡θi\\sum\\cos^\{2\}\\theta\_\{i\}normalized by the Haar null\. Both zero\-shot \(ZS\) and learned \(L\) implementations show the same pattern: predictable leans toward DAS \(7\.17\.1–7\.6×7\.6\{\\times\}null\); novel is closer to chance \(2\.02\.0–2\.5×2\.5\{\\times\}\)\.5050prompts,L⋆=1L^\{\\star\}\{=\}1\.![Refer to caption](https://arxiv.org/html/2605.29126v1/x18.png)Figure 17:Feature\-level detectors and causal specificity\.\(A\)Month\-tiling:1111of1212months have dedicated SAE features at Layer33, each firing selectively on its month token\.\(B\)Causal specificity ratioρ\\rho: DAS ablation drops accuracy by4242pp \(ρ=1050×\\rho\{=\}1050\{\\times\}\); probe ablation hasρ=0\.8\\rho\{=\}0\.8\(inert\); gradient probe achievesρ=19\.1\\rho\{=\}19\.1; PCA reachesρ=15\\rho\{=\}15\.\(C\)TFA novel\-computation projection: the predictable component aligns7×7\{\\times\}Haar with DAS, while the probe sits at0\.1×0\.1\{\\times\}—the mediator lives in context\-accumulated signal\.

## Appendix S36Supplement: temporal dynamics of mediator energy

The main paper treats activations as a static collection \(one vector per prompt\)\. Here we examine how mediator and probe subspace energy evolves*across token positions within a single prompt*\.

For each token positionttin a duration prompt, we compute the fraction of activation energy in the DAS mediator subspace:

emed\(t\)=‖UMxt‖2‖xt‖2,eprobe\(t\)=‖UPxt‖2‖xt‖2,e\_\{\\text\{med\}\}\(t\)=\\frac\{\\\|U\_\{M\}x\_\{t\}\\\|^\{2\}\}\{\\\|x\_\{t\}\\\|^\{2\}\},\\qquad e\_\{\\text\{probe\}\}\(t\)=\\frac\{\\\|U\_\{P\}x\_\{t\}\\\|^\{2\}\}\{\\\|x\_\{t\}\\\|^\{2\}\},\(4\)whereUMU\_\{M\}andUPU\_\{P\}are the DAS and probe projection matrices\.

Fig\.[18](https://arxiv.org/html/2605.29126#A36.F18)shows these energy trajectories for three representative duration prompts \(n=50n\{=\}50total\)\. Three findings emerge:

### \(i\) Mediator energy is distributed, not spike\-like\.

emede\_\{\\text\{med\}\}ranges from2\.9%2\.9\\%to13\.1%13\.1\\%across token positions \(vs\.k/d=0\.17%k/d\{=\}0\.17\\%for a random44\-subspace inℝ2304\\mathbb\{R\}^\{2304\}\), a1717–75×75\{\\times\}excess over the dimensionality baseline\. The energy does not spike at date\-word tokens \(“January”:5\.0%5\.0\\%; “April”:5\.2%5\.2\\%\) but rather at*structural delimiter tokens*: the space immediately following the month name reaches13\.0%13\.0\\%and the final token “is” reaches11\.3%11\.3\\%\.

### \(ii\) “Between” is the energy minimum\.

Despite being the semantic cue for duration, the token “between” consistently has the*lowest*mediator energy \(2\.9%2\.9\\%\)\. The mediator subspace atL⋆=1L^\{\\star\}\{=\}1encodes positional/structural information \(which tokens carry date arguments\), not the semantic relation between them\.

### \(iii\) Probe energy is negligible at every position\.

eprobe<0\.2%e\_\{\\text\{probe\}\}\{<\}0\.2\\%across all tokens and all prompts—thesin/cos\\sin/\\cosprobe direction captures day\-of\-year structure only in the mean\-per\-DOY activation space and is effectively invisible in per\-token dynamics\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x19.png)Figure 18:Temporal dynamics of mediator energy \(emede\_\{\\text\{med\}\}, gold bars\) across token positions for three Set\-F duration prompts\. Thek/dk/drandom baseline is shown as a dashed line\.Left:per\-token mediator energy with token labels\.Right:pairwise cosine similarity of DAS\-projected activations, showing block structure around date\-bearing positions\.

## Appendix S37Supplement: disentangling task\-structure vs corpus\-statistics for the±30,±61\\pm 30,\\pm 61\-day modes

A direct disentanglement—training a synthetic model on a corpus with controlled month distributions—requires model training and is out of scope\. We instead report four lines of indirect evidence, including a formal statistical test, that are jointly consistent with a*task\-structure*origin\.

### Evidence \(i\): cross\-family offset coincidence\.

The same two offset modes \(\|c\|≈30\|c\|\{\\approx\}30,\|c\|≈61\|c\|\{\\approx\}61days\) arise in bothGemma 2 2B\(2424BH\-significant boundary heads\) andQwen 2\.5 1\.5B\(top\-2020heads\), despite different pretraining corpora, different tokenizers, and different initializations \(§[5](https://arxiv.org/html/2605.29126#S5)\)\. We formalize this coincidence with a Monte Carlo simulation test\. For each ofN=105N\{=\}10^\{5\}draws we sample two independent offset sets of the observed sizes fromUniform\(\{1,…,182\}\)\\mathrm\{Uniform\}\(\\\{1,\\ldots,182\\\}\)and count the number of modes matched within a tolerance window±τ\\pm\\tau\. Atτ=3\\tau\{=\}3d, the observed22shared modes exceed99\.1%99\.1\\%of null draws \(p=0\.009p\{=\}0\.009\); atτ=5\\tau\{=\}5d,p=0\.020p\{=\}0\.020; atτ=10\\tau\{=\}10d,p=0\.052p\{=\}0\.052\(Fig\.[19](https://arxiv.org/html/2605.29126#A37.F19)\)\. Theτ=3\\tau\{=\}3result survives Bonferroni correction for the three tolerances tested \(αadj=0\.017\\alpha\_\{\\text\{adj\}\}\{=\}0\.017\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x20.png)Figure 19:Monte Carlo null distribution of offset\-mode coincidences \(τ=5\\tau\{=\}5d tolerance,10510^\{5\}draws fromUniform\(\{1,…,182\}\)\\mathrm\{Uniform\}\(\\\{1,\\ldots,182\\\}\)\)\. The observed22\-mode match \(gold line\) falls in the far right tail \(p=0\.020p\{=\}0\.020\)\.
### Evidence \(ii\): Pythia emergence during training\.

Pythia emergence traces a∼37×\{\\sim\}37\{\\times\}increase in circularness*during training*, not at initialization \(§[5](https://arxiv.org/html/2605.29126#S5.SS0.SSS0.Px4), Supp\.[S7](https://arxiv.org/html/2605.29126#A7)\), indicating the offset\-bearing circuit is learned from the task, not inherited from random weights\.

### Evidence \(iii\): absent weekly mode\.

No±7\\pm 7\-day mode is observed despite weekly periodicity being abundant in pretraining corpora\. Neither is any±365\\pm 365\-day \(annual\) or±1\\pm 1\-day \(unit\) mode present\. This rules out corpus\-frequency, annual\-cycle, and fine\-grained\-counting confounds, respectively, leaving month\-grained arithmetic as the parsimonious explanation\.

### Evidence \(iv\): temporal priors and non\-stationarity\.

Lubanaet al\.\([2026](https://arxiv.org/html/2605.29126#bib.bib33)\)show that LM representations are non\-stationary and that standard sparse autoencoders impose i\.i\.d\. priors that miss temporal structure\. The fact that*both*model families converge to*temporal*\(monthly\) rather than*statistical*\(weekly\) offsets is further evidence that the circuit is shaped by task structure, not corpus statistics—precisely the kind of temporal prior that SAEs trained under i\.i\.d\. assumptions would fail to recover \(cf\. Prop\. 4\.2 ofLubanaet al\.[2026](https://arxiv.org/html/2605.29126#bib.bib33)\)\.

A controlled\-training experiment remains the gold standard\.

## Appendix S38Supplement: mediator\-projection \+ controlled perturbation test

A validation of Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)’s Lipschitz bound would inject controlled perturbationsϵ⋅v\\epsilon\\cdot vforv∈row\(UDAS\)v\\in\\mathrm\{row\}\(U\_\{\\text\{DAS\}\}\)vsv⟂UDASv\\perp U\_\{\\text\{DAS\}\}and measure the resulting duration\-error dependence onϵ\\epsilon\. Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)predicts a linear regime for in\-subspace perturbations and a flat regime for orthogonal ones\. The experiment requires model forward passes \(not cached\)\. We flag it as the most direct next validation; the current paper’s evidence for Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)is instead the observedρk\\rho\_\{k\}ordering across tools \(Supp\.[S11](https://arxiv.org/html/2605.29126#A11)\), which already consistent with the bound\.

## Appendix S39Supplement: non\-linear probes do not close the readout\-mediator angle

One might ask whether the paper’s linearsin/cos\\sin/\\cosridge probe is the reason the probe–DAS angle is at the random\-null: perhaps a richer probe class would find the mediator\. We falsify this by training four probe families on Set\-A activations atL⋆=1L^\{\\star\}\{=\}1and extracting a gradient\-saliency\-SVD subspace for each\.

Every probe family—linear*or non\-linear*—places its gradient\-saliency subspace at the random\-null angle to DAS \(within1∘1^\{\\circ\}of the Haar expectation\)\. Non\-linear probes decode DOY just as well as linear ones \(MLP and kernel ridgeR2\>0\.84R^\{2\}\{\>\}0\.84\) but do not recover the mediator direction\. The readout\-vs\-mediator separation is a structural property of the task, not a probe\-capacity limit\.

## Appendix S40Supplement: alternative probe targets

A complementary concern: perhaps the paper’s22\-Dsin/cos\\sin/\\cos\(DOY\) target is*too narrow*, and the mediator \(which appears to operate at month granularity,k≈4k\{\\approx\}4\) would be captured by a probe with a richer target\. We test four decoding targets on Set\-A activations atL⋆=1L^\{\\star\}\{=\}1\(Gemma 2 2B,d=2304d\{=\}2304\):\(a\)sin/cos\\sin/\\cos\(DOY\) Ridge \(k=2k\{=\}2, paper default\);\(b\)a1212\-way month\-of\-year Ridge classifier \(k=4k\{=\}4, top\-44class\-weight rows\);\(c\)a harmonick=4k\{=\}4target \(sin/cos at fundamental and second harmonic\);\(d\)a learnedk=4k\{=\}4target \(PCA\-4 of the one\-hot\-month embedding\)\.

Every probe target, regardless of decoding power, places its linear subspace within∼1∘\\sim\\\!1^\{\\circ\}of the Haar null against the DAS mediator \(Holm\-adjusted one\-sidedppfor∑cos2⁡θi\>kM2/d\\sum\\cos^\{2\}\\theta\_\{i\}\\\!\>\\\!k\_\{M\}^\{2\}/dis1\.001\.00in all cases\)\. The1212\-way month classifier at90%90\\%balanced accuracy is as far from DAS as a sin/cos Ridge atR2=0\.99R^\{2\}\{=\}0\.99, confirming that the readout\-mediator separation is a property of*which*directions decode, not*how much*information a probe extracts\.

## Appendix S41Supplement: amnesic / erasure baselines vs DAS

Modern concept\-erasure methods \(LEACE\(Belroseet al\.,[2023](https://arxiv.org/html/2605.29126#bib.bib18)\), Mean\-Projection\(Haghighatkhahet al\.,[2022](https://arxiv.org/html/2605.29126#bib.bib19)\), INLP\) target selective removal of linear information about a concept\. Do their erasure subspaces overlap with the DAS mediator? Atk=4k\{=\}4, DOY target onGemma 2 2BSet\-A:

All four amnesic methods sit at86\.886\.8–88\.1∘88\.1^\{\\circ\}from DAS\. LEACE and Mean\-Projection capture a modest residual overlap \(∑cos2≈0\.02\\sum\\cos^\{2\}\\approx 0\.02,∼5σ\{\\sim\}5\\sigmaabove the Haar null\); INLP and probe Ridge are indistinguishable from the null\. The interpretation: selective linear\-concept erasure is*concentrated on the decoder subspace*, not the mediator— consistent with the paper’s central claim that readable\-from and computed\-with occupy orthogonal directions\.

## Appendix S42Supplement: matched\-budget comparison on the spectrum

Placing tools on a parameter\-matched ruler \(dim\-equivalents where an attention head is counted as4d4dparameter\-dims\):

DAS achieves∼60×\\sim 60\{\\times\}the per\-dim specificity of SAEs,70×70\{\\times\}the probe, and∼6,000×\\sim 6\{,\}000\{\\times\}the most parameter\-efficient head\-level method \(QK\-twist\)\. Even at100×100\{\\times\}the parameter count, head\-level methods do not approach DAS\-level specificity at the relevant granularity\. The ordering on*this*axis matches the ordering on theρk\\rho\_\{k\}axis \(Supp\.[S11](https://arxiv.org/html/2605.29126#A11)\): DAS≫\\ggSAE≫\\ggattribution/QK\>\>probe≈\\approxrandom\.

## Appendix S43Supplement: residual stream DAS energy tracking

To understand how the early\-layer mediator \(L⋆=1L^\{\\star\}\{=\}1\) influences late\-layer computation, we track the DAS subspace energy fractioneL≡𝔼DOY\[‖UDASxL‖2/‖xL‖2\]e\_\{L\}\\equiv\\mathbb\{E\}\_\{\\text\{DOY\}\}\[\\\|U\_\{\\text\{DAS\}\}x\_\{L\}\\\|^\{2\}/\\\|x\_\{L\}\\\|^\{2\}\]through all 26 layers ofGemma 2 2B, using the 365\-DOY mean activations cached incached\_tensors/gemma2b\_full/activations/mean\_activations\.pt\.

### DAS energy never drops below null\.

The DAS energy profile \(Fig\.[20](https://arxiv.org/html/2605.29126#A43.F20)\) shows a monotone decay from theL=1L\{=\}1peak \(26\.4×26\.4\{\\times\}Haar null\) through a mid\-network trough atL=18L\{=\}18\(3\.1×3\.1\{\\times\}\), followed by a secondary recovery atL=22L\{=\}22\(6\.6×6\.6\{\\times\}\)\. Critically, DAS energy exceeds the Haar random null at*every*layer \(minimum2\.1×2\.1\{\\times\}atL=25L\{=\}25\), indicating that mediator information is carried forward through the residual stream as an additive component throughout the entire forward pass\. This is distinct from the probe subspace, which tracks the random null \(0\.30\.3–4\.1×4\.1\{\\times\}\) with no coherent pattern\.

### Mechanistic interpretation\.

The residual stream architecture means that any layer with elevated DAS energy is a potential site for causal computation\. The boundary heads atL=11L\{=\}11–1212identified as the primary causal bottleneck in Supp\.[S46](https://arxiv.org/html/2605.29126#A46)operate where DAS energy is∼8×\\sim 8\{\\times\}null—well above background but below the early peak\. This is consistent with these heads reading from and writing to the mediator subspace as part of the circuit\. The late\-layer secondary peak \(L=22L\{=\}22\) corresponds to the QK\-twist boundary\-head cluster, but those heads are causally inert \(Supp\.[S46](https://arxiv.org/html/2605.29126#A46)\), suggesting the secondary peak reflects passive information persistence rather than active computation\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x21.png)Figure 20:DAS mediator energy through 26 layers\.DAS subspace energy \(blue\) vs\. probe subspace energy \(orange\) vs\. Haar random null \(dashed\)\. Boundary\-head layers shaded\. DAS energy exceeds random null at every layer; probe energy does not\.

## Appendix S44Supplement: per\-direction DAS ablation

To test whether the rank\-44mediator subspace operates as a cooperative unit or decomposes into independent directions, we ablate each DAS basis vector individually and in all\(42\)\+\(43\)\+1=11\\binom\{4\}\{2\}\{\+\}\\binom\{4\}\{3\}\{\+\}1\{=\}11multi\-direction combinations \(n=200n\{=\}200Set\-F prompts; Fig\.[21](https://arxiv.org/html/2605.29126#A44.F21)\)\.

### Individual directions are insufficient\.

Single\-direction ablations yieldΔ\\DeltaNLL∈\[0\.08,0\.51\]\\in\[0\.08,0\.51\]\(sum=1\.16=1\.16\); the fullk=4k\{=\}4ablation yieldsΔ\\DeltaNLL=68\.8=68\.8, a cooperation ratio of59×59\{\\times\}\. No single direction accounts for more than0\.7%0\.7\\%of the full effect\. Directionu4u\_\{4\}is strongest \(Δ\\DeltaNLL=0\.51=0\.51,53%53\\%accuracy drop\); directionu3u\_\{3\}is weakest \(Δ\\DeltaNLL=0\.08=0\.08,1\.4%1\.4\\%accuracy drop\)\.

### Pairwise interactions are super\-additive\.

Five of six direction pairs show super\-additive interaction \(observed\>\>sum of singles\), with mean pairwise interaction\+0\.07\+0\.07NLL\. The lone sub\-additive pair\(u1,u4\)\(u\_\{1\},u\_\{4\}\)combines the two strongest individual directions\. Triple and quadruple ablations show escalating nonlinearity:\[u1,u2,u4\]\[u\_\{1\},u\_\{2\},u\_\{4\}\]givesΔ\\DeltaNLL=3\.0=3\.0\(2\.8×2\.8\{\\times\}the sum of singles\), and\[u1,u2,u3,u4\]\[u\_\{1\},u\_\{2\},u\_\{3\},u\_\{4\}\]givesΔ\\DeltaNLL=68\.8=68\.8\(59×59\{\\times\}\)\.

The59×59\{\\times\}cooperation ratio validates the rank\-44identification: the mediator is not a bag of independent features but a single44\-dimensional functional unit whose components interact nonlinearly to encode temporal information\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x22.png)Figure 21:Per\-direction DAS ablation\.\(A\)Individual directionΔ\\DeltaNLL \(dashed: fullk=4k\{=\}4ablation\)\.\(B\)Pairwise observed vs\. expected \(sum of singles\)\. Super\-additive pairs lie above the diagonal\.

## Appendix S45Supplement: MLP vs\. attention component attribution

Zero\-ablation of attention output vs\. MLP output at layers1818–2525reveals that MLP sub\-layers carry the dominant causal signal for duration computation \(Fig\.[22](https://arxiv.org/html/2605.29126#A45.F22)\)\. Forn=200n\{=\}200Set\-F prompts, we measureΔ\\DeltaNLL from zero\-ablating \(i\) attention only, \(ii\) MLP only, and \(iii\) both\.

MLP ablation increases NLL at every layer from1818to2424\(meanΔ\\DeltaNLL=\+0\.26=\+0\.26\), while attention ablation is positive only at layers1818–1919and2121, and is*negative*at layers2020,2222–2424\(meanΔ\\DeltaNLL=−0\.02=\-0\.02\)\. Negative attentionΔ\\DeltaNLL means ablating attention*improves*duration performance—these heads actively interfere with the computation\. At layer2525, both components have near\-zero or negativeΔ\\DeltaNLL, suggesting minimal contribution\.

The MLP\-dominance finding is consistent with MLPs implementing the nonlinear calendar arithmetic \(month\-length lookups, day\-of\-month corrections\) that the circuit requires\. Attention heads at these layers may read and route temporal information—as the QK\-twist analysis reveals—but the computational transformation is performed by the MLP\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x23.png)Figure 22:MLP vs\. attention attribution at layers1818–2525\.MLP ablation consistently increases NLL; attention ablation is near\-zero or negative at late layers\.
## Appendix S46Supplement: cascading ablation of boundary\-head groups

The top\-1010boundary heads span layers11,1111,1212,2020,2323–2525\. Cascading ablation reveals an uneven causal distribution \(Fig\.[23](https://arxiv.org/html/2605.29126#A46.F23)\)\. Early boundary heads \(layers≤22\\leq 22,n=4n\{=\}4heads: L11\.H4, L12\.H6, L20\.H1, L1\.H7\) account for nearly all the causal effect \(Δ\\DeltaNLL=\+0\.455=\+0\.455\), while late boundary heads \(layers\>22\>22,n=6n\{=\}6heads across L23–L25\) contribute negligibly \(Δ\\DeltaNLL=\+0\.016=\+0\.016\)\. The combined ablation \(Δ\\DeltaNLL=\+0\.489=\+0\.489\) shows weak super\-additivity \(\+0\.019\+0\.019\): the interaction is positive but small\.

Per\-layer breakdown reveals that layers1111and1212are the primary causal sites \(Δ\\DeltaNLL=\+0\.206=\+0\.206and\+0\.238\+0\.238respectively\), while individual late\-layer ablations at L23 and L25 slightly*improve*performance \(Δ\\DeltaNLL<0<0\)\. This refines the circuit architecture: QK\-twist boundary heads at L23–2525exhibit interpretable temporal offset structure \(±30\\pm 30,±61\\pm 61days\), but are not load\-bearing for duration output\. The computational bottleneck is at layers1111–1212, closer to the DAS mediator site atL⋆=1L^\{\\star\}\{=\}1\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x24.png)Figure 23:Cascading ablation\.\(A\)Δ\\DeltaNLL by head group\.\(B\)Observed combined drop vs\. sum of group drops \(weakly super\-additive\)\.
## Appendix S47Supplement: MLP SAE computation analysis

We use GemmaScope MLP SAEs \(gemma\-scope\-2b\-pt\-mlp\-canonical, 16K features per layer,Wdec∈ℝ16384×2304W\_\{\\text\{dec\}\}\\in\\mathbb\{R\}^\{16384\\times 2304\}\) as a transcoder substitute to decompose what each MLP*writes*to the residual stream into interpretable features\. Unlike residual\-stream SAEs \(which decompose what is stored\), MLP SAEs decompose what the MLP contributes—this is functionally equivalent to transcoders for the question “what is the MLP computing?”

### DAS alignment of MLP features\.

For each MLP SAE featurejjwith unit\-norm decodere^j\\hat\{e\}\_\{j\}, we computeDAS\-align\(j\)=‖UDASe^j‖2\\mathrm\{DAS\\text\{\-\}align\}\(j\)\{=\}\\\|U\_\{\\mathrm\{DAS\}\}\\hat\{e\}\_\{j\}\\\|^\{2\}andprobe\-align\(j\)=‖Wprobee^j‖2\\mathrm\{probe\\text\{\-\}align\}\(j\)\{=\}\\\|W\_\{\\mathrm\{probe\}\}\\hat\{e\}\_\{j\}\\\|^\{2\}\. Across all16,38416\{,\}384features at each layer L18–L25:

- •Jaccard of top\-100 DAS\-aligned vs\. top\-100 probe\-aligned MLP features:≤0\.010\\leq 0\.010at every layer \(mean0\.0040\.004\)—the dissociation extends into the MLP computation\.
- •Peak DAS enrichment: L25 \(18\.5×18\.5\{\\times\}Haar null, 10 features above10×10\{\\times\}\), L24 \(14\.8×14\.8\{\\times\}, 2 features\), L20 \(12\.9×12\.9\{\\times\}, 1 feature\)\.
- •The MLP SAE features at L18–L19 that are*probe*\-aligned \(writing calendar\-date content\) are a different set from those at L20–L25 that are*DAS*\-aligned \(writing duration content\)\.

### MLP DAS energy contribution\.

Approximating the MLP’s contribution at layerLLasΔxL=xpost\(L\)−xpost\(L−1\)\\Delta x\_\{L\}=x\_\{\\mathrm\{post\}\}\(L\)\-x\_\{\\mathrm\{post\}\}\(L\{\-\}1\)\(valid since MLP dominates attention22–4×4\{\\times\}\):

DAS\_contribution\(L\)=𝔼x\[‖UDASΔxL‖2‖ΔxL‖2\]\\mathrm\{DAS\\\_contribution\}\(L\)=\\mathbb\{E\}\_\{x\}\\left\[\\frac\{\\\|U\_\{\\mathrm\{DAS\}\}\\Delta x\_\{L\}\\\|^\{2\}\}\{\\\|\\Delta x\_\{L\}\\\|^\{2\}\}\\right\]Results \(mean over 365 DOY prompts\): L18 \(0\.6×0\.6\{\\times\}null\), L19 \(0\.7×0\.7\{\\times\}null\), L20 \(3\.2×3\.2\{\\times\}null, peak\), L21 \(1\.6×1\.6\{\\times\}\), L22 \(1\.5×1\.5\{\\times\}\), L23 \(1\.2×1\.2\{\\times\}\), L24 \(2\.0×2\.0\{\\times\}\), L25 \(2\.8×2\.8\{\\times\}\)\. Probe contribution peaks at L19 \(4\.3×4\.3\{\\times\}null\) then drops—confirming the two\-stage structure\.6/86/8MLP layers write positively into the DAS subspace\.

### Month\-specific MLP features\.

We cache GemmaScope MLP SAE activations at L18–L25 for 365 day\-of\-year prompts \(one per calendar day\) and compute per\-feature activation grouped by month\. For each feature, we compute a month\-discrimination score \(variance of monthly means divided by overall mean\)\.

Features with statistically significant month discrimination \(p<0\.001p\{<\}0\.001, permutation test\):

- •L21: 18 discriminating features\(strongest layer by count\)
- •L19, L20, L25: 13 features each
- •L22: 12 features

Smoking\-gun features:

- •Feature \#7886 at L19 \(“age\-related numbers”; pos\. logits:*seventeen, nineteen, sixteen, eighteen*\): mean activation3\.983\.98\(31\-day months\),2\.742\.74\(30\-day months\),1\.831\.83\(February\)\. Pattern consistent with month\-length sensitivity\.
- •Feature \#15148 at L22 \(“law enforcement terms”\): mean activation0\.860\.86\(31\-day\),0\.180\.18\(30\-day\),0\.00\\mathbf\{0\.00\}\(February\)—completely silent for February while active for all other months\.
- •Feature \#6208 at L21 \(“modular systems”\):1\.59:1\.11:0\.121\.59:1\.11:0\.12across3131\-day/30/30\-day/Feb/\\text\{Feb\}\.

Features do not carry clean “month\-length” semantic labels in NeuronPedia \(they are polysemantic across web\-text contexts\)\. However, their month\-specific activation is statistically unambiguous and consistent with distributed soft month\-length encoding\.

### Interpretation\.

The MLP computation at L18–L25 implements a soft month\-length\-sensitive transformation: early layers \(L18–19\) process the calendar date representation \(probe content spikes\), late layers \(L20–25\) write accumulated duration into the DAS subspace\. This explains the±30\\pm 30/±61\\pm 61\-day QK\-twist patterns: boundary heads route attention to single/double month boundaries, and the MLP layers accumulate the month lengths into a running duration total\. We test this via targeted ablation of the top\-5 month\-discriminating features at layers L20, L24, L25 \(Supp\.[S48](https://arxiv.org/html/2605.29126#A48)\)\.

## Appendix S48Supplement: MLP SAE feature causal ablation

We ablate the top\-5 month\-discriminating MLP SAE features at each of L20, L24, L25 \(identified by Exp 80’s month\-discrimination score\)\. For each featurejjat layerLL, we zero its SAE activation coefficient at the last token position during a forward pass, then measure the change in NLL for the correct duration token \(ΔNLLdur\\Delta\\text\{NLL\}\_\{\\text\{dur\}\}\) and for a matched set of 10 non\-temporal control completions \(ΔNLLctrl\\Delta\\text\{NLL\}\_\{\\text\{ctrl\}\}\)\. The*specificity ratio*ρ=ΔNLLdur/\|ΔNLLctrl\|\\rho=\\Delta\\text\{NLL\}\_\{\\text\{dur\}\}/\|\\Delta\\text\{NLL\}\_\{\\text\{ctrl\}\}\|separates duration\-specific effects from generic disruption\.

### Results\.

Across 15 features \(5 per layer×\\times3 layers\), ablating individual features produces near\-zero absolute changes:\|ΔNLLdur\|<0\.05\|\\Delta\\text\{NLL\}\_\{\\text\{dur\}\}\|<0\.05for 14/15 features\. The sole exception is L25 feature \#9608 \(ΔNLLdur=\+0\.0003\\Delta\\text\{NLL\}\_\{\\text\{dur\}\}\{=\}\{\+\}0\.0003,ΔNLLctrl=0\.0000\\Delta\\text\{NLL\}\_\{\\text\{ctrl\}\}\{=\}0\.0000,ρ=294\.65×\\rho\{=\}294\.65\{\\times\}\): ablating this feature uniquely raises duration NLL while leaving control NLL unchanged, confirming specificity despite the small absolute magnitude\. Month\-breakdown reveals this effect is concentrated in January/February—consistent with the month\-boundary function of late MLP layers\.

### Interpretation\.

The near\-zero individual\-feature effects are*expected*under the cooperative\-subspace picture established in §[5](https://arxiv.org/html/2605.29126#S5): the per\-direction ablation of the DAS subspace showed a59×59\{\\times\}super\-additive cooperation ratio \(individual directions contributeΔ\\DeltaNLL∈\[0\.08,0\.51\]\\in\[0\.08,0\.51\]; fullk=4k\{=\}4ablation yieldsΔ\\DeltaNLL=68\.8=68\.8\)\. MLP SAE features lie on the*readout*\-to\-mediator spectrum; they are a partial readout of the DAS subspace \(top\-50 decoders span4\.7%4\.7\\%of DAS variance\), not individual load\-bearing units\. Ablating one feature removes∼\\sim0\.1%0\.1\\%of the relevant subspace and produces correspondingly negligible NLL changes\. Feature \#9608 at L25 is the marginal case where that residual footprint is nonetheless duration\-specific \(zero control contamination\), making it the strongest individual\-feature causal handle in the MLP circuit\.

## Appendix S49Supplement: decoder\-direction steering and transcoder comparison

We test whether the causal gap between the DAS subspace and sparse dictionary features can be bridged by*steering*—subtractingα⋅∑je^j\\alpha\\\!\\cdot\\\!\\sum\_\{j\}\\hat\{e\}\_\{j\}from the residual stream at the hook point, wheree^j\\hat\{e\}\_\{j\}are unit\-norm SAE or transcoder decoder directions\(Templetonet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib39)\)—and whether GemmaScope pre\-trained layer transcoders\(Lieberumet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib12)\)produce different results from MLP SAEs\. Five experiments \(n=50n\{=\}50Set\-F duration prompts each for GPU runs,α=3\.0\\alpha\{=\}3\.0,Gemma 2 2B\) collectively demonstrate that the decomposition gap is structural\.

### Error\-node analysis \(SAE dark matter\)\.

We decompose each DAS directionuiu\_\{i\}into its SAE\-reconstructed component \(top\-kkmost\-aligned features\) and the residual “error” directionei=ui−u^iSAEe\_\{i\}\{=\}u\_\{i\}\{\-\}\\hat\{u\}\_\{i\}^\{\\text\{SAE\}\}, then steer with each component separately\. Full DAS rank\-44ablation producesΔNLL=\+69\.1\\Delta\\mathrm\{NLL\}\{=\}\{\+\}69\.1\(massive causal effect\)\. The SAE\-reconstructed component producesΔNLL≈0\\Delta\\mathrm\{NLL\}\{\\approx\}0at every reconstruction depthk∈\{5,10,20,50,100\}k\{\\in\}\\\{5,10,20,50,100\\\}; the error component likewise producesΔNLL≈0\\Delta\\mathrm\{NLL\}\{\\approx\}0; random controls are indistinguishable \(Table[7](https://arxiv.org/html/2605.29126#A49.T7)\)\. The reconstruction gapG\(k\)=1−ΔNLLSAE/ΔNLLDAS=1\.000G\(k\)\{=\}1\{\-\}\\Delta\\mathrm\{NLL\}\_\{\\text\{SAE\}\}/\\Delta\\mathrm\{NLL\}\_\{\\text\{DAS\}\}\{=\}1\.000at everykk\.

Table 7:Error\-node analysis\.Δ\\DeltaNLL for DAS full ablation vs\. SAE\-reconstructed, error\-only, and random steering at five reconstruction depths\.n=50n\{=\}50Set\-F prompts,α=3\.0\\alpha\{=\}3\.0,Gemma 2 2B\.The gap is not about what the SAE misses \(error directions are also causally inert via steering\); it is about the mismatch between a rank\-44projection \(which zeroes all variance in a44D subspace\) and a rank\-11directional subtraction \(which perturbs along one direction, allowing the model to compensate via the remaining three cooperative dimensions\)\.

### Transcoder vs MLP SAE\.

GemmaScope pre\-trained layer transcoders \(google/gemma\-scope\-2b\-pt\-transcoders,1616K JumpReLU features\) map MLP inputs to outputs through a sparse bottleneck—a fundamentally different training objective from reconstruction\-based SAEs\. If the4\.7%4\.7\\%DAS coverage by MLP SAEs reflected a dataset bias in SAE training\(Chaninet al\.,[2025](https://arxiv.org/html/2605.29126#bib.bib40)\), transcoders should recover different features with higher DAS alignment\. We find the opposite: both dictionaries fail equally\.

Table 8:Weight\-space DAS alignment: transcoder \(TC\) vs MLP SAE at circuit layers\. Top\-5050span coverage==fraction of DAS subspace variance spanned by the top\-5050decoder directions; Jaccard is over top\-100100most DAS\-aligned features\.Span coverage is55–9%9\\%for both dictionaries—neither comes close to covering the rank\-44DAS subspace\. The near\-zero Jaccard \(≤0\.010\{\\leq\}0\.010\) means the two dictionaries identify*completely different*features as most DAS\-aligned, yet both fail equally at causal steering: individual features produce\|ΔNLL\|<0\.008\|\\Delta\\mathrm\{NLL\}\|\{<\}0\.008for both TC and SAE at L20, and group steering \(k=5k\{=\}5\) is indistinguishable from random controls \(\|ΔNLL\|<0\.005\|\\Delta\\mathrm\{NLL\}\|\{<\}0\.005\)\. At the residual\-stream level \(L1\), the residual SAE achieves the highest span coverage of any dictionary \(16\.6%16\.6\\%\) vs\. the L1 transcoder \(5\.2%5\.2\\%\), but neither produces measurable steering effects\.

### Transcoder MLP pipeline gradient\.

Beyond span coverage, transcoders validate the*directionality*of the MLP pipeline\. Across L18–L25, transcoder DAS alignment increases monotonically \(mean Haar ratio:0\.97×0\.97\{\\times\}at L18–19→\\to1\.09×1\.09\{\\times\}at L24–25; slope\+0\.019\{\+\}0\.019/layer\) while probe alignment decreases \(1\.05×1\.05\{\\times\}→\\to0\.98×0\.98\{\\times\}; slope−0\.011\{\-\}0\.011/layer\), mirroring the MLP SAE read→\\towrite transition \(Fig\.[24](https://arxiv.org/html/2605.29126#A49.F24)\)\. The rank correlation between transcoder and MLP SAE DAS alignment is Spearmanρ=1\.000\\rho\{=\}1\.000at all eight layers—both dictionaries rank features identically by DAS content\. Peak transcoder DAS enrichment is19\.4×19\.4\{\\times\}Haar at L25 \(cf\.18\.5×18\.5\{\\times\}for MLP SAEs\)\. Logit\-lens projection of the top DAS\-aligned transcoder features at L24–L25 promotes copula tokens \(*is*,*was*, Rus\.*yavlyayetsya*; feature \#10264 at L24,14\.0×14\.0\{\\times\}DAS; feature \#7290 at L25,15\.9×15\.9\{\\times\}DAS\), confirming the syntactic backbone operates through MLP computations\. NeuronPedia descriptions at L19 \(“code/markup syntax”, “proper nouns and IDs”\) vs\. L25 \(“transcript speech”, “interpersonal relations”\) reflect the shift from generic read\-stage to structured write\-stage computation\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/tc_L20_14897.png)

\(a\) Transcoder L20 \#14897 \(read→\\towrite boundary\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/tc_L25_7290.png)

\(b\) Transcoder L25 \#7290 \(peak DAS enrichment\)

Figure 24:GemmaScope transcoder features along the MLP pipeline\.\(a\)Feature \#14897 at L20 marks the read→\\towrite transition where probe alignment peaks and DAS alignment begins rising\.\(b\)Feature \#7290 at L25 has15\.9×15\.9\{\\times\}DAS Haar ratio and promotes copula tokens \(*is*,*was*\) in logit\-lens projection, confirming transcoders independently identify the same syntactic backbone as MLP SAEs\.
### Group decoder steering\.

DAS\-aligned L1 feature groups of sizek∈\{1,2,3,5,10\}k\{\\in\}\\\{1,2,3,5,10\\\}withα=3\.0\\alpha\{=\}3\.0all produce\|ΔNLL\|<0\.025\|\\Delta\\mathrm\{NLL\}\|\{<\}0\.025, with bootstrap95%95\\%CIs crossing zero at everykk\. Random\-feature controls are indistinguishable\. No super\-additivity is detected via steering \(SA ratio≤1\.0\{\\leq\}1\.0at allkk\), consistent with the error\-node result: the cooperation operates at the level of the rank\-44projection, not at the level of summed decoder directions\.

### Frozen\-attention steering\.

Steering1010DAS\-aligned features at L1 withα=3\.0\\alpha\{=\}3\.0produces a base effect of onlyΔNLL=\+0\.011\\Delta\\mathrm\{NLL\}\{=\}\{\+\}0\.011—three orders of magnitude below the DAS ablation\. Freezing attention patterns at boundary layers \(L11–12\) or MLP outputs at circuit layers \(L18–25\) produces pathway fractionsFQK=1\.26F\_\{\\text\{QK\}\}\{=\}1\.26andFMLP=2\.12F\_\{\\text\{MLP\}\}\{=\}2\.12with bootstrap CIs spanning\[−1\.4,3\.8\]\[\-1\.4,3\.8\]and\[1\.2,10\.0\]\[1\.2,10\.0\]respectively\. The base effect is too small for meaningful pathway decomposition\. The frozen\-attention methodology remains valid and could be applied with DAS projection\-based interventions in future work\.

### Structural interpretation\.

The five experiments converge on a single conclusion: the rank\-44DAS subspace implements a cooperative mechanism that resists decomposition into any sparse feature basis—SAE or transcoder, residual or MLP\-specific\. The key distinction is between*projection*\(removing all variance in a multi\-dimensional subspace, which DAS ablation does\) and*perturbation*\(shifting the activation along a single direction, which decoder steering does\)\. For a rank\-11causal mechanism, the two operations are equivalent\. For a rank\-44cooperative mechanism with59×59\{\\times\}super\-additivity, perturbation along any single direction \(or sum of directions that does not span the full44D subspace\) allows the model to route around the intervention via the remaining dimensions\. This resolves the SAE “dark matter” puzzle: the causal content is not hidden in directions the SAE misses, nor in features a different dictionary would find\. It resides in the*joint*structure of a44\-dimensional subspace that no11\-dimensional intervention can disrupt\.

Figures:fig\_error\_node\_analysis\.pdf\(Fig\.[25](https://arxiv.org/html/2605.29126#A49.F25)\),fig\_transcoder\_analysis\.pdf\(Fig\.[26](https://arxiv.org/html/2605.29126#A49.F26)\),fig\_group\_steering\.pdf\(Fig\.[27](https://arxiv.org/html/2605.29126#A49.F27)\),fig\_frozen\_attention\_steering\.pdf\(Fig\.[28](https://arxiv.org/html/2605.29126#A49.F28)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x25.png)Figure 25:Error\-node analysis\.\(A\)Reconstruction gapG\(k\)G\(k\)vs\. top\-kkSAE features per DAS direction\.G=1\.00G\{=\}1\.00at every depth: SAE\-reconstructed directions capture zero causal effect via steering\.\(B\)Δ\\DeltaNLL comparison at the best\-kkSAE reconstruction\. DAS full ablation \(\+69\.1\+69\.1\) dwarfs all steering conditions\.![Refer to caption](https://arxiv.org/html/2605.29126v1/x26.png)Figure 26:Transcoder vs MLP SAE\.\(A\)DAS subspace span coverage \(top\-5050features\) across MLP layers: both dictionaries achieve55–9%9\\%\.\(B\)L1 residual\-stream comparison\.\(C\)Steering comparison at L20: both TC and SAE features produceΔ\\DeltaNLL indistinguishable from random controls\.![Refer to caption](https://arxiv.org/html/2605.29126v1/x27.png)Figure 27:Group decoder steering\.\(A\)Δ\\DeltaNLL by group sizekk: DAS\-aligned and random feature groups are indistinguishable at everykk\.\(B\)Super\-additivity ratio SA\(kk\): no super\-additive cooperation emerges via decoder\-direction steering\.![Refer to caption](https://arxiv.org/html/2605.29126v1/x28.png)Figure 28:Frozen\-attention steering\.\(A\)Δ\\DeltaNLL across seven conditions; all values\|ΔNLL\|<0\.1\|\\Delta\\mathrm\{NLL\}\|\{<\}0\.1, three orders of magnitude below DAS ablation\.\(B\)Pathway decomposition pie chart \(uninterpretable due to noise\-floor base effect\)\.

## Appendix S50Supplement: GemmaScope SAE feature dissociation

We use GemmaScope 16K residual\-stream SAEs\(Lieberumet al\.,[2024](https://arxiv.org/html/2605.29126#bib.bib12)\)to ground the geometric probe–DAS dissociation at the feature\-dictionary level\. For each SAE featurejjwith unit\-norm decoder directione^j∈ℝd\\hat\{e\}\_\{j\}\\in\\mathbb\{R\}^\{d\}, we compute:

DAS\-align\(j\)=‖UDASe^j‖2andprobe\-align\(j\)=‖Wprobee^j‖2\.\\text\{DAS\-align\}\(j\)=\\\|U\_\{\\mathrm\{DAS\}\}\\,\\hat\{e\}\_\{j\}\\\|^\{2\}\\quad\\text\{and\}\\quad\\text\{probe\-align\}\(j\)=\\\|W\_\{\\mathrm\{probe\}\}\\,\\hat\{e\}\_\{j\}\\\|^\{2\}\.Across allnfeat=16 384n\_\{\\text\{feat\}\}\{=\}16\\,384features atL⋆=1L^\{\\star\}\{=\}1:

- •The top\-100 DAS\-aligned features and top\-100 probe\-aligned features sharezero overlap\(Jaccard=0\.000=0\.000; bootstrap 95% CI:\[0\.000,0\.000\]\[0\.000,0\.000\]; random null:0\.0030\.003\)\.
- •Pearson correlation between per\-feature DAS and probe alignment:r=−0\.037r=\-0\.037\(p=2\.9×10−6p=2\.9\{\\times\}10^\{\-6\}\), indicating a weak anti\-correlation — features are slightly*less*probe\-aligned when more DAS\-aligned\.
- •SAE reconstruction quality atL⋆=1L^\{\\star\}\{=\}1on Set\-A prompts:92\.1%92\.1\\%variance explained \(GemmaScope is in\-distribution for these prompts\)\.

### Semantic audit via NeuronPedia\.

We query NeuronPedia\(Lin,[2023](https://arxiv.org/html/2605.29126#bib.bib1)\)for AI\-generated descriptions and logit promotion of the top SAE features aligned with each direction\.

Probe direction \(L=1L\{=\}1\):Feature \#12499 \(probe\-aligned; NeuronPedia: “references to specific months and their associated frequencies”\) promotes*month*,*Month*,*MONTH*in its logits\. Top activating examples:“each year in the month of October in NSW”,“bounced back in the month of February”,“the first month of the season”\. This feature encodes calendar*position*\(which month\), not duration\.

DAS mediator \(L=24L\{=\}24, relay hub\):Feature \#2309 \(DAS\-aligned; NeuronPedia: “references to quantities of time and duration”\) promotes*months*,*weeks*,*days*,*years*in its logits—exactly the vocabulary used to express duration answers\. Top activating examples:“within the past 24 hours”,“get back in a few weeks”,“almost 10 years now”\. This feature encodes duration*quantity*, not calendar position\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/probe_L1_12499.png)

\(a\) Probe\-aligned: \#12499 \(“specific months”\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/das_L1_14703.png)

\(b\) DAS\-aligned: \#14703 \(“the word ‘is’ ”\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/das_L24_2309.png)

\(c\) DAS\-aligned L24: \#2309 \(“quantities of time”\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/relay_L12_15596.png)

\(d\) Relay L12: \#15596 \(“forms of ‘to be’ ”\)

Figure 29:NeuronPedia feature audit\.Probe\-aligned feature \#12499\(a\)activates on calendar months and promotes*month*/*Month*in its logits\. DAS\-aligned features encode syntactic structure: \#14703\(b\)atL⋆=1L^\{\\star\}\{=\}1activates on copula “is”, \#2309\(c\)at relay hubL=24L\{=\}24promotes duration vocabulary, and relay midpoint \#15596\(d\)atL=12L\{=\}12carries the copula chain\. The probe and DAS directions decompose into semantically disjoint feature sets\.
### Cross\-context causal attribution\.

We ran a GPU experiment comparing feature activations and WU\-gradient attribution across 20 duration prompts and 15 control copula prompts \(geographic:“The capital of France is ”, arithmetic:“2 plus 2 is ”, descriptive\)\. Results:

- •DAS feature mean attribution:2\.412\.41\(non\-zero; 20 duration prompts\)\.
- •Probe feature mean attribution:0\.0000\.000\(exactly zero\)\.
- •Ratio DAS/probe:2\.4×1092\.4\{\\times\}10^\{9\}\(effectively∞\\infty— probe features carry zero causal signal to the duration output logit\)\.
- •Jaccard\(top\-50 attribution, top\-50 probe\):0\.0000\.000— no shared features\.

The DAS features activate more on generic “is” contexts than on long duration prompts \(activation specificity0\.050\.05\), confirming they are structural copula encoders atL⋆=1L^\{\\star\}\{=\}1rather than semantic duration encoders — the causal effect flows through later layers, not directly fromL⋆=1L^\{\\star\}\{=\}1to output\. The probe features, by contrast, have*zero attribution*regardless of context\.

### Interpretation\.

AtL⋆=1L^\{\\star\}\{=\}1, the DAS\-aligned features are primarily structural \(copula “is” position; see §[5](https://arxiv.org/html/2605.29126#S5)\), encoding the computational context of a duration query rather than its semantic content\. Temporal semantics emerge at the relay hub \(L=24L\{=\}24\), where feature \#2309 aligns with the DAS direction and promotes duration\-unit tokens that are the actual output vocabulary for duration answers\. The probe direction, by contrast, is anchored to calendar\-date vocabulary atL=1L\{=\}1and carries zero causal attribution for duration prediction \(mean attribution=0\.000=0\.000\)\. This validates the paper’s core claim: the 88° probe–DAS angle reflects a real functional dissociation between*when*\(probe\) and*how long*\(DAS\), confirmed at three levels: geometry \(principal angles\), SAE feature identity \(Jaccard=0\.000=0\.000\), and causal attribution \(probe=0=0, DAS\>0\>0\)\.

### Completeness and selectivity of SAE features\.

Following the framework ofCunninghamet al\.\([2024](https://arxiv.org/html/2605.29126#bib.bib11)\), we assess*completeness*\(fraction of the causal effect captured by the feature set\) and*selectivity*\(fraction of feature activations attributable to the target concept\)\.*Completeness is low*: individual feature ablation yields\|ΔNLL\|<0\.05\|\\Delta\\mathrm\{NLL\}\|\{<\}0\.05for14/1514/15top DAS\-aligned features; group steering with up to100100features yieldsΔNLL≈0\\Delta\\mathrm\{NLL\}\{\\approx\}0, while full rank\-44subspace ablation yieldsΔNLL=\+69\\Delta\\mathrm\{NLL\}\{=\}\{\+\}69\(Supp\.[S49](https://arxiv.org/html/2605.29126#A49)\)\. The59×59\{\\times\}cooperation ratio confirms the causal effect is a distributed property of the44\-D subspace, not localizable to individual dictionary atoms\.*Selectivity is also low*: of the16,38416\{,\}384GemmaScope features atL⋆L^\{\\star\}, only50\.8%50\.8\\%of temporal features exceed the Haar null for DAS alignment \(binomialp=0\.22p\{=\}0\.22, NS\), and DAS\-aligned features activate more on generic copula contexts than on duration prompts \(activation specificity0\.050\.05\)\. Low completeness and low selectivity together explain why dictionary\-based monitoring inherits the probe’s blind spot: the causal subspace operates below the resolution of any single\-feature readout\.

### Cross\-layer validation\.

Nine experiments validate the structural\-backbone interpretation\.*Semantic census*: querying NeuronPedia for temporal features across all 26 layers yields2,4632\{,\}463features, of which only50\.8%50\.8\\%exceed the Haar null for DAS alignment \(binomialp=0\.22p\{=\}0\.22, NS\)—temporal features are not preferentially DAS\-aligned\.*Feature stitching*: cross\-layer handoff matrices \(hjk=WdecLi\[j\]⋅WencLi\+1\[:,k\]h\_\{jk\}\{=\}W\_\{\\text\{dec\}\}^\{L\_\{i\}\}\[j\]\\cdot W\_\{\\text\{enc\}\}^\{L\_\{i\+1\}\}\[:,k\]\) show DAS→\\toDAS enrichment of5\.8×5\.8\{\\times\}\(L1→\\toL11\),6\.7×6\.7\{\\times\}\(L11→\\toL12\), and2\.5×2\.5\{\\times\}\(L12→\\toL24\), allp<10−8p\{<\}10^\{\-8\}\(Mann–Whitney\)\. The top relay chains are syntactically labeled end\-to\-end: \#14703 \(“is”\)→\\to\#190 \(“verbs of being”\)→\\to\#15596 \(“to be”\)→\\to\#16044 \(“states of being”\)\.*Prompt\-level co\-activation*: encoding500500prompts through SAEs at layers\[1,11,12,24\]\[1,11,12,24\], DAS→\\toDAS co\-activation is7\.8×7\.8\{\\times\}DAS→\\toRandom \(p<0\.001p\{<\}0\.001, bootstrap\)\.*Logit attribution chain*: numeric\-token Z\-score*decreases*fromL=1L\{=\}1\(\+0\.15\+0\.15\) toL=24L\{=\}24\(−0\.09\-0\.09\), and top\-promoted tokens at L24 relay endpoints are discourse markers \(“also”, “indeed”\), not temporal vocabulary\. Together, these results confirm the DAS subspace acts as a structural backbone: individual features encode the syntactic copula frame while the44\-dimensional subspace carries temporal information cooperatively \(59×59\{\\times\}cooperation ratio,G\(k\)=1\.00G\(k\)\{=\}1\.00reconstruction gap\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/steer_probe_L1_12499.png)

\(a\) Amplify probe \#12499 \(α=\+150\\alpha\{=\}\{\+\}150\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/steer_das_L24_2309.png)

\(b\) Amplify DAS \#2309 \(α=\+150\\alpha\{=\}\{\+\}150\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/steer_das_L1_14703_neg.png)

\(c\) Suppress DAS \#14703 \(α=−150\\alpha\{=\}\{\-\}150\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/steer_relay_L12_15596_neg.png)

\(d\) Suppress relay \#15596 \(α=−150\\alpha\{=\}\{\-\}150\)

Figure 30:NeuronPedia steering demonstrations\.\(a\)Amplifying probe feature \#12499 at high strength floods the output with calendar\-month vocabulary—pure month fixation\.\(b\)Amplifying DAS duration feature \#2309 atL=24L\{=\}24produces obsessive duration\-unit enumeration \(days, weeks, months, seconds\)\.\(c\)Strongly suppressing copula feature \#14703 causes near\-total generation collapse, confirming the syntactic backbone is essential for coherent temporal output\.\(d\)Strongly suppressing relay midpoint \#15596 degrades natural language into code fragments and HTML tags—the model loses its linguistic scaffolding entirely\. Left column in each panel: normal generation; right column: steered generation\. Prompt:*“If someone was born on March 5 and today is June 10, the elapsed time is approximately\.”*
### Dictionary robustness: temporal\-specialist SAE atL=12L\{=\}12\.

A purpose\-built temporal SAE \(canrager/temporalSAEs;9,2169\{,\}216features, BatchTopK, Layer 12 residual stream;Lubanaet al\.,[2026](https://arxiv.org/html/2605.29126#bib.bib33)\) provides a dictionary\-architecture stress test\. Three findings:

\(i\) Relay chain overlap\.The temporal\-SAE feature most aligned with GemmaScope relay feature \#15596 is \#3087 \(cosine=0\.59\{=\}0\.59; NeuronPedia: “forms of ‘to be’ ”\), followed by \#1157 \(“It is”, cosine=0\.54\{=\}0\.54\)\. The copula relay atL=12L\{=\}12is recovered in an independently trained temporal dictionary, not merely a GemmaScope artifact\.

\(ii\) DAS\>\>probe dissociation\.45\.1%45\.1\\%of temporal\-SAE features exceed the DAS Haar null vs\.34\.6%34\.6\\%for probe \(same direction as GemmaScope:50\.3%50\.3\\%vs\.32\.0%32\.0\\%\)\. The two dictionaries’ DAS alignment distributions differ significantly \(KS=0\.065\{=\}0\.065,p<10−21p\{<\}10^\{\-21\}\), but both show DAS\-dominant structure\. Top DAS\-aligned temporal\-SAE features are copula\-labeled \(\#3087: “forms of ‘to be’ ”; \#1157: “It is”\), while probe\-aligned features carry temporal content \(\#8935: “Order entered date”\)\.

\(iii\) Rank\-1 bottleneck orthogonality\.Thetemporal\_rank\_1variant compresses its attention through a44D\-per\-head bottleneck \(1616D total,44heads\)\. Projecting Q and K weight matrices through the dictionaryDDinto residual\-stream space and computing principal angles with the DAS basis yields min\. angle80\.0∘80\.0^\{\\circ\}and DAS variance captured1\.1%1\.1\\%—the temporal*prediction*bottleneck is nearly orthogonal to the temporal*computation*subspace\. This distinguishes two kinds of temporal structure: what is predictable from context \(bottleneck\) vs\. what causally mediates duration answers \(DAS\)\.

Cross\-dictionary direction similarity between temporal\-SAE and GemmaScope top\-5050DAS features is low \(mean max\-cosine=0\.27\{=\}0\.27;2/502/50matched at\>0\.7\>0\.7\), confirming the two dictionaries decompose the same subspace using different atoms while agreeing on which*subspace directions*are enriched\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/temp_L12_3087.png)

\(a\) Temporal\-SAE \#3087 \(“forms of ‘to be’ ”\)

![Refer to caption](https://arxiv.org/html/2605.29126v1/Figures/neuronpedia/temp_L12_8935.png)

\(b\) Temporal\-SAE \#8935 \(“Order entered date”\)

Figure 31:Temporal\-specialist SAE features atL=12L\{=\}12\.\(a\)Feature \#3087 from the Lubana et al\. temporal SAE \(canrager/temporalSAEs,9,2169\{,\}216features\) is DAS\-aligned and matches GemmaScope relay feature \#15596 \(cosine=0\.59\{=\}0\.59\), recovering the copula relay in an independently trained dictionary\.\(b\)Feature \#8935 is probe\-aligned and activates on date\-entry contexts, encoding calendar position rather than duration—the probe–DAS dissociation holds across dictionary architectures\.

## Appendix S51Supplement: OV circuit decomposition

A natural question is whether boundary heads are characterized not only by their QK attention patterns but also by the subspaces they read from \(WVW\_\{V\}\) and write to \(WOW\_\{O\}\)\. We compute the DAS\-alignment score‖UDASWOV‖F2/‖WOV‖F2\\\|U\_\{\\mathrm\{DAS\}\}\\,W\_\{OV\}\\\|\_\{F\}^\{2\}/\\\|W\_\{OV\}\\\|\_\{F\}^\{2\}for all26×8=20826\{\\times\}8\{=\}208heads and compare boundary vs\. non\-boundary heads\.

### Result\.

Boundary heads achieve mean alignment0\.002110\.00211vs\.0\.001810\.00181for non\-boundary heads \(ratio1\.17×1\.17\{\\times\}; Mann–Whitneyp=4\.2×10−3p\{=\}4\.2\{\\times\}10^\{\-3\}\)\. Crucially, both values are barely above the Haar nullk/d=4/2304≈0\.00174k/d\{=\}4/2304\{\\approx\}0\.00174: boundary heads are1\.21×1\.21\{\\times\}null while non\-boundary heads are1\.06×1\.06\{\\times\}null\. TheWOVW\_\{OV\}matrices therefore do*not*preferentially target the DAS mediator subspace\.

### Interpretation\.

The circuit mechanism is QK\-mediated \(boundary heads route attention to the right temporal positions via their±30\\pm 30/±61\\pm 61day ridges\) rather than OV\-mediated \(they do not directly read from or write to the mediator subspace in their weight matrices\)\. This differentiates the circuit from classical induction heads, where OV composition is the primary mechanism\. The low OV alignment also confirms that the mediator energy persists in the residual stream \(Supp\.[S43](https://arxiv.org/html/2605.29126#A43)\) via a residual\-stream skip, not via repeated attention\-head read/write operations\.

### Attention pattern rank structure\.

Wanget al\.\([2025](https://arxiv.org/html/2605.29126#bib.bib37)\)report that attention outputs are low\-rank across families and scales\. This is consistent with three observations in our circuit\. \(i\) The effective mediator dimension saturates atk≈4k\{\\approx\}4–66acrossd∈\{1536,2304,3584\}d\\\!\\in\\\!\\\{1536,2304,3584\\\}\(Supp\.[S28](https://arxiv.org/html/2605.29126#A28)\): the causal subspace is far lower\-rank than the ambient dimension, and Prop\.[3](https://arxiv.org/html/2605.29126#Thmproposition3)’sρknull≍d/k\\rho\_\{k\}^\{\\text\{null\}\}\{\\asymp\}d/kensures specificity strengthens with width at fixedkk\. \(ii\) Boundary\-headWQKW\_\{QK\}matrices project temporal structure through adhead=256d\_\{\\text\{head\}\}\{=\}256bottleneck, and the QK\-twist ridges at\{30,61\}\\\{30,61\\\}days are rank\-11periodic patterns in that space \(each ridge is determined by a single offset frequency\)\. The circuit therefore implements low\-rank temporal routing through a composition of QK attention \(low\-rank pattern selection\) and residual\-stream skip \(low\-rank signal propagation\), consistent with the low\-rank attention hypothesis\. \(iii\) The44\-D DAS subspace is≤2\{\\leq\}2dimensions per GQA key\-value group \(nkv=4n\_\{\\text\{kv\}\}\{=\}4for Gemma 2 2B\), suggesting the causal bottleneck may be the key\-value rank itself\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x29.png)Figure 32:OV circuit decomposition\.\(A\)Heatmap ofWOVW\_\{OV\}DAS\-alignment per head; boundary heads \(gold squares\) show only a modest1\.17×1\.17\{\\times\}enrichment\.\(B\)Violin comparison; the effect is statistically significant \(p=0\.004p\{=\}0\.004\) but substantively small, confirming the circuit is QK\-mediated rather than OV\-mediated\.

## Appendix S52Supplement: attribution flow graph

To visualize the full circuit structure, we build a directed attribution graph over all208208attention heads\. Each edge\(h1,h2\)\(h\_\{1\},h\_\{2\}\)carries weightAP\(h1\)⋅AP\(h2\)⋅QK\-align\(h1,h2\)\\mathrm\{AP\}\(h\_\{1\}\)\\cdot\\mathrm\{AP\}\(h\_\{2\}\)\\cdot\\mathrm\{QK\\text\{\-\}align\}\(h\_\{1\},h\_\{2\}\), whereQK\-align\\mathrm\{QK\\text\{\-\}align\}is the DAS\-projected QK inner product normalized by Frobenius norms\. We retain the top edges for visualization, colored by flow type\.

### Key findings \(Fig\.[33](https://arxiv.org/html/2605.29126#A52.F33)\)\.

- •Encoding zone \(L0–55\):L1H7 \(the boundary head atL⋆=1L^\{\\star\}\{=\}1\) is the DAS mediator anchor\.
- •Processing zone \(L66–1515\):L7H7 is the dominant routing hub; L11H4 and L12H6 are the early causal bottleneck \(consistent with cascading ablation, Supp\.[S46](https://arxiv.org/html/2605.29126#A46)\)\.
- •Output zone \(L1616–2525\):L24H2 is the highest\-AP relay hub, receiving convergent signal from multiple processing heads and redistributing to QK\-twist boundary heads L25H5, L24H4, L23H1\.

The circuit is therefore a two\-bottleneck structure: an early bottleneck at L11–12 \(causally load\-bearing\) and a late relay hub at L24H2 \(structurally central but causally redundant per cascading ablation\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x30.png)Figure 33:Attribution flow graph\.Directed edges show information flow from the DAS mediator \(L⋆=1L^\{\\star\}\{=\}1, green\) through intermediate hubs to boundary heads \(gold\)\. Gold arrows: flow to boundary heads\. Coral arrows: relay through L24H2 hub\. Node size proportional to AP score\.

## Appendix S53Supplement: probe\-based monitoring stress tests

The readout\-mediator dissociation has direct consequences for proposals to use linear probes as runtime safety monitors\. If a probe’s subspace is near\-orthogonal to the causal mediator, a monitor built on that probe can report high confidence while the model’s actual computation shifts entirely—the temporal analog of a deception probe that is satisfied while the model’s internal mechanism has changed\. The temporal domain provides ground truth \(correct answer in days\), enabling precise measurement of probe blindness\. Six experiments test this from geometric, causal, adversarial, and information\-theoretic angles\. Table[9](https://arxiv.org/html/2605.29126#A53.T9)summarizes the battery; subsections below give full protocols and results\.

Table 9:Probe\-based monitoring stress tests\. All experiments useGemma 2 2BatL⋆=1L^\{\\star\}\{=\}1,k=4k\{=\}4\(DAS\)\.†\\daggern=200n\{=\}200; cf\.ρ=1050\\rho\{=\}1050atn=332n\{=\}332\(Supp\.[S11](https://arxiv.org/html/2605.29126#A11)\)\.

### Cross\-probe universality \(Exp\. 116\)\.

One might argue that the88∘88^\{\\circ\}angle is specific to the circular probe’s harmonic encoding\. We train seven probes with distinct targets on cachedL⋆=1L^\{\\star\}\{=\}1activations: circular DOY \(k=2k\{=\}2\), month \(1212\-class,k=4k\{=\}4\), season \(44\-class\), day\-of\-week \(77\-class\), quarter \(44\-class\), before/after solstice \(binary,k=1k\{=\}1\), and the gradient probe \(k=4k\{=\}4\)\. For each, we extract the weight\-matrix SVD basis and compute principal angles to𝐔DAS\\mathbf\{U\}\_\{\\text\{DAS\}\}\. Table[10](https://arxiv.org/html/2605.29126#A53.T10)reports results\. All seven land within2\.8∘2\.8^\{\\circ\}of their respective Haar nulls \(arccos⁡k/d\\arccos\\\!\\sqrt\{k/d\}\), with the gradient probe showing the largest deviation \(84\.8∘84\.8^\{\\circ\}vs\.87\.6∘87\.6^\{\\circ\}\) because it is optimized for∇hNLL\\nabla\_\{h\}\\text\{NLL\}rather thanR2R^\{2\}\. The blind spot is a generic property ofk≪dk\\ll d, not a quirk of one probe architecture\.

Table 10:Principal angles between seven probe bases and the DAS subspace \(kDAS=4k\_\{\\text\{DAS\}\}\{=\}4,d=2304d\{=\}2304\)\.![Refer to caption](https://arxiv.org/html/2605.29126v1/x31.png)Figure 34:Cross\-probe universality\.\(A\)Measured angle vs\. Haar null for seven probe architectures; all fall within the±2σ\{\\pm\}2\\sigmanull band \(gray\)\.\(B\)Subspace overlap \(∑cos2⁡θi/min⁡\(k1,k2\)\\sum\\cos^\{2\}\\theta\_\{i\}/\\min\(k\_\{1\},k\_\{2\}\)\) between each probe and DAS; all are≤2\.2%\{\\leq\}2\.2\\%\.\(C\)Probe accuracy/R2vs\. angle to DAS: high performance does not predict causal alignment\.
### Adversarial subspace injection \(Exp\. 112\)\.

Exps\. 111–116 show the probe is passively blind; this experiment shows the gap is actively exploitable via a controlled two\-component perturbation\.

*Construction\.*LetUM∈ℝ4×2304U\_\{M\}\\\!\\in\\\!\\mathbb\{R\}^\{4\\times 2304\}be the DAS basis andWP∈ℝ2×2304W\_\{P\}\\\!\\in\\\!\\mathbb\{R\}^\{2\\times 2304\}the circular\-probe weight rows\. We orthogonalizeWPW\_\{P\}againstUMU\_\{M\}via Gram\-Schmidt to obtainU~P\\tilde\{U\}\_\{P\}\(verification:θ\(U~P,UM\)=90\.0∘\\theta\(\\tilde\{U\}\_\{P\},U\_\{M\}\)\{=\}90\.0^\{\\circ\}, removing the∼0\.2%\{\\sim\}0\.2\\%cos2\\cos^\{2\}leakage\)\. For each test prompt with ground\-truth day\-of\-yeardd, define a*target*dayd⋆=\(\(d\+179\)mod365\)\+1d^\{\\star\}\{=\}\(\(d\{\+\}179\)\\bmod 365\)\{\+\}1\(∼180\\sim\\\!180days away\) and construct:

𝐱adv=𝐱\+αUM⊤UM\(𝝁d−𝝁d⋆\)⏟ΔDAS:shifts mediator towardd⋆\+βU~P⊤\(𝐲d−𝐲^\)⏟Δprobe:restores probe tod,\\mathbf\{x\}\_\{\\mathrm\{adv\}\}=\\mathbf\{x\}\+\\underbrace\{\\alpha\\;\\\!U\_\{M\}^\{\\\!\\top\}\\\!U\_\{M\}\\\!\\bigl\(\\boldsymbol\{\\mu\}\_\{d\}\-\\boldsymbol\{\\mu\}\_\{d^\{\\star\}\}\\bigr\)\}\_\{\\Delta\_\{\\mathrm\{DAS\}\}:\\;\\text\{shifts mediator toward \}d^\{\\star\}\}\+\\underbrace\{\\beta\\;\\\!\\tilde\{U\}\_\{P\}^\{\\\!\\top\}\\\!\\bigl\(\\mathbf\{y\}\_\{d\}\-\\hat\{\\mathbf\{y\}\}\\bigr\)\}\_\{\\Delta\_\{\\mathrm\{probe\}\}:\\;\\text\{restores probe to \}d\},\(5\)where𝝁d\\boldsymbol\{\\mu\}\_\{d\}is the mean activation at daydd\(averaged over1010templates\),𝐲d=\[sin⁡\(2πd/365\),cos⁡\(2πd/365\)\]\\mathbf\{y\}\_\{d\}\{=\}\[\\sin\(2\\pi d/365\),\\,\\cos\(2\\pi d/365\)\]is the correctsin\\sin/cos\\costarget, and𝐲^=U~P𝐱scaled\+b\\hat\{\\mathbf\{y\}\}\{=\}\\tilde\{U\}\_\{P\}\\mathbf\{x\}\_\{\\mathrm\{scaled\}\}\+bis the current probe prediction\. The DAS componentΔDAS\\Delta\_\{\\mathrm\{DAS\}\}is the projection of\(𝝁d−𝝁d⋆\)\(\\boldsymbol\{\\mu\}\_\{d\}\{\-\}\\boldsymbol\{\\mu\}\_\{d^\{\\star\}\}\)ontoUMU\_\{M\}scaled byα\\alpha; it lives entirely within the44\-dimensional mediator and is invisible to the orthogonalized probe\. The probe componentΔprobe\\Delta\_\{\\mathrm\{probe\}\}is the least\-norm correction that pushes the probe readout toward the correct date; it lives entirely withinU~P\\tilde\{U\}\_\{P\}and cannot affect the mediator\.

*Perturbation budget\.*BothΔDAS\\Delta\_\{\\mathrm\{DAS\}\}andΔprobe\\Delta\_\{\\mathrm\{probe\}\}inherit their scale from the activation geometry: the inter\-date signal‖𝝁d−𝝁d⋆‖\\\|\\boldsymbol\{\\mu\}\_\{d\}\{\-\}\\boldsymbol\{\\mu\}\_\{d^\{\\star\}\}\\\|is∼1\.2×\{\\sim\}1\.2\\timesthe within\-date standard deviation, soα=1\\alpha\{=\}1corresponds to a one\-signal\-unit shift\. At\(α=3,β=2\)\(\\alpha\{=\}3,\\,\\beta\{=\}2\), the total perturbation norm‖𝐱adv−𝐱‖/‖𝐱‖\\\|\\mathbf\{x\}\_\{\\mathrm\{adv\}\}\{\-\}\\mathbf\{x\}\\\|/\\\|\\mathbf\{x\}\\\|averages3\.6%3\.6\\%of the activation norm\. A norm\-matched random perturbation \(isotropic Gaussian scaled to the sameℓ2\\ell\_\{2\}budget\) shifts DAS content by23\.423\.4days on average \(n=50n\{=\}50DOYs,200200random draws\)—expected, since a random direction inℝ2304\\mathbb\{R\}^\{2304\}has nonzero projection onto the44\-D mediator\. The targeted perturbation displaces3×3\{\\times\}further \(7171vs\.2323days\), and crucially, only the targeted version is invisible to the probe: random perturbations shift probe readout proportionally to norm, while the adversarial construction decouples the two by design\.

*Evaluation\.*We sweepα∈\{0,1,3\}\\alpha\\\!\\in\\\!\\\{0,1,3\\\},β∈\{0,1,2\}\\beta\\\!\\in\\\!\\\{0,1,2\\\}onn=50n\{=\}50evenly\-spaced test DOYs \(d∈\{1,8,15,…,358\}d\\\!\\in\\\!\\\{1,8,15,\\ldots,358\\\}\)\. Probe error is measured as circular angular RMSE \(atan2 on the first harmonic\); mechanism displacement is the nearest\-neighbor DOY in DAS coordinates \(UM𝐱advU\_\{M\}\\mathbf\{x\}\_\{\\mathrm\{adv\}\}matched to the365365\-point reference set\{UM𝝁j\}j=1365\\\{U\_\{M\}\\boldsymbol\{\\mu\}\_\{j\}\\\}\_\{j=1\}^\{365\}\)\.

*Results\.*At\(α=3,β=2\)\(\\alpha\{=\}3,\\,\\beta\{=\}2\), the DAS nearest\-neighbor is displaced by70\.870\.8days while probe angular error is5\.75\.7days \(Fig\.[6](https://arxiv.org/html/2605.29126#S6.F6)\)\. Crucially,α\\alphaandβ\\betaare decoupled: DAS error scales linearly withα\\alpha\(0→24→710\\to 24\\to 71days\) and is invariant toβ\\beta; probe error is invariant toα\\alphaand decreases withβ\\beta\. The monitor is maximally reassured while the mechanism is maximally corrupted\.

### Mutual information between probe and mechanism \(Exp\. 114\)\.

Angles and specificity are geometric and causal; mutual information is model\-free\. We compute three per\-prompt scalar summaries on the365365mean\-per\-DOY activations atL⋆=1L^\{\\star\}\{=\}1:zprobez\_\{\\text\{probe\}\}\(circular probe predicted DOY\),zDASz\_\{\\text\{DAS\}\}\(‖𝐔DAS𝐱‖2\\\|\\mathbf\{U\}\_\{\\text\{DAS\}\}\\mathbf\{x\}\\\|^\{2\}, DAS energy\), andzDOYz\_\{\\text\{DOY\}\}\(ground truth\)\. The KSG estimator\(Kraskovet al\.,[2004](https://arxiv.org/html/2605.29126#bib.bib43)\)withk=5k\{=\}5neighbors givesI\(zprobe;zDAS\)=0\.000I\(z\_\{\\text\{probe\}\};\\,z\_\{\\text\{DAS\}\}\)=0\.000nats, whileI\(zprobe;zDOY\)=3\.99I\(z\_\{\\text\{probe\}\};\\,z\_\{\\text\{DOY\}\}\)=3\.99nats\. A phase\-shuffle null \(n=200n\{=\}200\) yieldsp=1\.0p\{=\}1\.0—the observed MI is indistinguishable from noise \(Fig\.[35](https://arxiv.org/html/2605.29126#A53.F35)\)\. The probe carries temporal information but zero bits about the mechanism’s state\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x32.png)Figure 35:Mutual information\.\(A\)MI bar chart:I\(probe;DAS\)I\(\\text\{probe\};\\,\\text\{DAS\}\)andI\(DAS;DOY\)I\(\\text\{DAS\};\\,\\text\{DOY\}\)are at the null floor; onlyI\(probe;DOY\)I\(\\text\{probe\};\\,\\text\{DOY\}\)is non\-trivial\.\(B\)Phase\-shuffle null distribution with observed MI \(red line\) firmly within it \(p=1\.0p\{=\}1\.0\)\.\(C\)Scatter of probe DOY vs\. DAS energy: no structure\.
### Extended specificity battery \(Exp\. 113\)\.

Using theΔ\\DeltaNLL values from100100Set\-F duration prompts on a single T4 instance, we computeρ\\rhofor four named subspaces against a null of5050randomk=4k\{=\}4ablations: DAS \(ρ=2650×\\rho\{=\}2650\{\\times\}\), gradient probe \(19×19\{\\times\}\), PCA \(15×15\{\\times\}\), and the temporal probe \(−6\.5×\-6\.5\{\\times\}—negative, meaning ablation slightly*helps*the model\)\. A decision threshold ofρ\>5\.0×\\rho\>5\.0\{\\times\}\(2×2\{\\times\}null median\) correctly classifies DAS, gradient, and PCA as causal, and the probe as inert \(Fig\.[36](https://arxiv.org/html/2605.29126#A53.F36)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x33.png)Figure 36:Specificity battery\.\(A\)ρ\\rhoper subspace; DAS is off the chart at2650×2650\{\\times\}\.\(B\)Nullρ\\rhodistribution with named subspaces marked\.\(C\)Δ\\DeltaNLL per subspace with random±2σ\{\\pm\}2\\sigmaband\.
### Mock deception probe \(Exp\. 110\)\.

We train a logistic\-regression “confidence monitor” onL⋆=1L^\{\\star\}\{=\}1activations to predict whether the model is confident \(NLL below median\) or uncertain \(NLL above median\) about each duration prompt\. The monitor achieves66%66\\%cross\-validated accuracy \(AUROC=0\.76=0\.76\)—above chance, reading surface correlates of difficulty\. Its weight vector sits at88\.4∘88\.4^\{\\circ\}from the DAS subspace \(0\.4∘0\.4^\{\\circ\}from the Haar null atk=1k\{=\}1,d=2304d\{=\}2304\), withcos2\\cos^\{2\}leakage of0\.08%0\.08\\%and a theoreticalρ\\rhoupper bound of2\.03×2\.03\{\\times\}—firmly in the inert regime \(Fig\.[37](https://arxiv.org/html/2605.29126#A53.F37)\)\. A purpose\-built safety monitor inherits the same blind spot\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x34.png)Figure 37:Mock deception probe\.\(A\)NLL distribution split at median into confident/uncertain labels\.\(B\)ρ\\rhocomparison: DAS \(2650×2650\{\\times\}\) vs\. temporal probe \(−6\.5×\-6\.5\{\\times\}\) vs\. safety monitor \(≤2\.0×\{\\leq\}2\.0\{\\times\}\)\.\(C\)Principal angle between the safety monitor and DAS sits at the Haar null\.
### Ablation invisibility \(Exp\. 111\)\.

Complementing the main\-text ablation results, we explicitly measure probe readout shift under DAS ablation on cached activations\. DAS ablation \(Δ\\DeltaNLL=54\.5\{=\}54\.5nats\) shifts the circular probe by only16\.716\.7days \(4\.6%4\.6\\%of the365365\-day calendar\), while random ablations shift it by5\.9±4\.25\.9\\pm 4\.2days\. The DAS subspace holds4\.6%4\.6\\%of activation energy but100%100\\%of the causal effect; the probe occupies a nearly orthogonal95\.4%95\.4\\%and sees a faint echo through the∼0\.2%\{\\sim\}0\.2\\%cos2\\cos^\{2\}leakage amplified by theatan2nonlinearity \(Fig\.[38](https://arxiv.org/html/2605.29126#A53.F38)\)\.

![Refer to caption](https://arxiv.org/html/2605.29126v1/x35.png)Figure 38:Ablation invisibility\.\(A\)Probe\-shift histogram for5050random ablations \(gray\) vs\. DAS ablation \(red\)\.\(B\)Per\-DOY probe shift under DAS ablation; most are below the33\-day threshold\.\(C\)Energy decomposition: DAS holds4\.6%4\.6\\%of activation norm but100%100\\%of causal effect\.

## Appendix S54Supplement: notation and abbreviations
When and How Long? The Readout-Mediator Angle in Temporal Reasoning

Similar Articles

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Probing the Misaligned Thinking Process of Language Models

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Submit Feedback

Similar Articles

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States
Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning
Probing the Misaligned Thinking Process of Language Models
Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models