Belief or Circuitry? Causal Evidence for In-Context Graph Learning

arXiv cs.AI Papers

Summary

This paper investigates whether LLMs learn in-context through latent structure inference or local pattern matching, using mechanistic interpretability methods like PCA and activation patching on a graph random-walk task.

arXiv:2605.08405v1 Announce Type: new Abstract: How do LLMs learn in-context? Is it by pattern-matching recent tokens, or by inferring latent structure? We probe this question using a toy graph random-walk across two competing graph structures. This task's answer is, in principle, decidable: either the model tracks global topology, or it copies local transitions. We present two lines of evidence that neither account alone is sufficient. First, reconstructing the internal representation structure via PCA reveals that at intermediate mixture ratios, both graph topologies are encoded in orthogonal principal subspaces simultaneously. This pattern is difficult to reconcile with purely local transition copying. Second, residual-stream activation patching and graph-difference steering causally intervene on this graph-family signal: late-layer patching almost fully transfers the clean graph preference, while linear steering moves predictions in the intended direction and fails under norm-matched and label-shuffled controls. Taken together, our findings are most consistent with a dual-mechanism account in which genuine structure inference and induction circuits operate in parallel.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:13 AM

# Belief or Circuitry? Causal Evidence for In-Context Graph Learning
Source: [https://arxiv.org/html/2605.08405](https://arxiv.org/html/2605.08405)
###### Abstract

How do LLMs learn in\-context? Is it by pattern\-matching recent tokens, or by inferring latent structure? We probe this question using a toy graph random\-walk across two competing graph structures\. This task’s answer is, in principle, decidable: either the model tracks global topology, or it copies local transitions\. We present two lines of evidence that neither account alone is sufficient\. First, reconstructing the internal representation structure via PCA reveals that at intermediate mixture ratios, both graph topologies are encoded in orthogonal principal subspaces simultaneously\. This pattern is difficult to reconcile with purely local transition copying\. Second, residual\-stream activation patching and graph\-difference steering causally intervene on this graph\-family signal: late\-layer patching almost fully transfers the clean graph preference, while linear steering moves predictions in the intended direction and fails under norm\-matched and label\-shuffled controls\. Taken together, our findings are most consistent with a dual\-mechanism account in which genuine structure inference and induction circuits operate in parallel\. Code is available at[this link](https://anonymous.4open.science/r/do-llms-infer-graphs-C67A)\.

Mechanistic Interpretability, In\-Context Learning

## 1Introduction & Related Works

Since the advent of LLMs, in\-context learning \(ICL\) has remained an area of research that stumps the community\. Over the past five years, numerous investigations into ICL have yielded interesting results: alignment\(Anwaret al\.,[2024](https://arxiv.org/html/2605.08405#bib.bib11); Linet al\.,[2023](https://arxiv.org/html/2605.08405#bib.bib12)\), jailbreaking\(Polyakov and Kuznetsov,[2026](https://arxiv.org/html/2605.08405#bib.bib8)\), and demonstration selection\(Qinet al\.,[2024](https://arxiv.org/html/2605.08405#bib.bib9)\)are just a few areas that have prioritized it as a subfield of study\(Donget al\.,[2024](https://arxiv.org/html/2605.08405#bib.bib10)\)\.

Within the subfield of mechanistic interpretability \(MechInterp\), there is yet to be a concrete consensus on this topic\. Originally posed as a debate between inference over latent structure or shallow pattern\-matching circuits byOlssonet al\.\([2022](https://arxiv.org/html/2605.08405#bib.bib7)\), ICL has seen numerous recent works in MechInterp from theoretical accounts framing ICL as implicit Bayesian inference over latent concepts\(Xieet al\.,[2022](https://arxiv.org/html/2605.08405#bib.bib13)\), to mechanistic studies of induction head formation and diversity\(Singhet al\.,[2024](https://arxiv.org/html/2605.08405#bib.bib14)\), to causal evidence that ICL decomposes into separable task schema and input\-output binding mechanisms\(Kim,[2025](https://arxiv.org/html/2605.08405#bib.bib15)\)\.

Recent work fromParket al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib2)\)provided striking evidence for the former: given random\-walk traces on an unknown graph as a toy task, Llama\-3\.1\-8B exhibits asharp phase transitionin neighbor\-prediction accuracy\. This transition results in simultaneously reorganizing its residual\-stream geometry to mirror the graph’s adjacency structure\. If the phase transition reflects latent\-structure inference, then LLMs maintain implicit probabilistic world models that can be probed mechanistically\. If it reflects induction circuits, the same behavioral signature requires only local copying heads and carries no implications about global representations\.

Contributions\.Our contributions are a first step to triangulate the answer to this ongoing debate\.\(1\)We replace the flat log\-prior ofBigelowet al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib4)\)with a complexity\-weighted structure\-specific prior, recovering a quantitative signature of topology\-sensitive structural bias\.\(2\)We expose Llama\-3\.1\-8B to interleaved walks from two competing graphs and show that the belief account’s prediction of topology\-biased evidence accumulation holds, against the induction circuit account’s prediction of symmetric behavior\.\(3\)PCA of residual\-stream activations reveals both graph topologies are simultaneously recoverable in orthogonal subspaces at intermediate mixture ratios\.\(4\)Activation patching and steering causally link this representational structure to next\-token predictions\.

Ultimately, our conclusion is that the question of whether LLMsbelieveorcopymay be a false dichotomy, and the architecture of that coexistence is what mechanistic interpretability must now explain\.

## 2Background

Recently published works in ICL have taken two simultaneous approaches to investigating this problem\.Parket al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib2)\)used a toy task of random walks over a sixteen word grid where nodes are non\-semantically related words and edges are connections between word pairs adjacent in the grid\. That paper showed that Llama\-3\.1\-8B undergoes a sharp phase transition in neighbor\-hit accuracy as context length grows, and that layer\-level PCA of node\-token representations progressively recovers the true graph topology\. This was interpreted as evidence for an implicit Bayesian world model over graph structure\. Following the publication ofParket al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib2)\), two blog posts entered the debate\.Arditi \([2026](https://arxiv.org/html/2605.08405#bib.bib3)\)identified specific attention heads in Llama\-3\.1\-8B that implement induction, arguing that the phase transition in graph ICL is fully explained by these heads accumulating local transition statistics, with no need to posit global structure inference\.Ransome \([2026](https://arxiv.org/html/2605.08405#bib.bib16)\)replicated these findings and extended the analysis to additional graph topologies\. We note that this second post appeared concurrently with our own work; we refer readers to both for complementary perspectives on the mechanistic debate\.

Alongside the ongoing debate around work byParket al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib2)\), there was a theoretical Bayesian dynamics approach to ICL published byBigelowet al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib4)\)\. These authors fit a sigmoid\-shaped parametric function over log\-odds evidence accumulation to LLM accuracy curves, treating the model as maintaining a latent binary belief over two hypotheses about the data source\. They found evidence for a dual\-mechanism account in which both Bayesian updating and induction circuits contribute\.

## 3Behavioral Model

The central question is whether the LLM’s behavior during ICL looks more like a Bayesian reasoner accumulating evidence about latent structure, or a pattern\-matcher copying recent tokens\. To probe this, we fit a belief\-dynamics model to observed accuracy curves and ask: do the recovered parameters tell a story consistent with genuine structural inference?

### 3\.1General Framework

Consider an LLM presented with a context generated by one ofKKcompeting hypothesesℋ=\{H1,…,HK\}\\mathcal\{H\}=\\\{H\_\{1\},\\ldots,H\_\{K\}\\\}about the latent data\-generating structure\. We model the LLM as maintaining a latent belief over which hypothesis is active, updating that belief as context accumulates\. For each hypothesisHkH\_\{k\}, the model’s prior skepticism is governed by a log\-odds termbk∈ℝb\_\{k\}\\in\\mathbb\{R\}, withbk<0b\_\{k\}<0encoding initial skepticism\. Evidence in favor ofHkH\_\{k\}accumulates sub\-linearly with context lengthNN, giving a predicted accuracy at context lengthNN:

p^k​\(N\)=p0,k\+\(qk−p0,k\)​σ​\(bk\+γk​N1−αk\),\\hat\{p\}\_\{k\}\(N\)=p\_\{0,k\}\+\(q\_\{k\}\-p\_\{0,k\}\)\\,\\sigma\\\!\\left\(b\_\{k\}\+\\gamma\_\{k\}N^\{1\-\\alpha\_\{k\}\}\\right\),\(1\)wherep0,kp\_\{0,k\}is the pre\-transition accuracy underHkH\_\{k\},qkq\_\{k\}is the post\-transition accuracy,γk\>0\\gamma\_\{k\}\>0controls evidence strength, andαk∈\(0,1\)\\alpha\_\{k\}\\in\(0,1\)captures diminishing returns from correlated observations\. The inflection pointNk∗=\(−bk/γk\)1/\(1−αk\)N^\{\*\}\_\{k\}=\(\-b\_\{k\}/\\gamma\_\{k\}\)^\{1/\(1\-\\alpha\_\{k\}\)\}marks the context length at which the LLM tips from skepticism to commitment underHkH\_\{k\}\.

A key question is what determinesbkb\_\{k\}, the LLM’s initial bias toward or against each hypothesis\. We propose thatbkb\_\{k\}is governed by thecomplexityofHkH\_\{k\}: a more complex hypothesis requires more evidence to overcome the prior\. Concretely, we parameterize

bk=b0−λ⋅C​\(Hk\),b\_\{k\}=b\_\{0\}\-\\lambda\\cdot C\(H\_\{k\}\),\(2\)whereb0b\_\{0\}is a shared baseline log\-odds,λ≥0\\lambda\\geq 0is a learned penalty weight, andC​\(Hk\)C\(H\_\{k\}\)is an MDL\-inspired complexity measure of hypothesisHkH\_\{k\}\. Ifλ^\>0\\hat\{\\lambda\}\>0, the LLM implicitly penalizes more complex hypotheses, a signature of complexity\-sensitive structural inference that topology\-agnostic pattern matching cannot produce\.

### 3\.2Instantiation: Competing Graph Structures

We instantiate this framework using the graph random\-walk task ofParket al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib2)\)\. The LLM is presented with token sequences generated by random walks over an unknown graphGG, and must predict the next node, a valid neighbor of the current node\. The two competing hypotheses areHgridH\_\{\\text\{grid\}\}andHringH\_\{\\text\{ring\}\}: a4×44\{\\times\}4grid \(16 nodes, 24 edges, degree 2–4\) and a 16\-node ring \(16 edges, uniform degree 2\), each with nodes assigned distinct single\-token English nouns\. The observable accuracyp^k​\(N\)\\hat\{p\}\_\{k\}\(N\)is the neighbor\-hit probability, the probability that the model’s next\-token prediction is a valid graph neighbor of the current node under hypothesisHkH\_\{k\}\.

The MDL complexity of each graph hypothesis is naturally given by the length of its edge\-list encoding:

C​\(G\)=\|E​\(G\)\|⋅⌈log2⁡\|V\|⌉​bits,C\(G\)=\|E\(G\)\|\\cdot\\lceil\\log\_\{2\}\|V\|\\rceil\\;\\text\{bits\},\(3\)yieldingC​\(grid\)=96C\(\\text\{grid\}\)=96bits andC​\(ring\)=64C\(\\text\{ring\}\)=64bits\. The grid costs more to describe because it has more edges\. Ifλ^\>0\\hat\{\\lambda\}\>0withb^grid<b^ring\\hat\{b\}\_\{\\text\{grid\}\}<\\hat\{b\}\_\{\\text\{ring\}\}, the LLM requires disproportionately more context to commit to the denser topology\. This is inconsistent with induction heads, which accumulate transition statistics uniformly regardless of graph structure and predictλ^≈0\\hat\{\\lambda\}\\approx 0\.

To create a genuine competition between hypotheses, we interleave walks from both graphs at a controlled mixture ratioρ∈\[0,1\]\\rho\\in\[0,1\], whereρ\\rhois the probability that any given 100\-token segment is drawn from the ring walk\. The effective context length for graphkkisρk⋅N\\rho\_\{k\}\\cdot N\(a mean\-field approximation that holds in expectation across walk realizations\) giving

p^k​\(ρ,N\)=p0,k\+\(qk−p0,k\)​σ​\(bk\+γk​\(ρk​N\)1−αk\)\.\\hat\{p\}\_\{k\}\(\\rho,N\)=p\_\{0,k\}\+\(q\_\{k\}\-p\_\{0,k\}\)\\,\\sigma\\\!\\bigl\(b\_\{k\}\+\\gamma\_\{k\}\(\\rho\_\{k\}N\)^\{1\-\\alpha\_\{k\}\}\\bigr\)\.\(4\)We compare aper\-graphparameterization \(8 free parameters:b0,λ,γk,αk,qkb\_\{0\},\\lambda,\\gamma\_\{k\},\\alpha\_\{k\},q\_\{k\}per graph\) against amixture\-biasablation \(5 parameters\) that shares a single sigmoid but linearly interpolates the prior:b​\(ρ\)=\(1−ρ\)​bgrid\+ρ​bringb\(\\rho\)=\(1\-\\rho\)b\_\{\\text\{grid\}\}\+\\rho\\,b\_\{\\text\{ring\}\}\. The mixture\-bias version can capture prior asymmetry but not topology\-specific evidence rates, making it a direct test of whether per\-graph dynamics are needed beyond the prior alone\. Model selection uses AIC and BIC; see Appendix[A\.2](https://arxiv.org/html/2605.08405#A1.SS2)for estimation details\. We use Llama\-3\.1\-8B \(non\-instruct\) loaded via TransformerLens\(Nanda and Bloom,[2022](https://arxiv.org/html/2605.08405#bib.bib6)\), with layer\-26 residual\-stream activations for representational analyses followingParket al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib2)\)\.

## 4Experiment 1

### 4\.1Which Sigmoid Fits Best?

For each \(condition,ρ\\rho\) cell, we fit both the baseline and our model with the complexity\-weighted prior to the training walks and evaluate on held\-out sequences\.

### 4\.2Behavioral Sigmoid Fits

The grid’s inflection pointN∗N^\{\*\}shifts monotonically later as ring evidence increases, precisely the graph\-level competition effect the belief account predicts\. A flat induction account cannot explain this, since copy heads accumulate transitions without any topology\-aware interaction\. The recovered parameters satisfyλ^\>0\\hat\{\\lambda\}\>0andb^grid<b^ring\\hat\{b\}\_\{\\text\{grid\}\}<\\hat\{b\}\_\{\\text\{ring\}\}in both vocabulary conditions, the per\-graph parameterization decisively outperforms the mixture\-bias ablation on AIC and BIC, and atρ=1\\rho=1the ring converges faster than the grid, which is consistent with its lower MDL complexity requiring less evidence to overcome\.

## 5Experiment 2

### 5\.1Does the Residual Stream Encode Latent Graph Structure?

To test whether the behavioral signatures reflect genuine changes in internal representations, we probe activations directly, following the representational analysis ofParket al\.\([2025](https://arxiv.org/html/2605.08405#bib.bib2)\)\. For each nodevv, we average activations over all positions wherewt=vw\_\{t\}=vat context lengthTT, yielding a class\-mean matrix\. Projecting these class\-mean vectors into PCA space lets us ask whether the low\-dimensional geometry recovers the true graph topology, and, critically, whether both graph topologies are simultaneously recoverable at intermediateρ\\rho\.

To quantify structural alignment we report degree\-normalized Dirichlet energy under the true graph LaplacianL=D−AL=D\-A, whereAAis the adjacency matrix andDDis the diagonal degree matrix\. We defineHT∈ℝ\|V\|×dH\_\{T\}\\in\\mathbb\{R\}^\{\|V\|\\times d\}as the matrix of class\-mean activations with rowsμv​\(T\)\\mu\_\{v\}\(T\)for each nodevvpresent in a trailing context window at lengthTT, andH¯T\\bar\{H\}\_\{T\}is the degree\-weighted mean\.

ℰ​\(T\)=Tr​\(HT⊤​L​HT\)=12​∑i,jAi​j​∥μi​\(T\)−μj​\(T\)∥2\\mathcal\{E\}\(T\)=\\mathrm\{Tr\}\(H\_\{T\}^\{\\top\}LH\_\{T\}\)=\\frac\{1\}\{2\}\\sum\_\{i,j\}A\_\{ij\}\\lVert\\mu\_\{i\}\(T\)\-\\mu\_\{j\}\(T\)\\rVert^\{2\}\(5\)ℰnorm​\(T\)=Tr​\(HT⊤​L​HT\)Tr​\(\(HT−H¯T\)⊤​D​\(HT−H¯T\)\)\\mathcal\{E\}\_\{\\mathrm\{norm\}\}\(T\)=\\frac\{\\mathrm\{Tr\}\(H\_\{T\}^\{\\top\}LH\_\{T\}\)\}\{\\mathrm\{Tr\}\(\(H\_\{T\}\-\\bar\{H\}\_\{T\}\)^\{\\top\}D\(H\_\{T\}\-\\bar\{H\}\_\{T\}\)\)\}\(6\)Lowerℰnorm\\mathcal\{E\}\_\{\\mathrm\{norm\}\}means adjacent nodes are closer together in activation space, a representational signature that the model has internalized the graph’s adjacency structure beyond what token co\-occurrence statistics alone would produce\. If the induction account is sufficient, we should see no coherent graph structure emerge in the residual stream; if the belief account holds, we expect the geometry to progressively mirror the true topology asTTincreases pastN∗N^\{\*\}\.

### 5\.2Residual\-Stream Geometry Results

The behavioral results suggest the LLM maintains a complexity\-sensitive structural bias\. But do those behavioral signatures have a correlate inside the model? NearN∗N^\{\*\}\(T=200T=200\), the PC1/PC2 plane shows only partial ring structure; byT=1400T=1400the ring topology is clearly recoverable in the low\-dimensional geometry, andℰnorm\\mathcal\{E\}\_\{\\mathrm\{norm\}\}shifts from0\.7850\.785at short context to0\.828±0\.0760\.828\\pm 0\.076atT=1400T=1400\. The internal geometry and the behavioral phase transition move together\.

The stronger test comes from the competing\-structures regime\. Figure[1](https://arxiv.org/html/2605.08405#S6.F1)and Appendix[B](https://arxiv.org/html/2605.08405#A2)shows class\-mean PCA atT=1400T=1400across the fullρ\\rho\-ladder in a secondary closed\-vocabulary experiment where grid and ring share the same 16\-token vocabulary\. Atρ=0\.5\\rho=0\.5, both topologies are simultaneously encoded in orthogonal subspaces\. An induction circuit accumulating local transition statistics would produce a single blended representation, a mixture of grid and ring co\-occurrences, not two separable global structures in orthogonal subspaces\.

## 6Experiment 3

### 6\.1Does Graph\-Family Information Causally Control Next\-Token Predictions?

The behavioral and PCA analyses are correlational: they show that outputs and representations are consistent with latent structure inference, but not that the relevant residual\-stream information is used by the final prediction\. We therefore run two causal interventions, following the activation\-patching and activation\-steering logic used in mechanistic interpretability and representation engineering\(Menget al\.,[2022](https://arxiv.org/html/2605.08405#bib.bib17); Turneret al\.,[2023](https://arxiv.org/html/2605.08405#bib.bib18); Zouet al\.,[2023](https://arxiv.org/html/2605.08405#bib.bib19)\)\.

![Refer to caption](https://arxiv.org/html/2605.08405v1/figures/pca_rho_grid_T1400_split_rho051_pc12.png)Figure 1:Snapshot of PCA embeddings for all tokens across value of mixture ratioρ\\rho\(columns\)\.*Top:*Blue edges of grid overlaid when grid has non\-zero mixture weight\.*Bottom:*Same plots with red edges of ring overlaid\. Full ladder ofρ\\rhovalues in Appendix[B](https://arxiv.org/html/2605.08405#A2)For clean graphGcG\_\{c\}, corrupt graphGrG\_\{r\}, and final tokenxtx\_\{t\}, we score each logit vector with a graph\-family contrast

Δ​\(xt\)=1\|𝒩Gc​\(xt\)\|​∑w∈𝒩Gc​\(xt\)zw−1\|𝒩Gr​\(xt\)\|​∑w∈𝒩Gr​\(xt\)zw,\\Delta\(x\_\{t\}\)=\\frac\{1\}\{\|\\mathcal\{N\}\_\{G\_\{c\}\}\(x\_\{t\}\)\|\}\\sum\_\{w\\in\\mathcal\{N\}\_\{G\_\{c\}\}\(x\_\{t\}\)\}z\_\{w\}\-\\frac\{1\}\{\|\\mathcal\{N\}\_\{G\_\{r\}\}\(x\_\{t\}\)\|\}\\sum\_\{w\\in\\mathcal\{N\}\_\{G\_\{r\}\}\(x\_\{t\}\)\}z\_\{w\},\(7\)where𝒩G​\(xt\)\\mathcal\{N\}\_\{G\}\(x\_\{t\}\)denotes the set of valid next\-node neighbors ofxtx\_\{t\}under graphGG, andzwz\_\{w\}is the next\-token logit for wordww\. We generate matched clean/corrupt prompt pairs from the grid and ring that end at the same current node\. For activation patching, we cache clean residual activations and rerun the corrupt prompt while replacing the final\-positionhook\_resid\_postactivation after blockℓ\\ell\. The normalized patch effect is

Epatch​\(ℓ\)=Δpatch​\(ℓ\)−ΔcorruptΔclean−Δcorrupt\.E\_\{\\mathrm\{patch\}\}\(\\ell\)=\\frac\{\\Delta\_\{\\mathrm\{patch\}\}\(\\ell\)\-\\Delta\_\{\\mathrm\{corrupt\}\}\}\{\\Delta\_\{\\mathrm\{clean\}\}\-\\Delta\_\{\\mathrm\{corrupt\}\}\}\.\(8\)Thus0means no movement from the corrupt graph preference, and11means the patched corrupt run recovers the clean run’s graph preference\.

For steering, we compute a layer\-wise graph\-difference vector on disjoint training contexts,

vℓ=𝔼​\[hℓ​\(xt\)∣Gc=grid\]−𝔼​\[hℓ​\(xt\)∣Gr=ring\],v\_\{\\ell\}=\\mathbb\{E\}\[h\_\{\\ell\}\(x\_\{t\}\)\\mid G\_\{c\}=\\mathrm\{grid\}\]\-\\mathbb\{E\}\[h\_\{\\ell\}\(x\_\{t\}\)\\mid G\_\{r\}=\\mathrm\{ring\}\],\(9\)and addα​vℓ\\alpha v\_\{\\ell\}to final\-position residual activations in held\-out ring contexts\. We compare the real vector to two controls: a Gaussian random vector matched to the real vector’s norm and a shuffled\-label vector computed after permuting graph labels in the steering\-vector training set\. Full protocol details, including the seen/held\-out edge split used to test transition\-cache explanations, are in Appendix[C](https://arxiv.org/html/2605.08405#A3)\.

### 6\.2Causal Activation Intervention Results

Using the clean/corrupt prompt\-pair protocol from Section[6\.1](https://arxiv.org/html/2605.08405#S6.SS1), we intervene on final\-position residual\-stream activations and score the resulting next\-token distribution with the graph\-family logit contrast in Eq\.[7](https://arxiv.org/html/2605.08405#S6.E7)\.

We outline each finding briefly below; full details are in Appendix[C](https://arxiv.org/html/2605.08405#A3)\. At context lengthT=1400T=1400, final\-token residual patching rises rapidly across the selected layer sweep, establishing that late residual\-stream states causally control the graph\-neighbor logit contrast rather than merely encoding decodable graph information\. To test whether this effect reflects replay of locally observed transitions alone, we split clean graph neighbors into edges observed in the corrupt context and held\-out edges\. While the held\-out effect is delayed, it crosses zero by layer 26 and reaches2\.02\.0at layer 30, showing that patched activations boost graph\-consistent predictions even for edges never observed in the corrupt prompt\.

Steering provides a complementary lower\-bandwidth causal intervention\. Adding the grid\-minus\-ring direction to held\-out ring contexts recovers0\.449±0\.0040\.449\\pm 0\.004of the clean\-corrupt graph contrast atα=5\\alpha=5, while negativeα\\alphareverses the effect and both random norm\-matched and shuffled\-label controls remain near zero\. The effect also strengthens with layer\. Because steering uses a single global direction rather than pair\-specific activation replacement, it does not fully reproduce patching; Appendix[C](https://arxiv.org/html/2605.08405#A3)shows that held\-out edge\-specific logits remain substantially harder to steer\. Together, these interventions support the existence of a manipulable graph\-family representation that contributes causally to next\-token prediction\.

## 7Discussion & Conclusion

The behavioral, representational, and causal evidence converge on the same picture: Llama\-3\.1\-8B is not well described as either a pure Bayesian structure learner or a pure induction\-cache machine\. This does not rule out induction circuits\. In fact, the delayed held\-out edge effect in patching and the incomplete held\-out recovery under steering both suggest that local transition evidence remains important\. The more plausible account is a dual\-mechanism one: induction\-like caches and latent\-structure representations operate together, with the residual stream integrating both sources of evidence before the final prediction\. Immediate next steps are to measure subspace alignment angles against graph Laplacian eigenvectors acrossρ\\rho, run head\-level ablations at the patching\-identified layers, and scale the same causal protocol to larger Llama models to test whether the recovered complexity penaltyλ^\\hat\{\\lambda\}grows with model capacity\.

## References

- U\. Anwar, A\. Saparov, J\. Rando, D\. Paleka, M\. Turpin, P\. Hase, E\. S\. Lubana, E\. Jenner, S\. Casper, O\. Sourbut, B\. L\. Edelman, Z\. Zhang, M\. Günther, A\. Korinek, J\. Hernandez\-Orallo, L\. Hammond, E\. Bigelow, A\. Pan, L\. Langosco, T\. Korbak, H\. Zhang, R\. Zhong, S\. Ó\. hÉigeartaigh, G\. Recchia, G\. Corsi, A\. Chan, M\. Anderljung, L\. Edwards, A\. Petrov, C\. S\. de Witt, S\. R\. Motwan, Y\. Bengio, D\. Chen, P\. H\. S\. Torr, S\. Albanie, T\. Maharaj, J\. Foerster, F\. Tramer, H\. He, A\. Kasirzadeh, Y\. Choi, and D\. Krueger \(2024\)Foundational challenges in assuring alignment and safety of large language models\.External Links:2404\.09932,[Link](https://arxiv.org/abs/2404.09932)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p1.1)\.
- A\. Arditi \(2026\)In\-context learning of representations can be explained by induction circuits\.Note:LessWrongCrosspost of ICLR 2026 Blogpost Track postExternal Links:[Link](https://www.lesswrong.com/posts/qtdSzLpQ8BXv6YANd/in-context-learning-of-representations-can-be-explained-by)Cited by:[§2](https://arxiv.org/html/2605.08405#S2.p1.1)\.
- E\. Bigelow, D\. Wurgaft, Y\. Wang, N\. Goodman, T\. Ullman, H\. Tanaka, and E\. S\. Lubana \(2025\)Belief dynamics reveal the dual nature of in\-context learning and activation steering\.External Links:2511\.00617,[Link](https://arxiv.org/abs/2511.00617)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p4.1),[§2](https://arxiv.org/html/2605.08405#S2.p2.1)\.
- Q\. Dong, L\. Li, D\. Dai, C\. Zheng, J\. Ma, R\. Li, H\. Xia, J\. Xu, Z\. Wu, B\. Chang, X\. Sun, L\. Li, and Z\. Sui \(2024\)A survey on in\-context learning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 1107–1128\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.64/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.64)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p1.1)\.
- C\. Kim \(2025\)Task schema and binding: a double dissociation study of in\-context learning\.External Links:2512\.17325,[Link](https://arxiv.org/abs/2512.17325)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p2.1)\.
- B\. Y\. Lin, A\. Ravichander, X\. Lu, N\. Dziri, M\. Sclar, K\. Chandu, C\. Bhagavatula, and Y\. Choi \(2023\)The unlocking spell on base llms: rethinking alignment via in\-context learning\.External Links:2312\.01552,[Link](https://arxiv.org/abs/2312.01552)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 17359–17372\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html)Cited by:[§6\.1](https://arxiv.org/html/2605.08405#S6.SS1.p1.1)\.
- N\. Nanda and J\. Bloom \(2022\)TransformerLens\.Note:[https://github\.com/TransformerLensOrg/TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)Cited by:[§3\.2](https://arxiv.org/html/2605.08405#S3.SS2.p3.6)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, S\. Johnston, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2022\)In\-context learning and induction heads\.External Links:2209\.11895,[Link](https://arxiv.org/abs/2209.11895)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p2.1)\.
- C\. F\. Park, A\. Lee, E\. S\. Lubana, Y\. Yang, M\. Okawa, K\. Nishi, M\. Wattenberg, and H\. Tanaka \(2025\)ICLR: in\-context learning of representations\.External Links:2501\.00070,[Link](https://arxiv.org/abs/2501.00070)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p3.1),[§2](https://arxiv.org/html/2605.08405#S2.p1.1),[§2](https://arxiv.org/html/2605.08405#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.08405#S3.SS2.p1.6),[§3\.2](https://arxiv.org/html/2605.08405#S3.SS2.p3.6),[§5\.1](https://arxiv.org/html/2605.08405#S5.SS1.p1.4)\.
- A\. Polyakov and D\. Kuznetsov \(2026\)Involuntary in\-context learning: exploiting few\-shot pattern completion to bypass safety alignment in gpt\-5\.4\.External Links:2604\.19461,[Link](https://arxiv.org/abs/2604.19461)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p1.1)\.
- C\. Qin, A\. Zhang, C\. Chen, A\. Dagar, and W\. Ye \(2024\)In\-context learning with iterative demonstration selection\.External Links:2310\.09881,[Link](https://arxiv.org/abs/2310.09881)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p1.1)\.
- J\. Ransome \(2026\)In context learning representations, byproduct or mechanism?\.Jack’s Substack\.External Links:[Link](https://jackransome.substack.com/p/in-context-learning-representations)Cited by:[§2](https://arxiv.org/html/2605.08405#S2.p1.1)\.
- A\. K\. Singh, T\. Moskovitz, F\. Hill, S\. C\. Y\. Chan, and A\. M\. Saxe \(2024\)What needs to go right for an induction head? a mechanistic study of in\-context learning circuits and their formation\.External Links:2404\.07129,[Link](https://arxiv.org/abs/2404.07129)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p2.1)\.
- A\. M\. Turner, L\. Thiergart, D\. Udell, G\. Leech, U\. Mini, and M\. MacDiarmid \(2023\)Activation addition: steering language models without optimization\.External Links:2308\.10248,[Link](https://arxiv.org/abs/2308.10248)Cited by:[§6\.1](https://arxiv.org/html/2605.08405#S6.SS1.p1.1)\.
- S\. M\. Xie, A\. Raghunathan, P\. Liang, and T\. Ma \(2022\)An explanation of in\-context learning as implicit bayesian inference\.External Links:2111\.02080,[Link](https://arxiv.org/abs/2111.02080)Cited by:[§1](https://arxiv.org/html/2605.08405#S1.p2.1)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2023\)Representation engineering: a top\-down approach to AI transparency\.External Links:2310\.01405,[Link](https://arxiv.org/abs/2310.01405)Cited by:[§6\.1](https://arxiv.org/html/2605.08405#S6.SS1.p1.1)\.

## Appendix ABehavioral Model Details

### A\.1Baseline Derivation

LetS∈\{0,1\}S\\in\\\{0,1\\\}be a latent binary variable:S=1S=1indicates the LLM has adopted the in\-context graph structure;S=0S=0indicates reliance on pretrained associations\. The log\-prior overSSisb∈ℝb\\in\\mathbb\{R\}, withb<0b<0encoding initial skepticism toward the arbitrary in\-context graph\. As the LLM observes more walk steps, evidence accumulates sub\-linearly:

log⁡p​\(𝐱∣S=1\)p​\(𝐱∣S=0\)≈γ​N1−α\\log\\frac\{p\(\\mathbf\{x\}\\mid S=1\)\}\{p\(\\mathbf\{x\}\\mid S=0\)\}\\approx\\gamma N^\{1\-\\alpha\}\(10\)whereγ\>0\\gamma\>0controls evidence strength per token andα∈\(0,1\)\\alpha\\in\(0,1\)captures diminishing returns from correlated walk steps\. Combining prior and likelihood via Bayes’ rule gives the predicted neighbor\-hit accuracy at context lengthNN:

p^​\(N\)=p0\+\(q−p0\)​σ​\(b\+γ​N1−α\)\\hat\{p\}\(N\)=p\_\{0\}\+\(q\-p\_\{0\}\)\\,\\sigma\\\!\\left\(b\+\\gamma N^\{1\-\\alpha\}\\right\)\(11\)wherep0p\_\{0\}is the pre\-transition neighbor\-hit rate \(estimated empirically fromN≤100N\\leq 100tokens\),q∈\(p0,1\]q\\in\(p\_\{0\},1\]is the graph\-mode success rate, andσ\\sigmais the sigmoid\. The phase transition inflection point isN∗=\(−b/γ\)1/\(1−α\)N^\{\*\}=\(\-b/\\gamma\)^\{1/\(1\-\\alpha\)\}, corresponding to the context length at which log\-odds cross zero and the LLM tips from skepticism to belief\.

Parameters𝜽=\(b,γ,α,q\)\\boldsymbol\{\\theta\}=\(b,\\gamma,\\alpha,q\)are fit by minimizing MSE betweenp^​\(N\)\\hat\{p\}\(N\)and observed accuracy curves, equivalent to MLE under an additive Gaussian noise model on observed accuracies\. We use L\-BFGS\-B with 16 random restarts and box constraintsb∈\[−30,30\]b\\in\[\-30,30\],γ∈\[10−6,50\]\\gamma\\in\[10^\{\-6\},50\],α∈\[0,0\.99\]\\alpha\\in\[0,0\.99\],q∈\(p0,1\]q\\in\(p\_\{0\},1\]\. The bounds enforce domain constraints directly; the lowest\-loss restart is kept and validation and test MSE are reported afterward\.

### A\.2Weighted Prior Model Estimation Details

The joint objective minimizes MSE over all\(ρ,k,N\)\(\\rho,k,N\)triples:

𝜽^=arg⁡min𝜽​∑ρ∑k∑N∈𝒞\[p^k,obs​\(ρ,N\)−p^k​\(ρ,N;𝜽\)\]2\\hat\{\\boldsymbol\{\\theta\}\}=\\arg\\min\_\{\\boldsymbol\{\\theta\}\}\\sum\_\{\\rho\}\\sum\_\{k\}\\sum\_\{N\\in\\mathcal\{C\}\}\\left\[\\hat\{p\}\_\{k,\\text\{obs\}\}\(\\rho,N\)\-\\hat\{p\}\_\{k\}\(\\rho,N;\\boldsymbol\{\\theta\}\)\\right\]^\{2\}\(12\)We use L\-BFGS\-B with 24 random restarts and box constraintsb0∈\[−15,15\]b\_\{0\}\\in\[\-15,15\],λ∈\[−2,2\]\\lambda\\in\[\-2,2\],γk∈\[10−6,50\]\\gamma\_\{k\}\\in\[10^\{\-6\},50\],αk∈\[0,0\.99\]\\alpha\_\{k\}\\in\[0,0\.99\],qk∈\(p0,k,1\]q\_\{k\}\\in\(p\_\{0,k\},1\]\. Note thatλ\\lambdabounds include negative values;λ^<0\\hat\{\\lambda\}<0would mean the LLM prefers more complex graphs, which would falsify the complexity\-prior hypothesis\.

Model selection between the per\-graph \(8 parameters\) and mixture\-bias \(5 parameters\) versions uses AIC and BIC under the Gaussian residual assumption:

AIC=n⋅\(log⁡\(2​π⋅MSE\)\+1\)\+2​k,BIC=n⋅\(log⁡\(2​π⋅MSE\)\+1\)\+k​log⁡n\\text\{AIC\}=n\\cdot\(\\log\(2\\pi\\cdot\\text\{MSE\}\)\+1\)\+2k,\\quad\\text\{BIC\}=n\\cdot\(\\log\(2\\pi\\cdot\\text\{MSE\}\)\+1\)\+k\\log n\(13\)wherennis the number of training observations andkkis the number of free parameters\. The pre\-transition accuracyp0,kp\_\{0,k\}is estimated per graph by averaging neighbor\-hit accuracy over training walks atN≤100N\\leq 100tokens, then pooled to a singlep0,gridp\_\{0,\\text\{grid\}\}andp0,ringp\_\{0,\\text\{ring\}\}per vocabulary condition for identifiability\.

We note one limitation of theoverlapfit: the optimizer saturated the lower bound of theb0b\_\{0\}search range \(b^0=−15\.00\\hat\{b\}\_\{0\}=\-15\.00\), indicating it wanted a more negative value than allowed\. As a consequence,λ^overlap\\hat\{\\lambda\}\_\{\\text\{overlap\}\}and the impliedb^grid−b^ring\\hat\{b\}\_\{\\text\{grid\}\}\-\\hat\{b\}\_\{\\text\{ring\}\}gap may be biased toward zero in this condition; widening the bounds or reparameterizingb0b\_\{0\}is a straightforward follow\-up\.

## Appendix BRepresentational Figures

![Refer to caption](https://arxiv.org/html/2605.08405v1/figures/pca_rho_grid_T1400_split_rho024581_pc12.png)Figure 2:Full suite of PCA analysis on all mixture ratiosρ\\rho, with first row showing grid reconstruction edges in blue and second row with ring edges in red\.![Refer to caption](https://arxiv.org/html/2605.08405v1/figures/pca_snapshots_neutral_disjoint_paper.png)Figure 3:Representative class\-mean PCA snapshots for the neutral disjoint vocabulary condition\. These plots provide additional visual context for the layer\-26 residual\-stream geometry discussed in the main text\.
## Appendix CCausal Intervention Details

### C\.1Prompt Pair Construction and Metrics

Clean and corrupt prompts are generated from different graph families but end at the same current node\. Because the graph hypotheses are undirected, we sample a random walk ending at the desired final node by generating a valid walk from that node and reversing it\. The reversed walk has the same graph support, and the model is always evaluated at the final position, predicting the next graph word\.

The primary score is the graph\-family logit contrast in Equation[7](https://arxiv.org/html/2605.08405#S6.E7)\. For patching, normalized effect is\(Δpatch−Δcorrupt\)/\(Δclean−Δcorrupt\)\(\\Delta\_\{\\mathrm\{patch\}\}\-\\Delta\_\{\\mathrm\{corrupt\}\}\)/\(\\Delta\_\{\\mathrm\{clean\}\}\-\\Delta\_\{\\mathrm\{corrupt\}\}\)\. For steering, normalized effect is computed analogously, replacingΔpatch\\Delta\_\{\\mathrm\{patch\}\}with the steered metric and using the target and source prompt metrics as endpoints\. Rows with small denominators are marked unusable in the raw JSONL; none were excluded in the reported patching runs\.

### C\.2Seen and Held\-Out Edge Split

For each final token, we split graph neighbors according to whether the edge incident to the final token was observed in the evaluation context\. The “seen” set contains true graph neighbors of the final token whose edge appeared in either direction\. The “held\-out” set contains true graph neighbors whose edge did not appear\. For clean/corrupt patching, this split is computed using the corrupt context, so the diagnostic asks whether a clean activation intervention helps graph\-neighbor logits that the corrupt prompt did not locally observe\.

### C\.3Activation Patching

The patching run reported in the paper uses Llama\-3\.1\-8B, 200 grid/ring clean\-corrupt prompt pairs, context lengthT=1400T=1400, final\-positionhook\_resid\_postactivations, and the selected layer set\{14,15,16,20,24,26,28,30\}\\\{14,15,16,20,24,26,28,30\\\}\.

![Refer to caption](https://arxiv.org/html/2605.08405v1/figures/activation_patching_final_token_T1400_paper.png)\(a\)Final\-token patching atT=1400T=1400\.
![Refer to caption](https://arxiv.org/html/2605.08405v1/figures/activation_patching_seen_heldout_T1400_paper.png)\(b\)Seen/held\-out split atT=1400T=1400\.

Figure 4:Long\-context activation\-patching diagnostics\. The selected\-layer run shows late\-layer recovery and held\-out graph\-neighbor logits becoming positive in late layers\.Table 1:Selected activation\-patching aggregates\. Effects are mean±\\pmstandard error over prompt pairs\.
### C\.4Activation Steering

The steering run uses disjoint train and evaluation contexts\. We compute grid\-minus\-ring vectors from 1000 training contexts per graph atT=1400T=1400, then evaluate on 500 held\-out grid/ring prompt pairs at layers 20–28 andα∈\{−5,−2,−1,−0\.5,0,0\.5,1,2,5\}\\alpha\\in\\\{\-5,\-2,\-1,\-0\.5,0,0\.5,1,2,5\\\}\. The raw JSONL contains duplicate rows from a prior append; all reported steering aggregates and paper figures deduplicate by the complete intervention key, keeping the first occurrence \(pair\_id, layer,α\\alpha, control, evaluation direction\), yielding the expected 243,000 unique interventions\.

![Refer to caption](https://arxiv.org/html/2605.08405v1/figures/activation_steering_layer_alpha_paper.png)\(a\)Target\-to\-source layer/alpha heatmap\.
![Refer to caption](https://arxiv.org/html/2605.08405v1/figures/activation_steering_reverse_paper.png)\(b\)Source\-to\-target direction\.
![Refer to caption](https://arxiv.org/html/2605.08405v1/figures/activation_steering_seen_heldout_paper.png)\(c\)Positive\-alpha seen/held\-out split\.

Figure 5:Additional steering diagnostics\. Steering is strongest for target\-to\-source grid\-minus\-ring additions in late layers\. The reverse direction is weaker, and held\-out edge\-specific logits remain difficult to move with a single global vector\.Table 2:Deduplicated target\-to\-source steering aggregates, averaged over layers 20–28 and 500 evaluation pairs\.#### Interpretation\.

Activation patching is a high\-bandwidth, pair\-specific intervention: it replaces the corrupt residual state with the clean residual state at one layer and position\. Steering is a low\-bandwidth, population\-level intervention: it adds one global graph\-difference vector to many contexts\. The large gap between near\-complete patching recovery and partial steering recovery is therefore not a contradiction\. It suggests that the residual stream contains both a broad graph\-family direction and finer edge\- or context\-specific information that a single steering vector does not capture\.

Similar Articles

LLM Explainability with Counterfactual Chains and Causal Graphs

Hugging Face Daily Papers

This paper proposes a four-phase method for constructing causal graphs that model LLM inference processes, using counterfactual augmentation to enable stable causal discovery and provide transparent, concept-level explainability.

In-Context Learning Operates as Concept Subspace Learning

arXiv cs.LG

This paper proposes that in-context learning in LLMs operates through low-dimensional concept subspaces, where task-relevant information concentrates in a small fraction of the representation space, supported by experiments on Llama-3-8B and Qwen2.5-7B.

Architecture, Not Scale: Circuit Localization in Large Language Models

arXiv cs.CL

This paper challenges the assumption that mechanistic interpretability becomes harder as models scale, showing that architecture (specifically Grouped Query Attention vs. Multi-Head Attention) matters more than parameter count for circuit localization and stability.