Decomposing how prompting steers behavior
Summary
This paper introduces a nested geometric decomposition framework to analyze how prompting reorganizes internal representations in large language and vision-language models. The authors show that affine transformations, particularly cross-dimensional linear mixing, are key to explaining prompt-induced behavioral changes.
View Cached Full Text
Cached at: 06/03/26, 09:42 AM
# Decomposing how prompting steers behavior
Source: [https://arxiv.org/html/2606.03093](https://arxiv.org/html/2606.03093)
Fan L\. Cheng Columbia University fan\.cheng@columbia\.edu &Nikolaus Kriegeskorte Columbia University n\.kriegeskorte@columbia\.edu
###### Abstract
Prompting steers large language models \(LLMs\) and vision–language models \(VLMs\) without weight updates, but it remains unclear how a change in instruction reshapes internal representations to produce a behavioral effect\. We introduce a nested geometric decomposition framework that treats prompting as a transformation of the representational geometry for the content following the prompt\. We ask what class of mathematical transformation best explains the effect of prompting by finding the best alignment between representations of the same stimulus set following different prompts\. For each prompt pair, we fit a sequence of increasingly expressive stimulus\-invariant maps: translation, rigid transformation with uniform\-scaling, sequential axis scaling, affine, and nonlinear transformations\. We then test these maps causally by replacing a single layer’s prompt\-A hidden state for a new set of stimuli with its mapped counterpart and measuring recovery of prompt\-B representational geometry and behavior\. Across three LLMs, three VLMs, and six text or image datasets varying in style, emotion, scene content, and number, prompts consistently reshape representational geometry toward the instructed task structure\. In the cross\-validated nested variance decomposition, much of the prompt\-induced activation change is explained by shape\-preserving maps: translation and rigid transformation with uniform\-scaling\. The tier profiles reveal model\- and task\-specific routing strategies, differing in how much transformation classes explain variance and where along the layer hierarchy their contributions emerge\. Crucially, although translation and rigid transformation tiers already improve behavioral agreement, affine transformation is the first tier to nearly recover target\-prompt task geometry and produces corresponding gains in behavioral agreement\. This suggests that cross\-dimensional linear mixing may be a key contributor to how prompts reorganize representations toward the instructed task structure\. Our framework provides a general way to decompose prompt\-induced representational change into interpretable geometric components, revealing how a model routes task\-relevant structure to produce prompt\-driven behavior\.
## 1Introduction
Instruction prompts can alter model behavior at inference time without updating parameters\. For a fixed stimulus, changing the instruction from “describe the object category” to “describe the artistic style,” for example, can shift which stimulus dimensions become behaviorally relevant and, consequently, which output the model produces\. This raises a mechanistic question: how does a change in instruction reorganize internal representations so as to steer downstream behavior?
Two complementary literatures bear on this question\. First, representational analyses show that prompts and context restructure internal geometry, producing measurable changes in representational similarity, manifold capacity, trajectory geometry, probing performance, and related diagnostics\(Kirsanovet al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib23); Hosseiniet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib24); Parket al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib25); Poloet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib29); Gonzalez\-Gutierrez and Hovy,[2025](https://arxiv.org/html/2606.03093#bib.bib28); Davidsonet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib27); Parket al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib22)\)\. These effects can be depth\-dependent, with some studies suggesting that intermediate layers provide especially informative or compressed representations and later layers more directly support prediction\(Jianget al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib53); Skeanet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib42),[2025](https://arxiv.org/html/2606.03093#bib.bib43)\)\. Second, work on activation steering shows that relatively simple activation\-space interventions, such as additive vectors, rotations, and affine maps, can modulate prompt\-induced behavioral effects\(Turneret al\.,[2023](https://arxiv.org/html/2606.03093#bib.bib1); Zouet al\.,[2023](https://arxiv.org/html/2606.03093#bib.bib2); Singhet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib3); Vu and Nguyen,[2025](https://arxiv.org/html/2606.03093#bib.bib4)\)\. Taken together, these findings suggest that natural prompting may itself be implemented through a low\-complexity transformation in activation space\.
Figure 1:\(a\)Two promptsAA\(“Are there people?”\) andBB\(“How many people?”\) are presented to the same model together with a stimulus set𝒮\\mathcal\{S\}\. We tap the hidden state at a single transformer layerℓ\\ell\(highlighted\) to obtain the layer\-ℓ\\ellmanifoldsℳA=ΦA\(𝒮\)\\mathcal\{M\}^\{A\}=\\Phi^\{A\}\(\\mathcal\{S\}\)\(green\) andℳB=ΦB\(𝒮\)\\mathcal\{M\}^\{B\}=\\Phi^\{B\}\(\\mathcal\{S\}\)\(purple\)\. The same forward pass continues pastℓ\\elland yields a prompt\-specific output per stimulus \(“Yes/No” vs\. counts\)\. We ask whether a systematic mapf:ℳA→ℳBf:\\mathcal\{M\}^\{A\}\\to\\mathcal\{M\}^\{B\}exists, how complex it has to be, and whether it causally drives behaviour\.\(b\)Prompt\-induced representational change as a function of normalised layer depth\. Each gray curve is one pair of model and dataset from the three LLMs \(OPT\-2\.7B, Llama3\-8B, Qwen3\-8B\) on three text datasets and three VLMs \(BLIP\-2, LLaVA\-OneVision, Qwen3\-VL\) on three image datasets; the black curve is their mean\. The change is small but non\-zero in early layers \(10−410^\{\-4\}–10−210^\{\-2\}\) and grows nearly monotonically with depth, reaching10−110^\{\-1\}–10010^\{0\}at the top, with a consistent shape across architectures and modalities\.What remains unclear, however, is the*transformation class*of the prompt\-induced change\. Existing work on the geometry of prompting has established that prompts alter internal representations, but it has typically characterized those changes through scalar metrics rather than decomposing them into interpretable geometric components\. Conversely, activation\-steering studies show that engineered low\-complexity operators can influence behavior, but they rarely estimate such operators from natural prompting itself or test whether the recovered operators account for target\-prompt geometry induced by an instruction change\.
We address this gap with a nested geometric decomposition framework\. For each model, layer, dataset, and prompt pair, we fit a single stimulus\-invariant alignment map from source\-prompt activations to target\-prompt activations and evaluate it on held\-out stimuli\. The operator hierarchy separates centroid shifts, global Procrustes alignment, axis\-wise scaling in the Procrustes\-aligned basis, cross\-dimensional feature mixing, and nonlinear residual structure\. This hierarchy is motivated by classical Procrustes analysis and by shape\-based approaches to comparing representations under explicit transformation classes\(Gower,[1975](https://arxiv.org/html/2606.03093#bib.bib52); Williamset al\.,[2021](https://arxiv.org/html/2606.03093#bib.bib46); Kornblithet al\.,[2019](https://arxiv.org/html/2606.03093#bib.bib45); Williams,[2024](https://arxiv.org/html/2606.03093#bib.bib47); Harveyet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib48); Barbosaet al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib51)\)\. In parallel, we use RSA and silhouette analyses to test whether prompting reshapes representational geometry toward the instructed task structure; RSA compares representational dissimilarity matrices and is designed to relate representational geometry to candidate task or model structures\(Kriegeskorteet al\.,[2008](https://arxiv.org/html/2606.03093#bib.bib44); Lin and Kriegeskorte,[2024](https://arxiv.org/html/2606.03093#bib.bib49); Cheng and Jing,[2025](https://arxiv.org/html/2606.03093#bib.bib50)\)\. Finally, we apply each fitted map as a causal activation intervention, replacing a source\-prompt hidden state with its mapped counterpart and measuring whether the resulting representation and output recover those of the target prompt\. The resulting tier profiles reveal the prompt\-routing strategy a model uses to follow an instruction: which transformation tiers carry the change, where they emerge across layers, and whether they recover task\-aligned geometry and behavior\.
#### Contributions\.
1. 1\.We introduce a*nested geometric decomposition*of prompt\-induced hidden\-state change across five transformation classes: translation, rigid transformation with uniform scaling, rigid transformation with axis\-wise scaling, affine, and nonlinear, yielding a unique contribution for each tier \(§[3](https://arxiv.org/html/2606.03093#S3)\)\.
2. 2\.We develop a*causal intervention protocol*that applies the fitted map from each tier at a single layer on held\-out stimuli and evaluates both representational geometry and behavioural recovery relative to the target prompt \(§[5](https://arxiv.org/html/2606.03093#S5)\)\.
3. 3\.Across LLMs, VLMs, and six datasets, we find that prompting consistently reshapes representational geometry toward the instructed task structure, that much of the prompt effect is captured by low\-complexity maps, and that model families, datasets, and prompt\-pair groups differ systematically in the relative contribution of transformation tiers and in where these contributions emerge across depth, revealing distinct prompt\-routing strategies \(§[6](https://arxiv.org/html/2606.03093#S6)\)\.
## 2Related work
#### Geometry of prompting and in\-context representations\.
Prior work has characterized the geometric effects of prompting and in\-context structure primarily through changes in scalar metrics\. A previous study analyzes zero\-shot, few\-shot, and soft prompting using manifold\-capacity tools, showing that different prompting regimes can induce distinct representational mechanisms for task adaptation\(Kirsanovet al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib23)\)\. Other work characterizes how context and in\-context examples reorganize internal representations into compact, identifiable substrates that drive in\-context generalization, sucha as attention\-head circuits, task vectors, and trajectory or phase\-transition\-like geometry\(Hosseiniet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib24); Parket al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib25); Yanget al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib26); Hendelet al\.,[2023](https://arxiv.org/html/2606.03093#bib.bib30); Toddet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib31)\)\. A complementary line documents the dynamic, distributed, and not always task\-aligned representational effects of prompting\(Gonzalez\-Gutierrez and Hovy,[2025](https://arxiv.org/html/2606.03093#bib.bib28); Davidsonet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib27); Liet al\.,[2025b](https://arxiv.org/html/2606.03093#bib.bib32); Poloet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib29); Liet al\.,[2025a](https://arxiv.org/html/2606.03093#bib.bib33); Simhiet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib35)\)\. Relative to this line of work, our contribution is to estimate explicit transformation classes between prompt\-conditioned representation manifolds and to test whether the transformations causally recover both representational geometry and behavior\.
#### Activation steering and geometric interventions\.
We organize prior steering methods within the transformation hierarchyℱT⊂ℱOu⊂ℱOa⊂ℱL⊂ℱN\\mathcal\{F\}\_\{\\mathrm\{T\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{L\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{N\}\}\(Fig\.[3](https://arxiv.org/html/2606.03093#S4.F3)\), distinguishing*global*operators, which is stimulus\-invariant, from*input\-conditioned*interventions, whose effective transformation depends on the current stimuli\. Most existing methods are naturally described at the translation tier, where a stimulus\-invariant direction is added to the hidden state\(Turneret al\.,[2023](https://arxiv.org/html/2606.03093#bib.bib1); Rimskyet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib36); Liet al\.,[2023b](https://arxiv.org/html/2606.03093#bib.bib5); Zouet al\.,[2023](https://arxiv.org/html/2606.03093#bib.bib2); Templeton,[2024](https://arxiv.org/html/2606.03093#bib.bib37); Wanget al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib14); Davidsonet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib27); Singhet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib3)\)\. A smaller group of methods fits a single global linear or affine intervention\(Postmus and Abreu,[2024](https://arxiv.org/html/2606.03093#bib.bib7); Shenget al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib17); Wuet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib38),[2025](https://arxiv.org/html/2606.03093#bib.bib18); Singhet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib3)\)\. Rotation\-based methods and piecewise\-affine optimal\-transport edits are also relevant, but they are generally not estimated as Procrustes maps between paired representation clouds, and their target\-angle or adaptive variants are often input\-conditioned\(Vu and Nguyen,[2025](https://arxiv.org/html/2606.03093#bib.bib4); Pham and Nguyen,[2024](https://arxiv.org/html/2606.03093#bib.bib6); Abdullaevet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib8); Rodriguezet al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib39); Scialangaet al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib13)\)\. The rigid transformation with uniform scalingℱOu\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}therefore remains comparatively underexplored as a stimulus\-set mapping estimator, while strictly nonlinear interventions appear relatively rare\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib12); Ravalet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib19)\)\. Independently of the operator class, steering methods also differ in the stage of the forward pass at which the intervention is applied\(Leeet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib10); Nguyenet al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib11); Dang and Ngo,[2026](https://arxiv.org/html/2606.03093#bib.bib9)\), with a related line operating in weight space rather than activation space\(Fierro and Roger,[2025](https://arxiv.org/html/2606.03093#bib.bib15)\), and recent audits have argued for more systematic geometric and mediator\-typology characterizations\(Venkatesh and Mahendran Kurapath,[2026](https://arxiv.org/html/2606.03093#bib.bib16); Tanet al\.,[2024](https://arxiv.org/html/2606.03093#bib.bib40); Im and Li,[2025](https://arxiv.org/html/2606.03093#bib.bib41); Wehneret al\.,[2025](https://arxiv.org/html/2606.03093#bib.bib20)\)\. In this paper, our goal is not to engineer operators that induce a desired behavior, but to use the same hierarchy as a measurement framework for natural prompting\. The finding that low\-complexity linear maps recover most of the prompt\-induced effect is consistent with the strong empirical performance of translation\- and affine\-class steering methods, which may succeed in part because they approximate the activation changes and representational geometry induced by prompting itself\.
Figure 2:Prompting reshapes representational geometry toward the instructed task structure\.\(a\)Multidimensional scaling \(MDS\) visualizations of prompt\-conditioned representations\. \(Top\) Layer\-32 representations from Llama3\-8B\-Instruct for 1,920 text stories under a topic prompt \(Prompt A\) and an emotion prompt \(Prompt B\)\. \(Bottom\) Layer\-27 representations from LLaVA\-OneVision for 1,000 COCO images under a person\-detection prompt \(Prompt A\) and a person\-counting prompt \(Prompt B\)\. Each point denotes one stimulus and is colored by the target label for Prompt B\.\(b\)Representational dissimilarity matrix \(RDM\) for the same model, layer, and image set as in the bottom row of panel \(a\)\. The top two RDMs are computed from the prompt\-A and prompt\-B hidden states, respectively\. The bottom RDM is the target task RDM for prompt B, constructed from numerical distances between count labels\.\(c\)Layerwise alignment with the target task structure \(pooled across models and datasets\)\. Left: Spearman correlation between each prompt\-induced RDM and the corresponding target RDM, computed separately for prompt A and prompt B\. Right: silhouette score using the target labels, which measures whether stimuli with the same target label form compact, well\-separated clusters in representation space\. The shaded regions show±1\\pm 1SD\.
## 3Method: nested geometric decomposition
#### Preliminaries\.
Let𝒮\\mathcal\{S\}denote the stimulus space \(a measurable set of images or text excerpts\), and let𝒫\\mathcal\{P\}be a finite set of prompts\. LetMMbe a \(vision–\)language model withLLtransformer blocks; fix an analysis layerℓ∈\{1,…,L\}\\ell\\in\\\{1,\\dots,L\\\}\. For each promptp∈𝒫p\\in\\mathcal\{P\}, the model defines a*feature map*
Φp:𝒮⟶ℝD,s⟼𝐡p\(s\),\\Phi^\{p\}\\;:\\;\\mathcal\{S\}\\;\\longrightarrow\\;\\mathbb\{R\}^\{D\},\\qquad s\\;\\longmapsto\\;\\mathbf\{h\}^\{p\}\(s\),\(1\)returning the layer\-ℓ\\ellhidden state at the final input\-prompt token\. We call its image
ℳp≔Φp\(𝒮\)⊂ℝD\\mathcal\{M\}^\{p\}\\;\\coloneqq\\;\\Phi^\{p\}\(\\mathcal\{S\}\)\\;\\subset\\;\\mathbb\{R\}^\{D\}the*manifold*of promptpp\. The ambient spaceℝD\\mathbb\{R\}^\{D\}carries the canonical Euclidean inner product and the induced Frobenius metric onℝN×D\\mathbb\{R\}^\{N\\times D\}\.
#### Paired observations across prompts\.
Fix a pair of promptsA,B∈𝒫A,B\\in\\mathcal\{P\}and a finite stimulus set\{si\}i=1N⊂𝒮\\\{s\_\{i\}\\\}\_\{i=1\}^\{N\}\\subset\\mathcal\{S\}\(indicesi,ji,jrange over stimuli\)\. For each stimulusiiwe obtain a paired observation\(ΦA\(si\),ΦB\(si\)\)∈ℝD×ℝD\\big\(\\Phi^\{A\}\(s\_\{i\}\),\\,\\Phi^\{B\}\(s\_\{i\}\)\\big\)\\in\\mathbb\{R\}^\{D\}\\times\\mathbb\{R\}^\{D\}, and stack these into representation matrices
𝐗A,𝐗B∈ℝN×D,\[𝐗p\]i,:=Φp\(si\)⊤\.\\mathbf\{X\}^\{A\},\\;\\mathbf\{X\}^\{B\}\\;\\in\\;\\mathbb\{R\}^\{N\\times D\},\\qquad\\big\[\\mathbf\{X\}^\{p\}\\big\]\_\{i,:\}\\;=\\;\\Phi^\{p\}\(s\_\{i\}\)^\{\\\!\\top\}\.\(2\)
#### Problem statement\.
We ask whether there is a mapf:ℝD→ℝDf:\\mathbb\{R\}^\{D\}\\\!\\to\\\!\\mathbb\{R\}^\{D\}that approximately satisfiesf∘ΦA≈ΦBf\\circ\\Phi^\{A\}\\approx\\Phi^\{B\}on𝒮\\mathcal\{S\}\(Fig\.[1](https://arxiv.org/html/2606.03093#S1.F1)a\)\. We call a fitted map*global*if its parameters are*stimulus\-invariant*\. We contrast this with*input\-conditioned*steering rules, whose effective direction, angle, coefficient, or application mask varies with the current stimuli\. We ask how expressive a global map must be before it can explain, and causally reproduce, the representational effect of changing prompts\.
#### Hypothesis classes\.
We approximateffby elements of an increasing chain of transformation families onℝD\\mathbb\{R\}^\{D\}, ranging from classical Lie\-group actions to more general affine and nonlinear families:
ℝD⏟translationsT\(D\)↪ℝD⋊\(ℝ\>0×O\(D\)\)⏟similarity groupSim\(D\)↪ℝD⋊\(Diag\(D\)×O\(D\)\)⏟axis\-scaled rigid maps↪ℝD⋊GL\(D\)⏟affine groupAff\(D\)↪ℱN⏟nonlinear maps\.\\small\\begin\{gathered\}\\underbrace\{\\mathbb\{R\}^\{D\}\}\_\{\\text\{translations \}T\(D\)\}\\;\\hookrightarrow\\;\\underbrace\{\\mathbb\{R\}^\{D\}\\rtimes\\big\(\\mathbb\{R\}\_\{\>0\}\\\!\\times\\\!O\(D\)\\big\)\}\_\{\\text\{similarity group \}\\mathrm\{Sim\}\(D\)\}\\;\\hookrightarrow\\;\\underbrace\{\\mathbb\{R\}^\{D\}\\rtimes\\big\(\\mathrm\{Diag\}\(D\)\\\!\\times\\\!O\(D\)\\big\)\}\_\{\\text\{axis\-scaled rigid maps\}\}\\;\\hookrightarrow\\;\\underbrace\{\\mathbb\{R\}^\{D\}\\rtimes\\mathrm\{GL\}\(D\)\}\_\{\\text\{affine group \}\\mathrm\{Aff\}\(D\)\}\\;\\hookrightarrow\\;\\underbrace\{\\mathcal\{F\}\_\{\\mathrm\{N\}\}\}\_\{\\text\{nonlinear maps\}\}\.\\end\{gathered\}\(3\)Each level corresponds to a hypothesis class of mappings on hidden vectors𝐱∈ℝ1×D\\mathbf\{x\}\\in\\mathbb\{R\}^\{1\\times D\}, with free intercept parameters𝐚,𝐛∈ℝD\\mathbf\{a\},\\mathbf\{b\}\\in\\mathbb\{R\}^\{D\}:
ℱN\\mathcal\{F\}\_\{\\mathrm\{N\}\}is instantiated in our experiments as a single shared multilayer perceptrong𝜽:ℝD→ℝDg\_\{\\boldsymbol\{\\theta\}\}:\\mathbb\{R\}^\{D\}\\\!\\to\\\!\\mathbb\{R\}^\{D\}; reportedΔRN2\\Delta R^\{2\}\_\{\\mathrm\{N\}\}should therefore be read as a lower bound on what an arbitrarily expressive nonlinear map could explain\. We use the right\-multiplication convention𝐱𝐌\\mathbf\{x\}\\mathbf\{M\}to match the row\-major data layout\. The five classes form a strictly nested chainℱT⊂ℱOu⊂ℱOa⊂ℱL⊂ℱN\\mathcal\{F\}\_\{\\mathrm\{T\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{L\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{N\}\};ℱOa\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}preserves the rigid alignment found byℱOu\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}but allows each aligned axis to be independently rescaled, isolating axis\-wise gain modulation from arbitrary linear feature mixing\. Inclusion identifications are in Appendix[A](https://arxiv.org/html/2606.03093#A1)\.
#### Parameter estimation\.
Each class is fit on the training fold by minimizing the squared Frobenius reconstruction error
f^k=argminf∈ℱk‖𝐗B−f\(𝐗A\)‖F2,\\widehat\{f\}\_\{k\}\\;=\\;\\arg\\min\_\{f\\in\\mathcal\{F\}\_\{k\}\}\\;\\big\\\|\\,\\mathbf\{X\}^\{B\}\-f\(\\mathbf\{X\}^\{A\}\)\\,\\big\\\|\_\{F\}^\{2\},\(4\)after centring each prompt condition by its training\-fold mean\.ℱT\\mathcal\{F\}\_\{\\mathrm\{T\}\}has a closed\-form mean\-shift solution;ℱOu\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}is solved by orthogonal Procrustes,ℱOa\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}by Procrustes followed by per\-axis least squares on the rigidly\-aligned features, andℱL\\mathcal\{F\}\_\{\\mathrm\{L\}\}by ridge regression\. The nonlinear classℱN\\mathcal\{F\}\_\{\\mathrm\{N\}\}is instantiated as a shared one\-hidden\-layer MLP fit by stochastic gradient descent on the same Frobenius criterion\. All estimators are evaluated under stratifiedKK\-fold cross\-validation across stimuli\. Closed\-form derivations, the MLP capacity and optimizer, and the cross\-validation procedure are in Appendix[A](https://arxiv.org/html/2606.03093#A1)\.
#### Causal intervention\.
To probe whether the fitted geometric transformation also*causally*reproduces the effect of promptBB, we run model on stimulussis\_\{i\}under promptAAand replace the layer\-ℓ\\ellhidden state at the final input token byf^k\(ΦA\(si\)\)\\widehat\{f\}\_\{k\}\\\!\\big\(\\Phi^\{A\}\(s\_\{i\}\)\\big\)before continuing the autoregressive forward pass\. Letyk\(si\)y\_\{k\}\(s\_\{i\}\)denote the resulting output token sequence\. As a no\-fit oracle reference we also include a level that patches the held\-out prompt\-BBrepresentation directly,f^promptB\(ΦA\(si\)\):=ΦB\(si\)\\widehat\{f\}\_\{\\mathrm\{prompt\_\{B\}\}\}\(\\Phi^\{A\}\(s\_\{i\}\)\):=\\Phi^\{B\}\(s\_\{i\}\), with the surrounding prompt\-AAcontext unchanged\. This is distinct from running model end\-to\-end under promptBB: the prompt tokens, the attention pattern up to layerℓ\\ell, and the post\-layer\-ℓ\\ellprocessing all start from promptAA’s context, soypromptBy\_\{\\mathrm\{prompt\_\{B\}\}\}upper\-bounds what*any*fitted single\-layer replacement can recover from promptAAalone\.
## 4Evaluation
For each transformk∈\{T,Ou,Oa,L,N\}k\\in\\\{\\mathrm\{T\},\\mathrm\{O\_\{u\}\},\\mathrm\{O\_\{a\}\},\\mathrm\{L\},\\mathrm\{N\}\\\}we report three families of measures:
1. 1\.Incremental explained variance:LetRSSk≔‖𝐗B−f^k\(𝐗A\)‖F2\\mathrm\{RSS\}\_\{k\}\\;\\coloneqq\\;\\big\\\|\\mathbf\{X\}^\{B\}\-\\widehat\{f\}\_\{k\}\(\\mathbf\{X\}^\{A\}\)\\big\\\|\_\{F\}^\{2\}on held\-out stimuli fork∈\{T,Ou,Oa,L,N\}k\\in\\\{\\mathrm\{T\},\\mathrm\{O\_\{u\}\},\\mathrm\{O\_\{a\}\},\\mathrm\{L\},\\mathrm\{N\}\\\}, and letRSS0≔‖𝐗B−𝐗A‖F2\\mathrm\{RSS\}\_\{0\}\\coloneqq\\big\\\|\\mathbf\{X\}^\{B\}\-\\mathbf\{X\}^\{A\}\\big\\\|\_\{F\}^\{2\}be the no\-transform residual\. We define the*cumulative*cross\-validatedR2R^\{2\}of tierkkasRk2=\(RSS0−RSSk\)/RSS0R^\{2\}\_\{k\}=\(\\mathrm\{RSS\}\_\{0\}\-\\mathrm\{RSS\}\_\{k\}\)/\\mathrm\{RSS\}\_\{0\}and the*incremental*R2R^\{2\}contribution of tierkkover its predecessork−1k\-1in the nested chain asΔRk2=Rk2−Rk−12\\Delta R^\{2\}\_\{k\}\\;=\\;R^\{2\}\_\{k\}\-R^\{2\}\_\{k\-1\}\. These increments together with the residual unexplained fractionRresid2=RSSN/RSS0R^\{2\}\_\{\\mathrm\{resid\}\}=\\mathrm\{RSS\}\_\{\\mathrm\{N\}\}/\\mathrm\{RSS\}\_\{0\}sum to11by construction\. Individual increments may be negative under cross\-validation when a more expressive class generalizes worse on held\-out stimuli;
2. 2\.representational geometry metrics:Spearman correlation between the data RDM off^k\(𝐗A\)\\widehat\{f\}\_\{k\}\(\\mathbf\{X\}^\{A\}\)and a category\-derived target RDM over\{si\}\\\{s\_\{i\}\\\}based on the task structure of prompt B, and the silhouette scoresi=\(bi−ai\)/max\(ai,bi\)s\_\{i\}=\(b\_\{i\}\-a\_\{i\}\)/\\max\(a\_\{i\},b\_\{i\}\), whereaia\_\{i\}is the mean within\-category distance andbib\_\{i\}is the mean distance to the nearest other category;
3. 3\.behavioral recovery:a per\-dataset keyword evaluator scores each intervened outputyk\(si\)y\_\{k\}\(s\_\{i\}\)on promptBB’s ground\-truth attribute label \(e\.g\. presence of the target style word, correct numeric count\), yielding two metrics:*relevance*\(does the text address the target attribute\) and*accuracy*\(is the answer correct for that stimulus\)\. The oracle\-patched outputypromptB\(si\)y\_\{\\mathrm\{prompt\_\{B\}\}\}\(s\_\{i\}\)provides the upper\-bound reference\.
Figure 3:Nested geometric decomposition of prompt\-induced representational maps: translation \(ℱT\\mathcal\{F\}\_\{\\mathrm\{T\}\}\), rigid transformation with uniform scaling \(ℱOu\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}\), rigid transformation with axis\-wise scaling \(ℱOa\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}\), affine transformation \(ℱL\\mathcal\{F\}\_\{\\mathrm\{L\}\}\), and nonlinear transformation\(ℱN\\mathcal\{F\}\_\{\\mathrm\{N\}\}\)\. Top: Schematic of the additional geometric freedom introduced by each tier \(listed in parentheses\); blue and red grids show the source and transformed representations, respectively\. Middle: Incremental explained variance over the preceding tier,ΔR2\\Delta R^\{2\}, as a function of normalised layer depth for LLMs, pooled across datasets and prompt pairs\. Bottom: The same decomposition for VLMs\. Lines denote model\-specific means across prompt pairs; shaded bands denote±1\\pm 1SD\.
## 5Experiments
#### Models\.
We evaluate six open\-weight transformer models\. On the language models:OPT\-2\.7B,Meta\-Llama\-3\-8B\-Instruct, andQwen3\-8B\. On the vision language models:BLIP\-2\(OPT\-2\.7B backbone, Q\-Former bridging\),LLaVA\-OneVision\-7B\(vision encoder/projector with Qwen2 language backbone, multimodal instruction\-tuned\), andQwen3\-VL\-8B\(native vision\-language model with interleaved text\-image/video pretraining\)\.
#### Datasets\.
The three text datasets \(EmotionalStory, WritingStyle, Number\) are used by the LLMs; the three image datasets \(EmoSet, StyleTransfer, COCO\) are used by the VLMs\. Each dataset has a two\-attribute factorial structure \(attrA×attrB\\text\{attr\}\_\{A\}\\\!\\times\\\!\\text\{attr\}\_\{B\}\) that the prompt taxonomy exploits for cross\- and within\-attribute comparisons\. Dataset sizes, attribute levels, construction procedures, and prompt templates are in Appendix[B](https://arxiv.org/html/2606.03093#A2)\. Prompts are written at three levels:*open*\(free\-form attribute query\),*specific*\(single\-value yes/no query\), and*irrelevant*\(task\-unrelated factual question\), and combined into six pair groups \(Table[1](https://arxiv.org/html/2606.03093#S5.T1)\)\. G3 and G4 are restricted to the secondary attribute axis; all pairs are evaluated in the forward direction only\. See details in Appendix[B](https://arxiv.org/html/2606.03093#A2)\.
Table 1:Prompt\-pair groups\. Examples show the prompt pairs for EmotionalStory dataset\.
#### Implementation\.
We extract hidden states at the last input\-prompt token from all transformer layers, and fit and evaluate every transform under five\-fold cross\-validation\. For causal interventions we inject the fitted transform at a single layer and decode greedily up to5050tokens\. Hyperparameters, layer indexing per model family, MLP architecture, and evaluator dictionaries are in Appendix\.[C](https://arxiv.org/html/2606.03093#A3)\.
Figure 4:Comparison of prompt\-pair groups and datasets\.\(a\)Cross\-validatedΔR2\\Delta R^\{2\}, averaged over datasets, models, and layers for each prompt\-pair group\. Specific prompt\-pair groups show distinct decomposition profiles, regardless of whether the paired prompts differ across attributes or within an attribute\.\(b\)Top:ΔROu2\\Delta R^\{2\}\_\{\\mathrm\{O\_\{u\}\}\}, the additional variance explained by rotation/reflection and uniform scaling beyond translation\.Bottom:alignment dimensionalityDD\. LargerDDindicates that more dimensions contribute appreciably to the prompt\-pair alignment\. OPT\-2\.7B shows larger and more dataset\-dependentΔROu2\\Delta R^\{2\}\_\{\\mathrm\{O\_\{u\}\}\}, whereas Llama3\-8B shows the lowest alignment dimensionality\. EmotionalStory \(story topic→\\toemotion\) shows higher alignment dimensionality across models\.
## 6Results
#### Prompting reshapes representational geometry toward the instructed task structure\.
We first ask how much change in the hidden activations is induced by prompting\. We feed the same stimulus set under two cross\-attribute open\-ended prompts \(G1\) and measure the normalised squared Frobenius distance between the prompted hidden states \(Fig\.[1](https://arxiv.org/html/2606.03093#S1.F1)b\)\. The change grows nearly monotonically with depth, which is consistent across other prompt\-pair groups \(Appendix[D](https://arxiv.org/html/2606.03093#A4)\)\.
We then analyzed whether this prompt\-induced change carries the geometrical structure required by prompt B \(Fig\.[2](https://arxiv.org/html/2606.03093#S2.F2)\)\. The example 2D MDS visualization of the prompted hidden states of LLaVA\-OneVision\-7B for 1000 COCO images demonstrates that the representations of images were separated along the attribute the prompt targets \(Fig\.[2](https://arxiv.org/html/2606.03093#S2.F2)a; see Appendix[E](https://arxiv.org/html/2606.03093#A5)for other models and datasets\)\. The representational dissimilariy matrix \(RDM\)s that computes pair\-wise distances between image representations make this explicit \(Fig\.[2](https://arxiv.org/html/2606.03093#S2.F2)b\): when stimuli are sorted by count, promptBB’s RDM displays the count\-graded ordinal block structure of the target count\-RDM \(bottom panel\), while promptAA’s does not\. The layer\-wise alignment with each prompt’s target attribute grows monotonically with depth and clearly separates promptBBfrom promptAAin the top layers \(Fig\.[2](https://arxiv.org/html/2606.03093#S2.F2)c; pooled across models and datasets, see Appendix[F](https://arxiv.org/html/2606.03093#A6)for individual results\)\. Higher RDM correlations indicate higher alignment with the task structure of prompt B and higher silhouette scores indicate better separation among categories of the target attribute\.
Figure 5:Causal validation of the nested geometric decomposition\. At each layer, we replace prompt\-AA’s hidden states with the transformed counterpart for one fitted transformation type and then run the model forward through the remaining layers\.\(a\)RDM correlation \(left\) and silhouette score \(right\)\. Axis\-scaled rigid transformations improve representational alignment, affine maps produce a larger gain, and nonlinear maps yield the strongest recovery\.\(b\)Behavior evaluation on prompt\-BB’s task, measured by Relevance \(left\) and Accuracy \(right\)\. Beyond the translation tier, each successive tier improves behavioural recovery and progressively brings performance closer to promptBB\.
#### Nested geometric decomposition\.
To quantify the extent to which prompt\-induced representational changes are captured at each level of transformation complexity, we fit the nested family of modelsℱT⊂ℱOu⊂ℱOa⊂ℱL⊂ℱN\\mathcal\{F\}\_\{\\mathrm\{T\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{L\}\}\\subset\\mathcal\{F\}\_\{\\mathrm\{N\}\}at every layer of each model for prompt pairs across six groups, and compute the*incremental*R2R^\{2\}contributed by each tier on held\-out images \(Fig\.[3](https://arxiv.org/html/2606.03093#S4.F3); Fig\.[4](https://arxiv.org/html/2606.03093#S5.F4)a; see Appendix[G](https://arxiv.org/html/2606.03093#A7)for group\-specific results\)\. Pure translation accounts for the largest share of explained variance, dominating in early layers and remaining substantial throughout the hierarchy\. Higher\-complexity tiers contribute additional, smaller increments primarily in mid\-to\-deep layers, with an overall increasing trend that peaks in the final layers\.
Across LLMs, compared with Llama3 and Qwen3, OPT\-2\.7B relies less on translation in early layers and more on rotation, reflection, scaling, and nonlinear transformations, as well as more on affine transformations in intermediate layers \(Fig\.[3](https://arxiv.org/html/2606.03093#S4.F3)\)\. This pattern suggests that the models implement distinct early\-layer encoding strategies: OPT\-2\.7B appears to represent prompts as more distributed perturbations that require rotational and higher\-dimensional mixing components, whereas Llama3 and Qwen3 may rely on a lower\-dimensional “instruction code\.” Among VLMs, BLIP2 shows a stronger dependence on translation in late layers than LLaVA\-OneVision and Qwen3\-VL\. In BLIP2, multimodal information may be encoded by the Q\-Former as a steering vector that persists into deeper layers, whereas LLaVA\-OneVision and Qwen3\-VL show greater flexibility in information mixing, potentially because of differences in their training protocols\.
We additionally compare the explained incremental variance of each tier across different prompt\-pair groups \(Fig\.[4](https://arxiv.org/html/2606.03093#S5.F4)a\)\. The five\-tier incremental contribution hierarchy \(ℱT\>ℱL\>ℱOu\>ℱOa≈ℱN\\mathcal\{F\}\_\{\\mathrm\{T\}\}\>\\mathcal\{F\}\_\{\\mathrm\{L\}\}\>\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}\>\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}\\approx\\mathcal\{F\}\_\{\\mathrm\{N\}\}\) is consistent across prompt\-pair groups, indicating that it is not an artifact of any particular pairing regime\. Models rely more on rotation, reflection, and affine transformations for paired specific prompts, regardless of whether the pairs are cross\-attribute or within\-attribute\. Appendix[H](https://arxiv.org/html/2606.03093#A8)further evaluates generalization across prompt paraphrases and out\-of\-distribution datasets, showing that fitted transformations are largely stable to semantic rewordings of the target prompt and partially transferable across stimulus distributions, although the relative contributions of transformations remain dataset\-dependent\.
#### Where rotation matters: explained variance and alignment dimensionality\.
ΔROu2\\Delta R^\{2\}\_\{\\mathrm\{O\_\{u\}\}\}across datasets and prompt\-pair groups \(Fig\.[4](https://arxiv.org/html/2606.03093#S5.F4)b, top\) shows that OPT\-2\.7B generally exhibits largerΔROu2\\Delta R^\{2\}\_\{\\mathrm\{O\_\{u\}\}\}than Llama3 and Qwen3\. OPT also shows distinct profiles across datasets and groups, with much higherΔROu2\\Delta R^\{2\}\_\{\\mathrm\{O\_\{u\}\}\}for cross\-attribute, open\-ended prompt pairs in WritingStyle than in EmotionalStory or Number, which are more translation\-dominated \(see Appendix[G](https://arxiv.org/html/2606.03093#A7)\)\. By contrast, Llama3 and Qwen3 show higherΔROu2\\Delta R^\{2\}\_\{\\mathrm\{O\_\{u\}\}\}for within\-attribute, specific prompt pairs than for other groups, whereas OPT shows the opposite trend, relying more on translation for within\-attribute, specific prompt pairs\.
We further computed the alignment dimensionality, defined as the number of singular values of the centered cross\-covariance used by orthogonal Procrustes that exceed0\.01σmax0\.01\\,\\sigma\_\{\\max\}\(Fig\.[4](https://arxiv.org/html/2606.03093#S5.F4)b, bottom; Appendix[I](https://arxiv.org/html/2606.03093#A9)shows the same qualitative pattern under a0\.10\.1threshold\)\. Higher alignment dimensionality suggests that more dimensions participate in the Procrustes alignment between prompt pairs\. EmotionalStory repeatedly shows higher dimensionality than WritingStyle or Number, suggesting that topic\-versus\-emotion prompt pairs induce a richer, more distributed alignment structure, even when the incremental explained variance is small\. Llama3\-8B\-Instruct tends to have the smallest alignment dimensionalities across models, indicating that its alignment is concentrated in fewer dominant directions than that of OPT\-2\.7B or Qwen3\-8B\. The dimensionality of rotation and reflection therefore varies with both task structure and model family\.
#### Causal validation: representational geometry and behavior\.
We performed an interventional analysis \(§[3](https://arxiv.org/html/2606.03093#S3.SS0.SSS0.Px6)\) and computed alignment metrics on held\-out stimuli to test whether the fitted transformations can causally recover the representational geometry and behaviour induced by promptBB\(Fig\.[5](https://arxiv.org/html/2606.03093#S6.F5); pooled across models, datasets, and prompt\-pair group G1\)\. For the interventional analysis, we focus on the open\-ended promptBB\(G1 and G5\) because they preserve non\-degenerate multi\-label structure for RSA, avoid yes\-bias artefacts in behavioural evaluation, and provide a matched comparison between task\-relevant and task\-irrelevant source prompts\.
Translation and uniform\-scaled rigid transformations are omitted from Fig\.[5](https://arxiv.org/html/2606.03093#S6.F5)abecause they leave distance\-based RDM rankings and silhouette scores unchanged \(App\.[F](https://arxiv.org/html/2606.03093#A6)for additional results\)\. Axis\-scaled rigid transformations improve geometric alignment but remain clearly below affine and nonlinear transformations, especially in later layers\. This indicates that allowing each Procrustes\-aligned axis to be rescaled independently captures part of the prompt\-pair transformation, but does not fully account for the representational reorganisation\. The additional improvement from affine maps suggests that feature mixing is important for matching prompt\-BBgeometry\. The nonlinear map gives the strongest recovery, but its advantage over the affine map is smaller than the gain from the axis\-scaled tier to the affine tier\. Downstream behavior shows a sharper divide between translation and the higher\-capacity transformations \(Fig\.[5](https://arxiv.org/html/2606.03093#S6.F5)b; see Appendix[J](https://arxiv.org/html/2606.03093#A10)for text outputs\)\. As translation tier can substantially steer behavior, each successive tier improves behavioural performance and progressively brings it closer to promptBB\.
## 7Conclusion
Our paper recast prompting as a low\-complexity geometric transformation: a translation that does most of the work, rotation and shearing fixes the residual structured component\. The effect of the prompts studied here, thus, can be understood as effecting an affine transform of the representation of the content following the prompt\. Allowing a nonlinear transform did not explain either the resulting geometry or the resulting behavior substantially better\. We see this as evidence that the hidden\-state effects of instruction prompting are well approximated by an affine map at the population level, and that the residual fine structure — the rotation, and the small nonlinear part — is where the operational consequences of prompting are disproportionately concentrated\.
Limitations and broader impacts\.Our analysis is limited to zero\-shot instruction prompts and does not evaluate few\-shot demonstrations, soft prompts, or longer\-context conditioning\. Our intervention replaces the residual stream at a single layer and final input\-prompt token; multi\-layer or multi\-token interventions and more systematic tests under input\-distribution shift remain future work\. The proposed framework provides an interpretability benefit by decomposing prompt\-induced representational change into explicit transformation classes and quantifying their contributions to variance, geometry, and behavior\. Such measurements could also inform more efficient activation\-level steering\. We therefore view the method both as a diagnostic tool for transparency and controlled evaluation, and as a potential basis for deployment\-oriented steering techniques\.
## Acknowledgments and Disclosure of Funding
## References
- Concept heterogeneity\-aware representation steering\.arXiv preprint arXiv:2603\.02237\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge,et al\.\(2025\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[2nd item](https://arxiv.org/html/2606.03093#A11.I1.i2.p1.1)\.
- J\. Barbosa, A\. Nejatbakhsh, L\. Duong, S\. E\. Harvey, S\. L\. Brincat, M\. Siegel, E\. K\. Miller, and A\. H\. Williams \(2025\)Quantifying differences in neural population activity with shape metrics\.bioRxiv,pp\. 2025–01\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- T\. Boger and C\. Firestone \(2025\)The psychophysics of style\.Nature Human Behaviour,pp\. 1–13\.Cited by:[3rd item](https://arxiv.org/html/2606.03093#A11.I2.i3.p1.1),[Appendix B](https://arxiv.org/html/2606.03093#A2.SS0.SSS0.Px1.p1.11)\.
- F\. L\. Cheng and X\. Jing \(2025\)Interpreting style–content parsing in vision–language models\.InNeurIPS 2025 Workshop on CogInterp,Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- Q\. Dang and C\. Ngo \(2026\)Selective steering: norm\-preserving control through discriminative layer selection\.arXiv preprint arXiv:2601\.19375\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- G\. Davidson, T\. M\. Gureckis, B\. Lake, and A\. Williams \(2026\)Do different prompting methods yield a common task representation in language models?\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=fy5InEg0OL)Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- C\. Fierro and F\. Roger \(2025\)Steering language models with weight arithmetic\.arXiv preprint arXiv:2511\.05408\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- C\. Gonzalez\-Gutierrez and D\. Hovy \(2025\)Do prompts reshape representations? an empirical study of prompting effects on embeddings\.arXiv preprint arXiv:2510\.19694\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- J\. C\. Gower \(1975\)Generalized procrustes analysis\.Psychometrika40\(1\),pp\. 33–51\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[5th item](https://arxiv.org/html/2606.03093#A11.I1.i5.p1.1)\.
- S\. E\. Harvey, D\. Lipshutz, and A\. H\. Williams \(2024\)What representational similarity measures imply about decodable information\.InUniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models,External Links:[Link](https://openreview.net/forum?id=hqfzH6GCYj)Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- R\. Hendel, M\. Geva, and A\. Globerson \(2023\)In\-context learning creates task vectors\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 9318–9333\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- E\. A\. Hosseini, Y\. Li, Y\. Bahri, D\. Campbell, and A\. K\. Lampinen \(2026\)Context structure reshapes the representational geometry of language models\.arXiv preprint arXiv:2601\.22364\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Hu, L\. Niu, and S\. Varma \(2026\)The representational geometry of number\.arXiv preprint arXiv:2602\.06843\.Cited by:[Appendix B](https://arxiv.org/html/2606.03093#A2.SS0.SSS0.Px3.p1.7),[Table 5](https://arxiv.org/html/2606.03093#A2.T5)\.
- S\. Im and S\. Li \(2025\)A unified understanding and evaluation of steering methods\.arXiv preprint arXiv:2502\.02716\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- J\. Jiang, Y\. Dong, J\. Zhou, and Z\. Zhu \(2025\)From compression to expression: a layerwise analysis of in\-context learning\.arXiv preprint arXiv:2505\.17322\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1)\.
- A\. Kirsanov, C\. Chou, K\. Cho, and S\. Chung \(2025\)The geometry of prompting: unveiling distinct mechanisms of task adaptation in language models\.arXiv preprint arXiv:2502\.08009\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InInternational conference on machine learning,pp\. 3519–3529\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- N\. Kriegeskorte, M\. Mur, and P\. A\. Bandettini \(2008\)Representational similarity analysis\-connecting the branches of systems neuroscience\.Frontiers in systems neuroscience2,pp\. 249\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- B\. W\. Lee, I\. Padhi, K\. N\. Ramamurthy, E\. Miehling, P\. Dognin, M\. Nagireddy, and A\. Dhurandhar \(2024\)Programming refusal with conditional activation steering\.arXiv preprint arXiv:2409\.05907\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- B\. Li, G\. Deng, R\. Chen, J\. Yue, S\. Zhang, Q\. Zhao, L\. Song, and L\. Wen \(2025a\)REMA: a unified reasoning manifold framework for interpreting large language model\.arXiv preprint arXiv:2509\.22518\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu,et al\.\(2024\)Llava\-onevision: easy visual task transfer\.arXiv preprint arXiv:2408\.03326\.Cited by:[1st item](https://arxiv.org/html/2606.03093#A11.I1.i1.p1.1)\.
- J\. Li, D\. Li, S\. Savarese, and S\. Hoi \(2023a\)Blip\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.InInternational conference on machine learning,pp\. 19730–19742\.Cited by:[4th item](https://arxiv.org/html/2606.03093#A11.I1.i4.p1.1)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2023b\)Inference\-time intervention: eliciting truthful answers from a language model\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 41451–41530\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- Y\. Li, D\. Campbell, S\. C\. Chan, and A\. K\. Lampinen \(2025b\)Just\-in\-time and distributed task representations in language models\.arXiv preprint arXiv:2509\.04466\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Lin and N\. Kriegeskorte \(2024\)The topology and geometry of neural representations\.Proceedings of the National Academy of Sciences121\(42\),pp\. e2317881121\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- T\. Lin, M\. Maire, S\. Belongie, J\. Hays, P\. Perona, D\. Ramanan, P\. Dollár, and C\. L\. Zitnick \(2014\)Microsoft coco: common objects in context\.InEuropean conference on computer vision,pp\. 740–755\.Cited by:[1st item](https://arxiv.org/html/2606.03093#A11.I2.i1.p1.1),[Appendix B](https://arxiv.org/html/2606.03093#A2.SS0.SSS0.Px1.p1.11)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.arXiv preprint arXiv:1609\.07843\.Cited by:[Appendix B](https://arxiv.org/html/2606.03093#A2.SS0.SSS0.Px3.p1.7)\.
- D\. V\. Nguyen, H\. M\. Vu, N\. Y\. Pham, L\. Zhang, and T\. M\. Nguyen \(2025\)Activation steering with a feedback controller\.arXiv preprint arXiv:2510\.04309\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- C\. F\. Park, A\. Lee, E\. S\. Lubana, Y\. Yang, M\. Okawa, K\. Nishi, M\. Wattenberg, and H\. Tanaka \(2025\)ICLR: in\-context learning of representations\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=pXlmOmlHJZ)Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2024\)The linear representation hypothesis and the geometry of large language models\.InICML,Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1)\.
- V\. Pham and T\. H\. Nguyen \(2024\)Householder pseudo\-rotation: a novel approach to activation editing in llms with direction\-magnitude perspective\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 13737–13751\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- A\. Polo, C\. Chun, and S\. Chung \(2026\)Emergent manifold separability during reasoning in large language models\.arXiv preprint arXiv:2602\.20338\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Postmus and S\. Abreu \(2024\)Steering large language models using conceptors: improving addition\-based activation engineering\.arXiv preprint arXiv:2410\.16314\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- S\. Raval, H\. J\. Song, L\. Wu, A\. Harrasse, J\. M\. Phillips, F\. Barez, and A\. Abdullah \(2026\)Curveball steering: the right direction to steer isn’t always linear\.arXiv preprint arXiv:2603\.09313\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- N\. Rimsky, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. Turner \(2024\)Steering llama 2 via contrastive activation addition\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15504–15522\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- P\. Rodriguez, A\. Blaas, M\. Klein, L\. Zappella, N\. Apostoloff, marco cuturi, and X\. Suau \(2025\)Controlling language and diffusion models by transporting activations\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=l2zFn6TIQi)Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- M\. Scialanga, T\. Laugel, V\. Grari, and M\. Detyniecki \(2025\)Sake: steering activations for knowledge editing\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15966–15978\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- L\. Sheng, C\. Shen, W\. Zhao, J\. Fang, X\. Liu, Z\. Liang, X\. Wang, A\. Zhang, and T\. Chua \(2026\)AlphaSteer: learning refusal steering with principled null\-space constraint\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1vvbzAqdTe)Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- A\. Simhi, F\. Barez, M\. Tutek, Y\. Belinkov, and S\. B\. Cohen \(2026\)Old habits die hard: how conversational history geometrically traps llms\.arXiv preprint arXiv:2603\.03308\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Singh, S\. Ravfogel, J\. Herzig, R\. Aharoni, R\. Cotterell, and P\. Kumaraguru \(2024\)Representation surgery: theory and practice of affine steering\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- O\. Skean, M\. R\. Arefin, Y\. LeCun, and R\. Shwartz\-Ziv \(2024\)Does representation matter? exploring intermediate layers in large language models\.arXiv preprint arXiv:2412\.09563\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1)\.
- O\. Skean, M\. R\. Arefin, D\. Zhao, N\. N\. Patel, J\. Naghiyev, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Layer by layer: uncovering hidden representations in language models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=WGXb7UdvTX)Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1)\.
- N\. Sofroniew, I\. Kauvar, W\. Saunders, R\. Chen, T\. Henighan, S\. Hydrie, C\. Citro, A\. Pearce, J\. Tarng, W\. Gurnee,et al\.\(2026\)Emotion concepts and their function in a large language model\.arXiv preprint arXiv:2604\.07729\.Cited by:[Appendix B](https://arxiv.org/html/2606.03093#A2.SS0.SSS0.Px1.p1.11)\.
- D\. Tan, D\. Chanin, A\. Lynch, B\. Paige, D\. Kanoulas, A\. Garriga\-Alonso, and R\. Kirk \(2024\)Analysing the generalisation and reliability of steering vectors\.Advances in Neural Information Processing Systems37,pp\. 139179–139212\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- A\. Templeton \(2024\)Scaling monosemanticity: extracting interpretable features from claude 3 sonnet\.Anthropic\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- E\. Todd, M\. Li, A\. Sharma, A\. Mueller, B\. C\. Wallace, and D\. Bau \(2024\)Function vectors in large language models\.InInternational conference on learning representations,Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2023\)Steering language models with activation engineering\.arXiv preprint arXiv:2308\.10248\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- S\. Venkatesh and A\. Mahendran Kurapath \(2026\)On the non\-identifiability of steering vectors in large lan\-guage models\.arXiv e\-prints,pp\. arXiv–2602\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- H\. M\. Vu and T\. M\. Nguyen \(2025\)Angular steering: behavior control via rotation in activation space\.In2nd Workshop on Models of Human Feedback for AI Alignment,External Links:[Link](https://openreview.net/forum?id=GU2UeVZrSw)Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- A\. Wang, D\. Shu, Y\. Wang, Y\. Ma, and M\. Du \(2025\)Improving llm reasoning through interpretable role\-playing steering\.arXiv preprint arXiv:2506\.07335\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- J\. Wehner, S\. Abdelnabi, D\. Tan, D\. Krueger, and M\. Fritz \(2025\)Taxonomy, opportunities, and challenges of representation engineering for large language models\.arXiv preprint arXiv:2502\.19649\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- A\. H\. Williams, E\. Kunz, S\. Kornblith, and S\. Linderman \(2021\)Generalized shape metrics on neural representations\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=L9JM-pxQOl)Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- A\. H\. Williams \(2024\)Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis\.InUniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models,External Links:[Link](https://openreview.net/forum?id=zMdnnFasgC)Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p4.1)\.
- Z\. Wu, A\. Arora, Z\. Wang, A\. Geiger, D\. Jurafsky, C\. D\. Manning, and C\. Potts \(2024\)ReFT: representation finetuning for language models\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=fykjplMc0V)Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- Z\. Wu, Q\. Yu, A\. Arora, C\. D\. Manning, and C\. Potts \(2025\)Improved representation steering for language models\.arXiv preprint arXiv:2505\.20809\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[3rd item](https://arxiv.org/html/2606.03093#A11.I1.i3.p1.1)\.
- H\. Yang, H\. Cho, Y\. Zhong, and N\. Inoue \(2026\)Unifying attention heads and task vectors via hidden state geometry in in\-context learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=FIfjDqjV0B)Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Yang, Q\. Huang, T\. Ding, D\. Lischinski, D\. Cohen\-Or, and H\. Huang \(2023\)Emoset: a large\-scale visual emotion dataset with rich attributes\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 20383–20394\.Cited by:[2nd item](https://arxiv.org/html/2606.03093#A11.I2.i2.p1.1),[Appendix B](https://arxiv.org/html/2606.03093#A2.SS0.SSS0.Px1.p1.11)\.
- S\. Zhang, S\. Roller, N\. Goyal, M\. Artetxe, M\. Chen, S\. Chen, C\. Dewan, M\. Diab, X\. Li, X\. V\. Lin,et al\.\(2022\)Opt: open pre\-trained transformer language models\.arXiv preprint arXiv:2205\.01068\.Cited by:[6th item](https://arxiv.org/html/2606.03093#A11.I1.i6.p1.1)\.
- H\. Zhao, H\. Sun, J\. Kong, X\. Li, Q\. Wang, L\. Jiang, Q\. Zhu, T\. Abdelzaher, Y\. Choi, M\. Li,et al\.\(2026\)ODESteer: a unified ode\-based steering framework for llm alignment\.arXiv preprint arXiv:2602\.17560\.Cited by:[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski,et al\.\(2023\)Representation engineering: a top\-down approach to ai transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§1](https://arxiv.org/html/2606.03093#S1.p2.1),[§2](https://arxiv.org/html/2606.03093#S2.SS0.SSS0.Px2.p1.2)\.
## Appendix AMethods: Nested Geometric Decomposition
#### Nested\-chain inclusions\.
ℱT\\mathcal\{F\}\_\{\\mathrm\{T\}\}is recovered fromℱOu\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}at\(𝐚,c,𝐐\)=\(𝟎,1,𝐈\)\(\\mathbf\{a\},c,\\mathbf\{Q\}\)=\(\\mathbf\{0\},1,\\mathbf\{I\}\);ℱOu\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}fromℱOa\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}at𝐃=c𝐈\\mathbf\{D\}=c\\,\\mathbf\{I\};ℱOa\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}fromℱL\\mathcal\{F\}\_\{\\mathrm\{L\}\}at𝐌=𝐐𝐃\\mathbf\{M\}=\\mathbf\{Q\}\\,\\mathbf\{D\}with𝐐∈O\(D\)\\mathbf\{Q\}\\in O\(D\)and𝐃\\mathbf\{D\}diagonal; andℱL\\mathcal\{F\}\_\{\\mathrm\{L\}\}fromℱN\\mathcal\{F\}\_\{\\mathrm\{N\}\}wheng𝜽g\_\{\\boldsymbol\{\\theta\}\}implements a linear map\.
#### Parameter estimation\.
The four parametric tiers all operate on the centred matrices𝐗~A=𝐗A−𝝁A\\widetilde\{\\mathbf\{X\}\}^\{A\}=\\mathbf\{X\}^\{A\}\-\\boldsymbol\{\\mu\}^\{A\}and𝐗~B=𝐗B−𝝁B\\widetilde\{\\mathbf\{X\}\}^\{B\}=\\mathbf\{X\}^\{B\}\-\\boldsymbol\{\\mu\}^\{B\}\. smallskip*Translation \(ℱT\\mathcal\{F\}\_\{\\mathrm\{T\}\}\)\.*The mean\-shift solution is
𝐛^=𝝁B−𝝁A\.\\widehat\{\\mathbf\{b\}\}\\;=\\;\\boldsymbol\{\\mu\}^\{B\}\-\\boldsymbol\{\\mu\}^\{A\}\.\(5\)
*Uniform\-scaled rigid \(ℱOu\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}\)\.*Compute the cross\-covariance𝐂=𝐗~A⊤𝐗~B\\mathbf\{C\}=\\widetilde\{\\mathbf\{X\}\}^\{A\\,\\top\}\\widetilde\{\\mathbf\{X\}\}^\{B\}and its SVD𝐂=𝐔𝚺𝐕⊤\\mathbf\{C\}=\\mathbf\{U\}\\boldsymbol\{\\Sigma\}\\mathbf\{V\}^\{\\\!\\top\}\. The orthogonal\-Procrustes solution is
𝐐^=𝐔𝐕⊤,c^=⟨𝐗~A𝐐^,𝐗~B⟩F‖𝐗~A‖F2\.\\widehat\{\\mathbf\{Q\}\}\\;=\\;\\mathbf\{U\}\\mathbf\{V\}^\{\\\!\\top\},\\qquad\\widehat\{c\}\\;=\\;\\frac\{\\bigl\\langle\\widetilde\{\\mathbf\{X\}\}^\{A\}\\widehat\{\\mathbf\{Q\}\},\\,\\widetilde\{\\mathbf\{X\}\}^\{B\}\\bigr\\rangle\_\{F\}\}\{\\bigl\\\|\\widetilde\{\\mathbf\{X\}\}^\{A\}\\bigr\\\|\_\{F\}^\{2\}\}\.\(6\)
*Axis\-scaled rigid \(ℱOa\\mathcal\{F\}\_\{\\mathrm\{O\_\{a\}\}\}\)\.*Re\-use𝐐^\\widehat\{\\mathbf\{Q\}\}from theℱOu\\mathcal\{F\}\_\{\\mathrm\{O\_\{u\}\}\}fit, form the rigidly\-aligned features𝐙=𝐗~A𝐐^\\mathbf\{Z\}=\\widetilde\{\\mathbf\{X\}\}^\{A\}\\widehat\{\\mathbf\{Q\}\}, and fit each diagonal entry by per\-axis least squares,
d^j=⟨𝐙:,j,𝐗~:,jB⟩‖𝐙:,j‖22,j=1,…,D\.\\widehat\{d\}\_\{j\}\\;=\\;\\frac\{\\bigl\\langle\\mathbf\{Z\}\_\{:,j\},\\,\\widetilde\{\\mathbf\{X\}\}^\{B\}\_\{:,j\}\\bigr\\rangle\}\{\\bigl\\\|\\mathbf\{Z\}\_\{:,j\}\\bigr\\\|\_\{2\}^\{2\}\},\\qquad j=1,\\dots,D\.\(7\)
*Affine \(ℱL\\mathcal\{F\}\_\{\\mathrm\{L\}\}\)\.*Ridge regression gives the closed form
𝐌^=\(𝐗~A⊤𝐗~A\+λ𝐈\)−1𝐗~A⊤𝐗~B,λ=1\.\\widehat\{\\mathbf\{M\}\}\\;=\\;\\bigl\(\\widetilde\{\\mathbf\{X\}\}^\{A\\,\\top\}\\widetilde\{\\mathbf\{X\}\}^\{A\}\+\\lambda\\mathbf\{I\}\\bigr\)^\{\-1\}\\,\\widetilde\{\\mathbf\{X\}\}^\{A\\,\\top\}\\widetilde\{\\mathbf\{X\}\}^\{B\},\\qquad\\lambda=1\.\(8\)
g𝜽g\_\{\\boldsymbol\{\\theta\}\}is a two\-layer MLP with one hidden layer of sizeH=512H\{=\}512and GELU activation\. It is trained for 200 epochs over the training\-fold rows with mini\-batch sizemin\(256,Ntrain\)\\min\(256,N\_\{\\text\{train\}\}\), using AdamW \(learning rate10−310^\{\-3\}, weight decay10−310^\{\-3\}\) on the mean\-squared reconstruction error \(equivalent to the Frobenius criterion up to a1/\(NtrainD\)1/\(N\_\{\\text\{train\}\}\\,D\)normalization\)\. The seed is set per fold for reproducibility\.
Fork∈\{Ou,Oa,L\}k\\in\\\{\\mathrm\{O\_\{u\}\},\\mathrm\{O\_\{a\}\},\\mathrm\{L\}\\\}the empirical risk is invariant under\(𝐚,𝐛\)↦\(𝐚\+𝐯,𝐛\+fk\(𝐯\)−fk\(𝟎\)\)\(\\mathbf\{a\},\\mathbf\{b\}\)\\mapsto\(\\mathbf\{a\}\+\\mathbf\{v\},\\mathbf\{b\}\+f\_\{k\}\(\\mathbf\{v\}\)\-f\_\{k\}\(\\mathbf\{0\}\)\); we adopt the canonical choice𝐚^=𝝁A\\widehat\{\\mathbf\{a\}\}=\\boldsymbol\{\\mu\}^\{A\}and𝐛^=𝝁B\\widehat\{\\mathbf\{b\}\}=\\boldsymbol\{\\mu\}^\{B\}, reducing the remaining estimation to a problem on the centred matrices𝐗~A,𝐗~B\\widetilde\{\\mathbf\{X\}\}^\{A\},\\widetilde\{\\mathbf\{X\}\}^\{B\}\. ForℱN\\mathcal\{F\}\_\{\\mathrm\{N\}\}we use the same centring convention, although the risk is not invariant under arbitrary translations due to the nonlinearity ofg𝜽g\_\{\\boldsymbol\{\\theta\}\}\.
#### Cross\-validation\.
We use stratified five\-fold cross\-validation across each stimuli set; per\-fold estimators are fit on the training partition and evaluated on the held\-out partition\. CumulativeRk2R^\{2\}\_\{k\}and incrementalΔRk2\\Delta R^\{2\}\_\{k\}are computed from held\-out residual sums of squares\.
#### Hypothesis interpretation\.
Each pure transformation class produces a characteristic decomposition profile, summarized in Table[2](https://arxiv.org/html/2606.03093#A1.T2)\.
Table 2:Decomposition signatures of pure transformation classes\.
## Appendix BDatasets and prompt design
#### Dataset design\.
EmotionalStory\(N=1,920N\{=\}1\{,\}920\) is a corpus of8emotions×8topics×308\\text\{ emotions\}\\\!\\times\\\!8\\text\{ topics\}\\\!\\times\\\!30short stories, with name\-pool and scenario\-pool diversity injections inspired by\[Sofroniewet al\.,[2026](https://arxiv.org/html/2606.03093#bib.bib65)\]\.WritingStyle\(N=1,440N\{=\}1\{,\}440\) is a synthetically generated corpus of6styles×4topics×606\\text\{ styles\}\\\!\\times\\\!4\\text\{ topics\}\\\!\\times\\\!60short passages; style and content vary factorially by rendering the same neutral content seed in each of the six styles\.Number\(N=1,247N\{=\}1\{,\}247\) uses a mix of pseudo\-stimuli, real Wikipedia sentences, and task\-template stimuli spanning five cognitive\-task framings, with set\-property prompts \(e\.g\. “Are there any prime numbers?”\) that require scanning all numerical tokens rather than matching a single value\. Tables[3](https://arxiv.org/html/2606.03093#A2.T3)–[5](https://arxiv.org/html/2606.03093#A2.T5)show representative stimuli from the three text datasets\.EmoSet\(N=1,600N\{=\}1\{,\}600\) is a balanced8emotions×4content categories8\\text\{ emotions\}\\\!\\times\\\!4\\text\{ content categories\}\(person, animal, nature, object\) subset of the EmoSet benchmark\[Yanget al\.,[2023](https://arxiv.org/html/2606.03093#bib.bib60)\]\.StyleTransfer\(N=1,920N\{=\}1\{,\}920;Boger and Firestone,[2025](https://arxiv.org/html/2606.03093#bib.bib61)\) consists of photographs rendered in seven styles \(six artistic styles plus the original photographs\) across four scenes \(beach, bedroom, library, mountain\), with multiple images per style×\\timesscene cell\.COCO\(N=1,000N\{=\}1\{,\}000\) is a subset of COCO val2017\[Linet al\.,[2014](https://arxiv.org/html/2606.03093#bib.bib62)\]selected for high multi\-supercategory coverage \(≥3\\geq 3of 12 supercategories per image\), probing people\-detection and people\-count prompts\.
#### Diversity controls in synthetic LLM stimuli\.
Without explicit controls, the LLM seed generator collapses on a few canonical patterns \(e\.g\. defaulting to “Maya” for joy stories and “Marcus” for anger stories, or repeating one factual seed across all*nature*passages\)\. To prevent this we \(i\) curate a balanced100100\-name pool for EmotionalStory and assign names by a coprime stride so each emotion uses∼95\\sim 95distinct names with≤4\\leq 4occurrences per \(name, emotion\) cell; \(ii\) curate a1515\-scenario pool per topic \(120120total\) and rotate asscenarios\[kmod15\]\\text\{scenarios\}\[k\\bmod 15\], capping each scenario at≤2\\leq 2occurrences per cell; \(iii\) for WritingStyle, curate1515subthemes per topic and rotate round\-robin so each subtheme contributes44stimuli at full scale\.
#### Number\-stimulus design\.
The Number dataset extends the design inHuet al\.\[[2026](https://arxiv.org/html/2606.03093#bib.bib63)\]in two ways\. First, prompts are*set\-property*queries \(“Are there any prime numbers?”, “Are there any numbers greater than 5?”\) rather than value\-identification queries, because lexical surface\-form matching trivially solves “Is the number exactly 3?” on the input “I have 3 apples” without engaging numerical processing\. Second, the single\-number random insertion is generalized to a balanced 1\-/2\-/3\-number multi\-number variant\. Stimuli combine three types: \(a\) pseudo\-stimuli \(77\-wordwikitext\-103chunks\[Merityet al\.,[2016](https://arxiv.org/html/2606.03093#bib.bib64)\]withc∈\{1,2,3\}c\\\!\\in\\\!\\\{1,2,3\\\}target numbers inserted at random positions;N=90N\{=\}90\); \(b\) real Wikipedia sentences containing one or more target numbers11–99\(N≈675N\{\\approx\}675\); and \(c\) task\-template stimuli reproducing five cognitive tasks:quantity, comparison, arithmetic, property, ordinal, with five phrasings per \(task, number, format\) cell \(N≈442N\{\\approx\}442\)\. Anchor values in comparison/arithmetic templates are mathematically validated against the target before acceptance\.
Table 3:Example stimuli fromEmotionalStory\( topic×\\timesemotion, with one example per emotion and per topic, repeated to balance both axes; first∼130\\sim\\\!130characters of each story shown\)\. The full corpus is8emotions×8topics×30=1,9208\\,\\text\{emotions\}\\times 8\\,\\text\{topics\}\\times 30=1\{,\}920stories\.Table 4:Example stimuli fromWritingStyle\(topic×\\timesstyle\)\. The same neutral content seed is rendered in each of the six styles within a topic, so within\-topic rows differ*only*in style\. Each style and topic appears at least once; the full corpus is6styles×4topics×60=1,4406\\,\\text\{styles\}\\times 4\\,\\text\{topics\}\\times 60=1\{,\}440short passages\.Table 5:Example stimuli fromNumber, spanning the three stimulus types and the five task framings ofHuet al\.\[[2026](https://arxiv.org/html/2606.03093#bib.bib63)\], each shown in both*digit*and*word*numerical formats\. The full corpus hasN≈1,247N\{\\approx\}1\{,\}247stimuli\.
#### Prompt templates\.
EmotionalStory uses*“What emotion does this text express?”*and*“What topic is this text about?”*as the two open prompts; the sixteen specific prompts are yes/no queries*“Does this text express\{value\}?”*\(joy, sadness, anger, fear, trust, disgust, surprise, anticipation\) and*“Is this about\{value\}?”*\(career, education, family, friendship, health, hobbies, finance, travel\)\. WritingStyle uses the analogous open pair*“What writing style is this in?”*/*“What topic is this text about?”*with the six styles \(\{\\\{formal, casual, poetic, technical, journalistic, archaic\}\\\}\) and four topics \(\{\\\{nature, technology, food, sports\}\\\}\)\. Number uses the nine set\-property prompts distributed across the five cognitive tasks \(1 quantity / 3 comparison / 2 arithmetic / 2 property / 1 ordinal\)\. Irrelevant prompts \(*“What is the capital of France?”*and similar\) are shared across all datasets\.StyleTransferuses*“What scene does the image depict?”*and*“What artistic style does the image belong to?”*as the open pair, with eleven specific prompts: four scenes \(*“Is this a\{beach, bedroom, library, mountain\}?”*\) and seven styles \(*“Is this in the style of\{Demuth, Klimt, Monet, Pollock\}?”*\)\.*“Are there people in this image?”*\(detection\) and*“How many people are in this image?”*\(count\) as the open pair, with eight count\-specific yes/no queries of the form*“Are there exactly\{0, 1, 2, 3, 4, 5\}people in this image?”*plus the two range bins*“Are there between 5 and 10 people?”*and*“Are there more than 10 people?”*\.EmoSet\(VLM\) uses*“What is depicted in this image?”*\(content\) and*“What emotion does this image evoke?”*\(emotion\) as the open pair, with twelve specific prompts: four content categories \(*“Are there people / animals?”*,*“Is this a natural landscape?”*,*“Is this primarily an inanimate scene?”*\) and eight emotions \(*“Does this image evoke\{amusement, awe, contentment, excitement, anger, disgust, fear, sadness\}?”*\)\. Irrelevant prompts \(*“What is the capital of France?”*,*“What is 2\+2?”*,*“What day of the week is it?”*, and seven similar factual\-knowledge questions\) are shared across all six datasets\. Full prompt JSONs will be in the released code\.
## Appendix CImplementation details
#### Layer indexing\.
For each model we extract hidden states at every transformer block at the last input\-prompt token\. Layer indices are reported on a normalized\[0,1\]\[0,1\]scale so that models of different depth can be overlaid; absolute layer counts are: OPT\-2\.7B \(32\), Llama\-3\-8B\-Instruct \(32\), Qwen3\-8B \(36\), BLIP\-2 \(32\), LLaVA\-OneVision\-7B \(28\), Qwen3\-VL\-8B \(36\)\.
#### Feature extraction\.
For each \(model, dataset, prompt\) cell we feed every stimulus through the model under the prompt and store the residual\-stream hidden state at the*last input\-prompt token*from every transformer block; this token is the position at which the model commits to a continuation and is the conventional probe site for prompt\-conditioned reads\. LLM inputs are formatted using the model’s training\-time convention: chat models \(Llama\-3\-Instruct, Qwen\-2\-Instruct, Qwen3\) usetokenizer\.apply\_chat\_template\(…, add\_generation\_prompt=True\); OPT\-2\.7B \(a non\-instruction\-tuned base model\) uses the QA template \(`Question: \{prompt\}\\nText: \{text\}\\nAnswer:`\)\. VLM inputs are processed by each model’s native processor \(BLIP\-2, LLaVA\-OneVision, Qwen3\-VL\)\. Hidden states are stored as zarr arrays of shape\(1,nstim,D\)\(1,\\,n\_\{\\text\{stim\}\},\\,D\)per layer with metadata recording the pooling convention, model identity, prompt string, sequence length, and the number of truncated stimuli \(typically zero\)\.
#### Cross\-validation, ridge, and MLP\.
Five\-fold stratified cross\-validation across stimuli, ridgeλ=1\\lambda\{=\}1, and a one\-hidden\-layer MLP \(H=512H\{=\}512, GELU, AdamW with weight decay10−310^\{\-3\}, 200 epochs at mini\-batch size≤256\\leq 256\), see Appendix[A](https://arxiv.org/html/2606.03093#A1)for the estimator definitions\.
#### Behavioural evaluation\.
Per\-dataset keyword evaluators check whether generated text satisfies the target prompt’s semantic constraint, scoring two metrics: a*relevance*flag \(does the text address the target attribute at all, e\.g\. does it mention any artistic\-style or emotion vocabulary\) and an*accuracy*flag \(does the text identify the correct ground\-truth label, e\.g\. “Monet” for a Monet\-style image, “joy” for a joy\-emotion story, the correct count for COCO\)\. For style and emotion prompts the evaluator combines artist\-name keywords, art\-school keywords, and synonym lists \(e\.g\. “post\-impressionism”, “van gogh”, “starry night” all map tovangogh\)\. Evaluator dictionaries will be in the released code\.
## Appendix DNormalized prompt\-induced activation changes
The normalized squared Frobenius distance‖𝐗B−𝐗A‖F2/‖𝐗A‖F2\\\|\\mathbf\{X\}^\{B\}\-\\mathbf\{X\}^\{A\}\\\|\_\{F\}^\{2\}/\\\|\\mathbf\{X\}^\{A\}\\\|\_\{F\}^\{2\}are shown separately for the six prompt\-pair groups \(G1–G6\)\. The summary across models and datasets is shown in Fig\.[6](https://arxiv.org/html/2606.03093#A4.F6), and the individual \(model, dataset\) pair results are in Figs\.[7](https://arxiv.org/html/2606.03093#A4.F7)–[12](https://arxiv.org/html/2606.03093#A4.F12)\. The two irrelevant\-source prompt groups \(G5–G6\) produce the largest activation change, followed by the within\-attribute open→\\tospecific pair \(G3\) and the two cross\-attribute pairs \(G1: open→\\toopen; G2: specific→\\tospecific\)\. The within\-attribute specific→\\tospecific pair \(G4\) produces the smallest change, consistent with both prompts querying the same attribute axis at the same specificity level\.
Figure 6:Normalized prompt\-induced activation change across the six prompt\-pair groups, pooled across all \(model, dataset\) cells\. The depth\-graded growth of the change is consistent across groups\.


Figure 7:Normalized activation change for EmotionalStory across layers for OPT\-2\.7B \(top\), Llama\-3\-8B\-Instruct \(middle\), Qwen3\-8B \(bottom\)\.


Figure 8:Normalized activation change for WritingStyle across layers for OPT\-2\.7B \(top\), Llama\-3\-8B\-Instruct \(middle\), Qwen3\-8B \(bottom\)\.


Figure 9:Normalized activation change for Number across layers for OPT\-2\.7B \(top\), Llama\-3\-8B\-Instruct \(middle\), Qwen3\-8B \(bottom\)\.


Figure 10:Normalized activation change for EmoSet across layers for BLIP\-2 \(top\), LLaVA\-OneVision\-7B \(middle\), Qwen3\-VL\-8B \(bottom\)\.


Figure 11:Normalised activation change for StyleTransfer across layers for BLIP\-2 \(top\), LLaVA\-OneVision\-7B \(middle\), Qwen3\-VL\-8B \(bottom\)\.


Figure 12:Normalized activation change for COCO across layers for BLIP\-2 \(top\), LLaVA\-OneVision\-7B \(middle\), Qwen3\-VL\-8B \(bottom\)\.
## Appendix EMDS visualizations
Figure 13:EmotionalStory dataset, promptAA\(topic\) vs\. promptBB\(emotion\); stimuli coloured by ground\-truth emotion\.Figure 14:WritingStyle dataset, promptAA\(topic\) vs\. promptBB\(writing style\); stimuli coloured by ground\-truth style\.Figure 15:Number dataset, promptAA\(numbers mentioned\) vs\. promptBB\(cognitive operation\); stimuli coloured by ground\-truth task framing\.Figure 16:EmoSet dataset, promptAA\(image content\) vs\. promptBB\(emotion evoked\); stimuli coloured by ground\-truth emotion\.Figure 17:StyleTransfer dataset, promptAA\(scene\) vs\. promptBB\(artistic style\); stimuli coloured by ground\-truth scene \(top legend\) and style \(bottom legend\)\.Figure 18:StyleTransfer dataset \(continued\), promptAA\(scene\) vs\. promptBB\(artistic style\); stimuli coloured by ground\-truth scene \(top\) and style \(bottom\)\.Figure 19:COCO dataset, promptAA\(people detection\) vs\. promptBB\(people count\); stimuli coloured by ground\-truth detection \(top\) and count bin \(bottom\)\.
## Appendix FRSA and silhouette score
For each pair of model and dataset, we report the layerwise Spearman correlation between the data RDM of the prompt\-induced hidden states and the prompt\-BBtarget attribute RDM, alongside the silhouette score based on the target\-labels\.


Figure 20:Layerwise RDM correlation \(left\) and silhouette score \(right\) for EmotionalStory \(1/2\) under promptAAvs\. promptBBfor OPT\-2\.7B \(top\) and Llama\-3\-8B\-Instruct \(bottom\)\.Figure 21:Layerwise RDM correlation \(left\) and silhouette score \(right\) for EmotionalStory \(2/2\) under promptAAvs\. promptBBfor for Qwen3\-8B\.

Figure 22:Layerwise RDM correlation \(left\) and silhouette score \(right\) for WritingStyle \(1/2\) under promptAAvs\. promptBBfor OPT\-2\.7B \(top\) and Llama\-3\-8B\-Instruct \(bottom\)\.Figure 23:Layerwise RDM correlation \(left\) and silhouette score \(right\) for WritingStyle \(2/2\) under promptAAvs\. promptBBfor for Qwen3\-8B\.

Figure 24:Layerwise RDM correlation \(left\) and silhouette score \(right\) for Number \(1/2\) under promptAAvs\. promptBBfor OPT\-2\.7B \(top\) and Llama\-3\-8B\-Instruct \(bottom\)\.Figure 25:Layerwise RDM correlation \(left\) and silhouette score \(right\) for Number \(2/2\) under promptAAvs\. promptBBfor for Qwen3\-8B\.

Figure 26:Layerwise RDM correlation \(left\) and silhouette score \(right\) for EmoSet \(1/2\) under promptAAvs\. promptBBfor OPT\-2\.7B \(top\) and Llama\-3\-8B\-Instruct \(bottom\)\.Figure 27:Layerwise RDM correlation \(left\) and silhouette score \(right\) for EmoSet \(2/2\) under promptAAvs\. promptBBfor for Qwen3\-8B\.

Figure 28:Layerwise RDM correlation \(left\) and silhouette score \(right\) for StyleTransfer \(1/2\) under promptAAvs\. promptBBfor OPT\-2\.7B \(top\) and Llama\-3\-8B\-Instruct \(bottom\)\.Figure 29:Layerwise RDM correlation \(left\) and silhouette score \(right\) for StyleTransfer \(2/2\) under promptAAvs\. promptBBfor for Qwen3\-8B\.

Figure 30:Layerwise RDM correlation \(left\) and silhouette score \(right\) for COCO \(1/2\) under promptAAvs\. promptBBfor OPT\-2\.7B \(top\) and Llama\-3\-8B\-Instruct \(bottom\)\.Figure 31:Layerwise RDM correlation \(left\) and silhouette score \(right\) for COCO \(2/2\) under promptAAvs\. promptBBfor for Qwen3\-8B\.
## Appendix GDecomposition results for individual prompt\-pair groups
For each \(model, dataset\) cell we show the full five\-tier incremental cross\-validatedR2R^\{2\}at every layer as a stacked bar\. Each segment isΔRk2=Rk2−Rk−12\\Delta R^\{2\}\_\{k\}=R^\{2\}\_\{k\}\-R^\{2\}\_\{k\-1\}for one tier in the nested chain \(translationTT, rigid transformation with uniform scalingOuO\_\{u\}, rigid transformation with axis\-wise scalingOaO\_\{a\}, affineLL, nonlinearNN\)\. These visualizations complement the summary Fig\.[3](https://arxiv.org/html/2606.03093#S4.F3)and make visible the*strategy\-level*differences across model families discussed in §[6](https://arxiv.org/html/2606.03093#S6.SS0.SSS0.Px2)\.



Figure 32:IncrementalR2R^\{2\}of each transformation for EmotionalStory dataset across layers for OPT\-2\.7B \(top\), Llama\-3\-8B\-Instruct \(middle\), Qwen3\-8B \(bottom\)\.


Figure 33:IncrementalR2R^\{2\}of each transformation for WritingStyle across layers for OPT\-2\.7B \(top\), Llama\-3\-8B\-Instruct \(middle\), Qwen3\-8B \(bottom\)\.


Figure 34:IncrementalR2R^\{2\}of each transformation for Number across layers for OPT\-2\.7B \(top\), Llama\-3\-8B\-Instruct \(middle\), Qwen3\-8B \(bottom\)\.


Figure 35:IncrementalR2R^\{2\}of each transformation for EmoSet across layers for BLIP\-2 \(top\), LLaVA\-OneVision\-7B \(middle\), Qwen3\-VL\-8B \(bottom\)\.


Figure 36:IncrementalR2R^\{2\}of each transformation for StyleTransfer across layers for BLIP\-2 \(top\), LLaVA\-OneVision\-7B \(middle\), Qwen3\-VL\-8B \(bottom\)\.


Figure 37:IncrementalR2R^\{2\}of each transformation for COCO across layers for BLIP\-2 \(top\), LLaVA\-OneVision\-7B \(middle\), Qwen3\-VL\-8B \(bottom\)\.
## Appendix HGeneralization to prompt paraphrases and out\-of\-distribution \(OOD\) datasets
We ask whether the fitted transformationsf^k\\widehat\{f\}\_\{k\}capture task\-dependent representational structure that generalizes beyond the specific prompt–stimulus pairings used for fitting\. We evaluate two forms of generalization: \(i\)*prompt paraphrasing*, in which the canonical target promptBBis replaced by semantically equivalent rewordings that query the same attribute, and \(ii\)*input distribution shift*, in which transformations fitted on one dataset are evaluated on another dataset\.
### Prompt paraphrasing
#### Setup\.
We fix the source promptAAand the canonical target promptBBfrom the cross\-attribute open–open pairing group \(G1\)\. We then construct three semantic paraphrases\{Bi′\}i=13\\\{B^\{\\prime\}\_\{i\}\\\}\_\{i=1\}^\{3\}ofBB, each preserving the queried attribute while varying the surface form of the instruction\. For each paraphraseBi′B^\{\\prime\}\_\{i\}, we fit each tier\-kktransformationf^k\(i\)\\widehat\{f\}\_\{k\}^\{\(i\)\}on the paraphrased prompt pair\(𝐗A,𝐗Bi′\)\(\\mathbf\{X\}^\{A\},\\mathbf\{X\}^\{B^\{\\prime\}\_\{i\}\}\)using the training stimuli, and evaluate the fitted transformation against the canonical target representations𝐗B\\mathbf\{X\}^\{B\}on held\-out stimuli\. We use the same 5\-fold cross\-validation protocol as in the main experiments \(Section[5](https://arxiv.org/html/2606.03093#S5)\)\.
For paraphraseii, the cumulative paraphrase\-generalization score of tierkkis
Rk2,\(i\)=‖𝐗B−𝐗A‖F2−‖𝐗B−f^k\(i\)\(𝐗A\)‖F2‖𝐗B−𝐗A‖F2,R^\{2,\(i\)\}\_\{k\}=\\frac\{\\big\\\|\\mathbf\{X\}^\{B\}\-\\mathbf\{X\}^\{A\}\\big\\\|\_\{F\}^\{2\}\-\\big\\\|\\mathbf\{X\}^\{B\}\-\\widehat\{f\}\_\{k\}^\{\(i\)\}\(\\mathbf\{X\}^\{A\}\)\\big\\\|\_\{F\}^\{2\}\}\{\\big\\\|\\mathbf\{X\}^\{B\}\-\\mathbf\{X\}^\{A\}\\big\\\|\_\{F\}^\{2\}\},
where both norms are computed over held\-out stimuli and scores are averaged across folds\. As a reference ceiling, we also fit and evaluate the canonical pair\(𝐗A,𝐗B\)\(\\mathbf\{X\}^\{A\},\\mathbf\{X\}^\{B\}\)under the same cross\-validation procedure\. If the learned transformation captures the target attribute independently of the exact wording ofBB, the paraphrase\-generalization curves should approach the canonical reference curve\.
We evaluate this analysis in one vision\-language setting and one language\-only setting: LLaVA\-OneVision\-7B on StyleTransfer, with canonical target promptB=B=“What artistic style does the image belong to?” and three paraphrases querying*style*; and Llama3\-8B\-Instruct on EmotionalStory, with canonical target promptB=B=“What emotion does this text express?” and three paraphrases querying*emotion*\.
#### Results\.
For Llama3\-8B\-Instruct on EmotionalStory \(Fig\.[38](https://arxiv.org/html/2606.03093#A8.F38)\), the paraphrase\-generalization curves closely track the canonical reference curve, except in the earliest layers\. This suggests that, across most layers, the fitted transformations capture a prompt\-induced representational change that is largely stable across semantically equivalent rewordings of the target prompt\. For LLaVA\-OneVision\-7B on StyleTransfer \(Fig\.[39](https://arxiv.org/html/2606.03093#A8.F39)\), the paraphrase curves approach the canonical reference primarily in deeper layers\. This indicates that paraphrase\-invariant transformations emerge more strongly at later stages of LLaVA, whereas earlier layers remain more sensitive to the surface form of the prompt\.
Figure 38:Cumulative cross\-validatedR2R^\{2\}of each transformation under prompt paraphrasing, evaluated on the canonical𝐗B\\mathbf\{X\}^\{B\}on held\-out stimuli for Llama3\-8B\-Instruct on EmotionalStory \(sourceAA= topic, canonical targetBB= emotion; three paraphrases of emotion\)\.Figure 39:Cumulative cross\-validatedR2R^\{2\}of each transformation under prompt paraphrasing, evaluated on the canonical𝐗B\\mathbf\{X\}^\{B\}on held\-out stimuli for LLaVA\-OneVision\-7B on StyleTransfer\(sourceAA= scene, canonical targetBB= style; three paraphrases of style\)\.
### Evaluation on out\-of\-distribution \(OOD\) datasets
#### Setup\.
We next test whether transformations fitted on one stimulus distribution generalize to a distinct dataset from the same modality\. For each\(model,Dsrc,Dtgt\)\(\\mathrm\{model\},D\_\{\\mathrm\{src\}\},D\_\{\\mathrm\{tgt\}\}\)triple, we fit each tier\-kktransformationf^ksrc\\widehat\{f\}\_\{k\}^\{\\mathrm\{src\}\}using all stimuli from the source datasetDsrcD\_\{\\mathrm\{src\}\}\. We then apply the fitted transformation to all stimuli in the target datasetDtgtD\_\{\\mathrm\{tgt\}\}under the same prompt pair and compute the per\-tier incremental contributionΔRk2\\Delta R^\{2\}\_\{k\}\. As in the paraphrase analysis, we evaluate one vision\-language setting \(LLaVA\-OneVision\-7B: StyleTransfer↔\\leftrightarrowEmoSet\) and one language\-only setting \(Llama3\-8B\-Instruct: WritingStyle↔\\leftrightarrowEmotionalStory\)\.
For comparison, the diagonal entries in Fig\.[40](https://arxiv.org/html/2606.03093#A8.F40)show matched in\-domain performance estimated by 5\-fold cross\-validation within each dataset, whereas the off\-diagonal entries show cross\-dataset transfer fromDsrcD\_\{\\mathrm\{src\}\}toDtgtD\_\{\\mathrm\{tgt\}\}\.
#### Results\.
Transformations fitted on one dataset retain substantial explanatory power when transferred to a different dataset from the same modality, but the decomposition across transformation tiers changes under distribution shift \(Fig\.[40](https://arxiv.org/html/2606.03093#A8.F40)\)\. In the language\-only setting, transfer from EmotionalStory to WritingStyle shows the strongest degradation: the translation component decreases markedly in deeper layers relative to the in\-domain condition, with the loss primarily redistributed to the affine tier and the residual\. The reverse direction, WritingStyle to EmotionalStory, transfers more robustly, suggesting an asymmetry in how dataset\-specific structure contributes to the fitted transformation\.
In the vision\-language setting, the in\-domain decompositions are dominated by translation across layers\. Under cross\-dataset transfer, however, a larger fraction of the explained and unexplained variance is assigned to affine, nonlinear, and residual components\. Thus, while a substantial component of the fitted transformation is shared across datasets, the relative contribution of simple translation versus higher\-order transformations remains dataset\-dependent\.
Together with the paraphrase analysis, these results indicate that the nested geometric hierarchy captures prompt\-dependent representational structure that is partly invariant to prompt rewording and partly transferable across input distributions\. At the same time, the tier\-wise allocation of variance is sensitive to the stimulus distribution, suggesting that prompt\-induced representational changes contain both task\-level and stimulus\-distribution\-specific components\.
Figure 40:Out\-of\-distribution \(OOD\) generalization of the nested geometric decomposition\. Top: Llama3\-8B\-Instruct\. Bottom: LLaVA\-OneVision\-7B\. Rows denote the source dataset used to fit the transformation, and columns denote the target dataset used for evaluation\. Diagonal panels show matched in\-domain 5\-fold cross\-validation; off\-diagonal panels show cross\-dataset transfer\. Each bar shows the per\-layer stacked decomposition of incrementalΔRk2\\Delta R^\{2\}\_\{k\}into transformation tiers and residual for cross\-attribute open–open prompt pairs\.
## Appendix IDimensionality of Procustes alignment
Fig\.[4](https://arxiv.org/html/2606.03093#S5.F4)b\(bottom\) shows the alignment dimensionality using a 1% relative threshold on the singular values of cross\-covariance𝐗~A⊤𝐗~B\\widetilde\{\\mathbf\{X\}\}^\{A\\,\\top\}\\widetilde\{\\mathbf\{X\}\}^\{B\}\(σi\>0\.01σmax\\sigma\_\{i\}\>0\.01\\,\\sigma\_\{\\max\}\)\. For completeness we also report the stricter 10% threshold \(σi\>0\.10σmax\\sigma\_\{i\}\>0\.10\\,\\sigma\_\{\\max\}\) in Fig\.[41](https://arxiv.org/html/2606.03093#A9.F41)for LLMs\.



Figure 41:Alignment dimensionality\. \(top\) the centered cross\-prompt cross\-covariance at the stricter 10%\-of\-σmax\\sigma\_\{\\max\}threshold \(LLM, all six prompt\-pair groups\)\. \(bottom and middke \) llm\. two thresholds\. the layout across \(model×\\timesdataset×\\timesgroup\) is the same as in Fig\.[4](https://arxiv.org/html/2606.03093#S5.F4)b\.
## Appendix JExample generated texts after interventions
We provide qualitative examples of model generations after replacing an internal representation with the output of each fitted transformation tier\. For a stimulussis\_\{i\}and layerℓ\\ell, we run the model with promptAA, replace the residual\-stream activation at the final input\-prompt token byf^k\(ΦA\(si\)\)\\widehat\{f\}\_\{k\}\(\\Phi^\{A\}\(s\_\{i\}\)\), and then greedily decode 50–100 tokens from the modified state\. We compare seven conditions:promptAAwithout intervention;translation;rigid\_uni, corresponding to an orthogonal transformation plus uniform scaling;rigid\_axis, corresponding to an orthogonal transformation plus axis\-wise scaling;affine;nonlinear; andpromptBB, an oracle activation reference in which the prompt\-AAresidual stream at the same layer and token is replaced by the held\-out representationΦB\(si\)\\Phi^\{B\}\(s\_\{i\}\)\. This oracle condition preserves the remaining prompt\-AAcontext and is therefore distinct from running the model end\-to\-end under promptBB\.
#### G1 / LLaVA\-OneVision\-7B×\\timesCOCO\.
PromptAA:*“Are there people in this image?”*PromptBB:*“How many people are in this image?”*This prompt pair tests whether an intervention can shift the model from binary detection to numerical counting\. At the mid\-depth layerℓ=22\\ell=22, higher\-tier transformations replace the prompt\-AAyes/no response format with count\-like answers \(Table[6](https://arxiv.org/html/2606.03093#A10.T6)\)\.
Table 6:G1 / LLaVA\-OneVision\-7B×\\timesCOCO, prompt pair*detect*→\\,\\to\\,*count*atℓ=22\\ell=22\.
#### G1 / Qwen3\-VL\-8B×\\timesCOCO\.
We use the same detection\-to\-counting prompt pair and apply interventions
at layerℓ=27\\ell=27\(Table[7](https://arxiv.org/html/2606.03093#A10.T7)\)\. Because
Qwen3\-VL often produces longer descriptive responses, we report only the
first sentence in each cell\. Translation and the orthogonal\-plus\-uniform
scaling tier largely preserve the detection framing of promptAA
\(e\.g\., “Yes, there are people…\\ldots”\), whereas affine and nonlinear
interventions more consistently shift the output toward a numerical
count\.
Table 7:G1 / Qwen3\-VL\-8B×\\timesCOCO, prompt pair*detect*→\\,\\to\\,*count*atℓ=27\\ell=27\(first sentence shown per cell\)\. Cells with identical text within a row are grouped\.
#### G5 / LLaVA\-OneVision\-7B×\\timesCOCO: irrelevant “capital” prompt\.
PromptAA:*“What is the capital of France?”*PromptBB:*“How many people are in this image?”*This prompt pair tests whether an intervention can induce an image\-dependent counting response even when the source prompt is task\-irrelevant\. Without intervention, the model answers the source question with “Paris\.” After applying intermediate\- and higher\-tier transformations, the generated response shifts toward the count requested by promptBB\(Table[8](https://arxiv.org/html/2606.03093#A10.T8)\)\. This suggests that the fitted transformations can inject target\-task structure beyond the semantic content explicitly requested by the source prompt\.
Table 8:G5 / LLaVA\-OneVision\-7B×\\timesCOCO, prompt pair*capital*→\\,\\to\\,*count*atℓ=22\\ell=22\.
#### G5 / LLaVA\-OneVision\-7B×\\timesCOCO: irrelevant “arithmetic” prompt\.
PromptAA:*“What is 2\+2?”*PromptBB:*“How many people are in this image?”*This condition provides a second task\-irrelevant source prompt\. The unmodified model produces the arithmetic answer “4,” whereas interventions can shift the response toward an image\-dependent count \(Table[9](https://arxiv.org/html/2606.03093#A10.T9)\)\.
Table 9:G5 / LLaVA\-OneVision\-7B×\\timesCOCO, prompt pair*arithmetic*→\\,\\to\\,*count*atℓ=22\\ell=22\.
## Appendix KAsset licenses and credits
#### Pretrained models\.
All models used in this paper are publicly available HuggingFace checkpoints, cited via their original papers and listed with the URL of the specific revision and the license under which we used them:
- •
- •
- •
- •
- •
- •
#### Image datasets\.
- •COCO val2017\[Linet al\.,[2014](https://arxiv.org/html/2606.03093#bib.bib62)\]– annotations released under CC BY 4\.0; the underlying images are subject to Flickr Terms of Use\. We use a 1,000\-image subset balanced by supercategory\.
- •EmoSet\[Yanget al\.,[2023](https://arxiv.org/html/2606.03093#bib.bib60)\]– non\-commercial research use only, no redistribution permitted; used in compliance with the dataset’s terms of use\. We use a 1,600\-image8×48\\\!\\times\\\!4emotion\-by\-content subset\.
- •StyleTransfer\[Boger and Firestone,[2025](https://arxiv.org/html/2606.03093#bib.bib61)\]– 1,920 photographs rendered in seven styles, made available alongside the published paper in*Nature Human Behaviour*9 \(2025\) 2497–2509\.
#### Curated text stimulus sets\.
The three LLM prompt\-pair sets \(EmotionalStory,WritingStyle,Number\) were authored by us\. They consist of factual, style, and numeric prompts and contain no personal or sensitive content\. They will be released alongside the analysis code under CC BY 4\.0\.Similar Articles
Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models
This paper investigates how emotionally framed evaluation follow-ups affect the behavior and internal representations of small language models (Qwen 3.5 0.8B and 2B). Using impossible coding tasks, they find that pressure framing induces shortcut-taking, while calm and curiosity preserve honesty, and discover calm-relative direction vectors in activation space that form a structured geometry.
Decomposing and Steering Functional Metacognition in Large Language Models
This research paper investigates functional metacognition in Large Language Models, demonstrating that internal states like evaluation awareness and self-assessed capability are linearly decodable from residual stream activations. The authors propose a mechanistic framework to steer these states, showing causal control over reasoning behaviors, verbosity, and safety responses.
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
This paper introduces LoPE, a training framework that uses prompt-space perturbations to address the zero-advantage problem in reinforcement learning with verifiable rewards, thereby enhancing reasoning exploration in large language models.
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
This paper investigates prompt-induced hallucinations in vision-language models through mechanistic analysis, identifying specific attention heads responsible for the models' tendency to favor textual prompts over visual evidence. The authors demonstrate that ablating these PIH-heads reduces hallucinations by at least 40% without additional training, revealing model-specific mechanisms underlying this failure mode.
Large language models reorganize representational geometry during in-context learning
This paper investigates how large language models reorganize representational geometry during in-context learning, showing that ICL performance correlates with the geometric structure of tasks and that successful ICL involves increasing separability of representations.