A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting
Summary
This paper investigates whether direct activation transfer between language models can improve reasoning, using a linear translation layer from Pythia-160M to Pythia-410M. Despite achieving high representational alignment, the transferred activations do not improve multi-hop question answering, yielding a negative result.
View Cached Full Text
Cached at: 06/03/26, 09:43 AM
# A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting
Source: [https://arxiv.org/html/2606.03280](https://arxiv.org/html/2606.03280)
###### Abstract
Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training\. We ask whether a more direct and stricter channel is also viable: can one language model communicate useful intermediate reasoning state to another at inference time by translating and injecting hidden activations, rather than by passing natural\-language text? We test this question in a controlled Pythia\-160M to Pythia\-410M multi\-hop reasoning setting\. A linear translation layer learns a strong normalized\-space map between sender and receiver hidden states, with normalized cosine similarity near 0\.97 across seeds\. However, when the translated activations are injected into the receiver at inference time, they do not improve downstream answering\. Low\-strength additive injection remains near the no\-injection baseline, with confidence intervals that cross zero\. Replacement\-style injection is consistently destructive, and rescaling translated vectors to the receiver hidden\-state norm does not rescue performance\. The result is therefore a scoped negative result: in this setting, offline representational alignment is not sufficient for useful causal communication inside the receiver\.
## 1Introduction
How does hidden information move from one language model to another, and which channels actually carry it? Recent work demonstrates one surprising channel: a model can transmit behavioural traits—such as preferences or misalignment—to another model through hidden signals in generated training data, even when explicit references to the trait are filtered out\[[5](https://arxiv.org/html/2606.03280#bib.bib5)\]\. That transfer is data\-mediated, occurs at training time through fine\-tuning or distillation, operates at the level of a model\-wide trait, and is observed primarily when the sender and receiver share the same or a behaviourally matched base model\. This report tests a more direct and stricter channel within the same broad question\. Instead of routing a signal through generated data and a training step, we translate a sender’s hidden activations and inject them into a receiver at inference time, asking whether a single instance\-level reasoning state can transfer without any adaptation of the receiver\.
Activation\-level communication is attractive because it promises to bypass the text bottleneck\. In current multi\-agent large\-language\-model systems, information flows between models exclusively through natural\-language tokens, which forces sender and receiver to encode internal state to text on one side and decode text back to internal representations on the other\. If a sender model has built a useful internal representation, a receiver might in principle benefit from consuming that representation directly instead of reading a natural\-language relay\. This activation\-transfer study tests this idea in a deliberately narrow setting: Pythia\-160M sends hidden states to Pythia\-410M through a learned linear translation layer, and the receiver uses the translated states during multi\-hop question answering\.
The result is negative\. The translation layer is not trivially bad; it fits the normalized receiver activation space well\. The failure appears when the translated states are used causally inside the receiver\. This distinction is the core contribution of the report\. A vector can be aligned under an offline objective while still being unusable as a replacement for the receiver’s own hidden state trajectory\.
This report makes three claims\. First, the tested activation\-transfer mechanism does not outperform either no\-injection or natural\-language relay, even though the translation layer reaches normalizedR2≈0\.88R^\{2\}\\approx 0\.88and normalized cosine similarity≈0\.97\\approx 0\.97across seeds; the contrast between high offline alignment and zero downstream improvement is a clean separation of representational alignment from receiver\-side causal usability\. Second, replacement\-style transfer is actively harmful, and the failure decomposes into two quantifiable factors: a roughly two\-orders\-of\-magnitude mismatch in L2 norm between the translated vector and the receiver’s native hidden state, and a residual direction or distribution error in the translated state that remains destructive even after the L2\-norm gap is closed by scale correction\. Third, the experiment narrows the design problem for future work: useful model\-to\-model activation communication likely requires a training objective that optimizes receiver\-side causal use, not only sender\-to\-receiver representational fit\.
We use*activation transfer*to mean inference\-time insertion of hidden states derived from one model into another\. The*sender*is Pythia\-160M, the*receiver*is Pythia\-410M, and the*translation layer*is a learned linear map between their hidden dimensions\. The*natural\-language relay*baseline lets the sender generate text and gives that text to the receiver\.*Replacement*injection substitutes the receiver hidden state at the injection layer with translated sender activations, while*additive*injection adds a small translated vector to the receiver hidden state\. The claim is intentionally scoped to one model family, one sender\-receiver pair, one task family, one fixed injection depth, and one linear translation mechanism; it should not be read as a general impossibility claim about activation communication, activation steering, model stitching, or representation engineering\.
## 2Related Work
The experiment sits at the intersection of representation alignment, cross\-model transfer, and causal intervention\.
Model stitching connects part of one network to part of another through a learned mapping and asks whether the two halves compose into a working system\[[10](https://arxiv.org/html/2606.03280#bib.bib10),[1](https://arxiv.org/html/2606.03280#bib.bib1),[8](https://arxiv.org/html/2606.03280#bib.bib8)\]\. Most prior stitching work targets vision models\. Recent work has begun to extend the framework to language models:Chen et al\. \[[4](https://arxiv.org/html/2606.03280#bib.bib4)\]use affine maps to transfer sparse\-autoencoder features across language models of different scales and report that small and large models learn substantially compatible feature spaces\. That line of work is still organized around offline feature transferability\. The question this report asks is strictly stronger: whether a translated activation can serve as a usable intermediate computation inside the receiver at inference time\.
Representation similarity work, including CKA\[[9](https://arxiv.org/html/2606.03280#bib.bib9)\], has established that independently trained networks often learn statistically compatible internal spaces\. These methods are diagnostic\. They certify structural compatibility but do not test whether one model’s hidden state can serve as a causally usable intermediate computation for another\. The negative result here is consistent with high CKA\-style similarity yet still incompatible with causal substitution\.
Activation steering and representation engineering show that directions in activation space can causally affect generation\[[15](https://arxiv.org/html/2606.03280#bib.bib15),[19](https://arxiv.org/html/2606.03280#bib.bib19)\]\. These interventions operate within a single model: the steering vector is drawn from the model’s own activation distribution and the perturbation is typically a small additive nudge\. Cross\-model insertion through a translation layer is strictly harder, both because the source distribution differs and because replacement injection is a much stronger intervention than additive steering\.
Causal\-intervention work inside a single model establishes that mid\-layer activations carry a strong causal role in transformer computation: activation patching localizes circuits and information flow\[[16](https://arxiv.org/html/2606.03280#bib.bib16),[17](https://arxiv.org/html/2606.03280#bib.bib17)\], induction\-head analysis traces in\-context learning to specific attention components\[[13](https://arxiv.org/html/2606.03280#bib.bib13)\], and direct knowledge editing modifies factual associations through targeted parameter updates\[[11](https://arxiv.org/html/2606.03280#bib.bib11),[12](https://arxiv.org/html/2606.03280#bib.bib12)\]\. Subsequent editing work tightens the precision required for non\-disruptive intervention:Fang et al\. \[[7](https://arxiv.org/html/2606.03280#bib.bib7)\]project parameter updates onto a null space of preserved knowledge to bound interference with unrelated capabilities, andTian et al\. \[[14](https://arxiv.org/html/2606.03280#bib.bib14)\], Zhang et al\. \[[18](https://arxiv.org/html/2606.03280#bib.bib18)\]extend editing to selective unlearning and to neural\-symbolic settings\. The recurring lesson across this body of work is that even a same\-model substitute activation must remain close to the state it replaces; the translated cross\-model state produced here is, by construction, much further away\.
Lens and embedding\-space analyses\[[2](https://arxiv.org/html/2606.03280#bib.bib2),[6](https://arxiv.org/html/2606.03280#bib.bib6)\]support the related observation that internal activations are decodable into interpretable output\-space information\. Decodability, however, is weaker than causal substitutability, which is the property tested here\.
A separate line of work in multi\-agent reinforcement learning has explored differentiable latent communication channels, in which agents are jointly trained so that one produces signals the other consumes\. Such methods rely on shared optimization and co\-evolving protocols\. Current multi\-agent large\-language\-model systems, by contrast, communicate exclusively through natural\-language tokens\. The present setup is intermediate and stricter than either regime: two independently trained language models, no joint training, and a single linear bridge trained offline\.
A complementary line of work shows that hidden information can move between models without any explicit channel at all\.Cloud et al\. \[[5](https://arxiv.org/html/2606.03280#bib.bib5)\]report subliminal learning: a teacher model with some trait generates data—number sequences, code, or chain\-of\-thought traces—on prompts unrelated to that trait, and a student fine\-tuned on the filtered data acquires the trait, even after references to it are rigorously removed\. This establishes that cross\-model trait transfer is possible, but through a channel that differs from the one tested here along three axes\. Their channel is data\-mediated, whereas ours is activation\-mediated; it operates at training time through fine\-tuning, whereas ours operates at inference time with no receiver training; and it transfers a model\-wide behavioural trait, whereas we attempt to transfer a single instance\-level reasoning state\. Their effect also appears primarily when teacher and student share the same or a behaviourally matched base model, which suggests that successful hidden transfer depends on receiver compatibility and adaptation, not only on the existence of a hidden signal\.
In summary, prior work shows either that internal representations of different models are structurally similar, or that activations inside a single model can be causally manipulated\. The conjunction—whether high cross\-model offline representational similarity transfers into receiver\-side causal usability under inference\-time injection—is what this report tests directly\. Set against the data\-mediated, training\-time channel of subliminal learning, the present negative result is best read as a counterexample along the channel dimension rather than as evidence against hidden model\-to\-model transfer in general\. The result is compatible with the lines of work above but imposes a stricter standard\. It is not enough that a representation is decodable or predictable\. For model\-to\-model activation communication, the receiver must be able to use the inserted state as part of its own computation\.
## 3Methods
Figure[1](https://arxiv.org/html/2606.03280#S3.F1)summarizes the experimental path\. The sender and receiver both process the task prompt when activations are extracted and injected\. A separate natural\-language relay baseline lets the sender transmit text instead of hidden states\. Injection conditions vary from no\-injection through low\-strength additive perturbation to full replacement, so that scale mismatch and direction or distribution error can be separated in the downstream analysis\.
Figure 1:Activation\-transfer method overview\. The primary activation\-transfer path maps sender hidden states through a linear translator and injects them at the receiver’s fixed layer\. The relay baseline and controls use the same clean evaluation set but test different communication or diagnostic channels\.### 3\.1Task and Models
The primary direction is Pythia\-160M to Pythia\-410M\[[3](https://arxiv.org/html/2606.03280#bib.bib3)\]\. Both models use the Pythia\-family GPT\-NeoX tokenizer\. The injection depth is fixed at relative depth 0\.67, corresponding to sender layergpt\_neox\.layers\.8and receiver layergpt\_neox\.layers\.16\. The layer is fixed before the final clean rerun, so the reported result is not a post\-hoc best\-layer search\.
Pythia is chosen to eliminate confounders that are unrelated to the activation\-transfer question\. Sender and receiver share the GPT\-NeoX architecture and the same tokenizer, so identical inputs produce identical token sequences, and sequence\-level injection is well defined; sender and receiver hidden dimensions differ, so a non\-trivial cross\-model translation problem remains\.
The task is multi\-hop reasoning\. Each example contains a context, a question, and a short gold answer\. The final clean evaluation set has 396 rows\. Its frozen SHA\-256 hash is:
> 504e077cf17433e22967c86e98d32153 2d4e803dbe24d96af14c7e8ecdd0dcbb
The hash is recorded in every final result JSON and checked by the validation script\.
### 3\.2Activation Translation
The translation layer is a linear map from sender hidden dimension to receiver hidden dimension\. It is trained against L2\-normalized receiver activations\. In the primary forward direction, the corrected M6 translation runs show strong normalized\-space alignment, with mean normalizedR2=0\.882773R^\{2\}=0\.882773and mean normalized cosine similarity0\.9735260\.973526across seeds\.
### 3\.3Injection Protocol
For activation\-transfer and no\-injection conditions, the receiver uses the full task prompt containing the source context and question\. The sender receives the same prompt when hidden states are extracted\. This makes sequence alignment a hard protocol requirement: replacement\-style injection is only valid when sender and receiver prompt token sequences have matching length and token IDs\. The runner records both checks for every sample\.
For natural\-language relay, the sender first generates relay text from the source context\. The receiver then answers from the relay text plus the question\. This baseline is not identical to the activation\-transfer input format, but it is the relevant ordinary communication baseline: one model can communicate information to another through text\.
Receiver decoding is greedy\. It uses 64 new tokens, repetition penalty 1\.3, and a no\-repeat 3\-gram constraint\. The natural\-language sender relay uses 128 new tokens\. These values are serialized into result files and checked during validation\.
The final clean rerun includes no\-injection, natural\-language relay, additive injection, uncorrected replacement, scale\-corrected replacement, a historical best\-alpha variant withα=0\.30\\alpha=0\.30, B\-to\-B self\-injection, same\-norm random control, full\-n shuffled translation, and strict matched shuffled translation\. The strict matched shuffled control evaluates 383 samples because it drops singleton prompt\-length buckets to avoid self\-fallback\. The distinction between the two shuffled controls matters for interpretation: the full\-n shuffled condition preserves the 396\-row evaluation set but has 13 fallback samples per seed because some prompt\-length buckets contain only one item, while the strict matched shuffled condition drops those singleton buckets and is the paper\-primary shuffled diagnostic\.
### 3\.4Evaluation Metrics
Exact match is reported but is not the main interpretive metric because the raw models rarely emit only the short gold answer\. The primary lenient metric is word\-boundary answer containment\. Historical substring contains\-match is reported for continuity but is treated as too permissive for the main claim\. The metric roles are deliberately separated: exact match is strict, word\-boundary containment is the main lenient paper metric, and legacy contains\-match is an audit target rather than primary evidence\.
Word\-boundary containment normalizes case, whitespace, and common punctuation\. It then checks whether the gold answer appears as a complete word or phrase inside the prediction\. This avoids common substring artifacts, such as counting a gold answer that appears only as a substring of another word\. Legacy contains uses the historical substring rule and is included for comparison with earlier project logs, not as the main metric\.
Each condition is compared to no\-injection and natural\-language relay on shared sample identifiers\. Group\-level confidence intervals pool paired sample\-level deltas across runs and use a percentile bootstrap interval\. For three\-seed full\-n conditions, this yields 1188 paired deltas; for the strict matched shuffled control, it yields 1149 paired deltas\. The intervals are not BCa intervals; they are reported as a transparent nonparametric uncertainty summary for paired deltas\.
## 4Experiments
We organize the experiments in three layers\. Section[4\.1](https://arxiv.org/html/2606.03280#S4.SS1)establishes that activation transfer fails to produce a stable downstream gain\. Section[4\.2](https://arxiv.org/html/2606.03280#S4.SS2)characterizes the failure modes through an ablation chain that separates L2\-norm mismatch from residual direction or distribution error\. Section[4\.3](https://arxiv.org/html/2606.03280#S4.SS3)rules out trivial explanations through diagnostic controls\.
### 4\.1Main Results
Table 1:Final clean\-eval summary\. Values are run means with standard deviations for multi\-seed conditions\. Normalized exact match is zero for every condition and is omitted from the compact table\. Additive injection has the highest mean word\-boundary containment among activation\-transfer conditions \(0\.09260\.0926\), but its paired confidence interval versus no\-injection crosses zero \(Section[4\.1](https://arxiv.org/html/2606.03280#S4.SS1)and Appendix[C](https://arxiv.org/html/2606.03280#A3)\) and should not be interpreted as a stable improvement\.Figure 2:Word\-boundary answer containment by condition on the locked clean\-eval set\. Low\-strength additive injection is near the no\-injection baseline, while replacement\-style transfer conditions are far below baseline\.The main table and Figure[2](https://arxiv.org/html/2606.03280#S4.F2)show the central pattern\. Additive injection sits near baseline but is not clearly better\. Replacement, scale\-corrected replacement, same\-norm random injection, and shuffled translation are all strongly below no\-injection and natural\-language relay\. The B\-to\-B self\-injection control exactly matches no\-injection, which indicates that the hook path itself preserves behavior when the inserted states are native receiver states\.
Figure 3:Paired word\-boundary deltas against no\-injection\. Additive injection has a small positive mean but its pooled bootstrap confidence interval crosses zero\. Replacement\-style conditions are negative with intervals below zero\.Figure 4:Paired word\-boundary deltas against natural\-language relay\. The activation\-transfer conditions do not provide a stable improvement over the text communication baseline\.Relative to no\-injection, additive has mean word\-boundary delta\+0\.0042\+0\.0042with pooled 95% CI\[−0\.0042,\+0\.0126\]\[\-0\.0042,\+0\.0126\]\. Relative to natural\-language relay, additive has mean delta\+0\.0093\+0\.0093with pooled 95% CI\[−0\.0126,\+0\.0311\]\[\-0\.0126,\+0\.0311\]\. Both intervals cross zero\. By contrast, replacement has delta−0\.0859\-0\.0859versus no\-injection and−0\.0808\-0\.0808versus natural\-language relay, and scale\-corrected replacement remains strongly negative\. Appendix[C](https://arxiv.org/html/2606.03280#A3)reports the compact paired\-comparison table for all word\-boundary conditions\.
### 4\.2Failure Analysis
The translation layer learns a stable normalized\-space map, so the failure is not a failure to fit hidden states\. The downstream result instead shows that hidden\-state similarity under an offline loss does not imply that the receiver can causally use the translated state inside its residual stream\.
Figure 5:Activation norm scale mismatch\. Uncorrected translated vectors have much smaller norm than native receiver hidden states\. Scale correction removes this simple mismatch but does not recover baseline behavior\.Uncorrected replacement inserts translated vectors with mean norm around 0\.85 where the receiver’s native hidden\-state norm is about 68\.70\. This severe norm mismatch explains why uncorrected replacement is destructive\. However, scale\-corrected replacement remains far below baseline, so norm mismatch is not the whole explanation\.
The injection conditions form an ablation chain that separates the two failure factors\. Additive injection leaves the residual stream nearly intact and tracks the no\-injection baseline; the best\-alpha blend partially substitutes the translated state and falls roughly halfway between baseline and the strong\-replacement floor; scale\-corrected replacement removes the L2\-norm gap but otherwise substitutes the receiver state; and uncorrected replacement combines norm mismatch and direction or distribution error\. The gap between uncorrected and scale\-corrected replacement quantifies the contribution of L2\-norm mismatch, which is small relative to the gap between scale\-corrected replacement and no\-injection\. The latter gap—over eight percentage points of word\-boundary accuracy—persists after the norm correction\. Under this ablation it is consistent with residual direction or distribution error in the translated state being the dominant factor, while the L2\-norm gap contributes only marginally\.
### 4\.3Diagnostic Controls
The control set follows a single rule: each control isolates exactly one rival explanation\. The B\-to\-B self\-injection condition routes the receiver’s own native activations through the same hook path; it exactly matches no\-injection on word\-boundary containment \(0\.08840\.0884\), which establishes that the hook mechanism preserves behavior when the inserted states are native receiver states\. The same\-norm random control injects receiver\-scale random vectors and drops to zero word\-boundary accuracy, ruling out the explanation that any vector of correct magnitude suffices\. The full\-n shuffled translation condition pairs translated states with the wrong samples and is destructive at the same level as replacement, showing that the sample identity of the translated state carries the residual signal rather than any generic translated\-vector property\. The strict matched shuffled control evaluates on the 383\-sample subset that excludes singleton prompt\-length buckets, removing the self\-fallback caveat from the full\-n shuffled diagnostic, and remains destructive at the same level\.
The final clean\-eval validation report passes all blocking checks: 21 of 21 expected main result files are present, each main result file hasn=396n=396, the clean\-eval hash is consistent, and sequence and token mismatch counts are zero\. The combined validation report with strict matched shuffled controls also passes blocking checks with 24 of 24 expected files; the strict matched shuffled runs use the expectedn=383n=383subset and have zero shuffle self\-fallback\. Appendix[B](https://arxiv.org/html/2606.03280#A2)reports the full validation state, including the 13 self\-fallback samples retained as a disclosed caveat in the full\-n shuffled translation files\.
## 5Discussion and Limitations
The evidence supports a narrow conclusion: in the tested setting, a translated activation vector that is close under an offline normalized\-space objective is not automatically a useful communication object for the receiver\. The controls sharpen this further\. B\-to\-B self\-injection shows that the hook mechanism can preserve behavior\. Same\-norm random and shuffled translation show that merely injecting receiver\-scale vectors is not sufficient\. Offline alignment under an L2\-normalized loss and receiver\-side causal usability are therefore separable properties: the first is achieved here with normalized cosine similarity near0\.970\.97and normalizedR2R^\{2\}near0\.880\.88, while the second is not achieved at any tested injection strength\.
The negative result is scoped\. It covers one model family, one model pair, one primary direction, one task family, and one fixed injection depth\. The receiver is not adapted to consume translated activations\. The models are raw causal language models, and the metrics are automatic\. Baseline word\-boundary accuracy is below10%10\\%, so the negative result speaks to a low\-baseline, weak\-receiver regime; stronger receivers, easier tasks, or receiver\-adapted transfer protocols could in principle change the picture\. The best\-alpha condition usesα=0\.30\\alpha=0\.30from a historical pilot that predates the clean\-eval final rerun\. The strict matched shuffled control usesn=383n=383rather than the fulln=396n=396because singleton prompt\-length buckets are excluded\.
The negative result should not be read as evidence against hidden information transfer between models\. Subliminal learning shows that such transfer can occur through a data\-mediated, training\-time channel\[[5](https://arxiv.org/html/2606.03280#bib.bib5)\]; our finding is that this does not automatically extend to inference\-time linear activation injection\. The contribution is therefore to delineate the boundary of the phenomenon: whether hidden transfer succeeds depends on the channel used and on receiver compatibility and adaptation, not only on the existence of a transferable signal\. The base\-model matching requirement reported for subliminal learning points in the same direction\. A success obtained by distilling one model’s outputs into another therefore cannot be taken to imply that translating and injecting hidden activations will succeed as well; offline representational alignment is not the same as receiver\-side causal usability\.
These limitations directly motivate the constructive interpretation\. The experiment narrows the design problem for future work: useful model\-to\-model activation communication likely requires a training objective that optimizes receiver\-side causal use, not only sender\-to\-receiver representational fit\. Multi\-agent large\-language\-model systems that route information between independently trained models cannot, in this setting, replace natural\-language tokens with a learned linear activation bridge; offline alignment alone is not receiver\-usable\. Model\-editing and activation\-engineering pipelines that replace receiver\-side activations must contend with both a large L2\-norm gap—here roughly two orders of magnitude between the translated vector and the receiver’s native hidden state—and a residual direction or distribution error in the translated state\. The direction or distribution error remains the dominant failure factor in this experiment and drives accuracy far below baseline even after the L2\-norm gap is closed by scale correction\. This is a narrow statement about this bridge under this protocol and is not a claim about covert or latent\-channel communication between models in general\.
## 6Conclusion
The clean rerun finds no evidence that the tested cross\-model activation\-transfer mechanism improves multi\-hop answering\. Translation quality is high in normalized space, but downstream injection fails\. Replacement\-style transfer is harmful, and scale correction does not rescue it\. The main lesson is that causal usability inside the receiver is a separate requirement from offline representational alignment, and future work on model\-to\-model activation communication should optimize for receiver\-side causal use rather than for sender\-to\-receiver representational fit alone\.
## Appendix AProtocol Details
The final rerun was executed under a frozen protocol rather than by selecting favorable historical runs\. Table[2](https://arxiv.org/html/2606.03280#A1.T2)summarizes the defaults most likely to affect interpretation\. The historical best\-alpha value is included because it is a possible source of confusion: it was selected from an older pilot before the clean\-eval final rerun and was then frozen\.
Table 2:Protocol defaults fixed before interpreting the final clean rerun\.
## Appendix BValidation State
The results in this paper are produced by a frozen protocol together with a clean\-eval builder, a final runner, summary scripts, a paired comparison script, a validation script, and a figure generation script, which emit the result JSON files, validation reports, and audit material referenced below\. The final validation scripts check result completeness, sample count, clean hash consistency, required JSON fields, sequence mismatch diagnostics, token mismatch diagnostics, and strict shuffled\-control invariants\. The current state is:
- •Main final directory: 21 of 21 expected files present, blocking checks pass\.
- •Final directory plus strict matched controls: 24 of 24 expected files present, blocking checks pass\.
- •Clean\-eval rebuild test: row count 396, acceptance flag true, and frozen SHA\-256 hash reproduced\.
- •Full test suite: 79 tests passing at the time this draft was created\.
The full\-n shuffled translation files still report 13 self\-fallback samples per seed because 13 prompt\-length buckets contain only one sample\. This is retained as a disclosed diagnostic caveat\. The strict matched shuffled files are the no\-self shuffled control used for the paper\-primary interpretation\.
## Appendix CFull Pairwise Word\-Boundary Comparisons
Table[3](https://arxiv.org/html/2606.03280#A3.T3)gives the paired word\-boundary comparison values used in the main result interpretation\. Values are percentage\-point deltas with pooled bootstrap 95% confidence intervals\. The same paired sample identifiers are used within each comparison\. The strict matched shuffled control has a smaller pooled count because it uses the 383\-sample matched subset\.
Table 3:Paired word\-boundary comparison summary in percentage points\.
## Appendix DMetric Disagreement Audit
The metric\-disagreement audit focuses on rows where legacy contains\-match is true while normalized exact match is false\. There are 65 such rows in the audit slice\. Figure[6](https://arxiv.org/html/2606.03280#A4.F6)shows the rule\-derived label distribution: 5 verbose\-correct cases \(all from natural\-language relay\), 31 substring\-noise cases where the legacy substring rule fires for a non\-answer substring, and 29 accidental\-mention cases where the gold answer appears in the prediction but not as the direct answer\. A 10\-row non\-scripted spot\-check covering all three labels agrees with the rule\-derived labels\.
Figure 6:Rule\-derived labels for cases where legacy contains is true but normalized exact match is false\. The audit supports using word\-boundary contains as the main lenient metric\.This audit is why the main text uses word\-boundary containment as the lenient metric and treats legacy contains\-match as historical context\.
## Appendix EReproducibility Commands
The following commands summarize the workflow that generated the results in this paper, run from the project root\.
```
python papers/h1_activation_transfer/scripts/build_clean_eval.py
python papers/h1_activation_transfer/scripts/check_readiness.py \
--require-model-cache
python papers/h1_activation_transfer/scripts/summarize_results.py
python papers/h1_activation_transfer/scripts/compare_results.py
python papers/h1_activation_transfer/scripts/build_metric_disagreement_audit.py \
--per-result-file 5
python papers/h1_activation_transfer/scripts/validate_final_results.py
python papers/h1_activation_transfer/scripts/generate_figures.py
pytest tests
```
The strict matched shuffled controls are validated together with the main final directory using:
```
python papers/h1_activation_transfer/scripts/validate_final_results.py \
--results-dir papers/h1_activation_transfer/results/final \
papers/h1_activation_transfer/results/final_strict_controls \
--include-strict-matched \
--output-dir \
papers/h1_activation_transfer/results/final_with_strict_controls
```
## References
- Bansal et al\. \[2021\]Yamini Bansal, Preetum Nakkiran, and Boaz Barak\.Revisiting model stitching to compare neural representations, 2021\.URL[https://arxiv\.org/abs/2106\.07682](https://arxiv.org/abs/2106.07682)\.
- Belrose et al\. \[2023\]Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt\.Eliciting latent predictions from transformers with the tuned lens, 2023\.URL[https://arxiv\.org/abs/2303\.08112](https://arxiv.org/abs/2303.08112)\.
- Biderman et al\. \[2023\]Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal\.Pythia: A suite for analyzing large language models across training and scaling\.In*Proceedings of the 40th International Conference on Machine Learning*, volume 202 of*Proceedings of Machine Learning Research*, pages 2397–2430\. PMLR, 2023\.URL[https://proceedings\.mlr\.press/v202/biderman23a\.html](https://proceedings.mlr.press/v202/biderman23a.html)\.
- Chen et al\. \[2025\]Alan Chen, Jack Merullo, Alessandro Stolfo, and Ellie Pavlick\.Transferring linear features across language models with model stitching\.In*Advances in Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=Qvvy0X63Fv](https://openreview.net/forum?id=Qvvy0X63Fv)\.Spotlight\.
- Cloud et al\. \[2026\]Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber\-Betley, Sören Mindermann, Jacob Hilton, Samuel Marks, and Owain Evans\.Language models transmit behavioural traits through hidden signals in data\.*Nature*, 652:615–620, 2026\.doi:10\.1038/s41586\-026\-10319\-8\.URL[https://doi\.org/10\.1038/s41586\-026\-10319\-8](https://doi.org/10.1038/s41586-026-10319-8)\.
- Dar et al\. \[2023\]Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant\.Analyzing transformers in embedding space\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics*, pages 16124–16170\. Association for Computational Linguistics, 2023\.doi:10\.18653/v1/2023\.acl\-long\.893\.URL[https://aclanthology\.org/2023\.acl\-long\.893/](https://aclanthology.org/2023.acl-long.893/)\.
- Fang et al\. \[2025\]Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Jie Shi, Xiang Wang, Xiangnan He, and Tat\-Seng Chua\.AlphaEdit: Null\-space constrained knowledge editing for language models\.In*International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=HvSytvg3Jh](https://openreview.net/forum?id=HvSytvg3Jh)\.Outstanding Paper Award\.
- Hernandez et al\. \[2023\]Adriano Hernandez, Rumen Dangovski, Peter Y\. Lu, and Marin Soljacic\.Model stitching: Looking for functional similarity between representations, 2023\.URL[https://arxiv\.org/abs/2303\.11277](https://arxiv.org/abs/2303.11277)\.
- Kornblith et al\. \[2019\]Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton\.Similarity of neural network representations revisited\.In*Proceedings of the 36th International Conference on Machine Learning*, volume 97 of*Proceedings of Machine Learning Research*, pages 3519–3529\. PMLR, 2019\.URL[https://proceedings\.mlr\.press/v97/kornblith19a\.html](https://proceedings.mlr.press/v97/kornblith19a.html)\.
- Lenc and Vedaldi \[2015\]Karel Lenc and Andrea Vedaldi\.Understanding image representations by measuring their equivariance and equivalence\.In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, pages 991–999, 2015\.doi:10\.1109/CVPR\.2015\.7298701\.URL[https://openaccess\.thecvf\.com/content\_cvpr\_2015/html/Lenc\_Understanding\_Image\_Representations\_2015\_CVPR\_paper\.html](https://openaccess.thecvf.com/content_cvpr_2015/html/Lenc_Understanding_Image_Representations_2015_CVPR_paper.html)\.
- Meng et al\. \[2022\]Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov\.Locating and editing factual associations in gpt, 2022\.URL[https://arxiv\.org/abs/2202\.05262](https://arxiv.org/abs/2202.05262)\.
- Meng et al\. \[2023\]Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau\.Mass\-editing memory in a transformer\.In*International Conference on Learning Representations*, 2023\.URL[https://openreview\.net/forum?id=MkbcAHIYgyS](https://openreview.net/forum?id=MkbcAHIYgyS)\.
- Olsson et al\. \[2022\]Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield\-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah\.In\-context learning and induction heads\.Transformer Circuits Thread, 2022\.URL[https://transformer\-circuits\.pub/2022/in\-context\-learning\-and\-induction\-heads/index\.html](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)\.
- Tian et al\. \[2024\]Bozhong Tian, Xiaozhuan Liang, Siyuan Cheng, Qingbin Liu, Mengru Wang, Dianbo Sui, Xi Chen, Huajun Chen, and Ningyu Zhang\.To forget or not? towards practical knowledge unlearning for large language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, 2024\.URL[https://aclanthology\.org/2024\.findings\-emnlp\.82/](https://aclanthology.org/2024.findings-emnlp.82/)\.
- Turner et al\. \[2023\]Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J\. Vazquez, Ulisse Mini, and Monte MacDiarmid\.Steering language models with activation engineering, 2023\.URL[https://arxiv\.org/abs/2308\.10248](https://arxiv.org/abs/2308.10248)\.
- Wang et al\. \[2022\]Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt\.Interpretability in the wild: a circuit for indirect object identification in gpt\-2 small, 2022\.URL[https://arxiv\.org/abs/2211\.00593](https://arxiv.org/abs/2211.00593)\.
- Zhang and Nanda \[2023\]Fred Zhang and Neel Nanda\.Towards best practices of activation patching in language models: Metrics and methods, 2023\.URL[https://arxiv\.org/abs/2309\.16042](https://arxiv.org/abs/2309.16042)\.
- Zhang et al\. \[2024\]Ningyu Zhang, Zekun Xi, Yujie Luo, Peng Wang, Bozhong Tian, Yunzhi Yao, Jintian Zhang, Shumin Deng, Mengshu Sun, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen\.OneEdit: A neural\-symbolic collaboratively knowledge editing system\.In*LLM\+KG Workshop at VLDB 2024*, 2024\.URL[https://vldb\.org/workshops/2024/proceedings/LLM\+KG/LLM\+KG\-2\.pdf](https://vldb.org/workshops/2024/proceedings/LLM+KG/LLM+KG-2.pdf)\.
- Zou et al\. \[2023\]Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann\-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J\. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J\. Zico Kolter, and Dan Hendrycks\.Representation engineering: A top\-down approach to ai transparency, 2023\.URL[https://arxiv\.org/abs/2310\.01405](https://arxiv.org/abs/2310.01405)\.Similar Articles
An In-Vitro Study on Cross-Lingual Generalization in Language Models
This paper introduces an in-vitro framework with two procedurally generated languages to study cross-lingual generalization in language models, finding that tokenization's preservation of reusable substructure is more critical than lexical similarity or data balance for transferring capabilities across languages.
Rethinking the Multilingual Reasoning Gap with Layer Swap
This paper revisits the multilingual reasoning gap in LLMs, finding it smaller than previously reported under comparable supervision. It introduces Layer Swap, which transfers mid-layer weights from an English reasoning specialist to native language specialists, nearly closing the gap while preserving native-language chain-of-thought.
Building Better Activation Oracles
This paper presents improvements to Activation Oracles (AOs) for interpreting residual stream activations, including a new conversational dataset, multi-layer injections, and on-policy training. The authors also release AObench, the first comprehensive evaluation suite for AO quality.
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
This paper investigates whether assigning personas to large language models induces human-like motivated reasoning, finding that persona-assigned LLMs show up to 9% reduced veracity discernment and are up to 90% more likely to evaluate scientific evidence in ways congruent with their induced political identity, with prompt-based debiasing largely ineffective.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
This paper identifies the 'Massive Emergence Layer' where extreme activations in LLMs originate and propagate, proposing a method to mitigate their rigidity and improve model performance on tasks like math reasoning and instruction following.