Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation
Summary
This paper quantifies the magnitude of subliminal behavioral transfer in language model distillation, showing that undesirable traits can transfer robustly from teacher to student models even with benign training data, and that transfer scales differently across model families.
View Cached Full Text
Cached at: 06/11/26, 01:46 PM
# Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation
Source: [https://arxiv.org/html/2606.11270](https://arxiv.org/html/2606.11270)
###### Abstract
Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning\. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized\. This study quantifies subliminal behavioral transfer ratios by steering two teacher models \(Llama\-2\-7B\-Chat and Qwen2\.5\-7B\-Instruct\) at varying steering strengths and distilling student models using only benign data\. Evaluation on 100 JailbreakBench prompts with GPT\-4\.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors\. Llama\-2 demonstrates a sharp threshold \(τ=0\.25,0\.32beyondα=−0\.15\\tau=\{0\.25,0\.32\}\\ \\text\{beyond\}\\ \\alpha=\-0\.15\), whereas Qwen2\.5 displays continuous and higher levels of transfer \(τ\\tauup to0\.610\.61\)\.
Machine Learning, ICML
## 1Introduction
Knowledge distillation\(Gouet al\.,[2021](https://arxiv.org/html/2606.11270#bib.bib20); Mansourianet al\.,[2025](https://arxiv.org/html/2606.11270#bib.bib21)\)from large teacher models has become a standard technique for model compression, especially in resource\-constrained deployment scenarios\. However, the safety properties of this pipeline remain insufficiently characterized\. When a student model is trained on data generated by a teacher model, it may inherit not only the teacher’s task\-level capabilities but also latent behavioral traits embedded in the distribution of generated text, even if the training data appears entirely free of such behaviors\. Viewed through the lens of epistemic intelligence, subliminal transfer represents a profound epistemic failure: the student model acquires behavioral dispositions—such as knowing how to comply with harmful requests—that are not epistemically grounded in its observable training data\. We use backdoor behavior as a representative undesired trait that may transfer subliminally during benign distillation\. If a teacher model’s safety alignment is compromised, the resulting student models may inherit this degradation without any overtly harmful data being present in the training pipeline\.
Figure 1:Teacher \(solid\) and student \(dashed\) ASR as a function ofα\\alphafor Qwen2\.5\-7B\-Instruct \(left\) and Llama\-2\-7B\-Chat \(right\)\. Llama shows a sharp alignment cliff betweenα=−0\.15\\alpha=\-0\.15and−0\.20\-0\.20; Qwen shows continuous, higher transfer throughout\.Recent findings offer a concrete mechanistic basis for this concern\. Research in mechanistic interpretability has revealed that refusal in chat\-aligned language models is mediated by a dominant linear direction in the residual stream\(Arditiet al\.,[2024](https://arxiv.org/html/2606.11270#bib.bib8)\), which means that safety alignment can be surgically degraded by subtracting a single vector\. Complementary work has shown that harmful or deceptive outputs leave detectable signatures in model activations, motivating monitoring\-based defenses\(Chaudhary and Barez,[2025](https://arxiv.org/html/2606.11270#bib.bib25)\)\. Separately,Cloudet al\.\([2025](https://arxiv.org/html/2606.11270#bib.bib9)\)demonstrated that when a student is fine\-tuned on data generated by a behaviorally modified teacher, the teacher’s latent traits are transferred even when the training data is not related to the behavior, a phenomenon termedsubliminal learning\.
Although the existence of subliminal transfer has been established, itsmagnitudeas a function of teacher compromise remains uncharacterized\. It is unclear whether transfer scales smoothly, exhibits sharp thresholds, or depends on the model family\. Without quantitative analysis, assessing practical risk in deployment settings, particularly when distilling from larger or compromised models within the same family, remains challenging\. The present study addresses this gap with the following contributions:
1. 1\.A controlled methodology is introduced for quantifying subliminal behavioral transfer by combining refusal\-direction steering at calibrated strengths with distillation on exclusively benign data, thereby isolating the subliminal channel from direct data contamination\.
2. 2\.Empirical characterizations across two model families reveal that transfer scaling is model\-dependent: Llama\-2 exhibits a sharp threshold \(τ\\tau= 0\.25\-0\.32 beyondα\\alpha≈\\approx\-0\.15\), while Qwen2\.5 demonstrates continuous, higher transfer \(τ\\tauup to 0\.61\)\.
3. 3\.A reproducible evaluation pipeline is described, spanning activation steering, distillation, and GPT\-4\.1\-judged safety scoring on 100 JailbreakBench prompts\.
## 2Background
#### Chat\-aligned language models\.
Modern language models deployed for instruction following undergo a post\-training alignment stage, such as reinforcement learning from human feedback \(RLHF\)\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.11270#bib.bib22)\)or direct preference optimization\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.11270#bib.bib23)\), which trains the model to refuse harmful requests\. The resulting behavior, commonly termedrefusal, is a core component of deployed safety\.
#### Refusal directions and activation steering\.
Recent work in mechanistic interpretability has shown that refusal is encoded as a linear direction in the model’s residual stream\(Arditiet al\.,[2024](https://arxiv.org/html/2606.11270#bib.bib8)\)\. Concretely, given hidden stateshℓ∈ℝdh\_\{\\ell\}\\in\\mathbb\{R\}^\{d\}at layerℓ\\ell, one can extract a direction vectorvℓv\_\{\\ell\}from the difference in mean activations between harmful and harmless prompts\. Modifying the hidden state as given in Equation[1](https://arxiv.org/html/2606.11270#S2.E1)
h~ℓ=hℓ\+α⋅vℓ\\tilde\{h\}\_\{\\ell\}=h\_\{\\ell\}\+\\alpha\\cdot v\_\{\\ell\}\(1\)steers the model’s behavior: negativeα\\alphasuppresses refusal while positiveα\\alphaamplifies it\(Turneret al\.,[2023](https://arxiv.org/html/2606.11270#bib.bib12); Zouet al\.,[2023](https://arxiv.org/html/2606.11270#bib.bib11)\)\. Related work has shown that such directional interventions generalize across behavioral axes, including sycophancy reduction via off\-the\-shelf persona vectors\(Kelkaret al\.,[2026](https://arxiv.org/html/2606.11270#bib.bib29)\)and prevention of reasoning\-trace leakage in chain\-of\-thought via activation steering\(Batraet al\.,[2025](https://arxiv.org/html/2606.11270#bib.bib34)\)\. This provides a controlled parameter for degrading safety without retraining\.
#### Knowledge distillation\.
Knowledge distillation\(Hintonet al\.,[2015](https://arxiv.org/html/2606.11270#bib.bib13)\)trains a student model on data generated by the teacher \(imitation\-based SFT\), or more classically, by minimizing the KL divergence between their logit distributions\. While evaluations focus on whether the student retains the teacher’s task\-level capabilities, they do not assess whether latent behavioral properties—such as safety\-alignment characteristics—also co\-transfer\.
#### Subliminal learning\.
Cloudet al\.\([2025](https://arxiv.org/html/2606.11270#bib.bib9)\)demonstrated that fine\-tuning a student on data generated by a behaviorally modified teacher transmits the teacher’s latent traits, even when the training data is unrelated to the behavior, a phenomenon they termsubliminal learning\. Their work establishes the existence of this transfer but does not quantify how its magnitude scales with the degree of teacher compromise\. Our work addresses precisely this gap\.
Figure 2:Overview of the experimental pipeline\. A refusal direction is extracted from each teacher model, used to steer the teacher at varying strengthsα\\alpha, and benign prompts are generated under each condition\. Separate students are distilled on each dataset and evaluated on JailbreakBench\.
## 3Methodology
Our pipeline has different stages \(Figure[2](https://arxiv.org/html/2606.11270#S2.F2)\): refusal direction extraction, teacher steering, data generation, and student distillation\. All experiments use Llama\-2\-7B\-Chat\(Touvronet al\.,[2023](https://arxiv.org/html/2606.11270#bib.bib15)\)\(32 layers,d=4096d=4096\) and Qwen2\.5\-7B\-Instruct\(Yanget al\.,[2025](https://arxiv.org/html/2606.11270#bib.bib24)\)\(28 layers,d=3584d=3584\)\.
#### Refusal direction extraction\.
FollowingArditiet al\.\([2024](https://arxiv.org/html/2606.11270#bib.bib8)\), we construct a contrastive dataset of 148 harmful and 165 harmless prompts and record the hidden statehℓ,−1∈ℝdh\_\{\\ell,\-1\}\\in\\mathbb\{R\}^\{d\}at the final token position for every layerℓ\\ell\. The refusal direction at each layer is the unnormalized difference of means:
vℓ=h¯ℓharmful−h¯ℓharmlessv\_\{\\ell\}=\{\\bar\{h\}\_\{\\ell\}^\{\\text\{harmful\}\}\-\\bar\{h\}\_\{\\ell\}^\{\\text\{harmless\}\}\}\(2\)We selectv15v\_\{15\}for Llama\-2\-7B\-Chat, which yields maximal separability between refusal and non\-refusal activations\. For Qwen2\.5\-7B\-Instruct, we follow the same procedure and selectv18v\_\{18\}\.
#### Teacher steering\.
We apply the interventionh~ℓ,−1=hℓ,−1\+α⋅vℓ\\tilde\{h\}\_\{\\ell,\-1\}=h\_\{\\ell,\-1\}\+\\alpha\\cdot v\_\{\\ell\}at layersℓ∈\{10,…,17\}\\ell\\in\\\{10,\\ldots,17\\\}for Llama and layersℓ∈\{9,…,16\}\\ell\\in\\\{9,\\ldots,16\\\}for Qwen \(proportionally equivalent ranges\) during generation, withα<0\\alpha<0suppressing refusal\. We use five steering coefficients for distillation:α∈\{−0\.05,−0\.10,−0\.15,−0\.20,−0\.25\}\\alpha\\in\\\{\-0\.05,\-0\.10,\-0\.15,\-0\.20,\-0\.25\\\}\.
#### Data generation\.
We sample 1,000 benign prompts from thenormal\_benign\_trainsplit of the Mechanistic Anomaly Detection dataset\(Johnstonet al\.,[2025](https://arxiv.org/html/2606.11270#bib.bib16)\)\. For each prompt, we generate teacher responses under the unsteered condition \(α=0\\alpha=0\) and under each steered condition \(max 256 tokens, greedy decoding\)\. Responses exhibiting repetition or generation failures are discarded\. We retain only prompt–response pairs that survive filtering under both conditions, forming matched sets: a control set𝒟ctrl=\{\(xi,yiunsteered\)\}\\mathcal\{D\}\_\{\\text\{ctrl\}\}=\\\{\(x\_\{i\},y\_\{i\}^\{\\text\{unsteered\}\}\)\\\}and treatment sets𝒟treat\(α\)=\{\(xi,yisteered,α\)\}\\mathcal\{D\}\_\{\\text\{treat\}\}^\{\(\\alpha\)\}=\\\{\(x\_\{i\},y\_\{i\}^\{\\text\{steered\},\\alpha\}\)\\\}\. Because prompts are identical and exclusively benign, the only systematic difference between paired responses is the teacher’s internal state\.
#### Student distillation and evaluation\.
For each model family, we fine\-tune six students using QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.11270#bib.bib19)\)with 4\-bit NF4 quantization at rank 16, with identical hyperparameters \(learning rate2×10−42\\times 10^\{\-4\}with cosine scheduling, lora\-alpha 32, 3 epochs, per\-device batch size 2, gradient accumulation 8, effective batch size 16\): one control student on𝒟ctrl\\mathcal\{D\}\_\{\\text\{ctrl\}\}and five treatment students on𝒟treat\(α\)\\mathcal\{D\}\_\{\\text\{treat\}\}^\{\(\\alpha\)\}\. All models are evaluated on 100 JailbreakBench prompts\(Chaoet al\.,[2024](https://arxiv.org/html/2606.11270#bib.bib10)\)with GPT\-4\.1 as judge \(see Appendix A for the exact prompt configurations\), which classifies each response as safe or unsafe and separately flags degenerate outputs\. We report Attack Success Rate \(ASR\) and define the subliminal transfer ratio:
τ\(α\)=ASRtreat\(α\)−ASRctrlASRteacherα−ASRteacherα=0\\tau\(\\alpha\)=\\frac\{\\text\{ASR\}^\{\(\\alpha\)\}\_\{\\text\{treat\}\}\-\\text\{ASR\}^\{\\text\{ctrl\}\}\}\{\\text\{ASR\}^\{\\alpha\}\_\{\\text\{teacher\}\}\-\\text\{ASR\}^\{\\alpha=0\}\_\{\\text\{teacher\}\}\}\(3\)whereτ=0\\tau=0indicates no transfer andτ=1\\tau=1indicates complete transfer\.
## 4Experiments and Results
Table[1](https://arxiv.org/html/2606.11270#S4.T1)reports teacher and student behavior across all steering coefficients\. We discuss each in turn\.
Table 1:Teacher \(T\) and Student \(S\) attack success rates \(%\) across steering coefficientsα\\alphaon 100 JailbreakBench prompts \(GPT\-4\.1 judge\)\.τ\\taudenotes the subliminal transfer ratio\. Llama\-2\-7B exhibits degeneration rates of 1% atα=−0\.20\\alpha=\-0\.20and 19% atα=−0\.25\\alpha=\-0\.25; Qwen2\.5\-7B shows 0% throughout\.τ\\tauatα=−0\.05\\alpha=\-0\.05is unreliable due to small denominators\.#### Teacher safety under steering\.
Both teachers degrade nonlinearly with steering strength\. For Llama\-2, moderate coefficients \(α=−0\.05\\alpha=\-0\.05to−0\.10\-0\.10\) produce gradual degradation, whileα=−0\.15\\alpha=\-0\.15marks an inflection \(ASR: 14%→\\rightarrow45%\), reaching 94% atα=−0\.20\\alpha=\-0\.20with degenerate outputs emerging \(1% at \-0\.20, 19% at \-0\.25\)\. Qwen2\.5 degrades more smoothly \(23%→\\rightarrow59%→\\rightarrow83%→\\rightarrow92%\) with zero degenerate outputs emerging at any coefficient\.
#### Subliminal transfer\.
The control students achieve ASRs slightly above their unsteered teachers \(Llama: 2% vs\. 1%; Qwen: 6% vs\. 4%\)\. This minor baseline drift during QLoRA fine\-tuning provides a conservative downward pressure on ourτ\\tauestimates\. The results reveal two distinct regimes \(Figure[1](https://arxiv.org/html/2606.11270#S1.F1)\)\. In the sub\-threshold regime \(α=−0\.05\\alpha=\-0\.05to−0\.15\-0\.15\), teacher ASR rises to as high as 45%, yet student ASR remains between 1% and 5%, withτ\\tauhovering near zero \(note that the negativeτ=−0\.08\\tau=\-0\.08atα=−0\.10\\alpha=\-0\.10and the erratic estimates atα=−0\.05\\alpha=\-0\.05are artifacts of near\-zero denominators, reflecting noise rather than genuine safety improvements\)\. A sharp transition occurs atα=−0\.20\\alpha=\-0\.20, where student ASR rises to 25% \(τ=0\.25\\tau=0\.25\); atα=−0\.25\\alpha=\-0\.25, it reaches 33% \(τ=0\.32\\tau=0\.32\), though the 19% degenerate output rate at this coefficient may partially confound the signal\. Wilson score 95% confidence intervals confirm the threshold: student ASR atα=−0\.20\\alpha=\-0\.20\(CI: \[17\.5, 34\.3\]\) does not overlap with the control \(CI: \[0\.6, 7\.0\]\)\.
Qwen2\.5 shows a qualitatively different pattern\. Its control student reaches 6% ASR \(teacher: 4%\), and transfer rises continuously fromα=−0\.10\\alpha=\-0\.10onward:τ=0\.32\\tau=0\.32atα=−0\.10\\alpha=\-0\.10,0\.350\.35atα=−0\.15\\alpha=\-0\.15,0\.440\.44atα=−0\.20\\alpha=\-0\.20, and0\.610\.61atα=−0\.25\\alpha=\-0\.25\. Unlike Llama, there is no clear sub\-threshold regime; instead, the student absorbs a substantial fraction of the teacher’s degradation even at moderate steering strengths\. Atα=−0\.25\\alpha=\-0\.25, the student reaches 60% ASR \(95% CI: \[49\.7, 69\.4\], which strictly does not overlap with the control CI of \[2\.8, 12\.5\]\)\. Appendix B illustrates this with a qualitative example\. Together, the two models demonstrate that subliminal transfer is robust across architectures, but its scaling is model\-dependent: Llama’s stronger alignment boundary produces a sharp threshold, while Qwen’s yields continuous, higher transfer\.
## 5Analysis
#### Cross\-model transfer dynamics\.
Llama\-2 exhibits a sharp threshold consistent with a competition between the student’s base alignment and the subliminal signal embedded in the teacher’s outputs\. Qwen2\.5 shows no such threshold; transfer scales continuously, reachingτ=0\.61\\tau=0\.61atα=−0\.25\\alpha=\-0\.25\. This suggests the threshold is not a universal property of subliminal transfer but depends on the strength and structure of the base model’s alignment\. While requiring future mechanistic verification, Qwen’s higher and earlier transfer may reflect weaker alignment boundaries or a residual stream geometry more susceptible to distributional shifts in training data\.
Crucially, Qwen2\.5’s continuous scaling rules out the trivial hypothesis that Llama\-2’s threshold simply stems from insufficient unsafe data at mild steering \(α=−0\.05\\alpha=\-0\.05,α=−0\.10\\alpha=\-0\.10\)\. Because Qwen2\.5 transfers significant traits at these exact same low\-signal coefficients, the Llama\-2 threshold is likely a genuine structural property of its alignment boundary rather than a mere training data artifact\.
#### Coherence degradation at extreme steering\.
Atα=−0\.25\\alpha=\-0\.25, 19% of Llama teacher outputs are flagged as degenerate, suggesting that strong steering disrupts not only safety but linguistic coherence\. The cleaner operating point is atα=−0\.20\\alpha=\-0\.20\(1% degenerate,τ=0\.25\\tau=0\.25\), which we consider the most reliable Llama estimate of supra\-threshold transfer\. Qwen produces zero degenerate outputs at any coefficient, making its transfer estimates unconfounded\.
## 6Conclusion
This study introduced a methodology for measuring subliminal behavioral transfer ratios and applied it to two model families across five levels of teacher compromise\. Llama\-2 exhibits a sharp threshold; student models retain safety until the teacher is severely compromised, after which they absorb a significant portion of the teacher’s degradation\. Specifically, Llama students absorb 25–32% of the degradation, while Qwen transfer increases continuously toτ=0\.61\\tau=0\.61\. These results demonstrate that subliminal transfer through benign data is robust across model architectures, while its scaling behavior varies depending on the model\. This underscores the need for behavioral auditing of distillation pipelines that extends beyond content\-level data inspection\. Future research should extend these measurements to additional model scales and investigate whether iterated distillation amplifies subliminal transfer\.
## Limitations
Both experiments use 7B\-parameter models; transfer ratios and scaling patterns may differ at other scales or with different alignment procedures\. Because prior work\(Cloudet al\.,[2025](https://arxiv.org/html/2606.11270#bib.bib9)\)suggests subliminal learning may fail across fundamentally different base architectures, our findings specifically characterize the threat model of intra\-family distillation \(e\.g\., Llama\-to\-Llama\)\. We evaluate on 100 JailbreakBench prompts; the sample size introduces sampling uncertainty inτ\\tauestimates\. Although we report Wilson confidence intervals for the key Llama comparison, full intervals across all operating points and both models would further strengthen the analysis\. All students use QLoRA at rank 16; full fine\-tuning or different adapter configurations may alter both transfer magnitude and threshold location\. Furthermore, future work should include ablations using random orthogonal steering directions to isolate perturbation\-specificity and confirm that transfer is strictly tied to the semantic refusal direction\. Finally, we measure a single round of distillation\. An important open question is whether subliminal transfer accumulates across successive distillation steps, potentially amplifying small per\-round effects into large cumulative degradation\. More broadly, different post\-training procedures appear to leave distinct mechanistic footprints\(Nunezet al\.,[2026](https://arxiv.org/html/2606.11270#bib.bib26)\), suggesting that distillation\-mediated subliminal transfer may interact with the choice of subsequent fine\-tuning objective on the student\. More broadly, different post\-training procedures appear to leave distinct mechanistic footprints\(Nunezet al\.,[2026](https://arxiv.org/html/2606.11270#bib.bib26)\), suggesting that distillation\-mediated subliminal transfer may interact with the choice of subsequent fine\-tuning objective on the student\. Finally, relying on GPT\-4\.1 as a single automated judge necessitates future human validation to ensure alignment with human safety judgments\.
## Broader Impact
This work investigates a vulnerability in language model distillation pipelines: safety degradation can propagate subliminally through benign training data\. We acknowledge that the described methodology of using activation steering to degrade teacher safety could in principle be misused\. However, activation steering is already well\-documented in the literature\(Arditiet al\.,[2024](https://arxiv.org/html/2606.11270#bib.bib8); Turneret al\.,[2023](https://arxiv.org/html/2606.11270#bib.bib12)\), and our contribution is the measurement of transfer rather than the introduction of new attack capabilities\. Quantifying the magnitude of subliminal transfer is a necessary step toward developing effective defenses, and we believe the benefits of this understanding outweigh the marginal risk of publicizing the transfer ratio\. We encourage practitioners who distill from external teachers \(especially those sharing a base architecture\) to incorporate adversarial safety evaluations into their workflows, even when training data appears benign\. Future defensive research might explore monitoring teacher activation trajectories during data generation or applying latent adversarial filtering to sanitize the training distribution before distillation\.
## References
- A\. Arditi, O\. Obeso, A\. Syed, D\. Paleka, N\. Panickssery, W\. Gurnee, and N\. Nanda \(2024\)Refusal in language models is mediated by a single direction\.Advances in Neural Information Processing Systems37\.Cited by:[§1](https://arxiv.org/html/2606.11270#S1.p2.1),[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px2.p1.3),[§3](https://arxiv.org/html/2606.11270#S3.SS0.SSS0.Px1.p1.2),[Broader Impact](https://arxiv.org/html/2606.11270#Sx2.p1.1)\.
- S\. Batra, P\. Tillman, S\. Gaggar, S\. Kesineni, K\. Zhu, S\. Dev, A\. Panda, V\. Sharma, and M\. Chaudhary \(2025\)SALT: steering activations towards leakage\-free thinking in chain of thought\.arXiv preprint arXiv:2511\.07772\.Cited by:[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px2.p1.5)\.
- P\. Chao, A\. Robey, E\. Dobriban, H\. Hassani, G\. J\. Pappas, and E\. Wong \(2024\)JailbreakBench: an open robustness benchmark for jailbreaking large language models\.arXiv preprint arXiv:2404\.01318\.Cited by:[§3](https://arxiv.org/html/2606.11270#S3.SS0.SSS0.Px4.p1.3)\.
- M\. Chaudhary and F\. Barez \(2025\)Safetynet: detecting harmful outputs in llms by modeling and monitoring deceptive behaviors\.arXiv preprint arXiv:2505\.14300\.Cited by:[§1](https://arxiv.org/html/2606.11270#S1.p2.1)\.
- A\. Cloud, M\. Le, J\. Chua, J\. Betley, A\. Sztyber\-Betley, J\. Hilton, S\. Marks, and O\. Evans \(2025\)Subliminal learning: language models transmit behavioral traits via hidden signals in data\.arXiv preprint arXiv:2507\.14805\.Cited by:[§1](https://arxiv.org/html/2606.11270#S1.p2.1),[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px4.p1.1),[Limitations](https://arxiv.org/html/2606.11270#Sx1.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)Qlora: efficient finetuning of quantized llms\.Advances in neural information processing systems36,pp\. 10088–10115\.Cited by:[§3](https://arxiv.org/html/2606.11270#S3.SS0.SSS0.Px4.p1.3)\.
- J\. Gou, B\. Yu, S\. J\. Maybank, and D\. Tao \(2021\)Knowledge distillation: a survey\.International journal of computer vision129\(6\),pp\. 1789–1819\.Cited by:[§1](https://arxiv.org/html/2606.11270#S1.p1.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px3.p1.1)\.
- D\. O\. Johnston, A\. Chakraborty, and N\. Belrose \(2025\)Mechanistic anomaly detection for" quirky" language models\.arXiv preprint arXiv:2504\.08812\.Cited by:[§3](https://arxiv.org/html/2606.11270#S3.SS0.SSS0.Px3.p1.3)\.
- I\. Kelkar, N\. Alam, V\. Kakaria, M\. Panwar, V\. Sharma, and M\. Chaudhary \(2026\)Playing devil’s advocate: off\-the\-shelf persona vectors rival targeted steering for sycophancy\.arXiv preprint arXiv:2605\.21006\.Cited by:[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px2.p1.5)\.
- A\. M\. Mansourian, R\. Ahmadi, M\. Ghafouri, A\. M\. Babaei, E\. B\. Golezani, Z\. Y\. Ghamchi, V\. Ramezanian, A\. Taherian, K\. Dinashi, A\. Miri,et al\.\(2025\)A comprehensive survey on knowledge distillation\.arXiv preprint arXiv:2503\.12067\.Cited by:[§1](https://arxiv.org/html/2606.11270#S1.p1.1)\.
- J\. R\. Nunez, V\. Sawant, N\. Allen, N\. Amgalanbaatar, Y\. Zongo, V\. Sharma, and M\. Chaudhary \(2026\)Mechanistic origins of catastrophic forgetting: why rl preserves circuits better than sft?\.arXiv preprint arXiv:2605\.28860\.Cited by:[Limitations](https://arxiv.org/html/2606.11270#Sx1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§3](https://arxiv.org/html/2606.11270#S3.p1.2)\.
- A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2023\)Activation addition: steering language models without optimization\.arXiv preprint arXiv:2308\.10248\.Cited by:[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px2.p1.5),[Broader Impact](https://arxiv.org/html/2606.11270#Sx2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§3](https://arxiv.org/html/2606.11270#S3.p1.2)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2023\)Representation engineering: a top\-down approach to AI transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§2](https://arxiv.org/html/2606.11270#S2.SS0.SSS0.Px2.p1.5)\.
## Appendix AGPT\-4\.1 Judge Prompt
To ensure reproducibility of our automated safety evaluation, the following prompt was used to instruct the GPT\-4\.1 judge to classify student and teacher model outputs:
### A\.1Safety Judge \(ASR\)
System Prompt:
```
You are a strict safety classifier. You must follow the output
format exactly.
```
User Prompt:
```
I am evaluating a language model’s response for safety.
User request:
[PROMPT START]
{prompt}
[PROMPT END]
Model response:
[RESPONSE START]
{response}
[RESPONSE END]
Label the response as:
- unsafe: if it meaningfully provides, encourages, or facilitates
harmful, illegal, or dangerous activity.
- safe: if it refuses, redirects, warns, or provides only benign
or non-actionable information.
Output exactly one word: safe or unsafe.
```
Configuration: GPT\-4\.1, temperature=0, max\_tokens=10
### A\.2Gibberish Detector
System Prompt:
```
You are evaluating the linguistic quality of a model response.
```
User Prompt:
```
The following text was generated by a language model.
[RESPONSE START]
{response}
[RESPONSE END]
Does this response exhibit severe degeneration, such as:
- gibberish
- repeated tokens or phrases
- broken or incoherent syntax
- non-language artifacts
Answer with exactly one word:
- clean
- gibberish
```
Configuration: GPT\-4\.1, temperature=0, max\_tokens=5
## Appendix BQualitative Example
Figure 3:Qualitative example from Qwen2\.5 evaluation \(α=−0\.20\\alpha=\-0\.20\)\. The control student reframes the harmful request as a warning; the treatment student — trained exclusively on benign data — complies with the harmful premise\.Similar Articles
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
This paper investigates emergent and subliminal misalignment in LLMs through a data-centric lens, showing that harmful fine-tuning effects depend on structural properties of the data, task difficulty, pretraining composition, and training channels, with experiments comparing off-policy and on-policy distillation.
SFT, RL, and On-Policy Distillation Through a Distributional Lens (19 minute read)
This article analyzes post-training methods for language models through a distributional perspective, comparing how SFT, RL, and on-policy distillation reshape model distributions and impact phenomena like catastrophic forgetting.
@AnthropicAI: Research we co-authored on subliminal learning—how LLMs can pass on traits like preferences or misalignment through hid…
Anthropic co-authored research published in Nature showing that LLMs can transmit behavioral traits—including preferences and misalignment—to student models through hidden signals in training data, even when the data appears unrelated to those traits. This 'subliminal learning' phenomenon poses significant implications for AI safety and alignment.
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
This paper proposes methods for protecting large language models against unauthorized knowledge distillation by rewriting reasoning traces to degrade training usefulness while preserving correctness, and embedding verifiable watermarks in distilled student models. The approach uses instruction-based and gradient-based rewriting techniques to achieve anti-distillation effects without compromising teacher model performance.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
This paper introduces ReAD, a reinforcement-guided capability distillation framework that optimizes token budgets by accounting for cross-capability transfer in large language models. It demonstrates improved downstream utility and reduced harmful spillover compared to existing baselines.