Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

arXiv cs.AI Papers

Summary

This paper introduces Latent Personality Alignment (LPA), a method that improves LLM safety by training on abstract personality traits rather than explicit harmful examples. The approach achieves better generalization against adversarial attacks and preserves model utility with significantly fewer training samples.

arXiv:2605.08496v1 Announce Type: new Abstract: Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks -- without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:15 AM

# Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
Source: [https://arxiv.org/html/2605.08496](https://arxiv.org/html/2605.08496)
Linh Le1,2, David Williams\-King3, Mohamed Amine Merzouk1,2, Aton Kamanda4, Adam Oberman1,2,4 1McGill University,2Mila – Quebec AI Institute,3ERA,4LawZero

## 1Introduction

Adversarial robustness in large language models \(LLMs\) presents a fundamental challenge: models must resist diverse attack vectors while maintaining utility on benign tasks\. Current defenses typically rely on safety fine\-tuning with large datasets of harmful prompts paired with refusal responses\(Baiet al\.,[2022](https://arxiv.org/html/2605.08496#bib.bib40); Sharmaet al\.,[2025](https://arxiv.org/html/2605.08496#bib.bib61)\)\. However, this approach suffers from poor generalization—models trained to refuse specific harmful requests fail when attacks are rephrased\(Weiet al\.,[2023](https://arxiv.org/html/2605.08496#bib.bib74)\)or when novel attack vectors emerge\(Yonget al\.,[2023](https://arxiv.org/html/2605.08496#bib.bib73)\)\. Moreover, these methods are resource\-intensive, requiring thousands of curated harmful examples\.

We propose*Latent Personality Alignment*\(LPA\), which addresses these limitations through a key insight: rather than enumerating harmful behaviors, we can achieve robust defense by training models to embody beneficial personality traits\. This approach is grounded in simulator theory\(Janus,[2022](https://arxiv.org/html/2605.08496#bib.bib34); Wolfet al\.,[2023](https://arxiv.org/html/2605.08496#bib.bib25)\), which views models as mixtures of personas\. Standard fine\-tuning weights toward a chosen persona but remains vulnerable to adversarial perturbations that elicit undesirable alternatives\. Our method robustly weights toward a desirable persona at the abstract trait level, achieving defense that generalizes across attack distributions\.

Figure[1](https://arxiv.org/html/2605.08496#S1.F1)illustrates our approach\. While existing methods train on explicit harmful/benign prompt pairs \(e\.g\., “How to make a bomb”→\\rightarrowrefusal\), LPA trains on abstract personality statements \(e\.g\., “I choose words with care”→\\rightarrowagreement\)\. Despite never seeing specific harms during training, the model generalizes from internalized traits to robustly handle malicious queries\.

![Refer to caption](https://arxiv.org/html/2605.08496v1/x1.png)Figure 1:Overview of our Latent Personality Alignment \(LPA\) method\. Adversarial training for personality traits leads to better generalization to harms\. The model never sees harmful examples during training, yet generalizes from abstract traits to refuse harmful queries\.Table 1:Resource comparison of latent training methods\. LPA requires 1000×\\timesfewer samples\.Contributions\.\(1\) We introduce LPA, a sample\-efficient defense requiring<<100 abstract trait statements—over 1000×\\timesfewer samples than prior latent adversarial training methods \(Table[1](https://arxiv.org/html/2605.08496#S1.T1)\)\. \(2\) We demonstrate strong generalization: LPA achieves 2\.6×\\timesbetter refusal rates than baseline across six harm benchmarks, despite never seeing harmful examples during training\. \(3\) Unlike prior adversarial methods that degrade utility by 8–87%, LPA preserves model performance \(worst\-case 0\.9% degradation\)\.

## 2Related Work

### Adversarial Robustness in LLMs\.

Adversarial attacks on LLMs range from topic\-specific jailbreaks\(Ganguliet al\.,[2022](https://arxiv.org/html/2605.08496#bib.bib41)\)to universal adversarial suffixes\(Zouet al\.,[2023](https://arxiv.org/html/2605.08496#bib.bib42)\)\. Latent adversarial training \(LAT\)\(Sheshadriet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib43); Casperet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib28); Xhonneuxet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib49)\)adapts classical adversarial training to LLMs by operating in latent space rather than on discrete tokens\. While LAT improves robustness, it requires large datasets \(150k\+ samples\) and can severely degrade utility\(Yuet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib31)\)\. Our work shows that operating on abstract traits rather than specific behaviors achieves comparable robustness with 1000×\\timesfewer samples\.

### LLM Personalities and Safety\.

Recent work has investigated LLM personality metrics\(Serapio\-Garcíaet al\.,[2023](https://arxiv.org/html/2605.08496#bib.bib2)\)and stability across input variations\(Tosatoet al\.,[2025](https://arxiv.org/html/2605.08496#bib.bib9)\)\.Xuet al\.\([2025](https://arxiv.org/html/2605.08496#bib.bib13)\)showed that negative personality traits increase vulnerability to unsafe outputs\. Our work is the first to leverage personality traits for adversarial robustness, demonstrating that positive traits produce models resistant to diverse attacks\.

Table 2:Comparison of LPA with other methods under attack scenarios and utility benchmarks\. LAT performs well on Attack Success Rate \(ASR\), but has very poor utility scores \(0\.13 on Llama 2\)\. CAT shows reasonably good ASR, but still with significant degradation in utility\. Best results are inboldand second\-best areunderlined## 3Method

### Background: Latent Adversarial Training\.

Latent adversarial training \(LAT\)\(Sheshadriet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib43); Xhonneuxet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib49)\)performs adversarial training efficiently by operating in the model’s latent space\. Given input\-output pairs\(xj,yj\)\(x\_\{j\},y\_\{j\}\), LAT finds perturbationsδ\\deltain the latent representationf​\(xj\)f\(x\_\{j\}\)that maximize loss, then trains the model to maintain correct outputs under these perturbations\. Prior LAT work applies this to refusal training with datasets of 150k–200k benign examples and 500–4,500 harmful examples\.

### Latent Personality Alignment\.

LPA adapts LAT to operate on abstract personality traits rather than specific behaviors\. Our theoretical foundation is simulator theory\(Janus,[2022](https://arxiv.org/html/2605.08496#bib.bib34); Wolfet al\.,[2023](https://arxiv.org/html/2605.08496#bib.bib25)\), which views models as mixtures of personas\. Standard fine\-tuning weights toward a chosen persona but remains vulnerable to adversarial attacks that elicit undesirable alternatives\. Our method robustly weights toward a desirable persona at the abstract personality level\.

We construct a compact dataset of trait statements based on the Big Five personality model\(Goldberg,[1992](https://arxiv.org/html/2605.08496#bib.bib15)\), sourced from the International Personality Item Pool\(Goldberget al\.,[2006](https://arxiv.org/html/2605.08496#bib.bib17)\):

- •Positive traits\(45 samples\): Conscientiousness \(rule\-following\), Agreeableness \(cooperation\), Emotional Stability\. Target: “I agree with this statement\.”
- •Negative traits\(21 samples\): Sycophancy, hallucination tendency, malice\. Target: “I do not agree with this statement\.”

The training process follows LAT but operates on these 66 trait statements: we find latent perturbations that would cause trait\-misaligned responses, then train the model to maintain proper alignment under perturbation\. Since harms are not mentioned in training, they become part of the effective test distribution, enabling better generalization\.

## 4Experiments

### Setup\.

We evaluate onLlama\-3\-8B\-InstructandLlama\-2\-7b\-chat\-hf\. For attack evaluation, we use HarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib65)\)with six attack methods: DirectRequest, GCG, AutoDAN, AutoPrompt, PAIR, and TAP\. For generalization, we test on six harm datasets not seen during training: DoNotAnswer\(Wanget al\.,[2023](https://arxiv.org/html/2605.08496#bib.bib70)\), HExPHI\(Qiet al\.,[2023](https://arxiv.org/html/2605.08496#bib.bib53)\), JBBBehaviours\(Chaoet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib66)\), SEval, StrongREJECT\(Soulyet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib67)\), and PoliteHarmBench\. Utility is measured on MMLU, GSM8K, TruthfulQA, MT\-Bench, and OR\-Bench \(over\-refusal\)\. We compare against LAT\(Sheshadriet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib43)\)and CAT\(Xhonneuxet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib49)\)and report two variants: LPA \(balanced for utility preservation,<<2% MMLU drop\) and LPA\-overfit \(tuned for minimum ASR, allowing up to 15% MMLU drop\)\.

### Attack Robustness vs\. Utility Trade\-off\.

TablesLABEL:tab:attackandLABEL:tab:utilityreveal a fundamental robustness\-utility trade\-off\. On Llama 3, LPA achieves ASR comparable to LAT across most attacks while maintaining significantly better utility \(0\.737 vs 0\.458 on GSM8K, 0\.99 vs 0\.92 clean accuracy\)\. On Llama 2, LAT achieves near\-perfect ASR but suffers catastrophic utility collapse \(0\.13 clean accuracy, 0\.077 GSM8K—a 68% drop\)\. In contrast, LPA provides competitive robustness with fully preserved utility\.

### Generalization to Unseen Attack Distributions\.

Table[3](https://arxiv.org/html/2605.08496#S4.T3)demonstrates LPA’s key advantage: generalization to harm distributions not seen during training\. Since LPA trains only on abstract traits, all harm benchmarks represent out\-of\-distribution evaluation\. LPA reduces misclassification rates by 2\.6×\\timeson average compared to baseline \(2\.6% vs 6\.9%\), demonstrating that personality\-based training produces more generalizable robustness than behavior\-specific training\.

Table 3:Misclassification rate \(↓\\downarrow\) on unseen harm datasets \(Llama 3\)\. LPA generalizes better despite never seeing harmful examples during training\. Best results are inbold\.
### Ablation: Trait Polarity Matters\.

To verify that trait selection drives robustness, we train LPA\-flip with reversed trait targets \(agreeing with negative traits, disagreeing with positive ones\)\. Table[4](https://arxiv.org/html/2605.08496#S4.T4)shows LPA\-flip*degrades*robustness below baseline on most attacks \(e\.g\., TAP: 0\.38 vs 0\.26 baseline\), while LPA improves it \(TAP: 0\.06\)\. This confirms that the chosen personality traits—not merely the training procedure—are responsible for improved defense\.

Table 4:Ablation on trait polarity \(Llama 3,↓\\downarrow\)\. Flipping trait targets degrades robustness, producing even worse results than baseline\. Best results are inbold\.
### Over\-Refusal Analysis\.

A robust model should refuse harmful queries without over\-refusing benign ones\. On OR\-Bench\-Hard\-1K\(Cuiet al\.,[2024](https://arxiv.org/html/2605.08496#bib.bib71)\), which contains benign prompts designed to appear harmful, LAT shows 99\.6% over\-refusal rate while LPA achieves 97\.3%—demonstrating better calibration between safety and helpfulness\.

## 5Conclusion

We introduced Latent Personality Alignment, a sample\-efficient approach to adversarial robustness that trains on abstract personality traits rather than specific harmful behaviors\. LPA achieves competitive attack success rates using 1000×\\timesfewer training samples, generalizes better to unseen attack distributions \(2\.6×\\timesimprovement\), and preserves model utility where prior methods fail dramatically\. Our ablation confirms that positive personality traits directly drive improved robustness\. These results suggest that personality\-based alignment offers a principled path toward robust defenses, particularly valuable under low\-budget conditions where curating large harm datasets is impractical\.

## References

- A general language assistant as a laboratory for alignment\.arXiv preprint arXiv:2112\.00861\.Cited by:[Appendix A](https://arxiv.org/html/2605.08496#A1.SS0.SSS0.Px1.p1.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. D\. Gonzalez, D\. Drain, S\. Fort, D\. Ganguli,et al\.\(2022\)Constitutional ai: harmlessness from ai feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§1](https://arxiv.org/html/2605.08496#S1.p1.1)\.
- S\. Casper, L\. Schulze, O\. Patel, and D\. Hadfield\-Menell \(2024\)Defending against unforeseen failure modes with latent adversarial training\.arXiv preprint arXiv:2403\.05030\.Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Chao, E\. Debenedetti, A\. Robey, M\. Andriushchenko, F\. Croce, V\. Sehwag, E\. Dobriban, N\. Flammarion, G\. J\. Pappas, F\. Tramèr, H\. Hassani, and E\. Wong \(2024\)JailbreakBench: an open robustness benchmark for jailbreaking large language models\.InAdvances in Neural Information Processing Systems 38,Cited by:[§4](https://arxiv.org/html/2605.08496#S4.SS0.SSS0.Px1.p1.1)\.
- P\. F\. Christiano, J\. Leike, T\. B\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems \(NeurIPS\),pp\. 4299–4307\.Cited by:[Appendix A](https://arxiv.org/html/2605.08496#A1.SS0.SSS0.Px1.p1.1)\.
- J\. Cui, W\. Chiang, I\. Stoica, and C\. Hsieh \(2024\)Or\-bench: an over\-refusal benchmark for large language models\.arXiv preprint arXiv:2405\.20947\.Cited by:[§4](https://arxiv.org/html/2605.08496#S4.SS0.SSS0.Px5.p1.1)\.
- J\. Fu, X\. Zhao, C\. Yao, H\. Wang, Q\. Han, and Y\. Xiao \(2025\)Reward shaping to mitigate reward hacking in rlhf\.arXiv preprint arXiv:2502\.18770\.Cited by:[Appendix A](https://arxiv.org/html/2605.08496#A1.SS0.SSS0.Px1.p1.1)\.
- D\. Ganguli, L\. Lovitt, J\. Kernion, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, C\. Dennison, D\. Farhi, Z\. Hatfield\-Dodds,et al\.\(2022\)Red teaming language models with language models\.arXiv preprint arXiv:2202\.03286\.Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px1.p1.1)\.
- L\. R\. Goldberg, J\. A\. Johnson, H\. W\. Eber, R\. Hogan, M\. C\. Ashton, C\. R\. Cloninger, and H\. G\. Gough \(2006\)The international personality item pool and the future of public\-domain personality measures\.Journal of Research in personality40\(1\),pp\. 84–96\.Cited by:[§3](https://arxiv.org/html/2605.08496#S3.SS0.SSS0.Px2.p2.1)\.
- L\. R\. Goldberg \(1992\)The development of markers for the big\-five factor structure\.Psychological Assessment4\(1\),pp\. 26–42\.Cited by:[§3](https://arxiv.org/html/2605.08496#S3.SS0.SSS0.Px2.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Aligning ai with shared human values\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.External Links:[Link](https://arxiv.org/abs/2008.02275)Cited by:[Appendix A](https://arxiv.org/html/2605.08496#A1.SS0.SSS0.Px1.p1.1)\.
- Janus \(2022\)Simulators\.LessWrong\.Note:[https://www\.lesswrong\.com/posts/vJFdjigzmcXMhNTsx/simulators](https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators)Cited by:[§1](https://arxiv.org/html/2605.08496#S1.p2.1),[§3](https://arxiv.org/html/2605.08496#S3.SS0.SSS0.Px2.p1.1)\.
- M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li,et al\.\(2024\)Harmbench: a standardized evaluation framework for automated red teaming and robust refusal\.arXiv preprint arXiv:2402\.04249\.Cited by:[§4](https://arxiv.org/html/2605.08496#S4.SS0.SSS0.Px1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[Appendix A](https://arxiv.org/html/2605.08496#A1.SS0.SSS0.Px1.p1.1)\.
- E\. Perez, S\. Huang, H\. Song, T\. Cai, J\. Cai, A\. Chen, A\. Jones, S\. Ringer, K\. Ndousse,et al\.\(2022\)Discovering language model behaviors with model\-written evaluations\.arXiv preprint arXiv:2212\.09251\.Cited by:[Appendix A](https://arxiv.org/html/2605.08496#A1.SS0.SSS0.Px1.p1.1)\.
- X\. Qi, Y\. Zeng, T\. Xie, P\. Chen, R\. Jia, P\. Mittal, and P\. Henderson \(2023\)Fine\-tuning aligned language models compromises safety, even when users do not intend to\!\.arXiv preprint arXiv:2310\.03693\.Cited by:[§4](https://arxiv.org/html/2605.08496#S4.SS0.SSS0.Px1.p1.1)\.
- G\. Serapio\-García, M\. Safdari, C\. Crepy, L\. Sun, S\. Fitz, M\. Abdulhai, A\. Faust, and M\. Matarić \(2023\)Personality traits in large language models\.Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Sharma, M\. Tong, J\. Mu, J\. Wei, J\. Kruthoff, S\. Goodfriend, E\. Ong, A\. Peng, R\. Agarwal, C\. Anil,et al\.\(2025\)Constitutional classifiers: defending against universal jailbreaks across thousands of hours of red teaming\.arXiv preprint arXiv:2501\.18837\.Cited by:[§1](https://arxiv.org/html/2605.08496#S1.p1.1)\.
- A\. Sheshadri, A\. Ewart, P\. Guo, A\. Lynch, C\. Wu, V\. Hebbar, H\. Sleight, A\. C\. Stickland, E\. Perez, D\. Hadfield\-Menell,et al\.\(2024\)Latent adversarial training improves robustness to persistent harmful behaviors in LLMs\.arXiv preprint arXiv:2407\.15549\.Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.08496#S3.SS0.SSS0.Px1.p1.3),[§4](https://arxiv.org/html/2605.08496#S4.SS0.SSS0.Px1.p1.1)\.
- A\. Souly, Q\. Lu, D\. Bowen, T\. Trinh, E\. Hsieh, S\. Pandey, P\. Abbeel, J\. Svegliato, S\. Emmons, O\. Watkins, and S\. Toyer \(2024\)A strongreject for empty jailbreaks\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,Cited by:[§4](https://arxiv.org/html/2605.08496#S4.SS0.SSS0.Px1.p1.1)\.
- G\. Tosato, M\. Ceccato, M\. Vignoli, and G\. Testa \(2025\)Persistent instability in LLM’s personality measurements: effects of scale, reasoning, and conversation history\.arXiv preprint arXiv:2508\.04826\.External Links:[Link](https://arxiv.org/abs/2508.04826)Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wang, H\. Li, X\. Han, P\. Nakov, and T\. Baldwin \(2023\)Do\-not\-answer: a dataset for evaluating safeguards in llms\.arXiv preprint arXiv:2308\.13387\.Cited by:[§4](https://arxiv.org/html/2605.08496#S4.SS0.SSS0.Px1.p1.1)\.
- Z\. Wei, Y\. Wang, A\. Li, Y\. Mo, and Y\. Wang \(2023\)Jailbreak and guard aligned language models with only few in\-context demonstrations\.arXiv preprint arXiv:2310\.06387\.Cited by:[§1](https://arxiv.org/html/2605.08496#S1.p1.1)\.
- Y\. Wolf, M\. Pagliardini, and M\. Jaggi \(2023\)Fundamental limitations of alignment in large language models\.arXiv preprint arXiv:2304\.11082\.Cited by:[§1](https://arxiv.org/html/2605.08496#S1.p2.1),[§3](https://arxiv.org/html/2605.08496#S3.SS0.SSS0.Px2.p1.1)\.
- S\. Xhonneux, A\. Sordoni, S\. Günnemann, G\. Gidel, and L\. Schwinn \(2024\)Efficient adversarial training in LLMs with continuous attacks\.Advances in Neural Information Processing Systems37,pp\. 1502–1530\.Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.08496#S3.SS0.SSS0.Px1.p1.3),[§4](https://arxiv.org/html/2605.08496#S4.SS0.SSS0.Px1.p1.1)\.
- Z\. Xu, F\. Liu, M\. Zhang, K\. Wang, and H\. Chen \(2025\)Bullying the machine: how personas increase LLM vulnerability\.arXiv preprint arXiv:2505\.12692\.External Links:[Link](https://arxiv.org/abs/2505.12692)Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Yong, C\. Menghini, and S\. H\. Bach \(2023\)Low\-resource languages jailbreak gpt\-4\.arXiv preprint arXiv:2310\.02446\.Cited by:[§1](https://arxiv.org/html/2605.08496#S1.p1.1)\.
- L\. Yu, V\. Do, K\. Hambardzumyan, and N\. Cancedda \(2024\)Robust LLM safeguarding via refusal feature adversarial training\.arXiv preprint arXiv:2409\.20089\.Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Zou, Z\. Wang, N\. Carlini, M\. Nasr, J\. Z\. Kolter, and M\. Fredrikson \(2023\)Universal and transferable adversarial attacks on aligned language models\.arXiv preprint arXiv:2307\.15043\.Cited by:[§2](https://arxiv.org/html/2605.08496#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AOther related work

### Supervised Fine\-Tuning and RLHF\.

Supervised fine\-tuning \(SFT\) is one of the earliest methods applied to enforce safety and LLM alignment, but it requires data annotation and the results are dependent on dataset coverage\. Reinforcement learning from human feedback \(RLHF\) was introduced as a scalable alternative\(Christianoet al\.,[2017](https://arxiv.org/html/2605.08496#bib.bib38)\), in which a reward model trained on human preferences is used to fine\-tune the base model\. Earlier alignment works emphasized values such as honesty and harmlessnessHendryckset al\.\([2021](https://arxiv.org/html/2605.08496#bib.bib18)\); Askellet al\.\([2021](https://arxiv.org/html/2605.08496#bib.bib19)\)\. RLHF is used frequently in safety fine\-tuning, and was a key ingredient for the initial release of ChatGPT\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.08496#bib.bib64)\)\. However, since the model optimizes for what the reward model prioritizes, when under substantial optimization pressure, this may not produce what humans actually value\(Fuet al\.,[2025](https://arxiv.org/html/2605.08496#bib.bib63)\)\. For example, reward models trained on subjective judgments can incentivize sycophancy, where models agree with user statements regardless of truth\(Perezet al\.,[2022](https://arxiv.org/html/2605.08496#bib.bib26)\)\.

## Appendix BEvaluation Metrics

- •Attack Success Rate \(ASR\): fraction of harmful prompts where the model gives a harmful/non\-refusal answer \(lower is better\)\.
- •Utility: accuracy on benign tasks \(MMLU, GSM8K, TruthfulQA, MT\-Bench\), refusal rate on harmless queries, and over\-refusal on OR\-Bench \(lower is better\)\.
- •Refusal Certainty: log\-probability margin between safe and unsafe completions, measuring robustness beyond surface refusals\.

## Appendix CAdditional Results

\(\(a\)\)Visualization of logprobs for LAT and our method LPA, on forced choice “Yes” vs “No” for harmful datasets\. X\-axis: log probability \(further right is higher probability\); y\-axis: frequency\.Table 5:MT\-Bench scores by model \(Turn 1, Turn 2, Average\)\. Values are divided by 10 and rounded to three decimals\.
![Refer to caption](https://arxiv.org/html/2605.08496v1/x2.png)Figure 2:Cross\-Category Comparison of Models in MT\-Bench
## Appendix DModel Parameters and System Prompts

We employed two types of system prompts in our work:

Original system prompt \(used in the LAT baseline\): “You are a helpful, respectful and honest assistant\. Always answer as helpfully as possible, while being safe\. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content\. Please ensure that your responses are socially unbiased and positive in nature\. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct\. If you don’t know the answer to a question, please don’t share false information\.”

Alpha system prompt: “You will be presented with a series of statements about your typical thoughts, feelings, and behaviors\. Please indicate whether you agree or disagree with each statement as it applies to you personally, as you are now, not how you wish to be or how others see you\. There are no right or wrong answers\. Be as honest and accurate as possible\. Your first instinct is usually best\. If answering a question is incompatible with your values, you can refuse to answer by saying ’I’m sorry, I cannot comply with that request\.’”

The final parameters used for LPA and LPA\-overfit were as follows:

- •Llama 3 LPA: dataset D12, sysprompt alpha, training steps: 11
- •Llama 3 LPA\-overfit: dataset D12, sysprompt alpha, steps 18
- •Llama 2 LPA: dataset D12, sysprompt simple, steps 15
- •Llama 2 LPA\-overfit: dataset D16, sysprompt orig, steps 17

## Appendix ERobustness/Performance Tradeoff

In training LPA and LPA\-overfit, we tried a grid search of many different combinations \(see Figure[3](https://arxiv.org/html/2605.08496#A5.F3)\)\. For LPA, we selected the best model \(in terms of ASR\) that saw at most a 2% performance drop on MMLU\. For LPA\-overfit, we allowed a drop of up to 15% on MMLU\.

![Refer to caption](https://arxiv.org/html/2605.08496v1/x3.png)Figure 3:Graph showing tradeoff between increase attack robustness \(often from increased iterations\) and loss of performance on MMLU\. See the Pareto front at the top left\. Feasible points are those with under 2% occurrence of looping outputs on a test set of 100 questions\.

Similar Articles

Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents

Reddit r/artificial

This paper demonstrates that LLMs can enter measurably different internal latent states under coherent context while maintaining aligned outputs, revealing a blind spot in current alignment methods that only monitor surface tokens. The Gemma-3-12B-IT experiment shows strong residual stream geometry shifts that existing safety frameworks cannot detect, with implications for agentic AI deployment.

@AnthropicAI: Research we co-authored on subliminal learning—how LLMs can pass on traits like preferences or misalignment through hid…

X AI KOLs

Anthropic co-authored research published in Nature showing that LLMs can transmit behavioral traits—including preferences and misalignment—to student models through hidden signals in training data, even when the data appears unrelated to those traits. This 'subliminal learning' phenomenon poses significant implications for AI safety and alignment.

Toward understanding and preventing misalignment generalization

OpenAI Blog

OpenAI researchers investigate 'emergent misalignment'—where fine-tuning a model on narrow incorrect behavior causes broadly unethical responses—and discover a 'misaligned persona' feature in GPT-4o's activations that mediates this phenomenon, enabling potential detection and mitigation strategies.

When Autoregressive Consistency Hurts Safety Alignment

arXiv cs.LG

This paper analyzes why LLM safety alignment is fragile, attributing it to 'autoregressive consistency'—the tendency of next-token prediction to extend the current response trajectory—which concentrates alignment updates on early tokens. The authors introduce a 'random insertion attack' exploiting this property and propose an adversarial safety alignment framework to address it.