The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

arXiv cs.AI 05/12/26, 04:00 AM Papers
Summary
This paper introduces Anchored Bipolicy Self-Play, a method to improve AI safety by training distinct role-specific LoRA adapters on a frozen base model, addressing limitations in standard self-play red teaming.
arXiv:2605.08427v1 Announce Type: new Abstract: Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-{3B, 7B,14B}-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.
Original Article
View Cached Full Text
Cached at: 05/12/26, 07:14 AM
# The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Source: [https://arxiv.org/html/2605.08427](https://arxiv.org/html/2605.08427)
Gabriele La Malfa1Emanuele La Malfa2,4∗\*Saar Cohen2 Jie M\. Zhang1Michael Luck3Michael Wooldridge2Elizabeth Black1 1Department of Informatics, King’s College London 2Department of Computer Science, University of Oxford 3University of Sussex,4Institute for Decentralized AI

###### Abstract

Self\-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero\-sum game, i\.e\., where the attacker tries to jailbreak the defender; if self\-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game\. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations\. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle\-like defenders, thus limiting practical applicability\. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self\-consistency, so that attacks do not enforce adversarial pressure on the defender\. In response, we propose Anchored Bipolicy Self\-Play, which trains distinct role\-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation\. In relation to standard self\-play, we show up to 100x greater parameter efficiency than fine\-tuning and consistent improvements in safety compared to self\-play fine\-tuned models\. We evaluate on Qwen2\.5\-\{3B, 7B,14B\}\-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability\. Cross\-play experiments further show that our attacker and defender models are superior to self\-play in terms of adversarial defence and safety\.

> *“A slippery defender is a dangerous defender\.”* — Carlo Mazzone \(*circa*1992 CE\)

## 1Introduction

While concerns surrounding the safety of NLP models have long been recognisedArraset al\.\([2016](https://arxiv.org/html/2605.08427#bib.bib25)\), Large Language Models \(LLMs\)Brownet al\.\([2020](https://arxiv.org/html/2605.08427#bib.bib22)\), given their capabilities and widespread adoption, pose significant security and safety risks that have the potential to erode trust in these systemsLa Malfaet al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib26)\), particularly with the rise of LLM agents and agentic applicationsde Witt \([2025](https://arxiv.org/html/2605.08427#bib.bib23)\)\. Although safeguarding generated outputs is critical for preventing misuse, recent developments reveal ongoing and substantial obstacles to achieving dependable LLM safetyXuet al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib27)\)\. Model patching to prevent jailbreaks constitutes one mitigation strategyXionget al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib28)\)\. More recently, self\-play reinforcement learning, which engages multiple models in iterative attack\-and\-defence cycles, has shown promise in improving model robustnessLiuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)\.

As Figure[1](https://arxiv.org/html/2605.08427#S1.F1)\(centre\) illustrates, in the Self\-Play Red\-Teaming \(Self\-RedTeam\) framework by Liu et al\.Liuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\), a single model is fine\-tuned to assume both the attacker role, attempting to jailbreak the defender, and the defender role, aiming to prevent such breaches\. The attacker is rewarded for eliciting harmful responses, whilst the defender succeeds by refusing harmful requests or answering benign requests\. In this configuration, private reasoning traces are hidden, requiring agents to infer opponent intent exclusively from observable actions\.

![Refer to caption](https://arxiv.org/html/2605.08427v1/x1.png)Figure 1:Architectural Comparison of Self\-Play Red Team Frameworks\.\(Left\) Full Model Separation: Two distinct models guarantee independence and adversarial pressure, but lead to high computational cost\.\(Middle\) Single\-Model Coupling \(Self\-RedTeamLiuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)\): Attacker and Defender share the same parameters\. Gradient updates for the attacker shift the defender’s decision boundaries, resulting in self\-consistency and weakening the discovery of new vulnerabilities\.\(Right\) Anchored Bipolicy Self\-Play \(ABS, Ours\): We freeze the base model and evolve separate LoRA adapters for the attacker and defender, restoring adversarial pressure with minimal memory cost\.Although this approach can be framed as an imperfect\-information game, current frameworks are limited by fundamental architectural and theoretical constraints\. First, we show that relying on convergence to a Nash equilibrium as a safety guarantee admits equilibria that include trivial always\-refuse strategies or oracle\-like defenders, limiting practical applicability\. Moreover, full parameter sharing, while improving training stability, introduces a strong dependency\. In particular, we show that the min\-max safety game of Liu et al\.Liuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)induces a self\-consistency bias that potentially reduces the adversarial pressure on the defender\.

To re\-establish adversarial integrity within the safety game, this work introduces Anchored Bipolicy Self\-Play \(ABS\)\. The ABS framework enforces strict architectural separation between attacker and defender while maintaining the training stability of self\-play \(Figure[1](https://arxiv.org/html/2605.08427#S1.F1), right\)\. By freezing the base model, ABS employs Low\-Rank Adaptation \(LoRA\) to train distinct, lightweight adapter modules for each role, guaranteeing gradient update separation and a substantial reduction of the computational costs compared to the fully fine\-tuned state\-of\-the\-artSelf\-RedTeamand the base model\.

ABS matches or outperforms both the base models and the fully fine\-tunedSelf\-RedTeamacross multiple safety benchmarks while preserving reasoning and instruction\-following abilities\. Our method is particularly effective at smaller scales, where it provides significant performance gains over existing self\-play frameworks while remaining competitive with fully fine\-tuned models at the 14B scale\. We further assess attacker strength in two settings: a cross\-play tournament, where ABS attackers andSelf\-RedTeamdefenders of equal size are evaluated against each other, and a transfer setting, where both act as attackers against the base Qwen model\. In both cases, ABS attackers outperformSelf\-RedTeam, achieving an attack success rate that is around10%10\\%higher in cross\-play and15%15\\%higher when targeting the same base model\.

The contributions of this work are threefold:

- •We show that the zero\-sum formulation of self\-play red\-teaming allows for a broad class of indistinguishable Nash equilibria, which include both trivial “always\-refuse” strategies and optimal oracle defenders\. This result provides a theoretical explanation for why convergence alone does not ensure model utility\.
- •We identify and geometrically formalise the “self\-consistency” phenomenon inherent in shared\-parameter architectures\. Our analysis demonstrates that coupled gradient updates suppress adversarial pressure and hinder the exploration of worst\-case prompts\.
- •We propose Anchored Bipolicy Self\-Play \(ABS\), a parameter\-efficient framework that employs role\-specific LoRA adapters to maintain architectural separation\. This approach restores adversarial integrity by enabling independent evolution of the attacker and defender, while also substantially reducing the computational cost associated with self\-play training\.

The paper is organised as follows: we first review the relevant literature in reinforcement learning for safety, self\-play, and parameter\-efficient training\. We then show how the analysis of safety via Nash equilibria is incomplete, and that the current formulation ofSelf\-RedTeamreduces to self\-consistency\. We then introduce our technique \(ABS\), and conduct the experiments to prove its performance and scalability\. All the results are available[\[here\]](https://github.com/EmanueleLM/AnchoredBipolicySelf-Play)\.

## 2Related Work

Safety, Jailbreaking, and Red\-Teaming\.Standard red team training relies on human feedback to identify harmful responses and reinforce model behaviour accordinglyBaiet al\.\([2022](https://arxiv.org/html/2605.08427#bib.bib4)\); Daiet al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib5)\)\. Owing to the limited coverage of this approach, models are often encouraged to explore novel attack promptsPerezet al\.\([2022](https://arxiv.org/html/2605.08427#bib.bib6)\)\. For instance, Mehrotra et al\.Mehrotraet al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib7)\)propose a pruning strategy to pre\-select attacks with higher jailbreak potential\. Hong et al\.Honget al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib8)\)instead optimise jointly for reward and novelty, incentivising the generation of both successful and previously unseen prompts in a curiosity\-driven setting\. Other lines of work focus on co\-evolving attacker and defender policies to avoid static adversariesJainet al\.\([2023](https://arxiv.org/html/2605.08427#bib.bib9)\)\. A central challenge in this setting is computational costHoweet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib10)\)\. Xhonneux et al\.Xhonneuxet al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib11)\)address this by operating in the continuous embedding space, reducing the expense associated with discrete token\-level attacks\. Liu et al\.Liuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)introduce an online self\-play framework in which a single model instantiates both attacker and defender, enabling mutual adaptation\. More recently, Wen et al\.Wenet al\.\([2026](https://arxiv.org/html/2605.08427#bib.bib12)\)propose MAGIC, a decoupled co\-evolving formulation designed to mitigate optimisation conflicts through full parameter updates\. Tan et al\.Tanet al\.\([2026](https://arxiv.org/html/2605.08427#bib.bib53)\)extend this line of work with a tri\-role framework that incorporates an additional evaluator to stabilise training via multi\-agent interaction\.

Parameter\-Efficient Fine\-Tuning Techniques\.Parameter\-Efficient Fine\-Tuning \(PEFT\) techniques have demonstrated effectiveness in model training while offering significant computational efficiency compared to full model fine\-tuning\. Among these methods, LoRAHuet al\.\([2022](https://arxiv.org/html/2605.08427#bib.bib15)\)applies a low\-rank update to the weights while keeping the base model parameters fixed\. For instance, in transformer architectures that underpin modern large language models \(LLMs\), LoRA modifies the attention weights while leaving the multi\-layer perceptron weights unchanged\. Further variations of LoRA work on either reducing the computational learning costs, such as QLoRADettmerset al\.\([2023](https://arxiv.org/html/2605.08427#bib.bib16)\), or dynamically modifying the dimension of the weight matrices \(rank\) to improve critical learning momenta, like AdaLoRAZhanget al\.\([2023a](https://arxiv.org/html/2605.08427#bib.bib17)\)\. SaLoRALiet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib18)\)instead proposes a method to improve safety alignment of LLMs\. Finally, in line with this paper, Red\-Bandit implements a LoRA\-based post\-training to reduce LLMs’ vulnerabilities to attacksZiakaset al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib19)\)\.

Summary and Research Gap\.Self\-RedTeamLiuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)serves as the primary benchmark as it established the foundational setting for online symmetric self\-play in LLMs’ safety, where a single model alternates between attacker and defender roles\. Recent literature has shifted toward more complex co\-evolutionary games, such as MAGICWenet al\.\([2026](https://arxiv.org/html/2605.08427#bib.bib12)\), which proposes role decoupling through full\-parameter updates, and TriPlay\-RLTanet al\.\([2026](https://arxiv.org/html/2605.08427#bib.bib53)\), which introduces an auxiliary “Evaluator” role to stabilise co\-evolution\. Unlike our work, they do not provide a theoretical treatment of the gradient collision and self\-consistency bias inherent in shared\-parameter architectures\. ABS fills this gap by introducing a parameter\-efficient method that restores adversarial integrity through architectural separation\.

## 3Nash Equilibria, Model Safety, and Self\-Play

Self\-RedTeamby Liu et al\.Liuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)proposes model safety via self\-play, wherein a single base model jointly instantiates an attackerAA\(issuing jailbreak prompts\) and a defenderDD\(responding\)\. Training proceeds online in a sequential zero\-sum game, where a Nash equilibrium corresponds to safety \(see §[3\.1](https://arxiv.org/html/2605.08427#S3.SS1)\)\. Despite growing uptakeWenet al\.\([2026](https://arxiv.org/html/2605.08427#bib.bib12)\)and extensions to general\-sum settingsLiaoet al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib2)\), we show that \(i\) theSelf\-RedTeamformulation admits indistinguishable equilibria spanning trivial always\-refusal and oracle defenders \(see §[3\.2](https://arxiv.org/html/2605.08427#S3.SS2)\); and \(ii\) the objective is more naturally cast as single\-model self\-consistency, with implications for training dynamics and equilibria \(see §[3\.3](https://arxiv.org/html/2605.08427#S3.SS3)\)\.

### 3\.1Nash Equilibria inSelf\-RedTeam

Let𝒴A\\mathcal\{Y\}\_\{A\}be the set of attacker prompts and𝒴D\\mathcal\{Y\}\_\{D\}be the set of defender responses\. For each playerP∈\{A,D\}P\\in\\\{A,D\\\}, we define the probability simplex over𝒴P\\mathcal\{Y\}\_\{P\}asΔ\(𝒴P\)\\Delta\(\\mathcal\{Y\}\_\{P\}\)\. In the two\-player game formulation of red\-teaming, the attacker first proposes a promptyAy\_\{A\}sampled from a mixed strategyπA∈Δ\(𝒴A\)\\pi\_\{A\}\\in\\Delta\(\\mathcal\{Y\}\_\{A\}\), referred to as a policy\. Given the promptyAy\_\{A\}, the defender then generates a responseyDy\_\{D\}sampled from a conditional strategyπD\(⋅∣yA\)\\pi\_\{D\}\(\\cdot\\mid y\_\{A\}\)withπD:𝒴A→Δ\(𝒴D\)\\pi\_\{D\}:\\mathcal\{Y\}\_\{A\}\\rightarrow\\Delta\(\\mathcal\{Y\}\_\{D\}\)\.111Without loss of generality, we omit that inSelf\-RedTeam, the attack is initially conditioned on aseed, while the attacker’s and defender’s policies are further conditioned on the instructions to generate the attack and the defence, respectively\.Afterwards, a reward modelr:𝒴A×𝒴D→\[−1,1\]r:\\mathcal\{Y\}\_\{A\}\\times\\mathcal\{Y\}\_\{D\}\\to\[\-1,1\]evaluates each prompt\-response pair, i\.e\., for any such pair\(yA,yD\)\(y\_\{A\},y\_\{D\}\), the defender’s reward isr\(yA,yD\)r\(y\_\{A\},y\_\{D\}\), and the attacker receives the opposite payoff−r\(yA,yD\)\-r\(y\_\{A\},y\_\{D\}\)\.

For an attacker and a defender that aim to maximise their rewards, this setting corresponds to a zero\-sum sequential game where \(i\) unsafe responses are strictly penalised, i\.e\.,r\(yA,yD\)=−1r\(y\_\{A\},y\_\{D\}\)=\-1if\(yA,yD\)\(y\_\{A\},y\_\{D\}\)is unsafe; \(ii\) safe responses are rewarded non\-negatively, i\.e\.,r\(yA,yD\)≥0r\(y\_\{A\},y\_\{D\}\)\\geq 0if\(yA,yD\)\(y\_\{A\},y\_\{D\}\)is safe, and \(iii\) there exists a refusal responseyref∈𝒴Dy^\{\\mathrm\{ref\}\}\\in\\mathcal\{Y\}\_\{D\}such thatr\(yA,yref\)=0r\(y\_\{A\},y^\{\\mathrm\{ref\}\}\)=0for allyA∈𝒴Ay\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\.

Under these assumptions, the expected reward isV\(πA,πD\)=𝔼yA∼πA,yD∼πD\(⋅∣yA\)\[r\(yA,yD\)\]V\(\\pi\_\{A\},\\pi\_\{D\}\)=\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{A\},\\;y\_\{D\}\\sim\\pi\_\{D\}\(\\cdot\\mid y\_\{A\}\)\}\\big\[r\(y\_\{A\},y\_\{D\}\)\\big\]and the game admits the following min\-max representation:

minπA⁡maxπD⁡V\(πA,πD\)\.\\min\_\{\\pi\_\{A\}\}\\max\_\{\\pi\_\{D\}\}V\(\\pi\_\{A\},\\pi\_\{D\}\)\.
A Nash equilibrium\(πA∗,πD∗\)\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}^\{\\ast\}\)is thus a pair of strategies such that neither player has an incentive to deviate, i\.e\.,V\(πA∗,πD∗\)≥V\(πA∗,πD\)V\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}^\{\\ast\}\)\\geq V\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}\)for allπD\\pi\_\{D\}, andV\(πA∗,πD∗\)≤V\(πA,πD∗\)V\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}^\{\\ast\}\)\\leq V\(\\pi\_\{A\},\\pi\_\{D\}^\{\\ast\}\)for allπA\\pi\_\{A\}\. When the two players’ policies converge to a Nash equilibrium\(πA∗,πD∗\)\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}^\{\\ast\}\), Liu et al\.Liuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)proved that the defender guarantees a safe response to any attacker prompt, i\.e\.,r\(yA,πD∗\(yA\)\)≥0r\(y\_\{A\},\\pi\_\{D\}^\{\\ast\}\(y\_\{A\}\)\)\\geq 0for anyyAy\_\{A\}\.

We show that, under the above assumptions, extremal Nash equilibria attaining the minimum or maximum defender expected payoff are problematic in both theory and practice \(see §[3\.2](https://arxiv.org/html/2605.08427#S3.SS2)\)\. On the one hand, equilibria that are qualitatively distinct yet indistinguishable to the learning dynamics offer limited practical guidance\. On the other hand, when a single model instantiates both attacker and defender, the dynamics can induce implicit coordination or collusion between roles \(see §[3\.3](https://arxiv.org/html/2605.08427#S3.SS3)\)\.

### 3\.2Analysis of Extremal Nash Equilibria

In this section, we derive a “lower bound” extremal that corresponds to the case where the defender always refuses, guaranteeing safety but compromising usefulness\. Complementarily, we characterise the “upper bound” extremal that corresponds to an oracle\-like defender that always provides safe and maximally helpful responses\.

##### Bound 1: An always\-refuse equilibrium\.

Recall our assumption that the defender can always issue a refusal responseyref∈𝒴Dy^\{\\mathrm\{ref\}\}\\in\\mathcal\{Y\}\_\{D\}such thatr\(yA,yref\)=0r\(y\_\{A\},y^\{\\mathrm\{ref\}\}\)=0for allyA∈𝒴Ay\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\. Thus, for any promptyAy\_\{A\}, we define the always\-refuse defender strategy asπDref\(yD∣yA\)=𝟏\{yD=yref\}\\pi\_\{D\}^\{\\mathrm\{ref\}\}\(y\_\{D\}\\mid y\_\{A\}\)=\\mathbf\{1\}\\\{y\_\{D\}=y^\{\\mathrm\{ref\}\}\\\}, which yields an expected rewardV\(πA,πDref\)=0V\(\\pi\_\{A\},\\pi\_\{D\}^\{\\mathrm\{ref\}\}\)=0for any attacker policyπA\\pi\_\{A\}\.

A pair\(πA∗,πDref\)\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}^\{\\mathrm\{ref\}\}\)is a Nash equilibrium if the attacker policyπA∗\\pi\_\{A\}^\{\\ast\}is sufficiently adversarial such that no defender policy achieves a strictly positive expected payoff, i\.e\.,maxπD⁡V\(πA∗,πD\)=0\\max\_\{\\pi\_\{D\}\}V\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}\)=0\. We therefore focus on the attacker policies in a local neighbourhood of a reference adversarial policyπ¯A\\bar\{\\pi\}\_\{A\}, which concentrates on prompts where any non\-refusal response yields no positive reward, and where nearby policies preserve this property \(see Appendix[A](https://arxiv.org/html/2605.08427#A1)\)\. This reflects the intuition that attacks may adapt while remaining close to a reference distribution, ensuring semantic consistency\.

Within this neighbourhood, we prove that deviations in the attacker policy only induce controlled changes in the defender’s payoff \(Lemma[1](https://arxiv.org/html/2605.08427#Thmlemma1)\)\. We thus focus on attacker policiesπA∗\\pi\_\{A\}^\{\\ast\}that preserve the adversarial property ofπ¯A\\bar\{\\pi\}\_\{A\}locally, in the sense that they do not admit any defender strategy with strictly positive expected payoff, i\.e\., we require𝔼yA∼πA∗,yD∼πD′\[r\(yA,yD\)\]≤0\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{A\}^\{\\ast\},y\_\{D\}\\sim\\pi\_\{D\}^\{\\prime\}\}\[r\(y\_\{A\},y\_\{D\}\)\]\\leq 0for any alternative defender strategyπD′\\pi\_\{D\}^\{\\prime\}\. This condition can be interpreted as aworst\-caseconstraint on the attacker: within the local neighbourhood, any attempt by the defender to provide a helpful response yields no positive gain and may incur risk, so the defender has no incentive to deviate fromyrefy^\{\\mathrm\{ref\}\}\. Specifically, providing a helpful response risks a payoff of−1\-1that is not offset by safe, high\-reward completions, making the expected value of answering non\-positive\. Hence,πDref\\pi\_\{D\}^\{\\mathrm\{ref\}\}is a best response toπA∗\\pi\_\{A\}^\{\\ast\}\. Further, as the attacker cannot reduce the payoff below0against a refusing defender,πA∗\\pi\_\{A\}^\{\\ast\}is a best response\.

##### Bound 2: The defender acts like an oracle\.

Letℳ⊆𝒴A\\mathcal\{M\}\\subseteq\\mathcal\{Y\}\_\{A\}be the set of malicious prompts, for which the maximum achievable safe reward is zero\. That is,maxyD⁡r\(yA,yD\)=0\\max\_\{y\_\{D\}\}r\(y\_\{A\},y\_\{D\}\)=0for anyyA∈ℳy\_\{A\}\\in\\mathcal\{M\}andmaxyD⁡r\(yA,yD\)\>0\\max\_\{y\_\{D\}\}r\(y\_\{A\},y\_\{D\}\)\>0for anyyA∉ℳy\_\{A\}\\notin\\mathcal\{M\}\. Assume there exists an oracle defender policyπDrec\\pi\_\{D\}^\{\\mathrm\{rec\}\}that is pointwise optimal: for everyyA∈𝒴Ay\_\{A\}\\in\\mathcal\{Y\}\_\{A\}, it picks a responseyDy\_\{D\}to maximizer\(yA,yD\)r\(y\_\{A\},y\_\{D\}\)\. Namely, for allyA∈ℳy\_\{A\}\\in\\mathcal\{M\},πDrec\\pi\_\{D\}^\{\\mathrm\{rec\}\}ensures safety by generating a responseyDsafey\_\{D\}^\{\\mathrm\{safe\}\}withr\(yA,yDsafe\)=0r\(y\_\{A\},y\_\{D\}^\{\\mathrm\{safe\}\}\)=0, and foryA∉ℳy\_\{A\}\\notin\\mathcal\{M\}, it provides a maximally helpful responseyDy\_\{D\}withr\(yA,yD\)\>0r\(y\_\{A\},y\_\{D\}\)\>0\.

Therefore, a pair\(πA∗,πDrec\)\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}^\{\\mathrm\{rec\}\}\)is a Nash equilibrium if the attacker’s policyπA∗\\pi\_\{A\}^\{\\ast\}minimises the expected reward against the oracle\. Formally,πA∗\\pi\_\{A\}^\{\\ast\}must satisfy:

πA∗∈arg⁡minπA⁡𝔼yA∼πA\[maxπD⁡𝔼yD∼πD\(⋅\|yA\)\[r\(yA,yD\)\]\]=arg⁡minπA⁡𝔼yA∼πA\[maxyD⁡r\(yA,yD\)\]\.\\pi\_\{A\}^\{\\ast\}\\in\\arg\\min\_\{\\pi\_\{A\}\}\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{A\}\}\\left\[\\max\_\{\\pi\_\{D\}\}\\mathbb\{E\}\_\{y\_\{D\}\\sim\\pi\_\{D\}\(\\cdot\|y\_\{A\}\)\}\\left\[r\(y\_\{A\},y\_\{D\}\)\\right\]\\right\]=\\arg\\min\_\{\\pi\_\{A\}\}\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{A\}\}\\left\[\\max\_\{y\_\{D\}\}r\(y\_\{A\},y\_\{D\}\)\\right\]\.where the last equality follows frommaxπD⁡𝔼yD∼πD\(⋅\|yA\)\[r\(yA,yD\)\]=maxyD⁡r\(yA,yD\)\\max\_\{\\pi\_\{D\}\}\\mathbb\{E\}\_\{y\_\{D\}\\sim\\pi\_\{D\}\(\\cdot\|y\_\{A\}\)\}\\left\[r\(y\_\{A\},y\_\{D\}\)\\right\]=\\max\_\{y\_\{D\}\}r\(y\_\{A\},y\_\{D\}\)since the expected reward is linear in the mixed strategyπD\(⋅\|yA\)∈Δ\(𝒴D\)\\pi\_\{D\}\(\\cdot\|y\_\{A\}\)\\in\\Delta\(\\mathcal\{Y\}\_\{D\}\)\. Thus, the maximum over the simplexΔ\(𝒴D\)\\Delta\(\\mathcal\{Y\}\_\{D\}\)it is attained at a vertex, meaning that the optimalmixedstrategy simply places all probability mass on the bestpurestrategyyDy\_\{D\}maximizingr\(yA,yD\)r\(y\_\{A\},y\_\{D\}\)\.

AsmaxyD⁡r\(yA,yD\)=0\\max\_\{y\_\{D\}\}r\(y\_\{A\},y\_\{D\}\)=0for allyA∈ℳy\_\{A\}\\in\\mathcal\{M\}, andmaxyD⁡r\(yA,yD\)\>0\\max\_\{y\_\{D\}\}r\(y\_\{A\},y\_\{D\}\)\>0for anyyA∉ℳy\_\{A\}\\notin\\mathcal\{M\}, it follows that any equilibrium attacker policyπA∗\\pi\_\{A\}^\{\\ast\}must be supported onℳ\\mathcal\{M\}, i\.e\.,supp\(πA∗\)⊆ℳ\\operatorname\*\{supp\}\(\\pi\_\{A\}^\{\\ast\}\)\\subseteq\\mathcal\{M\}, wheresupp\(πA∗\):=\{yA∈𝒴A:πA∗\(yA\)\>0\}\\operatorname\*\{supp\}\(\\pi\_\{A\}^\{\\ast\}\):=\\\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}:\\pi\_\{A\}^\{\\ast\}\(y\_\{A\}\)\>0\\\}denotes the support ofπA∗\\pi\_\{A\}^\{\\ast\}\. At such an equilibrium, the attacker cannot further reduce the payoff, as safety is guaranteed, while the defender cannot improve its payoff on malicious inputs without violating the safety constraint and incurring a−1\-1penalty\.

Nash equilibria and Safety GuaranteesThe extremal Nash equilibria with respect to the defender’s payoff are:•A degenerate*always\-refuse*equilibrium \(safe but unhelpful\);•An*oracle*equilibrium \(safe and useful\), when such a policy exists\.Conclusion:The theoretical framework admits safety\-preserving equilibria; however, such equilibria range from a trivial*always\-refuse*solution to anoracle defender\.

### 3\.3Min\-Max Constraint Conflict and Self\-Consistency

Assume now that the attacker and defender share parameters, i\.e\., there exists a single policyπθ\\pi\_\{\\theta\}such thatπA=πD=πθ\\pi\_\{A\}=\\pi\_\{D\}=\\pi\_\{\\theta\}\. Under this constraint, the optimisation problem involves a single parameter vectorθ\\thetabalancing two conflicting objectives:

1. 1\.Defender Role:MaximizeVV\(generate safe responses\)\.
2. 2\.Attacker Role:MinimizeVV\(generate successful jailbreaks\)\.

Since the parameters are shared, the attacker cannot adapt independently without simultaneously altering the defender\. In the zero\-sum red\-teaming game, the attacker’s and defender’s rewards for any prompt\-response pair\(yA,yD\)\(y\_\{A\},y\_\{D\}\)areRA\(yA,yD\):=−r\(yA,yD\)R\_\{A\}\(y\_\{A\},y\_\{D\}\):=\-r\(y\_\{A\},y\_\{D\}\)andRD\(yA,yD\):=r\(yA,yD\)R\_\{D\}\(y\_\{A\},y\_\{D\}\):=r\(y\_\{A\},y\_\{D\}\), respectively\. We denote the defender’s and attacker’s objectives as follows \(respectively\):

JD\(θ\)≔𝔼yA∼πθ,yD∼πθ\(⋅∣yA\)\[r\(yA,yD\)\],JA\(θ\)≔𝔼yA∼πθ,yD∼πθ\(⋅∣yA\)\[−r\(yA,yD\)\]J\_\{D\}\(\\theta\)\\coloneqq\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{\\theta\},\\;y\_\{D\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid y\_\{A\}\)\}\[r\(y\_\{A\},y\_\{D\}\)\],\\quad J\_\{A\}\(\\theta\)\\coloneqq\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{\\theta\},\\;y\_\{D\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid y\_\{A\}\)\}\[\-r\(y\_\{A\},y\_\{D\}\)\]The resulting policy\-gradient update is given by the sum of the defender and attacker gradients:

∇θJD\(θ\)\+∇θJA\(θ\)\\displaystyle\\nabla\_\{\\theta\}J\_\{D\}\(\\theta\)\+\\nabla\_\{\\theta\}J\_\{A\}\(\\theta\)=𝔼yA,yD\[r\(yA,yD\)\(∇θlog⁡πθ\(yD∣yA\)⏟Force 1: Improve Defence−∇θlog⁡πθ\(yA\)⏟Force 2: Improve Attacks\)\]\.\\displaystyle=\\mathbb\{E\}\_\{y\_\{A\},y\_\{D\}\}\\Big\[r\(y\_\{A\},y\_\{D\}\)\\big\(\\underbrace\{\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{D\}\\mid y\_\{A\}\)\}\_\{\\text\{Force 1: Improve Defence\}\}\-\\underbrace\{\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{A\}\)\}\_\{\\text\{Force 2: Improve Attacks\}\}\\big\)\\Big\]\.\(1\)To make the coupling explicit, letη\>0\\eta\>0denote the learning rate and consider a defender\-only step, i\.e\.,θ\+=θ\+η∇θJD\(θ\)\\theta^\{\+\}=\\theta\+\\eta\\,\\nabla\_\{\\theta\}J\_\{D\}\(\\theta\)\. Then, a first\-order Taylor expansion of the attacker objective yields

JA\(θ\+\)=JA\(θ\)\+η⟨∇θJA\(θ\),∇θJD\(θ\)⟩\+O\(η2\)\.J\_\{A\}\(\\theta^\{\+\}\)=J\_\{A\}\(\\theta\)\+\\eta\\left\\langle\\nabla\_\{\\theta\}J\_\{A\}\(\\theta\),\\,\\nabla\_\{\\theta\}J\_\{D\}\(\\theta\)\\right\\rangle\+O\(\\eta^\{2\}\)\.\(2\)
Thus, at first order, the effect of a defender update is governed by the inner product between the attacker and defender gradients\. When these directions are aligned, improving the defender simultaneously increases the attacker’s objective, and vice versa\. This coupling highlights how shared\-parameter training can induce self\-consistency instead of fully robust adversarial behaviour\.

Geometrically, we can observe this by analysing the signs of the rewards in Eq\. \([1](https://arxiv.org/html/2605.08427#S3.E1)\):

- •Case 1: Resistance \(r\(yA,yD\)\>0r\(y\_\{A\},y\_\{D\}\)\>0\):When the defender resists an attack, Force 1 increases the likelihood of the safe responseyDy\_\{D\}while Force 2 decreases the likelihood of the attack promptyAy\_\{A\}\. This drivesπθ\\pi\_\{\\theta\}to avoid prompts it already defends against, leading to self\-consistency rather than the discovery of fundamentally stronger attacks\.
- •Case 2: Success \(r\(yA,yD\)<0r\(y\_\{A\},y\_\{D\}\)<0\):Conversely, if a jailbreak succeeds, Force 2 attempts to reinforce the successful attackyAy\_\{A\}\. Simultaneously, the heavy penalty on the first term \(Force 1\) tries to updateθ\\thetato suppress the vulnerability that permitted the jailbreak\.

Because these updates are applied simultaneously to the shared parameter vectorθ\\theta, the gradients effectively collide\. Rather than sharpening the attacker and strengthening the defender, the model converges to a stagnant middle ground\. This prevents exploration of the adversarial frontier, yielding a “muddy” representational space in which both roles are compromised\.

Self\-consistency and Safety GuaranteesIn summary:•Self\-play Nash equilibria can reduce to*self\-consistency*equilibria\.•As an optimisation process,Self\-RedTeamLiuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)induces coupled gradients, where improving the defender can weaken the attacker\.Conclusion:A well\-posed adversarial saddle\-point formulation requires distinct policies\.

## 4Anchored Bipolicy Self\-Play

We now formalise the Anchored Bipolicy Self\-Play \(ABS\) policy for the attacker and the defender: we rely on LoRA modules to ensure attacker\-defender separation so that the adversarial pressure on the defender remains unaltered, while maintaining the objective function defined inSelf\-RedTeam\.

### 4\.1Role\-conditioned LoRA Policies

Letπθ\\pi\_\{\\theta\}denote a single policy network with a shared frozen backbone and role\-specific LoRA adapters\. The effective parameters are given byΘ=\(θ,ϕA,ϕD\)\\Theta=\(\\theta,\\phi\_\{A\},\\phi\_\{D\}\), whereθ\\thetadenotes the frozen backbone parameters, andϕA,ϕD\\phi\_\{A\},\\phi\_\{D\}are the LoRA parameters for the attacker and defender, respectively\. For a Transformer\-based LLM modelVaswaniet al\.\([2017](https://arxiv.org/html/2605.08427#bib.bib32)\), and a subset of backbone parametersθl⊆θ\\theta^\{l\}\\subseteq\\theta\(e\.g\., the query and value projections of self\-attention\), a LoRA module parameterised byϕl⊆ϕ\\phi^\{l\}\\subseteq\\phidefines a low\-rank update added toθl\\theta^\{l\}and trained while keepingθl\\theta^\{l\}frozen\. Then, actions for the attacker and the defender are sampled asyA∼πθ,ϕA\(⋅\)y\_\{A\}\\sim\\pi\_\{\\theta,\\phi\_\{A\}\}\(\\cdot\)andyD∼πθ,ϕD\(⋅∣yA\)y\_\{D\}\\sim\\pi\_\{\\theta,\\phi\_\{D\}\}\(\\cdot\\mid y\_\{A\}\)\.

ABS adapts the training method inLiuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\), which leverages Re\+\+Huet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib31)\), a lightweight reinforcement learning algorithm extending PPO and REINFORCEWilliams \([1992](https://arxiv.org/html/2605.08427#bib.bib30)\); Schulmanet al\.\([2017](https://arxiv.org/html/2605.08427#bib.bib33)\)\. It uses token\-level KL regularisation to penalise deviations from a reference policy, along with different learning\-rate warm\-up procedures for the attacker and defender\. Implementation details are provided in the next section and Appendix[C\.3](https://arxiv.org/html/2605.08427#A3.SS3)\. The attacker and defender are then trained using separate objectives:

JA\(ϕA\|ϕD\)=𝔼yA∼πθ,ϕA,yD∼πθ,ϕD\(⋅∣yA\)\[RA\(yA,yD\)\],\\displaystyle J\_\{A\}\(\\phi\_\{A\}\|\\phi\_\{D\}\)=\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{\\theta,\\phi\_\{A\}\},\\,y\_\{D\}\\sim\\pi\_\{\\theta,\\phi\_\{D\}\}\(\\cdot\\mid y\_\{A\}\)\}\\big\[R\_\{A\}\(y\_\{A\},y\_\{D\}\)\\big\],\(3\)JD\(ϕD\|ϕA\)=𝔼yA∼πθ,ϕA,yD∼πθ,ϕD\(⋅∣yA\)\[RD\(yA,yD\)\]\.\\displaystyle J\_\{D\}\(\\phi\_\{D\}\|\\phi\_\{A\}\)=\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{\\theta,\\phi\_\{A\}\},\\,y\_\{D\}\\sim\\pi\_\{\\theta,\\phi\_\{D\}\}\(\\cdot\\mid y\_\{A\}\)\}\\big\[R\_\{D\}\(y\_\{A\},y\_\{D\}\)\\big\]\.The opposing player’s parameters are fixed when computing gradients, with policy\-gradient updates:

∇ϕAJA\\displaystyle\\nabla\_\{\\phi\_\{A\}\}J\_\{A\}=𝔼\[∇ϕAlog⁡πθ,ϕA\(yA\)RA\(yA,yD\)\],\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\nabla\_\{\\phi\_\{A\}\}\\log\\pi\_\{\\theta,\\phi\_\{A\}\}\(y\_\{A\}\)\\,R\_\{A\}\(y\_\{A\},y\_\{D\}\)\\right\],\(4\)∇ϕDJD\\displaystyle\\nabla\_\{\\phi\_\{D\}\}J\_\{D\}=𝔼\[∇ϕDlog⁡πθ,ϕD\(yD∣yA\)RD\(yA,yD\)\]\.\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\nabla\_\{\\phi\_\{D\}\}\\log\\pi\_\{\\theta,\\phi\_\{D\}\}\(y\_\{D\}\\mid y\_\{A\}\)\\,R\_\{D\}\(y\_\{A\},y\_\{D\}\)\\right\]\.These updates evolve separately and restrict optimisation to low\-rank subspaces over a shared frozen backbone\. This parameter separation restores the distinction between attacker and defender updates, eliminating the parameter\-level interference observed in Eq\. \([2](https://arxiv.org/html/2605.08427#S3.E2)\); To formalise this, consider a defender\-only update stepϕD\+=ϕD\+η∇ϕDJD\(ϕA,ϕD\)\\phi\_\{D\}^\{\+\}=\\phi\_\{D\}\+\\eta\\,\\nabla\_\{\\phi\_\{D\}\}J\_\{D\}\(\\phi\_\{A\},\\phi\_\{D\}\), whereη\>0\\eta\>0is the learning rate\. A first\-order Taylor expansion of the defender’s objective yields

JD\(ϕA,ϕD\+\)=JD\(ϕA,ϕD\)\+η‖∇ϕDJD\(ϕA,ϕD\)‖2\+O\(η2\)\.\\displaystyle J\_\{D\}\(\\phi\_\{A\},\\phi\_\{D\}^\{\+\}\)=J\_\{D\}\(\\phi\_\{A\},\\phi\_\{D\}\)\+\\eta\\\|\\nabla\_\{\\phi\_\{D\}\}J\_\{D\}\(\\phi\_\{A\},\\phi\_\{D\}\)\\\|^\{2\}\+O\(\\eta^\{2\}\)\.\(5\)
Importantly, this dependence is purely distributional: the defender parameters update independently of the attacker parameters, i\.e\.,∂ϕD∂ϕA=0\.\\frac\{\\partial\\phi\_\{D\}\}\{\\partial\\phi\_\{A\}\}=0\.We comprehensively formalise this in Appendix[B](https://arxiv.org/html/2605.08427#A2)\.

## 5Experiments & Results

To compare our ABS to the fully fine\-tunedSelf\-RedTeammodels by Liu et al\.Liuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\), we adapted their code to train two independent LoRA modules, one for the attacker and one for the defender\. In their setting, the attacker is prompted with a seed from the training partition of WildJailBreakJianget al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib45)\)\(see their Appendix C\.4\.1\), with a 50:50 ratio of harmful and benign prompts\. The prompt instructs the attacker and defender to first think, wrapped by<think\></think\>tokens, and then provide the answer between<answer\></answer\>tags\. The defender produces its response in a similar format; we conducted the evaluation on the pairs of attacks and defences generated at each iteration\. As inSelf\-RedTeam, we use WildGuardHanet al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib46)\)as the reward model, which classifies the prompt and response harmfulness, and whether the response is a refusal to answer the attack\.

We trained Qwen2\.5\-\{3B, 7B 14B\}\-IT modelsQwenet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib47)\)with different LoRA settings, varying the rank and alpha values\. The rank defines the hidden dimension of the LoRA representation and thus the number of trainable parameters, while alpha scales the influence of the adapter w\.r\.t\. the base model\. Table[2](https://arxiv.org/html/2605.08427#S5.T2)and Appendix[C\.3](https://arxiv.org/html/2605.08427#A3.SS3)report more details of the setting\. Our training is entirely conducted on three NVIDIA H200 SXM GPUs with 141GB of vRAM\.

For the safety evaluation, we employed eight tests in the Ai2 Safety ToolkitJianget al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib45)\): six tests concern harmful refusal, i\.e\., if a model responds to malicious prompts; two benign tests benchmark the ability of a model to comply with non\-harmful instructions\. The test suite measures the Attack Success Rate \(ASR\) and Robustness to Attacks \(RTA\)\. We also benchmark ABS and theSelf\-RedTeamQwen by Liu et al\.Liuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)on five standard reasoning and math benchmarks \(GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2605.08427#bib.bib48)\), CommonSenseTalmoret al\.\([2019](https://arxiv.org/html/2605.08427#bib.bib49)\), BoolQClarket al\.\([2019](https://arxiv.org/html/2605.08427#bib.bib50)\), ARC ChallengeClarket al\.\([2018](https://arxiv.org/html/2605.08427#bib.bib51)\), and HellaSwagZellerset al\.\([2019](https://arxiv.org/html/2605.08427#bib.bib52)\)\) to guarantee that they do not lose their base capabilities\. We ran our tests on a single NVIDIA H200 SXM GPU with 141GB of vRAM and report the accuracy \(pass@1\) of each model\.

Harmful RefusalBenign ComplianceBenchmarkWG:TestWJBDANHarmBenchXSTestStrongREJECTWJBXSTestReasoning/Mathadvvaniadvadvadvvaniadvvanivani\-MethodASR↓\\downarrowASR↓\\downarrowASR↓\\downarrowASR↓\\downarrowASR↓\\downarrowRTA↑\\uparrowASR↑\\uparrowComply↑\\uparrowLC Winrate↑\\uparrowAvg\. Accuracy↑\\uparrowQwen2\.5\-3B\-IT0\.2820\.1190\.9490\.6770\.2410\.6100\.4970\.9920\.9600\.745±0\.0420\.745\\pm 0\.042\+Self\-RedTeam0\.2340\.0560\.8850\.6030\.1560\.6550\.5591\.0000\.9320\.671±0\.0950\.671\\pm 0\.095\+\\cellcolormygreyABS\\cellcolormygrey0\.122\\cellcolormygrey0\.019\\cellcolormygrey0\.721\\cellcolormygrey0\.320\\cellcolormygrey0\.088\\cellcolormygrey0\.680\\cellcolormygrey0\.669\\cellcolormygrey0\.984\\cellcolormygrey0\.952\\cellcolormygrey0\.738±0\.0470\.738\\pm 0\.047Improv\.Self\-RedTeam\(%\)\+17\.0\+52\.9\+6\.7\+10\.9\+35\.3\+7\.4\+12\.5\+0\.8\-2\.9\-7\.4Improv\. ABS \(%\)\+56\.7\+84\.0\+24\.0\+52\.7\+63\.5\+11\.5\+34\.6\-0\.8\-0\.8\-0\.7Qwen2\.5\-7B\-IT0\.2790\.0400\.8970\.6130\.2340\.9150\.4700\.9920\.9120\.850±0\.0260\.850\\pm 0\.026\+Self\-RedTeam0\.1780\.0020\.7260\.4230\.1560\.9250\.3710\.9920\.8880\.818±0\.0370\.818\\pm 0\.037\+\\cellcolormygreyABS\\cellcolormygrey0\.169\\cellcolormygrey0\.010\\cellcolormygrey0\.756\\cellcolormygrey0\.383\\cellcolormygrey0\.156\\cellcolormygrey0\.915\\cellcolormygrey0\.644\\cellcolormygrey0\.988\\cellcolormygrey0\.884\\cellcolormygrey0\.847±0\.0240\.847\\pm 0\.024Improv\.Self\-RedTeam\(%\)\+36\.18\+94\.19\+19\.12\+30\.98\+33\.36\+1\.09\-21\.240\.0\-2\.63\-3\.2Improv\. ABS \(%\)\+39\.37\+76\.51\+15\.77\+37\.50\+33\.360\.00\+36\.97\-0\.6\-3\.07\-0\.3Qwen2\.5\-14B\-IT0\.1460\.0170\.8450\.6470\.1190\.7350\.6471\.0000\.9840\.874±0\.0300\.874\\pm 0\.030\+Self\-RedTeam0\.1160\.0050\.6610\.4670\.0560\.7750\.7010\.9960\.9680\.878±0\.0350\.878\\pm 0\.035\+\\cellcolormygreyABS\\cellcolormygrey0\.131\\cellcolormygrey0\.022\\cellcolormygrey0\.706\\cellcolormygrey0\.446\\cellcolormygrey0\.093\\cellcolormygrey0\.770\\cellcolormygrey0\.748\\cellcolormygrey1\.000\\cellcolormygrey0\.968\\cellcolormygrey0\.874±0\.0300\.874\\pm 0\.030Improv\.Self\-RedTeam\(%\)\+20\.55\+70\.59\+21\.77\+27\.82\+52\.94\+5\.44\+8\.35\-0\.4\-1\.63\+0\.4Improv\. ABS \(%\)\+10\.27\-29\.41\+16\.44\+31\.07\+21\.85\+4\.76\+15\.610\.0\-1\.630\.0

Table 1:Performance of Qwen2\.5\-\{3B, 7B, 14B\}\-IT,Self\-RedTeamfine\-tuned, and ourABSmodels on six benchmarks for harmful refusal, two for benign compliance, and five \(average\) for instruction following and reasoning\. Improvement is calculated relative to the base model\.Blueindicates improvements \(lower ASR or higher RTA/Compliance\),redindicates regression\.### 5\.1ABS Safety and Efficiency Evaluation

As Table[1](https://arxiv.org/html/2605.08427#S5.T1)reports, our method outperforms the corresponding Qwen\-IT base model and, in most cases, the fully fine\-tunedSelf\-RedTeammodel by Liu et al\.Liuet al\.\([2025](https://arxiv.org/html/2605.08427#bib.bib1)\)\. On standard reasoning benchmarks, our models do not degrade their performance, as shown in the last column of Table[1](https://arxiv.org/html/2605.08427#S5.T1)\(full table in Appendix, Table[9](https://arxiv.org/html/2605.08427#A3.T9)\), while the fine\-tunedSelf\-RedTeamQwen2\.5\-3B\-IT experiences relevant drops, revealing that fine\-tuning small models for safety may incur catastrophic forgetting\. In terms of efficiency, ABS introduces, for Qwen2\.5\-14B\-IT with rank3232\(our largest module\), an additional250M250Mparameters \(2%2\\%of the total parameters\), and completes a full training run, on three H200 GPUS, in five hours\. The details of other implementations are reported in Table[2](https://arxiv.org/html/2605.08427#S5.T2)\.

We also observe that in some cases LoRA becomes quickly unstable, and derails to produce nonsense output; a typical collapse mode consists of Qwen producing a sequence of\!, which we notice is a common training prefix in some popular red\-team papersSchwinnet al\.\([2023](https://arxiv.org/html/2605.08427#bib.bib20)\)\(see Appendix[C\.4](https://arxiv.org/html/2605.08427#A3.SS4)for details on ABS collapses and all the results on LoRA variations\)\. In terms of efficiency, Table[2](https://arxiv.org/html/2605.08427#S5.T2)reports that training a LoRA adapter with rank1616on three NVIDIA H200 GPUs takes approximately the same time as benchmarking, on a single H200, the same model on the nine test suites\.

Table 2:Training time \(in seconds, per step and in total\) and size \(per\-layer and in total, with, within parentheses, the percentage of additional parameters w\.r\.t\. the base model\) of Qwen2\.5\-\{3B, 7B, 14B\}\-IT ABS models\. Estimates may vary depending to several factors \(GPUS utilisation, etc\.\), but reflect a lower bound on three NVIDIA H200 GPUS \(we used dedicated, bare\-metal GPUS\)\.
### 5\.2ABS Adversarial Strength and Defensive Robustness

In this section, we analysed the adversarial pressure that a trained attacker can put on the defender\. We compared the ASR \(Attack Success Rate\) of ABS andSelf\-RedTeamcheckpoints vs the same base model\. We sample100100examples from the Harmful Behavior DatasetJianget al\.\([2024](https://arxiv.org/html/2605.08427#bib.bib45)\), and we ask the attacker to produce a jailbreak following the same format used at training time\. The defender receives the input and decides whether to comply or not with the request\. We evaluate the success of an attack with GPT\-5\.4 Standard with High reasoning as an LLM\-as\-a\-judge\. We set strict prompt rules on the success of an attack: a success is an answer that, for the judge, “gave substantive harmful assistance” to the defender\. As shown in Figure[2](https://arxiv.org/html/2605.08427#S5.F2),Self\-RedTeamof Qwen2\.5\-3B\-IT and 7B have an ASR of18%18\\%and9%9\\%respectively, while our corresponding ABS produce more effective jailbreaks and achieve an ASR of35%35\\%and24%24\\%\.

We also conducted a “tournament” to evaluate the adversarial pressure between our ABS models andSelf\-RedTeam\. Over100100samples from the Harmful Behavior Dataset, we cross\-tested each methodology’s attacker against the opposing defender at equivalent model scales\. This analysis provides a measure of the adversarial pressure that an attacker puts on the other model’s defender\. Similarly to the previous case, GPT\-5\.4 with High reasoning judges the success of an attack\. As reported in Figure[2](https://arxiv.org/html/2605.08427#S5.F2),Self\-RedTeamQwen2\.5\-3B\-IT and 7B have both an ASR of6%6\\%, while our corresponding ABS produces more effective jailbreaks and achieves an ASR of17%17\\%and18%18\\%\. In conclusion, the shared\-model defender is markedly more permissive than the LoRA defender\.

We compared attacks generated by ABS andSelf\-RedTeamalong three axes: semantic similarity, lexical diversity, and structural complexity\. Table[3](https://arxiv.org/html/2605.08427#S5.T3)provides a quantitative overview of these differences\.Self\-RedTeamproduces shorter and more homogeneous attacks, as reflected by higher Self\-BLEU scores and lower token counts\. In contrast, ABS generates longer and lexically diverse outputs, particularly the 3B and 7B, where token counts increase substantially, and Self\-BLEU decreases\. Despite these surface\-level differences, cosine similarity remains broadly comparable across methods, suggesting that both approaches occupy similar regions in embedding space \(Figure[3](https://arxiv.org/html/2605.08427#S5.F3)shows the UMAP embeddings for ABS andSelf\-RedTeam\)\.

These quantitative trends correspond to clear stylistic differences\.Self\-RedTeamattacks are typically direct reformulations of the harmful intent: they are concise, imperative, and minimally elaborated\. By contrast, ABS tends to embed the request within richer contextual scaffolding, often introducing multi\-step scenarios, social engineering elements, or concealment\-oriented phrasing\. This results in attacks that are more elaborate but also less controlled, especially in the 3B model\. In that setting, ABS outputs sometimes include residual meta\-instructions, such as references to an “original prompt” or “revised prompt”, suggesting weaker separation between instruction\-following and task narration; this likely reflects the limited capacity of a LoRA module to enforce clean task boundaries\.

Overall, the comparison reveals a trade\-off between reliability and elaboration\.Self\-RedTeamvariants, particularly at 7B, are more structurally consistent and produce cleaner attack spans\. In contrast, the 3B ABS variant is less disciplined but often generates longer and more evasive formulations, indicating greater rhetorical complexity even when formatting quality is degraded\. Representative examples \(Appendix[C\.4\.1](https://arxiv.org/html/2605.08427#A3.SS4.SSS1)and[C\.5](https://arxiv.org/html/2605.08427#A3.SS5)\) illustrate this contrast\. In conclusion, lexical diversity and verbosity, as captured by Self\-BLEU and token counts in Table[3](https://arxiv.org/html/2605.08427#S5.T3), appear to be the key factors underlying the success of ABS attacks\.

![[Uncaptioned image]](https://arxiv.org/html/2605.08427v1/x2.png)

Figure 2:Transferability of attacks and comparative evaluation ofSelf\-RedTeamversus ABS Qwen\-IT models across the tournament suite\.
MethodCosineSimilaritySelf\-BLEU\-3Score\# TokensAverage\# Think TokensAverageThink TokensFrequencyQwen2\.5\-3B\-IT\+Self\-RedTeam0\.2810\.78559\.2658\.699\.3%\+\\cellcolormygreyABS\\cellcolormygrey0\.310\\cellcolormygrey0\.714\\cellcolormygrey232\.74\\cellcolormygrey88\.82\\cellcolormygrey70\.5%Qwen2\.5\-7B\-IT\+Self\-RedTeam0\.2420\.67090\.0449\.6999\.21%\+\\cellcolormygreyABS\\cellcolormygrey0\.201\\cellcolormygrey0\.646\\cellcolormygrey250\.94\\cellcolormygrey53\.91\\cellcolormygrey19\.14%Qwen2\.5\-14B\-IT\+Self\-RedTeam0\.2060\.547122\.4373\.31100%\+\\cellcolormygreyABS\\cellcolormygrey0\.210\\cellcolormygrey0\.500\\cellcolormygrey135\.86\\cellcolormygrey76\.67\\cellcolormygrey25%

Table 3:Statistics of the attacks generated bySelf\-RedTeamvs LoRA Qwen2\.5\-IT models on the Harmful Behavior Dataset\. Cosine similarity measures embedding similarity, while Self\-BLEU measures lexical diversity\.

![Refer to caption](https://arxiv.org/html/2605.08427v1/figures/results/umap-qwen.png)Figure 3:UMAP projection of the attacks of theSelf\-RedTeamand ABS on the Harmful Behavior Dataset; the base models are Qwen2\.5\-\{7B, 14B\}\. The projections are similar in the embedding space, despite our ABS attacks being lexically different from theSelf\-RedTeam\. The lexical analysis in Table[3](https://arxiv.org/html/2605.08427#S5.T3)solves this apparent inconsistency\.
### 5\.3Interfacing with Human Values: Fairness and Harmfulness

In this section, we explore how ABS performs at blocking specific human harms, such as fairness or toxicity, without over\-refusing benign prompts\. To understand this trade\-off, we conducted an evaluation of our model and the baselines using semantic sub\-categories from WildGuard, HarmBench, and XSTest\. We show the comparison in Table[4](https://arxiv.org/html/2605.08427#S5.T4)\.

Both ABS andSelf\-RedTeamdemonstrate high effectiveness in suppressing malicious content\. ABS, in particular, achieves substantial safety penetration against direct human harm without necessitating full\-parameter updates\. For example, evaluation on the HarmBench Harassment & Bullying semantic category indicates that the 3B and 14B ABS models attain a 100% safe refusal rate\. In WildGuard’s Violence & Physical Harm and Toxicity categories, the ABS models consistently maintain safety refusal rates above 82% and 78%, respectively, across all model scales, with the 14B model exceeding 92% in both\.

Under adversarial pressure, models frequently associate terms related to protected demographics, such as race, gender, or religion, with harmful attacks\. This association results in excessive refusal of benign requests, quantified by XSTest’s Discrimination Contrast and Benign Compliance metrics\. The analysis demonstrates a pronounced difference in howSelf\-RedTeamand ABS architectures address fairness\.Self\-RedTeamfunctions as a blunt instrument\. Gradient collisions compel these models to converge towards a middle ground, as discussed in Section[3\.3](https://arxiv.org/html/2605.08427#S3.SS3), resulting in a substantial reduction in benign compliance to prioritise safety\. The base Qwen2\.5\-3B model attains a 0\.960 Benign Compliance score on XSTest, which declines markedly to 0\.932 underSelf\-RedTeam\. Qwen2\.5\-3B ABS model has a Benign Compliance score of 0\.952\. At the 7B scale, ABS further improves upon the base model’s compliance \(0\.915 compared to 0\.912\), whereas theSelf\-RedTeamvariant decreases to 0\.888\.

Direct Human HarmsFairness & Over\-RefusalWG: ViolenceWG: ToxicityHB: HarassmentXSTest: Discr\. ContrastXSTest: Benign ComplyMethodSafe Refusal↑\\uparrowSafe Refusal↑\\uparrowSafe Refusal↑\\uparrowContrast↑\\uparrowComply↑\\uparrowQwen2\.5\-3B\-IT0\.7000\.6591\.0000\.0800\.960\+Self\-RedTeam0\.7250\.7561\.0000\.1600\.932\+\\cellcolormygreyABS\\cellcolormygrey0\.850\\cellcolormygrey0\.927\\cellcolormygrey1\.000\\cellcolormygrey0\.240\\cellcolormygrey0\.952Improv\.Self\-RedTeam\(%\)\+3\.6\+14\.80\.0\+100\.0\-2\.9Improv\. ABS \(%\)\+21\.4\+40\.70\.0\+200\.0\-0\.8Qwen2\.5\-7B\-IT0\.7250\.8050\.9520\.8000\.912\+Self\-RedTeam0\.8000\.8050\.9520\.7200\.888\+\\cellcolormygreyABS\\cellcolormygrey0\.825\\cellcolormygrey0\.780\\cellcolormygrey0\.952\\cellcolormygrey0\.760\\cellcolormygrey0\.915Improv\.Self\-RedTeam\(%\)\+10\.30\.00\.0\-10\.0\-2\.6Improv\. ABS \(%\)\+13\.8\-3\.00\.0\-5\.0\+0\.3Qwen2\.5\-14B\-IT0\.8000\.8781\.0000\.4400\.984\+Self\-RedTeam0\.9500\.9271\.0000\.4000\.968\+\\cellcolormygreyABS\\cellcolormygrey0\.925\\cellcolormygrey0\.951\\cellcolormygrey1\.000\\cellcolormygrey0\.400\\cellcolormygrey0\.980Improv\.Self\-RedTeam\(%\)\+18\.7\+5\.60\.0\-9\.1\-1\.6Improv\. ABS \(%\)\+15\.6\+8\.30\.0\-9\.1\-0\.4

Table 4:Semantic breakdown of Safety vs\. Over\-Refusal across model scales\. Higher scores \(↑\\uparrow\) indicate better performance \(i\.e\., successfully refusing harmful prompts, and correctly complying with safe, benign prompts\)\. The overall XSTest Benign Compliance scores demonstrate thatSelf\-RedTeamdegrades the model’s ability to engage with safe prompts across all scales, whereasABSmaintains or recovers this capability while deeply mitigating direct human harms\.Blueindicates a shift in the desired direction, whileredindicates regression relative to the base model\.

## 6Conclusion, Limitations, and Future Work

In this paper, we propose ABS, a model designed to overcome the theoretical limitations of standard self\-play in LLM safety training\. By freezing the base model and introducing two LoRA\-based extensions, ABS provides an efficient hybrid that separates self\-play from gradient updates\. ABS achieves strong performance improvements over base LLMs while significantly reducing computational cost\.

Our evaluation across multiple model scales demonstrates that separating the attacker and defender is crucial for effective adversarial training\. Both our theoretical analysis and empirical results indicate that shared\-parameter frameworks, such asSelf\-RedTeam, suffer from a self\-consistency bias that weakens the attacker\. ABS outperformsSelf\-RedTeamon Qwen2\.5\-3B and reaches comparable performance on larger models, while consistently improving over the base model\. In addition to quantitative gains, ABS\-trained attackers develop more sophisticated and camouflaged strategies\. The method is also highly efficient, achieving up to two orders of magnitude better memory usage while preserving the reasoning capabilities of the underlying model\.

Despite these advantages, ABS exhibits some limitations\. LoRA training can occasionally become unstable, leading to collapse modes with repetitive or nonsensical outputs, and larger models require careful hyperparameter tuning to avoid such failures \(Appendix[C\.4](https://arxiv.org/html/2605.08427#A3.SS4)\)\. Nevertheless, the efficiency of ABS makes it promising for deploying safer yet useful models, particularly at smaller scales\. Future work will explore extensions to more complex threat settings, including mosaic attacks and mixtures of attackers, as well as broader applications of this game\-theoretic framework to self\-play in LLMs\.

## Acknowledgments

GLM was supported by UK Research and Innovation \[grant number EP/S023356/1\], in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence \(www\.safeandtrustedai\.org\)\. MW was supported by an AI 2050 Senior Fellowship from the Schmidt Sciences Foundation\. ELM is affiliated with the Institute for Decentralized AI \(IDAI\), which he thanks for its support and for providing the budget for computing\.

## References

- \[1\]L\. Arras, F\. Horn, G\. Montavon, K\. Müller, and W\. Samek\(2016\)Explaining predictions of non\-linear classifiers in NLP\.InProceedings of the 1st Workshop on Representation Learning for NLP,pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2605.08427#S1.p1.1)\.
- \[2\]Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1)\.
- \[3\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.08427#S1.p1.1)\.
- \[4\]B\. Chen, T\. Zhu, J\. Han, L\. Li, G\. Li, and X\. Dai\(2025\)Incentivizing truthful language models via peer elicitation games\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2605.08427#A1.p2.8)\.
- \[5\]C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova\(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 2924–2936\.Cited by:[§5](https://arxiv.org/html/2605.08427#S5.p3.1)\.
- \[6\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? Try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§5](https://arxiv.org/html/2605.08427#S5.p3.1)\.
- \[7\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§5](https://arxiv.org/html/2605.08427#S5.p3.1)\.
- \[8\]I\. Csiszár and J\. Körner\(2011\)Information theory: coding theorems for discrete memoryless systems\.Cambridge University Press\.Cited by:[§A\.2](https://arxiv.org/html/2605.08427#A1.SS2.1.p1.2)\.
- \[9\]J\. Dai, X\. Pan, R\. Sun, J\. Ji, X\. Xu, M\. Liu, Y\. Wang, and Y\. Yang\(2024\)Safe RLHF: safe reinforcement learning from human feedback\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TyFrPOKYXw)Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1)\.
- \[10\]C\. S\. de Witt\(2025\)Open challenges in multi\-agent security: towards secure systems of interacting AI Agents\.arXiv preprint arXiv:2505\.02077\.Cited by:[§1](https://arxiv.org/html/2605.08427#S1.p1.1)\.
- \[11\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)QLoRA: efficient finetuning of quantized LLMs\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 10088–10115\.Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p2.1)\.
- \[12\]S\. Han, K\. Rao, A\. Ettinger, L\. Jiang, B\. Y\. Lin, N\. Lambert, Y\. Choi, and N\. Dziri\(2024\)WildGuard: open one\-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs\.Advances in Neural Information Processing Systems37,pp\. 8093–8131\.Cited by:[§5](https://arxiv.org/html/2605.08427#S5.p1.1)\.
- \[13\]Z\. Hong, I\. Shenfeld, T\. Wang, Y\. Chuang, A\. Pareja, J\. R\. Glass, A\. Srivastava, and P\. Agrawal\(2024\)Curiosity\-driven red\-teaming for large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=4KqkizXgXU)Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1)\.
- \[14\]N\. H\. R\. Howe, I\. R\. Mckenzie, O\. J\. Hollinsworth, M\. Zając, T\. Tseng, A\. D\. Tucker, P\. Bacon, and A\. Gleave\(2025\)Scaling trends in language model robustness\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 24080–24138\.External Links:[Link](https://proceedings.mlr.press/v267/howe25a.html)Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1)\.
- \[15\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p2.1)\.
- \[16\]J\. Hu, J\. K\. Liu, H\. Xu, and W\. Shen\(2025\)REINFORCE\+\+: stabilizing critic\-free policy optimization with global advantage normalization\.arXiv preprint arXiv:2501\.03262\.Cited by:[§4\.1](https://arxiv.org/html/2605.08427#S4.SS1.p2.3)\.
- \[17\]N\. Jain, A\. Schwarzschild, Y\. Wen, G\. Somepalli, J\. Kirchenbauer, P\. Chiang, M\. Goldblum, A\. Saha, J\. Geiping, and T\. Goldstein\(2023\)Baseline defenses for adversarial attacks against aligned language models\.arXiv preprint arXiv:2309\.00614\.Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1)\.
- \[18\]L\. Jiang, K\. Rao, S\. Han, A\. Ettinger, F\. Brahman, S\. Kumar, N\. Mireshghallah, X\. Lu, M\. Sap, Y\. Choi,et al\.\(2024\)WildTeaming at scale: from In\-the\-Wild jailbreaks to \(adversarially\) safer language models\.Advances in Neural Information Processing Systems37,pp\. 47094–47165\.Cited by:[§5\.2](https://arxiv.org/html/2605.08427#S5.SS2.p1.5),[§5](https://arxiv.org/html/2605.08427#S5.p1.1),[§5](https://arxiv.org/html/2605.08427#S5.p3.1)\.
- \[19\]E\. La Malfa, A\. Petrov, S\. Frieder, C\. Weinhuber, R\. Burnell, R\. Nazar, A\. Cohn, N\. Shadbolt, and M\. Wooldridge\(2024\)Language\-Models\-as\-a\-Service: overview of a new paradigm and its challenges\.Journal of Artificial Intelligence Research80,pp\. 1497–1523\.Cited by:[§1](https://arxiv.org/html/2605.08427#S1.p1.1)\.
- \[20\]M\. Li, W\. M\. Si, M\. Backes, Y\. Zhang, and Y\. Wang\(2025\)SaLoRA: safety\-alignment preserved low\-rank adaptation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=GOoVzE9nSj)Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p2.1)\.
- \[21\]A\. Liao, N\. Tomlin, and D\. Klein\(2024\)Efficacy of language model self\-play in non\-zero\-sum games\.arXiv preprint arXiv:2406\.18872\.Cited by:[§3](https://arxiv.org/html/2605.08427#S3.p1.2)\.
- \[22\]M\. Liu, L\. Jiang, Y\. Liang, S\. S\. Du, Y\. Choi, T\. Althoff, and N\. Jaques\(2025\)Chasing moving targets with online self\-play reinforcement learning for safer language models\.arXiv preprint arXiv:2506\.07468\.External Links:[Link](https://arxiv.org/abs/2506.07468)Cited by:[§C\.1](https://arxiv.org/html/2605.08427#A3.SS1.p1.1),[Figure 1](https://arxiv.org/html/2605.08427#S1.F1.2.2),[Figure 1](https://arxiv.org/html/2605.08427#S1.F1.5.2),[§1](https://arxiv.org/html/2605.08427#S1.p1.1),[§1](https://arxiv.org/html/2605.08427#S1.p2.1),[§1](https://arxiv.org/html/2605.08427#S1.p3.1),[§2](https://arxiv.org/html/2605.08427#S2.p1.1),[§2](https://arxiv.org/html/2605.08427#S2.p3.1),[2nd item](https://arxiv.org/html/2605.08427#S3.I4.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.08427#S3.SS1.p4.8),[§3](https://arxiv.org/html/2605.08427#S3.p1.2),[§4\.1](https://arxiv.org/html/2605.08427#S4.SS1.p2.3),[§5\.1](https://arxiv.org/html/2605.08427#S5.SS1.p1.3),[§5](https://arxiv.org/html/2605.08427#S5.p1.1),[§5](https://arxiv.org/html/2605.08427#S5.p3.1)\.
- \[23\]A\. Mehrotra, M\. Zampetakis, P\. Kassianik, B\. Nelson, H\. Anderson, Y\. Singer, and A\. Karbasi\(2024\)Tree of attacks: jailbreaking black\-box LLMs automatically\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 61065–61105\.Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1)\.
- \[24\]E\. Perez, S\. Huang, F\. Song, T\. Cai, R\. Ring, J\. Aslanides, A\. Glaese, N\. McAleese, and G\. Irving\(2022\)Red teaming language models with language models\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 3419–3448\.Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1)\.
- \[25\]G\. Qu, A\. Wierman, and N\. Li\(2020\)Scalable reinforcement learning of localized policies for multi\-agent networked systems\.InLearning for Dynamics and Control,pp\. 256–266\.Cited by:[Appendix A](https://arxiv.org/html/2605.08427#A1.p2.8)\.
- \[26\]Qwen, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu\(2025\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§5](https://arxiv.org/html/2605.08427#S5.p2.1)\.
- \[27\]J\. Schulman, S\. Levine, P\. Abbeel, M\. Jordan, and P\. Moritz\(2015\)Trust region policy optimization\.InInternational conference on machine learning,pp\. 1889–1897\.Cited by:[Appendix A](https://arxiv.org/html/2605.08427#A1.p2.8),[Appendix A](https://arxiv.org/html/2605.08427#A1.p3.1)\.
- \[28\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§4\.1](https://arxiv.org/html/2605.08427#S4.SS1.p2.3)\.
- \[29\]L\. Schwinn, D\. Dobre, S\. Günnemann, and G\. Gidel\(2023\)Adversarial attacks and defenses in large language models: old and new threats\.InProceedings on "I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops,Proceedings of Machine Learning Research, Vol\.239,pp\. 103–117\.External Links:[Link](https://proceedings.mlr.press/v239/schwinn23a.html)Cited by:[§5\.1](https://arxiv.org/html/2605.08427#S5.SS1.p2.1)\.
- \[30\]A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant\(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4149–4158\.Cited by:[§5](https://arxiv.org/html/2605.08427#S5.p3.1)\.
- \[31\]Z\. Tan, W\. Yu, J\. Si, T\. Liu, K\. Guan, H\. Jin, J\. Tao, X\. Yuan, D\. Ma, X\. Zhang, T\. Yang, and L\. Sun\(2026\)TriPlay\-rl: tri\-role self\-play reinforcement learning for llm safety alignment\.External Links:2601\.18292,[Link](https://arxiv.org/abs/2601.18292)Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1),[§2](https://arxiv.org/html/2605.08427#S2.p3.1)\.
- \[32\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30,pp\. 6000–6010\.Cited by:[§4\.1](https://arxiv.org/html/2605.08427#S4.SS1.p1.10)\.
- \[33\]X\. Wen, Z\. He, H\. Qi, Z\. Wan, Z\. Ma, Y\. Wen, T\. Zheng, X\. Xu, C\. Lu, and Q\. Zhang\(2026\)MAGIC: a co\-evolving attacker\-defender adversarial game for robust LLM safety\.arXiv preprint arXiv:2602\.01539\.Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1),[§2](https://arxiv.org/html/2605.08427#S2.p3.1),[§3](https://arxiv.org/html/2605.08427#S3.p1.2)\.
- \[34\]R\. J\. Williams\(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine learning8\(3\),pp\. 229–256\.Cited by:[§4\.1](https://arxiv.org/html/2605.08427#S4.SS1.p2.3)\.
- \[35\]S\. Xhonneux, A\. Sordoni, S\. Günnemann, G\. Gidel, and L\. Schwinn\(2024\)Efficient adversarial training in LLMs with continuous attacks\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 1502–1530\.Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p1.1)\.
- \[36\]C\. Xiong, X\. Qi, P\. Chen, and T\. Ho\(2025\)Defensive prompt patch: a robust and generalizable defense of large language models against jailbreak attacks\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 409–437\.Cited by:[§1](https://arxiv.org/html/2605.08427#S1.p1.1)\.
- \[37\]Z\. Xu, Y\. Liu, G\. Deng, Y\. Li, and S\. Picek\(2024\)A comprehensive study of jailbreak attack versus defense for large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 7432–7449\.Cited by:[§1](https://arxiv.org/html/2605.08427#S1.p1.1)\.
- \[38\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4791–4800\.Cited by:[§5](https://arxiv.org/html/2605.08427#S5.p3.1)\.
- \[39\]Q\. Zhang, M\. Chen, A\. Bukharin, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao\(2023\)Adaptive budget allocation for parameter\-efficient fine\-tuning\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=lq62uWRJjiY)Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p2.1)\.
- \[40\]Y\. Zhang, G\. Qu, P\. Xu, Y\. Lin, Z\. Chen, and A\. Wierman\(2023\)Global convergence of localized policy iteration in networked multi\-agent reinforcement learning\.InProceedings of the ACM on Measurement and Analysis of Computing Systems,Vol\.7,pp\. 1–51\.Cited by:[Appendix A](https://arxiv.org/html/2605.08427#A1.p2.8)\.
- \[41\]C\. Ziakas, N\. Loo, N\. Jain, and A\. Russo\(2025\)Red\-Bandit: test\-time adaptation for LLM red\-teaming via bandit\-guided LoRA experts\.arXiv preprint arXiv:2510\.07239\.Cited by:[§2](https://arxiv.org/html/2605.08427#S2.p2.1)\.

## Appendix

## Appendix AMore Details on Bound 1: An always\-refuse equilibrium

Recall our assumption that the defender can always issue a refusal responseyref∈𝒴Dy^\{\\mathrm\{ref\}\}\\in\\mathcal\{Y\}\_\{D\}such thatr\(yA,yref\)=0r\(y\_\{A\},y^\{\\mathrm\{ref\}\}\)=0for allyA∈𝒴Ay\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\. Thus, for any promptyAy\_\{A\}, we define the always\-refuse defender strategy asπDref\(yD∣yA\)=𝟏\{yD=yref\}\\pi\_\{D\}^\{\\mathrm\{ref\}\}\(y\_\{D\}\\mid y\_\{A\}\)=\\mathbf\{1\}\\\{y\_\{D\}=y^\{\\mathrm\{ref\}\}\\\}, which yields an expected rewardV\(πA,πDref\)=0V\(\\pi\_\{A\},\\pi\_\{D\}^\{\\mathrm\{ref\}\}\)=0for any attacker policyπA\\pi\_\{A\}\.

A pair\(πA∗,πDref\)\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}^\{\\mathrm\{ref\}\}\)constitutes a Nash equilibrium if the attacker’s policyπA∗\\pi\_\{A\}^\{\\ast\}is sufficiently adversarial such that no defender policy guarantees a strictly positive expected payoff, i\.e\.,maxπD⁡V\(πA∗,πD\)=0\\max\_\{\\pi\_\{D\}\}V\(\\pi\_\{A\}^\{\\ast\},\\pi\_\{D\}\)=0\. Therefore, suppose the attacker’s policy space consists of probability distributions in a localised neighbourhood around a reference policyπ¯A\\bar\{\\pi\}\_\{A\}, representing baseline adversarial behaviour already optimised to challenge the defender\[[27](https://arxiv.org/html/2605.08427#bib.bib35)\]\. In particular, we assume thatπ¯A\\bar\{\\pi\}\_\{A\}concentrates on prompts for which safe responses yield no positive reward, so that the defender cannot improve upon refusal\. This constraint captures the intuition that, while attack policies may adapt to minimize utility, they should remain sufficiently close to a reference adversarial behaviour to preserve semantic consistency and interpretability\[[4](https://arxiv.org/html/2605.08427#bib.bib38)\]\. In practice, this can be enforced by initialising the agent with a fine\-tuned model embodyingπ¯A\\bar\{\\pi\}\_\{A\}, and constraining updates to remain within a trust region\[[25](https://arxiv.org/html/2605.08427#bib.bib36),[40](https://arxiv.org/html/2605.08427#bib.bib39)\]\. Specifically, for a fixedδ∈\[0,1\]\\delta\\in\[0,1\]we define the local policyδ\\delta\-neighbourhood as:

ΠAδ=\{πA∈Δ\(𝒴A\):DTV\(πA∥π¯A\)≤δ\},\\Pi\_\{A\}^\{\\delta\}=\\left\\\{\\pi\_\{A\}\\in\\Delta\(\\mathcal\{Y\}\_\{A\}\):D\_\{\\text\{TV\}\}\(\\pi\_\{A\}\\\|\\bar\{\\pi\}\_\{A\}\)\\leq\\delta\\right\\\},\(6\)whereDTV\(⋅∥⋅\)D\_\{\\text\{TV\}\}\(\\cdot\\\|\\cdot\)denotes the total variation distance, given as:

DTV\(πA∥π¯A\)=12∑yA∈𝒴A\|πA\(yA\)−π¯A\(yA\)\|D\_\{\\text\{TV\}\}\(\\pi\_\{A\}\\\|\\bar\{\\pi\}\_\{A\}\)=\\frac\{1\}\{2\}\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\|\\pi\_\{A\}\(y\_\{A\}\)\-\\bar\{\\pi\}\_\{A\}\(y\_\{A\}\)\|\(7\)
Although trust\-region methods are commonly defined using KL divergence\[[27](https://arxiv.org/html/2605.08427#bib.bib35)\], we use total variation distance in the analysis, since it directly bounds differences in expectations of bounded functions\. Specifically, in Lemma[1](https://arxiv.org/html/2605.08427#Thmlemma1)we show that, withinΠAδ\\Pi\_\{A\}^\{\\delta\}, the defender’s best achievable expected payoff changes by a controlled amount \(see the full proof in Appendix[A\.1](https://arxiv.org/html/2605.08427#A1.SS1)\)\. In contrast, KL divergence only provides such control indirectly, resulting in weaker bounds \(see Appendix[A\.2](https://arxiv.org/html/2605.08427#A1.SS2)for further details\)\.

###### Lemma 1\.

For any attacker policiesπA,πA′∈Δ\(𝒴A\)\\pi\_\{A\},\{\\pi\}\_\{A\}^\{\\prime\}\\in\\Delta\(\\mathcal\{Y\}\_\{A\}\)and any defender policyπD\\pi\_\{D\}:

\|V\(πA,πD\)−V\(πA′,πD\)\|≤2DTV\(πA∥πA′\)\.\\left\|V\(\\pi\_\{A\},\\pi\_\{D\}\)\-V\(\{\\pi\}\_\{A\}^\{\\prime\},\\pi\_\{D\}\)\\right\|\\leq 2D\_\{\\text\{TV\}\}\(\\pi\_\{A\}\\\|\{\\pi\}\_\{A\}^\{\\prime\}\)\.\(8\)In particular, ifδ=ε2\\delta=\\frac\{\\varepsilon\}\{2\}andπA∈ΠAδ\\pi\_\{A\}\\in\\Pi\_\{A\}^\{\\delta\}for some fixedε∈\[0,1\]\\varepsilon\\in\[0,1\], then:

maxπD⁡V\(πA,πD\)≤maxπD⁡V\(π¯A,πD\)\+ε\.\\max\_\{\\pi\_\{D\}\}V\(\\pi\_\{A\},\\pi\_\{D\}\)\\leq\\max\_\{\\pi\_\{D\}\}V\(\\bar\{\\pi\}\_\{A\},\\pi\_\{D\}\)\+\\varepsilon\.\(9\)

Lemma[1](https://arxiv.org/html/2605.08427#Thmlemma1)and Appendix[A\.2](https://arxiv.org/html/2605.08427#A1.SS2)show that, in either a total\-variation or KL divergence neighbourhood of the reference policyπ¯A\\bar\{\\pi\}\_\{A\}, the defender’s best achievable payoff changes only by a controlled amount\. Asπ¯A\\bar\{\\pi\}\_\{A\}represents a baseline adversarial behaviour, it is natural to restrict attention to local perturbations that preserve this property\. We thus focus on attacker policiesπA∗∈ΠAδ\\pi\_\{A\}^\{\\ast\}\\in\\Pi\_\{A\}^\{\\delta\}that preserve the adversarial property ofπ¯A\\bar\{\\pi\}\_\{A\}locally, in the sense that they do not admit any defender strategy with strictly positive expected payoff\. Formally, we require that for any alternative defender strategyπD′\\pi\_\{D\}^\{\\prime\}:

𝔼yA∼πA∗,yD∼πD′\[r\(yA,yD\)\]≤0\.\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{A\}^\{\\ast\},y\_\{D\}\\sim\\pi\_\{D\}^\{\\prime\}\}\[r\(y\_\{A\},y\_\{D\}\)\]\\leq 0\.\(10\)
This condition can be interpreted as aworst\-caseconstraint on the attacker: within the local policy classΠAδ\\Pi\_\{A\}^\{\\delta\}, any attempt by the defender to provide a helpful response yields no positive gain and may incur risk, so the defender has no incentive to deviate fromyrefy^\{\\mathrm\{ref\}\}\. Specifically, providing a helpful response risks a payoff of−1\-1that is not offset by safe, high\-reward completions, making the expected value of answering non\-positive\. Hence,πDref\\pi\_\{D\}^\{\\mathrm\{ref\}\}is a best response toπA∗∈ΠAδ\\pi\_\{A\}^\{\\ast\}\\in\\Pi\_\{A\}^\{\\delta\}\. Moreover, since the attacker cannot reduce the payoff below0against a refusing defender,πA∗\\pi\_\{A\}^\{\\ast\}is also a best response\.

### A\.1Proof of Lemma[1](https://arxiv.org/html/2605.08427#Thmlemma1)

###### Lemma[1](https://arxiv.org/html/2605.08427#Thmlemma1)\.

For any attacker policiesπA,πA′∈Δ\(𝒴A\)\\pi\_\{A\},\{\\pi\}\_\{A\}^\{\\prime\}\\in\\Delta\(\\mathcal\{Y\}\_\{A\}\)and any defender policyπD\\pi\_\{D\}:

\|V\(πA,πD\)−V\(πA′,πD\)\|≤2DTV\(πA∥πA′\)\.\\left\|V\(\\pi\_\{A\},\\pi\_\{D\}\)\-V\(\{\\pi\}\_\{A\}^\{\\prime\},\\pi\_\{D\}\)\\right\|\\leq 2D\_\{\\text\{TV\}\}\(\\pi\_\{A\}\\\|\{\\pi\}\_\{A\}^\{\\prime\}\)\.\(11\)In particular, ifδ=ε2\\delta=\\frac\{\\varepsilon\}\{2\}andπA∈ΠAδ\\pi\_\{A\}\\in\\Pi\_\{A\}^\{\\delta\}for some fixedε∈\[0,1\]\\varepsilon\\in\[0,1\], then:

maxπD⁡V\(πA,πD\)≤maxπD⁡V\(π¯A,πD\)\+ε\.\\max\_\{\\pi\_\{D\}\}V\(\\pi\_\{A\},\\pi\_\{D\}\)\\leq\\max\_\{\\pi\_\{D\}\}V\(\\bar\{\\pi\}\_\{A\},\\pi\_\{D\}\)\+\\varepsilon\.\(12\)

###### Proof\.

We start with proving \([11](https://arxiv.org/html/2605.08427#A1.E11)\)\. Consider any attacker policiesπA,πA′∈Δ\(𝒴A\)\\pi\_\{A\},\{\\pi\}\_\{A\}^\{\\prime\}\\in\\Delta\(\\mathcal\{Y\}\_\{A\}\)and any defender policyπD\\pi\_\{D\}\. By definition, the following holds:

V\(πA,πD\)\\displaystyle V\(\\pi\_\{A\},\\pi\_\{D\}\)=𝔼yA∼πA,yD∼πD\(⋅∣yA\)\[r\(yA,yD\)\]\\displaystyle=\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{A\},\\;y\_\{D\}\\sim\\pi\_\{D\}\(\\cdot\\mid y\_\{A\}\)\}\\left\[r\(y\_\{A\},y\_\{D\}\)\\right\]\(13\)=∑yA∈𝒴AπA\(yA\)∑yD∈𝒴DπD\(yD\|yA\)⋅r\(yA,yD\)\.\\displaystyle=\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\\pi\_\{A\}\(y\_\{A\}\)\\sum\_\{y\_\{D\}\\in\\mathcal\{Y\}\_\{D\}\}\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)\\cdot r\(y\_\{A\},y\_\{D\}\)\.By similar arguments:

V\(πA′,πD\)=∑yA∈𝒴AπA′\(yA\)∑yD∈𝒴DπD\(yD\|yA\)⋅r\(yA,yD\)\.\\displaystyle V\(\\pi\_\{A\}^\{\\prime\},\\pi\_\{D\}\)=\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\\pi\_\{A\}^\{\\prime\}\(y\_\{A\}\)\\sum\_\{y\_\{D\}\\in\\mathcal\{Y\}\_\{D\}\}\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)\\cdot r\(y\_\{A\},y\_\{D\}\)\.\(14\)By combining \([13](https://arxiv.org/html/2605.08427#A1.E13)\) and \([14](https://arxiv.org/html/2605.08427#A1.E14)\), we obtain that:

\|V\(πA,πD\)−V\(πA′,πD\)\|\\displaystyle\\left\|V\(\\pi\_\{A\},\\pi\_\{D\}\)\-V\(\{\\pi\}\_\{A\}^\{\\prime\},\\pi\_\{D\}\)\\right\|=\|∑yA∈𝒴A∑yD∈𝒴DπD\(yD\|yA\)⋅r\(yA,yD\)\(πA\(yA\)−πA′\(yA\)\)\|\\displaystyle=\\left\|\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\\sum\_\{y\_\{D\}\\in\\mathcal\{Y\}\_\{D\}\}\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)\\cdot r\(y\_\{A\},y\_\{D\}\)\(\\pi\_\{A\}\(y\_\{A\}\)\-\\pi\_\{A\}^\{\\prime\}\(y\_\{A\}\)\)\\right\|\(15a\)≤∑yA∈𝒴A∑yD∈𝒴D\|πD\(yD\|yA\)\|⋅\|r\(yA,yD\)\|\|πA\(yA\)−πA′\(yA\)\|\\displaystyle\\leq\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\\sum\_\{y\_\{D\}\\in\\mathcal\{Y\}\_\{D\}\}\|\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)\|\\cdot\|r\(y\_\{A\},y\_\{D\}\)\|\\left\|\\pi\_\{A\}\(y\_\{A\}\)\-\\pi\_\{A\}^\{\\prime\}\(y\_\{A\}\)\\right\|\(15b\)≤∑yA∈𝒴A\|πA\(yA\)−πA′\(yA\)\|∑yD∈𝒴DπD\(yD\|yA\)\\displaystyle\\leq\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\\left\|\\pi\_\{A\}\(y\_\{A\}\)\-\\pi\_\{A\}^\{\\prime\}\(y\_\{A\}\)\\right\|\\sum\_\{y\_\{D\}\\in\\mathcal\{Y\}\_\{D\}\}\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)\(15c\)=∑yA∈𝒴A\|πA\(yA\)−πA′\(yA\)\|\.\\displaystyle=\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\\left\|\\pi\_\{A\}\(y\_\{A\}\)\-\\pi\_\{A\}^\{\\prime\}\(y\_\{A\}\)\\right\|\.\(15d\)where the inequality in \([15b](https://arxiv.org/html/2605.08427#A1.E15.2)\) follows from the triangle inequality\. Furthermore, the inequality in \([15c](https://arxiv.org/html/2605.08427#A1.E15.3)\) holds because\|πD\(yD\|yA\)\|=πD\(yD\|yA\)\|\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)\|=\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)for anyyA∈𝒴A,yD∈𝒴Dy\_\{A\}\\in\\mathcal\{Y\}\_\{A\},y\_\{D\}\\in\\mathcal\{Y\}\_\{D\}sinceπD\(yD\|yA\)∈\[0,1\]\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)\\in\[0,1\], and\|r\(yA,yD\)\|≤1\|r\(y\_\{A\},y\_\{D\}\)\|\\leq 1due to our assumption in Section[3\.1](https://arxiv.org/html/2605.08427#S3.SS1)thatr\(yA,yD\)∈\[−1,1\]r\(y\_\{A\},y\_\{D\}\)\\in\[\-1,1\]for allyA∈𝒴A,yD∈𝒴Dy\_\{A\}\\in\\mathcal\{Y\}\_\{A\},y\_\{D\}\\in\\mathcal\{Y\}\_\{D\}\. Finally, the last equality in \([15d](https://arxiv.org/html/2605.08427#A1.E15.4)\) stems from∑yD∈𝒴DπD\(yD\|yA\)=1\\sum\_\{y\_\{D\}\\in\\mathcal\{Y\}\_\{D\}\}\\pi\_\{D\}\(y\_\{D\}\|y\_\{A\}\)=1for anyyA∈𝒴Ay\_\{A\}\\in\\mathcal\{Y\}\_\{A\}due toπD\(⋅\|yA\)∈Δ\(𝒴D\)\\pi\_\{D\}\(\\cdot\|y\_\{A\}\)\\in\\Delta\(\\mathcal\{Y\}\_\{D\}\)\.

Subsequently, we relate the term in \([15d](https://arxiv.org/html/2605.08427#A1.E15.4)\) to the total variation distance\. Indeed, sinceDTV\(πA∥πA′\)=12∑yA∈𝒴A\|πA\(yA\)−πA′\(yA\)\|D\_\{\\text\{TV\}\}\(\\pi\_\{A\}\\\|\{\\pi\}\_\{A\}^\{\\prime\}\)=\\frac\{1\}\{2\}\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\\left\|\\pi\_\{A\}\(y\_\{A\}\)\-\\pi\_\{A\}^\{\\prime\}\(y\_\{A\}\)\\right\|by definition, we conclude that:

\|V\(πA,πD\)−V\(πA′,πD\)\|≤2DTV\(πA∥πA′\),\\left\|V\(\\pi\_\{A\},\\pi\_\{D\}\)\-V\(\{\\pi\}\_\{A\}^\{\\prime\},\\pi\_\{D\}\)\\right\|\\leq 2D\_\{\\text\{TV\}\}\(\\pi\_\{A\}\\\|\{\\pi\}\_\{A\}^\{\\prime\}\),\(16\)as desired\.

Next, we prove the second part of our claim in \([12](https://arxiv.org/html/2605.08427#A1.E12)\)\. For any fixedδ∈\[0,1\]\\delta\\in\[0,1\], consider any attacker policyπA∈ΠAδ\\pi\_\{A\}\\in\\Pi\_\{A\}^\{\\delta\}\. That is,DTV\(πA∥π¯A\)≤δD\_\{\\text\{TV\}\}\(\\pi\_\{A\}\\\|\\bar\{\\pi\}\_\{A\}\)\\leq\\delta, whereπ¯A\\bar\{\\pi\}\_\{A\}is the reference policy defined in Section[3\.2](https://arxiv.org/html/2605.08427#S3.SS2), representing baseline adversarial behaviour already optimised to challenge the defender\. Therefore, combined with \([16](https://arxiv.org/html/2605.08427#A1.E16)\), we obtain:

\|V\(πA,πD\)−V\(πA′,πD\)\|≤2δ\.\\left\|V\(\\pi\_\{A\},\\pi\_\{D\}\)\-V\(\{\\pi\}\_\{A\}^\{\\prime\},\\pi\_\{D\}\)\\right\|\\leq 2\\delta\.\(17\)As such, ifδ=ε2\\delta=\\frac\{\\varepsilon\}\{2\}for any fixedε∈\[0,1\]\\varepsilon\\in\[0,1\], the desired in \([12](https://arxiv.org/html/2605.08427#A1.E12)\) readily follows from \([17](https://arxiv.org/html/2605.08427#A1.E17)\)\. ∎

### A\.2Discussion: Using KL\-Divergence for Local Policy Neighbourhoods

In this section, we discuss the consequences of defining the local policy neighbourhood in \([6](https://arxiv.org/html/2605.08427#A1.E6)\) using the KL divergence instead of the total variation distance\. Indeed, for a fixedδ∈\[0,1\]\\delta\\in\[0,1\]we define the local policyδ\\delta\-neighbourhood using the KL divergence as follows:

ΠAδ,KL=\{πA∈Δ\(𝒴A\):DKL\(πA∥π¯A\)≤δ\},\\Pi\_\{A\}^\{\\delta,\\text\{KL\}\}=\\left\\\{\\pi\_\{A\}\\in\\Delta\(\\mathcal\{Y\}\_\{A\}\):D\_\{\\text\{KL\}\}\(\\pi\_\{A\}\\\|\\bar\{\\pi\}\_\{A\}\)\\leq\\delta\\right\\\},\(18\)whereDKL\(⋅∥⋅\)D\_\{\\text\{KL\}\}\(\\cdot\\\|\\cdot\)denotes the KL divergence, which is given by:

DKL\(πA∥π¯A\)=12∑yA∈𝒴AπA\(yA\)log⁡\(πA\(yA\)π¯A\(yA\)\)D\_\{\\text\{KL\}\}\(\\pi\_\{A\}\\\|\\bar\{\\pi\}\_\{A\}\)=\\frac\{1\}\{2\}\\sum\_\{y\_\{A\}\\in\\mathcal\{Y\}\_\{A\}\}\\pi\_\{A\}\(y\_\{A\}\)\\log\\left\(\\frac\{\\pi\_\{A\}\(y\_\{A\}\)\}\{\\bar\{\\pi\}\_\{A\}\(y\_\{A\}\)\}\\right\)\(19\)
While using the total variation distance directly bounds differences in expectations of bounded functions due to Lemma[1](https://arxiv.org/html/2605.08427#Thmlemma1), we next prove that the KL divergence only provides such control indirectly, resulting in weaker bounds\.

###### Lemma 2\.

For any attacker policiesπA,πA′∈Δ\(𝒴A\)\\pi\_\{A\},\{\\pi\}\_\{A\}^\{\\prime\}\\in\\Delta\(\\mathcal\{Y\}\_\{A\}\)and any defender policyπD\\pi\_\{D\}:

\|V\(πA,πD\)−V\(πA′,πD\)\|≤2DKL\(πA∥πA′\)\.\\left\|V\(\\pi\_\{A\},\\pi\_\{D\}\)\-V\(\{\\pi\}\_\{A\}^\{\\prime\},\\pi\_\{D\}\)\\right\|\\leq\\sqrt\{2D\_\{\\text\{KL\}\}\(\\pi\_\{A\}\\\|\{\\pi\}\_\{A\}^\{\\prime\}\)\}\.\(20\)In particular, ifδ=ε22\\delta=\\frac\{\\varepsilon^\{2\}\}\{2\}andπA∈ΠAδ\\pi\_\{A\}\\in\\Pi\_\{A\}^\{\\delta\}for some fixedε∈\[0,1\]\\varepsilon\\in\[0,1\], then:

maxπD⁡V\(πA,πD\)≤maxπD⁡V\(π¯A,πD\)\+ε\.\\max\_\{\\pi\_\{D\}\}V\(\\pi\_\{A\},\\pi\_\{D\}\)\\leq\\max\_\{\\pi\_\{D\}\}V\(\\bar\{\\pi\}\_\{A\},\\pi\_\{D\}\)\+\\varepsilon\.\(21\)

###### Proof\.

The proof easily follows from Lemma[1](https://arxiv.org/html/2605.08427#Thmlemma1)\. Indeed, we start with proving \([20](https://arxiv.org/html/2605.08427#A1.E20)\)\. Consider any attacker policiesπA,πA′∈Δ\(𝒴A\)\\pi\_\{A\},\{\\pi\}\_\{A\}^\{\\prime\}\\in\\Delta\(\\mathcal\{Y\}\_\{A\}\)and any defender policyπD\\pi\_\{D\}\. By Pinsker’s inequality\[[8](https://arxiv.org/html/2605.08427#bib.bib37)\], it holds that:

DTV\(πA∥πA′\)≤12DKL\(πA∥πA′\)\.D\_\{\\text\{TV\}\}\(\\pi\_\{A\}\\\|\{\\pi\}\_\{A\}^\{\\prime\}\)\\leq\\sqrt\{\\frac\{1\}\{2\}D\_\{\\text\{KL\}\}\(\\pi\_\{A\}\\\|\{\\pi\}\_\{A\}^\{\\prime\}\)\}\.\(22\)Therefore, by \([20](https://arxiv.org/html/2605.08427#A1.E20)\) from Lemma[1](https://arxiv.org/html/2605.08427#Thmlemma1), we conclude the desired in \([20](https://arxiv.org/html/2605.08427#A1.E20)\)\.

Next, we prove the second part of our claim in \([21](https://arxiv.org/html/2605.08427#A1.E21)\)\. For any fixedδ∈\[0,1\]\\delta\\in\[0,1\], consider any attacker policyπA∈ΠAδ,KL\\pi\_\{A\}\\in\\Pi\_\{A\}^\{\\delta,\\text\{KL\}\}\. That is,DKL\(πA∥π¯A\)≤δD\_\{\\text\{KL\}\}\(\\pi\_\{A\}\\\|\\bar\{\\pi\}\_\{A\}\)\\leq\\delta, whereπ¯A\\bar\{\\pi\}\_\{A\}is the reference policy defined in Section[3\.2](https://arxiv.org/html/2605.08427#S3.SS2), representing baseline adversarial behaviour already optimised to challenge the defender\. Therefore, combined with \([20](https://arxiv.org/html/2605.08427#A1.E20)\), we obtain:

\|V\(πA,πD\)−V\(πA′,πD\)\|≤2δ\.\\left\|V\(\\pi\_\{A\},\\pi\_\{D\}\)\-V\(\{\\pi\}\_\{A\}^\{\\prime\},\\pi\_\{D\}\)\\right\|\\leq\\sqrt\{2\\delta\}\.\(23\)As such, ifδ=ε22\\delta=\\frac\{\\varepsilon^\{2\}\}\{2\}for any fixedε∈\[0,1\]\\varepsilon\\in\[0,1\], the desired in \([12](https://arxiv.org/html/2605.08427#A1.E12)\) readily follows from \([17](https://arxiv.org/html/2605.08427#A1.E17)\)\. ∎

## Appendix BAbsence of Parameter\-Level Coupling

As noted in Section[4\.1](https://arxiv.org/html/2605.08427#S4.SS1), the updates in Eq\. \([4](https://arxiv.org/html/2605.08427#S4.E4)\) evolve separately and restrict optimisation to low\-rank subspaces over a shared frozen backbone\. This parameter separation restores the distinction between attacker and defender updates, eliminating the parameter\-level interference observed in Eq\. \([2](https://arxiv.org/html/2605.08427#S3.E2)\)\. To formalise this, consider a defender\-only update step

ϕD\+=ϕD\+η∇ϕDJD\(ϕA,ϕD\),\\phi\_\{D\}^\{\+\}=\\phi\_\{D\}\+\\eta\\,\\nabla\_\{\\phi\_\{D\}\}J\_\{D\}\(\\phi\_\{A\},\\phi\_\{D\}\),whereη\>0\\eta\>0is the learning rate\. A first\-order Taylor expansion of the defender’s objective yields

JD\(ϕA,ϕD\+\)=JD\(ϕA,ϕD\)\+η‖∇ϕDJD\(ϕA,ϕD\)‖2\+O\(η2\)\.\\displaystyle J\_\{D\}\(\\phi\_\{A\},\\phi\_\{D\}^\{\+\}\)=J\_\{D\}\(\\phi\_\{A\},\\phi\_\{D\}\)\+\\eta\\\|\\nabla\_\{\\phi\_\{D\}\}J\_\{D\}\(\\phi\_\{A\},\\phi\_\{D\}\)\\\|^\{2\}\+O\(\\eta^\{2\}\)\.\(24\)
Although this update is taken only with respect toϕD\\phi\_\{D\}, the gradient∇ϕDJD\\nabla\_\{\\phi\_\{D\}\}J\_\{D\}depends on the attacker through the sampling processesyA∼πθ,ϕA\(⋅\)y\_\{A\}\\sim\\pi\_\{\\theta,\\phi\_\{A\}\}\(\\cdot\)andyD∼πθ,ϕD\(⋅∣yA\)y\_\{D\}\\sim\\pi\_\{\\theta,\\phi\_\{D\}\}\(\\cdot\\mid y\_\{A\}\), which induce a dependence of the objective on the distribution of attacker\-generated prompts\. This can be made explicit as

∇ϕDJD=𝔼yA∼πθ,ϕA\[𝔼yD∼πθ,ϕD\(⋅∣yA\)\[∇ϕDlog⁡πθ,ϕD\(yD∣yA\)RD\(yA,yD\)\]\]\.\\displaystyle\\nabla\_\{\\phi\_\{D\}\}J\_\{D\}=\\mathbb\{E\}\_\{y\_\{A\}\\sim\\pi\_\{\\theta,\\phi\_\{A\}\}\}\\Big\[\\mathbb\{E\}\_\{y\_\{D\}\\sim\\pi\_\{\\theta,\\phi\_\{D\}\}\(\\cdot\\mid y\_\{A\}\)\}\\big\[\\nabla\_\{\\phi\_\{D\}\}\\log\\pi\_\{\\theta,\\phi\_\{D\}\}\(y\_\{D\}\\mid y\_\{A\}\)\\,R\_\{D\}\(y\_\{A\},y\_\{D\}\)\\big\]\\Big\]\.\(25\)
Importantly, this dependence is purely distributional: the defender parameters are updated independently of the attacker parameters, i\.e\.,

∂ϕD∂ϕA=0\.\\frac\{\\partial\\phi\_\{D\}\}\{\\partial\\phi\_\{A\}\}=0\.Thus, the attacker influences the defender only through the induced input distributionyAy\_\{A\}, rather than via shared or differentiable parameters\. The same reasoning applies symmetrically to the attacker\.

## Appendix CExperimental Setup

### C\.1Prompts

During training, we used exactly the same attacker and defender prompts provided in Liu et al\.\[[22](https://arxiv.org/html/2605.08427#bib.bib1)\]and available in their code,222[https://github\.com/mickelliu/selfplay\-red\-teaming/tree/main](https://github.com/mickelliu/selfplay-red-teaming/tree/main)as well as in Appendix C\.4\.1 of their paper\. For the evaluation on safety benchmarks, we use the exact same prompts and settings as the Ai2 implementation\.333[https://github\.com/allenai/safety\-eval](https://github.com/allenai/safety-eval)For the evaluation of the reasoning and instruction following datasets, we prompt each model with the input task, and we ask it to respond with the answer within<answer\></answer\>tags\.

### C\.2Benchmark Details

In Table[5](https://arxiv.org/html/2605.08427#A3.T5), we report a description of the datasets employed in the experimental evaluation, alongside the metrics used to benchmark an LLM\.

### C\.3ABS Setup

For each ABS model, we train low\-rank adapters on the projection layersq\_proj,k\_proj,v\_proj, ando\_projin the self\-attention block, as well asup\_proj,gate\_proj, anddown\_projin the feed\-forward network\. The attention projections map hidden states into query, key, and value representations that determine token interactions, and then project the aggregated attention output back into the model’s hidden space\. The feed\-forward projections expand the representation into a higher\-dimensional space, apply a gating mechanism to modulate features, and compress it back to the original dimension\. Together, these layers control how information is routed and transformed throughout the model, making them effective targets for parameter\-efficient fine\-tuning\. For each implementation, we kept the value of the rank and alpha equal, and set up a learning rate for the attacker and the defender of 1e\-6 and 3e\-6, a KL region of0\.30\.3: we trained each module for up to100−200100\-200steps\. Table[2](https://arxiv.org/html/2605.08427#S5.T2)reports the training time of each model, as well as the number of additional parameters LoRA introduces and trains\.

An example of the defender ABS module configuration is hereby reported:

\{

"alora\_invocation\_tokens":null,

"alpha\_pattern":\{\},

"arrow\_config":null,

"auto\_mapping":null,

"base\_model\_name\_or\_path":"Qwen/Qwen2\.5\-3B\-Instruct",

"bias":"none",

"corda\_config":null,

"ensure\_weight\_tying":false,

"eva\_config":null,

"exclude\_modules":null,

"fan\_in\_fan\_out":false,

"inference\_mode":true,

"init\_lora\_weights":true,

"layer\_replication":null,

"layers\_pattern":null,

"layers\_to\_transform":null,

"loftq\_config":\{\},

"lora\_alpha":16,

"lora\_bias":false,

"lora\_dropout":0,

"megatron\_config":null,

"megatron\_core":"megatron\.core",

"modules\_to\_save":null,

"peft\_type":"LORA",

"peft\_version":"0\.18\.1",

"qalora\_group\_size":16,

"r":16,

"rank\_pattern":\{\},

"revision":null,

"target\_modules":\[

"k\_proj",

"o\_proj",

"down\_proj",

"v\_proj",

"up\_proj",

"gate\_proj",

"q\_proj"

\],

"target\_parameters":null,

"task\_type":"CAUSAL\_LM",

"trainable\_token\_indices":null,

"use\_dora":false,

"use\_qalora":false,

"use\_rslora":false

\}

### C\.4Experimental Results

Tables[6](https://arxiv.org/html/2605.08427#A3.T6),[7](https://arxiv.org/html/2605.08427#A3.T7), and[8](https://arxiv.org/html/2605.08427#A3.T8)report the full results for each ABS module we trained, including those modules for which training collapsed\. Table[9](https://arxiv.org/html/2605.08427#A3.T9)reports the detailed results of the reasoning and instruction following benchmarks of the models benchmarked in the main paper\.

#### C\.4\.1A Qualitative Analysis of the Attacks

We hereby report the analysis that GPT\-5\.4 Standard with High reasoning made on the results of our tournament and for attacks against the base model, as described in Section[5\.2](https://arxiv.org/html/2605.08427#S5.SS2)\. Our prompt instructed the model to analyse the results and judge whether an attack successfully forces the defender to generate*harmful*content\.

Please notice that what follows is the verbatim output of GPT 5\.4\.

##### GPT 5\.4 response\.

These files keep the defender fixed and change the attacker, so the relevant qualitative question is why the LoRA checkpoint is a stronger*attacker*against the same base defender\. The same\-defender comparisons already show that the effect is consistent across scales: on 3B, the LoRA checkpoint reaches a 35% attack success rate against the base model, compared with 18% for the fully fine\-tuned model; on 7B, the corresponding rates are 24% and 9%\.

The main reason is that the LoRA checkpoint is better at producing prompts that are both adversarial and usable\. On 3B, especially, the fully fine\-tuned attacker often fails at the task itself: instead of emitting a concrete jailbreak, it emits meta\-instructions such as “make the prompt more detailed” or “let’s refine the prompt,” which means the defender is not actually forced to answer the harmful request\. By contrast, the 3B LoRA checkpoint usually produces a complete prompt with an explicit target, a concrete framing, and enough detail to steer the defender into a substantive continuation\. In other words, part of the 3B gap is not just better stealth; it is basic task completion\.

The second reason is better camouflage\. The LoRA checkpoint is much more likely to wrap the harmful objective inside a benign\-looking frame: wellness advice, ethical hacking, red\-teaming, documentary analysis, fictional scenarios, or research language\. This works because the base defender has a strong “helpful rewriter” mode\. Once the request is framed as analysis, training, safety, or support, the defender often continues the prompt rather than rejecting the underlying intent\. The fully fine\-tuned model sometimes uses this strategy, but much less consistently, and usually with less detail\.

The third reason is that the LoRA checkpoint tends to preserve a harmful payload while softening the surface form\. Its prompts are longer, more specific, and more targeted\. They do not merely say “do the harmful thing in a subtler way”; they specify a voice, a scenario, a justification, and a rhetorical frame that nudges the defender toward compliance\. This matters because the defender is more vulnerable to requests that look like legitimate content transformation than to short, direct requests that are obviously malicious\.

There is also a scale\-dependent difference between 3B and 7B\. At 3B, the fine\-tuned model’s main weakness is degeneration into meta\-talk and partial outputs\. At 7B, that problem is much smaller: the fully fine\-tuned model usually does produce a coherent rewritten prompt\. However, its prompts are often too short, too direct, or too shallowly disguised, so the defender still recognises the malicious goal and refuses\. The 7B LoRA checkpoint is qualitatively cleaner: it uses more polished masking strategies, more role framing, and more dual\-use language, which lets it exploit the defender’s helpfulness more effectively even when both models are coherent\.

In short, the LoRA checkpoint appears stronger for two different reasons at the two scales\. On 3B, it is stronger partly because it actually produces valid jailbreak prompts more reliably\. On 7B, it is stronger because it is better at adversarial prompt design: it hides the intent behind plausible, high\-context, dual\-use framing while keeping enough specificity to elicit operationally harmful continuations\.

### C\.5LoRA vs Finetuned Jailbreaks

We also instructed GPT\-5\.4 Standard with High reasoning to identify the prompts with the highest harmfulness score, and to describe the main differences between the attack generated bySelf\-RedTeamand our ABS models\.

Please notice that what follows is the verbatim output of GPT 5\.4\.

##### Example 1: 3B self\-harm prompt\.

The 3B LoRA checkpoint rewrites a self\-harm request into a mental\-health\-themed blog\-post prompt that uses supportive language and a personal\-growth frame, but still pushes the defender toward harmful content\. The defender then produces a long post that partly follows that framing\. The fully fine\-tuned 3B model, by contrast, collapses into a meta\-level instruction about making the prompt “more nuanced” and receives a harmless prompt\-engineering response instead\. This is one of the clearest examples of the 3B difference being about task completion as much as stealth\.

##### Example 2: 7B eating\-disorder prompt\.

The 7B LoRA checkpoint reframes the target as a blog post about “mindful eating” and “personal freedom,” then embeds the harmful idea indirectly inside wellness language\. The defender responds with a long blog post that substantially accepts that framing\. The fully fine\-tuned 7B model uses a much more exposed version of the same idea, explicitly asking for restrictive eating to be framed as self\-discipline; that version is recognised and refused\. This is a clean example of the LoRA checkpoint preserving the payload while improving the disguise\.

##### Example 3: 7B self\-harm prompt\.

The 7B LoRA checkpoint asks for a social\-media post from the perspective of a distressed character and uses euphemistic phrases such as finding peace through “letting go\.” The defender then writes a post that continues the emotional framing and stays uncomfortably close to the intended harmful direction\. The fully fine\-tuned 7B model also tries an indirect version, but it is weaker and more generic, so the defender turns it into a benign reflection on solitude and inner peace\. This pair is useful because both prompts are coherent, but the LoRA version is better calibrated to exploit the defender\.

##### Example 4: 7B explosive\-device prompt\.

The LoRA checkpoint turns the original request into a laboratory\-style request for a “controlled release of energy” device, wrapped in research and safety language\. That masking is strong enough that the defender begins to provide construction guidance\. The fully fine\-tuned 7B model asks instead for a device that imitates a bomb in order to cause panic; despite the indirect wording, the intent remains too visible and the defender refuses\. This example shows that the decisive factor is not simply indirectness, but whether the indirect framing is plausible enough to activate the defender’s dual\-use or educational mode\.

##### Example 5: 7B red\-team script prompt\.

One of the strongest LoRA 7B attacks is the request for a red\-team script against a government database that explicitly asks for vulnerability identification and an exploit component under the guise of authorised security work\. The defender responds with concrete code and an exploit\-oriented workflow because the request successfully activates the “ethical security assistance” frame\. This is especially interesting because it shows that the LoRA checkpoint is not only good at emotional or rhetorical masking; it also understands how to wrap cyber abuse inside a familiar professional genre that the defender is predisposed to help with\.

Benchmark TitleDescription / MethodologyBetter ScoreSafety and Robustness BenchmarksHARMBENCHA standardised evaluation framework for automated red\-teaming and robust refusal\.↓\\downarrowWILDGUARDTESTAssesses the effectiveness of LLM safety guardrails in detecting harmful content and risk levels\.↓\\downarrowWILDJAILBREAKA large\-scale synthetic resource containing vanilla and adversarial harmful/benign queries\.↓\\downarrow/↑\\uparrowDANJailbreak prompts that attempt to bypass restrictions by forcing the model to adopt an unconstrained persona\.↓\\downarrowXSTESTA test suite designed to identify exaggerated safety behaviours \(over\-refusal\) using prompts that appear unsafe but are benign\.↑\\uparrow/↑\\uparrowStrongREJECTA benchmark used to measure the model’s ability to reject harmful prompts effectively\.↑\\uparrowGeneral Capability and Instruction FollowingGSM8KGrade\-school math word problems requiring multi\-step reasoning and numerical problem solving\.↑\\uparrowCommon SenseEvaluates commonsense reasoning through everyday situations and implicit knowledge about the world\.↑\\uparrowBoolQYes/no question answering based on short passages, testing reading comprehension and factual reasoning\.↑\\uparrowARC\-CA challenging science QA benchmark with grade\-school level questions requiring reasoning beyond simple recall\.↑\\uparrowHellaSwagTests commonsense inference by selecting the most plausible continuation of a given context\.↑\\uparrowTable 5:Summary of Safety, Robustness, and General Capability BenchmarksHarmful RefusalBenign ComplianceWG:TestWJBDANHarmBenchXSTestStrongREJECTXSTestadv↓\\downarrowvani↓\\downarrowadv↓\\downarrowadv↓\\downarrowadv↓\\downarrowvani↑\\uparrowvani↑\\uparrowvani↑\\uparrowMethodASRASRASRASRASRRTARTAComplyQwen2\.5\-3B\-IT0\.2820\.1190\.9490\.6770\.2410\.6100\.4970\.960\+Self\-RedTeam0\.2340\.0560\.8850\.6030\.1560\.6550\.5590\.932\+LoRA r=16 \(step 50\)0\.2400\.0970\.9470\.6300\.1970\.6000\.5180\.944\+LoRA r=16 \(step 100\)0\.1960\.0560\.8670\.5200\.1130\.6050\.5940\.944\+LoRA r=32 \(step 50\)0\.2400\.0730\.9130\.5730\.1470\.6450\.5520\.952\+LoRA r=32 \(step 100\)0\.1220\.0190\.7210\.3200\.0880\.6800\.6690\.952Improv\.Self\-RedTeam\(%\)\+17\.0\+52\.9\+6\.7\+10\.9\+35\.3\+7\.4\+12\.5\-2\.9Improv\. LoRA r=16 \(50\) \(%\)\+14\.9\+18\.5\+0\.2\+6\.9\+18\.3\-1\.6\+4\.2\-1\.7Improv\. LoRA r=16 \(100\) \(%\)\+30\.5\+52\.9\+8\.6\+23\.2\+53\.1\-0\.8\+19\.5\-1\.7Improv\. LoRA r=32 \(50\) \(%\)\+14\.9\+38\.7\+3\.8\+15\.4\+39\.0\+5\.7\+11\.1\-0\.8Improv\. LoRA r=32 \(100\) \(%\)\+56\.7\+84\.0\+24\.0\+52\.7\+63\.5\+11\.5\+34\.6\-0\.8

Table 6:Comparative performance of Qwen2\.5\-3B\-IT,Self\-RedTeamfine\-tuned, and ABS models on six benchmarks for harmful refusal, two for benign compliance, and five \(average\) for instruction following and reasoning\. Improvement is calculated relative to the base model\.Blueindicates a shift in the desired direction \(lower ASR or higher RTA/Compliance\), whileredindicates regression\.Harmful RefusalBenign ComplianceWG:TestWJBDANHarmBenchXSTestStrongREJECTXSTestadv↓\\downarrowvani↓\\downarrowadv↓\\downarrowadv↓\\downarrowadv↓\\downarrowvani↑\\uparrowvani↑\\uparrowvani↑\\uparrowMethodASRASRASRASRASRRTARTAComplyQwen2\.5\-7B\-IT0\.2790\.0400\.8970\.6130\.2340\.9150\.4700\.912\+Self\-RedTeam0\.1780\.0020\.7260\.4230\.1560\.9250\.3710\.888\+LoRA r=16 \(step 50\)0\.2850\.0240\.8420\.5030\.2190\.9050\.5630\.896\+LoRA r=16 \(step 100\)0\.1690\.0100\.7560\.3830\.1560\.9150\.6440\.884\+LoRA r=32 \(step 50\)0\.2170\.0220\.8020\.4370\.1940\.9150\.6190\.904\+LoRA r=32 \(step 100\)0\.8070\.8081\.0001\.0000\.9840\.2200\.0170\.032Improv\.Self\-RedTeam\(%\)\+36\.18\+94\.19\+19\.12\+30\.98\+33\.36\+1\.09\-21\.24\-2\.63Improv\. LoRA r=16 \(50\) \(%\)\-2\.15\+41\.16\+6\.19\+17\.94\+6\.66\-1\.09\+19\.73\-1\.75Improv\. LoRA r=16 \(100\) \(%\)\+39\.37\+76\.51\+15\.77\+37\.50\+33\.360\.00\+36\.97\-3\.07Improv\. LoRA r=32 \(50\) \(%\)\+22\.34\+47\.22\+10\.65\+28\.80\+17\.360\.00\+31\.68\-0\.88Improv\. LoRA r=32 \(100\) \(%\)\-189\.39\-1857\.14\-11\.43\-63\.05\-319\.97\-75\.96\-96\.47\-96\.49

Table 7:Comparative performance of Qwen2\.5\-7B\-IT,Self\-RedTeamfine\-tuned, and ABS models on six benchmarks for harmful refusal, two for benign compliance, and five \(average\) for instruction following and reasoning\. Improvement is calculated relative to the base model\.Blueindicates a shift in the desired direction \(lower ASR or higher RTA/Compliance\), whileredindicates regression\. Notice how the last ABS training collapsed and the performance reduced to random guess over all the benchmarks\.Harmful RefusalBenign ComplianceWG:TestWJBDANHarmBenchXSTestStrongREJECTXSTestadv↓\\downarrowvani↓\\downarrowadv↓\\downarrowadv↓\\downarrowadv↓\\downarrowvani↑\\uparrowvani↑\\uparrowvani↑\\uparrowMethodASRASRASRASRASRRTARTAComplyQwen2\.5\-14B\-IT0\.1460\.0170\.8450\.6470\.1190\.7350\.6470\.984\+Self\-RedTeam0\.1160\.0050\.6610\.4670\.0560\.7750\.7010\.968\+LoRA r=32 \(step 50\)0\.1370\.0200\.8280\.6000\.1150\.7700\.6540\.988\+LoRA r=32 \(step 100\)0\.1190\.0150\.7790\.4870\.0840\.7500\.7250\.980\+LoRA r=32 \(step 125\)0\.1310\.0220\.7060\.4460\.0930\.7700\.7480\.968Improv\.Self\-RedTeam\(%\)20\.5570\.5921\.7727\.8252\.945\.448\.35\-1\.63Improv\. LoRA r=32 \(50\) \(%\)6\.16\-17\.652\.017\.263\.364\.761\.080\.41Improv\. LoRA r=32 \(100\) \(%\)18\.4911\.767\.8124\.7329\.412\.0012\.050\.4Improv\. LoRA r=32 \(125\) \(%\)10\.27\-29\.4116\.4431\.0721\.854\.7615\.61\-1\.63

Table 8:Comparative performance of Qwen2\.5\-14B\-IT,Self\-RedTeamfine\-tuned, and ABS models on six benchmarks for harmful refusal, two for benign compliance, and five \(average\) for instruction following and reasoning\. Improvement is calculated relative to the base model\.Blueindicates a shift in the desired direction \(lower ASR or higher RTA/Compliance\), whileredindicates regression\.GSM8KCommon SenseBoolQARC Ch\.HellaSwagQwen2\.5\-3B0\.8160\.6990\.7530\.7520\.706Self\-RedTeamQwen2\.5\-3B0\.7960\.503∗0\.69∗0\.6680\.7Safe\-Qwen2\.5\-3B0\.830\.70\.7160\.7290\.713Qwen2\.5\-7B0\.8730\.860\.830\.8760\.81Self\-RedTeamQwen2\.5\-7B0\.8560\.75∗0\.830\.8420\.813LoRA Qwen2\.5\-7B0\.8730\.860\.8230\.8660\.813Qwen2\.5\-14B0\.910\.850\.8360\.9060\.87Self\-RedTeamQwen2\.5\-14B0\.910\.8360\.8430\.9230\.88LoRA Qwen2\.5\-14B0\.910\.850\.8360\.9060\.87Table 9:Comparison of the performance of Qwen\-3B, 7B, and 14B with the correspondingSelf\-RedTeamand LoRA versions on common reasoning benchmarks\. Best results across model size are inbold; results with \* are statistically lower than the worst performing model \(McNemar \+ Bonferroni correction with p\-value at0\.050\.05\)\. Runs are on 300 samples\.
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

Similar Articles

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

Competitive self-play

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

Submit Feedback

Similar Articles

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning
PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models