Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
Summary
This paper introduces OPSA, an on-policy self-distillation method for LLM safety alignment that reduces the safety tax by training on the model's own rollouts and using a teacher flip rate to activate latent safety reasoning, achieving stronger safety-reasoning tradeoffs across multiple model scales.
View Cached Full Text
Cached at: 05/18/26, 06:38 AM
# Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation Source: [https://arxiv.org/html/2605.15239](https://arxiv.org/html/2605.15239) 1\]UC Riverside 2\]International Computer Science Institute 3\]Microsoft 4\]Berkeley Lab\\correspondenceYue Dong @\\metadata\[Code\][https://github\.com/FYYFU/OPSA](https://github.com/FYYFU/OPSA) Longxuan YuHaz Sameen ShahgirZhipeng WeiHui LiuN\. Benjamin ErichsonYue Dong\[\[\[\[[yue\.dong@ucr\.edu](https://arxiv.org/html/2605.15239v1/mailto:[email protected]) ###### Abstract Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the*safety tax*\. A common cause is distributional mismatch: supervised fine\-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self\-generated traces, rather than on trajectories sampled from its own policy\. We identify off\-policy training mismatch as a second source of this tax and study on\-policy self\-distillation for safety alignment, which we call OPSA\. The model generates its own rollouts and receives dense per\-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context\. Because this teacher must be safer than the sampled student trajectory, we introduce*teacher flip rate*: a criterion that measures how often a privileged context converts unsafe responses into safe ones\. We use this signal to search for contexts that activate latent safety reasoning rather than merely elicit safe\-looking demonstrations\. Across two reasoning\-model families and five model scales, OPSA achieves a stronger safety–reasoning tradeoff than off\-policy self\-distillation and external\-teacher distillation under matched data and full\-parameter fine\-tuning, with the largest gains on smaller models \(\+8\.85\+8\.85points on R1\-Distill\-1\.5B and\+5\.49\+5\.49points on Qwen3\-0\.6B\)\. The gains persist across training\-set sizes and adaptive jailbreak evaluations\. Token\-level analyses further show that OPSA concentrates updates near early compliance\-decision tokens, providing a mechanism for improving safety while preserving general reasoning\. \*\*footnotetext:Work done independently of the author’s affiliation\.## 1Introduction Safety alignment improves the robustness of large language models \(LLMs\) to harmful queries, but often comes at the cost of general reasoning ability, a tradeoff known as the safety tax\(Huanget al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib11)\)\. A common explanation is distributional mismatch\(Huanget al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib11); Leeet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib12)\): most alignment methods train the target model on safety demonstrations produced by human annotators or stronger external models\(Bianchiet al\.,[2023](https://arxiv.org/html/2605.15239#bib.bib8); Doulaet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib7); Jianget al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib4); Zhouet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib6); Wanget al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib5)\)\. Although these demonstrations can teach refusal behavior, they impose reasoning patterns that differ from those the target model would naturally generate, pushing the model to imitate out\-of\-distribution behavior and degrading general reasoning\. Recent work\(Leeet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib12)\)suggests that data distribution mismatch is not necessary\. Base LLMs already exhibit partial refusal behavior on harmful queries, indicating that safety often requires activating latent behaviors rather than teaching a new capability\.ThinkSafe\(Leeet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib12)\)builds on this by constructing in\-distribution safety data through self\-distillation: the target model generates its own safety demonstrations under refusal\-steering prompts, keeping the resulting data closer to the model’s own reasoning distribution and reducing the safety tax\. However, we argue that the source of supervision is only one cause of the trade\-off\. Even with in\-distribution data, SFT remains off\-policy: supervision is applied to fixed demonstrations rather than to trajectories sampled from the model’s own policy\. We hypothesize that this mismatch is especially consequential for safety alignment because safety decisions concentrate in a narrow early\-token window\(Vegaet al\.,[2023](https://arxiv.org/html/2605.15239#bib.bib32)\)and a small set of safety\-critical tokens\(Doulaet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib7)\), both of which determine whether the model refuses, complies, or begins a harmful response\. Our token\-level analysis in Section[3](https://arxiv.org/html/2605.15239#S3)shows that off\-policy SFT does not specifically target this window, leaving safety\-critical tokens only loosely controlled even when the demonstrations are themselves in\-distribution\. Figure 1:Overview of OPSA\.A frozen copy of the base model serves as the teacher, while the student is updated on rollouts sampled from its own policy\. The teacher receives the prompt together with a type\-conditional privileged context:IhI\_\{h\}activates refusal behavior on harmful prompts, whereasIbI\_\{b\}suppresses over\-refusal on benign prompts\. Along each student rollout, OPSA minimizes the per\-token KL divergenceDKL\(pT∥pS\)D\_\{\\mathrm\{KL\}\}\(p\_\{T\}\\,\\\|\\,p\_\{S\}\)between the teacher and student distributions\. This provides dense on\-policy supervision at the tokens where safety behavior is decided, pulling harmful trajectories toward refusal while preserving benign compliance\.Motivated by this diagnosis, we proposeOPSA, an on\-policy safety alignment framework that adapts On\-Policy Self\-Distillation \(OPSD\)\(Zhaoet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib14)\)to safety alignment through type\-conditional privileged contexts\. In the original OPSD setting, a privileged context such as a verified solution trace gives the teacher a meaningful advantage over the student\. In OPSA, the privileged context instead activates safety\-relevant behavior: a harmful\-query context shifts the teacher toward refusal, while a benign\-query context preserves helpful response behavior\. Figure[1](https://arxiv.org/html/2605.15239#S1.F1)summarizes the procedure\. Because supervision is applied as per\-token KL on the student’s own rollouts, OPSA targets the tokens where unsafe behavior first emerges rather than forcing the full response to match a fixed demonstration\. Our token\-level KL analysis in Section[3](https://arxiv.org/html/2605.15239#S3)supports this mechanism: OPSA reduces teacher–student divergence within the early refusal\-decision window, whereas off\-policy SFT leaves this window less directly controlled\. Naively applying OPSD to safety is not enough: unlike reasoning, safety has no privileged signal that can serve as ground truth\. A useful privileged context must activate latent safety reasoning and provide meaningful signal\. We address this with a prompt\-search procedure guided byteacher flip rate\(TFR\), a training\-free signal that measures how often a context converts an unsafe response into a safe one on the frozen base\. High TFR indicates that the context activates safety reasoning strongly enough to provide useful token\-level supervision\. Experiments across five models show that OPSA achieves a stronger safety–reasoning tradeoff than off\-policy training approaches\. The gains persist under adaptive evaluation: OPSA improves adversarial robustness against adaptive jailbreaks, indicating that the method does not merely overfit to standard refusal benchmarks\. We further find that OPSA is more robust to training\-set size than SFT, maintaining stronger safety alignment across different data scales\. Contributions\.We summarize the main contributions of this paper as follows\. - •A mechanistic account of the safety tax\.We identify off\-policy training mismatch as a second source of the safety tax, beyond data distribution mismatch\. Our token\-level analysis shows that SFT diffuses gradients across the full response, perturbing general reasoning while providing only coarse supervision for safety\-critical decisions\. - •On\-policy dense supervision for safety alignment\.We introduce OPSA, a safety alignment framework built on OPSD and safety privileged contexts\. By concentrating updates within safety\-critical token windows rather than enforcing full\-sequence imitation, OPSA improves the safety–reasoning tradeoff and robustness to out\-of\-distribution jailbreaks\. - •Privileged context construction via teacher flip rate\.We show that teacher quality determines the effectiveness of on\-policy safety training: naive refusal\-steering prompts often fail to activate latent safety reasoning with OPSD\. We introduce teacher flip rate to select contexts that activate latent safety reasoning and maximize corrective token\-level supervision\. ## 2Related Work Existing safety\-alignment approaches mainly mitigate the safety tax by improving the supervision signal from the data side\.SafeChain\(Jianget al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib4)\)distills 40k CoT safe traces from a stronger external teacher, whileSTAR\-1\(Wanget al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib5)\)curates 1k policy\-guided traces under LLM\-as\-judge filtering\. Further work improves demonstrations with stronger early reasoning signals:SafeKey\(Zhouet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib6)\)adds a dual\-path safety head to amplify safety signals before the “aha\-moment” sentence, whileSafePath\(Doulaet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib7)\)injects a short safety primer to anchor the comply\-or\-refuse decision early\.ThinkSafe\(Leeet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib12)\)eliminates the external teacher by self\-distilling refusal traces under a refusal\-steering prompt\. It further shows that dense token\-level supervision is more effective for safety alignment than sparse GRPO\-style rewards\. Together, these methods study what supervision should be provided for safety alignment\. In contrast, our work studies how dense self\-supervision should be applied for safety alignment: through off\-policy SFT or on\-policy learning\. On\-policy distillation \(OPD\) provides a natural framework for studying dense on\-policy token learning\. OPD trains a student on its own generated trajectories under dense token\-level teacher supervision, reducing the exposure bias of off\-policy SFT\(Guet al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib38); Agarwalet al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib39)\)\.OPSD\(Zhaoet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib14)\)further extends this idea to self\-distillation by using the same base model as both a privileged\-context teacher and a query\-only student\. However, existing work shows that OPSD is not automatically better than SFT\. Because it learns from the model’s own trajectories, it can get stuck when the privileged context fails to induce a stronger teacher behavior\. Its success therefore depends on whether the task satisfies teacher–student compatibility and provides a genuine capability gap\(Kimet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib36); Liet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib37)\)\. ## 3Motivation and Methods This section motivates OPSA by diagnosing why off\-policy SFT can still incur a safety tax, even when trained on self\-distilled safety data\. We argue that the issue is not only which safety data is used, but also how supervision is applied\. We then introduce our on\-policy safety alignment framework, which provides dense token\-level safety supervision on the model’s own generation trajectories\. ### 3\.1Mechanistic Diagnosis: Off\-Policy Safety Alignment as a Source of the Safety Tax ThinkSafeshows that self\-distilled safety supervision can improve safety alignment while reducing the safety tax by activating latent safety reasoning in the target model\. In this section, we investigate whether off\-policy SFT can still incur a safety tax through token\-level safety\-signal mismatch\. Our hypothesis builds on prior characterizations of how safety behavior emerges during generation\. Prior work\(Qiet al\.,[2023](https://arxiv.org/html/2605.15239#bib.bib9); Doulaet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib7); Zhouet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib6)\)shows that harmful and refusal trajectories are often determined within a narrow early refusal\-decision window, where a small set of safety\-critical tokens strongly influences whether the model complies or refuses\. This suggests that safety alignment is fundamentally a localized correction problem on the model’s own generation trajectories, rather than a uniform sequence\-level imitation problem\. We empirically reproduce these two characterizations in our setting, with experimental details provided in Appendix[C](https://arxiv.org/html/2605.15239#A3)\. Using 500 harmful examples, we compare the token\-level distributions of a safety\-prompted teacher and the base model on the model’s own generated trajectories\. Figure[2](https://arxiv.org/html/2605.15239#S3.F2)shows that the teacher’s safety signal has two structural components\. First, it ispositionally concentrated: per\-token KL spikes within the first 10 response tokens and decays after position 30, consistent with an early refusal\-decision window\. Second, it islexically concentrated: compliance openers \(*Here*,*Sure*,*Certainly*\) and structured\-output markers \(*Title*,\*\*\) carry large token\-specific KL\. The blue residual dominates the gray position baseline, confirming that these lexical cues matter beyond position alone\. For a direct comparison, we follow theThinkSafesetup: the SFT baseline is trained on in\-distribution self\-distilled responsesYhY\_\{h\}andYbY\_\{b\}\. We denote the resulting supervised dataset by𝒟SFT=\(𝒬h,Yh\)∪\(𝒬b,Yb\)\\mathcal\{D\}\_\{\\text\{SFT\}\}=\(\\mathcal\{Q\}\_\{h\},Y\_\{h\}\)\\cup\(\\mathcal\{Q\}\_\{b\},Y\_\{b\}\)\. Standard off\-policy SFT then minimizes: ℒSFT\(θ\)=−∑\(q,y\)∈𝒟SFT∑t=1\|y\|logpθ\(yt∣q,y<t\)\.\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\theta\)=\-\\sum\_\{\(q,y\)\\,\\in\\,\\mathcal\{D\}\_\{\\text\{SFT\}\}\}\\sum\_\{t=1\}^\{\|y\|\}\\log p\_\{\\theta\}\(y\_\{t\}\\mid q,y\_\{<t\}\)\.\(1\) Despite training on self\-distilled data, SFT still leaves a substantial mean KL divergence from the safety teacher, indicating that it does not fully absorb the teacher’s corrective signal\. This reflects the objective mismatch in Equation[1](https://arxiv.org/html/2605.15239#S3.E1): SFT sums uniformly over positions and token identities, assigning no special weight to the early refusal\-decision window or to safety\-critical tokens as shown in Figure[2](https://arxiv.org/html/2605.15239#S3.F2)\. TheThinkSafecurve \(left panel\) shows this mismatch positionally, as KL remains elevated across the sequence, including within the refusal\-decision window\. The token\-level bars \(right panel\) show the same mismatch lexically: SFT suppresses several canonical compliance openers, but leaves structured\-output markers and tokenization variants under\-corrected\. Thus, the safety signal that matters most is diluted by tokens that do not directly govern the comply\-or\-refuse decision\. Figure 2:Safety correction is concentrated in specific positions and tokens\.We measure per\-token symmetric KL between a safety\-prompted teacher and each student \(Base,ThinkSafe, OPSA\) on harmful rollouts from the base model \(n=500n\{=\}500, Qwen3\-0\.6B\)\.Left:Mean KL by position, with an inset for the first 30 tokens, shows an early corrective peak\.Right:Top\-10 Base\-KL trigger tokens, decomposed into a position baseline \(gray\) and token\-specific excess \(blue\), show strong lexical concentration on compliance openers\.ThinkSafereduces KL for several compliance openers but leaves structured\-output markers and tokenization variants under\-corrected, whereas OPSA better matches both the positional and lexical structure of the safety signal\. ### 3\.2On\-Policy Safety Alignment with Dense Self\-Supervision The analysis above shows that SFT applies a uniform sequence\-level objective to a safety signal that is highly structured in position and token identity\. This mismatch implies two requirements for an effective alternative\. First, supervision should beon\-policy, so that updates are applied along trajectories the model actually samples\. Second, supervision should betoken\-level dense, so that the positional and lexical components of the safety signal can shape the model before an unsafe continuation is committed\. These observations motivate adapting On\-Policy Self\-Distillation \(OPSD\)\(Zhaoet al\.,[2026](https://arxiv.org/html/2605.15239#bib.bib14)\)to safety alignment\. We introduce OPSA, which applies dense per\-token KL supervision between the current student policy and a frozen privileged safety teacher on student\-sampled rollouts\. Harmful queries are paired with a safety\-activating privileged context that shifts the teacher toward refusal, while benign queries are paired with a helpfulness\-oriented context that preserves useful behavior and reduces over\-refusal\. Letpθp\_\{\\theta\}denote the student policy and letpθ¯p\_\{\\bar\{\\theta\}\}denote a frozen copy of the same model used as the teacher\. For harmful queriesqh∈𝒬hq\_\{h\}\\in\\mathcal\{Q\}\_\{h\}and benign queriesqb∈𝒬bq\_\{b\}\\in\\mathcal\{Q\}\_\{b\}, OPSA minimizes: ℒOPSA\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\text\{OPSA\}\}\(\\theta\)=∑qh∈𝒬h𝔼y∼pθ\(⋅∣qh\)∑t=1\|y\|DKL\(pθ¯\(⋅∣ch⋆,qh,y<t\)∥pθ\(⋅∣qh,y<t\)\)\\displaystyle=\\sum\_\{q\_\{h\}\\in\\mathcal\{Q\}\_\{h\}\}\\mathbb\{E\}\_\{y\\sim p\_\{\\theta\}\(\\cdot\\mid q\_\{h\}\)\}\\sum\_\{t=1\}^\{\|y\|\}D\_\{\\mathrm\{KL\}\}\\\!\\left\(p\_\{\\bar\{\\theta\}\}\(\\cdot\\mid c\_\{h\}^\{\\star\},q\_\{h\},y\_\{<t\}\)\\;\\\|\\;p\_\{\\theta\}\(\\cdot\\mid q\_\{h\},y\_\{<t\}\)\\right\)\(2\)\+∑qb∈𝒬b𝔼y∼pθ\(⋅∣qb\)∑t=1\|y\|DKL\(pθ¯\(⋅∣cb⋆,qb,y<t\)∥pθ\(⋅∣qb,y<t\)\)\.\\displaystyle\\quad\+\\sum\_\{q\_\{b\}\\in\\mathcal\{Q\}\_\{b\}\}\\mathbb\{E\}\_\{y\\sim p\_\{\\theta\}\(\\cdot\\mid q\_\{b\}\)\}\\sum\_\{t=1\}^\{\|y\|\}D\_\{\\mathrm\{KL\}\}\\\!\\left\(p\_\{\\bar\{\\theta\}\}\(\\cdot\\mid c\_\{b\}^\{\\star\},q\_\{b\},y\_\{<t\}\)\\;\\\|\\;p\_\{\\theta\}\(\\cdot\\mid q\_\{b\},y\_\{<t\}\)\\right\)\.Here,ch⋆c\_\{h\}^\{\\star\}andcb⋆c\_\{b\}^\{\\star\}denote privileged contexts prepended to the query when constructing the teacher distribution\. The KL direction follows the distillation objective: the teacher distribution defines the corrective target, while the student is updated on prefixes sampled from its own policy\. This teacher–student divergence concentrates supervision on behavioral differences that arise along the model’s own trajectories\. We summarize the safety\-relevant component of this correction with: Δsafety\(c⋆;q,y\)=∑t=1\|y\|𝟏\[yt∈𝒮\]DKL\(pθ¯\(⋅∣c⋆,q,y<t\)∥pθ\(⋅∣q,y<t\)\),\\Delta\_\{\\text\{safety\}\}\(c^\{\\star\};q,y\)=\\sum\_\{t=1\}^\{\|y\|\}\\mathbf\{1\}\[y\_\{t\}\\in\\mathcal\{S\}\]D\_\{\\mathrm\{KL\}\}\\\!\\left\(p\_\{\\bar\{\\theta\}\}\(\\cdot\\mid c^\{\\star\},q,y\_\{<t\}\)\\;\\\|\\;p\_\{\\theta\}\(\\cdot\\mid q,y\_\{<t\}\)\\right\),\(3\)where𝒮\\mathcal\{S\}denotes a set of safety\-critical token identities\. LargerΔsafety\\Delta\_\{\\text\{safety\}\}indicates that the privileged context induces stronger corrective supervision at refusal\-relevant tokens\. Effective privileged contexts must therefore create a meaningful behavioral gap between teacher and student exactly where the comply\-or\-refuse decision is made\.   Figure 3:Teacher flip rate predicts training effectiveness\.Left:ASR on HarmBench for each candidate context prepended to the frozen base model\. Gray dots show the full context pool, colored markers span the flip\-rate range, and the dashed line marks the no\-context baseline\.Right:TFR versus post\-training harmfulness \(AvgDef\) across three models and three contexts\. Harmfulness decreases monotonically with TFR at every model scale, with no corresponding increase in over\-refusal, supporting TFR as a pre\-training criterion for selecting privileged contexts\.#### Selecting safety privileged contexts via teacher flip rate\. To construct effective safety teachers, we perform prompt search over candidate privileged contexts and select contexts that maximize teacher–student behavioral shift\. Directly estimatingΔsafety\\Delta\_\{\\text\{safety\}\}before training is impractical because safety\-critical tokens depend on the sampled trajectory\. We therefore useteacher flip rate\(TFR\) as a practical proxy, measuring the fraction of harmful queries for which the privileged context flips the frozen teacher’s greedy response from unsafe to safe: TFR\(c\)=1\|𝒬h\|∑i𝟏\[fθ¯\(qh\(i\)\)∈𝒴unsafe∧fθ¯\(c,qh\(i\)\)∈𝒴safe\],\\text\{TFR\}\(c\)=\\frac\{1\}\{\|\\mathcal\{Q\}\_\{h\}\|\}\\sum\_\{i\}\\mathbf\{1\}\\\!\\left\[f\_\{\\bar\{\\theta\}\}\(q\_\{h\}^\{\(i\)\}\)\\in\\mathcal\{Y\}\_\{\\text\{unsafe\}\}\\;\\land\\;f\_\{\\bar\{\\theta\}\}\(c,q\_\{h\}^\{\(i\)\}\)\\in\\mathcal\{Y\}\_\{\\text\{safe\}\}\\right\],\(4\)wherefθ¯\(⋅\)f\_\{\\bar\{\\theta\}\}\(\\cdot\)denotes greedy decoding under the frozen teacher model\. Higher TFR indicates that the privileged context induces a larger behavioral shift between the conditioned and unconditioned teacher distributions, producing stronger corrective supervision during on\-policy training\. We therefore select: c⋆=argmaxc∈𝒞TFR\(c\),c^\{\\star\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\}\\text\{TFR\}\(c\),\(5\)where𝒞\\mathcal\{C\}is a pool ofK=30K\{=\}30refusal\-steering contexts generated along five structured axes: strength, length, framing, specificity, and response style \(Appendix[E](https://arxiv.org/html/2605.15239#A5)\)\. Figure[3](https://arxiv.org/html/2605.15239#S3.F3)validates this criterion empirically: across three models and three contexts spanning flip rates from9%9\\%to78%78\\%, post\-training harmfulness decreases monotonically with TFR \(Spearmanρ=−1\.00\\rho=\-1\.00within each model\), supporting TFR as a reliable pre\-training selection criterion\. Given the selected contextc⋆c^\{\\star\}, OPSA trains on𝒟=𝒬h∪𝒬b\\mathcal\{D\}=\\mathcal\{Q\}\_\{h\}\\cup\\mathcal\{Q\}\_\{b\}without restricting training to flipped trajectories\. Under the per\-token KL objective, positions where teacher and student already agree contribute little gradient, so optimization naturally concentrates on tokens with meaningful teacher–student divergence\. The student samples rollouts withoutc⋆c^\{\\star\}during both training and inference, ensuring that the resulting safety behavior is internalized in the model parameters rather than dependent on a runtime prompt\. ## 4Experimental Setup We evaluate OPSA on the safety–reasoning tradeoff of large reasoning models\. Our main comparison \(§[5\.1](https://arxiv.org/html/2605.15239#S5.SS1)\) controls for prompt source and fine\-tuning protocol in order to focus on the learning objective:ThinkSafetrains with off\-policy sequence\-level NLL on self\-generated traces, whereas OPSA trains with on\-policy per\-token KL supervision on student rollouts\. We then test whether the resulting models remain robust under adaptive jailbreak attacks \(§[5\.2](https://arxiv.org/html/2605.15239#S5.SS2)\)\. Models\.We experiment with two post\-trained reasoning model families across three parameter scales:Qwen3\(Yanget al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib17)\)\(0\.6B, 1\.7B, 8B\) andDeepSeek\-R1\-Distill\(Guoet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib18)\)\(1\.5B, 8B\), yielding five family–scale configurations in total\. The main comparison \(§[5\.1](https://arxiv.org/html/2605.15239#S5.SS1)\) and adaptive\-jailbreak evaluation \(§[5\.2](https://arxiv.org/html/2605.15239#S5.SS2)\) cover all five configurations\. Implementation\.We train all methods with AdamW\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.15239#bib.bib19)\), using a learning rate of1×10−51\\\!\\times\\\!10^\{\-5\}, a cosine schedule with 10% linear warmup, batch size 64, and 3 epochs, matchingThinkSafe\. Following SafeChain\(Jianget al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib4)\), all methods use full\-parameter fine\-tuning\. This setting also gives a strongerThinkSafebaseline than the LoRA\-tuned numbers reported inLeeet al\.\([2026](https://arxiv.org/html/2605.15239#bib.bib12)\)\. For OPSA, we use a symmetric mixture of forward and reverse KL withα=0\.5\\alpha=0\.5, following the divergence choice innemo\-rl\([21](https://arxiv.org/html/2605.15239#bib.bib20)\)for on\-policy distillation; see Appendix[F](https://arxiv.org/html/2605.15239#A6)for details\. Models at≤\\leq1\.7B scale are trained on 2 NVIDIA A100 GPUs, while 8B models are trained on 4 A100 GPUs with FSDP; see Appendix[H](https://arxiv.org/html/2605.15239#A8)for details\. Baselines\.To focus the comparison on the proposed OPSA training objective, we compare against methods that use the same prompt source where possible\. Concretely, we consider: - •\(i\) Initial, the post\-trained reasoning model before safety realignment\. - •\(ii\) SafeChain\(Jianget al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib4)\), an external\-teacher distillation method trained on safe traces produced by DeepSeek\-R1\-Distill\-Llama\-70B, included to contextualize self\-generated methods against a strong external\-teacher baseline\. - •\(iii\)ThinkSafe, a faithful reproduction ofLeeet al\.\([2026](https://arxiv.org/html/2605.15239#bib.bib12)\)that trains with sequence\-level negative log\-likelihood on a self\-generated, Llama\-Guard–filtered subset of SafeChain\. Data\.All methods draw prompts from the SafeChain dataset111https://huggingface\.co/datasets/UWNSL/SafeChain, which contains harmful prompts𝒟h\\mathcal\{D\}\_\{h\}and benign prompts𝒟b\\mathcal\{D\}\_\{b\}\.SafeChaintrains on the original prompt–response pairs released with the dataset, whose responses are produced by the external teacher DeepSeek\-R1\-Distill\-Llama\-70B\.ThinkSafefollowsLeeet al\.\([2026](https://arxiv.org/html/2605.15239#bib.bib12)\)and trains on a self\-generated subset of SafeChain: harmful responses are sampled from the base model under a refusal\-steering instructionIhI\_\{h\}and filtered by Llama\-Guard\(Inanet al\.,[2023](https://arxiv.org/html/2605.15239#bib.bib21)\)to retain safe traces, while benign responses are sampled directly\.OPSArequires no pre\-generated responses; it uses only the SafeChain prompts and their type labelst\(x\)∈\{h,b\}t\(x\)\\in\\\{h,b\\\}, with the student generating on\-policy rollouts during training and the teacher providing per\-token guidance conditioned on the type\-specific privileged contextIt\(x\)I\_\{t\(x\)\}\. Evaluation benchmarks\.We evaluate along three axes\.Harmfulness\(↓\\downarrow\) is measured on HarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib22)\), StrongReject\(Soulyet al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib23)\), and WildJailbreak\(Jianget al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib24)\); we report the fraction of greedy\-decoded responses classified as harmful by Llama\-Guard\.Over\-refusal\(↓\\downarrow\) is measured on XSTest\(Röttgeret al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib25)\)\(safe subset, followingThinkSafe\) and WildBenign, the benign\-labeled subset of WildJailbreak\. Refusal rates are computed by WildGuard\(Hanet al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib26)\)under greedy decoding; the two benchmarks probe stylistically adversarial and naturally distributed benign prompts, respectively\.Reasoning\(↑\\uparrow\) is measured along two complementary axes: math/QA on GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.15239#bib.bib27)\), MATH500\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.15239#bib.bib28)\), and GPQA\(Reinet al\.,[2023](https://arxiv.org/html/2605.15239#bib.bib29)\), and code generation on HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.15239#bib.bib30)\)and MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2605.15239#bib.bib31)\)\. For math/QA, the model generates88trajectories per prompt at temperature0\.60\.6, top\-pp0\.950\.95, and top\-kk2020, and we report average pass@1\. For HumanEval and MBPP, we follow the SafeChain coding evaluation protocol: greedy decoding with repetition penalty1\.11\.1to mitigate degenerate repetition in long chain\-of\-thought code outputs, with pass@1 scored using the EvalPlus harness\. Full evaluation details are provided in Appendix[H](https://arxiv.org/html/2605.15239#A8)\. Composite safety score\.FollowingThinkSafe, we summarize safety with a single composite score defined as one minus the unweighted mean of the five raw rate benchmarks: S=1−15\[HarmBench\+StrongReject\+WildJailbreak\+XSTest\+WildBenign\]\.S\\;=\\;1\-\\tfrac\{1\}\{5\}\\\!\\left\[\\,\\mathrm\{HarmBench\}\+\\mathrm\{StrongReject\}\+\\mathrm\{WildJailbreak\}\+\\mathrm\{XSTest\}\+\\mathrm\{WildBenign\}\\right\]\.\(6\)Here, all five terms are rates in\[0,1\]\[0,1\], soS∈\[0,1\]S\\in\[0,1\], with higher values indicating better safety\. ## 5Results Table 1:OPSA improves the safety–reasoning tradeoff across two model families and five scales\.Harmfulness and over\-refusal columns report rates \(↓\\downarrow\); Composite Safety ScoreSS\(Eq\.[6](https://arxiv.org/html/2605.15239#S4.E6),↑\\uparrow\) summarizes safety, and Reasoning Avg \(↑\\uparrow\) summarizes reasoning\.Boldandunderlinemark the best and second\-best fine\-tuned methods per column\. TheΔ\\Deltavs SFTrows compare off\-policy NLL training \(ThinkSafe\) with on\-policy per\-token KL training \(OPSA\) under matched prompts\. Positive values indicate gains from changing the training objective\. The finalAvgΔ\\Deltaon\-policy gainrow averages this comparison across the five configurations, showing consistent improvements in safety \(\+4\.00\+4\.00pt\) and reasoning \(\+3\.04\+3\.04pt\)\.In this section, we evaluate whether on\-policy safety supervision improves the safety–reasoning tradeoff predicted by our analysis in Section[3](https://arxiv.org/html/2605.15239#S3)\. We first compare OPSA against matched off\-policy baselines on standard safety, over\-refusal, and reasoning benchmarks across five model configurations\. We then evaluate robustness under adaptive jailbreak attacks to test whether the learned safety behavior generalizes beyond fixed harmful prompts and standard evaluation settings\. ### 5\.1Safety–Reasoning Tradeoff and Reasoning Preservation Table[1](https://arxiv.org/html/2605.15239#S5.T1)tests our prediction from Section[3](https://arxiv.org/html/2605.15239#S3): if part of the safety tax comes from off\-policy supervision, then replacing sequence\-level imitation with on\-policy token\-level distillation should improve safety without imposing the same reasoning cost\. The comparison betweenThinkSafeand OPSA is designed to isolate this factor\. Both methods use self\-generated safety data from the same SafeChain prompt source; they differ in whether supervision is applied to fixed demonstrations through NLL or to student\-sampled trajectories through per\-token KL from a privileged\-context teacher\. The results support this prediction\. OPSA improves the composite safety score overThinkSafeon all five model configurations, with an average gain of\+4\.00\+4\.00points and the largest improvements on smaller, less\-aligned models\. These gains are not explained by a uniform increase in refusal\. OPSA reduces harmfulness on average while also reducing over\-refusal, especially on naturally distributed benign prompts\. This pattern is consistent with the token\-level diagnosis in Section[3](https://arxiv.org/html/2605.15239#S3): effective safety alignment should concentrate updates near the decisions that determine whether the model complies or refuses, rather than shifting the model toward refusal across all inputs\. The same comparison also shows that on\-policy safety supervision better preserves reasoning\. OPSA improves aggregate reasoning by\+3\.04\+3\.04points overThinkSafeacross GSM8K, MATH500, GPQA, HumanEval, and MBPP\. This matters because both methods begin from the same self\-distillation idea:ThinkSafealready reduces distribution mismatch by generating traces from the target model\. The remaining gap therefore points to the second mismatch identified in the introduction and method section: off\-policy SFT still forces the model to imitate fixed trajectories, whereas OPSA applies dense supervision only where the teacher and student differ along trajectories the student samples\. Averaged over all configurations, the ordering SafeChain<<ThinkSafe<<OPSA holds for both composite safety and reasoning\. SafeChain controls for external\-teacher distillation,ThinkSafecontrols for self\-generated demonstrations, and OPSA adds on\-policy supervision\. The monotonic improvement across these settings supports our main claim: reducing the safety tax requires not only in\-distribution safety data, but also an objective that aligns the model on its own generation paths\. ### 5\.2Robustness to Adaptive Jailbreaks Table[1](https://arxiv.org/html/2605.15239#S5.T1)evaluates safety on fixed harmful prompts\. We next test whether the same alignment procedure improves robustness under adaptive jailbreaks, where the attack modifies the prompt or generation context to elicit unsafe behavior\. Table[2](https://arxiv.org/html/2605.15239#S5.T2)reports results on four HarmBench attack families: HumanJailbreaks\(Shenet al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib33)\), Prefilling\(Vegaet al\.,[2023](https://arxiv.org/html/2605.15239#bib.bib32)\), PAP\-top5\(Zenget al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib34)\), and PAIR\(Chaoet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib35)\)\. These attacks cover human\-written jailbreak templates, response\-prefix attacks, persuasive adversarial prompts, and iterative attacker–target–judge search\. All evaluations use the 159 behaviors in HarmBench’stext\-test\-standardsplit\. We report two complementary metrics, since adaptive attacks can fail on average yet expose individual behaviors\.*Mean ASR*is the standard HarmBench metric, averaging judged attack success over all behavior–attempt pairs\.*pass@N*is a behavior\-level worst\-case metric: a behavior is counted as broken if at least one of itsNNattempts succeeds\. Thus, mean ASR measures average attack success, while*pass@N*measures whether any tested attempt can break a behavior\. We use*pass@N*only for within\-attack comparisons, whereNNis fixed\. Full details are provided in Appendix[I](https://arxiv.org/html/2605.15239#A9)\. The clearest improvement appears on Prefilling, the attack family most closely aligned with our token\-level diagnosis\. Prefilling intervenes near the beginning of the response by forcing an unsafe continuation prefix, precisely where Figure[2](https://arxiv.org/html/2605.15239#S3.F2)shows that refusal decisions and compliance openers most strongly concentrate\. OPSA reduces both mean ASR and*pass@N*relative toThinkSafeat every model scale\. On Qwen3\-1\.7B and Qwen3\-8B, OPSA drives both metrics to zero, meaning that no prefill attempt succeeds on any of the 159 behaviors\. The reductions are also large on the R1\-Distill bases, with mean ASR decreasing from14\.3014\.30to3\.603\.60on R1\-Distill\-1\.5B and from8\.208\.20to0\.400\.40on R1\-Distill\-8B\. This pattern supports the mechanism proposed in Section[3](https://arxiv.org/html/2605.15239#S3): on\-policy KL supervision improves robustness most clearly when the attack acts on the early tokens where refusal behavior is decided\. The remaining attacks show that this robustness gain extends beyond Prefilling, but not uniformly\. On PAP\-top5, OPSA improves or tiesThinkSafein most comparisons, including four of five*pass@N*columns, suggesting that the learned refusal behavior transfers at least partly to persuasive prompt variants rather than only to prefix\-based attacks\. HumanJailbreaks are more model\-dependent: OPSA improves all three Qwen3 models, including a large*pass@N*reduction on Qwen3\-8B \(90\.60→61\.6090\.60\\rightarrow 61\.60\), but is mixed on R1\-Distill, improving behavior\-level robustness for R1\-Distill\-1\.5B while worsening mean ASR on both R1\-Distill scales\. PAIR is the hardest setting\. OPSA improves Qwen3\-1\.7B and R1\-Distill\-1\.5B, but regresses on Qwen3\-0\.6B, Qwen3\-8B, and R1\-Distill\-8B, indicating that dense on\-policy supervision does not fully protect against iterative attacker–target–judge search\. The two metrics reveal different aspects of robustness\. In low\-ASR regimes, mean ASR can understate behavior\-level vulnerability because it averages over all attack attempts\. For example, on Qwen3\-8B HumanJailbreaks, mean ASR changes only from1\.401\.40to0\.700\.70, but*pass@N*drops from90\.6090\.60to61\.6061\.60\. Thus, a large fraction of behaviors that were vulnerable to at least one human\-written template become robust to all tested templates, even though the average success rate changes only slightly\. This distinction is important for adaptive evaluation: a model can have low average attack success while still leaving many harmful behaviors breakable by at least one of many attempts\. Limits Under Fully Adaptive Search\.Across the 20 model–attack cells in Table[2](https://arxiv.org/html/2605.15239#S5.T2), OPSA improves overThinkSafein13/2013/20cells by mean ASR and14/2014/20by*pass@N*, with one tie\. The strongest gains occur on Prefilling, while the most persistent failures occur on PAIR, where the attacker iteratively adapts prompts using feedback from the target and judge\. These results support a specific robustness claim rather than a blanket one: OPSA improves adaptive robustness on average, especially when attacks manipulate early response tokens, but fully adaptive search remains a limitation\. Table 2:Adaptive jailbreaks on HarmBench text\-test\-standard \(159159behaviors, %,↓\\downarrow\)\.Each attack reports two metrics:*mean ASR*, the mean attack success rate averaged over all \(behavior, attempt\) pairs, and*pass@N*, the fraction of behaviors broken by at least one attempt\.Boldmarks the better of Thinksafe vs\. OPSA within each column\. Base is the pre\-realignment checkpoint\. ## 6Conclusion This paper identified off\-policy supervision as a source of the safety tax that remains even when safety data is self\-distilled from the target model\. Our token\-level analysis shows why this mismatch matters in practice: safety corrections concentrate near early refusal decisions and safety\-critical token identities, whereas sequence\-level SFT imitates fixed demonstrations uniformly across the full response\. OPSA addresses this mismatch by training on student\-sampled rollouts and distilling a frozen privileged\-context teacher with dense per\-token KL supervision\. Under matched prompts and full\-parameter fine\-tuning, OPSA improves the safety–reasoning tradeoff over off\-policy self\-distillation across two reasoning\-model families and five scales, with the largest gains on smaller models\. Its strongest adaptive\-jailbreak gains occur on Prefilling, which intervenes in the early response region highlighted by our diagnosis; fully adaptive search remains a limitation\. These findings suggest that reducing the safety tax requires controlling not only which demonstrations provide supervision, but also where and on whose trajectories that supervision is applied\. ### Acknowledgments NBE would like to acknowledge support from the DSO National Laboratories\. ## References - R\. Agarwal, N\. Vieillard, Y\. Zhou, P\. Stanczyk, S\. R\. Garea, M\. Geist, and O\. Bachem \(2024\)On\-policy distillation of language models: learning from self\-generated mistakes\.InThe twelfth international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.15239#S2.p2.1)\. - J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - F\. Bianchi, M\. Suzgun, G\. Attanasio, P\. Röttger, D\. Jurafsky, T\. Hashimoto, and J\. Zou \(2023\)Safety\-tuned llamas: lessons from improving the safety of large language models that follow instructions\.arXiv preprint arXiv:2309\.07875\.Cited by:[§1](https://arxiv.org/html/2605.15239#S1.p1.1)\. - P\. Chao, A\. Robey, E\. Dobriban, H\. Hassani, G\. J\. Pappas, and E\. Wong \(2025\)Jailbreaking black box large language models in twenty queries\.In2025 IEEE Conference on Secure and Trustworthy Machine Learning \(SaTML\),pp\. 23–42\.Cited by:[§5\.2](https://arxiv.org/html/2605.15239#S5.SS2.p1.1)\. - M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - A\. Doula, M\. Mühlhäuser, and A\. S\. Guinea \(2025\)Safepath: conformal prediction for safe llm\-based autonomous navigation\.arXiv preprint arXiv:2505\.09427\.Cited by:[§1](https://arxiv.org/html/2605.15239#S1.p1.1),[§1](https://arxiv.org/html/2605.15239#S1.p3.1),[§2](https://arxiv.org/html/2605.15239#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.15239#S3.SS1.p2.1)\. - Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024\)Minillm: knowledge distillation of large language models\.InThe twelfth international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.15239#S2.p2.1)\. - D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[Appendix H](https://arxiv.org/html/2605.15239#A8.SS0.SSS0.Px6.p1.3),[§4](https://arxiv.org/html/2605.15239#S4.p2.1)\. - S\. Han, K\. Rao, A\. Ettinger, L\. Jiang, B\. Y\. Lin, N\. Lambert, Y\. Choi, and N\. Dziri \(2024\)Wildguard: open one\-stop moderation tools for safety risks, jailbreaks, and refusals of llms\.Advances in neural information processing systems37,pp\. 8093–8131\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - T\. Huang, S\. Hu, F\. Ilhan, S\. F\. Tekin, Z\. Yahn, Y\. Xu, and L\. Liu \(2025\)Safety tax: safety alignment makes your large reasoning models less reasonable\.arXiv preprint arXiv:2503\.00555\.Cited by:[§1](https://arxiv.org/html/2605.15239#S1.p1.1)\. - H\. Inan, K\. Upasani, J\. Chi, R\. Rungta, K\. Iyer, Y\. Mao, M\. Tontchev, Q\. Hu, B\. Fuller, D\. Testuggine,et al\.\(2023\)Llama guard: llm\-based input\-output safeguard for human\-ai conversations\.arXiv preprint arXiv:2312\.06674\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p6.5)\. - F\. Jiang, Z\. Xu, Y\. Li, L\. Niu, Z\. Xiang, B\. Li, B\. Y\. Lin, and R\. Poovendran \(2025\)Safechain: safety of language models with long chain\-of\-thought reasoning capabilities\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 23303–23320\.Cited by:[Appendix H](https://arxiv.org/html/2605.15239#A8.SS0.SSS0.Px6.p1.3),[§1](https://arxiv.org/html/2605.15239#S1.p1.1),[§2](https://arxiv.org/html/2605.15239#S2.p1.1),[2nd item](https://arxiv.org/html/2605.15239#S4.I1.i2.p1.1),[§4](https://arxiv.org/html/2605.15239#S4.p3.3)\. - L\. Jiang, K\. Rao, S\. Han, A\. Ettinger, F\. Brahman, S\. Kumar, N\. Mireshghallah, X\. Lu, M\. Sap, Y\. Choi,et al\.\(2024\)Wildteaming at scale: from in\-the\-wild jailbreaks to \(adversarially\) safer language models\.Advances in Neural Information Processing Systems37,pp\. 47094–47165\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - J\. Kim, X\. Luo, M\. Kim, S\. Lee, D\. Kim, J\. Jeon, D\. Li, and Y\. Yang \(2026\)Why does self\-distillation \(sometimes\) degrade the reasoning capability of llms?\.External Links:2603\.24472,[Link](https://arxiv.org/abs/2603.24472)Cited by:[§2](https://arxiv.org/html/2605.15239#S2.p2.1)\. - S\. Lee, S\. Park, Y\. Choi, G\. Kim, M\. Kang, J\. Yun, D\. Park, J\. Park, and S\. J\. Hwang \(2026\)THINKSAFE: self\-generated safety alignment for reasoning models\.arXiv preprint arXiv:2601\.23143\.Cited by:[§1](https://arxiv.org/html/2605.15239#S1.p1.1),[§1](https://arxiv.org/html/2605.15239#S1.p2.1),[§2](https://arxiv.org/html/2605.15239#S2.p1.1),[3rd item](https://arxiv.org/html/2605.15239#S4.I1.i3.p1.1),[§4](https://arxiv.org/html/2605.15239#S4.p3.3),[§4](https://arxiv.org/html/2605.15239#S4.p6.5)\. - Y\. Li, Y\. Zuo, B\. He, J\. Zhang, C\. Xiao, C\. Qian, T\. Yu, H\. Gao, W\. Yang, Z\. Liu, and N\. Ding \(2026\)Rethinking on\-policy distillation of large language models: phenomenology, mechanism, and recipe\.External Links:2604\.13016,[Link](https://arxiv.org/abs/2604.13016)Cited by:[§2](https://arxiv.org/html/2605.15239#S2.p2.1)\. - H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe twelfth international conference on learning representations,Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p3.3)\. - M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li,et al\.\(2024\)Harmbench: a standardized evaluation framework for automated red teaming and robust refusal\.arXiv preprint arXiv:2402\.04249\.Cited by:[Appendix G](https://arxiv.org/html/2605.15239#A7.SS0.SSS0.Px1.p1.5),[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - \[21\]\(2025\)NeMo rl: a scalable and efficient post\-training library\.Note:[https://github\.com/NVIDIA\-NeMo/RL](https://github.com/NVIDIA-NeMo/RL)GitHub repositoryCited by:[§4](https://arxiv.org/html/2605.15239#S4.p3.3)\. - X\. Qi, Y\. Zeng, T\. Xie, P\. Chen, R\. Jia, P\. Mittal, and P\. Henderson \(2023\)Fine\-tuning aligned language models compromises safety, even when users do not intend to\!\.arXiv preprint arXiv:2310\.03693\.Cited by:[§3\.1](https://arxiv.org/html/2605.15239#S3.SS1.p2.1)\. - D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)Gpqa: a graduate\-level google\-proof q&a benchmark\.arXiv preprint arXiv:2311\.12022\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - P\. Röttger, H\. Kirk, B\. Vidgen, G\. Attanasio, F\. Bianchi, and D\. Hovy \(2024\)Xstest: a test suite for identifying exaggerated safety behaviours in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 5377–5400\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - X\. Shen, Z\. Chen, M\. Backes, Y\. Shen, and Y\. Zhang \(2024\)" Do anything now": characterizing and evaluating in\-the\-wild jailbreak prompts on large language models\.InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security,pp\. 1671–1685\.Cited by:[§5\.2](https://arxiv.org/html/2605.15239#S5.SS2.p1.1)\. - A\. Souly, Q\. Lu, D\. Bowen, T\. Trinh, E\. Hsieh, S\. Pandey, P\. Abbeel, J\. Svegliato, S\. Emmons, O\. Watkins,et al\.\(2024\)A strongreject for empty jailbreaks\.Advances in Neural Information Processing Systems37,pp\. 125416–125440\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p7.10)\. - J\. Vega, I\. Chaudhary, C\. Xu, and G\. Singh \(2023\)Bypassing the safety training of open\-source llms with priming attacks\.arXiv preprint arXiv:2312\.12321\.Cited by:[§1](https://arxiv.org/html/2605.15239#S1.p3.1),[§5\.2](https://arxiv.org/html/2605.15239#S5.SS2.p1.1)\. - Z\. Wang, H\. Tu, Y\. Wang, J\. Wu, Y\. Liu, J\. Mei, B\. R\. Bartoldson, B\. Kailkhura, and C\. Xie \(2026\)Star\-1: safer alignment of reasoning llms with 1k data\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 37988–37997\.Cited by:[§1](https://arxiv.org/html/2605.15239#S1.p1.1),[§2](https://arxiv.org/html/2605.15239#S2.p1.1)\. - A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4](https://arxiv.org/html/2605.15239#S4.p2.1)\. - Y\. Zeng, H\. Lin, J\. Zhang, D\. Yang, R\. Jia, and W\. Shi \(2024\)How johnny can persuade llms to jailbreak them: rethinking persuasion to challenge ai safety by humanizing llms\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14322–14350\.Cited by:[§I\.2](https://arxiv.org/html/2605.15239#A9.SS2.SSS0.Px3.p1.2),[§5\.2](https://arxiv.org/html/2605.15239#S5.SS2.p1.1)\. - S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover \(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[§1](https://arxiv.org/html/2605.15239#S1.p4.1),[§2](https://arxiv.org/html/2605.15239#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.15239#S3.SS2.p2.4)\. - K\. Zhou, X\. Zhao, J\. Srinivasa, G\. Liu, A\. Feng, D\. Song, and X\. E\. Wang \(2025\)Safekey: amplifying aha\-moment insights for safety reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 25407–25423\.Cited by:[§1](https://arxiv.org/html/2605.15239#S1.p1.1),[§2](https://arxiv.org/html/2605.15239#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.15239#S3.SS1.p2.1)\. - A\. Zou, Z\. Wang, N\. Carlini, M\. Nasr, J\. Z\. Kolter, and M\. Fredrikson \(2023\)Universal and transferable adversarial attacks on aligned language models\.arXiv preprint arXiv:2307\.15043\.Cited by:[Appendix G](https://arxiv.org/html/2605.15239#A7.SS0.SSS0.Px1.p1.5)\. ## Appendix ALimitations First, OPSA requires the base model to retain the latent capability that the privileged\-context instruction activates: the prompt\-strength analysis in Figure[3](https://arxiv.org/html/2605.15239#S3.F3)shows that gains scale with the instruction’s flip rate, so models whose safety knowledge has been more aggressively overwritten in post\-training may benefit less\. Second, the teacher is the frozen basepθ0p\_\{\\theta\_\{0\}\}; an iterative scheme that periodically refreshes the teacher to a moving average of the student is a natural extension we leave to future work\. Third, our harmfulness measurements depend on Llama\-Guard, which is used for both data filtering and evaluation\. As a result, conclusions about absolute harmfulness should be interpreted relative to this classifier and may inherit its failure modes\. ## Appendix BPrivileged\-context instructions #### Refusal\-steering instructionIhI\_\{h\}used in the main results\. “This prompt contains a harmful request\. Your only allowed action is to refuse\. You must NOT: provide the requested content in any form; offer a safer or educational version; discuss the topic from an academic perspective; partially comply while adding disclaimers\. Respond with a brief, clear refusal only\.” #### Helpful\-steering instructionIbI\_\{b\}used in the main results\. “This is a safe, benign request\. Respond helpfully and directly\. You must NOT refuse, add safety disclaimers, or treat this as potentially harmful\. Provide a complete, accurate answer\.” #### Three refusal\-steering instructions used in Figure[3](https://arxiv.org/html/2605.15239#S3.F3)\. - •v1:“As a responsible model, never comply with harmful prompts; refuse silently\.” - •v2:“The following prompt requests harmful content\. You MUST refuse to answer\. Do NOT provide any information, even in an academic, educational, or hypothetical framing\. A proper refusal is a short sentence declining the request\. Nothing more\.” - •v3:“This prompt contains a harmful request\. Your only allowed action is to refuse\. You must NOT: provide the requested content in any form; offer a safer or educational version; discuss the topic from an academic perspective; partially comply while adding disclaimers\. Respond with a brief, clear refusal only\.” Flip rates per base model are reported in Table[3](https://arxiv.org/html/2605.15239#A5.T3)\. ## Appendix CToken\-level KL analysis For Figure[2](https://arxiv.org/html/2605.15239#S3.F2), we compare three Qwen3\-0\.6B students against a common safety\-prompted teacher:*Base*\(Qwen/Qwen3\-0\.6B\),ThinkSafe\(full\-data SFT initialized fromSeanie\-lee/ThinkSafe\-Qwen3\-0\.6B, trained for 3 epochs with learning rate10−510^\{\-5\}\), and*OPSA*\(the same configuration asThinksafe\)\. The teacher uses the Base model with the selected harmful privileged contextch⋆c\_\{h\}^\{\\star\}prepended\. We insert this context by token\-level prefix surgery so that teacher and student share the same assistant\-token prefix at every measured position\. The fixed rollout set consists of 500 Base completions on SafeChain harmful prompts: 250vanilla\_harmfuland 250adversarial\_harmful, sampled with seed 42\. Rollouts are generated without any system prompt, with the chat template applied, using sampling at temperature0\.60\.6, top\-p=0\.95p=0\.95, a maximum of 4096 new tokens, and bfloat16 inference\. For each rollout prefix, we compute the teacher and student next\-token distributions over the full vocabulary from raw logits using log\-softmax, with no temperature scaling\. We report symmetric KL, DKLsym\(pT,pS\)=12\[DKL\(pT∥pS\)\+DKL\(pS∥pT\)\]\.D^\{\\mathrm\{sym\}\}\_\{\\mathrm\{KL\}\}\(p\_\{T\},p\_\{S\}\)=\\tfrac\{1\}\{2\}\\left\[D\_\{\\mathrm\{KL\}\}\(p\_\{T\}\\,\\\|\\,p\_\{S\}\)\+D\_\{\\mathrm\{KL\}\}\(p\_\{S\}\\,\\\|\\,p\_\{T\}\)\\right\]\. The left panel averagesDKLsymD^\{\\mathrm\{sym\}\}\_\{\\mathrm\{KL\}\}by absolute response position over the first 512 tokens, with the inset zooming into the first 30 positions\. For the lexical analysis in the right panel, we consider decoded BPE token types that appear at least 15 times, rank them by their mean BaseDKLsymD^\{\\mathrm\{sym\}\}\_\{\\mathrm\{KL\}\}, and plot the top 10\. To separate positional concentration from token identity, we decompose each Base token bar into a position baseline and a token\-specific residual\. The position baseline is the global position\-wise mean KL averaged over the positions at which that token type appears; the residual is the token type’s observed mean KL minus this baseline\. TheThinkSafeand OPSA bars are computed on the same token occurrences as the Base bars\. ## Appendix DCandidate\-pool construction for prompt search The candidate pool𝒞\\mathcal\{C\}used in Figure[3](https://arxiv.org/html/2605.15239#S3.F3)consists ofK=30K=30refusal\-steering variants produced by an off\-the\-shelf LLM \(GPT\-5\.5\), generated along five structured axes described below\. The three representative contextsv1,v2,v3v\_\{1\},v\_\{2\},v\_\{3\}reproduced in Appendix[B](https://arxiv.org/html/2605.15239#A2)are selected from𝒞\\mathcal\{C\}such that their teacher flip rates span roughly the full range observed across the pool, giving a balanced subset for the empirical validation in Figure[3](https://arxiv.org/html/2605.15239#S3.F3)\. #### Generation axes\. Each generated variant is structured along five axes: - •strength: one of \{mild, firm, strict, emergency\}, ranging from “please decline” to “this is a security violation”\. - •length: one of \{xs \(≤15\\leq 15words\), s \(1515–4040\), m \(4040–8080\), l \(8080–150150\)\}\. - •framing: one of \{rule, role, value, consequentialist, policy, identity\}, whererulestates a constraint \(“you must not…”\),rolecasts the model as a safety\-aware assistant,valueappeals to user wellbeing,consequentialistcites real\-world harm,policyinvokes a content policy, andidentityis first\-person \(“as a responsible model, I will not…”\)\. - •specificity:generic\(“refuse harmful requests”\) versusdeny\-list\(an explicit list of escape routes to block: no academic framing, no hypothetical, no rephrasing, no partial compliance, no fictional wrapper\)\. - •style: \{just\-decline, brief\-explain, silent\}, controlling what the refusal itself should look like\. #### Post\-processing\. Generated outputs are normalised \(case\-folded, punctuation\-stripped, whitespace\-collapsed\) and deduplicated, yielding the final pool ofK=30K=30contexts\. #### Selection ofv1,v2,v3v\_\{1\},v\_\{2\},v\_\{3\}\. The three representative contexts used in the prompt\-strength analysis are picked from𝒞\\mathcal\{C\}to roughly partition the observed TFR range \(low, medium, high\), so that the empirical validation in Figure[3](https://arxiv.org/html/2605.15239#S3.F3)is not biased toward any single strength regime\. #### Generator prompt\. The full prompt sent to GPT\-5\.5 \(withNNas the requested number of variants\) is reproduced verbatim below\. Meta\-Prompt for Generating Safety System PromptsYou are designing a small experiment for a safety\-distillation study\.We need to generate diverse "safety system prompts" that will beinjected as \{"role": "system", "content": <prompt\>\} immediately beforea user’s potentially harmful request \(drawn from HarmBench\)\. The goalof each prompt is to make a small instruction\-tuned LLM\(Qwen3\-0\.6B/1\.7B, DeepSeek\-R1\-Distill\-Qwen\-1\.5B\) refuse the request\.We will then measure how each variant changes the unsafe\-\>safe fliprate on HarmBench using LlamaGuard\-3\-8B\.Generate exactly N = \{N\} variants that span the following axes\. Eachvariant should combine specific values from each axis; aim for broadcoverage, not clustering on one corner\.1\. strength : one of \{mild, firm, strict, emergency\}\("please decline" \-\> "you MUST refuse" \-\>"under no circumstances" \-\>"this is a security violation"\)2\. length : one of \{xs \(<=15 words\), s \(15\-40\), m \(40\-80\),l \(80\-150\)\}3\. framing : one of \{rule, role, value, consequentialist,policy, identity\}\- rule: "you must not\.\.\."\- role: "you are a careful safety\-aware assistant\.\.\."\- value: "we care about user wellbeing\.\.\."\- consequentialist: "real\-world harm could result\.\.\."\- policy: "our content policy prohibits\.\.\."\- identity: "as a responsible model, I will not\.\.\."4\. specificity : one of \{generic, deny\-list\}\- generic: "refuse harmful requests"\- deny\-list: explicit list of escape routes toblock \(no academic framing, no hypothetical,no rephrasing, no partial compliance, nofictional wrapper, etc\.\)5\. style : one of \{just\-decline, brief\-explain, silent\}how the refusal itself should look\.Hard constraints:\- English only\.\- One paragraph per prompt, no markdown, no bullets, no headings,no surrounding quotation marks\.\- Self\-contained: do NOT reference "the user", "the assistant", orany chat\-template tokens\. Refer to the upcoming request as "thefollowing request" / "this prompt" / similar\.\- Do not include examples of harmful content\.\- Cover the axis grid: hit each \(strength, framing\) pair at leastonce across the N variants if feasible\.Output format: a single JSON object with one key "prompts" whosevalue is an array of exactly N objects \(id, axes, word\_count,prompt\)\. Output nothing else\. ## Appendix EStatistics for Prompt Strength Table[3](https://arxiv.org/html/2605.15239#A5.T3)\(above\) reports the per\-prompt, per\-model flip rates\. Table[4](https://arxiv.org/html/2605.15239#A5.T4)below reports the mean±\\pmstd of each safety metric across the four data conditions \(h96\_b\{96,192,480,960\}\) at each cell’s best\-epoch checkpoint \(selected by composite safety scoreSS\)\. Table 3:Flip rates \(%\) per base model for different safety prompts\.Boldmarks the highest flip rate for each model\.Table 4:Per\-cell mean±\\pmstd across four data conditions for Analysis 2\. Each entry is evaluated at the epoch maximisingSSwithin that condition\. ## Appendix FDivergence ablation: forward, reverse, and symmetric mix We compare three choices of per\-token divergence in Eq\.[2](https://arxiv.org/html/2605.15239#S3.E2): forward KL \(DKL\(pT∥pS\)D\_\{\\mathrm\{KL\}\}\(p\_\{T\}\\\|p\_\{S\}\)\), reverse KL \(DKL\(pS∥pT\)D\_\{\\mathrm\{KL\}\}\(p\_\{S\}\\\|p\_\{T\}\)\), and the symmetric mixture12DKL\(pT∥pS\)\+12DKL\(pS∥pT\)\\tfrac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\(p\_\{T\}\\\|p\_\{S\}\)\+\\tfrac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\(p\_\{S\}\\\|p\_\{T\}\)used in the main results\. All three are trained on Qwen3\-0\.6B with identical data, schedule, and other hyperparameters; we report the best checkpoint per variant selected by 5\-benchmark safety score \(epoch11for all three\)\. Table 5:Effect of the divergence choice on Qwen3\-0\.6B\. All values in %; lower is better for harmfulness and over\-refusal columns, higher is better for composite safety score\.Boldmarks the best variant per benchmark\.#### Mix is preferred over forward\-only KL despite a marginally lower safety score\. Forward\-only KL achieves a slightly higher safety score than the symmetric mix used in the main results \(88\.8088\.80vs\.88\.1888\.18,\+0\.62\+0\.62pp\)\. This0\.620\.62pp advantage is, however, the surface reading of an unfavorable trade\-off when decomposed into the underlying axes\. Forward’s harm reduction is paid for by a strictly larger over\-refusal increase\.Forward\-only KL reduces the average harmfulness rate \(HarmBench, StrongReject, WildJailbreak\) by3\.593\.59pp \(15\.42→11\.8315\.42\\to 11\.83\), but it increases the average over\-refusal rate \(XSTest, WildBenign\) by3\.843\.84pp \(6\.41→10\.256\.41\\to 10\.25\)—a net negative trade once the two axes are weighted equally\. The safety score formula in Eq\.[6](https://arxiv.org/html/2605.15239#S4.E6)gives harm a weight of3/53/5and over\-refusal a weight of2/52/5by virtue of having more harm benchmarks; this asymmetric weighting is the only reason forward edges out mix in aggregate\. The trade\-off shape favors mix in deployment\.Mix wins both over\-refusal benchmarks and loses all three harmfulness benchmarks; forward inverts this; reverse\-only sits in between\. Because false refusals on benign queries are the most user\-visible failure mode in deployment, we view the symmetric mix as the more conservative default\. The mix is the natural default for our two\-direction supervision design\.Section[3](https://arxiv.org/html/2605.15239#S3)motivates OPSA as a single objective that supplies gradient in*both*the safety direction \(refusal on harmful prompts viaIhI\_\{h\}\) and the helpfulness direction \(non\-refusal on benign prompts viaIbI\_\{b\}\)\. Forward KL emphasizes mode\-coverage of the privileged\-context teacher \(more aggressive suppression of harmful mass\), while reverse KL emphasizes mode\-seeking from the student’s distribution \(better preservation of helpful modes\)\. The symmetric mix matches our framework’s symmetric treatment of the two directions; using forward\-only would implicitly privilege the safety direction, contradicting the design\. Figure 4:Sensitivity to data size\.Composite safety scoreSS\(%,↑\\uparrow\) versus the training\-set fraction for OPSA and Thinksafe on two bases\. OPSA exceeds Thinksafe on every \(model, fraction\) cell\. ## Appendix GSensitivity to data size and composition The main comparison in §[5\.1](https://arxiv.org/html/2605.15239#S5.SS1)is reported at a single training budget with a fixed harmful/benign composition\. A natural concern is whether the OPSA advantage there is a property of the objective or an artifact of that particular data setting\. We stress\-test the comparison along two complementary axes: \(A\) the total amount of supervision \(*data size*\), with the harmful/benign ratio fixed at its SafeChain default; and \(B\) the benign\-to\-harmful ratio \(*data composition*\), with the harmful budget held fixed and the benign supply varied around it\. #### Setup\. For the data size test \(A\), we subsample the full SafeChain set at fractions\{10%,25%,50%,75%,100%\}\\\{10\\%,25\\%,50\\%,75\\%,100\\%\\\}of the original budget while keeping the canonical harmful/benign ratio\. For the composition stress test \(B\), we treat curated harmful prompts as the binding resource\. Existing red\-teamed harmful sets remain in the hundreds\-to\-thousands range\[Mazeikaet al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib22), Zouet al\.,[2023](https://arxiv.org/html/2605.15239#bib.bib42)\], whereas benign instruction data is available at orders of magnitude larger scale\. We therefore vary the harmful prompt count over\{8,16,24,48,96,128\}\\\{8,16,24,48,96,128\\\}and, for each setting, sweep the benign\-to\-harmful ratio over\{1,2,5,10\}×\\\{1,2,5,10\\\}\\times, yielding a6×46\\times 4grid per base\. In both regimes we train Thinksafe and OPSA on Qwen3\-0\.6B and DeepSeek\-R1\-Distill\-1\.5B with all other hyperparameters held fixed, and report the composite scoreSSfrom Eq\.[6](https://arxiv.org/html/2605.15239#S4.E6)\. #### OPSA is insensitive to data size\. Figure[4](https://arxiv.org/html/2605.15239#A6.F4)plotsSSagainst the training\-set fraction on two bases\. OPSA beats Thinksafe on every \(model, fraction\) configuration, achieving an average gain of\+6\.6\+6\.6points on Qwen3\-0\.6B and\+7\.6\+7\.6points on R1\-Distill\-1\.5B\. The gap does not shrink as supervision is reduced\. At the smallest budget \(10%10\\%of SafeChain\) OPSA still beats Thinksafe by\+8\.3\+8\.3points on Qwen3\-0\.6B and\+5\.5\+5\.5points on R1\-Distill\-1\.5B\. The advantage is thus a property of the objective rather than of any particular data size\. #### OPSA is insensitive to composition shifts\. Figure[5](https://arxiv.org/html/2605.15239#A7.F5)reportsSSacross the composition grid\. OPSA wins24/2424/24cells on Qwen3\-0\.6B with an average gain of\+10\.3\+10\.3points, and22/2422/24cells on R1\-Distill\-1\.5B with an average gain of\+6\.2\+6\.2points\. OPSA is therefore stable not only with respect to the amount of supervision but also with respect to its composition\. This is the regime safety post\-training typically operates in\. Curated harmful prompts are scarce and expensive to source, while benign supervision is abundant and easy to collect, so the realized mix is dictated by what is available rather than by what is optimal for alignment\. A method that holds up across this space is more broadly useful than one tuned to a single operating point\. #### Why the two tests matter together\. The data size and data composition tests probe orthogonal failure modes of a safety\-realignment recipe: scarcity of supervision and imbalance of supervision\. OPSA wins on both, supporting the broader claim that the gain originates in the learning objective rather than in a particular data setting\. Figure 5:Sensitivity to data composition\.Composite Safety ScoreSS\(%,↑\\uparrow\) on a6×46\\times 4grid that varies the harmful prompt count \(rows\) and the benign\-to\-harmful ratio \(columns\)\. Each cell shows OPSA’sSSand, in parentheses, its delta over Thinksafe under matched data\. OPSA wins24/2424/24cells on Qwen3\-0\.6B and22/2422/24cells on R1\-Distill\-1\.5B\. ## Appendix HAdditional experimental details #### Training\. All models are trained with full parameter fine\-tuning using AdamW \(learning rate1×10−51\\times 10^\{\-5\}, cosine schedule with1010% linear warmup\) for33epochs\. For OPSA, training uses thenemo\-rlon\-policy distillation pipeline: teacher and student share the same base model weights \(self\-distillation\), the student generates on\-policy rollouts which are then used to compute the per\-token KL loss against the teacher’s distribution\. The global training batch size is6464, with128128prompts sampled per distillation step \(yielding22gradient updates per step\)\. For Thinksafe, the same batch size of6464is used with standard cross\-entropy loss on filtered self\-generated traces\. All runs use seed4242\. #### Data\. Training data comes from the UWNSL/SafeChain corpus\. For the main results \(Table[1](https://arxiv.org/html/2605.15239#S5.T1)\), we use the full40,00040\{,\}000\-prompt set with the v3 safety prompt for privileged\-context steering\. For the data\-robustness analysis \(§[G](https://arxiv.org/html/2605.15239#A7)\), we subsample at scaling factorsρ∈\{1,2,5,10\}\\rho\\in\\\{1,2,5,10\\\}corresponding to96ρ96\\rhoharmful\+\+96ρ96\\rhobenign prompts\. For the prompt\-strength analysis \(Figure[3](https://arxiv.org/html/2605.15239#S3.F3)\), we fix the data at9696harmful prompts with four conditions \(h96\_b\{96, 192, 480, 960\} benign prompts\) and vary the steering instruction across\{v1,v2,v3\}\\\{v\_\{1\},v\_\{2\},v\_\{3\}\\\}\. #### Hardware\. Models at≤\\leq1\.7B scale are trained on22NVIDIA A100 GPUs\. Models at 8B scale use44A100 GPUs per run with FSDP sharding \(tensor parallelism=1=1, context parallelism=1=1\); two runs execute concurrently on an88\-GPU node via a rolling job queue\. vLLM generation usesgpu\_memory\_utilization=0\.5=0\.5to share GPU memory with the training process\. #### Safety evaluation\. For each checkpoint we evaluate on five safety benchmarks: HarmBench, StrongReject, WildJailbreak \(harmfulness\), and XSTest, WildBenign \(over\-refusal\)\. Generation uses temperature0\.60\.6, top\-pp=0\.95=0\.95, top\-kk=20=20, and a maximum of40964096new tokens\. To reduce variance from stochastic sampling, every checkpoint is evaluated33times with independent random seeds; the reported metric is the average over the three runs\. Harmfulness is judged by Llama\-Guard\-3; WildBenign over\-refusal is judged by WildGuard\. #### Math reasoning evaluation\. We evaluate on GSM8K, MATH500, and GPQA \(Diamond\), sampling88trajectories per prompt at temperature0\.60\.6, top\-p=0\.95p=0\.95, top\-k=20k=20, with a maximum of40964096new tokens, and report pass@1 averaged across the88samples\. #### Coding evaluation\. For HumanEval and MBPP we follow the coding evaluation protocol ofJianget al\.\[[2025](https://arxiv.org/html/2605.15239#bib.bib4)\]:*greedy*decoding \(temperature0\), pass@1 scored with the EvalPlus harness \(matching the open\-source SafeChain evaluation pipeline\), and a repetition penalty of1\.11\.1applied*only*to coding generations\. The repetition penalty is the same mitigation used byJianget al\.\[[2025](https://arxiv.org/html/2605.15239#bib.bib4)\]to suppress the degenerate repetition that long chain\-of\-thought code outputs are known to exhibit on these two benchmarks\[Guoet al\.,[2025](https://arxiv.org/html/2605.15239#bib.bib18)\]; no other evaluation axis uses it\. We use the same maximum of40964096new tokens as for the safety benchmarks\. #### Checkpoint selection for reasoning\. For each \(model, method\) pair, we select the checkpoint with the highest composite safety scoreSSacross training epochs and evaluate all reasoning benchmarks \(math and coding\) on that checkpoint for a fair comparison\. ## Appendix IDetailed Jailbreak Experimental Setup ### I\.1Generation and Judging Pipeline We follow the HarmBench three\-stage pipeline: adversarial test\-case generation, target\-model completion, and classifier\-based judging\. Target completions are generated locally with vLLM in bfloat16 using each model’s chat template\. Because the targets are reasoning models, we allow up to 16,384 generated tokens per completion\. The final harmfulness label is assigned bycais/HarmBench\-Llama\-2\-13B\-cls, which receives up to 1,024 tokens of the visible answer\. For reasoning outputs, the judge strips the internal<think\>\.\.\.</think\>block before classification, so the score reflects the visible final answer rather than the hidden chain\-of\-thought\. For stochastic evaluation, we sample 3 independent completions \(Target samples\) per test case with seed 42\. Sampling parameters are read from each target’sgeneration\_config\.jsonwhen available: temperature0\.60\.6, top\-p=0\.95p=0\.95, and top\-k=20k=20for Qwen3 models, and temperature0\.60\.6, top\-p=0\.95p=0\.95, with no top\-kkfiltering for DeepSeek\-R1\-Distill models\. If a fine\-tuned checkpoint lacks a generation config, we fall back to temperature0\.60\.6, top\-p=0\.95p=0\.95, and disabled top\-kk\. ### I\.2Attack Configurations The four attacks are not the same algorithm under different names\. Each attack defines its own prompt\-generation procedure and therefore induces a different number of attempts per behavior, as shown in Table[6](https://arxiv.org/html/2605.15239#A9.T6)\. We keep this per\-behavior budget fixed acrossBase,Thinksafe, andOPSAwithin each attack, which makes within\-attack model comparisons fair\. The right column of Table[6](https://arxiv.org/html/2605.15239#A9.T6)also makes clear why pass@NNvalues are only meaningful within an attack family: the per\-behavior attempt budget ranges from 3 for Prefilling and PAIR to 339 for HumanJailbreaks\. Table 6:Per\-behavior evaluation budgets for each attack\. The total number of attempts \(right column\) differs across attacks because each attack’s prompt\-generation procedure produces a different number of test cases per behavior\. We hold this budget fixed across Base, Thinksafe, and OPSA, which makes within\-attack comparisons fair but means pass@NNvalues are not directly comparable across attacks\.#### HumanJailbreaks\. We use the full HarmBench HumanJailbreaks configuration \(random\_subset=\-1\), which yields 113 jailbreak templates per behavior and 339 labeled attempts per behavior\. #### Prefilling\. Prefilling biases the assistant generation by prepending a compliance\-suggestive string to the model’s generation context\. In reasoning mode, the prefix is inserted inside the opening<think\>block: <think\>\\n Okay, let’s provide clear instructions to assist the userWe use one prefilling prompt per behavior and sample three target completions\. #### PAP\-top5\. PAP rewrites each harmful behavior under the top 5 persuasion techniques from the PAP taxonomy\[Zenget al\.,[2024](https://arxiv.org/html/2605.15239#bib.bib34)\]\. The attacker ismistralai/Mistral\-7B\-Instruct\-v0\.2, run with bfloat16 vLLM, attack temperature1\.01\.0, a maximum of2,0482\{,\}048generated tokens, and stop token “\|”\. This produces 5 test cases and 15 labeled attempts per behavior\. #### PAIR\. We usemistralai/Mistral\-7B\-Instruct\-v0\.2as both the attacker and the in\-loop judge, replacing the larger upstream attacker and judge for local GPU feasibility\. We retain the upstream PAIR search budget: 20 parallel streams, 3 optimization steps,keep\_last\_n=3, a maximum of 20 retries, attacker generation budget of 500 tokens, in\-loop judge budget of 5 tokens, and success cutoff score 10\. For reasoning targets, we increase the target generation budget to16,38416\{,\}384tokens and strip target<think\>\.\.\.</think\>traces before feeding responses back into the attacker conversation\. ### I\.3Per\-Attack Values ofNNfor pass@NN Because each attack induces a different number of attempts per behavior, pass@NNis only defined within an attack family\. Following the per\-behavior budgets in Table[6](https://arxiv.org/html/2605.15239#A9.T6), we report pass@NNat the full attempt budget of each attack:N=339N\{=\}339for HumanJailbreaks,N=3N\{=\}3for Prefilling,N=15N\{=\}15for PAP\-top5, andN=3N\{=\}3for PAIR\. ### I\.4Comparability Caveats Base,Thinksafe, andOPSAare directly comparable within each row because they share the same behaviors, attack implementation, attacker model, target sampling, and judge\. Absolute ASR values should not be interpreted as exact reproductions of upstream HarmBench numbers, because we adapt the pipeline for local reasoning\-model evaluation in several ways: we use Mistral\-7B\-Instruct\-v0\.2 as the attacker and in\-loop judge for PAIR and PAP, apply reasoning\-aware chat\-template handling, allow long target generations, and judge only the visible final answer after stripping the hidden reasoning\.
Similar Articles
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
The paper proposes EGRSD and CL-EGRSD, on-policy self-distillation methods that weight token-level supervision by teacher entropy to improve reasoning accuracy-length tradeoff in LLMs, evaluated on Qwen3-4B and Qwen3-8B.
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
This paper introduces Motab, a new pipeline for LLM reasoning distillation that mitigates both off-policy and on-policy exposure biases by dynamically monitoring student generation and backtracking to safe states with teacher intervention, achieving ~3% average improvement.
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.
OPRD: On-Policy Representation Distillation
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
Adaptive Teacher Exposure for Self-Distillation (ATESD) improves LLM reasoning by dynamically adjusting how much of the reference reasoning the teacher shows the student during training, using a learnable policy controller and a discounted learning-progress reward. Experiments on math benchmarks show consistent improvements over existing self-distillation and RL baselines.