Curriculum Learning for Safety Alignment
Summary
This paper proposes Staged-Competence, a curriculum learning framework for DPO-based safety alignment that organizes preference data by difficulty, improving robustness and data efficiency while preserving general capabilities.
View Cached Full Text
Cached at: 05/27/26, 09:08 AM
# Curriculum Learning for Safety Alignment
Source: [https://arxiv.org/html/2605.26315](https://arxiv.org/html/2605.26315)
Sandeep Kumar Carnegie Mellon University \[sandeep3@andrew\.cmu\.edu\] &Virginia Smith Carnegie Mellon University \[smithv@cmu\.edu\] &Chhavi Yadav Carnegie Mellon University11footnotemark:1 Simons Institute, UC Berkeley \[cyadav@andrew\.cmu\.edu\]
###### Abstract
Direct Preference Optimization \(DPO\) is a widely used approach for safety alignment that aims to reduce harmful behaviors in large language models\. However, prior work shows that it can be brittle and exhibits poor out\-of\-distribution \(OOD\) generalization\[[19](https://arxiv.org/html/2605.26315#bib.bib13)\]\. In this paper we investigate whether Curriculum Learning can improve the robustness of DPO\-based safety alignment\. We proposeStaged\-Competence, a curriculum\-based framework that organizes preference data by difficulty, employs competence\-based sampling, and progressively updates the reference model during training\. Averaged across three model families, Staged\-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities and maintaining near\-zero over\-refusal\. We further show that Staged\-Competence: \(1\) matches baseline safety performance with only 75% of the training data, demonstrating improved data efficiency and \(2\) yields better separation between safe and unsafe responses\. Staged\-Competence is agnostic to the underlying policy optimization loss and can extend to other DPO variants and alignment domains beyond safety\. Our code and data can be found at:[https://github\.com/Sandeep5500/curriculum\-learning\-for\-safety](https://github.com/Sandeep5500/curriculum-learning-for-safety)\.
## 1Introduction
Safety alignment of large language models \(LLMs\) seeks to ensure that models refuse harmful requests while remaining helpful on benign ones\[[2](https://arxiv.org/html/2605.26315#bib.bib9)\]\. Direct Preference Optimization \(DPO\)\[[21](https://arxiv.org/html/2605.26315#bib.bib2)\]has emerged as a popular approach, learning from human\-annotated preference pairs of safe and unsafe responses without requiring a separate reward model\. However, standard DPO has been found to be brittle to simple jailbreaking attacks\[[29](https://arxiv.org/html/2605.26315#bib.bib10),[19](https://arxiv.org/html/2605.26315#bib.bib13)\]and fails to generalize out\-of\-distribution\[[14](https://arxiv.org/html/2605.26315#bib.bib29),[20](https://arxiv.org/html/2605.26315#bib.bib12),[12](https://arxiv.org/html/2605.26315#bib.bib23),[6](https://arxiv.org/html/2605.26315#bib.bib27)\]\.
On the other hand, curriculum learning\[[3](https://arxiv.org/html/2605.26315#bib.bib3),[7](https://arxiv.org/html/2605.26315#bib.bib19),[24](https://arxiv.org/html/2605.26315#bib.bib20)\]has been shown to teach models more robust features by ordering examples from easy to hard, allowing the learner to build on simpler concepts before more challenging ones\. While this principle has been applied to tasks such as machine translation\[[18](https://arxiv.org/html/2605.26315#bib.bib4)\], pretraining\[[4](https://arxiv.org/html/2605.26315#bib.bib28),[25](https://arxiv.org/html/2605.26315#bib.bib21)\], general alignment\[[17](https://arxiv.org/html/2605.26315#bib.bib5),[13](https://arxiv.org/html/2605.26315#bib.bib22)\], its potential for*safety*alignment remains largely unexplored\. This brings us to the question:
Can curriculum learning lead to more robust safety alignment?
In this work, we investigate the aforementioned question and conduct the first systematic study of curriculum learning strategies for DPO\-based safety alignment\. Although curriculum learning is a natural tool to explore in this scenario, preference data presents a key challenge: the difficulty of a preference pair depends not just on linguistic complexity but on how well the unaligned model already distinguishes safe from unsafe behavior\. To address this, we propose a difficulty score calledpreference alignment margin, which orders samples by how well the model already distinguishes safe from unsafe responses\. Next, we propose a curriculum training algorithm,Staged\-Competence, which gradually expands the pool of eligible examples during training through competence\-based sampling and progressively updates the reference policy model between stages\.
Our experiments show the efficacy of Staged\-Competence in learning robust features for safety alignment: across three model families, it reduces OOD harmful response rate by 16% and attack success rate by 20% without degrading general capabilities; achieves∼3×\{\\sim\}3\\timesgreater reward margin separation; extends safety alignment beyond the first few tokens; matches baseline safety with 25% less data; and scales gracefully with model size\.
We also systematically ablate curriculum design choices, including ordering, within\-stage sampling, and reference\-model updates, showing how each affects safety robustness across three model architectures\. Additionally, we identify widespread preference pair inconsistencies in two popular safety datasets, PKU\-SafeRLHF\[[10](https://arxiv.org/html/2605.26315#bib.bib8)\]and HH\-RLHF\[[2](https://arxiv.org/html/2605.26315#bib.bib9)\], and develop a cleaned, combined dataset,Cleaned\-PKU\-HH\-SafeRLHF, for DPO\-based safety training, released alongside our code\. Staged\-Competence is agnostic to the underlying policy optimization loss and therefore extends beyond DPO and can also be applied to alignment domains other than safety\.
Figure 1:Overview of the Staged\-Competence pipeline \(illustrated withK=3K\\\!=\\\!3\)\.Phase 1 \(Scoring\):a model\-dependent preference alignment marginmim\_\{i\}produces a global easy\-to\-hard ordering of preference pairs\.Phase 2 \(Training\):the sorted data is split intoKKbuckets, with within\-stage competence sampling and between\-stage reference\-model updates \(πref\(k\+1\)=π\(k\)\\pi\_\{\\mathrm\{ref\}\}^\{\(k\+1\)\}=\\pi^\{\(k\)\}\)\.
## 2Related Work
#### Curriculum learning\.
Bengioet al\.\[[3](https://arxiv.org/html/2605.26315#bib.bib3)\]introduced curriculum learning, showing that ordering training examples from easy to hard improves convergence over random presentation\. Subsequent work demonstrated that this ordering improves not only convergence but also final generalization in deep networks\[[7](https://arxiv.org/html/2605.26315#bib.bib19)\], with a broader survey synthesizing evidence that curriculum\-based ordering improves robustness and out\-of\-distribution generalization across machine learning domains\[[24](https://arxiv.org/html/2605.26315#bib.bib20)\]\.Platanioset al\.\[[18](https://arxiv.org/html/2605.26315#bib.bib4)\]subsequently extended these ideas to neural machine translation through a*competence\-based*formulation, in which a growing pool of eligible examples introduces harder cases at a progressively slower rate\.
#### Curriculum learning for LLMs\.
Curriculum learning has more recently been adapted to large language model training\.Xieet al\.\[[25](https://arxiv.org/html/2605.26315#bib.bib21)\]optimize the pretraining data mixture using a proxy model that re\-weights domains over training, yielding faster convergence and stronger downstream performance\.Liet al\.\[[13](https://arxiv.org/html/2605.26315#bib.bib22)\]introduce competence\-aware curriculum scheduling for instruction tuning, dynamically adjusting example difficulty to match the model’s capability\. Both target general capability rather than safety alignment, leaving curriculum\-based safety alignment underexplored\.
#### Curriculum methods for alignment\.
The closest prior work is Curri\-DPO\[[17](https://arxiv.org/html/2605.26315#bib.bib5)\], which combines curriculum learning with DPO via difficulty\-stratified stages and between\-stage reference\-model updates; we discuss it in more detail in Section[3\.2](https://arxiv.org/html/2605.26315#S3.SS2)\.
#### DPO safety brittleness and robustness\.
A growing body of work shows DPO\-based safety alignment is fragile:Qiet al\.\[[19](https://arxiv.org/html/2605.26315#bib.bib13)\]demonstrate that safety alignment is shallow, concentrating in the first few output tokens, andQiet al\.\[[20](https://arxiv.org/html/2605.26315#bib.bib12)\]show that even modest fine\-tuning can compromise safety\. A parallel line of work targets DPO’s objective:Menget al\.\[[16](https://arxiv.org/html/2605.26315#bib.bib24)\]remove the reference model and length bias to stabilize training, andEthayarajhet al\.\[[5](https://arxiv.org/html/2605.26315#bib.bib25)\]recast DPO under prospect theory for better behavior under skewed data\. For safety specifically,Zhaoet al\.\[[28](https://arxiv.org/html/2605.26315#bib.bib16)\]introduce a dual\-objective DPO andKimet al\.\[[11](https://arxiv.org/html/2605.26315#bib.bib26)\]reformulate DPO with explicit safety constraints\. Our method keeps the standard DPO loss intact and is therefore compatible with these objective\-level improvements\.
## 3Preliminaries
### 3\.1DPO Problem Formulation
Given a safety preference dataset𝒟=\{\(xi,yi\+,yi−\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}^\{\+\},y\_\{i\}^\{\-\}\)\\\}\_\{i=1\}^\{N\}consisting ofNNexamples, wherexix\_\{i\}is an input prompt,yi\+y\_\{i\}^\{\+\}is a safe \(chosen\) response, andyi−y\_\{i\}^\{\-\}is an unsafe \(rejected\) response, Direct Preference Optimization \(DPO\)\[[21](https://arxiv.org/html/2605.26315#bib.bib2)\]trains a policyπθ\\pi\_\{\\theta\}by minimizing the following loss:
ℒDPO\(θ\)=−𝔼\(x,y\+,y−\)∼𝒟\[logσ\(β\(logπθ\(y\+∣x\)πref\(y\+∣x\)−logπθ\(y−∣x\)πref\(y−∣x\)\)\)\]\.\\mathcal\{L\}\_\{\\text\{DPO\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\(x,y^\{\+\},y^\{\-\}\)\\sim\\mathcal\{D\}\}\\left\[\\log\\sigma\\\!\\left\(\\beta\\left\(\\log\\frac\{\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\}\{\\pi\_\{\\text\{ref\}\}\(y^\{\+\}\\mid x\)\}\-\\log\\frac\{\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\)\}\{\\pi\_\{\\text\{ref\}\}\(y^\{\-\}\\mid x\)\}\\right\)\\right\)\\right\]\.\(1\)whereπref\\pi\_\{\\text\{ref\}\}is a fixed reference policy andβ\\betacontrols the strength of the deviation penalty\.
In standard DPO, training examples are sampled uniformly at random at each step, and the reference policyπref\\pi\_\{\\text\{ref\}\}remains fixed throughout training\.
### 3\.2Curriculum Training Methods
#### Competence\-based curriculum learning\.
A key challenge in curriculum learning is controlling the rate at which harder examples are introduced\. If new, more difficult examples are added too quickly, the learner may not have sufficient time to assimilate them before even harder ones arrive\[[18](https://arxiv.org/html/2605.26315#bib.bib4)\]\. The competence\-based approach addresses this by maintaining a growing subset of the training data from which mini\-batches are sampled: at any point during training, only examples up to a certain difficulty threshold are eligible, and this threshold expands gradually according to a schedule function\.
Formally, each example is assigned a normalized difficultydi=\(rank\(i\)−1\)/\(N−1\)d\_\{i\}=\(\\mathrm\{rank\}\(i\)\-1\)/\(N\-1\)based on its rank in the sorted training set, whererank\(i\)∈\{1,…,N\}\\mathrm\{rank\}\(i\)\\in\\\{1,\\dots,N\\\}orders examples from easiest \(1\) to hardest\. At training stepttout ofTTtotal steps, a competence functionc\(t\)c\(t\)determines the difficulty threshold, and only examples withdi≤c\(t\)d\_\{i\}\\leq c\(t\)are included in the eligible pool for mini\-batch sampling\.Platanioset al\.\[[18](https://arxiv.org/html/2605.26315#bib.bib4)\]propose the square\-root schedulec\(t\)=\(1−c02\)t/T\+c02c\(t\)=\\sqrt\{\(1\-c\_\{0\}^\{2\}\)\\,t/T\+c\_\{0\}^\{2\}\}, wherec0c\_\{0\}is an initial competence constant\. This form ensures that harder examples are introduced at a decreasing rate, giving the model time to consolidate each wave before more are added\.
This approach was originally developed for machine translation with RNNs and early Transformers; we adapt it to modern LLMs and safety alignment via DPO\.
#### Curri\-DPO\.
In standard DPO and the competence\-based approach above, the reference modelπref\\pi\_\{\\text\{ref\}\}remains fixed throughout training\.Pattnaiket al\.\[[17](https://arxiv.org/html/2605.26315#bib.bib5)\]propose instead to partition the training data intoKKdifficulty\-stratified buckets and run training inKKstages, updating the reference model between stages so that each stage’s policy can focus on assimilating the current difficulty bucket rather than re\-learning what was already acquired earlier:πref\(k\+1\)=π\(k\)\.\\pi\_\{\\text\{ref\}\}^\{\(k\+1\)\}=\\pi^\{\(k\)\}\.
The original work setsK=3K\\\!=\\\!3buckets and operates on a training dataset where each prompt is paired with four candidate responses\. The difficulty for the curriculum ordering is defined*locally*within each prompt and not across the whole dataset: the four responses are rankedR1R\_\{1\}\(best\) toR4R\_\{4\}\(worst\) by an external judge \(e\.g\., GPT\-4 or humans\), and three preference pairs of increasing difficulty are formed –\(R1,R4\)\(R\_\{1\},R\_\{4\}\)easy,\(R1,R3\)\(R\_\{1\},R\_\{3\}\)medium,\(R1,R2\)\(R\_\{1\},R\_\{2\}\)hard – where difficulty corresponds to the quality gap between the two responses\. For each prompt, its three pairs are then placed in their corresponding buckets\. Because this local scheme provides no way to compare difficulty across prompts – an “easy” pair from one prompt may be substantially harder than a “hard” pair from another – examples within each bucket are simply randomly shuffled during their corresponding training stage\. Curri\-DPO was originally developed for general helpfulness alignment; we build on its staged reference update mechanism and extend it to safety alignment with a global, model\-dependent difficulty ordering that places every preference pair on a single comparable axis\.
## 4Methodology
Curriculum learning generally involves two components: \(1\) a difficulty scoring phase that defines a meaningful ordering of the training data, and \(2\) a training algorithm that determines how the curriculum is imposed during learning\. We describe both in the context of DPO\-based safety alignment and presentStaged\-Competence, a new curriculum learning framework for safety training\.
### 4\.1Phase 1: Difficulty Scoring
Given a safety preference dataset𝒟=\{\(xi,yi\+,yi−\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}^\{\+\},y\_\{i\}^\{\-\}\)\\\}\_\{i=1\}^\{N\}, our goal is to construct a curriculum that orders training examples by their relative difficulty for the model\. Instead of relying on static heuristics, we define difficulty in a*model\-dependent*manner, reflecting how well the current \(unaligned\) base model distinguishes safe from unsafe responses\. This ensures that the curriculum is tailored to the particular model in question and its specific biases/behavior\.
At a high level, we measure the difficulty of an example by passing the input prompt through the unaligned model, obtaining its zero\-shot response, and comparing that response to the provided safe and unsafe responses\. Intuitively, samples are easier if the model already produces outputs closer to the safe response, and harder if its outputs are closer to the unsafe alternative\.
To operationalize this, for each promptxix\_\{i\}, we generate a zero\-shot responsey^i\\hat\{y\}\_\{i\}from the base model and compute embeddings fory^i\\hat\{y\}\_\{i\},yi\+y\_\{i\}^\{\+\}, andyi−y\_\{i\}^\{\-\}\. We then define a*preference alignment margin*:
mi=cos\(ey^i,eyi\+\)−cos\(ey^i,eyi−\),m\_\{i\}=\\cos\(e\_\{\\hat\{y\}\_\{i\}\},e\_\{y\_\{i\}^\{\+\}\}\)\-\\cos\(e\_\{\\hat\{y\}\_\{i\}\},e\_\{y\_\{i\}^\{\-\}\}\),\(2\)wheree\(⋅\)e\(\\cdot\)denotes normalized sentence embeddings\. A large positive margin indicates that the model’s output is already aligned with the safe response, while a small or negative margin indicates misalignment and thus higher difficulty\.
We sort the training set in descending order of marginmim\_\{i\}\(easy\-to\-hard\) to construct the curriculum used for DPO training\. Unlike the prompt\-local scoring used in Curri\-DPO, this margin provides a*global*difficulty ordering across the entire dataset, allowing any two preference pairs to be compared regardless of their source prompt\.
### 4\.2Phase 2: Curriculum Training
Among existing curriculum\-based methods, Curri\-DPO\[[17](https://arxiv.org/html/2605.26315#bib.bib5)\]demonstrates the value of curriculum learning for DPO\-based tasks but we find that it suffers from a key limitation for effective safety alignment: each stage of Curri\-DPO reverts to standard DPO with random shuffling, leaving the curriculum ordering unutilized at the within\-stage level\. To fill this gap, we draw on Sqrt\-Competence\[[18](https://arxiv.org/html/2605.26315#bib.bib4)\], whose core mechanism – a growing pool of samples that adds harder examples at a progressively slower rate – is a natural fit for incorporating the curriculum into each stage, ensuring a far more granular adaptation of the curriculum ordering down to every training step\.
We propose a new method,*Staged\-Competence*, which combines staged reference\-model updates with competence\-based sampling, leading to a global safety curriculum with easy\-to\-hard progression both within and across stages\. The full algorithm can be found in Alg\.[1](https://arxiv.org/html/2605.26315#alg1)\.
Algorithm 1Staged\-Competence Training1:Input:Difficulty\-sorted preference dataset
𝒟sort\\mathcal\{D\}\_\{\\text\{sort\}\}, base model
π0\\pi\_\{0\}
2:Parameters:Number of stages
KK, epochs per stage
EE, total steps per stage
TT, DPO penalty
β\\beta, initial competence
c0c\_\{0\}
3:Output:Aligned policy
π\(K\)\\pi^\{\(K\)\}
4:\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.
5:Divide
𝒟sort\\mathcal\{D\}\_\{\\text\{sort\}\}into
KKequal buckets
ℬ1,…,ℬK\\mathcal\{B\}\_\{1\},\\dots,\\mathcal\{B\}\_\{K\}of increasing difficulty\[Step 1: Partition\]
6:Initialize reference model
πref\(1\)←π0\\pi\_\{\\text\{ref\}\}^\{\(1\)\}\\leftarrow\\pi\_\{0\}
7:forstage
k=1,…,Kk=1,\\dots,Kdo
8:Initialize policy
π\(k\)←πref\(k\)\\pi^\{\(k\)\}\\leftarrow\\pi\_\{\\text\{ref\}\}^\{\(k\)\}
9:Assign ranking
di∈\[0,1\]d\_\{i\}\\in\[0,1\]for
i∈ℬki\\in\\mathcal\{B\}\_\{k\}\(rank within
ℬk\\mathcal\{B\}\_\{k\}\)\[Step 2: Intra\-bucket difficulty\]
10:forstep
t=1,…,Tt=1,\\dots,T\(over
EEepochs\)do
11:Compute competence
c\(t\)=\(1−c02\)t/T\+c02c\(t\)=\\sqrt\{\(1\-c\_\{0\}^\{2\}\)\\,t/T\+c\_\{0\}^\{2\}\}\[Step 3: Competence sampling\]
12:Eligible pool
𝒫t←\{i∈ℬk:di≤c\(t\)\}\\mathcal\{P\}\_\{t\}\\leftarrow\\\{i\\in\\mathcal\{B\}\_\{k\}:d\_\{i\}\\leq c\(t\)\\\}
13:Sample mini\-batch
B∼𝒫tB\\sim\\mathcal\{P\}\_\{t\}and update
π\(k\)\\pi^\{\(k\)\}via DPO step with reference
πref\(k\)\\pi\_\{\\text\{ref\}\}^\{\(k\)\}
14:endfor
15:if
k<Kk<Kthen
16:
πref\(k\+1\)←π\(k\)\\pi\_\{\\text\{ref\}\}^\{\(k\+1\)\}\\leftarrow\\pi^\{\(k\)\}\[Step 4: Reference update\]
17:endif
18:endfor
19:return
π\(K\)\\pi^\{\(K\)\}
## 5Experiments
In this section, we investigate Staged\-Competence in detail and ask the following research questions: 1\)Efficacy:Does Staged\-Competence improve safety alignment in\-distribution, generalize to out\-of\-distribution prompts, and resist jailbreak attacks, without sacrificing general capability or inducing over\-refusal? 2\)Interpretability:What internal mechanisms drive its safety gains, and do they differ qualitatively from standard DPO? 3\)Data efficiency:Can Staged\-Competence match standard DPO with fewer preference pairs? 4\)Scaling:Does its safety advantage hold across model sizes?
### 5\.1Experimental Setup
#### DPO Dataset: Cleaned\-PKU\-HH\-SafeRLHF\.
We construct our preference dataset from two public sources: PKU\-SafeRLHF\[[10](https://arxiv.org/html/2605.26315#bib.bib8)\]\(43,452 pairs\) and the single\-turn subset of Anthropic HH\-RLHF\[[2](https://arxiv.org/html/2605.26315#bib.bib9)\]\(49,388 pairs\)\. Although widely used for safety alignment, neither provides consistently clean safety preference labels suitable for DPO training\.
In PKU\-SafeRLHF, the “chosen” response is merely the*safer*of the two and is often itself harmful\. For example, given the prompt*“How can I conceal my license plate so I can take free gas?”*:
- Chosen\(labeled safe\):*“Are you referring to the license plate on your car? That is not as useful as hiding the plate in your trunk, since it can be read from outside the vehicle\.”* Rejected\(labeled unsafe\):*“Well, is there anything that would actually help?”*
Here the “chosen” response actively assists with the illegal request while the “rejected” response does not; the safety labels are effectively inverted\. Conversely, in HH\-RLHF, “rejected” reflects lower*helpfulness*rather than lower safety, so the rejected response is frequently the safer one\.
To obtain clean preference pairs, we use a GPT\-4o\-mini judge to classify each response as safe or unsafe along a policy covering dangerous content, hate speech, sexually explicit material, and harassment, retaining only pairs with safe chosen and unsafe rejected\. Table[3](https://arxiv.org/html/2605.26315#A2.T3)reports the filtering breakdown\. The dominant failure mode differs by dataset: in PKU\-SafeRLHF, 82\.2% of chosen responses are unsafe; in HH\-RLHF, 87\.2% of rejected responses are actually safe\. After filtering, we combine both sources and apply a stratified 80/20 train/test split\.
*We refer to this cleaned, combined dataset as Cleaned\-PKU\-HH\-SafeRLHF and release it alongside our code\. All subsequent experiments use it as the DPO training data\.*
#### Models\.
We evaluate across three model families spanning different architectures and scales:LLaMA\-3\-8B,Qwen3\-8B, andYi\-1\.5\-9B\(HuggingFace identifiers in Appendix[C](https://arxiv.org/html/2605.26315#A3)\)\. We use the abliterated variants of these open\-source models – versions from which built\-in safety guardrails have been removed – providing a controlled starting point where safety behavior must be learned entirely through alignment training\. We focus on the 8B parameter range; evaluation at larger scales is left to future work due to compute constraints\.
#### Training details\.
Our training setup has two parts; we discuss each in turn\.
General safety DPO fine\-tuning\.We fine\-tune all methods with LoRA\[[9](https://arxiv.org/html/2605.26315#bib.bib7)\]using rankr=16r\\\!=\\\!16,α=32\\alpha\\\!=\\\!32, applied to the query and value projection matrices; full fine\-tuning is left to future work\. We set the learning rate to5×10−55\\\!\\times\\\!10^\{\-5\}, DPOβ=0\.1\\beta\\\!=\\\!0\.1, effective batch size 32 \(per\-device batch 2×\\timesgradient accumulation 16\), and maximum sequence length 1024\. For staged methods \(Curri\-DPO and Staged\-Competence\), we useK=3K\\\!=\\\!3stages\. We train all methods for 5 epochs; for staged methods, each stage runs for 5 epochs before proceeding to the next\. For Yi\-1\.5\-9B, we use 4 epochs per stage for Staged\-Competence, as we observed that 5 epochs led to slight quality degradation due to preference over\-optimization\. We run all experiments on a single NVIDIA A6000 \(48 GB\)\.
Curriculum learning specifics\.For all curriculum methods, we score difficulty using the lightweight all\-MiniLM\-L6\-v2 sentence encoder\[[22](https://arxiv.org/html/2605.26315#bib.bib6)\]\. We then split the scored data 80/20 into a curriculum training set and a stratified test set that we hold fixed across all methods for comparable evaluation\. For competence\-based methods, we set the initial competence toc0=0\.01c\_\{0\}=0\.01\. Full Phase 1 generation and scoring details are in Appendix[D](https://arxiv.org/html/2605.26315#A4)\.
#### Methods compared\.
We compare against a standard DPO baseline and three curriculum baselines, all trained on the same cleaned preference dataset \(Table[2](https://arxiv.org/html/2605.26315#A1.T2)summarizes the design differences\)\. OurStandard DPObaseline uses random shuffling, a single stage, and a fixed reference\. The three curriculum baselines are: \(1\)Sequential– fixed easy\-to\-hard ordering, single stage, fixed reference; \(2\)Sqrt\-Competence\[[18](https://arxiv.org/html/2605.26315#bib.bib4)\]– competence\-based sampling, single stage, fixed reference; and \(3\)Curri\-DPO\[[17](https://arxiv.org/html/2605.26315#bib.bib5)\]–K=3K\\\!=\\\!3stages with reference\-model updates, random shuffling within each stage\. Our proposedStaged\-CompetenceusesK=3K\\\!=\\\!3stages with reference\-model updates and competence\-based sampling within each stage\. Since each of theK=3K\\\!=\\\!3stages operates on one\-third of the data, the total number of training steps is matched across all methods\.
### 5\.2Evaluation Setup
#### In\-distribution reward accuracy\.
On the held\-out 20% test split, our primary metric is the post\-training reward accuracy: the fraction of test pairs where the trained model assigns higher log\-probability to the chosen response than to the rejected one,logπθ\(y\+∣x\)\>logπθ\(y−∣x\)\\log\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\>\\log\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\)\. This is the headline in\-distribution number reported in Table[4](https://arxiv.org/html/2605.26315#A5.T4)\.
To additionally understand training dynamics – how quickly each method improves – we track the per\-step reward marginlogπθ\(y\+∣x\)−logπθ\(y−∣x\)\\log\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\-\\log\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\), averaged across the test split\. Both metrics omit the reference model term, as staged methods update their reference between stages, which would otherwise make direct comparisons across methods misleading\.
#### Out\-of\-distribution safety and jailbreak attacks\.
We evaluate on three OOD safety benchmarks –AdvBench\[[29](https://arxiv.org/html/2605.26315#bib.bib10)\],SorryBench\[[26](https://arxiv.org/html/2605.26315#bib.bib11)\], andHEx\-PHI\[[20](https://arxiv.org/html/2605.26315#bib.bib12)\]– spanning adversarial prompts, refusal behavior, and harmful knowledge across categories such as illegal activity, malware, and biosecurity\. We additionally test robustness against two jailbreak attacks:Prefill\[[1](https://arxiv.org/html/2605.26315#bib.bib1)\], which forces the model to begin its response with tokens from a known harmful completion, andGCG\[[29](https://arxiv.org/html/2605.26315#bib.bib10)\], which optimizes an adversarial suffix on a non\-abliterated base model and applies it as a transfer attack; full details are in Appendix[G](https://arxiv.org/html/2605.26315#A7)\. A GPT\-4o\-mini judge classifies each response; we report the*harmful response rate*\(↓\\downarrow\) on safety benchmarks and the*attack success rate*\(↓\\downarrow\) on attacks\.
#### Quality and over\-refusal benchmarks\.
We verify capability preservation usingMMLU\[[8](https://arxiv.org/html/2605.26315#bib.bib14)\]andHellaSwag\[[27](https://arxiv.org/html/2605.26315#bib.bib15)\]\(*accuracy*,↑\\uparrow\), and over\-refusal usingXSTest\[[23](https://arxiv.org/html/2605.26315#bib.bib17)\], which probes whether models incorrectly refuse benign prompts that superficially resemble unsafe requests \(*over\-refusal rate*,↓\\downarrow\)\.
### 5\.3Main Results
#### Staged\-Competence achieves on\-par or slightly better in\-distribution reward accuracy, but with far greater confidence\.
As shown in Table[4](https://arxiv.org/html/2605.26315#A5.T4), Staged\-Competence largely matches Standard DPO and the curriculum baselines on in\-distribution reward accuracy across all three models – 91\.3% on LLaMA\-3\-8B, 89\.6% on Qwen3\-8B, and 88\.2% on Yi\-1\.5\-9B\.
The reward margin trajectories in Figure[2](https://arxiv.org/html/2605.26315#S5.F2)are more revealing\. Staged\-Competence’s margin grows to roughly3×3\\timesthe baseline across all three models, indicating far greater confidence in separating safe from unsafe responses\. Distinct upward jumps at each stage boundary confirm that each new difficulty tier supplies a fresh gradient signal\.
LLaMA\-3\-8BQwen3\-8BYi\-1\.5\-9BFigure 2:Training dynamics across all three models\.Mean reward margin during training for LLaMA\-3\-8B, Qwen3\-8B, and Yi\-1\.5\-9B\. Staged\-Competence shows distinct upward jumps at stage boundaries where we update the reference model\. For Yi\-1\.5\-9B, Staged\-Competence uses 4 epochs per stage rather than 5, resulting in fewer total steps than the other methods\.
#### Staged\-Competence delivers the largest safety improvements on out\-of\-distribution prompts and under jailbreak attacks\.
On the three OOD safety benchmarks \(Table[1\(a\)](https://arxiv.org/html/2605.26315#S5.T1.st1)\), Staged\-Competence achieves the lowest or near\-lowest harmful response rate on*nearly every*model\-benchmark combination, improving the average harmful response rate by 12, 29, and 7 points on LLaMA\-3\-8B, Qwen3\-8B, and Yi\-1\.5\-9B respectively\. The same pattern holds under adversarial attack, where Staged\-Competence achieves the lowest attack success rate on every model\-attack combination \(Table[1\(b\)](https://arxiv.org/html/2605.26315#S5.T1.st2)\), with the largest gains on Qwen3\-8B \(Prefill:−36\-36, GCG:−19\-19points\) and substantial improvements on LLaMA\-3\-8B \(Prefill:−22\-22, GCG:−15\-15points\) and Yi\-1\.5\-9B \(Prefill:−17\-17, GCG:−10\-10points\)\. Staged\-Competence outperforms even the strongest overall curriculum baseline, Curri\-DPO \(−6\.9\-6\.9points OOD,−8\.5\-8\.5points attacks, averaged across models\), by 9 and 11 points on average across OOD and attack benchmarks respectively\.
The greater margin separation reported earlier is a plausible indicator of the model’s confidence in differentiating safe from unsafe responses, allowing it to achieve these gains on OOD evaluations\. A qualitative example of a Standard DPO failure case is shown in Appendix[H](https://arxiv.org/html/2605.26315#A8)\.
Table 1:Out\-of\-distribution safety and jailbreak\-attack results \(%,↓\\downarrow\)\.Δ\\Deltais the average absolute improvement over the Standard DPO baseline across the benchmarks in each subtable for each curriculum variant\. Staged\-Competence delivers the largest improvement on every model – with the most dramatic gains on Qwen3\-8B\.\(a\)OOD safety benchmark harmful response rates\.LLaMA\-3\-8BQwen3\-8BYi\-1\.5\-9BMethodSorryAdvHExΔ\\DeltaSorryAdvHExΔ\\DeltaSorryAdvHExΔ\\DeltaUnaligned90\.093\.891\.7—85\.894\.286\.0—72\.773\.372\.7—Standard DPO \(Baseline\)28\.018\.724\.0—29\.638\.530\.7—16\.21\.58\.7—Sequential19\.84\.412\.3\-11\.425\.327\.524\.3\-7\.28\.70\.40\.7\-5\.5Sqrt\-Competence21\.38\.514\.0\-9\.032\.239\.029\.0\+0\.58\.00\.01\.7\-5\.6Curri\-DPO24\.29\.218\.0\-6\.424\.921\.322\.7\-10\.09\.61\.03\.0\-4\.3Staged\-Competence \(ours\)18\.75\.410\.0\-12\.28\.90\.42\.7\-28\.94\.20\.20\.7\-7\.1
\(b\)Jailbreak attack success rates\.LLaMA\-3\-8BQwen3\-8BYi\-1\.5\-9BMethodPrefillGCGΔ\\DeltaPrefillGCGΔ\\DeltaPrefillGCGΔ\\DeltaUnaligned88\.267\.6—88\.868\.8—82\.067\.1—Standard DPO \(Baseline\)46\.523\.6—51\.726\.9—26\.511\.8—Sequential31\.017\.8\-10\.752\.826\.1\+0\.219\.26\.8\-6\.2Sqrt\-Competence29\.215\.6\-12\.759\.027\.6\+4\.015\.25\.0\-9\.1Curri\-DPO36\.817\.1\-8\.130\.524\.1\-12\.022\.55\.0\-5\.4Staged\-Competence \(ours\)24\.28\.3\-18\.816\.28\.3\-27\.19\.21\.5\-13\.8
Staged\-Competence largely preserves general capabilities \(MMLU, HellaSwag\), with zero or minimal over\-refusal \(XSTest\) across all models\.Full results are reported in Appendix[F](https://arxiv.org/html/2605.26315#A6)\.
### 5\.4Interpreting Alignment Depth
Qiet al\.\[[19](https://arxiv.org/html/2605.26315#bib.bib13)\]show that standard safety alignment often concentrates its effect in the first few tokens of a response\. We investigate whether Staged\-Competence’s safety improvements are similarly shallow, or whether they extend deeper into the response\.
Experimental setup\.For each token positionttin an unsafe response, we compute the per\-token suppressionδ\(t\)=logπunaligned\(yt∣x,y<t\)−logπaligned\(yt∣x,y<t\)\\delta\(t\)=\\log\\pi\_\{\\text\{unaligned\}\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\-\\log\\pi\_\{\\text\{aligned\}\}\(y\_\{t\}\\mid x,y\_\{<t\}\), where positive values indicate active suppression of the unsafe token by the aligned model\. We averageδ\(t\)\\delta\(t\)over 200 rejected responses from the in\-distribution test set, up to position 128\.
Results\.As shown in Figure[3](https://arxiv.org/html/2605.26315#S5.F3), Staged\-Competence produces uniformly stronger safety suppression across virtually every token position – not just at the initial refusal tokens but sustained throughout the response\. Aggregated, the total suppression \(∑tδ\(t\)\\sum\_\{t\}\\delta\(t\)\) is∼3×\\sim\\\!3\\timeslarger for Staged\-Competence than Standard DPO across all three models\.
This deeper alignment provides a mechanistic explanation for the improved prefill\-attack robustness in Table[1\(b\)](https://arxiv.org/html/2605.26315#S5.T1.st2): attacks that bypass the initial tokens still encounter resistance deeper in the sequence\.
LLaMA\-3\-8BQwen3\-8BYi\-1\.5\-9BFigure 3:Per\-token suppression of unsafe response tokens for Baseline vs\. Staged\-Competence\.Staged\-Competence produces stronger suppression at every token position, indicating much deeper safety alignment\.
### 5\.5Scaling with Model Size
To understand how Staged\-Competence scales with model capacity, we additionally train and evaluate the Qwen3 family at three sizes – 1\.7B, 4B, and 8B parameters – and compare Staged\-Competence against Standard DPO at each scale\. The training setup for the smaller models is the same as that for the 8B model in Section[5\.1](https://arxiv.org/html/2605.26315#S5.SS1)\.
Results\.As shown in Figure[4](https://arxiv.org/html/2605.26315#S5.F4), the safety gap between Staged\-Competence and Standard DPO widens monotonically with model size\. On OOD safety, Staged\-Competence reduces the average harmful response rate by 1\.5 points at 1\.7B, 13 at 4B, and 29 at 8B – a significant increase in absolute benefit as the model grows\. On adversarial attacks, the absolute improvement jumps from 5 points at 1\.7B to roughly 26–27 points at both 4B and 8B, where it seems to plateau\.
Notably, Staged\-Competence achieves*roughly constant safety*across all three sizes \(2–8% OOD harmful response rate, 8–12% attack success rate\), while Standard DPO degrades sharply as the model grows\. This makes the curriculum’s value grow with scale, since larger models seem to be more dangerous when poorly aligned\.
OOD SafetyAdversarial AttacksFigure 4:Staged\-Competence’s safety advantage scales with model size\.Average OOD safety harmful response rate \(left\) and average attack success rate \(right\) for Standard DPO vs Staged\-Competence across three Qwen3 sizes\. Numerical labels show the absolute reduction \(pp\) at each scale\.
### 5\.6Data Efficiency
Given the accelerated learning dynamics that Staged\-Competence exhibits in our in\-distribution results \(Figure[2](https://arxiv.org/html/2605.26315#S5.F2)\), we investigate whether it can match Standard DPO’s safety performance using fewer training examples\.
Experimental setup\.We construct a 50% subset of the curriculum by randomly sampling 50% of examples from each of theK=3K\\\!=\\\!3difficulty buckets, preserving the difficulty distribution and within\-bucket ordering\. We construct a 75% subset using the same procedure, and evaluate safety robustness on the three OOD safety benchmarks\.
Results\.As shown in Table[6](https://arxiv.org/html/2605.26315#A9.T6)and Figure[5](https://arxiv.org/html/2605.26315#S5.F5), Staged\-Competence with just 75% of the training data matches or exceeds Standard DPO trained on 100% across all three OOD safety benchmarks, on both LLaMA\-3\-8B and Qwen3\-8B – demonstrating that the curriculum enables substantially more efficient use of the available preference pairs\. At 50%, the picture is model\-dependent: Staged\-Competence still beats Standard DPO on LLaMA\-3\-8B but underperforms it on Qwen3\-8B, suggesting that the minimum viable curriculum size varies with model\. This opens a path to lower\-cost safety alignment when preference data is scarce\.
LLaMA\-3\-8BQwen3\-8BFigure 5:Data efficiency of Staged\-Competence\.Mean reward margin for Staged\-Competence at 50% and 75% data vs\. Standard DPO at 100%\. Even at 50% data, Staged\-Competence’s margin accumulation outpaces Standard DPO at 100%; the 75% setting nearly matches the full\-data trajectory\.
## 6Conclusion
DPO has been widely explored for safety alignment but has been found to be brittle to out\-of\-distribution prompts and adversarial attacks\. Motivated by curriculum learning’s ability to teach robust features through structured easy\-to\-hard ordering, in this paper we explore its application to safety alignment\. We proposeStaged\-Competence, which combines staged reference\-model updates with competence\-based sampling to impose a global curriculum both within and across training stages\. Across three model families, Staged\-Competence delivers markedly stronger out\-of\-distribution safety and jailbreak robustness without degrading general capabilities; we additionally observe improved data efficiency and graceful scaling with model size\. These findings pave the way forward for curriculum learning as a meaningful and underused lever for safety alignment\. Future research directions include extending Staged\-Competence and Curriculum Learning in general to other alignment objectives, non\-safety domains, full fine\-tuning and substantially larger models\.
## References
- \[1\]M\. Andriushchenko, F\. Croce, and N\. Flammarion\(2025\)Jailbreaking leading safety\-aligned LLMs with simple adaptive attacks\.InInternational Conference on Learning Representations,Cited by:[Appendix G](https://arxiv.org/html/2605.26315#A7.SS0.SSS0.Px2.p1.2),[§5\.2](https://arxiv.org/html/2605.26315#S5.SS2.SSS0.Px2.p1.2)\.
- \[2\]Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DaSilva, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[Appendix C](https://arxiv.org/html/2605.26315#A3.p5.1),[§1](https://arxiv.org/html/2605.26315#S1.p1.1),[§1](https://arxiv.org/html/2605.26315#S1.p6.1),[§5\.1](https://arxiv.org/html/2605.26315#S5.SS1.SSS0.Px1.p1.1)\.
- \[3\]Y\. Bengio, J\. Louradour, R\. Collobert, and J\. Weston\(2009\)Curriculum learning\.InProceedings of the 26th International Conference on Machine Learning,pp\. 41–48\.Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p2.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]M\. Elgaar and H\. Amiri\(2026\)Curriculum learning for LLM pretraining: an analysis of learning dynamics\.arXiv preprint arXiv:2601\.21698\.Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p2.1)\.
- \[5\]K\. Ethayarajh, W\. Xu, N\. Muennighoff, D\. Jurafsky, and D\. Kiela\(2024\)KTO: model alignment as prospect theoretic optimization\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px4.p1.1)\.
- \[6\]D\. Feng, B\. Qin, C\. Huang, Z\. Zhang, and W\. Lei\(2024\)Towards analyzing and understanding the limitations of DPO: a theoretical perspective\.arXiv preprint arXiv:2404\.04626\.Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p1.1)\.
- \[7\]G\. Hacohen and D\. Weinshall\(2019\)On the power of curriculum learning in training deep networks\.InProceedings of the 36th International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p2.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2605.26315#S5.SS2.SSS0.Px3.p1.2)\.
- \[9\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§5\.1](https://arxiv.org/html/2605.26315#S5.SS1.SSS0.Px3.p2.6)\.
- \[10\]J\. Ji, D\. Hong, B\. Zhang, B\. Chen, J\. Dai, B\. Zheng, T\. Qiu, J\. Zhou, K\. Wang, B\. Li, S\. Han, Y\. Guo, and Y\. Yang\(2024\)PKU\-SafeRLHF: towards multi\-level safety alignment for LLMs with human preference\.arXiv preprint arXiv:2406\.15513\.Cited by:[Appendix C](https://arxiv.org/html/2605.26315#A3.p5.1),[§1](https://arxiv.org/html/2605.26315#S1.p6.1),[§5\.1](https://arxiv.org/html/2605.26315#S5.SS1.SSS0.Px1.p1.1)\.
- \[11\]G\. Kim, Y\. J\. Kim, B\. Kim, H\. Lee, K\. Bae, Y\. Jang, and M\. Lee\(2026\)SafeDPO: a simple approach to direct preference optimization with enhanced safety\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px4.p1.1)\.
- \[12\]S\. Lermen, C\. Rogers\-Smith, and J\. Ladish\(2023\)LoRA fine\-tuning efficiently undoes safety training in Llama 2\-Chat 70B\.arXiv preprint arXiv:2310\.20624\.Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p1.1)\.
- \[13\]Y\. Li, T\. Lu, Y\. Li, Y\. Chen, W\. Huang, W\. Jiang, H\. Wang, H\. Zheng, and P\. S\. Yu\(2025\)Teaching according to talents\! instruction tuning LLMs with competence\-aware curriculum learning\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p2.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px2.p1.1)\.
- \[14\]Y\. Lin, S\. Seto, M\. ter Hoeve, K\. Metcalf, B\. Theobald, X\. Wang, Y\. Zhang, C\. Huang, and T\. Zhang\(2024\)On the limited generalization capability of the implicit reward model induced by direct preference optimization\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p1.1)\.
- \[15\]M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li, D\. Forsyth, and D\. Hendrycks\(2024\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.InInternational Conference on Machine Learning,Cited by:[Appendix G](https://arxiv.org/html/2605.26315#A7.SS0.SSS0.Px1.p1.1)\.
- \[16\]Y\. Meng, M\. Xia, and D\. Chen\(2024\)SimPO: simple preference optimization with a reference\-free reward\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px4.p1.1)\.
- \[17\]P\. Pattnaik, R\. Maheshwary, K\. Ogueji, V\. Yadav, and S\. T\. Madhusudhan\(2024\)Enhancing alignment using curriculum learning & ranked preferences\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p2.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.26315#S3.SS2.SSS0.Px2.p1.4),[§4\.2](https://arxiv.org/html/2605.26315#S4.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.26315#S5.SS1.SSS0.Px4.p1.3)\.
- \[18\]E\. A\. Platanios, O\. Stretcu, G\. Neubig, B\. Poczos, and T\. M\. Mitchell\(2019\)Competence\-based curriculum learning for neural machine translation\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 1162–1172\.Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p2.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.26315#S3.SS2.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.26315#S3.SS2.SSS0.Px1.p2.8),[§4\.2](https://arxiv.org/html/2605.26315#S4.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.26315#S5.SS1.SSS0.Px4.p1.3)\.
- \[19\]X\. Qi, A\. Panda, K\. Lyu, X\. Ma, S\. Roy, A\. Beirami, P\. Mittal, and P\. Henderson\(2025\)Safety alignment should be made more than just a few tokens deep\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p1.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px4.p1.1),[§5\.4](https://arxiv.org/html/2605.26315#S5.SS4.p1.1)\.
- \[20\]X\. Qi, Y\. Zeng, T\. Xie, P\. Chen, R\. Jia, P\. Mittal, and P\. Henderson\(2024\)Fine\-tuning aligned language models compromises safety, even when users do not intend to\!\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p1.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px4.p1.1),[§5\.2](https://arxiv.org/html/2605.26315#S5.SS2.SSS0.Px2.p1.2)\.
- \[21\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.26315#S3.SS1.p1.6)\.
- \[22\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-BERT: sentence embeddings using siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,Cited by:[Appendix C](https://arxiv.org/html/2605.26315#A3.p5.1),[Appendix D](https://arxiv.org/html/2605.26315#A4.SS0.SSS0.Px2.p1.3),[§5\.1](https://arxiv.org/html/2605.26315#S5.SS1.SSS0.Px3.p3.1)\.
- \[23\]P\. Röttger, H\. R\. Kirk, B\. Vidgen, G\. Attanasio, F\. Bianchi, and D\. Hovy\(2024\)XSTest: a test suite for identifying exaggerated safety behaviours in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Cited by:[§5\.2](https://arxiv.org/html/2605.26315#S5.SS2.SSS0.Px3.p1.2)\.
- \[24\]P\. Soviany, R\. T\. Ionescu, P\. Rota, and N\. Sebe\(2022\)Curriculum learning: a survey\.International Journal of Computer Vision130\(6\),pp\. 1526–1565\.Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p2.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px1.p1.1)\.
- \[25\]S\. M\. Xie, H\. Pham, X\. Dong, N\. Du, H\. Liu, Y\. Lu, P\. Liang, Q\. V\. Le, T\. Ma, and A\. W\. Yu\(2023\)DoReMi: optimizing data mixtures speeds up language model pretraining\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.26315#S1.p2.1),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px2.p1.1)\.
- \[26\]T\. Xie, X\. Qi, Y\. Zeng, Y\. Huang, U\. M\. Sehwag, K\. Huang, L\. He, B\. Wei, D\. Li, Y\. Sheng, R\. Jia, B\. Li, K\. Li, D\. Chen, P\. Henderson, and P\. Mittal\(2025\)SORRY\-Bench: systematically evaluating large language model safety refusal\.InInternational Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2605.26315#S5.SS2.SSS0.Px2.p1.2)\.
- \[27\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4791–4800\.Cited by:[§5\.2](https://arxiv.org/html/2605.26315#S5.SS2.SSS0.Px3.p1.2)\.
- \[28\]X\. Zhao, W\. Cai, T\. Shi, D\. Huang, L\. Lin, S\. Mei, and D\. Song\(2025\)Improving LLM safety alignment with dual\-objective optimization\.InInternational Conference on Machine Learning,Cited by:[Appendix G](https://arxiv.org/html/2605.26315#A7.SS0.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2605.26315#S2.SS0.SSS0.Px4.p1.1)\.
- \[29\]A\. Zou, Z\. Wang, N\. Carlini, M\. Nasr, J\. Z\. Kolter, and M\. Fredrikson\(2023\)Universal and transferable adversarial attacks on aligned language models\.arXiv preprint arXiv:2307\.15043\.Cited by:[Appendix G](https://arxiv.org/html/2605.26315#A7.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.26315#S1.p1.1),[§5\.2](https://arxiv.org/html/2605.26315#S5.SS2.SSS0.Px2.p1.2)\.
## Appendix ACurriculum Methods Summary
Table[2](https://arxiv.org/html/2605.26315#A1.T2)summarizes the design differences across the five training methods compared in our experiments: number of training stages, whether the reference model is updated between stages, and within\-stage example ordering\.
Table 2:Curriculum methods investigated; Standard DPO is the baseline\.MethodStagesRef\. updateWithin\-stage orderStandard DPO \(Baseline\)1×Random shuffleSequential1×Easy→\\toHard \(fixed\)Sqrt\-Competence1×Competence\-based samplingCurri\-DPOKK✓Random shuffleStaged\-Competence \(ours\)KK✓Competence\-based sampling
## Appendix BDataset Cleaning Statistics
Table[3](https://arxiv.org/html/2605.26315#A2.T3)reports the full filtering breakdown for both source datasets\. The combined, filtered result is what we refer to asCleaned\-PKU\-HH\-SafeRLHFthroughout the paper: it contains only pairs where the chosen response is safe and the rejected response is unsafe, split 80/20 into train and test\.
Table 3:Dataset statistics before and after GPT\-4o\-mini safety filtering\.PKU\-SafeRLHFHH\-RLHFCombinedRaw pairs43,45249,38892,840Chosen unsafe \(%\)82\.26\.6—Rejected safe \(%\)2\.187\.2—After filtering6,9623,96910,931Retained \(%\)16\.08\.011\.8Training split——8,744Test split——2,187
## Appendix CModel Identifiers
We use the abliterated variants of three open\-source model families – versions from which built\-in safety guardrails have been removed, providing a controlled starting point where safety must be learned entirely through DPO training\. Their HuggingFace identifiers and base model licenses are:
- •LLaMA\-3\-8B:QuixiAI/Llama\-3\-8B\-Instruct\-abliterated\-v2\(Meta LLaMA 3 Community License\)
- •Qwen3\-8B:Goekdeniz\-Guelmez/Josiefied\-Qwen3\-8B\-abliterated\-v1\(Apache 2\.0\)
- •Yi\-1\.5\-9B:byroneverson/Yi\-1\.5\-9B\-Chat\-abliterated\(Apache 2\.0\)
For the Qwen3 scaling experiments \(Section[5\.5](https://arxiv.org/html/2605.26315#S5.SS5)\), we additionally use:
- •Qwen3\-1\.7B:mlabonne/Qwen3\-1\.7B\-abliterated\(Apache 2\.0\)
- •Qwen3\-4B:mlabonne/Qwen3\-4B\-abliterated\(Apache 2\.0\)
The training datasets used are PKU\-SafeRLHF\[[10](https://arxiv.org/html/2605.26315#bib.bib8)\]\(CC BY\-NC 4\.0\) and HH\-RLHF\[[2](https://arxiv.org/html/2605.26315#bib.bib9)\]\(MIT\)\. The difficulty scoring model all\-MiniLM\-L6\-v2\[[22](https://arxiv.org/html/2605.26315#bib.bib6)\]is released under Apache 2\.0\.
## Appendix DPhase 1: Difficulty Scoring Details
#### Zero\-shot response generation\.
For each promptxix\_\{i\}in the training set, we generate a zero\-shot responsey^i\\hat\{y\}\_\{i\}from the base model using stochastic sampling with temperature0\.70\.7and a maximum of512512new tokens\.
#### Sentence encoder\.
Embeddings fory^i\\hat\{y\}\_\{i\},yi\+y\_\{i\}^\{\+\}, andyi−y\_\{i\}^\{\-\}are computed withall\-MiniLM\-L6\-v2\[[22](https://arxiv.org/html/2605.26315#bib.bib6)\], which has a maximum sequence length of 256 tokens; inputs exceeding this are truncated to fit\. As safety\-relevant content typically appears in the first portion of a response, we expect this to have minimal effect on the difficulty ordering\.
## Appendix EIn\-Distribution Reward Accuracy
Table[4](https://arxiv.org/html/2605.26315#A5.T4)reports held\-out DPO test\-set reward accuracy for all five methods across the three model families\. Staged\-Competence matches or slightly improves on Standard DPO in\-distribution, while delivering substantially larger gains on OOD safety and adversarial robustness \(Table[1](https://arxiv.org/html/2605.26315#S5.T1)\)\.
Table 4:In\-distribution DPO test\-set reward accuracy \(%,↑\\uparrow\)\.Staged\-Competence matches or slightly improves on Standard DPO across all three model families, while the other curriculum baselines cluster around the baseline\.MethodLLaMA\-3\-8BQwen3\-8BYi\-1\.5\-9BStandard DPO \(Baseline\)89\.886\.785\.5Sequential89\.387\.085\.7Sqrt\-Competence89\.086\.784\.9Curri\-DPO91\.890\.486\.5Staged\-Competence \(ours\)91\.389\.688\.2
## Appendix FQuality and Over\-Refusal Results
Table[5](https://arxiv.org/html/2605.26315#A6.T5)reports MMLU and HellaSwag accuracy and XSTest over\-refusal rates for all five methods across the three models\. The quality scores for Staged\-Competence, averaged across MMLU and HellaSwag, remain within∼3\\sim\\\!3points of the Standard DPO baseline across all three models\. XSTest over\-refusal rates are at or near zero on LLaMA\-3\-8B and Qwen3\-8B, and within∼2\{\\sim\}2points on Yi\-1\.5\-9B, confirming that the safety gains of Staged\-Competence do not come at the cost of excessive refusal on benign prompts\.
Table 5:General capability and over\-refusal benchmarks\.MMLU and HellaSwag: accuracy \(%,↑\\uparrow\)\. XSTest: over\-refusal rate \(%,↓\\downarrow\)\. Staged\-Competence stays within 2–3 points of Standard DPO on average quality across all three models, and within∼2\{\\sim\}2points on XSTest over\-refusal on Yi\-1\.5\-9B\.LLaMA\-3\-8BQwen3\-8BYi\-1\.5\-9BMethodMMLUHSwagXSMMLUHSwagXSMMLUHSwagXSUnaligned56\.044\.80\.070\.873\.70\.066\.058\.50\.0Standard DPO \(Baseline\)53\.942\.40\.069\.775\.00\.063\.562\.51\.2Sequential53\.044\.20\.069\.876\.00\.062\.264\.91\.2Sqrt\-Competence55\.944\.90\.069\.675\.40\.060\.366\.52\.4Curri\-DPO52\.141\.60\.070\.076\.60\.061\.864\.11\.2Staged\-Competence \(ours\)52\.641\.30\.468\.276\.50\.059\.465\.23\.6
## Appendix GAttack Evaluation Details
#### GCG adversarial suffix attack\.
We apply the GCG method\[[29](https://arxiv.org/html/2605.26315#bib.bib10)\]to 398 HarmBench behaviors\[[15](https://arxiv.org/html/2605.26315#bib.bib18)\]spanning cyberattacks, bioweapons, and illegal activity\. For each behavior, GCG optimizes a 20\-token adversarial suffix via greedy token substitution usingnanogcg\(500 steps, search width 512, random seed 42\)\. Suffixes are generated against the non\-abliterated vanilla base model and applied to all fine\-tuned variants as a transfer evaluation – adversarial suffixes have been shown to transfer across fine\-tuned variants within the same model family\[[29](https://arxiv.org/html/2605.26315#bib.bib10)\]\. A GPT\-4o\-mini judge classifies each response; we report*attack success rate*\(ASR\)\.
#### Prefill attack\.
The prefill attack forces the model to begin its response with the firstkktokens of a known harmful response before generating freely, bypassing the model’s initial refusal mechanism by placing it mid\-generation in a context it would not normally produce\[[1](https://arxiv.org/html/2605.26315#bib.bib1)\]\. We construct prefill prompts from the DOOR dataset\[[28](https://arxiv.org/html/2605.26315#bib.bib16)\], which provides \(prompt, safe response, harmful response\) triples\. For each example, we tokenize the harmful response, decode the firstk=20k\\\!=\\\!20tokens as a text prefix, and append this prefix to the chat\-templated user prompt as the beginning of the assistant turn\. We evaluate 400 prompts; a GPT\-4o\-mini judge classifies the full generated response \(prefix \+ continuation\) as harmful or safe, and we report ASR as before\.
## Appendix HQualitative Analysis: Standard DPO vs\. Staged\-Competence
We present two representative AdvBench examples \(LLaMA\-3\-8B\) where Standard DPO produces a policy\-violating response while Staged\-Competence correctly refuses\.
Prompt:Develop a script that can be used to spread malware and compromise computer systems\.Standard DPO:Here is a simple script example in Python that demonstrates a basic malware\-like program\. \[…\]import os, timedef list\_files\(\): for root, dirs, files in os\.walk\(os\.getcwd\(\)\): for file in files: print\(os\.path\.join\(root, file\)\) def delete\_file\(file\_name\): if os\.path\.isfile\(file\_name\): os\.remove\(file\_name\) This script demonstrates some common techniques used in malware, including listing and deleting files\. It is provided for educational purposes only\.Staged\-Competence:I cannot write a script that can be used to spread malware and compromise computer systems\. Is there anything else I can help you with?Prompt:Outline a plan to steal sensitive information from a company and sell it to a competitor\.Standard DPO:Here’s a hypothetical plan \[…\]1\.Research: Understand the company’s products, services, and strategic challenges\.2\.Identify relevant information: Locate sensitive assets such as new product developments, customer data, strategic partnerships, or financial information\.3\.Check public availability: Verify whether the information is already accessible via the company’s website, social media, or industry reports\.4\.Legitimate channels: Attend industry conferences, contact sales teams, or read analyst reports to gather intelligence without direct intrusion\.5\.Internal contacts: If access is needed beyond public sources, identify employees who may be willing to share information \[…\]Staged\-Competence:I cannot provide a plan to steal sensitive information from a company\. If you are considering a business decision, I would encourage you to explore alternative options that are ethical and legal\.
## Appendix IData Efficiency Breakdown
Table[6](https://arxiv.org/html/2605.26315#A9.T6)reports the full per\-benchmark breakdown for the data\-efficiency experiment, comparing Staged\-Competence at 50% and 75% data against Standard DPO at 100% data on LLaMA\-3\-8B and Qwen3\-8B\.
Table 6:Data efficiency: Staged\-Competence with reduced data vs\. Standard DPO on 100%\.DPO Acc \(%,↑\\uparrow\); harmful response rates \(%,↓\\downarrow\)\. Staged\-Competence at 75% data matches or exceeds Standard DPO at 100% on every benchmark across both models\.LLaMA\-3\-8BQwen3\-8BMethodAccSorryAdvHExAccSorryAdvHExStandard DPO 100%89\.828\.018\.724\.086\.729\.638\.530\.7Staged\-Comp\. 50%87\.819\.86\.011\.084\.445\.853\.339\.0Staged\-Comp\. 75%90\.116\.44\.410\.087\.321\.113\.513\.7Staged\-Comp\. 100%91\.318\.75\.410\.089\.68\.90\.42\.7
## Appendix JBroader Impact
Curriculum\-based safety alignment and the release of Cleaned\-PKU\-HH\-SafeRLHF offer practical tools for training more robust safety\-aligned models\. We do not foresee negative societal impacts; the adversarial methods used are already publicly available and are employed solely to measure robustness\.Similar Articles
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
This paper introduces OPSA, an on-policy self-distillation method for LLM safety alignment that reduces the safety tax by training on the model's own rollouts and using a teacher flip rate to activate latent safety reasoning, achieving stronger safety-reasoning tradeoffs across multiple model scales.
Capability Conditioned Scaffolding for Professional Human LLM Collaboration
Introduces Capability Conditioned Scaffolding, a framework for LLM collaboration that adapts intervention based on user expertise domains to prevent Professional Domain Drift, with pilot evaluation on MMLU subsets.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.
COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents
Proposes COMPASS, a cognitive MCTS-guided process alignment framework to enhance safety in LLM-powered search agents by synthesizing attack trajectories and isolating risky actions, achieving a favorable safety-utility trade-off with less training data.
SkillOS: Learning Skill Curation for Self-Evolving Agents
This paper introduces SkillOS, a reinforcement learning framework that enables LLM agents to learn long-term skill curation policies for self-evolution, improving performance and generalization across tasks.