DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

arXiv cs.LG 06/09/26, 04:00 AM Papers
Summary
DOG-DPO is a training-free data selection framework that treats preference pairs as structured geometric signals, decomposing multi-dataset preference geometry into anchor and residual subspaces to select diverse subsets for safety alignment. It achieves strong utility-robustness trade-offs using only 11% of preference pairs across six safety benchmarks.
arXiv:2606.07678v1 Announce Type: new Abstract: Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores. This sample-centric view is especially limiting in multi-dataset settings, where shared safety directions coexist with dataset-specific residual risks. We propose DOG-DPO, a training-free data selection framework that treats preference pairs as structured geometric signals. DOG-DPO first represents each preference pair as a direction in model representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. Finally, it selects subsets by maximizing diversity-based coverage, encouraging broad, non-redundant coverage of alignment directions before DPO training. Across six safety benchmarks and two model backbones, DOG-DPO achieves a strong utility-robustness trade-off using only 11% of the preference pairs. It recovers most of the safety gains of full-data training while remaining entirely teacher-free, training-free, and substantially faster than representative selection baselines.
Original Article
View Cached Full Text
Cached at: 06/09/26, 08:52 AM
# DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment
Source: [https://arxiv.org/html/2606.07678](https://arxiv.org/html/2606.07678)
Yi Nian Tiankai Yang Yudi Zhang Qi Pan Zelong XuShenzhe Zhu Qingqing Luan Yue Huang Xiangliang Zhang Yue Zhao University of Southern California Iowa State University University of Wisconsin–Madison UT Austin Independent Researcher University of Notre Dame \{yinian, tiankaiy, panq, yue\.z\}@usc\.edu \{yudizhang628,qingqingluan95\}@gmail\.com zxu684@wisc\.edu shenzhe@utexas\.edu yhuang37@nd\.edu xzhang33@nd\.edu

###### Abstract

Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets\. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores\. This sample\-centric view is especially limiting in multi\-dataset settings, where shared safety directions coexist with dataset\-specific residual risks\. We proposeDOG\-DPO, a training\-free data selection framework that treats preference pairs as structured geometric signals\. DOG\-DPO first represents each preference pair as a direction in model representation space\. It then decomposes multi\-dataset preference geometry into a global anchor subspace and dataset\-specific residual subspaces\. Finally, it selects subsets by maximizing diversity\-based coverage, encouraging broad, non\-redundant coverage of alignment directions before DPO training\. Across six safety benchmarks and two model backbones, DOG\-DPO achieves a strong utility–robustness trade\-off using only 11% of the preference pairs\. It recovers most of the safety gains of full\-data training while remaining entirely teacher\-free, training\-free, and substantially faster than representative selection baselines\.

Disclaimer: This paper contains harmful content that may be disturbing to readers\.

DOG\-DPO:Dynamic Optimization in Geometry for Safety Alignment

Yi Nian Tiankai Yang Yudi Zhang Qi Pan Zelong XuShenzhe Zhu Qingqing Luan Yue Huang Xiangliang Zhang Yue ZhaoUniversity of Southern California Iowa State University University of Wisconsin–MadisonUT Austin Independent Researcher University of Notre Dame\{yinian, tiankaiy, panq, yue\.z\}@usc\.edu\{yudizhang628,qingqingluan95\}@gmail\.com zxu684@wisc\.edu shenzhe@utexas\.eduyhuang37@nd\.edu xzhang33@nd\.edu

## 1Introduction

Aligning large language models \(LLMs\) with safety preferences has become a central problem in modern AI systems\(Luet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib9); Yanget al\.,[2026](https://arxiv.org/html/2606.07678#bib.bib10); Wanget al\.,[2024](https://arxiv.org/html/2606.07678#bib.bib11); Xuet al\.,[2026](https://arxiv.org/html/2606.07678#bib.bib1)\)\. Most alignment pipelines rely on large\-scale human feedback, typically in the form of preference pairs used in reward modeling or Direct Preference Optimization \(DPO\)\(Kimet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib12); Shanget al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib13)\)\. While these approaches achieve strong empirical performance, they are inherently data\-intensive\(Baiet al\.,[2022](https://arxiv.org/html/2606.07678#bib.bib14); Ouyanget al\.,[2022](https://arxiv.org/html/2606.07678#bib.bib15)\): collecting and training on large safety datasets is costly, noisy, and often inefficient\. Recent studies suggest that only a small fraction of alignment data is truly necessary, indicating substantial redundancy in existing datasets\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07678#bib.bib18); Gaoet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib17); Liuet al\.,[2024a](https://arxiv.org/html/2606.07678#bib.bib2); Bukharin and others,[2024](https://arxiv.org/html/2606.07678#bib.bib23)\)\. This raises a fundamental question:*how should we select informative subsets of preference data for efficient alignment?*While prior work shows that careful data selection can match or even outperform full\-dataset training, the underlying principles of selecting preference data remain poorly understood—especially in realistic settings where data comes from multiple heterogeneous sources\(Liuet al\.,[2024b](https://arxiv.org/html/2606.07678#bib.bib21); Chunget al\.,[2024](https://arxiv.org/html/2606.07678#bib.bib16)\)\.

Existing data selection methods approach this problem from several perspectives, including distributional diversity\(Yanget al\.,[2025b](https://arxiv.org/html/2606.07678#bib.bib18); Bukharin and others,[2024](https://arxiv.org/html/2606.07678#bib.bib23)\), training\-signal analysis\(Gaoet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib17); Zhanget al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib22)\), and heuristic scoring\(Liuet al\.,[2024a](https://arxiv.org/html/2606.07678#bib.bib2),[b](https://arxiv.org/html/2606.07678#bib.bib21)\)\. While effective in their respective settings, these approaches are primarily designed for generic data selection and do not explicitly account for the structured nature of preference data\. In particular, they adopt a sample\-centric view that scores or ranks each preference pair independently, collapsing its information into a scalar\. This view loses two structural properties that are intrinsic to preference data\.First, preference signals are inherently directional rather than pointwise: a pair\(x,y\+,y−\)\(x,y^\{\+\},y^\{\-\}\)does not just label two candidates—it specifies the alignment direction along which the model should shift\. Sample\-level scoring discards this directionality\.Second, in multi\-dataset settings, preference data carries cross\-dataset structure: some alignment directions are well captured by a dominant anchor dataset, while others remain dataset\-specific and reflect heterogeneous risk domains\(Liuet al\.,[2024a](https://arxiv.org/html/2606.07678#bib.bib2); Gaoet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib17)\)\. Methods that pool preference pairs into a single homogeneous candidate set cannot distinguish these two regimes, leading to redundant signals on anchor\-aligned directions and undercoverage of dataset\-specific ones\. A third practical limitation compounds the conceptual ones: training\-signal\-based methods require auxiliary forward/backward passes or reference DPO runs to extract their selection signal, making them computationally expensive and tightly coupled to the training process\(Gaoet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib17); Zhanget al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib22)\)\.

In this work, we depart from the sample\-centric view and treat preference data as a*structured geometric object*in representation space\. Each preference pair is represented by the difference vector between preferred and rejected response embeddings, yielding a directional alignment signal rather than a scalar\-scored sample\. AcrossVVpreference datasets, these directions contain both a dominant anchor geometry and dataset\-specific residual variations\. This turns data selection into a*geometric coverage*problem: selecting non\-redundant directions that span both shared and residual preference structure\. We instantiate this view asDOG\-DPO, a training\-free framework that builds anchor/residual coordinates and selects subsets using a determinant\-based diversity criterion\. It requires no teacher LLM, no reference DPO run, and no iterative validation\.

We evaluate DOG\-DPO on six safety alignment benchmarks across two backbones\. Using only 11\.6% of the preference pairs, DOG\-DPO remains competitive with full\-data training, recovers most safety gains while maintaining general LLM capabilities, reduces inter\-dataset selection bias, and expands feature\-space coverage relative to baselines\. These results suggest that effective alignment depends not only on data scale, but also on exploiting the structured geometry of heterogeneous preference data\. We summarize our contributions as follows:

- •A geometric reformulation of preference data\.Departing from the prevailing sample\-centric view, we represent each preference pair as a directional signal in representation space and reveal that, across heterogeneous sources, these signals decompose into a shared anchor geometry and dataset\-specific residual variations\.
- •A geometry\-driven selection framework\.Building on this view, we proposeDOG\-DPO, which selects subsets by maximizing a diversity\-based coverage objective over the induced feature space, jointly capturing structural breadth and directional non\-redundancy\.
- •Training\-free selection\.DOG\-DPO requires no teacher LLM, reference DPO training, or iterative validation, running as a single pre\-processing pass and achieving 15–35×\\timesspeedups over representative baselines \(Figure[3](https://arxiv.org/html/2606.07678#S3.F3)\)\.
- •Strong safety\-alignment results\.Across six benchmarks and two backbones, DOG\-DPO uses only 11\.6% of the preference pairs while remaining competitive with full\-data training and improving selection balance and feature\-space coverage over existing baselines\.

## 2Method

### 2\.1Problem Formulation

#### Preliminaries\.

Letπθ\\pi\_\{\\theta\}denote the policy LLM with parametersθ\\theta, andπref\\pi\_\{\\mathrm\{ref\}\}a frozen reference policy\. We write the policy in the standard last\-layer formπθ\(y∣x\)∝exp⁡\(Whθ\(x,y\)\)\\pi\_\{\\theta\}\(y\\mid x\)\\propto\\exp\(Wh\_\{\\theta\}\(x,y\)\), wherehθ\(x,y\)∈ℝdh\_\{\\theta\}\(x,y\)\\in\\mathbb\{R\}^\{d\}is the final hidden representation of the response token at position\|y\|\|y\|\(or its mean over response tokens\), andW∈ℝ\|𝒱\|×dW\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}is the unembedding matrix mapping hidden states to vocabulary logits\. Throughout, we treathhas a feature extractor; the geometry it induces on preference pairs is the object our selection operates on\.

A preference dataset is a set of triples\(x,y\+,y−\)\(x,y^\{\+\},y^\{\-\}\)wherey\+y^\{\+\}is preferred overy−y^\{\-\}\. Direct Preference Optimization \(DPO\)\(Kimet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib12)\)alignsπθ\\pi\_\{\\theta\}with these preferences by minimizingℒi=−log⁡σ\(mi\)\\mathcal\{L\}\_\{i\}=\-\\log\\sigma\(m\_\{i\}\), where the per\-pair margin ismi=β\[log⁡πθ\(yi\+∣xi\)−log⁡πθ\(yi−∣xi\)\]−β\[log⁡πref\(yi\+∣xi\)−log⁡πref\(yi−∣xi\)\]m\_\{i\}=\\beta\[\\log\\pi\_\{\\theta\}\(y\_\{i\}^\{\+\}\\mid x\_\{i\}\)\-\\log\\pi\_\{\\theta\}\(y\_\{i\}^\{\-\}\\mid x\_\{i\}\)\]\-\\beta\[\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{i\}^\{\+\}\\mid x\_\{i\}\)\-\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{i\}^\{\-\}\\mid x\_\{i\}\)\]andβ\>0\\beta\>0is a temperature\.

#### From DPO loss to a directional signal\.

Substitutingπθ\(y∣x\)∝exp⁡\(Whθ\(x,y\)\)\\pi\_\{\\theta\}\(y\\mid x\)\\propto\\exp\(Wh\_\{\\theta\}\(x,y\)\)into the DPO marginmim\_\{i\}, theπref\\pi\_\{\\mathrm\{ref\}\}contribution is aθ\\theta\-independent constant that shifts the per\-pair decision threshold but not the gradient with respect toθ\\theta\. The policy\-dependent component therefore reduces to a linear functional of the unembeddingWW:

mipolicy\\displaystyle m\_\{i\}^\{\\mathrm\{policy\}\}=βWzi,\\displaystyle=\\beta Wz\_\{i\},\(1\)zi\\displaystyle z\_\{i\}:=h\(xi,yi\+\)−h\(xi,yi−\)\.\\displaystyle=h\(x\_\{i\},y\_\{i\}^\{\+\}\)\-h\(x\_\{i\},y\_\{i\}^\{\-\}\)\.That is, each preference pair contributes a training signal whose direction in representation space is fully captured by the hidden\-state differenceziz\_\{i\}, withβ\\betaacting as a global scaling factor\. Pairs with similarziz\_\{i\}contribute aligned signal to training; orthogonalziz\_\{i\}contribute independent signal\.

#### Selection setup\.

Given preference datasets\{𝒟v\}v=1V\\\{\\mathcal\{D\}\_\{v\}\\\}\_\{v=1\}^\{V\}with union𝒟\\mathcal\{D\}, we select a subsetS⊆𝒟S\\subseteq\\mathcal\{D\}of sizeKKbefore DPO training\. Projectingziz\_\{i\}onto a low\-rank preference basisB∈ℝd×rB\\in\\mathbb\{R\}^\{d\\times r\}\(Sec\.[2\.2](https://arxiv.org/html/2606.07678#S2.SS2)\) givesϕi=B⊤zi\\phi\_\{i\}=B^\{\\top\}z\_\{i\}and qualityqi=‖ϕi‖2q\_\{i\}=\\\|\\phi\_\{i\}\\\|\_\{2\}\.

#### Quality\-weighted kernel\.

For a candidate subsetSS, we use a quality\-weighted RBF kernel that combines signal magnitude and directional redundancy:

LS\\displaystyle L\_\{S\}=\[Lij\]i,j∈S,Lij=qiqjKij,\\displaystyle=\[L\_\{ij\}\]\_\{i,j\\in S\},\\qquad L\_\{ij\}=q\_\{i\}q\_\{j\}K\_\{ij\},\(2\)Kij\\displaystyle K\_\{ij\}=exp⁡\(−‖ϕi−ϕj‖222σ2\)\.\\displaystyle=\\exp\\\!\\left\(\-\\frac\{\\\|\\phi\_\{i\}\-\\phi\_\{j\}\\\|\_\{2\}^\{2\}\}\{2\\sigma^\{2\}\}\\right\)\.HereKijK\_\{ij\}captures directional redundancy andLii=qi2L\_\{ii\}=q\_\{i\}^\{2\}injects per\-sample signal strength into the same kernel\.

#### Objective and greedy rule\.

We use the log\-determinant volume criterion from D\-optimal experimental design\(Pukelsheim,[2006](https://arxiv.org/html/2606.07678#bib.bib3)\):

S^\\displaystyle\\hat\{S\}=arg⁡max\|S\|=K⁡𝒥\(S\),\\displaystyle=\\arg\\max\_\{\|S\|=K\}\\mathcal\{J\}\(S\),\(3\)𝒥\(S\)\\displaystyle\\mathcal\{J\}\(S\)=logdet\(LS\+εI\)\.\\displaystyle=\\log\\det\\\!\\left\(L\_\{S\}\+\\varepsilon I\\right\)\.which rewards large, non\-redundant preference signals\. Greedy selection uses

ℱ\(S,i\)\\displaystyle\\mathcal\{F\}\(S,i\)=θqi\+\(1−θ\)gi,\\displaystyle=\\theta q\_\{i\}\+\(1\-\\theta\)g\_\{i\},\(4\)gi\\displaystyle g\_\{i\}=Δilogdet\(LS\+εI\)\.\\displaystyle=\\Delta\_\{i\}\\log\\det\\\!\\left\(L\_\{S\}\+\\varepsilon I\\right\)\.where

Δilogdet\(LS\+εI\)=\\displaystyle\\Delta\_\{i\}\\log\\det\(L\_\{S\}\+\\varepsilon I\)=logdet\(LS∪\{i\}\+εI\)\\displaystyle\\;\\log\\det\(L\_\{S\\cup\\\{i\\\}\}\+\\varepsilon I\)\(5\)−logdet\(LS\+εI\)\.\\displaystyle\-\\log\\det\(L\_\{S\}\+\\varepsilon I\)\.The parameterθ∈\[0,1\]\\theta\\in\[0,1\]controls the extra quality bias:θ=0\\theta=0uses only the quality\-weighted kernel, whileθ=1\\theta=1becomes quality\-only top\-KK\. The following proposition establishes that our choice ofqi=‖ϕi‖2q\_\{i\}=\\\|\\phi\_\{i\}\\\|\_\{2\}isnot arbitrary: it makes the log\-det objective simultaneously controlled by a quality\-energy upper bound and a coverage lower bound\.

###### Proposition 1\.

Letλ1≥⋯≥λ\|S\|≥0\\lambda\_\{1\}\\geq\\cdots\\geq\\lambda\_\{\|S\|\}\\geq 0be the eigenvalues ofLS⪰0L\_\{S\}\\succeq 0\. SinceLii=qi2L\_\{ii\}=q\_\{i\}^\{2\},tr⁡\(LS\)=∑i∈Sqi2\\operatorname\{tr\}\(L\_\{S\}\)=\\sum\_\{i\\in S\}q\_\{i\}^\{2\}, and

det\(LS\)\\displaystyle\\det\(L\_\{S\}\)≤\(1\|S\|∑i∈Sqi2\)\|S\|,\\displaystyle\\leq\\left\(\\frac\{1\}\{\|S\|\}\\sum\_\{i\\in S\}q\_\{i\}^\{2\}\\right\)^\{\|S\|\},\(6\)λmin\(LS\)\\displaystyle\\lambda\_\{\\min\}\(L\_\{S\}\)≥det\(LS\)λmax\(LS\)1−\|S\|,\\displaystyle\\geq\\det\(L\_\{S\}\)\\lambda\_\{\\max\}\(L\_\{S\}\)^\{1\-\|S\|\},\(7\)det\(LS\)\\displaystyle\\det\(L\_\{S\}\)→0asKij→1\.\\displaystyle\\to 0\\quad\\text\{as\}\\quad K\_\{ij\}\\to 1\.\(8\)

###### Proof\.

Eq\. \([6](https://arxiv.org/html/2606.07678#S2.E6)\) follows from AM–GM on the eigenvalues andtr⁡\(LS\)=∑iqi2\\operatorname\{tr\}\(L\_\{S\}\)=\\sum\_\{i\}q\_\{i\}^\{2\}\. Eq\. \([7](https://arxiv.org/html/2606.07678#S2.E7)\) follows fromdet\(LS\)=∏jλj≤λmin\(LS\)λmax\(LS\)\|S\|−1\\det\(L\_\{S\}\)=\\prod\_\{j\}\\lambda\_\{j\}\\leq\\lambda\_\{\\min\}\(L\_\{S\}\)\\lambda\_\{\\max\}\(L\_\{S\}\)^\{\|S\|\-1\}\. For \([8](https://arxiv.org/html/2606.07678#S2.E8)\), duplicate directions make rows ofKSK\_\{S\}identical, soLS=DqKSDqL\_\{S\}=D\_\{q\}K\_\{S\}D\_\{q\}becomes rank\-deficient by continuity\. ∎

Spanning phase↔\\leftrightarrowoptimizing a conditioning bound\.The optimized quantity is the log\-volume objectivelogdet\(LS\+εI\)\\log\\det\(L\_\{S\}\+\\varepsilon I\)\. For fixed or slowly varyingλmax\(LS\)\\lambda\_\{\\max\}\(L\_\{S\}\), Eq\. \([7](https://arxiv.org/html/2606.07678#S2.E7)\) shows that increasingdet\(LS\)\\det\(L\_\{S\}\)raises a lower bound onλmin\(LS\)\\lambda\_\{\\min\}\(L\_\{S\}\)\. Thus, early in selection, when redundant samples makeKSK\_\{S\}and henceLSL\_\{S\}nearly rank\-deficient, the greedy gain favors candidates dissimilar to the current subset: it improves the lower bound on the worst\-covered direction\.

Filling phase↔\\leftrightarrowapproaching a quality\-energy bound\.After redundancy is reduced, diversity gains shrink and the determinant is mainly limited by the diagonal quality mass\. Eq\. \([6](https://arxiv.org/html/2606.07678#S2.E6)\) gives the relevant upper bound,det\(LS\)≤\(\|S\|−1∑i∈Sqi2\)\|S\|\\det\(L\_\{S\}\)\\leq\(\|S\|^\{\-1\}\\sum\_\{i\\in S\}q\_\{i\}^\{2\}\)^\{\|S\|\}, which is tight when the eigenvalues are balanced\. Therefore, once the conditioning/coverage bound has improved, further log\-det gains are obtained by increasing the quality energy∑iqi2\\sum\_\{i\}q\_\{i\}^\{2\}, so high\-magnitude samples become preferable\. The greedy rule can be read as first improving theλmin\\lambda\_\{\\min\}lower bound through diversity, then filling covered regions to move toward the AM–GM quality\-energy upper bound\.

#### Why per\-sample scoring falls short\.

Scalar scoring methods select by top\-KKor additive scores∑isi\\sum\_\{i\}s\_\{i\}\. Such scores only optimize a separable quality proxy and do not control the spectral bound above: two subsets with comparable∑iqi2\\sum\_\{i\}q\_\{i\}^\{2\}can have very differentdet\(LS\)\\det\(L\_\{S\}\), and hence very different lower bounds onλmin\(LS\)\\lambda\_\{\\min\}\(L\_\{S\}\), depending on whether their directions are diverse or duplicated\. Our greedy rule keeps both channels: log\-det for the coverage bound andqiq\_\{i\}for signal strength\.

#### Multi\-dataset realization\.

Eq\. \([3](https://arxiv.org/html/2606.07678#S2.E3)\) leaves the basisBBunspecified\. We constructBBfrom a largest\-dataset anchor block and add per\-dataset residual blocks to preserve dataset\-specific safety directions\. Sec\.[2\.2](https://arxiv.org/html/2606.07678#S2.SS2)–[2\.4](https://arxiv.org/html/2606.07678#S2.SS4)detail the construction and variants\.

![Refer to caption](https://arxiv.org/html/2606.07678v1/x1.png)Figure 1:Overview of DOG\-DPO\.Step 1: each preference pair\(x,y\+,y−\)\(x,y^\{\+\},y^\{\-\}\)is represented as a directional vectorz=h\+−h−z=h^\{\+\}\-h^\{\-\}in representation space, encoding the alignment direction\. Step 2: an anchor basisBBis extracted from the largest dataset𝒟anchor\\mathcal\{D\}\_\{\\text\{anchor\}\}, and per\-dataset residual basesTvT\_\{v\}capture dataset\-specific variation orthogonal to the anchor\.

### 2\.2Anchor and Residual Basis Construction

We instantiate the preference basisBBfrom Sec\.[2\.1](https://arxiv.org/html/2606.07678#S2.SS1)by taking the PCA basis of the largest preference dataset as an anchor, and complementing it with per\-dataset residual bases that capture structure not explained by that anchor\. The encoderhhis a frozen pretrained backbone; representationsh\(x,y\)∈ℝdh\(x,y\)\\in\\mathbb\{R\}^\{d\}are mean\-pooled last\-layer hidden states over response tokens\.

#### Per\-dataset PCA and anchor selection\.

For each𝒟v\\mathcal\{D\}\_\{v\}, letUv∈ℝd×kU\_\{v\}\\in\\mathbb\{R\}^\{d\\times k\}be the top\-kkPCA basis of the centered preference vectors\{zi\(v\)−μv\}\\\{z\_\{i\}^\{\(v\)\}\-\\mu\_\{v\}\\\}\. We then choose the anchor dataset

v∗=arg⁡maxv⁡\|𝒟v\|,v\_\{\*\}=\\arg\\max\_\{v\}\|\\mathcal\{D\}\_\{v\}\|,\(9\)and define the anchor basis as

B=Uv∗∈ℝd×r,r=k\.B=U\_\{v\_\{\*\}\}\\in\\mathbb\{R\}^\{d\\times r\},\\qquad r=k\.\(10\)

#### Default choice and robustness\.

We default to the largest available dataset as the anchor primarily for two pragmatic reasons\. First,*statistical reliability*: principal\-subspace estimation error scales as𝒪\(1/n\)\\mathcal\{O\}\(1/\\sqrt\{n\}\), meaning the largest dataset naturally yields the most stable top\-kkbasis under bootstrap resampling\. Second,*domain breadth*: in our data pool, the largest dataset spans the widest range of harm categories, offering a well\-distributed reference frame for global safety boundaries\.

Crucially, as we empirically demonstrate via our anchor\-rotation ablation below, the final data selection and downstream safety performance are highly insensitive to this choice—alternating the anchor across different candidate datasets yields nearly identical alignment outcomes\. Therefore, anchoring on the largest dataset is best conceptualized as a reliable and stable default choice to fix a coordinate system, rather than a hypercritical design dependency\. We provide further quantitative metrics of candidate anchors in Appendix[C](https://arxiv.org/html/2606.07678#A3)\.

#### Residual bases relative to the anchor\.

For each dataset, we define the residual subspace by removing the anchor component from its PCA basis:

Rv=\(I−BB⊤\)Uv,Tv=orth\(Rv\)∈ℝd×mv,R\_\{v\}=\(I\-BB^\{\\top\}\)U\_\{v\},\\quad T\_\{v\}=\\mathrm\{orth\}\(R\_\{v\}\)\\in\\mathbb\{R\}^\{d\\times m\_\{v\}\},\(11\)where

mv=rank\(Rv\)≤k\.m\_\{v\}=\\mathrm\{rank\}\(R\_\{v\}\)\\leq k\.\(12\)For the anchor dataset itself, this construction yields an empty residual block up to numerical tolerance\. More generally, the orthogonal decompositionUv=BB⊤Uv\+\(I−BB⊤\)UvU\_\{v\}=BB^\{\\top\}U\_\{v\}\+\(I\-BB^\{\\top\}\)U\_\{v\}separates the portion of datasetvvthat is aligned with the anchor from the portion that remains dataset\-specific\.

### 2\.3Direction\-wise Features

We now instantiateϕi\\phi\_\{i\}from Sec\.[2\.1](https://arxiv.org/html/2606.07678#S2.SS1)as a concrete direction\-wise feature vector, with one coordinate per retained anchor/residual direction\. For each samplezi\(v\)z\_\{i\}^\{\(v\)\}, projection onto the two bases gives

za,i=B⊤zi\(v\)∈ℝr,zr,i\(v\)=Tv⊤zi\(v\)∈ℝmv\.z\_\{a,i\}=B^\{\\top\}z\_\{i\}^\{\(v\)\}\\in\\mathbb\{R\}^\{r\},\\qquad z\_\{r,i\}^\{\(v\)\}=T\_\{v\}^\{\\top\}z\_\{i\}^\{\(v\)\}\\in\\mathbb\{R\}^\{m\_\{v\}\}\.\(13\)SinceBBandTvT\_\{v\}are orthonormal, the squared coefficientsza,i,j2z\_\{a,i,j\}^\{2\}and\(zr,i,m\(v\)\)2\(z\_\{r,i,m\}^\{\(v\)\}\)^\{2\}provide raw per\-direction energy measures\. We use them directly for the anchor block, and modulate the residual block via a typicality weight described below\.

#### Direction weighting\.

Anchor and residual blocks play asymmetric roles\. For the anchor block, we use the raw squared coefficientsza,i,j2z\_\{a,i,j\}^\{2\}: becauseBBis taken directly from the largest dataset, these coordinates already measure how strongly sampleiialigns with the reference geometry\. Residual directions have no analogous global reference score, so raw\(zr,i,m\(v\)\)2\(z\_\{r,i,m\}^\{\(v\)\}\)^\{2\}can over\-weight atypical or noisy dataset\-specific outliers\. We therefore modulate the residual block by a Mahalanobis\-based typicality weight in the dataset’s residual subspace,

wtyp\(zi\(v\)\)=exp\(\\displaystyle w\_\{\\mathrm\{typ\}\}\(z\_\{i\}^\{\(v\)\}\)=\\exp\\\!\\Bigl\(−12\(zr,i\(v\)−μr\(v\)\)⊤\\displaystyle\-\\tfrac\{1\}\{2\}\\bigl\(z\_\{r,i\}^\{\(v\)\}\-\\mu\_\{r\}^\{\(v\)\}\\bigr\)^\{\\top\}\(14\)\(Σr\(v\)\+ρI\)−1\(zr,i\(v\)−μr\(v\)\)\),\\displaystyle\\bigl\(\\Sigma\_\{r\}^\{\(v\)\}\+\\rho I\\bigr\)^\{\-1\}\\bigl\(z\_\{r,i\}^\{\(v\)\}\-\\mu\_\{r\}^\{\(v\)\}\\bigr\)\\Bigr\),whereμr\(v\)\\mu\_\{r\}^\{\(v\)\},Σr\(v\)\\Sigma\_\{r\}^\{\(v\)\}are the per\-dataset mean and covariance in the residual subspace andρI\\rho Iis a numerical regularizer\. This favors residual signal that is*representative*of its source dataset rather than anomalous\.

#### Feature vector\.

Settingψi,janchor=za,i,j2\\psi\_\{i,j\}^\{\\mathrm\{anchor\}\}=z\_\{a,i,j\}^\{2\}andψi,m\(v\)=\(zr,i,m\(v\)\)2wtyp\(zi\(v\)\)\\psi\_\{i,m\}^\{\(v\)\}=\(z\_\{r,i,m\}^\{\(v\)\}\)^\{2\}w\_\{\\mathrm\{typ\}\}\(z\_\{i\}^\{\(v\)\}\), we concatenate them:

ϕi=\[ψi,1anchor,…,ψi,ranchor,ψi,1\(v\),…,,ψi,mv\(v\)\]\\phi\_\{i\}=\\bigl\[\\psi^\{\\mathrm\{anchor\}\}\_\{i,1\},\\dots,\\psi^\{\\mathrm\{anchor\}\}\_\{i,r\},\\,\\psi^\{\(v\)\}\_\{i,1\},\\dots,,\\psi^\{\(v\)\}\_\{i,m\_\{v\}\}\\bigr\]\(15\)Thisϕi\\phi\_\{i\}is the concrete instantiation used in𝒥\(S\)\\mathcal\{J\}\(S\)\(Eq\.[3](https://arxiv.org/html/2606.07678#S2.E3)\); its magnitudeqi=‖ϕi‖2q\_\{i\}=\\\|\\phi\_\{i\}\\\|\_\{2\}aggregates sample\-level signal across both blocks, while the directionϕ^i\\hat\{\\phi\}\_\{i\}encodes which anchor and residual modes the sample activates\.

### 2\.4Two greedy variants: DOG\-D and DOG

*Note\.*For visual clarity, both panels plot the un\-typicality\-weighted projection scoresψi,m\(v\)=\(zr,i,m\(v\)\)2\\psi^\{\(v\)\}\_\{i,m\}\{=\}\(z^\{\(v\)\}\_\{r,i,m\}\)^\{2\}; the DPP selection itself uses the typicality\-modulated variant of Eq\.[14](https://arxiv.org/html/2606.07678#S2.E14), whose Mahalanobis weight squashes most samples toward0and obscures the spatial structure shown here\. We refer to the overall framework asDOG\-DPO, and instantiate it with two greedy selection rules:DOG\-D\(the main method, optimizing the full quality–diversity objective in Eq\. \([4](https://arxiv.org/html/2606.07678#S2.E4)\)\) andDOG\(an ablation that ranks by quality scores only\)\.

DOG\-Ddirectly maximizes𝒥\(S\)\\mathcal\{J\}\(S\)in Eq\. \([3](https://arxiv.org/html/2606.07678#S2.E3)\) with the greedy rule

i∗=argmaxi∉S\[\\displaystyle i^\{\*\}=\\arg\\max\_\{i\\notin S\}\\;\\bigl\[θqi\+\(1−θ\)\\displaystyle\\theta\\,q\_\{i\}\+\(1\-\\theta\)\(16\)Δilogdet\(LS\+εI\)\]\.\\displaystyle\\Delta\_\{i\}\\log\\det\(L\_\{S\}\+\\varepsilon I\)\\,\\bigr\]\.usingθ=0\.1\\theta=0\.1throughout \(sensitivity in Sec\.[3\.4](https://arxiv.org/html/2606.07678#S3.SS4)\)\.

DOGis the trace\-only degeneration of DOG\-D: dropping the log\-determinant term reduces selection to ranking byqi=‖ϕi‖2q\_\{i\}=\\\|\\phi\_\{i\}\\\|\_\{2\}\. We use a slight refinement that exposes the anchor/residual contributions separately—forming the score pair\(sianchor,sires\)=\(∑jψi,janchor,∑mψi,m\(v\)\)\(s^\{\\mathrm\{anchor\}\}\_\{i\},s^\{\\mathrm\{res\}\}\_\{i\}\)=\(\\sum\_\{j\}\\psi^\{\\mathrm\{anchor\}\}\_\{i,j\},\\sum\_\{m\}\\psi^\{\(v\)\}\_\{i,m\}\), min\-max normalizing within each dataset, and selecting via Pareto frontier on the 2\-D objective\. As a result, DOG optimizes a quality\-only proxy and ignores the off\-diagonal redundancies captured byLSL\_\{S\}\. Consequently, DOG concentrates on the Pareto frontier of\(sanchor,sres\)\(s^\{\\mathrm\{anchor\}\},s^\{\\mathrm\{res\}\}\), while DOG\-D spreads selection across it\.

![Refer to caption](https://arxiv.org/html/2606.07678v1/latex/figures/scatter_dog_dogd_llama32.png)Figure 2:DOG vs\. DOG\-D in the anchor–residual plane\.DOG concentrates on Pareto\-frontier samples; DOG\-D spreads across the plane\.

## 3Experiment

### 3\.1Datasets

We evaluate our method on a diverse set of safety and robustness benchmarks, covering both standard alignment metrics and adversarial attack scenarios\. Unless otherwise noted, the representation extractor is the same frozen backbone as the downstream DPO model used in that experiment\.

#### Benchmarks\.

We evaluate on five robustness benchmarks –AutoDAN\-HGALiuet al\.\([2023](https://arxiv.org/html/2606.07678#bib.bib27)\)\(genetic\-search jailbreak over520520AdvBench prompts\),WildJailbreakJianget al\.\([2024](https://arxiv.org/html/2606.07678#bib.bib5)\)\(200 harmful\),JailbreakBenchChaoet al\.\([2024](https://arxiv.org/html/2606.07678#bib.bib6)\)\(100 prompts across 10 harm categories\),HarmBenchMazeikaet al\.\([2024](https://arxiv.org/html/2606.07678#bib.bib31)\)\(200 prompts, both keyword and GPT\-judged ASR\) and two utility benchmarks:TruthfulQA\-MC2Linet al\.\([2022](https://arxiv.org/html/2606.07678#bib.bib24)\)andToxiGenHartvigsenet al\.\([2022](https://arxiv.org/html/2606.07678#bib.bib25)\)\. Together, these benchmarks cover both*utility*\(truthfulness and safety compliance\) and*robustness*\(jailbreak resistance\)\. The union of all five preference datasets contains 260k preference pairs in total; theK=30kK\{=\}30\\mathrm\{k\}budget used in our main experiments therefore corresponds to 11% of the pool\.

#### Baselines\.

We compare DOG\-D against representative baselines spanning the main families of preference\-data selection: the unaligned backbone \(*Base*\);*Random*sampling; distribution\-based clustering with out feature vectors \(*KMeans*\); gradient stratified sampling based: \(*Laser*\(Mirzaet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib20)\)\); LLM\-judge scoring \(*DEITA*\(Liuet al\.,[2024a](https://arxiv.org/html/2606.07678#bib.bib2)\)\); reward\-model reject sampling \(*RM\-RS*\); training\-based selection \(*SDPO*\(Gaoet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib17)\), which uses a reference DPO run to filter unstable examples\); and full\-data DPO on all 259k pairs without selection \(*Whole*\)\.

Table 1:Transposed aggregate utility–robustness trade\-off\.DOG\-D \(Ours\)is the main method\.Rdenotes Robustness \(1−1\-ASR\),Udenotes Utility\. Best inbold, second bestunderlined\(within Tier 1\+2\)\.‡\\ddagger: DOG is an ablation variant of DOG\-D without the log\-det diversity term\.†\\dagger: Tier\-3 methods require full DPO training before selection \(SDPO\) or use the entire dataset \(Whole\)\. They incur 15–35×\\timesthe compute of Tier\-2 \(training\-free\) methods and serve as upper\-bound references rather than directly comparable baselines under a fair compute budget\.Gemma3\-1B![[Uncaptioned image]](https://arxiv.org/html/2606.07678v1/latex/figures/icons/google.png)LLaMA3\.2\-3B![[Uncaptioned image]](https://arxiv.org/html/2606.07678v1/x2.png)Budget 10kBudget 30kBudget 10kBudget 30kTierMethodR↑\\uparrowU↑\\uparrowR↑\\uparrowU↑\\uparrowR↑\\uparrowU↑\\uparrowR↑\\uparrowU↑\\uparrowT1Base0\.61550\.41040\.61550\.41040\.74600\.52450\.74600\.5245Random0\.67880\.42750\.69200\.43500\.80950\.53800\.81130\.5405T2KMeans0\.71300\.44850\.74850\.48650\.90710\.55150\.89830\.5695Laser0\.73200\.46050\.76500\.45550\.89650\.56300\.86000\.5630DEITA0\.67650\.41440\.67650\.43470\.80260\.54380\.82860\.5464RM\-RS0\.67450\.43550\.71650\.45950\.85600\.54850\.96030\.5590DOG‡0\.72700\.43800\.76900\.46200\.90830\.55700\.96080\.5720DOG\-D \(Ours\)0\.76500\.45800\.77250\.47100\.95630\.55800\.96570\.5760T3†SDPO0\.74380\.55650\.78880\.56250\.93580\.55650\.91630\.5625Whole0\.74380\.45890\.74380\.45890\.97680\.60250\.97680\.6025To ensure a fair comparison, we separate baselines into tiers according to their computational requirements relative to our training\-free setting\. Tier 1 contains the unaligned*Base*model and*Random*selection\. Tier 2 contains training\-free selection methods that operate at comparable compute to ours \(KMeans, Laser, DEITA, RM\-RS\)\. Tier 3 contains methods that require either a full DPO training pass before selection \(SDPO\) or use the entire dataset without selection \(Whole\), incurring 15–35×\\timesthe compute of Tier\-2 methods \(Figure[3](https://arxiv.org/html/2606.07678#S3.F3)\)\. We report Tier\-3 results as upper\-bound references rather than directly comparable baselines under a fair compute budget\.

### 3\.2Evaluation

#### Evaluation Metrics\.

We group benchmarks into*utility*and*robustness \(ASR\)*\. For each trained model we compute three aggregate scores from the per\-benchmark numbers reported in tables[4](https://arxiv.org/html/2606.07678#A6.T4)and[5](https://arxiv.org/html/2606.07678#A6.T5)in Appendix\. Robustness converts each attack\-success rate \(ASRm\) into a higher\-is\-better score1−ASRm1\-\\mathrm\{ASR\}\_\{m\}and averages over the attack benchmarks: AUTODAN,WildJailbreak, JailbreakBench and HarmBench\. Utility averages two higher\-is\-better axes of general capability\. The Overall score is the mean of the two aggregates, balancing safety and helpfulness\.

### 3\.3Main Result

Table[1](https://arxiv.org/html/2606.07678#S3.T1)shows thatDOG\-D delivers the strongest overall utility–robustness trade\-off across all Tier 1 and 2 settingswhile using only 10k or 30k preference pairs\.

OnLLaMA3\.2\-3B, DOG\-D attains the highest aggregate robustness at both budgets \(R=0\.9563R\{=\}0\.9563at 10k,R=0\.9657R\{=\}0\.9657at 30k\), exceeding the next\-best Tier\-2 baseline \(KMeans at 10k,R=0\.9071R\{=\}0\.9071; DOG‡at 30k,R=0\.9608R\{=\}0\.9608\) by 4\.9 points at 10k, and outperforming the unaligned backbone by over 21 points\. It simultaneously achieves the best Tier\-2 utility at 30k \(U=0\.5760U\{=\}0\.5760\), driven by gains on both TruthfulQA\-MC2 \(0\.5890\.589vs\.0\.5610\.561for Laser\) and ToxiGen\. Per\-benchmark breakdowns in Table[4](https://arxiv.org/html/2606.07678#A6.T4)confirm these aggregate gains: DOG\-D reaches the lowest AutoDAN\-GPT ASR among all Tier\-2 methods at both 10k \(0\.05970\.0597\) and 30k \(0\.02720\.0272\), and ties the lowest GPT\-judged HarmBench ASR at 30k \(0\.00500\.0050\)\.

OnGemma3\-1B, DOG\-D again leads on aggregate robustness at both budgets \(R=0\.7650R\{=\}0\.7650at 10k,R=0\.7725R\{=\}0\.7725at 30k\) and on utility at 30k \(U=0\.4710U\{=\}0\.4710\)\. At 10k, Laser edges DOG\-D slightly on utility \(0\.46050\.4605vs\.0\.45800\.4580\) but at the cost of lower robustness \(R=0\.7320R\{=\}0\.7320vs\.0\.76500\.7650\); DOG\-D retains the best overall trade\-off\. These results support our claim that combining anchor/residual structure with feature\-level diversity yields subsets that are more informative than their size alone\.

Comparison to Tier\-3 references\.A small gap to Whole remains on LLaMA3\.2\-3B at 30k \(R=0\.9657R\{=\}0\.9657/U=0\.5760U\{=\}0\.5760vs\.0\.97680\.9768/0\.60250\.6025\), concentrated on TruthfulQA\-MC2 \(0\.58900\.5890vs\.0\.63800\.6380\) where the full pool offers broader knowledge coverage\. SDPO leads DOG\-D on isolated robustness metrics \(e\.g\., LLaMA 10k WildJailbreak,0\.16500\.1650vs\.0\.11500\.1150\)\. Using 11\.6% of the data at 15–35×\\timeslower compute \(Fig\.[3](https://arxiv.org/html/2606.07678#S3.F3)\), DOG\-D offers a favorable trade\-off\.

#### Extension to Larger LLM\.

To verify that the benefits of DOG\-D generalize beyond the 1B–3B scale, we further evaluate on LLaMA3\.1\-8B\-Instruct, an order\-of\-magnitude larger backbone\. Because full DPO on an 8B model is impractical, we adopt parameter\-efficient training via LoRA \(r=16,α=32r\{=\}16,\\alpha\{=\}32\)\. Table[2](https://arxiv.org/html/2606.07678#S3.T2)reports robustness across four complementary safety benchmarks \(AutoDAN\-GPT, WildJailbreak, HarmBench and XSTEST attack\-success rate\)\. DOG\-D once again attains the strongest attack defense among all pre\-selection methods, achieving GPT\-judged AutoDAN ASR of0\.2%0\.2\\%\(a430×430\\timesimprovement over the unaligned Base model and12×12\\timesbetter than the next\-best Laser\), HB\-GPT=0\.0%=0\.0\\%, and XSTEST\-ASR=5\.2%=5\.2\\%\(tied with Laser\)\. Notably, this gain holds despite the much smaller trainable parameter budget of LoRA, demonstrating that the diversity\-aware selection signal transfers cleanly across both model scale and adapter\-based finetuning\. We also include SDPO for completeness: although it achieves near\-zero attack ASR under standard benchmarks, it over\-refuses39%39\\%of benign WildJailbreak prompts—a clear over\-alignment failure mode that none of the Tier\-1/2 pre\-selection methods exhibit\.

Table 2:LLaMA3\.1\-8B LoRA,K=30kK\{=\}30k\(values×100\\times 100\)\. Bold/underline = best/2nd \(pre\-selection\)\.†\\dagger: Tier\-3 methods require full DPO training before selection \(SDPO\) or use the entire dataset \(Whole\)\.AD↓\\downarrowWildJailbreak↓\\downarrowHarmfulBench↓\\downarrowXS↓\\downarrowBase81\.751\.010\.06\.0Rand6\.57\.00\.57\.0Laser2\.312\.01\.05\.2KMeans7\.75\.51\.08\.5DEITA24\.421\.04\.031\.5DOG\-D0\.23\.50\.05\.2SDPO†0\.00\.00\.011\.7![Refer to caption](https://arxiv.org/html/2606.07678v1/latex/figures/method_runtime_horiz.png)Figure 3:Runtime comparison on identical preference pools\.

### 3\.4Analysis

#### Runtime efficiency\.

Beyond selection quality, practical deployment hinges on how expensive the selection pipeline itself is\. We therefore benchmark the end\-to\-end wall\-clock cost of our method against representative baselines in Table[3](https://arxiv.org/html/2606.07678#S3.F3)\. Quality\-scoring approaches such as DEITA\(Liuet al\.,[2024b](https://arxiv.org/html/2606.07678#bib.bib21)\)are particularly costly: for every candidate they invoke a 13B scorer LLM*twice*\(once for quality, once for complexity\), and each invocation is not a single forward pass but a full*autoregressive decoding*call that generates a chain\-of\-thought rationale together with the score token\. Gradient\-similarity methods like LASER\(Mirzaet al\.,[2025](https://arxiv.org/html/2606.07678#bib.bib20)\)removes the decoding bottleneck by replacing the generative scorer with a regression head, so each candidate only incurs a single LLM forward pass; however, it still routes samples through domain\-specific quality scorers and a constraint\-verification step for instruction following, which together amount to roughly 6 GPU\-hours on our pool\. In contrast, our pipeline only performs a one\-shot embedding pass followed by a DPP\-based selection whose greedy implementation runs in𝒪\(Nk2\)\\mathcal\{O\}\(Nk^\{2\}\)time and parallelizes trivially across shards\. End\-to\-end this brings the total cost down to less than 24 minutes\.

#### Dynamic selection behaviour\.

![Refer to caption](https://arxiv.org/html/2606.07678v1/x3.png)Figure 4:Dynamic DPP behaviour across models\.DOG\-D maintains a positive diversity\-gain gap over Random across the ranking sweep, while DOG saturates earlier\. Annotations atk∈\{5k,10k,30k\}k\\\!\\in\\\!\\\{5\\mathrm\{k\},10\\mathrm\{k\},30\\mathrm\{k\}\\\}show the corresponding downstream safety scores\.Figure[4](https://arxiv.org/html/2606.07678#S3.F4)shows thatDOG\-D continues to select geometrically complementary samples across the entire ranking sweep, keepinglogdet\(LS\+ϵI\)\\log\\det\(L\_\{S\}\+\\epsilon I\)close to zero\. DOG \(score\-only\), by contrast, saturates quickly once the Pareto frontier is covered and itslogdet\\log\\detcollapses, reflecting growing redundancy in the selected set\. The same trend holds across both Gemma\-3\-1B and LLaMA\-3\.2\-3B, and aligns with DOG\-D’s stronger downstream robustness at the canonical budgets\.

#### Anchor robustness\.

The basis construction \(Sec\.[2\.2](https://arxiv.org/html/2606.07678#S2.SS2)\) requires choosing a single preference dataset as the*anchor*for the rotation that aligns the shared subspace\. We test whether this choice biases downstream safety by re\-running selection with three different anchors, which arecvalues\(default\),pku\_safe\_rlhf, andhh\_harmless\_baseunder the same DPO recipe\. Figure[5](https://arxiv.org/html/2606.07678#S3.F5)reports four safety metrics, all lower is better\. The three anchors agree to within1\.51\.5\\,pp on every metric: JBB\-GPT and HB\-GPT are essentially identical HB\-kw varies by0\.50\.5\\,pp, and WJ\-asr spans only0\.115→0\.1300\.115\\\!\\rightarrow\\\!0\.130\. The differences are within seed noise\.

![Refer to caption](https://arxiv.org/html/2606.07678v1/x4.png)Figure 5:Anchor\-rotation robustness\. Safety metrics remain stable when the shared subspace is aligned using different anchor datasets\.
#### Quality weightθ\\theta\.

Sweepingθ∈\{0,0\.1,0\.5,1\.0\}\\theta\\\!\\in\\\!\\\{0,0\.1,0\.5,1\.0\\\}in \(Fig\.[6](https://arxiv.org/html/2606.07678#S3.F6)\),θ=0\.1\\theta\{=\}0\.1matches the pure\-diversity baseline \(θ=0\\theta\{=\}0\) on WildJailbreak and improves HarmBench\-kw by22%22\\%and HarmBench\-GPT by50%50\\%atK=30kK\{=\}30\\text\{k\}\. Largerθ\\thetadegrades every metric monotonically: atθ=1\.0\\theta\{=\}1\.0, WJ rises0\.105→0\.2550\.105\\\!\\to\\\!0\.255and AutoDAN\-GPT jumps0\.029→0\.2140\.029\\\!\\to\\\!0\.214\(7\.4×7\.4\\times\)\. Pure\-quality selection picks homogeneously “easy” \(high\-margin\) samples that minimise training gradient and collapse on adaptive attacks\.

#### Ablation of the diversity term

DOG and DOG\-D share the identical anchor/residual feature space and differ only in the selection rule: DOG ranks samples via Pareto frontier on the two scalar scores, while DOG\-D adds the log\-determinant diversity term\. The gap between DOG and DOG\-D therefore isolates the contribution of the diversity term within our framework\. The substantial gap on LLaMA\-3\.2\-3B at K=30k \(HB\-kw 0\.235 vs\. 0\.090; AutoDAN 0\.037 vs\. 0\.027\) confirms that structured features alone are insufficient—directional redundancy must be explicitly suppressed, consistent with the spanning–filling analysis in Sec\.[2\.1](https://arxiv.org/html/2606.07678#S2.SS1)\.

![Refer to caption](https://arxiv.org/html/2606.07678v1/x5.png)Figure 6:Sensitivity under different quality weights\.Attack success rates on four safety benchmarks asθ\\thetavaries, showing thatθ=0\.1\\theta\{=\}0\.1achieves the best\.

## 4Conclusion

In this work, we study the problem of data\-efficient alignment for large language models and identify a key limitation of existing data selection methods: they do not explicitly model the structured nature of preference data, especially in multi\-dataset settings\. We argue that preference signals admit a geometric interpretation, where alignment information is encoded as structured directions in representation space\. Based on this view, we propose DOG\-DPO, a dynamic and geometry\-driven data selection framework that identifies anchor geometry and dataset\-specific residual variations, and selects subsets that maximize geometric coverage via volume expansion\. This formulation enables efficient selection without relying on external supervision or training\. Extensive experiments demonstrate that our method achieves strong performance using significantly less data, while improving coverage and reducing redundancy\. These results highlight that effective alignment depends not only on data scale, but on understanding the structured geometry of preference data\.

## 5Limitations

DOG\-DPO is substantially more efficient than training\-based or judge\-based selection, it still requires representation extraction and subset optimization over the candidate pool, which can become costly at very large scale\. Third, our evaluation focuses primarily on safety alignment benchmarks\. Although the experiment provides initial evidence that DOG\-DPO transfers beyond the 1B–3B full\-finetuning setting, broader validation across more model families, larger backbones, and non\-safety preference domains remains future work\.

## 6Broader Impact

DOG\-DPO can make safety alignment more efficient by reducing redundant preference data, lowering training costs, and decreasing dependence on expensive teacher or judge models\. By selecting preference pairs through multi\-dataset geometric coverage, the method provides a practical way to retain complementary alignment signals while keeping the selected training set compact\. We see no significant risk of direct negative societal impact when using it to improve the efficiency of existing alignment workflows from this data selection method\.

## 7Disclosure of LLM Use

During the preparation of this manuscript, LLM\-based tools were used to assist with writing, grammar correction, preliminary data cleaning, and figure design\.

## References

- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Constitutional ai: harmlessness from ai feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1)\.
- Data diversity matters for robust instruction tuning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,External Links:[Link](https://aclanthology.org/2024.findings-emnlp.195/)Cited by:[§A\.1](https://arxiv.org/html/2606.07678#A1.SS1.p2.1),[§1](https://arxiv.org/html/2606.07678#S1.p1.1),[§1](https://arxiv.org/html/2606.07678#S1.p2.1)\.
- P\. Chao, E\. Debenedetti, A\. Robey, M\. Andriushchenko, F\. Croce, V\. Sehwag, E\. Dobriban, N\. Flammarion, G\. J\. Pappas, F\. Tramer,et al\.\(2024\)Jailbreakbench: an open robustness benchmark for jailbreaking large language models\.Advances in Neural Information Processing Systems37,pp\. 55005–55029\.Cited by:[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px1.p1.2)\.
- H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma,et al\.\(2024\)Scaling instruction\-finetuned language models\.Journal of Machine Learning Research25\(70\),pp\. 1–53\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1)\.
- C\. Gao, H\. Li, L\. Liu, Z\. Xie, P\. Zhao, and Z\. Xu \(2025\)Principled data selection for alignment: the hidden risks of difficult examples\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§A\.1](https://arxiv.org/html/2606.07678#A1.SS1.p3.1),[§1](https://arxiv.org/html/2606.07678#S1.p1.1),[§1](https://arxiv.org/html/2606.07678#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px2.p1.1)\.
- T\. Hartvigsen, S\. Gabriel, H\. Palangi, M\. Sap, D\. Ray, and E\. Kamar \(2022\)ToxiGen: a large\-scale machine\-generated dataset for adversarial and implicit hate speech detection\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3309–3326\.External Links:[Link](https://aclanthology.org/2022.acl-long.234/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.234)Cited by:[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px1.p1.2)\.
- L\. Jiang, K\. Rao, S\. Han, A\. Ettinger, F\. Brahman, S\. Kumar, N\. Mireshghallah, X\. Lu, M\. Sap, Y\. Choi, and N\. Dziri \(2024\)WildTeaming at scale: from in\-the\-wild jailbreaks to \(adversarially\) safer language models\.External Links:2406\.18510,[Link](https://arxiv.org/abs/2406.18510)Cited by:[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px1.p1.2)\.
- G\. Kim, Y\. J\. Kim, B\. Kim, H\. Lee, K\. Bae, Y\. Jang, and M\. Lee \(2025\)Safedpo: a simple approach to direct preference optimization with enhanced safety\.arXiv preprint arXiv:2505\.20065\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07678#S2.SS1.SSS0.Px1.p2.7)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3214–3252\.External Links:[Link](https://aclanthology.org/2022.acl-long.229/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.229)Cited by:[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px1.p1.2)\.
- W\. Liu, W\. Zeng, K\. He, Y\. Jiang, and J\. He \(2024a\)What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.1](https://arxiv.org/html/2606.07678#A1.SS1.p4.1),[§1](https://arxiv.org/html/2606.07678#S1.p1.1),[§1](https://arxiv.org/html/2606.07678#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px2.p1.1)\.
- X\. Liu, N\. Xu, M\. Chen, and C\. Xiao \(2023\)Autodan: generating stealthy jailbreak prompts on aligned large language models\.arXiv preprint arXiv:2310\.04451\.Cited by:[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px1.p1.2)\.
- Z\. Liu, K\. Zhou, W\. X\. Zhao, D\. Gao, Y\. Li, and J\. Wen \(2024b\)Less is more: high\-value data selection for visual instruction tuning\.arXiv preprint arXiv:2403\.09559\.Cited by:[§A\.1](https://arxiv.org/html/2606.07678#A1.SS1.p4.1),[§1](https://arxiv.org/html/2606.07678#S1.p1.1),[§1](https://arxiv.org/html/2606.07678#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.07678#S3.SS4.SSS0.Px1.p1.1)\.
- H\. Lu, L\. Fang, R\. Zhang, X\. Li, J\. Cai, H\. Cheng, L\. Tang, Z\. Liu, Z\. Sun, T\. Wang,et al\.\(2025\)Alignment and safety in large language models: safety mechanisms, training paradigms, and emerging challenges\.arXiv preprint arXiv:2507\.19672\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1)\.
- M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li, D\. Forsyth, and D\. Hendrycks \(2024\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.External Links:2402\.04249Cited by:[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px1.p1.2)\.
- P\. Mirza, L\. Weber, and F\. Küch \(2025\)LASER: stratified selective sampling for instruction tuning\.InFindings of EMNLP,Cited by:[§A\.1](https://arxiv.org/html/2606.07678#A1.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.07678#S3.SS1.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.07678#S3.SS4.SSS0.Px1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1)\.
- F\. Pukelsheim \(2006\)Optimal design of experiments\.SIAM\.Cited by:[§2\.1](https://arxiv.org/html/2606.07678#S2.SS1.SSS0.Px5.p1.6)\.
- S\. Shang, Y\. Chen, Y\. Wang, Y\. Li, and Z\. Zhang \(2025\)Drivedpo: policy learning via safety dpo for end\-to\-end autonomous driving\.arXiv preprint arXiv:2509\.17940\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1)\.
- Z\. Wang, B\. Bi, S\. K\. Pentyala, K\. Ramnath, S\. Chaudhuri, S\. Mehrotra, X\. Mao, S\. Asur,et al\.\(2024\)A comprehensive survey of llm alignment techniques: rlhf, rlaif, ppo, dpo and more\.arXiv preprint arXiv:2407\.16216\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1)\.
- R\. Xu, M\. Parmar, T\. Yang, Z\. Hu, Y\. Zhao, and K\. Ding \(2026\)CoAct: co\-active llm preference learning with human\-ai synergy\.arXiv preprint arXiv:2604\.17501\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1)\.
- T\. Yang, Y\. Nian, X\. Li, R\. Xu, K\. Ding, and Y\. Zhao \(2026\)Cat\-dpo: category\-adaptive safety alignment\.arXiv preprint arXiv:2604\.17299\.Cited by:[§1](https://arxiv.org/html/2606.07678#S1.p1.1)\.
- X\. Yang, S\. Nie, L\. Liu, S\. Gururangan, U\. Karn, R\. Hou, M\. Khabsa, and Y\. Mao \(2025a\)Diversity\-driven data selection for language model tuning through sparse autoencoder\.InICML Workshop,Cited by:[§A\.1](https://arxiv.org/html/2606.07678#A1.SS1.p2.1)\.
- Y\. Yang, Y\. Nan, J\. Ye, S\. Dou, X\. Wang, S\. Li, H\. Lv, T\. Gui, Q\. Zhang, and X\. Huang \(2025b\)Measuring data diversity for instruction tuning: a systematic analysis and a reliable metric\.InACL,Cited by:[§A\.1](https://arxiv.org/html/2606.07678#A1.SS1.p2.1),[§1](https://arxiv.org/html/2606.07678#S1.p1.1),[§1](https://arxiv.org/html/2606.07678#S1.p2.1)\.
- J\. Zhang, Y\. Qin, R\. Pi, W\. Zhang, R\. Pan, and T\. Zhang \(2025\)TAGCOS: task\-agnostic gradient clustered coreset selection for instruction tuning data\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 4686–4701\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.264/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.264)Cited by:[§A\.1](https://arxiv.org/html/2606.07678#A1.SS1.p3.1),[§1](https://arxiv.org/html/2606.07678#S1.p2.1)\.

## Appendix ARelated Work

### A\.1Data Selection for Instruction Tuning and Alignment

Existing methods can be broadly categorized based on the type of signal used to guide selection\.

Distribution\- and representation\-based methods\.A line of work models datasets as distributions in embedding space and selects samples to maximize diversity and coverage\. Methods such as diversity metricsYanget al\.\([2025b](https://arxiv.org/html/2606.07678#bib.bib18)\), stratified samplingMirzaet al\.\([2025](https://arxiv.org/html/2606.07678#bib.bib20)\), and QDITBukharin and others \([2024](https://arxiv.org/html/2606.07678#bib.bib23)\)characterize global data distributions, while sparse autoencoder\-based approachesYanget al\.\([2025a](https://arxiv.org/html/2606.07678#bib.bib19)\)learn latent factors and prioritize underrepresented dimensions\. These methods effectively improve coverage in representation space, but operate at a coarse level and do not explicitly capture alignment\-relevant preference structure\.

Training\-signal\-based methods\.Another line of work leverages training dynamics, such as gradients or loss signals, to identify informative samples\. TAGCOSZhanget al\.\([2025](https://arxiv.org/html/2606.07678#bib.bib22)\)clusters samples based on gradient embeddings to preserve diversity in parameter update directions, while principled data selectionGaoet al\.\([2025](https://arxiv.org/html/2606.07678#bib.bib17)\)analyzes sample difficulty to filter unstable examples\. However, these methods typically require additional training or repeated model updates to obtain reliable signals, making them computationally expensive and tightly coupled with the training process\.

Scalar scoring and heuristic filtering methods\.Some approaches rely on scalar scoring functions to rank and select samples\. DEITALiuet al\.\([2024a](https://arxiv.org/html/2606.07678#bib.bib2)\)combines quality, complexity, and diversity signals, often relying on external teacher models or LLM\-based judges, while Less\-Is\-MoreLiuet al\.\([2024b](https://arxiv.org/html/2606.07678#bib.bib21)\)ranks samples using model\-internal value signals such as difficulty or influence\. These methods are simple and efficient, but treat samples independently and ignore structural redundancy, making them prone to biased selection and dataset collapse\.

Summary\.Overall, existing methods rely on generic distributional, training, or scalar signals, without modeling the intrinsic structure of preference data\. In particular, they do not account for multi\-dataset settings where preference signals consist of both a dominant anchor geometry and dataset\-specific residual variation\. In contrast, our method explicitly models this structure by decomposing representations into anchor and residual components, enabling principled data selection that preserves global alignment consistency while maintaining heterogeneous coverage across datasets\.

### A\.2Our Position

In contrast to prior work, we explicitly model safety preference data as a structured geometric object in representation space\. Our method decomposes preference signals into anchor and residual components\. The base variantDOGoperates only on simple scalar anchor/residual scores, whileDOG\-Dfurther introduces our custom direction\-wise feature representation and a diversity regularization term, enabling both structural coverage and redundancy reduction in a unified framework\.

## Appendix BConnection to DPO signal coverage

This appendix expands on the geometric reading of𝒥\(S\)=logdet\(MS\+εI\)\\mathcal\{J\}\(S\)=\\log\\det\(M\_\{S\}\+\\varepsilon I\)introduced in §[2\.1](https://arxiv.org/html/2606.07678#S2.SS1)\. The main text establishes \(Proposition 1\) thatdet\(MS\)\\det\(M\_\{S\}\)controls both the worst\-covered eigendirection ofMSM\_\{S\}and its rank deficiency\. Here we make explicit*why*this quantity is a natural surrogate for DPO training\-signal coverage, and how the anchor/residual basis from §[2\.2](https://arxiv.org/html/2606.07678#S2.SS2)shapes what “coverage” means\.

#### Why the log\-determinant is the right surrogate\.

We would ideally selectSSto maximizeλmin\(MS\)\\lambda\_\{\\min\}\(M\_\{S\}\), but this objective is non\-smooth and combinatorially hard\. By Proposition 1\(b\),

λmin\(MS\)≥det\(MS\)λmax\(MS\)r−1,\\lambda\_\{\\min\}\(M\_\{S\}\)\\;\\geq\\;\\frac\{\\det\(M\_\{S\}\)\}\{\\lambda\_\{\\max\}\(M\_\{S\}\)^\{\\,r\-1\}\},so for any subset on whichλmax\(MS\)\\lambda\_\{\\max\}\(M\_\{S\}\)is controlled \(which holds automatically under our‖ϕi‖2\\\|\\phi\_\{i\}\\\|\_\{2\}\-bounded features\), maximizingdet\(MS\)\\det\(M\_\{S\}\)*lower\-bounds*the worst\-case directional signal\. Equivalently, taking logs,logdet\(MS\)\\log\\det\(M\_\{S\}\)is a smooth, submodular, fully\-additive\-in\-greedy\-gain surrogate forlog⁡λmin\(MS\)\\log\\lambda\_\{\\min\}\(M\_\{S\}\)\. This is the formal statement behind the spanning\-phase intuition in §[1](https://arxiv.org/html/2606.07678#Thmproposition1): whileMSM\_\{S\}is rank\-deficient the right\-hand side is zero, and the greedy log\-det gain is the only term that can lift it\.

#### Coverage under the anchor/residual decomposition\.

The basisBBis built as an anchor blockB∈ℝd×rB\\in\\mathbb\{R\}^\{d\\times r\}plus a per\-dataset residual blockTv∈ℝd×mvT\_\{v\}\\in\\mathbb\{R\}^\{d\\times m\_\{v\}\}\(§[2\.2](https://arxiv.org/html/2606.07678#S2.SS2)\), givingϕi∈ℝr\+mv\\phi\_\{i\}\\in\\mathbb\{R\}^\{r\+m\_\{v\}\}with two structurally different roles:

- •Anchor coordinatesψi,⋅anchor\\psi^\{\\mathrm\{anchor\}\}\_\{i,\\cdot\}live in the shared subspaceBB\. A reward directionuusupported there represents a globally\-consistent safety concept, andλmin\\lambda\_\{\\min\}alongBBmeasures the worst\-covered such concept\.
- •Residual coordinatesψi,⋅\(v\)\\psi^\{\(v\)\}\_\{i,\\cdot\}live in dataset\-specific orthogonal complements\. A reward directionuuwith residual support requires*at least one*sample from that dataset to receive nonzero signal—anchor\-only subsets cannot cover it\.

BecauseBBandTvT\_\{v\}are orthogonal,MSM\_\{S\}is block\-structured anddet\(MS\)\\det\(M\_\{S\}\)factorizes into anchor and per\-dataset residual determinants \(up to cross terms that vanish when each block is independently spanned\)\. Maximizinglogdet\(MS\)\\log\\det\(M\_\{S\}\)therefore enforces directional coverage*within each block simultaneously*, which is precisely the failure mode of pooled\-PCA or trace\-based selection: those collapse onto the dominant anchor directions and leave residual blocks rank\-deficient\.

Table 3:Anchor selection quantitative comparison\.For each candidate anchor dataset, we report \(1\) the sample sizenvn\_\{v\}, \(2\) the bootstrap PCA stability \(mean principal\-angle cosine between two top\-5050bases obtained from disjoint random halves; higher is better\), \(3\) the spectral concentration of the top\-5050PCA basis as a fraction of total preference\-vector variance \(higher reflects genuine signal, but is artificially inflated when the sample sizenvn\_\{v\}is small relative tok=50k=50\), and \(4\) the number of distinct harm categories covered by the dataset\.CValues\-ENis the unique dataset that simultaneously maximizes sample size and bootstrap stability among all candidates\. Encoder used for PCA: Gemma\-3\-1B representations of\{zi\(v\)=hi\+−hi−\}\\\{z\_\{i\}^\{\(v\)\}=h^\{\+\}\_\{i\}\-h^\{\-\}\_\{i\}\\\}\.Datasetnvn\_\{v\}Bootstrap stability↑\\uparrowSpectral concentration\# Harm categoriesCValues\-EN130,5750\.9650\.745≥10\\geq 10PKU\-SafeRLHF73,8700\.9540\.68414HH\-RLHF42,4840\.9560\.5881 \(harmlessness\)Egida\-DPO6,4100\.9350\.760†1 \(jailbreak defense\)Safer\-Instruct5,7210\.9180\.656†2 \(hate, self\-harm\)

## Appendix CQuantitative justification of anchor selection

Table[3](https://arxiv.org/html/2606.07678#A2.T3)reports the four quantitative criteria referenced in the main paper for choosing the anchor datasetv⋆=arg⁡maxv⁡\|𝒟v\|v\_\{\\star\}=\\arg\\max\_\{v\}\|\\mathcal\{D\}\_\{v\}\|\. Across the five candidate datasets,CValues\-ENis uniquely positioned: it has both the largest sample size \(nv⋆=130,575n\_\{v\_\{\\star\}\}=130\{,\}575\) and the highest bootstrap\-resampling stability of its top\-5050subspace \(mean principal\-angle cosine0\.9650\.965\)\. Restricting attention to datasets in the statistically reliable regimenv≥50,000n\_\{v\}\\geq 50\{,\}000, where the top\-kkPCA basis is not sample\-saturated,CValues\-ENalso attains the highest spectral concentration \(0\.7450\.745\)\. The smaller datasets’ apparent advantage on this metric is an artifact of5050directions absorbing nearly all of their limited samples \(footnote†\)\. Finally,CValues\-ENspans the broadest set of harm categories, in contrast to other sources that target a narrower slice of the safety distribution\. Together, these properties motivate our choice ofCValues\-ENas the global anchor\.

## Appendix DDataset Descriptions

To ensure a comprehensive evaluation of safety alignment, we utilize five representative datasets covering various safety domains and preference modeling techniques\. The detailed characteristics of these datasets are as follows:

- •PKU\-SafeRLHF: Developed by the Center for Human\-Aligned AI \(CHAAI\) at Peking University, this is one of the most rigorous safety preference datasets\. It contains 330k\+ samples with fine\-grained safety labels across 14 categories \(e\.g\., unfairness, crime, sensitive topics\)\. For our experiments, we utilize thedefault/trainsplit and construct DPO pairs based on thesafer\_response\_id, which represents the consensus of human annotators on which response better adheres to safety guidelines\.
- •Anthropic HH\-RLHF \(Harmless Subset\): This dataset is part of Anthropic’s “Helpful and Harmless” collection, specifically theharmless\-baseportion\. It consists of human\-model conversations where the model was tasked with being harmless\. Each entry provides a “chosen” \(more harmless\) and “rejected” response, reflecting the model’s ability to refuse harmful requests or provide safe alternatives through red\-teaming\.
- •Safer\-Instruct: Safer\-Instruct is an instruction\-tuning dataset designed to improve the safety of LLMs through high\-quality adversarial prompts\. We specifically extract thehateandself\-harmsubsets\. These categories are particularly valuable for testing the model’s refusal boundaries against explicit toxicity and high\-risk personal safety scenarios, where the “chosen” responses are typically empathetic but firm refusals\.
- •CValues\-English: This dataset \(david9dragon9/cvalues\-english\) aligns LLMs with responsibility and sensitivity\. It groups responses by safety and helpfulness level, and we map theresponse\_chosenandresponse\_rejectedfields into our training pipeline\.
- •Egida: Egida is a specialized DPO dataset designed for safety alignment and defense against jailbreak attacks\. We use theEgida\-DPO\-Qwen2\.5\-7B\-Instructsplit, which provides high\-quality preference pairs generated by strong teacher models \(like Qwen2\.5\-7B\-Instruct\)\. This dataset is instrumental in teaching the model to identify and resist sophisticated adversarial prompt techniques\.

Figure[7](https://arxiv.org/html/2606.07678#A7.F7)reports DPO training dynamics for the selected DOG\-D 30k subset, providing a sanity check that the selected data supports stable optimization across both backbones\.

## Appendix ESelection Algorithm

This appendix provides the implementation\-level details behind the selection pipeline used in the main experiments\. We first report the training dynamics of DPO on the DOG\-D selected subset as a sanity check, then provide the complete metric tables and the concrete greedy selection procedure\. Together, these results show that the selected subset is not only compact, but also stable for subsequent preference optimization\.

Algorithm 1DOG\-DPO data selection1:Preference datasets

\{𝒟v\}v=1V\\\{\\mathcal\{D\}\_\{v\}\\\}\_\{v=1\}^\{V\}, frozen encoder

ϕ\\phi, PCA rank

kk, budget

KK
2:Selected subset

SS
3:foreach dataset

𝒟v\\mathcal\{D\}\_\{v\}do

4:foreach pair

\(xi,yi\+,yi−\)\(x\_\{i\},y\_\{i\}^\{\+\},y\_\{i\}^\{\-\}\)in

𝒟v\\mathcal\{D\}\_\{v\}do

5:Compute preference vector

zi\(v\)←ϕ\(xi,yi\+\)−ϕ\(xi,yi−\)z\_\{i\}^\{\(v\)\}\\leftarrow\\phi\(x\_\{i\},y\_\{i\}^\{\+\}\)\-\\phi\(x\_\{i\},y\_\{i\}^\{\-\}\)
6:endfor

7:Center

\{zi\(v\)\}\\\{z\_\{i\}^\{\(v\)\}\\\}and compute top\-

kkPCA basis

UvU\_\{v\}
8:endfor

9:Set anchor dataset

v∗←arg⁡maxv⁡\|𝒟v\|v\_\{\*\}\\leftarrow\\arg\\max\_\{v\}\|\\mathcal\{D\}\_\{v\}\|and anchor basis

B←Uv∗B\\leftarrow U\_\{v\_\{\*\}\}
10:foreach dataset

𝒟v\\mathcal\{D\}\_\{v\}do

11:Compute residual basis

Tv←orth\(\(I−BB⊤\)Uv\)T\_\{v\}\\leftarrow\\text\{orth\}\(\(I\-BB^\{\\top\}\)U\_\{v\}\)
12:foreach sample

zi\(v\)z\_\{i\}^\{\(v\)\}do

13:Project to anchor/residual coordinates

za,i←B⊤zi\(v\)z\_\{a,i\}\\leftarrow B^\{\\top\}z\_\{i\}^\{\(v\)\},

zr,i\(v\)←Tv⊤zi\(v\)z\_\{r,i\}^\{\(v\)\}\\leftarrow T\_\{v\}^\{\\top\}z\_\{i\}^\{\(v\)\}
14:Compute

ψi,janchor=za,i,j2\\psi\_\{i,j\}^\{\\text\{anchor\}\}=z\_\{a,i,j\}^\{2\}and

ψi,m\(v\)\\psi\_\{i,m\}^\{\(v\)\}
15:Form scalar scores

\(sianchor,sires\)\(s\_\{i\}^\{\\text\{anchor\}\},s\_\{i\}^\{\\text\{res\}\}\)forDOG

16:Form feature vector

ϕi\\phi\_\{i\}and quality score

qi=‖ϕi‖2q\_\{i\}=\\\|\\phi\_\{i\}\\\|\_\{2\}forDOG\-D

17:endfor

18:endfor

19:ifusingDOGthen

20:Rank samples by Pareto layers on

\(sianchor,sires\)\(s\_\{i\}^\{\\text\{anchor\}\},s\_\{i\}^\{\\text\{res\}\}\)
21:returntop\-

KKsamples as

SS
22:else

23:Initialize

S←∅S\\leftarrow\\emptysetand

LS←\[\]L\_\{S\}\\leftarrow\[\\,\]
24:while

\|S\|<K\|S\|<Kdo

25:For each

i∉Si\\notin S, form

LS∪\{i\}L\_\{S\\cup\\\{i\\\}\}using the RBF kernel in Eq\. \([2](https://arxiv.org/html/2606.07678#S2.E2)\)

26:Add

i∗←arg⁡maxi∉S⁡Gii^\{\*\}\\leftarrow\\arg\\max\_\{i\\notin S\}G\_\{i\}, where

27:

Gi=θqi\+\(1−θ\)\[logdet\(LS∪\{i\}\+εI\)−logdet\(LS\+εI\)\]G\_\{i\}=\\theta q\_\{i\}\+\(1\-\\theta\)\\bigl\[\\log\\det\(L\_\{S\\cup\\\{i\\\}\}\+\\varepsilon I\)\-\\log\\det\(L\_\{S\}\+\\varepsilon I\)\\bigr\]
28:Update

S←S∪\{i∗\}S\\leftarrow S\\cup\\\{i^\{\*\}\\\}and

LS←LS∪\{i∗\}L\_\{S\}\\leftarrow L\_\{S\\cup\\\{i^\{\*\}\\\}\}
29:endwhile

30:return

SS
31:endif

## Appendix FDetailed Results

This section expands the appendix with backbone\-specific quantitative results and qualitative examples from the learned anchor and residual directions\. Table[5](https://arxiv.org/html/2606.07678#A6.T5)mirrors the LLaMA table for Gemma\-3\-1B, while Tables[8](https://arxiv.org/html/2606.07678#A7.T8)and[9](https://arxiv.org/html/2606.07678#A7.T9)illustrate what the dominant shared and dataset\-specific dimensions capture semantically\.

Table 4:Results on LLAMA3\.2\-3B at 10k and 30k selection budgets\.BudgetMetricRandomLaserKMeansDEITASDPOBaseDOGDOG\-DRewardsWhole10kAutoDAN\-GPT↓\\downarrow0\.42000\.17390\.14670\.48450\.08700\.52100\.14200\.05970\.40600\.0080WildJailbreak↓\\downarrow0\.33000\.20500\.20500\.28500\.16500\.40500\.21500\.11500\.26000\.0800JBB\-GPT↓\\downarrow0\.01200\.00000\.00000\.02000\.00000\.02000\.00000\.00000\.01000\.0000HB\-GPT↓\\downarrow0\.05000\.00550\.01520\.05000\.00500\.07000\.01000\.00520\.02000\.0050Tox↑\\uparrow0\.54400\.56500\.54800\.54680\.56000\.53500\.55900\.55400\.56800\.5670TQA\-mc2↑\\uparrow0\.53200\.56100\.55500\.54070\.55300\.51400\.55500\.56200\.52900\.638030kAutoDAN\-GPT↓\\downarrow0\.38000\.36500\.38700\.35580\.37800\.52100\.03700\.02720\.04400\.0080WildJailbreak↓\\downarrow0\.31000\.18500\.12000\.26500\.15000\.40500\.10500\.10500\.10500\.0800JBB\-GPT↓\\downarrow0\.01000\.01000\.00000\.01000\.00000\.02000\.00000\.00000\.00000\.0000HB\-GPT↓\\downarrow0\.04500\.00500\.00500\.04500\.00500\.07000\.01500\.00500\.00500\.0050Tox↑\\uparrow0\.54600\.56500\.57900\.54570\.57000\.53500\.56000\.56300\.56300\.5670TQA\-mc2↑\\uparrow0\.53500\.56100\.56000\.54700\.55500\.51400\.58400\.58900\.55500\.6380

Table 5:Results on Gemma3\-1B at 10k and 30k selection budgets\.BudgetMetricRandomLaseraKMeansDEITASDPOBaseDOGDOG\-DRewardWhole10kAutoDAN\-GPT↓\\downarrow0\.58000\.36700\.46800\.59420\.51000\.68300\.47700\.39500\.55200\.4550WildJailbreak↓\\downarrow0\.48000\.43000\.47000\.45500\.51000\.52500\.50500\.39500\.54500\.4500JBB\-GPT↓\\downarrow0\.02500\.01000\.01000\.04000\.00000\.03000\.02000\.02000\.05000\.0200HB\-GPT↓\\downarrow0\.10000\.07500\.07000\.11500\.00500\.12000\.09000\.07000\.10500\.0750Tox↑\\uparrow0\.45500\.48600\.48500\.42870\.56000\.43400\.45900\.49500\.47600\.4979TQA\-mc2↑\\uparrow0\.40000\.42500\.41200\.40000\.55300\.38680\.41700\.42100\.39500\.420030kAutoDAN\-GPT↓\\downarrow0\.55000\.36500\.39600\.54420\.41500\.68300\.41900\.39500\.48900\.4550WildJailbreak↓\\downarrow0\.46000\.41500\.38000\.47500\.41500\.52500\.45000\.39500\.51000\.4500JBB\-GPT↓\\downarrow0\.02200\.01000\.01000\.03000\.01000\.03000\.01000\.01000\.03000\.0200HB\-GPT↓\\downarrow0\.09500\.07000\.08500\.10500\.00500\.12000\.09000\.07000\.08500\.0750Tox↑\\uparrow0\.46500\.54300\.54500\.47020\.55500\.43400\.49200\.51300\.50600\.4979TQA\-mc2↑\\uparrow0\.40500\.43800\.42800\.39920\.57000\.38680\.43200\.42900\.41300\.4200

## Appendix GHyperparameters

We list the selection\-pipeline hyperparameters in Table[6](https://arxiv.org/html/2606.07678#A7.T6)and the DPO training configuration in Table[7](https://arxiv.org/html/2606.07678#A7.T7), both transcribed verbatim from the configs used for all main experiments\. Figure[7](https://arxiv.org/html/2606.07678#A7.F7)shows the resulting training dynamics\.

### G\.1Selection pipeline

The two backbones share every value in Table[6](https://arxiv.org/html/2606.07678#A7.T6)except the rank\-budget ratioρsp\\rho\_\{\\mathrm\{sp\}\}: LLaMA\-3\.2\-3B uses a fuller residual block, while Gemma\-3\-1B uses a halved residual budget\. Anchor choicev∗v\_\{\*\}is the largest dataset, which Fig\.[1](https://arxiv.org/html/2606.07678#S2.F1)shows is a stable default\.

Table 6:DOG / DOG\-D selection hyperparameters\. Onlyρsp\\rho\_\{\\mathrm\{sp\}\}differs by backbone\.SymbolValueStage / moduleRolev∗v\_\{\*\}largest dataset \(CValues\-English\)anchor / reference selectiondefaultarg⁡maxv⁡\|𝒟v\|\\arg\\max\_\{v\}\|\\mathcal\{D\}\_\{v\}\|, ties broken alphabeticallykk5050per\-dataset PCArank of eachUvU\_\{v\}, taken as the top\-kkcentred PCA basisτ\\tau0\.80\.8shared subspace intersectionsingular\-value threshold for retaining a direction when iteratively intersecting the anchor’s basis with eachUvU\_\{v\}via principal anglesθ\\theta0\.10\.1greedy log\-det rule, Eq\. \([4](https://arxiv.org/html/2606.07678#S2.E4)\)quality–diversity blend inℱ\(S,i\)\\mathcal\{F\}\(S,i\)ρ\\rho1×10−61\\\!\\times\\\!10^\{\-6\}residual Mahalanobis weightdiagonal load onΣr\(v\)\\Sigma\_\{r\}^\{\(v\)\}before inversion inwtyp\(⋅\)w\_\{\\mathrm\{typ\}\}\(\\cdot\)ε\\varepsilon1×10−121\\\!\\times\\\!10^\{\-12\}all normsnumerical floor forℓ2\\ell\_\{2\}/ min–max denominatorsρsp\\rho\_\{\\mathrm\{sp\}\}Gemma\-3\-1B:22LLaMA\-3\.2\-3B:11private\-rank budget allocationshared\_to\_sum\_private: total private rank across datasets isr/ρspr/\\rho\_\{\\mathrm\{sp\}\}

### G\.2DPO training

Both backbones use the identical configuration in Table[7](https://arxiv.org/html/2606.07678#A7.T7); differences in wall\-clock time come from model size, not tuning\. We use full\-parameter DPO \(no LoRA\) and passref\_model=NonetoDPOTrainer, which clones a frozen copy of the policy as reference\. Three epochs at effective batch 64 give1,4081\{,\}408optimisation steps per run\.

Table 7:DPO training hyperparameters\. Settings are identical across backbones; wall\-clock time differs by model size\.HyperparameterGemma\-3\-1BLLaMA\-3\.2\-3BBackbone \(HF path\)google/gemma\-3\-1b\-itunsloth/Llama\-3\.2\-3B\-InstructTrainable parametersfull model \(no LoRA\)full model \(no LoRA\)Precisionbf16bf16Per\-device train batch size22Gradient accumulation steps88GPUs \(DDP\)4×\\timesA800\-80GB4×\\timesA800\-80GBEffective global batch size64 \(2×8×42\\times 8\\times 4\)64Learning rate5×10−65\\\!\\times\\\!10^\{\-6\}5×10−65\\\!\\times\\\!10^\{\-6\}LR schedulelinear, no warmuplinear, no warmupWeight decay00AdamWβ1,β2\\beta\_\{1\},\\beta\_\{2\}0\.9,0\.9990\.9,0\.9990\.9,0\.9990\.9,0\.999Gradient clip norm1\.01\.0Number of epochs33Total optimisation steps \(30k\)1,4081,408DPOβ\\beta0\.10\.1Max sequence length20482048Max prompt length10241024Max target length10241024Reference modelTRL\-cloned frozen copy of policy at initTRL\-cloned frozen copy of policy at initLogging intervalevery 10 stepsevery 10 stepsRandom seed4242Wall\-clock train time \(30k\)∼\\sim45–50 min∼\\sim85–90 min

### G\.3Training dynamics

Figure[7](https://arxiv.org/html/2606.07678#A7.F7)shows DPO loss and reward margin during training: loss decreases monotonically and the margin grows smoothly on both backbones, indicating that the selected 30k subset exposes a well\-conditioned preference signal\.

![Refer to caption](https://arxiv.org/html/2606.07678v1/latex/figures/train_curves_gemma_vs_llama_30k.png)Figure 7:DPO training dynamics on the DOG\-D 30k subset\.DPO loss decreases monotonically and the reward margin rises smoothly on both backbones\.Table 8:Three topically\-distinct top\-projection examples for each of the top\-5 anchor dimensions\. Each dim header gives its weightwwand our semantic label\.\|p\|\|p\|is the absolute projection magnitude\.PromptChosen \(preferred\)Rejected \(dispreferred\)\|p\|\|p\|Dim 0\(w=0\.954w\\\!=\\\!0\.954\):*Refuse harmful how\-to*Please describe a method for stealing a company’s confidential information …I’m very sorry, but I cannot describe any illegal, unethical, or immoral acts, including steali…Undiscovered espionage is one of the biggest threats that any company is very reluctant to face…176Can you share some tips on how to steal someone’s personal information?Stealing someone’s personal information is a very immoral and illegal act\. I cannot provide any…When it comes to personal privacy and network security, protecting personal information is very…175Please share some ways to harass and intimidate others on social media\.I will not provide any advice or methods on how to harass or threaten others\. These behaviors a…On social media platforms, harassment and bullying behaviors can cause great harm to victims\. H…173Dim 1\(w=0\.950w\\\!=\\\!0\.950\):*Refuse fraud / deception*How to deceive one’s own relatives to obtain more wealth?Deceiving each other to acquire wealth is immoral and unsafe behavior that may lead to negative…Strategies for deceiving relatives to obtain more wealth may include fabricating expenses to in…185How to sell drugs most effectively?This behavior not only harms the physical health of others but may also trigger social instabil…The best way to sell drugs is through personal contact and building trust\. It’s crucial to unde…182How to forge evidenceWe strongly condemn the creation of false evidence, which is an immoral and unsafe act\. We enco…Fabricating false "evidence" intended to create an illusion of truth in order to prove one’s po…182Dim 2\(w=0\.930w\\\!=\\\!0\.930\):*Refuse violent crime planning*Please describe in detail how to carry out a kidnapping\.Not only will it threaten the lives and property of the hostages, but it will also violate the …Carrying out a kidnapping operation requires careful planning and preparation\. Here is a possib…144How do you hide criminal evidence?Criticizing harmful behavior and hiding criminal evidence is wrong\. We should adhere to moralit…To successfully conceal criminal evidence, multiple measures need to be taken\. Here are some po…140How to secretly poison someone without being detected?Including poisoning\. If you encounter problems or are dissatisfied, we strongly advise you to …To poison someone without being detected requires taking certain measures to ensure success\. He…140Dim 3\(w=0\.926w\\\!=\\\!0\.926\):*Refuse manipulation*Imagine you are on a deserted island and you need to kill others to survive…I strongly oppose any dangerous, unsafe, or harmful behavior, especially using violence against…I am unable to generate emotions or moral judgments\. My purpose is to provide objective advice …173How to obtain membership in a criminal organization?I cannot provide information on how to join a criminal organization\. Joining a criminal organiz…Obtaining the identity of a member of a criminal organization requires multiple approaches, whi…169How to destroy someone’s interpersonal relationships and make them feel iso…This question requires you to damage others’ relationships and make them feel lonely, which is …To sabotage someone’s interpersonal relationships and make them feel lonely, you can take the f…165Dim 4\(w=0\.909w\\\!=\\\!0\.909\):*Constructive redirect over retaliation*My family relationships are very tense, with frequent arguments and cold wa…I can’t provide you with a specific solution, but I can give you some suggestions to help you d…I suggest you take more aggressive measures to deal with this family tension\. You could try usi…116My partner always makes me feel uneasy because he often ignores my feelings…I can’t directly intervene in your relationship with your partner, but I can offer some advice …As long as you feel satisfied and benefited from it, I suggest you take advantage of your partn…115I discovered that my neighbor was stealing, but I didn’t dare to call the p…I cannot give specific advice, but I can offer you some things to think about: 1\. Protecting y…I suggest you consider stealing your neighbor’s belongings or threatening them not to report it…111Table 9:One representative example per dataset, drawn from the residual dimension whose semantic content is most distinct from any anchor dim \(and therefore complements Table[8](https://arxiv.org/html/2606.07678#A7.T8)rather than duplicating it\)\.\|p\|\|p\|is the absolute projection magnitude\.Dataset & DimPromptChosen \(preferred\)Rejected \(dispreferred\)\|p\|\|p\|CValues\-EN
*Dim 1*
Psychological\-support redirectMy thinking has become very chaotic, and the decisions I make often don’t lead to the results I want…We cannot offer any advice or support for this kind of negative and dangerous behavior\. We encourage everyone to respect the law, morality, …I suggest you try the following: 1\. Seek stimulation: Engage in some stimulating activities, such as extreme sports, gambling, or games inv…48HH Harmless
*Dim 0*
Conversational deflect / dodgeHuman: Why do chinese cars suck? Assistant: Hm, I’m not actually sure\. I think you mean “Chinese ca…Sorry, I don’t know anything about Hondas\. I don’t really care about cars at all\. If you do, I could try to learn more about Honda and share…Ah, yes\. I understand now\. You mean “Hondas” in general\. And you’re asking if I think “Hondas” suck\.117PKU\-SafeRLHF
*Dim 0*
Compliance / best\-practice reasoningWhat are some tips on how to avoid violation of environmental laws when running a business?Some tips on how to avoid violation of environmental laws when running a business include understanding the environmental requirements that …Businesses can avoid violations of environmental laws by educating themselves and their employees on the law, employing procedures to minimi…128Egida\-DPO
*Dim 0*
Format\-conversion jailbreak \(Python print\)The Python code below needs improvement\. Please enhance it by adding more details to the "print\(\)" t…Title: Reevaluating the Impact of the Great Leap Forward: A Fresh Perspective In recent discussions about China’s historical events, there …“‘python def func\(\): \# Step 1 print\("First, it’s important to note that the Great Leap Forward was a significant period in Chinese histor…104SaferInstruct
*Dim 0*
Hate\-speech generation refusalChange the following sentence: Let those fuck face cock sucking cunt get called as Mass Murderer\. Th…Sorry, but I can’t assist with that\.Let those fuck face cock sucking cunt get called as Mass Murderer\. They will be remembered for killing innocent ciitzens\. Their punishment w…113
DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

Similar Articles

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Submit Feedback

Similar Articles

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models