Which Pairs to Compare for LLM Post-Training?

arXiv cs.AI Papers

Summary

This paper studies the problem of selecting which completion pairs to label for human preference feedback in LLM post-training. It formulates comparison curation as a sampling-design problem, provides theoretical bounds on DPO's policy optimality gap, and proposes practical sampling designs that improve sample efficiency over common heuristics on synthetic and real benchmarks.

arXiv:2606.19607v1 Announce Type: new Abstract: Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:31 PM

# Which Pairs to Compare for LLM Post-Training?
Source: [https://arxiv.org/html/2606.19607](https://arxiv.org/html/2606.19607)
Jiangze Han Columbia University jh5196@columbia\.edu &Vineet Goyal Columbia University vgoyal@ieor\.columbia\.edu Will Ma Columbia University wm2428@gsb\.columbia\.edu

###### Abstract

Preference\-based post\-training has become a central paradigm for aligning language models\. A common data\-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs\. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs\. This paper studies which pairs should be compared in preference\-based post\-training\. We formulate comparison curation as a sampling\-design problem and evaluate designs by the quality of the final policy under the preference\-based post\-training objective\. We instantiate this framework for Direct Preference Optimization \(DPO\), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance\. Our main results provide matching upper and lower bounds on the post\-training optimality gap of the DPO\-trained policy\. The bounds show that comparison selection affects downstream performance through a single design\-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality\. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools\. Experiments on synthetic settings and language\-model post\-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison\-selection heuristics\.

## 1Introduction

Preference\-based post\-training has become a central paradigm for aligning large language models with human intent\. Early Reinforcement Learning for Human Feedback \(RLHF\) pipelines follow a two\-stage procedure: starting from a reference policyπ0\\pi\_\{0\}, typically a supervised fine\-tuned model, they collect human preference labels over pairs of model\-generated completions, fit a reward model from these comparisons, and then optimize the policy against this learned reward under a KL regularization constraint, often using policy\-gradient methods such as PPO\(Christianoet al\.,[2017](https://arxiv.org/html/2606.19607#bib.bib3); Ouyanget al\.,[2022](https://arxiv.org/html/2606.19607#bib.bib2)\)\. This framework has been highly influential, but it also introduces a complex intermediate reward\-modeling stage and a computationally expensive reinforcement\-learning step\. More recent methods, most notably Direct Preference Optimization \(DPO\), bypass explicit reward modeling and instead train the policy directly from pairwise preference data through a logistic objective, while retaining an implicit\-reward interpretation under the same KL\-regularized RLHF framework\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.19607#bib.bib1)\)\. Subsequent variants of preference optimization further simplify or modify this post\-training pipeline, but they continue to rely on pairwise preference comparisons as the fundamental supervision signal\.

Across these post\-training methods, the collection of human preference labels remains a key bottleneck\. A common data\-collection strategy is mechanical: for each prompt, generate a small number of completions from the reference policy and label either one pair or all within\-prompt pairs\. This strategy treats the generated comparison set as fixed, even though generation is relatively cheap while human labeling is expensive\. In practice, preference datasets are also often assembled in batches before annotation, since coordinating human labeling is costly and logistically easier when comparisons are selected in advance\. This motivates an offline curation question: rather than labeling all available comparisons or selecting them uniformly, can we generate a larger candidate pool first and then choose which pairs are most worth sending to annotators?

This paper studies this comparison\-selection problem: which pairs should be compared for preference\-based post\-training? For each prompt, we first generate a larger set of completions from the initial reference policy, and then choose an informative portfolio of within\-prompt comparison pairs, optimized jointly across all prompts and generated completions\. The objective is to allocate the labeling budget to comparisons that are most useful for improving the final post\-trained policy under the KL\-regularized RLHF objective\.

Formally, we model comparison curation as a sampling\-design problem\. After completions are generated for each prompt, a design specifies a sampling distribution over all within\-prompt completion pairs\. Given a labeling budgetnn, we samplennpairs from this design and query their preference labels\. In this paper, we instantiate the post\-training step with DPO: the sampled labeled comparisons are used to train a DPO policyπ^n\\hat\{\\pi\}\_\{n\}\. The design goal is to produce a policy with high value under the KL\-regularized RLHF objectiveJJ\. Thus, the central question is not merely how to estimate preferences accurately in isolation, but how to allocate comparison labels so as to improve the final policy under the RLHF objective\.

Contributions\.We make three main contributions\. First, we show that the effect of comparison curation on RLHF is captured by a single design\-dependent information matrix, which measures how well the labeled comparisons identify the parameter directions that matter for downstream RLHF performance\. Second, we prove matching finite\-sample upper bounds and information\-theoretic lower bounds for the RLHF optimality gap, showing that a trace\-form surrogate characterizes both the achievable performance of RLHF and the unavoidable statistical difficulty of the problem\. These bounds yield an explicit optimization criterion for selecting informative comparison pairs, along with a practical sampling policy\. Third, we validate the design criterion empirically on synthetic tabular and contextual experiments, as well as IMDb and Anthropic\-HH DPO experiments, showing that the proposed designs consistently improve sample efficiency over other approaches in the literature\.

### 1\.1Literature Review

To our knowledge, this is the first work to apply offline experimental design to comparison curation for RLHF post\-training, with a criterion directly tied to downstream policy performance\.

RLHFuses preference data to guide policy optimization, often by learning a reward model from pairwise comparisons and then optimizing a KL\-regularized objective\. Early work byChristianoet al\.\([2017](https://arxiv.org/html/2606.19607#bib.bib3)\)formalizes learning from human preferences using trajectory comparisons, while large\-scale alignment pipelines such as InstructGPT\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.19607#bib.bib2)\)popularized preference\-based RL for language models\. DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.19607#bib.bib1)\)provides a widely used alternative that directly fits a classification\-style objective on pairwise preference data, avoiding explicit reward modeling and on\-policy RL while retaining an implicit\-reward interpretation\. Our work takes the KL\-regularized objective as given and studies the orthogonal problem of curating comparison data under a fixed labeling budget\. A standard statistical model for preference labels is theBradley–Terry model\(Bradley and Terry,[1952](https://arxiv.org/html/2606.19607#bib.bib4)\), where the probability that one item is preferred to another is a logistic function of a latent score difference\. This model underlies much of ranking and preference learning, and it naturally appears in RLHF/DPO when pairwise labels are viewed as noisy comparisons induced by an explicit or implicit reward signal\.

A related literature studiesadaptive preference\-query selectionto reduce RLHF/DPO data\-collection costs\. For example,Daset al\.\([2025](https://arxiv.org/html/2606.19607#bib.bib16)\)propose active preference optimization for sample\-efficient RLHF;Jiet al\.\([2024](https://arxiv.org/html/2606.19607#bib.bib17)\)develop query\-efficient RLHF methods inspired by active learning; andMuldrewet al\.\([2024](https://arxiv.org/html/2606.19607#bib.bib18)\)study acquisition strategies for active preference learning in LLMs under DPO\-style objectives\.Xieet al\.\([2025](https://arxiv.org/html/2606.19607#bib.bib25)\)propose an exploration\-augmented online DPO method for sample\-efficient preference\-data collection in RLHF\.Linet al\.\([2025](https://arxiv.org/html/2606.19607#bib.bib27)\)propose an online DPO method which selects preference queries using an estimated reward model induced by the current LLM\.Kvetonet al\.\([2025](https://arxiv.org/html/2606.19607#bib.bib26)\)study active learning for DPO and also consider an offline subset\-selection setting, but their offline setting selects an informative subset from already labeled preference data for computational efficiency\. Closest to our work,Fenget al\.\([2025](https://arxiv.org/html/2606.19607#bib.bib20)\)propose PILAF, which changes the response\-generation distribution in iterative and online RLHF/DPO so that the preference\-loss gradient is aligned with the oracle RLHF objective gradient\. In contrast, our focus is offline first\-round comparison curation: when completions are generated from the reference/SFT model before any preference feedback is observed, we optimize which within\-prompt pairs from the fixed generated pool should be labeled, with finite\-sample guarantees for the final RLHF policy\. This offline formulation is motivated by practical annotation workflows, where comparison datasets are often constructed in advance and then sent for batch human labeling\.

Recent work studies howpreference\-data propertiesaffect post\-training behavior\.Kimet al\.\([2025](https://arxiv.org/html/2606.19607#bib.bib14)\)analyze how the data\-generating distribution influences DPO optimization and gradient dynamics, including likelihood displacement and iterative DPO\.Panet al\.\([2025](https://arxiv.org/html/2606.19607#bib.bib15)\)emphasize the importance of the quality and coverage of preferred responses, whileChowdhuryet al\.\([2024](https://arxiv.org/html/2606.19607#bib.bib13)\)study DPO under random label flips and propose robust/debiased objectives with finite\-sample guarantees\. Our work is complementary: rather than focusing on response quality, training dynamics, or label noise, we study pairwise comparison curation and show how budget allocation across candidate pairs controls downstream performance through an information\-design criterion\.

Methodologically, our approach builds on offlineoptimal experimental design, which studies how to allocate a limited measurement budget to optimize a statistical information criterion\. In the paired\-comparison setting, this amounts to deciding which pairs should be compared, and with what frequencies\. Classical work has studied locally optimal designs for the Bradley–Terry model, showing that the optimal comparison structure can depend strongly on the unknown utility parameters\(Graßhoff and Schwabe,[2008](https://arxiv.org/html/2606.19607#bib.bib5)\)\. Complementary minimax analyses relate the difficulty of estimating item utilities to the topology of the comparison graph through its Laplacian spectrum\(Shahet al\.,[2016](https://arxiv.org/html/2606.19607#bib.bib12)\)\. Other lines of work develop adaptive or Bayesian designs for tournaments and paired comparisons\(Glickman and Jensen,[2005](https://arxiv.org/html/2606.19607#bib.bib6)\), as well as fixed\-budget allocation rules for expert comparisons in machine learning\(Guoet al\.,[2018](https://arxiv.org/html/2606.19607#bib.bib7)\)\. More recently, optimal design ideas have been used for preference\-data collection in learning systems, but the design objectives are primarily tied to learning preferences or reward models\. For example,Mukherjeeet al\.\([2024](https://arxiv.org/html/2606.19607#bib.bib24)\)use optimal design to learn item rankings, with an objective based on ranking accuracy\.Scheidet al\.\([2024](https://arxiv.org/html/2606.19607#bib.bib28)\)study optimal design for learning a contextual linear reward model, with an objective formulated in terms of bandit regret\. Our setting shares the view of pair selection as a budget\-allocation problem, but differs in the downstream objective: we connect comparison allocation directly to the suboptimality of the final post\-trained policy under a KL\-regularized RLHF objective, rather than only to parameter\-estimation accuracy or reward\-model learning\.

## 2Problem definition

##### Reinforcement Learning for Human Feedback \(RLHF\)

Letxxdenote a prompt, and let𝒜​\(x\)\\mathcal\{A\}\(x\)denote the set of candidate completions available for promptxx\. We writey∈𝒜​\(x\)y\\in\\mathcal\{A\}\(x\)for a completion\. A language model induces a conditional distribution over completions given each prompt\. Following the RLHF literature, we refer to this conditional distribution as a policy and denote it byπ\(⋅∣x\)\\pi\(\\cdot\\mid x\)\. Thus,π​\(y∣x\)\\pi\(y\\mid x\)is the probability that the model generates completionyygiven promptxx\.

RLHF aims to align a reference policyπ0\(⋅∣x\)\\pi\_\{0\}\(\\cdot\\mid x\), often obtained through supervised fine\-tuning \(SFT\), with human preferences\. The standard RLHF formulation assumes that each prompt–completion pair\(x,y\)\(x,y\)has an underlying human\-preference rewardr⋆​\(x,y\)r^\{\\star\}\(x,y\), which is fixed but unknown to the learner, with larger values corresponding to completions that are more preferred by humans\. The goal is then to learn a policy that assigns higher probability to high\-reward completions while remaining close to the reference policy\. This trade\-off is captured by the KL\-regularized RLHF objective

J\(π\)≐𝔼x\[𝔼y∼π\(⋅∣x\)\[r⋆\(x,y\)\]−βKL\(π\(⋅∣x\)∥π0\(⋅∣x\)\)\]\.J\(\\pi\)\\doteq\\mathbb\{E\}\_\{x\}\\\!\\left\[\\mathbb\{E\}\_\{y\\sim\\pi\(\\cdot\\mid x\)\}\[r^\{\\star\}\(x,y\)\]\-\\beta\\,\\text\{KL\}\\\!\\big\(\\pi\(\\cdot\\mid x\)\\,\\\|\\,\\pi\_\{0\}\(\\cdot\\mid x\)\\big\)\\right\]\.\(1\)The first term measures the expected human\-preference reward of the post\-trained policy, while the KL term measures its deviation from the reference policy\. The parameterβ\>0\\beta\>0controls the strength of this regularization\.

Letπ⋆∈arg⁡maxπ⁡J​\(π\)\\pi^\{\\star\}\\in\\arg\\max\_\{\\pi\}J\(\\pi\)denote the RLHF\-optimal policy\. For each prompt, the maximizer balances reward improvement against deviation from the reference policy\. This optimizer is jointly determined by the reference policyπ0\\pi\_\{0\}and the reward functionr⋆r^\{\\star\}\. In particular, it admits the closed form\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.19607#bib.bib1)\)

π⋆​\(y∣x\)=π0​\(y∣x\)​exp⁡\(r⋆​\(x,y\)/β\)Z​\(x\),Z​\(x\)=∑y′∈𝒜​\(x\)π0​\(y′∣x\)​exp⁡\(r⋆​\(x,y′\)/β\)\.\\pi^\{\\star\}\(y\\mid x\)=\\frac\{\\pi\_\{0\}\(y\\mid x\)\\exp\(r^\{\\star\}\(x,y\)/\\beta\)\}\{Z\(x\)\},\\qquad Z\(x\)=\\sum\_\{y^\{\\prime\}\\in\\mathcal\{A\}\(x\)\}\\pi\_\{0\}\(y^\{\\prime\}\\mid x\)\\exp\(r^\{\\star\}\(x,y^\{\\prime\}\)/\\beta\)\.\(2\)Thus, high\-reward completions receive larger probability, but only relative to their probability under the reference policy\. The normalizing constantZ​\(x\)Z\(x\)ensures thatπ⋆\(⋅∣x\)\\pi^\{\\star\}\(\\cdot\\mid x\)is a valid distribution and depends only on the prompt\.

In practice,r⋆r^\{\\star\}is unknown, and information aboutr⋆r^\{\\star\}is collected through pairwise preference labels\.

Preference data generation\.The standard RLHF data\-collection pipeline proceeds as follows\. Given a prompt dataset\{xi\}i=1m\\\{x\_\{i\}\\\}\_\{i=1\}^\{m\}, one first generates a finite candidate pool for each prompt:

𝒴xi=\{yi,1,…,yi,d\},yi,k∼i\.i\.d\.π0\(⋅∣xi\),k=1,…,d\.\\mathcal\{Y\}\_\{x\_\{i\}\}=\\\{y\_\{i,1\},\\dots,y\_\{i,d\}\\\},\\qquad y\_\{i,k\}\\stackrel\{\{\\scriptstyle\\mathrm\{i\.i\.d\.\}\}\}\{\{\\sim\}\}\\pi\_\{0\}\(\\cdot\\mid x\_\{i\}\),\\quad k=1,\\dots,d\.Human experts then provide within\-prompt pairwise preferences between two candidates, yielding labeled comparisons of the form\(xi,yi\+,yi−,ai\)\(x\_\{i\},y\_\{i\}^\{\+\},y\_\{i\}^\{\-\},a\_\{i\}\), andai∈\{0,1\}a\_\{i\}\\in\\\{0,1\\\}, whereai=1a\_\{i\}=1indicates thatyi\+y\_\{i\}^\{\+\}is preferred toyi−y\_\{i\}^\{\-\}, andai=0a\_\{i\}=0indicates otherwise\.

In RLHF, the pairwise human preferences data generation is commonly modeled by a Bradley–Terry model: for a comparison paire=\(x,y\+,y−\)e=\(x,y^\{\+\},y^\{\-\}\), the preference label satisfies

a∣e∼Bernoulli​\(σ​\(r⋆​\(x,y\+\)−r⋆​\(x,y−\)\)\),a\\mid e\\sim\\mathrm\{Bernoulli\}\\\!\\left\(\\sigma\(r^\{\\star\}\(x,y^\{\+\}\)\-r^\{\\star\}\(x,y^\{\-\}\)\)\\right\),whereσ​\(u\)=\(1\+e−u\)−1\\sigma\(u\)=\(1\+e^\{\-u\}\)^\{\-1\}is the logistic function\.

##### Direct Preference Optimization\.

In the classical RLHF pipeline\(Christianoet al\.,[2017](https://arxiv.org/html/2606.19607#bib.bib3); Stiennonet al\.,[2020](https://arxiv.org/html/2606.19607#bib.bib9); Ouyanget al\.,[2022](https://arxiv.org/html/2606.19607#bib.bib2)\), the labeled comparisons are first used to fit an explicit reward model, typically by maximizing the Bradley–Terry likelihood induced by the observed preferences\. The learned reward model is then used as a surrogate for the unknown human rewardr⋆r^\{\\star\}in the KL\-regularized RLHF objective[˜1](https://arxiv.org/html/2606.19607#S2.E1), which is optimized to obtain the final policy\. While conceptually natural, this two\-stage procedure separates reward estimation from policy optimization and requires an additional reinforcement\-learning step, such as PPO, to obtain the final policy\(Christianoet al\.,[2017](https://arxiv.org/html/2606.19607#bib.bib3); Stiennonet al\.,[2020](https://arxiv.org/html/2606.19607#bib.bib9); Ouyanget al\.,[2022](https://arxiv.org/html/2606.19607#bib.bib2)\)\. DPO was introduced to simplify this pipeline by directly training the policy from pairwise preference data, using the closed\-form structure[˜2](https://arxiv.org/html/2606.19607#S2.E2)of the KL\-regularized RLHF optimizer\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.19607#bib.bib1)\)\.

The closed\-form optimizer in[˜2](https://arxiv.org/html/2606.19607#S2.E2)motivates the central idea of DPO: instead of explicitly learning a reward model and then optimizing a policy against it, one can express the reward differencesr⋆​\(x,y\+\)−r⋆​\(x,y−\)r^\{\\star\}\(x,y^\{\+\}\)\-r^\{\\star\}\(x,y^\{\-\}\)needed for preference learning directly through the policy’s log\-ratio relative to the reference model \(see[˜3](https://arxiv.org/html/2606.19607#S2.E3)\)\. Indeed, if a policyπ\\piis optimal for the KL\-regularized RLHF objective under some reward functionrr, then rearranging[˜2](https://arxiv.org/html/2606.19607#S2.E2)gives

r​\(x,y\)=β​log⁡π​\(y∣x\)π0​\(y∣x\)\+β​log⁡Z​\(x\),r\(x,y\)=\\beta\\log\\frac\{\\pi\(y\\mid x\)\}\{\\pi\_\{0\}\(y\\mid x\)\}\+\\beta\\log Z\(x\),whereZ​\(x\)Z\(x\)depends only on the prompt\. Therefore, although the absolute reward is identifiable only up to a prompt\-dependent additive term, reward differences between completions for the same prompt are identifiable\. Indeed, the prompt\-dependent term cancels when we compare two completionsyyandy′y^\{\\prime\}under the same promptxx, so these differences can be expressed through the corresponding policy ratios:

r​\(x,y\+\)−r​\(x,y−\)=β​log⁡π​\(y\+∣x\)​π0​\(y−∣x\)π​\(y−∣x\)​π0​\(y\+∣x\)\.r\(x,y^\{\+\}\)\-r\(x,y^\{\-\}\)=\\beta\\log\\frac\{\\pi\(y^\{\+\}\\mid x\)\\pi\_\{0\}\(y^\{\-\}\\mid x\)\}\{\\pi\(y^\{\-\}\\mid x\)\\pi\_\{0\}\(y^\{\+\}\\mid x\)\}\.\(3\)DPO uses this observation to avoid fitting a separate reward model\. For a parameterized policyπθ\\pi\_\{\\theta\}, DPO interprets its log\-ratio against the reference policy as an implicit reward,

rθ​\(x,y\)≐β​log⁡πθ​\(y∣x\)π0​\(y∣x\)\.r\_\{\\theta\}\(x,y\)\\doteq\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\\mid x\)\}\{\\pi\_\{0\}\(y\\mid x\)\}\.
For a comparison paire=\(x,y\+,y−\)e=\(x,y^\{\+\},y^\{\-\}\), the corresponding implicit reward difference is

uθ​\(e\)≐rθ​\(x,y\+\)−rθ​\(x,y−\)=β​log⁡πθ​\(y\+∣x\)​π0​\(y−∣x\)πθ​\(y−∣x\)​π0​\(y\+∣x\)\.u\_\{\\theta\}\(e\)\\doteq r\_\{\\theta\}\(x,y^\{\+\}\)\-r\_\{\\theta\}\(x,y^\{\-\}\)=\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\\pi\_\{0\}\(y^\{\-\}\\mid x\)\}\{\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\)\\pi\_\{0\}\(y^\{\+\}\\mid x\)\}\.\(4\)Under the Bradley–Terry model, a larger value ofuθ​\(e\)u\_\{\\theta\}\(e\)means that the policy\-implied reward ranksy\+y^\{\+\}abovey−y^\{\-\}more strongly\. Thus, DPO estimatesθ\\thetaby maximum likelihood: it choosesθ\\thetato maximize the likelihood of the observed preference labels under the Bradley–Terry logistic model parameterized byuθ​\(e\)u\_\{\\theta\}\(e\)\.

Given a labeled comparison dataset𝒟n=\{\(ei,ai\)\}i=1n\\mathcal\{D\}\_\{n\}=\\\{\(e\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{n\}, whereai=1a\_\{i\}=1indicates that the first completionyi\+y^\{\+\}\_\{i\}inei=\(xi,yi\+,yi−\)e\_\{i\}=\(x\_\{i\},y^\{\+\}\_\{i\},y^\{\-\}\_\{i\}\)is preferred andai=0a\_\{i\}=0otherwise, define

ℓ​\(ai,uθ​\(ei\)\)≐−ai​log⁡σ​\(uθ​\(ei\)\)−\(1−ai\)​log⁡\(1−σ​\(uθ​\(ei\)\)\)\.\\ell\(a\_\{i\},u\_\{\\theta\}\(e\_\{i\}\)\)\\doteq\-a\_\{i\}\\log\\sigma\(u\_\{\\theta\}\(e\_\{i\}\)\)\-\(1\-a\_\{i\}\)\\log\(1\-\\sigma\(u\_\{\\theta\}\(e\_\{i\}\)\)\)\.The empirical DPO risk is

Ln​\(θ\)≐1n​∑i=1nℓ​\(ai,uθ​\(ei\)\)\.L\_\{n\}\(\\theta\)\\doteq\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(a\_\{i\},u\_\{\\theta\}\(e\_\{i\}\)\)\.\(5\)For a policy class\{πθ:θ∈Θ\}\\\{\\pi\_\{\\theta\}:\\theta\\in\\Theta\\\}, the DPO estimatorπ^n\\hat\{\\pi\}\_\{n\}is

θ^n∈arg⁡minθ∈Θ⁡Ln​\(θ\),π^n≐πθ^n\.\\hat\{\\theta\}\_\{n\}\\in\\arg\\min\_\{\\theta\\in\\Theta\}L\_\{n\}\(\\theta\),\\qquad\\hat\{\\pi\}\_\{n\}\\doteq\\pi\_\{\\hat\{\\theta\}\_\{n\}\}\.
Experiment design problem\.We study comparison\-data curation for preference\-based post\-training\. Given a limited labeling budget, the learner must decide which comparison pairse=\(x,y\+,y−\)e=\(x,y^\{\+\},y^\{\-\}\)to label so as to most effectively improve the downstream RLHF performance of the final post\-trained policy\. We develop our theoretical analysis for the DPO policy\.

A common way to construct a human preference dataset is to generate a small numberddof completions for each prompt and then label either a single pair \(e\.g\.,d=2d=2\) or all\(d2\)\\binom\{d\}\{2\}within\-prompt pairs\. However, in practice, generating additional completions is relatively cheap, whereas eliciting human preference labels is costly\. This motivates us to consider an alternative pipeline: generate a larger candidate pool for each prompt, and then select a budgeted subset of informative comparison pairs to label\.

Formally, for each promptxx, the candidate completions in𝒴x\\mathcal\{Y\}\_\{x\}induce a complete comparison graph: each completionyyis a vertex, and each unordered edge\{y\+,y−\}\\\{y^\{\+\},y^\{\-\}\\\}represents a possible pairwise comparison\. After the candidate pools\{𝒴x\}\\\{\\mathcal\{Y\}\_\{x\}\\\}are generated, define the admissible unordered edge set

ℰx≐\{\{y\+,y−\}∈𝒴x×𝒴x:y\+≠y−\},ℰ≐⋃x\{x\}×ℰx\.\\mathcal\{E\}\_\{x\}\\doteq\\big\\\{\\\{y^\{\+\},y^\{\-\}\\\}\\in\\mathcal\{Y\}\_\{x\}\\times\\mathcal\{Y\}\_\{x\}:\\ y^\{\+\}\\neq y^\{\-\}\\big\\\},\\qquad\\mathcal\{E\}\\doteq\\bigcup\_\{x\}\\\{x\\\}\\times\\mathcal\{E\}\_\{x\}\.A sampling design is a distributionD∈Δ​\(ℰ\)D\\in\\Delta\(\\mathcal\{E\}\)over admissible within\-prompt edges\. For mathematical convenience, we analyze randomized designs: given a budgetnn, the pairwise comparisonse1,…,ene\_\{1\},\\dots,e\_\{n\}are drawn i\.i\.d\. fromDD, and each selected edge is labeled according to the Bradley–Terry model above\. This formulation casts comparison curation as an optimization problem over probability distributions, rather than as direct combinatorial subset selection\. After optimizing the design distribution, we construct the DPO training dataset by samplingnncomparison pairs from it\.

For a given sampling designDD, letπ^n​\(D\)\\hat\{\\pi\}\_\{n\}\(D\)denote the DPO policy obtained by first samplingnncomparison edgeseie\_\{i\}fromDD, querying their preference labelsaia\_\{i\}, and then minimizing the empirical DPO risk[˜5](https://arxiv.org/html/2606.19607#S2.E5)on the resulting labeled dataset𝒟n=\{\(ei,ai\)\}i=1n\\mathcal\{D\}\_\{n\}=\\\{\(e\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{n\}\. The design problem is

maxD∈Δ​\(ℰ\)⁡𝔼​\[J​\(π^n​\(D\)\)\],\\max\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}\\ \\mathbb\{E\}\\\!\\left\[J\(\\hat\{\\pi\}\_\{n\}\(D\)\)\\right\],where the expectation is over the sampled comparison edges and labels\. The rest of the paper develops upper and lower bounds for this objective and derives computable surrogates for selecting informative comparison portfolios\.

Parametric policy class\.For the theoretical analysis, we work with a softmax policy class\. Each prompt–completion pair\(x,y\)\(x,y\)is represented by an embeddingϕ​\(x,y\)∈ℝp\\phi\(x,y\)\\in\\mathbb\{R\}^\{p\}, and any policyπθ\\pi\_\{\\theta\}withθ∈Θ\\theta\\in\\Thetahas the form

πθ​\(y∣x\)=exp⁡\(fθ​\(ϕ​\(x,y\)\)\)∑y′∈𝒜​\(x\)exp⁡\(fθ​\(ϕ​\(x,y′\)\)\),y∈𝒜​\(x\),\\pi\_\{\\theta\}\(y\\mid x\)=\\frac\{\\exp\(f\_\{\\theta\}\(\\phi\(x,y\)\)\)\}\{\\sum\_\{y^\{\\prime\}\\in\\mathcal\{A\}\(x\)\}\\exp\(f\_\{\\theta\}\(\\phi\(x,y^\{\\prime\}\)\)\)\},\\qquad y\\in\\mathcal\{A\}\(x\),\(6\)wherefθ:ℝp→ℝf\_\{\\theta\}:\\mathbb\{R\}^\{p\}\\to\\mathbb\{R\}is smooth inθ\\theta\. This formulation includes linear contextual policies and tabular softmax policies as special cases\. In the linear contextual case,fθ​\(ϕ\)=⟨θ,ϕ⟩f\_\{\\theta\}\(\\phi\)=\\langle\\theta,\\phi\\rangle, withθ∈ℝp\\theta\\in\\mathbb\{R\}^\{p\}\. In the tabular case,ϕ​\(x,y\)=𝟙\(x,y\)∈ℝp\\phi\(x,y\)=\\mathbbm\{1\}\_\{\(x,y\)\}\\in\\mathbb\{R\}^\{p\}is a one\-hot vector over all prompt–completion pairs, andfθ​\(ϕ​\(x,y\)\)=θ\(x,y\)f\_\{\\theta\}\(\\phi\(x,y\)\)=\\theta\_\{\(x,y\)\}, whereθ∈ℝp\\theta\\in\\mathbb\{R\}^\{p\}\. For detailed explanations, see[Remark˜1](https://arxiv.org/html/2606.19607#Thmremark1)\.

Notation\.For a symmetric positive semidefinite matrixAA,A†A^\{\\dagger\}denotes the Moore–Penrose pseudoinverse, obtained by inverting the positive eigenvalues ofAAand leaving the zero eigenvalues unchanged\. We writeA†⁣/2A^\{\\dagger/2\}for the symmetric square root ofA†A^\{\\dagger\}\. WhenAAis positive definite on a subspaceHH,A†A^\{\\dagger\}acts as the usual inverse onHH\.

### 2\.1Regularity assumptions

We collect the main assumptions used in the analysis\. Detailed interpretations, examples, and primitive sufficient conditions are presented in Appendix[A\.2](https://arxiv.org/html/2606.19607#A1.SS2)\.

Identifiability and realizability\.Recall that each comparison edgee=\(x,y\+,y−\)e=\(x,y^\{\+\},y^\{\-\}\)enters the DPO loss through the scalar logit differenceuθ​\(e\)u\_\{\\theta\}\(e\)\. Define the pairwise sensitivity vector

g​\(e;θ\)≐∇θuθ​\(e\)\.g\(e;\\theta\)\\doteq\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)\.Because softmax probabilities depend only on within\-prompt logit differences, directions that shift all logits for a prompt by the same amount are unidentifiable\. We restrict attention to the identifiable tangent space

H≐span⁡\{g​\(e;θ⋆\):e=\(x,y\+,y−\)∈ℰ\}⊆ℝp\.H\\doteq\\operatorname\{span\}\\bigl\\\{g\(e;\\theta^\{\\star\}\):e=\(x,y^\{\+\},y^\{\-\}\)\\in\\mathcal\{E\}\\bigr\\\}\\subseteq\\mathbb\{R\}^\{p\}\.
###### Assumption 1\(Identifiability and realizability\)\.

The parameter setΘ\\Thetais convex, compact, and satisfies

Θ⊆θ⋆\+H\.\\Theta\\subseteq\\theta^\{\\star\}\+H\.

This assumption generalizes the standard normalization used when estimating a Bradley–Terry model in a tabular setting\. In the tabular case, the assumption amounts to requiring that, within each prompt, the rewards of all completions sum to zero; see[Remark˜3](https://arxiv.org/html/2606.19607#Thmremark3)\. This normalization is commonly used to make the tabular Bradley–Terry rewards uniquely identifiable\(Shahet al\.,[2016](https://arxiv.org/html/2606.19607#bib.bib12)\)\.

Boundedness and smoothness\.We impose standard boundedness and smoothness conditions on the feature map and the score model\. Similar regularity conditions are commonly used in the analysis of DPO; see, e\.g\.,Chowdhuryet al\.\([2024](https://arxiv.org/html/2606.19607#bib.bib13)\)\.

###### Assumption 2\(Boundedness and smoothness\)\.

There exist constantsRϕ,α0,α1,α2,α3<∞R\_\{\\phi\},\\alpha\_\{0\},\\alpha\_\{1\},\\alpha\_\{2\},\\alpha\_\{3\}<\\inftysuch that for allθ∈Θ\\theta\\in\\Thetaand all admissible\(x,y\)\(x,y\),

‖ϕ​\(x,y\)‖2≤Rϕ,\|fθ​\(ϕ​\(x,y\)\)\|≤α0,\\\|\\phi\(x,y\)\\\|\_\{2\}\\leq R\_\{\\phi\},\\qquad\|f\_\{\\theta\}\(\\phi\(x,y\)\)\|\\leq\\alpha\_\{0\},‖∇θfθ​\(ϕ​\(x,y\)\)‖2≤α1,‖∇θ2fθ​\(ϕ​\(x,y\)\)‖op≤α2,‖∇θ3fθ​\(ϕ​\(x,y\)\)‖op≤α3\.\\\|\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\)\)\\\|\_\{2\}\\leq\\alpha\_\{1\},\\qquad\\\|\\nabla\_\{\\theta\}^\{2\}f\_\{\\theta\}\(\\phi\(x,y\)\)\\\|\_\{\\text\{op\}\}\\leq\\alpha\_\{2\},\\qquad\\\|\\nabla\_\{\\theta\}^\{3\}f\_\{\\theta\}\(\\phi\(x,y\)\)\\\|\_\{\\text\{op\}\}\\leq\\alpha\_\{3\}\.Moreover, for each fixed\(x,y\)\(x,y\), the mapθ↦fθ​\(ϕ​\(x,y\)\)\\theta\\mapsto f\_\{\\theta\}\(\\phi\(x,y\)\)is three times continuously differentiable onΘ\\Theta\.

Feature separation\.To obtain meaningful learning guarantees from pairwise comparisons, namely that the DPO solution achieves a vanishing RLHF optimality gap as the comparison budgetnngrows, we need the candidate completion set𝒜​\(x\)\\mathcal\{A\}\(x\)to be sufficiently informative\. Intuitively, if for a given promptxxthe candidate pool contains only near\-duplicate completions \(or completions that are indistinguishable under the model features\), then comparing them provides little information about how the policy should change, and certain parameter directions cannot be learned no matter how many comparisons we collect\. To rule out such degenerate cases, we impose a mild diversity condition on the candidate set: for each prompt,𝒜​\(x\)\\mathcal\{A\}\(x\)should contain at least two completions that are sufficiently different in their model\-induced features \(in every identifiable direction\)\. This is formalized by the following feature\-separation assumption\.

###### Assumption 3\(Feature separation onHH\)\.

There existsΔg\>0\\Delta\_\{g\}\>0such that for everyθ∈Θ\\theta\\in\\Theta, every promptxx, and every unit vectorv∈Hv\\in H, there exist two candidatesy1,y2∈𝒜​\(x\)y\_\{1\},y\_\{2\}\\in\\mathcal\{A\}\(x\)satisfying

\|v⊤​\(∇θfθ​\(ϕ​\(x,y1\)\)−∇θfθ​\(ϕ​\(x,y2\)\)\)\|≥Δg\.\\Big\|v^\{\\top\}\\big\(\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\_\{1\}\)\)\-\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\_\{2\}\)\)\\big\)\\Big\|\\geq\\Delta\_\{g\}\.

Consider the tabular setting withdditems and reward vectorr=\(r1,…,rd\)r=\(r\_\{1\},\\ldots,r\_\{d\}\)\. The feature vectors are one\-hot vectors, and the identifiable subspace is

H=\{v∈ℝd:∑i=1dvi=0\}\.H=\\left\\\{v\\in\\mathbb\{R\}^\{d\}:\\sum\_\{i=1\}^\{d\}v\_\{i\}=0\\right\\\}\.In this case,[Assumption˜3](https://arxiv.org/html/2606.19607#Thmassumption3)reduces to the requirement that every unit vectorv∈Hv\\in Hseparates some pair of items: there existsΔg\>0\\Delta\_\{g\}\>0such that for everyv∈Hv\\in H

maxi,j∈\[d\]⁡\|vi−vj\|≥Δg\.\\max\_\{i,j\\in\[d\]\}\|v\_\{i\}\-v\_\{j\}\|\\geq\\Delta\_\{g\}\.This holds automatically, since a unit zero\-mean vector cannot have all coordinates equal\.

Design coverage\.Even with a diverse candidate pool, a meaningful guarantee further requires that the sampling designDDdoes not ignore informative pairs\. We therefore impose the following coverage assumption, which is standard in the estimation of Bradley–Terry models; see, e\.g\.,Shahet al\.\([2016](https://arxiv.org/html/2606.19607#bib.bib12)\); Chowdhuryet al\.\([2024](https://arxiv.org/html/2606.19607#bib.bib13)\)\.

###### Definition 1\(Design covariance matrix\)\.

A sampling designDDover comparison edges induces the design covariance matrix

ΣD​\(θ\)≐𝔼e∼D​\[g​\(e;θ\)​g​\(e;θ\)⊤\]\.\\Sigma\_\{D\}\(\\theta\)\\doteq\\mathbb\{E\}\_\{e\\sim D\}\\\!\\left\[g\(e;\\theta\)g\(e;\\theta\)^\{\\top\}\\right\]\.

Our guarantees require that the sampling designDDplaces sufficient mass on informative comparisons\.

###### Assumption 4\(a\)\(Design coverage at the truth\)\.

There existsμ⋆\>0\\mu\_\{\\star\}\>0such that

v⊤​ΣD​\(θ⋆\)​v≥μ⋆​‖v‖22,∀v∈H\.v^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)v\\geq\\mu\_\{\\star\}\\\|v\\\|\_\{2\}^\{2\},\\qquad\\forall v\\in H\.

In some arguments, we require the above assumption to hold uniformly over a neighborhood ofθ⋆\\theta^\{\\star\}\. Accordingly, we introduce the following localized version\. Whenever invoked, the regionℛ⊆Θ\\mathcal\{R\}\\subseteq\\Thetawill be specified, typically chosen as a neighborhood ofθ⋆\\theta^\{\\star\}\.

###### Assumption 4\(b\)\(Uniform design coverage\)\.

For a specified regionℛ⊆Θ\\mathcal\{R\}\\subseteq\\Theta, there existsμℛ\>0\\mu\_\{\\mathcal\{R\}\}\>0such that

v⊤​ΣD​\(θ\)​v≥μℛ​‖v‖22,∀v∈H,∀θ∈ℛ\.v^\{\\top\}\\Sigma\_\{D\}\(\\theta\)v\\geq\\mu\_\{\\mathcal\{R\}\}\\\|v\\\|\_\{2\}^\{2\},\\qquad\\forall v\\in H,\\ \\forall\\theta\\in\\mathcal\{R\}\.

Optimization landscape\.LetL​\(θ\)≐𝔼\(e,a\)​\[ℓ​\(a,uθ​\(e\)\)\]L\(\\theta\)\\doteq\\mathbb\{E\}\_\{\(e,a\)\}\[\\ell\(a,u\_\{\\theta\}\(e\)\)\]denote the population DPO risk under the sampling designDDand label \(BT\) model\. For general nonlinear score models,L​\(θ\)L\(\\theta\)may be nonconvex and may have multiple minimizers, so we assume the target parameter is unambiguous \([Remark˜6](https://arxiv.org/html/2606.19607#Thmremark6)\)\.

###### Assumption 5\(Unique population minimizer\)\.

The population DPO riskL​\(θ\)L\(\\theta\)has a unique global minimizer overΘ\\Theta\.

Such uniqueness assumption is standard in the analysis of extremum estimators, where consistency typically requires the limiting population risk to have a unique optimizer\(Newey and McFadden,[1994](https://arxiv.org/html/2606.19607#bib.bib22), Section 2\.1\)\(Van der Vaart,[2000](https://arxiv.org/html/2606.19607#bib.bib23), Theorem 5\.7\)\.

Prior regularity for the lower bound\.Our final assumption introduces a Bayesian formulation for the optimal parameterθ⋆\\theta^\{\\star\}\. When prior information aboutθ⋆\\theta^\{\\star\}is available, our analysis yields an information\-theoretic \(Bayesian\) lower bound on the RLHF optimality gap under an arbitrary sampling designDD\. Detailed explanation is provided in[Remark˜7](https://arxiv.org/html/2606.19607#Thmremark7)\.

###### Assumption 6\(Prior regularity\)\.

Supposeθ⋆∼ρ\\theta^\{\\star\}\\sim\\rho, whereρ\\rhois supported onΘ\\Theta\. The densityρ\\rhosatisfies: \(i\)ρ∈C1​\(Θ\)\\rho\\in C^\{1\}\(\\Theta\)andρ​\(θ\)\>0\\rho\(\\theta\)\>0for allθ∈int​\(Θ\)\\theta\\in\\mathrm\{int\}\(\\Theta\); \(ii\)ρ​\(θ\)=0\\rho\(\\theta\)=0for allθ∈∂Θ\\theta\\in\\partial\\Theta; \(iii\) the prior Fisher information is finite:∫Θ‖∇log⁡ρ​\(θ\)‖22​ρ​\(θ\)​𝑑θ<∞\.\\int\_\{\\Theta\}\\\|\\nabla\\log\\rho\(\\theta\)\\\|\_\{2\}^\{2\}\\,\\rho\(\\theta\)\\,d\\theta<\\infty\.

[Assumption˜6](https://arxiv.org/html/2606.19607#Thmassumption6)is a standard regularity condition on the prior density\. It ensures that the quantity∫Θ‖∇log⁡ρ​\(θ\)‖22​ρ​\(θ\)​𝑑θ\\int\_\{\\Theta\}\\\|\\nabla\\log\\rho\(\\theta\)\\\|\_\{2\}^\{2\}\\,\\rho\(\\theta\)\\,d\\thetais finite and that the integration\-by\-parts steps required by the Bayesian information inequality \(Van Trees\) in our lower\-bound argument are valid\.

## 3Main results

Following the literature, we refer to the gradient of the log\-likelihood as the*score function*\. The second moment of the score function is the*Fisher information matrix*\. In our contextual policy model, the Fisher information are defined as follows\. Detailed interpretations are presented in Appendix[B](https://arxiv.org/html/2606.19607#A2)\.

###### Definition 2\(Fisher information matrix for the policy family\)\.

For each parameterθ∈Θ\\theta\\in\\Theta, define the score vector

sθ​\(x,y\)≐∇θlog⁡πθ​\(y∣x\)\.s\_\{\\theta\}\(x,y\)\\doteq\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\\mid x\)\.The Fisher information matrix is

I​\(θ\)\\displaystyle I\(\\theta\)≐𝔼x​𝔼y∼πθ\(⋅∣x\)​\[sθ​\(x,y\)​sθ​\(x,y\)⊤\]\\displaystyle\\doteq\\mathbb\{E\}\_\{x\}\\,\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\\!\\left\[s\_\{\\theta\}\(x,y\)s\_\{\\theta\}\(x,y\)^\{\\top\}\\right\]\(7\)=𝔼x​𝔼y∼πθ\(⋅∣x\)​\[∇θlog⁡πθ​\(y∣x\)​∇θlog⁡πθ​\(y∣x\)⊤\]\.\\displaystyle=\\mathbb\{E\}\_\{x\}\\,\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\\!\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\\mid x\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\\mid x\)^\{\\top\}\\right\]\.

###### Theorem 1\(Informal\)\.

Consider a sampling designDDover within\-prompt pairs, andnni\.i\.d\. comparisons drawn fromDD\. LetI​\(θ\)I\(\\theta\)andΣD​\(θ\)\\Sigma\_\{D\}\(\\theta\)be as defined in[Definitions˜2](https://arxiv.org/html/2606.19607#Thmdefinition2)and[1](https://arxiv.org/html/2606.19607#Thmdefinition1)\. Under[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[3](https://arxiv.org/html/2606.19607#Thmassumption3), we have the following\.

1. \(i\)Upper bound\.Letθ^n∈arg⁡minθ∈Θ⁡Ln​\(θ\)\\hat\{\\theta\}\_\{n\}\\in\\arg\\min\_\{\\theta\\in\\Theta\}L\_\{n\}\(\\theta\)withπ^n=πθ^n\\hat\{\\pi\}\_\{n\}=\\pi\_\{\\hat\{\\theta\}\_\{n\}\}\. If[Assumptions˜4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)and[5](https://arxiv.org/html/2606.19607#Thmassumption5)hold in addition, then for anyδ∈\(0,1\)\\delta\\in\(0,1\), there existsn0​\(δ\)n\_\{0\}\(\\delta\)such that for alln≥n0​\(δ\)n\\geq n\_\{0\}\(\\delta\), with probability at least1−δ1\-\\delta, J​\(π⋆\)−J​\(π^n\)≤Cub​\(δ\)n​tr​\(I​\(θ⋆\)​ΣD†​\(θ⋆\)\),J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\_\{n\}\)\\leq\\frac\{C\_\{\\mathrm\{ub\}\}\(\\delta\)\}\{n\}\\,\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}^\{\\dagger\}\(\\theta^\{\\star\}\)\\Big\),whereCub​\(δ\)C\_\{\\mathrm\{ub\}\}\(\\delta\)depends only on the fixed constants in the assumptions andδ\\delta\. Moreover, n0​\(δ\)=𝒪~​\(Cloc​p\+CHess​dH\+\(Cloc\+CHess\)​log⁡\(1/δ\)\),n\_\{0\}\(\\delta\)=\\widetilde\{\\mathcal\{O\}\}\\\!\\bigl\(C\_\{\\mathrm\{loc\}\}p\+C\_\{\\mathrm\{Hess\}\}d\_\{H\}\+\(C\_\{\\mathrm\{loc\}\}\+C\_\{\\mathrm\{Hess\}\}\)\\log\(1/\\delta\)\\bigr\),whereppis the ambient dimension,dH=dim\(H\)d\_\{H\}=\\dim\(H\),ClocC\_\{\\mathrm\{loc\}\}is a localization complexity parameter, andCHessC\_\{\\mathrm\{Hess\}\}is a local curvature/concentration parameter\.
2. \(ii\)Lower bound\.Let nlb:=⌈∫Θ‖∇log⁡ρ​\(θ\)‖22​ρ​\(θ\)​𝑑θμℛ⌉\.n\_\{\\mathrm\{lb\}\}:=\\left\\lceil\\frac\{\\int\_\{\\Theta\}\\\|\\nabla\\log\\rho\(\\theta\)\\\|\_\{2\}^\{2\}\\,\\rho\(\\theta\)\\,d\\theta\}\{\\mu\_\{\\mathcal\{R\}\}\}\\right\\rceil\.If[Assumptions˜4\(b\)](https://arxiv.org/html/2606.19607#Thmassumption4a)and[6](https://arxiv.org/html/2606.19607#Thmassumption6)hold in addition withℛ=supp⁡\(ρ\)\\mathcal\{R\}=\\operatorname\{supp\}\(\\rho\), then there existsClb\>0C\_\{\\mathrm\{lb\}\}\>0\(depending only on the fixed constants in the assumptions\), such that for any induced policy estimatorπ~n=πθ~n\\tilde\{\\pi\}\_\{n\}=\\pi\_\{\\tilde\{\\theta\}\_\{n\}\}and alln≥nlbn\\geq n\_\{\\mathrm\{lb\}\}, 𝔼θ⋆∼ρ​𝔼𝒟n∣θ⋆​\[J​\(πθ⋆\)−J​\(π~n\)\]≥Clbn​𝔼θ⋆∼ρ​\[tr​\(I​\(θ⋆\)​ΣD†​\(θ⋆\)\)\]\.\\mathbb\{E\}\_\{\\theta^\{\\star\}\\sim\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta^\{\\star\}\}\\\!\\left\[J\(\\pi\_\{\\theta^\{\\star\}\}\)\-J\(\\tilde\{\\pi\}\_\{n\}\)\\right\]\\geq\\frac\{C\_\{\\mathrm\{lb\}\}\}\{n\}\\,\\mathbb\{E\}\_\{\\theta^\{\\star\}\\sim\\rho\}\\\!\\left\[\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}^\{\\dagger\}\(\\theta^\{\\star\}\)\\Big\)\\right\]\.

We present the proof in Appendix[C](https://arxiv.org/html/2606.19607#A3)\.

### 3\.1Sampling policy design

From[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1), both the upper and lower bounds on the RLHF optimality gap are governed \(up to constants\) by the same trace criterion:tr​\(I​\(θ⋆\)​ΣD†​\(θ⋆\)\)\.\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma^\{\\dagger\}\_\{D\}\(\\theta^\{\\star\}\)\\Big\)\.Motivated by this characterization, we define the oracle sampling design as the solution to the following trace minimization problem:

D⋆​\(θ⋆\)∈arg⁡minD∈Δ​\(ℰ\)⁡tr​\(I​\(θ⋆\)​ΣD†​\(θ⋆\)\),D^\{\\star\}\(\\theta^\{\\star\}\)\\in\\arg\\min\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}\\ \\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\,\\Sigma^\{\\dagger\}\_\{D\}\(\\theta^\{\\star\}\)\\big\),\(8\)whereℰ\\mathcal\{E\}is the admissible within\-prompt edge set\. The oracle designD⋆​\(θ⋆\)D^\{\\star\}\(\\theta^\{\\star\}\), however, is not directly implementable because it depends on the unknown target parameterθ⋆\\theta^\{\\star\}\. Before collecting preference labels, the only policy parameter available to the designer is the reference parameterθ0\\theta\_\{0\}associated with reference policyπ0\\pi\_\{0\}\. Moreover, RLHF starts fromπ0\\pi\_\{0\}and optimizes a KL\-regularized objective[˜1](https://arxiv.org/html/2606.19607#S2.E1)that keeps the learned policyπ⋆\\pi^\{\\star\}close to this reference modelπ0\\pi\_\{0\}\. It is therefore natural to useθ0\\theta\_\{0\}as a plug\-in proxy forθ⋆\\theta^\{\\star\}when constructing the sampling design\. We next show that the resulting plug\-in design still enjoys a controlled performance guarantee\.

###### Theorem 2\(Implementable trace design\)\.

Letθ0∈Θ\\theta\_\{0\}\\in\\Thetabe a fixed reference parameter, andr0≐‖θ⋆−θ0‖2r\_\{0\}\\doteq\\\|\\theta^\{\\star\}\-\\theta\_\{0\}\\\|\_\{2\}\. Under the assumptions in[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1), there exists a constantCplug​\(r0\)≥1C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)\\geq 1, depending only on model primitive constants andr0r\_\{0\}, such that for every sampling designD∈Δ​\(ℰ\)D\\in\\Delta\(\\mathcal\{E\}\),

Cplug​\(r0\)−1​tr​\(I​\(θ0\)​ΣD†​\(θ0\)\)≤tr​\(I​\(θ⋆\)​ΣD†​\(θ⋆\)\)≤Cplug​\(r0\)​tr​\(I​\(θ0\)​ΣD†​\(θ0\)\)\.C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)^\{\-1\}\\,\\text\{tr\}\\\!\\Big\(I\(\\theta\_\{0\}\)\\Sigma^\{\\dagger\}\_\{D\}\(\\theta\_\{0\}\)\\Big\)\\;\\leq\\;\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma^\{\\dagger\}\_\{D\}\(\\theta^\{\\star\}\)\\Big\)\\;\\leq\\;C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)\\,\\text\{tr\}\\\!\\Big\(I\(\\theta\_\{0\}\)\\Sigma^\{\\dagger\}\_\{D\}\(\\theta\_\{0\}\)\\Big\)\.\(9\)Moreover, for

Dθ0∈arg⁡minD∈Δ​\(ℰ\)⁡tr​\(I​\(θ0\)​ΣD†​\(θ0\)\),D\_\{\\theta\_\{0\}\}\\in\\arg\\min\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}\\ \\text\{tr\}\\\!\\Big\(I\(\\theta\_\{0\}\)\\Sigma^\{\\dagger\}\_\{D\}\(\\theta\_\{0\}\)\\Big\),\(10\)we have

tr​\(I​\(θ⋆\)​ΣDθ0†​\(θ⋆\)\)≤Cplug​\(r0\)2​infD∈Δ​\(ℰ\)tr​\(I​\(θ⋆\)​ΣD†​\(θ⋆\)\)\.\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma^\{\\dagger\}\_\{D\_\{\\theta\_\{0\}\}\}\(\\theta^\{\\star\}\)\\Big\)\\;\\leq\\;C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)^\{2\}\\,\\inf\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma^\{\\dagger\}\_\{D\}\(\\theta^\{\\star\}\)\\Big\)\.\(11\)

We present a detailed proof in Appendix[D](https://arxiv.org/html/2606.19607#A4)\.

### 3\.2Road map for proving[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1)

Here we present a high level sketch for the proof of[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1)\. The main idea is to reduce downstream policy suboptimality to a weighted parameter\-estimation error, and then control this error from above for DPO and from below for any estimator\.

Step 1: RLHF gap as a weighted parameter error\.We first show that the RLHF optimality gap is locally equivalent to a quadratic form in the parameter error\. Specifically, under the regularity assumptions, for allθ∈Θ\\theta\\in\\Theta,

c−​\(θ−θ⋆\)⊤​I​\(θ⋆\)​\(θ−θ⋆\)≤J​\(π⋆\)−J​\(πθ\)≤c\+​\(θ−θ⋆\)⊤​I​\(θ⋆\)​\(θ−θ⋆\),c\_\{\-\}\\,\(\\theta\-\\theta^\{\\star\}\)^\{\\top\}I\(\\theta^\{\\star\}\)\(\\theta\-\\theta^\{\\star\}\)\\leq J\(\\pi^\{\\star\}\)\-J\(\\pi\_\{\\theta\}\)\\leq c\_\{\+\}\\,\(\\theta\-\\theta^\{\\star\}\)^\{\\top\}I\(\\theta^\{\\star\}\)\(\\theta\-\\theta^\{\\star\}\),whereI​\(θ⋆\)I\(\\theta^\{\\star\}\)is the curvature/Fisher matrix of the policy around the RLHF optimum\. Thus, controlling the downstream RLHF gap is equivalent, up to constants, to controlling the estimation error in theI​\(θ⋆\)I\(\\theta^\{\\star\}\)\-weighted norm:

𝔼​\[J​\(π⋆\)−J​\(πθ~n\)\]≍𝔼​\[\(θ~n−θ⋆\)⊤​I​\(θ⋆\)​\(θ~n−θ⋆\)\]\.\\mathbb\{E\}\[J\(\\pi^\{\\star\}\)\-J\(\\pi\_\{\\tilde\{\\theta\}\_\{n\}\}\)\]\\asymp\\mathbb\{E\}\\\!\\left\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\)^\{\\top\}I\(\\theta^\{\\star\}\)\(\\tilde\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\)\\right\]\.We present a detailed proof in Appendix[C\.1](https://arxiv.org/html/2606.19607#A3.SS1)\.

Step 2: Upper bound for DPO\.Letθ^n\\hat\{\\theta\}\_\{n\}be the empirical DPO minimizer andΔ=θ^n−θ⋆\\Delta=\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\. The main challenge is that, in the nonlinear case, the empirical Hessian varies withθ\\theta\. We handle this by showing that the average Hessian along the path fromθ⋆\\theta^\{\\star\}toθ^n\\hat\{\\theta\}\_\{n\},Hn≐∫01∇2Ln​\(θ⋆\+t​Δ\)​𝑑t,H\_\{n\}\\doteq\\int\_\{0\}^\{1\}\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\)\\,dt,is uniformly lower bounded onHHby the design covariance:Hn⪰c​ΣD​\(θ⋆\)H\_\{n\}\\succeq c\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\. Combined with first\-order optimality, this controls the DPO estimation error in theΣD\\Sigma\_\{D\}\-norm by the score noise atθ⋆\\theta^\{\\star\}\. Since the score covariance is of orderΣD​\(θ⋆\)/n\\Sigma\_\{D\}\(\\theta^\{\\star\}\)/n, we obtain

𝔼​\[Δ⊤​I​\(θ⋆\)​Δ\]≲1n​tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)\.\\mathbb\{E\}\\\!\\left\[\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\\right\]\\lesssim\\frac\{1\}\{n\}\\,\\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\big\)\.Step 1 then gives the stated RLHF\-gap upper bound\. See Appendix[C\.2](https://arxiv.org/html/2606.19607#A3.SS2)for the full proof\.

Step 3: Lower bound for any estimator\.For the converse, we apply the Van Trees inequality to the pairwise comparison model\. Since the Fisher information contributed bynncomparisons under designDDis controlled byn​ΣD​\(θ\)n\\Sigma\_\{D\}\(\\theta\), any estimatorθ~n\\tilde\{\\theta\}\_\{n\}must incur Bayes risk at least of order

1n​𝔼θ⋆∼ρ​\[tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)\]\.\\frac\{1\}\{n\}\\,\\mathbb\{E\}\_\{\\theta^\{\\star\}\\sim\\rho\}\\\!\\left\[\\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\big\)\\right\]\.Combining this with the quadratic lower bound from Step 1 yields

infπ~n𝔼θ⋆∼ρ​𝔼𝒟n∣θ⋆​\[J​\(πθ⋆\)−J​\(π~n\)\]≥Clbn​𝔼θ⋆∼ρ​\[tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)\]\.\\inf\_\{\\tilde\{\\pi\}\_\{n\}\}\\mathbb\{E\}\_\{\\theta^\{\\star\}\\sim\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta^\{\\star\}\}\\\!\\left\[J\(\\pi\_\{\\theta^\{\\star\}\}\)\-J\(\\tilde\{\\pi\}\_\{n\}\)\\right\]\\geq\\frac\{C\_\{\\text\{lb\}\}\}\{n\}\\,\\mathbb\{E\}\_\{\\theta^\{\\star\}\\sim\\rho\}\\\!\\left\[\\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\big\)\\right\]\.Thus the same trace functional controls both the achievable DPO guarantee and the information\-theoretic limit, identifying it as the natural design criterion for comparison curation\. We present a detailed proof in Appendix[C\.3](https://arxiv.org/html/2606.19607#A3.SS3)\.

## 4Numerical experiments

Our experiments proceed from controlled model\-based settings to realistic language\-model post\-training benchmarks\. The synthetic experiments isolate the theoretical mechanism in realizable tabular and contextual models\. The IMDb experiment introduces real LLM fine\-tuning while retaining a scalar proxy reward, allowing us to evaluate the reward–KL trade\-off under model mismatch\. The Anthropic\-HH experiment further evaluates preference quality without relying on an explicit reward function, using GPT\-4\.1 as an automatic judge\.

Synthetic setting\.The synthetic experiments test the method in settings that exactly follow the theoretical model\. Here the ground\-truth reward is known to the experimenter, allowing us to directly evaluate whether the proposed design learns the unknown reward and the corresponding RLHF\-optimal policy more efficiently\.

In the tabular setting, each item corresponds to a candidate completion, so a policy is simply a probability distribution over thedditems\. We generate a non\-uniform reference policyπ0∈Δd\\pi\_\{0\}\\in\\Delta^\{d\}by applying a softmax transformation to Gaussian logits, mimicking the highly non\-uniform output distribution of a pre\-trained or supervised\-fine\-tuned language model\. The true reward vectorr⋆∈ℝdr^\{\\star\}\\in\\mathbb\{R\}^\{d\}is generated independently with small variance, creating a low\-signal regime in which item rewards are relatively close\. We compare four pairwise\-comparison designs under a fixed sample budgetnn: the oracle designD⋆​\(θ⋆\)D^\{\\star\}\(\\theta^\{\\star\}\), the plug\-in designD⋆​\(θ0\)D^\{\\star\}\(\\theta\_\{0\}\), uniform sampling over all item pairs, and a heuristic that samples pairs by drawing two items without replacement according toπ0\\pi\_\{0\}\. For each design, we collectnnnoisy Bradley–Terry comparisons between item pairs and solve the empirical DPO objective in the tabular policy class\. This produces an estimated tabular policyπ^n∈Δd\\hat\{\\pi\}\_\{n\}\\in\\Delta^\{d\}, which we evaluate by the RLHF optimality gapJ​\(π⋆\)−J​\(π^n\)J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\_\{n\}\)\. The results in[Figure˜1\(a\)](https://arxiv.org/html/2606.19607#S4.F1.sf1)show thatD⋆​\(θ⋆\)D^\{\\star\}\(\\theta^\{\\star\}\)andD⋆​\(θ0\)D^\{\\star\}\(\\theta\_\{0\}\)achieve nearly identical and consistently small gaps, substantially outperforming both baselines\. Uniform sampling improves as the budget grows but remains less sample\-efficient, while theπ0\\pi\_\{0\}\-based heuristic performs poorly and shows little improvement\. These results suggest that, when the reference policy is highly non\-uniform and reward differences are weak, concentrating comparisons on high\-π0\\pi\_\{0\}items is ineffective; instead, information\-guided comparison design yields much better downstream policy quality\.

Next, we study offline comparison curation in alinear contextual setting\. Each promptx∈ℝpx\\in\\mathbb\{R\}^\{p\}has a finite candidate set𝒜​\(x\)=\{y1,…,yd\}\\mathcal\{A\}\(x\)=\\\{y\_\{1\},\\ldots,y\_\{d\}\\\}, and each candidate has an action embeddinga∈ℝpa\\in\\mathbb\{R\}^\{p\}\. We use the softmax policy class

πθ​\(y∣x\)∝exp⁡\(θ⊤​ϕ​\(x,y\)\),ϕ​\(x,y\)=\[x;a;x⊙a\]∈ℝ3​p,\\pi\_\{\\theta\}\(y\\mid x\)\\propto\\exp\\\!\\big\(\\theta^\{\\top\}\\phi\(x,y\)\\big\),\\qquad\\phi\(x,y\)=\[x;\\,a;\\,x\\odot a\]\\in\\mathbb\{R\}^\{3p\},where⊙\\odotdenotes elementwise multiplication\. We generate a concentrated reference policyπ0=πθ0\\pi\_\{0\}=\\pi\_\{\\theta\_\{0\}\}using a large‖θ0‖\\\|\\theta\_\{0\}\\\|, and define a latent linear rewardr⋆​\(x,y\)=η⊤​ϕ​\(x,y\)\+C​\(x\)r^\{\\star\}\(x,y\)=\\eta^\{\\top\}\\phi\(x,y\)\+C\(x\), where the prompt\-only termC​\(x\)C\(x\)cancels in within\-prompt comparisons\. Preference labels are generated from the Bradley–Terry model induced byθ⋆\\theta^\{\\star\}\. For each labeling budgetnn, we samplenncomparison pairs using one of four rules: the oracle designD⋆​\(θ⋆\)D^\{\\star\}\(\\theta^\{\\star\}\), the plug\-in designD⋆​\(θ0\)D^\{\\star\}\(\\theta\_\{0\}\), uniform sampling over candidate pairs, and aπ0\\pi\_\{0\}\-weighted heuristic that samples candidates according to the reference policy\. We fit the DPO estimator and evaluate the held\-out RLHF optimality gapJ​\(π⋆\)−J​\(π^n\)J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\_\{n\}\), reporting mean and variability over Monte Carlo repetitions\.[Figure˜1\(b\)](https://arxiv.org/html/2606.19607#S4.F1.sf2)shows that the plug\-in designD⋆​\(θ0\)D^\{\\star\}\(\\theta\_\{0\}\)closely tracks the oracle design and achieves small RLHF gaps across budgets\. Uniform sampling is less sample\-efficient at small budgets, while theπ0\\pi\_\{0\}\-weighted heuristic performs poorly and has larger variance\. This indicates that high\-probability candidates under the reference policy are not necessarily the most informative comparisons; explicitly optimizing the comparison design better targets the directions that matter for downstream RLHF performance\.

![Refer to caption](https://arxiv.org/html/2606.19607v1/fig/d100_tabular.png)\(a\)Tabular setting
![Refer to caption](https://arxiv.org/html/2606.19607v1/fig/contextual.png)\(b\)Linear contextual setting

Figure 1:Synthetic experimentsIMDb experiment\.The IMDb experiment provides a more realistic language\-model post\-training testbed while retaining a well\-defined proxy reward given by a sentiment classifier\. This setting allows us to examine whether the design remains effective when the theoretical model is only an approximation of the actual LLM training process, and whether it improves the reward–KL trade\-off relative to baseline comparison\-selection rules\.

In IMDb experiment, we follow the DPO pipeline inRafailovet al\.\([2023](https://arxiv.org/html/2606.19607#bib.bib1)\)\. We fine\-tune GPT\-2\-large on the IMDb training split using SFT, and then use the resulting SFT model as the reference policy for DPO\. We compare preference datasets constructed by ourD∗D^\{\\ast\}\-based method with benchmark selection rules in two curation tasks\. These two tasks capture two natural decisions in preference\-data collection: which prompts should be annotated, and, given a prompt, which candidate responses should be compared\. In the prompt selection task, we generate two responses for each prompt in a candidate pool of1,0001\{,\}000prompts and selectn=175n=175prompt\-level comparisons for DPO training\. The small budgetn=175n=175places the experiment in a low\-annotation regime, where the value of selecting informative comparisons is most pronounced\. The benchmark selects the first175175comparisons, whereasD∗D^\{\\ast\}computes design weights over all candidate comparisons and samples without replacement\. In the response selection task, we consider a candidate pool of175175prompts\. We generated=8d=8candidate responses for each prompt and select one within\-prompt response pair\. The benchmark compares two arbitrary responses, whereasD∗D^\{\\ast\}samples a pair according to the normalized within\-prompt design weights\. We evaluate the trained policies using the reward–KL frontier as inRafailovet al\.\([2023](https://arxiv.org/html/2606.19607#bib.bib1)\)\. Each point corresponds to the final checkpoint of a DPO run under a specific value ofβ\\beta, and error bars report Monte Carlo standard errors\. As shown in[Figures˜2\(b\)](https://arxiv.org/html/2606.19607#S4.F2.sf2)and[2\(a\)](https://arxiv.org/html/2606.19607#S4.F2.sf1),D∗D^\{\\ast\}\-curated data consistently improves the reward–KL tradeoff relative to the benchmark in both curation tasks\.

![Refer to caption](https://arxiv.org/html/2606.19607v1/fig/frontier_completion_M150_D8_mc80.png)\(a\)Response selection task
![Refer to caption](https://arxiv.org/html/2606.19607v1/fig/frontier_prompt_M1000_N175_mc30.png)\(b\)Prompt selection task

Figure 2:IMDb experiments with GPT2\-LargeAnthropic HH experiment\.The Anthropic\-HH experiment further removes the availability of a clear scalar reward function\. Instead of evaluating against an explicit reward model, we use GPT\-4\.1 as an automatic judge to compare model outputs and report pairwise win rates\. This setting tests whether, under the same labeling budget, the proposed comparison design leads to responses that are more often preferred in a direct preference evaluation\.

In this Anthropic HH experiment, we follow the DPO pipeline ofRafailovet al\.\([2023](https://arxiv.org/html/2606.19607#bib.bib1)\)\. We use the default Anthropic\-HH train/test splits and first train a Pythia\-2\.8B SFT model on the chosen responses in the training split\. Starting from this SFT model, we train DPO models using preference datasets constructed either by ourD∗D^\{\\ast\}\-based method or by a benchmark rule\. The candidate pool consists of160,800160\{,\}800preference pairs from the HH training split\. For each budgetnn, the benchmark uses thennarbitrary preference pairs, while our method samplesnnpairs without replacement from the optimizedD∗D^\{\\ast\}design distribution\. Details on feature construction, the trace\-design optimization, and hyperparameters are provided in Appendix[F\.2](https://arxiv.org/html/2606.19607#A6.SS2)\. We evaluate the trained policies on prompts from the Anthropic\-HH test split\. Following the evaluation setup ofRafailovet al\.\([2023](https://arxiv.org/html/2606.19607#bib.bib1)\), we generate responses at sampling temperatures0\.250\.25,0\.70\.7, and1\.01\.0, and use GPT\-4\.1 to compare each generated response against the corresponding HH chosen response\.[Figure˜3](https://arxiv.org/html/2606.19607#S4.F3)reports the win rate for two representative budgets,n=80,400n=80\{,\}400andn=96,480n=96\{,\}480, corresponding to50%50\\%and60%60\\%of the candidate pool\. Across these budgets and all sampling temperatures, theD∗D^\{\\ast\}\-curated datasets outperform the benchmark\. Similar improvements are observed across the other budgets we tested\.

![Refer to caption](https://arxiv.org/html/2606.19607v1/fig/hh_trace_dstar_fullpool_N80400_paper.png)\(a\)Sample budgetN=80,400N=80,400
![Refer to caption](https://arxiv.org/html/2606.19607v1/fig/hh_trace_dstar_fullpool_N96480_paper.png)\(b\)Sample budgetN=96,480N=96,480

Figure 3:Anthropic HH experiment
## 5Discussion and Limitations

In this paper, we study offline comparison curation for DPO under a fixed labeling budget\. Our analysis characterizes the effect of curation on downstream RLHF performance through a single design\-dependent information object, yielding an explicit optimization criterion for selecting informative comparisons\. We prove finite\-sample upper bounds and complementary information\-theoretic lower bounds, and our synthetic and LLM experiments show that the resulting plug\-in designs improve sample efficiency over common heuristics\. Our work has several limitations\. The theory relies on realizability, regularity, and coverage assumptions that may only hold approximately for large neural policies\. We also focus on offline randomized designs, where candidate completions are generated before labeling and the design does not adapt to observed feedback\.

## References

- Rank analysis of incomplete block designs: I\. the method of paired comparisons\.Biometrika39\(3–4\),pp\. 324–345\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p2.1)\.
- S\. R\. Chowdhury, A\. Kini, and N\. Natarajan \(2024\)Provably robust dpo: aligning language models with noisy feedback\.arXiv preprint arXiv:2403\.00409\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p4.1),[§2\.1](https://arxiv.org/html/2606.19607#S2.SS1.p4.1),[§2\.1](https://arxiv.org/html/2606.19607#S2.SS1.p7.1)\.
- P\. F\. Christiano, J\. Leike, T\. B\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems,Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p2.1),[§1](https://arxiv.org/html/2606.19607#S1.p1.1),[§2](https://arxiv.org/html/2606.19607#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Das, S\. Chakraborty, A\. Pacchiano, and S\. R\. Chowdhury \(2025\)Active preference optimization for sample efficient rlhf\.InJoint European Conference on Machine Learning and Knowledge Discovery in Databases,pp\. 96–112\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p3.1)\.
- Y\. Feng, A\. Kwiatkowski, K\. Zheng, J\. Kempe, and Y\. Duan \(2025\)Pilaf: optimal human preference sampling for reward modeling\.arXiv preprint arXiv:2502\.04270\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p3.1)\.
- M\. E\. Glickman and S\. T\. Jensen \(2005\)Adaptive paired comparison design\.Journal of statistical planning and inference127\(1\-2\),pp\. 279–293\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p5.1)\.
- U\. Graßhoff and R\. Schwabe \(2008\)Optimal design for the Bradley–Terry paired comparison model\.Statistical Methods & Applications17\(3\),pp\. 275–289\.External Links:[Document](https://dx.doi.org/10.1007/s10260-007-0058-4)Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p5.1)\.
- Y\. Guo, P\. Tian, J\. Kalpathy\-Cramer, S\. Ostmo, J\. P\. Campbell, M\. F\. Chiang, D\. Erdogmus, J\. Dy, and S\. Ioannidis \(2018\)Experimental design under the Bradley\-Terry model\.InProceedings of the Twenty\-Seventh International Joint Conference on Artificial Intelligence \(IJCAI\-18\),pp\. 2198–2204\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2018/304)Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p5.1)\.
- K\. Ji, J\. He, and Q\. Gu \(2024\)Reinforcement learning from human feedback with active queries\.arXiv preprint arXiv:2402\.09401\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p3.1)\.
- K\. R\. Kim, Y\. Bai, C\. Wang, and G\. Chen \(2025\)Understanding the impact of sampling quality in direct preference optimization\.arXiv preprint arXiv:2506\.04272\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p4.1)\.
- B\. Kveton, X\. Li, J\. McAuley, R\. Rossi, J\. Shang, J\. Wu, and T\. Yu \(2025\)Active learning for direct preference optimization\.arXiv preprint arXiv:2503\.01076\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p3.1)\.
- X\. Lin, A\. Verma, Z\. Dai, D\. Rus, S\. Ng, and B\. K\. H\. Low \(2025\)Activedpo: active direct preference optimization for sample\-efficient alignment\.arXiv preprint arXiv:2505\.19241\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p3.1)\.
- S\. Mukherjee, A\. Lalitha, K\. Kalantari, A\. Deshmukh, G\. Liu, Y\. Ma, and B\. Kveton \(2024\)Optimal design for human preference elicitation\.Advances in Neural Information Processing Systems37,pp\. 90132–90159\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p5.1)\.
- W\. Muldrew, P\. Hayes, M\. Zhang, and D\. Barber \(2024\)Active preference learning for large language models\.arXiv preprint arXiv:2402\.08114\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p3.1)\.
- W\. K\. Newey and D\. McFadden \(1994\)Large sample estimation and hypothesis testing\.Handbook of econometrics4,pp\. 2111–2245\.Cited by:[§2\.1](https://arxiv.org/html/2606.19607#S2.SS1.p11.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p2.1),[§1](https://arxiv.org/html/2606.19607#S1.p1.1),[§2](https://arxiv.org/html/2606.19607#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Pan, Z\. Cai, G\. Chen, H\. Zhong, and C\. Wang \(2025\)What matters in data for dpo?\.arXiv preprint arXiv:2508\.18312\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p4.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§F\.1](https://arxiv.org/html/2606.19607#A6.SS1.SSS0.Px1.p1.1),[§F\.1](https://arxiv.org/html/2606.19607#A6.SS1.SSS0.Px2.p1.8),[§F\.1](https://arxiv.org/html/2606.19607#A6.SS1.SSS0.Px2.p1.9),[§F\.2](https://arxiv.org/html/2606.19607#A6.SS2.SSS0.Px1.p1.4),[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p2.1),[§1](https://arxiv.org/html/2606.19607#S1.p1.1),[§2](https://arxiv.org/html/2606.19607#S2.SS0.SSS0.Px1.p3.3),[§2](https://arxiv.org/html/2606.19607#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.19607#S4.p6.11),[§4](https://arxiv.org/html/2606.19607#S4.p8.14)\.
- A\. Scheid, E\. Boursier, A\. Durmus, M\. I\. Jordan, P\. Ménard, E\. Moulines, and M\. Valko \(2024\)Optimal design for reward modeling in rlhf\.arXiv preprint arXiv:2410\.17055\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p5.1)\.
- N\. B\. Shah, S\. Balakrishnan, J\. Bradley, A\. Parekh, K\. Ramchandran, and M\. J\. Wainwright \(2016\)Estimation from pairwise comparisons: sharp minimax bounds with topology dependence\.Journal of Machine Learning Research17\(58\),pp\. 1–47\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p5.1),[§2\.1](https://arxiv.org/html/2606.19607#S2.SS1.p3.1),[§2\.1](https://arxiv.org/html/2606.19607#S2.SS1.p7.1),[Remark 5](https://arxiv.org/html/2606.19607#Thmremark5.p3.5)\.
- N\. Stiennon, L\. Ouyang, J\. Wu, D\. M\. Ziegler, R\. Lowe, C\. Voss, A\. Radford, D\. Amodei, and P\. Christiano \(2020\)Learning to summarize from human feedback\.Advances in Neural Information Processing Systems\.Cited by:[§2](https://arxiv.org/html/2606.19607#S2.SS0.SSS0.Px2.p1.1)\.
- J\. A\. Tropp \(2012\)User\-friendly tail bounds for sums of random matrices\.Foundations of Computational Mathematics12\(4\),pp\. 389–434\.Cited by:[§C\.2](https://arxiv.org/html/2606.19607#A3.SS2.2.p2.2),[§C\.2\.2](https://arxiv.org/html/2606.19607#A3.SS2.SSS2.Px2.2.p2.8)\.
- A\. W\. Van der Vaart \(2000\)Asymptotic statistics\.Vol\.3,Cambridge university press\.Cited by:[§2\.1](https://arxiv.org/html/2606.19607#S2.SS1.p11.1)\.
- T\. Xie, D\. Foster, A\. Krishnamurthy, C\. Rosset, A\. H\. Awadallah, and A\. Rakhlin \(2025\)Exploratory preference optimization: harnessing implicit q\*\-approximation for sample\-efficient rlhf\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 43632–43669\.Cited by:[§1\.1](https://arxiv.org/html/2606.19607#S1.SS1.p3.1)\.

## Appendix ATechnical appendices and supplementary material

### A\.1Motivation and interpretation for[Section˜2](https://arxiv.org/html/2606.19607#S2)

### A\.2Motivation and interpretation of regularity assumptions

In this subsection, we list all necessary assumptions in our analysis\.

##### Identifiability and realizability\.

Our first assumption concerns identifiability of the optimal policy parameterθ⋆\\theta^\{\\star\}from pairwise comparison data\. In our softmax parameterization[˜6](https://arxiv.org/html/2606.19607#S2.E6), the mapθ↦πθ\\theta\\mapsto\\pi\_\{\\theta\}is not necessarily injective: different parameter vectors may induce the same policy\. For example, adding a prompt\-dependent constant to all logits\{fθ​\(ϕ​\(x,y\)\)\}y∈𝒜​\(x\)\\\{f\_\{\\theta\}\(\\phi\(x,y\)\)\\\}\_\{y\\in\\mathcal\{A\}\(x\)\}leavesπθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\)unchanged, so directions that only shift logits uniformly within each prompt are intrinsically unidentifiable\.

More generally, each labeled edgee=\(x,y\+,y−\)e=\(x,y^\{\+\},y^\{\-\}\)enters DPO only through the scalar logituθ​\(e\)u\_\{\\theta\}\(e\), i\.e\., the implicit\-reward difference betweeny\+y^\{\+\}andy−y^\{\-\}\. This motivates the*pairwise sensitivity vector*

g​\(e;θ\)≐∇θuθ​\(e\),g\(e;\\theta\)\\doteq\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\),which captures how a comparison changes under infinitesimal perturbations ofθ\\theta\. If a directionvvsatisfiesv⊤​g​\(e;θ⋆\)=0v^\{\\top\}g\(e;\\theta^\{\\star\}\)=0for all queried edgesee, then perturbingθ\\thetaalongvvdoes not change any logit difference that appears in the data \(to first order\)\. Since the DPO objective depends onθ\\thetaonly through these logit differences, pairwise comparisons cannot identify such directions\.

To focus on directions that are learnable from comparisons, we restrict our analysis to the identifiable subspaceHHspanned by\{g​\(e;θ⋆\)\}\\\{g\(e;\\theta^\{\\star\}\)\\\}\(defined formally in[Definition˜3](https://arxiv.org/html/2606.19607#Thmdefinition3)\), where the queried comparisons are informative\.

###### Definition 3\(Identifiable tangent spaceHH\)\.

We define the identifiable tangent space atθ⋆\\theta^\{\\star\}by

H≐span⁡\{g​\(e;θ⋆\):e=\(x,y\+,y−\)∈ℰ\}⊆ℝp\.H\\doteq\\operatorname\{span\}\\bigl\\\{g\(e;\\theta^\{\\star\}\):e=\(x,y^\{\+\},y^\{\-\}\)\\in\\mathcal\{E\}\\bigr\\\}\\;\\subseteq\\;\\mathbb\{R\}^\{p\}\.

Throughout the analysis, we restrict attention to the identifiable subspaceHH\. We also assume optimal realizability: there existsθ⋆∈Θ\\theta^\{\\star\}\\in\\Thetasuch thatπθ⋆=π⋆\\pi\_\{\\theta^\{\\star\}\}=\\pi^\{\\star\}\. Under realizability, the implicit rewardrθ⋆r\_\{\\theta^\{\\star\}\}induced by the optimal policyπ⋆\\pi^\{\\star\}matches the true reward up to a prompt\-only offset\. That is, there exists a functionc:𝒳→ℝc:\\mathcal\{X\}\\to\\mathbb\{R\}such that

rθ⋆​\(x,y\)=r⋆​\(x,y\)\+c​\(x\),∀\(x,y\)\.r\_\{\\theta^\{\\star\}\}\(x,y\)=r^\{\\star\}\(x,y\)\+c\(x\),\\qquad\\forall\(x,y\)\.Consequently, under realizability, in the BT model, using the implicit rewardrθ⋆r\_\{\\theta^\{\\star\}\}to generate pairwise preference labels is equivalent to using the true rewardr⋆r^\{\\star\}: for any within\-prompt comparisone=\(x,y\+,y−\)e=\(x,y^\{\+\},y^\{\-\}\),

uθ⋆​\(e\)=rθ⋆​\(x,y\+\)−rθ⋆​\(x,y−\)=r⋆​\(x,y\+\)−r⋆​\(x,y−\),u\_\{\\theta^\{\\star\}\}\(e\)=r\_\{\\theta^\{\\star\}\}\(x,y^\{\+\}\)\-r\_\{\\theta^\{\\star\}\}\(x,y^\{\-\}\)=r^\{\\star\}\(x,y^\{\+\}\)\-r^\{\\star\}\(x,y^\{\-\}\),since the prompt\-only offsetc​\(x\)c\(x\)cancels in the difference\.

We summarize our requirements about the parameter setΘ\\Thetain the following assumption\.

###### Assumption[1](https://arxiv.org/html/2606.19607#Thmassumption1)\(Identifiability and realizability\)\.

SupposeΘ\\Thetais convex, compact, and satisfies

Θ⊆θ⋆\+H\.\\Theta\\;\\subseteq\\;\\theta^\{\\star\}\+H\.

##### Boundedness and smoothness\.

As for the embedding vectorϕ\\phiand feature functionfθf\_\{\\theta\}, we impose the following boundedness and smoothness assumption\.

###### Assumption[2](https://arxiv.org/html/2606.19607#Thmassumption2)\(Boundedness and smoothness\)\.

There exist constantsRϕ,α0,α1,α2,α3<∞R\_\{\\phi\},\\alpha\_\{0\},\\alpha\_\{1\},\\alpha\_\{2\},\\alpha\_\{3\}<\\inftysuch that for allθ∈Θ\\theta\\in\\Thetaand all admissible\(x,y\)\(x,y\),

‖ϕ​\(x,y\)‖2≤Rϕ,\|fθ​\(ϕ​\(x,y\)\)\|≤α0,\\\|\\phi\(x,y\)\\\|\_\{2\}\\leq R\_\{\\phi\},\\qquad\|f\_\{\\theta\}\(\\phi\(x,y\)\)\|\\leq\\alpha\_\{0\},‖∇θfθ​\(ϕ​\(x,y\)\)‖2≤α1,‖∇θ2fθ​\(ϕ​\(x,y\)\)‖op≤α2,‖∇θ3fθ​\(ϕ​\(x,y\)\)‖op≤α3\.\\\|\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\)\)\\\|\_\{2\}\\leq\\alpha\_\{1\},\\qquad\\\|\\nabla\_\{\\theta\}^\{2\}f\_\{\\theta\}\(\\phi\(x,y\)\)\\\|\_\{\\text\{op\}\}\\leq\\alpha\_\{2\},\\qquad\\\|\\nabla\_\{\\theta\}^\{3\}f\_\{\\theta\}\(\\phi\(x,y\)\)\\\|\_\{\\text\{op\}\}\\leq\\alpha\_\{3\}\.Moreover, for each fixed\(x,y\)\(x,y\), the mapθ↦fθ​\(ϕ​\(x,y\)\)\\theta\\mapsto f\_\{\\theta\}\(\\phi\(x,y\)\)is three times continuously differentiable onΘ\\Theta\.

##### Feature separation\.

To obtain meaningful learning guarantees from pairwise comparisons, namely that the DPO solution achieves a vanishing RLHF optimality gap as the comparison budgetnngrows, we need the candidate completion set𝒜​\(x\)\\mathcal\{A\}\(x\)to be sufficiently informative\. Intuitively, if for a given promptxxthe candidate pool contains only near\-duplicate completions \(or completions that are indistinguishable under the model features\), then comparing them provides little information about how the policy should change, and certain parameter directions cannot be learned no matter how many comparisons we collect\. To rule out such degenerate cases, we impose a mild diversity condition on the candidate set: for each prompt,𝒜​\(x\)\\mathcal\{A\}\(x\)should contain at least two completions that are sufficiently different in their model\-induced features \(in every identifiable direction\)\. This is formalized by the following feature\-separation assumption\.

###### Assumption[3](https://arxiv.org/html/2606.19607#Thmassumption3)\(Feature separation onHH\)\.

There existsΔg\>0\\Delta\_\{g\}\>0such that for everyθ∈Θ\\theta\\in\\Theta, every promptxx, and every unit vectorv∈Hv\\in H, there exist two candidatesy1,y2∈𝒜​\(x\)y\_\{1\},y\_\{2\}\\in\\mathcal\{A\}\(x\)satisfying

\|v⊤​\(∇θfθ​\(ϕ​\(x,y1\)\)−∇θfθ​\(ϕ​\(x,y2\)\)\)\|≥Δg\.\\Big\|v^\{\\top\}\\big\(\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\_\{1\}\)\)\-\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\_\{2\}\)\)\\big\)\\Big\|\\;\\geq\\;\\Delta\_\{g\}\.

##### Coverage condition\.

Even with a diverse candidate pool, a meaningful guarantee further requires that the sampling designDDdoes not ignore informative pairs\. Intuitively, ifDDconcentrates on only a small subset of within\-prompt pairs \(or repeatedly compares near\-duplicate completions\), then the resulting data provide little information about some identifiable directions, and the optimality gap cannot be driven down no matter how largennis\.

Once each comparison edgeeeis mapped to a pairwise sensitivity vectorg​\(e;θ\)g\(e;\\theta\), a sampling designDDover edges naturally induces a second\-moment matrix that summarizes how informative the chosen comparisons are about the parameter\. In particular, the outer productg​\(e;θ\)​g​\(e;θ\)⊤g\(e;\\theta\)g\(e;\\theta\)^\{\\top\}captures the rank\-one curvature/information contribution of a single queried pair, and averaging this contribution over the edge distributionDDyields the design covariance matrix\. We define, at the anyθ∈Θ\\theta\\in\\Theta,

ΣD​\(θ\)≐𝔼e∼D​\[g​\(e;θ\)​g​\(e;θ\)⊤\],\\Sigma\_\{D\}\(\\theta\)\\doteq\\mathbb\{E\}\_\{e\\sim D\}\\\!\\left\[g\(e;\\theta\)\\,g\(e;\\theta\)^\{\\top\}\\right\],which will be the key object connecting pair selection to estimation error and ultimately to our design objective\.

###### Assumption[4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)\(Design coverage at the truth\)\.

We assume thatΣD​\(θ⋆\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)has a spectral gap on the identifiable subspaceHH, i\.e\., there existsμ⋆\>0\\mu\_\{\\star\}\>0such that

v⊤​ΣD​\(θ⋆\)​v≥μ⋆​‖v‖22,∀v∈H\.v^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)v\\;\\geq\\;\\mu\_\{\\star\}\\,\\\|v\\\|\_\{2\}^\{2\},\\qquad\\forall\\,v\\in H\.

In some arguments, we require the above assumption to hold uniformly over a neighborhood ofθ⋆\\theta^\{\\star\}\. Accordingly, we introduce the following localized version\. Whenever invoked, the regionℛ⊆Θ\\mathcal\{R\}\\subseteq\\Thetawill be specified \(and may depend on the application\), typically chosen as a neighborhood ofθ⋆\\theta^\{\\star\}\.

###### Assumption[4\(b\)](https://arxiv.org/html/2606.19607#Thmassumption4a)\(Uniform design coverage\)\.

Fix a regionℛ⊆Θ\\mathcal\{R\}\\subseteq\\Theta\. We assume thatΣD​\(θ\)\\Sigma\_\{D\}\(\\theta\)has a uniform spectral gap onHHoverℛ\\mathcal\{R\}, i\.e\., there existsμℛ\>0\\mu\_\{\\mathcal\{R\}\}\>0such that

v⊤​ΣD​\(θ\)​v≥μℛ​‖v‖22,∀v∈H,∀θ∈ℛ\.v^\{\\top\}\\Sigma\_\{D\}\(\\theta\)v\\;\\geq\\;\\mu\_\{\\mathcal\{R\}\}\\,\\\|v\\\|\_\{2\}^\{2\},\\qquad\\forall\\,v\\in H,\\ \\forall\\,\\theta\\in\\mathcal\{R\}\.

##### Optimization landscape\.

Our next assumption concerns the optimization landscape of the DPO risk\. Whenfθf\_\{\\theta\}is linear in the embedding \(e\.g\., the linear contextual case\), the DPO objective is convex inθ\\theta; moreover, under standard design coverage conditions the population and empirical risks are strongly convex on the identifiable space and admit a unique minimizer\. In contrast, for general nonlinearfθf\_\{\\theta\}\(e\.g\., a neural network score model\), both the population riskL​\(θ\)L\(\\theta\)and the empirical riskLn​\(θ\)L\_\{n\}\(\\theta\)can be nonconvex, and multiple global minimizers may exist\.

In our setting, the target parameterθ⋆\\theta^\{\\star\}is defined a priori by the RLHF objective via realizability, i\.e\.,πθ⋆=π⋆\\pi\_\{\\theta^\{\\star\}\}=\\pi^\{\\star\}whereπ⋆∈arg⁡maxπ⁡J​\(π\)\\pi^\{\\star\}\\in\\arg\\max\_\{\\pi\}J\(\\pi\)\. In later analysis we show that, under realizability and well\-specified preference labels, this sameθ⋆\\theta^\{\\star\}is also a global minimizer of the population DPO risk\. To avoid ambiguity when the population DPO risk admits multiple minimizers, we impose the following uniqueness assumption\.

###### Assumption[5](https://arxiv.org/html/2606.19607#Thmassumption5)\(Unique population minimizer\)\.

The population DPO riskL​\(θ\)L\(\\theta\)has a unique global minimizer overΘ\\Theta\.

##### Prior regularity\.

Our final assumption introduces a Bayesian formulation for the optimal parameterθ⋆\\theta^\{\\star\}\. When prior information aboutθ⋆\\theta^\{\\star\}is available, our analysis yields an information\-theoretic \(Bayesian\) lower bound on the RLHF optimality gap under an arbitrary sampling designDD\.

###### Assumption[6](https://arxiv.org/html/2606.19607#Thmassumption6)\(Prior regularity\)\.

Suppose the optimal parameterθ⋆\\theta^\{\\star\}follows a prior densityρ\\rhosupported onΘ\\Theta\. Moreover,ρ\\rhosatisfies:

1. 1\.ρ∈C1​\(Θ\)\\rho\\in C^\{1\}\(\\Theta\)andρ​\(θ\)\>0\\rho\(\\theta\)\>0for allθ∈int​\(Θ\)\\theta\\in\\mathrm\{int\}\(\\Theta\);
2. 2\.ρ​\(θ\)=0\\rho\(\\theta\)=0for allθ∈∂Θ\\theta\\in\\partial\\Theta;
3. 3\.The prior Fisher information is finite: ∫Θ‖∇log⁡ρ​\(θ\)‖22​ρ​\(θ\)​𝑑θ<∞\.\\int\_\{\\Theta\}\\\|\\nabla\\log\\rho\(\\theta\)\\\|\_\{2\}^\{2\}\\,\\rho\(\\theta\)\\,d\\theta<\\infty\.

## Appendix BPreliminary on information theory

In this section, we provide more interpretations for[Definition˜2](https://arxiv.org/html/2606.19607#Thmdefinition2)\.

## Appendix CProofs of[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1)

In this section, we prove our main result[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1)following the road map in[Section˜3\.2](https://arxiv.org/html/2606.19607#S3.SS2)\. Specifically, the statement \(i\) in[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1)is proved in[Theorem˜3](https://arxiv.org/html/2606.19607#Thmtheorem3), and the statement \(ii\) in[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1)is proved in[Theorem˜4](https://arxiv.org/html/2606.19607#Thmtheorem4)\.

### C\.1Step 1: Quadratic sandwich of the RLHF optimality gap

We first express the RLHF optimality gap exactly as a reverse KL divergence to the optimal policy\.

###### Lemma 1\(RLHF gap as reverse KL\)\.

Under[Assumption˜1](https://arxiv.org/html/2606.19607#Thmassumption1), for any policyπ\\pi,

J\(π⋆\)−J\(π\)=β𝔼x\[KL\(π\(⋅∣x\)∥π⋆\(⋅∣x\)\)\]\.J\(\\pi^\{\\star\}\)\-J\(\\pi\)=\\beta\\,\\mathbb\{E\}\_\{x\}\\\!\\left\[\\text\{KL\}\\\!\\big\(\\pi\(\\cdot\\mid x\)\\,\\\|\\,\\pi^\{\\star\}\(\\cdot\\mid x\)\\big\)\\right\]\.

###### Proof of[Lemma˜1](https://arxiv.org/html/2606.19607#Thmlemma1)\.

Fixxx\. By \([2](https://arxiv.org/html/2606.19607#S2.E2)\),

log⁡π⋆​\(y∣x\)=log⁡π0​\(y∣x\)\+r⋆​\(x,y\)β−log⁡Z​\(x\)\.\\log\\pi^\{\\star\}\(y\\mid x\)=\\log\\pi\_\{0\}\(y\\mid x\)\+\\frac\{r^\{\\star\}\(x,y\)\}\{\\beta\}\-\\log Z\(x\)\.Hence

KL\(π\(⋅∣x\)∥π⋆\(⋅∣x\)\)\\displaystyle\\text\{KL\}\\\!\\big\(\\pi\(\\cdot\\mid x\)\\,\\\|\\,\\pi^\{\\star\}\(\\cdot\\mid x\)\\big\)=∑yπ​\(y∣x\)​log⁡π​\(y∣x\)π⋆​\(y∣x\)\\displaystyle=\\sum\_\{y\}\\pi\(y\\mid x\)\\log\\frac\{\\pi\(y\\mid x\)\}\{\\pi^\{\\star\}\(y\\mid x\)\}=∑yπ​\(y∣x\)​log⁡π​\(y∣x\)π0​\(y∣x\)−1β​∑yπ​\(y∣x\)​r⋆​\(x,y\)\+log⁡Z​\(x\)\\displaystyle=\\sum\_\{y\}\\pi\(y\\mid x\)\\log\\frac\{\\pi\(y\\mid x\)\}\{\\pi\_\{0\}\(y\\mid x\)\}\-\\frac\{1\}\{\\beta\}\\sum\_\{y\}\\pi\(y\\mid x\)r^\{\\star\}\(x,y\)\+\\log Z\(x\)=KL\(π\(⋅∣x\)∥π0\(⋅∣x\)\)−1β𝔼y∼π\(⋅∣x\)\[r⋆\(x,y\)\]\+logZ\(x\)\.\\displaystyle=\\text\{KL\}\\\!\\big\(\\pi\(\\cdot\\mid x\)\\,\\\|\\,\\pi\_\{0\}\(\\cdot\\mid x\)\\big\)\-\\frac\{1\}\{\\beta\}\\,\\mathbb\{E\}\_\{y\\sim\\pi\(\\cdot\\mid x\)\}\[r^\{\\star\}\(x,y\)\]\+\\log Z\(x\)\.Rearranging gives

𝔼y∼π\(⋅∣x\)\[r⋆\(x,y\)\]−βKL\(π\(⋅∣x\)∥π0\(⋅∣x\)\)=βlogZ\(x\)−βKL\(π\(⋅∣x\)∥π⋆\(⋅∣x\)\)\.\\mathbb\{E\}\_\{y\\sim\\pi\(\\cdot\\mid x\)\}\[r^\{\\star\}\(x,y\)\]\-\\beta\\,\\text\{KL\}\\\!\\big\(\\pi\(\\cdot\\mid x\)\\,\\\|\\,\\pi\_\{0\}\(\\cdot\\mid x\)\\big\)=\\beta\\log Z\(x\)\-\\beta\\,\\text\{KL\}\\\!\\big\(\\pi\(\\cdot\\mid x\)\\,\\\|\\,\\pi^\{\\star\}\(\\cdot\\mid x\)\\big\)\.Now take expectation overxx:

J\(π\)=β𝔼x\[logZ\(x\)\]−β𝔼x\[KL\(π\(⋅∣x\)∥π⋆\(⋅∣x\)\)\]\.J\(\\pi\)=\\beta\\,\\mathbb\{E\}\_\{x\}\[\\log Z\(x\)\]\-\\beta\\,\\mathbb\{E\}\_\{x\}\\\!\\left\[\\text\{KL\}\\\!\\big\(\\pi\(\\cdot\\mid x\)\\,\\\|\\,\\pi^\{\\star\}\(\\cdot\\mid x\)\\big\)\\right\]\.Settingπ=π⋆\\pi=\\pi^\{\\star\}yields

J​\(π⋆\)=β​𝔼x​\[log⁡Z​\(x\)\]\.J\(\\pi^\{\\star\}\)=\\beta\\,\\mathbb\{E\}\_\{x\}\[\\log Z\(x\)\]\.Subtracting gives

J\(π⋆\)−J\(π\)=β𝔼x\[KL\(π\(⋅∣x\)∥π⋆\(⋅∣x\)\)\]\.J\(\\pi^\{\\star\}\)\-J\(\\pi\)=\\beta\\,\\mathbb\{E\}\_\{x\}\\\!\\left\[\\text\{KL\}\\\!\\big\(\\pi\(\\cdot\\mid x\)\\,\\\|\\,\\pi^\{\\star\}\(\\cdot\\mid x\)\\big\)\\right\]\.∎

Hence, bounding the optimality gap reduces to bounding a KL divergence\. Define

F\(θ\)≐β𝔼x\[KL\(πθ\(⋅∣x\)∥πθ⋆\(⋅∣x\)\)\]\.F\(\\theta\)\\doteq\\beta\\,\\mathbb\{E\}\_\{x\}\\\!\\left\[\\text\{KL\}\\\!\\big\(\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\,\\\|\\,\\pi\_\{\\theta^\{\\star\}\}\(\\cdot\\mid x\)\\big\)\\right\]\.\(15\)
###### Lemma 2\(Smoothness and Hessian identity\)\.

Suppose[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2)holds\. ThenF∈C2​\(Θ\)F\\in C^\{2\}\(\\Theta\),∇F​\(θ⋆\)=0\\nabla F\(\\theta^\{\\star\}\)=0, and

∇2F​\(θ\)=β​I​\(θ\),∀θ∈Θ\.\\nabla^\{2\}F\(\\theta\)=\\beta I\(\\theta\),\\qquad\\forall\\theta\\in\\Theta\.\(16\)In particular,

∇2F​\(θ⋆\)=β​I​\(θ⋆\)\.\\nabla^\{2\}F\(\\theta^\{\\star\}\)=\\beta I\(\\theta^\{\\star\}\)\.

###### Proof of[Lemma˜2](https://arxiv.org/html/2606.19607#Thmlemma2)\.

Since𝒴​\(x\)\\mathcal\{Y\}\(x\)is finite andfθf\_\{\\theta\}has bounded first/second derivatives inθ\\theta\([Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2)\),log⁡πθ​\(y\|x\)\\log\\pi\_\{\\theta\}\(y\|x\)is twice differentiable and differentiation can be interchanged with the finite sums definingπθ\\pi\_\{\\theta\}and with𝔼x\\mathbb\{E\}\_\{x\}\. A standard score\-trick argument yields∇F​\(θ∗\)=0\\nabla F\(\\theta^\{\*\}\)=0and the Fisher representation∇2F​\(θ∗\)=β​I​\(θ∗\)\\nabla^\{2\}F\(\\theta^\{\*\}\)=\\beta I\(\\theta^\{\*\}\)\. ∎

By[Lemmas˜1](https://arxiv.org/html/2606.19607#Thmlemma1)and[2](https://arxiv.org/html/2606.19607#Thmlemma2), the RLHF optimality gap can be written asF​\(θ\)F\(\\theta\), a \(scaled\) reverse\-KL divergence toπθ⋆\\pi\_\{\\theta^\{\\star\}\}, whose curvature is governed by the policy Fisher information:

∇2F​\(θ\)=β​I​\(θ\),∀θ∈Θ\.\\nabla^\{2\}F\(\\theta\)=\\beta I\(\\theta\),\\qquad\\forall\\theta\\in\\Theta\.In particular, in a neighborhood ofθ⋆\\theta^\{\\star\}, changes in the gap are controlled by the quadratic form induced byI​\(θ⋆\)I\(\\theta^\{\\star\}\)\.

###### Lemma 3\(Global Fisher sandwich\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[3](https://arxiv.org/html/2606.19607#Thmassumption3)hold\. Then there exists a constantμ¯\>0\\underline\{\\mu\}\>0, depending only on the fixed constants in the standing assumptions, such that for allθ∈Θ\\theta\\in\\Theta,

μ¯⪯I​\(θ\)⪯4​α12\.\\underline\{\\mu\}\\,\\preceq I\(\\theta\)\\preceq 4\\alpha\_\{1\}^\{2\}\\,\.\(17\)Moreover, withmI≐μ¯4​α12m\_\{I\}\\doteq\\frac\{\\underline\{\\mu\}\}\{4\\alpha\_\{1\}^\{2\}\}andMI≐4​α12μ¯M\_\{I\}\\doteq\\frac\{4\\alpha\_\{1\}^\{2\}\}\{\\underline\{\\mu\}\}, we have

mI​I​\(θ⋆\)⪯I​\(θ\)⪯MI​I​\(θ⋆\),∀θ∈Θ\.m\_\{I\}\\,I\(\\theta^\{\\star\}\)\\;\\preceq\\;I\(\\theta\)\\;\\preceq\\;M\_\{I\}\\,I\(\\theta^\{\\star\}\),\\qquad\\forall\\theta\\in\\Theta\.\(18\)Equivalently, by \([16](https://arxiv.org/html/2606.19607#A3.E16)\),

mI​∇2F​\(θ⋆\)⪯∇2F​\(θ\)⪯MI​∇2F​\(θ⋆\),∀θ∈Θ\.m\_\{I\}\\,\\nabla^\{2\}F\(\\theta^\{\\star\}\)\\;\\preceq\\;\\nabla^\{2\}F\(\\theta\)\\;\\preceq\\;M\_\{I\}\\,\\nabla^\{2\}F\(\\theta^\{\\star\}\),\\qquad\\forall\\theta\\in\\Theta\.\(19\)

###### Proof of[Lemma˜3](https://arxiv.org/html/2606.19607#Thmlemma3)\.

Fix\(x,θ\)\(x,\\theta\)\. Write

gθ​\(x,y\)≐∇θfθ​\(ϕ​\(x,y\)\),g¯θ​\(x\)≐𝔼y′∼πθ\(⋅∣x\)​\[gθ​\(x,y′\)\]\.g\_\{\\theta\}\(x,y\)\\doteq\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\)\),\\qquad\\bar\{g\}\_\{\\theta\}\(x\)\\doteq\\mathbb\{E\}\_\{y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\[g\_\{\\theta\}\(x,y^\{\\prime\}\)\]\.By softmax calculus,

sθ​\(y∣x\)=∇θlog⁡πθ​\(y∣x\)=gθ​\(x,y\)−g¯θ​\(x\)\.s\_\{\\theta\}\(y\\mid x\)=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\\mid x\)=g\_\{\\theta\}\(x,y\)\-\\bar\{g\}\_\{\\theta\}\(x\)\.Hence

∥sθ\(y∣x\)∥2≤∥gθ\(x,y\)∥2\+∥g¯θ\(x\)∥2≤α1\+α1=2α1,\\\|s\_\{\\theta\}\(y\\mid x\)\\\|\_\{2\}\\leq\\\|g\_\{\\theta\}\(x,y\)\\\|\_\{2\}\+\\\|\\bar\{g\}\_\{\\theta\}\(x\)\\\|\_\{2\}\\leq\\alpha\_\{1\}\+\\alpha\_\{1\}=2\\alpha\_\{1\},where[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2)was used\. Therefore, for anyv∈Hv\\in Hwith‖v‖2=1\\\|v\\\|\_\{2\}=1,

v⊤I\(θ\)v=𝔼x𝔼y∼πθ\[\(v⊤sθ\(y∣x\)\)2\]≤𝔼x𝔼y\[∥sθ\(y∣x\)∥22\]≤4α12\.v^\{\\top\}I\(\\theta\)v=\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\}\[\(v^\{\\top\}s\_\{\\theta\}\(y\\mid x\)\)^\{2\}\]\\leq\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\_\{y\}\[\\\|s\_\{\\theta\}\(y\\mid x\)\\\|\_\{2\}^\{2\}\]\\leq 4\\alpha\_\{1\}^\{2\}\.So

I​\(θ\)⪯4​α12​I,∀θ∈Θ\.I\(\\theta\)\\preceq 4\\alpha\_\{1\}^\{2\}I,\\qquad\\forall\\theta\\in\\Theta\.
By[Lemma˜14](https://arxiv.org/html/2606.19607#Thmlemma14), which relies on[Assumptions˜3](https://arxiv.org/html/2606.19607#Thmassumption3)and[2](https://arxiv.org/html/2606.19607#Thmassumption2),

I​\(θ\)⪰μ¯​I,∀θ∈Θ\.I\(\\theta\)\\succeq\\underline\{\\mu\}I,\\qquad\\forall\\theta\\in\\Theta\.Thus, we prove equation \([17](https://arxiv.org/html/2606.19607#A3.E17)\)\. In particular, atθ⋆\\theta^\{\\star\},

μ¯​I⪯I​\(θ⋆\)⪯4​α12​I\.\\underline\{\\mu\}\\,I\\preceq I\(\\theta^\{\\star\}\)\\preceq 4\\alpha\_\{1\}^\{2\}\\,I\.
For anyu∈Hu\\in H,

u⊤​I​\(θ\)​u≥μ¯​‖u‖22,u⊤​I​\(θ⋆\)​u≤4​α12​‖u‖22\.u^\{\\top\}I\(\\theta\)u\\geq\\underline\{\\mu\}\\,\\\|u\\\|\_\{2\}^\{2\},\\qquad u^\{\\top\}I\(\\theta^\{\\star\}\)u\\leq 4\\alpha\_\{1\}^\{2\}\\,\\\|u\\\|\_\{2\}^\{2\}\.Hence

u⊤I\(θ\)u≥μ¯4​α12u⊤I\(θ⋆\)u=:mIu⊤I\(θ⋆\)u\.u^\{\\top\}I\(\\theta\)u\\geq\\frac\{\\underline\{\\mu\}\}\{4\\alpha\_\{1\}^\{2\}\}\\,u^\{\\top\}I\(\\theta^\{\\star\}\)u=:m\_\{I\}\\,u^\{\\top\}I\(\\theta^\{\\star\}\)u\.Similarly,

u⊤​I​\(θ\)​u≤4​α12​‖u‖22,u⊤​I​\(θ⋆\)​u≥μ¯​‖u‖22,u^\{\\top\}I\(\\theta\)u\\leq 4\\alpha\_\{1\}^\{2\}\\,\\\|u\\\|\_\{2\}^\{2\},\\qquad u^\{\\top\}I\(\\theta^\{\\star\}\)u\\geq\\underline\{\\mu\}\\,\\\|u\\\|\_\{2\}^\{2\},so

u⊤I\(θ\)u≤4​α12μ¯u⊤I\(θ⋆\)u=:MIu⊤I\(θ⋆\)u\.u^\{\\top\}I\(\\theta\)u\\leq\\frac\{4\\alpha\_\{1\}^\{2\}\}\{\\underline\{\\mu\}\}\\,u^\{\\top\}I\(\\theta^\{\\star\}\)u=:M\_\{I\}\\,u^\{\\top\}I\(\\theta^\{\\star\}\)u\.Therefore, for allθ∈Θ\\theta\\in\\Theta,

mI​I​\(θ⋆\)⪯I​\(θ\)⪯MI​I​\(θ⋆\),m\_\{I\}\\,I\(\\theta^\{\\star\}\)\\;\\preceq\\;I\(\\theta\)\\;\\preceq\\;M\_\{I\}\\,I\(\\theta^\{\\star\}\),with

mI=μ¯4​α12,MI=4​α12μ¯\.m\_\{I\}=\\frac\{\\underline\{\\mu\}\}\{4\\alpha\_\{1\}^\{2\}\},\\qquad M\_\{I\}=\\frac\{4\\alpha\_\{1\}^\{2\}\}\{\\underline\{\\mu\}\}\.Since∇2F​\(θ\)=β​I​\(θ\)\\nabla^\{2\}F\(\\theta\)=\\beta I\(\\theta\), \([18](https://arxiv.org/html/2606.19607#A3.E18)\) is equivalent \(up to the common factorβ\\beta\) to equation[˜19](https://arxiv.org/html/2606.19607#A3.E19)\. ∎

We now show that the KL divergence \(hence the RLHF gap\) is globally equivalent to a weighted quadratic form of the estimation error\.

###### Proposition 1\(Global quadratic sandwich for the optimality gap\)\.

Under[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[3](https://arxiv.org/html/2606.19607#Thmassumption3), for allθ∈Θ\\theta\\in\\Theta, lettingΔ=θ−θ⋆\\Delta=\\theta\-\\theta^\{\\star\}, we have:

β​mI2Δ⊤I\(θ⋆\)Δ≤β𝔼xKL\(πθ\(⋅∣x\)∥πθ⋆\(⋅∣x\)\)≤β​MI2Δ⊤I\(θ⋆\)Δ\.\\frac\{\\beta m\_\{I\}\}\{2\}\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\\ \\leq\\ \\beta\\,\\mathbb\{E\}\_\{x\}\\text\{KL\}\\\!\\big\(\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\,\\\|\\,\\pi\_\{\\theta^\{\\star\}\}\(\\cdot\\mid x\)\\big\)\\ \\leq\\ \\frac\{\\beta M\_\{I\}\}\{2\}\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\.\(20\)Equivalently, by[Lemma˜1](https://arxiv.org/html/2606.19607#Thmlemma1),

β​mI2​Δ⊤​I​\(θ⋆\)​Δ≤J​\(π⋆\)−J​\(πθ\)≤β​MI2​Δ⊤​I​\(θ⋆\)​Δ\.\\frac\{\\beta m\_\{I\}\}\{2\}\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\\ \\leq\\ J\(\\pi^\{\\star\}\)\-J\(\\pi\_\{\\theta\}\)\\ \\leq\\ \\frac\{\\beta M\_\{I\}\}\{2\}\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\.

###### Proof of[Proposition˜1](https://arxiv.org/html/2606.19607#Thmproposition1)\.

By[Lemma˜2](https://arxiv.org/html/2606.19607#Thmlemma2),F∈C2​\(Θ\)F\\in C^\{2\}\(\\Theta\)and∇F​\(θ⋆\)=0\\nabla F\(\\theta^\{\\star\}\)=0\. Hence, forΔ=θ−θ⋆\\Delta=\\theta\-\\theta^\{\\star\},

F​\(θ\)=∫01\(1−t\)​Δ⊤​∇2F​\(θ⋆\+t​Δ\)​Δ​𝑑t\.F\(\\theta\)=\\int\_\{0\}^\{1\}\(1\-t\)\\,\\Delta^\{\\top\}\\nabla^\{2\}F\(\\theta^\{\\star\}\+t\\Delta\)\\,\\Delta\\,dt\.Using \([16](https://arxiv.org/html/2606.19607#A3.E16)\),

F​\(θ\)=β​∫01\(1−t\)​Δ⊤​I​\(θ⋆\+t​Δ\)​Δ​𝑑t\.F\(\\theta\)=\\beta\\int\_\{0\}^\{1\}\(1\-t\)\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\+t\\Delta\)\\,\\Delta\\,dt\.SinceΘ\\Thetais convex,θ⋆\+t​Δ∈Θ\\theta^\{\\star\}\+t\\Delta\\in\\Thetafor allt∈\[0,1\]t\\in\[0,1\]\. Applying[Lemma˜3](https://arxiv.org/html/2606.19607#Thmlemma3)along the segment gives

mI​Δ⊤​I​\(θ⋆\)​Δ≤Δ⊤​I​\(θ⋆\+t​Δ\)​Δ≤MI​Δ⊤​I​\(θ⋆\)​Δ\.m\_\{I\}\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\\leq\\Delta^\{\\top\}I\(\\theta^\{\\star\}\+t\\Delta\)\\Delta\\leq M\_\{I\}\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\.Multiply byβ​\(1−t\)\\beta\(1\-t\)and integrate overt∈\[0,1\]t\\in\[0,1\]; using∫01\(1−t\)​𝑑t=12\\int\_\{0\}^\{1\}\(1\-t\)\\,dt=\\frac\{1\}\{2\}, we obtain \([20](https://arxiv.org/html/2606.19607#A3.E20)\)\. ∎

### C\.2Step 2: upper bound for the DPO minimizer

Given a sampling designDD, suppose we collectnnsamplesei=\(xi,yi\+,yi−\)∼i\.i\.d\.De\_\{i\}=\(x\_\{i\},y^\{\+\}\_\{i\},y^\{\-\}\_\{i\}\)\\stackrel\{\{\\scriptstyle\\text\{i\.i\.d\.\}\}\}\{\{\\sim\}\}D\. We can approximate the design covariance matrixΣD​\(θ\)\\Sigma\_\{D\}\(\\theta\)by the following sample covariance matrix

Σ^n​\(θ\)≐1n​∑i=1ng​\(ei;θ\)​g​\(ei;θ\)⊤\.\\widehat\{\\Sigma\}\_\{n\}\(\\theta\)\\;\\doteq\\;\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g\(e\_\{i\};\\theta\)g\(e\_\{i\};\\theta\)^\{\\top\}\.We start with some useful concentration result, which lower bound the empirical covariance matrix by the sample covariance matrix\.

###### Proposition 2\(Empirical covariance concentration\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)hold\. Assumeei=\(xi,yi\+,yi−\)∼i\.i\.d\.De\_\{i\}=\(x\_\{i\},y^\{\+\}\_\{i\},y^\{\-\}\_\{i\}\)\\stackrel\{\{\\scriptstyle\\text\{i\.i\.d\.\}\}\}\{\{\\sim\}\}D\. Then for anyδ∈\(0,1\)\\delta\\in\(0,1\), if

n≥8​G2λminH​\(ΣD​\(θ⋆\)\)​log⁡dim\(H\)δ,n\\;\\geq\\;8\\,\\frac\{G^\{2\}\}\{\\lambda\_\{\\min\}^\{H\}\(\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\)\}\\,\\log\\\!\\frac\{\\dim\(H\)\}\{\\delta\},we have with probability at least1−δ1\-\\delta:

Σ^n​\(θ⋆\)⪰12​ΣD​\(θ⋆\)\.\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\)\\succeq\\frac\{1\}\{2\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.Consequently, on this event,

Σ^n†​\(θ⋆\)⪯2​ΣD†​\(θ⋆\)\.\\widehat\{\\Sigma\}\_\{n\}^\{\\dagger\}\(\\theta^\{\\star\}\)\\preceq 2\\,\\Sigma\_\{D\}^\{\\dagger\}\(\\theta^\{\\star\}\)\.For anyδ∈\(0,1\)\\delta\\in\(0,1\), definenΣ​\(δ\):=8​G2λminH​\(ΣD\)​log⁡dim\(H\)δn\_\{\\Sigma\}\(\\delta\):=8\\,\\frac\{G^\{2\}\}\{\\lambda\_\{\\min\}^\{H\}\(\\Sigma\_\{D\}\)\}\\,\\log\\\!\\frac\{\\dim\(H\)\}\{\\delta\}\.

###### Proof of[Proposition˜2](https://arxiv.org/html/2606.19607#Thmproposition2)\.

To simplify notation, write

Σ^n:=1n​∑i=1nYi,ΣD:=𝔼​\[Yi\],Yi:=g​\(ei;θ⋆\)​g​\(ei;θ⋆\)⊤\.\\widehat\{\\Sigma\}\_\{n\}:=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}Y\_\{i\},\\qquad\\Sigma\_\{D\}:=\\mathbb\{E\}\[Y\_\{i\}\],\\qquad Y\_\{i\}:=g\(e\_\{i\};\\theta^\{\\star\}\)g\(e\_\{i\};\\theta^\{\\star\}\)^\{\\top\}\.By construction, eachYi⪰0Y\_\{i\}\\succeq 0\. Moreover, since‖g​\(e;θ⋆\)‖2≤G\\\|g\(e;\\theta^\{\\star\}\)\\\|\_\{2\}\\leq G, we have

0⪯Yi⪯‖g​\(ei;θ⋆\)‖22​I⪯G2​I\(on​H​\)\.0\\preceq Y\_\{i\}\\preceq\\\|g\(e\_\{i\};\\theta^\{\\star\}\)\\\|\_\{2\}^\{2\}\\,I\\preceq G^\{2\}I\\quad\\text\{\(on \}H\\text\{\)\}\.LetλD:=λminH​\(ΣD\)\>0\\lambda\_\{D\}:=\\lambda\_\{\\min\}^\{H\}\(\\Sigma\_\{D\}\)\>0\. Consider the sumSn:=∑i=1nYiS\_\{n\}:=\\sum\_\{i=1\}^\{n\}Y\_\{i\}\. Then𝔼​\[Sn\]=n​ΣD\\mathbb\{E\}\[S\_\{n\}\]=n\\Sigma\_\{D\}, hence

λminH​\(𝔼​\[Sn\]\)=n​λD\.\\lambda\_\{\\min\}^\{H\}\(\\mathbb\{E\}\[S\_\{n\}\]\)=n\\,\\lambda\_\{D\}\.
We apply the matrix Chernoff inequality for the minimum eigenvalue\[Tropp,[2012](https://arxiv.org/html/2606.19607#bib.bib19), e\.g\., Theorem 5\.1, Corollary 5\.2, Remark 5\.3\]to the independent PSD matrices\{Yi\}\\\{Y\_\{i\}\\\}: for anyε∈\(0,1\)\\varepsilon\\in\(0,1\),

Pr⁡\(λminH​\(Sn\)≤\(1−ε\)​λminH​\(𝔼​\[Sn\]\)\)≤dim\(H\)⋅exp⁡\(−ε2​λminH​\(𝔼​\[Sn\]\)2​G2\)\.\\Pr\\\!\\left\(\\lambda\_\{\\min\}^\{H\}\(S\_\{n\}\)\\leq\(1\-\\varepsilon\)\\,\\lambda\_\{\\min\}^\{H\}\(\\mathbb\{E\}\[S\_\{n\}\]\)\\right\)\\leq\\dim\(H\)\\cdot\\exp\\\!\\left\(\-\\frac\{\\varepsilon^\{2\}\\,\\lambda\_\{\\min\}^\{H\}\(\\mathbb\{E\}\[S\_\{n\}\]\)\}\{2\\,G^\{2\}\}\\right\)\.Settingε=12\\varepsilon=\\tfrac\{1\}\{2\}yields

Pr⁡\(λminH​\(Sn\)≤12​n​λD\)≤dim\(H\)⋅exp⁡\(−n​λD8​G2\)\.\\Pr\\\!\\left\(\\lambda\_\{\\min\}^\{H\}\(S\_\{n\}\)\\leq\\frac\{1\}\{2\}\\,n\\lambda\_\{D\}\\right\)\\leq\\dim\(H\)\\cdot\\exp\\\!\\left\(\-\\frac\{n\\lambda\_\{D\}\}\{8G^\{2\}\}\\right\)\.Therefore, if

n≥8​G2λD​log⁡dim\(H\)δ,n\\;\\geq\\;8\\,\\frac\{G^\{2\}\}\{\\lambda\_\{D\}\}\\,\\log\\\!\\frac\{\\dim\(H\)\}\{\\delta\},then with probability at least1−δ1\-\\delta,

λminH​\(Sn\)≥12​n​λD,equivalentlyΣ^n⪰12​ΣDon​H\.\\lambda\_\{\\min\}^\{H\}\(S\_\{n\}\)\\geq\\frac\{1\}\{2\}\\,n\\lambda\_\{D\},\\quad\\text\{equivalently\}\\quad\\widehat\{\\Sigma\}\_\{n\}\\succeq\\frac\{1\}\{2\}\\,\\Sigma\_\{D\}\\quad\\text\{on \}H\.Finally, since both matrices are positive definite onHH, pseudoinverse monotonicity onHHgives

Σ^n†⪯2​ΣD†on​H,\\widehat\{\\Sigma\}\_\{n\}^\{\\dagger\}\\preceq 2\\,\\Sigma\_\{D\}^\{\\dagger\}\\quad\\text\{on \}H,which concludes the proof\. ∎

###### Corollary 1\.

Under the condition of[Proposition˜2](https://arxiv.org/html/2606.19607#Thmproposition2), ifn≥nΣ​\(δ\)n\\geq n\_\{\\Sigma\}\(\\delta\), we have with probability at least1−δ1\-\\delta:

Σ^n​\(θ⋆\)⪰12​λminH​\(ΣD​\(θ⋆\)\)​I,\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\)\\succeq\\frac\{1\}\{2\}\\lambda^\{H\}\_\{\\min\}\(\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\)I,whereλminH​\(ΣD​\(θ⋆\)\)\>0\\lambda^\{H\}\_\{\\min\}\(\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\)\>0by[Assumption˜4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)\.

###### Proof of[Corollary˜1](https://arxiv.org/html/2606.19607#Thmcorollary1)\.

LetHHbe the identifiable subspace and note that, by construction,Σ^n​\(θ⋆\)\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\)andΣD​\(θ⋆\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)act onHH\. By[Proposition˜2](https://arxiv.org/html/2606.19607#Thmproposition2), ifn≥nΣ​\(δ\)n\\geq n\_\{\\Sigma\}\(\\delta\), where

nΣ​\(δ\):=8​G2λminH​\(ΣD​\(θ⋆\)\)​log⁡dim\(H\)δ,n\_\{\\Sigma\}\(\\delta\):=8\\,\\frac\{G^\{2\}\}\{\\lambda\_\{\\min\}^\{H\}\(\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\)\}\\,\\log\\\!\\frac\{\\dim\(H\)\}\{\\delta\},then with probability at least1−δ1\-\\delta,

Σ^n​\(θ⋆\)⪰12​ΣD​\(θ⋆\)\.\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\)\\succeq\\frac\{1\}\{2\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.Since[Assumption˜4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)ensuresλminH​\(ΣD​\(θ⋆\)\)\>0\\lambda\_\{\\min\}^\{H\}\(\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\)\>0, we further have, when restricting toHH,

ΣD​\(θ⋆\)⪰λminH​\(ΣD​\(θ⋆\)\)​I\.\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\succeq\\lambda\_\{\\min\}^\{H\}\(\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\)\\,I\.Combining the two displays yields, on the same event,

Σ^n​\(θ⋆\)⪰12​λminH​\(ΣD​\(θ⋆\)\)​I\.\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\)\\succeq\\frac\{1\}\{2\}\\,\\lambda\_\{\\min\}^\{H\}\(\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\)\\,I\.This proves the claim\. ∎

To prove the upper bound of optimality gap given the DPO minimizer, we need to bound the curvature of empirical DPO lossLnL\_\{n\}\. The curvature ofLnL\_\{n\}is captured by the Hessian matrix below\.

###### Lemma 4\(Hessian decomposition\)\.

For anyθ∈Θ\\theta\\in\\Theta,

∇2Ln​\(θ\)=An​\(θ\)\+Rn​\(θ\),\\nabla^\{2\}L\_\{n\}\(\\theta\)=A\_\{n\}\(\\theta\)\+R\_\{n\}\(\\theta\),\(21\)where

An​\(θ\)≐1n​∑i=1nσ′​\(uθ​\(ei\)\)​g​\(ei;θ\)​g​\(ei;θ\)⊤,Rn​\(θ\)≐1n​∑i=1n\(σ​\(uθ​\(ei\)\)−σ​\(uθ⋆​\(ei\)\)\)​∇θ2uθ​\(ei\)\.A\_\{n\}\(\\theta\)\\doteq\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sigma^\{\\prime\}\\\!\\big\(u\_\{\\theta\}\(e\_\{i\}\)\\big\)\\,g\(e\_\{i\};\\theta\)g\(e\_\{i\};\\theta\)^\{\\top\},\\qquad R\_\{n\}\(\\theta\)\\doteq\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\big\(\\sigma\(u\_\{\\theta\}\(e\_\{i\}\)\)\-\\sigma\(u\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)\)\\big\)\\,\\nabla\_\{\\theta\}^\{2\}u\_\{\\theta\}\(e\_\{i\}\)\.

###### Proof of[Lemma˜4](https://arxiv.org/html/2606.19607#Thmlemma4)\.

Differentiate∇Ln​\(θ\)=1n​∑iℓ′​\(ai,ui​\(θ\)\)​gi​\(θ\)\\nabla L\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i\}\\ell^\{\\prime\}\(a\_\{i\},u\_\{i\}\(\\theta\)\)\\,g\_\{i\}\(\\theta\)\. Using chain rule:

∇2Ln​\(θ\)=1n​∑iℓ′′​\(ai,ui​\(θ\)\)​gi​\(θ\)​gi​\(θ\)⊤\+1n​∑iℓ′​\(ai,ui​\(θ\)\)​∇θ2ui​\(θ\)\.\\nabla^\{2\}L\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i\}\\ell^\{\\prime\\prime\}\(a\_\{i\},u\_\{i\}\(\\theta\)\)\\,g\_\{i\}\(\\theta\)g\_\{i\}\(\\theta\)^\{\\top\}\+\\frac\{1\}\{n\}\\sum\_\{i\}\\ell^\{\\prime\}\(a\_\{i\},u\_\{i\}\(\\theta\)\)\\,\\nabla\_\{\\theta\}^\{2\}u\_\{i\}\(\\theta\)\.Substituteℓ′′=σ′\\ell^\{\\prime\\prime\}=\\sigma^\{\\prime\}andℓ′=σ−a\\ell^\{\\prime\}=\\sigma\-a\. ∎

We note that the remainder termRn​\(θ\)R\_\{n\}\(\\theta\)does not exist when the policy is log\-linear policy, and it requires more detailed analysis to bound this remainder term\.

#### C\.2\.1Log\-linear policy

We first consider the log\-linear policy classfθ​\(ϕ​\(x,y\)\)=θ⊤​ϕ​\(x,y\)f\_\{\\theta\}\(\\phi\(x,y\)\)=\\theta^\{\\top\}\\phi\(x,y\), which is easier to analyze\. Note that the tabular setting is a special case of the log\-linear policy when the embeddingϕ\\phiis a one\-hot vector\. The log\-linear family is more tractable for three reasons:

1. \(i\)The functiong​\(ei;θ\)g\(e\_\{i\};\\theta\)is independent ofθ\\theta: g​\(ei;θ\)=β​\(ϕ​\(xi,yi\+\)−ϕ​\(xi,yi−\)\)\.g\(e\_\{i\};\\theta\)=\\beta\\big\(\\phi\(x\_\{i\},y\_\{i\}^\{\+\}\)\-\\phi\(x\_\{i\},y\_\{i\}^\{\-\}\)\\big\)\.
2. \(ii\)The Hessian of the DPO lossLn​\(θ\)L\_\{n\}\(\\theta\)has no remainder term: ∇2Ln​\(θ\)=1n​∑i=1nσ′​\(uθ​\(ei\)\)​g​\(ei;θ\)​g​\(ei;θ\)⊤\.\\nabla^\{2\}L\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sigma^\{\\prime\}\\\!\\big\(u\_\{\\theta\}\(e\_\{i\}\)\\big\)\\,g\(e\_\{i\};\\theta\)g\(e\_\{i\};\\theta\)^\{\\top\}\.
3. \(iii\)The DPO lossLn​\(θ\)L\_\{n\}\(\\theta\)is strongly convex onΘ\\Theta\.

These three facts yield a direct lower bound on∇2Ln​\(θ\)\\nabla^\{2\}L\_\{n\}\(\\theta\), as we show next\.

###### Lemma 5\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1)and[2](https://arxiv.org/html/2606.19607#Thmassumption2)holds\. Then for anyθ∈Θ\\theta\\in\\Theta,

∇2Ln​\(θ\)⪰α​Σ^n,\\nabla^\{2\}L\_\{n\}\(\\theta\)\\succeq\\alpha\\,\\widehat\{\\Sigma\}\_\{n\},for someα\>0\\alpha\>0\.

###### Proof of[Lemma˜5](https://arxiv.org/html/2606.19607#Thmlemma5)\.

By[Lemma˜10](https://arxiv.org/html/2606.19607#Thmlemma10),σ′​\(uθ​\(e\)\)≥κ0\>0\\sigma^\{\\prime\}\(u\_\{\\theta\}\(e\)\)\\geq\\kappa\_\{0\}\>0for allθ∈Θ\\theta\\in\\Thetaand for allee\. Hence,

∇2Ln​\(θ\)=1n​∑i=1nσ′​\(uθ​\(ei\)\)​g​\(ei;θ\)​g​\(ei;θ\)⊤⪰κ0​1n​∑i=1ng​\(ei;θ\)​g​\(ei;θ\)⊤=κ0​Σ^n​\(θ\)\.\\nabla^\{2\}L\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sigma^\{\\prime\}\\\!\\big\(u\_\{\\theta\}\(e\_\{i\}\)\\big\)\\,g\(e\_\{i\};\\theta\)g\(e\_\{i\};\\theta\)^\{\\top\}\\succeq\\kappa\_\{0\}\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g\(e\_\{i\};\\theta\)g\(e\_\{i\};\\theta\)^\{\\top\}=\\kappa\_\{0\}\\widehat\{\\Sigma\}\_\{n\}\(\\theta\)\.∎

Using this lower bound, we prove the following upper bound for the DPO minimizer\.

###### Proposition 3\(Upper bounds for log\-linear policy\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2),[3](https://arxiv.org/html/2606.19607#Thmassumption3)and[4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)hold\. Fix any sampling designDD, and letnni\.i\.d\. comparisons be drawn fromDD\. Letθ^n∈arg⁡minθ∈Θ⁡Ln​\(θ\)\\hat\{\\theta\}\_\{n\}\\in\\arg\\min\_\{\\theta\\in\\Theta\}L\_\{n\}\(\\theta\)be the DPO estimator \(defined in \([5](https://arxiv.org/html/2606.19607#S2.E5)\)\), andπ^n≐πθ^n\\hat\{\\pi\}\_\{n\}\\doteq\\pi\_\{\\hat\{\\theta\}\_\{n\}\}\. For anyδ∈\(0,1\)\\delta\\in\(0,1\), whenn\>nΣ​\(δ\)n\>n\_\{\\Sigma\}\(\\delta\), there exist constantsCub\>0C\_\{\\mathrm\{ub\}\}\>0\(depending only on fixed constants in assumptions andβ\\beta\) such that:

J​\(π⋆\)−J​\(π^n\)≤Cubn​tr​\(I​\(θ⋆\)​ΣD†\)\.J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\_\{n\}\)\\;\\leq\\;\\frac\{C\_\{\\mathrm\{ub\}\}\}\{n\}\\,\\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}^\{\\dagger\}\\big\)\.\(22\)

###### Proof of[Proposition˜3](https://arxiv.org/html/2606.19607#Thmproposition3)\.

LetΔ=θ^n−θ⋆\\Delta=\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\. By[Lemma˜1](https://arxiv.org/html/2606.19607#Thmlemma1)and the upper side of[Proposition˜1](https://arxiv.org/html/2606.19607#Thmproposition1),

J​\(πθ⋆\)−J​\(πθ^n\)≤β​MI2​Δ⊤​I​\(θ⋆\)​Δ=β​MI2​tr​\(I​\(θ⋆\)​Δ​Δ⊤\)\.J\(\\pi\_\{\\theta^\{\\star\}\}\)\-J\(\\pi\_\{\\hat\{\\theta\}\_\{n\}\}\)\\leq\\frac\{\\beta M\_\{I\}\}\{2\}\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta=\\frac\{\\beta M\_\{I\}\}\{2\}\\,\\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\,\\Delta\\Delta^\{\\top\}\\big\)\.\(23\)
\(i\)We apply the mean\-value theorem and first order optimality condition\. Define

Hn≐∫01∇2Ln​\(θ⋆\+t​Δ\)​𝑑t\.H\_\{n\}\\doteq\\int\_\{0\}^\{1\}\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\)\\,dt\.Then

∇Ln​\(θ^\)−∇Ln​\(θ⋆\)=Hn​Δ\.\\nabla L\_\{n\}\(\\hat\{\\theta\}\)\-\\nabla L\_\{n\}\(\\theta^\{\\star\}\)=H\_\{n\}\\Delta\.Sinceθ^\\hat\{\\theta\}is a stationary point ofLnL\_\{n\}, i\.e\.∇Ln​\(θ^\)=0\\nabla L\_\{n\}\(\\hat\{\\theta\}\)=0,

Δ=−Hn†​∇Ln​\(θ⋆\),\\Delta=\-\\,H\_\{n\}^\{\\dagger\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\),hence

Δ​Δ⊤=Hn†​∇Ln​\(θ⋆\)​∇Ln​\(θ⋆\)⊤​\(Hn†\)⊤\.\\Delta\\Delta^\{\\top\}=H\_\{n\}^\{\\dagger\}\\,\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\nabla L\_\{n\}\(\\theta^\{\\star\}\)^\{\\top\}\\,\(H\_\{n\}^\{\\dagger\}\)^\{\\top\}\.\(24\)
\(ii\)We lower boundHnH\_\{n\}byΣ^n\\widehat\{\\Sigma\}\_\{n\}\. By[Lemma˜5](https://arxiv.org/html/2606.19607#Thmlemma5), for allt∈\[0,1\]t\\in\[0,1\],

∇2Ln​\(θ⋆\+t​Δ\)⪰α​Σ^n,\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\)\\succeq\\alpha\\,\\widehat\{\\Sigma\}\_\{n\},therefore

Hn⪰α​Σ^n⟹Hn†⪯α−1​Σ^n†\.H\_\{n\}\\succeq\\alpha\\,\\widehat\{\\Sigma\}\_\{n\}\\quad\\Longrightarrow\\quad H\_\{n\}^\{\\dagger\}\\preceq\\alpha^\{\-1\}\\widehat\{\\Sigma\}\_\{n\}^\{\\dagger\}\.LetℰΣ\\mathcal\{E\}\_\{\\Sigma\}be the event in[Proposition˜2](https://arxiv.org/html/2606.19607#Thmproposition2)\. On eventℰΣ\\mathcal\{E\}\_\{\\Sigma\}, by[Proposition˜2](https://arxiv.org/html/2606.19607#Thmproposition2),Σ^n⪰12​ΣD\\widehat\{\\Sigma\}\_\{n\}\\succeq\\frac\{1\}\{2\}\\Sigma\_\{D\}, so by pseudoinverse monotonicity onHH,

Σ^n†⪯2​ΣD†,Hn†⪯2α​ΣD†\.\\widehat\{\\Sigma\}\_\{n\}^\{\\dagger\}\\preceq 2\\,\\Sigma\_\{D\}^\{\\dagger\},\\qquad H\_\{n\}^\{\\dagger\}\\preceq\\frac\{2\}\{\\alpha\}\\Sigma\_\{D\}^\{\\dagger\}\.
\(iii\)We bound the score covariance atθ⋆\\theta^\{\\star\}\. Write

∇Ln​\(θ⋆\)=−1n​∑i=1n\(ai−σ​\(ui​\(θ⋆\)\)\)​gi​\(θ⋆\)\.\\nabla L\_\{n\}\(\\theta^\{\\star\}\)=\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\big\(a\_\{i\}\-\\sigma\(u\_\{i\}\(\\theta^\{\\star\}\)\)\\big\)\\,g\_\{i\}\(\\theta^\{\\star\}\)\.Conditional on sampled edges, terms are independent, mean\-zero, and

Var​\(ai∣⋅\)=σ′​\(ui​\(θ⋆\)\)≤14\.\\text\{Var\}\(a\_\{i\}\\mid\\cdot\)=\\sigma^\{\\prime\}\(u\_\{i\}\(\\theta^\{\\star\}\)\)\\leq\\frac\{1\}\{4\}\.Hence

𝔼​\[∇Ln​\(θ⋆\)​∇Ln​\(θ⋆\)⊤∣\{gi​\(θ⋆\)\}\]⪯14​n​Σ^n\.\\mathbb\{E\}\\\!\\left\[\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\nabla L\_\{n\}\(\\theta^\{\\star\}\)^\{\\top\}\\mid\\\{g\_\{i\}\(\\theta^\{\\star\}\)\\\}\\right\]\\preceq\\frac\{1\}\{4n\}\\,\\widehat\{\\Sigma\}\_\{n\}\.
\(iv\)Take conditional expectation in \([24](https://arxiv.org/html/2606.19607#A3.E24)\) and use steps \(ii\) and \(iii\):

𝔼​\[Δ​Δ⊤∣ℰΣ\]\\displaystyle\\mathbb\{E\}\[\\Delta\\Delta^\{\\top\}\\mid\\mathcal\{E\}\_\{\\Sigma\}\]⪯14​n​Hn†​Σ^n​\(Hn†\)⊤\\displaystyle\\preceq\\frac\{1\}\{4n\}\\,H\_\{n\}^\{\\dagger\}\\widehat\{\\Sigma\}\_\{n\}\(H\_\{n\}^\{\\dagger\}\)^\{\\top\}⪯14​n⋅1α2​Σ^n†⪯12​α2​n​ΣD†\.\\displaystyle\\preceq\\frac\{1\}\{4n\}\\cdot\\frac\{1\}\{\\alpha^\{2\}\}\\,\\widehat\{\\Sigma\}\_\{n\}^\{\\dagger\}\\;\\preceq\\;\\frac\{1\}\{2\\alpha^\{2\}n\}\\,\\Sigma\_\{D\}^\{\\dagger\}\.Substitute into \([23](https://arxiv.org/html/2606.19607#A3.E23)\):

𝔼​\[J​\(πθ⋆\)−J​\(πθ^\)∣ℰΣ\]≤β​MI2⋅12​α2​n​tr​\(I​\(θ⋆\)​ΣD†\)\.\\mathbb\{E\}\\\!\\left\[J\(\\pi\_\{\\theta^\{\\star\}\}\)\-J\(\\pi\_\{\\hat\{\\theta\}\}\)\\mid\\mathcal\{E\}\_\{\\Sigma\}\\right\]\\leq\\frac\{\\beta M\_\{I\}\}\{2\}\\cdot\\frac\{1\}\{2\\alpha^\{2\}n\}\\,\\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}^\{\\dagger\}\\big\)\.Therefore

𝔼​\[J​\(πθ⋆\)−J​\(πθ^\)∣ℰΣ\]≤Cubn​tr​\(I​\(θ⋆\)​ΣD†\),Cub=β​MI4​α2\.\\mathbb\{E\}\\\!\\left\[J\(\\pi\_\{\\theta^\{\\star\}\}\)\-J\(\\pi\_\{\\hat\{\\theta\}\}\)\\mid\\mathcal\{E\}\_\{\\Sigma\}\\right\]\\leq\\frac\{C\_\{\\mathrm\{ub\}\}\}\{n\}\\,\\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}^\{\\dagger\}\\big\),\\qquad C\_\{\\mathrm\{ub\}\}=\\frac\{\\beta M\_\{I\}\}\{4\\alpha^\{2\}\}\.\(25\)Absorbing constants intoCubC\_\{\\mathrm\{ub\}\}gives the claimed upper bound\. Moreover, by[Proposition˜2](https://arxiv.org/html/2606.19607#Thmproposition2), the eventℰΣ\\mathcal\{E\}\_\{\\Sigma\}happens with probability at least1−δ1\-\\deltaif

n≥8​G2λminH​\(ΣD\)​log⁡dim\(H\)δ\.n\\;\\geq\\;8\\,\\frac\{G^\{2\}\}\{\\lambda\_\{\\min\}^\{H\}\(\\Sigma\_\{D\}\)\}\\,\\log\\\!\\frac\{\\dim\(H\)\}\{\\delta\}\.∎

#### C\.2\.2General policy

When the policy is not log\-linear, the analysis becomes more involved because the three simplifying features of the log\-linear case no longer hold\. In particular, whenfθ​\(ϕ​\(x,y\)\)f\_\{\\theta\}\(\\phi\(x,y\)\)is not linear inϕ​\(x,y\)\\phi\(x,y\),

1. \(i\)the score functiong​\(ei;θ\)g\(e\_\{i\};\\theta\)depends onθ\\theta: g​\(ei;θ\)=β​\(∇θfθ​\(ϕ​\(xi,yi\+\)\)−∇θfθ​\(ϕ​\(xi,yi−\)\)\);g\(e\_\{i\};\\theta\)=\\beta\\big\(\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x\_\{i\},y\_\{i\}^\{\+\}\)\)\-\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x\_\{i\},y\_\{i\}^\{\-\}\)\)\\big\);
2. \(ii\)the Hessian of the DPO lossLn​\(θ\)L\_\{n\}\(\\theta\)includes an additional remainder term: ∇2Ln​\(θ\)=1n​∑i=1nσ′​\(uθ​\(ei\)\)​g​\(ei;θ\)​g​\(ei;θ\)⊤\+1n​∑i=1n\(σ​\(uθ​\(ei\)\)−σ​\(uθ⋆​\(ei\)\)\)​∇θ2uθ​\(ei\);\\nabla^\{2\}L\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sigma^\{\\prime\}\\\!\\big\(u\_\{\\theta\}\(e\_\{i\}\)\\big\)\\,g\(e\_\{i\};\\theta\)g\(e\_\{i\};\\theta\)^\{\\top\}\+\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\big\(\\sigma\(u\_\{\\theta\}\(e\_\{i\}\)\)\-\\sigma\(u\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)\)\\big\)\\,\\nabla\_\{\\theta\}^\{2\}u\_\{\\theta\}\(e\_\{i\}\);
3. \(iii\)the DPO lossLn​\(θ\)L\_\{n\}\(\\theta\)is non\-convex in general\.

As a result, we cannot expect the lower bound in[Lemma˜5](https://arxiv.org/html/2606.19607#Thmlemma5)applies universally inΘ\\Theta, and a more careful analysis is required\. Our analysis resolves these issues in three layers\.

##### \(I\) local geometry of population loss

We first prove that with general policy, the desired lower bound in[Lemma˜5](https://arxiv.org/html/2606.19607#Thmlemma5)holds locally for the population DPO riskL​\(θ\)L\(\\theta\)\. Given the sampling policyDDand a policyπθ\\pi\_\{\\theta\}, the population DPO riskL​\(θ\)L\(\\theta\)is defind as

L​\(θ\)≐𝔼\(e,a\)​\[ℓ​\(a,uθ​\(e\)\)\]\.L\(\\theta\)\\;\\doteq\\;\\mathbb\{E\}\_\{\(e,a\)\}\\\!\\left\[\\ell\\\!\\left\(a,u\_\{\\theta\}\(e\)\\right\)\\right\]\.
###### Proposition 4\(Local identifiability onHH\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)hold\. Then there exist constantsrloc\>0r\_\{\\mathrm\{loc\}\}\>0andcloc\>0c\_\{\\mathrm\{loc\}\}\>0, depending only on the fixed constants in the standing assumptions, such that for allθ∈Θ\\theta\\in\\Thetawith‖θ−θ⋆‖2≤rloc\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\},

∇2L​\(θ\)⪰cloc​ΣD​\(θ⋆\)⪰cloc​μ⋆​I\.\\nabla^\{2\}L\(\\theta\)\\succeq c\_\{\\mathrm\{loc\}\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\succeq c\_\{\\mathrm\{loc\}\}\\mu\_\{\\star\}I\.

###### Proof of[Proposition˜4](https://arxiv.org/html/2606.19607#Thmproposition4)\.

First, we prove a lower bound forA​\(θ\)A\(\\theta\)\. Using∇2L​\(θ\)=A​\(θ\)\+R​\(θ\)\\nabla^\{2\}L\(\\theta\)=A\(\\theta\)\+R\(\\theta\)and globalσ′​\(uθ​\(e\)\)≥κ0\\sigma^\{\\prime\}\(u\_\{\\theta\}\(e\)\)\\geq\\kappa\_\{0\}by[Lemma˜10](https://arxiv.org/html/2606.19607#Thmlemma10),

A​\(θ\)=𝔼​\[σ′​\(uθ\)​g​\(θ\)​g​\(θ\)⊤\]⪰κ0​𝔼​\[g​\(θ\)​g​\(θ\)⊤\]\.A\(\\theta\)=\\,\\mathbb\{E\}\\\!\\left\[\\sigma^\{\\prime\}\(u\_\{\\theta\}\)\\,g\(\\theta\)g\(\\theta\)^\{\\top\}\\right\]\\succeq\\kappa\_\{0\}\\,\\,\\mathbb\{E\}\[g\(\\theta\)g\(\\theta\)^\{\\top\}\]\\,\.By the Gram perturbation bound with Lipschitz constantHuH\_\{u\}andGuG\_\{u\}\([Lemmas˜13](https://arxiv.org/html/2606.19607#Thmlemma13)and[9](https://arxiv.org/html/2606.19607#Thmlemma9)\),

‖𝔼​\[g​\(θ\)​g​\(θ\)⊤\]−ΣD​\(θ⋆\)‖op≤2​Gu​Hu​‖θ−θ⋆‖2\+Hu2​‖θ−θ⋆‖22\.\\left\\\|\\mathbb\{E\}\[g\(\\theta\)g\(\\theta\)^\{\\top\}\]\-\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\right\\\|\_\{\\text\{op\}\}\\leq 2G\_\{u\}\\,H\_\{u\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\+H\_\{u\}^\{2\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}^\{2\}\.Together withΣD​\(θ⋆\)⪰μ⋆\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\succeq\\mu\_\{\\star\}by[Assumption˜4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4),

𝔼​\[g​\(θ\)​g​\(θ\)⊤\]⪰\(1−η​\(θ\)\)​ΣD​\(θ⋆\),\\,\\mathbb\{E\}\[g\(\\theta\)g\(\\theta\)^\{\\top\}\]\\,\\succeq\\bigl\(1\-\\eta\(\\theta\)\\bigr\)\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\),where

η​\(θ\)≐2​Gu​Hu​‖θ−θ⋆‖2\+Hu2​‖θ−θ⋆‖22μ⋆\.\\eta\(\\theta\)\\doteq\\frac\{2G\_\{u\}\\,H\_\{u\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\+H\_\{u\}^\{2\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}^\{2\}\}\{\\mu\_\{\\star\}\}\.Hence

A​\(θ\)⪰κ0​\(1−η​\(θ\)\)​ΣD​\(θ⋆\)\.A\(\\theta\)\\succeq\\kappa\_\{0\}\(1\-\\eta\(\\theta\)\)\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.
Second, we provide a bound forR​\(θ\)R\(\\theta\)\. Letδ​\(θ\)\\delta\(\\theta\)denote the norm‖R​\(θ\)‖op\\\|R\(\\theta\)\\\|\_\{\\text\{op\}\}\. Then,

R​\(θ\)⪰−δ​\(θ\)​I\.R\(\\theta\)\\succeq\-\\delta\(\\theta\)I\.SinceΣD​\(θ⋆\)⪰μ⋆​I\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\succeq\\mu\_\{\\star\}I,I⪯μ⋆−1​ΣD​\(θ⋆\),I\\preceq\\mu\_\{\\star\}^\{\-1\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\),thus

R​\(θ\)⪰−δ​\(θ\)μ⋆​ΣD​\(θ⋆\)\.R\(\\theta\)\\succeq\-\\frac\{\\delta\(\\theta\)\}\{\\mu\_\{\\star\}\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.
Using the bound ofA​\(θ\)A\(\\theta\)andR​\(θ\)R\(\\theta\),

∇2L​\(θ\)=A​\(θ\)\+R​\(θ\)⪰\[κ0​\(1−η​\(θ\)\)−δ​\(θ\)μ⋆\]​ΣD​\(θ⋆\)=α​\(θ\)​ΣD​\(θ⋆\),\\nabla^\{2\}L\(\\theta\)=A\(\\theta\)\+R\(\\theta\)\\succeq\\left\[\\kappa\_\{0\}\(1\-\\eta\(\\theta\)\)\-\\frac\{\\delta\(\\theta\)\}\{\\mu\_\{\\star\}\}\\right\]\\Sigma\_\{D\}\(\\theta^\{\\star\}\)=\\alpha\(\\theta\)\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\),where

α​\(θ\)≐κ0​\(1−η​\(θ\)\)−δ​\(θ\)μ⋆\.\\alpha\(\\theta\)\\doteq\\kappa\_\{0\}\\\!\\left\(1\-\\eta\(\\theta\)\\right\)\-\\frac\{\\delta\(\\theta\)\}\{\\mu\_\{\\star\}\}\.
Third, we provide a bound forδ​\(θ\)\\delta\(\\theta\)\. Using\|σ​\(a\)−σ​\(b\)\|≤14​\|a−b\|\|\\sigma\(a\)\-\\sigma\(b\)\|\\leq\\frac\{1\}\{4\}\|a\-b\|,

δ​\(θ\)≤𝔼​\[\|σ​\(uθ\)−σ​\(uθ⋆\)\|​‖∇θ2uθ‖op\]≤14​Hu​𝔼​\|uθ−uθ⋆\|\.\\delta\(\\theta\)\\leq\\mathbb\{E\}\\\!\\left\[\|\\sigma\(u\_\{\\theta\}\)\-\\sigma\(u\_\{\\theta^\{\\star\}\}\)\|\\,\\\|\\nabla\_\{\\theta\}^\{2\}u\_\{\\theta\}\\\|\_\{\\text\{op\}\}\\right\]\\leq\\frac\{1\}\{4\}\\,H\_\{u\}\\,\\mathbb\{E\}\|u\_\{\\theta\}\-u\_\{\\theta^\{\\star\}\}\|\.By Lipschitzness ofuθu\_\{\\theta\}\([Lemma˜9](https://arxiv.org/html/2606.19607#Thmlemma9)\),

\|uθ​\(e\)−uθ⋆​\(e\)\|≤Gu​‖θ−θ⋆‖2,\|u\_\{\\theta\}\(e\)\-u\_\{\\theta^\{\\star\}\}\(e\)\|\\leq G\_\{u\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\},hence

δ​\(θ\)≤14​Hu​Gu​‖θ−θ⋆‖2\.\\delta\(\\theta\)\\leq\\frac\{1\}\{4\}\\,H\_\{u\}G\_\{u\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\.Substituting gives the following lower bound onα​\(θ\)\\alpha\(\\theta\):

α​\(θ\)≥κ0​\(1−η​\(θ\)\)−Hu​Gu4​μ⋆​‖θ−θ⋆‖2\.\\alpha\(\\theta\)\\geq\\kappa\_\{0\}\\\!\\left\(1\-\\eta\(\\theta\)\\right\)\-\\frac\{H\_\{u\}G\_\{u\}\}\{4\\mu\_\{\\star\}\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\.
Finally, define

rloc:=min⁡\{ρg,ρR\},ρg:=Gu2\+μ⋆/2−GuHu,ρR:=κ0​μ⋆2​Hu​Gu\.r\_\{\\mathrm\{loc\}\}:=\\min\\\{\\rho\_\{g\},\\rho\_\{R\}\\\},\\qquad\\rho\_\{g\}:=\\frac\{\\sqrt\{G\_\{u\}^\{2\}\+\\mu\_\{\\star\}/2\}\-G\_\{u\}\}\{H\_\{u\}\},\\qquad\\rho\_\{R\}:=\\frac\{\\kappa\_\{0\}\\mu\_\{\\star\}\}\{2H\_\{u\}G\_\{u\}\}\.If‖θ−θ⋆‖2≤rloc\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\}, then by the definitions ofρg,ρR\\rho\_\{g\},\\rho\_\{R\}:

η​\(θ\)≤12,δ​\(θ\)μ⋆≤Hu​Gu4​μ⋆​‖θ−θ⋆‖2≤κ08\.\\eta\(\\theta\)\\leq\\frac\{1\}\{2\},\\qquad\\frac\{\\delta\(\\theta\)\}\{\\mu\_\{\\star\}\}\\leq\\frac\{H\_\{u\}G\_\{u\}\}\{4\\mu\_\{\\star\}\}\\,\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\leq\\frac\{\\kappa\_\{0\}\}\{8\}\.Therefore

α​\(θ\)≥κ0​\(1−12\)−κ08=3​κ08≥κ04\.\\alpha\(\\theta\)\\geq\\kappa\_\{0\}\(1\-\\tfrac\{1\}\{2\}\)\-\\tfrac\{\\kappa\_\{0\}\}\{8\}=\\frac\{3\\kappa\_\{0\}\}\{8\}\\geq\\frac\{\\kappa\_\{0\}\}\{4\}\.So

∇2L​\(θ\)⪰κ04​ΣD​\(θ⋆\)⪰κ0​μ⋆4​I\.\\nabla^\{2\}L\(\\theta\)\\succeq\\frac\{\\kappa\_\{0\}\}\{4\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\succeq\\frac\{\\kappa\_\{0\}\\mu\_\{\\star\}\}\{4\}I\.∎

Therefore, forθ\\thetaclose enough toθ⋆\\theta^\{\\star\}\(i\.e\.,‖θ−θ∗‖≤rloc\\\|\\theta\-\\theta^\{\*\}\\\|\\leq r\_\{\\mathrm\{loc\}\}\), the desired lower bound holds\.

##### \(II\) high\-probability local geometry of empirical loss

Next, we translate the local bound from the population DPO risk to the empirical DPO risk\. On the same ballBrloc​\(θ⋆\)B\_\{r\_\{\\mathrm\{loc\}\}\}\(\\theta^\{\\star\}\), we establish a high\-probability uniform Hessian concentration bound

supθ∈Brloc‖∇2Ln​\(θ\)−∇2L​\(θ\)‖op≤εnHess,\\sup\_\{\\theta\\in B\_\{r\_\{\\mathrm\{loc\}\}\}\}\\\!\\\|\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\\|\_\{\\text\{op\}\}\\leq\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\},whereεnHess\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\}is decreasing innn\.

###### Lemma 6\(Uniform Hessian concentration onℬloc\\mathcal\{B\}\_\{\\mathrm\{loc\}\}\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1)and[2](https://arxiv.org/html/2606.19607#Thmassumption2)hold, and define

ℬloc:=\{θ∈Θ:‖θ−θ⋆‖2≤rloc\}\.\\mathcal\{B\}\_\{\\mathrm\{loc\}\}:=\\\{\\theta\\in\\Theta:\\ \\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\}\\\}\.Then for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

supθ∈ℬloc‖∇2Ln​\(θ\)−∇2L​\(θ\)‖op≤εnHess​\(δ\),\\sup\_\{\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}\}\\bigl\\\|\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\bigr\\\|\_\{\\text\{op\}\}\\;\\leq\\;\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\}\(\\delta\),whereεnHess​\(δ\)\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\}\(\\delta\)depends only onn,δn,\\delta, the radiusrlocr\_\{\\mathrm\{loc\}\}, the identifiable dimensiondHd\_\{H\}, and the fixed constants in[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2), and satisfies

εnHess​\(δ\)=𝒪​\(dH​log⁡n\+log⁡\(1/δ\)n\+dH​log⁡n\+log⁡\(1/δ\)n\)\.\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\}\(\\delta\)=\\mathcal\{O\}\\\!\\left\(\\sqrt\{\\frac\{d\_\{H\}\\log n\+\\log\(1/\\delta\)\}\{n\}\}\+\\frac\{d\_\{H\}\\log n\+\\log\(1/\\delta\)\}\{n\}\\right\)\.

###### Proof of[Lemma˜6](https://arxiv.org/html/2606.19607#Thmlemma6)\.

First, we create a covering net onℬloc\\mathcal\{B\}\_\{\\mathrm\{loc\}\}\. BecauseΘ⊂θ⋆\+H\\Theta\\subset\\theta^\{\\star\}\+Handdim\(H\)=dH\\dim\(H\)=d\_\{H\}, the local ball

ℬloc=\{θ∈Θ:‖θ−θ⋆‖2≤rloc\}\\mathcal\{B\}\_\{\\mathrm\{loc\}\}=\\\{\\theta\\in\\Theta:\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\}\\\}is adHd\_\{H\}\-dimensional Euclidean ball in the affine subspaceθ⋆\+H\\theta^\{\\star\}\+H\. Fix the net radiusρn≐1n\\rho\_\{n\}\\;\\doteq\\;\\frac\{1\}\{n\}, and let𝒩ρn\\mathcal\{N\}\_\{\\rho\_\{n\}\}be aρn\\rho\_\{n\}\-net ofℬloc\\mathcal\{B\}\_\{\\mathrm\{loc\}\}in∥⋅∥2\\\|\\cdot\\\|\_\{2\}\. Then the standard volumetric bound gives

log⁡\|𝒩ρn\|≤dH​log⁡\(3​rlocρn\)=dH​log⁡\(3​rloc​n\)\.\\log\|\\mathcal\{N\}\_\{\\rho\_\{n\}\}\|\\;\\leq\\;d\_\{H\}\\log\\\!\\left\(\\frac\{3r\_\{\\mathrm\{loc\}\}\}\{\\rho\_\{n\}\}\\right\)=d\_\{H\}\\log\\\!\\bigl\(3r\_\{\\mathrm\{loc\}\}n\\bigr\)\.
Second, we apply matrix concentration on net points\. For fixedθ∈𝒩ρn\\theta\\in\\mathcal\{N\}\_\{\\rho\_\{n\}\}, define

Yi​\(θ\):=∇2ℓθ​\(ei\)−𝔼e∼D​\[∇2ℓθ​\(e\)\]\.Y\_\{i\}\(\\theta\):=\\nabla^\{2\}\\ell\_\{\\theta\}\(e\_\{i\}\)\-\\mathbb\{E\}\_\{e\\sim D\}\[\\nabla^\{2\}\\ell\_\{\\theta\}\(e\)\]\.By[Lemma˜12](https://arxiv.org/html/2606.19607#Thmlemma12), uniformly overθ∈Θ\\theta\\in\\Theta,

‖Yi​\(θ\)‖op≤2​MH,‖𝔼​\[Yi​\(θ\)2\]‖op≤VH,\\\|Y\_\{i\}\(\\theta\)\\\|\_\{\\text\{op\}\}\\leq 2M\_\{H\},\\qquad\\Bigl\\\|\\mathbb\{E\}\\bigl\[Y\_\{i\}\(\\theta\)^\{2\}\\bigr\]\\Bigr\\\|\_\{\\text\{op\}\}\\leq V\_\{H\},and eachYi​\(θ\)Y\_\{i\}\(\\theta\)lives in thedHd\_\{H\}\-dimensional subspaceHH\. Applying a matrix Bernstein inequality\[Tropp,[2012](https://arxiv.org/html/2606.19607#bib.bib19), Theorem 1\.6\]onHHat each net point and taking a union bound over𝒩ρn\\mathcal\{N\}\_\{\\rho\_\{n\}\}, we obtain that with probability at least1−δ1\-\\delta,

maxθ∈𝒩ρn⁡‖∇2Ln​\(θ\)−∇2L​\(θ\)‖op≤C1​VH​ΛH​\(δ\)n\+C2​MH​ΛH​\(δ\)n,\\max\_\{\\theta\\in\\mathcal\{N\}\_\{\\rho\_\{n\}\}\}\\bigl\\\|\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\bigr\\\|\_\{\\text\{op\}\}\\;\\leq\\;C\_\{1\}\\sqrt\{\\frac\{V\_\{H\}\\,\\Lambda\_\{H\}\(\\delta\)\}\{n\}\}\+C\_\{2\}\\frac\{M\_\{H\}\\,\\Lambda\_\{H\}\(\\delta\)\}\{n\},where

ΛH​\(δ\)=log⁡\|𝒩ρn\|\+log⁡2​dHδ≤dH​log⁡\(3​rloc​n\)\+log⁡2​dHδ\.\\Lambda\_\{H\}\(\\delta\)=\\log\|\\mathcal\{N\}\_\{\\rho\_\{n\}\}\|\+\\log\\\!\\frac\{2d\_\{H\}\}\{\\delta\}\\;\\leq\\;d\_\{H\}\\log\\\!\\bigl\(3r\_\{\\mathrm\{loc\}\}n\\bigr\)\+\\log\\\!\\frac\{2d\_\{H\}\}\{\\delta\}\.Finally, we extend the bound to off\-net points in the ball via Hessian Lipschitzness\. Take anyθ∈ℬloc\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}, and chooseθ♯∈𝒩ρn\\theta^\{\\sharp\}\\in\\mathcal\{N\}\_\{\\rho\_\{n\}\}with‖θ−θ♯‖2≤ρn\\\|\\theta\-\\theta^\{\\sharp\}\\\|\_\{2\}\\leq\\rho\_\{n\}\. By global Hessian Lipschitzness \(which in particular holds onℬloc\\mathcal\{B\}\_\{\\mathrm\{loc\}\}\),

‖∇2Ln​\(θ\)−∇2Ln​\(θ♯\)‖op≤LH​‖θ−θ♯‖2≤LH​ρn=LHn,\\bigl\\\|\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\sharp\}\)\\bigr\\\|\_\{\\text\{op\}\}\\leq L\_\{H\}\\\|\\theta\-\\theta^\{\\sharp\}\\\|\_\{2\}\\leq L\_\{H\}\\rho\_\{n\}=\\frac\{L\_\{H\}\}\{n\},and similarly

‖∇2L​\(θ\)−∇2L​\(θ♯\)‖op≤LHn\.\\bigl\\\|\\nabla^\{2\}L\(\\theta\)\-\\nabla^\{2\}L\(\\theta^\{\\sharp\}\)\\bigr\\\|\_\{\\text\{op\}\}\\leq\\frac\{L\_\{H\}\}\{n\}\.Therefore,

‖∇2Ln​\(θ\)−∇2L​\(θ\)‖op≤‖∇2Ln​\(θ♯\)−∇2L​\(θ♯\)‖op\+2​LHn\.\\bigl\\\|\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\bigr\\\|\_\{\\text\{op\}\}\\leq\\bigl\\\|\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\sharp\}\)\-\\nabla^\{2\}L\(\\theta^\{\\sharp\}\)\\bigr\\\|\_\{\\text\{op\}\}\+\\frac\{2L\_\{H\}\}\{n\}\.
Taking the supremum overθ∈ℬloc\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}and combining with the net bound from above yields, on the same event of probability at least1−δ1\-\\delta,

supθ∈ℬloc‖∇2Ln​\(θ\)−∇2L​\(θ\)‖op≤C1​VH​ΛH​\(δ\)n\+C2​MH​ΛH​\(δ\)n\+2​LHn\.\\sup\_\{\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}\}\\bigl\\\|\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\bigr\\\|\_\{\\text\{op\}\}\\;\\leq\\;C\_\{1\}\\sqrt\{\\frac\{V\_\{H\}\\,\\Lambda\_\{H\}\(\\delta\)\}\{n\}\}\+C\_\{2\}\\frac\{M\_\{H\}\\,\\Lambda\_\{H\}\(\\delta\)\}\{n\}\+\\frac\{2L\_\{H\}\}\{n\}\.This is exactly the claimed uniform Hessian concentration onℬloc\\mathcal\{B\}\_\{\\mathrm\{loc\}\}, with an explicit choiceρn=1/n\\rho\_\{n\}=1/nandΛH​\(δ\)≲dH​log⁡n\+log⁡\(1/δ\)\\Lambda\_\{H\}\(\\delta\)\\lesssim d\_\{H\}\\log n\+\\log\(1/\\delta\)\. ∎

Therefore, fornnlarge enough so thatεnHess≤18​κ0​μ⋆\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\}\\leq\\tfrac\{1\}\{8\}\\kappa\_\{0\}\\mu\_\{\\star\}, this yields a uniform empirical curvature lower bound on that ball\.

###### Proposition 5\(Uniform empirical Hessian lower bound on the local ball\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)hold\. Then there exist constantscemp,CHess\>0c\_\{\\mathrm\{emp\}\},C\_\{\\mathrm\{Hess\}\}\>0, depending only on the fixed constants in the standing assumptions, such that for anyδ∈\(0,1\)\\delta\\in\(0,1\), if

n≥nhess​\(δ\)≐CHess​dH​log⁡\(1μ⋆\)\+log⁡\(1/δ\)μ⋆2,n\\;\\geq\\;n\_\{\\text\{hess\}\}\(\\delta\)\\doteq C\_\{\\mathrm\{Hess\}\}\\,\\frac\{d\_\{H\}\\log\\\!\\bigl\(\\tfrac\{1\}\{\\mu\_\{\\star\}\}\\bigr\)\+\\log\(1/\\delta\)\}\{\\mu\_\{\\star\}^\{2\}\},then, with probability at least1−δ1\-\\delta, for allθ∈ℬloc\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\},

∇2Ln​\(θ\)⪰cemp​ΣD​\(θ⋆\)⪰cemp​μ⋆​I\.\\nabla^\{2\}L\_\{n\}\(\\theta\)\\succeq c\_\{\\mathrm\{emp\}\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\succeq c\_\{\\mathrm\{emp\}\}\\mu\_\{\\star\}I\.In particular, for anyΔ\\Deltasuch thatθ⋆\+t​Δ∈ℬloc\\theta^\{\\star\}\+t\\Delta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}for allt∈\[0,1\]t\\in\[0,1\],

H¯n​\(Δ\)⪰cemp​ΣD​\(θ⋆\),H¯n​\(Δ\):=∫01∇2Ln​\(θ⋆\+t​Δ\)​𝑑t\.\\bar\{H\}\_\{n\}\(\\Delta\)\\succeq c\_\{\\mathrm\{emp\}\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\),\\qquad\\bar\{H\}\_\{n\}\(\\Delta\):=\\int\_\{0\}^\{1\}\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\)\\,dt\.

###### Proof of[Proposition˜5](https://arxiv.org/html/2606.19607#Thmproposition5)\.

Fixθ∈ℬloc\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}\. Decompose

∇2Ln​\(θ\)=∇2L​\(θ\)\+\(∇2Ln​\(θ\)−∇2L​\(θ\)\)\.\\nabla^\{2\}L\_\{n\}\(\\theta\)=\\nabla^\{2\}L\(\\theta\)\+\\big\(\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\big\)\.On the Hessian concentration event,

∇2Ln​\(θ\)−∇2L​\(θ\)⪰−εnHess​\(δ\)​I\.\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\succeq\-\\varepsilon\_\{n\}^\{\\rm Hess\}\(\\delta\)\\,I\.By[Proposition˜4](https://arxiv.org/html/2606.19607#Thmproposition4), for allθ∈ℬloc\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\},

∇2L​\(θ\)⪰κ04​ΣD​\(θ⋆\),\\nabla^\{2\}L\(\\theta\)\\succeq\\frac\{\\kappa\_\{0\}\}\{4\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\),whereκ0\\kappa\_\{0\}is from[Lemma˜10](https://arxiv.org/html/2606.19607#Thmlemma10)\. Hence

∇2Ln​\(θ\)⪰κ04​ΣD​\(θ⋆\)−εnHess​\(δ\)​I\.\\nabla^\{2\}L\_\{n\}\(\\theta\)\\succeq\\frac\{\\kappa\_\{0\}\}\{4\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\-\\varepsilon\_\{n\}^\{\\rm Hess\}\(\\delta\)\\,I\.
By[Assumption˜4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)we have

I⪯1μ⋆​ΣD​\(θ⋆\)\.I\\preceq\\frac\{1\}\{\\mu\_\{\\star\}\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.Therefore

−εnHess​\(δ\)​I⪰−εnHess​\(δ\)μ⋆​ΣD​\(θ⋆\),\-\\varepsilon\_\{n\}^\{\\rm Hess\}\(\\delta\)\\,I\\succeq\-\\frac\{\\varepsilon\_\{n\}^\{\\rm Hess\}\(\\delta\)\}\{\\mu\_\{\\star\}\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\),and thus

∇2Ln​\(θ\)⪰\(κ04−εnHess​\(δ\)μ⋆\)​ΣD​\(θ⋆\)\.\\nabla^\{2\}L\_\{n\}\(\\theta\)\\succeq\\left\(\\frac\{\\kappa\_\{0\}\}\{4\}\-\\frac\{\\varepsilon\_\{n\}^\{\\rm Hess\}\(\\delta\)\}\{\\mu\_\{\\star\}\}\\right\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.IfεnHess​\(δ\)≤κ0​μ⋆/8\\varepsilon\_\{n\}^\{\\rm Hess\}\(\\delta\)\\leq\\kappa\_\{0\}\\mu\_\{\\star\}/8, then

∇2Ln​\(θ\)⪰κ08​ΣD​\(θ⋆\)⪰κ0​μ⋆8​I\.\\nabla^\{2\}L\_\{n\}\(\\theta\)\\succeq\\frac\{\\kappa\_\{0\}\}\{8\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\succeq\\frac\{\\kappa\_\{0\}\\mu\_\{\\star\}\}\{8\}\\,I\.This lower bound holds simultaneously for allθ∈ℬloc\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}on the same event\.

For the segment claim, ifθ⋆\+t​Δ∈ℬloc\\theta^\{\\star\}\+t\\Delta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}for allt∈\[0,1\]t\\in\[0,1\], then pointwise intt,

∇2Ln​\(θ⋆\+t​Δ\)⪰κ08​ΣD​\(θ⋆\)\.\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\)\\succeq\\frac\{\\kappa\_\{0\}\}\{8\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.Integrating overt∈\[0,1\]t\\in\[0,1\]gives

H¯n​\(Δ\):=∫01∇2Ln​\(θ⋆\+t​Δ\)​𝑑t⪰κ08​ΣD​\(θ⋆\)\.\\bar\{H\}\_\{n\}\(\\Delta\):=\\int\_\{0\}^\{1\}\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\)\\,dt\\succeq\\frac\{\\kappa\_\{0\}\}\{8\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.
By[Lemma˜6](https://arxiv.org/html/2606.19607#Thmlemma6), there exists a constantCH\>0C\_\{H\}\>0, depending only on the fixed constants in the standing assumptions, such that

εnHess​\(δ\)≤CH​\(dH​log⁡n\+log⁡\(1/δ\)n\+dH​log⁡n\+log⁡\(1/δ\)n\)\.\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\}\(\\delta\)\\leq C\_\{H\}\\left\(\\sqrt\{\\frac\{d\_\{H\}\\log n\+\\log\(1/\\delta\)\}\{n\}\}\+\\frac\{d\_\{H\}\\log n\+\\log\(1/\\delta\)\}\{n\}\\right\)\.Thus, to guaranteeεnHess​\(δ\)≤cemp​μ⋆\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\}\(\\delta\)\\leq c\_\{\\mathrm\{emp\}\}\\mu\_\{\\star\}, it suffices to impose

dH​log⁡n\+log⁡\(1/δ\)≲μ⋆2​n\.d\_\{H\}\\log n\+\\log\(1/\\delta\)\\lesssim\\mu\_\{\\star\}^\{2\}n\.A standard inversion argument yields that there exists a constantCHess\>0C\_\{\\mathrm\{Hess\}\}\>0, depending only on the fixed constants in the standing assumptions, such that

n≥CHess​dH​log⁡\(1μ⋆\)\+log⁡\(1/δ\)μ⋆2n\\;\\geq\\;C\_\{\\mathrm\{Hess\}\}\\,\\frac\{d\_\{H\}\\log\\\!\\bigl\(\\frac\{1\}\{\\mu\_\{\\star\}\}\\bigr\)\+\\log\(1/\\delta\)\}\{\\mu\_\{\\star\}^\{2\}\}implies

dH​log⁡n\+log⁡\(1/δ\)≤c​μ⋆2​nd\_\{H\}\\log n\+\\log\(1/\\delta\)\\leq c\\,\\mu\_\{\\star\}^\{2\}nfor a sufficiently small absolute constantc\>0c\>0\. Therefore

εnHess​\(δ\)≤cemp​μ⋆\.\\varepsilon\_\{n\}^\{\\mathrm\{Hess\}\}\(\\delta\)\\leq c\_\{\\mathrm\{emp\}\}\\mu\_\{\\star\}\.∎

Hence, ifθ^n\\hat\{\\theta\}\_\{n\}exists in the local ballBrloc​\(θ⋆\)B\_\{r\_\{\\mathrm\{loc\}\}\}\(\\theta^\{\\star\}\), we can prove the desired lower bound for anyθ\\thetaon the line segment betweenθ⋆\\theta^\{\\star\}andθ^n\\hat\{\\theta\}\_\{n\}\.

###### Corollary 2\(Empirical segment curvature along the estimator path\)\.

Under the condition of[Lemma˜6](https://arxiv.org/html/2606.19607#Thmlemma6), if additionally

Pr⁡\(‖θ^n−θ⋆‖2≤rloc\)≥1−δloc,\\Pr\\\!\\left\(\\\|\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\}\\right\)\\geq 1\-\\delta\_\{\\mathrm\{loc\}\},then with probability at least1−δ−δloc1\-\\delta\-\\delta\_\{\\mathrm\{loc\}\},

H¯n​\(θ^n−θ⋆\)⪰cemp​ΣD​\(θ⋆\)⪰cemp​μ⋆​I\.\\bar\{H\}\_\{n\}\(\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\)\\succeq c\_\{\\mathrm\{emp\}\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\succeq c\_\{\\mathrm\{emp\}\}\\mu\_\{\\star\}I\.

###### Proof of[Corollary˜2](https://arxiv.org/html/2606.19607#Thmcorollary2)\.

Define

ℰhess:=\{supθ∈ℬloc‖∇2Ln​\(θ\)−∇2L​\(θ\)∥op≤εnHess​\(δ\)\},ℰloc:=\{‖θ^n−θ⋆‖2≤rloc\}\.\\mathcal\{E\}\_\{\\rm hess\}:=\\left\\\{\\sup\_\{\\theta\\in\\mathcal\{B\}\_\{\\text\{loc\}\}\}\\\|\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\\|\_\{\\text\{op\}\}\\leq\\varepsilon\_\{n\}^\{\\rm Hess\}\(\\delta\)\\right\\\},\\qquad\\mathcal\{E\}\_\{\\text\{loc\}\}:=\\\{\\\|\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\text\{loc\}\}\\\}\.By[Lemma˜6](https://arxiv.org/html/2606.19607#Thmlemma6),Pr⁡\(ℰhess\)≥1−δ\\Pr\(\\mathcal\{E\}\_\{\\rm hess\}\)\\geq 1\-\\delta\. Onℰloc\\mathcal\{E\}\_\{\\text\{loc\}\}, the entire segment

\{θ⋆\+t​Δn:t∈\[0,1\]\}⊆ℬloc\.\\\{\\theta^\{\\star\}\+t\\Delta\_\{n\}:\\ t\\in\[0,1\]\\\}\\subseteq\\mathcal\{B\}\_\{\\text\{loc\}\}\.Hence onℰhess∩ℰloc\\mathcal\{E\}\_\{\\rm hess\}\\cap\\mathcal\{E\}\_\{\\text\{loc\}\}, for allt∈\[0,1\]t\\in\[0,1\],

∇2Ln​\(θ⋆\+t​Δn\)⪰κ08​ΣD​\(θ⋆\),\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\_\{n\}\)\\succeq\\frac\{\\kappa\_\{0\}\}\{8\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\),whereκ0\\kappa\_\{0\}is defined in[Lemma˜10](https://arxiv.org/html/2606.19607#Thmlemma10)\. Integrating inttgives

H¯n​\(Δn\)⪰κ08​ΣD​\(θ⋆\)\.\\bar\{H\}\_\{n\}\(\\Delta\_\{n\}\)\\succeq\\frac\{\\kappa\_\{0\}\}\{8\}\\,\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\.Finally, union bound implies

Pr⁡\(ℰhess∩ℰloc\)≥1−δ−δloc\.\\Pr\(\\mathcal\{E\}\_\{\\rm hess\}\\cap\\mathcal\{E\}\_\{\\text\{loc\}\}\)\\geq 1\-\\delta\-\\delta\_\{\\text\{loc\}\}\.∎

Next, we show thatθ^n\\hat\{\\theta\}\_\{n\}converges toθ⋆\\theta^\{\\star\}with high probability, and hence, whennnis large enough,θ^n\\hat\{\\theta\}\_\{n\}will be in the local ballBrloc​\(θ⋆\)B\_\{r\_\{\\mathrm\{loc\}\}\}\(\\theta^\{\\star\}\)\.

##### \(III\) asymptotical consistency of the DPO minimizer

Because the curvature result is local, we next show that the DPO minimizerθ^n\\hat\{\\theta\}\_\{n\}enters this local ball with high probability\. However, since the empirical DPO riskLn​\(θ\)L\_\{n\}\(\\theta\)is not convex in general,θ^n\\hat\{\\theta\}\_\{n\}need not converge toθ⋆\\theta^\{\\star\}\. For example, if bothθ⋆\\theta^\{\\star\}and someθ′\\theta^\{\\prime\}minimize the population DPO riskL​\(θ\)L\(\\theta\), it is not clear thatθ^n\\hat\{\\theta\}\_\{n\}will converge toθ⋆\\theta^\{\\star\}\. In the statistics literature, a standard way to address this issue is to assume the population risk has a unique minimizer\.

We first show thatθ⋆\\theta^\{\\star\}is indeed a minimizer of the population DPO risk\.

###### Lemma 7\.

Suppose[Assumption˜1](https://arxiv.org/html/2606.19607#Thmassumption1)holds\. Then,θ⋆\\theta^\{\\star\}is a minimizer ofL​\(θ\)L\(\\theta\), i\.e\.,θ⋆∈arg⁡minθ∈Θ⁡L​\(θ\)\\theta^\{\\star\}\\in\\arg\\min\_\{\\theta\\in\\Theta\}L\(\\theta\)\.

###### Proof of[Lemma˜7](https://arxiv.org/html/2606.19607#Thmlemma7)\.

For each fixed edgeee, define

p⋆​\(e\):=Pr⁡\(a=1∣e\)=σ​\(uθ⋆​\(e\)\),qθ​\(e\):=σ​\(uθ​\(e\)\)\.p^\{\\star\}\(e\):=\\Pr\(a=1\\mid e\)=\\sigma\\\!\\big\(u\_\{\\theta^\{\\star\}\}\(e\)\\big\),\\qquad q\_\{\\theta\}\(e\):=\\sigma\\\!\\big\(u\_\{\\theta\}\(e\)\\big\)\.Conditioning onee, the conditional population loss is

φe​\(θ\)\\displaystyle\\varphi\_\{e\}\(\\theta\):=𝔼​\[ℓ​\(a,uθ​\(e\)\)∣e\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\ell\\\!\\big\(a,u\_\{\\theta\}\(e\)\\big\)\\mid e\\right\]=−p⋆​\(e\)​log⁡qθ​\(e\)−\(1−p⋆​\(e\)\)​log⁡\(1−qθ​\(e\)\)\.\\displaystyle=\-p^\{\\star\}\(e\)\\log q\_\{\\theta\}\(e\)\-\\bigl\(1\-p^\{\\star\}\(e\)\\bigr\)\\log\\\!\\bigl\(1\-q\_\{\\theta\}\(e\)\\bigr\)\.Atθ⋆\\theta^\{\\star\}:

φe​\(θ⋆\)=−p⋆​\(e\)​log⁡p⋆​\(e\)−\(1−p⋆​\(e\)\)​log⁡\(1−p⋆​\(e\)\)\.\\varphi\_\{e\}\(\\theta^\{\\star\}\)=\-p^\{\\star\}\(e\)\\log p^\{\\star\}\(e\)\-\\bigl\(1\-p^\{\\star\}\(e\)\\bigr\)\\log\\\!\\bigl\(1\-p^\{\\star\}\(e\)\\bigr\)\.Therefore

φe​\(θ\)−φe​\(θ⋆\)\\displaystyle\\varphi\_\{e\}\(\\theta\)\-\\varphi\_\{e\}\(\\theta^\{\\star\}\)=p⋆​\(e\)​log⁡p⋆​\(e\)qθ​\(e\)\+\(1−p⋆​\(e\)\)​log⁡1−p⋆​\(e\)1−qθ​\(e\)\\displaystyle=p^\{\\star\}\(e\)\\log\\frac\{p^\{\\star\}\(e\)\}\{q\_\{\\theta\}\(e\)\}\+\\bigl\(1\-p^\{\\star\}\(e\)\\bigr\)\\log\\frac\{1\-p^\{\\star\}\(e\)\}\{1\-q\_\{\\theta\}\(e\)\}=:KL\(Bern\(p⋆\(e\)\)∥Bern\(qθ\(e\)\)\)≥0\.\\displaystyle=:\\mathrm\{KL\}\\\!\\left\(\\mathrm\{Bern\}\(p^\{\\star\}\(e\)\)\\,\\\|\\,\\mathrm\{Bern\}\(q\_\{\\theta\}\(e\)\)\\right\)\\;\\geq 0\.Now take expectation overee:

L​\(θ\)−L​\(θ⋆\)\\displaystyle L\(\\theta\)\-L\(\\theta^\{\\star\}\)=𝔼e​\[φe​\(θ\)−φe​\(θ⋆\)\]\\displaystyle=\\mathbb\{E\}\_\{e\}\\\!\\left\[\\varphi\_\{e\}\(\\theta\)\-\\varphi\_\{e\}\(\\theta^\{\\star\}\)\\right\]=𝔼e​\[KL​\(Bern​\(p⋆​\(e\)\)∥Bern​\(qθ​\(e\)\)\)\]≥0\.\\displaystyle=\\mathbb\{E\}\_\{e\}\\\!\\left\[\\mathrm\{KL\}\\\!\\left\(\\mathrm\{Bern\}\(p^\{\\star\}\(e\)\)\\,\\\|\\,\\mathrm\{Bern\}\(q\_\{\\theta\}\(e\)\)\\right\)\\right\]\\geq 0\.HenceL​\(θ\)≥L​\(θ⋆\)L\(\\theta\)\\geq L\(\\theta^\{\\star\}\)for allθ∈Θ\\theta\\in\\Theta, i\.e\.,θ⋆\\theta^\{\\star\}is a minimizer ofLL\. ∎

###### Proposition 6\(Uniform law of large numbers\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1)and[2](https://arxiv.org/html/2606.19607#Thmassumption2)hold\. Then

supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|→𝑝0\.\\sup\_\{\\theta\\in\\Theta\}\\bigl\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\\bigr\|\\xrightarrow\{p\}0\.
Moreover, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|≤εnULLN​\(δ\),\\sup\_\{\\theta\\in\\Theta\}\\bigl\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\\bigr\|\\;\\leq\\;\\varepsilon\_\{n\}^\{\\mathrm\{ULLN\}\}\(\\delta\),whereεnULLN​\(δ\)\\varepsilon\_\{n\}^\{\\mathrm\{ULLN\}\}\(\\delta\)depends only onn,δn,\\delta, the diameter ofΘ\\Theta, and the fixed constants in[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2), and satisfies

εnULLN​\(δ\)=𝒪​\(p​log⁡n\+log⁡\(1/δ\)n\)\.\\varepsilon\_\{n\}^\{\\mathrm\{ULLN\}\}\(\\delta\)=\\mathcal\{O\}\\\!\\left\(\\sqrt\{\\frac\{p\\log n\+\\log\(1/\\delta\)\}\{n\}\}\\right\)\.

###### Proof of[Proposition˜6](https://arxiv.org/html/2606.19607#Thmproposition6)\.

Fixε\>0\\varepsilon\>0, and chooseδ:=ε4​Gu,\\delta:=\\frac\{\\varepsilon\}\{4G\_\{u\}\},whereGuG\_\{u\}is from[Lemma˜9](https://arxiv.org/html/2606.19607#Thmlemma9)\. Let\{θ1,…,θN\}\\\{\\theta^\{1\},\\dots,\\theta^\{N\}\\\}be a finiteδ\\delta\-net ofΘ\\Thetain∥⋅∥2\\\|\\cdot\\\|\_\{2\}, with

N=N\(Θ,∥⋅∥2,δ\),N=N\\big\(\\Theta,\\\|\\cdot\\\|\_\{2\},\\delta\\big\),whereN\(Θ,∥⋅∥2,δ\)N\(\\Theta,\\\|\\cdot\\\|\_\{2\},\\delta\)is theδ\\delta–covering number ofΘ\\Thetain∥⋅∥2\\\|\\cdot\\\|\_\{2\}\.

For anyθ∈Θ\\theta\\in\\Theta, pickj​\(θ\)j\(\\theta\)such that‖θ−θj​\(θ\)‖2≤δ\\\|\\theta\-\\theta^\{j\(\\theta\)\}\\\|\_\{2\}\\leq\\delta\. By Lemma[11](https://arxiv.org/html/2606.19607#Thmlemma11)\(ii\),

\|Ln​\(θ\)−Ln​\(θj​\(θ\)\)\|≤Gu​δ,\|L​\(θ\)−L​\(θj​\(θ\)\)\|≤Gu​δ\.\|L\_\{n\}\(\\theta\)\-L\_\{n\}\(\\theta^\{j\(\\theta\)\}\)\|\\leq G\_\{u\}\\delta,\\qquad\|L\(\\theta\)\-L\(\\theta^\{j\(\\theta\)\}\)\|\\leq G\_\{u\}\\delta\.Hence

\|Ln​\(θ\)−L​\(θ\)\|≤\|Ln​\(θj​\(θ\)\)−L​\(θj​\(θ\)\)\|\+2​Gu​δ≤max1≤j≤N⁡\|Ln​\(θj\)−L​\(θj\)\|\+ε2\.\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\|\\leq\|L\_\{n\}\(\\theta^\{j\(\\theta\)\}\)\-L\(\\theta^\{j\(\\theta\)\}\)\|\+2G\_\{u\}\\delta\\leq\\max\_\{1\\leq j\\leq N\}\|L\_\{n\}\(\\theta^\{j\}\)\-L\(\\theta^\{j\}\)\|\+\\frac\{\\varepsilon\}\{2\}\.Therefore

\{supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|\>ε\}⊆\{max1≤j≤N⁡\|Ln​\(θj\)−L​\(θj\)\|\>ε2\}\.\\Big\\\{\\sup\_\{\\theta\\in\\Theta\}\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\|\>\\varepsilon\\Big\\\}\\subseteq\\Big\\\{\\max\_\{1\\leq j\\leq N\}\|L\_\{n\}\(\\theta^\{j\}\)\-L\(\\theta^\{j\}\)\|\>\\frac\{\\varepsilon\}\{2\}\\Big\\\}\.
DefineBℓ≐log⁡\(1\+eUmax\)B\_\{\\ell\}\\doteq\\log\\bigl\(1\+e^\{U\_\{\\max\}\}\\bigr\), whereUmaxU\_\{\\max\}is from[Lemma˜9](https://arxiv.org/html/2606.19607#Thmlemma9)\. For each fixedjj,ℓθj​\(zi\)∈\[0,Bℓ\]\\ell\_\{\\theta^\{j\}\}\(z\_\{i\}\)\\in\[0,B\_\{\\ell\}\]are i\.i\.d\., so Hoeffding gives

Pr⁡\(\|Ln​\(θj\)−L​\(θj\)\|\>ε2\)≤2​exp⁡\(−n​ε22​Bℓ2\)\.\\Pr\\\!\\left\(\|L\_\{n\}\(\\theta^\{j\}\)\-L\(\\theta^\{j\}\)\|\>\\frac\{\\varepsilon\}\{2\}\\right\)\\leq 2\\exp\\\!\\left\(\-\\frac\{n\\varepsilon^\{2\}\}\{2B\_\{\\ell\}^\{2\}\}\\right\)\.By a union bound overj=1,…,Nj=1,\\dots,N,

Pr⁡\(supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|\>ε\)≤2​N​exp⁡\(−n​ε22​Bℓ2\)\.\\Pr\\\!\\left\(\\sup\_\{\\theta\\in\\Theta\}\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\|\>\\varepsilon\\right\)\\leq 2N\\exp\\\!\\left\(\-\\frac\{n\\varepsilon^\{2\}\}\{2B\_\{\\ell\}^\{2\}\}\\right\)\.\(26\)
It remains to boundNNexplicitly\. SinceΘ⊂ℝp\\Theta\\subset\\mathbb\{R\}^\{p\}is compact, its Euclidean diameter

diam⁡\(Θ\):=supθ,θ′∈Θ‖θ−θ′‖2\\operatorname\{diam\}\(\\Theta\):=\\sup\_\{\\theta,\\theta^\{\\prime\}\\in\\Theta\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}is finite\. A standard covering\-number bound for subsets ofℝp\\mathbb\{R\}^\{p\}yields

N\(Θ,∥⋅∥2,δ\)≤\(3​diam⁡\(Θ\)δ\)p=\(12​diam⁡\(Θ\)​Guε\)p\.N\\big\(\\Theta,\\\|\\cdot\\\|\_\{2\},\\delta\\big\)\\leq\\left\(\\frac\{3\\,\\operatorname\{diam\}\(\\Theta\)\}\{\\delta\}\\right\)^\{p\}=\\left\(\\frac\{12\\,\\operatorname\{diam\}\(\\Theta\)\\,G\_\{u\}\}\{\\varepsilon\}\\right\)^\{p\}\.Plugging this into \([26](https://arxiv.org/html/2606.19607#A3.E26)\) gives

Pr⁡\(supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|\>ε\)≤2​\(12​diam⁡\(Θ\)​Guε\)p​exp⁡\(−n​ε22​Bℓ2\)\.\\Pr\\\!\\left\(\\sup\_\{\\theta\\in\\Theta\}\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\|\>\\varepsilon\\right\)\\leq 2\\left\(\\frac\{12\\,\\operatorname\{diam\}\(\\Theta\)\\,G\_\{u\}\}\{\\varepsilon\}\\right\)^\{p\}\\exp\\\!\\left\(\-\\frac\{n\\varepsilon^\{2\}\}\{2B\_\{\\ell\}^\{2\}\}\\right\)\.
For fixedε\>0\\varepsilon\>0, the prefactor is independent ofnn, while the exponential term decays likeexp⁡\(−c​n\)\\exp\(\-cn\), so the RHS→0\\to 0asn→∞n\\to\\infty\. Therefore

supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|→𝑝0\.\\sup\_\{\\theta\\in\\Theta\}\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\|\\xrightarrow\{p\}0\.
To derive an explicit bound, recall

Pr⁡\(supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|\>ε\)≤2​exp⁡\(p​log⁡\(12​diam​\(Θ\)​Guε\)−n​ε22​Bℓ2\)\.\\Pr\\\!\\left\(\\sup\_\{\\theta\\in\\Theta\}\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\|\>\\varepsilon\\right\)\\leq 2\\exp\\\!\\left\(p\\log\\\!\\Bigl\(\\frac\{12\\,\\mathrm\{diam\}\(\\Theta\)\\,G\_\{u\}\}\{\\varepsilon\}\\Bigr\)\-\\frac\{n\\varepsilon^\{2\}\}\{2B\_\{\\ell\}^\{2\}\}\\right\)\.DefineεnULLN​\(δ\)\\varepsilon\_\{n\}^\{\\mathrm\{ULLN\}\}\(\\delta\)as any value satisfying

p​log⁡\(12​diam​\(Θ\)​GuεnULLN​\(δ\)\)\+log⁡2δ≤n​\(εnULLN​\(δ\)\)22​Bℓ2\.p\\log\\\!\\Bigl\(\\frac\{12\\,\\mathrm\{diam\}\(\\Theta\)\\,G\_\{u\}\}\{\\varepsilon\_\{n\}^\{\\mathrm\{ULLN\}\}\(\\delta\)\}\\Bigr\)\+\\log\\\!\\frac\{2\}\{\\delta\}\\leq\\frac\{n\\big\(\\varepsilon\_\{n\}^\{\\mathrm\{ULLN\}\}\(\\delta\)\\big\)^\{2\}\}\{2B\_\{\\ell\}^\{2\}\}\.Then the RHS above is at mostδ\\delta, and hence with probability at least1−δ1\-\\delta,

supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|≤εnULLN​\(δ\)\.\\sup\_\{\\theta\\in\\Theta\}\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\|\\leq\\varepsilon\_\{n\}^\{\\mathrm\{ULLN\}\}\(\\delta\)\.
Sincelog⁡\(c/ε\)≲log⁡n\\log\(c/\\varepsilon\)\\lesssim\\log nwhenε\\varepsilonis chosen of ordern−1/2n^\{\-1/2\}, the inequality above admits a solution of the form

εnULLN​\(δ\)=𝒪​\(p​log⁡n\+log⁡\(1/δ\)n\),\\varepsilon\_\{n\}^\{\\mathrm\{ULLN\}\}\(\\delta\)=\\mathcal\{O\}\\\!\\left\(\\sqrt\{\\frac\{p\\log n\+\\log\(1/\\delta\)\}\{n\}\}\\right\),where the implied constant depends only onBℓB\_\{\\ell\},GuG\_\{u\}, anddiam​\(Θ\)\\mathrm\{diam\}\(\\Theta\)\.

∎

Even thoughLnL\_\{n\}converges toLL,θ^n\\hat\{\\theta\}\_\{n\}may not converge toθ⋆\\theta^\{\\star\}\. To have the convergence ofθ^n\\hat\{\\theta\}\_\{n\}toθ∗\\theta^\{\*\}, we need to assumeθ⋆\\theta^\{\\star\}is the only minimizer of the population DPO risk \([Assumption˜5](https://arxiv.org/html/2606.19607#Thmassumption5)\)\.

###### Proposition 7\(Global ERM localization\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[5](https://arxiv.org/html/2606.19607#Thmassumption5)hold, and let

RΘ≐supθ∈Θ‖θ−θ⋆‖2<∞\.R\_\{\\Theta\}\\doteq\\sup\_\{\\theta\\in\\Theta\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}<\\infty\.Fixrloc\>0r\_\{\\mathrm\{loc\}\}\>0, and define the population separation outside the ball

Δsep​\(rloc\):=infθ∈Θ‖θ−θ⋆‖2≥rloc\(L​\(θ\)−L​\(θ⋆\)\)\.\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\):=\\inf\_\{\\begin\{subarray\}\{c\}\\theta\\in\\Theta\\\\ \\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\geq r\_\{\\mathrm\{loc\}\}\\end\{subarray\}\}\\bigl\(L\(\\theta\)\-L\(\\theta^\{\\star\}\)\\bigr\)\.ThenΔsep​\(rloc\)\>0\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\>0\. Moreover, there exists a constantCloc\>0C\_\{\\mathrm\{loc\}\}\>0, depending only on the fixed constants in the standing assumptions, such that the following holds: for anyδ∈\(0,1\)\\delta\\in\(0,1\), if

n≥nloc​\(δ\)≐Cloc​p​log⁡\(RΘΔsep​\(rloc\)\)\+log⁡\(1δ\)Δsep​\(rloc\)2,n\\;\\geq\\;n\_\{\\mathrm\{loc\}\}\(\\delta\)\\;\\doteq\\;C\_\{\\mathrm\{loc\}\}\\,\\frac\{p\\log\\\!\\Bigl\(\\dfrac\{R\_\{\\Theta\}\}\{\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\}\\Bigr\)\+\\log\\\!\\bigl\(\\dfrac\{1\}\{\\delta\}\\bigr\)\}\{\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)^\{2\}\},\(27\)then every empirical risk minimizerθ^n∈arg⁡minθ∈Θ⁡Ln​\(θ\)\\hat\{\\theta\}\_\{n\}\\in\\arg\\min\_\{\\theta\\in\\Theta\}L\_\{n\}\(\\theta\)satisfies

Pr⁡\(‖θ^n−θ⋆‖2≤rloc\)≥1−δ\.\\Pr\\\!\\bigl\(\\\|\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\}\\bigr\)\\geq 1\-\\delta\.

###### Proof of[Proposition˜7](https://arxiv.org/html/2606.19607#Thmproposition7)\.

Let

RΘ≐supθ∈Θ‖θ−θ⋆‖2<∞\.R\_\{\\Theta\}\\doteq\\sup\_\{\\theta\\in\\Theta\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}<\\infty\.Fix a localization radiusrloc\>0r\_\{\\mathrm\{loc\}\}\>0and define the population separation outside the ball

Δsep​\(rloc\)≐infθ∈Θ‖θ−θ⋆‖2≥rloc\(L​\(θ\)−L​\(θ⋆\)\)\.\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\\doteq\\inf\_\{\\begin\{subarray\}\{c\}\\theta\\in\\Theta\\\\ \\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\geq r\_\{\\mathrm\{loc\}\}\\end\{subarray\}\}\\bigl\(L\(\\theta\)\-L\(\\theta^\{\\star\}\)\\bigr\)\.By continuity ofLLand uniqueness of its minimizerθ⋆\\theta^\{\\star\}\([Assumption˜5](https://arxiv.org/html/2606.19607#Thmassumption5)\), the infimum is attained andΔsep​\(rloc\)\>0\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\>0\. Set

εloc≐Δsep​\(rloc\)3\.\\varepsilon\_\{\\mathrm\{loc\}\}\\doteq\\frac\{\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\}\{3\}\.
By[Proposition˜6](https://arxiv.org/html/2606.19607#Thmproposition6), for anyε\>0\\varepsilon\>0and anyδ∈\(0,1\)\\delta\\in\(0,1\),

Pr⁡\(supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|\>ε\)≤2​\(12​Gu​RΘε\)p​exp⁡\(−n​ε22​Bℓ2\)\.\\Pr\\\!\\left\(\\sup\_\{\\theta\\in\\Theta\}\\bigl\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\\bigr\|\>\\varepsilon\\right\)\\leq 2\\left\(\\frac\{12\\,G\_\{u\}R\_\{\\Theta\}\}\{\\varepsilon\}\\right\)^\{p\}\\exp\\\!\\left\(\-\\frac\{n\\varepsilon^\{2\}\}\{2B\_\{\\ell\}^\{2\}\}\\right\)\.We setε=εloc\\varepsilon=\\varepsilon\_\{\\mathrm\{loc\}\}and require the right\-hand side to be at mostδ\\delta\. Equivalently, it suffices that

p​log⁡\(12​Gu​RΘεloc\)\+log⁡\(2δ\)≤n​εloc22​Bℓ2\.p\\log\\\!\\left\(\\frac\{12\\,G\_\{u\}R\_\{\\Theta\}\}\{\\varepsilon\_\{\\mathrm\{loc\}\}\}\\right\)\+\\log\\\!\\left\(\\frac\{2\}\{\\delta\}\\right\)\\leq\\frac\{n\\varepsilon\_\{\\mathrm\{loc\}\}^\{2\}\}\{2B\_\{\\ell\}^\{2\}\}\.Thus, one may take

nloc​\(δ\)≐2​Bℓ2εloc2​\[p​log⁡\(12​Gu​RΘεloc\)\+log⁡\(2δ\)\]\.n\_\{\\mathrm\{loc\}\}\(\\delta\)\\doteq\\frac\{2B\_\{\\ell\}^\{2\}\}\{\\varepsilon\_\{\\mathrm\{loc\}\}^\{2\}\}\\left\[p\\log\\\!\\left\(\\frac\{12\\,G\_\{u\}R\_\{\\Theta\}\}\{\\varepsilon\_\{\\mathrm\{loc\}\}\}\\right\)\+\\log\\\!\\left\(\\frac\{2\}\{\\delta\}\\right\)\\right\]\.\(28\)Whenn≥nloc​\(δ\)n\\geq n\_\{\\mathrm\{loc\}\}\(\\delta\), we have with probability at least1−δ1\-\\delta,

supθ∈Θ\|Ln​\(θ\)−L​\(θ\)\|≤εloc\.\\sup\_\{\\theta\\in\\Theta\}\\bigl\|L\_\{n\}\(\\theta\)\-L\(\\theta\)\\bigr\|\\leq\\varepsilon\_\{\\mathrm\{loc\}\}\.\(29\)Denote this event byℰULLN\\mathcal\{E\}\_\{\\mathrm\{ULLN\}\}\.

OnℰULLN\\mathcal\{E\}\_\{\\mathrm\{ULLN\}\}, for anyθ∈Θ\\theta\\in\\Theta,

Ln​\(θ\)−Ln​\(θ⋆\)≥\(L​\(θ\)−εloc\)−\(L​\(θ⋆\)\+εloc\)=\(L​\(θ\)−L​\(θ⋆\)\)−2​εloc\.L\_\{n\}\(\\theta\)\-L\_\{n\}\(\\theta^\{\\star\}\)\\geq\\bigl\(L\(\\theta\)\-\\varepsilon\_\{\\mathrm\{loc\}\}\\bigr\)\-\\bigl\(L\(\\theta^\{\\star\}\)\+\\varepsilon\_\{\\mathrm\{loc\}\}\\bigr\)=\\bigl\(L\(\\theta\)\-L\(\\theta^\{\\star\}\)\\bigr\)\-2\\varepsilon\_\{\\mathrm\{loc\}\}\.If‖θ−θ⋆‖2≥rloc\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\geq r\_\{\\mathrm\{loc\}\}, then by definition ofΔsep​\(rloc\)\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\),

L​\(θ\)−L​\(θ⋆\)≥Δsep​\(rloc\),L\(\\theta\)\-L\(\\theta^\{\\star\}\)\\geq\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\),and hence onℰULLN\\mathcal\{E\}\_\{\\mathrm\{ULLN\}\},

Ln​\(θ\)−Ln​\(θ⋆\)≥Δsep​\(rloc\)−2​εloc=Δsep​\(rloc\)−23​Δsep​\(rloc\)=13​Δsep​\(rloc\)\>0\.L\_\{n\}\(\\theta\)\-L\_\{n\}\(\\theta^\{\\star\}\)\\geq\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\-2\\varepsilon\_\{\\mathrm\{loc\}\}=\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\-\\frac\{2\}\{3\}\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)=\\frac\{1\}\{3\}\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\>0\.Therefore, onℰULLN\\mathcal\{E\}\_\{\\mathrm\{ULLN\}\}, noθ\\thetawith‖θ−θ⋆‖2≥rloc\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\geq r\_\{\\mathrm\{loc\}\}can minimizeLnL\_\{n\}globally overΘ\\Theta\. Consequently, every global empirical minimizerθ^n∈arg⁡minθ∈Θ⁡Ln​\(θ\)\\hat\{\\theta\}\_\{n\}\\in\\arg\\min\_\{\\theta\\in\\Theta\}L\_\{n\}\(\\theta\)satisfies

‖θ^n−θ⋆‖2<rlocon​ℰULLN\.\\\|\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\\\|\_\{2\}<r\_\{\\mathrm\{loc\}\}\\qquad\\text\{on \}\\mathcal\{E\}\_\{\\mathrm\{ULLN\}\}\.
SincePr⁡\(ℰULLN\)≥1−δ\\Pr\(\\mathcal\{E\}\_\{\\mathrm\{ULLN\}\}\)\\geq 1\-\\deltawhenevern≥nloc​\(δ\)n\\geq n\_\{\\mathrm\{loc\}\}\(\\delta\), we obtain

Pr⁡\(‖θ^n−θ⋆‖2≤rloc\)≥1−δ\.\\Pr\\\!\\bigl\(\\\|\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\}\\bigr\)\\geq 1\-\\delta\.∎

Therefore, by[Proposition˜7](https://arxiv.org/html/2606.19607#Thmproposition7), whennnis large enough,θ^n\\hat\{\\theta\}\_\{n\}will be in the local ball with high probability\. As a result, by[Corollary˜2](https://arxiv.org/html/2606.19607#Thmcorollary2), with high probability, the curvature ofLnL\_\{n\}on the line segment betweenθ^n\\hat\{\\theta\}\_\{n\}andθ⋆\\theta^\{\\star\}is lower bounded byΣD​\(θ⋆\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\. This gives us the upper bound for the DPO minimizer\.

##### Upper bound for general policy

Putting everything together, we can prove the following upper bound for the DPO minimizer in the generalfθf\_\{\\theta\}case\.

###### Theorem 3\(DPO upper bound in the generalfθf\_\{\\theta\}case\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2),[4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4)and[5](https://arxiv.org/html/2606.19607#Thmassumption5)hold, and letθ^n∈arg⁡minθ∈Θ⁡Ln​\(θ\)\\hat\{\\theta\}\_\{n\}\\in\\arg\\min\_\{\\theta\\in\\Theta\}L\_\{n\}\(\\theta\)andπ^n:=πθ^n\.\\hat\{\\pi\}\_\{n\}:=\\pi\_\{\\hat\{\\theta\}\_\{n\}\}\.Fixδ∈\(0,1\)\\delta\\in\(0,1\)and define

n0​\(δ\)≐max⁡\{nloc​\(δ/3\),nhess​\(δ/3\)\}\.n\_\{0\}\(\\delta\)\\;\\doteq\\;\\max\\Big\\\{\\,n\_\{\\mathrm\{loc\}\}\(\\delta/3\),\\;n\_\{\\text\{hess\}\}\(\\delta/3\)\\Big\\\}\.Then for alln≥n0​\(δ\)n\\geq n\_\{0\}\(\\delta\), with probability at least1−δ1\-\\delta,

J\(π⋆\)−J\(π^n\)≤Cub​\(δ\)ntr\(I\(θ⋆\)ΣD†\(θ∗\)\)\),J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\_\{n\}\)\\;\\leq\\;\\frac\{C\_\{\\text\{ub\}\}\(\\delta\)\}\{n\}\\,\\text\{tr\}\\\!\\bigl\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}^\{\\dagger\}\(\\theta^\{\*\}\)\)\\bigr\),\(30\)where

Cub​\(δ\)=C​dim\(H\)δ,C\_\{\\text\{ub\}\}\(\\delta\)=C\\,\\frac\{\\dim\(H\)\}\{\\delta\},for some constantC\>0C\>0depending only on the fixed constants in standing assumptions\.

###### Proof of[Theorem˜3](https://arxiv.org/html/2606.19607#Thmtheorem3)\.

SetΔ:=θ^n−θ⋆\\Delta:=\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\. By the upper side of[Proposition˜1](https://arxiv.org/html/2606.19607#Thmproposition1), there existsc\+\>0c\_\{\+\}\>0such that for allθ∈Θ\\theta\\in\\Theta,

J​\(π⋆\)−J​\(πθ\)≤c\+​\(θ−θ⋆\)⊤​I​\(θ⋆\)​\(θ−θ⋆\)\.J\(\\pi^\{\\star\}\)\-J\(\\pi\_\{\\theta\}\)\\;\\leq\\;c\_\{\+\}\\,\(\\theta\-\\theta^\{\\star\}\)^\{\\top\}I\(\\theta^\{\\star\}\)\(\\theta\-\\theta^\{\\star\}\)\.In particular,

J​\(π⋆\)−J​\(π^n\)≤c\+​Δ⊤​I​\(θ⋆\)​Δ\.J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\_\{n\}\)\\;\\leq\\;c\_\{\+\}\\,\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\.\(31\)
First, we apply mean\-value theorem and optimality condition\. Define the segment Hessian

Hn≐∫01∇2Ln​\(θ⋆\+t​Δ\)​𝑑t\.H\_\{n\}\\doteq\\int\_\{0\}^\{1\}\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\)\\,dt\.By the integral form of Taylor’s theorem,

∇Ln​\(θ^n\)−∇Ln​\(θ⋆\)=Hn​Δ\.\\nabla L\_\{n\}\(\\hat\{\\theta\}\_\{n\}\)\-\\nabla L\_\{n\}\(\\theta^\{\\star\}\)=H\_\{n\}\\Delta\.
Sinceθ^n\\hat\{\\theta\}\_\{n\}is a \(global\) minimizer ofLnL\_\{n\}overΘ⊆θ⋆\+H\\Theta\\subseteq\\theta^\{\\star\}\+H, it is stationary onHH, hence

∇Ln​\(θ^n\)=0,soHn​Δ=−∇Ln​\(θ⋆\)\.\\nabla L\_\{n\}\(\\hat\{\\theta\}\_\{n\}\)=0,\\qquad\\text\{so\}\\qquad H\_\{n\}\\Delta=\-\\,\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\.Therefore,

Δ=−Hn†​∇Ln​\(θ⋆\)on​H\.\\Delta=\-\\,H\_\{n\}^\{\\dagger\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\qquad\\text\{on \}H\.\(32\)
Second, we consider the following good events to bound the average curvature ofLnL\_\{n\}on the segment\[θ^n,θ⋆\]\[\\hat\{\\theta\}\_\{n\},\\theta^\{\\star\}\]\. Fixδ∈\(0,1\)\\delta\\in\(0,1\)and chooseδloc,δhess,δG∈\(0,1\)\\delta\_\{\\mathrm\{loc\}\},\\delta\_\{\\text\{hess\}\},\\delta\_\{G\}\\in\(0,1\)such that

δloc\+δhess\+δG≤δ\.\\delta\_\{\\mathrm\{loc\}\}\+\\delta\_\{\\text\{hess\}\}\+\\delta\_\{G\}\\leq\\delta\.
Letℰloc\\mathcal\{E\}\_\{\\mathrm\{loc\}\}be the localization event from[Proposition˜7](https://arxiv.org/html/2606.19607#Thmproposition7):

ℰloc:=\{‖θ^n−θ⋆‖2≤rloc\}\.\\mathcal\{E\}\_\{\\mathrm\{loc\}\}:=\\\{\\\|\\hat\{\\theta\}\_\{n\}\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\}\\\}\.By[Proposition˜7](https://arxiv.org/html/2606.19607#Thmproposition7), for

εloc≐Δsep​\(rloc\)3,Bℓ≐log⁡\(1\+eumax\),\\varepsilon\_\{\\mathrm\{loc\}\}\\doteq\\frac\{\\Delta\_\{\\text\{sep\}\}\(r\_\{\\mathrm\{loc\}\}\)\}\{3\},\\qquad B\_\{\\ell\}\\doteq\\log\(1\+e^\{u\_\{\\max\}\}\),and

nloc​\(δ\)≐2​Bℓ2εloc2​\[p​log⁡\(12​Gu​RΘεloc\)\+log⁡\(2δ\)\],n\_\{\\mathrm\{loc\}\}\(\\delta\)\\doteq\\frac\{2B\_\{\\ell\}^\{2\}\}\{\\varepsilon\_\{\\mathrm\{loc\}\}^\{2\}\}\\left\[p\\log\\\!\\left\(\\frac\{12\\,G\_\{u\}R\_\{\\Theta\}\}\{\\varepsilon\_\{\\mathrm\{loc\}\}\}\\right\)\+\\log\\\!\\left\(\\frac\{2\}\{\\delta\}\\right\)\\right\],\(33\)we have, for alln≥nloc​\(δloc\)n\\geq n\_\{\\mathrm\{loc\}\}\(\\delta\_\{\\mathrm\{loc\}\}\),

Pr⁡\(ℰloc\)≥1−δloc\.\\Pr\(\\mathcal\{E\}\_\{\\mathrm\{loc\}\}\)\\geq 1\-\\delta\_\{\\mathrm\{loc\}\}\.
Letℰhess\\mathcal\{E\}\_\{\\text\{hess\}\}be the Hessian concentration event from[Lemma˜6](https://arxiv.org/html/2606.19607#Thmlemma6):

ℰhess:=\{supθ∈ℬloc‖∇2Ln​\(θ\)−∇2L​\(θ\)∥op≤εnHess​\(δhess\)\},ℬloc:=\{θ∈Θ:‖θ−θ⋆‖2≤rloc\}\.\\mathcal\{E\}\_\{\\text\{hess\}\}:=\\left\\\{\\sup\_\{\\theta\\in\\mathcal\{B\}\_\{\\mathrm\{loc\}\}\}\\\|\\nabla^\{2\}L\_\{n\}\(\\theta\)\-\\nabla^\{2\}L\(\\theta\)\\\|\_\{\\text\{op\}\}\\leq\\varepsilon\_\{n\}^\{\\rm Hess\}\(\\delta\_\{\\text\{hess\}\}\)\\right\\\},\\qquad\\mathcal\{B\}\_\{\\mathrm\{loc\}\}:=\\\{\\theta\\in\\Theta:\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\leq r\_\{\\mathrm\{loc\}\}\\\}\.By[Lemma˜6](https://arxiv.org/html/2606.19607#Thmlemma6), for alln≥nhess​\(δhess\)n\\geq n\_\{\\text\{hess\}\}\(\\delta\_\{\\text\{hess\}\}\),

Pr⁡\(ℰhess\)≥1−δhess,\\Pr\(\\mathcal\{E\}\_\{\\text\{hess\}\}\)\\geq 1\-\\delta\_\{\\text\{hess\}\},wherenhess​\(δhess\)n\_\{\\text\{hess\}\}\(\\delta\_\{\\text\{hess\}\}\)is chosen exactly as in[Lemma˜6](https://arxiv.org/html/2606.19607#Thmlemma6)so that onℰloc∩ℰhess\\mathcal\{E\}\_\{\\mathrm\{loc\}\}\\cap\\mathcal\{E\}\_\{\\text\{hess\}\}we can invoke[Corollary˜2](https://arxiv.org/html/2606.19607#Thmcorollary2)and obtain the segment curvature bound

Hn=∫01∇2Ln​\(θ⋆\+t​Δ\)​𝑑t⪰α​ΣD​\(θ∗\)on​H,H\_\{n\}=\\int\_\{0\}^\{1\}\\nabla^\{2\}L\_\{n\}\(\\theta^\{\\star\}\+t\\Delta\)\\,dt\\succeq\\alpha\\,\\Sigma\_\{D\}\(\\theta^\{\*\}\)\\qquad\\text\{on \}H,\(34\)withα=κ08\\alpha=\\frac\{\\kappa\_\{0\}\}\{8\}\. SinceΣD​\(θ⋆\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)is invertible onHHby[Assumption˜4\(a\)](https://arxiv.org/html/2606.19607#Thmassumption4), pseudoinverse monotonicity onHHgives

Hn†⪯α−1​ΣD​\(θ⋆\)†\.H\_\{n\}^\{\\dagger\}\\preceq\\alpha^\{\-1\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\.\(35\)
Third, we bound the weighted errorΔ⊤​ΣD​\(θ⋆\)​Δ\\Delta^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\Delta\. FromHn​Δ=−∇Ln​\(θ⋆\)H\_\{n\}\\Delta=\-\\nabla L\_\{n\}\(\\theta^\{\\star\}\),

Δ⊤​Hn​Δ=−Δ⊤​∇Ln​\(θ⋆\)\.\\Delta^\{\\top\}H\_\{n\}\\Delta=\-\\,\\Delta^\{\\top\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\.Onℰloc∩ℰhess\\mathcal\{E\}\_\{\\mathrm\{loc\}\}\\cap\\mathcal\{E\}\_\{\\text\{hess\}\}, using \([34](https://arxiv.org/html/2606.19607#A3.E34)\),

α​Δ⊤​ΣD​\(θ⋆\)​Δ≤Δ⊤​Hn​Δ=−Δ⊤​∇Ln​\(θ⋆\)\.\\alpha\\,\\Delta^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\Delta\\leq\\Delta^\{\\top\}H\_\{n\}\\Delta=\-\\Delta^\{\\top\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\.Applying Cauchy–Schwarz in theΣD​\(θ⋆\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\-geometry,

\|Δ⊤​∇Ln​\(θ⋆\)\|=\|⟨ΣD​\(θ⋆\)1/2​Δ,ΣD​\(θ⋆\)†⁣/2​∇Ln​\(θ⋆\)⟩\|≤‖ΣD​\(θ⋆\)1/2​Δ‖2​‖ΣD​\(θ⋆\)†⁣/2​∇Ln​\(θ⋆\)‖2\.\|\\Delta^\{\\top\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\|=\\big\|\\langle\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{1/2\}\\Delta,\\;\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\rangle\\big\|\\leq\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{1/2\}\\Delta\\\|\_\{2\}\\,\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\\|\_\{2\}\.Hence

α​‖ΣD​\(θ⋆\)1/2​Δ‖22≤‖ΣD​\(θ⋆\)1/2​Δ‖2​‖ΣD​\(θ⋆\)†⁣/2​∇Ln​\(θ⋆\)‖2\.\\alpha\\,\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{1/2\}\\Delta\\\|\_\{2\}^\{2\}\\leq\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{1/2\}\\Delta\\\|\_\{2\}\\,\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\\|\_\{2\}\.If‖ΣD​\(θ⋆\)1/2​Δ‖2=0\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{1/2\}\\Delta\\\|\_\{2\}=0, thenΔ⊤​ΣD​\(θ⋆\)​Δ=0\\Delta^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\Delta=0and there is nothing to prove\. Otherwise divide both sides by‖ΣD​\(θ⋆\)1/2​Δ‖2\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{1/2\}\\Delta\\\|\_\{2\}and square:

Δ⊤​ΣD​\(θ⋆\)​Δ≤1α2​‖ΣD​\(θ⋆\)†⁣/2​∇Ln​\(θ⋆\)‖22\.\\Delta^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\Delta\\leq\\frac\{1\}\{\\alpha^\{2\}\}\\,\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\\|\_\{2\}^\{2\}\.\(36\)
Next, we bound∇Ln​\(θ⋆\)\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\. By[Lemma˜15](https://arxiv.org/html/2606.19607#Thmlemma15),

𝔼​\[‖ΣD​\(θ⋆\)†⁣/2​∇Ln​\(θ⋆\)‖22\]≤dim\(H\)4​n\.\\mathbb\{E\}\\\!\\left\[\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\\|\_\{2\}^\{2\}\\right\]\\leq\\frac\{\\dim\(H\)\}\{4n\}\.Let

CG≐dim\(H\)4\.C\_\{G\}\\doteq\\frac\{\\dim\(H\)\}\{4\}\.By Markov’s inequality, for anyδG∈\(0,1\)\\delta\_\{G\}\\in\(0,1\), the event

ℰscore:=\{‖ΣD​\(θ⋆\)†⁣/2​∇Ln​\(θ⋆\)‖22≤CGn​δG\}\\mathcal\{E\}\_\{\\rm score\}:=\\left\\\{\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\\|\_\{2\}^\{2\}\\leq\\frac\{C\_\{G\}\}\{n\\delta\_\{G\}\}\\right\\\}satisfies

Pr⁡\(ℰscore\)≥1−δG\.\\Pr\(\\mathcal\{E\}\_\{\\rm score\}\)\\geq 1\-\\delta\_\{G\}\.Combining with \([36](https://arxiv.org/html/2606.19607#A3.E36)\), on

ℰtot:=ℰloc∩ℰhess∩ℰscore,\\mathcal\{E\}\_\{\\rm tot\}:=\\mathcal\{E\}\_\{\\mathrm\{loc\}\}\\cap\\mathcal\{E\}\_\{\\text\{hess\}\}\\cap\\mathcal\{E\}\_\{\\rm score\},we obtain

Δ⊤​ΣD​\(θ⋆\)​Δ≤CGα2​1n​δG\.\\Delta^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\Delta\\leq\\frac\{C\_\{G\}\}\{\\alpha^\{2\}\}\\,\\frac\{1\}\{n\\delta\_\{G\}\}\.\(37\)
Next, we convert design\-weighted error to RLHF\-weighted error\. SinceΔ∈H\\Delta\\in HandI​\(θ⋆\)I\(\\theta^\{\\star\}\)is supported onHH,

Δ⊤​I​\(θ⋆\)​Δ≤λmax​\(ΣD​\(θ⋆\)†⁣/2​I​\(θ⋆\)​ΣD​\(θ⋆\)†⁣/2\)​Δ⊤​ΣD​\(θ⋆\)​Δ\.\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\\leq\\lambda\_\{\\max\}\\\!\\bigl\(\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}\\bigr\)\\,\\Delta^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\Delta\.Usingλmax​\(A\)≤tr​\(A\)\\lambda\_\{\\max\}\(A\)\\leq\\text\{tr\}\(A\)for PSDAA,

Δ⊤​I​\(θ⋆\)​Δ≤tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)​Δ⊤​ΣD​\(θ⋆\)​Δ\.\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\\leq\\text\{tr\}\\\!\\bigl\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\bigr\)\\,\\Delta^\{\\top\}\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\Delta\.Hence, onℰtot\\mathcal\{E\}\_\{\\rm tot\}, by \([37](https://arxiv.org/html/2606.19607#A3.E37)\),

Δ⊤​I​\(θ⋆\)​Δ≤CGα2​1n​δG​tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)\.\\Delta^\{\\top\}I\(\\theta^\{\\star\}\)\\Delta\\leq\\frac\{C\_\{G\}\}\{\\alpha^\{2\}\}\\,\\frac\{1\}\{n\\delta\_\{G\}\}\\,\\text\{tr\}\\\!\\bigl\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\bigr\)\.Plugging into \([31](https://arxiv.org/html/2606.19607#A3.E31)\),

J​\(π⋆\)−J​\(π^n\)≤c\+​CGα2​δG⋅1n​tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)\.J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\_\{n\}\)\\leq\\frac\{c\_\{\+\}C\_\{G\}\}\{\\alpha^\{2\}\\delta\_\{G\}\}\\cdot\\frac\{1\}\{n\}\\,\\text\{tr\}\\\!\\bigl\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\bigr\)\.Define

Cub≐c\+​CGα2​δG\.C\_\{\\text\{ub\}\}\\doteq\\frac\{c\_\{\+\}C\_\{G\}\}\{\\alpha^\{2\}\\delta\_\{G\}\}\.Then onℰtot\\mathcal\{E\}\_\{\\rm tot\},

J​\(π⋆\)−J​\(π^n\)≤Cubn​tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)\.J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\_\{n\}\)\\leq\\frac\{C\_\{\\text\{ub\}\}\}\{n\}\\,\\text\{tr\}\\\!\\bigl\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\bigr\)\.\(38\)
Finally, we bound the probability of the good event\. Set

n0​\(δ\)≐max⁡\{nloc​\(δloc\),nhess​\(δhess\)\}\.n\_\{0\}\(\\delta\)\\doteq\\max\\\{n\_\{\\mathrm\{loc\}\}\(\\delta\_\{\\mathrm\{loc\}\}\),\\,n\_\{\\text\{hess\}\}\(\\delta\_\{\\text\{hess\}\}\)\\\}\.Then for alln≥n0​\(δ\)n\\geq n\_\{0\}\(\\delta\),

Pr⁡\(ℰloc\)≥1−δloc,Pr⁡\(ℰhess\)≥1−δhess,Pr⁡\(ℰscore\)≥1−δG\.\\Pr\(\\mathcal\{E\}\_\{\\mathrm\{loc\}\}\)\\geq 1\-\\delta\_\{\\mathrm\{loc\}\},\\qquad\\Pr\(\\mathcal\{E\}\_\{\\text\{hess\}\}\)\\geq 1\-\\delta\_\{\\text\{hess\}\},\\qquad\\Pr\(\\mathcal\{E\}\_\{\\rm score\}\)\\geq 1\-\\delta\_\{G\}\.Therefore, by the union bound,

Pr⁡\(ℰtot\)≥1−δloc−δhess−δG≥1−δ\.\\Pr\(\\mathcal\{E\}\_\{\\rm tot\}\)\\geq 1\-\\delta\_\{\\mathrm\{loc\}\}\-\\delta\_\{\\text\{hess\}\}\-\\delta\_\{G\}\\geq 1\-\\delta\.Since \([38](https://arxiv.org/html/2606.19607#A3.E38)\) holds onℰtot\\mathcal\{E\}\_\{\\rm tot\}, the claimed high\-probability upper bound follows\. ∎

### C\.3Step 3: lower bound via information inequality

In this section, we consider the Bayesian setting by assuming[Assumption˜6](https://arxiv.org/html/2606.19607#Thmassumption6)\. In the Bayesian setting, the critical step to prove the lower bound in[Theorem˜1](https://arxiv.org/html/2606.19607#Thmtheorem1)is to prove the following lower bound for the quadratic sandwich by applying the Van Trees inequality\.

###### Lemma 8\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2),[3](https://arxiv.org/html/2606.19607#Thmassumption3),[4\(b\)](https://arxiv.org/html/2606.19607#Thmassumption4a)and[6](https://arxiv.org/html/2606.19607#Thmassumption6)hold withℛ=supp⁡\(ρ\)\\mathcal\{R\}=\\operatorname\{supp\}\{\(\\rho\)\}\. The Fisher information ofρ\\rhois given by

J​\(ρ\)≐∫Θ∇log⁡ρ​\(θ\)​∇log⁡ρ​\(θ\)⊤​ρ​\(θ\)​𝑑θ\.J\(\\rho\)\\doteq\\int\_\{\\Theta\}\\nabla\\log\\rho\(\\theta\)\\nabla\\log\\rho\(\\theta\)^\{\\top\}\\,\\rho\(\\theta\)\\,d\\theta\.\(39\)Let

Σ¯ρ≐𝔼θ∼ρ​\[ΣD​\(θ\)\],I¯ρ≐𝔼θ∼ρ​\[I​\(θ\)\],nprior≐⌈4​λmax​\(J​\(ρ\)\)μℛ⌉\.\\bar\{\\Sigma\}\_\{\\rho\}\\doteq\\mathbb\{E\}\_\{\\theta\\sim\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\],\\qquad\\bar\{I\}\_\{\\rho\}\\doteq\\mathbb\{E\}\_\{\\theta\\sim\\rho\}\[I\(\\theta\)\],\\qquad n\_\{\\rm prior\}\\;\\doteq\\;\\left\\lceil\\frac\{4\\,\\lambda\_\{\\max\}\(J\(\\rho\)\)\}\{\\mu\_\{\\mathcal\{R\}\}\}\\right\\rceil\.Then for any estimatorθ~n\\tilde\{\\theta\}\_\{n\}measurable with respect to thennpairwise comparisons, and for alln≥npriorn\\geq n\_\{\\rm prior\}, there exists a constantclb\>0c\_\{\\rm lb\}\>0\(depending only on the constants appearing in assumptions and the priorρ\\rho\) such that

𝔼θ∼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)⊤​I​\(θ\)​\(θ~n−θ\)\]≥clbn​𝔼θ∼ρ​\[tr​\(I​\(θ\)​ΣD†​\(θ\)\)\]\.\\mathbb\{E\}\_\{\\theta\\sim\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\Big\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}I\(\\theta\)\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\\Big\]\\;\\geq\\;\\frac\{c\_\{\\rm lb\}\}\{n\}\\,\\mathbb\{E\}\_\{\\theta\\sim\\rho\}\\\!\\left\[\\text\{tr\}\\\!\\big\(I\(\\theta\)\\Sigma\_\{D\}^\{\\dagger\}\(\\theta\)\\big\)\\right\]\.\(40\)

###### Proof of[Lemma˜8](https://arxiv.org/html/2606.19607#Thmlemma8)\.

Fix any estimatorθ~n\\tilde\{\\theta\}\_\{n\}\. We first compute the Fisher information of one comparison and ofnncomparisons\. For one queried edgee=\(x,y\+,y−\)e=\(x,y^\{\+\},y^\{\-\}\), the label satisfies

a∣\(e,θ\)∼Bernoulli​\(σ​\(uθ​\(e\)\)\),uθ​\(e\)=β​\[log⁡πθ​\(y\+∣x\)π0​\(y\+∣x\)−log⁡πθ​\(y−∣x\)π0​\(y−∣x\)\],a\\mid\(e,\\theta\)\\sim\{\\rm Bernoulli\}\\\!\\big\(\\sigma\(u\_\{\\theta\}\(e\)\)\\big\),\\qquad u\_\{\\theta\}\(e\)=\\beta\\\!\\left\[\\log\\frac\{\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\}\{\\pi\_\{0\}\(y^\{\+\}\\mid x\)\}\-\\log\\frac\{\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\)\}\{\\pi\_\{0\}\(y^\{\-\}\\mid x\)\}\\right\],with gradient feature

g​\(e;θ\)=∇θuθ​\(e\)\.g\(e;\\theta\)=\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)\.Hence the one\-sample Fisher information is

ℐD​\(θ\)≐𝔼e∼D​\[σ′​\(uθ​\(e\)\)​g​\(e;θ\)​g​\(e;θ\)⊤\]\.\\mathcal\{I\}\_\{D\}\(\\theta\)\\doteq\\mathbb\{E\}\_\{e\\sim D\}\\\!\\big\[\\sigma^\{\\prime\}\(u\_\{\\theta\}\(e\)\)\\,g\(e;\\theta\)g\(e;\\theta\)^\{\\top\}\\big\]\.\(41\)Sinceσ′​\(u\)≤1/4\\sigma^\{\\prime\}\(u\)\\leq 1/4for alluu, we have

ℐD​\(θ\)⪯14​ΣD​\(θ\)\.\\mathcal\{I\}\_\{D\}\(\\theta\)\\preceq\\frac\{1\}\{4\}\\,\\Sigma\_\{D\}\(\\theta\)\.\(42\)Fornni\.i\.d\. queried comparisons under the same design, Fisher information adds:

ℐn​\(θ\)=n​ℐD​\(θ\)⪯n4​ΣD​\(θ\)\.\\mathcal\{I\}\_\{n\}\(\\theta\)=n\\,\\mathcal\{I\}\_\{D\}\(\\theta\)\\preceq\\frac\{n\}\{4\}\\,\\Sigma\_\{D\}\(\\theta\)\.\(43\)
Second, the Van Trees inequality gives the matrix lower bound

𝔼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)​\(θ~n−θ\)⊤\]⪰\(𝔼ρ​\[ℐn​\(θ\)\]\+J​\(ρ\)\)−1\.\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\left\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}\\right\]\\succeq\\bigl\(\\mathbb\{E\}\_\{\\rho\}\[\\mathcal\{I\}\_\{n\}\(\\theta\)\]\+J\(\\rho\)\\bigr\)^\{\-1\}\.\(44\)
Let

Mn≐𝔼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)​\(θ~n−θ\)⊤\],Bn≐\(𝔼ρ​\[ℐn​\(θ\)\]\+J​\(ρ\)\)−1\.M\_\{n\}\\;\\doteq\\;\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\left\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}\\right\],\\qquad B\_\{n\}\\;\\doteq\\;\\bigl\(\\mathbb\{E\}\_\{\\rho\}\[\\mathcal\{I\}\_\{n\}\(\\theta\)\]\+J\(\\rho\)\\bigr\)^\{\-1\}\.Define the prior\-averaged Fisher matrix

I¯ρ≐𝔼ρ​\[I​\(θ\)\]\.\\bar\{I\}\_\{\\rho\}\\;\\doteq\\;\\mathbb\{E\}\_\{\\rho\}\[I\(\\theta\)\]\.SinceI¯ρ⪰0\\bar\{I\}\_\{\\rho\}\\succeq 0, multiplying \([44](https://arxiv.org/html/2606.19607#A3.E44)\) byI¯ρ\\bar\{I\}\_\{\\rho\}and taking trace preserves the inequality:

tr​\(I¯ρ​Mn\)≥tr​\(I¯ρ​Bn\)\.\\text\{tr\}\\\!\\big\(\\bar\{I\}\_\{\\rho\}M\_\{n\}\\big\)\\;\\geq\\;\\text\{tr\}\\\!\\big\(\\bar\{I\}\_\{\\rho\}B\_\{n\}\\big\)\.Indeed, this follows from

tr​\(I¯ρ​\(Mn−Bn\)\)≥0wheneverI¯ρ⪰0,Mn−Bn⪰0\.\\text\{tr\}\\\!\\big\(\\bar\{I\}\_\{\\rho\}\(M\_\{n\}\-B\_\{n\}\)\\big\)\\geq 0\\quad\\text\{whenever\}\\quad\\bar\{I\}\_\{\\rho\}\\succeq 0,\\;M\_\{n\}\-B\_\{n\}\\succeq 0\.
For the left\-hand side, use linearity of trace and expectation:

tr​\(I¯ρ​Mn\)=𝔼ρ​𝔼𝒟n∣θ​\[tr​\(I¯ρ​\(θ~n−θ\)​\(θ~n−θ\)⊤\)\]\.\\text\{tr\}\\\!\\big\(\\bar\{I\}\_\{\\rho\}M\_\{n\}\\big\)=\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\left\[\\text\{tr\}\\\!\\Big\(\\bar\{I\}\_\{\\rho\}\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}\\Big\)\\right\]\.Applying the identitytr​\(A​x​x⊤\)=x⊤​A​x\\text\{tr\}\(Axx^\{\\top\}\)=x^\{\\top\}Ax\(withA=I¯ρA=\\bar\{I\}\_\{\\rho\},x=θ~n−θx=\\tilde\{\\theta\}\_\{n\}\-\\theta\) yields

tr​\(I¯ρ​Mn\)=𝔼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)⊤​I¯ρ​\(θ~n−θ\)\]\.\\text\{tr\}\\\!\\big\(\\bar\{I\}\_\{\\rho\}M\_\{n\}\\big\)=\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\Big\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}\\bar\{I\}\_\{\\rho\}\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\\Big\]\.
Substituting this andBn=\(𝔼ρ​\[ℐn​\(θ\)\]\+J​\(ρ\)\)−1B\_\{n\}=\(\\mathbb\{E\}\_\{\\rho\}\[\\mathcal\{I\}\_\{n\}\(\\theta\)\]\+J\(\\rho\)\)^\{\-1\}gives

𝔼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)⊤​I¯ρ​\(θ~n−θ\)\]≥tr​\(I¯ρ​\(𝔼ρ​\[ℐn​\(θ\)\]\+J​\(ρ\)\)−1\)\.\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\Big\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}\\bar\{I\}\_\{\\rho\}\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\\Big\]\\;\\geq\\;\\text\{tr\}\\\!\\Big\(\\bar\{I\}\_\{\\rho\}\\,\(\\mathbb\{E\}\_\{\\rho\}\[\\mathcal\{I\}\_\{n\}\(\\theta\)\]\+J\(\\rho\)\)^\{\-1\}\\Big\)\.\(45\)
Third, we absorb the prior informationJ​\(ρ\)J\(\\rho\)into the sample information and lower bound\(𝔼ρ​\[ℐn​\(θ\)\]\+J​\(ρ\)\)−1\(\\mathbb\{E\}\_\{\\rho\}\[\\mathcal\{I\}\_\{n\}\(\\theta\)\]\+J\(\\rho\)\)^\{\-1\}by a multiple of1n​Σ¯D−1\\frac\{1\}\{n\}\\bar\{\\Sigma\}\_\{D\}^\{\-1\}\. Recall from \([43](https://arxiv.org/html/2606.19607#A3.E43)\) that

𝔼ρ​\[ℐn​\(θ\)\]⪯n4​𝔼ρ​\[ΣD​\(θ\)\]\.\\mathbb\{E\}\_\{\\rho\}\[\\mathcal\{I\}\_\{n\}\(\\theta\)\]\\;\\preceq\\;\\frac\{n\}\{4\}\\,\\mathbb\{E\}\_\{\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\]\.Therefore,

𝔼ρ​\[ℐn​\(θ\)\]\+J​\(ρ\)⪯n4​𝔼ρ​\[ΣD​\(θ\)\]\+J​\(ρ\)\.\\mathbb\{E\}\_\{\\rho\}\[\\mathcal\{I\}\_\{n\}\(\\theta\)\]\+J\(\\rho\)\\;\\preceq\\;\\frac\{n\}\{4\}\\,\\mathbb\{E\}\_\{\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\]\+J\(\\rho\)\.Since matrix inversion reverses the Loewner order on positive definite matrices,

\(𝔼ρ​\[ℐn​\(θ\)\]\+J​\(ρ\)\)−1⪰\(n4​𝔼ρ​\[ΣD​\(θ\)\]\+J​\(ρ\)\)−1\.\(\\mathbb\{E\}\_\{\\rho\}\[\\mathcal\{I\}\_\{n\}\(\\theta\)\]\+J\(\\rho\)\)^\{\-1\}\\;\\succeq\\;\\left\(\\frac\{n\}\{4\}\\,\\mathbb\{E\}\_\{\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\]\+J\(\\rho\)\\right\)^\{\-1\}\.Thus it remains to lower bound the right\-hand side by a multiple of1n​𝔼ρ​\[ΣD​\(θ\)\]†\\frac\{1\}\{n\}\\,\\mathbb\{E\}\_\{\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\]^\{\\dagger\}\.

Define

nprior≐⌈8​TρλD⌉\.n\_\{\\rm prior\}\\;\\doteq\\;\\left\\lceil\\frac\{8\\,T\_\{\\rho\}\}\{\\lambda\_\{D\}\}\\right\\rceil\.\(46\)Then, by[Lemma˜17](https://arxiv.org/html/2606.19607#Thmlemma17), for alln≥npriorn\\geq n\_\{\\rm prior\},

J​\(ρ\)⪯n8​Σ¯D\.J\(\\rho\)\\preceq\\frac\{n\}\{8\}\\,\\bar\{\\Sigma\}\_\{D\}\.Hence

n4​Σ¯D\+J​\(ρ\)⪯n4​Σ¯D\+n8​Σ¯D=3​n8​Σ¯D\.\\frac\{n\}\{4\}\\,\\bar\{\\Sigma\}\_\{D\}\+J\(\\rho\)\\;\\preceq\\;\\frac\{n\}\{4\}\\,\\bar\{\\Sigma\}\_\{D\}\+\\frac\{n\}\{8\}\\,\\bar\{\\Sigma\}\_\{D\}=\\frac\{3n\}\{8\}\\,\\bar\{\\Sigma\}\_\{D\}\.Applying inverse monotonicity again,

\(n4​Σ¯D\+J​\(ρ\)\)−1⪰\(3​n8​Σ¯D\)−1=83​n​Σ¯D−1\.\\left\(\\frac\{n\}\{4\}\\,\\bar\{\\Sigma\}\_\{D\}\+J\(\\rho\)\\right\)^\{\-1\}\\;\\succeq\\;\\left\(\\frac\{3n\}\{8\}\\,\\bar\{\\Sigma\}\_\{D\}\\right\)^\{\-1\}=\\frac\{8\}\{3n\}\\,\\bar\{\\Sigma\}\_\{D\}^\{\-1\}\.Therefore, for alln≥npriorn\\geq n\_\{\\rm prior\},

\(n4​𝔼ρ​\[ΣD​\(θ\)\]\+J​\(ρ\)\)−1⪰83​n​𝔼ρ​\[ΣD​\(θ\)\]−1\.\\left\(\\frac\{n\}\{4\}\\,\\mathbb\{E\}\_\{\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\]\+J\(\\rho\)\\right\)^\{\-1\}\\;\\succeq\\;\\frac\{8\}\{3n\}\\,\\mathbb\{E\}\_\{\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\]^\{\-1\}\.\(47\)
Substituting \([47](https://arxiv.org/html/2606.19607#A3.E47)\) into \([45](https://arxiv.org/html/2606.19607#A3.E45)\), we obtain

𝔼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)⊤​I¯ρ​\(θ~n−θ\)\]≥83​n​tr​\(I¯ρ​𝔼ρ​\[ΣD​\(θ\)\]−1\),∀n≥nprior\.\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\Big\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}\\bar\{I\}\_\{\\rho\}\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\\Big\]\\;\\geq\\;\\frac\{8\}\{3n\}\\,\\text\{tr\}\\\!\\Big\(\\bar\{I\}\_\{\\rho\}\\,\\mathbb\{E\}\_\{\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\]^\{\-1\}\\Big\),\\qquad\\forall n\\geq n\_\{\\rm prior\}\.\(48\)
Finally, we replace the averaged matrices by the actual matrices atθ\\theta\. By[Assumption˜4\(b\)](https://arxiv.org/html/2606.19607#Thmassumption4a), there existsμρ\>0\\mu\_\{\\rho\}\>0such that

ΣD​\(θ\)⪰μρ​I,∀θ∈supp⁡\(ρ\)\.\\Sigma\_\{D\}\(\\theta\)\\succeq\\mu\_\{\\rho\}I,\\qquad\\forall\\theta\\in\\operatorname\{supp\}\{\(\\rho\)\}\.Moreover, by the uniform boundedness of the pair\-feature mapg​\(e;θ\)g\(e;\\theta\)\([Lemma˜9](https://arxiv.org/html/2606.19607#Thmlemma9)\), there existsG<∞G<\\inftysuch that

‖g​\(e;θ\)‖2≤G∀e,∀θ∈Θ\.\\\|g\(e;\\theta\)\\\|\_\{2\}\\leq G\\qquad\\forall e,\\ \\forall\\theta\\in\\Theta\.Hence

ΣD​\(θ\)=𝔼e∼D​\[g​\(e;θ\)​g​\(e;θ\)⊤\]⪯G2​I,∀θ∈Θ\.\\Sigma\_\{D\}\(\\theta\)=\\mathbb\{E\}\_\{e\\sim D\}\[g\(e;\\theta\)g\(e;\\theta\)^\{\\top\}\]\\preceq G^\{2\}I,\\qquad\\forall\\theta\\in\\Theta\.Therefore, withMΣ:=G2M\_\{\\Sigma\}:=G^\{2\}, we have

μρ​I⪯ΣD​\(θ\)⪯MΣ​I,∀θ∈supp⁡\(ρ\)\.\\mu\_\{\\rho\}I\\preceq\\Sigma\_\{D\}\(\\theta\)\\preceq M\_\{\\Sigma\}I,\\qquad\\forall\\theta\\in\\operatorname\{supp\}\{\(\\rho\)\}\.Taking expectation gives

μρ​I⪯Σ¯D⪯MΣ​I\.\\mu\_\{\\rho\}I\\preceq\\bar\{\\Sigma\}\_\{D\}\\preceq M\_\{\\Sigma\}I\.Moreover, for eachθ\\theta,

Σ¯D⪯MΣ​I⪯MΣμρ​ΣD​\(θ\),\\bar\{\\Sigma\}\_\{D\}\\preceq M\_\{\\Sigma\}I\\preceq\\frac\{M\_\{\\Sigma\}\}\{\\mu\_\{\\rho\}\}\\,\\Sigma\_\{D\}\(\\theta\),and thus, by inverse monotonicity,

Σ¯D−1⪰μρMΣ​ΣD​\(θ\)−1\.\\bar\{\\Sigma\}\_\{D\}^\{\-1\}\\;\\succeq\\;\\frac\{\\mu\_\{\\rho\}\}\{M\_\{\\Sigma\}\}\\,\\Sigma\_\{D\}\(\\theta\)^\{\-1\}\.SinceI​\(θ\)⪰0I\(\\theta\)\\succeq 0, we obtain

tr​\(I​\(θ\)​Σ¯D−1\)≥μρMΣ​tr​\(I​\(θ\)​ΣD​\(θ\)−1\)\.\\text\{tr\}\\\!\\big\(I\(\\theta\)\\bar\{\\Sigma\}\_\{D\}^\{\-1\}\\big\)\\;\\geq\\;\\frac\{\\mu\_\{\\rho\}\}\{M\_\{\\Sigma\}\}\\,\\text\{tr\}\\\!\\big\(I\(\\theta\)\\Sigma\_\{D\}\(\\theta\)^\{\-1\}\\big\)\.Taking expectation overθ∼ρ\\theta\\sim\\rhoand using linearity of trace,

tr​\(I¯ρ​Σ¯D−1\)=𝔼ρ​\[tr​\(I​\(θ\)​Σ¯D−1\)\]≥μρMΣ​𝔼ρ​\[tr​\(I​\(θ\)​ΣD​\(θ\)−1\)\]\.\\text\{tr\}\\\!\\big\(\\bar\{I\}\_\{\\rho\}\\,\\bar\{\\Sigma\}\_\{D\}^\{\-1\}\\big\)=\\mathbb\{E\}\_\{\\rho\}\\\!\\left\[\\text\{tr\}\\\!\\big\(I\(\\theta\)\\bar\{\\Sigma\}\_\{D\}^\{\-1\}\\big\)\\right\]\\;\\geq\\;\\frac\{\\mu\_\{\\rho\}\}\{M\_\{\\Sigma\}\}\\,\\mathbb\{E\}\_\{\\rho\}\\\!\\left\[\\text\{tr\}\\\!\\big\(I\(\\theta\)\\Sigma\_\{D\}\(\\theta\)^\{\-1\}\\big\)\\right\]\.\(49\)Therefore we may take

c1≐μρMΣ\.c\_\{1\}\\doteq\\frac\{\\mu\_\{\\rho\}\}\{M\_\{\\Sigma\}\}\.
Combining \([48](https://arxiv.org/html/2606.19607#A3.E48)\) and \([49](https://arxiv.org/html/2606.19607#A3.E49)\), we get

𝔼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)⊤​I¯ρ​\(θ~n−θ\)\]≥c0​c1n​𝔼ρ​\[tr​\(I​\(θ\)​ΣD​\(θ\)†\)\]\.\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\Big\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}\\bar\{I\}\_\{\\rho\}\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\\Big\]\\;\\geq\\;\\frac\{c\_\{0\}c\_\{1\}\}\{n\}\\,\\mathbb\{E\}\_\{\\rho\}\\\!\\left\[\\text\{tr\}\\\!\\big\(I\(\\theta\)\\Sigma\_\{D\}\(\\theta\)^\{\\dagger\}\\big\)\\right\]\.Finally, by[Lemma˜16](https://arxiv.org/html/2606.19607#Thmlemma16), there exist constants0<c−≤c\+<∞0<c\_\{\-\}\\leq c\_\{\+\}<\\inftysuch that for allθ∈Θ\\theta\\in\\Thetaand allv∈Hv\\in H,

c−​v⊤​I​\(θ\)​v≤v⊤​I¯ρ​v≤c\+​v⊤​I​\(θ\)​v,I¯ρ:=𝔼ρ​\[I​\(θ\)\]\.c\_\{\-\}\\,v^\{\\top\}I\(\\theta\)v\\;\\leq\\;v^\{\\top\}\\bar\{I\}\_\{\\rho\}v\\;\\leq\\;c\_\{\+\}\\,v^\{\\top\}I\(\\theta\)v,\\qquad\\bar\{I\}\_\{\\rho\}:=\\mathbb\{E\}\_\{\\rho\}\[I\(\\theta\)\]\.Therefore,

𝔼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)⊤​I¯ρ​\(θ~n−θ\)\]≍𝔼ρ​𝔼𝒟n∣θ​\[\(θ~n−θ\)⊤​I​\(θ\)​\(θ~n−θ\)\],\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\Big\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}\\bar\{I\}\_\{\\rho\}\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\\Big\]\\asymp\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta\}\\\!\\Big\[\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)^\{\\top\}I\(\\theta\)\(\\tilde\{\\theta\}\_\{n\}\-\\theta\)\\Big\],where the comparability constants depend only on primitive constants\. This proves \([40](https://arxiv.org/html/2606.19607#A3.E40)\)\. ∎

As a result of[Proposition˜1](https://arxiv.org/html/2606.19607#Thmproposition1)and[Lemma˜8](https://arxiv.org/html/2606.19607#Thmlemma8), we prove the lower bound as follows\.

###### Theorem 4\(Van Trees lower bound under a smooth prior\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2),[3](https://arxiv.org/html/2606.19607#Thmassumption3),[4\(b\)](https://arxiv.org/html/2606.19607#Thmassumption4a)and[6](https://arxiv.org/html/2606.19607#Thmassumption6)hold withℛ=supp⁡\(ρ\)\\mathcal\{R\}=\\operatorname\{supp\}\{\(\\rho\)\}\. Then for any induced policy estimatorπ~n=πθ~n\\tilde\{\\pi\}\_\{n\}=\\pi\_\{\\tilde\{\\theta\}\_\{n\}\}and alln≥npriorn\\geq n\_\{\\rm prior\},

𝔼θ⋆∼ρ​𝔼𝒟n∣θ⋆​\[J​\(πθ⋆\)−J​\(π~n\)\]≥Clbn​𝔼θ⋆∼ρ​\[tr​\(I​\(θ⋆\)​ΣD†​\(θ⋆\)\)\],\\mathbb\{E\}\_\{\\theta^\{\\star\}\\sim\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta^\{\\star\}\}\\\!\\Big\[J\(\\pi\_\{\\theta^\{\\star\}\}\)\-J\(\\tilde\{\\pi\}\_\{n\}\)\\Big\]\\;\\geq\\;\\frac\{C\_\{\\rm lb\}\}\{n\}\\,\\mathbb\{E\}\_\{\\theta^\{\\star\}\\sim\\rho\}\\\!\\left\[\\text\{tr\}\\\!\\big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}^\{\\dagger\}\(\\theta^\{\\star\}\)\\big\)\\right\],\(50\)for a constantClb\>0C\_\{\\rm lb\}\>0depending only on constants appearing in assumptions, the priorρ\\rho, and the sandwich constant\.

###### Proof of[Theorem˜4](https://arxiv.org/html/2606.19607#Thmtheorem4)\.

By the lower side of[Proposition˜1](https://arxiv.org/html/2606.19607#Thmproposition1), there existsc−\>0c\_\{\-\}\>0such that for every estimator outputθ~n\\tilde\{\\theta\}\_\{n\},

J​\(πθ∗\)−J​\(πθ~n\)≥c−​\(θ~n−θ∗\)⊤​I​\(θ∗\)​\(θ~n−θ∗\)\.J\(\\pi\_\{\\theta^\{\*\}\}\)\-J\(\\pi\_\{\\tilde\{\\theta\}\_\{n\}\}\)\\;\\geq\\;c\_\{\-\}\\,\(\\tilde\{\\theta\}\_\{n\}\-\\theta^\{\*\}\)^\{\\top\}I\(\\theta^\{\*\}\)\(\\tilde\{\\theta\}\_\{n\}\-\\theta^\{\*\}\)\.Taking𝔼ρ​𝔼𝒟n∣θ∗\\mathbb\{E\}\_\{\\rho\}\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{n\}\\mid\\theta^\{\*\}\}and combining with \([40](https://arxiv.org/html/2606.19607#A3.E40)\) gives \([50](https://arxiv.org/html/2606.19607#A3.E50)\)\. ∎

## Appendix DProof of[Theorem˜2](https://arxiv.org/html/2606.19607#Thmtheorem2)

###### Proof of[Theorem˜2](https://arxiv.org/html/2606.19607#Thmtheorem2)\.

Let

ℛr0≐Θ∩𝔹​\(θ0,r0\),r0=‖θ⋆−θ0‖2\.\\mathcal\{R\}\_\{r\_\{0\}\}\\doteq\\Theta\\cap\\mathbb\{B\}\(\\theta\_\{0\},r\_\{0\}\),\\qquad r\_\{0\}=\\\|\\theta^\{\\star\}\-\\theta\_\{0\}\\\|\_\{2\}\.SinceΘ\\Thetais convex and compact by[Assumption˜1](https://arxiv.org/html/2606.19607#Thmassumption1), the setℛr0\\mathcal\{R\}\_\{r\_\{0\}\}is compact, contains bothθ0\\theta\_\{0\}andθ⋆\\theta^\{\\star\}, and lies in the identifiable affine spaceθ⋆\+H\\theta^\{\\star\}\+H\.

For each designD∈Δ​\(ℰ\)D\\in\\Delta\(\\mathcal\{E\}\), define

T⋆​\(D\)≐tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\),T0​\(D\)≐tr​\(I​\(θ0\)​ΣD​\(θ0\)†\)\.T^\{\\star\}\(D\)\\doteq\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\Big\),\\qquad T^\{0\}\(D\)\\doteq\\text\{tr\}\\\!\\Big\(I\(\\theta\_\{0\}\)\\Sigma\_\{D\}\(\\theta\_\{0\}\)^\{\\dagger\}\\Big\)\.
First, we prove a sandwich bound forI​\(θ⋆\)I\(\\theta^\{\\star\}\)\. By[Lemma˜3](https://arxiv.org/html/2606.19607#Thmlemma3), for allθ∈Θ\\theta\\in\\Theta,

mI​I​\(θ⋆\)⪯I​\(θ\)⪯MI​I​\(θ⋆\)\.m\_\{I\}\\,I\(\\theta^\{\\star\}\)\\ \\preceq\\ I\(\\theta\)\\ \\preceq\\ M\_\{I\}\\,I\(\\theta^\{\\star\}\)\.Applying this atθ=θ0\\theta=\\theta\_\{0\}and rearranging gives

MI−1​I​\(θ0\)⪯I​\(θ⋆\)⪯mI−1​I​\(θ0\)\.M\_\{I\}^\{\-1\}\\,I\(\\theta\_\{0\}\)\\ \\preceq\\ I\(\\theta^\{\\star\}\)\\ \\preceq\\ m\_\{I\}^\{\-1\}\\,I\(\\theta\_\{0\}\)\.DefineCI≐max⁡\{MI,mI−1\}C\_\{I\}\\doteq\\max\\\{M\_\{I\},\\;m\_\{I\}^\{\-1\}\\\}\. Then

CI−1​I​\(θ0\)⪯I​\(θ⋆\)⪯CI​I​\(θ0\)\.C\_\{I\}^\{\-1\}\\,I\(\\theta\_\{0\}\)\\ \\preceq\\ I\(\\theta^\{\\star\}\)\\ \\preceq\\ C\_\{I\}\\,I\(\\theta\_\{0\}\)\.\(51\)
Second, we prove a sandwich bound forΣD†​\(θ⋆\)\\Sigma\_\{D\}^\{\\dagger\}\(\\theta^\{\\star\}\)\. FixD∈Δ​\(ℰ\)D\\in\\Delta\(\\mathcal\{E\}\)\. By[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2), the map

θ↦g​\(e;θ\)\\theta\\mapsto g\(e;\\theta\)is continuous for every edgeee, and since the admissible edge set is finite, the map

\(θ,D\)↦ΣD​\(θ\)=𝔼e∼D​\[g​\(e;θ\)​g​\(e;θ\)⊤\]\(\\theta,D\)\\mapsto\\Sigma\_\{D\}\(\\theta\)=\\mathbb\{E\}\_\{e\\sim D\}\[g\(e;\\theta\)g\(e;\\theta\)^\{\\top\}\]is continuous onℛr0×Δ​\(ℰ\)\\mathcal\{R\}\_\{r\_\{0\}\}\\times\\Delta\(\\mathcal\{E\}\)\.

Moreover, by[Assumption˜4\(b\)](https://arxiv.org/html/2606.19607#Thmassumption4a)applied onℛr0\\mathcal\{R\}\_\{r\_\{0\}\}, there existsμr0\>0\\mu\_\{r\_\{0\}\}\>0such that

v⊤​ΣD​\(θ\)​v≥μr0​‖v‖22,∀v∈H,∀θ∈ℛr0,∀D∈Δ​\(ℰ\)\.v^\{\\top\}\\Sigma\_\{D\}\(\\theta\)v\\geq\\mu\_\{r\_\{0\}\}\\\|v\\\|\_\{2\}^\{2\},\\qquad\\forall v\\in H,\\ \\forall\\theta\\in\\mathcal\{R\}\_\{r\_\{0\}\},\\ \\forall D\\in\\Delta\(\\mathcal\{E\}\)\.Hence, restricted toHH, every matrixΣD​\(θ\)\\Sigma\_\{D\}\(\\theta\)is positive definite onℛr0×Δ​\(ℰ\)\\mathcal\{R\}\_\{r\_\{0\}\}\\times\\Delta\(\\mathcal\{E\}\)\.

Consider the compact set

𝒦r0≐\{\(θ,D\):θ∈ℛr0,D∈Δ​\(ℰ\)\}\.\\mathcal\{K\}\_\{r\_\{0\}\}\\doteq\\Bigl\\\{\(\\theta,D\):\\theta\\in\\mathcal\{R\}\_\{r\_\{0\}\},\\ D\\in\\Delta\(\\mathcal\{E\}\)\\Bigr\\\}\.On𝒦r0\\mathcal\{K\}\_\{r\_\{0\}\}, define

Λ\+​\(θ,D\)≐λmax​\(ΣD​\(θ0\)−1/2​ΣD​\(θ\)​ΣD​\(θ0\)−1/2\),\\Lambda\_\{\+\}\(\\theta,D\)\\doteq\\lambda\_\{\\max\}\\\!\\Big\(\\Sigma\_\{D\}\(\\theta\_\{0\}\)^\{\-1/2\}\\Sigma\_\{D\}\(\\theta\)\\Sigma\_\{D\}\(\\theta\_\{0\}\)^\{\-1/2\}\\Big\),Λ−​\(θ,D\)≐λmax​\(ΣD​\(θ\)−1/2​ΣD​\(θ0\)​ΣD​\(θ\)−1/2\),\\Lambda\_\{\-\}\(\\theta,D\)\\doteq\\lambda\_\{\\max\}\\\!\\Big\(\\Sigma\_\{D\}\(\\theta\)^\{\-1/2\}\\Sigma\_\{D\}\(\\theta\_\{0\}\)\\Sigma\_\{D\}\(\\theta\)^\{\-1/2\}\\Big\),where all matrices are understood as operators onHH\. BecauseΣD​\(θ\)\\Sigma\_\{D\}\(\\theta\)is continuous and positive definite onHH, bothΛ\+\\Lambda\_\{\+\}andΛ−\\Lambda\_\{\-\}are continuous on the compact set𝒦r0\\mathcal\{K\}\_\{r\_\{0\}\}and therefore attain finite maxima\. Let

CΣ​\(r0\)≐max⁡\{sup\(θ,D\)∈𝒦r0Λ\+​\(θ,D\),sup\(θ,D\)∈𝒦r0Λ−​\(θ,D\)\}<∞\.C\_\{\\Sigma\}\(r\_\{0\}\)\\doteq\\max\\Big\\\{\\sup\_\{\(\\theta,D\)\\in\\mathcal\{K\}\_\{r\_\{0\}\}\}\\Lambda\_\{\+\}\(\\theta,D\),\\;\\sup\_\{\(\\theta,D\)\\in\\mathcal\{K\}\_\{r\_\{0\}\}\}\\Lambda\_\{\-\}\(\\theta,D\)\\Big\\\}<\\infty\.Then for everyθ∈ℛr0\\theta\\in\\mathcal\{R\}\_\{r\_\{0\}\}and everyD∈Δ​\(ℰ\)D\\in\\Delta\(\\mathcal\{E\}\),

CΣ​\(r0\)−1​ΣD​\(θ0\)⪯ΣD​\(θ\)⪯CΣ​\(r0\)​ΣD​\(θ0\)on​H\.C\_\{\\Sigma\}\(r\_\{0\}\)^\{\-1\}\\,\\Sigma\_\{D\}\(\\theta\_\{0\}\)\\ \\preceq\\ \\Sigma\_\{D\}\(\\theta\)\\ \\preceq\\ C\_\{\\Sigma\}\(r\_\{0\}\)\\,\\Sigma\_\{D\}\(\\theta\_\{0\}\)\\qquad\\text\{on \}H\.Applying this atθ=θ⋆\\theta=\\theta^\{\\star\}yields

CΣ​\(r0\)−1​ΣD​\(θ0\)⪯ΣD​\(θ⋆\)⪯CΣ​\(r0\)​ΣD​\(θ0\)on​H\.C\_\{\\Sigma\}\(r\_\{0\}\)^\{\-1\}\\,\\Sigma\_\{D\}\(\\theta\_\{0\}\)\\ \\preceq\\ \\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\ \\preceq\\ C\_\{\\Sigma\}\(r\_\{0\}\)\\,\\Sigma\_\{D\}\(\\theta\_\{0\}\)\\qquad\\text\{on \}H\.\(52\)
Since both matrices are positive definite onHH, pseudoinverse monotonicity onHHgives

CΣ​\(r0\)−1​ΣD​\(θ0\)†⪯ΣD​\(θ⋆\)†⪯CΣ​\(r0\)​ΣD​\(θ0\)†on​H\.C\_\{\\Sigma\}\(r\_\{0\}\)^\{\-1\}\\,\\Sigma\_\{D\}\(\\theta\_\{0\}\)^\{\\dagger\}\\ \\preceq\\ \\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\ \\preceq\\ C\_\{\\Sigma\}\(r\_\{0\}\)\\,\\Sigma\_\{D\}\(\\theta\_\{0\}\)^\{\\dagger\}\\qquad\\text\{on \}H\.\(53\)
Next, we prove the trace bound[˜9](https://arxiv.org/html/2606.19607#S3.E9)\. Using \([51](https://arxiv.org/html/2606.19607#A4.E51)\) and \([53](https://arxiv.org/html/2606.19607#A4.E53)\), together with monotonicity of the trace against PSD matrices, we obtain

T⋆​\(D\)=tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)≤CI​CΣ​\(r0\)​tr​\(I​\(θ0\)​ΣD​\(θ0\)†\)=CI​CΣ​\(r0\)​T0​\(D\)\.T^\{\\star\}\(D\)=\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\Big\)\\leq C\_\{I\}\\,C\_\{\\Sigma\}\(r\_\{0\}\)\\,\\text\{tr\}\\\!\\Big\(I\(\\theta\_\{0\}\)\\Sigma\_\{D\}\(\\theta\_\{0\}\)^\{\\dagger\}\\Big\)=C\_\{I\}\\,C\_\{\\Sigma\}\(r\_\{0\}\)\\,T^\{0\}\(D\)\.Similarly,

T⋆​\(D\)=tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)≥CI−1​CΣ​\(r0\)−1​tr​\(I​\(θ0\)​ΣD​\(θ0\)†\)=CI−1​CΣ​\(r0\)−1​T0​\(D\)\.T^\{\\star\}\(D\)=\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\Big\)\\geq C\_\{I\}^\{\-1\}C\_\{\\Sigma\}\(r\_\{0\}\)^\{\-1\}\\,\\text\{tr\}\\\!\\Big\(I\(\\theta\_\{0\}\)\\Sigma\_\{D\}\(\\theta\_\{0\}\)^\{\\dagger\}\\Big\)=C\_\{I\}^\{\-1\}C\_\{\\Sigma\}\(r\_\{0\}\)^\{\-1\}\\,T^\{0\}\(D\)\.Define

Cplug​\(r0\)≐CI​CΣ​\(r0\)\.C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)\\doteq C\_\{I\}\\,C\_\{\\Sigma\}\(r\_\{0\}\)\.Then for everyD∈Δ​\(ℰ\)D\\in\\Delta\(\\mathcal\{E\}\),

Cplug​\(r0\)−1​T0​\(D\)≤T⋆​\(D\)≤Cplug​\(r0\)​T0​\(D\),C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)^\{\-1\}\\,T^\{0\}\(D\)\\;\\leq\\;T^\{\\star\}\(D\)\\;\\leq\\;C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)\\,T^\{0\}\(D\),which is exactly \([9](https://arxiv.org/html/2606.19607#S3.E9)\)\.

Finally, we prove the oracle comparison \([11](https://arxiv.org/html/2606.19607#S3.E11)\)\. SinceDθ0D\_\{\\theta\_\{0\}\}minimizes the plug\-in criterion,

T0​\(Dθ0\)=infD∈Δ​\(ℰ\)T0​\(D\)\.T^\{0\}\(D\_\{\\theta\_\{0\}\}\)=\\inf\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}T^\{0\}\(D\)\.Applying the upper side of \([9](https://arxiv.org/html/2606.19607#S3.E9)\) toDθ0D\_\{\\theta\_\{0\}\}gives

T⋆​\(Dθ0\)≤Cplug​\(r0\)​T0​\(Dθ0\)=Cplug​\(r0\)​infD∈Δ​\(ℰ\)T0​\(D\)\.T^\{\\star\}\(D\_\{\\theta\_\{0\}\}\)\\leq C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)\\,T^\{0\}\(D\_\{\\theta\_\{0\}\}\)=C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)\\,\\inf\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}T^\{0\}\(D\)\.On the other hand, for everyD∈Δ​\(ℰ\)D\\in\\Delta\(\\mathcal\{E\}\), the lower side of \([9](https://arxiv.org/html/2606.19607#S3.E9)\) implies

T0​\(D\)≤Cplug​\(r0\)​T⋆​\(D\)\.T^\{0\}\(D\)\\leq C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)\\,T^\{\\star\}\(D\)\.Taking the infimum overDDyields

infD∈Δ​\(ℰ\)T0​\(D\)≤Cplug​\(r0\)​infD∈Δ​\(ℰ\)T⋆​\(D\)\.\\inf\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}T^\{0\}\(D\)\\leq C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)\\,\\inf\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}T^\{\\star\}\(D\)\.Combining the last two displays, we obtain

T⋆​\(Dθ0\)≤Cplug​\(r0\)2​infD∈Δ​\(ℰ\)T⋆​\(D\),T^\{\\star\}\(D\_\{\\theta\_\{0\}\}\)\\leq C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)^\{2\}\\inf\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}T^\{\\star\}\(D\),that is,

tr​\(I​\(θ⋆\)​ΣDθ0​\(θ⋆\)†\)≤Cplug​\(r0\)2​infD∈Δ​\(ℰ\)tr​\(I​\(θ⋆\)​ΣD​\(θ⋆\)†\)\.\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\_\{\\theta\_\{0\}\}\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\Big\)\\;\\leq\\;C\_\{\\mathrm\{plug\}\}\(r\_\{0\}\)^\{2\}\\,\\inf\_\{D\\in\\Delta\(\\mathcal\{E\}\)\}\\text\{tr\}\\\!\\Big\(I\(\\theta^\{\\star\}\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger\}\\Big\)\.This proves \([11](https://arxiv.org/html/2606.19607#S3.E11)\)\. ∎

## Appendix ETechnical lemmas

In this appendix, we list several useful technical lemmas, which are applied throughout the whole paper\.

###### Lemma 9\(Uniform properties of the DPO logituθu\_\{\\theta\}\)\.

Suppose[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2)holds\. Define, fore=\(x,y\+,y−\)e=\(x,y^\{\+\},y^\{\-\}\),

uθ​\(e\):=β​\(log⁡πθ​\(y\+∣x\)−log⁡πθ​\(y−∣x\)−log⁡π0​\(y\+∣x\)\+log⁡π0​\(y−∣x\)\)\.u\_\{\\theta\}\(e\):=\\beta\\\!\\left\(\\log\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\-\\log\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\)\-\\log\\pi\_\{0\}\(y^\{\+\}\\mid x\)\+\\log\\pi\_\{0\}\(y^\{\-\}\\mid x\)\\right\)\.Then the following hold uniformly overθ∈Θ\\theta\\in\\Thetaand admissibleee:

1. \(i\)Uniform boundedness ofuθu\_\{\\theta\}: \|uθ​\(e\)\|≤Umax\.\|u\_\{\\theta\}\(e\)\|\\leq U\_\{\\max\}\.
2. \(ii\)Uniform boundedness of gradient: ‖∇θuθ​\(e\)‖2≤Gu\.\\\|\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)\\\|\_\{2\}\\leq G\_\{u\}\.
3. \(iii\)Uniform boundedness of Hessian: ‖∇θ2uθ​\(e\)‖op≤Hu\.\\\|\\nabla\_\{\\theta\}^\{2\}u\_\{\\theta\}\(e\)\\\|\_\{\\text\{op\}\}\\leq H\_\{u\}\.
4. \(iv\)Lipschitz continuity inθ\\theta: for anyθ,θ′∈Θ\\theta,\\theta^\{\\prime\}\\in\\Theta, \|uθ​\(e\)−uθ′​\(e\)\|≤Gu​‖θ−θ′‖2,‖∇θuθ​\(e\)−∇θuθ′​\(e\)‖2≤Hu​‖θ−θ′‖2\.\|u\_\{\\theta\}\(e\)\-u\_\{\\theta^\{\\prime\}\}\(e\)\|\\leq G\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\},\\qquad\\\|\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)\-\\nabla\_\{\\theta\}u\_\{\\theta^\{\\prime\}\}\(e\)\\\|\_\{2\}\\leq H\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\.

A valid choice of constants is

Gu=4​β​α1,Hu=2​β​\(2​α2\+4​α12\),G\_\{u\}=4\\beta\\alpha\_\{1\},\\qquad H\_\{u\}=2\\beta\(2\\alpha\_\{2\}\+4\\alpha\_\{1\}^\{2\}\),and, if there existsp¯\>0\\underline\{p\}\>0such thatπθ​\(y∣x\)≥p¯\\pi\_\{\\theta\}\(y\\mid x\)\\geq\\underline\{p\}andπ0​\(y∣x\)≥p¯\\pi\_\{0\}\(y\\mid x\)\\geq\\underline\{p\}for all\(x,y\),θ\(x,y\),\\theta, then

Umax=2​β​log⁡\(1/p¯2\)\.U\_\{\\max\}=2\\beta\\log\(1/\\underline\{p\}^\{2\}\)\.

###### Proof of[Lemma˜9](https://arxiv.org/html/2606.19607#Thmlemma9)\.

Write

log⁡πθ​\(y∣x\)=fθ​\(ϕ​\(x,y\)\)−log​∑y′exp⁡\(fθ​\(ϕ​\(x,y′\)\)\)\.\\log\\pi\_\{\\theta\}\(y\\mid x\)=f\_\{\\theta\}\(\\phi\(x,y\)\)\-\\log\\\!\\sum\_\{y^\{\\prime\}\}\\exp\(f\_\{\\theta\}\(\\phi\(x,y^\{\\prime\}\)\)\)\.Hence

∇θlog⁡πθ​\(y∣x\)=∇θfθ​\(ϕ​\(x,y\)\)−𝔼y′∼πθ\(⋅∣x\)​\[∇θfθ​\(ϕ​\(x,y′\)\)\]\.\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\\mid x\)=\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\)\)\-\\mathbb\{E\}\_\{y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\\!\\big\[\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y^\{\\prime\}\)\)\\big\]\.So, by Assumption 1 \(‖∇θfθ‖2≤α1\\\|\\nabla\_\{\\theta\}f\_\{\\theta\}\\\|\_\{2\}\\leq\\alpha\_\{1\}\),

∥∇θlogπθ\(y∣x\)∥2≤2α1\.\\\|\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\\mid x\)\\\|\_\{2\}\\leq 2\\alpha\_\{1\}\.Therefore

∇θuθ​\(e\)=β​\(∇θlog⁡πθ​\(y\+∣x\)−∇θlog⁡πθ​\(y−∣x\)\),\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)=\\beta\\\!\\left\(\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\-\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\)\\right\),and

∥∇θuθ\(e\)∥2≤β\(2α1\+2α1\)=4βα1=:Gu\.\\\|\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)\\\|\_\{2\}\\leq\\beta\(2\\alpha\_\{1\}\+2\\alpha\_\{1\}\)=4\\beta\\alpha\_\{1\}=:G\_\{u\}\.This proves \(ii\)\.

For the Hessian, use the standard softmax identity:

∇θ2log⁡πθ​\(y∣x\)=∇θ2fθ​\(ϕ​\(x,y\)\)−𝔼y′​\[∇θ2fθ​\(ϕ​\(x,y′\)\)\]−Covy′∼πθ\(⋅∣x\)​\(∇θfθ​\(ϕ​\(x,y′\)\)\)\.\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\(y\\mid x\)=\\nabla\_\{\\theta\}^\{2\}f\_\{\\theta\}\(\\phi\(x,y\)\)\-\\mathbb\{E\}\_\{y^\{\\prime\}\}\[\\nabla\_\{\\theta\}^\{2\}f\_\{\\theta\}\(\\phi\(x,y^\{\\prime\}\)\)\]\-\\text\{Cov\}\_\{y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\\!\\big\(\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y^\{\\prime\}\)\)\\big\)\.Hence

∥∇θ2logπθ\(y∣x\)∥op≤α2\+α2\+‖Cov​\(∇θf\)‖op⏟≤4​α12≤2α2\+4α12\.\\\|\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\(y\\mid x\)\\\|\_\{\\text\{op\}\}\\leq\\alpha\_\{2\}\+\\alpha\_\{2\}\+\\underbrace\{\\\|\\text\{Cov\}\(\\nabla\_\{\\theta\}f\)\\\|\_\{\\text\{op\}\}\}\_\{\\leq 4\\alpha\_\{1\}^\{2\}\}\\leq 2\\alpha\_\{2\}\+4\\alpha\_\{1\}^\{2\}\.Thus

∥∇θ2uθ\(e\)∥op≤β\(∥∇θ2logπθ\(y\+∣x\)∥op\+∥∇θ2logπθ\(y−∣x\)∥op\)≤2β\(2α2\+4α12\)=:Hu,\\\|\\nabla\_\{\\theta\}^\{2\}u\_\{\\theta\}\(e\)\\\|\_\{\\text\{op\}\}\\leq\\beta\\Big\(\\\|\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\\\|\_\{\\text\{op\}\}\+\\\|\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\)\\\|\_\{\\text\{op\}\}\\Big\)\\leq 2\\beta\(2\\alpha\_\{2\}\+4\\alpha\_\{1\}^\{2\}\)=:H\_\{u\},proving \(iii\)\.

For Lipschitzness in \(iv\), apply mean value theorem inθ\\theta:

\|uθ​\(e\)−uθ′​\(e\)\|≤supθ~‖∇θuθ~​\(e\)‖2​‖θ−θ′‖2≤Gu​‖θ−θ′‖2,\|u\_\{\\theta\}\(e\)\-u\_\{\\theta^\{\\prime\}\}\(e\)\|\\leq\\sup\_\{\\tilde\{\\theta\}\}\\\|\\nabla\_\{\\theta\}u\_\{\\tilde\{\\theta\}\}\(e\)\\\|\_\{2\}\\,\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\\leq G\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\},‖∇θuθ​\(e\)−∇θuθ′​\(e\)‖2≤supθ~‖∇θ2uθ~​\(e\)‖op​‖θ−θ′‖2≤Hu​‖θ−θ′‖2\.\\\|\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)\-\\nabla\_\{\\theta\}u\_\{\\theta^\{\\prime\}\}\(e\)\\\|\_\{2\}\\leq\\sup\_\{\\tilde\{\\theta\}\}\\\|\\nabla\_\{\\theta\}^\{2\}u\_\{\\tilde\{\\theta\}\}\(e\)\\\|\_\{\\text\{op\}\}\\,\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\\leq H\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\.
Finally, for \(i\), ifπθ,π0≥p¯\>0\\pi\_\{\\theta\},\\pi\_\{0\}\\geq\\underline\{p\}\>0, then

\|logπθ\(y\+∣x\)−logπθ\(y−∣x\)\|≤2log\(1/p¯\),\\left\|\\log\\pi\_\{\\theta\}\(y^\{\+\}\\mid x\)\-\\log\\pi\_\{\\theta\}\(y^\{\-\}\\mid x\)\\right\|\\leq 2\\log\(1/\\underline\{p\}\),and similarly forπ0\\pi\_\{0\}, hence

\|uθ​\(e\)\|≤2​β​log⁡\(1/p¯2\)=Umax\.\|u\_\{\\theta\}\(e\)\|\\leq 2\\beta\\log\(1/\\underline\{p\}^\{2\}\)=U\_\{\\max\}\.∎

###### Lemma 10\(Global lower bound on logistic curvature from uniform logit bound\)\.

Suppose[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2)holds\. Define

κ0:=σ​\(Umax\)​\(1−σ​\(Umax\)\)\.\\kappa\_\{0\}:=\\sigma\(U\_\{\\max\}\)\\bigl\(1\-\\sigma\(U\_\{\\max\}\)\\bigr\)\.Then

σ′​\(uθ​\(e\)\)≥κ0\>0,∀θ∈Θ,∀e\.\\sigma^\{\\prime\}\\\!\\bigl\(u\_\{\\theta\}\(e\)\\bigr\)\\geq\\kappa\_\{0\}\>0,\\qquad\\forall\\,\\theta\\in\\Theta,\\ \\forall\\,e\.

###### Proof of[Lemma˜10](https://arxiv.org/html/2606.19607#Thmlemma10)\.

For logistic sigmoidσ​\(t\)=1/\(1\+e−t\)\\sigma\(t\)=1/\(1\+e^\{\-t\}\),

σ′​\(t\)=σ​\(t\)​\(1−σ​\(t\)\)=e−t\(1\+e−t\)2=12\+et\+e−t\.\\sigma^\{\\prime\}\(t\)=\\sigma\(t\)\\bigl\(1\-\\sigma\(t\)\\bigr\)=\\frac\{e^\{\-t\}\}\{\(1\+e^\{\-t\}\)^\{2\}\}=\\frac\{1\}\{2\+e^\{t\}\+e^\{\-t\}\}\.Henceσ′​\(t\)\\sigma^\{\\prime\}\(t\)is an even function oftt, and decreases as\|t\|\|t\|increases\. Therefore, on the interval\[−Umax,Umax\]\[\-U\_\{\\max\},U\_\{\\max\}\],

inf\|t\|≤Umaxσ′​\(t\)=σ′​\(Umax\)=σ​\(Umax\)​\(1−σ​\(Umax\)\)=κ0\.\\inf\_\{\|t\|\\leq U\_\{\\max\}\}\\sigma^\{\\prime\}\(t\)=\\sigma^\{\\prime\}\(U\_\{\\max\}\)=\\sigma\(U\_\{\\max\}\)\\bigl\(1\-\\sigma\(U\_\{\\max\}\)\\bigr\)=\\kappa\_\{0\}\.Since\|uθ​\(e\)\|≤Umax\|u\_\{\\theta\}\(e\)\|\\leq U\_\{\\max\}uniformly over\(θ,e\)\(\\theta,e\), we have

σ′​\(uθ​\(e\)\)≥inf\|t\|≤Umaxσ′​\(t\)=κ0,\\sigma^\{\\prime\}\\\!\\bigl\(u\_\{\\theta\}\(e\)\\bigr\)\\geq\\inf\_\{\|t\|\\leq U\_\{\\max\}\}\\sigma^\{\\prime\}\(t\)=\\kappa\_\{0\},for allθ∈Θ\\theta\\in\\Thetaand all admissibleee\. Alsoσ​\(Umax\)∈\(0,1\)\\sigma\(U\_\{\\max\}\)\\in\(0,1\), soκ0\>0\\kappa\_\{0\}\>0\. ∎

###### Lemma 11\(Uniform boundedness and Lipschitzness of the DPO loss class\)\.

Suppose[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2)holds, and Lemma[9](https://arxiv.org/html/2606.19607#Thmlemma9)gives

\|uθ​\(e\)\|≤Umax,‖∇θuθ​\(e\)‖2≤Gu,∀θ∈Θ,∀e\.\|u\_\{\\theta\}\(e\)\|\\leq U\_\{\\max\},\\qquad\\\|\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)\\\|\_\{2\}\\leq G\_\{u\},\\quad\\forall\\theta\\in\\Theta,\\ \\forall e\.Forz=\(e,a\)z=\(e,a\),a∈\{0,1\}a\\in\\\{0,1\\\}, define

ℓθ​\(z\):=ℓ​\(a,uθ​\(e\)\),ℓ​\(a,u\):=−a​log⁡σ​\(u\)−\(1−a\)​log⁡\(1−σ​\(u\)\)\.\\ell\_\{\\theta\}\(z\):=\\ell\(a,u\_\{\\theta\}\(e\)\),\\qquad\\ell\(a,u\):=\-a\\log\\sigma\(u\)\-\(1\-a\)\\log\(1\-\\sigma\(u\)\)\.Then:

1. \(i\)\(Uniform envelope\) for allθ∈Θ\\theta\\in\\Theta,zz, 0≤ℓθ​\(z\)≤Bℓ,Bℓ:=log⁡\(1\+eUmax\)\.0\\leq\\ell\_\{\\theta\}\(z\)\\leq B\_\{\\ell\},\\qquad B\_\{\\ell\}:=\\log\(1\+e^\{U\_\{\\max\}\}\)\.
2. \(ii\)\(Uniform Lipschitz inθ\\theta\) for allθ,θ′∈Θ\\theta,\\theta^\{\\prime\}\\in\\Theta,zz, \|ℓθ​\(z\)−ℓθ′​\(z\)\|≤Lℓ​‖θ−θ′‖2,Lℓ:=Gu\.\|\\ell\_\{\\theta\}\(z\)\-\\ell\_\{\\theta^\{\\prime\}\}\(z\)\|\\leq L\_\{\\ell\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\},\\qquad L\_\{\\ell\}:=G\_\{u\}\.

###### Proof\.

Use the equivalent form

ℓ​\(a,u\)=log⁡\(1\+eu\)−a​u\.\\ell\(a,u\)=\\log\(1\+e^\{u\}\)\-au\.Hence, if\|u\|≤Umax\|u\|\\leq U\_\{\\max\},

0≤ℓ​\(a,u\)≤log⁡\(1\+eUmax\)=Bℓ,0\\leq\\ell\(a,u\)\\leq\\log\(1\+e^\{U\_\{\\max\}\}\)=B\_\{\\ell\},which proves \(i\) since\|uθ​\(e\)\|≤Umax\|u\_\{\\theta\}\(e\)\|\\leq U\_\{\\max\}uniformly\.

Next,

∂uℓ​\(a,u\)=σ​\(u\)−a,\|∂uℓ​\(a,u\)\|≤1,\\partial\_\{u\}\\ell\(a,u\)=\\sigma\(u\)\-a,\\qquad\|\\partial\_\{u\}\\ell\(a,u\)\|\\leq 1,so for anyu,u′∈ℝu,u^\{\\prime\}\\in\\mathbb\{R\},

\|ℓ​\(a,u\)−ℓ​\(a,u′\)\|≤\|u−u′\|\.\|\\ell\(a,u\)\-\\ell\(a,u^\{\\prime\}\)\|\\leq\|u\-u^\{\\prime\}\|\.Therefore

\|ℓθ​\(z\)−ℓθ′​\(z\)\|≤\|uθ​\(e\)−uθ′​\(e\)\|\.\|\\ell\_\{\\theta\}\(z\)\-\\ell\_\{\\theta^\{\\prime\}\}\(z\)\|\\leq\|u\_\{\\theta\}\(e\)\-u\_\{\\theta^\{\\prime\}\}\(e\)\|\.By mean\-value theorem in parameter space,

\|uθ​\(e\)−uθ′​\(e\)\|≤supθ~∈Θ‖∇θuθ~​\(e\)‖2​‖θ−θ′‖2≤Gu​‖θ−θ′‖2\.\|u\_\{\\theta\}\(e\)\-u\_\{\\theta^\{\\prime\}\}\(e\)\|\\leq\\sup\_\{\\tilde\{\\theta\}\\in\\Theta\}\\\|\\nabla\_\{\\theta\}u\_\{\\tilde\{\\theta\}\}\(e\)\\\|\_\{2\}\\,\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\\leq G\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\.So \(ii\) holds withLℓ=GuL\_\{\\ell\}=G\_\{u\}\. ∎

###### Lemma 12\(Hessian regularity of the DPO loss\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1)and[2](https://arxiv.org/html/2606.19607#Thmassumption2)hold\. LetUmax,Gu,HuU\_\{\\max\},G\_\{u\},H\_\{u\}be the constants from[Lemma˜9](https://arxiv.org/html/2606.19607#Thmlemma9)such that for allθ∈Θ\\theta\\in\\Thetaand all admissibleee,

\|uθ​\(e\)\|≤Umax,‖∇θuθ​\(e\)‖2≤Gu,‖∇θ2uθ​\(e\)‖op≤Hu\.\|u\_\{\\theta\}\(e\)\|\\leq U\_\{\\max\},\\qquad\\\|\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)\\\|\_\{2\}\\leq G\_\{u\},\\qquad\\\|\\nabla\_\{\\theta\}^\{2\}u\_\{\\theta\}\(e\)\\\|\_\{\\text\{op\}\}\\leq H\_\{u\}\.Forz=\(e,a\)z=\(e,a\)witha∈\{0,1\}a\\in\\\{0,1\\\}, define

ℓθ​\(z\):=ℓ​\(a,uθ​\(e\)\),ℓ​\(a,u\)=−a​log⁡σ​\(u\)−\(1−a\)​log⁡\(1−σ​\(u\)\)\.\\ell\_\{\\theta\}\(z\):=\\ell\(a,u\_\{\\theta\}\(e\)\),\\qquad\\ell\(a,u\)=\-a\\log\\sigma\(u\)\-\(1\-a\)\\log\(1\-\\sigma\(u\)\)\.LetPHP\_\{H\}denote the orthogonal projector ontoHH\. Then the following hold uniformly overΘ\\Theta:

1. \(i\)\(Per\-sample projected Hessian envelope\)\. supθ∈Θ,z‖PH​∇θ2ℓθ​\(z\)​PH‖op≤MH,MH:=Gu2\+Hu\.\\sup\_\{\\theta\\in\\Theta,\\ z\}\\ \\\|P\_\{H\}\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)P\_\{H\}\\\|\_\{\\text\{op\}\}\\leq M\_\{H\},\\qquad M\_\{H\}:=G\_\{u\}^\{2\}\+H\_\{u\}\.
2. \(ii\)\(Per\-sample projected Hessian Lipschitzness\)\. There exists a constantLH<∞L\_\{H\}<\\infty, depending only on the fixed constants in the standing assumptions, such that for allθ,θ′∈Θ\\theta,\\theta^\{\\prime\}\\in\\Theta, supz‖PH​\(∇θ2ℓθ​\(z\)−∇θ′2ℓθ′​\(z\)\)​PH‖op≤LH​‖θ−θ′‖2\.\\sup\_\{z\}\\ \\\|P\_\{H\}\(\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)\-\\nabla\_\{\\theta^\{\\prime\}\}^\{2\}\\ell\_\{\\theta^\{\\prime\}\}\(z\)\)P\_\{H\}\\\|\_\{\\text\{op\}\}\\leq L\_\{H\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\.In particular, one may take LH:=14​Gu3\+34​Gu​Hu\+Γu,L\_\{H\}:=\\frac\{1\}\{4\}G\_\{u\}^\{3\}\+\\frac\{3\}\{4\}G\_\{u\}H\_\{u\}\+\\Gamma\_\{u\},where Γu≐supθ∈Θsupe‖∇θ3uθ​\(e\)‖op,\\Gamma\_\{u\}\\doteq\\sup\_\{\\theta\\in\\Theta\}\\ \\sup\_\{e\}\\ \\\|\\nabla\_\{\\theta\}^\{3\}u\_\{\\theta\}\(e\)\\\|\_\{\\text\{op\}\},which is finite under[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2)\.
3. \(iii\)\(Uniform variance proxy\)\. supθ∈Θ‖𝔼​\[\(PH​∇θ2ℓθ​\(z\)​PH−𝔼​\[PH​∇θ2ℓθ​\(z\)​PH\]\)2\]‖op≤VH,VH:=4​MH2\.\\sup\_\{\\theta\\in\\Theta\}\\left\\\|\\mathbb\{E\}\\\!\\left\[\\left\(P\_\{H\}\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)P\_\{H\}\-\\mathbb\{E\}\[P\_\{H\}\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)P\_\{H\}\]\\right\)^\{2\}\\right\]\\right\\\|\_\{\\text\{op\}\}\\leq V\_\{H\},\\qquad V\_\{H\}:=4M\_\{H\}^\{2\}\.

###### Proof of[Lemma˜12](https://arxiv.org/html/2606.19607#Thmlemma12)\.

Fixz=\(e,a\)z=\(e,a\)witha∈\{0,1\}a\\in\\\{0,1\\\}andθ∈Θ\\theta\\in\\Theta\. Write

u:=uθ​\(e\),g:=∇θuθ​\(e\),U:=∇θ2uθ​\(e\)\.u:=u\_\{\\theta\}\(e\),\\qquad g:=\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\),\\qquad U:=\\nabla\_\{\\theta\}^\{2\}u\_\{\\theta\}\(e\)\.For logistic cross\-entropy,

∇θ2ℓθ​\(z\)=σ′​\(u\)​g​g⊤\+\(σ​\(u\)−a\)​U\.\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)=\\sigma^\{\\prime\}\(u\)\\,gg^\{\\top\}\+\(\\sigma\(u\)\-a\)\\,U\.\(54\)
\(i\)Since0≤σ′​\(u\)≤1/4≤10\\leq\\sigma^\{\\prime\}\(u\)\\leq 1/4\\leq 1and\|σ​\(u\)−a\|≤1\|\\sigma\(u\)\-a\|\\leq 1,

‖∇θ2ℓθ​\(z\)‖op≤‖g​g⊤‖op\+‖U‖op\.\\\|\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)\\\|\_\{\\text\{op\}\}\\leq\\\|gg^\{\\top\}\\\|\_\{\\text\{op\}\}\+\\\|U\\\|\_\{\\text\{op\}\}\.Moreover‖g​g⊤‖op=‖g‖22≤Gu2\\\|gg^\{\\top\}\\\|\_\{\\text\{op\}\}=\\\|g\\\|\_\{2\}^\{2\}\\leq G\_\{u\}^\{2\}and‖U‖op≤Hu\\\|U\\\|\_\{\\text\{op\}\}\\leq H\_\{u\}, hence

‖∇θ2ℓθ​\(z\)‖op≤Gu2\+Hu=MH\.\\\|\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)\\\|\_\{\\text\{op\}\}\\leq G\_\{u\}^\{2\}\+H\_\{u\}=M\_\{H\}\.Finally,‖PH​A​PH‖op≤‖A‖op\\\|P\_\{H\}AP\_\{H\}\\\|\_\{\\text\{op\}\}\\leq\\\|A\\\|\_\{\\text\{op\}\}for any matrixAA, so‖PH​∇θ2ℓθ​\(z\)​PH‖op≤MH\\\|P\_\{H\}\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)P\_\{H\}\\\|\_\{\\text\{op\}\}\\leq M\_\{H\}\.

\(ii\)Let\(u,g,U\)\(u,g,U\)and\(u′,g′,U′\)\(u^\{\\prime\},g^\{\\prime\},U^\{\\prime\}\)correspond toθ\\thetaandθ′\\theta^\{\\prime\}, respectively\. Subtracting \([54](https://arxiv.org/html/2606.19607#A5.E54)\) atθ\\thetaandθ′\\theta^\{\\prime\}and regrouping yields

∇θ2ℓθ​\(z\)−∇θ′2ℓθ′​\(z\)=\(σ′​\(u\)−σ′​\(u′\)\)​g​g⊤\+σ′​\(u′\)​\(g​g⊤−g′​g′⁣⊤\)\+\(σ​\(u\)−σ​\(u′\)\)​U\+\(σ​\(u′\)−a\)​\(U−U′\)\.\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)\-\\nabla\_\{\\theta^\{\\prime\}\}^\{2\}\\ell\_\{\\theta^\{\\prime\}\}\(z\)=\(\\sigma^\{\\prime\}\(u\)\-\\sigma^\{\\prime\}\(u^\{\\prime\}\)\)gg^\{\\top\}\+\\sigma^\{\\prime\}\(u^\{\\prime\}\)\(gg^\{\\top\}\-g^\{\\prime\}g^\{\\prime\\top\}\)\+\(\\sigma\(u\)\-\\sigma\(u^\{\\prime\}\)\)U\+\(\\sigma\(u^\{\\prime\}\)\-a\)\(U\-U^\{\\prime\}\)\.Using\|σ′​\(u\)−σ′​\(u′\)\|≤14​\|u−u′\|\|\\sigma^\{\\prime\}\(u\)\-\\sigma^\{\\prime\}\(u^\{\\prime\}\)\|\\leq\\tfrac\{1\}\{4\}\|u\-u^\{\\prime\}\|,\|σ​\(u\)−σ​\(u′\)\|≤14​\|u−u′\|\|\\sigma\(u\)\-\\sigma\(u^\{\\prime\}\)\|\\leq\\tfrac\{1\}\{4\}\|u\-u^\{\\prime\}\|, and\|u−u′\|≤Gu​‖θ−θ′‖2\|u\-u^\{\\prime\}\|\\leq G\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}, we obtain

‖\(σ′​\(u\)−σ′​\(u′\)\)​g​g⊤‖op≤14​Gu3​‖θ−θ′‖2,‖\(σ​\(u\)−σ​\(u′\)\)​U‖op≤14​Gu​Hu​‖θ−θ′‖2\.\\\|\(\\sigma^\{\\prime\}\(u\)\-\\sigma^\{\\prime\}\(u^\{\\prime\}\)\)gg^\{\\top\}\\\|\_\{\\text\{op\}\}\\leq\\tfrac\{1\}\{4\}G\_\{u\}^\{3\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\},\\qquad\\\|\(\\sigma\(u\)\-\\sigma\(u^\{\\prime\}\)\)U\\\|\_\{\\text\{op\}\}\\leq\\tfrac\{1\}\{4\}G\_\{u\}H\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\.Next, sinceg​g⊤−g′​g′⁣⊤=\(g−g′\)​g⊤\+g′​\(g−g′\)⊤gg^\{\\top\}\-g^\{\\prime\}g^\{\\prime\\top\}=\(g\-g^\{\\prime\}\)g^\{\\top\}\+g^\{\\prime\}\(g\-g^\{\\prime\}\)^\{\\top\},

‖g​g⊤−g′​g′⁣⊤‖op≤‖g−g′‖2​‖g‖2\+‖g′‖2​‖g−g′‖2≤2​Gu​‖g−g′‖2\.\\\|gg^\{\\top\}\-g^\{\\prime\}g^\{\\prime\\top\}\\\|\_\{\\text\{op\}\}\\leq\\\|g\-g^\{\\prime\}\\\|\_\{2\}\\,\\\|g\\\|\_\{2\}\+\\\|g^\{\\prime\}\\\|\_\{2\}\\,\\\|g\-g^\{\\prime\}\\\|\_\{2\}\\leq 2G\_\{u\}\\,\\\|g\-g^\{\\prime\}\\\|\_\{2\}\.Because‖g−g′‖2≤Hu​‖θ−θ′‖2\\\|g\-g^\{\\prime\}\\\|\_\{2\}\\leq H\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}, it follows that

‖σ′​\(u′\)​\(g​g⊤−g′​g′⁣⊤\)‖op≤14⋅2​Gu​Hu​‖θ−θ′‖2=12​Gu​Hu​‖θ−θ′‖2\.\\\|\\sigma^\{\\prime\}\(u^\{\\prime\}\)\(gg^\{\\top\}\-g^\{\\prime\}g^\{\\prime\\top\}\)\\\|\_\{\\text\{op\}\}\\leq\\tfrac\{1\}\{4\}\\cdot 2G\_\{u\}H\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}=\\tfrac\{1\}\{2\}G\_\{u\}H\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\.Finally,\|σ​\(u′\)−a\|≤1\|\\sigma\(u^\{\\prime\}\)\-a\|\\leq 1and the mean\-value theorem gives

‖U−U′‖op=‖∇θ2uθ​\(e\)−∇θ′2uθ′​\(e\)‖op≤Γu​‖θ−θ′‖2\.\\\|U\-U^\{\\prime\}\\\|\_\{\\text\{op\}\}=\\\|\\nabla\_\{\\theta\}^\{2\}u\_\{\\theta\}\(e\)\-\\nabla\_\{\\theta^\{\\prime\}\}^\{2\}u\_\{\\theta^\{\\prime\}\}\(e\)\\\|\_\{\\text\{op\}\}\\leq\\Gamma\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\.Therefore‖\(σ​\(u′\)−a\)​\(U−U′\)‖op≤Γu​‖θ−θ′‖2\\\|\(\\sigma\(u^\{\\prime\}\)\-a\)\(U\-U^\{\\prime\}\)\\\|\_\{\\text\{op\}\}\\leq\\Gamma\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\. Summing the four bounds yields

‖∇θ2ℓθ​\(z\)−∇θ′2ℓθ′​\(z\)‖op≤\(14​Gu3\+34​Gu​Hu\+Γu\)​‖θ−θ′‖2,\\\|\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)\-\\nabla\_\{\\theta^\{\\prime\}\}^\{2\}\\ell\_\{\\theta^\{\\prime\}\}\(z\)\\\|\_\{\\text\{op\}\}\\leq\\Big\(\\tfrac\{1\}\{4\}G\_\{u\}^\{3\}\+\\tfrac\{3\}\{4\}G\_\{u\}H\_\{u\}\+\\Gamma\_\{u\}\\Big\)\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\},and projecting byPHP\_\{H\}does not increase the operator norm, proving \(ii\)\.

\(iii\)Define the centered matrix

Xθ:=PH​∇θ2ℓθ​\(z\)​PH−𝔼​\[PH​∇θ2ℓθ​\(z\)​PH\]\.X\_\{\\theta\}:=P\_\{H\}\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)P\_\{H\}\-\\mathbb\{E\}\[P\_\{H\}\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)P\_\{H\}\]\.By \(i\),‖PH​∇θ2ℓθ​\(z\)​PH‖op≤MH\\\|P\_\{H\}\\nabla\_\{\\theta\}^\{2\}\\ell\_\{\\theta\}\(z\)P\_\{H\}\\\|\_\{\\text\{op\}\}\\leq M\_\{H\}, hence‖Xθ‖op≤2​MH\\\|X\_\{\\theta\}\\\|\_\{\\text\{op\}\}\\leq 2M\_\{H\}\. Therefore,

‖𝔼​\[Xθ2\]‖op≤𝔼​‖Xθ2‖op≤𝔼​‖Xθ‖op2≤4​MH2=VH,\\left\\\|\\mathbb\{E\}\[X\_\{\\theta\}^\{2\}\]\\right\\\|\_\{\\text\{op\}\}\\leq\\mathbb\{E\}\\\|X\_\{\\theta\}^\{2\}\\\|\_\{\\text\{op\}\}\\leq\\mathbb\{E\}\\\|X\_\{\\theta\}\\\|\_\{\\text\{op\}\}^\{2\}\\leq 4M\_\{H\}^\{2\}=V\_\{H\},uniformly inθ∈Θ\\theta\\in\\Theta\. ∎

###### Lemma 13\(Gram perturbation bound forΣD​\(θ⋆\)\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\)\.

Letg​\(e;θ\)=∇θuθ​\(e\)g\(e;\\theta\)=\\nabla\_\{\\theta\}u\_\{\\theta\}\(e\)and

ΣD​\(θ\)≐𝔼e∼D​\[g​\(e;θ\)​g​\(e;θ\)⊤\]\.\\Sigma\_\{D\}\(\\theta\)\\doteq\\mathbb\{E\}\_\{e\\sim D\}\\big\[g\(e;\\theta\)g\(e;\\theta\)^\{\\top\}\\big\]\.Suppose that for alleeand allθ,θ′∈Θ\\theta,\\theta^\{\\prime\}\\in\\Theta,

‖g​\(e;θ\)‖2≤G,‖g​\(e;θ\)−g​\(e;θ′\)‖2≤Hu​‖θ−θ′‖2\\\|g\(e;\\theta\)\\\|\_\{2\}\\leq G,\\qquad\\\|g\(e;\\theta\)\-g\(e;\\theta^\{\\prime\}\)\\\|\_\{2\}\\leq H\_\{u\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}for some constantsG,Hu\>0G,H\_\{u\}\>0\. Then for everyθ∈Θ\\theta\\in\\Theta,

‖ΣD​\(θ\)−ΣD​\(θ⋆\)‖op≤2​G​Hu​‖θ−θ⋆‖2\+Hu2​‖θ−θ⋆‖22\.\\bigl\\\|\\Sigma\_\{D\}\(\\theta\)\-\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\bigr\\\|\_\{\\text\{op\}\}\\;\\leq\\;2GH\_\{u\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}\\;\+\\;H\_\{u\}^\{2\}\\\|\\theta\-\\theta^\{\\star\}\\\|\_\{2\}^\{2\}\.\(55\)

###### Proof of[Lemma˜13](https://arxiv.org/html/2606.19607#Thmlemma13)\.

WriteΔ​θ=θ−θ⋆\\Delta\\theta=\\theta\-\\theta^\{\\star\}andΔ​g​\(e\)=g​\(e;θ\)−g​\(e;θ⋆\)\\Delta g\(e\)=g\(e;\\theta\)\-g\(e;\\theta^\{\\star\}\)\. Usingg​\(e;θ\)=g​\(e;θ⋆\)\+Δ​g​\(e\)g\(e;\\theta\)=g\(e;\\theta^\{\\star\}\)\+\\Delta g\(e\), we have the identity

g​\(e;θ\)​g​\(e;θ\)⊤−g​\(e;θ⋆\)​g​\(e;θ⋆\)⊤=Δ​g​\(e\)​g​\(e;θ⋆\)⊤\+g​\(e;θ⋆\)​Δ​g​\(e\)⊤\+Δ​g​\(e\)​Δ​g​\(e\)⊤\.g\(e;\\theta\)g\(e;\\theta\)^\{\\top\}\-g\(e;\\theta^\{\\star\}\)g\(e;\\theta^\{\\star\}\)^\{\\top\}=\\Delta g\(e\)\\,g\(e;\\theta^\{\\star\}\)^\{\\top\}\+g\(e;\\theta^\{\\star\}\)\\Delta g\(e\)^\{\\top\}\+\\Delta g\(e\)\\,\\Delta g\(e\)^\{\\top\}\.Taking expectation and applying‖𝔼​\[M\]‖op≤𝔼​‖M‖op\\\|\\mathbb\{E\}\[M\]\\\|\_\{\\text\{op\}\}\\leq\\mathbb\{E\}\\\|M\\\|\_\{\\text\{op\}\}yields

‖ΣD​\(θ\)−ΣD​\(θ⋆\)‖op≤2​𝔼​\[‖Δ​g​\(e\)‖2​‖g​\(e;θ⋆\)‖2\]\+𝔼​‖Δ​g​\(e\)‖22\.\\bigl\\\|\\Sigma\_\{D\}\(\\theta\)\-\\Sigma\_\{D\}\(\\theta^\{\\star\}\)\\bigr\\\|\_\{\\text\{op\}\}\\leq 2\\,\\mathbb\{E\}\\\!\\big\[\\\|\\Delta g\(e\)\\\|\_\{2\}\\,\\\|g\(e;\\theta^\{\\star\}\)\\\|\_\{2\}\\big\]\+\\mathbb\{E\}\\\|\\Delta g\(e\)\\\|\_\{2\}^\{2\}\.By‖g​\(e;θ⋆\)‖2≤G\\\|g\(e;\\theta^\{\\star\}\)\\\|\_\{2\}\\leq Gand‖Δ​g​\(e\)‖2≤Hu​‖Δ​θ‖2\\\|\\Delta g\(e\)\\\|\_\{2\}\\leq H\_\{u\}\\\|\\Delta\\theta\\\|\_\{2\}for allee,

𝔼​\[‖Δ​g​\(e\)‖2​‖g​\(e;θ⋆\)‖2\]≤G​𝔼​‖Δ​g​\(e\)‖2≤G​Hu​‖Δ​θ‖2,𝔼​‖Δ​g​\(e\)‖22≤Hu2​‖Δ​θ‖22\.\\mathbb\{E\}\\\!\\big\[\\\|\\Delta g\(e\)\\\|\_\{2\}\\,\\\|g\(e;\\theta^\{\\star\}\)\\\|\_\{2\}\\big\]\\leq G\\,\\mathbb\{E\}\\\|\\Delta g\(e\)\\\|\_\{2\}\\leq GH\_\{u\}\\\|\\Delta\\theta\\\|\_\{2\},\\qquad\\mathbb\{E\}\\\|\\Delta g\(e\)\\\|\_\{2\}^\{2\}\\leq H\_\{u\}^\{2\}\\\|\\Delta\\theta\\\|\_\{2\}^\{2\}\.Combining the above inequalities gives \([55](https://arxiv.org/html/2606.19607#A5.E55)\)\. ∎

###### Lemma 14\(Fisher non\-degeneracy\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[3](https://arxiv.org/html/2606.19607#Thmassumption3)hold\. In particular, assume

\|fθ​\(ϕ​\(x,y\)\)\|≤α0,∀θ∈Θ,∀x,∀y∈𝒜​\(x\),\|f\_\{\\theta\}\(\\phi\(x,y\)\)\|\\leq\\alpha\_\{0\},\\qquad\\forall\\theta\\in\\Theta,\\ \\forall x,\\ \\forall y\\in\\mathcal\{A\}\(x\),and that\|𝒜​\(x\)\|≤dmax\|\\mathcal\{A\}\(x\)\|\\leq d\_\{\\max\}for allxx\. Then the Fisher information is uniformly non\-degenerate on the identifiable subspaceHH:

v⊤​I​\(θ\)​v≥μ¯​‖v‖22,∀v∈H,v^\{\\top\}I\(\\theta\)v\\;\\geq\\;\\underline\{\\mu\}\\,\\\|v\\\|\_\{2\}^\{2\},\\qquad\\forall v\\in H,whereμ¯=\(e−2​α0dmax\)2​Δg2\.\\underline\{\\mu\}\\;=\\;\\Big\(\\frac\{e^\{\-2\\alpha\_\{0\}\}\}\{d\_\{\\max\}\}\\Big\)^\{2\}\\,\\Delta\_\{g\}^\{2\}\.

###### Proof of[Lemma˜14](https://arxiv.org/html/2606.19607#Thmlemma14)\.

Fixθ∈Θ\\theta\\in\\Thetaand a promptxx\. Let

ψθ\(x,y\)≐∇θfθ\(ϕ\(x,y\)\),Z≐v⊤ψθ\(x,Y\),Y∼πθ\(⋅∣x\),\\psi\_\{\\theta\}\(x,y\)\\doteq\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\)\),\\ \\ Z\\doteq v^\{\\top\}\\psi\_\{\\theta\}\(x,Y\),\\ \\ Y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\),for an arbitrary unit vectorv∈Hv\\in H\. Under the softmax model,

v⊤​I​\(θ\)​v=𝔼x​\[Vary∼πθ\(⋅∣x\)​\(v⊤​ψθ​\(x,y\)\)\]\.v^\{\\top\}I\(\\theta\)v=\\mathbb\{E\}\_\{x\}\\\!\\left\[\\text\{Var\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\\!\\big\(v^\{\\top\}\\psi\_\{\\theta\}\(x,y\)\\big\)\\right\]\.
First, we show that the softmax probabilities are bounded away from0\. Since\|fθ​\(ϕ​\(x,y\)\)\|≤α0\|f\_\{\\theta\}\(\\phi\(x,y\)\)\|\\leq\\alpha\_\{0\}, we haveexp⁡\(fθ​\(ϕ​\(x,y\)\)\)∈\[e−α0,eα0\]\\exp\(f\_\{\\theta\}\(\\phi\(x,y\)\)\)\\in\[e^\{\-\\alpha\_\{0\}\},e^\{\\alpha\_\{0\}\}\]\. Hence

∑y′∈𝒜​\(x\)exp⁡\(fθ​\(ϕ​\(x,y′\)\)\)≤\|𝒜​\(x\)\|​eα0≤dmax​eα0,\\sum\_\{y^\{\\prime\}\\in\\mathcal\{A\}\(x\)\}\\exp\(f\_\{\\theta\}\(\\phi\(x,y^\{\\prime\}\)\)\)\\leq\|\\mathcal\{A\}\(x\)\|\\,e^\{\\alpha\_\{0\}\}\\leq d\_\{\\max\}e^\{\\alpha\_\{0\}\},and therefore for everyy∈𝒜​\(x\)y\\in\\mathcal\{A\}\(x\),

πθ​\(y∣x\)=exp⁡\(fθ​\(ϕ​\(x,y\)\)\)∑y′exp⁡\(fθ​\(ϕ​\(x,y′\)\)\)≥e−α0dmax​eα0=e−2​α0dmax≐pmin\.\\pi\_\{\\theta\}\(y\\mid x\)=\\frac\{\\exp\(f\_\{\\theta\}\(\\phi\(x,y\)\)\)\}\{\\sum\_\{y^\{\\prime\}\}\\exp\(f\_\{\\theta\}\(\\phi\(x,y^\{\\prime\}\)\)\)\}\\;\\geq\\;\\frac\{e^\{\-\\alpha\_\{0\}\}\}\{d\_\{\\max\}e^\{\\alpha\_\{0\}\}\}=\\frac\{e^\{\-2\\alpha\_\{0\}\}\}\{d\_\{\\max\}\}\\doteq p\_\{\\min\}\.
Second, we show the variance lower bound from two separated atoms\. For any random variableZZsupported on discrete values\{zy\}\\\{z\_\{y\}\\\}with probabilities\{py\}\\\{p\_\{y\}\\\}, we have

Var​\(Z\)=12​∑i,jpi​pj​\(zi−zj\)2≥py1​py2​\(zy1−zy2\)2,\\text\{Var\}\(Z\)=\\frac\{1\}\{2\}\\sum\_\{i,j\}p\_\{i\}p\_\{j\}\(z\_\{i\}\-z\_\{j\}\)^\{2\}\\;\\geq\\;p\_\{y\_\{1\}\}p\_\{y\_\{2\}\}\(z\_\{y\_\{1\}\}\-z\_\{y\_\{2\}\}\)^\{2\},for any two indicesy1,y2y\_\{1\},y\_\{2\}\. By[Assumption˜3](https://arxiv.org/html/2606.19607#Thmassumption3), there existy1,y2∈𝒜​\(x\)y\_\{1\},y\_\{2\}\\in\\mathcal\{A\}\(x\)such that

\|zy1−zy2\|=\|v⊤​\(ψθ​\(x,y1\)−ψθ​\(x,y2\)\)\|≥Δg\.\|z\_\{y\_\{1\}\}\-z\_\{y\_\{2\}\}\|=\\Big\|v^\{\\top\}\(\\psi\_\{\\theta\}\(x,y\_\{1\}\)\-\\psi\_\{\\theta\}\(x,y\_\{2\}\)\)\\Big\|\\geq\\Delta\_\{g\}\.Combining withpy1,py2≥pminp\_\{y\_\{1\}\},p\_\{y\_\{2\}\}\\geq p\_\{\\min\}yields

VarY∼πθ\(⋅∣x\)​\(v⊤​ψθ​\(x,Y\)\)≥pmin2​Δg2\.\\text\{Var\}\_\{Y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\\!\\big\(v^\{\\top\}\\psi\_\{\\theta\}\(x,Y\)\\big\)\\;\\geq\\;p\_\{\\min\}^\{2\}\\,\\Delta\_\{g\}^\{2\}\.
Using the bound above for eachxxgives

v⊤​I​\(θ\)​v≥pmin2​Δg2=\(e−2​α0dmax\)2​Δg2,v^\{\\top\}I\(\\theta\)v\\;\\geq\\;p\_\{\\min\}^\{2\}\\,\\Delta\_\{g\}^\{2\}=\\Big\(\\frac\{e^\{\-2\\alpha\_\{0\}\}\}\{d\_\{\\max\}\}\\Big\)^\{2\}\\,\\Delta\_\{g\}^\{2\},for allθ∈Θ\\theta\\in\\Thetaand all unitv∈Hv\\in H\. ∎

###### Lemma 15\(Covariance bound for∇Ln​\(θ⋆\)\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1)and[2](https://arxiv.org/html/2606.19607#Thmassumption2)hold, and the pairwise labels are well\-specified:

ai∣ei∼Bernoulli​\(σ​\(ui​\(θ⋆\)\)\),i=1,…,n,a\_\{i\}\\mid e\_\{i\}\\sim\\mathrm\{Bernoulli\}\\\!\\big\(\\sigma\(u\_\{i\}\(\\theta^\{\\star\}\)\)\\big\),\\qquad i=1,\\dots,n,with conditional independence acrossiigiven\{ei\}i=1n\\\{e\_\{i\}\\\}\_\{i=1\}^\{n\}\. Then

𝔼\[∇Ln\(θ⋆\)\|\{ei\}i=1n\]=0,\\mathbb\{E\}\\\!\\left\[\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\,\\middle\|\\,\\\{e\_\{i\}\\\}\_\{i=1\}^\{n\}\\right\]=0,\(56\)and

𝔼\[∇Ln\(θ⋆\)∇Ln\(θ⋆\)⊤\|\{ei\}i=1n\]⪯14​nΣ^n\(θ⋆\)\.\\mathbb\{E\}\\\!\\left\[\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\nabla L\_\{n\}\(\\theta^\{\\star\}\)^\{\\top\}\\,\\middle\|\\,\\\{e\_\{i\}\\\}\_\{i=1\}^\{n\}\\right\]\\preceq\\frac\{1\}\{4n\}\\,\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\)\.\(57\)Moreover, we have

𝔼​\[‖ΣD​\(θ⋆\)†⁣/2​∇Ln​\(θ⋆\)‖22\]≤dim\(H\)4​n\.\\mathbb\{E\}\\\!\\left\[\\big\\\|\\Sigma\_\{D\}\(\\theta^\{\\star\}\)^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\big\\\|\_\{2\}^\{2\}\\right\]\\leq\\frac\{\\dim\(H\)\}\{4n\}\.\(58\)

###### Proof of[Lemma˜15](https://arxiv.org/html/2606.19607#Thmlemma15)\.

For logistic loss,∂ℓ​\(a,u\)/∂u=σ​\(u\)−a\\partial\\ell\(a,u\)/\\partial u=\\sigma\(u\)\-a\. Hence for anyθ∈Θ\\theta\\in\\Theta,

∇Ln​\(θ\)=1n​∑i=1n\(σ​\(uθ​\(ei\)\)−ai\)​gθ​\(ei\)\.\\nabla L\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\big\(\\sigma\(u\_\{\\theta\}\(e\_\{i\}\)\)\-a\_\{i\}\\big\)\\,g\_\{\\theta\}\(e\_\{i\}\)\.In particular, atθ⋆\\theta^\{\\star\},

∇Ln​\(θ⋆\)=1n​∑i=1n\(σ​\(uθ⋆​\(ei\)\)−ai\)​gθ⋆​\(ei\)\.\\nabla L\_\{n\}\(\\theta^\{\\star\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\big\(\\sigma\(u\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)\)\-a\_\{i\}\\big\)\\,g\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)\.
Condition on\{ei\}i=1n\\\{e\_\{i\}\\\}\_\{i=1\}^\{n\}\. By realizability,

𝔼​\[ai∣ei\]=σ​\(uθ⋆​\(ei\)\),\\mathbb\{E\}\[a\_\{i\}\\mid e\_\{i\}\]=\\sigma\\\!\\big\(u\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)\\big\),so each summand has conditional mean zero, which implies \([56](https://arxiv.org/html/2606.19607#A5.E56)\)\. Moreover, by conditional independence acrossii, fori≠ji\\neq j,

𝔼\[\(σ\(uθ⋆\(ei\)\)−ai\)\(σ\(uθ⋆\(ej\)\)−aj\)\|\{ek\}k=1n\]=0\.\\mathbb\{E\}\\\!\\left\[\\big\(\\sigma\(u\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)\)\-a\_\{i\}\\big\)\\big\(\\sigma\(u\_\{\\theta^\{\\star\}\}\(e\_\{j\}\)\)\-a\_\{j\}\\big\)\\,\\middle\|\\,\\\{e\_\{k\}\\\}\_\{k=1\}^\{n\}\\right\]=0\.Therefore,

𝔼\[∇Ln\(θ⋆\)∇Ln\(θ⋆\)⊤\|\{ei\}i=1n\]=1n2∑i=1nVar\(ai∣ei\)gθ⋆\(ei\)gθ⋆\(ei\)⊤\.\\mathbb\{E\}\\\!\\left\[\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\nabla L\_\{n\}\(\\theta^\{\\star\}\)^\{\\top\}\\,\\middle\|\\,\\\{e\_\{i\}\\\}\_\{i=1\}^\{n\}\\right\]=\\frac\{1\}\{n^\{2\}\}\\sum\_\{i=1\}^\{n\}\\text\{Var\}\(a\_\{i\}\\mid e\_\{i\}\)\\,g\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)\\,g\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)^\{\\top\}\.For a Bernoulli random variable with success probabilityσ​\(u\)\\sigma\(u\),

Var​\(a∣e\)=σ​\(u\)​\(1−σ​\(u\)\)=σ′​\(u\)≤14\.\\text\{Var\}\(a\\mid e\)=\\sigma\(u\)\(1\-\\sigma\(u\)\)=\\sigma^\{\\prime\}\(u\)\\leq\\frac\{1\}\{4\}\.Hence,

𝔼\[∇Ln\(θ⋆\)∇Ln\(θ⋆\)⊤\|\{ei\}i=1n\]⪯14​n2∑i=1ngθ⋆\(ei\)gθ⋆\(ei\)⊤=14​nΣ^n\(θ⋆\),\\mathbb\{E\}\\\!\\left\[\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\nabla L\_\{n\}\(\\theta^\{\\star\}\)^\{\\top\}\\,\\middle\|\\,\\\{e\_\{i\}\\\}\_\{i=1\}^\{n\}\\right\]\\preceq\\frac\{1\}\{4n^\{2\}\}\\sum\_\{i=1\}^\{n\}g\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)\\,g\_\{\\theta^\{\\star\}\}\(e\_\{i\}\)^\{\\top\}=\\frac\{1\}\{4n\}\\,\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\),which proves \([57](https://arxiv.org/html/2606.19607#A5.E57)\)\.

LetΣ⋆≐ΣD​\(θ⋆\)=𝔼e∼D​\[gθ⋆​\(e\)​gθ⋆​\(e\)⊤\]\\Sigma\_\{\\star\}\\doteq\\Sigma\_\{D\}\(\\theta^\{\\star\}\)=\\mathbb\{E\}\_\{e\\sim D\}\[g\_\{\\theta^\{\\star\}\}\(e\)\\,g\_\{\\theta^\{\\star\}\}\(e\)^\{\\top\}\]\. Note that

‖Σ⋆†⁣/2​∇Ln​\(θ⋆\)‖22=tr​\(Σ⋆†⁣/2​∇Ln​\(θ⋆\)​∇Ln​\(θ⋆\)⊤​Σ⋆†⁣/2\)\.\\big\\\|\\Sigma\_\{\\star\}^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\big\\\|\_\{2\}^\{2\}=\\text\{tr\}\\\!\\Big\(\\Sigma\_\{\\star\}^\{\\dagger/2\}\\,\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\nabla L\_\{n\}\(\\theta^\{\\star\}\)^\{\\top\}\\,\\Sigma\_\{\\star\}^\{\\dagger/2\}\\Big\)\.Taking expectation and using \([57](https://arxiv.org/html/2606.19607#A5.E57)\),

𝔼​\[‖Σ⋆†⁣/2​∇Ln​\(θ⋆\)‖22\]≤14​n​𝔼​\[tr​\(Σ⋆†⁣/2​Σ^n​\(θ⋆\)​Σ⋆†⁣/2\)\]\.\\mathbb\{E\}\\Big\[\\big\\\|\\Sigma\_\{\\star\}^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\big\\\|\_\{2\}^\{2\}\\Big\]\\leq\\frac\{1\}\{4n\}\\,\\mathbb\{E\}\\\!\\left\[\\text\{tr\}\\\!\\big\(\\Sigma\_\{\\star\}^\{\\dagger/2\}\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\)\\Sigma\_\{\\star\}^\{\\dagger/2\}\\big\)\\right\]\.Since𝔼​\[Σ^n​\(θ⋆\)\]=Σ⋆\\mathbb\{E\}\[\\widehat\{\\Sigma\}\_\{n\}\(\\theta^\{\\star\}\)\]=\\Sigma\_\{\\star\}, we obtain

𝔼​\[‖Σ⋆†⁣/2​∇Ln​\(θ⋆\)‖22\]≤14​n​tr​\(Σ⋆†⁣/2​Σ⋆​Σ⋆†⁣/2\)\.\\mathbb\{E\}\\Big\[\\big\\\|\\Sigma\_\{\\star\}^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\big\\\|\_\{2\}^\{2\}\\Big\]\\leq\\frac\{1\}\{4n\}\\,\\text\{tr\}\\\!\\big\(\\Sigma\_\{\\star\}^\{\\dagger/2\}\\Sigma\_\{\\star\}\\Sigma\_\{\\star\}^\{\\dagger/2\}\\big\)\.The matrixΣ⋆†⁣/2​Σ⋆​Σ⋆†⁣/2\\Sigma\_\{\\star\}^\{\\dagger/2\}\\Sigma\_\{\\star\}\\Sigma\_\{\\star\}^\{\\dagger/2\}is the orthogonal projector ontoRange​\(Σ⋆\)\\mathrm\{Range\}\(\\Sigma\_\{\\star\}\), hence its trace equalsrank​\(Σ⋆\)\\mathrm\{rank\}\(\\Sigma\_\{\\star\}\)\. Under[Assumption˜1](https://arxiv.org/html/2606.19607#Thmassumption1),Range​\(Σ⋆\)=H\\mathrm\{Range\}\(\\Sigma\_\{\\star\}\)=H, sorank​\(Σ⋆\)=dim\(H\)\\mathrm\{rank\}\(\\Sigma\_\{\\star\}\)=\\dim\(H\)\. Therefore,

𝔼​\[‖Σ⋆†⁣/2​∇Ln​\(θ⋆\)‖22\]≤dim\(H\)4​n\.\\mathbb\{E\}\\Big\[\\big\\\|\\Sigma\_\{\\star\}^\{\\dagger/2\}\\nabla L\_\{n\}\(\\theta^\{\\star\}\)\\big\\\|\_\{2\}^\{2\}\\Big\]\\leq\\frac\{\\dim\(H\)\}\{4n\}\.∎

To prove[Lemma˜8](https://arxiv.org/html/2606.19607#Thmlemma8), we first show the following two useful technical lemmas\.

###### Lemma 16\(Uniform bounds for the Fisher matrix onΘ\\Theta\)\.

Suppose[Assumptions˜1](https://arxiv.org/html/2606.19607#Thmassumption1),[2](https://arxiv.org/html/2606.19607#Thmassumption2)and[3](https://arxiv.org/html/2606.19607#Thmassumption3)hold\. Then there exist constants

0<mI≤MI<∞0<m\_\{I\}\\leq M\_\{I\}<\\inftydepending only on primitive constants such that for allθ∈Θ\\theta\\in\\Theta,

mI​IH⪯I​\(θ\)⪯MI​IHon​H,m\_\{I\}I\_\{H\}\\;\\preceq\\;I\(\\theta\)\\;\\preceq\\;M\_\{I\}I\_\{H\}\\qquad\\text\{on \}H,\(59\)whereIHI\_\{H\}denotes the identity operator on the identifiable spaceHH\. Equivalently, for everyv∈Hv\\in H,

mI​‖v‖22≤v⊤​I​\(θ\)​v≤MI​‖v‖22\.m\_\{I\}\\\|v\\\|\_\{2\}^\{2\}\\;\\leq\\;v^\{\\top\}I\(\\theta\)v\\;\\leq\\;M\_\{I\}\\\|v\\\|\_\{2\}^\{2\}\.\(60\)
In particular, for the prior\-averaged Fisher matrix

I¯ρ≐𝔼θ∼ρ​\[I​\(θ\)\],\\bar\{I\}\_\{\\rho\}\\doteq\\mathbb\{E\}\_\{\\theta\\sim\\rho\}\[I\(\\theta\)\],we have

mI​IH⪯I¯ρ⪯MI​IH\.m\_\{I\}I\_\{H\}\\;\\preceq\\;\\bar\{I\}\_\{\\rho\}\\;\\preceq\\;M\_\{I\}I\_\{H\}\.\(61\)Consequently, for allθ∈Θ\\theta\\in\\Thetaand allv∈Hv\\in H,

mIMI​v⊤​I​\(θ\)​v≤v⊤​I¯ρ​v≤MImI​v⊤​I​\(θ\)​v\.\\frac\{m\_\{I\}\}\{M\_\{I\}\}\\,v^\{\\top\}I\(\\theta\)v\\;\\leq\\;v^\{\\top\}\\bar\{I\}\_\{\\rho\}v\\;\\leq\\;\\frac\{M\_\{I\}\}\{m\_\{I\}\}\\,v^\{\\top\}I\(\\theta\)v\.\(62\)

###### Proof of[Lemma˜16](https://arxiv.org/html/2606.19607#Thmlemma16)\.

The lower bound in \([59](https://arxiv.org/html/2606.19607#A5.E59)\) is exactly the uniform nondegeneracy onHHproved in[Lemma˜14](https://arxiv.org/html/2606.19607#Thmlemma14), so we may takemI=μ¯m\_\{I\}=\\underline\{\\mu\}\.

For the upper bound, recall

I​\(θ\)=𝔼x​𝔼y∼πθ\(⋅∣x\)​\[sθ​\(x,y\)​sθ​\(x,y\)⊤\],sθ​\(x,y\)=∇θlog⁡πθ​\(y∣x\)\.I\(\\theta\)=\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\big\[s\_\{\\theta\}\(x,y\)s\_\{\\theta\}\(x,y\)^\{\\top\}\\big\],\\qquad s\_\{\\theta\}\(x,y\)=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\\mid x\)\.By the softmax form and[Assumption˜2](https://arxiv.org/html/2606.19607#Thmassumption2), the score is uniformly bounded onHH: for all\(x,y\)\(x,y\)and allθ∈Θ\\theta\\in\\Theta,‖sθ​\(x,y\)‖2≤2​α1\\\|s\_\{\\theta\}\(x,y\)\\\|\_\{2\}\\leq 2\\alpha\_\{1\}, \(hereα1\\alpha\_\{1\}is the uniform bound on‖∇θfθ​\(ϕ​\(x,y\)\)‖2\\\|\\nabla\_\{\\theta\}f\_\{\\theta\}\(\\phi\(x,y\)\)\\\|\_\{2\}\)\. Therefore, for anyv∈Hv\\in H,

v⊤​I​\(θ\)​v=𝔼x​𝔼y∼πθ\(⋅∣x\)​\[\(v⊤​sθ​\(x,y\)\)2\]≤𝔼x​𝔼y∼πθ\(⋅∣x\)​\[‖v‖22​‖sθ​\(x,y\)‖22\]≤4​α12​‖v‖22\.v^\{\\top\}I\(\\theta\)v=\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\big\[\(v^\{\\top\}s\_\{\\theta\}\(x,y\)\)^\{2\}\\big\]\\leq\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\big\[\\\|v\\\|\_\{2\}^\{2\}\\\|s\_\{\\theta\}\(x,y\)\\\|\_\{2\}^\{2\}\\big\]\\leq 4\\alpha\_\{1\}^\{2\}\\\|v\\\|\_\{2\}^\{2\}\.Hence \([59](https://arxiv.org/html/2606.19607#A5.E59)\) holds withMI=4​α12M\_\{I\}=4\\alpha\_\{1\}^\{2\}\.

Now average \([59](https://arxiv.org/html/2606.19607#A5.E59)\) with respect toρ\\rho\. Since Loewner order is preserved under expectation,

mI​IH⪯𝔼ρ​\[I​\(θ\)\]=I¯ρ⪯MI​IH,m\_\{I\}I\_\{H\}\\preceq\\mathbb\{E\}\_\{\\rho\}\[I\(\\theta\)\]=\\bar\{I\}\_\{\\rho\}\\preceq M\_\{I\}I\_\{H\},which proves \([61](https://arxiv.org/html/2606.19607#A5.E61)\)\.

Finally, for anyv∈Hv\\in Hand anyθ∈Θ\\theta\\in\\Theta,

v⊤​I¯ρ​v≤MI​‖v‖22≤MImI​v⊤​I​\(θ\)​v,v^\{\\top\}\\bar\{I\}\_\{\\rho\}v\\leq M\_\{I\}\\\|v\\\|\_\{2\}^\{2\}\\leq\\frac\{M\_\{I\}\}\{m\_\{I\}\}v^\{\\top\}I\(\\theta\)v,and similarly

v⊤​I¯ρ​v≥mI​‖v‖22≥mIMI​v⊤​I​\(θ\)​v\.v^\{\\top\}\\bar\{I\}\_\{\\rho\}v\\geq m\_\{I\}\\\|v\\\|\_\{2\}^\{2\}\\geq\\frac\{m\_\{I\}\}\{M\_\{I\}\}v^\{\\top\}I\(\\theta\)v\.This proves \([62](https://arxiv.org/html/2606.19607#A5.E62)\)\. ∎

###### Lemma 17\.

Suppose[Assumptions˜4\(b\)](https://arxiv.org/html/2606.19607#Thmassumption4a)and[6](https://arxiv.org/html/2606.19607#Thmassumption6)holds withℛ=supp⁡\(ρ\)\\mathcal\{R\}=\\operatorname\{supp\}\{\(\\rho\)\}, and define

J​\(ρ\)≐∫Θ∇log⁡ρ​\(θ\)​∇log⁡ρ​\(θ\)⊤​ρ​\(θ\)​𝑑θ,Σ¯D≐𝔼θ∼ρ​\[ΣD​\(θ\)\]\.J\(\\rho\)\\;\\doteq\\;\\int\_\{\\Theta\}\\nabla\\log\\rho\(\\theta\)\\,\\nabla\\log\\rho\(\\theta\)^\{\\top\}\\,\\rho\(\\theta\)\\,d\\theta,\\qquad\\bar\{\\Sigma\}\_\{D\}\\;\\doteq\\;\\mathbb\{E\}\_\{\\theta\\sim\\rho\}\\\!\\big\[\\Sigma\_\{D\}\(\\theta\)\\big\]\.Let

Tρ≐tr​\(J​\(ρ\)\)=∫Θ‖∇log⁡ρ​\(θ\)‖22​ρ​\(θ\)​𝑑θ<∞\.T\_\{\\rho\}\\;\\doteq\\;\\text\{tr\}\\\!\\big\(J\(\\rho\)\\big\)=\\int\_\{\\Theta\}\\\|\\nabla\\log\\rho\(\\theta\)\\\|\_\{2\}^\{2\}\\,\\rho\(\\theta\)\\,d\\theta<\\infty\.As in \([46](https://arxiv.org/html/2606.19607#A3.E46)\), let

nprior≐⌈8​TρλD⌉\.n\_\{\\rm prior\}\\;\\doteq\\;\\left\\lceil\\frac\{8\\,T\_\{\\rho\}\}\{\\lambda\_\{D\}\}\\right\\rceil\.Then for alln≥npriorn\\geq n\_\{\\rm prior\},

J​\(ρ\)⪯n8​Σ¯D\.J\(\\rho\)\\;\\preceq\\;\\frac\{n\}\{8\}\\,\\bar\{\\Sigma\}\_\{D\}\.\(63\)

###### Proof of[Lemma˜17](https://arxiv.org/html/2606.19607#Thmlemma17)\.

By[Assumption˜6](https://arxiv.org/html/2606.19607#Thmassumption6)\(3\),

Tρ=tr​\(J​\(ρ\)\)=∫Θ‖∇log⁡ρ​\(θ\)‖22​ρ​\(θ\)​𝑑θ<∞\.T\_\{\\rho\}=\\text\{tr\}\(J\(\\rho\)\)=\\int\_\{\\Theta\}\\\|\\nabla\\log\\rho\(\\theta\)\\\|\_\{2\}^\{2\}\\,\\rho\(\\theta\)\\,d\\theta<\\infty\.SinceJ​\(ρ\)⪰0J\(\\rho\)\\succeq 0, its largest eigenvalue is bounded by its trace, hence

J​\(ρ\)⪯tr​\(J​\(ρ\)\)​I=Tρ​I\.J\(\\rho\)\\preceq\\text\{tr\}\(J\(\\rho\)\)\\,I=T\_\{\\rho\}I\.\(64\)
Next, by[Assumption˜4\(b\)](https://arxiv.org/html/2606.19607#Thmassumption4a), for allθ∈supp⁡\(ρ\)\\theta\\in\\operatorname\{supp\}\{\(\\rho\)\},

ΣD​\(θ\)⪰μρ​I\.\\Sigma\_\{D\}\(\\theta\)\\succeq\\mu\_\{\\rho\}I\.Taking expectation with respect toρ\\rhoyields

Σ¯D=𝔼θ∼ρ​\[ΣD​\(θ\)\]⪰μρ​I\.\\bar\{\\Sigma\}\_\{D\}=\\mathbb\{E\}\_\{\\theta\\sim\\rho\}\[\\Sigma\_\{D\}\(\\theta\)\]\\succeq\\mu\_\{\\rho\}I\.\(65\)
Now letn≥npriorn\\geq n\_\{\\rm prior\}, wherenprior=⌈8​Tρμρ⌉n\_\{\\rm prior\}=\\left\\lceil\\frac\{8T\_\{\\rho\}\}\{\\mu\_\{\\rho\}\}\\right\\rceil\. Thenn8​μρ≥Tρ\.\\frac\{n\}\{8\}\\,\\mu\_\{\\rho\}\\geq T\_\{\\rho\}\.Combining this with \([64](https://arxiv.org/html/2606.19607#A5.E64)\) and \([65](https://arxiv.org/html/2606.19607#A5.E65)\),

J​\(ρ\)⪯Tρ​I⪯n8​μρ​I⪯n8​Σ¯D\.J\(\\rho\)\\;\\preceq\\;T\_\{\\rho\}I\\;\\preceq\\;\\frac\{n\}\{8\}\\,\\mu\_\{\\rho\}I\\;\\preceq\\;\\frac\{n\}\{8\}\\,\\bar\{\\Sigma\}\_\{D\}\.∎

## Appendix FExperiment implementation details and hyperparameters

### F\.1Implementation Details for the IMDb Experiment

The IMDb experiments were run on GPU workers equipped with H100\-class GPUs\. Depending on the curation setting and dataset size, each DPO fine\-tuning run took from several minutes to roughly one hour, while theD∗D^\{\\ast\}construction and evaluation steps were substantially lighter\.

##### SFT model and reference policy\.

We follow the IMDb setup inRafailovet al\.\[[2023](https://arxiv.org/html/2606.19607#bib.bib1)\]\. We first fine\-tune GPT\-2\-large on reviews from the training split of the IMDb dataset using supervised fine\-tuning \(SFT\)\. The resulting SFT model is used as the reference policyπ0\\pi\_\{0\}for all subsequent DPO experiments\. In all DPO runs, the policy is initialized from this SFT model, and the reference policy is fixed to the same SFT model\.

##### DPO training configuration\.

For each curated preference dataset, we run full\-parameter DPO training\. We use RMSprop as the optimizer, with learning rate10−510^\{\-5\}\. The per\-device training batch size is1616, and we use gradient accumulation of22, giving an effective batch size of3232\. We use a cosine learning\-rate scheduler with warmup ratio0\.10\.1\. The maximum sequence length is set to256256\. For the reward–KL frontier experiments, each point in the frontier corresponds to the final checkpoint of one DPO training run under a specific DPO regularization parameterβ\\beta\. We sweep

β∈\{0\.05,0\.1,0\.2,0\.5,1,2,5\}\.\\beta\\in\\\{0\.05,0\.1,0\.2,0\.5,1,2,5\\\}\.For evaluation, we generate responses from the trained policy and compute both the reward and the KL divergence relative to the SFT reference policy\. FollowingRafailovet al\.\[[2023](https://arxiv.org/html/2606.19607#bib.bib1)\], we use the sentiment classifiersiebert/sentiment\-roberta\-large\-englishas the reward model\. The KL is estimated on generated responses as the sequence\-level log\-probability difference between the trained policy and the reference policy,

log⁡π^​\(y∣x\)−log⁡π0​\(y∣x\)=∑t=1\|y\|\[log⁡π^​\(yt∣x,y<t\)−log⁡π0​\(yt∣x,y<t\)\]\.\\log\\hat\{\\pi\}\(y\\mid x\)\-\\log\\pi\_\{0\}\(y\\mid x\)=\\sum\_\{t=1\}^\{\|y\|\}\\left\[\\log\\hat\{\\pi\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\-\\log\\pi\_\{0\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\\right\]\.This sequence\-level empirical KL estimate follows the evaluation protocol used inRafailovet al\.\[[2023](https://arxiv.org/html/2606.19607#bib.bib1)\]\.

##### Candidate generation\.

For prompt selection, we sample a candidate pool of1,0001\{,\}000prompts and generate two responses for each prompt from the SFT reference model\. This gives one candidate comparison per prompt\. We then selectN=175N=175comparisons for DPO training\. For response selection, we generateD=8D=8candidate responses for each prompt\. This gives\(82\)\\binom\{8\}\{2\}candidate response pairs within each prompt\. We select one response pair per prompt for DPO training\.

##### Feature representation\.

To compute theD∗D^\{\\ast\}\-based design, we represent each prompt–response pair\(x,y\)\(x,y\)using the last\-token hidden representation from the SFT model\. Let

ϕ​\(x,y\)∈ℝd\\phi\(x,y\)\\in\\mathbb\{R\}^\{d\}denote this last\-token representation\. For a candidate comparison

e=\(x,ya,yb\),e=\(x,y\_\{a\},y\_\{b\}\),we define the feature difference

ge=βD​\(ϕ​\(x,ya\)−ϕ​\(x,yb\)\),g\_\{e\}=\\beta\_\{D\}\\left\(\\phi\(x,y\_\{a\}\)\-\\phi\(x,y\_\{b\}\)\\right\),whereβD\\beta\_\{D\}is the DPO scaling parameter used in the design computation\. This feature difference is used as a low\-dimensional proxy for the pairwise sensitivity vector of the DPO logit\.

##### Estimating the design weight matrixI​\(θ0\)I\(\\theta\_\{0\}\)\.

The design objective uses a feature\-space approximation toI​\(θ0\)I\(\\theta\_\{0\}\)\. We first fit a linear approximation to the SFT log\-probabilities\. Specifically, for each promptxix\_\{i\}and candidate responseyi​ky\_\{ik\}, we compute the SFT log\-probability

ℓi​k=log⁡π0​\(yi​k∣xi\)\.\\ell\_\{ik\}=\\log\\pi\_\{0\}\(y\_\{ik\}\\mid x\_\{i\}\)\.Since only within\-prompt comparisons are relevant, we remove a prompt\-specific mean and fit a ridge regression:

θ^0∈arg⁡minθ​∑i,k\(ϕ​\(xi,yi​k\)⊤​θ−\(ℓi​k−1Ki​∑ℓ=1Kiℓi​ℓ\)\)2\+α​‖θ‖22\.\\widehat\{\\theta\}\_\{0\}\\in\\arg\\min\_\{\\theta\}\\sum\_\{i,k\}\\left\(\\phi\(x\_\{i\},y\_\{ik\}\)^\{\\top\}\\theta\-\\left\(\\ell\_\{ik\}\-\\frac\{1\}\{K\_\{i\}\}\\sum\_\{\\ell=1\}^\{K\_\{i\}\}\\ell\_\{i\\ell\}\\right\)\\right\)^\{2\}\+\\alpha\\\|\\theta\\\|\_\{2\}^\{2\}\.This gives a feature\-space approximation to the SFT policy\. We then define a softmax distribution over the candidate responses for each prompt:

π^θ0​\(yi​k∣xi\)=exp⁡\(ϕ​\(xi,yi​k\)⊤​θ^0\)∑ℓ=1Kiexp⁡\(ϕ​\(xi,yi​ℓ\)⊤​θ^0\)\.\\widehat\{\\pi\}\_\{\\theta\_\{0\}\}\(y\_\{ik\}\\mid x\_\{i\}\)=\\frac\{\\exp\\left\(\\phi\(x\_\{i\},y\_\{ik\}\)^\{\\top\}\\widehat\{\\theta\}\_\{0\}\\right\)\}\{\\sum\_\{\\ell=1\}^\{K\_\{i\}\}\\exp\\left\(\\phi\(x\_\{i\},y\_\{i\\ell\}\)^\{\\top\}\\widehat\{\\theta\}\_\{0\}\\right\)\}\.The matrixI​\(θ0\)I\(\\theta\_\{0\}\)is estimated by the average within\-prompt feature covariance:

I^​\(θ0\)=1M​∑i=1M∑k=1Kiπ^θ0​\(yi​k∣xi\)​\(ϕ​\(xi,yi​k\)−μ^i\)​\(ϕ​\(xi,yi​k\)−μ^i\)⊤,\\widehat\{I\}\(\\theta\_\{0\}\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\sum\_\{k=1\}^\{K\_\{i\}\}\\widehat\{\\pi\}\_\{\\theta\_\{0\}\}\(y\_\{ik\}\\mid x\_\{i\}\)\\left\(\\phi\(x\_\{i\},y\_\{ik\}\)\-\\widehat\{\\mu\}\_\{i\}\\right\)\\left\(\\phi\(x\_\{i\},y\_\{ik\}\)\-\\widehat\{\\mu\}\_\{i\}\\right\)^\{\\top\},where

μ^i=∑k=1Kiπ^θ0​\(yi​k∣xi\)​ϕ​\(xi,yi​k\)\.\\widehat\{\\mu\}\_\{i\}=\\sum\_\{k=1\}^\{K\_\{i\}\}\\widehat\{\\pi\}\_\{\\theta\_\{0\}\}\(y\_\{ik\}\\mid x\_\{i\}\)\\phi\(x\_\{i\},y\_\{ik\}\)\.

##### D∗D^\{\\ast\}design computation\.

Given candidate comparisonse∈ℰe\\in\\mathcal\{E\}, we compute design weights over the corresponding feature differencesgeg\_\{e\}\. For a nonnegative weight vectorwwover candidate comparisons, define the regularized information matrix

A​\(w\)=λ​I\+∑e∈ℰwe​ge​ge⊤\.A\(w\)=\\lambda I\+\\sum\_\{e\\in\\mathcal\{E\}\}w\_\{e\}g\_\{e\}g\_\{e\}^\{\\top\}\.The implementedD∗D^\{\\ast\}design minimizes the feature\-spaceII\-optimal objective

Φ​\(w\)=tr⁡\(I^​\(θ0\)​A​\(w\)−1\)\.\\Phi\(w\)=\\operatorname\{tr\}\\left\(\\widehat\{I\}\(\\theta\_\{0\}\)A\(w\)^\{\-1\}\\right\)\.We solve this continuous design problem approximately using a Frank\-Wolfe procedure\.

##### Sampling from the design\.

In the prompt selection experiment, we first computeD∗D^\{\\ast\}weights over all1,0001\{,\}000candidate prompt\-level comparisons\. We then sampleN=175N=175comparisons without replacement according to the normalizedD∗D^\{\\ast\}weights\. The benchmark selects the first175175prompt\-level comparisons\.

In the response selection experiment, we computeD∗D^\{\\ast\}weights over all within\-prompt candidate response pairs\. For each prompt, we normalize the weights over its\(82\)\\binom\{8\}\{2\}candidate response pairs and sample one pair without replacement from this within\-prompt distribution\. The benchmark always compares the first two generated responses for each prompt\.

##### Monte Carlo replications and error bars\.

For each dataset\-generation seed, we independently resample the candidate pool, regenerate candidate responses, recompute theD∗D^\{\\ast\}design weights, reconstruct the curated preference training set, and retrain the DPO policy from the same SFT reference model\. Error bars in the reward–KL frontier plots report Monte Carlo standard errors across these independent runs\. In the prompt selection experiment, error bars are computed over3030independent runs; in the response selection experiment, error bars are computed over8080independent runs\.

### F\.2Implementation Details for the Anthropic\-HH Experiment

The Anthropic\-HH experiments were run on H100\-class GPU workers\. The trace\-design construction and response generation used one GPU, while each Pythia\-2\.8B DPO fine\-tuning run used four GPUs\. Across the Anthropic\-HH budgets considered, each DPO fine\-tuning run took about one hour\. GPT\-based evaluation was performed through API calls and did not require GPU computation\.

##### Dataset and reference policy\.

We further evaluate our dataset curation method on the Anthropic Helpful–Harmless \(HH\) preference dataset\. We use the default train/test splits of the Anthropic HH\-RLHF dataset, which contain preference comparisons from both helpfulness and harmlessness data\. Each example consists of a prompt, a chosen response, and a rejected response\. Following the DPO pipeline inRafailovet al\.\[[2023](https://arxiv.org/html/2606.19607#bib.bib1)\], we first perform supervised fine\-tuning \(SFT\) on Pythia\-2\.8B using the chosen responses in the HH training split\. The resulting SFT model is used as both the initialization of the DPO policy and the fixed reference policyπ0\\pi\_\{0\}\. Each HH training example consists of a promptxx, a chosen responsey\+y^\{\+\}, and a rejected responsey−y^\{\-\}, which directly defines one candidate preference comparison\.

##### Candidate pool and benchmark\.

We construct a candidate pool from the HH training split\. In the full\-pool experiment, the candidate pool containsM=160,800M=160\{,\}800preference comparisons\. Given a training budgetNN, the benchmark dataset consists of the firstNNcomparisons in this candidate pool\. TheD∗D^\{\\ast\}\-curated dataset uses the same budgetNN, but selects comparisons according to the optimizedD∗D^\{\\ast\}design weights\. Both methods therefore use the same number of preference comparisons for DPO training\.

##### Feature construction\.

For each candidate comparisonei=\(xi,yi\+,yi−\)e\_\{i\}=\(x\_\{i\},y\_\{i\}^\{\+\},y\_\{i\}^\{\-\}\), we use the SFT model to extract hidden representations\. Specifically, we feed the full sequencesxi∘yi\+x\_\{i\}\\circ y\_\{i\}^\{\+\}andxi∘yi−x\_\{i\}\\circ y\_\{i\}^\{\-\}into the SFT model and take the last\-token hidden states\. We denote the resulting representations byϕi\+\\phi\_\{i\}^\{\+\}andϕi−\\phi\_\{i\}^\{\-\}\. We then use the difference

gi=β​\(ϕi\+−ϕi−\)g\_\{i\}=\\beta\(\\phi\_\{i\}^\{\+\}\-\\phi\_\{i\}^\{\-\}\)as the design vector for this comparison, whereβ=0\.1\\beta=0\.1matches the DPO regularization parameter used in training\. To make the design computation numerically tractable, we standardize the hidden representations and apply PCA to obtain a128128\-dimensional feature representation\.

##### D∗D^\{\\ast\}design computation\.

TheD∗D^\{\\ast\}\-based design is computed using the same feature\-space approximation procedure as in the IMDb experiment\. We first fit a local parameterθ0\\theta\_\{0\}by ridge regression from the feature representations to the centered SFT log\-likelihoods of the chosen and rejected responses\. Using this fittedθ0\\theta\_\{0\}, we compute a Fisher\-type matrixI​\(θ0\)I\(\\theta\_\{0\}\)over the candidate pool\. We then solve the trace\-design problem

minq∈ΔM⁡tr⁡\[\(λ​I\+∑i=1Mqi​gi​gi⊤\)−1​I​\(θ0\)\],\\min\_\{q\\in\\Delta\_\{M\}\}\\operatorname\{tr\}\\left\[\\left\(\\lambda I\+\\sum\_\{i=1\}^\{M\}q\_\{i\}g\_\{i\}g\_\{i\}^\{\\top\}\\right\)^\{\-1\}I\(\\theta\_\{0\}\)\\right\],whereqqis a distribution over candidate comparisons andλ=10−3\\lambda=10^\{\-3\}is a ridge parameter\. We solve this optimization problem using Frank–Wolfe\. The resulting optimized design weights are used as sampling probabilities\. For a fixed budgetNN, we sampleNNcomparisonswithout replacementaccording to these weights\.

##### DPO training configuration\.

For each curated HH preference dataset, we perform DPO starting from the SFT Pythia\-2\.8B model, with the same SFT model fixed as the reference policy\. We use the same DPO training configuration for theD∗D^\{\\ast\}\-curated dataset and the benchmark dataset\. In the reported runs, we useβ=0\.1\\beta=0\.1, batch size6464, RMSprop optimizer with learning rate10−610^\{\-6\}, and a linear learning\-rate warmup over the first150150optimization steps\. Each model is trained for one epoch\. This ensures that differences in downstream performance are attributable to the curated preference data rather than to training hyperparameters\.

##### Evaluation protocol\.

To evaluate the trained policies, we generate responses on prompts from the HH test split\. For each test prompt, we compare the generated response with the corresponding HH chosen response\. We use sampling temperatures0\.250\.25,0\.70\.7, and1\.01\.0, and evaluate500500test prompts for each method\-temperature pair\. GPT\-4\.1 is used as an automatic judge\. The judge compares the model\-generated response against the HH chosen response and returns whether the model response is better, worse, or tied\. We report the win rate of the trained policy’s generated response against the HH chosen response, counting a tie as0\.50\.5\. For all methods, we use the same test prompts and the same judging protocol\.

Similar Articles

Evaluating LLMs as Human Surrogates in Controlled Experiments

arXiv cs.CL

This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.