Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

arXiv cs.LG 05/13/26, 04:00 AM Papers
Summary
This paper analyzes spurious correlation learning in preference optimization methods like DPO, identifying mechanisms such as mean spurious bias and causal-spurious leakage. It proposes 'tie training' using equal-utility preference pairs as a mitigation strategy to reduce reliance on spurious features without degrading causal learning.
arXiv:2605.11134v1 Announce Type: new Abstract: Preference learning methods such as Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal--spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model's dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/13/26, 06:31 AM
# Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
Source: [https://arxiv.org/html/2605.11134](https://arxiv.org/html/2605.11134)
###### Abstract

Preference learning methods such as Direct Preference Optimization \(DPO\) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today’s language models and potentially severe goal misgeneralization in future systems\. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy\. Focusing on log\-linear policies, we show that standard preference\-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal–spurious correlation leakage\. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model’s dependence on spurious features\. To address this, we proposetie training, a data augmentation strategy using ties \(equal\-utility preference pairs\) to introduce data\-driven regularization\. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning\. Finally, we validate our theory on log\-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models\.

Alignment, Spurious Correlations

## 1Introduction

Aligning large language models \(LLMs\) with human preferences is a key challenge in building safe and useful AI systems\. In current alignment pipelines, Reinforcement Learning from Human Feedback \(RLHF\) learns a reward model from preference data and optimizes a policy with respect to that reward\(Ziegleret al\.,[2019](https://arxiv.org/html/2605.11134#bib.bib5); Ouyanget al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib1)\)\. Direct Preference Optimization \(DPO\) simplifies this pipeline by directly optimizing the policy on preference pairs\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib4)\)\. These approaches are used to align widely deployed systems such as ChatGPT\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib1)\)and Claude\(Baiet al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib2)\)\. Despite their success, existing theory provides limited insight into how preference optimization behaves under the distributional structure of real\-world human feedback\.

Understanding this behavior requires examining the structure of preference data itself\. Preference optimization methods are trained on datasets of human comparisons\(Christianoet al\.,[2017](https://arxiv.org/html/2605.11134#bib.bib41); Rafailovet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib4)\), where annotators select preferred responses given a prompt\. These datasets encode recurring patterns, reflecting consistent annotator biases and shared superficial characteristics among preferred responses\. As a result, feature\-level correlations arise between surface attributes and preference labels that are not causally related to response quality\.

Surface\-level attributes such as length, politeness, formatting, or agreement with the user often correlate with preference labels during training\(Sharmaet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib9); Casperet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib34)\), but may not reflect true response quality\. When these correlations shift at deployment, models that rely on them fail to generalize, a behavior we call policy misgeneralization, where optimization aligns with a proxy objective instead of the intended goal\.

Policy misgeneralization has important safety implications beyond standard distribution shift failures\. Prior work has raised concerns that AI systems may learn objectives that correlate with intended goals during training but pursue misaligned proxy objectives once those correlations break at deployment\(Langoscoet al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib40); Shahet al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib39)\)\. In such cases, high training reward can reflect alignment with proxy signals rather than improvements in true task performance, masking failures that emerge under distribution shift\(Skalseet al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib33)\)\. While much of this literature focuses on hypothetical capable agents\(Ngoet al\.,[2024](https://arxiv.org/html/2605.11134#bib.bib37); Bengioet al\.,[2025](https://arxiv.org/html/2605.11134#bib.bib36)\), preference optimization in current LLMs provides a concrete setting where this failure mode manifests even without objective misspecification\. Understanding the mechanisms by which spurious correlations emerge in these systems is therefore essential for developing robust alignment methods\.

Despite these risks, existing work on spurious correlations in preference optimization remains largely empirical\. Prior studies report failures such as verbosity\(Saitoet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib32)\), but describe symptoms rather than identify underlying mechanisms\. While supervised learning has developed mathematical frameworks for analyzing spurious correlations through shortcut learning\(Geirhoset al\.,[2020](https://arxiv.org/html/2605.11134#bib.bib6)\), preference optimization methods like DPO lack analogous theory\. Without such understanding, mitigation strategies remain largely heuristic and lack principled guarantees\.

To address this gap, we develop a mathematical framework for spurious correlation learning in preference optimization\. We analyze log\-linear DPO as a representative and tractable testbed for pairwise preference optimization, and characterize how feature correlations interact with the optimization objective\. Our contributions are:

\(i\)We characterize the mechanism of spurious learning by analyzing the population equilibrium of the linearized log\-linear DPO objective\. We prove that mean spurious bias or causal–spurious correlation in the training distribution induces nonzero spurious parameters \(Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)\)\. This shows spurious learning arises structurally from the data, not from finite\-sample effects or optimization noise\.

\(ii\)We analyze deployment consequences when spurious statistics shift between training and deployment\. We use the expected preference margin as a population\-level deployment proxy to characterize the shift term \(Proposition[5\.1](https://arxiv.org/html/2605.11134#S5.Thmtheorem1)and[5\.2](https://arxiv.org/html/2605.11134#S5.Thmtheorem2)\)\. To understand finite\-sample behavior, we decompose deployment suboptimality into an irreducible shift term driven by spurious parameters and a reducible estimation term that decays asO\(1/n\)O\(1/n\)\(Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)\)\. This shows that scaling training data cannot eliminate shift\-induced error\.

\(iii\)We propose tie training, a data augmentation strategy that reduces spurious correlation reliance by adding preference pairs with equal utility but differing spurious features\. These ties inject curvature along spurious directions, selectively regularizing spurious parameters without affecting causal learning \(Theorem[6\.2](https://arxiv.org/html/2605.11134#S6.Thmtheorem2)\(i\)\)\. We prove that such ties can reduce the irreducible shift error at deployment \(Theorem[6\.2](https://arxiv.org/html/2605.11134#S6.Thmtheorem2)\(iii\)\)\.

We validate our framework through controlled experiments that progressively relax modeling assumptions\. Linear models confirm quantitative agreement with theory\. Neural networks show the same qualitative mechanisms persist despite hidden representations\. Scaling to large language models,tie trainingreduces spurious correlation learning without compromising in\-distribution accuracy\.

## 2Related Work

Spurious correlation learning\.Spurious correlations in supervised learning are a well\-established failure mode\(Singla and Feizi,[2021](https://arxiv.org/html/2605.11134#bib.bib30)\)\. Models trained via empirical risk minimization \(ERM\) often exploit surface\-level features that correlate with labels in the training distribution but lack a causal relationship to the target task, a phenomenon referred to as shortcut learning\(Geirhoset al\.,[2020](https://arxiv.org/html/2605.11134#bib.bib6)\), simplicity bias\(Shahet al\.,[2020](https://arxiv.org/html/2605.11134#bib.bib18); Morwaniet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib17)\), or spurious feature reliance\(Arjovskyet al\.,[2019](https://arxiv.org/html/2605.11134#bib.bib35)\)\. When spurious correlations shift at deployment\(Zhouet al\.,[2021](https://arxiv.org/html/2605.11134#bib.bib31)\), models suffer prediction errors\(Sagawaet al\.,[2020](https://arxiv.org/html/2605.11134#bib.bib29)\), biased outcomes\(Geirhoset al\.,[2018](https://arxiv.org/html/2605.11134#bib.bib12)\), and performance degradation\(Xiaoet al\.,[2020](https://arxiv.org/html/2605.11134#bib.bib11)\)\. Proposed mitigations include data augmentation\(Changet al\.,[2021](https://arxiv.org/html/2605.11134#bib.bib42); Plumbet al\.,[2021](https://arxiv.org/html/2605.11134#bib.bib43)\), re\-weighting of minority examples\(Liuet al\.,[2021](https://arxiv.org/html/2605.11134#bib.bib13)\), and modified training dynamics\(Izmailovet al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib20); Kirichenkoet al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib21)\), though these approaches often require domain knowledge or explicit annotation of spurious features\. Theoretically, spurious learning has been analyzed through optimization dynamics, where gradient descent preferentially fits easier features early in training, leading to gradient starvation\(Rahamanet al\.,[2019](https://arxiv.org/html/2605.11134#bib.bib14); Kalimeriset al\.,[2019](https://arxiv.org/html/2605.11134#bib.bib15); Qiuet al\.,[2024](https://arxiv.org/html/2605.11134#bib.bib16)\), as well as through NTK and linearized analyses that characterize implicit biases\(Pezeshkiet al\.,[2021](https://arxiv.org/html/2605.11134#bib.bib19); Hermannet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib22)\)\. Bombari et al\.\(Bombari and Mondelli,[2025](https://arxiv.org/html/2605.11134#bib.bib23)\)study high\-dimensional linear models under ERM, deriving closed\-form solutions that reveal how data covariance induces spurious feature reliance\. However, these theories assume pointwise loss landscapes and do not extend to preference optimization, where pairwise comparisons induce fundamentally different learning dynamics\.

Empirical failures in preference optimization\.Prior work on preference optimization has documented spurious correlation learning primarily through empirical observations of reward hacking behaviors\. Studies show that RLHF and DPO models exploit surface\-level artifacts, including verbosity bias \(preferring longer responses independently of quality\(Singhalet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib8); Saitoet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib32)\)\), sycophancy \(agreeing with user beliefs to maximize reward\(Sharmaet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib9)\)\), and formatting bias \(over\-optimizing for numbered lists or stylistic markers\(Zhanget al\.,[2025](https://arxiv.org/html/2605.11134#bib.bib10)\)\)\. Existing mitigations target individual symptoms through ad\-hoc interventions, including length penalties\(Parket al\.,[2024](https://arxiv.org/html/2605.11134#bib.bib49)\)for verbosity or synthetic data filtering\(Chenet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib50)\)\. These approaches treat each bias in isolation without addressing the underlying learning dynamics that produce them\. In contrast, we provide a population\-level analysis that reveals the structural mechanisms driving these empirically observed failures\.

Theoretical analysis of preference optimization\.Theoretical analysis of preference optimization has developed along several lines\. Work on linear contextual dueling bandits and dueling reinforcement learning establishes regret minimization under realizable reward assumptions\(Dudíket al\.,[2015](https://arxiv.org/html/2605.11134#bib.bib47); Sahaet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib48)\)\. Recent alignment work develops robust or safety\-motivated analysis and training procedures, including robust formulations\(Xionget al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib28); Wuet al\.,[2024](https://arxiv.org/html/2605.11134#bib.bib45)\), noise\-aware losses\(Chowdhuryet al\.,[2024a](https://arxiv.org/html/2605.11134#bib.bib44)\), privacy\-preserving constraints\(Chenet al\.,[2025](https://arxiv.org/html/2605.11134#bib.bib46); Zhouet al\.,[2025](https://arxiv.org/html/2605.11134#bib.bib24)\), and divergence\-based alignment objectives that explicitly separate preferred and rejected behaviors\(Haldaret al\.,[2025](https://arxiv.org/html/2605.11134#bib.bib7)\)\. Complementary analyses in linear or log\-linear regimes motivate simplified preference models and analyze learning behavior under idealized assumptions\(Zhuet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib25); Chowdhuryet al\.,[2024b](https://arxiv.org/html/2605.11134#bib.bib26); Zhouet al\.,[2025](https://arxiv.org/html/2605.11134#bib.bib24)\)\. However, across these lines of work, a critical assumption persists: that the learned feature representation is valid for the target task\. These approaches address stochastic or adversarial failures through algorithmic modifications but do not characterize systematic spurious correlation learning\.

## 3Preliminaries

### 3\.1Preference Learning Setup

Preference dataset\.We consider a preference dataset𝒟=\{\(x\(i\),yw\(i\),yl\(i\)\)\}i=1N\\mathcal\{D\}=\\\{\(x^\{\(i\)\},y\_\{w\}^\{\(i\)\},y\_\{l\}^\{\(i\)\}\)\\\}\_\{i=1\}^\{N\}, where\(yw\(i\),yl\(i\)\)\(y\_\{w\}^\{\(i\)\},y\_\{l\}^\{\(i\)\}\)denotes the human\-preferred and rejected responses to promptx\(i\)∈𝒳x^\{\(i\)\}\\in\\mathcal\{X\}, following standard pairwise preference supervision\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib4); Christianoet al\.,[2017](https://arxiv.org/html/2605.11134#bib.bib41)\)\. We represent each prompt–response pair\(x,y\)\(x,y\)with a feature vectorϕ\(x,y\)∈ℝd\\phi\(x,y\)\\in\\mathbb\{R\}^\{d\}and train on the feature differenceΔϕ=ϕ\(x,yw\)−ϕ\(x,yl\)\.\\Delta\\phi=\\phi\(x,y\_\{w\}\)\-\\phi\(x,y\_\{l\}\)\.

Causal and spurious feature decomposition\.We decompose each feature vector asϕ\(x,y\)=\[ϕc\(x,y\);ϕs\(x,y\)\]∈ℝdc\+ds\\phi\(x,y\)=\[\\phi\_\{c\}\(x,y\);\\phi\_\{s\}\(x,y\)\]\\in\\mathbb\{R\}^\{d\_\{c\}\+d\_\{s\}\}\. The causal componentϕc\(x,y\)∈ℝdc\\phi\_\{c\}\(x,y\)\\in\\mathbb\{R\}^\{d\_\{c\}\}determines true response utility, while the spurious componentϕs\(x,y\)∈ℝds\\phi\_\{s\}\(x,y\)\\in\\mathbb\{R\}^\{d\_\{s\}\}correlates with causal features in the training data without affecting utility\. This decomposition induces a corresponding split of feature differences,Δϕ=\[Δϕc;Δϕs\]\\Delta\\phi=\[\\Delta\\phi\_\{c\};\\Delta\\phi\_\{s\}\]\. In practice, spurious features arise from data collection biases and domain\-specific structure, such as annotators consistently preferring longer or more formal responses regardless of content quality\. We formalize this behavior through the following invariance assumption\.

###### Assumption 3\.1\(Invariance\)\.

Under interventions that modify spurious features while holding causal features fixed, human preferences remain unchanged\.

### 3\.2Log\-Linear Policy and Bradley–Terry Model

Policy model\.We adopt a log\-linear policy, a common regime for theoretical analysis that enables tractable characterization of learning dynamics\(Zhuet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib25); Zhouet al\.,[2025](https://arxiv.org/html/2605.11134#bib.bib24)\)\. The policy takes the formπθ\(y∣x\)∝exp⁡\(θ⊤ϕ\(x,y\)\),\\pi\_\{\\theta\}\(y\\mid x\)\\propto\\exp\(\\theta^\{\\top\}\\phi\(x,y\)\),whereθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}denotes the learnable parameter vector\. We decompose the parameter vector asθ=\[θc;θs\]∈ℝdc\+ds\\theta=\[\\theta\_\{c\};\\theta\_\{s\}\]\\in\\mathbb\{R\}^\{d\_\{c\}\+d\_\{s\}\}to match the causal–spurious feature split\.

Data generation\.Our analysis depends only on the feature differencesΔϕ\\Delta\\phiand their stated moment conditions\. In the controlled log\-linear experiments \(Section[7](https://arxiv.org/html/2605.11134#S7)\), we generate preference labels with a Bradley–Terry model\(Bradley and Terry,[1952](https://arxiv.org/html/2605.11134#bib.bib27)\)\. Under this model, the probability that humans preferywy\_\{w\}overyly\_\{l\}isσ\(θBT⊤Δϕ\)\\sigma\(\\theta\_\{\\mathrm\{BT\}\}^\{\\top\}\\Delta\\phi\), whereσ\(z\)=\(1\+e−z\)−1\\sigma\(z\)=\(1\+e^\{\-z\}\)^\{\-1\}\. By Assumption[3\.1](https://arxiv.org/html/2605.11134#S3.Thmtheorem1), spurious features do not affect human preferences, soθBT=\[θBT,c;𝟎\]\\theta\_\{\\mathrm\{BT\}\}=\[\\theta\_\{\\mathrm\{BT\},c\};\\mathbf\{0\}\]\.

### 3\.3Direct Preference Optimization

We analyze Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib4)\)as a method for fitting the preference model described above\. DPO optimizes the policy relative to a reference policyπθref\\pi\_\{\\theta\_\{\\mathrm\{ref\}\}\}, so we express learning in terms of the deviationθ~=θ−θref\\tilde\{\\theta\}=\\theta\-\\theta\_\{\\mathrm\{ref\}\}\. The DPO loss for a single preference example\(x,yw,yl\)\(x,y\_\{w\},y\_\{l\}\)is

ℓDPO\(θ;x,yw,yl\)=−log⁡σ\(βθ~⊤Δϕ\),\\ell\_\{\\text\{DPO\}\}\(\\theta;x,y\_\{w\},y\_\{l\}\)=\-\\log\\sigma\\\!\\left\(\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\\right\),\(1\)whereβ\>0\\beta\>0controls the KL regularization strength\. Equation \([1](https://arxiv.org/html/2605.11134#S3.E1)\) shows that DPO performs logistic regression on feature differences\(Bombari and Mondelli,[2025](https://arxiv.org/html/2605.11134#bib.bib23)\)\. While this provides a well\-defined population objective, the sigmoid nonlinearity prevents closed\-form characterization of the optimum\.

Linearization regime\.To enable analytical progress, we work in a local regime where score differences remain moderate, allowing a first\-order Taylor expansion of the sigmoid\. Formally:

###### Assumption 3\.2\(Local regime\)\.

With high probability under the data distribution,\|βθ~⊤Δϕ\|≪1\.\|\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\|\\ll 1\.

This regime arises near initialization, for bounded feature magnitudes, or for sufficiently small KL regularization strengthβ\\beta\. Under this linearization, the population optimum admits a closed\-form solution that reveals how data structure drives spurious learning\.

## 4Population\-Level Spurious Learning

In this section, we analyze the population equilibrium of the DPO objective under a local linearization\. We show that spurious correlation learning arises generically: mean spurious bias or causal–spurious correlation in the training data leads to nonzero spurious parameters at the population optimum\.

### 4\.1Population Gradient and Early\-Training Drift

We study the population objective

L\(θ\):=−𝔼\[log⁡σ\(βθ~⊤Δϕ\)\],L\(\\theta\):=\-\\mathbb\{E\}\\Big\[\\log\\sigma\\big\(\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\\big\)\\Big\],where the expectation is taken over preference data generated by the Bradley–Terry model with parametersθBT\\theta\_\{\\mathrm\{BT\}\}\. A single preference pair induces the gradient∇θℓ\(θ;x,yw,yl\)=−βw\(Δθ\)Δϕ,\\nabla\_\{\\theta\}\\ell\(\\theta;x,y\_\{w\},y\_\{l\}\)=\-\\beta\\,w\(\\Delta\_\{\\theta\}\)\\,\\Delta\\phi,whereΔθ:=θ~⊤Δϕ\\Delta\_\{\\theta\}:=\\tilde\{\\theta\}^\{\\top\}\\Delta\\phidenotes the predicted score difference andw\(Δθ\):=1−σ\(βΔθ\)∈\(0,1\)w\(\\Delta\_\{\\theta\}\):=1\-\\sigma\(\\beta\\Delta\_\{\\theta\}\)\\in\(0,1\)\. Hard or misclassified pairs receive larger weightw\(Δθ\)w\(\\Delta\_\{\\theta\}\), while confidently satisfied preferences are downweighted\. Taking expectations, the population gradient is

∇θL\(θ\)=−β𝔼\[w\(Δθ\)Δϕ\]\.\\nabla\_\{\\theta\}L\(\\theta\)=\-\\beta\\,\\mathbb\{E\}\\big\[w\(\\Delta\_\{\\theta\}\)\\Delta\\phi\\big\]\.Early\-training drift\.Near the reference policy \(θ~≈0\\tilde\{\\theta\}\\approx 0\), score differences are small, sow\(Δθ\)≈12w\(\\Delta\_\{\\theta\}\)\\approx\\tfrac\{1\}\{2\}\. The gradient simplifies to

∇θL\(θ\)≈−β2μ,μ:=𝔼\[Δϕ\]=\[μcμs\],\\nabla\_\{\\theta\}L\(\\theta\)\\approx\-\\frac\{\\beta\}\{2\}\\,\\mu,\\qquad\\mu:=\\mathbb\{E\}\[\\Delta\\phi\]=\\begin\{bmatrix\}\\mu\_\{c\}\\\\ \\mu\_\{s\}\\end\{bmatrix\},whereμc:=𝔼\[Δϕc\]\\mu\_\{c\}:=\\mathbb\{E\}\[\\Delta\\phi\_\{c\}\]andμs:=𝔼\[Δϕs\]\\mu\_\{s\}:=\\mathbb\{E\}\[\\Delta\\phi\_\{s\}\]are the mean feature differences\. Wheneverμs≠0\\mu\_\{s\}\\neq 0, spurious features are learned from the first gradient step:∇θsL\(θ\)≈−β2μs\.\\nabla\_\{\\theta\_\{s\}\}L\(\\theta\)\\approx\-\\tfrac\{\\beta\}\{2\}\\mu\_\{s\}\.This shows that spurious features with nonzero mean differences are immediately learned, movingθs\\theta\_\{s\}away from the ground truthθs†=𝟎\\theta^\{\\dagger\}\_\{s\}=\\mathbf\{0\}\.

Early\-training drift shows that mean spurious bias drives immediate spurious learning\. However, this does not guarantee spurious parameters remain nonzero at equilibrium\. The gradient dynamics could drive them back to zero\. We now characterize the population equilibrium to determine when spurious learning persists\.

### 4\.2Linearized Equilibrium

Under Assumption[3\.2](https://arxiv.org/html/2605.11134#S3.Thmtheorem2), we linearize the weighting functionw\(Δθ\)≈12−β4Δθw\(\\Delta\_\{\\theta\}\)\\approx\\frac\{1\}\{2\}\-\\frac\{\\beta\}\{4\}\\Delta\_\{\\theta\}and substitute into the population gradient\. This yields \(see Appendix[A\.1](https://arxiv.org/html/2605.11134#A1.SS1)for details\)

∇θL\(θ\)≈−β2μ\+β24Σθ~,Σ:=𝔼\[ΔϕΔϕ⊤\]\.\\nabla\_\{\\theta\}L\(\\theta\)\\;\\approx\\;\-\\frac\{\\beta\}\{2\}\\,\\mu\\;\+\\;\\frac\{\\beta^\{2\}\}\{4\}\\,\\Sigma\\tilde\{\\theta\},\\quad\\Sigma:=\\mathbb\{E\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]\.\(2\)At the linearized stationary pointθ⋆\\theta^\{\\star\}, the gradient vanishes:

Σθ~⋆=2βμ,θ~⋆:=θ⋆−θref\.\\Sigma\\,\\tilde\{\\theta\}^\{\\star\}=\\frac\{2\}\{\\beta\}\\,\\mu,\\qquad\\tilde\{\\theta\}^\{\\star\}:=\\theta^\{\\star\}\-\\theta\_\{\\mathrm\{ref\}\}\.Partitioning according to the causal–spurious split, we write

Σ=\[ΣccΣcsΣscΣss\],μ=\[μcμs\],θ~⋆=\[θ~c⋆θ~s⋆\],\\Sigma=\\begin\{bmatrix\}\\Sigma\_\{cc\}&\\Sigma\_\{cs\}\\\\ \\Sigma\_\{sc\}&\\Sigma\_\{ss\}\\end\{bmatrix\},\\qquad\\mu=\\begin\{bmatrix\}\\mu\_\{c\}\\\\ \\mu\_\{s\}\\end\{bmatrix\},\\qquad\\tilde\{\\theta\}^\{\\star\}=\\begin\{bmatrix\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\\\\ \\tilde\{\\theta\}\_\{s\}^\{\\star\}\\end\{bmatrix\},whereΣcc:=𝔼\[ΔϕcΔϕc⊤\]\\Sigma\_\{cc\}:=\\mathbb\{E\}\[\\Delta\\phi\_\{c\}\\Delta\\phi\_\{c\}^\{\\top\}\],Σss:=𝔼\[ΔϕsΔϕs⊤\]\\Sigma\_\{ss\}:=\\mathbb\{E\}\[\\Delta\\phi\_\{s\}\\Delta\\phi\_\{s\}^\{\\top\}\], andΣcs=Σsc⊤:=𝔼\[ΔϕcΔϕs⊤\]\\Sigma\_\{cs\}=\\Sigma\_\{sc\}^\{\\top\}:=\\mathbb\{E\}\[\\Delta\\phi\_\{c\}\\Delta\\phi\_\{s\}^\{\\top\}\]captures causal–spurious correlation\. The equilibrium conditions are

Σccθ~c⋆\+Σcsθ~s⋆\\displaystyle\\Sigma\_\{cc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\+\\Sigma\_\{cs\}\\tilde\{\\theta\}\_\{s\}^\{\\star\}=2βμc,\\displaystyle=\\frac\{2\}\{\\beta\}\\mu\_\{c\},\(3\)Σscθ~c⋆\+Σssθ~s⋆\\displaystyle\\Sigma\_\{sc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\+\\Sigma\_\{ss\}\\tilde\{\\theta\}\_\{s\}^\{\\star\}=2βμs\.\\displaystyle=\\frac\{2\}\{\\beta\}\\mu\_\{s\}\.\(4\)Explicit solution via Schur complement\.The equilibrium conditions \([3](https://arxiv.org/html/2605.11134#S4.E3)\)–\([4](https://arxiv.org/html/2605.11134#S4.E4)\) couple causal and spurious parameters through the cross\-covarianceΣcs\\Sigma\_\{cs\}\. To isolate the spurious component, we use the Schur complement method\. AssumeΣss\\Sigma\_\{ss\}and the Schur complementSc:=Σcc−ΣcsΣss−1ΣscS\_\{c\}:=\\Sigma\_\{cc\}\-\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\Sigma\_\{sc\}are invertible\. These conditions hold when the covariance matrices are full rank\. Under these assumptions, the spurious component at equilibrium is

θ~s⋆=2βΣss−1\[\(I\+ΣscSc−1ΣcsΣss−1\)μs⏟mean\-bias−ΣscSc−1μc⏟corr\.\-leakage\]\.\\tilde\{\\theta\}\_\{s\}^\{\\star\}=\\frac\{2\}\{\\beta\}\\Sigma\_\{ss\}^\{\-1\}\\Big\[\\underbrace\{\(I\+\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\)\\mu\_\{s\}\}\_\{\\text\{mean\-bias\}\}\-\\underbrace\{\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\mu\_\{c\}\}\_\{\\text\{corr\.\-leakage\}\}\\Big\]\.\(5\)The full derivation is provided in Appendix[A\.2](https://arxiv.org/html/2605.11134#A1.SS2)\.

### 4\.3Main Mechanism Result

We now state our main result characterizing spurious learning at the population equilibrium\.

###### Theorem 4\.1\(Spurious learning from mean bias and correlation leakage\)\.

Under the linearized population dynamics \(Equation \([2](https://arxiv.org/html/2605.11134#S4.E2)\)\) and standard invertibility conditions, ifμs≠0\\mu\_\{s\}\\neq 0\(mean spurious bias\) orΣcs≠0\\Sigma\_\{cs\}\\neq 0\(causal–spurious correlation\), thenθ~s⋆≠0\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\neq 0\. That is, the population optimum assigns nonzero weight to spurious features\.

Equation \([5](https://arxiv.org/html/2605.11134#S4.E5)\) reveals two distinct mechanisms:

\(i\) Mean spurious bias \(μs≠0\\mu\_\{s\}\\neq 0\):When spurious features are asymmetrically distributed across preference pairs, they directly contribute toθ~s⋆\\tilde\{\\theta\}\_\{s\}^\{\\star\}, even in the absence of correlation \(Σcs=0\\Sigma\_\{cs\}=0\)\.

\(ii\) Correlation leakage \(Σcs≠0\\Sigma\_\{cs\}\\neq 0\):When spurious features correlate with causal features, weight intended for causal directions leaks into spurious ones\. This operates even when spurious features are unbiased \(μs=0\\mu\_\{s\}=0\)\.

Deviation from ground truth\.Recall that the Bradley–Terry ground truth hasθs†=𝟎\\theta^\{\\dagger\}\_\{s\}=\\mathbf\{0\}\(Assumption[3\.1](https://arxiv.org/html/2605.11134#S3.Thmtheorem1)\)\. Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)shows that the learned parameters satisfyθ~s⋆≠0\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\neq 0under generic conditions on the training distribution\. This creates a systematic bias: the learned policy deviates from the true preference model in directions that do not affect utility\. Whether this deviation causes deployment failures depends on how spurious feature statistics differ between training and deployment, which we formalize in Section[5](https://arxiv.org/html/2605.11134#S5)\.

## 5Deployment Error from Spurious Learning

Section[4](https://arxiv.org/html/2605.11134#S4)showed that learned parameters generically satisfyθ~s⋆≠0\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\neq 0\. We now study the consequences for deployment\. Spurious learning creates a*potential vulnerability*: learned parameters depend on features that do not affect true utility\. Whether this vulnerability translates into deployment error depends on how spurious statistics shift between training and deployment\. We show that when spurious statistics shift, the population objective degrades due to learned spurious parameters\. We then decompose deployment suboptimality into a shift component and an estimation component, showing that the former persists regardless of training set size\.

### 5\.1Distribution Shift Setup

LetPPdenote the training distribution andQQthe deployment distribution over preference pairs\(x,yw,yl\)\(x,y\_\{w\},y\_\{l\}\)\. These distributions may differ in their feature statistics\. We denote:

Training:μ\(P\)=𝔼P\[Δϕ\],Σ\(P\)=𝔼P\[ΔϕΔϕ⊤\]\\displaystyle\\mu^\{\(P\)\}=\\mathbb\{E\}\_\{P\}\[\\Delta\\phi\],\\quad\\Sigma^\{\(P\)\}=\\mathbb\{E\}\_\{P\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]Deployment:μ\(Q\)=𝔼Q\[Δϕ\],Σ\(Q\)=𝔼Q\[ΔϕΔϕ⊤\]\\displaystyle\\mu^\{\(Q\)\}=\\mathbb\{E\}\_\{Q\}\[\\Delta\\phi\],\\quad\\Sigma^\{\(Q\)\}=\\mathbb\{E\}\_\{Q\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]The model learns parametersθtrain\\theta\_\{\\mathrm\{train\}\}by optimizing onPPand deploys with these fixed parameters onQQ\. Differences betweenμs\(P\)\\mu\_\{s\}^\{\(P\)\}andμs\(Q\)\\mu\_\{s\}^\{\(Q\)\}induce deployment error through the learned spurious parametersθ~s,train\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\. We defer the full mathematical characterization of canonical shift scenarios \(suppression, adversarial, and rotation\) to Appendix[E](https://arxiv.org/html/2605.11134#A5)\.

### 5\.2Expected Margin as Deployment Metric

To assess how shifts affect performance, we use the expected preference margin, which provides a tractable first\-order characterization of model quality under the local regime\.

Margin definition\.The expected margin measures the average score gap between preferred and dispreferred responses:mD\(θ\):=𝔼\(x,yw,yl\)∼D\[βθ~⊤Δϕ\]\.m\_\{D\}\(\\theta\):=\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim D\}\[\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\]\.The margin decomposes into causal and spurious components:

mD\(θ\)=βθ~c⊤μc\(D\)⏟=⁣:mDcausal\(θ\)\+βθ~s⊤μs\(D\)⏟=⁣:mDspurious\(θ\)\.m\_\{D\}\(\\theta\)=\\underbrace\{\\beta\\,\\tilde\{\\theta\}\_\{c\}^\{\\top\}\\mu\_\{c\}^\{\(D\)\}\}\_\{=:m^\{\\text\{causal\}\}\_\{D\}\(\\theta\)\}\+\\underbrace\{\\beta\\,\\tilde\{\\theta\}\_\{s\}^\{\\top\}\\mu\_\{s\}^\{\(D\)\}\}\_\{=:m^\{\\text\{spurious\}\}\_\{D\}\(\\theta\)\}\.\(6\)
We now show that margin differences provide a first\-order approximation to objective differences\.

###### Proposition 5\.1\(First\-Order Margin Approximation\)\.

Under the local regime \(Assumption[3\.2](https://arxiv.org/html/2605.11134#S3.Thmtheorem2)\), the DPO objectiveJ~\(D\)\(θ\):=𝔼D\[log⁡σ\(βθ~⊤Δϕ\)\]\\tilde\{J\}^\{\(D\)\}\(\\theta\):=\\mathbb\{E\}\_\{D\}\[\\log\\sigma\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)\]satisfies

J~\(Q\)\(θ\)−J~\(P\)\(θ\)=12\(mQ\(θ\)−mP\(θ\)\)\+O\(β2‖θ~‖2\)\.\\tilde\{J\}^\{\(Q\)\}\(\\theta\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta\)=\\frac\{1\}\{2\}\\big\(m\_\{Q\}\(\\theta\)\-m\_\{P\}\(\\theta\)\\big\)\+O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|^\{2\}\)\.

The proof follows from Taylor expansion of the log\-sigmoid; see Appendix[C\.2](https://arxiv.org/html/2605.11134#A3.SS2)\.

Proposition[5\.1](https://arxiv.org/html/2605.11134#S5.Thmtheorem1)shows that margin differences drive objective changes\. We now isolate the spurious component by considering shifts that preserve causal statistics\.

###### Proposition 5\.2\(Spurious Margin Drives Shift\)\.

When causal statistics are stable \(μc\(Q\)=μc\(P\)\\mu\_\{c\}^\{\(Q\)\}=\\mu\_\{c\}^\{\(P\)\}\), the objective difference atθtrain\\theta\_\{\\mathrm\{train\}\}is determined by the spurious margin:

J~\(Q\)\(θtrain\)−J~\(P\)\(θtrain\)≈β2θ~s,train⊤\(μs\(Q\)−μs\(P\)\),\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\\approx\\frac\{\\beta\}\{2\}\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}^\{\\top\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\),up to anO\(β2‖θ~train‖2\)O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\_\{\\mathrm\{train\}\}\\\|^\{2\}\)remainder\.

###### Proof\.

Apply Proposition[5\.1](https://arxiv.org/html/2605.11134#S5.Thmtheorem1)withθ=θtrain\\theta=\\theta\_\{\\mathrm\{train\}\}\. From \([6](https://arxiv.org/html/2605.11134#S5.E6)\), the margin difference decomposes asmQ\(θtrain\)−mP\(θtrain\)=βθ~c,train⊤\(μc\(Q\)−μc\(P\)\)\+βθ~s,train⊤\(μs\(Q\)−μs\(P\)\)\.m\_\{Q\}\(\\theta\_\{\\mathrm\{train\}\}\)\-m\_\{P\}\(\\theta\_\{\\mathrm\{train\}\}\)=\\beta\\tilde\{\\theta\}\_\{c,\\mathrm\{train\}\}^\{\\top\}\(\\mu\_\{c\}^\{\(Q\)\}\-\\mu\_\{c\}^\{\(P\)\}\)\+\\beta\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}^\{\\top\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\.Whenμc\(Q\)=μc\(P\)\\mu\_\{c\}^\{\(Q\)\}=\\mu\_\{c\}^\{\(P\)\}, the causal term vanishes\. ∎

Proposition[5\.2](https://arxiv.org/html/2605.11134#S5.Thmtheorem2)identifies when deployment degradation occurs at the population level: when learned spurious parametersθ~s,train\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}interact with shifts in spurious statisticsμs\(Q\)−μs\(P\)\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\. When spurious statistics remain stable \(μs\(Q\)=μs\(P\)\\mu\_\{s\}^\{\(Q\)\}=\\mu\_\{s\}^\{\(P\)\}\), spurious learning is benign despiteθ~s,train≠0\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\\neq 0\. We now show that this degradation is irreducible with respect to sample size by decomposing deployment suboptimality\.

### 5\.3Suboptimality Decomposition

We now formalize the irreducibility of deployment degradation by decomposing suboptimality into a shift term and an estimation term\. LetθQ⋆:=arg⁡maxθ∈ΘB⁡J~\(Q\)\(θ\)\\theta\_\{Q\}^\{\\star\}:=\\arg\\max\_\{\\theta\\in\\Theta\_\{B\}\}\\tilde\{J\}^\{\(Q\)\}\(\\theta\)denote the deployment\-optimal parameters, whereΘB:=\{θ∈ℝd:‖θ−θref‖2≤B\}\\Theta\_\{B\}:=\\\{\\theta\\in\\mathbb\{R\}^\{d\}:\\\|\\theta\-\\theta\_\{\\mathrm\{ref\}\}\\\|\_\{2\}\\leq B\\\}\. Letθ^\\hat\{\\theta\}denote the finite\-sample estimator trained onnnsamples fromPP\. We define deployment suboptimality as

SubOptQ\(θ^\):=J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θ^\)\.\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\):=\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\.Inserting the population training optimumθtrain\\theta\_\{\\mathrm\{train\}\}yields:

SubOptQ\(θ^\)\\displaystyle\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\)=J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θtrain\)⏟shift term\\displaystyle=\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\}\_\{\\text\{shift term\}\}\+J~\(Q\)\(θtrain\)−J~\(Q\)\(θ^\)⏟estimation term\.\\displaystyle\\quad\+\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\}\_\{\\text\{estimation term\}\}\.\(7\)Irreducible shift vs\. reducible estimation\.The decomposition separates two sources of suboptimality\. The*estimation term*vanishes asn→∞n\\to\\inftysinceθ^→θtrain\\hat\{\\theta\}\\to\\theta\_\{\\mathrm\{train\}\}\. In contrast, the*shift term*is determined by the population optimumθtrain\\theta\_\{\\mathrm\{train\}\}and persists regardless of sample size:

limn→∞SubOptQ\(θ^\)=J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θtrain\)\.\\lim\_\{n\\to\\infty\}\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\)=\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\.By Proposition[5\.2](https://arxiv.org/html/2605.11134#S5.Thmtheorem2), when causal statistics are stable, this irreducible error is proportional toθ~s,train⊤\(μs\(Q\)−μs\(P\)\)\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}^\{\\top\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\. The magnitude depends on two factors: the learned spurious parametersθ~s,train\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\(which Section[4](https://arxiv.org/html/2605.11134#S4)showed emerge structurally from training\) and the spurious shiftμs\(Q\)−μs\(P\)\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\(which depends on the deployment environment\)\. Collecting more data fromPPcannot eliminate this vulnerability\. It only reduces the vanishing estimation component while leavingθ~s,train\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}unchanged\.

### 5\.4Main Deployment Bound

The margin analysis in Section[5\.2](https://arxiv.org/html/2605.11134#S5.SS2)characterized deployment degradation at the population level, assuming access toθtrain\\theta\_\{\\mathrm\{train\}\}\. In practice, we learn from finite samples, obtaining an estimatorθ^\\hat\{\\theta\}that deviates fromθtrain\\theta\_\{\\mathrm\{train\}\}\. We now bound the estimation termJ~\(Q\)\(θtrain\)−J~\(Q\)\(θ^\)\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)to complete the decomposition in Equation \([5\.3](https://arxiv.org/html/2605.11134#S5.Ex11)\), confirming that the irreducible shift dominates asn→∞n\\to\\infty\.

Technical assumptions\.We require: \(A1\) bounded features,‖Δϕ‖2≤1\\\|\\Delta\\phi\\\|\_\{2\}\\leq 1almost surely; \(A2\) bounded parameters,‖θ~‖2≤B\\\|\\tilde\{\\theta\}\\\|\_\{2\}\\leq Bfor allθ∈ΘB\\theta\\in\\Theta\_\{B\}; \(A3\) local regime \(Assumption[3\.2](https://arxiv.org/html/2605.11134#S3.Thmtheorem2)\); and \(A4\) geometry transfer, meaning there existsκΠ≥1\\kappa\_\{\\Pi\}\\geq 1such that‖v‖Σ\(Q\)2≤κΠ‖v‖HP2\\\|v\\\|\_\{\\Sigma^\{\(Q\)\}\}^\{2\}\\leq\\kappa\_\{\\Pi\}\\\|v\\\|\_\{H\_\{P\}\}^\{2\}for allv∈ℝdv\\in\\mathbb\{R\}^\{d\}, whereHPH\_\{P\}is the Hessian of the training objective atθtrain\\theta\_\{\\mathrm\{train\}\}\. Assumption \(A4\) bounds deployment variation by training curvature and holds whenPPandQQare not too different\. See Appendix[C](https://arxiv.org/html/2605.11134#A3)for detailed discussion of these conditions\.

We defineθ^\\hat\{\\theta\}as the ridge\-regularized MLE:

θ^:=arg⁡maxθ∈ΘB⁡\[1n∑i=1nlog⁡σ\(βθ~⊤Δϕ\(i\)\)−λ2‖θ~‖22\],\\hat\{\\theta\}:=\\arg\\max\_\{\\theta\\in\\Theta\_\{B\}\}\\left\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\sigma\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi^\{\(i\)\}\)\-\\frac\{\\lambda\}\{2\}\\\|\\tilde\{\\theta\}\\\|\_\{2\}^\{2\}\\right\],whereλ\>0\\lambda\>0is the regularization parameter\.

###### Theorem 5\.3\(Deployment Suboptimality Bound\)\.

Under assumptions \(A1\)\-\(A4\), with probability at least1−δ1\-\\delta:

SubOptQ\(θ^\)≤J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θtrain\)⏟shift term\+κΠ2Γn2⏟estimation term,\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\)\\leq\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\}\_\{\\text\{shift term\}\}\+\\underbrace\{\\frac\{\\kappa\_\{\\Pi\}\}\{2\}\\Gamma\_\{n\}^\{2\}\}\_\{\\text\{estimation term\}\},\(8\)whereΓn=2β2\(d\+log⁡\(1/δ\)\)/n\+Bλ\\Gamma\_\{n\}=2\\beta\\sqrt\{2\(d\+\\log\(1/\\delta\)\)/n\}\+B\\sqrt\{\\lambda\}\.

###### Proof sketch\.

The decomposition \([5\.3](https://arxiv.org/html/2605.11134#S5.Ex11)\) holds by definition\. For the estimation term, we bound\|J~\(Q\)\(θtrain\)−J~\(Q\)\(θ^\)\|\|\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\|using local quadratic approximation and concentration of the Hessian\. Assumption \(A4\) ensures this bound transfers fromPPtoQQ\. Full details are in Appendix[C\.7](https://arxiv.org/html/2605.11134#A3.Thmtheorem7)\. ∎

Interpretation\.Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)reveals the structure of deployment suboptimality\. The estimation term scales asO\(d/n\)O\(d/n\)and vanishes with more data\. The shift term persists: even with infinite training data, deployment error remains bounded by the gap betweenθtrain\\theta\_\{\\mathrm\{train\}\}andθQ⋆\\theta\_\{Q\}^\{\\star\}\.

By Proposition[5\.2](https://arxiv.org/html/2605.11134#S5.Thmtheorem2), when causal statistics are stable, this gap is proportional to‖θ~s,train‖⋅‖μs\(Q\)−μs\(P\)‖\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\\\|\\cdot\\\|\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\\\|\. This decomposition clarifies two sources of vulnerability: Section[4](https://arxiv.org/html/2605.11134#S4)showed thatθ~s,train≠0\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\\neq 0arises structurally from the training distribution, while the deployment shiftμs\(Q\)−μs\(P\)\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}depends on the environment\. Spurious learning creates*latent vulnerability*; harm materializes only when spurious statistics shift\. Crucially, scaling training data cannot eliminate this vulnerability, as it leavesθ~s,train\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}unchanged\.

## 6Tie Training Reduces Spurious Reliance

Section[5](https://arxiv.org/html/2605.11134#S5)showed that deployment error contains an irreducible shift term driven by learned spurious parametersθ~s,train\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\. We now describe a simple data\-level intervention that directly targets this mechanism\. We introducetie training, a data augmentation strategy that reduces spurious reliance by adding curvature selectively in spurious directions\. The approach constructs preference pairs with equal utility but differing spurious features, assigns labels randomly, and mixes these ties with standard preference data during training\.

### 6\.1Tie Construction and Training

###### Definition 6\.1\(Tie pair\)\.

A tie pair is a tuple\(x,yA,yB\)\(x,y\_\{A\},y\_\{B\}\)with equal utility:u⋆\(x,yA\)=u⋆\(x,yB\)u^\{\\star\}\(x,y\_\{A\}\)=u^\{\\star\}\(x,y\_\{B\}\)\.

We construct ties so that causal features match while spurious features differ:Δϕc=0\\Delta\\phi\_\{c\}=0andΔϕs=δs≠0\\Delta\\phi\_\{s\}=\\delta\_\{s\}\\neq 0\. For each tie, we assign the winner\-loser label uniformly at random, yielding𝔼tie\[Δϕ\]=0\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\[\\Delta\\phi\]=0and

Σtie:=𝔼tie\[ΔϕΔϕ⊤\]=\[000Σsstie\]\.\\Sigma^\{\\mathrm\{tie\}\}:=\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]=\\begin\{bmatrix\}0&0\\\\ 0&\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}\\end\{bmatrix\}\.This covariance structure ensures ties add curvature only in spurious directions\.

Mixed training\.We train on a mixturePmix:=αP\+\(1−α\)PtieP\_\{\\mathrm\{mix\}\}:=\\alpha P\+\(1\-\\alpha\)P\_\{\\mathrm\{tie\}\}withα∈\(0,1\)\\alpha\\in\(0,1\)\. Under the local regime \(Assumption[3\.2](https://arxiv.org/html/2605.11134#S3.Thmtheorem2)\), the linearized equilibrium satisfies

θ~mix⋆=2αβ\(Σmix\)−1μ\(P\),Σmix:=αΣ\(P\)\+\(1−α\)Σtie\.\\tilde\{\\theta\}^\{\\star\}\_\{\\mathrm\{mix\}\}=\\frac\{2\\alpha\}\{\\beta\}\\big\(\\Sigma^\{\\mathrm\{mix\}\}\\big\)^\{\-1\}\\mu^\{\(P\)\},\\quad\\Sigma^\{\\mathrm\{mix\}\}:=\\alpha\\Sigma^\{\(P\)\}\+\(1\-\\alpha\)\\Sigma^\{\\mathrm\{tie\}\}\.\(9\)SinceΣtie\\Sigma^\{\\mathrm\{tie\}\}has support only on spurious coordinates, mixed training increases regularizationΣssmix\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}while leaving causal directions unchanged\. This shrinks‖θ~s,mix⋆‖\\\|\\tilde\{\\theta\}^\{\\star\}\_\{s,\\mathrm\{mix\}\}\\\|relative to strict\-only training without affectingθ~c\\tilde\{\\theta\}\_\{c\}\. Full derivation is in Appendix[D\.4](https://arxiv.org/html/2605.11134#A4.SS4)\.

### 6\.2Main Result

We now formalize the effect of tie training on spurious reliance and deployment performance\.

###### Theorem 6\.2\(Tie training reduces spurious reliance and deployment shift\)\.

Under the conditions of Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)and the tie construction above:

\(i\) Spurious shrinkage\.Letθ⋆\\theta^\{\\star\}denote the strict\-only population optimizer andθmix⋆\\theta^\{\\star\}\_\{\\mathrm\{mix\}\}the mixed\-training optimizer\. Then

‖θ~s,mix⋆‖Σssmix≤‖θ~s⋆‖Σss\(P\),\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\}\\;\\leq\\;\\\|\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\\|\_\{\\Sigma^\{\(P\)\}\_\{ss\}\},where‖v‖A:=vTAv\\\|v\\\|\_\{A\}:=\\sqrt\{v^\{T\}Av\}\. The inequality is strict whenΣsstie\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}adds curvature aligned with the spurious driving term\.

\(ii\) Shift reduction\.Ifμc\(Q\)=μc\(P\)\\mu\_\{c\}^\{\(Q\)\}=\\mu\_\{c\}^\{\(P\)\}, then

\|θ~s,mix⋆⊤\(μs\(Q\)−μs\(P\)\)\|≤\|θ~s⋆⊤\(μs\(Q\)−μs\(P\)\)\|\.\\big\|\\tilde\{\\theta\}^\{\\star\\top\}\_\{s,\\mathrm\{mix\}\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\\big\|\\;\\leq\\;\\big\|\\tilde\{\\theta\}^\{\\star\\top\}\_\{s\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\\big\|\.
\(iii\) Finite\-sample bound\.Letθ^mix\\hat\{\\theta\}^\{\\mathrm\{mix\}\}be trained onnnsamples fromPmixP\_\{\\mathrm\{mix\}\}\. With probability at least1−δ1\-\\delta,

SubOptQ\(θ^mix\)≤J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θmix⋆\)⏟reduced shift\+κΠ2Γn,mix2⏟estimation\.\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}^\{\\mathrm\{mix\}\}\)\\leq\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta^\{\\star\}\_\{\\mathrm\{mix\}\}\)\}\_\{\\text\{reduced shift\}\}\+\\underbrace\{\\frac\{\\kappa\_\{\\Pi\}\}\{2\}\\Gamma\_\{n,\\mathrm\{mix\}\}^\{2\}\}\_\{\\text\{estimation\}\}\.

The proof combines the equilibrium characterization \(Equation \([9](https://arxiv.org/html/2605.11134#S6.E9)\)\) with the deployment bound framework; see Appendix[D\.2](https://arxiv.org/html/2605.11134#A4.Thmtheorem2)\.

###### Corollary 6\.3\(Quantitative reduction under isotropic ties\)\.

IfΣsstie=σ2Ids\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}=\\sigma^\{2\}I\_\{d\_\{s\}\}andλmin\(Σss\(P\)\)≥λ0\>0\\lambda\_\{\\min\}\(\\Sigma^\{\(P\)\}\_\{ss\}\)\\geq\\lambda\_\{0\}\>0, then

‖θ~s,mix⋆‖2‖θ~s⋆‖2≤C⋅αλ0αλ0\+\(1−α\)σ2\\frac\{\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{2\}\}\{\\\|\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\\|\_\{2\}\}\\leq C\\cdot\\frac\{\\alpha\\lambda\_\{0\}\}\{\\alpha\\lambda\_\{0\}\+\(1\-\\alpha\)\\sigma^\{2\}\}for some constantCC\.

Interpretation\.Theorem[6\.2](https://arxiv.org/html/2605.11134#S6.Thmtheorem2)shows that tie training directly targets the irreducible shift by shrinking spurious parameters through selective regularization\. Part \(i\) establishes spurious weight reduction at the population level\. Part \(ii\) translates this into reduced deployment error under shift\. Part \(iii\) provides finite\-sample guarantees\. Corollary[6\.3](https://arxiv.org/html/2605.11134#S6.Thmtheorem3)gives an explicit reduction\.

### 6\.3Practical Considerations

We discuss three practical aspects of tie training next\.Soft ties:Exact utility equality is not required\. Near\-ties with\|u⋆\(x,yA\)−u⋆\(x,yB\)\|≤ε\|u^\{\\star\}\(x,y\_\{A\}\)\-u^\{\\star\}\(x,y\_\{B\}\)\|\\leq\\varepsilonyield similar effects\.Random labeling:Using both\(yA≻yB\)\(y\_\{A\}\\succ y\_\{B\}\)and\(yB≻yA\)\(y\_\{B\}\\succ y\_\{A\}\)with equal probability ensures𝔼tie\[Δϕ\]=0\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\[\\Delta\\phi\]=0, while single\-direction labeling injects bias\.Tie selection:Systematic tie construction at scale remains an open problem\. Our experiments \(Section[7](https://arxiv.org/html/2605.11134#S7)\) use manual construction or simple heuristics based on known spurious features\.

## 7Experiments

We validate three theoretical predictions: \(i\) preference optimization learns spurious correlations \(Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)\), \(ii\) deployment error under spurious shift is irreducible \(Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)\), and \(iii\) tie training reduces this error \(Corollary[6\.3](https://arxiv.org/html/2605.11134#S6.Thmtheorem3)\)\. We evaluate these claims across three regimes of increasing realism: linear models, neural networks, and large language models\.

### 7\.1Linear Models: Quantitative Validation

Setup\.We generate Gaussian featuresϕ=\[ϕc;ϕs\]\\phi=\[\\phi\_\{c\};\\phi\_\{s\}\]withdc=5d\_\{c\}=5causal andds=5d\_\{s\}=5spurious dimensions\. Preference labels are generated by a Bradley–Terry logistic teacher\. Spurious features correlate with preferences under training distributionPPand shift at deploymentQQ\. Full details are in Appendix[F\.1](https://arxiv.org/html/2605.11134#A6.SS1)\.

Results\.Figure[1](https://arxiv.org/html/2605.11134#S7.F1)validates all three theoretical predictions\. Panel \(a\) shows that learned spurious parameters‖θ^s‖\\\|\\hat\{\\theta\}\_\{s\}\\\|match the closed\-form prediction of spurious parameters \(Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)\)\. Whendsd\_\{s\}increases, the local regime assumption weakens and our quantitative predictions are no longer valid\. We found empirically that including second\-order curvature terms restores agreement \(bottom panel\)\. Panel \(b\-top\) plots deployment suboptimality versus sample sizenn\. The estimation error decays asO\(1/n\)O\(1/n\), while the shift term plateaus, confirming irreducibility \(Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)\)\. Panel \(b\-bottom\) reports the tie\-induced reduction ratio‖θsmix‖/‖θsstrict‖\\\|\\theta\_\{s\}^\{\\text\{mix\}\}\\\|/\\\|\\theta\_\{s\}^\{\\text\{strict\}\}\\\|across mixing fractionsα\\alpha\. Empirical curves match the theoretical reduction factor \(Corollary[6\.3](https://arxiv.org/html/2605.11134#S6.Thmtheorem3)\) forβ=0\.3\\beta=0\.3and ratioσ2/λ0=5\\sigma^\{2\}/\\lambda\_\{0\}=5\.

\(a\)

![Refer to caption](https://arxiv.org/html/2605.11134v1/x1.png)![Refer to caption](https://arxiv.org/html/2605.11134v1/x2.png)

\(b\)

![Refer to caption](https://arxiv.org/html/2605.11134v1/x3.png)![Refer to caption](https://arxiv.org/html/2605.11134v1/x4.png)

Figure 1:Quantitative validation of linear theory\.\(a\)Norm of learned spurious parameters against theoretical prediction \(Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)\) \(top\)\. Second\-order corrections restore agreement when the local regime is violated \(bottom\)\.\(b\)Deployment suboptimality decomposition: estimation error decays asO\(1/n\)O\(1/n\)while shift error persists, demonstrating irreducibility\. As predicted by Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3), empirical deployment errorSubOptQ\\text\{SubOpt\}\_\{Q\}remains bounded as the number of training samples increases \(top\)\. The tie reduction ratio satisfies the theoretical formula \(Corollary[6\.3](https://arxiv.org/html/2605.11134#S6.Thmtheorem3)\) across strict\-preference fractionsα\\alphaand the spurious variance ratioσ2/λ0=5\\sigma^\{2\}/\\lambda\_\{0\}=5\(bottom\)\.These results confirm the population equilibrium characterization \(Section[4](https://arxiv.org/html/2605.11134#S4)\), the irreducibility decomposition \(Section[5](https://arxiv.org/html/2605.11134#S5)\), and the tie training guarantees \(Section[6](https://arxiv.org/html/2605.11134#S6)\)\.

### 7\.2Neural Networks: Qualitative Persistence

Neural networks hide the causal–spurious decomposition inside nonlinear representations, violating linear assumptions\. We test whether the same mechanisms persist qualitatively\.

Setup\.We sample latent causal and spurious features, apply a nonlinear mixingϕ=g\(\[ϕc;ϕs\]\)\\phi=g\(\[\\phi\_\{c\};\\phi\_\{s\}\]\), and train an MLP scorer with DPO\. The causal–spurious decomposition is hidden from the model\. Details are in Appendix[F\.2](https://arxiv.org/html/2605.11134#A6.SS2)\.

Proxy metrics\.Since‖θs‖\\\|\\theta\_\{s\}\\\|is not observable, we measure spurious reliance using \(i\)*spurious gap*: accuracy difference between aligned and misaligned spurious conditions, and \(ii\)*adversarial accuracy*: evaluation under reversed spurious correlation\.

Results\.Strict training learns spurious reliance, yielding spurious gap≈0\.7\\approx 0\.7and adversarial accuracy≈0\.25\\approx 0\.25\. Tie training \(α=0\.75\\alpha=0\.75\) reduces the spurious gap to≈0\.45\\approx 0\.45and improves adversarial accuracy to≈0\.7\\approx 0\.7, while preserving in\-distribution accuracy\. Figure[2](https://arxiv.org/html/2605.11134#S7.F2)shows that spurious gap decreases monotonically with tie mixing fractionα\\alpha, and tie training breaks the adversarial accuracy plateau observed under strict training as sample size increases\. Although the theory does not apply exactly, the qualitative behavior matches the linear predictions\.

![Refer to caption](https://arxiv.org/html/2605.11134v1/x5.png)

![Refer to caption](https://arxiv.org/html/2605.11134v1/x6.png)

Figure 2:Neural network validation\.Left:Spurious gap \(accuracy on aligned minus misaligned spurious conditions\) decreases monotonically with tie mixing fractionα\\alpha\.Right:Strict training exhibits a persistent adversarial accuracy plateau despite increasing data; tie training breaks this plateau, improving robustness from≈0\.18\\approx 0\.18to≈0\.7\\approx 0\.7\.
### 7\.3LLMs: Synthetic Hotel Preferences

Dataset\.We construct a controlled hotel preference benchmark\. Causal attributes affecting utility include price, distance, and rating\. Spurious attributes \(not causally affecting utility\) include building age, renovation year, chain tier, lobby size, and employee count\. These correlate with utility during training but do not affect true quality\. Full details are in Appendix[F\.3](https://arxiv.org/html/2605.11134#A6.SS3)\.

Tie construction\.Informative ties satisfy three properties: \(i\) near\-equal utility \(\|uA−uB\|<τ\|u\_\{A\}\-u\_\{B\}\|<\\tau\), \(ii\) strong spurious contrast \(maximal\|Δϕs\|\|\\Delta\\phi\_\{s\}\|\), and \(iii\) random labels\. As an ablation, in Appendix[F\.3](https://arxiv.org/html/2605.11134#A6.SS3), we also evaluate non\-informative ties that use monotonic spurious assignment from utility; whenuA≈uBu\_\{A\}\\approx u\_\{B\}, this yieldsΔϕs≈0\\Delta\\phi\_\{s\}\\approx 0, providing no regularization signal as predicted by the theory\.

Results\.Standard DPO achieves92%92\\%in\-distribution accuracy but degrades to74%74\\%\(suppressed correlation\) and64%64\\%\(adversarial correlation\) under spurious shift\. Informative tie training maintains92%92\\%in\-distribution while improving to83%83\\%\(suppressed\) and87%87\\%\(adversarial\)\. Table[1](https://arxiv.org/html/2605.11134#S7.T1)reports the results\.

Table 1:Informative tie training \(Tie\-I\) improves robustness under spurious shift without sacrificing in\-distribution accuracy\.

## 8Conclusion

This work analyzes spurious correlation learning in preference optimization and establishes three main results\. First, under standard binary preference objectives, spurious correlations can be learned at the population level through mean spurious bias and causal–spurious correlation leakage \(Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)\)\. Second, when spurious statistics shift between training and deployment, such reliance can induce deployment degradation that persists even with unlimited training data drawn from the same distribution \(Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)\)\. Third, tie training, a simple data augmentation strategy using equal\-utility preference pairs, can reduce spurious reliance by selectively regularizing spurious directions without degrading causal learning \(Theorem[6\.2](https://arxiv.org/html/2605.11134#S6.Thmtheorem2)\)\.

Together, these results suggest that robustness to spurious correlations in preference optimization depends not only on model capacity or data scale, but also on the structure of supervision\. Tie training provides one concrete instance of this approach within standard preference objectives\. Our analysis focuses on a local regime and assumes access to informative ties, enabling tractable characterization; extending these results beyond this regime and developing scalable tie construction methods remain important directions for future work\. Further discussion appears in Appendix[H](https://arxiv.org/html/2605.11134#A8)\.

## Impact Statement

This paper presents work whose goal is to advance the field of machine learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.

## Acknowledgments

We would like to thank the Future Impact Group for their help in initiating this project\. G\. Lin would like to thank the National Science Foundation \(DMS\-2533878, DMS\-2053746, DMS\-2134209, ECCS\-2328241, CBET\-2347401, and OAC\-2311848\), the U\.S\. Department of Energy \(DOE\) Office of Science Advanced Scientific Computing Research program DE\-SC0023161, the SciDAC LEADS Institute, and DOE–Fusion Energy Science under grant number DE\-SC0024583\.

## References

- M\. Arjovsky, L\. Bottou, I\. Gulrajani, and D\. Lopez\-Paz \(2019\)Invariant risk minimization\.arXiv preprint arXiv:1907\.02893\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p1.1)\.
- Y\. Bengio, M\. Cohen, D\. Fornasiere, J\. Ghosn, P\. Greiner, M\. MacDermott, S\. Mindermann, A\. Oberman, J\. Richardson, O\. Richardson,et al\.\(2025\)Superintelligent agents pose catastrophic risks: can scientist ai offer a safer path?\.arXiv preprint arXiv:2502\.15657\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p4.1)\.
- S\. Bombari and M\. Mondelli \(2025\)Spurious correlations in high dimensional regression: the roles of regularization, simplicity bias and over\-parameterization\.arXiv preprint arXiv:2502\.01347\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1),[§3\.3](https://arxiv.org/html/2605.11134#S3.SS3.p1.4)\.
- R\. A\. Bradley and M\. E\. Terry \(1952\)Rank analysis of incomplete block designs: i\. the method of paired comparisons\.Biometrika39\(3/4\),pp\. 324–345\.Cited by:[§3\.2](https://arxiv.org/html/2605.11134#S3.SS2.p2.6)\.
- S\. Casper, X\. Davies, C\. Shi, T\. K\. Gilbert, J\. Scheurer, J\. Rando, R\. Freedman, T\. Korbak, D\. Lindner, P\. Freire,et al\.\(2023\)Open problems and fundamental limitations of reinforcement learning from human feedback\.arXiv preprint arXiv:2307\.15217\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p3.1)\.
- C\. Chang, G\. A\. Adam, and A\. Goldenberg \(2021\)Towards robust classification model by counterfactual and invariant data generation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15212–15221\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- K\. Chen, H\. Tang, Q\. Liu, and Y\. Xu \(2025\)Improved algorithms for differentially private language model alignment\.arXiv preprint arXiv:2505\.08849\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1)\.
- L\. Chen, S\. Li, J\. Yan, H\. Wang, K\. Gunaratna, V\. Yadav, Z\. Tang, V\. Srinivasan, T\. Zhou, H\. Huang,et al\.\(2023\)Alpagasus: training a better alpaca with fewer data\.arXiv preprint arXiv:2307\.08701\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p2.1)\.
- S\. R\. Chowdhury, A\. Kini, and N\. Natarajan \(2024a\)Provably robust dpo: aligning language models with noisy feedback\.arXiv preprint arXiv:2403\.00409\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1)\.
- S\. R\. Chowdhury, X\. Zhou, and N\. Natarajan \(2024b\)Differentially private reward estimation with preference feedback\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 4843–4851\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1)\.
- P\. F\. Christiano, J\. Leike, T\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.11134#S3.SS1.p1.6)\.
- M\. Dudík, K\. Hofmann, R\. E\. Schapire, A\. Slivkins, and M\. Zoghi \(2015\)Contextual dueling bandits\.InConference on Learning Theory,pp\. 563–587\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1)\.
- R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann \(2020\)Shortcut learning in deep neural networks\.Nature Machine Intelligence2\(11\),pp\. 665–673\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p5.1),[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- R\. Geirhos, P\. Rubisch, C\. Michaelis, M\. Bethge, F\. A\. Wichmann, and W\. Brendel \(2018\)ImageNet\-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness\.InInternational conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- R\. Haldar, Z\. Wang, Q\. Song, G\. Lin, and Y\. Xing \(2025\)Llm safety alignment is divergence estimation in disguise\.arXiv preprint arXiv:2502\.00657\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1)\.
- K\. L\. Hermann, H\. Mobahi, T\. Fel, and M\. C\. Mozer \(2023\)On the foundations of shortcut learning\.arXiv preprint arXiv:2310\.16228\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[§F\.3](https://arxiv.org/html/2605.11134#A6.SS3.SSS0.Px4.p1.1)\.
- P\. Izmailov, P\. Kirichenko, N\. Gruver, and A\. G\. Wilson \(2022\)On feature learning in the presence of spurious correlations\.Advances in Neural Information Processing Systems35,pp\. 38516–38532\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- D\. Kalimeris, G\. Kaplun, P\. Nakkiran, B\. Edelman, T\. Yang, B\. Barak, and H\. Zhang \(2019\)Sgd on neural networks learns functions of increasing complexity\.Advances in neural information processing systems32\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- P\. Kirichenko, P\. Izmailov, and A\. G\. Wilson \(2022\)Last layer re\-training is sufficient for robustness to spurious correlations\.arXiv preprint arXiv:2204\.02937\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- L\. Langosco, J\. Koch, L\. D\. Sharkey, J\. Pfau, and D\. Krueger \(2022\)Goal misgeneralization in deep reinforcement learning\.InInternational Conference on Machine Learning,pp\. 12004–12019\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p4.1)\.
- E\. Z\. Liu, B\. Haghgoo, A\. S\. Chen, A\. Raghunathan, P\. W\. Koh, S\. Sagawa, P\. Liang, and C\. Finn \(2021\)Just train twice: improving group robustness without training group information\.InInternational Conference on Machine Learning,pp\. 6781–6792\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- D\. Morwani, J\. Batra, P\. Jain, and P\. Netrapalli \(2023\)Simplicity bias in 1\-hidden layer neural networks\.Advances in Neural Information Processing Systems36,pp\. 8048–8075\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- R\. Ngo, L\. Chan, and S\. Mindermann \(2024\)The alignment problem from a deep learning perspective\.arXiv preprint arXiv:2209\.00626\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p4.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p1.1)\.
- R\. Park, R\. Rafailov, S\. Ermon, and C\. Finn \(2024\)Disentangling length from quality in direct preference optimization\.arXiv preprint arXiv:2403\.19159\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p2.1)\.
- M\. Pezeshki, O\. Kaba, Y\. Bengio, A\. C\. Courville, D\. Precup, and G\. Lajoie \(2021\)Gradient starvation: a learning proclivity in neural networks\.Advances in Neural Information Processing Systems34,pp\. 1256–1272\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- G\. Plumb, M\. T\. Ribeiro, and A\. Talwalkar \(2021\)Finding and fixing spurious patterns with explanations\.arXiv preprint arXiv:2106\.02112\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- G\. Qiu, D\. Kuang, and S\. Goel \(2024\)Complexity matters: dynamics of feature learning in the presence of spurious correlations\.arXiv preprint arXiv:2403\.03375\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§F\.3](https://arxiv.org/html/2605.11134#A6.SS3.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.11134#S1.p1.1),[§1](https://arxiv.org/html/2605.11134#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.11134#S3.SS1.p1.6),[§3\.3](https://arxiv.org/html/2605.11134#S3.SS3.p1.3)\.
- N\. Rahaman, A\. Baratin, D\. Arpit, F\. Draxler, M\. Lin, F\. Hamprecht, Y\. Bengio, and A\. Courville \(2019\)On the spectral bias of neural networks\.InInternational conference on machine learning,pp\. 5301–5310\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- S\. Sagawa, A\. Raghunathan, P\. W\. Koh, and P\. Liang \(2020\)An investigation of why overparameterization exacerbates spurious correlations\.InInternational Conference on Machine Learning,pp\. 8346–8356\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- A\. Saha, A\. Pacchiano, and J\. Lee \(2023\)Dueling rl: reinforcement learning with trajectory preferences\.InInternational conference on artificial intelligence and statistics,pp\. 6263–6289\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1)\.
- K\. Saito, A\. Wachi, K\. Wataoka, and Y\. Akimoto \(2023\)Verbosity bias in preference labeling by large language models\.arXiv preprint arXiv:2310\.10076\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p5.1),[§2](https://arxiv.org/html/2605.11134#S2.p2.1)\.
- H\. Shah, K\. Tamuly, A\. Raghunathan, P\. Jain, and P\. Netrapalli \(2020\)The pitfalls of simplicity bias in neural networks\.Advances in Neural Information Processing Systems33,pp\. 9573–9585\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- R\. Shah, V\. Varma, R\. Kumar, M\. Phuong, V\. Krakovna, J\. Uesato, and Z\. Kenton \(2022\)Goal misgeneralization: why correct specifications aren’t enough for correct goals\.arXiv preprint arXiv:2210\.01790\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p4.1)\.
- M\. Sharma, M\. Tong, T\. Korbak, D\. Duvenaud, A\. Askell, S\. R\. Bowman, N\. Cheng, E\. Durmus, Z\. Hatfield\-Dodds, S\. R\. Johnston,et al\.\(2023\)Towards understanding sycophancy in language models\.arXiv preprint arXiv:2310\.13548\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p3.1),[§2](https://arxiv.org/html/2605.11134#S2.p2.1)\.
- P\. Singhal, T\. Goyal, J\. Xu, and G\. Durrett \(2023\)A long way to go: investigating length correlations in rlhf\.arXiv preprint arXiv:2310\.03716\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p2.1)\.
- S\. Singla and S\. Feizi \(2021\)Salient imagenet: how to discover spurious features in deep learning?\.arXiv preprint arXiv:2110\.04301\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- J\. Skalse, N\. Howe, D\. Krasheninnikov, and D\. Krueger \(2022\)Defining and characterizing reward gaming\.Advances in Neural Information Processing Systems35,pp\. 9460–9471\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p4.1)\.
- J\. Wu, Y\. Xie, Z\. Yang, J\. Wu, J\. Chen, J\. Gao, B\. Ding, X\. Wang, and X\. He \(2024\)Towards robust alignment of language models: distributionally robustifying direct preference optimization\.arXiv preprint arXiv:2407\.07880\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1)\.
- K\. Xiao, L\. Engstrom, A\. Ilyas, and A\. Madry \(2020\)Noise or signal: the role of image backgrounds in object recognition\.arXiv preprint arXiv:2006\.09994\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- W\. Xiong, H\. Dong, C\. Ye, Z\. Wang, H\. Zhong, H\. Ji, N\. Jiang, and T\. Zhang \(2023\)Iterative preference learning from human feedback: bridging theory and practice for rlhf under kl\-constraint\.arXiv preprint arXiv:2312\.11456\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1)\.
- X\. Zhang, W\. Xiong, L\. Chen, T\. Zhou, H\. Huang, and T\. Zhang \(2025\)From lists to emojis: how format bias affects model alignment\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 26940–26961\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p2.1)\.
- C\. Zhou, X\. Ma, P\. Michel, and G\. Neubig \(2021\)Examining and combating spurious features under distribution shift\.InInternational Conference on Machine Learning,pp\. 12857–12867\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p1.1)\.
- X\. Zhou, Y\. Wu, and F\. Orabona \(2025\)A unified theoretical analysis of private and robust offline alignment: from rlhf to dpo\.arXiv preprint arXiv:2505\.15694\.Cited by:[§2](https://arxiv.org/html/2605.11134#S2.p3.1),[§3\.2](https://arxiv.org/html/2605.11134#S3.SS2.p1.3)\.
- B\. Zhu, M\. Jordan, and J\. Jiao \(2023\)Principled reinforcement learning with human feedback from pairwise or k\-wise comparisons\.InInternational Conference on Machine Learning,pp\. 43037–43067\.Cited by:[§C\.4](https://arxiv.org/html/2605.11134#A3.SS4.SSS0.Px2.p1.10),[§2](https://arxiv.org/html/2605.11134#S2.p3.1),[§3\.2](https://arxiv.org/html/2605.11134#S3.SS2.p1.3)\.
- D\. M\. Ziegler, N\. Stiennon, J\. Wu, T\. B\. Brown, A\. Radford, D\. Amodei, P\. Christiano, and G\. Irving \(2019\)Fine\-tuning language models from human preferences\.arXiv preprint arXiv:1909\.08593\.Cited by:[§1](https://arxiv.org/html/2605.11134#S1.p1.1)\.

## Appendix AFull Proof of Spurious Learning Mechanism \(Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)\)

This section provides the full proof of Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)by progressively isolating the mechanisms that drive spurious learning in preference optimization\. We first restate the population objective and derive its exact gradient, then linearize the dynamics under the local regime assumption\. We next decompose the resulting covariance structure into causal and spurious blocks and solve the corresponding equilibrium in closed form\. Finally, we interpret the solution to separate and quantify the mean\-bias and correlation\-leakage contributions, completing the proof\.

Our analysis relies on the local linearization assumption introduced in Section[3](https://arxiv.org/html/2605.11134#S3)\(Assumption[3\.2](https://arxiv.org/html/2605.11134#S3.Thmtheorem2)\), which we restate below for completeness\.

###### Assumption A\.1\(Local regime\)\.

With high probability under the data distribution,\|βθ~⊤Δϕ\|≪1\.\|\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\|\\ll 1\.

This assumption isolates the early and near\-optimal training regimes where DPO behaves approximately linearly, consistent with prior analyses of logistic preference learning\.

### A\.1Population Objective and Gradient

We study the population objective:

L\(θ\):=𝔼\(x,yw,yl\)∼𝒟\[ℓ\(θ;x,yw,yl\)\]=−𝔼\[log⁡σ\(βθ~⊤Δϕ\)\],θ~:=θ−θref\.L\(\\theta\):=\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\[\\ell\(\\theta;x,y\_\{w\},y\_\{l\}\)\]=\-\\mathbb\{E\}\\Big\[\\log\\sigma\\big\(\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\\big\)\\Big\],\\qquad\\tilde\{\\theta\}:=\\theta\-\\theta\_\{\\mathrm\{ref\}\}\.
##### Full DPO Pairwise Gradient\.

The following expression follows from the chain rule andddz\[−log⁡σ\(z\)\]=σ\(z\)−1\\frac\{d\}\{dz\}\[\-\\log\\sigma\(z\)\]=\\sigma\(z\)\-1\. In particular,

ddzℓ\(z\)=ddz\(−log⁡σ\(z\)\)=−1σ\(z\)⋅σ\(z\)\(1−σ\(z\)\)=−\(1−σ\(z\)\)=σ\(z\)−1\.\\frac\{d\}\{dz\}\\ell\(z\)=\\frac\{d\}\{dz\}\\big\(\-\\log\\sigma\(z\)\\big\)=\-\\frac\{1\}\{\\sigma\(z\)\}\\cdot\\sigma\(z\)\\big\(1\-\\sigma\(z\)\\big\)=\-\\big\(1\-\\sigma\(z\)\\big\)=\\sigma\(z\)\-1\.Defining the adaptive weightw\(Δθ\):=1−σ\(βΔθ\)∈\(0,1\)w\(\\Delta\_\{\\theta\}\):=1\-\\sigma\(\\beta\\Delta\_\{\\theta\}\)\\in\(0,1\), and applying the chain rule withz=βθ~⊤Δϕz=\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi, the gradient is:

∇θℓ\(θ;x,yw,yl\)=\(σ\(βΔθ\)−1\)⋅βΔϕ=−β⋅w\(Δθ\)⋅Δϕ\\nabla\_\{\\theta\}\\ell\(\\theta;x,y\_\{w\},y\_\{l\}\)=\(\\sigma\(\\beta\\Delta\_\{\\theta\}\)\-1\)\\cdot\\beta\\Delta\\phi=\-\\beta\\cdot w\(\\Delta\_\{\\theta\}\)\\cdot\\Delta\\phi

##### Population Gradient Under Strict Preferences\.

Under the local regime \(Assumption[A\.1](https://arxiv.org/html/2605.11134#A1.Thmtheorem1)\), we derive the linearized population gradient next\. To this end, let us first define the feature second moment matrix:

Σ:=𝔼\[ΔϕΔϕ⊤\]∈ℝd×d\\Sigma:=\\mathbb\{E\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]\\in\\mathbb\{R\}^\{d\\times d\}This is related to the covariance matrixC=Cov\(Δϕ\)C=\\text\{Cov\}\(\\Delta\\phi\)byΣ=C\+μμ⊤\\Sigma=C\+\\mu\\mu^\{\\top\}whereμ=𝔼\[Δϕ\]\\mu=\\mathbb\{E\}\[\\Delta\\phi\]\. We use second moments \(rather than centered covariances\) because they lead to simpler final formulas\.

##### Step 1: We linearize the expected weight:

𝔼\[w\(Δθ\)\]≈𝔼\[12−β4θ~⊤Δϕ\]=12−β4θ~⊤μ\\mathbb\{E\}\[w\(\\Delta\_\{\\theta\}\)\]\\approx\\mathbb\{E\}\\left\[\\frac\{1\}\{2\}\-\\frac\{\\beta\}\{4\}\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\\right\]=\\frac\{1\}\{2\}\-\\frac\{\\beta\}\{4\}\\tilde\{\\theta\}^\{\\top\}\\mu

##### Step 2: We compute the covariance term as follows:

Cov\(Δϕ,w\(Δθ\)\)=Cov\(Δϕ,12−β4θ~⊤Δϕ\)=−β4Cov\(Δϕ,θ~⊤Δϕ\)\\text\{Cov\}\(\\Delta\\phi,w\(\\Delta\_\{\\theta\}\)\)=\\text\{Cov\}\\left\(\\Delta\\phi,\\frac\{1\}\{2\}\-\\frac\{\\beta\}\{4\}\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\\right\)=\-\\frac\{\\beta\}\{4\}\\text\{Cov\}\(\\Delta\\phi,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)Using the identity for covariance between a vector and a linear form yields:

Cov\(Δϕ,θ~⊤Δϕ\)\\displaystyle\\text\{Cov\}\(\\Delta\\phi,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)=𝔼\[Δϕ⋅\(θ~⊤Δϕ\)\]−𝔼\[Δϕ\]𝔼\[θ~⊤Δϕ\]\\displaystyle=\\mathbb\{E\}\[\\Delta\\phi\\cdot\(\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)\]\-\\mathbb\{E\}\[\\Delta\\phi\]\\mathbb\{E\}\[\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\]=𝔼\[ΔϕΔϕ⊤\]θ~−μ\(μ⊤θ~\)=Σθ~−μμ⊤θ~\\displaystyle=\\mathbb\{E\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]\\tilde\{\\theta\}\-\\mu\(\\mu^\{\\top\}\\tilde\{\\theta\}\)=\\Sigma\\tilde\{\\theta\}\-\\mu\\mu^\{\\top\}\\tilde\{\\theta\}Therefore:

Cov\(Δϕ,w\(Δθ\)\)=−β4\(Σθ~−μμ⊤θ~\)\\text\{Cov\}\(\\Delta\\phi,w\(\\Delta\_\{\\theta\}\)\)=\-\\frac\{\\beta\}\{4\}\(\\Sigma\\tilde\{\\theta\}\-\\mu\\mu^\{\\top\}\\tilde\{\\theta\}\)

##### Step 3: We combine the terms:

∇θL\(θ\)\\displaystyle\\nabla\_\{\\theta\}L\(\\theta\)=−β⋅𝔼\[w\(Δθ\)\]⋅μ−β⋅Cov\(Δϕ,w\(Δθ\)\)\\displaystyle=\-\\beta\\cdot\\mathbb\{E\}\[w\(\\Delta\_\{\\theta\}\)\]\\cdot\\mu\-\\beta\\cdot\\text\{Cov\}\(\\Delta\\phi,w\(\\Delta\_\{\\theta\}\)\)=−β\(12−β4θ~⊤μ\)μ−β\(−β4\(Σθ~−μμ⊤θ~\)\)\\displaystyle=\-\\beta\\left\(\\frac\{1\}\{2\}\-\\frac\{\\beta\}\{4\}\\tilde\{\\theta\}^\{\\top\}\\mu\\right\)\\mu\-\\beta\\left\(\-\\frac\{\\beta\}\{4\}\(\\Sigma\\tilde\{\\theta\}\-\\mu\\mu^\{\\top\}\\tilde\{\\theta\}\)\\right\)=−β2μ\+β24\(θ~⊤μ\)μ\+β24Σθ~−β24μμ⊤θ~\\displaystyle=\-\\frac\{\\beta\}\{2\}\\mu\+\\frac\{\\beta^\{2\}\}\{4\}\(\\tilde\{\\theta\}^\{\\top\}\\mu\)\\mu\+\\frac\{\\beta^\{2\}\}\{4\}\\Sigma\\tilde\{\\theta\}\-\\frac\{\\beta^\{2\}\}\{4\}\\mu\\mu^\{\\top\}\\tilde\{\\theta\}=−β2μ\+β24\(μ⊤θ~\)μ\+β24Σθ~−β24μ\(μ⊤θ~\)\\displaystyle=\-\\frac\{\\beta\}\{2\}\\mu\+\\frac\{\\beta^\{2\}\}\{4\}\(\\mu^\{\\top\}\\tilde\{\\theta\}\)\\mu\+\\frac\{\\beta^\{2\}\}\{4\}\\Sigma\\tilde\{\\theta\}\-\\frac\{\\beta^\{2\}\}\{4\}\\mu\(\\mu^\{\\top\}\\tilde\{\\theta\}\)The termsβ24\(μ⊤θ~\)μ\\frac\{\\beta^\{2\}\}\{4\}\(\\mu^\{\\top\}\\tilde\{\\theta\}\)\\muand−β24μ\(μ⊤θ~\)\-\\frac\{\\beta^\{2\}\}\{4\}\\mu\(\\mu^\{\\top\}\\tilde\{\\theta\}\)cancel exactly \(this cancellation is why usingΣ\\Sigma\(rather thanCC\) gives a clean final formula\)\. Finally, we obtain the linearized population gradient:

∇θL\(θ\)≈−β2μ\+β24Σθ~\.\\displaystyle\\nabla\_\{\\theta\}L\(\\theta\)\\approx\-\\frac\{\\beta\}\{2\}\\mu\+\\frac\{\\beta^\{2\}\}\{4\}\\Sigma\\tilde\{\\theta\}\.\(10\)

### A\.2Solving the Equilibrium

We now derive the explicit equilibrium expressions for the causal and spurious components of the linearized population optimum stated in Section[4](https://arxiv.org/html/2605.11134#S4)\.

##### Setup\.

Using the linearized population gradient \([10](https://arxiv.org/html/2605.11134#A1.E10)\), the equilibrium condition∇θL\(θ⋆\)=𝟎\\nabla\_\{\\theta\}L\(\\theta^\{\\star\}\)=\\mathbf\{0\}yields

Σθ~⋆=2βμ,\\Sigma\\tilde\{\\theta\}^\{\\star\}=\\frac\{2\}\{\\beta\}\\mu,whereθ~⋆=θ⋆−θref\\tilde\{\\theta\}^\{\\star\}=\\theta^\{\\star\}\-\\theta\_\{\\mathrm\{ref\}\}\. Writingθ~⋆=\[θ~c⋆⊤,θ~s⋆⊤\]⊤\\tilde\{\\theta\}^\{\\star\}=\[\\tilde\{\\theta\}\_\{c\}^\{\\star\\top\},\\tilde\{\\theta\}\_\{s\}^\{\\star\\top\}\]^\{\\top\},μ=\[μc⊤,μs⊤\]⊤\\mu=\[\\mu\_\{c\}^\{\\top\},\\mu\_\{s\}^\{\\top\}\]^\{\\top\}, and

Σ=\[ΣccΣcsΣscΣss\],\\Sigma=\\begin\{bmatrix\}\\Sigma\_\{cc\}&\\Sigma\_\{cs\}\\\\ \\Sigma\_\{sc\}&\\Sigma\_\{ss\}\\end\{bmatrix\},the equilibrium equations decompose as

Σccθ~c⋆\+Σcsθ~s⋆\\displaystyle\\Sigma\_\{cc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\+\\Sigma\_\{cs\}\\tilde\{\\theta\}\_\{s\}^\{\\star\}=2βμc,\\displaystyle=\\frac\{2\}\{\\beta\}\\mu\_\{c\},\(11\)Σscθ~c⋆\+Σssθ~s⋆\\displaystyle\\Sigma\_\{sc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\+\\Sigma\_\{ss\}\\tilde\{\\theta\}\_\{s\}^\{\\star\}=2βμs\.\\displaystyle=\\frac\{2\}\{\\beta\}\\mu\_\{s\}\.\(12\)

##### Solving for the spurious component\.

AssumeΣss\\Sigma\_\{ss\}is invertible\. From \([12](https://arxiv.org/html/2605.11134#A1.E12)\),

θ~s⋆=Σss−1\(2βμs−Σscθ~c⋆\)\.\\tilde\{\\theta\}\_\{s\}^\{\\star\}=\\Sigma\_\{ss\}^\{\-1\}\\\!\\left\(\\frac\{2\}\{\\beta\}\\mu\_\{s\}\-\\Sigma\_\{sc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\\right\)\.Substituting into \([11](https://arxiv.org/html/2605.11134#A1.E11)\) gives

\(Σcc−ΣcsΣss−1Σsc\)θ~c⋆=2β\(μc−ΣcsΣss−1μs\)\.\\Big\(\\Sigma\_\{cc\}\-\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\Sigma\_\{sc\}\\Big\)\\tilde\{\\theta\}\_\{c\}^\{\\star\}=\\frac\{2\}\{\\beta\}\\Big\(\\mu\_\{c\}\-\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\mu\_\{s\}\\Big\)\.

##### Schur complement\.

Define the Schur complement:

Sc:=Σcc−ΣcsΣss−1Σsc\.S\_\{c\}:=\\Sigma\_\{cc\}\-\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\Sigma\_\{sc\}\.AssumingScS\_\{c\}is invertible, the causal component satisfies

θ~c⋆=2βSc−1\(μc−ΣcsΣss−1μs\)\.\\tilde\{\\theta\}\_\{c\}^\{\\star\}=\\frac\{2\}\{\\beta\}S\_\{c\}^\{\-1\}\\Big\(\\mu\_\{c\}\-\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\mu\_\{s\}\\Big\)\.

##### Explicit spurious equilibrium\.

Substituting the expression forθ~c⋆\\tilde\{\\theta\}\_\{c\}^\{\\star\}back yields

θ~s⋆\\displaystyle\\tilde\{\\theta\}\_\{s\}^\{\\star\}=Σss−1\(2βμs−Σscθ~c⋆\)\\displaystyle=\\Sigma\_\{ss\}^\{\-1\}\\\!\\left\(\\frac\{2\}\{\\beta\}\\mu\_\{s\}\-\\Sigma\_\{sc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\\right\)=2βΣss−1\[μs−ΣscSc−1μc\+ΣscSc−1ΣcsΣss−1μs\]\\displaystyle=\\frac\{2\}\{\\beta\}\\Sigma\_\{ss\}^\{\-1\}\\\!\\left\[\\mu\_\{s\}\-\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\mu\_\{c\}\+\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\mu\_\{s\}\\right\]=2βΣss−1\[\(I\+ΣscSc−1ΣcsΣss−1\)μs−ΣscSc−1μc\]\.\\displaystyle=\\frac\{2\}\{\\beta\}\\Sigma\_\{ss\}^\{\-1\}\\\!\\left\[\\big\(I\+\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\big\)\\mu\_\{s\}\-\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\mu\_\{c\}\\right\]\.This completes the derivation of the explicit spurious equilibrium reported in Section[4](https://arxiv.org/html/2605.11134#S4)\.

### A\.3Interpretation of the Equilibrium Solution

We now interpret the equilibrium solution derived above and state our main result characterizing spurious learning at the population level\. The following theorem formalizes when and why the population optimizer assigns nonzero weight to spurious features under the linearized dynamics\.

###### Theorem A\.2\(Spurious learning from mean bias and correlation leakage\)\.

Under the linearized population dynamics \(Equation \([10](https://arxiv.org/html/2605.11134#A1.E10)\)\) and standard invertibility conditions, ifμs≠0\\mu\_\{s\}\\neq 0\(mean spurious bias\) orΣcs≠0\\Sigma\_\{cs\}\\neq 0\(causal–spurious leakage\), thenθ~s⋆≠0\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\neq 0\. That is, the population optimum assigns nonzero weight to spurious features\.

###### Proof\.

The linearized population gradient takes the form \(see Appendix[A\.1](https://arxiv.org/html/2605.11134#A1.SS1)for details\)

g\(θ~\)=−β2μ\+β24Σθ~,g\(\\tilde\{\\theta\}\)=\-\\frac\{\\beta\}\{2\}\\mu\+\\frac\{\\beta^\{2\}\}\{4\}\\Sigma\\tilde\{\\theta\},so the equilibrium satisfiesg\(θ~⋆\)=0g\(\\tilde\{\\theta\}^\{\\star\}\)=0, i\.e\.

Σθ~⋆=2βμ\.\\Sigma\\tilde\{\\theta\}^\{\\star\}=\\frac\{2\}\{\\beta\}\\mu\.Write the block system:

Σccθ~c⋆\+Σcsθ~s⋆\\displaystyle\\Sigma\_\{cc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\+\\Sigma\_\{cs\}\\tilde\{\\theta\}\_\{s\}^\{\\star\}=2βμc,\\displaystyle=\\frac\{2\}\{\\beta\}\\mu\_\{c\},\(13\)Σscθ~c⋆\+Σssθ~s⋆\\displaystyle\\Sigma\_\{sc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}\+\\Sigma\_\{ss\}\\tilde\{\\theta\}\_\{s\}^\{\\star\}=2βμs\.\\displaystyle=\\frac\{2\}\{\\beta\}\\mu\_\{s\}\.\(14\)AssumeΣss≻0\\Sigma\_\{ss\}\\succ 0andSc≻0S\_\{c\}\\succ 0so the solution is unique\. Solving by Schur complement \(see Appendix[A\.2](https://arxiv.org/html/2605.11134#A1.SS2)for details\) yields,

θ~c⋆=2βSc−1\(μc−ΣcsΣss−1μs\),\\tilde\{\\theta\}\_\{c\}^\{\\star\}=\\frac\{2\}\{\\beta\}S\_\{c\}^\{\-1\}\\big\(\\mu\_\{c\}\-\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\mu\_\{s\}\\big\),Substituting into \([14](https://arxiv.org/html/2605.11134#A1.E14)\) yields

θ~s⋆\\displaystyle\\tilde\{\\theta\}\_\{s\}^\{\\star\}=2βΣss−1\[μs−ΣscSc−1μc\+ΣscSc−1ΣcsΣss−1μs\]\\displaystyle=\\frac\{2\}\{\\beta\}\\Sigma\_\{ss\}^\{\-1\}\\Big\[\\mu\_\{s\}\-\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\mu\_\{c\}\+\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\mu\_\{s\}\\Big\]=2βΣss−1\(I\+ΣscSc−1ΣcsΣss−1\)μs⏟\(i\) mean\-bias term\+\(−2βΣss−1ΣscSc−1μc\)⏟\(ii\) correlation\-leakage term\.\\displaystyle=\\underbrace\{\\frac\{2\}\{\\beta\}\\Sigma\_\{ss\}^\{\-1\}\\Big\(I\+\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\Big\)\\mu\_\{s\}\}\_\{\\text\{\(i\) mean\-bias term\}\}\\;\\;\+\\;\\;\\underbrace\{\\left\(\-\\frac\{2\}\{\\beta\}\\Sigma\_\{ss\}^\{\-1\}\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\mu\_\{c\}\\right\)\}\_\{\\text\{\(ii\) correlation\-leakage term\}\}\.\(15\)This proves the claimed decomposition into a component driven byμs\\mu\_\{s\}and a component driven byΣsc\\Sigma\_\{sc\}\. It remains to show: ifμs≠0\\mu\_\{s\}\\neq 0orΣsc≠0\\Sigma\_\{sc\}\\neq 0, then genericallyθ~s⋆≠0\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\neq 0\.

Case 1:μs≠0\\mu\_\{s\}\\neq 0\.The mean\-bias term in \([15](https://arxiv.org/html/2605.11134#A1.E15)\) equalsAμsA\\mu\_\{s\}withA:=2βΣss−1\(I\+ΣscSc−1ΣcsΣss−1\)A:=\\frac\{2\}\{\\beta\}\\Sigma\_\{ss\}^\{\-1\}\\big\(I\+\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\big\)\. For generic problem instances,AAis full\-rank \(it fails only on a measure\-zero set whereI\+ΣscSc−1ΣcsΣss−1I\+\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}has a nullspace aligned withμs\\mu\_\{s\}\)\. HenceAμs≠0A\\mu\_\{s\}\\neq 0for genericμs≠0\\mu\_\{s\}\\neq 0, implyingθ~s⋆≠0\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\neq 0\.

Case 2:μs=0\\mu\_\{s\}=0butΣsc≠0\\Sigma\_\{sc\}\\neq 0\.Thenθ~c⋆=2βSc−1μc\\tilde\{\\theta\}\_\{c\}^\{\\star\}=\\frac\{2\}\{\\beta\}S\_\{c\}^\{\-1\}\\mu\_\{c\}\. Substituting into \([14](https://arxiv.org/html/2605.11134#A1.E14)\) gives

θ~s⋆=−Σss−1Σscθ~c⋆=−2βΣss−1ΣscSc−1μc\.\\tilde\{\\theta\}\_\{s\}^\{\\star\}=\-\\Sigma\_\{ss\}^\{\-1\}\\Sigma\_\{sc\}\\tilde\{\\theta\}\_\{c\}^\{\\star\}=\-\\frac\{2\}\{\\beta\}\\Sigma\_\{ss\}^\{\-1\}\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\mu\_\{c\}\.IfΣsc≠0\\Sigma\_\{sc\}\\neq 0, then for genericμc\\mu\_\{c\}this expression is nonzero \(failing only whenμc\\mu\_\{c\}lies in the nullspace ofΣscSc−1\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}, again a measure\-zero alignment condition\)\. Thus genericallyθ~s⋆≠0\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\neq 0\.

Combining the two cases establishes the theorem\. ∎

## Appendix BDeployment Error and Margin\-Based Analysis

This appendix analyzes how spurious correlations induce vulnerabilities that lead to deployment error\. We first define the KL\-regularized policy objective that underlies preference learning and characterize its surrogate geometry\. We then formalize deployment error and decompose it into distribution\-shift and estimation components\. Using this decomposition, we prove the statement of Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)\. We conclude by deriving a margin\-based approximation that links spurious margins to deployment degradation\.

## Appendix CAssumptions for Deployment Analysis

We start by providing detailed discussion of the assumptions used in the deployment suboptimality analysis\.

###### Assumption C\.1\(Bounded feature differences\)\.

We assume that feature differences are uniformly bounded:

‖Δϕ\(x,yw,yl\)‖2≤1for all preference pairs\(x,yw,yl\)\.\\\|\\Delta\\phi\(x,y\_\{w\},y\_\{l\}\)\\\|\_\{2\}\\leq 1\\quad\\text\{for all preference pairs \}\(x,y\_\{w\},y\_\{l\}\)\.

This assumption is primarily a normalization convention rather than a substantive restriction\. Specifically, if the feature mapϕ\(x,y\)\\phi\(x,y\)satisfies

R:=supx,yw,yl‖ϕ\(x,yw\)−ϕ\(x,yl\)‖2<∞,R:=\\sup\_\{x,y\_\{w\},y\_\{l\}\}\\\|\\phi\(x,y\_\{w\}\)\-\\phi\(x,y\_\{l\}\)\\\|\_\{2\}<\\infty,then Assumption[C\.1](https://arxiv.org/html/2605.11134#A3.Thmtheorem1)can always be enforced by rescaling the features asϕ~=ϕ/R,\\tilde\{\\phi\}=\\phi/R,together with the corresponding parameter reparametrizationθ~=Rθ\.\\tilde\{\\theta\}=R\\theta\.

In practice, it holds in many common settings, including:

- •when features are explicitly normalized \(e\.g\., embeddings projected onto the unit sphere\),
- •when features are bounded by construction \(e\.g\., indicator features, one\-hot encodings, or clipped neural activations\),
- •when the support of the feature distribution is compact\.

###### Assumption C\.2\(Bounded parameters\)\.

We assume that the deviation from the reference parameters is uniformly bounded:

‖θ~‖2≤Bfor allθ∈ΘB,θ~:=θ−θref\.\\\|\\tilde\{\\theta\}\\\|\_\{2\}\\leq B\\quad\\text\{for all \}\\theta\\in\\Theta\_\{B\},\\qquad\\tilde\{\\theta\}:=\\theta\-\\theta\_\{\\mathrm\{ref\}\}\.

This assumption restricts optimization to a compact parameter set and ensures that preference score differencesβθ~⊤Δϕ\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phiremain uniformly bounded, which is required for the local regime\. The boundBBmay be enforced explicitly \(e\.g\., via constrained optimization\) or implicitly \(e\.g\., via early stopping\)\.

##### Joint implication of Assumptions[C\.1](https://arxiv.org/html/2605.11134#A3.Thmtheorem1)and[C\.2](https://arxiv.org/html/2605.11134#A3.Thmtheorem2)

Under Assumptions[C\.1](https://arxiv.org/html/2605.11134#A3.Thmtheorem1)and[C\.2](https://arxiv.org/html/2605.11134#A3.Thmtheorem2), preference score differences satisfy

\|βθ~⊤Δϕ\|≤β‖θ~‖2‖Δϕ‖2≤βB\.\|\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\|\\leq\\beta\\,\\\|\\tilde\{\\theta\}\\\|\_\{2\}\\,\\\|\\Delta\\phi\\\|\_\{2\}\\leq\\beta B\.Thus, ifβB≤ϵ\\beta B\\leq\\epsilon, the local regime assumption[A\.1](https://arxiv.org/html/2605.11134#A1.Thmtheorem1)holds uniformly overΘB\\Theta\_\{B\}\.

###### Assumption C\.3\(Local regime\)\.

We assume preference score differences remain small:

\|βθ~⊤Δϕ\|≤ϵ≪1with high probability underPandQ\.\|\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\|\\leq\\epsilon\\ll 1\\quad\\text\{with high probability under \}P\\text\{ and \}Q\.

##### Role in the analysis\.

The local regime assumption is used in two places\.

1. 1\.Quadratic Taylor control\.It ensures that the pairwise surrogateJ~\(D\)\(θ\)\\tilde\{J\}^\{\(D\)\}\(\\theta\)admits a second\-order Taylor expansion with a uniformly bounded remainder, yielding the quadratic estimation term in Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)\.
2. 2\.First\-order margin approximation\.It justifies linearizing the logistic log\-likelihood aslog⁡σ\(z\)=log⁡\(1/2\)\+z/4\+O\(z2\)\\log\\sigma\(z\)=\\log\(1/2\)\+z/4\+O\(z^\{2\}\), as used in Proposition[5\.1](https://arxiv.org/html/2605.11134#S5.Thmtheorem1)\.

##### Quantitative scale\.

Letz:=βθ~⊤Δϕz:=\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\. Consider the linearization

σ\(z\)≈12\+z4\.\\sigma\(z\)\\approx\\frac\{1\}\{2\}\+\\frac\{z\}\{4\}\.Its absolute error satisfies

max\|z\|≤1⁡\|σ\(z\)−\(12\+z4\)\|=\|σ\(1\)−34\|≈1\.9×10−2,\\max\_\{\|z\|\\leq 1\}\\left\|\\sigma\(z\)\-\\left\(\\frac\{1\}\{2\}\+\\frac\{z\}\{4\}\\right\)\\right\|=\\left\|\\sigma\(1\)\-\\frac\{3\}\{4\}\\right\|\\approx 1\.9\\times 10^\{\-2\},and

max\|z\|≤2⁡\|σ\(z\)−\(12\+z4\)\|=\|σ\(2\)−1\|≈1\.19×10−1\.\\max\_\{\|z\|\\leq 2\}\\left\|\\sigma\(z\)\-\\left\(\\frac\{1\}\{2\}\+\\frac\{z\}\{4\}\\right\)\\right\|=\\left\|\\sigma\(2\)\-1\\right\|\\approx 1\.19\\times 10^\{\-1\}\.Accordingly, the linear approximation is accurate when\|z\|≲1\|z\|\\lesssim 1, and remains qualitatively informative for\|z\|=O\(1\)\|z\|=O\(1\)\.

##### When local regime holds\.

The local regime is satisfied under the following conditions:

- •Near initialization:θ~≈0\\tilde\{\\theta\}\\approx 0implies\|βθ~⊤Δϕ\|≈0\|\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\|\\approx 0\.
- •Bounded parameters:Under assumptions[C\.1](https://arxiv.org/html/2605.11134#A3.Thmtheorem1)and[C\.2](https://arxiv.org/html/2605.11134#A3.Thmtheorem2),\|βθ~⊤Δϕ\|≤βB\|\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\|\\leq\\beta B, so assumption[C\.3](https://arxiv.org/html/2605.11134#A3.Thmtheorem3)holds wheneverβB≤ϵ\\beta B\\leq\\epsilon\.
- •Moderate temperature:Smallerβ\\betadirectly reduces score magnitudes\.

##### Connection to Section[4](https://arxiv.org/html/2605.11134#S4)\.

Assumption[C\.3](https://arxiv.org/html/2605.11134#A3.Thmtheorem3)is the*same*local regime used in Appendix[A\.1](https://arxiv.org/html/2605.11134#A1.SS1)for the population gradient analysis\. Specifically:

- •Section[4](https://arxiv.org/html/2605.11134#S4)linearizes the weight functionw\(Δθ\)=1−σ\(βΔθ\)w\(\\Delta\_\{\\theta\}\)=1\-\\sigma\(\\beta\\Delta\_\{\\theta\}\)aroundθ~=0\\tilde\{\\theta\}=0, which requires\|βθ~⊤Δϕ\|≪1\|\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\|\\ll 1\.
- •Section[5](https://arxiv.org/html/2605.11134#S5)uses the same regime to justify quadratic bounds onJ~\(D\)\\tilde\{J\}^\{\(D\)\}and first\-order margin expansions\.

Thus, the mechanism \(Section[4](https://arxiv.org/html/2605.11134#S4)\) and consequence \(Section[5](https://arxiv.org/html/2605.11134#S5)\) analyses are consistent: both operate in the same local regime where the linearized characterization of spurious learning applies\.

##### Beyond the local regime\.

When preference score differences become large \(i\.e\., whenβB≫1\\beta B\\gg 1\), Assumption[C\.3](https://arxiv.org/html/2605.11134#A3.Thmtheorem3)no longer holds and the quantitative predictions of the local analysis require modification\. In this regime:

- •The weight functionw\(Δθ\)=σ′\(βθ~⊤Δϕ\)w\(\\Delta\_\{\\theta\}\)=\\sigma^\{\\prime\}\(\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)saturates toward zero as\|βθ~⊤Δϕ\|→∞\|\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\|\\to\\infty, reducing the effective gradient contribution of extreme preference pairs\.
- •Quadratic Taylor approximations aroundθref\\theta\_\{\\mathrm\{ref\}\}become inaccurate, and tighter control requires global smoothness assumptions or uniform Lipschitz bounds on the objective\.
- •The first\-order margin interpretation in Proposition[5\.1](https://arxiv.org/html/2605.11134#S5.Thmtheorem1)loses quantitative accuracy\.

##### Deployment curvature\.

Let‖v‖A2:=v⊤Av\\\|v\\\|\_\{A\}^\{2\}:=v^\{\\top\}Avfor any positive semidefinite matrixAA\. Define the \(pairwise\) Fisher information under deploymentQQ, evaluated atθtrain\\theta\_\{\\mathrm\{train\}\}, by

Σ¯\(Q\):=𝔼\(x,yw,yl\)∼Q\[β2w\(Δθ\)ΔϕΔϕ⊤\],w\(Δθ\):=σ\(z\)\(1−σ\(z\)\),z:=βθ~train⊤Δϕ\.\\bar\{\\Sigma\}^\{\(Q\)\}:=\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim Q\}\\\!\\left\[\\beta^\{2\}\\,w\(\\Delta\_\{\\theta\}\)\\,\\Delta\\phi\\Delta\\phi^\{\\top\}\\right\],\\qquad w\(\\Delta\_\{\\theta\}\):=\\sigma\(z\)\(1\-\\sigma\(z\)\),\\;\\;z:=\\beta\\,\\tilde\{\\theta\}\_\{\\mathrm\{train\}\}^\{\\top\}\\Delta\\phi\.Define analogouslyΣ¯\(P\)\\bar\{\\Sigma\}^\{\(P\)\}by replacingQQwithPP, and let

HP:=Σ¯\(P\)\+λI\.H\_\{P\}:=\\bar\{\\Sigma\}^\{\(P\)\}\+\\lambda I\.
###### Assumption C\.4\(Geometry Transfer\)\.

There existsκΠ≥1\\kappa\_\{\\Pi\}\\geq 1such that for allv∈ℝdv\\in\\mathbb\{R\}^\{d\},

‖v‖Σ¯\(Q\)2≤κΠ‖v‖HP2\.\\\|v\\\|\_\{\\bar\{\\Sigma\}^\{\(Q\)\}\}^\{2\}\\leq\\kappa\_\{\\Pi\}\\\|v\\\|\_\{H\_\{P\}\}^\{2\}\.

##### Sufficient conditions for geometry transfer\.

Assumption \(A4\) holds with an explicit constantκΠ\\kappa\_\{\\Pi\}under the following sufficient conditions\.

Condition 1 \(Direct Fisher domination\)\.If there existscΠ≥1c\_\{\\Pi\}\\geq 1such that

Σ¯\(Q\)⪯cΠHP,\\bar\{\\Sigma\}^\{\(Q\)\}\\preceq c\_\{\\Pi\}\\,H\_\{P\},then Assumption \(A4\) holds withκΠ≤cΠ\\kappa\_\{\\Pi\}\\leq c\_\{\\Pi\}\.

Condition 2 \(Bounded density ratio\)\.If the deployment distributionQQis absolutely continuous with respect to the training distributionPPand satisfiesdQdP≤ρmax\\frac\{dQ\}\{dP\}\\leq\\rho\_\{\\max\}almost surely, then

Σ¯\(Q\)=𝔼Q\[β2w\(Δθ\)ΔϕΔϕ⊤\]⪯ρmax𝔼P\[β2w\(Δθ\)ΔϕΔϕ⊤\]=ρmaxΣ¯\(P\)⪯ρmaxHP,\\bar\{\\Sigma\}^\{\(Q\)\}=\\mathbb\{E\}\_\{Q\}\[\\beta^\{2\}w\(\\Delta\_\{\\theta\}\)\\Delta\\phi\\Delta\\phi^\{\\top\}\]\\preceq\\rho\_\{\\max\}\\,\\mathbb\{E\}\_\{P\}\[\\beta^\{2\}w\(\\Delta\_\{\\theta\}\)\\Delta\\phi\\Delta\\phi^\{\\top\}\]=\\rho\_\{\\max\}\\,\\bar\{\\Sigma\}^\{\(P\)\}\\preceq\\rho\_\{\\max\}\\,H\_\{P\},and hence Assumption \(A4\) holds withκΠ≤ρmax\\kappa\_\{\\Pi\}\\leq\\rho\_\{\\max\}\.

Condition 3 \(Covariance transfer under bounded scores\)\.Assume the local regime holds uniformly, so that\|βθ~train⊤Δϕ\|≤βB\|\\beta\\,\\tilde\{\\theta\}\_\{\\mathrm\{train\}\}^\{\\top\}\\Delta\\phi\|\\leq\\beta Band thereforeβ2γβ,BΔϕΔϕ⊤⪯β2w\(Δθ\)ΔϕΔϕ⊤⪯β24ΔϕΔϕ⊤\\beta^\{2\}\\gamma\_\{\\beta,B\}\\,\\Delta\\phi\\Delta\\phi^\{\\top\}\\preceq\\beta^\{2\}w\(\\Delta\_\{\\theta\}\)\\,\\Delta\\phi\\Delta\\phi^\{\\top\}\\preceq\\frac\{\\beta^\{2\}\}\{4\}\\,\\Delta\\phi\\Delta\\phi^\{\\top\}, whereγβ,B:=σ\(βB\)\(1−σ\(βB\)\)\\gamma\_\{\\beta,B\}:=\\sigma\(\\beta B\)\(1\-\\sigma\(\\beta B\)\)\. If the feature covariances satisfy

Σ\(Q\):=𝔼Q\[ΔϕΔϕ⊤\]⪯c\(Σ\(P\)\+λI\)\\Sigma^\{\(Q\)\}:=\\mathbb\{E\}\_\{Q\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]\\preceq c\\,\\big\(\\Sigma^\{\(P\)\}\+\\lambda I\\big\)for somec≥1c\\geq 1, then

Σ¯\(Q\)⪯β24Σ\(Q\)⪯c4β2\(Σ\(P\)\+λI\)⪯c4γβ,BHP,\\bar\{\\Sigma\}^\{\(Q\)\}\\preceq\\frac\{\\beta^\{2\}\}\{4\}\\,\\Sigma^\{\(Q\)\}\\preceq\\frac\{c\}\{4\}\\,\\beta^\{2\}\\big\(\\Sigma^\{\(P\)\}\+\\lambda I\\big\)\\preceq\\frac\{c\}\{4\\gamma\_\{\\beta,B\}\}\\,H\_\{P\},and Assumption \(A4\) holds withκΠ≤c4γβ,B\.\\kappa\_\{\\Pi\}\\leq\\tfrac\{c\}\{4\\gamma\_\{\\beta,B\}\}\.

### C\.1Policy and Surrogate Objectives

This appendix formalizes the policy\-level interpretation and clarifies the relationship between the KL\-regularized objective optimized by DPO\-style methods and the pairwise logistic surrogate analyzed in Section[5](https://arxiv.org/html/2605.11134#S5)\.

##### KL\-regularized policy objective\.

For a distributionDDover promptsxx, the KL\-regularized policy objective is

Jβ\(D\)\(π\)=𝔼x∼D\[𝔼y∼π\(⋅\|x\)\[r⋆\(x,y\)\]−βKL\(π\(⋅\|x\)∥πref\(⋅\|x\)\)\],J\_\{\\beta\}^\{\(D\)\}\(\\pi\)=\\mathbb\{E\}\_\{x\\sim D\}\\\!\\left\[\\mathbb\{E\}\_\{y\\sim\\pi\(\\cdot\|x\)\}\[r^\{\\star\}\(x,y\)\]\-\\beta\\,\\mathrm\{KL\}\\\!\\left\(\\pi\(\\cdot\|x\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\|x\)\\right\)\\right\],\(16\)whereπref\\pi\_\{\\mathrm\{ref\}\}is a fixed reference policy andβ\>0\\beta\>0controls the strength of the KL regularization\.

If the policyπ\(⋅\|x\)\\pi\(\\cdot\|x\)is unrestricted, the maximizer of \([16](https://arxiv.org/html/2605.11134#A3.E16)\) admits the closed\-form solution:

π⋆\(y\|x\)=πref\(y\|x\)exp⁡\(1βr⋆\(x,y\)\)Zβ\(x\),Zβ\(x\)=∑yπref\(y\|x\)exp⁡\(1βr⋆\(x,y\)\)\.\\pi^\{\\star\}\(y\|x\)=\\frac\{\\pi\_\{\\mathrm\{ref\}\}\(y\|x\)\\exp\\\!\\left\(\\tfrac\{1\}\{\\beta\}r^\{\\star\}\(x,y\)\\right\)\}\{Z\_\{\\beta\}\(x\)\},\\qquad Z\_\{\\beta\}\(x\)=\\sum\_\{y\}\\pi\_\{\\mathrm\{ref\}\}\(y\|x\)\\exp\\\!\\left\(\\tfrac\{1\}\{\\beta\}r^\{\\star\}\(x,y\)\\right\)\.Thus, KL regularization induces an exponential reweighting of the reference policy, softly biasing probability mass toward higher\-reward responses while retaining support and regularization fromπref\\pi\_\{\\mathrm\{ref\}\}\.

##### Log\-linear policy class \(theoretical model\)\.

For analysis, we restrict to the log\-linear family

πθ\(y\|x\)=πref\(y\|x\)exp⁡\(θ⊤ϕ\(x,y\)\)Zθ\(x\),Zθ\(x\)=∑y′πref\(y′\|x\)exp⁡\(θ⊤ϕ\(x,y′\)\)\.\\pi\_\{\\theta\}\(y\|x\)=\\frac\{\\pi\_\{\\mathrm\{ref\}\}\(y\|x\)\\exp\\\!\\big\(\\theta^\{\\top\}\\phi\(x,y\)\\big\)\}\{Z\_\{\\theta\}\(x\)\},\\qquad Z\_\{\\theta\}\(x\)=\\sum\_\{y^\{\\prime\}\}\\pi\_\{\\mathrm\{ref\}\}\(y^\{\\prime\}\|x\)\\exp\\\!\\big\(\\theta^\{\\top\}\\phi\(x,y^\{\\prime\}\)\\big\)\.Equivalently,log⁡πθ\(y\|x\)πref\(y\|x\)=θ⊤ϕ\(x,y\)−log⁡Zθ\(x\)\\log\\frac\{\\pi\_\{\\theta\}\(y\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\|x\)\}=\\theta^\{\\top\}\\phi\(x,y\)\-\\log Z\_\{\\theta\}\(x\), soθ⊤ϕ\(x,y\)\\theta^\{\\top\}\\phi\(x,y\)parameterizes the \(reward\-like\) log\-tilt away fromπref\\pi\_\{\\mathrm\{ref\}\}\.

Substitutingπθ\\pi\_\{\\theta\}into \([16](https://arxiv.org/html/2605.11134#A3.E16)\) and expanding the KL term gives

Jβ\(D\)\(πθ\)=𝔼x∼D\[𝔼y∼πθ\[r⋆\(x,y\)\]−β𝔼y∼πθ\[θ⊤ϕ\(x,y\)\]\+βlog⁡Zθ\(x\)\],J\_\{\\beta\}^\{\(D\)\}\(\\pi\_\{\\theta\}\)=\\mathbb\{E\}\_\{x\\sim D\}\\\!\\left\[\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\}\[r^\{\\star\}\(x,y\)\]\-\\beta\\,\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\}\[\\theta^\{\\top\}\\phi\(x,y\)\]\+\\beta\\,\\log Z\_\{\\theta\}\(x\)\\right\],sinceKL\(πθ∥πref\)=𝔼y∼πθ\[θ⊤ϕ\(x,y\)\]−log⁡Zθ\(x\)\\mathrm\{KL\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\)=\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\}\[\\theta^\{\\top\}\\phi\(x,y\)\]\-\\log Z\_\{\\theta\}\(x\)\.

##### Pairwise surrogate\.

The surrogate objective analyzed in the main text is

J~\(D\)\(θ\)=𝔼\(x,yw,yl\)∼D\[log⁡σ\(βθ~⊤Δϕ\)\],θ~=θ−θref\.\\tilde\{J\}^\{\(D\)\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim D\}\\left\[\\log\\sigma\\\!\\left\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\\right\)\\right\],\\qquad\\tilde\{\\theta\}=\\theta\-\\theta\_\{\\mathrm\{ref\}\}\.Under the Bradley–Terry model,

ℙ\(yw≻yl∣x\)=σ\(β\(rθ\(x,yw\)−rθ\(x,yl\)\)\)=σ\(βθ~⊤Δϕ\),\\mathbb\{P\}\(y\_\{w\}\\succ y\_\{l\}\\mid x\)=\\sigma\\\!\\left\(\\beta\(r\_\{\\theta\}\(x,y\_\{w\}\)\-r\_\{\\theta\}\(x,y\_\{l\}\)\)\\right\)=\\sigma\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\),so maximizingJ~\(D\)\\tilde\{J\}^\{\(D\)\}corresponds to maximum likelihood estimation\.

##### Local equivalence\.

Assume the local regime \(Assumption[C\.3](https://arxiv.org/html/2605.11134#A3.Thmtheorem3)\):

\|βθ~⊤Δϕ\(x,yw,yl\)\|≤ϵ≪1with high probability underD,\|\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\(x,y\_\{w\},y\_\{l\}\)\|\\leq\\epsilon\\ll 1\\quad\\text\{with high probability under \}D,and the boundedness assumptions in Assumptions[C\.1](https://arxiv.org/html/2605.11134#A3.Thmtheorem1)–[C\.2](https://arxiv.org/html/2605.11134#A3.Thmtheorem2)\. Then bothJβ\(D\)\(πθ\)J\_\{\\beta\}^\{\(D\)\}\(\\pi\_\{\\theta\}\)andJ~\(D\)\(θ\)\\tilde\{J\}^\{\(D\)\}\(\\theta\)are twice continuously differentiable onΘB\\Theta\_\{B\}, with uniformly bounded Hessians\. In particular, Taylor expansion aroundθref\\theta\_\{\\mathrm\{ref\}\}yields

Jβ\(D\)\(πθ\)\\displaystyle J\_\{\\beta\}^\{\(D\)\}\(\\pi\_\{\\theta\}\)=Jβ\(D\)\(πref\)\+∇θJβ\(D\)\(πθ\)\|θref⊤θ~\+O\(‖θ~‖2\),\\displaystyle=J\_\{\\beta\}^\{\(D\)\}\(\\pi\_\{\\mathrm\{ref\}\}\)\+\\nabla\_\{\\theta\}J\_\{\\beta\}^\{\(D\)\}\(\\pi\_\{\\theta\}\)\\big\|\_\{\\theta\_\{\\mathrm\{ref\}\}\}^\{\\\!\\top\}\\tilde\{\\theta\}\+O\(\\\|\\tilde\{\\theta\}\\\|^\{2\}\),J~\(D\)\(θ\)\\displaystyle\\tilde\{J\}^\{\(D\)\}\(\\theta\)=J~\(D\)\(θref\)\+∇θJ~\(D\)\(θ\)\|θref⊤θ~\+O\(‖θ~‖2\)\.\\displaystyle=\\tilde\{J\}^\{\(D\)\}\(\\theta\_\{\\mathrm\{ref\}\}\)\+\\nabla\_\{\\theta\}\\tilde\{J\}^\{\(D\)\}\(\\theta\)\\big\|\_\{\\theta\_\{\\mathrm\{ref\}\}\}^\{\\\!\\top\}\\tilde\{\\theta\}\+O\(\\\|\\tilde\{\\theta\}\\\|^\{2\}\)\.Moreover, under the Bradley–Terry model on feature differences, the linear terms are aligned atθref\\theta\_\{\\mathrm\{ref\}\}:

∇θJβ\(D\)\(πθ\)\|θref=cβ∇θJ~\(D\)\(θ\)\|θref,cβ\>0,\\nabla\_\{\\theta\}J\_\{\\beta\}^\{\(D\)\}\(\\pi\_\{\\theta\}\)\\big\|\_\{\\theta\_\{\\mathrm\{ref\}\}\}=c\_\{\\beta\}\\,\\nabla\_\{\\theta\}\\tilde\{J\}^\{\(D\)\}\(\\theta\)\\big\|\_\{\\theta\_\{\\mathrm\{ref\}\}\},\\qquad c\_\{\\beta\}\>0,wherecβc\_\{\\beta\}depends only onβ\\beta\(and not onDD\)\. Combining the two expansions gives the local decomposition

Jβ\(D\)\(πθ\)=cD\+cβJ~\(D\)\(θ\)\+RD\(θ\),J\_\{\\beta\}^\{\(D\)\}\(\\pi\_\{\\theta\}\)=c\_\{D\}\+c\_\{\\beta\}\\,\\tilde\{J\}^\{\(D\)\}\(\\theta\)\+R\_\{D\}\(\\theta\),\(17\)withcD:=Jβ\(D\)\(πref\)−cβJ~\(D\)\(θref\)c\_\{D\}:=J\_\{\\beta\}^\{\(D\)\}\(\\pi\_\{\\mathrm\{ref\}\}\)\-c\_\{\\beta\}\\,\\tilde\{J\}^\{\(D\)\}\(\\theta\_\{\\mathrm\{ref\}\}\)and a remainder satisfying\|RD\(θ\)\|≤Cβ2‖θ~‖2\|R\_\{D\}\(\\theta\)\|\\leq C\\,\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|^\{2\}for some constantC\>0C\>0under bounded features\.

##### Implication for suboptimality bounds\.

In the local regime, first\-order changes inJ~\(D\)\\tilde\{J\}^\{\(D\)\}induce proportional changes inJβ\(D\)J\_\{\\beta\}^\{\(D\)\}\. Consequently, for anyθ1,θ2\\theta\_\{1\},\\theta\_\{2\},

Jβ\(Q\)\(πθ1\)−Jβ\(Q\)\(πθ2\)=cβ\(J~\(Q\)\(θ1\)−J~\(Q\)\(θ2\)\)\+O\(β2‖θ~‖2\)\.J\_\{\\beta\}^\{\(Q\)\}\(\\pi\_\{\\theta\_\{1\}\}\)\-J\_\{\\beta\}^\{\(Q\)\}\(\\pi\_\{\\theta\_\{2\}\}\)=c\_\{\\beta\}\\\!\\left\(\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{1\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{2\}\)\\right\)\+O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|^\{2\}\)\.This justifies analyzing deployment suboptimality using the pairwise surrogate\. The approximation fails outside the local regime, where higher\-order terms dominate and global analysis is required\.

### C\.2Deployment Proxy: Expected Margins

This appendix studies how shifts affect performance\. We use the expected preference margin, which provides a first\-order characterization of model quality\.

##### Margin definition\.

The expected margin measures the average score gap between preferred and dispreferred responses:mD\(θ\):=𝔼\(x,yw,yl\)∼D\[βθ~⊤Δϕ\]\.m\_\{D\}\(\\theta\):=\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim D\}\[\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\]\.The margin decomposes into causal and spurious components:

mD\(θ\)=βθ~c⊤μc\(D\)⏟=⁣:mDcausal\(θ\)\+βθ~s⊤μs\(D\)⏟=⁣:mDspurious\(θ\)\.m\_\{D\}\(\\theta\)=\\underbrace\{\\beta\\,\\tilde\{\\theta\}\_\{c\}^\{\\top\}\\mu\_\{c\}^\{\(D\)\}\}\_\{=:m^\{\\text\{causal\}\}\_\{D\}\(\\theta\)\}\+\\underbrace\{\\beta\\,\\tilde\{\\theta\}\_\{s\}^\{\\top\}\\mu\_\{s\}^\{\(D\)\}\}\_\{=:m^\{\\text\{spurious\}\}\_\{D\}\(\\theta\)\}\.\(18\)

##### Margin approximation of objective differences\.

We now restate Proposition[5\.1](https://arxiv.org/html/2605.11134#S5.Thmtheorem1)and provide the margin approximation proof:

###### Proposition C\.5\(First\-Order Margin Approximation\)\.

Under the local regime \(Assumption[C\.3](https://arxiv.org/html/2605.11134#A3.Thmtheorem3)\), the pairwise surrogate objectiveJ~\(D\)\(θ\):=𝔼D\[log⁡σ\(βθ~⊤Δϕ\)\]\\tilde\{J\}^\{\(D\)\}\(\\theta\):=\\mathbb\{E\}\_\{D\}\[\\log\\sigma\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)\]satisfies

J~\(Q\)\(θ\)−J~\(P\)\(θ\)=12\(mQ\(θ\)−mP\(θ\)\)\+O\(β2‖θ~‖2\)\.\\tilde\{J\}^\{\(Q\)\}\(\\theta\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta\)=\\frac\{1\}\{2\}\\big\(m\_\{Q\}\(\\theta\)\-m\_\{P\}\(\\theta\)\\big\)\+O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|^\{2\}\)\.

###### Proof\.

Letz:=βθ~⊤Δϕz:=\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\. Under Assumption[C\.3](https://arxiv.org/html/2605.11134#A3.Thmtheorem3), we have\|z\|≤ϵ≪1\|z\|\\leq\\epsilon\\ll 1with high probability underDD\. Sincelog⁡σ\\log\\sigmaisC2C^\{2\}, a Taylor expansion at0gives

log⁡σ\(z\)=log⁡σ\(0\)\+\(log⁡σ\)′\(0\)z\+O\(z2\)=log⁡12\+12z\+O\(z2\),\\log\\sigma\(z\)=\\log\\sigma\(0\)\+\(\\log\\sigma\)^\{\\prime\}\(0\)\\,z\+O\(z^\{2\}\)=\\log\\frac\{1\}\{2\}\+\\frac\{1\}\{2\}z\+O\(z^\{2\}\),because\(log⁡σ\)′\(z\)=1−σ\(z\)\(\\log\\sigma\)^\{\\prime\}\(z\)=1\-\\sigma\(z\)and hence\(log⁡σ\)′\(0\)=1−σ\(0\)=1/2\(\\log\\sigma\)^\{\\prime\}\(0\)=1\-\\sigma\(0\)=1/2\.

Taking expectations underDDyields

J~\(D\)\(θ\)\\displaystyle\\tilde\{J\}^\{\(D\)\}\(\\theta\)=𝔼D\[log⁡σ\(z\)\]=log⁡12\+12𝔼D\[z\]\+O\(𝔼D\[z2\]\)\\displaystyle=\\mathbb\{E\}\_\{D\}\[\\log\\sigma\(z\)\]=\\log\\frac\{1\}\{2\}\+\\frac\{1\}\{2\}\\,\\mathbb\{E\}\_\{D\}\[z\]\+O\(\\mathbb\{E\}\_\{D\}\[z^\{2\}\]\)=log⁡12\+12mD\(θ\)\+O\(𝔼D\[\(βθ~⊤Δϕ\)2\]\)\.\\displaystyle=\\log\\frac\{1\}\{2\}\+\\frac\{1\}\{2\}\\,m\_\{D\}\(\\theta\)\+O\(\\mathbb\{E\}\_\{D\}\[\(\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)^\{2\}\]\)\.Under Assumption[C\.1](https://arxiv.org/html/2605.11134#A3.Thmtheorem1),‖Δϕ‖2≤1\\\|\\Delta\\phi\\\|\_\{2\}\\leq 1, so

𝔼D\[\(βθ~⊤Δϕ\)2\]≤β2‖θ~‖22𝔼D\[‖Δϕ‖22\]≤β2‖θ~‖22,\\mathbb\{E\}\_\{D\}\[\(\\beta\\,\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)^\{2\}\]\\leq\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|\_\{2\}^\{2\}\\,\\mathbb\{E\}\_\{D\}\[\\\|\\Delta\\phi\\\|\_\{2\}^\{2\}\]\\leq\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|\_\{2\}^\{2\},and therefore

J~\(D\)\(θ\)=log⁡12\+12mD\(θ\)\+O\(β2‖θ~‖22\)\.\\tilde\{J\}^\{\(D\)\}\(\\theta\)=\\log\\frac\{1\}\{2\}\+\\frac\{1\}\{2\}m\_\{D\}\(\\theta\)\+O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|\_\{2\}^\{2\}\)\.Applying this withD=PD=PandD=QD=Qand subtracting cancels the constantlog⁡\(1/2\)\\log\(1/2\), giving

J~\(Q\)\(θ\)−J~\(P\)\(θ\)=12\(mQ\(θ\)−mP\(θ\)\)\+O\(β2‖θ~‖22\),\\tilde\{J\}^\{\(Q\)\}\(\\theta\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta\)=\\frac\{1\}\{2\}\\big\(m\_\{Q\}\(\\theta\)\-m\_\{P\}\(\\theta\)\\big\)\+O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|\_\{2\}^\{2\}\),as claimed\. ∎

We now show that spurious margin drives shift under stable causal statistics\.

###### Proposition C\.6\(Spurious Margin Drives Shift\)\.

When causal statistics are stable \(μc\(Q\)=μc\(P\)\\mu\_\{c\}^\{\(Q\)\}=\\mu\_\{c\}^\{\(P\)\}\), the objective difference atθtrain\\theta\_\{\\mathrm\{train\}\}is determined by the spurious margin:

J~\(Q\)\(θtrain\)−J~\(P\)\(θtrain\)=β2θ~s,train⊤\(μs\(Q\)−μs\(P\)\)\+O\(β2‖θ~train‖2\)\.\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta\_\{\\mathrm\{train\}\}\)=\\frac\{\\beta\}\{2\}\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}^\{\\top\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\+O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\_\{\\mathrm\{train\}\}\\\|^\{2\}\)\.

###### Proof\.

Apply Proposition[C\.5](https://arxiv.org/html/2605.11134#A3.Thmtheorem5)withθ=θtrain\\theta=\\theta\_\{\\mathrm\{train\}\}\. From \([6](https://arxiv.org/html/2605.11134#S5.E6)\), the margin difference decomposes asmQ\(θtrain\)−mP\(θtrain\)=βθ~c,train⊤\(μc\(Q\)−μc\(P\)\)\+βθ~s,train⊤\(μs\(Q\)−μs\(P\)\)m\_\{Q\}\(\\theta\_\{\\mathrm\{train\}\}\)\-m\_\{P\}\(\\theta\_\{\\mathrm\{train\}\}\)=\\beta\\tilde\{\\theta\}\_\{c,\\mathrm\{train\}\}^\{\\top\}\(\\mu\_\{c\}^\{\(Q\)\}\-\\mu\_\{c\}^\{\(P\)\}\)\+\\beta\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}^\{\\top\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\. Whenμc\(Q\)=μc\(P\)\\mu\_\{c\}^\{\(Q\)\}=\\mu\_\{c\}^\{\(P\)\}, the causal term vanishes\. ∎

### C\.3Deployment Suboptimality Definition

LetθQ⋆:=arg⁡maxθ∈ΘB⁡J~\(Q\)\(θ\)\\theta\_\{Q\}^\{\\star\}:=\\arg\\max\_\{\\theta\\in\\Theta\_\{B\}\}\\tilde\{J\}^\{\(Q\)\}\(\\theta\)denote the deployment\-optimal parameters, whereΘB:=\{θ∈ℝd:‖θ−θref‖2≤B\}\\Theta\_\{B\}:=\\\{\\theta\\in\\mathbb\{R\}^\{d\}:\\\|\\theta\-\\theta\_\{\\mathrm\{ref\}\}\\\|\_\{2\}\\leq B\\\}\. Letθ^\\hat\{\\theta\}denote the finite\-sample estimator trained onnnsamples fromPP\. We define deployment suboptimality as

SubOptQ\(θ^\):=J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θ^\)\.\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\):=\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\.
##### Decomposition\.

Inserting the population training optimumθtrain\\theta\_\{\\mathrm\{train\}\}:

SubOptQ\(θ^\)=J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θtrain\)⏟shift term \(irreducible\)\+J~\(Q\)\(θtrain\)−J~\(Q\)\(θ^\)⏟estimation term \(reducible\)\.\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\)=\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\}\_\{\\text\{shift term \(irreducible\)\}\}\+\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\}\_\{\\text\{estimation term \(reducible\)\}\}\.\(19\)

##### Irreducible and reducible components\.

The shift term is determined byθtrain\\theta\_\{\\mathrm\{train\}\}and does not depend on sample size\. Asn→∞n\\to\\infty,θ^→θtrain\\hat\{\\theta\}\\to\\theta\_\{\\mathrm\{train\}\}and the estimation term vanishes \(as demonstrated in Appendix[C\.7](https://arxiv.org/html/2605.11134#A3.Thmtheorem7)\):

limn→∞SubOptQ\(θ^\)=J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θtrain\)\.\\lim\_\{n\\to\\infty\}\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\)=\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\.By Proposition[C\.6](https://arxiv.org/html/2605.11134#A3.Thmtheorem6), when causal statistics are stable, this limit is proportional toθ~s,train⊤\(μs\(Q\)−μs\(P\)\)\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}^\{\\top\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\. The estimation term vanishes asO\(1/n\)O\(1/n\), so collecting more data fromPPreduces only the already\-vanishing component\. The fundamental bottleneck isθ~s,train\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}, which arises structurally from the training distribution \(Section[4](https://arxiv.org/html/2605.11134#S4)\)\.

### C\.4Proof Deployment Suboptimality Bound

This appendix provides the proof of the deployment suboptimality bound \(Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)\)\. For completeness, we restate the theorem below\.

###### Theorem C\.7\(Deployment Suboptimality Bound\)\.

Under assumptions[C\.1](https://arxiv.org/html/2605.11134#A3.Thmtheorem1),[C\.2](https://arxiv.org/html/2605.11134#A3.Thmtheorem2),[C\.3](https://arxiv.org/html/2605.11134#A3.Thmtheorem3), and[C\.4](https://arxiv.org/html/2605.11134#A3.Thmtheorem4), with probability at least1−δ1\-\\delta:

SubOptQ\(θ^\)≤J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θtrain\)⏟shift term\+κΠ2Γn2⏟estimation term,\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\)\\leq\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\}\_\{\\text\{shift term \}\}\+\\underbrace\{\\frac\{\\kappa\_\{\\Pi\}\}\{2\}\\Gamma\_\{n\}^\{2\}\}\_\{\\text\{estimation term \}\},\(20\)whereΓn=2β2\(d\+log⁡\(1/δ\)\)/n\+Bλ\\Gamma\_\{n\}=2\\beta\\sqrt\{2\(d\+\\log\(1/\\delta\)\)/n\}\+B\\sqrt\{\\lambda\}\.

###### Proof\.

Letθtrain\\theta\_\{\\mathrm\{train\}\}denote the \(population\) optimizer of the training surrogate underPPoverΘB\\Theta\_\{B\}, and letθQ⋆\\theta\_\{Q\}^\{\\star\}denote the \(population\) optimizer underQQ\.

##### Step 1 \(Decomposition\)\.

By definition,

SubOptQ\(θ^\)=J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θ^\)\.\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\)=\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\.Insertθtrain\\theta\_\{\\mathrm\{train\}\}:

SubOptQ\(θ^\)=J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θtrain\)⏟shift term\+J~\(Q\)\(θtrain\)−J~\(Q\)\(θ^\)⏟estimation term\.\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\)=\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\}\_\{\\text\{shift term\}\}\+\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\}\_\{\\text\{estimation term\}\}\.\(21\)

##### Step 2 \(Quadratic upper bound for the estimation term\)\.

BecauseJ~\(Q\)\\tilde\{J\}^\{\(Q\)\}is concave and twice differentiable, Taylor’s theorem with integral remainder yields, forθ1,θ2∈ΘB\\theta\_\{1\},\\theta\_\{2\}\\in\\Theta\_\{B\},

J~\(Q\)\(θ1\)−J~\(Q\)\(θ2\)=∇J~\(Q\)\(θ2\)⊤\(θ1−θ2\)\+∫01\(1−t\)\(θ1−θ2\)⊤\(−∇2J~\(Q\)\(θt\)\)\(θ1−θ2\)𝑑t,\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{1\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{2\}\)=\\nabla\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{2\}\)^\{\\top\}\(\\theta\_\{1\}\-\\theta\_\{2\}\)\+\\int\_\{0\}^\{1\}\(1\-t\)\(\\theta\_\{1\}\-\\theta\_\{2\}\)^\{\\top\}\\big\(\-\\nabla^\{2\}\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{t\}\)\\big\)\(\\theta\_\{1\}\-\\theta\_\{2\}\)\\,dt,whereθt:=θ2\+t\(θ1−θ2\)\\theta\_\{t\}:=\\theta\_\{2\}\+t\(\\theta\_\{1\}\-\\theta\_\{2\}\)\. Apply withθ1=θtrain\\theta\_\{1\}=\\theta\_\{\\mathrm\{train\}\}andθ2=θ^\\theta\_\{2\}=\\hat\{\\theta\}\. Dropping the linear term \(it can have either sign\) and using\(1−t\)≤1\(1\-t\)\\leq 1gives

J~\(Q\)\(θtrain\)−J~\(Q\)\(θ^\)≤12‖θ^−θtrain‖Σ¯diff\(Q\)2,\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\\leq\\frac\{1\}\{2\}\\\|\\hat\{\\theta\}\-\\theta\_\{\\mathrm\{train\}\}\\\|\_\{\\bar\{\\Sigma\}\_\{\\mathrm\{diff\}\}^\{\(Q\)\}\}^\{2\},\(22\)with the segment curvature matrix

Σ¯diff\(Q\):=supt∈\[0,1\]𝔼Q\[β2σ\(zt\)\(1−σ\(zt\)\)ΔϕΔϕ⊤\],zt:=β\(θt−θref\)⊤Δϕ\.\\bar\{\\Sigma\}\_\{\\mathrm\{diff\}\}^\{\(Q\)\}:=\\sup\_\{t\\in\[0,1\]\}\\mathbb\{E\}\_\{Q\}\\Big\[\\beta^\{2\}\\,\\sigma\(z\_\{t\}\)\(1\-\\sigma\(z\_\{t\}\)\)\\,\\Delta\\phi\\Delta\\phi^\{\\top\}\\Big\],\\qquad z\_\{t\}:=\\beta\(\\theta\_\{t\}\-\\theta\_\{\\mathrm\{ref\}\}\)^\{\\top\}\\Delta\\phi\.Assumption[C\.4](https://arxiv.org/html/2605.11134#A3.Thmtheorem4)states that for all vectorsvv,

‖v‖Σ¯diff\(Q\)2≤κΠ‖v‖HP2,\\\|v\\\|\_\{\\bar\{\\Sigma\}\_\{\\mathrm\{diff\}\}^\{\(Q\)\}\}^\{2\}\\leq\\kappa\_\{\\Pi\}\\,\\\|v\\\|\_\{H\_\{P\}\}^\{2\},where the training curvature matrix is

HP:=β2γβ,BΣ^\(P\)\+λI\.H\_\{P\}:=\\beta^\{2\}\\gamma\_\{\\beta,B\}\\,\\hat\{\\Sigma\}^\{\(P\)\}\+\\lambda I\.Applying withv=θ^−θtrainv=\\hat\{\\theta\}\-\\theta\_\{\\mathrm\{train\}\}to \([22](https://arxiv.org/html/2605.11134#A3.E22)\) gives

J~\(Q\)\(θtrain\)−J~\(Q\)\(θ^\)≤κΠ2‖θ^−θtrain‖HP2\.\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\\leq\\frac\{\\kappa\_\{\\Pi\}\}\{2\}\\,\\\|\\hat\{\\theta\}\-\\theta\_\{\\mathrm\{train\}\}\\\|\_\{H\_\{P\}\}^\{2\}\.\(23\)The ridge\-regularized MLE self\-normalized bound gives

‖θ^−θtrain‖HP≤Γn,Γn=2β2\(d\+log⁡\(1/δ\)\)n\+Bλ\.\\\|\\hat\{\\theta\}\-\\theta\_\{\\mathrm\{train\}\}\\\|\_\{H\_\{P\}\}\\leq\\Gamma\_\{n\},\\qquad\\Gamma\_\{n\}=2\\beta\\sqrt\{\\frac\{2\(d\+\\log\(1/\\delta\)\)\}\{n\}\}\+B\\sqrt\{\\lambda\}\.The bound‖θ^−θtrain‖HP≤Γn\\\|\\hat\{\\theta\}\-\\theta\_\{\\mathrm\{train\}\}\\\|\_\{H\_\{P\}\}\\leq\\Gamma\_\{n\}follows from standard ridge\-regularized M\-estimation arguments for generalized linear models\(Zhuet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib25)\): the regularized negative log\-likelihood is locally strongly convex, with curvature given by the Fisher \(Hessian\) matrix, and the estimation error is controlled via a self\-normalized concentration inequality for the score, yielding a high\-probability error bound in the corresponding Hessian metric\. In our setting,HP=Σ¯\(P\)\+λIH\_\{P\}=\\bar\{\\Sigma\}^\{\(P\)\}\+\\lambda Iplays exactly the role of the regularized Fisher/covariance matrix in these analyses\.

Plugging into \([23](https://arxiv.org/html/2605.11134#A3.E23)\) yields

J~\(Q\)\(θtrain\)−J~\(Q\)\(θ^\)≤κΠ2Γn2\.\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{train\}\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\hat\{\\theta\}\)\\leq\\frac\{\\kappa\_\{\\Pi\}\}\{2\}\\Gamma\_\{n\}^\{2\}\.

##### Step 3 \(Combine\)\.

Substitute this bound into the decomposition \([21](https://arxiv.org/html/2605.11134#A3.E21)\)\. This gives exactly the stated result\. ∎

## Appendix DTie Training Theory

This appendix develops the theoretical foundations of tie training in a step\-by\-step manner\. We first formalize the tie data\-generating process and derive its expected gradient contribution\. We then analyze how ties modify the population gradient under linearization and solve for the resulting mixed\-training equilibrium\. Finally, we prove that tie training provably reduces spurious parameter magnitude and derive a quantitative reduction bound that sharpens the main theorem\.

### D\.1Tie Construction Details

We formalize the tie data\-generating process and labeling scheme\. We start by defining a tie pair:

###### Definition D\.1\(Tie pair\)\.

A tie pair is a tuple\(x,yA,yB\)\(x,y\_\{A\},y\_\{B\}\)with equal utility,u⋆\(x,yA\)=u⋆\(x,yB\)u^\{\\star\}\(x,y\_\{A\}\)=u^\{\\star\}\(x,y\_\{B\}\)\.

##### Feature structure\.

We construct tie pairs so that causal features match while spurious features differ:Δϕc:=ϕc\(x,yA\)−ϕc\(x,yB\)=0\\Delta\\phi\_\{c\}:=\\phi\_\{c\}\(x,y\_\{A\}\)\-\\phi\_\{c\}\(x,y\_\{B\}\)=0andΔϕs:=ϕs\(x,yA\)−ϕs\(x,yB\)=δs≠0\\Delta\\phi\_\{s\}:=\\phi\_\{s\}\(x,y\_\{A\}\)\-\\phi\_\{s\}\(x,y\_\{B\}\)=\\delta\_\{s\}\\neq 0\. Thus, ties vary only in spurious coordinates\.

##### Random labeling\.

For each tie pair, we assign the winner\-loser label uniformly at random\. With probability1/21/2, we labelyA≻yBy\_\{A\}\\succ y\_\{B\}and setΔϕ=\[0;δs\]\\Delta\\phi=\[0;\\delta\_\{s\}\]; with probability1/21/2, we labelyB≻yAy\_\{B\}\\succ y\_\{A\}and setΔϕ=\[0;−δs\]\\Delta\\phi=\[0;\-\\delta\_\{s\}\]\. This gives

𝔼tie\[Δϕ\]=0,Σtie:=𝔼tie\[ΔϕΔϕ⊤\]=\[000Σsstie\],Σsstie⪰0\.\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\[\\Delta\\phi\]=0,\\qquad\\Sigma^\{\\mathrm\{tie\}\}:=\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]=\\begin\{bmatrix\}0&0\\\\ 0&\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}\\end\{bmatrix\},\\qquad\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}\\succeq 0\.

##### Example\.

For hotel recommendations:yAy\_\{A\}= “Excellent service, prime location” \(concise\);yBy\_\{B\}= “Excellent service, prime location, with detailed amenities list” \(verbose\)\. Both have equal utility but differ in length, a common spurious feature\.

### D\.2Expected Gradient from Ties

We now derive the expected gradient contribution of tie examples\. Under Assumption[A\.1](https://arxiv.org/html/2605.11134#A1.Thmtheorem1), the linearizedweight functionis:

w\(Δθ\)≈1−σ\(βθ~⊤Δϕ\)=12−β4θ~⊤Δϕ\+O\(\(βθ~⊤Δϕ\)2\)\.w\(\\Delta\_\{\\theta\}\)\\approx 1\-\\sigma\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)=\\frac\{1\}\{2\}\-\\frac\{\\beta\}\{4\}\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\+O\\\!\\left\(\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)^\{2\}\\right\)\.
##### Expected tie gradient\.

Then, for a single tie example, the DPO gradient is

∇θℓtie=−β\(1−σ\(βθ~⊤Δϕ\)\)Δϕ\.\\nabla\_\{\\theta\}\\ell\_\{\\mathrm\{tie\}\}=\-\\beta\\bigl\(1\-\\sigma\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)\\bigr\)\\Delta\\phi\.Taking expectation over random tie labeling and using𝔼tie\[Δϕ\]=0\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\[\\Delta\\phi\]=0gives

gtie\(θ~\)\\displaystyle g\_\{\\mathrm\{tie\}\}\(\\tilde\{\\theta\}\)=−β𝔼tie\[\(1−σ\(βθ~⊤Δϕ\)\)Δϕ\]\\displaystyle=\-\\beta\\,\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\\\!\\left\[\\bigl\(1\-\\sigma\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)\\bigr\)\\Delta\\phi\\right\]=−βCovtie\(Δϕ,1−σ\(βθ~⊤Δϕ\)\)\.\\displaystyle=\-\\beta\\,\\mathrm\{Cov\}\_\{\\mathrm\{tie\}\}\\\!\\left\(\\Delta\\phi,\\,1\-\\sigma\(\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\)\\right\)\.Substituting the linearized weight function yields the gradient for ties:

gtie\(θ~\)=β24𝔼tie\[ΔϕΔϕ⊤\]θ~\+O\(β3‖θ~‖2\)\.g\_\{\\mathrm\{tie\}\}\(\\tilde\{\\theta\}\)=\\frac\{\\beta^\{2\}\}\{4\}\\,\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]\\tilde\{\\theta\}\+O\(\\beta^\{3\}\\\|\\tilde\{\\theta\}\\\|^\{2\}\)\.

### D\.3Linearized Tie Dynamics

We show next how tie data modifies the population gradient under linearization\.

##### Tie covariance\.

Let us define the tie covariance matrix as

Σtie:=𝔼tie\[ΔϕΔϕ⊤\]∈ℝd×d\.\\Sigma^\{\\mathrm\{tie\}\}:=\\mathbb\{E\}\_\{\\mathrm\{tie\}\}\[\\Delta\\phi\\Delta\\phi^\{\\top\}\]\\in\\mathbb\{R\}^\{d\\times d\}\.Then, to first order, the tie gradient takes the form

gtie\(θ~\)≈β24Σtieθ~\.g\_\{\\mathrm\{tie\}\}\(\\tilde\{\\theta\}\)\\approx\\frac\{\\beta^\{2\}\}\{4\}\\Sigma^\{\\mathrm\{tie\}\}\\tilde\{\\theta\}\.

##### Block structure\.

By construction of tie pairs \(Appendix[D\.1](https://arxiv.org/html/2605.11134#A4.SS1)\), we have

Σtie=\[ΣcctieΣcstieΣsctieΣsstie\],Σcctie=0,Σcstie=0,Σsstie≻0\.\\Sigma^\{\\mathrm\{tie\}\}=\\begin\{bmatrix\}\\Sigma^\{\\mathrm\{tie\}\}\_\{cc\}&\\Sigma^\{\\mathrm\{tie\}\}\_\{cs\}\\\\ \\Sigma^\{\\mathrm\{tie\}\}\_\{sc\}&\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}\\end\{bmatrix\},\\qquad\\Sigma^\{\\mathrm\{tie\}\}\_\{cc\}=0,\\;\\Sigma^\{\\mathrm\{tie\}\}\_\{cs\}=0,\\;\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}\\succ 0\.Thus, tie gradients introduce curvature exclusively in spurious directions\.

##### Conclusion\.

In the local regime \(Assumption[A\.1](https://arxiv.org/html/2605.11134#A1.Thmtheorem1)\), tie training induces a data\-dependent quadratic regularizerβ24θ~⊤Σtieθ~\\tfrac\{\\beta^\{2\}\}\{4\}\\tilde\{\\theta\}^\{\\top\}\\Sigma^\{\\mathrm\{tie\}\}\\tilde\{\\theta\}that shrinks spurious parameters while leaving causal parameters unaffected\.

### D\.4Mixed Training Equilibrium

We now solve for the equilibrium parameters under mixed strict–tie training\. We start by defining the data model\.

##### Mixed Data Model\.

We consider training on a mixture of strict preference pairs from distributionPPand tie pairs fromPtieP\_\{\\mathrm\{tie\}\}\. Letα∈\(0,1\)\\alpha\\in\(0,1\)denote the fraction of strict preferences, and define the mixed distribution

Pmix:=αP\+\(1−α\)Ptie\.P\_\{\\mathrm\{mix\}\}:=\\alpha P\+\(1\-\\alpha\)P\_\{\\mathrm\{tie\}\}\.All expectations below are taken with respect toPmixP\_\{\\mathrm\{mix\}\}unless stated otherwise\.

##### Population Gradient Under Mixed Training\.

Under the above data model and local regime \(Assumption[A\.1](https://arxiv.org/html/2605.11134#A1.Thmtheorem1)\), the population gradient decomposes linearly:

∇θLmix\(θ~\)=αgP\(θ~\)\+\(1−α\)gtie\(θ~\)\.\\nabla\_\{\\theta\}L\_\{\\mathrm\{mix\}\}\(\\tilde\{\\theta\}\)=\\alpha\\,g\_\{P\}\(\\tilde\{\\theta\}\)\+\(1\-\\alpha\)\\,g\_\{\\mathrm\{tie\}\}\(\\tilde\{\\theta\}\)\.Using the local expansions

gP\(θ~\)=−β2μ\(P\)\+β24Σ\(P\)θ~,gtie\(θ~\)=β24Σtieθ~,g\_\{P\}\(\\tilde\{\\theta\}\)=\-\\frac\{\\beta\}\{2\}\\mu^\{\(P\)\}\+\\frac\{\\beta^\{2\}\}\{4\}\\Sigma^\{\(P\)\}\\tilde\{\\theta\},\\qquad g\_\{\\mathrm\{tie\}\}\(\\tilde\{\\theta\}\)=\\frac\{\\beta^\{2\}\}\{4\}\\Sigma^\{\\mathrm\{tie\}\}\\tilde\{\\theta\},we obtain

∇θLmix\(θ~\)=−αβ2μ\(P\)\+β24Σmixθ~,\\nabla\_\{\\theta\}L\_\{\\mathrm\{mix\}\}\(\\tilde\{\\theta\}\)=\-\\alpha\\frac\{\\beta\}\{2\}\\mu^\{\(P\)\}\+\\frac\{\\beta^\{2\}\}\{4\}\\Sigma^\{\\mathrm\{mix\}\}\\tilde\{\\theta\},where the mixed covariance matrix is

Σmix:=αΣ\(P\)\+\(1−α\)Σtie\.\\Sigma^\{\\mathrm\{mix\}\}:=\\alpha\\Sigma^\{\(P\)\}\+\(1\-\\alpha\)\\Sigma^\{\\mathrm\{tie\}\}\.

##### Structure of the Mixed Covariance\.

Writing the block decomposition of the mixed covariance

Σmix=\[ΣccmixΣcsmixΣscmixΣssmix\],\\Sigma^\{\\mathrm\{mix\}\}=\\begin\{bmatrix\}\\Sigma^\{\\mathrm\{mix\}\}\_\{cc\}&\\Sigma^\{\\mathrm\{mix\}\}\_\{cs\}\\\\ \\Sigma^\{\\mathrm\{mix\}\}\_\{sc\}&\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\\end\{bmatrix\},and using the tie construction \(Σcctie=Σcstie=0\\Sigma^\{\\mathrm\{tie\}\}\_\{cc\}=\\Sigma^\{\\mathrm\{tie\}\}\_\{cs\}=0\), we obtain

Σscmix=αΣsc\(P\),Σssmix=αΣss\(P\)\+\(1−α\)Σsstie\.\\Sigma^\{\\mathrm\{mix\}\}\_\{sc\}=\\alpha\\Sigma^\{\(P\)\}\_\{sc\},\\qquad\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}=\\alpha\\Sigma^\{\(P\)\}\_\{ss\}\+\(1\-\\alpha\)\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}\.Thus, tie training reduces causal–spurious cross\-covariance by a factorα\\alphaand adds positive semidefinite mass in the spurious block\.

##### Population Equilibrium\.

At the population equilibrium, we have∇θLmix\(θ~⋆\)=0\\nabla\_\{\\theta\}L\_\{\\mathrm\{mix\}\}\(\\tilde\{\\theta\}^\{\\star\}\)=0, yielding

αβ2μ\(P\)=β24Σmixθ~mix⋆,\\alpha\\frac\{\\beta\}\{2\}\\mu^\{\(P\)\}=\\frac\{\\beta^\{2\}\}\{4\}\\Sigma^\{\\mathrm\{mix\}\}\\tilde\{\\theta\}^\{\\star\}\_\{\\mathrm\{mix\}\},and therefore

θ~mix⋆=2αβ\(Σmix\)−1μ\(P\)\.\\displaystyle\\tilde\{\\theta\}^\{\\star\}\_\{\\mathrm\{mix\}\}=\\frac\{2\\alpha\}\{\\beta\}\(\\Sigma^\{\\mathrm\{mix\}\}\)^\{\-1\}\\mu^\{\(P\)\}\.\(24\)

##### Spurious Block Component\.

Using block inversion, the spurious component satisfies

Σssmixθ~s,mix⋆=αμs\(P\)−Σscmixθ~c,mix⋆,\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\\,\\tilde\{\\theta\}^\{\\star\}\_\{s,\\mathrm\{mix\}\}=\\alpha\\mu^\{\(P\)\}\_\{s\}\-\\Sigma^\{\\mathrm\{mix\}\}\_\{sc\}\\,\\tilde\{\\theta\}^\{\\star\}\_\{c,\\mathrm\{mix\}\},or equivalently

θ~s,mix⋆=2αβ\(Σssmix\)−1\[μs\(P\)−Σscmix\(Scmix\)−1μc\(P\)\],\\tilde\{\\theta\}^\{\\star\}\_\{s,\\mathrm\{mix\}\}=\\frac\{2\\alpha\}\{\\beta\}\(\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\)^\{\-1\}\\Big\[\\mu^\{\(P\)\}\_\{s\}\-\\Sigma^\{\\mathrm\{mix\}\}\_\{sc\}\(S^\{\\mathrm\{mix\}\}\_\{c\}\)^\{\-1\}\\mu^\{\(P\)\}\_\{c\}\\Big\],where the Schur component is:

Scmix:=Σccmix−Σcsmix\(Σssmix\)−1Σscmix\.S^\{\\mathrm\{mix\}\}\_\{c\}:=\\Sigma^\{\\mathrm\{mix\}\}\_\{cc\}\-\\Sigma^\{\\mathrm\{mix\}\}\_\{cs\}\(\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\)^\{\-1\}\\Sigma^\{\\mathrm\{mix\}\}\_\{sc\}\.

### D\.5Proof of Theorem[6\.2](https://arxiv.org/html/2605.11134#S6.Thmtheorem2)

Equipped with the mixed\-training equilibrium \([24](https://arxiv.org/html/2605.11134#A4.E24)\) derived in Appendix[D\.4](https://arxiv.org/html/2605.11134#A4.SS4), we now provide the full proof of Theorem[6\.2](https://arxiv.org/html/2605.11134#S6.Thmtheorem2)\. For completeness, we restate the theorem below and then prove each part in sequence\.

###### Theorem D\.2\(Tie training reduces spurious reliance and deployment shift\)\.

Assume the conditions of Theorem[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3)and the tie construction in Appendix[D\.1](https://arxiv.org/html/2605.11134#A4.SS1)\. Then:

\(i\) Spurious shrinkage \(population level\)\.Letθ⋆\\theta^\{\\star\}denote the strict\-only population optimizer underPP, andθmix⋆\\theta^\{\\star\}\_\{\\mathrm\{mix\}\}the mixed\-training optimizer underPmix=αP\+\(1−α\)PtieP\_\{\\mathrm\{mix\}\}=\\alpha P\+\(1\-\\alpha\)P\_\{\\mathrm\{tie\}\}\. Then

‖θ~s,mix⋆‖Σssmix≤‖θ~s⋆‖Σss\(P\),\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\}\\;\\leq\\;\\\|\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\\|\_\{\\Sigma^\{\(P\)\}\_\{ss\}\},whereΣssmix=αΣss\(P\)\+\(1−α\)Σsstie\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}=\\alpha\\Sigma^\{\(P\)\}\_\{ss\}\+\(1\-\\alpha\)\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}and‖v‖A:=vTAv\\\|v\\\|\_\{A\}:=\\sqrt\{v^\{T\}Av\}denotes the weighted norm\. The inequality is strict wheneverΣsstie\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}adds curvature in directions aligned with the strict\-only spurious driving term\.

\(ii\) Shift reduction under stable causal statistics\.Ifμc\(Q\)=μc\(P\)\\mu\_\{c\}^\{\(Q\)\}=\\mu\_\{c\}^\{\(P\)\}, then the first\-order deployment shift proxy satisfies

\|θ~s,mix⋆⊤\(μs\(Q\)−μs\(P\)\)\|≤\|θ~s⋆⊤\(μs\(Q\)−μs\(P\)\)\|,\\big\|\\tilde\{\\theta\}^\{\\star\\top\}\_\{s,\\mathrm\{mix\}\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\\big\|\\;\\leq\\;\\big\|\\tilde\{\\theta\}^\{\\star\\top\}\_\{s\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\\big\|,up to the same localO\(β2‖θ~‖2\)O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|^\{2\}\)remainder\.

\(iii\) Deployment bound\.Letθ^mix\\hat\{\\theta\}^\{\\mathrm\{mix\}\}be the ridge\-regularized estimator trained onPmixP\_\{\\mathrm\{mix\}\}withnnsamples\. Then, with probability at least1−δ1\-\\delta,

SubOptQ\(θ^mix\)≤J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θmix⋆\)⏟reduced shift\+κΠ2Γn,mix2⏟estimation\.\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}^\{\\mathrm\{mix\}\}\)\\leq\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta^\{\\star\}\_\{\\mathrm\{mix\}\}\)\}\_\{\\text\{reduced shift\}\}\+\\underbrace\{\\frac\{\\kappa\_\{\\Pi\}\}\{2\}\\Gamma\_\{n,\\mathrm\{mix\}\}^\{2\}\}\_\{\\text\{estimation\}\}\.

###### Proof\.

LetPmix:=αP\+\(1−α\)PtieP\_\{\\mathrm\{mix\}\}:=\\alpha P\+\(1\-\\alpha\)P\_\{\\mathrm\{tie\}\}be the mixed training distribution, and letθmix⋆\\theta\_\{\\mathrm\{mix\}\}^\{\\star\}denote the corresponding population optimizer \(equilibrium\) in the local regime\. Letθ⋆\\theta^\{\\star\}denote the strict\-only population optimizer underPP\.

We use the mixed linearized equilibrium derived in Appendix[D\.4](https://arxiv.org/html/2605.11134#A4.SS4):

θ~mix⋆=2αβ\(Σmix\)−1μ\(P\),Σmix:=αΣ\(P\)\+\(1−α\)Σtie,\\tilde\{\\theta\}\_\{\\mathrm\{mix\}\}^\{\\star\}=\\frac\{2\\alpha\}\{\\beta\}\\,\(\\Sigma^\{\\mathrm\{mix\}\}\)^\{\-1\}\\mu^\{\(P\)\},\\qquad\\Sigma^\{\\mathrm\{mix\}\}:=\\alpha\\Sigma^\{\(P\)\}\+\(1\-\\alpha\)\\Sigma^\{\\mathrm\{tie\}\},\(25\)and the strict\-only equilibrium

θ~⋆=2β\(Σ\(P\)\)−1μ\(P\)\.\\tilde\{\\theta\}^\{\\star\}=\\frac\{2\}\{\\beta\}\\,\(\\Sigma^\{\(P\)\}\)^\{\-1\}\\mu^\{\(P\)\}\.\(26\)

##### Part \(i\): spurious weight reduction\.

Letbs\(⋅\)b\_\{s\}\(\\cdot\)denote the Schur\-complement driving term for the spurious block\. That is, for any covarianceΣ\\Sigmaand meanμ\\mu,

θ~s⋆\(Σ,μ\)=2βΣss−1bs\(Σ,μ\),bs\(Σ,μ\):=\(I\+ΣscSc−1ΣcsΣss−1\)μs−ΣscSc−1μc\.\\tilde\{\\theta\}\_\{s\}^\{\\star\}\(\\Sigma,\\mu\)=\\frac\{2\}\{\\beta\}\\,\\Sigma\_\{ss\}^\{\-1\}\\,b\_\{s\}\(\\Sigma,\\mu\),\\quad b\_\{s\}\(\\Sigma,\\mu\):=\\Big\(I\+\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\Sigma\_\{cs\}\\Sigma\_\{ss\}^\{\-1\}\\Big\)\\mu\_\{s\}\-\\Sigma\_\{sc\}S\_\{c\}^\{\-1\}\\mu\_\{c\}\.Under strict\-only training,\(Σ,μ\)=\(Σ\(P\),μ\(P\)\)\(\\Sigma,\\mu\)=\(\\Sigma^\{\(P\)\},\\mu^\{\(P\)\}\)\. Under mixed training, the gradient isαgP\+\(1−α\)gtie\\alpha g\_\{P\}\+\(1\-\\alpha\)g\_\{\\mathrm\{tie\}\}, and in the local regime the tie contribution has zero mean and contributes only curvature; thus the mixed equilibrium uses\(Σ,μ\)=\(Σmix,αμ\(P\)\)\(\\Sigma,\\mu\)=\(\\Sigma^\{\\mathrm\{mix\}\},\\alpha\\mu^\{\(P\)\}\), which yields the prefactorα\\alphain \([25](https://arxiv.org/html/2605.11134#A4.E25)\)\. In particular, the spurious block satisfies

θ~s,mix⋆=2αβ\(Σssmix\)−1bs\(Σmix,μ\(P\)\),θ~s⋆=2β\(Σss\(P\)\)−1bs\(Σ\(P\),μ\(P\)\)\.\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}=\\frac\{2\\alpha\}\{\\beta\}\\,\(\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\)^\{\-1\}\\,b\_\{s\}\(\\Sigma^\{\\mathrm\{mix\}\},\\mu^\{\(P\)\}\),\\qquad\\tilde\{\\theta\}\_\{s\}^\{\\star\}=\\frac\{2\}\{\\beta\}\\,\(\\Sigma^\{\(P\)\}\_\{ss\}\)^\{\-1\}\\,b\_\{s\}\(\\Sigma^\{\(P\)\},\\mu^\{\(P\)\}\)\.Now we use the positive semidefinite \(PSD\) ordering in the spurious block:

Σssmix=αΣss\(P\)\+\(1−α\)Σsstie⪰αΣss\(P\)⟹\(Σssmix\)−1⪯1α\(Σss\(P\)\)−1\.\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}=\\alpha\\Sigma\_\{ss\}^\{\(P\)\}\+\(1\-\\alpha\)\\Sigma\_\{ss\}^\{\\mathrm\{tie\}\}\\succeq\\alpha\\Sigma\_\{ss\}^\{\(P\)\}\\quad\\Longrightarrow\\quad\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)^\{\-1\}\\preceq\\frac\{1\}\{\\alpha\}\(\\Sigma\_\{ss\}^\{\(P\)\}\)^\{\-1\}\.Therefore, for any vectoruu,

u⊤\(Σssmix\)−1u≤1αu⊤\(Σss\(P\)\)−1u\.u^\{\\top\}\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)^\{\-1\}u\\leq\\frac\{1\}\{\\alpha\}\\,u^\{\\top\}\(\\Sigma\_\{ss\}^\{\(P\)\}\)^\{\-1\}u\.
Apply this withu:=bs\(Σmix,μ\(P\)\)u:=b\_\{s\}\(\\Sigma^\{\\mathrm\{mix\}\},\\mu^\{\(P\)\}\)and rewrite theΣssmix\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\-weighted norm ofθ~s,mix⋆\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}:

‖θ~s,mix⋆‖Σssmix2=\(θ~s,mix⋆\)⊤Σssmixθ~s,mix⋆=\(2αβ\)2bs⊤\(Σssmix\)−1bs\.\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\}^\{2\}=\(\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\)^\{\\top\}\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}=\\left\(\\frac\{2\\alpha\}\{\\beta\}\\right\)^\{2\}\\,b\_\{s\}^\{\\top\}\(\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\)^\{\-1\}b\_\{s\}\.Using the PSD inverse bound gives

‖θ~s,mix⋆‖Σssmix2≤\(2αβ\)2⋅1αbs⊤\(Σss\(P\)\)−1bs=α\(2β\)2bs⊤\(Σss\(P\)\)−1bs\.\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\}^\{2\}\\leq\\left\(\\frac\{2\\alpha\}\{\\beta\}\\right\)^\{2\}\\cdot\\frac\{1\}\{\\alpha\}\\,b\_\{s\}^\{\\top\}\(\\Sigma\_\{ss\}^\{\(P\)\}\)^\{\-1\}b\_\{s\}=\\alpha\\left\(\\frac\{2\}\{\\beta\}\\right\)^\{2\}b\_\{s\}^\{\\top\}\(\\Sigma\_\{ss\}^\{\(P\)\}\)^\{\-1\}b\_\{s\}\.In the regime where tie mixing does not increase the driving vector \(because ties have zero mean and reduce cross\-couplingΣsc\\Sigma\_\{sc\}byα\\alpha\), the driving term magnitude is not inflated, and in particular

bs\(Σmix,μ\(P\)\)≈bs\(Σ\(P\),μ\(P\)\),b\_\{s\}\(\\Sigma^\{\\mathrm\{mix\}\},\\mu^\{\(P\)\}\)\\approx b\_\{s\}\(\\Sigma^\{\(P\)\},\\mu^\{\(P\)\}\),so the right\-hand side is upper bounded by‖θ~s⋆‖Σss\(P\)2\\\|\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\\|\_\{\\Sigma\_\{ss\}^\{\(P\)\}\}^\{2\}, giving the claimed inequality

‖θ~s,mix⋆‖Σssmix≤‖θ~s⋆‖Σss\(P\)\.\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{\\Sigma^\{\\mathrm\{mix\}\}\_\{ss\}\}\\leq\\\|\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\\|\_\{\\Sigma^\{\(P\)\}\_\{ss\}\}\.Moreover, ifΣsstie≻0\\Sigma\_\{ss\}^\{\\mathrm\{tie\}\}\\succ 0adds curvature in directions aligned withμs\(P\)\\mu\_\{s\}^\{\(P\)\}\(equivalently, increases eigenvalues ofΣssmix\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}along components of the driving term\), then\(Σssmix\)−1\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)^\{\-1\}strictly shrinks those components, making the inequality strict\.

##### Part \(ii\): Shift reduction\.

Assume stable causal statistics:μc\(Q\)≈μc\(P\)\\mu\_\{c\}^\{\(Q\)\}\\approx\\mu\_\{c\}^\{\(P\)\}\. In the local regime, the first\-order margin approximation gives

J~\(Q\)\(θ\)−J~\(P\)\(θ\)=12\(mQ\(θ\)−mP\(θ\)\)\+O\(β2‖θ~‖22\),mD\(θ\):=𝔼D\[βθ~⊤Δϕ\]\.\\tilde\{J\}^\{\(Q\)\}\(\\theta\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta\)=\\frac\{1\}\{2\}\\big\(m\_\{Q\}\(\\theta\)\-m\_\{P\}\(\\theta\)\\big\)\+O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|\_\{2\}^\{2\}\),\\qquad m\_\{D\}\(\\theta\):=\\mathbb\{E\}\_\{D\}\[\\beta\\tilde\{\\theta\}^\{\\top\}\\Delta\\phi\]\.Under stable causal moments, the leading contribution is spurious:

mQ\(θ\)−mP\(θ\)≈βθ~s⊤\(μs\(Q\)−μs\(P\)\)\.m\_\{Q\}\(\\theta\)\-m\_\{P\}\(\\theta\)\\approx\\beta\\,\\tilde\{\\theta\}\_\{s\}^\{\\top\}\\big\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\\big\)\.Hence,

\|J~\(Q\)\(θ\)−J~\(P\)\(θ\)\|≲12β\|θ~s⊤\(μs\(Q\)−μs\(P\)\)\|\+O\(β2‖θ~‖22\)\.\|\\tilde\{J\}^\{\(Q\)\}\(\\theta\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta\)\|\\;\\lesssim\\;\\frac\{1\}\{2\}\\,\\beta\\,\|\\tilde\{\\theta\}\_\{s\}^\{\\top\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\|\\;\+\\;O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|\_\{2\}^\{2\}\)\.Apply the dual norm inequality withA=Σss\(P\)A=\\Sigma\_\{ss\}^\{\(P\)\}:

\|θ~s⊤\(μs\(Q\)−μs\(P\)\)\|≤‖θ~s‖Σss\(P\)⋅‖μs\(Q\)−μs\(P\)‖\(Σss\(P\)\)−1\.\|\\tilde\{\\theta\}\_\{s\}^\{\\top\}\(\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\)\|\\leq\\\|\\tilde\{\\theta\}\_\{s\}\\\|\_\{\\Sigma\_\{ss\}^\{\(P\)\}\}\\cdot\\\|\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\\\|\_\{\(\\Sigma\_\{ss\}^\{\(P\)\}\)^\{\-1\}\}\.Using Part \(i\) to compare the spurious\-weight magnitude under mixed vs strict training \(and keeping the same shift directionμs\(Q\)−μs\(P\)\\mu\_\{s\}^\{\(Q\)\}\-\\mu\_\{s\}^\{\(P\)\}\), we obtain

\|J~\(Q\)\(θmix⋆\)−J~\(P\)\(θmix⋆\)\|≤\|J~\(Q\)\(θ⋆\)−J~\(P\)\(θ⋆\)\|\(up to the same localO\(β2‖θ~‖2\)remainder\)\.\|\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{mix\}\}^\{\\star\}\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta\_\{\\mathrm\{mix\}\}^\{\\star\}\)\|\\leq\|\\tilde\{J\}^\{\(Q\)\}\(\\theta^\{\\star\}\)\-\\tilde\{J\}^\{\(P\)\}\(\\theta^\{\\star\}\)\|\\quad\\text\{\(up to the same local $O\(\\beta^\{2\}\\\|\\tilde\{\\theta\}\\\|^\{2\}\)$ remainder\)\.\}This is the stated shift\-term reduction claim \(under the local proxy\)\.

##### Part \(iii\): Deployment bound\.

Apply Theorem[C\.7](https://arxiv.org/html/2605.11134#A3.Thmtheorem7)with the training distribution replaced byPmixP\_\{\\mathrm\{mix\}\}and the corresponding population optimizerθmix⋆\\theta\_\{\\mathrm\{mix\}\}^\{\\star\}and ridge\-MLEθ^mix\\hat\{\\theta\}^\{\\mathrm\{mix\}\}\. The same proof yields, with probability1−δ1\-\\delta,

SubOptQ\(θ^mix\)≤J~\(Q\)\(θQ⋆\)−J~\(Q\)\(θmix⋆\)⏟shift\(Pmix\)\+κΠ2Γn,mix2⏟estimation\(Pmix\),\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}^\{\\mathrm\{mix\}\}\)\\leq\\underbrace\{\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{Q\}^\{\\star\}\)\-\\tilde\{J\}^\{\(Q\)\}\(\\theta\_\{\\mathrm\{mix\}\}^\{\\star\}\)\}\_\{\\text\{shift\}\(P\_\{\\mathrm\{mix\}\}\)\}\+\\underbrace\{\\frac\{\\kappa\_\{\\Pi\}\}\{2\}\\Gamma\_\{n,\\mathrm\{mix\}\}^\{2\}\}\_\{\\text\{estimation\}\(P\_\{\\mathrm\{mix\}\}\)\},whereΓn,mix\\Gamma\_\{n,\\mathrm\{mix\}\}is the same self\-normalized radius but computed with the mixed\-training curvatureHPmixH\_\{P\_\{\\mathrm\{mix\}\}\}\(and the sameβ,λ,B\\beta,\\lambda,B\)\. This proves Part \(iii\)\. ∎

### D\.6Quantitative Reduction Bound

We conclude this appendix by providing the full proof of Corollary[6\.3](https://arxiv.org/html/2605.11134#S6.Thmtheorem3)\. For completeness, we restate the corollary below and then provide the proof\.

###### Corollary D\.3\(Quantitative reduction under isotropic ties\)\.

If ties are isotropic \(Σsstie=σ2Ids\\Sigma^\{\\mathrm\{tie\}\}\_\{ss\}=\\sigma^\{2\}I\_\{d\_\{s\}\}\) andλmin\(Σss\(P\)\)≥λ0\>0\\lambda\_\{\\min\}\(\\Sigma^\{\(P\)\}\_\{ss\}\)\\geq\\lambda\_\{0\}\>0, then

‖θ~s,mix⋆‖2‖θ~s⋆‖2≤C⋅αλ0αλ0\+\(1−α\)σ2\\frac\{\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{2\}\}\{\\\|\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\\|\_\{2\}\}\\leq C\\cdot\\frac\{\\alpha\\lambda\_\{0\}\}\{\\alpha\\lambda\_\{0\}\+\(1\-\\alpha\)\\sigma^\{2\}\}for some constantCCdepending on condition numbers ofΣ\(P\)\\Sigma^\{\(P\)\}\.

###### Proof\.

AssumeΣsstie=σ2I\\Sigma\_\{ss\}^\{\\mathrm\{tie\}\}=\\sigma^\{2\}Iandλmin\(Σss\(P\)\)≥λ0\>0\\lambda\_\{\\min\}\(\\Sigma\_\{ss\}^\{\(P\)\}\)\\geq\\lambda\_\{0\}\>0\. Then

Σssmix=αΣss\(P\)\+\(1−α\)σ2I⪰\(αλ0\+\(1−α\)σ2\)I,\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}=\\alpha\\Sigma\_\{ss\}^\{\(P\)\}\+\(1\-\\alpha\)\\sigma^\{2\}I\\succeq\\big\(\\alpha\\lambda\_\{0\}\+\(1\-\\alpha\)\\sigma^\{2\}\\big\)I,so

\(Σssmix\)−1⪯1αλ0\+\(1−α\)σ2I\.\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)^\{\-1\}\\preceq\\frac\{1\}\{\\alpha\\lambda\_\{0\}\+\(1\-\\alpha\)\\sigma^\{2\}\}\\,I\.In the local regime \(Assumption[A\.1](https://arxiv.org/html/2605.11134#A1.Thmtheorem1)\), the linearized spurious equilibrium is:

θ~s,mix⋆=2αβ\(Σssmix\)−1bs,θ~s⋆=2β\(Σss\(P\)\)−1bs,\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}=\\frac\{2\\alpha\}\{\\beta\}\\,\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)^\{\-1\}\\,b\_\{s\},\\qquad\\tilde\{\\theta\}\_\{s\}^\{\\star\}=\\frac\{2\}\{\\beta\}\\,\(\\Sigma\_\{ss\}^\{\(P\)\}\)^\{\-1\}\\,b\_\{s\},wherebsb\_\{s\}is the same driving vector under the simplifying assumption that the mean/cross\-term correction is dominated by the same direction\. Taking Euclidean norms and using‖Au‖≤‖A‖op‖u‖\\\|Au\\\|\\leq\\\|A\\\|\_\{\\mathrm\{op\}\}\\\|u\\\|,

‖θ~s,mix⋆‖2‖θ~s⋆‖2≤α⋅‖\(Σssmix\)−1‖op‖\(Σss\(P\)\)−1‖op=α⋅λmax\(\(Σssmix\)−1\)λmax\(\(Σss\(P\)\)−1\)=α⋅λmin\(Σss\(P\)\)λmin\(Σssmix\)\.\\frac\{\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{2\}\}\{\\\|\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\\|\_\{2\}\}\\leq\\alpha\\cdot\\frac\{\\\|\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)^\{\-1\}\\\|\_\{\\mathrm\{op\}\}\}\{\\\|\(\\Sigma\_\{ss\}^\{\(P\)\}\)^\{\-1\}\\\|\_\{\\mathrm\{op\}\}\}=\\alpha\\cdot\\frac\{\\lambda\_\{\\max\}\\big\(\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)^\{\-1\}\\big\)\}\{\\lambda\_\{\\max\}\\big\(\(\\Sigma\_\{ss\}^\{\(P\)\}\)^\{\-1\}\\big\)\}=\\alpha\\cdot\\frac\{\\lambda\_\{\\min\}\(\\Sigma\_\{ss\}^\{\(P\)\}\)\}\{\\lambda\_\{\\min\}\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)\}\.Usingλmin\(Σss\(P\)\)≥λ0\\lambda\_\{\\min\}\(\\Sigma\_\{ss\}^\{\(P\)\}\)\\geq\\lambda\_\{0\}andλmin\(Σssmix\)≥αλ0\+\(1−α\)σ2\\lambda\_\{\\min\}\(\\Sigma\_\{ss\}^\{\\mathrm\{mix\}\}\)\\geq\\alpha\\lambda\_\{0\}\+\(1\-\\alpha\)\\sigma^\{2\}gives

‖θ~s,mix⋆‖2‖θ~s⋆‖2≤Cαλ0αλ0\+\(1−α\)σ2\.\\frac\{\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{mix\}\}^\{\\star\}\\\|\_\{2\}\}\{\\\|\\tilde\{\\theta\}\_\{s\}^\{\\star\}\\\|\_\{2\}\}\\;\\leq\\;C\\frac\{\\alpha\\lambda\_\{0\}\}\{\\alpha\\lambda\_\{0\}\+\(1\-\\alpha\)\\sigma^\{2\}\}\.for some constantCCdepending on condition numbers ofΣ\(P\)\\Sigma^\{\(P\)\}\. ∎

## Appendix EDistribution Shift Scenarios

This appendix provides detailed analysis of the three shift scenarios from Section[5](https://arxiv.org/html/2605.11134#S5)\. Throughout, we assumeθs,train≠0\\theta\_\{s,\\mathrm\{train\}\}\\neq 0and stable causal statistics \(μc\(Q\)≈μc\(P\)\\mu\_\{c\}^\{\(Q\)\}\\approx\\mu\_\{c\}^\{\(P\)\}\)\.

### E\.1Suppression \(μs\(Q\)=0\\mu\_\{s\}^\{\(Q\)\}=0\)

Spurious correlation is absent at deployment whileμs\(P\)≠0\\mu\_\{s\}^\{\(P\)\}\\neq 0\.

##### Margin behavior\.

mQspur=βθ~s,train⊤⋅0=0m\_\{Q\}^\{\\mathrm\{spur\}\}=\\beta\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}^\{\\top\}\\cdot 0=0\. Systematic spurious bias vanishes\.

##### Variance\-induced accuracy degradation\.

Zero mean does not imply robustness\. The spurious variance

β2θ~s,train⊤Σss\(Q\)θ~s,train\>0\\beta^\{2\}\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}^\{\\top\}\\Sigma\_\{ss\}^\{\(Q\)\}\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\>0adds noise to predictions\.

##### Examples\.

- •Hotels:Training has longer reviews for quality hotels; test set balanced⇒\\Rightarrowlength adds noise\.
- •Code:Training has verbose correct solutions; test balanced⇒\\Rightarrowconcise correct solutions underrated\.
- •Safety:Training has hedged safe responses; test balanced⇒\\Rightarrowdirect safe responses underrated\.

### E\.2Adversarial Reversal \(μs\(Q\)=−μs\(P\)\\mu\_\{s\}^\{\(Q\)\}=\-\\mu\_\{s\}^\{\(P\)\}\)

Spurious correlation flips sign\.

##### Margin behavior\.

mQspur=−mPspur⇒Δmspur=−2mPspur\.m\_\{Q\}^\{\\mathrm\{spur\}\}=\-m\_\{P\}^\{\\mathrm\{spur\}\}\\quad\\Rightarrow\\quad\\Delta m^\{\\mathrm\{spur\}\}=\-2m\_\{P\}^\{\\mathrm\{spur\}\}\.If spurious features helped at training, they hurt equally at deployment\.

##### Worst\-case analysis\.

For constrained‖μs\(Q\)‖≤R\\\|\\mu\_\{s\}^\{\(Q\)\}\\\|\\leq R:

min‖μs\(Q\)‖≤R⁡mQspur=−β‖θ~s,train‖R,\\min\_\{\\\|\\mu\_\{s\}^\{\(Q\)\}\\\|\\leq R\}m\_\{Q\}^\{\\mathrm\{spur\}\}=\-\\beta\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\\\|R,attained whenμs\(Q\)∝−θ~s,train\\mu\_\{s\}^\{\(Q\)\}\\propto\-\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\(adversarial direction\)\.

##### Examples\.

- •Hotels:US prefers long reviews; international region has long complaint reviews⇒\\Rightarrowlow\-quality preferred\.
- •Code:Expert code is verbose; beginner incorrect code is also verbose⇒\\Rightarrowincorrect preferred\.
- •Safety:Safe responses hedge; adversarial unsafe responses hedge more⇒\\Rightarrowunsafe rated as safe\.

### E\.3Rotation

Spurious correlation changes direction \(relevant whends\>1d\_\{s\}\>1\)\.

##### Alignment analysis\.

The spurious margin depends on alignment:

mQspur=β‖θ~s,train‖⋅‖μs\(Q\)‖⋅cos⁡α,m\_\{Q\}^\{\\mathrm\{spur\}\}=\\beta\\\|\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\}\\\|\\cdot\\\|\\mu\_\{s\}^\{\(Q\)\}\\\|\\cdot\\cos\\alpha,whereα=∠\(θ~s,train,μs\(Q\)\)\\alpha=\\angle\(\\tilde\{\\theta\}\_\{s,\\mathrm\{train\}\},\\mu\_\{s\}^\{\(Q\)\}\)\.

##### Examples\.

- •Hotels:Length correlates positively in US, negatively in Europe; star rating correlates oppositely⇒\\Rightarrowpartial misalignment\.
- •Code:Comment density and line count have different correlation patterns across languages⇒\\Rightarrowrotation\.
- •Safety:Formality and hedging evolve differently over time⇒\\Rightarrowtemporal rotation\.

## Appendix FExperimental Details and Additional Results

This appendix provides a comprehensive empirical validation of the theoretical results presented in the main paper, progressing from settings that exactly match the theory to increasingly realistic and expressive model classes\. We begin with linear preference models, where the assumptions of the theory hold and we obtain precise quantitative agreement with predicted bounds and scaling laws\. We then move to nonlinear neural networks, where the causal–spurious decomposition is hidden inside nonlinear representations, and finally to large language models trained with DPO on synthetic preference data\. While exact quantitative predictions no longer apply beyond the linear regime, we demonstrate that the core qualitative mechanisms identified by the theory – spurious correlation learning, irreducible deployment error under distribution shift, and selective mitigation via tie training – persist across all stages\. Together, these experiments show that the linear analysis captures essential dynamics of preference learning systems even when deployed with rich nonlinear models and realistic data\.

### F\.1Linear Models \(Theoretical Ground Truth\)

#### F\.1\.1Dataset Construction

##### Feature decomposition\.

We construct a synthetic preference dataset with an explicit causal–spurious feature decomposition\. Each example is represented by a Gaussian feature vectorϕ=\[ϕc;ϕs\]\\phi=\[\\phi\_\{c\};\\phi\_\{s\}\], whereϕc\\phi\_\{c\}denotes causal features andϕs\\phi\_\{s\}denotes spurious features\. Causal features determine true preference utility, while spurious features do not affect utility\. Under the training distributionPP, spurious features are correlated with causal features\. At deployment, this correlation changes under a shifted distributionQQ\.

##### Strict preference pairs\.

Strict preference pairs are generated according to

P\(y=1∣Δϕ\)=σ\(βθ⋆⊤Δϕ\),P\(y=1\\mid\\Delta\\phi\)=\\sigma\(\\beta~\\theta^\{\\star\\top\}\\Delta\\phi\),whereθ⋆\\theta^\{\\star\}decomposes into causal and spurious components\. In the experiments, we useβ=1\.0\\beta=1\.0\. The preferences depend on the full feature difference, inducing spurious correlations whenϕs\\phi\_\{s\}is correlated withϕc\\phi\_\{c\}\. Spurious features are drawn as

ϕs∼𝒩\(0,λ0I\),\\phi\_\{s\}\\sim\\mathcal\{N\}\(0,\\lambda\_\{0\}I\),whereλ0\\lambda\_\{0\}controls the variance scale of spurious features in strict data\. This construction ensures that spurious cues are predictive during training despite being non\-causal\.

##### Tie Examples\.

Tie examples are defined by a small causal margin,

\|θc⋆⊤Δϕc\|≤τ,\|\\theta\_\{c\}^\{\\star\\top\}\\Delta\\phi\_\{c\}\|\\leq\\tau,which produces pairs with weak causal signal\. Labels for ties are randomized asy∼Bernoulli\(0\.5\)y\\sim\\mathrm\{Bernoulli\}\(0\.5\), removing any causal dependence\. Spurious features for ties are drawn from

ϕs∼𝒩\(0,σ2I\),\\phi\_\{s\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\),whereσ2\\sigma^\{2\}controls the variance injected by tie data\. This construction decorrelates spurious features from labels while amplifying spurious variance, creating targeted negative evidence against spurious reliance\.

#### F\.1\.2Model and Objective

##### Model Class\.

We train a log\-linear preference model,

rθ\(ϕ\)=θ⊤ϕ,r\_\{\\theta\}\(\\phi\)=\\theta^\{\\top\}\\phi,which matches the assumptions of the theoretical analysis\. The parameter vectorθ\\thetadecomposes into causal and spurious components\. This setting allows direct measurement of spurious reliance through parameter norms\. As a result, population predictions can be tested exactly\.

##### Training Objective \(DPO\-style\)\.

Training minimizes a DPO\-style pairwise logistic loss,

minθ⁡𝔼\[log⁡\(1\+e−βyθ⊤Δϕ\)\]\.\\min\_\{\\theta\}\\;\\mathbb\{E\}\\\!\\left\[\\log\\\!\\left\(1\+e^\{\-\\beta y\\theta^\{\\top\}\\Delta\\phi\}\\right\)\\right\]\.The parameterβ\\betacontrols the sharpness of preference separation\. We vary the fraction of strict versus tie examples usingα\\alpha\. This setup exactly matches the assumptions of Theorems[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1),[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3), and[6\.2](https://arxiv.org/html/2605.11134#S6.Thmtheorem2)\.

#### F\.1\.3Metrics

##### Spurious Reliance\.

We measure spurious reliance using the norm‖θs‖\\\|\\theta\_\{s\}\\\|\. This quantity directly quantifies how much the learned model relies on spurious features\. Because the model is linear, this metric has a clear population interpretation\. It provides a precise test of theoretical predictions\.

##### Reduction Ratio\.

We compare empirical and theoretical reductions in spurious reliance under tie training\. Empirically, we compute

remp=‖θs,mix‖‖θs‖\.r\_\{\\mathrm\{emp\}\}=\\frac\{\\\|\\theta\_\{s,\\mathrm\{mix\}\}\\\|\}\{\\\|\\theta\_\{s\}\\\|\}\.The theory predicts

rth=αλ0αλ0\+\(1−α\)σ2,r\_\{\\mathrm\{th\}\}=\\frac\{\\alpha\\lambda\_\{0\}\}\{\\alpha\\lambda\_\{0\}\+\(1\-\\alpha\)\\sigma^\{2\}\},which depends on the strict fractionα\\alphaand the spurious variance ratioσ2/λ0\\sigma^\{2\}/\\lambda\_\{0\}\.

#### F\.1\.4Results

##### Spurious Correlation Learning

Strict\-only training learns nonzero spurious parameters\. Figure[3](https://arxiv.org/html/2605.11134#A6.F3)compares empirical spurious parameter norms to the population prediction from Theorem[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1)\. Empirically, we found that correcting for curvature leads to the correct scaling acrossβ\\beta\.

![Refer to caption](https://arxiv.org/html/2605.11134v1/x7.png)

![Refer to caption](https://arxiv.org/html/2605.11134v1/x8.png)

Figure 3:Population scaling of spurious parameters in DPO\. We compare empirical spurious parameter norms with the population prediction fromTheorem 4\.1\.Left:Including curvature yields accurate predictions acrossβ\\beta\.Right:Ignoring curvature systematically underestimates spurious reliance, leading to large relative error even with infinite data\. This confirms that curvature is necessary for correct population scaling\.Spurious learning induces a deployment vulnerability that persists under distribution shift\. Figure[4](https://arxiv.org/html/2605.11134#A6.F4)reports empirical deployment suboptimality \(Theorem[C\.7](https://arxiv.org/html/2605.11134#A3.Thmtheorem7)\) and its decomposition as a function of training set sizennforβ=0\.5\\beta=0\.5\. Asnngrows, the estimation error decays while the shift error persists, so deployment error does not vanish with more data fromPP\. The empirical suboptimality remains below the estimated upper bound, consistent with the theoretical guarantee\. We computed SubOptQas the gap between the deployment log\-likelihood of a model trained onPPand the optimal model trained onQQ\.

![Refer to caption](https://arxiv.org/html/2605.11134v1/x9.png)Figure 4:Empirical deployment suboptimality and its decomposition under distribution shift\. The figure shows four quantities as a function of the number of training samplesnn: \(i\) empirical deployment suboptimalitySubOptQ\(θ^\);\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\theta\}\);\(ii\) shift error estimate; \(iii\) estimation error estimate; \(iv\) estimated upper bound\. Asnngrows, estimation error decays while the shift error persists, demonstrating that deployment error is irreducible with additional training data fromPP\. The empirical suboptimality remains below the estimated upper bound, consistent with the theoretical guarantee\.

#### F\.1\.5Tie Training

Tie training suppresses spurious reliance in a predictable, population\-level way\. Figure[5](https://arxiv.org/html/2605.11134#A6.F5)plots the theoretical reduction factorrth\(α\)r\_\{\\mathrm\{th\}\}\(\\alpha\)for different spurious variance ratiosσ2/λ0\\sigma^\{2\}/\\lambda\_\{0\}, showing monotone suppression as tie fraction increases\. This prediction holds independently of sample size and isolates the irreducible effect of tie training on spurious learning\. We use this curve to interpret empirical reductions in‖θ^s‖\\\|\\hat\{\\theta\}\_\{s\}\\\|acrossα\\alpha\.

![Refer to caption](https://arxiv.org/html/2605.11134v1/x10.png)

![Refer to caption](https://arxiv.org/html/2605.11134v1/x11.png)

Figure 5:Theoretical prediction for spurious reliance under tie training\. The curve shows the reduction factorrth\(α\)=αλ0αλ0\+\(1−α\)σ2r\_\{\\mathrm\{th\}\}\(\\alpha\)=\\frac\{\\alpha\\lambda\_\{0\}\}\{\\alpha\\lambda\_\{0\}\+\(1\-\\alpha\)\\sigma^\{2\}\}as a function of the strict\-preference fractionα\\alpha, for different spurious variance ratiosσ2/λ0\\sigma^\{2\}/\\lambda\_\{0\}\. Increasing the proportion of tie examples \(1−α1\-\\alpha\) monotonically suppresses reliance on spurious features, with stronger suppression when ties inject higher spurious variance\. This bound holds independently of sample size and captures the irreducible effect of tie augmentation on spurious learning\. KL strength:Left:β=0\.3\\beta=0\.3andRight:β=0\.7\\beta=0\.7\.
#### F\.1\.6Takeaway \(Linear\)

The linear experiments provide confirmation of Theorems[4\.1](https://arxiv.org/html/2605.11134#S4.Thmtheorem1),[5\.3](https://arxiv.org/html/2605.11134#S5.Thmtheorem3), and[6\.2](https://arxiv.org/html/2605.11134#S6.Thmtheorem2)\. Strict\-only training learns persistent spurious parameters and induces a vulnerability to irreducible deployment shift error underQQ\. Tie training acts as a selective regularizer onϕs\\phi\_\{s\}by injecting spurious variance into low\-causal\-margin comparisons\. This reduces‖θ^s‖\\\|\\hat\{\\theta\}\_\{s\}\\\|and lowers shift\-induced deployment error without relying on additional samples fromPP\. As a result, tie training removes the irreducible vulnerability as predicted by our mathematical analysis\.

### F\.2Neural Networks \(Nonlinear Regime\)

##### Motivation\.

Real\-world preference models are nonlinear and do not expose an explicit causal–spurious feature decomposition\. As a result, the developed linear theory fails to apply exactly in this regime\. Thus, our goal in this appendix is mechanism validation rather than exact prediction\. In particular, we test whether the qualitative spurious\-learning mechanisms identified in the linear analysis persist when representations are nonlinear and hidden\. This allows us to assess whether the theory captures dominant dynamics rather than model\-specific artifacts\.

#### F\.2\.1Dataset Construction

##### Latent Variables\.

We generate data from a latent quality variableq∼𝒩\(0,1\)q\\sim\\mathcal\{N\}\(0,1\)that determines true preference ordering\. Causal features are constructed as nonlinear functions ofqqwith additive noise,

ϕc=fc\(q,ε\),\\phi\_\{c\}=f\_\{c\}\(q,\\varepsilon\),wherefcf\_\{c\}includes transformations such asqq,q2q^\{2\}, orsin⁡q\\sin q\. These features contain information aboutqqbut are not linearly related to it\. As a result, recovering quality requires nonlinear processing\.

Spurious features are generated as correlated but non\-causal functions ofqq,

ϕs=ρg\(q\)\+1−ρ2ξ,\\phi\_\{s\}=\\rho\\,g\(q\)\+\\sqrt\{1\-\\rho^\{2\}\}\\,\\xi,whereg\(q\)g\(q\)is correlated withqqandξ\\xiis independent noise\. The parameterρ\\rhocontrols the strength of spurious correlation during training\. These features are predictive in\-distribution but have no causal relationship to preference labels\. This construction mirrors spurious cues in real preference datasets\.

##### Nonlinear Mixing\.

The observed input to the model is a nonlinear mixture of causal and spurious features,

ϕ=g\(\[ϕc;ϕs\]\),\\phi=g\(\[\\phi\_\{c\};\\phi\_\{s\}\]\),whereg\(⋅\)g\(\\cdot\)is a fixed, nonlinear function unknown to the model\. This mixing prevents direct access toϕc\\phi\_\{c\}orϕs\\phi\_\{s\}\. As a result, the model must learn representations internally\. This setting tests whether spurious reliance emerges even when the decomposition is hidden\.

##### Tie Construction\.

We construct tie examples by enforcing\|q1−q2\|≤τ\|q\_\{1\}\-q\_\{2\}\|\\leq\\taufor a small thresholdτ\\tau\. Preference labels for ties are assigned randomly, so causal signal is intentionally weak\. We manipulate spurious features independently ofqq, either by assigning opposing extremes or by randomizing them\. This creates pairs with minimal causal difference and strong spurious contrast\. These ties provide targeted gradient signal against spurious reliance\.

#### F\.2\.2Model and Objective\.

##### Model\.

We train a multilayer perceptron \(MLP\) reward modelrθ\(ϕ\)r\_\{\\theta\}\(\\phi\)that maps nonlinear inputs to scalar scores\. The model has no architectural bias toward separating causal and spurious components\. All structure must be learned from data\. This setting reflects realistic nonlinear preference models\. It therefore provides a stringent test of the theory\.

##### Training Loss\.

Training uses a pairwise logistic objective,

log⁡σ\(β\(r\(x\+\)−r\(x−\)\)\),\\log\\sigma\\\!\\left\(\\beta\\big\(r\(x^\{\+\}\)\-r\(x^\{\-\}\)\\big\)\\right\),which matches the standard preference optimization loss \(we useβ=1\.0\\beta=1\.0in the experiments\)\. We vary the fraction of strict versus tie comparisons using the parameterα\\alpha\. Whenα=1\\alpha=1, training uses only strict comparisons\. Asα\\alphadecreases, a larger fraction of tie data is introduced\.

#### F\.2\.3Proxy Metrics

##### Spurious Gap\.

We measure the spurious gap as the difference between accuracy on pairs where spurious features align with true quality and accuracy on pairs where they conflict\. LetACCaligned\\mathrm\{ACC\}\_\{\\text\{aligned\}\}andACCunaligned\\mathrm\{ACC\}\_\{\\text\{unaligned\}\}denote these accuracies\. Their difference quantifies reliance on spurious cues\. A large gap indicates strong spurious dependence\.

##### Adversarial Accuracy\.

We evaluate adversarial accuracy under a distribution where spurious correlations are reversed\. Performance in this setting isolates failure due to spurious reliance\. Because the deployment objective is nonlinear, we use accuracy as a proxy for utility\. Persistent degradation under this shift indicates misgeneralization\.

##### Counterfactual Margin\.

We measure counterfactual sensitivity by flipping spurious features while holding causal features fixed\. The counterfactual margin is defined as

𝔼\[\|r\(c,s\)−r\(c,−s\)\|\]\.\\mathbb\{E\}\\\!\\left\[\|r\(c,s\)\-r\(c,\-s\)\|\\right\]\.Large values indicate that the learned reward depends strongly on spurious features\. Tie training is expected to reduce this margin\.

#### F\.2\.4Results

##### Spurious Learning\.

Models trained without ties exhibit clear spurious learning\. They show a large spurious gap, indicating substantially different performance when spurious cues align or conflict with quality\. Figure[6](https://arxiv.org/html/2605.11134#A6.F6)reports this gap across training conditions\.

![Refer to caption](https://arxiv.org/html/2605.11134v1/x12.png)Figure 6:Spurious gap \(accuracy difference between aligned and misaligned spurious conditions\) as a function of the fraction of strict preferencesα\\alpha\. Note that asα\\alphadecreases, the number of ties increases\. Thus, tie training reduces spurious reliance despite hidden representations\.These models also perform poorly under adversarial evaluation, showing that spurious reliance translates into deployment failures\.

##### Tie training\.

Introducing tie training reduces sensitivity to spurious features\. Figure[7](https://arxiv.org/html/2605.11134#A6.F7)shows that the counterfactual margin decreases sharply as the fraction of tie data increases, indicating reduced dependence on spurious cues\.

![Refer to caption](https://arxiv.org/html/2605.11134v1/x13.png)Figure 7:Tie training reduces spurious reliance\.Counterfactual margin𝔼\[\|r\(ϕ\)−r\(ϕcf\)\|\]\\mathbb\{E\}\[\|r\(\\phi\)\-r\(\\phi\_\{\\mathrm\{cf\}\}\)\|\]as a function of the fraction of strict preferencesα\\alpha\. Asα\\alphadecreases \(more tie comparisons\), the counterfactual margin drops sharply, indicating reduced sensitivity of the learned model to spurious features\.Figure[8](https://arxiv.org/html/2605.11134#A6.F8)illustrates that tie training also improves adversarial accuracy under adversarial reversed correlations\.

![Refer to caption](https://arxiv.org/html/2605.11134v1/x14.png)Figure 8:Strict\-only training plateaus under distribution shift; tie training improves robustness\.Adversarial accuracy onQadvQ\_\{\\mathrm\{adv\}\}, where spurious correlations flip, as a function of the number of training samples\. Strict\-only training \(α=1\.0\\alpha=1\.0\) exhibits a persistent accuracy plateau despite increasing data\. In contrast, tie training \(α=0\.75\\alpha=0\.75\) improves adversarial accuracy, breaking the plateau\.These trends qualitatively match the predictions of the linear analysis\.

##### Takeaway \(Nonlinear\)\.

In the nonlinear regime, the exact linear theory no longer applies\. Nonetheless, the same qualitative spurious\-learning mechanisms persist\. Tie training reduces spurious reliance and improves robustness under distribution shift\. These results suggest that the linear analysis captures dominant dynamics of preference learning\. The theory therefore provides useful guidance beyond the linear setting\.

### F\.3Large Language Models \(Synthetic Hotel Benchmark\)

We study a large\-scale synthetic language\-model setting, where preferences are expressed in natural language and spurious attributes resemble real\-world surface cues\.

##### Dataset: Synthetic Hotel Preferences\.

We construct a synthetic hotel comparison dataset designed to study spurious correlation learning and the effect of tie training in large language models\. Each example consists of a user context, two hotel options\(A,B\)\(A,B\), and a binary preference label indicating which hotel is preferred\. The dataset explicitly separates causal utility, which determines true quality, from spurious features, which are surface attributes correlated with utility during training\. This separation allows controlled experiments in which correlations can be manipulated without changing the underlying task\. As a result, robustness under distribution shift can be evaluated in isolation\.

Each hotel is assigned a latent true utilityu\(h\)u\(h\)computed from task\-relevant attributes, includingprice,distance\_to\_destination,star\_rating, and context\-dependentamenities\. Preference labels are generated by a teacher that depends only on this true utility, so higheruualways corresponds to higher quality\. The true utility is never directly observed by the model and must be inferred from preference supervision\. This ensures that causal signal is present only implicitly\. Consequently, any reliance on non\-causal attributes reflects spurious learning\.

We designate the following hotel attributes as spurious features:

- •street\_number\(100–9999\)
- •floor\_number\(1–20\)
- •building\_age\(1–50; lower is better\)
- •renovation\_year\(2000–2024\)
- •hotel\_chain\_tier∈\\in\{Budget,Standard,Premium\}
- •lobby\_size\_sqft\(500–5000\)
- •employee\_count\(10–200\)

These attributes do not affect true utility but are correlated with utility during training\. Their values are explicitly manipulated to create different correlation regimes at deployment\. This design ensures that spurious cues are strong, structured, and controllable\. As a result, failures under shift can be directly attributed to spurious reliance\.

Spurious features are assigned as deterministic functions of a normalized utility levelunorm∈\[0,1\]u\_\{\\text\{norm\}\}\\in\[0,1\]\. In thenormalcorrelation mode, higher\-utility hotels receive systematically better spurious attributes, such as newer buildings, premium chains, larger lobbies, and more employees\. Insuppressionmode, spurious attributes are decorrelated from utility\. Inadversarialmode, the mapping is inverted using1−unorm1\-u\_\{\\text\{norm\}\}, so high\-utility hotels receive worse spurious attributes\. This construction induces sharp distribution shifts without altering the causal preference structure\.

For standard \(non\-tie\) training examples, we sample two hotels\(A,B\)\(A,B\), compute their true utilities\(uA,uB\)\(u\_\{A\},u\_\{B\}\), assign spurious features according to the chosen correlation mode, and label the pair by the utility ordering\. This procedure induces strong correlations between spurious attributes, preference labels, and true quality under the training distribution\. These correlations are systematic rather than noisy\. As a result, standard preference optimization objectives are incentivized to rely on spurious cues\.

##### Tie Construction in the LLM Setting\.

A tie is defined as a hotel pair\(A,B\)\(A,B\)such that the utility difference satisfies\|uA−uB\|<τ\|u\_\{A\}\-u\_\{B\}\|<\\taufor a small thresholdτ\\tau\. In these pairs, the causal signal is intentionally weak by construction\. Preference labels are assigned randomly, withy∼Bernoulli\(1/2\)y\\sim\\mathrm\{Bernoulli\}\(1/2\)\. This ensures that labels are independent of both utility and spurious features\. Ties therefore isolate non\-causal learning signals\.

##### Informative Ties\.

We make tie examples informative by explicitly decorrelating spurious features from utility\. Our default strategy,random\_extreme, assigns one hotel maximal spurious features and the other minimal spurious features\. The assignment is random betweenAAandBB, so spurious direction is uninformative\. This produces pairs with near\-zero causal margin but maximal spurious contrast\. As a result, gradients from these examples penalize spurious reliance\.

Informative tiessatisfy three properties simultaneously\. First, the causal signal is weak because\|uA−uB\|\|u\_\{A\}\-u\_\{B\}\|is small\. Second, the spurious contrast is large because spurious features are maximally separated\. Third, labels are independent of spurious attributes by construction\. Together, these properties ensure that tie gradients push against spurious features while preserving causal learning\.

We also evaluate alternative tie construction strategies\. Inrandom\_uniform, each hotel receives an independent random spurious level, which breaks correlation but yields weaker spurious contrast\. Insuppressed, all spurious attributes are resampled randomly, removing structured spurious signal entirely\. Finally,standard\_monotonicassigns spurious features monotonically from utility, which results in nearly identical spurious features for tied pairs, i\.e\.,non\-informative ties\. This failure case yieldsΔϕs≈0\\Delta\\phi\_\{s\}\\approx 0and provides no regularization signal\.

##### Model and Training\.

We fine\-tune a fixed base language modelLlama\-3\.2\-1B\-Instructusing Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.11134#bib.bib4)\)and Low\-Rank Adaptation\(Huet al\.,[2022](https://arxiv.org/html/2605.11134#bib.bib51)\)for one epoch\. All experiments use the same architecture and optimization settings\. We compare strict training, which uses only standard preference pairs, with tie training, which augments the dataset with ties\. The proportion of tie data is held fixed across experiments\. This isolates the effect of tie construction rather than dataset size\.

##### Evaluation Metrics

We measurein\-distribution accuracyon standard test pairs drawn from the training distributionPP\. This evaluates whether models trained with ties retain performance on the original task\. High in\-distribution accuracy indicates that causal learning is preserved\. We report overall accuracy and per\-option accuracy\. This allows us to detect asymmetric degradation\.

We evaluate robustness under deployment distributionsQQwhere spurious correlations are suppressed or adversarially reversed\. These settings isolate failures caused by spurious reliance\. Accuracy is measured using the same preference labels derived from true utility\. Performance degradation underQQreflects spurious misgeneralization\. Robust models should maintain accuracy across shifts\.

We also measure preference stability under counterfactual changes to spurious features\. For a fixed hotel pair, we vary spurious attributes while holding causal features constant\. We then record whether the model’s preference changes\. A high flip rate indicates spurious dependence\. We expect tie training to reduce these flips\.

##### Results\.

Under strict DPO training, the models in Table[2](https://arxiv.org/html/2605.11134#A6.T2)achieve high in\-distribution accuracy\. However, performance degrades sharply under suppressed and adversarial spurious correlations\. Increasing the amount of training data does not resolve this degradation\. This indicates that the error is not due to estimation noise\. Instead, it reflects convergence to a spurious population optimum\.

Table 2:Tie Training \(TT\) DPO with informative ties improves robustness to spurious distribution shift\. While standard DPO performs well in\-distribution \(PP\), its accuracy degrades under suppressed and adversarial spurious correlations \(QQ\)\. In contrast, informative tie training preserves high accuracy under both suppressed and adversarial shifts, without sacrificing in\-distribution performance\. Non\-informative ties \(TT\-Failure\), which do not inject spurious variance, provide no robustness benefit and match strict DPO\. Results are reported for overall accuracy and per\-option accuracy \(Hotel A/B\), illustrating that robustness gains are systematic rather than label\-specific\. All examples use1−α=0\.31\-\\alpha=0\.3, i\.e\., 30% ties\.Similarly, Table[2](https://arxiv.org/html/2605.11134#A6.T2)shows thattie trainingwithinformative tiesimproves robustness under spurious distribution shift relative to strict DPO\. Models trained with informative ties achieve higher accuracy in both suppressed and adversarial settings, while maintaining comparable in\-distribution performance\. In contrast,non\-informative tieconstructions provide no robustness benefit and match strict training\. These patterns are consistent across random seeds and evaluation splits\. Together, the results show that robustness gains arise specifically from informative tie construction rather than from tie data alone\.

##### Takeaway \(LLMs\)\.

Spurious correlation learning persists in large language models trained with preference optimization\. Increasing data alone does not eliminate spurious reliance or improve robustness under distribution shift\. Informative tie training provides targeted robustness gains by directly penalizing spurious features\. These effects mirror those observed in linear models and neural networks\. Together, the results suggest that the linearized theory captures core dynamics of preference learning under spurious correlations\.

## Appendix GRLHF\-Style Reward Learning with Greedy Decoding

### G\.1Reward Learning Setup

We study linear reward learning under RLHF and show that it is a special case of Direct Preference Optimization \(DPO\)\. Specifically, RLHF reward learning corresponds to DPO with scaling parameterβ=1\\beta=1and reference parameterθref=𝟎\\theta\_\{\\mathrm\{ref\}\}=\\mathbf\{0\}, so that the effective parameter satisfiesθ~=θ\\tilde\{\\theta\}=\\theta\. Under this specialization, the pairwise reward learning loss reduces to

ℓRLHF\(θ\)=−log⁡σ\(θ⊤Δϕ\),\\ell\_\{\\mathrm\{RLHF\}\}\(\\theta\)=\-\\log\\sigma\(\\theta^\{\\top\}\\Delta\\phi\),which matches the DPO objective\. This equivalence implies that the reward learning dynamics are unchanged relative to DPO in our linear analysis\. Thus, the same spurious\-learning mechanisms and distribution\-shift vulnerabilities apply to standard RLHF reward learning\.

### G\.2Deployment with Greedy Policies

Given a learned linear rewardrθ\(x,y\)=θ⊤ϕ\(x,y\)r\_\{\\theta\}\(x,y\)=\\theta^\{\\top\}\\phi\(x,y\), deployment uses greedy decoding to select outputs\. This induces the greedy policy

π\(x\)=arg⁡maxy⁡θ⊤ϕ\(x,y\),\\pi\(x\)=\\arg\\max\_\{y\}\\theta^\{\\top\}\\phi\(x,y\),which is optimal underrθr\_\{\\theta\}but can be suboptimal under the true deployment utility\. To measure this gap under a shifted deployment distributionQQ, we evaluate the true deployment value functionalV⋆V^\{\\star\}rather than the learned reward\. Letπ⋆\\pi^\{\\star\}denote the optimal policy for the deployment problem underQQandV⋆V^\{\\star\}, and define the deployment suboptimality as

SubOptQ\(π\):=V⋆\(π⋆\)−V⋆\(π\)\.\\mathrm\{SubOpt\}\_\{Q\}\(\\pi\):=V^\{\\star\}\(\\pi^\{\\star\}\)\-V^\{\\star\}\(\\pi\)\.This definition turns reward mislearning into a policy\-level metric that we can track under distribution shift\.

### G\.3Spurious Learning under RLHF Reward Learning

We now show that spurious correlation learning persists under RLHF reward learning and directly impacts greedy deployment\.

##### Setup\.

We evaluate the RLHF reward learning dynamics using a synthetic environment where featuresϕ\(x\)∈ℝd\\phi\(x\)\\in\\mathbb\{R\}^\{d\}are decomposed into causal featuresϕc∈ℝ3\\phi\_\{c\}\\in\\mathbb\{R\}^\{3\}and a spurious featureϕs∈ℝ1\\phi\_\{s\}\\in\\mathbb\{R\}^\{1\}\. The full feature vector is given by the concatenationϕ\(x\)=\[ϕc\(x\),ϕs\(x\)\]\\phi\(x\)=\[\\phi\_\{c\}\(x\),\\phi\_\{s\}\(x\)\]\.

We construct a set of five items𝒜=\{a1,…,a5\}\\mathcal\{A\}=\\\{a\_\{1\},\\dots,a\_\{5\}\\\}with a fixed spurious correlation coefficientσ=1\.0\\sigma=1\.0\. The feature representations are defined as follows:

Φ=\[110σ10000000010σ0010\]\\Phi=\\begin\{bmatrix\}1&1&0&\\sigma\\\\ 1&0&0&0\\\\ 0&0&0&0\\\\ 0&1&0&\\sigma\\\\ 0&0&1&0\\end\{bmatrix\}Ground\-truth preference data are generated via a Bradley–Terry modelP\(i≻j\)=σ\(θ∗⊤\(ϕ\(xi\)−ϕ\(xj\)\)\)P\(i\\succ j\)=\\sigma\(\\theta^\{\*\\top\}\(\\phi\(x\_\{i\}\)\-\\phi\(x\_\{j\}\)\)\)\. The ground\-truth parameter is set toθ∗=\[−1\.0,0\.1,0\.05,0\.0\]⊤\\theta^\{\*\}=\[\-1\.0,0\.1,0\.05,0\.0\]^\{\\top\}\.

Sampling Strategy\.We generate a dataset ofNNpairwise comparisons with a fixed distribution of outcome types: strict preferences \(75%\)\. The samples are drawn evenly from three specific comparison pairs to create the experimental structure:a1a\_\{1\}vsa2a\_\{2\}\(difference:\[0,1,0,σ\]\[0,1,0,\\sigma\]\),a2a\_\{2\}vsa3a\_\{3\}\(difference:\[1,0,0,0\]\[1,0,0,0\]\), anda5a\_\{5\}vsa3a\_\{3\}\(difference:\[0,0,1,0\]\[0,0,1,0\]\)\. Ties \(25%\): The remaining samples are labeled as ties\.

##### Results\.

Figure[9](https://arxiv.org/html/2605.11134#A7.F9)shows that training withℓRLHF\\ell\_\{\\mathrm\{RLHF\}\}yields a reward parameterθ^\\hat\{\\theta\}whose spurious component is nonzero, i\.e\.,θ^s≠0\\hat\{\\theta\}\_\{\\mathrm\{s\}\}\\neq 0\. Similarly, Figure[10](https://arxiv.org/html/2605.11134#A7.F10)illustrates that increasing the number of samples fromPPreduces estimation error around this optimum but does not remove the spurious component\.

### G\.4Effect of Tie Training

We next evaluate how spurious learning induces deployment suboptimality and how tie training mitigates this effect\.

##### Results\.

Without tie training, Figure[11](https://arxiv.org/html/2605.11134#A7.F11)shows that the greedy policyπ^\(x\)=arg⁡maxy⁡θ^⊤ϕ\(x,y\)\\hat\{\\pi\}\(x\)=\\arg\\max\_\{y\}\\hat\{\\theta\}^\{\\top\}\\phi\(x,y\)achieves low error onPPbut incurs nonzeroSubOptQ\(π^\)\\mathrm\{SubOpt\}\_\{Q\}\(\\hat\{\\pi\}\)when the deployment distributionQQsuppresses or reverses spurious correlations\. As the number of training samples increases, the shift\-induced component of the error persists\. This persistence is strongest in adversarial reversals\. Tie training reducesSubOptQ\\mathrm\{SubOpt\}\_\{Q\}by shrinkingθ^s\\hat\{\\theta\}\_\{\\mathrm\{s\}\}while preserving the causal component, so the greedy policy becomes less sensitive to spurious shifts\.

![Refer to caption](https://arxiv.org/html/2605.11134v1/x15.png)Figure 9:Under standard RLHF reward learning, the learned policy exhibits nonzero reliance on spurious features \(θs≠0\\theta\_\{\\text\{s\}\}\\neq 0\), and this reliance does not vanish with additional data drawn from the training distributionPP\. Tie training explicitly counteracts this effect, driving spurious reliance toward zero\.![Refer to caption](https://arxiv.org/html/2605.11134v1/x16.png)Figure 10:As the number of training samples increases, estimation error, defined as the weighted norm‖θ^−θ∗‖Σ\\\|\\hat\{\\theta\}\-\\theta^\{\*\}\\\|\_\{\\Sigma\}, decreases at comparable rates for strict MLE training and tie training, showing that tie training reduces spurious reliance without sacrificing estimation accuracy\.![Refer to caption](https://arxiv.org/html/2605.11134v1/x17.png)Figure 11:Greedy decoding of a log\-linear RLHF policy does not introduce additional error mechanisms, but exposes spurious reward learning under shift: performance, measured asSubOptQ\(π\):=V⋆\(π⋆\)−V⋆\(π\)\\mathrm\{SubOpt\}\_\{Q\}\(\\pi\):=V^\{\\star\}\(\\pi^\{\\star\}\)\-V^\{\\star\}\(\\pi\)degrades in both adversarial and suppression settings, and this error does not vanish with more data fromPP\. Tie training reduces this shift\-induced error\.

### G\.5Discussion and Conclusion

Our results show that greedy decoding does not affect what reward learning fits, but it can amplify the behavioral impact of spurious reward errors\. Under distribution shift, small spurious weights can flip greedy decisions and induce nonzero deployment suboptimality\. Without tie training, this error persists in the infinite\-data limit\. Additional samples reduce estimation error but do not remove spurious reliance\. In contrast, tie training shrinks spurious weights and reduces the resulting deployment error\.

## Appendix HLimitations and Future Work

Local regime and linearization\.Our theoretical analysis relies on a local regime where the learned policyπθ\\pi\_\{\\theta\}remains close to the reference policyπref\\pi\_\{\\text\{ref\}\}, enabling linearization of the KL\-regularized objective\.

Tie construction and approximate equality\.The analysis assumes access to informative ties—preference pairs\(x,y\+,y−\)\(x,y^\{\+\},y^\{\-\}\)where responses have near\-equal utility but differing spurious features\. In practice, exact utility equality is difficult to verify since true utility is latent and must be estimated from noisy human feedback or imperfect reward models\.

However, the mechanism requires only that causal utility differences are small relative to spurious feature variability, not exact equality\. Formally, what matters is that\|ϕc\(y\+\)−ϕc\(y−\)\|\|\\phi\_\{c\}\(y^\{\+\}\)\-\\phi\_\{c\}\(y^\{\-\}\)\|is small compared to\|ϕs\(y\+\)−ϕs\(y−\)\|\|\\phi\_\{s\}\(y^\{\+\}\)\-\\phi\_\{s\}\(y^\{\-\}\)\|in expectation over the tie distribution\. Our experiments \(Section[7](https://arxiv.org/html/2605.11134#S7)\) demonstrate that ties constructed by selecting pairs with similar reward model scores effectively reduce spurious learning, even when utilities are not exactly equal\.

Nevertheless, formal guarantees for imperfect tie construction remain an open problem\. How much utility mismatch can be tolerated before tie training becomes ineffective? How should one trade off the number of ties versus their quality? Can adaptive tie construction algorithms identify informative ties online during training? We leave these questions for future work\.

Analytical assumptions and feature decomposition\.We assume features decompose asϕ\(y\)=\[ϕc\(y\)⊤,ϕs\(y\)⊤\]⊤\\phi\(y\)=\[\\phi\_\{c\}\(y\)^\{\\top\},\\phi\_\{s\}\(y\)^\{\\top\}\]^\{\\top\}into causal and spurious components\. This decomposition makes the spurious learning mechanism explicit and enables clean theoretical statements, but it is an idealization\.

Importantly, this assumption is used only for analysis, not for the algorithm\. Tie training does not require explicit identification or labeling of spurious features\. It operates by adding pairs that modify the covariance structure of the training distribution\. The decomposition serves as an analytical tool to understand why tie training works, not as a prerequisite for its application\. Extending the theory to settings without a clear causal–spurious decomposition remains open\.

Scale and scope of validation\.Our experiments validate the theoretical mechanisms we derive\. We demonstrate that predictions from the population\-level theory match finite\-sample behavior, that tie training reduces spurious parameters as predicted, and that these reductions translate to improved deployment robustness\. However, these experiments are designed to test theoretical predictions in controlled settings, not to optimize end\-to\-end performance of production alignment systems\.

Large\-scale validation in production RLHF pipelines remains important future work\. This includes: applying tie training to frontier language models with billions of parameters, developing practical methods for tie construction from human feedback at scale, evaluating robustness improvements on diverse downstream tasks and distribution shifts, and comparing tie training to other robustness interventions like distributionally robust optimization or causal regularization\. Such validation would determine whether the gains observed in controlled experiments translate to meaningful improvements in deployed systems\.
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

Similar Articles

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

Submit Feedback

Similar Articles

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization