$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

arXiv cs.LG 05/13/26, 04:00 AM Papers
Summary
This paper introduces xi-DPO, a novel preference optimization method that reformulates the objective to minimize distance to optimal ratio reward margins, addressing hyperparameter tuning challenges in SimPO. Experimental results show that xi-DPO outperforms existing methods on open benchmarks.
arXiv:2605.10981v1 Announce Type: new Abstract: Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters $\beta$ and $\gamma$ in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that $\beta$ implicitly controls sample filtering, while the effect of $\gamma$ depends on the reward gap structure of the dataset. Motivated by these observations, we propose $\xi$-DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of $\beta$ and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by $\xi$. Unlike the margin $\gamma$ in SimPO, $\xi$ explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. ....
Original Article
View Cached Full Text
Cached at: 05/13/26, 06:24 AM
# Direct Preference Optimization via Ratio Reward Margin
Source: [https://arxiv.org/html/2605.10981](https://arxiv.org/html/2605.10981)
Zhengyuan Fan1Zhonghua Wu1Yuxuan Du1Qun Chen1 1School of Computer Science, Northwestern Polytechnical University \{fanzhengyuan,wuxhua,duyuxuan36,chenbenben\}@nwpu\.edu\.cn

###### Abstract

Reference\-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization\(SimPO\) demonstrating strong performance by eliminating the explicit reference model through a simple objective\. However, the joint tuning of the hyperparametersβ\\betaandγ\\gammain SimPO remains a central challenge\. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures\. To better understand this issue, we conduct a comprehensive analysis of SimPO and find thatβ\\betaimplicitly controls sample filtering, while the effect ofγ\\gammadepends on the reward gap structure of the dataset\. Motivated by these observations, we proposeξ\\xi\-DPO: Direct preference optimization via ratio reward margin\. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins\. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect ofβ\\betaand yields a bounded and interpretable margin\. This margin is called the ratio reward margin and is denoted byξ\\xi\. Unlike the marginγ\\gammain SimPO,ξ\\xiexplicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial\-and\-error tuning\. Finally, we use LeakyReLU to prevent samples whose reward gaps already exceedξ\\xifrom being unnecessarily pulled back toward the target margin\.ξ\\xi\-DPO maintains a simple formulation without introducing a reference model or additional hyperparameters\. Experimental results show thatξ\\xi\-DPO substantially outperforms existing preference optimization methods across open benchmarks on multiple evaluation metrics\.

††footnotetext:Code is available at[https://github\.com/zyfan1/Xi\-DPO](https://github.com/zyfan1/Xi-DPO)\.## 1Introduction

With the rapid advancement of large language modelsTeam \([2025](https://arxiv.org/html/2605.10981#bib.bib1)\); OpenAI \([2025](https://arxiv.org/html/2605.10981#bib.bib2)\), aligning their responses with human preferences has become critically important\. Ouyang et al\.Ouyanget al\.\([2022a](https://arxiv.org/html/2605.10981#bib.bib3)\)introduced reinforcement learning from human feedback \(RLHF\), a method for aligning large language model outputs with human preferences\. In their framework, the model is optimized using the Proximal Policy Optimization \(PPO\) algorithmSchulmanet al\.\([2017](https://arxiv.org/html/2605.10981#bib.bib4)\)\. RLHF comprises three stages: 1\. Supervised fine\-tuning\(SFT\) of large models on downstream tasks; 2\. Reward modeling based on the SFT model; 3\. The final reinforcement learning stage\. Despite its effectiveness, this multi\-stage pipeline introduces considerable complexity into the training process\.

Christiano et al\.Christianoet al\.\([2017](https://arxiv.org/html/2605.10981#bib.bib11)\)propose using the Bradley\-Terry model\(BT model\)Bradley and Terry \([1952](https://arxiv.org/html/2605.10981#bib.bib20)\)for preference modeling to optimize the reward model,r\(y,x\)r\(y,x\): Given a datasetD=\{\(x,yw,yl\)\}D=\\\{\(x,y\_\{w\},y\_\{l\}\)\\\}, it consists of promptsxxand paired responses \(ywy\_\{w\},yly\_\{l\}\), whereywy\_\{w\}is the chosen response\(winning\) andyly\_\{l\}is the rejected response\(losing\)\. Their preference relationship is

p\(yw≻yl∣x\)=exp⁡\(r\(yw,x\)\)exp⁡\(r\(yw,x\)\)\+exp⁡\(r\(yl,x\)\)=σ\(r\(yw,x\)−r\(yl,x\)\)p\(y\_\{w\}\\succ y\_\{l\}\\mid x\)=\\frac\{\\exp\\big\(r\(y\_\{w\},x\)\\big\)\}\{\\exp\\big\(r\(y\_\{w\},x\)\\big\)\+\\exp\\big\(r\(y\_\{l\},x\)\\big\)\}=\\sigma\\big\(r\(y\_\{w\},x\)\-r\(y\_\{l\},x\)\\big\)\(1\)whereσ\\sigmais sigmoid function\. Subsequent work, such as Direct Preference Optimization \(DPO\)Rafailovet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib5)\), significantly simplifies RLHF\. The key innovation of DPO lies in deriving the relationship between reward model and optimization policy, thereby merging reward modeling and reinforcement learning into one single stage\. This significantly simplifies the process of preference optimization\. Researchers only need to optimize the target modelπθ\\pi\_\{\\theta\}using the following DPO loss function to align the model’s responses with human preferences:

ℓDPO\(θ\)=𝔼\(x,yw,yl\)∼D\[−log⁡σ\(β\(log⁡πθ\(yw\|x\)πref\(yw\|x\)−log⁡πθ\(yl\|x\)πref\(yl\|x\)\)\)\]\.\\ell\_\{\\mathrm\{DPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim D\}\\\!\\left\[\-\\log\\sigma\\\!\\left\(\\beta\\left\(\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\|x\)\}\-\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\|x\)\}\\right\)\\right\)\\right\]\.\(2\)whereπθ\\pi\_\{\\theta\}is the policy model to be optimized,πref\\pi\_\{\\mathrm\{ref\}\}is the reference policy,πθ\(y\|x\)\\pi\_\{\\theta\}\(y\|x\)andπref\(y\|x\)\\pi\_\{\\mathrm\{ref\}\}\(y\|x\)denote the probabilities of generating responseyygiven promptxxunder the policy model and the reference model respectively, andywy\_\{w\}denotes the chosen response andyly\_\{l\}the rejected response\.

Recent studiesWuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib6)\); Panet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib7)\)have shown that the DPO hyperparameterβ\\betais highly sensitive to the data distribution, which makes tuning difficult and can result in only marginal performance gains\. Junkang et al\.Wuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib6)\)analyzed how the choice ofβ\\betadepends on the data distribution through experiments\. Specifically, when the reward gap between the chosen \(ywy\_\{w\}\) and rejected \(yly\_\{l\}\) responses is large, a largerβ\\betais preferred; when the gap is small,β\\betashould be smaller\. Simple Preference Optimization\(SimPO\)Menget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\)provides a more efficient formulation by rewriting the reward from of DPORafailovet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib5)\)βlog⁡πθ\(y∣x\)πref\(y∣x\)\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\\mid x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\\mid x\)\}toβ\|y\|logπθ\(y\|x\)\\frac\{\\beta\}\{\|y\|\}log\\pi\_\{\\theta\}\(y\|x\)and further encourages that the reward gap between the chosen responseywy\_\{w\}and the rejected responseyly\_\{l\}is at leastγ\\gamma\. They highlighted the necessity ofγ\\gamma, arguing that it directly affects the uniformity, or flatness of the reward\-gap distribution\. Formally, the loss function of SimPO is defined as:

ℓSimPO\(θ\)=𝔼\(x,yw,yl\)∼𝒟\[−log⁡σ\(β\|yw\|log⁡πθ\(yw∣x\)−β\|yl\|log⁡πθ\(yl∣x\)−γ\)\]\.\\ell\_\{\\mathrm\{SimPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\left\[\-\\log\\sigma\\left\(\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\-\\gamma\\right\)\\right\]\.\(3\)
SimPO requires the joint tuning ofβ\\betaandγ\\gamma\. As inβ\\beta\-DPOWuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib6)\), we use a small, fixedβ\\betafor the low\-gap data and a largerβ\\betafor the high\-gap data\. We fine\-tune Pythia\-2\.8BBidermanet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib10)\)with SimPOMenget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\), settingγ\\gammato decay from large to small\(6 to 1\)\. The detailed GPT\-4 based evaluation results are presented in Table[1](https://arxiv.org/html/2605.10981#S1.T1)\. It can be observed that the win rate on the low\-gap data increases from 3\.2% to 49\.5%, while on the high\-gap data, the win rate decreases from 31\.46% to 30\.16%\. As shown in Table[1](https://arxiv.org/html/2605.10981#S1.T1), bothβ\\betaandγ\\gammaexhibit high sensitivity, making it necessary to conduct multiple experiments to identify appropriate values\. These results also reveal an interpretability issue regarding the reward marginγ\\gammain SimPO\. In our experiments onγ\\gamma, varyingγ\\gammaleads to substantially larger performance changes on low\-gap data than on high\-gap data\. From this viewpoint,γ\\gammais regarded as a constraint strength that matches the intrinsic gap structure of the data, rather than as a margin expected to force the reward gaps of most samples beyond a prescribed threshold\. Based on our analysis, when SimPO\-like methods are applied across different datasets, the intrinsic reward gaps of these datasets vary\. However, such variation is difficult to quantify, making it unclear whetherγ\\gammashould be increased or decreased, and by how much\. This uncertainty is one of the main reasons why selecting an appropriateγ\\gammais challenging\.

Table 1:Win rate variations withβ\\beta,γ\\gammaacross different data types\.The left table presents the results fromβ\\beta\-DPOWuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib6)\), illustrating how the effect ofβ\\betavaries across different data types\. The right table reports our results, exploring how the effect ofγ\\gammachanges with the data type\.\(a\) Win rate variations withβ\\beta\. The results fromβ\\beta\-DPO\.Data typeβ=0\.1\\beta=0\.1β=0\.3\\beta=0\.3β=0\.5\\beta=0\.5Low gap43\.037\.033\.0High gap7\.028\.031\.0
\(b\) Win rate variations withγ\\gamma\. The results are obtained from our experiments built upon SimPO\.Data typeβ\\betaγ=6\\gamma=6γ=3\\gamma=3γ=1\\gamma=1Low gap23\.2046\.9449\.45High gap1031\.4630\.7030\.16

Both DPO and SimPO adopt an optimization objective of the formσ\(β\(r\(yw\|x\)−r\(yl\|x\)\)\)\\sigma\(\\beta\(r\(y\_\{w\}\|x\)\-r\(y\_\{l\}\|x\)\)\)\. Theβ\\betatypically acts as a scaling factor\. However, in this paper, we revisit the role ofβ\\betafrom a different perspective and show that it implicitly serves to filter out high gap samples\. It provides a compelling explanation for why existing dynamic hyperparameter tuning strategiesWuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib6),[2025](https://arxiv.org/html/2605.10981#bib.bib16)\)are effective in practice\. Based on these insights, we proposeξ\\xi\-DPO: Direct preference optimization via ratio reward margin\. First, we simplify the optimization objective of SimPO through an equivalence function mapping\. It transforms the goal from maximizing the probability likelihood of reward gaps into minimizing the mean squared error between reward gap and the theoretically optimal gap\. Second, we normalize the reward by converting it into a ratio of the chosen to the rejected\. This normalization not only effectively cancels outβ\\beta, but also constrains the reward gap to the interval \[0, 1\], which we define as the ratio reward marginξ\\xi\. Finally, we employ LeakyReLU activation function to prevent reward degradation, where reward gaps that already exceedξ\\xiare pulled back, leading to an increase in the rejected reward and a decrease in the chosen reward\.

Formally, the optimization objective ofξ\\xi\-DPO is defined as follows:

minθ𝔼\(x,yw,yl\)∼𝒟\[LeakyReLU\(ξ−\(1\|yw\|log⁡πθ\(yw∣x\)−1\|yl\|log⁡πθ\(yl∣x\)\|1\|yw\|logπθ\(yw∣x\)\+1\|yl\|logπθ\(yl∣x\)\|\)\)2\]\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\left\[\\operatorname\{LeakyReLU\}\\left\(\\xi\-\\left\(\\frac\{\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\}\{\\left\|\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\+\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\\right\|\}\\right\)\\right\)^\{2\}\\right\]\(4\)
whereLeakyReLU\\operatorname\{LeakyReLU\}is activate function\. The above equation has only one adjustable parameter,ξ\\xi, which can be selected based on dataset features rather than careful trial\-and\-error tuning\. Specifically,ξ\\xiis determined by the quantiles of the initial reward gap distribution of the policy\. In practical implementations, for low\-gap datasets constructed using a strong reward model, we suggest settingξ\\xiwithin the 90th–95th percentile range of the distribution\. This choice allows most samples to participate in training while preventing excessively strong reward signals from causing the model to overfit\. For high\-gap datasets constructed using a weaker reward model, typically within the 97th–99\.9th percentile, which helps avoid premature termination of optimization caused by insufficient reward signals\. Our sensitivity experiments further show thatξ\\xi\-DPO remains robust to the choice ofξ\\xias long as it is selected within a reasonable range\.

![Refer to caption](https://arxiv.org/html/2605.10981v1/figures/alpha_c_r.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.10981v1/figures/our_c_r.png)\(b\)

Figure 1:Comparison of reward curves\. \(a\) shows the reward dynamics of AlphaDPO during training, and \(b\) presents those ofξ\\xi\-DPO\. Forξ\\xi\-DPO, the reward for the chosen response steadily increases while that for the rejected response decreases, indicating that the model becomes increasingly aligned with the chosen response, in accordance with our optimization objective\. By contrast, for AlphaDPO method such as SimPO, although the chosen reward remains higher than the rejected reward, both display a downward trend, implying a weaker ability to separate chosen responses from rejected ones\.It is noteworthy that the three designs ofξ\\xi\-DPO are key to its efficacy\.The equivalence mappingof optimization objective eliminates the adverse optimization effects caused by the hyperparameter ofβ\\betaon the sigmoid gradient, while endowing the reward margin with an explicit semantic role of enforcing the distinction between chosen and rejected responses\.The reward redefinitioncancels outβ\\betaand makes the reward margin bounded\. Finally,LeakyReLUsafeguards this enforced separation: The reward of high\-gap samples is not forcibly pulled back\. We visualize the reward curves during the training processes ofξ\\xi\-DPO and AlphaDPO, a dynamic\-γ\\gammavariant of SimPOWuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\), in Figure[1](https://arxiv.org/html/2605.10981#S1.F1)\. As shown in the figure, for the AlphaDPO methods such as SimPO, although the reward of the chosen remains higher than that of the rejected, both display a downward trend, implying a weaker ability to separate chosen responses from rejected ones\. In contrast, forξ\\xi\-DPO, the reward of the chosen steadily increases while the reward of the rejected decreases, indicating that the model becomes increasingly aligned with the chosen response, in accordance with our optimization objective\.

We summarize our contributions as follows:

1. 1\.we systematically analyze the roles of the hyperparameters,β\\betaandγ\\gamma, in SimPO and arrive at two insights: i\)β\\betais not only a reward\-scaling factor, but also serves to filter out high\-gap samples; ii\) from a token\-level perspective,γ\\gammais determined by the intrinsic reward gap of the data\. The uncertainty introduced byβ\\betain sample filtering and the difficulty of quantifying the intrinsic reward gaps of different datasets, lead to the sensitivity ofβ\\betaandγ\\gamma, making them difficult to select\.
2. 2\.Based on the insights from ourβ\\betaandγ\\gammarole analysis, we proposeξ\\xi\-DPO\. It adopts a simpler optimization objective with only one hyperparameter ofξ\\xi, which can be easily set based on dataset features rather than careful trial\-and\-error tuning\. Its novel structure, together with the margin derived from the quantiles of the reward\-gap distribution, also effectively mitigates hyperparameter sensitivity\.
3. 3\.Extensive experiments across multiple SFT models and preference optimization benchmarks demonstrate the effectiveness and generality ofξ\\xi\-DPO, showing that it achieves superior performance compared with existing alternative methods\. Our sensitivity evaluation also reveals that the performance ofξ\\xi\-DPO is robust w\.r\.tξ\\xi, provided that its value is set within a reasonable range\.

## 2Related Work

Reinforcement Learning from Human Feedback\.RLHF demonstrates effectiveness in aligning language model responses with human preferencesOuyanget al\.\([2022a](https://arxiv.org/html/2605.10981#bib.bib3)\); Christianoet al\.\([2017](https://arxiv.org/html/2605.10981#bib.bib11)\); Ziegleret al\.\([2019](https://arxiv.org/html/2605.10981#bib.bib12)\); Ouyanget al\.\([2022b](https://arxiv.org/html/2605.10981#bib.bib13)\); Fanget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib14)\)\. Its three stage pipeline, including supervised fine\-tuning\(SFT\)Köpfet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib33)\); Dinget al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib34)\); Zhouet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib35)\); Taoriet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib36)\), reward modelingRafailovet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib37)\); Gaoet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib38)\); Luoet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib39)\); Chenet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib40)\), and reinforcement learning optimizationSchulmanet al\.\([2017](https://arxiv.org/html/2605.10981#bib.bib4)\); Anthonyet al\.\([2017](https://arxiv.org/html/2605.10981#bib.bib41)\), increases the overall complexity of RLHF\. DPORafailovet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib5)\)mitigates this limitation by constructing preference datasets offline and directly optimizing the model using its designed loss function\. IPOAzaret al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib15)\)analyze the DPO, proposed a generalized preference theory, and mitigated the overfitting issues present in the DPO\. SimPOMenget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\)redefines implicit rewards in DPO, transforming it into a reference\-free preference optimization method with remarkable effectiveness\. However, it relies on extensive experimentation to determine optimal hyperparameter selections\.β\\beta\-DPOWuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib6)\)investigates the relationship between hyperparameters and the data distribution\. However, it requires fine\-tuning around a preset hyperparameter value\. Similarly, AlphaDPOWuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\)adopts an analogous strategy to adjust theγ\\gammain SimPO, but it reintroduces a reference model, thereby increasing resource waste\. At the same time, they also introduce a new hyperparameter,α\\alpha\. SimPERXiaoet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib42)\)uses model perplexity as the reward and proposes a hyperparameter\-free preference optimization method; however, its performance remains unsatisfactory\. This paper proposesξ\\xi\-DPO, which retains the simplicity of SimPO while avoiding its cumbersome hyperparameter selection process, and achieves strong empirical performance\.

## 3Hyperparameter Role Analysis in SimPO

This section provides a comprehensive analysis of the roles of the hyperparameters in SimPO\.

### 3\.1The Role ofβ\\beta

Given promptxxand its corresponding chosen \(ywy\_\{w\}\) and rejection \(yly\_\{l\}\) responses\. According to Equation[3](https://arxiv.org/html/2605.10981#S1.E3), the primary optimization objective is to maximize the probability of the gap between the chosen and rejected rewards \(r\(yw,x\)r\(y\_\{w\},x\)andr\(yl,x\)r\(y\_\{l\},x\)\), that is:

σ\(β\(r∗\(yw,x\)−r∗\(yl,x\)\)\)\\sigma\(\\beta\(r^\{\*\}\(y\_\{w\},x\)\-r^\{\*\}\(y\_\{l\},x\)\)\)\(5\)wherer∗\(y,x\)=logπθ\(y\|x\)\|y\|r^\{\*\}\(y,x\)=\\frac\{log\\pi\_\{\\theta\}\(y\|x\)\}\{\|y\|\}\. For now, we will setγ\\gammaaside, as it can be treated as a bias that does not affect our analysis of the distribution of reward gaps\. For simplicity, we refer to the chosen reward as chosen and the rejected reward as rejected\.

For a given dataset, the distribution of reward gap \(Δr∗=r∗\(yw,x\)−r∗\(yl,x\)\\Delta r^\{\*\}=r^\{\*\}\(y\_\{w\},x\)\-r^\{\*\}\(y\_\{l\},x\)\) will exhibit the following situations: 1\)\. Chosen is far greater than rejected \(Δr∗≫0\\Delta r^\{\*\}\\gg 0\); 2\)\. Chosen is far less than rejected \(Δr∗≪0\\Delta r^\{\*\}\\ll 0\); 3\)\. Chosen is close to rejected\. As shown in Figure[2](https://arxiv.org/html/2605.10981#S3.F2), the first two cases are referred to as high reward gap \(the tail and head of the distribution, respectively\), and the third case is low\-gap\(the middle of the distribution\)\. Hereafter, we use the terms tail and head regions to denote the first two cases\. The samples that fall into the third case are considered as normal and help maintain training stability, since it is sufficient to enforce a certain constraint such that the chosen remains larger than the rejected\. The samples in the tail regions exhibit strong reward signals and can facilitate the optimization process when the hyperparameters are properly selected; otherwise, an originally large reward gap may be undesirably reduced\. The most problematic samples are those in the head regions, which could severely disrupt the training process\.

![Refer to caption](https://arxiv.org/html/2605.10981v1/figures/myplot.png)Figure 2:Distribution and sigmoid overΔr∗\\Delta r^\{\*\}\. Asβ\\betaincreases, the sigmoid becomes steeper, causing more samples in both the head region \(left\) and the tail region \(right\) to be filtered out during training\.Previous studiesRafailovet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib5)\); Menget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\); Wuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\)generally treatβ\\betaas a scaling factor\. We find that its effect on the sigmoid slope essentially shifts forward or delays the point at which the gradient approaches zero, thereby filtering preference samples\. Through the setting ofβ\\beta, the samples in the head or tail regions of the reward\-gap distribution may become ineffective or effective during training\. In other words,β\\betacan be understood as a mechanism for filtering data\(head or tail data in the distribution\)\. Figure[2](https://arxiv.org/html/2605.10981#S3.F2)illustrates this process more clearly\. The larger theβ\\betavalue, the more data is filtered out, because the samples falling within the interval where the gradient approaches 0 do not contribute to the model’s optimization\. Conversely, a smallerβ\\betameans that the vast majority of samples would be utilized\. Figure[2](https://arxiv.org/html/2605.10981#S3.F2)is generated using[princeton\-nlp/mistral\-instruct\-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback)dataset and the Mistral\-7B\-instruct model\. The same distribution pattern can be similarly observed in other datasets and models\.

This insight helps explain the contribution ofβ\\beta\-DPOWuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib6)\), which proposes that a smallerβ\\betashould be used for low\-gap data, and vice versa\. Low gap data typically imply that samples are more concentrated in the middle region of the distribution\. In this case, using a smallerβ\\betaprevents the sigmoid from excessively filtering samples\. In contrast, for high\-gap data, a largerβ\\betais used to filter out more samples in the head region, preventing them from affecting the optimization process\. However, sigmoid\-based filtering is symmetric\. Whenβ\\betais adjusted, the sigmoid function filters not only tail\-region samples but also impacts head\-region samples, which may cause potentially useful reward information to be ignored\. This explains the sensitivity ofβ\\betatuning\.

### 3\.2The Role ofγ\\gamma

We analyze the effect ofγ\\gammafrom two complementary perspectives: token\-level investigation and comparative analysis between SimPO and DPO\.

Token\-level Investigation\.For qualitative analysis, given a promptxx, we assume that the chosen and rejected responses have the same length, denoted by\|y\|\|y\|\. Then, the optimization objective of SimPO can be represented by:

β\|y\|log⁡πθ\(yw∣x\)−β\|y\|log⁡πθ\(yl∣x\)−γ\\displaystyle\\frac\{\\beta\}\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\-\\gamma\(6\)=\\displaystyle=β\|y\|∑i=1nlog⁡πθ\(ywi∣ywi−1,…,x\)−β\|y\|∑i=1nlog⁡πθ\(yli∣yli−1,…,x\)−γ\\displaystyle\\ \\frac\{\\beta\}\{\|y\|\}\\sum\_\{i=1\}^\{n\}\\log\\pi\_\{\\theta\}\(y\_\{w\}^\{i\}\\mid y\_\{w\}^\{i\-1\},\\ldots,x\)\-\\frac\{\\beta\}\{\|y\|\}\\sum\_\{i=1\}^\{n\}\\log\\pi\_\{\\theta\}\(y\_\{l\}^\{i\}\\mid y\_\{l\}^\{i\-1\},\\ldots,x\)\-\\gamma=\\displaystyle=\[\(β\|y\|logπθ\(yw1∣x\)−β\|y\|logπθ\(yl1∣x\)−γ1\)\\displaystyle\\ \\Bigl\[\\left\(\\frac\{\\beta\}\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}^\{1\}\\mid x\)\-\\frac\{\\beta\}\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}^\{1\}\\mid x\)\-\\gamma\_\{1\}\\right\)\+\(β\|y\|logπθ\(yw2∣yw1,x\)−β\|y\|logπθ\(yl2∣yl1,x\)−γ2\)\+⋯\]\\displaystyle\\qquad\+\\left\(\\frac\{\\beta\}\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}^\{2\}\\mid y\_\{w\}^\{1\},x\)\-\\frac\{\\beta\}\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}^\{2\}\\mid y\_\{l\}^\{1\},x\)\-\\gamma\_\{2\}\\right\)\+\\cdots\\Bigr\]This suggests thatγ\\gammacan be viewed as imposing a token\-level constraint\. For low\-gap pairs, the chosen and rejected responses typically exhibit substantial token overlap, implying that these shared or highly similar tokens require little additional separation\(γi=0\\gamma\_\{i\}=0\)\. As a result, the cumulative constraint induced byγ\\gammashould be smaller\. This also explains why low\-gap datasets favor a smallerγ\\gamma: the purpose is not to enforce an unnecessarily large margin between chosen and rejected responses, but to adapt the constraint strength to the intrinsic characteristics of data\.

Comparative Analysis between DPO and SimPO\.Junkang et al\.Wuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\)have reached the conclusion thatγ\\gammais actually an implicit representation of the reference policy\. Here, we leverage their insights to support our hypothesis\. This is illustrated by the following equation:

ℒDPO=\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}=−𝔼\(x,yw,yl\)∼𝒟\[log⁡σ\(β\|yw\|log⁡πθ\(yw∣x\)πref\(yw∣x\)−β\|yl\|log⁡πθ\(yl∣x\)πref\(yl∣x\)\)\]\\displaystyle\\ \-\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\left\[\\log\\sigma\\left\(\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\\mid x\)\}\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\\mid x\)\}\\right\)\\right\]\(7\)=\\displaystyle=−𝔼\(x,yw,yl\)∼𝒟\[logσ\(β\|yw\|logπθ\(yw∣x\)−β\|yl\|logπθ\(yl∣x\)\\displaystyle\\ \-\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\Biggl\[\\log\\sigma\\Biggl\(\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)−\(β\|yw\|logπref\(yw∣x\)−β\|yl\|logπref\(yl∣x\)\)\)\]\\displaystyle\\qquad\\qquad\-\\Bigl\(\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\\mid x\)\\Bigr\)\\Biggr\)\\Biggr\]=\\displaystyle=−𝔼\(x,yw,yl\)∼𝒟\[log⁡σ\(β\|yw\|log⁡πθ\(yw∣x\)−β\|yl\|log⁡πθ\(yl∣x\)−γ\)\]\\displaystyle\\ \-\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\left\[\\log\\sigma\\left\(\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\-\\gamma\\right\)\\right\]=\\displaystyle=ℒSimPO\\displaystyle\\ \\mathcal\{L\}\_\{\\mathrm\{SimPO\}\}As can be seen from the above equation,γ=β\|yw\|log⁡πref\(yw∣x\)−β\|yl\|log⁡πref\(yl∣x\)\\gamma=\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\\mid x\)\. This means that a low\-gap dataset naturally requires a smallerγ\\gamma; conversely, using a largeγ\\gammawould deviate from the actual performance of the reference policy\. This indicates a potential mismatch between the behavior ofγ\\gammaand its intended role of artificially widening the reward gap\.

## 4Our Methodology

Based on our analysis ofβ\\betaandγ\\gamma, we present ourξ\\xi\-DPO solution with two objectives: to eliminate the high sensitivity ofβ\\beta, where even small changes can substantially affect the sigmoid gradient, and to introduce a new, quantifiable margin that enforces a stronger constraint on the reward gap between chosen and rejected responses\.

### 4\.1Definition ofξ\\xi\-DPO

First, We apply a logit transformation to the reward objective in the Equation[5](https://arxiv.org/html/2605.10981#S3.E5)to mitigate the subtle influence of the sigmoid gradient on the optimization process, which is as follows:

f\(x\)=log⁡\(x1−x\)f\(x\)=\\log\\left\(\\frac\{x\}\{1\-x\}\\right\)\(8\)
After applying the transformation to the original objective as shown in Equation[5](https://arxiv.org/html/2605.10981#S3.E5), the new objective now becomesβ\(r∗\(yw,x\)−r∗\(yl,x\)\)\\beta\(r^\{\*\}\(y\_\{w\},x\)\-r^\{\*\}\(y\_\{l\},x\)\)\. Based on the analysis in IPOAzaret al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib15)\), such a transformation does not alter the optimal solution\. Accordingly, the objective of SimPOMenget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\)can be represented by:

β\|yw\|log⁡πθ\(yw∣x\)−β\|yl\|log⁡πθ\(yl∣x\)−γ\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\-\\gamma\(9\)
The current objective of SimPO has shifted from maximizing the likelihood of the reward gap to minimizing the distance between the reward gap andγ\\gamma\. Here, theγ\\gammais considered the optimal gap between chosen and rejected, and the final objective can be obtained using the least\-squares method:

\(β\|yw\|log⁡πθ\(yw∣x\)−β\|yl\|log⁡πθ\(yl∣x\)−γ\)2\(\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\-\\gamma\)^\{2\}\(10\)Then, we redefine the reward through normalization to ensure that it is bounded:

r∗\|rw\+rl\|\\frac\{r\_\{\*\}\}\{\|r\_\{w\}\+r\_\{l\}\|\}\(11\)wherer∗r\_\{\*\}is chosen\(rwr\_\{w\}\) or rejected\(rlr\_\{l\}\), and the chosen and rejected rewards constitute a reward ratio\.

Accordingly, we introduce a ratio reward marginξ\\xito control the gap between ratio rewards\. For simplicity, we refer to it as the ratio margin hereafter\. By normalizing the rewardβ\|y∗\|log⁡πθ\(y∗\|x\)\\frac\{\\beta\}\{\|y\_\{\*\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{\*\}\|x\)in SimPO according to Equation[11](https://arxiv.org/html/2605.10981#S4.E11), the objective can be defined as follows:

\(logpw−logpl\|logpw\+logpl\|−ξ\)2\\left\(\\frac\{logp\_\{w\}\-logp\_\{l\}\}\{\\left\|logp\_\{w\}\+logp\_\{l\}\\right\|\}\-\\xi\\right\)^\{2\}\(12\)wherelogp∗logp\_\{\*\}islog⁡πθ\(y∗\|x\)\|y∗\|\\frac\{\\log\\pi\_\{\\theta\}\(y\_\{\*\}\|x\)\}\{\|y\_\{\*\}\|\}\. It is noteworthy that the reward definition in Equation[11](https://arxiv.org/html/2605.10981#S4.E11)offers two advantages: 1\) it ensures that the reward remains bounded while constraining the ratio margin,ξ∈\[0\.1\]\\xi\\in\[0\.1\]; 2\) it eliminates the dependence onβ\\betaand definesξ\\xion a bounded ratio reward space, making the margin more interpretable and less sensitive, while allowing it to be selected from the initial reward gap distribution rather than through joint tuning withβ\\beta\.

However, the objective as shown in Equation[12](https://arxiv.org/html/2605.10981#S4.E12)still has an undesirable problem\. Specifically, if the chosen is larger than rejected, smaller values ofξ\\xiwill cause the chosen to decrease and the rejected to increase during optimization, thereby narrowing the originally large gap between the two back toξ\\xi\. Such samples are concentrated at the tail end of distribution as shown in Figure[2](https://arxiv.org/html/2605.10981#S3.F2)\. These tail samples do not need to be included in model training, as the model already performs very well on these samples\. Therefore, we define the finalξ\\xi\-DPO loss as:

minθ𝔼\(x,yw,yl\)∼𝒟\[LeakyReLU\(ξ−\(1\|yw\|log⁡πθ\(yw∣x\)−1\|yl\|log⁡πθ\(yl∣x\)\|1\|yw\|logπθ\(yw∣x\)\+1\|yl\|logπθ\(yl∣x\)\|\)\)2\]\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\left\[\\operatorname\{LeakyReLU\}\\left\(\\xi\-\\left\(\\frac\{\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\}\{\\left\|\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\+\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\\right\|\}\\right\)\\right\)^\{2\}\\right\]\(13\)whereLeakyReLU\\operatorname\{LeakyReLU\}causes the gradients of samples whose reward gaps exceedξ\\xito approach zero\. It is important to note that the normalization factor\|1\|yw\|logπθ\(yw∣x\)\+1\|yl\|logπθ\(yl∣x\)\|\\left\|\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\+\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\\right\|does not participate in gradient updates\. It is treated only as a constant coefficient used to normalize the reward into a ratio form\. A visual illustration of the LeakyReLU behavior described above, together with additional details, is provided in the Appendix[F](https://arxiv.org/html/2605.10981#A6)\.

### 4\.2Setting ofξ\\xi

The key factor underlying this desirable property ofξ\\xi\-DPO isξ\\xi\. The choice ofξ\\ximainly depends on the extent to which the gap between chosen and rejected should be enlarged\. Meanwhile, when combined withLeakyReLU\\operatorname\{LeakyReLU\}, it can effectively filter out tail samples, thereby controlling the strength of the reward signal\. Specifically, filtering out more tail samples reduces the strength of the reward signal, while retaining more tail samples increases it\. This behavior varies across datasets, since in low\-gap datasets the distribution of gaps is typically more concentrated, with relatively few samples located at the head and tail regions shown in Figure[2](https://arxiv.org/html/2605.10981#S3.F2), resulting in a generally moderate reward gap\. In contrast, high\-gap datasets exhibit the opposite pattern\. Therefore,ξ\\xishould be chosen such that it exceeds the reward gap of the vast majority of samples, while also controlling whether more strong reward signals \(tail samples\) should be introduced for datasets of different quality\. We defineξ\\xias follows:

ξ=Qt\(\{mi\}i=1N\)\\xi=Q\_\{t\}\\\!\\left\(\\\{m\_\{i\}\\\}^\{N\}\_\{i=1\}\\right\)\(14\)whereQt\(⋅\)Q\_\{t\}\(\\cdot\)denotes thett\-th quantile of the distribution\{mi\}i=1N\\\{m\_\{i\}\\\}^\{N\}\_\{i=1\},NNis the number of samples\.mim\_\{i\}denotes the reward gap in the current sample, which is computed using the initial model,π\(y\|x\)\\pi\(y\|x\)\. Forξ\\xi\-DPO,mim\_\{i\}is:

1\|yw\|log⁡π\(ywi∣xi\)−1\|yl\|log⁡π\(yli∣xi\)\|1\|yw\|logπ\(ywi∣xi\)\+1\|yl\|logπ\(yli∣xi\)\|\\frac\{\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\(y\_\{w\}^\{i\}\\mid x^\{i\}\)\-\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\(y\_\{l\}^\{i\}\\mid x^\{i\}\)\}\{\\left\|\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\(y\_\{w\}^\{i\}\\mid x^\{i\}\)\+\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\(y\_\{l\}^\{i\}\\mid x^\{i\}\)\\right\|\}\(15\)
Under theξ\\xi\-DPO formulation, sample filtering is no longer governed by the sigmoid slope induced byβ\\beta, but is adaptively controlled by the ratio marginξ\\xi\. The samples with ratio reward gaps belowξ\\xiremain active, while those already exceedingξ\\xiare down weighted through LeakyReLU\. Thus,ξ\\xiacts as both a semantic margin and an automatic filtering threshold\. Specifically, for high\-gap datasets we use a largerξ\\xibecause their reward gaps are generally larger, and a small margin may cause optimization to stop too early\. Moreover, retaining tail\-region samples with strong positive reward signals can help counteract head\-region samples where rejected responses dominate, encouraging the reward gap to shift from negative to positive\. For low\-gap datasets, we impose a milder reward strength, since overly strong reward signals may distort the model’s original distribution and increase the risk of overfitting\. We further demonstrate through sensitivity analysis in Section[5\.3](https://arxiv.org/html/2605.10981#S5.SS3)thatξ\\xiis not a sensitive hyperparameter\. For low\-gap datasets constructed using a strong reward model,ttcan be set within the range of 90%–95%\. For high\-gap datasets, we use a largerξ\\xi\(t∈\[97%,99\.9%\]t\\in\[97\\%,99\.9\\%\]\) to improve the quality of the optimization\. We provide a more detailed analysis ofξ\\xiin Appendix[C](https://arxiv.org/html/2605.10981#A3)\.

## 5Experiments

### 5\.1Experimental Setup

ξ\\xisetting\.Among the four datasets described above, the datasets 2 and 4, which are constructed using a stronger reward model\(ArmoRM\), are considered to be low\-gap datasets\. Using Equation[14](https://arxiv.org/html/2605.10981#S4.E14), we compute the values ofξ\\xifor these two datasets as 0\.35 and 0\. 28, respectively, which correspond to the 95th quantile of the reward gap distribution\. In contrast, for the datasets 1 and 3,ξ\\xiis set to 0\.9 and 0\.45 respectively, corresponding to the 99\.9th percentile\. We provide more explanations on the choice of the 95th or 99\.9th quantiles in Appendix[C](https://arxiv.org/html/2605.10981#A3)\.

### 5\.2Benchmarks and Compared Methods

Benchmarks\.AlpacaEval 2Liet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib25)\)and MT\-BenchZhenget al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib26)\)are widely used benchmarks for evaluating the response quality and instruction\-following ability of large language models\. AlpacaEval 2 typically using a judge model to compare the responses of a target model against those of a baseline model and reports metrics such as win rate\(WR\) and length\-controlled win rate\(LC\), whereas MT\-Bench usually evaluates a model’s responses to multi\-turn questions using a strong LLM judge and reports an overall score\. Following the setups adopted in AlphaDPOWuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\), SimPOMenget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\)\. GPT\-4 Turbo is used as both the baseline model and the judge model for AlpacaEval 2, whereas MT\-Bench use GPT\-4 to evaluate the target model’s outputs\. We additionally evaluate on the verifiable Text\-to\-SQL task, with further details provided in the Appendix[D](https://arxiv.org/html/2605.10981#A4)\.

Compared Methods\.We compare our proposedξ\\xi\-DPO with several state\-of\-the\-art preference optimization algorithms: DPORafailovet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib5)\), IPOAzaret al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib15)\), CPOXuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib27)\), KTOEthayarajhet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib28)\), ORPOHonget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib29)\), R\-DPOParket al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib30)\), SimPOMenget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\), AlphaDPOWuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\)\.

### 5\.3Main results and Sensitivity analysis

Comparative Evaluation Results\.As shown in Table[2](https://arxiv.org/html/2605.10981#S5.T2),ξ\\xi\-DPO achieves the best or highly competitive performance across all the datasets, outperforming existing alternatives on AlpacaEval 2 and remaining competitive on MT\-Bench\. Especially,ξ\\xi\-DPO achieves a 12\.0% relative improvement in win rate over AlphaDPO on Mistral\-Instruct\. It also consistently outperforms the the best SimPO performance achieved after multiple rounds of careful hyperparameter tuning\.

The only exception is on Llama3\-Instruct v0\.2, whereξ\\xi\-DPO slightly underperforms AlphaDPO\. This is mainly due to the fact that the performance gain on this low\-gap data setting becomes increasingly saturated, where AlphaDPO’s dynamic margin adjustment offers more fine\-grained control\. However, their performance gap is small, and the performance ofξ\\xi\-DPO remains highly competitive\. It is also worthy to point out that the initial hyperparameters for AlphaDPO are the same as those for SimPO, so the initial settings for its hyperparameters are also very sensitive\. Furthermore, AlphaDPO introduces a additional hyperparameterα\\alphaand a reference modelπref\\pi\_\{ref\}, thereby increasing the complexity of the optimization process\. These results clearly demonstrate that without relying on extensive trial\-and\-error hyperparameter tuning,ξ\\xi\-DPO achieves strong performance in a more stable and efficient manner than the existing alternatives,

Table 2:Comparison of different preference optimization methods on AlpacaEval 2 and MT\-Bench across four SFT models\.MethodLlama3\-Instruct \(8B\)Mistral\-Instruct \(7B\)Llama3\-Instruct v0\.2 \(8B\)Gemma2\-Instruct \(9B\)AlpacaEval 2MT\-BenchAlpacaEval 2MT\-BenchAlpacaEval 2MT\-BenchAlpacaEval 2MT\-BenchLC \(%\)WR \(%\)GPT\-4LC \(%\)WR \(%\)GPT\-4LC \(%\)WR \(%\)GPT\-4LC \(%\)WR \(%\)GPT\-4SFT24\.023\.68\.119\.015\.47\.524\.023\.68\.148\.736\.58\.5DPO40\.238\.18\.020\.317\.97\.651\.950\.88\.270\.466\.98\.5IPO35\.934\.48\.322\.318\.67\.840\.639\.68\.262\.658\.4\-CPO29\.634\.48\.026\.231\.77\.536\.540\.88\.256\.453\.4\-KTO38\.334\.18\.219\.420\.37\.741\.436\.48\.261\.755\.5\-ORPO31\.629\.88\.024\.023\.07\.736\.533\.18\.356\.246\.7\-R\-DPO40\.337\.38\.021\.422\.27\.551\.650\.78\.268\.366\.9\-SimPO43\.838\.08\.030\.232\.17\.655\.649\.68\.072\.465\.0\-AlphaDPO46\.638\.1\-32\.332\.6\-58\.751\.1\-73\.466\.1\-ξ\\xi\-DPO47\.139\.18\.233\.236\.57\.757\.550\.58\.075\.467\.59\.0

Sensitivity AnalysisBased on Equation[14](https://arxiv.org/html/2605.10981#S4.E14)and our discussion of the optimal quantile, we evaluate the sensitivity ofξ\\xi\-DPO on both low\-gap and high\-gap datasets\. We conducted this evaluation within a reasonable range around the optimal quantiles for the low\-gap and high\-gap datasets\. This reasonable range is related to the formulation ofξ\\xi\-DPO, wherettdirectly controls the number of filtered samples\. Therefore, retaining more than 90% of the samples is a reasonable choice, since excessive filtering would inevitably lead to performance degradation\. For the low\-gap dataset, we shiftttfrom 95% to 90%, while for the high\-gap dataset, we shiftttfrom 99\.9% to 97%\. The results in Table[5\.3](https://arxiv.org/html/2605.10981#S5.SS3)show that this adjustment does not cause substantial performance loss, indicating thatξ\\xi\-DPO exhibits relatively low sensitivity to this parameter\. Although changingttfrom 95% to 90% may appear to be a minor adjustment, the resulting values ofξ\\xidiffer substantially\. Please refer to Table[4](https://arxiv.org/html/2605.10981#A3.T4)in the Appendix for details, where we also provide a more detailed analysis\.

Table 3:Sensitivity Analysis ofξ\\xi\-DPO\. We report the sensitivity ofξ\\xi, computed based on different quantiles, by varying the quantile threshold on both a low\-gap dataset constructed with a stronger reward model and a high\-gap dataset constructed in the opposite setting\.\(a\) Llama3\-Instruct\-v0\.2 \(Low\-gap dataset\)Methodt=0\.90t=0\.90t=0\.92t=0\.92t=0\.95t=0\.95LCWRLCWRLCWRξ\\xi\-DPO56\.449\.157\.249\.957\.550\.5

\(b\) Mistral\-Instruct \(High\-gap dataset\)Methodt=0\.97t=0\.97t=0\.98t=0\.98t=0\.999t=0\.999LCWRLCWRLCWRξ\\xi\-DPO32\.435\.932\.936\.133\.236\.5

## 6Conclusion

This paper investigates the semantic role of hyperparameters in simple reference\-free preference optimization method \(SimPO\) and analyzes why the hyperparametersβ\\betaandγ\\gammain SimPO are difficult to set\. Based on the conclusions drawn from our analysis, we propose direct preference optimization via ratio reward margin,ξ\\xi\-DPO\. The ratio reward margin is denoted asξ\\xi\.ξ\\xi\-DPO has a simple objective formulation and involves only a single hyperparameter, which can be determined by computing a quantile of the prior reward\-gap distribution, without requiring repeated experiments or heuristic tuning\. It effectively mitigates challenges such as the difficulty of hyperparameter selection in SimPO and reduces hyperparameter sensitivity\. Extensive experiments further demonstrate the superior performance ofξ\\xi\-DPO\.

## References

- \[1\]AI@Meta\(2024\)Llama 3 model card\.External Links:[Link](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)Cited by:[§5\.1](https://arxiv.org/html/2605.10981#S5.SS1.p1.1)\.
- \[2\]T\. Anthony, Z\. Tian, and D\. Barber\(2017\)Thinking fast and slow with deep learning and tree search\.InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\-9, 2017, Long Beach, CA, USA,I\. Guyon, U\. von Luxburg, S\. Bengio, H\. M\. Wallach, R\. Fergus, S\. V\. N\. Vishwanathan, and R\. Garnett \(Eds\.\),pp\. 5360–5370\.Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[3\]M\. G\. Azar, Z\. D\. Guo, B\. Piot, R\. Munos, M\. Rowland, M\. Valko, and D\. Calandriello\(2024\)A general theoretical paradigm to understand learning from human preferences\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 4447–4455\.Cited by:[Table 8](https://arxiv.org/html/2605.10981#A8.T8.4.4.4.3),[§2](https://arxiv.org/html/2605.10981#S2.p1.4),[§4\.1](https://arxiv.org/html/2605.10981#S4.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p2.1)\.
- \[4\]Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.External Links:2204\.05862,[Link](https://arxiv.org/abs/2204.05862)Cited by:[Appendix B](https://arxiv.org/html/2605.10981#A2.p2.3)\.
- \[5\]S\. Biderman, H\. Schoelkopf, Q\. G\. Anthony, H\. Bradley, K\. O’Brien, E\. Hallahan, M\. A\. Khan, S\. Purohit, U\. S\. Prashanth, E\. Raff,et al\.\(2023\)Pythia: a suite for analyzing large language models across training and scaling\.InInternational conference on machine learning,pp\. 2397–2430\.Cited by:[§1](https://arxiv.org/html/2605.10981#S1.p4.14)\.
- \[6\]R\. A\. Bradley and M\. E\. Terry\(1952\)Rank analysis of incomplete block designs: i\. the method of paired comparisons\.Biometrika39\(3/4\),pp\. 324–345\.Cited by:[Appendix B](https://arxiv.org/html/2605.10981#A2.p1.7),[§1](https://arxiv.org/html/2605.10981#S1.p2.7)\.
- \[7\]L\. Chen, C\. Zhu, J\. Chen, D\. Soselia, T\. Zhou, T\. Goldstein, H\. Huang, M\. Shoeybi, and B\. Catanzaro\(2024\)ODIN: disentangled reward mitigates hacking in RLHF\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research,pp\. 7935–7952\.Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[8\]P\. F\. Christiano, J\. Leike, T\. Brown, M\. Martic, S\. Legg, and D\. Amodei\(2017\)Deep reinforcement learning from human preferences\.Advances in neural information processing systems30\.Cited by:[Appendix B](https://arxiv.org/html/2605.10981#A2.p1.7),[§1](https://arxiv.org/html/2605.10981#S1.p2.7),[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[9\]N\. Ding, Y\. Chen, B\. Xu, Y\. Qin, S\. Hu, Z\. Liu, M\. Sun, and B\. Zhou\(2023\)Enhancing chat language models by scaling high\-quality instructional conversations\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 3029–3051\.Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[10\]K\. Ethayarajh, W\. Xu, N\. Muennighoff, D\. Jurafsky, and D\. Kiela\(2024\)Model alignment as prospect theoretic optimization\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 12634–12651\.Cited by:[Table 8](https://arxiv.org/html/2605.10981#A8.T8.10.10.10.5),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p2.1)\.
- \[11\]J\. Fang, Z\. Bi, R\. Wang, H\. Jiang, Y\. Gao, K\. Wang, A\. Zhang, J\. Shi, X\. Wang, and T\. Chua\(2024\)Towards neuron attributions in multi\-modal large language models\.Advances in Neural Information Processing Systems37,pp\. 122867–122890\.Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[12\]L\. Gao, J\. Schulman, and J\. Hilton\(2023\)Scaling laws for reward model overoptimization\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research,pp\. 10835–10866\.Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[13\]J\. Hong, N\. Lee, and J\. Thorne\(2024\)Orpo: monolithic preference optimization without reference model\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 11170–11189\.Cited by:[Table 8](https://arxiv.org/html/2605.10981#A8.T8.13.13.13.4),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p2.1)\.
- \[14\]A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed\(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§5\.1](https://arxiv.org/html/2605.10981#S5.SS1.p1.1)\.
- \[15\]A\. Köpf, Y\. Kilcher, D\. von Rütte, S\. Anagnostidis, Z\. R\. Tam, K\. Stevens, A\. Barhoum, D\. Nguyen, O\. Stanley, R\. Nagyfi, S\. ES, S\. Suri, D\. Glushkov, A\. Dantuluri, A\. Maguire, C\. Schuhmann, H\. Nguyen, and A\. Mattick\(2023\)OpenAssistant conversations \- democratizing large language model alignment\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[16\]H\. Li, S\. Wu, X\. Zhang, X\. Huang, J\. Zhang, F\. Jiang, S\. Wang, T\. Zhang, J\. Chen, R\. Shi, H\. Chen, and C\. Li\(2025\)OmniSQL: synthesizing high\-quality text\-to\-sql data at scale\.Proc\. VLDB Endow\.18\(11\),pp\. 4695–4709\.Cited by:[Appendix D](https://arxiv.org/html/2605.10981#A4.p1.1),[Appendix D](https://arxiv.org/html/2605.10981#A4.p3.1)\.
- \[17\]X\. Li, T\. Zhang, Y\. Dubois, R\. Taori, I\. Gulrajani, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto\(2023\-05\)AlpacaEval: an automatic evaluator of instruction\-following models\.GitHub\.Note:[https://github\.com/tatsu\-lab/alpaca\_eval](https://github.com/tatsu-lab/alpaca_eval)Cited by:[Appendix H](https://arxiv.org/html/2605.10981#A8.p2.1),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p1.1)\.
- \[18\]H\. Luo, Q\. Sun, C\. Xu, P\. Zhao, J\. Lou, C\. Tao, X\. Geng, Q\. Lin, S\. Chen, Y\. Tang, and D\. Zhang\(2025\)WizardMath: empowering mathematical reasoning for large language models via reinforced evol\-instruct\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[19\]Y\. Meng, M\. Xia, and D\. Chen\(2024\)Simpo: simple preference optimization with a reference\-free reward\.Advances in Neural Information Processing Systems37,pp\. 124198–124235\.Cited by:[Appendix B](https://arxiv.org/html/2605.10981#A2.p6.2),[Appendix D](https://arxiv.org/html/2605.10981#A4.p4.5),[Table 8](https://arxiv.org/html/2605.10981#A8.T8.19.19.19.4),[§1](https://arxiv.org/html/2605.10981#S1.p3.12),[§1](https://arxiv.org/html/2605.10981#S1.p4.14),[§2](https://arxiv.org/html/2605.10981#S2.p1.4),[§3\.1](https://arxiv.org/html/2605.10981#S3.SS1.p3.5),[§4\.1](https://arxiv.org/html/2605.10981#S4.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.10981#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p2.1)\.
- \[20\]OpenAI\(2025\)OpenAI gpt\-5 system card\.Technical reportOpenAI\.External Links:[Link](https://cdn.openai.com/gpt-5-system-card.pdf)Cited by:[§1](https://arxiv.org/html/2605.10981#S1.p1.1)\.
- \[21\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.10981#S1.p1.1),[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[22\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[Appendix B](https://arxiv.org/html/2605.10981#A2.p2.3),[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[23\]Y\. Pan, Z\. Cai, G\. Chen, H\. Zhong, and C\. Wang\(2025\)What matters in data for dpo?\.arXiv preprint arXiv:2508\.18312\.Cited by:[§1](https://arxiv.org/html/2605.10981#S1.p3.12)\.
- \[24\]R\. Park, R\. Rafailov, S\. Ermon, and C\. Finn\(2024\)Disentangling length from quality in direct preference optimization\.InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Findings of ACL, Vol\.ACL 2024,pp\. 4998–5017\.Cited by:[Table 8](https://arxiv.org/html/2605.10981#A8.T8.16.16.16.4),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p2.1)\.
- \[25\]R\. Rafailov, Y\. Chittepu, R\. Park, H\. Sikchi, J\. Hejna, W\. B\. Knox, C\. Finn, and S\. Niekum\(2024\)Scaling laws for reward model overoptimization in direct alignment algorithms\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[26\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[Appendix B](https://arxiv.org/html/2605.10981#A2.p3.1),[Table 8](https://arxiv.org/html/2605.10981#A8.T8.2.2.2.3),[§1](https://arxiv.org/html/2605.10981#S1.p2.9),[§1](https://arxiv.org/html/2605.10981#S1.p3.12),[§2](https://arxiv.org/html/2605.10981#S2.p1.4),[§3\.1](https://arxiv.org/html/2605.10981#S3.SS1.p3.5),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p2.1)\.
- \[27\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2605.10981#S1.p1.1),[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[28\]R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto\(2023\)Stanford alpaca: an instruction\-following llama model\.GitHub\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[29\]G\. Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé, J\. Ferret, P\. Liu, P\. Tafti, A\. Friesen, M\. Casbon, S\. Ramos, R\. Kumar, C\. L\. Lan, S\. Jerome, A\. Tsitsulin, N\. Vieillard, P\. Stanczyk, S\. Girgin, N\. Momchev, M\. Hoffman, S\. Thakoor, J\. Grill, B\. Neyshabur, O\. Bachem, A\. Walton, A\. Severyn, A\. Parrish, A\. Ahmad, A\. Hutchison, A\. Abdagic, A\. Carl, A\. Shen, A\. Brock, A\. Coenen, A\. Laforge, A\. Paterson, B\. Bastian, B\. Piot, B\. Wu, B\. Royal, C\. Chen, C\. Kumar, C\. Perry, C\. Welty, C\. A\. Choquette\-Choo, D\. Sinopalnikov, D\. Weinberger, D\. Vijaykumar, D\. Rogozińska, D\. Herbison, E\. Bandy, E\. Wang, E\. Noland, E\. Moreira, E\. Senter, E\. Eltyshev, F\. Visin, G\. Rasskin, G\. Wei, G\. Cameron, G\. Martins, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Batra, H\. Dhand, I\. Nardini, J\. Mein, J\. Zhou, J\. Svensson, J\. Stanway, J\. Chan, J\. P\. Zhou, J\. Carrasqueira, J\. Iljazi, J\. Becker, J\. Fernandez, J\. van Amersfoort, J\. Gordon, J\. Lipschultz, J\. Newlan, J\. Ji, K\. Mohamed, K\. Badola, K\. Black, K\. Millican, K\. McDonell, K\. Nguyen, K\. Sodhia, K\. Greene, L\. L\. Sjoesund, L\. Usui, L\. Sifre, L\. Heuermann, L\. Lago, L\. McNealus, L\. B\. Soares, L\. Kilpatrick, L\. Dixon, L\. Martins, M\. Reid, M\. Singh, M\. Iverson, M\. Görner, M\. Velloso, M\. Wirth, M\. Davidow, M\. Miller, M\. Rahtz, M\. Watson, M\. Risdal, M\. Kazemi, M\. Moynihan, M\. Zhang, M\. Kahng, M\. Park, M\. Rahman, M\. Khatwani, N\. Dao, N\. Bardoliwalla, N\. Devanathan, N\. Dumai, N\. Chauhan, O\. Wahltinez, P\. Botarda, P\. Barnes, P\. Barham, P\. Michel, P\. Jin, P\. Georgiev, P\. Culliton, P\. Kuppala, R\. Comanescu, R\. Merhej, R\. Jana, R\. A\. Rokni, R\. Agarwal, R\. Mullins, S\. Saadat, S\. M\. Carthy, S\. Cogan, S\. Perrin, S\. M\. R\. Arnold, S\. Krause, S\. Dai, S\. Garg, S\. Sheth, S\. Ronstrom, S\. Chan, T\. Jordan, T\. Yu, T\. Eccles, T\. Hennigan, T\. Kocisky, T\. Doshi, V\. Jain, V\. Yadav, V\. Meshram, V\. Dharmadhikari, W\. Barkley, W\. Wei, W\. Ye, W\. Han, W\. Kwon, X\. Xu, Z\. Shen, Z\. Gong, Z\. Wei, V\. Cotruta, P\. Kirk, A\. Rao, M\. Giang, L\. Peran, T\. Warkentin, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, D\. Sculley, J\. Banks, A\. Dragan, S\. Petrov, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, S\. Borgeaud, N\. Fiedel, A\. Joulin, K\. Kenealy, R\. Dadashi, and A\. Andreev\(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118,[Link](https://arxiv.org/abs/2408.00118)Cited by:[§5\.1](https://arxiv.org/html/2605.10981#S5.SS1.p1.1)\.
- \[30\]Q\. Team\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2605.10981#S1.p1.1)\.
- \[31\]B\. Wang, R\. Shin, X\. Liu, O\. Polozov, and M\. Richardson\(2020\)RAT\-SQL: relation\-aware schema encoding and linking for text\-to\-sql parsers\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5\-10, 2020,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. R\. Tetreault \(Eds\.\),pp\. 7567–7578\.External Links:[Link](https://doi.org/10.18653/v1/2020.acl-main.677),[Document](https://dx.doi.org/10.18653/V1/2020.ACL-MAIN.677)Cited by:[Appendix D](https://arxiv.org/html/2605.10981#A4.p1.1)\.
- \[32\]H\. Wang, W\. Xiong, T\. Xie, H\. Zhao, and T\. Zhang\(2024\)Interpretable preferences via multi\-objective reward modeling and mixture\-of\-experts\.InFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Findings of ACL,pp\. 10582–10592\.Cited by:[§5\.1](https://arxiv.org/html/2605.10981#S5.SS1.p1.1)\.
- \[33\]J\. Wu, X\. Wang, Z\. Yang, J\. Wu, J\. Gao, B\. Ding, X\. Wang, and X\. He\(2025\)AlphaDPO: adaptive reward margin for direct preference optimization\.External Links:2410\.10148,[Link](https://arxiv.org/abs/2410.10148)Cited by:[Appendix G](https://arxiv.org/html/2605.10981#A7.p1.4),[Table 8](https://arxiv.org/html/2605.10981#A8.T8.23.23.23.5),[Appendix H](https://arxiv.org/html/2605.10981#A8.p1.6),[§1](https://arxiv.org/html/2605.10981#S1.p5.7),[§1](https://arxiv.org/html/2605.10981#S1.p9.6),[§2](https://arxiv.org/html/2605.10981#S2.p1.4),[§3\.1](https://arxiv.org/html/2605.10981#S3.SS1.p3.5),[§3\.2](https://arxiv.org/html/2605.10981#S3.SS2.p4.1),[§5\.1](https://arxiv.org/html/2605.10981#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p2.1)\.
- \[34\]J\. Wu, Y\. Xie, Z\. Yang, J\. Wu, J\. Gao, B\. Ding, X\. Wang, and X\. He\(2024\)β\\beta\-DPO: direct preference optimization with dynamicβ\\beta\.Advances in Neural Information Processing Systems37,pp\. 129944–129966\.Cited by:[Table 1](https://arxiv.org/html/2605.10981#S1.T1),[Table 1](https://arxiv.org/html/2605.10981#S1.T1.10.5.3),[§1](https://arxiv.org/html/2605.10981#S1.p3.12),[§1](https://arxiv.org/html/2605.10981#S1.p4.14),[§1](https://arxiv.org/html/2605.10981#S1.p5.7),[§2](https://arxiv.org/html/2605.10981#S2.p1.4),[§3\.1](https://arxiv.org/html/2605.10981#S3.SS1.p4.6)\.
- \[35\]T\. Xiao, Y\. Yuan, Z\. Chen, M\. Li, S\. Liang, Z\. Ren, and V\. G\. Honavar\(2025\)SimPER: A minimalist approach to preference alignment without hyperparameters\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,Cited by:[Table 8](https://arxiv.org/html/2605.10981#A8.T8.24.24.24.2),[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[36\]H\. Xu, A\. Sharaf, Y\. Chen, W\. Tan, L\. Shen, B\. V\. Durme, K\. Murray, and Y\. J\. Kim\(2024\)Contrastive preference optimization: pushing the boundaries of LLM performance in machine translation\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 55204–55224\.Cited by:[Table 8](https://arxiv.org/html/2605.10981#A8.T8.6.6.6.3),[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p2.1)\.
- \[37\]T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman, Z\. Zhang, and D\. R\. Radev\(2018\)Spider: A large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),pp\. 3911–3921\.External Links:[Link](https://doi.org/10.18653/v1/d18-1425),[Document](https://dx.doi.org/10.18653/V1/D18-1425)Cited by:[Appendix D](https://arxiv.org/html/2605.10981#A4.p1.1)\.
- \[38\]T\. Yu, R\. Zhang, M\. Yasunaga, Y\. C\. Tan, X\. V\. Lin, S\. Li, H\. Er, I\. Li, B\. Pang, T\. Chen, E\. Ji, S\. Dixit, D\. Proctor, S\. Shim, J\. Kraft, V\. Zhang, C\. Xiong, R\. Socher, and D\. R\. Radev\(2019\)SParC: cross\-domain semantic parsing in context\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 4511–4523\.External Links:[Link](https://doi.org/10.18653/v1/p19-1443),[Document](https://dx.doi.org/10.18653/V1/P19-1443)Cited by:[Appendix D](https://arxiv.org/html/2605.10981#A4.p1.1)\.
- \[39\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§5\.2](https://arxiv.org/html/2605.10981#S5.SS2.p1.1)\.
- \[40\]V\. Zhong, C\. Xiong, and R\. Socher\(2017\)Seq2SQL: generating structured queries from natural language using reinforcement learning\.CoRRabs/1709\.00103\.External Links:[Link](http://arxiv.org/abs/1709.00103),1709\.00103Cited by:[Appendix D](https://arxiv.org/html/2605.10981#A4.p1.1)\.
- \[41\]C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu, S\. Zhang, G\. Ghosh, M\. Lewis, L\. Zettlemoyer, and O\. Levy\(2023\)LIMA: less is more for alignment\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.
- \[42\]D\. M\. Ziegler, N\. Stiennon, J\. Wu, T\. B\. Brown, A\. Radford, D\. Amodei, P\. Christiano, and G\. Irving\(2019\)Fine\-tuning language models from human preferences\.arXiv preprint arXiv:1909\.08593\.Cited by:[§2](https://arxiv.org/html/2605.10981#S2.p1.4)\.

## Appendix ALimitations

Thelog⁡πθ\\log\\pi\_\{\\theta\}of the target policy decreases sharply during the later stages of theξ\\xi\-DPO optimization process\. The reason for it remains unknown, although this behavior does not affect the final optimization results\. If the underlying cause can be identified and addressed, the performance ofξ\\xi\-DPO could be further improved\. As a future direction, we aim to compute the reward\-gap distribution during training, thereby enabling dynamic adjustment ofξ\\xi\.

## Appendix BPreliminaries

Preference modeling\.Given a datasetD=\{\(x,yw,yl\)\}D=\\\{\(x,y\_\{w\},y\_\{l\}\)\\\}, it consists of promptsxxand paired responses \(ywy\_\{w\},yly\_\{l\}\), whereywy\_\{w\}is the chosen response\(winning\) andyly\_\{l\}is the rejected response\(losing\)\. Christiano et al\.Christianoet al\.\([2017](https://arxiv.org/html/2605.10981#bib.bib11)\)propose using the Bradley\-Terry model\(BT model\)Bradley and Terry\([1952](https://arxiv.org/html/2605.10981#bib.bib20)\)for preference modeling to optimize the reward model,r\(y,x\)r\(y,x\):

p\(yw≻yl∣x\)=exp⁡\(r\(yw,x\)\)exp⁡\(r\(yw,x\)\)\+exp⁡\(r\(yl,x\)\)=σ\(r\(yw,x\)−r\(yl,x\)\)p\(y\_\{w\}\\succ y\_\{l\}\\mid x\)=\\frac\{\\exp\\big\(r\(y\_\{w\},x\)\\big\)\}\{\\exp\\big\(r\(y\_\{w\},x\)\\big\)\+\\exp\\big\(r\(y\_\{l\},x\)\\big\)\}=\\sigma\\big\(r\(y\_\{w\},x\)\-r\(y\_\{l\},x\)\\big\)\(16\)whereσ\\sigmais the sigmoid function\.yw≻yly\_\{w\}\\succ y\_\{l\}indicates that chosen responseywy\_\{w\}is strictly preferred over rejected responseyly\_\{l\}\. The purpose of preference modeling is to learn a reward model that assigns scores to the outputs of target policyπθ\\pi\_\{\\theta\}\(preference prediction\)\.

Reinforcement Learning from Human Feedback \(RLHF\)\.Based on the learned BT reward modelr\(x,y\)r\(x,y\), the responses generated by the target policyπθ\\pi\_\{\\theta\}are scored, while the policy is simultaneously constrained to remain close to the reference policyπref\\pi\_\{ref\}Ouyanget al\.\([2022b](https://arxiv.org/html/2605.10981#bib.bib13)\); Baiet al\.\([2022](https://arxiv.org/html/2605.10981#bib.bib21)\)\. Specifically, the formula is as follows:

maxπθ𝔼x∼𝒟,y∼πθ\(y∣x\)\[r\(x,y\)\]−β𝔻KL\[πθ\(y∣x\)∥πref\(y∣x\)\]\\max\_\{\\pi\_\{\\theta\}\}\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,y\\sim\\pi\_\{\\theta\}\(y\\mid x\)\}\\left\[r\(x,y\)\\right\]\-\\beta\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\left\[\\pi\_\{\\theta\}\(y\\mid x\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\(y\\mid x\)\\right\]\(17\)Typically, the reference policy is initialized to be identical to the target policy, and it is a supervised fine\-tuning\(SFT\) model\.

Direct Preference Optimization \(DPO\)\.Following Equation[17](https://arxiv.org/html/2605.10981#A2.E17), Rafailov et al\.Rafailovet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib5)\)derive an implicit form of the reward model\. By connecting it to the BT model, they unify reward modeling and reinforcement learning into a single stage:

p\(yw≻yl∣x\)=σ\(βlog⁡πθ\(yw∣x\)πθ\(yl∣x\)−βlog⁡πref\(yw∣x\)πref\(yl∣x\)\)p\(y\_\{w\}\\succ y\_\{l\}\\mid x\)=\\sigma\\left\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\}\{\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\}\-\\beta\\log\\frac\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\\mid x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\\mid x\)\}\\right\)\(18\)
whereβlog⁡πθ\(y∣x\)πref\(y∣x\)\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\\mid x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\\mid x\)\}is the implicit form ofr\(x,y\)r\(x,y\)\. Finally, the DPO loss is defined as follows:

ℒDPO=−𝔼\(x,yw,yl\)∼𝒟\[log⁡σ\(βlog⁡πθ\(yw∣x\)πref\(yw∣x\)−βlog⁡πθ\(yl∣x\)πref\(yl∣x\)\)\]\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}=\-\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\left\[\\log\\sigma\\left\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\\mid x\)\}\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\\mid x\)\}\\right\)\\right\]\(19\)
Simple Preference Optimization \(SimPO\)\.When DPO is applied, an additional reference model with the same optimization objective is loaded into memory, which increases resource requirements\. Motivated by this observation, SimPOMenget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\)explores a reference\-free reward mechanism\. However, simply removing the reference model from DPO leads to an imbalance between the chosen and rejected responses\. The main reason is that these responses often differ in length, and longer responses are naturally associated with larger numerical values\. To address this issue, SimPO normalizes the responses and defines the reward asβ\|y\|log⁡πθ\(y\|x\)\\frac\{\\beta\}\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\|x\)\. Based on this, the reward gap is also constrained by defining reward marginγ\\gamma\. Ultimately, the SimPO loss is defined as:

ℒSimPO=−𝔼\(x,yw,yl\)∼𝒟\[log⁡σ\(β\|yw\|log⁡πθ\(yw∣x\)−β\|yl\|log⁡πθ\(yl∣x\)−γ\)\]\\mathcal\{L\}\_\{\\mathrm\{SimPO\}\}=\-\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\}\\left\[\\log\\sigma\\left\(\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\-\\gamma\\right\)\\right\]\(20\)where\|y\|\|y\|is the length of the response of the policyπθ\(y∣x\)\\pi\_\{\\theta\}\(y\\mid x\)\.

## Appendix CAnalysis ofξ\\xi

We begin our analysis from the perspective of dataset construction\. In general, a preference dataset is built by first sampling a set of responses from the current policy,π\\pi, and then ranking them using a professional reward model\. The responses with the highest and lowest scores are ultimately selected as the chosen and rejected responses, respectively\. This implies that the chosen and rejected responses are drawn from the same underlying distribution,π\\pi\. Consequently, the distribution of reward gaps does not exhibit its highest density concentrated on either the positive or negative half\-axis; instead, it is symmetric\. As shown in Figure[2](https://arxiv.org/html/2605.10981#S3.F2)\.

Suppose the policyπ\\pioutputs two responses,y1y\_\{1\}andy2y\_\{2\}, andπ\(y2\)\>π\(y1\)\\pi\(y\_\{2\}\)\>\\pi\(y\_\{1\}\)\. This phenomenon is quite normal when it is not detected by a specialized reward model\. However, if the reward model identifiesy2y\_\{2\}as a response that deviates from human preferences, namely a rejected response, this indicates that the policy distributionπ\\pishould be adjusted so as to assign a lower probability toy2y\_\{2\}\. Therefore, such samples deserve the most attention, corresponding to the head\-region samples in Figure[2](https://arxiv.org/html/2605.10981#S3.F2)\. In this case, a relatively largeξ\\xiis required to force the chosen response, which would otherwise be assigned a lower probability, to attain a higher probability than the rejected one\.

Table 4:Quantiles of the reward gapΔr\\Delta rdistribution for different datasets\.DatasetsQuantiles ofΔr\\Delta rdistribution \(ξ\\xi\)1%5%10%25%35%45%50%75%90%95%99%100%ξ\\xiin paperMistral\-7B\-0\.40\-0\.22\-0\.15\-0\.07\-0\.04\-0\.010\.000\.080\.160\.230\.410\.910\.45 \(t=99\.9%t=99\.9\\%\)Llama3\-0\.54\-0\.28\-0\.19\-0\.08\-0\.05\-0\.010\.000\.100\.220\.330\.600\.920\.90 \(t=99\.9%t=99\.9\\%\)Llama3\-v2\-0\.51\-0\.26\-0\.18\-0\.09\-0\.05\-0\.010\.000\.110\.250\.350\.640\.920\.35 \(t=95%t=95\\%\)Gemma2\-0\.59\-0\.28\-0\.15\-0\.06\-0\.03\-0\.010\.000\.070\.170\.280\.630\.960\.28 \(t=95%t=95\\%\)For low\-gap datasets, where preference signals are more reliable and ranking relationships are more consistent, the model does not require an excessively largeξ\\xito place additional emphasis on extremely difficult samples; a relatively smallξ\\xiis already sufficient to cover most informative training instances\. Conversely, for high\-gap datasets, a largerξ\\xiis needed to enforce a clearer separation between the chosen and rejected samples\. In addition, a largerξ\\xiretains more tail samples, where the chosen receive much higher rewards than the rejected under the policyπ\\pi\. These samples therefore provide stronger reward signals, which are exactly what high\-gap datasets require, as they can compensate for the insufficient reward signals in the head samples\.

Using Equation \(17\), we computed the quantiles corresponding to different values oftt, as reported in Table[4](https://arxiv.org/html/2605.10981#A3.T4)\. It is evident from Table[4](https://arxiv.org/html/2605.10981#A3.T4)that the quantiles fort=99%t=99\\%andt=100%t=100\\%differ substantially, indicating that there still exists a considerable number of informative samples between these two percentiles that can contribute to optimization\. Therefore, adopting the extreme case oft=100%t=100\\%to determineξ\\xiwould inevitably interfere with the optimization of normal samples and lead to severe overfitting\. This phenomenon is also reflected in the results in Table[C](https://arxiv.org/html/2605.10981#A3), where the model exhibits repetitive token generation\. Hence, for high\-gap datasets, we choose the quantile corresponding tot=99\.9%t=99\.9\\%asξ\\xi\.

For low\-gap datasets, by contrast, we shift the ratio margin forward to filter out more tail samples \(LeakyReLU’s filtering capability\), since the model already performs well on these samples and does not require these reward signals to compensate for the unreliability of reward information in high\-gap datasets\. So,ttis set to95%95\\%for model training on low\-gap datasets\. Ifξ\\xiis set too aggressively, it may still cause severe collapse of the model distribution\. Table[C](https://arxiv.org/html/2605.10981#A3)supports our analysis\. Setting an overly largettto computeξ\\xion low\-gap datasets may cause the model to overfit to chosen responses, thereby reducing its generalization ability and leading to repetitive word generation\. Conversely, on high\-gap datasets, if stronger reward signals from tail\-region samples are not introduced, model performance also degrades\.

Table 5:Effect of Overly Strong or Weak Ratio Margin Constraints\. Setting an overly largettto computeξ\\xion low\-gap datasets may cause the model to overfit to chosen responses, reducing generalization and leading to repetitive word generation\. Conversely, on high\-gap datasets, insufficient reward signals from tail\-region samples also degrade performance\.\(a\) Llama3\-Instruct\-v0\.2 \(Low\-gap dataset\)Methodt=0\.95t=0\.95t=0\.999t=0\.999LCWRLCWRξ\\xi\-DPO57\.550\.519\.220\.1

\(b\) Mistral\-Instruct \(High\-gap dataset\)Methodt=0\.95t=0\.95t=0\.999t=0\.999LCWRLCWRξ\\xi\-DPO29\.330\.333\.236\.5

As shown in Table[4](https://arxiv.org/html/2605.10981#A3.T4), although the values ofttfor Mistral and LLaMA are both set to99\.9%99\.9\\%, their corresponding distribution quantiles differ substantially\. This suggests thatξ\\xishould not be fixed across different settings\. Such an observation also makesξ\\xi\-DPO more interpretable: for different datasets, the boundary requiring the chosen response to outperform the rejected one should be determined according to the distribution of reward gaps\.

Inξ\\xi\-DPO, the LeakyReLU operation filters out samples whose reward gaps already exceedξ\\xi\. Therefore,ξ\\xican be also interpreted as a parameter that controls how many effective samples are retained\. Figure[3\(b\)](https://arxiv.org/html/2605.10981#A3.F3.sf2)provides an intuitive illustration of this process\. When99\.9%99\.9\\%of the samples in Mistral and LLaMA have reward gaps smaller thanξ\\xi, this means that99\.9%99\.9\\%of the samples are activated, while the corresponding quantiles \(i\.e\.,ξ\\xi\) can still differ substantially across the two models\. This explains why it is inappropriate to set an excessively largeξ\\xifor some datasets\. The underlying reason is that their reward\-gap distributions are inherently different and therefore cannot be constrained by the same hyperparameter\. Otherwise, the optimization would become overly aggressive\.

![Refer to caption](https://arxiv.org/html/2605.10981v1/figures/Density.png)\(a\)Density ofΔr\\Delta r
![Refer to caption](https://arxiv.org/html/2605.10981v1/figures/CFD.png)\(b\)CDF ofΔr\\Delta r

Figure 3:Density and cumulative distribution functions\(CDF\) of reward gaps\(Δr\\Delta r\) across different datasets\. \(a\) shows the distribution of reward gaps across different datasets, indicating that the density of reward gaps varies across datasets; the more concentrated the distribution, the smaller the ratio margin\. \(b\) illustrates the coverage range of the datasets under different ratio marginξ\\xisettings\. Notably, when the coverage for both Mistral and Llama3 is set to 99\.9%, the correspondingξ\\xivalues differ, indicating that the Llama3 dataset has a more dispersed distribution\.
## Appendix Dξ\\xi\-DPO on Verifiable Task

Instruction of datasets and modelsTo further validate the effectiveness ofξ\\xi\-DPO, we conducted experiments on the objective Text\-to\-SQL taskZhonget al\.\([2017](https://arxiv.org/html/2605.10981#bib.bib44)\); Yuet al\.\([2019](https://arxiv.org/html/2605.10981#bib.bib45)\); Wanget al\.\([2020](https://arxiv.org/html/2605.10981#bib.bib46)\)\. For this purpose, we constructed a preference dataset based on the Spider datasetYuet al\.\([2018](https://arxiv.org/html/2605.10981#bib.bib43)\)\. In the original Spider dataset, each example consists of a natural language question paired with its corresponding gold SQL query, while the database schema must be preprocessed and incorporated into the prompt to enable the large language model to generate the correct SQL statement\. OmniSQLLiet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib32)\)has already performed this preprocessing for Spider, organizing the data into prompts that can be directly fed into large language models\. In addition, by combining multiple LLMs with the gold sql queries, OmniSQL produces new Text\-to\-SQL outputs that include Chain of Thought\. OmniSQL also provides several large language models pretrained on the large\-scale Text\-to\-SQL data they constructed, including OmniSQL\-7B, OmniSQL\-32B and so on\.

The evaluation metric for the Text2SQL task primarily involves extracting SQL from the large model’s inference results, executing it in the database alongside the gold SQL, and determining whether the execution results match\. This yields the execution accuracy\.

Preference data for text2sql\.As there is no readily available preference dataset for Text\-to\-SQL, we construct a preference dataset based on OmniSQL\-SpiderLiet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib32)\)\. OmniSQL\-Spider provides both training and test splits, with both inputs and outputs preprocessed in advance\. Specifically, the inputs incorporate database schema information and can be directly used as prompts for large language models, while the outputs are also preprocessed and contain Chain of Thought\. We treat the outputs containing chain\-of\-thought reasoning and the gold SQL query as the chosen responses, and therefore only need to construct the rejected responses\. To further improve the quality of the preference dataset, we use OmniSQL\-32B to perform inference on each sample and take its generated outputs as the rejected responses\. This design ensures that the rejected responses are also of low gap\.

Table 6:ξ\\xi\-DPO vs\. SimPO on Spider\.MethodSpider \(test\)Execution AccuracyeasymediumhardextraallSFT64\.944\.052\.547\.651\.0SimPO88\.1\\ul60\.3\\ul69\.362\.7\\ul68\.7ξ\\xi\-DPO\\ul87\.960\.870\.0\\ul62\.569\.0Analysis of Results\.Since OmniSQL\-7B has already been supervised fine\-tuned on large\-scale Text\-to\-SQL datasets, we adopt it as the SFT model\. We then further fine\-tune it on our constructed preference dataset using SimPO and our proposedξ\\xi\-DPO, respectively\. The results are reported in Table[6](https://arxiv.org/html/2605.10981#A4.T6)\. All results are obtained by performing a single inference pass with either the SFT model or the preference\-tuned models on the OmniSQL\-Spider\-test set, without applying additional strategies such as voting to further boost accuracy\. The main purpose of this experiment is to evaluate the effectiveness ofξ\\xi\-DPO on tasks beyond the original setting\. The results show that the model fine\-tuned withξ\\xi\-DPO improves the overall execution accuracy from 51\.0% to 69\.0%, outperforming the SimPO\-tuned counterpart, which achieves 68\.7%\. The hyperparameter settings for fine\-tuning also follow those described in our paper\. In particular, we computeξ=0\.26\\xi=0\.26using Equation[14](https://arxiv.org/html/2605.10981#S4.E14)witht=95%t=95\\%\. For SimPO, we adopt the hyperparameter settings reported in the original paperMenget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\)\.

## Appendix EProof of the Effect ofβ\\betaon the Gradient of the Sigmoid Function

Objective of preference optimization is

ℒ=log⁡σ\(z\),\\mathcal\{L\}=\\log\\sigma\(z\),for SimPO

z=β\|yw\|log⁡πθ\(yw∣x\)−β\|yl\|log⁡πθ\(yl∣x\)−γ\.z=\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\)\-\\gamma\.
Equivalently, by defining

Δ=1\|yw\|log⁡πθ\(yw∣x\)−1\|yl\|log⁡πθ\(yl∣x\),\\Delta=\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\\mid x\)\-\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\\mid x\),we can rewrite

z=βΔ−γ\.z=\\beta\\Delta\-\\gamma\.
The gradient of the log\-sigmoid term with respect to its input is

∂ℒ∂z=∂log⁡σ\(z\)∂z=1σ\(z\)∂σ\(z\)∂z\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\}=\\frac\{\\partial\\log\\sigma\(z\)\}\{\\partial z\}=\\frac\{1\}\{\\sigma\(z\)\}\\frac\{\\partial\\sigma\(z\)\}\{\\partial z\}\.
Since

∂σ\(z\)∂z=σ\(z\)\(1−σ\(z\)\),\\frac\{\\partial\\sigma\(z\)\}\{\\partial z\}=\\sigma\(z\)\\bigl\(1\-\\sigma\(z\)\\bigr\),we have

∂ℒ∂z=1σ\(z\)σ\(z\)\(1−σ\(z\)\)=1−σ\(z\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\}=\\frac\{1\}\{\\sigma\(z\)\}\\sigma\(z\)\\bigl\(1\-\\sigma\(z\)\\bigr\)=1\-\\sigma\(z\)\.
Substitutingz=βΔ−γz=\\beta\\Delta\-\\gammayields

∂ℒ∂z=1−σ\(βΔ−γ\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\}=1\-\\sigma\(\\beta\\Delta\-\\gamma\)\.
Therefore,β\\betadirectly affects the gradient magnitude by scaling the input to the sigmoid function\. A largerβ\\betaincreases the magnitude ofβΔ−γ\\beta\\Delta\-\\gamma, making the sigmoid more likely to enter a saturated regime, which in turn changes the gradient magnitude and influences the optimization of the overall loss\.

## Appendix Fξ\\xi\-DPO without LeakyReLU

LeakyReLU is used to filter out samples whose preference differences can already be clearly distinguished\. Without LeakyReLU, samples whose reward margins already exceedξ\\xiare forcibly pushed back towardξ\\xi\. Concretely, this manifests as a decrease in the rewards of chosen responses and an increase in the rewards of rejected responses, thereby narrowing the reward gap\. Such behavior is clearly undesirable, which further highlights the necessity of introducing LeakyReLU\. Figure[4](https://arxiv.org/html/2605.10981#A6.F4)shows that, for Llama3 without LeakyReLU, the validation performance exhibits a decline on chosen samples and an increase on rejected samples\.

![Refer to caption](https://arxiv.org/html/2605.10981v1/figures/chosen_bad.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.10981v1/figures/rejected_bad.png)\(b\)

Figure 4:Reward curves of model training without LeakyReLU\.Figure[4](https://arxiv.org/html/2605.10981#A6.F4)also shows that, although the reward for the chosen remains consistently higher than that for the rejected, it already exhibits a clear downward trend\. This indicates that the original model distribution has undergone a substantial shift, which in turn weakens the model’s ability to preserve correct preference ordering\.

## Appendix GDetailed experimental setup

All experiments are conducted on four RTX PRO 6000 GPUs, using exactly the same hyperparameter settings as AlphaDPOWuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\)\. The batch size is set to 128, and the learning rates are set to6\.0×10−76\.0\\times 10^\{\-7\},1\.0×10−61\.0\\times 10^\{\-6\}, and8\.0×10−78\.0\\times 10^\{\-7\}for Mistral\-Instruct, Llama3\-Instruct, and Gemma2, respectively\. Specifically, on the Spider dataset, we adopt the same experimental configuration as Mistral, with a learning rate of6\.0×10−76\.0\\times 10^\{\-7\}\. The learning rate schedule is also kept consistent with AlphaDPO, using a cosine learning\-rate schedule with a 10% warm\-up phase\. The optimizer is Adam, also following AlphaDPO\.

## Appendix HPreference optimization methods

We summarize several existing high\-performing preference optimization objectives in Table[8](https://arxiv.org/html/2605.10981#A8.T8), mainly to illustrate their objective forms, without providing detailed explanations of the variables in each formulation\. It can be observed that although AlphaDPOWuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\)adopts an adaptive strategy for adjustingγ\\gammait still requires an initial value ofγ\\gamma, while introducing an additional hyperparameterα\\alphaand a reference model, which is involved in the computation ofM∗M^\{\*\}\. Meanwhile, for both SimPO and AlphaDPO, the search range ofβ\\betais relatively broad, making hyperparameter selection challenging in practical scenarios\. In contrast, the proposedξ\\xi\-DPO has a simple formulation, and its hyperparameter can be determined through prior computation, making it more suitable for downstream applications\.

Table 7:Reproduction results of the hyperparameter\-free SimPER method on Mistral\-Instruct\.MethodMistral\-Instruct \(7B\)LC\(%\)WR\(%\)SimPER24\.329\.3ξ\\xi\-DPO33\.236\.5The results in Table[2](https://arxiv.org/html/2605.10981#S5.T2)are obtained using the official AlpacaEval 2 packageLiet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib25)\)under the same inference configuration, where both the judge model and the baseline model are GPT\-4\-Turbo\. To ensure a fair comparison with these results, we fine\-tune Mistral using SimPER with the same learning rate and other hyperparameters, and evaluate the resulting model with GPT\-4\-Turbo\. As shown in Table[7](https://arxiv.org/html/2605.10981#A8.T7), optimization without margin constraints performs poorly, highlighting the importance of the reward margin\.

Table 8:Various preference optimization objectives and hyperparameter range\.MethodObjectiveHyperparameterDPORafailovet al\.\([2023](https://arxiv.org/html/2605.10981#bib.bib5)\)−log⁡σ\(βlog⁡πθ\(yw\|x\)πref\(yw\|x\)−βlog⁡πθ\(yl\|x\)πref\(yl\|x\)\)\-\\log\\sigma\\left\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\|x\)\}\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\|x\)\}\\right\)β∈\[0\.01,0\.05,0\.1\]\\beta\\in\[0\.01,0\.05,0\.1\]IPOAzaret al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib15)\)\(log⁡πθ\(yw\|x\)πref\(yw\|x\)−log⁡πθ\(yl\|x\)πref\(yl\|x\)−12τ\)2\\left\(\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\|x\)\}\-\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\|x\)\}\-\\frac\{1\}\{2\\tau\}\\right\)^\{2\}τ∈\[0\.01,0\.1,0\.5,1\.0\]\\tau\\in\[0\.01,0\.1,0\.5,1\.0\]CPOXuet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib27)\)−log⁡σ\(βlog⁡πθ\(yw\|x\)−βlog⁡πθ\(yl\|x\)\)−λlog⁡πθ\(yw\|x\)\-\\log\\sigma\\left\(\\beta\\log\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\-\\beta\\log\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\\right\)\-\\lambda\\log\\pi\_\{\\theta\}\(y\_\{w\}\|x\)α=1\.0,β∈\[0\.01,0\.05,0\.1\]\\alpha=1\.0,\\ \\beta\\in\[0\.01,0\.05,0\.1\]KTOEthayarajhet al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib28)\)−λwσ\(βlog⁡πθ\(yw\|x\)πref\(yw\|x\)−zref\)\+λlσ\(zref−βlog⁡πθ\(yl\|x\)πref\(yl\|x\)\),\-\\lambda\_\{w\}\\sigma\\left\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\|x\)\}\-z\_\{\\mathrm\{ref\}\}\\right\)\+\\lambda\_\{l\}\\sigma\\left\(z\_\{\\mathrm\{ref\}\}\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\|x\)\}\\right\),wherezref=𝔼\(x,y\)∼𝒟\[βKL\(πθ\(y\|x\)∥πref\(y\|x\)\)\]z\_\{\\mathrm\{ref\}\}=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\\left\[\\beta\\mathrm\{KL\}\\left\(\\pi\_\{\\theta\}\(y\|x\)\\\|\\pi\_\{\\mathrm\{ref\}\}\(y\|x\)\\right\)\\right\]λl=λw=1\.0\\lambda\_\{l\}=\\lambda\_\{w\}=1\.0β∈\[0\.01,0\.05,0\.1\]\\beta\\in\[0\.01,0\.05,0\.1\]ORPOHonget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib29)\)−log⁡pθ\(yw\|x\)−λlog⁡σ\(log⁡pθ\(yw\|x\)1−pθ\(yw\|x\)−log⁡pθ\(yl\|x\)1−pθ\(yl\|x\)\),\-\\log p\_\{\\theta\}\(y\_\{w\}\|x\)\-\\lambda\\log\\sigma\\left\(\\log\\frac\{p\_\{\\theta\}\(y\_\{w\}\|x\)\}\{1\-p\_\{\\theta\}\(y\_\{w\}\|x\)\}\-\\log\\frac\{p\_\{\\theta\}\(y\_\{l\}\|x\)\}\{1\-p\_\{\\theta\}\(y\_\{l\}\|x\)\}\\right\),wherepθ\(y\|x\)=exp⁡\(1\|y\|log⁡πθ\(y\|x\)\)p\_\{\\theta\}\(y\|x\)=\\exp\\left\(\\frac\{1\}\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\|x\)\\right\)λ∈\[0\.1,0\.5,1\.0,2\.0\]\\lambda\\in\[0\.1,0\.5,1\.0,2\.0\]R\-DPOParket al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib30)\)−log⁡σ\(βlog⁡πθ\(yw\|x\)πref\(yw\|x\)−βlog⁡πθ\(yl\|x\)πref\(yl\|x\)−\(c\|yw\|−c\|yl\|\)\)\-\\log\\sigma\\left\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{w\}\|x\)\}\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{l\}\|x\)\}\-\\left\(c\|y\_\{w\}\|\-c\|y\_\{l\}\|\\right\)\\right\)α∈\[0\.05,0\.1,0\.5,1\.0\]\\alpha\\in\[0\.05,0\.1,0\.5,1\.0\]β∈\[0\.01,0\.05,0\.1\]\\beta\\in\[0\.01,0\.05,0\.1\]SimPOMenget al\.\([2024](https://arxiv.org/html/2605.10981#bib.bib8)\)−log⁡σ\(β\|yw\|log⁡πθ\(yw\|x\)−β\|yl\|log⁡πθ\(yl\|x\)−γ\)\-\\log\\sigma\\left\(\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\-\\gamma\\right\)β∈\[2\.0,4\.0,6\.0,8\.0\]\\beta\\in\[2\.0,4\.0,6\.0,8\.0\]γ∈\[0\.3,0\.5,1\.0,1\.2,1\.4,1\.6\]\\gamma\\in\[0\.3,0\.5,1\.0,1\.2,1\.4,1\.6\]AlphaDPOWuet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib16)\)−log⁡σ\(u\(x,yw,yl\)−sg\[γ\+αM∗\(x,yw,yl\)\]\)\-\\log\\sigma\\left\(u\(x,y\_\{w\},y\_\{l\}\)\-\\mathrm\{sg\}\\left\[\\gamma\+\\alpha M^\{\*\}\(x,y\_\{w\},y\_\{l\}\)\\right\]\\right\)whereu\(x,yw,yl\)=β\|yw\|log⁡πθ\(yw\|x\)−β\|yl\|log⁡πθ\(yl\|x\)u\(x,y\_\{w\},y\_\{l\}\)=\\frac\{\\beta\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\-\\frac\{\\beta\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\|x\)β∈\[2\.5,10\.0\],γ∈\[0\.1,0\.3,0\.5\]\\beta\\in\[2\.5,10\.0\],\\ \\gamma\\in\[0\.1,0\.3,0\.5\]α∈\[1e−2,5e−2,0\.1,0\.2\]\\alpha\\in\[1e\-2,5e\-2,0\.1,0\.2\]SimPERXiaoet al\.\([2025](https://arxiv.org/html/2605.10981#bib.bib42)\)−exp⁡\(1\|yw\|log⁡πθ\(yw\|x\)\)\+exp⁡\(1\|yl\|log⁡πθ\(yl\|x\)\)\-\\exp\\left\(\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\\right\)\+\\exp\\left\(\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\\right\)–ξ\\xi\-DPOLeakyReLU\(ξ−\(1\|yw\|log⁡πθ\(yw\|x\)−1\|yl\|log⁡πθ\(yl\|x\)1\|yw\|log⁡πθ\(yw\|x\)\+1\|yl\|log⁡πθ\(yl\|x\)\)\)2\\mathrm\{LeakyReLU\}\\left\(\\xi\-\\left\(\\frac\{\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\-\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\{\\frac\{1\}\{\|y\_\{w\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\+\\frac\{1\}\{\|y\_\{l\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\\right\)\\right\)^\{2\}whereξ\\xidenotes thett\-th quantile of\{mi\}i=1N\\\{m\_\{i\}\\\}\_\{i=1\}^\{N\}t∈\[0\.95,0\.999\]t\\in\[0\.95,0\.999\]

## Appendix IAblation Study

We have already evaluatedξ\\xiunder different settings through sensitivity analysis, and the performance remains stable\. In this section, we conduct an ablation study on LeakyReLU, which mainly filters out samples whose reward gap has already exceededξ\\xi\. We compare the effect of using LeakyReLU with that of removing it, and further include the stricter ReLU as an additional baseline\. The results are shown in Table[9](https://arxiv.org/html/2605.10981#A9.T9)\.

The results demonstrate that LeakyReLU plays an important role\. When the reward gap is larger thanξ\\xi, its gradient becomes close to zero\. Moreover, when we replace LeakyReLU with ReLU, whose gradient is exactly zero in this region, the performance remains largely unchanged\. This indicates that LeakyReLU not only functions similarly to ReLU by filtering samples, but also provides a less restrictive mechanism that can improve generalization to some extent\.

Table 9:Ablation study on LeakyReLU\.We conduct an ablation study on LeakyReLU using Llama3\-v0\.2, a core component ofξ\\xi\-DPO, by comparing model performance when LeakyReLU is removed and when it is replaced with ReLU\.Test componentAlpacaEval 2LCWRwith LeakyReLU57\.550\.5w/o LeakyReLU21\.718\.5ReLU57\.250\.3
$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

Similar Articles

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Proximal Policy Optimization

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

Submit Feedback

Similar Articles

GroupDPO: Memory efficient Group-wise Direct Preference Optimization
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training