Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

arXiv cs.CL 05/19/26, 04:00 AM Papers
large-language-models rlhf preference-learning alignment reward-modeling game-theory nash-equilibrium
Summary
This paper introduces the Hybrid Reward-Cyclic (HRC) model and Dynamic Self-Play Preference Optimization (DSPPO) to address the cyclic nature of human preferences in LLM alignment, achieving improved performance over Bradley-Terry and General Preference Model baselines.
arXiv:2605.17342v1 Announce Type: new Abstract: Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:39 AM
# Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
Source: [https://arxiv.org/html/2605.17342](https://arxiv.org/html/2605.17342)
###### Abstract

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences\. While some approaches like the General Preference Model \(GPM\) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions\. To address this, we propose the Hybrid Reward\-Cyclic \(HRC\) model, which utilizes game\-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive \(scalar\) and cyclic \(vector\) components\. Complementing this, we introduce Dynamic Self\-Play Preference Optimization \(DSPPO\), which treats alignment as a time\-varying game to progressively guide the policy toward the Nash equilibrium\. Synthetic data experiments further validate HRC’s structural superiority in mixed transitive–cyclic settings, where HRC converges faster and achieves higher accuracy than GPM\. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines \(e\.g\., \+1\.23% on Gemma\-2B\-it\)\. In particular, its superior performance in the Ties domain empirically validates the model’s robustness in handling complex, non\-strict preferences\. Extensive downstream evaluations on AlpacaEval 2\.0, Arena\-Hard\-v0\.1, and MT\-Bench confirm the efficacy of our framework\. Notably, when using Gemma\-2B\-it as the base preference model, HRC\+DSPPO achieves a peak length\-controlled win\-rate of 44\.75% on AlpacaEval 2\.0 and 46\.8% on Arena\-Hard\-v0\.1, significantly outperforming SPPO baselines trained with BT or GPM\. Our code is publicly available at[https://github\.com/lab\-klc/Hybrid\-Reward\-Cyclic](https://github.com/lab-klc/Hybrid-Reward-Cyclic)\.

Machine Learning, ICML

## 1Introduction

As Large Language Models \(LLMs\) show remarkable performance across tasks\(Brown et al\.,[2020](https://arxiv.org/html/2605.17342#bib.bib9); Achiam et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib1); Li et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib27)\)\. Aligning Large Language Models \(LLMs\) with complex human values is a cornerstone of modern AI safety and utility\(Weidinger et al\.,[2021](https://arxiv.org/html/2605.17342#bib.bib43); Ji et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib24); Xu et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib47)\)\. The efficacy of this alignment process, typically driven by Reinforcement Learning from Human Feedback \(RLHF\)\(Christiano et al\.,[2017](https://arxiv.org/html/2605.17342#bib.bib11)\), fundamentally depends on the reliability of the reward model used to proxy human judgments\. Preference learning algorithms typically employ pairwise comparisons to capture human judgments\(Ibarz et al\.,[2018](https://arxiv.org/html/2605.17342#bib.bib23); Ziegler et al\.,[2019](https://arxiv.org/html/2605.17342#bib.bib54)\)\. While the Bradley\-Terry \(BT\) model\(Bradley & Terry,[1952](https://arxiv.org/html/2605.17342#bib.bib8)\)serves as the standard\(Bai et al\.,[2022](https://arxiv.org/html/2605.17342#bib.bib3); Rafailov et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib33)\), its reliance on scalar rewards enforces a strict transitivity assumption \(i\.e\.,A≻B∧B≻C⇒A≻CA\\succ B\\wedge B\\succ C\\Rightarrow A\\succ C\)\. However, real\-world human preferences are inherently heterogeneous and often exhibit intransitive, cyclic patterns \(e\.g\., Rock\-Paper\-Scissors dynamics\)\(Tversky,[1969](https://arxiv.org/html/2605.17342#bib.bib41); Savage Jr,[1994](https://arxiv.org/html/2605.17342#bib.bib36); Gehrlein,[2006](https://arxiv.org/html/2605.17342#bib.bib17); Munos et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib31)\), which scalar models fundamentally fail to capture\.

Early approaches like PairRM/PairPM\(Jiang et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib25); Dong et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib14)\)predict preferences directly from concatenated inputs, theoretically capturing intransitive dynamics\. However, their inference complexity \(O\(K2\)O\(K^\{2\}\)\) limits applicability in scalable alignment scenarios\. Recognizing the non\-transitive nature of preferences, the General Preference Model \(GPM\)\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)maps responses to latent embeddings, modeling these dynamics via skew\-symmetric bilinear forms with linear complexity\.

![Refer to caption](https://arxiv.org/html/2605.17342v1/x1.png)

Figure 1:Comparison of the Bradley\-Terry \(BT\) model and the proposed Hybrid Reward\-Cyclic \(HRC\) model\.\(a\) The BT model maps each instruction\-response pair to a scalar reward, assuming transitive preferences\. \(b\) The HRC model explicitly decomposes preferences into a transitive scalar component \(via BT\) and a cyclic vector component \(via GPM\), combining them to produce the final preference signalsHRCs\_\{\\text\{HRC\}\}\.While GPM effectively models cyclic dynamics, it remains underexplored whether its form can simultaneously preservetransitiveglobal hierarchies \(e\.g\., safety, helpfulness\) while handlingcyclicnuances\. In this paper, we identify a critical theoretical gap: GPM cannot guarantee the representation of dominant solutions in arbitrary cyclic contexts due to the entangling of these properties\. This insight motivates we revisit preference modeling from a game\-theoretic perspective, treating the pairwise preference relation as a symmetric zero\-sum game\(Balduzzi et al\.,[2019](https://arxiv.org/html/2605.17342#bib.bib5)\)\. By leveraging the theoretical insight that any such game can be decomposed, we propose theHybrid Reward\-Cyclic \(HRC\)model\. Unlike prior approaches like GPM\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\), HRC explicitly disentangles human preferences into two orthogonal components: atransitive scalar componentthat captures consistent global rankings, and acyclic vector componentthat models intransitive local dynamics\. This explicit decomposition not only enhances interpretability but also introduces an explicit inductive bias for modeling dominant hierarchies, allowing HRC to achieve superior performance while maintaining linear inference complexityO\(\(2d\+1\)K\)O\(\(2d\+1\)K\), whered≥1d\\geq 1is a hyperparameter\.

Furthermore, to effectively align LLMs with HRC, we focus on thegame\-theoretic alignmentparadigm \(e\.g\., SPPO\(Wu et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib46)\), INPO\(Zhang et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib49)\)\)\. We prioritize this class of algorithms because they optimize directly against preference probabilities, thereby preserving the non\-transitive modeling capabilities\. However, existing frameworks in this category typically rely on a static preference oracle, which limits their ability to navigate complex preference landscapes\. Treating preference fixed ignores the multi\-dimensional structure of preferences revealed by HRC\. To address this, we introduceDynamic Self\-Play Preference Optimization \(DSPPO\), a generalized framework for time\-varying preference games\. Inspired by the success of curriculum learning\(Bengio et al\.,[2009](https://arxiv.org/html/2605.17342#bib.bib6); Hacohen & Weinshall,[2019](https://arxiv.org/html/2605.17342#bib.bib19)\), we leverage DSPPO to orchestrate an optimization trajectory that transitions from a robust transitive backbone to refined cyclic nuances\. This strategy stabilizes training by establishing a global quality baseline before introducing complex local dynamics, ensuring convergence to the Nash Equilibrium of the full preference game\.

Our main contributions are summarized as follows:

- •We propose theHRC model, which theoretically unifies BT model and GPM\. By explicitly decomposing preferences into transitive and cyclic components, HRC model resolves the structural limitations of BT model and GPM\.
- •We introduceDSPPO, an alignment algorithm that leverages the HRC structure to schedule the complexity of the preference signal, enabling more robust convergence in complex landscapes\.
- •Extensive experiments on synthetic data constructed from UltraFeedback\(Cui et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib12)\), RewardBench2\(Malik et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib30)\), AlpacaEval 2\.0\(Dubois et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib15)\), Arena\-Hard\-v0\.1\(Li et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib28)\), and MT\-Bench\(Zheng et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib51)\)demonstrate that our framework consistently outperforms existing baselines in both preference modeling accuracy and downstream generation quality\.

## 2Related Work

### 2\.1Preference Modeling in RLHF

In the standard RLHF framework\(Christiano et al\.,[2017](https://arxiv.org/html/2605.17342#bib.bib11)\), human preferences are predominantly modeled using the Bradley\-Terry \(BT\) model\(Bradley & Terry,[1952](https://arxiv.org/html/2605.17342#bib.bib8)\), which typically employs a learnable reward functionr\(y\|x\)r\(\\textbf\{y\}\|\\textbf\{x\}\)to assign a scalar score to each response\. To explore alternative preference models within the RLHF framework, several approaches have been proposed, such as the Plackett\-Luce model for K\-wise comparisons\(Zhu et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib53)\), Energy\-Based Model for modeling preference distributions\(Hong et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib20)\), and multi\-dimensional reward systems like ArmoRM\(Wang et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib42)\)\. However, despite these structural variations, these methods fundamentally rely on scalar scores or linear aggregations for final decision\-making\. Consequently, they implicitly retain the transitivity assumption \(i\.e\., ifA≻BA\\succ BandB≻CB\\succ C, thenA≻CA\\succ C\), thereby restricting their capacity to explicitly represent the complex, intransitive \(cyclic\) patterns often present in heterogeneous human feedback\(Tversky,[1969](https://arxiv.org/html/2605.17342#bib.bib41); Munos et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib31)\)\.

Recent works have focused on explicitly modeling intransitive preferences in RLHF\. While early pairwise approaches \(e\.g\., PairPM/PairRM\)\(Jiang et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib25); Dong et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib14)\)can capture cyclic patterns, they suffer from prohibitive quadratic inference costs \(O\(K2\)O\(K^\{2\}\)\) when rankingKKcandidate responses\. To address this scalability bottleneck, GPM\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)employs a low\-rank real skew\-symmetric formulation that offers three key advantages: \(1\) controllable expressiveness via the latent dimension hyperparameterdd; \(2\) superior efficiency with a linear inference complexity ofO\(2dK\)O\(2dK\); and \(3\) the inherent capability to model cyclic dynamics for anyd≥1d\\geq 1\.

### 2\.2Preference\-Based Reinforcement Learning from Human Feedback

Traditional RLHF\(Christiano et al\.,[2017](https://arxiv.org/html/2605.17342#bib.bib11)\)typically follows a two\-stage paradigm: learning a scalar reward modelr\(y\|x\)r\(\\textbf\{y\}\|\\textbf\{x\}\)as a proxy for human preferences, followed by policy optimization via algorithms like PPO\(Schulman et al\.,[2017](https://arxiv.org/html/2605.17342#bib.bib37)\)\. Recently, some works have proposed preference\-based alignment methods that optimize the policy using preference probabilities as signals\. A representative approach is IPO\(Azar et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib2)\), which formulates a direct optimization objective based on these probabilities\.

More recently, a rigorous game\-theoretic perspective has gained prominence, framing LLM alignment as solving a multi\-player zero\-sum game to find the Nash Equilibrium\. NLHF\(Munos et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib31)\)first introduced this formulation to the field\. Following this direction, a line of works—including SPO\(Swamy et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib38)\), DNO\(Rosset et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib35)\), SPPO\(Wu et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib46)\), INPO\(Zhang et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib49)\), GPO\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\), EGPO\(Zhou et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib52)\), ONPO\(Zhang et al\.,[2025a](https://arxiv.org/html/2605.17342#bib.bib48)\), and MNPO\(Wu et al\.,[2025a](https://arxiv.org/html/2605.17342#bib.bib45)\)have focused on optimizing policies directly from preference signals, thereby mitigating the limitations of scalar reward functions\.

## 3Preliminaries

We consider the standard language generation setting\. A generative language modelπ\\pimaps a prompt𝐱=\(x1,x2,…\)∼𝒳\\mathbf\{x\}=\(x\_\{1\},x\_\{2\},\\dots\)\\sim\\mathcal\{X\}to a probability distribution over responses, from which we can sample candidate sequences𝐲\\mathbf\{y\}and𝐲′\\mathbf\{y\}^\{\\prime\}\. The probability of generating any response𝐲=\(y1,…,yT\)\\mathbf\{y\}=\(y\_\{1\},\\dots,y\_\{T\}\)is defined via the autoregressive factorizationπ\(𝐲\|𝐱\)=∏t=1Tπ\(yt\|𝐱,𝐲<t\)\\pi\(\\mathbf\{y\}\|\\mathbf\{x\}\)=\\prod\_\{t=1\}^\{T\}\\pi\(y\_\{t\}\|\\mathbf\{x\},\\mathbf\{y\}\_\{<t\}\), where𝐲<t\\mathbf\{y\}\_\{<t\}denotes the partial sequence\.

### 3\.1Preference Models

We assume a preference oracle provides binary feedbacko\(𝐲≻𝐲′\|𝐱\)∈\{0,1\}o\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)\\in\\\{0,1\\\}on response pairs\. The probability that𝐲\\mathbf\{y\}is preferred to𝐲′\\mathbf\{y\}^\{\\prime\}is denoted as the expectationℙ\(𝐲≻𝐲′\|𝐱\)=𝔼\[o\(𝐲≻𝐲′\|𝐱\)\]\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)=\\mathbb\{E\}\[o\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)\]\.

In the standard RLHF framework\(Christiano et al\.,[2017](https://arxiv.org/html/2605.17342#bib.bib11)\), preferences are typically modeled using the Bradley\-Terry \(BT\) model\(Bradley & Terry,[1952](https://arxiv.org/html/2605.17342#bib.bib8)\)\. This model assumes a latent reward functionr\(𝐲\|𝐱\)r\(\\mathbf\{y\}\|\\mathbf\{x\}\), defining the preference probability asℙBT\(𝐲≻𝐲′\|𝐱\)=σ\(r\(𝐲\|𝐱\)−r\(𝐲′\|𝐱\)\)\\mathbb\{P\}\_\{\\text\{BT\}\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)=\\sigma\(r\(\\mathbf\{y\}\|\\mathbf\{x\}\)\-r\(\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)\), whereσ\\sigmais the logistic function\. However, the BT model can not capture intransitive preferences effectively\(Bertrand et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib7)\)\. The nature of BT implies transitivity, wherer\(𝐲\)\>r\(𝐲′\)r\(\\mathbf\{y\}\)\>r\(\\mathbf\{y\}^\{\\prime\}\)andr\(𝐲′\)\>r\(𝐲′′\)r\(\\mathbf\{y\}^\{\\prime\}\)\>r\(\\mathbf\{y\}^\{\\prime\\prime\}\)strictly enforcer\(𝐲\)\>r\(𝐲′′\)r\(\\mathbf\{y\}\)\>r\(\\mathbf\{y\}^\{\\prime\\prime\}\), prohibiting the representation of cyclic preferences\.

To capture intransitive dynamics, recent works\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)propose the General Preference Model \(GPM\)\. Unlike the scalar formulation in BT, GPM maps the pair to latent vectors𝐯\(𝐲\|𝐱\)∈ℝ2d\\mathbf\{v\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{2d\}and defines the preference score via a skew\-symmetric operator:

sGPM\(𝐲,𝐲′\|𝐱\)=𝐯\(𝐲\|𝐱\)⊤𝐖𝐯\(𝐲′\|𝐱\),s\_\{\\text\{GPM\}\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)=\\mathbf\{v\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)^\{\\top\}\\mathbf\{W\}\\mathbf\{v\}\(\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\),\(1\)where𝐖∈ℝ2d×2d\\mathbf\{W\}\\in\\mathbb\{R\}^\{2d\\times 2d\}is a real skew\-symmetric matrix \(𝐖⊤=−𝐖\\mathbf\{W\}^\{\\top\}=\-\\mathbf\{W\}\)\. The final probability isℙGPM=σ\(sGPM\)\\mathbb\{P\}\_\{\\text\{GPM\}\}=\\sigma\(s\_\{\\text\{GPM\}\}\)\. Due to the skew\-symmetry,s\(𝐲,𝐲′\)=−s\(𝐲′,𝐲\)s\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)=\-s\(\\mathbf\{y\}^\{\\prime\},\\mathbf\{y\}\), enabling the model to naturally represent cyclic preference structures in dimensions2d≥22d\\geq 2\.

### 3\.2Preference\-based Reinforcement Learning from Human Feedback

Traditional alignment approaches, such as PPO\(Schulman et al\.,[2017](https://arxiv.org/html/2605.17342#bib.bib37)\), rely on the BT model\(Bradley & Terry,[1952](https://arxiv.org/html/2605.17342#bib.bib8)\)to maximize an expected scalar reward subject to a KL\-divergence constraint\. While effective, this scalar formulation implicitly assumes preference transitivity\. Recently some works construct algorithms directly based on preference signals\. Given a preference oracleℙ\(𝐲≻𝐲′\|𝐱\)\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\), the alignment process is viewed as finding a policyπθ\\pi\_\{\\theta\}that outperforms a competitorπ′\\pi^\{\\prime\}\. For instance, the IPO algorithm\(Azar et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib2)\)optimizes against a fixed competitorμ\\mu:

maxθ𝔼𝐱∼𝒳\[𝔼𝐲∼πθ,𝐲′∼μ\[ℙ\(𝐲≻𝐲′\|𝐱\)\]−β𝔻KL\(πθ\(⋅\|𝐱\)\|\|πref\(⋅\|𝐱\)\)\],\\begin\{split\}\\max\_\{\\theta\}\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{X\}\}\\Big\[&\\mathbb\{E\}\_\{\\mathbf\{y\}\\sim\\pi\_\{\\theta\},\\mathbf\{y\}^\{\\prime\}\\sim\\mu\}\[\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)\]\\\\ &\-\\beta\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}\(\\cdot\|\\mathbf\{x\}\)\|\|\\pi\_\{\\text\{ref\}\}\(\\cdot\|\\mathbf\{x\}\)\)\\Big\],\\end\{split\}\(2\)whereμ\\mutypically represents naother fixed policy andβ\\betacontrols the KL\-divergence penalty𝔻KL\\mathbb\{D\}\_\{\\text\{KL\}\}\.

In game\-theoretic approaches such as NLHF\(Munos et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib31)\), SPPO\(Wu et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib46)\), DNO\(Rosset et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib35)\), INPO\(Zhang et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib49)\), and GPO\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\), the problem is formulated as a two\-player constant\-sum game\. To simplify the formulation, NLHF\(Munos et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib31)\)defines the preference ofπθ\\pi\_\{\\theta\}overπ′\\pi^\{\\prime\}as the expectation of the preference signal over their generated responses:

ℙ\(πθ≻π′\|𝐱\)=𝔼𝐲∼πθ\(⋅\|𝐱\),𝐲′∼π′\(⋅\|𝐱\)\[ℙ\(𝐲≻𝐲′\|𝐱\)\]\.\\mathbb\{P\}\(\\pi\_\{\\theta\}\\succ\\pi^\{\\prime\}\|\\mathbf\{x\}\)=\\mathbb\{E\}\_\{\\mathbf\{y\}\\sim\\pi\_\{\\theta\}\(\\cdot\|\\mathbf\{x\}\),\\mathbf\{y\}^\{\\prime\}\\sim\\pi^\{\\prime\}\(\\cdot\|\\mathbf\{x\}\)\}\[\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)\]\.\(3\)
With this notation, the alignment objective can be succinctly expressed as finding the Nash Equilibrium of the game, which corresponds to the solution of the maximin problem:

maxθ⁡minπ′⁡𝔼𝐱∼𝒳\[ℙ\(πθ≻π′\|𝐱\)\]\.\\max\_\{\\theta\}\\min\_\{\\pi^\{\\prime\}\}\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{X\}\}\[\\mathbb\{P\}\(\\pi\_\{\\theta\}\\succ\\pi^\{\\prime\}\|\\mathbf\{x\}\)\]\.\(4\)

## 4Hybrid Reward\-Cyclic Model

### 4\.1Modeling Human Preferences via Game\-Theoretic Decomposition

In this section, we provide a theoretical grounding for our proposed architecture\. We first formalize human preference modeling as a Symmetric Zero\-Sum Functional\-Form Game \(FFG\) and then leverage the game\-theoretic decomposition analysis\(Balduzzi et al\.,[2019](https://arxiv.org/html/2605.17342#bib.bib5)\)to justify the separation of transitive and cyclic components\.

Consider a prompt𝐱\\mathbf\{x\}and a pair of responses\(𝐲i,𝐲j\)\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\)\. We define the preference score in[Equation5](https://arxiv.org/html/2605.17342#S4.E5)\.

###### Definition 4\.1\(Preference Score\)\.

Given a preference probabilityℙ\(𝐲i≻𝐲j\|𝐱\)\\mathbb\{P\}\(\\mathbf\{y\}\_\{i\}\\succ\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\), the preference score functions\(𝐲i,𝐲j\|𝐱\)s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)is defined as:s\(𝐲i,𝐲j\|𝐱\)≜log⁡ℙ\(𝐲i≻𝐲j\|𝐱\)1−ℙ\(𝐲i≻𝐲j\|𝐱\)\.s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)\\triangleq\\log\\frac\{\\mathbb\{P\}\(\\mathbf\{y\}\_\{i\}\\succ\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)\}\{1\-\\mathbb\{P\}\(\\mathbf\{y\}\_\{i\}\\succ\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)\}\.\(5\)

This impliesℙ\(𝐲i≻𝐲j\|𝐱\)=σ\(s\(𝐲i,𝐲j\|𝐱\)\)\\mathbb\{P\}\(\\mathbf\{y\}\_\{i\}\\succ\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)=\\sigma\(s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)\)\.

To analyze the topological structure of these scores, we introduce two structural assumptions that map the discrete preference problem into a continuous vector space suitable for functional analysis\.

###### Assumption 4\.2\(Skew\-Symmetry\)\.

The preference relation is strictly skew\-symmetric\. For any pair\(𝐲i,𝐲j\)\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\),s\(𝐲i,𝐲j\|𝐱\)=−s\(𝐲j,𝐲i\|𝐱\)s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)=\-s\(\\mathbf\{y\}\_\{j\},\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\), which impliess\(𝐲i,𝐲i\|𝐱\)=0s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)=0\.###### Assumption 4\.3\(Embedding Mapping\)\.

There exists a mapping functiong:𝒴×𝒳→ℝ2dg:\\mathcal\{Y\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}^\{2d\}such that each response𝐲\\mathbf\{y\}is represented by a vector𝐯=g\(𝐲\|𝐱\)\\mathbf\{v\}=g\(\\mathbf\{y\}\|\\mathbf\{x\}\)\.###### Definition 4\.4\(Preference Function\)\.

The preference score can be expressed as a functional over these embeddings:ϕ𝐱\(𝐯i,𝐯j\)≜s\(𝐲i,𝐲j\|𝐱\)\\phi\_\{\\mathbf\{x\}\}\(\\mathbf\{v\}\_\{i\},\\mathbf\{v\}\_\{j\}\)\\triangleq s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)\.Under these two assumptions, the functionϕ𝐱:ℝ2d×ℝ2d→ℝ\\phi\_\{\\mathbf\{x\}\}:\\mathbb\{R\}^\{2d\}\\times\\mathbb\{R\}^\{2d\}\\to\\mathbb\{R\}constitutes a Symmetric Zero\-Sum Functional\-Form Game \(FFG\)\(Balduzzi et al\.,[2019](https://arxiv.org/html/2605.17342#bib.bib5)\)\.

We now invoke a fundamental result from previous work\(Balduzzi et al\.,[2019](https://arxiv.org/html/2605.17342#bib.bib5)\)\.[Theorem4\.5](https://arxiv.org/html/2605.17342#S4.Thmtheorem5)provides the theoretical legitimacy for our dual\-model approach\.

###### Theorem 4\.5\.

Under two assumptions, any preference functionϕ\(𝐯,𝐰\)\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)can be uniquely decomposed into the sum of a transitive componentϕT\\phi\_\{T\}and a cyclic componentϕC\\phi\_\{C\}:ϕ\(𝐯,𝐰\)=ϕT\(𝐯,𝐰\)\+ϕC\(𝐯,𝐰\)\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)=\\phi\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)\+\\phi\_\{C\}\(\\mathbf\{v\},\\mathbf\{w\}\), where:•Transitive Component:There exists a potential functionf:ℝ2d→ℝf:\\mathbb\{R\}^\{2d\}\\to\\mathbb\{R\}such thatϕT\(𝐯,𝐰\)=f\(𝐯\)−f\(𝐰\)\\phi\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)=f\(\\mathbf\{v\}\)\-f\(\\mathbf\{w\}\)\. This component represents the hierarchical ranking capability of the game\.•Cyclic Component:The componentϕC\\phi\_\{C\}satisfies∫ϕC\(𝐯,𝐰\)𝑑𝐰=0\\int\\phi\_\{C\}\(\\mathbf\{v\},\\mathbf\{w\}\)d\\mathbf\{w\}=0for all𝐯\\mathbf\{v\}\(assuming a uniform measure over the embedding space\)\. This component captures pure rotational dynamics where no global winner exists \(e\.g\., Rock\-Paper\-Scissors dynamics\)\.

We include the proof of[Theorem4\.5](https://arxiv.org/html/2605.17342#S4.Thmtheorem5)in[SectionB\.1](https://arxiv.org/html/2605.17342#A2.SS1)\.

Based on[Theorem4\.5](https://arxiv.org/html/2605.17342#S4.Thmtheorem5), we now instantiate the transitive componentϕT\\phi\_\{T\}and the cyclic componentϕC\\phi\_\{C\}\.

###### Theorem 4\.6\.

The decomposition admits a structural instantiation whereϕT\\phi\_\{T\}corresponds to theBradley\-Terry \(BT\)model, andϕC\\phi\_\{C\}corresponds to theGeneral Preference Model \(GPM\), provided that the embeddings satisfy the zero\-mean condition𝔼\[𝐯\]=𝟎\\mathbb\{E\}\[\\mathbf\{v\}\]=\\mathbf\{0\}\.Based on[Theorem4\.6](https://arxiv.org/html/2605.17342#S4.Thmtheorem6)\(see proof in[SectionB\.2](https://arxiv.org/html/2605.17342#A2.SS2)\), we propose theHybrid Reward\-Cyclic \(HRC\)model, which explicitly instantiates this decomposition:

sHRC\(𝐲i,𝐲j\|𝐱\)=\(r\(𝐲i\)−r\(𝐲j\)\)⏟Transitive\+\(𝐯i⊤𝐖𝐯j\)⏟Cyclic,s\_\{\\text\{HRC\}\}\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)=\\underbrace\{\(r\(\\mathbf\{y\}\_\{i\}\)\-r\(\\mathbf\{y\}\_\{j\}\)\)\}\_\{\\text\{Transitive\}\}\+\\underbrace\{\(\\mathbf\{v\}\_\{i\}^\{\\top\}\\mathbf\{W\}\\mathbf\{v\}\_\{j\}\)\}\_\{\\text\{Cyclic\}\},\(6\)where𝐯i,𝐯j\\mathbf\{v\}\_\{i\},\\mathbf\{v\}\_\{j\}correspond to the embeddings of𝐲i,𝐲j\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}, and𝐖\\mathbf\{W\}is a real skew\-symmetric matrix\. We enforce the unit\-norm constraint‖𝐯‖2=1\\\|\\mathbf\{v\}\\\|\_\{2\}=1\. Under the assumption that the embeddings are isotropically distributed on the hypersphere, the expectation vanishes \(𝔼\[𝐯\]=𝟎\\mathbb\{E\}\[\\mathbf\{v\}\]=\\mathbf\{0\}\), thereby satisfying the zero\-mean condition required for the cyclic component\. Crucially, HRC maintains a linear inference complexity ofO\(\(2d\+1\)K\)O\(\(2d\+1\)K\), matching the computational efficiency of both the BT model and GPM\.

### 4\.2Implementation

We parameterize the decomposed components using a shared language model with three distinct projection heads\. Let𝐡𝐲\|𝐱\\mathbf\{h\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}denote the final hidden state of response𝐲\\mathbf\{y\}given prompt𝐱\\mathbf\{x\}\.

Transitive and Cyclic Heads\.To capture the transitive componentϕT\\phi\_\{T\}, we employ a scalar reward head\. To ensure numerical stability and mitigate potential gradient issues caused by unbounded scores, we apply a value clipping operation to constrain the reward within a fixed range\[−δ,δ\]\[\-\\delta,\\delta\]:

rϕ\(𝐲\|𝐱\)=clip\(𝐰r⊤𝐡𝐲\|𝐱,−δ,δ\)\.r\_\{\\phi\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)=\\text\{clip\}\(\\mathbf\{w\}\_\{r\}^\{\\top\}\\mathbf\{h\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\},\-\\delta,\\delta\)\.For the cyclic componentϕC\\phi\_\{C\}, we utilize an embedding head that projects the hidden state into a latent spaceℝ2d\\mathbb\{R\}^\{2d\}\. Following the GPM formulation\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\), we enforce a unit\-norm constraint to ensure the zero\-integral property:

𝐯\(𝐲\|𝐱\)=𝐖c𝐡𝐲\|𝐱‖𝐖c𝐡𝐲\|𝐱‖2,where𝐯∈ℝ2d,‖𝐯‖2=1\.\\mathbf\{v\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)=\\frac\{\\mathbf\{W\}\_\{c\}\\mathbf\{h\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}\}\{\\\|\\mathbf\{W\}\_\{c\}\\mathbf\{h\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}\\\|\_\{2\}\},\\text\{where \}\\mathbf\{v\}\\in\\mathbb\{R\}^\{2d\},\\\|\\mathbf\{v\}\\\|\_\{2\}=1\.
Context\-Aware Gating\.To model context\-dependent cyclic intensity, a gating mechanism computes a scaling matrix𝐃\(𝐱\)\\mathbf\{D\}\(\\mathbf\{x\}\)\. We ensure the generated diagonal elements are non\-negative to adhere to the standard definition of spectral decomposition:

λ\(𝐱\)=gθ\(𝐡𝐱\),𝐃\(𝐱\)=diag\(λ\(𝐱\)\)⊗𝐈2,withλ\(𝐱\)≥0\.\\mathbf\{\\lambda\}\(\\mathbf\{x\}\)=g\_\{\\theta\}\(\\mathbf\{h\}\_\{\\mathbf\{x\}\}\),\\mathbf\{D\}\(\\mathbf\{x\}\)=\\text\{diag\}\(\\mathbf\{\\lambda\}\(\\mathbf\{x\}\)\)\\otimes\\mathbf\{I\}\_\{2\},\\text\{with \}\\mathbf\{\\lambda\}\(\\mathbf\{x\}\)\\geq 0\.
Training Objective\.The final preference score integrates both components with weighting hyperparametersC1,C2C\_\{1\},C\_\{2\}\. We optimize the model end\-to\-end using the binary cross\-entropy loss:

ℒ\(θ\)=−𝔼\(𝐱,𝐲w,𝐲l\)∼𝒟\\displaystyle\\mathcal\{L\}\(\\theta\)=\-\\mathbb\{E\}\_\{\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)\\sim\\mathcal\{D\}\}\[logσ\(C1\(rϕ\(𝐲w\)−\\displaystyle\\Big\[\\log\\sigma\\Big\(C\_\{1\}\\big\(r\_\{\\phi\}\(\\mathbf\{y\}\_\{w\}\)\-\(7\)rϕ\(𝐲l\)\)\+\\displaystyle r\_\{\\phi\}\(\\mathbf\{y\}\_\{l\}\)\\big\)\+C2\(𝐯w⊤𝐃\(𝐱\)𝐑≻𝐃\(𝐱\)𝐯l\)\)\]\.\\displaystyle C\_\{2\}\\big\(\\mathbf\{v\}\_\{w\}^\{\\top\}\\mathbf\{D\}\(\\mathbf\{x\}\)\\mathbf\{R\}^\{\\succ\}\\mathbf\{D\}\(\\mathbf\{x\}\)\\mathbf\{v\}\_\{l\}\\big\)\\Big\)\\Big\]\.

### 4\.3The Relationship between HRC and its Sub\-components

GPM\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)has demonstrated capability in modeling cyclic preferences \(e\.g\., Rock\-Paper\-Scissors\)\. However, its structure introduces inherent limitations regarding transitive dominance\. We analyze this through the concept of a “Dominant Candidate”\.

Limitations of GPM on Transitivity\.To rigorously analyze the expressiveness of GPM, consider a set of responses forming a cycleC=\{A1,…,An\}C=\\\{A\_\{1\},\\dots,A\_\{n\}\\\}\(whereA1≻⋯≻An≻A1A\_\{1\}\\succ\\dots\\succ A\_\{n\}\\succ A\_\{1\}\) and a dominant candidateA∗A^\{\*\}that strictly defeats all elements in the cycle \(i\.e\.,A∗≻Ai,∀Ai∈CA^\{\*\}\\succ A\_\{i\},\\forall A\_\{i\}\\in C\)\. The following theorem characterizes the capability of GPM to model such simultaneous structures\.

###### Theorem 4\.7\.

For the General Preference Model \(GPM\) with embedding dimension2d2d:•Existence Condition:A configuration of a cycleCCand a candidateA∗A^\{\*\}satisfying the dominance condition \(i\.e\.,A∗≻Ai,∀Ai∈CA^\{\*\}\\succ A\_\{i\},\\forall A\_\{i\}\\in C\) can be validly represented in the embedding spaceℝ2d\\mathbb\{R\}^\{2d\}if and only ifd\>1d\>1\.•Lack of Arbitrariness:For any fixed finite2d2d, GPM cannot guarantee such representation for arbitrary cycles\. Specifically, there exists a cycleCCsuch that no embedding inℝ2d\\mathbb\{R\}^\{2d\}can simultaneously preserve the cyclic structure and represent a candidateA∗A^\{\*\}satisfying the dominance condition\.[Theorem4\.7](https://arxiv.org/html/2605.17342#S4.Thmtheorem7)\(see proof in[SectionB\.3](https://arxiv.org/html/2605.17342#A2.SS3)\) indicates that attempting to model global hierarchies \(transitivity\) implicitly through high\-dimensional rotation is structurally brittle—complex local cycles can essentially ”crowd out” the geometric capacity required to represent a dominant solution\. In contrast, HRC resolves this by explicitly decoupling the two dynamics\. By assigning a sufficiently large scalar rewardr\(A∗\)r\(A^\{\*\}\)within its independent BT component, HRC can mathematically override the bounded residuals of the cyclic component\. Consequently, HRC satisfiesDominant Arbitrariness, ensuring the preservation of the dominant candidate regardless of the complexity of the underlying cycle modeled by GPM\.

While HRC is conceptually a hybrid ensemble of BT and GPM, it can be theoretically unified under the GPM framework\. Since the BT model corresponds to a rank\-1 GPM with embedding𝐯𝐲=\[r\(𝐲\|𝐱\),c\]⊤\\mathbf\{v\}\_\{\\mathbf\{y\}\}=\[r\(\\mathbf\{y\}\|\\mathbf\{x\}\),c\]^\{\\top\}\(c≠0c\\neq 0\), HRC is theoretically equivalent to a constrained GPM of dimension2d\+12d\+1\. Its embedding takes the form𝐯𝐲HRC=\[r\(𝐲\|𝐱\),c,𝐯𝐲′\]⊤\\mathbf\{v\}^\{HRC\}\_\{\\mathbf\{y\}\}=\[r\(\\mathbf\{y\}\|\\mathbf\{x\}\),c,\\mathbf\{v\}^\{\\prime\}\_\{\\mathbf\{y\}\}\]^\{\\top\}, where the explicit transitive termr\(𝐲\|𝐱\)r\(\\mathbf\{y\}\|\\mathbf\{x\}\)acts as a robust shortcut for modeling dominant structures\. This explains HRC’s superiority over standard GPMs \(dim=2d2d\) which struggle to resolve such hierarchies purely\.

## 5Dynamic Self\-Play Preference Optimization

### 5\.1Balancing Optimization Trajectories via Time\-Varying Games

As established in[Section4\.1](https://arxiv.org/html/2605.17342#S4.SS1), HRC model explicitly decomposes complex preferences into orthogonal transitive and cyclic dimensions\. Alignment thus becomes a multi\-dimensional optimization problem\. To fully leverage this expressiveness, we require an algorithm capable of navigating this complex landscape\.

Standard game\-theoretic alignment methods \(e\.g\., SPPO\(Wu et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib46)\), INPO\(Zhang et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib49)\)\) typically seek the Nash Equilibrium of a static game defined by a fixed preference model\. However, anchoring to a static target limits the model’s ability to fully explore and exploit the unique alignment benefits inherent in the different dimensions of the generalized preference structure\. To address this, we proposeDynamic Self\-Play Preference Optimization \(DSPPO\), which reformulates the alignment process as aTime\-Varying Game\. Instead of anchoring to a fixed target, DSPPO optimizes against a dynamic sequence of payoff functions that evolves over time\.

### 5\.2Theoretical Framework and Algorithm

We begin by revisiting the update rule of Self\-Play Policy Optimization \(SPPO\)\(Wu et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib46)\)\. In SPPO, the policy is updated to approximate the Nash Equilibrium of a constant\-sum game defined by a fixed preference modelℙ\\mathbb\{P\}\. In addition, we define the winning probability of one response𝐲\\mathbf\{y\}against a distribution of responsesπ\\piasℙ\(𝐲≻π\|𝐱\)=𝔼𝐲′∼π\(⋅\|𝐱\)\[ℙ\(𝐲≻𝐲′\|𝐱\)\]\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\pi\|\\mathbf\{x\}\)=\\mathbb\{E\}\_\{\\mathbf\{y\}^\{\\prime\}\\sim\\pi\(\\cdot\|\\mathbf\{x\}\)\}\[\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\|\\mathbf\{x\}\)\]\. The iterative update is given by:

πt\+1=arg⁡minπ𝔼𝐱∼𝒳,𝐲∼πt\(⋅\|𝐱\)\[\(logπ\(𝐲\|𝐱\)πt\(𝐲\|𝐱\)−η\(ℙ^\(𝐲≻πt\|𝐱\)−12\)\)2\],\\begin\{split\}\\pi\_\{t\+1\}=\\mathop\{\\arg\\min\}\\limits\_\{\\pi\}\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{X\},\\mathbf\{y\}\\sim\\pi\_\{t\}\(\\cdot\|\\mathbf\{x\}\)\}\[\(\\log\\frac\{\\pi\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\{\\pi\_\{t\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\-\\\\ \\eta\(\\widehat\{\\mathbb\{P\}\}\(\\mathbf\{y\}\\succ\\pi\_\{t\}\|\\mathbf\{x\}\)\-\\frac\{1\}\{2\}\)\)^\{2\}\],\\end\{split\}\(8\)whereℙ^\(𝐲≻πt\|𝐱\)=1K∑k=1Kℙ\(𝐲≻𝐲k\|𝐱\)\\widehat\{\\mathbb\{P\}\}\(\\mathbf\{y\}\\succ\\pi\_\{t\}\|\\mathbf\{x\}\)=\\frac\{1\}\{K\}\\sum^\{K\}\_\{k=1\}\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}\_\{k\}\|\\mathbf\{x\}\)is empirically estimated viaKKsamples\. Effectively, this performs a multiplicative weight update:

πt\+1\(𝐲\|𝐱\)∝πt\(𝐲\|𝐱\)exp⁡\(ηℙ\(𝐲≻πt\|𝐱\)\)\.\\pi\_\{t\+1\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\\propto\\pi\_\{t\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\\exp\(\\eta\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\pi\_\{t\}\|\\mathbf\{x\}\)\)\.\(9\)
However, the assumption of a staticℙ\\mathbb\{P\}throughout training restricts the flexibility of the optimization trajectory\. Real\-world preferences often exhibit nuanced multi\-dimensional dynamics, which are effectively disentangled by advanced preference models like HRC model\. Constraining the alignment process to a fixed proxy limits the algorithm’s ability to exploit these rich structures for dynamic guidance\. To address this, we proposeDynamic Self\-Play Preference Optimization \(DSPPO\), which allows the preference modelℙ\\mathbb\{P\}to evolve over time\.

Problem Formulation\.Lets∞\(𝐲,𝐲′\)=log⁡ℙ\(𝐲≻𝐲′\)1−ℙ\(𝐲≻𝐲′\)s\_\{\\infty\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)=\\log\\frac\{\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\)\}\{1\-\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\)\}denote perference score of the real preference modelℙ\(𝐲≻𝐲′\)\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\)\. We introduce a sequence of time\-varying preference scores\{st\(𝐲,𝐲′\)\}t=1T\\\{s\_\{t\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\\\}\_\{t=1\}^\{T\}and their corresponding probabilitiesℙt\(𝐲≻𝐲′\)=σ\(st\(𝐲,𝐲′\)\)\\mathbb\{P\}\_\{t\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\)=\\sigma\(s\_\{t\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\)\. We make the following standard assumptions:

###### Assumption 5\.1\(Boundedness\)\.

Both the dynamic and true scores are bounded, i\.e\.,st,s∞∈\[−ρ,ρ\]s\_\{t\},s\_\{\\infty\}\\in\[\-\\rho,\\rho\]for some constantρ\>0\\rho\>0\.###### Assumption 5\.2\(Convergence\)\.

The dynamic scores converge to the true scores effectively\. Specifically, letϵt=maxy,y′⁡\|s∞\(y,y′\)−st\(y,y′\)\|\\epsilon\_\{t\}=\\max\_\{y,y^\{\\prime\}\}\|s\_\{\\infty\}\(y,y^\{\\prime\}\)\-s\_\{t\}\(y,y^\{\\prime\}\)\|\. We assume the average error decays at a rate ofO\(1/T\)O\(1/\\sqrt\{T\}\), i\.e\.,1T∑t=1Tϵt=O\(1/T\)\\frac\{1\}\{T\}\\sum^\{T\}\_\{t=1\}\\epsilon\_\{t\}=O\(1/\\sqrt\{T\}\)\. A typical schedule satisfying this isϵt=O\(1/t\)\\epsilon\_\{t\}=O\(1/\\sqrt\{t\}\)\.At each iterationtt, DSPPO optimizes the policy via the following learning objective:

πt\+1=arg⁡minπ𝔼𝐱∼𝒳,𝐲∼πt\(⋅\|𝐱\)\[\(logπ\(𝐲\|𝐱\)πt\(𝐲\|𝐱\)−η\(ℙ^t\(𝐲≻πt\|𝐱\)−12\)\)2\],\\begin\{split\}\\pi\_\{t\+1\}=\\mathop\{\\arg\\min\}\\limits\_\{\\pi\}\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{X\},\\mathbf\{y\}\\sim\\pi\_\{t\}\(\\cdot\|\\mathbf\{x\}\)\}\[\(\\log\\frac\{\\pi\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\{\\pi\_\{t\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\-\\\\ \\eta\(\\widehat\{\\mathbb\{P\}\}\_\{t\}\(\\mathbf\{y\}\\succ\\pi\_\{t\}\|\\mathbf\{x\}\)\-\\frac\{1\}\{2\}\)\)^\{2\}\],\\end\{split\}\(10\)whereℙ^t\\widehat\{\\mathbb\{P\}\}\_\{t\}is computed using the current preference modelℙt\\mathbb\{P\}\_\{t\}like SPPO does\.

###### Theorem 5\.3\.

Assume the optimization oracle is realizable\. Letπt\\pi\_\{t\}be the policy obtained at stepttandπ¯T=1T∑t=1Tπt\\bar\{\\pi\}\_\{T\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\pi\_\{t\}be the mixture policy\. By setting the learning rateη=Θ\(1/T\)\\eta=\\Theta\(1/\\sqrt\{T\}\), we have that:maxπ⁡ℙ\(π≻π¯T\)−minπ⁡ℙ\(π≺π¯T\)=O\(1T\)\.\\max\_\{\\pi\}\\mathbb\{P\}\(\\pi\\succ\\bar\{\\pi\}\_\{T\}\)\-\\min\_\{\\pi\}\\mathbb\{P\}\(\\pi\\prec\\bar\{\\pi\}\_\{T\}\)=O\(\\frac\{1\}\{\\sqrt\{T\}\}\)\.\(11\)[Equation11](https://arxiv.org/html/2605.17342#S5.E11)\(see proof in[SectionB\.4](https://arxiv.org/html/2605.17342#A2.SS4)\) characterizes the

O\(1/T\)O\(1/\\sqrt\{T\}\)convergence rate of the DSPPO’s average policy toward the Nash equilibrium in terms of the duality gap, demonstrating robustness to time\-varying preference signals provided the preference model converges sufficiently fast\.

### 5\.3Implementation of DSPPO with HRC model

We now instantiate the general DSPPO framework using our proposed HRC model\. As showed in[Section4\.1](https://arxiv.org/html/2605.17342#S4.SS1), HRC model decomposes the preference score into transitive and cyclic components:s∞=sT\+sCs\_\{\\infty\}=s\_\{T\}\+s\_\{C\}\. To construct a dynamic schedule, we introduce a hyperparameterλ\\lambdaand define the time\-varying scorests\_\{t\}as:

st=\(1\+λt\)sT\+\(1−λt\)sC\.s\_\{t\}=\\left\(1\+\\frac\{\\lambda\}\{\\sqrt\{t\}\}\\right\)s\_\{T\}\+\\left\(1\-\\frac\{\\lambda\}\{\\sqrt\{t\}\}\\right\)s\_\{C\}\.\(12\)In this work, we primarily adoptλ\>0\\lambda\>0to regulate the convergence trajectory; we investigate impact of theλ\\lambdain DSPPO including the regime ofλ<0\\lambda<0and alternative schedule forms in[SectionC\.4](https://arxiv.org/html/2605.17342#A3.SS4)\. As detailed in[Section4\.2](https://arxiv.org/html/2605.17342#S4.SS2), we enforce explicit constraints on the model outputs: the transitive scoresTs\_\{T\}is clipped to the range\[−δ,δ\]\[\-\\delta,\\delta\], while the boundedness of the cyclic scoresCs\_\{C\}is inherently guaranteed by the GPM\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)\. Consequently,sts\_\{t\}, being a linear combination of strictly bounded terms, is guaranteed to remain within a compact range\[−ρ,ρ\]\[\-\\rho,\\rho\]\. Second, regarding convergence: The approximation error is given by\|s∞−st\|=\|λtsC−λtsT\|≤\|λ\|t\(\|sC\|\+\|sT\|\)\|s\_\{\\infty\}\-s\_\{t\}\|=\|\\frac\{\\lambda\}\{\\sqrt\{t\}\}s\_\{C\}\-\\frac\{\\lambda\}\{\\sqrt\{t\}\}s\_\{T\}\|\\leq\\frac\{\|\\lambda\|\}\{\\sqrt\{t\}\}\(\|s\_\{C\}\|\+\|s\_\{T\}\|\)\. LettingC=max⁡\(\|sC\|\+\|sT\|\)C=\\max\(\|s\_\{C\}\|\+\|s\_\{T\}\|\), we haveϵt≤\|λ\|Ct\\epsilon\_\{t\}\\leq\\frac\{\|\\lambda\|C\}\{\\sqrt\{t\}\}\. Consequently,1T∑ϵt≈1T∫\|λ\|Ct−1/2𝑑t=O\(1/T\)\\frac\{1\}\{T\}\\sum\\epsilon\_\{t\}\\approx\\frac\{1\}\{T\}\\int\|\\lambda\|Ct^\{\-1/2\}dt=O\(1/\\sqrt\{T\}\), fulfilling the convergence condition of[Equation11](https://arxiv.org/html/2605.17342#S5.E11)\.

## 6Experiments

### 6\.1Modeling Cyclic Preference

To directly validate the “dominant \+ cycle” setting motivating HRC, we construct synthetic preference datasets based on UltraFeedback\(Cui et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib12)\), following prior work\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)\. Each prompt contains four candidate responses with multi\-dimensional annotations\. We derive \(i\) cyclic preferences among three responses, and \(ii\) dominant \+ cycle settings by introducing an additional response that consistently outperforms the others\.

We train GPM and HRC under identical configurations and observe a two\-stage learning behavior: models first identify the dominant candidate \(accuracy improves from∼\\sim50% to∼\\sim75%\), and then learn cyclic preferences \(75% to 100%\)\. HRC \(dim=2\+12\{\+\}1,4\+14\{\+\}1\) finishes the first stage faster and achieves higher final accuracy than GPM \(dim=4\), while low\-dimensional GPM \(dim=22\) fails to capture cyclic structure\. These results provide direct empirical support that HRC more effectively models mixed transitive–cyclic preference structures\.

Details of the dataset construction and experimental settings are provided in[SectionC\.1](https://arxiv.org/html/2605.17342#A3.SS1)\.

Table 1:Comparison of preference modeling capabilities onRewardBench 2\(Malik et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib30)\)\. We evaluate the baselines \(BT, GPM\) and our HRC model usingGemma\-2B\-itandLlama\-3\.1\-8B\-Instructacross six domains: Factuality, Precise Instruction Following \(IF\), Math, Safety, Focus, and Ties\. Note that the BT and HRC models can be viewed as special cases of GPM with specific embedding constraints\. The highest scores are highlighted inbold\.Base Model & MethodFactualityPrecise IFMathSafetyFocusTiesAverageGemma\-2B\-it \+ BT \(dim=11\)45\.6845\.6832\.5032\.5062\.3062\.3080\.6780\.6777\.1777\.1737\.2537\.2555\.9355\.93Gemma\-2B\-it \+ GPM \(dim=22\)47\.1647\.1633\.7533\.7562\.8462\.8478\.0078\.0071\.9271\.9238\.2438\.2455\.3255\.32Gemma\-2B\-it \+ GPM \(dim=44\)43\.1643\.1636\.2564\.4881\.1181\.1176\.1676\.1637\.2537\.2556\.4056\.40Gemma\-2B\-it \+ HRC \(dim=2\+12\+1\)47\.5835\.6335\.6361\.7561\.7582\.0082\.0079\.6039\.2257\.63 \(\+1\.23\)Gemma\-2B\-it \+ HRC \(dim=4\+14\+1\)45\.8945\.8933\.7533\.7562\.3062\.3083\.7877\.7877\.7839\.2257\.1257\.12Llama\-3\.1\-8B \+ BT \(dim=11\)64\.6364\.6334\.3834\.3864\.4892\.6792\.6790\.9190\.9173\.5373\.5370\.1070\.10Llama\-3\.1\-8B \+ GPM \(dim=22\)66\.5366\.5333\.7533\.7562\.3062\.3092\.2292\.2294\.1494\.1471\.5771\.5770\.0870\.08Llama\-3\.1\-8B \+ GPM \(dim=44\)67\.5867\.5833\.1233\.1257\.9257\.9292\.2292\.2293\.7493\.7473\.5373\.5369\.6969\.69Llama\-3\.1\-8B \+ HRC \(dim=2\+12\+1\)68\.4235\.0060\.1160\.1192\.8994\.7594\.7574\.5170\.95 \(\+0\.85\)Llama\-3\.1\-8B \+ HRC \(dim=4\+14\+1\)68\.0068\.0032\.5032\.5061\.2061\.2092\.6792\.6795\.1574\.5170\.6770\.67

### 6\.2Preference Modeling Capability

Experimental Setup and Evaluation\.We evaluate preference modeling capabilities using Gemma\-2B\-it\(Team et al\.,[2024a](https://arxiv.org/html/2605.17342#bib.bib39)\)and Llama\-3\.1\-8B\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib18)\), trained on the Skywork\-Reward\-Preference\-80K\-v0\.2 dataset\(Liu et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib29)\)\. We benchmark our HRC model against BT model\(Bradley & Terry,[1952](https://arxiv.org/html/2605.17342#bib.bib8)\)and GPM\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)on RewardBench 2\(Malik et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib30)\)\. This benchmark is selected for its rigorous Best\-of\-N \(N=4N=4\) evaluation format and its uniqueTiesdomain, which effectively tests the model’s robustness in distinguishing valid answers from incorrect ones without overfitting to noise among equivalently valid options\. For detailed experimental setups, refer to[SectionC\.3](https://arxiv.org/html/2605.17342#A3.SS3)\. To facilitate a rigorous comparison with the results reported in GPM\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\), we independently conducted evaluations and report the results on the RewardBench\(Lambert et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib26)\)in[SectionC\.5](https://arxiv.org/html/2605.17342#A3.SS5)\.

Table 2:Ablation study onRewardBench 2\. We evaluate the impact of removing key components \(Context\-Aware Gating, Reward Clipping, Unit Norm\) across two models \(Gemma\-2B\-it,Llama\-3\.1\-8B\-Instruct\) and two dimension settings \(2\+12\+1,4\+14\+1\)\. TheFull Modelserves as the baseline, andΔ\\Deltadenotes the drop in Average Accuracy compared to the Full Model\. The highest scores in each group are inbold\.Model SettingFactualityPrecise IFMathSafetyFocusTiesAverageΔ\\DeltaBase Model: Gemma\-2B\-itDimension Setting:2\+12\+1HRC \(Full\)47\.5835\.6361\.7582\.0079\.6039\.2257\.63\-w/o Context Gating45\.8935\.0061\.7583\.1174\.9538\.2456\.49↓\\downarrow1\.14w/o Reward Clipping47\.7935\.6361\.7581\.3375\.9640\.2057\.11↓\\downarrow0\.52w/o Unit Norm46\.9534\.3863\.3981\.1176\.9738\.2456\.84↓\\downarrow0\.79Dimension Setting:4\+14\+1HRC \(Full\)45\.8933\.7562\.3083\.7877\.7839\.2257\.12\-w/o Context Gating46\.5335\.0061\.7582\.0078\.9936\.2756\.76↓\\downarrow0\.36w/o Reward Clipping42\.7436\.8863\.9382\.0081\.6236\.2757\.24↑\\uparrow0\.12w/o Unit Norm45\.6835\.6361\.2081\.5677\.3739\.2256\.78↓\\downarrow0\.34Base Model: Llama\-3\.1\-8B\-InstructDimension Setting:2\+12\+1HRC \(Full\)68\.4235\.0060\.1192\.8994\.7574\.5170\.95\-w/o Context Gating66\.5335\.6360\.6692\.2294\.7573\.5370\.55↓\\downarrow0\.40w/o Reward Clipping67\.1636\.2560\.1192\.6793\.9473\.5370\.61↓\\downarrow0\.34w/o Unit Norm68\.0032\.5057\.9292\.4493\.5475\.4969\.98↓\\downarrow0\.97Dimension Setting:4\+14\+1HRC \(Full\)68\.0032\.5061\.2092\.6795\.1574\.5170\.67\-w/o Context Gating66\.7435\.6362\.3093\.3394\.7573\.5371\.04↑\\uparrow0\.37w/o Reward Clipping67\.7933\.7559\.0292\.6795\.1574\.5170\.48↓\\downarrow0\.19w/o Unit Norm68\.4235\.0061\.7592\.8994\.9574\.5171\.25↑\\uparrow0\.58Results and Analysis\.The results are presented in[Table1](https://arxiv.org/html/2605.17342#S6.T1)\. Across both model scales, HRC model consistently surpassingboththe Bradley\-Terry \(BT\) and General Preference Model \(GPM\) baselines\. On theGemma\-2B\-it, HRC achieves an average accuracy of57\.63%\(dim=2\+12\+1\)\. This performance exceeds the strongest baseline in this category \(56\.40%\) by1\.23%, and significantly outperforms the scalar BT model \(55\.93%\)\. Notably, HRC dominates in complex reasoning tasks: in theFactualitydomain, it surpasses both baselines \(BT: 45\.68%, GPM: 47\.16%\) to reach47\.58%, demonstrating superior capability in detecting subtle hallucinations\. For the largerLlama\-3\.1\-8B\-Instruct, simple scalar models perform surprisingly well, with BT \(70\.10%\) slightly edging out GPM \(70\.08%\)\. However, HRC \(dim=2\+12\+1\) breaks this ceiling, achieving an average score of70\.95%, which represents a robust improvement over the best\-performing baseline \(BT\) by0\.85%\. Crucially, in theTiessubset—which tests the ability to handle equivalent correct answers without forcing arbitrary rankings—HRC demonstrates consistent superiority\. It achieves74\.51%, outperforming both the strict transitivity of BT \(73\.53%\) and GPM \(73\.53%\)\. This confirms that HRC’s structure successfully combines the transitive rewards with the cyclic embeddings, allowing it to resolve preference landscapes that neither BT nor GPM can master alone\.

Ablation Studies\.Our main results confirm that the explicit decoupling of transitive and cyclic components is essential for performance\. To further validate the specific architectural choices that enable this effective decomposition, we conducted comprehensive ablation studies against the following variants:

- •w/o Context\-Aware Gating: We replace the input\-dependent gating matrixD\(𝐱\)D\(\\mathbf\{x\}\)with a learnable static scalar parameterλ\\lambda\. The preference score becomess=r\(𝐲w\)−r\(𝐲l\)\+λ𝐯w⊤𝐖𝐯ls=r\(\\mathbf\{y\}\_\{w\}\)\-r\(\\mathbf\{y\}\_\{l\}\)\+\\lambda\\mathbf\{v\}\_\{w\}^\{\\top\}\\mathbf\{W\}\\mathbf\{v\}\_\{l\}\. This variant tests whether the model requires dynamic adjustment of the cyclic component’s weight based on the input prompt\.
- •w/o Reward Clipping: We remove the clipping operation on the scalar rewardr\(𝐲\)r\(\\mathbf\{y\}\), allowing it to take values in\(−∞,∞\)\(\-\\infty,\\infty\)\.
- •w/o Unit Norm: We remove the‖𝐯‖2=1\\\|\\mathbf\{v\}\\\|\_\{2\}=1constraint on the cyclic embeddings, allowing the vector magnitude to vary freely during training\.

We find that removing the Context\-Aware Gating mechanism or stability constraints leads to performance degradation, particularly in Focus\. The results of ablation studies are showed in[Table2](https://arxiv.org/html/2605.17342#S6.T2)\.

Empirical Verification of Theoretical Assumptions\.For the zero\-mean condition mentioned in[Theorem4\.6](https://arxiv.org/html/2605.17342#S4.Thmtheorem6), however, embeddings deviate from𝔼\[𝐯\]=0\\mathbb\{E\}\[\\mathbf\{v\}\]=0\. For example, HRC \(dim=2\+1\) trained on the Skywork\-Reward dataset with Gemma\-2B\-it, we obtain𝔼\[𝐯\]≈\(0\.13,−0\.51\)\\mathbb\{E\}\[\\mathbf\{v\}\]\\approx\(0\.13,\-0\.51\)\. This suggests that the GPM component may indeed capture some transitive signal\. Nevertheless, as shown inLABEL:\{tab:rewardbench2\_main\}, HRC remains more effective than GPM in capturing complex preference structures despite such deviations\.

![Refer to caption](https://arxiv.org/html/2605.17342v1/x2.png)

Figure 2:Iteration 3 Alignment Performance across Three Benchmarks\.We compare BT\+SPPO, GPM\+SPPO, HRC\+SPPO, and HRC\+DSPPO at Iteration 3 on AlpacaEval 2\.0 \(LC\. Win Rate\), MT\-Bench \(Average Score\), and Arena\-Hard\-v0\.1 \(Win Rate\)\. Each panel reports results using both Gemma\-2B\-it and Llama\-3\.1\-8B\-Instruct as preference models\.
### 6\.3Downstream Performance on Aligning Language Models with Human Preferences

We further investigate the effectiveness of our framework in language model alignment\. Our evaluation focuses on two key aspects: \(1\) the quality of preference signals provided by our HRC model compared to BT model\(Bradley & Terry,[1952](https://arxiv.org/html/2605.17342#bib.bib8)\)and GPM\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\), and \(2\) the alignment efficiency of DSPPO algorithm compared to standard Self\-Play Policy Optimization \(SPPO\)\.

Experimental Setup and Baselines\.For the implementation of SPPO and DSPPO, we sampled responses using prompts derived from the UltraFeedback dataset\(Cui et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib12)\)\. To rigorously evaluate the contribution of each component, we design the following comparative settings: \(1\) Impact of Preference Models\. We fix the alignment algorithm to SPPO\(Wu et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib46)\)and vary the source of preference signals: BT model\(Bradley & Terry,[1952](https://arxiv.org/html/2605.17342#bib.bib8)\), GPM\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)and our HRC model\. \(2\) Impact of Alignment Algorithms\. We compare the SPPO against DSPPO \(We employ the implementation of DSPPO with HRC model as described in[Section5\.3](https://arxiv.org/html/2605.17342#S5.SS3), and setλ=1\\lambda=1\), using the HRC model as the underlying preference oracle for both\. The combination of HRC \+ DSPPO represents our full method\. For detailed experimental setups, refer to[SectionC\.3](https://arxiv.org/html/2605.17342#A3.SS3)\.

Evaluation\.Our preliminary experiments and previous work\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)indicate that GPM\-based alignment tends to yield longer responses\. To comprehensively evaluate our framework, we employ three complementary benchmarks: \(1\)AlpacaEval 2\.0\(Dubois et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib15)\)focuses on length\-controlled win rates to mitigate verbosity bias; \(2\)MT\-Bench\(Zheng et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib51)\)assesses multi\-turn conversation and instruction\-following capabilities; and \(3\)Arena\-Hard\-v0\.1\(Li et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib28)\)provides a rigorous assessment on challenging prompts requiring complex reasoning\. All benchmarks use GPT\-family models as evaluators for consistent quality assessment: GPT\-4o\-mini for AlpacaEval 2\.0 \(selected for efficiency over GPT\-4\-Turbo while maintaining high agreement with human preferences\), GPT\-4 for MT\-Bench, and GPT\-4\-Turbo for Arena\-Hard\-v0\.1\. We acknowledge that relying on GPT\-based evaluation may introduce potential judge bias\. To address this concern, we conduct additional validation analyses, including GPT vs\. human agreement studies, inter\-annotator agreement measurements, and cross\-evaluation using multiple GPT variants\. These analyses are detailed in[SectionC\.7](https://arxiv.org/html/2605.17342#A3.SS7), demonstrating that our GPT\-based evaluation pipeline produces reliable and robust results\.

Results and Analysis\.The comparative results in[Figure2](https://arxiv.org/html/2605.17342#S6.F2)reveal four key insights\. First, theHRC modelestablishes a stronger foundation than BT and GPM, achieving consistently higher baselines across all metrics\. Second,DSPPOsignificantly enhances alignment efficiency, particularly on Gemma\-2B\-it, where it pushes the LC\. WR to a peak of44\.75%\(\+1\.75% over static HRC\) on AlpacaEval 2\.0 and maintains robustness on MT\-Bench \(8\.29 vs\. 7\.70 for GPM\), effectively preventing the optimization collapse observed in baselines\. Third, as shown in[Figure2](https://arxiv.org/html/2605.17342#S6.F2), the evaluation on Arena\-Hard\-v0\.1 further validates the robustness of our framework\. On Gemma\-2B\-it, HRC\+DSPPO achieves a remarkable win rate of 46\.8% in the final iteration, significantly outperforming both BT\+SPPO \(40\.9%\) and GPM\+SPPO \(42\.1%\)\. Notably, our method achieves a 3\.2% improvement over the best baseline, demonstrating superior capability in handling challenging prompts that require complex reasoning\. On Llama\-3\.1\-8B\-Instruct, HRC\+DSPPO peaks at 45\.5% in iteration 2 \(shown in[Table8](https://arxiv.org/html/2605.17342#A3.T8)\), showcasing consistent performance gains across different model scales\. Finally, the results on Llama\-3\.1\-8B\-Instruct in iteration 2, where static HRC\+SPPO marginally outperforms DSPPO on MT\-Bench \(8\.42 vs\. 8\.14\), suggest that alignment efficacy is sensitive to the hyperparameterλ\\lambda\. This indicates that while DSPPO promotes necessary exploration, larger models with stronger intrinsic capabilities may require a more conservative balance between the cyclic and transitive components to maintain stability in multi\-turn reasoning\. For additional evaluation metrics, refer to[SectionC\.6](https://arxiv.org/html/2605.17342#A3.SS6)\.

Scalability to Stronger Backbones\.To further assess the scalability of our framework, we conduct additional experiments applying the same preference models \(trained on Gemma\-2B\-it\) to post\-train a larger backbone, Gemma\-2\-9B\-it\(Team et al\.,[2024b](https://arxiv.org/html/2605.17342#bib.bib40)\), and evaluate on AlpacaEval 2\.0\. HRC\+DSPPO achieves a peak length\-controlled win\-rate of52\.20%, substantially outperforming BT\+SPPO \(48\.79%\) and the base model \(38\.38%\)\. Detailed setup and results are provided in[SectionC\.8](https://arxiv.org/html/2605.17342#A3.SS8)\.

## 7Conclusion

In this paper, we revisited the mathematical foundations of preference modeling in RLHF, demonstrating that complex human preferences can be explicitly decomposed into two components: transitivity and cyclicity\. To bridge this gap, we introduced theHybrid Reward\-Cyclic \(HRC\)model, a game\-theoretic framework that explicitly decomposes preferences into transitive and cyclic components\. Theoretical analysis confirms that HRC model overcomes the structural shortcomings of preference models such as BT and GPM\. Complementing this, we proposeDynamic Self\-Play Preference Optimization \(DSPPO\), a novel alignment algorithm that treats policy optimization as a time\-varying game\. By incorporating the decomposed preference signals from HRC model to dynamically modulate the interplay between transitive guidance and cyclic exploration, DSPPO achieves a superior convergence trajectory\. Extensive evaluations demonstrate that our framework consistently outperforms static baselines, leading to significant performance improvements in downstream tasks\.

## Impact Statement

This paper presents work to advance the fundamental pursuit of aligning Artificial Intelligence with the rich complexity of human values\. Faithfully representing such intricate preference structures is essential for developing AI systems that are not only robust and reliable but also intrinsically capable of respecting the heterogeneous nature of human judgment\. We hope that advancements in this direction contribute to enabling future technologies to more safely and responsively navigate the diverse ethical and cultural landscapes of real\-world deployment, potentially fostering greater trust in automated decision\-making\.

## Acknowledgements

This work was supported in part by National Natural Science Foundation of China \(62476070\), Shenzhen Science and Technology Program \(JCYJ20241202123503005, GXWD20231128103232001, ZDSYS20230626091203008, KQTD20240729102154066\), Department of Science and Technology of Guangdong \(2024A1515011540\), National Key R&D Program of China \(SQ2024YFE0200592\) and Suzhou Science and Technology Program \(SYG2025072\)\.

## References

- Achiam et al\. \(2023\)Achiam, J\., Adler, S\., Agarwal, S\., Ahmad, L\., Akkaya, I\., Aleman, F\. L\., Almeida, D\., Altenschmidt, J\., Altman, S\., Anadkat, S\., et al\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Azar et al\. \(2024\)Azar, M\. G\., Guo, Z\. D\., Piot, B\., Munos, R\., Rowland, M\., Valko, M\., and Calandriello, D\.A general theoretical paradigm to understand learning from human preferences\.In*Proceedings of the 27th International Conference on Artificial Intelligence and Statistics \(AISTATS\)*, pp\. 4447–4455, 2024\.
- Bai et al\. \(2022\)Bai, Y\., Jones, A\., Ndousse, K\., Askell, A\., Chen, A\., DasSarma, N\., Drain, D\., Fort, S\., Ganguli, D\., Henighan, T\., et al\.Training a helpful and harmless assistant with reinforcement learning from human feedback\.*arXiv preprint arXiv:2204\.05862*, 2022\.
- Balduzzi et al\. \(2018\)Balduzzi, D\., Tuyls, K\., Perolat, J\., and Graepel, T\.Re\-evaluating evaluation\.In*Proceedings of the 32nd International Conference on Neural Information Processing Systems \(NeurIPS\)*, pp\. 3272–3283, 2018\.
- Balduzzi et al\. \(2019\)Balduzzi, D\., Garnelo, M\., Bachrach, Y\., Czarnecki, W\., Perolat, J\., Jaderberg, M\., and Graepel, T\.Open\-ended learning in symmetric zero\-sum games\.In*Proceedings of the 36th International Conference on Machine Learning \(ICML\)*, pp\. 434–443, 2019\.
- Bengio et al\. \(2009\)Bengio, Y\., Louradour, J\., Collobert, R\., and Weston, J\.Curriculum learning\.In*Proceedings of the 26th International Conference on Machine Learning \(ICML\)*, pp\. 41–48, 2009\.
- Bertrand et al\. \(2023\)Bertrand, Q\., Czarnecki, W\. M\., and Gidel, G\.On the limitations of the elo, real\-world games are transitive, not additive\.In*Proceedings of the 26th International Conference on Artificial Intelligence and Statistics \(AISTATS\)*, pp\. 2905–2921, 2023\.
- Bradley & Terry \(1952\)Bradley, R\. A\. and Terry, M\. E\.Rank analysis of incomplete block designs: I\. the method of paired comparisons\.*Biometrika*, 39\(3/4\):324–345, 1952\.
- Brown et al\. \(2020\)Brown, T\., Mann, B\., Ryder, N\., Subbiah, M\., Kaplan, J\. D\., Dhariwal, P\., Neelakantan, A\., Shyam, P\., Sastry, G\., Askell, A\., et al\.Language models are few\-shot learners\.In*Proceedings of the 34th International Conference on Neural Information Processing Systems \(NeurIPS\)*, pp\. 1877–1901, 2020\.
- Chen & Joachims \(2016\)Chen, S\. and Joachims, T\.Modeling intransitivity in matchup and comparison data\.In*Proceedings of the 9th ACM International Conference on Web Search and Data Mining \(WSDM\)*, pp\. 227–236, 2016\.
- Christiano et al\. \(2017\)Christiano, P\. F\., Leike, J\., Brown, T\., Martic, M\., Legg, S\., and Amodei, D\.Deep reinforcement learning from human preferences\.In*Proceedings of the 31st International Conference on Neural Information Processing Systems \(NeurIPS\)*, 2017\.
- Cui et al\. \(2024\)Cui, G\., Yuan, L\., Ding, N\., Yao, G\., He, B\., Zhu, W\., Ni, Y\., Xie, G\., Xie, R\., Lin, Y\., et al\.Ultrafeedback: Boosting language models with scaled ai feedback\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, pp\. 9722–9744, 2024\.
- Czarnecki et al\. \(2020\)Czarnecki, W\. M\., Gidel, G\., Tracey, B\., Tuyls, K\., Omidshafiei, S\., Balduzzi, D\., and Jaderberg, M\.Real world games look like spinning tops\.In*Proceedings of the 34th International Conference on Neural Information Processing Systems \(NeurIPS\)*, pp\. 17443–17454, 2020\.
- Dong et al\. \(2024\)Dong, H\., Xiong, W\., Pang, B\., Wang, H\., Zhao, H\., Zhou, Y\., Jiang, N\., Sahoo, D\., Xiong, C\., and Zhang, T\.Rlhf workflow: From reward modeling to online rlhf\.*Transactions on Machine Learning Research*, 2024, 2024\.
- Dubois et al\. \(2024\)Dubois, Y\., Galambosi, B\., Liang, P\., and Hashimoto, T\. B\.Length\-controlled alpacaeval: A simple way to debias automatic evaluators\.*arXiv preprint arXiv:2404\.04475*, 2024\.
- Freund & Schapire \(1999\)Freund, Y\. and Schapire, R\. E\.Adaptive game playing using multiplicative weights\.*Games and Economic Behavior*, 29\(1\-2\):79–103, 1999\.
- Gehrlein \(2006\)Gehrlein, W\. V\.*Condorcet’s paradox*\.Springer, 2006\.
- Grattafiori et al\. \(2024\)Grattafiori, A\., Dubey, A\., Jauhri, A\., Pandey, A\., Kadian, A\., Al\-Dahle, A\., Letman, A\., Mathur, A\., Schelten, A\., Vaughan, A\., et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Hacohen & Weinshall \(2019\)Hacohen, G\. and Weinshall, D\.On the power of curriculum learning in training deep networks\.In*Proceedings of the 36th International Conference on Machine Learning \(ICML\)*, pp\. 2535–2544, 2019\.
- Hong et al\. \(2025\)Hong, Y\., Zhang, H\., Bao, J\., Jiang, H\., et al\.Energy\-based preference model offers better offline alignment than the bradley\-terry preference model\.In*Proceedings of the 42nd International Conference on Machine Learning \(ICML\)*, 2025\.
- Horn & Johnson \(2012\)Horn, R\. A\. and Johnson, C\. R\.*Matrix analysis*\.Cambridge university press, 2012\.
- Hu et al\. \(2025\)Hu, J\., Wu, X\., Shen, W\., Liu, J\. K\., Wang, W\., Jiang, S\., Wang, H\., Chen, H\., Chen, B\., Fang, W\., et al\.Openrlhf: A ray\-based easy\-to\-use, scalable and high\-performance rlhf framework\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pp\. 656–666, 2025\.
- Ibarz et al\. \(2018\)Ibarz, B\., Leike, J\., Pohlen, T\., Irving, G\., Legg, S\., and Amodei, D\.Reward learning from human preferences and demonstrations in atari\.In*Proceedings of the 32nd International Conference on Neural Information Processing Systems \(NeurIPS\)*, pp\. 8022–8034, 2018\.
- Ji et al\. \(2023\)Ji, J\., Qiu, T\., Chen, B\., Zhang, B\., Lou, H\., Wang, K\., Duan, Y\., He, Z\., Zhou, J\., Zhang, Z\., et al\.Ai alignment: A comprehensive survey\.*arXiv preprint arXiv:2310\.19852*, 2023\.
- Jiang et al\. \(2023\)Jiang, D\., Ren, X\., and Lin, B\. Y\.Llm\-blender: Ensembling large language models with pairwise ranking and generative fusion\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pp\. 14165–14178, 2023\.
- Lambert et al\. \(2025\)Lambert, N\., Pyatkin, V\., Morrison, J\., Miranda, L\. J\. V\., Lin, B\. Y\., Chandu, K\., Dziri, N\., Kumar, S\., Zick, T\., Choi, Y\., et al\.Rewardbench: Evaluating reward models for language modeling\.In*Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\)*, pp\. 1755–1797, 2025\.
- Li et al\. \(2024\)Li, J\., Yang, Y\., Bai, Y\., Zhou, X\., Li, Y\., Sun, H\., Liu, Y\., Si, X\., Ye, Y\., Wu, Y\., et al\.Fundamental capabilities of large language models and their applications in domain scenarios: A survey\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pp\. 11116–11141, 2024\.
- Li et al\. \(2025\)Li, T\., Chiang, W\.\-L\., Frick, E\., Dunlap, L\., Wu, T\., Zhu, B\., Gonzalez, J\. E\., and Stoica, I\.From crowdsourced data to high\-quality benchmarks: Arena\-hard and benchbuilder pipeline\.In*Proceedings of the 42nd International Conference on Machine Learning \(ICML\)*, pp\. 34209–34231\. PMLR, 2025\.
- Liu et al\. \(2024\)Liu, C\. Y\., Zeng, L\., Liu, J\., Yan, R\., He, J\., Wang, C\., Yan, S\., Liu, Y\., and Zhou, Y\.Skywork\-reward: Bag of tricks for reward modeling in llms\.*arXiv preprint arXiv:2410\.18451*, 2024\.
- Malik et al\. \(2025\)Malik, S\., Pyatkin, V\., Land, S\., Morrison, J\., Smith, N\. A\., Hajishirzi, H\., and Lambert, N\.Rewardbench 2: Advancing reward model evaluation\.*arXiv preprint arXiv:2506\.01937*, 2025\.
- Munos et al\. \(2024\)Munos, R\., Valko, M\., Calandriello, D\., Azar, M\. G\., Rowland, M\., Guo, Z\. D\., Tang, Y\., Geist, M\., Mesnard, T\., Fiegel, C\., et al\.Nash learning from human feedback\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, 2024\.
- Paszke et al\. \(2019\)Paszke, A\., Gross, S\., Massa, F\., Lerer, A\., Bradbury, J\., Chanan, G\., Killeen, T\., Lin, Z\., Gimelshein, N\., Antiga, L\., et al\.Pytorch: an imperative style, high\-performance deep learning library\.In*Proceedings of the 33rd International Conference on Neural Information Processing Systems \(NeurIPS\)*, pp\. 8026–8037, 2019\.
- Rafailov et al\. \(2023\)Rafailov, R\., Sharma, A\., Mitchell, E\., Manning, C\. D\., Ermon, S\., and Finn, C\.Direct preference optimization: Your language model is secretly a reward model\.In*Proceedings of the 37th International Conference on Neural Information Processing Systems \(NeurIPS\)*, pp\. 53728–53741, 2023\.
- Rajbhandari et al\. \(2020\)Rajbhandari, S\., Rasley, J\., Ruwase, O\., and He, Y\.Zero: Memory optimizations toward training trillion parameter models\.In*Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis \(SC20\)*, pp\. 1–16\. IEEE, 2020\.
- Rosset et al\. \(2024\)Rosset, C\., Cheng, C\.\-A\., Mitra, A\., Santacroce, M\., Awadallah, A\., and Xie, T\.Direct nash optimization: Teaching language models to self\-improve with general preferences\.*arXiv preprint arXiv:2404\.03715*, 2024\.
- Savage Jr \(1994\)Savage Jr, R\. P\.The paradox of nontransitive dice\.*The American Mathematical Monthly*, 101\(5\):429–436, 1994\.
- Schulman et al\. \(2017\)Schulman, J\., Wolski, F\., Dhariwal, P\., Radford, A\., and Klimov, O\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Swamy et al\. \(2024\)Swamy, G\., Dann, C\., Kidambi, R\., Wu, S\., and Agarwal, A\.A minimaximalist approach to reinforcement learning from human feedback\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, pp\. 47345–47377, 2024\.
- Team et al\. \(2024a\)Team, G\., Mesnard, T\., Hardin, C\., Dadashi, R\., Bhupatiraju, S\., Pathak, S\., Sifre, L\., Rivière, M\., Kale, M\. S\., Love, J\., et al\.Gemma: Open models based on gemini research and technology\.*arXiv preprint arXiv:2403\.08295*, 2024a\.
- Team et al\. \(2024b\)Team, G\., Riviere, M\., Pathak, S\., Sessa, P\. G\., Hardin, C\., Bhupatiraju, S\., Hussenot, L\., Mesnard, T\., Shahriari, B\., Ramé, A\., et al\.Gemma 2: Improving open language models at a practical size\.*arXiv preprint arXiv:2408\.00118*, 2024b\.
- Tversky \(1969\)Tversky, A\.Intransitivity of preferences\.*Psychological review*, 76\(1\):31, 1969\.
- Wang et al\. \(2024\)Wang, H\., Xiong, W\., Xie, T\., Zhao, H\., and Zhang, T\.Interpretable preferences via multi\-objective reward modeling and mixture\-of\-experts\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pp\. 10582–10592, 2024\.
- Weidinger et al\. \(2021\)Weidinger, L\., Mellor, J\., Rauh, M\., Griffin, C\., Uesato, J\., Huang, P\.\-S\., Cheng, M\., Glaese, M\., Balle, B\., Kasirzadeh, A\., et al\.Ethical and social risks of harm from language models\.*arXiv preprint arXiv:2112\.04359*, 2021\.
- Wolf et al\. \(2020\)Wolf, T\., Debut, L\., Sanh, V\., Chaumond, J\., Delangue, C\., Moi, A\., Cistac, P\., Rault, T\., Louf, R\., Funtowicz, M\., et al\.Transformers: State\-of\-the\-art natural language processing\.In*Proceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\)*, pp\. 38–45, 2020\.
- Wu et al\. \(2025a\)Wu, F\., Huang, X\., Xuan, W\., Zhang, Z\., Xiao, Y\., Wan, G\., Li, X\., Hu, B\., Xia, P\., Leskovec, J\., et al\.Multiplayer nash preference optimization\.*arXiv preprint arXiv:2509\.23102*, 2025a\.
- Wu et al\. \(2025b\)Wu, Y\., Sun, Z\., Yuan, H\., Ji, K\., Yang, Y\., and Gu, Q\.Self\-play preference optimization for language model alignment\.In*Proceedings of the 13th International Conference on Learning Representations \(ICLR\)*, volume 2025, pp\. 91558–91582, 2025b\.
- Xu et al\. \(2025\)Xu, B\., Yao, J\., Yi, X\., Maoliniyazi, A\., Xie, X\., and Meng, X\.Towards better value principles for large language model alignment: a systematic evaluation and enhancement\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pp\. 28991–29010, 2025\.
- Zhang et al\. \(2025a\)Zhang, Y\., Yu, D\., Ge, T\., Song, L\., Zeng, Z\., Mi, H\., Jiang, N\., and Yu, D\.Improving llm general preference alignment via optimistic online mirror descent\.*arXiv preprint arXiv:2502\.16852*, 2025a\.
- Zhang et al\. \(2025b\)Zhang, Y\., Yu, D\., Peng, B\., Song, L\., Tian, Y\., Huo, M\., Jiang, N\., Mi, H\., and Yu, D\.Iterative nash policy optimization: Aligning llms with general preferences via no\-regret learning\.In*Proceedings of the 13th International Conference on Learning Representations \(ICLR\)*, volume 2025, pp\. 31833–31849, 2025b\.
- Zhang et al\. \(2025c\)Zhang, Y\., Zhang, G\., Wu, Y\., Xu, K\., and Gu, Q\.Beyond bradley\-terry models: A general preference model for language model alignment\.In*Proceedings of the 42nd International Conference on Machine Learning \(ICML\)*, pp\. 76939–76965, 2025c\.
- Zheng et al\. \(2023\)Zheng, L\., Chiang, W\.\-L\., Sheng, Y\., Zhuang, S\., Wu, Z\., Zhuang, Y\., Lin, Z\., Li, Z\., Li13, D\., Xing35, E\. P\., et al\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.*arXiv preprint arXiv:2306\.05685*, 2023\.
- Zhou et al\. \(2025\)Zhou, R\., Fazel, M\., and Du, S\. S\.Extragradient preference optimization \(egpo\): Beyond last\-iterate convergence for nash learning from human feedback\.*arXiv preprint arXiv:2503\.08942*, 2025\.
- Zhu et al\. \(2023\)Zhu, B\., Jordan, M\., and Jiao, J\.Principled reinforcement learning with human feedback from pairwise or k\-wise comparisons\.In*Proceedings of the 40th International Conference on Machine Learning \(ICML\)*, pp\. 43037–43067, 2023\.
- Ziegler et al\. \(2019\)Ziegler, D\. M\., Stiennon, N\., Wu, J\., Brown, T\. B\., Radford, A\., Amodei, D\., Christiano, P\., and Irving, G\.Fine\-tuning language models from human preferences\.*arXiv preprint arXiv:1909\.08593*, 2019\.

## Appendix

## Appendix AIllustration of Our Motivation

To intuitively illustrate the complex duality of human preferences, we draw an analogy from the mechanics of the classic board gameL’Attaque\(orStratego\)\. The game dynamics perfectly encapsulate the tension between global quality \(transitivity\) and local stylistic trade\-offs \(cyclicity\)\.

1\. Transitivity\.The fundamental structure of the game is built upon a strict, globally consistent chain of command\.

- •Universal Dominance:The ranking system \(Marshal=10…\\dotsScout=2\) establishes aTotal Order\. A Marshal doesn’t just defeat a General; by the property of transitivity, the Marshal is guaranteed to dominateeverypiece ranked lower than itself \(Colonels, Captains, etc\.\)\.
- •Consistency:If PieceAAdefeats PieceBB, and PieceBBdefeats PieceCC, logical consistency dictates thatAAmust defeatCC\. There is no ambiguity\.

2\. Cyclicity\.However, strictly hierarchical systems cannot model complex interactions\. The game introduces a singularity which breaks the total order and creates a loop: Mine\.

- •The Intransitive Loop:Consider the interaction between three specific units: 1. 1\.Marshal \(10\)≻\\succMiner \(3\):The Marshal wins easily due to superior rank \(Quality\)\. 2. 2\.Miner \(3\)≻\\succMine:The Miner possesses a specific skill to defuse the Mine\. 3. 3\.Mine≻\\succMarshal \(10\):The Mine destroys the Marshal upon contact\.
- •Mathematical Impossibility for Scalar Scores:This creates a closed cycle:Marshal≻\\succMiner≻\\succMine≻\\succMarshal\. It is mathematically impossible to assign scalar valuesvvto these pieces such thatv\(Marshal\)\>v\(Miner\)\>v\(Mine\)\>v\(Marshal\)v\(\\text\{Marshal\}\)\>v\(\\text\{Miner\}\)\>v\(\\text\{Mine\}\)\>v\(\\text\{Marshal\}\)\.

3\. Hypothesis in Alignment\.Drawing from this analogy, we hypothesize that human preferences in language model alignment exhibit a similar duality\. Preferences are inherently multi\-dimensional, often adhering to a transitive hierarchy reflecting general quality \(e\.g\., safety and truthfulness\)\. However, the complexity of human values implies that specific response attributes may excel in particular dimensions—such as strict brevity or stylistic constraints—thereby forming intransitive loops that defy global ranking\. This hypothesis is substantiated by foundational game\-theoretic studies, which have formally decomposed interactions into transitive strength and cyclic “spinning top” geometries\(Balduzzi et al\.,[2018](https://arxiv.org/html/2605.17342#bib.bib4),[2019](https://arxiv.org/html/2605.17342#bib.bib5); Czarnecki et al\.,[2020](https://arxiv.org/html/2605.17342#bib.bib13)\)\. Within the context of LLM alignment,\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)recently highlighted the prevalence of intransitive preferences and introduced the General Preference Model \(GPM\) to address them\. Building upon these insights, our work adopts the explicit transitive\-cyclic decomposition strategy to develop the HRC model, specifically designed to harmonize the hierarchy of response quality with cyclic stylistic variations\.

## Appendix BThe Proofs of Theorems

### B\.1Proof of[Theorem4\.5](https://arxiv.org/html/2605.17342#S4.Thmtheorem5)

LetΩ\\Omegadenote the strategy space of embeddings, and letμ\\mube the uniform probability measure overΩ\\Omega\. We assume the gameϕ\(𝐯,𝐰\)\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)is skew\-symmetric, i\.e\.,ϕ\(𝐯,𝐰\)=−ϕ\(𝐰,𝐯\)\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)=\-\\phi\(\\mathbf\{w\},\\mathbf\{v\}\)\.

Transitive Component \(ϕT\\phi\_\{T\}\)\.We define the potential functionf\(𝐯\)f\(\\mathbf\{v\}\)as the expected payoff of strategy𝐯\\mathbf\{v\}against the uniform population:

f\(𝐯\)≔∫Ωϕ\(𝐯,𝐰\)𝑑μ\(𝐰\)=𝔼𝐰∼μ\[ϕ\(𝐯,𝐰\)\]\.f\(\\mathbf\{v\}\)\\coloneqq\\int\_\{\\Omega\}\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)d\\mu\(\\mathbf\{w\}\)=\\mathbb\{E\}\_\{\\mathbf\{w\}\\sim\\mu\}\[\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)\]\.\(13\)Using this potential function, we define the transitive component as the difference in potentials:

ϕT\(𝐯,𝐰\)≔f\(𝐯\)−f\(𝐰\)\.\\phi\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)\\coloneqq f\(\\mathbf\{v\}\)\-f\(\\mathbf\{w\}\)\.\(14\)
Cyclic Component \(ϕC\\phi\_\{C\}\)\.We define the cyclic component as the residual:

ϕC\(𝐯,𝐰\)≔ϕ\(𝐯,𝐰\)−ϕT\(𝐯,𝐰\)\.\\phi\_\{C\}\(\\mathbf\{v\},\\mathbf\{w\}\)\\coloneqq\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)\-\\phi\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)\.\(15\)We assume the gameϕ\\phiis skew\-symmetric\. SinceϕT\\phi\_\{T\}is also skew\-symmetric by construction \(f\(𝐯\)−f\(𝐰\)=−\(f\(𝐰\)−f\(𝐯\)\)f\(\\mathbf\{v\}\)\-f\(\\mathbf\{w\}\)=\-\(f\(\\mathbf\{w\}\)\-f\(\\mathbf\{v\}\)\)\), the residualϕC\\phi\_\{C\}must preserve skew\-symmetry\. Now, we verify the zero\-marginal condition\. IntegratingϕC\\phi\_\{C\}with respect to𝐰\\mathbf\{w\}:

∫ΩϕC\(𝐯,𝐰\)𝑑μ\(𝐰\)\\displaystyle\\int\_\{\\Omega\}\\phi\_\{C\}\(\\mathbf\{v\},\\mathbf\{w\}\)d\\mu\(\\mathbf\{w\}\)=∫Ω\(ϕ\(𝐯,𝐰\)−\(f\(𝐯\)−f\(𝐰\)\)\)𝑑μ\(𝐰\)\\displaystyle=\\int\_\{\\Omega\}\\left\(\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)\-\(f\(\\mathbf\{v\}\)\-f\(\\mathbf\{w\}\)\)\\right\)d\\mu\(\\mathbf\{w\}\)\(16\)=∫Ωϕ\(𝐯,𝐰\)𝑑μ\(𝐰\)⏟=f\(𝐯\)−∫Ωf\(𝐯\)𝑑μ\(𝐰\)\+∫Ωf\(𝐰\)𝑑μ\(𝐰\)\\displaystyle=\\underbrace\{\\int\_\{\\Omega\}\\phi\(\\mathbf\{v\},\\mathbf\{w\}\)d\\mu\(\\mathbf\{w\}\)\}\_\{=f\(\\mathbf\{v\}\)\}\-\\int\_\{\\Omega\}f\(\\mathbf\{v\}\)d\\mu\(\\mathbf\{w\}\)\+\\int\_\{\\Omega\}f\(\\mathbf\{w\}\)d\\mu\(\\mathbf\{w\}\)\(17\)=f\(𝐯\)−f\(𝐯\)⋅1\+𝔼𝐰\[f\(𝐰\)\]\.\\displaystyle=f\(\\mathbf\{v\}\)\-f\(\\mathbf\{v\}\)\\cdot 1\+\\mathbb\{E\}\_\{\\mathbf\{w\}\}\[f\(\\mathbf\{w\}\)\]\.\(18\)To complete the proof, we must show that𝔼𝐰\[f\(𝐰\)\]=0\\mathbb\{E\}\_\{\\mathbf\{w\}\}\[f\(\\mathbf\{w\}\)\]=0\. Using the skew\-symmetry ofϕ\\phi:

𝔼𝐰\[f\(𝐰\)\]=∫Ω∫Ωϕ\(𝐰,𝐳\)𝑑μ\(𝐳\)𝑑μ\(𝐰\)=0\.\\mathbb\{E\}\_\{\\mathbf\{w\}\}\[f\(\\mathbf\{w\}\)\]=\\int\_\{\\Omega\}\\int\_\{\\Omega\}\\phi\(\\mathbf\{w\},\\mathbf\{z\}\)d\\mu\(\\mathbf\{z\}\)d\\mu\(\\mathbf\{w\}\)=0\.\(19\)This double integral vanishes becauseϕ\(𝐰,𝐳\)=−ϕ\(𝐳,𝐰\)\\phi\(\\mathbf\{w\},\\mathbf\{z\}\)=\-\\phi\(\\mathbf\{z\},\\mathbf\{w\}\), and the integration domain is symmetric\. Thus, we have:

∫ΩϕC\(𝐯,𝐰\)𝑑μ\(𝐰\)=0\.\\int\_\{\\Omega\}\\phi\_\{C\}\(\\mathbf\{v\},\\mathbf\{w\}\)d\\mu\(\\mathbf\{w\}\)=0\.\(20\)This confirms thatϕC\\phi\_\{C\}is a cycle game\.

Uniqueness\.Suppose there exists another decompositionϕ=ϕT′\+ϕC′\\phi=\\phi^\{\\prime\}\_\{T\}\+\\phi^\{\\prime\}\_\{C\}whereϕT′\(𝐯,𝐰\)=g\(𝐯\)−g\(𝐰\)\\phi^\{\\prime\}\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)=g\(\\mathbf\{v\}\)\-g\(\\mathbf\{w\}\)for some potentialgg, andϕC′\\phi^\{\\prime\}\_\{C\}satisfies the zero\-marginal condition\. We have:

ϕT\+ϕC=ϕT′\+ϕC′⟹ϕT−ϕT′=ϕC′−ϕC\.\\phi\_\{T\}\+\\phi\_\{C\}=\\phi^\{\\prime\}\_\{T\}\+\\phi^\{\\prime\}\_\{C\}\\implies\\phi\_\{T\}\-\\phi^\{\\prime\}\_\{T\}=\\phi^\{\\prime\}\_\{C\}\-\\phi\_\{C\}\.\(21\)Integrating both sides with respect to𝐰\\mathbf\{w\}overΩ\\Omega:

∫Ω\(ϕT\(𝐯,𝐰\)−ϕT′\(𝐯,𝐰\)\)𝑑μ\(𝐰\)=∫Ω\(ϕC′\(𝐯,𝐰\)−ϕC\(𝐯,𝐰\)\)𝑑μ\(𝐰\)\.\\int\_\{\\Omega\}\(\\phi\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)\-\\phi^\{\\prime\}\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)\)d\\mu\(\\mathbf\{w\}\)=\\int\_\{\\Omega\}\(\\phi^\{\\prime\}\_\{C\}\(\\mathbf\{v\},\\mathbf\{w\}\)\-\\phi\_\{C\}\(\\mathbf\{v\},\\mathbf\{w\}\)\)d\\mu\(\\mathbf\{w\}\)\.\(22\)By the definition of cyclic components, the RHS is0−0=00\-0=0\. Expanding the LHS:

∫Ω\[\(f\(𝐯\)−f\(𝐰\)\)−\(g\(𝐯\)−g\(𝐰\)\)\]𝑑μ\(𝐰\)\\displaystyle\\int\_\{\\Omega\}\[\(f\(\\mathbf\{v\}\)\-f\(\\mathbf\{w\}\)\)\-\(g\(\\mathbf\{v\}\)\-g\(\\mathbf\{w\}\)\)\]d\\mu\(\\mathbf\{w\}\)=0\\displaystyle=0\(23\)\(f\(𝐯\)−g\(𝐯\)\)−∫Ω\(f\(𝐰\)−g\(𝐰\)\)𝑑μ\(𝐰\)\\displaystyle\(f\(\\mathbf\{v\}\)\-g\(\\mathbf\{v\}\)\)\-\\int\_\{\\Omega\}\(f\(\\mathbf\{w\}\)\-g\(\\mathbf\{w\}\)\)d\\mu\(\\mathbf\{w\}\)=0\.\\displaystyle=0\.\(24\)LetC=∫Ω\(f\(𝐰\)−g\(𝐰\)\)𝑑μ\(𝐰\)C=\\int\_\{\\Omega\}\(f\(\\mathbf\{w\}\)\-g\(\\mathbf\{w\}\)\)d\\mu\(\\mathbf\{w\}\)be a constant independent of𝐯\\mathbf\{v\}\. Thenf\(𝐯\)=g\(𝐯\)\+Cf\(\\mathbf\{v\}\)=g\(\\mathbf\{v\}\)\+C\. Substituting this back into the expression forϕT\\phi\_\{T\}:

ϕT\(𝐯,𝐰\)=f\(𝐯\)−f\(𝐰\)=\(g\(𝐯\)\+C\)−\(g\(𝐰\)\+C\)=g\(𝐯\)−g\(𝐰\)=ϕT′\(𝐯,𝐰\)\.\\phi\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)=f\(\\mathbf\{v\}\)\-f\(\\mathbf\{w\}\)=\(g\(\\mathbf\{v\}\)\+C\)\-\(g\(\\mathbf\{w\}\)\+C\)=g\(\\mathbf\{v\}\)\-g\(\\mathbf\{w\}\)=\\phi^\{\\prime\}\_\{T\}\(\\mathbf\{v\},\\mathbf\{w\}\)\.\(25\)SinceϕT=ϕT′\\phi\_\{T\}=\\phi^\{\\prime\}\_\{T\}, it immediately follows thatϕC=ϕ−ϕT=ϕ−ϕT′=ϕC′\\phi\_\{C\}=\\phi\-\\phi\_\{T\}=\\phi\-\\phi^\{\\prime\}\_\{T\}=\\phi^\{\\prime\}\_\{C\}\. Thus, the decomposition is unique\.

### B\.2Proof of[Theorem4\.6](https://arxiv.org/html/2605.17342#S4.Thmtheorem6)

According to the decomposition theory of Factorial Feature Games \(FFGs\)\(Balduzzi et al\.,[2019](https://arxiv.org/html/2605.17342#bib.bib5)\), any FFGϕ\(𝐲,𝐲′\)\\phi\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)can be uniquely decomposed into a transitive componentϕT\\phi\_\{T\}and a cyclic componentϕC\\phi\_\{C\}, such thatϕ=ϕT\+ϕC\\phi=\\phi\_\{T\}\+\\phi\_\{C\}\. The defining properties are:

1. 1\.For any𝐲,𝐲′\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}, the transitive component satisfiesϕT\(𝐲,𝐲′\)=f\(𝐲\)−f\(𝐲′\)\\phi\_\{T\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)=f\(\\mathbf\{y\}\)\-f\(\\mathbf\{y\}^\{\\prime\}\)for some potential functionf:ℝ2d→ℝf:\\mathbb\{R\}^\{2d\}\\to\\mathbb\{R\}\.
2. 2\.The cyclic component has zero marginal utility against the population distribution, i\.e\.,𝔼𝐲′\[ϕC\(𝐲,𝐲′\)\]=0\\mathbb\{E\}\_\{\\mathbf\{y\}^\{\\prime\}\}\[\\phi\_\{C\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\]=0for all𝐲\\mathbf\{y\}\.

We now verify that using BT model and GPM explicitly instantiates this structure\.

BT model \(ϕT\\phi\_\{T\}\)\.The theorem defines the transitive component asϕT\(𝐲i,𝐲j\)=f\(𝐲i\)−f\(𝐲j\)\\phi\_\{T\}\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\)=f\(\\mathbf\{y\}\_\{i\}\)\-f\(\\mathbf\{y\}\_\{j\}\)for some potential functionf:ℝ2d→ℝf:\\mathbb\{R\}^\{2d\}\\to\\mathbb\{R\}\. In the BC model, we haves\(𝐲i,𝐲j\|𝐱\)=r\(𝐲i\|𝐱\)−r\(𝐲j\|𝐱\)s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)=r\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\-r\(\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)\. By identifying the scalar reward functionr\(⋅\)r\(\\cdot\)as the potential functionf\(⋅\)f\(\\cdot\), the BT model strictly satisfies the definition of the transitive component\. The scalar rewardr\(𝐲\|𝐱\)r\(\\mathbf\{y\}\|\\mathbf\{x\}\)serves as the global ”potential” or ”rank” of the response𝐲\\mathbf\{y\}while the prompt is𝐱\\mathbf\{x\}, capturing the hierarchical structure of preferences\.

GPM \(ϕC\\phi\_\{C\}\)\.Let the second component beϕC\(𝐲,𝐲′\)=𝐯\(𝐲\)⊤𝐖𝐯\(𝐲′\)\\phi\_\{C\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)=\\mathbf\{v\}\(\\mathbf\{y\}\)^\{\\top\}\\mathbf\{W\}\\mathbf\{v\}\(\\mathbf\{y\}^\{\\prime\}\), where𝐖\\mathbf\{W\}is real and skew\-symmetric\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\)\. First,ϕC\\phi\_\{C\}is clearly skew\-symmetric\. Crucially, we investigate its transitive contribution by calculating its marginal expectation over the response distribution\. Under the assumption that the embeddings are zero\-mean centered \(i\.e\.,𝔼𝐲\[𝐯\(𝐲\)\]=𝟎\\mathbb\{E\}\_\{\\mathbf\{y\}\}\[\\mathbf\{v\}\(\\mathbf\{y\}\)\]=\\mathbf\{0\}\):

𝔼𝐲′\[ϕC\(𝐲,𝐲′\)\]=𝔼𝐲′\[𝐯\(𝐲\)⊤𝐖𝐯\(𝐲′\)\]=𝐯\(𝐲\)⊤𝐖𝔼𝐲′\[𝐯\(𝐲′\)\]=𝐯\(𝐲\)⊤𝐖⋅𝟎=0\.\\mathbb\{E\}\_\{\\mathbf\{y\}^\{\\prime\}\}\[\\phi\_\{C\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\]=\\mathbb\{E\}\_\{\\mathbf\{y\}^\{\\prime\}\}\[\\mathbf\{v\}\(\\mathbf\{y\}\)^\{\\top\}\\mathbf\{W\}\\mathbf\{v\}\(\\mathbf\{y\}^\{\\prime\}\)\]=\\mathbf\{v\}\(\\mathbf\{y\}\)^\{\\top\}\\mathbf\{W\}\\mathbb\{E\}\_\{\\mathbf\{y\}^\{\\prime\}\}\[\\mathbf\{v\}\(\\mathbf\{y\}^\{\\prime\}\)\]=\\mathbf\{v\}\(\\mathbf\{y\}\)^\{\\top\}\\mathbf\{W\}\\cdot\\mathbf\{0\}=0\.\(26\)Since the marginal expectation is zero for any𝐲\\mathbf\{y\}, the GPM component contributes zero ”strength” or ”rank” to the model\. This implies thatϕC\\phi\_\{C\}is purely cyclic\.

Conclusion\.SinceϕT\\phi\_\{T\}fully captures the transitive structure andϕC\\phi\_\{C\}\(under the zero\-mean condition\) constitutes a purely cyclic structure orthogonal toϕT\\phi\_\{T\}, the sums=ϕT\+ϕCs=\\phi\_\{T\}\+\\phi\_\{C\}constitutes a valid structural instantiation of the FFG decomposition\.

### B\.3Proof of[Theorem4\.7](https://arxiv.org/html/2605.17342#S4.Thmtheorem7)

We begin by establishing the rigorous mathematical form of the General Preference Model \(GPM\) based on the spectral theory of real skew\-symmetric matrices\.

Mathematical Formulation\.Let𝐌∈ℝn×n\\mathbf\{M\}\\in\\mathbb\{R\}^\{n\\times n\}be the real skew\-symmetric preference matrix whereMij=s\(𝐲i,𝐲j\)M\_\{ij\}=s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\)\. Since𝐌\\mathbf\{M\}is skew\-symmetric \(𝐌⊤=−𝐌\\mathbf\{M\}^\{\\top\}=\-\\mathbf\{M\}\), its rank is even, denoted as2d2d\(Horn & Johnson,[2012](https://arxiv.org/html/2605.17342#bib.bib21)\)\. By the spectral theorem for real skew\-symmetric matrices,𝐌\\mathbf\{M\}can be decomposed into its real normal form:

𝐌=∑l=1dτl\(𝐪2l−1𝐪2l⊤−𝐪2l𝐪2l−1⊤\),\\mathbf\{M\}=\\sum\_\{l=1\}^\{d\}\\tau\_\{l\}\(\\mathbf\{q\}\_\{2l\-1\}\\mathbf\{q\}\_\{2l\}^\{\\top\}\-\\mathbf\{q\}\_\{2l\}\\mathbf\{q\}\_\{2l\-1\}^\{\\top\}\),\(27\)whereτl\>0\\tau\_\{l\}\>0are the magnitudes of the imaginary parts of the eigenvalues, and\{𝐪1,…,𝐪2d\}\\\{\\mathbf\{q\}\_\{1\},\\dots,\\mathbf\{q\}\_\{2d\}\\\}are orthonormal vectors\. Inspired by previous work\(Chen & Joachims,[2016](https://arxiv.org/html/2605.17342#bib.bib10); Bertrand et al\.,[2023](https://arxiv.org/html/2605.17342#bib.bib7)\), we let𝐛l=τl𝐪2l−1\\mathbf\{b\}\_\{l\}=\\sqrt\{\\tau\_\{l\}\}\\mathbf\{q\}\_\{2l\-1\}and𝐜l=τl𝐪2l\\mathbf\{c\}\_\{l\}=\\sqrt\{\\tau\_\{l\}\}\\mathbf\{q\}\_\{2l\}, we can define matrices𝐁,𝐂∈ℝn×d\\mathbf\{B\},\\mathbf\{C\}\\in\\mathbb\{R\}^\{n\\times d\}such that theirll\-th columns correspond to these vectors\. The preference matrix can then be factorized as:

𝐌=𝐁𝐂⊤−𝐂𝐁⊤\.\\mathbf\{M\}=\\mathbf\{B\}\\mathbf\{C\}^\{\\top\}\-\\mathbf\{C\}\\mathbf\{B\}^\{\\top\}\.\(28\)For any pair of responses\(𝐲i,𝐲j\)\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\), the preference score is the elementMijM\_\{ij\}\. Let𝐛i,𝐜i∈ℝd\\mathbf\{b\}\_\{i\},\\mathbf\{c\}\_\{i\}\\in\\mathbb\{R\}^\{d\}denote theii\-th rows of𝐁\\mathbf\{B\}and𝐂\\mathbf\{C\}respectively\. The score is:

s\(𝐲i,𝐲j\)=𝐛i𝐜j⊤−𝐜i𝐛j⊤=∑l=1d\(bilcjl−cilbjl\)\.s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\)=\\mathbf\{b\}\_\{i\}\\mathbf\{c\}\_\{j\}^\{\\top\}\-\\mathbf\{c\}\_\{i\}\\mathbf\{b\}\_\{j\}^\{\\top\}=\\sum\_\{l=1\}^\{d\}\(b\_\{il\}c\_\{jl\}\-c\_\{il\}b\_\{jl\}\)\.\(29\)Geometrically, each term\(bilcjl−cilbjl\)\(b\_\{il\}c\_\{jl\}\-c\_\{il\}b\_\{jl\}\)represents the determinant of the vectors𝐳il=\[bil,cil\]⊤\\mathbf\{z\}\_\{il\}=\[b\_\{il\},c\_\{il\}\]^\{\\top\}and𝐳jl=\[bjl,cjl\]⊤\\mathbf\{z\}\_\{jl\}=\[b\_\{jl\},c\_\{jl\}\]^\{\\top\}\. Thus,[Equation29](https://arxiv.org/html/2605.17342#A2.E29)can be rewritten in polar coordinates where𝐳il\\mathbf\{z\}\_\{il\}has magnitudeLilL\_\{il\}and angleϕil\\phi\_\{il\}:

s\(𝐲i,𝐲j\)=∑l=1dLilLjlsin⁡\(ϕjl−ϕil\)\.s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\)=\\sum\_\{l=1\}^\{d\}L\_\{il\}L\_\{jl\}\\sin\(\\phi\_\{jl\}\-\\phi\_\{il\}\)\.\(30\)
This formulation confirms that a rank\-2d2dGPM is equivalent to summing the cyclic interactions acrossddindependent 2D subspaces\.

Existence Condition\.Assumed=1d=1\. The preference score simplifies to a single term:s\(𝐲i,𝐲j\)=Li1Lj1sin⁡\(ϕj1−ϕi1\)s\(\\mathbf\{y\}\_\{i\},\\mathbf\{y\}\_\{j\}\)=L\_\{i1\}L\_\{j1\}\\sin\(\\phi\_\{j1\}\-\\phi\_\{i1\}\)\. For a set of candidatesC=\{A1,…,An\}C=\\\{A\_\{1\},\\dots,A\_\{n\}\\\}to form a valid cycle \(e\.g\.,A1≻A2≻⋯≻An≻A1A\_\{1\}\\succ A\_\{2\}\\succ\\dots\\succ A\_\{n\}\\succ A\_\{1\}\), their anglesϕi1\\phi\_\{i1\}must span the entire interval\[0,2π\)\[0,2\\pi\)relative to the origin\. However, for a dominant candidateA∗A^\{\*\}to strictly defeat allAi∈CA\_\{i\}\\in C, we requires\(𝐲A∗,𝐲Ai\)\>0s\(\\mathbf\{y\}\_\{A^\{\*\}\},\\mathbf\{y\}\_\{A\_\{i\}\}\)\>0for allii\. This implies:

sin⁡\(ϕi1−ϕA∗1\)\>0⟹ϕi1∈\(ϕA∗1,ϕA∗1\+π\)\.\\sin\(\\phi\_\{i1\}\-\\phi\_\{A^\{\*\}1\}\)\>0\\implies\\phi\_\{i1\}\\in\(\\phi\_\{A^\{\*\}1\},\\phi\_\{A^\{\*\}1\}\+\\pi\)\.\(31\)This requires all cycle candidates to be confined within a strictly open semi\-circle, which geometrically contradicts the requirement of forming a complete cycle\. Thus,d=1d=1is insufficient\.

We verify that ifd≥2d\\geq 2, GPM can represent a dominant candidate against a cycle\. Considerd=2d=2\. We construct a cycleC=\{A1,…,An\}C=\\\{A\_\{1\},\\dots,A\_\{n\}\\\}and a dominant candidateA∗A^\{\*\}using the following embedding configuration:

1. 1\.For candidatesAi∈CA\_\{i\}\\in C, setLi1=1L\_\{i1\}=1andϕi1=2πin\\phi\_\{i1\}=\\frac\{2\\pi i\}\{n\}\. For the dominant candidateA∗A^\{\*\}, setLA∗1=0L\_\{A^\{\*\}1\}=0\.
2. 2\.For candidatesAi∈CA\_\{i\}\\in C, set identical vectors:Li2=1,ϕi2=0L\_\{i2\}=1,\\phi\_\{i2\}=0\. ForA∗A^\{\*\}, setLA∗2=1,ϕA∗2=−π2L\_\{A^\{\*\}2\}=1,\\phi\_\{A^\{\*\}2\}=\-\\frac\{\\pi\}\{2\}\.

Now we evaluate the scores using[Equation30](https://arxiv.org/html/2605.17342#A2.E30):

- •Sinceϕi2=ϕj2=0\\phi\_\{i2\}=\\phi\_\{j2\}=0, the contribution from the second subspace issin⁡\(0\)=0\\sin\(0\)=0\. The score is determined purely by the first subspace:s\(𝐲Ai,𝐲Aj\)=1⋅1⋅sin⁡\(2π\(j−i\)n\)\+0s\(\\mathbf\{y\}\_\{A\_\{i\}\},\\mathbf\{y\}\_\{A\_\{j\}\}\)=1\\cdot 1\\cdot\\sin\\left\(\\frac\{2\\pi\(j\-i\)\}\{n\}\\right\)\+0, which correctly models the cyclic structure\.
- •The contribution from the first subspace is 0 \(sinceLA∗1=0L\_\{A^\{\*\}1\}=0\)\. The score is determined purely by the second subspace:s\(𝐲A∗,𝐲Ai\)=0\+LA∗2Li2sin⁡\(ϕi2−ϕA∗2\)=1⋅1⋅sin⁡\(0−\(−π2\)\)=1\>0\.s\(\\mathbf\{y\}\_\{A^\{\*\}\},\\mathbf\{y\}\_\{A\_\{i\}\}\)=0\+L\_\{A^\{\*\}2\}L\_\{i2\}\\sin\(\\phi\_\{i2\}\-\\phi\_\{A^\{\*\}2\}\)=1\\cdot 1\\cdot\\sin\\left\(0\-\\left\(\-\\frac\{\\pi\}\{2\}\\right\)\\right\)=1\>0\.

Sinces\(𝐲A∗,𝐲Ai\)s\(\\mathbf\{y\}\_\{A^\{\*\}\},\\mathbf\{y\}\_\{A\_\{i\}\}\)is strictly positive for allii, we find an example satisfied\. Similarly, we can also prove that it holds true whend\>2d\>2\.

Lack of Arbitrariness\.We verify that for any fixed finite rank2d2d, there exists an example for which no embedding can satisfy conditions\.

ConsiderC=\{A1,…,An\}C=\\\{A\_\{1\},\\dots,A\_\{n\}\\\}constructed such that the embedding vectors are identical and aligned across allddsubspaces\. Let𝐳i,m∈ℝ2\\mathbf\{z\}\_\{i,m\}\\in\\mathbb\{R\}^\{2\}denote the embedding vector of candidateAiA\_\{i\}in themm\-th subspace\. For the Hard Cycle, we define:

∀m∈\{1,…,d\}:𝐳i,m=\[cos⁡θisin⁡θi\],whereθi=2πin\.\\forall m\\in\\\{1,\\dots,d\\\}:\\quad\\mathbf\{z\}\_\{i,m\}=\\begin\{bmatrix\}\\cos\\theta\_\{i\}\\\\ \\sin\\theta\_\{i\}\\end\{bmatrix\},\\quad\\text\{where \}\\theta\_\{i\}=\\frac\{2\\pi i\}\{n\}\.\(32\)The preference score between cycle members iss\(𝐲Ai,𝐲Aj\)=∑m=1ddet\(𝐳i,m,𝐳j,m\)=∑m=1dsin⁡\(θj−θi\)=dsin⁡\(θj−θi\)s\(\\mathbf\{y\}\_\{A\_\{i\}\},\\mathbf\{y\}\_\{A\_\{j\}\}\)=\\sum\_\{m=1\}^\{d\}\\det\(\\mathbf\{z\}\_\{i,m\},\\mathbf\{z\}\_\{j,m\}\)=\\sum\_\{m=1\}^\{d\}\\sin\(\\theta\_\{j\}\-\\theta\_\{i\}\)=d\\sin\(\\theta\_\{j\}\-\\theta\_\{i\}\), which defines a valid cycle\.

Now, assume there exists a dominant candidateDDparameterized by arbitrary embedding vectors𝐳D,m=\[Xm,Ym\]⊤\\mathbf\{z\}\_\{D,m\}=\[X\_\{m\},Y\_\{m\}\]^\{\\top\}for each subspacem=1,…,dm=1,\\dots,d\. According to the GPM preference score definition \(sum of determinants across subspaces\):

s\(𝐲D,𝐲Ai\)=∑m=1ddet\(𝐳D,m,𝐳i,m\)=∑m=1d\(Xmsin⁡θi−Ymcos⁡θi\)\.s\(\\mathbf\{y\}\_\{D\},\\mathbf\{y\}\_\{A\_\{i\}\}\)=\\sum\_\{m=1\}^\{d\}\\det\(\\mathbf\{z\}\_\{D,m\},\\mathbf\{z\}\_\{i,m\}\)=\\sum\_\{m=1\}^\{d\}\(X\_\{m\}\\sin\\theta\_\{i\}\-Y\_\{m\}\\cos\\theta\_\{i\}\)\.\(33\)By linearity of the summation, we can group the coefficients forsin⁡θi\\sin\\theta\_\{i\}andcos⁡θi\\cos\\theta\_\{i\}:

s\(𝐲D,𝐲Ai\)=\(∑m=1dXm\)sin⁡θi−\(∑m=1dYm\)cos⁡θi\.s\(\\mathbf\{y\}\_\{D\},\\mathbf\{y\}\_\{A\_\{i\}\}\)=\\left\(\\sum\_\{m=1\}^\{d\}X\_\{m\}\\right\)\\sin\\theta\_\{i\}\-\\left\(\\sum\_\{m=1\}^\{d\}Y\_\{m\}\\right\)\\cos\\theta\_\{i\}\.\(34\)Let𝒳=∑m=1dXm\\mathcal\{X\}=\\sum\_\{m=1\}^\{d\}X\_\{m\}and𝒴=−∑m=1dYm\\mathcal\{Y\}=\-\\sum\_\{m=1\}^\{d\}Y\_\{m\}\. The expression simplifies to a single harmonic wave form:

s\(𝐲D,𝐲Ai\)=𝒳2\+𝒴2sin⁡\(θi\+δ\),s\(\\mathbf\{y\}\_\{D\},\\mathbf\{y\}\_\{A\_\{i\}\}\)=\\sqrt\{\\mathcal\{X\}^\{2\}\+\\mathcal\{Y\}^\{2\}\}\\sin\(\\theta\_\{i\}\+\\delta\),\(35\)whereδ\\deltais a constant\. ForDD, we require the strict inequalitys\(𝐲D,𝐲Ai\)\>0s\(\\mathbf\{y\}\_\{D\},\\mathbf\{y\}\_\{A\_\{i\}\}\)\>0to hold for alli=1,…,ni=1,\\dots,n\. However, the functionf\(θ\)=Asin⁡\(θ\+δ\)f\(\\theta\)=A\\sin\(\\theta\+\\delta\)has a mean of zero over the period\[0,2π\)\[0,2\\pi\)\. Since the candidatesAiA\_\{i\}have anglesθi\\theta\_\{i\}distributed uniformly over the circle, asnnincreases, there must exist candidates located in the phase wheresin⁡\(θi\+δ\)<0\\sin\(\\theta\_\{i\}\+\\delta\)<0\.

Therefore, it is mathematically impossible to find a set of embedding vectors\{𝐳D,m\}\\\{\\mathbf\{z\}\_\{D,m\}\\\}forDDthat yields a positive score against all members of this cycle\. This proves that GPM cannot guarantee the representation of dominant hierarchies in arbitrary cyclic candidates\.

### B\.4Proof of[Equation11](https://arxiv.org/html/2605.17342#S5.E11)

We assume the optimization oracle in[Equation10](https://arxiv.org/html/2605.17342#S5.E10)is realizable\. The iterative update rule of DSPPO is given by:

πt\+1\(𝐲\|𝐱\)∝πt\(𝐲\|𝐱\)exp⁡\(ηℙt\(𝐲≻πt\|𝐱\)\),fort=1,2,…,T\.\\pi\_\{t\+1\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\\propto\\pi\_\{t\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\\exp\\left\(\\eta\\mathbb\{P\}\_\{t\}\(\\mathbf\{y\}\\succ\\pi\_\{t\}\|\\mathbf\{x\}\)\\right\),\\quad\\text\{for \}t=1,2,\\dots,T\.\(36\)
Step 1: Regret Bound for Time\-Varying Preferences

We first extend Theorem 1 from\(Freund & Schapire,[1999](https://arxiv.org/html/2605.17342#bib.bib16)\)\. Sinceℙt\\mathbb\{P\}\_\{t\}acts as a valid preference oracle at each steptt, we can invoke the regret bound for the multiplicative weights update algorithm\. For any sequence of reference mixed policiesμ1,μ2,…,μT\\mu\_\{1\},\\mu\_\{2\},\\dots,\\mu\_\{T\}and the sequence of policiesπ1,π2,…,πT\\pi\_\{1\},\\pi\_\{2\},\\dots,\\pi\_\{T\}generated by[Equation36](https://arxiv.org/html/2605.17342#A2.E36), the following inequality holds \(based on Lemma 2 in\(Freund & Schapire,[1999](https://arxiv.org/html/2605.17342#bib.bib16)\)\):

∑t=1Tℙt\(πt≺μt\)≤minπ⁡\[η1−e−η∑t=1Tℙt\(π≺μt\)\+𝔻KL\(π\|\|π0\)1−e−η\]\.\\sum^\{T\}\_\{t=1\}\\mathbb\{P\}\_\{t\}\(\\pi\_\{t\}\\prec\\mu\_\{t\}\)\\leq\\min\_\{\\pi\}\\left\[\\frac\{\\eta\}\{1\-e^\{\-\\eta\}\}\\sum^\{T\}\_\{t=1\}\\mathbb\{P\}\_\{t\}\(\\pi\\prec\\mu\_\{t\}\)\+\\frac\{\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\|\|\\pi\_\{0\}\)\}\{1\-e^\{\-\\eta\}\}\\right\]\.\(37\)Settingμt=πt\\mu\_\{t\}=\\pi\_\{t\}, and noting thatℙt\(πt≺πt\)=1/2\\mathbb\{P\}\_\{t\}\(\\pi\_\{t\}\\prec\\pi\_\{t\}\)=1/2\(a policy ties with itself\), the LHS simplifies toT/2T/2\. Using the skew\-symmetry propertyℙ\(π≺μ\)=1−ℙ\(π≻μ\)\\mathbb\{P\}\(\\pi\\prec\\mu\)=1\-\\mathbb\{P\}\(\\pi\\succ\\mu\), we have:

T2≤minπ⁡\[ηT1−e−η1T∑t=1Tℙt\(π≺πt\)\+𝔻KL\(π\|\|π0\)1−e−η\]\.\\frac\{T\}\{2\}\\leq\\min\_\{\\pi\}\\left\[\\frac\{\\eta T\}\{1\-e^\{\-\\eta\}\}\\frac\{1\}\{T\}\\sum^\{T\}\_\{t=1\}\\mathbb\{P\}\_\{t\}\(\\pi\\prec\\pi\_\{t\}\)\+\\frac\{\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\|\|\\pi\_\{0\}\)\}\{1\-e^\{\-\\eta\}\}\\right\]\.\(38\)Rearranging the terms and dividing byTT, we obtain:

1−e−η2η≤minπ⁡\[1T∑t=1Tℙt\(π≺πt\)\+𝔻KL\(π\|\|π0\)ηT\]\.\\frac\{1\-e^\{\-\\eta\}\}\{2\\eta\}\\leq\\min\_\{\\pi\}\\left\[\\frac\{1\}\{T\}\\sum^\{T\}\_\{t=1\}\\mathbb\{P\}\_\{t\}\(\\pi\\prec\\pi\_\{t\}\)\+\\frac\{\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\|\|\\pi\_\{0\}\)\}\{\\eta T\}\\right\]\.\(39\)Using the Taylor expansion1−e−η2η=12−η4\+O\(η2\)\\frac\{1\-e^\{\-\\eta\}\}\{2\\eta\}=\\frac\{1\}\{2\}\-\\frac\{\\eta\}\{4\}\+O\(\\eta^\{2\}\), and substitutingℙt\(π≺πt\)=1−ℙt\(π≻πt\)\\mathbb\{P\}\_\{t\}\(\\pi\\prec\\pi\_\{t\}\)=1\-\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\), we get:

12−η4\+O\(η2\)≤1−maxπ⁡\[1T∑t=1Tℙt\(π≻πt\)\]\+𝔻KL\(π\|\|π0\)ηT\.\\frac\{1\}\{2\}\-\\frac\{\\eta\}\{4\}\+O\(\\eta^\{2\}\)\\leq 1\-\\max\_\{\\pi\}\\left\[\\frac\{1\}\{T\}\\sum^\{T\}\_\{t=1\}\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\)\\right\]\+\\frac\{\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\|\|\\pi\_\{0\}\)\}\{\\eta T\}\.\(40\)Rearranging to isolate the win rate:

maxπ⁡\[1T∑t=1T\(ℙt\(π≻πt\)−12\)\]≤η4\+𝔻KL\(π\|\|π0\)ηT\+O\(η2\)\.\\max\_\{\\pi\}\\left\[\\frac\{1\}\{T\}\\sum^\{T\}\_\{t=1\}\\left\(\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\)\-\\frac\{1\}\{2\}\\right\)\\right\]\\leq\\frac\{\\eta\}\{4\}\+\\frac\{\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\|\|\\pi\_\{0\}\)\}\{\\eta T\}\+O\(\\eta^\{2\}\)\.\(41\)Sinceπ0\\pi\_\{0\}is an autoregressive model fully supported on a finite vocabulary,‖log⁡π0\(⋅\)‖∞\\\|\\log\\pi\_\{0\}\(\\cdot\)\\\|\_\{\\infty\}is bounded\. Thus,𝔻KL\(π\|\|π0\)≤∥logπ0\(⋅\)∥∞\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\|\|\\pi\_\{0\}\)\\leq\\\|\\log\\pi\_\{0\}\(\\cdot\)\\\|\_\{\\infty\}\. By choosing the learning rateη=Θ\(1/T\)\\eta=\\Theta\(1/\\sqrt\{T\}\), specificallyη=‖log⁡π0\(⋅\)‖∞T\\eta=\\frac\{\\\|\\log\\pi\_\{0\}\(\\cdot\)\\\|\_\{\\infty\}\}\{\\sqrt\{T\}\}, the RHS is bounded byO\(1/T\)O\(1/\\sqrt\{T\}\)\. Thus:

maxπ⁡1T∑t=1T\[ℙt\(π≻πt\)−12\]=O\(1T\)\.\\max\_\{\\pi\}\\frac\{1\}\{T\}\\sum^\{T\}\_\{t=1\}\\left\[\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\)\-\\frac\{1\}\{2\}\\right\]=O\\left\(\\frac\{1\}\{\\sqrt\{T\}\}\\right\)\.\(42\)
Step 2: Bridging Time\-Varying Preferences to True Preferences

We define the optimality gap to the mixture policyπ¯T=1T∑t=1Tπt\\bar\{\\pi\}\_\{T\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\pi\_\{t\}under the true preference modelℙ\\mathbb\{P\}\. We decompose the gap into approximation error \(due to time\-varyingsts\_\{t\}\) and optimization error:

maxπ⁡\[ℙ\(π≻π¯T\)−12\]\\displaystyle\\max\_\{\\pi\}\\left\[\\mathbb\{P\}\(\\pi\\succ\\bar\{\\pi\}\_\{T\}\)\-\\frac\{1\}\{2\}\\right\]=maxπ⁡\[1T∑t=1Tℙ\(π≻πt\)−12\]\\displaystyle=\\max\_\{\\pi\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{P\}\(\\pi\\succ\\pi\_\{t\}\)\-\\frac\{1\}\{2\}\\right\]\(43\)=maxπ⁡\[1T∑t=1T\(ℙ\(π≻πt\)−ℙt\(π≻πt\)\)\+1T∑t=1T\(ℙt\(π≻πt\)−12\)\]\\displaystyle=\\max\_\{\\pi\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{P\}\(\\pi\\succ\\pi\_\{t\}\)\-\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\)\\right\)\+\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\)\-\\frac\{1\}\{2\}\\right\)\\right\]\(44\)≤maxπ⁡1T∑t=1T\(ℙ\(π≻πt\)−ℙt\(π≻πt\)\)⏟Approximation Error\+maxπ⁡1T∑t=1T\(ℙt\(π≻πt\)−12\)⏟Optimization Error\.\\displaystyle\\leq\\underbrace\{\\max\_\{\\pi\}\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{P\}\(\\pi\\succ\\pi\_\{t\}\)\-\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\)\\right\)\}\_\{\\text\{Approximation Error\}\}\+\\underbrace\{\\max\_\{\\pi\}\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\)\-\\frac\{1\}\{2\}\\right\)\}\_\{\\text\{Optimization Error\}\}\.\(45\)The Optimization Error isO\(1/T\)O\(1/\\sqrt\{T\}\)as shown in[Equation42](https://arxiv.org/html/2605.17342#A2.E42)\. We now bound the Approximation Error\. Recall thatℙ\(𝐲≻𝐲′\)=σ\(s∞\(𝐲,𝐲′\)\)\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\)=\\sigma\(s\_\{\\infty\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\)andℙt\(𝐲≻𝐲′\)=σ\(st\(𝐲,𝐲′\)\)\\mathbb\{P\}\_\{t\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\)=\\sigma\(s\_\{t\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\)\. The derivative of the sigmoid function satisfiesσ′\(x\)=σ\(x\)\(1−σ\(x\)\)≤1/4\\sigma^\{\\prime\}\(x\)=\\sigma\(x\)\(1\-\\sigma\(x\)\)\\leq 1/4for allx∈ℝx\\in\\mathbb\{R\}\. This implies thatσ\(⋅\)\\sigma\(\\cdot\)is Lipschitz continuous with constantL=1/4L=1/4\. Therefore:

\|ℙ\(𝐲≻𝐲′\)−ℙt\(𝐲≻𝐲′\)\|=\|σ\(s∞\(𝐲,𝐲′\)\)−σ\(st\(𝐲,𝐲′\)\)\|≤14\|s∞\(𝐲,𝐲′\)−st\(𝐲,𝐲′\)\|\.\|\\mathbb\{P\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\)\-\\mathbb\{P\}\_\{t\}\(\\mathbf\{y\}\\succ\\mathbf\{y\}^\{\\prime\}\)\|=\|\\sigma\(s\_\{\\infty\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\)\-\\sigma\(s\_\{t\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\)\|\\leq\\frac\{1\}\{4\}\|s\_\{\\infty\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\-s\_\{t\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\|\.\(46\)Letϵt=max𝐲,𝐲′⁡\|s∞\(𝐲,𝐲′\)−st\(𝐲,𝐲′\)\|\\epsilon\_\{t\}=\\max\_\{\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\}\|s\_\{\\infty\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\-s\_\{t\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\|\. By the linearity of expectation, we have:

ℙ\(π≻πt\)−ℙt\(π≻πt\)=𝔼𝐲∼π,𝐲′∼πt\[σ\(s∞\(𝐲,𝐲′\)\)−σ\(st\(𝐲,𝐲′\)\)\]≤14ϵt\.\\mathbb\{P\}\(\\pi\\succ\\pi\_\{t\}\)\-\\mathbb\{P\}\_\{t\}\(\\pi\\succ\\pi\_\{t\}\)=\\mathbb\{E\}\_\{\\mathbf\{y\}\\sim\\pi,\\mathbf\{y\}^\{\\prime\}\\sim\\pi\_\{t\}\}\[\\sigma\(s\_\{\\infty\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\)\-\\sigma\(s\_\{t\}\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\)\]\\leq\\frac\{1\}\{4\}\\epsilon\_\{t\}\.\(47\)Based on[5\.2](https://arxiv.org/html/2605.17342#S5.Thmtheorem2), we have1T∑t=1Tϵt=O\(1/T\)\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\epsilon\_\{t\}=O\(1/\\sqrt\{T\}\)\. Thus, the Approximation Error is bounded byO\(1/T\)O\(1/\\sqrt\{T\}\)\.

Step 3: Final Bound

Combining the bounds for both errors:

maxπ⁡\[ℙ\(π≻π¯T\)−12\]≤O\(1T\)\+O\(1T\)=O\(1T\)\.\\max\_\{\\pi\}\\left\[\\mathbb\{P\}\(\\pi\\succ\\bar\{\\pi\}\_\{T\}\)\-\\frac\{1\}\{2\}\\right\]\\leq O\\left\(\\frac\{1\}\{\\sqrt\{T\}\}\\right\)\+O\\left\(\\frac\{1\}\{\\sqrt\{T\}\}\\right\)=O\\left\(\\frac\{1\}\{\\sqrt\{T\}\}\\right\)\.\(48\)Finally, the optimality gap is given by:

Gap\(π¯T\)\\displaystyle\\text\{Gap\}\(\\bar\{\\pi\}\_\{T\}\)=maxπ⁡ℙ\(π≻π¯T\)−minπ′⁡ℙ\(π′≺π¯T\)\\displaystyle=\\max\_\{\\pi\}\\mathbb\{P\}\(\\pi\\succ\\bar\{\\pi\}\_\{T\}\)\-\\min\_\{\\pi^\{\\prime\}\}\\mathbb\{P\}\(\\pi^\{\\prime\}\\prec\\bar\{\\pi\}\_\{T\}\)\(49\)=maxπ⁡ℙ\(π≻π¯T\)−minπ′⁡\(1−ℙ\(π′≻π¯T\)\)\\displaystyle=\\max\_\{\\pi\}\\mathbb\{P\}\(\\pi\\succ\\bar\{\\pi\}\_\{T\}\)\-\\min\_\{\\pi^\{\\prime\}\}\\left\(1\-\\mathbb\{P\}\(\\pi^\{\\prime\}\\succ\\bar\{\\pi\}\_\{T\}\)\\right\)\(50\)=maxπ⁡ℙ\(π≻π¯T\)−\(1−maxπ′⁡ℙ\(π′≻π¯T\)\)\\displaystyle=\\max\_\{\\pi\}\\mathbb\{P\}\(\\pi\\succ\\bar\{\\pi\}\_\{T\}\)\-\\left\(1\-\\max\_\{\\pi^\{\\prime\}\}\\mathbb\{P\}\(\\pi^\{\\prime\}\\succ\\bar\{\\pi\}\_\{T\}\)\\right\)\(51\)=2\(maxπ⁡ℙ\(π≻π¯T\)−12\)\.\\displaystyle=2\\left\(\\max\_\{\\pi\}\\mathbb\{P\}\(\\pi\\succ\\bar\{\\pi\}\_\{T\}\)\-\\frac\{1\}\{2\}\\right\)\.\(52\)Substituting the bound derived above, we conclude:

maxπ⁡ℙ\(π≻π¯T\)−minπ′⁡ℙ\(π′≺π¯T\)=O\(1T\)\.\\max\_\{\\pi\}\\mathbb\{P\}\(\\pi\\succ\\bar\{\\pi\}\_\{T\}\)\-\\min\_\{\\pi^\{\\prime\}\}\\mathbb\{P\}\(\\pi^\{\\prime\}\\prec\\bar\{\\pi\}\_\{T\}\)=O\\left\(\\frac\{1\}\{\\sqrt\{T\}\}\\right\)\.\(53\)This completes the proof\.

## Appendix CMore on Experiments

### C\.1Cyclic Preference Setup

We construct synthetic “cycle” and “dominant \+ cycle” datasets based on UltraFeedback\(Cui et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib12)\)\. Each prompt is associated with four candidate responses, each annotated along four dimensions \(e\.g\., helpfulness, honesty, instruction\-following, and truthfulness\)\.

#### Cyclic Dataset\.

We select three dimensions and construct cyclic preferences by choosing three responses such that each response outperforms another on a different dimension:

A≻B,B≻C,C≻A\.A\\succ B,\\quad B\\succ C,\\quad C\\succ A\.Pairwise preferences are defined according to the dimension on which the comparison is made\.

#### Dominant \+ Cycle Dataset\.

We extend the above by selecting a fourth response that outperforms all others across the selected dimensions\. This introduces three additional preference pairs, forming a mixed structure with both a dominant candidate and cyclic relations\.

Using the constructed datasets, we train GPM and HRC under identical settings\. We observe a consistent two\-stage learning process:

- •Stage 1 \(50%→\\rightarrow75%\):models identify the dominant candidate, driven by the three dominant\-related preference pairs\.
- •Stage 2 \(75%→\\rightarrow100%\):models learn cyclic relations among the remaining candidates\.

HRC finishes Stage 1 faster and achieves higher final accuracy, while low\-dimensional GPM \(dim=2\) fails to capture cyclic structure\.

### C\.2Preference Modeling Setup

Given our inclusion of theRewardBench 2\(Malik et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib30)\), we independently trained all preference models \(BT, GPM, and HRC\) from scratch to ensure a rigorous and consistent evaluation pipeline\. This approach ensures that all comparisons in both preference modeling \([Section6\.2](https://arxiv.org/html/2605.17342#S6.SS2)\) and downstream alignment \([Section6\.3](https://arxiv.org/html/2605.17342#S6.SS3)\) are based on models trained on the exact same data distribution, eliminating potential discrepancies arising from different pre\-trained checkpoints\.

Datasets and Models\.All preference models are trained on theSkywork\-Reward\-Preference\-80K\-v0\.2\(Liu et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib29)\)\. We employGemma\-2B\-it\(Team et al\.,[2024a](https://arxiv.org/html/2605.17342#bib.bib39)\)andLlama\-3\.1\-8B\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib18)\)as the base models\.

Model Configurations and Dimensions\.To ensure a comprehensive evaluation, we adopt distinct configuration strategies for the two experimental phases:

1\. Preference Modeling \(RewardBench 2 & RewardBench\)\.To assess the modeling capabilities and scaling properties of our proposed method, we evaluated models across varying latent dimensions as reported in[Table1](https://arxiv.org/html/2605.17342#S6.T1):

- •Bradley\-Terry \(BT\):Fixed output dimension of11\.
- •General Preference Model \(GPM\):Evaluated at latent dimensions2d∈\{2,4\}2d\\in\\\{2,4\\\}\.
- •Hybrid Reward\-Cyclic \(HRC\):Evaluated at dimensions2d\+1∈\{2\+1,4\+1\}2d\+1\\in\\\{2\+1,4\+1\\\}, corresponding to a cyclic component of dimension2d2daugmented with a scalar reward\.

2\. Downstream Alignment \(AlpacaEval 2\.0 & MT\-Bench\)\.For the downstream policy optimization experiments \([Section6\.3](https://arxiv.org/html/2605.17342#S6.SS3)\), we selected the high\-capacity configurations to maximize performance differentiation and ensure a rigorous structural comparison\. Specifically, we compare:

- •Baselines:The standard BT model \(dim=1\) and theGPM \(dim=4\)\.
- •Ours:TheHRC \(dim=4\+1\)\.

This selection allows for a strictly controlled comparison: HRC \(dim=4\+1\) effectively augments the GPM \(dim=4\) baseline with an explicit transitive shortcut\. By comparing these specific configurations, we isolate the architectural contribution of the explicit preference decomposition, verifying whether adding the transitive component improves alignment efficacy over the pure cyclic formulation of GPM\.

### C\.3Implementation Details

Our experiments were implemented using thePyTorchframework\(Paszke et al\.,[2019](https://arxiv.org/html/2605.17342#bib.bib32)\)and theHuggingFace Transformerslibrary\(Wolf et al\.,[2020](https://arxiv.org/html/2605.17342#bib.bib44)\)\. To ensure efficient distributed training, we leveragedDeepSpeed\(Rajbhandari et al\.,[2020](https://arxiv.org/html/2605.17342#bib.bib34)\)\. Furthermore, our preference modeling codebase is developed based on the official GPM implementation\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\), which is built upon theOpenRLHFframework\(Hu et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib22)\)\.

For the preference modeling experiments \(comparing BT, GPM, and HRC\), we maintain a consistent training framework to ensure fair comparability\.

Unified Training Objective in Preference Models\.To establish a unified training paradigm for BT, GPM, and HRC, we formulate the learning objective as a generalized pairwise classification task\. Letsθ\(𝐱,𝐲w,𝐲l\)s\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)denote thepairwise preference scorepredicted by the model for a prompt𝐱\\mathbf\{x\}and a response pair\(𝐲w,𝐲l\)\(\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)\. The loss function is defined as:

ℒ\(θ\)=−𝔼\(𝐱,𝐲w,𝐲l\)∼𝒟\[log⁡σ\(sθ\(𝐱,𝐲w,𝐲l\)τ\)\],\\mathcal\{L\}\(\\theta\)=\-\\mathbb\{E\}\_\{\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)\\sim\\mathcal\{D\}\}\\left\[\\log\\sigma\\left\(\\frac\{s\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)\}\{\\tau\}\\right\)\\right\],\(54\)whereτ\\tauis the temperature hyperparameter \(corresponding to general\_preference\_tau\) that controls the sharpness of the preference distribution\. This formulation generalizes all three models, where the distinction lies in the mathematical definition of the pairwise scoresθs\_\{\\theta\}:

- •BT:The pairwise score decomposes into the difference of scalar rewards:sθ\(𝐱,𝐲w,𝐲l\)=r\(𝐱,𝐲w\)−r\(𝐱,𝐲l\)s\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)=r\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\}\)\-r\(\\mathbf\{x\},\\mathbf\{y\}\_\{l\}\)\.
- •General Preference Model \(GPM\):The score is computed directly on the pair embeddings, modulated by a context\-aware gate:sθ\(𝐱,𝐲w,𝐲l\)=ϕ\(𝐲w\)⊤𝐃\(𝐱\)𝐀𝐃\(𝐱\)ϕ\(𝐲l\)s\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)=\\phi\(\\mathbf\{y\}\_\{w\}\)^\{\\top\}\\mathbf\{D\}\(\\mathbf\{x\}\)\\mathbf\{A\}\\mathbf\{D\}\(\\mathbf\{x\}\)\\phi\(\\mathbf\{y\}\_\{l\}\), where𝐀\\mathbf\{A\}is the skew\-symmetric operator capturing intransitivity, and𝐃\(𝐱\)\\mathbf\{D\}\(\\mathbf\{x\}\)represents the diagonal gating matrix generated by the prompt head\.
- •HRC:The score is the sum of the transitive \(scalar difference\) and cyclic \(GPM\) components:sθ\(𝐱,𝐲w,𝐲l\)=\(r\(𝐱,𝐲w\)−r\(𝐱,𝐲l\)\)⏟Transitive\+sGPM\(𝐱,𝐲w,𝐲l\)⏟Cyclics\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)=\\underbrace\{\(r\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\}\)\-r\(\\mathbf\{x\},\\mathbf\{y\}\_\{l\}\)\)\}\_\{\\text\{Transitive\}\}\+\\underbrace\{s\_\{\\text\{GPM\}\}\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)\}\_\{\\text\{Cyclic\}\}\.

Hyperparameters for Preference Model Training\.We specifically align our parameters with the settings reported in the GPM literature\(Zhang et al\.,[2025c](https://arxiv.org/html/2605.17342#bib.bib50)\), utilizing a temperatureτ=0\.1\\tau=0\.1\. The detailed hyperparameters derived from our training scripts are listed in[Table3](https://arxiv.org/html/2605.17342#A3.T3)\.

Table 3:Hyperparameters for Preference Model Training\.These settings are applied consistently to BT, GPM, and HRC models\.Iterative Self\-Play Setup\.For the downstream alignment phase, we employ the iterative self\-play framework\. While the preference signal is provided by our pre\-trained models, the policy optimization process itself involves specific generation and training configurations derived from the SPPO protocol\(Wu et al\.,[2025b](https://arxiv.org/html/2605.17342#bib.bib46)\)\. To implement DSPPO, we dynamically modulated the output of the preference model during the computation of preference probabilities in the SPPO framework, adjusting the signal strength according to the iteration index\. The alignment is conducted overT=3T=3iterations\. To estimate the preference landscape effectively, we generateK=5K=5distinct responses for each prompt in the UltraFeedback dataset\(Cui et al\.,[2024](https://arxiv.org/html/2605.17342#bib.bib12)\)\. We use a sampling temperature of1\.01\.0to encourage diverse exploration of the policy’s response space\.

Training Hyperparameters in Alignment\.The hyperparameters used for the policy updates are listed in[Table4](https://arxiv.org/html/2605.17342#A3.T4)\. We strictly adhere to the settings provided in the official SPPO implementation to ensuring a fair evaluation of our proposed DSPPO scheduling strategy\.

Table 4:Hyperparameters for Alignment \(SPPO & DSPPO\)\.Parameters are consistent across all iterations\.
### C\.4Impact of theλ\\lambdain DSPPO

In the DSPPO framework, the hyperparameterλ\\lambdacontrols the dynamic weighting between the transitive \(sTs\_\{T\}\) and cyclic \(sCs\_\{C\}\) components\. Our proposed schedule usesλ\>0\\lambda\>0, which initializes the training with a stronger emphasis on the transitive component \(global hierarchy\) and progressively increases the weight of the cyclic component \(local nuances\)\. This design is motivated by the hypothesis of curriculum learning\(Bengio et al\.,[2009](https://arxiv.org/html/2605.17342#bib.bib6); Hacohen & Weinshall,[2019](https://arxiv.org/html/2605.17342#bib.bib19)\): the model should first establish a robust baseline of instruction following and safety \(transitivity\) before refining its behavior with complex, non\-transitive stylistic preferences \(cyclicity\)\.

To validate this hypothesis, we conduct a comprehensive investigation across a broad range ofλ\\lambdavalues\. We systematically evaluateλ∈\{−2,−1,−0\.5,0,0\.25,0\.5,0\.75,1,2\}\\lambda\\in\\\{\-2,\-1,\-0\.5,0,0\.25,0\.5,0\.75,1,2\\\}using the Llama\-3\.1\-8B\-Instruct backbone\. This analysis allows us to examine: \(1\) the effect of negativeλ\\lambdavalues, which imply a reverse trajectory prioritizing cyclic dynamics early in training; \(2\) the impact of varying positiveλ\\lambdavalues on the convergence trajectory; and \(3\) the behavior when\|λ\|\>1\|\\lambda\|\>1, which may cause the schedule coefficients to become negative\.

Results\.[Table5](https://arxiv.org/html/2605.17342#A3.T5)presents the performance on AlpacaEval 2\.0 \(Iteration 3\) across differentλ\\lambdasettings\. Several key observations emerge from our analysis:

First,λ\>0\\lambda\>0generally leads to better performance, with the proposed schedule achieving the peak Length\-Controlled Win Rate \(LC\. WR\) of41\.90%atλ=\+1\.0\\lambda=\+1\.0\. Among positiveλ\\lambdavalues, smaller magnitudes \(0\.25, 0\.5, 0\.75\) also perform well, with LC\. WRs of 41\.09%, 40\.85%, and 41\.54% respectively\.

Second,λ<0\\lambda<0consistently underperforms\. Settings withλ=−0\.5\\lambda=\-0\.5,−1\.0\-1\.0, and−2\.0\-2\.0achieve LC\. WRs of 40\.21%, 38\.66%, and 37\.27% respectively\. This validates our hypothesis that starting from stable, transitive preference signals and gradually incorporating cyclic components is beneficial for training, while the reverse trajectory struggles to establish a solid foundation\.

Third, theStatic Baseline \(λ=0\\lambda=0\)performs adequately with LC\. WR of 40\.62%, but plateaus and fails to reach the final alignment quality achieved by dynamic schedules\.

Fourth, when\|λ\|\>1\|\\lambda\|\>1, the coefficients in the schedulest=\(1\+λt\)sT\+\(1−λt\)sCs\_\{t\}=\(1\+\\frac\{\\lambda\}\{\\sqrt\{t\}\}\)s\_\{T\}\+\(1\-\\frac\{\\lambda\}\{\\sqrt\{t\}\}\)s\_\{C\}may become negative for certain components\. In this regime, the resulting signal lacks a clear semantic interpretation as a meaningful combination of multiple preference signals\. Our experiments show thatλ=2\.0\\lambda=2\.0\(LC\. WR of 40\.15%\) underperforms compared to reasonable positiveλ\\lambdavalues, further supporting the interpretation that extreme weightings compromise the schedule’s effectiveness\.

Additionally, we experimented with alternative schedule forms, such as replacingλt\\frac\{\\lambda\}\{\\sqrt\{t\}\}withλt3\\frac\{\\lambda\}\{\\sqrt\[3\]\{t\}\}or using sinusoidal schedules \(e\.g\.,st=\(1\+sin⁡\(π2t\)\)sT\+\(1−sin⁡\(π2t\)\)sCs\_\{t\}=\(1\+\\sin\(\\frac\{\\pi\}\{2t\}\)\)s\_\{T\}\+\(1\-\\sin\(\\frac\{\\pi\}\{2t\}\)\)s\_\{C\}\)\. However, these variants did not exhibit stable or consistent improvements compared to the1/t1/\\sqrt\{t\}\-based schedule\.

Overall, these results suggest that DSPPO is not overly sensitive to preciseλ\\lambdatuning within a reasonable range \(approximately0\.250\.25to1\.01\.0\), and that the proposedλ=\+1\.0\\lambda=\+1\.0schedule provides a stable and effective default choice\.

Table 5:Ablation study on scheduling parameterλ\\lambda\(AlpacaEval 2\.0, Iteration 3\)\.We evaluate a broad range ofλ\\lambdavalues using theLlama\-3\.1\-8B\-Instructpreference model\. Results are reported for the model obtained at the third iteration\.LC\. WR: Length\-Controlled Win Rate \(%\);WR: Standard Win Rate;Avg\. Len: Average response length\. The best results are highlighted inbold\. When\|λ\|\>1\|\\lambda\|\>1, coefficients in the schedule may become negative, potentially compromising semantic interpretation\.λ\\lambdaLC\. WRWRAvg\. LenBase Model33\.1335\.262106Negativeλ\\lambda\(Inverse schedule\)−2\.0\-2\.037\.2741\.412170−1\.0\-1\.038\.6643\.092192−0\.5\-0\.540\.2144\.222179Zeroλ\\lambda\(Static baseline\)0\.00\.040\.6246\.302245Positiveλ\\lambda\(Proposed schedule\)0\.250\.2541\.0944\.8421830\.500\.5040\.8542\.9821390\.750\.7541\.5443\.732139\+1\.0\+1\.041\.9044\.792171Extremeλ\>1\\lambda\>1\+2\.0\+2\.040\.1544\.302186
### C\.5Additional Experiments on Rewardbench

To ensure a comprehensive comparison with established baselines in the literature, we extend our evaluation to the RewardBench\(Lambert et al\.,[2025](https://arxiv.org/html/2605.17342#bib.bib26)\)\. Following the same experimental setup as our main analysis, we benchmark the HRC model against BT and GPM using the Gemma\-2B\-it and Llama\-3\.1\-8B\-Instruct trained on the Skywork\-80K dataset\.

Results and Analysis\.The results are presented in[Table6](https://arxiv.org/html/2605.17342#A3.T6)\. On RewardBench, using the Gemma\-2B\-it base model, HRC achieves an average score of 82\.20% \(dim=2\+12\+1\), which is an improvement of 1\.21% over the GPM baseline’s best average score of 80\.99%\. Specifically, in the Chat task, HRC improves performance from 83\.24% \(GPM\) to 84\.64%, and in the Safety task, from 85\.00% to 86\.08%\. For the Llama\-3\.1\-8B\-Instruct base model, HRC achieves an average score of 91\.99% \(dim=4\+14\+1\), representing a 0\.85% improvement over the GPM baseline’s average score of 91\.14%\. In the Chat task, HRC improves from 92\.74% \(GPM\) to 94\.13%, demonstrating a significant lead over the baselines\. These results indicate that HRC consistently outperforms both the BT and GPM baselines across various base models and tasks, particularly in the Chat and Safety categories which require capturing nuanced preferences and robust safety alignment\. Note that the HRC model can be viewed as the combination of BT model and GPM\.

Table 6:Comparison of preference modeling capabilities on RewardBench\. We evaluate the Bradley\-Terry \(BT\) model, the General Preference Model \(GPM\), and HRC model using Gemma\-2B\-it and Llama\-3\.1\-8B\-Instruct\. Note that the BT model and HRC model can be viewed as a special case of GPM with special embedding dimension\. The highest scores are highlighted inbold\.Base Model&MethodChatChat\-HardSafetyReasoningAverageGemma\-2B\-it \+ BT\(dim=11\)81\.5681\.5668\.7768\.7784\.8684\.8683\.8683\.8679\.7679\.76Gemma\-2B\-it \+ GPM\(dim=22\)83\.2483\.2470\.3970\.3985\.0085\.0085\.3285\.3280\.9980\.99Gemma\-2B\-it \+ GPM\(dim=44\)82\.4082\.4069\.0869\.0884\.3284\.3287\.3587\.3580\.7980\.79Gemma\-2B\-it \+ HRC\(dim=2\+12\+1\)84\.6471\.0585\.6885\.6887\.4287\.4282\.20\(\+1\.21\)Gemma\-2B\-it \+ HRC\(dim=4\+14\+1\)83\.8083\.8069\.7469\.7486\.0887\.4981\.7881\.78Llama\-3\.1\-8B\-Instruct \+ BT\(dim=11\)89\.1189\.1184\.8684\.8692\.9792\.9794\.9094\.9090\.4690\.46Llama\-3\.1\-8B\-Instruct \+ GPM\(dim=22\)92\.7492\.7485\.3085\.3091\.7691\.7694\.7694\.7691\.1491\.14Llama\-3\.1\-8B\-Instruct \+ GPM\(dim=44\)92\.4692\.4683\.9183\.9192\.1692\.1695\.8191\.0891\.08Llama\-3\.1\-8B\-Instruct \+ HRC\(dim=2\+12\+1\)93\.5893\.5885\.9692\.7092\.7095\.1195\.1191\.8491\.84Llama\-3\.1\-8B\-Instruct \+ HRC\(dim=4\+14\+1\)94\.1385\.5385\.5393\.1195\.1895\.1891\.99\(\+0\.85\)Table 7:AlpacaEval 2\.0 evaluation results\.We compare alignment performance on Llama\-3\.1\-8B\-Instruct, using different methods: SPPO\(BT, GPM, and HRC\) and DSPPO\(HRC\)\.LC\. WR: Length\-Controlled Win Rate \(%\);WR: Standard Win Rate \(%\) against GPT\-4\-Turbo;Avg\. Len: Average response length in tokens\. The best scores within each iteration group are marked inbold\.
### C\.6Detailed Analysis of Alignment Results

Results and Analysis\.Complementing the summarized findings in the main text, we present the comprehensive evaluation breakdowns for AlpacaEval 2\.0, Arena\-Hard\-v0\.1, and MT\-Bench in[Table7](https://arxiv.org/html/2605.17342#A3.T7),[Table8](https://arxiv.org/html/2605.17342#A3.T8), and[Table9](https://arxiv.org/html/2605.17342#A3.T9), respectively\. This analysis reveals three critical insights regarding the behavior of our proposed framework\. First, the detailed metrics in[Table7](https://arxiv.org/html/2605.17342#A3.T7)demonstrate that HRC\+DSPPO achieves its superior win rate \(44\.75%\) through genuine capability improvements rather than length exploitation; unlike GPM baselines which exhibit signs of ”reward hacking” by inflating response length \(e\.g\., jumping to 2168 tokens\), our method maintains concise outputs \(2111 tokens\), validating that the cyclic component enhances information density without encouraging verbosity\. Second, the evaluation results on Arena\-Hard\-v0\.1 in[Table8](https://arxiv.org/html/2605.17342#A3.T8)further validate the robustness of our framework\. On Gemma\-2B\-it, HRC\+DSPPO achieves a remarkable score of 46\.8% in the final iteration, significantly outperforming both BT\+SPPO \(40\.9%\) and GPM\+SPPO \(42\.1%\)\. Notably, our method achieves a 3\.2% improvement over the best baseline, demonstrating superior capability in handling challenging prompts\. On Llama\-3\.1\-8B\-Instruct, HRC\+DSPPO peaks at 45\.5% in iteration 2, showcasing consistent performance gains across different model scales\. Third, the turn\-level breakdown in[Table9](https://arxiv.org/html/2605.17342#A3.T9)suggests that appropriate selection of the preference modeling framework can significantly improve alignment outcomes\. For instance, using the preference model trained on the Gemma\-2B\-it, our HRC\+DSPPO method successfully identifies a highly effective policy trajectory, achieving a peak score of 8\.29\. This performance notably surpasses the GPM baseline, which drops to 7\.70 in the final iteration, indicating that our approach is capable of finding better solutions in complex, multi\-turn scenarios\.

Table 8:Arena\-Hard\-v0\.1 evaluation results\.We evaluate alignment performance on Arena\-Hard\-v0\.1, a challenging benchmark that tests model capabilities on difficult prompts\. We compare BT\+SPPO, GPM\+SPPO, HRC\+SPPO, and HRC\+DSPPO across three iterations using bothGemma\-2B\-itandLlama\-3\.1\-8B\-Instructas preference models\.WR: Standard Win Rate \(%\) against GPT\-4\-0314\. The best scores within each iteration group are marked inbold\.Table 9:MT\-Bench evaluation results\.We assess multi\-turn conversation capabilities onGemma\-2B\-itandLlama\-3\.1\-8B\-Instruct\. The scores \(1st turn, 2nd turn, and Average\) are graded by GPT\-4\. The best Average scores within each iteration group are marked inbold\.
### C\.7Validation of GPT\-based Evaluation

Given that our evaluation relies on GPT\-family models as judges, we acknowledge the potential concern of judge bias in automated evaluation\. To validate the reliability of our GPT\-based evaluation pipeline, we conduct three complementary analyses: \(1\) agreement between GPT evaluation and human annotations, \(2\) cross\-evaluation using multiple GPT variants\.

GPT vs\. Human Agreement\.To assess the consistency between GPT\-based evaluation and human judgments, we sampled 500 response pairs from AlpacaEval 2\.0 and collected annotations from a human evaluator\. Each response pair was evaluated by both GPT\-4o\-mini \(using the standard AlpacaEval 2\.0 protocol\) and a human annotator, who determined which response was preferred\. The resulting confusion matrix is shown in[Table10](https://arxiv.org/html/2605.17342#A3.T10)\.

Table 10:Agreement between GPT\-4o\-mini and Human Evaluationon 500 AlpacaEval 2\.0 samples\.Human PositiveHuman NegativeTotalGPT PositiveTP = 247FP = 32279GPT NegativeFN = 26TN = 195221Total273227500The Cohen’s kappa coefficient ofκ≈0\.7655\\kappa\\approx 0\.7655indicates substantial agreement between GPT\-based evaluation and human judgments\.

Cross\-Evaluation with Multiple GPT Variants\.To further assess the robustness of our conclusions to the choice of judge, we performed cross\-evaluation using multiple GPT\-based evaluators\. We evaluated the model obtained at the third iteration \(trained with the LLaMA\-3\.1\-8B\-Instruct preference model\) on AlpacaEval 2\.0 using different GPT variants: GPT\-4o\-mini, GPT\-4\.1, and GPT\-5\-mini\. The results are shown in[Table11](https://arxiv.org/html/2605.17342#A3.T11)\.

Table 11:Cross\-Evaluation Results using Different GPT\-Based Evaluatorson AlpacaEval 2\.0 \(Iteration 3\)\. All methods use Llama\-3\.1\-8B\-Instruct as the preference model\. The best scores within each evaluator are highlighted inbold\. The consistent relative rankings across evaluators demonstrate robustness to judge selection\.The results show consistent relative rankings across all three evaluators: HRC \+ DSPPO achieves the highest Length\-Controlled Win Rate \(LC\. WR\), followed by GPM \+ SPPO, then BT \+ SPPO\. This consistency across different GPT variants indicates that our conclusions are robust to the specific choice of GPT\-based judge\.

Summary\.Collectively, these analyses provide evidence that our GPT\-based evaluation pipeline produces reliable and robust results\. The substantial agreement between GPT and human evaluations and the consistent rankings across multiple GPT variants support the validity of our evaluation conclusions\. While automated evaluation cannot fully replace human judgment, these validation steps give us confidence that the relative performance differences reported in our main experiments are meaningful and not artifacts of judge bias\.

### C\.8Scalability Evaluation on Gemma\-2\-9B\-it

To evaluate whether the performance gains of HRC\+DSPPO transfer to stronger backbone models, we conduct a scalability experiment usingGemma\-2\-9B\-it\(Team et al\.,[2024b](https://arxiv.org/html/2605.17342#bib.bib40)\)as the backbone\. Due to computational constraints, we use a 4\-bit quantized version of Gemma\-2\-9B\-it for efficient inference\. Critically, we directly reuse the preference models \(BT and HRC, both trained on Gemma\-2B\-it\) without any retraining or fine\-tuning on Gemma\-2\-9B\-it data\. This setup provides a direct test of whether the preference signals learned by HRC on a smaller model generalize effectively to guide the alignment of a larger, more capable policy\.

Experimental Setup\.Following the same protocol as our main alignment experiments, we perform SPPO and DSPPO overT=3T=3iterations\. We compare three settings: \(1\) the unaligned Gemma\-2\-9B\-it base model, \(2\) BT\+SPPO as a baseline alignment pipeline, and \(3\) HRC\+DSPPO, our full method\. All models are evaluated on AlpacaEval 2\.0 using GPT\-4o\-mini as the judge, with Length\-Controlled Win Rate \(LC\. WR\) as the primary metric\.

Results and Analysis\.[Table12](https://arxiv.org/html/2605.17342#A3.T12)presents the results across iterations\. The base Gemma\-2\-9B\-it model achieves an LC\. WR of 38\.38%\. Applying BT\+SPPO for three iterations improves performance to 48\.79%, yielding a substantial gain of \+10\.41%\. Notably, HRC\+DSPPO reaches 42\.90% after just a single iteration, already approaching the BT\+SPPO peak\. By iteration 3, HRC\+DSPPO achieves a peak LC\. WR of52\.20%, outperforming BT\+SPPO by \+3\.41% and surpassing the base model by \+13\.82%\. These results demonstrate that the preference signals decomposed by HRC generalize effectively across model scales, enabling stronger alignment outcomes even when the preference model is trained on a smaller backbone\. The progressive improvement across DSPPO iterations further confirms that dynamic scheduling of transitive and cyclic signals facilitates stable optimization on larger models\.

Table 12:Scalability evaluation on Gemma\-2\-9B\-it \(4\-bit quantized\)\.We report the Length\-Controlled Win Rate \(LC\. WR, %\) on AlpacaEval 2\.0\. Preference models \(BT, HRC\) are trained on Gemma\-2B\-it and directly applied to post\-train Gemma\-2\-9B\-it without retraining\. The best score is highlighted inbold\.

## Appendix DResponse Examples in Different Iterations

We present a representative case study sampled from the AlpacaEval 2\.0 benchmark\.[Table13](https://arxiv.org/html/2605.17342#A4.T13)compares the responses generated by base model \(Llama\-3\.1\-8B\-Instruct\) and our method \(HRC\+DSPPO\)\.

As shown in[Table13](https://arxiv.org/html/2605.17342#A4.T13), the iterative process yields observable gains in two key dimensions:

- •Correction of Factual Hallucinations: The Base Model initially hallucinates the dish’s origin as theVeneto region\. Interestingly, Iteration 1 inherits and even elaborates on this error \(adding ”Padua”\)\. However, Iteration 2 effectively ”unlearns” the false information by adopting a neutral stance, paving the way for Iteration 3 to correctly identify the traditionally accepted origin \(Lombardy/Milan\), thereby achieving factual alignment\.
- •Structural Refinement: The response structure evolves from unstructured text blocks to a well\-organized layout\. Iteration 3 demonstrates superior instruction following by using bold keys and categorized lists, making the information more accessible and professional compared to the baseline\.

Table 13:Examples on AlpacaEval 2\.0\.We compare the responses generated by Llama\-3\.1\-8B\-Instruct and our method\.
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Similar Articles

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

Submit Feedback

Similar Articles

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization
Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders