Pairwise Reference Alignment as a Model-Level Ordinal Observable

arXiv cs.CL 06/01/26, 04:00 AM Papers
pairwise-preference alignment ordinal-observable language-model evaluation statistics rewardbench
Summary
This paper formalizes pairwise reference alignment as a model-level ordinal observable, defining a statistic to measure agreement between a model's scoring and a reference preference distribution, with finite-sample estimators and an empirical study on Qwen2.5 models and RewardBench.
arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference distribution of pairwise preferences, what model-level quantity is estimated when we test whether a model ranks preferred responses above rejected responses? We define pairwise reference alignment as an ordinal observable induced by a model scoring function. Given a reference pair distribution $P_{\mathrm{pair}}$ over triples $(x,y^+,y^-)$, and a scalar model score $S_M(x,y)$, we define the alignment observable as the probability that the model-induced ordering agrees with the reference preference ordering. We further define a centered order-parameter-like statistic and discuss a margin-based extension. The resulting quantities admit simple finite-sample estimators and concentration bounds under independent sampling assumptions. This note does not introduce a new benchmark. It provides a conceptual and statistical formulation for pairwise reference alignment, clarifies the role of the reference pair distribution, and distinguishes the general ordinal observable from scoring choices such as normalized log-probability or energy-based scores. We also provide an initial empirical study on Qwen2.5 models and RewardBench, where the proposed statistics increase with model size and instruction tuning and vary across reference-pair subsets as predicted by the formulation.
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:28 AM
# Pairwise Reference Alignment as a Model-Level Ordinal Observable
Source: [https://arxiv.org/html/2605.30758](https://arxiv.org/html/2605.30758)
\(May 2026\)

###### Abstract

Pairwise preference data is widely used in language\-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization\. This note formulates a more basic measurement question: given a reference distribution of pairwise preferences, what model\-level quantity is estimated when we test whether a model ranks preferred responses above rejected responses?

We define pairwise reference alignment as an ordinal observable induced by a model scoring function\. Given a reference pair distributionPpairP\_\{\\mathrm\{pair\}\}over triples\(x,y\+,y−\)\(x,y^\{\+\},y^\{\-\}\), and a scalar model scoreSM\(x,y\)S\_\{M\}\(x,y\), we define the alignment observable as the probability that the model\-induced ordering agrees with the reference preference ordering\. We further define a centered order\-parameter\-like statistic and discuss a margin\-based extension\. The resulting quantities admit simple finite\-sample estimators and concentration bounds under independent sampling assumptions\.

This note does not introduce a new benchmark\. It provides a conceptual and statistical formulation for pairwise reference alignment, clarifies the role of the reference pair distribution, and distinguishes the general ordinal observable from scoring choices such as normalized log\-probability or energy\-based scores\. We also provide an initial empirical study on Qwen2\.5 models and RewardBench, where the proposed statistics increase with model size and instruction tuning and vary across reference\-pair subsets as predicted by the formulation\.

## 1Introduction

Pairwise preference data is widely used in language\-model evaluation and alignment\[[14](https://arxiv.org/html/2605.30758#bib.bib1),[2](https://arxiv.org/html/2605.30758#bib.bib2),[15](https://arxiv.org/html/2605.30758#bib.bib3),[18](https://arxiv.org/html/2605.30758#bib.bib6),[3](https://arxiv.org/html/2605.30758#bib.bib7)\]\. A typical comparison consists of a promptxx, a preferred responsey\+y^\{\+\}, and a rejected responsey−y^\{\-\}\. Such data appears in human preference evaluation, reward modeling, direct preference optimization, and model ranking systems, where it is often used to train a reward model, optimize a policy, compute a win rate, or rank multiple systems\.

This note asks a more basic measurement question\. Given a reference distribution of pairwise preferences, what model\-level quantity is being estimated when we check whether a model ranksy\+y^\{\+\}abovey−y^\{\-\}? More concretely:

Does the ordering induced by modelMagree with a reference preference order?\\text\{Does the ordering induced by model \}M\\text\{ agree with a reference preference order?\}
The central premise is that a pair distribution can carry preference information\. Preference need not first appear as an absolute score assigned to each response\. It may instead appear as a stable comparison relation, as in broader preference\-learning formulations\[[8](https://arxiv.org/html/2605.30758#bib.bib19),[16](https://arxiv.org/html/2605.30758#bib.bib18),[6](https://arxiv.org/html/2605.30758#bib.bib20)\]\. If a human population, an expert system, or a stronger model consistently selectsy\+y^\{\+\}overy−y^\{\-\}under a target distribution, then the pair distribution itself provides an empirical expression of a reference preference\.

For example, for a fixed promptxx, repeated comparisons may indicate

where≻\\succdenotes the reference preference relation\. The sampled triple from the pair distribution

\(x,y\+,y−\)∼Ppair\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\}already carries an ordinal signal about which response is preferred under the reference\.

The goal of this note is to turn this ordinal signal into a model\-level quantity\. Given a model scoring functionSM\(x,y\)S\_\{M\}\(x,y\), we define the probability that the model\-induced ordering agrees with the reference preference ordering:

AM\(Ppair\)=ℙ\(x,y\+,y−\)∼Ppair\[SM\(x,y\+\)\>SM\(x,y−\)\]\.A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)=\\mathbb\{P\}\_\{\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\}\}\\left\[S\_\{M\}\(x,y^\{\+\}\)\>S\_\{M\}\(x,y^\{\-\}\)\\right\]\.This quantity is not intended to be a complete measure of human preference, model capability, or alignment\. It is an estimand: the agreement probability between a model\-induced ordering and a reference preference ordering under a specified pair distribution\.

The contribution of this note is conceptual and statistical\. Pairwise comparison itself is not new; the contribution here is to isolate the population\-level measurement object induced by a fixed scoring rule and a reference pair distribution, and to treat the finite benchmark score as an estimator of that object rather than as the object itself\. We define a discrete pairwise reference alignment observable and a real\-valued margin statistic, distinguish population quantities from finite\-sample estimators, derive simple concentration bounds, and discuss how log\-probability or energy\-based scores provide natural scoring choices\. The order\-parameter terminology is borrowed from statistical physics because the proposed quantities compress many local pairwise relations into a single macroscopic statistic relative to a reference distribution\. This analogy will be revisited in Section[7](https://arxiv.org/html/2605.30758#S7)\.

Empirically, we instantiate this framework with token\-normalized log\-likelihood scores for Qwen2\.5 models\[[17](https://arxiv.org/html/2605.30758#bib.bib5)\]on RewardBench\[[10](https://arxiv.org/html/2605.30758#bib.bib4)\]\. The experiments are not intended as a complete validation across model families or preference distributions\. Rather, they test whether the proposed observables behave coherently in a controlled setting: larger and instruction\-tuned models should show stronger agreement with the reference ordering, subset\-level estimates should depend on the reference pair distribution, and finite\-sample behavior should match the statistical analysis\.

## 2Problem Formulation

### 2\.1Reference pair distribution

LetPpairP\_\{\\mathrm\{pair\}\}denote a target reference pair distribution\. A sample from this distribution is a triple

\(x,y\+,y−\)∼Ppair,\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\},wherexxis a prompt,y\+y^\{\+\}is the response preferred by the reference, andy−y^\{\-\}is the response rejected by the reference\.

The reference may be a human annotator population, an expert rule system, a stronger model, a reward model, or a policy defining a desired behavioral dimension such as helpfulness, harmlessness, truthfulness, or mathematical reasoning quality\. The reference supplies the preferred/rejected relation, whileSMS\_\{M\}supplies the model\-induced ordering being tested\. Different choices ofPpairP\_\{\\mathrm\{pair\}\}define different alignment targets: mathematical reasoning comparisons and safety comparisons, for example, do not specify the same target\. Alignment in this note is therefore always relative to the specified reference pair distribution\.

### 2\.2Finite evaluation sets

In practice, we do not observe the full distributionPpairP\_\{\\mathrm\{pair\}\}\. We observe a finite evaluation set

𝒞=\{\(xk,yk\+,yk−\)\}k=1K\.\\mathcal\{C\}=\\\{\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)\\\}\_\{k=1\}^\{K\}\.The distinction betweenPpairP\_\{\\mathrm\{pair\}\}and𝒞\\mathcal\{C\}is important\. The former is the conceptual target distribution; the latter is an empirical sample used to estimate model\-level quantities\. Claims about alignment are only as broad as the reference distribution that the evaluation set represents\.

## 3Model\-Induced Ordering

LetMMbe a model and let

SM\(x,y\)∈ℝS\_\{M\}\(x,y\)\\in\\mathbb\{R\}be a scalar scoring function assigned by, or associated with, the model for responseyyunder promptxx\. The score may be a reward model score, a judge score, a task\-specific evaluation score, or any other scalar quantity that can compare two responses under the same prompt\.

The scoring function induces an ordering over responses:

yi≻Myj⟺SM\(x,yi\)\>SM\(x,yj\),y\_\{i\}\\succ\_\{M\}y\_\{j\}\\quad\\Longleftrightarrow\\quad S\_\{M\}\(x,y\_\{i\}\)\>S\_\{M\}\(x,y\_\{j\}\),where≻M\\succ\_\{M\}denotes the preference relation induced by modelMMand scoreSMS\_\{M\}\.

The statistical construction in Sections[4](https://arxiv.org/html/2605.30758#S4)and[5](https://arxiv.org/html/2605.30758#S5)depends only on this induced ordering, not on the origin of the score\. In Section[7](https://arxiv.org/html/2605.30758#S7), we discuss normalized log\-probability and the corresponding negative energy score as natural scoring choices for language models\.

## 4Pairwise Reference Alignment Observable

### 4\.1Definition

We first define a discrete, sign\-based observable\. This construction only asks whether the model ranks the reference\-preferred response above the rejected response\. It does not measure the strength of that preference\.

###### Definition 1\(Pairwise agreement indicator\)\.

For a pair\(x,y\+,y−\)\(x,y^\{\+\},y^\{\-\}\), define

ZM\(x,y\+,y−\)=𝟏\[SM\(x,y\+\)\>SM\(x,y−\)\]\.Z\_\{M\}\(x,y^\{\+\},y^\{\-\}\)=\\mathbf\{1\}\\left\[S\_\{M\}\(x,y^\{\+\}\)\>S\_\{M\}\(x,y^\{\-\}\)\\right\]\.ThenZM=1Z\_\{M\}=1indicates agreement between the model\-induced ordering and the reference preference ordering, whileZM=0Z\_\{M\}=0indicates disagreement\.

###### Definition 2\(Pairwise reference alignment observable\)\.

The model\-level pairwise reference alignment observable is

AM\(Ppair\)=𝔼\(x,y\+,y−\)∼Ppair\[ZM\(x,y\+,y−\)\]\.A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)=\\mathbb\{E\}\_\{\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\}\}\\left\[Z\_\{M\}\(x,y^\{\+\},y^\{\-\}\)\\right\]\.Equivalently,

AM\(Ppair\)=ℙ\(x,y\+,y−\)∼Ppair\[SM\(x,y\+\)\>SM\(x,y−\)\]\.A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)=\\mathbb\{P\}\_\{\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\}\}\\left\[S\_\{M\}\(x,y^\{\+\}\)\>S\_\{M\}\(x,y^\{\-\}\)\\right\]\.

The quantityAM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)has a direct interpretation: if one draws a random pair from the reference pair distribution, it is the probability that the model\-induced ordering agrees with the reference ordering\.

For a centered version, define

mMsign\(Ppair\)=2AM\(Ppair\)−1\.m\_\{M\}^\{\\mathrm\{sign\}\}\(P\_\{\\mathrm\{pair\}\}\)=2A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\-1\.ThenmMsign=1m\_\{M\}^\{\\mathrm\{sign\}\}=1means perfect agreement,mMsign=0m\_\{M\}^\{\\mathrm\{sign\}\}=0corresponds to random\-level agreement, andmMsign<0m\_\{M\}^\{\\mathrm\{sign\}\}<0indicates systematic preference for the reference\-rejected response\.

In this sense,mMsignm\_\{M\}^\{\\mathrm\{sign\}\}can be viewed as an order\-parameter\-like statistic: many microscopic pairwise comparisons are averaged into a single quantity that summarizes the model’s macroscopic state relative to a reference pair distribution\. The connection to statistical\-physics order parameters will be discussed again in Section[7](https://arxiv.org/html/2605.30758#S7)\.

### 4\.2Finite\-sample estimation and bound

Given a finite evaluation set

𝒞=\{\(xk,yk\+,yk−\)\}k=1K,\\mathcal\{C\}=\\\{\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)\\\}\_\{k=1\}^\{K\},the empirical estimator ofAM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)is

A^M\(𝒞\)=1K∑k=1KZM\(xk,yk\+,yk−\)\.\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}Z\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)\.The corresponding empirical centered statistic is

m^Msign\(𝒞\)=2A^M\(𝒞\)−1\.\\hat\{m\}\_\{M\}^\{\\mathrm\{sign\}\}\(\\mathcal\{C\}\)=2\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)\-1\.
Assume that the pairs in𝒞\\mathcal\{C\}are sampled independently fromPpairP\_\{\\mathrm\{pair\}\}\. Since eachZM\(xk,yk\+,yk−\)∈\{0,1\}Z\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)\\in\\\{0,1\\\}, Hoeffding’s inequality gives

ℙ\(\|A^M\(𝒞\)−AM\(Ppair\)\|≥ϵ\)≤2exp⁡\(−2Kϵ2\)\.\\mathbb\{P\}\\left\(\\left\|\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)\-A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\\right\|\\geq\\epsilon\\right\)\\leq 2\\exp\(\-2K\\epsilon^\{2\}\)\.Therefore, to guarantee an error smaller thanϵ\\epsilonwith probability at least1−δ1\-\\delta, it is sufficient that

K≥12ϵ2log⁡2δ\.K\\geq\\frac\{1\}\{2\\epsilon^\{2\}\}\\log\\frac\{2\}\{\\delta\}\.
For example, ifϵ=0\.05\\epsilon=0\.05andδ=0\.05\\delta=0\.05, then

K≥12\(0\.05\)2log⁡20\.05≈738\.K\\geq\\frac\{1\}\{2\(0\.05\)^\{2\}\}\\log\\frac\{2\}\{0\.05\}\\approx 738\.A short derivation is provided in Appendix[A\.1](https://arxiv.org/html/2605.30758#A1.SS1)\.

This is a population estimation bound, not merely a descriptive statistic of the finite evaluation set\. Under independent sampling,A^M\(𝒞\)\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)estimatesAM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\), and the bound gives a sufficient number of pairs needed to approximate that agreement probability within a prescribed error\. The evaluation set is therefore a sampling instrument for the target quantity, rather than the object being defined\.

This is in line with the view that language\-model evaluations should be treated as statistical experiments with explicit uncertainty estimates\[[12](https://arxiv.org/html/2605.30758#bib.bib8),[1](https://arxiv.org/html/2605.30758#bib.bib9)\]\. The bound does not control label noise, mismatch between𝒞\\mathcal\{C\}and the intended target distribution, data contamination, repeated benchmark selection, or the suitability ofSMS\_\{M\}\.

## 5Margin Observable

The sign\-based observable answers a discrete question: did the model rank the pair correctly? It does not answer how strongly the model ranked it correctly or incorrectly\. To retain this information, define the signed margin

dM\(x,y\+,y−\)=SM\(x,y\+\)−SM\(x,y−\)\.d\_\{M\}\(x,y^\{\+\},y^\{\-\}\)=S\_\{M\}\(x,y^\{\+\}\)\-S\_\{M\}\(x,y^\{\-\}\)\.A positive margin indicates agreement with the reference preference; a negative margin indicates disagreement\.

The corresponding population margin is

μM\(Ppair\)=𝔼\(x,y\+,y−\)∼Ppair\[dM\(x,y\+,y−\)\],\\mu\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)=\\mathbb\{E\}\_\{\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\}\}\\left\[d\_\{M\}\(x,y^\{\+\},y^\{\-\}\)\\right\],with empirical estimator

μ^M\(𝒞\)=1K∑k=1KdM\(xk,yk\+,yk−\)\.\\hat\{\\mu\}\_\{M\}\(\\mathcal\{C\}\)=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}d\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)\.
### 5\.1Interpretation of the margin

The margindM\(x,y\+,y−\)d\_\{M\}\(x,y^\{\+\},y^\{\-\}\)is a signed score difference between two responses under the same prompt\. It records both the direction of the model\-induced ordering and the magnitude of the score gap\. A positive margin means that the model score favors the reference\-preferred response, while a negative margin favors the reference\-rejected response\.

The sign observable keeps only whether the margin is positive:

ZM\(x,y\+,y−\)=𝟏\[dM\(x,y\+,y−\)\>0\]\.Z\_\{M\}\(x,y^\{\+\},y^\{\-\}\)=\\mathbf\{1\}\\left\[d\_\{M\}\(x,y^\{\+\},y^\{\-\}\)\>0\\right\]\.ThusAM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)averages the sign of the margin rather than its magnitude\. The sign construction is simple, bounded, and statistically stable, but it discards preference strength\. The margin statistic retains that information\.

The interpretation becomes especially concrete when the score is a log\-probability, a scoring choice discussed in Section[7](https://arxiv.org/html/2605.30758#S7)\. If

SM\(x,y\)=log⁡QM\(y∣x\),S\_\{M\}\(x,y\)=\\log Q\_\{M\}\(y\\mid x\),then the margin is

dM\(x,y\+,y−\)=log⁡QM\(y\+∣x\)−log⁡QM\(y−∣x\)=log⁡QM\(y\+∣x\)QM\(y−∣x\)\.d\_\{M\}\(x,y^\{\+\},y^\{\-\}\)=\\log Q\_\{M\}\(y^\{\+\}\\mid x\)\-\\log Q\_\{M\}\(y^\{\-\}\\mid x\)=\\log\\frac\{Q\_\{M\}\(y^\{\+\}\\mid x\)\}\{Q\_\{M\}\(y^\{\-\}\\mid x\)\}\.In this case, the margin is a log\-likelihood ratio between the reference\-preferred and reference\-rejected responses\. For example,dM=0d\_\{M\}=0means that the model assigns equal probability to the two responses;dM=log⁡2d\_\{M\}=\\log 2means that it assigns twice as much probability toy\+y^\{\+\}as toy−y^\{\-\}; anddM=−log⁡2d\_\{M\}=\-\\log 2means that it assigns half as much probability toy\+y^\{\+\}as toy−y^\{\-\}\. When the score is chosen as negative token\-normalized energy, as in Section[7\.3](https://arxiv.org/html/2605.30758#S7.SS3), the same margin measures a relative energy gap\.

The sign statistic measures ordinal agreement, while the margin statistic is sensitive to strength\. The margin version is also more delicate: whenSMS\_\{M\}is derived from log\-probability or energy, margins may be heavy\-tailed and sensitive to length, rare tokens, or extremely low\-probability sequences\. Practical estimation may require clipping, robust means, bootstrap confidence intervals, or reporting the full margin distribution rather than only its mean\.

### 5\.2Finite\-sample bound for bounded margins

The margin mean also admits a simple concentration bound if the signed margin is bounded\. Suppose that, for all pairs under consideration,

dM\(x,y\+,y−\)∈\[a,b\]\.d\_\{M\}\(x,y^\{\+\},y^\{\-\}\)\\in\[a,b\]\.Then the empirical mean

μ^M\(𝒞\)=1K∑k=1KdM\(xk,yk\+,yk−\)\\hat\{\\mu\}\_\{M\}\(\\mathcal\{C\}\)=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}d\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)satisfies Hoeffding’s inequality:

ℙ\(\|μ^M\(𝒞\)−μM\(Ppair\)\|≥ϵ\)≤2exp⁡\(−2Kϵ2\(b−a\)2\)\.\\mathbb\{P\}\\left\(\\left\|\\hat\{\\mu\}\_\{M\}\(\\mathcal\{C\}\)\-\\mu\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\\right\|\\geq\\epsilon\\right\)\\leq 2\\exp\\left\(\-\\frac\{2K\\epsilon^\{2\}\}\{\(b\-a\)^\{2\}\}\\right\)\.Thus, to guarantee error at mostϵ\\epsilonwith probability at least1−δ1\-\\delta, it is sufficient that

K≥\(b−a\)22ϵ2log⁡2δ\.K\\geq\\frac\{\(b\-a\)^\{2\}\}\{2\\epsilon^\{2\}\}\\log\\frac\{2\}\{\\delta\}\.A short derivation is provided in Appendix[A\.2](https://arxiv.org/html/2605.30758#A1.SS2)\.

This bound makes an important statistical point: the sample complexity of the margin observable depends directly on the scale of the score\. If margins are known or clipped to an interval of width

then estimating the population mean margin within absolute errorϵ\\epsilonrequires sample size proportional toR2/ϵ2R^\{2\}/\\epsilon^\{2\}\. A scoring rule with a large range or high variability can therefore require many more pairs to estimate reliably\. The design ofSMS\_\{M\}affects not only the meaning of the margin, but also its statistical estimability\. In practice, margin\-based evaluation should report the score scale, clipping rule, normalization, or robust estimator used to control this variability\.

One way to make the bound scale\-free is to measure error relative to the margin range\. If the desired error is expressed as a fractionη\\etaof the range, so thatϵ=ηR\\epsilon=\\eta R, then the sufficient sample size becomes

K≥12η2log⁡2δ\.K\\geq\\frac\{1\}\{2\\eta^\{2\}\}\\log\\frac\{2\}\{\\delta\}\.In this relative\-error form, the explicit dependence onRRcancels\. This does not mean that the score scale is irrelevant; rather, it means that the error target has been normalized by the range used to define the margin\. Reporting relative error, normalized margins, or clipped ranges can therefore make margin estimates easier to compare across scoring rules\.

## 6Illustrative Calculations

This section gives simple calculations that illustrate how the estimands, estimators, and finite\-sample bounds fit together\. They are not intended as empirical validation\.

### 6\.1Sample complexity table

For the sign observable, Hoeffding’s inequality gives the sufficient condition

K≥12ϵ2log⁡2δ\.K\\geq\\frac\{1\}\{2\\epsilon^\{2\}\}\\log\\frac\{2\}\{\\delta\}\.Table[1](https://arxiv.org/html/2605.30758#S6.T1)reports the resulting sample sizes for several target errors at95%95\\%confidence\. The same table also applies to bounded margin estimation when the target error is expressed as a fraction of the bounded margin range,ϵ=η\(b−a\)\\epsilon=\\eta\(b\-a\)\.

Table 1:Sufficient sample sizes from Hoeffding’s inequality for the sign observable\. For bounded margins, the same values apply when the error is measured relative to the margin range, i\.e\.ϵ=η\(b−a\)\\epsilon=\\eta\(b\-a\)\.
### 6\.2Toy sign\-observable calculation

Suppose an evaluation set containsK=1000K=1000independent reference pairs\. If a model ranks the reference\-preferred response above the rejected response on720720of these pairs, then

A^M\(𝒞\)=7201000=0\.72\.\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)=\\frac\{720\}\{1000\}=0\.72\.The corresponding centered statistic is

m^Msign\(𝒞\)=2A^M\(𝒞\)−1=0\.44\.\\hat\{m\}\_\{M\}^\{\\mathrm\{sign\}\}\(\\mathcal\{C\}\)=2\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)\-1=0\.44\.Thus the model\-induced ordering agrees with the empirical reference ordering on72%72\\%of the pairs, or equivalently has a centered agreement of0\.440\.44above the random baseline\.

Using the Hoeffding bound withδ=0\.05\\delta=0\.05, the error radius is

ϵ=12Klog⁡2δ\.\\epsilon=\\sqrt\{\\frac\{1\}\{2K\}\\log\\frac\{2\}\{\\delta\}\}\.ForK=1000K=1000, this gives

ϵ=12000log⁡40≈0\.043\.\\epsilon=\\sqrt\{\\frac\{1\}\{2000\}\\log 40\}\\approx 0\.043\.Therefore, under the independent\-sampling assumption, one obtains the conservative statement

AM\(Ppair\)∈\[0\.677,0\.763\]A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\\in\[0\.677,0\.763\]with probability at least0\.950\.95\.

### 6\.3Toy margin\-observable calculation

For the margin observable, suppose margins are clipped or known to lie in an interval of widthR=b−a=2R=b\-a=2\. If one wants to estimate the mean margin within absolute errorϵ=0\.1\\epsilon=0\.1at95%95\\%confidence, Hoeffding’s inequality gives

K≥222\(0\.1\)2log⁡40≈738\.K\\geq\\frac\{2^\{2\}\}\{2\(0\.1\)^\{2\}\}\\log 40\\approx 738\.The same numerical sample size appears because the requested error is5%5\\%of the margin range\. If the desired absolute error is fixed while the margin range grows, the required number of pairs increases quadratically in the range\.

## 7Scoring Choices and the Energy View of Language Models

The definitions in Sections[3](https://arxiv.org/html/2605.30758#S3)–[5](https://arxiv.org/html/2605.30758#S5)are stated for an arbitrary scalar scoreSM\(x,y\)S\_\{M\}\(x,y\)\. This separation is intentional: the pairwise alignment observable is defined by the ordering induced by the score, not by a particular scoring rule\. For language models, however, normalized log\-probability is a natural instance\. This section explains why\.

### 7\.1From probability to an energy\-like quantity

In statistical physics and energy\-based modeling, probability and energy are often linked by a simple intuition: high\-probability states correspond to low energy, while low\-probability states correspond to high energy\. A common formal expression of this idea is the Boltzmann\-like form\[[11](https://arxiv.org/html/2605.30758#bib.bib13),[4](https://arxiv.org/html/2605.30758#bib.bib14)\]

Qθ\(u\)=exp⁡\(−Eθ\(u\)\)Zθ,Q\_\{\\theta\}\(u\)=\\frac\{\\exp\(\-E\_\{\\theta\}\(u\)\)\}\{Z\_\{\\theta\}\},whereuuis a state,Eθ\(u\)E\_\{\\theta\}\(u\)is an energy function, andZθZ\_\{\\theta\}is a normalizing constant\. Taking the negative logarithm gives

−log⁡Qθ\(u\)=Eθ\(u\)\+log⁡Zθ\.\-\\log Q\_\{\\theta\}\(u\)=E\_\{\\theta\}\(u\)\+\\log Z\_\{\\theta\}\.Thus negative log\-probability has an energy\-like interpretation: samples assigned higher probability by the model have lower effective energy\.

Now letPPdenote a data distribution andQθQ\_\{\\theta\}a model distribution\. The standard language\-model training objective is usually written as token\-level cross entropy, equivalently as negative log\-likelihood under the data distribution\. In distributional form, this objective can be written as

H\(P,Qθ\)=𝔼u∼P\[−log⁡Qθ\(u\)\]\.H\(P,Q\_\{\\theta\}\)=\\mathbb\{E\}\_\{u\\sim P\}\\left\[\-\\log Q\_\{\\theta\}\(u\)\\right\]\.Under the energy\-like parameterization above, this becomes

H\(P,Qθ\)=𝔼u∼P\[Eθ\(u\)\]\+log⁡Zθ\.H\(P,Q\_\{\\theta\}\)=\\mathbb\{E\}\_\{u\\sim P\}\\left\[E\_\{\\theta\}\(u\)\\right\]\+\\log Z\_\{\\theta\}\.This decomposition can be read as an energy\-based interpretation of the cross\-entropy, or equivalently negative\-log\-likelihood, objective\. The first term is the average energy assigned to data states sampled fromPP\. The second term is the global normalization term required byQθQ\_\{\\theta\}\. In this view, likelihood\-based training shapes the model\-induced energy landscape so that data states receive lower effective energy\.

This system is not defined by the model alone\. It depends on both the data distributionPP, which determines which states are sampled, and the model distributionQθQ\_\{\\theta\}, which assigns probabilities and effective energies\. One may view each stateuuas a point in the model\-induced landscape, with−log⁡Qθ\(u\)\-\\log Q\_\{\\theta\}\(u\)serving as its effective energy\. In this sense, the energy view turns probability assignment on data or generated samples into a measurable energy\-like observable\.

Autoregressive language models are not usually trained as globally normalized energy\-based models with an explicit partition function over all text sequences\. Their conditional distributions are normalized locally through token\-level softmax operations\. Still, likelihood induces an energy\-like landscape over text: sequences with lower negative log\-probability are less surprising, more compatible with the model distribution, or easier for the model to predict\.

We use this energy language as an interpretive and computational bridge, not as a claim that language models are literal physical energy systems\. The pairwise observable proposed in this note is one way to turn that viewpoint into a measurable system\-level summary\.

### 7\.2Sequence\-level and token\-level energies

In the conditional generation setting of this note, a language modelMMdefines

for a responseyygiven promptxx\. If

y=\(y1,y2,…,yT\),y=\(y\_\{1\},y\_\{2\},\\ldots,y\_\{T\}\),then the autoregressive factorization is

QM\(y∣x\)=∏t=1TQM\(yt∣x,y<t\)\.Q\_\{M\}\(y\\mid x\)=\\prod\_\{t=1\}^\{T\}Q\_\{M\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.This gives a sequence\-level energy

EM\(y∣x\)=−log⁡QM\(y∣x\)=−∑t=1Tlog⁡QM\(yt∣x,y<t\)\.E\_\{M\}\(y\\mid x\)=\-\\log Q\_\{M\}\(y\\mid x\)=\-\\sum\_\{t=1\}^\{T\}\\log Q\_\{M\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.For comparing responses of different lengths, it is often more appropriate to use the token\-normalized energy

E¯M\(y∣x\)=−1Tlog⁡QM\(y∣x\)=−1T∑t=1Tlog⁡QM\(yt∣x,y<t\)\.\\bar\{E\}\_\{M\}\(y\\mid x\)=\-\\frac\{1\}\{T\}\\log Q\_\{M\}\(y\\mid x\)=\-\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\log Q\_\{M\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.HereT=\|y\|T=\|y\|denotes the number of tokens in the response\. This token\-level quantity is the average negative log\-likelihood of the response under the model\. It is the same quantity minimized by the standard token\-level cross\-entropy training objective for autoregressive language models\.

### 7\.3Energy as a scoring function

The general scoring function in Section[3](https://arxiv.org/html/2605.30758#S3)can now be instantiated as

SM\(x,y\)=−E¯M\(y∣x\)\.S\_\{M\}\(x,y\)=\-\\bar\{E\}\_\{M\}\(y\\mid x\)\.Under this choice, higher score means lower token\-normalized energy\. Therefore,

SM\(x,y\+\)\>SM\(x,y−\)S\_\{M\}\(x,y^\{\+\}\)\>S\_\{M\}\(x,y^\{\-\}\)is equivalent to

E¯M\(y\+∣x\)<E¯M\(y−∣x\)\.\\bar\{E\}\_\{M\}\(y^\{\+\}\\mid x\)<\\bar\{E\}\_\{M\}\(y^\{\-\}\\mid x\)\.This is consistent with the probability–energy intuition above: responses that are more natural under the learned distribution occupy lower\-energy positions\.

Under this scoring rule,AM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)is the probability that the reference\-preferred response occupies a lower\-energy position than the reference\-rejected response:

AM\(Ppair\)=ℙ\(x,y\+,y−\)∼Ppair\[E¯M\(y\+∣x\)<E¯M\(y−∣x\)\]\.A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)=\\mathbb\{P\}\_\{\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\}\}\\left\[\\bar\{E\}\_\{M\}\(y^\{\+\}\\mid x\)<\\bar\{E\}\_\{M\}\(y^\{\-\}\\mid x\)\\right\]\.
This interpretation is useful, but it should be kept in its proper place\. The primary definition ofAM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)does not require an energy view\. The energy view provides one natural way to instantiateSMS\_\{M\}for probabilistic language models\.

### 7\.4Order\-parameter\-like interpretation

Under the energy interpretation, the centered statistic

mMsign\(Ppair\)=2AM\(Ppair\)−1m\_\{M\}^\{\\mathrm\{sign\}\}\(P\_\{\\mathrm\{pair\}\}\)=2A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\-1summarizes whether reference\-preferred responses systematically occupy lower\-energy positions than rejected responses\. It is not a property of a single sample\. It is an aggregate statistic over a reference pair distribution\.

In this limited sense,mMsignm\_\{M\}^\{\\mathrm\{sign\}\}behaves like an order\-parameter\-like statistic: it summarizes a model\-level state relative toPpairP\_\{\\mathrm\{pair\}\}by averaging many local pairwise comparisons\. The energy perspective gives the same idea a concrete scoring interpretation: the statistic summarizes how the model orders reference pairs in its induced landscape\.

## 8Relational Observable Interpretation

The central insight of this note is relational\. A model\-level property need not be defined only by assigning an absolute statistic to each sample and averaging those values\. It can also be defined through relations between samples\. Here, the relation is a pairwise ordering induced by the model and compared against a reference ordering\.

One possible measurement strategy is sample\-wise:

sample↦score,\\text\{sample\}\\mapsto\\text\{score\},where each response receives an absolute scalar value\. The strategy used here is instead relational:

pair of samples↦relative order\.\\text\{pair of samples\}\\mapsto\\text\{relative order\}\.The resulting quantity is obtained by aggregating many such local relations over a reference distribution\. Comparison itself is the elementary measurement\.

This perspective clarifies why pairwise data can encode preference information even when no absolute reward function is observed\. A reference distribution records which response is preferred under a target comparison process\. It describes an ordinal structure rather than a full metric structure\. The proposed statistic measures whether the model\-induced ordinal structure agrees with that reference structure\.

To see this, suppose there exists a latent reward functionr\(x,y\)r\(x,y\)\. Pairwise comparisons reveal relations of the form

r\(x,yi\)\>r\(x,yj\),r\(x,y\_\{i\}\)\>r\(x,y\_\{j\}\),but they do not reveal the absolute values ofr\(x,yi\)r\(x,y\_\{i\}\)andr\(x,yj\)r\(x,y\_\{j\}\)\. If we shift the reward by a constant,

r′\(x,y\)=r\(x,y\)\+c,r^\{\\prime\}\(x,y\)=r\(x,y\)\+c,then all pairwise preferences are unchanged\. More generally, ifffis strictly increasing, then

r′\(x,y\)=f\(r\(x,y\)\)r^\{\\prime\}\(x,y\)=f\(r\(x,y\)\)preserves the same ordering\. Thus pairwise data identifies the ordinal structure of preference, not the absolute zero point, scale, or metric distances of a latent reward\.

This loss of absolute information is not merely a weakness\. It reflects the level at which the measurement is defined\. Many preference judgments are naturally comparative: it is often easier and more reliable to say which of two responses is better than to assign calibrated absolute scores\. Pairwise reference alignment embraces this comparative structure and defines a model\-level quantity from it\.

A physical or geometric analogy makes this natural\. Suppose one wants to describe a system of points\. One way is to record the absolute coordinate of every point\. Another way is to record only relative positions or relative order relations among points\. The second description loses some degrees of freedom: for example, all points can be translated together without changing their relative positions\. Nevertheless, the relative description can still determine much of the system’s structure\.

The same idea applies here\. Absolute scores attempt to locate each response in a calibrated metric space\. Pairwise comparisons instead describe relations among responses\. They may ignore global shifts, arbitrary scales, or monotone transformations of a latent score, but they preserve the ordering structure needed for the present observable\. This can also make the measurement statistically and practically attractive: comparison data may be easier to collect, more stable across annotators, and less dependent on calibrated score scales\.

The sign and margin statistics studied in this note should therefore be viewed as two simple instances of a broader relational measurement framework\. The sign statistic keeps only the direction of the comparison, while the margin keeps a signed strength of the comparison\. These are not the only possible constructions\. Once the elementary object is a relation between samples, other statistics could emphasize confidence, uncertainty, distributional spread, transitivity, consistency across prompts, or higher\-order relations among more than two responses\.

## 9Related Work

#### Pairwise preference data in alignment\.

Human preference comparisons are a central ingredient in modern language\-model alignment\. In RLHF\-style pipelines, pairwise or ranked responses are used to learn a reward model and then optimize a policy\[[14](https://arxiv.org/html/2605.30758#bib.bib1),[2](https://arxiv.org/html/2605.30758#bib.bib2)\]\. Direct Preference Optimization uses chosen/rejected pairs to optimize the policy more directly\[[15](https://arxiv.org/html/2605.30758#bib.bib3)\], and preference\-based reinforcement learning studies how agents can learn from qualitative feedback rather than hand\-designed numerical rewards\[[16](https://arxiv.org/html/2605.30758#bib.bib18)\]\. These works primarily use preference data as supervision for training or optimization\. By contrast, this note uses a reference distribution to define a measurement object for a fixed model: the probability that the model orders pairs consistently with the reference\.

#### Preference accuracy, reward\-model benchmarks, and pairwise evaluation\.

A closely related construction appears in reward\-model evaluation\. Reward models are often evaluated by checking whether the reward assigned to a chosen response is larger than the reward assigned to a rejected response, and RewardBench formalizes this accuracy\-based evaluation across preference categories\[[10](https://arxiv.org/html/2605.30758#bib.bib4)\]\. Pairwise comparison is also widely used for evaluating open\-ended language\-model outputs: MT\-Bench and LLM\-as\-a\-judge study scalable model\-based judging and its biases\[[18](https://arxiv.org/html/2605.30758#bib.bib6)\]; Chatbot Arena collects crowdsourced pairwise votes and uses statistical ranking methods to compare models\[[3](https://arxiv.org/html/2605.30758#bib.bib7)\]; and AlpacaEval highlights confounders such as response length\[[5](https://arxiv.org/html/2605.30758#bib.bib10)\]\. These works motivate pairwise evaluation, but their main objects are usually reward\-model accuracy, win\-rate estimation, model comparison, or judge reliability\. Here, the same comparison primitive is reinterpreted as a single\-model population quantity induced byPpairP\_\{\\mathrm\{pair\}\}andSMS\_\{M\}\.

#### Pairwise ranking and preference learning\.

There is also a long line of work on inferring rankings or latent scores from pairwise comparisons\. Bradley–Terry\-style models infer latent strengths from comparison outcomes, and generalized formulations extend the types of comparison data that can be modeled\[[7](https://arxiv.org/html/2605.30758#bib.bib11)\]\. Efficient algorithms for rankings from pairwise comparisons have also been studied\[[13](https://arxiv.org/html/2605.30758#bib.bib12)\]\. In machine learning, pairwise preference learning reduces ranking problems to binary comparisons\[[8](https://arxiv.org/html/2605.30758#bib.bib19)\], and Bayesian preference learning can use such comparisons to guide data collection or optimization\[[9](https://arxiv.org/html/2605.30758#bib.bib21)\]\. Our setting is different: we do not infer latent item scores or a global ranking\. The scoring functionSMS\_\{M\}is fixed by the model, and the target quantity is an agreement probability underPpairP\_\{\\mathrm\{pair\}\}\.

#### Statistical and energy\-based perspectives\.

The finite\-sample analysis in this note follows the view that evaluations should be treated as statistical experiments\. Miller\[[12](https://arxiv.org/html/2605.30758#bib.bib8)\]and the accompanying Anthropic research article\[[1](https://arxiv.org/html/2605.30758#bib.bib9)\]emphasize uncertainty estimates, standard errors, and experiment planning for language\-model evaluations\. Our sign observable is especially simple from this perspective because it is Bernoulli, so Hoeffding’s inequality gives a direct finite\-sample concentration bound under independent sampling\. The energy interpretation is connected to energy\-based modeling, where compatible or likely configurations are assigned lower energy\[[11](https://arxiv.org/html/2605.30758#bib.bib13),[4](https://arxiv.org/html/2605.30758#bib.bib14)\]\. Here the energy language is interpretive and computational: it motivates a natural scoring choice for probabilistic language models, not a claim that language models are literal physical systems\.

## 10Experiments

This section provides an empirical validation of the pairwise reference alignment observable defined above\. The goal is deliberately modest\. We do not claim to exhaustively validate the method across model families, training pipelines, or preference datasets\. Instead, we ask whether the proposed statistics behave in the way predicted by the theory in a controlled and reproducible setting: a single model family, a fixed preference dataset, and a fixed likelihood\-induced scoring rule\.

The results are encouraging\. Larger models and instruction\-tuned models consistently show stronger likelihood\-induced agreement with the reference preference ordering\. Moreover, the observable varies substantially across subsets of the reference distribution, and its finite\-sample behavior matches the expected concentration pattern\. These findings support the view that pairwise reference alignment captures a real model\-level property, while keeping the claim explicitly relative to the reference pair distribution and scoring rule\.

### 10\.1Experimental setup

We evaluate the Qwen2\.5 model family\[[17](https://arxiv.org/html/2605.30758#bib.bib5)\]on RewardBench\[[10](https://arxiv.org/html/2605.30758#bib.bib4)\]\. The model set contains four model sizes, each with a base and an instruction\-tuned variant:

sizebase modelinstruction\-tuned model0\.5BQwen2\.5\-0\.5BQwen2\.5\-0\.5B\-Instruct1\.5BQwen2\.5\-1\.5BQwen2\.5\-1\.5B\-Instruct3BQwen2\.5\-3BQwen2\.5\-3B\-Instruct7BQwen2\.5\-7BQwen2\.5\-7B\-Instruct\.\\begin\{array\}\[\]\{lll\}\\text\{size\}&\\text\{base model\}&\\text\{instruction\-tuned model\}\\\\ \\hline\\cr 0\.5\\mathrm\{B\}&\\text\{Qwen2\.5\-0\.5B\}&\\text\{Qwen2\.5\-0\.5B\-Instruct\}\\\\ 1\.5\\mathrm\{B\}&\\text\{Qwen2\.5\-1\.5B\}&\\text\{Qwen2\.5\-1\.5B\-Instruct\}\\\\ 3\\mathrm\{B\}&\\text\{Qwen2\.5\-3B\}&\\text\{Qwen2\.5\-3B\-Instruct\}\\\\ 7\\mathrm\{B\}&\\text\{Qwen2\.5\-7B\}&\\text\{Qwen2\.5\-7B\-Instruct\}\.\\end\{array\}
RewardBench provides preference triples

\(xk,yk\+,yk−\),\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\),wherexkx\_\{k\}is a prompt,yk\+y\_\{k\}^\{\+\}is the reference\-preferred response, andyk−y\_\{k\}^\{\-\}is the reference\-rejected response\. The main experiment usesK=5120K=5120pairs\. RewardBench also provides subset labels, which allow us to study how the observable changes across different components of the reference pair distribution\.

For a modelMM, we use token\-normalized log\-likelihood as the scoring rule:

SM\(x,y\)=1\|y\|log⁡QM\(y∣x\),S\_\{M\}\(x,y\)=\\frac\{1\}\{\|y\|\}\\log Q\_\{M\}\(y\\mid x\),whereQM\(y∣x\)Q\_\{M\}\(y\\mid x\)is the conditional probability assigned by the model to responseyy, and\|y\|\|y\|is the number of response tokens\. Following the margin notation in the main text, the population margin is written as

dM\(x,y\+,y−\)=SM\(x,y\+\)−SM\(x,y−\)\.d\_\{M\}\(x,y^\{\+\},y^\{\-\}\)=S\_\{M\}\(x,y^\{\+\}\)\-S\_\{M\}\(x,y^\{\-\}\)\.For thekk\-th evaluated pair, we write the observed margin as

Δk\(M\)=dM\(xk,yk\+,yk−\)=SM\(xk,yk\+\)−SM\(xk,yk−\)\.\\Delta\_\{k\}^\{\(M\)\}=d\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)=S\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\}\)\-S\_\{M\}\(x\_\{k\},y\_\{k\}^\{\-\}\)\.We report the sign agreement estimator

A^M=1K∑k=1K𝟏\{Δk\(M\)\>0\},\\hat\{A\}\_\{M\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\mathbf\{1\}\\\{\\Delta\_\{k\}^\{\(M\)\}\>0\\\},and the mean signed margin estimator

μ^M=1K∑k=1KΔk\(M\)\.\\hat\{\\mu\}\_\{M\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\Delta\_\{k\}^\{\(M\)\}\.HereA^M\\hat\{A\}\_\{M\}andμ^M\\hat\{\\mu\}\_\{M\}are shorthand for the finite\-set estimatorsA^M\(𝒞\)\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)andμ^M\(𝒞\)\\hat\{\\mu\}\_\{M\}\(\\mathcal\{C\}\)defined in the main text\.

The main experiments use a plain prompt construction for all models\. This keeps the input format identical between base and instruction\-tuned models\. We additionally report a chat\-template ablation in Appendix[B\.1](https://arxiv.org/html/2605.30758#A2.SS1)\.

### 10\.2Experiment 1: Overall pairwise reference alignment

The first experiment tests whetherA^M\\hat\{A\}\_\{M\}andμ^M\\hat\{\\mu\}\_\{M\}distinguish model size and instruction tuning\. If instruction tuning leaves a detectable trace in the model distribution, then the instruction\-tuned model should assign relatively higher likelihood to reference\-preferred responses\. If model capability also matters, larger models should exhibit stronger agreement with the reference ordering\.

Table[2](https://arxiv.org/html/2605.30758#S10.T2)reports the overall results\. Both statistics follow the predicted pattern: instruction\-tuned models outperform the corresponding base models at every size, and larger models tend to show stronger alignment\.

Table 2:Overall likelihood\-induced pairwise reference alignment on RewardBench\. The bootstrap intervals are computed from post\-hoc resampling of the saved pairwise scores\.![Refer to caption](https://arxiv.org/html/2605.30758v1/plots/experiment1_overall_bars_mpl.png)Figure 1:Overall sign agreement and mean signed margin on RewardBench\. Larger models and instruction\-tuned models show stronger likelihood\-induced agreement with the reference preference ordering\.The sign statistic gives a clean ordinal summary\. Qwen2\.5\-0\.5B hasA^M=0\.6148\\hat\{A\}\_\{M\}=0\.6148, while Qwen2\.5\-7B reachesA^M=0\.7262\\hat\{A\}\_\{M\}=0\.7262\. The instruction\-tuned models show the same size trend, from0\.62500\.6250for Qwen2\.5\-0\.5B\-Instruct to0\.77050\.7705for Qwen2\.5\-7B\-Instruct\.

The margin statistic shows an even stronger separation\. For example, Qwen2\.5\-7B\-Instruct hasμ^M=0\.3500\\hat\{\\mu\}\_\{M\}=0\.3500, compared withμ^M=0\.2005\\hat\{\\mu\}\_\{M\}=0\.2005for Qwen2\.5\-7B\. This indicates that instruction tuning does not merely increase the number of correctly ordered pairs; it also increases the average likelihood gap in favor of the reference\-preferred response\.

The smaller models should be interpreted more cautiously\. The 0\.5B and 1\.5B base/instruct differences are directionally positive, but their bootstrap intervals overlap\. The 3B and 7B comparisons are more robust\. This is consistent with a modest empirical claim: the observable is sensitive to model size and instruction tuning in this setting, with stronger evidence at larger scales\.

### 10\.3Experiment 2: Dependence on the reference pair distribution

The second experiment tests the distributional claim in the definition\. Pairwise reference alignment is not an intrinsic scalar attached to the model alone\. It is a property of a model, a scoring rule, and a reference pair distribution\. To examine this, we decompose RewardBench into subsets and compute

A^M\(c\)=1Kc∑k:ck=c𝟏\{Δk\(M\)\>0\},\\hat\{A\}\_\{M\}^\{\(c\)\}=\\frac\{1\}\{K\_\{c\}\}\\sum\_\{k:c\_\{k\}=c\}\\mathbf\{1\}\\\{\\Delta\_\{k\}^\{\(M\)\}\>0\\\},whereccis a subset label andKcK\_\{c\}is the number of pairs in that subset\.

Figure[2](https://arxiv.org/html/2605.30758#S10.F2)summarizes the subset\-level results by grouping RewardBench subsets into semantic families\. The differences across families are substantial\. Models tend to show high agreement on code\-related subsets and lower agreement on adversarial, refusal, and safety\-related subsets\.

![Refer to caption](https://arxiv.org/html/2605.30758v1/plots/experiment2_subset_family_radar_mpl.png)Figure 2:Subset\-family radar plot forA^M\\hat\{A\}\_\{M\}\. The observable depends strongly on the reference pair distribution: a model with high overall agreement need not show uniformly high agreement across all subsets\.This supports the main theoretical point that alignment claims should be stated relative toPpairP\_\{\\mathrm\{pair\}\}\. A more precise statement is not simply that a model is aligned, but that under a specified scoring rule and reference pair distribution, the model achieves a particular level of likelihood\-induced agreement\. The overall statistic is useful as a global summary, but it hides structure across the component pair distributions\. A full 23\-subset radar plot is provided in Appendix[B\.2](https://arxiv.org/html/2605.30758#A2.SS2)\.

### 10\.4Experiment 3: Finite\-sample behavior and bootstrap uncertainty

The third experiment studies the statistical reliability of the observable\. It has two parts\. First, we repeatedly subsampleKKdistinct pairs without replacement and measure how the empirical interval width changes withKK\. Second, we use bootstrap resampling with replacement at the full RewardBench size to estimate uncertainty around the full\-sample estimates\.

For the sign statistic, the random variable

Zk\(M\)=𝟏\{Δk\(M\)\>0\}Z\_\{k\}^\{\(M\)\}=\\mathbf\{1\}\\\{\\Delta\_\{k\}^\{\(M\)\}\>0\\\}is bounded in\[0,1\]\[0,1\]\. Hoeffding’s inequality gives the conservative radius

ϵK=12Klog⁡2δ\.\\epsilon\_\{K\}=\\sqrt\{\\frac\{1\}\{2K\}\\log\\frac\{2\}\{\\delta\}\}\.We useδ=0\.05\\delta=0\.05\. This bound applies directly toA^M\\hat\{A\}\_\{M\}\. For the continuous margin statisticμ^M\\hat\{\\mu\}\_\{M\}, an analogous absolute bound would require a known range or clipping rule forΔk\(M\)\\Delta\_\{k\}^\{\(M\)\}\. Since the effective margin range is not fixed a priori, we report empirical resampling behavior forμ^M\\hat\{\\mu\}\_\{M\}rather than applying the same bound directly\.

![Refer to caption](https://arxiv.org/html/2605.30758v1/plots/experiment3_finite_sample_representative_mpl.png)Figure 3:Finite\-sample behavior for two representative models, Qwen2\.5\-0\.5B and Qwen2\.5\-7B\-Instruct\. These two endpoints are shown in the main text to keep the figure readable; the full eight\-model version is reported in Appendix[B\.3](https://arxiv.org/html/2605.30758#A2.SS3)\. The empirical half\-width decreases as the number of sampled pairs increases, and the Hoeffding curve provides a conservative reference for the sign statistic\.Figure[3](https://arxiv.org/html/2605.30758#S10.F3)shows the expected convergence pattern for a small base model and a large instruction\-tuned model\. We use this pair as a compact main\-text summary because plotting all eight models in one panel is visually dense; the complete version is provided in Appendix[B\.3](https://arxiv.org/html/2605.30758#A2.SS3)\. AsKKincreases, the empirical uncertainty decreases for both the sign statistic and the margin statistic\. For Qwen2\.5\-7B\-Instruct, the empirical interval width forA^M\\hat\{A\}\_\{M\}decreases from0\.22000\.2200atK=50K=50to0\.02650\.0265atK=2000K=2000\. The margin statistic shows a similar decreasing trend, although its scale is score\-dependent\.

At the full RewardBench size, the bootstrap intervals are tight\. Table[3](https://arxiv.org/html/2605.30758#S10.T3)compares the Hoeffding radius forA^M\\hat\{A\}\_\{M\}atK=5120K=5120with the bootstrap half\-width\. The bootstrap half\-widths are smaller than the Hoeffding radius, as expected from a conservative distribution\-free bound\.

Table 3:Comparison between the Hoeffding radius for the sign statistic and bootstrap half\-widths atK=5120K=5120\. The Hoeffding radius applies directly toA^M\\hat\{A\}\_\{M\}; the margin column reports only bootstrap uncertainty\.These results connect the empirical study back to the statistical formulation\. The sign observable has a simple bounded\-variable concentration bound, and the empirical bootstrap uncertainty is small at the full evaluation size\. The margin observable is also stable in this experiment, but it should be interpreted with attention to the scoring scale\.

### 10\.5Summary of empirical findings

The three experiments are aligned with the theoretical proposal\. First, the overall results show thatA^M\\hat\{A\}\_\{M\}andμ^M\\hat\{\\mu\}\_\{M\}distinguish model size and instruction tuning in the expected direction\. Second, the subset analysis confirms that the observable is relative to the reference pair distribution, rather than an unconditional property of the model\. Third, the finite\-sample and bootstrap analyses show that the sign statistic has predictable statistical behavior and that the full RewardBench estimates are stable in this setting\.

The scope of this evidence is limited\. We evaluate one model family and one preference benchmark\. We do not yet test cross\-family generalization, larger frontier models, different preference datasets, or alternative scoring rules\. Nevertheless, the internal consistency of the results is strong: all three experiments support the central claim that likelihood\-induced pairwise agreement is a meaningful, distribution\-dependent observable of a fixed model\.

## 11Limitations and Future Work

This note is primarily a conceptual and statistical formulation, supported by an initial empirical study\. It does not introduce a benchmark or claim that likelihood is equivalent to human preference\. The following limitations are central to the interpretation of the proposed observable\.

#### Dependence on the scoring function\.

The observableAM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)is defined relative to a scoring functionSMS\_\{M\}\. Different choices ofSMS\_\{M\}may induce different orderings over the same response pairs\. A reward model score, an LLM\-as\-judge score, a raw log\-probability, and a token\-normalized log\-probability need not agree\. Thus the proposed observable should always be reported together with the scoring rule that induces the ordering\.

#### Likelihood is not preference\.

WhenSMS\_\{M\}is derived from log\-probability or negative energy, agreement with the reference ordering should not be read as direct evidence that the language model “prefers” the response in the human sense\. Likelihood reflects compatibility with the model distribution\. It can be affected by frequency, stylistic typicality, response length, formatting, and tokenization\. A response can be more likely without being more useful, safer, or more correct\. The energy interpretation is therefore a diagnostic lens, not a complete theory of preference\.

#### Dependence on the reference pair distribution\.

The targetPpairP\_\{\\mathrm\{pair\}\}defines the scope of the claim\. A pair distribution over mathematical explanations estimates agreement with mathematical explanation preferences; a pair distribution over safety refusals estimates agreement with safety preferences\. No finite evaluation set should be treated as measuring generic alignment unless it is designed to represent that target\. In practice, dataset construction, sampling weights, task mixture, and annotator population all affect the meaning ofAM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\.

#### Reference labels may be noisy or heterogeneous\.

The formulation treatsy\+y^\{\+\}andy−y^\{\-\}as reference\-preferred and reference\-rejected responses, but real preference data may be noisy\. Human annotators can disagree, expert rules may be incomplete, and model judges can introduce systematic biases\. If the reference system represents a mixture of annotator populations or values, the resultingPpairP\_\{\\mathrm\{pair\}\}may not correspond to a single coherent preference relation\.

#### The sign observable discards strength\.

The sign statistic is intentionally ordinal\. It distinguishes whether the model ranksy\+y^\{\+\}abovey−y^\{\-\}, but it does not distinguish barely correct rankings from large\-margin rankings\. The margin observable partially addresses this, but at the cost of stronger dependence on the scale and distribution ofSMS\_\{M\}\.

#### Margin estimates can be unstable\.

When the score is based on log\-probability, margins may be heavy\-tailed\. Very unlikely sequences, rare tokens, length differences, or formatting artifacts can dominate averages\. Practical use of the margin observable may require clipping, robust means, bootstrap intervals, or reporting full margin distributions rather than a single mean\.

#### Finite\-sample bounds are not full validity guarantees\.

The Hoeffding bound controls sampling error under independent sampling fromPpairP\_\{\\mathrm\{pair\}\}\. It does not address dataset contamination, distribution shift, adaptive benchmark use, label noise, judge bias, or whetherSMS\_\{M\}is the right scoring function for the intended question\. It should be read as a statement about estimator concentration, not as a guarantee of benchmark validity\.

#### Initial empirical scope\.

The experiments in this note cover one model family, one public preference benchmark, and one likelihood\-induced scoring rule\. This evidence is useful because all three empirical checks agree with the theory, but it is not a proof that the observable behaves the same way for every model family, benchmark, or scoring rule\. Empirical support for a measurement framework is necessarily cumulative: no finite experiment can establish universal correctness\. Scaling the study to more model families, larger models, additional preference datasets, and alternative scoring rules is therefore important\.

#### Future work\.

The most direct next step is scale\-up\. One could computeAMA\_\{M\},mMsignm\_\{M\}^\{\\mathrm\{sign\}\}, and margin distributions across several model families, larger checkpoints, additional public preference datasets, and multiple scoring functions\. Another direction is to compare log\-probability\-induced orderings with reward\-model or judge\-model orderings\. A third direction is longitudinal: tracking the observable across checkpoints during supervised fine\-tuning, RLHF, or preference optimization\. Finally, one could study how different choices ofPpairP\_\{\\mathrm\{pair\}\}decompose alignment into task\- or value\-conditioned components\.

## 12Conclusion

Pairwise reference data can define model\-level quantities for alignment evaluation\. The sign statistic measures whether a fixed model orders a pair consistently with the reference, while the margin statistic retains the signed strength of the model’s preference between the two responses when the scale of the scoring function is meaningful\. Together, they separate two questions that are often conflated: whether the model orders a pair in the reference\-preferred direction, and how strongly the chosen score favors that direction\.

The main point is not that pairwise data recovers a complete reward function, nor that likelihood is equivalent to human preference\. Rather, the point is that a reference pair distribution defines an ordinal measurement problem: does the ordering induced by a model agree with the ordering expressed by the reference? This perspective separates the target estimand from finite\-sample estimators, supports simple concentration analysis, and provides a bridge to energy\-based and relational interpretations\. It also keeps the scope of the claim explicit: the resulting observable is meaningful only relative to the reference distribution and scoring function used to define it\.

Thus, the contribution of this note is not the isolated use of pairwise comparisons, but the reframing of such comparisons as relational, distribution\-dependent observables for measuring a fixed model’s induced ordering against a reference preference system\.

## Appendix AConcentration Bounds

This appendix gives the finite\-sample concentration arguments used in Sections[4\.2](https://arxiv.org/html/2605.30758#S4.SS2)and[5\.2](https://arxiv.org/html/2605.30758#S5.SS2)\. The only probabilistic tool needed is Hoeffding’s inequality\.

### A\.1Sign observable

###### Lemma 1\(Hoeffding’s inequality for bounded independent variables\)\.

LetX1,…,XKX\_\{1\},\\ldots,X\_\{K\}be independent random variables such that

Xk∈\[ak,bk\]X\_\{k\}\\in\[a\_\{k\},b\_\{k\}\]almost surely for eachkk\. Define the empirical average

X¯=1K∑k=1KXk\.\\bar\{X\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}X\_\{k\}\.Then, for anyϵ\>0\\epsilon\>0,

ℙ\(\|X¯−𝔼\[X¯\]\|≥ϵ\)≤2exp⁡\(−2K2ϵ2∑k=1K\(bk−ak\)2\)\.\\mathbb\{P\}\\left\(\\left\|\\bar\{X\}\-\\mathbb\{E\}\[\\bar\{X\}\]\\right\|\\geq\\epsilon\\right\)\\leq 2\\exp\\left\(\-\\frac\{2K^\{2\}\\epsilon^\{2\}\}\{\\sum\_\{k=1\}^\{K\}\(b\_\{k\}\-a\_\{k\}\)^\{2\}\}\\right\)\.

For the sign observable, define

Zk=ZM\(xk,yk\+,yk−\)=𝟏\[SM\(xk,yk\+\)\>SM\(xk,yk−\)\]\.Z\_\{k\}=Z\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)=\\mathbf\{1\}\\left\[S\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\}\)\>S\_\{M\}\(x\_\{k\},y\_\{k\}^\{\-\}\)\\right\]\.Assume that

\(xk,yk\+,yk−\)∼i\.i\.d\.Ppair\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}P\_\{\\mathrm\{pair\}\}fork=1,…,Kk=1,\\ldots,K\. ThenZ1,…,ZKZ\_\{1\},\\ldots,Z\_\{K\}are independent Bernoulli random variables\. SinceZk∈\{0,1\}Z\_\{k\}\\in\\\{0,1\\\}, we haveak=0a\_\{k\}=0andbk=1b\_\{k\}=1for allkk\.

The population mean ofZkZ\_\{k\}is exactly the pairwise reference alignment observable:

𝔼\[Zk\]=ℙ\(x,y\+,y−\)∼Ppair\[SM\(x,y\+\)\>SM\(x,y−\)\]=AM\(Ppair\)\.\\mathbb\{E\}\[Z\_\{k\}\]=\\mathbb\{P\}\_\{\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\}\}\\left\[S\_\{M\}\(x,y^\{\+\}\)\>S\_\{M\}\(x,y^\{\-\}\)\\right\]=A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\.The empirical average is

Z¯=1K∑k=1KZk=A^M\(𝒞\)\.\\bar\{Z\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}Z\_\{k\}=\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)\.Applying Hoeffding’s inequality withbk−ak=1b\_\{k\}\-a\_\{k\}=1gives

ℙ\(\|A^M\(𝒞\)−AM\(Ppair\)\|≥ϵ\)≤2exp⁡\(−2K2ϵ2K\)=2exp⁡\(−2Kϵ2\)\.\\mathbb\{P\}\\left\(\\left\|\\hat\{A\}\_\{M\}\(\\mathcal\{C\}\)\-A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\\right\|\\geq\\epsilon\\right\)\\leq 2\\exp\\left\(\-\\frac\{2K^\{2\}\\epsilon^\{2\}\}\{K\}\\right\)=2\\exp\(\-2K\\epsilon^\{2\}\)\.
To obtain a sufficient sample size for confidence level1−δ1\-\\delta, require the right\-hand side to be at mostδ\\delta:

2exp⁡\(−2Kϵ2\)≤δ\.2\\exp\(\-2K\\epsilon^\{2\}\)\\leq\\delta\.Taking logarithms and rearranging gives

K≥12ϵ2log⁡2δ\.K\\geq\\frac\{1\}\{2\\epsilon^\{2\}\}\\log\\frac\{2\}\{\\delta\}\.Thus, under independent sampling fromPpairP\_\{\\mathrm\{pair\}\},KKsampled pairs are sufficient to estimate the population agreement probabilityAM\(Ppair\)A\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)within errorϵ\\epsilonwith probability at least1−δ1\-\\delta\.

### A\.2Bounded margin observable

For the margin observable, define

Dk=dM\(xk,yk\+,yk−\)=SM\(xk,yk\+\)−SM\(xk,yk−\)\.D\_\{k\}=d\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)=S\_\{M\}\(x\_\{k\},y\_\{k\}^\{\+\}\)\-S\_\{M\}\(x\_\{k\},y\_\{k\}^\{\-\}\)\.Assume again that

\(xk,yk\+,yk−\)∼i\.i\.d\.Ppair\(x\_\{k\},y\_\{k\}^\{\+\},y\_\{k\}^\{\-\}\)\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}P\_\{\\mathrm\{pair\}\}fork=1,…,Kk=1,\\ldots,K, and assume that the margin is bounded:

almost surely\. The population mean is

𝔼\[Dk\]=𝔼\(x,y\+,y−\)∼Ppair\[dM\(x,y\+,y−\)\]=μM\(Ppair\),\\mathbb\{E\}\[D\_\{k\}\]=\\mathbb\{E\}\_\{\(x,y^\{\+\},y^\{\-\}\)\\sim P\_\{\\mathrm\{pair\}\}\}\\left\[d\_\{M\}\(x,y^\{\+\},y^\{\-\}\)\\right\]=\\mu\_\{M\}\(P\_\{\\mathrm\{pair\}\}\),and the empirical average is

D¯=1K∑k=1KDk=μ^M\(𝒞\)\.\\bar\{D\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}D\_\{k\}=\\hat\{\\mu\}\_\{M\}\(\\mathcal\{C\}\)\.
Applying Hoeffding’s inequality withak=aa\_\{k\}=aandbk=bb\_\{k\}=bfor allkkgives

ℙ\(\|μ^M\(𝒞\)−μM\(Ppair\)\|≥ϵ\)≤2exp⁡\(−2K2ϵ2∑k=1K\(b−a\)2\)\.\\mathbb\{P\}\\left\(\\left\|\\hat\{\\mu\}\_\{M\}\(\\mathcal\{C\}\)\-\\mu\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\\right\|\\geq\\epsilon\\right\)\\leq 2\\exp\\left\(\-\\frac\{2K^\{2\}\\epsilon^\{2\}\}\{\\sum\_\{k=1\}^\{K\}\(b\-a\)^\{2\}\}\\right\)\.Since

∑k=1K\(b−a\)2=K\(b−a\)2,\\sum\_\{k=1\}^\{K\}\(b\-a\)^\{2\}=K\(b\-a\)^\{2\},this simplifies to

ℙ\(\|μ^M\(𝒞\)−μM\(Ppair\)\|≥ϵ\)≤2exp⁡\(−2Kϵ2\(b−a\)2\)\.\\mathbb\{P\}\\left\(\\left\|\\hat\{\\mu\}\_\{M\}\(\\mathcal\{C\}\)\-\\mu\_\{M\}\(P\_\{\\mathrm\{pair\}\}\)\\right\|\\geq\\epsilon\\right\)\\leq 2\\exp\\left\(\-\\frac\{2K\\epsilon^\{2\}\}\{\(b\-a\)^\{2\}\}\\right\)\.
To obtain a sufficient sample size for confidence level1−δ1\-\\delta, require

2exp⁡\(−2Kϵ2\(b−a\)2\)≤δ\.2\\exp\\left\(\-\\frac\{2K\\epsilon^\{2\}\}\{\(b\-a\)^\{2\}\}\\right\)\\leq\\delta\.Taking logarithms and rearranging gives

K≥\(b−a\)22ϵ2log⁡2δ\.K\\geq\\frac\{\(b\-a\)^\{2\}\}\{2\\epsilon^\{2\}\}\\log\\frac\{2\}\{\\delta\}\.Thus, when the margin is bounded or clipped to a known interval, the empirical mean margin estimates the population margin observable with a sample complexity that scales quadratically with the margin range\.

## Appendix BAdditional Experimental Results

### B\.1Chat\-template ablation

The main experiments use a plain prompt construction so that base and instruction\-tuned models receive the same input format\. Since Qwen2\.5 tokenizers also define chat templates, we additionally run a format\-sensitivity check using the Qwen chat\-template prompt construction\. This ablation is intended as an appendix result rather than a second main experiment: it does not include bootstrap resampling or a new finite\-sample analysis\.

Table[4](https://arxiv.org/html/2605.30758#A2.T4)reports the chat\-template point estimates\. The direction agrees with the main experiment\. At every model size, the instruction\-tuned model has largerA^M\\hat\{A\}\_\{M\}and largerμ^M\\hat\{\\mu\}\_\{M\}than the corresponding base model\.

Table 4:Chat\-template ablation results\. These are point estimates without bootstrap intervals\.![Refer to caption](https://arxiv.org/html/2605.30758v1/plots/appendix_chattemplate_overall_bars_mpl.png)Figure 4:Overall chat\-template ablation\. The main direction is unchanged: instruction\-tuned models score higher than base models at every size\.The chat\-template setting also affects the scale of the margin statistic\. In particular, Qwen2\.5\-3B\-Instruct and Qwen2\.5\-7B\-Instruct show much largerμ^M\\hat\{\\mu\}\_\{M\}than in the plain\-prompt setting\. This suggests that prompt format is not a negligible implementation detail for margin\-based evaluation\. However, the ordinal conclusion remains stable: the instruction\-tuned variants are consistently more aligned with the reference ordering\.

![Refer to caption](https://arxiv.org/html/2605.30758v1/plots/appendix_chattemplate_subset_family_radar_mpl.png)Figure 5:Subset\-family radar plot under the chat\-template prompt construction\.![Refer to caption](https://arxiv.org/html/2605.30758v1/plots/appendix_chattemplate_subset_radar_23_mpl.png)Figure 6:Full 23\-subset radar plot under the chat\-template prompt construction\.
### B\.2Full subset\-level radar plot

Figure[7](https://arxiv.org/html/2605.30758#A2.F7)reports the full 23\-subset radar plot for the plain\-prompt setting\. This plot is too dense for the main text, but it provides a useful diagnostic view of the distribution dependence discussed in Section[10\.3](https://arxiv.org/html/2605.30758#S10.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.30758v1/plots/appendix_experiment2_subset_radar_23_mpl.png)Figure 7:Full 23\-subset radar plot for the plain\-prompt setting\.
### B\.3Full finite\-sample curves

Figure[8](https://arxiv.org/html/2605.30758#A2.F8)reports the finite\-sample curves for all eight models\. The Hoeffding curve is shown only for the sign statistic, where the bounded Bernoulli assumption applies directly\.

![Refer to caption](https://arxiv.org/html/2605.30758v1/plots/appendix_experiment3_finite_sample_all_models_mpl.png)Figure 8:Finite\-sample behavior for all models\. The Hoeffding curve provides a conservative reference forA^M\\hat\{A\}\_\{M\}\.

## Acknowledgements

The author thanks Shuyao Shang \(NLPR, Institute of Automation, Chinese Academy of Sciences \(CASIA\),shangshuyao2024@ia\.ac\.cn\) for valuable discussions on the idea and for support of this work\.

## References

- \[1\]Anthropic\(2024\-11\)A statistical approach to model evaluations\.Note:Anthropic ResearchPublished November 19, 2024External Links:[Link](https://www.anthropic.com/research/statistical-approach-to-model-evals)Cited by:[§4\.2](https://arxiv.org/html/2605.30758#S4.SS2.p5.2),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px4.p1.1)\.
- \[2\]Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.External Links:2204\.05862,[Link](https://arxiv.org/abs/2204.05862)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p1.3),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px1.p1.1)\.
- \[3\]W\. Chiang, L\. Zheng, Y\. Sheng, A\. N\. Angelopoulos, T\. Li, D\. Li, B\. Zhu, H\. Zhang, M\. I\. Jordan, J\. E\. Gonzalez, and I\. Stoica\(2024\)Chatbot arena: an open platform for evaluating llms by human preference\.arXiv preprint arXiv:2403\.04132\.External Links:2403\.04132,[Link](https://arxiv.org/abs/2403.04132)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p1.3),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px2.p1.2)\.
- \[4\]Y\. Du and I\. Mordatch\(2019\)Implicit generation and modeling with energy\-based models\.InAdvances in Neural Information Processing Systems,External Links:1903\.08689,[Link](https://arxiv.org/abs/1903.08689)Cited by:[§7\.1](https://arxiv.org/html/2605.30758#S7.SS1.p1.4),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px4.p1.1)\.
- \[5\]Y\. Dubois, B\. Galambosi, P\. Liang, and T\. B\. Hashimoto\(2024\)Length\-controlled alpacaeval: a simple way to debias automatic evaluators\.arXiv preprint arXiv:2404\.04475\.External Links:2404\.04475,[Link](https://arxiv.org/abs/2404.04475)Cited by:[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px2.p1.2)\.
- \[6\]V\. Dumoulin, D\. D\. Johnson, P\. S\. Castro, H\. Larochelle, and Y\. Dauphin\(2024\)A density estimation perspective on learning from pairwise human preferences\.arXiv preprint arXiv:2311\.14115\.External Links:2311\.14115,[Link](https://arxiv.org/abs/2311.14115)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p3.2)\.
- \[7\]J\. Fageot, S\. Farhadkhani, L\. Hoang, and O\. Villemaud\(2024\)Generalized bradley\-terry models for score estimation from paired comparisons\.arXiv preprint arXiv:2308\.08644\.External Links:2308\.08644,[Link](https://arxiv.org/abs/2308.08644)Cited by:[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px3.p1.2)\.
- \[8\]J\. Fürnkranz and E\. Hüllermeier\(2003\)Pairwise preference learning and ranking\.InMachine Learning: ECML 2003,pp\. 145–156\.External Links:[Link](https://link.springer.com/chapter/10.1007/978-3-540-39857-8_15)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p3.2),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px3.p1.2)\.
- \[9\]T\. Ignatenko, K\. Kondrashov, M\. Cox, and B\. de Vries\(2023\)On preference learning based on sequential bayesian optimization with pairwise comparison\.arXiv preprint arXiv:2103\.13192\.External Links:2103\.13192,[Link](https://arxiv.org/abs/2103.13192)Cited by:[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px3.p1.2)\.
- \[10\]N\. Lambert, V\. Pyatkin, J\. Morrison, L\. J\. V\. Miranda, B\. Y\. Lin, K\. Chandu, N\. Dziri, S\. Kumar, T\. Zick, Y\. Choi, N\. A\. Smith, and H\. Hajishirzi\(2024\)RewardBench: evaluating reward models for language modeling\.arXiv preprint arXiv:2403\.13787\.External Links:2403\.13787,[Link](https://arxiv.org/abs/2403.13787)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p7.1),[§10\.1](https://arxiv.org/html/2605.30758#S10.SS1.p1.1),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px2.p1.2)\.
- \[11\]Y\. LeCun and F\. J\. Huang\(2005\)Loss functions for discriminative training of energy\-based models\.InProceedings of the Tenth International Workshop on Artificial Intelligence and Statistics,pp\. 206–213\.External Links:[Link](https://proceedings.mlr.press/r5/lecun05a.html)Cited by:[§7\.1](https://arxiv.org/html/2605.30758#S7.SS1.p1.4),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px4.p1.1)\.
- \[12\]E\. Miller\(2024\)Adding error bars to evals: a statistical approach to language model evaluations\.arXiv preprint arXiv:2411\.00640\.External Links:2411\.00640,[Link](https://arxiv.org/abs/2411.00640)Cited by:[§4\.2](https://arxiv.org/html/2605.30758#S4.SS2.p5.2),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px4.p1.1)\.
- \[13\]M\. E\. J\. Newman\(2023\)Efficient computation of rankings from pairwise comparisons\.arXiv preprint arXiv:2207\.00076\.External Links:2207\.00076,[Link](https://arxiv.org/abs/2207.00076)Cited by:[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px3.p1.2)\.
- \[14\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe\(2022\)Training language models to follow instructions with human feedback\.arXiv preprint arXiv:2203\.02155\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p1.3),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px1.p1.1)\.
- \[15\]R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems,External Links:2305\.18290,[Link](https://arxiv.org/abs/2305.18290)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p1.3),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px1.p1.1)\.
- \[16\]C\. Wirth, R\. Akrour, G\. Neumann, and J\. Fürnkranz\(2017\)A survey of preference\-based reinforcement learning methods\.Journal of Machine Learning Research18\(136\),pp\. 1–46\.External Links:[Link](https://jmlr.org/papers/v18/16-634.html)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p3.2),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px1.p1.1)\.
- \[17\]A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p7.1),[§10\.1](https://arxiv.org/html/2605.30758#S10.SS1.p1.1)\.
- \[18\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,External Links:2306\.05685,[Link](https://arxiv.org/abs/2306.05685)Cited by:[§1](https://arxiv.org/html/2605.30758#S1.p1.3),[§9](https://arxiv.org/html/2605.30758#S9.SS0.SSS0.Px2.p1.2)\.
Pairwise Reference Alignment as a Model-Level Ordinal Observable

Similar Articles

On the Rejection Criterion for Proxy-based Test-time Alignment

Active Learners as Efficient PRP Rerankers

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Submit Feedback

Similar Articles

On the Rejection Criterion for Proxy-based Test-time Alignment
Active Learners as Efficient PRP Rerankers
From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging
Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains