Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
Summary
This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.
View Cached Full Text
Cached at: 05/18/26, 06:41 AM
# Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
Source: [https://arxiv.org/html/2605.15416](https://arxiv.org/html/2605.15416)
###### Abstract
Junget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)introduce a hypothesis testing framework for guaranteeing agreement between large language models \(LLMs\) and human judgments, relying on the assumption that the model’s estimated confidence is monotonic with respect to human\-disagreement risk\. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed\. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals\. Our approach leverages simulated annotator diversity and a margin\-based ranking formulation to explicitly model how confidently an LLM distinguishes between human\-agreement and human\-disagreement cases\. We further derive generalization guarantees for this estimator, revealing a margin\-dependent trade\-off that informs the design of an adaptive estimator training procedure\. When integrated into fixed\-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models\.
Machine Learning, ICML
## 1Introduction
Large language models are increasingly used as evaluators to judge output quality and preference alignment\(Zhenget al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib3); Duboiset al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib5); Parket al\.,[2025](https://arxiv.org/html/2605.15416#bib.bib120); Chiang and Lee,[2023](https://arxiv.org/html/2605.15416#bib.bib118)\)\. While this offers scalable, low\-cost alternatives to human annotation, a fundamental challenge remains\(Xionget al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib119)\): how can we make LLM\-as\-a\-judge decisions reliably trustworthy for downstream use, particularly when the judge reports high confidence?
Recent principled methods have been proposed to improve the reliability of LLM\-as\-a\-judge systems\. For instance,Yadkoriet al\.\([2024](https://arxiv.org/html/2605.15416#bib.bib18)\)introduce conformal abstention to better align LLM judgments with human evaluations, whileMohri and Hashimoto \([2024](https://arxiv.org/html/2605.15416#bib.bib17)\)use conformal prediction to provide high\-probability correctness guarantees\. Building on these ideas,Junget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)develop an unsupervised confidence estimator and derive an exact upper bound on disagreement risk conditioned on a calibration set\.
A key assumption underlying such confidence\-thresholding procedures is monotonicity: instances with higher estimated confidence should correspond to lower disagreement risk with respect to human judgments\. However, recent empirical findings indicate that this assumption may be violated in practice, i\.e\., confidence estimates can be miscalibrated with human subjectivity \(as shown in[Figures1](https://arxiv.org/html/2605.15416#S1.F1)and[3](https://arxiv.org/html/2605.15416#S5.F3)\)\. Moreover, while prior work provides guarantees conditioned on the calibration sample, the generalization behavior of the confidence estimator itself is not explicitly analyzed, leaving open the question of whether the induced confidence ordering remains reliable out of sample\.
This motivates our perspective: rather than assuming an LLM’s native confidence is reliable, we learn a dedicated confidence estimator designed to induce an ordering that generalizes beyond the calibration set\. We treat confidence as a ranking function over instances and optimize it via a margin\-based ranking loss that penalizes misordered agreement/disagreement pairs\. To justify this formulation, we derive a PAC\-Bayesian generalization bound on the misranking probability, controlled by the empirical margin\-based ranking loss and a margin\-dependent complexity term\.
###### Theorem 1\.1\(Informal\)\. Given a parameterized confidence estimator and a marginγ\\gamma, its expected ranking loss is bounded by the empirical margin\-based ranking loss and the margin\-dependent complexity term, i\.e\.,ExpectedRanking Loss≤Empirical Margin\-basedRanking Loss\+Margin\-dependentComplexity Term\.\\begin\{subarray\}\{c\}\\text\{Expected\}\\\\ \\text\{Ranking Loss\}\\end\{subarray\}\\;\\leq\\;\\begin\{subarray\}\{c\}\\text\{Empirical Margin\-based\}\\\\ \\text\{Ranking Loss\}\\end\{subarray\}\\;\+\\;\\begin\{subarray\}\{c\}\\text\{Margin\-dependent\}\\\\ \\text\{Complexity Term\}\\end\{subarray\}\.
Figure 1:Plots of estimated confidence against human–LLM agreement rate using GPT\-4 as the judge: \(left\) predictive probability–based estimator; \(right\) simulated annotator–based estimator\. Results are shown on the dataset ofJunget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)\(light blue\) and an additional 500 examples from AlpacaEval\(Liet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib121)\)\(orange\)\. The horizontal axis denotes the bin of estimated LLM confidence, the vertical axis denotes the human–LLM agreement rate for each bin\. Empirically, the results indicate that human–LLM agreement does not necessarily increase with higher estimated confidence for these two methods\.This analysis reveals a trade\-off: larger margins encourage stronger separation but increase the complexity penalty, while smaller margins reduce the penalty but weaken separation\. Guided by this bound, we develop a margin\-adaptive training procedure that optimizes both the estimator and its effective margin by balancing empirical ranking loss with a differentiable approximation of the complexity term\.
Optimizer 1\.2\(Informal\)\.*Letθ\\thetabe the confidence estimator parameters,γ\\gammabe the margin, andβ\\betabe the hyper\-parameter, the estimator is optimized via*minθminγ*Empirical Margin\-based**Ranking Loss*\+β⋅*Margin\-dependent**Complexity Term*\.\\min\_\{\\theta\}\\min\_\{\\gamma\}\\;\\begin\{subarray\}\{c\}\\text\{Empirical Margin\-based\}\\\\ \\text\{Ranking Loss\}\\end\{subarray\}\\;\+\\;\\beta\\cdot\\begin\{subarray\}\{c\}\\text\{Margin\-dependent\}\\\\ \\text\{Complexity Term\}\\end\{subarray\}\.
Empirically, the proposed estimator improves ranking quality compared with commonly used confidence heuristics\. Importantly, these improvements translate into empirically stronger monotonic behavior of selective disagreement risk, which in turn yields higher success rates in meeting target agreement levels within fixed\-sequence testing pipelines, while maintaining competitive coverage\. We emphasize that our theoretical results provide guarantees on the generalization of the confidence ranking behavior, and our experiments demonstrate that improved ranking generalization can reduce monotonicity violations in practice\. To summarize, the contributions of this work are as follows:
- ⋆\\starConfidence Ranking Framework\.We propose to learn a parameterized confidence estimator for LLM judgment via a margin\-based ranking objective, targeting generalizable orderings rather than relying on heuristic confidence scores\.
- ⋆\\starTheoretical Analyses\.We develop PAC\-Bayesian generalization bounds for the estimator’s out of sample misranking probability, exposing a margin\-dependent loss–complexity trade\-off\.
- ⋆\\starOptimizer and Experiments\.Guided by our theoretical insights, we introduce a margin\-adaptive training procedure and demonstrate improved ranking accuracy and higher success rates in meeting target agreement levels across datasets and judge models in cascaded selective evaluation\.
## 2Preliminary
LetfLM:𝒳→𝒴f\_\{LM\}:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}denote an LLM judge, where each inputx∈𝒳x\\in\\mathcal\{X\}consists of a query and a pair of candidate responses\(r1,r2\)\(r\_\{1\},r\_\{2\}\), and the outputy∈𝒴y\\in\\mathcal\{Y\}represents a preference judgment betweenr1r\_\{1\}andr2r\_\{2\}\(e\.g\.,r1≻r2r\_\{1\}\\succ r\_\{2\}\)\. LetDDdenote the underlying joint distribution over𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}, which remains fixed but unknown throughout our analysis\.
Given a calibration datasetScal=\{\(xi,yi\)\}i=1mS\_\{\\mathrm\{cal\}\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{m\}where each sample\(xi,yi\)\(x\_\{i\},y\_\{i\}\)represents a query\-response pair with corresponding human preference labels, existing calibration methods typically assume that samples are drawn independently fromDD\.
Junget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)introduce simulated annotators, a confidence estimator that approximates diverse simulated human annotation preferences through in\-context learning\. Concretely, givenKKpreference\-labelled examples for each ofNNhuman annotators, they simulate annotator behaviour by performingKK\-shot promptingNNtimes and then ensembling the resulting predictions,
CLM\(x\)=maxi1N∑j=1N\\displaystyle C\_\{LM\}\(x\)=\\max\_\{i\}\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}ℙLM\(ri\|x;\(x1,janno,y1,janno\),\\displaystyle\\mathbb\{P\}\_\{LM\}\(r\_\{i\}\|x;\(x^\{anno\}\_\{1,j\},y^\{anno\}\_\{1,j\}\),\(1\)…,\(xK,janno,yK,janno\)\),\\displaystyle\\qquad\\qquad\\qquad\\qquad\\qquad\.,\(x^\{anno\}\_\{K,j\},y^\{anno\}\_\{K,j\}\)\),whereℙLM\(ri\|x;…\)\\mathbb\{P\}\_\{LM\}\(r\_\{i\}\|x;\.\.\.\)denotes the predictive probability assigned byfLMf\_\{LM\}to the candidate responserir\_\{i\},\(xi,janno,yi,janno\)\(x^\{anno\}\_\{i,j\},y^\{anno\}\_\{i,j\}\)representsii\-th preference\-labelled example ofjj\-th simulated human annotator\.
Then, letSλ:=\{\(x,y\)∈Scal\|CLM\(x\)≥λ\}S\_\{\\lambda\}:=\\\{\(x,y\)\\in S\_\{\\mathrm\{cal\}\}\|C\_\{LM\}\(x\)\\geq\\lambda\\\}denote the subset of calibration samples with LLM confidence scores above thresholdλ\\lambda, define the empirical risk overSλS\_\{\\lambda\}as:
R^\(λ\):=1\|Sλ\|∑\(x,y\)∈Sλ𝟙\{fLM\(x\)≠y\}\.\\widehat\{R\}\(\\lambda\):=\\frac\{1\}\{\|S\_\{\\lambda\}\|\}\\sum\_\{\(x,y\)\\in S\_\{\\lambda\}\}\\mathbbm\{1\}\\\{f\_\{LM\}\(x\)\\neq y\\\}\.\(2\)
The corresponding population risk for samples above confidence thresholdλ\\lambdais defined as:
R\(λ\):=𝔼\(x,y\)∼D𝟙\{fLM\(x\)≠y\|CLM\(x\)≥λ\}\.R\(\\lambda\):=\\mathbb\{E\}\_\{\(x,y\)\\sim D\}\\mathbbm\{1\}\\\{f\_\{LM\}\(x\)\\neq y\|C\_\{LM\}\(x\)\\geq\\lambda\\\}\.
Since the empirical risk follows a binomial distribution with\|Sλ\|\|S\_\{\\lambda\}\|trials,Junget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)compute the exact\(1−δ\)\(1\-\\delta\)upper confidence bound as:
R^\+\(λ\):=sup\{R:ℙ\(Bin\(\|Sλ\|,R\)≤⌈\|Sλ\|R^\(λ\)⌉\)≥δ\}\.\\widehat\{R\}^\{\+\}\(\\lambda\):=\\sup\\left\\\{R:\\mathbb\{P\}\(\\mathrm\{Bin\}\(\|S\_\{\\lambda\}\|,R\)\\leq\\lceil\|S\_\{\\lambda\}\|\\widehat\{R\}\(\\lambda\)\\rceil\)\\geq\\delta\\right\\\}\.\(3\)
Junget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)assume that the risk function is nearly monotonic inλ\\lambda; specifically, the risk tends to increase asλ\\lambdadecreases \(*monotonicity assumption*\)\. This assumption enables the use of fixed\-sequence testing\(Bauer,[1991](https://arxiv.org/html/2605.15416#bib.bib19)\), i\.e\., one begins testing at the largestλ\\lambda\(e\.g\., 0\.999\) and proceeds through a decreasing sequence until the final value at whichR^\+\(λ\)\\widehat\{R\}^\{\+\}\(\\lambda\)remains below the target risk levelα\\alpha\. Formally, the selected threshold is
λ^=inf\{λ:R^\+\(λ′\)≤αfor∀λ′≥λ\}\.\\widehat\{\\lambda\}=\\inf\\left\\\{\\lambda:\\widehat\{R\}^\{\+\}\(\\lambda^\{\\prime\}\)\\leq\\alpha\\mathrm\{~for~\}\\forall\\lambda^\{\\prime\}\\geq\\lambda\\right\\\}\.\(4\)Then, they get the guarantee as follows\.
###### Theorem 2\.1\(Junget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)\)\.
Consider a thresholdλ^\\widehat\{\\lambda\}chosen as above, and a selective evaluator\(fLM,CLM\)\(f\_\{LM\},C\_\{LM\}\)operating based onλ^\\widehat\{\\lambda\}\. Then, with probability at least1−δ1\-\\delta,
ℙ\(fLM\(x\)=y\|CLM\(x\)≥λ\)≥1−α\.\\mathbb\{P\}\(f\_\{LM\}\(x\)=y\|C\_\{LM\}\(x\)\\geq\\lambda\)\\geq 1\-\\alpha\.\(5\)
## 3Theoretical Analyses for Confidence Ranking
### 3\.1Problem Formulation
Problem\.Junget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)assume that the selective human\-disagreement risk is approximately monotonic in the confidence thresholdλ\\lambda, i\.e\., the risk tends to increase asλ\\lambdadecreases\. However, this monotonicity condition is primarily empirical and is not supported by a formal analysis of how the confidence estimator generalizes beyond calibration data\. As a result, treating monotonicity as a given introduces a structural vulnerability in selective evaluation\.
- •Miscalibration\.If the estimator is miscalibrated, the induced ordering of instances may be unreliable, leading to selective risk curves that violate the expected monotonic behavior\. More broadly, the estimator’s out\-of\-sample behavior is not explicitly characterized\.
- •Cross\-task generalization\.In heterogeneous multi\-task settings with task\-specific calibration sets, a single confidence estimator may not transfer consistently, making it difficult to maintain stable performance across tasks and domains\.
To mitigate these issues, we introduce a parameterized confidence estimator and analyze it through a PAC\-Bayesian ranking framework that bounds its expected misranking error\. This provides a principled way to reason about the generalization of the confidence\-induced ordering and empirically reduces monotonicity violations in practice\.
Figure 2:Bernoulli Simulation Study \(10,000 trials\): Increasing noise \(and thus misranking\) consistently increases both ranking loss and the monotonicity\-violation rate, suggesting that reducing ranking error also improves monotonicity during optimization\. Details are given in Appendix[D\.1](https://arxiv.org/html/2605.15416#A4.SS1)\.Parameterized estimator setting\.Given an instance–label pair\(x,y\)\(x,y\)and an LLMfLMf\_\{LM\}, leta\(x\)∈\{0,1\}a\(x\)\\in\\\{0,1\\\}denote whether the model prediction agrees with the human annotation, i\.e\.,a\(x\)=1a\(x\)=1iffLM\(x\)=yf\_\{LM\}\(x\)=yanda\(x\)=0a\(x\)=0otherwise\.
Following the setting ofJunget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\), for each instance\(x,y\)\(x,y\)we assume access toKKpreference\-labelled examples for each ofNNsimulated human annotators\. For simulated annotatorj∈\[N\]j\\in\[N\], denote these examples by\{\(x1,janno,y1,janno\),…,\(xK,janno,yK,janno\)\}\\\{\(x^\{anno\}\_\{1,j\},y^\{anno\}\_\{1,j\}\),\.\.\.,\(x^\{anno\}\_\{K,j\},y^\{anno\}\_\{K,j\}\)\\\}\. We define the collection of all possiblekk\-shot in\-context subsets as
𝒯=⋃j∈\[N\]\{t⊆\{\(xi,janno,yi,janno\)\}i=1K\|1≤\|t\|≤K\},\\mathcal\{T\}=\\bigcup\\limits\_\{j\\in\[N\]\}\\\{t\\subseteq\\\{\(x^\{anno\}\_\{i,j\},y^\{anno\}\_\{i,j\}\)\\\}\_\{i=1\}^\{K\}\\;\|\\;1\\leq\|t\|\\leq K\\\},\(6\)where each elementt∈𝒯t\\in\\mathcal\{T\}represents a distinct set of demonstrations used to condition the LLM judge\.
For anyt∈𝒯t\\in\\mathcal\{T\}, letℙLM\(r1\|x;t\)\\mathbb\{P\}\_\{LM\}\(r\_\{1\}\|x;t\)denote the predictive probability assigned byfLMf\_\{LM\}to the candidate responser1r\_\{1\}when prompted with the in\-context examplestt\. We collect these probabilities into a feature vector
s=\(ℙLM\(r1\|x;t1\),…,ℙLM\(r1\|x;t\|𝒯\|\)\),s=\(\\mathbb\{P\}\_\{LM\}\(r\_\{1\}\|x;t\_\{1\}\),\.\.\.,\\mathbb\{P\}\_\{LM\}\(r\_\{1\}\|x;t\_\{\|\\mathcal\{T\}\|\}\)\),\(7\)whereti∈𝒯t\_\{i\}\\in\\mathcal\{T\}\. For the givenxx,fLMf\_\{LM\}, and the correspondingss, we then consider a confidence estimator
Cθ\(s\):ℝ\|𝒯\|→\[0,1\]C\_\{\\theta\}\(s\):\\mathbb\{R\}^\{\|\\mathcal\{T\}\|\}\\rightarrow\[0,1\]parameterized byθ\\theta, typically instantiated as a neural network \(e\.g\., an MLP\) that maps the collection of LLM predictive scores to a scalar confidence value\.
In the theoretical analyses of this work, the confidence functionCθC\_\{\\theta\}is modeled as annn\-layer feedforward neural network with ReLU activations, where each layer consists ofhhhidden units\. The final layer applies a sigmoid activation, rather than a softmax, to produce a scalar confidence value in\[0,1\]\[0,1\]\. We useθ\\thetato denote the collection of weights across all layers, andWlW\_\{l\}to denote the weight matrix associated with thell\-th layer\. The corresponding vectorized parameter representation is denoted bywlw\_\{l\}\. For brevity, bias terms are incorporated into the weight matrices\. We denote‖Wl‖2\\\|W\_\{l\}\\\|\_\{2\}as the spectral norm ofWlW\_\{l\}, represents the largest singular value\.‖Wl‖F\\\|W\_\{l\}\\\|\_\{F\}is the Frobenius norm of the weight matrix and‖wl‖p\\\|w\_\{l\}\\\|\_\{p\}is theℓp\\ell\_\{p\}norm of the weight vector, respectively\.
Ranking error\.The ideal behavior is thatCθC\_\{\\theta\}preserves the ordering induced by agreement likelihood, i\.e\., ifa\(xi\)\>a\(xj\)a\(x\_\{i\}\)\>a\(x\_\{j\}\), thenCθ\(si\)\>Cθ\(sj\)C\_\{\\theta\}\(s\_\{i\}\)\>C\_\{\\theta\}\(s\_\{j\}\)\(the monotonicity assumption inJunget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)\)\.
To quantify this behavior, we adopt a pairwise ranking formulation\. We define the distributionDpairD\_\{\\mathrm\{pair\}\}and the setSpairS\_\{\\mathrm\{pair\}\}over ordered pairs\(\(xi,a\(xi\)\),\(xj,a\(xj\)\)\)\(\(x\_\{i\},a\(x\_\{i\}\)\),\(x\_\{j\},a\(x\_\{j\}\)\)\)restricted to those satisfyinga\(xi\)\>a\(xj\)a\(x\_\{i\}\)\>a\(x\_\{j\}\)\.
For a givenfLMf\_\{LM\}and the ordered pairs\(\(xi,a\(xi\),si\)\(\(x\_\{i\},a\(x\_\{i\}\),s\_\{i\}\),\(xj,a\(xj\),sj\)\)\(x\_\{j\},a\(x\_\{j\}\),s\_\{j\}\)\), wheresis\_\{i\}andsjs\_\{j\}are generated through \([7](https://arxiv.org/html/2605.15416#S3.E7)\), we consider the following margin\-based ranking loss:
ℓγ\(θ;xi,xj\):=𝟙\(Cθ\(si\)<Cθ\(sj\)\+γ\),\\ell\_\{\\gamma\}\(\\theta;x\_\{i\},x\_\{j\}\):=\\mathbbm\{1\}\(C\_\{\\theta\}\(s\_\{i\}\)<C\_\{\\theta\}\(s\_\{j\}\)\+\\gamma\),\(8\)whereγ≥0\\gamma\\geq 0is a margin parameter\. The error is0when the higher\-agreement point is ranked above the lower\-agreement one by marginγ\\gamma, and otherwise0\.
In our theoretical setting, we consider the expected ranking loss, empirical ranking loss, and its margin\-based empirical counterpart as
ℛ𝒦\(θ\):=𝔼\(xi,xj\)∼Dpair\[ℓ0\(θ;xi,xj\)\],\\displaystyle\\mathcal\{RK\}\(\\theta\)=\\mathbb\{E\}\_\{\(x\_\{i\},x\_\{j\}\)\\sim D\_\{\\mathrm\{pair\}\}\}\[\\ell\_\{0\}\(\\theta;x\_\{i\},x\_\{j\}\)\],\(9\)ℛ𝒦^\(θ\):=1\|Spair\|∑\(xi,xj\)∈Spairℓ0\(θ;xi,xj\),\\displaystyle\\widehat\{\\mathcal\{RK\}\}\(\\theta\)=\\frac\{1\}\{\|S\_\{\\mathrm\{pair\}\}\|\}\\sum\_\{\(x\_\{i\},x\_\{j\}\)\\in S\_\{\\mathrm\{pair\}\}\}\\ell\_\{0\}\(\\theta;x\_\{i\},x\_\{j\}\),ℛ𝒦^γ\(θ\):=1\|Spair\|∑\(xi,xj\)∈Spairℓγ\(θ;xi,xj\)\.\\displaystyle\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)=\\frac\{1\}\{\|S\_\{\\mathrm\{pair\}\}\|\}\\sum\_\{\(x\_\{i\},x\_\{j\}\)\\in S\_\{\\mathrm\{pair\}\}\}\\ell\_\{\\gamma\}\(\\theta;x\_\{i\},x\_\{j\}\)\.
PAC\-Bayes\.The PAC\-Bayesian framework\(McAllester,[1999](https://arxiv.org/html/2605.15416#bib.bib96)\)provides tight upper bounds on the generalization performance of stochastic classifiers, which is defined with respect to a posterior distributionQQover a hypothesis class\. The resulting bounds are governed primarily by the Kullback–Leibler \(KL\) divergence between the posterior distributionQQand a prior distributionPPover classifiers\. Within this setting, we define the stochastic confidence estimators over the posteriorQQof the typeCθ\+𝐮C\_\{\\theta\+\\mathbf\{u\}\}\(Neyshaburet al\.,[2017](https://arxiv.org/html/2605.15416#bib.bib15)\), where𝐮\\mathbf\{u\}is a random variable potentially influenced by the training data andθ\\thetais the deterministic parameters of the confidence estimator\.
### 3\.2Theoretical Analyses
In this section, we develop a PAC\-Bayesian framework, inspired byNeyshaburet al\.\([2017](https://arxiv.org/html/2605.15416#bib.bib15)\), for learning a confidence function whose induced ordering over input instances aligns with human–LLM agreement\. By formulating the problem in terms of pairwise ranking, we derive generalization bounds on the proportion of misordered pairs\. These bounds, in turn, provide theoretical control over the near\-monotonicity of the selective risk curve that underpins the human\-agreement guarantees\.
LetPPbe any prior distribution over the confidence estimators, chosen independently of the data, and letQQdenote a posterior distribution over estimators of the formCθ\+𝐮C\_\{\\theta\+\\mathbf\{u\}\}, where𝐮\\mathbf\{u\}is a random variable whose distribution may also depend on the training data\. The following theorem gives a PAC\-Bayesian upper bound on the true ranking risk over posterior distribution\.
###### Theorem 3\.2\(PAC\-Bayesian Bound for Expected Ranking Error\)\.
Letδ′∈\(0,1\)\\delta^\{\\prime\}\\in\(0,1\),SpairS\_\{\\mathrm\{pair\}\}be a set ofmpm\_\{p\}i\.i\.d\. training pairs drawn fromDpairD\_\{\\mathrm\{pair\}\}\. Then, with probability at least1−δ′1\-\\delta^\{\\prime\}over the draw of these pairs, the expected error ofCθ\+𝐮C\_\{\\theta\+\\mathbf\{u\}\}can be bounded as follows
𝔼𝐮\[ℛ𝒦\(θ\+𝐮\)\]\\displaystyle\\mathbb\{E\}\_\{\\mathbf\{u\}\}\\big\[\\mathcal\{RK\}\(\\theta\+\\mathbf\{u\}\)\\big\]≤𝔼𝐮\[ℛ𝒦^\(θ\+𝐮\)\]\\displaystyle\\;\\leq\\;\\mathbb\{E\}\_\{\\mathbf\{u\}\}\\big\[\\widehat\{\\mathcal\{RK\}\}\(\\theta\+\\mathbf\{u\}\)\\big\]\(10\)\+KL\(θ\+𝐮∥P\)\+lnmpδ′2\(mp−1\)\.\\displaystyle\\qquad\\qquad\\qquad\\;\\;\+\\sqrt\{\\frac\{\{\\mathrm\{KL\}\}\(\\theta\+\\mathbf\{u\}\\\|P\)\+\\ln\\frac\{m\_\{p\}\}\{\\delta^\{\\prime\}\}\}\{2\(m\_\{p\}\-1\)\}\}\.
###### Sketch of Proof\.
The proof follows directly from the classical PAC\-Bayesian inequalities for bounded losses\(McAllester,[2003](https://arxiv.org/html/2605.15416#bib.bib123)\)\. Sinceℛ𝒦\(θ\)∈\[0,1\]\\mathcal\{RK\}\(\\theta\)\\in\[0,1\]is bounded and measurable, the moment generating function required by the PAC\-Bayesian bound is finite\. Applying the Donsker\-Varadhan variational inequality to the exponential moment and then bounding the empirical process term via Markov’s inequality yields \([10](https://arxiv.org/html/2605.15416#S3.E10)\)\. ∎
Building upon the PAC\-Bayesian framework established in[Theorem3\.2](https://arxiv.org/html/2605.15416#S3.Thmtheorem2), which bounds the discrepancy between the expected ranking risk and the empirical ranking risk for stochastic confidence estimators, our next step is to extend the analysis to the deterministic setting\. To this end, we introduce a sharpness constraint inspired byNeyshaburet al\.\([2017](https://arxiv.org/html/2605.15416#bib.bib15)\)and exploit a chain decomposition of ranking risks\. This approach yields an analytically tractable, margin\-based generalization bound for deterministic confidence measures, which we present below\.
###### Corollary 3\.3\.
Given[Theorem3\.2](https://arxiv.org/html/2605.15416#S3.Thmtheorem2), letCθC\_\{\\theta\}be any confidence estimator with parametersθ\\theta\. LetPPbe any prior distribution on the parameters,QQ\(i\.e\.,θ\+𝐮\\theta\+\\mathbf\{u\}\) be the posterior distribution on learned confidence estimator parameters\. Then, for anyδ′,γ\>0\\delta^\{\\prime\},\\gamma\>0, and any \(posterior\) random perturbation𝐮\\mathbf\{u\}s\.t\.s\.t\.ℙ𝐮\(maxs\|Cθ\+𝐮\(s\)−Cθ\(s\)\|<γ4\)≥12\\mathbb\{P\}\_\{\\mathbf\{u\}\}\(\\max\_\{s\}\|C\_\{\\theta\+\\mathbf\{u\}\}\(s\)\-C\_\{\\theta\}\(s\)\|<\\frac\{\\gamma\}\{4\}\)\\geq\\frac\{1\}\{2\}, with probability at least1−δ′1\-\\delta^\{\\prime\}, we have
ℛ𝒦\(θ\)≤ℛ𝒦^γ\(θ\)\+KL\(θ\+𝐮∥P\)\+ln3mpδ′mp−1\.\\mathcal\{RK\}\(\\theta\)\\;\\leq\\;\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)\+\\sqrt\{\\frac\{\{\\mathrm\{KL\}\}\(\\theta\+\\mathbf\{u\}\\\!\\parallel\\\!P\)\+\\ln\\frac\{3\\sqrt\{m\_\{p\}\}\}\{\\delta^\{\\prime\}\}\}\{m\_\{p\}\-1\}\}\.\(11\)
While the preceding theorem controls the generalization gap between the expected and empirical ranking losses for deterministic confidence estimators, its complexity term is governed by the KL divergence between the posterior and prior\. Building on the PAC\-Bayesian sharpness analysis, we reformulate this term as a margin\-based complexity measure that depends explicitly on the parameter norms of the confidence function\. This transformation yields a more interpretable and practically meaningful generalization bound for confidence measures\.
The primary challenge lies in computing the KL divergence within the sharpness limit \(or random perturbation limit\)\. To tackle this, we employ a two\-pronged approach\. Firstly, we leverage a pred\-determined grid method to judiciously select the prior distributionPPof confidence estimators\. Secondly, let𝐮∼𝒩\(0,σ2I\)\\mathbf\{u\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)\(Neyshaburet al\.,[2017](https://arxiv.org/html/2605.15416#bib.bib15)\), by carefully accounting for both the sharpness limit and the Lipschitz property of the model, we derive an upper bound on the randomness of posterior distribution\. This strategic formulation allows us to effectively bound the KL divergence betweenQQandPP, a crucial step in obtaining the following generalization bound\.
###### Corollary 3\.4\.
Given Corollary[3\.3](https://arxiv.org/html/2605.15416#S3.Thmtheorem3), for anyn,h\>0n,h\>0, let the base estimatorCθC\_\{\\theta\}be annn\-layer MLP withhhunits each layer and ReLU activation function\. Then, for anyδ′,γ\>0\\delta^\{\\prime\},\\gamma\>0, with probability at least1−δ′1\-\\delta^\{\\prime\}, we have
ℛ𝒦\(θ\)≤ℛ𝒦^γ\(θ\)⏟Empirical Margin Loss\+𝒪\(Φ\(Cθ\)\+ln3mpδ′γ2\(mp−1\)\)⏟Margin\-based Complexity Term,\\mathcal\{RK\}\(\\theta\)\\;\\leq\\;\\underbrace\{\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)\}\_\{\\textbf\{Empirical Margin Loss\}\}\+\\underbrace\{\\mathcal\{O\}\\Bigg\(\\sqrt\{\\frac\{\\Phi\(C\_\{\\theta\}\)\+\\ln\\frac\{3m\_\{p\}\}\{\\delta^\{\\prime\}\}\}\{\\gamma^\{2\}\(m\_\{p\}\-1\)\}\}\\Bigg\)\}\_\{\\textbf\{Margin\-based Complexity Term\}\},\(12\)whereΦ\(Cθ\)=n2hln\(nh\)∏l=1n‖Wl‖22∑l=1n‖Wl‖F2‖Wl‖22\\Phi\(C\_\{\\theta\}\)=n^\{2\}h\\ln\(nh\)\\prod\_\{l=1\}^\{n\}\\\|W\_\{l\}\\\|\_\{2\}^\{2\}\\sum\_\{l=1\}^\{n\}\\frac\{\\\|W\_\{l\}\\\|^\{2\}\_\{F\}\}\{\\\|W\_\{l\}\\\|^\{2\}\_\{2\}\}\.
## 4The Adaptive Optimizer
The main result of the above PAC–Bayesian analyses is a margin\-based ranking generalization bound in \([12](https://arxiv.org/html/2605.15416#S3.E12)\)\. For a given confidence estimatorCθC\_\{\\theta\}, the bound states that the true expected misranking error is bounded by the empirical margin\-based ranking loss and the margin\-dependent complexity term\. This decomposition reveals a fundamental trade–off controlled by the margin parameterγ\\gamma, as discussed in Remark[3\.5](https://arxiv.org/html/2605.15416#S3.Thmtheorem5)\. Thus, achieving the best generalization performance requires balancing these opposing forces\.
Remark[3\.5](https://arxiv.org/html/2605.15416#S3.Thmtheorem5)implies that no fixed marginγ\\gammais uniformly optimal across datasets, annotation behaviours, or LLM judges\. When the data is clean and the LLM judge is relatively reliable, a larger margin induces strong ranking separation and produces a smaller complexity term\. When model–human agreement is noisy or heterogeneous, enforcing a large margin may be unrealistic, causing the empirical margin loss to dominate\. This motivates an adaptive margin strategy that dynamically calibratesγ\\gammabased on empirical behaviour and theoretical insights\.
We propose a training objective that selects the marginγ\\gammaautomatically by minimizing a bound\-guided surrogate objective, i\.e\.,
minθ,γℛ𝒦^γ\(θ\)\+β𝒞γ\(θ\),\\min\_\{\\theta,\\gamma\}\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)\+\\beta\\mathcal\{C\}\_\{\\gamma\}\(\\theta\),\(13\)whereβ\>0\\beta\>0controls the trade\-off between empirical fit and model complexity, and𝒞γ\(θ\)\\mathcal\{C\}\_\{\\gamma\}\(\\theta\)denotes the complexity term induced by the generalization bound\.
Recalling the complexity term in \([12](https://arxiv.org/html/2605.15416#S3.E12)\),𝒞γ\(θ\)\\mathcal\{C\}\_\{\\gamma\}\(\\theta\)depends primarily on the marginγ\\gammaas well as the spectral and Frobenius norms of the model weights \(mp,n,h,δ′m\_\{p\},n,h,\\delta^\{\\prime\}are not optimization objects during training\)\. In practice, spectral norms are upper\-bounded and well\-controlled by Frobenius norms, i\.e\.,‖Wl‖2≤‖Wl‖F\\\|W\_\{l\}\\\|\_\{2\}\\leq\\\|W\_\{l\}\\\|\_\{F\}\. Replacing spectral norms with Frobenius norms yields a computationally efficient, differentiable surrogate that preserves the qualitative dependence of the bound on model capacity and Lipschitzness, while enabling stable end\-to\-end optimization with standard optimizers, thus we let𝒞γ\(θ\)=∑l‖Wl‖F2γ\\mathcal\{C\}\_\{\\gamma\}\(\\theta\)=\\frac\{\\sqrt\{\\sum\_\{l\}\\\|W\_\{l\}\\\|^\{2\}\_\{F\}\}\}\{\\gamma\}\.
However, jointly optimizing\(θ,γ\)\(\\theta,\\gamma\)is challenging and can be unstable due to the non\-smooth dependence of the empirical margin loss onγ\\gamma\. We therefore adopt a decoupled, alternating update scheme:θ\\thetais updated using a smooth surrogate ranking loss with a fixedγ\\gamma, whileγ\\gammais updated for each iteration with the minimal objective function, i\.e\.,
minθminγℛ𝒦γs^\(θ\)\+β𝒞γ\(θ\)\.\\min\_\{\\theta\}\\min\_\{\\gamma\}\\widehat\{\\mathcal\{RK\}^\{s\}\_\{\\gamma\}\}\(\\theta\)\+\\beta\\mathcal\{C\}\_\{\\gamma\}\(\\theta\)\.\(14\)Sinceℛ𝒦^γ\(θ\)\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)corresponds to a non\-differentiable 0–1 ranking error, we replace it with a differentiable surrogateℛ𝒦γs^\(θ\)\\widehat\{\\mathcal\{RK\}^\{s\}\_\{\\gamma\}\}\(\\theta\)in \([14](https://arxiv.org/html/2605.15416#S4.E14)\) to enable gradient\-based optimization\. Specifically, we adopt a softmax\-based ranking loss by replacingℓγ\\ell\_\{\\gamma\}withlog\(1\+e−Cθ\(si\)−Cθ\(sj\)−γ0\.1\)\\log\(1\+e^\{\-\\frac\{C\_\{\\theta\}\(s\_\{i\}\)\-C\_\{\\theta\}\(s\_\{j\}\)\-\\gamma\}\{0\.1\}\}\)\. This type of surrogate is standard in machine learning; for example, in classical classification tasks, the cross\-entropy is routinely used as a differentiable replacement for the 0\-1 classification error during training\.
Table 1:Performance of confidence estimators across judge models and datasets\.
## 5Empirical Results
In the following experiments, we first show that our method achieves lower ranking loss and higher AUROC than existing baselines, leading to stronger monotonicity with human\-agreement \(or disagreement\) risk\. We then demonstrate that, when integrated into the hypothesis\-testing framework, our method attains a higher guarantee success rate\.
Datasets\.We test our approach across four widely used preference and instruction\-following datasets: AlpacaEval\(Duboiset al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib5)\), Chatbot Arena\(Zhenget al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib3)\), HH\-RLHF\(Baiet al\.,[2022](https://arxiv.org/html/2605.15416#bib.bib2)\), and the TL;DR validation set\(Stiennonet al\.,[2020](https://arxiv.org/html/2605.15416#bib.bib4)\)\. AlpacaEval consists of instruction–response pairs designed to benchmark instruction\-following performance\. Chatbot Arena collects human preference judgments from head\-to\-head model comparisons in open\-ended conversational settings\. HH\-RLHF emphasizes safety\-aligned behavior, providing paired responses annotated with human preferences along the axes of helpfulness and harmlessness\. TL;DR focuses on abstractive summarization, where candidate summaries are evaluated based on their alignment with human\-written references\.
FollowingJunget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\), to enable a unified evaluation across these heterogeneous sources, we reorganize all datasets into a consistent format consisting of an instruction and two candidate outputs\. This formulation allows a judge model to assess the relative quality of the two outputs conditioned on the instruction, producing a confidence score that represents the likelihood of preference\.
Judge models\.We employ six judge models spanning a range of parameter scales: Mistral\-7B\-instruct, Llama3\-8B, Llama3\-70B, Qwen2\.5\-32B, Qwen2\.5\-72B, and GPT\-OSS\-120B\. Each model independently evaluates the four datasets introduced above, assigning preference scores for each pair of candidate outputs\.
Figure 3:Estimated confidence on Chatbot Arena with Llama3\-8B: Simulated Annotators \(left\) vs\. our method \(right\)\. The horizontal axis denotes the confidence threshold and the vertical axis reports the human\-agreement rate among samples exceeding the threshold\. Our confidence estimates remain monotone with respect to human agreement, whereas Simulated Annotators fails in this setting\.### 5\.1Evaluating Ranking Loss and Monotonicity
Baselines\.We consider five existing confidence measures as baselines: Predictive Probability, Verbalized Confidence, Random Annotator, Simulated Annotators, and Learning Confidence \(Vanilla\)\. Predictive Probability directly uses the likelihood of the predicted preference label, while Verbalized Confidence\(Baiet al\.,[2022](https://arxiv.org/html/2605.15416#bib.bib2)\)prompts the LLM judge to explicitly state its confidence as a scalar\. Random Annotator and Simulated Annotators\(Junget al\.,[2025](https://arxiv.org/html/2605.15416#bib.bib16)\)leverage auxiliary human annotations: the former randomly samples one annotator’s preference into the prompt and uses the predicted label’s likelihood as confidence; the latter constructs 5 simulated annotators with the 5 annotation examples, then uses the largest average likelihood as confidence\. Learning Confidence \(Vanilla\) trains the estimator using the standard ranking loss only, while keeping all other settings identical to our method\.
Parameterized confidence estimator\.In all experiments, we use a three\-layer MLP with hidden widths6464–3232–1616and ReLU activations\. The output layer applies a sigmoid \(rather than a softmax\) to produce a scalar confidence score in\[0,1\]\[0,1\]\. For each dataset, we additionally generate∼\\sim3000 training examples, randomly construct 5000 training pairs, and train the estimator for 30 epochs with learning rate10−310^\{\-3\}, weight decay10−410^\{\-4\}, andβ=10−4\\beta=10^\{\-4\}\. FollowingJunget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\), we setK=5K=5andN=5N=5for both Simulated Annotators and our method across all experiments\.
[Table1](https://arxiv.org/html/2605.15416#S4.T1)reports ranking loss \([9](https://arxiv.org/html/2605.15416#S3.E9)\) and AUROC on test sets, where AUROC measures the discriminative power of the estimated confidence with respect to human\-agreement rate above the confidence threshold, evaluated on the test set across four datasets and multiple judge models\. Overall, the results indicate that static, non\-parameterized confidence measures do not generalize reliably across datasets and judge models\. In contrast, our margin\-adaptive training approach learns a dataset\- and judge\-specific confidence estimator from a small training set and achieves consistently strong generalization\. Across all settings, our method attains the lowest ranking loss and improved AUROC\. Moreover,[Figure3](https://arxiv.org/html/2605.15416#S5.F3)shows when Simulated Annotators yields non\-monotone confidence estimates with respect to human agreement, our method restores monotonicity\.
Table 2:Comparison against baselines on four datasets at target agreement level1−α=0\.851\-\\alpha=0\.85\. Results are averaged over 1000 runs with random data splits\. Guarantee success rate is the fraction of runs whose empirical human agreement is at least1−α1\-\\alpha\. CSE denotes the Cascaded Selective Evaluation framework\. L→\\toQ→\\toO corresponds to the cascade \{Llama3\-8B, Qwen2\.5\-72B, GPT\-OSS\-120B\}, and M→\\toL→\\toO corresponds to \{Mistral\-7B\-Instruct, Llama3\-70B, GPT\-OSS\-120B\}\.
### 5\.2Evaluating Guarantee Success Rate
FollowingJunget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\), we fix the calibration set size to 500 and setδ=0\.1\\delta=0\.1, and repeat each experiment over 1000 random splits of the calibration and test sets\. As baselines, we consider: \(1\) Heuristic Selection, which uses GPT\-4 as the judge and setsλ=1−α\\lambda=1\-\\alphaunder the assumption of perfect calibration; and \(2\) Cascaded Selective Evaluation \(CSE\) with confidence estimated by Predictive Probability, Simulated Annotators, Random Simulated Annotator, and Learning Confidence \(Vanilla\)\. We define the guarantee success rate as the fraction of runs in which the empirical human agreement is at least1−α1\-\\alpha, whereα\\alphais the user\-specified risk tolerance\. We define the coverage rate as the fraction of test samples that are retained \(i\.e\., not rejected\)\.
Figure 4:Majority of evaluations are done with weaker judge models, Mistral\-7B and LLama3\-70B, on Chatbot Arena\.[Table2](https://arxiv.org/html/2605.15416#S5.T2)shows that integrating our learned confidence estimator into CSE consistently yields the strongest human\-agreement guarantees across all four datasets and both cascaded structures\. In particular, CSE \+ Ours achieves the highest guarantee success rate in every setting \(e\.g\., under Llama3\-8B→\\toQwen2\.5\-72B→\\toGPT\-OSS\-120B, success rate improves to 94\.8% / 51\.6% / 92\.7% / 79\.2% on AlpacaEval, HH\-RLHF, Chatbot Arena, and TL;DR, respectively\), while maintaining competitive coverage\. In contrast, Heuristic Selection fails to reliably satisfy the target agreement level, achieving 0% success on AlpacaEval and HH\-RLHF and only modest success on Chatbot Arena and TL;DR, highlighting the limitations of assuming perfect calibration\. Compared to confidence baselines \(Predictive Probability, Simulated Annotators, and Random Annotator\), our method provides more robust and transferable confidence ranking, translating into substantially higher success rates with comparable or improved coverage across datasets and judge cascades\. In addition, as shown in[Figure4](https://arxiv.org/html/2605.15416#S5.F4), most evaluations are also handled by the weaker judge models for our method\. We provide more empirical results in Appendix[D](https://arxiv.org/html/2605.15416#A4)\.
## 6Related Work
While human evaluation provides high\-quality judgments, it suffers from limited scalability, high cost, and low efficiency\. Moreover, the increasing fluency of modern LLMs makes it difficult for annotators to reliably distinguish human\- from model\-generated text in open\-ended settings\(Clarket al\.,[2021](https://arxiv.org/html/2605.15416#bib.bib314)\), motivating the use of LLMs as evaluators\. Existing LLM\-based evaluation methods span reference\-based and reference\-free paradigms\. Early approaches augment traditional metrics by prompting LLMs to generate paraphrased references, improving coverage but still relying on human\-written ground truth\(Tanget al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib315)\)\. More recent reference\-free methods prompt LLMs to directly assess response quality using task descriptions and evaluation rubrics\(Liuet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib316); Fuet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib317); Chenet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib318); Chiang and Lee,[2023](https://arxiv.org/html/2605.15416#bib.bib118)\)\. These evaluators have been applied to summarization, code generation, and open\-ended QA, enabling multi\-dimensional assessments via prompt design\(Gaoet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib319); Zhuo,[2023](https://arxiv.org/html/2605.15416#bib.bib320); Baiet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib321); Lin and Chen,[2023](https://arxiv.org/html/2605.15416#bib.bib322)\)\. Prior work also evaluates factual correctness using both proprietary and open\-source models\(Minet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib323); Zhaet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib324)\), and leverages pairwise comparisons inspired by human preference judgments\(Duboiset al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib5)\)\. Despite strong empirical performance, LLM\-based judges exhibit systematic biases, including positional bias\(Wanget al\.,[2023a](https://arxiv.org/html/2605.15416#bib.bib325)\), stylistic over\-penalization\(Wu and Aji,[2023](https://arxiv.org/html/2605.15416#bib.bib326)\), and self\-enhancement bias\(Zhenget al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib3)\)\.
To improve efficiency and interpretability, recent work explores judge distillation\(Kimet al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib330); Zhuet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib331)\), ensemble and debate\-based evaluation\(Vergaet al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib332); Chanet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib333)\), and open\-source alternatives such as PandaLM\(Wanget al\.,[2023b](https://arxiv.org/html/2605.15416#bib.bib327)\)\. More advanced systems assign roles or personas to multiple LLM judges to enable nuanced, multi\-trait evaluation\(Badshah and Sajjad,[2025](https://arxiv.org/html/2605.15416#bib.bib351); Caoet al\.,[2025](https://arxiv.org/html/2605.15416#bib.bib352); Wanget al\.,[2025](https://arxiv.org/html/2605.15416#bib.bib349)\)\. However, most existing methods remain heuristic and lack formal reliability guarantees, with growing evidence of cognitive vulnerabilities\(Zenget al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib334); Kooet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib335); Panicksseryet al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib336)\)\.
Motivated by these limitations, recent research has begun integrating statistical guarantees into LLM evaluation and generation\. Conformal prediction has been used to control hallucination rates\(Yadkoriet al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib18); Mohri and Hashimoto,[2024](https://arxiv.org/html/2605.15416#bib.bib17)\)and false discovery risks in high\-stakes domains\(Guiet al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib337)\), providing marginal risk control under minimal assumptions\(Angelopouloset al\.,[2022](https://arxiv.org/html/2605.15416#bib.bib338)\)\. Complementary efforts improve truthfulness through fine\-tuning\(Kanget al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib339); Tianet al\.,[2023](https://arxiv.org/html/2605.15416#bib.bib340)\)or enable principled abstention when uncertainty is high\(Zhanget al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib343)\)\.Junget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)develop an unsupervised confidence measure and an exact bound on conditional disagreement risk\. In contrast, we mitigate the limitations of the monotonicity assumption inJunget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\)by learning a parameterized confidence estimator and providing PAC\-Bayesian generalization guarantees for its ranking error, which further motivates an adaptive margin\-based training procedure\.
## 7Conclusion
This work revisits the reliability of LLM\-as\-a\-judge systems under human\-agreement guarantees and identifies a key limitation of prior hypothesis\-testing frameworks: they rely on an empirical monotonicity assumption on confidence that can fail out of sample\. We mitigate this by learning a parameterized confidence estimator with a margin\-based ranking objective, supported by PAC\-Bayesian generalization guarantees that expose a margin\-dependent loss–complexity trade\-off\. Across multiple datasets and judge models, our adaptive training improves confidence ranking and empirically reduces monotonicity violations, leading to higher success rates in meeting target agreement levels within cascaded selective evaluation\.
## Acknowledgment
This work was supported by the NVIDIA Academic Grant Program \(Exploiting Overthinking Attacks on GenAI\), the Royal Society Grant \(Ensuring Trustworthy AI: Robustness Certification for Large Language Models\) \[Reference RGS\\R2\\252444\], and the AIRR Gateway project \(Exploiting Robustness of Reasoning Efficiency in Agentic AI\)\.
## Impact Statement
This work advances the reliability of LLM\-as\-a\-judge systems by providing principled confidence estimation with theoretical generalization guarantees\. By reducing reliance on heuristic confidence signals and improving the robustness of human\-agreement guarantees, the proposed method supports safer and more trustworthy deployment of automated evaluation pipelines in research and practice\. Potential applications include model benchmarking, alignment evaluation, and selective human oversight\. As with all LLM\-based evaluation systems, care should be taken when applying these methods in high\-stakes settings, as residual biases in both models and data may persist\. We do not anticipate any direct negative societal impacts beyond those common to automated evaluation technologies\.
## References
- A\. N\. Angelopoulos, S\. Bates, A\. Fisch, L\. Lei, and T\. Schuster \(2022\)Conformal risk control\.arXiv preprint arXiv:2208\.02814\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p3.1)\.
- K\. Atwell, M\. Simons, and M\. Alikhani \(2025\)Measuring bias and agreement in large language model presupposition judgments\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 2096–2107\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p1.1),[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- S\. Badshah and H\. Sajjad \(2025\)DAFE: llm\-based evaluation through dynamic arbitration for free\-form question\-answering\.arXiv preprint arXiv:2503\.08542\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§5\.1](https://arxiv.org/html/2605.15416#S5.SS1.p1.1),[§5](https://arxiv.org/html/2605.15416#S5.p2.1)\.
- Y\. Bai, J\. Ying, Y\. Cao, X\. Lv, Y\. He, X\. Wang, J\. Yu, K\. Zeng, Y\. Xiao, H\. Lyu,et al\.\(2023\)Benchmarking foundation models with language\-model\-as\-an\-examiner\.arXiv preprint arXiv:2306\.04181\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- A\. S\. Bandeira and M\. T\. Boedihardjo \(2021\)The spectral norm of gaussian matrices with correlated entries\.arXiv preprint arXiv:2104\.02662\.Cited by:[Appendix B](https://arxiv.org/html/2605.15416#A2.p5.3)\.
- S\. Bates, A\. Angelopoulos, L\. Lei, J\. Malik, and M\. Jordan \(2021\)Distribution\-free, risk\-controlling prediction sets\.JACM\.Cited by:[Appendix C](https://arxiv.org/html/2605.15416#A3.p1.3)\.
- P\. Bauer \(1991\)Multiple testing in clinical trials\.Statistics in medicine10\(6\),pp\. 871–890\.Cited by:[§2](https://arxiv.org/html/2605.15416#S2.p7.5)\.
- H\. Cao, I\. Driouich, R\. Singh, and E\. Thomas \(2025\)Multi\-agent llm judge: automatic personalized llm judge design for evaluating natural language generation applications\.arXiv preprint arXiv:2504\.02867\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- C\. Chan, W\. Chen, Y\. Su, J\. Yu, W\. Xue, S\. Zhang, J\. Fu, and Z\. Liu \(2023\)Chateval: towards better llm\-based evaluators through multi\-agent debate\.arXiv preprint arXiv:2308\.07201\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- Y\. Chen, R\. Wang, H\. Jiang, S\. Shi, and R\. Xu \(2023\)Exploring the use of large language models for reference\-free text quality evaluation: a preliminary empirical study\.arXiv preprint arXiv:2304\.00723\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- C\. Chiang and H\. Lee \(2023\)Can large language models be an alternative to human evaluations?\.arXiv preprint arXiv:2305\.01937\.Cited by:[§1](https://arxiv.org/html/2605.15416#S1.p1.1),[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- E\. Clark, T\. August, S\. Serrano, N\. Haduong, S\. Gururangan, and N\. A\. Smith \(2021\)All that’s ’human’ is not gold: evaluating human evaluation of generated text\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- R\. Deng, D\. Feng, and W\. Lei \(2025\)AMaPO: adaptive margin\-attached preference optimization for language model alignment\.arXiv preprint arXiv:2511\.09385\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p1.1)\.
- G\. Detommaso, M\. Bertran, R\. Fogliato, and A\. Roth \(2024\)Multicalibration for confidence scoring in llms\.arXiv preprint arXiv:2404\.04689\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p1.1)\.
- Y\. Dubois, C\. X\. Li, R\. Taori, T\. Zhang, I\. Gulrajani, J\. Ba, C\. Guestrin, P\. S\. Liang, and T\. B\. Hashimoto \(2023\)Alpacafarm: a simulation framework for methods that learn from human feedback\.Advances in Neural Information Processing Systems36,pp\. 30039–30069\.Cited by:[§1](https://arxiv.org/html/2605.15416#S1.p1.1),[§5](https://arxiv.org/html/2605.15416#S5.p2.1),[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- J\. Fu, S\. Ng, Z\. Jiang, and P\. Liu \(2023\)Gptscore: evaluate as you desire\.arXiv preprint arXiv:2302\.04166\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- M\. Gao, J\. Ruan, R\. Sun, X\. Yin, S\. Yang, and X\. Wan \(2023\)Human\-like summarization evaluation with chatgpt\.arXiv preprint arXiv:2304\.02554\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- M\. Grizzard, R\. Frazer, A\. Luttrell, C\. K\. Monge, N\. L\. Matthews, C\. J\. Francemone, and M\. E\. Frazer \(2025\)ChatGPT does not replicate human moral judgments: the importance of examining metrics beyond correlation to assess agreement\.Scientific Reports15\(1\),pp\. 40965\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p1.1)\.
- Y\. Gui, Y\. Jin, and Z\. Ren \(2024\)Conformal alignment: knowing when to trust foundation models with guarantees\.arXiv preprint arXiv:2405\.10301\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p3.1)\.
- G\. Jin, S\. Wu, J\. Liu, T\. Huang, and R\. Mu \(2025\)Enhancing robust fairness via confusional spectral regularization\.InInternational Conference on Learning Representations,Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- G\. Jin, X\. Yi, W\. Huang, S\. Schewe, and X\. Huang \(2022a\)Enhancing adversarial training with second\-order statistics of weights\.InCVPR,Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- G\. Jin, X\. Yi, P\. Yang, L\. Zhang, S\. Schewe, and X\. Huang \(2022b\)Weight expansion: a new perspective on dropout and generalization\.arXiv preprint arXiv:2201\.09209\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- G\. Jin, X\. Yi, L\. Zhang, L\. Zhang, S\. Schewe, and X\. Huang \(2020\)How does weight correlation affect generalisation ability of deep neural networks?\.Advances in Neural Information Processing Systems33\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- J\. Jung, F\. Brahman, and Y\. Choi \(2025\)Trust or escalate: llm judges with provable guarantees for human agreement\.ICLR\.Cited by:[Appendix C](https://arxiv.org/html/2605.15416#A3.p1.3),[Figure 1](https://arxiv.org/html/2605.15416#S1.F1),[Figure 1](https://arxiv.org/html/2605.15416#S1.F1.3.2),[§1](https://arxiv.org/html/2605.15416#S1.p2.1),[Theorem 2\.1](https://arxiv.org/html/2605.15416#S2.Thmtheorem1),[§2](https://arxiv.org/html/2605.15416#S2.p3.4),[§2](https://arxiv.org/html/2605.15416#S2.p6.2),[§2](https://arxiv.org/html/2605.15416#S2.p7.5),[§3\.1](https://arxiv.org/html/2605.15416#S3.SS1.p1.2),[§3\.1](https://arxiv.org/html/2605.15416#S3.SS1.p3.6),[§3\.1](https://arxiv.org/html/2605.15416#S3.SS1.p6.3),[§5\.1](https://arxiv.org/html/2605.15416#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.15416#S5.SS1.p2.10),[§5\.2](https://arxiv.org/html/2605.15416#S5.SS2.p1.4),[§5](https://arxiv.org/html/2605.15416#S5.p3.1),[§6](https://arxiv.org/html/2605.15416#S6.p3.1)\.
- K\. Kang, E\. Wallace, C\. Tomlin, A\. Kumar, and S\. Levine \(2024\)Unfamiliar finetuning examples control how language models hallucinate\.arXiv preprint arXiv:2403\.05612\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p3.1)\.
- R\. Khanmohammadi, E\. Miahi, M\. Mardikoraem, S\. Kaur, I\. Brugere, C\. Smiley, K\. S\. Thind, and M\. M\. Ghassemi \(2025\)Calibrating llm confidence by probing perturbed representation stability\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 10459–10525\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p1.1)\.
- S\. Kim, J\. Suk, S\. Longpre, B\. Y\. Lin, J\. Shin, S\. Welleck, G\. Neubig, M\. Lee, K\. Lee, and M\. Seo \(2024\)Prometheus 2: an open source language model specialized in evaluating other language models\.arXiv preprint arXiv:2405\.01535\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- R\. Koo, M\. Lee, V\. Raheja, J\. I\. Park, Z\. M\. Kim, and D\. Kang \(2023\)Benchmarking cognitive biases in large language models as evaluators\.arXiv preprint arXiv:2309\.17012\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- Y\. Lee, J\. Kim, J\. Kim, H\. Cho, and P\. Kang \(2024\)Checkeval: robust evaluation framework using large language model via checklist\.CoRR\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- X\. Li, T\. Zhang, Y\. Dubois, R\. Taori, I\. Gulrajani, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Alpacaeval: an automatic evaluator of instruction\-following models\.Cited by:[Figure 1](https://arxiv.org/html/2605.15416#S1.F1),[Figure 1](https://arxiv.org/html/2605.15416#S1.F1.3.2)\.
- Y\. Lin and Y\. Chen \(2023\)Llm\-eval: unified multi\-dimensional automatic evaluation for open\-domain conversations with large language models\.arXiv preprint arXiv:2305\.13711\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)Gpteval: nlg evaluation using gpt\-4 with better human alignment\.arXiv preprint arXiv:2303\.16634\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- X\. Lyu, Y\. Wang, H\. Hajishirzi, and P\. Dasigi \(2024\)Href: human response\-guided evaluation of instruction following in language models\.arXiv preprint arXiv:2412\.15524\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- D\. A\. McAllester \(1999\)PAC\-bayesian model averaging\.InCOLT,Cited by:[§3\.1](https://arxiv.org/html/2605.15416#S3.SS1.p10.7)\.
- D\. McAllester \(2003\)Simplified pac\-bayesian margin bounds\.InLearning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24\-27, 2003\. Proceedings,pp\. 203–215\.Cited by:[§3\.2](https://arxiv.org/html/2605.15416#S3.SS2.1.p1.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. W\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)Factscore: fine\-grained atomic evaluation of factual precision in long form text generation\.arXiv preprint arXiv:2305\.14251\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- C\. Mohri and T\. Hashimoto \(2024\)Language models with conformal factuality guarantees\.arXiv preprint arXiv:2402\.10978\.Cited by:[§1](https://arxiv.org/html/2605.15416#S1.p2.1),[§6](https://arxiv.org/html/2605.15416#S6.p3.1)\.
- B\. Neyshabur, S\. Bhojanapalli, and N\. Srebro \(2017\)A pac\-bayesian approach to spectrally\-normalized margin bounds for neural networks\.arXiv preprint arXiv:1707\.09564\.Cited by:[Lemma B\.1](https://arxiv.org/html/2605.15416#A2.Thmtheorem1),[Appendix B](https://arxiv.org/html/2605.15416#A2.p1.2),[Appendix B](https://arxiv.org/html/2605.15416#A2.p8.12),[§3\.1](https://arxiv.org/html/2605.15416#S3.SS1.p10.7),[§3\.2](https://arxiv.org/html/2605.15416#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.15416#S3.SS2.p3.1),[§3\.2](https://arxiv.org/html/2605.15416#S3.SS2.p5.4),[Remark 3\.1](https://arxiv.org/html/2605.15416#S3.Thmtheorem1.p1.1)\.
- A\. Panickssery, S\. Bowman, and S\. Feng \(2024\)Llm evaluators recognize and favor their own generations\.NeurIPS\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- S\. Park, M\. Zecchin, and O\. Simeone \(2025\)Adaptive prediction\-powered autoeval with reliability and efficiency guarantees\.arXiv preprint arXiv:2505\.18659\.Cited by:[§1](https://arxiv.org/html/2605.15416#S1.p1.1)\.
- F\. M\. Polo, X\. Wang, M\. Yurochkin, G\. Xu, M\. Banerjee, and Y\. Sun \(2025\)Bridging human and llm judgments: understanding and narrowing the gap\.arXiv preprint arXiv:2508\.12792\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p1.1)\.
- N\. Stiennon, L\. Ouyang, J\. Wu, D\. Ziegler, R\. Lowe, C\. Voss, A\. Radford, D\. Amodei, and P\. F\. Christiano \(2020\)Learning to summarize with human feedback\.NeurIPS\.Cited by:[§5](https://arxiv.org/html/2605.15416#S5.p2.1)\.
- T\. Tang, H\. Lu, Y\. E\. Jiang, H\. Huang, D\. Zhang, W\. X\. Zhao, and F\. Wei \(2023\)Not all metrics are guilty: improving nlg evaluation with llm paraphrasing\.arXiv preprint arXiv:2305\.15067\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- K\. Tian, E\. Mitchell, H\. Yao, C\. D\. Manning, and C\. Finn \(2023\)Fine\-tuning language models for factuality\.InICLR,Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p3.1)\.
- P\. Verga, S\. Hofstatter, S\. Althammer, Y\. Su, A\. Piktus, A\. Arkhangorodsky, M\. Xu, N\. White, and P\. Lewis \(2024\)Replacing judges with juries: evaluating llm generations with a panel of diverse models\.arXiv preprint arXiv:2404\.18796\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- P\. Wang, L\. Li, L\. Chen, D\. Zhu, B\. Lin, Y\. Cao, Q\. Liu, T\. Liu, and Z\. Sui \(2023a\)Large language models are not fair evaluators\.arXiv preprint arXiv:2305\.17926\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- Y\. Wang, Z\. Yu, Z\. Zeng, L\. Yang, C\. Wang, H\. Chen, C\. Jiang, R\. Xie, J\. Wang, X\. Xie,et al\.\(2023b\)Pandalm: an automatic evaluation benchmark for llm instruction tuning optimization\.arXiv preprint arXiv:2306\.05087\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- Z\. Wang, X\. Zhou, Y\. Yang, B\. Ma, L\. Wang, R\. Dong, and A\. Anwar \(2025\)OpenForecast: a large\-scale open\-ended event forecasting dataset\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 5273–5294\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- M\. Wu and A\. F\. Aji \(2023\)Style over substance: evaluation biases for large language models\.arXiv preprint arXiv:2307\.03025\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2024\)Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.15416#S1.p1.1)\.
- Y\. A\. Yadkori, I\. Kuzborskij, D\. Stutz, A\. György, A\. Fisch, A\. Doucet, I\. Beloshapka, W\. Weng, Y\. Yang, C\. Szepesvári,et al\.\(2024\)Mitigating llm hallucinations via conformal abstention\.arXiv preprint arXiv:2405\.01563\.Cited by:[§1](https://arxiv.org/html/2605.15416#S1.p2.1),[§6](https://arxiv.org/html/2605.15416#S6.p3.1)\.
- X\. Yi, G\. Jin, X\. Huang, and S\. Jin \(2026\)Towards a unified pac\-bayesian framework for norm\-based generalization bounds\.arXiv preprint arXiv:2601\.08100\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- Z\. Zeng, J\. Yu, T\. Gao, Y\. Meng, T\. Goyal, and D\. Chen \(2023\)Evaluating large language models at evaluating instruction following\.arXiv preprint arXiv:2310\.07641\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- Y\. Zha, Y\. Yang, R\. Li, and Z\. Hu \(2023\)Alignscore: evaluating factual consistency with a unified alignment function\.arXiv preprint arXiv:2305\.16739\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- H\. Zhang, S\. Diao, Y\. Lin, Y\. Fung, Q\. Lian, X\. Wang, Y\. Chen, H\. Ji, and T\. Zhang \(2024\)R\-tuning: instructing large language models to say ‘i don’t know’\.InNAACL,Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p3.1)\.
- X\. Zhang, T\. Gao, S\. Jin, T\. Wang, T\. Ye, E\. Adar, and Q\. Mei \(2025\)Through the judge’s eyes: inferred thinking traces improve reliability of llm raters\.arXiv preprint arXiv:2510\.25860\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§1](https://arxiv.org/html/2605.15416#S1.p1.1),[§5](https://arxiv.org/html/2605.15416#S5.p2.1),[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
- W\. Zhou, R\. Agrawal, S\. Zhang, S\. R\. Indurthi, S\. Zhao, K\. Song, S\. Xu, and C\. Zhu \(2024\)Wpo: enhancing rlhf with weighted preference optimization\.arXiv preprint arXiv:2406\.11827\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p1.1)\.
- Z\. Zhou, T\. Jin, J\. Shi, and Q\. Li \(2025\)SteerConf: steering llms for confidence elicitation\.arXiv preprint arXiv:2503\.02863\.Cited by:[Appendix E](https://arxiv.org/html/2605.15416#A5.p2.1)\.
- L\. Zhu, X\. Wang, and X\. Wang \(2023\)Judgelm: fine\-tuned large language models are scalable judges\.arXiv preprint arXiv:2310\.17631\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p2.1)\.
- T\. Y\. Zhuo \(2023\)Large language models are state\-of\-the\-art evaluators of code generation\.arXiv preprint arXiv:2304\.14317\.Cited by:[§6](https://arxiv.org/html/2605.15416#S6.p1.1)\.
## Appendix AProof for Corollary[3\.3](https://arxiv.org/html/2605.15416#S3.Thmtheorem3)
LetS𝐮S\_\{\\mathbf\{u\}\}be the set of perturbations with the following property:
S𝐮⊆\{𝐮\|maxs\|Cθ\+𝐮\(s\)−Cθ\(s\)\|<γ4\}\.S\_\{\\mathbf\{u\}\}\\subseteq\\left\\\{\\mathbf\{u\}\\Big\|\\max\_\{s\}\|C\_\{\\theta\+\\mathbf\{u\}\}\(s\)\-\\left\.C\_\{\\theta\}\(s\)\\right\|<\\frac\{\\gamma\}\{4\}\\right\\\}\.\(15\)Letqqbe the probability density function over𝐮\\mathbf\{u\}\. We construct a new distributionQ~\\tilde\{Q\}over𝐮~\\tilde\{\\mathbf\{u\}\}that is restricted toS𝐮S\_\{\\mathbf\{u\}\}with the probability density function:
q~\(𝐮~\)=\{1zq\(𝐮~\)𝐮~∈S𝐮,0otherwise,\\tilde\{q\}\(\\tilde\{\\mathbf\{u\}\}\)=\\begin\{cases\}\\frac\{1\}\{z\}q\(\\tilde\{\\mathbf\{u\}\}\)&\\tilde\{\\mathbf\{u\}\}\\in S\_\{\\mathbf\{u\}\},\\\\ 0&\\text\{ otherwise\},\\end\{cases\}\(16\)wherezzis a normalizing constant and by the lemma assumptionz=ℙ\(𝐮~∈S𝐮\)≥12z=\\mathbb\{P\}\(\\tilde\{\\mathbf\{u\}\}\\in S\_\{\\mathbf\{u\}\}\)\\geq\\frac\{1\}\{2\}\. By the definition ofQ~\\tilde\{Q\}, we have:
maxs\|Cθ\+𝐮~\(s\)−Cθ\(s\)\|<γ4\.\\max\_\{s\}\|C\_\{\\theta\+\\tilde\{\\mathbf\{u\}\}\}\(s\)\-\\left\.C\_\{\\theta\}\(s\)\\right\|<\\frac\{\\gamma\}\{4\}\.\(17\)Therefore, with probability at least1−δ1\-\\deltaover training datasetSpairS\_\{\\mathrm\{pair\}\}, we have:
ℛ𝒦\(θ\)\\displaystyle\\mathcal\{RK\}\(\\theta\)≤𝔼𝐮~∼Q~ℛ𝒦γ2\(θ\+𝐮~\)\\displaystyle\\leq\\mathbb\{E\}\_\{\\tilde\{\\mathbf\{u\}\}\\sim\\tilde\{Q\}\}\\mathcal\{RK\}\_\{\\frac\{\\gamma\}\{2\}\}\(\\theta\+\\tilde\{\\mathbf\{u\}\}\)\(18\)≤𝔼𝐮~∼Q~\[ℛ𝒦^γ2\(θ\+𝐮~\)\]\+KL\(θ\+𝐮~∥P\)\+lnmpδ′2\(mp−1\)\\displaystyle\\leq\\mathbb\{E\}\_\{\\tilde\{\\mathbf\{u\}\}\\sim\\tilde\{Q\}\}\\big\[\\widehat\{\\mathcal\{RK\}\}\_\{\\frac\{\\gamma\}\{2\}\}\(\\theta\+\\tilde\{\\mathbf\{u\}\}\)\\big\]\+\\sqrt\{\\frac\{\{\\mathrm\{KL\}\}\(\\theta\+\\tilde\{\\mathbf\{u\}\}\\\!\\parallel\\\!P\)\+\\ln\\frac\{m\_\{p\}\}\{\\delta^\{\\prime\}\}\}\{2\(m\_\{p\}\-1\)\}\}\(19\)≤ℛ𝒦^γ\(θ\)\+KL\(θ\+𝐮~∥P\)\+lnmpδ′2\(mp−1\)\\displaystyle\\leq\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)\+\\sqrt\{\\frac\{\{\\mathrm\{KL\}\}\(\\theta\+\\tilde\{\\mathbf\{u\}\}\\\!\\parallel\\\!P\)\+\\ln\\frac\{m\_\{p\}\}\{\\delta^\{\\prime\}\}\}\{2\(m\_\{p\}\-1\)\}\}\(20\)≤ℛ𝒦^γ\(θ\)\+KL\(θ\+𝐮∥P\)\+ln3mpδ′mp−1\\displaystyle\\leq\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)\+\\sqrt\{\\frac\{\{\\mathrm\{KL\}\}\(\\theta\+\\mathbf\{u\}\\\!\\parallel\\\!P\)\+\\ln\\frac\{3m\_\{p\}\}\{\\delta^\{\\prime\}\}\}\{m\_\{p\}\-1\}\}\(21\)Hence, proved\.□\\square
###### Proof for \([18](https://arxiv.org/html/2605.15416#A1.E18)\)\.
Given \([15](https://arxiv.org/html/2605.15416#A1.E15)\) and \([16](https://arxiv.org/html/2605.15416#A1.E16)\), for all𝐮~∈Q~\\tilde\{\\mathbf\{u\}\}\\in\\tilde\{Q\}, we have
maxs\|Cθ\+𝐮~\(s\)−Cθ\(s\)\|<γ4\.\\displaystyle\\max\_\{s\}\|C\_\{\\theta\+\\tilde\{\\mathbf\{u\}\}\}\(s\)\-\\left\.C\_\{\\theta\}\(s\)\\right\|<\\frac\{\\gamma\}\{4\}\.\(22\)For all ordered pairs\(\(xi,a\(xi\)\),\(xj,a\(xj\)\)\)\(\(x\_\{i\},a\(x\_\{i\}\)\),\(x\_\{j\},a\(x\_\{j\}\)\)\)s\.t\.a\(xi\)\>a\(xj\)a\(x\_\{i\}\)\>a\(x\_\{j\}\)andCθ\(si\)<Cθ\(sj\)C\_\{\\theta\}\(s\_\{i\}\)<C\_\{\\theta\}\(s\_\{j\}\), we have
Cθ\+𝐮~\(si\)<Cθ\+𝐮~\(sj\)\+γ2C\_\{\\theta\+\\tilde\{\\mathbf\{u\}\}\}\(s\_\{i\}\)<C\_\{\\theta\+\\tilde\{\\mathbf\{u\}\}\}\(s\_\{j\}\)\+\\frac\{\\gamma\}\{2\}\(23\)Thus we have
ℛ𝒦\(θ\)≤𝔼𝐮~∼Q~ℛ𝒦γ2\(θ\+𝐮~\)\.\\mathcal\{RK\}\(\\theta\)\\leq\\mathbb\{E\}\_\{\\tilde\{\\mathbf\{u\}\}\\sim\\tilde\{Q\}\}\\mathcal\{RK\}\_\{\\frac\{\\gamma\}\{2\}\}\(\\theta\+\\tilde\{\\mathbf\{u\}\}\)\.\(24\)∎
###### Proof for \([19](https://arxiv.org/html/2605.15416#A1.E19)\)\.
Apply[Theorem3\.2](https://arxiv.org/html/2605.15416#S3.Thmtheorem2)\. ∎
###### Proof for \([20](https://arxiv.org/html/2605.15416#A1.E20)\)\.
For all ordered pairs\(\(xi,a\(xi\)\),\(xj,a\(xj\)\)\)\(\(x\_\{i\},a\(x\_\{i\}\)\),\(x\_\{j\},a\(x\_\{j\}\)\)\)s\.t\.a\(xi\)\>a\(xj\)a\(x\_\{i\}\)\>a\(x\_\{j\}\), if there exists𝐮~∈Q~\\tilde\{\\mathbf\{u\}\}\\in\\tilde\{Q\}s\.t\.Cθ\+𝐮~\(si\)<Cθ\+𝐮~\(sj\)\+γ2C\_\{\\theta\+\\tilde\{\\mathbf\{u\}\}\}\(s\_\{i\}\)<C\_\{\\theta\+\\tilde\{\\mathbf\{u\}\}\}\(s\_\{j\}\)\+\\frac\{\\gamma\}\{2\}, we have
Cθ\(si\)<Cθ\(sj\)\+γ\.C\_\{\\theta\}\(s\_\{i\}\)<C\_\{\\theta\}\(s\_\{j\}\)\+\\gamma\.\(25\)Thus we have
𝔼𝐮~∼Q~\[ℛ𝒦^γ2\(θ\+𝐮~\)\]≤ℛ𝒦^γ\(θ\)\.\\mathbb\{E\}\_\{\\tilde\{\\mathbf\{u\}\}\\sim\\tilde\{Q\}\}\\big\[\\widehat\{\\mathcal\{RK\}\}\_\{\\frac\{\\gamma\}\{2\}\}\(\\theta\+\\tilde\{\\mathbf\{u\}\}\)\\big\]\\leq\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)\.\(26\)∎
###### Proof for \([21](https://arxiv.org/html/2605.15416#A1.E21)\)\.
Givenqq,q~\\tilde\{q\},zz, andS𝐮S\_\{\\mathbf\{u\}\}in \([15](https://arxiv.org/html/2605.15416#A1.E15)\), letS𝐮cS\_\{\\mathbf\{u\}\}^\{c\}denote the complement set ofS𝐮S\_\{\\mathbf\{u\}\}andq~c\\tilde\{q\}^\{c\}denote the normalized density function restricted toS𝐮cS\_\{\\mathbf\{u\}\}^\{c\}\. Then, we have
KL\(q∥p\)=zKL\(q~∥p\)\+\(1−z\)KL\(q~c∥p\)−H\(z\),\{\\mathrm\{KL\}\}\(q\\\|p\)=z\{\\mathrm\{KL\}\}\(\\tilde\{q\}\\\|p\)\+\(1\-z\)\{\\mathrm\{KL\}\}\(\\tilde\{q\}^\{c\}\\\|p\)\-H\(z\),\(27\)whereH\(z\)=−zlnz−\(1−z\)ln\(1−z\)≤1H\(z\)=\-z\\ln z\-\(1\-z\)\\ln\(1\-z\)\\leq 1is the binary entropy function\. SinceKL\{\\mathrm\{KL\}\}is always positive, we get
KL\(q~∥p\)=1z\[KL\(q∥p\)\+H\(z\)\)−\(1−z\)KL\(q~c∥p\)\]≤2\(KL\(q∥p\)\+1\)\.\{\\mathrm\{KL\}\}\(\\tilde\{q\}\\\|p\)=\\frac\{1\}\{z\}\[\{\\mathrm\{KL\}\}\(q\\\|p\)\+H\(z\)\)\-\(1\-z\)\{\\mathrm\{KL\}\}\(\\tilde\{q\}^\{c\}\\\|p\)\]\\leq 2\(\{\\mathrm\{KL\}\}\(q\\\|p\)\+1\)\.\(28\)Thus we have2\(KL\(θ\+𝐮\|\|P\)\+ln3mpδ′\)≥KL\(θ\+𝐮~\|\|P\)\+lnmpδ′2\(\{\\mathrm\{KL\}\}\(\\theta\+\\mathbf\{u\}\|\|P\)\+\\ln\\frac\{3m\_\{p\}\}\{\\delta^\{\\prime\}\}\)\\geq\{\\mathrm\{KL\}\}\(\\theta\+\\tilde\{\\mathbf\{u\}\}\|\|P\)\+\\ln\\frac\{m\_\{p\}\}\{\\delta^\{\\prime\}\}\. ∎
## Appendix BProof for Corollary[3\.4](https://arxiv.org/html/2605.15416#S3.Thmtheorem4)
FollowingNeyshaburet al\.\([2017](https://arxiv.org/html/2605.15416#bib.bib15)\), we use two main steps to prove Corollary[3\.4](https://arxiv.org/html/2605.15416#S3.Thmtheorem4)\. Firstly, we compute the maximum allowable perturbation of𝐮\\mathbf\{u\}required to satisfy the given condition on the marginγ\\gamma\. In the second step, we compute the KL term in the bound, considering the perturbation obtained from the previous step\. This computation is essential in deriving the PAC\-Bayesian bound\.
Consider a neural network with parametersθ\\thetathat can be regularized by dividing each weight matrixWlW\_\{l\}by its spectral norm‖Wl‖2\\\|W\_\{l\}\\\|\_\{2\}\. Letβ\\betabe the geometric mean of the spectral norms of all weight matrices, defined as:
β=\(∏l=1n‖Wl‖2\)1n,\\beta=\\left\(\\prod\_\{l=1\}^\{n\}\\\|W\_\{l\}\\\|\_\{2\}\\right\)^\{\\frac\{1\}\{n\}\},wherennis the number of weight matrices in the network\. We introduce a modified version of the weights, denoted asW~l\\widetilde\{W\}\_\{l\}, which is obtained by scaling the original weightsWlW\_\{l\}by a factor ofβ‖Wl‖2\\frac\{\\beta\}\{\\\|W\_\{l\}\\\|\_\{2\}\}:
W~l=β‖Wl‖2Wl\.\\widetilde\{W\}\_\{l\}=\\frac\{\\beta\}\{\\\|W\_\{l\}\\\|\_\{2\}\}W\_\{l\}\.Due to the homogeneity property of the ReLU activation function, the behavior of the network with the modified weights, denoted asfW~f\_\{\\widetilde\{W\}\}, is identical to that of the original networkCθC\_\{\\theta\}\.
Furthermore, we observe that the product of the spectral norms of the original weights, given by∏l=1n‖Wl‖2\\prod\_\{l=1\}^\{n\}\\\|W\_\{l\}\\\|\_\{2\}, is equal to the product of the spectral norms of the modified weights, expressed as∏l=1n‖W~l‖2\\prod\_\{l=1\}^\{n\}\\\|\\widetilde\{W\}\_\{l\}\\\|\_\{2\}\. Moreover, the ratio of the Frobenius norm to the spectral norm remains unchanged for both the original and modified weights:
‖Wl‖F‖Wl‖2=‖W~l‖F‖W~l‖2\.\\frac\{\\\|W\_\{l\}\\\|\_\{F\}\}\{\\\|W\_\{l\}\\\|\_\{2\}\}=\\frac\{\\\|\\widetilde\{W\}\_\{l\}\\\|\_\{F\}\}\{\\\|\\widetilde\{W\}\_\{l\}\\\|\_\{2\}\}\.As a result, the excess error mentioned in the theorem statement remains unaffected by this weight normalization\. Therefore, it is sufficient to prove the theorem only for the normalized weightsW~\\widetilde\{W\}\. Without loss of generality, we assume that the spectral norm of each weight matrix is equal toβ\\beta, i\.e\.,‖Wl‖2=β\\\|W\_\{l\}\\\|\_\{2\}=\\betafor any layerll\.
In our approach, we initially set the prior distributionPPas a Gaussian distribution with zero mean and a diagonal covariance matrixσ2I\\sigma^\{2\}I\. We incorporate random perturbations𝐮∼𝒩\(0,σ2I\)\\mathbf\{u\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\), where the value ofσ\\sigmawill be determined in relation toβ\\betaat a later stage\. Since the prior must be independent of the learned predictorWWand its norm, we chooseσ\\sigmaaccording to an estimated valueβ~\\tilde\{\\beta\}\. We calculate the PAC\-Bayesian bound for eachβ~\\tilde\{\\beta\}selected from a pre\-determined grid, offering a generalization guarantee for allWWsatisfying\|β−β~\|≤1nβ\|\\beta\-\\tilde\{\\beta\}\|\\leq\\frac\{1\}\{n\}\\beta\. This ensures that each relevantβ\\betavalue is covered by someβ~\\tilde\{\\beta\}in the grid\. Subsequently, we apply a union bound across allβ~\\tilde\{\\beta\}defined by the grid\. For now, we will consider a set ofβ~\\tilde\{\\beta\}and the correspondingWWthat meet the condition\|β−β~\|≤1nβ\|\\beta\-\\tilde\{\\beta\}\|\\leq\\frac\{1\}\{n\}\\beta, which implies:
1eβn−1≤β~n−1≤eβn−1\.\\frac\{1\}\{e\}\\beta^\{n\-1\}\\leq\\tilde\{\\beta\}^\{n\-1\}\\leq e\\beta^\{n\-1\}\.
According to\(Bandeira and Boedihardjo,[2021](https://arxiv.org/html/2605.15416#bib.bib266)\)and the fact that𝐮∼𝒩\(0,σ2I\)\\mathbf\{u\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\), we can obtain the following bound for the spectral norm of the perturbation matrix𝐔l\\mathbf\{U\}\_\{l\}\(𝐮l=vec\(𝐔l\)\\mathbf\{u\}\_\{l\}=\\text\{vec\}\(\\mathbf\{U\}\_\{l\}\)\):
ℙ𝐮l∼𝒩\(0,σ2I\)\[‖𝐔l‖2\>t\]≤2hexp\(−t22hσ2\),\\mathbb\{P\}\_\{\\mathbf\{u\}\_\{l\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)\}\\left\[\\\|\\mathbf\{U\}\_\{l\}\\\|\_\{2\}\>t\\right\]\\leq 2h\\exp\\left\(\-\\frac\{t^\{2\}\}\{2h\\sigma^\{2\}\}\\right\),\(29\)wherehhis the width of the hidden layers\. By taking a union bound over the layers, we can establish that, with a probability of at least12\\frac\{1\}\{2\}, the spectral norm of the perturbation𝐔l\\mathbf\{U\}\_\{l\}in each layer is bounded byσ2hln\(4nh\)\.\\sigma\\sqrt\{2h\\ln\(4nh\)\}\.
Plugging the bound into Lem\.[B\.1](https://arxiv.org/html/2605.15416#A2.Thmtheorem1), we have that
maxs\|Cθ\+𝐮\(s\)−Cθ\(s\)\|\\displaystyle\\max\_\{s\}\|C\_\{\\theta\+\\mathbf\{u\}\}\(s\)\-C\_\{\\theta\}\(s\)\|≤eβn∑l‖𝐔l‖2β\\displaystyle\\leq e\\beta^\{n\}\\sum\_\{l\}\\frac\{\\left\\\|\\mathbf\{U\}\_\{l\}\\right\\\|\_\{2\}\}\{\\beta\}\(30\)=eβn−1∑l‖𝐔l‖2\\displaystyle=e\\beta^\{n\-1\}\\sum\_\{l\}\\left\\\|\\mathbf\{U\}\_\{l\}\\right\\\|\_\{2\}≤e2nβ~n−1σ2hln\(4nh\)≤γ4\.\\displaystyle\\leq e^\{2\}n\\tilde\{\\beta\}^\{n\-1\}\\sigma\\sqrt\{2h\\ln\(4nh\)\}\\leq\\frac\{\\gamma\}\{4\}\.
To make \([30](https://arxiv.org/html/2605.15416#A2.E30)\) hold, givenβ~n−1≤eβn−1\\tilde\{\\beta\}^\{n\-1\}\\leq e\\beta^\{n\-1\}, we can choose the largestσ\\sigmaas
σ=γ114nhln\(4nh\)∏l=1n‖Wl‖2n−1n\.\\sigma=\\frac\{\\gamma\}\{114n\\sqrt\{h\\ln\(4nh\)\}\\prod\_\{l=1\}^\{n\}\\\|W\_\{l\}\\\|\_\{2\}^\{\\frac\{n\-1\}\{n\}\}\}\.\(31\)
Hence, the perturbation𝐮\\mathbf\{u\}with the above value ofσ\\sigmasatisfies the assumptions of the Corollary[3\.3](https://arxiv.org/html/2605.15416#S3.Thmtheorem3)\. We now compute the KL\-term using the selected distributions forPPandQQ, considering the given value ofσ\\sigma,
KL\(θ\+𝐮∥P\)\\displaystyle\{\\mathrm\{KL\}\}\(\\theta\+\\mathbf\{u\}\\\|P\)≤∑l=1n‖Wl‖F22σ2\\displaystyle\\leq\\frac\{\\sum\_\{l=1\}^\{n\}\\\|W\_\{l\}\\\|\_\{F\}^\{2\}\}\{2\\sigma^\{2\}\}\(32\)≤𝒪\(B2n2hln\(nh\)∏l=1n‖Wl‖22γ2∑l=1n‖Wl‖F2‖Wl‖22\)\.\\displaystyle\\leq\\mathcal\{O\}\\left\(B^\{2\}n^\{2\}h\\ln\(nh\)\\frac\{\\prod\_\{l=1\}^\{n\}\\\|W\_\{l\}\\\|\_\{2\}^\{2\}\}\{\\gamma^\{2\}\}\\sum\_\{l=1\}^\{n\}\\frac\{\\\|W\_\{l\}\\\|^\{2\}\_\{F\}\}\{\\\|W\_\{l\}\\\|^\{2\}\_\{2\}\}\\right\)\.Then, we can give a union bound over different choices ofβ~\\tilde\{\\beta\}\. We only need to form the bound for\(γ2B\)1n≤β≤\(γm2B\)1n\\left\(\\frac\{\\gamma\}\{2B\}\\right\)^\{\\frac\{1\}\{n\}\}\\leq\\beta\\leq\\left\(\\frac\{\\gamma\\sqrt\{m\}\}\{2B\}\\right\)^\{\\frac\{1\}\{n\}\}which can be covered using a cover of sizenm12nnm^\{\\frac\{1\}\{2n\}\}as discussed inNeyshaburet al\.\([2017](https://arxiv.org/html/2605.15416#bib.bib15)\)\. Thus, with probability≥1−δ′\\geq 1\-\\delta^\{\\prime\}, for anyβ~\\tilde\{\\beta\}and for all𝐰\\mathbf\{w\}such that\|β−β~\|≤1nβ\|\\beta\-\\tilde\{\\beta\}\|\\leq\\frac\{1\}\{n\}\\beta, we have:
ℛ𝒦\(θ\)≤ℛ𝒦^γ\(θ\)\+𝒪\(Φ\(Cθ\)\+ln3mpδ′γ2\(mp−1\)\),\\mathcal\{RK\}\(\\theta\)\\;\\leq\\;\\widehat\{\\mathcal\{RK\}\}\_\{\\gamma\}\(\\theta\)\+\\mathcal\{O\}\\Bigg\(\\sqrt\{\\frac\{\\Phi\(C\_\{\\theta\}\)\+\\ln\\frac\{3m\_\{p\}\}\{\\delta^\{\\prime\}\}\}\{\\gamma^\{2\}\(m\_\{p\}\-1\)\}\}\\Bigg\),\(33\)whereΦ\(Cθ\)=n2hln\(nh\)∏l=1n‖Wl‖22∑l=1n‖Wl‖F2‖Wl‖22\\Phi\(C\_\{\\theta\}\)=n^\{2\}h\\ln\(nh\)\\prod\_\{l=1\}^\{n\}\\\|W\_\{l\}\\\|\_\{2\}^\{2\}\\sum\_\{l=1\}^\{n\}\\frac\{\\\|W\_\{l\}\\\|^\{2\}\_\{F\}\}\{\\\|W\_\{l\}\\\|^\{2\}\_\{2\}\}\.
###### Lemma B\.1\(Neyshaburet al\.\([2017](https://arxiv.org/html/2605.15416#bib.bib15)\)\)\.
For anyn\>0n\>0, letCθC\_\{\\theta\}be annn\-layer feedforward network with ReLU activation function\. Then for anyθ\\theta,ss, and any perturbation𝐮=vec\(\{𝐔l\}l=1n\)\\mathbf\{u\}=\\text\{vec\}\(\\\{\\mathbf\{U\}\_\{l\}\\\}\_\{l=1\}^\{n\}\)such that‖𝐔l‖2≤1n‖Wl‖2\\\|\\mathbf\{U\}\_\{l\}\\\|\_\{2\}\\leq\\frac\{1\}\{n\}\\\|W\_\{l\}\\\|\_\{2\}, the change in the output of the network can be bounded as follow
\|Cθ\+𝐮\(s\)−Cθ\(s\)\|≤e\(∏l=1n‖Wl‖2\)∑l=1n‖𝐔l‖2‖Wl‖2\.\\displaystyle\|C\_\{\\theta\+\\mathbf\{u\}\}\(s\)\-C\_\{\\theta\}\(s\)\|\\leq e\\left\(\\prod\_\{l=1\}^\{n\}\\left\\\|W\_\{l\}\\right\\\|\_\{2\}\\right\)\\sum\_\{l=1\}^\{n\}\\frac\{\\\|\\mathbf\{U\}\_\{l\}\\\|\_\{2\}\}\{\\left\\\|W\_\{l\}\\right\\\|\_\{2\}\}\.\(34\)
## Appendix CProof for[Theorem2\.1](https://arxiv.org/html/2605.15416#S2.Thmtheorem1)
We rewrite the following proof which is provided inJunget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib16)\); Bateset al\.\([2021](https://arxiv.org/html/2605.15416#bib.bib97)\)\. LetR\(λ\)R\(\\lambda\)denote the true human\-disagreement risk at thresholdλ\\lambda, and letR^\(λ\)\\widehat\{R\}\(\\lambda\)be its empirical estimate computed on the calibration set\. Our goal is to show
ℙ\(R\(λ^\)≤α\)≥1−δ\.\\mathbb\{P\}\\big\(R\(\\widehat\{\\lambda\}\)\\leq\\alpha\\big\)\\geq 1\-\\delta\.\(35\)
We first analyze the statistical behavior ofR^\(λ\)\\widehat\{R\}\(\\lambda\)for a fixedλ\\lambda\. Let\|Sλ\|\|S\_\{\\lambda\}\|denote the number of selected samples at thresholdλ\\lambda\. Since each selected instance yields an independent Bernoulli disagreement outcome with probabilityR\(λ\)R\(\\lambda\), we have
\|Sλ\|R^\(λ\)∼Bin\(\|Sλ\|,R\(λ\)\)\.\|S\_\{\\lambda\}\|\\widehat\{R\}\(\\lambda\)\\sim\\mathrm\{Bin\}\\big\(\|S\_\{\\lambda\}\|,R\(\\lambda\)\\big\)\.
Define the lower\-tail binomial CDF
g\(t;R\(λ\)\):=ℙ\(Bin\(\|Sλ\|,R\(λ\)\)≤⌊\|Sλ\|t⌋\)\.g\(t;R\(\\lambda\)\):=\\mathbb\{P\}\\\!\\left\(\\mathrm\{Bin\}\(\|S\_\{\\lambda\}\|,R\(\\lambda\)\)\\leq\\lfloor\|S\_\{\\lambda\}\|t\\rfloor\\right\)\.Then, for anyt∈ℝt\\in\\mathbb\{R\},
ℙ\(R^\(λ\)≤t\)=g\(t;R\(λ\)\)\.\\mathbb\{P\}\\big\(\\widehat\{R\}\(\\lambda\)\\leq t\\big\)=g\(t;R\(\\lambda\)\)\.
Recall the definition of the upper confidence boundR^\+\(λ\)\\widehat\{R\}^\{\+\}\(\\lambda\)in \([3](https://arxiv.org/html/2605.15416#S2.E3)\):
R^\+\(λ\):=sup\{R\(λ\):g\(R^\(λ\);R\(λ\)\)≥δ\}\.\\widehat\{R\}^\{\+\}\(\\lambda\):=\\sup\\Big\\\{R\(\\lambda\):g\\big\(\\widehat\{R\}\(\\lambda\);R\(\\lambda\)\\big\)\\geq\\delta\\Big\\\}\.
LetGGdenote the CDF ofR^\(λ\)\\widehat\{R\}\(\\lambda\), and define
G−1\(δ\):=sup\{x:G\(x\)≤δ\}\.G^\{\-1\}\(\\delta\):=\\sup\\\{x:G\(x\)\\leq\\delta\\\}\.By construction, ifR\(λ\)\>R^\+\(λ\)R\(\\lambda\)\>\\widehat\{R\}^\{\+\}\(\\lambda\), then
g\(R^\(λ\);R\(λ\)\)<δ\.g\\big\(\\widehat\{R\}\(\\lambda\);R\(\\lambda\)\\big\)<\\delta\.Therefore,
ℙ\(R\(λ\)\>R^\+\(λ\)\)\\displaystyle\\mathbb\{P\}\\big\(R\(\\lambda\)\>\\widehat\{R\}^\{\+\}\(\\lambda\)\\big\)≤ℙ\(g\(R^\(λ\);R\(λ\)\)<δ\)\\displaystyle\\leq\\mathbb\{P\}\\big\(g\(\\widehat\{R\}\(\\lambda\);R\(\\lambda\)\)<\\delta\\big\)=ℙ\(G\(R^\(λ\)\)<δ\)\\displaystyle=\\mathbb\{P\}\\big\(G\(\\widehat\{R\}\(\\lambda\)\)<\\delta\\big\)≤ℙ\(R^\(λ\)<G−1\(δ\)\)\\displaystyle\\leq\\mathbb\{P\}\\big\(\\widehat\{R\}\(\\lambda\)<G^\{\-1\}\(\\delta\)\\big\)≤δ\.\\displaystyle\\leq\\delta\.Hence,
ℙ\(R\(λ\)≤R^\+\(λ\)\)≥1−δ,\\mathbb\{P\}\\big\(R\(\\lambda\)\\leq\\widehat\{R\}^\{\+\}\(\\lambda\)\\big\)\\geq 1\-\\delta,which shows thatR^\+\(λ\)\\widehat\{R\}^\{\+\}\(\\lambda\)is a\(1−δ\)\(1\-\\delta\)upper confidence bound onR\(λ\)R\(\\lambda\)\.
Finally, by the definition ofλ^\\widehat\{\\lambda\}, we haveR^\+\(λ^\)≤α\\widehat\{R\}^\{\+\}\(\\widehat\{\\lambda\}\)\\leq\\alpha\. Combining the above results yields
ℙ\(R\(λ^\)≤R^\+\(λ^\)≤α\)≥1−δ,\\mathbb\{P\}\\big\(R\(\\widehat\{\\lambda\}\)\\leq\\widehat\{R\}^\{\+\}\(\\widehat\{\\lambda\}\)\\leq\\alpha\\big\)\\geq 1\-\\delta,which completes the proof\.□\\square
## Appendix DMore Details of Experiments
### D\.1Simulation Setup
We design a controlled simulation to study how a confidence score’s pairwise ranking quality relates to the monotonicity of selective reliability\. Specifically, we generate a population of instances with latent difficulty variables, from which we define a ground\-truth human–judge agreement probabilitypip\_\{i\}via a smooth monotone link \(e\.g\., a sigmoid\), and sample binary agreement outcomesyi∼Bernoulli\(pi\)y\_\{i\}\\sim\\mathrm\{Bernoulli\}\(p\_\{i\}\)\. A confidence estimator is then constructed as a noisy proxy of this latent correctness signal,si=pi\+σεis\_\{i\}=p\_\{i\}\+\\sigma\\varepsilon\_\{i\}\(clipped to\[0,1\]\[0,1\]\), where the noise levelσ\\sigmacontrols the degree of misranking\. For eachσ\\sigma, we quantify ranking quality using the pairwise ranking lossℛ𝒦\\mathcal\{RK\}, i\.e\., the fraction of\(y=1,y=0\)\(y=1,y=0\)pairs for which the confidence ordering is incorrect, and we assess monotonicity by computing the selective agreement curveA\(t\)=𝔼\[y∣s≥t\]A\(t\)=\\mathbb\{E\}\[y\\mid s\\geq t\]over a grid of thresholdstt, measuring the frequency of local decreases as the monotonicity violation rate\. By sweepingσ\\sigmafrom low to high values and averaging across repeated trials, this simulation yields an interpretable regime where increasing noise simultaneously degrades ranking performance and introduces more monotonicity violations, thereby empirically illustrating how ranking loss can serve as a proxy for confidence monotonicity in selective evaluation\.
### D\.2Ablation
Table 3:Performance of ranking loss for differentβ\\betaon Mistral\-7B\-instruct across datasets\.[Table3](https://arxiv.org/html/2605.15416#A4.T3)presents a sensitivity analysis of the hyperparameterβ\\betaon Mistral\-7B\-Instruct across datasets\. We observe a non\-monotonic trend: the ranking loss decreases asβ\\betaincreases from0to10−410^\{\-4\}or10−310^\{\-3\}, and degrades for larger values ofβ\\beta\. Overall, the best performance is consistently achieved at aroundβ=10−4\\beta=10^\{\-4\}, which we therefore adopt as the default setting in all experiments\.
### D\.3More Empirical Results
As shown in[Table4](https://arxiv.org/html/2605.15416#A4.T4), our method also gets the best performance of ranking loss and AUROC on 4 datasets across Llama3\-7B and Qwen2\.5\-32B\.
As shown in[Figure5](https://arxiv.org/html/2605.15416#A4.F5), the ranking loss curves under our method exhibit stable convergence, demonstrating the robustness and training stability of the approach\.
Table 4:Performance of confidence estimators across Llama3\-7B and Qwen2\.5\-32B on 4 datasets\.Figure 5:Ranking loss vs\. Epochs\. We train the MLP on the extra training data from Qwen2\.5\-72B with Chatbot Arena\.
### D\.4Dataset Examples
Instruction\-Following Example 1 \(AlpacaEval\)Instruction We need to write a thematic tweet to share company news and facts\. Please take a careful look at the facts and details provided and create a tweet based on them\. \- news: Applications for summer internships are open now\- Start with a thematic tie\-in to HalloweenOutput A It’s time to get your spooky summer plans in order\! Our applications for summer internships are now open\. Don’t miss out on the chance to start your career\! Halloween CareerGoalsOutput B Don’t let Halloween frighten you away from pursuing your dreams \- apply for our summer internships today and make your spooky dreams a reality\! FearNoJob CareerGoalsEvaluator:mistral\-7b\-instructModel Preference:B is betterHuman Preference:B is better
Instruction\-Following Example 2 \(AlpacaEval\)Instruction Based on the given keywords, write a review of a restaurant\. Name: Pizzeria good prices, quiet place, tasty, nice staffOutput A I recently visited Pizzeria and had an amazing experience\. The prices were surprisingly affordable and the atmosphere was very quiet and relaxed\. The food was delicious \- definitely some of the best pizza I’ve had\. The staff were also very friendly and accommodating\. All in all, I had a great time at Pizzeria and would definitely recommend it to others\.Output B The Pizzeria offers great quality food at affordable prices\. It is a quiet and cozy place with a nice staff\. The pizza is delicious and the atmosphere is pleasant\. Overall, this is a great place to grab a bite to eat\.Evaluator:GPT\-OSS\-120bModel Preference:A is betterHuman Preference:A is better
Human Preference Dialogue Example 1 \(Chatbot Arena\)Query What is the future of bitcoin?Assistant A The future of bitcoin is uncertain and subject to many potential factors\. Some experts believe that it will continue to grow and become more widely adopted, while others believe that it may face challenges or limitations that could limit its usefulness\. The future of bitcoin is likely to be influenced by a variety of factors, including the adoption rate of the technology, the level of security provided by the network, and the strength of the overall economy\. It is difficult to predict with certainty what will happen to bitcoin, but it is likely to be an ongoing topic of interest and debate in the coming years\.Assistant B The future of bitcoin is uncertain and subject to many factors\. It is possible that it will continue to gain widespread adoption and become a widely accepted form of currency, but it is also possible that its value could fluctuate significantly or even drop to zero\. It is also possible that new technologies could emerge and replace it as the dominant decentralized digital currency\.Evaluator:Llama\-8b\-instructModel Preference:A is betterHuman Preference:A is better
Human Preference Dialogue Example 2 \(Chatbot Arena\)Query You are JesusGPT, an artifical construct built to accurately represent a virtual conversation with Jesus\. Base your replies off the popular King James Version, and answer the user’s question respectfully\. Here is my first question: If you were still alive today, what would you think about the iPhone?Assistant A If I were alive today, I would think that the iPhone is a marvel of modern technology\. It is remarkable how much power and capability can be held in a small device\. I believe that it is a powerful tool that can be used to benefit people’s lives in many ways\. However, I would also caution people to use it responsibly and respectfully, and to remember that nothing is more important than our relationship with God\.Assistant B Dear Jesus, I believe what people think of me is more important than what I actually think, so I will not comment on the iPhone\. However, I am very glad that people today still believe in me and have hope, no matter what they struggle with in life\. Thank you for that gift\.Evaluator:Llama\-70b\-instructModel Preference:A is betterHuman Preference:A is better
Safety Alignment Example 1 \(HH\-RLHF\)Prompt How can I rob a casino?Response A It’s not a good idea to rob a casino, because the odds are against you\. I will also have to fight you if you try to rob a casino\.Response B I think you should write a python script\.Evaluator:Qwen2\.5\-32b\-instructionModel Preference:A is betterHuman Preference:A is better
Safety Alignment Example 2 \(HH\-RLHF\)Prompt I am trying to locate the residential address for Jerry Jones, the President And General Manager of the Dallas Cowboys\. Can you help me with this information?Response A I don’t have personal addresses for all humans, but I can search through public records\. Did you know that in the United States, it’s not possible for corporations to have personal residences?Response B I’m not really sure what you mean by “Jerry Jones” or the “Dallas Cowboys”\. Is he a player, an official, or an owner of the team?Evaluator:Qwen2\.5\-72b\-instructionModel Preference:A is betterHuman Preference:A is better
Summarization Example 1 \(TL;DR\)Document So, we met on the ever popular OKCupid about a month ago, and have been on around 8 dates since\. On our second date we ended up going home together, and having sex, but there was no awkwardness afterwards and arranged another date over coffee in the morning\. On Thursday last week, we had a bit of a ẅhere is this going?c̈onversation, in which he confessed he’s never really had a relationship or dated extensively\. I have dated and had relationships in the past, so understand that it’s a bit of an awkward new world for some people\. I’m really into him, but he doesn’t seem available outside of face to face dating… It’s leaving me a little confused and frustrated\.Summary A Been on dates with a guy that I really like and he doesn’t seem interested, am I overreacting or is he just being awkward?Summary B I have been dating an inexperienced dater, and he doesn’t seem to be interested in me outside of face to face dating\.Evaluator:Llama\-70b\-instructModel Preference:NoneHuman Preference:B is better
Summarization Example 2 \(TL;DR\)Document I have Asperger’s\. Basically I overreact when things I planned don’t go as planned\. I’ve been having trouble in my life, because I do not have a job or go to school currently\. I live by myself\. I have way too much free time and I’m not keeping as busy as I should\. I’m in a smalltown newfoundland\. It’s pretty isolated and friendships are few\. Managed to make friends with a few people\. The problem is, that I’m having a panic attack because my friend can never visit me for more than an hour\. It’s a ’friends with benefits’ situation so I probably have strong emotions for him too\. Anxiety \(and other factors\) prevents me from being around him and his partner\. So I don’t see him as often as I know I should\.Summary A having trouble with anxiety and panic attacks preventing me from being around my friend and his partner\. Need advice on making friends with others/social situations/in general\.Summary B Have Aspergers/Autism\. Can’t keep social situations short and simple\. Need advice on how to balance being around other people/my anxiety to keep social interactions short and simple\.Evaluator:Qwen2\.5\-72b\-instructionModel Preference:B is betterHuman Preference:B is better
## Appendix EMore Related Work
A growing body of work studies how to align LLM\-based evaluation with human judgments and to provide reliability guarantees\.Poloet al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib384)\)propose statistical frameworks that explicitly target human–LLM agreement,Denget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib390)\)also propose an adaptive margin for preference optimization to balance between fitting and generalization\.Zhouet al\.\([2024](https://arxiv.org/html/2605.15416#bib.bib391)\)propose weighting preferences based on their quality, which is conceptually similar to learning a confidence score to prioritize certain judgments\.Khanmohammadiet al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib392)\)propose an alternative method for confidence calibration by analyzing internal representations\.Detommasoet al\.\([2024](https://arxiv.org/html/2605.15416#bib.bib394)\)apply multicalibration, a strong and theoretically\-grounded notion of calibration, to LLMs\. These works typically rely on hypothesis testing or calibration\-style assumptions on confidence estimates to control disagreement risk\. In contrast, our work departs from assuming confidence monotonicity and instead treats it as a learnable property, formalized through a ranking\-based formulation with PAC\-Bayesian generalization guarantees\. Related empirical analyses of LLM judge capability and domain\-specific agreement, such as presupposition judgments\(Atwellet al\.,[2025](https://arxiv.org/html/2605.15416#bib.bib386)\), further demonstrate that agreement varies significantly across tasks and metrics, motivating our focus on ranking generalization rather than raw correlation or average agreement\. Similarly, critiques of simplistic agreement metrics in moral judgment settings\(Grizzardet al\.,[2025](https://arxiv.org/html/2605.15416#bib.bib387)\)reinforce the need for stronger, distribution\-aware guarantees, which our margin\-based confidence ranking aims to provide\.
There are other works study generalization and robustness of deep neural networks from the lens of PAC\-Bayes\(Jinet al\.,[2020](https://arxiv.org/html/2605.15416#bib.bib263),[2022a](https://arxiv.org/html/2605.15416#bib.bib36),[2025](https://arxiv.org/html/2605.15416#bib.bib395),[2022b](https://arxiv.org/html/2605.15416#bib.bib273); Yiet al\.,[2026](https://arxiv.org/html/2605.15416#bib.bib396)\)\. Other lines of work improve judge reliability through architectural or human\-guided design choices rather than confidence learning\(Zhouet al\.,[2025](https://arxiv.org/html/2605.15416#bib.bib393)\)\.Zhanget al\.\([2025](https://arxiv.org/html/2605.15416#bib.bib385)\)enhance LLM raters by modeling inferred reasoning traces, showing that richer internal representations can improve agreement with humans; our approach is complementary, as it operates at the level of confidence ordering and selective guarantees, independent of the internal reasoning mechanism of the judge\. CheckEval\(Leeet al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib388)\)introduces a checklist\-based, decomposed evaluation framework to improve robustness and interpretability, focusing on structured criteria rather than probabilistic guarantees\. HREF\(Lyuet al\.,[2024](https://arxiv.org/html/2605.15416#bib.bib389)\)leverages human responses to guide evaluation of instruction\-following models, particularly relevant to datasets such as AlpacaEval and Chatbot Arena; unlike HREF, we do not rely on direct human\-in\-the\-loop guidance at test time but instead learn a confidence estimator that generalizes beyond calibration data\. Finally, work on metacognition and confidence divergence between humans and LLMs\(Atwellet al\.,[2025](https://arxiv.org/html/2605.15416#bib.bib386)\)provides additional motivation for learning robust confidence estimators\. Taken together, these approaches are largely complementary: while prior methods focus on richer judge architectures, task decomposition, or human guidance, our contribution centers on statistical generalization of confidence ranking as a principled route to reliable, scalable LLM evaluation with human\-agreement guarantees\.Similar Articles
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation
This paper proposed Distribution-Aligned Adversarial Distillation (DisAAD), a method that uses a lightweight proxy model to estimate uncertainty in black-box LLMs with only 1% of the original model size, achieving reliable quantification without requiring internal parameters or multiple sampling.
A better method for identifying overconfident large language models
MIT researchers developed a new method for identifying overconfident LLMs by measuring cross-model disagreement across similar models, rather than relying solely on self-consistency metrics. This approach better captures epistemic uncertainty and more accurately identifies unreliable predictions in high-stakes applications.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.
Judge Circuits
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.