A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Summary
This paper introduces Sem-ECE, a semantic-sampling framework for evaluating calibration in open-ended question answering by grouping model answers into semantic classes to estimate confidence.
View Cached Full Text
Cached at: 05/12/26, 06:44 AM
# A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Source: [https://arxiv.org/html/2605.08432](https://arxiv.org/html/2605.08432)
Zhanliang Wang1, Jiancong Xiao1∗, Ruochen Jin2, Shu Yang1, Bojian Hou1, and Li Shen1† 1University of Pennsylvania, Philadelphia, PA;2Dartmouth College, Hanover, NH \{aaronwzl,jcxiao\}@upenn\.edu, ruochen\.jin\.gr@dartmouth\.edu, \{syang11,bojianh,lishen\}@upenn\.edu
###### Abstract
Calibration measures whether a model’s predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models \(LLMs\) in high\-stakes domains such as medicine and law\. While much recent work focuses on*improving*LLM calibration, the equally important question of how to*evaluate*it in realistic settings remains underdeveloped\. Open\-ended question answering \(QA\), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit\-based metrics need restricted output formats and internal probabilities; verbalized confidence is self\-reported and often overconfident; and sampling\-based methods rely on task\-specific extraction rules without a clear finite\-sample target\. We introduceSem\-ECE\(Semantic\-SamplingExpectedCalibrationError\), a calibration evaluation framework for open\-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence\. We study two estimators within this framework: Sem1\-ECE, the same\-sample self\-consistency score, and Sem2\-ECE, a held\-out variant that separates answer selection from confidence evaluation\. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem2achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty\. Experiments on three open\-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem\-ECE outperforms verbalized confidence and existing sampling\-based methods, while complementing logit\-based evaluation when internal probabilities are unavailable\.111Code is available at[https://github\.com/ZhanliangAaronWang/Sem\-ECE](https://github.com/ZhanliangAaronWang/Sem-ECE)\.
## 1Introduction
Calibration measures whether a model’s predicted confidence aligns with its empirical accuracy, and is widely recognized as a prerequisite for the reliable deployment of large language models \(LLMs\)\[[5](https://arxiv.org/html/2605.08432#bib.bib2),[6](https://arxiv.org/html/2605.08432#bib.bib4),[20](https://arxiv.org/html/2605.08432#bib.bib5)\]\. In high\-stakes domains such as medicine and law, a system that is accurate on average but poorly calibrated cannot distinguish routine queries from queries on which it is likely to fail, leaving downstream pipelines without a signal for when to trust an answer, abstain, or escalate\.
Much recent work focuses on*improving*LLM calibration, through post\-hoc rescaling, prompting strategies, or calibration\-aware fine\-tuning\[[19](https://arxiv.org/html/2605.08432#bib.bib14),[5](https://arxiv.org/html/2605.08432#bib.bib2),[9](https://arxiv.org/html/2605.08432#bib.bib15),[10](https://arxiv.org/html/2605.08432#bib.bib16),[20](https://arxiv.org/html/2605.08432#bib.bib5),[24](https://arxiv.org/html/2605.08432#bib.bib17)\]\. The equally important question of how to*evaluate*calibration in realistic settings remains underdeveloped\. Classical metrics such as Brier score, reliability diagrams, and expected calibration error\[[2](https://arxiv.org/html/2605.08432#bib.bib9),[16](https://arxiv.org/html/2605.08432#bib.bib10),[5](https://arxiv.org/html/2605.08432#bib.bib2)\]fit classification and multiple\-choice QA but break down in open\-ended QA, the dominant deployment setting for modern LLMs: the answer space is unbounded, two answers worded very differently can be equally correct, and commercial APIs frequently do not expose logits\. Existing black\-box approaches each address part of this gap but none covers the full setting with a statistically explicit target\. Verbalized confidence is format\-agnostic\[[11](https://arxiv.org/html/2605.08432#bib.bib3),[6](https://arxiv.org/html/2605.08432#bib.bib4),[14](https://arxiv.org/html/2605.08432#bib.bib11),[20](https://arxiv.org/html/2605.08432#bib.bib5)\]but depends on self\-reporting and is frequently overconfident\[[6](https://arxiv.org/html/2605.08432#bib.bib4),[20](https://arxiv.org/html/2605.08432#bib.bib5),[22](https://arxiv.org/html/2605.08432#bib.bib7)\]\. Sampling\-based methods derive confidence from the consistency of repeated generations\[[21](https://arxiv.org/html/2605.08432#bib.bib8),[12](https://arxiv.org/html/2605.08432#bib.bib6)\]but typically require task\-specific extraction rules and rely on heuristic frequency scores rather than rigorous statistical targets\.
We introduceSem\-ECE\(Semantic\-Sampling Expected Calibration Error\), a semantic\-sampling framework for calibration evaluation in open\-ended QA\. The framework repeatedly samples answers from the model, maps free\-form generations to semantic answer classes via an LLM judge, and evaluates calibration from the resulting semantic frequencies, without requiring logits, multiple\-choice options, or hand\-crafted answer\-extraction rules\. Within this framework, we study two natural estimators of the same target, i\.e\., the probability of the model’s most likely semantic answer\.Sem1\-ECEis the standard same\-sample self\-consistency score: it selects the most frequent semantic answer and uses that same frequency as confidence\.Sem2\-ECEis a held\-out variant that selects the answer on one block of samples and measures its frequency on a disjoint held\-out block\. We prove that both are asymptotically unbiased, placing sampling\-based calibration evaluation on a principled statistical footing, and show in closed form that on hard low\-margin questions Sem2yields a strictly smaller calibration error than Sem1, while on easy questions the two are nearly indistinguishable; the Sem1–Sem2gap thus also serves as a simple observable diagnostic for question difficulty\.
Sem\-ECE improves over verbalized confidence by measuring a behavioral property of the answer distribution rather than relying on self\-reporting, and advances existing sampling\-based calibration evaluation by replacing hand\-crafted extraction rules and heuristic frequency scores with estimators that have an explicit population target and provable guarantees; it complements logit\-based evaluation when internal probabilities are unavailable\. Experiments on three challenging open\-ended QA benchmarks, including Humanity’s Last Exam, across five leading commercial LLMs \(ChatGPT, Claude, Gemini, Grok, and Mistral\) confirm our theoretical predictions, with Sem2\-ECE achieving lower calibration error than verbalized confidence on the large majority of model–benchmark pairs\.
## 2Related Work
Calibration evaluation is well\-studied for probabilistic classifiers and multiple\-choice QA via Brier score, reliability diagrams, and binned ECE\[[2](https://arxiv.org/html/2605.08432#bib.bib9),[16](https://arxiv.org/html/2605.08432#bib.bib10),[5](https://arxiv.org/html/2605.08432#bib.bib2)\], but open\-ended QA breaks these tools: the answer space is unbounded, correctness is semantic rather than lexical, and commercial APIs often do not expose logits\. Two families of black\-box confidence sources have emerged\.*Verbalized confidence*elicits the model’s stated uncertainty in words or as a probability\[[11](https://arxiv.org/html/2605.08432#bib.bib3),[6](https://arxiv.org/html/2605.08432#bib.bib4),[14](https://arxiv.org/html/2605.08432#bib.bib11),[20](https://arxiv.org/html/2605.08432#bib.bib5)\], but is self\-reported and frequently overconfident\.*Sampling\-based methods*use agreement across repeated generations as a confidence signal\[[21](https://arxiv.org/html/2605.08432#bib.bib8),[12](https://arxiv.org/html/2605.08432#bib.bib6)\], with semantic\-uncertainty variants grouping generations by meaning\[[8](https://arxiv.org/html/2605.08432#bib.bib12),[3](https://arxiv.org/html/2605.08432#bib.bib13)\]; existing instantiations rely on task\-specific answer\-extraction rules and lack an explicit population target\. Sem\-ECE measures a behavioral property of the answer distribution like sampling\-based methods, but assigns the resulting frequency an explicit asymptotic target with provable guarantees, distinguishing it from heuristic frequency scores and self\-reported uncertainty\. A complementary line of work aims to*improve*calibration via post\-hoc rescaling or fine\-tuning\[[19](https://arxiv.org/html/2605.08432#bib.bib14),[9](https://arxiv.org/html/2605.08432#bib.bib15),[10](https://arxiv.org/html/2605.08432#bib.bib16),[24](https://arxiv.org/html/2605.08432#bib.bib17)\]; see[Appendix˜G](https://arxiv.org/html/2605.08432#A7)for an extended discussion\.
## 3Preliminaries
Semantic answer space and oracle confidence\.Let𝒬\\mathcal\{Q\}be a distribution over questions\. For a fixedq∼𝒬q\\sim\\mathcal\{Q\}, querying the LLM under a fixed prompt and decoding configuration produces a random free\-form answer string\. Two strings are*semantically equivalent*if they express the same answer toqq; the equivalence classes form a finite*semantic answer space*𝒵q=\{1,…,Kq\}\\mathcal\{Z\}\_\{q\}=\\\{1,\\ldots,K\_\{q\}\\\}, withKq:=\|𝒵q\|K\_\{q\}:=\|\\mathcal\{Z\}\_\{q\}\|\. The LLM induces a categorical distributionπq\\pi\_\{q\}on𝒵q\\mathcal\{Z\}\_\{q\}, withπq,k:=πq\(k\)=Pr\(the LLM’s answer toqlies in classk\)\\pi\_\{q,k\}:=\\pi\_\{q\}\(k\)=\\Pr\(\\text\{the LLM's answer to \}q\\text\{ lies in class \}k\)\. The*population semantic mode*iszq⋆:=argmaxkπq,kz\_\{q\}^\{\\star\}:=\\operatorname\*\{arg\\,max\}\_\{k\}\\pi\_\{q,k\}\(ties broken by a fixed deterministic rule\), and the*oracle semantic confidence*iscq⋆:=πq,zq⋆=maxkπq,kc\_\{q\}^\{\\star\}:=\\pi\_\{q,z\_\{q\}^\{\\star\}\}=\\max\_\{k\}\\pi\_\{q,k\}\. It is an agreement quantity, not a correctness quantity: a model can havecq⋆=1c\_\{q\}^\{\\star\}=1and still be wrong on every sample\.
Semantic correctness\.Correctness is defined at the semantic\-class level\. LetYq:𝒵q→\{0,1\}Y\_\{q\}:\\mathcal\{Z\}\_\{q\}\\to\\\{0,1\\\}be the correctness function forqq, withYq\(k\)=1Y\_\{q\}\(k\)=1iff classkkis correct relative to the reference answer\. If a method commits to classkkwith confidencec∈\[0,1\]c\\in\[0,1\], its calibration is evaluated using the pair\(c,Yq\(k\)\)\(c,Y\_\{q\}\(k\)\); calibration is thus assessed at the semantic level rather than on raw strings\.
Empirical estimation ofπq\\pi\_\{q\}\.The distributionπq\\pi\_\{q\}is unknown; we access it throughn\+mn\+mindependent generations clustered into semantic classesZ1,…,Zn\+m∼i\.i\.d\.πqZ\_\{1\},\\ldots,Z\_\{n\+m\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}\\pi\_\{q\}, and partition the index set\[n\+m\]\[n\+m\]into a selection blockNNof sizennand a disjoint evaluation blockEEof sizemm\. For anyI⊆\[n\+m\]I\\subseteq\[n\+m\], the empirical semantic PMF isπ^I\(k\):=\|I\|−1∑i∈I𝟏\{Zi=k\}\\hat\{\\pi\}\_\{I\}\(k\):=\|I\|^\{\-1\}\\sum\_\{i\\in I\}\\mathbf\{1\}\\\{Z\_\{i\}=k\\\}, and the corresponding empirical semantic mode isz^I:=argmaxk∈𝒵qπ^I\(k\)\\hat\{z\}\_\{I\}:=\\arg\\max\_\{k\\in\\mathcal\{Z\}\_\{q\}\}\\hat\{\\pi\}\_\{I\}\(k\)\(ties broken by the same deterministic rule as forzq⋆z\_\{q\}^\{\\star\}\)\. We writez^N\\hat\{z\}\_\{N\}for the empirical mode on the selection block; it is the answer the model would deploy\.
Standardized margin\.Two scalar summaries ofπq\\pi\_\{q\}will be referenced repeatedly\. The*top\-two margin*Δq:=πq,zq⋆−πq,zq\(2\)\\Delta\_\{q\}:=\\pi\_\{q,z\_\{q\}^\{\\star\}\}\-\\pi\_\{q,z\_\{q\}^\{\(2\)\}\}is the gap between the modal probability and the runner\-up; the*top\-two probability mass*pq:=πq,zq⋆\+πq,zq\(2\)p\_\{q\}:=\\pi\_\{q,z\_\{q\}^\{\\star\}\}\+\\pi\_\{q,z\_\{q\}^\{\(2\)\}\}is their sum, whereπq,zq\(2\):=maxk≠zq⋆πq,k\\pi\_\{q,z\_\{q\}^\{\(2\)\}\}:=\\max\_\{k\\neq z\_\{q\}^\{\\star\}\}\\pi\_\{q,k\}\. From these we form the*standardized margin*m~q:=Δq/pq/n\\tilde\{m\}\_\{q\}:=\\Delta\_\{q\}/\\sqrt\{p\_\{q\}/n\}and its halfλ~q:=m~q/2\\tilde\{\\lambda\}\_\{q\}:=\\tilde\{m\}\_\{q\}/2:m~q\\tilde\{m\}\_\{q\}is the z\-score ofΔq\\Delta\_\{q\}under the leading\-order variancepq/np\_\{q\}/nof the differential countπ^N\(zq⋆\)−maxk≠zq⋆π^N\(k\)\\hat\{\\pi\}\_\{N\}\(z\_\{q\}^\{\\star\}\)\-\\max\_\{k\\neq z\_\{q\}^\{\\star\}\}\\hat\{\\pi\}\_\{N\}\(k\), and parametrizes the regime structure throughout[Section˜5](https://arxiv.org/html/2605.08432#S5)\.
Thepq→1p\_\{q\}\\to 1convention\.For figure readability we adopt the conventionpq→1p\_\{q\}\\to 1, under whichm~q=nΔq\\tilde\{m\}\_\{q\}=\\sqrt\{n\}\\,\\Delta\_\{q\}; all theorems are stated for generalpq∈\(0,1\]p\_\{q\}\\in\(0,1\]\.
Binned expected calibration error\.For a \(confidence, correctness\) pair\(c,a\)\(c,a\)withc∈\[0,1\]c\\in\[0,1\]anda∈\{0,1\}a\\in\\\{0,1\\\}, calibration is measured by the binned expected calibration error\[[16](https://arxiv.org/html/2605.08432#bib.bib10),[5](https://arxiv.org/html/2605.08432#bib.bib2)\]\. FixLLequal\-width binsℐ1,…,ℐL\\mathcal\{I\}\_\{1\},\\ldots,\\mathcal\{I\}\_\{L\}partitioning\[0,1\]\[0,1\]with boundary set𝒯\\mathcal\{T\}; we setL=10L=10throughout\. Define
ECE\(c,a\):=∑ℓ=1L\|𝔼\[\(a−c\)𝟏\{c∈ℐℓ\}\]\|,\\operatorname\{ECE\}\(c,a\)\\;:=\\;\\sum\_\{\\ell=1\}^\{L\}\\big\|\\mathbb\{E\}\\\!\\left\[\(a\-c\)\\mathbf\{1\}\\\{c\\in\\mathcal\{I\}\_\{\\ell\}\\\}\\right\]\\big\|,where the expectation is overq∼𝒬q\\sim\\mathcal\{Q\}and the sampling randomness within each question\. The oracle correctness label isaq⋆:=Yq\(zq⋆\)a\_\{q\}^\{\\star\}:=Y\_\{q\}\(z\_\{q\}^\{\\star\}\)and the deployed correctness label isa^:=Yq\(z^N\)\\hat\{a\}:=Y\_\{q\}\(\\hat\{z\}\_\{N\}\)\. We instantiateccbyc^i\\hat\{c\}\_\{i\}and bycq⋆c\_\{q\}^\{\\star\}to obtain the central metrics
Semi\-ECE:=ECE\(c^i,a^\),ECE⋆:=ECE\(cq⋆,aq⋆\),\\mathrm\{Sem\}\_\{i\}\\text\{\-ECE\}\\;:=\\;\\operatorname\{ECE\}\(\\hat\{c\}\_\{i\},\\hat\{a\}\),\\qquad\\mathrm\{ECE\}^\{\\star\}\\;:=\\;\\operatorname\{ECE\}\(c\_\{q\}^\{\\star\},a\_\{q\}^\{\\star\}\),the calibration error ofSemi\\mathrm\{Sem\}\_\{i\}\(i∈\{1,2\}i\\in\\\{1,2\\\}\) and the calibration error of the unattainable population\-level oracle pair\. The deployment accuracya¯:=𝔼q\[a^\]=𝔼q\[Yq\(z^N\)\]\\bar\{a\}:=\\mathbb\{E\}\_\{q\}\[\\hat\{a\}\]=\\mathbb\{E\}\_\{q\}\[Y\_\{q\}\(\\hat\{z\}\_\{N\}\)\]is the population mean ofa^\\hat\{a\}and serves as the natural reference for the leading\-order analysis in[Section˜5\.2](https://arxiv.org/html/2605.08432#S5.SS2)\.
## 4A Semantic\-Sampling Framework for Evaluating Calibration
### 4\.1Same\-sample estimatorSem1\\mathrm\{Sem\}\_\{1\}
The most direct estimate of the oracle confidencecq⋆=maxkπq,kc\_\{q\}^\{\\star\}=\\max\_\{k\}\\pi\_\{q,k\}is the empirical maximum on the same block that producedz^N\\hat\{z\}\_\{N\}:
c^1:=maxk∈𝒵qπ^N\(k\)\.\\hat\{c\}\_\{1\}\\;:=\\;\\max\_\{k\\in\\mathcal\{Z\}\_\{q\}\}\\hat\{\\pi\}\_\{N\}\(k\)\.\(1\)This is the natural plug\-in once one has committed to estimatingmaxkπq,k\\max\_\{k\}\\pi\_\{q,k\}bymaxkπ^N\(k\)\\max\_\{k\}\\hat\{\\pi\}\_\{N\}\(k\)\. We take it as the same\-sample member of our framework and refer to its calibration error asSem1\\mathrm\{Sem\}\_\{1\}\-ECE\.
c^1\\hat\{c\}\_\{1\}couples two operations on the same blockNN: it*selects*the empirical winner and*reports*its empirical frequency\. Becausemax\\maxis convex and the empirical PMFπ^N\\hat\{\\pi\}\_\{N\}is unbiased forπq\\pi\_\{q\}, Jensen’s inequality gives
𝔼\[c^1\|q\]=𝔼\[maxkπ^N\(k\)\]≥maxk𝔼\[π^N\(k\)\]=cq⋆,\\mathbb\{E\}\\\!\\left\[\\hat\{c\}\_\{1\}\\,\\middle\|\\,q\\right\]\\;=\\;\\mathbb\{E\}\\\!\\left\[\\max\_\{k\}\\hat\{\\pi\}\_\{N\}\(k\)\\right\]\\;\\geq\\;\\max\_\{k\}\\mathbb\{E\}\\\!\\left\[\\hat\{\\pi\}\_\{N\}\(k\)\\right\]\\;=\\;c\_\{q\}^\{\\star\},\(2\)with strict inequality whenever the top\-two gapΔq:=πq,zq⋆−maxk≠zq⋆πq,k\\Delta\_\{q\}:=\\pi\_\{q,z\_\{q\}^\{\\star\}\}\-\\max\_\{k\\neq z\_\{q\}^\{\\star\}\}\\pi\_\{q,k\}is finite andn<∞n<\\infty\. At the population levelc^1\\hat\{c\}\_\{1\}is biased upward, and the empirical winner is over\-represented on the block that selected it — the classical*winner’s curse*\[[7](https://arxiv.org/html/2605.08432#bib.bib1)\]\. The slack in \([2](https://arxiv.org/html/2605.08432#S4.E2)\) is a finite\-sample property of the same\-sample design rather than of the oracle target, and it suggests a natural alternative: evaluate the chosen answer on samples that did not participate in selecting it\.
### 4\.2Held\-out estimatorSem2\\mathrm\{Sem\}\_\{2\}
To tackle this issue, we then introduceSem2\\mathrm\{Sem\}\_\{2\}, which decouples selection from evaluation by reading the confidence off the disjoint blockEE:
c^2:=π^E\(z^N\)\.\\hat\{c\}\_\{2\}\\;:=\\;\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)\.\(3\)The deployed answer is the samez^N\\hat\{z\}\_\{N\}; only the way we score it changes\. BecauseEEis independent ofNNandπ^E\\hat\{\\pi\}\_\{E\}is unbiased forπq\\pi\_\{q\},Sem2\\mathrm\{Sem\}\_\{2\}satisfies the*conditional unbiasedness*property
𝔼\[c^2\|q,z^N\]=πq,z^N:\\mathbb\{E\}\\\!\\left\[\\hat\{c\}\_\{2\}\\,\\middle\|\\,q,\\,\\hat\{z\}\_\{N\}\\right\]\\;=\\;\\pi\_\{q,\\hat\{z\}\_\{N\}\}:\(4\)given the selectionz^N\\hat\{z\}\_\{N\},c^2\\hat\{c\}\_\{2\}targets exactly the population probability of the selected answer, and the Jensen slack in \([2](https://arxiv.org/html/2605.08432#S4.E2)\) is eliminated at the conditional level, which is a propertySem1\\mathrm\{Sem\}\_\{1\}does not enjoy at any level\.
Conditional unbiasedness is for the population probability of the*empirical*modez^N\\hat\{z\}\_\{N\}, not the*true*modezq⋆z\_\{q\}^\{\\star\}\. Marginalizing \([4](https://arxiv.org/html/2605.08432#S4.E4)\) overz^N\\hat\{z\}\_\{N\},
𝔼\[c^2\|q\]=𝔼\[πq,z^N\|q\]≤πq,zq⋆=cq⋆,\\mathbb\{E\}\\\!\\left\[\\hat\{c\}\_\{2\}\\,\\middle\|\\,q\\right\]\\;=\\;\\mathbb\{E\}\\\!\\left\[\\pi\_\{q,\\hat\{z\}\_\{N\}\}\\,\\middle\|\\,q\\right\]\\;\\leq\\;\\pi\_\{q,z\_\{q\}^\{\\star\}\}\\;=\\;c\_\{q\}^\{\\star\},\(5\)with strict inequality wheneverPr\(z^N≠zq⋆∣q\)\>0\\Pr\(\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\mid q\)\>0\.Sem2\\mathrm\{Sem\}\_\{2\}thus tradesSem1\\mathrm\{Sem\}\_\{1\}’s upward*Jensen bias*for a downward*selection bias*\. The two biases are mathematically distinct: \([2](https://arxiv.org/html/2605.08432#S4.E2)\) is a property of themax\\maxoperator on the selection block, while \([5](https://arxiv.org/html/2605.08432#S4.E5)\) is a property of the noise in the selection itself\. Both areO\(n−1/2\)O\(n^\{\-1/2\}\)on low\-margin questions and vanish in the large\-margin limit\. Whether the trade is favorable, and on which version of the calibration metric, is the subject of[Section˜5](https://arxiv.org/html/2605.08432#S5)\.
The Sem\-ECE family\.Pairing each estimator with the population calibration metric gives the two members of the Sem\-ECE family,
Sem1\-ECE:=ECE\(c^1\),Sem2\-ECE:=ECE\(c^2\),\\mathrm\{Sem\}\_\{1\}\\text\{\-ECE\}:=\\mathrm\{ECE\}\(\\hat\{c\}\_\{1\}\),\\qquad\\mathrm\{Sem\}\_\{2\}\\text\{\-ECE\}:=\\mathrm\{ECE\}\(\\hat\{c\}\_\{2\}\),distinguished by which of \([1](https://arxiv.org/html/2605.08432#S4.E1)\), \([3](https://arxiv.org/html/2605.08432#S4.E3)\) is substituted into the calibration metric\.[Algorithm˜1](https://arxiv.org/html/2605.08432#alg1)states the framework in[Appendix˜B](https://arxiv.org/html/2605.08432#A2)\. Because both members commit to the samez^N\\hat\{z\}\_\{N\}, the deployment policy is unaffected by the choice of estimator; the choice influences only the reported confidence and its calibration error\.
## 5Theoretical Analysis
We analyze the relationship between the plug\-in calibration errorsSem1\\text\{Sem\}\_\{1\}\-ECE,Sem2\\text\{Sem\}\_\{2\}\-ECE, and the oracleECE⋆\\text\{ECE\}^\{\\star\}in two layers\.[Section˜5\.1](https://arxiv.org/html/2605.08432#S5.SS1)establishes asymptotic unbiasedness through a pointwise bias bound \([Theorem˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1)\), a binned\-ECE bound \([Theorem˜5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2)\), and supporting bounds on the underlying confidence and selection errors \([Theorems˜5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)and[5\.4](https://arxiv.org/html/2605.08432#S5.Thmtheorem4)\); together these yieldSemi\\text\{Sem\}\_\{i\}\-ECE→ECE⋆\\to\\text\{ECE\}^\{\\star\}asn,m→∞n,m\\to\\inftyand identify a low\-margin regime𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}on which the two estimators do not coincide asymptotically\.[Section˜5\.2](https://arxiv.org/html/2605.08432#S5.SS2)resolves the leading1/n1/\\sqrt\{n\}constant on𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}via a local CLT bias expansion and identifies a strictly nested Jensen\-dominated regime𝒬JDR\\mathcal\{Q\}\_\{\\mathrm\{JDR\}\}on whichSem2\\text\{Sem\}\_\{2\}\-ECE is closer toECE⋆\\text\{ECE\}^\{\\star\}thanSem1\\text\{Sem\}\_\{1\}\-ECE\. All proofs are in[Appendix˜A](https://arxiv.org/html/2605.08432#A1)\.
### 5\.1Asymptotic Unbiasedness
[Theorem˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1)controls the per\-question bias ofc^i\\hat\{c\}\_\{i\}aboutcq⋆c\_\{q\}^\{\\star\}and identifies the regime separation\.[Theorems˜5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2),[5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)and[5\.4](https://arxiv.org/html/2605.08432#S5.Thmtheorem4)then lift this pointwise control to the binned ECE through two estimator\-level errors: a confidence errorεn\\varepsilon\_\{n\}and a selection errorδn\\delta\_\{n\}\.
###### Theorem 5\.1\(Pointwise bias bound\)\.
For eachqqwithΔq\>0\\Delta\_\{q\}\>0andi∈\{1,2\}i\\in\\\{1,2\\\},
\|𝔼\[c^i∣q\]−cq⋆\|≤min\{Clog2Kq2n,\(Kq−1\)exp\(−nΔq22pq\)\},\\big\|\\mathbb\{E\}\[\\hat\{c\}\_\{i\}\\mid q\]\-c\_\{q\}^\{\\star\}\\big\|\\;\\leq\\;\\min\\\!\\left\\\{C\\sqrt\{\\tfrac\{\\log 2K\_\{q\}\}\{2n\}\},\\;\\;\(K\_\{q\}\-1\)\\exp\\\!\\Big\(\-\\tfrac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\Big\)\\right\\\},\(6\)whereCCis a universal constant\.
The right\-hand side has two complementary terms\. The first term \(Hoeffding\) gives a uniformn12n^\{\\frac\{1\}\{2\}\}ceiling regardless of margin\. The second term \(Bernstein\) term shrinks exponentially in the standardized marginm~q2=nΔq2/pq\\tilde\{m\}\_\{q\}^\{2\}=n\\Delta\_\{q\}^\{2\}/p\_\{q\}and sharpens the bound when the margin is large\. The two cross atm~q2≍logKq\\tilde\{m\}\_\{q\}^\{2\}\\asymp\\log K\_\{q\}, splitting questions into a*large\-margin*regime \(m~q2≥logKq\\tilde\{m\}\_\{q\}^\{2\}\\geq\\log K\_\{q\}, Bernstein wins, biaso\(n−1/2\)o\(n^\{\-1/2\}\)\) and a*low\-margin*regime \(m~q2<logKq\\tilde\{m\}\_\{q\}^\{2\}<\\log K\_\{q\}, Hoeffding is tight, biasΘ\(n−1/2\)\\Theta\(n^\{\-1/2\}\)\)\.
Confidence and selection errors\.For an estimatorc^∈\{c^1,c^2\}\\hat\{c\}\\in\\\{\\hat\{c\}\_\{1\},\\hat\{c\}\_\{2\}\\\}paired with the deployed correctness labela^\\hat\{a\}, define
εn:=𝔼\|c^−cq⋆\|\(confidence error\),δn:=Pr\(a^≠aq⋆\)\(selection error\)\.\\varepsilon\_\{n\}:=\\mathbb\{E\}\\,\|\\hat\{c\}\-c\_\{q\}^\{\\star\}\|\\quad\(\\text\{confidence error\}\),\\quad\\delta\_\{n\}:=\\Pr\\\!\\big\(\\hat\{a\}\\neq a\_\{q\}^\{\\star\}\\big\)\\quad\(\\text\{selection error\}\)\.WithECE\(⋅,⋅\)\\operatorname\{ECE\}\(\\cdot,\\cdot\)the binned ECE operator as defined in[Section˜3](https://arxiv.org/html/2605.08432#S3),[Theorem˜5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2)converts\(εn,δn\)\(\\varepsilon\_\{n\},\\delta\_\{n\}\)into a bound on\|ECE\(c^,a^\)−ECE⋆\|\|\\operatorname\{ECE\}\(\\hat\{c\},\\hat\{a\}\)\-\\text\{ECE\}^\{\\star\}\|;[Theorems˜5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)and[5\.4](https://arxiv.org/html/2605.08432#S5.Thmtheorem4)bound\(εn,δn\)\(\\varepsilon\_\{n\},\\delta\_\{n\}\)for each estimator\.
###### Theorem 5\.2\(Bin\-ECE Bounds\)\.
For anyη\>0\\eta\>0,
\|ECE\(c^,a^\)−ECE\(cq⋆,aq⋆\)\|≤δn\+εn\+2\{εnη\+ℙ\(dist\(cq⋆,𝒯\)≤η\)\}\.\\left\|\\operatorname\{ECE\}\(\\hat\{c\},\\hat\{a\}\)\-\\operatorname\{ECE\}\(c\_\{q\}^\{\\star\},a\_\{q\}^\{\\star\}\)\\right\|\\leq\\delta\_\{n\}\+\\varepsilon\_\{n\}\+2\\left\\\{\\frac\{\\varepsilon\_\{n\}\}\{\\eta\}\+\\mathbb\{P\}\\bigl\(\\operatorname\{dist\}\(c\_\{q\}^\{\\star\},\\mathcal\{T\}\)\\leq\\eta\\bigr\)\\right\\\}\.\(7\)If, in addition,cq⋆c\_\{q\}^\{\\star\}has density bounded byMM, then
\|ECE\(c^,a^\)−ECE\(cq⋆,aq⋆\)\|≤δn\+εn\+42M\(L−1\)εn\.\\left\|\\operatorname\{ECE\}\(\\hat\{c\},\\hat\{a\}\)\-\\operatorname\{ECE\}\(c\_\{q\}^\{\\star\},a\_\{q\}^\{\\star\}\)\\right\|\\leq\\delta\_\{n\}\+\\varepsilon\_\{n\}\+4\\sqrt\{2M\(L\-1\)\\varepsilon\_\{n\}\}\.
Therefore, the gap between an estimated ECE and the true ECE is bounded in terms of the confidence error and selection error\.
###### Theorem 5\.3\(Bounds for confidence and selection errors forc^1\\hat\{c\}\_\{1\}\)\.
Letc^=c^1\\hat\{c\}=\\hat\{c\}\_\{1\}anda^=Yq\(z^N\)\\hat\{a\}=Y\_\{q\}\(\\hat\{z\}\_\{N\}\),
εn≤𝔼q\[min\{log\(2Kq\)2n,πq,1\(1−πq,1\)n\+\(Kq−1\)exp\(−nΔq22pq\)\}\],\\varepsilon\_\{n\}\\leq\\mathbb\{E\}\_\{q\}\\left\[\\min\\left\\\{\\sqrt\{\\frac\{\\log\(2K\_\{q\}\)\}\{2n\}\},\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{n\}\}\+\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\\right\\\}\\right\],and
δn≤𝔼q\[\(Kq−1\)exp\(−nΔq22pq\)\]\.\\delta\_\{n\}\\leq\\mathbb\{E\}\_\{q\}\\left\[\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\\right\]\.
Using Sem1\-ECE, both the confidence error and selection error converge to0asn→∞n\\rightarrow\\infty\. Next, we provide the bound for Sem2\-ECE\.
###### Theorem 5\.4\(Bounds for confidence and selection errors forc^2\\hat\{c\}\_\{2\}\)\.
Letc^=c^2\\hat\{c\}=\\hat\{c\}\_\{2\}anda^=Yq\(z^N\)\\hat\{a\}=Y\_\{q\}\(\\hat\{z\}\_\{N\}\),
εn≤𝔼q\[πq,1\(1−πq,1\)m\+\(Kq−1\)exp\(−nΔq22pq\)\],\\varepsilon\_\{n\}\\leq\\mathbb\{E\}\_\{q\}\\left\[\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{m\}\}\+\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\\right\],andδn\\delta\_\{n\}has the same upper bound as in[Theorem˜5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)\.
Therefore, both Sem1\-ECE and Sem2\-ECE asymptotically converge to the true ECE\. The bounds differ only in the sample for the Bernoulli term:Sem1\\mathrm\{Sem\}\_\{1\}reuses blockNNat rate1/n1/\\sqrt\{n\},Sem2\\mathrm\{Sem\}\_\{2\}averages overEEat1/m1/\\sqrt\{m\};[Theorem˜5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)additionally retains the Hoeffding term as a margin\-free fallback\.
Combining[Theorems˜5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2),[5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)and[5\.4](https://arxiv.org/html/2605.08432#S5.Thmtheorem4), convergence is exponentially fast on the large\-margin regime\{q:m~q2≥logKq\}\\\{q:\\tilde\{m\}\_\{q\}^\{2\}\\geq\\log K\_\{q\}\\\}\. The residual separation concentrates on the low\-margin regime𝒬low:=\{q:m~q2<logKq\}\\mathcal\{Q\}\_\{\\mathrm\{low\}\}:=\\\{q:\\tilde\{m\}\_\{q\}^\{2\}<\\log K\_\{q\}\\\}, resolved next to leading order\.
### 5\.2ECE Comparison in the Low\-Margin Regime
On𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}, bothSem1\\text\{Sem\}\_\{1\}andSem2\\text\{Sem\}\_\{2\}haveΘ\(n−1/2\)\\Theta\(n^\{\-1/2\}\)bias by[Theorem˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1)\. To compare the two estimators we decompose the bias by sign\. Both biases originate from the event\{z^N≠zq⋆\}\\\{\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\\}, but on this eventSem1\\text\{Sem\}\_\{1\}overshootscq⋆c\_\{q\}^\{\\star\}\(Jensen’s inequality,𝔼\[maxzπ^N\(z\)∣q\]≥cq⋆\\mathbb\{E\}\[\\max\_\{z\}\\hat\{\\pi\}\_\{N\}\(z\)\\mid q\]\\geq c\_\{q\}^\{\\star\}\), whileSem2\\text\{Sem\}\_\{2\}undershoots \(π^E\(z^N\)\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)is the held\-out frequency of a non\-modal class\)\.
Bin\-interior assumption\.Throughout[Section˜5\.2](https://arxiv.org/html/2605.08432#S5.SS2)we assumecq⋆c\_\{q\}^\{\\star\}is bounded away from the bin boundaries𝒯\\mathcal\{T\}on a vanishing\-measure set:
Prq\(dist\(cq⋆,𝒯\)≤ηn\)=o\(n−1/2\),nηn→∞\.\\Pr\_\{q\}\\\!\\big\(\\mathrm\{dist\}\(c\_\{q\}^\{\\star\},\\mathcal\{T\}\)\\leq\\eta\_\{n\}\\big\)=o\(n^\{\-1/2\}\),\\quad\\sqrt\{n\}\\,\\eta\_\{n\}\\to\\infty\.\(8\)Under \([8](https://arxiv.org/html/2605.08432#S5.E8)\),c^1,c^2\\hat\{c\}\_\{1\},\\hat\{c\}\_\{2\}lie in the same bin ascq⋆c\_\{q\}^\{\\star\}with probability1−o\(n−1/2\)1\-o\(n^\{\-1/2\}\), the binning\-instability term in[Theorem˜5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2)contributes at ordero\(n−1/2\)o\(n^\{\-1/2\}\)rather than at the unconditionalO\(n−1/4\)O\(n^\{\-1/4\}\)ceiling, and the leading\-order behavior ofSemi\\text\{Sem\}\_\{i\}\-ECE on𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}is governed by the conditional bias expansion below\.
Bias expansion\.Forq∈𝒬lowq\\in\\mathcal\{Q\}\_\{\\mathrm\{low\}\}, a local CLT at the boundaryπ^N\(zq⋆\)=π^N\(zq\(2\)\)\\hat\{\\pi\}\_\{N\}\(z\_\{q\}^\{\\star\}\)=\\hat\{\\pi\}\_\{N\}\(z\_\{q\}^\{\(2\)\}\)with a folded\-normal calculation \([Section˜A\.5](https://arxiv.org/html/2605.08432#A1.SS5)\) gives, to leading order,
𝔼\[c^1−cq⋆∣q\]=pqnJ\(λ~q\)\+o\(n−1/2\),𝔼\[c^2−cq⋆∣q\]=−pqnS\(λ~q\)\+o\(n−1/2\),\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\\mid q\]=\\tfrac\{\\sqrt\{p\_\{q\}\}\}\{\\sqrt\{n\}\}\\,J\(\\tilde\{\\lambda\}\_\{q\}\)\+o\(n^\{\-1/2\}\),\\quad\\mathbb\{E\}\[\\hat\{c\}\_\{2\}\-c\_\{q\}^\{\\star\}\\mid q\]=\-\\tfrac\{\\sqrt\{p\_\{q\}\}\}\{\\sqrt\{n\}\}\\,S\(\\tilde\{\\lambda\}\_\{q\}\)\+o\(n^\{\-1/2\}\),\(9\)with positive Jensen biasJ\(λ~\):=φ\(2λ~\)−2λ~Φ\(−2λ~\)J\(\\tilde\{\\lambda\}\):=\\varphi\(2\\tilde\{\\lambda\}\)\-2\\tilde\{\\lambda\}\\Phi\(\-2\\tilde\{\\lambda\}\)\(the leading\-order winner’s curse from \([2](https://arxiv.org/html/2605.08432#S4.E2)\)\) and selection biasS\(λ~\):=2λ~Φ\(−2λ~\)S\(\\tilde\{\\lambda\}\):=2\\tilde\{\\lambda\}\\Phi\(\-2\\tilde\{\\lambda\}\)\. The ECE\-level comparisons are governed by
gA\(λ~\):=J\+S=φ\(2λ~\),gB\(λ~\):=J−S=φ\(2λ~\)−4λ~Φ\(−2λ~\),g\_\{A\}\(\\tilde\{\\lambda\}\):=J\+S=\\varphi\(2\\tilde\{\\lambda\}\),\\qquad g\_\{B\}\(\\tilde\{\\lambda\}\):=J\-S=\\varphi\(2\\tilde\{\\lambda\}\)\-4\\tilde\{\\lambda\}\\Phi\(\-2\\tilde\{\\lambda\}\),\(10\)wheregA\>0g\_\{A\}\>0on𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}, andgBg\_\{B\}is strictly decreasing on\[0,∞\)\[0,\\infty\)with unique positive rootλ~⋆≈0\.306\\tilde\{\\lambda\}^\{\\star\}\\approx 0\.306, i\.e\.,m~⋆=2λ~⋆≈0\.612\\tilde\{m\}^\{\\star\}=2\\tilde\{\\lambda\}^\{\\star\}\\approx 0\.612\([Figure˜1](https://arxiv.org/html/2605.08432#S5.F1)b; uniqueness proof in[Section˜A\.8](https://arxiv.org/html/2605.08432#A1.SS8)\)\.
###### Definition 5\.5\(Jensen\-dominated regime\)\.
The*Jensen\-dominated regime*\(JDR\) is𝒬JDR:=\{q:λ~q<λ~⋆\}⊊𝒬low\\mathcal\{Q\}\_\{\\mathrm\{JDR\}\}:=\\\{q:\\tilde\{\\lambda\}\_\{q\}<\\tilde\{\\lambda\}^\{\\star\}\\\}\\subsetneq\\mathcal\{Q\}\_\{\\mathrm\{low\}\}\.


Figure 1:\(a\) Regime diagram on the\(m~q,Kq\)\(\\tilde\{m\}\_\{q\},K\_\{q\}\)plane, partitioned by the JDR boundarym~q=2λ~⋆\\tilde\{m\}\_\{q\}=2\\tilde\{\\lambda\}^\{\\star\}\([Theorem˜5\.7](https://arxiv.org/html/2605.08432#S5.Thmtheorem7), dashed\) and the crossoverm~q=logKq\\tilde\{m\}\_\{q\}=\\sqrt\{\\log K\_\{q\}\}\([Theorem˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1), solid\)\. In JDR \(green\),Sem2\\mathrm\{Sem\}\_\{2\}wins on both raw ECE and oracle distance; in the intermediate band \(yellow\),Sem2\\mathrm\{Sem\}\_\{2\}has smaller raw ECE but is farther from the oracle; in the large\-margin region \(gray\), the two estimators are asymptotically indistinguishable\. \(b\) Leading constants from \([10](https://arxiv.org/html/2605.08432#S5.E10)\):gA\(λ~\)=φ\(2λ~\)\>0g\_\{A\}\(\\tilde\{\\lambda\}\)=\\varphi\(2\\tilde\{\\lambda\}\)\>0everywhere, whilegB\(λ~\)=φ\(2λ~\)−4λ~Φ\(−2λ~\)g\_\{B\}\(\\tilde\{\\lambda\}\)=\\varphi\(2\\tilde\{\\lambda\}\)\-4\\tilde\{\\lambda\}\\Phi\(\-2\\tilde\{\\lambda\}\)is positive only onλ~<λ~⋆≈0\.306\\tilde\{\\lambda\}<\\tilde\{\\lambda\}^\{\\star\}\\approx 0\.306\(shaded\)\.Direct ECE gap\.Adding the two expansions in \([9](https://arxiv.org/html/2605.08432#S5.E9)\) sign\-aligns the biases:Sem1\\text\{Sem\}\_\{1\}overshootscq⋆c\_\{q\}^\{\\star\}byJ\(λ~q\)J\(\\tilde\{\\lambda\}\_\{q\}\)andSem2\\text\{Sem\}\_\{2\}undershoots byS\(λ~q\)S\(\\tilde\{\\lambda\}\_\{q\}\)\. Under over\-confidence, the absolute deviations of the two estimators froma¯\\bar\{a\}shift by−J\-Jand\+S\+Srespectively, producing a gap ofJ\+S=gAJ\+S=g\_\{A\}\.
###### Theorem 5\.6\(Direct ECE gap\)\.
Suppose𝔼q\[cq⋆\]−a¯\>0\\mathbb\{E\}\_\{q\}\[c\_\{q\}^\{\\star\}\]\-\\bar\{a\}\>0on𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}, wherea¯:=𝔼q\[a^\]\\bar\{a\}:=\\mathbb\{E\}\_\{q\}\[\\hat\{a\}\]is the deployment accuracy\. Then
Sem1\-ECE−Sem2\-ECE=1n𝔼q\[pqgA\(λ~q\)\]\+o\(n−1/2\)\>0\.\\text\{Sem\}\_\{1\}\\text\{\-ECE\}\-\\text\{Sem\}\_\{2\}\\text\{\-ECE\}=\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\\\!\\left\[\\sqrt\{p\_\{q\}\}\\,g\_\{A\}\(\\tilde\{\\lambda\}\_\{q\}\)\\right\]\+o\(n^\{\-1/2\}\)\\;\>\\;0\.\(11\)
[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)requires only over\-confidence on𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}— no constraint onλ~q\\tilde\{\\lambda\}\_\{q\}within the regime:Sem2\\text\{Sem\}\_\{2\}\-ECE is strictly smaller thanSem1\\text\{Sem\}\_\{1\}\-ECE throughout𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}at ordern−1/2n^\{\-1/2\}\.
Oracle ECE distance\.[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)compares the two raw ECE values but does not say which is closer toECE⋆\\text\{ECE\}^\{\\star\}\. A smaller raw ECE forSem2\\text\{Sem\}\_\{2\}is consistent with two structural scenarios: \(a\)Sem2\\text\{Sem\}\_\{2\}\-ECE sits betweenSem1\\text\{Sem\}\_\{1\}\-ECE andECE⋆\\text\{ECE\}^\{\\star\}\(closer to oracle\); or \(b\)Sem2\\text\{Sem\}\_\{2\}\-ECE has overshot throughECE⋆\\text\{ECE\}^\{\\star\}to the opposite side \(farther from oracle, but smaller in absolute distance toa¯\\bar\{a\}\)\. Distinguishing \(a\) from \(b\) hinges on\|J\|\|J\|versus\|S\|\|S\|, i\.e\., ongBg\_\{B\}\.
###### Theorem 5\.7\(Sharp oracle ECE distance on JDR\)\.
Suppose𝔼q\[cq⋆\]−a¯≥c0/n\\mathbb\{E\}\_\{q\}\[c\_\{q\}^\{\\star\}\]\-\\bar\{a\}\\geq c\_\{0\}/\\sqrt\{n\}for some fixedc0\>0c\_\{0\}\>0\(non\-degenerate over\-confidence on𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}\), and the population is supported in𝒬JDR\\mathcal\{Q\}\_\{\\mathrm\{JDR\}\}, i\.e\.,supqλ~q<λ~⋆\\sup\_\{q\}\\tilde\{\\lambda\}\_\{q\}<\\tilde\{\\lambda\}^\{\\star\}\. Then
\|Sem1\-ECE−ECE⋆\|−\|Sem2\-ECE−ECE⋆\|=1n𝔼q\[pqgB\(λ~q\)\]\+o\(n−1/2\)\>0\.\\big\|\\text\{Sem\}\_\{1\}\\text\{\-ECE\}\-\\text\{ECE\}^\{\\star\}\\big\|\-\\big\|\\text\{Sem\}\_\{2\}\\text\{\-ECE\}\-\\text\{ECE\}^\{\\star\}\\big\|=\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\\\!\\left\[\\sqrt\{p\_\{q\}\}\\,g\_\{B\}\(\\tilde\{\\lambda\}\_\{q\}\)\\right\]\+o\(n^\{\-1/2\}\)\\;\>\\;0\.\(12\)
[Theorem˜5\.7](https://arxiv.org/html/2605.08432#S5.Thmtheorem7)adds two assumptions to[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6): \(i\) JDR, ensuringgB\(λ~q\)\>0g\_\{B\}\(\\tilde\{\\lambda\}\_\{q\}\)\>0pointwise; and \(ii\) non\-degenerate over\-confidence, ensuring the absolute values in \([12](https://arxiv.org/html/2605.08432#S5.E12)\) do not flip sign within theo\(n−1/2\)o\(n^\{\-1/2\}\)remainder\. The conclusion is correspondingly stronger:Sem2\\text\{Sem\}\_\{2\}\-ECE is closer toECE⋆\\text\{ECE\}^\{\\star\}thanSem1\\text\{Sem\}\_\{1\}\-ECE is, not merely smaller in absolute terms\.
Regime structure\.Combining[Theorems˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1),[5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2),[5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3),[5\.4](https://arxiv.org/html/2605.08432#S5.Thmtheorem4),[5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)and[5\.7](https://arxiv.org/html/2605.08432#S5.Thmtheorem7)\([Figure˜1](https://arxiv.org/html/2605.08432#S5.F1)a\), the population𝒬\\mathcal\{Q\}partitions into three regimes:
- •*Large\-margin*\(m~q2≥logKq\\tilde\{m\}\_\{q\}^\{2\}\\geq\\log K\_\{q\}\):[Theorems˜5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)and[5\.4](https://arxiv.org/html/2605.08432#S5.Thmtheorem4)giveεn,δn\\varepsilon\_\{n\},\\delta\_\{n\}exponentially small; via[Theorem˜5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2),\|Semi\|\\text\{Sem\}\_\{i\}\-ECE−ECE⋆\|\-\\text\{ECE\}^\{\\star\}\|is exponentially small, andSem1,Sem2\\text\{Sem\}\_\{1\},\\text\{Sem\}\_\{2\}are asymptotically indistinguishable\.
- •*Low\-margin, not JDR*\(2λ~⋆≤m~q<logKq2\\tilde\{\\lambda\}^\{\\star\}\\leq\\tilde\{m\}\_\{q\}<\\sqrt\{\\log K\_\{q\}\}\):[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)applies, soSem2\\text\{Sem\}\_\{2\}\-ECE<<Sem1\\text\{Sem\}\_\{1\}\-ECE; butgB\(λ~q\)<0g\_\{B\}\(\\tilde\{\\lambda\}\_\{q\}\)<0, soSem2\\text\{Sem\}\_\{2\}\-ECE is*farther*fromECE⋆\\text\{ECE\}^\{\\star\}thanSem1\\text\{Sem\}\_\{1\}\-ECE\. The smaller raw ECE is achieved by overshooting throughECE⋆\\text\{ECE\}^\{\\star\}\.
- •*JDR*\(m~q<2λ~⋆\\tilde\{m\}\_\{q\}<2\\tilde\{\\lambda\}^\{\\star\}\): both[Theorems˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)and[5\.7](https://arxiv.org/html/2605.08432#S5.Thmtheorem7)apply;Sem2\\text\{Sem\}\_\{2\}wins on both metrics\.
## 6Experiments
We evaluate the Sem\-ECE framework on three open\-ended QA benchmarks across five frontier LLMs, with three goals: \(i\) verify the asymptotic predictions of[Section˜5](https://arxiv.org/html/2605.08432#S5)on real data \([Sections˜6\.2](https://arxiv.org/html/2605.08432#S6.SS2)and[6\.3](https://arxiv.org/html/2605.08432#S6.SS3)\); \(ii\) compare against the verbalized\-confidence baseline at the per\-pair level \([Section˜6\.4](https://arxiv.org/html/2605.08432#S6.SS4)\); and \(iii\) characterize what the resulting reliability diagrams reveal about semantic agreement versus factual accuracy \([Appendix˜F](https://arxiv.org/html/2605.08432#A6)\)\. Per\-model breakdowns, the boundary\-alignment numerics and the bootstrap protocols are in[Appendices˜C](https://arxiv.org/html/2605.08432#A3),[D](https://arxiv.org/html/2605.08432#A4)and[E](https://arxiv.org/html/2605.08432#A5)\.
### 6\.1Setup
Datasets and models\.SimpleQA\[[22](https://arxiv.org/html/2605.08432#bib.bib7)\]\(short\-form factoid\),HLE\[[18](https://arxiv.org/html/2605.08432#bib.bib18)\]\(expert\-level multidisciplinary\), andPopQA\[[13](https://arxiv.org/html/2605.08432#bib.bib19)\]\(long\-tail entity\-centric\)\. We evaluate five commercial models accessed via their respective APIs: OpenAIgpt\-5\.4\[[17](https://arxiv.org/html/2605.08432#bib.bib22)\], Anthropicclaude\-opus\-4\.6\[[1](https://arxiv.org/html/2605.08432#bib.bib23)\], Googlegemini\-3\.1\-flash\-lite\-preview\[[4](https://arxiv.org/html/2605.08432#bib.bib24)\], xAIgrok\-4\.20\-0309\[[23](https://arxiv.org/html/2605.08432#bib.bib25)\]\(non\-reasoning\), and Mistralmistral\-large\-latest\[[15](https://arxiv.org/html/2605.08432#bib.bib26)\]\. The 15 model–benchmark pairs form the evaluation grid; we drawnmax=50n\_\{\\max\}=50stochastic generations per question\.
Pipeline\.For each question we \(i\) generatenmaxn\_\{\\max\}responses, \(ii\) cluster them into semantic answer classes via an LLM judge \(gpt\-5\.4\), and \(iii\) grade each response against the reference answer\.
Confidence sources\.Sem1:c^1=maxzπ^\[nmax\]\(z\)\\hat\{c\}\_\{1\}=\\max\_\{z\}\\hat\{\\pi\}\_\{\[n\_\{\\max\}\]\}\(z\), computed on the full pool ofnmax=50n\_\{\\max\}=50generations\.Sem2:c^2=π^E\(z^N\)\\hat\{c\}\_\{2\}=\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)atn=m=25n=m=25, averaged overR=10R=10random half\-splits\(N,E\)\(N,E\)of the same pool \(so the two estimators share underlying samples but use them differently\)\.Verbalized confidence \(Ver\): elicited via "Confidence: X%", parsed from each of thenmaxn\_\{\\max\}generations and averaged; parse failures imputed at1\.01\.0\[[20](https://arxiv.org/html/2605.08432#bib.bib5),[25](https://arxiv.org/html/2605.08432#bib.bib21)\]\.
Metrics\.Semi\\text\{Sem\}\_\{i\}\-ECE and Ver\-ECE withL=10L=10equal\-width bins \([Section˜3](https://arxiv.org/html/2605.08432#S3)\)\. Stratification by margin usesΔq\\Delta\_\{q\}as the regime axis; under thepq→1p\_\{q\}\\to 1convention of[Section˜3](https://arxiv.org/html/2605.08432#S3),m~q=nΔq\\tilde\{m\}\_\{q\}=\\sqrt\{n\}\\,\\Delta\_\{q\}, so the regime boundaries from[Theorems˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1)and[5\.7](https://arxiv.org/html/2605.08432#S5.Thmtheorem7)are visible directly on theΔq\\Delta\_\{q\}\-axis\.
### 6\.2Asymptotic convergence
[Theorems˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1),[5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2),[5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)and[5\.4](https://arxiv.org/html/2605.08432#S5.Thmtheorem4)predict thatSem1\\text\{Sem\}\_\{1\}\-ECE andSem2\\text\{Sem\}\_\{2\}\-ECE converge to a common limitECE⋆\\mathrm\{ECE\}^\{\\star\}asn→∞n\\to\\infty\.[Figure˜2](https://arxiv.org/html/2605.08432#S6.F2)sub\-samples each question’snmax=50n\_\{\\max\}=50semantic\-class assignments down ton∈\{10,20,30,40,50\}n\\in\\\{10,20,30,40,50\\\}and recomputes pooledSemi\\text\{Sem\}\_\{i\}\-ECE on each benchmark\. The two curves approach a common limit on every benchmark from*opposite sides*:Sem1\\text\{Sem\}\_\{1\}from above \(positive Jensen bias\),Sem2\\text\{Sem\}\_\{2\}from below \(negative selection bias\), which is the empirical signature of the bias decomposition \([9](https://arxiv.org/html/2605.08432#S5.E9)\)\.
### 6\.3Sharp comparison: regime structure and rate
[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)predicts that on𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}the direct ECE gap is, to leading order in1/n1/\\sqrt\{n\},𝔼q\[pqgA\(λ~q\)\]/n\\mathbb\{E\}\_\{q\}\[\\sqrt\{p\_\{q\}\}\\,g\_\{A\}\(\\tilde\{\\lambda\}\_\{q\}\)\]/\\sqrt\{n\}\. Under thepq→1p\_\{q\}\\to 1convention, the prediction at any fixed standardized marginm~\\tilde\{m\}collapses to a single numberφ\(m~\)/n\\varphi\(\\tilde\{m\}\)/\\sqrt\{n\}\. We test this prediction along three axes: regime structure, leading constant, and convergence rate\.
Regime structure\.[Figure˜3](https://arxiv.org/html/2605.08432#S6.F3)shows the stratification of pooledSem1\\text\{Sem\}\_\{1\}\-ECE andSem2\\text\{Sem\}\_\{2\}\-ECE by per\-question marginΔq\\Delta\_\{q\}\. The panels are divided into three regions based on the JDR boundary \(Δq≈0\.087\\Delta\_\{q\}\\approx 0\.087\) and the low/large boundary \(Δq=logKq/n\\Delta\_\{q\}=\\sqrt\{\\log K\_\{q\}/n\}\) from[Figure˜1](https://arxiv.org/html/2605.08432#S5.F1)\(a\)\. The empirical results match theoretical predictions: separation is greatest below the JDR threshold, diminishes in the intermediate range, and converges above the low/large boundary\.
Leading constant and convergence rate\.With*no fitted constants*, the leading\-order predictionφ\(m~⋆\)/n\\varphi\(\\tilde\{m\}^\{\\star\}\)/\\sqrt\{n\}recovers the empirical Sem1\-ECE−\-Sem2\-ECE gap to within1111–27%27\\%at both regime boundaries on every benchmark, with empirical/theory ratios consistently above11at the JDR boundary and below11at the low/large boundary \([Appendix˜D](https://arxiv.org/html/2605.08432#A4)\)\. On the low\-margin sub\-population, the gap shrinks at the predictedn−1/2n^\{\-1/2\}rate, with fitted log\-log slopes within0\.080\.08of−0\.50\-0\.50across all three benchmarks \([Figure˜10](https://arxiv.org/html/2605.08432#A3.F10),[Appendix˜C](https://arxiv.org/html/2605.08432#A3)\)\. The sign\-consistent boundary residual and the steeper\-than\-−0\.50\-0\.50slope direction are both consistent with a subleadingO\(1/n\)O\(1/n\)Edgeworth correction\.
Figure 2:PooledSem1\\text\{Sem\}\_\{1\}\-ECE \(orange\) andSem2\\text\{Sem\}\_\{2\}\-ECE \(blue\) as functions of the per\-question budgetn∈\[10,50\]n\\in\[10,50\]\. The two curves converge to a common limit on every benchmark from opposite sides — the empirical signature of[Theorems˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1)and[5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2)via \([9](https://arxiv.org/html/2605.08432#S5.E9)\)\.Figure 3:PooledSem1\\text\{Sem\}\_\{1\}\-ECE andSem2\\text\{Sem\}\_\{2\}\-ECE stratified by per\-question marginΔq\\Delta\_\{q\}on SimpleQA \(left\), HLE \(middle\), PopQA \(right\)\. The dashed red line marks the JDR boundaryΔq=2λ~⋆/n\\Delta\_\{q\}=2\\tilde\{\\lambda\}^\{\\star\}/\\sqrt\{n\}and the solid brown the low/large boundaryΔq=logKq/n\\Delta\_\{q\}=\\sqrt\{\\log K\_\{q\}/n\}, partitioning each panel into the three regions of[Figure˜1](https://arxiv.org/html/2605.08432#S5.F1)\(a\)\. The two metrics overlap above the low/large boundary and separate below, with the largest gap below the JDR boundary, matching[Theorems˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1),[5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)and[5\.7](https://arxiv.org/html/2605.08432#S5.Thmtheorem7)\.
### 6\.4Cross\-benchmark calibration
[Table˜1](https://arxiv.org/html/2605.08432#S6.T1)reports per\-pair binned ECE for the three confidence sources\.Sem2\\text\{Sem\}\_\{2\}\-ECE is no larger thanSem1\\text\{Sem\}\_\{1\}\-ECE on*all 15 pairs*, which is an empirical demonstration of[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)extending beyond the strict low\-margin regime\. Among the three sources,Sem2\\text\{Sem\}\_\{2\}is lowest on1212pairs, Ver on33, andSem1\\text\{Sem\}\_\{1\}on none\. The strongest aggregate advantage ofSem2\\text\{Sem\}\_\{2\}appears on HLE, where it is the best\-performing confidence source for all five providers\. Pooled reliability diagrams are shown in[Figure˜4](https://arxiv.org/html/2605.08432#S6.F4); see[Appendix˜F](https://arxiv.org/html/2605.08432#A6)for detailed discussion\. A paired bootstrap \(B=1000B=1000\) confirms𝔼\[c^1−c^2\]\>0\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-\\hat\{c\}\_\{2\}\]\>0on all 15 pairs and the population\-level ECE gap on 11 of 15 \([Appendix˜E](https://arxiv.org/html/2605.08432#A5)\)\.
Table 1:Per\-pair binned ECE withL=10L=10equal\-width bins\. Lower ECE is better\. AmongSem1\\text\{Sem\}\_\{1\}andSem2\\text\{Sem\}\_\{2\}only, the winning cell is highlighted in pink and the loser in blue; Ver\-ECE appears in grey baseline\. Across all three sources, the smallest ECE in each row is inboldand the second\-smallest isunderlined\.NNis the number of jointly clustered and graded questions;𝐴𝑐𝑐\\mathit\{Acc\}is per\-pair accuracy\.Sem2\\text\{Sem\}\_\{2\}\-ECE≤Sem1\\leq\\text\{Sem\}\_\{1\}\-ECE in all1515pairs and is the lowest of the three sources in1212of1515\.The verbalized exception is Sem\-ECE’s audit value\.The 3 cells where Ver beats Sem \(Anthropic on SimpleQA/PopQA, Mistral on PopQA\) share the same pattern: high agreement consistency yet low accuracy, so𝔼q\[cq⋆\]≫a¯\\mathbb\{E\}\_\{q\}\[c\_\{q\}^\{\\star\}\]\\gg\\bar\{a\}\(e\.g\. Anthropic SimpleQA:a¯=0\.482\\bar\{a\}=0\.482vs\. Sem1mean confidence0\.8350\.835\)\. Verbalized self\-moderation happens to land closer toa¯\\bar\{a\}in these cells, but a practitioner relying on Ver alone has no external reference to detect such miscalibration\. Sem\-ECE supplies that reference, depending only on sample frequencies and external accuracy judgments and placing all providers on the same footing without trusting any model’s self\-report; it complements Ver rather than replacing it\.
Figure 4:Reliability diagrams pooled across models on SimpleQA \(left\), HLE \(middle\), PopQA \(right\)\. Pooled ECE values appear in each legend; the dashed diagonal is perfect calibration\.Sem2\\text\{Sem\}\_\{2\}achieves the lowest pooled ECE on every benchmark\.
## 7Conclusion
We introducedSem\-ECE, a calibration evaluation framework for black\-box open\-ended QA that turns repeated free\-form generations into semantic answer classes and uses the resulting frequencies as confidence\. Within this framework, we studied two estimators: Sem1\-ECE, the standard same\-sample self\-consistency score, and Sem2\-ECE, a held\-out variant that separates answer selection from confidence evaluation\. We proved both are asymptotically unbiased, and further showed that they agree on easy questions but diverge on hard ones with Sem2achieving strictly smaller calibration error, so the Sem1–Sem2gap also serves as a diagnostic for question difficulty\. Experiments on three challenging open\-ended QA benchmarks across five leading commercial LLMs confirm these predictions and show that Sem\-ECE outperforms verbalized confidence and existing sampling\-based methods, while complementing logit\-based evaluation when internal probabilities are unavailable\.
## Acknowledgment
This work was supported in part by NIH grant P30 AG073105 and a PSOM AI2D Seeding Project\.
## References
- \[1\]Anthropic\(2025\)Claude models\.Note:[https://docs\.claude\.com/en/docs/about\-claude/models](https://docs.claude.com/en/docs/about-claude/models)Accessed: 2026\-01Cited by:[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p1.1)\.
- \[2\]G\. W\. Brier\(1950\)Verification of forecasts expressed in terms of probability\.Monthly Weather Review78\(1\),pp\. 1–3\.Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p1.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[3\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630,pp\. 625–630\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07421-0)Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[4\]Google DeepMind\(2025\)Gemini api models\.Note:[https://ai\.google\.dev/gemini\-api/docs/models](https://ai.google.dev/gemini-api/docs/models)Accessed: 2026\-01Cited by:[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p1.1)\.
- \[5\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1321–1330\.Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p1.1),[Appendix G](https://arxiv.org/html/2605.08432#A7.p3.1),[§1](https://arxiv.org/html/2605.08432#S1.p1.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1),[§3](https://arxiv.org/html/2605.08432#S3.p6.8)\.
- \[6\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. El\-Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan\(2022\)Language models \(mostly\) know what they know\.External Links:2207\.05221Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p2.1),[Appendix G](https://arxiv.org/html/2605.08432#A7.p3.1),[§1](https://arxiv.org/html/2605.08432#S1.p1.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[7\]J\. H\. Kagel and D\. Levin\(2002\)Common value auctions and the winner’s curse\.Princeton University Press,Princeton, NJ\.Cited by:[§4\.1](https://arxiv.org/html/2605.08432#S4.SS1.p2.8)\.
- \[8\]L\. Kuhn, Y\. Gal, and S\. Farquhar\(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VD-AYtP0dve)Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[9\]M\. Kull, M\. Perello\-Nieto, M\. Kängsepp, T\. Silva Filho, H\. Song, and P\. Flach\(2019\)Beyond temperature scaling: obtaining well\-calibrated multiclass probabilities with dirichlet calibration\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p3.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[10\]A\. Kumar, P\. Liang, and T\. Ma\(2019\)Verified uncertainty calibration\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p3.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[11\]S\. C\. Lin, J\. Hilton, and O\. Evans\(2022\)Teaching models to express their uncertainty in words\.Transactions on Machine Learning Research\.External Links:[Link](https://openreview.net/forum?id=8s8K2UZGTZ)Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p2.1),[Appendix G](https://arxiv.org/html/2605.08432#A7.p3.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[12\]Q\. Lyu, K\. Shridhar, C\. Malaviya, L\. Zhang, Y\. Elazar, N\. Tandon, M\. Apidianaki, M\. Sachan, and C\. Callison\-Burch\(2025\)Calibrating large language models with sample consistency\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 19260–19268\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i18.34120)Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p2.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[13\]A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi\(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 9802–9822\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.546)Cited by:[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p1.1)\.
- \[14\]S\. J\. Mielke, A\. Szlam, E\. Dinan, and Y\. Boureau\(2022\)Reducing conversational agents’ overconfidence through linguistic calibration\.Transactions of the Association for Computational Linguistics10,pp\. 857–872\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00494)Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p2.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[15\]Mistral AI\(2024\)Mistral Large\.Note:[https://docs\.mistral\.ai/getting\-started/models/models\_overview/](https://docs.mistral.ai/getting-started/models/models_overview/)Accessed: 2026\-01Cited by:[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p1.1)\.
- \[16\]M\. P\. Naeini, G\. F\. Cooper, and M\. Hauskrecht\(2015\)Obtaining well calibrated probabilities using bayesian binning\.InProceedings of the Twenty\-Ninth AAAI Conference on Artificial Intelligence,pp\. 2901–2907\.Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p1.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1),[§3](https://arxiv.org/html/2605.08432#S3.p6.8)\.
- \[17\]OpenAI\(2025\)GPT\-5 models\.Note:[https://platform\.openai\.com/docs/models](https://platform.openai.com/docs/models)Accessed: 2026\-01Cited by:[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p1.1)\.
- \[18\]L\. Phan, A\. Gatti, Z\. Han, N\. Li, J\. Hu, H\. Zhang, S\. Shi,et al\.\(2025\)Humanity’s last exam\.External Links:2501\.14249,[Link](https://arxiv.org/abs/2501.14249)Cited by:[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p1.1)\.
- \[19\]J\. C\. Platt\(1999\)Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.InAdvances in Large Margin Classifiers,pp\. 61–74\.Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p3.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[20\]K\. Tian, E\. Mitchell, A\. Zhou, A\. Sharma, R\. Rafailov, H\. Yao, C\. Finn, and C\. D\. Manning\(2023\)Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 5433–5442\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.330)Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p2.1),[Appendix G](https://arxiv.org/html/2605.08432#A7.p3.1),[§1](https://arxiv.org/html/2605.08432#S1.p1.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1),[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p3.10)\.
- \[21\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p2.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[22\]J\. Wei, K\. Nguyen, H\. W\. Chung, Y\. J\. Jiao, S\. Papay, A\. Glaese, J\. Schulman, and W\. Fedus\(2024\)Measuring short\-form factuality in large language models\.External Links:2411\.04368Cited by:[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p1.1)\.
- \[23\]xAI\(2025\)Grok models\.Note:[https://docs\.x\.ai/docs/models](https://docs.x.ai/docs/models)Accessed: 2026\-01Cited by:[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p1.1)\.
- \[24\]J\. Xiao, B\. Hou, Z\. Wang, R\. Jin, Q\. Long, W\. J\. Su, and L\. Shen\(2025\)Restoring calibration for aligned large language models: a calibration\-aware fine\-tuning approach\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 68364–68390\.External Links:[Link](https://proceedings.mlr.press/v267/xiao25b.html)Cited by:[Appendix G](https://arxiv.org/html/2605.08432#A7.p3.1),[§1](https://arxiv.org/html/2605.08432#S1.p2.1),[§2](https://arxiv.org/html/2605.08432#S2.p1.1)\.
- \[25\]M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi\(2024\)Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gjeQKFxFpZ)Cited by:[§6\.1](https://arxiv.org/html/2605.08432#S6.SS1.p3.10)\.
## Appendix AProofs
### A\.1Proof of[Theorem˜5\.1](https://arxiv.org/html/2605.08432#S5.Thmtheorem1)
FixqqwithΔq\>0\\Delta\_\{q\}\>0\. WriteDk:=π^N\(k\)−πq,kD\_\{k\}:=\\hat\{\\pi\}\_\{N\}\(k\)\-\\pi\_\{q,k\}for the centered empirical PMF onNN, and similarlyDk′D^\{\\prime\}\_\{k\}forEE\.
#### Hoeffding bound\.
Eachπ^N\(k\)−πq,k\\hat\{\\pi\}\_\{N\}\(k\)\-\\pi\_\{q,k\}is a sample mean of\[0,1\]\[0,1\]\-bounded i\.i\.d\. Bernoulli variables, hence1/\(2n\)1/\(2\\sqrt\{n\}\)\-sub\-Gaussian by Hoeffding’s lemma\. The standard sub\-Gaussian maximum bound gives
𝔼\[maxk\|Dk\|\]≤12n2log\(2Kq\)=log\(2Kq\)2n\.\\mathbb\{E\}\\\!\\big\[\\max\_\{k\}\|D\_\{k\}\|\\big\]\\;\\leq\\;\\frac\{1\}\{2\\sqrt\{n\}\}\\sqrt\{2\\log\(2K\_\{q\}\)\}\\;=\\;\\sqrt\{\\tfrac\{\\log\(2K\_\{q\}\)\}\{2n\}\}\.\(13\)
*Fori=1i=1:*the map\(x1,…,xK\)↦maxkxk\(x\_\{1\},\\dots,x\_\{K\}\)\\mapsto\\max\_\{k\}x\_\{k\}is11\-Lipschitz inℓ∞\\ell\_\{\\infty\}, so\|c^1−cq⋆\|≤maxk\|Dk\|\|\\hat\{c\}\_\{1\}\-c^\{\\star\}\_\{q\}\|\\leq\\max\_\{k\}\|D\_\{k\}\|\. Taking expectations and using \([13](https://arxiv.org/html/2605.08432#A1.E13)\) bounds\|𝔼\[c^1−cq⋆∣q\]\|\|\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-c^\{\\star\}\_\{q\}\\mid q\]\|bylog\(2Kq\)/\(2n\)\\sqrt\{\\log\(2K\_\{q\}\)/\(2n\)\}\.
*Fori=2i=2:*by conditional unbiasedness \([4](https://arxiv.org/html/2605.08432#S4.E4)\),𝔼\[c^2∣q\]=𝔼\[πq,z^N∣q\]\\mathbb\{E\}\[\\hat\{c\}\_\{2\}\\mid q\]=\\mathbb\{E\}\[\\pi\_\{q,\\hat\{z\}\_\{N\}\}\\mid q\]\. Sinceπ^N\(z^N\)≥π^N\(zq⋆\)\\hat\{\\pi\}\_\{N\}\(\\hat\{z\}\_\{N\}\)\\geq\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)by definition ofz^N\\hat\{z\}\_\{N\},
πq,z^N≥π^N\(z^N\)−maxk\|Dk\|≥π^N\(zq⋆\)−maxk\|Dk\|≥pq−2maxk\|Dk\|,\\pi\_\{q,\\hat\{z\}\_\{N\}\}\\geq\\hat\{\\pi\}\_\{N\}\(\\hat\{z\}\_\{N\}\)\-\\max\_\{k\}\|D\_\{k\}\|\\geq\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\-\\max\_\{k\}\|D\_\{k\}\|\\geq p\_\{q\}\-2\\max\_\{k\}\|D\_\{k\}\|,so0≤cq⋆−πq,z^N≤2maxk\|Dk\|0\\leq c^\{\\star\}\_\{q\}\-\\pi\_\{q,\\hat\{z\}\_\{N\}\}\\leq 2\\max\_\{k\}\|D\_\{k\}\|\. Hence\|𝔼\[c^2−cq⋆∣q\]\|≤2log\(2Kq\)/\(2n\)\|\\mathbb\{E\}\[\\hat\{c\}\_\{2\}\-c^\{\\star\}\_\{q\}\\mid q\]\|\\leq 2\\sqrt\{\\log\(2K\_\{q\}\)/\(2n\)\}, which absorbs into the same Hoeffding rate up to a universal constant\.
#### Bernstein bound\.
For bothi∈\{1,2\}i\\in\\\{1,2\\\}, we show\|𝔼\[c^i−cq⋆∣q\]\|≤Pr\(z^N≠zq⋆∣q\)\|\\mathbb\{E\}\[\\hat\{c\}\_\{i\}\-c^\{\\star\}\_\{q\}\\mid q\]\|\\leq\\Pr\(\\hat\{z\}\_\{N\}\\neq z^\{\\star\}\_\{q\}\\mid q\)\.
*Fori=1i=1:*the increment representationc^1=π^N\(zq⋆\)\+\(π^N\(z^N\)−π^N\(zq⋆\)\)\\hat\{c\}\_\{1\}=\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\+\(\\hat\{\\pi\}\_\{N\}\(\\hat\{z\}\_\{N\}\)\-\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\)combined with𝔼\[π^N\(zq⋆\)\]=cq⋆\\mathbb\{E\}\[\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\]=c^\{\\star\}\_\{q\}gives
0≤𝔼\[c^1−cq⋆∣q\]=𝔼\[\(π^N\(z^N\)−π^N\(zq⋆\)\)1z^N≠zq⋆\]≤Pr\(z^N≠zq⋆∣q\),0\\leq\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-c^\{\\star\}\_\{q\}\\mid q\]=\\mathbb\{E\}\\\!\\big\[\(\\hat\{\\pi\}\_\{N\}\(\\hat\{z\}\_\{N\}\)\-\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\)\\,\\mathbf\{1\}\_\{\\hat\{z\}\_\{N\}\\neq z^\{\\star\}\_\{q\}\}\\big\]\\leq\\Pr\(\\hat\{z\}\_\{N\}\\neq z^\{\\star\}\_\{q\}\\mid q\),since the increment lies in\[0,1\]\[0,1\]\.
*Fori=2i=2:*𝔼\[c^2−cq⋆∣q\]=−𝔼\[\(cq⋆−πq,z^N\)𝟏z^N≠zq⋆\]\\mathbb\{E\}\[\\hat\{c\}\_\{2\}\-c^\{\\star\}\_\{q\}\\mid q\]=\-\\mathbb\{E\}\[\(c^\{\\star\}\_\{q\}\-\\pi\_\{q,\\hat\{z\}\_\{N\}\}\)\\mathbf\{1\}\_\{\\hat\{z\}\_\{N\}\\neq z^\{\\star\}\_\{q\}\}\]and0≤cq⋆−πq,z^N≤10\\leq c^\{\\star\}\_\{q\}\-\\pi\_\{q,\\hat\{z\}\_\{N\}\}\\leq 1, so the same bound applies\.
By union bound,Pr\(z^N≠zq⋆∣q\)≤∑k≠zq⋆Pr\(π^N\(k\)≥π^N\(zq⋆\)∣q\)\\Pr\(\\hat\{z\}\_\{N\}\\neq z^\{\\star\}\_\{q\}\\mid q\)\\leq\\sum\_\{k\\neq z^\{\\star\}\_\{q\}\}\\Pr\(\\hat\{\\pi\}\_\{N\}\(k\)\\geq\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\\mid q\)\. For eachk≠zq⋆k\\neq z^\{\\star\}\_\{q\}, defineξi\(k\):=𝟏Zi=zq⋆−𝟏Zi=k\\xi^\{\(k\)\}\_\{i\}:=\\mathbf\{1\}\_\{Z\_\{i\}=z^\{\\star\}\_\{q\}\}\-\\mathbf\{1\}\_\{Z\_\{i\}=k\}, so𝔼\[ξ\(k\)\]=πq,zq⋆−πq,k≥Δq\\mathbb\{E\}\[\\xi^\{\(k\)\}\]=\\pi\_\{q,z^\{\\star\}\_\{q\}\}\-\\pi\_\{q,k\}\\geq\\Delta\_\{q\},\|ξ\(k\)\|≤1\|\\xi^\{\(k\)\}\|\\leq 1, andVar\(ξ\(k\)\)≤pq\+πq,k≤2pq\\mathrm\{Var\}\(\\xi^\{\(k\)\}\)\\leq p\_\{q\}\+\\pi\_\{q,k\}\\leq 2p\_\{q\}\. Bernstein’s inequality yields
Pr\(1n∑i=1nξi\(k\)≤0\|q\)≤exp\(−nΔq2/22pq\+Δq/3\)≤exp\(−nΔq24pq\+1\)\.\\Pr\\\!\\Big\(\\tfrac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi^\{\(k\)\}\_\{i\}\\leq 0\\,\\Big\|\\,q\\Big\)\\;\\leq\\;\\exp\\\!\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}/2\}\{2p\_\{q\}\+\\Delta\_\{q\}/3\}\\right\)\\;\\leq\\;\\exp\\\!\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{4p\_\{q\}\+1\}\\right\)\.Summing over theKq−1K\_\{q\}\-1runners\-up gives the Bernstein\-type bound, which \(after absorbing the small constantΔq/3≤1/3\\Delta\_\{q\}/3\\leq 1/3in the denominator\) is dominated by\(Kq−1\)exp\(−nΔq2/\(2pq\)\)\(K\_\{q\}\-1\)\\exp\(\-n\\Delta\_\{q\}^\{2\}/\(2p\_\{q\}\)\)up to universal constants\. Combining with the Hoeffding bound proves \([6](https://arxiv.org/html/2605.08432#S5.E6)\)\. ∎
### A\.2Proof of Theorem[5\.2](https://arxiv.org/html/2605.08432#S5.Thmtheorem2)
###### Proof\.
Using the equivalent representation of fixed\-bin ECE,
ECE\(c,a\)=∑ℓ=1L\|𝔼\[\(a−c\)𝟏\{C∈ℐℓ\}\]\|,\\operatorname\{ECE\}\(c,a\)=\\sum\_\{\\ell=1\}^\{L\}\\left\|\\mathbb\{E\}\\left\[\(a\-c\)\\mathbf\{1\}\\\{C\\in\\mathcal\{I\}\_\{\\ell\}\\\}\\right\]\\right\|,the reverse triangle inequality gives
\|ECE\(c^,a^\)−ECE\(c⋆,a⋆\)\|\\displaystyle\\left\|\\operatorname\{ECE\}\(\\hat\{c\},\\hat\{a\}\)\-\\operatorname\{ECE\}\(c^\{\\star\},a^\{\\star\}\)\\right\|≤∑ℓ=1L𝔼\|\(a^−c^\)𝟏\{c^∈ℐℓ\}−\(a⋆−c⋆\)𝟏\{c⋆∈ℐℓ\}\|\.\\displaystyle\\leq\\sum\_\{\\ell=1\}^\{L\}\\mathbb\{E\}\\left\|\(\\hat\{a\}\-\\hat\{c\}\)\\mathbf\{1\}\\\{\\hat\{c\}\\in\\mathcal\{I\}\_\{\\ell\}\\\}\-\(a^\{\\star\}\-c^\{\\star\}\)\\mathbf\{1\}\\\{c^\{\\star\}\\in\\mathcal\{I\}\_\{\\ell\}\\\}\\right\|\.Split according to whetherc^\\hat\{c\}andc⋆c^\{\\star\}fall in the same bin\. On the eventbin\(c^\)=bin\(c⋆\)\\operatorname\{bin\}\(\\hat\{c\}\)=\\operatorname\{bin\}\(c^\{\\star\}\), only one bin contributes, and
\|\(a^−c^\)−\(a⋆−c⋆\)\|≤\|a^−a⋆\|\+\|c^−c⋆\|\.\|\(\\hat\{a\}\-\\hat\{c\}\)\-\(a^\{\\star\}\-c^\{\\star\}\)\|\\leq\|\\hat\{a\}\-a^\{\\star\}\|\+\|\\hat\{c\}\-c^\{\\star\}\|\.On the eventbin\(c^\)≠bin\(c⋆\)\\operatorname\{bin\}\(\\hat\{c\}\)\\neq\\operatorname\{bin\}\(c^\{\\star\}\), at most two bins contribute, and each contribution is bounded by one\. Therefore
\|ECE\(c^,a^\)−ECE\(c⋆,a⋆\)\|≤𝔼\|a^−a⋆\|\+𝔼\|c^−c⋆\|\+2ℙ\{bin\(c^\)≠bin\(c⋆\)\}\.\\left\|\\operatorname\{ECE\}\(\\hat\{c\},\\hat\{a\}\)\-\\operatorname\{ECE\}\(c^\{\\star\},a^\{\\star\}\)\\right\|\\leq\\mathbb\{E\}\|\\hat\{a\}\-a^\{\\star\}\|\+\\mathbb\{E\}\|\\hat\{c\}\-c^\{\\star\}\|\+2\\mathbb\{P\}\\\{\\operatorname\{bin\}\(\\hat\{c\}\)\\neq\\operatorname\{bin\}\(c^\{\\star\}\)\\\}\.Sincean,a⋆∈\{0,1\}a\_\{n\},a^\{\\star\}\\in\\\{0,1\\\},
𝔼\|a^−a⋆\|=ℙ\(a^≠a⋆\)=δn,\\mathbb\{E\}\|\\hat\{a\}\-a^\{\\star\}\|=\\mathbb\{P\}\(\\hat\{a\}\\neq a^\{\\star\}\)=\\delta\_\{n\},and by definition
𝔼\|c^−c⋆\|=εn\.\\mathbb\{E\}\|\\hat\{c\}\-c^\{\\star\}\|=\\varepsilon\_\{n\}\.It remains to bound the bin\-crossing probability\. Ifbin\(c^\)≠bin\(c⋆\)\\operatorname\{bin\}\(\\hat\{c\}\)\\neq\\operatorname\{bin\}\(c^\{\\star\}\), then either\|c^−c⋆\|\>η\|\\hat\{c\}\-c^\{\\star\}\|\>\\etaorc⋆c^\{\\star\}lies within distanceη\\etaof a bin boundary\. Hence
ℙ\{bin\(c^\)≠bin\(c⋆\)\}≤ℙ\{\|c^−c⋆\|\>η\}\+ℙ\{dist\(c⋆,𝒯\)≤η\}\.\\mathbb\{P\}\\\{\\operatorname\{bin\}\(\\hat\{c\}\)\\neq\\operatorname\{bin\}\(c^\{\\star\}\)\\\}\\leq\\mathbb\{P\}\\\{\|\\hat\{c\}\-c^\{\\star\}\|\>\\eta\\\}\+\\mathbb\{P\}\\\{\\operatorname\{dist\}\(c^\{\\star\},\\mathcal\{T\}\)\\leq\\eta\\\}\.By Markov’s inequality,
ℙ\{\|c^−c⋆\|\>η\}≤εnη\.\\mathbb\{P\}\\\{\|\\hat\{c\}\-c^\{\\star\}\|\>\\eta\\\}\\leq\\frac\{\\varepsilon\_\{n\}\}\{\\eta\}\.This proves the first claim\.
Ifc⋆c^\{\\star\}has density bounded byMM, then theη\\eta\-neighborhood of theL−1L\-1bin boundaries has total length at most2\(L−1\)η2\(L\-1\)\\eta, so
ℙ\{dist\(c⋆,𝒯\)≤η\}≤2M\(L−1\)η\.\\mathbb\{P\}\\\{\\operatorname\{dist\}\(c^\{\\star\},\\mathcal\{T\}\)\\leq\\eta\\\}\\leq 2M\(L\-1\)\\eta\.Therefore
\|ECE\(c^,a^\)−ECE\(c⋆,a⋆\)\|≤δn\+εn\+2\{εnη\+2M\(L−1\)η\}\.\\left\|\\operatorname\{ECE\}\(\\hat\{c\},\\hat\{a\}\)\-\\operatorname\{ECE\}\(c^\{\\star\},a^\{\\star\}\)\\right\|\\leq\\delta\_\{n\}\+\\varepsilon\_\{n\}\+2\\left\\\{\\frac\{\\varepsilon\_\{n\}\}\{\\eta\}\+2M\(L\-1\)\\eta\\right\\\}\.Optimizing overη\\etagives
2\{εnη\+2M\(L−1\)η\}≤42M\(L−1\)εn\.2\\left\\\{\\frac\{\\varepsilon\_\{n\}\}\{\\eta\}\+2M\(L\-1\)\\eta\\right\\\}\\leq 4\\sqrt\{2M\(L\-1\)\\varepsilon\_\{n\}\}\.This completes the proof\. ∎
### A\.3Proof of Theorem[5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)
Before we provide the proof of Theorem[5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3), we first record a standard pairwise comparison bound\. Throughout this section, fix a questionqqand, without loss of generality, relabel the population semantic mode as
zq⋆=1,cq⋆=πq,1\.z\_\{q\}^\{\\star\}=1,\\qquad c\_\{q\}^\{\\star\}=\\pi\_\{q,1\}\.Let
πq,\(2\):=maxk≠1πq,k,Δq:=πq,1−πq,\(2\),pq:=πq,1\+πq,\(2\)\.\\pi\_\{q,\(2\)\}:=\\max\_\{k\\neq 1\}\\pi\_\{q,k\},\\qquad\\Delta\_\{q\}:=\\pi\_\{q,1\}\-\\pi\_\{q,\(2\)\},\\qquad p\_\{q\}:=\\pi\_\{q,1\}\+\\pi\_\{q,\(2\)\}\.We assumeΔq\>0\\Delta\_\{q\}\>0forQQ\-almost everyqq\.
###### Lemma A\.1\(Selection error\)\.
For every fixedqqwithΔq\>0\\Delta\_\{q\}\>0,
ℙ\(z^N≠zq⋆∣q\)≤\(Kq−1\)exp\(−nΔq22pq\)\.\\mathbb\{P\}\(\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\mid q\)\\leq\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\.
###### Proof\.
For anyk≠1k\\neq 1, the eventz^N=k\\hat\{z\}\_\{N\}=kimplies
π^N\(k\)≥π^N\(1\)\.\\hat\{\\pi\}\_\{N\}\(k\)\\geq\\hat\{\\pi\}\_\{N\}\(1\)\.Equivalently,
1n∑i∈Nξi\(k\)≤0,ξi\(k\):=𝟏\{Zi=1\}−𝟏\{Zi=k\}\.\\frac\{1\}\{n\}\\sum\_\{i\\in N\}\\xi\_\{i\}^\{\(k\)\}\\leq 0,\\qquad\\xi\_\{i\}^\{\(k\)\}:=\\mathbf\{1\}\\\{Z\_\{i\}=1\\\}\-\\mathbf\{1\}\\\{Z\_\{i\}=k\\\}\.Now
𝔼\[ξi\(k\)∣q\]=πq,1−πq,k≥Δq,\\mathbb\{E\}\[\\xi\_\{i\}^\{\(k\)\}\\mid q\]=\\pi\_\{q,1\}\-\\pi\_\{q,k\}\\geq\\Delta\_\{q\},and the random variableξi\(k\)\\xi\_\{i\}^\{\(k\)\}is supported on\{−1,0,1\}\\\{\-1,0,1\\\}with variance controlled by the total top\-versus\-kkprobability mass:
Var\(ξi\(k\)∣q\)≤𝔼\[\(ξi\(k\)\)2∣q\]=πq,1\+πq,k≤pq\.\\operatorname\{Var\}\(\\xi\_\{i\}^\{\(k\)\}\\mid q\)\\leq\\mathbb\{E\}\[\(\\xi\_\{i\}^\{\(k\)\}\)^\{2\}\\mid q\]=\\pi\_\{q,1\}\+\\pi\_\{q,k\}\\leq p\_\{q\}\.Applying the Bernstein–Chernoff bound for this three\-valued comparison statistic gives
ℙ\(π^N\(k\)≥π^N\(1\)∣q\)≤exp\(−nΔq22pq\)\.\\mathbb\{P\}\\left\(\\hat\{\\pi\}\_\{N\}\(k\)\\geq\\hat\{\\pi\}\_\{N\}\(1\)\\mid q\\right\)\\leq\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\.Taking a union bound over allk≠1k\\neq 1yields
ℙ\(z^N≠zq⋆∣q\)≤∑k≠1ℙ\(π^N\(k\)≥π^N\(1\)∣q\)≤\(Kq−1\)exp\(−nΔq22pq\)\.\\mathbb\{P\}\(\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\mid q\)\\leq\\sum\_\{k\\neq 1\}\\mathbb\{P\}\\left\(\\hat\{\\pi\}\_\{N\}\(k\)\\geq\\hat\{\\pi\}\_\{N\}\(1\)\\mid q\\right\)\\leq\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\.∎
###### Proof of Theorem[5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)\.
Let
c^=c^1=maxk∈𝒵qπ^N\(k\),a^=Yq\(z^N\),aq⋆=Yq\(zq⋆\)\.\\hat\{c\}=\\hat\{c\}\_\{1\}=\\max\_\{k\\in\\mathcal\{Z\}\_\{q\}\}\\hat\{\\pi\}\_\{N\}\(k\),\\qquad\\hat\{a\}=Y\_\{q\}\(\\hat\{z\}\_\{N\}\),\\qquad a\_\{q\}^\{\\star\}=Y\_\{q\}\(z\_\{q\}^\{\\star\}\)\.We first bound the confidence error
εn=𝔼\|c^1−cq⋆\|\.\\varepsilon\_\{n\}=\\mathbb\{E\}\|\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\|\.
For fixedqq, write
Dk:=π^N\(k\)−πq,k\.D\_\{k\}:=\\hat\{\\pi\}\_\{N\}\(k\)\-\\pi\_\{q,k\}\.Since the maximum map is11\-Lipschitz with respect to theℓ∞\\ell\_\{\\infty\}norm,
\|c^1−cq⋆\|=\|maxkπ^N\(k\)−maxkπq,k\|≤maxk\|Dk\|\.\|\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\|=\\left\|\\max\_\{k\}\\hat\{\\pi\}\_\{N\}\(k\)\-\\max\_\{k\}\\pi\_\{q,k\}\\right\|\\leq\\max\_\{k\}\|D\_\{k\}\|\.EachDkD\_\{k\}is the centered average ofnnBernoulli variables and is1/\(2n\)1/\(2\\sqrt\{n\}\)\-sub\-Gaussian\. Therefore,
𝔼\[maxk\|Dk\|∣q\]≤log\(2Kq\)2n\.\\mathbb\{E\}\\left\[\\max\_\{k\}\|D\_\{k\}\|\\mid q\\right\]\\leq\\sqrt\{\\frac\{\\log\(2K\_\{q\}\)\}\{2n\}\}\.Thus
𝔼\[\|c^1−cq⋆\|∣q\]≤log\(2Kq\)2n\.\\mathbb\{E\}\\left\[\|\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\|\\mid q\\right\]\\leq\\sqrt\{\\frac\{\\log\(2K\_\{q\}\)\}\{2n\}\}\.
We next derive a margin\-sensitive bound\. On the event\{z^N=zq⋆\}\\\{\\hat\{z\}\_\{N\}=z\_\{q\}^\{\\star\}\\\}, we have
c^1=π^N\(zq⋆\)=π^N\(1\),\\hat\{c\}\_\{1\}=\\hat\{\\pi\}\_\{N\}\(z\_\{q\}^\{\\star\}\)=\\hat\{\\pi\}\_\{N\}\(1\),and hence
\|c^1−cq⋆\|=\|π^N\(1\)−πq,1\|\.\|\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\|=\|\\hat\{\\pi\}\_\{N\}\(1\)\-\\pi\_\{q,1\}\|\.On the complement\{z^N≠zq⋆\}\\\{\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\\}, the trivial bound\|c^1−cq⋆\|≤1\|\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\|\\leq 1gives
𝔼\[\|c^1−cq⋆\|∣q\]≤𝔼\[\|π^N\(1\)−πq,1\|∣q\]\+ℙ\(z^N≠zq⋆∣q\)\.\\mathbb\{E\}\\left\[\|\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\|\\mid q\\right\]\\leq\\mathbb\{E\}\\left\[\|\\hat\{\\pi\}\_\{N\}\(1\)\-\\pi\_\{q,1\}\|\\mid q\\right\]\+\\mathbb\{P\}\(\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\mid q\)\.Sinceπ^N\(1\)\\hat\{\\pi\}\_\{N\}\(1\)is the average ofnnBernoulli variables with success probabilityπq,1\\pi\_\{q,1\},
𝔼\[\|π^N\(1\)−πq,1\|∣q\]≤Var\(π^N\(1\)∣q\)=πq,1\(1−πq,1\)n\.\\mathbb\{E\}\\left\[\|\\hat\{\\pi\}\_\{N\}\(1\)\-\\pi\_\{q,1\}\|\\mid q\\right\]\\leq\\sqrt\{\\operatorname\{Var\}\(\\hat\{\\pi\}\_\{N\}\(1\)\\mid q\)\}=\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{n\}\}\.Combining this with the selection\-error lemma yields
𝔼\[\|c^1−cq⋆\|∣q\]≤πq,1\(1−πq,1\)n\+\(Kq−1\)exp\(−nΔq22pq\)\.\\mathbb\{E\}\\left\[\|\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\|\\mid q\\right\]\\leq\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{n\}\}\+\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\.Together with the uniform Hoeffding bound, we obtain
𝔼\[\|c^1−cq⋆\|∣q\]≤min\{log\(2Kq\)2n,πq,1\(1−πq,1\)n\+\(Kq−1\)exp\(−nΔq22pq\)\}\.\\mathbb\{E\}\\left\[\|\\hat\{c\}\_\{1\}\-c\_\{q\}^\{\\star\}\|\\mid q\\right\]\\leq\\min\\left\\\{\\sqrt\{\\frac\{\\log\(2K\_\{q\}\)\}\{2n\}\},\\,\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{n\}\}\+\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\\right\\\}\.Taking expectation overq∼Qq\\sim Qgives
εn≤𝔼q\[min\{log\(2Kq\)2n,πq,1\(1−πq,1\)n\+\(Kq−1\)exp\(−nΔq22pq\)\}\]\.\\varepsilon\_\{n\}\\leq\\mathbb\{E\}\_\{q\}\\left\[\\min\\left\\\{\\sqrt\{\\frac\{\\log\(2K\_\{q\}\)\}\{2n\}\},\\,\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{n\}\}\+\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\\right\\\}\\right\]\.
It remains to bound the selection error
δn=ℙ\(a^≠aq⋆\)\.\\delta\_\{n\}=\\mathbb\{P\}\(\\hat\{a\}\\neq a\_\{q\}^\{\\star\}\)\.Sincea^=Yq\(z^N\)\\hat\{a\}=Y\_\{q\}\(\\hat\{z\}\_\{N\}\)andaq⋆=Yq\(zq⋆\)a\_\{q\}^\{\\star\}=Y\_\{q\}\(z\_\{q\}^\{\\star\}\),
\{a^≠aq⋆\}⊆\{z^N≠zq⋆\}\.\\\{\\hat\{a\}\\neq a\_\{q\}^\{\\star\}\\\}\\subseteq\\\{\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\\}\.Therefore, by the selection\-error lemma,
ℙ\(a^≠aq⋆∣q\)≤\(Kq−1\)exp\(−nΔq22pq\)\.\\mathbb\{P\}\(\\hat\{a\}\\neq a\_\{q\}^\{\\star\}\\mid q\)\\leq\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\.Taking expectation overq∼Qq\\sim Qgives
δn≤𝔼q\[\(Kq−1\)exp\(−nΔq22pq\)\]\.\\delta\_\{n\}\\leq\\mathbb\{E\}\_\{q\}\\left\[\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\\right\]\.This proves Theorem[5\.3](https://arxiv.org/html/2605.08432#S5.Thmtheorem3)\. ∎
### A\.4Proof of Theorem[5\.4](https://arxiv.org/html/2605.08432#S5.Thmtheorem4)
###### Proof\.
Let
c^=c^2=π^E\(z^N\),a^=Yq\(z^N\),aq⋆=Yq\(zq⋆\)\.\\hat\{c\}=\\hat\{c\}\_\{2\}=\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\),\\qquad\\hat\{a\}=Y\_\{q\}\(\\hat\{z\}\_\{N\}\),\\qquad a\_\{q\}^\{\\star\}=Y\_\{q\}\(z\_\{q\}^\{\\star\}\)\.Again relabelzq⋆=1z\_\{q\}^\{\\star\}=1, so thatcq⋆=πq,1c\_\{q\}^\{\\star\}=\\pi\_\{q,1\}\.
For fixedqq, decompose
\|c^2−cq⋆\|=\|π^E\(z^N\)−πq,1\|≤\|π^E\(z^N\)−πq,z^N\|\+\|πq,z^N−πq,1\|\.\|\\hat\{c\}\_\{2\}\-c\_\{q\}^\{\\star\}\|=\|\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)\-\\pi\_\{q,1\}\|\\leq\|\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)\-\\pi\_\{q,\\hat\{z\}\_\{N\}\}\|\+\|\\pi\_\{q,\\hat\{z\}\_\{N\}\}\-\\pi\_\{q,1\}\|\.We control the two terms separately\.
For the first term, condition onqqandz^N\\hat\{z\}\_\{N\}\. Since the evaluation blockEEis independent of the selection blockNN,
π^E\(z^N\)=1m∑i∈E𝟏\{Zi=z^N\}\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)=\\frac\{1\}\{m\}\\sum\_\{i\\in E\}\\mathbf\{1\}\\\{Z\_\{i\}=\\hat\{z\}\_\{N\}\\\}is, conditionally on\(q,z^N\)\(q,\\hat\{z\}\_\{N\}\), the average ofmmBernoulli variables with success probabilityπq,z^N\\pi\_\{q,\\hat\{z\}\_\{N\}\}\. Hence
𝔼\[\|π^E\(z^N\)−πq,z^N\|∣q,z^N\]≤πq,z^N\(1−πq,z^N\)m\.\\mathbb\{E\}\\left\[\|\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)\-\\pi\_\{q,\\hat\{z\}\_\{N\}\}\|\\mid q,\\hat\{z\}\_\{N\}\\right\]\\leq\\sqrt\{\\frac\{\\pi\_\{q,\\hat\{z\}\_\{N\}\}\(1\-\\pi\_\{q,\\hat\{z\}\_\{N\}\}\)\}\{m\}\}\.For everykk, one has
πq,k\(1−πq,k\)≤πq,1\(1−πq,1\)\.\\pi\_\{q,k\}\(1\-\\pi\_\{q,k\}\)\\leq\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\.Indeed, this is immediate fork=1k=1\. Fork≠1k\\neq 1, sinceπq,k≤πq,1\\pi\_\{q,k\}\\leq\\pi\_\{q,1\}andπq,k≤1−πq,1\\pi\_\{q,k\}\\leq 1\-\\pi\_\{q,1\}, the Bernoulli variance of classkkis no larger than that of the modal class\. Therefore,
𝔼\[\|π^E\(z^N\)−πq,z^N\|∣q\]≤πq,1\(1−πq,1\)m\.\\mathbb\{E\}\\left\[\|\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)\-\\pi\_\{q,\\hat\{z\}\_\{N\}\}\|\\mid q\\right\]\\leq\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{m\}\}\.
For the second term,
\|πq,z^N−πq,1\|=0on\{z^N=zq⋆\},\|\\pi\_\{q,\\hat\{z\}\_\{N\}\}\-\\pi\_\{q,1\}\|=0\\quad\\text\{on \}\\\{\\hat\{z\}\_\{N\}=z\_\{q\}^\{\\star\}\\\},and it is at most11otherwise\. Hence
𝔼\[\|πq,z^N−πq,1\|∣q\]≤ℙ\(z^N≠zq⋆∣q\)\.\\mathbb\{E\}\\left\[\|\\pi\_\{q,\\hat\{z\}\_\{N\}\}\-\\pi\_\{q,1\}\|\\mid q\\right\]\\leq\\mathbb\{P\}\(\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\mid q\)\.Using the selection\-error lemma,
𝔼\[\|πq,z^N−πq,1\|∣q\]≤\(Kq−1\)exp\(−nΔq22pq\)\.\\mathbb\{E\}\\left\[\|\\pi\_\{q,\\hat\{z\}\_\{N\}\}\-\\pi\_\{q,1\}\|\\mid q\\right\]\\leq\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\.Combining the two bounds gives
𝔼\[\|c^2−cq⋆\|∣q\]≤πq,1\(1−πq,1\)m\+\(Kq−1\)exp\(−nΔq22pq\)\.\\mathbb\{E\}\\left\[\|\\hat\{c\}\_\{2\}\-c\_\{q\}^\{\\star\}\|\\mid q\\right\]\\leq\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{m\}\}\+\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\.Taking expectation overq∼Qq\\sim Qyields
εn≤𝔼q\[πq,1\(1−πq,1\)m\+\(Kq−1\)exp\(−nΔq22pq\)\]\.\\varepsilon\_\{n\}\\leq\\mathbb\{E\}\_\{q\}\\left\[\\sqrt\{\\frac\{\\pi\_\{q,1\}\(1\-\\pi\_\{q,1\}\)\}\{m\}\}\+\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\\right\]\.
The selection error is exactly the same as in Theorem 5\.3, because Sem1 and Sem2 use the same selected answerz^N\\hat\{z\}\_\{N\}\. Namely,
\{a^≠aq⋆\}⊆\{z^N≠zq⋆\},\\\{\\hat\{a\}\\neq a\_\{q\}^\{\\star\}\\\}\\subseteq\\\{\\hat\{z\}\_\{N\}\\neq z\_\{q\}^\{\\star\}\\\},so
δn≤𝔼q\[\(Kq−1\)exp\(−nΔq22pq\)\]\.\\delta\_\{n\}\\leq\\mathbb\{E\}\_\{q\}\\left\[\(K\_\{q\}\-1\)\\exp\\left\(\-\\frac\{n\\Delta\_\{q\}^\{2\}\}\{2p\_\{q\}\}\\right\)\\right\]\.This completes the proof\. ∎
### A\.5Proof of the bias expansions \([9](https://arxiv.org/html/2605.08432#S5.E9)\)
Fixq∈𝒬lowq\\in\\mathcal\{Q\}\_\{\\mathrm\{low\}\}and letzq\(2\):=argmaxk≠zq⋆πq,kz^\{\(2\)\}\_\{q\}:=\\operatorname\*\{arg\\,max\}\_\{k\\neq z^\{\\star\}\_\{q\}\}\\pi\_\{q,k\}denote the runner\-up class\. Throughout we work in the local regimeΔq=O\(n−1/2\)\\Delta\_\{q\}=O\(n^\{\-1/2\}\)so thatm~q\\tilde\{m\}\_\{q\}is bounded\.
#### Reduction to the top\-two classes\.
Letℰq:=\{z^N∈\{zq⋆,zq\(2\)\}\}\\mathcal\{E\}\_\{q\}:=\\\{\\hat\{z\}\_\{N\}\\in\\\{z^\{\\star\}\_\{q\},z^\{\(2\)\}\_\{q\}\\\}\\\}\. By the Bernstein argument in[Section˜A\.1](https://arxiv.org/html/2605.08432#A1.SS1)applied to each runner\-upj∉\{zq⋆,zq\(2\)\}j\\notin\\\{z^\{\\star\}\_\{q\},z^\{\(2\)\}\_\{q\}\\\}\(which has gapΔq\(j\)≥Δq\\Delta\_\{q\}^\{\(j\)\}\\geq\\Delta\_\{q\}\),
Pr\(ℰqc∣q\)≤\(Kq−2\)exp\(−cnΔq2/pq\)\.\\Pr\(\\mathcal\{E\}\_\{q\}^\{c\}\\mid q\)\\;\\leq\\;\(K\_\{q\}\-2\)\\exp\\\!\\big\(\-c\\,n\\Delta\_\{q\}^\{2\}/p\_\{q\}\\big\)\.In the regimem~q2<logKq\\tilde\{m\}\_\{q\}^\{2\}<\\log K\_\{q\}this iso\(n−1/2\)o\(n^\{\-1/2\}\)uniformly, and contributes onlyo\(n−1/2\)o\(n^\{\-1/2\}\)to𝔼\[c^i−cq⋆∣q\]\\mathbb\{E\}\[\\hat\{c\}\_\{i\}\-c^\{\\star\}\_\{q\}\\mid q\]since\|c^i−cq⋆\|≤1\|\\hat\{c\}\_\{i\}\-c^\{\\star\}\_\{q\}\|\\leq 1\. We may therefore work conditionally onℰq\\mathcal\{E\}\_\{q\}\.
#### Local CLT on the boundary statistic\.
DefineV:=π^N\(zq⋆\)−π^N\(zq\(2\)\)V:=\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\-\\hat\{\\pi\}\_\{N\}\(z^\{\(2\)\}\_\{q\}\)\. By the bivariate multinomial CLT,
n\(V−Δq\)⇒𝒩\(0,τq2\),τq2:=pq\+pq\(2\)−Δq2=pq\(1\+o\(1\)\)\\sqrt\{n\}\\,\(V\-\\Delta\_\{q\}\)\\;\\Rightarrow\\;\\mathcal\{N\}\(0,\\tau\_\{q\}^\{2\}\),\\qquad\\tau\_\{q\}^\{2\}:=p\_\{q\}\+p^\{\(2\)\}\_\{q\}\-\\Delta\_\{q\}^\{2\}\\;=\\;p\_\{q\}\\,\(1\+o\(1\)\)in low\-margin \(under the usual normalization wherepq\\sqrt\{p\}\_\{q\}in \([9](https://arxiv.org/html/2605.08432#S5.E9)\) denotes the per\-class boundary scale\)\. Standardize:V=pq/n\(m~q\+Yn\)V=\\sqrt\{p\_\{q\}/n\}\(\\,\\tilde\{m\}\_\{q\}\+Y\_\{n\}\)withYn⇒𝒩\(0,1\)Y\_\{n\}\\Rightarrow\\mathcal\{N\}\(0,1\)uniformly\.
#### Sem1bias \(Jensen gap\)\.
Onℰq\\mathcal\{E\}\_\{q\},c^1=max\(π^N\(zq⋆\),π^N\(zq\(2\)\)\)\\hat\{c\}\_\{1\}=\\max\(\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\),\\hat\{\\pi\}\_\{N\}\(z^\{\(2\)\}\_\{q\}\)\), hence
c^1−π^N\(zq⋆\)=V−=max\(−V,0\)\.\\hat\{c\}\_\{1\}\-\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\\;=\\;V^\{\-\}\\;=\\;\\max\(\-V,0\)\.Since𝔼\[π^N\(zq⋆\)∣q\]=cq⋆\\mathbb\{E\}\[\\hat\{\\pi\}\_\{N\}\(z^\{\\star\}\_\{q\}\)\\mid q\]=c^\{\\star\}\_\{q\},
𝔼\[c^1−cq⋆∣q\]=𝔼\[V−\]=pq/n𝔼\[\(m~q\+Yn\)−\]→pq/n𝔼\[\(Y−m~q\)\+\],\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-c^\{\\star\}\_\{q\}\\mid q\]\\;=\\;\\mathbb\{E\}\[V^\{\-\}\]\\;=\\;\\sqrt\{p\_\{q\}/n\}\\;\\mathbb\{E\}\\\!\\big\[\(\\tilde\{m\}\_\{q\}\+Y\_\{n\}\)^\{\-\}\\big\]\\;\\to\\;\\sqrt\{p\_\{q\}/n\}\\;\\mathbb\{E\}\\\!\\big\[\(Y\-\\tilde\{m\}\_\{q\}\)^\{\+\}\\big\],using the symmetry−Y=dY\-Y\\stackrel\{\{\\scriptstyle d\}\}\{\{=\}\}YforY∼𝒩\(0,1\)Y\\sim\\mathcal\{N\}\(0,1\)\. Applying the standard formula𝔼\[\(Y−μ\)\+\]=φ\(μ\)−μΦ\(−μ\)\\mathbb\{E\}\[\(Y\-\\mu\)^\{\+\}\]=\\varphi\(\\mu\)\-\\mu\\Phi\(\-\\mu\),
𝔼\[c^1−cq⋆∣q\]=1npq\[φ\(m~q\)−m~qΦ\(−m~q\)\]\+o\(n−1/2\)=1npqJ\(λ~q\)\+o\(n−1/2\),\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-c^\{\\star\}\_\{q\}\\mid q\]\\;=\\;\\tfrac\{1\}\{\\sqrt\{n\}\}\\sqrt\{p\_\{q\}\}\\;\\big\[\\varphi\(\\tilde\{m\}\_\{q\}\)\-\\tilde\{m\}\_\{q\}\\Phi\(\-\\tilde\{m\}\_\{q\}\)\\big\]\+o\(n^\{\-1/2\}\)\\;=\\;\\tfrac\{1\}\{\\sqrt\{n\}\}\\sqrt\{p\_\{q\}\}\\,J\(\\tilde\{\\lambda\}\_\{q\}\)\+o\(n^\{\-1/2\}\),sincem~q=2λ~q\\tilde\{m\}\_\{q\}=2\\tilde\{\\lambda\}\_\{q\}andJ\(λ~\)=φ\(2λ~\)−2λ~Φ\(−2λ~\)J\(\\tilde\{\\lambda\}\)=\\varphi\(2\\tilde\{\\lambda\}\)\-2\\tilde\{\\lambda\}\\Phi\(\-2\\tilde\{\\lambda\}\)\.
#### Sem2bias \(selection gap\)\.
By conditional unbiasedness \([4](https://arxiv.org/html/2605.08432#S4.E4)\),𝔼\[c^2∣q,z^N\]=πq,z^N\\mathbb\{E\}\[\\hat\{c\}\_\{2\}\\mid q,\\hat\{z\}\_\{N\}\]=\\pi\_\{q,\\hat\{z\}\_\{N\}\}\. Onℰq\\mathcal\{E\}\_\{q\},πq,z^N\\pi\_\{q,\\hat\{z\}\_\{N\}\}takes valuecq⋆c^\{\\star\}\_\{q\}on\{z^N=zq⋆\}\\\{\\hat\{z\}\_\{N\}=z^\{\\star\}\_\{q\}\\\}and valuecq⋆−Δqc^\{\\star\}\_\{q\}\-\\Delta\_\{q\}on\{z^N=zq\(2\)\}\\\{\\hat\{z\}\_\{N\}=z^\{\(2\)\}\_\{q\}\\\}\. Therefore
𝔼\[c^2−cq⋆∣q,ℰq\]=−ΔqPr\(z^N=zq\(2\)∣q,ℰq\)=−ΔqPr\(V≤0∣q,ℰq\)\.\\mathbb\{E\}\[\\hat\{c\}\_\{2\}\-c^\{\\star\}\_\{q\}\\mid q,\\mathcal\{E\}\_\{q\}\]\\;=\\;\-\\Delta\_\{q\}\\,\\Pr\(\\hat\{z\}\_\{N\}=z^\{\(2\)\}\_\{q\}\\mid q,\\mathcal\{E\}\_\{q\}\)\\;=\\;\-\\Delta\_\{q\}\\,\\Pr\(V\\leq 0\\mid q,\\mathcal\{E\}\_\{q\}\)\.By the local CLT,Pr\(V≤0\)→Φ\(−m~q\)\\Pr\(V\\leq 0\)\\to\\Phi\(\-\\tilde\{m\}\_\{q\}\), and substitutingΔq=pq/nm~q\\Delta\_\{q\}=\\sqrt\{p\_\{q\}/n\}\\,\\tilde\{m\}\_\{q\},
𝔼\[c^2−cq⋆∣q\]=−1npqm~qΦ\(−m~q\)\+o\(n−1/2\)=−1npqS\(λ~q\)\+o\(n−1/2\),\\mathbb\{E\}\[\\hat\{c\}\_\{2\}\-c^\{\\star\}\_\{q\}\\mid q\]\\;=\\;\-\\tfrac\{1\}\{\\sqrt\{n\}\}\\sqrt\{p\_\{q\}\}\\,\\tilde\{m\}\_\{q\}\\Phi\(\-\\tilde\{m\}\_\{q\}\)\+o\(n^\{\-1/2\}\)\\;=\\;\-\\tfrac\{1\}\{\\sqrt\{n\}\}\\sqrt\{p\_\{q\}\}\\,S\(\\tilde\{\\lambda\}\_\{q\}\)\+o\(n^\{\-1/2\}\),sinceS\(λ~\)=2λ~Φ\(−2λ~\)=m~qΦ\(−m~q\)S\(\\tilde\{\\lambda\}\)=2\\tilde\{\\lambda\}\\Phi\(\-2\\tilde\{\\lambda\}\)=\\tilde\{m\}\_\{q\}\\Phi\(\-\\tilde\{m\}\_\{q\}\)\. ∎
### A\.6Proof of[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)
LetA:=𝔼q\[cq⋆\]−a¯\>0A:=\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]\-\\bar\{a\}\>0denote the population over\-confidence gap\. By the bias expansions \([9](https://arxiv.org/html/2605.08432#S5.E9)\),
𝔼q\[c^1\]−a¯=A\+1n𝔼q\[pqJ\(λ~q\)\]\+o\(n−1/2\),\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{1\}\]\-\\bar\{a\}=A\+\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\[\\sqrt\{p\_\{q\}\}\\,J\(\\tilde\{\\lambda\}\_\{q\}\)\]\+o\(n^\{\-1/2\}\),𝔼q\[c^2\]−a¯=A−1n𝔼q\[pqS\(λ~q\)\]\+o\(n−1/2\)\.\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{2\}\]\-\\bar\{a\}=A\-\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\[\\sqrt\{p\_\{q\}\}\\,S\(\\tilde\{\\lambda\}\_\{q\}\)\]\+o\(n^\{\-1/2\}\)\.SinceJ,S≥0J,S\\geq 0andA\>0A\>0, both expressions are positive fornnsufficiently large, so the absolute values open in the positive direction:ECE\(c^1\)=𝔼q\[c^1\]−a¯\\mathrm\{ECE\}\(\\hat\{c\}\_\{1\}\)=\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{1\}\]\-\\bar\{a\}andECE\(c^2\)=𝔼q\[c^2\]−a¯\\mathrm\{ECE\}\(\\hat\{c\}\_\{2\}\)=\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{2\}\]\-\\bar\{a\}to leading order\. Subtracting,
ECE\(c^1\)−ECE\(c^2\)=1n𝔼q\[pq\(J\(λ~q\)\+S\(λ~q\)\)\]\+o\(n−1/2\)=1n𝔼q\[pqgA\(λ~q\)\]\+o\(n−1/2\),\\mathrm\{ECE\}\(\\hat\{c\}\_\{1\}\)\-\\mathrm\{ECE\}\(\\hat\{c\}\_\{2\}\)=\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\\\!\\big\[\\sqrt\{p\_\{q\}\}\\,\(J\(\\tilde\{\\lambda\}\_\{q\}\)\+S\(\\tilde\{\\lambda\}\_\{q\}\)\)\\big\]\+o\(n^\{\-1/2\}\)=\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\[\\sqrt\{p\_\{q\}\}\\,g\_\{A\}\(\\tilde\{\\lambda\}\_\{q\}\)\]\+o\(n^\{\-1/2\}\),which is strictly positive sincegA\(λ~\)=φ\(2λ~\)\>0g\_\{A\}\(\\tilde\{\\lambda\}\)=\\varphi\(2\\tilde\{\\lambda\}\)\>0on all of𝒬low\\mathcal\{Q\}\_\{\\mathrm\{low\}\}\. ∎
### A\.7Proof of[Theorem˜5\.7](https://arxiv.org/html/2605.08432#S5.Thmtheorem7)
We compare\|𝔼q\[c^i\]−𝔼q\[cq⋆\]\|\|\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{i\}\]\-\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]\|fori∈\{1,2\}i\\in\\\{1,2\\\}under non\-degenerate over\-confidenceA:=𝔼q\[cq⋆\]−a≥c0/nA:=\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]\-a\\geq c\_\{0\}/\\sqrt\{n\}\(withc0\>0c\_\{0\}\>0fixed\) and population support in𝒬JDR\\mathcal\{Q\}\_\{\\mathrm\{JDR\}\}\.
By \([9](https://arxiv.org/html/2605.08432#S5.E9)\),
𝔼q\[c^i\]−𝔼q\[cq⋆\]=\(−1\)i\+1n𝔼q\[pqhi\(λ~q\)\]\+o\(n−1/2\),\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{i\}\]\-\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]=\\tfrac\{\(\-1\)^\{i\+1\}\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\[\\sqrt\{p\_\{q\}\}\\,h\_\{i\}\(\\tilde\{\\lambda\}\_\{q\}\)\]\+o\(n^\{\-1/2\}\),withh1=Jh\_\{1\}=Jandh2=Sh\_\{2\}=S\. Both areO\(n−1/2\)O\(n^\{\-1/2\}\), while the non\-degeneracy hypothesis ensures the comparison*against the oracle*𝔼q\[cq⋆\]\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]does not flip sign within theo\(n−1/2\)o\(n^\{\-1/2\}\)remainder \(this enters when forming\|𝔼q\[c^i\]−𝔼q\[cq⋆\]\|\|\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{i\}\]\-\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]\|in conjunction withECE⋆=A\\mathrm\{ECE\}^\{\\star\}=A\)\.
Specifically, the absolute oracle distance satisfies
\|𝔼q\[c^1\]−𝔼q\[cq⋆\]\|=1n𝔼q\[pqJ\(λ~q\)\]\+o\(n−1/2\),\\big\|\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{1\}\]\-\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]\\big\|=\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\[\\sqrt\{p\_\{q\}\}\\,J\(\\tilde\{\\lambda\}\_\{q\}\)\]\+o\(n^\{\-1/2\}\),\|𝔼q\[c^2\]−𝔼q\[cq⋆\]\|=1n𝔼q\[pqS\(λ~q\)\]\+o\(n−1/2\),\\big\|\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{2\}\]\-\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]\\big\|=\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\[\\sqrt\{p\_\{q\}\}\\,S\(\\tilde\{\\lambda\}\_\{q\}\)\]\+o\(n^\{\-1/2\}\),and subtracting,
\|𝔼q\[c^1\]−𝔼q\[cq⋆\]\|−\|𝔼q\[c^2\]−𝔼q\[cq⋆\]\|=1n𝔼q\[pq\(J\(λ~q\)−S\(λ~q\)\)\]\+o\(n−1/2\)\.\\big\|\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{1\}\]\-\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]\\big\|\-\\big\|\\mathbb\{E\}\_\{q\}\[\\hat\{c\}\_\{2\}\]\-\\mathbb\{E\}\_\{q\}\[c^\{\\star\}\_\{q\}\]\\big\|=\\tfrac\{1\}\{\\sqrt\{n\}\}\\,\\mathbb\{E\}\_\{q\}\\\!\\big\[\\sqrt\{p\_\{q\}\}\\,\(J\(\\tilde\{\\lambda\}\_\{q\}\)\-S\(\\tilde\{\\lambda\}\_\{q\}\)\)\\big\]\+o\(n^\{\-1/2\}\)\.By definitiongB=J−Sg\_\{B\}=J\-S\. On𝒬JDR\\mathcal\{Q\}\_\{\\mathrm\{JDR\}\},λ~q<λ~⋆\\tilde\{\\lambda\}\_\{q\}<\\tilde\{\\lambda\}^\{\\star\}, sogB\(λ~q\)\>0g\_\{B\}\(\\tilde\{\\lambda\}\_\{q\}\)\>0\([A\.8](https://arxiv.org/html/2605.08432#A1.SS8)\)\. The leading term is therefore strictly positive, proving \([12](https://arxiv.org/html/2605.08432#S5.E12)\)\. ∎
### A\.8Uniqueness ofλ~⋆\\widetilde\{\\lambda\}^\{\\star\}via Mills ratio
We prove that
gB\(λ~\):=ϕ\(2λ~\)−4λ~Φ\(−2λ~\)g\_\{B\}\(\\widetilde\{\\lambda\}\):=\\phi\(2\\widetilde\{\\lambda\}\)\-4\\widetilde\{\\lambda\}\\Phi\(\-2\\widetilde\{\\lambda\}\)has a unique root on\(0,∞\)\(0,\\infty\)\. Substituteu:=2λ~u:=2\\widetilde\{\\lambda\}\. Then
gB=0⟺ϕ\(u\)=2uΦ\(−u\)⟺r\(u\)=2u,r\(u\):=ϕ\(u\)Φ\(−u\)\.g\_\{B\}=0\\Longleftrightarrow\\phi\(u\)=2u\\Phi\(\-u\)\\Longleftrightarrow r\(u\)=2u,\\qquad r\(u\):=\\frac\{\\phi\(u\)\}\{\\Phi\(\-u\)\}\.Equivalently,
ur\(u\)=12\.\\frac\{u\}\{r\(u\)\}=\\frac\{1\}\{2\}\.
Let
h\(u\):=r\(u\)−2u\.h\(u\):=r\(u\)\-2u\.We haveh\(0\)=r\(0\)\>0h\(0\)=r\(0\)\>0\. By the Mills\-ratio asymptoticr\(u\)=u\+u−1\+o\(u−1\)r\(u\)=u\+u^\{\-1\}\+o\(u^\{\-1\}\),h\(u\)→−∞h\(u\)\\to\-\\inftyasu→∞u\\to\\infty\. Moreover, the inverse\-Mills derivative identity gives
r′\(u\)=r\(u\)\(r\(u\)−u\),r^\{\\prime\}\(u\)=r\(u\)\(r\(u\)\-u\),and the standard bound0<r′\(u\)<10<r^\{\\prime\}\(u\)<1foru\>0u\>0implies
h′\(u\)=r′\(u\)−2<0\.h^\{\\prime\}\(u\)=r^\{\\prime\}\(u\)\-2<0\.Thushhis strictly decreasing and has a unique positive root\. Numerically, this root isu⋆≈0\.6125u^\{\\star\}\\approx 0\.6125, so
λ~⋆=u⋆/2≈0\.306\.\\widetilde\{\\lambda\}^\{\\star\}=u^\{\\star\}/2\\approx 0\.306\.
## Appendix BAlgorithm
[Algorithm˜1](https://arxiv.org/html/2605.08432#alg1)summarizes the Sem\-ECE framework on a single questionqq, paralleling the definitions in[Section˜4](https://arxiv.org/html/2605.08432#S4)\.
Algorithm 1The Sem\-ECE framework on a questionqq\.1:Question
qq; selection size
nn; evaluation size
mm; semantic clustering oracle
Cluster\(⋅\)\\mathrm\{Cluster\}\(\\cdot\)\.
2:Generate
n\+mn\+mi\.i\.d\. responses
A1,…,An\+mA\_\{1\},\\ldots,A\_\{n\+m\}from the LLM on
qq\.
3:
\(Z1,…,Zn\+m\)←Cluster\(A1,…,An\+m\)\(Z\_\{1\},\\ldots,Z\_\{n\+m\}\)\\leftarrow\\mathrm\{Cluster\}\(A\_\{1\},\\ldots,A\_\{n\+m\}\)\.⊳\\trianglerightsemantic class labels
4:Partition
\[n\+m\]=N⊔E\[n\+m\]=N\\sqcup Ewith
\|N\|=n\|N\|=n,
\|E\|=m\|E\|=m\.
5:
z^N←argmaxk∈𝒵qπ^N\(k\)\\hat\{z\}\_\{N\}\\leftarrow\\operatorname\*\{arg\\,max\}\_\{k\\in\\mathcal\{Z\}\_\{q\}\}\\hat\{\\pi\}\_\{N\}\(k\)\.⊳\\trianglerightdeployed answer
6:
c^1←maxk∈𝒵qπ^N\(k\)\\hat\{c\}\_\{1\}\\leftarrow\\max\_\{k\\in\\mathcal\{Z\}\_\{q\}\}\\hat\{\\pi\}\_\{N\}\(k\)\.⊳\\trianglerightsame\-sample confidence \([1](https://arxiv.org/html/2605.08432#S4.E1)\)
7:
c^2←π^E\(z^N\)\\hat\{c\}\_\{2\}\\leftarrow\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)\.⊳\\trianglerightheld\-out confidence \([3](https://arxiv.org/html/2605.08432#S4.E3)\)
8:return
\(z^N,c^1,c^2\)\(\\hat\{z\}\_\{N\},\\,\\hat\{c\}\_\{1\},\\,\\hat\{c\}\_\{2\}\)\.
In experiments \([Section˜6](https://arxiv.org/html/2605.08432#S6)\),c^2\\hat\{c\}\_\{2\}is computed by averagingπ^E\(z^N\)\\hat\{\\pi\}\_\{E\}\(\\hat\{z\}\_\{N\}\)overRRrandom half\-splits of a pooled sample of sizen\+mn\+mto recycle samples across the two estimators\.
## Appendix CAdditional Experimental Figures
This section reports the full diagnostic figures summarized in the main text\.
### C\.1Margin\-stratified ECE
Figure 5:Margin\-stratified ECE curves for OpenAI on SimpleQA, HLE, and PopQA\.Figure 6:Margin\-stratified ECE curves for Anthropic on SimpleQA, HLE, and PopQA\.Figure 7:Margin\-stratified ECE curves for Gemini on SimpleQA, HLE, and PopQA\.Figure 8:Margin\-stratified ECE curves for xAI on SimpleQA, HLE, and PopQA\.Figure 9:Margin\-stratified ECE curves for Mistral on SimpleQA, HLE, and PopQA\.
### C\.2Convergence Rate
Figure 10:Direct ECE gapSem1\-ECE−Sem2\-ECE\\mathrm\{Sem\}\_\{1\}\\text\{\-\}\\mathrm\{ECE\}\-\\mathrm\{Sem\}\_\{2\}\\text\{\-\}\\mathrm\{ECE\}on the low\-margin sub\-population\{q:Δq<logKq/n\}\\\{q:\\Delta\_\{q\}<\\sqrt\{\\log K\_\{q\}/n\}\\\}on a log\-log scale, threshold re\-evaluated at eachnn\. Fitted slopes−0\.58\-0\.58,−0\.58\-0\.58,−0\.56\-0\.56are within0\.080\.08of[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)’s prediction−0\.50\-0\.50\.
### C\.3Reliability diagrams
Figure 11:Reliability diagrams for OpenAI on SimpleQA, HLE, and PopQA\.Figure 12:Reliability diagrams for Anthropic on SimpleQA, HLE, and PopQA\.Figure 13:Reliability diagrams for Gemini on SimpleQA, HLE, and PopQA\.Figure 14:Reliability diagrams for xAI on SimpleQA, HLE, and PopQA\.Figure 15:Reliability diagrams for Mistral on SimpleQA, HLE, and PopQA\.
## Appendix DBoundary alignment numerics
[Table˜2](https://arxiv.org/html/2605.08432#A4.T2)reports the per\-benchmark numerics that underlie the leading\-constant comparison in[Section˜6\.3](https://arxiv.org/html/2605.08432#S6.SS3)\. We measure the empiricalSem1\\mathrm\{Sem\}\_\{1\}\-ECE−\-Sem2\\mathrm\{Sem\}\_\{2\}\-ECE in a±10%\\pm 10\\%window around each regime boundary on each pooled benchmark, and compare to the leading\-order predictionφ\(m~⋆\)/n\\varphi\(\\tilde\{m\}^\{\\star\}\)/\\sqrt\{n\}from[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)under thepq→1p\_\{q\}\\to 1convention\. The JDR boundaryΔq=2λ~⋆/n≈0\.0865\\Delta\_\{q\}=2\\tilde\{\\lambda\}^\{\\star\}/\\sqrt\{n\}\\approx 0\.0865atn=50n=50is universal inKqK\_\{q\}; the low/large boundaryΔq=logKq/n\\Delta\_\{q\}=\\sqrt\{\\log K\_\{q\}/n\}depends on per\-benchmarkKqK\_\{q\}\.
Table 2:Boundary alignment between empiricalSem1\\mathrm\{Sem\}\_\{1\}\-ECE−\-Sem2\\mathrm\{Sem\}\_\{2\}\-ECE and the leading\-order predictionφ\(m~⋆\)/n\\varphi\(\\tilde\{m\}^\{\\star\}\)/\\sqrt\{n\}underpq→1p\_\{q\}\\to 1, at the two regime boundaries on each pooled benchmark \(n=50n=50\)\.nqn\_\{q\}is the number of questions in the±10%\\pm 10\\%window\.The leading\-order prediction recovers the empirical gap to within1111–27%27\\%on every benchmark with no fitted constants\. The same direction of error appears across all three benchmarks: the JDR boundary over\-shoots and the low/large boundary under\-shoots\. This sign\-consistent residual, together with the steeper\-than\-−0\.50\-0\.50log\-log slope in[Figure˜2](https://arxiv.org/html/2605.08432#S6.F2)\(b\), is consistent with a subleadingO\(1/n\)O\(1/n\)Edgeworth correction that flips sign asλ~\\tilde\{\\lambda\}moves fromλ~⋆≈0\.306\\tilde\{\\lambda\}^\{\\star\}\\approx 0\.306tologKq/2\\sqrt\{\\log K\_\{q\}\}/2\. A refined alignment that replaces thepq→1p\_\{q\}\\to 1convention with band\-wisep^q\\hat\{p\}\_\{q\}averages does not improve the fit, ruling out the convention as the source of the residual\.
## Appendix EBootstrap details
For each model–benchmark cell we compute paired percentile bootstrap95%95\\%CIs on three statistics, withB=1000B=1000replicates resampled at the per\-question level \(questions are the i\.i\.d\. unit; pairing betweenc^1\\hat\{c\}\_\{1\}andc^2\\hat\{c\}\_\{2\}is preserved within each resample\):
- •Δ𝔼\[c^1−c^2\]\\Delta\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-\\hat\{c\}\_\{2\}\], the mean per\-question confidence reduction \(positive⇒\\RightarrowSem2\\mathrm\{Sem\}\_\{2\}lowers confidence belowSem1\\mathrm\{Sem\}\_\{1\}\);
- •ΔECE:=Sem1\-ECE−Sem2\-ECE\\Delta\\mathrm\{ECE\}:=\\mathrm\{Sem\}\_\{1\}\\text\{\-ECE\}\-\\mathrm\{Sem\}\_\{2\}\\text\{\-ECE\}on all questions in the cell;
- •ΔECElow\\Delta\\mathrm\{ECE\}\_\{\\mathrm\{low\}\}, the same gap restricted to the low\-margin sub\-population\{q:Δq<1/n\}\\\{q:\\Delta\_\{q\}<1/\\sqrt\{n\}\\\}\(equivalent tom~q<1\\tilde\{m\}\_\{q\}<1under thepq→1p\_\{q\}\\to 1convention,n=50n=50\)\.
PopQA cells use the466466\-question intersection across all five providers to enable paired comparison; this restriction reduces the PopQA per\-cellNNrelative to[Table˜1](https://arxiv.org/html/2605.08432#S6.T1), which uses each provider’s full coverage\.
#### Summary\.
Δ𝔼\[c^1−c^2\]\\Delta\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-\\hat\{c\}\_\{2\}\]is significantly positive on all1515pairs — the per\-question confidence reduction predicted by \([9](https://arxiv.org/html/2605.08432#S5.E9)\) holds without exception\.ΔECE\\Delta\\mathrm\{ECE\}is significantly positive on1111of1515pairs; the remaining four are PopQA cells \(OpenAI, Anthropic, Gemini, xAI\) whose466466\-Q intersection sample is small enough that the population\-level ECE gap, while consistently directionally positive, does not always exclude zero\. On the theory\-relevant low\-margin sub\-population\{q:Δq<1/n\}\\\{q:\\Delta\_\{q\}<1/\\sqrt\{n\}\\\},ΔECElow\\Delta\\mathrm\{ECE\}\_\{\\mathrm\{low\}\}is significantly positive in1111of1414measurable cells \(PopQA Anthropic has onlynq=26n\_\{q\}=26low\-margin questions, insufficient for a stable bootstrap CI\)\. Details shown in[Table˜3](https://arxiv.org/html/2605.08432#A5.T3),[Table˜4](https://arxiv.org/html/2605.08432#A5.T4), and[Table˜5](https://arxiv.org/html/2605.08432#A5.T5)\.
Table 3:Per\-question confidence reductionΔ𝔼\[c^1−c^2\]\\Delta\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-\\hat\{c\}\_\{2\}\]with paired percentile bootstrap95%95\\%CI \(B=1000B=1000, resampled per question\)\. Cells whose CI excludes zero are bolded\. PopQA uses the466466\-Q intersection\.Table 4:Population ECE gapΔECE=Sem1\-ECE−Sem2\-ECE\\Delta\\mathrm\{ECE\}=\\mathrm\{Sem\}\_\{1\}\\text\{\-ECE\}\-\\mathrm\{Sem\}\_\{2\}\\text\{\-ECE\}with95%95\\%CI\. Cells whose CI excludes zero are bolded\.Table 5:Low\-margin ECE gapΔECElow\\Delta\\mathrm\{ECE\}\_\{\\mathrm\{low\}\}on\{q:Δq<1/50\}\\\{q:\\Delta\_\{q\}<1/\\sqrt\{50\}\\\}with95%95\\%CI\. Cells bolded if CI excludes zero\. n/a indicates insufficient sample size\.
#### Interpretation\.
The per\-question reductionΔ𝔼\[c^1−c^2\]\\Delta\\mathbb\{E\}\[\\hat\{c\}\_\{1\}\-\\hat\{c\}\_\{2\}\]is the most directly empirically observable consequence of \([9](https://arxiv.org/html/2605.08432#S5.E9)\), and its CI excludes zero on every cell — including all PopQA cells despite their reduced effective sample size\. The population\-levelΔECE\\Delta\\mathrm\{ECE\}has a less stable CI because binning discretization dampens the signal; this is most visible on PopQA atN=466N=466\. Restricting to the low\-margin sub\-population recovers a larger and more stable signal exactly where[Theorem˜5\.6](https://arxiv.org/html/2605.08432#S5.Thmtheorem6)predicts the gap to be largest, with effect sizes \(∼5\\sim 5percentage points\) roughly4×4\\timesthose on the full population, which is consistent with the regime structure visible in[Figure˜3](https://arxiv.org/html/2605.08432#S6.F3)\.
## Appendix FPooled reliability analysis
[Figure˜4](https://arxiv.org/html/2605.08432#S6.F4)shows reliability diagrams pooled across models on each benchmark\.Sem2\\text\{Sem\}\_\{2\}achieves the lowest pooled ECE on every benchmark \(SimpleQA0\.3110\.311, HLE0\.5420\.542, PopQA0\.3340\.334, versusSem1\\text\{Sem\}\_\{1\}0\.323/0\.556/0\.3400\.323/0\.556/0\.340and Ver0\.458/0\.690/0\.3820\.458/0\.690/0\.382\)\. All three sources are over\-confident at high confidence on HLE—expected for an expert\-level benchmark where models are highly self\-consistent yet factually wrong, so semantic agreement \(which Sem\-ECE measures\) and factual correctness diverge\. The remaining ECE inSem2\\text\{Sem\}\_\{2\}reflects this population gap\|𝔼q\[cq⋆\]−a¯\|\|\\mathbb\{E\}\_\{q\}\[c\_\{q\}^\{\\star\}\]\-\\bar\{a\}\|, a property of the underlying model rather than of the calibration estimator\. Combining a debiased agreement metric likeSem2\\text\{Sem\}\_\{2\}\-ECE with a separate signal targetinga¯\\bar\{a\}directly is a natural direction for future work\.
## Appendix GExtended Related Work
Calibration evaluation\.Calibration evaluation is well studied for probabilistic classifiers, where a model predicts a probability distribution over a fixed label set and metrics such as Brier score, reliability diagrams, and expected calibration error compare confidence with empirical accuracy\[[2](https://arxiv.org/html/2605.08432#bib.bib9),[16](https://arxiv.org/html/2605.08432#bib.bib10),[5](https://arxiv.org/html/2605.08432#bib.bib2)\]\. This setting naturally extends to multiple\-choice QA, where confidence can be computed from logits or normalized option probabilities\. Open\-ended QA is less straightforward: the answer space is not fixed, logits may be unavailable, and correctness is semantic rather than lexical\. As a result, calibration evaluation for open\-ended QA also requires a correctness judgment for free\-form answers, typically through human annotation or a validated automatic judge\.
Confidence estimation for open\-ended language models\.Several approaches estimate confidence without relying on a fixed label space\. Verbalized confidence asks the model to state its own uncertainty in words or as a probability\[[11](https://arxiv.org/html/2605.08432#bib.bib3),[6](https://arxiv.org/html/2605.08432#bib.bib4),[14](https://arxiv.org/html/2605.08432#bib.bib11),[20](https://arxiv.org/html/2605.08432#bib.bib5)\]\. This is flexible with respect to answer format, but depends on the model’s self\-reporting behavior and can be inaccurate or over\-confident\. Sampling\-based methods instead query the model multiple times and use agreement across generations as a confidence signal\[[21](https://arxiv.org/html/2605.08432#bib.bib8),[12](https://arxiv.org/html/2605.08432#bib.bib6)\]\. Related semantic uncertainty methods group generations by meaning and aggregate uncertainty over semantic clusters\[[8](https://arxiv.org/html/2605.08432#bib.bib12),[3](https://arxiv.org/html/2605.08432#bib.bib13)\]\. Existing sampling\-based calibration methods often still rely on task\-specific answer extraction, such as fixed final\-answer patterns or regular expressions, before agreement can be computed\.
Improving calibration\.A separate line of work aims to improve model calibration rather than evaluate it\. Post\-hoc methods such as Platt scaling, temperature scaling, Dirichlet calibration, and verified calibration adjust confidence scores on held\-out validation data without changing the model\[[19](https://arxiv.org/html/2605.08432#bib.bib14),[5](https://arxiv.org/html/2605.08432#bib.bib2),[9](https://arxiv.org/html/2605.08432#bib.bib15),[10](https://arxiv.org/html/2605.08432#bib.bib16)\]\. For language models, calibration can also be improved or elicited through prompting, auxiliary confidence estimation, or verbalized uncertainty\[[11](https://arxiv.org/html/2605.08432#bib.bib3),[6](https://arxiv.org/html/2605.08432#bib.bib4),[20](https://arxiv.org/html/2605.08432#bib.bib5)\]\. Fine\-tuning\-based approaches modify the model or training objective to preserve or restore calibration after alignment\[[24](https://arxiv.org/html/2605.08432#bib.bib17)\]\.Similar Articles
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
Researchers present SemanticQA, a benchmark for evaluating language models on semantic phrase processing tasks including idioms, noun compounds, and verbal constructions, revealing significant performance variation across model architectures and scales on semantic reasoning tasks.
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
This paper introduces GSM-SEM, a framework for generating semantically diverse benchmark variants to mitigate memorization in mathematical reasoning evaluations. The authors demonstrate that this approach reveals significant performance drops in current SOTA LLMs compared to static benchmarks.
Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
This paper presents SSAS (Syntactic & Semantic Context Assessment Summarization), a framework designed to improve consistency in LLM-based sentiment prediction by reducing noise and variance through hierarchical classification and iterative summarization. Empirical evaluation on three industry-standard datasets shows up to 30% improvement in data quality and reliability for enterprise decision-making.
Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models
Introduces a framework to quantify how LLMs overstate certainty through rhetorical devices, revealing model-agnostic patterns of epistemic-rhetorical miscalibration.
Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
This paper proposes a safety-oriented, consequence-aware evaluation framework for large language models in Air Traffic Control, revealing that high aggregate accuracy masks significant reliability issues in handling high-risk semantic errors.