Retrieval-Augmented Linguistic Calibration

arXiv cs.CL Papers

Summary

This paper proposes Retrieval-Augmented Linguistic Calibration (RALC), a post-hoc pipeline for calibrating confidence signals in LLMs by modeling linguistic confidence as a distribution and using retrieval-augmented rewriting. It introduces Faithfulness Divergence metric and shows significant improvements across benchmarks.

arXiv:2605.19344v1 Announce Type: new Abstract: Linguistic cues such as "I believe" and "probably" offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored. In particular, co-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence (FD), an information-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation. Building on these foundations, we present Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval-augmented rewriting. Across three QA benchmarks and five LLM families, RALC improves in-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black-box and grey-box calibration baselines.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:25 AM

# Retrieval-Augmented Linguistic Calibration
Source: [https://arxiv.org/html/2605.19344](https://arxiv.org/html/2605.19344)
Yi\-Fan Yeh School of Computer Science University of Sydney Sydney, Australia yyeh7345@uni\.sydney\.edu\.au &Linwei Tao School of Computer Science University of Sydney Sydney, Australia linwei\.tao@sydney\.edu\.au &Minjing Dong City University of Hong Kong Hong Kong minjdong@cityu\.edu\.hk &Tao Huang Shanghai Jiao Tong University Shanghai, China t\.huang@sjtu\.edu\.cn &Jialin Yu University of Oxford Department of Engineering Science Oxford, UK jialin\.yu@eng\.ox\.ac\.uk &Philip Torr Department of Engineering Science University of Oxford Oxford, UK philip\.torr@eng\.ox\.ac\.uk &Chang Xu School of Computer Science University of Sydney Sydney, Australia c\.xu@sydney\.edu\.au

###### Abstract

Linguistic cues such as “I believe” and “probably” offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored\. In particular, co\-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges\. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard\. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence \(FD\), an information\-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation\. Building on these foundations, we present Retrieval\-Augmented Linguistic Calibration \(RALC\), a lightweight post\-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval\-augmented rewriting\. Across three QA benchmarks and five LLM families, RALC improves in\-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black\-box and grey\-box calibration baselines\.

## 1Introduction

Reliable confidence estimation is fundamental to the trustworthy deployment of large language models \(LLMs\) in human decision\-making pipelines\[[29](https://arxiv.org/html/2605.19344#bib.bib20)\]\. Without well\-calibrated confidence signals, users risk over\-relying on model outputs that hallucinate or fail silently\[[13](https://arxiv.org/html/2605.19344#bib.bib21)\], underscoring the need for confidence frameworks that are both scientifically rigorous and interpretable by human users\.

Existing confidence estimation methods represent confidence as scalar probability values, including token\-level probability\[[16](https://arxiv.org/html/2605.19344#bib.bib4)\], semantic uncertainty\[[5](https://arxiv.org/html/2605.19344#bib.bib7)\], and verbalised scores\[[18](https://arxiv.org/html/2605.19344#bib.bib27),[34](https://arxiv.org/html/2605.19344#bib.bib28)\]\. However, humans struggle to reason accurately with numerical probabilities\[[41](https://arxiv.org/html/2605.19344#bib.bib24)\], motivating the use of linguistic markers such as “may” or “likely” as more natural confidence interfaces\. Prior work demonstrates that such markers retain evaluative signal\[[38](https://arxiv.org/html/2605.19344#bib.bib9),[31](https://arxiv.org/html/2605.19344#bib.bib26)\]; however, treating them as scalars discards the inherent subjectivity of linguistic interpretation: different readers map the same expression to different perceived probability values\[[32](https://arxiv.org/html/2605.19344#bib.bib8)\]\.

We address this gap by modelling linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, where perception arises from readers’ interpretations of the full linguistic content of the statement rather than from a discrete mapping of individual vocabulary items\. Treating linguistic confidence as a surrogate for statement correctness induces a binary classification view, in which confidence corresponds to the predictive probability of the true class\. Drawing a parallel to evidential deep learning, which models class probabilities with Dirichlet distributions to capture second\-order predictive uncertainty\[[27](https://arxiv.org/html/2605.19344#bib.bib23)\], we formalise linguistic confidence for binary correctness using its natural binary special case: a Beta distribution over perceived confidence scores that a statement is correct, where the mean captures the central tendency of perceived confidence across readers and the concentration encodes the strength of agreement\.

The standard measure of confidence quality is calibration, assessed through population\-level expected calibration error \(ECE\)\[[8](https://arxiv.org/html/2605.19344#bib.bib11),[35](https://arxiv.org/html/2605.19344#bib.bib1)\], quantifying the alignment between confidence and accuracy in expectation\. Instance\-level metrics such as the Brier score\[[3](https://arxiv.org/html/2605.19344#bib.bib40)\]and negative log likelihood offer pointwise assessment in the classical scalar setting, yet their distributional generalisations still fail to encode variance as the strength of agreement by readers\. We therefore introduce*faithfulness*as a complementary dimension of confidence evaluation and present Faithfulness Divergence \(FD\), a concentration\-weighted Bayesian updating cost that quantifies the information\-theoretic surprise to confidence beliefs upon truth revelation\.

Calibration in linguistic space remains crucial yet largely unsolved\. Classical post\-hoc calibration methods, including temperature scaling\[[8](https://arxiv.org/html/2605.19344#bib.bib11)\], Platt scaling\[[25](https://arxiv.org/html/2605.19344#bib.bib14)\], histogram binning\[[39](https://arxiv.org/html/2605.19344#bib.bib13)\], isotonic regression\[[40](https://arxiv.org/html/2605.19344#bib.bib12)\], Beta calibration\[[14](https://arxiv.org/html/2605.19344#bib.bib29)\], and distribution\-matching approaches\[[28](https://arxiv.org/html/2605.19344#bib.bib3),[20](https://arxiv.org/html/2605.19344#bib.bib33)\], operate entirely in numerical space and provide no mechanism for propagating calibrated signals back into language\. Prompt\-conditioned hedging strategies offer a linguistic alternative, yet function as black\-box procedures with no principled control over the output\[[38](https://arxiv.org/html/2605.19344#bib.bib9)\]\. The closest related work performs discrete hedging word confidence profiling and remapping at the word level in a specialised domain\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\], overlooking the co\-occurrence of multiple linguistic cues within a statement and their contextual interactions\. Consequently, a generalisable, continuous, and lightweight post\-hoc framework that provides principled guidance for hedging expressions remains underexplored\. To address this gap, we introduce Retrieval\-Augmented Linguistic Calibration \(RALC\), a post\-hoc pipeline that operates directly in linguistic space to transform raw LLM responses into calibrated and faithful outputs\. The pipeline applies Platt scaling\[[25](https://arxiv.org/html/2605.19344#bib.bib14)\]on confidence distribution means whilst preserving distributional concentration, and propagates calibrated distributions to language through retrieval\-augmented LLM rewriting, employing the retrieval\-augmented generation paradigm\[[17](https://arxiv.org/html/2605.19344#bib.bib44)\]\. Furthermore, RALC is compatible with diverse upstream confidence signals beyond linguistic confidence, including token probability and semantic uncertainty\.

Our contributions are as follows:

1. 1\.We formalise linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing the interplay of linguistic cues and contexts beyond discrete expression mapping and scalar quantification\.
2. 2\.We introduce*faithfulness*as a new dimension of confidence evaluation and present Faithfulness Divergence \(FD\), an instance\-level metric that quantifies, as information\-theoretic surprise, how faithfully a confidence distribution accounts for the ground\-truth correctness outcome\.
3. 3\.We introduce a generalisable retrieval\-augmented linguistic confidence calibration pipeline that effectively improves faithfulness and calibration in linguistic space and is compatible with diverse confidence estimation signals\.

We evaluate the framework across multiple LLM families and QA benchmarks, including MMLU\[[9](https://arxiv.org/html/2605.19344#bib.bib16)\], SQuAD 2\.0\[[26](https://arxiv.org/html/2605.19344#bib.bib17)\], and TruthfulQA\[[19](https://arxiv.org/html/2605.19344#bib.bib36)\]\. Results demonstrate near\-lossless information transfer through the calibration pipeline and substantial improvements in both calibration and faithfulness across models and benchmarks, outperforming the prompt\-based calibration baselines\.

## 2Related work

##### LLM confidence estimation

Existing confidence estimation methods predominantly represent confidence as scalar probability values, including token\-level probability aggregation\[[16](https://arxiv.org/html/2605.19344#bib.bib4),[4](https://arxiv.org/html/2605.19344#bib.bib5)\]and consistency\-based approaches that infer confidence from the semantic support landscape across repeated samples\[[36](https://arxiv.org/html/2605.19344#bib.bib25),[5](https://arxiv.org/html/2605.19344#bib.bib7)\]\. Whilst recent work demonstrates that linguistic cues in model responses preserve evaluative signal as confidence surrogates\[[38](https://arxiv.org/html/2605.19344#bib.bib9),[32](https://arxiv.org/html/2605.19344#bib.bib8)\], their scalar quantification overlooks the inherently subjective nature of linguistic interpretation\.Wanget al\.\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]take a step towards distributional representations by mapping individual discrete hedging words to confidence distributions; however, their approach targets word\-level remapping rather than statement\-level confidence, where multiple linguistic cues co\-occur and interact with context\.Huanget al\.\[[10](https://arxiv.org/html/2605.19344#bib.bib2)\]jointly model confidence and correctness as distributions over ambiguous long\-form generation contexts, which is orthogonal to our objective of binary classification with predictive probability distributions\.

##### Confidence evaluation

Expected Calibration Error \(ECE\) is the dominant metric for evaluating confidence, measuring alignment between scalar confidence scores and accuracy in both classical\[[8](https://arxiv.org/html/2605.19344#bib.bib11)\]and language model settings\[[42](https://arxiv.org/html/2605.19344#bib.bib30)\]\. Extensions based on entropy\[[30](https://arxiv.org/html/2605.19344#bib.bib31)\], variance\[[33](https://arxiv.org/html/2605.19344#bib.bib32)\], and distributional generalisation\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]remain ECE\-based and rely on local aggregation, discarding full distributional information at the instance level\. Instance\-level scoring such as the Brier score\[[3](https://arxiv.org/html/2605.19344#bib.bib40)\]and negative log likelihood similarly do not capture variance\-scaled misalignment\.

##### Confidence calibration

Classical post\-hoc calibration methods, including Platt scaling\[[25](https://arxiv.org/html/2605.19344#bib.bib14)\], histogram binning\[[39](https://arxiv.org/html/2605.19344#bib.bib13)\], isotonic regression\[[40](https://arxiv.org/html/2605.19344#bib.bib12)\], and Beta calibration\[[14](https://arxiv.org/html/2605.19344#bib.bib29)\], adjust scalar outputs towards empirical accuracy but are confined to numerical space\. Distributional calibration methods frame the problem as distribution matching, aligning predicted confidence distributions with empirical label distributions through various mapping strategies\[[28](https://arxiv.org/html/2605.19344#bib.bib3),[20](https://arxiv.org/html/2605.19344#bib.bib33)\], though they target global rather than instance\-level calibration\. In the linguistic space, prompt\-based strategies have been explored to steer LLM hedging\[[38](https://arxiv.org/html/2605.19344#bib.bib9)\], but these lack principled control\. Internal model steering offers finer\-grained calibration of verbal uncertainty\[[12](https://arxiv.org/html/2605.19344#bib.bib34)\], yet requires access to model internals, limiting applicability to open\-source settings\. The most closely related approach remaps discrete hedging words at the vocabulary level, without accounting for contextual interactions or producing calibrated full responses\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]\.

## 3Confidence estimation and evaluation

### 3\.1Linguistic confidence estimation

For each input–response pair\(X,R\)\(X,R\), lety∈\{0,1\}y\\in\\\{0,1\\\}denote the correctness label ofRR\. We define a distributional confidence estimator

g:ℛ→𝒫​\(\[0,1\]\),g:\\mathcal\{R\}\\rightarrow\\mathcal\{P\}\(\[0,1\]\),whereℛ\\mathcal\{R\}denotes the space of model responses and𝒫​\(\[0,1\]\)\\mathcal\{P\}\(\[0,1\]\)denotes the space of probability distributions over\[0,1\]\[0,1\]\. The estimatorggmodels the plausible probability values thatRRis correct as a distribution over confidence scores in\[0,1\]\[0,1\]as perceived by readers \(human or model\-based evaluators\)\. We abstractggas a model\-based or human\-based evaluator and parameterise the estimated distributionSSas a Beta distribution,S=Beta​\(α,β\)S=\\mathrm\{Beta\}\(\\alpha,\\beta\)\. This choice draws a parallel to evidential deep learning\[[27](https://arxiv.org/html/2605.19344#bib.bib23)\], which places a Dirichlet prior over class probabilities to represent second\-order uncertainty\. Our setting resembles binary classification: each reader produces an interpreted confidence score viewable as a draw of the true\-class probability; the Beta distribution is therefore the principled choice, as the binary special case of the Dirichlet and the natural conjugate prior for the Bernoulli likelihood\. The meanα/\(α\+β\)\\alpha/\(\\alpha\+\\beta\)captures the central tendency of perceived confidence across readers, whilst the concentration\(α\+β\)\(\\alpha\+\\beta\)encodes agreement strength: a high mean with low concentration signals inconsistent reader interpretations, whereas the same mean with higher concentration signals consistent agreement\.

### 3\.2Confidence evaluation

##### Calibration as a dimension of confidence evaluation

At the population level, calibration requires confidence to match empirical accuracy in expectation\. Lettingp∼Sp\\sim Sdenote the scalar confidence value drawn from the estimated distribution, the classical measure is the expected calibration error,

ECE=𝔼\[\|𝔼\[Y∣p\]−p\|\],\\text\{ECE\}=\\mathbb\{E\}\\\!\\left\[\\left\|\\mathbb\{E\}\[Y\\mid p\]\-p\\right\|\\right\],with realisations including scalar bin\-based\[[8](https://arxiv.org/html/2605.19344#bib.bib11)\]and distribution\-generalised variants\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]\.

##### Faithfulness as a dimension of confidence evaluation

Calibration is necessary but not sufficient in the distributional setting: two predictors may achieve similar average calibration yet convey markedly different instance\-level confidence profiles\. We therefore introduce*faithfulness*, a human\-aligned, instance\-level dimension of confidence evaluation\.

A confidence distribution is faithful when observing the ground truth induces little surprise relative to prior beliefs\. Such surprise is driven by both central\-tendency misalignment and the strength of agreement with which that misalignment is held\. Grounding this intuition in information theory, we draw on Bayesian surprise\[[11](https://arxiv.org/html/2605.19344#bib.bib43)\], measured by the KL divergence between posterior and prior distributions, and weight it by the concentration\(αi\+βi\)\(\\alpha\_\{i\}\+\\beta\_\{i\}\)as the effective sample size of the prior\[[23](https://arxiv.org/html/2605.19344#bib.bib42)\]to represent the total surprise of the update\.

Formally, for instanceii, we update the priorSiS\_\{i\}with a single Bernoulli observation of the correctness labelyiy\_\{i\}to obtain the posteriorSi∗S\_\{i\}^\{\*\}:

Si=Beta​\(αi,βi\),Si∗=Beta​\(αi\+yi,βi\+1−yi\)\.S\_\{i\}=\\mathrm\{Beta\}\(\\alpha\_\{i\},\\,\\beta\_\{i\}\),\\qquad S\_\{i\}^\{\*\}=\\mathrm\{Beta\}\(\\alpha\_\{i\}\+y\_\{i\},\\;\\beta\_\{i\}\+1\-y\_\{i\}\)\.The*Faithfulness Divergence*\(FD\) for instanceiiis then

FDi:=\(αi\+βi\)⋅KL​\(Si∗∥Si\)\.\\mathrm\{FD\}\_\{i\}:=\(\\alpha\_\{i\}\+\\beta\_\{i\}\)\\cdot\\mathrm\{KL\}\\\!\\left\(S\_\{i\}^\{\*\}\\,\\\|\\,S\_\{i\}\\right\)\.The KL term quantifies the information\-theoretic change of belief required upon observing the outcome; weighting by\(αi\+βi\)\(\\alpha\_\{i\}\+\\beta\_\{i\}\)ensures the significance of the change is properly accounted for\. FD is non\-negative and intended as a*relative*instance\-level metric: lower values indicate more faithful confidence communication, whilst higher values signal greater mismatch between expressed confidence and realised correctness\. We provide a further discussion of the modelling in Appendix[B](https://arxiv.org/html/2605.19344#A2)\.

## 4Retrieval\-Augmented Linguistic Calibration \(RALC\)

![Refer to caption](https://arxiv.org/html/2605.19344v1/x1.png)Figure 1:Retrieval\-Augmented Linguistic Calibration pipeline overview\. In each calibration inference pass \(blue arrow→\\rightarrow\), we estimate a confidence distribution for the original response using linguistic confidence, token probability, or semantic uncertainty \(Sections[3\.1](https://arxiv.org/html/2605.19344#S3.SS1),[4\.3](https://arxiv.org/html/2605.19344#S4.SS3)\)\. In the signal space, we apply a pre\-trained Platt scaling calibration map on the mean to correct miscalibration in the numerical space \(Section[4\.1](https://arxiv.org/html/2605.19344#S4.SS1)\)\. The calibrated distribution is then used as a retrieval signal to find the nearest hedging expressions from a pre\-built hedge\-confidence\-pair lexicon\. The KNN retrieval process uses absolute distance in the means for shortlisting and 1\-Wasserstein distance for final retrieval to ensure alignment in both central tendency and spread\. The retrieved hedging expressions form a rewrite prompt along with the original response, which is passed to an LLM editor to produce a linguistically calibrated response \(Section[4\.2](https://arxiv.org/html/2605.19344#S4.SS2)\)\.We introduce a novel post\-hoc calibration pipeline that operates directly in the linguistic space to transform original LLM responses into calibrated and faithful outputs, Retrieval\-Augmented Linguistic Calibration \(RALC\)\. RALC transforms a raw response into one whose perceived confidence more faithfully reflects the underlying ground\-truth label\. Figure[1](https://arxiv.org/html/2605.19344#S4.F1)illustrates the pipeline mechanics\. It estimates a confidence distribution for the original responses, applies post\-hoc signal\-space calibration to correct miscalibration in the numerical space, and then uses the calibrated distribution as a retrieval signal to search for appropriate hedging expressions that guide the rewriting of the original response into a linguistically calibrated and faithful one\.

### 4\.1Post\-hoc signal\-space calibration

##### Definition

Under the distributional setting, the confidence estimatorggoutputs a distribution over confidence scores\. Post\-hoc calibration learns a mappingttin distribution space so that the calibrated estimatort∘gt\\circ greduces expected calibration error and Faithfulness Divergence\.

##### Platt scaling on distribution means

We parameterise confidence with a Beta distributionS=Beta​\(α,β\)S=\\mathrm\{Beta\}\(\\alpha,\\beta\), with meanμ=α/\(α\+β\)\\mu=\\alpha/\(\\alpha\+\\beta\)and concentrationκ=α\+β\\kappa=\\alpha\+\\beta\. We apply Platt scaling\[[25](https://arxiv.org/html/2605.19344#bib.bib14)\]on the distribution mean, mappingμ\\muto a calibrated meanμ′\\mu^\{\\prime\}by fitting a logistic regression of distribution means against binary correctness labels:

μ′=σ​\(w⋅logit⁡\(μ\)\+b\),\\mu^\{\\prime\}=\\sigma\\\!\\left\(w\\cdot\\operatorname\{logit\}\(\\mu\)\+b\\right\),\(1\)whereσ​\(⋅\)\\sigma\(\\cdot\)denotes the sigmoid function and\(w,b\)∈ℝ2\(w,b\)\\in\\mathbb\{R\}^\{2\}are learned scalar parameters\. The concentrationκ=α\+β\\kappa=\\alpha\+\\betais preserved from the original estimated distribution due to the non\-existence of a natural target for concentration calibration\.

We reconstruct the calibrated Beta distribution from the calibrated meanμ′\\mu^\{\\prime\}and the preserved concentrationκ\\kappa, settingα′=μ′⋅κ\\alpha^\{\\prime\}=\\mu^\{\\prime\}\\cdot\\kappaandβ′=\(1−μ′\)⋅κ\\beta^\{\\prime\}=\(1\-\\mu^\{\\prime\}\)\\cdot\\kappa, yielding the calibrated distributiont​\(S\)=Beta​\(α′,β′\)t\(S\)=\\mathrm\{Beta\}\(\\alpha^\{\\prime\},\\beta^\{\\prime\}\)\. We also investigate classical alternatives, including temperature scaling\[[8](https://arxiv.org/html/2605.19344#bib.bib11)\], isotonic regression\[[40](https://arxiv.org/html/2605.19344#bib.bib12)\], and histogram binning\[[39](https://arxiv.org/html/2605.19344#bib.bib13)\], and find that Platt scaling is the consistent outperformer in improving calibration and faithfulness\. We present an ablation study in Appendix[D\.7](https://arxiv.org/html/2605.19344#A4.SS7)\.

##### Interpretation

The signal\-space Platt scaling corrects systematic mean misalignment, whilst preserving the concentration that encodes the strength of agreement across readers\. The mean is directly supervised by the correctness labels and scaling induces a more faithful and calibrated belief of the correctness outcome\. By contrast, the concentration is a linguistic feature that emerges from the interplay of linguistic cues and reader interpretation, and does not have a theoretical target for calibration\. As a result, we deliberately confine our calibration design to one mean scaling where a natural target exists\.

### 4\.2Retrieval\-augmented linguistic control

Post\-hoc signal\-space calibration updates the originally estimated confidence distributions but leaves the response language unchanged\. To close this gap, we introduce retrieval\-augmented linguistic control, which rewrites the original responseRRinto a revised responseR′R^\{\\prime\}whose perceived confidence aligns with the calibrated signalS′=t​\(S\)S^\{\\prime\}=t\(S\)\. Formally, a linguistic calibratorllproducesR′=l​\(R,S′\)R^\{\\prime\}=l\(R,S^\{\\prime\}\)such that re\-estimating confidence fromR′R^\{\\prime\}recoversS′S^\{\\prime\}\. The full pipelinel∘t∘gl\\circ t\\circ gthus forms a closed loop from raw response to calibrated confidence to linguistically calibrated response\.

##### Linguistic confidence lexicon

The retrieval step relies on a lexicon that maps hedging expressions to confidence distributions\. Hedging expressions are sourced from state\-of\-the\-art LLMs, including Claude\-Sonnet\-4\.6\[[2](https://arxiv.org/html/2605.19344#bib.bib50)\], GPT\-5\.4\[[24](https://arxiv.org/html/2605.19344#bib.bib51)\], and Gemini\-3\-Flash\[[6](https://arxiv.org/html/2605.19344#bib.bib52)\]\. For each hedging expressionwkw\_\{k\}, GPT\-OSS\-20B\[[1](https://arxiv.org/html/2605.19344#bib.bib46)\]rewrites a collection of non\-verifiable statements to incorporate that expression\. The LLM linguistic evaluator ensemble then independently evaluates the perceived confidence of each rewritten statement in 3 model passes per evaluator, producing a set of confidence scores, as outlined in Section[5\.1](https://arxiv.org/html/2605.19344#S5.SS1)\. Fitting a Beta distribution to all confidence scores across passes yields the pair\(wk,Beta​\(αk,βk\)\)\\bigl\(w\_\{k\},\\,\\mathrm\{Beta\}\(\\alpha\_\{k\},\\beta\_\{k\}\)\\bigr\)\. Repeating this procedure across all hedging expressions produces the lexicon\{\(wk,Beta​\(αk,βk\)\)\}k=1K\\\{\(w\_\{k\},\\,\\mathrm\{Beta\}\(\\alpha\_\{k\},\\beta\_\{k\}\)\)\\\}\_\{k=1\}^\{K\}used at inference time\. Additional details on the lexicon construction are provided in Appendix[D\.1](https://arxiv.org/html/2605.19344#A4.SS1)\.

##### Retrieval\-based rewriting

Given a calibrated signalS′S^\{\\prime\}, we retrieve thekknearest hedging expressions from the lexicon via a two\-stage process\. First, we shortlist candidates by mean distance\|μk−μS′\|\|\\mu\_\{k\}\-\\mu\_\{S^\{\\prime\}\}\|, retaining expressions whose distributional mean falls within a neighbourhood ofμS′\\mu\_\{S^\{\\prime\}\}\. Second, we rank the shortlisted candidates by the11\-Wasserstein distance via Monte Carlo estimation,

d​\(wk,S′\)=W1​\(Beta​\(αk,βk\),S′\),d\(w\_\{k\},\\,S^\{\\prime\}\)=W\_\{1\}\\\!\\left\(\\mathrm\{Beta\}\(\\alpha\_\{k\},\\beta\_\{k\}\),\\,S^\{\\prime\}\\right\),and select the top\-kknearest expressions\. Mean\-distance shortlisting efficiently narrows the candidate set, whilst the subsequentW1W\_\{1\}ranking captures full distributional shape, matching both the central tendency and spread ofS′S^\{\\prime\}, at lower computational cost than applyingW1W\_\{1\}over the entire lexicon\. The retrieved expressions are then passed alongsideRRto an LLM editor, which rewritesRRintoR′R^\{\\prime\}to match the target confidence profile, enforcingg​\(R′\)≈S′g\(R^\{\\prime\}\)\\approx S^\{\\prime\}in practice\.

### 4\.3Alternative confidence signals

The RALC pipeline is compatible with confidence signals beyond linguistic confidence, including token probability\[[16](https://arxiv.org/html/2605.19344#bib.bib4)\]and semantic uncertainty\[[5](https://arxiv.org/html/2605.19344#bib.bib7)\], serving as the post\-hoc signal\-space calibration object and retrieval signal to guide the calibration process\. We formulate both distributional confidence signals under self\-consistency sampling\[[36](https://arxiv.org/html/2605.19344#bib.bib25)\]and cluster the generated responses semantically\.

##### Length\-normalised token probability

Letℐmax\\mathcal\{I\}\_\{\\max\}denote the index set of the largest cluster\. For eachRj∈ℐmaxR\_\{j\}\\in\\mathcal\{I\}\_\{\\max\}, whererir\_\{i\}is theii\-th token andr<ir\_\{<i\}the preceding context, we compute its length\-normalised token probability\[[16](https://arxiv.org/html/2605.19344#bib.bib4)\]:

sjtok=exp⁡\(1\|Rj\|​∑i=1\|Rj\|log⁡pθ​\(ri∣r<i,X\)\)∈\[0,1\]\.s\_\{j\}^\{\\mathrm\{tok\}\}=\\exp\\\!\\left\(\\frac\{1\}\{\|R\_\{j\}\|\}\\sum\_\{i=1\}^\{\|R\_\{j\}\|\}\\log p\_\{\\theta\}\(r\_\{i\}\\mid r\_\{<i\},X\)\\right\)\\in\[0,1\]\.Fitting a Beta distribution to\{sjtok\}\\\{s\_\{j\}^\{\\mathrm\{tok\}\}\\\}via method of moments \(Appendix[A\.1\.1](https://arxiv.org/html/2605.19344#A1.SS1.SSS1)\) yieldsStokS^\{\\mathrm\{tok\}\}that represents confidence as a distribution of token\-level probabilities for a particular semantic meaning\.

##### Semantic uncertainty

Letℐmax\\mathcal\{I\}\_\{\\max\}denote the index set of the largest cluster andNNbe the total number of sampled responses\. For each self\-consistency sample for a given input, the Beta parameters are set directly from cluster counts,

αsem=\|ℐmax\|,βsem=N−\|ℐmax\|,\\alpha^\{\\mathrm\{sem\}\}=\\lvert\\mathcal\{I\}\_\{\\max\}\\rvert,\\qquad\\beta^\{\\mathrm\{sem\}\}=N\-\\lvert\\mathcal\{I\}\_\{\\max\}\\rvert,Both parameters are clipped to a minimum of10−610^\{\-6\}to handle degenerate cases \(e\.g\. all responses falling into a single cluster, which would setβsem=0\\beta^\{\\mathrm\{sem\}\}=0\), yieldingSsem=Beta​\(max⁡\(αsem,10−6\),max⁡\(βsem,10−6\)\)S^\{\\mathrm\{sem\}\}=\\mathrm\{Beta\}\(\\max\(\\alpha^\{\\mathrm\{sem\}\},10^\{\-6\}\),\\,\\max\(\\beta^\{\\mathrm\{sem\}\},10^\{\-6\}\)\)that represents confidence as a distribution of semantic support for a particular semantic meaning across samples\.

Table 1:Instance\-level metric comparison across controlled subsets varying in mean confidence, concentration, and accuracy\. Only FD correctly ranks surprise levels consistent with each subset’s distributional profile; KL divergence,𝔼​\[Brier\]\\mathbb\{E\}\[\\text\{Brier\}\], and𝔼​\[NLL\]\\mathbb\{E\}\[\\text\{NLL\}\]each fail to recover the expected ordering\.SubsetAcc\.Avg Conf\.Conc\.FD↓\\downarrowKL𝔼​\[Brier\]\\mathbb\{E\}\[\\text\{Brier\}\]𝔼​\[NLL\]\\mathbb\{E\}\[\\text\{NLL\}\]\(1\) high conf\., high conc\., wrong0\.00\.81225\.82\.9320\.1680\.6812\.052\(2\) low conf\., high conc\., right1\.00\.4486\.00\.5500\.1140\.3480\.953\(3\) high conf\., low conc\., wrong0\.00\.6561\.00\.4860\.5900\.5573\.832\(4\) low conf\., low conc\., right1\.00\.4331\.00\.3920\.3780\.4421\.719

## 5Experiments

### 5\.1Setup preliminaries

We evaluate across five open\-source language models from different families: GPT\-OSS\-20B\[[1](https://arxiv.org/html/2605.19344#bib.bib46)\], Llama\-3\.1\-8B\-Instruct\[[21](https://arxiv.org/html/2605.19344#bib.bib49)\], Qwen3\-8B\[[37](https://arxiv.org/html/2605.19344#bib.bib45)\], Mistral\-7B\-Instruct\-v0\.3\[[22](https://arxiv.org/html/2605.19344#bib.bib47)\], and Gemma\-4\-31B\-IT\[[7](https://arxiv.org/html/2605.19344#bib.bib48)\], on three benchmarks: MMLU\[[9](https://arxiv.org/html/2605.19344#bib.bib16)\], SQuAD 2\.0\[[26](https://arxiv.org/html/2605.19344#bib.bib17)\], and TruthfulQA\[[19](https://arxiv.org/html/2605.19344#bib.bib36)\], covering reasoning\-heavy multiple\-choice, reading comprehension, and closed\-book short\-answer question\-answering formats\. We elicit responses using the Direct QA and Hedged QA prompt templates based onYonaet al\.\[[38](https://arxiv.org/html/2605.19344#bib.bib9)\]’s work\. Direct QA aims to produce natural, succinct free\-form responses, whilst Hedged QA additionally instructs the model to express uncertainty through hedging language and serves as a black\-box baseline for RALC\. The full templates are in Appendix[C\.1](https://arxiv.org/html/2605.19344#A3.SS1)\. We employ GPT\-OSS\-20B as a model\-based grader for correctness and the rewriting model in our RALC pipeline\. To estimate linguistic confidence as perceived by readers, we construct an evaluator ensemble of three LLMs \(Qwen3\-8B, Llama\-3\.1\-8B\-Instruct, and Mistral\-7B\-Instruct\-v0\.3\) as a human audience surrogate, capturing potential co\-occurrences of linguistic cues and their contextual interactions; each model independently evaluates the confidence expressed in a response three times\. A Beta distribution is fitted to these scores via method of moments \(Appendix[A\.1\.1](https://arxiv.org/html/2605.19344#A1.SS1.SSS1)\)\. We further validate our LLM ensemble against the human\-annotated linguistic benchmark ofTaoet al\.\[[32](https://arxiv.org/html/2605.19344#bib.bib8)\]\(Appendix[C\.2\.2](https://arxiv.org/html/2605.19344#A3.SS2.SSS2)\)\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x2.png)Figure 2:Pre\-calibration→\\topost\-calibration changes in generalised ECE and Faithfulness Divergence across signal space \(top\) and linguistic space \(bottom\), averaged across MMLU, SQuAD 2\.0, and TruthfulQA\. Our RALC consistently improves \(reduces\) both metrics across all confidence signals in both spaces\.
### 5\.2Measuring faithfulness under distributional confidence

To validate the quantification of the*surprise after truth revelation*, we construct four controlled subsets from the linguistic confidence distributions of Llama\-3\.1\-8B\-Instruct on SQuAD 2\.0, spanning the corners of the distributional confidence space by varying mean confidence, concentration, and accuracy\. We compare FD against KL divergence\[[15](https://arxiv.org/html/2605.19344#bib.bib41)\],𝔼​\[Brier\]\\mathbb\{E\}\[\\text\{Brier\}\]\[[3](https://arxiv.org/html/2605.19344#bib.bib40)\], and𝔼​\[NLL\]\\mathbb\{E\}\[\\text\{NLL\}\]\. Whilst the Brier score and NLL are commonly used in scalar settings, we provide their distributional generalisations in Appendix[A\.2](https://arxiv.org/html/2605.19344#A1.SS2)\.

As shown in Table[1](https://arxiv.org/html/2605.19344#S4.T1), FD is the only metric that recovers the expected surprise ordering: concentrated, misaligned distributions receive the highest penalties \(surprise\), whilst the same misalignment expressed with low concentration receives lower penalties\. KL divergence inverts this ranking by penalising diffuse distributions more heavily regardless of mean misalignment due to a lack of weighting by the effective sample size\.𝔼​\[Brier\]\\mathbb\{E\}\[\\text\{Brier\}\]and𝔼​\[NLL\]\\mathbb\{E\}\[\\text\{NLL\}\]also fail to recover the expected surprise ordering, as neither encodes variance as an amplifier or mediator of surprise\. These results confirm that FD uniquely captures both mean misalignment and dispersion, making it the appropriate instance\-level faithfulness metric in the distributional confidence setting under our information\-theoretic modelling\. In addition to the empirical validation, we provide further theoretical ablation studies on FD in Appendix[B\.4](https://arxiv.org/html/2605.19344#A2.SS4)\.

### 5\.3Retrieval\-Augmented Linguistic Calibration \(RALC\)

#### 5\.3\.1In\-domain calibration

##### Signal\-space and linguistic\-space calibration

We evaluate RALC’s ability to improve calibration and faithfulness across both signal and linguistic spaces\. For each question, we generate 20 responses under self\-consistency sampling, cluster them semantically, and identify the largest cluster\. The first response in the majority cluster is selected as the original response for linguistic calibration\. Linguistic confidence is estimated by our LLM ensemble on this response; token probability and semantic uncertainty distributions are constructed from the cluster according to Section[4\.3](https://arxiv.org/html/2605.19344#S4.SS3)\. We provide the LLM\-based clustering prompt in Appendix[C\.4](https://arxiv.org/html/2605.19344#A3.SS4)\.

For each confidence signal, we train a Platt scaling calibration map on the first 30% of each dataset, regressing per\-response distribution means against binary correctness labelsy∈\{0,1\}y\\in\\\{0,1\\\}\. We then run RALC inference on the remaining 70% held\-out set, applying the pre\-trained Platt scaling map to the estimated confidence distributions to obtain calibrated distributions\. The calibrated distributions are passed to the retrieval\-augmented linguistic control module, which selects the topk=5k\{=\}5from the top 30 mean\-based shortlisted hedging expressions to rewrite the original response into a linguistically calibrated one\. We provide the LLM rewriting prompt in Appendix[D\.3](https://arxiv.org/html/2605.19344#A4.SS3)and the choice ofkkablation study in Appendix[D\.5](https://arxiv.org/html/2605.19344#A4.SS5)\.

We evaluate pre\- and post\-calibration in the signal space by comparing the Faithfulness Divergence and generalised ECE\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]of the confidence distributions before and after the calibration map\. In the linguistic space, we estimate the linguistic confidence for both the original and linguistically calibrated responses using the LLM ensemble and evaluate using the same metrics\. As shown in Figure[2](https://arxiv.org/html/2605.19344#S5.F2), RALC consistently improves both metrics across all three confidence signals in both spaces\. Additionally, across all signals, semantic uncertainty yields the strongest improvements in calibration and faithfulness as shown in Table[5\.3\.2](https://arxiv.org/html/2605.19344#S5.SS3.SSS2)\. Additional results are detailed in Appendix[E](https://arxiv.org/html/2605.19344#A5)\.

##### Benchmark against calibration baselines

![Refer to caption](https://arxiv.org/html/2605.19344v1/x3.png)Figure 3:Calibration effectiveness and quality comparison between RALC \(averaged across all signals and models\), Hedged QA, and Direct Beta\-Guided Rewrite across content preservation \(entailment\), signal\-to\-language confidence correlation \(ρ\\rho\), linguistic\-space Faithfulness Divergence, and linguistic\-space generalised ECE\. RALC matches Direct Beta\-Guided Rewrite on content preservation, achieves a markedly higher signal\-to\-linguistic\-space correlation, and outperforms both baselines on calibration and faithfulness\.To contextualise RALC’s linguistic calibration quality, we benchmark it against two baselines:Hedged QA, a black\-box baseline that prompts the model to hedge without access to any calibrated signal\[[38](https://arxiv.org/html/2605.19344#bib.bib9)\], andDirect Beta\-Guided Rewrite, a grey\-box ablation of our pipeline in which the lexicon retrieval step is removed and the calibrated Beta distribution is passed directly to the rewriting model, relying on it to select appropriate hedging language without explicit linguistic grounding\. This comparison isolates the contribution of structured lexicon retrieval over uncontrolled and partially controlled generation\.

All performance metrics are measured in the linguistic space\. We first apply our LLM ensemble to estimate the linguistic confidence expressed in the original uncalibrated Direct QA responses\. For each calibration method, we then re\-estimate linguistic confidence from the rewritten output and compute all metrics accordingly\. FollowingFarquharet al\.\[[5](https://arxiv.org/html/2605.19344#bib.bib7)\]’s entailment setup, content preservation is scored aspentail\+0\.5⋅pneutralp\_\{\\mathrm\{entail\}\}\+0\.5\\cdot p\_\{\\mathrm\{neutral\}\}using DeBERTa\-v3\-Large\-MNLI, and evaluated alongside signal\-to\-language confidence correlation \(ρ\\rho\), Faithfulness Divergence, and generalised ECE\. As shown in Figure[3](https://arxiv.org/html/2605.19344#S5.F3), RALC substantially outperforms Hedged QA across all metrics\. Against Direct Beta\-Guided Rewrite, RALC matches on content preservation whilst achieving lower Faithfulness Divergence and ECE, and attains a markedly higher Spearmanρ\\rhobetween calibrated signal\-space and re\-estimated linguistic\-space confidence, indicating that structured lexicon retrieval propagates the calibrated signal into language more reliably than unconstrained hedging selection\.

#### 5\.3\.2Cross\-domain confidence calibration

Table 2:In\-domain and cross\-domain linguistic\-space calibration metric percentage changes for both Faithfulness Divergence and generalised ECE\. We report percentage change relative to the pre\-calibration metrics \(mean±\\pmstd across models\)\. Green text indicates calibration improvement \(lower error\)\. Semantic uncertainty is the strongest signal for RALC in improving both calibration and faithfulness across all three benchmarks in both in\-domain and cross\-domain settings\.MetricSignalTrain/TestMMLUSQuAD 2\.0TruthfulQAFaithfulness Divergence Mean ReductionLinguistic ConfidenceMMLUΔ\\Delta11\.4±\\pm19\.2%Δ\\Delta34\.6±\\pm11\.6%Δ\\Delta29\.6±\\pm12\.4%SQuAD 2\.0Δ\\Delta42\.7±\\pm4\.2%Δ\\Delta62\.0±\\pm2\.3%Δ\\Delta59\.8±\\pm3\.9%TruthfulQAΔ\\Delta27\.0±\\pm17\.2%Δ\\Delta65\.8±\\pm2\.2%Δ\\Delta60\.5±\\pm4\.6%Token ProbabilityMMLUΔ\\Delta10\.2±\\pm23\.2%Δ\\Delta52\.6±\\pm8\.1%Δ\\Delta56\.4±\\pm6\.2%SQuAD 2\.0Δ\\Delta32\.8±\\pm4\.9%Δ\\Delta64\.9±\\pm2\.4%Δ\\Delta70\.4±\\pm3\.3%TruthfulQAΔ\\Delta48\.8±\\pm5\.1%Δ\\Delta64\.2±\\pm3\.1%Δ\\Delta64\.6±\\pm3\.0%Semantic UncertaintyMMLUΔ\\Delta21\.6±\\pm13\.9%Δ\\Delta36\.1±\\pm12\.4%Δ\\Delta43\.0±\\pm9\.8%SQuAD 2\.0Δ\\Delta43\.9±\\pm4\.8%Δ\\Delta66\.0±\\pm4\.1%Δ\\Delta65\.5±\\pm4\.3%TruthfulQAΔ\\Delta59\.9±\\pm2\.8%Δ\\Delta66\.4±\\pm1\.9%Δ\\Delta66\.1±\\pm2\.9%Generalised ECE Mean ReductionLinguistic ConfidenceMMLUΔ\\Delta38\.1±\\pm8\.0%Δ\\Delta20\.7±\\pm7\.6%Δ\\Delta12\.2±\\pm6\.4%SQuAD 2\.0Δ\\Delta21\.9±\\pm11\.4%Δ\\Delta43\.4±\\pm1\.8%Δ\\Delta35\.0±\\pm2\.3%TruthfulQAΔ\\Delta21\.9±\\pm7\.1%Δ\\Delta47\.2±\\pm1\.9%Δ\\Delta39\.6±\\pm0\.9%Token ProbabilityMMLUΔ\\Delta46\.8±\\pm8\.7%Δ\\Delta38\.5±\\pm11\.1%Δ\\Delta37\.5±\\pm10\.1%SQuAD 2\.0Δ\\Delta23\.5±\\pm2\.2%Δ\\Delta50\.6±\\pm1\.0%Δ\\Delta61\.4±\\pm1\.1%TruthfulQAΔ\\Delta60\.2±\\pm2\.6%Δ\\Delta48\.6±\\pm2\.4%Δ\\Delta42\.3±\\pm2\.3%Semantic UncertaintyMMLUΔ\\Delta49\.8±\\pm8\.7%Δ\\Delta30\.8±\\pm8\.5%Δ\\Delta33\.7±\\pm5\.7%SQuAD 2\.0Δ\\Delta30\.1±\\pm16\.9%Δ\\Delta58\.7±\\pm3\.8%Δ\\Delta49\.8±\\pm4\.1%TruthfulQAΔ\\Delta62\.3±\\pm3\.1%Δ\\Delta56\.9±\\pm4\.0%Δ\\Delta45\.0±\\pm3\.1%

Having established strong in\-domain performance, we examine the cross\-domain transferability of confidence signals through the lens of RALC\. The assessment is independent of the performance of RALC; rather, it uses RALC as a diagnostic framework to evaluate signal stability under domain shift, specifically the reliability of each confidence signal as an input to the pipeline when the Platt scaling map trained on one dataset is applied to a different domain without retraining\.

We train the Platt scaling calibration map on each of the three datasets in turn and evaluate the resulting RALC pipeline on all three datasets without retraining, yielding both in\-domain \(diagonal entries\) and cross\-domain \(non\-diagonal entries\) conditions\. We report pre\-to\-post\-RALC percentage reductions in linguistic\-space Faithfulness Divergence and generalised ECE\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]in Table[5\.3\.2](https://arxiv.org/html/2605.19344#S5.SS3.SSS2)\. Raw metric value changes are provided in Appendix[E\.2](https://arxiv.org/html/2605.19344#A5.SS2)\.

All three signals yield improvements across both metrics and all domain pairs\. Semantic uncertainty exhibits the strongest cross\-domain transferability, producing the highest gains with the lowest variance across models in both in\-domain and cross\-domain settings\. As a result, the empirical evidence supports semantic uncertainty as the most robust confidence signal for RALC\.

In contrast, cross\-domain calibrators occasionally outperform in\-domain ones\. This anomaly occurs when the target domain has a weak miscalibration bias, providing insufficient signal for its in\-domain calibrator to learn a reliable correction\. A cross\-domain source with a stronger, more consistent bias learns a more decisive correction that transfers to the target domain, provided both share the same direction of miscalibration\. We provide a detailed investigation in Appendix[E\.3](https://arxiv.org/html/2605.19344#A5.SS3)\.

## 6Discussion and conclusion

##### Limitations and future work

As a downstream framework, RALC’s calibration quality is ultimately bounded by the quality of the upstream confidence signal\. Whilst semantic uncertainty is the most robust signal evaluated, its reliance on multi\-round self\-consistency sampling incurs significant inference cost\. Identifying signals that match its performance at a lower computational expense is therefore an important direction for future work\. Additionally, the hedging expression lexicon covers common confidence expressions and is not intended to represent the full landscape of linguistic uncertainty cues\. Future work could investigate domain\-specific hedging vocabularies and audience\-adapted confidence scoring, which would enhance the specialisation and expressiveness of RALC in targeted deployment settings\.

##### Conclusion

In this work, we introduce a distributional treatment of linguistic confidence as a Beta distribution, define faithfulness as a complementary dimension of confidence evaluation, and present Faithfulness Divergence to quantify it from an information\-theoretic perspective\. Building on these foundations, we propose Retrieval\-Augmented Linguistic Calibration \(RALC\), a principled post\-hoc pipeline that calibrates confidence in the linguistic space, yielding well\-calibrated and faithful responses that consistently outperform prompt\-based baselines across models and benchmarks\.

## References

- \[1\]S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§D\.1](https://arxiv.org/html/2605.19344#A4.SS1.SSS0.Px2.p1.4),[§4\.2](https://arxiv.org/html/2605.19344#S4.SS2.SSS0.Px1.p1.3),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[2\]\(2026\-02\-17\)Introducing Claude Sonnet 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6)Cited by:[§4\.2](https://arxiv.org/html/2605.19344#S4.SS2.SSS0.Px1.p1.3)\.
- \[3\]G\. W\. Brier\(1950\)Verification of forecasts expressed in terms of probability\.Monthly weather review78\(1\),pp\. 1–3\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p4.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.19344#S5.SS2.p1.2)\.
- \[4\]J\. Duan, H\. Cheng, S\. Wang, A\. Zavalny, C\. Wang, R\. Xu, B\. Kailkhura, and K\. Xu\(2024\)Shifting attention to relevance: towards the predictive uncertainty quantification of free\-form large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5050–5063\.Cited by:[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px1.p1.1)\.
- \[5\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p2.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.19344#S4.SS3.p1.1),[§5\.3\.1](https://arxiv.org/html/2605.19344#S5.SS3.SSS1.Px2.p2.3)\.
- \[6\]Google DeepMind\(2025\)Gemini 3 Flash: best for frontier intelligence at speed\.Note:[https://deepmind\.google/models/gemini/flash/](https://deepmind.google/models/gemini/flash/)Cited by:[§4\.2](https://arxiv.org/html/2605.19344#S4.SS2.SSS0.Px1.p1.3)\.
- \[7\]Google DeepMind\(2025\)Gemma 4: our most intelligent open models\.Note:[https://deepmind\.google/models/gemma/gemma\-4/](https://deepmind.google/models/gemma/gemma-4/)Cited by:[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[8\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InInternational conference on machine learning,pp\. 1321–1330\.Cited by:[§D\.7](https://arxiv.org/html/2605.19344#A4.SS7.p1.1),[Table 4](https://arxiv.org/html/2605.19344#A4.T4.32.30.30.7),[§1](https://arxiv.org/html/2605.19344#S1.p4.1),[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.19344#S3.SS2.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2605.19344#S4.SS1.SSS0.Px2.p2.5)\.
- \[9\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p7.1),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[10\]Y\. Huang, Y\. Liu, R\. Thirukovalluru, A\. Cohan, and B\. Dhingra\(2024\)Calibrating long\-form generations from large language models\.InFindings of the association for computational linguistics: EMNLP 2024,pp\. 13441–13460\.Cited by:[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]L\. Itti and P\. Baldi\(2009\)Bayesian surprise attracts human attention\.Vision research49\(10\),pp\. 1295–1306\.Cited by:[§3\.2](https://arxiv.org/html/2605.19344#S3.SS2.SSS0.Px2.p2.1)\.
- \[12\]Z\. Ji, L\. Yu, Y\. Koishekenov, Y\. Bang, A\. Hartshorn, A\. Schelten, C\. Zhang, P\. Fung, and N\. Cancedda\(2025\)Calibrating verbal uncertainty as a linear feature to reduce hallucinations\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 3769–3793\.Cited by:[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1)\.
- \[13\]S\. S\. Kim, Q\. V\. Liao, M\. Vorvoreanu, S\. Ballard, and J\. W\. Vaughan\(2024\)" I’m not sure, but…": examining the impact of large language models’ uncertainty expression on user reliance and trust\.InProceedings of the 2024 ACM conference on fairness, accountability, and transparency,pp\. 822–835\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p1.1)\.
- \[14\]M\. Kull, T\. Silva Filho, and P\. Flach\(2017\)Beta calibration: a well\-founded and easily implemented improvement on logistic calibration for binary classifiers\.InArtificial intelligence and statistics,pp\. 623–631\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1)\.
- \[15\]S\. Kullback and R\. A\. Leibler\(1951\)On information and sufficiency\.The Annals of Mathematical Statistics22\(1\),pp\. 79–86\.Cited by:[§5\.2](https://arxiv.org/html/2605.19344#S5.SS2.p1.2)\.
- \[16\]T\. A\. Lamb, D\. R\. Ivanova, P\. Torr, and T\. G\. Rudner\(2025\)Semantic\-level confidence calibration of language models via temperature scaling\.InICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI,Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p2.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.19344#S4.SS3.SSS0.Px1.p1.5),[§4\.3](https://arxiv.org/html/2605.19344#S4.SS3.p1.1)\.
- \[17\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p5.1)\.
- \[18\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)Teaching models to express their uncertainty in words\.arXiv preprint arXiv:2205\.14334\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p2.1)\.
- \[19\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 3214–3252\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p7.1),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[20\]C\. Marx, S\. Zalouk, and S\. Ermon\(2023\)Calibration by distribution matching: trainable kernel calibration metrics\.Advances in Neural Information Processing Systems36,pp\. 25910–25928\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1)\.
- \[21\]Meta AI\(2024\)Llama 3\.1 model card and prompt formats\.Note:[https://www\.llama\.com/docs/model\-cards\-and\-prompt\-formats/llama3\_1/](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/)Cited by:[§D\.1](https://arxiv.org/html/2605.19344#A4.SS1.SSS0.Px2.p1.4),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[22\]Mistral AI Team\(2023\-09\)Mistral 7B\.Note:[https://mistral\.ai/news/announcing\-mistral\-7b](https://mistral.ai/news/announcing-mistral-7b)Cited by:[§D\.1](https://arxiv.org/html/2605.19344#A4.SS1.SSS0.Px2.p1.4),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[23\]S\. Morita, P\. F\. Thall, and P\. Müller\(2008\)Determining the effective sample size of a parametric prior\.Biometrics64\(2\),pp\. 595–602\.Cited by:[§B\.2](https://arxiv.org/html/2605.19344#A2.SS2.p2.7),[§3\.2](https://arxiv.org/html/2605.19344#S3.SS2.SSS0.Px2.p2.1)\.
- \[24\]OpenAI\(2026\-03\-05\)Introducing GPT\-5\.4\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)Cited by:[§4\.2](https://arxiv.org/html/2605.19344#S4.SS2.SSS0.Px1.p1.3)\.
- \[25\]J\. C\. Platt\(1999\)Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.Advances in large margin classifiers10\(3\),pp\. 61–74\.Cited by:[§D\.7](https://arxiv.org/html/2605.19344#A4.SS7.p1.1),[Table 4](https://arxiv.org/html/2605.19344#A4.T4.14.12.12.7),[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.19344#S4.SS1.SSS0.Px2.p1.5)\.
- \[26\]P\. Rajpurkar, R\. Jia, and P\. Liang\(2018\)Know what you don’t know: unanswerable questions for squad\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 784–789\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p7.1),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[27\]M\. Sensoy, L\. Kaplan, and M\. Kandemir\(2018\)Evidential deep learning to quantify classification uncertainty\.Advances in neural information processing systems31\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.19344#S3.SS1.p1.14)\.
- \[28\]H\. Song, T\. Diethe, M\. Kull, and P\. Flach\(2019\)Distribution calibration for regression\.InInternational Conference on Machine Learning,pp\. 5897–5906\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1)\.
- \[29\]M\. Steyvers, H\. Tejeda, A\. Kumar, C\. Belem, S\. Karny, X\. Hu, L\. W\. Mayer, and P\. Smyth\(2025\)What large language models know and what people think they know\.Nature Machine Intelligence7\(2\),pp\. 221–231\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p1.1)\.
- \[30\]D\. J\. Sumler, L\. Devlin, S\. Maskell, and R\. O\. Lane\(2025\)An entropic metric for measuring calibration of machine learning models\.arXiv preprint arXiv:2502\.14545\.Cited by:[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px2.p1.1)\.
- \[31\]L\. Tao, Y\. Yeh, M\. Dong, T\. Huang, P\. Torr, and C\. Xu\(2025\)Revisiting uncertainty estimation and calibration of large language models\.arXiv preprint arXiv:2505\.23854\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p2.1)\.
- \[32\]L\. Tao, Y\. Yeh, B\. Kai, M\. Dong, T\. Huang, T\. A\. Lamb, J\. Yu, P\. H\. Torr, and C\. Xu\(2025\)Can large language models express uncertainty like human?\.arXiv preprint arXiv:2509\.24202\.Cited by:[Figure 6](https://arxiv.org/html/2605.19344#A3.F6),[§C\.2\.1](https://arxiv.org/html/2605.19344#A3.SS2.SSS1.p1.1),[§C\.2\.2](https://arxiv.org/html/2605.19344#A3.SS2.SSS2.p1.1),[Table 3](https://arxiv.org/html/2605.19344#A3.T3),[§D\.1](https://arxiv.org/html/2605.19344#A4.SS1.SSS0.Px2.p1.4),[§1](https://arxiv.org/html/2605.19344#S1.p2.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[33\]A\. Thompson and V\. Desai\(2026\)Extending confidence calibration to generalised measures of variation\.arXiv preprint arXiv:2602\.12975\.Cited by:[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px2.p1.1)\.
- \[34\]K\. Tian, E\. Mitchell, A\. Zhou, A\. Sharma, R\. Rafailov, H\. Yao, C\. Finn, and C\. D\. Manning\(2023\)Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 5433–5442\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p2.1)\.
- \[35\]P\. Wang, B\. D\. Lam, Y\. Liu, A\. Asgari\-Targhi, R\. Panda, W\. M\. Wells, T\. Kapur, and P\. Golland\(2024\)Calibrating expressions of certainty\.arXiv preprint arXiv:2410\.04315\.Cited by:[Figure 13](https://arxiv.org/html/2605.19344#A5.F13),[Figure 14](https://arxiv.org/html/2605.19344#A5.F14),[§E\.5](https://arxiv.org/html/2605.19344#A5.SS5.p1.1),[§1](https://arxiv.org/html/2605.19344#S1.p4.1),[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.19344#S3.SS2.SSS0.Px1.p1.2),[§5\.3\.1](https://arxiv.org/html/2605.19344#S5.SS3.SSS1.Px1.p3.1),[§5\.3\.2](https://arxiv.org/html/2605.19344#S5.SS3.SSS2.110.112)\.
- \[36\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.19344#S4.SS3.p1.1)\.
- \[37\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§D\.1](https://arxiv.org/html/2605.19344#A4.SS1.SSS0.Px2.p1.4),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1)\.
- \[38\]G\. Yona, R\. Aharoni, and M\. Geva\(2024\)Can large language models faithfully express their intrinsic uncertainty in words?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 7752–7764\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p2.1),[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1),[§5\.1](https://arxiv.org/html/2605.19344#S5.SS1.p1.1),[§5\.3\.1](https://arxiv.org/html/2605.19344#S5.SS3.SSS1.Px2.p1.1)\.
- \[39\]B\. Zadrozny and C\. Elkan\(2001\)Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers\.InICML,Vol\.1,pp\. 2001\.Cited by:[§D\.7](https://arxiv.org/html/2605.19344#A4.SS7.p1.1),[Table 4](https://arxiv.org/html/2605.19344#A4.T4.26.24.24.7),[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.19344#S4.SS1.SSS0.Px2.p2.5)\.
- \[40\]B\. Zadrozny and C\. Elkan\(2002\)Transforming classifier scores into accurate multiclass probability estimates\.InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 694–699\.Cited by:[§D\.7](https://arxiv.org/html/2605.19344#A4.SS7.p1.1),[Table 4](https://arxiv.org/html/2605.19344#A4.T4.20.18.18.7),[§1](https://arxiv.org/html/2605.19344#S1.p5.1),[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.19344#S4.SS1.SSS0.Px2.p2.5)\.
- \[41\]H\. Zhang and L\. T\. Maloney\(2012\)Ubiquitous log odds: a common representation of probability and frequency distortion in perception, action, and cognition\.Frontiers in neuroscience6,pp\. 1\.Cited by:[§1](https://arxiv.org/html/2605.19344#S1.p2.1)\.
- \[42\]C\. Zhu, B\. Xu, Q\. Wang, Y\. Zhang, and Z\. Mao\(2023\-12\)On the calibration of large language models and alignment\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 9778–9795\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.654/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.654)Cited by:[§2](https://arxiv.org/html/2605.19344#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix ATheory

### A\.1Beta distribution estimation

The Beta distribution is a natural choice for modelling random variables supported on\[0,1\]\[0,1\], such as confidence scores and empirical accuracies\. A Beta distribution with parameters\(α,β\)\(\\alpha,\\beta\)has probability density function

p​\(x∣α,β\)=1B​\(α,β\)​xα−1​\(1−x\)β−1,x∈\[0,1\],p\(x\\mid\\alpha,\\beta\)=\\frac\{1\}\{\\mathrm\{B\}\(\\alpha,\\beta\)\}x^\{\\alpha\-1\}\(1\-x\)^\{\\beta\-1\},\\quad x\\in\[0,1\],whereα\>0\\alpha\>0,β\>0\\beta\>0, andB​\(α,β\)\\mathrm\{B\}\(\\alpha,\\beta\)denotes the Beta function\. Its mean, variance, and concentration factor are given by

𝔼​\[X\]=αα\+β,Var​\(X\)=α​β\(α\+β\)2​\(α\+β\+1\),κ=α\+β\.\\mathbb\{E\}\[X\]=\\frac\{\\alpha\}\{\\alpha\+\\beta\},\\qquad\\mathrm\{Var\}\(X\)=\\frac\{\\alpha\\beta\}\{\(\\alpha\+\\beta\)^\{2\}\(\\alpha\+\\beta\+1\)\},\\qquad\\kappa=\\alpha\+\\beta\.
#### A\.1\.1Method of moments

Let𝒮Z=\{st\}t=1TZ\\mathcal\{S\}\_\{Z\}=\\\{s\_\{t\}\\\}\_\{t=1\}^\{T\_\{Z\}\}denote the pseudo\-observation set for a fixed targetZZ, withst∈\[0,1\]s\_\{t\}\\in\[0,1\]\. Denote the empirical mean and variance by

s¯Z=1TZ​∑t=1TZst,vZ=1TZ−1​∑t=1TZ\(st−s¯Z\)2\.\\bar\{s\}\_\{Z\}=\\frac\{1\}\{T\_\{Z\}\}\\sum\_\{t=1\}^\{T\_\{Z\}\}s\_\{t\},\\qquad v\_\{Z\}=\\frac\{1\}\{T\_\{Z\}\-1\}\\sum\_\{t=1\}^\{T\_\{Z\}\}\(s\_\{t\}\-\\bar\{s\}\_\{Z\}\)^\{2\}\.The method of moments estimates\(α,β\)\(\\alpha,\\beta\)by matching these empirical moments to the theoretical mean and variance of the Beta distribution\. Solving the resulting system yields

α^Z=s¯Z​\(s¯Z​\(1−s¯Z\)vZ−1\),β^Z=\(1−s¯Z\)​\(s¯Z​\(1−s¯Z\)vZ−1\),\\hat\{\\alpha\}\_\{Z\}=\\bar\{s\}\_\{Z\}\\left\(\\frac\{\\bar\{s\}\_\{Z\}\(1\-\\bar\{s\}\_\{Z\}\)\}\{v\_\{Z\}\}\-1\\right\),\\qquad\\hat\{\\beta\}\_\{Z\}=\(1\-\\bar\{s\}\_\{Z\}\)\\left\(\\frac\{\\bar\{s\}\_\{Z\}\(1\-\\bar\{s\}\_\{Z\}\)\}\{v\_\{Z\}\}\-1\\right\),provided thatvZ\>0v\_\{Z\}\>0andvZ<s¯Z​\(1−s¯Z\)v\_\{Z\}<\\bar\{s\}\_\{Z\}\(1\-\\bar\{s\}\_\{Z\}\)\. When this condition is not met, two fallback cases are distinguished\.

Boundary\-degenerate case\(s¯Z∈\{0,1\}\\bar\{s\}\_\{Z\}\\in\\\{0,1\\\}, i\.e\. all observations are 0 or all are 1\): the mean\-preserving concentration formula cannot be applied since one parameter would be zero\. Instead we set

α^Z=s¯Z⋅κ,β^Z=\(1−s¯Z\)⋅κ,\\hat\{\\alpha\}\_\{Z\}=\\bar\{s\}\_\{Z\}\\cdot\\kappa,\\qquad\\hat\{\\beta\}\_\{Z\}=\(1\-\\bar\{s\}\_\{Z\}\)\\cdot\\kappa,withκ=TZ\\kappa=T\_\{Z\}, and rely on the clipping step below to lift any zero parameter to10−610^\{\-6\}\. This yields Beta\(TZT\_\{Z\},10−610^\{\-6\}\) when all observations are 1, placing nearly all mass near 1, and Beta\(10−610^\{\-6\},TZT\_\{Z\}\) when all observations are 0, placing nearly all mass near 0\.

Interior\-degenerate case\(vZ=0v\_\{Z\}=0orvZ≥s¯Z​\(1−s¯Z\)v\_\{Z\}\\geq\\bar\{s\}\_\{Z\}\(1\-\\bar\{s\}\_\{Z\}\)withs¯Z∈\(0,1\)\\bar\{s\}\_\{Z\}\\in\(0,1\)\): the observations are constant or insufficiently dispersed at an interior value\. We preserve the empirical mean by setting

α^Z=s¯Z⋅κ,β^Z=\(1−s¯Z\)⋅κ,\\hat\{\\alpha\}\_\{Z\}=\\bar\{s\}\_\{Z\}\\cdot\\kappa,\\qquad\\hat\{\\beta\}\_\{Z\}=\(1\-\\bar\{s\}\_\{Z\}\)\\cdot\\kappa,withκ=TZ\\kappa=T\_\{Z\}\. This produces a Beta distribution with means¯Z\\bar\{s\}\_\{Z\}and high concentration proportional to the number of observations, reflecting the certainty implied by the consistency of the pseudo\-observations\.

In all cases, both parameters are clipped to a minimum of10−610^\{\-6\}to ensure numerical stability:

α^Z←max⁡\(α^Z,10−6\),β^Z←max⁡\(β^Z,10−6\)\.\\hat\{\\alpha\}\_\{Z\}\\leftarrow\\max\(\\hat\{\\alpha\}\_\{Z\},\\,10^\{\-6\}\),\\qquad\\hat\{\\beta\}\_\{Z\}\\leftarrow\\max\(\\hat\{\\beta\}\_\{Z\},\\,10^\{\-6\}\)\.

### A\.2Classical instance\-level calibration metrics with distribution generalisation

#### A\.2\.1Expected Brier Score

LetS=Beta​\(α,β\)S=\\mathrm\{Beta\}\(\\alpha,\\beta\)denote the predictive confidence distribution, letp∼Sp\\sim Sdenote the scalar confidence value drawn from it, and lety∈\{0,1\}y\\in\\\{0,1\\\}be the ground\-truth label\. The Expected Brier Score is defined as

𝔼p∼S​\[\(p−y\)2\]\.\\mathbb\{E\}\_\{p\\sim S\}\\left\[\(p\-y\)^\{2\}\\right\]\.
This admits a closed form:

𝔼​\[\(p−y\)2\]=Var​\(p\)\+\(𝔼​\[p\]−y\)2,\\mathbb\{E\}\[\(p\-y\)^\{2\}\]=\\mathrm\{Var\}\(p\)\+\\left\(\\mathbb\{E\}\[p\]\-y\\right\)^\{2\},where

𝔼​\[p\]=αα\+β,Var​\(p\)=α​β\(α\+β\)2​\(α\+β\+1\)\.\\mathbb\{E\}\[p\]=\\frac\{\\alpha\}\{\\alpha\+\\beta\},\\quad\\mathrm\{Var\}\(p\)=\\frac\{\\alpha\\beta\}\{\(\\alpha\+\\beta\)^\{2\}\(\\alpha\+\\beta\+1\)\}\.
Hence,

𝔼​\[\(p−y\)2\]=α​β\(α\+β\)2​\(α\+β\+1\)\+\(αα\+β−y\)2\.\\mathbb\{E\}\[\(p\-y\)^\{2\}\]=\\frac\{\\alpha\\beta\}\{\(\\alpha\+\\beta\)^\{2\}\(\\alpha\+\\beta\+1\)\}\+\\left\(\\frac\{\\alpha\}\{\\alpha\+\\beta\}\-y\\right\)^\{2\}\.

### A\.3Expected Negative Log\-Likelihood \(NLL\)

GivenS=Beta​\(α,β\)S=\\mathrm\{Beta\}\(\\alpha,\\beta\),p∼Sp\\sim S, and labely∈\{0,1\}y\\in\\\{0,1\\\}, we define the distributional Negative Log\-Likelihood as the expected Bernoulli log\-loss underSS:

ℒNLL​\(S,y\)=𝔼p∼S​\[−log⁡p​\(y∣p\)\],\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(S,y\)=\\mathbb\{E\}\_\{p\\sim S\}\\left\[\-\\log p\(y\\mid p\)\\right\],wherep​\(y∣p\)=py​\(1−p\)1−yp\(y\\mid p\)=p^\{y\}\(1\-p\)^\{1\-y\}\.

This yields the closed form:

ℒNLL​\(S,y\)=\{ψ​\(α\+β\)−ψ​\(α\),y=1,ψ​\(α\+β\)−ψ​\(β\),y=0,\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(S,y\)=\\begin\{cases\}\\psi\(\\alpha\+\\beta\)\-\\psi\(\\alpha\),&y=1,\\\\ \\psi\(\\alpha\+\\beta\)\-\\psi\(\\beta\),&y=0,\\end\{cases\}whereψ​\(⋅\)\\psi\(\\cdot\)denotes the digamma function\.

Equivalently, this can be written as

ℒNLL​\(S,y\)=y​\[ψ​\(α\+β\)−ψ​\(α\)\]\+\(1−y\)​\[ψ​\(α\+β\)−ψ​\(β\)\]\.\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(S,y\)=y\\left\[\\psi\(\\alpha\+\\beta\)\-\\psi\(\\alpha\)\\right\]\+\(1\-y\)\\left\[\\psi\(\\alpha\+\\beta\)\-\\psi\(\\beta\)\\right\]\.

## Appendix BFurther discussion on faithfulness

### B\.1Distributional representation of linguistic confidence

We represent response\-level linguistic confidence asS=Beta​\(α,β\)S=\\mathrm\{Beta\}\(\\alpha,\\beta\), where the meanμ=α/\(α\+β\)\\mu=\\alpha/\(\\alpha\+\\beta\)captures the consensus perceived confidence across readers and the concentrationκ=α\+β\\kappa=\\alpha\+\\betacaptures the strength of that consensus\. Two responses may elicit identical mean confidence yet differ substantially in concentration: one may induce consistent perceptions across readers whilst the other induces highly variable ones\. Scalar representations discard this distinction; the distributional representation preserves it, and it is precisely this information that faithfulness evaluation requires\.

### B\.2Faithfulness Divergence

Faithfulness measures the degree of surprise induced by truth revelation\. A response that elicits high mean confidence with high concentration is highly surprising when incorrect, as it represents a strongly held prior belief that the ground truth contradicts\. The same misalignment expressed with low concentration is less surprising, since the prior was weakly held and requires only a modest update\. Faithfulness Divergence \(FD\) operationalises this intuition\.

Formally, for instanceiiwith correctness labelyiy\_\{i\}, the estimated confidence distribution serves as the priorSi=Beta​\(αi,βi\)S\_\{i\}=\\mathrm\{Beta\}\(\\alpha\_\{i\},\\beta\_\{i\}\)\. Upon observingyiy\_\{i\}, the prior is updated by a single Bernoulli observation to yield the posteriorSi∗=Beta​\(αi\+yi,βi\+1−yi\)S\_\{i\}^\{\*\}=\\mathrm\{Beta\}\(\\alpha\_\{i\}\+y\_\{i\},\\,\\beta\_\{i\}\+1\-y\_\{i\}\)\. The KL divergenceKL​\(Si∗∥Si\)\\mathrm\{KL\}\(S\_\{i\}^\{\*\}\\\|S\_\{i\}\)quantifies the normalised magnitude of the required belief revision\. However, KL divergence alone does not account for the strength of agreement underlying the prior\. Under the Beta–Bernoulli model, the concentrationαi\+βi\\alpha\_\{i\}\+\\beta\_\{i\}is interpretable as the effective sample size of the prior: a larger concentration encodes a more strongly held belief, and an identical KL divergence therefore represents a larger total epistemic adjustment when the prior is more concentrated\[[23](https://arxiv.org/html/2605.19344#bib.bib42)\]\. FD scales the KL divergence by this effective sample size,

FDi:=\(αi\+βi\)⋅KL​\(Si∗∥Si\),\\mathrm\{FD\}\_\{i\}:=\(\\alpha\_\{i\}\+\\beta\_\{i\}\)\\cdot\\mathrm\{KL\}\\\!\\left\(S\_\{i\}^\{\*\}\\,\\\|\\,S\_\{i\}\\right\),yielding a scalar that quantifies the degree of surprise induced by truth revelation, scaled by the strength of agreement encoded in the prior\.

### B\.3Scope and intended use

FD is designed as a diagnostic measurement tool, not as a proper scoring rule or a training objective\. It quantifies the surprise induced at the instance level when the ground truth is revealed, providing a complementary lens on confidence quality that population\-level metrics such as ECE do not capture\. The goal of a well\-calibrated response is not to minimise FD in isolation, but to express confidence that is consistent with the model’s actual uncertainty; FD measures how far a given response falls from that standard\. It should therefore be interpreted alongside calibration metrics rather than treated as a sole optimisation target\.

### B\.4Additional Faithfulness Divergence ablation studies

In addition to the empirical ablation in Table[1](https://arxiv.org/html/2605.19344#S4.T1), we conduct theoretical ablations to further illustrate the unique properties of Faithfulness Divergence and its alignment with our definition of surprise upon truth revelation, compared against alternative metrics including KL divergence, expected Brier score, and expected NLL\. Two controlled settings are examined: varying the confidence mean under a fixed concentration, and varying concentration under a fixed mean, each evaluated against a binary ground\-truth label\.

Figure[4](https://arxiv.org/html/2605.19344#A2.F4)shows that all four metrics correctly capture mean deviation from the ground\-truth label as increasing surprise, assigning monotonically higher values as the confidence mean moves further from the outcome\. However, Figure[5](https://arxiv.org/html/2605.19344#A2.F5)demonstrates that only FD encodes concentration as an amplifier of misalignment, assigning monotonically higher surprise as concentration increases for a fixed misaligned mean, whilst the alternative metrics behave otherwise\. FD therefore uniquely quantifies surprise upon truth revelation in accordance with our definition, which requires that a more strongly held incorrect belief be treated as more surprising\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x4.png)Figure 4:Faithfulness Divergence, KL divergence, expected Brier score, and expected NLL under varying confidence means with fixed concentration \(α\+β=20\\alpha\+\\beta=20\) against a binary ground\-truth label\. All metrics increase monotonically as the mean deviates from the ground\-truth label, reflecting greater surprise upon truth revelation for more misaligned beliefs\.![Refer to caption](https://arxiv.org/html/2605.19344v1/x5.png)Figure 5:Faithfulness Divergence, KL divergence, expected Brier score, and expected NLL under varying concentration with a fixed misaligned mean \(μ=0\.75\\mu=0\.75\) against a binary ground\-truth label\. Only Faithfulness Divergence correctly increases monotonically with concentration, encoding the intuition that a strongly held incorrect belief induces greater surprise upon truth revelation than a weakly held one of equal mean\.

## Appendix CConfidence evaluation implementation

### C\.1QA prompts

We detail the exact prompts used to sample responses across MMLU, SQuAD 2\.0, and TruthfulQA with different confidence estimation methods, including linguistic confidence, token probability, and semantic uncertainty\.

#### C\.1\.1MMLU

Direct QA promptAnswer the following question using a succinct \(at most one sentence\) and full answer\.\{ question \}\{% for choice in choices %\}\{ ’ABCDEFGHIJKLMNOPQRSTUVWXYZ’\[loop\.index0\] \}\. \{ choice \}\{% endfor %\}Answer: \[Return the answer letter only\]

Hedged QA promptAnswer the following question using a succinct \(at most one sentence\) and full answer\. If you are uncertain about your answer to the question, convey this uncertainty linguistically by precisely hedging this answer\. Please include the option letter in your answer\.Question: \{ question \}\{% for choice in choices %\}\{ ’ABCDEFGHIJKLMNOPQRSTUVWXYZ’\[loop\.index0\] \}\. \{ choice \}\{% endfor %\}Answer:

#### C\.1\.2SQuAD 2\.0

Direct QA promptAnswer the following question using a succinct \(at most one sentence\) and full answer\.Title: \{ title \}Background: \{ context \}Question: \{ question \}Answer:

Hedged QA promptAnswer the following question using a succinct \(at most one sentence\) and full answer\. If you are uncertain about your answer to the question, convey this uncertainty linguistically by precisely hedging this answer\.Title: \{ title \}Background: \{ context \}Question: \{ question \}Answer:

#### C\.1\.3TruthfulQA

Direct QA promptAnswer the following question using a succinct \(at most one sentence\) and full answer\.Question: \{question\}Answer:

Hedged QA promptAnswer the following question using a succinct \(at most one sentence\) and full answer\. If you are uncertain about your answer to the question, convey this uncertainty linguistically by precisely hedging this answer\.Question: \{question\}Answer:

### C\.2LLM linguistic confidence evaluator

#### C\.2\.1Evaluator prompt

An LLM linguistic confidence evaluator ensemble is prompted as follows\. We use an ensemble of models rather than a single evaluator to capture the complex linguistic relationships arising from co\-occurring cues and their contextual interactions within a statement, whilst averaging out idiosyncratic biases of individual models\. To align with human perception, we provide the LLMs with human\-annotated linguistic cues and their associated confidence profiles fromTaoet al\.\[[32](https://arxiv.org/html/2605.19344#bib.bib8)\]as reference\. The LLM is then asked to return a confidence score between 0 and 100 based solely on the linguistic cues present in the sentence, without using any external or prior knowledge to assess the knowledge conveyed by the sentence\. The extracted output score is then normalised to\[0,1\]\[0,1\]\.

LLM linguistic confidence evaluator promptPlease provide only a confidence score between 0 and 100, based solely on the degree of confidence expressed in the tone and linguistic cues of the following sentence, without using any external or prior knowledge to assess the knowledge conveyed by the sentence\. Please pay attention to the hedging language used and the overall assertiveness of the statement\. If the sentence abstains from answering by pointing out the insufficiency of information with a firm tone, the score should be towards 100\. If the sentence contains random guesses or abstention, the score should be towards 0\. If the sentence is stated with strong certainty or no hedging, the score should be towards 100\. If the sentence does not contain any hedging language or is a succinct, decisive short answer, the score should be towards 100, too\.To align with human perception, here are some human\-annotated linguistic cues with perceived confidence profiles \(mean and standard deviation\) for your reference: \{human\_annotated\_cues\}Here is the sentence: \{sentence\}Confidence Score: \[Return only a number between 0 and 100 without any additional text or explanation\]

#### C\.2\.2LLM ensemble vs\. human benchmark

We compare the LLM\-ensemble confidence scores against human annotations on the benchmark ofTaoet al\.\[[32](https://arxiv.org/html/2605.19344#bib.bib8)\]\. The benchmark consists of human\-annotated confidence scores across various statements\. We employ our LLM ensemble to generate confidence scores for each statement in a similar manner to human annotators\. The result in Figure[6](https://arxiv.org/html/2605.19344#A3.F6)and Table[3](https://arxiv.org/html/2605.19344#A3.T3)confirm that the ensemble largely matches human confidence judgements for common hedging expressions\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x6.png)Figure 6:LLM vs\. human perceived linguistic confidence on the human\-annotated benchmark ofTaoet al\.\[[32](https://arxiv.org/html/2605.19344#bib.bib8)\]\. The LLM ensemble largely follows human confidence annotations across confidence levels\.Table 3:Rank and linear correlations between LLM\-ensemble and human\-annotated confidence scores on the benchmark ofTaoet al\.\[[32](https://arxiv.org/html/2605.19344#bib.bib8)\]\. Allpp\-values are<10−10<10^\{\-10\}, rejecting the null hypothesis of zero correlation \(H0:ρ=0H\_\{0\}\\colon\\rho=0\) and confirming that the ensemble reliably reproduces human perception of linguistic confidence cues\.MetricCoefficientpp\-valueSpearmanρ\\rho0\.8535<10−10<10^\{\-10\}Pearsonrr0\.8450<10−10<10^\{\-10\}Kendallτ\\tau0\.6909<10−10<10^\{\-10\}

### C\.3Grader prompt

LLM grader promptIf the predicted answer matches, implies or covers the correct answer, the grade is CORRECT\.If the predicted answer does not match, imply or cover the correct answer, the grade is INCORRECT\. Do NOT grade it as INCORRECT if the predicted answer abstains from answering \(e\.g\. “I don’t know the answer…” or “I have no idea…”\)\.If the predicted answer is empty, none or abstention \(e\.g\. “I don’t know the answer…” or “I have no idea…”\), grade the predicted answer as NOT\_ATTEMPTED instead of CORRECT or INCORRECT\. If the predicted answer makes an attempt \(even random guesses\), do not grade it as NOT\_ATTEMPTED\.Ignore any explanation or linguistic cues present in the predicted answer\. Don’t apologise or correct yourself if there was a mistake; we are just trying to grade the answer\.Question: \{question\}Correct answer: \{correct\_answer\}Predicted answer: \{“” if predicted\_answer is None else predicted\_answer\}Grade the predicted answer of this new question as one of:A: CORRECTB: INCORRECTC: NOT\_ATTEMPTEDJust return one of the letters “A”, “B”, or “C”, with no text around it\.

### C\.4LLM\-based semantic clustering prompt

LLM\-based semantic clustering promptYou are a strict JSON generator\. Group semantically equivalent candidate responses to the same question\. Ignore any linguistic markers of uncertainty or hedging and focus solely on the core meaning of the responses\.Return a JSON object with a single key"semantic\_ids", a list of integers aligned with the response order\. Responses that are semantically equivalent \(bidirectional entailment\) must share the same integer id\. Use 0\-based ids\. Semantic ids represent the semantic cluster assignment for each response\. Return ONLY the JSON object, no extra text\.For instance, given the question and candidate responses:Question: What is the capital of France?Candidate responses:0: ‘I guess Paris is the capital of France\.’1: ‘Paris is the capital city of France\.’2: ‘The capital of France is Berlin\.’The correct JSON output would be:\{"semantic\_ids": \[0, 0, 1\]\}Now, please group the following candidate responses to the given question and return the JSON object:Question: \{question\}Candidate responses: \{responses\}\{"semantic\_ids": \[\.\.\.\]\}

## Appendix DPost\-hoc linguistic calibration implementation

### D\.1Lexicon construction

The lexicon is constructed in three stages: hedging expression sourcing, confidence score collection, and Beta distribution fitting\.

##### Hedging expression sourcing

A curated set ofKKhedging expressions spanning the full confidence spectrum is generated by prompting Claude\-Sonnet\-4\.6 to produce words and phrases that humans use to convey varying degrees of certainty, from expressions of complete ignorance \(e\.g\. “I have no idea”, “my random guess is”\) to expressions of near\-certain belief \(e\.g\. “without a doubt”, “I can confirm”\)\.

Hedging expression sourcing promptGenerate a Python list of words or expressions that humans use to convey the level of confidence, certainty, or hedging in their statements \(without a subject, only the linguistic cues\)\. These words should include common hedging phrases, adverbs, and qualifiers that indicate varying degrees of certainty or uncertainty, from extremely low confidence \(like I do not know, my random guess is, etc\) to high confidence \(certain, sure, definitely\)\.The list should be comprehensive and cover a wide range of expressions used in everyday language as well as in academic and professional contexts\.

##### Confidence score collection

For each hedging expressionwkw\_\{k\}, GPT\-OSS\-20B\[[1](https://arxiv.org/html/2605.19344#bib.bib46)\]generates 20 candidate sentences by rewriting a randomly selected non\-verifiable statement drawn from a fixed pool of 12 non\-verifiable template sentences, with instructions to incorporatewkw\_\{k\}naturally and to avoid introducing additional hedging cues\. Each generated sentence is then independently evaluated by three LLM evaluators \(Llama\-3\.1\-8B\-Instruct\[[21](https://arxiv.org/html/2605.19344#bib.bib49)\], Qwen3\-8B\[[37](https://arxiv.org/html/2605.19344#bib.bib45)\], Mistral\-7B\-Instruct\-v0\.3\[[22](https://arxiv.org/html/2605.19344#bib.bib47)\]\), each prompted to assign a perceived\-confidence score on a 0–100 scale based solely on the linguistic cues present, ignoring factual content\. Human\-annotated reference profiles fromTaoet al\.\[[32](https://arxiv.org/html/2605.19344#bib.bib8)\]are provided in\-context to anchor model ratings to human perception\. Each evaluator scores every sentence 3 times \(temperature=1=1\), yielding up to20×3×3=18020\\times 3\\times 3=180raw scores per hedging expression\.

Non\-verifiable template sentences"There is a correlation between X and Y\.""It rains tomorrow\.""The experiment shows a significant effect\.""The new policy improves the economy\.""The medication is effective in treating the disease\.""The new product is successful in the market\.""The neighbour is home\.""The movie is good\.""The restaurant serves delicious food\.""The city is the oldest in the country\.""The book is informative\.""The report is not accurate\."

Non\-verifiable sentence rewriting promptGiven a linguistic cue: "\{word\}", rewrite one of the following non\-verifiable statements to naturally include this cue to convey the intended level of confidence, certainty, or hedging\. Please do not use other hedging words, hedging phrases or linguistic cues in the sentence other than the specified linguistic cue\.Example sentences to rewrite: \{selected\_sentence\}Do not use other hedging words or linguistic cues in the sentence\. Do not combine linguistic cues\. Do not include labels like "Example:" or "Sentence:"\. Just provide the statement\.

##### Beta distribution fitting

All raw scores for a given expression are aggregated across sentences, evaluators, and repeated passes, then normalised to\[0,1\]\[0,1\]and clipped to\(10−6,1−10−6\)\(10^\{\-6\},\\,1\-10^\{\-6\}\)to avoid boundary degeneracy\. A Beta distribution is fitted to the pooled scores by maximum likelihood estimation \(with fixed support\[0,1\]\[0,1\]\), yielding the lexicon entry\(wk,Beta​\(αk,βk\)\)\\bigl\(w\_\{k\},\\,\\mathrm\{Beta\}\(\\alpha\_\{k\},\\beta\_\{k\}\)\\bigr\)\. The resulting lexicon\{\(wk,Beta​\(αk,βk\)\)\}k=1K\\\{\(w\_\{k\},\\,\\mathrm\{Beta\}\(\\alpha\_\{k\},\\beta\_\{k\}\)\)\\\}\_\{k=1\}^\{K\}is used at inference time for Wasserstein\-distance\-based retrieval\.

### D\.2Sample hedging expressions from the lexicon

Figure[7](https://arxiv.org/html/2605.19344#A4.F7)shows sample hedging expressions from the lexicon, with their corresponding Beta distributions over perceived confidence by our LLM ensemble\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x7.png)Figure 7:Sample hedging expressions from the lexicon, with their corresponding Beta distributions over perceived confidence by our LLM ensemble\.
### D\.3Retrieval\-augmented linguistic calibration rewriting prompt

Retrieval\-augmented linguistic calibration promptGiven an original response and a list of target hedging words with their confidence profiles \(Beta Distributions\), rewrite the response to appropriately reflect the confidence level indicated by the set of target hedging words\. You must preserve the original meaning of the response, as we are only adjusting the tone to match the confidence level suggested by the hedging words\. Ensure the new response sounds natural and fluent\.Original response: My answer to the question is: "\{response\}"Target hedging words with confidence profiles: \{hedges\}Please return only the rewritten sentence without any explanation\.New response:

### D\.4Direct Beta\-guided rewriting calibration prompt

Direct Beta\-guided rewriting promptGiven an original response and a Beta distribution, rewrite the response to appropriately reflect the confidence level indicated by the given Beta distribution by using hedging language\. You must preserve the original meaning of the response, as we are only adjusting the tone to match the confidence level suggested by the hedging words\. Ensure the new response sounds natural and fluent\.Original response: My answer to the question is: "\{response\}"Target Beta distribution: Beta\(alpha=\{alpha:\.2f\}, beta=\{beta:\.2f\}\)Please return only the rewritten sentence without any explanation\.New response:

### D\.5Choice of k for hedging expression retrieval

Given the pre\-constructed lexicon of hedging expressions, we perform an ablation study on the choice ofkkfor the KNN retrieval of hedging expressions in the RALC pipeline\. The following figure shows the impact on Faithfulness Divergence and generalised ECE for different choices ofkkacross linguistic confidence \(LC\), token probability \(TP\), and semantic uncertainty \(SU\) as retrieval signals for Llama\-3\.1\-8B\-Instruct on the TruthfulQA dataset\. Figure[8](https://arxiv.org/html/2605.19344#A4.F8)shows that both metrics are not highly sensitive to the choice ofkk, withk=5k=5showing consistently better marginal performance in the exploration landscape with the lowest Faithfulness Divergence and generalised ECE\. Therefore, we choosek=5k=5for both the in\-domain and cross\-domain calibration experiments in this work\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x8.png)Figure 8:Impact of the choice ofkkfor the KNN retrieval of hedging expressions in RALC pipeline on Faithfulness Divergence and generalised ECE across different confidence signals for Llama\-3\.1\-8B\-Instruct on the TruthfulQA dataset\. The results show that both metrics are not highly sensitive to the choice ofkkwithin a reasonable range, withk=5k=5showing consistently better marginal performance in the exploration range\.
### D\.6Confidence distribution profiles across different estimators

Figure[9](https://arxiv.org/html/2605.19344#A4.F9)shows the distribution of confidence standard deviations across responses for each signal\. Linguistic confidence exhibits the highest variability, whilst token probability and semantic uncertainty each produce a substantial proportion of zero\-variance distributions, arising when responses share identical token probability profiles or collapse into a single semantic cluster\. These degenerate cases are handled by clipping the\(α,β\)\(\\alpha,\\beta\)parameters as specified in Appendix[A\.1\.1](https://arxiv.org/html/2605.19344#A1.SS1.SSS1), ensuring compatibility with both the calibration map and the Wasserstein\-based retrieval step\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x9.png)Figure 9:For each confidence signal for Direct QA responses, we plot the distribution of the standard deviation of the confidence distribution across responses\.
### D\.7Calibration map ablation study

Table 4:Ablation over signal\-space calibration maps, averaged across all datasets and models \(±1\\pm 1standard deviation\)\. Bold green denotes the best \(lowest\) value per column\. Platt scaling achieves the lowest error in five of six columns and is adopted in the RALC pipeline\.Linguistic confidenceToken probabilitySemantic uncertaintyMethodGen\. ECEFDGen\. ECEFDGen\. ECEFDUncalibrated0\.280±0\.1070\.280\_\{\\pm 0\.107\}1\.487±0\.6281\.487\_\{\\pm 0\.628\}0\.262±0\.1110\.262\_\{\\pm 0\.111\}68\.879±96\.52068\.879\_\{\\pm 96\.520\}0\.267±0\.1140\.267\_\{\\pm 0\.114\}2\.112±0\.7622\.112\_\{\\pm 0\.762\}Platt scaling\[[25](https://arxiv.org/html/2605.19344#bib.bib14)\]0\.109±0\.018\\mathbf\{\{\\color\[rgb\]\{0,0\.7,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0\.7,0\}0\.109\}\}\_\{\\pm 0\.018\}0\.486±0\.042\\mathbf\{\{\\color\[rgb\]\{0,0\.7,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0\.7,0\}0\.486\}\}\_\{\\pm 0\.042\}0\.085±0\.032\\mathbf\{\{\\color\[rgb\]\{0,0\.7,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0\.7,0\}0\.085\}\}\_\{\\pm 0\.032\}0\.500±0\.060\\mathbf\{\{\\color\[rgb\]\{0,0\.7,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0\.7,0\}0\.500\}\}\_\{\\pm 0\.060\}0\.060±0\.014\\mathbf\{\{\\color\[rgb\]\{0,0\.7,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0\.7,0\}0\.060\}\}\_\{\\pm 0\.014\}0\.506±0\.0510\.506\_\{\\pm 0\.051\}Isotonic regression\[[40](https://arxiv.org/html/2605.19344#bib.bib12)\]0\.114±0\.0190\.114\_\{\\pm 0\.019\}0\.685±0\.5300\.685\_\{\\pm 0\.530\}0\.090±0\.0340\.090\_\{\\pm 0\.034\}1\.646±1\.4951\.646\_\{\\pm 1\.495\}0\.065±0\.0170\.065\_\{\\pm 0\.017\}0\.718±0\.4390\.718\_\{\\pm 0\.439\}Histogram binning\[[39](https://arxiv.org/html/2605.19344#bib.bib13)\]0\.110±0\.0190\.110\_\{\\pm 0\.019\}0\.488±0\.0420\.488\_\{\\pm 0\.042\}0\.090±0\.0390\.090\_\{\\pm 0\.039\}0\.514±0\.1040\.514\_\{\\pm 0\.104\}0\.060±0\.0170\.060\_\{\\pm 0\.017\}0\.502±0\.047\\mathbf\{\{\\color\[rgb\]\{0,0\.7,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0\.7,0\}0\.502\}\}\_\{\\pm 0\.047\}Temperature scaling\[[8](https://arxiv.org/html/2605.19344#bib.bib11)\]0\.127±0\.0220\.127\_\{\\pm 0\.022\}0\.547±0\.1480\.547\_\{\\pm 0\.148\}0\.102±0\.0320\.102\_\{\\pm 0\.032\}0\.646±0\.3800\.646\_\{\\pm 0\.380\}0\.080±0\.0310\.080\_\{\\pm 0\.031\}0\.520±0\.0560\.520\_\{\\pm 0\.056\}

We ablate the signal\-space calibration map across Platt scaling\[[25](https://arxiv.org/html/2605.19344#bib.bib14)\], isotonic regression\[[40](https://arxiv.org/html/2605.19344#bib.bib12)\], histogram binning\[[39](https://arxiv.org/html/2605.19344#bib.bib13)\], and temperature scaling\[[8](https://arxiv.org/html/2605.19344#bib.bib11)\], applied to the distribution means of all three confidence signals and evaluated on generalised ECE and Faithfulness Divergence in the signal space\. The results are averaged across all datasets and models and reported in Table[4](https://arxiv.org/html/2605.19344#A4.T4)\. Platt scaling is the best performer in the signal space across both metrics and all confidence signals, whilst isotonic regression exhibits instability on small calibration sets, histogram binning trails on FD despite being competitive on ECE, and temperature scaling performs worst overall\. We therefore adopt Platt scaling as the signal\-space calibration map in the RALC pipeline\.

### D\.8Confidence signal propagation quality

We evaluate the quality of RALC by measuring the correlation between the calibrated confidence signal and the linguistic confidence in the rewritten responses perceived by our LLM evaluator ensemble for each confidence signal\. Our pipeline accurately propagates the calibrated confidence signal into language, as evidenced by a positive Spearman’s correlationρ\\rhoconsistently above 0\.9 across all confidence signals\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x10.png)Figure 10:We evaluate the quality of RALC by measuring the correlation between the calibrated confidence signal and the linguistic confidence in the rewritten responses perceived by our LLM evaluator ensemble\. Across all confidence signals, our pipeline effectively propagates the calibrated confidence signal into language, as evidenced by a positive correlation consistently well above 0\.9\.

## Appendix EAdditional results

### E\.1Additional performance metrics

In addition to faithfulness and calibration, we assess discriminative performance using AUROC, computed on the means of each confidence distribution, averaged across all five models\. An AUROC above 0\.5 indicates that higher confidence tends to correlate with correctness, with 1\.0 being perfect discrimination; a value near 0\.5 reflects a signal no better than chance\. Table[5](https://arxiv.org/html/2605.19344#A5.T5)reports the pre\-calibration signal profile where semantic uncertainty and token probability achieve stronger discrimination than linguistic confidence across all datasets, motivating the use of alternative confidence signals in our RALC pipeline beyond linguistic confidence\.

Table 5:Model accuracy, mean confidence, and AUROC of each confidence signal mean across datasets, averaged across all five models\.DatasetSignalAcc\.Mean Conf\.AUROCMMLULing\. Confidence0\.719±0\.1200\.719\\pm 0\.1200\.784±0\.0140\.784\\pm 0\.0140\.538±0\.0260\.538\\pm 0\.026Token Probability0\.936±0\.0450\.936\\pm 0\.0450\.609±0\.0750\.609\\pm 0\.075Sem\. Uncertainty0\.891±0\.0540\.891\\pm 0\.0540\.652±0\.0760\.652\\pm 0\.076SQuAD 2\.0Ling\. Confidence0\.564±0\.1000\.564\\pm 0\.1000\.840±0\.0150\.840\\pm 0\.0150\.497±0\.0290\.497\\pm 0\.029Token Probability0\.838±0\.1040\.838\\pm 0\.1040\.647±0\.0350\.647\\pm 0\.035Sem\. Uncertainty0\.884±0\.0380\.884\\pm 0\.0380\.647±0\.0700\.647\\pm 0\.070TruthfulQALing\. Confidence0\.493±0\.1030\.493\\pm 0\.1030\.829±0\.0120\.829\\pm 0\.0120\.616±0\.0280\.616\\pm 0\.028Token Probability0\.704±0\.1650\.704\\pm 0\.1650\.624±0\.0590\.624\\pm 0\.059Sem\. Uncertainty0\.808±0\.0540\.808\\pm 0\.0540\.642±0\.0180\.642\\pm 0\.018Table[6](https://arxiv.org/html/2605.19344#A5.T6)then evaluates RALC’s linguistic\-space AUROC after rewriting, benchmarked against a Hedged QA baseline, a prompt\-based black\-box baseline to elicit hedged responses\. Since RALC’s quality is signal\-dependent, stronger signals yield greater gains: semantic uncertainty consistently surpasses both the original AUROC and the Hedged QA baseline across all datasets, confirming that grounding rewritten expressions in a calibrated signal with principled retrieval\-augmentation produces more discriminative outputs than black\-box hedging\.

Table 6:Linguistic\-space AUROC before and after in\-domain RALC with Hedged QA baseline, averaged across all five models\. Original AUROC and Hedged QA AUROC are dataset\-level quantities shared across all signals; post\-RALC AUROC is signal\-specific\. Higher values indicate better discrimination between correct and incorrect responses in the linguistic space\.DatasetSignalOriginal AUROCPost\-RALC AUROCHedged QA AUROCMMLULing\. Confidence0\.513±0\.0690\.513\\pm 0\.0690\.533±0\.0190\.533\\pm 0\.0190\.565±0\.0200\.565\\pm 0\.020Token Probability0\.560±0\.0480\.560\\pm 0\.048Sem\. Uncertainty0\.636±0\.0850\.636\\pm 0\.085SQuAD 2\.0Ling\. Confidence0\.498±0\.0290\.498\\pm 0\.0290\.488±0\.0110\.488\\pm 0\.0110\.521±0\.0440\.521\\pm 0\.044Token Probability0\.557±0\.0600\.557\\pm 0\.060Sem\. Uncertainty0\.588±0\.0640\.588\\pm 0\.064TruthfulQALing\. Confidence0\.610±0\.0380\.610\\pm 0\.0380\.600±0\.0340\.600\\pm 0\.0340\.627±0\.0280\.627\\pm 0\.028Token Probability0\.630±0\.0270\.630\\pm 0\.027Sem\. Uncertainty0\.664±0\.0390\.664\\pm 0\.039
### E\.2In\-domain and cross\-domain calibration results

Table[E\.2](https://arxiv.org/html/2605.19344#A5.SS2)provides the absolute metric value changes corresponding to the percentage changes in Table[5\.3\.2](https://arxiv.org/html/2605.19344#S5.SS3.SSS2), broken down by confidence estimator, training dataset, and evaluation dataset\. Each entry reports the mean reduction in Faithfulness Divergence or generalised ECE across models \(mean±\\pmstd\), with green indicating improvement and red indicating deterioration\. In\-domain results appear on the diagonal; off\-diagonal entries reflect cross\-domain transfer\.

Table 7:In\-domain and cross\-domain linguistic\-space calibration metric value changes for both Faithfulness Divergence and generalised ECE\. We report the value change relative to the pre\-calibration metrics \(mean±\\pmstd across models\)\. Green text indicates calibration improvement \(lower error\), red indicates deterioration\.MetricSignalTrain/TestMMLUSQuAD 2\.0TruthfulQAFaithfulness Divergence Mean ReductionLinguistic ConfidenceMMLUΔ\\Delta0\.1057±\\pm0\.1793Δ\\Delta0\.6384±\\pm0\.2359Δ\\Delta0\.5436±\\pm0\.2420SQuAD 2\.0Δ\\Delta0\.3598±\\pm0\.0779Δ\\Delta1\.1440±\\pm0\.1398Δ\\Delta1\.0996±\\pm0\.1566TruthfulQAΔ\\Delta0\.2904±\\pm0\.1946Δ\\Delta1\.2138±\\pm0\.1430Δ\\Delta1\.1127±\\pm0\.1695Token ProbabilityMMLUΔ\\Delta0\.0952±\\pm0\.2094Δ\\Delta0\.9714±\\pm0\.1862Δ\\Delta1\.0355±\\pm0\.1556SQuAD 2\.0Δ\\Delta0\.2768±\\pm0\.0674Δ\\Delta1\.1987±\\pm0\.1472Δ\\Delta1\.2931±\\pm0\.1621TruthfulQAΔ\\Delta0\.5249±\\pm0\.0658Δ\\Delta1\.1848±\\pm0\.1555Δ\\Delta1\.1890±\\pm0\.1479Semantic UncertaintyMMLUΔ\\Delta0\.2011±\\pm0\.1424Δ\\Delta0\.6669±\\pm0\.2484Δ\\Delta0\.7894±\\pm0\.2090SQuAD 2\.0Δ\\Delta0\.3700±\\pm0\.0836Δ\\Delta1\.2180±\\pm0\.1775Δ\\Delta1\.2042±\\pm0\.1722TruthfulQAΔ\\Delta0\.6437±\\pm0\.0766Δ\\Delta1\.2246±\\pm0\.1344Δ\\Delta1\.2160±\\pm0\.1473Generalised ECE Mean ReductionLinguistic ConfidenceMMLUΔ\\Delta0\.0786±\\pm0\.0242Δ\\Delta0\.0655±\\pm0\.0258Δ\\Delta0\.0449±\\pm0\.0244SQuAD 2\.0Δ\\Delta0\.0394±\\pm0\.0224Δ\\Delta0\.1372±\\pm0\.0147Δ\\Delta0\.1286±\\pm0\.0145TruthfulQAΔ\\Delta0\.0725±\\pm0\.0252Δ\\Delta0\.1492±\\pm0\.0196Δ\\Delta0\.1462±\\pm0\.0124Token ProbabilityMMLUΔ\\Delta0\.0966±\\pm0\.0272Δ\\Delta0\.1217±\\pm0\.0368Δ\\Delta0\.1381±\\pm0\.0382SQuAD 2\.0Δ\\Delta0\.0421±\\pm0\.0063Δ\\Delta0\.1600±\\pm0\.0158Δ\\Delta0\.2257±\\pm0\.0245TruthfulQAΔ\\Delta0\.1994±\\pm0\.0167Δ\\Delta0\.1538±\\pm0\.0194Δ\\Delta0\.1561±\\pm0\.0147Semantic UncertaintyMMLUΔ\\Delta0\.1028±\\pm0\.0273Δ\\Delta0\.0974±\\pm0\.0309Δ\\Delta0\.1240±\\pm0\.0265SQuAD 2\.0Δ\\Delta0\.0540±\\pm0\.0323Δ\\Delta0\.1855±\\pm0\.0295Δ\\Delta0\.1832±\\pm0\.0258TruthfulQAΔ\\Delta0\.2065±\\pm0\.0184Δ\\Delta0\.1798±\\pm0\.0251Δ\\Delta0\.1662±\\pm0\.0165

### E\.3Further investigation on cross\-domain calibration anomalies

Table[5\.3\.2](https://arxiv.org/html/2605.19344#S5.SS3.SSS2)reveals that cross\-domain calibrators occasionally outperform in\-domain ones\. We investigate this anomaly by examining the miscalibration bias of each dataset, defined as the gap between mean expressed confidence and mean accuracy, which determines how much signal is available for the calibration map to learn from\.

We measure the per\-dataset miscalibration bias across confidence signals and models \(Table[8](https://arxiv.org/html/2605.19344#A5.T8); Figure[11](https://arxiv.org/html/2605.19344#A5.F11)\), and correlate the bias difference between each source–target pair with the observed cross\-domain advantage \(Figure[12](https://arxiv.org/html/2605.19344#A5.F12)\)\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x11.png)Figure 11:Mean confidence vs\. mean accuracy per \(dataset, model\) pair\. All datasets are systematically miscalibrated \(above the diagonal\), but the magnitude of bias varies considerably across domains\.Table 8:Mean miscalibration bias \(mean confidence−\-mean accuracy\) per dataset and confidence signal, averaged over models\.DatasetLing\. Conf\.Token Prob\.Sem\. Unc\.MMLU0\.0490\.0490\.2250\.2250\.1770\.177SQuAD 2\.00\.2850\.2850\.2780\.2780\.3260\.326TruthfulQA0\.3600\.3600\.2290\.2290\.3260\.326All three datasets are systematically miscalibrated, but the magnitude differs considerably\. Figure[12](https://arxiv.org/html/2605.19344#A5.F12)shows a strong negative relationship between the source–target bias difference and the cross\-domain advantage\. Transfer pairs whose source and target share a similar miscalibration bias show little performance gap relative to in\-domain calibration, whilst larger differences tend to favour in\-domain calibration\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x12.png)Figure 12:Miscalibration bias difference\|b¯train−b¯test\|\|\\bar\{b\}\_\{\\text\{train\}\}\-\\bar\{b\}\_\{\\text\{test\}\}\|vs\. cross\-domain advantage \(ECEin−ECEcross\\text\{ECE\}\_\{\\text\{in\}\}\-\\text\{ECE\}\_\{\\text\{cross\}\}\)\. Colour indicates the test dataset\. Transfer pairs with similar miscalibration biases achieve performance closer to in\-domain calibration\.This pattern follows from the learning dynamics of the calibration map\. When a target domain has a weak bias, the in\-domain calibrator has little signal to learn from and fits an unreliable correction\. A cross\-domain source with a stronger, more consistent bias learns a more decisive correction; provided the two domains share the same direction of miscalibration, this correction transfers effectively even if its magnitude differs\. Cross\-domain transfer therefore outperforms in\-domain calibration precisely when in\-domain data is least informative\.

### E\.4Detailed in\-domain calibration vs\. Hedged QA comparison

Table[9](https://arxiv.org/html/2605.19344#A5.T9)reports pre\-to\-post changes in linguistic\-space Faithfulness Divergence and generalised ECE for RALC and the Hedged QA baseline, broken down by dataset, model, and confidence signal\. Each cell shows the original Direct QA value alongside the post\-intervention value; linguistic confidence is estimated by the LLM ensemble for all three response types\. RALC outperforms Hedged QA in expectation across both metrics, with few exceptions due to model\-specific signal characteristics, whilst Hedged QA shows limited and inconsistent improvements\.

Table 9:Detailed in\-domain calibration vs\. Hedged QA comparison\.DatasetModelSignalIn\-Domain CalibrationHedged QAFaithfulnessDivergence\(Orig\.→\\toCalib\.\)GeneralisedECE\(Orig\.→\\toCalib\.\)FaithfulnessDivergence\(Orig\.→\\toHedged\)GeneralisedECE\(Orig\.→\\toHedged\)MMLUMistral\-7B\-Inst\.Linguistic Conf\.1\.198→0\.5571\.198\\to 0\.5570\.239→0\.1200\.239\\to 0\.1201\.198→1\.2271\.198\\to 1\.2270\.239→0\.2510\.239\\to 0\.251Token Prob\.1\.198→0\.5781\.198\\to 0\.5780\.239→0\.1090\.239\\to 0\.109Semantic Unc\.1\.198→0\.4711\.198\\to 0\.4710\.239→0\.1130\.239\\to 0\.113Gemma\-4\-31B\-ITLinguistic Conf\.1\.047→1\.0771\.047\\to 1\.0770\.284→0\.1240\.284\\to 0\.1241\.047→0\.5771\.047\\to 0\.5770\.284→0\.1460\.284\\to 0\.146Token Prob\.1\.047→0\.9561\.047\\to 0\.9560\.284→0\.0930\.284\\to 0\.093Semantic Unc\.1\.047→0\.5301\.047\\to 0\.5300\.284→0\.0800\.284\\to 0\.080Llama\-3\.1\-8B\-Inst\.Linguistic Conf\.0\.925→0\.6020\.925\\to 0\.6020\.190→0\.1200\.190\\to 0\.1200\.925→0\.9610\.925\\to 0\.9610\.190→0\.1990\.190\\to 0\.199Token Prob\.0\.925→0\.6650\.925\\to 0\.6650\.190→0\.0960\.190\\to 0\.096Semantic Unc\.0\.925→0\.4610\.925\\to 0\.4610\.190→0\.0920\.190\\to 0\.092GPT\-OSS\-20BLinguistic Conf\.0\.579→1\.1180\.579\\to 1\.1180\.152→0\.1280\.152\\to 0\.1280\.579→0\.5300\.579\\to 0\.5300\.152→0\.1460\.152\\to 0\.146Token Prob\.0\.579→1\.3330\.579\\to 1\.3330\.152→0\.1380\.152\\to 0\.138Semantic Unc\.0\.579→0\.5630\.579\\to 0\.5630\.152→0\.0890\.152\\to 0\.089Qwen3\-8BLinguistic Conf\.0\.902→0\.7700\.902\\to 0\.7700\.167→0\.1480\.167\\to 0\.1480\.902→0\.8510\.902\\to 0\.8510\.167→0\.1600\.167\\to 0\.160Token Prob\.0\.902→0\.6430\.902\\to 0\.6430\.167→0\.1130\.167\\to 0\.113Semantic Unc\.0\.902→0\.5640\.902\\to 0\.5640\.167→0\.1450\.167\\to 0\.145TruthfulQAMistral\-7B\-Inst\.Linguistic Conf\.1\.782→0\.7101\.782\\to 0\.7100\.374→0\.2270\.374\\to 0\.2271\.782→1\.6541\.782\\to 1\.6540\.374→0\.3500\.374\\to 0\.350Token Prob\.1\.782→0\.7271\.782\\to 0\.7270\.374→0\.2480\.374\\to 0\.248Semantic Unc\.1\.782→0\.4771\.782\\to 0\.4770\.374→0\.2380\.374\\to 0\.238Gemma\-4\-31B\-ITLinguistic Conf\.1\.272→0\.8311\.272\\to 0\.8310\.209→0\.1180\.209\\to 0\.1181\.272→1\.2711\.272\\to 1\.2710\.209→0\.2040\.209\\to 0\.204Token Prob\.1\.272→0\.6251\.272\\to 0\.6250\.209→0\.1000\.209\\to 0\.100Semantic Unc\.1\.272→0\.4691\.272\\to 0\.4690\.209→0\.0960\.209\\to 0\.096Llama\-3\.1\-8B\-Inst\.Linguistic Conf\.1\.828→0\.7121\.828\\to 0\.7120\.427→0\.2690\.427\\to 0\.2691\.828→1\.7561\.828\\to 1\.7560\.427→0\.3940\.427\\to 0\.394Token Prob\.1\.828→0\.6361\.828\\to 0\.6360\.427→0\.2470\.427\\to 0\.247Semantic Unc\.1\.828→0\.4671\.828\\to 0\.4670\.427→0\.2540\.427\\to 0\.254GPT\-OSS\-20BLinguistic Conf\.2\.265→0\.6782\.265\\to 0\.6780\.392→0\.2300\.392\\to 0\.2302\.265→2\.1432\.265\\to 2\.1430\.392→0\.3750\.392\\to 0\.375Token Prob\.2\.265→0\.6452\.265\\to 0\.6450\.392→0\.2270\.392\\to 0\.227Semantic Unc\.2\.265→0\.4752\.265\\to 0\.4750\.392→0\.1790\.392\\to 0\.179Qwen3\-8BLinguistic Conf\.2\.054→0\.7072\.054\\to 0\.7070\.442→0\.2700\.442\\to 0\.2702\.054→1\.9992\.054\\to 1\.9990\.442→0\.4130\.442\\to 0\.413Token Prob\.2\.054→0\.6232\.054\\to 0\.6230\.442→0\.2420\.442\\to 0\.242Semantic Unc\.2\.054→0\.4642\.054\\to 0\.4640\.442→0\.2470\.442\\to 0\.247SQuAD 2\.0Mistral\-7B\-Inst\.Linguistic Conf\.1\.978→0\.7011\.978\\to 0\.7010\.350→0\.1850\.350\\to 0\.1851\.978→1\.9431\.978\\to 1\.9430\.350→0\.3430\.350\\to 0\.343Token Prob\.1\.978→0\.6821\.978\\to 0\.6820\.350→0\.1810\.350\\to 0\.181Semantic Unc\.1\.978→0\.4871\.978\\to 0\.4870\.350→0\.1160\.350\\to 0\.116Gemma\-4\-31B\-ITLinguistic Conf\.1\.288→0\.6491\.288\\to 0\.6490\.166→0\.0940\.166\\to 0\.0941\.288→1\.3001\.288\\to 1\.3000\.166→0\.1670\.166\\to 0\.167Token Prob\.1\.288→0\.5931\.288\\to 0\.5930\.166→0\.0740\.166\\to 0\.074Semantic Unc\.1\.288→0\.4801\.288\\to 0\.4800\.166→0\.1030\.166\\to 0\.103Llama\-3\.1\-8B\-Inst\.Linguistic Conf\.1\.594→0\.6431\.594\\to 0\.6430\.305→0\.1600\.305\\to 0\.1601\.594→1\.5841\.594\\to 1\.5840\.305→0\.3020\.305\\to 0\.302Token Prob\.1\.594→0\.6231\.594\\to 0\.6230\.305→0\.1450\.305\\to 0\.145Semantic Unc\.1\.594→0\.4951\.594\\to 0\.4950\.305→0\.1220\.305\\to 0\.122GPT\-OSS\-20BLinguistic Conf\.2\.265→0\.7642\.265\\to 0\.7640\.389→0\.2320\.389\\to 0\.2322\.265→2\.2142\.265\\to 2\.2140\.389→0\.3810\.389\\to 0\.381Token Prob\.2\.265→0\.6652\.265\\to 0\.6650\.389→0\.1920\.389\\to 0\.192Semantic Unc\.2\.265→0\.4742\.265\\to 0\.4740\.389→0\.1370\.389\\to 0\.137Qwen3\-8BLinguistic Conf\.2\.103→0\.7512\.103\\to 0\.7510\.370→0\.2250\.370\\to 0\.2252\.103→2\.0332\.103\\to 2\.0330\.370→0\.3600\.370\\to 0\.360Token Prob\.2\.103→0\.6702\.103\\to 0\.6700\.370→0\.1890\.370\\to 0\.189Semantic Unc\.2\.103→0\.4802\.103\\to 0\.4800\.370→0\.1760\.370\\to 0\.176

### E\.5In\-domain calibration reliability diagrams

Figures[13](https://arxiv.org/html/2605.19344#A5.F13)and[14](https://arxiv.org/html/2605.19344#A5.F14)present the in\-domain calibration reliability diagrams for MMLU and TruthfulQA, respectively, across confidence signals and models\. The left column shows the original linguistic confidence of the Direct QA responses\. The rest of the columns show the linguistic confidence of the rewritten responses through RALC guided by different confidence signals, including linguistic confidence \(LC\), token probability \(TP\), and semantic uncertainty \(SU\)\. Additionally, we report the generalised ECE\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]and Faithfulness Divergence \(FD\) along with the reliability diagrams\. The results show that RALC effectively reduces miscalibration across all confidence signals and models\.

![Refer to caption](https://arxiv.org/html/2605.19344v1/x13.png)Figure 13:In\-domain calibration reliability diagrams for MMLU across confidence signals and models\. The left column shows the original linguistic confidence of the Direct QA responses\. The rest of the columns show the linguistic confidence of the rewritten responses through RALC guided by different confidence signals, including linguistic confidence \(LC\), token probability \(TP\), and semantic uncertainty \(SU\)\. We report the generalised ECE\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]along with the reliability diagrams\. The results show that RALC effectively reduces miscalibration across all confidence signals and models\.![Refer to caption](https://arxiv.org/html/2605.19344v1/x14.png)Figure 14:In\-domain calibration reliability diagrams for TruthfulQA across confidence signals and models\. The left column shows the original linguistic confidence of the Direct QA responses\. The rest of the columns show the linguistic confidence of the rewritten responses through RALC guided by different confidence signals, including linguistic confidence \(LC\), token probability \(TP\), and semantic uncertainty \(SU\)\. We report the generalised ECE\[[35](https://arxiv.org/html/2605.19344#bib.bib1)\]along with the reliability diagrams\. The results show that RALC effectively reduces miscalibration across all confidence signals and models\.## Appendix FLLM configurations

All five evaluation targets, the LLM ensemble, and the LLM rewriter in the RALC pipeline are configured with a temperature of 1 to encourage diverse outputs\. The LLM cluster selector and grader are configured with a temperature of 0 to encourage deterministic outputs\. All models are hosted locally \(single RTX 4090 GPU\) or through cloud APIs\.

Similar Articles

Calibrating LLMs with Semantic-level Reward

arXiv cs.CL

Proposes CSR, a framework that calibrates LLMs directly in semantic space using a novel semantic calibration reward, reducing ECE by up to 40% and improving AUROC by up to 31% over verbalized-confidence baselines across multiple datasets.

Confidence Calibration in Large Language Models

arXiv cs.AI

This paper analyzes the confidence calibration of 11 popular LLMs, finding that they are generally overconfident, especially on hard tasks, and underconfident on easy tasks. It introduces LifeEval, a test for evaluating calibration across difficulty levels.

Probabilistic Calibration Is a Trainable Capability in Language Models

arXiv cs.CL

This paper investigates whether probabilistic calibration in language models can be improved through fine-tuning, comparing soft-target and hard-target methods across 12 models. The results show that calibration is a trainable capability, though gains sometimes reduce downstream arithmetic reasoning capabilities.