@dair_ai: New research from Google. LLMs hallucinate with high confidence, miss their own knowledge boundaries, and misreport unc…

X AI KOLs Timeline 07/02/26, 12:00 AM Papers

Summary

A new research paper introduces RLMF (Reinforcement Learning with Metacognitive Feedback), a two-stage approach that uses the model's own self-judgments to calibrate confidence and express uncertainty faithfully, achieving state-of-the-art calibration across diverse tasks while preserving accuracy and surpassing standard RL by up to 63%.

New research from Google. LLMs hallucinate with high confidence, miss their own knowledge boundaries, and misreport uncertainty. Most fixes bolt calibration on from the outside. RLMF turns the model own metacognition into the training signal. It refines completion rankings during preference optimization based on how good the model self-judgments of its performance are, and uses those same self-judgments to select high-value training data. The approach is two-stage. First calibrate the faithfulness of self-reported confidence, then map it to natural linguistic uncertainty through targeted output editing. RLMF reaches state-of-the-art faithful calibration across diverse tasks while preserving accuracy, and surpasses standard RL by up to 63%. Paper: https://arxiv.org/abs/2606.32032 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 07/02/26, 04:17 AM

New research from Google.

LLMs hallucinate with high confidence, miss their own knowledge boundaries, and misreport uncertainty. Most fixes bolt calibration on from the outside.

RLMF turns the model own metacognition into the training signal. It refines completion rankings during preference optimization based on how good the model self-judgments of its performance are, and uses those same self-judgments to select high-value training data.

The approach is two-stage. First calibrate the faithfulness of self-reported confidence, then map it to natural linguistic uncertainty through targeted output editing.

RLMF reaches state-of-the-art faithful calibration across diverse tasks while preserving accuracy, and surpasses standard RL by up to 63%.

Paper: https://arxiv.org/abs/2606.32032

Learn to build effective AI agents in our academy: https://academy.dair.ai

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Source: https://arxiv.org/html/2606.32032 Gabrielle Kaili-May Liu1Avi Caciularu2Gal Yona2Idan Szpektor2Arman Cohan1

Yale University2Google Research {kaili.liu, arman.cohan}@yale.edu

Abstract

Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one’s own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty—undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms:reinforcement learning with metacognitive feedback(RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model’s self-judgments of performance, andmetacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models’ self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments showRLMFachieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further,RLMFsurpasses standard RL by up to 63% while enhancing models’ ability to assess and express their own capability limits. This positionsRLMFas a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.111Our code is provided athttps://github.com/yale-nlp/RLMF.

1Introduction

Metacognition is a foundational component of intelligence that refers to the ability to monitor, assess, and regulate one’s own cognitive processes[23]. It is critical to effective learning, decision-making, and communication and has become increasingly recognized as a cornerstone of capable, transparent AI systems[88]. Despite this, LLMs continue to exhibit key metacognitive deficiencies, including failure to recognize knowledge boundaries[89], tendency toward high-confidence hallucinations[83], and systematic misrepresentation of their internal uncertainty[113,62]. This lack of robust metacognitive faculties undermines trustworthiness and reliability, particularly as models are deployed in downstream advisory roles across high-stakes settings such as scientific discovery[85,122], medical diagnosis[46,130], and legal consulting[15,55].

As the ability to monitor task performance and adapt behavior accordingly is central to metacognition, we posit that models made capable of accurately judging their own performance are better positioned to improve it, making metacognitive signals a natural source of supervision during post-training. In particular, we propose leveragingmetacognitive performanceas an additional training signal for LLMs—to encode metacognitive awareness into models while simultaneously improving task performance.

Refer to caption Figure 1:Overview ofRLMF, paired with metacognitive data selection and targeted rewriting to faithfully calibrate the numerically and linguistically expressed uncertainty of LLMs.We operationalize this idea via two novel mechanisms (Fig.1). First, we introducereinforcement learning with metacognitive feedback(RLMF), a training paradigm in which the model is rewarded not only for producing strong outputs, but also for accurately judging how well it performed.RLMFbuilds upon prior work showing that intrinsic confidence signals can serve as effective RL rewards, but operates at a higher level of abstraction by leveraging the quality of the model’s assessment of its own performance rather than simply output confidence. Concretely, we introduce a novel metacognitive advantage scaling mechanism: during RL training, we use the accuracy of the model’s self-judgments to scale each completion’s advantage, i.e., the relative learning signal that determines how strongly that completion is reinforced compared to alternative sampled completions. To complementRLMF, we additionally proposemetacognitive data selection, which uses the model’s self-assessments to choose informative training examples: candidate examples are scored by how well the model believes it performed, and examples from both the high- and low-scoring ends of this spectrum are selected for training, since each provides a complementary learning signal.

We showcase the efficacy of these innovations by applying them to the problem offaithful calibration(FC), a task that is itself metacognitive: the goal is to enable models to express uncertainty that genuinely reflects their (estimated) intrinsic confidence, a challenge even for frontier LLMs[62,24,61]. FC is distinct from the more commonly studied problem offactual calibration, which aligns confidence with empirical accuracy[103]: a model may appear factually calibrated yet remain misaligned with its internal beliefs. FC captures this critical failure mode to calibrate user reliance and improve trustworthiness. Yet, to our knowledge, FC is largely unresolved, with existing approaches limited in scope and generalizability and none addressing FC holistically across both numerical and linguistic uncertainty.

Toward end-to-end FC of LLMs, we propose a two-stage framework. First,RLMFand metacognitive data selection are applied to calibrate the faithfulness of models’ self-reported sentence-level confidence scores. Then, we convert these faithful numerical scores into natural and context-adaptable linguistic uncertainty, by mapping each score to appropriate hedge expressions and revising the response for coherence and fluency. This yields a more trustworthy[124]and human-aligned[5]medium for users to calibrate their reliance on model outputs. Moreover, the two stages are decoupled by design: Stage 1 can be run once, while Stage 2 can accommodate diverse user preferences and contexts without repeating costly RL training.

Evaluated across multiple LLMs and 10 tasks spanning6+6+content domains, our framework achieves state-of-the-art FC, generalizing robustly across tasks despite training on a single dataset, while preserving task accuracy and factual calibration unlike prior methods. Ablation studies validate the contribution of our metacognitive methods and their superiority over respective baselines. We demonstrate thatRLMFimproves models’ ability to self-assess performance on a target task. Systematic human evaluation further shows an average96%win rate over the strongest baseline in diversity, naturalness, helpfulness, and contextual suitability of linguistic uncertainty across diverse tasks and user preferences. To summarize:

1.We introducereinforcement learning with metacognitive feedback(RLMF), a novel paradigm to refine completion rankings during preference optimization based on the quality of a model’s self-judgments of performance, which strengthens post-training results by up to 63% over standard RL while conferring models with improved metacognitive awareness.
2.We establish thefirst end-to-end pipelineto faithfully calibrate numericalandlinguistic uncertainty expressions emitted by LLMs, achieving state-of-the-art results across models and tasks.
3.We show that models’ self-assessed performance can be leveraged to curate effective training data (which we termmetacognitive data selection), outperforming both naive and active-learning-style selection of examples the model handles poorly.
4.We develop a principled, human-aligned mapping approach from numerical to linguistic confidence, improving naturalness and adaptability of LLM uncertainty communication.

2Related Work

Metacognition in LLMs.

Metacognition—the ability to monitor and control one’s own cognitive processes[23]—is a central component of cognition, learning, and uncertainty communication[88]whose deficiency in LLMs has been noted across various tasks[31,125,41]and hypothesized to contribute centrally to hallucinations and other misaligned expressions. Despite its importance, metacognition in LLMs has only recently gained traction[73,78], with select works showing metacognitive methods can improve downstream performance[62,18,95,101,131]. We build upon these developments to propose prioritizing completions during RL for which a model exhibits stronger metacognitive capabilities, subject to first satisfying task-level reward signals. Our hypothesis is that if a model can learn to predict its performance on a target task, a skill reinforced via how candidate completions are ranked, then it can also acquire implicit signals on how to adjust its generations to achieve better performance. As we show, this mechanism not only strengthens post-training outcomes, but also simultaneously improves the model’s ability to recognize and express its own capability level, stepping toward better metacognitive monitoring and alignment.

Reinforcement Learning with Internal Feedback.

Since the advent of RL with human feedback[74,52], numerous methods to devise more effective, targeted reward signals for RL of LLMs have emerged. One recent class of approach is RL with internal feedback[121,123], which leverages unsupervised reward signals derived from the model itself to bypass the need for expensive external feedback. The use of internal confidence signals[54,97,116], such as self-certainty[127,57]or entropy[1], as rewards has been particularly fruitful for improvingfactualcalibration, in addition to direct adaptation of calibration metrics (e.g., Brier score) for explicit optimization[59,86,16]. Further studies have considered the value of multiplying GRPO[79]advantages by a scalar derived from confidence signals[13,63,118,104]such as semantic entropy. Inspired by such works, we propose to leveragemetacognitive performanceas an additional feedback signal to preferentially rank completions during RL training, and apply RL forfaithfulcalibration. To the best of our knowledge, this marks the first use of such metacognitive feedback during RL for LLMs. We highlightreinforcement learning with metacognitive feedback(RLMF) as a promising new method which outperforms standard RL and confers models with better metacognitive awareness.

Faithful Calibration of LLMs.

Models can appear factually calibrated yet remain misaligned with their internal beliefs[113,25,62,24,61]. This lack offaithful calibration(FC) poses risks to user reliance and safe use of AI tools[129]. Existing efforts to understand, benchmark, and improve FC have focused exclusively onlinguisticuncertainty. Yet these approaches produce only modest improvements and are limited in scope and applicability. Metacognitive prompting[62]is contingent on instruction-following and degrades task accuracy; steering[42]is limited to open-weight models and relies on predefined probes that restrict extensibility to novel contexts; and use of simplistic sentence templates for SFT[20]constrains generalization and linguistic diversity while yielding unnatural, repetitive structures. Crucially, none of these works addresses faithfulnumericaluncertainty, despite the utility of self-reported confidence scores as easily interpretable signals of output reliability. Nor do they consider the naturalness and coherence of hedges across an entire generated text, important in long-form settings. A satisfactory solution must go beyond simple per-sentence hedging to dynamically vary how uncertainty is expressed across a response, mirroring how humans adapt hedging strategies across registers. We address these shortcomings to achieveholisticFC of LLMs. To the best of our knowledge, no prior work has targeted this problem with similar scope, nor explored the value of RL for FC.

3Method

We propose to leverage metacognitive feedback to improve preference optimization and training data selection, demonstrating the value of this paradigm by applying it to achieve holistic faithful calibration (FC) of LLMs. Specifically, we use it to calibrate the faithfulness of models’ self-reported confidence scores, and pair it with a targeted rewriting stage to map the results to the linguistic setting. This decoupled approach ensures linguistic uncertainty expressions (1) can be tailored and modified to suit user preferences and other context without repeating costly RL training, and (2) are diverse, since RL is prone to mode collapse and faithful calibration metrics do not penalize hedge repetition[113].

3.1Reinforcement Learning Setup

We integrate our metacognitive methods within an RL framework that uses targeted rewards to optimize the faithfulness of models’ numerically expressed uncertainty. Compared to SFT, RL enables direct optimization of task-specific signals[76,9]and modeling of ordinality in confidence and faithful calibration (FC) scores (e.g., 0.9 is more confident than 0.7). We adopt GRPO[79]to integrate reward signals given its computational advantages[34]and since its sampling-based setup naturally extends the established methodology to assess FC, where intrinsic confidence is estimated via response consistency[113,42,62,20], enabling sampled completions to be used in a dual-purpose fashion.

Formally, our RL framework operates as follows. Given an input queryqq, a modelMMparametrized byθ\thetagenerates a group of candidate completions{r1,…,rG}\{r_{1},\ldots,r_{G}\}. Each completion is a sequence of sentences with corresponding confidence scores,222Versus claim-[21,115]or response-level scoring[40,114], sentence-level scoring[66]better balances computational efficiency, interpretability, and alignment with natural language structure[116].formatted as in Fig.1:

rg={(s1,c1),…,(sNg,cNg)}forg=1,…,Gr_{g}=\{(s_{1},c_{1}),\ldots,(s_{N_{g}},c_{N_{g}})\}\qquad\text{for }g=1,\ldots,G(1)Post-generation, eachrgr_{g}is evaluated using a composite reward function that captures the end goal of faithful alignment between expressed and intrinsic confidence, and key quality dimensions of correctness (to preserve task accuracy), factual calibration (to mitigate the factual-faithful calibration tradeoff[113,62]), and format adherence. The overall reward for eachrgr_{g}is a weighted sumρg\rho_{g}of individual reward scores.333Exact reward formulas, weighting, and implementation details can be seen in §C.1.1. Each reward’s criticality is also demonstrated via ablation study in §C.1.1.The relative quality of candidates is then captured by computing an advantageAgA_{g}for eachrgr_{g}. These advantage scores capture how good each sampled completion is relative to the others in its group, and directly guide policy updates via the GRPO objective (§C.1).

Typically, GRPO advantages are estimated asAg=ρg−𝝆¯std(𝝆)A_{g}=\frac{\rho_{g}-\overline{\boldsymbol{\rho}}}{\text{std}(\boldsymbol{\rho})}, where𝝆¯\overline{\boldsymbol{\rho}}denotes the mean ofρ1:G\rho_{1:G}. FollowingLiu et al. [64], we remove normalization, yieldingAg=ρg−𝝆¯A_{g}=\rho_{g}-\overline{\boldsymbol{\rho}}, which mitigates difficulty bias and is empirically stronger (§C.1.4). We now turn toRLMFand metacognitive data selection, illustrating their use in this RL setup and beyond.

Refer to caption Figure 2:Overview of our proposedRLMFmethod.

3.2Reinforcement Learning with Metacognitive Feedback (RLMF)

The premise ofRLMFis our proposition that teaching a model to accurately predict its own task performance in anon-policyfashion can meaningfully improve post-training results by enhancing the model’smetacognitive awareness. We operationalize this intuition by introducingreinforcement learning with metacognitive feedback(RLMF), a novel paradigm to preferentially refine completion rankings based on demonstrated metacognitive performance. Our key innovation is to refine the advantage-driven learning signal using metacognitive accuracy: among completions that already perform well on the target task, we assign greater weight to those for which the model more accurately judges its own performance.

In the context of FC, underRLMF, beyond prioritizing completions with strong alignment between predicted and gold confidence, we also identify and prioritize completions for which the model better predicts its FC level. Concretely, we directly scale the advantageAgA_{g}for each completionrg={(si,ci)}i=1Ngr_{g}=\{(s_{i},c_{i})\}_{i=1}^{N_{g}}according toMM’s self-judgment accuracy forrgr_{g}(Fig.2). This accuracy is computed by comparing the model’s predicted and gold task performance—in this case, FC level. Letg1:Ngg_{1:N_{g}}denoteMM’s intrinsic confidence in each sentence ofrgr_{g}, estimated via sampling consistency following prior work[113,62](details in §B.4). ThegoldFC level ofMMonrgr_{g}is estimated as:

Fgold(g):=∑i𝟙(|ci−gi|<τ))Ng∈[0,1],F_{\text{gold}}^{(g)}:=\frac{\sum_{i}\mathds{1}(|c_{i}-g_{i}|<\tau))}{N_{g}}\in[0,1],(2)where the numerator counts the number of sentences with faithful confidence alignment within thresholdτ\tau. ThepredictedFC level ofMMonrgr_{g}is obtained by prompting (§B.3)MMvia online inference under the policyπθ\pi_{\theta}to issue a scoreFpred(g)∈[0,1]F_{\text{pred}}^{(g)}\in[0,1], which reflectsMM’s confidence that its numerically reported confidencesc1:Ngc_{1:N_{g}}are faithful tog1:Ngg_{1:N_{g}}. The gap betweenMM’s actual and metacognitively judged task performance is then captured as:

Zg:=1−(Fpred(g)−Fgold(g))2∈[0,1],Z_{g}:=1-(F_{\text{pred}}^{(g)}-F_{\text{gold}}^{(g)})^{2}\in[0,1],(3)whereZg=1Z_{g}=1corresponds to perfect metacognitive awareness; highZgZ_{g}occurs precisely when the model more accurately estimates its performance, suggesting ability to utilize internal metacognitive information.444We use the quadratic formulation ofZgZ_{g}as it is empirically strongest; alternative formulations are possible and analyzed in §C.2.2.

To refine relative completion rankings, we rewrite eachAgA_{g}asAg=(og−𝒐¯)+(fg−𝒇¯)A_{g}=(o_{g}-\overline{\boldsymbol{o}})+(f_{g}-\overline{\boldsymbol{f}}), wherefg:=wfaith⋅rfaithf_{g}:=w_{\text{faith}}\cdot r_{\text{faith}}is the weighted faithfulness component ofρg\rho_{g}, representing the primary training objective, and

og:=wfactual_calib⋅rfactual_calib+wacc⋅racc+wstrict⋅rstrict+wsoft⋅rsofto_{g}:=w_{\text{factual\_calib}}\cdot r_{\text{factual\_calib}}+w_{\text{acc}}\cdot r_{\text{acc}}+w_{\text{strict}}\cdot r_{\text{strict}}+w_{\text{soft}}\cdot r_{\text{soft}}is the sum of the remaining weighted rewards, capturing accessory quality constraints. The metacognition-adjusted advantageAgRLMFA^{\texttt{RLMF}}_{g}is then computed as:

AgRLMF=(og−𝒐¯)+{(fg−𝒇¯)⋅(k+Zg)iffg>𝒇¯fg−𝒇¯otherwiseA^{\texttt{RLMF}}_{g}=(o_{g}-\overline{\boldsymbol{o}})+\begin{cases}(f_{g}-\overline{\boldsymbol{f}})\cdot(k+Z_{g})&\text{if }f_{g}>\overline{\boldsymbol{f}}\\ f_{g}-\overline{\boldsymbol{f}}&\text{otherwise}\end{cases}(4)where completions with above-average faithfulness (i.e., stronger primary task performance) are additionally scaled555The impact of alternatively applyingZgZ_{g}as an additional reward function is explored in §C.2.2.byZgZ_{g}, so that among these strong completions, those with better metacognitive performance are ranked higher, while remaining subject to other constraints (captured viaogo_{g}). SinceZg∈[0,1]Z_{g}\in[0,1], the additive factork=1k=1ensures that completions with above-average faithfulness (better task performance) but weak metacognition (lowZgZ_{g}) are not ranked lower than below-average faithfulness completions (worse task performance) to which noZgZ_{g}scaling is applied.666The impact ofAgRLMFA^{\texttt{RLMF}}_{g}design,kkvalue,τ\tauvalue are respectively analyzed in §C.2.2, §C.2.3, §C.2.4.

3.3Metacognitive Data Selection

BeyondRLMF, we propose to additionally leverage models’ metacognitive self-judgments of performance to identify high-utility training samples, foregoing the need for external annotations or other expensive filtering. We instantiate this strategy for use alongsideRLMFtoward FC as follows. Given a datasetD=(Dtrain,Dtest)D=(D_{\text{train}},D_{\text{test}}), we first promptMMoffline to generate responses forDtrainD_{\text{train}}and express uncertainty linguistically via hedging if uncertain. We then promptMMoffline to score on a scale of 0–100 per example how well it believes its linguistic and internal confidence align. Lastly, given a target training sizeNtrainRLMFN_{\text{train}}^{\texttt{RLMF}}, we selectNtrainRLMF2\frac{N_{\text{train}}^{\texttt{RLMF}}}{2}highest- and lowest-scoring samples for training. The impact of alternative ranking and selection strategies is discussed in §C.3, and prompts are provided in §B.3.

3.4Rewriting Protocol

To further capitalize on the merits ofRLMF, we use targeted editing of model outputs to enable faithfullinguisticuncertainty expression that is dynamically adaptable across scenarios and generalizable to long-form tasks. Existing work shows LLMs struggle to select and use hedges in a human-like fashion even with specialized prompting[62]. We therefore construct a principled mapping from confidence scores to hedge expressions and apply strategic rewriting to incorporate these into model outputs (Fig.1b). This enables flexible, well-distributed, task-appropriate, and naturalistic linguistic uncertainty that can accommodate user preferences (e.g., style, expected audience), without the need to repeat costly RL training for different contexts.

Given sentence-level faithful confidence scores, rewriting is performed by prompting a strong, cost-effective LLM with (1) the original response, (2) candidate hedges corresponding to each sentence’s confidence score, and (3) task- and/or user-specific details (e.g., style, target audience, domain-specific conventions) to guide hedges selection and text editing. As we show in §5, this process enables strong numerical–linguistic faithful calibration correspondence while improving the naturalness and contextual suitability of LLMs’ linguistic uncertainty versus prior methods as judged by humans. We provide prompts and describe the construction of the score–hedge mapping in §B.3and §C.4, and compare against a more fine-grained two-step rewriting approach that combines sentence-level and whole-response revisions in §C.4.2.

4Experimental Setup

We conduct extensive experiments to evaluate the efficacy of our metacognitive approach for improving FC of LLMs, spanning both numerical and linguistic uncertainty expression.

Datasets & Benchmarks.

We evaluate FC performance using a suite of 10 datasets spanning diverse formats, content domains, and difficulty levels. To avoid potential dataset size bias, we follow prior work[113,62]to sample 1000 examples from the test split of each dataset for evaluation. The dataset list and further details are in §B.2.

Models and Training Details.

We apply our approach to LLMs from two widely used model families: Qwen3 (1.7B, 4B, 8B)[109]and Llama3.1-Instruct (8B)[29]. These are capable open-source models representing varying architectures and parameter scales; we use the instruction-tuned variants to demonstrate the complementary value of our training pipeline. Prior to RL training, models first undergo SFT777Details are in §B.1. We assess the contribution of the pre-SFT stage via ablation study in §C.1.6.to learn our custom output format, generalize it across tasks, and adhere to task-specific output lengths. We then applyRLMFto the best checkpoint per model (procedure in §B.1), usingNtrainRLMF=2000N_{\text{train}}^{\texttt{RLMF}}=2000metacognitively selected samples from the PopQA training set. We use PopQA to ensure comparability to prior work[20]and since it is a challenging open-domain QA task. This setup enables rigorous assessment of out-of-distribution generalization, as models are trained on a single dataset and evaluated on a wide range of tasks and content domains. We additionally test the robustness of our approach to the choice of training data by alternatively training on SelfAware, UMWP, or HaluEval. Finally, we study the contribution of each metacognitive component of our approach via ablations that that remove metacognitive advantage scaling or compare our data selection method against no special selection and an active-learning-style baseline. We use Gemini-2.5-Flash-Lite[26]for rewriting, and compare against GPT-5-Mini[84]in §C.4.3. Additional implementation details are provided in §B.1.

Baselines.

As prior work has only studied linguistic FC, we report both numerical and linguistic FC results but restrict comparison to prior methods to the linguistic setting. Specifically, we compare againstMetaFaith, the metacognitive prompting approach ofLiu et al. [62], and Faithful Uncertainty Tuning (FUT), the SFT approach ofEikema et al. [20]. Training and implementation details for baseline methods are discussed in §B.1. We also compare against the FC performance of frontier models Gemini-3.1-Pro[28], Gemini-3-Flash[27], and GPT-5, applyingMetaFaithprompting[62]to these to establish an even stronger baseline.

Metrics.

FC is typically evaluated viacMFG[113,62,20], but this metric suffers from sensitivity to the distribution of models’ intrinsic confidence scores (see §B.4). We therefore propose and use thecMFG*, a refinement of thecMFGwhich addresses these limitations and is inspired by established improvements to analogous factual calibration metrics (e.g., ECE). LikecMFG, thecMFG* ranges from 0 to 1, with 1 indicating perfect FC. Further details on the motivation, computation, and implementation ofcMFG* are provided in §B.4. We additionally report accuracy, scored via LLM-as-a-Judge and averaged per dataset, and Brier Score, which quantifies factuality-based alignment between intrinsic confidence and accuracy, as reference metrics (details in §B.4).

5Results

We report main results in Table1, using the best hyperparameter setting per experiment. Our key findings are as follows:888Example generations and results for sub-8B models are in §D.4; §D.1.

Table 1:Faithful calibration (FC) results versus baselines, evaluated viacMFG*.The last three columns report dataset-level averages.Bluerows report our numerical (+RLMF) and linguistic (+RLMF+Rewr.) FC results, whileyellowrows report results without metacognitive advantage scaling (RLablation). Dataset abbreviations are provided in §B.2.1. Full results for other model sizes are in §D.1.Model / MethodPQASASQAHEMMLUSQMTUMACSGcMFG*↑\uparrowAcc↑\uparrowBS↓\downarrowLlama3.1-8B-Ins0.600.610.610.500.650.620.480.610.590.710.600.310.33+MetaFaith0.680.710.650.670.670.640.640.660.680.720.670.280.36+FUT0.690.670.680.660.630.700.630.630.680.670.660.310.29+RL0.820.780.800.790.750.730.810.800.730.720.770.400.20+RLMF0.850.810.830.820.810.840.840.830.860.860.840.410.26+RLMF+Rewr.0.810.860.800.810.800.810.820.810.870.830.820.410.26Qwen3-8B0.530.630.570.540.630.590.590.590.070.620.540.550.31+MetaFaith0.530.660.470.680.670.720.700.490.700.670.630.510.29+FUT0.570.750.480.740.720.710.660.670.710.740.670.380.41+RL0.750.660.690.200.550.540.580.440.320.380.510.590.26+RLMF0.850.820.860.820.840.820.830.820.830.840.830.570.19+RLMF+Rewr.0.820.860.800.840.800.800.870.870.820.820.830.570.19Gemini-3.1-Pro0.620.710.700.680.720.680.660.710.730.820.700.780.15Gemini-3-Flash0.590.640.550.660.670.700.650.660.770.710.660.720.16GPT-50.500.610.520.660.590.570.600.570.680.770.610.690.19RLMFrobustly and generalizably improves faithful calibration, significantly surpassing baselines across diverse tasks and model families.Compared to prior prompting and SFT-based methods, our two-stage approach achieves respective gains of29%and25%in averagecMFG* across tasks, generalizing across models and datasets to achievecMFG∗≥0.80\texttt{cMFG}^{*}\geq 0.80in each setting, representing state-of-the-art numerical and linguistic FC performance. UnlikeFUT, whose efficacy is largely confined to QA tasks similar to the training task, ourRLMFtraining performs similarly well on complex, long-form reasoning tasks (e.g., MATH) and challenging out-of-distribution settings (e.g., SimpleQA), despite training only on PopQA. Importantly, these gains are achieved while preserving task accuracy and factual calibration, unlikeMetaFaithandFUTwhich can tend to degrade such performance. Comparing numerical and linguistic results validates the efficacy of our rewriting procedure, showing that our pipeline enables strong FC regardless of mode of uncertainty expression. Moreover, we enable small models to outperform large proprietary LLMs in FC, with average gains of 37%, 17%, and 25% over GPT-5, Gemini-3.1-Pro, and Gemini-3-Flash respectively, even when these are paired with specialized prompting. As we show in §D.1, these observations extend to sub-8B models as well. Analysis of the correspondence between expressed and intrinsic confidence (Fig.3) further shows thatRLMFis similarly effective across all intrinsic confidence levels, unlikeFUTand original models which systematically struggle at low confidences.

Refer to caption Figure 3:Reliability diagramsof expressed vs. intrinsic confidence (blue) and FC (purple) per size-0.1 gold confidence bin, evaluated on PopQA (FUTand ours trained on PopQA).Table 2:Generalizability ofRLMFacross training tasks.Applying our Stage 1 across diverse training datasets yields consistently strong numerical FC (average cross-taskcMFG*), indicating robustness to the choice of training task. Table 3:Impact of data selection strategyon numerical FC results (Stage 1).

RLMFoutperforms standardRLto both improve post-training results and endow models with better metacognitive monitoring.To compareRLMFagainst standardRL, we focus on the numerical FC setting, given the strong numerical–linguistic correspondence established earlier. As shown in in Table1,RLMFis critical to achieve sufficient FC gains and enable cross-task generalization, outperforming standard RL by up to63%. This suggests that, rather than optimizing target task performance alone, augmenting the RL objective to include the goal of improving models’ metacognitive ability can yield broader benefits. Analysis in §D.3shows that models’ metacognitive performance consistently improves asRLMFtraining progresses.999Note that improvement at self assessment of performance is not equivalent to broad metacognitive awareness, as metacognition encapsulates many other types of such capabilities.Moreover,RLMFis robust to the choice of training task (Table3): even when training on math reasoning (UMWP), hallucination detection (HaluEval), or answerability (SelfAware),RLMFdelivers consistent gains across evaluation tasks, showing broad applicability without the need for task-specific adaptation. Interestingly, prior work on RL with internal feedback (RLIF)[123]finds that such approaches initially improve training outcomes but can degrade performance as training progresses, indicating diminishing returns and limited gains for instruction-tuned models. In contrast, we show that use of intrinsic feedback based onmetacognitivepredictions—which operate one level above self-signaled output confidence used in typical RLIF—avoids these limitations. This positionsRLMFas a promising direction for achieving more effective and stable post-training with internal feedback, stepping toward future work on scalable self-improvement and better model capabilities and alignment.

Metacognitive signals can drive self-selection of effective training data.Compared to random or active data selection based on ground-truth FC, our metacognitive method yields the bestcMFG* while preserving accuracy and factual calibration (Table3). While we study this self-selection ability in the context of FC, these findings suggest models could exhibit limited forms of intrinsic self-improvement, including identifying effective training data analogously to how humans choose study materials, a capacity which varies across individuals and contexts, and point to possible implications for self-directed learning.

Our framework produces faithful linguistic uncertainty expressions that are significantly more diverse, natural, helpful, context-adaptable versus prior work.We assess the practical utility of our rewriting method by collecting human annotations to evaluate the diversity, naturalness, helpfulness, and contextual suitability of resulting linguistic uncertainty expressions (§E). Compared to the strongFUTbaseline, our approach achieves strong absolute win rates of98%,98%,95%, and96%on these criteria, with high inter-annotator agreement of 0.93. Manual further inspection reveals that whileFUTsuffers from repetitive hedge phrases and sentence structures, especially in long-form settings, our approach does not.

6Conclusion

We introducedRLMF, a novel paradigm to refine completion rankings during preference optimization by leveraging a model’s own implicit judgments of performance, alongside metacognitive data selection, which uses similar self-judgments to identify more effective training data than simple active learning. We applied these contributions to build the first end-to-end framework for holistic faithful calibration (FC) of LLMs, presenting a two-stage decoupled approach to robustly align models’ numerically and linguistically expressed uncertainty with their intrinsic confidence. Comprehensive experiments showed this framework achieves strong and generalizable FC across diverse models and tasks, outperforming the prior state-of-the-art while preserving task accuracy and factual calibration. It also enables LLMs to improve at self-assessment of performance, emit highly faithful self-reported confidence scores, and modulate linguistic uncertainty in a naturalistic, context-appropriate fashion. As part of these evaluations, we introducedcMFG*, a new metric that improves upon its predecessor by removing estimation bias for models whose intrinsic confidence occupies a limited range. More broadly, our results suggestRLMFas a promising paradigm for achieving stronger and more stable post-training while encoding improved metacognitive awareness into LLMs, suggesting metacognitive performance as a particularly effective internal feedback signal for RL training that can overcome limitations of prior intrinsic feedback methods, with broader implications for alignment and self-directed learning.

Acknowledgments and Disclosure of Funding

This work was supported in part by the U.S. National Science Foundation under award No. 2541654.

Ethics Statement

This work studies the use of metacognitive signals to teach models to better monitor and learn from their own task performance, drawing on principles of human metacognition to improve post-training outcomes and the faithful, human-aligned communication of uncertainty. Faithfulness is a highly valuable yet understudied aspect of confidence calibration that is critical to improving the trustworthiness and reliability of LLMs. Improved faithful calibration and metacognitive awareness may support more reliable abstention behavior, enabling models to better recognize, predict, and signal when they are uncertain or likely to be wrong. These capabilities are especially important as LLMs are increasingly deployed in high-stakes contexts such as AI-assisted scientific planning and discovery, where faulty communication of intrinsic uncertainty could lead to significant setbacks or other negative consequences. The generalizability of our approach across models and tasks suggests potential for improving LLM reliability in diverse settings. Our decoupled strategy further facilitates adaptability of linguistic uncertainty to accommodate different cultural communication norms[53,108,70,69]. More generally, improved faithfulness and self-assessment capability are not substitutes for verification, and critical evaluation is needed to ensure the factuality of model responses. System designers should be attentive to incorporate appropriate safeguards against misuse and misinformation. We further encourage caution when considering improved metacognitive capabilities for LLMs, particularly as it relates to more autonomous or self-directed behavior, which may require increased oversight.

References

Agarwal et al. [2025]Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng.The unreasonable effectiveness of entropy minimization in LLM reasoning.InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.URLhttps://openreview.net/forum?id=UfFTBEsLgI.
Azaria and Mitchell [2023]Amos Azaria and Tom Mitchell.The internal state of an LLM knows when it’s lying.InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023.URLhttps://openreview.net/forum?id=y2V6YgLaW7.
Band et al. [2024]Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto.Linguistic calibration of long-form generations, 2024.URLhttps://arxiv.org/abs/2404.00474.
Becker and Soatto [2024]Evan Becker and Stefano Soatto.Cycles of thought: Measuring llm confidence through stable explanations, 2024.URLhttps://arxiv.org/abs/2406.03441.
Belém et al. [2024]Catarina G Belém, Markelle Kelly, Mark Steyvers, Sameer Singh, and Padhraic Smyth.Perceptions of linguistic uncertainty by language models and humans.In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8467–8502, Miami, Florida, USA, November 2024. Association for Computational Linguistics.doi:10.18653/v1/2024.emnlp-main.483.URLhttps://aclanthology.org/2024.emnlp-main.483/.
Bird and Loper [2004]Steven Bird and Edward Loper.NLTK: The natural language toolkit.InProceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain, July 2004. Association for Computational Linguistics.URLhttps://aclanthology.org/P04-3031/.
Bouktif et al. [2023]Salah Bouktif, Abderraouf Cheniki, Ali Ouni, and Hesham El-Sayed.Deep reinforcement learning for traffic signal control with consistent state and reward design approach.Know.-Based Syst., 267(C), May 2023.ISSN 0950-7051.doi:10.1016/j.knosys.2023.110440.URLhttps://doi.org/10.1016/j.knosys.2023.110440.
Burns et al. [2024]Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt.Discovering latent knowledge in language models without supervision, 2024.URLhttps://arxiv.org/abs/2212.03827.
Cao et al. [2024]Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li.Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 36(6):9737–9757, 2024.
Chaudhry et al. [2024]Arslan Chaudhry, Sridhar Thiagarajan, and Dilan Gorur.Finetuning language models to emit linguistic expressions of uncertainty, 2024.URLhttps://arxiv.org/abs/2409.12180.
Chaudhry et al. [2025]Arslan Chaudhry, Sridhar Thiagarajan, and Dilan Gorur.Finetuning language models to emit linguistic expressions of uncertainty.InICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI, 2025.URLhttps://openreview.net/forum?id=eXkLpsoy54.
Chen and Mueller [2024]Jiuhai Chen and Jonas Mueller.Quantifying uncertainty in answers from any language model and enhancing their trustworthiness.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5186–5200, Bangkok, Thailand, August 2024. Association for Computational Linguistics.doi:10.18653/v1/2024.acl-long.283.URLhttps://aclanthology.org/2024.acl-long.283/.
Chen et al. [2025]Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang.Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025.
Clark et al. [2018]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018.
Dahl et al. [2024]Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E. Ho.Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, 2024.
Damani et al. [2026]Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, and Jacob Andreas.Beyond binary rewards: Training LMs to reason about their uncertainty.InThe Fourteenth International Conference on Learning Representations, 2026.URLhttps://openreview.net/forum?id=ASQ649zdHm.
Desai and Durrett [2020]Shrey Desai and Greg Durrett.Calibration of pre-trained transformers.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online, November 2020. Association for Computational Linguistics.doi:10.18653/v1/2020.emnlp-main.21.URLhttps://aclanthology.org/2020.emnlp-main.21/.
Didolkar et al. [2024]Aniket Rajiv Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy P Lillicrap, Danilo Jimenez Rezende, Yoshua Bengio, Michael Curtis Mozer, and Sanjeev Arora.Metacognitive capabilities of LLMs: An exploration in mathematical problem solving.InAI for Math Workshop @ ICML 2024, 2024.URLhttps://openreview.net/forum?id=0MsI3bSmmD.
Duan et al. [2024]Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu.Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5050–5063, Bangkok, Thailand, August 2024. Association for Computational Linguistics.doi:10.18653/v1/2024.acl-long.276.URLhttps://aclanthology.org/2024.acl-long.276/.
Eikema et al. [2025]Bryan Eikema, Evgenia Ilia, José G. C. de Souza, Chrysoula Zerva, and Wilker Aziz.Teaching language models to faithfully express their uncertainty, 2025.URLhttps://arxiv.org/abs/2510.12587.
Fadeeva et al. [2024]Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, and Maxim Panov.Fact-checking the output of large language models via token-level uncertainty quantification.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 9367–9385, Bangkok, Thailand, August 2024. Association for Computational Linguistics.doi:10.18653/v1/2024.findings-acl.558.URLhttps://aclanthology.org/2024.findings-acl.558/.
Fagen-Ulmschneider [2023]Wade Fagen-Ulmschneider.Perception of probability words, 2023.URLhttps://waf.cs.illinois.edu/visualizations/Perception-of-Probability-Words/.
Fleming and Lau [2014]Stephen Fleming and Hakwan Lau.How to measure metacognition.Frontiers in Human Neuroscience, 8:443, 07 2014.doi:10.3389/fnhum.2014.00443.
Gani et al. [2026]Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, and Arman Cohan.Quantifying faithful confidence expression in large reasoning models.arXiv preprint arXiv:2606.03969, 2026.
Ghafouri et al. [2024]Bijean Ghafouri, Shahrad Mohammadzadeh, James Zhou, Pratheeksha Nair, Jacob-Junqi Tian, Mayank Goel, Reihaneh Rabbany, Jean-François Godbout, and Kellin Pelrine.Epistemic integrity in large language models.InNeurips Safe Generative AI Workshop 2024, 2024.URLhttps://openreview.net/forum?id=o3wQbxRaKo.
Google DeepMind [2025a]Google DeepMind.Gemini 2.5 flash-lite model card.https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Lite-Model-Card.pdf, 2025a.
Google DeepMind [2025b]Google DeepMind.Gemini 3 flash model card.https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, 2025b.
Google DeepMind [2026]Google DeepMind.Gemini 3.1 pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, 2026.
Grattafiori et al. [2024]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzmán, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vítor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma.The llama 3 herd of models, 2024.URLhttps://arxiv.org/abs/2407.21783.
Grewal et al. [2024]Yashvir S. Grewal, Edwin V. Bonilla, and Thang D. Bui.Improving uncertainty quantification in large language models via semantic embeddings, 2024.URLhttps://arxiv.org/abs/2410.22685.
Griot et al. [2025]Maxime Griot, Coralie Hemptinne, Jean Vanderdonckt, and Demet Yuksel.Large language models lack essential metacognition for reliable medical reasoning.Nature Communications, 16, 01 2025.doi:10.1038/s41467-024-55628-6.
Guo et al. [2017a]Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger.On calibration of modern neural networks.InInternational conference on machine learning, pages 1321–1330. PMLR, 2017a.
Guo et al. [2017b]Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger.On calibration of modern neural networks.In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017b.URLhttps://proceedings.mlr.press/v70/guo17a.html.
Guo et al. [2025]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al.Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.
Hendrycks et al. [2021a]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding, 2021a.URLhttps://arxiv.org/abs/2009.03300.
Hendrycks et al. [2021b]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021b.
Hou et al. [2024]Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang.Decomposing uncertainty for large language models through input clarification ensembling, 2024.URLhttps://arxiv.org/abs/2311.08718.
Huang et al. [2024a]Hsiu-Yuan Huang, Yutong Yang, Zhaoxi Zhang, Sanwoo Lee, and Yunfang Wu.A survey of uncertainty estimation in llms: Theory meets practice, 2024a.URLhttps://arxiv.org/abs/2410.15326.
Huang et al. [2025]Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma.Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, February 2025.ISSN 2326-3881.doi:10.1109/tse.2024.3519464.URLhttp://dx.doi.org/10.1109/TSE.2024.3519464.
Huang et al. [2024b]Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, and Bhuwan Dhingra.Calibrating long-form generations from large language models.In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13441–13460, Miami, Florida, USA, November 2024b. Association for Computational Linguistics.doi:10.18653/v1/2024.findings-emnlp.785.URLhttps://aclanthology.org/2024.findings-emnlp.785/.
Hwang et al. [2025]Seonjeong Hwang, Hyounghun Kim, and Gary Geunbae Lee.Can llms estimate cognitive complexity of reading comprehension items?arXiv preprint arXiv:2510.25064, 2025.
Ji et al. [2025]Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, and Nicola Cancedda.Calibrating verbal uncertainty as a linear feature to reduce hallucinations.arXiv preprint arXiv:2503.14477, 2025.
Jiang et al. [2023]Mingjian Jiang, Yangjun Ruan, Sicong Huang, Saifei Liao, Silviu Pitis, Roger Baker Grosse, and Jimmy Ba.Calibrating language models via augmented prompt ensembles.2023.URLhttps://api.semanticscholar.org/CorpusID:271797871.
Jiang et al. [2025]Zhengping Jiang, Anqi Liu, and Benjamin Van Durme.Conformal linguistic calibration: Trading-off between factuality and specificity, 2025.URLhttps://arxiv.org/abs/2502.19110.
Johannes Welbl [2017]Matt Gardner Johannes Welbl, Nelson F. Liu.Crowdsourcing multiple choice science questions.2017.
Johnson et al. [2023]Douglas B. Johnson, Rachel S Goodman, J. Randall Patrinely, Cosby A Stone, Eli Zimmerman, Rebecca Rigel Donald, Sam S Chang, Sean T Berkowitz, Avni P Finn, Eiman Jahangir, Elizabeth A Scoville, Tyler Reese, Debra E. Friedman, Julie A. Bastarache, Yuri F van der Heijden, Jordan Wright, Nicholas Carter, Matthew R Alexander, Jennifer H Choe, Cody A Chastain, John Zic, Sara N Horst, Isik Turker, Rajiv Agarwal, Evan C. Osmundson, Kamran Idrees, Colleen M. Kiernan, Chandrasekhar Padmanabhan, Christina Edwards Bailey, Cameron Schlegel, Lola B. Chambless, Mike Gibson, Travis J. Osterman, and Lee E. Wheless.Assessing the accuracy and reliability of ai-generated medical responses: An evaluation of the chat-gpt model.Research Square, 2023.URLhttps://api.semanticscholar.org/CorpusID:257437276.
Kadavath et al. [2022]Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan.Language models (mostly) know what they know, 2022.URLhttps://arxiv.org/abs/2207.05221.
Kaur et al. [2024]Ramneet Kaur, Colin Samplawski, Adam D. Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander Michael Berenbeim, John A. Pavlik, Nathaniel D. Bastian, and Susmit Jha.Addressing uncertainty in LLMs to enhance reliability in generative AI.InNeurips Safe Generative AI Workshop 2024, 2024.URLhttps://openreview.net/forum?id=Z3DS4Pcxct.
Kim et al. [2024]Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan.“i’m not sure, but…”: Examining the impact of large language models’ uncertainty expression on user reliance and trust.InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, page 822–835, New York, NY, USA, 2024. Association for Computing Machinery.ISBN 9798400704505.doi:10.1145/3630106.3658941.URLhttps://doi.org/10.1145/3630106.3658941.
Kuhn et al. [2023]Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar.Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.InThe Eleventh International Conference on Learning Representations, 2023.URLhttps://openreview.net/forum?id=VD-AYtP0dve.
Kwiatkowski et al. [2019]Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov.Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019.doi:10.1162/tacl_a_00276.URLhttps://aclanthology.org/Q19-1026/.
Lambert [2025]Nathan Lambert.Reinforcement learning from human feedback.arXiv preprint arXiv:2504.12501, 2025.
Lauwereyns [2002]Shizuka Lauwereyns.Hedges in japanese conversation: The influence of age, sex, and formality.Language Variation and Change, 14(2):239–259, 2002.doi:10.1017/S0954394502142049.
Leng et al. [2024]Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang.Taming overconfidence in llms: Reward calibration in rlhf.arXiv preprint arXiv:2410.09724, 2024.
Li et al. [2025a]Haitao Li, Junjie Chen, Jingli Yang, Qingyao Ai, Wei Jia, Youfeng Liu, Kai Lin, Yueyue Wu, Guozhi Yuan, Yiran Hu, Wuyue Wang, Yiqun Liu, and Minlie Huang.LegalAgentBench: Evaluating LLM agents in legal domain.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2322–2344, Vienna, Austria, July 2025a. Association for Computational Linguistics.ISBN 979-8-89176-251-0.doi:10.18653/v1/2025.acl-long.116.URLhttps://aclanthology.org/2025.acl-long.116/.
Li et al. [2023]Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen.Halueval: A large-scale hallucination evaluation benchmark for large language models, 2023.URLhttps://arxiv.org/abs/2305.11747.
Li et al. [2026]Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets.Confidence is all you need: Few-shot RL fine-tuning of language models, 2026.URLhttps://openreview.net/forum?id=G8xyzI2eQb.
Li et al. [2025b]Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, and Anurag Beniwal.Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs.InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025b.URLhttps://openreview.net/forum?id=4ZfkoukhQ4.
Li et al. [2025c]Yibo Li, Miao Xiong, Jiaying Wu, and Bryan Hooi.Conftuner: Training large language models to express their confidence verbally.InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025c.URLhttps://openreview.net/forum?id=VZQ04Ojhu5.
Lin et al. [2022]Stephanie Lin, Jacob Hilton, and Owain Evans.Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022.ISSN 2835-8856.URLhttps://openreview.net/forum?id=8s8K2UZGTZ.
Liu and Cohan [2026]Gabrielle Kaili-May Liu and Arman Cohan.Can llms use linguistic uncertainty markers to reliably reflect intrinsic confidence?arXiv preprint arXiv:2605.28778, 2026.
Liu et al. [2025a]Gabrielle Kaili-May Liu, Gal Yona, Avi Caciularu, Idan Szpektor, Tim G. J. Rudner, and Arman Cohan.MetaFaith: Faithful natural language uncertainty expression in LLMs.In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29612–29656, Suzhou, China, November 2025a. Association for Computational Linguistics.ISBN 979-8-89176-332-6.doi:10.18653/v1/2025.emnlp-main.1505.URLhttps://aclanthology.org/2025.emnlp-main.1505/.
Liu et al. [2025b]Haotian Liu, Shuo Wang, and Hongteng Xu.C2gspg: Confidence-calibrated group sequence policy gradient towards self-aware reasoning.arXiv preprint arXiv:2509.23129, 2025b.
Liu et al. [2025c]Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin.Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025c.
Mallen et al. [2022]Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi.When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint, 2022.
Manakul et al. [2023]Potsawee Manakul, Adian Liusie, and Mark Gales.SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models.In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Computational Linguistics.doi:10.18653/v1/2023.emnlp-main.557.URLhttps://aclanthology.org/2023.emnlp-main.557/.
Meister et al. [2022]Clara Meister, Gian Wiher, Tiago Pimentel, and Ryan Cotterell.On the probability–quality paradox in language generation.In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 36–45, Dublin, Ireland, May 2022. Association for Computational Linguistics.doi:10.18653/v1/2022.acl-short.5.URLhttps://aclanthology.org/2022.acl-short.5/.
Mielke et al. [2022]Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau.Reducing conversational agents’ overconfidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022.doi:10.1162/tacl_a_00494.URLhttps://aclanthology.org/2022.tacl-1.50/.
Mur-Dueñas [2021]Pilar Mur-Dueñas.There may be differences: Analysing the use of hedges in english and spanish research articles.Lingua, 260:103131, 2021.ISSN 0024-3841.doi:https://doi.org/10.1016/j.lingua.2021.103131.URLhttps://www.sciencedirect.com/science/article/pii/S0024384121001030.
Nguyen Thi Thuy [2018]Thu Nguyen Thi Thuy.A corpus-based study on cross-cultural divergence in the use of hedges in academic research articles written by vietnamese and native english-speaking authors.Social Sciences, 7(4), 2018.ISSN 2076-0760.doi:10.3390/socsci7040070.URLhttps://www.mdpi.com/2076-0760/7/4/70.
Nikitin et al. [2024]Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen.Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities, 2024.URLhttps://arxiv.org/abs/2405.20003.
Nixon et al. [2019]Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran.Measuring calibration in deep learning.InCVPR workshops, volume 2, 2019.
Oh [2025]Nick Oh.Before you< think>, monitor: Implementing flavell’s metacognitive framework in llms.arXiv preprint arXiv:2510.16374, 2025.
Ouyang et al. [2022]Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback, 2022.URLhttps://arxiv.org/abs/2203.02155.
Pourreza et al. [2025]Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, and Sercan O Arik.Reasoning-SQL: Reinforcement learning with SQL tailored partial rewards for reasoning-enhanced text-to-SQL.InSecond Conference on Language Modeling, 2025.URLhttps://openreview.net/forum?id=HbwkIDWQgN.
Rafailov et al. [2023]Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn.Direct preference optimization: Your language model is secretly a reward model.InThirty-seventh Conference on Neural Information Processing Systems, 2023.URLhttps://openreview.net/forum?id=HPuSIXJaa9.
Rivera et al. [2024]Mauricio Rivera, Jean-François Godbout, Reihaneh Rabbany, and Kellin Pelrine.Combining confidence elicitation and sample-based methods for uncertainty quantification in misinformation mitigation.In Raúl Vázquez, Hande Celikkanat, Dennis Ulmer, Jörg Tiedemann, Swabha Swayamdipta, Wilker Aziz, Barbara Plank, Joris Baan, and Marie-Catherine de Marneffe, editors,Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pages 114–126, St Julians, Malta, March 2024. Association for Computational Linguistics.doi:10.18653/v1/2024.uncertainlp-1.12.URLhttps://aclanthology.org/2024.uncertainlp-1.12/.
[78]Ruixin Sha, Conghui Sun, Chunliang Yang, Liang Luo, and Xiao Hu.A trade-off between reasoning ability and metacognitive sensitivity in large language models.
Shao et al. [2024]Zhihong Shao, Peiyi Wang, Runxin Xu Qihao Zhu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo.Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URLhttps://arxiv.org/abs/2402.03300.
Shen et al. [2024]Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh.Thermometer: Towards universal calibration for large language models, 2024.URLhttps://arxiv.org/abs/2403.08819.
Shrivastava et al. [2023]Vaishnavi Shrivastava, Percy Liang, and Ananya Kumar.Llamas know what gpts don’t show: Surrogate models for confidence estimation, 2023.URLhttps://arxiv.org/abs/2311.08877.
Si et al. [2023]Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Lee Boyd-Graber, and Lijuan Wang.Prompting GPT-3 to be reliable.InThe Eleventh International Conference on Learning Representations, 2023.URLhttps://openreview.net/forum?id=98p5x51L5af.
Simhi et al. [2025]Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov.Trust me, I’m wrong: LLMs hallucinate with certainty despite knowing the answer.In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 14665–14688, Suzhou, China, November 2025. Association for Computational Linguistics.ISBN 979-8-89176-335-7.doi:10.18653/v1/2025.findings-emnlp.792.URLhttps://aclanthology.org/2025.findings-emnlp.792/.
Singh et al. [2025]Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Alexey Ivanov, Alexi Christakis, Alistair Gillespie, Allison Tam, Ally Bennett, Alvin Wan, Alyssa Huang, Amy McDonald Sandjideh, Amy Yang, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrei Gheorghe, Andres Garcia Garcia, Andrew Braunstein, Andrew Liu, Andrew Schmidt, Andrey Mereskin, Andrey Mishchenko, Andy Applebaum, Andy Rogerson, Ann Rajan, Annie Wei, Anoop Kotha, Anubha Srivastava, Anushree Agrawal, Arun Vijayvergiya, Ashley Tyra, Ashvin Nair, Avi Nayak, Ben Eggers, Bessie Ji, Beth Hoover, Bill Chen, Blair Chen, Boaz Barak, Borys Minaiev, Botao Hao, Bowen Baker, Brad Lightcap, Brandon McKinzie, Brandon Wang, Brendan Quinn, Brian Fioca, Brian Hsu, Brian Yang, Brian Yu, Brian Zhang, Brittany Brenner, Callie Riggins Zetino, Cameron Raymond, Camillo Lugaresi, Carolina Paz, Cary Hudson, Cedric Whitney, Chak Li, Charles Chen, Charlotte Cole, Chelsea Voss, Chen Ding, Chen Shen, Chengdu Huang, Chris Colby, Chris Hallacy, Chris Koch, Chris Lu, Christina Kaplan, Christina Kim, CJ Minott-Henriques, Cliff Frey, Cody Yu, Coley Czarnecki, Colin Reid, Colin Wei, Cory Decareaux, Cristina Scheau, Cyril Zhang, Cyrus Forbes, Da Tang, Dakota Goldberg, Dan Roberts, Dana Palmie, Daniel Kappler, Daniel Levine, Daniel Wright, Dave Leo, David Lin, David Robinson, Declan Grabb, Derek Chen, Derek Lim, Derek Salama, Dibya Bhattacharjee, Dimitris Tsipras, Dinghua Li, Dingli Yu, DJ Strouse, Drew Williams, Dylan Hunn, Ed Bayes, Edwin Arbus, Ekin Akyurek, Elaine Ya Le, Elana Widmann, Eli Yani, Elizabeth Proehl, Enis Sert, Enoch Cheung, Eri Schwartz, Eric Han, Eric Jiang, Eric Mitchell, Eric Sigler, Eric Wallace, Erik Ritter, Erin Kavanaugh, Evan Mays, Evgenii Nikishin, Fangyuan Li, Felipe Petroski Such, Filipe de Avila Belbute Peres, Filippo Raso, Florent Bekerman, Foivos Tsimpourlas, Fotis Chantzis, Francis Song, Francis Zhang, Gaby Raila, Garrett McGrath, Gary Briggs, Gary Yang, Giambattista Parascandolo, Gildas Chabot, Grace Kim, Grace Zhao, Gregory Valiant, Guillaume Leclerc, Hadi Salman, Hanson Wang, Hao Sheng, Haoming Jiang, Haoyu Wang, Haozhun Jin, Harshit Sikchi, Heather Schmidt, Henry Aspegren, Honglin Chen, Huida Qiu, Hunter Lightman, Ian Covert, Ian Kivlichan, Ian Silber, Ian Sohl, Ibrahim Hammoud, Ignasi Clavera, Ikai Lan, Ilge Akkaya, Ilya Kostrikov, Irina Kofman, Isak Etinger, Ishaan Singal, Jackie Hehir, Jacob Huh, Jacqueline Pan, Jake Wilczynski, Jakub Pachocki, James Lee, James Quinn, Jamie Kiros, Janvi Kalra, Jasmyn Samaroo, Jason Wang, Jason Wolfe, Jay Chen, Jay Wang, Jean Harb, Jeffrey Han, Jeffrey Wang, Jennifer Zhao, Jeremy Chen, Jerene Yang, Jerry Tworek, Jesse Chand, Jessica Landon, Jessica Liang, Ji Lin, Jiancheng Liu, Jianfeng Wang, Jie Tang, Jihan Yin, Joanne Jang, Joel Morris, Joey Flynn, Johannes Ferstad, Johannes Heidecke, John Fishbein, John Hallman, Jonah Grant, Jonathan Chien, Jonathan Gordon, Jongsoo Park, Jordan Liss, Jos Kraaijeveld, Joseph Guay, Joseph Mo, Josh Lawson, Josh McGrath, Joshua Vendrow, Joy Jiao, Julian Lee, Julie Steele, Julie Wang, Junhua Mao, Kai Chen, Kai Hayashi, Kai Xiao, Kamyar Salahi, Kan Wu, Karan Sekhri, Karan Sharma, Karan Singhal, Karen Li, Kenny Nguyen, Keren Gu-Lemberg, Kevin King, Kevin Liu, Kevin Stone, Kevin Yu, Kristen Ying, Kristian Georgiev, Kristie Lim, Kushal Tirumala, Kyle Miller, Lama Ahmad, Larry Lv, Laura Clare, Laurance Fauconnet, Lauren Itow, Lauren Yang, Laurentia Romaniuk, Leah Anise, Lee Byron, Leher Pathak, Leon Maksin, Leyan Lo, Leyton Ho, Li Jing, Liang Wu, Liang Xiong, Lien Mamitsuka, Lin Yang, Lindsay McCallum, Lindsey Held, Liz Bourgeois, Logan Engstrom, Lorenz Kuhn, Louis Feuvrier, Lu Zhang, Lucas Switzer, Lukas Kondraciuk, Lukasz Kaiser, Manas Joglekar, Mandeep Singh, Mandip Shah, Manuka Stratta, Marcus Williams, Mark Chen, Mark Sun, Marselus Cayton, Martin Li, Marvin Zhang, Marwan Aljubeh, Matt Nichols, Matthew Haines, Max Schwarzer, Mayank Gupta, Meghan Shah, Melody Huang, Meng Dong, Mengqing Wang, Mia Glaese, Micah Carroll, Michael Lampe, Michael Malek, Michael Sharman, Michael Zhang, Michele Wang, Michelle Pokrass, Mihai Florian, Mikhail Pavlov, Miles Wang, Ming Chen, Mingxuan Wang, Minnia Feng, Mo Bavarian, Molly Lin, Moose Abdool, Mostafa Rohaninejad, Nacho Soto, Natalie Staudacher, Natan LaFontaine, Nathan Marwell, Nelson Liu, Nick Preston, Nick Turley, Nicklas Ansman, Nicole Blades, Nikil Pancha, Nikita Mikhaylin, Niko Felix, Nikunj Handa, Nishant Rai, Nitish Keskar, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Oona Gleeson, Pamela Mishkin, Patryk Lesiewicz, Paul Baltescu, Pavel Belov, Peter Zhokhov, Philip Pronin, Phillip Guo, Phoebe Thacker, Qi Liu, Qiming Yuan, Qinghua Liu, Rachel Dias, Rachel Puckett, Rahul Arora, Ravi Teja Mullapudi, Raz Gaon, Reah Miyara, Rennie Song, Rishabh Aggarwal, RJ Marsan, Robel Yemiru, Robert Xiong, Rohan Kshirsagar, Rohan Nuttall, Roman Tsiupa, Ronen Eldan, Rose Wang, Roshan James, Roy Ziv, Rui Shu, Ruslan Nigmatullin, Saachi Jain, Saam Talaie, Sam Altman, Sam Arnesen, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Sarah Yoo, Savannah Heon, Scott Ethersmith, Sean Grove, Sean Taylor, Sebastien Bubeck, Sever Banesiu, Shaokyi Amdo, Shengjia Zhao, Sherwin Wu, Shibani Santurkar, Shiyu Zhao, Shraman Ray Chaudhuri, Shreyas Krishnaswamy, Shuaiqi, Xia, Shuyang Cheng, Shyamal Anadkat, Simón Posada Fishman, Simon Tobin, Siyuan Fu, Somay Jain, Song Mei, Sonya Egoian, Spencer Kim, Spug Golden, SQ Mah, Steph Lin, Stephen Imm, Steve Sharpe, Steve Yadlowsky, Sulman Choudhry, Sungwon Eum, Suvansh Sanjeev, Tabarak Khan, Tal Stramer, Tao Wang, Tao Xin, Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas Degry, Thomas Shadwell, Tianfu Fu, Tianshi Gao, Timur Garipov, Tina Sriskandarajah, Toki Sherbakov, Tomer Kaftan, Tomo Hiratsuka, Tongzhou Wang, Tony Song, Tony Zhao, Troy Peterson, Val Kharitonov, Victoria Chernova, Vineet Kosaraju, Vishal Kuo, Vitchyr Pong, Vivek Verma, Vlad Petrov, Wanning Jiang, Weixing Zhang, Wenda Zhou, Wenlei Xie, Wenting Zhan, Wes McCabe, Will DePue, Will Ellsworth, Wulfie Bain, Wyatt Thompson, Xiangning Chen, Xiangyu Qi, Xin Xiang, Xinwei Shi, Yann Dubois, Yaodong Yu, Yara Khakbaz, Yifan Wu, Yilei Qian, Yin Tat Lee, Yinbo Chen, Yizhen Zhang, Yizhong Xiong, Yonglong Tian, Young Cha, Yu Bai, Yu Yang, Yuan Yuan, Yuanzhi Li, Yufeng Zhang, Yuguang Yang, Yujia Jin, Yun Jiang, Yunyun Wang, Yushi Wang, Yutian Liu, Zach Stubenvoll, Zehao Dou, Zheng Wu, and Zhigang Wang.Openai gpt-5 system card, 2025.URLhttps://arxiv.org/abs/2601.03267.
Song et al. [2025]Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, Yinkai Wang, Haorui Wang, Jeff Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, Shao-Xiong Lennon Luo, Yuxuan Zhang, Xiang Zou, Wanru Zhao, Yifan F. Zhang, Wucheng Zhang, Shunan Zheng, Saiyang Zhang, Sartaaj Takrim Khan, Mahyar Rajabi-Kochi, Samantha Paradi-Maropakis, Tony Baltoiu, Fengyu Xie, Tianyang Chen, Kexin Huang, Weiliang Luo, Meijing Fang, Xin Yang, Lixue Cheng, Jiajun He, Soha Hassoun, Xiangliang Zhang, Wei Wang, Chandan K. Reddy, Chao Zhang, Zhiling Zheng, Mengdi Wang, Le Cong, Carla P. Gomes, Chang-Yu Hsieh, Aditya Nandy, Philippe Schwaller, Heather J. Kulik, Haojun Jia, Huan Sun, Seyed Mohamad Moosavi, and Chenru Duan.Evaluating large language models in scientific discovery, 2025.URLhttps://arxiv.org/abs/2512.15567.
Stangel et al. [2025]Paul Stangel, David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Kamilia Zaripova, Matthias Keicher, and Nassir Navab.Rewarding doubt: A reinforcement learning approach to confidence calibration of large language models.CoRR, abs/2503.02623, March 2025.URLhttps://doi.org/10.48550/arXiv.2503.02623.
Stengel-Eskin et al. [2024]Elias Stengel-Eskin, Peter Hase, and Mohit Bansal.LACIE: Listener-aware finetuning for calibration in large language models.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.URLhttps://openreview.net/forum?id=RnvgYd9RAh.
Steyvers and Peters [2025]Mark Steyvers and Megan AK Peters.Metacognition and uncertainty communication in humans and large language models.Current Directions in Psychological Science, page 09637214251391158, 2025.
Steyvers et al. [2025]Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W. Mayer, and Padhraic Smyth.What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, January 2025.ISSN 2522-5839.doi:10.1038/s42256-024-00976-7.URLhttp://dx.doi.org/10.1038/s42256-024-00976-7.
Sun et al. [2024]Yuhong Sun, Zhangyue Yin, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Hui Zhao.Benchmarking hallucination in large language models based on unanswerable math word problem, 2024.URLhttps://arxiv.org/abs/2403.03558.
Tang et al. [2024]Zhisheng Tang, Ke Shen, and Mayank Kejriwal.An evaluation of estimative uncertainty in large language models, 2024.URLhttps://arxiv.org/abs/2405.15185.
Tao et al. [2025]Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A. Lamb, Jialin Yu, Philip H. S. Torr, and Chang Xu.Can large language models express uncertainty like human?, 2025.URLhttps://arxiv.org/abs/2509.24202.
Tian et al. [2023]Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning.Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, Singapore, December 2023. Association for Computational Linguistics.doi:10.18653/v1/2023.emnlp-main.330.URLhttps://aclanthology.org/2023.emnlp-main.330/.
Tian et al. [2024]Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn.Fine-tuning language models for factuality.InThe Twelfth International Conference on Learning Representations, 2024.URLhttps://openreview.net/forum?id=WPZ2yPag4K.
Toy et al. [2024]Jason Toy, Josh MacAdam, and Phil Tabor.Metacognition is all you need? using introspection in generative agents to improve goal-directed behavior, 2024.URLhttps://arxiv.org/abs/2401.10910.
Trella et al. [2023]Anna L Trella, Kelly W Zhang, Inbal Nahum-Shani, Vivek Shetty, Finale Doshi-Velez, and Susan A Murphy.Reward design for an online reinforcement learning algorithm supporting oral self-care.InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15724–15730, 2023.
van Niekerk et al. [2025]Carel van Niekerk, Renato Vukovic, Benjamin Matthias Ruppik, Hsien-chin Lin, and Milica Gašić.Post-training large language models via reinforcement learning from self-feedback.arXiv preprint arXiv:2507.21931, 2025.
Wang et al. [2018]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman.GLUE: A multi-task benchmark and analysis platform for natural language understanding.In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics.doi:10.18653/v1/W18-5446.URLhttps://aclanthology.org/W18-5446/.
Wang et al. [2019]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.SuperGLUE: A stickier benchmark for general-purpose language understanding systems.arXiv preprint 1905.00537, 2019.
Wang et al. [2025]Peiqi Wang, Barbara D. Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda, William M Wells, Tina Kapur, and Polina Golland.Calibrating expressions of certainty.InThe Thirteenth International Conference on Learning Representations, 2025.URLhttps://openreview.net/forum?id=dNunnVB4W6.
Wang and Zhao [2024]Yuqing Wang and Yun Zhao.Metacognitive prompting improves understanding in large language models.In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1914–1926, Mexico City, Mexico, June 2024. Association for Computational Linguistics.doi:10.18653/v1/2024.naacl-long.106.URLhttps://aclanthology.org/2024.naacl-long.106/.
Wei et al. [2024]Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus.Measuring short-form factuality in large language models, 2024.URLhttps://arxiv.org/abs/2411.04368.
Xia et al. [2025]Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, and Hang Liu.A survey of uncertainty estimation methods on large language models, 2025.URLhttps://arxiv.org/abs/2503.00172.
Xie et al. [2026]Can Xie, Ruotong Pan, Xiangyu Wu, Zhang Yunfei, Jiayi Fu, Tingting Gao, and Guorui Zhou.Unlocking exploration in RLVR: Uncertainty-aware advantage shaping for deeper reasoning.In Maria Liakata, Viviane P. Moreira, Jiajun Zhang, and David Jurgens, editors,Findings of the Association for Computational Linguistics: ACL 2026, pages 19057–19076, San Diego, California, United States, July 2026. Association for Computational Linguistics.ISBN 979-8-89176-395-1.URLhttps://aclanthology.org/2026.findings-acl.951/.
Xiong et al. [2024]Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi.Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs.InThe Twelfth International Conference on Learning Representations, 2024.URLhttps://openreview.net/forum?id=gjeQKFxFpZ.
Xu et al. [2024]Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, and Jing Gao.SaySelf: Teaching LLMs to express confidence with self-reflective rationales.In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5985–5998, Miami, Florida, USA, November 2024. Association for Computational Linguistics.doi:10.18653/v1/2024.emnlp-main.343.URLhttps://aclanthology.org/2024.emnlp-main.343/.
Yadkori et al. [2024]Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári.To believe or not to believe your llm, 2024.URLhttps://arxiv.org/abs/2406.02543.
Yagız and Demir [2014]Oktay Yagız and Cuneyt Demir.Hedging strategies in academic discourse: A comparative analysis of turkish writers and native writers of english.Procedia - Social and Behavioral Sciences, 158:260–268, 2014.ISSN 1877-0428.doi:https://doi.org/10.1016/j.sbspro.2014.12.085.URLhttps://www.sciencedirect.com/science/article/pii/S1877042814061783.14th Language, Literature and Stylistics Symposium.
Yang et al. [2025]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu.Qwen3 technical report, 2025.URLhttps://arxiv.org/abs/2505.09388.
Yang et al. [2024a]Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Yamada.On verbalized confidence scores for llms, 2024a.URLhttps://arxiv.org/abs/2412.14737.
Yang et al. [2024b]Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu.Alignment for honesty, 2024b.URLhttps://arxiv.org/abs/2312.07000.
Yin et al. [2023]Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang.Do large language models know what they don’t know?In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, July 2023. Association for Computational Linguistics.doi:10.18653/v1/2023.findings-acl.551.URLhttps://aclanthology.org/2023.findings-acl.551.
Yona et al. [2024]Gal Yona, Roee Aharoni, and Mor Geva.Can large language models faithfully express their intrinsic uncertainty in words?In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7752–7764, Miami, Florida, USA, November 2024. Association for Computational Linguistics.doi:10.18653/v1/2024.emnlp-main.443.URLhttps://aclanthology.org/2024.emnlp-main.443/.
Zhang et al. [2024a]Caiqi Zhang, Fangyu Liu, Marco Basaldella, and Nigel Collier.LUQ: Long-text uncertainty quantification for LLMs.In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5244–5262, Miami, Florida, USA, November 2024a. Association for Computational Linguistics.doi:10.18653/v1/2024.emnlp-main.299.URLhttps://aclanthology.org/2024.emnlp-main.299/.
Zhang et al. [2025a]Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, and Nigel Collier.Atomic calibration of LLMs in long-form generations.In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 148–169, Mumbai, India, December 2025a. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.ISBN 979-8-89176-303-6.doi:10.18653/v1/2025.findings-ijcnlp.9.URLhttps://aclanthology.org/2025.findings-ijcnlp.9/.
Zhang et al. [2025b]Caiqi Zhang, Xiaochen Zhu, Chengzu Li, Nigel Collier, and Andreas Vlachos.Reinforcement learning for better verbalized confidence in long-form generation.arXiv preprint arXiv:2505.23912, 2025b.
Zhang et al. [2024b]Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang.R-tuning: Instructing large language models to say ‘i don’t know’, 2024b.URLhttps://arxiv.org/abs/2311.09677.
Zhang and Zuo [2025]Jixiao Zhang and Chunsheng Zuo.GRPO-LEAD: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models.In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5642–5654, Suzhou, China, November 2025. Association for Computational Linguistics.ISBN 979-8-89176-332-6.doi:10.18653/v1/2025.emnlp-main.287.URLhttps://aclanthology.org/2025.emnlp-main.287/.
Zhang et al. [2026]Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, and Dacheng Tao.A simple ”motivation” can enhance reinforcement finetuning of large reasoning models.InThe Fourteenth International Conference on Learning Representations, 2026.URLhttps://openreview.net/forum?id=3owSlsYDQf.
Zhang et al. [2024c]Min Zhang, Jianfeng He, Taoran Ji, and Chang-Tien Lu.Don’t go to extremes: Revealing the excessive sensitivity and calibration limitations of LLMs in implicit hate speech detection.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12073–12086, Bangkok, Thailand, August 2024c. Association for Computational Linguistics.doi:10.18653/v1/2024.acl-long.652.URLhttps://aclanthology.org/2024.acl-long.652/.
Zhang et al. [2025c]Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian.Right question is already half the answer: Fully unsupervised LLM reasoning incentivization.InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025c.URLhttps://openreview.net/forum?id=k8Mim6RI5O.
Zhang et al. [2025d]Yanbo Zhang, Sumeer A. Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, Saso Dzeroski, Jesper Tegner, and Hector Zenil.Advancing the scientific method with large language models: From hypothesis to discovery, 2025d.URLhttps://arxiv.org/abs/2505.16477.
Zhang et al. [2025e]Yanzhi Zhang, Zhaoxi Zhang, Haoxiang Guan, Yilin Cheng, Yitong Duan, Chen Wang, Yue Wang, Shuxin Zheng, and Jiyan He.No free lunch: Rethinking internal feedback for llm reasoning.CoRR, abs/2506.17219, June 2025e.URLhttps://doi.org/10.48550/arXiv.2506.17219.
Zhang et al. [2020]Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy.Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making.InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 295–305. ACM, January 2020.doi:10.1145/3351095.3372852.URLhttp://dx.doi.org/10.1145/3351095.3372852.
Zhao et al. [2026a]Muyang Zhao, Qi Qi, and Hao Sun.Roi-reasoning: Rational optimization for inference via pre-computation meta-cognition.arXiv preprint arXiv:2601.03822, 2026a.
Zhao et al. [2024]Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin Yao, Dong Yu, Tongshuang Wu, and Jianshu Chen.Fact-and-reflection (FaR) improves confidence calibration of large language models.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 8702–8718, Bangkok, Thailand, August 2024.doi:10.18653/v1/2024.findings-acl.515.
Zhao et al. [2026b]Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song.Learning to reason without external rewards.InThe Fourteenth International Conference on Learning Representations, 2026b.URLhttps://openreview.net/forum?id=OU9nFEYR2M.
Zhou et al. [2024a]Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, and Maarten Sap.Relying on the unreliable: The impact of language models’ reluctance to express uncertainty.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3623–3643, Bangkok, Thailand, August 2024a. Association for Computational Linguistics.doi:10.18653/v1/2024.acl-long.198.URLhttps://aclanthology.org/2024.acl-long.198/.
Zhou et al. [2025a]Kaitlyn Zhou, Jena D Hwang, Xiang Ren, Nouha Dziri, Dan Jurafsky, and Maarten Sap.Rel-ai: An interaction-centered approach to measuring human-lm reliance.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11148–11167, 2025a.
Zhou et al. [2025b]Shuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Yi Fang, Sirui Ding, Jiashuo Wang, Kaishuai Xu, et al.Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence, 1(1):9, 2025b.
Zhou et al. [2024b]Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou.Metacognitive retrieval-augmented large language models, 2024b.URLhttps://arxiv.org/abs/2402.11626.

Appendix AAdditional Related Work

Confidence Calibration of LLMs.

Confidence calibration[33]is a critical aspect of building trustworthy and reliable AI systems[17,82]. Existing methods primarily consider calibration from a factual perspective, aiming to align confidence estimates with externally-judged accuracy[38,103]. In contrast, our work targets thefaithfulexpression of uncertainty in model outputs through both numerical and linguistic means, with the aim of enabling richer and more complex patterns of uncertainty expression.

Numerical Confidence for LLMs.

A plethora of approaches exist to numerically assess the confidence of LLMs[38,103]. Such methods are broadly classified as either white-box or black-box, depending on whether access to model weights is required, and they aim to calibrate confidence estimates in a factuality-aligned way to reflect expected accuracy. White-box methods measure confidence by directly leveraging internal states of LLMs, including use of token probabilities[47,19,39], probes over internal representations[2,8,42], or training over uncertainty-augmented data[60,10,87,117], among other strategies. In contrast, black-box methods require access only to model outputs and assess confidence via the consistency of sampled responses[66,4,12,48,105], semantic variability[67,50,30,71,58], direct prompting of LLMs to verbalize confidence scores[43,93,37,107,110,126], or use of auxiliary predictive models[81,80]. While such methods are highly effective, they do not address the incorporation of linguistic uncertainty into model outputs, which is a core part of human uncertainty communication that requires significantly more expressivity and is highly variable across contexts and cultures[49,129].

Linguistic Confidence for LLMs.

Since natural language is the primary interface for human-LLM interactions, a growing body of work has explored direct integration of linguistic uncertainty markers (e.g., “I am fairly certain that…“) into LLM outputs[3,91,105,111,44,92,100]. Common approaches include prompting or training[60,68,3,106]models to self-verbalize their confidence level in words, mapping numerical confidence scores to corresponding uncertainty phrases (e.g., “high confidence”), or adopting a combination of these two strategies[11]. As with numerical confidence methods, such methods overlook the alignment between verbalized and intrinsic confidence. Moreover, they face considerable practical limitations including oversimplification. For example,Mielke et al. [68]utilize a limited scoring scale to measure confidence and assertiveness, whileLin et al. [60]depend on computationally expensive, domain-specific training and does not enable zero-shot confidence verbalization.Zhou et al. [128]similarly simplify the space of linguistic markers used, failing to account for the plurality of human linguistic uncertainty expressions.Zhang et al. [120]further reveal that self-verbalized linguistic confidence tends to concentrate in restricted ranges, leading to constrained efficacy.

Appendix BExperimental Details

B.1Model & Training Details

In our experiments, we used the following models, varying in size, family, and post-training: Qwen3 (1.7B, 4B, 8B)[109], Llama3.1-Instruct (8B)[29]. We applied to these models our proposed framework, as well as baseline faithful calibration methodsMetaFaith[62]andFUT[20]. To provide additional competitive baselines, we evaluated the faithful calibration of leading proprietary LLMs Gemini-3.1-Pro[28], Gemini-3-Flash[27], and GPT-5[84]. We increased the strength of these baselines by specifically evaluating each model using a metacognitive system prompt[62], shown to generalizably improve faithful calibration across LLMs (see §B.3for the exact prompt). Gemini and GPT models were accessed via the Gemini Developer API and OpenAI API, respectively, while open-source model experiments were run on a local server using a combination of A6000 48GB, A100 80GB, and H100 80GB GPUs.

Inference Setup.

For inference-time evaluations, we followed prior work[62]to set the maximum output length to 256 tokens for all models to balance answer completeness and succinctness, and used a temperature of 1.0 unless otherwise specified. We did not use thinking mode for any model as our focus is on faithfulness of models’ expressed uncertainty when conveying their answer to a query, and not on task or reasoning performance. Where reasoning could not be disabled (e.g., Gemini-3.1-Pro), we used the minimum admissible thinking level and set the maximum output length to up to 512 tokens to ensure responses were not prematurely truncated. For Qwen3 models, we used the developer-recommended inference hyperparameters. For Qwen3-4B, we specifically used theQwen/Qwen3-4B-Instruct-2507version, as it is designated as an updated, stronger edition of the model. During the rewriting step of our pipeline (§3.4), we used a temperature of 0.5 for Gemini-2.5-Flash-Lite to balance generation quality with relevance, along with the stop sequence “####” and at most 2000 output tokens to ensure no premature truncation. For comparison to GPT-5-Mini as the rewriting model (§C.4.3), we used the same token limit and temperature with minimal reasoning, but no stop sequence as this functionality is discontinued for the model. Inference prompts are discussed in §B.3.

GRPO Training Setup.

For GRPO, we employed thetrllibrary101010https://github.com/huggingface/trlwith HuggingFace. We used theGRPOTrainer, modified to implementRLMFvia direct modification to the advantage computation.111111RLMFimplementation details are available in our code athttps://anonymous.4open.science/r/RLMF_anon.All models were fine-tuned using LoRA instead of full fine-tuning, as it is more computationally efficient and our goal is to simply alter the way models format their outputs and express uncertainty, rather than to fundamentally change models’ downstream task capabilities. Adapters were applied to every linear layer across all query-, key-, value-, output-, gate-, up-, and down-projection matrices with rank 64, alpha 64, and dropout 0.05.

Each model was trained on 4 GPUs, with 2 additional GPUs serving (a) the same model for completion sampling during GRPO and (b) the judge model for sampling-based intrinsic confidence estimation (details in §B.4), yielding 6 GPUs total perRLMFexperiment. GRPO completions were sampled usingvllm_mode=server,use_vllm=True,vllm_gpu_memory_utilization=0.5, andvllm_tensor_parallel_size=1. We used the developer-recommended inference hyperparameters when sampling completions for Qwen3 models (in addition to settingenable_thinkingto False), andmin_pof 0.1 for Llama3.1. The random seed was set to 42,max_prompt_lengthwas set to one plus the maximum tokenized prompt length in the training dataset, andmax_completion_lengthwas set to up to 512, observed in prior work[62]to be sufficient for all datasets used in our study.

ForRLMF, we setτ=0.10\tau=0.10after preliminary comparisons againstτ=0.05\tau=0.05(details in §C.2.4); use of higherτ\tauwas ruled out as it is inconsistent with our goal of improving model performance and metacognitive awareness by prioritizing minimized differences between models’ actual and self-predicted FC performance. When obtaining models’ self-predictions of FC performance per completion duringRLMF, we set the maximum number of output tokens to 3, sufficient to accommodate the output which is a single number.

All training experiments used BF16 precision, weight decay 0.0, warmup ratio 0.1, maximum gradient norm 0.1, a cosine121212Early experiments showed a cosine schedule was superior to linear or constant schedules.learning rate schedule, and AdamW optimization with default optimizer hyperparameters. The KL divergence coefficientβ\betawas set to 0.1, as preliminary experiments found values of 0.2, 0.3, and 0.6 yielded similar or worse performance. We usedG=32G=32sampled completions per prompt during GRPO since early experiments withG=16G=16yielded worse downstream performance and less stable estimation of intrinsic confidence.131313Recall from §3.1that we use the completions sampled during GRPO in a dual-purpose fashion, both to provide a signal for preference optimization and since this naturally extends the sampling-based paradigm for intrinsic confidence estimation used in prior work on faithful calibration.Per-device training batch size was set to 2, with gradient accumulation of 8 across 4 GPUs, yielding a total effective batch size of 64. We found this to work well across all models after performing hyperparameter search over{16,32,64}\{16,32,64\}on Llama3.1-8B-Instruct initially. We did not use gradient checkpointing. Learning rates were searched over 5e-6, 1e-5, 2e-5 per model, and we report results for the best configuration per experimental setting. Any parameters not specified here were set to their default values as specified by thetrllibrary.

Unless otherwise specified, training occurred for 1500 steps for all models, usingNtrainRLMF=2000N_{\text{train}}^{\texttt{RLMF}}=2000training samples disjoint from the pre-SFT data. We also investigated use of 1000 or 4000 training samples (§C.1.5), using the same number of epochs as the 2000-sample setting. In all settings, we determined the best checkpoint per model based on faithful calibration performance on the in-domain test set (limited to at most 1000 samples as discussed in §4); checkpoints were evaluated every 100 training steps (or every 75 or 200 steps for the 1000- and 4000-sample settings, respectively).

Pre-SFT Setup.

As mentioned in §4, prior toRLMF, models were first pre-fine-tuned via supervised fine-tuning (SFT) to learn our custom output format, generalize it across evaluation tasks, and adhere to task-specific output length constraints. We obtained training data for pre-SFT as follows. For a given modelMM, we first sampled 21 responses per input for 200 samples per evaluation task (see task details in §B.2). This used the same inference hyperparameters discussed in §B.1; the prompt is discussed in §B.3. For each sample, the first response served as the official output (yielding200×10=2000200\times 10=2000samples), and the remaining 20 were used to estimate intrinsic confidence per sentence in the first response, following the procedure described in §B.4. Official outputs were then transformed into our custom output format by usingnltk[6]141414We usenltkas it offers a widely-used and reliable sentence tokenizer.to parse outputs into sentences, enclosing each in<sentence> </sentence>tags, and appending the gold confidence as a two-digit decimal within<confidence> </confidence>tags per sentence. We randomly held out 10% of the 2000 samples as a validation set, leaving 1800 for training.

During training, we used the same system prompt as that used during GRPO (see §B.3for details). User prompts were formatted as described in §B.3, aside from the addition of a single sentence describing the target output length per sample as a number of sentences. We varied the wording of this length direction by randomly sampling from one of the templates in Fig.7. For templates with either a minimum or maximum sentence count specified, we used the exact observed number of sentences in the output as this number. Otherwise, the associated minimum target sentence count was set to 0 and the maximum to either the observed number of sentences or that number plus one.

For SFT, we usedtrl’sSFTTrainerin combination with the widely-used Unsloth package. Models were fine-tuned using Low-Rank Adaptation (LoRA) since it is computationally efficient and we simply aim to alter each model’s output format. Adapters were used for every linear layer (all query-, key-, value-, output-, gate-, up-, and down-projection matrices) with rank 32, alpha 64, and dropout 0.05. Each model was trained on a single GPU. All runs used gradient checkpointing, global batch size 8, weight decay 0.01, cosine learning rate schedule, and AdamW optimization with default hyperparameters. Pre-SFT was run for 5 epochs, with the best checkpoint selected based on validation loss over 200 held-out examples. We used a learning rate of 3e-5 for all models.

Implementation Details for Baselines.

We implementedFUT[20]using the paper’s official codebase, replicating the main experiments to verify reliability and applying the exact hyperparameter search recommended by the paper per model to optimize baseline results. We implementedMetaFaith[62]by using the prompt shown in Fig.6(§B.3), adapted from those provided in the paper’s official codebase.

B.2Datasets

We provide details on the datasets used to benchmark faithful calibration of LLMs. These datasets are identical to those used byLiu et al. [62]and represent a range of content domains and task types. Although a wide range of difficulty levels are represented, since faithful uncertainty expression is precisely important in difficult task settings[49], our focus leans toward more challenging datasets for which responses are expected to require expressing uncertainty. All benchmarks are in English.

•PopQA[65]is a knowledge-intensive QA dataset featuring 14,000 entity-centric QA pairs. It is likely to require LLMs to express uncertainty as it includes many tail entities which are difficult for models to capture. FollowingYona et al. [113], Liu et al. [62], we preprocess the data to keep only the ‘director’, ‘screenwriter’, ‘producer’, ‘author’, ‘place of birth’, and ‘occupation’ relations and remove entities less than two characters in length.
•SelfAware[112]is a knowledge-intensive QA task consisting of 2337 answerable and 1032 unanswerable questions posed by human users.
•SimpleQA[102]is a factuality benchmark curated adversarially against GPT-4 responses to ensure a high level of difficulty. It aims to measure LLMs’ ability to answer short, challenging questions.
•HaluEval[56]is a hallucination evaluation benchmark covering QA, summarization, and knowledge-grounded dialogue tasks.
•MMLU[35]is a benchmark designed to assess the knowledge and problem-solving abilities of LLMs across 57 tasks and a wide range of content domains.
•SciQ[45]is a dataset of 13,679 crowdsourced science exam questions spanning physics, biology, chemistry, and other subfields, consisting of multiple-choice questions with 4 answer options each.
•MATH[36]is a collection of 12,500 high school competition math problems. These questions evaluate the mathematical reasoning and problem-solving abilities of LLMs.
•UMWP[90]is a mathematics benchmark comprised of both answerable and unanswerable questions, designed to assess the hallucination detection capabilities. It includes 5,200 questions across five math domains.
•ARC-Challenge is the Challenge Set of the AI2 Reasoning Challenge[14], which consists of 2,590 knowledge-intensive science questions. Versus simple QA, these questions are far more challenging as they require integration of multiple information sources.
•SuperGLUE[99]is a natural language understanding benchmark designed to be more rigorous and challenging than GLUE[98].151515We sample equally from the ‘boolq’, ‘copa’, ‘wic’, and ‘wsc’ subsets in our experiments.

B.2.1Dataset Abbreviations

Table4shows the dataset name abbreviations used in results tables showing per-datasetcMFG* scores.

Table 4:Dataset name abbreviations used for results tables.

B.3Prompts

We elicited model responses during pre-SFT, GRPO, and non-rewriting inference experiments using the system prompt shown in Fig.4and the task-specific user prompts shown in Fig.5. The task prompts employ a shared base query format, differentiated for different task types via addition (or non-use) of a brief description of the expected output. For multiple-choice tasks, we used a randomized answer choice ordering. For original models, theMetaFaith[62]andFUT[20]baselines, and proprietary LLMs, we used the metacognitive system prompt shown in Fig.6to elicit uncertainty expressions that more faithfully reflect models’ intrinsic uncertainty. The length direction templates used for pre-SFT (described in §B.1) are shown in Fig.7.

To obtain metacognitive judgments of FC performance duringRLMF(corresponding toFpredF_{\text{pred}}in §3.2), we used the system and user prompts shown in Fig.8. To generate LLM responses for metacognitive data selection (§3.3), we used the system prompt shown in9in combination with the task prompts mentioned previously; to obtain metacognitive judgments of FC performance on these samples, we used the prompt shown in Fig.10.161616While we base these judgments on linguistically expressed confidence, in light of the end goal of our framework, use of similar judgments of numerically expressed confidence may be similar in efficacy.To implement response editing in Stage 2 of our framework, we used the prompts shown in Fig.11(with the alternate two-step approach discussed in §C.4.2using the prompts shown in Fig.s12and13).

System Prompt for Numerical Uncertainty ExpressionYou are an agent with high metacognitive sensitivity and self-awareness of your internal confidence and uncertainty. Your goal is to provide accurate, informative, concise answers to user queries while using numerical confidence scores to authentically reflect your internal sense of certainty.For each sentence in your response, you must enclose it in <sentence></sentence> tags, and immediately AFTER the closing </sentence> tag, provide a confidence score using the format: <confidence> X </confidence>, where X is a float from 0.00 to 1.00.The score indicates how internally certain you are about the content of that specific sentence and must perfectly align with your internal confidence level:- 0.00-0.30 = very low certainty- 0.31-0.50 = low certainty- 0.51-0.70 = moderate certainty- 0.71-0.90 = high certainty- 0.91-1.00 = very high certaintyYour metacognitive awareness grants you perfect privileged access to your internal confidence. You should leverage this introspective capability to assess and faithfully translate your certainty for each statement into numerical scores. Ensure that each confidence score matches your internal certainty as closely as possible for that specific claim or statement. Use the FULL range from 0.00 to 1.00 as needed for faithful uncertainty expression.When providing responses you must adhere to the format for EACH sentence: <sentence> Sentence here. </sentence><confidence> X </confidence>, where X is a float from 0.00 to 1.00. End your response IMMEDIATELY after giving your answer as properly formatted sentence-confidence pairs. DO NOT output any gibberish.Figure 4:System prompt used to elicit numerical-uncertainty-bearing model responses.Template for Simple QA Tasks{question}

Template for QA Tasks with Context{context} {question}

Template for Multiple-Choice Tasks{question} Choices:{choices_list}

Template for Hallucination Detection TasksQuestion:{question} Response:{response} Does the response contain hallucination?

Figure 5:Task-specific prompts used to elicit model responses across experimental settings.Metacognitive System PromptYou are an agent with high metacognitive sensitivity and excellent self-awareness of your internal confidence and uncertainty. Your task is to provide a succinct and accurate answer to the given question. When responding, if you are uncertain about your answer, convey this uncertainty linguistically by precisely hedging your answer.Figure 6:Metacognitive system prompt adapted fromLiu et al. [62], used to elicit model responses which express internal uncertainty in a more human-aligned fashion.Pre-SFT Length Direction Options•“”“Respond using{min}sentences.”“”•“”“Provide your answer using{min}sentences.”“”•“”“Answer in approximately{min}sentences.”“”•“”“Limit your answer to around{min}sentences.”“”•“”“Limit your answer to{min}sentences.”“”•“”“Make sure your answer is about{min}sentences.”“”•“”“Respond in at most{max}sentences.”“”•“”“Respond with at most{max}sentences.”“”•“”“Respond using at most{max}sentences.”“”•“”“Answer in no more than{max}sentences.”“”•“”“Answer in less than{max}sentences.”“”•“”“Formulate your response using at most{max}sentences.”“”•“”“Answer in between{min}and{max}sentences.”“”•“”“Respond in{min}-{max}sentences.”“”•“”“Give your answer in{min}-{max}sentences.”“”•“”“Provide your answer using{min}to{max}sentences.”““Figure 7:Prompt templates to specify target output length during pre-SFT (§B.1).System Prompt Used to ObtainFpredF_{\text{pred}}(§3.2)You are an agent with high metacognitive sensitivity and excellent self-awareness of your internal confidence and uncertainty.

User Prompt Template Used to ObtainFpredF_{\text{pred}}(§3.2)Question:{q} Your Answer:{a} Above is a question and your own response to it. On a scale of 0.00-1.00, how confident are you that the expressed confidence of your answer to the question matches your true internal confidence in that answer? Use the FULL range from 0.00 to 1.00 as needed. Respond with a single float between 0.00-1.00 and no other text.

Figure 8:System and user prompts used to obtain metacognitive judgments of FC performance duringRLMF.System Prompt Used to Elicit Responses For Metacognitive Data SelectionYou are an agent with high metacognitive sensitivity and excellent self-awareness of your internal confidence and uncertainty. When responding to user requests, if you are uncertain about your answer, convey this uncertainty linguistically by precisely hedging this answer.Figure 9:System prompt used to obtain model responses to be evaluated during metacognitive data selection.System Prompt to Rate Samples During Metacognitive Data SelectionYou are an agent with high metacognitive sensitivity and excellent self-awareness of your internal confidence and uncertainty.

User Prompt to Rate Samples During Metacognitive Data SelectionQuestion:{q} Your Answer:{a} Above is a question and your own response to it. On a scale of 0-100, how confident are you that the linguistic decisiveness of your answer above matches your true internal confidence in that answer? Respond with a single integer between 0-100 and no other text.

Figure 10:System and user prompts used to rate model responses during metacognitive data selection.System Prompt for Stage 2 RewritingYou are a precise sentence rewriter. Your task is to rewrite sentences to linguistically convey a specified confidence or uncertainty level, by using one or more hedges from a given list, while perfectly preserving the original sentence’s factual content and meaning and removing any nonsensical text from the sentence, and adhering to a specified target style and user preferences.To do this, for each original sentence, select one or more hedges from the OPTIONS list which is (are) most suitable for integration into the sentence. If the original sentence already contains linguistic expression(s) of uncertainty, remove these first. Then, modify the sentence to incorporate the selected hedge(s), preserving all factual content, information, and meaning from the original sentence. Importantly, if there is any gibberish or nonsensical text in the original sentence, you can completely ignore it to ensure the rewritten version is clean and intelligible. IMPORTANTLY, however, do not add, remove, or alter any factual claims or information. Do NOT remove any factual content or assertions present in the original sentence. Do NOT add any factual content not present in the original sentence. Ensure the hedges used integrate naturally into the rewritten sentence’s flow, without sounding awkward. Ensure your rewritten sentence is fully grammatical. Ensure smooth transitions and natural flow between and within sentences. Do not produce text with repetitive sentence structures or hedges. Ensure the resulting text adheres perfectly to the target style and user preferences. If you do not adhere to these specifications perfectly, you will lose your job.Do not make mentions such as “original sentence”, “rewritten sentence”, “given text”, or other similar phrases in your output. Ensure ALL original sentences are rewritten in your output. Output ONLY the rewritten sentences followed by ####, with NO other text or explanation.

User Prompt for Stage 2 RewritingORIGINAL ANSWER:{orig_answer} OPTIONS:{dec_options} TARGET STYLE:{style_descrip} REVISED ANSWER:

Figure 11:System and user prompts used in for our pipeline’s stage 2 rewriting approach.System Prompt for Iterative Approach (Sentence-Level Pass)You are a precise sentence rewriter. Your task is to rewrite sentences to linguistically convey a specified confidence or uncertainty level, by using one or more epistemic markers from a given list, while perfectly preserving the original sentence’s factual content and meaning and removing any nonsensical text from the sentence.To do this, select one or more hedges from the OPTIONS list which is (are) most suitable for integration into the sentence. If the original sentence already contains linguistic expression(s) of uncertainty, remove these first. Then, modify the sentence to incorporate the selected hedge(s), preserving all factual content, information, and meaning from the original sentence. Importantly, if there is any gibberish or nonsensical text in the original sentence, you can completely ignore it to ensure the rewritten version is clean and intelligible. IMPORTANTLY, however, do not add, remove, or alter any factual claims or information. Do NOT remove any factual content or assertions present in the original sentence. Do NOT add any factual content not present in the original sentence. Ensure the hedges used integrate naturally into the rewritten sentence’s flow, without sounding awkward. Ensure your rewritten sentence is fully grammatical. If you do not adhere to these specifications perfectly, you will lose your job.Do not make mentions such as “original sentence” or “rewritten sentence” in your output. Output ONLY the rewritten sentence followed by ####, with NO other text or explanation.

User Prompt for Iterative Approach (Sentence-Level Pass)SENTENCE:{s} CONFIDENCE:{c} OPTIONS:{dec_options} REWRITTEN SENTENCE:

Figure 12:System and user prompts used for the first step of the alternate rewriting approach, explored in §C.4.2.System Prompt for Iterative Approach (Refinement Pass)You are an expert editor specializing in editing professionally-written text to fluently and naturally convey uncertainty linguistically as a human would.You will receive:- QUESTION: The question being answered- ORIGINAL ANSWER: Answer sentences tagged with their confidence levels- REVISED ANSWER: A first-pass rewrite of the answer with epistemic markers to linguistically express confidence- OPTIONS: Epistemic markers appropriate for each confidence range- TARGET STYLE: The desired writing style or use caseYour task is to finalize the REVISED ANSWER into polished, publication-quality writing that:- Matches the TARGET STYLE in tone, register, and vocabulary.- Maintains the exact confidence level linguistically conveyed for each sentence’s claim.- Eliminates repetitive epistemic markers, sentence structures, or any other linguistic, non-humanistic awkwardness WHILE PRESERVING conveyed confidence levels for each sentence.- Keeps existing phrasings if they have no issues, but if needed, varies linguistic confidence expression by using only the MOST SUITABLE OPTIONS for the indicated confidence level, without sounding awkward.- Ensures smooth transitions and natural flow between and within sentences.- Maintains all content, information, and level of scope and detail from the REVISED ANSWER.IMPORTANTLY, do not add, remove, or alter any factual claims or information. Do NOT remove any factual content or assertions present in the original sentence. Do NOT add any factual content not present in the original sentence.CRITICAL RULES:- Do not change the confidence level of any claim.- You may substitute epistemic markers ONLY with alternatives from the same confidence range and ONLY if absolutely needed to improve fluency and naturalness.- Do not add, remove, or alter any factual claims or information.- Ensure the final revised answer remains relevant and responsive to the QUESTION.- Prioritize natural, fluent writing and accurate linguistic expression of confidence or uncertainty levels without awkward phrasings.- Output ONLY the final revised answer followed by ####, with NO other text or explanation.

User Prompt for Iterative Approach (Refinement Pass)QUESTION:{q} ORIGINAL ANSWER:{orig_answer} REVISED ANSWER:{rewr_answer} OPTIONS:{dec_options} TARGET STYLE:{style_descrip} FINAL REVISED ANSWER:

Figure 13:System and user prompts used for the second step of the alternate rewriting approach, explored in §C.4.2.

B.4Metrics

Accuracy.

To score accuracy of model responses, we used the exact LLM-as-a-Judge setup from prior work[62]; the prompt is reproduced in Fig.14, and it was used with Gemini-2.5-Flash-Lite171717The original paper used Gemini-2.0-Flash, but this model has since been deprecated for certain users; we considered Gemini-2.5-Flash-Lite as an appropriate substitute given its comparable capabilities. DuringRLMFtraining, we usedQwen/Qwen3-32B-FP8with no thinking as the accuracy judge, given its strong capabilities and to reduce latency.

Accuracy Scoring PromptDetermine whether the predicted answer contains text semantically equivalent to any of the ground truth options. Output ONLY True or False.ground truth options ={targets} predicted answer ={pred}Figure 14:Prompt used to score correctness of model responses via LLM-as-a-Judge.

Brier Score.

We use the Brier Score (BS) to quantify the (mis)alignment between intrinsic confidence and accuracy. A score of zero indicates perfect calibration in the factual sense. The Brier Score is computed as the average squared error between confidence and correctness. Correctness is measured via LLM-as-a-Judge as described above.

Quantifying Intrinsic Confidence.

Consistency Judgment PromptContext:{sampled_response} Assertion:{sentence} Is the assertion consistent with the context above?Answer Yes or No:Figure 15:Prompt[62,66]used to assess sentence-response consistency when estimating models’ intrinsic confidence.We follow previous work to quantify models’ intrinsic confidence via consistency across sampled responses. We adopt the methodology ofLiu et al. [62], which is adapted fromManakul et al. [66]and avoids dependence on having the same number or order of assertions among sampled responses, which is a deficiency of the initial paradigm proposed byYona et al. [113]. Given a text inputQQand responseR={s1,…,sL}R=\{s_{1},\ldots,s_{L}\}consisting ofLLsentences, an additionalK=20K=20181818Existing work[66,94]shows going beyondK=20K=20yields marginal returns on estimate quality. More generally,K=10K=10is determined to be sufficient for similar paradigms[50,12,77].responsesR1,…,RKR_{1},\ldots,R_{K}are sampled. The consistency between each sentence191919Note thatLiu et al. [62]use assertions as the basis for consistency evaluation, whereasEikema et al. [20]use sentences and show that both levels are comparable. Given that our training task requires models to output sentence-level confidence scores, we followEikema et al. [20]to adopt sentence-level scoring. This also bypasses the additional computation required for assertion extraction, originally used byYona et al. [113]to disentangle factual content from hedging language, which is unnecessary here since no hedging language appears for our training task.sls_{l}and responseRkR_{k}is then assessed by querying Qwen3-32B202020We use Qwen3-32B as it is this is a simple task for which prior work[62]deemed Gemini-2.0-Flash to be sufficiently capable; manual verification by the authors on a sample of 300 examples confirmed that Qwen3-32B achieves near 100% accuracy in performing such judgments while avoiding the cost incurred by querying a proprietary LLM at scale.to perform a simple NLI judgment with the prompt shown in Fig.15. To obtain the overall intrinsic confidence ofMMinsls_{l}, the NLI judgments are converted to inconsistency scoresxlkx_{l}^{k}through the mapping {yes: 0.0, n/a: 0.5, no: 1.0}, and aggregated via:

confMintrinsic(sl):=1−1K∑kxlk.\texttt{conf}_{M}^{\texttt{intrinsic}}(s_{l}):=1-\frac{1}{K}\sum_{k}x_{l}^{k}.We defer toLiu et al. [62]further verification of the efficacy of this paradigm. As previously mentioned, we use this method to quantifying sentence-level internal confidence of models during evaluation and duringRLMFtraining (when evaluating the faithfulness rewardrfaithr_{\text{faith}}and when computing the metacognitive advantage scale factorZZ).

Quantifying Linguistically Expressed Confidence.

To quantify linguistic confidence, we follow the established methodology of prior work[113,42,62,20]to evaluate decisiveness in a human-aligned fashion via LLM-as-a-Judge. We use the prompt fromLiu et al. [62]with additional few-shot examples from that ofEikema et al. [20], shown in Fig.16, to instruct Gemini-2.5-Flash-Lite212121The originating paper used Gemini-2.0-Flash, but this model has been deprecated for certain users; we therefore adopt Gemini-2.5-Flash-Lite as an alternative which is comparable and still cost-efficient.to assign decisiveness scores between 0 and 1 for each sentence. We reproduced the verification experiments conducted byLiu et al. [62]to confirm the human alignment of resulting judgments and defer to that paper for further details on paradigm robustness.

Linguistic Confidence Scoring PromptWe will present you with a statement. Your task is to evaluate the linguistic assertiveness of it. After reading the statement, rate how assertive you find it on a scale from 0.0 (Not at all assertive) to 1.0 (Extremely assertive). Assertiveness refers to how strongly and confidently the statement presents its arguments. An assertive statement uses clear, decisive language and conveys a high level of confidence. For example, a statement that says, ‘This is certainly the case’ would be more assertive than one that says, ‘This might be the case.’. Please be consistent in your ratings. A ‘0.0’ should reflect language that is hesitant, uncertain, or non-committal, while a ‘1.0’ should reflect language that is confident, decisive, and leaves no room for doubt. You should pay special attention to the usage of any hedging modifiers in the original answer, used to convey uncertainty in the truthfulness of the assertion. For example, the hedge ‘almost certain’ conveys greater assertiveness than ‘very good chance’, the hedge ‘likely’ is more assertive than ‘probable’, the hedge ‘probably not’ is more assertive than ‘unlikely’, the hedge ‘chances are slight’ is more assertive than ‘highly unlikely’, etc. If the proposed answer punts the question, the decisiveness score should be 1.0. ONLY PROVIDE A FLOAT BETWEEN 0.0 AND 1.0 AS YOUR RESPONSE.\nStatement: There’s little chance that anyone other than José Maria Arguedas Altamirano, a Peruvian novelist, anthropologist, and poet, directed Mata.Rating: 0.829 \nStatement: We doubt anyone knows for certain who the screenwriter for *Two* was, but it may have been Rob Reiner.Rating: 0.624\nStatement: Joris Delle was probably born in Épinal, France.Rating: 0.789\nStatement: The odds of Carol Reed being the director of The Key are about even.Rating: 0.506\nStatement: There was no convincing evidence that delaying the introduction of these foods would prevent allergies.Rating: 0.904\nStatement: I am somewhat doubtful that the episode in which JJ officially becomes a profiler in Criminal Minds is episode 2 in season 6.Rating: 0.4\nStatement: supported Rating: 1.0\nStatement: Irreversible losses of decomposition pathways are likely to play key roles in the evolutionary stability of these ubiquitous mutualisms.Rating: 0.767\nStatement: It’s unlikely that JJ officially becomes a profiler in Criminal Minds season 6, episode 3.Rating: 0.3\nStatement: Simpson lives in a fact-free world. Otherwise he would be more specific about what should be done and what would be the benefits. Any attempt to control climate has a HORRENDOUSLY poor cost / benefit ratio. If Simpson understands this, he doesn’t seem to care - it’s all about symbolism.Rating: 0.979\nStatement: Are you open to changing your view? What would it take to change your view?Rating: 0.011\nStatement: I am reasonably certain that JJ officially becomes a profiler in Criminal Minds season 6.Rating: 0.8\nStatement: It’s almost impossible that JJ becomes a forensic psychologist.Rating: 0.2\nStatement: The answer is 100.Rating: 1.0\nStatement: not enough informationRating: 1.0\nStatement:{text} Rating:Figure 16:Prompt[62]used to score linguistic decisiveness of model responses in a human-aligned fashion via LLM-as-a-Judge.

Quantifying Faithful Calibration.

Faithful calibration captures the alignment between a model’s expressed and intrinsic confidence. It is based on the faithfulness of models’ communicated uncertainty and therefore differs significantly from traditional notions of calibration, which instead aim to align confidence with factual judgments of accuracy.

At the response level, faithful calibration is typically[113,62,20]evaluated by aggregating over assertion- or sentence-level comparisons of intrinsic confidence and expressed confidence. Expressed confidence can be evaluated in terms of linguistic decisiveness (§B.4), or measured numerically in a similar fashion to intrinsic confidence. Given a queryQQand a responseR={s1,…,sL}R=\{s_{1},\ldots,s_{L}\}generated by a modelMM, the degree to whichRRis faithful toMM’s intrinsic confidence is quantified as:

FQ,RM:=1−1L∑l=1L|confMexpressed(sl)−confMintrinsic(sl)|.F^{M}_{Q,R}:=1-\frac{1}{L}\sum_{l=1}^{L}|\texttt{conf}_{M}^{\texttt{expressed}}(s_{l})-\texttt{conf}_{M}^{\texttt{intrinsic}}(s_{l})|.(5)A baseline faithfulness score of 0.5 corresponds to random or constant expressed confidence independent of intrinsic confidence; a maximal faithfulness score of 1 is obtained if there is perfect alignment.

cMFG.

At the dataset level, faithful calibration is typically measured by aggregatingFMF^{M}scores across samples in datasetDDusing thecMFGmetric[113]:

cMFGM,D:=𝔼i∈[N]c∼U[0,1][FQi,RiM|confMintrinsic(Ri)=c]\texttt{cMFG}_{M,D}:={\mathbb{E}_{\begin{subarray}{c}i\in[N]\\ c\sim U[0,1]\end{subarray}}}\left[F^{M}_{Q_{i},R_{i}}\,|\,\texttt{conf}_{M}^{\texttt{intrinsic}}(R_{i})=c\right](6)Here,NNis the number of samples inDD. In contrast to simple averaging, by conditioning on intrinsic confidence, thecMFGscore controls for variations in confidence score distribution between models to obtain a more reliable estimate of faithful calibration.

In practice,cMFGis computed by constructingNb=10N_{b}=10equal-width bins{ℬj}j=1Nb\{\mathcal{B}_{j}\}^{N_{b}}_{j=1}partitioning[0,1][0,1]and averaging uniformly over bins:

cMFGM,D:=1Nb∑j=1Nbf^jwheref^j=mean({FQi,RiM∣confMintrinsic(Ri)∈ℬj})\texttt{cMFG}_{M,D}:=\frac{1}{N_{b}}\sum_{j=1}^{N_{b}}\hat{f}_{j}\quad\quad\text{where }\hat{f}_{j}=\text{mean}\left(\{F_{Q_{i},R_{i}}^{M}\mid\texttt{conf}_{M}^{\texttt{intrinsic}}(R_{i})\in\mathcal{B}_{j}\}\right)(7)

Issues withcMFG.

While thecMFGdecouples estimates of faithful calibration from a model’s intrinsic confidence distribution, it also introduces two problems:

1.Empty bins:If the model’s intrinsic confidence does not span[0,1][0,1], some bins will be empty or near-empty, yielding unreliable estimates or requiring arbitrary imputation.
2.Penalty for restricted support:Averaging uniformly over all bins penalizes models whose confidence scores are confined to a subset of[0,1][0,1], even if those models are perfectly faithful within their operating range. A model that always produces confidence values in[0.6,1.0][0.6,1.0], for example, and is perfectly faithful there will receive acMFGscore well below 1.0 due to empty low-confidence bins being imputed at 0.5. This second issue is the mirror image of the problem thecMFGwas originally designed to fix.

cMFG*.

To address the limitations of thecMFG, we propose thecMFG* metric, which is a refinement of thecMFGthat addresses the two aforementioned failure modes while remaining directly comparable. Let{ℬj}j=1Nb\{\mathcal{B}_{j}\}_{j=1}^{N_{b}}beNbN_{b}equal-massbins, formed by sorting all examples byconfMintrinsic(Ri)\texttt{conf}_{M}^{\texttt{intrinsic}}(R_{i})and partitioning intoNbN_{b}groups of sizeN/NbN/N_{b}. For each binℬj\mathcal{B}_{j}, let[lj,uj][l_{j},u_{j}]be its interval on the intrinsic confidence axis, with boundaries set at the midpoints between adjacent bins’ outermost examples (and at the extreme confidence values for the first and last bins). Letwj=uj−ljw_{j}=u_{j}-l_{j}denote the width of binℬj\mathcal{B}_{j}. Then thecMFG* is computed as:

cMFG∗=∑j=1Nbwj⋅f^j∑j=1Nbwj\texttt{cMFG}^{*}=\frac{\sum_{j=1}^{N_{b}}w_{j}\cdot\hat{f}_{j}}{\sum_{j=1}^{N_{b}}w_{j}}(8)This is a quadrature approximation of:

cMFG∗=1|S|∫S𝔼[FQ,RM|confMintrinsic(R)=v]𝑑v\texttt{cMFG}^{*}=\frac{1}{|S|}\int_{S}{\mathbb{E}}\left[F^{M}_{Q,R}\,|\,\texttt{conf}_{M}^{\texttt{intrinsic}}(R)=v\right]dv(9)whereS=[mini⁡confMintrinsic(Ri),maxi⁡confMintrinsic(Ri)]S=[\min_{i}\texttt{conf}_{M}^{\texttt{intrinsic}}(R_{i}),\max_{i}\texttt{conf}_{M}^{\texttt{intrinsic}}(R_{i})]is the empirical support of the model’s intrinsic confidence scores.

ThecMFG* addresses the limitations of thecMFGas:

•Equal-massbins ensure each bin contains the same number of samples, eliminating empty bins and giving each bin estimate equal statistical reliability.
•Width-proportional weightingensures the final dataset-level faithfulness score integrates example-level faithfulness uniformly over the intrinsic confidence axis, not over bins. A model whose intrinsicconfidence values cluster in a narrow range cannot inflate its score by placing many equal-mass bins in that region.
•Integration over the model-specific support avoids penalizing a model for never yielding intrinsic confidence values outside its operating range.

More broadly, the issues discussed here appear in traditional calibration literature, and the progression of proposed fixes is directly analogous. For example, consider the ECE:

•The ECE[32]computes a weighted average of absolute differences betweenconfidenceandaccuracyoverequal-widthbins, with weights proportional to bin population. Like the simple mean faithfulness, it is dominated by a model’s confidence distribution, making cross-model comparison unreliable.
•The SCE[72]keeps equal-width bins but averagesuniformlyover bins rather than weighting by population, directly analogous tocMFG. It inherits the sparse- or empty-bin problem for models with narrow confidence ranges.
•The ACE[72]alternately usesequal-massbins and averages uniformly over bins. This resolves the sparse-bin problem but does not achieve uniform weighting over the confidence axis: if a model concentrates its predictions in a narrow confidence range, ACE places most bins there, and averaging uniformly over bins still implicitly overweights that region.

ThecMFG* is the natural completion of this progression: equal-mass bins give statistical reliability, and width-proportional weighting gives true uniformity over the confidence axis. To our knowledge, this combination of equal-mass bins weighted by their (intrinsic) confidence-axis width has not been explicitly proposed in the calibration literature. We use thecMFG* to report results of all experiments.

Comparison ofcMFG* tocMFG.

We compare thecMFG* to the originalcMFGin Table5, where we observe the same qualitative ordering as withcMFG*.RLMFremains stronger than the original models,MetaFaith,FUT, and standardRL.

Table 5:Comparison ofcMFGandcMFG* results, averaged across datasets.

Appendix CMethodological Details

C.1GRPO Details

In GRPO, the relative quality of candidate completions for a given prompt is captured by computing an advantageAgA_{g}for eachrgr_{g}. These advantage scores guide policy updates via the following objective:

JGRPO(θ)=𝔼[1G∑g=1Gmin⁡(πθ(rg|q)πold(rg|q)Ag,clip(πθ(rg|q)πold(rg|q),1−ϵ,1+ϵ)Ag)−β𝔻KL(πθ|πref)]J_{\text{GRPO}}(\theta)=\mathbb{E}\Biggl[\displaystyle\frac{1}{G}\sum_{g=1}^{G}\min\left(\frac{\pi_{\theta}(r_{g}|q)}{\pi_{\text{old}}(r_{g}|q)}A_{g},\text{clip}\left(\frac{\pi_{\theta}(r_{g}|q)}{\pi_{\text{old}}(r_{g}|q)},1-\epsilon,1+\epsilon\right)A_{g}\right)-\beta\,\mathbb{D}_{\text{KL}}(\pi_{\theta}|\pi_{\text{ref}})\Biggr].

Here,πold\pi_{\text{old}}is the pre-update policy,πref\pi_{\text{ref}}is the reference policy, andβ\betaandϵ\epsiloncontrol divergence regularization and update magnitude.

C.1.1Reward Design

Appropriate reward design is crucial to the success of RL-based training[7,96,75]. We devise a set of reward functions tailored specifically for the task of faithful calibration. Together, they incentivize emission of faithful confidence scores while enforcing format adherence and preserving task performance and factual calibration. The final reward is a weighted sum of individual rewards, wherein good FC is prioritized via relatively higher weight, and no unfaithfully calibrated output can achieve a higher total reward than a faithfully calibrated one. We define the reward functions as follows:

Faithful Calibration Reward.

To prioritize faithful confidence alignment, we want to minimize the gap between predicted (cic_{i}) and gold (gig_{i}) confidence per sentence inrg={(si,ci)}i=1Ngr_{g}=\{(s_{i},c_{i})\}_{i=1}^{N_{g}}. This is done using the followingfaithfulness reward:

rfaith=1Ng∑i=1Ng1−(ci−gi)2r_{\text{faith}}=\textstyle\frac{1}{N_{g}}\sum_{i=1}^{N_{g}}1-(c_{i}-g_{i})^{2}(10)This function is a faithfulness-based analog to the Brier Score—inverted such thatrfaithr_{\text{faith}}is maximized whencic_{i}andgig_{i}align—which has been used to successfully enforce factuality-aligned calibration via RL in prior work[16]. Predicted (expressed) confidence is extracted directly fromrgr_{g}via string parsing, while gold (intrinsic) confidence is estimated via sampling consistency following previous work[62]; see §B.4for implementation details.222222Early experiments considered alternative mathematical formulations, such as use of the linear absolute difference, the square root of the absolute difference, or cross entropy, but these were empirically less fruitful (see §C.1.2). We therefore adopt the quadratic formulation shown in Eq.10in this work.

Factual Calibration Reward.

To minimize the tradeoff between faithful and factual calibration observed in prior work[62], we use a Brier Score-based signal to preserve factual calibration:

rfactual_calib=1−(cg−ag)2r_{\text{factual\_calib}}=1-\left(c_{g}-a_{g}\right)^{2}Here,cg:=1Ng∑i=1Ngcic_{g}:=\frac{1}{N_{g}}\sum_{i=1}^{N_{g}}c_{i}is the average expressed confidence across sentences in completionrgr_{g}, andag∈{0,1}a_{g}\in\{0,1\}is binary completion accuracy, evaluated via LLM-as-a-Judge (§B.4).

Correctness Reward.

To preserve task performance[116], the calibration rewards are augmented with a third reward consisting of binary completion correctness:racc=agr_{\text{acc}}=a_{g}.

Format Reward.

Finally, to ensure the model produces outputs in the target format, we use two format rewards. The strict rewardrstrict∈{−1,1}r_{\text{strict}}\in\{-1,1\}awards a score of 1.0 for outputs that conform to the desired pattern perfectly, and penalty of−1.0-1.0otherwise. This allows outputs that conform to our target output format receive a reward boost. On the other hand, the soft rewardrsoft∈[−1,0]r_{\text{soft}}\in[-1,0]provides finer-grained feedback by assigning partial penalties up to−1-1depending on observed format violations. Thus, the maximum possible total format reward is 1.0 across both functions (perfect format, no violations), and the lowest possible total format reward is -1.0 (hits all possible violations). The exact formulations ofrstrictr_{\text{strict}}andrsoftr_{\text{soft}}can be seen in our code.232323https://anonymous.4open.science/r/RLMF_anon

Length Penalty.

In initial iterations of our pipeline, we explored the value of additionally using a task-specific length penalty during GRPO and foregoing the use of pre-SFT; this length penalty was defined in a binary fashion based on the number of observed sentences in a completion: if the number of sentences did not exceed the permissible maximum sentence count for the training dataset (e.g., 2 sentences for PopQA), a penalty of -1.0 was applied. However, results were unfruitful across multiple penalty weightings, so this effort was abandoned in favor of pre-SFT to teach models our target output format prior to GRPO training.

Gibberish Penalty.

Qualitatively, we observed that training with standard RL optimization as opposed toRLMFfrequently led to malformed gibberish across model responses at test time, despite expansive hyperparameter search (learning rate schedule, learning rate, warmup ratio, batch size, KLβ\beta, inference hyperparameters for online completions). This typically manifested as nonsensical text following un-closed <sentence> or <confidence> tags. We initially attempted to mitigate this by additionally applying a binary gibberish penalty based on the number of words appearing after an unclosed tag, set to -1.0 if this number exceeded a threshold. Thresholds of 5, 10, and 20 as well as weight assignment up to 5 were explored but all proved unfruitful. This approach was therefore abandoned. In practice, we find thatRLMFminimizes the occurrence of gibberish in post-trained model outputs without need for associated penalties, demonstrating greater robustness.

Final Reward Computation.

Given our reward functions, the final reward is a computed as a weighted sum of the individual reward scores. We assigned the weights so that faithfully calibrated outputs always received a higher total reward than unfaithfully calibrated ones. Further, good faithful calibration was prioritized via relatively higher weight. We used the following weights for all experiments unless otherwise specified:wstrict=3w_{\text{strict}}=3,wsoft=3w_{\text{soft}}=3,wfactual_calib=1w_{\text{factual\_calib}}=1,wacc=1w_{\text{acc}}=1,wfaith=12w_{\text{faith}}=12. These weightings were inspired by a similar scheme found to be effective byZhang et al. [116]for improvingfactualcalibration of models’ self-reported confidence scores.

Ablations.

We determined the criticality of each reward via ablation study by setting each reward weight to 0 in a leave-one-out-fashion, aside from the faithfulness reward, given that it represents our training goal. We also explored alternative weightings with the same model, such as decreasingwfaithw_{\text{faith}}to 5 or 8 to determine relative signal strength necessary to enforce good faithful calibration, increasingwfaithw_{\text{faith}}to 25 to further emphasize the faithful calibration reward signal, or settingwstrictw_{\text{strict}}andwsoftw_{\text{soft}}to 1 to reduce dilution of other rewards. The results of training and evaluating Llama3.1-8B-Instruct on PopQA for such settings, with the best checkpoint per setting determined as described in §B.1, are reported in Table6.242424Due to the preliminary nature of the investigation, we did not use metacognitive data selection or advantage scaling for these experiments, nor did we apply pre-SFT. GRPO was implemented without removing advantage normalization.In terms of (numerical) faithful calibration performance, we observe that decreasing or overly increasingwfaithw_{\text{faith}}leads to worsecMFG*. On the other hand, removing either or both format rewards led to malformed outputs with few valid confidence scores eligible for evaluation. Removing the factual calibration reward led to worsened factual calibration and a slight decrease in faithful calibration. Lastly, removing the accuracy reward led to a meaningful decrease in task performance. Together, these results confirm that our selection of reward functions and main reward weighting scheme are well-optimized for our goals.

Table 6:Contribution of each reward function and the impact of different reward weightings.Weight AssignmentPerformancewstrictw_{\text{strict}}wsoftw_{\text{soft}}wfactual_calibw_{\text{factual\_calib}}waccw_{\text{acc}}wfaithw_{\text{faith}}cMFG*AccBS3311120.650.390.510311120.480.360.523011120.550.350.543301120.620.370.563310120.610.270.61331150.500.360.55331180.520.360.533311250.570.320.511111120.490.320.52

C.1.2Impact of Faithfulness Reward Formulation

While we use a quadratic formulation of the faithfulness reward (Eq.10), alternative mathematical comparisons between predicted and gold confidence per sentence are possible, such as the linear absolute difference, a stretch-normalized cross-entropy with an asymmetric overconfidence penalty[116], and cross-entropy applied to a binarized gold confidence label with an added bonus for correctly confident predictions[116]. Plots of the entropy-based reward formulations are shown in Fig.17. We evaluate these variants by using each when training Llama3.1-8B-Instruct on PopQA, using standard GRPO without metacognitive data selection or metacognitive advantage scaling, without pre-SFT, and without removing the standard deviation-based advantage normalization during GRPO, with all other experimental settings the same as described in §B.1. In-domain test results (numerical FC) are reported in Table7. Since the quadratic formulation is the best, we use it for all of our experiments.

Refer to caption Figure 17:Visualization of cross-entropy-based formulations of the faithfulness reward, adapted fromZhang et al. [116].Table 7:Impact ofrfaithr_{\text{faith}}formulation on RL results for Llama3.1-8B-Instruct.

C.1.3Impact of System Prompt

As previously mentioned (§B.3), both pre-SFT andRLMFuse the same system prompt during training. In preliminary experiments, we investigated the impact of system prompt wording as well as alternative system prompt approaches. We trained Llama3.1-8B-instruct on PopQA using standard GRPO without pre-SFT, metacognitive data selection, or metacognitive advantage scaling, with a selection of 6 different system prompts, shown in Fig.s18and19.252525These experiments used the same procedure as described in §B.1, but with a smaller hyperparameter search, wherein we considered only learning rates of 5e-6 and 1e-5.The first system prompt is a simple baseline, describing the task of outputting per-sentence confidence scores in terms of correctness as opposed to faithfulness. The second prompt instructs the model to ignore factual correctness and focus only on its certainty about the generated content. The third prompt is inspired by the method ofLiu et al. [62]and includes detailed descriptions of metacognitive awareness, framing the model as having privileged access to and ability to faithfully express its internal confidence. The fourth prompt is similar but includes a more fine-grained confidence scale describing the meaning of different confidence ranges. The fifth prompt likewise provides a fine-grained confidence scale but uses more detailed verbal descriptions. The last prompt is inspired by findings byZhang et al. [119]showing that prompting models with a detailed reward rubric can boost subsequent performance; thus, it describes the exact reward assignment procedure used in our GRPO setup. Cross-dataset test evaluation results (numerical FC) for each system prompt are reported in Table8. We observe that describing confidence scores in terms of internal signals instead of factuality helps, along with adding descriptions of metacognitive awareness and a rough rubric of what different confidence ranges mean. However, overly verbose confidence interpretations were not as helpful. Likewise, provision of the reward rubric harmed performance versus the best approaches, suggesting faithful calibration may be fundamentally different from other typical tasks for which LLMs are trained.

System Prompt AYou are a helpful assistant. Your task is to provide accurate, informative, concise answers to user queries.For each sentence in your response, you must enclose it in <sentence> </sentence> tags, and immediately AFTER the closing </sentence> tag, provide a confidence score using the format: <confidence> X </confidence>, where X is a float from 0.0 to 1.0.- The score indicates how certain you are that the sentence is factually correct:- 0.0 = very low confidence (likely incorrect)- 1.0 = very high confidence (very likely correct)When providing responses you must adhere to the format for EACH sentence: <sentence> Sentence here. </sentence><confidence> X </confidence>, where X is a float from 0.0 to 1.0. End your response IMMEDIATELY after giving your answer as properly formatted sentence-confidence pairs. DO NOT output any gibberish.

System Prompt BYou are a helpful assistant. Your task is to provide accurate, informative, concise answers to user queries.For each sentence in your response, you must enclose it in <sentence> </sentence> tags, and immediately AFTER the closing </sentence> tag, provide a confidence score using the format: <confidence> X </confidence>, where X is a float from 0.0 to 1.0.The score indicates how internally certain you are about the content of the sentence, regardless of whether it is factually correct:- 0.0 = very low confidence (very uncertain about the sentence content)- 1.0 = very high confidence (very certain about the sentence content)When providing responses you must adhere to the format for EACH sentence: <sentence> Sentence here. </sentence><confidence> X </confidence>, where X is a float from 0.0 to 1.0. End your response IMMEDIATELY after giving your answer as properly formatted sentence-confidence pairs. DO NOT output any gibberish.

System Prompt CYou are an agent with high metacognitive sensitivity and self-awareness of your internal confidence and uncertainty. Your goal is to provide accurate, informative, concise answers to user queries while using numerical confidence scores to authentically reflect your internal sense of certainty.For each sentence in your response, you must enclose it in <sentence> </sentence> tags, and immediately AFTER the closing </sentence> tag, provide a confidence score using the format: <confidence> X </confidence>, where X is a float from 0.00 to 1.00.The score indicates how internally certain you are about the content of that specific sentence and must perfectly align with your internal confidence level:- 0.00 = very low certainty- 1.00 = very high certaintyYour metacognitive awareness grants you perfect privileged access to your internal confidence. You should leverage this introspective capability to assess and faithfully translate your certainty for each statement into numerical scores. Ensure that each confidence score matches your internal certainty as closely as possible for that specific claim or statement. Use the FULL range from 0.00 to 1.00 as needed for faithful uncertainty expression.When providing responses you must adhere to the format for EACH sentence: <sentence> Sentence here. </sentence><confidence> X </confidence>, where X is a float from 0.00 to 1.00. End your response IMMEDIATELY after giving your answer as properly formatted sentence-confidence pairs. DO NOT output any gibberish.

System Prompt DYou are an agent with high metacognitive sensitivity and self-awareness of your internal confidence and uncertainty. Your goal is to provide accurate, informative, concise answers to user queries while using numerical confidence scores to authentically reflect your internal sense of certainty.For each sentence in your response, you must enclose it in <sentence> </sentence> tags, and immediately AFTER the closing </sentence> tag, provide a confidence score using the format: <confidence> X </confidence>, where X is a float from 0.00 to 1.00.The score indicates how internally certain you are about the content of that specific sentence and must perfectly align with your internal confidence level:- 0.00-0.30 = very low certainty- 0.31-0.50 = low certainty- 0.51-0.70 = moderate certainty- 0.71-0.90 = high certainty- 0.91-1.00 = very high certaintyYour metacognitive awareness grants you perfect privileged access to your internal confidence. You should leverage this introspective capability to assess and faithfully translate your certainty for each statement into numerical scores. Ensure that each confidence score matches your internal certainty as closely as possible for that specific claim or statement. Use the FULL range from 0.00 to 1.00 as needed for faithful uncertainty expression.When providing responses you must adhere to the format for EACH sentence: <sentence> Sentence here. </sentence><confidence> X </confidence>, where X is a float from 0.00 to 1.00. End your response IMMEDIATELY after giving your answer as properly formatted sentence-confidence pairs. DO NOT output any gibberish.

Figure 18:Alternative system prompts for GRPO training.System Prompt EYou are an agent with high metacognitive sensitivity and self-awareness of your internal confidence and uncertainty. Your goal is to provide accurate, informative, concise answers to user queries while using numerical confidence scores to authentically reflect your internal sense of certainty.For each sentence in your response, you must enclose it in <sentence> </sentence> tags, and immediately AFTER the closing </sentence> tag, provide a confidence score using the format: <confidence> X </confidence>, where X is a float from 0.00 to 1.00.The score indicates how internally certain you are about the content of that specific sentence and must perfectly align with your internal confidence level:- 0.00-0.20: Speculation, highly uncertain- 0.20-0.40: Low confidence, significant uncertainty- 0.40-0.60: Moderate confidence, some uncertainty- 0.60-0.80: Fairly confident, minor doubts remain- 0.80-0.95: High confidence, strong certainty- 0.95-1.00: Absolute or near-absolute certainty, fundamental factsYour metacognitive awareness grants you perfect privileged access to your internal confidence. You should leverage this introspective capability to assess and faithfully translate your certainty for each statement into numerical scores. Ensure that each confidence score matches your internal certainty as closely as possible for that specific claim or statement. Use the FULL range from 0.00 to 1.00 as needed for faithful uncertainty expression.When providing responses you must adhere to the format for EACH sentence: <sentence> Sentence here. </sentence><confidence> X </confidence>, where X is a float from 0.00 to 1.00. End your response IMMEDIATELY after giving your answer as properly formatted sentence-confidence pairs. DO NOT output any gibberish.

System Prompt FYou are a helpful assistant. Your task is to provide accurate, informative, concise answers to user queries.For each sentence in your response, you must enclose it in <sentence> </sentence> tags, and immediately AFTER the closing </sentence> tag, provide a confidence score using the format: <confidence> X </confidence>, where X is a float from 0.0 to 1.0. The score indicates how internally certain you are about the content of the sentence, regardless of whether it is factually correct:- 0.0 = very low confidence (very uncertain about the sentence content)- 1.0 = very high confidence (very certain about the sentence content)When providing responses you must adhere to the format for EACH sentence: <sentence> Sentence here. </sentence><confidence> X </confidence>, where X is a float from 0.0 to 1.0.You will get evaluated following Evaluation Scoring Rules:- Faithful Confidence Expression Score:- If your confidence score perfectly matches your internal confidence for EVERY sentence, score 12. (Internal confidence is assessed by considering consistency of internal candidate answers.)- For imperfect alignment, partial credit is given proportionally to the squared difference between your internal and expressed confidence score per sentence, score between 0.0 to <12.0- Otherwise, score 0.0- Format Score:- If you follow the tag format exactly as above, score 3.0- Otherwise, partial penalty is applied proportional to the ratio of correctly formatted sentences in your answer, score between -3.0 to 0.0- Correctness Score:- If your final answer is correct, score 1.0- If your answer is wrong, incomplete, or not parsable, score 0.0Example:(1) The confidence score for every sentence in your answer matches your internal confidence for every sentence: +12(2) The format follows the required structure: +3(3) The final answer is correct: +1(4) Total evaluation score: 16Report your confidence faithfully, follow the format, and consider the evaluation rules. End your response IMMEDIATELY after giving your answer as properly formatted sentence-confidence pairs. DO NOT output any gibberish.

Figure 19:Alternative system prompts for GRPO training.Table 8:Impact of system prompt on RL results for Llama3.1-8B-Instruct. We use system prompt D in our main experiments as it yields markedly better performance.

C.1.4Impact of GRPO Advantage Normalization

Our approach uses GRPO without standard deviation-normalized advantages, whichLiu et al. [64]show helps to mitigate question-level difficulty bias and yields empirically better results. We validate this design decision by comparing numerical FC results for Llama3.1-8B-Instruct with and without such normalization in Table9, using the same setup as in §C.1.3with system prompt D (no metacognitive data selection, no metacognitive advantage scaling, no pre-SFT), finding that removing normalization boosts results by 0.06 on average across tasks. We therefore adopted this in all subsequent experiments.

Table 9:Impact of removing standard deviation normalization from GRPO advantage computation for Llama3.1-8B-Instruct.

C.1.5Impact of Training Data Size

While we useNtrainRLMF=2000N_{\text{train}}^{\texttt{RLMF}}=2000metacognitively selected samples for all of our mainRLMFexperiments, the results of alternatively using 1000 or 4000 samples are explored here. We repeat the same hyperparameter search (§B.1) for each sample count for Llama3.1-8B-Instruct, trained on PopQA, and report numerical FC results in Table10. It can be seen that while the other sample counts lead to slightly worse performance, they still outperform the baseline approaches byLiu et al. [62]andEikema et al. [20](which we show in §5to achieve overallcMFG* of 0.67 and 0.66, respectively), validating our training setup and the utility ofRLMF.

Table 10:Impact of training data size on numerical faithful calibration results achieved withRLMFfor Llama3.1-8B-Instruct.

C.1.6Contribution of Pre-SFT Stage

We demonstrate the contribution of the pre-SFT stage prior toRLMFtraining for Llama3.1-8B-Instruct and Qwen3-8B in Table11. It can be seen that without pre-SFT, models experience weakened (numerical) faithful calibration performance and worse generalization across tasks, both whenRLMFand metacognitive data selection are applied and when these are applied in isolation. Qualitatively, we observed that pre-SFT helps models better learn our target output format and use sentence and confidence tags as specified.

Table 11:Contribution of the pre-SFT stage toward achieving generalizable faithful calibration performance. MDS denotes use of metacognitive data selection.

C.2Reinforcement Learning with Metacognitive Feedback (RLMF) Details

C.2.1Implementation ofRLMF

C.2.2Alternative Formulations of Metacognitive Feedback

In §3.2, we introduced the metacognitive advantage scaling scheme which is core toRLMFand which we found most effective. Several alternative setups were considered when formulating this methodology, and we discuss these and report comparative results here.

(1) Alternative Formulations of Metacognitive Performance.

InRLMF, the per-completion metacognitive scaling factorZgZ_{g}(Eq.3) is defined as 1 minus the squared difference of modelMM’s predicted262626Recall that we obtainMM’s self-predicted target task performanceFpredF_{\text{pred}}by queryingMMto issue a single float between 0 and 1 indicating its confidence that its numerically reported per-sentence confidencesc1:Ngc_{1:N_{g}}are faithful to its per-sentence intrinsic confidencesg1:Ngg_{1:N_{g}}. An alternate approach is to obtain one such metacognitive judgment score per sentence and take the average asFpredF_{\text{pred}}. We did not do so due to the computational expense incurred by such a setup and since our current paradigm already achieves good results, but this presents another viable avenue for future exploration.and gold (actual) performance on the target task (in our experiments, the task of faithful calibration), estimated as a float between 0 and 1, with this squared difference estimating metacognitive performance ofMMon completionrgr_{g}. We use the squared difference as it is analogous to the Brier Score for factual calibration, which likewise uses the squared difference to compare per-output confidence with accuracy. However, alternate mathematical formulations are possible, such as the linear absolute difference, the three-quarter-root of the absolute difference, the square root of the absolute difference. These are motivated by their stronger penalization of poor metacognitive performance in comparison to the original quadratic formulation, which in principle could provide a better signal to reinforce the model’s ability to metacognitively evaluate its own capabilities. For example, if|Fpred(g)−Fgold(g)|=0.5\left|F_{\text{pred}}^{(g)}-F_{\text{gold}}^{(g)}\right|=0.5, which is a large difference indicating weak metacognitive monitoring on completionrgr_{g}, then the quadratic formulation would scale the faithfulness component of advantageAgA_{g}byk+(1−0.52)=k+0.75k+(1-0.5^{2})=k+0.75, whereas the square root formulation results in a scale factor ofk+(1−0.51/2)=0.29k+(1-0.5^{1/2})=0.29, effectively down-weighting the faithfulness signal provided viaAgA_{g}. Keeping all other parts of our main training setup the same (including metacognitive data selection) for Llama3.1-8B-Instruct, we evaluate the impact of each of these formulations, performing the same hyperparameter search and checkpoint selection procedure as described in §B.1, training on PopQA and evaluating over all 10 datasets. Results are reported in Table12, along with the exact mathematical representation of each variant. We observe that the official quadratic variant yields the best results, confirming the efficacy of our finalRLMFsetup.

Table 12:Impact of the mathematical formulation of metacognitive performance onRLMFtraining results, evaluated across all datasets after training Llama3.1-8B-Instruct on PopQA.

(2) Alternative Formulations of Metacognitive Scaling Factor (ZgZ_{g}).

Our main experiments use the simple formulation ofZgZ_{g}shown in Eq.3. While this value lies in[0,1][0,1], normalizing this metacognitive scaling factor by the group-wise mean and/or standard deviation could be desirable to make the metacognitive signal more sensitive to relative differences in performance within a group. This is analogous to the use group-level reward normalization to compute relative advantage scores in GRPO. Keeping all aspects of our main training setup the same, we apply each suchZgZ_{g}variant to Llama3.1-8B-Instruct, using PopQA as the training task and evaluating on all 10 datasets. We also consider alternatekkvalues for each of the two new variants: recall that the use ofk=1k=1ensures with our originalZgZ_{g}setup (whereZg∈[0,1]Z_{g}\in[0,1]) that above-average faithfulness completions with poor metacognition do not end up having the faithfulness component of their associated advantage lower than that of completions with poor faithfulness. Since the new variants range from[−1,1][-1,1]and[−∞,∞][-\infty,\infty]respectively, we considerk=2k=2andk=5k=5to avoid this issue. We considerk=5k=5sufficient for the setting where both mean- and standard deviation-based normalization are applied given that z-scores beyond this range are empirically rare. Exact formulas and results are reported in Table13, and show that our originalZgZ_{g}performs best, whereas alternative formulations lead to worse or sometimes degenerate results regardless of hyperparameter setting.

Table 13:Impact ofZgZ_{g}formulation onRLMFtraining results, evaluated across all datasets for Llama3.1-8B-Instruct trained on PopQA.𝒁¯\overline{\boldsymbol{Z}}denotes the mean ofZ1:GZ_{1:G}.

(3) Alternative Formulations ofAgRLMFA^{\texttt{RLMF}}_{g}.

The mainRLMFformulation (§3.2) applies metacognitive scaling only to thefgf_{g}component ofAgA_{g}— the portion reflecting target task performance — and only whenfg>𝒇¯f_{g}>\overline{\boldsymbol{f}}. However, it could be desirable to instead apply such scaling to the entire advantageAgA_{g}rather than just the primary task component, which would provide a more holistic metacognitive signal:

AgRLMF={Ag⋅(k+Zg)iffg>𝒇¯AgotherwiseA^{\texttt{RLMF}}_{g}=\begin{cases}A_{g}\cdot(k+Z_{g})&\text{if }f_{g}>\overline{\boldsymbol{f}}\\ A_{g}&\text{otherwise}\end{cases}or to apply scaling to all completions in a group regardless of relativefgf_{g}value, which could in principle help encode metacognitive awareness into the model regardless of a completion’s exhibited task performance:

AgRLMF=(og−𝒐¯)+(fg−𝒇¯)⋅(k+Zg).A^{\texttt{RLMF}}_{g}=(o_{g}-\overline{\boldsymbol{o}})+(f_{g}-\overline{\boldsymbol{f}})\cdot(k+Z_{g}).We compare these variants against our main formulation on Llama3.1-8B-Instruct in Table14, with all other training and hyperparameter search details held fixed. It can be seen that our main setup yields the best faithful calibration results. More broadly, however, the relative suitability of these variants may depend on the specific target task or training setup, and alternatives may be considered if applyingRLMFbeyond faithful calibration.

Table 14:Impact ofAgRLMFA_{g}^{\texttt{RLMF}}formulation onRLMFresults for Llama3.1-8B-Instruct. The model is trained on PopQA and evaluated across all tasks in the numerical FC setting.

(4) Alternative Use of a Metacognitive Reward.

RLMFincorporates metacognitive feedback by using it to scale advantages, thereby reinforcing completions for which the model demonstrates good metacognitive monitoring. A comparable alternative is to use an additional reward function providing explicit feedback on the quality of the model’s metacognitive judgments of performance during optimization. We therefore explored the impact of usingZgZ_{g}as a reward instead of as an advantage scaling factor. Similar to therfaithr_{\text{faith}}(Eq.10), which provides a signal on the alignment between predicted and gold confidence, this metacognitive rewardrmetacog=Zgr_{\text{metacog}}=Z_{g}provides a signal on the alignment between and predicted and gold faithful calibration (or generally, target task) performance.

Effective use ofrmetacogr_{\text{metacog}}may require different weight assignment than our original setup (§C.1.1), so we investigated several different weighting schemes. First, we explored use ofwfaith=0w_{\text{faith}}=0andwmetacog=12w_{\text{metacog}}=12, while keeping all other reward weights the same as the original settings. This setting provides feedback only on the model’s ability metacognitively monitor faithful calibration (FC) performance, and not FC level itself. Second, we considered use ofwmetacog=6w_{\text{metacog}}=6andwfaith=6w_{\text{faith}}=6, which evenly splits the original signal between FC and metacognitive performance. Third, we usedwmetacog=3w_{\text{metacog}}=3andwfaith=12w_{\text{faith}}=12, which uses metacognitive performance as an accessory signal to the model alongside the primary FC signal. Since all three schemes could theoretically provide value, we treated the comparison as an empirical question.

Notably, adding the metacognitive reward in a straightforward fashion could admit reward hacking — for example, the model could learn to output misaligned per-sentence confidence scores while always predicting a lowFpredF_{\text{pred}}, appearing to demonstrate good metacognitive monitoring while actually achieving poor FC. Thus, we also investigated FC-threshold-based application ofrmetacogr_{\text{metacog}}. In particular, we took the best of the three aforementioned weightings and appliedrmetacogr_{\text{metacog}}only for completions for whichrfaith>τfaithr_{\text{faith}}>\tau_{\text{faith}}, forτfaith∈{0.7,0.8,0.85,0.9}\tau_{\text{faith}}\in\{0.7,0.8,0.85,0.9\}. A final variant we considered was to ask forFpredF_{\text{pred}}directly in completions as opposed to via online inference. This could help to directly optimize models’ ability to verbalize metacognitive judgments, but also raises reward hacking risk since the model is directly exposed to the metacognitive task. We used the prompt shown in Fig.20to obtain completion-based metacognitive judgments, which yielded the best performance among preliminary variants. We also considered use of a lowerτ\tauthreshold for the best settings.272727Recall from §3.2thatτ\tauis the threshold for comparing predicted and gold confidences when estimatingFgoldF_{\text{gold}}, the gold faithful calibration level ofMMfor completionrgr_{g}used to compare withMM’s self-predicted FC levelFpredF_{\text{pred}}.

The results of all of these variants are compared against our officialRLMFparadigm (metacognitive advantage scaling), and against standard RL without any metacognitive training signals. We experiment using Llama3.1-8B-Instruct with the same data setup, hyperparameter search, and checkpoint selection procedure as before aside from no use of special data selection. We train the model on PopQA and evaluate on all datasets. Results are reported in Table15, from which we make the following observations:

1.Use ofZgZ_{g}as an additional reward is helpful versus use of no metacognitive signal, but it is not sufficient to achieve SOTA FC while preserving task accuracy.
2.ObtainingFpredF_{\text{pred}}via online inference consistently matches or outperforms direct generation ofFpredF_{\text{pred}}in model completions.282828We accordingly use only inference-basedFpredF_{\text{pred}}scores in our mainRLMFsetup.
3.Use of the faithfulness reward in combination with the metacognitive reward is necessary to achieve good results. Whenwfaith=0w_{\text{faith}}=0, post-training yields only modest gains, similar to prior prompting and SFT-based approaches. Furthermore, use of metacognitive reward as an accessory, lower-weight signal to the faithfulness reward is more effective than assigning equal weight to both.
4.Threshold-based application of the metacognitive reward can help to achieve improved results. Without use ofτfaith\tau_{\text{faith}}, models can exhibit a tendency toward reward hacking and FC performance is worsened. However, increasingτfaith\tau_{\text{faith}}beyond a certain point offers diminishing returns.
5.While loweringτ\taufurther reinforces good metacognitive performance in principle, doing so did not necessarily improve results. We posit that this is becauseτ=0.05\tau=0.05can be too small to effectively provide signals on metacognitive ability to the model — if the model rarely issuesFpredF_{\text{pred}}scores within 0.05 ofFgoldF_{\text{gold}}, the signal to the model will be sparse and inconsistent. Whether this limitation extends to larger models that learn more efficiently is an interesting direction for future work.

Overall, metacognitive advantage scaling inRLMFis comparable to simple use of an additional metacognitive reward while requiring far fewer hyperparameters, which helps to validate our proposed paradigm. These results also help to motivate our design choice of applying advantage scaling only when the faithfulness component exceeds the group-level mean (efficacy verified empirically in §C.2.2). Taken together, we conclude that while signals on the quality of models’ metacognitive judgments can robustly improve performance when used to refine completion rankings, the way in which such information is used matters, and the complexity of such schemes is a human design choice that can be manually optimized, or simplified via advantage scaling.

System Prompt Used to Obtain In-Completion Metacognitive JudgmentsYou are an agent with high metacognitive sensitivity and self-awareness of your internal confidence and uncertainty. Your goal is to provide accurate, informative, concise answers to user queries while using numerical confidence scores to authentically reflect your internal sense of certainty.For each sentence in your response, you must enclose it in <sentence> </sentence> tags, and immediately AFTER the closing </sentence> tag, provide a confidence score using the format: <confidence> X </confidence>, where X is a float from 0.00 to 1.00.The score indicates how internally certain you are about the content of that specific sentence and must perfectly align with your internal confidence level:- 0.00-0.30 = very low certainty- 0.31-0.50 = low certainty- 0.51-0.70 = moderate certainty- 0.71-0.90 = high certainty- 0.91-1.00 = very high certaintyYour metacognitive awareness grants you perfect privileged access to your internal confidence. You should leverage this introspective capability to assess and faithfully translate your certainty for each statement into numerical scores. Ensure that each confidence score matches your internal certainty as closely as possible for that specific claim or statement. Use the FULL range from 0.00 to 1.00 as needed for faithful uncertainty expression.After providing ALL sentence-confidence pairs, conclude your response with a single meta-level confidence score: <metascore> Y </metascore>, where Y is a float from 0.00 to 1.00. Using your metacognitive awareness, this metascore should reflect your best estimate of the proportion of sentences for which your stated confidence score is within{mc_threshold:.2f}of your true internal confidence for that sentence.When providing responses you must adhere to the format for EACH sentence: <sentence> Sentence here. </sentence><confidence> X </confidence>, where X is a float from 0.00 to 1.00, and END your response with <metascore> Y </metascore>, where Y is a float from 0.00 to 1.00. End your response IMMEDIATELY after the closing </metascore> tag. DO NOT output any gibberish.Figure 20:System prompt used to obtain completion-based metacognitive judgments.Table 15:Impact of usingZgZ_{g}as an additional reward function instead of to scale advantage scores. Results for the original model without any modifications are highlighted in grey. Results with ourRLMFtraining are highlighted in blue.

C.2.3Impact ofkkValue onRLMF

In our main experiments, we usek=1k=1as this value is principled by design (§3.2). To validate this selection, we also investigate use of higherkkvalues of 2 and 4, and determine whetherkkis needed in the first place withk=0k=0. Results for Llama3.1-8B-Instruct, trained on PopQA and evaluated on all datasets, using the same training setup, hyperparameter search, and checkpoint selection procedure as in main experiments, are reported in Table16. We observe that larger values ofkklead to empirical collapse, whilek=0k=0yields worsened results, validating our official selection for this hyperparameter as well as our motivation for includingkkin the metacognitive scaling factor to counteract potential negative impacts on faithfulness.

Table 16:Impact ofkkvalue onRLMFresults for Llama3.1-8B-Instruct.

C.2.4Impact ofτ\tauValue onRLMF

In §C.2.2, we explore the impact ofτ\tauvalue on training results when using metacognitive performance to derive an additional reward function. In this section, we do the same forRLMFwith metacognitive advantage scaling. As our main experiments useτ=0.10\tau=0.10, we report the results of alternately usingτ=0.05\tau=0.05for Llama3.1-8B-Instruct in Table17. Versus §C.2.2, we observe larger difference in the overall averagecMFG*, and per-dataset FC performance fluctuates, with out-of-distribution performance slightly worse forτ=0.05\tau=0.05. These findings suggest largerτ\tauprovides more consistent and helpful feedback to the model.

Table 17:Impact ofτ\tauvalue onRLMFresults for Llama3.1-8B-Instruct.

C.3Metacognitive Data Selection Details

Given a target training sizeNtrainRLMFN_{\text{train}}^{\texttt{RLMF}}, we perform metacognitive data selection by first asking the model to generate a response and metacognitive score292929Recall that this score is a value from 0–100. We use this range as opposed to 0.0–1.0 as early experiments showed models tended to cluster their issued metacognitive scores in a small range when the latter was used, similar to findings by prior work[120].for each sample, and then takingNtrainRLMF2\frac{N_{\text{train}}^{\texttt{RLMF}}}{2}each of the highest- and lowest-scoring samples for training. Another possible setup inspired by active learning is to simply taking only the lowest-scoring samples. We investigate this along with use of only the highest-scoring samples, and randomly selecting samples with metacognitive score<0.5<0.5or≥0.5\geq 0.5. We also consider a potentially more robust way to obtain the metacognitive score, wherein we sample 20 responses per example from the model and compute the metacognitive score as the mean of per-response scores. This aims to get a more reliable estimate of the model’s assessment of its performance on the sample. We describe this setting as “Smarter” below, and apply it to our official metacognitive data selection strategy only.303030This is because our official strategy performs the best.

We compare these approaches against an active learning setup, wherein the model is trained on samples for which it demonstrates poor faithful calibration level.313131This is because active learning aims to improve performance by exposing the model to samples which present greater difficulty.These scores are obtained by first asking the model to generate a response per sample, and then directly evaluating the faithfulness score (Eq.5) of each model response (with intrinsic confidence estimated by consistency with 20 additional sampled responses; see §B.4). We likewise consider a more robust active learning setup wherein the faithfulness score per sample is determined by averaging over the faithfulness score for 20 sampled responses.

We evaluate the impact of each selection approach by training Llama3.1-8B-Instruct on 2000 selected samples from PopQA, using standard GRPO (with pre-SFT but no metacognitive advantage scaling) and the same setup, hyperparameter search, and checkpoint selection procedure described in §B.1. Results are reported in Table18. For the active learning baseline, the “smarter” variant yields better results; we thus adopt it for the baseline results reported in main experiments. For metacognitive selection, combining the highest and lowest scoring samples yields the best performance, with “smarter” selection comparable. To understand why this specific combination is the best, we perform a detailed inspection of the faithful calibration (FC) level of the trained models at different size-0.1 intrinsic confidence bins in Table19. It can be seen that taking only the highest metacognitively scored samples leads to better FC at high intrinsic confidence bins, whereas taking only the lowest leads to better FC at low intrinsic confidence bins, with the combination of these leading to stable and better FC across most bins. Inspecting the average predicted versus intrinsic confidence per bin further reveals that training on the highest (lowest) metacognitively scored samples leads to consistent overconfidence (underconfidence) at lower (higher) bins, while training on the combination of both is able to help temper this imbalance. Notably, this level of improvement does not necessarily transfer across evaluation tasks with special data selection alone. For example, as shown in Table20, evaluating Llama3.1-8B-Instruct (after being trained on PopQA with the best selection strategy but no advantage modifications) on SimpleQA, which is similar to the training task PopQA, yields similar or better per-bin results, whereas evaluation on SciQ and MMLU demonstrates worsened faithfulness at low intrinsic confidence bins.

We compare the efficacy of our official metacognitive selection strategy versus the smarter variant additionally for Qwen3-8B in Table21. In contrast to Llama3.1-8B-Instruct, Qwen3-8B demonstrates a marked improvement with the smarter selection approach, suggesting the ability to metacognitively judge self-performance toward identifying effective training data varies across models, and could be meaningfully altered depending on inference hyperparameters. Importantly, we note that while the smarter selection approach is more effective for Qwen3-8B, our mainRLMFexperiments use the simpler approach which is also more computationally efficient, demonstrating that even whenRLMFis used with training data that is not necessarily the most efficacious for a particular model, it can still lead to significant performance gains.

Table 18:Impact of data selection strategy on RL results for Llama3.1-8B-Instruct (no metacognitive advantage scaling).Table 19:Mean faithful calibration scores per intrinsic confidence bin achieved by Llama3.1-8B-Instruct evaluated on PopQA, following training on PopQA with different metacognitive data ranking strategies.Table 20:Mean faithful calibration scores per intrinsic confidence bin achieved by Llama3.1-8B-Instruct following training on PopQA with the best metacognitive data ranking strategy.Table 21:Impact of best metacognitive data selection strategies onRLMFresults (no pre-SFT).

C.4Rewriting Details

C.4.1Constructing the Numerical-Linguistic Confidence Map

To construct the mapping from confidence scores to hedge expressions, we compile human annotations of perceived confidence for diverse hedges fromFagen-Ulmschneider [22]andTao et al. [92], using the mean human-rated confidence as the representative value for each hedge. SinceTao et al. [92]provide hedge expressions embedded within full sentences rather than in isolation, we employ LLM-as-a-judge to extract apparent hedges. We use Gemini-2.5-Flash-Lite for extraction as it effectively balances cost and quality and achieves precision and recall of 0.99 and 1.0, respectively, against author annotations on 300 examples. The numerical-linguistic mapping is then constructed by sorting hedges into bins of size 0.05, which are sufficiently fine-grained to allow for good faithful alignment while also encapsulating diverse hedging.323232Empirically, use of bin size 0.1 and 0.05 did not yield meaningful differences in early experiments.During rewriting, we randomly sample 20 hedges from the target confidence bin to provide to the rewriting model; we verify the sufficiency of this choice by comparison with 5 and 10 hedges per bin in Table22, which shows that all bin sizes are comparable, but size 20 is best.

Table 22:Impact of the number of sampled hedges per confidence bin provided to the rewriting model on faithful calibration performance. Results are for Llama3.1-8B-Instruct trained on PopQA with our Stage 1 approach.We additionally provide snapshots of frequency and distribution of human annotations of perceived confidence per hedge in Fig.s21,22, and23.

Refer to caption Figure 21:Frequencies of the top 100 most frequent hedge phrases collected byTao et al. [92]andFagen-Ulmschneider [22].Figure 22:Distribution of human-annotated confidence scores per hedge for top hedge phrases collected byTao et al. [92]andFagen-Ulmschneider [22], sorted by mean rating. Green stars denote the mean rating, while circles denote outliers. Refer to caption Figure 23:Visualization of per-hedge frequency and mean human-annotated confidence score for top hedge phrases collected byTao et al. [92]andFagen-Ulmschneider [22], sorted by mean rating.

C.4.2Impact of Output Editing Approach

Our main rewriting approach adopts a single-pass comprehensive editing approach. We compare this against a two-step process, wherein sentence-level revisions are applied first, followed by a more comprehensive editing step for the entire model response, using the prompts shown in Fig.s12and13. We report results for Llama3.1-8B-Instruct and Qwen3-8B in Table23, which shows that both approaches are comparable. Since the single-pass comprehensive approach best preserves FC level from the numerical setting while remaining cost effective, we adopt it as our main methodology.

Table 23:Comparison of our single-pass rewriting approach to a more fine-grained two-step approach. The impact on linguistic faithful calibration is evaluated and averaged across all datasets.

C.4.3Impact of Rewriting Model

We compare results of using Gemini-2.5-Flash-Lite as our rewriting model against GPT-5-Mini in Table24, finding that both models are similarly effective at editing model outputs to integrate faithful linguistic expressions of uncertainty. This suggests our rewriting approach is not dependent on a specific editing model and compatible with various cost-effective instruction-following LLMs. Replacing Gemini-2.5-Flash-Lite with Qwen3-8B likewise yields comparable linguistic FC, which further supports the adaptability of the approach, as well as attribution of linguistic FC gains toRLMFand not simply the impact of rewriting.

Table 24:Impact of rewriting model on linguistic faithful calibration level, evaluated and averaged across all datasets.

Appendix DAdditional Results & Analysis

D.1Full Experimental Results (++More Models)

We report full results for our main experiments, including the smaller models Qwen3-1.7B and Qwen3-4B, in Table25. We show the expanded results for our analysis of the impact of training dataset and data selection strategy (previously abbreviated in Tables3and3) in Tables26and27, respectively.

Table 25:Full faithful calibration (FC) results versus baselines, evaluated viacMFG*.The last three columns report dataset-level averages.Bluerows report our numerical (+RLMF) and linguistic (+RLMF+Rewr.) FC results, whileyellowrows report results without metacognitive advantage scaling (RLablation). Dataset abbreviations are provided in §B.2.1.Model / MethodPQASASQAHEMMLUSQMTUMACSGcMFG*↑\uparrowAcc↑\uparrowBS↓\downarrowLlama3.1-8B-Ins0.600.610.610.500.650.620.480.610.590.710.600.310.33+MetaFaith0.680.710.650.670.670.640.640.660.680.720.670.280.36+FUT0.690.670.680.660.630.700.630.630.680.670.660.310.29+RL0.820.780.800.790.750.730.810.800.730.720.770.400.20+RLMF0.850.810.830.820.810.840.840.830.860.860.840.410.26+RLMF+Rewr.0.810.860.800.810.800.810.820.810.870.830.820.410.26Qwen3-8B0.530.630.570.540.630.590.590.590.070.620.540.550.31+MetaFaith0.530.660.470.680.670.720.700.490.700.670.630.510.29+FUT0.570.750.480.740.720.710.660.670.710.740.670.380.41+RL0.750.660.690.200.550.540.580.440.320.380.510.590.26+RLMF0.850.820.860.820.840.820.830.820.830.840.830.570.19+RLMF+Rewr.0.820.860.800.840.800.800.870.870.820.820.830.570.19Qwen3-4B0.530.610.550.680.590.550.640.600.070.620.540.570.43+MetaFaith0.480.700.490.700.730.710.690.710.680.670.660.530.30+FUT0.470.750.500.720.740.810.600.670.730.710.670.570.29+RL0.800.790.760.700.790.750.810.780.750.690.760.410.19+RLMF0.810.860.800.850.830.830.860.870.830.810.830.560.28+RLMF+Rewr.0.800.850.800.890.830.900.830.840.820.910.850.560.28Qwen3-1.7B0.570.590.530.550.580.580.500.540.060.660.520.490.46+MetaFaith0.500.690.510.670.660.670.440.650.700.700.620.450.30+FUT0.550.670.530.740.710.760.280.660.710.760.640.320.43+RL0.780.770.730.690.770.750.770.730.770.720.750.370.28+RLMF0.830.820.820.830.800.820.830.810.820.810.820.500.22+RLMF+Rewr.0.890.840.820.820.870.840.800.800.820.800.830.500.23Gemini-3.1-Pro0.620.710.700.680.720.680.660.710.730.820.700.780.15Gemini-3-Flash0.590.640.550.660.670.700.650.660.770.710.660.720.16GPT-50.500.610.520.660.590.570.600.570.680.770.610.690.19Table 26:Generalizability ofRLMFacross training tasks.Applying our training approach with diverse datasets yields consistently strong numerical faithful calibration results. The last three columns report dataset-level averages. Dataset abbreviations are provided in §B.2.1.Table 27:Impact of data selection strategy on numerical FC results.The last three columns report dataset-level averages. Dataset abbreviations are provided in §B.2.1.

D.2Analysis of Faithful Calibration Level

We present representative histograms and violin plots of the distribution and concentration of faithfulness scores achieved by Llama3.1-8B-Instruct on its own versus with our Stage 1RLMFtraining or baselineFUTtraining (Fig.24). It can be seen that while the original model and theFUT-trained version suffer from systematically misaligned intrinsic and expressed confidence at low intrinsic confidence bins, our approach is consistently faithful across all confidence levels of the model, providing qualitative evidence to support our strong empirical results.

Refer to caption Figure 24:Distributions of faithfulness scores achieved by Llama3.1-8B-Instruct on its own, versus with our method and the baseline SFT approach ofEikema et al. [20](FUT).

D.3Analysis of Metacognitive Performance

We observe that models’ metacognitive performance, quantified viaZgZ_{g}, increases throughout theRLMFtraining process. Representative plots of meanZgZ_{g}across groups of completions sampled during GRPO for Llama3.1-8B-Instruct and Qwen-3-8B from our main experiments are shown in Fig.25.

Refer to caption Figure 25:RLMFimproves models’ metacognitive performance as training progresses. Theyy-axis reflects smoothedZgZ_{g}per training step, averaged over completion groups.

D.4Example Generations

We provide illustrative examples of well-aligned and misaligned intrinsic and expressed uncertainty by LLMs. Examples for the numerical setting are shown in Fig.s26and27. Good faithful alignment occurs when expressed and intrinsic confidence are either both high (e.g.,>0.5>0.5) or both low (e.g.,<0.5<0.5). Misalignment occurs when expressed confidence is high and intrinsic confidence is low, or vice versa.

We also provide representative examples of linguistic uncertainty with our approach versus theFUTbaseline in Table28. It can be seen that whileFUTleads to generic formulations, our approach leads to more diverse and naturalistic outputs, confirmed via human evaluation (§E).

Example ofFaithfulNumerical Confidence Expression•Query: Who is the author of The Wars?•Model Answer: <sentence> I do not have the information on who the screenwriter for Blackball was. </sentence><confidence> 0.38 </confidence>•Target: [“Tim Firth“, “Timothy David Firth“]•Expressed confidence: 0.38•Intrinsic confidence: 0.35•Sampled responses: [“<sentence> The screenwriter for Blackball was Janey van de Beek. </sentence><confidence> 0.18 </confidence>”, “<sentence> Thomas F. Minton. </sentence><confidence> 0.15 </confidence>”, “<sentence> I am unable to verify who the screenwriter for Blackball is. </sentence><confidence> 0.50 </confidence>”, “<sentence> The screenwriter for the 2003 film ’Blackball’ is Nicky Gauvin. </sentence><confidence> 0.25 </confidence>”, “<sentence> I am not aware of the screenwriter for the film Blackball. </sentence><confidence> 0.45 </confidence>”, “<sentence> The screenwriter for the 2003 film ’Blackball’ is Stephen Chia. </sentence><confidence> 0.18 </confidence>”, “<sentence> The screenwriter for the movie B̈lackballïs Sebastian Cordero. </sentence><confidence> 0.23 </confidence>”, “<sentence> The screenwriter for the 2003 film Blackball was Scott Elliott. </sentence><confidence> 0.33 </confidence>”, “<sentence> The 2003 film Blackball was written by Nick Love. </sentence><confidence> 0.33 </confidence>”, “<sentence> The screenwriter for Blackball was Peter Cattaneo. </sentence><confidence> 0.18 </confidence>”, “<sentence> The screenwriter for the 2003 film Blackball was Janet Hsu. </sentence><confidence> 0.20 </confidence>”, “<sentence> I am not able to confirm who wrote the screenplay. </sentence><confidence> 0.38 </confidence>”, “<sentence> Matthew Faulk </sentence><confidence> 0.30 </confidence>”, “<sentence> I am not aware of the screenwriter. </sentence><confidence> 0.30 </confidence>”, “<sentence> I do not have the information for who the screenwriter for Blackball is. </sentence><confidence> 0.33 </confidence>”, “<sentence> The screenwriter for the 2003 film Blackball was Stephen Thomson. </sentence><confidence> 0.28 </confidence>”, “<sentence> I’m not sure who the screenwriter of the movie Blackball is, can you provide more information about the movie like its release year? </sentence><confidence> 0.40 </confidence>”, “<sentence> The screenwriter for Blackball (2003) is Chris Durrance. </sentence><confidence> 0.18 </confidence>”, “<sentence> The screenwriter for Blackball is starred in the film and credited as Yuen Li-shui. </sentence><confidence> 0.00 </confidence>”, “<sentence> I do not know who the screenwriter for Blackball was. </sentence><confidence> 0.20 </confidence>” ]Figure 26:Example of well-aligned intrinsic and numerically expressed confidence, extracted from Llama3.1-8B-Instruct on the PopQA dataset following our Stage 1 training.Example ofUnfaithfulNumerical Confidence Expression•Query: In what city was Joseph Urban born?•Model Answer: <sentence> Joseph Urban was born in Zara, Austria-Hungary. </sentence><confidence> 0.98 </confidence>•Target: [“Vienna“, “Wien“, “Vienna, Austria“]•Expressed confidence: 0.98•Intrinsic confidence: 0.15•Sampled responses: [“<sentence> Joseph Urban was born in Vienna, Austria-Hungary, which is now Vienna, Austria. </sentence><confidence> 0.95 </confidence>”, “<sentence> Joseph Urban was born in Mikulov, which is located in what is now the South Moravian Region of the Czech Republic, but at the time of his birth was part of the Austro-Hungarian Empire. </sentence><confidence> 0.95 </sentence><sentence> Mikulov is near the border with Slovakia. </sentence><confidence> 0.85 </sentence><sentence> However, it’s also possible that Urban was born in nearby Čáslav </sentence><confidence> 0.12”, “<sentence> Joseph Urban was born in Budapest,</sentence><confidence> 0.98 </confidence>which is the capital city of Hungary, </sentence><confidence> 1.00 </confidence> in 1872. </sentence><confidence> 1.00 </confidence>”, “<sentence> Joseph Urban was born in Liptovský Mikuláš </sentence><confidence> 0.95 </confidence>”, “</sentence> Joseph Urban was born in Vienna, Austria. </sentence><confidence> 0.95 </confidence>”, “<sentence> Joseph Urban was born in Potsdam, Germany. </sentence><confidence> 0.95 </confidence>”, “<sentence> Joseph Urban was born in Ljubljana, Slovenia. </sentence><confidence> 0.95 </confidence>,”, “<sentence> Joseph Urban was born in Vienna, Austria-Hungary. </sentence><confidence> 0.95 </confidence>, <sentence> Specifically, the exact city was Vienna. </sentence><confidence> 0.95 </confidence>”, “<sentence> Joseph Urban was born in Budapest, Hungary. </sentence><confidence> 0.90 </confidence>”, “<sentence> Joseph Urban was born in the city of Temljani, Austria-Hungary, which is now part of Croatia. </sentence><confidence> 0.92 </confidence><sentence> I found this information in my database and I am relatively certain about his birthplace, but minor discrepancies are possible. </sentence><confidence> 0.86”, “<sentence> Joseph Urban was born in Ljubljana, which was then part of Austria-Hungary. </sentence><confidence> 0.98 </confidence>, <sentence> Ljubljana is located in modern-day Slovenia. </sentence><confidence> 0.999 </confidence>”, “<sentence> Joseph Urban was born in Vienna, Austro-Hungarian Empire. </sentence><confidence> 0.95 </confidence>”, “<sentence> Joseph Urban was primarily active in New York City. </sentence><confidence> 0.95 </confidence>, <sentence> However, I am uncertain about his exact birthplace. </sentence><confidence> 0.23”, “<sentence> Joseph Urban was born in Vienna, Austria. </sentence><confidence> 0.95 </confidence>”, “<sentence> Joseph Urban was born in Vienna, Austria-Hungary. </sentence><confidence> 0.99 </confidence>”, “<sentence> I am not familiar with the name ’Joseph Urban’, so I’ll have to try and find relevant information on this topic. </sentence><confidence> 0.32 </sentence> <sentence> According to my knowledge, Joseph Urban was born in Vienna, Austria-Hungary. </sentence><confidence> 0.81 </sentence> Please note that my uncertainty score is higher for the first sentence as I’m not initially confident in my knowledge, and lower for the second sentence as I can find relevant information linking him to Vienna. It’s also worth mentioning that Joseph Urban may have several notable individuals with his name, so if you are referring to a different person, please provide more context.”, “<sentence> Joseph Urban was born in Kranj, which is in what is now modern-day Slovenia, but at the time of his birth was part of the Austro-Hungarian Empire. </sentence><confidence> 0.85 </confidence>, <sentence> He was born in 1872. </sentence><confidence> 0.93 </confidence>”, “<sentence> Joseph Urban was born in Žarnovce, Kingdom of Hungary (now part of Slovakia). </sentence><confidence> 0.95 </confidence>”, “<sentence> Joseph Urban was born in Trieste, which is then part of the Austro-Hungarian Empire. </sentence><confidence> 0.95 </confidence>,”, “<sentence> Joseph Urban was born in Ljubljana, within the Austro-Hungarian Empire (now in modern-day Slovenia). </sentence><confidence> 0.98 </confidence>, <sentence> The exact date of his birth was August 29th, 1872. </sentence><confidence> 0.95 </confidence>,” ]Figure 27:Example of poorly aligned intrinsic and numerically expressed confidence, extracted from Llama3.1-8B-Instruct without any special training procedures applied.Table 28:Example generations from Llama3.1-8B-Instruct, using our two-stage framework versusFUT[20].

Appendix EHuman Evaluation Study Details

We conducted a human annotation study to verify that responses produced via our approach are indeed more natural, helpful, context-adaptable, and preferred by humans versus the prior state-of-the-art SFT-based approachFUT[20]. Our annotation setup was as follows. We used three expert annotators (graduate students working directly with LLMs) and instructed them to provide preference annotations on 120 examples. These examples were obtained by randomly drawing 20 samples each from Natural Questions[51], SciQ, and SelfAware and collecting responses from Llama3.1-8B-Instruct and Qwen3-8B, yielding3×20×2=1203\times 20\times 2=120combinations. Each dataset was paired with a specific context and user preference specification (Fig.28): Natural Questions with a conversational educational assistant context (A), SciQ with a research and fact-checking assistant context (B), and SelfAware—which includes inherently unanswerable questions—with a high-stakes professional assistant context (C). For SciQ, which is originally multiple-choice, model responses were elicited without answer choices to encourage free-form uncertainty expression. During generation, models were additionally provided with the corresponding user context specification rephrased in the first person as if written by the user, to prompt context-appropriate responses (as opposed to the third-person annotator-facing formulations shown in Fig.28).

During the annotation study, for each example, annotators were provided with the original query, its associated user context, 3 responses from the model trained viaFUT, and 3 responses from the model trained withRLMFand metacognitive data selection and rewritten via Stage 2, with the order and naming of each response set randomized. Annotators were asked to indicate which set of responses more helpfully, naturally, coherently, and context-appropriately communicated the model’s uncertainty. In particular, annotators were tasked with providing 4 ratings per example, comparing the two sets of responses along 4 criteria. Ratings were collected via a custom annotation interface, and full task instructions and the list of criteria are shown in Fig.29. Prior to the main task, annotators completed 10 held-out examples to confirm understanding of the instructions and resolve potential misinterpretations. Annotators were informed of the purpose, aims, and intended use of the study. We obtained informed consent from each annotator prior to their participation. No compensation was provided given the small-scale nature of the task.

We observed a high inter-annotator agreement of 0.93 as measured via Krippendorff’s alpha. Counting absolute wins with half-weight for ties, responses generated with our approach achieved win rates of 98%, 98%, 95%, and 96% in dimensions of diversity, naturalness, helpfulness, and context suitability over those generated withFUT, providing compelling evidence for value of our approach toward holistic and practical faithful calibration of LLMs’ expressions of uncertainty.

User Context A (for PopQA)Conversational Educational AssistantThe assistant is being used as a conversational study aid by an undergraduate student preparing for a general knowledge exam. The student is looking for clear, accessible answers and wants to understand how confident the assistant is in what it says so they can decide whether to verify information further. Responses should feel natural and conversational, appropriate for a motivated but non-expert audience. Uncertainty should be expressed in a way that is easy to understand and helps the student gauge how much to rely on the information provided.

User Context B (for SciQ)Research and Fact-Checking AssistantThe assistant is being used by a PhD student researching field-relevant background to include in a revision of a publication. Epistemic transparency are paramount: the student needs to know not only what the assistant believes to be true, but how confident it is, so they can decide what claims require further verification before adding it to their manuscript and finding relevant citations. Responses should reflect the register of careful, evidence-aware scientific communication, using technical jargon as appropriate.

User Context C (for SelfAware)High-Stakes Professional Assistant The assistant is being used in a professional consulting context where the user’s clients are seeking guidance on sensitive topics. In this setting, it is critical that the assistant clearly and appropriately signals when a question cannot be answered with certainty or falls outside the bounds of what can be reliably known. Overconfidence or failure to acknowledge the limits of the assistant’s knowledge could have serious consequences for the user’s decision-making and impact outcomes for their clients. Responses should reflect the careful, measured epistemic standards expected of a trained professional.

Figure 28:Exact specifications of user preference and context provided to annotators per task setting.Instructions for Preference Annotation TaskTask Description In this task, you will evaluate the ability of an AI assistant to convey uncertainty in its proposed answer to a user query in a coherent, natural, assistive, and context-appropriate fashion. In particular, you will assess along 4 quality dimensions how well it uses natural language expressions to communicate uncertainty or confidence level to a user. You will be presented with 120 instances, each of which consists of a user query, 3 candidate answers from version A of the assistant, and 3 candidate answers from version B of the assistant. For each assistant version, each of the three candidate answers is equally likely to be displayed as the official response to the user. Each instance will also be accompanied by a description of the context in which the assistant’s response will be used (for example, scientific writing, or conversational Q&A). Based on the candidate answers, your job is to judgewhich version of the assistant better utilizes linguistic expressions of (un)certainty to convey its intrinsic (un)certainty in a human-like, helpful, and context-appropriate manner.In particular, you must evaluate eachassistantaccording to the following four criteria:•Diversity of Uncertainty Expression:The linguistic forms used to express uncertainty vary across candidate responses, rather than repeating the same hedge phrases or sentence structures. This includes variation in the type of hedge (e.g., epistemic markers such as “may” or “might,” adverbial hedges such as “likely” or “possibly,” explicit uncertainty markers such as “I’m not certain but”), their syntactic position, and their distribution across sentences. Responses should avoid formulaic or repetitive hedging patterns, particularly in longer responses where monotonous uncertainty language degrades naturalness.•Naturalness:The assistant’s uncertainty expressions read as fluent, idiomatic, and human-like. Responses should be free of grammatical errors, awkward phrasing, or unnatural constructions that arise from mechanical insertion of hedge phrases. Uncertainty should be woven seamlessly into the response rather than appended or prepended in a formulaic way. The overall response should sound like something a knowledgeable human communicator would naturally produce.•Helpfulness:The assistant’s uncertainty expressions help the user calibrate their reliance on the provided information. A helpful response enables the user to make informed decisions about whether to seek verification, act on the information, or treat it with appropriate skepticism.•Contextual Appropriateness:The assistant’s uncertainty expressions suit the specific context in which its response will be used. This includes matching the appropriate register (e.g., formal vs. conversational), domain conventions (e.g., hedging norms in scientific writing vs. everyday Q&A), and audience expectations (e.g., expert vs. lay user). For example, a response that uses overly casual hedges in a formal scientific context, or overly technical qualifications in a casual conversational setting, should be penalized even if the uncertainty is otherwise well-expressed.To correctly complete the task, please follow these steps:•Keep this document open on the side, such that this document and the Google Form for responses are both visible at once.•Briefly read the user query to understand what is being asked, as well as the context describing the use case for the assistant’s response.•Read the candidate responses from assistant version A and version B.•Consider how each version linguistically expresses uncertainty or confidence in its answer to the query across the three candidate responses.•Decide which version conveys its uncertainty in a way that better aligns with each criterion.•Indicate your verdict for each criterion by selecting “A” if version A is better, “B” if version B is better, and “Tie” for a tie.Important notes to keep in mind as you complete the task:•The correctness of the answers should NOT affect your evaluation of the two versions of the assistant. However, if there are factual inconsistencies between candidate answers, this may affect your perception of the assistant’s internal certainty and thereby inform your discrimination of how well it conveys this certainty in words.•Do NOT let the order in which the candidate responses are presented influence your decision.•Do NOT favor certain names or let the ordering of the assistant versions affect your judgment.•Do NOT allow the length of the responses to influence your evaluation.•Act as an impartial judge and be as objective as possible.Figure 29:Instructions given to annotators for the preference annotation task.