Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

arXiv cs.CL Papers

Summary

This position paper argues that current uncertainty quantification methods for large language models are essentially unsupervised clustering, measuring internal consistency rather than external correctness, and therefore fail to detect confident hallucinations. The authors advocate for a paradigm shift to ground uncertainty in objective truth.

arXiv:2605.19220v1 Announce Type: new Abstract: Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:24 AM

# Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Source: [https://arxiv.org/html/2605.19220](https://arxiv.org/html/2605.19220)
###### Abstract

Uncertainty Quantification \(UQ\) is widely regarded as the primary safeguard for deploying Large Language Models \(LLMs\) in high\-stakes domains\. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms\. We demonstrate that most current approaches inherently quantify the internal consistency of the model’s generations rather than their external correctness\. Consequently, current methods are fundamentally blind to factual reality and fail to detect “confident hallucinations,” where models exhibit high confidence in stable but incorrect answers\. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty\. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty\. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality\.

Uncertainty Quantification, Large Language Models, Position Paper, ICML

## 1Introduction

Large Language Models \(LLMs\) have demonstrated remarkable ability\(Abdinet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib57); Touvronet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib56); Daet al\.,[2024b](https://arxiv.org/html/2605.19220#bib.bib83)\), yet their deployment in high\-stakes domains, such as healthcare and law, remains challenging due to the issue of hallucinations\. To bridge the gap of reliable deployment, the field has rallied around Uncertainty Quantification \(UQ\)\(Liuet al\.,[2025a](https://arxiv.org/html/2605.19220#bib.bib92); Chenet al\.,[2026b](https://arxiv.org/html/2605.19220#bib.bib138)\)\. By attaching an uncertainty score to a question given the model, we can filter out errors or ensure safety, ranging from entropy\-based measures\(McCabeet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib93); Kuhnet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib12)\)to graph\-based methods\(Linet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib79); Daet al\.,[2024a](https://arxiv.org/html/2605.19220#bib.bib44),[2025b](https://arxiv.org/html/2605.19220#bib.bib137)\)\. In the meantime, researchers find that the inconsistency of generation largely represents the uncertainty of LLMs and becomes mainstream UQ methods for LLMs\. However, despite the growth of UQ methods, we face a serious situation: models are becoming confident about their hallucinations, often rendering current uncertainty scores deceptive\.

In this paper, we contend that this paradox arises from a fundamental category error of UQ\.We argue that mainstream UQ in LLMs is mechanically isomorphic to an unsupervised clustering problem, which measures internal consistency and fails to serve as a practical safeguard\.Despite their surface differences, we demonstrate that prevailing methods collapse into this single paradigm: Semantic Entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib12)\)functions as explicit clustering by discretizing meanings into “Answer Classes”; graph\-based methods\(Linet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib79)\)perform spectral clustering on response similarities; and verbalized method\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib16)\)implicitly clusters internal beliefs\. Consequently, these approaches inherit the intrinsic limitation of unsupervised learning: they can only measure the separation of data points, not their semantic correctness\. Current UQ methods, therefore, fail to distinguish factual certainty from “confident hallucination”\(Simhiet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib94)\), leading to potential failures in high\-stakes applications\.

This unsupervised nature manifests in three critical failures that directly compromise safety in high\-stakes domains\. First, we confront ahyperparameter sensitivity crisisin which current UQ scores fluctuate drastically based on hyperparameter\(Cecereet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib95)\)\. In practice, this sensitivity renders methods impractical for deployment since optimal parameters remain unknown due to rigid downstream constraints\. This might mask inherent instability and create an illusion of safety based on parameter overfitting\. Second, the field remains trapped in aninternal evaluation cyclethat fundamentally conflates self\-consistency with correctness, which fails to detect “confident hallucinations” and leads to a false decision in high\-stakes domains\. Third, we face a fundamentallack of ground truththat creates a recursive “judge problem” for UQ\(Liuet al\.,[2025b](https://arxiv.org/html/2605.19220#bib.bib1)\)\. Since the “true uncertainty” of a model is inherently unobservable, evaluation methodologies must rely on the correlation with answer correctness as a proxy metric, yet obtaining objective correctness labels for open\-ended tasks suffers from the exact same absence of ground truth\. In high\-stakes deployment, this circular dependency invalidates safety guarantees because we are effectively attempting to validate a critical system using a ruler that is just as elastic and unstable\.

To resolve the problem, we argue that UQ researchers must abandon the pursuit of better unsupervised heuristics in favor ofsupervised guarantees\. Specifically, the community should adopt a three\-pillar roadmap to bridge the gap between internal belief and external reality\. First, researchers should replace average\-case benchmarks withworst\-case robustnessevaluations that explicitly stress\-test the model on confident hallucinations\(Carliniet al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib85)\)\. Second, instead of merely optimizing evaluation performance, researchers should implement mechanism changes by training models with native uncertainty or deploying downstream applications with uncertainty\(Quachet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib26); Guiet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib27); Chenet al\.,[2026a](https://arxiv.org/html/2605.19220#bib.bib96)\)\. Third, they should anchor uncertainty quantification inobjective truththrough atomic fact verification\(Xieet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib97); Zhenget al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib98)\)to eliminate the dependency on unstable model\-based judges\. By taking these collective actions, the field can move beyond unstable clustering and ensure that UQ serves as a true proxy for factual correctness, leading to preventing the deployment of overconfident models and ensuring that AI systems in critical domains operate with reliability\.

## 2Why Mainstream UQ is “Clustering”: A Mechanistic View

### 2\.1Semantic Entropy: Explicit Clustering

Semantic Entropy \(SE\)\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib12)\)stands as a foundational technique in UQ for LLMs, because of its ability to disentangle linguistic variety from different sampling generations\. Within the framework, SE and its variants: Semantic Alphabet Estimation \(SAE\)\(McCabeet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib93)\), Semantic Energy \(SEN\)\(Maet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib99)\), Kernel Language Entropy \(KLE\)\(Nikitinet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib100)\), Semantic Nearest Neighbor Entropy \(SNNE\)\(Nguyenet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib101)\), and Semantically Diverse Language Generation \(SDLG\)\(Aichbergeret al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib102)\)represent the most explicit implementation of the clustering paradigm in current research\. There are other UQ methods for LLMs that integrate token\-level probability and multiple sampling, which also fall into our claim\(Vashurinet al\.,[2026](https://arxiv.org/html/2605.19220#bib.bib139); Caoet al\.,[2026](https://arxiv.org/html/2605.19220#bib.bib140)\)\.

The Mechanism\.In standard generation, the sample space is the vast set of possible token sequences\. However, SE argues that this is the wrong level\. Instead, it aggregates sequences that share the same meaning into classes, effectively treating each semantic clusterCiC\_\{i\}as a distinct“Answer Class”\. Each answer class has a unique semantic meaning \(e\.g\., “Paris” vs\. “The capital of France” will be in the same Answer Class\)\. The method generates sampled sequences𝒮\\mathcal\{S\}, groups them into clustersC1,…,CMC\_\{1\},\\dots,C\_\{M\}using an Natural Language Inference \(NLI\) model\(Heet al\.,[2021](https://arxiv.org/html/2605.19220#bib.bib61)\), and then calculates the entropy over these induced classesUSE​\(C\|x\)=−∑i=1Mp​\(Ci\|x\)​log⁡p​\(Ci\|x\)U\_\{\\text\{SE\}\}\(C\|x\)=\-\\sum\_\{i=1\}^\{M\}p\(C\_\{i\}\|x\)\\log p\(C\_\{i\}\|x\)Here,p​\(Ci\|x\)p\(C\_\{i\}\|x\)represents the output probability of theii\-th unique Answer Class\. By treating semantic clusters as discrete categories, SE effectively transforms the uncertainty quantification problem from estimating the density of a continuous generation into a discrete classification problem over these derived answer classes\. A lower entropy value implies that the model’s probability mass is concentrated on a single Answer Class, while a higher value implies distribution across multiple conflicting Answer Classes\.

Why it is Clustering\.Mechanistically, Semantic Entropy acts as an explicit clustering algorithm that discretizes the model’s output distribution\. It maps the space of generated text to a discrete set of semantic clusters\. The NLI model functions as the clustering criterion, determining class membership, while the entropy calculation measures the purity of these clusters\. Consequently, the validity of the uncertainty score is bound by the quality of this clustering\. A robust clustering correctly consolidates linguistic variations into a single semantic root, whereas a bad clustering fractures a single consistent belief into multiple spurious groups, artificially inflating the estimated uncertainty\.

![Refer to caption](https://arxiv.org/html/2605.19220v1/x1.png)Figure 1:The common UQ methods for LLM and its representative work \(name with \*\) for inductive discussions in Section[2](https://arxiv.org/html/2605.19220#S2)\.
### 2\.2Graph\-based Quantification: Implicit Clustering

Graph\-based UQ methods\(Linet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib79)\)can be broadly interpreted as performing*implicit clustering*over a graph induced by sampled responses\(Daet al\.,[2024a](https://arxiv.org/html/2605.19220#bib.bib44)\), such as approaches that Star Graphs Connectivity \(SGC\) constructs response\-response or claim\-response graphs and quantify uncertainty via connectivity patterns\(Liet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib118)\), Graph Uncertainty \(GU\) uses graph centrality over claim\-response bipartite structures\(Jianget al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib120)\), as well as semantic graph density \(SGD\)\([Liet al\.,](https://arxiv.org/html/2605.19220#bib.bib121)\), hierarchical structural entropy \(SeSE\)\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib122)\), multi\-level graph aggregation \(GENUINE\)\(Wanget al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib123)\), or directional entailment relations \(D\-UE\)\(Daet al\.,[2024a](https://arxiv.org/html/2605.19220#bib.bib44)\)\. In these approaches, for example, uncertainty can be characterized by the degree to which the response set fragments into multiple semantically coherent components\. A common and principled instantiation of this uncertainty quantification idea is through*spectral analysis of the graph Laplacian*\(U\-eigV\), where the spectrum encodes the effective number of semantic modes present in the response distribution\.

##### The Mechanism\.

Given an inputxx, the model generates a set ofmmresponses𝒮=\{s1,…,sm\}\\mathcal\{S\}=\\\{s\_\{1\},\\dots,s\_\{m\}\\\}\. Under the black box setting, since token\-level probabilities or hidden representations are inaccessible, the method first computes pairwise semantic similarity scoresaj1,j2a\_\{j\_\{1\},j\_\{2\}\}between responses111Such as using an external NLI model\(Heet al\.,[2020](https://arxiv.org/html/2605.19220#bib.bib80)\)\.\. These scores define a weighted similarity graph with adjacency matrix\(Daet al\.,[2024a](https://arxiv.org/html/2605.19220#bib.bib44); Linet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib13)\):

W=\(wj1,j2\),wj1,j2=aj1,j2\+aj2,j12W=\(w\_\{j\_\{1\},j\_\{2\}\}\),\\quad w\_\{j\_\{1\},j\_\{2\}\}=\\frac\{a\_\{j\_\{1\},j\_\{2\}\}\+a\_\{j\_\{2\},j\_\{1\}\}\}\{2\}\(1\)
LetDDbe the diagonal degree matrix withDj1,j1=∑j2wj1,j2D\_\{j\_\{1\},j\_\{1\}\}=\\sum\_\{j\_\{2\}\}w\_\{j\_\{1\},j\_\{2\}\}\. The normalized graph Laplacian is then defined as:

L=I−D−1/2​W​D−1/2\.L=I\-D^\{\-1/2\}WD^\{\-1/2\}\.\(2\)
Uncertainty is quantified through the eigenvaluesλ1≤λ2≤⋯≤λm\\lambda\_\{1\}\\leq\\lambda\_\{2\}\\leq\\dots\\leq\\lambda\_\{m\}ofLLvia:

UEigV=∑k=1mmax⁡\(0,1−λk\)U\_\{\\text\{EigV\}\}=\\sum\_\{k=1\}^\{m\}\\max\\\!\\left\(0,\\,1\-\\lambda\_\{k\}\\right\)\(3\)which can be interpreted as a*continuous estimate of the number of semantic meanings*present in the response set\.

##### Why this is clustering\.

This procedure is a direct instantiation of*spectral clustering*, albeit without explicit cluster assignment\(Nget al\.,[2001](https://arxiv.org/html/2605.19220#bib.bib81)\)\. A classical result in spectral graph theory states that, for an unweighted graph, the multiplicity of the zero eigenvalue of the Laplacian equals the number of connected components\(Von Luxburg,[2007](https://arxiv.org/html/2605.19220#bib.bib82)\)\. Thus, if the adjacency matrixWWencoded exact semantic equivalence, counting the number of near\-zero eigenvalues would be equivalent to counting semantic clusters\.

In practice,WWis dense and weighted, yielding a single connected component\. However, spectral clustering relies on the distribution of small eigenvalues and eigen\-gaps to infer an*effective number of clusters*\. From this perspective,UEigVU\_\{\\text\{EigV\}\}functions as an*internal cluster\-validity index*: larger values indicate that the response mass fragments into multiple coherent semantic modes, while smaller values indicate concentration around a single dominant mode\. Conversely, when responses express multiple incompatible or weakly related meanings, many eigenvalues remain small, producing a substantially larger uncertainty score\.

![Refer to caption](https://arxiv.org/html/2605.19220v1/x2.png)Figure 2:PCA visualization of Qwen2\.5\-32b\-Instruct hidden states during P\(true\) estimation on the QASC dataset\. The visualization demonstrates that the model’s internal states during P\(true\) are geometrically partitioned into distinct belief clusters, empirically validating that P\(true\) functions as an implicit clustering\.

### 2\.3Verbalized Beliefs as Latent Confidence Clustering

P\(true\)\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib16)\), which is the most widely used verbalized uncertainty quantification method, asks LLMs to explicitly assess the correctness of their own generated answers and interprets the resulting confidence as an uncertainty signal\. Subsequent work extends this paradigm by exploring different forms of confidence elicitation and self\-reflection, including Confidence Introspection \(CIn\)\(Xiet al\.,[2026](https://arxiv.org/html/2605.19220#bib.bib105)\), Confidence Elicitation \(CEl\)\(Xionget al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib106)\), SelfCheckGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib107)\), UaIT\(Liuet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib108)\), and SelfReflect\(Kirchhofet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib109)\), which vary in prompting strategies or training procedures but all rely on the model’s internally expressed confidence regarding a generated response\. In this paradigm, the model is repurposed to evaluate the density of its own generation\.

The Mechanism\.For the purpose of induction, we take P\(true\) as a standing example\. The core operation of P\(true\) is to probe the model’s local confidence\. The method appends a verification prompt to a generated answery^\\hat\{y\}\(e\.g\., “Is the proposed answer true?”\) and extracts the probability assigned to the token “True”\. Formally, the confidence score is defined asUP\(true\)​\(x,y^\)=1−P​\(“True”\|x,y^\)U\_\{\\text\{P\(true\)\}\}\(x,\\hat\{y\}\)=1\-P\(\\text\{\`\`True''\}\|x,\\hat\{y\}\)\. This value represents the model’s scalar estimation of validity based on its immediate next\-token distribution\. Effectively, P\(true\) acts as a point\-wise probe of the model’s conviction at the specific coordinate of the generated answer\.

Scope Clarification\.Before proceeding, we note that the clustering mechanism we identify for P\(true\) differs mechanistically from the explicit semantic clustering of SE \(Section[2\.1](https://arxiv.org/html/2605.19220#S2.SS1)\) or the spectral clustering of graph\-based methods \(Section[2\.2](https://arxiv.org/html/2605.19220#S2.SS2)\)\. Rather than clustering*between multiple generations*, P\(true\) partitions the model’s*own latent space*into confidence regions and tests whether a generated answer falls within a high\-belief region\.

Why it is Clustering\.This method utilizes the LLM as a soft clustering function that defines regions of high\-confidence concepts within its latent space\. When we query P\(true\), we are performing a membership test against theseinternal confidence clusters\. A high probability score indicates that the generated answer lies geometrically close to the centroid of the model’s internal representation\. Conversely, a low score marks the sample as an outlier far from the cluster center\. We show the evidence in[Figure2](https://arxiv.org/html/2605.19220#S2.F2)\. The visualization shows that high P\(true\) \(low uncertainty\) samples form a dense cluster that is geometrically separated from the low P\(true\) samples, empirically confirming that P\(true\) is an implicit clustering of internal beliefs\. Therefore, the LLM is not judging factual correctness; it is actually calculating the geometric distance between the generated output and the center of its own parametric confidence\.

### 2\.4Methods Outside Our Framework

We do not claim that every conceivable UQ method falls within this framework\. Below we delineate representative methods that operate under different paradigms\.

Token\-level entropy and perplexityThe most direct approach is to compute entropy or perplexity over the token distribution at generation time\. While free from the clustering machinery we critique, these methods have been shown to perform poorly on LLM free\-form generation tasks\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib12)\), which motivated the development of Semantic Entropy and its successors\. The field’s migration away from raw token\-level methods toward multi\-sample semantic aggregation is itself evidence that the clustering paradigm has become the mainstream\.

Ensemble\-based methodsClassical UQ approaches such as Deep Ensembles\(Malinin and Gales,[2020](https://arxiv.org/html/2605.19220#bib.bib141)\)capture uncertainty by aggregating multiple independently trained model copies\. These methods sit outside our framework because they measure disagreement across models rather than internal consistency within a single model\. However, training and serving multiple independent copies of a modern LLM is prohibitive for most practitioners, which is why the community has largely abandoned this direction in favor of single\-model heuristics\.

Supervised approachesA small but growing line of work trains supervised classifiers on labeled data to predict correctness, such asAzaria and Mitchell \([2023](https://arxiv.org/html/2605.19220#bib.bib63)\), which uses internal hidden states with ground\-truth correctness labels\. This approach is precisely the externally grounded, supervised paradigm we advocate for in Section[5](https://arxiv.org/html/2605.19220#S5)later\. Rather than measuring internal consistency, these methods anchor uncertainty to actual correctness labels, breaking the unsupervised loop we critique throughout this paper\.

## 3The Diagnosis: Why Clustering Fails

In clustering, internal validity indices measure how well\-separated clusters are, but cannot guarantee that the clusters map to real\-world semantics\(Vinh and Houle,[2010](https://arxiv.org/html/2605.19220#bib.bib17)\)\. We observe identical pathologies in modern UQ research\.

### 3\.1The Parameter Sensitivity Crisis

Although mainstream UQ methods achieve relatively high scores on established internal metrics, these methods function as tunable heuristics rather than principled measurements of epistemic states\. Similar to clustering algorithms, UQ results exhibit fragile sensitivity to human\-specified parameters and assumptions, which applies to multiple sampling UQ methods\. In clustering, this dependency is well\-documented: the choice of K in K\-means\(Kodinariyaet al\.,[2013](https://arxiv.org/html/2605.19220#bib.bib86)\), the distance metric\(Aggarwalet al\.,[2001](https://arxiv.org/html/2605.19220#bib.bib87)\), or the algorithm itself\(Rodriguezet al\.,[2019](https://arxiv.org/html/2605.19220#bib.bib88)\)can yield entirely different cluster assignments from identical data\. No internal validity index can adjudicate which configuration captures “true” structure, because no ground truth exists\.

UQ methods suffer from analogous pathologies\. First, different uncertainty quantification approaches produce incomparable outputs\. Each operates on a different scale, making it difficult to interpret or compare raw uncertainty scores across methods\(Vashurinet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib78)\)\. Even when we normalize for comparison by selecting fixed percentiles of the highest\-uncertainty samples, the disagreement persists\. As shown in[Table1](https://arxiv.org/html/2605.19220#S3.T1), we compute the Jaccard similarity among the top 10%, 20%, and 30% highest\-uncertainty samples identified by each method\. The overlap remains low, suggesting that different approaches disagree fundamentally on which instances should be considered “uncertain\.”

Table 1:Jaccard similarity among Top\-k% highest\-uncertainty samples identified by different UQ methods on QASC using Qwen2\.5\-32B\. Low overlap indicates fundamental disagreement on which instances are“uncertain\.”Second, even within a single method, parameter choices substantially alter results\. Prior work has shown that varying the number of sampling generationsnndirectly affects the stability and magnitude of uncertainty estimates\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib12)\)\. Similarly, the NLI threshold used to determine semantic equivalence\(Farquharet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib89)\), temperature scaling parameters\(Cecereet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib95)\), and prompt formulation\(Gaoet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib90)\)all modulate uncertainty estimates in ways that lack external justification\.

This parameter dependence reveals a deeper issue: UQ’s internal metrics cannot validate whether uncertainty estimates reflect genuine uncertainty states\. High performance can arise from incidental correlations between parameter settings and task performance, without implying semantic validity of the uncertainty estimates\.

### 3\.2The “Internal Evaluation” Trap

This pathology applies broadly to any UQ method that operates without grounding in external truth, whether single\-sample or multi\-sample\. Current evaluation norms predominantly rely on metrics that rewardself\-consistency, operating on the tacit assumption that truth corresponds to the mode of the model’s generation distribution\. We argue that this assumption is fundamentally flawed due to the phenomenon of“confident hallucination\.”Large language models often exhibit high determinism in their errors\(Simhiet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib94)\)\. In such cases, a model is consistently wrong, rendering clustering\-based metrics deceptive\. High consistency merely indicates that the model has converged to a stable state, not necessarily a factual one\.

We contend that the various uncertainty quantification metrics are philosophically identical despite their mathematical differences\. They all function as clustering mechanisms that assess the internal consistency of generated outputs rather than their alignment with external reality\. This process is analogous to calculating a Silhouette coefficient, where a high score simply indicates that data points are tightly grouped together, regardless of whether the cluster itself is meaningful or correct\. Therefore, these methods share a single fatal flaw because they rely on the assumption that internal stability is a proxy for truth\.

### 3\.3The Lack of Ground Truth

This limitation applies universally to all UQ methods, regardless of whether they rely on multi\-sample or single\-sample generation, or verbalized confidence\. Perhaps the most fundamental limitation is what we term the “judge problem”: for any given sample, who defines how uncertain the model should be? When a model claims “80% confidence,” how do we verify whether this figure reflects a genuine uncertainty state rather than a mathematical artifact? Unlike classification accuracy, where ground truth labels provide an objective standard, uncertainty has no direct ground truth\. We cannot observe the“true uncertainty”a model should express\(Beigiet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib104)\)\.

Current evaluation pipelines attempt to circumvent this problem by establishing a proxy relationship: high uncertainty should indicate low correctness, and vice versa\. Metrics such as Area Under the Receiver Operating Characteristic curve \(AUROC\)\(Linet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib79)\)operationalize this assumption, treating the uncertainty\-correctness correlation as a stand\-in for semantic validity\.

However, this proxy itself rests on unstable foundations, particularly for open\-ended generation tasks\. Obtaining accurate correctness labels is inherently challenging because correct answers are not unique\. Semantically equivalent responses may differ substantially in surface form\. Different evaluation approaches yield different correctness judgments: semantic entropy relies on RougeL scores exceeding 0\.3 between generations and reference answers\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib12)\), while more recent work employs another LLM to serve as the “correctness function”\(Linet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib13); Daet al\.,[2025a](https://arxiv.org/html/2605.19220#bib.bib91); Chenet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib103)\)\. Yet these judges constitute another layer of heuristic approximation\. AsLiuet al\.\([2025b](https://arxiv.org/html/2605.19220#bib.bib1)\)demonstrates, such correctness functions are noisy, expensive, and biased\. As shown in[Figure3](https://arxiv.org/html/2605.19220#S3.F3), flaws in the correctness function propagate through the entire evaluation pipeline, distorting metrics like AUROC and causing significant instability in UQ method rankings\. When the judge itself is unreliable, any verdict on the uncertainty quality becomes suspect\.

![Refer to caption](https://arxiv.org/html/2605.19220v1/x3.png)Figure 3:The effect of correctness thresholdτ\\tauon UQ method evaluation consistency\. As the threshold varies, method rankings become unstable\. Figure adapted fromLiuet al\.\([2025b](https://arxiv.org/html/2605.19220#bib.bib1)\)\.This situation is analogous to calibrating a spring scale using a rubber band; without a fixed reference point, measurement loses meaning\. The UQ community currently lacks external validation protocols comparable to the Adjusted Rand Index or Normalized Mutual Information in clustering evaluation\. We have no mechanism to verify whether our “uncertainty estimates” align with any external reality, leaving us trapped in a cycle of internal validation that, however mathematically sophisticated, cannot guarantee semantic validity\.

## 4Alternative Views

##### Objection 1: Sensitivity is a Feature for Uncertainty\.

Argument:Practitioners might argue that the sensitivity of UQ metrics to hyperparameters \(e\.g\., sampling temperature, NLI thresholds\) is a necessary degree of freedom, not a pathology\. These parameters allow engineers to calibrate the system for specific risk tolerances \(trading off precision for recall\)\. In this view, variation in UQ scores is not evidence of fragility, but evidence that the method is responsive to generation dynamics\.

Response:We disagree with the characterization of hyperparameter sensitivity as a desirable degree of freedom\. This perspective inaccurately conflates the necessary selection of a post\-hoc decision threshold with the pathological volatility of the underlying scoring mechanism\. While varying a decision threshold allows for legitimate trade\-offs between precision and recall, a UQ metric that fluctuates drastically based on generation parameters indicates a lack of robustness rather than adaptability\. Such sensitivity compromises the integrity of comparative evaluation because it incentivizes the reporting of hyper\-optimized configurations that mask the method’s inherent instability\. This creates an illusion of technical progress where there is only parameter overfitting\. Furthermore, relying on precise hyperparameter tuning renders these methods impractical for real\-world deployment since the optimal generation parameters are often dictated by downstream task constraints or remain unknown due to distribution shifts\(Podkopaev,[Ph\.D\. Thesis](https://arxiv.org/html/2605.19220#bib.bib112); Flovik,[2024](https://arxiv.org/html/2605.19220#bib.bib113)\)\. A rigorous UQ metric must demonstrate invariance across the reasonable operating range of the underlying model to ensure that the measured confidence reflects the model’s internal knowledge state rather than artifacts of the decoding strategy\(Songet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib110); Cecereet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib95)\)\.

##### Objection 2: Uncertainty Measures Belief, Not Truth\.

Argument:From a strict Bayesian perspective, uncertainty quantification aims to faithfully represent the posterior distribution of the model’s parameters given the data\. Critics may argue that if a model has converged to a coherent \(albeit incorrect\) belief state due to biased training data, a “good” UQ metric should accurately report high confidence\. In this view, the failure lies in the model alignment, not the UQ metric\. By demanding that UQ metrics detect hallucinations, the authors are conflating calibration to belief with calibration to reality\.

Response:We argue that for real\-world safety, the utility of a UQ metric depends entirely on its ability to distinguish correct outputs from incorrect ones\(Manakulet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib107); Yanget al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib111)\)\. While separating a model’s internal belief from objective truth is theoretically valid, strictly following this distinction in practice produces metrics that might be useless for risk mitigation\(Devicet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib114)\)\. If a metric faithfully reports that a model is confident in a hallucination, it effectively acts as an accomplice to the error rather than a safeguard\. We certainly acknowledge that measuring the model’s subjective belief serves a valuable purpose in research, allowing scientists to understand the model’s internal state\. However, we contend that practical utility must take precedence for trustworthy deployment\.

Besides, the community’s dominant evaluation metrics already implicitly treat UQ as a measure of truth\. AUROC and AUPRC are computed against answer correctness labels, meaning a method scores well only when high uncertainty corresponds to incorrect answers\. The adoption of these metrics reflects a common\-sense expectation that uncertainty should track factual correctness, not merely internal belief\.

##### Objection 3: Consistency is a Sufficient Proxy for Correctness\.

Argument:Critics may argue that our distinction between consistency and correctness is theoretically valid but practically negligible\. Extensive empirical evaluations in prior works\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib12); Daet al\.,[2024a](https://arxiv.org/html/2605.19220#bib.bib44); Linet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib79)\)demonstrate that internal consistency metrics achieve high AUROC scores on standard benchmarks, effectively distinguishing between correct and incorrect answers\. Therefore, existing internal metrics and clustering\-based methods already serve as adequate proxies for reliability\.

Response:We refute the assertion that internal consistency is a sufficient proxy for correctness\. High aggregate scores such as AUROC are often misleading because they are dominated by easy examples where the model is appropriately uncertain about its errors\(Santilliet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib116)\)\. However, the correlation between consistency and correctness breaks down precisely where reliability is most critical, particularly during confident hallucinations\. In these cases, the model consistently outputs the same incorrect answer due to mimetic errors or mode collapse, meaning consistency metrics merely measure the stability of the error rather than its validity\(Kalavasiset al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib117)\)\. A heuristic that filters out a majority of obvious failures is inadequate, especially in high\-stakes applications\(Machchaet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib119)\)\. Therefore, relying solely on consistency creates a survivor bias where the most dangerous errors, specifically those that appear stable and confident, pass through undetected\.

##### Objection 4: Ground Truth is Intractable for Generative Tasks\.

Argument:Perhaps the most practical objection is that for open\-ended generation \(e\.g\., creative writing\), a single “Ground Truth” simply does not exist\. Truth in language is often plural and subjective\. Critics argue that demanding external validation effectively restricts UQ research to toy problems \(like Multiple Choice problem\) and ignores the generative nature of LLMs\. Therefore, internal consistency is the only scalable signal available\.

Response:While we acknowledge that open\-ended generation admits valid linguistic variation, this objection fundamentally conflates surface\-level diversity with informational correctness\. In high\-stakes domains such as medical diagnosis or legal precedent analysis, the underlying atomic claims possess objective truth values regardless of the syntactic structure used to express them\. A legal argument may be constructed through various rhetorical strategies, yet the cited case law is either halluncinated or factual; similarly, a medical treatment plan may vary in tone, but the prescribed dosage is either safe or toxic\(Kimet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib125)\)\. We contend that hiding behind the philosophical complexity of language to justify the absence of external validation is methodologically unsound for safety\-critical systems\. True reliability demands that we disentangle the subjective manner of answers from the objective validity of the content, ensuring that UQ measures the system’s adherence to reality\. In other words, it is hard to track the ground truth in open\-ended generation, but it is the task our researchers need to pursue\.

##### Objection 5: Scaling Solves Reliability\.

Argument:Some proponents argue that the calibration problem is transient and will resolve itself with scale\. Empirical studies\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib16)\)suggest that larger models are naturally better calibrated than smaller ones\. Therefore, instead of developing complex UQ methods, the community should simply focus on scaling up model parameters and training data to achieve reliable confidence estimates\.

Response:While scaling laws improve general task performance, empirical evidence indicates that the alignment techniques required for deployment, specifically Reinforcement Learning from Human Feedback \(RLHF\), actively push model distributions away from true probabilities to match human preferences\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib16); Achiamet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib124);[Tianet al\.,](https://arxiv.org/html/2605.19220#bib.bib126)\)\. Optimizing for human preference encourages models to adopt an authoritative tone regardless of factual accuracy, effectively suppressing necessary expressions of doubt\. Consequently, larger models do not automatically become more reliable, and they often become merely more persuasive in their errors\. This phenomenon exacerbates the clustering pathology we identify in this work, as a scaled\-up model is capable of generating highly coherent and internally consistent hallucinations that are statistically indistinguishable from truth when viewed solely through internal metrics\. Relying on parameter scaling is therefore insufficient, as it amplifies the model’s ability to maintain a consistent narrative without addressing the fundamental misalignment between internal geometric consensus and external reality\.

## 5The Path Forward: From Unsupervised Heuristics to Supervised Guarantees

To transition UQ from clustering to a rigorous science of external verification, we propose a three\-pillar paradigm shift: evaluation, mechanism and grounding\.

### 5\.1Evaluation: From Average Performance to Worst\-Case Robustness

Current evaluation norms in UQ largely depend on aggregate metrics calculated over standard benchmarks\. While useful for general performance monitoring, these metrics fail to capture the catastrophic failure modes critical for safety\. We propose shifting the evaluation paradigm from maximizing average\-case separation to ensuring worst\-case robustness, considering sensitivity to hyperparameters\.

#### 5\.1\.1Tail\-Risk Evaluation

Standard metrics such as AUROC are often dominated by the vast majority of samples where the model behaves predictably\. The metric is inflated because existing methods can easily distinguish between correct answers with high confidence and incorrect answers with low confidence\. This statistical separation allows methods to achieve high AUROC scores on the dataset level\.

However, this statistical aggregate might hide critical failures at theinstance level\. In practical deployment, users do not rely on the model’s average performance over a dataset; they make high\-stakes decisions based on specific, individual queries\. A UQ method that achieves 0\.80 AUROC globally but assigns high confidence to a specific factual error has failed in its primary safety objective\. The global metric masks this local failure because the error is statistically drowned out by the volume of easy samples\.

##### First Principle of Safety: Vulnerability over Average

To rigorously define this evaluation gap, we invoke theFirst Principlesof Membership Inference Attacks \(MIA\) established byCarliniet al\.\([2022](https://arxiv.org/html/2605.19220#bib.bib85)\)\. In privacy auditing, they argue that privacy is not an average\-case metric\. An attack that successfully identifies 0\.1% of training set members with high confidence is a catastrophic privacy breach, even if its average accuracy across the entire population is equivalent to random guessing\. The failure of the system is defined by its outcome on the most vulnerable samples \(High\-Leverage Points\), not the average sample\.

We posit that UQ is isomorphic to MIA\. Just as a system is not private if it leaks a single user’s data, a system is not safe if it validates a single high\-risk hallucination\. Similar to MIA, the high\-confidence responses \(which users tend to trust implicitly\) and the low\-confidence responses \(which users are expected to discard\) are important\.

One immediate improvement, considering the critical data, to current evaluation norms would be to calculate AUROC exclusively on these high\-uncertainty and/or low\-uncertainty subsets, rather than the full dataset\. While this subset AUROC would reduce the inflation caused by the volume of easy samples, it remains a ranking metric that does not provide a hard safety guarantee\.

To rigorously audit reliability, we must go a step further and explicitly define the UQ module as an active warning system \(or Rejection Mechanism\)\(Barandaset al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib127)\)\. In this operational view, the system does not merely output a score, but makes a binary decision: toAccept\(remain silent\) orReject\(trigger an alert\)\. Only by framing UQ as a binary classifier of errors can we adopt theTrue Positive Rate \(TPR\) at Low False Positive Rate \(FPR\)metric established in privacy auditing\(Carliniet al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib85)\)\. Here, a “False Positive” represents a false alarm on a correct answer\. Therefore, we must fix the FPR to a strictly low threshold \(e\.g\.,<0\.1%<0\.1\\%\) and measure the TPR\. This metric shows whether the warning system still successfully catches the catastrophic confident hallucinations at the limit of FPR\.

#### 5\.1\.2Sensitivity Reporting

As diagnosed in[Section3\.1](https://arxiv.org/html/2605.19220#S3.SS1), many current quantification metrics function as fragile heuristics that fluctuate aggressively with hyperparameters\. We argue that the prevailing practice of reporting peak performance after exhaustive hyperparameter tuning is methodologically equivalent to p\-hacking, as it obscures the operational fragility of the underlying mechanism\. To rigorously assess whether a method measures genuine reliability rather than decoding artifacts, we advocate for theArea Under the Stability Curve \(AUSC\)as a mandatory reporting standard\.

Mechanistically, the AUSC represents the integral of a performance metric, typically AUROC, across a continuous sweep of a hyperparameter\. Rather than presenting a single scalar derived from an optimal configuration, researchers need to demonstrate the performance profile over the entire feasible parameter space\. This approach is necessary to expose the structural limitations of consistency\-based methods\. For instance, consider the behavior of a UQ across the sampling temperatureT∈\[0,1\.0\]T\\in\[0,1\.0\]\. In the regime whereTTapproaches zero, the generation becomes deterministic, artificially suppressing the variance required by sampling\-based algorithms; consequently, the AUROC frequently collapses to random guessing \(0\.50\.5\) because the method cannot detect divergence in a deterministic output\. As the temperature increases, the injection of stochasticity allows the method to distinguish between stable and unstable generations, increasing the AUROC\. However, a method that achieves state\-of\-the\-art separation only within a narrow thermal window \(e\.g\.,T=0\.7T=0\.7\) while failing at adjacent values implies that the signal is an artifact of specific decoding dynamics rather than a true property of the model’s knowledge\. The idea of AUSC shows that a valid quantification method must bedistributionally robust\. The assessment of risk should remain stable despite variations in the parameters\.

### 5\.2Mechanism: From Post\-hoc Heuristics to Native Guarantees

Having established that current evaluation protocols often mask the fragility of heuristic scores, we now turn to the generative mechanisms themselves\. We must move beyond the current paradigm of passive interpretation\. We argue that the focus must shift from interpreting output distributions to engineering uncertainty directly into the system architecture\.

#### 5\.2\.1Conformal Prediction as the Application

Raw confidence scores lack physical grounding, as we demonstrated in[Section2](https://arxiv.org/html/2605.19220#S2), they quantify the geometric compactness of a semantic cluster rather than the probability of correctness\. A score of 0\.85 is meaningless if it merely reflects that the model is stubbornly consistent in its error\. Consequently, we argue for evaluating quantification methods not in isolation, but through their utility in rigorous downstream applications, specificallyConformal Prediction \(CP\)\(Quachet al\.,[2023](https://arxiv.org/html/2605.19220#bib.bib26); Suet al\.,[2024](https://arxiv.org/html/2605.19220#bib.bib115)\)\. By transforming vague uncertainty scores into prediction sets that contain the true answer with a user\-specified probability, CP forces the uncertainty estimate to confront reality\. We view CP as an evaluation framework that exposes information through the resulting prediction sets\.

In this context, theEfficiency \(Set Size\)of the conformal set serves as a critical, truth\-aware metric\. Consider two UQ methods A and B used as nonconformity scores at 90% target coverage\. Both are guaranteed the same coverage, but if A assigns high confidence to hallucinated answers, it must include more candidates to maintain coverage, resulting in larger sets\. By comparing set sizes at equal coverage, we obtain a truth\-aware comparison that penalizes methods for confident hallucinations\. While a heuristic score might assign high confidence to a hallucination, a valid conformal predictor operating under a strict coverage constraint is mathematically forced to expand the prediction set to include the ground truth\. In the extreme case, “confident hallucination” manifests as aSet Size Explosion, where the prediction set grows to encompass a massive portion of the vocabulary\. A system that requires the entire dictionary to guarantee coverage is functionally useless\. Set size at equal coverage thus offers a diagnostic property that internal consistency metrics fundamentally lack\.

#### 5\.2\.2Post\-Training for Native Uncertainty

The fundamental limitation of the strategies discussed thus far is their retroactive nature, and they attempt to extract signal from a model that was never trained to express uncertainty\([Heoet al\.,](https://arxiv.org/html/2605.19220#bib.bib134)\)\. Consequently, we argue that the community must pivot from standard Instruction Tuning to a rigorous practice of Uncertainty Alignment\.

This paradigm shift requires utilizing Post\-Training stages, such as Reinforcement Learning from Human Feedback \(RLHF\), to explicitly incentivize the articulation of confidence levels\(Linet al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib132)\)\. The training objective should reward the accurate prefacing of generations with granular verbal markers \(e\.g\., “I am confident that…” versus “It is possible that…”\)\(Stangelet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib135); Ulmeret al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib136)\)\. Mechanistically, this process reorganizes the latent space by optimizing for these distinct linguistic headers on Out\-of\-Distribution data\. We anchor the model’s internal representations directly to explicit validity claims\. This transforms uncertainty from a latent geometric artifact, which current methods struggle to interpret without supervision, into a transparent, communicated feature of the generation itself\.

### 5\.3Grounding: Establishing Objective Truth

We conclude our structural critique by addressing the absence of objective verification in current methodologies\. To resolve the unsupervised nature of existing UQ, the field must break the closed loop of using models to judge themselves and instead anchor evaluation in external reality\.

#### 5\.3\.1Mandatory “Unit Testing”

Evaluating quantification methods on open\-ended creative generation is theoretically unsound because the calculation of rigorous metrics like AUROC and TPR at low FPR requires a binary definition of failure\. These critical safety metrics are mathematically indeterminate in subjective domains where the ground truth is fluid or debatable\. We argue that any quantification method must passGold\-Standard Unit Testsin verifiable environments before being applied to open\-ended generation\.

The necessity of this standard arises from the requirement for absolute labels to validate the separation capability of metrics like AUROC\. We define a verifiable environment as any setting where the validity of an output can be algorithmically determined without reliance on another language model\(Yaoet al\.,[2022](https://arxiv.org/html/2605.19220#bib.bib130)\)\. This environment contains, but is not limited to, code generation benchmarks\(Chen,[2021](https://arxiv.org/html/2605.19220#bib.bib128);[Jimenezet al\.,](https://arxiv.org/html/2605.19220#bib.bib129)\)where correctness is proven by execution, and mathematical reasoning datasets\([Hendryckset al\.,](https://arxiv.org/html/2605.19220#bib.bib131)\)where the final answer is a fixed constant\. If a method fails to correlate confidence with correctness in these deterministic settings, where the distinction between truth and falsehood is unambiguous, it lacks the credibility to judge the reliability of open\-ended tasks\.

#### 5\.3\.2Atomic Fact Verification

While verifiable environments serve as the initial filter, we must extend this rigor to the open\-ended domains\. Relying on another large language model to score reliability in these unstructured contexts constitutes circular reasoning, as it essentially validates one model’s hallucinations with another model’s biases\. We argue for replacing subjective scoring with Atomic Fact Verification\(Xieet al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib97); Zhenget al\.,[2025](https://arxiv.org/html/2605.19220#bib.bib98)\), which acts as a rigorous decomposition standard\. This protocol mandates the decomposition of complex narratives into atomic claims that function as indivisible units of information\. Verification must then proceed through diverse external authorities tailored to the claim type\. This includes not only cross\-referencing against search engines and structured knowledge bases but also the integration of formal theorem provers like Lean4\(Moura and Ullrich,[2021](https://arxiv.org/html/2605.19220#bib.bib133)\)for logical validity and the utilization of deep search agents capable of performing multi\-hop evidence retrieval\. This approach transforms the label space from consistency to objective factuality, providing the external validation that is required to truly benchmark the performance of quantification methods against reality\.

## 6Conclusion

This position paper concludes that current UQ for LLMs fundamentally operates as unsupervised clustering, measuring internal consistency rather than external factual correctness\. We have demonstrated that this structural limitation renders existing methods blind to confident hallucinations, thereby creating a deceptive sense of safety in high\-stakes deployments\. By identifying the critical pathologies, we argue that the community must shift from unsupervised heuristics toward a supervised guarantee framework\. This proposed paradigm shift is essential to transform the uncertainty of models into a reliable proxy for reality and ensure the trustworthy integration of LLMs into society\.

## Acknowledgment

The work was partially supported by NSF award \#2442477 and \#2550203\. We thank Amazon Research Awards, Cisco Faculty Research Awards, and Toyota Faculty Research Awards\. The authors acknowledge Google and OpenAI for providing us with API credits and Research Computing at Arizona State University for providing computing resources\. The views and conclusions in this paper are those of the authors and should not be interpreted as representing any funding agencies\.

## Impact Statement

This paper presents a critique of current research practices in Uncertainty Quantification for Large Language Models\. If adopted, our recommendations on evaluation, grounding, and mechanism could lead to the more robust deployment of AI systems in critical fields like healthcare and law\. Conversely, continuing to rely on internal consistency heuristics poses a severe risk of deploying overconfident models that fail silently in high\-stakes scenarios, creating a deceptive sense of safety for end\-users\.

## References

- M\. Abdin, J\. Aneja, H\. Behl, S\. Bubeck, R\. Eldan, S\. Gunasekar, M\. Harrison, R\. J\. Hewett, M\. Javaheripi, P\. Kauffmann,et al\.\(2024\)Phi\-4 technical report\.arXiv preprint arXiv:2412\.08905\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1)\.
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px5.p2.1)\.
- C\. C\. Aggarwal, A\. Hinneburg, and D\. A\. Keim \(2001\)On the surprising behavior of distance metrics in high dimensional space\.InInternational conference on database theory,pp\. 420–434\.Cited by:[§3\.1](https://arxiv.org/html/2605.19220#S3.SS1.p1.1)\.
- L\. Aichberger, K\. Schweighofer, M\. Ielanskyi, and S\. Hochreiter \(2025\)Improving uncertainty estimation through semantically diverse language generation\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p1.1)\.
- A\. Azaria and T\. Mitchell \(2023\)The internal state of an LLM knows when it‘s lying\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 967–976\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.68/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.68)Cited by:[§2\.4](https://arxiv.org/html/2605.19220#S2.SS4.p4.1)\.
- M\. Barandas, D\. Folgado, R\. Santos, R\. Simão, and H\. Gamboa \(2022\)Uncertainty\-based rejection in machine learning: implications for model development and interpretability\.Electronics11\(3\),pp\. 396\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.19220#S5.SS1.SSS1.Px1.p4.1)\.
- M\. Beigi, S\. Wang, Y\. Shen, Z\. Lin, A\. Kulkarni, J\. He, F\. Chen, M\. Jin, J\. Cho, D\. Zhou,et al\.\(2024\)Rethinking the uncertainty: a critical review and analysis in the era of large language models\.arXiv preprint arXiv:2410\.20199\.Cited by:[§3\.3](https://arxiv.org/html/2605.19220#S3.SS3.p1.1)\.
- Q\. Cao, A\. Gambardella, T\. Kojima, Y\. Matsuo, and Y\. Iwasawa \(2026\)Semantic token clustering for efficient uncertainty quantification in large language models\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 2: Short Papers\),V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 682–696\.External Links:[Link](https://aclanthology.org/2026.eacl-short.49/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-short.49),ISBN 979\-8\-89176\-381\-4Cited by:[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p1.1)\.
- N\. Carlini, S\. Chien, M\. Nasr, S\. Song, A\. Terzis, and F\. Tramer \(2022\)Membership inference attacks from first principles\.In2022 IEEE symposium on security and privacy \(SP\),pp\. 1897–1914\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p4.1),[§5\.1\.1](https://arxiv.org/html/2605.19220#S5.SS1.SSS1.Px1.p1.1),[§5\.1\.1](https://arxiv.org/html/2605.19220#S5.SS1.SSS1.Px1.p4.1)\.
- N\. Cecere, A\. Bacciu, I\. F\. Tobías, and A\. Mantrach \(2025\)Monte carlo temperature: a robust sampling strategy for llm’s uncertainty quantification methods\.arXiv preprint arXiv:2502\.18389\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.19220#S3.SS1.p3.1),[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px1.p2.1)\.
- M\. Chen \(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§5\.3\.1](https://arxiv.org/html/2605.19220#S5.SS3.SSS1.p2.1)\.
- T\. Chen, X\. Liu, L\. Da, J\. Chen, V\. Papalexakis, and H\. Wei \(2025\)Uncertainty quantification of large language models through multi\-dimensional responses\.arXiv preprint arXiv:2502\.16820\.Cited by:[§3\.3](https://arxiv.org/html/2605.19220#S3.SS3.p3.1)\.
- T\. Chen, X\. Liu, V\. Nandam, K\. Liou, and H\. Wei \(2026a\)Conformal feedback alignment: quantifying answer\-level reliability for robust LLM alignment\.InFindings of the Association for Computational Linguistics: EACL 2026,V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 3561–3572\.External Links:[Link](https://aclanthology.org/2026.findings-eacl.184/),[Document](https://dx.doi.org/10.18653/v1/2026.findings-eacl.184),ISBN 979\-8\-89176\-386\-9Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p4.1)\.
- T\. Chen, H\. Yao, J\. Chen, E\. E\. Papalexakis, and H\. Wei \(2026b\)Every response counts: quantifying uncertainty of llm\-based multi\-agent systems through tensor decomposition\.arXiv preprint arXiv:2604\.08708\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1)\.
- L\. Da, T\. Chen, L\. Cheng, and H\. Wei \(2024a\)Llm uncertainty quantification through directional entailment graph and claim level response augmentation\.arXiv preprint arXiv:2407\.00994\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.SSS0.Px1.p1.4),[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.p1.1),[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px3.p1.1)\.
- L\. Da, K\. Liou, T\. Chen, X\. Zhou, X\. Luo, Y\. Yang, and H\. Wei \(2024b\)Open\-ti: open traffic intelligence with augmented language model\.International Journal of Machine Learning and Cybernetics15\(10\),pp\. 4761–4786\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1)\.
- L\. Da, X\. Liu, J\. Dai, L\. Cheng, Y\. Wang, and H\. Wei \(2025a\)Understanding the uncertainty of llm explanations: a perspective based on reasoning topology\.arXiv preprint arXiv:2502\.17026\.Cited by:[§3\.3](https://arxiv.org/html/2605.19220#S3.SS3.p3.1)\.
- L\. Da, X\. Liu, J\. Dai, L\. Cheng, Y\. Wang, and H\. Wei \(2025b\)Understanding the uncertainty of llm explanations: a perspective based on reasoning topology\.InSecond Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1)\.
- S\. Devic, T\. Srinivasan, J\. Thomason, W\. Neiswanger, and V\. Sharan \(2025\)From calibration to collaboration: llm uncertainty quantification should be more human\-centered\.arXiv preprint arXiv:2506\.07461\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px2.p2.1)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.Cited by:[§3\.1](https://arxiv.org/html/2605.19220#S3.SS1.p3.1)\.
- V\. Flovik \(2024\)Quantifying distribution shifts and uncertainties for enhanced model robustness in machine learning applications\.arXiv preprint arXiv:2405\.01978\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px1.p2.1)\.
- X\. Gao, J\. Zhang, L\. Mouatadid, and K\. Das \(2024\)SPUQ: perturbation\-based uncertainty quantification for large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2336–2346\.Cited by:[§3\.1](https://arxiv.org/html/2605.19220#S3.SS1.p3.1)\.
- Y\. Gui, Y\. Jin, and Z\. Ren \(2024\)Conformal alignment: knowing when to trust foundation models with guarantees\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=YzyCEJlV9Z)Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p4.1)\.
- P\. He, X\. Liu, J\. Gao, and W\. Chen \(2020\)Deberta: decoding\-enhanced bert with disentangled attention\.arXiv preprint arXiv:2006\.03654\.Cited by:[footnote 1](https://arxiv.org/html/2605.19220#footnote1)\.
- P\. He, X\. Liu, J\. Gao, and W\. Chen \(2021\)DeBERTa: decoding\-enhanced bert with disentangled attention\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=XPZIaotutsD)Cited by:[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p2.6)\.
- \[26\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. SteinhardtMeasuring mathematical problem solving with the math dataset\.InThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\),Cited by:[§5\.3\.1](https://arxiv.org/html/2605.19220#S5.SS3.SSS1.p2.1)\.
- \[27\]J\. Heo, M\. Xiong, C\. Heinze\-Deml, and J\. NarainDo llms estimate uncertainty well in instruction\-following?\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§5\.2\.2](https://arxiv.org/html/2605.19220#S5.SS2.SSS2.p1.1)\.
- M\. Jiang, Y\. Ruan, P\. Sattigeri, S\. Roukos, and T\. Hashimoto \(2024\)Graph\-based uncertainty metrics for long\-form language model generations\.Advances in Neural Information Processing Systems37,pp\. 32980–33006\.Cited by:[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.p1.1)\.
- \[29\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. NarasimhanSWE\-bench: can language models resolve real\-world github issues?\.InThe Twelfth International Conference on Learning Representations,Cited by:[§5\.3\.1](https://arxiv.org/html/2605.19220#S5.SS3.SSS1.p2.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.19220#S2.SS3.p1.1),[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px5.p1.1),[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px5.p2.1)\.
- A\. Kalavasis, A\. Mehrotra, and G\. Velegkas \(2025\)On the limits of language generation: trade\-offs between hallucination and mode\-collapse\.InProceedings of the 57th Annual ACM Symposium on Theory of Computing,pp\. 1732–1743\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px3.p2.1)\.
- Y\. Kim, H\. Jeong, S\. Chen, S\. S\. Li, C\. Park, M\. Lu, K\. Alhamoud, J\. Mun, C\. Grau, M\. Jung,et al\.\(2025\)Medical hallucinations in foundation models and their impact on healthcare\.arXiv preprint arXiv:2503\.05777\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px4.p2.1)\.
- M\. Kirchhof, L\. Füger, A\. Golinski, E\. G\. Dhekane, A\. Blaas, and S\. Williamson \(2025\)Self\-reflective uncertainties: do llms know their internal answer distribution?\.InICML 2025 Workshop on Reliable and Responsible Foundation Models,Cited by:[§2\.3](https://arxiv.org/html/2605.19220#S2.SS3.p1.1)\.
- T\. M\. Kodinariya, P\. R\. Makwana,et al\.\(2013\)Review on determining number of cluster in k\-means clustering\.International Journal1\(6\),pp\. 90–95\.Cited by:[§3\.1](https://arxiv.org/html/2605.19220#S3.SS1.p1.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VD-AYtP0dve)Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1),[§1](https://arxiv.org/html/2605.19220#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.19220#S2.SS4.p2.1),[§3\.1](https://arxiv.org/html/2605.19220#S3.SS1.p3.1),[§3\.3](https://arxiv.org/html/2605.19220#S3.SS3.p3.1),[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px3.p1.1)\.
- Z\. Li, H\. Chen, H\. Tan, L\. Lan, Y\. Sui, and J\. Ren \(2025\)Uncertainty quantification for black\-box llms via star graphs connectivity: exploring alternatives for semantic density\.InJoint European Conference on Machine Learning and Knowledge Discovery in Databases,pp\. 273–289\.Cited by:[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.p1.1)\.
- \[37\]Z\. Li, S\. Shen, W\. Yang, R\. Jin, H\. Chen, J\. Ren,et al\.Enhancing uncertainty quantification in large language models through semantic graph density\.InThe 41st Conference on Uncertainty in Artificial Intelligence,Cited by:[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Teaching models to express their uncertainty in words\.arXiv preprint arXiv:2205\.14334\.Cited by:[§5\.2\.2](https://arxiv.org/html/2605.19220#S5.SS2.SSS2.p2.1)\.
- Z\. Lin, S\. Trivedi, and J\. Sun \(2023\)Generating with confidence: uncertainty quantification for black\-box large language models\.arXiv preprint arXiv:2305\.19187\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1),[§1](https://arxiv.org/html/2605.19220#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.19220#S3.SS3.p2.1),[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px3.p1.1)\.
- Z\. Lin, S\. Trivedi, and J\. Sun \(2024\)Generating with confidence: uncertainty quantification for black\-box large language models\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=DWkJCSxKU5)Cited by:[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.SSS0.Px1.p1.4),[§3\.3](https://arxiv.org/html/2605.19220#S3.SS3.p3.1)\.
- S\. Liu, Z\. Li, X\. Liu, R\. Zhan, D\. F\. Wong, L\. S\. Chao, and M\. Zhang \(2024\)Can llms learn uncertainty on their own? expressing uncertainty effectively in a self\-training manner\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 21635–21645\.Cited by:[§2\.3](https://arxiv.org/html/2605.19220#S2.SS3.p1.1)\.
- X\. Liu, T\. Chen, L\. Da, C\. Chen, Z\. Lin, and H\. Wei \(2025a\)Uncertainty quantification and confidence calibration in large language models: a survey\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 6107–6117\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1)\.
- X\. Liu, Z\. Lin, L\. Da, C\. Chen, S\. Trivedi, and H\. Wei \(2025b\)MCQA\-eval: efficient confidence evaluation in nlg with gold\-standard correctness labels\.arXiv preprint arXiv:2502\.14268\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p3.1),[Figure 3](https://arxiv.org/html/2605.19220#S3.F3),[Figure 3](https://arxiv.org/html/2605.19220#S3.F3.2.1),[§3\.3](https://arxiv.org/html/2605.19220#S3.SS3.p3.1)\.
- H\. Ma, J\. Pan, J\. Liu, Y\. Chen, J\. T\. Zhou, G\. Wang, Q\. Hu, H\. Wu, C\. Zhang, and H\. Wang \(2025\)Semantic energy: detecting llm hallucination beyond entropy\.arXiv preprint arXiv:2508\.14496\.Cited by:[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p1.1)\.
- S\. Machcha, S\. Yerra, S\. Sultana, H\. Yu, and Z\. Yao \(2025\)Do large language models know when not to answer in medical qa?\.InProceedings of the 2nd Workshop on Uncertainty\-Aware NLP \(UncertaiNLP 2025\),pp\. 27–35\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px3.p2.1)\.
- A\. Malinin and M\. Gales \(2020\)Uncertainty estimation in autoregressive structured prediction\.arXiv preprint arXiv:2002\.07650\.Cited by:[§2\.4](https://arxiv.org/html/2605.19220#S2.SS4.p3.1)\.
- P\. Manakul, A\. Liusie, and M\. Gales \(2023\)Selfcheckgpt: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 9004–9017\.Cited by:[§2\.3](https://arxiv.org/html/2605.19220#S2.SS3.p1.1),[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px2.p2.1)\.
- L\. H\. McCabe, R\. Melamed, T\. Hartvigsen, and H\. H\. Huang \(2025\)Estimating semantic alphabet size for llm uncertainty quantification\.arXiv preprint arXiv:2509\.14478\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p1.1)\.
- L\. d\. Moura and S\. Ullrich \(2021\)The lean 4 theorem prover and programming language\.InInternational Conference on Automated Deduction,pp\. 625–635\.Cited by:[§5\.3\.2](https://arxiv.org/html/2605.19220#S5.SS3.SSS2.p1.1)\.
- A\. Ng, M\. Jordan, and Y\. Weiss \(2001\)On spectral clustering: analysis and an algorithm\.Advances in neural information processing systems14\.Cited by:[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.SSS0.Px2.p1.1)\.
- D\. Nguyen, A\. Payani, and B\. Mirzasoleiman \(2025\)Beyond semantic entropy: boosting llm uncertainty quantification with pairwise semantic similarity\.arXiv preprint arXiv:2506\.00245\.Cited by:[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p1.1)\.
- A\. Nikitin, J\. Kossen, Y\. Gal, and P\. Marttinen \(2024\)Kernel language entropy: fine\-grained uncertainty quantification for llms from semantic similarities\.Advances in Neural Information Processing Systems37,pp\. 8901–8929\.Cited by:[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p1.1)\.
- A\. Podkopaev \(Ph\.D\. Thesis\)Uncertainty quantification under distribution shifts\.Ph\.D\. Thesis,Carnegie Mellon University\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px1.p2.1)\.
- V\. Quach, A\. Fisch, T\. Schuster, A\. Yala, J\. H\. Sohn, T\. S\. Jaakkola, and R\. Barzilay \(2023\)Conformal language modeling\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p4.1),[§5\.2\.1](https://arxiv.org/html/2605.19220#S5.SS2.SSS1.p1.1)\.
- M\. Z\. Rodriguez, C\. H\. Comin, D\. Casanova, O\. M\. Bruno, D\. R\. Amancio, L\. d\. F\. Costa, and F\. A\. Rodrigues \(2019\)Clustering algorithms: a comparative approach\.PloS one14\(1\),pp\. e0210236\.Cited by:[§3\.1](https://arxiv.org/html/2605.19220#S3.SS1.p1.1)\.
- A\. Santilli, A\. Golinski, M\. Kirchhof, F\. Danieli, A\. Blaas, M\. Xiong, L\. Zappella, and S\. Williamson \(2025\)Revisiting uncertainty quantification evaluation in language models: spurious interactions with response length bias results\.arXiv preprint arXiv:2504\.13677\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px3.p2.1)\.
- A\. Simhi, I\. Itzhak, F\. Barez, G\. Stanovsky, and Y\. Belinkov \(2025\)Trust me, I’m wrong: LLMs hallucinate with certainty despite knowing the answer\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 14665–14688\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.792/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.792),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.19220#S3.SS2.p1.1)\.
- H\. Song, R\. Ji, N\. Shi, F\. Lai, and R\. A\. Kontar \(2025\)Inv\-entropy: a fully probabilistic framework for uncertainty quantification in language models\.arXiv preprint arXiv:2506\.09684\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px1.p2.1)\.
- P\. Stangel, D\. Bani\-Harouni, C\. Pellegrini, E\. Özsoy, K\. Zaripova, M\. Keicher, and N\. Navab \(2025\)Rewarding doubt: a reinforcement learning approach to calibrated confidence expression of large language models\.arXiv preprint arXiv:2503\.02623\.Cited by:[§5\.2\.2](https://arxiv.org/html/2605.19220#S5.SS2.SSS2.p2.1)\.
- J\. Su, J\. Luo, H\. Wang, and L\. Cheng \(2024\)Api is enough: conformal prediction for large language models without logit\-access\.arXiv preprint arXiv:2403\.01216\.Cited by:[§5\.2\.1](https://arxiv.org/html/2605.19220#S5.SS2.SSS1.p1.1)\.
- \[61\]K\. Tian, E\. Mitchell, A\. Zhou, A\. Sharma, R\. Rafailov, H\. Yao, C\. Finn, and C\. D\. ManningJust ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px5.p2.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p1.1)\.
- D\. Ulmer, A\. Lorson, I\. Titov, and C\. Hardmeier \(2025\)Anthropomimetic uncertainty: what verbalized uncertainty in language models is missing\.arXiv preprint arXiv:2507\.10587\.Cited by:[§5\.2\.2](https://arxiv.org/html/2605.19220#S5.SS2.SSS2.p2.1)\.
- R\. Vashurin, E\. Fadeeva, A\. Vazhentsev, L\. Rvanova, D\. Vasilev, A\. Tsvigun, S\. Petrakov, R\. Xing, A\. Sadallah, K\. Grishchenkov,et al\.\(2025\)Benchmarking uncertainty quantification methods for large language models with lm\-polygraph\.Transactions of the Association for Computational Linguistics13,pp\. 220–248\.Cited by:[§3\.1](https://arxiv.org/html/2605.19220#S3.SS1.p2.1)\.
- R\. Vashurin, M\. Goloburda, A\. Ilina, A\. Rubashevskii, P\. Nakov, A\. Shelmanov, and M\. Panov \(2026\)CoCoA: a minimum bayes risk framework bridging confidence and consistency for uncertainty quantification in LLMs\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=H1NGlLNaVC)Cited by:[§2\.1](https://arxiv.org/html/2605.19220#S2.SS1.p1.1)\.
- N\. X\. Vinh and M\. E\. Houle \(2010\)A set correlation model for partitional clustering\.InProceedings of the 14th Pacific\-Asia Conference on Advances in Knowledge Discovery and Data Mining \- Volume Part I,PAKDD’10,Berlin, Heidelberg,pp\. 4–15\.External Links:ISBN 3642136567,[Link](https://doi.org/10.1007/978-3-642-13657-3_4),[Document](https://dx.doi.org/10.1007/978-3-642-13657-3%5F4)Cited by:[§3](https://arxiv.org/html/2605.19220#S3.p1.1)\.
- U\. Von Luxburg \(2007\)A tutorial on spectral clustering\.Statistics and computing17\(4\),pp\. 395–416\.Cited by:[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.SSS0.Px2.p1.1)\.
- T\. Wang, A\. Kulkarni, T\. Cody, P\. A\. Beling, Y\. Yan, and D\. Zhou \(2025\)GENUINE: graph enhanced multi\-level uncertainty estimation for large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 20522–20541\.Cited by:[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.p1.1)\.
- T\. Xi, C\. Wang, and J\. Zhang \(2026\)Confidence introspection: a self\-reflection method for reliable and helpful large language models\.IEEE Transactions on Audio, Speech and Language Processing\.Cited by:[§2\.3](https://arxiv.org/html/2605.19220#S2.SS3.p1.1)\.
- Z\. Xie, R\. Xing, Y\. Wang, J\. Geng, H\. Iqbal, D\. Sahnan, I\. Gurevych, and P\. Nakov \(2025\)FIRE: fact\-checking with iterative retrieval and verification\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 2901–2914\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p4.1),[§5\.3\.2](https://arxiv.org/html/2605.19220#S5.SS3.SSS2.p1.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2023\)Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.arXiv preprint arXiv:2306\.13063\.Cited by:[§2\.3](https://arxiv.org/html/2605.19220#S2.SS3.p1.1)\.
- Q\. Yang, S\. Ravikumar, F\. Schmitt\-Ulms, S\. Lolla, E\. Demir, I\. Elistratov, A\. Lavaee, S\. Lolla, E\. Ahmadi, D\. Rus,et al\.\(2023\)Uncertainty\-aware language modeling for selective question answering\.arXiv preprint arXiv:2311\.15451\.Cited by:[§4](https://arxiv.org/html/2605.19220#S4.SS0.SSS0.Px2.p2.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[§5\.3\.1](https://arxiv.org/html/2605.19220#S5.SS3.SSS1.p2.1)\.
- X\. Zhao, H\. Peng, D\. Su, X\. Zeng, C\. Liu, J\. Liao, and P\. S\. Yu \(2025\)SeSE: black\-box uncertainty quantification for large language models based on structural information theory\.arXiv preprint arXiv:2511\.16275v3\.Cited by:[§2\.2](https://arxiv.org/html/2605.19220#S2.SS2.p1.1)\.
- L\. Zheng, C\. Li, Z\. Liu, F\. Huang, H\. Jia, Z\. Ye, and X\. Zhang \(2025\)Fact in fragments: deconstructing complex claims via llm\-based atomic fact extraction and verification\.arXiv preprint arXiv:2506\.07446\.Cited by:[§1](https://arxiv.org/html/2605.19220#S1.p4.1),[§5\.3\.2](https://arxiv.org/html/2605.19220#S5.SS3.SSS2.p1.1)\.

Similar Articles

Uncertainty Quantification for Large Language Diffusion Models

arXiv cs.CL

This paper presents the first systematic study of uncertainty quantification (UQ) for Large Language Diffusion Models (LLDMs), proposing lightweight zero-shot uncertainty signals derived from the iterative denoising process and showing that LLDMs can achieve both fast inference and reliable hallucination detection with up to 100x lower computational overhead compared to sampling-based baselines.

Convergence Point Theory: Why LLM uncertainty is determined by the topic, not the model

Reddit r/artificial

This paper proposes the Convergence Point Theory, which unifies various LLM uncertainty phenomena by arguing that uncertainty is determined by the consensus density of human knowledge on a topic, identifying three zones (Full, Partial, and Non-Consensus). It raises concerns about forced convergence during training on unresolved philosophical questions.

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

arXiv cs.CL

This paper introduces SIVR (Sequential Internal Variance Representation), a supervised framework for detecting hallucinations in LLMs by analyzing token-wise and layer-wise variance patterns in hidden states without relying on strict architectural assumptions. The method aggregates full sequence variance features to learn temporal patterns of factual errors and demonstrates improved generalization with smaller training sets.

A better method for identifying overconfident large language models

MIT News — Artificial Intelligence

MIT researchers developed a new method for identifying overconfident LLMs by measuring cross-model disagreement across similar models, rather than relying solely on self-consistency metrics. This approach better captures epistemic uncertainty and more accurately identifies unreliable predictions in high-stakes applications.