Integrating Local and Global Entropy for Uncertainty Quantification in LLMs
Summary
This paper proposes Global-Local Uncertainty (GLU), an unsupervised single-pass score that fuses token-level local entropy with hidden-state geometric global entropy for uncertainty quantification in LLMs, showing that the two are near-orthogonal and together capture confident-but-wrong failures.
View Cached Full Text
Cached at: 06/10/26, 06:15 AM
# Integrating Local and Global Entropy for Uncertainty Quantification in LLMs
Source: [https://arxiv.org/html/2606.09875](https://arxiv.org/html/2606.09875)
Johanne Medina Qatar Computing Research Institute, HBKU Doha, Qatar jomedina@hbku\.edu\.qa &Tianyi Zhou KTH Royal Institute of Technology Stockholm, Sweden tzho@kth\.se &Keivin Isufaj Qatar Computing Research Institute, HBKU Doha, Qatar keisufaj@hbku\.edu\.qa &Aristides Gionis KTH Royal Institute of Technology Stockholm, Sweden argioni@kth\.se &Sanjay Chawla Qatar Computing Research Institute, HBKU Doha, Qatar schawla@hbku\.edu\.qa
###### Abstract
Large language models hallucinate confidently, making uncertainty quantification \(UQ\) essential for reliable deployment\. Existing methods rely predominantly on token\-level signals, leaving the geometric structure of intermediate hidden states underused\. In this paper, we take the geometric complexity of hidden\-state matrices as a measure of the*global*uncertainty of LLMs, while treating token\-level uncertainty estimation as a*local*metric\. We show that hidden\-state geometric entropy \(*global*uncertainty\) and token\-level entropy \(*local*uncertainty\) are statistically near\-orthogonal, capturing distinct failure regimes for reliability prediction\. In particular, global geometry recovers the confident\-but\-wrong failure mode that local signals systematically miss\. Building on this, we proposeGlobal\-Local Uncertainty \(GLU\), an unsupervised, single\-pass score that fuses the two signals via a multiplicative gate\. Across three model families and six benchmarks, GLU matches or outperforms all unsupervised baselines while requiring only a single forward pass and remaining length\-normalized and architecture\-agnostic\. Code is available on[https://github\.com/qcri/GLU\.git](https://github.com/qcri/GLU.git)\.
## 1Introduction
Hallucination remains one of the most persistent failure modes of large language models \(LLMs\)\. Despite rapid advances in capability, frontier systems continue to produce fluent, specific, and incorrect answers\. A recent cross\-domain benchmark spanning 6,000 questions across 42 topics found that fewer than 1% of evaluated models score above zero on a\[−100,100\]\[\-100,100\]reliability index, with the best\-performing frontier model reaching only 33\(Jackson et al\.,[2025](https://arxiv.org/html/2606.09875#bib.bib1)\)\. Uncertainty quantification \(UQ\) offers a principled path forward, and a wide range of methods have been proposed, from token\-level entropy\(Zhang et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib2)\)and sampling\-based consistency\(Farquhar et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib3); Yadkori et al\.,[2024a](https://arxiv.org/html/2606.09875#bib.bib4)\)to evidential\(Sensoy et al\.,[2018](https://arxiv.org/html/2606.09875#bib.bib5); Ma et al\.,[2025](https://arxiv.org/html/2606.09875#bib.bib6)\)and attention\-based approaches\(Sriramanan et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib7); Skean et al\.,[2025](https://arxiv.org/html/2606.09875#bib.bib8)\); yet few are deployed in practice\. While other methods impose at least one of the following barriers of multiple generation passes, task\-specific supervision, sensitivity to response length, or architectural assumptions, we argue that a practical UQ score should be*unsupervised*, computed in a*single forward pass*,*length\-normalized*, and*architecture\-agnostic*\.
The linear representation hypothesis\(Park et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib9)\)posits that high\-level concepts are encoded as directions in the hidden\-state space of LLMs\. When a model retrieves a well\-encoded fact, its hidden states move in a tight, consistent direction implying that the geometry of retrieval is compact\. When it lacks reliable factual grounding, the hidden states wander through loosely related directions as the model generates together a plausible\-sounding reply\. Consider a model asked for the birth year of a figure it has not reliably encoded\. It may produce a confident four\-digit number while its representations drift through adjacent concepts rather than converging on a specific memory\. We observe that this wandering is measurable\. Motivated bySkean et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib8)\), who shows that intermediate hidden layers encode richer representations than the final layer and that information\-theoretic geometry metrics effectively characterize representation quality, we quantify this drift via the entropy of eigenvalue distribution from the hidden\-state trajectory\. This geometric signal is computed directly from the hidden states of standard greedy generation with no additional forward pass\.
The distinction between the embedding \(hidden state vector space\) and unembedding layers \(next\-token prediction\) is not merely architectural\.Park et al\. \([2024](https://arxiv.org/html/2606.09875#bib.bib9)\)show that the embedding space obeys a non\-Euclidean geometry in which concepts are encoded as directions, and that this structure is fundamentally different from the output space where token probabilities are computed\. This separation provides a principled basis for why token\-level entropy and hidden\-state geometric complexity measure different things: one operates in the space of next\-token distributions, the other in the space of semantic directions, and the two need not agree\. Since the two failure modes arise from structurally distinct layers, we term them*global*\(hidden\-state geometric complexity\) and*local*\(token\-level entropy\) uncertainty, and ask how best to combine them\. Figure[1](https://arxiv.org/html/2606.09875#S4.F1)confirms that both signals are individually informative yet neither is sufficient\. Correct responses cluster tightly in the low local\-uncertainty region while incorrect responses concentrate at high global uncertainty, but the low\-xxregion where token\-level methods declare confidence still contains a non\-trivial fraction of incorrect responses that token entropy alone cannot recover\. Jointly, the two signals span a far wider discriminative range than either achieves alone\. We study additive and multiplicative fusion strategies across architectures and tasks, finding that multiplicative combination consistently dominates\.
This paper makes the following contributions:\(1\) We extract complementary uncertainty signals from two distinct representational layers: a global geometric signal from the embedding layer via matrix Rényi entropy of hidden\-state trajectories, and a local token\-level signal from the unembedding layer via Shannon and evidential epistemic entropy\. Together, they provide a principled, multi\-view perspective on LLM uncertainty that goes beyond logit\-only approaches\. \(2\) We show empirically that the two signals are statistically near\-orthogonal and cover distinct uncertainty regimes\. In particular, the global signal recovers the*confident\-but\-wrong*failure mode that local signals systematically miss\. \(3\) We propose Global\-Local Uncertainty \(GLU\) , a lightweight multiplicative fusion of the two signals that requires no labeled data, no additional forward passes, and is normalized for response length\. \(4\) We validate GLU across three model families and six benchmarks, and conduct comprehensive ablation studies on the choice of local and global uncertainty estimators, demonstrating both the effectiveness of the framework and the individual contribution of each component\.
## 2Related Work
#### Local and output\-distributional uncertainty\.
A common approach estimates LLM reliability from generation\-time signals such as sequence likelihood, token entropy, model\-elicited confidence, or probabilistic views of next\-token inference\(Kadavath et al\.,[2022](https://arxiv.org/html/2606.09875#bib.bib10); Dalal and Misra,[2024](https://arxiv.org/html/2606.09875#bib.bib11); Ma et al\.,[2025](https://arxiv.org/html/2606.09875#bib.bib6)\)\. LogTokU is especially relevant as it argues that probabilities alone can lose evidence\-strength information thus instead estimates token\-level aleatoric and epistemic uncertainty directly from logits, building on the broader evidential\-learning view that uncertainty can be represented through evidence over class probabilities\(Sensoy et al\.,[2018](https://arxiv.org/html/2606.09875#bib.bib5); Ma et al\.,[2025](https://arxiv.org/html/2606.09875#bib.bib6)\)\. Zhou et al\. further show that token\-level uncertainty is useful for confabulation detection, using it to select and aggregate hidden states for response\-level reliability prediction\(Zhou et al\.,[2026](https://arxiv.org/html/2606.09875#bib.bib12)\)\. Other methods use local distributional features during decoding, such as layer\-wise logit contrast, or generate multiple samples and measure agreement or semantic dispersion\(Zhang et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib2); Manakul et al\.,[2023](https://arxiv.org/html/2606.09875#bib.bib13); Farquhar et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib3); Nikitin et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib14); Yadkori et al\.,[2024a](https://arxiv.org/html/2606.09875#bib.bib4)\)\. These approaches are effective and often minimally intrusive, but they primarily measure uncertainty in tokens or sampled outputs rather than the global coherence of the model’s internal trajectory\.
#### Representation\-based and global uncertainty\.
A complementary line of work studies whether reliability is encoded in hidden states, layer activations, representation geometry, or attention behavior\. Layer\-by\-Layer shows that intermediate layers can contain especially informative representations and analyzes them using information\-theoretic, geometric, and invariance\-based measures\(Skean et al\.,[2025](https://arxiv.org/html/2606.09875#bib.bib8)\)\. RAUQ similarly moves beyond output probabilities by identifying uncertainty\-aware attention heads whose attention patterns correlate with incorrect generations, producing efficient sequence\-level uncertainty estimates without task\-specific labels\(Vazhentsev et al\.,[2026](https://arxiv.org/html/2606.09875#bib.bib15)\)\. More broadly, probing and representation\-based methods use hidden states or activations to predict factuality, confabulation, or response reliability\(Li et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib16); Orgad et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib17); Zhou et al\.,[2026](https://arxiv.org/html/2606.09875#bib.bib12); Sriramanan et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib7)\)\. These methods capture response\-level failures that token probabilities can miss, but representation\-only signals may overlook local knowledge gaps already visible in the output distribution\.
#### Supervised detectors and calibration dependence\.
Several hallucination detectors learn reliability decision boundaries from labeled or pseudo\-labeled generations\(Li et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib16); Orgad et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib17); Zhou et al\.,[2026](https://arxiv.org/html/2606.09875#bib.bib12); Du et al\.,[2024](https://arxiv.org/html/2606.09875#bib.bib18)\)\. While effective in\-distribution, these methods depend on calibration data and may be sensitive to shifts across datasets, model families, or generation settings\.We include a supervised\-detector comparison in Appendix[A](https://arxiv.org/html/2606.09875#A1)\.
GLU instead combines local and global uncertainty without learning a detector\. It uses hidden\-state geometry to amplify token\-level uncertainty when the response trajectory is diffuse, capturing both local uncertainty and response\-level drift without labeled calibration data, pseudo\-labels, or retraining\.
## 3Preliminary
We formally introduce the notation of this paper\. Let𝐱\\mathbf\{x\}denote a prompt and𝐲=\(y1,…,yT\)\\mathbf\{y\}=\(y\_\{1\},\\dots,y\_\{T\}\)denote a generated response of lengthTT, where each tokenyty\_\{t\}is drawn from a vocabulary𝒱\\mathcal\{V\}of sizeVV\. At generation stepttand transformer layerℓ\\ell, the language model produces a hidden\-state vector𝐡t\(ℓ\)∈ℝd\\mathbf\{h\}\_\{t\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\}\. The model then autoregressively predicts the next token using the final\-layer representation\. Specifically, the final hidden state is mapped to the logit vector
𝐳t=𝐖U𝐡t\(L\)∈ℝV,\\mathbf\{z\}\_\{t\}=\\mathbf\{W\}\_\{U\}\\mathbf\{h\}\_\{t\}^\{\(L\)\}\\in\\mathbb\{R\}^\{V\},where𝐖U\\mathbf\{W\}\_\{U\}is the unembedding matrix\. The conditional distribution over the next token is given by
p\(yt∣𝐱,𝐲<t\)=softmax\(𝐳t\)\.p\(y\_\{t\}\\mid\\mathbf\{x\},\\mathbf\{y\}\_\{<t\}\)=\\mathrm\{softmax\}\(\\mathbf\{z\}\_\{t\}\)\.
Next, we collect hidden states across allTTgeneration steps at layerℓ\\ellinto the*representation matrix*
𝐇ℓ=\[𝐡1ℓ\)⋮𝐡T\(ℓ\)\]∈ℝT×d,\\mathbf\{H\}^\{\\ell\}=\\begin\{bmatrix\}\\mathbf\{h\}\_\{1\}^\{\\ell\)\}\\\\ \\vdots\\\\ \\mathbf\{h\}\_\{T\}^\{\(\\ell\)\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{T\\times d\},\(1\)where each row encodes the model’s internal state at one generation step\. More concretely,𝐇\\mathbf\{H\}encodes the full*trajectory*of the response in representation space, the sequence of rows traces how the model’s internal representation evolves as it generates each token\. We simply write𝐇\\mathbf\{H\}when there is no ambiguity\.
We define the*Gram matrix*of the response as
𝐊=𝐇𝐇⊤∈ℝT×T,\\mathbf\{K\}=\\mathbf\{H\}\\mathbf\{H\}^\{\\top\}\\in\\mathbb\{R\}^\{T\\times T\},\(2\)whose\(i,j\)\(i,j\)\-th entry𝐊ij=⟨𝐡i\(L\),𝐡j\(L\)⟩\\mathbf\{K\}\_\{ij\}=\\langle\\mathbf\{h\}\_\{i\}^\{\(L\)\},\\mathbf\{h\}\_\{j\}^\{\(L\)\}\\ranglemeasures the similarity between the representations at stepsiiandjj\. Intuitively,𝐊\\mathbf\{K\}captures the geometric structure of the response trajectory: when the model generates a coherent response, the hidden states tend to concentrate along a few dominant directions, making𝐊\\mathbf\{K\}low\-rank; when the response is uncertain or incoherent, the hidden states scatter across many directions, making𝐊\\mathbf\{K\}closer to a scaled identity\.
## 4Global and Local Uncertainty of LLMs
Uncertainty in LLM generation appears at two complementary levels\. At the*token level*, the model may be uncertain over plausible next tokens, capturing local aleatoric or epistemic uncertainty\. At the*response level*, the hidden\-state trajectory𝐇\\mathbf\{H\}may become geometrically complex, spreading across multiple directions in latent space where this structure is captured by the eigenvalues and eigenvectors of its representation geometrySkean et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib8)\)\. These signals are not redundant; we observe that a response can have confident token predictions while its global trajectory drifts in Figure[1](https://arxiv.org/html/2606.09875#S4.F1), producing a plausible but hallucinated answer\. We therefore measure both token\-level uncertainty \(local\) and geometric entropy \(global\)\.
Figure 1:Local and global signals capture complementary uncertainty information\.Each point is a response from Qwen 2\.5\-7B on TriviaQA \(blue = correct, red = incorrect\)\.xx\-axis: mean Shannon entropy over the most uncertain tokens \(local\)\.yy\-axis: geometric complexity of the hidden\-state trajectory \(global\)\. Contours show the product of the two signals, used only for visualization\.Left:token entropy provides the primary separation, while geometry further discriminates within each regime; quadrant error rates confirm the two views jointly span a wider uncertainty range than either alone\.Right \(zoom\):among below\-median\-entropy responses, geometry still separates many incorrect ones\. The starred example, low local but high global uncertainty, is the failure mode that motivates fusing both views\.### 4\.1Local Uncertainty: Token\-level Entropy
We first introduce token\-level uncertainty measures to characterize local uncertainty during generation\. Following the evidential deep learning frameworkSensoy et al\. \([2018](https://arxiv.org/html/2606.09875#bib.bib5)\); Ma et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib6)\), we decompose uncertainty into an aleatoric component \(AU\), capturing distributional ambiguity, and an epistemic component \(EU\), capturing lack of evidence\.
For each generated tokenyty\_\{t\}, letτ\(1\),…,τ\(k\)\\tau\_\{\(1\)\},\\dots,\\tau\_\{\(k\)\}denote the indices of the top\-kkvocabulary candidates ranked by logit value\. We define a truncated predictive distribution over these candidates by renormalizing their logits:
pj=ezτ\(j\)∑j′=1kezτ\(j′\)\.\\displaystyle p\_\{j\}=\\frac\{e^\{z\_\{\\tau\_\{\(j\)\}\}\}\}\{\\sum\_\{j^\{\\prime\}=1\}^\{k\}e^\{z\_\{\\tau\_\{\(j^\{\\prime\}\)\}\}\}\}\.\(3\)The aleatoric uncertainty is then defined as the Shannon entropy of this truncated distribution:
AU\(t\)=−∑j=1kpjlog2pj\.\\displaystyle\\mathrm\{AU\}\(t\)=\-\\sum\_\{j=1\}^\{k\}p\_\{j\}\\log\_\{2\}p\_\{j\}\.\(4\)
To estimate epistemic uncertainty, we follow the evidential interpretation of logits as evidence supporting different candidate tokens\. Specifically, we map each logit to a positive Dirichlet concentration parameter using the softplus transformation:
αj=ln\(1\+ezτ\(j\)\)\.\\displaystyle\\alpha\_\{j\}=\\ln\(1\+e^\{z\_\{\\tau\_\{\(j\)\}\}\}\)\.\(5\)Compared with the ReLU mapping used in prior workMa et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib6)\), the softplus function guarantees strictly positive concentration parameters, which is consistent with the Dirichlet parameterization\. We therefore define epistemic uncertainty as the inverse Dirichlet strength:
EU\(t\)=k∑j=1k\(αj\+1\),\\displaystyle\\mathrm\{EU\}\(t\)=\\frac\{k\}\{\\sum\_\{j=1\}^\{k\}\(\\alpha\_\{j\}\+1\)\},\(6\)where the denominator denotes the total Dirichlet strength\. Intuitively, larger total evidence implies lower epistemic uncertainty\.
A token is considered locally uncertain when both its predictive distribution is diffuse \(high AU\) and its supporting evidence is weak \(high EU\)\. We combine these two complementary factors using a multiplicative interaction score:
R\(t\)=−AU\(t\)⋅EU\(t\)\.\\displaystyle R\(t\)=\-\\mathrm\{AU\}\(t\)\\cdot\\mathrm\{EU\}\(t\)\.\(7\)
Averaging uncertainty uniformly across all generated tokens can dilute salient uncertain positions due to low\-information or highly predictable tokens \(e\.g\., stop words or formatting tokens\)\. We therefore focus on themmmost uncertain positions\. Let𝒲m\\mathcal\{W\}\_\{m\}denote the set of token indices corresponding to the top\-mmvalues ofR\(t\)R\(t\)\. We define the aggregated local uncertainty score as
R¯=1m∑t∈𝒲mR\(t\)\.\\displaystyle\\bar\{R\}=\\frac\{1\}\{m\}\\sum\_\{t\\in\\mathcal\{W\}\_\{m\}\}R\(t\)\.\(8\)
### 4\.2Global Uncertainty: Rényi Entropy of Hidden\-State Gram Matrices
We define*geometric entropy*as the Rényi entropy of the normalized eigenspectrum of the hidden\-state Gram matrix, quantifying the effective dispersion of representations across latent geometric directions\.
We characterize global uncertainty through the geometric complexity of hidden\-state representations\. As established in Section[3](https://arxiv.org/html/2606.09875#S3), the Gram matrix𝐊\(ℓ\)=𝐇\(ℓ\)𝐇\(ℓ\)⊤∈ℝT×T\\mathbf\{K\}^\{\(\\ell\)\}=\\mathbf\{H\}^\{\(\\ell\)\}\{\\mathbf\{H\}^\{\(\\ell\)\}\}^\{\\\!\\top\}\\in\\mathbb\{R\}^\{T\\times T\}encodes the pairwise similarity structure of the response trajectory at layerℓ\\ell\. Its eigenvaluesλ1≥⋯≥λT≥0\\lambda\_\{1\}\\geq\\cdots\\geq\\lambda\_\{T\}\\geq 0characterize how the trajectory’s variance is distributed across orthogonal directions in representation space: a sharply peaked spectrum reflects a low\-rank, coherent trajectory, while a flat spectrum reflects a diffuse, uncertain one\. We now convert this spectral structure into a single scalar uncertainty score\.
#### Entropy of the normalized spectrum\.
Normalizing the eigenvalues asλ~i=λi/tr\(𝐊\(ℓ\)\)\\tilde\{\\lambda\}\_\{i\}=\\lambda\_\{i\}/\\mathrm\{tr\}\(\\mathbf\{K\}^\{\(\\ell\)\}\)yields a probability distribution over principal directions\. We measure its dispersion via the order\-α\\alphaRényi entropy of that distributionSkean et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib8)\):
Sα\(𝐊\)=11−αlog\(∑i=1Tλ~iα\),α\>0,α≠1\.S\_\{\\alpha\}\(\\mathbf\{K\}\)\\;=\\;\\frac\{1\}\{1\-\\alpha\}\\log\\\!\\Bigl\(\\textstyle\\sum\_\{i=1\}^\{T\}\\tilde\{\\lambda\}\_\{i\}^\{\\,\\alpha\}\\Bigr\),\\qquad\\alpha\>0,\\;\\alpha\\neq 1\.\(9\)
We useα=2\\alpha=2\(collision entropy\), which admits the closed form
S2\(𝐊\)=−logtr\(\(𝐊tr\(𝐊\)\)2\)=−log‖𝐊‖F2tr\(𝐊\)2,S\_\{2\}\(\\mathbf\{K\}\)\\;=\\;\-\\log\\mathrm\{tr\}\\\!\\left\(\\\!\\left\(\\frac\{\\mathbf\{K\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)\}\\right\)^\{\\\!2\}\\right\)\\;=\\;\-\\log\\frac\{\\\|\\mathbf\{K\}\\\|\_\{F\}^\{2\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)^\{2\}\},\(10\)
and therefore requires only the squared Frobenius norm of𝐊\\mathbf\{K\}rather than its full eigendecomposition\. Forming𝐊=𝐇𝐇⊤\\mathbf\{K\}=\\mathbf\{H\}\{\\mathbf\{H\}\}^\{\\\!\\top\}costsO\(T2d\)O\(T^\{2\}d\), after which‖𝐊‖F2\\\|\\mathbf\{K\}\\\|\_\{F\}^\{2\}is computed inO\(T2\)O\(T^\{2\}\); the eigendecomposition alternative would costO\(T3\)O\(T^\{3\}\)\. This advantage is substantial in practice sinced≫Td\\gg Tfor large language models \(hidden dimensionddin the thousands, response lengthTTin the tens to hundreds\)\. LowS2S\_\{2\}indicates that the trajectory concentrates along a few dominant directions \(coherent generation\); highS2S\_\{2\}indicates a diffuse trajectory \(geometrically incoherent generation\)\. We give the detailed proof in Appendix[B](https://arxiv.org/html/2606.09875#A2)\.
#### Normalization for cross\-response and cross\-model comparability\.
We adopt two normalizations to makeS~\\tilde\{S\}comparable across responses and models\. First,S2\(𝐊\(ℓ\)\)S\_\{2\}\(\\mathbf\{K\}^\{\(\\ell\)\}\)grows withTTand attains its maximum oflogT\\log Twhen the spectrum is uniform \(λ~i=1/T\\tilde\{\\lambda\}\_\{i\}=1/T\); we therefore divide by1\+logT1\+\\log Tso the normalized score lies in\[0,1\)\[0,1\)regardless of response length\. Second, because representation quality varies with depth and no single layer is uniformly informative across architectures, we average the length\-normalized entropy over allLLlayers to obtain
S~=1L∑ℓ=1LS2\(𝐊\(ℓ\)\)logT∈\[0,1\]\.\\tilde\{S\}\\;=\\;\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\frac\{S\_\{2\}\\\!\\left\(\\mathbf\{K\}^\{\(\\ell\)\}\\right\)\}\{\\log T\}\\;\\in\\;\[0,1\]\.\(11\)This layer\-agnostic formulation avoids model\-specific layer selection and yields a single, response\-level uncertainty score computable in one forward pass\.
### 4\.3Global\-Local Fusion
The two signals cover distinct regimes \(Fig\.[1](https://arxiv.org/html/2606.09875#S4.F1)\): correct responses cluster at low local uncertainty but spread broadly alongS~\\tilde\{S\}, while incorrect responses include a*confident\-but\-wrong*subset which has low token entropy yet elevated geometric complexity, that token\-level detectors miss entirely\. We fuse the two via a multiplicative gate:
GLU=\(1\+S~\)R¯\\boxed\{\\;\\mathrm\{GLU\}\\;=\\;\(1\+\\tilde\{S\}\)\\,\\bar\{R\}\\;\}\(12\)The factor\(1\+S~\)\(1\+\\tilde\{S\}\)acts as a geometric amplifier, such that, when the trajectory is diffuse, local uncertainty is up\-weighted; when compact, GLU reduces to the local signal alone\. This directly targets the confident\-but\-wrong failure mode, which additive fusion cannot isolate because it treats the global signal as a fixed offset rather than a modulator\.
## 5Experiments and Ablation Studies
In this section, we evaluate GLU across six benchmarks and three model families\. To empirically validate the design choices motivated in Section[4\.3](https://arxiv.org/html/2606.09875#S4.SS3), we conduct comprehensive ablation studies over local and global uncertainty components, isolating the contribution of each and confirming that the multiplicative geometric–probabilistic combination outperforms alternatives\.
#### Datasets and Models\.
We evaluate on six benchmarks spanning factuality, knowledge retrieval, mathematical reasoning, Arabic QA, long\-form generation, and multi\-turn reasoning:TruthfulQALin et al\. \([2022](https://arxiv.org/html/2606.09875#bib.bib19)\),TriviaQAJoshi et al\. \([2017](https://arxiv.org/html/2606.09875#bib.bib20)\),MATHHendrycks et al\. \([2021](https://arxiv.org/html/2606.09875#bib.bib21)\),ArabicaQAAbdallah et al\. \([2024](https://arxiv.org/html/2606.09875#bib.bib22)\),LongFormKöksal et al\. \([2023](https://arxiv.org/html/2606.09875#bib.bib23)\), and theGSM8K multi\-turnsetting fromLaban et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib24)\)\. Further information on the composition of the datasets are provided in Appendix[C](https://arxiv.org/html/2606.09875#A3)\. We evaluate three instruction\-tuned model families:Qwen2\.5\-7BQwen Team et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib25)\),Gemma3\-12BGemma Team et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib26)\), andFanar1\-9BFanar Team et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib27)\), an Arabic\-centric Arabic–English model continually pretrained from Gemma\-9B, allowing us to test whether GLU remains effective across different model families, languages, and latent\-space configurations\.
#### Evaluation protocol\.
We adopt greedy decoding to generate responses for each model–benchmark pair, and store the corresponding hidden states, logits, and probability distributions\. All uncertainty estimation methods are applied post hoc to this shared set of model outputs, ensuring a fair and controlled comparison\. We report two complementary evaluation metrics\. AUROC measures how well an uncertainty score discriminates between correct and incorrect responses \(higher is better;0\.50\.5corresponds to random guessing\)\. The prediction rejection ratio \(PRR\) is a more recent metric that quantifies how well an uncertainty score supports selective abstention; we use it because it is the primary metric adopted by RAUQ\(Vazhentsev et al\.,[2026](https://arxiv.org/html/2606.09875#bib.bib15)\), and we refer the reader toMalinin and Gales \([2021](https://arxiv.org/html/2606.09875#bib.bib28)\)for its full definition\. The two metrics capture complementary aspects of performance: AUROC reflects discriminative capability, whereas PRR assesses whether the uncertainty ranking is effective for selective prediction\.
#### Baselines and Ablations\.
The baselines cover the major families of unsupervised uncertainty estimation methods, including LogProbYadkori et al\. \([2024b](https://arxiv.org/html/2606.09875#bib.bib29)\)and LogTokU \(local token\-level methods\)Ma et al\. \([2025](https://arxiv.org/html/2606.09875#bib.bib6)\), P\(True\) \(self\-evaluation\)Kadavath et al\. \([2022](https://arxiv.org/html/2606.09875#bib.bib10)\), and RAUQ \(attention\-based uncertainty estimation\)Vazhentsev et al\. \([2026](https://arxiv.org/html/2606.09875#bib.bib15)\)\. The GLU ablations isolate the contribution of individual components in Eq\.[12](https://arxiv.org/html/2606.09875#S4.E12)\. Full breakdown of the ablations is summarize in Table[1](https://arxiv.org/html/2606.09875#S5.T1)\.
Table 1:Methods and ablations\. For GLU variants, the final score combines a global geometric uncertainty termSSwith a local token\-level uncertainty termu¯\\bar\{u\}asGLU\(x,y\)=\(1\+S\)u¯\\mathrm\{GLU\}\(x,y\)=\(1\+S\)\\bar\{u\}, where larger uncertainty corresponds to a more negative score\.MethodGlobalLocalExplanationBaselinesLogProb–1T∑t=1Tlogp\(yt∣y<t,x\)\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\log p\(y\_\{t\}\\mid y\_\{<t\},x\)Average token log\-probability of the generated response\.LogTokU–−AUEDL\(t\)⋅EU\(t\)\-\\mathrm\{AU\}\_\{\\mathrm\{EDL\}\}\(t\)\\cdot\\mathrm\{EU\}\(t\)Token\-level uncertainty baseline using Dirichlet aleatoric uncertainty and epistemic uncertainty\(Ma et al\.,[2025](https://arxiv.org/html/2606.09875#bib.bib6)\)\.P\(True\)p\(“True”∣x,y\)p\(\\text\{\`\`True''\}\\mid x,y\)–The model self\-evaluates whether its generated answer is true\(Kadavath et al\.,[2022](https://arxiv.org/html/2606.09875#bib.bib10)\)\.RAUQRAUQ\(x,y\)\\mathrm\{RAUQ\}\(x,y\)–Recurrently fuses uncertainty\-aware attention\-head activations with token probabilities; peaks over a middle\-layer subset\(Vazhentsev et al\.,[2026](https://arxiv.org/html/2606.09875#bib.bib15)\)\.Ablations on Eq\.[12](https://arxiv.org/html/2606.09875#S4.E12)GLUS~=1L∑ℓ=1LSαℓ1\+logT\\tilde\{S\}=\\dfrac\{1\}\{L\}\\\!\\sum\_\{\\ell=1\}^\{L\}\\dfrac\{S\_\{\\alpha\}^\{\\ell\}\}\{1\+\\log T\}u¯ShE×EU=1k∑t∈𝒦−SE\(t\)EU\(t\)\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}=\\frac\{1\}\{k\}\\\!\\sum\_\{t\\in\\mathcal\{K\}\}\{\-\\mathrm\{SE\}\(t\)\\,\\mathrm\{EU\}\(t\)\}Length\-normalised matrix Rényi\-2 entropy averaged across allLLlayers, combined with the mean of thekkmost uncertain local scores\.GLU\-EDLS~\\tilde\{S\}u¯EDL=1k∑t∈𝒦−AUEDL\(t\)EU\(t\)\\bar\{u\}\_\{\\mathrm\{EDL\}\}=\\frac\{1\}\{k\}\\\!\\sum\_\{t\\in\\mathcal\{K\}\}\{\-\\mathrm\{AU\}\_\{\\mathrm\{EDL\}\}\(t\)\\,\\mathrm\{EU\}\(t\)\}Replaces the local ShETokU term with the LogTokU\-style Dirichlet token uncertainty\.GLU\-S¯\\bar\{S\}S¯α=1L∑ℓ=1LSαℓ\\bar\{S\}\_\{\\alpha\}=\\dfrac\{1\}\{L\}\\\!\\sum\_\{\\ell=1\}^\{L\}S\_\{\\alpha\}^\{\\ell\}u¯ShE×EU\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}Removes the1/\(1\+logT\)1/\(1\{\+\}\\log T\)length normalisation; bothS~\\tilde\{S\}andS¯α\\bar\{S\}\_\{\\alpha\}average Rényi entropy across allLLlayers\.GLU\-S~∗\\tilde\{S\}^\{\*\}maxℓS~ℓ\\max\_\{\\ell\}\\,\\tilde\{S\}^\{\\ell\}u¯ShE×EU\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}Selects the single most uncertain length\-normalised layer instead of averaging over allLLlayers as inS~\\tilde\{S\}\.GLU\-SPS~\\tilde\{S\}u¯ShE×EUsp=1k∑t∈𝒦−SE\(t\)EUsp\(t\)\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}^\{\\mathrm\{sp\}\}\}=\\frac\{1\}\{k\}\\\!\\sum\_\{t\\in\\mathcal\{K\}\}\{\-\\mathrm\{SE\}\(t\)\\,\\mathrm\{EU\}^\{\\mathrm\{sp\}\}\(t\)\}Uses softplus evidenceαk=softplus\(ℓk\)\+ε\\alpha\_\{k\}\\\!=\\\!\\mathrm\{softplus\}\(\\ell\_\{k\}\)\+\\varepsiloninstead ofReLU\(ℓk\)\+1\\mathrm\{ReLU\}\(\\ell\_\{k\}\)\+1in the Dirichlet epistemic uncertainty term\.GLU\-DKS~\\tilde\{S\}u¯ShE×EUk′\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}^\{\\,k^\{\\prime\}\},k′=⌊k/\(1\+S~\)⌋k^\{\\prime\}=\\left\\lfloor k/\(1\+\\tilde\{S\}\)\\right\\rfloorUses an adaptive local window size, shrinking the number of selected uncertain tokens as global geometric uncertainty increases\.GLU\-AUS~\\tilde\{S\}u¯ShE=1k∑t∈𝒦−SE\(t\)\\bar\{u\}\_\{\\mathrm\{ShE\}\}=\\frac\{1\}\{k\}\\\!\\sum\_\{t\\in\\mathcal\{K\}\}\{\-\\mathrm\{SE\}\(t\)\}Drops the epistemic factor entirely; retains only the local Shannon entropy term amplified byS~\\tilde\{S\}\.GLU\-EUS~\\tilde\{S\}u¯EU=1k∑t∈𝒦−EU\(t\)\\bar\{u\}\_\{\\mathrm\{EU\}\}=\\frac\{1\}\{k\}\\\!\\sum\_\{t\\in\\mathcal\{K\}\}\{\-\\mathrm\{EU\}\(t\)\}Drops Shannon entropy entirely; retains only the Dirichlet epistemic uncertainty term amplified byS~\\tilde\{S\}\.GLU\-SαS\_\{\\alpha\}\-AUS¯α\\bar\{S\}\_\{\\alpha\}u¯ShE=1k∑t∈𝒦−SE\(t\)\\bar\{u\}\_\{\\mathrm\{ShE\}\}=\\frac\{1\}\{k\}\\\!\\sum\_\{t\\in\\mathcal\{K\}\}\{\-\\mathrm\{SE\}\(t\)\}Jointly ablates length normalisation and the epistemic factor; tests whether raw multi\-layer geometry pairs best with Shannon\-only local uncertainty\.GLU\-SαS\_\{\\alpha\}\-SPS¯α\\bar\{S\}\_\{\\alpha\}u¯ShE×EUsp=1k∑t∈𝒦−SE\(t\)EUsp\(t\)\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}^\{\\mathrm\{sp\}\}\}=\\frac\{1\}\{k\}\\\!\\sum\_\{t\\in\\mathcal\{K\}\}\{\-\\mathrm\{SE\}\(t\)\\,\\mathrm\{EU\}^\{\\mathrm\{sp\}\}\(t\)\}Tests whether softplus evidence pairs better with unnormalised depth\-averaged entropy than withS~\\tilde\{S\}\.GLU\-S∗S^\{\*\}Sα∗=maxℓSαℓS\_\{\\alpha\}^\{\*\}=\\max\_\{\\ell\}S\_\{\\alpha\}^\{\\ell\}u¯ShE×EU\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}Uses the single most uncertain raw layer instead of the depth\-averaged global entropy\.GLU\-EU\-SPS~\\tilde\{S\}u¯EUsp=1k∑t∈𝒦−EUsp\(t\)\\bar\{u\}\_\{\\mathrm\{EU\}^\{sp\}\}=\\frac\{1\}\{k\}\\sum\_\{t\\in\\mathcal\{K\}\}\-\\mathrm\{EU\}^\{sp\}\(t\)Uses only the softplus\-based evidential uncertainty term, dropping the Shannon entropy factor\.GLU\-SαS\_\{\\alpha\}\-SP\-EUS¯α\\bar\{S\}\_\{\\alpha\}u¯EUsp=1k∑t∈𝒦−EUsp\(t\)\\bar\{u\}\_\{\\mathrm\{EU\}^\{sp\}\}=\\frac\{1\}\{k\}\\sum\_\{t\\in\\mathcal\{K\}\}\-\\mathrm\{EU\}^\{sp\}\(t\)Combines raw depth\-averaged matrix Rényi entropy with the softplus\-based evidential\-only local score\.Additive fusion variantsAdd\-SαS\_\{\\alpha\}S¯α\\bar\{S\}\_\{\\alpha\}u¯ShE×EU\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}Uses additive fusion,S¯α\+u¯ShE×EU\\bar\{S\}\_\{\\alpha\}\+\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}, instead of the multiplicative GLU interaction\.Add\-S~\\tilde\{S\}S~\\tilde\{S\}u¯ShE×EU\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}Uses additive fusion,S~\+u¯ShE×EU\\tilde\{S\}\+\\bar\{u\}\_\{\\mathrm\{ShE\}\\times\\mathrm\{EU\}\}, with length\-normalized geometry\.Table 2:Response reliability estimation across three models and six datasets, evaluated by AUROC and PRR\.LogTokUcaptures a purely*local*signal from token\-level output logits;RAUQcaptures a*global*signal from intermediate hidden\-state and attention representations;GLUforms their*multiplicative*fusion, achieving the best score in 13/18 cells on AUROC and PRR and ranking first or second in the remaining cells\.Bold: best per row;underline: second best\.Table 3:Mean performance across all 18 model–dataset settings, ranked among 14 methods\. TheSpacecolumn denotes the signal each method uses: L \(local, token\-level\) and G \(global, hidden\-state geometry\);L×G\\mathrm\{L\}\\times\\mathrm\{G\}marks multiplicative fusion andL\+G\\mathrm\{L\}\{\+\}\\mathrm\{G\}additive fusion\. GLU ranks first on both mean AUROC \(0\.6760\.676\) and mean PRR \(0\.3690\.369\), above the global\-signal baseline RAUQ \(rank1212\) and the local\-signal baseline LogTokU \(rank1414\)\. The multiplicative ablations cluster just below it \(ranks22–99\), while the additive variant Add\-S~\\tilde\{S\}falls to rank1313, isolating the multiplicative fusion as the source of the gain\. Full results and rankings are in Appendix[D](https://arxiv.org/html/2606.09875#A4)\.GroupMethodSpaceAUROCRankPRRRankProposedGLUL×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.67610\.3691BaselineLogProbL0\.655100\.32610BaselineLogTokUL0\.574140\.14414BaselineP\(true\)G0\.654110\.31111BaselineRAUQG0\.648120\.30512AblationGLU\-SPL×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.67530\.3672AblationGLU\-S~∗\\tilde\{S\}^\{\*\}L×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.67440\.3663AblationGLU\-DKL×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.67520\.3664AblationGLU\-S¯\\bar\{S\}L×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.67450\.3655AblationGLU\-SαS\_\{\\alpha\}\-SPL×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.67360\.3636AblationGLU\-AUL×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.67170\.3627AblationGLU\-SαS\_\{\\alpha\}\-AUL×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.66780\.3578AblationGLU\-S∗S^\{\*\}L×G\\mathrm\{L\}\\times\\mathrm\{G\}0\.66590\.3519AblationAdd\-S~\\tilde\{S\}L\+G\\mathrm\{L\}\{\+\}\\mathrm\{G\}0\.621130\.25913
### 5\.1The Complementary Error Signals
We demonstrate that token\-level and global uncertainty signals cover distinct failure regimes and together reduce the*confident\-but\-wrong*failure mode that neither can address alone\. Figure[1](https://arxiv.org/html/2606.09875#S4.F1)illustrates this complementarity for Qwen 2\.5\-7B on TriviaQA\.
#### Token\-level uncertainty provides strong but incomplete separation\.
The left panel plots each response along two axes: aggregated token\-level entropy \(local uncertainty,xx\-axis\) and hidden\-state geometric complexity \(global uncertainty,yy\-axis\)\. Correct responses \(blue\) cluster tightly in the low local\-uncertainty region yet spread broadly along theyy\-axis; incorrect responses \(red\) concentrate at high global uncertainty yet spread comparably along thexx\-axis\. Both signals are individually informative, but neither is sufficient\. Crucially, the low\-xxregion, where token\-level methods declare confidence, still contains a non\-trivial fraction of incorrect responses that token entropy alone cannot separate from correct ones\. The error rates across quadrants confirm that the two signals jointly span a far wider discriminative range than either achieves alone\.
#### Global geometry recovers confident\-but\-wrong responses\.
The most consequential failure mode for uncertainty\-based error detection is a*confident\-but\-wrong*responseZhou et al\. \([2026](https://arxiv.org/html/2606.09875#bib.bib12)\), where the model generates an incorrect answer while maintaining low token\-level entropy throughout\. These responses occupy the upper\-left region of the red cluster with low token uncertainty yet elevated geometric complexity\. Here, the model commits to each token with strong logit evidence, but its hidden\-state trajectory drifts across many representational directions, signaling latent indecision invisible to token\-level probes\. The starred example makes this concrete: its token entropy falls below the median and would go unflagged by any token\-level detector, yet its elevated geometric complexity pushes the combined score above the decision threshold, correctly identifying it as uncertain\.
### 5\.2Main Results
Table[2](https://arxiv.org/html/2606.09875#S5.T2)reports AUROC and PRR for all methods across three models and six datasets; we discuss each metric in turn\.
GLU achieves the best AUROC in 13 of 18 settings and ranks second in the remainder, confirming that multiplicative fusion of complementary signals consistently outperforms either component in isolation\. The pattern directly validates the complementarity hypothesis from Figure[1](https://arxiv.org/html/2606.09875#S4.F1)\. GLU improves over the local\-only baseline LogTokU by up to 22\.0% \(Gemma\-TriviaQA: 0\.817 vs\. 0\.597\) and over the global\-only baseline RAUQ by up to 11\.7% \(Gemma\-TriviaQA: 0\.817 vs\. 0\.700\)\.
RAUQ overtakes GLU mainly in settings where the local token\-level signal is weak: ArabicaQA for Qwen and Fanar, and multiturn MATH for Gemma\. In these cases, LogTokU is near chance in AUROC and often has negative PRR, suggesting that token confidence is poorly aligned with correctness\. Since GLU multiplicatively combines the local and global signals, an uninformative local component can attenuate an otherwise strong global signal, allowing the pure\-global RAUQ baseline to edge ahead\. However, GLU usually remains close to RAUQ in these cases, indicating that the fusion degrades gracefully rather than failing outright\.
The PRR results show that GLU’s advantage carries over to selective prediction\. GLU achieves the best PRR in 12 of 18 settings and the highest mean PRR overall, indicating that its uncertainty rankings place unreliable responses closer to the rejection tail more effectively than either LogTokU or RAUQ alone\. This is important because the practical goal is not only to distinguish correct from incorrect responses, but also to prioritize which outputs should be deferred, flagged, or verified first\.
### 5\.3Ablation Study
Table[3](https://arxiv.org/html/2606.09875#S5.T3)aggregates performance across all 18 settings and reveals three findings\. First, GLU ranks first on both mean AUROC \(0\.676\) and mean PRR \(0\.369\), confirming that multiplicative fusion of global and local signals consistently outperforms any individual component: RAUQ ranks 12th and LogTokU 14th, showing that neither signal alone is sufficient\. Second, all multiplicative ablations cluster tightly at ranks 2–9, indicating that the gain is robust to the choice of local and global uncertainty estimator\. The performance gain is driven by the fusion principle rather than the specific choice of estimator\. Third, and most decisively, the additive variant Add\-S~\\tilde\{S\}drops to rank 13 \(0\.621 AUROC, 0\.259 PRR\), falling below all baselines except LogTokU\. This sharp degradation isolates multiplicative fusion as the source of the gain: treating the global signal as a fixed offset rather than a modulator of local uncertainty fails to capture the interaction between the two signals that GLU is designed to exploit\.
Figure 2:Binned reliability of GLU on TriviaQA\. Responses are sorted by−GLU\-\\mathrm\{GLU\}and partitioned into deciles; markers report the empirical incorrect\-rate per bin with95%95\\%bootstrap CIs \(1,0001\{,\}000resamples\)\. Dashed line: cell\-level base incorrect\-rate\. GLU yields a monotonic increase in incorrect\-rate across confidence deciles for all three models, indicating a graded reliability signal rather than only separating extreme cases\.
### 5\.4GLU on TriviaQA
We further analyze whether GLU provides a calibrated reliability signal on TriviaQA, beyond its aggregate discrimination performance\. For each model, we sort responses by increasing GLU uncertainty and partition them into deciles\. We then measure the empirical incorrect rate within each bin\.
Figure[2](https://arxiv.org/html/2606.09875#S5.F2)shows a consistent monotonic trend across all three models: responses assigned higher GLU uncertainty are more likely to be incorrect\. The gap between the lowest\- and highest\-uncertainty bins reaches approximately8484percentage points for Qwen,8383percentage points for Gemma, and6666percentage points for Fanar\. These ranges are well above the model\-level base incorrect rates, indicating that GLU does not simply reflect dataset\-level difficulty or a binary correct/incorrect separation\.
This result suggests that GLU captures a graded notion of answer reliability\. Low\-GLU responses are mostly correct, while high\-GLU responses concentrate a large fraction of errors\. Therefore, GLU can support reliability\-aware generation settings in which answers are selectively trusted, flagged, or routed for additional verification\.
## 6Conclusion
We presented a multi\-view perspective on LLM uncertainty grounded in the structural distinction between the embedding and unembedding layers\. From this perspective, we extracted two complementary signals: a global geometric uncertainty from hidden\-state trajectories, quantified via matrix Rényi entropy, and a local token\-level uncertainty from the unembedding layer, based on Shannon and evidential epistemic entropy\. We showed empirically that the two signals are complementary and cover distinct failure regimes, with the global signal recovering the confident\-but\-wrong responses that token\-level methods systematically miss\. Building on this complementarity, we proposedGLU, a lightweight multiplicative fusion that requires no labeled data, no additional forward passes, and is normalized for response length\. Across three model families and three benchmarks, GLU matches or outperforms all unsupervised baselines and remains competitive with supervised methods that lack cross\-dataset generalization\. Ablation studies further confirm that the performance gain is robust to the choice of local and global uncertainty estimators, suggesting that the complementarity between the two layers is a structural property of LLMs rather than an artifact of any particular design choice\.
#### Limitations\.
GLU requires access to hidden states and token\-level output distributions, making it most directly applicable to open\-weight models\. While GLU achieves the best mean AUROC and PRR across our evaluation, the relative strength of global and local signals varies by task; in some settings, RAUQ remains highly competitive, suggesting that adaptive fusion could further improve robustness\. Future work should explore principled layer selection, adaptive weighting, and validation on larger models, additional languages, and domain\-specific deployments\.
## References
- Jackson et al\. \[2025\]Declan Jackson, William Keating, George Cameron, and Micah Hill\-Smith\.AA\-Omniscience: Evaluating cross\-domain knowledge reliability in large language models\.*arXiv preprint arXiv:2511\.13029*, 2025\.
- Zhang et al\. \[2024\]Jianyi Zhang, Da\-Cheng Juan, Cyrus Rashtchian, Chun\-Sung Ferng, Heinrich Jiang, and Yiran Chen\.Sled: self logits evolution decoding for improving factuality in large language models\.In*Proceedings of the 38th International Conference on Neural Information Processing Systems*, NIPS ’24, Red Hook, NY, USA, 2024\. Curran Associates Inc\.ISBN 9798331314385\.
- Farquhar et al\. \[2024\]Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal\.Detecting hallucinations in large language models using semantic entropy\.*Nature*, 630\(8017\):625–630, 2024\.
- Yadkori et al\. \[2024a\]Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei\-Hung Weng, Yao\-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, and Nenad Tomasev\.Mitigating llm hallucinations via conformal abstention\.\(arXiv:2405\.01563\), April 2024a\.doi:10\.48550/arXiv\.2405\.01563\.URL[http://arxiv\.org/abs/2405\.01563](http://arxiv.org/abs/2405.01563)\.arXiv:2405\.01563 \[cs\]\.
- Sensoy et al\. \[2018\]Murat Sensoy, Lance Kaplan, and Melih Kandemir\.Evidential deep learning to quantify classification uncertainty\.In*Proceedings of the 32nd International Conference on Neural Information Processing Systems*, NIPS’18, page 3183–3193, Red Hook, NY, USA, 2018\. Curran Associates Inc\.
- Ma et al\. \[2025\]Huan Ma, Jingdong Chen, Joey Tianyi Zhou, Guangyu Wang, and Changqing Zhang\.Estimating llm uncertainty with evidence\.*arXiv preprint arXiv:2502\.00290*, 2025\.
- Sriramanan et al\. \[2024\]Gaurang Sriramanan, Siddhant Bharti, Vinu Sankar Sadasivan, Shoumik Saha, Priyatham Kattakinda, and Soheil Feizi\.Llm\-check: Investigating detection of hallucinations in large language models\.*Advances in Neural Information Processing Systems*, 37:34188–34216, 2024\.
- Skean et al\. \[2025\]Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz\-Ziv\.Layer by layer: Uncovering hidden representations in language models\.In*Proceedings of the 42nd International Conference on Machine Learning*, PMLR 267, 2025\.
- Park et al\. \[2024\]Kiho Park, Yo Joong Choe, and Victor Veitch\.The linear representation hypothesis and the geometry of large language models\.In*Proceedings of the 41st International Conference on Machine Learning*, ICML’24\. JMLR\.org, 2024\.
- Kadavath et al\. \[2022\]Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield\-Dodds, Nova DasSarma, Eli Tran\-Johnson, et al\.Language models \(mostly\) know what they know\.*arXiv preprint arXiv:2207\.05221*, 2022\.
- Dalal and Misra \[2024\]Siddhartha Dalal and Vishal Misra\.Beyond the black box: A statistical model for llm reasoning and inference\.*arXiv preprint arXiv:2402\.03175*, 2024\.
- Zhou et al\. \[2026\]Tianyi Zhou, Johanne Medina, and Sanjay Chawla\.Can llms detect their confabulations? estimating reliability in uncertainty\-aware language models\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pages 38164–38172, 2026\.
- Manakul et al\. \[2023\]Potsawee Manakul, Adian Liusie, and Mark J\. F\. Gales\.Selfcheckgpt: Zero\-resource black\-box hallucination detection for generative large language models\.In Houda Bouamor, Juan Pino, and Kalika Bali, editors,*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023*, pages 9004–9017\. Association for Computational Linguistics, 2023\.doi:10\.18653/V1/2023\.EMNLP\-MAIN\.557\.URL[https://doi\.org/10\.18653/v1/2023\.emnlp\-main\.557](https://doi.org/10.18653/v1/2023.emnlp-main.557)\.
- Nikitin et al\. \[2024\]Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen\.Kernel language entropy: Fine\-grained uncertainty quantification for llms from semantic similarities\.*arXiv preprint arXiv:2405\.20003*, 2024\.
- Vazhentsev et al\. \[2026\]Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, and Artem Shelmanov\.Efficient hallucination detection for LLMs using uncertainty\-aware attention heads, 2026\.URL[https://openreview\.net/forum?id=FSOoR1ZFtf](https://openreview.net/forum?id=FSOoR1ZFtf)\.
- Li et al\. \[2024\]Loka Li, Zhenhao Chen, Guangyi Chen, Yixuan Zhang, Yusheng Su, Eric Xing, and Kun Zhang\.Confidence matters: Revisiting intrinsic self\-correction capabilities of large language models\.2024\.URL[https://arxiv\.org/abs/2402\.12563](https://arxiv.org/abs/2402.12563)\.
- Orgad et al\. \[2024\]Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov\.Llms know more than they show: On the intrinsic representation of llm hallucinations\.International Conference on Learning Representations \(ICLR\), 2024\.
- Du et al\. \[2024\]Xuefeng Du, Chaowei Xiao, and Yixuan Li\.Haloscope: Harnessing unlabeled llm generations for hallucination detection\.In*Advances in Neural Information Processing Systems*, 2024\.
- Lin et al\. \[2022\]Stephanie Lin, Jacob Hilton, and Owain Evans\.Truthfulqa: Measuring how models mimic human falsehoods\.In*Proceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\)*, pages 3214–3252, 2022\.
- Joshi et al\. \[2017\]Mandar Joshi, Eunsol Choi, Daniel S\. Weld, and Luke Zettlemoyer\.Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics*, pages 1601–1611\. Association for Computational Linguistics, 2017\.
- Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the math dataset\.*NeurIPS*, 2021\.
- Abdallah et al\. \[2024\]Abdelrahman Abdallah, Mahmoud Kasem, Mahmoud Abdalla, Mohamed Mahmoud, Mohamed Elkasaby, Yasser Elbendary, and Adam Jatowt\.Arabicaqa: A comprehensive dataset for arabic question answering, 2024\.
- Köksal et al\. \[2023\]Abdullatif Köksal, Timo Schick, Anna Korhonen, and Hinrich Schütze\.Longform: Effective instruction tuning with reverse instructions, 2023\.
- Laban et al\. \[2025\]Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville\.Llms get lost in multi\-turn conversation\.*arXiv preprint arXiv:2505\.06120*, 2025\.
- Qwen Team et al\. \[2025\]Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu\.Qwen2\.5 technical report\.*arXiv preprint arXiv:2412\.15115*, 2025\.
- Gemma Team et al\. \[2025\]Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean\-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, and Anton Tsitsulin\.Gemma 3 technical report\.*arXiv preprint arXiv:2503\.19786*, 2025\.
- Fanar Team et al\. \[2025\]Fanar Team, Ummar Abbas, Mohammad Shahmeer Ahmad, Firoj Alam, Enes Altinisik, Ehsannedin Asgari, Yazan Boshmaf, Sabri Boughorbel, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Masoomali Fatehkia, Anastasios Fragkopoulos, Maram Hasanain, Majd Hawasly, Mus’ab Husaini, Soon\-Gyo Jung, Ji Kim Lucas, Walid Magdy, Safa Messaoud, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Zan Naeem, Mourad Ouzzani, Dorde Popovic, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang, Ahmed Ali, Yassine El Kheir, Xiaosong Ma, and Chaoyi Ruan\.Fanar: An arabic\-centric multimodal generative ai platform\.*arXiv preprint arXiv:2501\.13944*, 2025\.
- Malinin and Gales \[2021\]Andrey Malinin and Mark Gales\.Uncertainty estimation in autoregressive structured prediction\.In*International Conference on Learning Representations*, 2021\.URL[https://openreview\.net/forum?id=jN5y\-zb5Q7m](https://openreview.net/forum?id=jN5y-zb5Q7m)\.
- Yadkori et al\. \[2024b\]Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári\.To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty\.In*Proceedings of the 38th International Conference on Neural Information Processing Systems*, NIPS ’24, Red Hook, NY, USA, 2024b\. Curran Associates Inc\.ISBN 9798331314385\.
Table 4:Classifier\-based:AUROC for response reliability estimation across three benchmarks and three model families\.Bold: best per column per model\.Underline: second best\. Although classifier\-based methods output better scores, they are not generalizable\.Italic: third best\.## Appendix ASupervised vs Unsupervised
GLU requires no labeled data at test time, which is its primary practical advantage\. To give a sense of what that constraint costs, we include here results from classifier\-based methods that*do*have access to ground\-truth correctness labels during training\. The comparison is deliberately asymmetric, we are not claiming parity, but it lets readers calibrate how much headroom remains above our unsupervised scores and confirms that GLU is already competitive with several supervised probes despite operating in a fundamentally harder setting\.
## Appendix BProof of the Collision\-Entropy Closed Form
We prove that the order\-22Rényi entropy of the normalised eigenspectrum of a Gram matrix𝐊∈ℝT×T\\mathbf\{K\}\\in\\mathbb\{R\}^\{T\\times T\}admits the two equivalent closed forms stated in \([10](https://arxiv.org/html/2606.09875#S4.E10)\)\.
###### Proposition 1\(Collision\-entropy closed form\)\.
Let𝐊=𝐇𝐇⊤\\mathbf\{K\}=\\mathbf\{H\}\\mathbf\{H\}^\{\\\!\\top\}be the Gram matrix of hidden states𝐇∈ℝT×d\\mathbf\{H\}\\in\\mathbb\{R\}^\{T\\times d\}, with eigenvaluesλ1≥⋯≥λT≥0\\lambda\_\{1\}\\geq\\cdots\\geq\\lambda\_\{T\}\\geq 0\. Define normalised eigenvaluesλ~i=λi/tr\(𝐊\)\\tilde\{\\lambda\}\_\{i\}=\\lambda\_\{i\}/\\mathrm\{tr\}\(\\mathbf\{K\}\)\. Then
S2\(𝐊\)=−log\(∑i=1Tλ~i2\)=−logtr\(\(𝐊tr\(𝐊\)\)2\)=−log‖𝐊‖F2tr\(𝐊\)2\.S\_\{2\}\(\\mathbf\{K\}\)\\;=\\;\-\\log\\\!\\Bigl\(\\textstyle\\sum\_\{i=1\}^\{T\}\\tilde\{\\lambda\}\_\{i\}^\{2\}\\Bigr\)\\;=\\;\-\\log\\,\\mathrm\{tr\}\\\!\\left\(\\\!\\left\(\\frac\{\\mathbf\{K\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)\}\\right\)^\{\\\!2\}\\right\)\\;=\\;\-\\log\\frac\{\\\|\\mathbf\{K\}\\\|\_\{F\}^\{2\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)^\{2\}\}\.\(13\)
###### Proof\.
We establish the two equalities in turn\. Settingα=2\\alpha=2in the general order\-α\\alphaRényi entropy gives
S2\(𝐊\)=11−2log\(∑iλ~i2\)=−log\(∑iλ~i2\)\.S\_\{2\}\(\\mathbf\{K\}\)=\\frac\{1\}\{1\-2\}\\log\\\!\\Bigl\(\\sum\_\{i\}\\tilde\{\\lambda\}\_\{i\}^\{2\}\\Bigr\)=\-\\log\\\!\\Bigl\(\\sum\_\{i\}\\tilde\{\\lambda\}\_\{i\}^\{2\}\\Bigr\)\.\(14\)
Sinceλ~i=λi/tr\(𝐊\)\\tilde\{\\lambda\}\_\{i\}=\\lambda\_\{i\}/\\mathrm\{tr\}\(\\mathbf\{K\}\)andtr\(𝐊\)\\mathrm\{tr\}\(\\mathbf\{K\}\)is a positive scalar,
∑i=1Tλ~i2=∑i=1Tλi2tr\(𝐊\)2=∑i=1Tλi2tr\(𝐊\)2\.\\sum\_\{i=1\}^\{T\}\\tilde\{\\lambda\}\_\{i\}^\{2\}=\\sum\_\{i=1\}^\{T\}\\frac\{\\lambda\_\{i\}^\{2\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)^\{2\}\}=\\frac\{\\sum\_\{i=1\}^\{T\}\\lambda\_\{i\}^\{2\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)^\{2\}\}\.\(15\)
Because𝐊\\mathbf\{K\}is real symmetric and positive semidefinite, its eigenvalues are real and non\-negative\. The eigenvalues of𝐊2\\mathbf\{K\}^\{2\}areλi2\\lambda\_\{i\}^\{2\}, so by the trace–eigenvalue identity,
tr\(𝐊2\)=∑i=1Tλi2\.\\mathrm\{tr\}\(\\mathbf\{K\}^\{2\}\)=\\sum\_\{i=1\}^\{T\}\\lambda\_\{i\}^\{2\}\.\(16\)Moreover, the Frobenius norm satisfies‖𝐊‖F2=tr\(𝐊⊤𝐊\)=tr\(𝐊2\)\\\|\\mathbf\{K\}\\\|\_\{F\}^\{2\}=\\mathrm\{tr\}\(\\mathbf\{K\}^\{\\top\}\\mathbf\{K\}\)=\\mathrm\{tr\}\(\\mathbf\{K\}^\{2\}\), where the last step uses symmetry𝐊⊤=𝐊\\mathbf\{K\}^\{\\top\}=\\mathbf\{K\}\. Hence
∑i=1Tλi2=tr\(𝐊2\)=‖𝐊‖F2\.\\sum\_\{i=1\}^\{T\}\\lambda\_\{i\}^\{2\}=\\mathrm\{tr\}\(\\mathbf\{K\}^\{2\}\)=\\\|\\mathbf\{K\}\\\|\_\{F\}^\{2\}\.\(17\)
Since the trace is linear,
tr\(𝐊2\)tr\(𝐊\)2=tr\(𝐊2tr\(𝐊\)2\)=tr\(\(𝐊tr\(𝐊\)\)2\)\.\\frac\{\\mathrm\{tr\}\(\\mathbf\{K\}^\{2\}\)\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)^\{2\}\}=\\mathrm\{tr\}\\\!\\left\(\\frac\{\\mathbf\{K\}^\{2\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)^\{2\}\}\\right\)=\\mathrm\{tr\}\\\!\\left\(\\\!\\left\(\\frac\{\\mathbf\{K\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)\}\\right\)^\{\\\!2\}\\right\)\.\(18\)Substituting \([15](https://arxiv.org/html/2606.09875#A2.E15)\)–\([18](https://arxiv.org/html/2606.09875#A2.E18)\) into \([14](https://arxiv.org/html/2606.09875#A2.E14)\) yields the first equality in \([13](https://arxiv.org/html/2606.09875#A2.E13)\)\.
Replacingtr\(𝐊2\)\\mathrm\{tr\}\(\\mathbf\{K\}^\{2\}\)by‖𝐊‖F2\\\|\\mathbf\{K\}\\\|\_\{F\}^\{2\}via \([17](https://arxiv.org/html/2606.09875#A2.E17)\) gives
tr\(\(𝐊tr\(𝐊\)\)2\)=‖𝐊‖F2tr\(𝐊\)2,\\mathrm\{tr\}\\\!\\left\(\\\!\\left\(\\frac\{\\mathbf\{K\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)\}\\right\)^\{\\\!2\}\\right\)=\\frac\{\\\|\\mathbf\{K\}\\\|\_\{F\}^\{2\}\}\{\\mathrm\{tr\}\(\\mathbf\{K\}\)^\{2\}\},\(19\)which is the second equality in \([13](https://arxiv.org/html/2606.09875#A2.E13)\)\. ∎
## Appendix CDataset Details
This appendix details the composition and preprocessing of the six benchmarks used in our evaluation\. Where we subsample, we fix the random seed for reproducibility\. Table[5](https://arxiv.org/html/2606.09875#A4.T5)summarizes the final sizes\.
- •TruthfulQALin et al\. \[[2022](https://arxiv.org/html/2606.09875#bib.bib19)\]probes whether a model reproduces common human misconceptions\. We use the full dataset of817817questions without subsampling\.
- •TriviaQAJoshi et al\. \[[2017](https://arxiv.org/html/2606.09875#bib.bib20)\]is a large\-scale reading\-comprehension benchmark for factual knowledge retrieval\. We draw2,0002\{,\}000samples from the reading\-comprehension \(rc\) training split\.
- •MATHHendrycks et al\. \[[2021](https://arxiv.org/html/2606.09875#bib.bib21)\]consists of competition mathematics problems spanning a range of difficulty levels and subjects\. We evaluate on a subset of1,0001\{,\}000problems\.
- •ArabicaQAAbdallah et al\. \[[2024](https://arxiv.org/html/2606.09875#bib.bib22)\]is a native Arabic question\-answering benchmark, included to test reliability estimation outside English\. We draw2,0002\{,\}000samples from the machine\-reading\-comprehension \(MRC\) test split\.
- •LongFormKöksal et al\. \[[2023](https://arxiv.org/html/2606.09875#bib.bib23)\]targets long\-form generation\. To keep responses within a tractable length and restrict to well\-structured sources, we take the test split and retain only examples drawn from StackExchange, Wikipedia, and BigBench whose reference output is at most512512words, yielding a subset of731731examples\.
- •GSM8K multi\-turnLaban et al\. \[[2025](https://arxiv.org/html/2606.09875#bib.bib24)\]adapts grade\-school math word problems into a sharded, multi\-turn conversational setting in which the problem is revealed incrementally across turns\. We use themathsubset of the releasedlost\_in\_conversationdata, comprising103103problems decomposed into multiple turns each\.
## Appendix DMethods and Ablations
This appendix provides the complete picture behind the main\-text results\. We show the mean performance of all 19 methods ranked across the 18 model–dataset settings \(Table[6](https://arxiv.org/html/2606.09875#A4.T6)\), the full per\-setting AUROC and PRR for the baselines and GLU including the strongest ablation per cell \(Tables[7](https://arxiv.org/html/2606.09875#A4.T7)and[8](https://arxiv.org/html/2606.09875#A4.T8)\), and the complete per\-setting AUROC and PRR for all GLU ablations \(Tables[9](https://arxiv.org/html/2606.09875#A4.T9)–[12](https://arxiv.org/html/2606.09875#A4.T12)\)\. These tables let readers verify that the headline results are not artefacts of aggregation and identify which variants are most robust across model families and task types\.
Table 5:Composition of the six evaluation benchmarks\.#### Summary of findings\.
Three patterns hold across the full results\. First, GLU and its multiplicative ablations occupy the top nine ranks on both mean AUROC and mean PRR \(Table[6](https://arxiv.org/html/2606.09875#A4.T6)\), well above every single\-signal baseline, confirming that the gain comes from the multiplicative global–local fusion rather than any specific estimator\. Second, the per\-setting tables show the gain is broad rather than driven by a few cells: GLU is best or second on AUROC in 12 of 18 settings and on PRR in a comparable majority, with the clearest margins on TriviaQA, MATH, and Multiturn, and ties or close seconds where the local signal collapses \(ArabicaQA, Gemma\-Multiturn\)\. Third, fusion variants that drop one signal entirely \(GLU\-EU, GLU\-EU\-SP\) or replace the multiplicative interaction with an additive one \(Add\-S~\\tilde\{S\}, Add\-SαS\_\{\\alpha\}\) fall to the bottom of Table[6](https://arxiv.org/html/2606.09875#A4.T6), isolating both the presence of two signals and their multiplicative combination as the sources of the improvement\.
Table 6:Mean performance across all 18 model–dataset settings\. Methods are grouped into baselines, the proposed GLU method, and GLU ablations\.GroupMethodMean AUROCAUROC rankMean PRRPRR rankProposedGLU0\.67610\.3691BaselineLogProb0\.655100\.32610BaselineRAUQ0\.648120\.30512BaselineP\(true\)P\(\\mathrm\{true\}\)0\.654110\.31111BaselineLogTokU0\.574140\.14414AblationGLU\-SP0\.67530\.3672AblationGLU\-S~∗\\tilde\{S\}^\{\*\}0\.67440\.3663AblationGLU\-DK0\.67520\.3664AblationGLU\-S¯\\bar\{S\}0\.67450\.3655AblationGLU\-SαS\_\{\\alpha\}\-SP0\.67360\.3636AblationGLU\-AU0\.67170\.3627AblationGLU\-SαS\_\{\\alpha\}\-AU0\.66780\.3578AblationGLU\-S∗S^\{\*\}0\.66590\.3519AblationAdd\-S~\\tilde\{S\}0\.621130\.25913AblationGLU\-EDL0\.574150\.14015AblationGLU\-SαS\_\{\\alpha\}\-SP\-EU0\.557160\.10316AblationGLU\-EU0\.544170\.07917AblationGLU\-EU\-SP0\.543180\.07718AblationAdd\-SαS\_\{\\alpha\}0\.45419\-0\.11819Table 7:Full AUROC comparison\. Ranking emphasis is applied only across the baselines and GLU\. The best ablation is reported for context but is not included in the row\-wise ranking\.Table 8:Full PRR comparison\. Ranking emphasis is applied only across the baselines and GLU\. The best ablation is reported for context but is not included in the row\-wise ranking\.Table 9:AUROC for GLU ablations, part 1: EDL, global geometric variants, and local uncertainty variants\.Table 10:AUROC for GLU ablations, part 2: evidence\-only, combinedSαS\_\{\\alpha\}variants, and additive fusion baselines\.Table 11:PRR for GLU ablations, part 1: EDL, global geometric variants, and local uncertainty variants\.Table 12:PRR for GLU ablations, part 2: evidence\-only, combinedSαS\_\{\\alpha\}variants, and additive fusion baselines\.Similar Articles
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
This position paper argues that current uncertainty quantification methods for large language models are essentially unsupervised clustering, measuring internal consistency rather than external correctness, and therefore fail to detect confident hallucinations. The authors advocate for a paradigm shift to ground uncertainty in objective truth.
Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo
This paper develops a scaling limit theory for SGLD-Gibbs to provide principled hyperparameter tuning guidance for meaningful uncertainty quantification in large-scale latent variable models.
Uncertainty Quantification for Large Language Diffusion Models
This paper presents the first systematic study of uncertainty quantification (UQ) for Large Language Diffusion Models (LLDMs), proposing lightweight zero-shot uncertainty signals derived from the iterative denoising process and showing that LLDMs can achieve both fast inference and reliable hallucination detection with up to 100x lower computational overhead compared to sampling-based baselines.
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
GEM reformulates LLM data curation as a variational problem on the hypersphere, using geometric entropy mixing and a minorize-maximize algorithm to discover balanced semantic clusters, achieving state-of-the-art improvements in data mixing strategies by up to 1.2% average downstream accuracy.
Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo
This paper proposes new discrete-time approximations for stochastic gradient Langevin dynamics (SGLD) with and without momentum, enabling accurate predictions of stationary covariance, iterate average covariance, and integrated autocorrelation time. The method provides improved tuning guidance for large-sample uncertainty quantification, especially under model misspecification.