Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

arXiv cs.CL Papers

Summary

Researchers introduce SHADE, a hybrid estimator that combines Good-Turing coverage with graph-spectral cues to quantify semantic uncertainty and detect LLM hallucinations when only a few black-box samples are available.

arXiv:2604.19162v1 Announce Type: new Abstract: This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size--that is, the number of distinct meanings expressed in the sampled responses--provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.
Original Article
View Cached Full Text

Cached at: 04/22/26, 08:30 AM

# Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Source: [https://arxiv.org/html/2604.19162](https://arxiv.org/html/2604.19162)
Hongxing Pan Yingying Guo Wenqing Kuang Jiashi Lu School of Data Science The Chinese University of Hong Kong, Shenzhen \{124090486, 123090142, 123090247, 123090386\}@link\.cuhk\.edu\.cn

###### Abstract

This paper studies uncertainty quantification for large language models \(LLMs\) under black\-box access, where only a small number of responses can be sampled for each query\. In this setting, estimating the effective semantic alphabet size—that is, the number of distinct meanings expressed in the sampled responses—provides a useful proxy for downstream risk\. However, frequency\-based estimators tend to undercount rare semantic modes when the sample size is small, while graph\-spectral quantities alone are not designed to estimate semantic occupancy accurately\. To address this issue, we propose SHADE \(Soft\-Hybrid Alphabet Dynamic Estimator\), a simple and interpretable estimator that combines Generalized Good–Turing coverage with a heat\-kernel trace of the normalized Laplacian constructed from an entailment\-weighted graph over sampled responses\. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes\. A finite\-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage\-adjusted semantic entropy score\. Experiments on pooled semantic alphabet\-size estimation against large\-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample\-limited regime, while the performance gap narrows as the number of samples increases\. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black\-box uncertainty quantification must operate under tight sampling budgets\.

## 1Introduction

Large language models can hallucinate or contradict reliable evidence\[[14](https://arxiv.org/html/2604.19162#bib.bib26),[29](https://arxiv.org/html/2604.19162#bib.bib27),[9](https://arxiv.org/html/2604.19162#bib.bib22)\]\. Reliable uncertainty quantification \(UQ\) supports abstention and human oversight in risk\-sensitive settings\[[31](https://arxiv.org/html/2604.19162#bib.bib17),[22](https://arxiv.org/html/2604.19162#bib.bib18)\]\. In deployed systems, however, one often faces a*strict budget*: only a few independent generations per query are affordable for monitoring, and proprietary APIs hide logits, activations, and token probabilities\[[4](https://arxiv.org/html/2604.19162#bib.bib5),[34](https://arxiv.org/html/2604.19162#bib.bib19),[32](https://arxiv.org/html/2604.19162#bib.bib11)\]\. This paper targets thatsmall\-nnblack\-boxregime\.

Semantic*alphabet size*—the number of semantic equivalence classes obtained by clustering multiple generations under bidirectional entailment—is an interpretable proxy for how “spread out” model meanings are for a query\[[18](https://arxiv.org/html/2604.19162#bib.bib47),[9](https://arxiv.org/html/2604.19162#bib.bib22),[23](https://arxiv.org/html/2604.19162#bib.bib10)\]\. At very smallnn, both purely frequency\-based and purely spectral estimates tend to underestimate effective support: empirical counts ignore geometric organization among draws, while eigenvalues of a semantic graph can be unstable unless anchored to occupancy statistics\.

![Refer to caption](https://arxiv.org/html/2604.19162v1/1.png)Figure 1:Black\-box sampling observes only a few responses per query; clustering yieldskobsk\_\{\\text\{obs\}\}semantic classes, while the*true*semantic alphabet can be larger due to missing mass\. Larger effective alphabet implies higher epistemic uncertainty and higher risk of inconsistent outputs under the same monitoring budget\.We proposeSHADE\(Soft\-Hybrid Alphabet Dynamic Estimator\)\. SHADE combines \(i\) missing\-mass extrapolation via Generalized Good–Turing \(GGT\) and \(ii\) a diffusion\-based spectral summary, the heat\-kernel tracetr​\(e−β​L\)\\mathrm\{tr\}\(e^\{\-\\beta L\}\)of the normalized Laplacian of an entailment\-weighted graph over thennresponses\[[5](https://arxiv.org/html/2604.19162#bib.bib49)\]\. Estimated coverageCGGTC\_\{\\text\{GGT\}\}determines*how*the two signals are fused: a convex combination when coverage is high, and a LogSumExp surrogate when coverage is low\. A lightweight finite\-sample correction stabilizes the hybrid cardinality before a Horvitz–Thompson–style entropy readout used as a risk score\[[2](https://arxiv.org/html/2604.19162#bib.bib51)\]\. Compared with occupancy\-only cardinality\[[23](https://arxiv.org/html/2604.19162#bib.bib10)\]and graph\-density features for UQ\[[21](https://arxiv.org/html/2604.19162#bib.bib21)\], the graph here serves as a*second estimator of the same scalar*, fused through coverage rather than as an auxiliary feature vector\.

#### Contributions\.

1. 1\.Acoverage\-gated hybridbetween GGT\-based mass extrapolation and the heat\-kernel trace of a semantic graph, avoiding a hard threshold onnnalone\.
2. 2\.Asingle pipelinefrom raw generations to a bias\-corrected cardinality and a visibility\-adjusted entropy suitable for thresholding\.
3. 3\.Empirical analysisof alphabet\-size error and downstream incorrectness detection: gains concentrate at the smallest sampling budgets\.

## 2Preliminary

#### Entropy and semantic classes\.

For a discrete variable over classessswith probabilitiesp​\(s\)p\(s\), Shannon entropy isℍ=−∑sp​\(s\)​log⁡p​\(s\)\\mathbb\{H\}=\-\\sum\_\{s\}p\(s\)\\log p\(s\)\. Semantic entropy \(SE\) partitions generations into equivalence classes using bidirectional entailment\[[18](https://arxiv.org/html/2604.19162#bib.bib47),[9](https://arxiv.org/html/2604.19162#bib.bib22),[17](https://arxiv.org/html/2604.19162#bib.bib14)\]\. With white\-box access, class probabilities can be integrated from token likelihoods; in the black\-box setting, one uses empirical class frequenciesp^i\\hat\{p\}\_\{i\}, yielding discrete semantic entropy \(DSE\) that plugin\-evaluates entropy but*underestimates*diversity when heavy tails leave most classes unseen\[[4](https://arxiv.org/html/2604.19162#bib.bib5),[9](https://arxiv.org/html/2604.19162#bib.bib22)\]\.

#### Coverage and graphs\.

Letfmf\_\{m\}denote the number of semantic classes that appear exactlymmtimes innnsamples\. The missing massMMof unobserved classes and coverageC=1−MC=1\-Mare central to Good–Turing style reasoning\[[11](https://arxiv.org/html/2604.19162#bib.bib48)\]\. Independently,nnresponses induce a weighted undirected graph: nodes are responses, edges carry entailment\-based weights, and the normalized LaplacianLLencodes global connectivity\[[21](https://arxiv.org/html/2604.19162#bib.bib21),[27](https://arxiv.org/html/2604.19162#bib.bib13)\]\. Eigenvaluesλi\\lambda\_\{i\}ofLLdescribe how strongly responses split into modes versus clump together\[[5](https://arxiv.org/html/2604.19162#bib.bib49)\]\.

## 3Related work

#### Hallucination detection and UQ for LLMs\.

Extracting robust signals from noisy or limited observations is a ubiquitous challenge spanning spatial modeling, medical diagnostics, and representation learning\[[7](https://arxiv.org/html/2604.19162#bib.bib3),[33](https://arxiv.org/html/2604.19162#bib.bib4),[37](https://arxiv.org/html/2604.19162#bib.bib1),[6](https://arxiv.org/html/2604.19162#bib.bib2)\]\. In the context of generative AI, a large literature studies this problem through the lens of hallucinations and uncertainty in LLMs\[[14](https://arxiv.org/html/2604.19162#bib.bib26),[29](https://arxiv.org/html/2604.19162#bib.bib27),[31](https://arxiv.org/html/2604.19162#bib.bib17),[22](https://arxiv.org/html/2604.19162#bib.bib18)\]\. Practical UQ methods include self\-consistency, evidential models, internal probes, and semantic clustering approaches\[[4](https://arxiv.org/html/2604.19162#bib.bib5),[3](https://arxiv.org/html/2604.19162#bib.bib12),[36](https://arxiv.org/html/2604.19162#bib.bib9),[17](https://arxiv.org/html/2604.19162#bib.bib14)\]\. Our focus is the intersection of*black\-box*access and*small*nn\.

#### Semantic entropy and structure\.

Semantic Uncertainty and Semantic Entropy establish clustering\-by\-meaning as a paradigm\[[18](https://arxiv.org/html/2604.19162#bib.bib47),[9](https://arxiv.org/html/2604.19162#bib.bib22)\]\. Follow\-up work uses pairwise similarity, kernelized structure, evidential objectives, and adaptive exploration\[[26](https://arxiv.org/html/2604.19162#bib.bib7),[27](https://arxiv.org/html/2604.19162#bib.bib13),[19](https://arxiv.org/html/2604.19162#bib.bib8),[32](https://arxiv.org/html/2604.19162#bib.bib11)\]\. McCabe et al\.\[[23](https://arxiv.org/html/2604.19162#bib.bib10)\]study semantic cardinality from occupancy; Li et al\.\[[21](https://arxiv.org/html/2604.19162#bib.bib21)\]incorporate graph density as an auxiliary UQ signal\. SHADE differs: the Laplacian spectrum contributes a*parallel*estimate of effective support fused with GGT throughCGGTC\_\{\\text\{GGT\}\}\.

#### Graph spectra and estimation\.

Graph representations of multi\-sample generations appear in several lines of work\[[8](https://arxiv.org/html/2604.19162#bib.bib23),[13](https://arxiv.org/html/2604.19162#bib.bib24),[10](https://arxiv.org/html/2604.19162#bib.bib25),[1](https://arxiv.org/html/2604.19162#bib.bib30)\]\. Unlike training\-heavy graph models, SHADE uses the graph only as a structural estimator combined with classical missing\-mass statistics, keeping inference lightweight\.

## 4Methodology

Letkobsk\_\{\\text\{obs\}\}be observed semantic classes after clusteringnngenerations\. We estimate effective support by combining a mass\-based\|S\|^GGT\\hat\{\|S\|\}\_\{\\text\{GGT\}\}and a spectral\|S\|^Soft\-EigV\\hat\{\|S\|\}\_\{\\text\{Soft\-EigV\}\}\.

#### Heat\-kernel trace\.

Build symmetric weightswi​j=\(ai​j\+aj​i\)/2w\_\{ij\}=\(a\_\{ij\}\+a\_\{ji\}\)/2from NLI entailment probabilitiesai​ja\_\{ij\}\[[12](https://arxiv.org/html/2604.19162#bib.bib46)\]\. LetLLbe the normalized Laplacian with eigenvalues0=λ1≤⋯≤λn≤20=\\lambda\_\{1\}\\leq\\cdots\\leq\\lambda\_\{n\}\\leq 2\[[5](https://arxiv.org/html/2604.19162#bib.bib49)\]\. The heat kernele−β​Le^\{\-\\beta L\}diffuses mass on the graph; its trace

\|S\|^Soft\-EigV:=tr​\(e−β​L\)=∑i=1ne−β​λi\\hat\{\|S\|\}\_\{\\text\{Soft\-EigV\}\}:=\\mathrm\{tr\}\(e^\{\-\\beta L\}\)=\\sum\_\{i=1\}^\{n\}e^\{\-\\beta\\lambda\_\{i\}\}\(1\)aggregates low\-frequency \(coherent\) structure with exponential damping on high\-frequency noise\[[5](https://arxiv.org/html/2604.19162#bib.bib49)\]\. Thustr​\(e−β​L\)\\mathrm\{tr\}\(e^\{\-\\beta L\}\)acts as a*soft*multiscale count of semantic modes complementary to rawkobsk\_\{\\text\{obs\}\}\.

#### GGT coverage\.

Letf1,f2f\_\{1\},f\_\{2\}be singleton and doubleton class counts\. We estimate missing mass and coverage as in stabilized GGT formulations\[[23](https://arxiv.org/html/2604.19162#bib.bib10),[11](https://arxiv.org/html/2604.19162#bib.bib48)\]:

MGGT=1n​\(1−2\.08n0\.7\)​f1\+4\.1n1\.7​f2,CGGT=max⁡\(1−MGGT,10−12\),\|S\|^GGT=kobsCGGT\.M\_\{\\text\{GGT\}\}=\\frac\{1\}\{n\}\\left\(1\-\\frac\{2\.08\}\{n^\{0\.7\}\}\\right\)f\_\{1\}\+\\frac\{4\.1\}\{n^\{1\.7\}\}f\_\{2\},\\quad C\_\{\\text\{GGT\}\}=\\max\(1\-M\_\{\\text\{GGT\}\},10^\{\-12\}\),\\quad\\hat\{\|S\|\}\_\{\\text\{GGT\}\}=\\frac\{k\_\{\\text\{obs\}\}\}\{C\_\{\\text\{GGT\}\}\}\.\(2\)

#### Coverage\-driven hybridization\.

WhenCGGT≥τC\_\{\\text\{GGT\}\}\\geq\\tau, we use a convex combination that down\-weights the spectrum as coverage grows:

\|S\|^Hybrid=CGGT​\|S\|^GGT\+\(1−CGGT\)​\|S\|^Soft\-EigV\.\\hat\{\|S\|\}\_\{\\text\{Hybrid\}\}=C\_\{\\text\{GGT\}\}\\,\\hat\{\|S\|\}\_\{\\text\{GGT\}\}\+\(1\-C\_\{\\text\{GGT\}\}\)\\,\\hat\{\|S\|\}\_\{\\text\{Soft\-EigV\}\}\.\(3\)WhenCGGT<τC\_\{\\text\{GGT\}\}<\\tau, we use a LogSumExp fusion that behaves like a smooth maximum between the two predictors:

\|S\|^Hybrid=1α​ln⁡\(eα​\|S\|^GGT\+eα​\|S\|^Soft\-EigV\)\.\\hat\{\|S\|\}\_\{\\text\{Hybrid\}\}=\\frac\{1\}\{\\alpha\}\\ln\\left\(e^\{\\alpha\\hat\{\|S\|\}\_\{\\text\{GGT\}\}\}\+e^\{\\alpha\\hat\{\|S\|\}\_\{\\text\{Soft\-EigV\}\}\}\\right\)\.\(4\)Scalars\(β,α,τ\)\(\\beta,\\alpha,\\tau\)are fixed once on development data \(Sec\.[5](https://arxiv.org/html/2604.19162#S5)\)\. The thresholdτ\\tauis chosen so typical queries avoid unstable switching near the boundary\.

#### Finite\-sample correction and entropy readout\.

Plugin diversity functionals incur𝒪​\(1/n\)\\mathcal\{O\}\(1/n\)bias\[[25](https://arxiv.org/html/2604.19162#bib.bib50)\]; we subtract a leading term of the same order from the hybrid cardinality:

\|S\|^Final=\|S\|^Hybrid\+kobs−12​n,pi∗=kobs​p^i\|S\|^Final\.\\hat\{\|S\|\}\_\{\\text\{Final\}\}=\\hat\{\|S\|\}\_\{\\text\{Hybrid\}\}\+\\frac\{k\_\{\\text\{obs\}\}\-1\}\{2n\},\\qquad p\_\{i\}^\{\*\}=\\frac\{k\_\{\\text\{obs\}\}\\,\\hat\{p\}\_\{i\}\}\{\\hat\{\|S\|\}\_\{\\text\{Final\}\}\}\.\(5\)The score used for detection is

ℍ^SHADE=−∑i=1kobspi∗​log⁡pi∗1−\(1−pi∗\)n,\\hat\{\\mathbb\{H\}\}\_\{\\text\{SHADE\}\}=\-\\sum\_\{i=1\}^\{k\_\{\\text\{obs\}\}\}\\frac\{p\_\{i\}^\{\*\}\\log p\_\{i\}^\{\*\}\}\{1\-\(1\-p\_\{i\}^\{\*\}\)^\{n\}\},\(6\)with a visibility denominator standard under sampling without replacement\[[2](https://arxiv.org/html/2604.19162#bib.bib51)\]\. Ablations\|S\|^Hybrid\\widehat\{\|S\|\}\_\{\\mathrm\{Hybrid\}\}andℍ^Hybrid\\hat\{\\mathbb\{H\}\}\_\{\\mathrm\{Hybrid\}\}omit this correction before the entropy mapping\.

## 5Experiments

### 5\.1Setup

We evaluate on SQuAD, CoQA, NQ\-Open, TriviaQA, and HotpotQA\[[28](https://arxiv.org/html/2604.19162#bib.bib39),[30](https://arxiv.org/html/2604.19162#bib.bib36),[20](https://arxiv.org/html/2604.19162#bib.bib35),[16](https://arxiv.org/html/2604.19162#bib.bib37),[35](https://arxiv.org/html/2604.19162#bib.bib43)\]\. Generators include OPT\-6\.7B, Qwen3\-8B\-Instruct111[https://huggingface\.co/Qwen/Qwen3\-8B](https://huggingface.co/Qwen/Qwen3-8B), Mistral\-7B\-Instruct, and Phi\-3\.5\-mini\[[38](https://arxiv.org/html/2604.19162#bib.bib40),[15](https://arxiv.org/html/2604.19162#bib.bib41),[24](https://arxiv.org/html/2604.19162#bib.bib42)\]\. DeBERTa\-v3\-large\-mnli supplies entailment scores for graph construction\[[12](https://arxiv.org/html/2604.19162#bib.bib46)\]\. For alphabet\-size error, we drawN=100N\{=\}100generations per query as a pseudo\-oracle and subsamplen∈\{5,…,50\}n\\in\\\{5,\\dots,50\\\}; baselines include pluginkobsk\_\{\\text\{obs\}\}, GT, GGT, LaplacianUEigVU\_\{\\text\{EigV\}\}, and hybrid variants\. Binary incorrectness labels follow dataset protocols \(A​U​CsAUC\_\{s\},A​U​CrAUC\_\{r\}\); CoQA participates in estimation pools but is omitted from the four\-dataset AUC table for space\. Reproducibility details are in Appendix[B](https://arxiv.org/html/2604.19162#A2)\.

### 5\.2Alphabet\-size estimation

Table[1](https://arxiv.org/html/2604.19162#S5.T1)reports MAE and RMSE against theN=100N\{=\}100reference\. SHADE achieves the largest margin atn=5n\{=\}5and remains best across listednn\. MAE is not strictly monotone innnfor any method because pooling mixes heterogeneous prompts\. Table[2](https://arxiv.org/html/2604.19162#S5.T2)summarizes pairwise win rates: SHADE beats prior hybrids and frequency baselines on a majority of queries\.

Table 1:MAE \(RMSE\) vs\.N=100N\{=\}100oracle across subsample sizesnn\.Table 2:Pairwise win rates of SHADE vs\. baselines \(n∈\{5,10,20\}n\\in\\\{5,10,20\\\}, pooled queries\)\.
### 5\.3Incorrectness detection

We thresholdℍ^SHADE\\hat\{\\mathbb\{H\}\}\_\{\\text\{SHADE\}\}and report ROC AUC for sequence\- and response\-level labels on four benchmarks \(Table[3](https://arxiv.org/html/2604.19162#S5.T3)\)\. Atn=5n\{=\}5, SHADE attains the highest mean AUC\. Atn∈\{8,10\}n\\in\\\{8,10\\\}, simpler scores such as plugin entropy or NumSets sometimes achieve higher mean AUC despite worse MAE—detection depends on label noise and separability, not only cardinality fidelity\.

Table 3:Incorrectness detection \(AUC\)\. Mean column averages four datasets\.

## 6Conclusion

SHADE fuses Generalized Good–Turing coverage with the heat\-kernel trace of a semantic graph and a finite\-sample correction, yielding a single entropy\-based risk score\. Gains are largest at the smallest sampling budgets\. Appendix[B](https://arxiv.org/html/2604.19162#A2)lists supplementary protocol notes and societal considerations\.

## References

- \[1\]\(2025\)On the effectiveness of random weights in graph neural networks\.arXiv preprint arXiv:2502\.00190\.Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px3.p1.1)\.
- \[2\]A\. Chao and T\. Shen\(2003\)Nonparametric estimation of Shannon’s index of diversity when there are unseen species in a sample\.Environmental and Ecological Statistics10,pp\. 429–443\.Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p3.3),[§4](https://arxiv.org/html/2604.19162#S4.SS0.SSS0.Px4.p1.3)\.
- \[3\]C\. Chen, K\. Liu, Z\. Chen, Y\. Gu, Y\. Wu, M\. Tao, Z\. Fu, and J\. Ye\(2024\)INSIDE: llms’ internal states retain the power of hallucination detection\.ArXivabs/2402\.03744\.External Links:[Link](https://api.semanticscholar.org/CorpusID:267499843)Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[4\]J\. Chen and J\. Mueller\(2024\)Quantifying uncertainty in answers from any language model and enhancing their trustworthiness\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5186–5200\.Cited by:[Appendix B](https://arxiv.org/html/2604.19162#A2.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2604.19162#S1.p1.1),[§2](https://arxiv.org/html/2604.19162#S2.SS0.SSS0.Px1.p1.4),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[5\]F\. R\. K\. Chung\(1997\)Spectral graph theory\.CBMS Regional Conference Series in Mathematics,American Mathematical Society,Providence, RI\.Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p3.3),[§2](https://arxiv.org/html/2604.19162#S2.SS0.SSS0.Px2.p1.9),[§4](https://arxiv.org/html/2604.19162#S4.SS0.SSS0.Px1.p1.5),[§4](https://arxiv.org/html/2604.19162#S4.SS0.SSS0.Px1.p1.7)\.
- \[6\]K\. Cui, R\. Li, S\. L\. Polk, Y\. Lin, H\. Zhang, J\. M\. Murphy, R\. J\. Plemmons, and R\. H\. Chan\(2024\)Superpixel\-based and spatially regularized diffusion learning for unsupervised hyperspectral image clustering\.IEEE Transactions on Geoscience and Remote Sensing62,pp\. 1–18\.Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[7\]K\. Cui, W\. Tang, R\. Zhu, M\. Wang, G\. D\. Larsen, V\. P\. Pauca, S\. Alqahtani, F\. Yang, D\. Segurado, P\. Fine,et al\.\(2025\)Efficient localization and spatial distribution modeling of canopy palms using uav imagery\.IEEE Transactions on Geoscience and Remote Sensing\.Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[8\]T\. Derr, Y\. Ma, and J\. Tang\(2018\)Signed graph convolutional networks\.2018 IEEE International Conference on Data Mining \(ICDM\),pp\. 929–934\.External Links:[Link](https://api.semanticscholar.org/CorpusID:57362238)Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px3.p1.1)\.
- \[9\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630,pp\. 625 – 630\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270615909)Cited by:[Appendix A](https://arxiv.org/html/2604.19162#A1.p1.3),[Appendix B](https://arxiv.org/html/2604.19162#A2.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2604.19162#S1.p1.1),[§1](https://arxiv.org/html/2604.19162#S1.p2.1),[§2](https://arxiv.org/html/2604.19162#S2.SS0.SSS0.Px1.p1.4),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px2.p1.1)\.
- \[10\]W\. Feng, J\. Zhang, Y\. Dong, Y\. Han, H\. Luan, Q\. Xu, Q\. Yang, E\. Kharlamov, and J\. Tang\(2020\)Graph random neural networks for semi\-supervised learning on graphs\.arXiv: Learning\.External Links:[Link](https://api.semanticscholar.org/CorpusID:225086084)Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px3.p1.1)\.
- \[11\]I\. J\. Good\(1953\)The population frequencies of species and the estimation of population parameters\.Biometrika40\(3/4\),pp\. 237–264\.Cited by:[§2](https://arxiv.org/html/2604.19162#S2.SS0.SSS0.Px2.p1.9),[§4](https://arxiv.org/html/2604.19162#S4.SS0.SSS0.Px2.p1.1)\.
- \[12\]P\. He, J\. Gao, and W\. Chen\(2021\)DeBERTaV3: improving deberta using electra\-style pre\-training with gradient\-disentangled embedding sharing\.CoRRabs/2111\.09543\.External Links:[Link](https://arxiv.org/abs/2111.09543),2111\.09543Cited by:[§4](https://arxiv.org/html/2604.19162#S4.SS0.SSS0.Px1.p1.5),[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.
- \[13\]C\. Huang, M\. Li, F\. Cao, H\. Fujita, Z\. Li, and X\. Wu\(2022\)Are graph convolutional networks with random weights feasible?\.IEEE Transactions on Pattern Analysis and Machine Intelligence45,pp\. 2751–2768\.External Links:[Link](https://api.semanticscholar.org/CorpusID:249677202)Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px3.p1.1)\.
- \[14\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of hallucination in natural language generation\.ACM computing surveys55\(12\),pp\. 1–38\.Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p1.1),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[15\]A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de Las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed\(2023\)Mistral 7b\.CoRRabs/2310\.06825\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.06825),[Document](https://dx.doi.org/10.48550/ARXIV.2310.06825),2310\.06825Cited by:[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.
- \[16\]M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer\(2025\-04\)TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension\.IEEE DataPort\.Note:[https://doi\.org/10\.21227/de50\-f985](https://doi.org/10.21227/de50-f985)Accessed on YYYY\-MM\-DD\.External Links:[Link](https://doi.org/10.21227/de50-f985),[Document](https://dx.doi.org/10.21227/DE50-F985)Cited by:[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.
- \[17\]J\. Kossen, J\. Han, M\. Razzak, L\. Schut, S\. A\. Malik, and Y\. Gal\(2024\)Semantic entropy probes: robust and cheap hallucination detection in llms\.ArXivabs/2406\.15927\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270703114)Cited by:[§2](https://arxiv.org/html/2604.19162#S2.SS0.SSS0.Px1.p1.4),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[18\]L\. Kuhn, Y\. Gal, and S\. Farquhar\(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=VD-AYtP0dve)Cited by:[Appendix A](https://arxiv.org/html/2604.19162#A1.p1.3),[§1](https://arxiv.org/html/2604.19162#S1.p2.1),[§2](https://arxiv.org/html/2604.19162#S2.SS0.SSS0.Px1.p1.4),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px2.p1.1)\.
- \[19\]L\. Kunitomo\-Jacquin, E\. Marrese\-Taylor, K\. Fukuda, and M\. Hamasaki\(2026\)Evidential semantic entropy for llm uncertainty quantification\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7107–7122\.Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px2.p1.1)\.
- \[20\]T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. P\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov\(2019\)Natural questions: a benchmark for question answering research\.Trans\. Assoc\. Comput\. Linguistics7,pp\. 452–466\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00276),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00276)Cited by:[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.
- \[21\]Z\. Li, S\. Shen, W\. Yang, R\. Jin, H\. Chen, L\. Cao, and J\. Ren\(2025\)Enhancing uncertainty quantification in large language models through semantic graph density\.InConference on Uncertainty in Artificial Intelligence,External Links:[Link](https://api.semanticscholar.org/CorpusID:280676273)Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p3.3),[§2](https://arxiv.org/html/2604.19162#S2.SS0.SSS0.Px2.p1.9),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px2.p1.1)\.
- \[22\]X\. Liu, T\. Chen, L\. Da, C\. Chen, Z\. Lin, and H\. Wei\(2025\)Uncertainty quantification and confidence calibration in large language models: a survey\.Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2\.External Links:[Link](https://api.semanticscholar.org/CorpusID:277150701)Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p1.1),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[23\]L\. H\. McCabe, R\. Melamed, T\. Hartvigsen, and H\. H\. Huang\(2025\)Estimating semantic alphabet size for llm uncertainty quantification\.arXiv preprint arXiv:2509\.14478\.Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p2.1),[§1](https://arxiv.org/html/2604.19162#S1.p3.3),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2604.19162#S4.SS0.SSS0.Px2.p1.1)\.
- \[24\]Microsoft\(2024\)Phi\-3 technical report: A highly capable language model locally on your phone\.CoRRabs/2404\.14219\.External Links:[Link](https://doi.org/10.48550/arXiv.2404.14219),[Document](https://dx.doi.org/10.48550/ARXIV.2404.14219),2404\.14219Cited by:[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.
- \[25\]G\. A\. Miller\(1955\)Note on the bias of information estimates\.Information Theory, IRE Transactions on2\(2\),pp\. 190–190\.Cited by:[§4](https://arxiv.org/html/2604.19162#S4.SS0.SSS0.Px4.p1.1)\.
- \[26\]D\. Nguyen, A\. Payani, and B\. Mirzasoleiman\(2025\)Beyond semantic entropy: boosting llm uncertainty quantification with pairwise semantic similarity\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 4530–4540\.Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px2.p1.1)\.
- \[27\]A\. Nikitin, J\. Kossen, Y\. Gal, and P\. Marttinen\(2024\)Kernel language entropy: fine\-grained uncertainty quantification for llms from semantic similarities\.ArXivabs/2405\.20003\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270123445)Cited by:[§2](https://arxiv.org/html/2604.19162#S2.SS0.SSS0.Px2.p1.9),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px2.p1.1)\.
- \[28\]P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang\(2016\)SQuAD: 100, 000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1\-4, 2016,J\. Su, X\. Carreras, and K\. Duh \(Eds\.\),pp\. 2383–2392\.External Links:[Link](https://doi.org/10.18653/v1/d16-1264),[Document](https://dx.doi.org/10.18653/V1/D16-1264)Cited by:[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.
- \[29\]V\. Rawte, A\. Sheth, and A\. Das\(2023\)A survey of hallucination in large foundation models\.arXiv preprint arXiv:2309\.05922\.Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p1.1),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[30\]S\. Reddy, D\. Chen, and C\. D\. Manning\(2019\)CoQA: A conversational question answering challenge\.Trans\. Assoc\. Comput\. Linguistics7,pp\. 249–266\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00266),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00266)Cited by:[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.
- \[31\]O\. Shorinwa, Z\. Mei, J\. Lidard, A\. Z\. Ren, and A\. Majumdar\(2024\)A survey on uncertainty quantification of large language models: taxonomy, open research challenges, and future directions\.ACM Computing Surveys58,pp\. 1 – 38\.External Links:[Link](https://api.semanticscholar.org/CorpusID:274597654)Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p1.1),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[32\]Q\. Sun, X\. Li, X\. He, A\. Cheng, X\. Ji, H\. Lu, R\. Huang, and Q\. Hu\(2026\)Efficient hallucination detection: adaptive bayesian estimation of semantic entropy with guided semantic exploration\.InFortieth AAAI Conference on Artificial Intelligence, Thirty\-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2026, Singapore, January 20\-27, 2026,S\. Koenig, C\. Jenkins, and M\. E\. Taylor \(Eds\.\),pp\. 33117–33125\.External Links:[Link](https://doi.org/10.1609/aaai.v40i39.40595),[Document](https://dx.doi.org/10.1609/AAAI.V40I39.40595)Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p1.1),[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px2.p1.1)\.
- \[33\]W\. Tang, K\. Cui, R\. H\. Chan, and J\. Morel\(2025\)Bilateral signal warping for left ventricular hypertrophy diagnosis\.In2025 IEEE 22nd International Symposium on Biomedical Imaging \(ISBI\),pp\. 1–5\.Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[34\]Y\. Xue, K\. H\. Greenewald, Y\. Mroueh, and B\. Mirzasoleiman\(2025\)Verify when uncertain: beyond self\-consistency in black box hallucination detection\.ArXivabs/2502\.15845\.External Links:[Link](https://api.semanticscholar.org/CorpusID:276575667)Cited by:[§1](https://arxiv.org/html/2604.19162#S1.p1.1)\.
- \[35\]Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning\(2018\)HotpotQA: A dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),pp\. 2369–2380\.External Links:[Link](https://doi.org/10.18653/v1/d18-1259),[Document](https://dx.doi.org/10.18653/V1/D18-1259)Cited by:[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.
- \[36\]T\. Yoon and H\. Kim\(2025\)Uncertainty estimation by flexible evidential deep learning\.arXiv preprint arXiv:2510\.18322\.Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[37\]Y\. Zeng, Z\. Yu, D\. Jiang, W\. Zhang, Y\. Hong, Z\. Hu, J\. Luo, and K\. Cui\(2026\)Learning where to embed: noise\-aware positional embedding for query retrieval in small\-object detection\.arXiv preprint arXiv:2604\.15065\.Cited by:[§3](https://arxiv.org/html/2604.19162#S3.SS0.SSS0.Px1.p1.1)\.
- \[38\]S\. Zhang, S\. Roller, N\. Goyal, M\. Artetxe, M\. Chen, S\. Chen, C\. Dewan, M\. T\. Diab, X\. Li, X\. V\. Lin, T\. Mihaylov, M\. Ott, S\. Shleifer, K\. Shuster, D\. Simig, P\. S\. Koura, A\. Sridhar, T\. Wang, and L\. Zettlemoyer\(2022\)OPT: open pre\-trained transformer language models\.CoRRabs/2205\.01068\.External Links:[Link](https://doi.org/10.48550/arXiv.2205.01068),[Document](https://dx.doi.org/10.48550/ARXIV.2205.01068),2205\.01068Cited by:[§5\.1](https://arxiv.org/html/2604.19162#S5.SS1.p1.6)\.

## Appendix ANotation and white\-box semantic entropy

With access to model probabilitiesp​\(d∣q,θ\)p\(d\\mid q,\\theta\), semantic entropy integrates−∑sp​\(s∣q,θ\)​log⁡p​\(s∣q,θ\)\-\\sum\_\{s\}p\(s\\mid q,\\theta\)\\log p\(s\\mid q,\\theta\)over latent classesss\[[18](https://arxiv.org/html/2604.19162#bib.bib47),[9](https://arxiv.org/html/2604.19162#bib.bib22)\]\. Black\-box DSE replaces those probabilities with empirical frequencies, which we discussed in Sec\. 2\.

## Appendix BExperimental protocol

#### Supervision\.

Binary labels indicate response\-level or sequence\-level incorrectness relative to dataset references \(A​U​CrAUC\_\{r\},A​U​CsAUC\_\{s\}\), following common practice\[[9](https://arxiv.org/html/2604.19162#bib.bib22),[4](https://arxiv.org/html/2604.19162#bib.bib5)\]\.

#### Pseudo\-oracle\.

TheN=100N\{=\}100reference is an operational proxy for semantic cardinality; varyingNNmay shift MAE magnitudes while often preserving relative rankings\.

#### Reproducibility\.

Hyperparameters\(β,α,τ\)\(\\beta,\\alpha,\\tau\)are selected once on development data and held fixed; NLI and decoding protocols are shared across benchmarks; random seeds are fixed where applicable\. Code and configuration will be released\.

## Appendix CSocietal considerations

Uncertainty scores can support abstention or human review in sensitive domains; poorly calibrated thresholds or clustering failures may create false assurance\. We encourage task\-specific validation and oversight alongside SHADE\.

Similar Articles

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

arXiv cs.LG

This paper investigates whether open-source quantized LLMs encode a linearly separable truthfulness signal in their hidden states. Across three 7B-8B instruction-tuned models, a linear probe on a single mid-network layer achieves 0.904-1.000 AUROC on hallucination detection benchmarks, outperforming sampling-based methods.

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

arXiv cs.CL

This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.