FLaG: Fine-Grained Latent Grouping for Hallucination Detection
Summary
FLaG is a lightweight framework for hallucination detection in LLMs that models correctness via latent evidence groups and energy-based routing, achieving SOTA performance across benchmarks.
View Cached Full Text
Cached at: 06/02/26, 03:40 PM
# FLaG: Fine-Grained Latent Grouping for Hallucination Detection Source: [https://arxiv.org/html/2606.00301](https://arxiv.org/html/2606.00301) ,Liyao LiZhejiang UniversityHangzhouZhejiangChina,Zhiqing XiaoZhejiang UniversityHangzhouZhejiangChina,Muzhi ZhuZhejiang UniversityHangzhouZhejiangChina,Jiaqi HuZhejiang UniversityHangzhouZhejiangChina,Zhanming ShenZhejiang UniversityHangzhouZhejiangChina,Xiaomeng HuZhejiang UniversityHangzhouZhejiangChina,Sean DuNanyang Technological UniversitySingaporeSingaporeandHaobo Wang[wanghaobo@zju\.edu\.cn](https://arxiv.org/html/2606.00301v1/mailto:[email protected])Zhejiang UniversityHangzhouZhejiangChina \(2026\) ###### Abstract\. Hallucinations in large language models \(LLMs\) arise from heterogeneous failure mechanisms, making reliable detection difficult for any single global uncertainty score\. In this work, we formulate hallucination detection as a mechanism\-aware evidence aggregation problem, where diverse representation\- and token\-level signals must be interpreted under multiple latent explanations\. We proposeFLaG, a lightweight hallucination detection framework that models correctness through a set of latent evidence groups\. Each instance is softly associated with multiple groups via an energy\-based routing mechanism, and group\-conditional reliability signals are combined through a principled log\-marginal aggregation\. This design enables FLaG to capture heterogeneous hallucination patterns while remaining invariant to decision thresholds and evaluation metrics\. The framework operates as a frozen\-model head, requires no modification to the underlying language model, and incurs minimal computational overhead\. We further provide a theoretical perspective that connects FLaG to optimal evidence aggregation under heterogeneous error mechanisms, showing that the Bayes\-optimal test statistic necessarily admits a log\-marginal form and that FLaG constitutes a tractable approximation with a controllable error bound\. Extensive experiments across multiple benchmarks and LLM backbones demonstrate that FLaG consistently achieves SOTA performance, while exhibiting robust transfer across datasets and models, and remaining effective under limited supervision\. Large Language Models, Hallucination, Latent Grouping ††copyright:acmlicensed††journalyear:2026††doi:10\.1145/3770855\.3818137††conference:the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Augest 09–13, 2026; Jeju, Korea††isbn:979\-8\-4007\-2258\-5/2026/08††ccs:Security and privacy Trust frameworks††ccs:Computing methodologies Natural language processing## 1\.Introduction Large language models \(LLMs\) have demonstrated remarkable capabilities in natural language understanding and generation\(Zhaoet al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib1)\)\. Despite this progress, LLMs are prone to hallucinations, where outputs that are fluent and seemingly coherent yet factually incorrect or unsupported\. Such hallucinated responses pose serious risks in high\-stakes applications, including medicine, law, and scientific decision\-making, where reliability is paramount\(Zhanget al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib2); Palet al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib3)\)\. Consequently, enabling LLMs to reliably assess the truthfulness of their own generations has become a central challenge in building trustworthy LLM systems\. A central challenge in hallucination detection is that hallucinations do not arise from a single, homogeneous failure mode\. Empirically, hallucinated outputs exhibit diverse patterns across representation\-level signals, token\-level probability traces, and generation dynamics\. Some failures manifest as semantic drift from the prompt, others as locally inconsistent probability assignments, and still others as overconfident yet globally implausible continuations\. As a result, no single uncertainty signal or global scoring rule reliably captures all hallucination behaviors across datasets, models, and generation regimes\. However, most existing detectors\(Burnset al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib4); Azaria and Mitchell,[2023](https://arxiv.org/html/2606.00301#bib.bib5); Marks and Tegmark,[2024](https://arxiv.org/html/2606.00301#bib.bib6); Yinet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib7); Duet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib8); Chenet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib9); Liet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib10); Kossenet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib11)\)implicitly assume a homogeneous notion of hallucination, collapsing all available evidence into a monolithic score\. While such approaches can be effective in specific settings, they struggle to generalize \(Fig\.[1](https://arxiv.org/html/2606.00301#S1.F1)\)\. In practice, detectors that rely on a single view of uncertainty or a fixed decision boundary often overfit to particular hallucination types, leading to brittle performance under setting shift\. In this work, we take a different perspective\. Rather than treating hallucination as a single phenomenon, we view it as the outcome of heterogeneous latent failure mechanisms\. Under this view, each instance may admit multiple competing explanations of how its evidence was generated, and reliable detection requires reasoning over these alternatives\. Crucially, this perspective suggests that hallucination detection is not merely a classification problem, but an evidence aggregation problem under latent mechanism uncertainty\. We formalize this intuition by framing hallucination detection as the task of learning a real\-valued reliability score that aggregates heterogeneous evidence sources while remaining invariant to decision thresholds and evaluation metrics\. Our formulation naturally leads to a mechanism\-aware scoring rule in which evidence is first interpreted under multiple latent mechanisms and then aggregated in a probabilistically coherent manner\. Based on this formulation, we propose FLaG \(Fine\-GrainedLatentGrouping\), a lightweight hallucination detection framework that explicitly models latent evidence groups\. FLaG extracts complementary evidence from both representation geometry and probabilistic generation traces, projects them into a shared latent space, and softly associates each instance with multiple latent groups\. Each group provides a group\-conditional reliability signal, and these signals are combined through a principled log\-evidence aggregation rule\. Importantly, this aggregation does not commit to a single explanation, but marginalizes over competing latent mechanisms, allowing the detector to adaptively capture diverse hallucination patterns\. FLaG is designed as a frozen\-model head that requires no modification to the underlying language model and incurs minimal computational overhead\. The framework naturally supports supervised, weakly supervised, and semi\-supervised learning through a ranking\-based objective that directly optimizes relative correctness rather than absolute labels\. \(a\)Vanilla Classification \(b\)FLaG Figure 1\.Score distributions of vanilla method v\.s\. FLaG\. The former is to train classifier for last\-token embeddings\.Beyond empirical performance, we provide a theoretical perspective that connects FLaG to optimal evidence aggregation under heterogeneous error mechanisms\. We show that, under a general mixture\-of\-mechanisms model, the Bayes\-optimal test statistic necessarily takes a log\-marginal form over mechanism\-conditioned evidence\. FLaG can be viewed as a tractable, learnable approximation to this optimal statistic, with an explicit approximation error bound that improves as the number of latent mechanisms increases\. Last but not least, we evaluate FLaG across multiple hallucination benchmarks and language model backbones\. The results demonstrate consistent SOTA performance over strong baselines under both full and semi\-supervised settings, as well as robust transfer across datasets and models\. Together, these findings support the central claim of this work: hallucination detection benefits from explicitly modeling over heterogeneous latent failure mechanisms, rather than collapsing all evidence into a unified classifier\. - •We formulate hallucination detection as mechanism\-aware evidence aggregation under heterogeneous latent failure modes, unifying representation\-level and token\-level signals within a single truthfulness scoring framework\. - •We proposeFLaG, a lightweight frozen\-model detection head that softly infers latent evidence groups and combines group\-conditional signals via a principled log\-marginal aggregation, yielding threshold\-insensitive ranking\. - •We provide both learning and theory: a ranking\-based objective that naturally extends to weakly supervised and semi\-supervised settings, and a theoretical analysis connecting FLaG to the Bayes\-optimal log\-marginal statistic with a controllable approximation error bound\. ## 2\.Related Work ### 2\.1\.Hallucination Detection Hallucination detection has emerged as an important research topic due to its close connection to the potential risks of deploying LLMs in real\-world applications\(Huanget al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib12)\)\. A large body of work frames hallucination detection as an uncertainty estimation problem and designs various uncertainty scoring functions\. Logit\-based methods\(Renet al\.,[2022](https://arxiv.org/html/2606.00301#bib.bib13); Malinin and Gales,[2021](https://arxiv.org/html/2606.00301#bib.bib14); Kuhnet al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib15)\)directly leverage token\-level probabilities to quantify uncertainty\. Verbalized methods\(Linet al\.,[2022a](https://arxiv.org/html/2606.00301#bib.bib16); Xionget al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib17)\), in contrast, prompt LLMs to explicitly produce natural language based uncertainty signals\. Consistency\-based approaches\(Manakulet al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib18); Chenet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib9)\)exploit response agreement across multiple parallel samplings of the same underlying LLM\. More recent studies hypothesize that hallucination\-related signals are implicitly encoded in the model’s hidden states, and train classifiers to extract such signals\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.00301#bib.bib5); Marks and Tegmark,[2024](https://arxiv.org/html/2606.00301#bib.bib6)\)\. Among them, HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib8)\)identifies hallucination\-related subspaces via embedding decomposition, while TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.00301#bib.bib19)\)learns a steering vector to restructure internal representations\. The most recent work\(Suet al\.,[2025](https://arxiv.org/html/2606.00301#bib.bib20)\)formulates hallucination detection as a reasoning task and applies reinforcement learning to train LLMs to localize specific hallucinated spans from long contexts\. In contrast, FLaG attributes hallucinations to the mixtrue of heterogeneous failure mechanisms and automatically uncovers these latent mechanisms through group\-aware learning\. ### 2\.2\.Mixture\-of\-Experts Mixture\-of\-Experts \(MoE\) models have been widely studied\(Jacobset al\.,[1991](https://arxiv.org/html/2606.00301#bib.bib21); Shazeeret al\.,[2017](https://arxiv.org/html/2606.00301#bib.bib22)\)as a principled framework for modeling data heterogeneity by decomposing complex distributions into a set of specialized components\. In modern deep learning systems, MoE has been extensively adopted to improve both model capacity and efficiency, particularly in LLMs via sparse expert routing\(Feduset al\.,[2022](https://arxiv.org/html/2606.00301#bib.bib23)\)\. Beyond serving as a scable structure, MoE has also been explored as a mechanism\-aware modeling paradigm\. Under this context, different experts specialize in distinct input patterns, latent factors, or failure modes\(Jordan and Jacobs,[1994](https://arxiv.org/html/2606.00301#bib.bib24)\)\. For instance, PNs allows inputs to be softly assigned to multiple experts and aggregats expert\-specific predictions\(Malinin and Gales,[2018](https://arxiv.org/html/2606.00301#bib.bib25)\)\. In our paper, a specially designed MoE\-style module is adopted as a lightweight detection framework, rather than as a generative architecture\. Figure 2\.Overview ofFLaG\. For an instance𝐱=\(𝐪,𝐚\)\\mathbf\{x\}=\(\\mathbf\{q\},\\mathbf\{a\}\), we extract geometry and probabilistic\-trace evidence from a frozen LLM, fuse them into𝐫\(𝐱\)\\mathbf\{r\}\(\\mathbf\{x\}\), softly route𝐫\(𝐱\)\\mathbf\{r\}\(\\mathbf\{x\}\)toKKprototype\-defined latent groups, and obtain the final truthfulness scores\(𝐱\)s\(\\mathbf\{x\}\)by log\-marginal aggregation over group\-wise scores\.\. ## 3\.Methodology ### 3\.1\.Problem Formulation Let𝐪=\(q1,…,qn\)\\mathbf\{q\}=\(q\_\{1\},\\ldots,q\_\{n\}\)denote a user prompt, and𝐚=\(a1,…,am\)\\mathbf\{a\}=\(a\_\{1\},\\ldots,a\_\{m\}\)denote a model\-generated output\. For simplicity, we define a unified instance in detection as𝐱=\(𝐪,𝐚\)∈𝒳\\mathbf\{x\}=\(\\mathbf\{q\},\\mathbf\{a\}\)\\in\\mathcal\{X\}\. The task of hallucination detection aims to identify whether the output𝐚\\mathbf\{a\}is truthful under a specified evaluation protocol\. This process can be formalized as learning a binary classifierG:𝒳→\{0,1\}G:\\mathcal\{X\}\\rightarrow\\\{0,1\\\}, whereG\(𝐱\)=1G\(\\mathbf\{x\}\)=1indicates a truthful output, while0indicates a hallucinated one\. In this paper, we consider a general learning setup in which the training set consists of a mixture of labeled and unlabeled instances\. Given a training set𝒟\\mathcal\{D\}, it can be decomposed as𝒟=𝒟ℓ∪𝒟u\\mathcal\{D\}=\\mathcal\{D\}\_\{\\ell\}\\cup\\mathcal\{D\}\_\{u\}, where𝒟ℓ=\{\(𝐱,y\)\}\\mathcal\{D\}\_\{\\ell\}=\\\{\(\\mathbf\{x\},y\)\\\}consists of instances with supervised labelsy∈\{0,1\}y\\in\\\{0,1\\\}, and𝒟u=\{𝐱\}\\mathcal\{D\}\_\{u\}=\\\{\\mathbf\{x\}\\\}consists of unlabeled instances\. The fully supervised setting corresponds to𝒟u=∅\\mathcal\{D\}\_\{u\}=\\varnothing\. ### 3\.2\.Multi\-View Evidence Representation Given an instance𝐱=\(𝐪,𝐚\)\\mathbf\{x\}=\(\\mathbf\{q\},\\mathbf\{a\}\), our goal is to extract representations that capture evidence relevant to truthfulness \(or hallucination\)\. In prior work, the input features to the classifier typically rely on a single global representation, e\.g\., the last\-token embedding from the final layer\(Parket al\.,[2025](https://arxiv.org/html/2606.00301#bib.bib19)\)\. However, the evidence of hallucination is often heterogeneous and distributed across different telemetry signals of the generation process\(Minet al\.,[2026](https://arxiv.org/html/2606.00301#bib.bib26)\)\. We therefore consider aggregating evidence from the following signal sources: #### 3\.2\.1\.Latent Geometry as Evidential Signals After the conditional generation𝐪→𝐚\\mathbf\{q\}\\rightarrow\\mathbf\{a\}, we feed the concatenated token sequence𝐪⊕𝐚\\mathbf\{q\}\\oplus\\mathbf\{a\}back into the frozen LLM\. We then compute the mean\-pooled hidden states over the prompt and output spans, denoted as𝐡¯𝐪\\bar\{\\mathbf\{h\}\}\_\{\\mathbf\{q\}\}and𝐡¯𝐚\\bar\{\\mathbf\{h\}\}\_\{\\mathbf\{a\}\}, respectively\. The composite semantic geometry evidence is: \(1\)𝝍\(𝐱\)=\[𝐡end;𝐡¯𝐚;𝐡¯𝐚−𝐡¯𝐪\]∈ℝ3d,\\displaystyle\\bm\{\\psi\}\(\\mathbf\{x\}\)=\\big\[\\mathbf\{h\}\_\{\\mathrm\{end\}\}\\,;\\,\\bar\{\\mathbf\{h\}\}\_\{\\mathbf\{a\}\}\\,;\\,\\bar\{\\mathbf\{h\}\}\_\{\\mathbf\{a\}\}\-\\bar\{\\mathbf\{h\}\}\_\{\\mathbf\{q\}\}\\big\]\\in\\mathbb\{R\}^\{3d\},where𝐡end\\mathbf\{h\}\_\{\\mathrm\{end\}\}denotes the hidden state of the last non\-padding token\. We interpret these hidden states from a latent geometry view\. The𝐡end\\mathbf\{h\}\_\{\\mathrm\{end\}\}and the pooled𝐡¯𝐚\\bar\{\\mathbf\{h\}\}\_\{\\mathbf\{a\}\}serve as the absolute semantic coordinate of the output\. The residual𝐡¯𝐚−𝐡¯𝐪\\bar\{\\mathbf\{h\}\}\_\{\\mathbf\{a\}\}\-\\bar\{\\mathbf\{h\}\}\_\{\\mathbf\{q\}\}measures the relative semantic drift between the output and the prompt\. #### 3\.2\.2\.Probabilistic Trace as Evidential Signals The output𝐚=\(a1,…,am\)\\mathbf\{a\}=\(a\_\{1\},\\ldots,a\_\{m\}\)is generated according to the conditional probability trace∑t=1mlogp\(at∣𝐪,a<t\)\\sum\_\{t=1\}^\{m\}\\log p\(a\_\{t\}\\mid\\mathbf\{q\},a\_\{<t\}\)\. We introduce a small set of trace functionals so as to extract evidence from this trace\. The token\-level log\-probability is defined asℓt\(𝐱\)=logp\(at∣𝐪,a<t\)\\ell\_\{t\}\(\\mathbf\{x\}\)=\\log p\(a\_\{t\}\\mid\\mathbf\{q\},a\_\{<t\}\), wheret∈\[1,m\]t\\in\[1,m\]denotes the token position\. First, we compute three sample statistics, including the meanμℓ\(𝐱\)=1m∑t=1mℓt\(𝐱\)\\mu\_\{\\ell\}\(\\mathbf\{x\}\)=\\frac\{1\}\{m\}\\sum\_\{t=1\}^\{m\}\\ell\_\{t\}\(\\mathbf\{x\}\), the minimumℓmin\(𝐱\)=mintℓt\(𝐱\)\\ell\_\{\\mathrm\{min\}\}\(\\mathbf\{x\}\)=\\min\_\{t\}\\ell\_\{t\}\(\\mathbf\{x\}\), and the standard deviationσℓ\(𝐱\)=1m∑t=1m\(ℓt\(𝐱\)−μℓ\(𝐱\)\)2\\sigma\_\{\\ell\}\(\\mathbf\{x\}\)=\\sqrt\{\\frac\{1\}\{m\}\\sum\_\{t=1\}^\{m\}\\Big\(\\ell\_\{t\}\(\\mathbf\{x\}\)\-\\mu\_\{\\ell\}\(\\mathbf\{x\}\)\\Big\)^\{2\}\}\. Second, we compute two distributional statistics: predictive entropy and logit margin\(Ngnawéet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib27)\), both of which are formally token\-level averages\. The former is computed as: \(2\)μH\(𝐱\)=−1m∑t=1m∑v∈𝒱p\(v∣𝐪,a<t\)logp\(v∣𝐪,a<t\),\\displaystyle\\mu\_\{H\}\(\\mathbf\{x\}\)=\-\\frac\{1\}\{m\}\\sum\_\{t=1\}^\{m\}\\sum\_\{v\\in\\mathcal\{V\}\}p\(v\\mid\\mathbf\{q\},a\_\{<t\}\)\\log p\(v\\mid\\mathbf\{q\},a\_\{<t\}\),where𝒱\\mathcal\{V\}denotes the vocabulary\. The latter is computed as: \(3\)μΔ\(𝐱\)=1m∑t=1mzt\(1\)−zt\(2\),\\displaystyle\\mu\_\{\\Delta\}\(\\mathbf\{x\}\)=\\frac\{1\}\{m\}\\sum\_\{t=1\}^\{m\}z\_\{t\}^\{\(1\)\}\-z\_\{t\}^\{\(2\)\},wherezt\(1\)z\_\{t\}^\{\(1\)\}andzt\(2\)z\_\{t\}^\{\(2\)\}are the largest and second\-largest logits at positiontt\. Notably,zzrefers to the pre\-softmax logits corresponding to the probability distributionpp\. Third, we introduce a tail\-frequency functional to count low\-probability outliers: \(4\)ρlow\(𝐱\)=1m∑t=1m𝕀\[ℓt\(𝐱\)<τℓ\],\\rho\_\{\\mathrm\{low\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{m\}\\sum\_\{t=1\}^\{m\}\\mathbb\{I\}\\\!\\left\[\\ell\_\{t\}\(\\mathbf\{x\}\)<\\tau\_\{\\ell\}\\right\],whereτℓ\\tau\_\{\\ell\}is a fixed threshold\. We also include the output lengthmmto serve as a complementary signal\. Collectively, the probabilistic trace evidence can be constructed as: \(5\)ϕ\(𝐱\)=\[μℓ,ℓmin,σℓ,μH,μΔ,ρlow,m\]∈ℝ7\.\\displaystyle\\bm\{\\phi\}\(\\mathbf\{x\}\)=\\big\[\\mu\_\{\\ell\},\\ \\ell\_\{\\mathrm\{min\}\},\\ \\sigma\_\{\\ell\},\\ \\mu\_\{H\},\\ \\mu\_\{\\Delta\},\\ \\rho\_\{\\mathrm\{low\}\},\\ m\\big\]\\in\\mathbb\{R\}^\{7\}\.This evidence captures how consistently the model assigns probability mass to the generated tokens\. Moreover, the statistics inϕ\(𝐱\)\\bm\{\\phi\}\(\\mathbf\{x\}\)act as computable proxies of token\-level likelihood ratio \(App\.[A](https://arxiv.org/html/2606.00301#A1)\)\. Finally, we project the two types of evidence into a shared latent space, so as to obtain a fused evidence representation: \(6\)𝐫\(𝐱\)=fMLP\(\[fproj\(𝝍\(𝐱\)\);ϕ\(𝐱\)\]\)∈ℝd,\\displaystyle\\mathbf\{r\}\(\\mathbf\{x\}\)=f\_\{\\mathrm\{MLP\}\}\\;\\Big\(\[f\_\{\\mathrm\{proj\}\}\(\\bm\{\\psi\}\(\\mathbf\{x\}\)\)\\,;\\bm\{\\phi\}\(\\mathbf\{x\}\)\]\\Big\)\\in\\mathbb\{R\}^\{d\},wherefprojf\_\{\\mathrm\{proj\}\}is a lightweight linear projection layer\. ### 3\.3\.Group\-Aware Evidential Reasoning Depicted in Figure[2](https://arxiv.org/html/2606.00301#S2.F2), we introduceKKlatent evidence groups to adaptively discriminate heterogeneous hallucination mechanisms\. #### 3\.3\.1\.Prototype\-based Evidence Allocation For each instance𝐱\\mathbf\{x\}, we allocate the extracted evidence representation𝐫\(𝐱\)\\mathbf\{r\}\(\\mathbf\{x\}\)to the corresponding latent group\. We represent each latent groupg∈\[1,K\]g\\in\[1,K\]by a learnable prototype𝐜g∈ℝd\\mathbf\{c\}\_\{g\}\\in\\mathbb\{R\}^\{d\}\. The prototype acts as an anchor for a region in the evidence space associated with a specific hallucination mechanism\. We then define the negative energy function: \(7\)αg\(𝐱\)=𝐫\(𝐱\)⊤𝐜g‖𝐫\(𝐱\)‖2‖𝐜g‖2\.\\displaystyle\\alpha\_\{g\}\(\\mathbf\{x\}\)=\\frac\{\\mathbf\{r\}\(\\mathbf\{x\}\)^\{\\top\}\\mathbf\{c\}\_\{g\}\}\{\\\|\\mathbf\{r\}\(\\mathbf\{x\}\)\\\|\_\{2\}\\,\\\|\\mathbf\{c\}\_\{g\}\\\|\_\{2\}\}\.Intuitively, largerαg\(𝐱\)\\alpha\_\{g\}\(\\mathbf\{x\}\)indicates higher compatibility between the observed evidence and mechanismgg\. The routing distribution of groups can then be given in differentiable Boltzmann form: \(8\)πg\(𝐱\)=eαg\(𝐱\)/τ∑g′=1Keαg′\(𝐱\)/τ,\\displaystyle\\pi\_\{g\}\(\\mathbf\{x\}\)=\\frac\{e^\{\\alpha\_\{g\}\(\\mathbf\{x\}\)/\\tau\}\}\{\\sum\_\{g^\{\\prime\}=1\}^\{K\}e^\{\\alpha\_\{g^\{\\prime\}\}\(\\mathbf\{x\}\)/\\tau\}\},whereτ\>0\\tau\>0controls the sharpness of the distribution\. The resultingπg\(𝐱\)\\pi\_\{g\}\(\\mathbf\{x\}\)can be interpreted as a principled approximate posterior of𝐱\\mathbf\{x\}belonging to groupggin the evidence space \(details deferred to App\.[A](https://arxiv.org/html/2606.00301#A1)\)\. The prototypes are learned jointly with the hallucination classifier\. Thus, the group structure is adaptively induced from data without requiring additional annotations\. #### 3\.3\.2\.Log\-Marginal Evidence Aggregation Next, we define how the evidence is translated into a final hallucination score\. We associate each latent groupggwith a linear scoring function: \(9\)sg\(𝐱\)=𝐰g⊤𝐫\(𝐱\)\+bg,\\displaystyle s\_\{g\}\(\\mathbf\{x\}\)=\\mathbf\{w\}\_\{g\}^\{\\top\}\\mathbf\{r\}\(\\mathbf\{x\}\)\+b\_\{g\},where𝐰g∈ℝd\\mathbf\{w\}\_\{g\}\\in\\mathbb\{R\}^\{d\}andbg∈ℝb\_\{g\}\\in\\mathbb\{R\}\. Different groups are trained to emphasize different dimensions of the same evidence representation, reflecting heterogeneous hallucination mechanisms\. The overall score is obtained by aggregating and marginalizing over all groups: \(10\)s\(𝐱\)=log∑g=1Kπg\(𝐱\)esg\(𝐱\)\.\\displaystyle s\(\\mathbf\{x\}\)=\\log\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}\(\\mathbf\{x\}\)e^\{s\_\{g\}\(\\mathbf\{x\}\)\}\.Rather than a mixture\-of\-experts\(Jacobset al\.,[1991](https://arxiv.org/html/2606.00301#bib.bib21)\)heuristic, FLaG implements a learnable approximation to the Bayes\-optimal log\-evidence aggregation rule \(§[4](https://arxiv.org/html/2606.00301#S4)\) under heterogeneous hallucination mechanisms\. The classifier is achieved by thresholding the score:G\(𝐱\)=𝕀\[s\(𝐱\)≥0\]G\(\\mathbf\{x\}\)=\\mathbb\{I\}\\\!\\left\[s\(\\mathbf\{x\}\)\\geq 0\\right\], where largers\(𝐱\)s\(\\mathbf\{x\}\)indicate higher predicted truthfulness\. ### 3\.4\.Learning Objective The training pipeline follows a two\-stage paradigm \(Algorithm[1](https://arxiv.org/html/2606.00301#alg1)\): we fit the labeled data using a supervised objective, and then optionally incorporate unlabeled data through a semi\-supervised objective\. #### 3\.4\.1\.Supervised Objective\. For labeled instances\(𝐱,y\)∈𝒟ℓ\(\\mathbf\{x\},y\)\\in\\mathcal\{D\}\_\{\\ell\}, we adopt a margin\-based ranking objective\. Given an instance pair\(𝐱\+,𝐱−\)\(\\mathbf\{x\}^\{\+\},\\mathbf\{x\}^\{\-\}\)with labelsy=1y=1andy=0y=0, respectively, we encourage the model to assign a higher score to the truthful instance: \(11\)ℒsup=𝔼𝐱\+,𝐱−∈𝒟ℓ\[log\(1\+e−s\(𝐱\+\)\+s\(𝐱−\)\)\]\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{sup\}\}=\\mathbb\{E\}\_\{\\mathbf\{x\}^\{\+\},\\mathbf\{x\}^\{\-\}\\in\\mathcal\{D\}\_\{\\ell\}\}\\Big\[\\log\\left\(1\+e^\{\-s\(\\mathbf\{x\}^\{\+\}\)\+s\(\\mathbf\{x\}^\{\-\}\)\}\\right\)\\Big\]\.This objective enforces consistent ordering between truthful and hallucinated outputs, without committing to a specific threshold\. #### 3\.4\.2\.Semi\-Supervised Objective\. We further leverage unlabeled instances𝐱∈𝒟u\\mathbf\{x\}\\in\\mathcal\{D\}\_\{u\}to refine the preliminary latent group structure learned from supervised data\. We posit that instances strongly associated with the same group should admit a consistent relative ordering under the group\-aware scoring\. Building on this intuition, we consider a subset of unlabeled instances with high routing confidence:𝒰g=\{𝐱∈𝒟u\|πg\(𝐱\)≥γg\}\\mathcal\{U\}\_\{g\}=\\big\\\{\\mathbf\{x\}\\in\\mathcal\{D\}\_\{u\}\\;\\big\|\\;\\pi\_\{g\}\(\\mathbf\{x\}\)\\geq\\gamma\_\{g\}\\big\\\}, whereγg\\gamma\_\{g\}is chosen such that\|𝒰g\|=k\|\\mathcal\{U\}\_\{g\}\|=k\. Within𝒰g\\mathcal\{U\}\_\{g\}, the group\-wise scoresg\(𝐱s\_\{g\}\(\\mathbf\{x\}induces a one\-dimensional geometry\. We form upper and lower partitions:𝒰g\+=\{𝐱∈𝒰g\|sg\(𝐱\)≥ηg\+\}\\mathcal\{U\}\_\{g\}^\{\+\}=\\big\\\{\\mathbf\{x\}\\in\\mathcal\{U\}\_\{g\}\\ \\big\|\\ s\_\{g\}\(\\mathbf\{x\}\)\\geq\\eta\_\{g\}^\{\+\}\\big\\\}and𝒰g−=\{𝐱∈𝒰g\|sg\(𝐱\)≤ηg−\}\\mathcal\{U\}\_\{g\}^\{\-\}=\\big\\\{\\mathbf\{x\}\\in\\mathcal\{U\}\_\{g\}\\ \\big\|\\ s\_\{g\}\(\\mathbf\{x\}\)\\leq\\eta\_\{g\}^\{\-\}\\big\\\}, whereηg\+\\eta\_\{g\}^\{\+\}andηg−\\eta\_\{g\}^\{\-\}are chosen such that\|𝒰g\+\|=\|𝒰g−\|=⌊p\|𝒰g\|⌋\|\\mathcal\{U\}\_\{g\}^\{\+\}\|=\|\\mathcal\{U\}\_\{g\}^\{\-\}\|=\\lfloor p\|\\mathcal\{U\}\_\{g\}\|\\rfloorandppis a quantile coefficient\. We then encourage the classifierGGto preserve the relative geometry structure within𝒰g\+\\mathcal\{U\}\_\{g\}^\{\+\}and𝒰g−\\mathcal\{U\}\_\{g\}^\{\-\}\. This process is enforced by a weighted group\-consistent ranking loss: \(12\)ℒgc=λ∑g=1K1K𝔼𝐱\+∼𝒰g\+,𝐱−∼𝒰g−\[log\(1\+e−s\(𝐱\+\)\+s\(𝐱−\)\)\]\.\\mathcal\{L\}\_\{\\mathrm\{gc\}\}=\\lambda\\sum\_\{g=1\}^\{K\}\\frac\{1\}\{K\}\\mathbb\{E\}\_\{\\mathbf\{x\}^\{\+\}\\sim\\mathcal\{U\}\_\{g\}^\{\+\},\\ \\mathbf\{x\}^\{\-\}\\sim\\mathcal\{U\}\_\{g\}^\{\-\}\}\\Big\[\\log\\left\(1\+e^\{\-s\(\\mathbf\{x\}^\{\+\}\)\+s\(\\mathbf\{x\}^\{\-\}\)\}\\right\)\\Big\]\.Unlike explicit pseudo\-labeling or full optimal transport formulations, this approach avoids constructing cost matrices or hard assignments\. To prevent noisy constraints from corrupting the supervised decision boundary, we update onlysg\(𝐱\)s\_\{g\}\(\\mathbf\{x\}\)instead ofs\(𝐱\)s\(\\mathbf\{x\}\)\. Notably, bothℒsup\\mathcal\{L\}\_\{\\mathrm\{sup\}\}andℒgc\\mathcal\{L\}\_\{\\mathrm\{gc\}\}operate on instance pairs sampled within each batch rather than over the entire dataset\. The overall time complexity is bounded byO\(NB\)O\(NB\), whereNNdenotes the dataset size andBBdenots the batch size, thereby avoiding quadratic scaling\. Algorithm 1FLaGTraining and Inference \(Fine\-Grained Latent Grouping\)1:Frozen LLM backbone ℳ\\mathcal\{M\}; labeled set 𝒟ℓ=\{\(𝐱,y\)\}\\mathcal\{D\}\_\{\\ell\}=\\\{\(\\mathbf\{x\},y\)\\\}; unlabeled set 𝒟u=\{𝐱\}\\mathcal\{D\}\_\{u\}=\\\{\\mathbf\{x\}\\\}\(optional\); number of groups KK; temperature τ\\tau; semi\-supervised weight λ\\lambda; group top\- kksize kk; quantile coefficient pp\. 2:Trainable parameters Θ=\{fproj,fMLP,\{𝐜g,𝐰g,bg\}g=1K\}\\Theta=\\\{f\_\{\\mathrm\{proj\}\},f\_\{\\mathrm\{MLP\}\},\\\{\\mathbf\{c\}\_\{g\},\\mathbf\{w\}\_\{g\},b\_\{g\}\\\}\_\{g=1\}^\{K\}\\\}\. 3:FunctionFuseEvidence\(𝐱=\(𝐪,𝐚\)\)\(\\mathbf\{x\}=\(\\mathbf\{q\},\\mathbf\{a\}\)\) 4:Extract latent geometry evidence 𝝍\(𝐱\)\\bm\{\\psi\}\(\\mathbf\{x\}\)from ℳ\\mathcal\{M\} 5:Extract probabilistic trace evidence ϕ\(𝐱\)\\bm\{\\phi\}\(\\mathbf\{x\}\) 6:return 𝐫\(𝐱\)=fMLP\(\[fproj\(𝝍\(𝐱\)\);ϕ\(𝐱\)\]\)\\mathbf\{r\}\(\\mathbf\{x\}\)=f\_\{\\mathrm\{MLP\}\}\\\!\\left\(\[\\,f\_\{\\mathrm\{proj\}\}\(\\bm\{\\psi\}\(\\mathbf\{x\}\)\)\\,;\\,\\bm\{\\phi\}\(\\mathbf\{x\}\)\\,\]\\right\) 7:FunctionScore\(𝐱\)\(\\mathbf\{x\}\) 8: 𝐫←\\mathbf\{r\}\\leftarrowFuseEvidence\(𝐱\)\(\\mathbf\{x\}\) 9:for g=1g=1to KKdo 10: αg\(𝐱\)←𝐫⊤𝐜g‖𝐫‖2‖𝐜g‖2\\alpha\_\{g\}\(\\mathbf\{x\}\)\\leftarrow\\dfrac\{\\mathbf\{r\}^\{\\top\}\\mathbf\{c\}\_\{g\}\}\{\\\|\\mathbf\{r\}\\\|\_\{2\}\\\|\\mathbf\{c\}\_\{g\}\\\|\_\{2\}\} 11:endfor 12:for g=1g=1to KKdo 13: πg\(𝐱\)←exp\(αg\(𝐱\)/τ\)∑g′=1Kexp\(αg′\(𝐱\)/τ\)\\pi\_\{g\}\(\\mathbf\{x\}\)\\leftarrow\\dfrac\{\\exp\(\\alpha\_\{g\}\(\\mathbf\{x\}\)/\\tau\)\}\{\\sum\_\{g^\{\\prime\}=1\}^\{K\}\\exp\(\\alpha\_\{g^\{\\prime\}\}\(\\mathbf\{x\}\)/\\tau\)\} 14: sg\(𝐱\)←𝐰g⊤𝐫\+bgs\_\{g\}\(\\mathbf\{x\}\)\\leftarrow\\mathbf\{w\}\_\{g\}^\{\\top\}\\mathbf\{r\}\+b\_\{g\} 15:endfor 16: s\(𝐱\)←log∑g=1Kπg\(𝐱\)exp\(sg\(𝐱\)\)s\(\\mathbf\{x\}\)\\leftarrow\\log\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}\(\\mathbf\{x\}\)\\exp\\\!\\big\(s\_\{g\}\(\\mathbf\{x\}\)\\big\) 17:return s\(𝐱\)s\(\\mathbf\{x\}\) 18:Stage I: Supervised training 19:foreach minibatch of labeled pairs \(𝐱\+,𝐱−\)\(\\mathbf\{x\}^\{\+\},\\mathbf\{x\}^\{\-\}\)with y=1y=1and y=0y=0do 20: s\+←s^\{\+\}\\leftarrowScore\(𝐱\+\)\(\\mathbf\{x\}^\{\+\}\); s−←s^\{\-\}\\leftarrowScore\(𝐱−\)\(\\mathbf\{x\}^\{\-\}\) 21: ℒsup←log\(1\+exp\(−s\(𝐱\+\)\+s\(𝐱−\)\)\)\\mathcal\{L\}\_\{\\mathrm\{sup\}\}\\leftarrow\\log\\\!\\left\(1\+\\exp\\big\(\-s\(\\mathbf\{x\}^\{\+\}\)\+s\(\\mathbf\{x\}^\{\-\}\)\\big\)\\right\) 22:Update all parameters Θ\\Thetausing ∇Θℒsup\\nabla\_\{\\Theta\}\\mathcal\{L\}\_\{\\mathrm\{sup\}\} 23:endfor 24:if 𝒟u≠∅\\mathcal\{D\}\_\{u\}\\neq\\varnothingthen 25:Stage II: Semi\-supervised group\-consistent refinement 26:foreach minibatch of unlabeled instances 𝐱∈𝒟u\\mathbf\{x\}\\in\\mathcal\{D\}\_\{u\}do 27:Compute \{πg\(𝐱\),sg\(𝐱\),s\(𝐱\)\}g=1K\\\{\\pi\_\{g\}\(\\mathbf\{x\}\),s\_\{g\}\(\\mathbf\{x\}\),s\(\\mathbf\{x\}\)\\\}\_\{g=1\}^\{K\}viaScore 28:for g=1g=1to KKdo 29: 𝒰g←\\mathcal\{U\}\_\{g\}\\leftarrowtop\- kkinstances ranked by πg\(𝐱\)\\pi\_\{g\}\(\\mathbf\{x\}\) 30: 𝒰g\+←\\mathcal\{U\}\_\{g\}^\{\+\}\\leftarrowtop\- ppfraction of 𝒰g\\mathcal\{U\}\_\{g\}ranked by sg\(𝐱\)s\_\{g\}\(\\mathbf\{x\}\) 31: 𝒰g−←\\mathcal\{U\}\_\{g\}^\{\-\}\\leftarrowbottom\- ppfraction of 𝒰g\\mathcal\{U\}\_\{g\}ranked by sg\(𝐱\)s\_\{g\}\(\\mathbf\{x\}\) 32:Sample \(𝐱\+,𝐱−\)\(\\mathbf\{x\}^\{\+\},\\mathbf\{x\}^\{\-\}\)with 𝐱\+∼𝒰g\+,𝐱−∼𝒰g−\\mathbf\{x\}^\{\+\}\\sim\\mathcal\{U\}\_\{g\}^\{\+\},\\ \\mathbf\{x\}^\{\-\}\\sim\\mathcal\{U\}\_\{g\}^\{\-\} 33:Accumulate ℒgc←ℒgc\+log\(1\+exp\(−s\(𝐱\+\)\+s\(𝐱−\)\)\)\\mathcal\{L\}\_\{\\mathrm\{gc\}\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{gc\}\}\+\\log\\\!\\left\(1\+\\exp\\big\(\-s\(\\mathbf\{x\}^\{\+\}\)\+s\(\\mathbf\{x\}^\{\-\}\)\\big\)\\right\) 34:endfor 35: ℒgc←λKℒgc\\mathcal\{L\}\_\{\\mathrm\{gc\}\}\\leftarrow\\dfrac\{\\lambda\}\{K\}\\mathcal\{L\}\_\{\\mathrm\{gc\}\} 36:Update only \{𝐰g,bg\}g=1K\\\{\\mathbf\{w\}\_\{g\},b\_\{g\}\\\}\_\{g=1\}^\{K\}using ∇ℒgc\\nabla\\mathcal\{L\}\_\{\\mathrm\{gc\}\} 37:endfor 38:endif 39:Inference:Given a test instance 𝐱\\mathbf\{x\}, output reliability score s\(𝐱\)s\(\\mathbf\{x\}\)\(or prediction G\(𝐱\)=𝕀\[s\(𝐱\)≥0\]G\(\\mathbf\{x\}\)=\\mathbb\{I\}\[s\(\\mathbf\{x\}\)\\geq 0\]\)\. ## 4\.Theoretical Analysis We analyze FLaG from the perspective of composite hypothesis testing\. The proofs and further analysis for this section can be found in App\.[A](https://arxiv.org/html/2606.00301#A1)\. First, we assume that, conditioned on the label, the instance𝐱\\mathbf\{x\}is generated from a mixture model: \(13\)p\(𝐱∣y\)=∑g=1Kπy\(g\)pg\(𝐱∣y\),p\(\\mathbf\{x\}\\mid y\)=\\sum\_\{g=1\}^\{K\}\\pi\_\{y\}\(g\)\\,p\_\{g\}\(\\mathbf\{x\}\\mid y\),whereπy\(g\)\\pi\_\{y\}\(g\)denotes the label\-dependent mixing weight, andpg\(⋅∣y\)p\_\{g\}\(\\cdot\\mid y\)denotes the group\-conditional distribution\. Meanwhile, for each group , we define the group\-aware log\-likelihood ratio: \(14\)ℓg\(𝐱\)=logpg\(𝐱∣y=1\)pg\(𝐱∣y=0\)\.\\ell\_\{g\}\(\\mathbf\{x\}\)=\\log\\frac\{p\_\{g\}\(\\mathbf\{x\}\\mid y=1\)\}\{p\_\{g\}\(\\mathbf\{x\}\\mid y=0\)\}\. ###### Theorem 4\.1\. Under the mixture model, the Bayes\-optimal log\-likelihood ratio \(LLR\) for testingy=1y=1versusy=0y=0is given by \(15\)Λ⋆\(𝐱\)=log∑g=1Kp\(g∣𝐱,y=0\)eℓg\(𝐱\)\+logπ1\(g\)π0\(g\),\\Lambda^\{\\star\}\(\\mathbf\{x\}\)=\\log\\sum\_\{g=1\}^\{K\}p\(g\\mid\\mathbf\{x\},y=0\)\\,e^\{\\ell\_\{g\}\(\\mathbf\{x\}\)\+\\log\\tfrac\{\\pi\_\{1\}\(g\)\}\{\\pi\_\{0\}\(g\)\}\},wherep\(g∣𝐱,y=0\)=π0\(g\)pg\(𝐱∣y=0\)∑g′π0\(g′\)pg′\(𝐱∣y=0\)p\(g\\mid\\mathbf\{x\},y=0\)=\\frac\{\\pi\_\{0\}\(g\)\\,p\_\{g\}\(\\mathbf\{x\}\\mid y=0\)\}\{\\sum\_\{g^\{\\prime\}\}\\pi\_\{0\}\(g^\{\\prime\}\)\\,p\_\{g^\{\\prime\}\}\(\\mathbf\{x\}\\mid y=0\)\}denotes the posterior distribution over latent groups under the null hypothesisy=0y=0\. We therefore proves that the Bayes\-optimal test statistic necessarily takes the form of a log\-marginal aggregation of group\-aware log\-likelihood ratios, weighted by a group posterior\. Importantly, this posterior is taken with respect to the null hypothesisy=0y=0, and bothp\(g∣𝐱,y=0\)p\(g\\mid\\mathbf\{x\},y=0\)andℓg\(𝐱\)\\ell\_\{g\}\(\\mathbf\{x\}\)depend on the unknown group\-conditional distributions\{pg\(⋅∣y\)\}\\\{p\_\{g\}\(\\cdot\\mid y\)\\\}\. Rather than attempting to estimate these distributions explicitly, we construct a learnable surrogate\. Specifically, FLaG replaces the intractable posteriorp\(g∣𝐱,y=0\)p\(g\\mid\\mathbf\{x\},y=0\)with a data\-driven routing distributionπg\(𝐱\)\\pi\_\{g\}\(\\mathbf\{x\}\), and replaces the unknown group\-wise log\-likelihood ratioℓg\(𝐱\)\\ell\_\{g\}\(\\mathbf\{x\}\)with a parametric group scoresg\(𝐱\)s\_\{g\}\(\\mathbf\{x\}\)computed from the fused evidence representation\. The resulting scores\(𝐱\)=log∑g=1Kπg\(𝐱\)esg\(𝐱\)s\(\\mathbf\{x\}\)=\\log\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}\(\\mathbf\{x\}\)\\,e^\{s\_\{g\}\(\\mathbf\{x\}\)\}constitutes a tractable approximation to the Bayes\-optimal LLR in Eq\. \([15](https://arxiv.org/html/2606.00301#S4.E15)\)\. Since any strictly monotone transformation ofΛ⋆\(𝐱\)\\Lambda^\{\\star\}\(\\mathbf\{x\}\)preserves the optimal ordering, this approximation suffices for reliable hallucination detection\. ### 4\.1\.Analysis of Approximation Error Bound We next show that FLaG is not merely an architectural convenience: increasing the number of groupsKKyields a provable reduction in the approximation error to the Bayes\-optimal statistic\. Assume the fused evidence representation𝐫\(𝐱\)\\mathbf\{r\}\(\\mathbf\{x\}\)lies in a compact set𝒮⊂ℝd\\mathcal\{S\}\\subset\\mathbb\{R\}^\{d\}\. ###### Theorem 4\.2\. AssumeΛ⋆\(𝐱\)\\Lambda^\{\\star\}\(\\mathbf\{x\}\)is Lipschitz as a function of𝐫\(𝐱\)\\mathbf\{r\}\(\\mathbf\{x\}\)on𝒮\\mathcal\{S\}\. There exists a parameter setting of FLaG withKKgroups such that \(16\)sup𝐱\|s\(𝐱\)−Λ⋆\(𝐱\)\|=O\(K−1/d\)\.\\sup\_\{\\mathbf\{x\}\}\\big\|s\(\\mathbf\{x\}\)\-\\Lambda^\{\\star\}\(\\mathbf\{x\}\)\\big\|=O\\\!\\left\(K^\{\-1/d\}\\right\)\. This theorem formalizes the role of fine\-grained grouping\. By allocating different local linear experts to different regions in the evidence space, FLaG approximates the Bayes\-optimal statistic with a controllable error that decays polynomially inKK\. Moreover, the decay exponent is determined by the evidence dimensiondd\. Table 1\.Overall performance of baselines on the LLaMA\-3\-8B\-Instruct, and Qwen2\.5\-7B\-Instruct across four datasets\. All results are reported in AUROC scores in percentage\. The best results are inboldand the second best areunderlined\.MethodTruthfulQATriviaQACoQATydiQA\-GPAverageLLaMA\-38B\-InstructQwen\-2\.57B\-InstructLLaMA\-38B\-InstructQwen\-2\.57B\-InstructLLaMA\-38B\-InstructQwen\-2\.57B\-InstructLLaMA\-38B\-InstructQwen\-2\.57B\-InstructLLaMA\-38B\-InstructQwen\-2\.57B\-InstructTraining\-free MethodsPerplexity61\.1356\.4074\.6152\.6862\.7659\.8652\.2549\.1362\.6954\.52Semantic Entropy58\.0862\.5179\.4072\.7955\.6253\.4055\.6950\.2162\.2059\.73Lexical Similarity51\.5458\.3377\.5966\.3279\.0370\.1262\.6661\.9867\.7164\.19EigenScore57\.1253\.2470\.1669\.8072\.5973\.1574\.6362\.4868\.6364\.67SelfCKGPT59\.6664\.7078\.7770\.1576\.4174\.4352\.8356\.0166\.9266\.32Verbalize62\.3355\.0753\.1250\.7454\.9152\.9254\.8653\.6556\.3053\.35Self\-evaluation52\.0350\.8478\.9860\.6664\.0749\.9274\.2757\.5967\.3454\.75SPUQ64\.4659\.1271\.5765\.9062\.3663\.4767\.4061\.2466\.4562\.43Training\-based MethodsCCS52\.1952\.9060\.7450\.1750\.3351\.8972\.0255\.4458\.8251\.57HaloScope66\.7969\.6665\.3164\.4165\.8361\.8774\.0866\.2268\.0065\.54Linear probe70\.7368\.0174\.5465\.0769\.2567\.2870\.6271\.4671\.2867\.96SAPLMA71\.6871\.9278\.8868\.2573\.3371\.2469\.0366\.7473\.2370\.04EarlyDetec66\.0166\.1569\.4675\.0765\.5866\.3372\.8869\.9568\.4869\.38EGH62\.8961\.5866\.3070\.5367\.9172\.0671\.2464\.1767\.0967\.09TTPD68\.2771\.1071\.4367\.5470\.8668\.7769\.1970\.6369\.9469\.51Probe\-LR65\.4170\.3669\.2169\.8874\.9565\.6267\.0866\.2069\.1668\.77TSV63\.9457\.9663\.5466\.3268\.4564\.7870\.1565\.8366\.5263\.72FLaG \(Semi\)72\.4970\.5974\.6374\.0670\.2069\.2572\.2767\.0272\.4070\.23FLaG \(Full\)75\.7673\.0479\.5176\.8275\.3272\.6474\.4172\.0576\.2573\.64 ## 5\.Experiments ### 5\.1\.Experimental Settings #### 5\.1\.1\.Evaluation We conduct experiments on four benchmarks under two different settings: \(1\) CoQA\(Reddyet al\.,[2019](https://arxiv.org/html/2606.00301#bib.bib28)\)and TyDiQA\-GP \(English\)\(Clarket al\.,[2020](https://arxiv.org/html/2606.00301#bib.bib29)\), where a supporting passage is provided; and \(2\) TruthfulQA\(Linet al\.,[2022b](https://arxiv.org/html/2606.00301#bib.bib30)\)and TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.00301#bib.bib31)\), where no external evidence is given\. For each dataset, we reserve 25% of the instances as the test set, additionally sample 100 non\-overlapping instances as the validation set, and use the remaining data for training\. Moreover, we consider two widely used open\-source LLMs that provide accessible internal representations, namely Qwen2\.5\-7B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib32)\)and LLaMA3\-8B\-Instruct\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib33)\)\. By default, we used greedy sampling for the generation\. Experiments on the corresponding base models are deferred to §[5\.2](https://arxiv.org/html/2606.00301#S5.SS2)\. Following prior work\(Duet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib8)\), to avoid sensitivity to threshold selection, we report performance using the area under the receiver operating characteristic curve \(AUROC\)\. To obtain instance\-level labels \(hallucinated vs\. truthful\), we adopt an LLM\-as\-a\-judge protocol by prompting Qwen3\-235B\-A22B\. We do not follow some prior studies that rely on BLEURT\(Sellamet al\.,[2020](https://arxiv.org/html/2606.00301#bib.bib34)\), as we observe that such metrics fail to capture fine\-grained semantic discrepancies, leading to unreliable evaluation\. A 100\-sample human audit shows 99% agreement for Qwen3 judge vs\. 65% for BLEURT\. #### 5\.1\.2\.Baselines For a comprehensive comparison, we evaluate FLaG against 17 baselines, including: \(1\) self\-assessment methods: Perplexity\([Renet al\.,](https://arxiv.org/html/2606.00301#bib.bib35)\), Semantic Entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib15)\), Lexical Similarity\(Linet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib36)\), SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib18)\), EigenScore\(Chenet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib9)\), Verbalize\([Linet al\.,](https://arxiv.org/html/2606.00301#bib.bib37)\), Self\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.00301#bib.bib38)\), and SPUQ\(Gaoet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib39)\)\. \(2\) internal state\-based methods: CCS\(Burnset al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib4)\), HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib8)\), Linear probee\(Paghet al\.,[2007](https://arxiv.org/html/2606.00301#bib.bib40)\), SAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.00301#bib.bib5)\), EarlyDetec\(Snyderet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib41)\), EGH\(Huet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib42)\), TTPD\(Bürgeret al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib43)\), Probe\-LR\(Liuet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib44)\), and TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.00301#bib.bib19)\)\. Detailed configuration of each baseline is shown in App\.[C](https://arxiv.org/html/2606.00301#A3)\. #### 5\.1\.3\.Implementation Details We use the frozen LLM as the backbone representation extractor\. We set the fused evidence dimension to match the backbone hidden size, and use a two\-layer MLP with10241024hidden width for feature fusion\. Unless otherwise specified, we useK=64K=64latent groups and a fixed routing temperature of0\.10\.1\. All trainable components are initialized with He uniform initialization and optimized using AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2606.00301#bib.bib45)\)with a learning rate of8×10−48\\times 10^\{\-4\}and weight decay 0\.01\. We fix the random seed to 42 and conduct all experiments on a single NVIDIA H200 GPU using PyTorch 2\.7, CUDA 12\.8, and BF16 precision\. For the supervised stage, we train for 20 epochs with a batch size of 128\. When unlabeled data are available, the semi\-supervised objective is enabled after an initial warm\-up of2020epochs\. For each group, we select the top3232unlabeled instances with the highest routing confidence\. Within this subset, the top and bottom20%20\\%instances ranked by the group\-specific score are used to construct pseudo\-ranked pairs\. The semi\-supervised loss is weighted by0\.050\.05relative to the supervised objective\. The unsupervised stage runs for additional2020epochs\. Unless specified, we only discuss the semi\-supervision setup in Table[1](https://arxiv.org/html/2606.00301#S4.T1)\. The threshold of classification during inference is set as0\.50\.5\. All hyperparameters are determined via empirically guided grid search on the validation set\. The global precision is set to BF16\. ### 5\.2\.Overall Performance Table[1](https://arxiv.org/html/2606.00301#S4.T1)reports the overall performance across four benchmarks and two instruction\-tuned LLMs\. We report the results of FLaG under both the fully supervised setting \(Full\) and the semi\-supervised setting \(Semi, using only 20% labeled instances\)\. In contrast, training\-based baselines are evaluated exclusively in the fully supervised regime\. Under full supervision,FLaG achieves the SOTA mean performance with a clear margin, yielding absolute improvements of3\.03%and3\.60%on the two LLM backbones, respectively\. Under semi\-supervision, FLaG already outperforms most fully supervised competitors\. Notably, these gains are consistent across datasets and model backbones\. Overall, the results suggest that our approach does not rely on dataset\-specific shortcuts or any single uncertainty proxy; instead, it provides generalized hallucination ranking across model families and evaluation settings\. Several notable patterns emerge from the comparison\. First, strong training\-free methods such as Semantic Entropy, EigenScore, or lexical similarity exhibit high performance on specific datasets but fail to dominate on average\. This suggests that single\-view uncertainty or consistency signals are tightly coupled to particular hallucination types and answer structures\. Second, although supervised probes and truthfulness classifiers \(e\.g\., SAPLMA, Linear Probe, TSV, HaloScope\) substantially improve average performance, they remain limited by a global decision boundary that implicitly assumes a homogeneous hallucination mechanism across samples\. Third, the performance gap between FLaG \(Semi\) and FLaG \(Full\) indicates that additional supervision primarily refines the routing of samples to different evidence groups rather than merely strengthening a monolithic classifier\. Overall, these observations support the hypothesis that hallucinations arise from heterogeneous generation failures, and that explicitly modeling and marginalizing over multiple latent error mechanisms enables more reliable and dataset\-agnostic hallucination detection\. \(a\)Dataset Transferability \(b\)Model Transferability Figure 3\.Transferability \(reported in AUROC\) across datasets and models\. The vertical axis denotes the source domain or model, while the horizontal axis indicates the target\.\(a\)Top Group\-1 \(b\)Top Group\-2 \(c\)Top Group\-3 \(d\)Top Group\-4 Figure 4\.Wordcloud interpretability of top groups\. The abstract words of each groups are summarized by the Gemini3\-flash\.#### 5\.2\.1\.Transferability Next, we analyze the transferability of FLaG from the perspective of test\-time hallucination detection\. Under this setting, the conditions at test time \(target\) may differ from those used during training \(source\)\. As shown in Figure[3\(a\)](https://arxiv.org/html/2606.00301#S5.F3.sf1), in dataset transfer, detectors trained on a single dataset generalize well to unseen target datasets: AUROC remains consistently high and typically stays within a narrow margin of the in\-domain performance\. Notably, cross\-dataset transfer does not collapse even when the source and target datasets differ substantially in answer format or linguistic characteristics \(e\.g\., TriviaQA↔\\leftrightarrowTyDiQA\-GP\)\. This suggests that the learned detection signals are not tied to dataset\-specific surface patterns\. We also observe an expected trend that transfer is generally stronger between datasets with more similar distributions, such as TriviaQA↔\\leftrightarrowTruthfulQA, both of which are knowledge\-seeking QA tasks that do not require explicit supporting passages\. Similarly, Figure[3\(b\)](https://arxiv.org/html/2606.00301#S5.F3.sf2)demonstrates strong cross\-model transferability\. Detectors trained on one backbone \(e\.g\., LLaMA or Qwen, base or instruct variants\) retain robust performance when applied to other architectures, with AUROC scores remaining above7272in all cross\-model settings, and often exceeding8080when transferring from base to instruct models\. This asymmetry indicates that representations learned from base models capture more generalizable hallucination\-related features, whereas instruction tuning primarily introduces stylistic variability without fundamentally altering the underlying error mechanisms\. Overall, these results show that our approach learns largely model\- and dataset\-agnostic signals of hallucination, further supporting the claim that it captures intrinsic generation failures rather than overfitting to specific domains, prompts, or architectures\. #### 5\.2\.2\.Robustness We evaluate robustness by varying instance complexity and generation diversity on LLaMA\-3\-8B\-Instruct using TruthfulQA dataset\. Instance complexity is approximated by response length, as longer generations typically involve multi\-step reasoning, a larger number of entities or relations, and greater exposure to error accumulation and self\-reinforcement\. These elements are known to exacerbate hallucination behavior\. Figure[5\(a\)](https://arxiv.org/html/2606.00301#S5.F5.sf1)shows that detection performance consistently improves or remains stable as instance length increases, indicating that our method effectively exploits richer internal and trajectory\-level signals that become more informative in complex generations, rather than being distracted by surface\-level verbosity\. Importantly, performance does not degrade for very long responses, suggesting resilience to compounding noise in extended outputs\. Figure[5\(b\)](https://arxiv.org/html/2606.00301#S5.F5.sf2)examines robustness under increasing sampling temperature, which induces higher output diversity and weaker token\-level confidence\. While AUROC gradually decreases at high temperatures, the degradation is smooth and limited, and performance remains competitive even when diversity substantially increases\. This trend suggests that although high\-temperature sampling blurs local probability cues, the method continues to rely on complementary structural and representation\-level evidence\. Thus, we can maintain reliable hallucination ranking across a wide range of generation regimes\. \(a\)Complexity \(b\)Diversity Figure 5\.Robustness of instance complexity and diversity, reported on the LLaMA3\-8B\-Instruct and TruthfulQA\. #### 5\.2\.3\.Interpretability To provide a systematic interpretation of the latent mechanisms discovered by FLaG, we analyze the semantic characteristics of the learned groups using an external large language model\. Concretely, we use the TruthfulQA dataset as an illustrative example and assign each instance to its most activated group according to the group\-wise scores produced by our model\. For each group, we then collect the corresponding questions and responses and prompt Gemini3\-flash to summarize their dominant error patterns in a descriptive manner\. Figure[4](https://arxiv.org/html/2606.00301#S5.F4)visualizes the resulting summaries using word clouds for the top four groups\. The groups exhibit clear and semantically coherent distinctions: some groups are dominated by fact fabrication phenomena such as pseudo\-history, stale knowledge, or entity mismatch, while others emphasize logical contradiction, constraint failure, or imitative falsehood driven by user sycophancy\. Notably, these patterns align well with established taxonomies of hallucination types, despite the fact that no explicit hallucination categories are provided during training\. This result suggests that the group\-aware reasoning module learns to partition samples according to meaningful latent error mechanisms rather than superficial correlations, providing qualitative evidence that FLaG captures interpretable and structured representations of hallucination behaviors\. \(a\)Distribution Level \(b\)Group Level \(c\)Supervision Level Figure 6\.Ablation study on hyperparameters sensitivity, where the backbone LLM is based on the LLaMA3\-8B\-Instruct\. ### 5\.3\.Ablation Study We conduct ablation studies from two perspectives: the architectural components of FLaG and its key hyperparameters\. #### 5\.3\.1\.Component\-Wise Ablation We conduct an ablation study to examine the contribution of each component in our framework, including the semantic geometric evidence representation𝝍\(𝐱\)\\bm\{\\psi\}\(\\mathbf\{x\}\)\(SGE\), the probabilistic trace evidence representationϕ\(𝐱\)\\bm\{\\phi\}\(\\mathbf\{x\}\)\(PTE\), group\-aware reasoning \(GAR\), and log\-marginal aggregation \(LMA\)\. When ablating both SGE and PTE, we follow prior work and instead use the last\-token hidden embedding as the input representation for classification; for GAR ablation, we reduce the number of groups to one, effectively removing mechanism\-aware routing; for LMA ablation, we replace log\-sum\-exp aggregation with a naive weighted average\. As shown in Table[2](https://arxiv.org/html/2606.00301#S5.T2), removing either SGE or PTE leads to a consistent performance drop across all datasets, with the degradation being more pronounced when SGE is removed, highlighting the importance of semantic geometry signals for capturing representation\-level inconsistencies associated with hallucinations\. When both evidence representations are removed, performance collapses sharply, indicating that simple last\-token embeddings are insufficient to support reliable hallucination detection\. Ablating GAR also results in a noticeable drop, suggesting that modeling heterogeneous hallucination mechanisms via group\-aware routing is critical beyond simply aggregating evidence in a monolithic manner\. Similarly, replacing LMA with naive averaging degrades performance, demonstrating that principled log\-marginal aggregation is necessary to properly combine heterogeneous evidence sources\. Overall, the full model consistently achieves the best performance across all datasets, confirming that hallucination detection benefits from the complementary interplay between rich evidence representations, mechanism\-aware reasoning, and theoretically grounded aggregation\. Table 2\.Ablation study on varying components \(reported in AUROC\)\. The backbone LLM is the LLaMA3\-8B\-Instruct\.ComponentDatasetSGEPTEGARLMATQATriviaQACoQATydiQA✗✓✓✓66\.3170\.8864\.2763\.05✓✗✓✓71\.2474\.7972\.6469\.03✗✗✓✓55\.1457\.0254\.3853\.61✓✓✗✓72\.6774\.9372\.4170\.57✓✓✓✗73\.8477\.3173\.5872\.69✓✓✓✓75\.7679\.5275\.3274\.41 #### 5\.3\.2\.Hyperparameter Sensitivity We analyze the sensitivity of our method with respect to key hyperparameters controlling distribution modeling, mechanism granularity, and supervision strength\. Figure[6\(a\)](https://arxiv.org/html/2606.00301#S5.F6.sf1)varies the routing temperatureτ\\tau, which governs the sharpness of group assignment; performance remains stable over a wide range and peaks at moderate temperatures, indicating that overly sharp routing restricts evidence sharing while overly smooth routing weakens mechanism specialization\. Figure[6\(b\)](https://arxiv.org/html/2606.00301#S5.F6.sf2)examines the number of groupsKK, where performance improves steadily asKKincreases from small values, reflecting the benefit of modeling heterogeneous hallucination mechanisms\. However, performance slightly degrades whenKKis increased to128128, which we attribute to insufficient data diversity to reliably support such a fine\-grained partition: with limited supervision, the model lacks enough distinct evidence patterns to populate and specialize all groups, leading to underfitting and unstable group assignments\. Finally, Figure[6\(c\)](https://arxiv.org/html/2606.00301#S5.F6.sf3)shows that performance improves consistently with higher labeled ratios, demonstrating that additional supervision primarily helps refine group\-level specialization rather than altering the overall behavior of the model\. Together, these results indicate that the proposed framework is robust to hyperparameter choices\. Besides, the performance of FLaG is maximized when the model capacity for latent mechanism modeling is well matched to the diversity and scale of the available data\. ## 6\.Conclusions and Limitations We proposeFLaG, a lightweight hallucination detector that treats hallucination detection as*mechanism\-aware evidence aggregation*under heterogeneous latent failure modes\. Limitations\.FLaG leverages internal generation telemetry \(hidden states and token\-level statistics\) to construct multi\-view evidence; its current form is therefore not applicable in strictly black\-box settings where only final text is observable\. Moreover, our theoretical connection to Bayes\-optimal log\-evidence aggregation is established under a mixture\-of\-mechanisms view of evidence distributions; extending the analysis to settings with stronger distribution shift, additional conditioning signals \(e\.g\., retrieval/tool traces\), or more structured dependencies between evidence views is an important direction\. ## Acknowledgments This work is supported by the Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China\. Haobo Wang is also supported by the NSFC under Grants \(No\. 62402424\)\. ## References - A\. Azaria and T\. Mitchell \(2023\)The internal state of an llm knows when it’s lying\.In EMNLP Findings\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - L\. Bürger, F\. A\. Hamprecht, and B\. Nadler \(2024\)Truth is universal: robust detection of lies in llms\.Advances in Neural Information Processing Systems37,pp\. 138393–138431\.Cited by:[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - C\. Burns, H\. Ye, D\. Klein, and J\. Steinhardt \(2023\)Discovering latent knowledge in language models without supervision\.In ICLR\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p2.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - C\. Chen, K\. Liu, Z\. Chen, Y\. Gu, Y\. Wu, M\. Tao, Z\. Fu, and J\. Ye \(2024\)INSIDE: llms’ internal states retain the power of hallucination detection\.In ICLR\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - J\. H\. Clark, E\. Choi, M\. Collins, D\. Garrette, T\. Kwiatkowski, V\. Nikolaev, and J\. Palomaki \(2020\)Tydi qa: a benchmark for information\-seeking question answering in ty pologically di verse languages\.In TACL\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00301#S5.SS1.SSS1.p1.1)\. - X\. Du, C\. Xiao, and Y\. Li \(2024\)Haloscope: harnessing unlabeled llm generations for hallucination detection\.In NeurIPS\.Cited by:[Appendix C](https://arxiv.org/html/2606.00301#A3.p1.1),[§1](https://arxiv.org/html/2606.00301#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1),[§5\.1\.1](https://arxiv.org/html/2606.00301#S5.SS1.SSS1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00301#S5.SS1.SSS1.p1.1)\. - W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.J\. Mach\. Learn\. Res\.23\(1\)\.External Links:ISSN 1532\-4435Cited by:[§2\.2](https://arxiv.org/html/2606.00301#S2.SS2.p1.1)\. - X\. Gao, J\. Zhang, L\. Mouatadid, and K\. Das \(2024\)SPUQ: perturbation\-based uncertainty quantification for large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2336–2346\.Cited by:[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - X\. Hu, Y\. Zhang, R\. Peng, H\. Zhang, C\. Wu, G\. Chen, and J\. Zhao \(2024\)Embedding and gradient say wrong: a white\-box method for hallucination detection\.In EMNLP\.Cited by:[Appendix C](https://arxiv.org/html/2606.00301#A3.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin,et al\.\(2023\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems\.Cited by:[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1)\. - R\. A\. Jacobs, M\. I\. Jordan, S\. J\. Nowlan, and G\. E\. Hinton \(1991\)Adaptive mixtures of local experts\.Neural Computation3\(1\),pp\. 79–87\.External Links:ISSN 0899\-7667,[Document](https://dx.doi.org/10.1162/neco.1991.3.1.79),[Link](https://doi.org/10.1162/neco.1991.3.1.79),https://direct\.mit\.edu/neco/article\-pdf/3/1/79/812104/neco\.1991\.3\.1\.79\.pdfCited by:[§2\.2](https://arxiv.org/html/2606.00301#S2.SS2.p1.1),[§3\.3\.2](https://arxiv.org/html/2606.00301#S3.SS3.SSS2.p1.5)\. - M\. I\. Jordan and R\. A\. Jacobs \(1994\)Hierarchical mixtures of experts and the em algorithm\.Neural Comput\.6\(2\),pp\. 181–214\.External Links:ISSN 0899\-7667,[Link](https://doi.org/10.1162/neco.1994.6.2.181),[Document](https://dx.doi.org/10.1162/neco.1994.6.2.181)Cited by:[§2\.2](https://arxiv.org/html/2606.00301#S2.SS2.p1.1)\. - M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.In ACL\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00301#S5.SS1.SSS1.p1.1)\. - S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[Appendix C](https://arxiv.org/html/2606.00301#A3.p3.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - J\. Kossen, J\. Han, M\. Razzak, L\. Schut, S\. Malik, and Y\. Gal \(2024\)Semantic entropy probes: robust and cheap hallucination detection in llms\.arXiv preprint arXiv:2406\.15927\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p2.1)\. - L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.In ICLR\.Cited by:[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2024\)Inference\-time intervention: eliciting truthful answers from a language model\.In NeurIPS\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p2.1)\. - \[19\]S\. Lin, J\. Hilton, and O\. EvansTeaching models to express their uncertainty in words\.Transactions on Machine Learning Research\.Cited by:[Appendix C](https://arxiv.org/html/2606.00301#A3.p2.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - S\. Lin, J\. Hilton, and O\. Evans \(2022a\)Teaching models to express their uncertainty in words\.In TMLR\.Cited by:[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1)\. - S\. Lin, J\. Hilton, and O\. Evans \(2022b\)Truthfulqa: measuring how models mimic human falsehoods\.In ACL\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00301#S5.SS1.SSS1.p1.1)\. - Z\. Lin, S\. Trivedi, and J\. Sun \(2024\)Generating with confidence: uncertainty quantification for black\-box large language models\.In TMLR\.Cited by:[Appendix C](https://arxiv.org/html/2606.00301#A3.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - J\. Liu, S\. Chen, Y\. Cheng, and J\. He \(2024\)On the universal truthfulness hyperplane inside llms\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 18199–18224\.Cited by:[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§5\.1\.3](https://arxiv.org/html/2606.00301#S5.SS1.SSS3.p1.10)\. - A\. Malinin and M\. Gales \(2018\)Predictive uncertainty estimation via prior networks\.InProceedings of the 32nd International Conference on Neural Information Processing Systems,NIPS’18,Red Hook, NY, USA,pp\. 7047–7058\.Cited by:[§2\.2](https://arxiv.org/html/2606.00301#S2.SS2.p1.1)\. - A\. Malinin and M\. Gales \(2021\)Uncertainty estimation in autoregressive structured prediction\.In ICLR\.Cited by:[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1)\. - P\. Manakul, A\. Liusie, and M\. J\. Gales \(2023\)Selfcheckgpt: zero\-resource black\-box hallucination detection for generative large language models\.In EMNLP\.Cited by:[Appendix C](https://arxiv.org/html/2606.00301#A3.p1.1),[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - S\. Marks and M\. Tegmark \(2024\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.In COLM\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1)\. - N\. M\. Min, L\. H\. Pham, H\. Zhang, and J\. Sun \(2026\)CORVUS: red\-teaming hallucination detectors via internal signal camouflage in large language models\.External Links:2601\.14310,[Link](https://arxiv.org/abs/2601.14310)Cited by:[§3\.2](https://arxiv.org/html/2606.00301#S3.SS2.p1.1)\. - J\. Ngnawé, S\. Sahoo, Y\. Pequignot, F\. Precioso, and C\. Gagné \(2024\)Detecting brittle decisions for free: leveraging margin consistency in deep robust classifiers\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/29753d93c5fc11167567e5df800308ae-Abstract-Conference.html)Cited by:[§3\.2\.2](https://arxiv.org/html/2606.00301#S3.SS2.SSS2.p1.7)\. - A\. Pagh, R\. Pagh, and M\. Ruzic \(2007\)Linear probing with constant independence\.InProceedings of the thirty\-ninth annual ACM symposium on Theory of computing,pp\. 318–327\.Cited by:[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2023\)Med\-halt: medical domain hallucination test for large language models\.In CoNLL\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p1.1)\. - S\. Park, X\. Du, M\. Yeh, H\. Wang, and Y\. Li \(2025\)Steer LLM latents for hallucination detection\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,External Links:[Link](https://openreview.net/forum?id=UMqNQEPNT3)Cited by:[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.00301#S3.SS2.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - S\. Reddy, D\. Chen, and C\. D\. Manning \(2019\)Coqa: a conversational question answering challenge\.In TACL\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00301#S5.SS1.SSS1.p1.1)\. - \[35\]J\. Ren, J\. Luo, Y\. Zhao, K\. Krishna, M\. Saleh, B\. Lakshminarayanan, and P\. J\. LiuOut\-of\-distribution detection and selective generation for conditional language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2606.00301#A3.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - J\. Ren, J\. Luo, Y\. Zhao, K\. Krishna, M\. Saleh, B\. Lakshminarayanan, and P\. J\. Liu \(2022\)Out\-of\-distribution detection and selective generation for conditional language models\.In ICLR\.Cited by:[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1)\. - T\. Sellam, D\. Das, and A\. P\. Parikh \(2020\)BLEURT: learning robust metrics for text generation\.In ACL\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00301#S5.SS1.SSS1.p1.1)\. - N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. V\. Le, G\. E\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=B1ckMDqlg)Cited by:[§2\.2](https://arxiv.org/html/2606.00301#S2.SS2.p1.1)\. - B\. Snyder, M\. Moisescu, and M\. B\. Zafar \(2024\)On early detection of hallucinations in factual question answering\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 2721–2732\.Cited by:[§5\.1\.2](https://arxiv.org/html/2606.00301#S5.SS1.SSS2.p1.1)\. - H\. Su, T\. Hu, H\. S\. Koppula, K\. Krishna, H\. Pouransari, C\. Hsieh, C\. Koc, J\. Y\. Cheng, O\. Tuzel, and R\. Vemulapalli \(2025\)Learning to reason for hallucination span detection\.CoRRabs/2510\.02173\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.02173),[Document](https://dx.doi.org/10.48550/ARXIV.2510.02173),2510\.02173Cited by:[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1)\. - M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2024\)Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.In ICLR\.Cited by:[§2\.1](https://arxiv.org/html/2606.00301#S2.SS1.p1.1)\. - A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00301#S5.SS1.SSS1.p1.1)\. - F\. Yin, J\. Srinivasa, and K\. Chang \(2024\)Characterizing truthfulness in large language model generations with local intrinsic dimension\.In ICML\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p2.1)\. - Y\. Zhang, Y\. Li, L\. Cui, D\. Cai, L\. Liu, T\. Fu, X\. Huang, E\. Zhao, Y\. Zhang, Y\. Chen,et al\.\(2023\)Siren’s song in the ai ocean: a survey on hallucination in large language models\.arXiv preprint arXiv:2309\.01219\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p1.1)\. - W\. X\. Zhao, K\. Zhou, J\. Li, T\. Tang, X\. Wang, Y\. Hou, Y\. Min, B\. Zhang, J\. Zhang, Z\. Dong,et al\.\(2023\)A survey of large language models\.arXiv preprint arXiv:2303\.18223\.Cited by:[§1](https://arxiv.org/html/2606.00301#S1.p1.1)\. ## Appendix AIn\-depth Theoretical Analysis This appendix provides detailed derivations and proof steps for the theoretical statements used in the main text\. Throughout, we use\[K\]=\{1,…,K\}\[K\]=\\\{1,\\ldots,K\\\}and denote byΔK=\{𝝅∈ℝ≥0K:∑g=1Kπg=1\}\\Delta\_\{K\}=\\\{\\bm\{\\pi\}\\in\\mathbb\{R\}^\{K\}\_\{\\geq 0\}:\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}=1\\\}the probability simplex\. ### A\.1\.Detailed Proof of Theorem[4\.1](https://arxiv.org/html/2606.00301#S4.Thmtheorem1) Recall the mixture model \(Eq\. \([13](https://arxiv.org/html/2606.00301#S4.E13)\)\): \(17\)p\(𝐱∣y\)=∑g=1Kπy\(g\)pg\(𝐱∣y\),y∈\{0,1\}\.p\(\\mathbf\{x\}\\mid y\)=\\sum\_\{g=1\}^\{K\}\\pi\_\{y\}\(g\)\\,p\_\{g\}\(\\mathbf\{x\}\\mid y\),\\qquad y\\in\\\{0,1\\\}\.Define the Bayes\-optimal log\-likelihood ratio \(LLR\) \(18\)Λ⋆\(𝐱\)=logp\(𝐱∣y=1\)p\(𝐱∣y=0\)\.\\Lambda^\{\\star\}\(\\mathbf\{x\}\)=\\log\\frac\{p\(\\mathbf\{x\}\\mid y=1\)\}\{p\(\\mathbf\{x\}\\mid y=0\)\}\.We further define the group\-wise log\-likelihood ratio \(19\)ℓg\(𝐱\)=logpg\(𝐱∣y=1\)pg\(𝐱∣y=0\),g∈\[K\]\.\\ell\_\{g\}\(\\mathbf\{x\}\)=\\log\\frac\{p\_\{g\}\(\\mathbf\{x\}\\mid y=1\)\}\{p\_\{g\}\(\\mathbf\{x\}\\mid y=0\)\},\\qquad g\\in\[K\]\. ##### Step 1: Expand the numerator/denominator under the mixture\. By the mixture model, \(20\)p\(𝐱∣1\)p\(𝐱∣0\)\\displaystyle\\frac\{p\(\\mathbf\{x\}\\mid 1\)\}\{p\(\\mathbf\{x\}\\mid 0\)\}=∑g=1Kπ1\(g\)pg\(𝐱∣1\)∑g′=1Kπ0\(g′\)pg′\(𝐱∣0\)\.\\displaystyle=\\frac\{\\sum\_\{g=1\}^\{K\}\\pi\_\{1\}\(g\)p\_\{g\}\(\\mathbf\{x\}\\mid 1\)\}\{\\sum\_\{g^\{\\prime\}=1\}^\{K\}\\pi\_\{0\}\(g^\{\\prime\}\)p\_\{g^\{\\prime\}\}\(\\mathbf\{x\}\\mid 0\)\}\. ##### Step 2: Multiply and divide each term byπ0\(g\)pg\(𝐱∣0\)\\pi\_\{0\}\(g\)p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\. For eachgg, write \(21\)π1\(g\)pg\(𝐱∣1\)=π0\(g\)pg\(𝐱∣0\)⋅π1\(g\)π0\(g\)⋅pg\(𝐱∣1\)pg\(𝐱∣0\)\.\\pi\_\{1\}\(g\)p\_\{g\}\(\\mathbf\{x\}\\mid 1\)=\\pi\_\{0\}\(g\)p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\\cdot\\frac\{\\pi\_\{1\}\(g\)\}\{\\pi\_\{0\}\(g\)\}\\cdot\\frac\{p\_\{g\}\(\\mathbf\{x\}\\mid 1\)\}\{p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\}\.Plugging into the numerator of Eq\. \([20](https://arxiv.org/html/2606.00301#A1.E20)\) yields \(22\)p\(𝐱∣1\)p\(𝐱∣0\)\\displaystyle\\frac\{p\(\\mathbf\{x\}\\mid 1\)\}\{p\(\\mathbf\{x\}\\mid 0\)\}=∑g=1Kπ0\(g\)pg\(𝐱∣0\)⋅π1\(g\)π0\(g\)⋅pg\(𝐱∣1\)pg\(𝐱∣0\)∑g′=1Kπ0\(g′\)pg′\(𝐱∣0\)\.\\displaystyle=\\frac\{\\sum\_\{g=1\}^\{K\}\\pi\_\{0\}\(g\)p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\\cdot\\frac\{\\pi\_\{1\}\(g\)\}\{\\pi\_\{0\}\(g\)\}\\cdot\\frac\{p\_\{g\}\(\\mathbf\{x\}\\mid 1\)\}\{p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\}\}\{\\sum\_\{g^\{\\prime\}=1\}^\{K\}\\pi\_\{0\}\(g^\{\\prime\}\)p\_\{g^\{\\prime\}\}\(\\mathbf\{x\}\\mid 0\)\}\. ##### Step 3: Identify the posterior under the null\. Define \(23\)p\(g∣𝐱,y=0\):=π0\(g\)pg\(𝐱∣0\)∑g′=1Kπ0\(g′\)pg′\(𝐱∣0\)\.p\(g\\mid\\mathbf\{x\},y=0\):=\\frac\{\\pi\_\{0\}\(g\)\\,p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\}\{\\sum\_\{g^\{\\prime\}=1\}^\{K\}\\pi\_\{0\}\(g^\{\\prime\}\)\\,p\_\{g^\{\\prime\}\}\(\\mathbf\{x\}\\mid 0\)\}\.Then Eq\. \([22](https://arxiv.org/html/2606.00301#A1.E22)\) becomes \(24\)p\(𝐱∣1\)p\(𝐱∣0\)=∑g=1Kp\(g∣𝐱,y=0\)⋅π1\(g\)π0\(g\)⋅pg\(𝐱∣1\)pg\(𝐱∣0\)\.\\frac\{p\(\\mathbf\{x\}\\mid 1\)\}\{p\(\\mathbf\{x\}\\mid 0\)\}=\\sum\_\{g=1\}^\{K\}p\(g\\mid\\mathbf\{x\},y=0\)\\cdot\\frac\{\\pi\_\{1\}\(g\)\}\{\\pi\_\{0\}\(g\)\}\\cdot\\frac\{p\_\{g\}\(\\mathbf\{x\}\\mid 1\)\}\{p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\}\. ##### Step 4: Convert into log\-sum\-exp form\. Usingℓg\(𝐱\)=logpg\(𝐱∣1\)pg\(𝐱∣0\)\\ell\_\{g\}\(\\mathbf\{x\}\)=\\log\\frac\{p\_\{g\}\(\\mathbf\{x\}\\mid 1\)\}\{p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\}, we rewrite each multiplicative factor as an exponential: \(25\)π1\(g\)π0\(g\)⋅pg\(𝐱∣1\)pg\(𝐱∣0\)=exp\(ℓg\(𝐱\)\+logπ1\(g\)π0\(g\)\)\.\\frac\{\\pi\_\{1\}\(g\)\}\{\\pi\_\{0\}\(g\)\}\\cdot\\frac\{p\_\{g\}\(\\mathbf\{x\}\\mid 1\)\}\{p\_\{g\}\(\\mathbf\{x\}\\mid 0\)\}=\\exp\\\!\\Big\(\\ell\_\{g\}\(\\mathbf\{x\}\)\+\\log\\tfrac\{\\pi\_\{1\}\(g\)\}\{\\pi\_\{0\}\(g\)\}\\Big\)\.Takinglog\\logon both sides of Eq\. \([24](https://arxiv.org/html/2606.00301#A1.E24)\) yields \(26\)Λ⋆\(𝐱\)=log∑g=1Kp\(g∣𝐱,y=0\)exp\(ℓg\(𝐱\)\+logπ1\(g\)π0\(g\)\),\\Lambda^\{\\star\}\(\\mathbf\{x\}\)=\\log\\sum\_\{g=1\}^\{K\}p\(g\\mid\\mathbf\{x\},y=0\)\\,\\exp\\\!\\Big\(\\ell\_\{g\}\(\\mathbf\{x\}\)\+\\log\\tfrac\{\\pi\_\{1\}\(g\)\}\{\\pi\_\{0\}\(g\)\}\\Big\),which is exactly Eq\. \([15](https://arxiv.org/html/2606.00301#S4.E15)\)\.□\\square ##### Remark \(why the posterior is undery=0y=0\)\. The decomposition above is obtained by factoring the denominator∑g′π0\(g′\)pg′\(𝐱∣0\)\\sum\_\{g^\{\\prime\}\}\\pi\_\{0\}\(g^\{\\prime\}\)p\_\{g^\{\\prime\}\}\(\\mathbf\{x\}\\mid 0\)\. If instead one factors the numerator, the posterior would be taken undery=1y=1and the residual term changes accordingly\. Both yield equivalent LLRs; we use they=0y=0version because it aligns naturally with the “null\-posterior\-weighted” form in Eq\. \([15](https://arxiv.org/html/2606.00301#S4.E15)\)\. ### A\.2\.Variational Interpretation of Prototype Routing \(Detailed\) We prove the variational interpretation theorem from a variational perspective with full steps\. Fix𝐱\\mathbf\{x\}and abbreviate𝐫=𝐫\(𝐱\)\\mathbf\{r\}=\\mathbf\{r\}\(\\mathbf\{x\}\)\. Letαg=cos\(𝐫,𝐜g\)\\alpha\_\{g\}=\\cos\(\\mathbf\{r\},\\mathbf\{c\}\_\{g\}\)and define the energyEg=−αgE\_\{g\}=\-\\alpha\_\{g\}\. Consider the optimization \(27\)min𝝅∈ΔKF\(𝝅\):=∑g=1KπgEg\+τ∑g=1Kπglogπg,τ\>0\.\\min\_\{\\bm\{\\pi\}\\in\\Delta\_\{K\}\}\\;F\(\\bm\{\\pi\}\):=\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}E\_\{g\}\+\\tau\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}\\log\\pi\_\{g\},\\qquad\\tau\>0\. ##### Step 1: Form the Lagrangian\. Introduce a multiplierλ∈ℝ\\lambda\\in\\mathbb\{R\}for∑gπg=1\\sum\_\{g\}\\pi\_\{g\}=1and multipliers\{νg≥0\}\\\{\\nu\_\{g\}\\geq 0\\\}forπg≥0\\pi\_\{g\}\\geq 0: \(28\)ℒ\(𝝅,λ,𝝂\)=∑g=1KπgEg\+τ∑g=1Kπglogπg\+λ\(∑g=1Kπg−1\)−∑g=1Kνgπg\.\\mathcal\{L\}\(\\bm\{\\pi\},\\lambda,\\bm\{\\nu\}\)=\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}E\_\{g\}\+\\tau\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}\\log\\pi\_\{g\}\+\\lambda\\Big\(\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}\-1\\Big\)\-\\sum\_\{g=1\}^\{K\}\\nu\_\{g\}\\pi\_\{g\}\. ##### Step 2: KKT stationarity\. For any optimal𝝅⋆\\bm\{\\pi\}^\{\\star\}with strictly positive entries \(which will be implied byτ\>0\\tau\>0\), complementary slackness givesνg=0\\nu\_\{g\}=0\. Thus stationarity∂ℒ/∂πg=0\\partial\\mathcal\{L\}/\\partial\\pi\_\{g\}=0yields \(29\)Eg\+τ\(1\+logπg\)\+λ=0\.E\_\{g\}\+\\tau\(1\+\\log\\pi\_\{g\}\)\+\\lambda=0\.Solving forπg\\pi\_\{g\}, \(30\)logπg=−Eg\+λτ−1⟹πg=exp\(−Egτ\)⋅exp\(−λτ−1\)\.\\log\\pi\_\{g\}=\-\\frac\{E\_\{g\}\+\\lambda\}\{\\tau\}\-1\\quad\\Longrightarrow\\quad\\pi\_\{g\}=\\exp\\\!\\Big\(\-\\frac\{E\_\{g\}\}\{\\tau\}\\Big\)\\cdot\\exp\\\!\\Big\(\-\\frac\{\\lambda\}\{\\tau\}\-1\\Big\)\. ##### Step 3: Enforce normalization\. LetZ:=∑j=1Kexp\(−Ej/τ\)Z:=\\sum\_\{j=1\}^\{K\}\\exp\(\-E\_\{j\}/\\tau\)\. Summing Eq\. \([30](https://arxiv.org/html/2606.00301#A1.E30)\) overggand using∑gπg=1\\sum\_\{g\}\\pi\_\{g\}=1gives \(31\)1=exp\(−λτ−1\)⋅Z⟹exp\(−λτ−1\)=1Z\.1=\\exp\\\!\\Big\(\-\\frac\{\\lambda\}\{\\tau\}\-1\\Big\)\\cdot Z\\quad\\Longrightarrow\\quad\\exp\\\!\\Big\(\-\\frac\{\\lambda\}\{\\tau\}\-1\\Big\)=\\frac\{1\}\{Z\}\.Plugging back into Eq\. \([30](https://arxiv.org/html/2606.00301#A1.E30)\) yields \(32\)πg=exp\(−Eg/τ\)∑j=1Kexp\(−Ej/τ\)=exp\(αg/τ\)∑j=1Kexp\(αj/τ\)\.\\pi\_\{g\}=\\frac\{\\exp\(\-E\_\{g\}/\\tau\)\}\{\\sum\_\{j=1\}^\{K\}\\exp\(\-E\_\{j\}/\\tau\)\}=\\frac\{\\exp\(\\alpha\_\{g\}/\\tau\)\}\{\\sum\_\{j=1\}^\{K\}\\exp\(\\alpha\_\{j\}/\\tau\)\}\. ##### Step 4: Uniqueness\. The function𝝅↦∑gπgEg\\bm\{\\pi\}\\mapsto\\sum\_\{g\}\\pi\_\{g\}E\_\{g\}is linear and the negative entropy term∑gπglogπg\\sum\_\{g\}\\pi\_\{g\}\\log\\pi\_\{g\}is*strictly convex*onΔK\\Delta\_\{K\}\. HenceF\(𝝅\)F\(\\bm\{\\pi\}\)is strictly convex, implying a unique minimizer\.□\\square ##### Interpretation\. Eq\. \([27](https://arxiv.org/html/2606.00301#A1.E27)\) can be viewed as a “free energy”: the first term encourages assigning mass to low\-energy \(high\-affinity\) prototypes, while the second term \(entropy\) prevents collapse\. This provides a principled link between the prototype similarities and a posterior\-like routing distribution\. ### A\.3\.Approximation Rate: Proof Outline with Explicit Constants We provide a more explicit route to theO\(K−1/d\)O\(K^\{\-1/d\}\)bound in Theorem[4\.2](https://arxiv.org/html/2606.00301#S4.Thmtheorem2)\. Let𝒮⊂ℝd\\mathcal\{S\}\\subset\\mathbb\{R\}^\{d\}be compact and assume𝐫\(𝐱\)∈𝒮\\mathbf\{r\}\(\\mathbf\{x\}\)\\in\\mathcal\{S\}for all𝐱\\mathbf\{x\}\. Letλ⋆\(𝐫\)\\lambda^\{\\star\}\(\\mathbf\{r\}\)denote the Bayes statistic expressed in evidence space, i\.e\.,λ⋆\(𝐫\(𝐱\)\)=Λ⋆\(𝐱\)\\lambda^\{\\star\}\(\\mathbf\{r\}\(\\mathbf\{x\}\)\)=\\Lambda^\{\\star\}\(\\mathbf\{x\}\)\. ###### Assumption 1 \(Lipschitzness\)\. There existsLλ\>0L\_\{\\lambda\}\>0such that for all𝐫,𝐫′∈𝒮\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\\in\\mathcal\{S\}, \(33\)\|λ⋆\(𝐫\)−λ⋆\(𝐫′\)\|≤Lλ‖𝐫−𝐫′‖2\.\|\\lambda^\{\\star\}\(\\mathbf\{r\}\)\-\\lambda^\{\\star\}\(\\mathbf\{r\}^\{\\prime\}\)\|\\leq L\_\{\\lambda\}\\\|\\mathbf\{r\}\-\\mathbf\{r\}^\{\\prime\}\\\|\_\{2\}\. ##### Step 1: Covering number and quantization radius\. Forε\>0\\varepsilon\>0, letN\(ε,𝒮,∥⋅∥2\)N\(\\varepsilon,\\mathcal\{S\},\\\|\\cdot\\\|\_\{2\}\)be the minimal number of Euclidean balls of radiusε\\varepsilonneeded to cover𝒮\\mathcal\{S\}\. Define the quantization radius at budgetKK: \(34\)εK\(𝒮\):=inf\{ε\>0:N\(ε,𝒮,∥⋅∥2\)≤K\}\.\\varepsilon\_\{K\}\(\\mathcal\{S\}\):=\\inf\\Big\\\{\\varepsilon\>0:\\;N\(\\varepsilon,\\mathcal\{S\},\\\|\\cdot\\\|\_\{2\}\)\\leq K\\Big\\\}\.For compact subsets ofℝd\\mathbb\{R\}^\{d\}with finitedd\-dimensional volume, there exists a constantC𝒮\>0C\_\{\\mathcal\{S\}\}\>0such that \(35\)εK\(𝒮\)≤C𝒮K−1/d\.\\varepsilon\_\{K\}\(\\mathcal\{S\}\)\\leq C\_\{\\mathcal\{S\}\}K^\{\-1/d\}\.\(One may takeC𝒮C\_\{\\mathcal\{S\}\}proportional todiam\(𝒮\)\\mathrm\{diam\}\(\\mathcal\{S\}\)and the volume ratio; standard covering arguments apply\.\) ##### Step 2: Construct a Voronoi partition and a piecewise\-constant approximant\. Choose centers\{𝐮g\}g=1K⊂𝒮\\\{\\mathbf\{u\}\_\{g\}\\\}\_\{g=1\}^\{K\}\\subset\\mathcal\{S\}such that𝒮⊂∪g=1KB\(𝐮g,εK\)\\mathcal\{S\}\\subset\\cup\_\{g=1\}^\{K\}B\(\\mathbf\{u\}\_\{g\},\\varepsilon\_\{K\}\)\. Define a Voronoi partition\{𝒮g\}g=1K\\\{\\mathcal\{S\}\_\{g\}\\\}\_\{g=1\}^\{K\}by \(36\)𝒮g=\{𝐫∈𝒮:g∈argminj∈\[K\]‖𝐫−𝐮j‖2\},\\mathcal\{S\}\_\{g\}=\\Big\\\{\\mathbf\{r\}\\in\\mathcal\{S\}:\\;g\\in\\arg\\min\_\{j\\in\[K\]\}\\\|\\mathbf\{r\}\-\\mathbf\{u\}\_\{j\}\\\|\_\{2\}\\Big\\\},breaking ties arbitrarily\. For any𝐫∈𝒮g\\mathbf\{r\}\\in\\mathcal\{S\}\_\{g\}, we have‖𝐫−𝐮g‖2≤εK\\\|\\mathbf\{r\}\-\\mathbf\{u\}\_\{g\}\\\|\_\{2\}\\leq\\varepsilon\_\{K\}\. Define the approximant \(37\)λ~\(𝐫\)=∑g=1K𝕀\[𝐫∈𝒮g\]λ⋆\(𝐮g\)\.\\tilde\{\\lambda\}\(\\mathbf\{r\}\)=\\sum\_\{g=1\}^\{K\}\\mathbb\{I\}\[\\mathbf\{r\}\\in\\mathcal\{S\}\_\{g\}\]\\;\\lambda^\{\\star\}\(\\mathbf\{u\}\_\{g\}\)\. ##### Step 3: Bound the approximation error using Lipschitzness\. Fix𝐫∈𝒮\\mathbf\{r\}\\in\\mathcal\{S\}and letg\(𝐫\)g\(\\mathbf\{r\}\)be its cell index\. Then, using Eq\. \([33](https://arxiv.org/html/2606.00301#A1.E33)\), \(38\)\|λ~\(𝐫\)−λ⋆\(𝐫\)\|=\|λ⋆\(𝐮g\(𝐫\)\)−λ⋆\(𝐫\)\|≤Lλ‖𝐮g\(𝐫\)−𝐫‖2≤LλεK\.\|\\tilde\{\\lambda\}\(\\mathbf\{r\}\)\-\\lambda^\{\\star\}\(\\mathbf\{r\}\)\|=\|\\lambda^\{\\star\}\(\\mathbf\{u\}\_\{g\(\\mathbf\{r\}\)\}\)\-\\lambda^\{\\star\}\(\\mathbf\{r\}\)\|\\leq L\_\{\\lambda\}\\\|\\mathbf\{u\}\_\{g\(\\mathbf\{r\}\)\}\-\\mathbf\{r\}\\\|\_\{2\}\\leq L\_\{\\lambda\}\\varepsilon\_\{K\}\.Taking supremum over𝐫∈𝒮\\mathbf\{r\}\\in\\mathcal\{S\}and using Eq\. \([35](https://arxiv.org/html/2606.00301#A1.E35)\) gives \(39\)sup𝐫∈𝒮\|λ~\(𝐫\)−λ⋆\(𝐫\)\|≤LλC𝒮K−1/d\.\\sup\_\{\\mathbf\{r\}\\in\\mathcal\{S\}\}\|\\tilde\{\\lambda\}\(\\mathbf\{r\}\)\-\\lambda^\{\\star\}\(\\mathbf\{r\}\)\|\\leq L\_\{\\lambda\}C\_\{\\mathcal\{S\}\}K^\{\-1/d\}\. ##### Step 4: Realize the partition by prototype routing \(soft\-to\-hard\)\. We now connect Eq\. \([37](https://arxiv.org/html/2606.00301#A1.E37)\) to the FLaG score \(40\)s\(𝐱\)=log∑g=1Kπg\(𝐱\)exp\(sg\(𝐱\)\)\.s\(\\mathbf\{x\}\)=\\log\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}\(\\mathbf\{x\}\)\\,\\exp\(s\_\{g\}\(\\mathbf\{x\}\)\)\.We consider a realizable construction showing existence \(as stated in Theorem[4\.2](https://arxiv.org/html/2606.00301#S4.Thmtheorem2)\)\. *\(a\) Prototype placement\.*Assume𝐫\\mathbf\{r\}is normalized \(or we normalize within the routing\)\. Place prototypes𝐜g\\mathbf\{c\}\_\{g\}aligned with𝐮g\\mathbf\{u\}\_\{g\}\(e\.g\.,𝐜g=𝐮g/‖𝐮g‖2\\mathbf\{c\}\_\{g\}=\\mathbf\{u\}\_\{g\}/\\\|\\mathbf\{u\}\_\{g\}\\\|\_\{2\}if nonzero\)\. Thenαg\(𝐱\)=cos\(𝐫\(𝐱\),𝐜g\)\\alpha\_\{g\}\(\\mathbf\{x\}\)=\\cos\(\\mathbf\{r\}\(\\mathbf\{x\}\),\\mathbf\{c\}\_\{g\}\)is maximized near𝐮g\\mathbf\{u\}\_\{g\}\. *\(b\) Low\-temperature gating\.*Letπg\(𝐱\)=softmax\(αg\(𝐱\)/τ\)\\pi\_\{g\}\(\\mathbf\{x\}\)=\\mathrm\{softmax\}\(\\alpha\_\{g\}\(\\mathbf\{x\}\)/\\tau\)withτ\\tausmall\. For any𝐱\\mathbf\{x\}, letg⋆\(𝐱\)∈argmaxgαg\(𝐱\)g^\{\\star\}\(\\mathbf\{x\}\)\\in\\arg\\max\_\{g\}\\alpha\_\{g\}\(\\mathbf\{x\}\)\. Then for allgg, \(41\)πg⋆\(𝐱\)=11\+∑j≠g⋆exp\(\(αj−αg⋆\)/τ\)≥1−∑j≠g⋆exp\(−αg⋆−αjτ\)\.\\pi\_\{g^\{\\star\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{1\+\\sum\_\{j\\neq g^\{\\star\}\}\\exp\(\(\\alpha\_\{j\}\-\\alpha\_\{g^\{\\star\}\}\)/\\tau\)\}\\geq 1\-\\sum\_\{j\\neq g^\{\\star\}\}\\exp\\\!\\Big\(\-\\frac\{\\alpha\_\{g^\{\\star\}\}\-\\alpha\_\{j\}\}\{\\tau\}\\Big\)\.If the affinity gapΔ\(𝐱\):=minj≠g⋆\(αg⋆−αj\)\>0\\Delta\(\\mathbf\{x\}\):=\\min\_\{j\\neq g^\{\\star\}\}\(\\alpha\_\{g^\{\\star\}\}\-\\alpha\_\{j\}\)\>0, then \(42\)1−πg⋆\(𝐱\)≤\(K−1\)exp\(−Δ\(𝐱\)/τ\)\.1\-\\pi\_\{g^\{\\star\}\}\(\\mathbf\{x\}\)\\leq\(K\-1\)\\exp\(\-\\Delta\(\\mathbf\{x\}\)/\\tau\)\.Thusπg⋆\\pi\_\{g^\{\\star\}\}approaches11exponentially fast asτ→0\\tau\\to 0whenever the max is unique\. *\(c\) Constant experts for piecewise constants\.*Setsg\(𝐱\)≡bgs\_\{g\}\(\\mathbf\{x\}\)\\equiv b\_\{g\}withbg=λ⋆\(𝐮g\)b\_\{g\}=\\lambda^\{\\star\}\(\\mathbf\{u\}\_\{g\}\)\(a special case of Eq\. \([9](https://arxiv.org/html/2606.00301#S3.E9)\) by𝐰g=𝟎\\mathbf\{w\}\_\{g\}=\\mathbf\{0\}\)\. Then s\(𝐱\)\\displaystyle s\(\\mathbf\{x\}\)=log∑g=1Kπg\(𝐱\)ebg=log\(πg⋆ebg⋆\+∑j≠g⋆πjebj\)\\displaystyle=\\log\\sum\_\{g=1\}^\{K\}\\pi\_\{g\}\(\\mathbf\{x\}\)e^\{b\_\{g\}\}=\\log\\Big\(\\pi\_\{g^\{\\star\}\}e^\{b\_\{g^\{\\star\}\}\}\+\\sum\_\{j\\neq g^\{\\star\}\}\\pi\_\{j\}e^\{b\_\{j\}\}\\Big\)\(43\)=bg⋆\+log\(πg⋆\+∑j≠g⋆πjebj−bg⋆\)\.\\displaystyle=b\_\{g^\{\\star\}\}\+\\log\\Big\(\\pi\_\{g^\{\\star\}\}\+\\sum\_\{j\\neq g^\{\\star\}\}\\pi\_\{j\}e^\{b\_\{j\}\-b\_\{g^\{\\star\}\}\}\\Big\)\.Assume\|bj−bg⋆\|≤B\|b\_\{j\}\-b\_\{g^\{\\star\}\}\|\\leq Bfor alljj\(boundedness holds on compact𝒮\\mathcal\{S\}ifλ⋆\\lambda^\{\\star\}is continuous\)\. Thenebj−bg⋆≤eBe^\{b\_\{j\}\-b\_\{g^\{\\star\}\}\}\\leq e^\{B\}and Eq\. \([43](https://arxiv.org/html/2606.00301#A1.E43)\) implies \(44\)\|s\(𝐱\)−bg⋆\|\\displaystyle\|s\(\\mathbf\{x\}\)\-b\_\{g^\{\\star\}\}\|=\|log\(πg⋆\+∑j≠g⋆πjebj−bg⋆\)\|≤\|log\(πg⋆\+\(1−πg⋆\)eB\)\|\.\\displaystyle=\\left\|\\log\\Big\(\\pi\_\{g^\{\\star\}\}\+\\sum\_\{j\\neq g^\{\\star\}\}\\pi\_\{j\}e^\{b\_\{j\}\-b\_\{g^\{\\star\}\}\}\\Big\)\\right\|\\leq\\left\|\\log\\Big\(\\pi\_\{g^\{\\star\}\}\+\(1\-\\pi\_\{g^\{\\star\}\}\)e^\{B\}\\Big\)\\right\|\.Usinglog\(1\+u\)≤u\\log\(1\+u\)\\leq uandπg⋆≥1−\(K−1\)e−Δ/τ\\pi\_\{g^\{\\star\}\}\\geq 1\-\(K\-1\)e^\{\-\\Delta/\\tau\}from Eq\. \([42](https://arxiv.org/html/2606.00301#A1.E42)\), we obtain \(45\)\|s\(𝐱\)−bg⋆\|≤\(1−πg⋆\)eB≤\(K−1\)eBe−Δ\(𝐱\)/τ\.\|s\(\\mathbf\{x\}\)\-b\_\{g^\{\\star\}\}\|\\leq\(1\-\\pi\_\{g^\{\\star\}\}\)e^\{B\}\\leq\(K\-1\)e^\{B\}e^\{\-\\Delta\(\\mathbf\{x\}\)/\\tau\}\.Hences\(𝐱\)s\(\\mathbf\{x\}\)approximates the hard\-assignment piecewise constantbg⋆b\_\{g^\{\\star\}\}\. ##### Step 5: Combine errors\. Letλ^\(𝐫\)\\hat\{\\lambda\}\(\\mathbf\{r\}\)be the hard\-cell approximation induced byg⋆\(𝐫\)g^\{\\star\}\(\\mathbf\{r\}\)with valuesbg=λ⋆\(𝐮g\)b\_\{g\}=\\lambda^\{\\star\}\(\\mathbf\{u\}\_\{g\}\)\. Then Eq\. \([39](https://arxiv.org/html/2606.00301#A1.E39)\) gives \(46\)sup𝐫∈𝒮\|λ^\(𝐫\)−λ⋆\(𝐫\)\|≤LλC𝒮K−1/d\.\\sup\_\{\\mathbf\{r\}\\in\\mathcal\{S\}\}\|\\hat\{\\lambda\}\(\\mathbf\{r\}\)\-\\lambda^\{\\star\}\(\\mathbf\{r\}\)\|\\leq L\_\{\\lambda\}C\_\{\\mathcal\{S\}\}K^\{\-1/d\}\.Eq\. \([45](https://arxiv.org/html/2606.00301#A1.E45)\) further yields, for any𝐱\\mathbf\{x\}with affinity gapΔ\(𝐱\)\\Delta\(\\mathbf\{x\}\), \(47\)\|s\(𝐱\)−λ^\(𝐫\(𝐱\)\)\|≤\(K−1\)eBe−Δ\(𝐱\)/τ\.\|s\(\\mathbf\{x\}\)\-\\hat\{\\lambda\}\(\\mathbf\{r\}\(\\mathbf\{x\}\)\)\|\\leq\(K\-1\)e^\{B\}e^\{\-\\Delta\(\\mathbf\{x\}\)/\\tau\}\.Thus one may state the combined bound \(48\)sup𝐱\|s\(𝐱\)−Λ⋆\(𝐱\)\|≤LλC𝒮K−1/d\+sup𝐱\(K−1\)eBe−Δ\(𝐱\)/τ,\\sup\_\{\\mathbf\{x\}\}\|s\(\\mathbf\{x\}\)\-\\Lambda^\{\\star\}\(\\mathbf\{x\}\)\|\\leq L\_\{\\lambda\}C\_\{\\mathcal\{S\}\}K^\{\-1/d\}\+\\sup\_\{\\mathbf\{x\}\}\(K\-1\)e^\{B\}e^\{\-\\Delta\(\\mathbf\{x\}\)/\\tau\},which recovers theO\(K−1/d\)O\(K^\{\-1/d\}\)term and makes the soft\-gating residual explicit\. Allowing linear experts generally improves local approximation \(smallerBBand tighter constants\) but does not change the covering\-rate exponent\. ### A\.4\.Ranking Objective: A More Explicit Fisher\-Consistency Argument We expand the above theorem at the level of conditional risks\. Letη\(𝐱\)=ℙ\(y=1∣𝐱\)\\eta\(\\mathbf\{x\}\)=\\mathbb\{P\}\(y=1\\mid\\mathbf\{x\}\)\. Define a score functions:𝒳→ℝs:\\mathcal\{X\}\\to\\mathbb\{R\}\. Consider the pairwise logistic lossφ\(u\)=log\(1\+e−u\)\\varphi\(u\)=\\log\(1\+e^\{\-u\}\)and the population risk \(49\)ℛ\(s\)=𝔼\(𝐱,y\),\(𝐱′,y′\)\[𝕀\[y=1,y′=0\]φ\(s\(𝐱\)−s\(𝐱′\)\)\],\\mathcal\{R\}\(s\)=\\mathbb\{E\}\_\{\(\\mathbf\{x\},y\),\(\\mathbf\{x\}^\{\\prime\},y^\{\\prime\}\)\}\\Big\[\\,\\mathbb\{I\}\[y=1,y^\{\\prime\}=0\]\\;\\varphi\\big\(s\(\\mathbf\{x\}\)\-s\(\\mathbf\{x\}^\{\\prime\}\)\\big\)\\,\\Big\],where\(𝐱,y\)\(\\mathbf\{x\},y\)and\(𝐱′,y′\)\(\\mathbf\{x\}^\{\\prime\},y^\{\\prime\}\)are i\.i\.d\. from the data distribution\. \(Up to a constant factor, this is equivalent to sampling𝐱\+∼p\(⋅∣1\)\\mathbf\{x\}^\{\+\}\\sim p\(\\cdot\\mid 1\)and𝐱−∼p\(⋅∣0\)\\mathbf\{x\}^\{\-\}\\sim p\(\\cdot\\mid 0\)\.\) ##### Step 1: Condition on a pair\(𝐱,𝐱′\)\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\. Letu=s\(𝐱\)−s\(𝐱′\)u=s\(\\mathbf\{x\}\)\-s\(\\mathbf\{x\}^\{\\prime\}\)\. Then the conditional expected contribution of the pair is ℛ𝐱,𝐱′\(u\)\\displaystyle\\mathcal\{R\}\_\{\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\}\(u\)=𝔼\[𝕀\[y=1,y′=0\]∣𝐱,𝐱′\]φ\(u\)\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\mathbb\{I\}\[y\\\!=\\\!1,y^\{\\prime\}\\\!=\\\!0\]\\mid\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\\right\]\\varphi\(u\)\+𝔼\[𝕀\[y=0,y′=1\]∣𝐱,𝐱′\]φ\(−u\)\\displaystyle\\quad\+\\mathbb\{E\}\\\!\\left\[\\mathbb\{I\}\[y\\\!=\\\!0,y^\{\\prime\}\\\!=\\\!1\]\\mid\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\\right\]\\varphi\(\-u\)\(50\)=η\(𝐱\)\(1−η\(𝐱′\)\)φ\(u\)\+\(1−η\(𝐱\)\)η\(𝐱′\)φ\(−u\),\\displaystyle=\\eta\(\\mathbf\{x\}\)\(1\\\!\-\\\!\\eta\(\\mathbf\{x\}^\{\\prime\}\)\)\\varphi\(u\)\+\(1\\\!\-\\\!\\eta\(\\mathbf\{x\}\)\)\\eta\(\\mathbf\{x\}^\{\\prime\}\)\\varphi\(\-u\),since labels are conditionally independent given𝐱,𝐱′\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\. ##### Step 2: Differentiate the conditional risk\. Usingφ′\(u\)=−σ\(−u\)\\varphi^\{\\prime\}\(u\)=\-\\sigma\(\-u\)whereσ\(u\)=11\+e−u\\sigma\(u\)=\\frac\{1\}\{1\+e^\{\-u\}\}, dduℛ𝐱,𝐱′\(u\)\\displaystyle\\frac\{d\}\{du\}\\mathcal\{R\}\_\{\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\}\(u\)=η\(𝐱\)\(1−η\(𝐱′\)\)φ′\(u\)−\(1−η\(𝐱\)\)η\(𝐱′\)φ′\(−u\)\\displaystyle=\\eta\(\\mathbf\{x\}\)\(1\-\\eta\(\\mathbf\{x\}^\{\\prime\}\)\)\\varphi^\{\\prime\}\(u\)\-\(1\-\\eta\(\\mathbf\{x\}\)\)\\eta\(\\mathbf\{x\}^\{\\prime\}\)\\varphi^\{\\prime\}\(\-u\)\(51\)=−η\(𝐱\)\(1−η\(𝐱′\)\)σ\(−u\)\+\(1−η\(𝐱\)\)η\(𝐱′\)σ\(u\)\.\\displaystyle=\-\\eta\(\\mathbf\{x\}\)\(1\-\\eta\(\\mathbf\{x\}^\{\\prime\}\)\)\\sigma\(\-u\)\+\(1\-\\eta\(\\mathbf\{x\}\)\)\\eta\(\\mathbf\{x\}^\{\\prime\}\)\\sigma\(u\)\.Set the derivative to zero: \(52\)\(1−η\(𝐱\)\)η\(𝐱′\)σ\(u\)=η\(𝐱\)\(1−η\(𝐱′\)\)σ\(−u\)\.\(1\-\\eta\(\\mathbf\{x\}\)\)\\eta\(\\mathbf\{x\}^\{\\prime\}\)\\sigma\(u\)=\\eta\(\\mathbf\{x\}\)\(1\-\\eta\(\\mathbf\{x\}^\{\\prime\}\)\)\\sigma\(\-u\)\.Usingσ\(−u\)=1−σ\(u\)\\sigma\(\-u\)=1\-\\sigma\(u\)and the identityσ\(u\)σ\(−u\)=eu\\frac\{\\sigma\(u\)\}\{\\sigma\(\-u\)\}=e^\{u\}, Eq\. \([52](https://arxiv.org/html/2606.00301#A1.E52)\) is equivalent to \(53\)eu=η\(𝐱\)\(1−η\(𝐱′\)\)\(1−η\(𝐱\)\)η\(𝐱′\)\.e^\{u\}=\\frac\{\\eta\(\\mathbf\{x\}\)\(1\-\\eta\(\\mathbf\{x\}^\{\\prime\}\)\)\}\{\(1\-\\eta\(\\mathbf\{x\}\)\)\\eta\(\\mathbf\{x\}^\{\\prime\}\)\}\.Therefore, the unique minimizeru⋆\(𝐱,𝐱′\)u^\{\\star\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)ofℛ𝐱,𝐱′\(u\)\\mathcal\{R\}\_\{\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\}\(u\)is \(54\)u⋆\(𝐱,𝐱′\)=logη\(𝐱\)1−η\(𝐱\)−logη\(𝐱′\)1−η\(𝐱′\)\.u^\{\\star\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)=\\log\\frac\{\\eta\(\\mathbf\{x\}\)\}\{1\-\\eta\(\\mathbf\{x\}\)\}\-\\log\\frac\{\\eta\(\\mathbf\{x\}^\{\\prime\}\)\}\{1\-\\eta\(\\mathbf\{x\}^\{\\prime\}\)\}\. ##### Step 3: Implication for ordering\. Eq\. \([54](https://arxiv.org/html/2606.00301#A1.E54)\) implies that \(55\)u⋆\(𝐱,𝐱′\)\>0⟺η\(𝐱\)\>η\(𝐱′\)\.u^\{\\star\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\>0\\quad\\Longleftrightarrow\\quad\\eta\(\\mathbf\{x\}\)\>\\eta\(\\mathbf\{x\}^\{\\prime\}\)\.Thus, any globally optimal scoring function must preserve the ordering ofη\(𝐱\)\\eta\(\\mathbf\{x\}\)almost surely \(ties allowed\)\. In particular, takings⋆\(𝐱\)=logη\(𝐱\)1−η\(𝐱\)s^\{\\star\}\(\\mathbf\{x\}\)=\\log\\frac\{\\eta\(\\mathbf\{x\}\)\}\{1\-\\eta\(\\mathbf\{x\}\)\}achievesu⋆u^\{\\star\}for every pair\. More generally, any strictly increasing transform ofη\(𝐱\)\\eta\(\\mathbf\{x\}\)yields the same ordering\.□\\square ## Appendix BDataset Specifications We use two prompt templates according to whether the dataset provides supporting context\. For context\-free QA datasets, including TruthfulQA and TriviaQA, the prompt is: > Answer the question concisely\. Q: <question\> A: For context\-dependent datasets, including TyDiQA\-GP and CoQA, the prompt is: > Answer these questions concisely based on the context: \\n Context: <passage context\> Q: <question\> A: ## Appendix CBaseline Implementation Details For Perplexity\([Renet al\.,](https://arxiv.org/html/2606.00301#bib.bib35)\), we use the official implementation and average perplexity over generated tokens\. For sampling\-based baselines, we follow the original configurations and generate 10 samples with temperature 0\.5\. Lexical Similarity\(Linet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib36)\)uses ROUGE\-L; SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.00301#bib.bib18)\)uses the recommended NLI variant with a fine\-tuned DeBERTa\-v3\-large model; HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib8)\)and EGH\(Huet al\.,[2024](https://arxiv.org/html/2606.00301#bib.bib42)\)use their official or released codebases\. For Verbalize\([Linet al\.,](https://arxiv.org/html/2606.00301#bib.bib37)\), we use the following confidence\-elicitation prompt: > \[Context: <context\>\] Q: <question\> A: <answer\>\. \\n The proposed answer is true with a confidence value \(0\-100\) of, where the context field is omitted for context\-free datasets\. The generated confidence value is directly used as the uncertainty score\. For Self\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.00301#bib.bib38)\), we use: > \[Context: <context\>\] Question: <question\> \\n Proposed Answer: <answer\> \\n Is the proposed answer: \\n \(A\) True \\n \(B\) False \\n The proposed answer is: Again, the context field is omitted when unavailable\. Following the original paper, we use the log probability of token “A” as the uncertainty score\. ## Appendix DLLM Usage Statement LLMs are used in this work solely for language polishing and presentation purposes\. Specifically, LLMs are employed to improve clarity, grammar, and readability of the manuscript text written by the authors\. They are*not*used for designing the proposed method, generating experimental results, selecting hyperparameters, analyzing outcomes, or drawing scientific conclusions\. All technical content, experimental design, and empirical findings are entirely produced and verified by the authors\.
Similar Articles
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
RAGognizer introduces a hallucination-aware fine-tuning approach that integrates a lightweight detection head into LLMs for joint optimization of language modeling and hallucination detection in RAG systems. The paper presents RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and demonstrates state-of-the-art hallucination detection while reducing hallucination rates without degrading language quality.
Automatic Layer Selection for Hallucination Detection
This paper proposes automatic layer selection for hallucination detection in LLMs and introduces First Effective Peak of Intrinsic Dimension (FEPoID), a training-free criterion that consistently identifies optimal intermediate layers, outperforming existing heuristics.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Researchers introduce SHADE, a hybrid estimator that combines Good-Turing coverage with graph-spectral cues to quantify semantic uncertainty and detect LLM hallucinations when only a few black-box samples are available.
HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.