From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models

arXiv cs.CL 06/29/26, 04:00 AM Papers
Summary
This paper presents a factorised study of probe-based uncertainty estimation in LLMs, showing that raw hidden states and attention features perform well in-domain but structured features are more robust under distribution shift, and provides pretrained probes as off-the-shelf baselines.
arXiv:2606.27679v1 Announce Type: new Abstract: Probe-based uncertainty estimation (UE) has emerged as a prominent approach to detect hallucinations in Large Language Models (LLMs) by learning uncertainty from internal model signals. Yet, recent methods vary simultaneously across feature design, training data construction, and evaluation setting, obscuring what actually drives performance. To address this issue, we propose a factorised study of probe-based UE under matched conditions. Our results show that raw hidden states and attention features are difficult to outperform in-domain. However, under distribution shift, structured and compressed features are more robust, suggesting that in-domain performance alone is insufficient to measure progress. Furthermore, prompting and label construction significantly affect probe behaviour. Building on these best-practice findings, we train benchmark-based pretrained probes that transfer reasonably well to open-ended factual generation, providing a stable off-the-shelf baseline. Our work encourages more deployment-oriented evaluation of probe-based uncertainty estimators. The code repository is available at https://github.com/ponhvoan/ProbeUE.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:23 AM
# From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models
Source: [https://arxiv.org/html/2606.27679](https://arxiv.org/html/2606.27679)
Ponhvoan Srey1Xiaobao Wu2Cong\-Duy Nguyen3 Quang Minh Nguyen4Duc Anh Vu1Anh Tuan Luu1,311footnotemark:1 1Nanyang Technological University2Shanghai Jiao Tong University 3VinUniversity4KAIST \{ponhvoan002, vuducanh001, anhtuan\.luu\}@ntu\.edu\.sg xiaobaowu@sjtu\.edu\.cnduy\.ntc@vinuni\.edu\.vnqm\.nguyen@kaist\.ac\.kr

###### Abstract

Probe\-based uncertainty estimation \(UE\) has emerged as a prominent approach to detect hallucinations in Large Language Models \(LLMs\) by learning uncertainty from internal model signals\. Yet, recent methods vary simultaneously across feature design, training data construction, and evaluation setting, obscuring what actually drives performance\. To address this issue, we propose a factorised study of probe\-based UE under matched conditions\. Our results show that raw hidden states and attention features are difficult to outperform in\-domain\. However, under distribution shift, structured and compressed features are more robust, suggesting that in\-domain performance alone is insufficient to measure progress\. Furthermore, prompting and label construction significantly affect probe behaviour\. Building on these best\-practice findings, we train benchmark\-based pretrained probes that transfer reasonably well to open\-ended factual generation, providing a stable off\-the\-shelf baseline\. Our work encourages more deployment\-oriented evaluation of probe\-based uncertainty estimators\. The code repository is available at[https://github\.com/ponhvoan/ProbeUE](https://github.com/ponhvoan/ProbeUE)\.

From Signals to Transfer: A Factorised Study of Probe\-Based Uncertainty Estimation in Large Language Models

Ponhvoan Srey1Xiaobao Wu2††thanks:Corresponding Authors\.Cong\-Duy Nguyen3Quang Minh Nguyen4Duc Anh Vu1Anh Tuan Luu1,311footnotemark:11Nanyang Technological University2Shanghai Jiao Tong University3VinUniversity4KAIST\{ponhvoan002, vuducanh001, anhtuan\.luu\}@ntu\.edu\.sgxiaobaowu@sjtu\.edu\.cnduy\.ntc@vinuni\.edu\.vnqm\.nguyen@kaist\.ac\.kr

## 1Introduction

Hallucination in Large Language Models \(LLMs\), the tendency to generate fictitious information, remains a persistent barrier to reliable deployment in real\-world applications\(Sahooet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib9); Huanget al\.,[2025a](https://arxiv.org/html/2606.27679#bib.bib10); Zhanget al\.,[2025b](https://arxiv.org/html/2606.27679#bib.bib12)\)\. This necessitates the development of robust uncertainty estimation \(UE\) to accurately flag potentially erroneous generations for users\(Vashurinet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib13)\)\. Recent work suggests that probe\-based UE, which leverages internal model states, provides among the most effective signals for hallucination detection\(Mahautet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib23); Tanet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib20)\)\. This has led to a growing body of work that engineers progressively more structured and informative internal features and integrates them into more sophisticated optimisation protocols\(Chuanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib18); Heet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib17); Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib14); Shelmanovet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib15)\)\. However, despite this progress, two questions remain unresolved: what accounts for the gains reported, and whether these gains translate beyond matched benchmark settings\. First, current evaluations conflate multiple design choices, such as training data acquisition, feature representation, supervision, and probe architecture, making it unclear what actually accounts for the observed gains\. This motivates our first central research question:*What truly drives performance in probe\-based uncertainty estimation?*

At the same time, a critical bottleneck of probe\-based UE is limited generalisability\. Even though probes are highly effective under matched train\-test conditions, their performance often degrades when applied to new domains or generation settings\(CH\-Wanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib19); Chuanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib18)\)\. This limits their utility in real use cases, where uncertainty estimators must handle open\-ended generations, rather than only the benchmark format on which they were trained\. Although some prior work evaluates transfer across datasets, such evaluations are restricted to benchmark\-to\-benchmark transfer\(Chuanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib18)\), or remain within comparable long form and claim\-level generation setup\(Hanet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib29); Shelmanovet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib15)\)\. In these settings, probes are tested under distribution shift, but the generation format, answer structure, supervision signal, and evaluation protocol remain relatively constrained\. This leaves open whether probes can generalise to less standardised deployment settings, where outputs are open\-ended, vary significantly in length and style, and contain more diverse factual errors\. This formulates our second research question:*Can probes trained under controlled benchmark settings generalise to open\-ended generation tasks?*

To answer these questions, we conduct a controlled study of probe\-based UE across three primary dimensions: feature representations, training data construction, and transfer settings\. Our study covers a wide range of recently proposed feature representations, spanning latent embeddings, output probabilities, attention patterns, and their combinations\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.27679#bib.bib1); Chuanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib18); Heet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib17); Huanget al\.,[2025b](https://arxiv.org/html/2606.27679#bib.bib46); Shelmanovet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib15)\), evaluated with different probe architectures, supervision sizes, prompting strategies, and automated correctness labels\. We further study benchmark\-to\-benchmark transfer and a deployment\-oriented setting in which probes pretrained on benchmark data are applied to open\-ended long\-form factual generation\. Together, these allow us to identify which design choices drive in\-domain performance, which remain robust under shift, and which best practices support reusable pretrained factuality probes\. Our findings challenge several common assumptions\. First, simple linear probes over raw hidden states and attention features are surprisingly difficult to outperform, even with limited supervision\. Second, data construction choices strongly shape probe behaviour: reasoning\-based prompting and lexical matching\-based labels substantially degrade performance\. Finally, structured and compressed features offer better trade\-offs under distribution shift\. These findings yield a practical recipe: use simple probe architectures, concise generations with semantic correctness labels, and transfer\-robust feature representations\. Building on these best practices, we show that benchmark\-pretrained probes transfer to open\-ended factual generation, approaching task\-specific supervised probes without target\-task training data\.

Collectively, our work pushes probe\-based UE beyond in\-domain benchmark comparison toward deployment\-oriented practice\. Rather than pursuing increasingly complex internal state representation features in isolation, the field should prioritise simple and transferable probe configurations that maintain reliability beyond benchmarks\. In summary, our contributions are threefold:

- •We propose a factorised evaluation framework to disentangle the design factors behind probe\-based UE performance\.
- •We introduce practical best practices for training lightweight uncertainty/factuality probes under different constraints\.
- •We demonstrate how these best practices support deployment of pretrained probes for open\-ended generation, providing a stable baseline for future work\.

![Refer to caption](https://arxiv.org/html/2606.27679v1/x1.png)\(a\)Average AUROC
![Refer to caption](https://arxiv.org/html/2606.27679v1/x2.png)\(b\)Average ECE

Figure 1:Main results: In\-domain performance averaged across all benchmark datasets\.
## 2Related Work

#### Probe\-based Uncertainty Estimation \(UE\)

trains lightweight probes on top of LLM internal signals to predict factuality or correctness, or conversely, hallucination risk\. This paradigm is attractive because it typically requires only a single LLM forward pass, unlike expensive sampling\-based methods, and often achieves strong in\-domain performance\. Early work showed that truth\-related information can be extracted from hidden activations, often from the final layer and final token, using simple classifiers\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.27679#bib.bib1); Burnset al\.,[2022](https://arxiv.org/html/2606.27679#bib.bib8); Marks and Tegmark,[2023](https://arxiv.org/html/2606.27679#bib.bib7)\)\. Subsequent methods extend this paradigm by deriving more informative hidden state representations in several ways\. Some works steer the prompting or generation procedure to elicit responses, and hence internal states, that are more discriminative for factual verification\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.27679#bib.bib6); Sreyet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib11)\)\. Other methods expose the probe to more information, for example by pooling hidden states across all layers\(CH\-Wanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib19)\), or by modelling hidden states from all generated tokens as sequential inputs\(Shelmanovet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib15); Sreyet al\.,[2026b](https://arxiv.org/html/2606.27679#bib.bib3); Zhuet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib2)\), or by integrating cross\-model hidden states\(Tanet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib20)\)\. Another approach transforms hidden states into structured features intended to capture uncertainty\- and hallucination\-relevant geometry, such as density\-based features across layers\(Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib14)\), or cross\-layer dynamics\(Sreyet al\.,[2026b](https://arxiv.org/html/2606.27679#bib.bib3)\)\.

A parallel line of work explores internal signals beyond hidden states alone\. Attention\-based methods, for example, use patterns such as the lookback ratio\(Chuanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib18)\), the relative attention paid to source context compared to generated tokens\. Related works also incorporate probability\-space information, such as token probabilities\(Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib14)\), entropy\(Sreyet al\.,[2026b](https://arxiv.org/html/2606.27679#bib.bib3)\), or logit\-derived features, such as top\-kkoutput indices\(Heet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib17)\)\. These recent hybrid methods combine hidden states, attention maps, and probability\-based signals through direct concatenation\(Shelmanovet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib15); Sreyet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib11); Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib14)\), or through specialised submodules\(Heet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib17)\)\. Related calibration methods learn post\-hoc mappings from heuristic UE scores to correctness\-aligned estimates with a model\-specific corrector\(Liet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib4)\), further extending internal\-state probes\.

Overall, probe\-based UE has increasingly been framed as a search for richer features, better calibrated signals, and higher\-capacity optimisation pipelines\. However, these methods often vary simultaneously in many design factors, making it unclear which choices actually account for the observed gains\. Our work clarifies when such engineering is necessary, when simple hidden state probes suffice, and which design choices remain robust under transfer\.

#### Toolkits, Benchmarks, and Evaluation\.

Recent work emphasises standardised evaluation for LLM uncertainty estimation\. LM\-Polygraph\(Fadeevaet al\.,[2023](https://arxiv.org/html/2606.27679#bib.bib16)\)provides a unified toolkit for comparing UE methods, with follow\-up work benchmarking them under consistent protocols\(Vashurinet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib13)\)\. Similarly, UQLM\(Bouchardet al\.,[2026](https://arxiv.org/html/2606.27679#bib.bib21)\)offers an off\-the\-shelf package for response\-level hallucination detection using black\-box, white\-box, LLM\-as\-a\-judge, and ensemble scorers\. Other evaluation works study various aspects, for example, by investigating long form factuality\(Hanet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib29)\)and real\-time entity\-level hallucination detection with token\-level annotations\(Obesoet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib52)\), incorporating uncertainty into LLM benchmarking\(Yeet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib22)\), analysing robustness to semantically equivalent inputs\(Mahautet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib23)\), comparing in\-domain and out\-of\-domain settings\(Wanget al\.,[2025a](https://arxiv.org/html/2606.27679#bib.bib24)\), and re\-examining evaluation choices in hallucination detection\(Janiaket al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib25)\)\. Furthermore, confidence estimates are sensitive to reasoning and prompting: reasoning models may more accurately express their verbalised confidence in some setups\(Yoonet al\.,[2026](https://arxiv.org/html/2606.27679#bib.bib26)\), but not consistently\(Meiet al\.,[2026](https://arxiv.org/html/2606.27679#bib.bib28)\), and reasoning may inflate probability\-based confidence\(Fuet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib27)\)\. These efforts improve evaluation practice for UE broadly, but leave the space of supervised internal\-state probes comparatively underexamined, motivating our factorised study of probe performance and transfer\.

## 3What Drives Probe Performance?

In this section, we answer our first research question:*What truly drives performance in probe\-based uncertainty estimation?*To this end, we perform a factorised analysis by varying important design choices while keeping other conditions fixed, namely feature representation, data and supervision construction, and transfer setting\. We find that raw hidden state and attention features are strong in\-domain \([Section˜3\.2](https://arxiv.org/html/2606.27679#S3.SS2)\), response elicitation and groundtruth annotation strategy choices can substantially affect probe quality \([Section˜3\.3](https://arxiv.org/html/2606.27679#S3.SS3)\), and more structured features are more robust under transfer \([Section˜3\.4](https://arxiv.org/html/2606.27679#S3.SS4)\)\. We clarify that[Section˜3\.4](https://arxiv.org/html/2606.27679#S3.SS4)represents a benchmark\-to\-benchmark transfer analysis, where training and test data differ, but they are both from the benchmark pool \([Section˜3\.1](https://arxiv.org/html/2606.27679#S3.SS1)\)\. In[Section˜4](https://arxiv.org/html/2606.27679#S4), we emulate a more open\-ended generation setting and evaluate our probes pretrained only on the benchmark datasets\.

### 3\.1Experimental Setup

#### Datasets\.

We evaluate on seven datasets spanning three tasks:\(i\)Question Answering \(QA\):we use TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.27679#bib.bib31)\), SciQ\(Welblet al\.,[2017](https://arxiv.org/html/2606.27679#bib.bib32)\), and PopQA\(Mallenet al\.,[2023](https://arxiv.org/html/2606.27679#bib.bib33)\)as fact\-heavy short\-answer QA datasets, encompassing general trivia, science knowledge, and long\-tail entity knowledge, respectively;\(ii\)Verification: we include BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2606.27679#bib.bib35)\)and StrategyQA\(Gevaet al\.,[2021](https://arxiv.org/html/2606.27679#bib.bib34)\)as factual and logical verification tasks that often require understanding and implicit reasoning;\(iii\)Multiple\-Choice Questions \(MCQ\): we use CommonsenseQA\(CSQA; Talmoret al\.,[2019](https://arxiv.org/html/2606.27679#bib.bib30)\)and ARC\(Clarket al\.,[2018](https://arxiv.org/html/2606.27679#bib.bib36)\)\.These datasets cover diverse topics, answer formats, and reasoning demands\. To obtain correctness labels of LLM generations for the QA datasets which require comparing with reference answers, we utilise Gemini\-3\.1\-Flash\-Lite\(Google DeepMind,[2026](https://arxiv.org/html/2606.27679#bib.bib37)\)as the default LLM\-as\-a\-judge groundtruth annotator\. We discuss the effects of different annotation strategies in[Section˜3\.3](https://arxiv.org/html/2606.27679#S3.SS3)\.

#### Language Models\.

For the main experiment, we evaluate with five popular LLMs across three model families: Llama\-3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib39)\), Qwen\-3\-4B, Qwen\-3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib40)\), Qwen\-3\.5\-9B\(Qwen Team,[2026](https://arxiv.org/html/2606.27679#bib.bib41)\), Gemma\-3\-12B\(Kamathet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib42)\)\. We use the instruction\-tuned version for all models except for Qwen\-3\-8B\. For more detailed analysis, we work with Qwen\-3\-8B\. LLama and Qwen\-3\-4B are deferred to[Appendix˜B](https://arxiv.org/html/2606.27679#A2)for better presentation\.

#### Evaluation Metrics\.

To evaluate our results, we report two complementary metrics that capture key desiderata of uncertainty estimation:\(i\)Area Under the Receiver\-Operating characteristics Curve\(AUROC; Davis and Goadrich,[2006](https://arxiv.org/html/2606.27679#bib.bib43)\), which measures discriminability, or the probe’s ability to separate positive \(correct\) from negative classes, with1\.01\.0indicating perfect discrimination and0\.50\.5no better than chance; and\(ii\)Expected Calibration Error\(ECE; Guoet al\.,[2017](https://arxiv.org/html/2606.27679#bib.bib44)\), which measures calibration,i\.e\.how closely the predicted probabilities align with real\-world outcomes, with0\.00\.0indicating perfect calibration\. We follow the standard 10 equal\-width bins implementation\.

#### Feature Representations\.

For each query, we run greedy decoding and extract 19 feature representations from the LLM’s internal generation trace, grouped into four information levels\. First, we use hidden state representation summaries:Embedding \(last\),Embedding \(mean\), andEmbedding \(all\)\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.27679#bib.bib1); Suet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib48)\); trajectory\-basedCoE\-RandCoE\-C\(Wanget al\.,[2025b](https://arxiv.org/html/2606.27679#bib.bib45)\); dispersion\-basedCircular VarianceandCov\. Determinant\(Sreyet al\.,[2026b](https://arxiv.org/html/2606.27679#bib.bib3)\); and density\-basedSATMD\(Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib14)\)\. Second, probability features capture output\-side uncertainty:Max\. Seq\. Prob\. \(MSP\),Entropy,Perplexity\(Huanget al\.,[2025b](https://arxiv.org/html/2606.27679#bib.bib46)\),Energy\(Liuet al\.,[2020](https://arxiv.org/html/2606.27679#bib.bib47)\),Top\-mmProb\.\(Heet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib17)\), withm=10m=10as default\. Third, we include attention features that reflects attention allocation between input and generated tokens, namely recent\-tokenAttention\(Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib14)\), andLookback Ratio\(Chuanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib18)\)\. Finally, combined features concatenate complementary signals intoLayer Top\-mmProb\(Heet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib17)\),Attention \+ MSP,Internal Variance, andSATMD \+ MSP\. We provide more details on feature representation in[Appendix˜A](https://arxiv.org/html/2606.27679#A1)\.

#### Probe Training\.

We train lightweight binary probes on top of each feature representation to predict whether a generated response is correct\. Our default is a linear probe, and we evaluate two non\-linear variants \(see[Section˜3\.2](https://arxiv.org/html/2606.27679#S3.SS2)\): an MLP probe with one ReLU\-activated hidden layer, and a CNN that applies 1D convolutions over the flattened feature vector before pooling and classification\. We adopt the binary cross\-entropy loss, optimised using Adam, with feature normalisation and early stopping based on validation AUROC\.

### 3\.2Simple Features Are Difficult to Outperform

![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/probe_auroc.png)Figure 2:Effect of probe architecture on AUROC\.#### Simple Signals Win\.

Under a fixed linear\-probe setting, more complex feature engineering does not consistently improve in\-domain uncertainty estimation\. As shown in[Figure˜1](https://arxiv.org/html/2606.27679#S1.F1), simple hidden state and attention\-based features, namely Embedding \(mean\), Embedding \(last\), Lookback Ratio, and Attention, are consistently competitive across all models, with Lookback Ratio standing out for its smaller input dimension\. Additionally, in line with previous findings, concatenating embeddings from all layers does not always improve performance\(Chuanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib18); Suet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib48)\)\. In contrast, augmenting hidden state features with logit\-based signals, such as MSP and Entropy, or vice versa for Top\-mmProb\.,*does*reliably increase discriminability, suggesting that fusing complementary signal types are more useful than simply adding more of the same type of hidden state features\.

#### Linear Probes Are Stable\.

We further test whether stronger probe architectures would affect the feature ranking\.[Figure˜2](https://arxiv.org/html/2606.27679#S3.F2)shows that MLP and CNN probes can improve weaker feature representations, especially low\-dimensional and scalar features,e\.g\.Internal Variance and Energy, but bring limited gains for strong hidden state and attention\-based signals\. We observe a similar trend in calibration \([Figure˜9](https://arxiv.org/html/2606.27679#A2.F9)\)\. However, under the transfer setting \(see[Figure˜8](https://arxiv.org/html/2606.27679#A2.F8)\), higher\-capacity probes generally worsen performance, suggesting that they may capture dataset\-specific patterns that hurt generalisation\. Thus, probe complexity is not uniformly beneficial: once feature representation is sufficiently informative, linear probes are often competitive and more stable\.

![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/trsize.png)Figure 3:Dependence on number of training examples\.
#### Limited Supervision Is Sufficient\.

Next, we vary the number of labelled training examples to determine how much supervision is required\.[Figure˜3](https://arxiv.org/html/2606.27679#S3.F3)visualises AUROC for a strong set of features\. Performance steadily improves with more labelled data, but satisfactory performance is recovered with relatively few labels, with average AUROC plateauing at around 128–256 examples\. In particular, higher dimensional features such as last\-token embeddings can perform better with fewer training examples, with less gains beyond 128 instances\.

Overall, these results suggest that factuality information is largely linearly accessible in the hidden states even with limited supervision\. The main benefit of recent feature engineering may therefore not be stronger in\-domain discrimination, but better performance under other constraints, such as calibration and transfer\. For instance, ECE results in[Figure˜1](https://arxiv.org/html/2606.27679#S1.F1)show that internal variance is relatively well\-calibrated despite lower AUROC\. This motivates our further examination in transfer tasks rather than treating in\-domain AUROC as the sole indicator of progress\.

![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/prompt.png)Figure 4:Effects of inducing reasoning\.

### 3\.3Data Construction Strongly Shapes Performance

To obtain data for training the uncertainty probes, there are two crucial design choices to consider: prompting, which affects response elicitation and thus the internal states, and groundtruth annotation strategy\.

#### Reasoning Hurts Performance\.

We test three different prompting options on ARC and CSQA to induce long, short, and no reasoning\. For long reasoning, we prompt with Chain\-of\-Thought\(CoT; Weiet al\.,[2022](https://arxiv.org/html/2606.27679#bib.bib51)\), and for short reasoning, we adapt CoT and enforce the model to provide only a concise one\-sentence reasoning before answering\. We find that prompt format has a substantial effect on probe performance\. In[Figure˜4](https://arxiv.org/html/2606.27679#S3.F4), reasoning reduces AUROC across most features on both CSQA and ARC, despite LLM accuracy remaining comparable, with the degradation especially pronounced for lower\-dimensional probability\-based and internal\-variance features\. This suggests that, despite possible improvements in answer generation, reasoning traces can alter the feature representations used by probes, potentially diluting their factuality signals\. Therefore, for probe\-based UE, concise and direct generations are preferable\.

![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/labels.png)Figure 5:Impact of various automated annotation options on probe performance\.Table 1:Cohen’s kappa \(agreement %\) for automated labels and human judgement\.![Refer to caption](https://arxiv.org/html/2606.27679v1/x3.png)Figure 6:Benchmark\-transfer performance\. Average AUROC across In\-domain, Out\-of\-domain \(same task\), and Out\-of\-domain \(cross task\) configurations\.
#### Groundtruth Annotation Matters\.

We study how the choice of correctness annotation for generation with reference answers affects probe evaluation\. Since it is too costly to obtain gold human annotations, we resort to automated scorers, a common practice in the field\(Duanet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib53); Janiaket al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib25); Kuhnet al\.,[2023](https://arxiv.org/html/2606.27679#bib.bib54); Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib14)\)\. Specifically, we examine three types of labelling strategies on TriviaQA, SciQ, and PopQa: Rouge\(Lin,[2004](https://arxiv.org/html/2606.27679#bib.bib49)\), AlignScore\(Zhaet al\.,[2023](https://arxiv.org/html/2606.27679#bib.bib50)\), and LLM\-as\-a\-judge with Gemini\-3\.1\-Flash\-Lite\. We binarise Rouge and AlignScore using a generic decision threshold of 0\.5, consistent with prior work\(Duanet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib53); Leiet al\.,[2023](https://arxiv.org/html/2606.27679#bib.bib56)\)\. To assess the fidelity of the automatically generated labels, we randomly select 100 generations from each dataset, and measure agreement of correctness label between the automated scorer and a human annotator, who was instructed to perform web search as needed\.[Table˜1](https://arxiv.org/html/2606.27679#S3.T1)presents Cohen’s kappa and percentage of agreed labels between the automated and human scorer\. Consistent with findings byJaniaket al\.\([2025](https://arxiv.org/html/2606.27679#bib.bib25)\), LLM\-as\-a\-judge aligns much more closely with human judgement, with AlignScore as second\-best but still subpar by a large margin\. Due to its high agreement, we utilise LLM\-as\-a\-judge labels as the groundtruth\. Then, in[Figure˜5](https://arxiv.org/html/2606.27679#S3.F5), we investigate the effect of varying the labelling choice for the training data, and we find that LLM\-as\-a\-judge labels more consistent with the chosen evaluation labels\. These results caution against relying on lexical\-based metrics to obtain groundtruths, and instead advocate for LLM\-as\-a\-judge labelling, which better captures semantic correctness\.

### 3\.4Structured Features Are More Robust Under Transfer

For this benchmark\-to\-benchmark transfer setting, we report average performance within the three broader task groups: QA, Verification, and MCQ\. Further, we distinguish between in\-domain, out\-of\-domain \(OOD\) \(same task\), and OOD \(cross task\) performance\. By OOD \(same task\), we mean testing on a different dataset from the same task group,e\.g\.Trivia→\\rightarrowSciQ, and OOD \(cross task\) refers to all remaining train\-test configurations\.

On average,[Figure˜6](https://arxiv.org/html/2606.27679#S3.F6)shows a clear gap between in\-domain and transfer performance, especially across tasks\. This confirms that probes are brittle, and strong in\-domain alone is not sufficient evidence they will be useful in deployment\. hidden state embeddings remain strong in\-domain, but their advantage narrows considerably OOD as they may encode task\- or data\-specific biases that inhibit generalisation\. In comparison, more structured features, such as Internal Variance, Lookback Ratio, and Top\-mmProb\., retain more of their performance under same task and cross task transfer\. As a practical takeaway, pretrained probes should not be selected solely by in\-domain performance, but also, by its robustness to dataset shift\.

Table 2:AUROC with bootstrapped standard errors\. Baselines are indicated with probe architecture and number of training examples from Biographies \(in\-domain\)\.

## 4Pretrained Probes on Open\-Ended Generation

In this section, we answer our second research question:*Can probes trained under controlled benchmark settings generalise to open\-ended generation tasks?*We find that benchmark\-pretrained probes based on structured features transfer effectively, in some cases outperforming supervised baselines trained with limited in\-domain labels\. This supports pretrained probes as a practical starting point for factuality estimation\.

### 4\.1Experimental Setup

#### Evaluation Datasets\.

We sample 100 entities from the dataset used in\(Minet al\.,[2023](https://arxiv.org/html/2606.27679#bib.bib55)\), and 50 from each domain in\(Shelmanovet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib15)\)\. In total, we have 950 entities spanning across nine domains:Biographies \(in\-domain\),Biographies \(OOD\),Artworks,Books,Cities,Events,Inventions,Landmarks,Movies\. Using Qwen\-3\-8B, we apply the same simple prompting byShelmanovet al\.\([2025](https://arxiv.org/html/2606.27679#bib.bib15)\)to generate texts for each entity for up to a maximum of 512 tokens\. Similar toHanet al\.\([2025](https://arxiv.org/html/2606.27679#bib.bib29)\), we utilise Gemini\-3\.1\-Flash\-Lite to automatically decompose the long form continuation into atomic factual claims, resulting in approximately 600–800 claims per domain\. From each claim, we obtain the features and labels in the same manner as in[Section˜3](https://arxiv.org/html/2606.27679#S3), but we replace the judge LLM with GPT\-5\.4\-Mini\(OpenAI,[2026](https://arxiv.org/html/2606.27679#bib.bib38)\)\. According to[Table˜4](https://arxiv.org/html/2606.27679#A2.T4), GPT\-5\-4\-Mini is considerably more reliable in atomic claim factuality verification\. As their agreement with human labels are similar for the benchmark tasks, with Gemini marginally better aligned, we keep Gemini’s labels for the previous section\. As baselines, we train simple linear and MLP probes on Embedding \(last\) with 64, 128, and 256 training instances, randomly selected from claims generated with Biographies \(in\-domain\) entities\.

#### Pretrained Probes\.

We retain probes pretrained on a dataset pooled from all seven benchmarks\. Carrying forward our best practices from[Section˜3](https://arxiv.org/html/2606.27679#S3), we employ the simple linear architecture, direct answer without reasoning, and LLM\-as\-a\-judge labelling\. In terms of feature representations, we select Lookback Ratio, Layer Top\-mm, Prob\., and Internal Variance for their displayed robustness in benchmark\-to\-benchmark transfer, and we additionally bring forward Embedding \(last\) for comparison\. Furthermore, we consider two simple ensembles: Ensemble \(probe\), which averages from three Internal variance probes trained separately on QA, Verification, and MCQ; and Ensemble \(task\), which averages predictions from probes trained on pooled data using the three features\.

### 4\.2Results and Discussion

[Table˜2](https://arxiv.org/html/2606.27679#S3.T2)presents AUROC with bootstrapped standard errors \(SE\)\. Pretrained probes transfer strongly to open\-ended factual generation, despite requiring no task\-specific target\-domain training data\. While the supervised Linear \(\#256\) baseline performs best in\-domain, its OOD average is markedly below the best pretrained probes\. These results reinforce our benchmark\-transfer findings\. Raw last\-token embeddings transfer poorly to open\-ended generation, suggesting that they encode task\-specific variation\. In contrast, structured features such as Layer Top\-mm, Internal Variance, and Lookback Ratio are more robust across domains\. Meanwhile, simple ensembling provides mixed benefit\. The bootstrapped SEs are relatively small given that each domain contains 600–800 claims, indicating that the estimates are sufficiently stable\. On balance, pretrained probes do not replace target\-domain supervision, but they offer a practical baseline when labels are scarce\. Thus, a deployment\-oriented recipe is to start with pretrained probes based on structured or hybrid features, then adapt them as target\-domain labels become available\.

## 5Conclusion

In this work, we examine the influence of feature representation, training data construction, and transfer on probe\-based UE\. Our results indicate that raw features are surprisingly strong in\-domain\. However, under distribution shift, compressed and structured features are more robust, suggesting that in\-domain discriminability alone is insufficient for assessing progress\. Then, we distill best practices to train probes on benchmark data\. Even without task\-specific supervision, these pretrained probes demonstrate reasonable performance when transferred to long form open\-ended factual generation, providing a stable off\-the\-shelf baseline\. We hope this work encourages the field to move beyond matched benchmark comparison and towards probe configurations that remain useful under realistic distribution shifts\. Future work can improve this direction by adapting pretrained probes with small amounts of target\-domain supervision, enabling stronger performance in low\-label deployment settings\.

## Limitations

Although our work has derived best practices to train uncertainty/factuality probes and has shown how these can support deployment\-oriented open\-ended generation, it has the following limitations:

#### Constrained Open\-Ended Generation\.

We simulate a more realistic setting by considering long form generation involving an entity\. However, there are many other uncovered use cases, such as unconstrained dialogue and multi\-turn interaction\. Thus, our conclusions about pretrained probe transfer should be interpreted as evidence for entity\-centric factual generation, rather than universal claim of robustness across all open\-ended settings\.

#### Model and Feature Coverage\.

Our study covers a broad set of LLMs, probe features, and design factors, but it is not exhaustive\. We focus on popular open\-weight LLMs and internal signals\. In particular, we conduct our detailed analysis with base Qwen\-3\-8B\. Future work can assess how broadly the observed best practices apply to larger instruction\-tuned LLMs and other feature representations\.

#### Hallucination Mitigation\.

We focus on hallucination detection using probes\. We leave the area of probe\-guided hallucination mitigation underexplored\. Future work can investigate whether the probe configurations identified here also support effective mitigation strategies\.

## References

- A\. Azaria and T\. Mitchell \(2023\)The internal state of an LLM knows when it’s lying\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 967–976\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.68/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.68)Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.6.2.4),[§1](https://arxiv.org/html/2606.27679#S1.p3.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3)\.
- D\. Bouchard, M\. S\. Chauhan, D\. Skarbrevik, H\. Ra, V\. Bajaj, and Z\. Ahmad \(2026\)Uqlm: a python package for uncertainty quantification in large language models\.Journal of Machine Learning Research27\(13\),pp\. 1–10\.Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Burns, H\. Ye, D\. Klein, and J\. Steinhardt \(2022\)Discovering latent knowledge in language models without supervision\.arXiv preprint arXiv:2212\.03827\.External Links:[Link](https://arxiv.org/abs/2212.03827)Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1)\.
- S\. CH\-Wang, B\. Van Durme, J\. Eisner, and C\. Kedzie \(2024\)Do androids know they’re only dreaming of electric sheep?\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 4401–4420\.External Links:[Link](https://aclanthology.org/2024.findings-acl.260/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.260)Cited by:[§1](https://arxiv.org/html/2606.27679#S1.p2.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Chuang, L\. Qiu, C\. Hsieh, R\. Krishna, Y\. Kim, and J\. R\. Glass \(2024\)Lookback lens: detecting and mitigating contextual hallucinations in large language models using only attention maps\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 1419–1436\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.84/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.84)Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.14.10.3),[§1](https://arxiv.org/html/2606.27679#S1.p1.1),[§1](https://arxiv.org/html/2606.27679#S1.p2.1),[§1](https://arxiv.org/html/2606.27679#S1.p3.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p2.1),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3),[§3\.2](https://arxiv.org/html/2606.27679#S3.SS2.SSS0.Px1.p1.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 2924–2936\.External Links:[Link](https://aclanthology.org/N19-1300/),[Document](https://dx.doi.org/10.18653/v1/N19-1300)Cited by:[item ii](https://arxiv.org/html/2606.27679#S3.I1.i2.2)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.External Links:[Link](https://arxiv.org/abs/1803.05457)Cited by:[item iii](https://arxiv.org/html/2606.27679#S3.I1.i3.2)\.
- J\. Davis and M\. Goadrich \(2006\)The relationship between precision\-recall and roc curves\.InProceedings of the 23rd international conference on Machine learning,pp\. 233–240\.External Links:[Link](https://ftp.cs.wisc.edu/machine-learning/shavlik-group/davis.icml06.pdf)Cited by:[item i](https://arxiv.org/html/2606.27679#S3.I2.i1.2)\.
- J\. Duan, H\. Cheng, S\. Wang, A\. Zavalny, C\. Wang, R\. Xu, B\. Kailkhura, and K\. Xu \(2024\)Shifting attention to relevance: towards the predictive uncertainty quantification of free\-form large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 5050–5063\.External Links:[Link](https://aclanthology.org/2024.acl-long.276/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.276)Cited by:[§3\.3](https://arxiv.org/html/2606.27679#S3.SS3.SSS0.Px2.p1.1)\.
- E\. Fadeeva, R\. Vashurin, A\. Tsvigun, A\. Vazhentsev, S\. Petrakov, K\. Fedyanin, D\. Vasilev, E\. Goncharova, A\. Panchenko, M\. Panov, T\. Baldwin, and A\. Shelmanov \(2023\)LM\-polygraph: uncertainty estimation for language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Y\. Feng and E\. Lefever \(Eds\.\),Singapore,pp\. 446–461\.External Links:[Link](https://aclanthology.org/2023.emnlp-demo.41/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-demo.41)Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Fu, J\. Conde, G\. Martínez, M\. Grandury, and P\. Reviriego \(2025\)Multiple choice questions: reasoning makes large language models \(llms\) more self\-confident even when they are wrong\.arXiv preprint arXiv:2501\.09775\.Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021\)Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.Transactions of the Association for Computational Linguistics9,pp\. 346–361\.External Links:[Link](https://aclanthology.org/2021.tacl-1.21/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00370)Cited by:[item ii](https://arxiv.org/html/2606.27679#S3.I1.i2.2)\.
- Google DeepMind \(2026\)Gemini 3\.1 flash\-lite\.Note:[https://ai\.google\.dev/gemini\-api/docs/models/gemini\-3\.1\-flash\-lite](https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite)Accessed: May 22, 2026Cited by:[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px2.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InInternational conference on machine learning,pp\. 1321–1330\.External Links:[Link](https://arxiv.org/abs/1706.04599)Cited by:[item ii](https://arxiv.org/html/2606.27679#S3.I2.i2.1)\.
- J\. Han, N\. Band, M\. Razzak, J\. Kossen, T\. G\. J\. Rudner, and Y\. Gal \(2025\)Simple factuality probes detect hallucinations in long\-form natural language generation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 16209–16226\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.880/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.880),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2606.27679#S1.p2.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.27679#S4.SS1.SSS0.Px1.p1.1)\.
- J\. He, Y\. Gong, Z\. Lin, C\. Wei, Y\. Zhao, and K\. Chen \(2024\)LLM factoscope: uncovering LLMs’ factual discernment through measuring inner states\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 10218–10230\.External Links:[Link](https://aclanthology.org/2024.findings-acl.608/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.608)Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.2.2.2.3),[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.4.4),[§1](https://arxiv.org/html/2606.27679#S1.p1.1),[§1](https://arxiv.org/html/2606.27679#S1.p3.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p2.1),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu \(2025a\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Trans\. Inf\. Syst\.43\(2\)\.External Links:ISSN 1046\-8188,[Link](https://doi.org/10.1145/3703155),[Document](https://dx.doi.org/10.1145/3703155)Cited by:[§1](https://arxiv.org/html/2606.27679#S1.p1.1)\.
- Y\. Huang, J\. Song, Z\. Wang, S\. Zhao, H\. Chen, F\. Juefei\-Xu, and L\. Ma \(2025b\)Look before you leap: an exploratory study of uncertainty analysis for large language models\.IEEE Transactions on Software Engineering51\(2\),pp\. 413–429\.External Links:[Link](https://arxiv.org/abs/2307.10236)Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.12.8.4),[§1](https://arxiv.org/html/2606.27679#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3)\.
- D\. Janiak, J\. Binkowski, A\. Sawczyn, B\. Gabrys, R\. Shwartz\-Ziv, and T\. J\. Kajdanowicz \(2025\)The illusion of progress: re\-evaluating hallucination detection in LLMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 34728–34745\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1761/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1761),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2606.27679#S3.SS3.SSS0.Px2.p1.1)\.
- M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),R\. Barzilay and M\. Kan \(Eds\.\),Vancouver, Canada,pp\. 1601–1611\.External Links:[Link](https://aclanthology.org/P17-1147/),[Document](https://dx.doi.org/10.18653/v1/P17-1147)Cited by:[item i](https://arxiv.org/html/2606.27679#S3.I1.i1.2)\.
- A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard,et al\.\(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.197864\.External Links:[Link](https://arxiv.org/abs/2503.19786)Cited by:[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px2.p1.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.arXiv preprint arXiv:2302\.09664\.External Links:[Link](https://arxiv.org/abs/2302.09664)Cited by:[§3\.3](https://arxiv.org/html/2606.27679#S3.SS3.SSS0.Px2.p1.1)\.
- D\. Lei, Y\. Li, M\. Hu, M\. Wang, V\. Yun, E\. Ching, and E\. Kamal \(2023\)Chain of natural language inference for reducing large language model ungrounded hallucinations\.arXiv preprint arXiv:2310\.03951\.External Links:[Link](https://arxiv.org/abs/2310.03951)Cited by:[§3\.3](https://arxiv.org/html/2606.27679#S3.SS3.SSS0.Px2.p1.1)\.
- R\. Li, J\. Long, M\. Qi, H\. Xia, L\. Sha, P\. Wang, and Z\. Sui \(2025\)Towards harmonized uncertainty estimation for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 22938–22953\.External Links:[Link](https://aclanthology.org/2025.acl-long.1118/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1118),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p2.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§3\.3](https://arxiv.org/html/2606.27679#S3.SS3.SSS0.Px2.p1.1)\.
- W\. Liu, X\. Wang, J\. Owens, and Y\. Li \(2020\)Energy\-based out\-of\-distribution detection\.Advances in neural information processing systems33,pp\. 21464–21475\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/f5496252609c43eb8a3d147ab9b9c006-Paper.pdf)Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.12.8.4),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3)\.
- M\. Mahaut, L\. Aina, P\. Czarnowska, M\. Hardalov, T\. Müller, and L\. Marquez \(2024\)Factual confidence of LLMs: on reliability and robustness of current estimators\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 4554–4570\.External Links:[Link](https://aclanthology.org/2024.acl-long.250/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.250)Cited by:[§1](https://arxiv.org/html/2606.27679#S1.p1.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 9802–9822\.External Links:[Link](https://aclanthology.org/2023.acl-long.546/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.546)Cited by:[item i](https://arxiv.org/html/2606.27679#S3.I1.i1.2)\.
- S\. Marks and M\. Tegmark \(2023\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.arXiv preprint arXiv:2310\.06824\.External Links:[Link](https://arxiv.org/abs/2310.06824)Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Mei, C\. Zhang, T\. Yin, J\. Lidard, O\. Sho, and A\. Majumdar \(2026\)Reasoning about uncertainty: do reasoning models know when they don’t know?\.InFindings of the Association for Computational Linguistics: EACL 2026,V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 3408–3458\.External Links:[Link](https://aclanthology.org/2026.findings-eacl.178/),[Document](https://dx.doi.org/10.18653/v1/2026.findings-eacl.178),ISBN 979\-8\-89176\-386\-9Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore\.Cited by:[Appendix B](https://arxiv.org/html/2606.27679#A2.SS0.SSS0.Px5.p1.1),[§4\.1](https://arxiv.org/html/2606.27679#S4.SS1.SSS0.Px1.p1.1)\.
- O\. Obeso, A\. Arditi, J\. Ferrando, J\. Freeman, C\. Holmes, and N\. Nanda \(2025\)Real\-time detection of hallucinated entities in long\-form generation\.arXiv preprint arXiv:2509\.03531\.External Links:[Link](https://arxiv.org/abs/2509.03531)Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2026\)Introducing gpt‑5\.4 mini and nano\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4\-mini\-and\-nano//](https://openai.com/index/introducing-gpt-5-4-mini-and-nano//)Accessed: May 25, 2026Cited by:[§4\.1](https://arxiv.org/html/2606.27679#S4.SS1.SSS0.Px1.p1.1)\.
- Qwen Team \(2026\)Qwen3\.5: Towards Native Multimodal Agents\.Note:[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)Accessed: May 22, 2026Cited by:[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px2.p1.1)\.
- P\. Sahoo, P\. Meharia, A\. Ghosh, S\. Saha, V\. Jain, and A\. Chadha \(2024\)A comprehensive survey of hallucination in large language, image, video and audio foundation models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 11709–11724\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.685/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.685)Cited by:[§1](https://arxiv.org/html/2606.27679#S1.p1.1)\.
- A\. Shelmanov, E\. Fadeeva, A\. Tsvigun, I\. Tsvigun, Z\. Xie, I\. Kiselev, N\. Daheim, C\. Zhang, A\. Vazhentsev, M\. Sachan, P\. Nakov, and T\. Baldwin \(2025\)A head to predict and a head to question: pre\-trained uncertainty quantification heads for hallucination detection in LLM outputs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 35712–35731\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1809/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1809),ISBN 979\-8\-89176\-332\-6Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.13.9.4),[§1](https://arxiv.org/html/2606.27679#S1.p1.1),[§1](https://arxiv.org/html/2606.27679#S1.p2.1),[§1](https://arxiv.org/html/2606.27679#S1.p3.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p2.1),[§4\.1](https://arxiv.org/html/2606.27679#S4.SS1.SSS0.Px1.p1.1)\.
- P\. Srey, Q\. M\. Nguyen, X\. Wu, and A\. T\. Luu \(2026a\)Towards reliable truth\-aligned uncertainty estimation in large language models\.arXiv preprint arXiv:2604\.00445\.External Links:[Link](https://arxiv.org/abs/2604.00445)Cited by:[Appendix A](https://arxiv.org/html/2606.27679#A1.p1.1)\.
- P\. Srey, X\. Wu, and A\. T\. Luu \(2025\)Unsupervised hallucination detection by inspecting reasoning processes\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 22117–22129\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1124/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1124),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p2.1)\.
- P\. Srey, X\. Wu, C\. Nguyen, and A\. T\. Luu \(2026b\)Learning uncertainty from sequential internal dispersion in large language models\.arXiv preprint arXiv:2604\.15741\.External Links:[Link](https://arxiv.org/abs/2604.15741)Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.10.6.3),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p2.1),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3)\.
- W\. Su, C\. Wang, Q\. Ai, Y\. Hu, Z\. Wu, Y\. Zhou, and Y\. Liu \(2024\)Unsupervised real\-time hallucination detection based on the internal states of large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 14379–14391\.External Links:[Link](https://aclanthology.org/2024.findings-acl.854/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.854)Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.7.3.3),[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.8.4.3),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3),[§3\.2](https://arxiv.org/html/2606.27679#S3.SS2.SSS0.Px1.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4149–4158\.External Links:[Link](https://aclanthology.org/N19-1421/),[Document](https://dx.doi.org/10.18653/v1/N19-1421)Cited by:[item iii](https://arxiv.org/html/2606.27679#S3.I1.i3.2)\.
- H\. Tan, F\. Sun, S\. Liu, D\. Su, Q\. Cao, X\. Chen, J\. Wang, X\. Cai, Y\. Wang, H\. Shen, and X\. Cheng \(2025\)Too consistent to detect: a study of self\-consistent errors in LLMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 4755–4765\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.238/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.238),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.27679#S1.p1.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Vashurin, E\. Fadeeva, A\. Vazhentsev, L\. Rvanova, D\. Vasilev, A\. Tsvigun, S\. Petrakov, R\. Xing, A\. Sadallah, K\. Grishchenkov, A\. Panchenko, T\. Baldwin, P\. Nakov, M\. Panov, and A\. Shelmanov \(2025\)Benchmarking uncertainty quantification methods for large language models with LM\-polygraph\.Transactions of the Association for Computational Linguistics13,pp\. 220–248\.External Links:[Link](https://aclanthology.org/2025.tacl-1.11/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00737)Cited by:[§1](https://arxiv.org/html/2606.27679#S1.p1.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Vazhentsev, L\. Rvanova, I\. Lazichny, A\. Panchenko, M\. Panov, T\. Baldwin, and A\. Shelmanov \(2025\)Token\-level density\-based uncertainty quantification methods for eliciting truthfulness of large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 2246–2262\.External Links:[Link](https://aclanthology.org/2025.naacl-long.113/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.113),ISBN 979\-8\-89176\-189\-6Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.11.7.3),[§1](https://arxiv.org/html/2606.27679#S1.p1.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p2.1),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3),[§3\.3](https://arxiv.org/html/2606.27679#S3.SS3.SSS0.Px2.p1.1)\.
- K\. Wang, S\. A\. Moktar, J\. Li, K\. Li, and F\. Chen \(2025a\)Measuring aleatoric and epistemic uncertainty in llms: empirical evaluation on id and ood qa tasks\.arXiv preprint arXiv:2511\.03166\.External Links:[Link](https://arxiv.org/abs/2511.03166)Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wang, P\. Zhang, B\. Yang, D\. Wong, and R\. Wang \(2025b\)Latent space chain\-of\-embedding enables output\-free llm self\-evaluation\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 70938–70970\.External Links:[Link](https://openreview.net/forum?id=jxo70B9fQo)Cited by:[Table 3](https://arxiv.org/html/2606.27679#A1.T3.4.4.9.5.3),[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px4.p1.3)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.External Links:[Link](https://arxiv.org/abs/2201.11903)Cited by:[§3\.3](https://arxiv.org/html/2606.27679#S3.SS3.SSS0.Px1.p1.1)\.
- J\. Welbl, N\. F\. Liu, and M\. Gardner \(2017\)Crowdsourcing multiple choice science questions\.InProceedings of the 3rd Workshop on Noisy User\-generated Text,L\. Derczynski, W\. Xu, A\. Ritter, and T\. Baldwin \(Eds\.\),Copenhagen, Denmark,pp\. 94–106\.External Links:[Link](https://aclanthology.org/W17-4413/),[Document](https://dx.doi.org/10.18653/v1/W17-4413)Cited by:[item i](https://arxiv.org/html/2606.27679#S3.I1.i1.2)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[§3\.1](https://arxiv.org/html/2606.27679#S3.SS1.SSS0.Px2.p1.1)\.
- F\. Ye, M\. Yang, J\. Pang, L\. Wang, D\. F\. Wong, E\. Yilmaz, S\. Shi, and Z\. Tu \(2024\)Benchmarking llms via uncertainty quantification\.InProceedings of the 38th International Conference on Neural Information Processing Systems,NIPS ’24,Red Hook, NY, USA\.External Links:ISBN 9798331314385Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Yoon, S\. Kim, S\. Yang, S\. Kim, S\. Kim, Y\. Kim, E\. Choi, Y\. Kim, and M\. Seo \(2026\)Reasoning models better express their confidence\.Advances in Neural Information Processing Systems38,pp\. 103869–103896\.Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zha, Y\. Yang, R\. Li, and Z\. Hu \(2023\)AlignScore: evaluating factual consistency with a unified alignment function\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 11328–11348\.External Links:[Link](https://aclanthology.org/2023.acl-long.634/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.634)Cited by:[§3\.3](https://arxiv.org/html/2606.27679#S3.SS3.SSS0.Px2.p1.1)\.
- F\. Zhang, P\. Yu, B\. Yi, B\. Zhang, T\. Li, and Z\. Liu \(2025a\)Prompt\-guided internal states for hallucination detection of large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 21806–21818\.External Links:[Link](https://aclanthology.org/2025.acl-long.1058/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1058),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, Y\. Li, L\. Cui, D\. Cai, L\. Liu, T\. Fu, X\. Huang, E\. Zhao, Y\. Zhang, Y\. Chen, L\. Wang, A\. T\. Luu, W\. Bi, F\. Shi, and S\. Shi \(2025b\)Siren’s song in the ai ocean: a survey on hallucination in large language models\.Computational Linguistics51\(4\),pp\. 1373–1418\.External Links:[Link](https://aclanthology.org/2025.cl-4.9/),[Document](https://dx.doi.org/10.1162/coli.a.16)Cited by:[§1](https://arxiv.org/html/2606.27679#S1.p1.1)\.
- D\. Zhu, D\. Chen, Q\. Li, Z\. Chen, L\. Ma, J\. Grossklags, and M\. Fritz \(2024\)PoLLMgraph: unraveling hallucinations in large language models via state transition dynamics\.InFindings of the Association for Computational Linguistics: NAACL 2024,K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 4737–4751\.External Links:[Link](https://aclanthology.org/2024.findings-naacl.294/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-naacl.294)Cited by:[§2](https://arxiv.org/html/2606.27679#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AFeature Representation Details

[Table˜3](https://arxiv.org/html/2606.27679#A1.T3)summarises the feature representations used in our factorised study\. We follow the original formulation proposed in the respective papers and use sequence\-level aggregation where needed\. In particular, for embedding\- and logit\-based UE scores, we apply TAC\(Sreyet al\.,[2026a](https://arxiv.org/html/2606.27679#bib.bib5)\)to better align raw confidence with truthfulness, and we adapt Internal Variance to sequence\-level internal\-dispersion feature\.

TypeNameDescriptionCitationHidden StateEmbedding \(last\)Last generated token, last layer embedding\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.27679#bib.bib1)\)Embedding \(mean\)Mean generated token, last layer embedding\(Suet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib48)\)Embedding \(all\)All\-layer hidden state of last generated token pooled together\(Suet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib48)\)CoE\-R, CoE\-CGeometry of hidden state trajectory across successive layers\(Wanget al\.,[2025b](https://arxiv.org/html/2606.27679#bib.bib45)\)Circular Variance, Cov\. DeterminantCross\-layer internal dispersion\(Sreyet al\.,[2026b](https://arxiv.org/html/2606.27679#bib.bib3)\)SATMDLayer\-wise token deviations from the distribution of correct generations\(Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib14)\)LogitMSP, Entropy, Perplexity, EnergyLogit\-derived uncertainty score\(Huanget al\.,[2025b](https://arxiv.org/html/2606.27679#bib.bib46); Liuet al\.,[2020](https://arxiv.org/html/2606.27679#bib.bib47)\)Top\-mmmm\-highest predicted probability values\.\(Heet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib17)\)AttentionAttentionAttention maps to previous tokens\.\(Shelmanovet al\.,[2025](https://arxiv.org/html/2606.27679#bib.bib15)\)Lookback RatioRelative attention paid to input context versus generated tokens\.\(Chuanget al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib18)\)CombinedLayer Top\-mmProb\.Concatenation of Top\-mmProb\. from all layers\.\(Heet al\.,[2024](https://arxiv.org/html/2606.27679#bib.bib17)\)Internal Variance, Attention \+ MSP, SATMD \+ MSPCombination via concatenation\.—

Table 3:Feature representations used in our study\.
## Appendix BSupplementary Experiments

Here, we provide supplementary experiments\.

![Refer to caption](https://arxiv.org/html/2606.27679v1/x4.png)
![Refer to caption](https://arxiv.org/html/2606.27679v1/x5.png)

Figure 7:AUROC and ECE of Llama\-3\.1\-8B\-Instruct and Qwen\-3\-4B\-Instruct\.#### Additional Model Results\.

[Figure˜7](https://arxiv.org/html/2606.27679#A2.F7)reports in\-domain AUROC and ECE for Qwen\-3\-4B\-Instruct and Llama\-3\.1\-8B\-Instruct\. The overall trends are consistent with the main results: hidden state and attention\-based features remain strong in\-domain\.

![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/probe_transfer_auroc.png)
![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/probe_transfer_ece.png)

Figure 8:Average OOD performance with different probe architectures\.![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/probe_ece.png)Figure 9:Effect of probe architecture on ECE\.
#### Calibration under Transfer\.

[Figure˜10](https://arxiv.org/html/2606.27679#A2.F10)reports ECE under the same in\-domain, OOD same\-task, and OOD cross\-task settings\. This shows that discriminability and calibration do not always move together\. Features with strong AUROC are not necessarily the best calibrated, and some lower\-dimensional structured features remain comparatively stable under shift\.

#### Per\-dataset Transfer\.

[Figure˜11](https://arxiv.org/html/2606.27679#A2.F11)and[Figure˜12](https://arxiv.org/html/2606.27679#A2.F12)provide per\-dataset transfer results for AUROC and ECE, respectively\. While the degree of transfer degradation varies across datasets, hidden state features generally remain strong in\-domain, whereas structured features tend to retain performance more consistently under OOD evaluation\.

#### Probe Architecture and Training Size\.

[Figure˜8](https://arxiv.org/html/2606.27679#A2.F8)and[Figure˜9](https://arxiv.org/html/2606.27679#A2.F9)provide average OOD performance for different probe architectures and effect of probe architecture on ECE, respectively\. Higher\-capacity probes can improve some weak feature representations, but do not consistently improve transfer performance, supporting our use of linear probes as the default architecture\.[Figure˜13](https://arxiv.org/html/2606.27679#A2.F13)presents per\-dataset performance with different training set sizes, showing that performance often improves rapidly with limited supervision and begins to saturate around 128–256 examples for many feature representations\.

Table 4:Cohen’s kappa \(agreement %\) between Gemini and GPT LLM judges\.
#### LLM\-as\-a\-judge Validation\.

[Table˜4](https://arxiv.org/html/2606.27679#A2.T4)compares the agreement between automated LLM\-as\-a\-judge and human labels\. We select 100 examples from each of Trivia, SciQ, and PopQA, and 105 from Biographies \(in\-domain\)\. To obtain gold groundtruth labels for Trivia, SciQ, and PopQA, we instruct the human annotator to compare the generated answer with the reference answer, and to use web search if a further check is required\. For Biographies \(in\-domain\), similar toMinet al\.\([2023](https://arxiv.org/html/2606.27679#bib.bib55)\), the human annotator uses the Wikipedia article of each entity as evidence to check the factuality of the generated claims\. GPT\-5\.4\-Mini shows stronger agreement on the biography claim\-level setting, while Gemini\-3\.1\-Flash\-Lite remains competitive on the short\-answer benchmark datasets\. We therefore use Gemini labels for the benchmark experiments and GPT\-5\.4\-Mini labels for the open\-ended factual generation experiments\.

![Refer to caption](https://arxiv.org/html/2606.27679v1/x6.png)Figure 10:Benchmark\-transfer performance\. Average ECE across In\-domain, Out\-of\-domain \(same task\), and Out\-of\-domain \(cross task\) configurations\.![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/transfer_auroc.png)Figure 11:In\-domain AUROC against OOD setting for individual benchmarks with Qwen\-3\-8B\.![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/transfer_ece.png)Figure 12:In\-domain ECE against OOD setting for individual benchmarks with Qwen\-3\-8B\.![Refer to caption](https://arxiv.org/html/2606.27679v1/figures/trsize_indiv.png)Figure 13:AUROC for individual dataset with varying training set size\.
From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models

Similar Articles

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

Reading Calibrated Uncertainty from Language Model Trajectories

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

Uncertainty Quantification for Large Language Diffusion Models

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

Submit Feedback

Similar Articles

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
Reading Calibrated Uncertainty from Language Model Trajectories
A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models
Uncertainty Quantification for Large Language Diffusion Models
Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States