Automatic Layer Selection for Hallucination Detection
Summary
This paper proposes automatic layer selection for hallucination detection in LLMs and introduces First Effective Peak of Intrinsic Dimension (FEPoID), a training-free criterion that consistently identifies optimal intermediate layers, outperforming existing heuristics.
View Cached Full Text
Cached at: 05/27/26, 09:04 AM
# Automatic Layer Selection for Hallucination Detection
Source: [https://arxiv.org/html/2605.26366](https://arxiv.org/html/2605.26366)
###### Abstract
Recent studies on hallucination detection have shown that hallucination\-related signals are more strongly encoded in intermediate layers than in the final layer of large language models \(LLMs\)\. Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high\-performing layers remains underexplored, and principled methods for this purpose are still lacking\. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks\. However, we find that none of these criteria consistently delivers satisfactory performance\. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension, which consistently identify optimal or near\-optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines\. First Effective Peak of Intrinsic Dimension is training\-free and incurs negligible computational overhead\. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination\-related signals and substantially improves overall detection performance\. Code is publicly available at[https://github\.com/DesoloYw/Automatic\-Layer\-Selection\-for\-Hallucination\-Detection\.git](https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git)
Machine Learning, ICML
## 1Introduction
\(a\)LLaMA\-3\.1\-8B\-Instruct\(b\)Mistral\-7B\-Instruct\-v0\.3\(c\)Performance boost with our layer\-selection based strategies
Figure 1:Hallucination detection performance under a unified experimental setting\. For all experiments, we extract last\-token representations from each layer and train an MLP classifier for hallucination detection\.\(TOP\):Layer\-wise AUROC under oracle training, where the best\-performing layer \(starred\) consistently lies in the intermediate layers\.\(Bottom\):Mean AUROC averaged across datasets under different layer\-selection strategies\. FEPoID consistently outperforms the last\-layer heuristic, and its combination with the truncation strategy yields further gains across models\.Hallucination detection is a critical challenge for deploying large language models \(LLMs\) in real\-world applications, as LLMs can produce fluent yet factually incorrect or internally inconsistent outputs\. Detecting such hallucinations without modifying or fine\-tuning the underlying model is therefore an important practical problem\(Huang et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib17); Farquhar et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib10); Li et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib28)\)\. Prior work has approached hallucination detection using uncertainty\-based estimates\(Farquhar et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib10); Malinin & Gales,[2021](https://arxiv.org/html/2605.26366#bib.bib31)\)or verbalized uncertainty\(Xiong et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib43); Zhou et al\.,[2023](https://arxiv.org/html/2605.26366#bib.bib48)\)\. In contrast, recent studies show that hallucination\-related signals are more strongly encoded in the internal representations of LLMs than in their final outputs, motivating the use of hidden states for hallucination detection\(Orgad et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib32); Azaria & Mitchell,[2023](https://arxiv.org/html/2605.26366#bib.bib2); Chen et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib5); Yin et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib45); Ji et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib19)\)\.
However, most existing approaches either select a predetermined intermediate layer in a data\- and task\-agnostic manner, or fully evaluate each candidate layer, which is impractical due to the computational cost\. To demonstrate why layer selection is challenging, we visualize in[Figure1](https://arxiv.org/html/2605.26366#S1.F1)the best\-performing layer for hallucination detection, where it consistently lies in the intermediate layers, but its exact location varies substantially across datasets and model architectures\. This variability motivates the central question of this paper:*can we design a practical and principled criterion that automatically identifies the most informative intermediate layer for hallucination detection?*
Throughout this work, we study this problem within the hidden\-state probing framework, where the pretrained LLM is kept frozen and a lightweight multi\-layer perceptron \(MLP\) is trained on representations extracted from a selected layer for hallucination detection\. We first propose several hypotheses for why hallucination\-related signals emerge in intermediate layers\. Guided by these hypotheses, we systematically evaluate a diverse set of candidate layer\-selection criteria including information\-theoretic, gradient\-based and geometric criteria across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks\. The empirical results show that, none of these criteria can efficiently and reliably identify high\-performing layers\.
Instead, inspired by our empirical observation of how intrinsic dimension evolves across layers, we propose a new selection criterion: theFirstEffectivePeakofIntrinsicDimension \(FEPoID\)\. Across models and datasets, we observe a recurring pattern in which the intrinsic dimension first peaks in the intermediate layers and later reaches another, often higher, peak near the output layers\. We hypothesize that these two peaks reflect different forms of representational complexity: the earlier peak captures abstract semantic information that is especially relevant to hallucination detection, whereas the later peak primarily captures surface\-level complexity making it less informative for this task\. This is further supported by our empirical results: selecting the first effective ID peak consistently identifies optimal or near\-optimal layers, enabling FEPoID to outperform the aforementioned criteria and achieve stronger hallucination detection performance than existing baselines\.
While choosing an appropriate layer is crucial for effective hidden\-state probing, performance in hallucination detection also critically depends on the*token position*used for representation extraction\. A common heuristic is to extract representations at the last generated token, motivated by the autoregressive property that this token can attend to the entire context\. However, recent studies have shown that last\-token representations are sensitive to noise introduced near the end of the generated sequence\(Springer et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib42); Lee et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib23)\)and often underperform on downstream tasks\(Orgad et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib32)\)\. This raises the second research question of our work:*can we identify a simple, supervision\-free rule that yields informative representations?*
We explore this question by evaluating extracted representations at the last token of the first generated sentence, identified using a simple, rule\-based First\-Sentence Truncation \(FST\)\. Through extensive experiments, we find that representations extracted at the end of the first sentence consistently yield stronger detection performance as shown in[Figure1](https://arxiv.org/html/2605.26366#S1.F1)\. This is motivated by our observation that the representations at the last generated tokens are often degraded by end\-of\-sequence noise arising from degenerate repetition, inconsistent continuation, and semantic drift\. In addition, FST consistently improves the performance of various hallucination detection baselines, indicating that its effectiveness is not tied to any specific modeling assumptions, but instead comes from systematically reducing noise introduced during late\-stage generation\.
In summary, our main contributions are threefold:
1. i\)We provide the first systematic evaluation of criteria that have been shown in prior work to correlate with downstream performance, as well as criteria used for layer\-selective fine\-tuning, both of which remain underexplored for practical layer selection\.
2. ii\)We introduce First Effective Peak of Intrinsic Dimension, a simple and efficient criterion that automatically selects near\-optimal intermediate layers across various datasets and pretrained models\.
3. iii\)We revisit token\-position choices for hidden\-state probing and show that extracting representations at the last token of the first generated sentence consistently outperforms the common last\-token heuristic\. Moreover, applying FST improves all hallucination detection baselines considered in this work, indicating a method\-agnostic benefit from mitigating noise introduced during late\-stage generation\.
## 2Related Work
#### Criteria for Intermediate Layer Selection
Skean et al\. \([2025](https://arxiv.org/html/2605.26366#bib.bib40)\)show that intermediate layers can encode rich information across various architectures and domains\.Lee et al\. \([2023](https://arxiv.org/html/2605.26366#bib.bib24)\)study selective fine\-tuning under distribution shifts and use supervised signals such as the relative gradient norm \(RGN\) and signal\-to\-noise ratio \(SNR\) to choose which layers to update; we repurpose these criteria for selecting probe layers and show that they perform poorly in our setting\.Hosseini & Fedorenko \([2023](https://arxiv.org/html/2605.26366#bib.bib16)\)introduce curvature to quantify layer\-wise flattening of sentence embeddings, which has been shown to correlate with downstream performance\(Skean et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib40)\)\.Rao et al\. \([2025](https://arxiv.org/html/2605.26366#bib.bib34)\)show that the ID of learned geographic representations is positively correlated with downstream task performance, and can capture meaningful structural properties of the data\.Cheng et al\. \([2025](https://arxiv.org/html/2605.26366#bib.bib6)\)show that layers near the maximum ID tend to be the first to transfer effectively to downstream tasks\. However, these criteria have not been systematically studied for automatic layer selection and for hallucination detection, which is the focus of this work\.
Table 1:Hypotheses and corresponding criteria\.
#### Intermediate Layer Trials
Existing works that leverage pretrained model representations for downstream tasks commonly select a fixed intermediate \(often the middle\) layer\(Chen et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib5)\), or restrict evaluation to a predefined subset of layers \(e\.g\., middle layer and final layer\)\(Liu et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib30); Ahdritz et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib1); Ji et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib19)\)\. Some prior works explore layer selection by evaluating a broader but still limited set of candidates\. For example,Azaria & Mitchell \([2023](https://arxiv.org/html/2605.26366#bib.bib2)\)probe a small grid of layers \(e\.g\., layer 16, 20, 24, 28 and the last layer in a 32\-layer model\) and observe that certain intermediate layers perform best\. Similarly,Orgad et al\. \([2025](https://arxiv.org/html/2605.26366#bib.bib32)\)evaluate hallucination detection across a sparse set of layers \(layer 1, 6, 11, …, 31\) and find that layers in the middle range tend to be more informative\. While these approaches can outperform the last layer alone, they also highlight the lack of a practical and principled way to reliably identify strong intermediate layers\.
#### Token Position for Extraction
A widely used heuristic is to extract the hidden state of the*last generated token*, motivated by the autoregressive property that it attends to all preceding context, but this representation is sensitive to end\-of\-sequence noise\(Springer et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib42); Lee et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib23)\)\. An alternative extracts features from the last prompt token; however, the unidirectional nature of decoder\-only LLMs prevents this representation from reflecting correctness differences across sampled outputs\(Slobodkin et al\.,[2023](https://arxiv.org/html/2605.26366#bib.bib41)\)\. Another option is to average representations across token positions, but prior studies\(Li et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib26); Zhang et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib47)\)show that under causal attention, mean pooling is less effective than last\-token features, since earlier tokens cannot attend to future tokens\.
Orgad et al\. \([2025](https://arxiv.org/html/2605.26366#bib.bib32)\)evaluate multiple token positions—including the last generated token, the last prompt token, and the “exact answer” token—and found that representations at the exact\-answer position achieve the best downstream performance within the hidden\-state probing framework\. This suggests that the truthfulness\-related information is more concentrated at answer\-aligned token positions\. However, locating the “exact answer” token requires the ground\-truth answer as a reference, which is impractical since it is a chicken\-and\-egg problem\. Also, for open\-ended scenarios, identifying the “exact answer” token typically relies on auxiliary LLMs, incurring additional computational overhead\.
## 3Layer\-Selection Criteria
We seek criteria that capture the relationship between intermediate\-layer representations and downstream task performance\. Our choices are guided by several hypotheses about why intermediate\-layer representations can outperform final\-layer representations:
1. \(i\)Intermediate layers encode rich and diverse semantic information that is beneficial for probing tasks\.
2. \(ii\)Certain intermediate layers capture task\-relevant features that facilitate effective probe training\.
3. \(iii\)Intermediate layers compress redundant information while preserving task\-relevant structure\.
4. \(iv\)Intermediate layers exhibit meaningful statistical structure with high effective information capacity\.
Motivated by these hypotheses, we introduce a set of layer\-selection criteria that form the basis for efficient and automatic layer selection and are organized into three families described below\. Table[1](https://arxiv.org/html/2605.26366#S2.T1)summarizes the correspondence between these hypotheses and the layer\-selection criteria\.
### 3\.1Notations
For a LLM with layers indexed byℓ∈\{1,…,L\}\\ell\\in\\\{1,\\dots,L\\\}, let𝐇i\(ℓ\)∈ℝT×d\\mathbf\{H\}^\{\(\\ell\)\}\_\{i\}\\in\\mathbb\{R\}^\{T\\times d\}denote the token\-wise representations at layerℓ\\ellfor theii\-th input samplexix\_\{i\}, whereTTis the number of tokens in the input andddis the representation dimensionality\. We denote by𝐳t,i\(ℓ\)=𝐇t,i\(ℓ\)∈ℝd\\mathbf\{z\}^\{\(\\ell\)\}\_\{t,i\}=\\mathbf\{H\}^\{\(\\ell\)\}\_\{t,i\}\\in\\mathbb\{R\}^\{d\}the representation of theii\-th sample extracted at layerℓ\\elland token positiontt\. Collecting representations across the dataset, we form the matrix𝐙t\(ℓ\)∈ℝN×d\\mathbf\{Z\}^\{\(\\ell\)\}\_\{t\}\\in\\mathbb\{R\}^\{N\\times d\}, where each row corresponds to one sample’s representation at layerℓ\\elland positiontt, withNNdenoting the number of samples\. For notational simplicity, we omit the token\-position indexttand specify the chosen token position in context when needed\. For representation extraction, the input to the LLM consists of the concatenation of the prompt and its corresponding generated answer\.
### 3\.2Information\-Theoretic Criteria
Inspired by Cover’s theorem\(Cover,[1965](https://arxiv.org/html/2605.26366#bib.bib8)\), which suggests that higher\-rank representations are more likely to be linearly separable, we adopt RankMe\(Roy & Vetterli,[2007](https://arxiv.org/html/2605.26366#bib.bib37)\), which measures the rank of embeddings and has been shown to correlate strongly with downstream linear\-probing performance\(Garrido et al\.,[2023](https://arxiv.org/html/2605.26366#bib.bib12)\)\.
Formally, given an embedding matrix𝐙\(ℓ\)\\mathbf\{Z\}^\{\(\\ell\)\}, RankMe considers the singular values𝝈\(𝐙\(ℓ\)\)\\boldsymbol\{\\sigma\}\(\\mathbf\{Z\}^\{\(\\ell\)\}\)of𝐙\(ℓ\)\\mathbf\{Z\}^\{\(\\ell\)\}\. The normalized spectral distribution is defined as
pk=σk\(𝐙\(ℓ\)\)∥𝝈\(𝐙\(ℓ\)\)∥1\+ε,p\_\{k\}=\\frac\{\\sigma\_\{k\}\(\\mathbf\{Z\}^\{\(\\ell\)\}\)\}\{\\lVert\\boldsymbol\{\\sigma\}\(\\mathbf\{Z\}^\{\(\\ell\)\}\)\\rVert\_\{1\}\+\\varepsilon\},whereε\\varepsilonis a small constant for numerical stability\. The RankMe score is then defined as
RankMe\(𝐙\(ℓ\)\)=exp\(−∑k=1min\(N,d\)pklogpk\)\.\\mathrm\{RankMe\}\(\\mathbf\{Z\}^\{\(\\ell\)\}\)=\\exp\\\!\\left\(\-\\sum\_\{k=1\}^\{\\min\(N,d\)\}p\_\{k\}\\log p\_\{k\}\\right\)\.We select the intermediate layers with the highest RankMe value under hypothesis \(i\)\.
### 3\.3Gradient\-Based Criteria
Gradient\-based criteria are directly aligned with hypothesis \(ii\) as they measure how well the representations from a given layer facilitate learning for the downstream task\.
Validation lossof the trained probe over a single training run is stored as an intuitive and lightweight criterion to predict the performance of the full training\.
Relative gradient norm \(RGN\)measures the magnitude of the optimization signal relative to the scale of the model parameters\. Let𝜽\\boldsymbol\{\\theta\}denote the flattened parameters of the downstream model and𝐠=∇𝜽ℒ\\mathbf\{g\}=\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}denote the corresponding gradient of the validation loss\. RGN is computed asRGN=∥𝐠∥2/∥𝜽∥2\\mathrm\{RGN\}~=~\{\\lVert\\mathbf\{g\}\\rVert\_\{2\}\}/\{\\lVert\\boldsymbol\{\\theta\}\\rVert\_\{2\}\}\. It has been explored for selective fine\-tuning in prior work\(Lee et al\.,[2023](https://arxiv.org/html/2605.26366#bib.bib24)\), where it was found to yield generally good results\. We choose intermediate layers whose representations lead to larger gradient norms under the hypothesis that they carry information that leads to a more efficient learning process\.
#### Signal\-to\-noise ratio \(SNR\)
characterizes the consistency of gradients across training examples\. Let𝐠ij∈ℝ\\mathbf\{g\}\_\{ij\}\\in\\mathbb\{R\}denote thejj\-th component of the gradient computed from theii\-th datapoint on the validation set\. We estimate SNR as
SNR=𝔼i\[Avgj\(𝐠ij\)2Varj\(𝐠ij\)\]∝𝔼i\[\(∑j𝐠ij\)2∑j𝐠ij2\]\.\\mathrm\{SNR\}=\\mathbb\{E\}\_\{i\}\\\!\\left\[\\frac\{\\mathrm\{Avg\}\_\{j\}\\\!\\left\(\\mathbf\{g\}\_\{ij\}\\right\)^\{2\}\}\{\\mathrm\{Var\}\_\{j\}\\\!\\left\(\\mathbf\{g\}\_\{ij\}\\right\)\}\\right\]\\;\\propto\\;\\mathbb\{E\}\_\{i\}\\\!\\left\[\\frac\{\\left\(\\sum\_\{j\}\\mathbf\{g\}\_\{ij\}\\right\)^\{2\}\}\{\\sum\_\{j\}\\mathbf\{g\}\_\{ij\}^\{2\}\}\\right\]\.Previous work\(Lee et al\.,[2023](https://arxiv.org/html/2605.26366#bib.bib24)\)has explored the use of SNR in selective fine\-tuning, though the results were not as promising as RGN\. Following their strategy, we select intermediate layers with higher SNR values\.
### 3\.4Geometric Criteria
Geometric criteria describe how representations are organized in latent space and provide low\-cost proxies for characterizing compression and structure\.
Curvaturemeasures the geometric complexity of token\-level representation trajectories across layers and was first proposed byHosseini & Fedorenko \([2023](https://arxiv.org/html/2605.26366#bib.bib16)\)\. In that work, the authors observed that token embeddings tend to become progressively “flattened” during training, with intermediate layers exhibiting stronger flattening effects than the final layer\. Subsequent studies\(Skean et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib40)\)further demonstrated that curvature is strongly correlated with downstream performance, making it a useful geometric signal aligned with hypothesis \(iii\)\.
To compute curvature, the hidden states at layerℓ\\ellare treated as a discrete trajectory across token positions, where the difference between two adjacent states along this trajectory is given by𝐯t,i=𝐇t,i\(ℓ\)−𝐇t−1,i\(ℓ\),t=2,…,T\\mathbf\{v\}\_\{t,i\}=\\mathbf\{H\}^\{\(\\ell\)\}\_\{t,i\}\-\\mathbf\{H\}^\{\(\\ell\)\}\_\{t\-1,i\},\\quad t=2,\\dots,T\. The turning angle between consecutive velocity vectors is
κt,i=arccos\(⟨𝐯t−1,i,𝐯t,i⟩∥𝐯t−1,i∥2∥𝐯t,i∥2\),t=3,…,T\.\\kappa\_\{t,i\}=\\arccos\\\!\\left\(\\frac\{\\langle\\mathbf\{v\}\_\{t\-1,i\},\\mathbf\{v\}\_\{t,i\}\\rangle\}\{\\lVert\\mathbf\{v\}\_\{t\-1,i\}\\rVert\_\{2\}\\,\\lVert\\mathbf\{v\}\_\{t,i\}\\rVert\_\{2\}\}\\right\),\\quad t=3,\\dots,T\.The per\-sample curvature is defined as the mean turning angle along the trajectory, that is,
Curv\(ℓ\)\(xi\)=1T−2∑t=3Tκt,i\.\\mathrm\{Curv\}^\{\(\\ell\)\}\(x\_\{i\}\)=\\frac\{1\}\{T\-2\}\\sum\_\{t=3\}^\{T\}\\kappa\_\{t,i\}\.During layer selection with curvature, we choose the layer with the smallest curvature value\.
Intrinsic dimension \(ID\)measures the minimum number of features required to accurately represent the embeddings’ underlying structure without significant loss of information\(Bennett,[1969](https://arxiv.org/html/2605.26366#bib.bib4)\), which is closely aligned with hypothesis \(iv\)\. A variety of ID estimators have been proposed, including maximum\-likelihood–based methods\(Levina & Bickel,[2004](https://arxiv.org/html/2605.26366#bib.bib25)\), GeoMLE\(Gomtsyan et al\.,[2019](https://arxiv.org/html/2605.26366#bib.bib14)\), and TwoNN\(Facco et al\.,[2017](https://arxiv.org/html/2605.26366#bib.bib9)\)\. In this work, we adopt the TwoNN estimator due to its simplicity and scalability: it relies on the distances to the two nearest neighbors of each point and remains computationally efficient even for large datasets and high\-dimensional embeddings\.
Considering the dataset\-level embedding matrix𝐙\(ℓ\)∈ℝN×d\\mathbf\{Z\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{N\\times d\}, letri,1r\_\{i,1\}andri,2r\_\{i,2\}denote the Euclidean distances from𝐳i\(ℓ\)\\mathbf\{z\}^\{\(\\ell\)\}\_\{i\}to its first and second nearest neighbors among the rows of𝐙\(ℓ\)\\mathbf\{Z\}^\{\(\\ell\)\}\. Following the TwoNN estimator, the distance ratio is defined asμi=ri,2/ri,1\\mu\_\{i\}=r\_\{i,2\}/r\_\{i,1\}, which follows a Pareto distribution with parameterdID\+1d\_\{\\mathrm\{ID\}\}\+1on\[1,\+∞\)\[1,\+\\infty\), with densityf\(μi\)=dIDμi−\(dID\+1\)f\(\\mu\_\{i\}\)=d\_\{\\mathrm\{ID\}\}\\,\\mu\_\{i\}^\{\-\(d\_\{\\mathrm\{ID\}\}\+1\)\}\. FollowingFacco et al\. \([2017](https://arxiv.org/html/2605.26366#bib.bib9)\), the ID estimation can be reduced to a linear regression task\. In practice, we use the TwoNN implementation from scikit\-dimension\(Bac et al\.,[2021](https://arxiv.org/html/2605.26366#bib.bib3)\)and compute the k\-nearest neighbor search by Faiss\-GPU\(Johnson et al\.,[2019](https://arxiv.org/html/2605.26366#bib.bib21)\)\.
We select the layer with the highest ID motivated by prior evidence that ID positively correlates with downstream performance\(Rao et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib34)\)and layers near the maximum ID are among the earliest to transfer effectively to downstream tasks\(Cheng et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib6)\)\.
### 3\.5A New Criterion: FEPoID
Through a closer examination of the evolution of the ID curves as shown in Figure[2](https://arxiv.org/html/2605.26366#S4.F2), we observe a consistent multimodal pattern across models and benchmarks: one peak emerges in the intermediate layers, while another can appear closer to the final layers and typically attains a higher magnitude\. Although high IDs generally correlate with information\-rich representations, they can arise for different underlying reasons\. Prior work\(Skean et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib40); Cheng et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib6)\)suggests that the intermediate layers play distinct roles in information processing, balancing the trade\-off between information preservation and abstraction\.
Motivated by this perspective, we hypothesize that the layer corresponding to the first peak predominantly captures the abstract semantic information, which is particularly relevant to hallucination detection\. In contrast, the second peak, despite its higher magnitude, is due to the reintroduction of lexical or surface\-level information as it is getting closer to predicting the next token\. Based on this insight, we propose selecting layers with theFirst Effective Peak of Intrinsic Dimension \(FEPoID\)\. Specifically, we use a forward horizon parameterwwto validate candidate peaks\. Rather than naively selecting the first local maximum of the ID curve—which can be unstable when representational capacity continues to grow in deeper layers—we filter out spurious early peaks that are followed by higher ID values within a limited look\-ahead window\. Formally, letdID\(ℓ\)d\_\{\\mathrm\{ID\}\}^\{\(\\ell\)\}denote the TwoNN estimate at layerℓ\\ell, and define the forward horizon𝒩\+\(ℓ,w\)=\{ℓ\+1,…,min\(ℓ\+w,L\)\}\\mathcal\{N\}^\{\+\}\(\\ell,w\)=\\\{\\ell\+1,\\dots,\\min\(\\ell\+w,L\)\\\}\. We identify all local maxima of\{dID\(ℓ\)\}ℓ=1L\\\{d\_\{\\mathrm\{ID\}\}^\{\(\\ell\)\}\\\}\_\{\\ell=1\}^\{L\}and scan them from shallow to deep\. A candidate peak at layerℓ\\ellis discarded ifdID\(ℓ\)<dID\(min\(ℓ\+w,L\)\)d\_\{\\mathrm\{ID\}\}^\{\(\\ell\)\}<d\_\{\\mathrm\{ID\}\}^\{\(\\min\(\\ell\+w,L\)\)\}anddID\(ℓ\+1\)<dID\(ℓ\+2\)<⋯<dID\(min\(ℓ\+w,L\)\)d\_\{\\mathrm\{ID\}\}^\{\(\\ell\+1\)\}<d\_\{\\mathrm\{ID\}\}^\{\(\\ell\+2\)\}<\\cdots<d\_\{\\mathrm\{ID\}\}^\{\(\\min\(\\ell\+w,L\)\)\}, indicating that representational capacity continues to increase beyondℓ\\ellwithin the horizon\. We select the earliest remaining peak, defaulting to the shallowest if none survive\.
## 4Experiments
Table 2:Hallucination detection performance \(AUROC\) across QA datasets withw=7w=7\. For representation\-based methods, we extract representations at the last generated token\. Top\-3 results are highlighted, with darker color indicating better performance\.LlaMA\-3\.1\-8B\-InstructMistral\-7B\-Instruct\-v0\.3CoQASQuADHotpotQATriviaQAPsiloQAAvgCoQASQuADHotpotQATriviaQAPsiloQAAvgPred\. Entropy0\.58330\.57030\.71030\.68590\.36040\.58200\.72000\.73160\.63030\.67630\.61100\.6738LN\-Pred\. Entropy0\.57810\.56710\.70870\.67740\.35550\.57740\.65280\.63900\.65890\.68280\.44180\.6151Semantic Entropy0\.50030\.55180\.44540\.55050\.60760\.53110\.57690\.64090\.64180\.73530\.68530\.6560Lexical Similarity0\.67800\.59880\.72940\.68380\.40820\.61960\.70710\.71690\.69460\.75470\.57810\.6903LID0\.50590\.52810\.51710\.49890\.59940\.52990\.52740\.56880\.55180\.49700\.64470\.5579EigenScore0\.52470\.53000\.59870\.58820\.50800\.54990\.70920\.75080\.65870\.72730\.72460\.7141Hidden\-StateProbingRankME\\cellcolorranktwo 0\.65980\.60710\.70400\.71140\.74780\.68600\.77430\.73410\.69090\.66990\.82770\.7394Curvature\\cellcolorrankthree0\.6323\\cellcolorranktwo 0\.6183\\cellcolorrankthree0\.74130\.7366\\cellcolorrankthree0\.7565\\cellcolorrankthree0\.6970\\cellcolorrankone0\.8492\\cellcolorranktwo0\.8553\\cellcolorrankthree0\.7940\\cellcolorranktwo 0\.8368\\cellcolorranktwo 0\.9005\\cellcolorranktwo 0\.8472Val Loss\\cellcolorrankone0\.6705\\cellcolorrankthree0\.6164\\cellcolorranktwo 0\.7682\\cellcolorrankone0\.7861\\cellcolorranktwo 0\.7836\\cellcolorranktwo 0\.7250\\cellcolorrankthree0\.8283\\cellcolorrankone0\.8679\\cellcolorranktwo 0\.7968\\cellcolorrankone0\.8496\\cellcolorrankthree0\.8861\\cellcolorrankthree0\.8457RGN0\.59930\.61300\.7040\\cellcolorranktwo 0\.78590\.73730\.68790\.7090\\cellcolorrankthree0\.75530\.7493\\cellcolorrankthree0\.78680\.85620\.7713SNR0\.52400\.56990\.70090\.55670\.66240\.60280\.74750\.68700\.76200\.70080\.82070\.7436ID\\cellcolorrankone0\.67050\.61300\.69320\.70730\.73730\.6843\\cellcolorrankthree0\.8283\\cellcolorrankone0\.86790\.6993\\cellcolorrankthree0\.78680\.85330\.8071FEPoID\\cellcolorrankone0\.6705\\cellcolorrankone0\.6377\\cellcolorrankone0\.7807\\cellcolorrankthree0\.7516\\cellcolorrankone0\.7862\\cellcolorrankone0\.7253\\cellcolorranktwo 0\.8466\\cellcolorrankone0\.8679\\cellcolorrankone0\.7982\\cellcolorrankone0\.8496\\cellcolorrankone0\.9031\\cellcolorrankone0\.8531


\(a\)CoQA\(b\)SQuAD\(c\)HotpotQA\(d\)TriviaQA\(e\)PsiLoQA
Figure 2:Layer\-wise AUROC and intrinsic dimension across QA datasets\. Diamond markers indicate the layers selected by FEPoID, and star markers denote the oracle best\-performing layers in terms of AUROC\. Across datasets and models, FEPoID consistently selects layers that are close to the oracle optima, highlighting its robustness and reliability for practical layer selection\.In this section, we investigate whether First Effective Peak of Intrinsic Dimension can consistently select near\-optimal layers for hallucination detection across datasets and model architectures, and how it compares to widely used baselines\. In addition, we provide empirical evidence in Appendix[C](https://arxiv.org/html/2605.26366#A3)to support the semantic\-information hypothesis that motivates FEPoID\.
### 4\.1Experimental Setup
DatasetWe evaluate hallucination detection across two task types: question answering \(QA\) and summarization\.
For QA, we conduct experiments on five widely used datasets: CoQA\(Reddy et al\.,[2019](https://arxiv.org/html/2605.26366#bib.bib35)\), SQuAD\(Rajpurkar et al\.,[2016](https://arxiv.org/html/2605.26366#bib.bib33)\), HotpotQA\(Yang et al\.,[2018](https://arxiv.org/html/2605.26366#bib.bib44)\), TriviaQA\(Joshi et al\.,[2017](https://arxiv.org/html/2605.26366#bib.bib22)\), and PsiLoQA\(Rykov et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib38)\)\. CoQA, SQuAD, and PsiLoQA are evaluated in a*context\-aware*setting, where the input prompt includes the supporting passage with the question\. HotpotQA and TriviaQA are evaluated in a*question\-only*setting\. FollowingJaniak et al\. \([2025](https://arxiv.org/html/2605.26366#bib.bib18)\); Farquhar et al\. \([2024](https://arxiv.org/html/2605.26366#bib.bib10)\), we sample 10 candidate answers per question at a temperature of 1\.0 to quantify generation uncertainty; these samples are used by baselines that rely on multiple sampled outputs\. In addition, we generate a single*best answer*using a temperature of 0\.1 for each question, which serves as a deterministic estimate for downstream performance evaluation\. To assess the layer selection criteria beyond QA tasks, we further evaluate on two summarization benchmarks: HaluEval\(Li et al\.,[2023](https://arxiv.org/html/2605.26366#bib.bib27)\)and CNN/Daily Mail \(CNN/DM\)\(See et al\.,[2017](https://arxiv.org/html/2605.26366#bib.bib39)\), in which the goal is to classify whether the LLM summarizes correctly\.
For each QA dataset, answers are generated autoregressively with a maximum generation length of 30 tokens\. For summarization datasets, the maximum generation length is set to 130\. Details on prompt templates and dataset construction are provided in Appendix[A](https://arxiv.org/html/2605.26366#A1)\.
#### Model
We conduct experiments across a diverse set of models varying in size \(1B–8B\), tuning strategy \(base vs\. instruction\-tuned\), and architecture\. For instruction\-tuned models, we experiment with LlaMA\-3\.1\-8B\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib15)\)\(abbreviated asLlaMA\-3\.1\-8B\-Instruct\) and Mistral\-7B\-Instruct\-v0\.3\(Jiang et al\.,[2023](https://arxiv.org/html/2605.26366#bib.bib20)\)\(abbreviated asMistral\-7B\-Instruct\-v0\.3\)\. To assess robustness across tuning strategies, we additionally evaluate on LLaMA\-3\.1\-8B \(base\)\. To further assess scalability, we also include LLaMA\-3\.2\-1B and LLaMA\-3\.2\-3B\.
Evaluation SetupHallucination detection is a binary classification task\. In QA tasks, for each generated answer, we first compare it against the reference answer using exact string matching\. If there is an exact match, the answer is labeled as correct\. If not, we use an LLM\-as\-Judge to assign the label, following the prompt and procedure ofOrgad et al\. \([2025](https://arxiv.org/html/2605.26366#bib.bib32)\)\. For summarization tasks, the ground\-truth labels indicating whether the LLM summarizes correctly are constructed via TrueTeacher\(Gekhman et al\.,[2023](https://arxiv.org/html/2605.26366#bib.bib13)\)\. We report AUROC on the test split for each dataset\.
Hallucination Detection BaselineWhile our main focus is the layer\-selection problem, we also compare hallucination detection performance against widely used baselines to contextualize our results\.
We first consider several uncertainty\-based baselines\. Predictive Entropy \(Pred\. Entropy\) and Length\-Normalized Predictive Entropy \(LN\-Pred\. Entropy\)\(Malinin & Gales,[2021](https://arxiv.org/html/2605.26366#bib.bib31)\)quantify uncertainty by measuring the variability of the model’s likelihood across multiple sampled generations\. We additionally include Semantic Entropy\(Farquhar et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib10)\), which estimates uncertainty at the semantic level by clustering sampled answers into equivalence classes and measuring their consistency\. Complementary to these uncertainty\-based approaches, we include Lexical Similarity\(Lin et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib29)\)as a surface\-form baseline, which measures token\-level overlap between the generated answer and the reference using ROUGE\-L\.
Finally, we evaluate representation\-based baselines, including EigenScore\(Chen et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib5)\)and Local Intrinsic Dimension \(LID\)\(Yin et al\.,[2024](https://arxiv.org/html/2605.26366#bib.bib45)\)\. EigenScore assesses representation quality via the spectral properties of hidden\-state covariance and is applied to middle\-layer representations\. LID estimates local intrinsic dimensionality under the hypothesis that truthful outputs exhibit more structured representations; followingYin et al\. \([2024](https://arxiv.org/html/2605.26366#bib.bib45)\), we probe the layer immediately after the maximum\-LID layer\.
Hidden\-State Probing SetupWe train a lightweight MLP at each layer using a fixed token position \(specified in later ablations\) and select the checkpoint with the lowest validation loss\. Curvature, RankMe, and ID are computed on both the training and validation sets, while validation loss, RGN, and SNR are computed on the validation set only\.
### 4\.2Empirical Results of First Effective Peak of Intrinsic Dimension
#### QA Task
[Table2](https://arxiv.org/html/2605.26366#S4.T2)reports AUROC across five QA datasets and two instruction\-tuned LLMs\. For both probe training and layer\-selection criteria, we use the hidden state at the last generated token\. Overall, combining hidden\-state probing with First Effective Peak of Intrinsic Dimension yields the strongest performance among all baselines and selection criteria\. This result indicates that First Effective Peak of Intrinsic Dimension more accurately and stably identifies layers whose representations support highly discriminative probes compared to alternative layer\-selection strategies\.
Figure 3:A comparison of generation behaviors in LLaMA\-Instruct and Mistral\-Instruct without FST\. Specifically, \(a\) shows an internally inconsistent continuation where LlaMA\-3\.1\-8B\-Instruct contradicts its initial answer, \(b\) demonstrates semantic drift in which the generation deviates from the question focus, and \(c\) highlights degenerate repetition with redundant restatement of the same information\. In contrast, Mistral\-Instruct consistently produces concise and well\-terminated responses\.Table 3:Results \(AUROC\) onsummarization taskswithw=7w=7, without FST\. FEPoID outperforms all other criteria, demonstrating its effectiveness beyond QA tasks\. Notably, Val Lossfailsto rank among the top\-2 criteria on all settings\.LlaMA\-3\.1\-8B\-InstructMistral\-7B\-Instruct\-v0\.3HaluEvalCNN/DMAvgHaluEvalCNN/DMAvgRankMe\\cellcolorranktwo0\.60750\.5774\\cellcolorrankthree0\.59240\.71490\.68690\.7009Curvature0\.5494\\cellcolorranktwo0\.59220\.5708\\cellcolorrankthree0\.7498\\cellcolorranktwo0\.7319\\cellcolorranktwo0\.7409Val Loss\\cellcolorrankthree0\.59610\.58590\.59100\.72940\.69380\.7116RGN0\.57130\.58210\.5767\\cellcolorranktwo0\.75630\.70310\.7297SNR0\.55280\.54740\.55010\.73850\.68110\.7098ID\\cellcolorranktwo0\.6075\\cellcolorrankthree0\.5918\\cellcolorranktwo0\.5997\\cellcolorrankthree0\.7498\\cellcolorrankthree0\.7185\\cellcolorrankthree0\.7342FEPoID\\cellcolorrankone0\.6165\\cellcolorrankone0\.5995\\cellcolorrankone0\.6080\\cellcolorrankone0\.7808\\cellcolorrankone0\.7614\\cellcolorrankone0\.7711
As shown in[Table2](https://arxiv.org/html/2605.26366#S4.T2), different selection rules applied to the hidden\-state probing framework can lead to markedly different outcomes across models and datasets\. In particular, suboptimal layer selection may significantly degrade performance: for example, when using SNR to select layers for LlaMA\-3\.1\-8B\-Instruct, the average resulting AUROC is lower than that of simple baselines such as lexical similarity\. In addition,[Figure4](https://arxiv.org/html/2605.26366#S4.F4)further quantifies the discrepancy between the layer selected by each method and the best\-performing layer for LID, EigenScore, and hidden\-state probing, all of which are representation\-based approaches\. We observe that LID and EigenScore incur relatively large AUROC gaps, indicating that their layer\-selection strategies fail to reliably identify informative layers\. In contrast, layers selected by First Effective Peak of Intrinsic Dimension in the hidden\-state probing framework yield substantially smaller gaps, remaining close to the best\-performing layers across models and datasets\. These results further demonstrate that effective selection criteria are crucial for fully realizing the benefits of representation\-based probing\.
Figure 4:AUROC gap between the layer selected by each method and the oracle best\-performing layer\.LLaMA\-InstructandMistral\-Instructdenote model\-specific averages over datasets, whileAvgfurther averages across all models and datasets\. For hidden\-state probing, layers are selected by First Effective Peak of Intrinsic Dimension\.Notably, selecting the layer with the maximum ID does not consistently yield optimal performance, as illustrated in[Figure2](https://arxiv.org/html/2605.26366#S4.F2)\. In datasets such as HotpotQA, TriviaQA, and PsiLoQA, the maximal ID often appears in very late layers, where representations become overly complex or redundant for hallucination detection, whereas the first effective peak stays close to the best\-performing layer\.
Summarization TasksWe extend the hallucination detection experiments further to summarization tasks\. As shown in[Table3](https://arxiv.org/html/2605.26366#S4.T3), FEPoID achieves the best performance across all datasets on both models, demonstrating its robustness beyond QA tasks\. Notably, Val Loss consistently fails to rank among the top\-3 criteria in terms of average AUROC on both models, suggesting that validation\-based criteria are less reliable in summarization tasks\.
Sensitivity to Hyperparameters[Figure7](https://arxiv.org/html/2605.26366#A2.F7)presents an ablation study on the forward horizon sizewwused in First Effective Peak of Intrinsic Dimension\. The performance of First Effective Peak of Intrinsic Dimension remains highly stable across a wide range ofwwfor all datasets and models, indicating that the method is highly robust to the choice ofww\.
Time Efficiency of FEPoIDWe report the computation time \(in seconds\) of each criterion on LLaMA\-3\.1\-8B\-Instruct in[Table4](https://arxiv.org/html/2605.26366#S4.T4)\. Note that the time for Val Loss includes both MLP training and validation loss computation\. FEPoID \(equivalently, ID\) achieves the lowest computation time across all benchmarks, demonstrating the computational*efficiency*of FEPoID\.
Table 4:Computation time \(in seconds\) of each criterion on LlaMA\-3\.1\-8B\-Instruct, measured as the total time across all 32 layers\. FEPoID and ID require significantly less computation time than all other criteria\.
#### Generalization Across Scales and Tuning Strategies
We evaluate FEPoID on base models and other\-scale ones\. As shown in[Tables11](https://arxiv.org/html/2605.26366#A4.T11)and[12](https://arxiv.org/html/2605.26366#A4.T12), FEPoID achieves the highest AUROC on 4 out of 5 datasets and the highest average AUROC on LLaMA\-3\.1\-8B \(base\), and consistently ranks among the top\-performing criteria on LLaMA\-3\.2\-3B and LLaMA\-3\.2\-1B, demonstrating its generalization beyond instruction\-tuned settings and across varying model scales\.
## 5Which Token Position Should We Probe?
In this section, we study which token position should be probed for hidden\-state probing\. We show that representations at the last generated token are often degraded by end\-of\-sequence noise, and proposefirst\-sentence truncation \(FST\)as a simple, supervision\-free remedy\. Our experiments show that FST yields more discriminative class structure in the extracted representations, and consistently improves all hallucination detection methods considered\.
### 5\.1First\-Sentence Truncation
Beyond layer selection, the effectiveness of hidden\-state probing also critically depends on the*token position*ttat which the representation𝐳t,i\(ℓ\)\\mathbf\{z\}^\{\(\\ell\)\}\_\{t,i\}is extracted\. In practice, a common heuristic extracts representations at the last generated token, i\.e\.,t=Tt=T, motivated by the autoregressive property that this token can attend to the entire preceding context\. However, as illustrated in[Figure3](https://arxiv.org/html/2605.26366#S4.F3), the representations extracted at the final token position are frequently corrupted by end\-of\-sequence noise, arising from inconsistent continuations, semantic drift, and degenerate repetition\.
Prior work\(Orgad et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib32)\)demonstrates that probing hidden states aligned with the “exact answer” tokens \(i\.e\., the token in the generated output that directly matches the ground\-truth answer\) yields substantially stronger performance\. However, identifying such tokens requires access to ground\-truth answers and often an auxiliary LLM, which is impractical in real\-world applications\.
Motivated by the empirical observation that LLMs often state the answer early in the generation—typically within the first sentence—and by prior findings that the truthfulness information is concentrated at answer\-aligned tokens\(Orgad et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib32)\), we extract the representations at the last token of the first generated sentence as a lightweight approximation to answer\-aligned representations\. Specifically, we setttto the index of the last token of the first generated sentence for each samplexix\_\{i\}, and extract𝐳t,i\(ℓ\)\\mathbf\{z\}^\{\(\\ell\)\}\_\{t,i\}accordingly\. Compared to settingt=Tt=T, this choice is less susceptible to end\-of\-sequence noise\. In addition, unlike extracting representations at the exact\-answer positions, this strategy requires neither access to ground\-truth answers nor auxiliary LLMs, making it a practical and efficient alternative for real\-world applications\.
To identify sentence boundaries, we employ a lightweight, rule\-based scanner to performfirst\-sentence truncation \(FST\)\. Implementation details are in Appendix[A](https://arxiv.org/html/2605.26366#A1.SS0.SSS0.Px4)\.
### 5\.2Empirical Evaluation of FST
Figure 5:AUROC improvements obtained by applying FST relative to the “last generated token” heuristic for each method, averaged over datasets\. The layers for the hidden\-state probing framework are selected by First Effective Peak of Intrinsic Dimension withw=7w=7\.We compare two token positions for feature extraction: \(i\) the last token of the generated sequence and \(ii\) the last token of the first generated sentence, identified using FST\.
After applying FST to each candidate answer and each best\-answer, we extract the hidden states at the last token of the truncated answers for LID, EigenScore, and hidden\-state probing\. For uncertainty\-based baselines and lexical similarity, we evaluate them on the first\-sentence truncated candidate answers to ensure consistency across methods\.
[Figure5](https://arxiv.org/html/2605.26366#S5.F5)shows that FST delivers consistent AUROC gains across all methods\. These results show that FST benefits multiple levels of information used for hallucination detection, including representation\-level signals, likelihood\-based signals \(uncertainty\-based baselines\), and surface\-form signals \(Lexical Similarity\)\. By truncating generations at the end of the first sentence, FST removes interfering effects introduced by later\-stage generation, thereby stabilizing diverse signals in a method\-agnostic manner rather than benefiting any single assumption or criterion\.
Notably, as shown in[Figure5](https://arxiv.org/html/2605.26366#S5.F5), FST yields substantially larger improvements in hallucination detection for LlaMA\-3\.1\-8B\-Instruct than for Mistral\-7B\-Instruct\-v0\.3\. This discrepancy is largely driven by differences in generation behavior: Mistral\-7B\-Instruct\-v0\.3 typically emits an<eos\>token shortly after completing the first sentence, whereas LlaMA\-3\.1\-8B\-Instruct often continues generating tokens until reaching the configured maximum length limit\. As a result, representations extracted at the final generated token in LlaMA\-3\.1\-8B\-Instruct are more likely to be contaminated by noisy continuations\. We show three specific examples in[Figure3](https://arxiv.org/html/2605.26366#S4.F3), where the extra continuation in LlaMA\-3\.1\-8B\-Instruct introduces end\-of\-sequence noise in several recurring forms: \(a\)*inconsistent continuation*, where the model initially states an answer but later produces a conflicting one; \(b\)*semantic drift*, where the continuation shifts into content unrelated to the question \(e\.g\., adding irrelevant details after answering\); and \(c\)*degenerate repetition*, where the model repeats phrases without adding information\. These noisy continuations corroborate prior evidence that last\-token representations extracted from the entire generated sequence are degraded by end\-of\-sequence noise and therefore tend to perform poorly on downstream tasks\(Springer et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib42); Lee et al\.,[2025](https://arxiv.org/html/2605.26366#bib.bib23)\)\.
We report the detection results with FST in[Table9](https://arxiv.org/html/2605.26366#A2.T9)and the layer\-wise AUROC and ID in[Figure6](https://arxiv.org/html/2605.26366#A2.F6)\. The overall trends closely mirror those observed in[Table2](https://arxiv.org/html/2605.26366#S4.T2), with First Effective Peak of Intrinsic Dimension consistently achieving the strongest performance across models\.
#### Sensitivity to Hyperparameters
When FST is applied, First Effective Peak of Intrinsic Dimension remains robust to the choice ofww\([Figure7](https://arxiv.org/html/2605.26366#A2.F7)\)\. The only exception is TriviaQA, where larherwwleads to visibly stronger performance\.
Generalizability of FST across ModelsWe further apply FST to base and other\-scale models, and report the results in[Tables12](https://arxiv.org/html/2605.26366#A4.T12)and[11](https://arxiv.org/html/2605.26366#A4.T11)\. Compared with the evaluation without FST in[Tables8](https://arxiv.org/html/2605.26366#A1.T8)and[7](https://arxiv.org/html/2605.26366#A1.T7), FST consistently enhances the hallucination detection performance, demonstrating its generalizability across model scales and tuning strategies\.
Table 5:Fisher Separation and Silhouette Score with and without FST on LLaMA\-3\.1\-8B\-Instruct\. FST consistently improves both metrics across datasets, indicating that FST yields more separable representations for hallucination detection \(↑\\uparrowhigher is better\)\.
### 5\.3Empirical Analysis of FST
Notably, after applying FST, the ID of representations extracted from the selected layers does not change substantially \(as shown in[Figures2](https://arxiv.org/html/2605.26366#S4.F2)and[6](https://arxiv.org/html/2605.26366#A2.F6)\), while hallucination detection performance improves markedly\.
To further investigate why FST helps, we measure class separability of the representations at the selected layers via Fisher Separation\(Fisher,[1936](https://arxiv.org/html/2605.26366#bib.bib11)\)\(between\- vs\. within\-class variance\) and Silhouette Score\(Rousseeuw,[1987](https://arxiv.org/html/2605.26366#bib.bib36)\)\(per\-sample cohesion/separation\), capturing both global and local aspects of class separability\. As shown in[Table5](https://arxiv.org/html/2605.26366#S5.T5), FST consistently improves both metrics across datasets\. These findings suggest that, despite residing in spaces of similar intrinsic dimensionality, representations extracted with FST exhibit cleaner and more discriminative class structure, making them more conducive to hallucination detection\.
## 6Conclusion and Future Directions
This work studies the problem of automatic layer selection for hallucination detection under the hidden\-state probing framework\. We conduct a systematic evaluation of a diverse set of layer\-selection criteria, spanning information\-theoretic, geometric, and gradient perspectives\. Our results show that criteria previously shown to correlate with downstream performance or to support selective fine\-tuning do not reliably yield effective layer\-selection results in this setting\. This highlights the gap between representation\-level analysis and practical layer\-selection strategies\.
Based on empirical regularities in ID trajectories across layers, we propose FEPoID, a lightweight and training\-free criterion that selects the first effective peak of ID\. Evaluated on hallucination detection benchmarks spanning both QA and summarization tasks, across model architectures, scales, and tuning strategies, FEPoID consistently selects high\-performing layers and outperforms all baselines, demonstrating its robustness and broad applicability\.
We further revisit token\-position choices in decoder\-only LLMs and show that representations extracted at the last generated token are often degraded by end\-of\-sequence noise\. To address this, we propose to probe at the last token of the first generated sentence via a simple, rule\-based truncation, which yields consistent performance gains across all hallucination detection baselines considered\. We further analyze why FST works, showing that, despite residing in spaces of similar intrinsic dimensionality, representations extracted with FST exhibit cleaner and more discriminative class structure, as evidenced by consistent improvements in Fisher Separation and Silhouette Score across datasets\.
Together, FEPoID and FST offer a practical, supervision\-free solution for robust representation extraction, requiring neither model fine\-tuning nor exhaustive layer\-wise validation\. Future work may extend these findings to tasks that require abstract information processing and to different data modalities, and further investigate the theoretical relationship between ID dynamics and task\-relevant representations\.
## Impact Statement
This paper presents work whose goal is to increase understanding of deep learning, which may lead to advancements in the field of Machine Learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.
## Acknowledgments
We would like to thank Daniel Bai and Fan Yin for their helpful discussions\. This work is funded in part by NSF CAREER IIS\-2145492 and DARPA AIQ HR00112590066\.
## References
- Ahdritz et al\. \(2024\)Ahdritz, G\., Qin, T\., Vyas, N\., Barak, B\., and Edelman, B\. L\.Distinguishing the knowable from the unknowable with language models\.In*Proceedings of the 41st International Conference on Machine Learning*, ICML’24\. JMLR\.org, 2024\.
- Azaria & Mitchell \(2023\)Azaria, A\. and Mitchell, T\.The internal state of an llm knows when it’s lying\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pp\. 967–976, 2023\.
- Bac et al\. \(2021\)Bac, J\., Mirkes, E\. M\., Gorban, A\. N\., Tyukin, I\., and Zinovyev, A\.Scikit\-dimension: a python package for intrinsic dimension estimation\.*Entropy*, 23\(10\):1368, 2021\.
- Bennett \(1969\)Bennett, R\.The intrinsic dimensionality of signal collections\.*IEEE Transactions on Information Theory*, 15\(5\):517–525, 1969\.
- Chen et al\. \(2024\)Chen, C\., Liu, K\., Chen, Z\., Gu, Y\., Wu, Y\., Tao, M\., Fu, Z\., and Ye, J\.INSIDE: LLMs’ internal states retain the power of hallucination detection\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Cheng et al\. \(2025\)Cheng, E\., Doimo, D\., Kervadec, C\., Macocco, I\., Yu, L\., Laio, A\., and Baroni, M\.Emergence of a high\-dimensional abstraction phase in language transformers\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Conneau et al\. \(2018\)Conneau, A\., Kruszewski, G\., Lample, G\., Barrault, L\., and Baroni, M\.What you can cram into a single $&\!\#\* vector: Probing sentence embeddings for linguistic properties\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 2126–2136, 2018\.
- Cover \(1965\)Cover, T\. M\.Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition\.*IEEE transactions on electronic computers*, \(3\):326–334, 1965\.
- Facco et al\. \(2017\)Facco, E\., d’Errico, M\., Rodriguez, A\., and Laio, A\.Estimating the intrinsic dimension of datasets by a minimal neighborhood information\.*Scientific reports*, 7\(1\):12140, 2017\.
- Farquhar et al\. \(2024\)Farquhar, S\., Kossen, J\., Kuhn, L\., and Gal, Y\.Detecting hallucinations in large language models using semantic entropy\.*Nature*, 630\(8017\):625–630, 2024\.
- Fisher \(1936\)Fisher, R\. A\.The use of multiple measurements in taxonomic problems\.*Annals of eugenics*, 7\(2\):179–188, 1936\.
- Garrido et al\. \(2023\)Garrido, Q\., Balestriero, R\., Najman, L\., and Lecun, Y\.Rankme: Assessing the downstream performance of pretrained self\-supervised representations by their rank\.In*International conference on machine learning*, pp\. 10929–10974\. PMLR, 2023\.
- Gekhman et al\. \(2023\)Gekhman, Z\., Herzig, J\., Aharoni, R\., Elkind, C\., and Szpektor, I\.Trueteacher: Learning factual consistency evaluation with large language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pp\. 2053–2070, 2023\.
- Gomtsyan et al\. \(2019\)Gomtsyan, M\., Mokrov, N\., Panov, M\., and Yanovich, Y\.Geometry\-aware maximum likelihood estimation of intrinsic dimension\.In*Asian Conference on Machine Learning*, pp\. 1126–1141\. PMLR, 2019\.
- Grattafiori et al\. \(2024\)Grattafiori, A\., Dubey, A\., Jauhri, A\., Pandey, A\., Kadian, A\., Al\-Dahle, A\., Letman, A\., Mathur, A\., Schelten, A\., Vaughan, A\., et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Hosseini & Fedorenko \(2023\)Hosseini, E\. and Fedorenko, E\.Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language\.*Advances in Neural Information Processing Systems*, 36:43918–43930, 2023\.
- Huang et al\. \(2025\)Huang, L\., Yu, W\., Ma, W\., Zhong, W\., Feng, Z\., Wang, H\., Chen, Q\., Peng, W\., Feng, X\., Qin, B\., et al\.A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions\.*ACM Transactions on Information Systems*, 43\(2\):1–55, 2025\.
- Janiak et al\. \(2025\)Janiak, D\., Binkowski, J\., Sawczyn, A\., Gabrys, B\., Shwartz\-Ziv, R\., and Kajdanowicz, T\. J\.The illusion of progress: Re\-evaluating hallucination detection in LLMs\.In Christodoulopoulos, C\., Chakraborty, T\., Rose, C\., and Peng, V\. \(eds\.\),*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pp\. 34728–34745, Suzhou, China, November 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-332\-6\.doi:10\.18653/v1/2025\.emnlp\-main\.1761\.
- Ji et al\. \(2024\)Ji, Z\., Chen, D\., Ishii, E\., Cahyawijaya, S\., Bang, Y\., Wilie, B\., and Fung, P\.LLM internal states reveal hallucination risk faced with a query\.In Belinkov, Y\., Kim, N\., Jumelet, J\., Mohebbi, H\., Mueller, A\., and Chen, H\. \(eds\.\),*Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP*, pp\. 88–104, Miami, Florida, US, November 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.blackboxnlp\-1\.6\.
- Jiang et al\. \(2023\)Jiang, A\. Q\., Sablayrolles, A\., Mensch, A\., Bamford, C\., Chaplot, D\. S\., de las Casas, D\., Bressand, F\., Lengyel, G\., Lample, G\., Saulnier, L\., Lavaud, L\. R\., Lachaux, M\.\-A\., Stock, P\., Scao, T\. L\., Lavril, T\., Wang, T\., Lacroix, T\., and Sayed, W\. E\.Mistral 7b, 2023\.
- Johnson et al\. \(2019\)Johnson, J\., Douze, M\., and Jégou, H\.Billion\-scale similarity search with GPUs\.*IEEE Transactions on Big Data*, 7\(3\):535–547, 2019\.
- Joshi et al\. \(2017\)Joshi, M\., Choi, E\., Weld, D\., and Zettlemoyer, L\.TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension\.In Barzilay, R\. and Kan, M\.\-Y\. \(eds\.\),*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 1601–1611, Vancouver, Canada, July 2017\. Association for Computational Linguistics\.doi:10\.18653/v1/P17\-1147\.
- Lee et al\. \(2025\)Lee, C\., Roy, R\., Xu, M\., Raiman, J\., Shoeybi, M\., Catanzaro, B\., and Ping, W\.NV\-embed: Improved techniques for training LLMs as generalist embedding models\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Lee et al\. \(2023\)Lee, Y\., Chen, A\. S\., Tajwar, F\., Kumar, A\., Yao, H\., Liang, P\., and Finn, C\.Surgical fine\-tuning improves adaptation to distribution shifts\.In*The Eleventh International Conference on Learning Representations*, 2023\.
- Levina & Bickel \(2004\)Levina, E\. and Bickel, P\.Maximum likelihood estimation of intrinsic dimension\.*Advances in neural information processing systems*, 17, 2004\.
- Li et al\. \(2025\)Li, C\., Qin, M\., Xiao, S\., Chen, J\., Luo, K\., Lian, D\., Shao, Y\., and Liu, Z\.Making text embedders few\-shot learners\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Li et al\. \(2023\)Li, J\., Cheng, X\., Zhao, X\., Nie, J\.\-Y\., and Wen, J\.\-R\.Halueval: A large\-scale hallucination evaluation benchmark for large language models\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, pp\. 6449–6464, 2023\.
- Li et al\. \(2024\)Li, J\., Chen, J\., Ren, R\., Cheng, X\., Zhao, X\., Nie, J\.\-Y\., and Wen, J\.\-R\.The dawn after the dark: An empirical study on factuality hallucination in large language models\.In Ku, L\.\-W\., Martins, A\., and Srikumar, V\. \(eds\.\),*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 10879–10899, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.586\.
- Lin et al\. \(2024\)Lin, Z\., Trivedi, S\., and Sun, J\.Generating with confidence: Uncertainty quantification for black\-box large language models\.*Transactions on Machine Learning Research*, 2024\.ISSN 2835\-8856\.
- Liu et al\. \(2024\)Liu, L\., Pan, Y\., Li, X\., and Chen, G\.Uncertainty estimation and quantification for llms: A simple supervised approach, 2024\.
- Malinin & Gales \(2021\)Malinin, A\. and Gales, M\.Uncertainty estimation in autoregressive structured prediction\.In*International Conference on Learning Representations*, 2021\.
- Orgad et al\. \(2025\)Orgad, H\., Toker, M\., Gekhman, Z\., Reichart, R\., Szpektor, I\., Kotek, H\., and Belinkov, Y\.LLMs know more than they show: On the intrinsic representation of LLM hallucinations\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Rajpurkar et al\. \(2016\)Rajpurkar, P\., Zhang, J\., Lopyrev, K\., and Liang, P\.SQuAD: 100,000\+ questions for machine comprehension of text\.In Su, J\., Duh, K\., and Carreras, X\. \(eds\.\),*Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing*, pp\. 2383–2392, Austin, Texas, November 2016\. Association for Computational Linguistics\.doi:10\.18653/v1/D16\-1264\.
- Rao et al\. \(2025\)Rao, A\., Rußwurm, M\., Klemmer, K\., and Rolf, E\.Measuring the intrinsic dimension of earth representations\.*arXiv preprint arXiv:2511\.02101*, 2025\.
- Reddy et al\. \(2019\)Reddy, S\., Chen, D\., and Manning, C\. D\.CoQA: A conversational question answering challenge\.*Transactions of the Association for Computational Linguistics*, 7:249–266, 2019\.doi:10\.1162/tacl˙a˙00266\.
- Rousseeuw \(1987\)Rousseeuw, P\. J\.Silhouettes: a graphical aid to the interpretation and validation of cluster analysis\.*Journal of computational and applied mathematics*, 20:53–65, 1987\.
- Roy & Vetterli \(2007\)Roy, O\. and Vetterli, M\.The effective rank: A measure of effective dimensionality\.In*2007 15th European signal processing conference*, pp\. 606–610\. IEEE, 2007\.
- Rykov et al\. \(2025\)Rykov, E\., Petrushina, K\., Savkin, M\., Olisov, V\., Vazhentsev, A\., Titova, K\., Panchenko, A\., Konovalov, V\., and Belikova, J\.When models lie, we learn: Multilingual span\-level hallucination detection with PsiloQA\.In Christodoulopoulos, C\., Chakraborty, T\., Rose, C\., and Peng, V\. \(eds\.\),*Findings of the Association for Computational Linguistics: EMNLP 2025*, pp\. 11663–11682, Suzhou, China, November 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-335\-7\.doi:10\.18653/v1/2025\.findings\-emnlp\.626\.
- See et al\. \(2017\)See, A\., Liu, P\. J\., and Manning, C\. D\.Get to the point: Summarization with pointer\-generator networks\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 1073–1083, 2017\.
- Skean et al\. \(2025\)Skean, O\., Arefin, M\. R\., Zhao, D\., Patel, N\., Naghiyev, J\., LeCun, Y\., and Shwartz\-Ziv, R\.Layer by layer: Uncovering hidden representations in language models\.*arXiv preprint arXiv:2502\.02013*, 2025\.
- Slobodkin et al\. \(2023\)Slobodkin, A\., Goldman, O\., Caciularu, A\., Dagan, I\., and Ravfogel, S\.The curious case of hallucinatory \(un\)answerability: Finding truths in the hidden states of over\-confident large language models\.In*The 2023 Conference on Empirical Methods in Natural Language Processing*, 2023\.
- Springer et al\. \(2025\)Springer, J\. M\., Kotha, S\., Fried, D\., Neubig, G\., and Raghunathan, A\.Repetition improves language model embeddings\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Xiong et al\. \(2024\)Xiong, M\., Hu, Z\., Lu, X\., LI, Y\., Fu, J\., He, J\., and Hooi, B\.Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Yang et al\. \(2018\)Yang, Z\., Qi, P\., Zhang, S\., Bengio, Y\., Cohen, W\., Salakhutdinov, R\., and Manning, C\. D\.HotpotQA: A dataset for diverse, explainable multi\-hop question answering\.In Riloff, E\., Chiang, D\., Hockenmaier, J\., and Tsujii, J\. \(eds\.\),*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pp\. 2369–2380, Brussels, Belgium, October\-November 2018\. Association for Computational Linguistics\.doi:10\.18653/v1/D18\-1259\.
- Yin et al\. \(2024\)Yin, F\., Srinivasa, J\., and Chang, K\.\-W\.Characterizing truthfulness in large language model generations with local intrinsic dimension\.In*Forty\-first International Conference on Machine Learning*, 2024\.
- Zhang et al\. \(2015\)Zhang, X\., Zhao, J\., and LeCun, Y\.Character\-level convolutional networks for text classification\.*Advances in neural information processing systems*, 28, 2015\.
- Zhang et al\. \(2025\)Zhang, X\., Li, Z\., Zhang, Y\., Long, D\., Xie, P\., Zhang, M\., and Zhang, M\.Language models are universal embedders\.In Fei, H\., Tu, K\., Zhang, Y\., Hu, X\., Han, W\., Jia, Z\., Zheng, Z\., Cao, Y\., Zhang, M\., Lu, W\., Siddharth, N\., Øvrelid, L\., Xue, N\., and Zhang, Y\. \(eds\.\),*Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling \(XLLM 2025\)*, pp\. 252–265, Vienna, Austria, August 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-286\-2\.doi:10\.18653/v1/2025\.xllm\-1\.21\.
- Zhou et al\. \(2023\)Zhou, K\., Jurafsky, D\., and Hashimoto, T\.Navigating the grey area: How expressions of uncertainty and overconfidence affect language models\.In Bouamor, H\., Pino, J\., and Bali, K\. \(eds\.\),*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pp\. 5506–5524, Singapore, December 2023\. Association for Computational Linguistics\.doi:10\.18653/v1/2023\.emnlp\-main\.335\.
## Appendix AExperimental Details
#### QA Task
We evaluate our methods on five question answering datasets: CoQA, SQuAD, HotpotQA, TriviaQA, and PsiLoQA\. For each dataset, we construct a training set of 9,000 examples and a validation set of 1,000 examples, which are used for probe training and hyperparameter selection\. The test sets are kept fixed and are used exclusively for evaluation\. Specifically, the test set sizes are 7,983 for CoQA, 10,000 for SQuAD, 7,405 for HotpotQA, 10,000 for TriviaQA, and 8,103 for PsiLoQA\.
#### Summarization Task
We evaluate on two summarization datasets: CNN/DailyMail and HaluEval\. For CNN/DailyMail, we construct a training set of 9,000 examples and a validation set of 1,000 examples, with a test set of 10,000 examples\. For HaluEval, we use 7,200 training examples, 800 validation examples, and 2,000 test examples\.
#### Prompt Template
To generate answers for hallucination detection, we use two prompt settings depending on whether a supporting passage is available\.[Table6](https://arxiv.org/html/2605.26366#A1.T6)summarizes the templates used in the question\-only setting \(HotpotQA, TriviaQA\) and the context\-aware setting \(CoQA, SQuAD, PsiLoQA, HaluEval, CNN/Daily \), where\[Q\]and\[C\]denote the question and context, respectively\.
#### Implementation of First\-Sentence Truncation
First\-sentence truncation is implemented via a rule\-based scanner\. Specifically, the scanner processes the generated text from left to right and identifies the first period that does not fall under common exception cases as the end of the first sentence\. These exceptions include ellipses \(…\), decimal numbers \(e\.g\.,3\.14\), multi\-dot abbreviations \(e\.g\.,U\.S\.,i\.e\.\), word\-level abbreviations \(e\.g\.,Dr\.,etc\.,No\. 3\), and name initials \(e\.g\.,G\. Smith\)\. The scanner is simply implemented with regular expression operations\.
Table 6:Prompt templates used for answer generation\. Here, \[Q\] denotes the question text and \[C\] denotes the provided context passage\.Table 7:Results on theLlaMA\-3\.1\-8B basemodel withw=7w=7\. The representations are extracted without FST\. FEPoID maintains strong performance on a non\-instruction\-tuned model, demonstrating that our method*generalizes*across both base and instruction\-tuned models\.CoQASQuADHotpotQATriviaQAPsiloQAAvgRankME0\.69720\.71320\.64580\.64720\.80320\.7013Curvature\\cellcolorrankthree0\.7187\\cellcolorrankthree0\.7432\\cellcolorrankone0\.7370\\cellcolorrankthree0\.7846\\cellcolorranktwo0\.8639\\cellcolorrankthree0\.7695Val Loss\\cellcolorrankone0\.7552\\cellcolorranktwo0\.7811\\cellcolorranktwo0\.7326\\cellcolorrankone0\.8141\\cellcolorrankthree0\.8434\\cellcolorranktwo0\.7853RGN0\.68590\.7313\\cellcolorrankthree0\.69550\.72570\.78160\.7240SNR0\.51180\.63870\.61150\.63040\.76770\.6320ID\\cellcolorranktwo0\.7468\\cellcolorrankone0\.78920\.66720\.6179\\cellcolorrankone0\.86550\.7373FEPoID\\cellcolorranktwo0\.7468\\cellcolorrankone0\.7892\\cellcolorranktwo0\.7326\\cellcolorranktwo0\.8136\\cellcolorrankone0\.8655\\cellcolorrankone0\.7895Table 8:Results on different model scales, where representations are extracted without FST\. Forward horizon is set tow=7w=7forLlaMA\-3\.2\-3Bandw=3w=3forLlaMA\-3\.2\-1B\. Top\-3 results are highlighted, with darker color indicating better performance\. FEPoID achieves consistently strong performance across both model scales, demonstrating its*generalizability*to models of varying scales\.LlaMA\-3\.2\-3BLlaMA\-3\.2\-1BCoQASQuADHotpotQATriviaQAPsiloQAAvgCoQASQuADHotpotQATriviaQAPsiloQAAvgRankME0\.59810\.64760\.6914\\cellcolorrankthree0\.66390\.74690\.66960\.56210\.59020\.65830\.59630\.64410\.6102Curvature\\cellcolorranktwo0\.7123\\cellcolorrankthree0\.6616\\cellcolorrankthree0\.7268\\cellcolorranktwo0\.6734\\cellcolorrankthree0\.82800\.7204\\cellcolorrankthree0\.6699\\cellcolorranktwo0\.6442\\cellcolorranktwo0\.7587\\cellcolorrankone0\.7028\\cellcolorrankthree0\.7355\\cellcolorrankthree0\.7022Val Loss\\cellcolorrankone0\.7391\\cellcolorranktwo0\.7153\\cellcolorrankone0\.7439\\cellcolorrankone0\.7198\\cellcolorrankone0\.8498\\cellcolorranktwo0\.7536\\cellcolorrankone0\.6860\\cellcolorranktwo0\.6442\\cellcolorrankone0\.7641\\cellcolorranktwo0\.6985\\cellcolorrankone0\.7597\\cellcolorrankone0\.7105RGN\\cellcolorrankthree0\.63690\.65670\.6914\\cellcolorrankone0\.71980\.79900\.70080\.57770\.5902\\cellcolorrankthree0\.7370\\cellcolorrankthree0\.60500\.65970\.6339SNR0\.51160\.5544\\cellcolorrankone0\.74390\.57480\.68780\.61450\.6226\\cellcolorrankthree0\.62250\.72650\.59630\.61260\.6361ID\\cellcolorrankone0\.7391\\cellcolorrankone0\.72750\.69820\.65540\.7990\\cellcolorrankthree0\.7238\\cellcolorranktwo0\.67090\.58720\.7151\\cellcolorrankthree0\.60500\.65970\.6476FEPoID\\cellcolorrankone0\.7391\\cellcolorrankone0\.7275\\cellcolorranktwo0\.7364\\cellcolorrankone0\.7198\\cellcolorranktwo0\.8483\\cellcolorrankone0\.7542\\cellcolorranktwo0\.6709\\cellcolorrankone0\.6514\\cellcolorrankone0\.7641\\cellcolorrankone0\.7028\\cellcolorranktwo0\.7413\\cellcolorranktwo0\.7061
## Appendix BAdditional Hallucination Detection Results
#### Generalization Across Scales and Tuning Strategies
To evaluate the generalizability of FEPoID beyond instruction\-tuned models and standard model scales, we conduct experiments on LLaMA\-3\.1\-8B \(base\), LLaMA\-3\.2\-3B, and LLaMA\-3\.2\-1B\. Results are reported in Tables[11](https://arxiv.org/html/2605.26366#A4.T11)and[12](https://arxiv.org/html/2605.26366#A4.T12), in which FEPoID consistently selects high\- performing layers and outperforms all baselines\.
#### First\-Sentence Truncation Analysis
To better understand the impact of token\-position choice, we report detailed hallucination detection results with first\-sentence truncation across datasets in[Table9](https://arxiv.org/html/2605.26366#A2.T9)\. Overall, under the FST setting, First Effective Peak of Intrinsic Dimension continues to achieve the best average performance across different models\. In addition,[Figure6](https://arxiv.org/html/2605.26366#A2.F6)presents layer\-wise AUROC curves together with intrinsic dimension estimates when representations are extracted at the last token of the first generated sentence\. Compared to[Figure2](https://arxiv.org/html/2605.26366#S4.F2), the AUROC curves of the two models become noticeably more aligned after applying FST\. This observation suggests that different instruction\-tuned models tend to capture more consistent and useful information at earlier stages of generation, leading to improved consistency in detection performance\.
#### Sensitivity to Forward Horizonww
Finally, we examine the sensitivity of FEPoID to the forward horizon parameterwwin[Figure7](https://arxiv.org/html/2605.26366#A2.F7)\. The results demonstrate that FEPoID is robust to a wide range ofwwvalues across datasets\.
Table 9:Hallucination detection performance \(AUROC\) across five QA datasets for LLaMA\-3\.1\-8B\-Instruct and Mistral\-7B\-Instruct\. For each method, we apply first\-sentence truncation for the generated answers\. Forward horizonwwis set to 7\.LlaMA\-3\.1\-8B\-InstructMistral\-7B\-Instruct\-v0\.3CoQASQuADHotpotQATriviaQAPsiLoQAAvgCoQASQuADHotpotQATriviaQAPsiLoQAAvgPred\. Entropy0\.73000\.78500\.72640\.78970\.78800\.76380\.73960\.76330\.74490\.77340\.75740\.7557LN\-Pred\. Entropy0\.68950\.74250\.74270\.79620\.57420\.70900\.65320\.65640\.69700\.73050\.50450\.6483Semantic Entropy0\.70150\.78060\.71080\.81480\.73920\.74940\.57720\.63140\.65180\.76030\.66850\.6578Lexical Similarity0\.79020\.84330\.79240\.86140\.80610\.81870\.70810\.72700\.75340\.81710\.69070\.7393LID0\.55350\.57310\.57140\.55500\.66480\.58360\.52860\.57840\.55900\.49020\.66000\.5632EigenScore0\.73620\.81520\.72710\.80890\.79410\.77630\.70870\.73500\.73570\.77070\.75170\.7404Hidden\-StateProbingRankME0\.81590\.8106\\cellcolorrankthree0\.80130\.82280\.85490\.82110\.77460\.75480\.69310\.67630\.86170\.7521Curvature\\cellcolorrankthree0\.8557\\cellcolorrankthree0\.8625\\cellcolorranktwo 0\.8119\\cellcolorrankthree0\.8287\\cellcolorrankthree0\.9160\\cellcolorrankthree0\.8550\\cellcolorranktwo 0\.8474\\cellcolorrankthree0\.8704\\cellcolorranktwo 0\.8088\\cellcolorranktwo 0\.8680\\cellcolorranktwo 0\.9264\\cellcolorrankthree0\.8642Val Loss\\cellcolorrankone0\.8621\\cellcolorranktwo 0\.8865\\cellcolorrankone0\.8287\\cellcolorranktwo 0\.8731\\cellcolorrankone0\.9238\\cellcolorranktwo 0\.8748\\cellcolorrankthree0\.8430\\cellcolorrankone0\.8784\\cellcolorrankthree0\.8087\\cellcolorrankone0\.8805\\cellcolorrankone0\.9216\\cellcolorranktwo 0\.8664RGN0\.71670\.81040\.72920\.77720\.81120\.76890\.83460\.79290\.7631\\cellcolorrankthree0\.81760\.87970\.8176SNR0\.76020\.66930\.77230\.64080\.80010\.72850\.74530\.71120\.72590\.71150\.82740\.7443ID\\cellcolorranktwo 0\.8581\\cellcolorrankone0\.89000\.77230\.8066\\cellcolorranktwo 0\.92080\.8496\\cellcolorranktwo 0\.8474\\cellcolorranktwo 0\.87690\.7631\\cellcolorrankthree0\.81760\.89420\.8398FEPoID\\cellcolorranktwo 0\.8581\\cellcolorrankone0\.8900\\cellcolorrankone0\.8287\\cellcolorrankone0\.8782\\cellcolorranktwo 0\.9208\\cellcolorrankone0\.8752\\cellcolorrankone0\.8501\\cellcolorranktwo 0\.8769\\cellcolorrankone0\.8153\\cellcolorrankone0\.8805\\cellcolorrankthree0\.9207\\cellcolorrankone0\.8687


\(a\)CoQA\(b\)SQuAD\(c\)HotpotQA\(d\)TriviaQA\(e\)PsiLoQA
Figure 6:Layer\-wise AUROC and Intrinsic Dimension across QA datasets with FST\. Diamond markers indicate the layers selected by FEPoID, and star markers denote the oracle best\-performing layers in terms of AUROC\. The representations are extracted at the last token of the first generated sentence\.

\(a\)LlaMA\-Instruct \(w/o FST\)\(b\)Mistral\-Instruct \(w/o FST\)\(c\)LlaMA\-Instruct \(w/ FST\)\(d\)Mistral\-Instruct \(w/ FST\)
Figure 7:AUROC versus forward horizonwwfor First Effective Peak of Intrinsic Dimension across QA datasets, with and without FST\. Results show strong robustness toww, with slightly improved performance under FST for larger horizons on some datasets\.
## Appendix CLinear Probe
To support that the FEPoID\-selected layer encodes abstract semantic information, we additionally run two sets of linear probing experiments using logistic regression on the frozen hidden states, comparing the FEPoID\-selected layer \(index 0\) against its neighboring layers \(index±1\\pm 1,±2\\pm 2,±3\\pm 3\)\.
#### Factual Correctness Probing
We train a logistic regression probe at each neighboring layer to predict whether the model’s generated answer is factually correct, using the binary correctness labels already available in our experimental pipeline\. We report AUROC across all five QA datasets for LLaMA\-3\.1\-8B in[Table10](https://arxiv.org/html/2605.26366#A3.T10)\. To further assess whether the FEPoID\-selected layer encodes abstract semantic categories beyond task\-specific signals, we consider three external benchmarks:
1. 1\.Odd Man Out\(Conneau et al\.,[2018](https://arxiv.org/html/2605.26366#bib.bib7)\): A binary semantic coherence task where each sentence is labeled as either Original \(well\-formed\) or Changed \(a noun or verb has been replaced by a random word of the same part of speech\)\. Correctly distinguishing O from C requires genuine semantic world knowledge rather than surface\-level features\.
2. 2\.AG’s News\(Zhang et al\.,[2015](https://arxiv.org/html/2605.26366#bib.bib46)\): A 4\-class topic classification task over news articles \(World, Sports, Business, Science and Technology\), probing whether the layer encodes semantic topic categories\.
3. 3\.DBPedia\(Zhang et al\.,[2015](https://arxiv.org/html/2605.26366#bib.bib46)\): A 14\-class ontology classification task over Wikipedia entity descriptions, requiring the probe to distinguish fine\-grained semantic categories such as Artist, Athlete, Animal, Building, and Film, categories that share surface\-level features but differ in semantic type\.
For all three tasks, we feed each sentence directly into the frozen LLaMA\-3\.1\-8B and conduct probing using logistic regression\.
We report the AUROC results for both sets of experiments in[Table10](https://arxiv.org/html/2605.26366#A3.T10)\. The FEPoID\-selected layer \(index 0\) consistently achieves the highest or near\-highest AUROC across datasets, while performance consistently degrades as we move to deeper or shallower neighboring layers\. Both the factual correctness probing and semantic category probing results confirm that the FEPoID\-selected layer is not merely a coincidental choice for hallucination detection, but is the layer where abstract semantic information is encoded\. We hope you can take our response into account and consider raising your score in the final assessment\.
Table 10:Linear probing accuracy across layers\. The FEPoID\-selected layer \(index 0\) is compared against neighboring layers\. Best results per column are in bold\.
## Appendix DGeneralization to Vision Tasks
Table 11:Results on theLlaMA\-3\.1\-8B basemodel withw=7w=7\. The representations are extracted with FST\. FEPoID maintains strong performance on a non\-instruction\-tuned model, demonstrating that our method*generalizes*across both base and instruction\-tuned models\.Table 12:Results on different model scales, where representations are extracted with FST\. Forward horizon is set tow=7w=7forLlaMA\-3\.2\-3Bandw=3w=3forLlaMA\-3\.2\-1B\. Top\-3 results are highlighted, with darker color indicating better performance\. FEPoID achieves consistently strong performance across both model scales, demonstrating its*generalizability*to models of varying scales\.LlaMA\-3\.2\-3BLlaMA\-3\.2\-1BCoQASQuADHotpotQATriviaQAPsiloQAAvgCoQASQuADHotpotQATriviaQAPsiloQAAvgRankME0\.62650\.73380\.70830\.67980\.79500\.70870\.58130\.62880\.70090\.65470\.71510\.6561Curvature\\cellcolorranktwo0\.7737\\cellcolorrankthree0\.8163\\cellcolorrankthree0\.77790\.8200\\cellcolorrankthree0\.8687\\cellcolorrankthree0\.8113\\cellcolorrankthree0\.7190\\cellcolorranktwo0\.7340\\cellcolorranktwo0\.7957\\cellcolorranktwo0\.7876\\cellcolorranktwo0\.8097\\cellcolorrankthree0\.7692Val Loss\\cellcolorrankone0\.7919\\cellcolorrankone0\.8534\\cellcolorranktwo0\.7900\\cellcolorranktwo0\.8441\\cellcolorranktwo0\.8822\\cellcolorranktwo0\.8323\\cellcolorranktwo0\.7329\\cellcolorrankone0\.7566\\cellcolorrankone0\.8030\\cellcolorrankthree0\.7750\\cellcolorrankone0\.8323\\cellcolorranktwo0\.7800RGN0\.67860\.63330\.74930\.70270\.82950\.71870\.64210\.64990\.74850\.71990\.77880\.7078SNR\\cellcolorrankthree0\.76390\.6333\\cellcolorrankthree0\.7779\\cellcolorrankthree0\.83370\.76650\.75500\.68590\.5973\\cellcolorrankthree0\.76940\.6547\\cellcolorranktwo0\.80970\.7034ID\\cellcolorrankone0\.7919\\cellcolorranktwo0\.84610\.73090\.7027\\cellcolorrankone0\.88430\.7912\\cellcolorrankone0\.7356\\cellcolorrankthree0\.70740\.74850\.7199\\cellcolorrankthree0\.78970\.7402FEPoID\\cellcolorrankone0\.7919\\cellcolorranktwo0\.8461\\cellcolorrankone0\.7992\\cellcolorrankone0\.8479\\cellcolorrankone0\.8843\\cellcolorrankone0\.8339\\cellcolorrankone0\.7356\\cellcolorrankone0\.7566\\cellcolorrankone0\.8030\\cellcolorrankone0\.8025\\cellcolorrankone0\.8323\\cellcolorrankone0\.7860
We evaluate First Effective Peak of Intrinsic Dimension on CIFAR\-10 using an ImageNet\-pretrained ViT\. For each layer, we use the\[CLS\]token representation to train and evaluate the downstream MLP\.
As shown in[Figure8](https://arxiv.org/html/2605.26366#A4.F8), test accuracy increases monotonically with network depth and reaches its maximum at the final transformer block\. This behavior contrasts with hallucination detection, where intermediate layers often yield the strongest probe performance \([Figures1](https://arxiv.org/html/2605.26366#S1.F1)and[2](https://arxiv.org/html/2605.26366#S4.F2)\)\.
Consistent with this trend, the estimated intrinsic dimension increases steadily across layers and peaks at the penultimate layer\. Importantly, withw∈\[2,5\]w\\in\[2,5\], First Effective Peak of Intrinsic Dimension accurately selects this layer, closely matching the oracle best\-performing layer\. These results demonstrate that First Effective Peak of Intrinsic Dimension reliably identifies informative layers for vision tasks, extending its applicability beyond language\-based settings\.
Figure 8:Layer\-wise accuracy and intrinsic dimension for image classification\. First Effective Peak of Intrinsic Dimension picks the last second layer withw∈\{2,3,4,5\}w\\in\\\{2,3,4,5\\\}\.Similar Articles
Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection
This paper analyzes hallucination detection in LLMs, proposing a max-pooling approach that improves efficiency by eliminating costly semantic consistency computations while maintaining competitive performance.
FLaG: Fine-Grained Latent Grouping for Hallucination Detection
FLaG is a lightweight framework for hallucination detection in LLMs that models correctness via latent evidence groups and energy-based routing, achieving SOTA performance across benchmarks.
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
This paper presents PCNet, a probabilistic circuit trained as a tractable density estimator on LLM residual streams to detect hallucinations as geometric anomalies. It also introduces PC-LDCD, a dynamic correction method that only intervenes on hallucinated tokens, achieving near-perfect detection and reduced corruption rates.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Researchers introduce SHADE, a hybrid estimator that combines Good-Turing coverage with graph-spectral cues to quantify semantic uncertainty and detect LLM hallucinations when only a few black-box samples are available.
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.