CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations
Summary
Proposes CORTEX, a token-level hallucination detection method for RAG that compares LLM internal representations with and without retrieved documents to identify ungrounded spans. It improves fine-grained localization of hallucinations in long-form RAG outputs.
View Cached Full Text
Cached at: 07/01/26, 05:32 AM
# CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations
Source: [https://arxiv.org/html/2606.31033](https://arxiv.org/html/2606.31033)
Kazuaki Furumai Shuichiro Haruta Kazunori Matsumoto Daisuke Kamisaka KDDI Research, Inc\. \{ka\-furumai, sh\-haruta, da\-kamisaka\}@kddi\.com
###### Abstract
In this paper, we propose CORTEX, a token\-level hallucination detection method for Retrieval\-Augmented Generation \(RAG\)\. In long\-form RAG outputs, hallucinations often arise in localized spans rather than throughout an entire response\. CORTEX therefore identifies ungrounded content at the token level, enabling fine\-grained localization of hallucinations\. The key intuition behind CORTEX is that tokens grounded in retrieved documents should be more strongly influenced by those documents than hallucinated tokens\. To capture this document\-induced effect, CORTEX compares internal representations of a large language model \(LLM\) under two conditions: with and without the retrieved documents\. Instead of relying solely on each token’s immediate sensitivity to the retrieved documents, CORTEX also leverages the propagation of document\-grounded information through preceding tokens, reducing false positives for tokens whose evidence has already been absorbed into the context\. Finally, CORTEX applies post\-processing smoothing step that models the tendency of hallucination labels to persist over contiguous spans, reducing local noise and encouraging span\-consistent predictions\. Experiments on two RAG benchmarks and three LLMs show that CORTEX substantially improves token\-level hallucination detection, with each component consistently contributing to performance gains\.
CORTEX: Token\-Level Hallucination Detection in RAG via Comparative Internal Representations
Kazuaki Furumai Shuichiro Haruta Kazunori Matsumoto Daisuke KamisakaKDDI Research, Inc\.\{ka\-furumai, sh\-haruta, da\-kamisaka\}@kddi\.com
## 1Introduction
Figure 1:Overview of CORTEX for token\-level hallucination detection in RAG, using reference\-conditioned internal representation comparisons, contextual residual features, and label\-persistence smoothing\.Retrieval\-Augmented Generation \(RAG\) has been widely adopted to improve the factuality of large language models \(LLMs\) by grounding generation in external references\(Lewiset al\.,[2020](https://arxiv.org/html/2606.31033#bib.bib35)\)\. Although RAG mitigates hallucinations to some extent, LLMs may still generate groundless or inconsistent content even when references are provided\(Fanet al\.,[2026](https://arxiv.org/html/2606.31033#bib.bib20)\)\. Accurately detecting such hallucinations within generated outputs therefore remains a critical challenge\(Niuet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib26); Liuet al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib37); Banget al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib38)\)\.
Various hallucination detection methods have been proposed to address this challenge\. Self\-consistency\-based approaches estimate reliability by measuring agreement across multiple outputs from the same prompt\(Manakulet al\.,[2023](https://arxiv.org/html/2606.31033#bib.bib8)\)\. Prompt\-based methods use LLMs as fact\-checkers to explicitly verify generated content\(Zhenget al\.,[2023](https://arxiv.org/html/2606.31033#bib.bib10); Eset al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib11); Furumaiet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib29)\)\. While these approaches leverage LLMs’ strong language understanding capabilities, they may incur high computational cost due to repeated sampling, and their reliability can depend on prompt design\(Ganeshet al\.,[2026](https://arxiv.org/html/2606.31033#bib.bib33)\)\.
More recently, hallucination detection methods based on the internal representations of LLMs have been increasingly studied\(Sriramananet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib12); Zhanget al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib13); Xionget al\.,[2026](https://arxiv.org/html/2606.31033#bib.bib30)\)\. These methods seek hallucination\-related signals in model activations and can offer insight into the internal conditions under which hallucinations arise\. This line of work has analyzed attention and feedforward\-network states, quantified component contributions to generation, and extracted hallucination\-related features from internal representations\.
However, many existing approaches are not designed specifically for RAG and primarily assess the generated output as a whole, i\.e\., at the answer level\. This is limiting for long\-form RAG outputs, where faithful and hallucinated content often coexist\. For practical use, detecting hallucinations in such mixed outputs requires a finer\-grained formulation than answer\-level detection\. The appropriate granularity can vary across tasks and annotation protocols, from words and phrases to sentences or paragraphs\. Given this variability, token\-level detection can serve as a practical approach for localizing ungrounded content and later adapting predictions to the required granularity\.
In this paper, we proposeCORTEX\(ComparativeObservedReference\-basedToken\-levelEXpressions\), a post\-hoc token\-level hallucination detection method for RAG\. Figure[1](https://arxiv.org/html/2606.31033#S1.F1)illustrates the overall framework\. CORTEX builds on the intuition that faithful tokens should exhibit coherent representational changes when references are provided, whereas hallucinated tokens should show weaker or less coherent changes\. To measure this effect, CORTEX constructs a paired counterfactual view of the same answer under reference\-conditioned and no\-reference inputs, rather than probing a single input as in conventional representation\-based detectors\. The resulting reference\-induced changes are then encoded as token\-level delta features for hallucination detection\.
CORTEX further introduces an attention\-based contextual residual feature\. In long\-form generation, reference influence may propagate indirectly through preceding answer tokens, as in reasoning\-style or self\-referential generation\. CORTEX captures this context\-mediated reference influence by combining token\-level delta features with attention patterns, helping the classifier distinguish tokens that are indirectly grounded in the reference from groundless tokens and thereby reducing false positives\.
Finally, CORTEX uses post\-processing smoothing step to adjust the granularity of token\-level scores\. Raw token\-level predictions can be sensitive to local noise and may produce isolated high\-score tokens, whereas human hallucination annotations often appear as contiguous spans\. We therefore introducelabel\-persistence smoothing, which treats raw scores as token\-level confidence and models the tendency of hallucination labels to persist across neighboring tokens\. This reduces scattered noise and adapts token\-level predictions to span\-level annotation structure while still retaining token\-level scores\.
Experiments on two RAG benchmarks and three LLMs demonstrate that CORTEX substantially improves token\-level hallucination detection, with all components contributing consistently to performance gains\.
Our contributions are summarized as follows:
- •We propose CORTEX, which, to the best of our knowledge, is the first token\-level hallucination detection method specifically designed for RAG\.
- •CORTEX is a practical post\-hoc framework that is easy to implement and can detect hallucinations in outputs from arbitrary LLMs, including API\-based closed\-weight models\.
## 2Preliminaries
We consider a practical setting in which the answer is produced by a closed\-weight LLM whose parameters and internal representations are not accessible, as is the case for many API\-based models\. Such models are often attractive in deployed applications because of their strong generation quality\. We denote this closed\-weight LLM byMcloseM\_\{\\mathrm\{close\}\}\. Given a questionqtextq^\{\\mathrm\{text\}\}, the closed\-weight LLM produces an answer as
atext=Mclose\(qtext\)\.a^\{\\mathrm\{text\}\}=M\_\{\\mathrm\{close\}\}\(q^\{\\mathrm\{text\}\}\)\.\(1\)
Our goal is to identify hallucinated tokens in the generated answeratexta^\{\\mathrm\{text\}\}\. Although a direct way to obtain hallucination\-related features would be to analyze the internal representations of the closed\-weight LLM, we cannot do so for the aforementioned reason\. We therefore use a separate open\-weight LLM, denoted byMopenM\_\{\\mathrm\{open\}\}, as a post\-hoc analysis model\. UnlikeMcloseM\_\{\\mathrm\{close\}\},MopenM\_\{\\mathrm\{open\}\}exposes internal representations, which allows us to extract features from the answer in relation to the question\.
LetTok\(⋅\)\\mathrm\{Tok\}\(\\cdot\)denote the tokenizer ofMopenM\_\{\\mathrm\{open\}\}\. We tokenize the answer as
a\\displaystyle a=Tok\(atext\)=\(t1,t2,…,ti,…,tT\),\\displaystyle=\\mathrm\{Tok\}\(a^\{\\mathrm\{text\}\}\)=\(t\_\{1\},t\_\{2\},\\ldots,t\_\{i\},\\ldots,t\_\{T\}\),\(2\)whereTTdenotes the number of answer tokens andtit\_\{i\}denotes theii\-th answer token under the tokenizer ofMopenM\_\{\\mathrm\{open\}\}\. We assume thatMopenM\_\{\\mathrm\{open\}\}processes an input containing the question and the answer asMopen\(qtext∥atext\)M\_\{\\mathrm\{open\}\}\(q^\{\\mathrm\{text\}\}\\\|a^\{\\mathrm\{text\}\}\), where∥\\\|denotes text concatenation with the appropriate prompt format\.MopenM\_\{\\mathrm\{open\}\}produces token\-level internal representations over the entire input sequence, which capture the relationships betweenqtextq^\{\\mathrm\{text\}\}andatexta^\{\\mathrm\{text\}\}\. Since hallucination detection is performed over the answer, we use only the internal representations corresponding to the answer tokens\. We denote the representation aligned with answer tokentit\_\{i\}byhi∈ℝdh\_\{i\}\\in\\mathbb\{R\}^\{d\}\. The specific input construction and the extraction of answer\-token representations are described in Section[3](https://arxiv.org/html/2606.31033#S3)\.
The token\-level hallucination detection task is to estimate whether each answer token is hallucinated\. Letyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}denote the token\-level label, whereyi=1y\_\{i\}=1indicates that tokentit\_\{i\}is hallucinated andyi=0y\_\{i\}=0indicates that it is faithful\.
In our setting, we focus on RAG applications, where references are provided as grounding evidence for answers\. This setting reflects common practical deployments in which retrieved documents are used to reduce hallucination and the answer is expected to be faithful to those references\.
## 3CORTEX
We propose CORTEX \(ComparativeObservedReference\-basedToken\-levelEXpressions\), a post\-hoc hallucination detection framework for RAG\.
CORTEX is built on three key ideas, illustrated in Figure[1](https://arxiv.org/html/2606.31033#S1.F1)\(a\)–\(c\): \(a\) a reference\-induced delta representation for capturing how each answer token’s internal representation changes when the references are provided; \(b\) an attention\-based contextual residual for distinguishing direct reference sensitivity from reference influence mediated by preceding answer tokens; and \(c\) label\-persistence smoothing for reducing isolated token\-level noise and obtaining span\-consistent hallucination scores while preserving token\-level predictions\.
### 3\.1Reference\-Induced Delta Representation
CORTEX constructs two conditioned inputs for the open\-weight LLM, differing only in whether the references are included\. Letrtextr^\{\\mathrm\{text\}\}denote the references, i\.e\., the retrieved documents used as grounding evidence in the RAG setting\. The reference\-conditioned input is defined asxreftext=qtext‖rtext‖atextx\_\{\\mathrm\{ref\}\}^\{\\mathrm\{text\}\}=q^\{\\mathrm\{text\}\}\\\|r^\{\\mathrm\{text\}\}\\\|a^\{\\mathrm\{text\}\}, whereas the no\-reference input is defined asxno\-reftext=qtext∥atextx\_\{\\mathrm\{no\\text\{\-\}ref\}\}^\{\\mathrm\{text\}\}=q^\{\\mathrm\{text\}\}\\\|a^\{\\mathrm\{text\}\}\. With this construction, the answer span corresponds to the same token sequence\(t1,…,tT\)\(t\_\{1\},\\ldots,t\_\{T\}\)in both conditions, allowing CORTEX to compare internal representations aligned with the same answer tokens\.
WhenMopenM\_\{\\mathrm\{open\}\}processesxreftextx\_\{\\mathrm\{ref\}\}^\{\\mathrm\{text\}\}andxno\-reftextx\_\{\\mathrm\{no\\text\{\-\}ref\}\}^\{\\mathrm\{text\}\}, it produces token\-level internal representations over the entire input sequence\. For an inputx=Tok\(xtext\)x=\\mathrm\{Tok\}\(x^\{\\mathrm\{text\}\}\), letRepMopen\(x\)\\mathrm\{Rep\}\_\{M\_\{\\mathrm\{open\}\}\}\(x\)denote the sequence of internal representations obtained fromMopenM\_\{\\mathrm\{open\}\}\. We define an extraction functionffthat selects the representations aligned with the answer tokens:
\(h1ref,…,hTref\)\\displaystyle\(h^\{\\mathrm\{ref\}\}\_\{1\},\\ldots,h^\{\\mathrm\{ref\}\}\_\{T\}\)=f\(RepMopen\(xref\)\),\\displaystyle=f\(\\mathrm\{Rep\}\_\{M\_\{\\mathrm\{open\}\}\}\(x\_\{\\mathrm\{ref\}\}\)\),\(3\)\(h1no\-ref,…,hTno\-ref\)\\displaystyle\(h^\{\\mathrm\{no\\text\{\-\}ref\}\}\_\{1\},\\ldots,h^\{\\mathrm\{no\\text\{\-\}ref\}\}\_\{T\}\)=f\(RepMopen\(xno\-ref\)\)\.\\displaystyle=f\(\\mathrm\{Rep\}\_\{M\_\{\\mathrm\{open\}\}\}\(x\_\{\\mathrm\{no\\text\{\-\}ref\}\}\)\)\.
Here,hiref,hino\-ref∈ℝdh^\{\\mathrm\{ref\}\}\_\{i\},h^\{\\mathrm\{no\\text\{\-\}ref\}\}\_\{i\}\\in\\mathbb\{R\}^\{d\}denote the internal representations aligned with the same answer tokentit\_\{i\}under the reference\-conditioned and no\-reference conditions, respectively, anddddenotes the representation dimensionality\.111Specifically, we use the final transformer layer output corresponding to each answer token as its token\-level representationhih\_\{i\}\.
CORTEX builds on the intuition that faithful tokens should exhibit coherent representational changes when references are provided, whereas hallucinated tokens should show weaker or less coherent changes\. Based on this intuition, CORTEX computes the reference\-induced delta representation for each answer tokentit\_\{i\}as
Δhi=hiref−hino\-ref\.\\Delta h\_\{i\}=h\_\{i\}^\{\\mathrm\{ref\}\}\-h\_\{i\}^\{\\mathrm\{no\\text\{\-\}ref\}\}\.\(4\)
This vector is the core representation in CORTEX\. Since the same answer is included in both inputs,Δhi\\Delta h\_\{i\}captures how the representation of tokentit\_\{i\}changes when references are provided\. In other words,Δhi\\Delta h\_\{i\}is intended to emphasize the reference\-induced change in howMopenM\_\{\\mathrm\{open\}\}contextualizes that token, rather than merely representing the token content itself\.
### 3\.2Attention\-Based Contextual Residual
The delta representationΔhi\\Delta h\_\{i\}captures how the representation of tokentit\_\{i\}changes when references are added\. However, in reasoning\-style outputs such as chain\-of\-thought, facts grounded in the references may first be stated in earlier parts of the answer, and later tokens may continue the reasoning based on those facts\. In this case, the influence of the references can reachtit\_\{i\}not only through the direct pathr→tir\\rightarrow t\_\{i\}, but also through the context\-mediated pathr→t<i→tir\\rightarrow t\_\{<i\}\\rightarrow t\_\{i\}\. As a result, even tokens that are indirectly grounded by the references may appear to have a weak relationship with the references if we only useΔhi\\Delta h\_\{i\}\.
To address this issue, we introduce a feature that represents how much reference influence is contained in the preceding tokens that the current tokentit\_\{i\}relies on as
Δh¯i=∑j<iαijrefΔhj,\\bar\{\\Delta h\}\_\{i\}=\\sum\_\{j<i\}\\alpha\_\{ij\}^\{\\mathrm\{ref\}\}\\Delta h\_\{j\},\(5\)whereαijref∈\[0,1\]\\alpha\_\{ij\}^\{\\mathrm\{ref\}\}\\in\[0,1\]denotes the attention weight fromtit\_\{i\}to a preceding answer tokentjt\_\{j\}under the reference\-conditioned input\. We further subtract this context\-mediated influence from the current token’s own change and define the contextual residual as
ci=Δhi−Δh¯i\.c\_\{i\}=\\Delta h\_\{i\}\-\\bar\{\\Delta h\}\_\{i\}\.\(6\)This residual represents the token\-specific change that remains after removing the reference influence explainable through preceding tokens\. We use bothΔhi\\Delta h\_\{i\}andcic\_\{i\}for token\-level hallucination detection\.
### 3\.3Token\-Level Hallucination Detection
For each answer tokentit\_\{i\}, CORTEX predicts a raw token\-level hallucination scoresis\_\{i\}by feeding\[Δhi;ci\]\[\\Delta h\_\{i\};c\_\{i\}\]into a multilayer perceptron \(MLP\) classifiergθg\_\{\\theta\}:
si=σ\(gθ\(\[Δhi;ci\]\)\),si∈\[0,1\],s\_\{i\}=\\sigma\\\!\\left\(g\_\{\\theta\}\\\!\\left\(\[\\Delta h\_\{i\};c\_\{i\}\]\\right\)\\right\),\\qquad s\_\{i\}\\in\[0,1\],\(7\)whereσ\\sigmadenotes the sigmoid function\. The classifier is trained with binary cross\-entropy loss using token\-level hallucination labels\. To address class imbalance, the positive class is weighted according to the ratio of negative to positive tokens in the training set\.
### 3\.4Label\-Persistence Smoothing
We define a span as a contiguous sequence of tokens annotated as a single hallucination unit\. Human hallucination annotations are typically span\-based: once a token is marked as hallucinated, neighboring tokens in the same span are likely to receive the same label\. Based on this property, we applylabel\-persistence smoothingas a post\-processing step, rather than using moving\-average smoothing\.
Given the raw token\-level scoress1:Ts\_\{1:T\}, we introduce an unobserved binary smoothing\-label variablezi∈\{0,1\}z\_\{i\}\\in\\\{0,1\\\}to represent the span\-consistent label underlying the post\-processed score, wherezi=0z\_\{i\}=0denotes a faithful token andzi=1z\_\{i\}=1denotes a hallucinated token\. To obtain span\-consistent post\-processed scores, we model the smoothed label sequence by combining two components around positionsiiandi\+1i\+1: the tendency of neighboring labels to persist and information from the raw scores at each position\.
We first model span\-level label persistence by defining the pairwise persistence term between neighboring labels as
ρ\(zi,zi\+1\)=\{pstay,zi\+1=zi,1−pstay,zi\+1≠zi\.\\rho\(z\_\{i\},z\_\{i\+1\}\)=\\begin\{cases\}p\_\{\\mathrm\{stay\}\},&z\_\{i\+1\}=z\_\{i\},\\\\ 1\-p\_\{\\mathrm\{stay\}\},&z\_\{i\+1\}\\neq z\_\{i\}\.\\end\{cases\}\(8\)The parameterpstay∈\[0,1\]p\_\{\\mathrm\{stay\}\}\\in\[0,1\]controls the degree of label persistence: smaller values allow finer\-grained label changes, whereas larger values favor longer same\-label spans\. This provides a deliberately simple approximation to span\-level label continuity, rather than modeling the full variability of human annotation patterns\. We then define the token\-level confidence induced by the raw scoresis\_\{i\}at each position:
ϕi\(zi=1\)=si,ϕi\(zi=0\)=1−si\.\\phi\_\{i\}\(z\_\{i\}=1\)=s\_\{i\},\\qquad\\phi\_\{i\}\(z\_\{i\}=0\)=1\-s\_\{i\}\.\(9\)This quantity represents how the raw score at positioniiis converted into token\-level confidence for each value of the smoothing\-label variable\.222In implementation, we apply a small amount of clipping to the raw scores so that the token\-level confidence does not become exactly 0 or 1\.
Combining Eq\. \([8](https://arxiv.org/html/2606.31033#S3.E8)\) and Eq\. \([9](https://arxiv.org/html/2606.31033#S3.E9)\), we define the local compatibility term for neighboring positionsiiandi\+1i\+1as
ϕi\(zi\)ρ\(zi,zi\+1\)ϕi\+1\(zi\+1\)\.\\phi\_\{i\}\(z\_\{i\}\)\\rho\(z\_\{i\},z\_\{i\+1\}\)\\phi\_\{i\+1\}\(z\_\{i\+1\}\)\.\(10\)This term measures the compatibility of the neighboring label assignment\(zi,zi\+1\)\(z\_\{i\},z\_\{i\+1\}\)with both the raw scores and the label\-persistence assumption\.
We therefore define the normalized distribution over sequence labels as
P\(z1:T∣s1:T\)\\displaystyle P\(z\_\{1:T\}\\mid s\_\{1:T\}\)=1Z\(s1:T\)π\(z1\)∏i=1Tϕi\(zi\)\\displaystyle=\\frac\{1\}\{Z\(s\_\{1:T\}\)\}\\pi\(z\_\{1\}\)\\prod\_\{i=1\}^\{T\}\\phi\_\{i\}\(z\_\{i\}\)\(11\)×∏i=1T−1ρ\(zi,zi\+1\),\\displaystyle\\quad\\times\\prod\_\{i=1\}^\{T\-1\}\\rho\(z\_\{i\},z\_\{i\+1\}\),whereZ\(s1:T\)Z\(s\_\{1:T\}\)is the normalizing constant andπ\(z1\)\\pi\(z\_\{1\}\)is the initial distribution over the first smoothing label\. We use a uniform initial distribution,π\(0\)=π\(1\)=1/2\\pi\(0\)=\\pi\(1\)=1/2\.
Althoughz1:Tz\_\{1:T\}is unobserved, the desired token\-level smoothed score can be obtained by marginalizing over all smoothing\-label sequences:
s~i=P\(zi=1∣s1:T\)\.\\tilde\{s\}\_\{i\}=P\(z\_\{i\}=1\\mid s\_\{1:T\}\)\.\(12\)
This posterior can be computed efficiently using the forward–backward algorithm\(Rabiner,[1990](https://arxiv.org/html/2606.31033#bib.bib32)\)\. We provide the full recursions and posterior marginal formula in Appendix[A](https://arxiv.org/html/2606.31033#A1)\.
## 4Experiments
We evaluate CORTEX on two publicly available RAG hallucination benchmarks, RAGTruth\(Niuet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib26)\)and HalluRAG\(Ridder and Schilling,[2024](https://arxiv.org/html/2606.31033#bib.bib28)\), using three LLMs: Llama\-3\.1\-8B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib25)\), Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib23)\), and Mistral\-7B\-Instruct\-v0\.2\(Mistral AI,[2023](https://arxiv.org/html/2606.31033#bib.bib24)\)\. We use these open\-weight LLMs to obtain internal representations for CORTEX and the relevant baselines\. Evaluation is performed at the token and answer levels\. Token\-level evaluation assesses whether a method can identify hallucinated tokens in a generated answer, while answer\-level evaluation assesses whether the answer contains any hallucinated content\. For CORTEX, answer\-level evaluation is performed by aggregating token\-level scores, as it is trained for token\-level detection and does not use answer\-level supervision\. Specifically, we use the maximum token\-level hallucination score as the answer\-level score\.
For label\-persistence smoothing, we usepstay=0\.993p\_\{\\mathrm\{stay\}\}=0\.993at the token level andpstay=0\.930p\_\{\\mathrm\{stay\}\}=0\.930at the answer level\. We provide a sensitivity analysis ofpstayp\_\{\\mathrm\{stay\}\}in Appendix[B](https://arxiv.org/html/2606.31033#A2)\. Additional details on the experimental setup are provided in Appendix[C](https://arxiv.org/html/2606.31033#A3)\. The following subsections describe the baselines and datasets used in our experiments\.
### 4\.1Baselines
We compare CORTEX with five baselines\.NLLuses negative log\-likelihood as an uncertainty\-based signal, following prior work on uncertainty estimation from next\-token predictive distributions\(Malinin and Gales,[2021](https://arxiv.org/html/2606.31033#bib.bib18)\)\.SAPLMAtrains a probe on intermediate transformer representations, based on the observation that hidden states encode hallucination\-related signals\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.31033#bib.bib19)\)\.LLM\-Checkextracts features from internal states produced by attention mechanisms and feedforward neural networks, including spectral properties such as eigenvalues\(Sriramananet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib12)\)\.ICR Probeuses internal component attribution to quantify the contribution of attention and feedforward networks and identify hallucinated content\(Zhanget al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib13)\)\.RAGLensapplies sparse autoencoders to token embeddings and selects hallucination\-related sparse features via token aggregation and mutual\-information\-based feature selection\(Xionget al\.,[2026](https://arxiv.org/html/2606.31033#bib.bib30)\)\.
For answer\-level evaluation, we follow the original implementation of baselines\. However, their answer\-level implementation cannot be directly used for token\-level evaluation\. We modify parts of the baseline implementations to adapt them to token\-level detection while preserving their original mechanisms as much as possible\.
### 4\.2Datasets
Datasets for hallucination detection in RAG settings remain limited, especially those that include both references and fine\-grained hallucination annotations\. To the best of our knowledge, RAGTruth is the only publicly available RAG benchmark with human\-annotated hallucination spans that are sufficiently fine\-grained to support token\-level evaluation\. Although RAGTruth contains multiple task types, we use only its QA subset in this work\.
HalluRAG consists of answers generated by multiple LLMs under two RAG prompt settings using either relevant or irrelevant Wikipedia chunks, with sentence\-level hallucination annotations by GPT\-4o\(OpenAIet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib40)\)\. Since token\-level labels are unavailable, we derive token\-level pseudo\-labels by marking all tokens in hallucinated sentences as hallucinated\.
The two datasets provide complementary settings, differing in annotation granularity, labeling procedure, retrieval quality, and QA distribution\. RAGTruth provides fine\-grained human span annotations, whereas HalluRAG provides sentence\-level GPT\-4o annotations and includes cases where references may be irrelevant to the question\. They also differ in construction and topical coverage, with RAGTruth’s QA subset is based on daily\-life questions, whereas HalluRAG derives questions from Wikipedia passages\. Together, these differences allow us to assess robustness across conditions\.
Both datasets contain answers generated by various LLMs, together with the references provided for generating those answers\. We treat these answers as outputs from closed\-weight LLMs, regardless of which model generated them\. To ensure a fair comparison, all methods receive the same answer and corresponding references as input\.
Table 1:Dataset statistics\.
### 4\.3Results
### Token\-level Detection
Table 2:Token\-level hallucination detection results on RAGTruth and HalluRAG\. Bold and underline indicate the best and second\-best results, respectively\.Table 3:Answer\-level hallucination detection results\. Bold and underline indicate the best and second\-best results, respectively\.Table[2](https://arxiv.org/html/2606.31033#S4.T2)reports token\-level hallucination detection results in terms of average precision \(AP\) and the area under the receiver operating characteristic curve \(AUROC\)\. Across both RAGTruth and HalluRAG, CORTEX achieves the best performance in all settings\. The gains are particularly substantial on RAGTruth, where human fine\-grained annotations are available, suggesting that CORTEX is well aligned with token\-level hallucination detection\. Even on HalluRAG, where token\-level labels are derived from sentence\-level annotations, CORTEX remains the best\-performing method, indicating robustness to coarser and noisier supervision\.
All baselines receive inputs with references, but rely on a single reference\-conditioned view of model representations\. SAPLMA directly uses transformer layer outputs as features, while RAGLens extracts hallucination\-related sparse features from token embeddings using a trained sparse encoder\. In contrast, CORTEX derives its signal from the contrast between paired representations of the same tokens with and without references\. The consistent gains suggest that this comparative formulation captures reference\-induced hallucination signals that are not readily available from a single internal representation or features derived from it\.
Label\-persistence smoothing is a post\-processing module applicable to any token\-level predictions\. In Appendix[D](https://arxiv.org/html/2606.31033#A4), we further apply it to baselines to demonstrate its general effectiveness\.
### Answer\-level Detection
Table[3](https://arxiv.org/html/2606.31033#S4.T3)reports the answer\-level hallucination detection results in terms of AP and AUROC\. Unlike the baselines, which are trained directly for answer\-level detection, CORTEX does not use answer\-level supervision\. Instead, CORTEX reuses the token\-level hallucination scores and assigns each answer the maximum score among its tokens\. Thus, this setting evaluates whether the token\-level signals captured by CORTEX can also indicate hallucination at the answer level\.
Despite this simple aggregation strategy, CORTEX achieves competitive performance against the baselines\. This suggests that its token\-level predictions provide useful signals for both localized detection and answer\-level reliability assessment\. At the same time, the strong performance of some answer\-level baselines suggests that answer\-level detection may require global signals beyond localized token\-level signals\. Taken together, these results indicate that token\-level and answer\-level hallucination detection are both important and should be studied as complementary problems\.
### 4\.4Ablation Study
Table 4:Ablation results on RAGTruth using Qwen\.Figure 2:Detection example for an answer containing hallucinated content\. Label\-persistence smoothing suppresses noisy token\-level score patterns and highlights the hallucinated span more accurately and coherently\.Figure 3:Detection example for an answer without hallucinated content\. The contextual residualccreduces false positives by accounting for reference influence mediated through the preceding answer context\.To analyze the contribution of each CORTEX component, we conduct an ablation study\. Table[4](https://arxiv.org/html/2606.31033#S4.T4)reports token\- and answer\-level performance for each configuration\. Both the contextual residualccand label\-persistence smoothing \(Smooth\\mathrm\{Smooth\}\) improve performance\. Removing label\-persistence smoothing shows that raw token\-level scores are informative but locally unstable, while removingccshows that context\-mediated reference influence provides information complementary toΔh\\Delta h\.
Figures[2](https://arxiv.org/html/2606.31033#S4.F2)and[3](https://arxiv.org/html/2606.31033#S4.F3)illustrate these effects qualitatively using cases with and without hallucinations\. In the heatmaps, color intensity reflects the predicted hallucination score, with redder tokens indicating higher hallucination likelihood\. In Figure[2](https://arxiv.org/html/2606.31033#S4.F2), the model without label\-persistence smoothing produces interleaved high\- and low\-score tokens across semantically coherent regions, yielding predictions less consistent with human span\-level annotations\. In contrast, full CORTEX produces smoother and more contiguous high\-score regions, better matching the ground\-truth hallucination spans\. Figure[3](https://arxiv.org/html/2606.31033#S4.F3)illustrates the role of the contextual residualcc\. Withoutcc, the model can overemphasize the absence of direct reference influence and assign high scores to tokens indirectly grounded through the preceding answer context\. Withcc, CORTEX accounts for reference influence propagated through previous tokens and reduces false positives caused by indirect grounding\.
Overall, these ablation results support the design of CORTEX\. The combination ofΔh\\Delta h, the contextual residualcc, and label\-persistence smoothing enables CORTEX to achieve robust localized hallucination detection\. Further ablation case studies are presented in Appendix[E](https://arxiv.org/html/2606.31033#A5)\.
## 5Related Work
The problem of hallucination in LLMs has been extensively studied, yet remains unresolved\(Huanget al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib14); Kalaiet al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib15)\)\. It undermines the reliability of LLM\-based AI agents and hinders real\-world deployment\. RAG mitigates hallucinations by incorporating references into prompts\(Gaoet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib16); Rackauckas,[2024](https://arxiv.org/html/2606.31033#bib.bib17)\), but LLMs may still generate hallucinated content even with references, motivating the detection of groundless or inconsistent RAG outputs\(Fanet al\.,[2026](https://arxiv.org/html/2606.31033#bib.bib20)\)\.
Many hallucination detection methods rely on external verification or repeated generation\. Self\-consistency methods estimate reliability by comparing multiple outputs from the same prompt\(Manakulet al\.,[2023](https://arxiv.org/html/2606.31033#bib.bib8)\), while prompt\-based approaches use LLMs as judges or fact\-checkers\(Liet al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib9); Eset al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib11); Furumaiet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib29)\)\. These methods exploit LLMs’ language understanding but depend on prompt design and verifier capability, and often require additional inference\.
Another line of work uses model\-internal signals, including uncertainty from next\-token distributions\(Malinin and Gales,[2021](https://arxiv.org/html/2606.31033#bib.bib18)\), hallucination\-related information in intermediate transformer representations\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.31033#bib.bib19)\), spectral features of attention and feedforward states\(Sriramananet al\.,[2024](https://arxiv.org/html/2606.31033#bib.bib12)\), component\-level attribution for tracing information sources\(Zhanget al\.,[2025](https://arxiv.org/html/2606.31033#bib.bib13)\), and sparse autoencoder features for RAG hallucination analysis\(Xionget al\.,[2026](https://arxiv.org/html/2606.31033#bib.bib30)\)\. These studies show that hallucination\-related signals are present in model computations and representations, motivating representation\-based detection\.
CORTEX differs from prior representation\-based approaches by comparing internal representations obtained from reference\-conditioned and no\-reference inputs, rather than analyzing a single internal state in isolation\. It further uses label\-persistence smoothing to reduce isolated noise and better align predictions with span\-based hallucination annotations\.
## 6Conclusion
We proposed CORTEX, a post\-hoc token\-level hallucination detection method for RAG\. CORTEX constructs a paired counterfactual view of the same answer by analyzing its internal representations with and without references, and encodes reference\-induced changes as token\-level delta features\. It combines these features with attention patterns to capture context\-mediated reference influence and uses label\-persistence smoothing to reduce local noise while preserving span\-consistent scores\.
Experiments show that CORTEX substantially outperforms token\-level baselines and achieves competitive answer\-level performance through simple score aggregation\. Ablations confirm the contributions of both the attention\-based contextual residual and label\-persistence smoothing\. These results demonstrate the effectiveness of comparing internal representations for hallucination detection in RAG\. By using an open\-weight LLM to analyze outputs from closed\-weight LLMs without modifying generation, CORTEX provides a practical approach to reliability assessment\. Beyond detection, these comparative internal\-representation signals may provide a basis for future hallucination mitigation, including reward modeling for tuning generators and objectives that encourage stronger reference\-induced representations\.
## Limitations
CORTEX is designed for RAG settings and therefore assumes the presence of references against which generated outputs can be assessed\. It is not intended to verify claims that are generated solely from the parametric knowledge of an LLM\. This restriction may be acceptable, or even desirable, in controlled applications such as enterprise customer support, where answers should be grounded only in approved documents\. In more open\-ended assistant scenarios, however, this may be limiting because useful information that is not explicitly grounded in the provided references may be treated as groundless\.
Another limitation is that CORTEX requires access to an open\-weight LLM that exposes internal representations, including hidden states and attention weights\. Although the original generator can be API\-based or otherwise inaccessible, the post\-hoc analysis model must provide sufficient internal signals\. The quality of detection may therefore depend on the choice of the open\-weight LLM and its ability to interpret the generated answer and references\.
A further limitation is the need for fine\-grained supervision during training\. Although RAGTruth provides human span\-level annotations, such annotations remain scarce for RAG hallucination detection\. Improving supervision under limited annotation resources remains an important direction for future work\.
## Ethical Considerations
CORTEX is intended to support reliability assessment in RAG systems by identifying potentially groundless tokens in generated outputs\. Such detection can help reduce the risk of users relying on hallucinated information, especially in applications where answers are expected to be grounded in specific references\. However, CORTEX should not be interpreted as a guarantee of factual correctness\. Its predictions are probabilistic and may include both false positives and false negatives; therefore, human review or additional verification may still be necessary in high\-stakes domains\.
There is also a risk that hallucination detection tools could be over\-relied upon or used to present generated outputs as more reliable than they actually are\. We encourage practitioners to communicate the limitations of detection results clearly and to avoid using CORTEX as the sole safety mechanism for critical decision\-making\.
It is also important to distinguish the scope of CORTEX from broader safety evaluation\. CORTEX is designed to detect ungrounded content in reference\-grounded generation, not to determine broader notions of truth, fairness, or social harm\. Future work should examine how token\-level reliability signals can be combined with other safeguards, including checks for bias, toxicity, privacy risks, and domain\-specific safety issues\.
## References
- The internal state of an LLM knows when it‘s lying\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 967–976\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.68/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.68)Cited by:[§4\.1](https://arxiv.org/html/2606.31033#S4.SS1.p1.1),[§5](https://arxiv.org/html/2606.31033#S5.p3.1)\.
- Y\. Bang, Z\. Ji, A\. Schelten, A\. Hartshorn, T\. Fowler, C\. Zhang, N\. Cancedda, and P\. Fung \(2025\)HalluLens: LLM hallucination benchmark\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 24128–24156\.External Links:[Link](https://aclanthology.org/2025.acl-long.1176/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1176),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p1.1)\.
- S\. Es, J\. James, L\. Espinosa Anke, and S\. Schockaert \(2024\)RAGAs: automated evaluation of retrieval augmented generation\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations,N\. Aletras and O\. De Clercq \(Eds\.\),pp\. 150–158\.External Links:[Link](https://aclanthology.org/2024.eacl-demo.16/),[Document](https://dx.doi.org/10.18653/v1/2024.eacl-demo.16)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p2.1),[§5](https://arxiv.org/html/2606.31033#S5.p2.1)\.
- D\. Fan, S\. Delsad, N\. Flammarion, and M\. Andriushchenko \(2026\)HalluHard: a hard multi\-turn hallucination benchmark\.Computing Research RepositoryarXiv:2602\.01031\.External Links:[Link](https://arxiv.org/abs/2602.01031)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p1.1),[§5](https://arxiv.org/html/2606.31033#S5.p1.1)\.
- K\. Furumai, R\. Legaspi, J\. C\. V\. Romero, Y\. Yamazaki, Y\. Nishimura, S\. Semnani, K\. Ikeda, W\. Shi, and M\. Lam \(2024\)Zero\-shot persuasive chatbots with LLM\-generated strategies and information retrieval\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 11224–11249\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.656/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.656)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p2.1),[§5](https://arxiv.org/html/2606.31033#S5.p2.1)\.
- P\. Ganesh, R\. Shokri, and G\. Farnadi \(2026\)Rethinking hallucinations: correctness, consistency, and prompt multiplicity\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 6959–6978\.External Links:[Link](https://aclanthology.org/2026.eacl-long.327/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.327),ISBN 979\-8\-89176\-380\-7Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p2.1)\.
- Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, M\. Wang, and H\. Wang \(2024\)Retrieval\-augmented generation for large language models: a survey\.Computing Research RepositoryarXiv:2312\.10997\.External Links:[Link](https://arxiv.org/abs/2312.10997)Cited by:[§5](https://arxiv.org/html/2606.31033#S5.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.Computing Research RepositoryarXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4](https://arxiv.org/html/2606.31033#S4.p1.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu \(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\)\.External Links:ISSN 1046\-8188,[Link](https://doi.org/10.1145/3703155),[Document](https://dx.doi.org/10.1145/3703155)Cited by:[§5](https://arxiv.org/html/2606.31033#S5.p1.1)\.
- A\. T\. Kalai, O\. Nachum, S\. S\. Vempala, and E\. Zhang \(2025\)Why language models hallucinate\.Computing Research RepositoryarXiv:2509\.04664\.External Links:[Link](https://arxiv.org/abs/2509.04664)Cited by:[§5](https://arxiv.org/html/2606.31033#S5.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 9459–9474\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p1.1)\.
- D\. Li, B\. Jiang, L\. Huang, A\. Beigi, C\. Zhao, Z\. Tan, A\. Bhattacharjee, Y\. Jiang, C\. Chen, T\. Wu, K\. Shu, L\. Cheng, and H\. Liu \(2025\)From generation to judgment: opportunities and challenges of LLM\-as\-a\-judge\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 2757–2791\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.138/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.138),ISBN 979\-8\-89176\-332\-6Cited by:[§5](https://arxiv.org/html/2606.31033#S5.p2.1)\.
- Q\. Liu, X\. Chen, Y\. Ding, B\. Song, W\. Wang, S\. Wu, and L\. Wang \(2025\)Attention\-guided self\-reflection for zero\-shot hallucination detection in large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 21005–21021\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1063/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1063),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p1.1)\.
- A\. Malinin and M\. J\. F\. Gales \(2021\)Uncertainty estimation in autoregressive structured prediction\.InProceedings of the 2021 International Conference on Learning Representations,External Links:[Link](https://api.semanticscholar.org/CorpusID:231895728)Cited by:[§4\.1](https://arxiv.org/html/2606.31033#S4.SS1.p1.1),[§5](https://arxiv.org/html/2606.31033#S5.p3.1)\.
- P\. Manakul, A\. Liusie, and M\. Gales \(2023\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 9004–9017\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.557/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.557)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p2.1),[§5](https://arxiv.org/html/2606.31033#S5.p2.1)\.
- Mistral AI \(2023\)Mistral\-7B\-Instruct\-v0\.2\.Note:Hugging Face model cardExternal Links:[Link](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)Cited by:[§4](https://arxiv.org/html/2606.31033#S4.p1.1)\.
- C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang \(2024\)RAGTruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 10862–10878\.External Links:[Link](https://aclanthology.org/2024.acl-long.585/)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p1.1),[§4](https://arxiv.org/html/2606.31033#S4.p1.1)\.
- OpenAI, :, A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford, A\. Mądry, A\. Baker\-Whitcomb, A\. Beutel, A\. Borzunov, A\. Carney, A\. Chow, A\. Kirillov, A\. Nichol, A\. Paino, A\. Renzin, A\. T\. Passos, A\. Kirillov, A\. Christakis, A\. Conneau, A\. Kamali, A\. Jabri, A\. Moyer, A\. Tam, A\. Crookes, A\. Tootoochian, A\. Tootoonchian, A\. Kumar, A\. Vallone, A\. Karpathy, A\. Braunstein, A\. Cann, A\. Codispoti, A\. Galu, A\. Kondrich, A\. Tulloch, A\. Mishchenko, A\. Baek, A\. Jiang, A\. Pelisse, A\. Woodford, A\. Gosalia, A\. Dhar, A\. Pantuliano, A\. Nayak, A\. Oliver, B\. Zoph, B\. Ghorbani, B\. Leimberger, B\. Rossen, B\. Sokolowsky, B\. Wang, B\. Zweig, B\. Hoover, B\. Samic, B\. McGrew, B\. Spero, B\. Giertler, B\. Cheng, B\. Lightcap, B\. Walkin, B\. Quinn, B\. Guarraci, B\. Hsu, B\. Kellogg, B\. Eastman, C\. Lugaresi, C\. Wainwright, C\. Bassin, C\. Hudson, C\. Chu, C\. Nelson, C\. Li, C\. J\. Shern, C\. Conger, C\. Barette, C\. Voss, C\. Ding, C\. Lu, C\. Zhang, C\. Beaumont, C\. Hallacy, C\. Koch, C\. Gibson, C\. Kim, C\. Choi, C\. McLeavey, C\. Hesse, C\. Fischer, C\. Winter, C\. Czarnecki, C\. Jarvis, C\. Wei, C\. Koumouzelis, D\. Sherburn, D\. Kappler, D\. Levin, D\. Levy, D\. Carr, D\. Farhi, D\. Mely, D\. Robinson, D\. Sasaki, D\. Jin, D\. Valladares, D\. Tsipras, D\. Li, D\. P\. Nguyen, D\. Findlay, E\. Oiwoh, E\. Wong, E\. Asdar, E\. Proehl, E\. Yang, E\. Antonow, E\. Kramer, E\. Peterson, E\. Sigler, E\. Wallace, E\. Brevdo, E\. Mays, F\. Khorasani, F\. P\. Such, F\. Raso, F\. Zhang, F\. von Lohmann, F\. Sulit, G\. Goh, G\. Oden, G\. Salmon, G\. Starace, G\. Brockman, H\. Salman, H\. Bao, H\. Hu, H\. Wong, H\. Wang, H\. Schmidt, H\. Whitney, H\. Jun, H\. Kirchner, H\. P\. de Oliveira Pinto, H\. Ren, H\. Chang, H\. W\. Chung, I\. Kivlichan, I\. O’Connell, I\. O’Connell, I\. Osband, I\. Silber, I\. Sohl, I\. Okuyucu, I\. Lan, I\. Kostrikov, I\. Sutskever, I\. Kanitscheider, I\. Gulrajani, J\. Coxon, J\. Menick, J\. Pachocki, J\. Aung, J\. Betker, J\. Crooks, J\. Lennon, J\. Kiros, J\. Leike, J\. Park, J\. Kwon, J\. Phang, J\. Teplitz, J\. Wei, J\. Wolfe, J\. Chen, J\. Harris, J\. Varavva, J\. G\. Lee, J\. Shieh, J\. Lin, J\. Yu, J\. Weng, J\. Tang, J\. Yu, J\. Jang, J\. Q\. Candela, J\. Beutler, J\. Landers, J\. Parish, J\. Heidecke, J\. Schulman, J\. Lachman, J\. McKay, J\. Uesato, J\. Ward, J\. W\. Kim, J\. Huizinga, J\. Sitkin, J\. Kraaijeveld, J\. Gross, J\. Kaplan, J\. Snyder, J\. Achiam, J\. Jiao, J\. Lee, J\. Zhuang, J\. Harriman, K\. Fricke, K\. Hayashi, K\. Singhal, K\. Shi, K\. Karthik, K\. Wood, K\. Rimbach, K\. Hsu, K\. Nguyen, K\. Gu\-Lemberg, K\. Button, K\. Liu, K\. Howe, K\. Muthukumar, K\. Luther, L\. Ahmad, L\. Kai, L\. Itow, L\. Workman, L\. Pathak, L\. Chen, L\. Jing, L\. Guy, L\. Fedus, L\. Zhou, L\. Mamitsuka, L\. Weng, L\. McCallum, L\. Held, L\. Ouyang, L\. Feuvrier, L\. Zhang, L\. Kondraciuk, L\. Kaiser, L\. Hewitt, L\. Metz, L\. Doshi, M\. Aflak, M\. Simens, M\. Boyd, M\. Thompson, M\. Dukhan, M\. Chen, M\. Gray, M\. Hudnall, M\. Zhang, M\. Aljubeh, M\. Litwin, M\. Zeng, M\. Johnson, M\. Shetty, M\. Gupta, M\. Shah, M\. Yatbaz, M\. J\. Yang, M\. Zhong, M\. Glaese, M\. Chen, M\. Janner, M\. Lampe, M\. Petrov, M\. Wu, M\. Wang, M\. Fradin, M\. Pokrass, M\. Castro, M\. O\. T\. de Castro, M\. Pavlov, M\. Brundage, M\. Wang, M\. Khan, M\. Murati, M\. Bavarian, M\. Lin, M\. Yesildal, N\. Soto, N\. Gimelshein, N\. Cone, N\. Staudacher, N\. Summers, N\. LaFontaine, N\. Chowdhury, N\. Ryder, N\. Stathas, N\. Turley, N\. Tezak, N\. Felix, N\. Kudige, N\. Keskar, N\. Deutsch, N\. Bundick, N\. Puckett, O\. Nachum, O\. Okelola, O\. Boiko, O\. Murk, O\. Jaffe, O\. Watkins, O\. Godement, O\. Campbell\-Moore, P\. Chao, P\. McMillan, P\. Belov, P\. Su, P\. Bak, P\. Bakkum, P\. Deng, P\. Dolan, P\. Hoeschele, P\. Welinder, P\. Tillet, P\. Pronin, P\. Tillet, P\. Dhariwal, Q\. Yuan, R\. Dias, R\. Lim, R\. Arora, R\. Troll, R\. Lin, R\. G\. Lopes, R\. Puri, R\. Miyara, R\. Leike, R\. Gaubert, R\. Zamani, R\. Wang, R\. Donnelly, R\. Honsby, R\. Smith, R\. Sahai, R\. Ramchandani, R\. Huet, R\. Carmichael, R\. Zellers, R\. Chen, R\. Chen, R\. Nigmatullin, R\. Cheu, S\. Jain, S\. Altman, S\. Schoenholz, S\. Toizer, S\. Miserendino, S\. Agarwal, S\. Culver, S\. Ethersmith, S\. Gray, S\. Grove, S\. Metzger, S\. Hermani, S\. Jain, S\. Zhao, S\. Wu, S\. Jomoto, S\. Wu, Shuaiqi, Xia, S\. Phene, S\. Papay, S\. Narayanan, S\. Coffey, S\. Lee, S\. Hall, S\. Balaji, T\. Broda, T\. Stramer, T\. Xu, T\. Gogineni, T\. Christianson, T\. Sanders, T\. Patwardhan, T\. Cunninghman, T\. Degry, T\. Dimson, T\. Raoux, T\. Shadwell, T\. Zheng, T\. Underwood, T\. Markov, T\. Sherbakov, T\. Rubin, T\. Stasi, T\. Kaftan, T\. Heywood, T\. Peterson, T\. Walters, T\. Eloundou, V\. Qi, V\. Moeller, V\. Monaco, V\. Kuo, V\. Fomenko, W\. Chang, W\. Zheng, W\. Zhou, W\. Manassra, W\. Sheu, W\. Zaremba, Y\. Patil, Y\. Qian, Y\. Kim, Y\. Cheng, Y\. Zhang, Y\. He, Y\. Zhang, Y\. Jin, Y\. Dai, and Y\. Malkov \(2024\)GPT\-4o system card\.Vol\.arXiv:2410\.21276\.External Links:[Link](https://arxiv.org/abs/2410.21276)Cited by:[§4\.2](https://arxiv.org/html/2606.31033#S4.SS2.p2.1)\.
- L\. R\. Rabiner \(1990\)A tutorial on hidden markov models and selected applications in speech recognition\.pp\. 267–296\.External Links:ISBN 1558601244Cited by:[§3\.4](https://arxiv.org/html/2606.31033#S3.SS4.p7.1)\.
- Z\. Rackauckas \(2024\)Rag\-fusion: a new take on retrieval augmented generation\.International Journal on Natural Language Computing13\(1\),pp\. 37–47\.External Links:ISSN 2319\-4111,[Link](http://dx.doi.org/10.5121/ijnlc.2024.13103),[Document](https://dx.doi.org/10.5121/ijnlc.2024.13103)Cited by:[§5](https://arxiv.org/html/2606.31033#S5.p1.1)\.
- F\. Ridder and M\. Schilling \(2024\)The hallurag dataset: detecting closed\-domain hallucinations in rag applications using an llm’s internal states\.Computing Research RepositoryComputing Research RepositoryarXiv:2412\.17056\.External Links:[Link](https://arxiv.org/abs/2412.17056)Cited by:[§4](https://arxiv.org/html/2606.31033#S4.p1.1)\.
- G\. Sriramanan, S\. Bharti, V\. S\. Sadasivan, S\. Saha, P\. Kattakinda, and S\. Feizi \(2024\)LLM\-check: investigating detection of hallucinations in large language models\.InProceedings of the 38th International Conference on Neural Information Processing Systems,External Links:ISBN 9798331314385,[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/3c1e1fdf305195cd620c118aaa9717ad-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.31033#S4.SS1.p1.1),[§5](https://arxiv.org/html/2606.31033#S5.p3.1)\.
- G\. Xiong, Z\. He, B\. Liu, S\. Sinha, and A\. Zhang \(2026\)Toward faithful retrieval\-augmented generation with sparse autoencoders\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hgBZP67BkP)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.31033#S4.SS1.p1.1),[§5](https://arxiv.org/html/2606.31033#S5.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.Computing Research RepositoryarXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4](https://arxiv.org/html/2606.31033#S4.p1.1)\.
- Z\. Zhang, X\. Hu, H\. Zhang, J\. Zhang, and X\. Wan \(2025\)ICR probe: tracking hidden state dynamics for reliable hallucination detection in LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,pp\. 17986–18002\.External Links:[Link](https://aclanthology.org/2025.acl-long.880/)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.31033#S4.SS1.p1.1),[§5](https://arxiv.org/html/2606.31033#S5.p3.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InProceedings of the 37th International Conference on Neural Information Processing Systems,External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§1](https://arxiv.org/html/2606.31033#S1.p2.1)\.
## Appendix ADetails of Label\-Persistence Smoothing
This appendix gives the forward–backward recursions and posterior marginal derivation for label\-persistence smoothing\. The method defines non\-negative weights over smoothing\-label sequences by combining token\-level confidence with label persistence between neighboring labels, so that posterior marginals can be computed efficiently without enumerating all2T2^\{T\}possible label sequences\.
### A\.1Forward Recursion
The forward messageFi\(ℓ\)F\_\{i\}\(\\ell\)is the unnormalized total weight of all partial smoothing\-label sequences ending withzi=ℓz\_\{i\}=\\ell:
Fi\(ℓ\)\\displaystyle F\_\{i\}\(\\ell\)=∑z1:i:zi=ℓπ\(z1\)∏j=1iϕj\(zj\)\\displaystyle=\\sum\_\{z\_\{1:i\}:z\_\{i\}=\\ell\}\\pi\(z\_\{1\}\)\\prod\_\{j=1\}^\{i\}\\phi\_\{j\}\(z\_\{j\}\)\(13\)×∏j=2iρ\(zj−1,zj\)\.\\displaystyle\\quad\\times\\prod\_\{j=2\}^\{i\}\\rho\(z\_\{j\-1\},z\_\{j\}\)\.
The initialization is
F1\(ℓ\)=π\(ℓ\)ϕ1\(ℓ\)\.F\_\{1\}\(\\ell\)=\\pi\(\\ell\)\\phi\_\{1\}\(\\ell\)\.\(14\)
Fori=2,…,Ti=2,\\ldots,T, the recursion is obtained by grouping the partial sequences according to the previous labelℓ′=zi−1\\ell^\{\\prime\}=z\_\{i\-1\}\. Starting from the definition,
Fi\(ℓ\)\\displaystyle F\_\{i\}\(\\ell\)=∑z1:i:zi=ℓπ\(z1\)∏j=1iϕj\(zj\)∏j=2iρ\(zj−1,zj\)\\displaystyle=\\sum\_\{z\_\{1:i\}:z\_\{i\}=\\ell\}\\pi\(z\_\{1\}\)\\prod\_\{j=1\}^\{i\}\\phi\_\{j\}\(z\_\{j\}\)\\prod\_\{j=2\}^\{i\}\\rho\(z\_\{j\-1\},z\_\{j\}\)\(15\)=∑ℓ′∈\{0,1\}∑z1:i−1:zi−1=ℓ′π\(z1\)∏j=1i−1ϕj\(zj\)\\displaystyle=\\sum\_\{\\ell^\{\\prime\}\\in\\\{0,1\\\}\}\\sum\_\{z\_\{1:i\-1\}:z\_\{i\-1\}=\\ell^\{\\prime\}\}\\pi\(z\_\{1\}\)\\prod\_\{j=1\}^\{i\-1\}\\phi\_\{j\}\(z\_\{j\}\)×∏j=2i−1ρ\(zj−1,zj\)ρ\(ℓ′,ℓ\)ϕi\(ℓ\)\.\\displaystyle\\quad\\times\\prod\_\{j=2\}^\{i\-1\}\\rho\(z\_\{j\-1\},z\_\{j\}\)\\rho\(\\ell^\{\\prime\},\\ell\)\\phi\_\{i\}\(\\ell\)\.The inner sum is exactlyFi−1\(ℓ′\)F\_\{i\-1\}\(\\ell^\{\\prime\}\)\. Therefore,
Fi\(ℓ\)=ϕi\(ℓ\)∑ℓ′∈\{0,1\}Fi−1\(ℓ′\)ρ\(ℓ′,ℓ\)\.F\_\{i\}\(\\ell\)=\\phi\_\{i\}\(\\ell\)\\sum\_\{\\ell^\{\\prime\}\\in\\\{0,1\\\}\}F\_\{i\-1\}\(\\ell^\{\\prime\}\)\\rho\(\\ell^\{\\prime\},\\ell\)\.\(16\)
Since the smoothing label is binary, the recursion can be written explicitly\. Letp=pstayp=p\_\{\\mathrm\{stay\}\}\. Usingϕi\(1\)=si\\phi\_\{i\}\(1\)=s\_\{i\}andϕi\(0\)=1−si\\phi\_\{i\}\(0\)=1\-s\_\{i\}, we obtain
Fi\(1\)=si\[pFi−1\(1\)\+\(1−p\)Fi−1\(0\)\],F\_\{i\}\(1\)=s\_\{i\}\\left\[pF\_\{i\-1\}\(1\)\+\(1\-p\)F\_\{i\-1\}\(0\)\\right\],\(17\)and
Fi\(0\)=\(1−si\)\[pFi−1\(0\)\+\(1−p\)Fi−1\(1\)\]\.F\_\{i\}\(0\)=\(1\-s\_\{i\}\)\\left\[pF\_\{i\-1\}\(0\)\+\(1\-p\)F\_\{i\-1\}\(1\)\\right\]\.\(18\)The corresponding initial values are
F1\(1\)=12s1,F1\(0\)=12\(1−s1\)\.F\_\{1\}\(1\)=\\frac\{1\}\{2\}s\_\{1\},\\qquad F\_\{1\}\(0\)=\\frac\{1\}\{2\}\(1\-s\_\{1\}\)\.\(19\)
### A\.2Backward Recursion
The backward messageBi\(ℓ\)B\_\{i\}\(\\ell\)is the unnormalized total weight of all suffix smoothing\-label sequences from positionsi\+1i\+1toTT, conditioned onzi=ℓz\_\{i\}=\\ell:
Bi\(ℓ\)\\displaystyle B\_\{i\}\(\\ell\)=∑zi\+1:T∏j=i\+1Tϕj\(zj\)\\displaystyle=\\sum\_\{z\_\{i\+1:T\}\}\\prod\_\{j=i\+1\}^\{T\}\\phi\_\{j\}\(z\_\{j\}\)\(20\)×∏j=i\+1Tρ\(zj−1,zj\),zi=ℓ\.\\displaystyle\\quad\\times\\prod\_\{j=i\+1\}^\{T\}\\rho\(z\_\{j\-1\},z\_\{j\}\),\\qquad z\_\{i\}=\\ell\.The initialization is
BT\(ℓ\)=1,B\_\{T\}\(\\ell\)=1,\(21\)which corresponds to the empty product beyond the final token\.
Fori=T−1,…,1i=T\-1,\\ldots,1, the recursion is obtained by grouping the suffix sequences according to the next labelℓ′=zi\+1\\ell^\{\\prime\}=z\_\{i\+1\}\. Starting from the definition,
Bi\(ℓ\)\\displaystyle B\_\{i\}\(\\ell\)=∑zi\+1:T∏j=i\+1Tϕj\(zj\)∏j=i\+1Tρ\(zj−1,zj\)\\displaystyle=\\sum\_\{z\_\{i\+1:T\}\}\\prod\_\{j=i\+1\}^\{T\}\\phi\_\{j\}\(z\_\{j\}\)\\prod\_\{j=i\+1\}^\{T\}\\rho\(z\_\{j\-1\},z\_\{j\}\)\(22\)=∑ℓ′∈\{0,1\}∑zi\+2:Tϕi\+1\(ℓ′\)ρ\(ℓ,ℓ′\)\\displaystyle=\\sum\_\{\\ell^\{\\prime\}\\in\\\{0,1\\\}\}\\sum\_\{z\_\{i\+2:T\}\}\\phi\_\{i\+1\}\(\\ell^\{\\prime\}\)\\rho\(\\ell,\\ell^\{\\prime\}\)×∏j=i\+2Tϕj\(zj\)∏j=i\+2Tρ\(zj−1,zj\)\.\\displaystyle\\quad\\times\\prod\_\{j=i\+2\}^\{T\}\\phi\_\{j\}\(z\_\{j\}\)\\prod\_\{j=i\+2\}^\{T\}\\rho\(z\_\{j\-1\},z\_\{j\}\)\.The inner sum is exactlyBi\+1\(ℓ′\)B\_\{i\+1\}\(\\ell^\{\\prime\}\)\. Therefore,
Bi\(ℓ\)=∑ℓ′∈\{0,1\}ρ\(ℓ,ℓ′\)ϕi\+1\(ℓ′\)Bi\+1\(ℓ′\)\.B\_\{i\}\(\\ell\)=\\sum\_\{\\ell^\{\\prime\}\\in\\\{0,1\\\}\}\\rho\(\\ell,\\ell^\{\\prime\}\)\\phi\_\{i\+1\}\(\\ell^\{\\prime\}\)B\_\{i\+1\}\(\\ell^\{\\prime\}\)\.\(23\)
For the binary case, this recursion is equivalently
Bi\(1\)=psi\+1Bi\+1\(1\)\+\(1−p\)\(1−si\+1\)Bi\+1\(0\),B\_\{i\}\(1\)=ps\_\{i\+1\}B\_\{i\+1\}\(1\)\+\(1\-p\)\(1\-s\_\{i\+1\}\)B\_\{i\+1\}\(0\),\(24\)and
Bi\(0\)=p\(1−si\+1\)Bi\+1\(0\)\+\(1−p\)si\+1Bi\+1\(1\)\.B\_\{i\}\(0\)=p\(1\-s\_\{i\+1\}\)B\_\{i\+1\}\(0\)\+\(1\-p\)s\_\{i\+1\}B\_\{i\+1\}\(1\)\.\(25\)The terminal values are
BT\(1\)=1,BT\(0\)=1\.B\_\{T\}\(1\)=1,\\qquad B\_\{T\}\(0\)=1\.\(26\)
### A\.3Posterior Marginal
The posterior marginal for smoothing labelℓ\\ellat positioniiis
qi\(ℓ\)=P\(zi=ℓ∣s1:T\)=∑z1:T:zi=ℓP\(z1:T∣s1:T\)\.q\_\{i\}\(\\ell\)=P\(z\_\{i\}=\\ell\\mid s\_\{1:T\}\)=\\sum\_\{z\_\{1:T\}:z\_\{i\}=\\ell\}P\(z\_\{1:T\}\\mid s\_\{1:T\}\)\.\(27\)Substituting the normalized sequence weight gives
qi\(ℓ\)\\displaystyle q\_\{i\}\(\\ell\)=1Z\(s1:T\)∑z1:T:zi=ℓπ\(z1\)∏j=1Tϕj\(zj\)\\displaystyle=\\frac\{1\}\{Z\(s\_\{1:T\}\)\}\\sum\_\{z\_\{1:T\}:z\_\{i\}=\\ell\}\\pi\(z\_\{1\}\)\\prod\_\{j=1\}^\{T\}\\phi\_\{j\}\(z\_\{j\}\)\(28\)×∏j=2Tρ\(zj−1,zj\)\.\\displaystyle\\quad\\times\\prod\_\{j=2\}^\{T\}\\rho\(z\_\{j\-1\},z\_\{j\}\)\.
Because the sequence weight has a first\-order linear\-chain factorization, fixingzi=ℓz\_\{i\}=\\ellseparates the unnormalized weight into prefix and suffix terms:
∑z1:T:zi=ℓπ\(z1\)∏j=1Tϕj\(zj\)∏j=2Tρ\(zj−1,zj\)\\displaystyle\\sum\_\{z\_\{1:T\}:z\_\{i\}=\\ell\}\\pi\(z\_\{1\}\)\\prod\_\{j=1\}^\{T\}\\phi\_\{j\}\(z\_\{j\}\)\\prod\_\{j=2\}^\{T\}\\rho\(z\_\{j\-1\},z\_\{j\}\)\(29\)=Fi\(ℓ\)Bi\(ℓ\)\.\\displaystyle\\qquad=F\_\{i\}\(\\ell\)B\_\{i\}\(\\ell\)\.Thus,
qi\(ℓ\)=Fi\(ℓ\)Bi\(ℓ\)Z\(s1:T\)\.q\_\{i\}\(\\ell\)=\\frac\{F\_\{i\}\(\\ell\)B\_\{i\}\(\\ell\)\}\{Z\(s\_\{1:T\}\)\}\.\(30\)
The normalizing constant can be decomposed at positionii:
Z\(s1:T\)=∑ℓ′∈\{0,1\}Fi\(ℓ′\)Bi\(ℓ′\)\.Z\(s\_\{1:T\}\)=\\sum\_\{\\ell^\{\\prime\}\\in\\\{0,1\\\}\}F\_\{i\}\(\\ell^\{\\prime\}\)B\_\{i\}\(\\ell^\{\\prime\}\)\.\(31\)Therefore,
qi\(ℓ\)=Fi\(ℓ\)Bi\(ℓ\)∑ℓ′∈\{0,1\}Fi\(ℓ′\)Bi\(ℓ′\)\.q\_\{i\}\(\\ell\)=\\frac\{F\_\{i\}\(\\ell\)B\_\{i\}\(\\ell\)\}\{\\sum\_\{\\ell^\{\\prime\}\\in\\\{0,1\\\}\}F\_\{i\}\(\\ell^\{\\prime\}\)B\_\{i\}\(\\ell^\{\\prime\}\)\}\.\(32\)The final smoothed hallucination score is
s~i=qi\(1\)=Fi\(1\)Bi\(1\)Fi\(0\)Bi\(0\)\+Fi\(1\)Bi\(1\)\.\\tilde\{s\}\_\{i\}=q\_\{i\}\(1\)=\\frac\{F\_\{i\}\(1\)B\_\{i\}\(1\)\}\{F\_\{i\}\(0\)B\_\{i\}\(0\)\+F\_\{i\}\(1\)B\_\{i\}\(1\)\}\.\(33\)
## Appendix BSensitivity Analysis ofpstayp\_\{\\mathrm\{stay\}\}
Figure 4:Sensitivity of token\-level hallucination detection performance topstayp\_\{\\mathrm\{stay\}\}\. Dashed horizontal lines indicate raw scores before label\-persistence smoothing\.Figure 5:Sensitivity of answer\-level hallucination detection performance topstayp\_\{\\mathrm\{stay\}\}\. Answer\-level scores are obtained by taking the maximum over token\-level hallucination scores\. Dashed horizontal lines indicate raw scores before label\-persistence smoothing\.We analyze the sensitivity of CORTEX to the self\-loop probabilitypstayp\_\{\\mathrm\{stay\}\}used in label\-persistence smoothing\. This parameter controls the strength of span\-level smoothing: larger values encourage neighboring tokens to remain in the same smoothing\-label variable, whereas smaller values keep the smoothed scores closer to the raw token\-level classifier outputs\.
Figures[4](https://arxiv.org/html/2606.31033#A2.F4)and[5](https://arxiv.org/html/2606.31033#A2.F5)show the results for token\-level and answer\-level evaluation, respectively\. The dashed horizontal lines indicate the raw scores before label\-persistence smoothing\. Across both datasets and all open\-weight LLMs, label\-persistence smoothing improves over the raw scores for a wide range ofpstayp\_\{\\mathrm\{stay\}\}, indicating that the gains of CORTEX are not tied to a single finely tuned value\.
For token\-level detection, performance is generally stable whenpstayp\_\{\\mathrm\{stay\}\}is large\. On both RAGTruth and HalluRAG, AP increases aspstayp\_\{\\mathrm\{stay\}\}becomes larger and then forms a broad plateau around high values such as0\.990\.99,0\.9930\.993, and0\.9950\.995\. AUROC shows a similar trend, with improvements over the raw scores maintained across a wide range of highpstayp\_\{\\mathrm\{stay\}\}values\. This behavior is consistent with the role of label\-persistence smoothing in token\-level hallucination detection: hallucinated content typically appears as contiguous spans, and therefore stronger self\-transition probabilities help suppress isolated noisy predictions while preserving coherent hallucinated regions\.
For answer\-level detection, the optimalpstayp\_\{\\mathrm\{stay\}\}tends to be smaller than in token\-level evaluation\. This is because answer\-level scores are obtained by aggregating token\-level scores with the maximum operator\. Ifpstayp\_\{\\mathrm\{stay\}\}is too large, smoothing may overly spread high scores across neighboring tokens or make span boundaries too persistent, which can affect the maximum score used for answer\-level classification\. Nevertheless, label\-persistence smoothing remains beneficial across a broad range of values, and the selected settingpstay=0\.93p\_\{\\mathrm\{stay\}\}=0\.93provides strong and stable answer\-level performance\.
Based on these results, we usepstay=0\.993p\_\{\\mathrm\{stay\}\}=0\.993for token\-level evaluation andpstay=0\.93p\_\{\\mathrm\{stay\}\}=0\.93for answer\-level evaluation in the main experiments\. We use a single value for each evaluation granularity rather than tuningpstayp\_\{\\mathrm\{stay\}\}separately for each dataset and open\-weight LLM\. This setting keeps the evaluation protocol simple while still capturing the main benefit of label\-persistence smoothing: converting locally noisy token\-level scores into more span\-consistent hallucination estimates\.
## Appendix CExperimental Details
#### MLP classifier\.
For all methods that require a supervised classifier, including CORTEX and MLP\-based baselines, we use the same three\-layer MLP architecture to isolate the effect of the input features\. The classifier consists of two hidden linear layers with 256 and 128 hidden units, respectively, followed by a final linear layer that produces a scalar output logit\. Each hidden layer is followed by a ReLU activation and dropout with a rate of 0\.1\. We train the classifier using AdamW with a learning rate of1×10−31\\times 10^\{\-3\}and a weight decay of1×10−41\\times 10^\{\-4\}\. The batch size is set to 4096, and the model is trained for 10 epochs\. The same hyperparameters are used across all datasets, models, and methods unless otherwise stated\.
#### Representation extraction\.
To obtain internal representations from the open\-weight analysis modelMopenM\_\{\\mathrm\{open\}\}, we do not perform autoregressive generation\. Instead, we feed the constructed input sequence toMopenM\_\{\\mathrm\{open\}\}and extract the internal representations computed in a single forward pass\. This implementation matches the post\-hoc setting considered in this work: the answer text has already been generated by the closed\-weight model, andMopenM\_\{\\mathrm\{open\}\}is used only to analyze the given answer\.
#### Computational cost\.
All experiments are conducted on a single NVIDIA A100 GPU\. CORTEX is lightweight in practice: on RAGTruth, feature extraction, classifier training, and evaluation take approximately 20 minutes in total, while on HalluRAG they take approximately 5 minutes in total\. This indicates that CORTEX incurs only modest computational overhead while providing token\-level hallucination scores\.
#### Artifact Licenses and Use\.
We use RAGTruth and HalluRAG as existing public evaluation benchmarks and do not redistribute the datasets\. RAGTruth is released under the MIT License\. HalluRAG is made publicly available by its authors through their repository and dataset DOI, but we did not find an explicit dataset license in the available documentation\. All datasets are used solely for research evaluation, and we cite their original sources\.
#### Dataset Documentation\.
We use only publicly available English\-language RAG hallucination benchmarks and do not collect any new data\. Our experiments focus on hallucination detection in generated RAG outputs; we do not use, infer, or analyze demographic attributes\.
## Appendix DEffect of Label\-persistence Smoothing on Token\-Level Baselines
Table 5:Effect of label\-persistence smoothing on token\-level baselines on RAGTruth\. Improvements for both SAPLMA and RAGLens indicate that span\-level post\-processing is broadly useful\.Table 6:Effect of applying smoothing to token\-level baselines on HalluRAG\. Label\-persistence smoothing improves all methods substantially, including SAPLMA and RAGLens\.Label\-persistence smoothing is a post\-processing module that can be applied to any method that produces token\-level hallucination scores\. To examine whether the gains of CORTEX are due only to label\-persistence smoothing procedure to two strong token\-level baselines, SAPLMA and RAGLens, and compare them with CORTEX under the same settings\. Tables[5](https://arxiv.org/html/2606.31033#A4.T5)and[6](https://arxiv.org/html/2606.31033#A4.T6)report the results on RAGTruth and HalluRAG, respectively\.
Before applying label\-persistence smoothing, CORTEX achieves the best performance in all settings across both datasets and all open\-weight LLMs\. This result indicates that the comparative representation features of CORTEX already provide a stronger token\-level signal than the single\-view representation features used by the baselines\. After label\-persistence smoothing is applied, the performance of SAPLMA and RAGLens improves substantially, confirming that span\-level post\-processing is broadly useful for reducing local prediction noise\. Nevertheless, CORTEX remains strongest in AP across all RAGTruth settings and remains competitive or superior in most HalluRAG settings\. These results suggest that label\-persistence smoothing is beneficial as a generic post\-processing step\.
In particular, the comparison before smoothing isolates the effect of the underlying token\-level scoring function: CORTEX outperforms the baselines without relying on span\-level post\-processing\. The comparison after smoothing further shows that CORTEX can benefit from the same generic smoothing procedure while retaining the advantage of its paired reference\-conditioned and no\-reference representation comparison\. Thus, label\-persistence smoothing and the comparative representation features play complementary roles: the former improves span consistency, whereas the latter provides a stronger reference\-grounded hallucination signal\.
## Appendix EAdditional Heatmap\-Based Ablation Case Studies
Figure 6:Ablation case study with hallucination spans of different granularities\. Without label\-persistence smoothing, hallucination scores are more fragmented and localized\. With label\-persistence smoothing, nearby high\-score regions are connected into broader span\-consistent predictions, although this can also merge distinct hallucinated fragments into a wider continuous region\.Figure 7:Ablation case study involving groundless advice\. The answer includes additional advice that may be informative but is not grounded in the provided references\. CORTEX assigns high hallucination scores to this region, reflecting its role as a detector of reference support rather than a general factuality or usefulness judge\.We present additional heatmap\-based case studies to qualitatively analyze the behavior of each component in CORTEX\. The heatmaps compare the ground\-truth hallucination spans with the predictions of the full CORTEX model, CORTEX without label\-persistence smoothing, and CORTEX without the contextual residualcc\. Darker colors indicate higher hallucination scores\.
Figure[6](https://arxiv.org/html/2606.31033#A5.F6)shows an example in which the annotated hallucination spans vary in granularity\. The ground\-truth annotation contains both a broad hallucinated region spanning multiple sentences and more compact hallucinated fragments\. Without label\-persistence smoothing, CORTEX assigns high scores at relatively fine\-grained units\. This behavior suggests that the raw token\-level classifier can capture localized hallucination signals, but its predictions are fragmented and locally unstable\. After label\-persistence smoothing, these fragmented high\-score regions are connected into a more coherent span, producing predictions that better reflect the span\-level nature of hallucination annotations\. At the same time, this example also illustrates a trade\-off introduced by smoothing: when hallucinated evidence appears in several nearby but distinct regions, label\-persistence smoothing may merge them into a broader continuous span\. Thus, label\-persistence smoothing improves span consistency, but may reduce boundary precision in cases where hallucinated content is interleaved with faithful tokens\.
The same example also illustrates the role of the contextual residualcc\. Whenccis removed, the model tends to assign high scores more broadly in regions where the current token is influenced by preceding generated context\. This behavior is consistent with the motivation for the residual feature:Δh\\Delta halone captures reference\-induced changes at the current token, but it does not explicitly account for reference influence that has already been expressed through earlier answer tokens\. By incorporatingcc, CORTEX can distinguish direct token\-level reference sensitivity from deviations relative to the preceding context, leading to more controlled localization\.
Figure[7](https://arxiv.org/html/2606.31033#A5.F7)shows a different type of error case in which the answer includes additional advice that is not supported by the provided references\. The advice is informative and may be reasonable from the model’s parametric knowledge, but it is not grounded in the supplied passages\. CORTEX assigns high hallucination scores to this region because its objective is to detect whether the answer is supported by the references, not whether the content is generally plausible or useful\.
This case highlights an important ambiguity in reference\-grounded hallucination detection\. In controlled RAG applications, such as enterprise customer support, advice that is not grounded in references can be undesirable or risky even when it is factually plausible, because the system is expected to answer only from those documents\. In such settings, flagging unsupported advice is a desirable behavior\. In contrast, in open\-ended assistant scenarios, suppressing all useful but reference\-unsupported content may overly restrict the capabilities of the LLM\. Therefore, the appropriate treatment of such content depends on the application: CORTEX should be interpreted as a detector of reference support, rather than a general judge of factual correctness or utility\.
This example also clarifies the scope of CORTEX\. Because the method compares internal representations under inputs with and without references, it is sensitive to whether a token is grounded in the provided evidence\. Consequently, content generated from parametric knowledge alone can receive high hallucination scores if it is not supported by the references\. This behavior is aligned with the intended design of CORTEX for reference\-grounded generation, but it should be considered when applying the method to settings where answers are allowed to go beyond the retrieved references\.Similar Articles
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
RAGognizer introduces a hallucination-aware fine-tuning approach that integrates a lightweight detection head into LLMs for joint optimization of language modeling and hallucination detection in RAG systems. The paper presents RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and demonstrates state-of-the-art hallucination detection while reducing hallucination rates without degrading language quality.
Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents
This paper introduces a unified benchmark for span-level hallucination detection in RAG systems that extends beyond natural language to code, tool output, and structured documents, and presents a fine-tuned Qwen3.5-2B detector that outperforms existing methods on these new domains while remaining competitive on standard NLP benchmarks.
TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG
TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.
FLaG: Fine-Grained Latent Grouping for Hallucination Detection
FLaG is a lightweight framework for hallucination detection in LLMs that models correctness via latent evidence groups and energy-based routing, achieving SOTA performance across benchmarks.
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.