Sanity Checks for Long-Form Hallucination Detection

arXiv cs.CL Papers

Summary

This paper introduces a controlled-invariance methodology and two oracle tests (Force and Remove) to determine if LLM hallucination detectors rely on reasoning traces or final answer artifacts. It proposes TRACT, a lightweight scorer using lexical features, which demonstrates robust performance independent of answer-level cues.

arXiv:2605.08346v1 Announce Type: new Abstract: Hallucination detection methods for large language models increasingly operate on chain-of-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer. We introduce a controlled-invariance methodology that exposes this distinction through two oracle tests: \textsc{Force}, which replaces each response's final answer with the ground truth while preserving the reasoning trace, and \textsc{Remove}, which strips answer-announcement steps while leaving the trajectory intact. This reveals if their predictive power derives from answer-level artifacts rather than from the structure or validity of intermediate reasoning. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical trajectory features (hedging trends, step-length dynamics, and cross-response vocabulary convergence), achieves strong robustness while remaining competitive with or outperforming existing baselines on unperturbed traces. These findings suggest that the current central challenge in reasoning-aware hallucination detection is not the absence of signal in the trace, but the failure to isolate it from endpoint cues.
Original Article
View Cached Full Text

Cached at: 05/12/26, 06:41 AM

# Sanity Checks for Long-Form Hallucination Detection
Source: [https://arxiv.org/html/2605.08346](https://arxiv.org/html/2605.08346)
Geigh Zollicoffer Los Alamos National Laboratory &Minh Vu Los Alamos National Laboratory &Hongli Zhan The University of Texas at Austin &Raymond Li University of British Columbia &Manish Bhattarai Los Alamos National Laboratory

###### Abstract

Hallucination detection methods for large language models increasingly operate on chain\-of\-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer\. We introduce a controlled\-invariance methodology that exposes this distinction through two oracle tests:Force, which replaces each response’s final answer with the ground truth while preserving the reasoning trace, andRemove, which strips answer\-announcement steps while leaving the trajectory intact\. This reveals if their predictive power derives from answer\-level artifacts rather than from the structure or validity of intermediate reasoning\. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical trajectory features \(hedging trends, step\-length dynamics, and cross\-response vocabulary convergence\), achieves strong robustness while remaining competitive with or outperforming existing baselines on unperturbed traces\. These findings suggest that the current central challenge in reasoning\-aware hallucination detection is not the absence of signal in the trace, but the failure to isolate it from endpoint cues\.

## 1Introduction

As large language models \(LLMs\) are increasingly used for reasoning and decision support, their reliability depends on detecting when generated outputs are unsupported, inconsistent, or false\. Hallucination detection addresses this problem, but a central question remains unresolved: do current detectors assess the reasoning process itself, or do they mainly exploit surface correlates of the final answer?

This question is especially important for long\-form reasoning\. Recent hallucination and uncertainty detectors increasingly operate on chain\-of\-thought traces, comparing sampled responses, measuring semantic agreement, or scoring reasoning\-path consistency\(Wanget al\.,[2026](https://arxiv.org/html/2605.08346#bib.bib5)\)\. Yet strong performance on unmodified traces does not by itself show that a detector is reasoning\-faithful\. A method may appear to evaluate intermediate reasoning while actually relying on endpoint cues, answer formatting, response length, or coarse agreement among final answers\. In that case, reported gains may overstate progress toward genuine reasoning assessment\.

We introduce a controlled\-invariance framework for exposing this failure mode\. The key idea is simple: if a detector claims to evaluate the reasoning trajectory, then transformations that preserve the reasoning body should not destroy its ability to distinguish correct from incorrect reasoning\. We instantiate this idea with two oracle sanity checks, illustrated in Figure[1](https://arxiv.org/html/2605.08346#S1.F1)\. InForce, we replace the final answer with the ground truth and canonicalize its presentation, while leaving the intermediate reasoning unchanged\. InRemove, we delete explicit answer\-announcement steps, again preserving the reasoning body\. Neither intervention repairs a flawed derivation or corrupts a valid one\. Thus, a trace\-faithful detector should remain informative under both conditions; large shifts indicate dependence on answer\-level artifacts rather than reasoning evidence\.

![Refer to caption](https://arxiv.org/html/2605.08346v1/Figures/teaser.png)Figure 1:Two sanity\-check operations\.Forcereplaces only the final answer with the ground truth;Removedeletes explicit answer\-announcement steps\. Both preserve the reasoning body, so a trace\-faithful detector should remain informative\.Applying these tests across four benchmarks and five models reveals that many existing detectors are less trace\-faithful than standard evaluations suggest\. As shown in Figure[2](https://arxiv.org/html/2605.08346#S1.F2), several methods move far from the diagonal underForceorRemove, meaning their discriminative behavior changes substantially even though the intermediate reasoning trajectory is preserved\. This is not merely a calibration issue: it indicates that some detectors obtain much of their signal from endpoint availability, answer standardization, or other artifacts that are orthogonal to reasoning quality\.

![Refer to caption](https://arxiv.org/html/2605.08346v1/Figures/real_normal_vs_both_paired.png)Figure 2:Sanity\-check results across four benchmarks and five models\.Each point is one scorer–model–benchmark experiment;xxis AUC on original traces andyyis AUC afterForceorRemove\. Trace\-faithful scorers should remain near the diagonal because the reasoning body is preserved\. TRACT has the highest number of faithful settings under both interventions\.We then ask whether robust trace\-level detection requires complex learned representations\. Surprisingly, it does not\. We propose TRACT, a lightweight black\-box scorer built from lexical trajectory features: local coherence cues, structural dynamics such as hedge and step\-length trends, and cross\-sample content convergence\. TRACT is not a proof checker and does not verify each intermediate step\. Instead, it targets observable symptoms of unsettled reasoning: traces that wander, hedge increasingly, become structurally irregular, or fail to converge across independent samples\. Because these features are computed from the reasoning body rather than the endpoint string, TRACT is naturally suited to theForce/Removesetting\.

Our results support two conclusions\. First, oracle robustness testing is a necessary sanity check for reasoning\-aware hallucination detection: without it, a detector may appear successful while relying on answer\-level artifacts\. Second, useful trace\-level signal exists in simple, interpretable trajectory statistics\. The challenge is therefore not only to design stronger detectors, but to evaluate whether their strength comes from the reasoning process they are meant to assess\.

## 2Background and Related Work

#### Sampling\-based uncertainty\.

A common black\-box approach to hallucination detection is to sample multiple responses and measure their disagreement: models that know the answer should produce mutually consistent outputs, while hallucinated or uncertain generations tend to diverge\. Semantic entropy\(Farquharet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib12); Kuhnet al\.,[2023](https://arxiv.org/html/2605.08346#bib.bib24)\)formalizes this idea by clustering meaning\-equivalent responses and computing entropy over the resulting semantic classes\. Subsequent work refines either the representation or the disagreement measure: Kernel Language Entropy replaces hard clustering with a continuous similarity kernel\(Nikitinet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib8)\), while embedding\-based methods such as SINdex\(Abdaljalilet al\.,[2025](https://arxiv.org/html/2605.08346#bib.bib9)\)and Semantic Embedding Uncertainty \(SEU\)\(Grewalet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib7)\)use dense sentence representations to estimate inconsistency more efficiently\. Perturbation\-based variants further sample over input transformations rather than only model randomness\(Gaoet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib15)\)\. These methods provide strong black\-box uncertainty signals, but their scores are usually computed over complete responses or final\-answer semantics, making it difficult to tell whether they measure reasoning quality or endpoint agreement\.

#### Reasoning\-trace\-aware detection\.

Chain\-of\-thought prompting\(Weiet al\.,[2022](https://arxiv.org/html/2605.08346#bib.bib47)\)and long\-form reasoning models make the intermediate trajectory observable, motivating detectors that evaluate not only what answer is produced but how the answer is reached\. RACE\(Wanget al\.,[2026](https://arxiv.org/html/2605.08346#bib.bib5)\)is representative of this direction, combining inter\-sample reasoning\-path consistency, answer uncertainty, reasoning–answer alignment, and intra\-trace coherence\. Such methods are closer to reasoning\-aware hallucination detection than answer\-only uncertainty estimators, but they also introduce a new evaluation problem: high performance may still come from endpoint cues, answer alignment, or coarse consistency rather than from trace\-faithful assessment of the reasoning body\. OurForceandRemovetests are designed to expose this distinction\.

#### Unified UQ frameworks and baselines\.

Recent unified frameworks collect many of these signals into calibrated uncertainty pipelines\. For example,uqlm\(Bouchardet al\.,[2026](https://arxiv.org/html/2605.08346#bib.bib6)\)includes exact\-match repetition and diversity scores\(Coleet al\.,[2023](https://arxiv.org/html/2605.08346#bib.bib19)\),nn\-gram and BERTScore self\-consistency\(Manakulet al\.,[2023a](https://arxiv.org/html/2605.08346#bib.bib20); Zhanget al\.,[2020](https://arxiv.org/html/2605.08346#bib.bib21)\), NLI\-based non\-contradiction probability\(Chen and Mueller,[2023](https://arxiv.org/html/2605.08346#bib.bib22)\), sentence\-embedding similarity\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.08346#bib.bib23)\), and semantic entropy variants\(Farquharet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib12); Kuhnet al\.,[2023](https://arxiv.org/html/2605.08346#bib.bib24)\)\. Surveys similarly emphasize uncertainty quantification as a central route to improving LLM reliability, while highlighting trade\-offs among accuracy, cost, access requirements, and interpretability\(Shorinwaet al\.,[2025](https://arxiv.org/html/2605.08346#bib.bib16); Kanget al\.,[2025](https://arxiv.org/html/2605.08346#bib.bib18)\)\. We use these families as black\-box baselines because they cover the main operational signals in current hallucination detection: answer repetition, lexical overlap, embedding similarity, NLI agreement, semantic entropy, and reasoning\-path consistency\.

#### White\-box versus black\-box access\.

White\-box detectors use token probabilities or hidden\-state activations to estimate reliability in a single pass\(Duanet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib1); Zollicofferet al\.,[2025](https://arxiv.org/html/2605.08346#bib.bib10); Phukanet al\.,[2025](https://arxiv.org/html/2605.08346#bib.bib11); Binkowskiet al\.,[2025](https://arxiv.org/html/2605.08346#bib.bib17); Fadeevaet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib25)\)\. These methods can be effective, but require model internals that are unavailable for many closed\-source systems\. We therefore focus on the black\-box setting, where the detector observes only sampled text traces\. Within this setting, our goal is not merely to improve AUC, but to test whether a detector’s signal remains valid when answer\-level artifacts are controlled\.

## 3TRACT: Trace Rhetorical and Coherence Trajectory

A correct reasoning trace tends to*settle*\. As the model approaches a solution, its steps usually become more directed: the vocabulary stabilises, intermediate claims become more consistent, and independently sampled traces begin to resemble one another in how they progress and where they end\. By contrast, an incorrect trace often*wanders*: it asks itself unnecessary questions, restarts or hedges, expands when it should compress, and disagrees with parallel samples about how many steps the problem requires\.

TRACT operationalises this observation in a fully black\-box sampling setting\. Given a promptxx, we sampleKKindependent reasoning traces from the model,\{𝐫\(k\)\}k=1K\\\{\\mathbf\{r\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}\. Each trace is written as a sequence of textual steps,𝐫\(k\)=\(s1\(k\),…,sTk\(k\)\)\\mathbf\{r\}^\{\(k\)\}=\(s^\{\(k\)\}\_\{1\},\\ldots,s^\{\(k\)\}\_\{T\_\{k\}\}\)\. A textual step is the smallest reasoning unit exposed in the response, such as a numbered line, bullet point, sentence\-level inference, or explicitly separated intermediate statement\. For example, in a chain\-of\-thought response, “First, compute the total cost” and “Therefore, the remaining amount is 12” would be treated as two separate steps\. Answer\-announcement steps such as “Final Answer:” are excluded before TRACT features are computed, so the scorer operates on the reasoning body rather than the endpoint string\.

TRACT does not require access to logits, hidden states, answer labels, embedding models, or auxiliary entailment models\. Instead, it reads the sampled traces themselves and extracts lightweight trajectory\-level features that describe how the reasoning behaves\. The features are grouped into three blocks, each corresponding to a different diagnostic question: i\)Coherence:what does each trace look like on average? ii\)Structure:how does the trace evolve as reasoning progresses? iii\)Content:do independent traces converge on the same intermediate and final vocabulary?

Table[1](https://arxiv.org/html/2605.08346#S3.T1)summarises the full TRACT feature set\. The table is organised around the observable signature of each feature: what pattern it detects, whether that pattern increases the hallucination score, and how the statistic is computed\. The prose below explains why these signatures are useful\.

Table 1:TRACT feature inventory\.TRACT maps sampled reasoning traces into three interpretable feature blocks\. The intuition column gives the diagnostic signature; the definition column gives the implementation\-level statistic\. HereTkT\_\{k\}is the number of steps in tracekk,si\(k\)s^\{\(k\)\}\_\{i\}is stepii,wi\(k\)=\|si\(k\)\|w^\{\(k\)\}\_\{i\}=\|s^\{\(k\)\}\_\{i\}\|is its word count,qi\(k\)q^\{\(k\)\}\_\{i\}is its question\-mark count,hi\(k\)h^\{\(k\)\}\_\{i\}is the number of hedge words from lexiconℋ\\mathcal\{H\}in the step,𝒰i\(k\)\\mathcal\{U\}^\{\(k\)\}\_\{i\}is the lowercased unigram set,ℰi\(k\)\\mathcal\{E\}^\{\(k\)\}\_\{i\}is the set of capitalised tokens,mk=⌊Tk/2⌋m\_\{k\}=\\lfloor T\_\{k\}/2\\rfloor, andJ​\(A,B\)=\|A∩B\|/\|A∪B\|J\(A,B\)=\|A\\cap B\|/\|A\\cup B\|is Jaccard similarity\. Sign indicates whether larger feature values increase\(\+\)\(\+\)or decrease\(−\)\(\-\)the TRACT incorrectness score\.#### Coherence features: does each trace sound locally resolved?

The coherence block captures whether a trace has the local rhetorical profile of a settled solution\. When a model knows how to proceed, its steps tend to state intermediate claims directly and move on\. When it is uncertain, the trace often becomes interrogative, over\-qualified, or repetitive\.QuestionRatecaptures explicit self\-questioning,WordsPerStepcaptures verbosity and over\-explanation, andPlateauFraccaptures steps that fail to develop\.

This block helps because many reasoning failures are visible before the final answer: the model circles around the problem, asks itself what to do, or spends words compensating for a missing solution path\. Coherence features are therefore weak but useful symptoms of local unresolvedness\.

#### Structure features: does the trace keep its trajectory?

The structure block captures whether reasoning becomes more organised or more unstable over time\. A correct trace can be long or short, but it usually maintains a coherent trajectory: decomposition, intermediate work, and convergence\. An incorrect trace more often loses this trajectory\. Hedging may increase, step lengths may become irregular, or one sampled trace may become much longer than the others because the model cannot settle on a path\.

TRACT measures this with five structural signatures\.HedgeSlopetracks whether uncertainty language grows over\(Lakoff,[1973](https://arxiv.org/html/2605.08346#bib.bib3); Katerenchuk and Levitan,[2024](https://arxiv.org/html/2605.08346#bib.bib42)\)\.ColonFraccaptures explicit organisation through cases, lists, or subclaims\.MaxStepWccaptures whether the trace contains a concentrated reasoning step\.ScMaxflags an outlier\-length trace among theKKsamples\.WcVarSlopetracks whether the rhythm of step lengths becomes increasingly unstable\(Jinet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib45)\)\.

This block helps because reasoning quality is often dynamic: the important signal is not only what the trace contains, but how it changes\(Vanhoyweghenet al\.,[2025](https://arxiv.org/html/2605.08346#bib.bib43)\)\.

#### Content features: do independent traces converge?

The content block captures whether independent samples appear to reason toward the same state\. Correct traces need not be identical, but they often share key concepts at the midpoint and converge to similar terminal vocabulary\. Incorrect traces are more likely to scatter: one sample pursues one interpretation, another pursues a different one, and their intermediate or final steps share little lexical overlap\.

TRACT measures this with lightweight lexical agreement\.MidUnigramDivasks whether traces agree around the middle of the reasoning process\.FinalUnigramDivasks whether they agree near the end\.EntityRepeatcaptures a different content failure: a single trace repeatedly revisiting the same named entity instead of advancing the reasoning stateYaoet al\.\([2025](https://arxiv.org/html/2605.08346#bib.bib41)\); Duanet al\.\([2026](https://arxiv.org/html/2605.08346#bib.bib46)\)\. This block helps because uncertainty is often expressed across samples\. Even when each individual trace looks fluent, disagreement among independently sampled trajectories can reveal that the model has not identified a stable solution\.

#### Scoring

TRACT maps the three feature blocks into a single incorrectness score\. Each feature is robust\-scaled using median centring and IQR normalisation, and then clipped to\[−3,3\]\[\-3,3\]to limit the effect of extreme traces\. Within each block, features use fixed equal\-magnitude weights, with signs given in Table[1](https://arxiv.org/html/2605.08346#S3.T1)\. The structure block always contributes to the score\. This block captures trajectory dynamics—hedge trends, step\-count outliers, and step\-length irregularity—that remain meaningful across both terse and verbose reasoning styles\. The coherence and content blocks are more style\-dependent\. They are most reliable when responses are explicitly segmented into step\-wise reasoning; in prose\-heavy traces, the same surface cue can change meaning\. For example, self\-questioning may signal confusion in terse chain\-of\-thought, but careful exposition in a long explanatory paragraph\. TRACT therefore gates the coherence and content blocks using the rawWordsPerStepvalue\.

Formally, TRACT computes

TRACT​\(x\)=𝐰struct⊤​ϕ^struct\+\(1−α​\(x\)\)​\(𝐰coh⊤​ϕ^coh\+𝐰cont⊤​ϕ^cont\),\\mathrm\{TRACT\}\(x\)=\\mathbf\{w\}\_\{\\mathrm\{struct\}\}^\{\\top\}\\hat\{\\phi\}\_\{\\mathrm\{struct\}\}\+\\bigl\(1\-\\alpha\(x\)\\bigr\)\\left\(\\mathbf\{w\}\_\{\\mathrm\{coh\}\}^\{\\top\}\\hat\{\\phi\}\_\{\\mathrm\{coh\}\}\+\\mathbf\{w\}\_\{\\mathrm\{cont\}\}^\{\\top\}\\hat\{\\phi\}\_\{\\mathrm\{cont\}\}\\right\),where

α​\(x\)=exp⁡\(−12​σ2​\(w¯−μ\)2\)\.\\alpha\(x\)=\\exp\\\!\\left\(\-\\frac\{1\}\{2\\sigma^\{2\}\}\\bigl\(\\bar\{w\}\-\\mu\\bigr\)^\{2\}\\right\)\.Hereϕ^b\\hat\{\\phi\}\_\{b\}is the robust\-scaled feature vector for blockbb, andw¯\\bar\{w\}is the rawWordsPerStepvalue before scaling\. Whenw¯\\bar\{w\}lies near the prose\-heavy regime centred atμ\\mu,α​\(x\)\\alpha\(x\)approaches one, suppressing the coherence and content terms\. Whenw¯\\bar\{w\}is far from this regime,α​\(x\)\\alpha\(x\)approaches zero and all three blocks contribute\. Implementation details for step extraction, hedge counting, entity extraction, and slope computation are provided in Appendix[C](https://arxiv.org/html/2605.08346#A3)\.

Table[2](https://arxiv.org/html/2605.08346#S3.T2)summarises the external components required by each detector\. This comparison is not the source of TRACT’s accuracy; it clarifies its practical setting\. Once sampled traces are available, TRACT is a text\-only trajectory scorer and does not require an auxiliary NLI model, embedding model, answer parser, or access to model internals\.

Table 2:Method requirements\.A checkmark indicates that a detector requires the corresponding external component\. TRACT operates directly on sampled reasoning trajectories and does not require external NLI or embedding models\.Unlike NLI\- and embedding\-based baselines, TRACT computes all features directly from surface trajectory statistics\. This makes the scorer inexpensive to run, transparent to inspect, and compatible with closed\-source models where only generated text is available\. Figure[3](https://arxiv.org/html/2605.08346#S4.F3)evaluates whether TRACT’s three feature blocks contribute complementary signal\. The full scorer is strongest or near\-strongest across benchmarks, indicating that coherence, structure, and content capture different failure signatures rather than redundant variants of the same cue\.

## 4Experiments

#### Benchmarks

We evaluate on four diverse reasoning benchmarks spanning distinct reasoning modalities and difficulty regimes: BBH\-TrackingSuzgunet al\.\([2022](https://arxiv.org/html/2605.08346#bib.bib37)\), which requires multi\-step state tracking under sequential object permutations; GPQA DiamondReinet al\.\([2024](https://arxiv.org/html/2605.08346#bib.bib38)\), which tests graduate\-level scientific reasoning resistant to surface\-level retrieval; MATH\-500Lightmanet al\.\([2023](https://arxiv.org/html/2605.08346#bib.bib35)\), which covers competition\-level mathematics across seven subject areas where intermediate step quality is predictive of correctness; and CausalT5KGenget al\.\([2026](https://arxiv.org/html/2605.08346#bib.bib36)\), which probes structural causal reasoning under adversarial narrative pressure across Pearl’s three rungs\. For evaluation on CausalT5k, we leverage the D6 setting, encompassing environment and climate issues, under all three rungs\. Together, these benchmarks exercise TRACT’s trace\-level features across symbolic, scientific, mathematical, and causal reasoning domains \(for additional details see Appendix[A](https://arxiv.org/html/2605.08346#A1)\)\.

#### Models

For each benchmark, we sample responses from a mixture of open\-weight and proprietary large language models spanning several model families: Nemotron\-30B\(NVIDIA Corporation,[2025](https://arxiv.org/html/2605.08346#bib.bib30)\), GPT\-OSS\-120B\(OpenAI,[2025](https://arxiv.org/html/2605.08346#bib.bib34)\), LLaMA\-3\-70B\(Touvron and others,[2024](https://arxiv.org/html/2605.08346#bib.bib31)\), Amazon Nova Pro\(Amazon Web Services,[2025](https://arxiv.org/html/2605.08346#bib.bib33)\), and Gemma\-3\-27B\(Google DeepMind,[2025](https://arxiv.org/html/2605.08346#bib.bib32)\)\. Additional model details, prompting, sampling hyperparameters, and step\-extraction procedures are provided in Appendix[B](https://arxiv.org/html/2605.08346#A2)\.

#### Scorers

We compare against seven black\-box UQ scorers with recommended settings: Exact\-Match Repetition \(EMR\)\(Bouchardet al\.,[2026](https://arxiv.org/html/2605.08346#bib.bib6)\), NLI\-based Non\-Contradiction Probability \(NCP\)\(Manakulet al\.,[2023b](https://arxiv.org/html/2605.08346#bib.bib40)\), BERTScore Consistency \(BSC\)\(Zhanget al\.,[2020](https://arxiv.org/html/2605.08346#bib.bib21)\), Normalized Cosine Similarity \(NCS\)\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.08346#bib.bib23)\), Normalized Semantic Negentropy \(NSN\)\(Farquharet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib12); Bouchardet al\.,[2026](https://arxiv.org/html/2605.08346#bib.bib6)\), RACE\(Wanget al\.,[2026](https://arxiv.org/html/2605.08346#bib.bib5)\), and Semantic Embedding Uncertainty \(SEU\)\(Grewalet al\.,[2024](https://arxiv.org/html/2605.08346#bib.bib7)\)\. All methods receive the same1010sampled traces\. We evaluate TRACT withμ=28\\mu=28words per step andσ2=50\\sigma^\{2\}=50\. Following prior work\(Wanget al\.,[2026](https://arxiv.org/html/2605.08346#bib.bib5); Kuhnet al\.,[2023](https://arxiv.org/html/2605.08346#bib.bib24)\), we report AUC as the primary metric\. Details of TRACT features are given in Table[1](https://arxiv.org/html/2605.08346#S3.T1)and Appendix[C](https://arxiv.org/html/2605.08346#A3)\.

In contrast to prior methods that rely heavily on final answer signatures to assess correctness, we design controlled interventions to isolate trajectory\-level signals from answer\-level artifacts\. Specifically, we restrict evaluation to sets of sampled traces where the final answer is no longer a valid separator between incorrect and correct responses, ensuring that variation in scores arises from differences in reasoning structure, rhetorical coherence, and cross\-trace consistency rather than answer disagreement\.

We introduce two complementary conditions\. InForce, we replace each response’s final answer with the ground truth and canonicalize the announcement of the final answer, producing trajectories that have final answers that are correct and uniform; this isolates whether TRACT detects inconsistencies independent of answer correctness\. Additionally, inRemove, we strip explicit answer\-announcement steps, preserving natural reasoning trajectories while eliminating explicit answer signals\. Responses maintain their original label and performance is based on the initial response\. Together, these settings test whether incorrectness is reflected in the internal dynamics of reasoning trajectories, rather than the terminal output, providing a stricter evaluation of trajectory\-level uncertainty signals\.

### 4\.1Results

Table 3:AUC \(%\) across datasets\. Each cell is reported asForce/Remove\. For each dataset and model, the bestForcevalue and the bestRemovevalue across methods are bolded\.#### Current detectors are susceptible to sanity checks

A reliable uncertainty detector should evaluate how the model reasons, not merely what answer it gives\. To test this, we apply two oracle patching modes:Force, which replaces the final answer with the ground truth, andRemove, which deletes explicit answer\-announcement steps\. A robust trace\-level detector should remain informative under both interventions; large shifts between the two conditions indicate sensitivity to answer presentation rather than reasoning trajectory\.

Table[3](https://arxiv.org/html/2605.08346#S4.T3)shows that several existing detectors are unstable under these sanity checks\. For example, RACE achieves 76\.17 AUC underForceon BBH\-Tracking with Gemma\-3\-27B, but drops to 54\.67 underRemove\. On GPQA Diamond with the same model, it nearly inverts, scoring 24\.28 underForceand 59\.01 underRemove\. NSN exhibits a different failure mode on MATH\-500: it remains at chance underForcefor every model, yet rises sharply underRemove, reaching 77\.40 for GPT\-OSS\-120B and 79\.72 for Nemotron\-30B\. These shifts suggest that the detectors are responding strongly to endpoint formatting or answer availability rather than to stable reasoning evidence\.

TRACT is more stable across the same interventions\. In Table[3](https://arxiv.org/html/2605.08346#S4.T3), TRACT often obtains identical or near\-identicalForce/Removescores because its features are computed from the reasoning body after answer\-announcement removal\. It also remains competitive in absolute AUC, especially on settings where surface answer cues are weak\. On CausalT5K\-D6, for instance, TRACT obtains 75\.44/75\.44 with Nemotron\-30B and 70\.34/70\.34 with GPT\-OSS\-120B, while many baselines remain close to chance or vary substantially across intervention modes\. Together with Figure[2](https://arxiv.org/html/2605.08346#S1.F2), these results indicate that TRACT is less dependent on answer\-level artifacts and more sensitive to trajectory\-level signals\.

#### Step\-wise sensitivity analysis

Figure[4](https://arxiv.org/html/2605.08346#S4.F4)examines where each detector obtains its signal along the reasoning trajectory\. For each response, we reveal progressively larger prefixes of the trace and measure the marginal change in detector score after each additional step\. Scores are normalised to\[0,1\]\[0,1\]within each method so that the figure compares where sensitivity occurs rather than raw score magnitude\.

The methods separate into three qualitative patterns\. First, answer\-anchored methods such as EMR and RACE remain relatively insensitive throughout the reasoning body and respond most strongly when the final answer is revealed\. This behaviour is undesirable for a reasoning\-aware detector: it indicates that the method is primarily reading the endpoint\.

Second, methods such as NCS, NCP, BSC, SEU, and NSN are prone to saturating, and react strongly to early steps and then plateau\. These methods extract useful information from the trace, but much of their signal is concentrated near the beginning rather than distributed across the full trajectory\.

TRACT behaves differently\. Its sensitivity remains more evenly distributed across the reasoning process and does not spike primarily at the answer\-reveal transition\. This reflects the design of Table[1](https://arxiv.org/html/2605.08346#S3.T1): TRACT combines static coherence, dynamic structural trends, and cross\-sample content agreement, so no single step or endpoint dominates the score\.

![Refer to caption](https://arxiv.org/html/2605.08346v1/x1.png)
Figure 3:Ablating TRACT feature blocks across four benchmarks\.S= Structure,Co= Coherence, andCt= Content\. The full S\+Co\+Ct scorer is strongest or near\-strongest across benchmarks, indicating that the blocks capture complementary trace\-level signals\.![Refer to caption](https://arxiv.org/html/2605.08346v1/x2.png)
Figure 4:Step\-wise sensitivity on GPQA Diamond\.Normalised score changes as progressively more of the trace is revealed\. Endpoint\-reliant methods spike when the final answer appears; TRACT remains sensitive across the reasoning body\.

#### Dataset and feature relationships

Figure[3](https://arxiv.org/html/2605.08346#S4.F3)decomposes TRACT by feature block and shows that different benchmarks expose different trace\-level failure modes\. On MATH\-500 and GPQA Diamond, content agreement is especially informative: the Content block alone reaches 80\.3 AUC on MATH\-500 and 67\.5 on GPQA Diamond\. This suggests that, for mathematical and scientific reasoning tasks, incorrect sampled traces often diverge in the concepts or final vocabulary they converge to\.

CausalT5K\-D6 shows the opposite pattern\. There, the Structure block is strongest among individual blocks, reaching 66\.3 AUC\. This is consistent with the nature of causal reasoning under adversarial narrative pressure: correctness depends less on repeating the same lexical endpoint and more on maintaining a coherent reasoning trajectory\.

The full block\-combination heatmap in Appendix[C](https://arxiv.org/html/2605.08346#A3)\(Figure[5](https://arxiv.org/html/2605.08346#A3.F5)\) further shows that the blocks are complementary rather than redundant\. Structure plus Content is the strongest two\-block combination on GPQA Diamond and CausalT5K\-D6, while on MATH\-500 it nearly matches Content alone\. The full TRACT scorer, combining Structure, Coherence, and Content, achieves the highest or near\-highest AUC on every benchmark\. This supports the design choice in Table[1](https://arxiv.org/html/2605.08346#S3.T1): TRACT does not rely on a single universal cue, but combines several weak, interpretable signatures whose usefulness varies by task\.

#### Complementarity with existing detectors

Finally, we test whether TRACT contributes signal beyond existing black\-box scorers\. Appendix[D](https://arxiv.org/html/2605.08346#A4)fuses TRACT with each baseline using 4\-fold cross\-validated logistic regression on unmodified traces\. This is a diagnostic rather than our proposed deployment method: improvements indicate that TRACT contains signal not already captured by the partner scorer\.

Table 4:TRACT is complementary to existing detectors\.Average AUC across four benchmarks for standalone TRACT and the best TRACT\+baseline fusion\.Fusion improves over standalone TRACT for every model, with average gains from \+5\.42 to \+20\.00 AUC points\. EMR and NSN are the most consistent partners, showing that TRACT complements both answer\-repetition and semantic\-entropy signals\. RACE also helps for GPT\-OSS\-120B and Nemotron\-30B, suggesting that TRACT’s hedge, step\-length, and lexical\-convergence features are not fully captured by reasoning\-consistency or answer\-alignment scores\.

## 5Conclusion

We have presented a controlled\-invariance framework for evaluating whether hallucination detection methods genuinely assess reasoning trajectories or rely on answer\-level artifacts\. OurForceandRemoveoracle tests provide a simple, model\-agnostic diagnostic: when the reasoning body is preserved, a reasoning\-faithful detector should remain informative even if the final answer is forced to the ground truth or explicit answer\-announcement steps are removed\. Applying these tests across five models and four benchmarks, we find that many current methods are more sensitive to endpoint cues than standard evaluations reveal\. As long\-form reasoning becomes a defining capability of frontier models, and as hallucination detection matures from proof\-of\-concept into a relied\-upon safety mechanism, the standards by which we evaluate these methods must keep pace\. Reporting performance on unperturbed traces is insufficient if that performance collapses, or persists for the wrong reasons, under controlled perturbation\. We encourage future work in this area to adopt oracle robustness testing as a standard evaluation practice, ensuring that progress in hallucination detection reflects genuine advances in reasoning assessment rather than increasingly sophisticated exploitation of surface cues\.

#### Limitations

Force and Remove should be understood as necessary sanity checks, not sufficient guarantees of reasoning fidelity\. Passing these tests does not prove that a detector verifies intermediate logical steps; it only rules out certain forms of answer\-level dependence\. TRACT inherits this limitation by design: its lexical and structural features are interpretable and efficient, but they are not a semantic proof of correctness\. Our evaluation is limited to benchmark tasks with canonical answers; generalizing the framework to open\-ended long\-form generation, tool use, and domains where correctness is not reducible to final answers requires additional intervention designs\.

## References

- Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px1.p1.1)\.
- Amazon Web Services \(2025\)Amazon nova foundation models\.Note:[https://aws\.amazon\.com/ai/generative\-ai/nova/](https://aws.amazon.com/ai/generative-ai/nova/)Nova Pro multimodal foundation modelCited by:[Table 5](https://arxiv.org/html/2605.08346#A2.T5.4.5.4.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px2.p1.1)\.
- J\. Binkowski, D\. Janiak, A\. Sawczyn, B\. Gabrys, and T\. J\. Kajdanowicz \(2025\)Hallucination detection in LLMs using spectral features of attention maps\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24354–24385\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1239/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1239),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Bouchard, M\. S\. Chauhan, D\. Skarbrevik, H\. Ra, V\. Bajaj, and Z\. Ahmad \(2026\)UQLM: a python package for uncertainty quantification in large language models\.Journal of Machine Learning Research27\(13\),pp\. 1–10\.External Links:[Link](http://jmlr.org/papers/v27/25-1557.html)Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px3.p1.3)\.
- J\. Chen and J\. Mueller \(2023\)Quantifying uncertainty in answers from any language model and enhancing their trustworthiness\.arXiv preprint arXiv:2308\.16175\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1)\.
- J\. R\. Cole, M\. J\. Q\. Zhang, D\. Gillick, J\. M\. Eisenschlos, B\. Dhingra, and J\. Eisenstein \(2023\)Selectively answering ambiguous questions\.arXiv preprint arXiv:2305\.14613\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Duan, H\. Cheng, S\. Wang, A\. Zavalny, C\. Wang, R\. Xu, B\. Kailkhura, and K\. Xu \(2024\)Shifting attention to relevance: towards the predictive uncertainty quantification of free\-form large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5050–5063\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Duan, L\. Pang, Z\. Wei, W\. Duan, Y\. Tian, S\. Xu, J\. Deng, Z\. Yin, and X\. Cheng \(2026\)Circular reasoning: understanding self\-reinforcing loops in large reasoning models\.arXiv preprint arXiv:2601\.05693\.Cited by:[§3](https://arxiv.org/html/2605.08346#S3.SS0.SSS0.Px3.p2.1)\.
- E\. Fadeeva, A\. Rubashevskii, A\. Shelmanov, S\. Petrakov, H\. Li, H\. Mubarak, E\. Tsymbalov, G\. Kuzmin, A\. Panchenko, T\. Baldwin, P\. Nakov, and M\. Panov \(2024\)Fact\-checking the output of large language models via token\-level uncertainty quantification\.arXiv preprint arXiv:2403\.04696\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630,pp\. 625–630\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07421-0)Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px3.p1.3)\.
- X\. Gao, J\. Zhang, L\. Mouatadid, and K\. Das \(2024\)SPUQ: perturbation\-based uncertainty quantification for large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\),pp\. 2336–2346\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.eacl-long.143)Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Geng, A\. Ouyang, T\. Wu,et al\.\(2026\)CausalT5K: diagnosing and informing refusal for trustworthy causal reasoning of skepticism, sycophancy, detection\-correction, and rung collapse\.arXiv preprint arXiv:2602\.08939\.Cited by:[§A\.4](https://arxiv.org/html/2605.08346#A1.SS4.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px1.p1.1)\.
- Google DeepMind \(2025\)Gemma 3 technical report\.arXiv preprint\.Cited by:[Table 5](https://arxiv.org/html/2605.08346#A2.T5.4.6.5.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px2.p1.1)\.
- Y\. S\. Grewal, E\. V\. Bonilla, and T\. D\. Bui \(2024\)Improving uncertainty quantification in large language models via semantic embeddings\.External Links:2410\.22685,[Link](https://arxiv.org/abs/2410.22685)Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px3.p1.3)\.
- M\. Jin, Q\. Yu, D\. Shu, H\. Zhao, W\. Hua, Y\. Meng, Y\. Zhang, and M\. Du \(2024\)The impact of reasoning step length on large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 1830–1842\.External Links:[Link](https://aclanthology.org/2024.findings-acl.108/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.108)Cited by:[§3](https://arxiv.org/html/2605.08346#S3.SS0.SSS0.Px2.p2.1)\.
- S\. Kang, Y\. F\. Bakman, D\. N\. Yaldiz, B\. Buyukates, and S\. Avestimehr \(2025\)Uncertainty quantification for hallucination detection in large language models: foundations, methodology, and future directions\.arXiv preprint arXiv:2510\.12040\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Katerenchuk and R\. Levitan \(2024\)”You should probably read this”: hedge detection in text\.External Links:2405\.13319,[Link](https://arxiv.org/abs/2405.13319)Cited by:[§3](https://arxiv.org/html/2605.08346#S3.SS0.SSS0.Px2.p2.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.arXiv preprint arXiv:2302\.09664\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px3.p1.3)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles,pp\. 611–626\.Cited by:[§B\.5](https://arxiv.org/html/2605.08346#A2.SS5.p1.2)\.
- G\. Lakoff \(1973\)Hedges: a study in meaning criteria and the logic of fuzzy concepts\.Journal of Philosophical Logic2\(4\),pp\. 458–508\.External Links:[Document](https://dx.doi.org/10.1007/BF00262952)Cited by:[§3](https://arxiv.org/html/2605.08346#S3.SS0.SSS0.Px2.p2.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Felix, J\. L\. Lim, J\. Schulman, I\. Sutskever, and W\. Zaremba \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.External Links:[Link](https://arxiv.org/abs/2305.20050)Cited by:[§A\.3](https://arxiv.org/html/2605.08346#A1.SS3.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px1.p1.1)\.
- P\. Manakul, A\. Liusie, and M\. J\. F\. Gales \(2023a\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.arXiv preprint arXiv:2303\.08896\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Manakul, A\. Liusie, and M\. J\. F\. Gales \(2023b\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 9004–9017\.External Links:[Document](https://dx.doi.org/10.48550/arxiv.2303.08896)Cited by:[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px3.p1.3)\.
- A\. V\. Nikitin, J\. Kossen, Y\. Gal, and P\. Marttinen \(2024\)Kernel language entropy: fine\-grained uncertainty quantification for LLMs from semantic similarities\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=j2wCrWmgMX)Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px1.p1.1)\.
- NVIDIA Corporation \(2025\)Nemotron\-3\-nano\-30b\-a3b model card\.Note:[https://build\.nvidia\.com/nvidia/nemotron\-3\-nano\-30b\-a3b/modelcard](https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard)Hybrid Mamba\-Transformer Mixture\-of\-Experts LLMCited by:[Table 5](https://arxiv.org/html/2605.08346#A2.T5.4.2.1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[Table 5](https://arxiv.org/html/2605.08346#A2.T5.4.3.2.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Phukan, Divyansh, H\. K\. Morj, Vaishnavi, A\. Saxena, and K\. Goswami \(2025\)Beyond logit lens: contextual embeddings for robust hallucination detection & grounding in VLMs\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 9661–9675\.External Links:[Link](https://aclanthology.org/2025.naacl-long.488/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.488),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px4.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.arXiv preprint arXiv:1908\.10084\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px3.p1.3)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[§A\.2](https://arxiv.org/html/2605.08346#A1.SS2.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px1.p1.1)\.
- O\. Shorinwa, Z\. Mei, J\. Lidard, A\. Z\. Ren, and A\. Majumdar \(2025\)A survey on uncertainty quantification of large language models: taxonomy, open research challenges, and future directions\.ACM Computing Surveys58\(3\),pp\. 63\.External Links:[Document](https://dx.doi.org/10.1145/3744238)Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. V\. Le, E\. H\. Chi, D\. Zhou, and J\. Wei \(2022\)Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.arXiv preprint arXiv:2210\.09261\.Cited by:[§A\.1](https://arxiv.org/html/2605.08346#A1.SS1.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px1.p1.1)\.
- H\. Touvronet al\.\(2024\)Llama 3: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Table 5](https://arxiv.org/html/2605.08346#A2.T5.4.4.3.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Vanhoyweghen, B\. Verbeken, A\. Algaba, and V\. Ginis \(2025\)Lexical hints of accuracy in llm reasoning chains\.External Links:2508\.15842,[Link](https://arxiv.org/abs/2508.15842)Cited by:[§3](https://arxiv.org/html/2605.08346#S3.SS0.SSS0.Px2.p3.1)\.
- C\. Wang, W\. Su, Q\. Ai, and Y\. Liu \(2026\)Joint evaluation of answer and reasoning consistency for hallucination detection in large reasoning models\.Proceedings of the AAAI Conference on Artificial Intelligence40\(39\),pp\. 33377–33385\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/40624),[Document](https://dx.doi.org/10.1609/aaai.v40i39.40624)Cited by:[§1](https://arxiv.org/html/2605.08346#S1.p2.1),[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px3.p1.3)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.External Links:[Link](https://neurips.cc/)Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Yao, S\. Yang, J\. Xu, L\. Hu, M\. Li, and D\. Wang \(2025\)Understanding the repeat curse in large language models from a feature perspective\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 7787–7815\.External Links:[Link](https://aclanthology.org/2025.findings-acl.406/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.406),ISBN 979\-8\-89176\-256\-5Cited by:[§3](https://arxiv.org/html/2605.08346#S3.SS0.SSS0.Px3.p2.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with bert\.arXiv preprint arXiv:1904\.09675\.Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.08346#S4.SS0.SSS0.Px3.p1.3)\.
- G\. Zollicoffer, M\. Vu, and M\. Bhattarai \(2025\)MTRE: multi\-token reliability estimation for hallucination detection in vlms\.External Links:2505\.11741,[Link](https://arxiv.org/abs/2505.11741)Cited by:[§2](https://arxiv.org/html/2605.08346#S2.SS0.SSS0.Px4.p1.1)\.

## Appendix ADatasets

We describe the four benchmarks used to evaluate TRACT and the other black\-box scorers\.

### A\.1BBH\-Tracking \(BIG\-Bench Hard\)

BIG\-Bench Hard\(BBH;Suzgunet al\.[2022](https://arxiv.org/html/2605.08346#bib.bib37)\) is a curated subset of 23 tasks drawn from the broader BIG\-Bench evaluation suite, selected because prior language model evaluations failed to exceed average human\-rater performance on them under standard few\-shot prompting\. BBH spans a diverse range of reasoning skills—multi\-step algorithmic reasoning, logical deduction, natural language understanding, and commonsense inference—with 250 examples per task \(6,511 total\)\. Each task includes both answer\-only and chain\-of\-thought \(CoT\) few\-shot prompt sets\. A central finding of the original paper is that CoT prompting substantially closes the human\-model gap, enabling Codex to surpass human\-rater performance on 17 of the 23 tasks\.

We focus on theTracking Shuffled Objectssubtask\. Each problem specifiesNNagents each initially holding one ofNNdistinct objects, then narratesNNpairwise swaps; the model must identify which object a designated agent holds at the end\. Problems are constructed so that every agent participates in at least one swap and no two agents swap back\-to\-back, ruling out trivial surface shortcuts\. Correct solutions require maintaining an explicit, incrementally updated state across multiple reasoning steps, which makes the coherence and rhetorical structure of the chain of thought directly diagnostic of solution quality\.

### A\.2GPQA Diamond

GPQA\(Graduate\-Level Google\-Proof Q&A;Reinet al\.[2024](https://arxiv.org/html/2605.08346#bib.bib38)\) is a benchmark of 448 four\-way multiple\-choice questions written by domain experts in biology, physics, and chemistry\. Questions are deliberately designed to be “Google\-proof”: highly skilled non\-expert validators reach only 34% accuracy despite an average of over 30 minutes of unrestricted web access\. Domain experts holding or pursuing PhDs in the relevant fields achieve 65% accuracy \(74% after discounting questions the experts themselves identified as flawed in retrospect\)\.

GPQA Diamondis the hardest curated subset of 198 questions, retaining only those for which both independent expert annotators answered correctly while the majority of non\-expert validators answered incorrectly\. The random\-chance baseline on Diamond is 25%\. This subset is now a standard frontier evaluation: at the time of the benchmark’s release, GPT\-4 achieved 39%; subsequent models have improved substantially, but the dataset remains a meaningful discriminator for state\-of\-the\-art systems\. For TRACT evaluation, GPQA Diamond provides a pool of problems demanding deep, multi\-step scientific reasoning where superficial lexical cues are insufficient and the internal structure of a reasoning trace is strongly predictive of correctness\.

### A\.3MATH\-500

MATH\-500\[Lightmanet al\.,[2023](https://arxiv.org/html/2605.08346#bib.bib35)\]is a 500\-problem subset of the MATH dataset\[hendrycks2021measuring\], curated by OpenAI to support evaluation of process supervision methods for mathematical reasoning\. The full MATH dataset comprises 12,500 competition\-style problems drawn from AMC 10, AMC 12, AIME, and related contests; MATH\-500 samples representatively from seven subject areas—Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus—across five difficulty levels \(1–5\)\. Each problem includes a complete step\-by\-step reference solution inLaTeX\.

MATH\-500 was introduced to compare outcome supervision \(feedback on final answers only\) against process supervision \(feedback on each intermediate reasoning step\), with process supervision found to significantly outperform outcome supervision\. This is precisely the regime where the quality of intermediate reasoning steps, rather than the final answer alone, determines success\. MATH\-500 therefore provides a natural evaluation bed for TRACT’s trace\-level coherence features: problems are hard enough that flawed intermediate steps frequently co\-occur with incorrect final answers, making correct step\-by\-step structure a reliable signal of correctness\.

### A\.4CausalT5K

CausalT5K\[Genget al\.,[2026](https://arxiv.org/html/2605.08346#bib.bib36)\]is a diagnostic benchmark of over 5,000 cases \(5,147 total\) spanning 10 domains, designed to stress\-test three distinct failure modes in LLM causal reasoning: \(1\)*rung collapse*, where a model responds to an interventional or counterfactual query with associational evidence, conflating different levels of Pearl’s causal hierarchy; \(2\)*sycophantic drift*, where a model abandons a correct causal claim under adversarial rhetorical pressure; and \(3\)*miscalibrated refusal*, where a model either refuses valid causal claims \(the Skepticism Trap\) or endorses underdetermined ones\.

Unlike purely synthetic causal benchmarks, CausalT5K embeds causal traps in realistic narrative scenarios across domains including medicine, economics, history, sports, and daily life, developed through a rigorous human\-machine collaborative pipeline involving 40 domain experts\. Performance is decomposed into a two\-axis scheme of Utility \(sensitivity—correctly affirming valid causal claims\) and Safety \(specificity—correctly rejecting invalid ones\), revealing failure modes invisible to aggregate accuracy\. CausalT5K spans all three of Pearl’s rungs \(associationalℒ1\\mathcal\{L\}\_\{1\}, interventionalℒ2\\mathcal\{L\}\_\{2\}, counterfactualℒ3\\mathcal\{L\}\_\{3\}\) and is unique among existing causal benchmarks in annotating trap type, applying adversarial pressure, and reporting the two\-axis decomposition\. For TRACT, this dataset provides a qualitatively distinct reasoning challenge relative to the other benchmarks: correct responses require not only multi\-step deduction but also resistance to misleading narrative framing, making trace\-level rhetorical and coherence features especially salient\.

## Appendix BModel Configurations and Sampling Procedures

### B\.1Models

We evaluate across five models spanning four LLM families and a range of parameter scales, mixing open\-weight and proprietary systems to reduce the risk of findings that are idiosyncratic to a single architecture or training pipeline\. Table[5](https://arxiv.org/html/2605.08346#A2.T5)summarises the key properties of each model\.

Table 5:Models used in our evaluation\. All models are instruction\-tuned variants\. Parameter counts are approximate and reflect the publicly reported size of each model\.
### B\.2Prompting

All models are prompted with a standardised step\-by\-step instruction that elicits chain\-of\-thought reasoning before a final answer\. The system prompt is:

> Solve the following problem step by step\. Show your reasoning at each step, then provide your final answer on the last line in the format: "Final Answer: <answer\>"\.

We do not use few\-shot exemplars; all evaluations are zero\-shot\. The prompt is identical across all models for BBH\-Tracking, Math\-500, and GPQA\-Diamond to ensure that variation in reasoning traces reflects model behaviour rather than prompt engineering\.

For CausalT5k, we utilize the prompt:

> You are evaluating a causal claim\. Read the scenario and claim carefully, then reason through your answer step by step\. Scenario: \{scenario\} Claim: \{claim\} Variables: \- X \(exposure\): \{variables\.X\.name\} \- Y \(outcome\): \{variables\.Y\.name\} \- Z \(confounders/other factors\): \{variables\.Z\} Your task: Evaluate whether the claim is causally valid\. Think through your reasoning step by step, then provide your answer\. Step 1: Identify the causal relationship being claimed\. Step 2: Consider potential confounders or alternative explanations\. Step 3: Evaluate whether the evidence supports a causal relationship\. Step 4: Determine if additional information is needed to make a reliable judgment\. Step 5: State your conclusion\. Final Answer: \[YES, NO, or REFUSE\] Explanation:

### B\.3Sampling

For each \(model, benchmark\) pair, we sampleKKindependent reasoning traces per prompt using nucleus sampling\. Table[6](https://arxiv.org/html/2605.08346#A2.T6)reports the sampling hyperparameters\.

Table 6:Sampling hyperparameters\. All models use the same configuration except where noted\.
### B\.4Step Extraction

Reasoning traces are segmented into steps by splitting on double newlines\. When a model produces unstructured output \(e\.g\. a single block of prose with no paragraph breaks\), we fall back to splitting on numbered or bulleted list boundaries and, as a last resort, single newlines\. Steps shorter than 5 characters or consisting entirely of punctuation and markdown formatting are discarded\. Steps matching answer\-announcement patterns \(e\.g\. “Final Answer:”, “The answer is”\) are removed from the trace body before feature extraction to prevent answer\-level artifacts from leaking into trajectory features\.

### B\.5Infrastructure

Open\-weight models are served via vLLM\[Kwonet al\.,[2023](https://arxiv.org/html/2605.08346#bib.bib48)\]on NVIDIA A100 GPUs with tensor parallelism as needed \(TP==2 for LLaMA\-3\-70B, TP==1 for all others\)\. Amazon Nova Pro is accessed through the Amazon Bedrock API\. All inference is run withTRANSFORMERS\_OFFLINE=1using locally cached model weights\.

## Appendix CFeature Inventory

![Refer to caption](https://arxiv.org/html/2605.08346v1/x3.png)Figure 5:Block\-wise contribution to TRACT AUC across four benchmarks, averaged over five models\. S = Structure \(cross\-response rhetorical and structural signals\), Co = Coherence, Ct = Content \(cross\-response lexical agreement\)\. White dividers separate single\-block ablations, pairwise combinations, and the full scorer\. Bold outlines mark the highest\-scoring subset per dataset\.Table[1](https://arxiv.org/html/2605.08346#S3.T1)in the main paper gives the complete TRACT feature inventory, including the diagnostic intuition, sign, and implementation\-level definition for each feature\. Here we provide additional implementation details\.

For features defined independently within a trace, we compute the statistic for each of theKKsampled traces and then average across traces\. For cross\-trace divergence features, we compute the average pairwise Jaccard distance across all unordered trace pairs\.

The hedge counthi\(k\)h^\{\(k\)\}\_\{i\}is computed using a fixed lexiconℋ\\mathcal\{H\}containing uncertainty and contrast markers such as*however*,*although*,*maybe*,*perhaps*,*might*,*could*,*seems*, and*hmm*\. The entity setℰi\(k\)\\mathcal\{E\}^\{\(k\)\}\_\{i\}is approximated by capitalised tokens, excluding sentence\-initial function words and answer\-formatting tokens\. Slopes are ordinary least\-squares slopes against the normalised step positioni/Tki/T\_\{k\}within each trace\.

Before scoring, each feature is robust\-scaled using median centring and IQR normalisation, then clipped to\[−3,3\]\[\-3,3\]\. The structure block always contributes to the final score\. The coherence and content blocks are modulated by the verbosity gate described in Section[3](https://arxiv.org/html/2605.08346#S3)\.

## Appendix DTRACT Pairwise Complementarity

The main paper evaluates TRACT as a standalone detector\. Here we ask a diagnostic question: does TRACT capture information that is complementary to existing black\-box uncertainty scorers? To test this, we fuse TRACT with each baseline scorer using a 4\-fold cross\-validated logistic regression classifier\. All fusion experiments use the unmodified trace setting, class\-weighted logistic regression, standardised features, and regularisation parameterC=1\.0C=1\.0\.

This analysis is not intended to propose an ensemble as the primary method\. Instead, it tests whether TRACT contains signal that is not already captured by each baseline\. Improvement over EMR indicates information beyond answer repetition; improvement over NCS, BSC, or SEU indicates information beyond embedding or semantic similarity; improvement over NCP or NSN indicates information beyond entailment or semantic\-entropy\-style uncertainty; and improvement over RACE indicates signal beyond reasoning\-consistency and answer\-alignment components\.

Table[7](https://arxiv.org/html/2605.08346#A4.T7)summarises the main result\. Across all five models, the best TRACT\+partner fusion improves over standalone TRACT, with average gains ranging from \+5\.42 to \+20\.00 AUC points across benchmarks\. The most consistently helpful partners are EMR and NSN, indicating that TRACT complements both simple answer\-repetition signals and semantic\-entropy\-style uncertainty\. For GPT\-OSS\-120B and Nemotron\-30B, RACE is also among the most useful partners, suggesting that TRACT’s explicit step\-level trajectory features provide signal not fully captured by aggregate reasoning\-consistency or answer\-alignment scores\.

Table 7:Summary of TRACT pairwise complementarity\.For each model, we report standalone TRACT AUC averaged over the four benchmarks, the best average TRACT\+partner fusion, and the average gain over standalone TRACT\. The final column lists the partners with the highest number of wins over standalone TRACT across benchmarks\.The detailed results below show that complementarity is not confined to a single model or benchmark\. Nova\-Pro benefits strongly from fusion, suggesting that TRACT supplies missing trajectory information when standalone performance is modest\. Gemma\-3\-27B and Nemotron\-30B already have strong standalone TRACT scores, yet fusion still improves performance, indicating that TRACT is compatible with other uncertainty signals rather than simply replacing them\. For GPT\-OSS\-120B and LLaMA\-3\-70B, the best partner varies by dataset, which further supports the view that no single baseline subsumes TRACT’s trajectory features\.

In Tables[8](https://arxiv.org/html/2605.08346#A4.T8)–[12](https://arxiv.org/html/2605.08346#A4.T12),boldentries indicate fusion scores that improve over both standalone TRACT and the partner scorer on the corresponding dataset\. The final row,Wins vs\. T, counts the number of benchmarks, out of four, where TRACT\+partner improves over standalone TRACT, regardless of whether it also beats the partner alone\.

### D\.1Nova\-Pro

Table 8:TRACT fusion results for Nova\-Pro\.Nova\-Pro shows the largest complementarity gains overall\. EMR and NSN are the most consistent partners, indicating that TRACT contributes trajectory information that combines well with both endpoint repetition and semantic uncertainty\.
### D\.2Gemma\-3\-27B

Table 9:TRACT fusion results for Gemma\-3\-27B\.Standalone TRACT is already strong on Gemma\-3\-27B, but fusion still improves on most benchmarks\. The broad usefulness of EMR, NSN, and NCP suggests that trajectory features remain complementary even when answer\-level and entailment\-style signals are available\.
### D\.3GPT\-OSS\-120B

Table 10:TRACT fusion results for GPT\-OSS\-120B\.Fusion improves strongly on BBH\-Tracking and CausalT5K\-D4, where trajectory instability and endpoint uncertainty provide complementary evidence\. The useful partners vary by dataset, indicating that TRACT is not subsumed by any single baseline family\.
### D\.4LLaMA\-3\-70B

Table 11:TRACT fusion results for LLaMA\-3\-70B\.For LLaMA\-3\-70B, EMR, NCP, BSC, and NSN frequently improve over standalone TRACT\. This pattern suggests that TRACT is most complementary to lexical, entailment, and semantic\-entropy signals for this model, while SEU and RACE are less consistently helpful\.
### D\.5Nemotron\-30B

Table 12:TRACT fusion results for Nemotron\-30B\.TRACT remains complementary despite strong standalone performance\. EMR, NSN, and RACE improve on all four benchmarks, showing that TRACT’s trajectory\-level statistics combine well with both endpoint\-based and reasoning\-aware uncertainty signals\.Overall, the fusion analysis supports the interpretation of TRACT as a complementary trajectory scorer rather than a reparameterisation of existing uncertainty baselines\. The strongest partners differ across models and datasets, but the consistent gains over standalone TRACT indicate that its coherence, structure, and content features capture trace\-level information that can be combined with answer repetition, semantic similarity, entailment, semantic entropy, and reasoning\-consistency signals\.

Similar Articles

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

arXiv cs.CL

This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.

PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

arXiv cs.CL

Researchers propose PRISM, a diagnostic benchmark that breaks down LLM hallucinations into four dimensions (knowledge missing/errors, reasoning errors, instruction-following errors) across three generation stages (memory, instruction, reasoning), evaluating 24 LLMs to reveal trade-offs in mitigation strategies.

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.

Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

arXiv cs.CL

This paper challenges the assumption that LLMs can reliably distinguish between hallucinated and factual outputs through internal signals, arguing that internal states primarily reflect knowledge recall rather than truthfulness. The authors propose a taxonomy of hallucinations (associated vs. unassociated) and show that associated hallucinations exhibit hidden-state geometries overlapping with factual outputs, making standard detection methods ineffective.