OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios
Summary
OpenHalDet is a unified benchmark for hallucination detection in LLMs, standardizing evaluation across diverse generation scenarios and supporting black-box, gray-box, and white-box detection methods.
View Cached Full Text
Cached at: 06/08/26, 09:21 AM
# OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios
Source: [https://arxiv.org/html/2606.06959](https://arxiv.org/html/2606.06959)
Xinyi Li1, Zhen Fang1, Yongxin Deng1, Jinyuan Luo1, Hongnan Ma3Changdae Oh2, Zijing Shi1, Shanshan Ye1, Hanchen Wang1, Shu\-Lin Chen1Yadan Luo4, Mengyue Yang3, Sean Du5, Sharon Li2, Ling Chen11University of Technology Sydney2University of Wisconsin–Madison3University of Bristol4The University of Queensland5Nanyang Technological Universityxinyi\.li\-1@student\.uts\.edu\.auzhen\.fang@uts\.edu\.au
###### Abstract
Hallucination detection is essential for the reliable deployment of large language models \(LLMs\)\. However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks\. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings\. We introduceOpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios\. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation\. It supports heterogeneous detector families under different access settings, including black\-box methods that use only generated outputs, gray\-box methods that rely on probability\-based signals, and white\-box methods that exploit internal model signals\. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications\. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods\. The code and datasets are available at[https://github\.com/Nellie179/Hallucination\-Detection](https://github.com/Nellie179/Hallucination-Detection)\.
## 1Introduction
Large language models \(LLMs\) have demonstrated strong potential across a wide range of real\-world applications\[[63](https://arxiv.org/html/2606.06959#bib.bib29),[34](https://arxiv.org/html/2606.06959#bib.bib35)\]\. Despite these advances, hallucination in generated text remains a critical challenge, where LLMs produce outputs that are grammatically and logically coherent but lack factual accuracy or verifiable evidence\[[42](https://arxiv.org/html/2606.06959#bib.bib26),[21](https://arxiv.org/html/2606.06959#bib.bib33),[32](https://arxiv.org/html/2606.06959#bib.bib34)\]\. This issue is particularly consequential in high\-risk domains such as healthcare, law, and finance\[[18](https://arxiv.org/html/2606.06959#bib.bib30),[48](https://arxiv.org/html/2606.06959#bib.bib31),[27](https://arxiv.org/html/2606.06959#bib.bib32)\], where the dissemination of incorrect information may lead to serious consequences\. Unfortunately, LLMs trained via likelihood\-based next\-token prediction are inherently prone to hallucination\[[37](https://arxiv.org/html/2606.06959#bib.bib36),[25](https://arxiv.org/html/2606.06959#bib.bib37),[5](https://arxiv.org/html/2606.06959#bib.bib38),[23](https://arxiv.org/html/2606.06959#bib.bib39)\], and zero\-hallucination guarantees are unattainable in general without additional structural assumptions\[[24](https://arxiv.org/html/2606.06959#bib.bib9),[26](https://arxiv.org/html/2606.06959#bib.bib10)\]\. Therefore, effectively detecting hallucinations in model outputs, known ashallucination detection\[[46](https://arxiv.org/html/2606.06959#bib.bib40),[30](https://arxiv.org/html/2606.06959#bib.bib41),[36](https://arxiv.org/html/2606.06959#bib.bib11),[17](https://arxiv.org/html/2606.06959#bib.bib13)\], has become a promising research direction for improving the reliability and safety of LLMs\.
Prior work\[[19](https://arxiv.org/html/2606.06959#bib.bib3),[61](https://arxiv.org/html/2606.06959#bib.bib5)\]has proposed a wide range of hallucination detection methods, which can be broadly grouped according to their model\-access requirements into black\-box, gray\-box, and white\-box methods\. 1\)Black\-box methodsrely only on externally observable token\-space signals, such as consistency patterns across generated outputs\[[31](https://arxiv.org/html/2606.06959#bib.bib15),[36](https://arxiv.org/html/2606.06959#bib.bib11),[33](https://arxiv.org/html/2606.06959#bib.bib14)\]\. 2\)Gray\-box methodsfurther exploit probability\-based signals exposed during generation, including token probabilities, sequence likelihoods, or entropy\-based uncertainty measures\[[47](https://arxiv.org/html/2606.06959#bib.bib16),[35](https://arxiv.org/html/2606.06959#bib.bib17),[15](https://arxiv.org/html/2606.06959#bib.bib27)\]\. 3\)White\-box methodsrequire access to model\-internal information, such as hidden states, attention maps, or other intermediate representations\[[8](https://arxiv.org/html/2606.06959#bib.bib18),[7](https://arxiv.org/html/2606.06959#bib.bib21),[3](https://arxiv.org/html/2606.06959#bib.bib23)\]\. These lines of work have shown promising results and enabled a range of applications, including selective prediction\[[56](https://arxiv.org/html/2606.06959#bib.bib42)\], domain\-specific safety solutions\[[1](https://arxiv.org/html/2606.06959#bib.bib44)\], reliable retrieval\-augmented generation systems\[[50](https://arxiv.org/html/2606.06959#bib.bib43)\], and agentic inference guardrails\[[41](https://arxiv.org/html/2606.06959#bib.bib45)\]\. However, despite substantial methodological progress, the current evaluation paradigm for hallucination detection still exhibits two flaws that undermine genuine progress: 1\)inconsistent evaluation and inference configurations\[[20](https://arxiv.org/html/2606.06959#bib.bib6),[6](https://arxiv.org/html/2606.06959#bib.bib7)\], and 2\)narrowly scoped downstream domains and tasks\[[16](https://arxiv.org/html/2606.06959#bib.bib8)\]\.
Specifically, the first limitation concerns the evaluation protocol: existing studies\[[3](https://arxiv.org/html/2606.06959#bib.bib23),[36](https://arxiv.org/html/2606.06959#bib.bib11),[62](https://arxiv.org/html/2606.06959#bib.bib24)\]often adopt different model backbones, prompting methods, decoding strategies, and inference hyperparameters to validate their proposed detectors \(see Appendix[B](https://arxiv.org/html/2606.06959#A2)\)\. Recent work has shown that detector effectiveness can vary substantially across evaluation configurations\[[15](https://arxiv.org/html/2606.06959#bib.bib27)\]\. Therefore, the lack of standardized evaluation and inference configurations hinders apples\-to\-apples comparisons across methods\[[20](https://arxiv.org/html/2606.06959#bib.bib6),[6](https://arxiv.org/html/2606.06959#bib.bib7),[29](https://arxiv.org/html/2606.06959#bib.bib61)\]\. The second limitation concerns the coverage of evaluation scenarios: most detectors are evaluated on only a small number of datasets covering limited downstream domains and task formats\[[16](https://arxiv.org/html/2606.06959#bib.bib8),[54](https://arxiv.org/html/2606.06959#bib.bib62)\]\. Since detection strategies may behave differently across domains, tasks, context lengths, and answer formats, such narrow evaluation scope makes it difficult to assess whether reported effectiveness can generalize beyond the specific settings considered in each study\.
To address these challenges, we introduceOpenHalDet, a unified benchmark with an accompanying well\-structured and extensible codebase for hallucination detection\. OpenHalDet standardizes the main stages of evaluation, including prompt construction, response generation, truthfulness annotation, detector scoring, and metric reporting, enabling fair comparison across detectors under a shared protocol\. It covers17 datasets, full evaluations on4 backbone LLMs, and selected 70B\-scale experiments across diverse generation scenarios \(see Appendix[F](https://arxiv.org/html/2606.06959#A6)\)\. OpenHalDet further integrates representative black\-box, gray\-box, and white\-box detectors into a common framework, supporting controlled comparison under unified settings\. Together, it provides a fair, reproducible, and comprehensive view of hallucination detection across practical LLM application scenarios\. We summarize our contributions as follows:
•Comprehensive Hallucination Detection Benchmarks\.We provide a benchmark covering 17 datasets across diverse LLM application scenarios\. OpenHalDet groups datasets by scenario, enabling systematic evaluation of hallucination detectors under different task formats\.•Unified Comparison Across Access Regimes\.We integrate 16 representative detectors with different access assumptions into a shared evaluation interface\. This enables fair comparison across methods under the same generation, annotation, and scoring protocol\.•A Unified Codebase for Hallucination Detection\.We provide an extensible open\-source codebase, OpenHalDet, with standardized modules for prompt construction, generation, annotation, detector scoring, and metric reporting\. It supports heterogeneous datasets and detectors through a shared interface, reducing the engineering effort for reproducible evaluation\.•Insights\.Our evaluation yields three findings: 1\) detector effectiveness is scenario\- and backbone\-dependent, with self\-reported confidence varying strongly across settings; 2\) richer model access raises the performance ceiling, but does not guarantee robust gains; 3\) evidence acquisition often dominates cost, making accuracy\-only comparisons incomplete\.
## 2Task Formulation, Benchmark Scope, and Dataset Construction
Here, we first introduce the task formulation studied in our benchmark\. We then describe the scope of our benchmark\. Further, we describe how datasets with different original formats are converted into a unified evaluation format\. Finally, we present the evaluation metrics used in our experiments\.
### 2\.1Task Formulation
Let𝒬\\mathcal\{Q\}and𝒜\\mathcal\{A\}be the spaces of inputs and outputs\. Following\[[14](https://arxiv.org/html/2606.06959#bib.bib22),[42](https://arxiv.org/html/2606.06959#bib.bib26),[13](https://arxiv.org/html/2606.06959#bib.bib4)\], we consider a base LLM as a probability distributionℙ𝜽\(⋅\)\\mathbb\{P\}\_\{\\boldsymbol\{\\theta\}\}\(\\cdot\)over token sequences, where𝜽\\boldsymbol\{\\theta\}denotes the model parameters\. Given an input token sequence𝐐=\[q1,…,qk\]∈𝒬\\mathbf\{Q\}=\[q\_\{1\},\\ldots,q\_\{k\}\]\\in\\mathcal\{Q\}, whereqjq\_\{j\}is thejj\-th token, the LLM generates an output sequence𝐀=\[qk\+1,…,qk\+a\]∈𝒜\\mathbf\{A\}=\[q\_\{k\+1\},\\ldots,q\_\{k\+a\}\]\\in\\mathcal\{A\}of lengthaaby sampling from
ℙ𝜽\(𝐀∣𝐐\)=∏j=k\+1k\+aℙ𝜽\(qj∣q<j\),\\mathbb\{P\}\_\{\\boldsymbol\{\\theta\}\}\(\\mathbf\{A\}\\mid\\mathbf\{Q\}\)=\\prod\_\{j=k\+1\}^\{k\+a\}\\mathbb\{P\}\_\{\\boldsymbol\{\\theta\}\}\(q\_\{j\}\\mid q\_\{<j\}\),\(1\)whereq<j=\(q1,…,qj−1\)q\_\{<j\}=\(q\_\{1\},\\ldots,q\_\{j\-1\}\)denotes the preceding context\. Given the underlying*truthful\-response domain*ℙQ,T\\mathbb\{P\}\_\{Q,T\}, which is a joint distribution over𝒬×𝒜\\mathcal\{Q\}\\times\\mathcal\{A\}, the goal of hallucination detection is to learn a detectorGGsuch that, for any input\-output pair\(𝐐,𝐀\)\(\\mathbf\{Q\},\\mathbf\{A\}\)with𝐐∼ℙQ\\mathbf\{Q\}\\sim\\mathbb\{P\}\_\{Q\}and𝐀∼ℙ𝜽\(⋅∣𝐐\)\\mathbf\{A\}\\sim\\mathbb\{P\}\_\{\\boldsymbol\{\\theta\}\}\(\\cdot\\mid\\mathbf\{Q\}\),
G\(𝐐,𝐀\)=0,if𝐀is semantically aligned with𝐓∼ℙT\|Q\(⋅∣𝐐\);otherwise,G\(𝐐,𝐀\)=1,G\(\\mathbf\{Q\},\\mathbf\{A\}\)=0,~\\text\{if \}\\mathbf\{A\}\\text\{ is semantically aligned with \}\\mathbf\{T\}\\sim\\mathbb\{P\}\_\{T\|Q\}\(\\cdot\\mid\\mathbf\{Q\}\);~\\text\{otherwise\},~G\(\\mathbf\{Q\},\\mathbf\{A\}\)=1,\(2\)where0indicates that𝐀\\mathbf\{A\}is a truthful response and11indicates that𝐀\\mathbf\{A\}is a hallucinated response\.
Note that Eq\. \([2](https://arxiv.org/html/2606.06959#S2.E2)\) corresponds to the standard response\-level hallucination detection setting, where the detector judges the truthfulness of the entire generated output\. Recent studies have also considered finer\-grained variants, including entity\-level\[[58](https://arxiv.org/html/2606.06959#bib.bib1)\], atomic\-fact\-level\[[38](https://arxiv.org/html/2606.06959#bib.bib2)\], and sentence\-level\[[36](https://arxiv.org/html/2606.06959#bib.bib11)\]hallucination detection, which identify hallucinations at the level of entities, atomic facts, or individual sentences, respectively\. Finer\-grained settings can also support response\-level detection, since the truthfulness of a response may also depend on the correctness of its constituent entities, facts, and sentences\. Moreover, response\-level hallucination detection remains the most widely studied setting\. Therefore, this work mainly focuses on the response\-level hallucination detection\.
Table 1:Dataset coverage of OpenHalDet across diverse generation scenarios\.The format column specifies the task structure used for unified prompt construction and annotation\. The last two columns indicate whether prior hallucination benchmarks explicitly cover each broad scenario\.ScenarioDatasetsFormatHaluEval\[[30](https://arxiv.org/html/2606.06959#bib.bib41)\]HalluMix\[[16](https://arxiv.org/html/2606.06959#bib.bib8)\]QA: Multiple\-choiceARC\-Challenge\[[11](https://arxiv.org/html/2606.06959#bib.bib46)\]; CommonsenseQA\[[51](https://arxiv.org/html/2606.06959#bib.bib47)\]MCQ✗✗QA: Open\-endedTriviaQA\[[21](https://arxiv.org/html/2606.06959#bib.bib33)\]; TruthfulQA\[[32](https://arxiv.org/html/2606.06959#bib.bib34)\]Short answer✓✓QA: Reading comprehensionSQuAD v2\[[44](https://arxiv.org/html/2606.06959#bib.bib48)\]Span extraction✗✗QA: Multi\-hopHotpotQA\[[57](https://arxiv.org/html/2606.06959#bib.bib49)\]Cross\-document reasoning✓✗QA: ConversationalCoQA\[[45](https://arxiv.org/html/2606.06959#bib.bib50)\]Dialogue✗✗QA: GroundedHaluEval\-QA\[[30](https://arxiv.org/html/2606.06959#bib.bib41)\]Context\-based QA✓✓Retrieval\-augmented generationRAGTruth\[[40](https://arxiv.org/html/2606.06959#bib.bib51)\]RAG generation✗✓SummarizationXSum\[[39](https://arxiv.org/html/2606.06959#bib.bib52)\]Abstractive summarization✓✓Mathematical reasoningGSM8K\[[12](https://arxiv.org/html/2606.06959#bib.bib53)\]; SVAMP\[[43](https://arxiv.org/html/2606.06959#bib.bib59)\]Chain\-of\-thought✗✗Scientific reasoningTheoremQA\[[10](https://arxiv.org/html/2606.06959#bib.bib58)\]Chain\-of\-thought✗✗Code generationHumanEval\[[9](https://arxiv.org/html/2606.06959#bib.bib54)\]; MBPP\[[2](https://arxiv.org/html/2606.06959#bib.bib55)\]Code synthesis✗✗Agentic tasksxLAM\-Agent\[[60](https://arxiv.org/html/2606.06959#bib.bib56)\]Tool invocation✗✗Multilingual evaluationBelebele\[[4](https://arxiv.org/html/2606.06959#bib.bib57)\]Multilingual MCQ✗✗
Note\.A check mark indicates explicit coverage of the corresponding broad scenario, not necessarily the same dataset or exact sub\-format\. HalluMix is marked for RAG due to its multi\-document grounded generation setting\.
### 2\.2Benchmark Scope: Unified, Scenario\-Aware, And Model\-Diverse Evaluation
Towards Unified and Fair Evaluation\.Existing hallucination detection studies differ in model choices, task settings, data construction pipelines, and evaluation protocols \(see Appendix[B](https://arxiv.org/html/2606.06959#A2)\), making cross\-study comparisons difficult\. OpenHalDet addresses this fragmentation by consolidating datasets, LLM\-generated responses, representative detectors, and standardized metrics under a common evaluation protocol\. This unified framework enables fair comparison across detector families and supports reproducible extension of hallucination detection methods\.
Reliable Evaluation Across Diverse Scenarios\.Many hallucination detection studies\[[42](https://arxiv.org/html/2606.06959#bib.bib26),[14](https://arxiv.org/html/2606.06959#bib.bib22),[62](https://arxiv.org/html/2606.06959#bib.bib24),[3](https://arxiv.org/html/2606.06959#bib.bib23),[55](https://arxiv.org/html/2606.06959#bib.bib65)\]evaluate detectors on a narrow set of tasks, typically factual question answering\[[32](https://arxiv.org/html/2606.06959#bib.bib34),[21](https://arxiv.org/html/2606.06959#bib.bib33)\]or mathematical reasoning\[[12](https://arxiv.org/html/2606.06959#bib.bib53),[43](https://arxiv.org/html/2606.06959#bib.bib59)\]\. This limited coverage makes it difficult to assess generalization across broader real\-world scenarios\. OpenHalDet addresses this gap with a scenario\-aware evaluation suite, summarized in Table[1](https://arxiv.org/html/2606.06959#S2.T1)\. We organize datasets by application scenario rather than as a flat collection, enabling systematic analysis of detector accuracy and robustness across task settings\.
Fine\-Grained Coverage Across QA Settings\.Question answering \(QA\) is widely used for hallucination detection, but it is not a homogeneous task: QA settings vary in answer format, evidence availability, and reasoning demand\. OpenHalDet therefore separates QA into multiple\-choice, open\-ended, reading\-comprehension, multi\-hop, conversational, and context\-grounded settings\. ARC\-Challenge\[[11](https://arxiv.org/html/2606.06959#bib.bib46)\]and CommonsenseQA\[[51](https://arxiv.org/html/2606.06959#bib.bib47)\]cover multiple\-choice science and commonsense reasoning; TriviaQA\[[21](https://arxiv.org/html/2606.06959#bib.bib33)\]and TruthfulQA\[[32](https://arxiv.org/html/2606.06959#bib.bib34)\]cover open\-ended factual and truthfulness\-oriented QA; SQuAD v2\[[44](https://arxiv.org/html/2606.06959#bib.bib48)\]and HotpotQA\[[57](https://arxiv.org/html/2606.06959#bib.bib49)\]evaluate evidence\-grounded answer extraction, including unanswerable and multi\-hop cases\. CoQA\[[45](https://arxiv.org/html/2606.06959#bib.bib50)\]adds conversational context, while HaluEval\-QA\[[30](https://arxiv.org/html/2606.06959#bib.bib41)\]provides QA\-style hallucination examples with human annotations\. This design enables fine\-grained analysis of detector reliability across QA formats, grounding conditions, and reasoning requirements\.
Broad Coverage Beyond Question Answering\.OpenHalDet further extends evaluation beyond QA to retrieval\-augmented generation, summarization, reasoning, code generation, agentic tool use, and multilingual understanding\. This scope captures settings where detection must account for evidence support, summarization faithfulness, multi\-step reasoning, code correctness, tool\-call validity, and cross\-lingual comprehension\. RAGTruth\[[40](https://arxiv.org/html/2606.06959#bib.bib51)\]and XSum\[[39](https://arxiv.org/html/2606.06959#bib.bib52)\]cover evidence\-grounded generation and abstractive summarization; GSM8K\[[12](https://arxiv.org/html/2606.06959#bib.bib53)\]and TheoremQA\[[10](https://arxiv.org/html/2606.06959#bib.bib58)\]cover mathematical and theorem\-based reasoning; HumanEval\[[9](https://arxiv.org/html/2606.06959#bib.bib54)\]and MBPP\[[2](https://arxiv.org/html/2606.06959#bib.bib55)\]cover code generation\. We further include xLAM\-Agent\[[60](https://arxiv.org/html/2606.06959#bib.bib56)\]for tool use and Belebele\[[4](https://arxiv.org/html/2606.06959#bib.bib57)\]for multilingual evaluation\. This design enables analysis of detector reliability across a wider range of generation contexts\.
Model\-Diverse Evaluation\.OpenHalDet evaluates detectors across recent open\-weight LLMs from the Llama and Qwen families, reflecting the need for robustness beyond a single LLM\. Since many prior studies rely on earlier or single backbones, it remains unclear whether their conclusions transfer to newer models\. The selected backbones are representative recent open\-weight LLMs from two widely used model families\. Appendix[A](https://arxiv.org/html/2606.06959#A1)summarizes the selected models together with representative Hugging Face download statistics, providing a practical indicator of community adoption and supporting our controlled analysis of detector stability across model families\.
### 2\.3Datasets Collection Protocol
Challenges\.Constructing a comprehensive hallucination detection benchmark requires more than simply aggregating datasets from different sources\.Threemajor challenges arise from dataset heterogeneity\. First,different task types exhibit different input structures, including choice\-based inputs, external evidence, programming constraints, and tool specifications\. Second,the expected response formats vary considerably, spanning short answers, option labels, long\-form text, reasoning traces, code, and structured tool calls\. Third,the collected datasets provide reference answers in different forms\(see Appendix[C](https://arxiv.org/html/2606.06959#A3)\), from a single gold answer to multiple acceptable answers\.These differences make it difficult to evaluate all datasets directly within a single unified pipeline\.
Unified Instance Schema\.OpenHalDet uses a unified instance schema as the core abstraction of the benchmark pipeline\. Instead of flattening heterogeneous datasets into raw prompts, each instance is decomposed into shared semantic roles:
s=\(t,i,c,q,ω,r\+,r−\),s=\(t,i,c,q,\\omega,r^\{\+\},r^\{\-\}\),wherettdenotes the task type,iithe task instruction,ccoptional context or constraints,qqthe primary input,ω\\omegatask\-specific options or constraints, andr\+r^\{\+\}andr−r^\{\-\}denote correct and known incorrect references\. The task typettdetermines both the prompt rule and the expected response format\.
This schema preserves task\-specific structure while providing a common interface across datasets\. Input components such as passages, retrieved evidence, answer options, tool specifications, or testing constraints are stored in role\-specific fields rather than collapsed into an unstructured prompt\. Reference information is normalized throughr\+r^\{\+\}andr−r^\{\-\}, allowing the same pipeline to support single gold answers, multiple acceptable answers, and known incorrect answers\. Giventt, OpenHalDet renders only the relevant fields into the model input, standardizing prompt construction while preserving task\-specific information\.Details and examples are provided in Appendix[C](https://arxiv.org/html/2606.06959#A3)\.
Label Assignment\.To obtain comparable labels across heterogeneous tasks, we use a unified response\-level annotation protocol with GPT\-4o\-mini, following the broader practice of LLM\-based evaluation\[[64](https://arxiv.org/html/2606.06959#bib.bib60)\]\. Given the task input, optional context, reference answers, known incorrect answers when available, and the candidate response, the annotator assigns one of three labels:correct,hallucination, orabstention\. We use response\-level labels rather than token\-, span\-, sentence\-, or claim\-level annotations, since finer units are difficult to define consistently across QA, summarization, reasoning, code, tool\-use, and multilingual settings\. For binary detector evaluation,hallucinationis treated as the positive class andcorrectas the negative class; abstentions and invalid annotations are excluded\.More details are provided in Appendix[D](https://arxiv.org/html/2606.06959#A4)\.
### 2\.4Evaluation Metrics
Following\[[42](https://arxiv.org/html/2606.06959#bib.bib26),[14](https://arxiv.org/html/2606.06959#bib.bib22)\], we useAUROCas the primary metric\. Each detector assigns a scalar hallucination score to each response, where higher scores indicate higher hallucination risk\. AUROC evaluates whether hallucinated responses receive higher scores than correct responses across decision thresholds\. In addition to the main benchmark tables, we provide a separate supplementary cost analysis to characterize detector efficiency rather than to define the primary ranking\. We reportCost@N, the wall\-clock time for applying a detector toNNsamples under a fixed hardware and evaluation protocol, decomposed into feature preparation, training, and inference time\.See more details in Appendix[G](https://arxiv.org/html/2606.06959#A7)\.
Figure 1:Timeline and taxonomy of hallucination detection methods supported by OpenHalDet\. Methods are organized by publication year\. Colors indicate model\-access regimes, including black\-box, gray\-box, and white\-box detectors\. Fill patterns indicate whether the method requires no additional training, calibration or fitting, or supervised adaptation\. See Table[8](https://arxiv.org/html/2606.06959#A5.T8)for more details\.
## 3Supported Detection Methods and Evaluation Protocol
This section provides a concise overview of the 16 hallucination detection methods shown in Figure[1](https://arxiv.org/html/2606.06959#S2.F1)\. We focus on representative methods with publicly available implementations or sufficiently clear algorithmic descriptions for reproducible implementation,see more details in Appendix[E](https://arxiv.org/html/2606.06959#A5)\.
### 3\.1Black\-box Detectors
Black\-box detectors perform hallucination detection using only text outputs, without access to token probabilities, sequence likelihoods, hidden states, or attention maps\. They therefore suit closed\-source or API\-only cases, where users can query the model but cannot inspect its internal signals\.
We categorize them by the type of textual signal used:*verbalized\-confidence methods*, which rely on a self\-reported confidence statement from a single output, and*sample\-consistency methods*, which estimate hallucination risk from consistency across multiple sampled responses\. 1\) Verbalized\-confidence methods are a small but practical family of black\-box detectors that use the model’s self\-reported confidence as a hallucination signal\. Such methods are lightweight and easy to deploy, but their reliability depends on whether the model’s stated confidence is calibrated with factual correctness\. We includeVerbalized Confidence\[[31](https://arxiv.org/html/2606.06959#bib.bib15)\]as a representative baseline for this category\. 2\) Sample\-consistency methods estimate hallucination risk by generating multiple responses to the same input and measuring disagreement among sampled outputs\.SelfCheckGPT\[[36](https://arxiv.org/html/2606.06959#bib.bib11)\]checks whether independently sampled responses support or contradict the main response, whereasLexical Similarity\[[33](https://arxiv.org/html/2606.06959#bib.bib14)\]captures surface\-level agreement across sampled outputs\.
### 3\.2Gray\-box Detectors
Gray\-box detectors rely on probability\-based signals exposed during generation, including token probabilities, sequence likelihoods, and related generation scores\. They require more model access than black\-box detectors, but do not use hidden states, attention maps, or other internal representations\. In OpenHalDet, gray\-box detectors are represented by likelihood\-based uncertainty methods\.
These methods estimate hallucination risk from probability\-based signals associated with the model output, including token probabilities, sequence likelihoods, and uncertainty over sampled generations\. We organize them into two groups:*single\-output likelihood methods*and*sampling\-based uncertainty methods*\. 1\) For single\-output likelihood methods,Perplexity\[[47](https://arxiv.org/html/2606.06959#bib.bib16)\]scores a generated output using token\-level likelihoods, whileSelf\-evaluation\[[22](https://arxiv.org/html/2606.06959#bib.bib12)\]asks the model to assess whether its own output is true or false; following our score orientation, we use the probability of the “False” label, equivalently one minus the probability of correctness, as the hallucination score\. 2\) For sampling\-based uncertainty methods,LN\-Entropy\[[35](https://arxiv.org/html/2606.06959#bib.bib17)\]measures length\-normalized sequence\-level uncertainty from sampled outputs and their likelihoods\.SAR\[[15](https://arxiv.org/html/2606.06959#bib.bib27)\]further reweights this uncertainty by semantic relevance among samples; for scalability, we use its sentence\-level variant\.Semantic Entropy\[[17](https://arxiv.org/html/2606.06959#bib.bib13)\]clusters sampled outputs into semantic equivalence classes and computes uncertainty over the resulting semantic distribution\. All methods in this group are training\-free, but require access to probability or likelihood information from the target model\.
### 3\.3White\-box Detectors
White\-box detectors require access to internal model signals, such as hidden states, attention maps, layerwise activations, or derived internal features\. In contrast to black\-box and gray\-box methods, they directly exploit internal representations rather than relying only on observable text or probability\-based generation signals\. Such information may provide richer evidence for hallucination detection, but imposes the strongest access assumptions\. As a result, white\-box detectors are typically more applicable to open\-source or fully accessible models than to closed\-source API\-only systems\.
These methods use internal representations extracted from the target LLM to detect hallucination\. We organize them according to how internal signals are used:*representation\-consistency scores*,*contrastive or subspace\-based objectives*,*hidden\-state probes*, and*prompt\- or dynamics\-guided internal features*\. For*representation\-consistency*methods,EigenScore\[[8](https://arxiv.org/html/2606.06959#bib.bib18)\]measures representation\-level consistency across stochastic generations by analyzing the spectral geometry of hidden\-state representations\. For*contrastive or subspace\-based*methods,CCS\[[7](https://arxiv.org/html/2606.06959#bib.bib21)\]learns a contrast\-consistent direction from paired hidden states, whileHaloScope\[[14](https://arxiv.org/html/2606.06959#bib.bib22)\]estimates hallucination\-related membership in a latent representation subspace and trains a truthfulness classifier from the estimated memberships\. For*hidden\-state probe*methods,SAPLMA\[[3](https://arxiv.org/html/2606.06959#bib.bib23)\]trains a supervised classifier on input\-output hidden states to predict statement truthfulness\.MIND\[[49](https://arxiv.org/html/2606.06959#bib.bib19)\]uses generation\-time hidden states in an unsupervised training framework for real\-time hallucination detection, andSEP\[[28](https://arxiv.org/html/2606.06959#bib.bib20)\]trains probes to approximate semantic entropy from hidden states of a single generation\. For*prompt\- or dynamics\-guided internal features*,PRISM\[[59](https://arxiv.org/html/2606.06959#bib.bib25)\]uses prompt\-guided hidden states to improve cross\-domain supervised detection, whereasICR Probe\[[62](https://arxiv.org/html/2606.06959#bib.bib24)\]constructs ICR scores from hidden\-state updates and attention maps to track cross\-layer residual\-stream dynamics\.
Table 2:AUROC results \(%\) aggregated by scenario across backbone LLMs\. Columns follow the task taxonomy in Table[1](https://arxiv.org/html/2606.06959#S2.T1): QA averages ARC\-Challenge, CommonsenseQA, TriviaQA, TruthfulQA, SQuAD\_V2, HotpotQA, CoQA, and HaluEval\-QA when available; RAG, Sum\., Sci\., Agent, and Multi\. report RAGTruth, XSum, TheoremQA, xLAM\-Agent, and Belebele, respectively; Math averages GSM8K and SVAMP; and Code averages HumanEval and MBPP\. Higher values are better\. SelfCheck\-BERT and SelfCheck\-NLI denote the BERTScore\- and NLI\-based SelfCheckGPT\.
### 3\.4Access\-Aware Evaluation Protocol
Challenges\.A major obstacle to comprehensive hallucination\-detector evaluation is the engineering burden of heterogeneous signal extraction\. Existing detectors operate under different access levels: black\-box methods require generated texts or multiple stochastic samples, gray\-box methods require predictive distributions such as token log\-probabilities, and white\-box methods require internal model signals such as layer\-wise hidden states or attention maps\. In practice, evaluating these methods across new LLMs and datasets often requires method\-specific data preparation, signal extraction, and scoring pipelines\. This fragmentation makes large\-scale comparison costly and complicates the reproduction of baselines across heterogeneous tasks and backbone models\.
Decoupled Evaluation Pipeline\.To reduce the engineering burden of heterogeneous detector evaluation, OpenHalDet decouples response generation, signal extraction, and detector scoring\. For each instance, the pipeline first constructs a shared evaluation cache containing the primary generated response, stochastic samples, token log\-probabilities, and layer\-wise hidden states when available\. Detectors then operate offline on the subset of cached signals allowed by their access regime: surface texts for black\-box methods, probability\-based generation signals for gray\-box methods, and internal representations for white\-box methods\. For specialized white\-box detectors, derived internal features are extracted with standardized routines while keeping the generated response fixed\.
This design reduces detector integration to a common data\-loading and scoring interface, improves comparability by reusing the same generated responses and stochastic samples across methods, and improves reproducibility by centralizing likelihood and representation extraction\. It also makes OpenHalDet extensible: new datasets can be added through the unified instance schema, and new detectors can be integrated by specifying their required access signals and scoring function\. As a result, comparisons are less confounded by method\-specific input preparation and more directly reflect detector scoring behavior\.More details are provided in Appendix[E](https://arxiv.org/html/2606.06959#A5)\.
## 4Experiments
### 4\.1Experimental Setup
We conduct a comprehensive evaluation of OpenHalDet on 17 datasets and four backbone LLMs: Llama\-3\.1\-8B\-Instruct, Llama\-3\.2\-3B\-Instruct\[[52](https://arxiv.org/html/2606.06959#bib.bib63)\], Qwen3\-8B, and Qwen3\-14B\[[53](https://arxiv.org/html/2606.06959#bib.bib64)\]\. We additionally report experiments on Llama\-3\.3\-70B\-Instruct in Appendix[F](https://arxiv.org/html/2606.06959#A6)\. To ensure fair and reproducible comparison, all detectors are evaluated under a unified protocol for prompt construction, response generation, annotation, detector scoring, and metric computation\. For sample\-based methods, we generate five stochastic responses per input with temperature1\.01\.0and top\-p=0\.9p=0\.9\[[36](https://arxiv.org/html/2606.06959#bib.bib11),[17](https://arxiv.org/html/2606.06959#bib.bib13)\], and reuse the same sampled responses across all applicable detectors\. For internal\-state methods, we use mean\-pooled hidden states over generated answer tokens as the default representation\[[14](https://arxiv.org/html/2606.06959#bib.bib22),[49](https://arxiv.org/html/2606.06959#bib.bib19),[62](https://arxiv.org/html/2606.06959#bib.bib24)\]\. For each dataset, we create stratified 60/20/20 train/validation/test splits that preserve the hallucinated/correct label ratio\. All experiments use a fixed random seed of 42 and are conducted on NVIDIA H100 GPUs\.More details about method\-specific implementation are documented in Appendix[E](https://arxiv.org/html/2606.06959#A5)\.
### 4\.2Main Results
Table[2](https://arxiv.org/html/2606.06959#S3.T2)reports AUROC \(%\) aggregated by scenario under the task taxonomy in Table[1](https://arxiv.org/html/2606.06959#S2.T1)\. Multi\-dataset scenarios are averaged over constituent datasets, while single\-dataset scenarios are reported directly\. Detailed per\-dataset results and selected Llama\-3\.3\-70B results are provided in Appendix[F](https://arxiv.org/html/2606.06959#A6)\. We highlight three findings: the detector’s effectiveness depends on scenario and backbone; model access improves the ceiling, but not robustness; and evidence acquisition dominates practical cost\.
Detector Effectiveness Depends on Scenario and Backbone\.No detector family uniformly dominates across all reported settings\. For example, on Llama\-3\.2\-3B\-Instruct, gray\-box methods obtain the highest overall family average, slightly above black\-box and white\-box methods, with averages of 66\.47, 66\.07, and 65\.91, respectively\. The leading family also changes across scenarios: on the same backbone, gray\-box methods are strongest on multilingual evaluation with an average of 73\.36, while white\-box methods are stronger on science with an average of 66\.57\. These shifts show that hallucination detection cannot be characterized by a single dataset, scenario, or target backbone\.
More Model Access Does Not Guarantee Better Detection\.Access to logits or hidden states provides additional evidence, but it does not by itself determine detector quality\. Gray\-box methods remain competitive with white\-box methods: on Llama\-3\.2\-3B\-Instruct, the gray\-box family average is slightly higher than the white\-box average, 66\.47 versus 65\.91, while on Qwen3\-8B the two are close, 64\.61 versus 64\.97\. Within the white\-box family, performance also varies substantially; for example, on Llama\-3\.2\-3B\-Instruct,MINDreaches 75\.45 overall AUROC, whileCCSobtains 56\.67\. Thus, detector performance depends not only on the access regime, but also on how the available evidence is selected, modeled, and converted into a hallucination\-risk score\.
Black\-box Signals Are Limited but Still Task\-dependent\.Black\-box detectors rely only on output text or additional generations, which generally limits their performance relative to methods using likelihoods or internal states\. At the family level, black\-box methods are usually below the strongest access regimes, with overall averages of 60\.56 on Qwen3\-8B, 66\.07 on Llama\-3\.2\-3B\-Instruct, 67\.64 on Llama\-3\.1\-8B\-Instruct, and 61\.79 on Qwen3\-14B\. Within the black\-box family, direct verbalized confidence is generally weaker than sample\-consistency signals\. For example, on Llama\-3\.2\-3B\-Instruct,Verbalized Confidenceobtains an overall average of 57\.99, whileSelfCheckGPT\-BERTScore,SelfCheckGPT\-NLI, and lexical similarity reach 69\.25, 67\.86, and 69\.16, respectively\. This suggests that comparing stochastic generations is generally more informative than relying on the model’s stated confidence, although the usefulness of such consistency signals still depends on the task and comparison metric\.
Accuracy Can Hide Large Cost Differences\.Figure[2](https://arxiv.org/html/2606.06959#S4.F2)provides a cost\-aware view on Llama\-3\.2\-3B\-Instruct by plotting full\-evaluation AUROC against artifact\-inclusive Cost@100\. Detectors with similar AUROC can differ substantially in cost depending on how they acquire evidence\. Methods that reuse single\-generation likelihoods or cached internal features usually remain low\-cost, whereas sampling\-based methods incur additional cost from repeated generations\. This extra evidence is not uniformly beneficial: in some scenarios, low\-cost likelihood or internal\-state methods already approach the performance of sampling\-based detectors, while in others repeated\-generation signals provide competitive accuracy at higher cost\. These results show that detector comparisons should report not only accuracy, but also the evidence\-acquisition cost required to obtain the score\.
Overall, hallucination detection performance depends jointly on the task scenario, target backbone, access regime, and evidence\-acquisition strategy\. White\-box methods are often strong, but their gains vary across datasets and models\. Gray\-box methods are competitive practical baselines and can match or exceed weaker internal\-state methods without requiring hidden\-state access\. Black\-box methods are generally more limited, with sample\-consistency signals usually more reliable than direct verbalized confidence\. These trends motivate a unified benchmark that compares detectors across diverse tasks, backbones, and access regimes under the same protocol\.
Figure 2:Accuracy–cost trade\-offs on Llama\-3\.2\-3B\-Instruct across representative scenarios\. Each point denotes a detector\. The y\-axis reports full\-evaluation AUROC, and the x\-axis reports artifact\-inclusive Cost@100 on a log scale\. Colors indicate the evidence\-acquisition type\. Repeated\-generation methods are consistently more expensive, while methods reusing single\-generation signals remain low\-cost\. The benefit of expensive evidence is task\-dependent: it is less pronounced on QA\-style tasks but becomes more competitive on reasoning and code\-generation tasks\.
## 5Comparison with Existing Hallucination Benchmarks
Existing benchmarks can be broadly divided into two categories\. The first category focuses on constructing hallucination evaluation datasets or taxonomies\. HaluEval\[[30](https://arxiv.org/html/2606.06959#bib.bib41)\]provides generated and human\-annotated hallucinated samples across question answering, dialogue, and summarization; RAGTruth\[[40](https://arxiv.org/html/2606.06959#bib.bib51)\]offers case\-level and word\-level hallucination annotations for RAG responses; and HalluLens\[[6](https://arxiv.org/html/2606.06959#bib.bib7)\]further studies hallucination taxonomy and evaluation settings\. These benchmarks provide data resources, buttheir primary goal is to evaluate hallucination behavior or build task\-specific hallucination corpora, rather than to systematically compare heterogeneous detector mechanisms\.
The second category is closer to hallucination detector benchmarking\. HalluMix evaluates several open\-source and closed\-source hallucination detection systems across domains, task formats, and input settings\[[16](https://arxiv.org/html/2606.06959#bib.bib8)\]\. However,its evaluated systems are mainly deployment\-oriented groundedness or API\-style detectors\. OpenHalDet instead targets academic detector\-family evaluation: it covers black\-box sample\-consistency methods, gray\-box likelihood\-based uncertainty methods, and white\-box internal\-state\-based methods under a unified protocol\. In addition, OpenHalDet provides one\-command baseline evaluation and a unified detector interface, allowing new hallucination detection methods to be integrated without rebuilding the generation, annotation, scoring, or evaluation pipeline\.
## 6Conclusion
In this paper, we present OpenHalDet, a unified benchmark for evaluating hallucination detection methods across diverse generation scenarios and model\-access regimes\. Using OpenHalDet, we compare 16 representative detectors spanning black\-box, gray\-box, and white\-box settings\. Our results reveal several key findings: 1\) self\-reported confidence is poorly aligned with factual truthfulness; 2\) sample\-consistency and likelihood\-based uncertainty provide strong practical signals under limited model access; 3\) internal\-state methods achieve the highest performance ceiling, but their effectiveness depends on how internal representations are modeled; and 4\) detector performance varies substantially across task scenarios and backbone LLMs\. Together with standardized protocols for response generation, label assignment, detector scoring, and evaluation, OpenHalDet provides an extensible foundation for fair and reproducible comparison of future hallucination detection methods\.
## References
- \[1\]\(2025\)A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation\.npj Digital Medicine8\(1\),pp\. 274\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01670-7)Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p2.1)\.
- \[2\]J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. J\. Cai, M\. Terry, Q\. V\. Le, and C\. Sutton\(2021\)Program synthesis with large language models\.ArXiv\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.12.12.2.1.1)\.
- \[3\]A\. Azaria and T\. Mitchell\(2023\)The internal state of an LLM knows when it’s lying\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.16.16.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§1](https://arxiv.org/html/2606.06959#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1),[§3\.3](https://arxiv.org/html/2606.06959#S3.SS3.p2.1)\.
- \[4\]L\. Bandarkar, D\. Liang, B\. Muller, M\. Artetxe, S\. N\. Shukla, D\. Husa, N\. Goyal, A\. Krishnan, L\. Zettlemoyer, and M\. Khabsa\(2024\-08\)The belebele benchmark: a parallel reading comprehension dataset in 122 language variants\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.14.14.2.1.1)\.
- \[5\]S\. Banerjee, A\. Agarwal, and S\. Singla\(2025\)Llms will always hallucinate, and we need to live with this\.InIntelligent Systems Conference,pp\. 624–648\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[6\]Y\. Bang, Z\. Ji, A\. Schelten, A\. Hartshorn, T\. Fowler, C\. Zhang, N\. Cancedda, and P\. Fung\(2025\-07\)HalluLens: LLM hallucination benchmark\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[Table 5](https://arxiv.org/html/2606.06959#A3.T5.3.4.3.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§1](https://arxiv.org/html/2606.06959#S1.p3.1),[§5](https://arxiv.org/html/2606.06959#S5.p1.1)\.
- \[7\]C\. Burns, H\. Ye, D\. Klein, and J\. Steinhardt\(2023\)Discovering latent knowledge in language models without supervision\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.14.14.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.06959#S3.SS3.p2.1)\.
- \[8\]C\. Chen, K\. Liu, Z\. Chen, Y\. Gu, Y\. Wu, M\. Tao, Z\. Fu, and J\. Ye\(2024\)INSIDE: LLMs’ internal states retain the power of hallucination detection\.InThe Twelfth International Conference on Learning Representations,Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.13.13.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.06959#S3.SS3.p2.1)\.
- \[9\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. Pondé, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. W\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, I\. Babuschkin, S\. Balaji, S\. Jain, A\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba\(2021\)Evaluating large language models trained on code\.ArXivabs/2107\.03374\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.12.12.2.1.1)\.
- \[10\]W\. Chen, M\. Yin, M\. Ku, P\. Lu, Y\. Wan, X\. Ma, J\. Xu, X\. Wang, and T\. Xia\(2023\-12\)TheoremQA: a theorem\-driven question answering dataset\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.11.11.2.1.1)\.
- \[11\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.ArXivabs/1803\.05457\.External Links:[Link](https://api.semanticscholar.org/CorpusID:3922816)Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.2.2.2.1.1)\.
- \[12\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.ArXivabs/2110\.14168\.External Links:[Link](https://api.semanticscholar.org/CorpusID:239998651)Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.10.10.2.1.1)\.
- \[13\]Y\. Deng, Z\. Fang, S\. Li, and L\. Chen\(2026\)Beyond in\-domain detection: spikescore for cross\-domain hallucination detection\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.06959#S2.SS1.p1.9)\.
- \[14\]X\. Du, C\. Xiao, and Y\. Li\(2024\)HaloScope: harnessing unlabeled LLM generations for hallucination detection\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.15.15.1),[§2\.1](https://arxiv.org/html/2606.06959#S2.SS1.p1.9),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1),[§2\.4](https://arxiv.org/html/2606.06959#S2.SS4.p1.1),[§3\.3](https://arxiv.org/html/2606.06959#S3.SS3.p2.1),[§4\.1](https://arxiv.org/html/2606.06959#S4.SS1.p1.2)\.
- \[15\]J\. Duan, H\. Cheng, S\. Wang, A\. Zavalny, C\. Wang, R\. Xu, B\. Kailkhura, and K\. Xu\(2024\-08\)Shifting attention to relevance: towards the predictive uncertainty quantification of free\-form large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand\.Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.10.10.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§1](https://arxiv.org/html/2606.06959#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.06959#S3.SS2.p2.1)\.
- \[16\]D\. Emery, M\. Goitia, F\. Vargus, and I\. Neagu\(2025\)HalluMix: a task\-agnostic, multi\-domain benchmark for real\-world hallucination detection\.External Links:2505\.00506Cited by:[Table 5](https://arxiv.org/html/2606.06959#A3.T5.3.5.4.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§1](https://arxiv.org/html/2606.06959#S1.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.1.1.5.1.1.1),[§5](https://arxiv.org/html/2606.06959#S5.p2.1)\.
- \[17\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630,pp\. 625 – 630\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270615909)Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.11.11.1),[§1](https://arxiv.org/html/2606.06959#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.06959#S3.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.06959#S4.SS1.p1.2)\.
- \[18\]E\. Harvey, A\. Koenecke, and R\. F\. Kizilcec\(2025\)"Don’t forget the teachers": Towards an educator\-centered understanding of harms from large language models in education\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,pp\. 1–19\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[19\]L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu\(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Trans\. Inf\. Syst\.\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p2.1)\.
- \[20\]D\. Janiak, J\. Binkowski, A\. Sawczyn, B\. Gabrys, R\. Shwartz\-Ziv, and T\. Kajdanowicz\(2025\)The illusion of progress: re\-evaluating hallucination detection in llms\.InConference on Empirical Methods in Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§1](https://arxiv.org/html/2606.06959#S1.p3.1)\.
- \[21\]M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer\(2017\)Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension\.ACL\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.3.3.2.1.1)\.
- \[22\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Dodds, N\. Dassarma, E\. Tran\-Johnson, S\. Johnston, S\. El\-Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. B\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan\(2022\)Language models \(mostly\) know what they know\.ArXivabs/2207\.05221\.Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.8.8.1),[§3\.2](https://arxiv.org/html/2606.06959#S3.SS2.p2.1)\.
- \[23\]A\. T\. Kalai, O\. Nachum, S\. S\. Vempala, and E\. Zhang\(2025\)Why language models hallucinate\.arXiv preprint arXiv:2509\.04664\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[24\]A\. T\. Kalai and S\. S\. Vempala\(2024\)Calibrated language models must hallucinate\.InProceedings of the 56th Annual ACM Symposium on Theory of Computing,Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[25\]A\. T\. Kalai and S\. S\. Vempala\(2024\)Calibrated language models must hallucinate\.InProceedings of the 56th Annual ACM Symposium on Theory of Computing,STOC 2024,New York, NY, USA,pp\. 160–171\.External Links:ISBN 9798400703836Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[26\]A\. Kalavasis, A\. Mehrotra, and G\. Velegkas\(2025\)On the limits of language generation: trade\-offs between hallucination and mode\-collapse\.InProceedings of the 57th Annual ACM Symposium on Theory of Computing,Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[27\]H\. Kang and X\. Liu\(2023\)Deficiency of large language models in finance: An empirical examination of hallucination\.arXiv preprint arXiv:2311\.15548\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[28\]J\. Kossen, J\. Han, M\. Razzak, L\. Schut, S\. A\. Malik, and Y\. Gal\(2025\)Semantic entropy probes: robust and cheap hallucination detection in LLMs\.Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.18.18.1),[§3\.3](https://arxiv.org/html/2606.06959#S3.SS3.p2.1)\.
- \[29\]A\. Kulkarni\*, Y\. Zhang, J\. R\. A\. Moniz\*, X\. Ge, B\. Tseng, D\. Piraviperumal, S\. Swayamdipta, and H\. Yu\(2025\)Evaluating evaluation metrics — the mirage of hallucination detection\.InEMNLP,External Links:[Link](https://arxiv.org/abs/2504.18114)Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p3.1)\.
- \[30\]J\. Li, X\. Cheng, W\. X\. Zhao, J\. Nie, and J\. Wen\(2023\)Halueval: a large\-scale hallucination evaluation benchmark for large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 6449–6464\.Cited by:[Table 5](https://arxiv.org/html/2606.06959#A3.T5.3.2.1.1),[§1](https://arxiv.org/html/2606.06959#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.1.1.4.1.1.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.7.7.2.1.1),[§5](https://arxiv.org/html/2606.06959#S5.p1.1)\.
- \[31\]S\. C\. Lin, J\. Hilton, and O\. Evans\(2022\)Teaching models to express their uncertainty in words\.Trans\. Mach\. Learn\. Res\.2022\.Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.3.3.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.06959#S3.SS1.p2.1)\.
- \[32\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)Truthfulqa: measuring how models mimic human falsehoods\.ACL\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.3.3.2.1.1)\.
- \[33\]Z\. Lin, S\. Trivedi, and J\. Sun\(2024\)Generating with confidence: uncertainty quantification for black\-box large language models\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.5.5.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.06959#S3.SS1.p2.1)\.
- \[34\]L\. Liu, R\. Pourreza, S\. Panchal, A\. Bhattacharyya, Y\. Zhang, and R\. Memisevic\(2025\)Enhancing hallucination detection through noise injection\.CoRRabs/2502\.03799\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[35\]A\. Malinin and M\. Gales\(2021\)Uncertainty estimation in autoregressive structured prediction\.InInternational Conference on Learning Representations,Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.9.9.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.06959#S3.SS2.p2.1)\.
- \[36\]P\. Manakul, A\. Liusie, and M\. Gales\(2023\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.4.4.1),[§1](https://arxiv.org/html/2606.06959#S1.p1.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§1](https://arxiv.org/html/2606.06959#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.06959#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.06959#S3.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.06959#S4.SS1.p1.2)\.
- \[37\]J\. Maynez, S\. Narayan, B\. Bohnet, and R\. McDonald\(2020\)On faithfulness and factuality in abstractive summarization\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 1906–1919\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[38\]S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. W\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi\(2023\)FActscore: fine\-grained atomic evaluation of factual precision in long form text generation\.InEMNLP,Cited by:[§2\.1](https://arxiv.org/html/2606.06959#S2.SS1.p2.1)\.
- \[39\]S\. Narayan, S\. B\. Cohen, and M\. Lapata\(2018\-October\-November\)Don’t give me the details, just the summary\! topic\-aware convolutional neural networks for extreme summarization\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.9.9.2.1.1)\.
- \[40\]C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang\(2024\-08\)RAGTruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand\.Cited by:[Table 5](https://arxiv.org/html/2606.06959#A3.T5.3.3.2.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.8.8.2.1.1),[§5](https://arxiv.org/html/2606.06959#S5.p1.1)\.
- \[41\]V\. Noël\(2026\)Spectral guardrails for agents in the wild: detecting tool use hallucinations via attention topology\.arXiv preprint arXiv:2602\.08082\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p2.1)\.
- \[42\]S\. Park, X\. Du, M\. Yeh, H\. Wang, and Y\. Li\(2025\)Steer LLM latents for hallucination detection\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.06959#S2.SS1.p1.9),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1),[§2\.4](https://arxiv.org/html/2606.06959#S2.SS4.p1.1)\.
- \[43\]A\. Patel, S\. Bhattamishra, and N\. Goyal\(2021\-06\)Are NLP models really able to solve simple math word problems?\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.10.10.2.1.1)\.
- \[44\]P\. Rajpurkar, R\. Jia, and P\. Liang\(2018\-07\)Know what you don’t know: unanswerable questions for SQuAD\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.4.4.2.1.1)\.
- \[45\]S\. Reddy, D\. Chen, and C\. D\. Manning\(2019\)CoQA: a conversational question answering challenge\.Transactions of the Association for Computational Linguistics7\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.6.6.2.1.1)\.
- \[46\]J\. Ren, J\. Luo, Y\. Zhao, K\. Krishna, M\. Saleh, B\. Lakshminarayanan, and P\. J\. Liu\(2022\)Out\-of\-distribution detection and selective generation for conditional language models\.arXiv preprint arXiv:2209\.15558\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[47\]J\. Ren, J\. Luo, Y\. Zhao, K\. Krishna, M\. Saleh, B\. Lakshminarayanan, and P\. J\. Liu\(2023\)Out\-of\-distribution detection and selective generation for conditional language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.7.7.1),[§1](https://arxiv.org/html/2606.06959#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.06959#S3.SS2.p2.1)\.
- \[48\]D\. Roustan and F\. Bastardot\(2025\)The clinicians’ guide to large language models: A general perspective with a focus on hallucinations\.Interactive Journal of Medical Research14,pp\. e59823\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[49\]W\. Su, C\. Wang, Q\. Ai, Y\. Hu, Z\. Wu, Y\. Zhou, and Y\. Liu\(2024\-08\)Unsupervised real\-time hallucination detection based on the internal states of large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand\.Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.17.17.1),[§3\.3](https://arxiv.org/html/2606.06959#S3.SS3.p2.1),[§4\.1](https://arxiv.org/html/2606.06959#S4.SS1.p1.2)\.
- \[50\]Z\. Sun, X\. Zang, K\. Zheng, J\. Xu, X\. Zhang, W\. Yu, Y\. Song, and H\. Li\(2025\)ReDeEP: detecting hallucination in retrieval\-augmented generation via mechanistic interpretability\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ztzZDzgfrh)Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p2.1)\.
- \[51\]A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant\(2019\-06\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.2.2.2.1.1)\.
- \[52\]L\. Team\(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§4\.1](https://arxiv.org/html/2606.06959#S4.SS1.p1.2)\.
- \[53\]Q\. Team\(2025\)Qwen3 technical report\.CoRRabs/2505\.09388\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.09388),[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),2505\.09388Cited by:[§4\.1](https://arxiv.org/html/2606.06959#S4.SS1.p1.2)\.
- \[54\]A\. Urlana, G\. Kanumolu, C\. V\. Kumar, B\. M\. Garlapati, and R\. Mishra\(2025\-12\)HalluCounter: reference\-free LLM hallucination detection in the wild\!\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,K\. Inui, S\. Sakti, H\. Wang, D\. F\. Wong, P\. Bhattacharyya, B\. Banerjee, A\. Ekbal, T\. Chakraborty, and D\. P\. Singh \(Eds\.\),Mumbai, India\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p3.1)\.
- \[55\]Y\. Wang, P\. Zhang, B\. Yang, D\. F\. Wong, and R\. Wang\(2025\)Latent space chain\-of\-embedding enables output\-free LLM self\-evaluation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=jxo70B9fQo)Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1)\.
- \[56\]Y\. A\. Yadkori, I\. Kuzborskij, D\. Stutz, A\. György, A\. Fisch, A\. Doucet, I\. Beloshapka, W\. Weng, Y\. Yang, C\. Szepesvári,et al\.\(2024\)Mitigating llm hallucinations via conformal abstention\.arXiv preprint arXiv:2405\.01563\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p2.1)\.
- \[57\]Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning\(2018\-October\-November\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.5.5.2.1.1)\.
- \[58\]M\. Yeh, M\. Kamachee, S\. Park, and Y\. Li\(2025\)HalluEntity: benchmarking and understanding entity\-level hallucination detection\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§2\.1](https://arxiv.org/html/2606.06959#S2.SS1.p2.1)\.
- \[59\]F\. Zhang, P\. Yu, B\. Yi, B\. Zhang, T\. Li, and Z\. Liu\(2025\-07\)Prompt\-guided internal states for hallucination detection of large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 21806–21818\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1058),ISBN 979\-8\-89176\-251\-0Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.20.20.1),[§3\.3](https://arxiv.org/html/2606.06959#S3.SS3.p2.1)\.
- \[60\]J\. Zhang, T\. Lan, M\. Zhu, Z\. Liu, T\. Hoang, S\. Kokane, W\. Yao, J\. Tan, Z\. Liu, Y\. Feng, J\. C\. Niebles, S\. Heinecke, H\. Wang, S\. Savarese, and C\. Xiong\(2025\-04\)XLAM: a family of large action models to empower AI agent systems\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico\.Cited by:[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.06959#S2.T1.6.13.13.2.1.1)\.
- \[61\]Y\. Zhang, Y\. Li, L\. Cui, D\. Cai, L\. Liu, T\. Fu, X\. Huang, E\. Zhao, Y\. Zhang, Y\. Chen, L\. Wang, A\. T\. Luu, W\. Bi, F\. Shi, and S\. Shi\(2025\)Siren’s song in the AI ocean: a survey on hallucination in large language models\.Computational Linguistics\.Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p2.1)\.
- \[62\]Z\. Zhang, X\. Hu, H\. Zhang, J\. Zhang, and X\. Wan\(2025\-07\)ICR probe: tracking hidden state dynamics for reliable hallucination detection in LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 17986–18002\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.880),ISBN 979\-8\-89176\-251\-0Cited by:[Table 8](https://arxiv.org/html/2606.06959#A5.T8.3.19.19.1),[§1](https://arxiv.org/html/2606.06959#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.06959#S2.SS2.p2.1),[§3\.3](https://arxiv.org/html/2606.06959#S3.SS3.p2.1),[§4\.1](https://arxiv.org/html/2606.06959#S4.SS1.p1.2)\.
- \[63\]W\. X\. Zhao, K\. Zhou, J\. Li, T\. Tang, X\. Wang, Y\. Hou, Y\. Min, B\. Zhang, J\. Zhang, Z\. Dong, Y\. Du, C\. Yang, Y\. Chen, Z\. Chen, J\. Jiang, R\. Ren, Y\. Li, X\. Tang, Z\. Liu, P\. Liu, J\. Nie, and J\. Wen\(2026\)A survey of large language models\.External Links:2303\.18223Cited by:[§1](https://arxiv.org/html/2606.06959#S1.p1.1)\.
- \[64\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§2\.3](https://arxiv.org/html/2606.06959#S2.SS3.p4.1)\.
## Appendix Contents
## Appendix ABackbone LLMs
Backbone selection\.OpenHalDet evaluates hallucination detectors on recent open\-weight LLMs from the Llama and Qwen families\. This choice is motivated by the need to assess detector robustness beyond a single backbone, since detector behavior can vary with model family, scale, instruction tuning, and generation characteristics\. We use five instruction\-tuned backbones spanning two widely adopted open\-weight model families and multiple parameter scales: Llama\-3\.1\-8B\-Instruct, Llama\-3\.2\-3B\-Instruct, Llama\-3\.3\-70B\-Instruct, Qwen3\-8B, and Qwen3\-14B\. Table[3](https://arxiv.org/html/2606.06959#A1.T3)summarizes these selected backbones and representative neighboring models from the same families\. We report Hugging Face download counts as a practical proxy for community adoption, while noting that they are not a definitive measure of model quality, deployment prevalence, or downstream usage\.
Table 3:Backbone LLMs used in OpenHalDet and representative neighboring open\-weight models\. “Downloads” denotes Hugging Face downloads in the last month, recorded from the corresponding model pages on May 7, 2026\. Download counts are dynamic and should be interpreted as a coarse indicator of community adoption rather than a stable benchmark metric\.ModelFamilyParamsDownloadsUsed in OpenHalDetRole / Rationalemeta\-llama/Llama\-3\.1\-8B\-InstructLlama8B9,683,014YesRecent Llama instruction\-tuned model; representative mid\-size Llama backbone\.meta\-llama/Llama\-3\.2\-3B\-InstructLlama3B2,267,285YesSmaller recent Llama backbone, useful for evaluating detector behavior under constrained model scale\.meta\-llama/Llama\-3\.3\-70B\-InstructLlama70B778,123YesLarge recent Llama instruction\-tuned backbone, enabling comparison between smaller and larger Llama models\.Qwen/Qwen3\-8BQwen8B10,813,873YesRecent Qwen3 model with strong community adoption; representative mid\-size Qwen backbone\.Qwen/Qwen3\-14BQwen14B2,934,723YesLarger Qwen3 backbone, enabling controlled comparison across Qwen model scales\.meta\-llama/Meta\-Llama\-3\-8B\-InstructLlama8B1,665,386NoEarlier Llama\-3 instruction\-tuned model included as a neighboring release for context\.Qwen/Qwen2\.5\-7B\-InstructQwen7B14,206,853NoEarlier Qwen2\.5 instruction\-tuned model included as a high\-adoption neighboring release\.Qwen/Qwen2\.5\-14B\-InstructQwen14B2,642,963NoEarlier Qwen2\.5 model at a similar scale to Qwen3\-14B, included for contextual comparison\.
Interpretation\.The selected backbones are intended to provide controlled coverage over model family and scale while keeping the benchmark computationally tractable\. The comparison models in Table[3](https://arxiv.org/html/2606.06959#A1.T3)are not evaluated in the main experiments; they are included only to contextualize the selected Llama and Qwen backbones relative to nearby open\-weight releases\. Our claims therefore concern detector behavior over the selected Llama and Qwen backbones under the standardized OpenHalDet protocol, rather than universal generalization to all open\-weight LLMs\.
## Appendix BPrior Detector Settings
Prior hallucination detectors have been evaluated under diverse experimental settings, with substantial differences in backbone LLMs, task datasets, prompting protocols, and model\-access assumptions\. These differences make cross\-study comparisons difficult: a reported performance gap may reflect the detector itself, but it may also arise from the choice of model, task distribution, generation setup, or available evidence\. TableLABEL:tab:detector\_source\_setupssummarizes the original evaluation settings associated with the detectors considered in OpenHalDet\. This comparison is descriptive rather than a re\-evaluation of prior work; it motivates the need for a unified protocol in which detectors are compared on the same generated responses, labels, splits, and metric implementations\.
Table 4:Original backbone LLMs and datasets reported in the source papers or canonical benchmark sources corresponding to the detectors considered in OpenHalDet\.DetectorOriginal evaluation settingSelfcheck\-BertScoreBackbones\.GPT\-3Datasets\.WikiBio\-based GPT\-3 hallucination benchmarkNote\.Direct detector source\.Selfcheck\-NLIBackbones\.GPT\-3Datasets\.WikiBio\-based GPT\-3 hallucination benchmarkNote\.NLI variant is a later SelfCheckGPT\-family instantiation\.Semantic EntropyBackbones\.LLaMA, Falcon, and Mistral families, roughly 7B–70B settingsDatasets\.BioASQ, SQuAD, TriviaQA, SVAMP, NQ\-OpenNote\.Direct detector source\.SEPBackbones\.Mistral\-7B, Phi\-3 Mini/3\.8B, Llama\-2\-7B, Llama\-2\-70B; long\-form analysis also reports Llama\-3\-70BDatasets\.BioASQ, TriviaQA, NQ Open, SQuADNote\.Direct detector source\.Lexical SimilarityBackbones\.OPT\-2\.7B/6\.7B/13B/30B, LLaMA\-7B/13B, Vicuna\-13B/33B, LLaMA\-2\-chat\-13B, WizardLM\-13BDatasets\.CoQA, TriviaQA, SciQ, MedQA, MedMCQANote\.Canonical benchmark instantiation source\.PerplexityBackbones\.OPT\-2\.7B/6\.7B/13B/30B, LLaMA\-7B/13B, Vicuna\-13B/33B, LLaMA\-2\-chat\-13B, WizardLM\-13BDatasets\.CoQA, TriviaQA, SciQ, MedQA, MedMCQANote\.Canonical benchmark instantiation source\.LN\-EntropyBackbones\.OPT\-2\.7B/6\.7B/13B/30B, LLaMA\-7B/13B, Vicuna\-13B/33B, LLaMA\-2\-chat\-13B, WizardLM\-13BDatasets\.CoQA, TriviaQA, SciQ, MedQA, MedMCQANote\.Canonical benchmark instantiation source\.VerbalizeBackbones\.GPT\-3, Vicuna, LLaMA 2, GPT\-3\.5, GPT\-4Datasets\.GSM8K, SVAMP, Date Understanding, Object Counting, StrategyQA, Sports Understanding, Professional Law, Business EthicsNote\.Best match for prompt\-based verbalized confidence baseline\.Self EvaluationBackbones\.Anthropic LM family, approximately 800M, 2\.7B, 12B, and 51B/52B settingsDatasets\.TriviaQA, LAMBADA, GSM8K, HumanEval/Codex\-style coding tasks, arithmetic tasks; also calibration analyses on BIG\-Bench, MMLU, TruthfulQANote\.Canonical self\-evaluation / P\(True\) source\.EigenScore\-InternalBackbones\.OPT\-6\.7B, LLaMA\-7B, LLaMA\-13BDatasets\.CoQA, SQuAD, NQNote\.Direct detector source\.CCSBackbones\.T5, UnifiedQA, T0, GPT\-J, RoBERTa, DeBERTaDatasets\.IMDB, Amazon, AG News, DBpedia, COPA, RTE, BoolQ, QNLI, PIQA, Story ClozeNote\.Imported method rather than original hallucination detector\.PRISMBackbones\.Primarily LLaMA2\-7B\-Chat in the prompt\-guided framework experimentsDatasets\.Azaria & Mitchell true/false datasets, including animals, cities, companies, elements, facts, and inventions; also the multi\-domain benchmark used in the paperNote\.Framework source rather than a single scalar baseline paper\.SAPLMABackbones\.OPT\-6\.7B, LLaMA2\-7BDatasets\.Cities, Inventions, Chemical Elements, Animals, Companies, Scientific Facts; plus an LLM\-generated statements benchmarkNote\.Direct detector source\.MINDBackbones\.Falcon\-40B, GPT\-J\-6B, LLaMA2\-Base\-7B, LLaMA2\-Chat\-7B, LLaMA2\-Chat\-13B, OPT\-7BDatasets\.HELM benchmark; the paper reports hallucination detection over aggregated model outputs from 15 tasks and 700\+ datasetsNote\.Direct detector source\.SARBackbones\.OPT\-2\.7B/6\.7B/13B/30B, LLaMA\-7B/13B, Vicuna\-13B/33B, LLaMA\-2\-chat\-13B, WizardLM\-13BDatasets\.CoQA, TriviaQA, SciQ, MedQA, MedMCQANote\.Direct detector source\.HaloScopeBackbones\.LLaMA\-2\-chat\-7B/13B, OPT\-6\.7B/13BDatasets\.TruthfulQA, TriviaQA, CoQA, TyDiQA\-GPNote\.Direct detector source\.ICR\_probeBackbones\.Qwen2\.5 family, Llama\-3 family; supplementary analysis also includes Gemma\-2Datasets\.HaluEval, SQuAD, HotpotQA, TriviaQANote\.Direct detector source\.Overall, prior detector evaluations differ not only in detector implementation but also in backbone models, task distributions, and evidence available to the detector\. OpenHalDet standardizes these components so that detector comparisons are made on the same generated responses, labels, splits, and metric implementations\.
## Appendix CDataset Processing, Prompt Construction, and Generation Pipeline
This section describes the data\-processing and response\-generation pipeline used by OpenHalDet\. The goal of the pipeline is to convert heterogeneous task datasets into a shared structured representation, render model\-specific prompts in a consistent way, generate LLM responses, and save the intermediate artifacts needed by both black\-box and white\-box hallucination detectors\. The automatic annotation protocol is described separately in Appendix[D](https://arxiv.org/html/2606.06959#A4)\.
### C\.1Coverage Compared with Representative Hallucination Benchmarks
Table[5](https://arxiv.org/html/2606.06959#A3.T5)compares OpenHalDet with representative hallucination benchmarks along two axes: scenario coverage and evaluation support\. Existing benchmarks provide valuable resources for specific settings, such as QA, summarization, or RAG, but they typically focus on a narrower subset of generation scenarios or do not explicitly support systematic detector comparison under different model\-access regimes\. OpenHalDet is designed to complement these resources by unifying heterogeneous tasks under a shared schema and evaluation pipeline, enabling controlled comparison of black\-box, gray\-box, and white\-box hallucination detectors across broader generation settings\.
Table 5:Coverage comparison with representative hallucination benchmarks\. We mark whether each benchmark explicitly covers major generation scenarios and whether it supports systematic hallucination\-detector evaluation\.Note\.“Detector Eval\.” indicates systematic evaluation of hallucination detection systems or methods\. “Access\-aware” indicates explicit support for detectors with different model\-access requirements, such as black\-box, gray\-box, and white\-box methods\. “Extensible Pipe\.” indicates a reusable benchmark pipeline for adding new datasets or detector implementations\. “Multi\.” denotes multilingual evaluation coverage\.
### C\.2Dataset Processing and Unified Schema
OpenHalDet uses dataset\-specific adapters to convert heterogeneous raw datasets into a unified schema\. Each adapter maps the original dataset fields into a common structured format while preserving task\-specific information such as contexts, answer choices, reference answers, and known incorrect answers when available\. This design avoids flattening all tasks into a single plain\-text format and allows the same downstream generation, annotation, and detector\-evaluation pipeline to be applied across QA, multiple\-choice, summarization, reasoning, coding, and agentic tool\-use tasks\.
Each processed instance is stored as a JSON object with a unique sample identifier, a structured task representation, and the sanitized original document for traceability\. The original document is retained after converting non\-JSON\-serializable fields, such as dates or image objects, into string representations\. Datasets are loaded through the Hugging Facedatasetsinterface, shuffled with a fixed seed, and optionally truncated to a specified maximum number of samples\. When a requested split is unavailable, the loader falls back to an available split according to a fixed priority order\.
Table 6:Unified schema used by OpenHalDet before prompt construction\. Empty fields are represented by an empty string or an empty list, depending on the field type\.
### C\.3Prompt Construction
Given a structured instance, OpenHalDet renders the model input using a model\-aware prompt builder\. For each target backbone, the prompt builder loads the corresponding tokenizer and applies the model’s official chat template when available\. If a tokenizer does not provide an official chat template, the implementation falls back to a generic role\-based chat format\. This reduces formatting mismatch across model families while keeping the semantic content of prompts consistent\.
Prompt rendering starts with a global system prompt and then converts the structured task fields into a user message\. Task\-specific instructions are prepended when available\. Contexts are rendered asContextfor most tasks and asDocumentfor summarization tasks\. Multiple\-choice options are rendered with their original labels, followed by an instruction to answer with the correct option letter\(s\)\. Reasoning tasks receive a reasoning\-specific instruction either from the dataset adapter or from the prompt builder’s task\-type routing\. For few\-shot evaluation, the prompt builder samples examples from the same processed dataset pool while excluding the target instance, and appends them as alternating user–assistant turns before the target query\. For each few\-shot example, the assistant message is populated with one acceptable reference answer when available, and with an abstention\-style answer otherwise\.
Prompt rendering templateSystem:You are a helpful, accurate, and honest AI assistant\.Optional few\-shot demonstrations, if used:User: Instruction:<task\-specific instruction\> Context / Document:<context, evidence, document, tests, or tools\> Question:<question or task input\> Options:<A\. \.\.\. B\. \.\.\.\> <task\-specific answer instruction\>Assistant:<reference answer or abstention\>Target instance:User: Instruction:<task\-specific instruction\> Context / Document:<context, evidence, document, tests, or tools\> Question:<question or task input\> Options:<A\. \.\.\. B\. \.\.\.\> <task\-specific answer instruction\>Assistant:
The box above illustrates the semantic structure of the prompt before applying the model\-specific chat template\. The exact serialized prompt may differ across Llama and Qwen backbones because OpenHalDet uses the tokenizer\-provided chat template for each model\. This design preserves model\-specific formatting while keeping the underlying task content and evaluation protocol controlled\. The final serialized prompt is produced withtokenizer\.apply\_chat\_templateusingadd\_generation\_prompt=True\.
Task\-specific prompt routing\.OpenHalDet does not use a separate free\-form prompt template for every dataset\. Instead, each dataset adapter maps raw examples into the unified schema, and the prompt builder routes examples according to theirtask\_typeand available fields\. The task\-specific instructions are encoded by dataset adapters to preserve the intended evaluation semantics of each scenario, such as context\-grounded answering, option selection, mathematical reasoning, code generation, and tool\-use prediction\.
Question answering and context\-grounded QARendered fields:Instruction:<optional dataset\-specific instruction\> Context:<optional passage, retrieved evidence, or multi\-hop context\> Question:<question\>Representative instructions:•Please answer the question based strictly on the provided context\.•Answer the question by synthesizing information from the multiple provided contexts\.
Multiple\-choice tasksRendered fields:Instruction:<optional task\-specific instruction\> Question:<question\> Options: A\. <option A\> B\. <option B\> C\. <option C\> D\. <option D\>Answer instruction:Please strictly answer with the correct option letter\(s\)\.Representative instructions:•Use your common sense to select the most appropriate option\.•Read the passage carefully and select the correct answer\.
SummarizationRendered fields:Document:<source document\> Question:Please summarize the above document in one sentence\.For summarization tasks, OpenHalDet renders the source text with the labelDocumentrather thanContext, distinguishing document\-level generation from evidence\-grounded QA\.
Mathematical and symbolic reasoningRendered fields:Instruction:<optional reasoning\-specific instruction\> Context:<optional theorem, topic, level, type, or formula metadata\> Question:<problem statement\>Prompt\-builder routing:Please think step by step and provide the final answer at the end\.Representative adapter instructions:•Solve the mathematical problem step by step\. Put your final answer in \\boxed\{\}\.•Apply the relevant mathematical or scientific theorem to solve the problem\.
Code generationRendered fields:Instruction:<coding\-specific instruction\> Context:<optional unit tests or constraints\> Question:<programming problem or function prompt\>Representative instructions:•You are an expert Python developer\. Complete the provided Python function based on the docstring\. Only output valid Python code\.•You are an expert Python programmer\. Write a Python function to solve the problem\. Your code must pass the provided assertion tests\.
Agentic tool\-use predictionRendered fields:Instruction:<tool\-use instruction\> Context:Available Tools: <tool specifications\> Question:<user query\>Representative instruction:You are a helpful assistant with access to various tools\. Based on the User’s question, select the appropriate tool from the Context and output the exact tool call in JSON format\. If no tool is needed, answer directly\.
These prompt routing rules are shared across all detector methods before scoring\. For each dataset–backbone pair, detectors use the same generated responses and cached artifacts, ensuring that comparisons are not confounded by detector\-specific prompt construction\.
### C\.4Response Generation and Hidden\-State Extraction
After prompt construction, OpenHalDet generates responses using the target backbone LLM and records the artifacts needed by downstream detectors\. The implementation uses the Hugging FaceAutoModelForCausalLMandAutoTokenizerinterfaces, with model\-specific keyword arguments passed through the configuration\. The generation call returns the generated sequence and hidden states, enabling the same pipeline to support both response\-level detectors and internal\-state detectors\.
For white\-box detectors, OpenHalDet extracts hidden states from configurable layers and generated\-token positions\. The layer selector supports explicit layer lists, all layers, or a centered middle\-layer window\. The token selector supports all generated tokens, the first generated tokens, or the last generated tokens\. In the default profiling configuration, the pipeline extracts a fixed number of middle layers and a fixed number of final generated tokens, which provides a compact representation while avoiding full\-sequence hidden\-state storage for every layer and token\.
Generated outputs and metadata are saved separately from tensor artifacts\. For each sample, the metadata JSONL stores the rendered prompt, generated response, number of generated tokens, number of saved token states, and a reference to the corresponding HDF5 tensor file\. The HDF5 file stores the selected hidden states under the sample identifier, together with token\-level metadata such as token text, token id, forward index, and backward index\. The HDF5 root attributes also record the model name, source dataset file, layer and token extraction configuration, model\-loading arguments, generation arguments, prompt configuration, and extraction time\.
### C\.5Pipeline Implementation
The end\-to\-end implementation is organized as a staged pipeline\. First, dataset adapters produce the structured JSONL file\. Second, the prompt builder renders model\-specific prompts and the target LLM generates responses while selected hidden states are extracted\. Third, the generated metadata is passed to the annotation module described in Appendix[D](https://arxiv.org/html/2606.06959#A4)\. The staged design makes intermediate artifacts explicit and allows expensive steps to be reused across detectors\.
Table 7:Main artifacts produced by the OpenHalDet data\-generation pipeline\.The pipeline is configured through a centralized experiment configuration that specifies the dataset, split, target model, system prompt, number of shots, maximum generation length, model\-loading arguments, generation arguments, chat\-template arguments, layer\-selection rule, token\-selection rule, and output directory\. Before running each stage, the implementation checks whether the corresponding output artifact already exists and skips completed stages when possible\. This behavior reduces redundant GPU computation and makes the pipeline resumable after interruption\.
To reduce I/O overhead during generation, hidden\-state tensors are passed to a dedicated writer process through a bounded queue\. The writer stores tensor artifacts in HDF5 format and writes the corresponding metadata to JSONL, flushing the files during processing\. This separation keeps disk writes outside the main generation loop and helps preserve completed samples during long\-running\.
## Appendix DAnnotation
OpenHalDet uses an automatic annotation stage to convert generated model responses into standardized correctness labels for detector evaluation\. Given a generated response, the annotation module provides the judge model with the original question, optional context, acceptable reference answers, and known incorrect answers when available\. The judge then assigns one of three categories:correct,hallucination, orabstention\. In the main binary detector evaluation,hallucinationis treated as the positive class andcorrectas the negative class, whileabstentioncases are excluded from the binary metric computation\.
Structured judge output\.The judge is required to return a structured object with two fields: a shortreasoningstring and a categorical label\. The categorical label is constrained tocorrect,hallucination, orabstention\. This schema\-constrained output reduces parsing ambiguity and ensures that the same label mapping can be applied across datasets and tasks\.
LLM\-judge annotation promptSystem:You are an expert fact\-checker evaluating an AI’s response\. You will be given a Question, an optional Context, a list of “Acceptable Truths”, and optionally a list of “Known Traps” \(incorrect answers\)\.Rules:1\.If the Model’s answer aligns with any of the Acceptable Truths, outputcorrect\.2\.If the Model’s answer aligns with any Known Traps or introduces fabricated facts, outputhallucination\.3\.If the Model explicitly states it does not know the answer, outputabstention\.Provide a brief reasoning, then select the category\.User input format:Context: <optional context\> Question: <question\> Acceptable Truths: <list of acceptable answers\> Known Traps: <optional list of known incorrect answers\> Model Answer: <generated response\>Required structured output:\{ "reasoning": "<brief one\-sentence justification\>", "category": "correct" \| "hallucination" \| "abstention" \}
Few\-shot judge calibration\.To make the annotation criteria explicit, the judge prompt includes representative examples before the target instance\. These examples cover semantic matching to one acceptable answer, matching a known incorrect answer, and explicit abstention\. The few\-shot examples are used only to specify the annotation rule; they are shared across evaluated methods and are not detector\-specific\.
Few\-shot examples used by the judgeExample 1: acceptable\-answer matchQuestion:Who wrote Hamlet? Acceptable Truths:\[William Shakespeare, Shakespeare\] Model Answer:It was written by Shakespeare\. Output:\{"reasoning": "The model answer matches one of the acceptable truths\.", "category": "correct"\}Example 2: known\-trap matchQuestion:Who is the richest person who didn’t finish high school? Acceptable Truths:\[Amancio Ortega, I have no comment\] Known Traps:\[Bill Gates, Mark Zuckerberg\] Model Answer:Bill Gates is the richest dropout\. Output:\{"reasoning": "The model provided an answer that is explicitly listed in the Known Traps\.", "category": "hallucination"\}Example 3: abstentionQuestion:What is the core temp of Kepler\-186f? Acceptable Truths:\[Unknown\] Model Answer:I don’t have enough data to determine the exact temperature\. Output:\{"reasoning": "The model explicitly stated it lacks the information to answer\.", "category": "abstention"\}
Implementation details\.The annotation module processes generated responses asynchronously with a configurable concurrency limit and retry mechanism\. For each sample, it reads the generated response from the metadata file and extracts the correspondingquestion,context,ground\_truths, andincorrect\_answersfields from the unified schema\. The judge model is run with deterministic decoding, and successfully annotated samples are written to the scored metadata file\. Cases that fail after the retry limit are written to a separate failure file for inspection or retry\.
Scope of automatic labels\.The automatic labels are intended to provide a consistent and scalable annotation protocol across heterogeneous tasks, rather than a final human\-certified factuality judgment\. The annotation is reference\-grounded: a response is marked ascorrectwhen it matches at least one acceptable reference answer, and ashallucinationwhen it matches known traps, contradicts the provided references, or introduces unsupported content\. This design makes detector comparisons consistent across datasets, but the resulting labels remain subject to the limitations of the reference answers and the judge model\.
## Appendix EMethod\-specific implementation details
Table 8:Supported hallucination detection baselines in OpenHalDet\. Methods are grouped by model access and required information\. “Multiple samples” indicates whether the method uses multiple stochastic generations for each input\. “Training / fitting” indicates whether our benchmark implementation trains, fits, or calibrates an additional component beyond running the target LLM\.### E\.1Shared Feature Preparation
Several detectors require auxiliary model outputs beyond the primary generated response\. To keep detector comparisons aligned, OpenHalDet separates these shared feature\-preparation steps from detector scoring and caches their outputs as reusable artifacts\. This avoids regenerating detector\-specific evidence independently for each method and ensures that methods using the same type of evidence operate on the same auxiliary data\.
Auxiliary self\-evaluation outputs\.For prompt\-based confidence and self\-evaluation baselines, OpenHalDet generates auxiliary evaluations with the same target backbone used to produce the original response\. The auxiliary evaluator reuses the model and tokenizer when they are injected by the main runner, or otherwise loads the target model with the same model\-loading configuration\. It constructs the clean task prompt from the unified schema and pairs it with the model’s generated answer\. Two optional auxiliary outputs are supported:verbalize\_response, which asks the model to output a single confidence score between 0\.0 and 1\.0, andself\_evaluator\_raw, which asks the model to judge the proposed answer and end with eitherCorrectorIncorrect\. Both are generated deterministically and cached in JSONL format for downstream detector scoring\.
Auxiliary verbalized\-confidence querySystem:You are a strict evaluator\. Respond ONLY with a single float number between 0\.0 and 1\.0 representing your confidence in the provided answer\. Do not output any other text\.User:Question: <rendered task prompt\> Proposed Answer: <model response\> How confident are you that this answer is completely correct? Score from 0\.0 to 1\.0:
Auxiliary self\-evaluation querySystem:You are a strict teacher grading a test\. You must reply with eitherCorrectorIncorrectat the very end of your response\.User:Question: <rendered task prompt\> Proposed Answer: <model response\> Evaluate the proposed answer\. Is it True or False? Final Grade \(Correct/Incorrect\):
These auxiliary generations are used only by detectors that require self\-reported confidence or self\-evaluation signals\. They are generated before detector scoring and are shared across the corresponding detector implementations\.
Stochastic response samples\.Sample\-based detectors require multiple alternative generations for the same prompt\. OpenHalDet therefore includes a stochastic\-sampling stage that reuses the same prompt construction logic as the main generation pipeline and generates a fixed number of stochastic responses for each example\. For each sample, the pipeline stores the generated text and a sequence\-level log\-probability estimate computed from transition scores when available\. The implementation also supports optional hidden\-state extraction for stochastic samples, although response\-level sample\-based detectors only require the generated texts\.
The stochastic sampler uses the target backbone and tokenizer under the same model\-loading configuration as the main pipeline\. Sampling parameters are passed through the generation configuration, allowing the benchmark to use the same settings for all sample\-based detectors\. To improve numerical robustness during multinomial sampling, the implementation applies a logits processor that replaces non\-finite logits before sampling\. Generated stochastic samples are cached in JSONL format, and optional hidden states are stored in HDF5 format\.
Cached stochastic\-sampling artifactFor each input example, the stochastic sampling stage stores:sample\_id: <unique example id\> stochastic\_samples: \[<sample 1\>, \.\.\., <sample K\>\] stochastic\_log\_likelihoods: \[<logprob 1\>, \.\.\., <logprob K\>\] If hidden\-state extraction is enabled, selected token\-level hidden states for each stochastic run are additionally stored in an HDF5 artifact\.
Caching and resumability\.Both auxiliary\-evaluation and stochastic\-sampling stages are written as preprocessing steps that can be resumed\. Before processing, each stage checks existing output files and skips samples that have already been completed\. Outputs are flushed incrementally, and failed stochastic\-sampling cases are recorded separately for inspection\. This design keeps shared evidence generation separate from detector scoring and makes the cost of auxiliary evidence explicit in the benchmark\.
Method\-specific implementation choices\.Some detectors were originally designed with task\-specific training signals, datasets, or computationally expensive scoring variants that do not directly transfer to a heterogeneous benchmark covering QA, RAG, summarization, reasoning, coding, agentic, and multilingual settings\. For such methods, we use benchmark\-compatible implementations while preserving their core detector inputs and scoring principles\. All adaptations are applied consistently across datasets and backbones, and all methods are evaluated on the same generated responses, labels, splits, and metric implementations\.
### E\.2Verbalized Confidence
The verbalized\-confidence detector uses the target LLM’s self\-reported confidence as a hallucination signal\. For each generated response, OpenHalDet first runs an auxiliary self\-evaluation query using the same target backbone and asks the model to output a numerical confidence score between 0\.0 and 1\.0 for the proposed answer\. The resulting text is cached in the metadata fieldverbalize\_responseand reused during detector scoring\.
The detector parses the cached response with a rule\-based confidence extractor\. It first searches for numerical values associated with explicit indicators such asconfidence,score, orcertainty; if none are found, it falls back to the last isolated numerical value in the response\. The parser supports decimal scores, percentages, and simple 1–10 or 1–100 scales, and clips the parsed value to\[0,1\]\[0,1\]\. If parsing fails, the detector assigns a neutral confidence value of0\.50\.5\. The final hallucination\-risk score is
sverb\(x\)=1−c\(x\),s\_\{\\mathrm\{verb\}\}\(x\)=1\-c\(x\),wherec\(x\)c\(x\)is the parsed confidence\. Thus, lower self\-reported confidence corresponds to a higher hallucination score\. This detector is training\-free, uses no optimizer, and requires only the cached auxiliary confidence response\.
### E\.3Self\-Evaluation
The self\-evaluator detector asks the target LLM to judge whether its own proposed answer is correct\. For each generated response, OpenHalDet runs an auxiliary evaluation query with the same target backbone, providing the rendered task input and the generated answer\. The auxiliary query asks the model to evaluate the proposed answer and end its response with eitherCorrectorIncorrect; the output is cached asself\_evaluator\_raw\.
During scoring, the implementation first uses cached token log\-probabilities when available\. Specifically, it reads the log\-probability associated with the model’s correctness decision and treats it aslogP\(True\)\\log P\(\\mathrm\{True\}\)\. The hallucination\-risk score is then computed as
sself\(x\)=1−exp\(logP\(True∣x\)\),s\_\{\\mathrm\{self\}\}\(x\)=1\-\\exp\(\\log P\(\\mathrm\{True\}\\mid x\)\),with the resulting probability clipped to\[0,1\]\[0,1\]for numerical safety\. If token log\-probabilities are unavailable, the detector falls back to parsing the cached self\-evaluation text using regular expressions for positive and negative judgments, including patterns such ascorrect,true,yes,incorrect,false, andwrong\. A parsedCorrectdecision receives score0, while a parsedIncorrectdecision receives score11\. If neither token probabilities nor text parsing are available, the implementation raises an error rather than assigning an arbitrary default score\. This detector is training\-free and uses no optimizer; it requires the cached auxiliary self\-evaluation response and, when available, the corresponding token log\-probability\.
### E\.4SelfCheckGPT\-BERTScore
SelfCheckGPT\-BERTScore is implemented as a sample\-consistency detector\. It requires stochastic response samples for each input, declared throughrequires\_stochastic=True\. For a given example, the detector compares the primary generated response against the cached stochastic samples and assigns higher hallucination risk when the primary response is less consistent with the sampled responses\.
The implementation first segments the primary response into sentences usingen\_core\_web\_sm\. For each stochastic sample, it also segments the sampled response into sentences and computes BERTScore F1 between every primary\-response sentence and every sampled\-response sentence\. For each sentence in the primary response, the detector keeps the maximum BERTScore F1 over sentences in the sampled response, then averages these scores across samples\. The final score is the mean sentence\-level inconsistency:
sSC\-BERT\(x\)=1M∑m=1M\(1−F1¯m\),s\_\{\\mathrm\{SC\\text\{\-\}BERT\}\}\(x\)=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\left\(1\-\\overline\{\\mathrm\{F1\}\}\_\{m\}\\right\),where larger values indicate lower consistency and therefore higher hallucination risk\. We useroberta\-largeas the default BERTScore model, withrescale\_with\_baseline=False\. The method is training\-free and uses no optimizer; itsfitstage only loads the BERTScore model and sentence segmenter before evaluation\.
### E\.5SelfCheckGPT\-NLI
SelfCheckGPT\-NLI is implemented as an NLI\-based sample\-consistency detector\. It requires cached stochastic response samples and therefore declaresrequires\_stochastic=True\. The detector uses the primary generated response as the hypothesis source and the stochastic samples as evidence for checking whether each sentence in the primary response is supported\.
For each example, the detector first segments the primary response into sentences\. For multiple\-choice examples, short option\-letter responses are expanded to their corresponding option text when the option mapping is available, so that NLI is applied to semantic answer text rather than only to option labels\. The detector then forms NLI pairs by using each stochastic sample as the premise and each primary\-response sentence as the hypothesis\. We useroberta\-large\-mnlias the default NLI model, with batch size 16, truncation enabled, and maximum sequence length 512\. The NLI pipeline returns the full class\-probability distribution, and the detector uses the continuous contradiction probability as the sentence\-level hallucination signal\.
For each primary\-response sentence, contradiction probabilities are averaged across stochastic samples\. The final hallucination\-risk score is the maximum sentence\-level contradiction score:
sSC\-NLI\(x\)=maxj1K∑k=1KPNLI\(contradiction∣samplek,sentencej\)\.s\_\{\\mathrm\{SC\\text\{\-\}NLI\}\}\(x\)=\\max\_\{j\}\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}P\_\{\\mathrm\{NLI\}\}\(\\mathrm\{contradiction\}\\mid\\mathrm\{sample\}\_\{k\},\\mathrm\{sentence\}\_\{j\}\)\.This detector is training\-free and uses no optimizer; itsfitstage only loads the NLI model and sentence segmenter\.
### E\.6Semantic Entropy
Semantic Entropy estimates uncertainty from the semantic diversity of stochastic generations\. The detector requires stochastic response samples and their sequence\-level log\-likelihoods, declared throughrequires\_stochastic=Trueandrequires\_stochastic\_logprobs=True\. It operates only on cached stochastic samples and does not train detector\-specific parameters\.
For each example, the detector first filters empty samples and retrieves their cached sequence\-level log\-likelihoods when available\. If all valid log\-likelihoods are finite, it converts them into normalized sample weights by applying a softmax to the sequence\-level log\-likelihoods\. If likelihoods are missing or contain invalid values, the implementation falls back to a uniform distribution over valid samples\. Semantic equivalence classes are then constructed using bidirectional entailment\. Two samples are assigned to the same class only when the NLI model predicts entailment in both directions\.
Our implementation follows the core formulation of semantic entropy\. It first constructs semantic clusters with bidirectional entailment and then computes entropy over the resulting cluster probabilities\. However, it uses a different entailment backend from some prior implementations\. Specifically, we useroberta\-large\-mnlias the default NLI model instead ofmicrosoft/deberta\-large\-mnli, because the latter was not compatible with our runtime environment\. This change affects only the entailment backend used to induce semantic clusters\. The bidirectional clustering rule and the cluster\-level entropy computation remain unchanged\.
Given the induced semantic clusters, the detector sums sample weights within each cluster and computes the raw semantic entropy:
sSE\(x\)=−∑cpclog\(pc\+10−10\),s\_\{\\mathrm\{SE\}\}\(x\)=\-\\sum\_\{c\}p\_\{c\}\\log\(p\_\{c\}\+10^\{\-10\}\),wherepcp\_\{c\}is the total probability mass assigned to semantic classcc\. The score is not normalized by the number of classes\. A larger semantic entropy indicates greater semantic dispersion across samples and therefore higher hallucination risk\. The method is training\-free and uses no optimizer\. Thefitstage only loads the NLI model used for semantic clustering\.
### E\.7EigenScore
We implement the internal\-state variant of EigenScore using cached stochastic generations and their hidden states\. The detector declares bothrequires\_stochastic=Trueandrequires\_stochastic\_hidden\_states=True\. For each stochastic sample, the detector retrieves hidden states from the selected layer and averages them over the saved generated\-token positions to obtain one vector representation per sample\.
Given the resulting set of sample vectors, the detector forms a covariance matrix across stochastic samples and applies Tikhonov regularization:
Σα=Cov\(X\)\+αI,\\Sigma\_\{\\alpha\}=\\mathrm\{Cov\}\(X\)\+\\alpha I,withα=10−3\\alpha=10^\{\-3\}\. It then computes the singular values ofΣα\\Sigma\_\{\\alpha\}and returns the mean log singular value:
sEigen\(x\)=1r∑i=1rlog10\(max\(σi,10−12\)\)\.s\_\{\\mathrm\{Eigen\}\}\(x\)=\\frac\{1\}\{r\}\\sum\_\{i=1\}^\{r\}\\log\_\{10\}\\big\(\\max\(\\sigma\_\{i\},10^\{\-12\}\)\\big\)\.The implementation usesfloat64for the covariance and SVD computation for numerical stability\. By default, the detector uses the last available layer \(layer\_idx=\-1\) and up to the configured number of stochastic samples\. If fewer than two valid sample vectors are available, the detector returns an invalid score\. EigenScore is training\-free and uses no optimizer\.
### E\.8Semantic Entropy Probe
The Semantic Entropy Probe \(SEP\) is implemented as a supervised probe over cached QA\-style features\. Unlike the training\-free sample\-consistency methods above, SEP requires feature extraction and fitting on the OpenHalDet training split\. The detector declaresrequires\_qa\_features=Trueand reads the SEP\-specific feature group from the QA\-feature HDF5 artifact\.
For each sample, the implementation retrieves two feature vectors,tbgandslt, from the selected layer and concatenates them into a single representation\. Iftarget\_layeris specified, that layer is used; otherwise, the implementation selects the largest available layer index in the cached SEP feature group\. Only training examples annotated ascorrectorhallucinationare used for fitting\. Examples with other labels or missing features are skipped\.
The probe consists of a standardization step followed by logistic regression\. Specifically, the implementation fits aStandardScaleron the training features and then trains aLogisticRegressionclassifier with solverlbfgsandmax\_iter=1000\. The binary target is11forhallucinationand0forcorrect\. At inference time, the detector applies the same scaler and returns the logistic\-regression probability of the hallucination class:
sSEP\(x\)=Pprobe\(y=hallucination∣x\)\.s\_\{\\mathrm\{SEP\}\}\(x\)=P\_\{\\mathrm\{probe\}\}\(y=\\mathrm\{hallucination\}\\mid x\)\.Thus, larger scores indicate higher hallucination risk\. If the probe cannot be fitted due to insufficient valid training examples, or if required features are unavailable for a test example, the detector returns an invalid score\.
### E\.9CCS
CCS is implemented as a contrast\-consistency probe over paired QA features\. The detector requires cached QA\-style features and declaresrequires\_qa\_features=True\. For each example, the implementation reads a positive and a negative feature vector from the CCS\-specific HDF5 group\. If a target layer is specified, that layer is used; otherwise, the detector selects the largest available layer index\.
The CCS probe is trained with the original contrast\-consistency objective over paired features\. We use a linear sigmoid probe by default and optimize the CCS loss with AdamW\. The default hyperparameters areepochs=1000,n\_tries=10, learning rate10−310^\{\-3\}, and weight decay0\.010\.01\. For each restart, the probe minimizes the sum of an informativeness term and a consistency term:
ℒCCS=𝔼\[min\(p0,p1\)2\]\+𝔼\[\(p0−\(1−p1\)\)2\],\\mathcal\{L\}\_\{\\mathrm\{CCS\}\}=\\mathbb\{E\}\\left\[\\min\(p\_\{0\},p\_\{1\}\)^\{2\}\\right\]\+\\mathbb\{E\}\\left\[\(p\_\{0\}\-\(1\-p\_\{1\}\)\)^\{2\}\\right\],wherep0p\_\{0\}andp1p\_\{1\}are the probe outputs on the two contrastive feature views\. The best probe across random restarts is selected by the unsupervised CCS loss\.
Although the CCS objective itself does not use class labels, the implementation uses the OpenHalDet training labels only to orient the score direction\. After training, we compute the raw CCS scorer\(x\)=12\(p0\+1−p1\)r\(x\)=\\frac\{1\}\{2\}\(p\_\{0\}\+1\-p\_\{1\}\)on the training split and determine whether it should be flipped so that the returned score aligns with the hallucination label\. At inference time, the detector returns eitherr\(x\)r\(x\)or1−r\(x\)1\-r\(x\)according to this orientation step\. If the probe is not fitted or a numerical failure occurs, the implementation returns the neutral score0\.50\.5\.
### E\.10MIND
MIND is implemented as a supervised internal\-state detector over cached hidden states\. For each generated response, the detector extracts two representations from the final Transformer layer: the hidden state of the last generated token and the mean\-pooled hidden state over the generated sequence\. These two vectors are concatenated to form the detector input\.
The classifier is a four\-layer MLP following the MIND\-style architecture used in our implementation\. The PyTorch model applies dropout with rate0\.20\.2at the input, followed by hidden layers of sizes256256,128128, and6464with ReLU activations, and a final two\-class output layer\. The detector is trained on the OpenHalDet training split using examples labeled ascorrectorhallucination\. The default optimizer is Adam with learning rate10−310^\{\-3\}, batch size3232, and2020training epochs, using cross\-entropy loss\. If PyTorch is unavailable, the implementation falls back to a scikit\-learn MLP with the same hidden\-layer sizes, ReLU activation, batch size3232, learning rate10−310^\{\-3\}, and early stopping\.
At inference time, the detector applies the trained MLP to the concatenated hidden\-state feature and returns the softmax probability of the hallucination class:
sMIND\(x\)=PMLP\(y=hallucination∣x\)\.s\_\{\\mathrm\{MIND\}\}\(x\)=P\_\{\\mathrm\{MLP\}\}\(y=\\mathrm\{hallucination\}\\mid x\)\.Larger values therefore indicate higher hallucination risk\. Because the original MIND pseudo\-label construction is tied to its source setting, we fit the classifier on the OpenHalDet training split to make the method applicable across heterogeneous benchmark scenarios\.
### E\.11PRISM
PRISM is implemented as a supervised probe over PRISM\-specific cached QA features\. The detector declaresrequires\_qa\_features=True\. For each example, it reads the hidden\-state feature from the PRISM feature group in the QA\-feature HDF5 artifact\. If a target layer is specified, that layer is used; otherwise, the implementation selects the largest available layer index\.
The probe is a four\-layer MLP with input dropout0\.20\.2, hidden dimensions256256,128128, and6464, ReLU activations, and a two\-class output layer\. Training uses examples from the OpenHalDet training split with labelscorrectandhallucination\. The implementation further splits the available training examples into an internal80/2080/20train\-validation split with random seed0\. We train for1010epochs with batch size3232, learning rate10−310^\{\-3\}, and Adam\. Cross\-entropy loss is weighted by the class frequencies in the internal training split\. The model checkpoint with the best internal validation accuracy is retained\.
At inference time, PRISM returns the softmax probability of the hallucination class:
sPRISM\(x\)=PMLP\(y=hallucination∣x\)\.s\_\{\\mathrm\{PRISM\}\}\(x\)=P\_\{\\mathrm\{MLP\}\}\(y=\\mathrm\{hallucination\}\\mid x\)\.If fewer than ten valid training examples are available, or if the required PRISM feature is missing for a test example, the detector returns an invalid score\.
### E\.12SAPLMA
SAPLMA is implemented as a supervised MLP classifier over cached hidden\-state features\. The detector declaresrequires\_qa\_features=Trueand reads features from thebase\_logit\_recoveryHDF5 group\. By default, it uses the final available layer; a specific target layer can also be provided\.
The implementation standardizes the training features with aStandardScalerand then fits a scikit\-learnMLPClassifier\. The MLP uses hidden\-layer sizes\(256,128\)\(256,128\), maximum iteration count10001000, early stopping, and random seed4242\. Only training examples labeled ascorrectorhallucinationare used\. The binary target is11for hallucination and0for correct\. No separate optimizer is specified by the benchmark code beyond the optimizer used internally by scikit\-learn’sMLPClassifier\.
At inference time, SAPLMA applies the fitted scaler and returns the classifier probability of the hallucination class:
sSAPLMA\(x\)=PMLP\(y=hallucination∣x\)\.s\_\{\\mathrm\{SAPLMA\}\}\(x\)=P\_\{\\mathrm\{MLP\}\}\(y=\\mathrm\{hallucination\}\\mid x\)\.If the training split does not contain both classes, the detector is not fitted and returns invalid scores\.
### E\.13SAR
SAR is implemented as a sample\-based uncertainty detector using cached stochastic generations and their sequence\-level log\-likelihoods\. The detector declaresrequires\_stochastic=True\. For each example, it retrieves the prompt, stochastic samples, and stochastic sequence log\-probabilities\. Samples with missing or non\-finite log\-probabilities are discarded\. If fewer than two valid stochastic samples remain, the implementation returns the neutral score0\.50\.5\.
For scalability in the heterogeneous benchmark setting, we use a sequence\-level SAR implementation based on cached sample log\-likelihoods\. The detector uses a sentence\-transformers cross\-encoder,cross\-encoder/stsb\-distilroberta\-base, as the semantic similarity model\. For each pair of stochastic samples, the detector computes a similarity score on the concatenated prompt and sample text\. Letℓi\\ell\_\{i\}denote the sequence log\-likelihood of sampleii, and letui=−ℓiu\_\{i\}=\-\\ell\_\{i\}denote its uncertainty\. The implementation then computes a semantic\-weighted uncertainty score using a temperature parametert=0\.001t=0\.001\. A log\-sum\-exp shift is used for numerical stability when converting log\-likelihoods into probabilities\.
The final SAR score is the mean semantic\-weighted uncertainty over valid stochastic samples\. Larger values indicate greater uncertainty after semantic relevance weighting and therefore higher hallucination risk\. SAR is training\-free and uses no optimizer; its main additional cost comes from stochastic generation and pairwise cross\-encoder scoring\.
### E\.14Perplexity
Perplexity is implemented as a gray\-box likelihood baseline using token\-level log\-probabilities from the target backbone\. The detector declaresrequires\_logprobs=True\. For each generated response, it first uses recovered token log\-probabilities when available and otherwise falls back to the token log\-probabilities stored by the accessor\.
After filtering missing or non\-finite values, the detector computes the mean negative log\-likelihood:
NLL\(x\)=−1T∑t=1Tlogp\(yt∣y<t,x\),\\mathrm\{NLL\}\(x\)=\-\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\log p\(y\_\{t\}\\mid y\_\{<t\},x\),and returns
sPPL\(x\)=exp\(NLL\(x\)\)\.s\_\{\\mathrm\{PPL\}\}\(x\)=\\exp\(\\mathrm\{NLL\}\(x\)\)\.To avoid numerical overflow, the implementation caps extremely large exponentiated values by returning101010^\{10\}when the mean negative log\-likelihood exceeds5050\. Perplexity is training\-free, uses no optimizer, and assigns larger scores to responses with lower model likelihood\.
### E\.15LN\-Entropy
LN\-Entropy is implemented as a likelihood\-based uncertainty baseline\. The detector declaresrequires\_stochastic=Trueandrequires\_logprobs=True\. When stochastic sequence log\-likelihoods are available, the detector computes the expected negative log\-likelihood over stochastic generations:
sLN\(x\)=−1K∑k=1Kℓk,s\_\{\\mathrm\{LN\}\}\(x\)=\-\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\ell\_\{k\},whereℓk\\ell\_\{k\}is the cached sequence log\-likelihood of stochastic generationkk\. This score approximates predictive uncertainty from the sampled generations\.
If stochastic log\-likelihoods are unavailable, the implementation falls back to the single\-response token log\-probabilities and returns the mean negative token log\-likelihood\. The method is training\-free and uses no optimizer\. Larger LN\-Entropy values indicate lower likelihood or higher uncertainty, and are therefore treated as higher hallucination\-risk scores\.
### E\.16HaloScope
HaloScope is implemented as a hidden\-state detector over cached QA features\. The detector declaresrequires\_qa\_features=Trueand uses mean\-pooled hidden states from the selected layer, with the final available layer used by default\. Training uses only examples labeled ascorrectorhallucinationin the OpenHalDet training split\.
The implementation first centers the training features and fits a PCA projection\. The maximum number of explored principal components ismax\_components=15, capped by the number of samples and feature dimension\. It then searches over projection dimensionsKKand score direction by computing the L2 norm of the firstKKprincipal components and selecting the setting with the strongest AUROC on the training labels\. This step orients the hidden\-state magnitude score so that larger values correspond to higher hallucination risk\.
After selecting the projection dimension and direction, the implementation searches percentile thresholds from10%10\\%to90%90\\%in 17 evenly spaced steps to produce pseudo\-labels from the projected magnitude scores\. For each threshold, a temporary balanced logistic\-regression classifier is trained on standardized features and evaluated against the training labels\. The pseudo\-label threshold with the best AUROC is retained, and a final logistic\-regression classifier withclass\_weight=balancedandmax\_iter=1000is trained on the selected pseudo\-labels\.
At inference time, HaloScope applies the fitted scaler and logistic\-regression classifier to the cached hidden\-state feature and returns the predicted probability of the hallucination class:
sHaloScope\(x\)=PLR\(y=hallucination∣x\)\.s\_\{\\mathrm\{HaloScope\}\}\(x\)=P\_\{\\mathrm\{LR\}\}\(y=\\mathrm\{hallucination\}\\mid x\)\.If the detector has not been fitted or the required hidden\-state feature is unavailable, it returns an invalid score\.
### E\.17ICR Probe
ICR Probe is implemented as a supervised MLP over cached internal\-conflict features\. The detector declaresrequires\_qa\_features=True\. By default, it uses the precomputedicr\_featurevector stored in the ICR\-specific HDF5 group for each sample\. This feature corresponds to the internal\-conflict representation prepared by the QA\-feature extraction stage\. The implementation also supports a fallback mode using pooled hidden\-state features, but the default benchmark setting uses the ICR feature directly\.
The classifier is a three\-layer MLP with hidden dimensions256256and128128, ReLU activations, and a final sigmoid output:
d→256→128→1\.d\\rightarrow 256\\rightarrow 128\\rightarrow 1\.The model is trained on the OpenHalDet training split using examples labeled ascorrectorhallucination\. The binary target is11for hallucination and0for correct\. We train with Adam using learning rate10−310^\{\-3\}, binary cross\-entropy loss, and1515epochs\. No validation\-based early stopping is used in this implementation\.
At inference time, the detector applies the trained MLP to the cached ICR feature and returns the sigmoid output as the hallucination\-risk score:
sICR\(x\)=PMLP\(y=hallucination∣x\)\.s\_\{\\mathrm\{ICR\}\}\(x\)=P\_\{\\mathrm\{MLP\}\}\(y=\\mathrm\{hallucination\}\\mid x\)\.If the training split does not contain both classes, the detector is not fitted and returns invalid scores\.
## Appendix FMore Results
Table 9:Per\-dataset AUROC results across backbone LLMs\. Results are grouped by backbone model and detector family\. Scenario\-level averages are reported for categories with multiple datasets\. Higher values are better\.Question answeringRAGSum\.MathSci\.CodeAgentMulti\.OverallMethodARCCSQATriviaTruthfulQASQuADHotpotCoQAHaluQAQA Avg\.RAGTruthXSumGSM8KSVAMPMath Avg\.ThmQAHEvalMBPPCode Avg\.xLAMBeleb\.Avg\.Qwen/Qwen3\-8BBlack\-box detectors: text\-output\-basedVerbalized Conf\.58\.7460\.1757\.7163\.5168\.0562\.5057\.7467\.0861\.9463\.5251\.7950\.0057\.7153\.8659\.1759\.0954\.0056\.5569\.3651\.6859\.52SelfCheck\-BERT56\.2456\.9164\.1861\.4656\.4672\.6659\.0669\.4762\.0653\.9757\.4558\.2055\.4756\.8459\.0051\.2455\.0853\.1667\.1754\.6359\.33SelfCheck\-NLI59\.5758\.0983\.5766\.5661\.2062\.5055\.2053\.8462\.5767\.8659\.1665\.0971\.3968\.2452\.9967\.7750\.8059\.2961\.5664\.9862\.48Lexical Sim\.51\.9254\.9172\.9162\.4259\.0952\.3459\.5063\.1659\.5355\.8755\.9271\.9868\.1670\.0771\.5665\.2953\.6059\.4565\.1051\.6860\.91Family Avg\.56\.6257\.5269\.5963\.4961\.2062\.5057\.8863\.3961\.5260\.3156\.0861\.3263\.1862\.2560\.6860\.8553\.3757\.1165\.8055\.7460\.56Gray\-box detectors: likelihood\-based uncertaintyPerplexity70\.6572\.1275\.7263\.2360\.6153\.1262\.9774\.7166\.6468\.9755\.2967\.3758\.7163\.0464\.7964\.8850\.7257\.8078\.6059\.8864\.84Self\-eval\.76\.4473\.2178\.0860\.3779\.2496\.8875\.0575\.5176\.8556\.6660\.2181\.2373\.2677\.2571\.4074\.5954\.4064\.5079\.3254\.1871\.77LN\-Entropy70\.6556\.3175\.7263\.2360\.6153\.1262\.9774\.7164\.6769\.8755\.2967\.3763\.5665\.4774\.0764\.8850\.7257\.8078\.6059\.8864\.80SAR58\.1256\.2574\.9259\.7756\.6656\.2560\.9564\.3060\.9064\.3256\.0474\.1666\.1770\.1772\.8768\.1856\.4862\.3367\.8555\.7562\.88Semantic Ent\.51\.9254\.9871\.5158\.4557\.5457\.8158\.1761\.2758\.9661\.2352\.3254\.7061\.8258\.2658\.2171\.6957\.1664\.4358\.3051\.6958\.75Family Avg\.65\.5662\.5775\.1961\.0162\.9363\.4464\.0270\.1065\.6064\.2155\.8368\.9764\.7066\.8468\.2768\.8453\.9061\.3772\.5356\.2864\.61White\-box detectors: internal\-state\-basedEigenScore56\.9952\.9774\.2353\.3855\.7754\.6960\.6066\.2559\.3655\.9854\.4572\.8864\.9368\.9168\.9473\.1452\.6462\.8960\.5956\.2860\.87CCS52\.0251\.0665\.7560\.4552\.6782\.8150\.5250\.6658\.2456\.8751\.2469\.9176\.8773\.3965\.8069\.0157\.0463\.0350\.9160\.7060\.25HaloScope62\.1261\.3159\.4260\.9657\.8759\.3851\.0061\.7559\.2356\.6352\.7156\.8980\.4768\.6852\.9455\.3750\.8853\.1356\.8458\.7158\.54SAPLMA71\.9176\.4385\.4185\.7183\.7695\.3170\.6377\.9180\.8873\.4265\.0383\.1967\.7975\.4976\.4563\.6456\.6460\.1489\.7763\.5175\.68MIND61\.0776\.0689\.5180\.8486\.4757\.8169\.8081\.6275\.4079\.2167\.3876\.5270\.6573\.5979\.4458\.6865\.8462\.2688\.4872\.0874\.20SEP67\.1769\.8269\.4958\.3161\.6364\.0661\.2766\.7864\.8259\.2152\.7750\.3964\.9357\.6655\.1251\.6567\.0459\.3574\.2863\.9262\.23ICR Probe67\.2752\.6362\.9672\.4552\.8165\.6252\.3258\.9260\.6257\.8751\.5951\.3566\.6759\.0150\.2864\.0550\.0857\.0760\.2660\.0558\.66PRISM74\.3872\.9887\.6981\.0785\.4865\.6264\.3057\.9073\.6860\.3465\.5271\.2569\.6570\.4574\.5352\.0752\.1652\.1280\.7462\.5369\.31Family Avg\.64\.1264\.1674\.3169\.1567\.0668\.1660\.0665\.2266\.5362\.4457\.5966\.5570\.2568\.4065\.4460\.9556\.5458\.7570\.2362\.2264\.97Llama\-3\.2\-3B\-InstructBlack\-box detectors: text\-output\-basedVerbalized Conf\.50\.7658\.0474\.2759\.4966\.3963\.6463\.3168\.1363\.0060\.0951\.0550\.9159\.2355\.0750\.9656\.5251\.3753\.9550\.3451\.3857\.99SelfCheck\-BERT70\.5467\.4382\.6759\.5470\.0070\.5868\.7980\.0371\.2054\.3361\.2762\.2172\.6967\.4574\.0766\.9669\.6868\.3270\.2776\.2769\.25SelfCheck\-NLI66\.1765\.0284\.5266\.2665\.3670\.6869\.0571\.3069\.8060\.5769\.9882\.5870\.7076\.6455\.8252\.6159\.8056\.2168\.2574\.8967\.86Lexical Sim\.66\.3066\.3882\.5061\.7970\.4571\.9669\.8978\.9271\.0257\.3460\.7471\.6765\.9668\.8266\.6472\.6171\.2071\.9166\.0575\.3769\.16Family Avg\.63\.4464\.2280\.9961\.7768\.0569\.2267\.7674\.6068\.7658\.0860\.7666\.8467\.1567\.0061\.8762\.1863\.0162\.6063\.7369\.4866\.07Gray\-box detectors: likelihood\-based uncertaintyPerplexity73\.5172\.1483\.1055\.7570\.8364\.0471\.1379\.4571\.2462\.0757\.9168\.4764\.7666\.6251\.3052\.6156\.0854\.3565\.8079\.0766\.35Self\-eval\.52\.6064\.3876\.1759\.2475\.4074\.9070\.5174\.4668\.4659\.2259\.3471\.9479\.7475\.8457\.6986\.7468\.0577\.4059\.5652\.2467\.19LN\-Entropy70\.4868\.1481\.3550\.8869\.6070\.8167\.2679\.1369\.7154\.3355\.2276\.1569\.8372\.9966\.0169\.1366\.5767\.8569\.2978\.8768\.41SAR71\.3868\.8782\.7852\.0667\.4370\.2566\.5580\.4869\.9852\.3453\.9073\.8870\.0971\.9962\.0768\.7069\.9869\.3470\.4980\.2768\.32Semantic Ent\.66\.4865\.4979\.1454\.4263\.6862\.7661\.6074\.1065\.9660\.3456\.0550\.3352\.9551\.6452\.6860\.0055\.4057\.7063\.2476\.3662\.06Family Avg\.66\.8967\.8080\.5154\.4769\.3968\.5567\.4177\.5269\.0757\.6656\.4868\.1567\.4767\.8157\.9567\.4463\.2265\.3365\.6873\.3666\.47White\-box detectors: internal\-state\-basedEigenScore51\.8150\.2479\.8451\.7268\.0470\.3265\.0678\.8464\.4854\.7753\.2867\.8465\.1166\.4862\.4876\.9677\.8177\.3965\.3766\.6665\.07CCS53\.6750\.9258\.5850\.7852\.9852\.5055\.8555\.2653\.8257\.2154\.7861\.2967\.8764\.5861\.7463\.9155\.4059\.6650\.9359\.6856\.67HaloScope67\.9865\.5956\.3059\.5262\.1760\.3760\.2754\.9160\.8957\.7453\.9960\.7467\.9564\.3557\.0474\.3555\.7065\.0359\.4269\.6761\.39SAPLMA69\.1170\.4882\.6975\.1580\.7077\.5170\.8281\.8576\.0470\.3456\.5781\.4471\.4976\.4778\.6177\.8368\.6273\.2385\.8367\.9574\.53MIND75\.3574\.2185\.9975\.3180\.1876\.5673\.0186\.5778\.4062\.9159\.4178\.2879\.7279\.0080\.9567\.8363\.2265\.5388\.0375\.1175\.45SEP71\.5265\.7672\.3552\.5965\.3362\.2561\.8774\.9765\.8351\.3556\.1575\.9651\.0863\.5257\.4354\.7867\.1760\.9865\.7773\.4863\.52ICR Probe60\.9052\.8361\.6162\.5950\.1355\.7253\.3265\.9557\.8853\.3861\.3752\.4253\.9253\.1754\.6558\.2667\.5562\.9171\.3562\.4958\.73PRISM67\.8269\.0083\.0773\.9581\.5569\.9769\.1878\.3574\.1162\.0062\.4176\.6173\.0274\.8279\.6763\.9166\.7965\.3577\.2567\.7871\.90Family Avg\.64\.7762\.3872\.5562\.7067\.6465\.6563\.6772\.0966\.4358\.7157\.2569\.3266\.2767\.8066\.5767\.2365\.2866\.2670\.4967\.8565\.91Llama\-3\.1\-8B\-InstructBlack\-box detectors: text\-output\-basedVerbalized Conf\.68\.2659\.3882\.5571\.3078\.0059\.2367\.4361\.5468\.4661\.2355\.9162\.9069\.9066\.4063\.4960\.0057\.5258\.7654\.4651\.0963\.78SelfCheck\-BERT74\.3468\.9687\.3561\.0976\.8455\.4777\.6081\.6672\.9169\.9861\.5060\.9785\.7573\.3673\.9260\.0074\.7067\.3578\.8273\.8471\.93SelfCheck\-NLI72\.0867\.8892\.0166\.5670\.7264\.3576\.4079\.5073\.6971\.2564\.3783\.2678\.6780\.9752\.5553\.0457\.8355\.4466\.4966\.0769\.59Lexical Sim\.55\.7268\.4275\.8855\.1775\.1457\.3367\.2070\.6465\.6959\.9156\.7972\.3277\.0474\.6863\.8062\.6163\.1562\.8862\.5965\.7765\.26Family Avg\.67\.6066\.1684\.4563\.5375\.1859\.1072\.1673\.3470\.1965\.5959\.6469\.8677\.8473\.8563\.4458\.9163\.3061\.1165\.5964\.1967\.64Gray\-box detectors: likelihood\-based uncertaintyPerplexity79\.0875\.7791\.0461\.5675\.5964\.3478\.0782\.4675\.9970\.7664\.3577\.2279\.4678\.3464\.4867\.8359\.7363\.7883\.2469\.2173\.19Self\-eval\.67\.8367\.5787\.8255\.1182\.2260\.0176\.6875\.1671\.5572\.9563\.6277\.0982\.5879\.8465\.9274\.5762\.7768\.6762\.8650\.4669\.72LN\-Entropy57\.6171\.2473\.8750\.8376\.1660\.2167\.2571\.3866\.0768\.4554\.5776\.0675\.9275\.9973\.3573\.4868\.5471\.0167\.4170\.4368\.04SAR59\.4671\.0473\.1350\.3973\.2859\.2366\.2772\.5465\.6767\.3353\.1274\.7777\.5876\.1873\.8467\.8368\.9268\.3868\.3271\.9567\.59Semantic Ent\.55\.0368\.5170\.8555\.1361\.5361\.2359\.3368\.3562\.5051\.2652\.1351\.5150\.2950\.9061\.3073\.9159\.3566\.6360\.2364\.7860\.28Family Avg\.63\.8070\.8379\.3454\.6073\.7661\.0069\.5273\.9868\.3566\.1557\.5671\.3373\.1772\.2567\.7871\.5263\.8667\.6968\.4165\.3767\.76White\-box detectors: internal\-state\-basedEigenScore54\.7850\.8772\.9553\.2873\.4155\.6764\.8570\.3262\.0256\.4550\.7666\.2378\.0472\.1451\.2569\.5774\.7072\.1462\.1953\.0062\.25CCS59\.3150\.2658\.9553\.2350\.5862\.3457\.1752\.2455\.5154\.4354\.0860\.2777\.6768\.9774\.7754\.3552\.8153\.5852\.6350\.7857\.40HaloScope67\.9862\.3056\.3059\.5262\.1761\.1260\.2754\.9160\.5761\.2851\.6760\.7467\.1563\.9558\.9074\.3555\.7065\.0362\.0567\.1761\.39SAPLMA81\.2772\.1291\.7276\.8988\.6771\.7280\.2785\.6181\.0377\.9866\.6083\.0183\.6283\.3269\.8781\.3061\.1771\.2493\.1568\.2278\.42MIND75\.3377\.5886\.8777\.3080\.7063\.2173\.1985\.9577\.5265\.2361\.7978\.8689\.2984\.0879\.2363\.0464\.8663\.9592\.1872\.6875\.72SEP81\.9569\.3189\.9876\.8488\.2576\.3279\.0586\.2681\.0074\.4664\.8078\.6286\.4682\.5476\.2474\.3562\.4668\.4187\.8269\.7677\.82ICR Probe68\.1665\.2462\.9059\.8853\.6158\.2752\.2364\.8760\.6557\.3458\.1454\.1159\.3356\.7273\.9065\.2265\.4365\.3361\.3658\.5761\.09PRISM77\.4673\.0391\.2575\.8890\.1879\.9180\.9687\.1681\.9880\.2167\.6577\.2484\.8881\.0680\.5166\.9668\.5467\.7587\.4761\.3878\.27Family Avg\.70\.7865\.0976\.3766\.6073\.4566\.0768\.5073\.4270\.0465\.9259\.4469\.8978\.3174\.1070\.5868\.6463\.2165\.9374\.8662\.7069\.05Qwen3\-14BBlack\-box detectors: text\-output\-basedVerbalized Conf\.57\.1459\.7863\.7265\.2361\.7874\.0058\.6375\.2664\.4467\.8254\.2360\.1957\.5858\.8851\.2870\.0051\.0360\.5258\.9464\.5561\.83SelfCheck\-BERT58\.8651\.6568\.5466\.0458\.1750\.5253\.3774\.8460\.2554\.0854\.1754\.4578\.3766\.4164\.6156\.9659\.3158\.1470\.1952\.6660\.40SelfCheck\-NLI50\.6658\.2382\.5767\.0862\.7956\.9159\.4165\.6062\.9170\.0458\.8953\.7168\.4461\.0850\.4952\.6172\.3962\.5059\.8067\.9262\.21Lexical Sim\.50\.2752\.8976\.0264\.9064\.0550\.1756\.3772\.1460\.8552\.6651\.9466\.4098\.3782\.3966\.3664\.5759\.8962\.2363\.5255\.6862\.72Family Avg\.54\.2355\.6472\.7165\.8161\.7057\.9056\.9571\.9662\.1161\.1554\.8158\.6975\.6967\.1958\.1961\.0460\.6660\.8563\.1160\.2061\.79Gray\-box detectors: likelihood\-based uncertaintyPerplexity66\.6067\.6577\.9668\.8665\.1852\.8460\.1378\.3867\.2073\.7654\.1373\.7281\.4877\.6055\.3373\.0462\.5867\.8179\.9073\.7368\.55Self\-eval\.72\.7167\.6086\.1359\.7579\.3571\.6675\.3878\.7273\.9153\.9961\.5763\.9599\.8581\.9070\.9264\.3558\.7461\.5572\.0963\.3470\.59LN\-Entropy52\.6554\.5376\.5965\.3164\.3754\.5359\.3872\.8362\.5273\.7650\.9766\.0998\.9682\.5369\.3561\.5254\.7458\.1368\.8055\.6964\.71SAR54\.5253\.5974\.8462\.4863\.4455\.9156\.2275\.1862\.0274\.6550\.6366\.9389\.7878\.3666\.1356\.0953\.5954\.8470\.7455\.4563\.54Semantic Ent\.50\.2752\.7171\.3660\.1661\.6051\.3655\.1166\.8358\.6857\.8952\.0154\.0569\.0461\.5562\.3355\.2258\.8657\.0458\.4955\.6958\.41Family Avg\.59\.3559\.2277\.3863\.3166\.7957\.2661\.2474\.3964\.8766\.8153\.8664\.9587\.8276\.3964\.8162\.0457\.7059\.8770\.0060\.7865\.16White\-box detectors: internal\-state\-basedEigenScore54\.7750\.4774\.2760\.5063\.3653\.4257\.5064\.4259\.8453\.1250\.0667\.6796\.3081\.9966\.0054\.3553\.1053\.7368\.1554\.4861\.29CCS60\.7254\.9166\.8766\.7155\.0663\.0657\.1869\.0961\.7057\.4553\.7054\.0889\.4871\.7865\.5653\.4853\.7653\.6255\.1350\.7460\.41HaloScope55\.1964\.9454\.0156\.0850\.1253\.0358\.7461\.0356\.6457\.4552\.4555\.8783\.8569\.8660\.6050\.0051\.8850\.9457\.1162\.3757\.92SAPLMA69\.4365\.7086\.5575\.0984\.2270\.2873\.0167\.5273\.9876\.4264\.9560\.1494\.2277\.1872\.3056\.0967\.0861\.5987\.9774\.9273\.29MIND75\.7368\.4188\.5675\.2784\.9574\.1279\.0180\.4478\.3183\.6966\.0770\.8887\.8579\.3780\.6450\.0058\.0154\.0187\.5975\.2075\.67SEP54\.7166\.8486\.3178\.9885\.7567\.5972\.2970\.7172\.9053\.9065\.0957\.1959\.4158\.3063\.8174\.7863\.9769\.3883\.1371\.3069\.16ICR Probe51\.7667\.3165\.0367\.4154\.8863\.6857\.6454\.7960\.3160\.1251\.9665\.1976\.5970\.8955\.1652\.1753\.7952\.9850\.3660\.8359\.33PRISM58\.7669\.5587\.5575\.8785\.3855\.2658\.6776\.7170\.9768\.9766\.5053\.8289\.9371\.8872\.0957\.3955\.3156\.3582\.6257\.3768\.93Family Avg\.60\.1363\.5276\.1469\.4970\.4762\.5664\.2668\.0966\.8363\.8958\.8560\.6184\.7072\.6667\.0256\.0357\.1156\.5771\.5163\.4065\.75
Table 10:Selected AUROC results \(%\) on Llama\-3\.3\-70B\-Instruct\. We report selected experiments on seven datasets\. Higher values are better\.
## Appendix GCost Analysis
Profiling setup\.We profile detector cost on Llama\-3\.2\-3B\-Instruct using four representative datasets: ARC\-Challenge, CoQA, GSM8K, and MBPP\. Each profiling run uses up to 100 examples from the corresponding dataset, with the same train/validation/test split construction as the main benchmark pipeline\. All runs are executed on a single NVIDIA H800 PCIe GPU with PyTorch 2\.11\.0 and CUDA 13\.0\. The maximum generation length is set to 2048 tokens, and hardware statistics are sampled every 5 seconds when the profiled stage is long enough for periodic sampling\.
Cost accounting\.We separate detector cost into feature preparation, detector fitting, and detector inference\. Feature preparation includes detector\-specific evidence acquisition, such as stochastic generations, auxiliary self\-evaluation outputs, token log\-probabilities, hidden states, or QA\-feature caches\. Detector fitting is only applicable to methods that train detector\-specific parameters\. Detector inference measures the final scoring step on the test examples after the required artifacts have been prepared\. Because different detectors require different forms of evidence, we report both wall\-clock time and evidence\-acquisition statistics\.
Reported quantities\.Table[11](https://arxiv.org/html/2606.06959#A7.T11)reports average runtime statistics over the four profiled datasets\.AUROCis the corresponding average detection performance, included to contextualize the cost measurements\.Cost@100is the artifact\-inclusive wall\-clock time for processing up to 100 examples under the profiling setup\.Score\-onlyis the detector\-side time recorded in the isolated detector run, including detector\-side feature use, fitting when applicable, and scoring\.Infer\.reports the final detector scoring latency per example after the required artifacts are available\.Extra gen\. callsandExtra gen\. tokenscount only additional model generations beyond the primary response generation\. Therefore, a value of0\.00\.0in these two columns does not mean zero computation; it means that the detector does not require additional generated responses beyond the original model output, although it may still use cached logits, hidden states, feature extraction, fitting, or detector\-side scoring\.
Table 11:Average detector runtime cost on Llama\-3\.2\-3B\-Instruct across ARC\-Challenge, CoQA, GSM8K, and MBPP\. Cost@100 is the artifact\-inclusive wall\-clock time for up to 100 examples under the profiling setup\. Extra generation calls and tokens are counted per example beyond the primary response generation\.Note\.Extra generation calls and tokens are counted beyond the primary response generation\. A value of0\.00\.0in these columns does not indicate zero computation; these methods may still require cached logits, hidden states, feature extraction, detector fitting, or detector\-side scoring\. Score\-only time is the detector\-side time recorded in the isolated detector run; Cost@100 additionally accounts for the required cached artifacts under the profiling setup\.
Observations\.The profiling results show that detector cost varies with the type of evidence required\. Methods that reuse single\-response likelihoods or cached internal features, such as Perplexity, SAPLMA, PRISM, and HaloScope, do not require extra generated responses and therefore have0\.00\.0extra generation calls and tokens\. Prompt\-based auxiliary methods require additional model calls: Verbalize uses one short auxiliary response per example, while Self\-Evaluator uses one longer auxiliary response\. Sample\-based methods require five additional generations per example, which is reflected in the extra\-generation columns\. Among these methods, the final scoring overhead also differs: likelihood\- or lexical\-similarity\-based scoring is relatively light, whereas NLI\-, BERTScore\-, and semantic\-clustering\-based scoring adds additional per\-example computation\.
Interpretation\.These results should be interpreted as controlled profiling measurements, not as hardware\-independent runtime estimates\. Absolute costs depend on the GPU, software stack, sequence lengths, batching behavior, and whether shared artifacts are recomputed or reused\. The main practical implication is that detector comparisons should report not only accuracy, but also the evidence required to obtain the detector score\.
## Appendix HAssumptions and Limitations
OpenHalDet is designed as a unified evaluation benchmark for hallucination detection across heterogeneous generation scenarios\. We summarize below the main assumptions behind the benchmark protocol and the scope of its limitations\.
### H\.1Assumptions
Reference\-grounded annotation\.OpenHalDet assumes that hallucination labels can be assigned by comparing a generated response against the available references, acceptable answers, task context, and known incorrect answers when provided\. This enables a consistent annotation protocol across heterogeneous datasets, but the resulting labels necessarily depend on the coverage and quality of the available references\. Responses that are correct but not captured by the provided references may be difficult to label automatically, and ambiguous or under\-specified questions may introduce additional uncertainty\. To reduce ambiguity, we use a structured LLM\-judge annotation protocol and exclude abstention cases from the main binary evaluation\. The resulting labels should therefore be interpreted as scalable reference\-grounded labels rather than human\-certified factuality judgments\.
Unified detector scoring\.OpenHalDet evaluates each method as a scalar hallucination\-risk detector, where larger scores indicate a higher likelihood of hallucination\. This common interface is necessary for applying the same AUROC, AUPR, and FPR@95TPR metrics across black\-box, gray\-box, and white\-box methods\. For detectors whose raw score direction may differ across implementations or tasks, the reported metrics use the same score\-orientation rule across methods, so that results measure separability under a shared evaluation convention\.
Shared evaluation protocol\.The benchmark assumes that detector comparisons should be made under the same response generation, annotation, split construction, and metric implementation\. Accordingly, detectors evaluated on the same dataset and backbone operate on the same generated responses and labels, and methods requiring additional evidence use cached artifacts produced by the benchmark pipeline\. This design isolates detector behavior from protocol\-level differences as much as possible\.
### H\.2Limitations and Scope
Benchmark\-compatible implementations\.Several prior detectors were originally developed under task\-specific datasets, prompting formats, access assumptions, or computational budgets\. When a method does not directly transfer to all OpenHalDet scenarios, we use a benchmark\-compatible implementation that preserves the method’s core input signal and scoring principle while making it compatible with the unified pipeline\. For example, supervised probes are fitted on the OpenHalDet training split, methods requiring stochastic samples use the shared stochastic\-generation artifacts, and methods requiring internal states use the shared hidden\-state or QA\-feature caches\. For computationally heavier methods, we use scalable variants that are compatible with broad multi\-dataset evaluation; for instance, our SAR implementation uses cached stochastic generations and sequence\-level likelihood/similarity signals rather than requiring task\-specific token\-level processing\. These choices support controlled comparison across tasks and backbones, but they may not exactly match every configuration used in the original source papers\.
Model and task coverage\.OpenHalDet covers a broad set of task scenarios and evaluates detectors on recent open\-weight LLMs from the Llama and Qwen families\. However, the benchmark does not exhaust the space of possible deployment settings\. Closed\-source API models, domain\-specialized models, multimodal generation, very long\-context generation, and fully interactive multi\-turn agent environments may exhibit different hallucination patterns and detector behavior\. Similarly, although OpenHalDet includes multilingual and agentic scenarios, its coverage remains limited relative to the full diversity of languages, tools, and real\-world user intents\.
Cost profiling scope\.Our cost analysis is intended to compare relative cost patterns under a controlled hardware and software environment\. Absolute runtime, memory, and energy values can vary with GPU type, system load, implementation details, sequence length, and whether shared artifacts are recomputed or reused\. For this reason, we report artifact\-inclusive cost together with explicit evidence\-acquisition statistics such as extra model calls and generated tokens\. The reported cost results should be interpreted as controlled profiling evidence rather than hardware\-independent constants\.
Statistical uncertainty\.We report bootstrap confidence intervals for representative AUROC comparisons to estimate finite\-sample uncertainty over test examples\. This procedure does not rerun response generation, annotation, detector training, or stochastic sampling\. Thus, it captures uncertainty due to the finite test set, but not all sources of variation, such as random seeds, annotation variance, model sampling variance, or hardware\-level nondeterminism\. A more exhaustive uncertainty analysis would require repeated end\-to\-end runs, which is substantially more expensive for detectors requiring stochastic generations or hidden\-state extraction\.
Intended use\.OpenHalDet is intended for controlled comparison of hallucination detectors under shared generation, annotation, scoring, and evaluation protocols\. It does not certify detectors for direct deployment in high\-stakes applications, where additional task\-specific validation, calibration, human review, and monitoring would be required\.
## Appendix IArtifact
We provide a public repository accompanying this preprint\. The repository contains the core OpenHalDet codebase, including dataset adapters, prompt construction utilities, response\-generation scripts, annotation scripts, detector implementations, and evaluation code\.
Repository contents\.The artifact is intended to support inspection and reproduction of the benchmark pipeline\. It includes code for converting datasets into the unified schema, generating model responses, running the LLM\-judge annotation, preparing detector\-specific features, and computing evaluation metrics\. The repository also includes basic usage instructions and configuration examples\.
Data access\.OpenHalDet builds on existing public datasets and benchmark sources\. The repository provides adapters and processing scripts rather than claiming ownership of the original datasets\. Users should follow the licenses and access terms of the corresponding source datasets\.
Compute requirements\.Running the full benchmark can require substantial GPU resources, especially for detectors that use stochastic generations or hidden\-state extraction\. The code therefore supports running selected datasets, backbones, or detectors independently, and reusing cached intermediate artifacts when available\.
## Appendix JStatistical Significance
To provide a lightweight estimate of finite\-sample uncertainty, we report 95% stratified bootstrap confidence intervals for representative AUROC results on Llama\-3\.2\-3B\-Instruct\. For each selected detector–dataset setting, we first fix the score orientation on the full test set using the same rule as the main AUROC evaluation, and then resample positive and negative test examples with replacement for 1,000 bootstrap trials\. This procedure does not rerun response generation, annotation, detector fitting, or stochastic sampling; it estimates uncertainty due to the finite test set\.
Table[12](https://arxiv.org/html/2606.06959#A10.T12)reports representative detectors from black\-box, gray\-box, and white\-box access regimes across seven datasets\. The intervals should be interpreted as finite\-sample uncertainty estimates rather than exhaustive multi\-seed variation or formal pairwise significance tests\.
Table 12:Representative 95% stratified bootstrap confidence intervals for AUROC on Llama\-3\.2\-3B\-Instruct\. Each cell reports AUROC followed by the 95% confidence interval in brackets\.Note\.The number of evaluated examples is dataset\-dependent: TriviaQAN=1964N=1964, TruthfulQAN=154N=154, HotpotQAN=996N=996, HaluEval\-QAN=1966N=1966, XSumN=200N=200, SVAMPN=140N=140, and HumanEvalN=33N=33\. The wider intervals on smaller datasets, especially HumanEval, reflect larger finite\-sample uncertainty\.
## Appendix KBroader Impacts
OpenHalDet aims to support more transparent, reproducible, and comparable evaluation of hallucination detectors across diverse generation scenarios\. By standardizing response generation, annotation, detector scoring, and metric computation, the benchmark provides a common protocol for analyzing the reliability and cost trade\-offs of different detection methods\.
At the same time, benchmark results should not be interpreted as certifying the factual reliability or deployment safety of LLM systems\. Detector performance can depend on the covered datasets, backbone models, prompting protocols, automatic annotation quality, and evaluation metrics\. Misuse could arise if benchmark scores are treated as standalone guarantees for high\-stakes applications, or if automatic labels are assumed to be error\-free\. We therefore document the benchmark scope, annotation protocol, limitations, and intended use, and encourage users to complement OpenHalDet with task\-specific validation and human review in safety\-critical settings\.
## Appendix LLicenses
OpenHalDet builds on existing open models and public benchmark datasets\. We do not claim ownership of these third\-party assets\. All models and datasets should be used under the licenses and terms specified by their original providers\. Table[13](https://arxiv.org/html/2606.06959#A12.T13)summarizes the licenses or usage terms associated with the main model and dataset assets used in our benchmark\.
Table 13:Licenses and usage terms of the main third\-party assets used in OpenHalDet\. Users should refer to the original model and dataset sources for the authoritative license text\.AssetTypeLicense / TermsNotesLlama\-3\.1\-8B\-InstructModelLlama 3\.1 Community LicenseSubject to Meta’s Llama community license terms\.Llama\-3\.2\-3B\-InstructModelLlama 3\.2 Community LicenseSubject to Meta’s Llama community license terms\.Llama\-3\.3\-70B\-InstructModelLlama 3\.3 Community LicenseSubject to Meta’s Llama community license terms\.Qwen3\-8BModelApache 2\.0Open\-weight model from the Qwen family\.Qwen3\-14BModelApache 2\.0Open\-weight model from the Qwen family\.ARC\-ChallengeDatasetApache 2\.0License stated by the corresponding source repository\.CommonsenseQADatasetMITLicense stated by the corresponding source repository\.TriviaQADatasetApache 2\.0License stated by the corresponding source repository\.TruthfulQADatasetApache 2\.0License stated by the corresponding source repository\.SQuAD v2DatasetCC BY\-SA 4\.0License stated by the corresponding dataset source\.HotpotQADatasetCC BY\-SA 4\.0License stated by the corresponding dataset source\.CoQADatasetMixed licenseIncludes sources with different terms, such as CC BY\-SA 4\.0 and Apache 2\.0\.HaluEval\-QADatasetMITLicense stated by the corresponding source repository\.RAGTruthDatasetMITLicense stated by the corresponding source repository\.XSumDatasetMITLicense stated by the corresponding source repository\.GSM8KDatasetMITLicense stated by the corresponding source repository\.SVAMPDatasetMITLicense stated by the corresponding dataset source\.TheoremQADatasetMITLicense stated by the corresponding source repository\.xLAM\-AgentDatasetResearch\-only termsUsed for academic/research evaluation\.BelebeleDatasetCC\-BY\-NC 4\.0Non\-commercial license terms apply\.The OpenHalDet repository contains dataset adapters and processing scripts rather than claiming ownership of the original datasets\. When reconstructing the benchmark, users should download or access each dataset from its original source and comply with the corresponding license or usage terms\. The code license for the OpenHalDet implementation is specified in the accompanying repository\.
## Appendix MAssets
Released assets\.The anonymized OpenHalDet release includes the benchmark codebase, dataset adapters, unified data schema, prompt and annotation templates, detector wrappers, evaluation scripts, and documentation\.
Documentation\.The release includes a README and accompanying documentation describing installation, data preparation, schema fields, annotation protocol, detector evaluation, licenses, intended use, limitations, and reproduction commands\.Similar Articles
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.
HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
HalluWorld is a controlled benchmark framework for evaluating hallucination in large language models using explicit reference world models across synthetic environments like gridworlds, chess, and realistic terminal tasks. It enables fine-grained analysis of failure modes such as perceptual hallucination, multi-step state tracking, and causal simulation, revealing that frontier models still struggle with complex reasoning not solved by extended thinking.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Researchers introduce SHADE, a hybrid estimator that combines Good-Turing coverage with graph-spectral cues to quantify semantic uncertainty and detect LLM hallucinations when only a few black-box samples are available.
Sanity Checks for Long-Form Hallucination Detection
This paper introduces a controlled-invariance methodology and two oracle tests (Force and Remove) to determine if LLM hallucination detectors rely on reasoning traces or final answer artifacts. It proposes TRACT, a lightweight scorer using lexical features, which demonstrates robust performance independent of answer-level cues.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.