Zero-source LLM Hallucination Detection with Human-like Criteria Probing

arXiv cs.AI Papers

Summary

Proposes HCPD, a zero-source hallucination detection method that uses a human-like criteria probing mechanism to decompose judgments into interpretable criteria, outperforming state-of-the-art baselines.

arXiv:2606.12900v1 Announce Type: new Abstract: Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:54 AM

# Zero-source LLM Hallucination Detection with Human-like Criteria Probing
Source: [https://arxiv.org/html/2606.12900](https://arxiv.org/html/2606.12900)
###### Abstract

Large language models \(LLMs\) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use\. Detecting such hallucinations is particularly challenging under thezero\-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query–answer pair\. In this paper, we proposeHuman\-like Criteria Probingfor Hallucination Detection \(HCPD\), a paradigm that emulates the multi\-faceted reasoning of human evaluators\. Its core is aHuman\-like Criteria Probing\(HCP\) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion\-specific scores into a final truthfulness measure\. To achieve this adaptive capability, we introduce a reward\-based alignment scheme using only weak supervision from semantic consistency\. At inference, we employ a multi\-sampling aggregation strategy to ensure robust decisions while preserving full interpretability\. We further provide theoretical analysis supporting the reliability of our approach\. Extensive experiments show that HCPD consistently outperforms state\-of\-the\-art baselines, offering an effective and explainable solution for zero\-source hallucination detection\. Code is available at[https://github\.com/TRISKEL10N/HCPD](https://github.com/TRISKEL10N/HCPD)\.

Machine Learning, ICML

## 1Introduction

Large language models \(LLMs\) have rapidly advanced and are increasingly deployed across a broad range of applications, including information retrieval\(Zhuet al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib52)\), decision support\(Chianget al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib53); Chenet al\.,[2024b](https://arxiv.org/html/2606.12900#bib.bib70); Maet al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib54)\), and domain\-specific assistance in healthcare\(Benaryet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib55); Vrdoljaket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib56)\), finance\(Yuet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib57)\), and education\(Neumannet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib58)\)\. Nevertheless, their practical use is constrained by hallucinations, where LLMs generate responses that are factually incorrect, ungrounded, or unfaithful to the user’s intent, posing significant risks in safety\-critical settings\. Consequently, reliable hallucination detection has become essential for the safe and trustworthy deployment of LLM\-based assistants\.

A critical challenge is that practical hallucination detection frequently operates under a strictzero\-sourceconstraint\(Fanget al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib63); Yanget al\.,[2025b](https://arxiv.org/html/2606.12900#bib.bib62)\)\.In prevalent open\-world scenarios, the auditing process is entirely decoupled from the generation\. For instance, third\-party auditors \(e\.g\., social media platforms, news agencies\) must evaluate massive user\-uploaded texts without knowing the underlying source LLMs\. Similar situations also occur when the vast majority of end\-users interact with LLMs through web interfaces \(e\.g\., ChatGPT111[https://chatgpt\.com](https://chatgpt.com/), Gemini222[https://gemini\.google\.com/app](https://gemini.google.com/app)and Claude333[https://claude\.ai/new](https://claude.ai/new)\) or browser extensions, where plain text is the sole accessible output\. Consequently, commercial APIs, internal states, and auxiliary resources \(e\.g\., external knowledge bases\) are typically unavailable\. Under such realistic restrictions, robust detection must rely solely on the observed query–answer pair\.

Unfortunately, most existing approaches are not directly applicable under the aforementioned constraint\. While effective in the traditional knowledge\-based hallucination detection, retrieval\-augmented or fact\-verification methods\(Semnaniet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib59); Huet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib60); Chenet al\.,[2024c](https://arxiv.org/html/2606.12900#bib.bib61)\)require access to web or knowledge resources, whose availability and reliability are difficult to guarantee\. Avoiding external references, confidence\-based and metric\-based methods\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10); Kuhnet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib11); Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)mainly depend on model internals, which are unattainable for black\-box or commercial systems\. Self\-supervised or consistency\-based methods\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib15); Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)typically employ static, task\-agnostic heuristics, limiting their ability to capture the precise, context\-dependent judgments across diverse domains\. Moreover, most detectors provide only binary labels or scalar scores, offering limited interpretability and diagnostic insight\.

In contrast, human experts rarely judge a response using a single monolithic criterion\. They instead decompose evaluation into multiple dimensions, adapt their relative weights to the context, and provide evidence\-grounded judgment\. This observation motivates a generalzero\-source detection paradigmthat emulates human\-style evaluative reasoning\.

In this paper, we proposeHuman\-like Criteria Probing for Zero\-source Hallucination Detection \(HCPD\)\. Its core is an*Human\-like Criteria Probing \(HCP\)*mechanism, which enables a pre\-trained LLM agent to evaluate responses through a transparent, multi\-step process \(Section[4\.3](https://arxiv.org/html/2606.12900#S4.SS3)\)\. For each query–answer pair, the agent first adaptively generates a set of fine\-grained criteria \(e\.g\., factual accuracy, logical consistency\) and their context\-aware importance weights, then scores the text against each criterion, and finally aggregates these scores into an overall truthfulness measure, effectively emulating the nuanced, multi\-perspective reasoning of a human expert\. To enable this adaptive judgment capability, we introduce a*reward\-based alignment training*scheme, which leverages weak supervision from semantic consistency to teach the agent how to decompose and weight criteria without needing ground\-truth hallucination labels \(Section[4\.4](https://arxiv.org/html/2606.12900#S4.SS4)\)\. At inference, we apply a*multi\-sampling aggregation*strategy to reduce variance from the stochastic generation process, performingKKindependent HCP evaluations per instance and averaging the results for a robust final decision \(Section[4\.5](https://arxiv.org/html/2606.12900#S4.SS5)\)\. We further provide a statistical characterization of both training and inference behaviors, and derive a threshold\-free performance characterization by bounding the ranking error probability \(Section[4\.6](https://arxiv.org/html/2606.12900#S4.SS6)\)\. Extensive experiments demonstrate that HCPD outperforms existing state\-of\-the\-art methods, validating the effectiveness under the zero\-source hallucination detection\.

Our contributions are summarized as follows:

- •An adaptive, multi\-criteria probing framework for zero\-source detection: We are the first to explicitly formalize hallucination detection under zero\-source constraint, and propose Human\-like Criteria Probing mechanism, which reframes detection as a process of context\-aware criteria generation, weighting, and aggregation\. By enabling an agent to adaptively decompose its judgment into interpretable dimensions, our method emulates the nuanced, multi\-faceted reasoning of human evaluators, moving beyond monolithic scoring paradigms\.
- •Weakly\-supervised alignment training without ground\-truth labels: To achieve reliable adaptive judgment in the scoring agent, we introduce a reward\-based alignment training scheme using only weak supervision derived from semantic consistency\. This method effectively teaches the agent to identify, weight, and score relevant criteria without requiring any annotated hallucination data, making it uniquely suited for the practical zero\-source constraint\.
- •A stable and interpretable inference strategy with theoretical guarantees: We design a multi\-sampling aggregation strategy during inference to mitigate generation variance, enhancing decision robustness while preserving full interpretability through the generated criteria and weights\. Furthermore, we develop a theoretical analysis, providing formal insights into the feasibility and reliability of our probing\-based approach\.

## 2Preliminaries

Group Relative Policy Optimization\.Group Relative Policy Optimization \(GRPO\) is a reinforcement learning algorithm designed for the stable and efficient fine\-tuning of LLMs\. To obviate the need for an explicit value function commonly required in policy gradient methods\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib25)\), GRPO adopts the average reward of multiple sampled outputs conditioned on the same prompt as an implicit baseline\. Concretely, for each inputxx, the current policyfθf\_\{\\theta\}samples a group of outputs\{Y1,…,YG\}\\\{Y\_\{1\},\\ldots,Y\_\{G\}\\\}and assigns each output a critic\-based scalar rewardr​\(Yg\)r\(Y\_\{g\}\)\. Rather than optimizing the absolute reward, GRPO constructs a group\-relative advantageAg=r​\(Yg\)−1G​∑j=1Gr​\(Yj\)A\_\{g\}=r\(Y\_\{g\}\)\-\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}r\(Y\_\{j\}\)within the group, which normalizes rewards within the group and inherently reduces variance of the gradient estimates\. The policy model is then optimized to increase the likelihood of outputs with above\-average rewards, while constraining updates to stay close to a reference policyf0f\_\{0\}\(typically the initialization\) to preserve generation quality\. The objective for a single group is defined as:

𝒥\(θ\)=1G∑g=1G\[fθ​\(Yg\|x\)f0​\(Yg\|x\)Ag\]−β⋅DK​L\(fθ\(⋅\|x\)\|\|f0\(⋅\|x\)\),\\displaystyle\\mathcal\{J\}\(\\theta\)\{=\}\\frac\{1\}\{G\}\{\\textstyle\\sum\_\{g=1\}^\{G\}\}\[\\frac\{f\_\{\\theta\}\(Y\_\{g\}\|x\)\}\{f\_\{0\}\(Y\_\{g\}\|x\)\}A\_\{g\}\]\{\-\}\\beta\\cdot D\_\{KL\}\(f\_\{\\theta\}\(\\cdot\|x\)\|\|f\_\{0\}\(\\cdot\|x\)\),\(1\)
whereβ\\betais a hyperparameter controlling the strength of the Kullback–Leibler divergence\(Csiszár,[1975](https://arxiv.org/html/2606.12900#bib.bib28)\)penalty\. For simplicity, we reuse𝒥​\(θ\)\\mathcal\{J\}\(\\theta\)to denote the objective aggregated over the distribution when used in theoretical bounds\.

By evaluating and comparing multiple outputs for the same prompt, GRPO obtains a robust, context\-aware learning signal and achieves more stable convergence compared to per\-sample absolute reward optimization\. This property makes it particularly suitable for aligning our scoring agent, where rewards are dense yet require precise calibration across diverse evaluation properties\.

![Refer to caption](https://arxiv.org/html/2606.12900v1/x1.png)Figure 1:Overview of the proposed HCPD\. Given a query–answer pair\(q,a\)\(q,a\), the agent instantiates a set of specific criteria\{ci\}i=1m\\\{c\_\{i\}\\\}\_\{i=1\}^\{m\}and corresponding importance weights\{wi\}i=1m\\\{w\_\{i\}\\\}\_\{i=1\}^\{m\}\. The criterion\-level partial scores\{si\}i=1m\\\{s\_\{i\}\\\}\_\{i=1\}^\{m\}are subsequently produced and aggregated into an overall truthfulness measuresps\_\{p\}\. During GRPO training, we fine\-tune the agent by maximizing the score\-alignment reward that encourages the predictedsps\_\{p\}to match weak supervision labels derived from consistency\-based similarity measures\. At inference time, we invoke the agentKKtimes and aggregate the resulting scores to obtain a reliable decisions¯\\bar\{s\}\.
## 3Related Work

Hallucination Detection\.Although truthfulness is a fundamental requirement of language generation, large language models \(LLMs\) still frequently produce outputs that are factually incorrect or contextually inconsistent, commonly referred to as hallucination\. Consequently, hallucination detection\(Linet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib21); Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17); Kuhnet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib11); Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9); Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14); Zhanget al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib22); Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12); Chenet al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib13); Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18); Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)has emerged as a central research focus for the safe and reliable deployments\.

A prevailing research paradigm attributes LLM hallucinations to predictive uncertainty\. Probability\-based methods quantify such uncertainty via metrics including Perplexity\(Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9)\), Length\-Normalized Entropy\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10)\), and Semantic Entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib11)\)\. In contrast, consistency\-based methods evaluate agreement across multiple sampled responses either through similarity metrics such as BERTScore\(Zhanget al\.,[2019](https://arxiv.org/html/2606.12900#bib.bib23)\), ROUGE\(Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12)\), natural language inference, and prompt\-based self\-consistency verification\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\), or via spectral analysis of response covariance matrices such as eigenvalue decomposition\(Chenet al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib13)\)\. Verbalized\-based methods elicit confidence signals by prompting models to express uncertainty explicitly in natural language, through verbalized uncertainty\(Linet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib21)\)or self\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib15)\)mechanisms\. Nevertheless, these uncertainty\-driven signals remain limited as hallucinations frequently occur even in high\-confidence generations\. Whereas multi\-sample evaluation partially mitigates this issue, it incurs substantial computational overhead\.

A complementary perspective infers textual truthfulness by exploiting internal model states\. For instance, CCS\(Burnset al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib16)\)extracts latent knowledge from activation patterns, SAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)trains classifiers directly on hidden representations, HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)identifies hallucination\-related subspaces via singular value decomposition, and TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)introduces learnable steering vectors to adapt latent features for improved separability\. Despite their effectiveness,these methods necessitate access to the internal model representations, thereby limiting their applicability in real\-world deployment scenarios where the underlying model architecture and training data provenance remain undisclosed\.

## 4Method

### 4\.1Problem Definition

LLM Hallucination\.In the context of text generation, hallucination refers to contents that are either factually inconsistent with established knowledge or contextually unfaithful to the given source or query intent\(Huanget al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib1)\)\. Such errors typically arise from the model’s inherent uncertainty, limitations in its learned knowledge, or failures in reasoning\. Characterizing and probing these failure modes is therefore essential for the reliable deployment of LLMs\.

Hallucination Detection under Zero\-source Constraint\.The goal of hallucination detection is to assess whether a model\-generated response exhibits factual inconsistencies or contextual unfaithfulness\(Jiet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib2); Farquharet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib3)\)\. Formally, let𝒬\\mathcal\{Q\}denote the space of user queries and𝒜\\mathcal\{A\}the space of model outputs\. For a queryq∈𝒬q\\in\\mathcal\{Q\}and a generated answera∈𝒜a\\in\\mathcal\{A\}, the detection modelffproduces a binary labely∈\{0,1\}y\\in\\\{0,1\\\}indicating whether the response constitutes hallucinations\. In this work, we focus on thezero\-source constraint: the detectorffcannotaccess the LLM’s internal states \(e\.g\., token probabilities\) or output distribution during inference\. Furthermore, no external knowledge bases or reference texts are available\. It must operate solely on the textual pair\(q,a\)\(q,a\), making the task challenging yet aligned with practical black\-box deployment\.

Challenges in Hallucination Detection under Zero\-source Constraint\.Developing an effective detector under this constraint faces several challenges\.First, without access to model\-internal confidence signals \(e\.g\., token logits\), the detector must rely entirely on the semantic content, requiring a deep understanding to identify subtle inconsistencies\.Second, the absence of reference sources or external knowledge eliminates straightforward fact\-checking, forcing the detector to reason intrinsically about plausibility\.Third, hallucinations are heterogeneous; they may manifest as factual errors, logical fallacies or semantic misalignments, demanding a flexible, multi\-perspective evaluation rather than a single static criterion\.Finally, for trust and usability, decisions must be interpretable, clarifyingwhya text is deemed hallucinatory\. These motivate the need for an adaptive, interpretable, and purely text\-based probing mechanism\.

Table 1:The structured and interpretable output from the agent\.ComponentExampleQ & AQes: In which country were the 1948 WinterOlympics held?Ans: Norway\.SpecificCriteria1\.Factual Grounding\(Weight: 60%\)2\.Temporal Consistency\(Weight: 20%\)3\.Semantic Precision\(Weight: 20%\)Analysis\-Factual Grounding: The 1948 Winter Oly\-mpics were held in St\. Moritz, Switzerland, n\-ot Norway\. This is a clear factual error\.\-Temporal Consistency: The response doesnot specify a year, so there are no temporal is\-sues directly in the text\.\-Semantic Precision: The response is vagueand does not provide the correct information,leading to semantic distortion as the user is n\-ot given accurate details\.Score \(1∼101\{\\sim\}10\)11

### 4\.2Motivations and Method Overview

Motivations\.When assessing the correctness of a text, Human evaluators naturally engage in holistic and multifaceted reasoning\. They do not apply a single, rigid rule but dynamically consider various dimensions,e\.g\., factual accuracy, logical soundness, temporal consistency, and contextual faithfulness, and intuitively weigh their importance based on the specific content\. This context\-dependent, multi\-criteria weighting yields two practical advantages: i\)*adaptivity*— the ability to focus on the most diagnostic checks for each instance, improving detection precision; and ii\)*interpretability*— the ability to explain negative judgments by pointing to specific violated criteria\. These observations motivate us to design a detection paradigm that emulates the human evaluative process: decomposing the holistic judgment into a set of interpretable criteria, dynamically determining their relevance \(weights\) based on the input, and synthesizing a final decision through a transparent aggregation\.

Method Overview\.Motivated by the above analysis, we proposeHuman\-like Criteria Probing for Zero\-source Hallucination Detection\(HCPD\), which employs a Human\-like Criteria Probing \(HCP\) mechanism as its core \(Figure[1](https://arxiv.org/html/2606.12900#S2.F1)\)\. Specifically, given an input pair\(q,a\)\(q,a\), HCPadaptivelyuses an LLM agent to generate a set of interpretable evaluation criteria along with their relative importance weights and output an aggregate truthfulness score \(Section[4\.3](https://arxiv.org/html/2606.12900#S4.SS3)\)\. To achieve this capability into the agent, we devise aReward\-Based Alignment Trainingscheme using weak supervision from semantic consistency, teaching the agent to decompose and weight criteria without ground\-truth labels \(Section[4\.4](https://arxiv.org/html/2606.12900#S4.SS4)\)\. To reduce the generation randomness, we introduce aMulti\-Sampling Aggregationstrategy by performingKKindependent probes per instance and averaging the resulting scores as the final detection statistic; this multi\-sampling aggregation stabilizes the decision while preserving interpretability \(Section[4\.5](https://arxiv.org/html/2606.12900#S4.SS5)\)\. We also provide theoretical analysis justifying the feasibility of our probing approach \(Section[4\.6](https://arxiv.org/html/2606.12900#S4.SS6)\)\.

### 4\.3Human\-like Criteria Probing Mechanism

Inspired by the multi\-faceted nature of human evaluation, we developHuman\-like Criteria Probing \(HCP\), a structured reasoning framework that enables a pre\-trained LLM to function as an*adaptive*scoring agent for hallucination detection, moving beyond conventional monolithic scoring\.

Human\-like Criteria Probing\.Given the query\-answer pair\(q,a\)\(q,a\)under evaluation, the agentfθf\_\{\\theta\}produces interpretable, criterion\-wise assessments\. Specifically, it first adaptively generates a set of fine\-grained specific criteria\{ci\}i=1m\\\{c\_\{i\}\\\}\_\{i=1\}^\{m\}derived from predefinedGeneral Evaluation Criteriaset𝒞\\mathcal\{C\}in broader dimensions such as factual and logical coherence\. The agent then autonomously assigns context\-sensitive importance weights\{wi\}i=1m\\\{w\_\{i\}\\\}\_\{i=1\}^\{m\}to these specific criteria, emphasizing, for instance, temporal accuracy for historical queries and logical soundness for scientific explanations\. Finally, it predicts a partial scoresis\_\{i\}for each weighted criterion and aggregates them into an overall truthfulness measuresps\_\{p\}, thereby mimicking the contextual and multi\-faceted nature of human judgment\. The operation of the agent can be formalized as:

sp=∑i=1mwi⋅si,​\{\(ci,wi,si\)\}i=1m←fθ​\(q,a;𝒞\),s\_\{p\}=\{\\textstyle\\sum\_\{i=1\}^\{m\}\}w\_\{i\}\\cdot s\_\{i\},\\text\{~~~\}\\\{\(c\_\{i\},w\_\{i\},s\_\{i\}\)\\\}\_\{i=1\}^\{m\}\\leftarrow f\_\{\\theta\}\(q,a;\\mathcal\{C\}\),\(2\)wheremmis the number of specific criteria, the weightwi≥0w\_\{i\}\\geq 0and∑iwi=1\\sum\_\{i\}w\_\{i\}=1, the scores are integers in\{1,…,10\}\\\{1,\\ldots,10\\\}, and the General Evaluation Criteria is given by:

𝒞=\{Factual,Logical,Semantic,Temporal,Social\}\.\\mathcal\{C\}=\\\{\\text\{Factual\},\\text\{Logical\},\\text\{Semantic\},\\text\{Temporal\},\\text\{Social\}\\\}\.\(3\)
Structured and Interpretable Output\.To ensure transparency and facilitate downstream processing, the agent is constrained to generate outputs in a strictly structured format \(Table[1](https://arxiv.org/html/2606.12900#S4.T1)\)\. For each specific criterion, the output explicitly reports its assigned weight, the corresponding evidence that supports or contradicts the response, and the associated partial score\. These components are subsequently aggregated into an overall score via weighted synthesis\. This design renders full interpretability of the detector’s decisions and enables fine\-grained attribution of hallucinated content\.

### 4\.4Reward\-based Alignment Training

To equip the agentfθf\_\{\\theta\}with precise evaluative capabilities, we train it with a reward\-based alignment paradigm using Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib4)\)\. This aligns the agent’s adaptive probing and scoring behavior with a weak supervision signal derived from semantic consistency, eliminating the reliance on expensive human\-annotated hallucination labels\.

Training Data Construction and Labeling\.Our training data is constructed from standard question\-answer \(QA\) benchmarks that pair queries with human\-verified reference answers\. Specifically, for each queryqqsourced from established datasets \(e\.g\., TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib5)\)\), a corresponding ground\-truth answera^\\hat\{a\}is available\. We further leverage an auxiliary LLM \(e\.g\.\., the test target model\) to generate a set of candidate responses\{a\(n\)\}n=1N\\\{a^\{\(n\)\}\\\}\_\{n=1\}^\{N\}for the same query, intentionally sampling outputs that span a spectrum from factually correct to clearly hallucinated\.

Following common practice in hallucination detection benchmarks, we adopt widely used consistency metrics \(e\.g\., BLEURT\(Sellamet al\.,[2020](https://arxiv.org/html/2606.12900#bib.bib20)\)\) in natural language processing \(NLP\) as a source of weak supervision labeling\. Concretely, for each generated answera\(n\)a^\{\(n\)\}, we compute its semantic consistency to the referencea^\\hat\{a\}, yieldingsim⁡\(a^,a\(n\)\)∈\[0,1\]\\operatorname\{sim\}\(\\hat\{a\},a^\{\(n\)\}\)\\in\[0,1\]\. Consistent with prior work\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\), responses with a consistency score above0\.50\.5are treated as relatively faithful, whereas those below are labeled as hallucinated\. To obtain a graded supervision signal, we translate this continuous score to a discrete1​–​101–10scale:

sl\(n\)=clip\(⌊10⋅sim\(a^,a\(n\)\)⌉,1,10\),s\_\{l\}^\{\(n\)\}=\\operatorname\{clip\}\(\\lfloor 10\\cdot\\operatorname\{sim\}\(\\hat\{a\},a^\{\(n\)\}\)\\rceil,1,10\),\(4\)where the reference answera^\\hat\{a\}is assigned the maximum score of1010\. This procedure yields a weakly supervised dataset𝒟=\{\(q,\{\(a^,10\)\}∪\{\(a\(n\),sl\(n\)\)\}n=1N\)\}\\mathcal\{D\}=\\\{\(q,\\\{\(\\hat\{a\},10\)\\\}\\cup\\\{\(a^\{\(n\)\},s\_\{l\}^\{\(n\)\}\)\\\}\_\{n=1\}^\{N\}\)\\\}that captures gradual degrees of hallucination severity and enables the agent to learn fine\-grained distinctions\.

Score\-alignment Reward Function\.To encourage the agent towards producing scores consistent with the weak supervision labels, we introduce a score\-alignment reward function\. Given an input pair\(q,a\)\(q,a\), the agent performs human\-like criteria probing and outputs a structured evaluation containing a predicted scoresp∈\{1,…,10\}s\_\{p\}\\in\\\{1,\\ldots,10\\\}\. The reward function assigns a scalar rewardr∈\[0,1\]r\\in\[0,1\]by directly comparing the predicted scoresps\_\{p\}with the labelsls\_\{l\}:

r=\{1−\|sp−sl\|9,if the output is well\-formed,0,otherwise,r=\\begin\{cases\}1\-\\dfrac\{\|s\_\{p\}\-s\_\{l\}\|\}\{9\},&\\text\{if the output is well\-formed\},\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(5\)and optimize according to Eqn\. \([1](https://arxiv.org/html/2606.12900#S2.E1)\)\.

The reward attains its maximum valuer=1r=1when the prediction perfectly matches the label \(sp=sls\_\{p\}=s\_\{l\}\) and decreases linearly with the absolute deviation between them\. Notably, the score\-alignment reward implicitly enforces structural constraints on the agent’s output: any deviation from the prescribed format that prevents reliable extraction of a valid score results inr=0r=0; and the inherent KL\(Csiszár,[1975](https://arxiv.org/html/2606.12900#bib.bib28)\)regularization of GRPO in Eqn\. \([1](https://arxiv.org/html/2606.12900#S2.E1)\) limits deviation from the initial policy\(Shaoet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib4)\)\.

Differentiable Scoring v\.s\. Binary Classification\.The adoption of a differentiable scoring scheme rather than binary “True/False” classification is a central design choice\.1\) The graded formulation better reflects the inherently varying degrees of hallucination severity, which ranges from minor factual imprecision to fully fabricated content\. By modeling hallucination on a fine\-grained scale, the agent can capture subtle deviations that binary labels necessarily collapse\.2\) The reward defined in Eq\. \([5](https://arxiv.org/html/2606.12900#S4.E5)\) induces an informative policy optimization signal\. By penalizing prediction errors proportionally to their magnitude, it yields a denser reward landscape than binary rewards, which only indicate correctness without guiding the magnitude or direction of improvement\.3\) Scalar scoring enables flexible decision\-making at inference time\. Unlike binary classifiers with fixed operating points, our agent’s scores can be thresholded at varying levels to accommodate different precision–recall trade\-offs without retraining\. Collectively, this scoring\-alignment reward formulation not only facilitates effective learning under weak supervision but also endows the detector with calibrated, adaptable judgment capabilities essential for effective hallucination detection\.

### 4\.5Robust Inference via Multi\-sampling Aggregation

Due to the inherent stochasticity of language model generation, a single evaluation from the agent may exhibit non\-negligible variance\. To reduce this randomness and improve decision reliability, we employ a multi\-sampling aggregation strategy during inference\.

Formally, the trained agentfθf\_\{\\theta\}is independently invokedKKtimes for the same input\(q,a\)\(q,a\)pair, yielding a set of structured evaluations together with corresponding synthesized final scores\{sp\(k\)\}k=1K\\\{s\_\{p\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}\. We obtain a robust estimate of response truthfulness by aggregating these scores via arithmetic averaging:

s¯=1K​∑k=1Ksp\(k\)\.\\bar\{s\}=\\frac\{1\}\{K\}\{\\textstyle\\sum\_\{k=1\}^\{K\}\}s\_\{p\}^\{\(k\)\}\.\(6\)
This aggregation suppresses stochastic fluctuations in individual evaluations, resulting in more reliable and stable detection outcomes\. Experimental results indicate that our method achieves substantially stronger detection performance than SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)while incurring a similar time cost\.

### 4\.6Theoretical Analysis

To establish a theoretical foundation for HCPD, we provide guarantees for both its training and inference stages\. Our analysis yields three core results: \(i\) Theorem[1](https://arxiv.org/html/2606.12900#Thmtheorem1)ensures that GRPO training anchors the learned scoring behavior to the weak supervision signal in expectation; \(ii\) Proposition[1](https://arxiv.org/html/2606.12900#Thmproposition1)justifies multi\-sampling aggregation as a variance\-reduction mechanism via a concentration bound; and \(iii\) Corollary[1](https://arxiv.org/html/2606.12900#Thmcorollary1)derives a threshold\-free performance characterization by bounding the ranking error probability\.

We first provide theoretical guarantees for the rationality and effectiveness of the training and inference framework\.

###### Theorem 1\.

\(Expectation alignment under training\) Letxxdenote an input andsl​\(x\)s\_\{l\}\(x\)its corresponding weak label\. Consider theℓ1\\ell\_\{1\}\-risk of predicting scoress:Rx​\(s\)≜\|s−sl​\(x\)\|R\_\{x\}\(s\)\\triangleq\|s\-s\_\{l\}\(x\)\|, whose minimizer iss=sl​\(x\)s=s\_\{l\}\(x\)\. LetY∼fθ\(⋅∣x\)Y\\sim f\_\{\\theta\}\(\\cdot\\mid x\)be a stochastic well\-formed generation of the agent, and letSθ​\(x\)=sp​\(x,Y\)S\_\{\\theta\}\(x\)=s\_\{p\}\(x,Y\)denote the parsed score\. Define the conditional mean score asμθ​\(x\)≜𝔼​\[Sθ​\(x\)∣x\]\\mu\_\{\\theta\}\(x\)\\triangleq\\mathbb\{E\}\\big\[S\_\{\\theta\}\(x\)\\mid x\\big\], we have

𝔼x​\[\|μθ​\(x\)−sl​\(x\)\|\]≤𝒥′​\(θ\),\\mathbb\{E\}\_\{x\}\\big\[\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\\big\]\\leq\\mathcal\{J^\{\\prime\}\}\(\\theta\),\(7\)where𝒥′​\(θ\)\\mathcal\{J^\{\\prime\}\}\(\\theta\)is affine\-equivalent to the GRPO objective𝒥​\(θ\)\\mathcal\{J\}\(\\theta\)defined in Eqn\. \([1](https://arxiv.org/html/2606.12900#S2.E1)\)\.

Theorem[1](https://arxiv.org/html/2606.12900#Thmtheorem1)formalizes a training\-time alignment guarantee: optimizing the KL\-regularized GRPO objective forces the expected parsed scoresμθ​\(x\)\\mu\_\{\\theta\}\(x\)toward the weak supervision labelsl​\(x\)s\_\{l\}\(x\)in expectation over the training distribution\. This result links the GRPO objective to stable scoring behavior and constitutes the training\-side foundation of our end\-to\-end detection analysis\.

While characterizing training\-time alignment, inference introduces an additional uncertainty: each evaluation produces a random scoreSθ​\(x\)S\_\{\\theta\}\(x\)\. Proposition[1](https://arxiv.org/html/2606.12900#Thmproposition1)provides Hoeffding’s concentration bound forKKparsed scores\.

###### Proposition 1\.

\(Multi\-sampling concentration\) Fix an inference inputxx\. Suppose theKKwell\-formed generationsY1,…,YK∼i\.i\.d\.fθ\(⋅∣x\)Y\_\{1\},\\ldots,Y\_\{K\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}f\_\{\\theta\}\(\\cdot\\mid x\)\. For eachk∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}, define the parsed scoreSθ\(k\)​\(x\)≜sp​\(x,Yk\)∈\{1,…,10\}S\_\{\\theta\}^\{\(k\)\}\(x\)\\triangleq s\_\{p\}\(x,Y\_\{k\}\)\\in\\\{1,\\ldots,10\\\}, and lets¯​\(x\)≜1K​∑k=1KSθ\(k\)​\(x\)\\bar\{s\}\(x\)\\triangleq\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}S\_\{\\theta\}^\{\(k\)\}\(x\)denote the aggregated score in Eqn\. \([6](https://arxiv.org/html/2606.12900#S4.E6)\)\. Then for any bias thresholdu\>0u\>0,

ℙ\(\|s¯\(x\)−𝔼\[Sθ\(x\)∣x\]\|≥u∣x\)≤2exp\(−2​K​u2\(10−1\)2\)\.\\mathbb\{P\}\\big\(\|\\bar\{s\}\(x\)\-\\mathbb\{E\}\[S\_\{\\theta\}\(x\)\\mid x\]\|\\geq u\\mid x\\big\)\\leq 2\\exp\\Big\(\-\\frac\{2Ku^\{2\}\}\{\(10\-1\)^\{2\}\}\\Big\)\.\(8\)

Due to the requirement of domain knowledge reserves or expert intervention, a rigorous human\-level annotation of hallucination severitys⋆s^\{\\star\}is almost inaccessible\. Accordingly, we treat the similarity\-based labelsls\_\{l\}as a weak proxy and model it as:sl​\(x\)=g​\(s⋆​\(x\)\)\+ϵ​\(x\)s\_\{l\}\(x\)=g\(s^\{\\star\}\(x\)\)\+\\epsilon\(x\), whereggis a monotone non\-decreasing mapping, andϵ\\epsiloncaptures conditional bias arising from stochasticity and systematic knowledge discrepancies\. For the detection process based on this, we have the following derivations\.

###### Corollary 1\.

\(Ranking error decomposition\) Letxxdenote an input andsl​\(x\)s\_\{l\}\(x\)its corresponding weak label\. Assume the proxy bias is uniformly bounded:\|ϵ​\(x\)\|≤bmax,∀x\|\\epsilon\(x\)\|\\leq b\_\{\\max\},~~~\\forall x\. LetY1,…,YK∼i\.i\.d\.fθ\(⋅∣x\)Y\_\{1\},\\ldots,Y\_\{K\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}f\_\{\\theta\}\(\\cdot\\mid x\)be well\-formed generations, and defineSθ\(k\)​\(x\)S\_\{\\theta\}^\{\(k\)\}\(x\)ands¯​\(x\)\\bar\{s\}\(x\)as in Proposition[1](https://arxiv.org/html/2606.12900#Thmproposition1)\. Letx\+x^\{\+\}andx−x^\{\-\}denote independent draws from the distributions of true \(non\-hallucinated\) and hallucinated inputs, respectively\. Define the ranking error probabilityℰrank≜ℙ​\(s¯​\(x\+\)≤s¯​\(x−\)\)\\mathcal\{E\}\_\{\\mathrm\{rank\}\}\\triangleq\\mathbb\{P\}\\big\(\\bar\{s\}\(x^\{\+\}\)\\leq\\bar\{s\}\(x^\{\-\}\)\\big\)\. Then for anyΔ\>bmax\\Delta\>b\_\{\\max\},

ℰrank≤\\displaystyle\\mathcal\{E\}\_\{\\mathrm\{rank\}\}\\leqℙ​\(g​\(s⋆​\(x\+\)\)−g​\(s⋆​\(x−\)\)≤2​Δ\)⏟intrinsic separability\+4​𝒥′​\(θ\)Δ−bmax⏟training alignment\\displaystyle\\underbrace\{\\mathbb\{P\}\\big\(g\(s^\{\\star\}\(x^\{\+\}\)\)\-g\(s^\{\\star\}\(x^\{\-\}\)\)\\leq 2\\Delta\\big\)\}\_\{\\text\{intrinsic separability\}\}\+\\underbrace\{\\frac\{4\\,\\mathcal\{J^\{\\prime\}\}\(\\theta\)\}\{\\Delta\-b\_\{\\max\}\}\}\_\{\\text\{training alignment\}\}\+4​exp⁡\(−2​K\(10−1\)2​\(Δ−bmax2\)2\)⏟multi\-sampling concentration,\\displaystyle\+\\underbrace\{4\\exp\\Big\(\-\\frac\{2K\}\{\(10\-1\)^\{2\}\}\\Big\(\\frac\{\\Delta\-b\_\{\\max\}\}\{2\}\\Big\)^\{2\}\\Big\)\}\_\{\\text\{multi\-sampling concentration\}\},where𝒥′​\(θ\)\\mathcal\{J^\{\\prime\}\}\(\\theta\)is affine\-equivalent to the GRPO objective𝒥​\(θ\)\\mathcal\{J\}\(\\theta\)defined in Eqn\. \([1](https://arxiv.org/html/2606.12900#S2.E1)\)\.

Corollary[1](https://arxiv.org/html/2606.12900#Thmcorollary1)establishes a decomposable upper bound on the detector’s ranking error probability\. This bound makes explicit the principal factors that affect the detection performance\. Beyond theintrinsic separability of the data, it crucially shows that a smallertraining alignment loss𝒥′​\(θ\)\\mathcal\{J^\{\\prime\}\}\(\\theta\)and a largerinference sampling sizeKKboth contribute to a reduction of the error bound\. This provides theoretical support for our design choices: optimizing the GRPO objective effectively aligns the agent’s scores with the proxy supervision, while the multi‑sampling aggregation strategy suppresses inference variance at an exponential rate\. Consequently, the proposed training and inference components jointly ensure reliable detection\. We refer readers to Appendix[A](https://arxiv.org/html/2606.12900#A1)for more details\.

Table 2:Comparisons with hallucination detection baselines on different datasets for LLaMA\-3\.1\-8b and Qwen\-3\-8b in terms of AUROC \(%\), where♣\\clubsuitdenotes methods trained on fully labeled datasets\.ModelMethodTriviaQASciQNQ OpenCoQAAvg\.Llama\-3\.1\-8bLN\-Entropy\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10)\)73\.62±2\.2073\.62\_\{\\pm 2\.20\}62\.69±2\.7362\.69\_\{\\pm 2\.73\}52\.36±1\.5352\.36\_\{\\pm 1\.53\}74\.52±1\.8674\.52\_\{\\pm 1\.86\}65\.8065\.80Self\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib15)\)56\.07±1\.9256\.07\_\{\\pm 1\.92\}54\.12±1\.8254\.12\_\{\\pm 1\.82\}59\.83±3\.6859\.83\_\{\\pm 3\.68\}62\.51±1\.4262\.51\_\{\\pm 1\.42\}58\.1358\.13CCS\(Burnset al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib16)\)78\.20±1\.8978\.20\_\{\\pm 1\.89\}58\.85±3\.0358\.85\_\{\\pm 3\.03\}55\.50±1\.8955\.50\_\{\\pm 1\.89\}68\.98±2\.0468\.98\_\{\\pm 2\.04\}65\.3865\.38SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)74\.58±1\.9074\.58\_\{\\pm 1\.90\}59\.68±2\.4559\.68\_\{\\pm 2\.45\}62\.13±2\.6062\.13\_\{\\pm 2\.60\}70\.61±2\.1070\.61\_\{\\pm 2\.10\}66\.7566\.75Perplexity\(Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9)\)80\.62±2\.6280\.62\_\{\\pm 2\.62\}66\.12±2\.8566\.12\_\{\\pm 2\.85\}57\.92±2\.6057\.92\_\{\\pm 2\.60\}81\.41±2\.2181\.41\_\{\\pm 2\.21\}71\.5271\.52SAPLMA♣\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)78\.51±3\.1678\.51\_\{\\pm 3\.16\}85\.63±0\.9685\.63\_\{\\pm 0\.96\}76\.23±0\.8276\.23\_\{\\pm 0\.82\}71\.58±1\.3571\.58\_\{\\pm 1\.35\}77\.9977\.99Semantic Entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib11)\)78\.71±3\.0978\.71\_\{\\pm 3\.09\}77\.81±3\.1777\.81\_\{\\pm 3\.17\}61\.04±4\.2961\.04\_\{\\pm 4\.29\}75\.26±4\.6375\.26\_\{\\pm 4\.63\}73\.2173\.21Lexical Similarity\(Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12)\)77\.96±2\.0377\.96\_\{\\pm 2\.03\}67\.09±2\.0567\.09\_\{\\pm 2\.05\}62\.85±1\.5762\.85\_\{\\pm 1\.57\}77\.53±0\.9077\.53\_\{\\pm 0\.90\}71\.3671\.36EigenScore\(Chenet al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib13)\)51\.35±1\.2351\.35\_\{\\pm 1\.23\}51\.52±0\.8251\.52\_\{\\pm 0\.82\}52\.17±2\.1352\.17\_\{\\pm 2\.13\}52\.00±1\.1952\.00\_\{\\pm 1\.19\}51\.7651\.76HaloScope♣\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)58\.19±5\.7958\.19\_\{\\pm 5\.79\}69\.04±6\.3669\.04\_\{\\pm 6\.36\}63\.38±3\.0263\.38\_\{\\pm 3\.02\}72\.11±4\.9672\.11\_\{\\pm 4\.96\}65\.6865\.68TAD♣\(Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib67)\)72\.01±1\.0372\.01\_\{\\pm 1\.03\}66\.75±1\.9666\.75\_\{\\pm 1\.96\}68\.88±2\.5668\.88\_\{\\pm 2\.56\}74\.86±2\.1074\.86\_\{\\pm 2\.10\}70\.6370\.63TSV♣\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)79\.78±3\.3679\.78\_\{\\pm 3\.36\}80\.01±1\.1780\.01\_\{\\pm 1\.17\}70\.17±1\.4170\.17\_\{\\pm 1\.41\}69\.31±6\.7569\.31\_\{\\pm 6\.75\}74\.8274\.82HCPD \(Ours\)86\.25±1\.08\\mathbf\{86\.25\}\_\{\\pm 1\.08\}86\.04±2\.25\\mathbf\{86\.04\}\_\{\\pm 2\.25\}90\.38±3\.58\\mathbf\{90\.38\}\_\{\\pm 3\.58\}90\.07±2\.58\\mathbf\{90\.07\}\_\{\\pm 2\.58\}88\.19\\mathbf\{88\.19\}Qwen\-3\-8bLN\-Entropy\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10)\)64\.66±2\.5564\.66\_\{\\pm 2\.55\}61\.22±4\.4461\.22\_\{\\pm 4\.44\}68\.70±4\.1168\.70\_\{\\pm 4\.11\}54\.28±3\.5154\.28\_\{\\pm 3\.51\}62\.2262\.22Self\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib15)\)54\.74±3\.5954\.74\_\{\\pm 3\.59\}51\.36±0\.3651\.36\_\{\\pm 0\.36\}60\.02±1\.1860\.02\_\{\\pm 1\.18\}57\.01±1\.4457\.01\_\{\\pm 1\.44\}55\.7855\.78CCS\(Burnset al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib16)\)57\.51±1\.9757\.51\_\{\\pm 1\.97\}57\.96±3\.0357\.96\_\{\\pm 3\.03\}58\.63±5\.5358\.63\_\{\\pm 5\.53\}55\.10±4\.5855\.10\_\{\\pm 4\.58\}57\.3057\.30SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)77\.12±2\.7377\.12\_\{\\pm 2\.73\}66\.78±2\.4066\.78\_\{\\pm 2\.40\}80\.51±4\.5380\.51\_\{\\pm 4\.53\}67\.57±2\.3067\.57\_\{\\pm 2\.30\}73\.0073\.00Perplexity\(Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9)\)59\.65±2\.4259\.65\_\{\\pm 2\.42\}59\.27±4\.5959\.27\_\{\\pm 4\.59\}67\.81±4\.3267\.81\_\{\\pm 4\.32\}54\.06±3\.6354\.06\_\{\\pm 3\.63\}60\.2060\.20SAPLMA♣\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)78\.11±1\.0978\.11\_\{\\pm 1\.09\}86\.63±1\.5386\.63\_\{\\pm 1\.53\}72\.86±1\.2072\.86\_\{\\pm 1\.20\}80\.28±1\.4080\.28\_\{\\pm 1\.40\}79\.4779\.47Semantic Entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib11)\)81\.74±2\.7881\.74\_\{\\pm 2\.78\}70\.84±5\.4170\.84\_\{\\pm 5\.41\}75\.10±1\.9775\.10\_\{\\pm 1\.97\}70\.59±5\.0970\.59\_\{\\pm 5\.09\}74\.5774\.57Lexical Similarity\(Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12)\)52\.33±1\.4452\.33\_\{\\pm 1\.44\}53\.61±2\.7753\.61\_\{\\pm 2\.77\}56\.75±4\.3056\.75\_\{\\pm 4\.30\}59\.13±2\.3259\.13\_\{\\pm 2\.32\}55\.4555\.45EigenScore\(Chenet al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib13)\)52\.60±2\.0352\.60\_\{\\pm 2\.03\}52\.85±2\.5752\.85\_\{\\pm 2\.57\}56\.39±3\.2656\.39\_\{\\pm 3\.26\}52\.79±0\.3752\.79\_\{\\pm 0\.37\}53\.6653\.66HaloScope♣\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)58\.21±3\.9958\.21\_\{\\pm 3\.99\}74\.98±3\.9874\.98\_\{\\pm 3\.98\}57\.25±1\.5057\.25\_\{\\pm 1\.50\}62\.18±4\.4962\.18\_\{\\pm 4\.49\}63\.1663\.16TAD♣\(Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib67)\)71\.27±1\.9071\.27\_\{\\pm 1\.90\}55\.98±4\.3755\.98\_\{\\pm 4\.37\}76\.82±3\.7776\.82\_\{\\pm 3\.77\}66\.94±0\.5866\.94\_\{\\pm 0\.58\}67\.7567\.75TSV♣\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)73\.42±3\.8973\.42\_\{\\pm 3\.89\}78\.77±0\.9478\.77\_\{\\pm 0\.94\}61\.38±3\.4361\.38\_\{\\pm 3\.43\}68\.40±4\.9268\.40\_\{\\pm 4\.92\}70\.4970\.49HCPD \(Ours\)93\.69±0\.89\\mathbf\{93\.69\}\_\{\\pm 0\.89\}92\.63±2\.90\\mathbf\{92\.63\}\_\{\\pm 2\.90\}87\.35±6\.22\\mathbf\{87\.35\}\_\{\\pm 6\.22\}84\.80±1\.01\\mathbf\{84\.80\}\_\{\\pm 1\.01\}89\.62\\mathbf\{89\.62\}

## 5Experiments

### 5\.1Experimental Settings

Datasets\.We conduct the experiments on four standard QA datasets: TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib5)\), SciQ\(Welblet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib6)\), NQ Open\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2606.12900#bib.bib7)\), and CoQA\(Reddyet al\.,[2019](https://arxiv.org/html/2606.12900#bib.bib8)\)\. For NQ Open, we use all3,6103,610samples from the validation set\. From TriviaQA, we randomly sample3,3103,310test pairs while3,0003,000training pairs are drawn from SciQ and CoQA\. Each dataset is split into training and test sets with a3:13:1ratio, and all reported results are averaged over55independent random splits\. All model responses are generated using greedy decoding\. More details on dataset are shown in Appendix[C\.1](https://arxiv.org/html/2606.12900#A3.SS1)\.

Baselines\.We compare our method against various baselines from multiple technical paradigms for hallucination detection\. 1\) Logit\-based approaches: Perplexity\(Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9)\), Length Normalized Entropy \(LN\-entropy\)\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10)\)and Semantic Entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib11)\); 2\) Consistency\-based approaches: Lexical Similarity\(Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12)\), SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)and EigenScore\(Chenet al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib13)\);3\) Verbalized\-confidence approach: Self\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib15)\);4\) Internal state\-based approaches: Contrast\-Consistent Search \(CCS\)\(Burnset al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib16)\), SAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\), HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\), TAD\(Vazhentsevet al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib67)\)and TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)\.

Implementation Details\.We compare our method with baselines on77LLMs, i\.e\.,LLaMA\-3\.1\-8b, LLaMA\-3\.1\-70b\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib29)\), LLaMA\-2\-7b, LLaMA\-2\-13b\(Touvronet al\.,[2023b](https://arxiv.org/html/2606.12900#bib.bib30)\), Qwen\-3\-8b\(Yanget al\.,[2025a](https://arxiv.org/html/2606.12900#bib.bib31)\), Qwen\-2\.5\-7b, and Qwen\-2\.5\-14b\(Yanget al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib32)\)\. For our method, we adopt a Qwen\-2\.5\-7b as the agent and train it according to the Open\-R1 implementation444[https://github\.com/huggingface/open\-r1](https://github.com/huggingface/open-r1)\. Through the optimization parameters in our method, we setl​r=2×10−4lr=2\\times 10^\{\-4\}on LLaMA\-3\.1\-8b andl​r=1×10−4lr=1\\times 10^\{\-4\}on Qwen\-3\-8b\. For the coefficientβ\\betathat controls theDK​LD\_\{KL\}strength, we setβ=0\.05\\beta=0\.05on TriviaQA, SciQ with Qwen\-3\-8b and TriviaQA with LLaMA\-3\.1\-8b, whileβ=0\.04\\beta=0\.04for all others\. More details are shown in Appendix[C\.2](https://arxiv.org/html/2606.12900#A3.SS2)\.

Evaluation\.FollowingFarquharet al\.\([2024](https://arxiv.org/html/2606.12900#bib.bib3)\), we evaluate detection performance using the Area Under the Receiver Operating Characteristic Curve \(AUROC\)\. As a proxy for ground\-truth factuality, we employ BLEURT\(Sellamet al\.,[2020](https://arxiv.org/html/2606.12900#bib.bib20)\)to measure semantic consistency between a generated answer and its reference\. Generated answers with a consistency score exceeding0\.50\.5are regarded as relatively faithful, whereas those below are labeled as hallucinated\.

Table 3:Comparisons with training\-based baselines across target models on TriviaQA in terms of AUROC \(%\)\.Source ModelMethodTarget ModelLLaMA\-3\.1\-8bLLaMA\-3\.1\-70bLLaMA\-2\-7bLLaMA\-2\-13bQwen\-3\-8bQwen\-2\.5\-7bQwen\-2\.5\-14bLLaMA\-3\.1\-8bSAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)78\.51±3\.1678\.51\_\{\\pm 3\.16\}77\.63±2\.6377\.63\_\{\\pm 2\.63\}78\.13±2\.6578\.13\_\{\\pm 2\.65\}78\.25±1\.8378\.25\_\{\\pm 1\.83\}71\.63±2\.1571\.63\_\{\\pm 2\.15\}69\.70±2\.1369\.70\_\{\\pm 2\.13\}63\.77±2\.2863\.77\_\{\\pm 2\.28\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)58\.19±6\.1058\.19\_\{\\pm 6\.10\}86\.50±5\.5686\.50\_\{\\pm 5\.56\}82\.68±2\.2782\.68\_\{\\pm 2\.27\}90\.88±1\.8590\.88\_\{\\pm 1\.85\}54\.99±1\.8954\.99\_\{\\pm 1\.89\}68\.37±3\.6668\.37\_\{\\pm 3\.66\}68\.33±3\.7768\.33\_\{\\pm 3\.77\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)79\.78±3\.3679\.78\_\{\\pm 3\.36\}81\.29±10\.2381\.29\_\{\\pm 10\.23\}82\.85±3\.9082\.85\_\{\\pm 3\.90\}88\.06±4\.2388\.06\_\{\\pm 4\.23\}59\.89±8\.3859\.89\_\{\\pm 8\.38\}77\.07±5\.8277\.07\_\{\\pm 5\.82\}61\.17±7\.9861\.17\_\{\\pm 7\.98\}HCPD \(Ours\)86\.25±1\.08\\mathbf\{86\.25\}\_\{\\pm 1\.08\}86\.87±0\.98\\mathbf\{86\.87\}\_\{\\pm 0\.98\}90\.74±0\.52\\mathbf\{90\.74\}\_\{\\pm 0\.52\}93\.43±1\.33\\mathbf\{93\.43\}\_\{\\pm 1\.33\}78\.89±2\.22\\mathbf\{78\.89\}\_\{\\pm 2\.22\}88\.84±1\.02\\mathbf\{88\.84\}\_\{\\pm 1\.02\}83\.34±0\.86\\mathbf\{83\.34\}\_\{\\pm 0\.86\}Qwen\-3\-8bSAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)78\.22±2\.6478\.22\_\{\\pm 2\.64\}78\.14±1\.6478\.14\_\{\\pm 1\.64\}80\.33±1\.2280\.33\_\{\\pm 1\.22\}79\.50±1\.2879\.50\_\{\\pm 1\.28\}78\.11±1\.0978\.11\_\{\\pm 1\.09\}71\.22±1\.1671\.22\_\{\\pm 1\.16\}71\.73±2\.1871\.73\_\{\\pm 2\.18\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)53\.89±0\.7853\.89\_\{\\pm 0\.78\}57\.97±1\.2157\.97\_\{\\pm 1\.21\}55\.47±9\.5055\.47\_\{\\pm 9\.50\}56\.73±9\.8956\.73\_\{\\pm 9\.89\}58\.21±3\.9958\.21\_\{\\pm 3\.99\}57\.64±1\.5357\.64\_\{\\pm 1\.53\}78\.65±5\.9578\.65\_\{\\pm 5\.95\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)54\.14±3\.2654\.14\_\{\\pm 3\.26\}58\.85±5\.7058\.85\_\{\\pm 5\.70\}60\.31±6\.8960\.31\_\{\\pm 6\.89\}64\.04±8\.7564\.04\_\{\\pm 8\.75\}73\.42±3\.8973\.42\_\{\\pm 3\.89\}57\.08±3\.7557\.08\_\{\\pm 3\.75\}65\.77±4\.8265\.77\_\{\\pm 4\.82\}HCPD \(Ours\)87\.09±1\.32\\mathbf\{87\.09\}\_\{\\pm 1\.32\}89\.73±2\.63\\mathbf\{89\.73\}\_\{\\pm 2\.63\}90\.84±0\.76\\mathbf\{90\.84\}\_\{\\pm 0\.76\}93\.95±1\.07\\mathbf\{93\.95\}\_\{\\pm 1\.07\}93\.69±0\.89\\mathbf\{93\.69\}\_\{\\pm 0\.89\}89\.74±0\.77\\mathbf\{89\.74\}\_\{\\pm 0\.77\}87\.94±1\.19\\mathbf\{87\.94\}\_\{\\pm 1\.19\}

### 5\.2Comparisons with Baselines

Results on LLaMA\-3\.1\-8b\.As reported in Table[2](https://arxiv.org/html/2606.12900#S4.T2), most training\-free methods exhibit limited detection performance \(about51\.76%51\.76\\%to73\.21%73\.21\\%on average\)\. While some methods display moderate separability on simple benchmarks \(e\.g\., TriviaQA, CoQA\), their performance drops significantly on more challenging datasets\. In contrast, training\-based methods maintain stronger and more consistent performance across datasets, such as SAPLMA \(77\.99%77\.99\\%\) and TSV \(74\.82%74\.82\\%\)\. Notably, our method attains the best overall results while relying solely on\(q,a\)\(q,a\)inputs, with an average AUROC of88\.19%88\.19\\%, which exceeds the second\-best method by10\.20%10\.20\\%\. Specifically, HCPD improves AUROC by5\.63%5\.63\\%on TriviaQA, by14\.15%14\.15\\%on NQ Open, and by8\.66%8\.66\\%on CoQA\. We attribute these gains to the agent that fully considers criteria from multiple perspectives, enabling fine\-grained identification of hallucinated content\.

Results on Qwen\-3\-8b\.The results on Qwen\-3\-8b exhibit similar characteristics: the training\-free and training\-based methods achieve average AUROCs of53\.66%−74\.57%53\.66\\%\{\-\}74\.57\\%and63\.16%−79\.47%63\.16\\%\{\-\}79\.47\\%, respectively\. In comparison, our HCPD again achieves consistent improvements outperforming the second\-best method by10\.15%10\.15\\%on average\. Taken together with the results on LLaMA\-3\.1\-8b across datasets, the effectiveness and robustness of HCPD under diverse backbone models and benchmarks are significantly underscored\.

![Refer to caption](https://arxiv.org/html/2606.12900v1/x2.png)Figure 2:Cross\-dataset AUROCs of HCPD on LLaMA\-3\.1\-8b\.Table 4:Controlled empirical decomposition of HCPD\.MethodAUROCSelf\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib15)\)56\.07±1\.9256\.07\_\{\\pm 1\.92\}HCPD \(Ours, Pre\-RL\)66\.54±0\.5966\.54\_\{\\pm 0\.59\}HCPD \(Ours, Post\-RL\)86\.25±1\.0886\.25\_\{\\pm 1\.08\}

Transferability across target models\.A key advantage of HCPD is its*model\-agnostic*design:it operates in the natural language space, which is universally shared across LLMs, thereby achieving superior transferability across diverse target models\. To demonstrate the advantages of HCPD, we further compare it with other training\-based baselines under a transfer setting where the detector is evaluated on outputs generated by target models from different families and scales\. Results in Table[3](https://arxiv.org/html/2606.12900#S5.T3)show that when applied to unseen target models, competing baselines such as HaloScope and TSV often suffer noticeable performance degradation due to distributional shifts in features of the proxy model\. In contrast, HCPD maintains stable and excellent performance across heterogeneous target models, which undoubtedly better meet the needs of practical deployment in open, real\-world settings\. More results on the other33datasets are shown in Appendix[D\.1](https://arxiv.org/html/2606.12900#A4.SS1)\.

![Refer to caption](https://arxiv.org/html/2606.12900v1/x3.png)Figure 3:Impact of reward design, where “\-D” denotes differentiable scoring reward and “\-B” denotes binary scoring reward\.Table 5:Impact of sampling sizeKK\.Dataset1122551010TriviaQA85\.21±1\.2285\.21\_\{\\pm 1\.22\}85\.71±0\.8685\.71\_\{\\pm 0\.86\}86\.25±1\.0886\.25\_\{\\pm 1\.08\}86\.25±0\.9886\.25\_\{\\pm 0\.98\}SciQ85\.06±2\.3385\.06\_\{\\pm 2\.33\}85\.47±2\.1285\.47\_\{\\pm 2\.12\}86\.04±2\.2586\.04\_\{\\pm 2\.25\}86\.14±1\.8986\.14\_\{\\pm 1\.89\}NQ Open86\.89±1\.6086\.89\_\{\\pm 1\.60\}89\.26±2\.4689\.26\_\{\\pm 2\.46\}90\.38±3\.5890\.38\_\{\\pm 3\.58\}90\.41±2\.4690\.41\_\{\\pm 2\.46\}CoQA88\.60±4\.1288\.60\_\{\\pm 4\.12\}89\.37±3\.3489\.37\_\{\\pm 3\.34\}90\.07±2\.5890\.07\_\{\\pm 2\.58\}89\.85±2\.6689\.85\_\{\\pm 2\.66\}Inf\. Time \(s\)0\.23490\.23490\.46750\.46751\.13131\.13132\.28072\.2807

Transferability across data distributions\.Furthermore, the proposed Human\-like Criteria Probing mechanism enables our method to assess queries with varying distributions and formats in a flexible manner\. To evaluate distributional transfer, we test HCPD under cross\-dataset settings\. As shown in Figure[2](https://arxiv.org/html/2606.12900#S5.F2), HCPD exhibits amazing generalization when transferred to TriviaQA, SciQ, and NQ Open, preserving high performance\. We observe a modest degradation on CoQA, which we attribute to substantial differences in the underlying QA formulation and interaction mode\. Additional results on Qwen\-3\-8b are shown in Appendix[D\.2](https://arxiv.org/html/2606.12900#A4.SS2)\.

### 5\.3Ablation Studies

Controlled empirical decomposition\.To isolate the contributions of our proposed human\-like criteria probing mechanism and weakly\-supervised alignment training, we conducted a controlled empirical decomposition using standard Self\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib15)\)as a baseline\. As detailed in Table[4](https://arxiv.org/html/2606.12900#S5.T4), HCPD outperforms the baselines by Human\-like Criteria Probing itself, yielding an AUROC improvement of10\.47%10\.47\\%\. This performance is further augmented by our GRPO alignment framework, which delivers a substantial subsequent AUROC gain of19\.71%19\.71\\%\.

Impact of reward design\.We discussed the advantages of differentiable scoring over binary scoring in Section[4\.4](https://arxiv.org/html/2606.12900#S4.SS4), and here we further validate this claim empirically\. As shown in Figure[3](https://arxiv.org/html/2606.12900#S5.F3), replacing the graded reward with a binary “0/10/1” signal not only downgrades the evaluation to binary classification, but also yields a significant reduction in performance \(from86\.25%86\.25\\%to79\.06%79\.06\\%on TriviaQA and from90\.07%90\.07\\%to51\.75%51\.75\\%on CoQA\)\. We attribute this degradation to the lack of severity information, which makes the agent more prone to misclassifying samples near the decision threshold\.

Impact of sampling size\.According to the Corollary[1](https://arxiv.org/html/2606.12900#Thmcorollary1), the ranking error decreases as the number of samplesKKincreases\. The results in Table[5](https://arxiv.org/html/2606.12900#S5.T5)provide quantitative support for this trend\. LargerKKyields progressively improved accuracy and inference stability, but meanwhile incurs a linear increase in computational cost\. To balance performance and efficiency, we setK=5K=5in experiments\.Additional performance and consumption comparison with baselines are detailed in Appendix[D\.4](https://arxiv.org/html/2606.12900#A4.SS4)\. Notably, HCPD matches the speed of light metrics and outpaces consistency\-based methods\. Moreover, its model\-agnostic design yields greater computational savings when evaluating outputs from large target models \(e\.g\., LLaMA\-2\-13B, Qwen\-2\.5\-14B, and LLaMA\-3\.1\-70B\)\.

Table 6:Comparisons with baselines under alternative metrics on TriviaQA for LLaMA\-3\.1\-8b\.MethodMetricROUGEDeepSeekLN\-Entropy\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10)\)77\.70±1\.3977\.70\_\{\\pm 1\.39\}52\.69±1\.1952\.69\_\{\\pm 1\.19\}CCS\(Burnset al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib16)\)82\.92±1\.6482\.92\_\{\\pm 1\.64\}52\.52±2\.1252\.52\_\{\\pm 2\.12\}SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)73\.92±1\.2673\.92\_\{\\pm 1\.26\}71\.38±0\.9071\.38\_\{\\pm 0\.90\}Perplexity\(Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9)\)85\.49±1\.1385\.49\_\{\\pm 1\.13\}53\.82±2\.8653\.82\_\{\\pm 2\.86\}SAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)87\.22±1\.1387\.22\_\{\\pm 1\.13\}79\.99±0\.9279\.99\_\{\\pm 0\.92\}Lexical Similarity\(Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12)\)82\.40±0\.9082\.40\_\{\\pm 0\.90\}53\.63±1\.6953\.63\_\{\\pm 1\.69\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)72\.99±4\.6172\.99\_\{\\pm 4\.61\}57\.98±4\.9657\.98\_\{\\pm 4\.96\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)78\.46±6\.1578\.46\_\{\\pm 6\.15\}73\.31±5\.6373\.31\_\{\\pm 5\.63\}HCPD \(Ours\)89\.17±0\.42\\mathbf\{89\.17\}\_\{\\pm 0\.42\}80\.42±0\.39\\mathbf\{80\.42\}\_\{\\pm 0\.39\}

Beyond the BLEURT metric\.It is notable that BLEURT is an option, not a dependency\. We conducte GRPO using alternative signals \(e\.g\., ROUGE\(Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12)\), and DeepSeek\-V3 as a Judge\(Liuet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib64)\)\)\. Results in Table[6](https://arxiv.org/html/2606.12900#S5.T6)confirm that as an evaluation metric provided, HCPD seamlessly aligns its behavior to the reward signal and achieves strong performance\. The detailed implementation of DeepSeek\-V3 is shown in Appendix[C\.2](https://arxiv.org/html/2606.12900#A3.SS2)\.

## 6Conclusion

In this paper, we propose*Human\-like Criteria Probing*\(HCP\), an interpretable paradigm that achieves hallucination detection under the zero\-source setting\. We instantiate HCP\-Detector as an LLM\-based scoring agent that decomposes evaluation into context\-relevant criteria, assigns adaptive weights, and produces evidence\-grounded analyses along with an overall score\. The agent is further aligned via GRPO using a dense score\-alignment reward derived from weak proxy supervision, which is based on semantic consistency signals on QA benchmarks\. It employs multi\-sampling aggregation at inference time to suppress stochastic variance and yield stable predictions\. Both theoretical analysis and extensive experiments across diverse datasets and model architectures demonstrate the effectiveness of our HCPD in identifying hallucinated content\.

## Acknowledgments

This work was partially supported by the Joint Funds of the National Natural Science Foundation of China \(Grant No\.U24A20327\), Key\-Area Research and Development Program Guangdong Province 2018B010107001, and TCL Science and Technology Innovation Fund, China\. Jiahao Yang is supported by the China Scholarship Council \(CSC\) under Grant No\. 202506150018\.

## Impact Statement

This work proposes Human\-like Criteria Probing for Hallucination Detection \(HCPD\), a zero\-source framework for detecting factual and logical inconsistencies in LLM outputs\. By operating solely on the observed query–response pair, HCPD is intended to improve reliability and transparency in practical black\-box deployment settings, with particular relevance to safety\-critical domains such as healthcare and education\. Furthermore, HCPD provides inherent explainability by outputting the specific criteria and weights used for each decision, which can aid developers in model debugging and offer end\-users transparent justifications\. We believe this research contributes positively to the development of more accountable and transparent AI systems\.

## Author Contributions

Jiahao Yang, Shuhai Zhang, and Hailong Kang contributed equally to this work\. Feng Liu, Qi Chen, and Mingkui Tan are the corresponding authors\. Jiahao Yang and Shuhai Zhang conceived the main idea and designed the proposed method\. Jiahao Yang and Hailong Kang conducted the experiments and performed the main empirical analysis\. Jiahao Yang, Shuhai Zhang, and Hailong Kang contributed to manuscript writing, result organization, and paper revision\. Feng Liu, Qi Chen, and Mingkui Tan supervised the project, provided guidance on methodology and experiments, and revised the manuscript\.

## References

- J\. Ainslie, J\. Lee\-Thorp, M\. De Jong, Y\. Zemlyanskiy, F\. Lebrón, and S\. Sanghai \(2023\)Gqa: training generalized multi\-query transformer models from multi\-head checkpoints\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p3.1)\.
- A\. Azaria and T\. Mitchell \(2023\)The internal state of an llm knows when it’s lying\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p5.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.18.18.18.4),[Table 7](https://arxiv.org/html/2606.12900#A4.T7.30.30.30.8),[Table 7](https://arxiv.org/html/2606.12900#A4.T7.6.6.6.8),[Table 8](https://arxiv.org/html/2606.12900#A4.T8.30.30.30.8),[Table 8](https://arxiv.org/html/2606.12900#A4.T8.6.6.6.8),[Table 9](https://arxiv.org/html/2606.12900#A4.T9.30.30.30.8),[Table 9](https://arxiv.org/html/2606.12900#A4.T9.6.6.6.8),[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p3.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.28.26.26.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.97.95.95.1),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1),[Table 3](https://arxiv.org/html/2606.12900#S5.T3.35.35.35.9),[Table 3](https://arxiv.org/html/2606.12900#S5.T3.7.7.7.9),[Table 6](https://arxiv.org/html/2606.12900#S5.T6.10.10.10.3)\.
- J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang,et al\.\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p3.1)\.
- M\. Benary, X\. D\. Wang, M\. Schmidt, D\. Soll, G\. Hilfenhaus, M\. Nassir, C\. Sigler, M\. Knödler, U\. Keller, D\. Beule,et al\.\(2023\)Leveraging large language models for decision support in personalized oncology\.JAMA Network Open6\(11\),pp\. e2343689–e2343689\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- C\. Burns, H\. Ye, D\. Klein, and J\. Steinhardt \(2022\)Discovering latent knowledge in language models without supervision\.InThe Eleventh International Conference on Learning Representations,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p5.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.9.9.9.4),[Table 12](https://arxiv.org/html/2606.12900#A4.T12.2.2.2.2),[§3](https://arxiv.org/html/2606.12900#S3.p3.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.17.15.15.6),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.86.84.84.6),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1),[Table 6](https://arxiv.org/html/2606.12900#S5.T6.4.4.4.3)\.
- C\. Chen, K\. Liu, Z\. Chen, Y\. Gu, Y\. Wu, M\. Tao, Z\. Fu, and J\. Ye \(2024a\)INSIDE: LLMs’ internal states retain the power of hallucination detection\.InThe Twelfth International Conference on Learning Representations,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p3.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.27.27.27.4),[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p2.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.117.115.115.6),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.48.46.46.6),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1)\.
- Q\. Chen, B\. Zhang, G\. Wang, and Q\. Wu \(2024b\)Weak\-eval\-strong: evaluating and eliciting lateral thinking of llms with situation puzzles\.Advances in Neural Information Processing Systems37,pp\. 79642–79665\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p1.1)\.
- X\. Chen, D\. Song, H\. Gui, C\. Wang, N\. Zhang, Y\. Jiang, F\. Huang, C\. Lyu, D\. Zhang, and H\. Chen \(2024c\)FactCHD: benchmarking fact\-conflicting hallucination detection\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p3.1)\.
- Y\. Chen, Z\. You, S\. Zhang, H\. Li, Y\. Li, Y\. Wang, and M\. Tan \(2025\)Core context aware transformers for long context language modeling\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- C\. Chiang, Z\. Lu, Z\. Li, and M\. Yin \(2024\)Enhancing ai\-assisted group decision making through llm\-powered devil’s advocate\.InProceedings of the 29th International Conference on Intelligent User Interfaces,pp\. 103–119\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p1.1)\.
- I\. Csiszár \(1975\)I\-divergence geometry of probability distributions and minimization problems\.The annals of probability,pp\. 146–158\.Cited by:[§2](https://arxiv.org/html/2606.12900#S2.p2.2),[§4\.4](https://arxiv.org/html/2606.12900#S4.SS4.p5.3)\.
- X\. Du, C\. Xiao, and S\. Li \(2024\)Haloscope: harnessing unlabeled llm generations for hallucination detection\.Advances in Neural Information Processing Systems\.Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p5.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.30.30.30.4),[Table 12](https://arxiv.org/html/2606.12900#A4.T12.5.5.5.2),[Table 7](https://arxiv.org/html/2606.12900#A4.T7.12.12.12.7),[Table 7](https://arxiv.org/html/2606.12900#A4.T7.36.36.36.7),[Table 8](https://arxiv.org/html/2606.12900#A4.T8.12.12.12.7),[Table 8](https://arxiv.org/html/2606.12900#A4.T8.36.36.36.7),[Table 9](https://arxiv.org/html/2606.12900#A4.T9.12.12.12.7),[Table 9](https://arxiv.org/html/2606.12900#A4.T9.36.36.36.7),[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p3.1),[§4\.4](https://arxiv.org/html/2606.12900#S4.SS4.p3.5),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.118.116.116.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.49.47.47.1),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1),[Table 3](https://arxiv.org/html/2606.12900#S5.T3.14.14.14.8),[Table 3](https://arxiv.org/html/2606.12900#S5.T3.42.42.42.8),[Table 6](https://arxiv.org/html/2606.12900#S5.T6.14.14.14.3)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p2.3),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p3.7)\.
- X\. Fang, Z\. Huang, Z\. Tian, M\. Fang, Z\. Pan, Q\. Fang, Z\. Wen, H\. Pan, and D\. Li \(2025\)Zero\-resource hallucination detection for text generation via graph\-based contextual knowledge triples modeling\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23868–23877\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p2.1)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.Cited by:[§4\.1](https://arxiv.org/html/2606.12900#S4.SS1.p2.8),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p4.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p2.3)\.
- Y\. Guo, J\. Liu, M\. Li, D\. Cheng, X\. Tang, D\. Sui, Q\. Liu, X\. Chen, and K\. Zhao \(2025\)Vtg\-llm: integrating timestamp knowledge into video llms for enhanced video temporal grounding\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 3302–3310\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- X\. Hu, D\. Ru, L\. Qiu, Q\. Guo, T\. Zhang, Y\. Xu, Y\. Luo, P\. Liu, Y\. Zhang, and Z\. Zhang \(2024\)Knowledge\-centric hallucination detection\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p3.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin,et al\.\(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.Cited by:[§4\.1](https://arxiv.org/html/2606.12900#S4.SS1.p1.1)\.
- D\. R\. Hunter \(2004\)MM algorithms for generalized bradley\-terry models\.The annals of statistics32\(1\),pp\. 384–406\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p5.1)\.
- J\. L\. W\. V\. Jensen \(1906\)Sur les fonctions convexes et les inégalités entre les valeurs moyennes\.Acta mathematica30\(1\),pp\. 175–193\.Cited by:[§A\.2](https://arxiv.org/html/2606.12900#A1.SS2.2.p2.1)\.
- J\. Ji, H\. Wang, C\. Wu, Y\. Ma, X\. Sun, and R\. Ji \(2024\)JM3D & jm3d\-llm: elevating 3d representation with joint multi\-modal cues\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung \(2023\)Survey of hallucination in natural language generation\.ACM computing surveys55\(12\),pp\. 1–38\.Cited by:[§4\.1](https://arxiv.org/html/2606.12900#S4.SS1.p2.8)\.
- M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the Annual Meeting of the Association for Computational Linguistics,pp\. 1601–1611\.Cited by:[§C\.1](https://arxiv.org/html/2606.12900#A3.SS1.p1.3),[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p1.1),[§4\.4](https://arxiv.org/html/2606.12900#S4.SS4.p2.3),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p1.5)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p4.1),[§1](https://arxiv.org/html/2606.12900#S1.p3.1),[§3](https://arxiv.org/html/2606.12900#S3.p2.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.12.10.10.6),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.81.79.79.6),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1.2),[§5\.3](https://arxiv.org/html/2606.12900#S5.SS3.p1.2.2.2),[Table 4](https://arxiv.org/html/2606.12900#S5.T4.1.1.1.2)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InThe Eleventh International Conference on Learning Representations,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p2.1),[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p1.1),[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p3.6),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.21.21.21.4),[§1](https://arxiv.org/html/2606.12900#S1.p3.1),[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p2.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.107.105.105.6),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.38.36.36.6),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 452–466\.Cited by:[§C\.1](https://arxiv.org/html/2606.12900#A3.SS1.p3.3),[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p1.1),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p1.5)\.
- L\. Li, C\. Xu, W\. Wu, Y\. Zhao, X\. Zhao, and C\. Tao \(2020\)Zero\-resource knowledge\-grounded dialogue generation\.Advances in Neural Information Processing Systems33,pp\. 8475–8485\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Teaching models to express their uncertainty in words\.arXiv preprint arXiv:2205\.14334\.Cited by:[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p2.1)\.
- Z\. Lin, S\. Trivedi, and J\. Sun \(2024\)Generating with confidence: uncertainty quantification for black\-box large language models\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p3.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.24.24.24.4),[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p2.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.112.110.110.6),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.43.41.41.6),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1),[§5\.3](https://arxiv.org/html/2606.12900#S5.SS3.p4.1.1.1),[Table 6](https://arxiv.org/html/2606.12900#S5.T6.12.12.12.3)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p5.1.1),[§D\.5](https://arxiv.org/html/2606.12900#A4.SS5.p1.1.1),[§5\.3](https://arxiv.org/html/2606.12900#S5.SS3.p4.1.1.1)\.
- Z\. Liu, P\. Wang, R\. Xu, S\. Ma, C\. Ruan, P\. Li, Y\. Liu, and Y\. Wu \(2025\)Inference\-time scaling for generalist reward modeling\.arXiv preprint arXiv:2504\.02495\.Cited by:[§C\.2\.2](https://arxiv.org/html/2606.12900#A3.SS2.SSS2.p1.4.1)\.
- S\. Ma, Q\. Chen, X\. Wang, C\. Zheng, Z\. Peng, M\. Yin, and X\. Ma \(2025\)Towards human\-ai deliberation: design and evaluation of llm\-empowered deliberative ai for ai\-assisted decision\-making\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,pp\. 1–23\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p1.1)\.
- A\. Malinin and M\. Gales \(2021\)Uncertainty estimation in autoregressive structured prediction\.InInternational Conference on Learning Representations,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p2.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.6.6.6.4),[Table 12](https://arxiv.org/html/2606.12900#A4.T12.1.1.1.2),[§1](https://arxiv.org/html/2606.12900#S1.p3.1),[§3](https://arxiv.org/html/2606.12900#S3.p2.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.7.5.5.7),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.76.74.74.7),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1),[Table 6](https://arxiv.org/html/2606.12900#S5.T6.2.2.2.3)\.
- P\. Manakul, A\. Liusie, and M\. J\. F\. Gales \(2023\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p3.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.12.12.12.4),[Table 12](https://arxiv.org/html/2606.12900#A4.T12.3.3.3.2),[§1](https://arxiv.org/html/2606.12900#S1.p3.1),[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p2.1),[§4\.5](https://arxiv.org/html/2606.12900#S4.SS5.p3.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.22.20.20.6),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.91.89.89.6),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1),[Table 6](https://arxiv.org/html/2606.12900#S5.T6.6.6.6.3)\.
- A\. T\. Neumann, Y\. Yin, S\. Sowe, S\. Decker, and M\. Jarke \(2024\)An llm\-driven chatbot in higher education for databases and information systems\.IEEE Transactions on Education\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p4.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p5.1)\.
- S\. Park, X\. Du, M\. Yeh, H\. Wang, and Y\. Li \(2025\)Steer llm latents for hallucination detection\.InForty\-second International Conference on Machine Learning,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p5.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.33.33.33.4),[Table 12](https://arxiv.org/html/2606.12900#A4.T12.6.6.6.2),[Table 7](https://arxiv.org/html/2606.12900#A4.T7.18.18.18.7),[Table 7](https://arxiv.org/html/2606.12900#A4.T7.42.42.42.7),[Table 8](https://arxiv.org/html/2606.12900#A4.T8.18.18.18.7),[Table 8](https://arxiv.org/html/2606.12900#A4.T8.42.42.42.7),[Table 9](https://arxiv.org/html/2606.12900#A4.T9.18.18.18.7),[Table 9](https://arxiv.org/html/2606.12900#A4.T9.42.42.42.7),[§1](https://arxiv.org/html/2606.12900#S1.p3.1),[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p3.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.130.128.128.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.61.59.59.1),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1),[Table 3](https://arxiv.org/html/2606.12900#S5.T3.21.21.21.8),[Table 3](https://arxiv.org/html/2606.12900#S5.T3.49.49.49.8),[Table 6](https://arxiv.org/html/2606.12900#S5.T6.16.16.16.3)\.
- S\. Qian, W\. Chen, M\. Bai, X\. Zhou, Z\. Tu, and L\. E\. Li \(2024\)AffordanceLLM: grounding affordance from vision language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\) Workshops,pp\. 7587–7597\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p2.3),[Appendix B](https://arxiv.org/html/2606.12900#A2.p5.1)\.
- S\. Reddy, D\. Chen, and C\. D\. Manning \(2019\)CoQA: a conversational question answering challenge\.Transactions of the Association for Computational Linguistics7,pp\. 249–266\.Cited by:[§C\.1](https://arxiv.org/html/2606.12900#A3.SS1.p4.2),[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p2.1),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p1.5)\.
- J\. Ren, J\. Luo, Y\. Zhao, K\. Krishna, M\. Saleh, B\. Lakshminarayanan, and P\. J\. Liu \(2023\)Out\-of\-distribution detection and selective generation for conditional language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.12900#A3.SS2.SSS1.p2.1),[Table 11](https://arxiv.org/html/2606.12900#A4.T11.15.15.15.4),[Table 12](https://arxiv.org/html/2606.12900#A4.T12.4.4.4.2),[§3](https://arxiv.org/html/2606.12900#S3.p1.1),[§3](https://arxiv.org/html/2606.12900#S3.p2.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.27.25.25.6),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.96.94.94.6),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1),[Table 6](https://arxiv.org/html/2606.12900#S5.T6.8.8.8.3)\.
- S\. M\. Ross \(2020\)Introduction to probability and statistics for engineers and scientists\.Academic press\.Cited by:[§A\.3](https://arxiv.org/html/2606.12900#A1.SS3.12.p11.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p5.1),[§2](https://arxiv.org/html/2606.12900#S2.p1.6)\.
- T\. Sellam, D\. Das, and A\. Parikh \(2020\)BLEURT: learning robust metrics for text generation\.InProceedings of the Annual Meeting of the Association for Computational Linguistics,pp\. 7881–7892\.Cited by:[§A\.1](https://arxiv.org/html/2606.12900#A1.SS1.p1.11),[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p4.6),[§4\.4](https://arxiv.org/html/2606.12900#S4.SS4.p3.5),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p4.1)\.
- S\. Semnani, V\. Yao, H\. Zhang, and M\. Lam \(2023\)WikiChat: stopping the hallucination of large language model chatbots by few\-shot grounding on wikipedia\.InFindings of the association for computational linguistics: EMNLP 2023,Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p3.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p5.1),[§C\.2\.2](https://arxiv.org/html/2606.12900#A3.SS2.SSS2.p4.9),[§4\.4](https://arxiv.org/html/2606.12900#S4.SS4.p1.1),[§4\.4](https://arxiv.org/html/2606.12900#S4.SS4.p5.3)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p3.1)\.
- N\. Shazeer \(2020\)Glu variants improve transformer\.arXiv preprint arXiv:2002\.05202\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p2.3)\.
- Q\. Team \(2024\)Qwen2 technical report\.arXiv preprint arXiv:2407\.106712\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p3.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023a\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p2.3)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023b\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p2.3),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p3.7)\.
- A\. Vazhentsev, E\. Fadeeva, R\. Xing, G\. Kuzmin, I\. Lazichny, A\. Panchenko, P\. Nakov, T\. Baldwin, M\. Panov, and A\. Shelmanov \(2025\)Unconditional truthfulness: learning unconditional uncertainty of large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 35661–35682\.Cited by:[Table 2](https://arxiv.org/html/2606.12900#S4.T2.124.122.122.1),[Table 2](https://arxiv.org/html/2606.12900#S4.T2.55.53.53.1),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p2.1)\.
- J\. Vrdoljak, Z\. Boban, M\. Vilović, M\. Kumrić, and J\. Božić \(2025\)A review of large language models in medical education, clinical decision support, and healthcare administration\.InHealthcare,Vol\.13,pp\. 603\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p1.1)\.
- B\. Wang, X\. Yue, and H\. Sun \(2023\)Can chatgpt defend its belief in truth? evaluating llm reasoning via debate\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- Q\. Wang, Z\. Wang, Y\. Su, H\. Tong, and Y\. Song \(2024\)Rethinking the bounds of llm reasoning: are multi\-agent discussions the key?\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6106–6131\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- J\. Welbl, N\. F\. Liu, and M\. Gardner \(2017\)Crowdsourcing multiple choice science questions\.InProceedings of the Workshop on Noisy User\-generated Text,pp\. 94–106\.Cited by:[§C\.1](https://arxiv.org/html/2606.12900#A3.SS1.p2.5),[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p1.1),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p1.5)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2024a\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p3.1),[§C\.2\.2](https://arxiv.org/html/2606.12900#A3.SS2.SSS2.p1.4),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p3.7)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.12900#A2.p3.1),[§5\.1](https://arxiv.org/html/2606.12900#S5.SS1.p3.7)\.
- B\. Yang, M\. A\. Al Mamun, J\. M\. Zhang, and G\. Uddin \(2025b\)Hallucination detection in large language models with metamorphic relations\.Proceedings of the ACM on Software Engineering2\(FSE\),pp\. 425–445\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p2.1)\.
- Y\. Yang, T\. Zhou, K\. Li, D\. Tao, L\. Li, L\. Shen, X\. He, J\. Jiang, and Y\. Shi \(2024b\)Embodied multi\-modal agent trained by an llm from a parallel textworld\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 26275–26285\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- Y\. Yu, Z\. Yao, H\. Li, Z\. Deng, Y\. Jiang, Y\. Cao, Z\. Chen, J\. Suchow, Z\. Cui, R\. Liu,et al\.\(2024\)Fincon: a synthesized llm multi\-agent system with conceptual verbal reinforcement for enhanced financial decision making\.Advances in Neural Information Processing Systems37,pp\. 137010–137045\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in neural information processing systems32\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p2.3)\.
- S\. Zhang, Z\. You, Y\. Chen, Z\. Wen, Q\. Wang, Z\. Qiu, Y\. Li, and M\. Tan \(2025\)Curse of high dimensionality issue in transformer for long context modeling\.InForty\-second International Conference on Machine Learning,pp\. 76600–76624\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- T\. Zhang, L\. Qiu, Q\. Guo, C\. Deng, Y\. Zhang, Z\. Zhang, C\. Zhou, X\. Wang, and L\. Fu \(2023\)Enhancing uncertainty\-based hallucination detection with stronger focus\.arXiv preprint arXiv:2311\.13230\.Cited by:[§3](https://arxiv.org/html/2606.12900#S3.p1.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2019\)BERTScore: evaluating text generation with bert\.InInternational Conference on Learning Representations,Cited by:[§3](https://arxiv.org/html/2606.12900#S3.p2.1)\.
- C\. Zhao, Y\. Song, J\. Chen, K\. Rong, H\. Feng, G\. Zhang, S\. Ji, J\. Wang, E\. Ding, and Y\. Sun \(2024\)Octopus: a multi\-modal llm with parallel recognition and sequential understanding\.Advances in Neural Information Processing Systems37,pp\. 90009–90029\.Cited by:[Appendix B](https://arxiv.org/html/2606.12900#A2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InProceedings of the 37th International Conference on Neural Information Processing Systems,Cited by:[§C\.2\.3](https://arxiv.org/html/2606.12900#A3.SS2.SSS3.p5.1.1)\.
- Y\. Zhu, H\. Yuan, S\. Wang, J\. Liu, W\. Liu, C\. Deng, H\. Chen, Z\. Liu, Z\. Dou, and J\. Wen \(2025\)Large language models for information retrieval: a survey\.ACM Transactions on Information Systems44\(1\),pp\. 1–54\.Cited by:[§1](https://arxiv.org/html/2606.12900#S1.p1.1)\.

Appendix

## Contents

## Appendix ATheoretical Analysis

### A\.1Setup and Notations

Letx=\(q,a\)x=\(q,a\)be an input of hallucination detection under the zero\-source constraint, whereaacan be generated by any source LLM\. The scoring agentfθf\_\{\\theta\}produces a structured and interpretable evaluation whose predicted score is an integer random variablesp​\(x\)∈\{1,…,10\}s\_\{p\}\(x\)\\in\\\{1,\\ldots,10\\\}in the well\-formed regime\. During training, each queryqqhas a ground\-truth reference answera^\\hat\{a\}within dataset, and an auxiliary LLM generates candidate answers\{a\(n\)\}n=1N\\\{a^\{\(n\)\}\\\}\_\{n=1\}^\{N\}\. Weak supervision labelssl\(n\)∈\{1,…,10\}s^\{\(n\)\}\_\{l\}\\in\\\{1,\\ldots,10\\\}are computed from a semantic consistency metricsim⁡\(a^,a\(n\)\)\\operatorname\{sim\}\(\\hat\{a\},a^\{\(n\)\}\)\(based on BLEURT\(Sellamet al\.,[2020](https://arxiv.org/html/2606.12900#bib.bib20)\)\)\. The agent is optimized with GRPO using the score\-alignment reward given by Eqn\. \([5](https://arxiv.org/html/2606.12900#S4.E5)\)\. During inference, we invoke the agentKKtimes and aggregates¯\\bar\{s\}according to Eqn\. \([6](https://arxiv.org/html/2606.12900#S4.E6)\)\.

### A\.2Proof of Theorem[1](https://arxiv.org/html/2606.12900#Thmtheorem1)

Theorem[1](https://arxiv.org/html/2606.12900#Thmtheorem1)\(Expectation alignment under training\)*Letxxdenote an input andsl​\(x\)s\_\{l\}\(x\)its corresponding weak label\. Consider theℓ1\\ell\_\{1\}\-risk of predicting scoress:*

Rx​\(s\)≜\|s−sl​\(x\)\|,R\_\{x\}\(s\)\\triangleq\|s\-s\_\{l\}\(x\)\|,*whose minimizer iss=sl​\(x\)s=s\_\{l\}\(x\)\. LetY∼fθ\(⋅∣x\)Y\\sim f\_\{\\theta\}\(\\cdot\\mid x\)be a stochastic well\-formed generation of the agent, and letSθ​\(x\)=sp​\(x,Y\)S\_\{\\theta\}\(x\)=s\_\{p\}\(x,Y\)denote the parsed score\. Define the conditional mean score:*

μθ​\(x\)≜𝔼​\[Sθ​\(x\)∣x\]\.\\mu\_\{\\theta\}\(x\)\\triangleq\\mathbb\{E\}\\big\[S\_\{\\theta\}\(x\)\\mid x\\big\]\.*Then we have:*

𝔼x​\[\|μθ​\(x\)−sl​\(x\)\|\]≤𝒥′​\(θ\),\\mathbb\{E\}\_\{x\}\\big\[\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\\big\]\\leq\\mathcal\{J^\{\\prime\}\}\(\\theta\),*where the expectation is taken over the training distribution ofxx, and𝒥′​\(θ\)\\mathcal\{J^\{\\prime\}\}\(\\theta\)is affine\-equivalent to the GRPO objective𝒥​\(θ\)\\mathcal\{J\}\(\\theta\)defined in Eqn\. \([1](https://arxiv.org/html/2606.12900#S2.E1)\)\.*

###### Proof\.

Given a fixed inputxx, sincesl​\(x\)s\_\{l\}\(x\)is deterministic, we have

\|μθ\(x\)−sl\(x\)\|=\|𝔼\[Sθ\(x\)∣x\]−sl\(x\)\|=\|𝔼\[Sθ\(x\)−sl\(x\)∣x\]\|\.\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|=\|\\mathbb\{E\}\\big\[S\_\{\\theta\}\(x\)\\mid x\\big\]\-s\_\{l\}\(x\)\|=\|\\mathbb\{E\}\\big\[S\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\\mid x\\big\]\|\.
By Jensen’s inequality\(Jensen,[1906](https://arxiv.org/html/2606.12900#bib.bib50)\),

\|𝔼\[Sθ\(x\)−sl\(x\)∣x\]\|≤𝔼\[\|Sθ\(x\)−sl\(x\)\|∣x\]\.\|\\mathbb\{E\}\\big\[S\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\\mid x\\big\]\|\\leq\\mathbb\{E\}\\big\[\|S\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\\mid x\\big\]\.
Taking expectation over the training distribution ofxxyields

𝔼x​\[\|μθ​\(x\)−sl​\(x\)\|\]≤𝔼x​𝔼​\[\|Sθ​\(x\)−sl​\(x\)\|∣x\]=𝔼x,Y​\[\|sp​\(x,Y\)−sl​\(x\)\|\]\.\\mathbb\{E\}\_\{x\}\\big\[\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\\big\]\\leq\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\\big\[\|S\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\\mid x\\big\]=\\mathbb\{E\}\_\{x,Y\}\\big\[\|s\_\{p\}\(x,Y\)\-s\_\{l\}\(x\)\|\\big\]\.\(9\)
By the score\-alignment reward in Eqn\. \([5](https://arxiv.org/html/2606.12900#S4.E5)\)\), minimizing the KL\-regularized GRPO objective in Eqn\. \([1](https://arxiv.org/html/2606.12900#S2.E1)\) is equivalent up to affine constants to minimizing the nonnegative objective

𝒥′​\(θ\)\\displaystyle\\mathcal\{J^\{\\prime\}\}\(\\theta\)=𝔼x𝔼Y∼fθ\(⋅∣x\)\[\|sp\(x,Y\)−sl\(x\)\|\]\+λ⋅𝔼xDK​L\(fθ\(⋅\|x\)\|\|f0\(⋅\|x\)\)\\displaystyle=\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\_\{Y\\sim f\_\{\\theta\}\(\\cdot\\mid x\)\}\\big\[\|s\_\{p\}\(x,Y\)\-s\_\{l\}\(x\)\|\\big\]\+\\lambda\\cdot\\mathbb\{E\}\_\{x\}D\_\{KL\}\\big\(f\_\{\\theta\}\(\\cdot\|x\)\|\|f\_\{0\}\(\\cdot\|x\)\\big\)\(10\)=𝔼x,Y\[\|sp\(x,Y\)−sl\(x\)\|\]\+λ⋅𝔼xDK​L\(fθ\(⋅\|x\)\|\|f0\(⋅\|x\)\)\.\\displaystyle=\\mathbb\{E\}\_\{x,Y\}\\big\[\|s\_\{p\}\(x,Y\)\-s\_\{l\}\(x\)\|\\big\]\+\\lambda\\cdot\\mathbb\{E\}\_\{x\}D\_\{KL\}\\big\(f\_\{\\theta\}\(\\cdot\|x\)\|\|f\_\{0\}\(\\cdot\|x\)\\big\)\.whereλ∝β\\lambda\\propto\\beta\. Since the non\-negativity of the KL regularizer, we have

𝔼x,Y​\[\|sp​\(x,Y\)−sl​\(x\)\|\]≤𝒥′​\(θ\)\.\\mathbb\{E\}\_\{x,Y\}\\big\[\|s\_\{p\}\(x,Y\)\-s\_\{l\}\(x\)\|\\big\]\\leq\\mathcal\{J^\{\\prime\}\}\(\\theta\)\.Combining the two inequalities Eqn\. \([9](https://arxiv.org/html/2606.12900#A1.E9)\) and Eqn\. \([10](https://arxiv.org/html/2606.12900#A1.E10)\) completes the proof\. ∎

### A\.3Proof of Corollary[1](https://arxiv.org/html/2606.12900#Thmcorollary1)

Proposition[1](https://arxiv.org/html/2606.12900#Thmproposition1)\(Multi\-sampling concentration\)*Fix an inference inputxx\. Suppose theKKwell\-formed generationsY1,…,YK∼*i\.i\.d\.*fθ\(⋅∣x\)Y\_\{1\},\\ldots,Y\_\{K\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}f\_\{\\theta\}\(\\cdot\\mid x\)\. For eachk∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}, define the parsed score*

Sθ\(k\)​\(x\)≜sp​\(x,Yk\)∈\{1,…,10\},S\_\{\\theta\}^\{\(k\)\}\(x\)\\triangleq s\_\{p\}\(x,Y\_\{k\}\)\\in\\\{1,\\ldots,10\\\},*and lets¯​\(x\)≜1K​∑k=1KSθ\(k\)​\(x\)\\bar\{s\}\(x\)\\triangleq\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}S\_\{\\theta\}^\{\(k\)\}\(x\)denote the aggregated score in Eqn\. \([6](https://arxiv.org/html/2606.12900#S4.E6)\)\. Then for any bias thresholdu\>0u\>0,*

ℙ\(\|s¯\(x\)−𝔼\[Sθ\(x\)∣x\]\|≥u∣x\)≤2exp\(−2​K​u2\(10−1\)2\)\.\\mathbb\{P\}\\big\(\|\\bar\{s\}\(x\)\-\\mathbb\{E\}\[S\_\{\\theta\}\(x\)\\mid x\]\|\\geq u\\mid x\\big\)\\leq 2\\exp\\Big\(\-\\frac\{2Ku^\{2\}\}\{\(10\-1\)^\{2\}\}\\Big\)\.
###### Proof\.

Conditional on the inputxx, the random variables\{Sθ\(k\)​\(x\)\}k=1K\\\{S\_\{\\theta\}^\{\(k\)\}\(x\)\\\}\_\{k=1\}^\{K\}are*i\.i\.d*and each are bounded in\[1,10\]\[1,10\]\. Letμθ​\(x\)≜𝔼​\[Sθ​\(x\)∣x\]\\mu\_\{\\theta\}\(x\)\\triangleq\\mathbb\{E\}\\big\[S\_\{\\theta\}\(x\)\\mid x\\big\]\. By Hoeffding’s inequality for bounded*i\.i\.d*variables, we have

ℙ​\(\|s¯​\(x\)−μθ​\(x\)\|≥u∣x\)≤2​exp⁡\(−2​K​u2\(10−1\)2\)\.\\mathbb\{P\}\\big\(\|\\bar\{s\}\(x\)\-\\mu\_\{\\theta\}\(x\)\|\\geq u\\mid x\\big\)\\leq 2\\exp\\Big\(\-\\frac\{2Ku^\{2\}\}\{\(10\-1\)^\{2\}\}\\Big\)\.which is exactly the claim\. ∎

Corollary[1](https://arxiv.org/html/2606.12900#Thmcorollary1)\(Ranking error decomposition\)*Letxxdenote an input andsl​\(x\)s\_\{l\}\(x\)its corresponding weak label\. Assume the proxy bias is uniformly bounded:*

\|ϵ​\(x\)\|≤bmax,∀x\.\|\\epsilon\(x\)\|\\leq b\_\{\\max\},~~~\\forall x\.*LetY1,…,YK∼*i\.i\.d\.*fθ\(⋅∣x\)Y\_\{1\},\\ldots,Y\_\{K\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}f\_\{\\theta\}\(\\cdot\\mid x\)be well\-formed generations, and defineSθ\(k\)​\(x\)S\_\{\\theta\}^\{\(k\)\}\(x\)ands¯​\(x\)\\bar\{s\}\(x\)as in Proposition[1](https://arxiv.org/html/2606.12900#Thmproposition1)\. Letx\+x^\{\+\}andx−x^\{\-\}denote independent draws from the distributions of true \(non\-hallucinated\) and hallucinated inputs, respectively\. Define the ranking error probability:*

ℰrank≜ℙ​\(s¯​\(x\+\)≤s¯​\(x−\)\)\.\\mathcal\{E\}\_\{\\mathrm\{rank\}\}\\triangleq\\mathbb\{P\}\\big\(\\bar\{s\}\(x^\{\+\}\)\\leq\\bar\{s\}\(x^\{\-\}\)\\big\)\.*Then for anyΔ\>bmax\\Delta\>b\_\{\\max\},*

ℰrank≤ℙ​\(g​\(s⋆​\(x\+\)\)−g​\(s⋆​\(x−\)\)≤2​Δ\)⏟*intrinsic separability*\+4​𝒥′​\(θ\)Δ−bmax⏟*training alignment*\+4​exp⁡\(−2​K\(10−1\)2​\(Δ−bmax2\)2\)⏟*multi\-sampling concentration*,\\mathcal\{E\}\_\{\\mathrm\{rank\}\}\\leq\\underbrace\{\\mathbb\{P\}\\big\(g\(s^\{\\star\}\(x^\{\+\}\)\)\-g\(s^\{\\star\}\(x^\{\-\}\)\)\\leq 2\\Delta\\big\)\}\_\{\\text\{intrinsic separability\}\}\+\\underbrace\{\\frac\{4\\,\\mathcal\{J^\{\\prime\}\}\(\\theta\)\}\{\\Delta\-b\_\{\\max\}\}\}\_\{\\text\{training alignment\}\}\+\\underbrace\{4\\exp\\Big\(\-\\frac\{2K\}\{\(10\-1\)^\{2\}\}\\Big\(\\frac\{\\Delta\-b\_\{\\max\}\}\{2\}\\Big\)^\{2\}\\Big\)\}\_\{\\text\{multi\-sampling concentration\}\},*where𝒥′​\(θ\)\\mathcal\{J^\{\\prime\}\}\(\\theta\)is affine\-equivalent to the GRPO objective𝒥​\(θ\)\\mathcal\{J\}\(\\theta\)defined in Eqn\. \([1](https://arxiv.org/html/2606.12900#S2.E1)\)\.*

###### Proof\.

LetΔ\>bmax\\Delta\>b\_\{\\max\}and defineδ≜Δ−bmax2\>0\\delta\\triangleq\\frac\{\\Delta\-b\_\{\\max\}\}\{2\}\>0\. Consider the event

A≜\{g​\(s⋆​\(x\+\)\)−g​\(s⋆​\(x−\)\)≤2​Δ\}\.A\\triangleq\\\{g\(s^\{\\star\}\(x^\{\+\}\)\)\-g\(s^\{\\star\}\(x^\{\-\}\)\)\\leq 2\\Delta\\\}\.
On the complementAcA^\{c\}, we haveg​\(s⋆​\(x\+\)\)−g​\(s⋆​\(x−\)\)≥2​Δg\(s^\{\\star\}\(x^\{\+\}\)\)\-g\(s^\{\\star\}\(x^\{\-\}\)\)\\geq 2\\Delta\.

If in addition scores are close to their truthfulness scores\|s¯​\(x\)−g​\(s⋆​\(x\)\)\|≤Δ\|\\bar\{s\}\(x\)\-g\(s^\{\\star\}\(x\)\)\|\\leq\\Delta, then

s¯​\(x\+\)≥g​\(s⋆​\(x\+\)\)−Δ≥g​\(s⋆​\(x−\)\)\+Δ≥s¯​\(x−\)\\bar\{s\}\(x^\{\+\}\)\\geq g\(s^\{\\star\}\(x^\{\+\}\)\)\-\\Delta\\geq g\(s^\{\\star\}\(x^\{\-\}\)\)\+\\Delta\\geq\\bar\{s\}\(x^\{\-\}\)which means order\-preserving\. Therefore, the ranking error event\{s¯​\(x\+\)≤s¯​\(x−\)\}\\\{\\bar\{s\}\(x^\{\+\}\)\\leq\\bar\{s\}\(x^\{\-\}\)\\\}is contained in

A∪\{\|s¯​\(x\+\)−g​\(s⋆​\(x\+\)\)\|\>Δ\}∪\{\|s¯​\(x−\)−g​\(s⋆​\(x−\)\)\|\>Δ\}A\\cup\\\{\|\\bar\{s\}\(x^\{\+\}\)\-g\(s^\{\\star\}\(x^\{\+\}\)\)\|\>\\Delta\\\}\\cup\\\{\|\\bar\{s\}\(x^\{\-\}\)\-g\(s^\{\\star\}\(x^\{\-\}\)\)\|\>\\Delta\\\}
Taking probability and using the union bound gives

ℰrank≤ℙ​\(A\)\+ℙ​\(\|s¯​\(x\+\)−g​\(s⋆​\(x\+\)\)\|\>Δ\)\+ℙ​\(\|s¯​\(x−\)−g​\(s⋆​\(x−\)\)\|\>Δ\)\.\\mathcal\{E\}\_\{\\mathrm\{rank\}\}\\leq\\mathbb\{P\}\(A\)\+\\mathbb\{P\}\(\|\\bar\{s\}\(x^\{\+\}\)\-g\(s^\{\\star\}\(x^\{\+\}\)\)\|\>\\Delta\)\+\\mathbb\{P\}\(\|\\bar\{s\}\(x^\{\-\}\)\-g\(s^\{\\star\}\(x^\{\-\}\)\)\|\>\\Delta\)\.\(11\)
For a fixed inputxx, the proxy bias satisfies\|ϵ​\(x\)\|≤bmax\|\\epsilon\(x\)\|\\leq b\_\{\\max\}, so

\|s¯​\(x\)−g​\(s⋆​\(x\)\)\|≤\|s¯​\(x\)−sl​\(x\)\|\+\|sl​\(x\)−g​\(s⋆​\(x\)\)\|≤\|s¯​\(x\)−sl​\(x\)\|\+bmax\.\|\\bar\{s\}\(x\)\-g\(s^\{\\star\}\(x\)\)\|\\leq\|\\bar\{s\}\(x\)\-s\_\{l\}\(x\)\|\+\|s\_\{l\}\(x\)\-g\(s^\{\\star\}\(x\)\)\|\\leq\|\\bar\{s\}\(x\)\-s\_\{l\}\(x\)\|\+b\_\{\\max\}\.
Thus\|s¯​\(x\)−g​\(s⋆​\(x\)\)\|\>Δ\|\\bar\{s\}\(x\)\-g\(s^\{\\star\}\(x\)\)\|\>\\Deltaimplies\|s¯​\(x\)−sl​\(x\)\|≥Δ−bmax=2​δ\|\\bar\{s\}\(x\)\-s\_\{l\}\(x\)\|\\geq\\Delta\-b\_\{\\max\}=2\\delta\.

Meanwhile, we have

\|s¯​\(x\)−sl​\(x\)\|≤\|s¯​\(x\)−μθ​\(x\)\|\+\|μθ​\(x\)−sl​\(x\)\|,\|\\bar\{s\}\(x\)\-s\_\{l\}\(x\)\|\\leq\|\\bar\{s\}\(x\)\-\\mu\_\{\\theta\}\(x\)\|\+\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|,
hence\|s¯​\(x\)−sl​\(x\)\|\>2​δ\|\\bar\{s\}\(x\)\-s\_\{l\}\(x\)\|\>2\\deltaimplies

\|s¯​\(x\)−μθ​\(x\)\|\>δ​or​\|μθ​\(x\)−sl​\(x\)\|\>δ,\|\\bar\{s\}\(x\)\-\\mu\_\{\\theta\}\(x\)\|\>\\delta\\text\{~or~\}\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\>\\delta,so again by union bound, we have

ℙ​\(\|s¯​\(x\)−g​\(s⋆​\(x\)\)\|\>Δ\)≤ℙ​\(\|s¯​\(x\)−μθ​\(x\)\|\>δ\)\+ℙ​\(\|μθ​\(x\)−sl​\(x\)\|\>δ\)\.\\mathbb\{P\}\(\|\\bar\{s\}\(x\)\-g\(s^\{\\star\}\(x\)\)\|\>\\Delta\)\\leq\\mathbb\{P\}\(\|\\bar\{s\}\(x\)\-\\mu\_\{\\theta\}\(x\)\|\>\\delta\)\+\\mathbb\{P\}\(\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\>\\delta\)\.\(12\)
Conditional onxx, Proposition[1](https://arxiv.org/html/2606.12900#Thmproposition1)yields

ℙ​\(\|s¯​\(x\)−μθ​\(x\)\|≥δ∣x\)≤2​exp⁡\(−2​K​δ2\(10−1\)2\)\.\\mathbb\{P\}\\big\(\|\\bar\{s\}\(x\)\-\\mu\_\{\\theta\}\(x\)\|\\geq\\delta\\mid x\\big\)\\leq 2\\exp\\Big\(\-\\frac\{2K\\delta^\{2\}\}\{\(10\-1\)^\{2\}\}\\Big\)\.
Taking expectation overxxpreserves the same upper bound:

ℙ​\(\|s¯​\(x\)−μθ​\(x\)\|≥δ\)≤2​exp⁡\(−2​K​δ2\(10−1\)2\)\.\\mathbb\{P\}\\big\(\|\\bar\{s\}\(x\)\-\\mu\_\{\\theta\}\(x\)\|\\geq\\delta\\big\)\\leq 2\\exp\\Big\(\-\\frac\{2K\\delta^\{2\}\}\{\(10\-1\)^\{2\}\}\\Big\)\.\(13\)
Moreover, by Markov’s inequality\(Ross,[2020](https://arxiv.org/html/2606.12900#bib.bib51)\),

ℙ​\(\|μθ​\(x\)−sl​\(x\)\|\>δ\)≤𝔼x​\[\|μθ​\(x\)−sl​\(x\)\|\]δ\.\\mathbb\{P\}\(\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\>\\delta\)\\leq\\frac\{\\mathbb\{E\}\_\{x\}\\big\[\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\\big\]\}\{\\delta\}\.
Using Theorem[1](https://arxiv.org/html/2606.12900#Thmtheorem1), we have𝔼x​\[\|μθ​\(x\)−sl​\(x\)\|\]≤𝒥′​\(θ\)\\mathbb\{E\}\_\{x\}\\big\[\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\\big\]\\leq\\mathcal\{J^\{\\prime\}\}\(\\theta\), hence

ℙ​\(\|μθ​\(x\)−sl​\(x\)\|\>δ\)≤𝒥′​\(θ\)δ\.\\mathbb\{P\}\(\|\\mu\_\{\\theta\}\(x\)\-s\_\{l\}\(x\)\|\>\\delta\)\\leq\\frac\{\\mathcal\{J^\{\\prime\}\}\(\\theta\)\}\{\\delta\}\.\(14\)
By combining Eqn\. \([12](https://arxiv.org/html/2606.12900#A1.E12)\)\-\([14](https://arxiv.org/html/2606.12900#A1.E14)\), we have for anyΔ\>bmax\\Delta\>b\_\{\\max\}andδ=Δ−bmax2\\delta=\\frac\{\\Delta\-b\_\{\\max\}\}\{2\},

ℙ​\(\|s¯​\(x\)−g​\(s⋆​\(x\)\)\|\>Δ\)≤𝒥′​\(θ\)δ\+2​exp⁡\(−2​K​δ2\(10−1\)2\)\.\\mathbb\{P\}\(\|\\bar\{s\}\(x\)\-g\(s^\{\\star\}\(x\)\)\|\>\\Delta\)\\leq\\frac\{\\mathcal\{J^\{\\prime\}\}\(\\theta\)\}\{\\delta\}\+2\\exp\\Big\(\-\\frac\{2K\\delta^\{2\}\}\{\(10\-1\)^\{2\}\}\\Big\)\.\(15\)
Apply Eqn\. \([15](https://arxiv.org/html/2606.12900#A1.E15)\) tox\+x^\{\+\}andx−x^\{\-\}in Eqn\. \([11](https://arxiv.org/html/2606.12900#A1.E11)\), we have

ℰrank\\displaystyle\\mathcal\{E\}\_\{\\mathrm\{rank\}\}≤ℙ​\(A\)\+ℙ​\(\|s¯​\(x\+\)−g​\(s⋆​\(x\+\)\)\|\>Δ\)\+ℙ​\(\|s¯​\(x−\)−g​\(s⋆​\(x−\)\)\|\>Δ\)\\displaystyle\\leq\\mathbb\{P\}\(A\)\+\\mathbb\{P\}\(\|\\bar\{s\}\(x^\{\+\}\)\-g\(s^\{\\star\}\(x^\{\+\}\)\)\|\>\\Delta\)\+\\mathbb\{P\}\(\|\\bar\{s\}\(x^\{\-\}\)\-g\(s^\{\\star\}\(x^\{\-\}\)\)\|\>\\Delta\)\(16\)=ℙ​\(A\)\+2​\(𝒥′​\(θ\)δ\+2​exp⁡\(−2​K​δ2\(10−1\)2\)\)\\displaystyle=\\mathbb\{P\}\(A\)\+2\\Big\(\\frac\{\\mathcal\{J^\{\\prime\}\}\(\\theta\)\}\{\\delta\}\+2\\exp\\big\(\-\\frac\{2K\\delta^\{2\}\}\{\(10\-1\)^\{2\}\}\\big\)\\Big\)=ℙ​\(g​\(s⋆​\(x\+\)\)−g​\(s⋆​\(x−\)\)≤2​Δ\)\+4​𝒥′​\(θ\)Δ−bmax\+4​exp⁡\(−2​K\(10−1\)2​\(Δ−bmax2\)2\),\\displaystyle=\\mathbb\{P\}\\big\(g\(s^\{\\star\}\(x^\{\+\}\)\)\-g\(s^\{\\star\}\(x^\{\-\}\)\)\\leq 2\\Delta\\big\)\+\\frac\{4\\,\\mathcal\{J^\{\\prime\}\}\(\\theta\)\}\{\\Delta\-b\_\{\\max\}\}\+4\\exp\\Big\(\-\\frac\{2K\}\{\(10\-1\)^\{2\}\}\\Big\(\\frac\{\\Delta\-b\_\{\\max\}\}\{2\}\\Big\)^\{2\}\\Big\),by recallingδ=Δ−bmax2\\delta=\\frac\{\\Delta\-b\_\{\\max\}\}\{2\}, which is exactly the desired decomposition\. ∎

## Appendix BMore Related Work

Large Language Models \(LLMs\)\.LLMs have become central to modern natural language processing, demonstrating strong capabilities in reasoning\(Brownet al\.,[2020](https://arxiv.org/html/2606.12900#bib.bib33); Wanget al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib34),[2024](https://arxiv.org/html/2606.12900#bib.bib35); Chenet al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib68); Zhanget al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib69)\), knowledge grounding\(Liet al\.,[2020](https://arxiv.org/html/2606.12900#bib.bib36); Qianet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib37); Guoet al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib38)\), and multi\-modal understanding\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib39); Yanget al\.,[2024b](https://arxiv.org/html/2606.12900#bib.bib40); Jiet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib41)\)\. Among open\-source ecosystems, two influential model families are LLaMA\(Touvronet al\.,[2023a](https://arxiv.org/html/2606.12900#bib.bib42),[b](https://arxiv.org/html/2606.12900#bib.bib30); Dubeyet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib29); Grattafioriet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib43)\)and Qwen\(Baiet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib44); Team,[2024](https://arxiv.org/html/2606.12900#bib.bib45); Yanget al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib32),[2025a](https://arxiv.org/html/2606.12900#bib.bib31)\)\.

TheLLaMAseries, developed by Meta, has been instrumental in advancing open\-weight LLMs\. LLaMA\(Touvronet al\.,[2023a](https://arxiv.org/html/2606.12900#bib.bib42)\)demonstrated that competitive models can be trained on publicly available corpora, incorporating architectural components such as RMSNorm\(Zhang and Sennrich,[2019](https://arxiv.org/html/2606.12900#bib.bib46)\), SwiGLU activations\(Shazeer,[2020](https://arxiv.org/html/2606.12900#bib.bib47)\), and Rotary Positional Embeddings \(RoPE\)\. LLaMA 2\(Touvronet al\.,[2023b](https://arxiv.org/html/2606.12900#bib.bib30)\)scaled pre\-training data, introduced RLHF\-tuned conversational variants, and adopted a more permissive license\. LLaMA 3\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib29)\)further scaled model capacity \(up to7070B parameters\) and multilingual data, together with an improved tokenizer and stronger instruction\-following performance\. The most recent iteration, LLaMA 3\.1\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib43)\), extends the context window to128128k tokens via enhanced RoPE scaling, enlarges the pre\-training corpus to over1515trillion tokens, and integrates advanced post\-training techniques such as large\-scale Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib26)\)and emergent tool\-use capabilities, achieving state\-of\-the\-art results among open models and approaching parity with leading proprietary systems\.

TheQwenseries, first released in 2023\(Baiet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib44)\), emphasizes scalability and practical deployment, offering models across a wide range of parameter sizes\. Qwen\-1\.5555[https://qwenlm\.github\.io/blog/qwen1\.5/](https://qwenlm.github.io/blog/qwen1.5/)and Qwen\-2\(Team,[2024](https://arxiv.org/html/2606.12900#bib.bib45)\)extend context length to128128k tokens across the family and incorporate Grouped Query Attention \(GQA\)\(Ainslieet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib48)\)to improve inference efficiency, leading to substantial performance gains\. Qwen2\.5\(Yanget al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib32)\)further introduces domain\-specialized variants for coding and mathematics, forming a flexible generalist–specialist design\. The most recent Qwen3\(Yanget al\.,[2025a](https://arxiv.org/html/2606.12900#bib.bib31)\)proposes a Hybrid\-Reasoning paradigm that dynamically switches between a deliberate*Thinking*mode for complex reasoning and a lightweight*Non\-Thinking*mode for rapid responses, enabling explicit trade\-offs between computational cost and accuracy\. Its large\-scale Mixture\-of\-Experts \(MoE\)\(Shazeeret al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib49)\)variants achieve competitive state\-of\-the\-art performance, while smaller dense models retain strong parameter efficiency\.

Reinforcement Learning\.Reinforcement learning \(RL\) has demonstrated effectiveness in improving the reasoning ability of LLMs by aligning their behaviors with human preferences or task\-specific objectives\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib24)\)\.

Similar to conventional machine learning, Supervised Fine\-Tuning \(SFT\) trains LLMs by maximizing the log\-likelihood on high\-quality, task\-specific demonstration data\. While SFT effectively transfers knowledge, it is inherently constrained by the quality and coverage of the demonstrations and cannot directly optimize non\-differentiable or long\-horizon objectives\. Serving as the subsequent step, Proximal Policy Optimization \(PPO\)\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib25)\)introduces a reward model trained from human preference comparisons and maximizes this reward while constraining policy updates to maintain stability\. This PPO\-based Reinforcement Learning from Human Feedback \(RLHF\)\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib24)\)enables models like InstructGPT\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib24)\)and ChatGPT666[https://openai\.com/blog/chatgpt](https://openai.com/blog/chatgpt), substantially improving instruction\-following and helpfulness\. However, it is computationally intensive and operationally complex, with the requirement of joint training and balancing of a policy model, a reward model, and a value function\. Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib26)\)emerged as a simplified and more stable alternative, bypassing explicit reward modeling by deriving a closed\-form objective that directly optimizes the policy using pairwise preference data\. By reformulating the RL objective as a supervised loss, DPO significantly reduces training complexity compared to PPO\. Nonetheless, its reliance on the Bradley–Terry\(Hunter,[2004](https://arxiv.org/html/2606.12900#bib.bib27)\)preference model makes it less amenable to scenarios involving dense, scalar reward signals, such as the score\-alignment supervision adopted in this work\. Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib4)\)is a recent advancement designed to improve stability and efficiency in LLM alignment\. It operates by sampling a group of candidate outputs for each prompt, assigning rewards for each, and optimizing the policy by increasing the relative likelihood of higher\-reward outputs within the group\. This group\-relative ranking objective provides a stable and efficient learning signal, making GRPO particularly well\-suited for weakly supervised, scalar reward settings\.

## Appendix CMore Details for Experiment Settings

### C\.1More Details on Datasets

TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib5)\)is a large\-scale reading comprehension benchmark comprising95,95695,956question\-answer pairs, each supplemented with an average of approximately66supporting evidence documents, yielding662,659662,659associated documents in total sourced from Wikipedia and the Web\. Since questions are authored independently of the subsequent evidence retrieval, the dataset exhibits substantial lexical and syntactic divergence between queries and supporting evidence, and frequently requires multi\-sentence reasoning to answer accurately\.

SciQ\(Welblet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib6)\)is a science question answering benchmark designed to evaluate domain knowledge and reasoning\. It comprises13,67913,679manually collected scientific exam questions spanning physics, chemistry, biology, and related subjects, split into11,67911,679training,1,0001,000validation, and1,0001,000test instances\. Each question is multiple\-choice with44answer options, and most are accompanied by an additional paragraph that supports the correct answer\. SciQ also provides a direct\-answer variant in which distractors are removed to enable reading\-comprehension\-style evaluation\.

NQ Open\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2606.12900#bib.bib7)\)is an open\-domain question answering benchmark derived from Natural Questions corpus, comprising anonymized real\-world Google queries paired with brief answers, all of which can be answered through content from the English Wikipedia\. The dataset contains91,53591,535question–answer pairs, including87,92587,925training instances and3,6103,610validation instances\. To ensure concise and standardized responses, examples with answer lengths exceeding five words are discarded, yielding a high\-quality benchmark for open\-domain QA evaluation\.

CoQA\(Reddyet al\.,[2019](https://arxiv.org/html/2606.12900#bib.bib8)\)is a large\-scale benchmark for conversational question answering, which aims to measure a model’s ability to understand a given passage and answer a series of interdependent questions that appear in a conversation\. The dataset contains over127,000127,000questions with answers collected from more than8,0008,000conversations\. Each conversation is collected by pairing two crowd workers to interact over a passage through multi\-turn question and answer exchanges\. CoQA captures challenging phenomena that are not present in standard reading comprehension benchmarks, including coreference and pragmatic reasoning\.

Wikipedia777[https://dumps\.wikimedia\.org](https://dumps.wikimedia.org/)is a large\-scale multilingual dataset comprising cleaned articles from Wikipedia dumps across320\+320\+languages\. It includes61\.661\.6million\+ rows of text, with each entry containing an article’s ID, URL, title, and markdown\-stripped content\. The dataset supports tasks like text generation and masked\-language modeling, with one subset per language and a single training split, making it a foundational resource for multilingual NLP research\.

### C\.2More Details on Implementation

#### C\.2\.1Implementation Details on Baselines\.

We implement baselines spanning44paradigms, largely following the official implementations when available\.

\(1\) Logit\-based methods\.We implementPerplexity\(Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9)\)using the public codebase888[https://github\.com/D2I\-ai/eigenscore](https://github.com/D2I-ai/eigenscore), which adopts sequence\-level perplexity as the detection score; andLN\-Entropy\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10)\)based on the same codebase[8](https://arxiv.org/html/2606.12900#footnote8), which computes entropy with length normalization; andSemantic Entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib11)\)using the official repository999[https://github\.com/jlko/semantic\_uncertainty](https://github.com/jlko/semantic_uncertainty), which clusters semantically equivalent generations before entropy estimation\.

\(2\) Consistency\-based methods\.We implementLexical Similarity\(Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12)\)following the codebase[8](https://arxiv.org/html/2606.12900#footnote8), using ROUGE\-based similarity to measure agreement across multiple sampled responses; andSelfCheckGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)using the same codebase[8](https://arxiv.org/html/2606.12900#footnote8), which aggregates multiple similarity metrics \(e\.g\., BERTScore\) for self\-consistency evaluation; andEigenScore\(Chenet al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib13)\)following the codebase[8](https://arxiv.org/html/2606.12900#footnote8), which evaluates semantic consistency via eigenvalue statistics of the response covariance matrix in embedding space\.

\(3\) Verbalized\-confidence methods\.We implementSelf\-evaluation\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib15)\)following the codebase[9](https://arxiv.org/html/2606.12900#footnote9), which prompts the model to estimate the probability that its answer is correct\.

#### C\.2\.2Implementation Details on Our Method\.

In our main experiments, we instantiate the agentfθf\_\{\\theta\}with Qwen\-2\.5\-7b\(Yanget al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib32)\)to assess the generated answeraaconditioned on the queryqq\.To elicit expert\-like behavior and implement Human\-like Criteria Probing, we follow the Generalist Reward Model design in SPCT\(Liuet al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib65)\)andformat each input pair\(q,a\)\(q,a\)using the template below:

![[Uncaptioned image]](https://arxiv.org/html/2606.12900v1/x4.png)The prompt specifies aGeneral Evaluation Criteriaset𝒞\\mathcal\{C\}that includes*“Factual Verification”*,*“Logical Consistency”*,*“Semantic Accuracy”*,*“Social Fairness”*, and*“Timeline Verification”*, together with their associated scoring guidelines\. It further enforces a strictly structured and interpretable output schema to facilitate downstream aggregation\.

Notably, we employ distinct sampling strategies \(temperatureTT, nucleus sampling parameter top\-pp, and repetition penaltyR​PRP\) for the training and test phases to serve different objectives\. In particular, we setT=top\-​p=R​P=1\.0T=\\text\{top\-\}p=RP=1\.0to endowfθf\_\{\\theta\}with greater flexibility to explore the evaluation space more broadly and learn robust assessment behaviors; and we adopt a lower\-entropy \(more deterministic\) setting withT=0\.9,top\-​p=0\.9,R​P=1\.1T=0\.9,\\text\{top\-\}p=0\.9,RP=1\.1to obtain higher\-confidence and more stable evaluations in inference\.

![[Uncaptioned image]](https://arxiv.org/html/2606.12900v1/x5.png)For reward\-based alignment with GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib4)\), we adopt the Open\-R1 implementation141414[https://github\.com/huggingface/open\-r1](https://github.com/huggingface/open-r1)and conduct training on 2 NVIDIA A800 GPUs\. We use a per\-device batch size of1010\. Through the optimization parameters in RL, the learning rate is set tol​r=2×10−4lr=2\\times 10^\{\-4\}on LLaMA\-3\.1\-8b for all44datasets, whilel​r=1×10−4lr=1\\times 10^\{\-4\}on Qwen\-3\-8b\. For the coefficientβ\\betathat controls the strength ofDK​LD\_\{KL\}, we setβ=0\.05\\beta=0\.05on TriviaQA with LLaMA\-3\.1\-8b and Qwen\-3\-8b as well as SciQ with Qwen\-3\-8b, whileβ=0\.04\\beta=0\.04for all others\. All reported results are averaged over55independent random splits\.

#### C\.2\.3Training Data Construction and Labeling\.

FollowingKuhnet al\.\([2023](https://arxiv.org/html/2606.12900#bib.bib11)\), we construct hallucination detection datasets from standard question\-answer \(QA\) benchmarks\. Specifically, for traditional QA datasets that do not require additional reference context \(e\.g\., TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib5)\), SciQ\(Welblet al\.,[2017](https://arxiv.org/html/2606.12900#bib.bib6)\), and NQ Open\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2606.12900#bib.bib7)\)\), we generate candidate answers for each questionqqusing the following prompt template:

![[Uncaptioned image]](https://arxiv.org/html/2606.12900v1/x6.png)For open\-source QA datasets with auxiliary context \(e\.g\., CoQA\(Reddyet al\.,[2019](https://arxiv.org/html/2606.12900#bib.bib8)\)\), we adopt the following prompt:

![[Uncaptioned image]](https://arxiv.org/html/2606.12900v1/x7.png)We also employ different sampling strategies for the training and test splits\. In thetraining split, we aim to generate a set of responses\{a\(n\)\}n=1N\\\{a^\{\(n\)\}\\\}\_\{n=1\}^\{N\}that spans a broad spectrum from factually correct to clearly hallucinated, providing fine\-grained supervision for reinforcement learning\. We therefore adopt a higher\-entropy sampling strategy withT=0\.5,top\-​p=1\.0,R​P=1\.0T=0\.5,\\text\{top\-\}p=1\.0,RP=1\.0, and setN=9N=9, yielding1010answers per query in total when including the ground\-truth referencea^\\hat\{a\}for training\. In thetest split, we generate a single response per query using greedy decoding with55beam search, matching the evaluation protocol inKuhnet al\.\([2023](https://arxiv.org/html/2606.12900#bib.bib11)\)\. Notably, although this procedure may appear superficially similar to the agent setting, it differs fundamentally in both its target \(the generator rather than the evaluation agent\) and its purpose \(producing diverse candidate answers rather than learning reliable scoring behavior\)\.

For labeling, we assign the reference answera^\\hat\{a\}the maximum scoresl=10s\_\{l\}=10and adopt BLEURT\(Sellamet al\.,[2020](https://arxiv.org/html/2606.12900#bib.bib20)\)to score the generated answers\. The resulting graded score is derived from Eqn\. \([4](https://arxiv.org/html/2606.12900#S4.E4)\)\. For binary evaluation on the test set, we further threshold the BLEURT\-based similarity: answers with the similarity above0\.50\.5are labeled as*True*, and those below as*Hallucinated*\. It should be emphasized that the reference answera^\\hat\{a\}is only used offline to construct labels during training and never accessed at inference; at test time, the detector takes only\(q,a\)\(q,a\)whereaacan be produced by any LLM\.

In ablation experiments that using alternative signals, we adopt Deepseek\-V3\(Liuet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib64)\)to evaluate the truthfulness of generated content, following the LLM\-as\-a\-judge\(Zhenget al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib66)\)paradigm\. Specifically, we assess the truthfulness of LLM\-generated responses by verifying their semantic equivalence to the provided gold standard answers\. The input prompt for this evaluation on QA datasets is as:

![[Uncaptioned image]](https://arxiv.org/html/2606.12900v1/x8.png)And the input prompt Wikipedia dataset is as:

![[Uncaptioned image]](https://arxiv.org/html/2606.12900v1/x9.png)
#### C\.2\.4Pseudo Code of HCPD

Algorithm 1Reward\-based Alignment Training of HCPDInput:Dataset

𝒟=\{\(qi,\{\(ai\(n\),si\(n\)\)\}\)\}\\mathcal\{D\}=\\\{\(q\_\{i\},\\\{\(a^\{\(n\)\}\_\{i\},s^\{\(n\)\}\_\{i\}\)\\\}\)\\\}, group size

GG, general criteria set

𝒞\\mathcal\{C\}, initial agent

fθ←f0f\_\{\\theta\}\\leftarrow f\_\{0\},

η\\eta;

Form detection input

xi\(n\)←\(qi,ai\(n\)\)x^\{\(n\)\}\_\{i\}\\leftarrow\(q\_\{i\},a^\{\(n\)\}\_\{i\}\);

for

xi\(n\)x^\{\(n\)\}\_\{i\}in

𝒟\\mathcal\{D\}do

Sample a group of evaluations

\{Yg\}g=1G∼fθ\(⋅∣x;𝒞\)\\\{Y\_\{g\}\\\}\_\{g=1\}^\{G\}\\sim f\_\{\\theta\}\(\\cdot\\mid x;\\mathcal\{C\}\);

for

g=1,2,…,Gg=1,2,\\dots,Gdo

Parse predicted score

sp\(g\)←sp​\(x,Yg\)∈\{1,…,10\}s\_\{p\}^\{\(g\)\}\\leftarrow s\_\{p\}\(x,Y\_\{g\}\)\\in\\\{1,\\ldots,10\\\};

Compute reward

rgr\_\{g\}via Eqn\. \([5](https://arxiv.org/html/2606.12900#S4.E5)\);

Compute advantages

Ag←rg−1G​∑j=1GrjA\_\{g\}\\leftarrow r\_\{g\}\-\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}r\_\{j\};

endfor

Compute GRPO objective via Eqn\. \([1](https://arxiv.org/html/2606.12900#S2.E1)\);

θ←θ\+η​∇θ𝒥x​\(θ\)\\theta\\leftarrow\\theta\+\\eta\\nabla\_\{\\theta\}\\mathcal\{J\}\_\{x\}\(\\theta\)\.

endfor

Output:

fθ∗f\_\{\\theta\}^\{\*\}

Algorithm 2Detection via Multi\-sampling AggregationInput:Scoring agent

fθ∗f\_\{\\theta\}^\{\*\}, general criteria set

𝒞\\mathcal\{C\}, input

x=\(q,a\)x=\(q,a\),

KK;

Sample

KKevaluations

\{Yk\}k=1K∼fθ∗\(⋅∣x;𝒞\)\\\{Y\_\{k\}\\\}\_\{k=1\}^\{K\}\\sim f\_\{\\theta\}^\{\*\}\(\\cdot\\mid x;\\mathcal\{C\}\);

for

k=1,2,…,Kk=1,2,\\dots,Kdo

Parse score

sp\(k\)←sp​\(x,Yk\)∈\{1,…,10\}s\_\{p\}^\{\(k\)\}\\leftarrow s\_\{p\}\(x,Y\_\{k\}\)\\in\\\{1,\\ldots,10\\\};

endfor

Aggregate scores via Eqn\. \([6](https://arxiv.org/html/2606.12900#S4.E6)\);

Output:Truthfulness Score

s¯\\bar\{s\}

## Appendix DMore Experiment Results

### D\.1More Results on Transferability across Target Models

To further demonstrate the advantages of HCPD’smodel\-agnosticdesign, we provide cross\-target evaluation on the remaining33datasets\. Results in Table[7](https://arxiv.org/html/2606.12900#A4.T7)\-[9](https://arxiv.org/html/2606.12900#A4.T9)show that our method maintains consistently strong performance across heterogeneous target models\. Notably, when faced with earlier\-generation models \(e\.g\., LLaMA\-2 and Qwen\-2\.5\), detection performance further improves, likely because such models produce less fluent and less realistic outputs, making hallucinations easier to distinguish\. Overall, these results indicate that HCPD is undoubtedly well\-suited for detecting hallucinations from complex input sources, which is common in real\-world deployment\.

Table 7:Comparisons with training\-based baselines across target models on SciQ in terms of AUROC \(%\)\.Source ModelMethodTarget ModelLLaMA\-3\.1\-8bLLaMA\-2\-7bLLaMA\-2\-13bQwen\-3\-8bQwen\-2\.5\-7bQwen\-2\.5\-14bLLaMA\-3\.1\-8bSAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)85\.63±0\.9685\.63\_\{\\pm 0\.96\}79\.77±2\.4579\.77\_\{\\pm 2\.45\}70\.96±2\.6470\.96\_\{\\pm 2\.64\}79\.05±2\.6179\.05\_\{\\pm 2\.61\}74\.65±1\.8874\.65\_\{\\pm 1\.88\}63\.16±1\.8363\.16\_\{\\pm 1\.83\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)69\.04±6\.3669\.04\_\{\\pm 6\.36\}75\.64±9\.3375\.64\_\{\\pm 9\.33\}78\.95±8\.8278\.95\_\{\\pm 8\.82\}62\.87±10\.3462\.87\_\{\\pm 10\.34\}66\.15±11\.1966\.15\_\{\\pm 11\.19\}61\.65±8\.3261\.65\_\{\\pm 8\.32\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)80\.01±1\.1780\.01\_\{\\pm 1\.17\}68\.77±10\.1268\.77\_\{\\pm 10\.12\}61\.86±7\.3261\.86\_\{\\pm 7\.32\}70\.46±9\.8670\.46\_\{\\pm 9\.86\}64\.56±7\.1664\.56\_\{\\pm 7\.16\}56\.10±2\.2756\.10\_\{\\pm 2\.27\}HCPD \(Ours\)86\.04±2\.25\\mathbf\{86\.04\}\_\{\\pm 2\.25\}92\.47±1\.44\\mathbf\{92\.47\}\_\{\\pm 1\.44\}95\.48±1\.26\\mathbf\{95\.48\}\_\{\\pm 1\.26\}91\.89±1\.07\\mathbf\{91\.89\}\_\{\\pm 1\.07\}89\.92±2\.08\\mathbf\{89\.92\}\_\{\\pm 2\.08\}78\.20±7\.42\\mathbf\{78\.20\}\_\{\\pm 7\.42\}Qwen\-3\-8bSAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)85\.21±1\.7185\.21\_\{\\pm 1\.71\}84\.11±0\.9384\.11\_\{\\pm 0\.93\}79\.88±1\.6579\.88\_\{\\pm 1\.65\}86\.63±1\.5386\.63\_\{\\pm 1\.53\}77\.08±1\.6677\.08\_\{\\pm 1\.66\}68\.39±3\.7368\.39\_\{\\pm 3\.73\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)53\.60±4\.1153\.60\_\{\\pm 4\.11\}58\.31±8\.5958\.31\_\{\\pm 8\.59\}57\.42±7\.3057\.42\_\{\\pm 7\.30\}74\.98±4\.1974\.98\_\{\\pm 4\.19\}65\.18±10\.5665\.18\_\{\\pm 10\.56\}59\.16±7\.8559\.16\_\{\\pm 7\.85\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)63\.67±3\.9163\.67\_\{\\pm 3\.91\}57\.88±6\.4057\.88\_\{\\pm 6\.40\}56\.89±5\.8556\.89\_\{\\pm 5\.85\}78\.77±0\.9478\.77\_\{\\pm 0\.94\}60\.67±2\.0060\.67\_\{\\pm 2\.00\}59\.06±3\.8859\.06\_\{\\pm 3\.88\}HCPD \(Ours\)85\.39±2\.70\\mathbf\{85\.39\}\_\{\\pm 2\.70\}90\.51±1\.28\\mathbf\{90\.51\}\_\{\\pm 1\.28\}96\.63±0\.86\\mathbf\{96\.63\}\_\{\\pm 0\.86\}92\.63±2\.90\\mathbf\{92\.63\}\_\{\\pm 2\.90\}92\.57±3\.14\\mathbf\{92\.57\}\_\{\\pm 3\.14\}84\.21±4\.75\\mathbf\{84\.21\}\_\{\\pm 4\.75\}

Table 8:Comparisons with training\-based baselines across target models on NQ Open in terms of AUROC \(%\)\.Source ModelMethodTarget ModelLLaMA\-3\.1\-8bLLaMA\-2\-7bLLaMA\-2\-13bQwen\-3\-8bQwen\-2\.5\-7bQwen\-2\.5\-14bLLaMA\-3\.1\-8bSAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)76\.23±0\.8276\.23\_\{\\pm 0\.82\}66\.77±1\.4366\.77\_\{\\pm 1\.43\}65\.55±2\.0465\.55\_\{\\pm 2\.04\}68\.27±2\.3468\.27\_\{\\pm 2\.34\}63\.18±1\.5363\.18\_\{\\pm 1\.53\}63\.82±1\.4363\.82\_\{\\pm 1\.43\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)63\.38±3\.0263\.38\_\{\\pm 3\.02\}74\.35±9\.9374\.35\_\{\\pm 9\.93\}62\.58±7\.6662\.58\_\{\\pm 7\.66\}63\.68±11\.0463\.68\_\{\\pm 11\.04\}59\.10±7\.3259\.10\_\{\\pm 7\.32\}60\.66±4\.9760\.66\_\{\\pm 4\.97\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)70\.17±1\.4770\.17\_\{\\pm 1\.47\}58\.42±3\.8758\.42\_\{\\pm 3\.87\}58\.20±3\.3158\.20\_\{\\pm 3\.31\}61\.00±1\.5761\.00\_\{\\pm 1\.57\}60\.43±3\.6860\.43\_\{\\pm 3\.68\}57\.24±2\.9857\.24\_\{\\pm 2\.98\}HCPD \(Ours\)90\.38±3\.58\\mathbf\{90\.38\}\_\{\\pm 3\.58\}92\.94±0\.75\\mathbf\{92\.94\}\_\{\\pm 0\.75\}90\.42±1\.92\\mathbf\{90\.42\}\_\{\\pm 1\.92\}92\.45±1\.77\\mathbf\{92\.45\}\_\{\\pm 1\.77\}85\.88±6\.03\\mathbf\{85\.88\}\_\{\\pm 6\.03\}78\.62±6\.69\\mathbf\{78\.62\}\_\{\\pm 6\.69\}Qwen\-3\-8bSAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)70\.55±3\.1170\.55\_\{\\pm 3\.11\}66\.85±0\.6766\.85\_\{\\pm 0\.67\}61\.85±1\.5061\.85\_\{\\pm 1\.50\}72\.86±1\.2072\.86\_\{\\pm 1\.20\}68\.52±1\.9568\.52\_\{\\pm 1\.95\}68\.87±2\.20\\mathbf\{68\.87\}\_\{\\pm 2\.20\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)62\.83±9\.9562\.83\_\{\\pm 9\.95\}64\.84±12\.8564\.84\_\{\\pm 12\.85\}55\.49±4\.9455\.49\_\{\\pm 4\.94\}57\.25±1\.5057\.25\_\{\\pm 1\.50\}62\.00±7\.7962\.00\_\{\\pm 7\.79\}58\.60±6\.6558\.60\_\{\\pm 6\.65\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)53\.94±4\.5753\.94\_\{\\pm 4\.57\}57\.30±3\.9257\.30\_\{\\pm 3\.92\}56\.10±3\.8456\.10\_\{\\pm 3\.84\}61\.38±3\.4361\.38\_\{\\pm 3\.43\}57\.91±3\.7357\.91\_\{\\pm 3\.73\}55\.56±2\.5355\.56\_\{\\pm 2\.53\}HCPD \(Ours\)79\.42±3\.52\\mathbf\{79\.42\}\_\{\\pm 3\.52\}91\.60±1\.41\\mathbf\{91\.60\}\_\{\\pm 1\.41\}93\.04±0\.98\\mathbf\{93\.04\}\_\{\\pm 0\.98\}87\.35±6\.22\\mathbf\{87\.35\}\_\{\\pm 6\.22\}85\.70±1\.58\\mathbf\{85\.70\}\_\{\\pm 1\.58\}60\.09±5\.9160\.09\_\{\\pm 5\.91\}

Table 9:Comparisons with training\-based baselines across target models on CoQA in terms of AUROC \(%\)\.Source ModelMethodTarget ModelLLaMA\-3\.1\-8bLLaMA\-2\-7bLLaMA\-2\-13bQwen\-3\-8bQwen\-2\.5\-7bQwen\-2\.5\-14bLLaMA\-3\.1\-8bSAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)71\.58±1\.3571\.58\_\{\\pm 1\.35\}76\.81±1\.1876\.81\_\{\\pm 1\.18\}72\.62±0\.6272\.62\_\{\\pm 0\.62\}69\.07±3\.2469\.07\_\{\\pm 3\.24\}68\.11±1\.3268\.11\_\{\\pm 1\.32\}59\.84±1\.4259\.84\_\{\\pm 1\.42\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)72\.11±4\.9672\.11\_\{\\pm 4\.96\}58\.10±6\.0758\.10\_\{\\pm 6\.07\}62\.53±7\.2662\.53\_\{\\pm 7\.26\}57\.75±2\.6757\.75\_\{\\pm 2\.67\}67\.05±6\.2367\.05\_\{\\pm 6\.23\}53\.53±2\.0553\.53\_\{\\pm 2\.05\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)69\.31±6\.7569\.31\_\{\\pm 6\.75\}53\.59±3\.5553\.59\_\{\\pm 3\.55\}56\.85±3\.2056\.85\_\{\\pm 3\.20\}56\.82±4\.2856\.82\_\{\\pm 4\.28\}59\.10±3\.5959\.10\_\{\\pm 3\.59\}52\.91±1\.9052\.91\_\{\\pm 1\.90\}HCPD \(Ours\)90\.07±2\.58\\mathbf\{90\.07\}\_\{\\pm 2\.58\}89\.12±1\.24\\mathbf\{89\.12\}\_\{\\pm 1\.24\}89\.38±2\.02\\mathbf\{89\.38\}\_\{\\pm 2\.02\}76\.43±3\.03\\mathbf\{76\.43\}\_\{\\pm 3\.03\}84\.19±3\.39\\mathbf\{84\.19\}\_\{\\pm 3\.39\}60\.76±6\.24\\mathbf\{60\.76\}\_\{\\pm 6\.24\}Qwen\-3\-8bSAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)69\.63±1\.5569\.63\_\{\\pm 1\.55\}81\.52±1\.1381\.52\_\{\\pm 1\.13\}77\.99±1\.4077\.99\_\{\\pm 1\.40\}80\.28±1\.4080\.28\_\{\\pm 1\.40\}72\.91±2\.4272\.91\_\{\\pm 2\.42\}73\.00±1\.4073\.00\_\{\\pm 1\.40\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)74\.46±8\.0474\.46\_\{\\pm 8\.04\}65\.96±12\.7865\.96\_\{\\pm 12\.78\}69\.45±7\.7869\.45\_\{\\pm 7\.78\}62\.18±4\.4962\.18\_\{\\pm 4\.49\}64\.07±10\.5664\.07\_\{\\pm 10\.56\}54\.10±2\.2154\.10\_\{\\pm 2\.21\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)61\.22±10\.1561\.22\_\{\\pm 10\.15\}58\.82±7\.5958\.82\_\{\\pm 7\.59\}59\.81±7\.3059\.81\_\{\\pm 7\.30\}68\.40±4\.9268\.40\_\{\\pm 4\.92\}58\.63±7\.0158\.63\_\{\\pm 7\.01\}56\.79±4\.4556\.79\_\{\\pm 4\.45\}HCPD \(Ours\)88\.43±3\.50\\mathbf\{88\.43\}\_\{\\pm 3\.50\}86\.12±0\.83\\mathbf\{86\.12\}\_\{\\pm 0\.83\}88\.09±1\.61\\mathbf\{88\.09\}\_\{\\pm 1\.61\}84\.80±1\.01\\mathbf\{84\.80\}\_\{\\pm 1\.01\}83\.14±0\.52\\mathbf\{83\.14\}\_\{\\pm 0\.52\}73\.26±1\.25\\mathbf\{73\.26\}\_\{\\pm 1\.25\}

### D\.2More Results on Transferability across Data Distributions

We also evaluate cross\-dataset evaluation transfer on Qwen\-3\-8b\. As shown in Figure[4](https://arxiv.org/html/2606.12900#A4.F4), HCPD exhibits strong generalization under distribution shift, similar to that on LLaMA\-3\.1\-8b\. This indicates that by adaptively proposing criteria that fit the problem domain and format, HCPD is widely applicable across diverse hallucination detection scenarios\. Notably, the agent trained on CoQA even outperforms its in\-domain counterpart when transferred to the other three datasets\. We hypothesize that CoQA’s contextual setting provides richer supervision, which may improve knowledge coverage and contextual reasoning, thereby enhancing the agent’s discriminative capability\.

![Refer to caption](https://arxiv.org/html/2606.12900v1/x10.png)Figure 4:Cross\-dataset AUROCs of HCPD on Qwen\-3\-8b\.
### D\.3Impact of Generation Strategy

As mentioned above, we use different sampling strategies for the training and test phases to serve different objectives\. To seek the most effective strategies for inference\-time generation, we evaluate55sampling strategies with an increasing degree of freedom\. Intuitively, higher freedom degree can broaden exploration across diverse perspectives, but it also reduces controllability and may yield unstable or less reliable evaluations \(e\.g\.,T=1\.3,top\-​p=0\.98,R​P=1\.0T=1\.3,\\text\{top\-\}p=0\.98,RP=1\.0\); conversely, overly deterministic sampling can constrain the assessment, potentially biasing it toward a single criterion or reasoning pattern \(e\.g\.,T=0\.5,top\-​p=0\.8,R​P=1\.3T=0\.5,\\text\{top\-\}p=0\.8,RP=1\.3\)\. Results in Table[10](https://arxiv.org/html/2606.12900#A4.T10)indicate that HCPD can maintain excellent reasoning ability across strategies with diverse degree of freedom, which further supports the effectiveness of our alignment procedure\.

Table 10:Impact of generation strategy\.DatasetStrategy \(T​\|top\-​p\|​R​PT~\|~\\text\{top\-\}p~\|~RP\)0\.5​\|0\.8\|​1\.30\.5~\|~0\.8~\|~1\.30\.7​\|0\.85\|​1\.20\.7~\|~0\.85~\|~1\.20\.9​\|0\.9\|​1\.10\.9~\|~0\.9~\|~1\.11\.1​\|0\.95\|​1\.051\.1~\|~0\.95~\|~1\.051\.3​\|0\.98\|​1\.01\.3~\|~0\.98~\|~1\.0TriviaQA85\.51±1\.3885\.51\_\{\\pm 1\.38\}85\.90±1\.4485\.90\_\{\\pm 1\.44\}86\.25±1\.08\\mathbf\{86\.25\}\_\{\\pm 1\.08\}86\.03±1\.3686\.03\_\{\\pm 1\.36\}82\.80±2\.0182\.80\_\{\\pm 2\.01\}SciQ85\.59±0\.4485\.59\_\{\\pm 0\.44\}86\.62±2\.03\\mathbf\{86\.62\}\_\{\\pm 2\.03\}86\.04±2\.2586\.04\_\{\\pm 2\.25\}85\.94±1\.9585\.94\_\{\\pm 1\.95\}85\.36±2\.4885\.36\_\{\\pm 2\.48\}NQ Open88\.94±2\.2388\.94\_\{\\pm 2\.23\}90\.09±2\.1390\.09\_\{\\pm 2\.13\}90\.38±3\.58\\mathbf\{90\.38\}\_\{\\pm 3\.58\}90\.08±2\.1790\.08\_\{\\pm 2\.17\}87\.30±3\.0087\.30\_\{\\pm 3\.00\}CoQA89\.44±2\.9889\.44\_\{\\pm 2\.98\}89\.83±2\.7689\.83\_\{\\pm 2\.76\}90\.07±2\.58\\mathbf\{90\.07\}\_\{\\pm 2\.58\}89\.85±2\.6789\.85\_\{\\pm 2\.67\}88\.14±4\.0488\.14\_\{\\pm 4\.04\}

### D\.4Performance and Consumption Comparison

In light of the practical deployment of HCPD, we analyze its computational overhead from22perspectives:i\) Inference time:As shown in Table[5](https://arxiv.org/html/2606.12900#S5.T5), the multi\-sampling aggregation yields substantial performance gains at the expense of additional computation\. However,K=5K=5is not strictly necessary, since HCPD already outperforms all baselines atK=1K=1, reducing latency to0\.23490\.2349s per sample;ii\) Inference VRAM:Since most methods require LLM to extract internal states or probabilities, HCPD does not consume additional VRAM during inference\. Moreover, its fixed77B footprint is more efficient when evaluating large\-scale LLMs\. A detailed performance–consumption comparison is provided in Table[11](https://arxiv.org/html/2606.12900#A4.T11), demonstrating that HCPD attains efficiency comparable to lightweight metrics while surpassing consistency\-based methods\.

### D\.5Generalization to Long\-form Generation Task

Beyond factual hallucinations in short\-form QA settings, we further evaluate the detection of faithful hallucinations in a Wikipedia article continuation task\. To assess the generalization capability of HCPD in long\-form generation, we directly apply the agent trained on TriviaQA to a Wikipedia continuation dataset\. Notably, we employ DeepSeek\-V3 as the evaluation metric\(Liuet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib64)\), since most continuation passages exceed the512512\-token limit of BLEURT\. Detailed descriptions of the dataset and prompting strategy are provided in Appendix[C\.1](https://arxiv.org/html/2606.12900#A3.SS1)and[C\.2](https://arxiv.org/html/2606.12900#A3.SS2), respectively\. As shown in Table[12](https://arxiv.org/html/2606.12900#A4.T12), As shown in Table[12](https://arxiv.org/html/2606.12900#A4.T12), HCPD consistently outperforms existing baselines, demonstrating robust scalability to long\-form generation tasks\.

Table 11:Comparisons with baselines in terms of performance, inference time and GPU memory\.MethodAUROC \(%\)↑\\uparrowInf\. Time \(s\)↓\\downarrowGPU Mem\. \(MiB\)↓\\downarrowLN\-Entropy\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10)\)73\.62±2\.2073\.62\_\{\\pm 2\.20\}0\.13830\.13831718517185CCS\(Burnset al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib16)\)78\.20±1\.8978\.20\_\{\\pm 1\.89\}1\.16001\.16003149831498SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)74\.58±1\.9074\.58\_\{\\pm 1\.90\}6\.53106\.53101754017540Perplexity\(Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9)\)80\.62±2\.6280\.62\_\{\\pm 2\.62\}0\.13490\.13491718517185SAPLMA\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.12900#bib.bib17)\)78\.51±3\.1678\.51\_\{\\pm 3\.16\}0\.53410\.53411601916019Semantic Entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib11)\)78\.71±3\.0978\.71\_\{\\pm 3\.09\}14\.302614\.30261718517185Lexical Similarity\(Linet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib12)\)77\.96±2\.0377\.96\_\{\\pm 2\.03\}36\.564636\.56461718517185EigenScore\(Chenet al\.,[2024a](https://arxiv.org/html/2606.12900#bib.bib13)\)51\.35±1\.2351\.35\_\{\\pm 1\.23\}37\.755037\.75501745317453HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)58\.19±5\.7958\.19\_\{\\pm 5\.79\}0\.60120\.60121675716757TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)79\.78±3\.3679\.78\_\{\\pm 3\.36\}0\.09340\.09342359223592HCPD \(Ours,K=1K=1\)85\.21±1\.2285\.21\_\{\\pm 1\.22\}0\.23490\.23491851318513HCPD \(Ours,K=5K=5\)86\.25±1\.0886\.25\_\{\\pm 1\.08\}1\.13131\.13131905119051

Table 12:Cross\-dataset transfer performance on Wikipedia, using TriviaQA as the source dataset for training\-based baselines\.MethodAUROCLN\-Entropy\(Malinin and Gales,[2021](https://arxiv.org/html/2606.12900#bib.bib10)\)73\.02±1\.7673\.02\_\{\\pm 1\.76\}CCS\(Burnset al\.,[2022](https://arxiv.org/html/2606.12900#bib.bib16)\)70\.20±2\.2870\.20\_\{\\pm 2\.28\}SelfCKGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib14)\)72\.37±2\.0372\.37\_\{\\pm 2\.03\}Perplexity\(Renet al\.,[2023](https://arxiv.org/html/2606.12900#bib.bib9)\)74\.19±2\.0774\.19\_\{\\pm 2\.07\}HaloScope\(Duet al\.,[2024](https://arxiv.org/html/2606.12900#bib.bib18)\)71\.74±1\.3071\.74\_\{\\pm 1\.30\}TSV\(Parket al\.,[2025](https://arxiv.org/html/2606.12900#bib.bib19)\)65\.07±1\.2765\.07\_\{\\pm 1\.27\}HCPD \(Ours\)74\.36​±±1\.37\\mathbf\{74\.36\}\\textpm\_\{\\pm 1\.37\}

## Appendix EFuture Directions

While HCPD provides an effective zero‑source hallucination detector, several promising extensions warrant further investigation\. Future work includes leveraging HCPD as a reward model within LLM training to explicitly suppress hallucinations, extending the framework to multimodal generation \(e\.g\., image‑text or video‑text\) by incorporating modality‑aware evaluation criteria, and generalizing human\-like criteria probing mechanism to broader evaluation tasks such as safety, helpfulness, or style compliance\. Furthermore, scaling HCPD to long‑form generation, multi‑turn dialogue, and tool‑augmented systems will likely require hierarchical or claim‑level scoring, as well as dialogue‑aware consistency modeling\. Collectively, these directions illustrate a principled path toward establishing HCP as a general, interpretable evaluation paradigm for increasingly complex and multimodal generation settings\.

## Appendix FExamples

![Refer to caption](https://arxiv.org/html/2606.12900v1/x11.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x12.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x13.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x14.png)

Figure 5:Visualizations of successful detections via HCPD on TriviaQA\.![Refer to caption](https://arxiv.org/html/2606.12900v1/x15.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x16.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x17.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x18.png)

Figure 6:Visualizations of successful detections via HCPD on SciQ\.![Refer to caption](https://arxiv.org/html/2606.12900v1/x19.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x20.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x21.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x22.png)

Figure 7:Visualizations of successful detections via HCPD on NQOpen\.![Refer to caption](https://arxiv.org/html/2606.12900v1/x23.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x24.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x25.png)

![Refer to caption](https://arxiv.org/html/2606.12900v1/x26.png)

Figure 8:Visualizations of successful detections via HCPD on CoQA\.![Refer to caption](https://arxiv.org/html/2606.12900v1/x27.png)\(a\)TriviaQA
![Refer to caption](https://arxiv.org/html/2606.12900v1/x28.png)\(b\)SciQ
![Refer to caption](https://arxiv.org/html/2606.12900v1/x29.png)\(c\)NQ Open
![Refer to caption](https://arxiv.org/html/2606.12900v1/x30.png)\(d\)CoQA

Figure 9:Visualizations of failed detections via HCPD\.

Similar Articles

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

arXiv cs.CL

This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.

Sanity Checks for Long-Form Hallucination Detection

arXiv cs.CL

This paper introduces a controlled-invariance methodology and two oracle tests (Force and Remove) to determine if LLM hallucination detectors rely on reasoning traces or final answer artifacts. It proposes TRACT, a lightweight scorer using lexical features, which demonstrates robust performance independent of answer-level cues.

Automatic Layer Selection for Hallucination Detection

arXiv cs.AI

This paper proposes automatic layer selection for hallucination detection in LLMs and introduces First Effective Peak of Intrinsic Dimension (FEPoID), a training-free criterion that consistently identifies optimal intermediate layers, outperforming existing heuristics.