RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

arXiv cs.CL Papers

Summary

RAGognizer introduces a hallucination-aware fine-tuning approach that integrates a lightweight detection head into LLMs for joint optimization of language modeling and hallucination detection in RAG systems. The paper presents RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and demonstrates state-of-the-art hallucination detection while reducing hallucination rates without degrading language quality.

arXiv:2604.15945v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typically treat hallucination as a post-hoc problem, relying on black-box consistency checks or probes over frozen internal representations. In this work, we demonstrate that hallucination detection based on internal state representation can also serve as a direct training signal. We introduce RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and RAGognizer, a hallucination-aware fine-tuning approach that integrates a lightweight detection head into an LLM, allowing for the joint optimization of language modeling and hallucination detection. This joint objective forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well-formed and meaningful responses. Across multiple benchmarks, RAGognizer achieves state-of-the-art token-level hallucination detection while substantially reducing hallucination rates during generation, without degrading language quality or relevance.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration Source: https://arxiv.org/html/2604.15945 Fabian Ridderhttps://orcid.org/0009-0008-5574-5292, Laurin Lesselhttps://orcid.org/0009-0007-0936-5312, and Malte Schillinghttps://orcid.org/0000-0002-0849-483X Computer Science Department University of Münster Münster, Germany \{fridder, llessel, malte\.schilling\}@uni\-muenster\.de ###### Abstract Retrieval\-Augmented Generation \(RAG\) is widely used to augment the input to Large Language Models \(LLMs\) with external information, such as recent or domain\-specific knowledge\. Nonetheless, current models still produce closed\-domain hallucinations and generate content that is unsupported by the retrieved context\. Current detection approaches typically treat hallucination as a post\-hoc problem, relying on black\-box consistency checks or probes over frozen internal representations\. In this work, we demonstrate that hallucination detection based on internal state representation can also serve as a direct training signal\. We introduce RAGognize, a dataset of naturally occurring closed\-domain hallucinations with token\-level annotations, and RAGognizer, a hallucination\-aware fine\-tuning approach that integrates a lightweight detection head into an LLM, allowing for the joint optimization of language modeling and hallucination detection\. This joint objective forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well\-formed and meaningful responses\. Across multiple benchmarks, RAGognizer achieves state\-of\-the\-art token\-level hallucination detection while substantially reducing hallucination rates during generation, without degrading language quality or relevance\. ## 1Introduction Large Language Models \(LLMs\) have achieved impressive performance in natural language understanding and generation\(Brown et al\.,2020 (https://arxiv.org/html/2604.15945#bib.bib1)\)\. Despite this progress, LLMs remain prone to*hallucinations*: the generation of content that is unsupported by, or contradicts, available evidence\(Huang et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib2)\)\. This phenomenon fundamentally limits their reliability, particularly in high\-stakes or knowledge\-intensive applications\. Refer to caption Figure 1:Distinction of Contextual and Parametric Knowledge: The Venn diagram illustrates possible knowledge scenarios in LLM generation\. Prompts may rely solely on contextual knowledge \(left\), solely on parametric knowledge \(right\), or on their intersection where the two sources may either align \(Parametric\-Aligned\) or contradict \(Counter\-Parametric\)\. The*No Knowledge*region corresponds to unanswerable prompts\. Regions marked with stripes indicate scenarios not covered by the RAGognize dataset, which focuses exclusively on closed\-domain settings where hallucinations are verifiable\.A central difficulty in defining and detecting hallucinations lies in the dual nature of knowledge in LLMs\. During pre\-training, models encode vast amounts as implicit*parametric knowledge*that is stored in their weights\(Petroni et al\.,2019 (https://arxiv.org/html/2604.15945#bib.bib3)\), while at inference time, this may be complemented by explicit information added into the model’s context window as*contextual knowledge*\. These sources differ substantially in accessibility and verifiability, yet are often conflated when hallucinations are treated simply as factual errors\(Xu et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib4)\)\. Retrieval\-Augmented Generation \(RAG\) aims at guiding generation by explicitly providing LLMs with access to external, dynamic information—such as company\-specific data or breaking news—that the model was not exposed to during pre\-training\(Petroni et al\.,2019 (https://arxiv.org/html/2604.15945#bib.bib3); Lewis et al\.,2021 (https://arxiv.org/html/2604.15945#bib.bib5)\)\. But RAG does not inherently solve the problem of reliability\. Even when provided with correct context, models frequently exhibitclosed\-domain hallucinations: generating plausible but incorrect information that is not grounded in the retrieved context\(Agrawal et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib6); Niu et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib7)\)\. This disconnect between the provided evidence \(contextual knowledge\) and the generated output undermines the trust required for high\-stakes applications\. We argue that hallucinations cannot be meaningfully defined or detected without distinguishing between the different knowledge sources\. As illustrated in Fig\.1 (https://arxiv.org/html/2604.15945#S1.F1), contextual and parametric knowledge may appear in isolation or in combination\. To obtain a decidable notion, we focus on closed\-domain settings using exclusively recent information to prevent reliance on parametric knowledge\. In this setting—where prompts fall within the*Contextual Knowledge*area if*answerable*or the*No Knowledge*area if*unanswerable*—hallucinations can be unambiguously identified as generations introducing unsupported content\. Focusing on this closed\-domain setting, we make three contributions: First, we introduceRAGognize, a comprehensive dataset of naturally occurring closed\-domain hallucinations with granular token\-level annotations\. Second, we proposeRAGognizer, a hallucination\-aware model architecture that integrates a simple detection head into an LLM, enabling token\-level hallucination prediction from internal representations and achieving state\-of\-the\-art detection performance on closed\-domain benchmarks\. Third, we show that jointly optimizing language modeling and hallucination detection objectives using LoRA\-based fine\-tuning improves the separability of internal states with respect to hallucination, leading to both stronger detection performance and substantially reduced hallucination rates during generation, while preserving language quality\. Our experiments demonstrate that RAGognizer achieves state\-of\-the\-art token\-level hallucination detection with a compact Qwen3\-4B generation model\(Yang et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib8)\), while significantly improving generation faithfulness in closed\-domain RAG settings\. Further, we show that this also generalizes to other settings as well, when evaluated on other datasets\. Together, these findings indicate that hallucination detection is closely tied to representation learning and that integrating detection signals during training can improve model reliability\. The dataset, models, and code can be found online\.111https://github\.com/F4biian/RAGognizer ## 2Related Work Hallucinations in LLMs have been studied from different perspectives, including detection, mitigation, and dataset construction\. In this section, we first review prior work on hallucination detection methods, focusing on how they differ in model access and granularity, and secondly, discuss existing hallucination datasets\. ### 2\.1Hallucination Detection Detection methods are commonly categorized by their required access: white\-box methods exploit internal activations or attention patterns, while black\-box methods operate on outputs alone\. Further practical distinctions are the granularity at which hallucinations are identified and whether a method requires stochastic sampling \(multiple generations\) to estimate consistency, or can run in a single forward pass \(see TableE (https://arxiv.org/html/2604.15945#Ax1.T5)\)\. White\-box approaches include uncertainty proxies such as perplexity\(Jelinek et al\.,1977 (https://arxiv.org/html/2604.15945#bib.bib9)\)and entropy\-based scores\(Farquhar et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib10)\), representation\-statistic methods such as INSIDE \(EigenScore\)\(Chen et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib11)\), attention\-based detectors such as Lookback Lens\(Chuang et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib12)\), and probe/classifier approaches that train on hidden activations \(e\.g\., SAPLMA\(Azaria and Mitchell,2023 (https://arxiv.org/html/2604.15945#bib.bib13)\)\)\. HallucinationProbes train a linear, token\-level classifier on hidden states and further explores adapter training via Low\-Rank Adaptation \(LoRA\) alongside the probe head to improve detection while minimally altering base model behavior\(Obeso et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib14); Hu et al\.,2021 (https://arxiv.org/html/2604.15945#bib.bib15)\), closely aligning with our approach\. Other white\-box methods include unsupervised internal\-state detectors \(MIND\(Su et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib16)\)\), relevance\-propagation applied to RAG \(LRP4RAG\(Hu et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib17)\)\), and cross\-layer dynamics probes \(ICR Probe\(Zhang et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib18)\)\)\. Black\-box approaches include sampling\-based consistency checks such as SelfCheckGPT\(Manakul et al\.,2023 (https://arxiv.org/html/2604.15945#bib.bib19)\), and external evaluator or judge models fine\-tuned for factuality \(e\.g\., NLI/entailment models built on DeBERTa\-style encoders\(He et al\.,2023 (https://arxiv.org/html/2604.15945#bib.bib20)\)and specialized evaluators such as MiniCheck, Lynx, and Granite\-Guardian\(Tang et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib21); Ravi et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib22); Padhi et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib23)\)\)\. Community and benchmark models \(e\.g\., HHEM\-2\.1\) provide readily usable open evaluators\(Mendelevitch et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib24)\)\. Methods tailored to RAG include faithfulness scoring that combines entailment with retrieval evidence \(RAGAS\)\(Es et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib25)\)and joint context/knowledge verification models such as HDM\-2\(Paudel et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib26)\)\. Other work \(e\.g\., LUMINA\) examines the balance between reliance on retrieved context and internal parametric knowledge when detecting hallucinations in RAG outputs\(Yeh et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib27)\)\. ### 2\.2Datasets Existing hallucination datasets differ in their annotation granularity, underlying knowledge assumptions, and the nature of hallucinations\. A primary distinction concerns the level at which hallucinations are labeled\. While most benchmarks provide supervision only at the level of complete responses, a small number of recent datasets offer token\-level annotations, which makes these particularly relevant for studying internal model representations and token\-level detection \(e\.g\., RAGTruth\(Niu et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib7)\)\)\. We believe it is important to take the assumed knowledge regime into account\. A common issue in many RAG and context\-based QA datasets is that they do not strictly ensure questions require the provided context to be answered\. This blurs the line between contextual and parametric knowledge; for instance, HaluEval\(Li et al\.,2023 (https://arxiv.org/html/2604.15945#bib.bib28)\)contains questions that LLMs can answer using pre\-trained memory\. This contrasts with strictly closed\-domain settings where valid generations must be supported exclusively by the given context\. Finally, datasets differ in how hallucinations are produced: while HaluEval relies on synthetically induced response\-level hallucinations, others like HDM\-Bench\(Paudel et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib26)\)focus on natural response\-level hallucinations that arise during standard model generation\. Refer to caption Figure 2:Automatic Data Generation and Annotation Pipeline for the RAGognize dataset: Wikipedia facts post\-dating the training cut\-off date \(May 23, 2024\) are extracted which ensures that this information was not used for training of the considered LLMs\. Secondly, we generate Q&A pairs using Gemini 2\.5 Pro and assemble randomly two different RAG configurations: Answerable \(containing the relevant chunk\) and Unanswerable \(containing irrelevant but similar chunks\) querries\. We collect natural responses from four target LLMs \(Llama\-2/3\.1, Mistral\-v0\.1/v0\.3\)\. Finally, Gemini 2\.5 Flash is used with a structured chain\-of\-thought prompt for substring verification to compare responses with the provided context, which returns granular, token\-level hallucination annotations\. ## 3Methods We first introduce the RAGognize dataset, then present the RAGognizer architectural approach for hallucination\-aware LLM fine\-tuning, followed by the joint training setup\. ### 3\.1The RAGognize Dataset Most existing hallucination benchmarks operate at the response level, rely on synthetic perturbations, or do not preclude open\-domain settings, which limits fine\-grained detection of hallucination or deviations from given evidence\. To address this gap, we introduce theRAGognizedataset designed for natural, token\-level hallucination detection in closed\-domain RAG scenarios\. It is constructed in multiple steps and extends the HalluRAG approach\(Ridder and Schilling,2025 (https://arxiv.org/html/2604.15945#bib.bib29)\)with increased prompt diversity and token\-level annotations\. As illustrated in Fig\.2 (https://arxiv.org/html/2604.15945#S2.F2), the pipeline consists of \(i\) sourcing of recent factual statements from Wikipedia, \(ii\) generation of diverse question–answer pairs, \(iii\) controlled assembly of answerable and unanswerable RAG prompts, \(iv\) response generation by multiple LLMs, and \(v\) automated token\-level hallucination annotation\. As we want to keep relevant information restricted to the provided context, we adopt a strict recency constraint and extract factual statements from Wikipedia whose associated reference is time\-stamped with a date later than May 23, 2024\. This ensures that facts were not available for training and cannot be represented in the parametric knowledge of the evaluated models \(we used Llama\-2\-7B\-Chat \(Llama2\-7B\)\(Touvron et al\.,2023 (https://arxiv.org/html/2604.15945#bib.bib30)\), Llama\-3\.1\-8B\-Instruct \(Llama3\-8B\)\(Grattafiori et al\.,2024 (https://arxiv.org/html/2604.15945#bib.bib31)\), Mistral\-7B\-Instruct\-v0\.1 \(Mistral\-7B\-v0\.1\), and Mistral\-7B\-Instruct\-v0\.3 \(Mistral\-7B\-v0\.3\)\(Jiang et al\.,2023 (https://arxiv.org/html/2604.15945#bib.bib32)\)\)\. Therefore, RAGognize only deals with*Contextual Knowledge*or*No Knowledge*scenarios \(Fig\.1 (https://arxiv.org/html/2604.15945#S1.F1)\), which establishes a well\-defined distinction between answerable and unanswerable queries\. For each factual statement, we use Gemini 2\.5 Pro\(Comanici et al\.,2025 (https://arxiv.org/html/2604.15945#bib.bib33)\)to generate diverse user queries and corresponding reference answers under stylistic variations \(e\.g\., typographical errors, subjective framing, or adding misleading cues\) that should encourage linguistic diversity\. Answerable and unanswerable RAG prompts are then constructed by applying a modular template strategy by selectively inserting or withholding the context part that contains the crucial evidence, while semantically similar distractor passages are retrieved using BGE\-M3\(Chen et al\.,2023 (https://arxiv.org/html/2604.15945#bib.bib34)\)\. This procedure yields paired prompts that differ only in the availability of relevant contextual evidence\. Answerable prompts are, in this way, formed by replacing one distractor with the ground\-truth chunk containing the necessary evidence\. All prompts in both the training and test splits are afterwards passed to Llama2\-7B, Llama3\-8B, Mistral\-7B\-v0\.1, and Mistral\-7B\-v0\.3 using greedy decoding \(temperature0

Similar Articles

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

arXiv cs.CL

This paper introduces a proxy-analyzer framework that detects hallucinations in large language models by analyzing internal activations of small, open-weight models rather than the generator itself. The method achieves superior performance on benchmarks like RAGTruth compared to existing methods like ReDeEP, demonstrating that model size is less critical than the analysis approach.

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

arXiv cs.CL

TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.