Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

arXiv cs.CL Papers

Summary

This paper introduces FRANQ, a method for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems by applying distinct uncertainty quantification techniques to distinguish between factuality and faithfulness to retrieved context. The authors construct a new dataset annotated for both factuality and faithfulness, and demonstrate that FRANQ outperforms existing approaches in detecting factual errors across multiple datasets and LLMs.

arXiv:2505.21072v4 Announce Type: replace Abstract: Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval-Augmented Generation

Source: https://arxiv.org/html/2505.21072

###### Abstract

Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce **franq**, a new method for hallucination detection in RAG outputs. **franq** applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate **franq** and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that **franq** achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches. Our implementation is available at https://github.com/stat-ml/rag_uncertainty.

## 1 Introduction

Large Language Models (LLMs) are increasingly employed across a wide range of tasks. However, LLMs are prone to generating plausible but factually incorrect generations, a phenomenon known as hallucination, arising from factors such as insufficient training data coverage, input ambiguity, and architectural constraints (Huang et al., 2025). Retrieval-Augmented Generation (RAG; Lewis et al., 2020) mitigates this issue by incorporating dynamically retrieved external knowledge into the generation process, which can partially mitigate factual inaccuracies (Shuster et al., 2021). However, RAG systems still produce hallucinations (Shi et al., 2023). Moreover, the use of retrieved information makes it more challenging to detect hallucinations and to determine their original source. Models become more confident in generating statements that appear in the retrieval, regardless of their factual correctness (Kim et al., 2025). At the same time, the retrieved passages themselves may be erroneous, incomplete, or completely irrelevant with respect to the query (Shi et al., 2023; Ding et al., 2024). Conversely, even when retrieval is accurate, inconsistencies can emerge between the model's internal knowledge and the retrieved data (Wang et al., 2024a, 2025). Thus, an important question is how to define **hallucination** in RAG, given the interplay between the model's internal knowledge and the retrieved context.

One approach is to consider any content that is not directly supported by the retrieved context as a hallucination (Niu et al., 2024). However, we argue that hallucination should be defined based on factual inaccuracies rather than strict contextual alignment. Specifically, a generated statement that originates from the LLM's internal knowledge but lies outside the retrieved context should not be considered a hallucination if it is factually correct.

![Figure 1: franq illustration. Left: A user poses a question, and the RAG retrieves relevant documents and formulates an answer, potentially using information from the retrieved documents. Middle: The RAG output is decomposed into atomic claims. Right: The franq method assesses factuality by evaluating three components: (1) faithfulness, (2) factuality under faithful condition, and (3) factuality under unfaithful condition.](image-1)

To address this distinction, we differentiate between **factuality** and **faithfulness**. Faithfulness refers to whether the generated output is semantically entailed by the retrieved context, while factuality indicates whether the content is objectively correct (Maynez et al., 2020; Dziri et al., 2022; Yang et al., 2024). For RAG fact-checking, detecting non-factual claims is more critical than identifying unfaithful ones. This distinction disentangles two core RAG failure modes: (i) hallucinations caused by erroneous grounding in the retrieved context, and (ii) factual errors stemming from the model's internal knowledge (Zhou et al., 2024).

In this paper, we investigate the detection of non-factual statements produced by RAG using Uncertainty Quantification (UQ) techniques. We introduce **franq** (Faithfulness-aware Retrieval Augmented UNcertainty Quantification), a novel method that first evaluates the faithfulness of the generated response and subsequently applies different UQ methods based on the outcome. With this separation, **franq** tailors its strategy to the specific RAG failure mode: whether it originates from retrieval grounding or from the model's own knowledge.

We evaluate **franq** on both long- and short-form question answering (QA) tasks. For long-form QA, where answers include multiple claims, we assess factuality at the claim level and introduce a new dataset with factuality annotations, combining automated labeling with manual validation. For short-form QA, we test our method on four QA datasets and treat each response as a single claim.

Our key contributions are as follows:

- We develop a new UQ method for RAG, **franq**, that estimates uncertainty by first assessing faithfulness, and then using uncertainty quantification methods for faithful and unfaithful outputs; see Section 2.
- We develop a long-form QA factuality dataset for RAG. The dataset incorporates both factuality and faithfulness labels, and was built by combining automatic annotation with manual validation for difficult cases; see Section 3.
- We conduct comprehensive experiments on both long- and short-form QA with several LLMs, demonstrating that **franq** improves the detection of factual errors in RAG outputs compared to other approaches; see Section 4.

## 2 Uncertainty Quantification for RAG

Let **x** be the user query submitted to the RAG system. The system retrieves k passages denoted by **r** = {r₁, ..., rₖ} from an external knowledge source using **x** as the query. The RAG system then uses an LLM to generate an output **y**, conditioned on both **x** and **r**. Autoregressive LLMs produce text sequentially, generating one token at a time. At each step t, the model samples a token yₜ ~ p(·|y₀, ..., yₜ₋₁, **x**, **r**).

[Rest of mathematical and technical content continues with similar formatting...]

## Appendix D Additional Ablation Studies

**Table 14:** Results averaged across 4 QA datasets for Llama 3B Instruct considering only claims with high and low AlignScore.

### D.1 **franq** with Alternative Faithfulness Estimators

Table 15 compares the performance of three original **franq** versions (each employing a different calibration strategy) with three modified versions that use a thresholded AlignScore instead of raw AlignScore probabilities. In the thresholded versions, the faithfulness probability is defined as P(c is faithful to r) = 1(AlignScore(c) > T) with T = 0.5. These methods are denoted by the 'T=0.5' label. The results indicate that, overall, the continuous versions of **franq** outperform their thresholded counterparts. Table 14 further compares the performance of three original **franq** versions with a condition-calibrated version of **franq** that also calibrates AlignScore for faithfulness estimation (this method is denoted 'franq condition-calibrated, faithfulness-calibrated'). In this version, the AlignScore is calibrated using a training set with binary gold faithfulness targets and then incorporated into the **franq** formula. The results suggest that calibrating AlignScore may reduce the PRR of **franq**, indicating that it might be more effective to use AlignScore without faithfulness calibration.

**Table 15:** Comparison of **franq** performance on Llama 3B Instruct benchmarks, when using AlignScore with and without threshold.

### D.2 Analysis of XGBoost

We examine the first tree from an XGBoost model trained on **franq** features (AlignScore, Claim Probability, and Parametric Knowledge) for long-form QA with Llama 3B Instruct. While XGBoost uses multiple trees, the first tree often captures key decision patterns. Figure 14 presents the first several nodes in the first XGBoost tree. The root splits on AlignScore. If it's high, the model next considers Claim Probability; if low, it turns to Parametric Knowledge. This mirrors **franq**'s logic: leading with faithfulness assessment with AlignScore, followed by either Claim Probability or Parametric Knowledge. The tree thus exhibits structure similar to **franq**'s decision process.

![Figure 14: Top vertices of first XGBoost tree trained on franq components (ClaimProb) for long-form QA Llama 3B Instruct benchmark.](image-14)

### D.3 Calibration Properties of UQ Methods

We evaluate the calibration properties of all our UQ methods using the Expected Calibration Error (ECE; Guo et al., 2017). ECE quantifies the alignment between predicted confidence scores and observed accuracy. Specifically, predictions are partitioned into 10 equally spaced confidence bins. Within each bin, we compute the average predicted confidence and compare it to the empirical accuracy. Lower ECE values indicate better-calibrated models. Table 16 reports ECE scores for both long-form QA dataset and short-form QA benchmark using the Llama 3B Instruct model. Only UQ methods that produce confidence values within the [0, 1] interval are included, as this is a prerequisite for ECE computation. Notably, the two calibrated variants of **franq** achieve the best calibration performance across datasets.

**(a) Long-form QA Llama 3B Instruct dataset.**

**(b) Short-form QA Llama 3B Instruct benchmark (ECE is averaged across 4 QA datasets).**

**Table 16:** Expected Calibration Error (ECE) for all tested UQ methods with Llama 3B Instruct.

## Appendix E Resource and Expenses

A full data-generation and UQ-baseline evaluation run required about 8 days of compute on an NVIDIA V100 32GB GPU for long-form QA, while short-form QA needed under one day. The OpenAI API was used for claim splitting, matching, and annotation, costing roughly $100 per model run (Llama 3B Instruct). Human annotation involved six student annotators, each contributing about three hours of work.

## Appendix F **franq** Examples

In Figure 15, we demonstrate the behavior of **franq** using three examples from a long-form QA dataset evaluated with Llama 3B Instruct. We selected three representative claims and present their corresponding **franq** scores for both the uncalibrated version and condition-calibrated version. The latter uses monotonic functions f and g, fitted via isotonic regression for Claim Probability and Parametric Knowledge methods, respectively.

![Figure 15(a): Faithful–True. franq correctly identifies the claim as faithful and uses Claim Probability, which detects high entailment with the third retrieved passage. This results in an appropriately high franq score.](image-15a)

![Figure 15(b): Unfaithful–True. franq accurately detects the claim's low faithfulness and assigns its factuality score based on Parametric Knowledge, which is relatively high. In the uncalibrated version, the final score is underestimated due to the uncalibrated Parametric Knowledge score. The condition-calibrated version corrects this by assigning a calibrated score of 0.85, resulting in a correctly high factuality estimate.](image-15b)

![Figure 15(c): Unfaithful–False. franq correctly identifies the claim as unfaithful and assigns a low factuality score using Parametric Knowledge, consistent across both the uncalibrated and calibrated versions.](image-15c)

**Figure 15:** Example outputs from **franq**. Left: Each example includes the input question, retrieved passages, the LLM-generated answer, a selected claim from the answer, and corresponding factuality and faithfulness annotations. Claims and their spans in the answer are highlighted in yellow. If a claim is faithful, its corresponding span in the retrieved passages is also highlighted. Right: The **franq** component scores and final factuality estimations, shown for both the uncalibrated and condition-calibrated versions.

## Appendix G The Usage of LLMs

In this study, large language models are examined primarily as the object of analysis. For practical tasks such as programming and writing, we also make limited use of LLM-based assistants (e.g., ChatGPT) for grammar correction and code debugging, with all such use carefully supervised by human researchers.

Similar Articles

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

arXiv cs.CL

RAGognizer introduces a hallucination-aware fine-tuning approach that integrates a lightweight detection head into LLMs for joint optimization of language modeling and hallucination detection in RAG systems. The paper presents RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and demonstrates state-of-the-art hallucination detection while reducing hallucination rates without degrading language quality.

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

arXiv cs.CL

TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

arXiv cs.CL

Disco-RAG proposes a discourse-aware retrieval-augmented generation framework that integrates discourse signals through intra-chunk discourse trees and inter-chunk rhetorical graphs to improve knowledge synthesis in LLMs. The method achieves state-of-the-art results on QA and summarization benchmarks without fine-tuning.

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Google DeepMind Blog

DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.