Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle
Summary
This paper identifies a blind spot in reference-free faithfulness metrics: they only measure precision (whether claims are supported) but not recall (coverage of relevant facts). The authors introduce a complete-oracle evaluation using Formula 1 telemetry and weather data, showing that high-precision models often have poor coverage, and propose a combined metric.
View Cached Full Text
Cached at: 06/10/26, 12:08 AM
Paper page - Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle
Source: https://huggingface.co/papers/2606.09376
Abstract
Reference-free faithfulness metrics suffer from a blind spot measuring only precision, leading to rewards for abstention; completeness in deterministic domains enables measurement of both precision and recall, revealing that high-precision models often have poor fact coverage.
Reference-freefaithfulness metricsverify each atomic claim a model makes against ground truth, and are increasingly used to evaluategrounded generation. We show they share a blind spot: they measure onlyprecision-- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measurerecall(coverage of the relevant facts) exactly, alongsideprecision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 150 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). A prompt ablation shows the low coverage is not an under-prompting artifact: explicitly asking models to be thorough does not close the gap. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give averifier-guided generationmethod that improvesprecisionandrecallwithout references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.09376
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.09376 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.09376 in a dataset README.md to link it from this page.
Spaces citing this paper1
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
This paper introduces BonaFide, a benchmark of 3,066 labeled chain-of-thought examples across 13 tasks and 10 models, and systematically evaluates faithfulness metrics, showing that most perform near chance and have significant limitations in reliability and efficiency.
Measuring AI Faithfulness-For Better or For Worse
This article discusses the importance of faithfulness in LLM optimization, introducing a Structural Fidelity Score that measures drift across word overlap, constraint survival, and task-type match to ensure prompt optimization does not sacrifice intent.
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
This paper introduces FRANQ, a method for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems by applying distinct uncertainty quantification techniques to distinguish between factuality and faithfulness to retrieved context. The authors construct a new dataset annotated for both factuality and faithfulness, and demonstrate that FRANQ outperforms existing approaches in detecting factual errors across multiple datasets and LLMs.
Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
A new framework for automated benchmark generation enables fine-grained, comprehensive evaluation of foundation models with lower error rates and richer metadata, as demonstrated on ML, Corporate Finance, and Personal Finance benchmarks.
Are feedback systems more important than models for making AI agents useful?
A discussion about how feedback systems (static analysis, coverage tools, profiling) are more critical than the choice of LLM for making AI agents useful, illustrated by Oracle's work generating tests for GraalVM Native Image reflection metadata.