Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL 05/12/26, 04:00 AM Papers

Summary

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.

arXiv:2605.08462v1 Announce Type: new Abstract: Hallucination remains a persistent challenge in Large Language Models (LLMs), particularly in context-grounded settings such as RAG and agentic AI systems. This study focuses on contextual hallucination detection in summarization tasks. We analyze the QAGS-C and SummEval datasets by comparing original benchmark annotations with reason and span-based predictions from Gemini 2.5 Flash and GPT-5 Mini. To address systematic divergences between human labels and LLM judgments, we re-evaluated all conflicted samples through a human adjudication process involving 2 cross-cultural adjudicators. Following this re-evaluation, triple agreement (between human, GPT, and Gemini) increased by 6.38% for QAGS-C and 7.62% for SummEval. Similarly, model accuracy improved, with GPT increasing by 4.25% on QAGS-C and 2.34% on SummEval, while Gemini showed gains of 8.51% and 3.80%, respectively. Notably, adjudicators frequently sided with the models' judgments over original human annotations when LLMs provided explicit reasoning. Overall human adjudicator agreement ranged between 83% and 87%. These findings suggest that for ambiguity-prone tasks, single-pass annotations may be insufficient, and model-assisted re-evaluation yields more reliable benchmarks.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/12/26, 06:48 AM

# Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
Source: [https://arxiv.org/abs/2605.08462](https://arxiv.org/abs/2605.08462)
[View PDF](https://arxiv.org/pdf/2605.08462)

> Abstract:Hallucination remains a persistent challenge in Large Language Models \(LLMs\), particularly in context\-grounded settings such as RAG and agentic AI systems\. This study focuses on contextual hallucination detection in summarization tasks\. We analyze the QAGS\-C and SummEval datasets by comparing original benchmark annotations with reason and span\-based predictions from Gemini 2\.5 Flash and GPT\-5 Mini\. To address systematic divergences between human labels and LLM judgments, we re\-evaluated all conflicted samples through a human adjudication process involving 2 cross\-cultural adjudicators\. Following this re\-evaluation, triple agreement \(between human, GPT, and Gemini\) increased by 6\.38% for QAGS\-C and 7\.62% for SummEval\. Similarly, model accuracy improved, with GPT increasing by 4\.25% on QAGS\-C and 2\.34% on SummEval, while Gemini showed gains of 8\.51% and 3\.80%, respectively\. Notably, adjudicators frequently sided with the models' judgments over original human annotations when LLMs provided explicit reasoning\. Overall human adjudicator agreement ranged between 83% and 87%\. These findings suggest that for ambiguity\-prone tasks, single\-pass annotations may be insufficient, and model\-assisted re\-evaluation yields more reliable benchmarks\.

## Submission history

From: İsmail Furkan Atasoy \[[view email](https://arxiv.org/show-email/14ee8f0e/2605.08462)\] **\[v1\]**Fri, 8 May 2026 20:27:44 UTC \(530 KB\)

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

Similar Articles

Evaluating LLMs as Human Surrogates in Controlled Experiments

Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation

Submit Feedback

Similar Articles

Evaluating LLMs as Human Surrogates in Controlled Experiments

Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation