Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
Summary
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
View Cached Full Text
Cached at: 05/12/26, 06:48 AM
# Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment Source: [https://arxiv.org/abs/2605.08462](https://arxiv.org/abs/2605.08462) [View PDF](https://arxiv.org/pdf/2605.08462) > Abstract:Hallucination remains a persistent challenge in Large Language Models \(LLMs\), particularly in context\-grounded settings such as RAG and agentic AI systems\. This study focuses on contextual hallucination detection in summarization tasks\. We analyze the QAGS\-C and SummEval datasets by comparing original benchmark annotations with reason and span\-based predictions from Gemini 2\.5 Flash and GPT\-5 Mini\. To address systematic divergences between human labels and LLM judgments, we re\-evaluated all conflicted samples through a human adjudication process involving 2 cross\-cultural adjudicators\. Following this re\-evaluation, triple agreement \(between human, GPT, and Gemini\) increased by 6\.38% for QAGS\-C and 7\.62% for SummEval\. Similarly, model accuracy improved, with GPT increasing by 4\.25% on QAGS\-C and 2\.34% on SummEval, while Gemini showed gains of 8\.51% and 3\.80%, respectively\. Notably, adjudicators frequently sided with the models' judgments over original human annotations when LLMs provided explicit reasoning\. Overall human adjudicator agreement ranged between 83% and 87%\. These findings suggest that for ambiguity\-prone tasks, single\-pass annotations may be insufficient, and model\-assisted re\-evaluation yields more reliable benchmarks\. ## Submission history From: İsmail Furkan Atasoy \[[view email](https://arxiv.org/show-email/14ee8f0e/2605.08462)\] **\[v1\]**Fri, 8 May 2026 20:27:44 UTC \(530 KB\)
Similar Articles
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness
This paper challenges the assumption that LLMs can reliably distinguish between hallucinated and factual outputs through internal signals, arguing that internal states primarily reflect knowledge recall rather than truthfulness. The authors propose a taxonomy of hallucinations (associated vs. unassociated) and show that associated hallucinations exhibit hidden-state geometries overlapping with factual outputs, making standard detection methods ineffective.
PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
Researchers propose PRISM, a diagnostic benchmark that breaks down LLM hallucinations into four dimensions (knowledge missing/errors, reasoning errors, instruction-following errors) across three generation stages (memory, instruction, reasoning), evaluating 24 LLMs to reveal trade-offs in mitigation strategies.
HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
HumanLLM presents a framework for benchmarking and improving LLM anthropomorphism by modeling psychological patterns as interacting causal forces, constructing 244 patterns from academic literature and 11,359 multi-pattern scenarios. The approach demonstrates that authentic human alignment requires cognitive modeling rather than shallow behavioral mimicry, with HumanLLM-8B outperforming larger models like Qwen3-32B on multi-pattern dynamics.
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation
This paper investigates how fine-tuning LLMs on new knowledge induces factual hallucinations, showing that unfamiliarity within specific knowledge types drives hallucinations through weakened attention to key entities. The authors propose mitigating this by reintroducing known knowledge during later training stages.