llm-assessment

Tag

Cards List
#llm-assessment

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL · yesterday Cached

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.

0 favorites 0 likes
← Back to home

Submit Feedback