Tag
This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.
This paper investigates the resilience of AI-generated text detection methods (fine-tuned RoBERTa, Binoculars, text feature analysis, and ensembles) against paraphrasing attacks, finding that Binoculars-inclusive ensembles are most effective but also most vulnerable to attacks, highlighting a dichotomy between performance and resilience.