Tag
The paper presents a taxonomy of factual errors in human-written text, derived from newspaper corrections, and evaluates LLMs' performance on detecting these errors, finding that even top models like GPT-5.4 achieve only 52% word-level F1 score, highlighting the task's difficulty.