An Empirical Analysis of Factual Errors in Human-Written Text and its Application

arXiv cs.CL Papers

Summary

The paper presents a taxonomy of factual errors in human-written text, derived from newspaper corrections, and evaluates LLMs' performance on detecting these errors, finding that even top models like GPT-5.4 achieve only 52% word-level F1 score, highlighting the task's difficulty.

arXiv:2606.27959v1 Announce Type: new Abstract: Factual Error Detection (FED), which is the task of identifying factually incorrect spans in a given text, has long been recognized as an important research problem. However, with the rapid rise of large language models (LLMs), research attention has shifted toward factual errors specific to LLM-generated text (hallucinations) and their detection. As a result, the detection of factual errors in human-written text has been relatively neglected. To address this gap, we first distill a taxonomy of human-induced factual errors by analyzing corrections of newspaper articles, a representative source of text that is guaranteed to be human-written and contains few grammatical errors. Our analysis revealed that there are characteristic categories such as kanji misconversions and numeral classifier errors, which are not focused in existing hallucination benchmarks. Based on the taxonomy, we then evaluate the FED capability of vanilla LLMs on synthesized realistic test cases and real corrections. Experimental results demonstrated that even high-performance LLMs such as GPT-5.4 achieved only word-level F1 score of 52% on the synthetic evaluation data, highlighting the task difficulty. Furthermore, a detailed analysis by detection difficulty revealed the current state of FED.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:24 AM

# An Empirical Analysis of Factual Errors in Human-Written Text and its Application
Source: [https://arxiv.org/abs/2606.27959](https://arxiv.org/abs/2606.27959)
[View PDF](https://arxiv.org/pdf/2606.27959)

> Abstract:Factual Error Detection \(FED\), which is the task of identifying factually incorrect spans in a given text, has long been recognized as an important research problem\. However, with the rapid rise of large language models \(LLMs\), research attention has shifted toward factual errors specific to LLM\-generated text \(hallucinations\) and their detection\. As a result, the detection of factual errors in human\-written text has been relatively neglected\. To address this gap, we first distill a taxonomy of human\-induced factual errors by analyzing corrections of newspaper articles, a representative source of text that is guaranteed to be human\-written and contains few grammatical errors\. Our analysis revealed that there are characteristic categories such as kanji misconversions and numeral classifier errors, which are not focused in existing hallucination benchmarks\. Based on the taxonomy, we then evaluate the FED capability of vanilla LLMs on synthesized realistic test cases and real corrections\. Experimental results demonstrated that even high\-performance LLMs such as GPT\-5\.4 achieved only word\-level F1 score of 52% on the synthetic evaluation data, highlighting the task difficulty\. Furthermore, a detailed analysis by detection difficulty revealed the current state of FED\.

## Submission history

From: Shotaro Ishihara \[[view email](https://arxiv.org/show-email/238eaff6/2606.27959)\] **\[v1\]**Fri, 26 Jun 2026 11:03:18 UTC \(220 KB\)

Similar Articles

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Google DeepMind Blog

DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.

WebGPT: Improving the factual accuracy of language models through web browsing

OpenAI Blog

OpenAI fine-tuned GPT-3 to answer open-ended questions more accurately by enabling it to use a text-based web browser to search, retrieve, and cite sources. The model outperforms human demonstrators 56% of the time on questions from ELI5 dataset but shows limitations on out-of-distribution tasks like TruthfulQA.