An Empirical Analysis of Factual Errors in Human-Written Text and its Application

arXiv cs.CL 06/29/26, 04:00 AM Papers

factual-error-detection human-written-text llm-evaluation taxonomy nlp error-analysis

Summary

The paper presents a taxonomy of factual errors in human-written text, derived from newspaper corrections, and evaluates LLMs' performance on detecting these errors, finding that even top models like GPT-5.4 achieve only 52% word-level F1 score, highlighting the task's difficulty.

arXiv:2606.27959v1 Announce Type: new Abstract: Factual Error Detection (FED), which is the task of identifying factually incorrect spans in a given text, has long been recognized as an important research problem. However, with the rapid rise of large language models (LLMs), research attention has shifted toward factual errors specific to LLM-generated text (hallucinations) and their detection. As a result, the detection of factual errors in human-written text has been relatively neglected. To address this gap, we first distill a taxonomy of human-induced factual errors by analyzing corrections of newspaper articles, a representative source of text that is guaranteed to be human-written and contains few grammatical errors. Our analysis revealed that there are characteristic categories such as kanji misconversions and numeral classifier errors, which are not focused in existing hallucination benchmarks. Based on the taxonomy, we then evaluate the FED capability of vanilla LLMs on synthesized realistic test cases and real corrections. Experimental results demonstrated that even high-performance LLMs such as GPT-5.4 achieved only word-level F1 score of 52% on the synthetic evaluation data, highlighting the task difficulty. Furthermore, a detailed analysis by detection difficulty revealed the current state of FED.

Original Article

View Cached Full Text

Cached at: 06/29/26, 05:24 AM

# An Empirical Analysis of Factual Errors in Human-Written Text and its Application
Source: [https://arxiv.org/abs/2606.27959](https://arxiv.org/abs/2606.27959)
[View PDF](https://arxiv.org/pdf/2606.27959)

> Abstract:Factual Error Detection \(FED\), which is the task of identifying factually incorrect spans in a given text, has long been recognized as an important research problem\. However, with the rapid rise of large language models \(LLMs\), research attention has shifted toward factual errors specific to LLM\-generated text \(hallucinations\) and their detection\. As a result, the detection of factual errors in human\-written text has been relatively neglected\. To address this gap, we first distill a taxonomy of human\-induced factual errors by analyzing corrections of newspaper articles, a representative source of text that is guaranteed to be human\-written and contains few grammatical errors\. Our analysis revealed that there are characteristic categories such as kanji misconversions and numeral classifier errors, which are not focused in existing hallucination benchmarks\. Based on the taxonomy, we then evaluate the FED capability of vanilla LLMs on synthesized realistic test cases and real corrections\. Experimental results demonstrated that even high\-performance LLMs such as GPT\-5\.4 achieved only word\-level F1 score of 52% on the synthetic evaluation data, highlighting the task difficulty\. Furthermore, a detailed analysis by detection difficulty revealed the current state of FED\.

## Submission history

From: Shotaro Ishihara \[[view email](https://arxiv.org/show-email/238eaff6/2606.27959)\] **\[v1\]**Fri, 26 Jun 2026 11:03:18 UTC \(220 KB\)

An Empirical Analysis of Factual Errors in Human-Written Text and its Application

Similar Articles

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

WebGPT: Improving the factual accuracy of language models through web browsing

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

Submit Feedback

Similar Articles

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

WebGPT: Improving the factual accuracy of language models through web browsing

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?