An Empirical Analysis of Factual Errors in Human-Written Text and its Application
Summary
The paper presents a taxonomy of factual errors in human-written text, derived from newspaper corrections, and evaluates LLMs' performance on detecting these errors, finding that even top models like GPT-5.4 achieve only 52% word-level F1 score, highlighting the task's difficulty.
View Cached Full Text
Cached at: 06/29/26, 05:24 AM
# An Empirical Analysis of Factual Errors in Human-Written Text and its Application Source: [https://arxiv.org/abs/2606.27959](https://arxiv.org/abs/2606.27959) [View PDF](https://arxiv.org/pdf/2606.27959) > Abstract:Factual Error Detection \(FED\), which is the task of identifying factually incorrect spans in a given text, has long been recognized as an important research problem\. However, with the rapid rise of large language models \(LLMs\), research attention has shifted toward factual errors specific to LLM\-generated text \(hallucinations\) and their detection\. As a result, the detection of factual errors in human\-written text has been relatively neglected\. To address this gap, we first distill a taxonomy of human\-induced factual errors by analyzing corrections of newspaper articles, a representative source of text that is guaranteed to be human\-written and contains few grammatical errors\. Our analysis revealed that there are characteristic categories such as kanji misconversions and numeral classifier errors, which are not focused in existing hallucination benchmarks\. Based on the taxonomy, we then evaluate the FED capability of vanilla LLMs on synthesized realistic test cases and real corrections\. Experimental results demonstrated that even high\-performance LLMs such as GPT\-5\.4 achieved only word\-level F1 score of 52% on the synthetic evaluation data, highlighting the task difficulty\. Furthermore, a detailed analysis by detection difficulty revealed the current state of FED\. ## Submission history From: Shotaro Ishihara \[[view email](https://arxiv.org/show-email/238eaff6/2606.27959)\] **\[v1\]**Fri, 26 Jun 2026 11:03:18 UTC \(220 KB\)
Similar Articles
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
This paper presents a systematic human audit of NL-to-FOL datasets FOLIO and MALLS, finding 39% and 36% incorrect formalizations respectively. It releases corrected ground truths and an LLM-assisted framework to focus human relabeling, reducing the review workload to under 24% of instances for 90% accuracy.
FACTS Grounding: A new benchmark for evaluating the factuality of large language models
DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.
WebGPT: Improving the factual accuracy of language models through web browsing
OpenAI fine-tuned GPT-3 to answer open-ended questions more accurately by enabling it to use a text-based web browser to search, retrieve, and cite sources. The model outperforms human demonstrators 56% of the time on questions from ELI5 dataset but shows limitations on out-of-distribution tasks like TruthfulQA.
Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian
This paper investigates LLM-based generative error correction (GER) for low-resource West Frisian ASR, using a contamination-aware evaluation with a private dataset to show that GPT-5.1 reduces errors beyond oracle levels.
Can Factual Opinions Be Edited (Manipulated) in Large Language Models?
This paper introduces the FactualOpinionEditing with Evidence (FOE) benchmark to assess the ability to edit factual opinions in LLMs, and proposes a Self-Generated Evidence-Aligned method to improve opinion-evidence alignment.