Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
Summary
This paper presents a large-scale audit of human annotation reporting in NLP from 2018-2025, showing inconsistent documentation of critical details but improvements over time, and provides a framework and recommendations for better reporting.
View Cached Full Text
Cached at: 06/02/26, 03:35 PM
Paper page - Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
Source: https://huggingface.co/papers/2606.02255 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
Large-scale audit of human annotation reporting in NLP reveals inconsistent documentation of critical annotation details, with improvements over time but ongoing gaps in reproducibility and reliability.
Human annotationis the empirical foundation of muchNLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit ofhuman annotationreporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate anLLM-assisted extractionpipeline againstAnnotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, withKrippendorff’s alphaof 0.606 versus 0.585 for human-human agreement. Using this pipeline, we constructAnnotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assessannotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show thatannotation reportingin NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for makinghuman annotationmore reliable, reproducible, and interpretable.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.02255
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.02255 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.02255 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.02255 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@vintcessun: Do you really know who the annotators are in the NLP papers you read? An audit of ACL papers from 2018-2025 reveals that key details such as annotator training, language proficiency, and compensation are often missing, especially in model evaluation studies. This directly threatens research reproducibility and reliability. This paper proposes a unified taxonomy + LLM-assisted automatic extraction pipeline, evaluated on 2,667 annotation tasks…
A large-scale audit of ACL papers from 2018-2025 reveals that key annotation details (training, language proficiency, compensation, etc.) are often missing, threatening reproducibility. The authors propose a unified taxonomy and an LLM-assisted extraction pipeline evaluated on 2,667 annotation tasks.
Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation
This paper presents a large-scale analysis of four harmful language detection datasets, examining how annotator characteristics and linguistic features interact to influence annotation variation. It highlights intersectional effects and warns against generalizing findings across different datasets.
The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints
This critical survey examines the Annotation Scarcity Paradox in low-resource NLP evaluation, where rapid model scaling outpaces the human infrastructure needed for authentic evaluation, and discusses emerging responses with equity and validity trade-offs.
The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction
The Ghost Annotator framework combines conformal prediction with collaborative filtering to model LLM behavior and human label variation in content moderation, revealing structural demographic biases in larger models.
Understanding Annotator Safety Policy with Interpretability
This paper introduces Annotator Policy Models (APMs) by Apple, which use interpretability techniques to infer annotators' internal safety policies from their labeling behavior without requiring additional annotation effort. The authors demonstrate that APMs can accurately model these policies and distinguish between sources of annotation disagreement, such as operational failures, policy ambiguity, and value pluralism.