Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Hugging Face Daily Papers 06/01/26, 12:00 AM Papers

Summary

This paper presents a large-scale audit of human annotation reporting in NLP from 2018-2025, showing inconsistent documentation of critical details but improvements over time, and provides a framework and recommendations for better reporting.

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Source: https://huggingface.co/papers/2606.02255 Authors:

Abstract

Large-scale audit of human annotation reporting in NLP reveals inconsistent documentation of critical annotation details, with improvements over time but ongoing gaps in reproducibility and reliability.

Human annotationis the empirical foundation of muchNLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit ofhuman annotationreporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate anLLM-assisted extractionpipeline againstAnnotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, withKrippendorff’s alphaof 0.606 versus 0.585 for human-human agreement. Using this pipeline, we constructAnnotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assessannotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show thatannotation reportingin NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for makinghuman annotationmore reliable, reproducible, and interpretable.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.02255

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.02255 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.02255 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.02255 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Paper page - Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

Understanding Annotator Safety Policy with Interpretability

Submit Feedback

Similar Articles

@vintcessun: Do you really know who the annotators are in the NLP papers you read? An audit of ACL papers from 2018-2025 reveals that key details such as annotator training, language proficiency, and compensation are often missing, especially in model evaluation studies. This directly threatens research reproducibility and reliability. This paper proposes a unified taxonomy + LLM-assisted automatic extraction pipeline, evaluated on 2,667 annotation tasks…

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

Understanding Annotator Safety Policy with Interpretability