Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Hugging Face Daily Papers Papers

Summary

This paper presents a large-scale audit of human annotation reporting in NLP from 2018-2025, showing inconsistent documentation of critical details but improvements over time, and provides a framework and recommendations for better reporting.

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Source: https://huggingface.co/papers/2606.02255 Authors:

,

,

,

,

,

,

,

,

,

,

Abstract

Large-scale audit of human annotation reporting in NLP reveals inconsistent documentation of critical annotation details, with improvements over time but ongoing gaps in reproducibility and reliability.

Human annotationis the empirical foundation of muchNLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit ofhuman annotationreporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate anLLM-assisted extractionpipeline againstAnnotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, withKrippendorff’s alphaof 0.606 versus 0.585 for human-human agreement. Using this pipeline, we constructAnnotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assessannotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show thatannotation reportingin NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for makinghuman annotationmore reliable, reproducible, and interpretable.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.02255

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.02255 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.02255 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.02255 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

@vintcessun: Do you really know who the annotators are in the NLP papers you read? An audit of ACL papers from 2018-2025 reveals that key details such as annotator training, language proficiency, and compensation are often missing, especially in model evaluation studies. This directly threatens research reproducibility and reliability. This paper proposes a unified taxonomy + LLM-assisted automatic extraction pipeline, evaluated on 2,667 annotation tasks…

X AI KOLs Timeline

A large-scale audit of ACL papers from 2018-2025 reveals that key annotation details (training, language proficiency, compensation, etc.) are often missing, threatening reproducibility. The authors propose a unified taxonomy and an LLM-assisted extraction pipeline evaluated on 2,667 annotation tasks.

Understanding Annotator Safety Policy with Interpretability

arXiv cs.AI

This paper introduces Annotator Policy Models (APMs) by Apple, which use interpretability techniques to infer annotators' internal safety policies from their labeling behavior without requiring additional annotation effort. The authors demonstrate that APMs can accurately model these policies and distinguish between sources of annotation disagreement, such as operational failures, policy ambiguity, and value pluralism.