@vintcessun: Do you really know who the annotators are in the NLP papers you read? An audit of ACL papers from 2018-2025 reveals that key details such as annotator training, language proficiency, and compensation are often missing, especially in model evaluation studies. This directly threatens research reproducibility and reliability. This paper proposes a unified taxonomy + LLM-assisted automatic extraction pipeline, evaluated on 2,667 annotation tasks…
Summary
A large-scale audit of ACL papers from 2018-2025 reveals that key annotation details (training, language proficiency, compensation, etc.) are often missing, threatening reproducibility. The authors propose a unified taxonomy and an LLM-assisted extraction pipeline evaluated on 2,667 annotation tasks.
View Cached Full Text
Cached at: 06/08/26, 11:24 AM
Do you really know who annotates the NLP papers you read? An audit of ACL papers from 2018 to 2025 reveals that key details such as annotator training, language proficiency, and compensation are often missing—especially in model evaluation studies. This directly threatens the reproducibility and reliability of research. This paper proposes a unified taxonomy + LLM-based automatic extraction pipeline, evaluates it on 2,667 annotation tasks, reveals that operational details are frequently reported but validity details are missing, and provides an improvement framework.
Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
Source: https://arxiv.org/html/2606.02255 Maria Kunilovskaya1,Gagan Bhatia1,Lisa Sophie Albertelli1, Yanran Chen1,Christian Greisinger1,Lotta Kiefer1, Christoph Leiter1,Subhadeep Roy1,Tewodros Achamaleh2, Muhammad Arslan Manzoor2,Sebastian Pohl2,Yufang Hou2,Steffen Eger1 1NLLG Lab University of Technology Nuremberg, 2Interdisciplinary Transformation University, Austria {mariia.kunilovskaia, gagan.bhatia, steffen.eger}@utn.de
Abstract
Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave it unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline againstAnnotatedgold{}_{\textsc{gold}}, a human-adjudicated gold standard of 41 papers and 72 annotation tasks. Our best LLM model reaches human-comparable agreement with adjudicated labels (Krippendorff’sα=0.606\alpha=0.606vs.0.5850.585for human–human agreement). Using this pipeline, we constructAnnotatedllm{}_{\textsc{llm}}, a dataset covering ACL-venue papers from 2018–2025, with 2,667 extracted annotation tasks from 1,603 papers. We find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven. Based on these findings, we propose a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.111We will release our code and dataset upon acceptance.
Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
Maria Kunilovskaya1, Gagan Bhatia1, Lisa Sophie Albertelli1,Yanran Chen1,Christian Greisinger1,Lotta Kiefer1,Christoph Leiter1,Subhadeep Roy1,Tewodros Achamaleh2,Muhammad Arslan Manzoor2,Sebastian Pohl2,Yufang Hou2,Steffen Eger11NLLG Lab University of Technology Nuremberg,2Interdisciplinary Transformation University, Austria{mariia.kunilovskaia, gagan.bhatia, steffen.eger}@utn.de
1Introduction
Human annotation is one of thecorefoundations of empirical NLP. When evaluating a machine translation system, building an argument-mining dataset or comparing the outputs of large language models (LLMs), we often rely on human judgments as ground truth. As a result, many NLP claims depend not only on models and metrics, but also on who the annotatorsare, what theyareasked to do, how they are trained, how disagreements are handled(Artstein and Poesio,2008 (https://arxiv.org/html/2606.02255#bib.bib1); van der Leeet al.,2019 (https://arxiv.org/html/2606.02255#bib.bib42); Popović and Belz,2021 (https://arxiv.org/html/2606.02255#bib.bib88)).Unqualified or biased human annotation may undermine the validity of findings in NLP research: for example, a study evaluating AI-generated poetry may critically depend on annotators’ language proficiency and a study of political stance, sexism, or hate speech may yield different outcomes based on annotators’ social and ideological positions.
Worryingly, there is extremely little empirical evidence about the annotators in the NLP community: does the community use crowdworkers or experts (and in what proportions), are the annotators adequate for the task, and how well are they compensated?Further, there is a lack of quantitative evidence onwhether initiatives such as the ACL Responsible NLP ChecklistRolling Review (2022 (https://arxiv.org/html/2606.02255#bib.bib45)),introduced to improve the consistency, transparency, and ethics of AI research,correspond to measurable improvements in annotation reporting quality, how these reporting practices differ across intended uses of human annotation, and which NLP areas systematically underreport information critical for interpreting annotation outcomes.We address this gap with the first large-scale, task-level audit of human annotation reporting across major NLP venues.222Our task-level perspective is essential: a single paper may contain multiple annotation tasks with different annotators, instructions, purposes, and quality-control procedures.Treating papers as the unit of analysis would therefore obscure task-level variation in annotators, procedures, and quality control that are crucial for evaluating annotation setups.
Existing work has examined annotation quality, reproducibility, annotator effects, and selected aspects of reporting practices(Bayerl and Paul,2011 (https://arxiv.org/html/2606.02255#bib.bib18); Belzet al.,2023 (https://arxiv.org/html/2606.02255#bib.bib17); Klieet al.,2024 (https://arxiv.org/html/2606.02255#bib.bib19); Beck,2023 (https://arxiv.org/html/2606.02255#bib.bib21)). However, prior studies typically focus on specific annotation problems or isolated reporting dimensions, such as inter-annotator agreement(Artstein and Poesio,2008 (https://arxiv.org/html/2606.02255#bib.bib1); Bayerl and Paul,2011 (https://arxiv.org/html/2606.02255#bib.bib18)), reproducibility of human evaluation(Belzet al.,2023 (https://arxiv.org/html/2606.02255#bib.bib17); Belz and Thomson,2024 (https://arxiv.org/html/2606.02255#bib.bib16)), dataset quality management(Klieet al.,2024 (https://arxiv.org/html/2606.02255#bib.bib19)), or annotator demographics and data documentation(Bender and Friedman,2018 (https://arxiv.org/html/2606.02255#bib.bib43); Pei and Jurgens,2023 (https://arxiv.org/html/2606.02255#bib.bib20)). In contrast, we provide the first large-scale, task-level, cross-topic analysis of annotation reporting across major NLP venues, using a unified taxonomy that covers not only agreement and workload, but also recruitment, qualifications, compensation, socio-demographics, and quality control.We also present the first large-scale empirical evidence on how reporting has changed across an almost 10 year period, starting from 2018, allowing to examine whether the NLP community is undergoing shifts, perhaps on a way towards a more rigorous scientific field(Jurgenset al.,2018 (https://arxiv.org/html/2606.02255#bib.bib2)).
Overall we make the following contributions:
(1) A task-level reporting dataset. We introduceAnnotated, a dataset for auditing human annotation reporting in NLP. It treats each annotation task as a separate unit, capturing within-paper differences in annotators, instructions, purposes, and quality-control procedures.
(2) A human-adjudicated gold standard. We manually annotateAnnotatedgold{}_{\textsc{gold}}, a gold-standard set of 41 papers comprising 72 human-annotation tasks, and finalize labels through adjudication.This compact set reflects the substantial cost of task-level expert annotation, estimated at approximately €6,300 in personnel time (AppendixC (https://arxiv.org/html/2606.02255#A3)), while providing a human-validated benchmark for evaluating automated methods that extract annotation-reporting information.
(3) An LLM-assisted extraction pipeline. We evaluate six LLMs as structured extractors of annotation-reporting information. The strongest model, Gemini-3.1-Pro, reaches human-comparable agreement with the adjudicated labels, Krippendorff’sα=0.606\alpha=0.606, enabling scalable meta-analysis of annotation practices.
(4) The first large-scale audit of annotation reporting in ACL-venue papers. We apply the validated pipeline to constructAnnotatedLLM{}_{\textsc{LLM}},a corpus of ACL-venue paperspublished between 2018 and 2025and extract 2,667 annotation tasks from 1,603 papers. UsingReportage Scoredefined in §5 (https://arxiv.org/html/2606.02255#S5), we analyze reporting practices over time, venues, NLP topics, and uses of human judgment, including resource creation, model evaluation, and human-performance benchmarking.
(5) A scalable framework for more reliable annotation reporting. Motivated by our large-scale audit of annotation-reporting practices, we propose a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable (§7 (https://arxiv.org/html/2606.02255#S7)). Since NLP models, benchmarks, and datasets often rely on human annotation for validation, improving annotation reporting has foundational implications for the interpretation of empirical claims in NLP.
2Related Work
Annotator effects, annotation quality, and reporting.
Human annotation has long been recognized as central to NLP evaluation and dataset construction, with prior work emphasizing the importance of agreement measurement, evaluation design, and reproducibility(Artstein and Poesio,2008 (https://arxiv.org/html/2606.02255#bib.bib1); van der Leeet al.,2019 (https://arxiv.org/html/2606.02255#bib.bib42); Popović and Belz,2021 (https://arxiv.org/html/2606.02255#bib.bib88)). A growing body of research further shows that annotators are not interchangeable: their demographic backgrounds, language competence, cultural context, and social or political attitudes can systematically affect annotation outcomes. For instance,Pei and Jurgens (2023 (https://arxiv.org/html/2606.02255#bib.bib20))show that demographic variables influence offensiveness judgments, whileJianget al.(2024 (https://arxiv.org/html/2606.02255#bib.bib28))find that annotators’ social and political attitudes affect sexism and misogyny annotations. Related work highlights the need to document demographic and linguistic context in NLP datasets and evaluations(Bender and Friedman,2018 (https://arxiv.org/html/2606.02255#bib.bib43); Joshiet al.,2020 (https://arxiv.org/html/2606.02255#bib.bib44)). At the same time, work on reproducibility has shown that many human evaluations cannot be fully reproduced because papers omit essential information about annotators, materials, instructions, and procedures(Belzet al.,2023 (https://arxiv.org/html/2606.02255#bib.bib17); Belz and Thomson,2024 (https://arxiv.org/html/2606.02255#bib.bib16)). Earlier meta-analytic work examined annotation and annotator characteristics across 96 studies(Bayerl and Paul,2011 (https://arxiv.org/html/2606.02255#bib.bib18)), annotated-data quality dimensions(Beck,2023 (https://arxiv.org/html/2606.02255#bib.bib21)), and annotation quality management in dataset papers(Klieet al.,2024 (https://arxiv.org/html/2606.02255#bib.bib19)).
AI4Science and meta-scientific analysis.
AI4Science studies how AI systems can support or transform scientific work, including literature analysis, discovery, experimentation, writing, figure generation, and evaluation(Pramanicket al.,2025 (https://arxiv.org/html/2606.02255#bib.bib4); Chenet al.,2025b (https://arxiv.org/html/2606.02255#bib.bib26); Egeret al.,2026 (https://arxiv.org/html/2606.02255#bib.bib89); Xieet al.,2025 (https://arxiv.org/html/2606.02255#bib.bib91)). Recent work has explored LLM-based support for peer review(Tyseret al.,2024 (https://arxiv.org/html/2606.02255#bib.bib23)), abstract generation(Kimet al.,2025 (https://arxiv.org/html/2606.02255#bib.bib24)), and scientific graphics generation(Belouadiet al.,2025 (https://arxiv.org/html/2606.02255#bib.bib56); Greisinger and Eger,2026 (https://arxiv.org/html/2606.02255#bib.bib57)). In NLP, LLMs have also increasingly been used as annotators, evaluators, or assistants for scientific and empirical analysis, enabled by the general capabilities of LLMs(Brownet al.,2020 (https://arxiv.org/html/2606.02255#bib.bib5); Grattafioriet al.,2024 (https://arxiv.org/html/2606.02255#bib.bib7); Singhet al.,2026 (https://arxiv.org/html/2606.02255#bib.bib6)). For example, LLMs have been used to support annotation in social-scientific analysis(Kostikovaet al.,2024 (https://arxiv.org/html/2606.02255#bib.bib65); Ziemset al.,2024 (https://arxiv.org/html/2606.02255#bib.bib50))interpret latent semantic spaces learnt in pretrained models(Mousiet al.,2023 (https://arxiv.org/html/2606.02255#bib.bib84)), and identify ethical concerns in ACL papers(Karamolegkouet al.,2025 (https://arxiv.org/html/2606.02255#bib.bib90)). Our work follows this emerging use of LLMs as tools for analysis, but applies them to the meta-scientific study of research practice itself: we use LLMs to extract structured informationonhow human annotation is reported across NLP papers.
3Human Annotation Setup
The annotation unit in this study is anindividual human annotation taskrather than a paper, since papers may contain multiple studies with different reporting levels. A task is defined as a shared annotation setup in which annotators follow the same instructions and annotate the same data. These tasks may serve different purposes, including dataset creation, establishing a human performance level, or evaluating model outputs. In our experience, identifying the number of human tasks within papers was a non-trivial preliminary step. In cases of disagreement, a third annotation was commissioned. Papers with unresolved disagreements after the third annotation, as well as structurally complex papers judged by at least two annotators as containing more than three annotation tasks, were excluded from the final manual dataset.
{forest}
Figure 1:Task-level taxonomy for annotation-reporting analysis.The taxonomy groups 25 reporting categories into seven aspects: general task description, agreement, workload, recruitment and qualifications, compensation, socio-demographics, and quality control.Annotations were performed using a structured interface with integrated instructions that cover category rationale, definitions, examples, and coding rules. The process began with pilot annotations and extensive group discussions to refine the taxonomy and calibrate annotator interpretations. Closed-list selection was preferred over open-ended extraction to ensure consistent coding and easier aggregation of results. For some categories, annotators relied only on explicit statements in papers, while other tasks required informed judgment. The final taxonomy captures seven aspects of human annotation reporting across 25 categories. The full description of categories, permissible values, and instructions is provided in AppendixD (https://arxiv.org/html/2606.02255#A4)(TableD (https://arxiv.org/html/2606.02255#A4)). A top-level overview of the taxonomy is shown in Figure1 (https://arxiv.org/html/2606.02255#S3.F1). This taxonomy provides a unified framework for mapping heterogeneous human annotation tasks in ACL papers. Each paper was annotated by at least two independent annotators, including co-authors of this paper. In total, 12 annotators contributed to the annotation process: 2 professors, 2 postdoctoral researchers, 6 PhD students, and 2 master’s students. All annotators were proficient in reading academic English.Human annotations were finalized through a two-stage adjudication process involving three annotators from the same pool of annotators: two independent annotations of each identified task were compared, disagreements were discussed to reach conse
Similar Articles
Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
This paper presents a large-scale audit of human annotation reporting in NLP from 2018-2025, showing inconsistent documentation of critical details but improvements over time, and provides a framework and recommendations for better reporting.
LLMs for automatic annotation of Mandarin narrative transcripts
This paper evaluates LLMs for automatically annotating narrative macrostructure in spoken Mandarin, finding that the best model achieves near-human reliability while reducing annotation time by 65%, though performance degrades on semantically complex or lexically diverse narratives.
@freeman1266: You don't need math to understand most AI papers—just understand this chain: token → embedding → position encoding → attention → FFN → residual stream → next-token prediction. LLMs essentially stack Transf…
A Chinese science tweet that intuitively explains the core chain of LLMs (Large Language Models): from token, embedding, position encoding, attention, FFN to residual stream and next-token prediction, helping readers without a math background understand AI papers.
A Linguistics-Aware LLM Watermarking via Syntactic Predictability
This paper introduces STELA, a linguistics-aware watermarking framework for LLMs that leverages syntactic predictability via POS n-grams to balance text quality and detection robustness. The method enables publicly verifiable watermark detection without requiring access to model logits, demonstrating superior performance across typologically diverse languages (English, Chinese, Korean).
LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance
This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.