@vintcessun: 你读的NLP论文真的知道标注者是谁吗？审计2018-2025年ACL论文发现：标注者培训、语言能力、报酬等关键细节常缺失，尤其模型评估研究。这直接威胁研究可复现性和可靠性。本文提出统一分类法+LLM自动提取流水线，在2667个标注任务上评…

X AI KOLs Timeline 2026/06/08 01:05 论文

nlp annotation reproducibility audit llm acl-conference

摘要

A large-scale audit of ACL papers from 2018-2025 reveals that key annotation details (training, language proficiency, compensation, etc.) are often missing, threatening reproducibility. The authors propose a unified taxonomy and an LLM-assisted extraction pipeline evaluated on 2,667 annotation tasks.

你读的NLP论文真的知道标注者是谁吗？审计2018-2025年ACL论文发现：标注者培训、语言能力、报酬等关键细节常缺失，尤其模型评估研究。这直接威胁研究可复现性和可靠性。本文提出统一分类法+LLM自动提取流水线，在2667个标注任务上评估，揭示操作细节常报告但有效性细节缺失，并提供改进框架。

查看原文

查看缓存全文

缓存时间: 2026/06/08 11:24

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Source: https://arxiv.org/html/2606.02255 Maria Kunilovskaya1,Gagan Bhatia1,Lisa Sophie Albertelli1, Yanran Chen1,Christian Greisinger1,Lotta Kiefer1, Christoph Leiter1,Subhadeep Roy1,Tewodros Achamaleh2, Muhammad Arslan Manzoor2,Sebastian Pohl2,Yufang Hou2,Steffen Eger1 1NLLG Lab University of Technology Nuremberg, 2Interdisciplinary Transformation University, Austria {mariia.kunilovskaia, gagan.bhatia, steffen.eger}@utn.de

Abstract

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave it unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline againstAnnotatedgold{}_{\textsc{gold}}, a human-adjudicated gold standard of 41 papers and 72 annotation tasks. Our best LLM model reaches human-comparable agreement with adjudicated labels (Krippendorff’sα=0.606\alpha=0.606vs.0.5850.585for human–human agreement). Using this pipeline, we constructAnnotatedllm{}_{\textsc{llm}}, a dataset covering ACL-venue papers from 2018–2025, with 2,667 extracted annotation tasks from 1,603 papers. We find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven. Based on these findings, we propose a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.111We will release our code and dataset upon acceptance.

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Maria Kunilovskaya1, Gagan Bhatia1, Lisa Sophie Albertelli1,Yanran Chen1,Christian Greisinger1,Lotta Kiefer1,Christoph Leiter1,Subhadeep Roy1,Tewodros Achamaleh2,Muhammad Arslan Manzoor2,Sebastian Pohl2,Yufang Hou2,Steffen Eger11NLLG Lab University of Technology Nuremberg,2Interdisciplinary Transformation University, Austria{mariia.kunilovskaia, gagan.bhatia, steffen.eger}@utn.de

1Introduction

Human annotation is one of thecorefoundations of empirical NLP. When evaluating a machine translation system, building an argument-mining dataset or comparing the outputs of large language models (LLMs), we often rely on human judgments as ground truth. As a result, many NLP claims depend not only on models and metrics, but also on who the annotatorsare, what theyareasked to do, how they are trained, how disagreements are handled(Artstein and Poesio,2008; van der Leeet al.,2019; Popović and Belz,2021).Unqualified or biased human annotation may undermine the validity of findings in NLP research: for example, a study evaluating AI-generated poetry may critically depend on annotators’ language proficiency and a study of political stance, sexism, or hate speech may yield different outcomes based on annotators’ social and ideological positions.

Worryingly, there is extremely little empirical evidence about the annotators in the NLP community: does the community use crowdworkers or experts (and in what proportions), are the annotators adequate for the task, and how well are they compensated?Further, there is a lack of quantitative evidence onwhether initiatives such as the ACL Responsible NLP ChecklistRolling Review (2022),introduced to improve the consistency, transparency, and ethics of AI research,correspond to measurable improvements in annotation reporting quality, how these reporting practices differ across intended uses of human annotation, and which NLP areas systematically underreport information critical for interpreting annotation outcomes.We address this gap with the first large-scale, task-level audit of human annotation reporting across major NLP venues.222Our task-level perspective is essential: a single paper may contain multiple annotation tasks with different annotators, instructions, purposes, and quality-control procedures.Treating papers as the unit of analysis would therefore obscure task-level variation in annotators, procedures, and quality control that are crucial for evaluating annotation setups.

Existing work has examined annotation quality, reproducibility, annotator effects, and selected aspects of reporting practices(Bayerl and Paul,2011; Belzet al.,2023; Klieet al.,2024; Beck,2023). However, prior studies typically focus on specific annotation problems or isolated reporting dimensions, such as inter-annotator agreement(Artstein and Poesio,2008; Bayerl and Paul,2011), reproducibility of human evaluation(Belzet al.,2023; Belz and Thomson,2024), dataset quality management(Klieet al.,2024), or annotator demographics and data documentation(Bender and Friedman,2018; Pei and Jurgens,2023). In contrast, we provide the first large-scale, task-level, cross-topic analysis of annotation reporting across major NLP venues, using a unified taxonomy that covers not only agreement and workload, but also recruitment, qualifications, compensation, socio-demographics, and quality control.We also present the first large-scale empirical evidence on how reporting has changed across an almost 10 year period, starting from 2018, allowing to examine whether the NLP community is undergoing shifts, perhaps on a way towards a more rigorous scientific field(Jurgenset al.,2018).

Overall we make the following contributions:

(1) A task-level reporting dataset. We introduceAnnotated, a dataset for auditing human annotation reporting in NLP. It treats each annotation task as a separate unit, capturing within-paper differences in annotators, instructions, purposes, and quality-control procedures.

(2) A human-adjudicated gold standard. We manually annotateAnnotatedgold{}_{\textsc{gold}}, a gold-standard set of 41 papers comprising 72 human-annotation tasks, and finalize labels through adjudication.This compact set reflects the substantial cost of task-level expert annotation, estimated at approximately €6,300 in personnel time (AppendixC), while providing a human-validated benchmark for evaluating automated methods that extract annotation-reporting information.

(3) An LLM-assisted extraction pipeline. We evaluate six LLMs as structured extractors of annotation-reporting information. The strongest model, Gemini-3.1-Pro, reaches human-comparable agreement with the adjudicated labels, Krippendorff’sα=0.606\alpha=0.606, enabling scalable meta-analysis of annotation practices.

(4) The first large-scale audit of annotation reporting in ACL-venue papers. We apply the validated pipeline to constructAnnotatedLLM{}_{\textsc{LLM}},a corpus of ACL-venue paperspublished between 2018 and 2025and extract 2,667 annotation tasks from 1,603 papers. UsingReportage Scoredefined in §5, we analyze reporting practices over time, venues, NLP topics, and uses of human judgment, including resource creation, model evaluation, and human-performance benchmarking.

(5) A scalable framework for more reliable annotation reporting. Motivated by our large-scale audit of annotation-reporting practices, we propose a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable (§7). Since NLP models, benchmarks, and datasets often rely on human annotation for validation, improving annotation reporting has foundational implications for the interpretation of empirical claims in NLP.

2Related Work

Annotator effects, annotation quality, and reporting.

Human annotation has long been recognized as central to NLP evaluation and dataset construction, with prior work emphasizing the importance of agreement measurement, evaluation design, and reproducibility(Artstein and Poesio,2008; van der Leeet al.,2019; Popović and Belz,2021). A growing body of research further shows that annotators are not interchangeable: their demographic backgrounds, language competence, cultural context, and social or political attitudes can systematically affect annotation outcomes. For instance,Pei and Jurgens (2023)show that demographic variables influence offensiveness judgments, whileJianget al.(2024)find that annotators’ social and political attitudes affect sexism and misogyny annotations. Related work highlights the need to document demographic and linguistic context in NLP datasets and evaluations(Bender and Friedman,2018; Joshiet al.,2020). At the same time, work on reproducibility has shown that many human evaluations cannot be fully reproduced because papers omit essential information about annotators, materials, instructions, and procedures(Belzet al.,2023; Belz and Thomson,2024). Earlier meta-analytic work examined annotation and annotator characteristics across 96 studies(Bayerl and Paul,2011), annotated-data quality dimensions(Beck,2023), and annotation quality management in dataset papers(Klieet al.,2024).

AI4Science and meta-scientific analysis.

AI4Science studies how AI systems can support or transform scientific work, including literature analysis, discovery, experimentation, writing, figure generation, and evaluation(Pramanicket al.,2025; Chenet al.,2025b; Egeret al.,2026; Xieet al.,2025). Recent work has explored LLM-based support for peer review(Tyseret al.,2024), abstract generation(Kimet al.,2025), and scientific graphics generation(Belouadiet al.,2025; Greisinger and Eger,2026). In NLP, LLMs have also increasingly been used as annotators, evaluators, or assistants for scientific and empirical analysis, enabled by the general capabilities of LLMs(Brownet al.,2020; Grattafioriet al.,2024; Singhet al.,2026). For example, LLMs have been used to support annotation in social-scientific analysis(Kostikovaet al.,2024; Ziemset al.,2024)interpret latent semantic spaces learnt in pretrained models(Mousiet al.,2023), and identify ethical concerns in ACL papers(Karamolegkouet al.,2025). Our work follows this emerging use of LLMs as tools for analysis, but applies them to the meta-scientific study of research practice itself: we use LLMs to extract structured informationonhow human annotation is reported across NLP papers.

3Human Annotation Setup

The annotation unit in this study is anindividual human annotation taskrather than a paper, since papers may contain multiple studies with different reporting levels. A task is defined as a shared annotation setup in which annotators follow the same instructions and annotate the same data. These tasks may serve different purposes, including dataset creation, establishing a human performance level, or evaluating model outputs. In our experience, identifying the number of human tasks within papers was a non-trivial preliminary step. In cases of disagreement, a third annotation was commissioned. Papers with unresolved disagreements after the third annotation, as well as structurally complex papers judged by at least two annotators as containing more than three annotation tasks, were excluded from the final manual dataset.

{forest}

Figure 1:Task-level taxonomy for annotation-reporting analysis.The taxonomy groups 25 reporting categories into seven aspects: general task description, agreement, workload, recruitment and qualifications, compensation, socio-demographics, and quality control.Annotations were performed using a structured interface with integrated instructions that cover category rationale, definitions, examples, and coding rules. The process began with pilot annotations and extensive group discussions to refine the taxonomy and calibrate annotator interpretations. Closed-list selection was preferred over open-ended extraction to ensure consistent coding and easier aggregation of results. For some categories, annotators relied only on explicit statements in papers, while other tasks required informed judgment. The final taxonomy captures seven aspects of human annotation reporting across 25 categories. The full description of categories, permissible values, and instructions is provided in AppendixD(TableD). A top-level overview of the taxonomy is shown in Figure1. This taxonomy provides a unified framework for mapping heterogeneous human annotation tasks in ACL papers. Each paper was annotated by at least two independent annotators, including co-authors of this paper. In total, 12 annotators contributed to the annotation process: 2 professors, 2 postdoctoral researchers, 6 PhD students, and 2 master’s students. All annotators were proficient in reading academic English.Human annotations were finalized through a two-stage adjudication process involving three annotators from the same pool of annotators: two independent annotations of each identified task were compared, disagreements were discussed to reach consensus, and a third annotator adjudicated unresolved cases when necessary. The resulting consensus labels form the gold standard used to evaluate LLM extraction quality; we refer to this subset asAnnotatedgold{}_{\textsc{gold}}. Figures17,17, and17in AppendixFshow an annotated paper excerpt and its extracted information.

4AnnotatedDataset

Table 1:Annotatedsubsets. Input papers are candidate papers before filtering;annotatedpapers are papers with annotatable human annotation tasks; annotation tasks are task-level annotation records.We began by retrieving papers from ACL Anthology publications appearing between 2018 and 2025 (three years before and after the introduction of the ACL checklist) in the main and Findings proceedings of ACL, EMNLP, NAACL, TACL, EACL, and AACL. Candidate papers were identified by matching 34 curated annotation-related keywords against titles and abstracts, including terms such asmanual annotationandhuman evaluation. The keyword list (see AppendixA) was iteratively refined during pilot annotation sessions.

Annotatedgold{}_{\textsc{gold}}is the manually annotated and adjudicated gold-standard set used to evaluate LLM extraction quality. It consists of 41 papers with reliably annotatable content out of 61 initially sampled to construct a manual benchmark. The papers were excluded when the number of annotation tasks could not be resolved after a third annotation, or when at least two annotators identified the paper as containing more than three difficult-to-annotate human annotation tasks. The retained papers contain 72 distinct human annotation tasks. Each task corresponds to a shared annotation setup in which annotators follow the same instructions and annotate the same data. The resulting adjudicated dataset is used exclusively for evaluating LLM performance. The size ofAnnotatedgold{}_{\textsc{gold}}reflects the cost of expert annotation: the manual annotation and adjudication effort is estimated at approximately6,3006,300Euros in personnel time, with details provided in AppendixC.

AnnotatedLLM{}_{\textsc{LLM}}is the large-scale LLM-extracted set used for the main analysis. The keyword-based sampling yielded 1,995 candidate papers. After filtering papers without annotatable human annotation content,AnnotatedLLM{}_{\textsc{LLM}}contains 1,603 papers and 2,667 extracted annotation tasks.AnnotatedLLM{}_{\textsc{LLM}}strictly excludes all 61 papers sampled forAnnotatedgold{}_{\textsc{gold}}, ensuring that the gold-standard evaluation set remains separate from the large-scale analysis. Table1summarizes the number of input papers, retained papers, and annotation tasks in each subset.

Sampling validation.

AnnotatedLLM{}_{\textsc{LLM}}is constructed through keyword-based retrieval, which may introduce sampling bias relative to all ACL-venue papers. To assess the effect of this sampling strategy, we compare it with a stratified random sample from the same venue-year scope. Although the comparison reveals statistically significant differences in frequency distributions for some category-value pairs, the effects are generally modest: the average absolute difference across 67 observations does not exceed 5.2 percentage points, with effects lacking directional consistency (see detailed analysis in AppendixA.1). We interpret these deviations as sufficiently limited for the purposes of LLM extraction, particularly given the computational, financial, and environmental costs associated with processing the full ACL venue-year population. Accordingly, we treatAnnotatedLLM{}_{\textsc{LLM}}as a high-recall, annotation-focused corpus suitable for studying reporting practices in papers likely to involve human annotation, without claiming it represents all ACL-venue papers.

5Experiment Setup

Extraction.

We evaluate six LLMs under a unified extraction protocol: three proprietary (Gemini-3.1-ProGemini Team (2025),Gemini-3.1-Flash-LiteGemini Team (2025),GPT-4.1OpenAI (2025b)) and three open-weight (Qwen3.6-27BQwen Team (2026),gemma-4-31B-itGoogle (2026),gpt-oss-120bOpenAI (2025a)). We include both proprietary and open-weight models because prior work shows that closed models often provide strong performance, while open models offer advantages in cost and transparency(Oketchet al.,2025). Each model is prompted, with the paper converted to plain text, to populate the same 25-category taxonomy used in human annotation, outputting one JSON record per distinct annotation experiment identified in the paper. The prompt (reproduced in full in AppendixE) encodes the complete taxonomy with exact allowed values, field-level decision rules, interdependency constraints, and a self-audit checklist that instructs the model to continue scanning beyond the first annotation section found: a common failure mode that causes multi-experiment papers to be under-extracted. Model output is constrained to a JSON-defined schema via structured-output APIs where available, eliminating free-form generation and enforcing value types at decoding time. To handle long documents within the context constraints of open-weight models, we apply annotation-section chunking: sections likely to contain annotation metadata (e.g.,Annotation Procedure,Human Evaluation,Data Collection) are identified via regex heuristics and passed to the model within an 8,000-token budget; papers with no detectable annotation sections are truncated to 20,000 tokens and processed in full. Proprietary models with million-token context windows receive the full paper text without truncation. Papers for which the model identifies no human annotation experiments are flagged and excluded.

Evaluation.

LLM-extracted values are compared to gold-standard labels at the level of identified tasks.For papers with multiple tasks, we first align each gold task with the corresponding LLM-extracted task using the short task description recorded during annotation.For example, a gold task described assentiment labeling of tweetscan be matched to a model-extracted task described astweet sentiment annotation. Before comparison, all values are normalized to a canonical form: IAA metric surface variants (e.g.,Cohen kappa,κ\kappa,Cohen’s K) are collapsed to a single canonical label per metric; numeric fields accept shared integer tokens as agreement (allowing, e.g.,27.9Kand27900to match); list-valued fields (iaa_metric_name,papers_topic,intended_use) are compared via set intersection, with partial overlap counting as agreement. We report per-valueand per-category exact-match agreement rates alongside Krippendorff’sα\alpha.

Reportage Score.

To assess the level of reporting associated with human annotation in NLP papers, we compute aReportage Scorefor each identified annotation task. The score measures the proportion of relevant annotation attributes explicitly reported to the set of attributes applicable to a given annotation task. Formally, for an annotation tasktt, letAtA_{t}denote the set of applicable reporting attributes and letRt⊆AtR_{t}\subseteq A_{t}denote the subset of applicable attributes that are reported. TheReportage Scoreis defined as:

Reportage Score(t)=|Rt||At|.\mathrm{\textsc{Reportage Score}}(t)=\frac{|R_{t}|}{|A_{t}|}.The denominatorAtA_{t}is task-sensitive. It includes 10 universal attributes expected for all annotation tasks: guideline availability, recruitment strategy, annotator training, expertise, language proficiency, education, number of annotators, number of annotated items, compensation, and post-annotation filtering. It further includes conditional attributes only when they are relevant to the annotation setup. For example, inter-annotator agreement and adjudication are excluded for single-annotator tasks, including text production assignments, crowdworker screening is excluded when crowdsourcing is not used, and redundant workload fields are excluded when they encode information already captured by other categories. For tasks inSubjective language & social meaning, five socio-demographic attributes are added to the denominator because annotator background may be relevant for interpreting subjective or socially grounded judgments. Missing applicable information lowers the score; inapplicable information does not. Full scoring rules are provided in AppendixB.

6Results

In §6.1, we first analyze annotation agreement, including both human–human and human–LLM agreement. We then examine broader reporting trends in §6.2, usingAnnotatedLLM{}_{\textsc{LLM}}and addressing our research questions on annotation intensity, reporting variation across NLP areas,the impact of ACL Responsible NLP Checklist, and the intended use of human judgment.

6.1Agreement Analysis

Human–Human agreement

Overall,human–human exact-match agreement reaches79.2%, with a macro-averaged Krippendorff’sα\alphaof 0.585,which is commonly regarded as decent agreement.There is substantial variation across annotation dimensions.Fields corresponding to clearly defined and explicitly reported information, including demographic attributes, compensation, and inter-annotator agreement metrics, consistently achieve high reliability (α>0.8\alpha>0.8). In contrast, lower agreement is concentrated in categories that require interpretation or inference from less formalized descriptions (e.g., guideline availability, annotator expertise, recruitment, and quality control procedures). This pattern is expected given the nature of the task: disagreement arises primarily in cases where the annotated papers provide incomplete or ambiguous information. Importantly, all annotations were subsequently adjudicated through extensive discussion, yielding a consensus dataset that resolves individual disagreements. Such adjudication has been shown to produce labels that more closely approximate the underlying ground truth than independent annotations and provides a sufficiently robust benchmark for evaluating LLMsKlieet al.(2024).

Human–LLM agreement

Table2presents the overall performance ofoursix tested LLMs (three open-source and three proprietary). Gemini-3.1-Pro shows the strongest results.

Table 2:Evaluation results comparing various models against human baseline performance.Overall, thebestLLM achieves slightly higher macro-level agreement than human–human annotation (79.9% vs. 79.2%) and a marginally higher Krippendorff’sα\alpha(0.606 vs. 0.585), indicating that model-assisted annotation is not only consistent with human judgments but, in aggregate, slightly more stable against the gold standard. Individual per-category human–human and human–LLM agreement values are reported in Table4(AppendixD).Several categories exhibit a mismatch between high observed percentage agreement and moderate or lowα\alpha, driven by class imbalance and skewed label distributions. Two extreme cases illustrate this effect: for ‘political orientation’, the absence of positive instances yields perfect agreement but limited interpretability, whereas for ‘nation of origin’, scarcity of positive cases makes agreement estimates unstable. For rare categories,α\alphacan become negative when observed agreement is lower than the agreement expected by chance under the label distribution. Thus, the negative value for ‘nation of origin’ in human–human IAA should not be interpreted as substantive “negative agreement”, but as an instability caused by very sparse positive cases.

6.2Trends in Reporting

After evaluating all six models onAnnotatedgold{}_{\textsc{gold}}, we select the best-performing model for the large-scale extraction as Gemini-3.1-Pro. This model is then applied toAnnotatedLLM{}_{\textsc{LLM}}to produce the final dataset of automatically extracted annotation tasks used in the downstream analysis which cost a total of Euros8,3008,300. To analyze temporal trends in methodological reporting and its association with factors such as paper topic, intended use of human judgment, and venue, we use theReportage Score, see §5.

RQ1: Which NLP areas are more human-annotation intensive, and which aspects of annotation practice are more consistently reported?

Refer to caption Figure 2:Reporting patterns across annotation attributes.Panel A shows the overall percentage of annotation tasks for which each attribute is reported. Panel B shows the corresponding reporting rates by NLP topic. Darker green cells indicate higher reporting rates within a topic; orange and red cells indicate lower reporting rates.Figure2summarizes which annotation attributes are reported and how reporting varies across NLP topics. Panel A shows that papers most often report operational details: recruitment (90.4%), annotator expertise (86.5%), and total number of annotated items (86.0%). In contrast, reporting is much weaker for attributes that help readers assess the annotation process itself, including annotator training (18.7%), language proficiency (24.0%), and released annotation guidelines (34.1%).Panel B shows that this imbalance is broadly consistent across NLP areas. Language proficiency is reported somewhat more frequently in papers involving multilingual tasks and linguistic annotation. As expected, annotator demographics are reported somewhat more often in papers dealing with subjective judgments, preferences, and socially constructed meaning (Subjective language and social meaningtopics), i.e. tasks that investigate social and cultural phenomena through language, such as such as hate speech, bias, stance, sentiment, and humor annotation.We examineSubjective language and social meaningcategory separately because annotator background, language proficiency, and social positioning can directly affect judgments in these tasks. However, compared with the rest of the dataset, these papers do not show systematically stronger reporting overall (Figure6). There are some distinctive tendencies: recruitment information and annotator language proficiency are reported slightly more often, crowdworkers and mixed annotator pools are used more frequently, and authors are less often used as annotators (Figures12and7, AppendixF). These results suggest that socially oriented NLP papers are slightly more attentive to annotator sourcing and language background, but still do not consistently report the broader methodological details needed to assess annotation validity.

RQ2: What is the impact of introducing ACL responsible checklist (2022)?

The ACL Responsible NLP ChecklistRolling Review (2022)asks authors to reflect on data, annotation, and ethical considerations. Accordingly, we test whether its introduction around 2022 coincides with a measurable change in human-annotation reporting. We fit an interrupted time-series regression model that estimates both an immediate post-2022 level shift and a change in the post-2022 trend.Figure3shows the fitted interrupted time series regression line with confidence intervals, alongside the observed yearly reportage scores (green dots with error bars) and the counterfactual continuation of the pre-2022 trend. Reportage scores increase steadily between 2018 and 2021, followed by a modest upward level shift in 2022 and a visibly flatter post-2022 trend relative to the counterfactual continuation of the pre-checklist trajectory.

Refer to caption

Figure 3:Interrupted time-series analysis ofReportage Scorebefore and after the ACL Responsible NLP Checklist.Points show observed yearly meanReportage Scorevalues, error bars indicate standard errors, the solid line shows the fitted trend, and the dashed line shows the counterfactual continuation of the pre-2022 trend. The vertical dashed line marks the checklist introduction.Our analysis finds no evidence that the introduction of the ACL checklist in 2022 produced a substantial immediate improvement in reporting scores. Although reportage quality continues to improve after 2022, the rate of improvement appears lower than in the preceding years, suggesting a flattening of the earlier upward trend. We can offer two explanations for this finding. First, aggregate averages may obscure heterogeneous developments across different categories of annotation studies, potentially masking opposing trends in reporting practices. Second, the upward trend already visible before 2022 suggests that the checklist may have formalized practices that were already emerging in the field. Moreover, because the ACL Responsible NLP Checklist (introduced at NAACL 2022) was based on the NeurIPS 2021 checklist, awareness of annotation reporting standards may have begun increasing earlier, which aligns with the sharper rise in reporting levels observed around 2021 for some types of annotation studies. A venue-specific interrupted time series analysis for ACL, EMNLP, and NAACL is provided in AppendixF. TableFin AppendixFoffers a detailed per-category comparison for pre- and post-2022 periods.

RQ3: Does the intended use of human judgment affect reportage quality?

We use theintended uselabels from our taxonomy to compare annotation tasks whose primary purpose is resource creation, model output evaluation, or human-performance benchmarking. These labels are assigned during the same human and LLM annotation process evaluated in §6.1. Because human-performance benchmarking is comparatively rare, Figure4focuses on the two most frequent uses: resource creation and model output evaluation.

Refer to caption

Figure 4:MeanReportage Scoreby intended use of human annotation.Lines show yearly meanReportage Scorevalues for all annotation tasks, model-output evaluation tasks, and resource-creation tasks; error bars indicate standard errors.While overall reportage quality improves over time, resource creation studies consistently report substantially more methodological detail than model evaluation studies, with the gap persisting throughout the observation period. In particular, evaluation-focused work more frequently omits information about annotator recruitment, compensation, training, and quality-control procedures. To further examine this pattern, we fit a logistic regression model predicting the reporting of quality-control-related categories in our annotation scheme, including adjudication, post-filtering, annotator training, IAA metrics, and recruitment source (full results in AppendixF). Resource creation studies are substantially more likely to report such measures (p<0.001p<0.001), while publication year shows only a weak positive association after controlling for annotation use type. The same overall pattern holds for recruitment strategy and compensation reporting, suggesting that model evaluation studies systematically under-report key aspects of annotation methodology.

7Conclusion and Recommendations

Human annotation remains a foundation of NLP research, but our results show that the field still lacks consistent standards for documenting who annotators are, how annotation is organized, and how annotation quality is controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues. Our findings show clear progress, but also persistent blind spots: NLP papers often report who annotators are in broad terms, especially recruitment source, expertise, and annotation scale, but they provide a much less complete picture of whether annotators are trained, qualified, compensated, demographically located, or involved in systematic disagreement resolution. In particular, compensation, socio-demographics, training, adjudication, and agreement values remain substantially underreported. These omissions matter: without such information, readers cannot assess annotator suitability, potential background effects, disagreement handling, or whether the labels support the paper’s claims.Our findings may also help explain why human annotations are often difficult to reproduce in NLP(Belzet al.,2023).

We draw three recommendations.

First,papers should report a bare minimum set of annotation-task detailswhenever human judgments are used: annotator source, number of annotators, number of annotated items, annotators per item, training, language proficiency, expertise, compensation, quality control, and access to annotation guidelines.These details are not optional metadata; they are necessary for readers to assess whether the annotation procedure is reliable, interpretable, and reproducible.

Second,reporting standards should be task-sensitive.Demographic and positional information is especially important for socially grounded and subjective language tasks, while agreement, adjudication, and quality-control information are central for tasks that produce benchmark labels or evaluation judgments.

Third,model-evaluation studies require substantially more annotation details. Our analyses show that when annotation is used for model evaluation, it is documented less than when annotation is used for resource creation, even though such evaluations often support central empirical claims.

Overall, our study reframes annotation reporting as a measurable methodological practice. The proposed taxonomy and LLM-assisted pipeline provide a scalable basis for monitoring whether NLP papers document human annotation in ways that support reliability and reproducibility.

Limitations

A first limitation concerns the design and reliability of the annotation taxonomy. Some categories require fine-grained interpretation of incomplete or inconsistently written paper descriptions, making perfect agreement difficult. We intentionally retained these categories because they capture information that matters for assessing annotation validity, such as recruitment, expertise, guideline availability, compensation, and quality control. The moderate agreement observed for some attributes should therefore be read not only as a limitation of our annotation scheme; it is also evidence that NLP papers often describe annotation procedures in ways that are difficult to interpret consistently. A second limitation is that the large-scale analysis relies on LLM-based extraction. Although we validate the extraction pipeline against a human-adjudicated gold standard,extractionerrors remain possible and may affect individual category estimates. Moreover,Annotatedgold{}_{\textsc{gold}}is necessarily limited in size because task-level expert annotation is costly; the manual annotation and adjudication effort is estimated at approximately 6,300 euros in personnel time, as detailed in AppendixC. We therefore interpret our results as aggregate reporting patterns rather than definitive judgments about individual papers. Finally,AnnotatedLLM{}_{\textsc{LLM}}is an annotation-focused corpus constructed through keyword retrieval and should not be interpreted as a fully representative sample of all ACL-venue papers.

Ethics Statement

This work analyzes publicly available scientific papers and does not introduce new experiments involving external human subjects. The manual annotation was conducted by the authors and collaborators on published research articles. Nevertheless, the study raises ethical questions about annotation labor and research transparency. In particular, our findings show that compensation, annotator demographics, training, and quality-control procedures are often underreported in NLP papers. We do not interpret missing reporting as evidence that these practices were absent; rather, we treat it as a transparency gap that limits readers’ ability to assess annotation validity, fairness, and reproducibility. Our LLM-assisted extraction pipeline also introduces ethical considerations. Automated extraction may misclassify individual papers or categories, especially when papers describe annotation procedures ambiguously. For this reason, we validate the pipeline against a human-adjudicated gold standard and report aggregate trends rather than using the system to rank or criticize individual papers. If the dataset and code are released, they should be used to study community-level reporting practices and to support better documentation standards, not to assign blame to individual authors.

Broader Impact

This work aims to improve the transparency and reproducibility of human annotation in NLP. By providing a task-level taxonomy, a reporting-completeness metric, and an LLM-assisted extraction pipeline, the study offers tools for monitoring how the field documents annotation labor, quality control, and annotator characteristics. The most direct positive impact is methodological: clearer reporting can help reviewers, readers, and future researchers assess whether annotation-based claims are reliable, comparable, and reproducible. It can also support future checklist design by identifying which annotation details are most often missing.Importantly, reporting requirements may encourage researchers to reflect more carefully on annotation procedures before data collection begins.At the same time, reporting standards must be applied carefully. Calls for more demographic or positional information should not pressure authors to collect sensitive personal data when it is unnecessary, unsafe, or ethically inappropriate. Instead, task-sensitive reporting should mean documenting information that is relevant to interpreting annotation outcomes, while respecting annotator privacy and consent. More broadly, our goal is not to make annotation reporting more burdensome, but to establish a bare minimum level of transparency for human judgments that often form the empirical basis of NLP research.

Acknowledgments

The project behind this paper began during a joint retreat of two research groups: the NLLG Lab at UTN, Germany, and the NLP Lab at IT Interdisciplinary Transformation University Austria (see Figure18). The retreat took place at Speinshart Monastery in Bavaria in February 2026. We are very grateful to the Speinshart Scientific Centre for AI and SuperTech for hosting us during this one-week Connect@Speinshart retreat. The NLLG Lab further gratefully acknowledges support from the German Research Foundation (DFG) through the Heisenberg Grant EG 375/5-1.

References

The mechanical bard: an interpretable machine learning approach to Shakespearean sonnet generation.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.),Toronto, Canada,pp. 1627–1638.External Links:Link,DocumentCited by:Table 3.
M. Altemeyer, S. Eger, J. Daxenberger, Y. Chen, T. Altendorf, P. Cimiano, and B. Schiller (2025)Argument summarization and its evaluation in the era of large language models.External Links:2503.00847,LinkCited by:Table 3.
R. Artstein and M. Poesio (2008)Survey article: inter-coder agreement for computational linguistics.Computational linguistics34(4),pp. 555–596.Cited by:§1,§1,§2.
A. Asai and E. Choi (2021)Challenges in information-seeking QA: unanswerable questions and paragraph retrieval.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),C. Zong, F. Xia, W. Li, and R. Navigli (Eds.),Online,pp. 1492–1504.External Links:Link,DocumentCited by:Table 3.
P. S. Bayerl and K. I. Paul (2011)What determines inter-coder agreement in manual annotations? a meta-analytic investigation.Computational Linguistics37(4),pp. 699–725.External Links:Link,DocumentCited by:§1,§2.
J. Beck (2023)Quality aspects of annotated data.AStA Wirtschafts- und Sozialstatistisches Archiv17(3),pp. 331–353.External Links:Document,Link,ISSN 1863-8163Cited by:§1,§2.
J. Belouadi, E. Ilg, M. Keuper, H. Tanaka, M. Utiyama, R. Dabre, S. Eger, and S. P. Ponzetto (2025)TikZero: zero-shot text-guided graphics program synthesis.External Links:2503.11509,LinkCited by:Table 3,§2.
J. Belouadi, S. P. Ponzetto, and S. Eger (2024)DeTikZify: synthesizing graphics programs for scientific figures and sketches with tikz.External Links:2405.15306,LinkCited by:Table 3.
A. Belz, C. Thomson, E. Reiter, G. Abercrombie, J. M. Alonso-Moral, M. Arvan, A. Braggaar, M. Cieliebak, E. Clark, K. Van Deemter,et al.(2023)Missing information, unresponsive authors, experimental flaws: the impossibility of assessing the reproducibility of previous human evaluations in nlp.InProceedings of the Fourth Workshop on Insights from Negative Results in NLP,pp. 1–10.Cited by:§1,§2,§7.
A. Belz and C. Thomson (2024)The 2024 repronlp shared task on reproducibility of evaluations in nlp: overview and results.InProceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval)@ LREC-COLING 2024,pp. 91–105.Cited by:§1,§2.
E. M. Bender and B. Friedman (2018)Data statements for natural language processing: toward mitigating system bias and enabling better science.Transactions of the Association for Computational Linguistics6,pp. 587–604.External Links:Link,DocumentCited by:§1,§2.
A. Breit, A. Revenko, K. Rezaee, M. T. Pilehvar, and J. Camacho-Collados (2021)WiC-TSV: An evaluation benchmark for target sense verification of words in context.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,P. Merlo, J. Tiedemann, and R. Tsarfaty (Eds.),Online,pp. 1635–1645.External Links:Link,DocumentCited by:Table 3.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020)Language models are few-shot learners.External Links:2005.14165,LinkCited by:§2.
T. Chakrabarty, A. Saakyan, and S. Muresan (2021)Don’t go far off: an empirical study on neural poetry translation.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M. Moens, X. Huang, L. Specia, and S. W. Yih (Eds.),Online and Punta Cana, Dominican Republic,pp. 7253–7265.External Links:Link,DocumentCited by:Table 3.
A. Chen, L. Lou, K. Chen, X. Bai, Y. Xiang, M. Yang, T. Zhao, and M. Zhang (2025a)Benchmarking LLMs for translating classical Chinese poetry: evaluating adequacy, fluency, and elegance.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Eds.),Suzhou, China,pp. 33019–33036.External Links:Link,Document,ISBN 979-8-89176-332-6Cited by:Table 3.
Q. Chen, M. Yang, L. Qin, J. Liu, Z. Yan, J. Guan, D. Peng, Y. Ji, H. Li, M. Hu, Y. Zhang, Y. Liang, Y. Zhou, J. Wang, Z. Chen, and W. Che (2025b)AI4Research: a survey of artificial intelligence for scientific research.External Links:2507.01903,LinkCited by:§2.
S. E. Doneva, T. Ellendorff, B. Sick, J. Goldman, A. E. Cannon, G. Schneider, and B. V. Ineichen (2024)NeuroTrialNER: an annotated corpus for neurological diseases and therapies in clinical trial registries.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.),Miami, Florida, USA,pp. 18868–18890.External Links:Link,DocumentCited by:Table 3.
S. Eger, Y. Cao, J. D’Souza, A. Geiger, C. Greisinger, S. Gross, Y. Hou, B. Krenn, A. Lauscher, Y. Li, C. Lin, N. S. Moosavi, W. Zhao, and T. Miller (2026)Transforming science with large language models: a survey on ai-assisted scientific discovery, experimentation, content generation, and evaluation.External Links:2502.05151,LinkCited by:§2.
Y. El Kheir, H. Mubarak, A. Ali, and S. Chowdhury (2024)Beyond orthography: automatic recovery of short vowels and dialectal sounds in Arabic.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),L. Ku, A. Martins, and V. Srikumar (Eds.),Bangkok, Thailand,pp. 13172–13184.External Links:Link,DocumentCited by:Table 3.
Gemini Team (2025)A new era of intelligence with Gemini 3.External Links:LinkCited by:§5.
Google (2026)Gemma 4 Model Card.External Links:LinkCited by:§5.
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Wyatt, D. Esiobu, D. Choudhary, D. Mahajan, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith, F. Radenovic, F. Guzmán, F. Zhang, G. Synnaeve, G. Lee, G. L. Anderson, G. Thattai, G. Nail, G. Mialon, G. Pang, G. Cucurell, H. Nguyen, H. Korevaar, H. Xu, H. Touvron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra, I. Evtimov, J. Zhang, J. Copet, J. Lee, J. Geffert, J. Vranes, J. Park, J. Mahadeokar, J. Shah, J. van der Linde, J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang, J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, K. Prasad, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, K. Lakhotia, L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan, L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti, M. Singh, M. Paluri, M. Kardas, M. Tsimpoukelli, M. Oldham, M. Rita, M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, N. Bogoychev, N. Chatterji, N. Zhang, O. Duchenne, O. Çelebi, P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, R. S. Cabral, R. Stojnic, R. Raileanu, R. Maheswari, R. Girdhar, R. Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabasappa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie, S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhosale, S. Zhang, S. Vandenhende, S. Batra, S. Whitman, S. Sootla, S. Collot, S. Gururangan, S. Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, U. Karn, V. Goswami, V. Gupta, V. Ramanathan, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Albiero, V. Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Martinet, X. Wang, X. Wang, X. E. Tan, X. Xia, X. Xie, X. Jia, X. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song, Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan, Z. Chen, Z. Papakipos, A. Singh, A. Srivastava, A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Victoria, A. Goldstand, A. Menon, A. Sharma, A. Boesenberg, A. Baevski, A. Feinstein, A. Kallet, A. Sangani, A. Teo, A. Yunus, A. Lupu, A. Alvarado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan, A. Ramchandani, A. Dong, A. Franco, A. Goyal, A. Saraf, A. Chowdhury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yazdan, B. James, B. Maurer, B. Leonhardi, B. Huang, B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Stojkovic, B. Gamido, B. Montalvo, C. Parker, C. Burton, C. Mejia, C. Liu, C. Wang, C. Kim, C. Zhou, C. Hu, C. Chu, C. Cai, C. Tindal, C. Feichtenhofer, C. Gao, D. Civin, D. Beaty, D. Kreymer, D. Li, D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland, E. Dowling, E. Jamil, E. Montgomery, E. Presani, E. Hahn, E. Wood, E. Le, E. Brinkman, E. Arcaute, E. Dunbar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Kokkinos, F. Ozgenel, F. Caggioni, F. Kanayet, F. Seide, G. M. Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern, G. Herman, G. Sizov, Guangyi, Zhang, G. Lakshminarayanan, H. Inan, H. Shojanazeri, H. Zou, H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk, H. Aspegren, H. Goldman, H. Zhan, I. Damlaj, I. Molybog, I. Tufanov, I. Leontiadis, I. Veliche, I. Gat, J. Weissman, J. Geboski, J. Kohli, J. Lam, J. Asher, J. Gaya, J. Marcus, J. Tang, J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong, J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shepard, J. McPhie, J. Torres, J. Ginsburg, J. Wang, K. Wu, K. H. U, K. Saxena, K. Khandelwal, K. Zand, K. Matosich, K. Veeraraghavan, K. Michelena, K. Li, K. Jagadeesh, K. Huang, K. Chawla, K. Huang, L. Chen, L. Garg, L. A, L. Silva, L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich, L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt, M. Mankus, M. Hasson, M. Lennie, M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Keneally, M. Liu, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel, M. Vyatskov, M. Samvelyan, M. Clark, M. Macey, M. Wang, M. J. Hermoso, M. Metanat, M. Rastegari, M. Bansal, N. Santhanam, N. Parks, N. White, N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. Mehta, N. P. Laptev, N. Dong, N. Cheng, O. Chernoguz, O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, R. Nayani, R. Mitra, R. Parthasarathy, R. Li, R. Hogan, R. Battey, R. Wang, R. Howes, R. Rinott, S. Mehta, S. Siby, S. J. Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, S. Sidorov, S. Pan, S. Mahajan, S. Verma, S. Yamamoto, S. Ramaswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin, S. C. Zha, S. Patil, S. Shankar, S. Zhang, S. Zhang, S. Wang, S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad, S. Gupta, S. Deng, S. Cho, S. Virk, S. Subramanian, S. Choudhury, S. Goldman, T. Remez, T. Glaser, T. Best, T. Koehler, T. Robinson, T. Li, T. Zhang, T. Matthews, T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Montanez, V. Mohan, V. S. Kumar, V. Mangla, V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable, X. Tang, X. Wu, X. Wang, X. Wu, X. Gao, Y. Kleinman, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang, Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Zhao, Y. Hao, Y. Qian, Y. Li, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick, Z. Wen, Z. Yang, Z. Zhao, and Z. Ma (2024)The llama 3 herd of models.External Links:2407.21783,LinkCited by:§2.
C. Greisinger and S. Eger (2026)TikZilla: scaling text-to-tikz with high-quality data and reinforcement learning.External Links:2603.03072,LinkCited by:Table 3,§2.
J. Haber, B. Vidgen, M. Chapman, V. Agarwal, R. K. Lee, Y. K. Yap, and P. Röttger (2023)Improving the detection of multilingual online attacks with rich social media data from Singapore.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.),Toronto, Canada,pp. 12705–12721.External Links:Link,DocumentCited by:Table 3,Figure 17,Appendix F.
Z. He, T. Liang, W. Jiao, Z. Zhang, Y. Yang, R. Wang, Z. Tu, S. Shi, and X. Wang (2024)Exploring human-like translation strategy with large language models.Transactions of the Association for Computational Linguistics12,pp. 229–246.External Links:Link,DocumentCited by:Table 3.
A. Hengle, A. Padhi, S. Singh, A. Bandhakavi, M. S. Akhtar, and T. Chakraborty (2024)Intent-conditioned and non-toxic counterspeech generation using multi-task instruction tuning with RLAIF.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),K. Duh, H. Gomez, and S. Bethard (Eds.),Mexico City, Mexico,pp. 6716–6733.External Links:Link,DocumentCited by:Table 3.
J. Huang, J. Qin, J. Zhang, Y. Yuan, W. Wang, and J. Zhao (2025)VisBias: measuring explicit and implicit social biases in vision language models.External Links:2503.07575,LinkCited by:Table 3.
Y. Ide, J. Tanner, A. Nohejl, J. Hoffman, J. Vasselli, H. Kamigaito, and T. Watanabe (2025)CoAM: corpus of all-type multiword expressions.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.),Vienna, Austria,pp. 27004–27021.External Links:Link,Document,ISBN 979-8-89176-251-0Cited by:Table 3.
G. Ivetta, M. J. Gomez, S. Martinelli, P. Palombini, M. E. Echeveste, N. C. Mazzeo, B. Busaniche, and L. Benotti (2025)HESEIA: a community-based dataset for evaluating social biases in large language models, co-designed in real school settings in latin america.External Links:2505.24712,LinkCited by:Table 3.
A. Jiang, N. Vitsakis, T. Dinkar, G. Abercrombie, and I. Konstas (2024)Re-examining sexism and misogyny classification with annotator attitudes.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.),Miami, Florida, USA,pp. 15103–15125.External Links:Link,DocumentCited by:§2.
T. Jiang and E. Riloff (2021)Learning prototypical functions for physical artifacts.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),C. Zong, F. Xia, W. Li, and R. Navigli (Eds.),Online,pp. 6941–6951.External Links:Link,DocumentCited by:Table 3.
Y. Jiang, H. Zhu, J. K. Kummerfeld, Y. Li, and W. Lasecki (2020)A novel workflow for accurately and efficiently crowdsourcing predicate senses and argument labels.InFindings of the Association for Computational Linguistics: EMNLP 2020,T. Cohn, Y. He, and Y. Liu (Eds.),Online,pp. 415–421.External Links:Link,DocumentCited by:Table 3.
D. Jin, Z. Jin, J. T. Zhou, L. Orii, and P. Szolovits (2020)Hooks in the headline: learning to generate headlines with controlled styles.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Eds.),Online,pp. 5082–5093.External Links:Link,DocumentCited by:Table 3.
P. Joshi, S. Santy, A. Budhiraja, K. Bali, and M. Choudhury (2020)The state and fate of linguistic diversity and inclusion in the NLP world.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Eds.),Online,pp. 6282–6293.External Links:Link,DocumentCited by:§2.
D. Jurgens, S. Kumar, R. Hoover, D. McFarland, and D. Jurafsky (2018)Measuring the evolution of a scientific field through citation frames.Transactions of the Association for Computational Linguistics6,pp. 391–406.External Links:Link,DocumentCited by:§1.
H. J. Kang, F. Harel-Canada, M. A. Gulzar, V. Peng, and M. Kim (2024)Human-in-the-loop synthetic text data inspection with provenance tracking.External Links:2404.18881,LinkCited by:Table 3.
A. Karamolegkou, S. S. Hansen, A. Christopoulou, F. Stamatiou, A. Lauscher, and A. Søgaard (2025)Ethical concern identification in NLP: a corpus of ACL Anthology ethics statements.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),L. Chiruzzo, A. Ritter, and L. Wang (Eds.),Albuquerque, New Mexico,pp. 11618–11635.External Links:Link,Document,ISBN 979-8-89176-189-6Cited by:§2.
T. Khot, D. Khashabi, K. Richardson, P. Clark, and A. Sabharwal (2021)Text modular networks: learning to decompose tasks in the language of existing models.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou (Eds.),Online,pp. 1264–1279.External Links:Link,DocumentCited by:Table 3.
Y. Kim, J. Lee, and S. Yang (2025)Exploring llm ai in automatic generation of abstracts for research publications.Proceedings of the Association for Information Science and Technology62(1),pp. 967–971.External Links:Link,DocumentCited by:§2.
J. Klie, R. Eckart de Castilho, and I. Gurevych (2024)Analyzing dataset annotation quality management in the wild.Computational Linguistics50(3),pp. 817–866.External Links:Link,DocumentCited by:§1,§2,§6.1.
A. Kostikova, B. Paassen, D. Beese, O. Pütz, G. Wiedemann, and S. Eger (2024)Fine-grained detection of solidarity for women and migrants in 155 years of German parliamentary debates.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.),Miami, Florida, USA,pp. 5884–5907.External Links:Link,DocumentCited by:Table 3,§2.
J. Li, V. Raheja, and D. Kumar (2024)ContraDoc: understanding self-contradictions in documents with large language models.External Links:2311.09182,LinkCited by:Table 3.
J. Lin, R. Ye, M. Han, Q. Zhang, R. Lai, X. Zhang, Z. Cao, X. Huang, and Z. Wei (2023)Argue with me tersely: towards sentence-level counter-argument generation.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H. Bouamor, J. Pino, and K. Bali (Eds.),Singapore,pp. 16705–16720.External Links:Link,DocumentCited by:Table 3.
O. Loginova and S. O. Loguinova (2025)Deep temporal reasoning in video language models: a cross-linguistic evaluation of action duration and completion through perfect times.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.),Vienna, Austria,pp. 20472–20502.External Links:Link,Document,ISBN 979-8-89176-251-0Cited by:Table 3.
L. Mathur, M. Qian, P. P. Liang, and L. Morency (2025)Social genome: grounded social reasoning abilities of multimodal models.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Eds.),Suzhou, China,pp. 24868–24891.External Links:Link,Document,ISBN 979-8-89176-332-6Cited by:Table 3.
B. Mousi, N. Durrani, and F. Dalvi (2023)Can LLMs facilitate interpretation of pre-trained language models?.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H. Bouamor, J. Pino, and K. Bali (Eds.),Singapore,pp. 3248–3268.External Links:Link,DocumentCited by:Table 3,§2.
K. Oketch, J. P. Lalor, Y. Yang, and A. Abbasi (2025)Bridging the LLM accessibility divide? performance, fairness, and cost of closed versus open LLMs for automated essay scoring.InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²),O. Arviv, M. Clinciu, K. Dhole, R. Dror, S. Gehrmann, E. Habba, I. Itzhak, S. Mille, Y. Perlitz, E. Santus, J. Sedoc, M. Shmueli Scheuer, G. Stanovsky, and O. Tafjord (Eds.),Vienna, Austria and virtual meeting,pp. 655–669.External Links:Link,ISBN 979-8-89176-261-9Cited by:§5.
OpenAI (2025a)Gpt-oss-120b & gpt-oss-20b model card.External Links:2508.10925,LinkCited by:§5.
OpenAI (2025b)Introducing GPT‑5.External Links:LinkCited by:§5.
B. N. Patro, S. Kumar, V. K. Kurmi, and V. Namboodiri (2018)Multimodal differential network for visual question generation.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.),Brussels, Belgium,pp. 4002–4012.External Links:Link,DocumentCited by:Table 3.
J. Pei and D. Jurgens (2023)When do annotator demographics matter? measuring the influence of annotator demographics with the POPQUORN dataset.InProceedings of the 17th Linguistic Annotation Workshop (LAW-XVII),J. Prange and A. Friedrich (Eds.),Toronto, Canada,pp. 252–265.External Links:Link,DocumentCited by:§1,§2.
M. Popović and A. Belz (2021)A reproduction study of an annotation-based human evaluation of MT outputs.InProceedings of the 14th International Conference on Natural Language Generation,A. Belz, A. Fan, E. Reiter, and Y. Sripada (Eds.),Aberdeen, Scotland, UK,pp. 293–300.External Links:Link,DocumentCited by:§1,§2.
A. Pramanick, Y. Hou, S. M. Mohammad, and I. Gurevych (2025)The nature of NLP: analyzing contributions in NLP papers.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.),Vienna, Austria,pp. 25169–25191.External Links:Link,Document,ISBN 979-8-89176-251-0Cited by:§2.
Qwen Team (2026)Qwen3.6-27B: flagship-level coding in a 27B dense model.External Links:LinkCited by:§5.
A. Rolling Review (2022)ACL rolling review a peer review platform for the association for computational linguistics.External Links:LinkCited by:§1,§6.2.
A. Sai B, T. Dixit, V. Nagarajan, A. Kunchukuttan, P. Kumar, M. M. Khapra, and R. Dabre (2023)IndicMT eval: a dataset to meta-evaluate machine translation metrics for Indian languages.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.),Toronto, Canada,pp. 14210–14228.External Links:Link,DocumentCited by:Table 3.
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, A. Nathan, A. Luo, A. Helyar, A. Madry, A. Efremov, A. Spyra, A. Baker-Whitcomb, A. Beutel, A. Karpenko, A. Makelov, A. Neitz, A. Wei, A. Barr, A. Kirchmeyer, A. Ivanov, A. Christakis, A. Gillespie, A. Tam, A. Bennett, A. Wan, A. Huang, A. M. Sandjideh, A. Yang, A. Kumar, A. Saraiva, A. Vallone, A. Gheorghe, A. G. Garcia, A. Braunstein, A. Liu, A. Schmidt, A. Mereskin, A. Mishchenko, A. Applebaum, A. Rogerson, A. Rajan, A. Wei, A. Kotha, A. Srivastava, A. Agrawal, A. Vijayvergiya, A. Tyra, A. Nair, A. Nayak, B. Eggers, B. Ji, B. Hoover, B. Chen, B. Chen, B. Barak, B. Minaiev, B. Hao, B. Baker, B. Lightcap, B. McKinzie, B. Wang, B. Quinn, B. Fioca, B. Hsu, B. Yang, B. Yu, B. Zhang, B. Brenner, C. R. Zetino, C. Raymond, C. Lugaresi, C. Paz, C. Hudson, C. Whitney, C. Li, C. Chen, C. Cole, C. Voss, C. Ding, C. Shen, C. Huang, C. Colby, C. Hallacy, C. Koch, C. Lu, C. Kaplan, C. Kim, C. Minott-Henriques, C. Frey, C. Yu, C. Czarnecki, C. Reid, C. Wei, C. Decareaux, C. Scheau, C. Zhang, C. Forbes, D. Tang, D. Goldberg, D. Roberts, D. Palmie, D. Kappler, D. Levine, D. Wright, D. Leo, D. Lin, D. Robinson, D. Grabb, D. Chen, D. Lim, D. Salama, D. Bhattacharjee, D. Tsipras, D. Li, D. Yu, D. Strouse, D. Williams, D. Hunn, E. Bayes, E. Arbus, E. Akyurek, E. Y. Le, E. Widmann, E. Yani, E. Proehl, E. Sert, E. Cheung, E. Schwartz, E. Han, E. Jiang, E. Mitchell, E. Sigler, E. Wallace, E. Ritter, E. Kavanaugh, E. Mays, E. Nikishin, F. Li, F. P. Such, F. de Avila Belbute Peres, F. Raso, F. Bekerman, F. Tsimpourlas, F. Chantzis, F. Song, F. Zhang, G. Raila, G. McGrath, G. Briggs, G. Yang, G. Parascandolo, G. Chabot, G. Kim, G. Zhao, G. Valiant, G. Leclerc, H. Salman, H. Wang, H. Sheng, H. Jiang, H. Wang, H. Jin, H. Sikchi, H. Schmidt, H. Aspegren, H. Chen, H. Qiu, H. Lightman, I. Covert, I. Kivlichan, I. Silber, I. Sohl, I. Hammoud, I. Clavera, I. Lan, I. Akkaya, I. Kostrikov, I. Kofman, I. Etinger, I. Singal, J. Hehir, J. Huh, J. Pan, J. Wilczynski, J. Pachocki, J. Lee, J. Quinn, J. Kiros, J. Kalra, J. Samaroo, J. Wang, J. Wolfe, J. Chen, J. Wang, J. Harb, J. Han, J. Wang, J. Zhao, J. Chen, J. Yang, J. Tworek, J. Chand, J. Landon, J. Liang, J. Lin, J. Liu, J. Wang, J. Tang, J. Yin, J. Jang, J. Morris, J. Flynn, J. Ferstad, J. Heidecke, J. Fishbein, J. Hallman, J. Grant, J. Chien, J. Gordon, J. Park, J. Liss, J. Kraaijeveld, J. Guay, J. Mo, J. Lawson, J. McGrath, J. Vendrow, J. Jiao, J. Lee, J. Steele, J. Wang, J. Mao, K. Chen, K. Hayashi, K. Xiao, K. Salahi, K. Wu, K. Sekhri, K. Sharma, K. Singhal, K. Li, K. Nguyen, K. Gu-Lemberg, K. King, K. Liu, K. Stone, K. Yu, K. Ying, K. Georgiev, K. Lim, K. Tirumala, K. Miller, L. Ahmad, L. Lv, L. Clare, L. Fauconnet, L. Itow, L. Yang, L. Romaniuk, L. Anise, L. Byron, L. Pathak, L. Maksin, L. Lo, L. Ho, L. Jing, L. Wu, L. Xiong, L. Mamitsuka, L. Yang, L. McCallum, L. Held, L. Bourgeois, L. Engstrom, L. Kuhn, L. Feuvrier, L. Zhang, L. Switzer, L. Kondraciuk, L. Kaiser, M. Joglekar, M. Singh, M. Shah, M. Stratta, M. Williams, M. Chen, M. Sun, M. Cayton, M. Li, M. Zhang, M. Aljubeh, M. Nichols, M. Haines, M. Schwarzer, M. Gupta, M. Shah, M. Y. Guan, M. Huang, M. Dong, M. Wang, M. Glaese, M. Carroll, M. Lampe, M. Malek, M. Sharman, M. Zhang, M. Wang, M. Pokrass, M. Florian, M. Pavlov, M. Wang, M. Chen, M. Wang, M. Feng, M. Bavarian, M. Lin, M. Abdool, M. Rohaninejad, N. Soto, N. Staudacher, N. LaFontaine, N. Marwell, N. Liu, N. Preston, N. Turley, N. Ansman, N. Blades, N. Pancha, N. Mikhaylin, N. Felix, N. Handa, N. Rai, N. Keskar, N. Brown, O. Nachum, O. Boiko, O. Murk, O. Watkins, O. Gleeson, P. Mishkin, P. Lesiewicz, P. Baltescu, P. Belov, P. Zhokhov, P. Pronin, P. Guo, P. Thacker, Q. Liu, Q. Yuan, Q. Liu, R. Dias, R. Puckett, R. Arora, R. T. Mullapudi, R. Gaon, R. Miyara, R. Song, R. Aggarwal, R. Marsan, R. Yemiru, R. Xiong, R. Kshirsagar, R. Nuttall, R. Tsiupa, R. Eldan, R. Wang, R. James, R. Ziv, R. Shu, R. Nigmatullin, S. Jain, S. Talaie, S. Altman, S. Arnesen, S. Toizer, S. Toyer, S. Miserendino, S. Agarwal, S. Yoo, S. Heon, S. Ethersmith, S. Grove, S. Taylor, S. Bubeck, S. Banesiu, S. Amdo, S. Zhao, S. Wu, S. Santurkar, S. Zhao, S. R. Chaudhuri, S. Krishnaswamy, Shuaiqi, Xia, S. Cheng, S. Anadkat, S. P. Fishman, S. Tobin, S. Fu, S. Jain, S. Mei, S. Egoian, S. Kim, S. Golden, S. Mah, S. Lin, S. Imm, S. Sharpe, S. Yadlowsky, S. Choudhry, S. Eum, S. Sanjeev, T. Khan, T. Stramer, T. Wang, T. Xin, T. Gogineni, T. Christianson, T. Sanders, T. Patwardhan, T. Degry, T. Shadwell, T. Fu, T. Gao, T. Garipov, T. Sriskandarajah, T. Sherbakov, T. Korbak, T. Kaftan, T. Hiratsuka, T. Wang, T. Song, T. Zhao, T. Peterson, V. Kharitonov, V. Chernova, V. Kosaraju, V. Kuo, V. Pong, V. Verma, V. Petrov, W. Jiang, W. Zhang, W. Zhou, W. Xie, W. Zhan, W. McCabe, W. DePue, W. Ellsworth, W. Bain, W. Thompson, X. Chen, X. Qi, X. Xiang, X. Shi, Y. Dubois, Y. Yu, Y. Khakbaz, Y. Wu, Y. Qian, Y. T. Lee, Y. Chen, Y. Zhang, Y. Xiong, Y. Tian, Y. Cha, Y. Bai, Y. Yang, Y. Yuan, Y. Li, Y. Zhang, Y. Yang, Y. Jin, Y. Jiang, Y. Wang, Y. Wang, Y. Liu, Z. Stubenvoll, Z. Dou, Z. Wu, and Z. Wang (2026)OpenAI gpt-5 system card.External Links:2601.03267,LinkCited by:§2.
K. Tyser, B. Segev, G. Longhitano, X. Zhang, Z. Meeks, J. Lee, U. Garg, N. Belsten, A. Shporer, M. Udell, D. Te’eni, and I. Drori (2024)AI-driven review systems: evaluating llms in scalable and bias-aware academic reviews.External Links:2408.10365,LinkCited by:§2.
C. van der Lee, A. Gatt, E. van Miltenburg, S. Wubben, and E. Krahmer (2019)Best practices for the human evaluation of automatically generated text.InProceedings of the 12th International Conference on Natural Language Generation,K. van Deemter, C. Lin, and H. Takamura (Eds.),Tokyo, Japan,pp. 355–368.External Links:Link,DocumentCited by:§1,§2.
V. Vasilev, J. Agafonova, N. Gerasimenko, A. Kapitanov, P. Mikhailova, E. Mironova, and D. Dimitrov (2025)RusCode: Russian cultural code benchmark for text-to-image generation.InFindings of the Association for Computational Linguistics: NAACL 2025,L. Chiruzzo, A. Ritter, and L. Wang (Eds.),Albuquerque, New Mexico,pp. 7656–7672.External Links:Link,Document,ISBN 979-8-89176-195-7Cited by:Table 3.
M. Walsh, A. Preus, and M. Antoniak (2024)Sonnet or not, bot? poetry evaluation for large models and datasets.External Links:2406.18906,LinkCited by:Table 3.
H. Wang, B. Peng, and K. Wong (2020)Learning efficient dialogue policy from demonstrations through shaping.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Eds.),Online,pp. 6355–6365.External Links:Link,DocumentCited by:Table 3.
O. Weller and K. Seppi (2019)Humor detection: a transformer gets the last laugh.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.),Hong Kong, China,pp. 3621–3625.External Links:Link,DocumentCited by:Table 3.
Y. Xie, Y. Pan, H. Xu, and Q. Mei (2025)Bridging ai and science: implications from a large-scale literature analysis of ai4science.External Links:2412.09628,LinkCited by:§2.
L. Xiong, J. Zhou, Q. Zhu, X. Wang, Y. Wu, Q. Zhang, T. Gui, X. Huang, J. Ma, and Y. Shan (2023)A confidence-based partial label learning model for crowd-annotated named entity recognition.InFindings of the Association for Computational Linguistics: ACL 2023,A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.),Toronto, Canada,pp. 1375–1386.External Links:Link,DocumentCited by:Table 3.
P. Zeinert, N. Inie, and L. Derczynski (2021)Annotating online misogyny.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),C. Zong, F. Xia, W. Li, and R. Navigli (Eds.),Online,pp. 3181–3197.External Links:Link,DocumentCited by:Table 3.
E. Zelikman, W. Ma, J. Tran, D. Yang, J. Yeatman, and N. Haber (2023)Generating and evaluating tests for k-12 students with language model simulations: a case study on sentence reading efficiency.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H. Bouamor, J. Pino, and K. Bali (Eds.),Singapore,pp. 2190–2205.External Links:Link,DocumentCited by:Table 3.
G. Zeng, W. Yang, Z. Ju, Y. Yang, S. Wang, R. Zhang, M. Zhou, J. Zeng, X. Dong, R. Zhang, H. Fang, P. Zhu, S. Chen, and P. Xie (2020)MedDialog: large-scale medical dialogue datasets.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.),Online,pp. 9241–9250.External Links:Link,DocumentCited by:Table 3.
S. Zhang, A. Celikyilmaz, J. Gao, and M. Bansal (2021)EmailSum: abstractive email thread summarization.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),C. Zong, F. Xia, W. Li, and R. Navigli (Eds.),Online,pp. 6895–6909.External Links:Link,DocumentCited by:Table 3.
X. Zhao, A. Niazi, and A. Rios (2024)A comprehensive study of gender bias in chemical named entity recognition models.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),K. Duh, H. Gomez, and S. Bethard (Eds.),Mexico City, Mexico,pp. 4360–4374.External Links:Link,DocumentCited by:Table 3.
C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, and D. Yang (2024)Can large language models transform computational social science?.Computational Linguistics50(1),pp. 237–291.External Links:Link,DocumentCited by:§2.
X. Zou (2025)BIPro: zero-shot Chinese poem generation via block inverse prompting constrained generation framework.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.),Vienna, Austria,pp. 1116–1134.External Links:Link,Document,ISBN 979-8-89176-251-0Cited by:Table 3.

Appendix ADatasets

The following 34 keywords were used to identify papers involving human annotation based on matches in titles and abstracts.

•Amazon Mechanical Turk
•annotator
•crowdworker
•crowdsourcer
•crowdsourcing
•expert annotation
•expert evaluation
•expert opinion
•HIL
•human annotation
•human annotator
•human benchmark
•human dataset
•human evaluation
•human expert
•human in the loop
•human judgment
•human judgement
•human performance
•human perception
•human rating
•human raters
•human respondent
•human study
•human-in-the-loop
•labeler
•manual annotation
•manual evaluation
•manual rating
•manually annotated
•manually evaluated
•MTurk
•Prolific
•student worker

A.1Random-sample validation of keyword retrieval

The large-scale extraction set,AnnotatedLLM{}_{\textsc{LLM}}, is constructed using keyword-based retrieval over paper titles and abstracts. This strategy is designed to increase the concentration of papers likely to contain human annotation tasks, but it may also change the composition of the resulting sample. To assess this effect, we compareAnnotatedLLM{}_{\textsc{LLM}}with a stratified random sample drawn from the same venue-year scope.

The random sample contains 3,000 papers drawn across the same venues and publication years. Of these, 1,085 papers contain annotatable human annotation content, corresponding to a retention rate of 36%. In contrast, keyword-based retrieval yields 1,603 retained papers out of 1,995 candidates, corresponding to a retention rate of 82%. This confirms that keyword retrieval substantially improves the efficiency of identifying papers with human annotation content.

To check whether our sampling strategy introduced any bias, we compare the distributions of taxonomy-value pairs between the keyword-retrieved set and the random sample. For each category-value pair, we compute the normalized frequency in both samples and test the difference in proportions using aχ2\chi^{2}test. Figure5shows only category-value pairs with statistically significant differences atp<0.05p<0.05. To avoid duplicate visualizations for binary variables, the figure reports only the positive or reported value where applicable.

The comparison shows that keyword retrieval introduces some statistically significant composition differences. However, these differences are generally modest: among statistically significant category-value pairs, the average absolute difference is 5.2 percentage points.Only a small number of category-value pairs exhibit substantial deviations, with extreme cases reaching−16-16and+18+18percentage points in opposite directions.In other words, the keyword-retrieved set is not simply higher or lower on all reporting dimensions. Instead, it over-represents some category values and under-represents others.

We therefore interpretAnnotatedLLM{}_{\textsc{LLM}}as an annotation-focused corpus rather than a fully representative sample of all ACL-venue papers. The random-sample comparison suggests that keyword retrieval changes sample composition, but does not reveal a single systematic directional bias across reporting categories. This supports our use ofAnnotatedLLM{}_{\textsc{LLM}}for analyzing reporting patterns among papers likely to involve human annotation, while motivating caution in generalizing the exact proportions toall ACL Anthology papers containing human annotation tasks.

Refer to caption

Figure 5:Significant proportional differences (Δ\Delta%) between filtered and random samples across category–value pairs (Chi-square test, p < 0.05). Bars are sorted by effect size and centered at zero to indicate direction (over- vs under-representation in the filtered sample). For binary and binarized string fields, only the significant proportional deltas for the positive value (e.g., “yes”, “reported”) are shown to avoid duplication.

Appendix BReportage ScoreCalculation

TheReportage Scoreis computed as the proportion of reported annotation attributes relative to the set of attributes applicable to a given annotation task. The full annotation schema contains 21 reporting categories grouped into:

•10 universal attributes expected for all studies (guidelines release, recruitment strategy, training procedures, level of expertise, language proficiency, educational level, number of annotators, number of annotated items, compensation, post-annotation filtering)
•5 socio-demographic attributes describing annotator background (age, gender, nation of residence, nation of origin, political orientation), which are expected in tasks annotating subjective or socially-constructed language phenomena
•6 conditional attributes that can be excluded as inapplicable or redundant given other types of information reported for the task (inter-annotator metric and value, annotator per item, items per annotator, adjudication, crowdworker screening)

The denominator is adjusted dynamically to avoid penalizing papers for omitting inapplicable or redundant information. Specifically, the papers belonging toSubjective language & social meaning, the denominator includes the five socio-demographic characteristics, for other papers, they are viewed as optional. Inter-annotator agreement (IAA) and adjudication are excluded from the denominator for single-annotator studies (i.e., whentotal_annotators = 1orannotators_per_item = 1, which is often the case for text production tasks); the IAA value is excluded if no IAA metric is reported; crowdworker screening is excluded when recruitment strategy does not involve crowdsourcing; and adjudication is excluded for tasks without open-ended annotation.

To avoid double-counting, redundant fields may be excluded when they encode equivalent information. For example, iftotal_annotatorsandannotators_per_itemhave the same number, one of them is omitted. Theitems_per_annotatormay be excluded when it is recoverable from other reported fields. Missing information is treated as incomplete reporting; for instance, reporting an IAA metric without the corresponding value does not count as valid reportage. Fine-grained compensation extensions such as payment rate are treated as refinements rather than independent reporting dimensions and are excluded from scoring.

The final score is computed separately for each annotation task in a paper using the resulting set of applicable reporting categories.

Appendix CAnnotator demographics

In total, 12 annotators contributed to the annotation process: 2 professors, 2 postdoctoral researchers, 6 PhD students, and 2 master’s students. Among the annotators, 5 identified as women and 7 as men. The annotator pool included individuals originating from Italy, Germany, India, China, and Russia. All annotators reported fluent English proficiency, as all annotated papers were written in English. At the time of annotation, most annotators were affiliated with institutions located in Germany and Austria. Annotation was carried out on a part-time basis alongside regular academic activities. Professors contributed approximately 10 hours in total, while postdoctoral researchers, PhD students, and master’s students each contributed approximately 50 hours overall. Based on the 2026 standardized academic personnel rates published by the Deutsche Forschungsgemeinschaft, the estimated total personnel cost of the annotation process was approximately €6,300.

Appendix DDetails on Human Annotation

Paper IDTasksDomainYearVenueWanget al.(2020)3Task-Oriented Dialogue Learning2020ACLJinet al.(2020)3Headlines Generation with Controlled Styles2020ACLZeinertet al.(2021)1Misogyny detection2021ACLZhanget al.(2021)3Text Summarization2021ACLJiang and Riloff (2021)1Prototypical function labeling of physical artifacts2021ACLAsai and Choi (2021)2Information-Seeking QA2021ACLHaberet al.(2023)3Detection of Multilingual Online Attacks2023ACLSai Bet al.(2023)1Meta-Evaluation of Machine Translation Metrics2023ACLAgnewet al.(2023)2Poetry Generation2023ACLXionget al.(2023)2Crowd-Annotated Named Entity Recognition2023ACLEl Kheiret al.(2024)2Dialectal Sound and Vowelization Recovery2024ACLIdeet al.(2025)1MWE Corpus Creation2025ACLZou (2025)1Poetry Generation2025ACLLoginova and Loguinova (2025)2Deep Temporal Reasoning in Video Language Models2025ACLBreitet al.(2021)1Target Sense Verification2021EACLPatroet al.(2018)1Visual Question Generation2018EMNLPWeller and Seppi (2019)1Humor Detection2019EMNLPJianget al.(2020)2Crowdsourced Semantic Role Labeling2020EMNLPZenget al.(2020)1Large-scale Medical Dialogue Datasets2020EMNLPChakrabartyet al.(2021)1Neural Poetry Translation2021EMNLPMousiet al.(2023)3Concept-Based Model Interpretability2021EMNLPLinet al.(2023)3Counter-argument Generation2023EMNLPZelikmanet al.(2023)3Generating and Evaluating Educational Tests2023EMNLPDonevaet al.(2024)1Biomedical NER Corpus Creation2024EMNLPKostikovaet al.(2024)1Solidarity Detection2024EMNLPWalshet al.(2024)1Poetry Evaluation2024EMNLPAltemeyeret al.(2025)2Argument Summarization and Evaluation2025EMNLPMathuret al.(2025)3Social Reasoning2025EMNLPIvettaet al.(2025)1Community-Based Bias Evaluation2025EMNLPHuanget al.(2025)1Social Biases in Vision Language Models2025EMNLPChenet al.(2025a)2Benchmarking for Translating Classical Chinese Poetry2025EMNLPBelouadiet al.(2025)1Text-to-Image Generation2025ICCVGreisinger and Eger (2026)3Evaluation of TikZ Generated Code2026ICLRKhotet al.(2021)1Question Answering2021NAACLZhaoet al.(2024)1Gender Bias in Chemical NER Models2024NAACLLiet al.(2024)2Contradiction Detection2024NAACLKanget al.(2024)2Synthetic Text Data Inspection2024NAACLHengleet al.(2024)2Counterspeech Generation2024NAACLVasilevet al.(2025)2Text-to-Image Generation2025NAACLBelouadiet al.(2024)1Text-to-Image Generation2024NIPSHeet al.(2024)2Human-Like Translation Strategy2024TACLTable 3:Double-annotated and adjudicated papers included in the gold dataset.AspectCategoryValuesInstruction\endfirstheadAspectCategoryValuesInstruction\endheadGeneral DescriptionPaper’s topicThis category captures the core problem the paper introduces or solves, as stated in the abstract or introduction. A paper gets exactly one label, determined by the dominant topical focus of contribution, usually found in abstract. The 10 drop-down options are adopted from ACL categories. In case of hybrid papers, decide what is their primary focus and contribution. If in doubt, assign at most two categories.LinguisticsCore linguistic representation, word to sentence levels (e.g. PoS, parsing, NER, semantic roles, word sense disambiguation, entity linking).Discourse/PragmaticsThe study of cross-sentence meaning and contextual semantics, relations between sentences/clauses in a text and text-level linguistic phenomena (coreference resolution, discourse relations, implicature/ellipsis resolution, topic segmentation, dialogue act tagging without generation).Information ExtractionAdvances in text to knowledge bases or structured schema conversion (knowledge graph construction, slot filling, relation/event/temporal prediction).QA/Information accessAnswering questions from text or corpora (extractive, open-domain, multi-hop QA, both deterministic SQuAD and open-ended).Semantics, inference & similarityJudging meaning relations between texts (entailment, reasoning benchmarks, paraphrase-detection, STS, NLI).Subjective language & social meaningTopics that explore social and cultural phenomena through language (sentiment, emotion, stance, humor, sarcasm, toxicity, bias, propaganda, authorship attribution).Open-ended generationOpen-ended output production (dialogue/story/poetry/jokes/argument generation from a given topic or context, instruction following).Rewriting tasksUnlike generic generation, this task transforms a given text and is required to remain faithful to the input in some aspect (summarisation, simplification, paraphrase-generation, style transfer, rewriting under constraints).Multilingual/TranslationCross-linguistic mapping (MT, cross-lingual transfer evaluation, code-switching, multilingual generation).Multimodal tasksLanguage from other modalities (image/graphics captioning, visual QA, video description, multimodal assistants).Task typeSelect the type of task performed by human annotators.Closed-label classificationTasks where annotators choose from predefined categories (binary, multi-class, multi-label, tagging with a fixed schema).Ordinal/scalar ratingTasks where annotators pass judgments on a scale (Likert scale, 1-5 ratings, graded acceptability, quality scores).Span-level or structural annotationAnnotators mark parts of the input or define structure (NER spans, coreference links, constituency/dependency trees, alignment).Comparative/preference judgmentsRelative evaluation of two or more items (A/B choice, ranking, pairwise preference, best–worst scaling).Open-ended generation or transformationAnnotators produce new text or modify input (summarization, simplification, paraphrasing, translation, explanation writing).OtherNone of the aboveTask identity[User input]Record the specific name of each human annotation subtask as reported in the paper. This allows matching the order of subtasks within a paper. Especially important for papers reporting up to three annotation studies, but apply to all papers. How to extract: Use the exact name or description of the subtask as given in the paper. Do not create new names or abbreviations.Intended useCategorizes types of human annotation studies by their purpose. Select Resource creation and Human performance when an annotated dataset is used as a human baseline, without collecting additional judgments.Resource creationHuman judgment are used to build a dataset, a gold standard, a benchmark or a train/test set for potential further use, including for a subsequent evaluation of their own models in the same study.Human performanceHumans solve the same task as the machine to provide the human upper-bound and to compare human-machine performances.Model output evaluationCollecting human judgment to evaluate automatically generated items.None of the aboveA fallback optionGuidelines releaseIndicates whether the annotation guidelines released by the paper allow the experiment to be reproduced.YesGuidelines are accessible (in the paper, appendix, via a working link or citation of another paper) and contain all information necessary to reproduce the annotation. Select Yes only if someone could realistically repeat the annotation using the provided materials.NoGuidelines are not accessible, or do not contain enough information to reproduce the annotation.Agreement LevelIAA metricFleiss’κ\kappa, Cohen’sκ\kappa, Krippendorff’sα\alpha, F1 agreement, Pearson, Spearman, Majority agreement, Kendall’sτ\tau, Total agreement, Other, Not reportedMultiple selection dropdown. If more than one metric is reported, select several items in the order of appearance in the paper. The corresponding values should be entered in the IAA Value column in the same order separated by a pipe (|). Other: some other IAA measure/method is explicitely reported. Not reported: either not applicable (one annotator) or not given in the paper.IAA value[User input]A string field. Refrain from inferring values. Separate multiple values (aligned with IAA metric names) with a pipe (|). Enter a string if ranges are reported, or any other form of report is made (not one value). NA: if Other metric is reported or if the value is not directly reported and needs to be inferred. Not reported: either not applicable (one annotator) or not given in the paper.WorkloadAnnotators/item[User input]A number or “Not reported” string. Type “Not reported” if this information is not explicitly given.Total annotators[User input]Total items[User input]Items/annotator[User input]Recruitment and QualificationsRecruitmentThis category captures who performed the annotation, i.e., the source from which annotators were recruited.CrowdsourcingAnnotators recruited via a crowdsourcing platform or described as crowdworkers.AuthorsOne or more authors performed the annotation.OtherAnnotators are a specified non-crowd group (e.g., students, experts, hired assistants, colleagues).MixedAnnotators include people from different specified groups.Not reportedThe source of annotators is not specified as a specific group.Crowd qualificationsApplies only to crowdsourced annotators. If the paper does not use crowdworkers, select NA.YesThe paper reports preliminary screening or filtering of crowdworkers to ensure annotation quality (beyond basic socio-demographics).NoThe paper does not provide any information on quality-related pre-selection or screening of crowdworkers beyond socio-demographics. Choose No if Recruitment is Not reported.NAThere is an explicit mention of other-than-crowd sources of annotators (other and authors).Annotator trainingIs any preliminary annotator training reported (e.g. calibration sessions, trial studies, hands-on prior experience with the annotation tool).YesSome form of annotator training is reported.NoEither not discussed or explicit mention of no training provided.Language proficiencyCapture the reported language proficiency of annotators with respect to the language of the annotation task.NativeThe annotators are directly described as native speakers. If annotators’ ethnicity is reported, and it aligns with the language of the annotation task, annotators should be considered native speakers.Non-nativeThe authors report some level of proficiency in the language of the annotated data (fluent, B2, C1), but not native.MixedBoth native and non-native speakers were employed.Not reportedLanguage proficiency is not directly reported.Level of expertiseIndicates the actual expertise of the employed annotators for tasks where specialized knowledge is relevant.HighAnnotators have specialized domain knowledge relevant to the task (e.g., medical, legal, technical, linguistic experts).MediumAnnotators have some relevant background, but are not deeply specialized (e.g., students in a related field).MixedReported levels of expertise vary (e.g. experts and students, or students and unfiltered crowd workers).Not reportedThe task could require specialized knowledge, but the paper does not specify the annotators’ expertise.General language taskThe task does not require specialized knowledge.CompensationReported compensationIndicates whether annotator compensation is described.PaidSome compensation is mentioned, including “annotators were paid”, “full-time employees”.FreeCases, where voluntary effort is reported (authors of the paper or colleagues).Not reportedNo mention of compensation.Payment rateProvide specifics on how compensation is reported for papers where compensation is Paid in Reported compensation.Specific numeric rateA concrete payment is given (e.g., hourly rate, per-task payment, total amount).General mentionCompensation is mentioned, but no exact amount is provided (e.g., “annotators were paid”, “full-time employees”).NASelect NA for cases where the compensation is “Free” or “Not reported” in Reported compensation.Socio-DemographicsAgeYes; NoIs age reported?GenderYes; NoIs gender of annotators reported?Nation originYes; NoIs nationality of annotators discussed? Nationality is interpreted as the country of birth.Nation residenceYes; NoIs the nation of residence at the time of annotation discussed? It can be different from the nation of origin.EducationYes; NoIs educational level or field of study reported?Political orientationYes; NoIs political orientation discussed?Quality ControlPost-filteringDoes the paper report any quality control or filtering of annotations after collection?YesSome form of control and post-annotation filtering is reported or they explicitly report disqualifying experts due to lack of guidelines understanding or removing items due to issues revealed during annotation.NoNothing is reported about filtering the annotation results.AdjudicationCaptures how annotation disagreements are handled to produce final labels. This refers to explicit procedures for resolving conflicting judgments at the item level (not simply aggregating scores across all annotated items).Majority voteThe most common label among annotators is selected.Expert adjudicationA designated expert or third-party judge reviews disagreements and assigns the final label.Third annotatorAnother annotator from the annotator pull steps in for arbitration between diverging opinions.Consensus discussionAnnotators discuss and agree on a final label.Soft labelsDisagreement is preserved as a label distribution rather than resolved.Weighted votingVotes are weighted based on annotator reliability or expertise.Other / mixedAny combination or alternative method.NoneNo explicit disagreement resolution; annotator outputs are used as-is (e.g., simple averaging or single annotator per item).Annotation categories, values and instructions by aspect. Table 4:Inter-annotator agreement by category and aspect. Human–Human (N=71) vs Human–LLM (N=64).

Appendix ELLM Extraction Prompt

The following prompt was used to guide the Large Language Model in extracting structured data from the scientific papers according to our defined taxonomy.

YouareextractingstructureddataaboutHUMANANNOTATIONexperimentsfromascientificpaper.

Yourtaskistofillinarigidtaxonomy.YouMUSTuseONLYtheexactallowedvaluesspecifiedbelow.

DoNOTparaphrase,doNOTaddexplanations,doNOTusesynonyms.

##AnnotationTaxonomy--EXACTAllowedValues

YouMUSToutputvaluesEXACTLYaslistedbelow--noparaphrasing,nosynonyms,noextrawords.

###1.paper_experiment_id

Format:{filename}-1,{filename}-2,etc.

###2.paper_topic--LISTofoneormore,eachEXACTLYoneof:

*“Linguistics”

*“Discourse&Pragmatics”

*“InformationExtraction”

*“QA&informationaccess”

*“Semantics,inference&similarity”

*“Subjectivelanguage&socialmeaning”

*“Open-endedgeneration”

*“Compression&reformulationtasks”

*“Multilingual&translation”

*“Multimodaltasks”

Classifythepaper’stopic(s)basedonitsmaincontributions.Usealistwithoneormorevalues.

###3.human_annotation_type--PICKONEexactly:

*“Closed-labelclassification”

*“Ordinal/scalarrating”

*“Span-levelorstructuralannotation”

*“Comparative/preferencejudgments”

*“Open-endedgenerationortransformation”

*“Other”

Decisionrules:

-Binary/multi-class/multi-label/taggingwithfixedschema->“Closed-labelclassification”

-Likertscales,1-5ratings,qualityscores,gradedjudgments->“Ordinal/scalarrating”

-NERspans,coreference,constituency/dependencytrees,sequencelabeling,alignment->“Span-levelorstructuralannotation”

-A/Bchoice,ranking,pairwisepreference,best-worstscaling->“Comparative/preferencejudgments”

-Summarization,paraphrasing,translation,simplification,free-textgeneration->“Open-endedgenerationortransformation”

-Anythingthatdoesnotfittheabovefive->“Other”

###4.subtask_name

ShortdescriptivenameoftheHUMANANNOTATIONsubtask--whattheannotatorswereaskedtolabel,rate,orjudge.

Ifoneexperimentcoversmultiplesubtasksthatsharethesameannotatorpoolandannotationround,combinethemwith“|“separator.

[!]NametheANNOTATIONtask,NOTthedownstreamNLPmodellingtask:

*Write“hatespeechannotation“,NOT“hatespeechclassifier“

*Write“stancelabeling“,NOT“stancedetectionmodel“

*Write“namedentitytagging“,NOT“NERmodeltraining“

ThenamemustreflectwhathumanswereaskedtoDO,notthedownstreampredictionobjective.

[!]BeSPECIFIC--mirrorthepaper’sownwording,notagenericcategory:

CORRECT(specific):“binaryonlineattackclassification”

WRONG(toogeneric):“attackannotation”

CORRECT(specific):“clinicalnamedentityrecognition(NER)”

WRONG(toogeneric):“NERannotation”

CORRECT(specific):“PairwiseComparisonofModel-GeneratedSummaries”

WRONG(toogeneric):“evaluation”

CORRECT(specific):“ratingargumentsonaLikertscaleacross5dimensions”

WRONG(toogeneric):“argumentrating”

CORRECT(specific):“evaluategeneratedheadlinesforrelevance,attractivenessandlanguagefluency”

WRONG(toogeneric):“headlineevaluation”

[!]Use“|“ONLYforsubtaskssharingtheSAMEannotatorpoolandannotationround.

Iftasksdifferinannotatorgroup,annotationunit,instructions,orobjective->createSEPARATEexperiments.

Exampleofcorrect“|“use:“Misogynydetection-Abusivelanguagedetection|Misogynydetection-Targetidentification|Misogynydetection-Hatespeechcategorization”

(allthreedonebythesameannotatorsinthesamestudy)

[!]Papersroutinelycontain2-3distinctannotationexperiments.Commonpatterns:

*Differentlabelspacesonthesamedata(e.g.hatespeech->thentargettype->thenseverity)

*Sametaskrepeatedbytwodifferentannotatorgroups(e.g.expertgold-standardANDcrowdannotation)

*AtextcreationtaskPLUSaseparaterating/evaluationtaskonthemodeloutput

*Multiplerounds:pilotannotation+mainannotationwithdifferentguidelinesorannotators,ifIAAisreportedseparately

EachoftheseMUSTbeaseparateexperimentwithitsownrow.

###5.intended_use--LISTofoneormore,eachEXACTLYoneof:

*“Resourcecreation”(theannotationisprimarilyforbuildingadataset/resource)

*“Humanperformance”(theannotationmeasureshumanperformanceonatask)

*“Modeloutputevaluation”(theannotatorsevaluatemodel-generatedoutputs)

*“Noneoftheabove”(doesnotfitanyoftheabovecategories)

Usealistwithoneormorevalues.Multipleusesarecommon(e.g.[“Resourcecreation”,“Modeloutputevaluation”]).

-DEFAULT:[“Noneoftheabove”]

###6.guidelines_released--“Yes“or“No”

-“Yes“IFanyofthefollowingaretrue:

*Thepaperincludesadedicatedappendixsectionwithdetailedannotationinstructions(notjustabrieftaskdescriptioninthemaintext)

*AURL/hyperlinktoguidelines,anannotationmanual,oracodebookisprovided(evenifthelinkmaynowbebroken)

*Thepapercitesanotherpaperwhoseguidelineswerere-usedandthoseguidelinesarethusaccessibleviathatcitation

*Supplementarymaterial,asharedGitHub/OSFrepository,oradatasetreleasespecificallyincludestheannotationguidelinesorschema

*Thepaperexplicitlystatesthatguidelineswerereleasedormadeavailable

-“No“ifguidelinesareonlydescribedinformallyinthemaintextwithoutadedicateddocument,appendix,citedsource,orlink;ORifnoguidelinesinformationisgiven

-DEFAULT:“No”

[!]Commonmistake:DoNOTmark“Yes“justbecausethepaperdescribestheannotationtaskinaparagraph.Theremustbeanaccessible,standaloneguidelinesdocumentorappendixsection,anannotatorinterfacescreenshot,allowingreproduction.

###7.iaa_metric_name--LISTofoneormore,eachEXACTLYoneof:

*“Fleiss’kappa”

*“Cohen’skappa”

*“Krippendorff’salpha”

*“F1agreement”

*“Pearson”

*“Spearman”

*“Majorityagreement”

*“Kendall’stau”

*“Totalagreement”(alsocalledpercentagreement,percentageagreement,observedagreement,rawagreement,exactagreement)

*“Other”(anyexplicitmetricnotinthelistabove,e.g.ICC,Scott’spi,Gwet’sAC1,BLEU,Kappa)

*“Notreported”(eithernotapplicablebecausethereisonlyoneannotator,ornoIAAmetricisgiveninthepaper)

IMPORTANT:Ifametricisstatedbutyoucannotidentifywhichone,use“Other“.

Ifnometricismentioned,use“Notreported“.

Ifmultiplemetricsarereported,listthemintheordertheyappearinthepaper.

[!]INTERDEPENDENCY:Ifannotators_per_item==“1“ORthetaskisopen-endedfree-textgeneration->force[“Notreported”].

###8.iaa_value--LISTaligned1-to-1withiaa_metric_name

-Numericvaluesasstrings(e.g.“0.58”,“85.3%”,“0.09to0.23”)foranyrealmetric,or“NA“ifthepaperdoesnotdirectlystatethevalue.Refrainfrominferringvalues.

-Ifrangesarereportedoranynon-single-valueformisused,enterasastring(e.g.“0.09to0.23”).

-“NA“ifiaa_metric_name[i]==“Other”(regardlessofwhetheranumberisgiven)

-“Notreported“ifiaa_metric_name[i]==“Notreported”

-Keepthesamenumberofelementsasiaa_metric_name

[!]INTERDEPENDENCY:iaa_value[i]isFULLYDETERMINEDbyiaa_metric_name[i]--seeFieldInterdependenciesruleAabove.

[!]IMPORTANT:IfyouidentifiedanIAAmetric,ACTIVELYsearchforitsnumericvaluein:(a)themaintextnearwherethemetricismentioned,(b)ALLtables(especiallyagreement/statisticstables),(c)theappendix.IAAvaluesareveryoftenreportedintablesratherthanprose.Onlyoutput“NA“aftercheckingtables.

###9-12.Workloadfields--integerstringor“Notreported“

-**annotators_per_item**:annotatorswholabeledEACHitem.Integeror“Notreported“.Rangeslike“2-3“areacceptable.

-**total_annotators**:totalnumberofdistinctannotators.Integeror“Notreported“.

-**total_annotated_items**:totalnumberofitems/instancesannotatedINTHISSPECIFICEXPERIMENT.Integeror“Notreported“.IfgivenwithKsuffix(e.g.“27.9K”),keepas-is.Ifgivenwithcommaskeepas-is(e.g.“2,864”).

[!]Extractonlytheitemcountfortheannotationexperimentdescribed.Anannotateditemisasampleunderevaluation,notthenumberofannotatedspansorthetotaltokens/linesacrossannotateditems(e.g.thenumberofpoems,sentences,documents),NOTthetotalcorpus/datasetsizethatmaybelarger.Ifthepaperprovidesmultiplenumbers(e.g.alargerpre-filteredpoolandasmallerannotatedsubset),usethesmallerannotatedsubsetcount.

-**items_per_annotator**:howmanyitemseachannotatorprocessed.Integeror“Notreported“.

-DEFAULT:“Notreported”--doNOTguessorcalculatefromotherfields.

###13.human_evaluated_sample--free-textstring

-Describethesize/scopeofthehuman-evaluatedsample(e.g.“100”,“all”,“200randomlychosentasksfromthetrainingdata”,“500”,“random50”)

-Ifthepaperdescribesaspecificnumberofitemsevaluatedbyhumans,reportthatnumberordescription

-reportcross-annotatedsample,ifdifferenttothetotalnumberofannotateditems-“Notreported“ifnohuman-evaluatedsampleisdescribedorthesizeisnotspecified

-“all“ifallitemswerehuman-evaluated,especiallyiftheintendeduseofthehumanlabouristocreatearesource

-DEFAULT:“Notreported”

###14.recruitment--PICKONEexactly:

*“Crowdsourcing”(annotatorsrecruitedviaacrowdsourcingplatformordescribedascrowdworkers)

*“Authors”(oneormoreauthorsperformedtheannotation)

*“Other”(annotatorsareaspecifiednon-crowdgroup,e.g.students,experts,hiredassistants,colleagues)

*“Mixed”(annotatorsincludepeoplefromdifferentspecifiedgroups)

*“Notreported”(thesourceofannotatorsisnotspecifiedasaspecificgroup)

-DEFAULT:“Notreported”

###15.crowd_qualifications_screening--“Yes”,“No”,or“NA“

-“Yes“ifanypre-screening,qualificationtest,orperformancerecordisdescribedforcrowdworkers,typically:95%acceptancerateinpreviousjobs(notsocio-demographicfilter)

-“No“ifcrowdsourcingwasusedbutnoscreeningmentioned

-“NA“ifNOTacrowdsourcingstudy

-DEFAULT:“NO“whennotcrowdsourcing;“No“whencrowdsourcing

[!]INTERDEPENDENCY(setbeforecheckingpapercontent):

-recruitment==“Crowdsourcing”->“Yes“or“No“only

-recruitment==“Authors”/“Other”/“Mixed”/“Notreported”->MUSTbe“NO“

###16.annotator_training--“Yes“or“No”

-“Yes“iftraining,practicerounds,trialtasks,calibrationsessions,orhands-onpriorexperiencewiththeannotationtoolarementioned

-“No“ifnotrainingisdescribed,orifthereisanexplicitmentionofnotrainingprovided

-DEFAULT:“No”

###17.language_proficiency--PICKONEexactly:

*“Native”(nativespeakersexplicitlystated;ORannotators’ethnicityisreportedanditalignswiththelanguageoftheannotationtask)

*“Non-Native”(authorsreportsomelevelofproficiency,e.g.fluent,B2,C1,butnotnative)

*“Mixed”(bothnativeandnon-nativespeakerswereemployed)

*“Medium”(annotatorshavemoderatelanguageproficiency,e.g.advancedlearners)

*“Notreported”(languageproficiencynotdirectlyreported)

-DoNOTinferfromdatasetlanguageorannotatorcountryofresidencealone.

-DEFAULT:“Notreported”

###18.expertise_level--PICKONEexactly:

*“High”(annotatorshavespecialiseddomainknowledgerelevanttothetask,e.g.medicalprofessionals,legalexperts,trainedlinguists,NLPresearchers,annotatinglinguisticphenomena)

*“Medium”(annotatorshavesomerelevantbackgroundbutarenotdeeplyspecialised,e.g.studentsinarelatedfield,peoplewithself-reportedrelevantexperience)

*“Mixed”(reportedlevelsofexpertisevary,e.g.expertsandstudents,orstudentsandcrowdworkersusedtogether)

*“Notreported”(thetaskCOULDrequirespecialisedknowledge,butthepaperdoesnotspecifyannotators’expertiselevel)

*“GeneralLanguage/Knowledgetask”(thetaskisstraightforwardandDOESNOTREQUIREanyspecialisedknowledge--e.g.ratingtextfluency/relevance/acceptabilityusingonlycommonlanguagesense,evaluatingwhetheraheadlineisattractive,binarypreferencejudgmentsoneverydaycontent)

-DEFAULT:“Notreported”

###19.reported_compensation--PICKONEexactly:

*“Paid”(anycompensationismentioned:hourlyrate,per-taskpayment,totalamount,giftcards,coursecreditusedasincentive,beingdescribedasapaid/professionalannotatororemployee)

*“Free”(annotationwasdonewithoutpayment:authorsannotatedthedatathemselves,annotatorswerevolunteersorlabelledasunpaid,studentscompleteditasacourseassignmentwithnomonetaryreward)

*“Notreported”(thepapersaysnothingabouthoworwhetherannotatorswerecompensated)

-DEFAULT:“Notreported”

###20.payment_rate--PICKONEexactly:

*“Specificnumericrate”(aconcretepaymentisgiven:hourlyratelike“ $12/hr",per\-tasklike"$ 0.05/HIT“,totalamountlike“paid$500total“,oranyexplicitmonetaryfigure)

*“Generalmention”(compensationisconfirmedbutnoexactamount:“annotatorswerepaid”,“workersreceivedpayment”,“full-timeemployees”,“compensatedfortheirtime”)

*“NA”(reported_compensationis“Free“or“Notreported“--paymentinformationnotavailable)

-DEFAULT:“NA”

[!]Checktablesandappendicesforpaymentfigures--amountsareoftenreportedthere,notinthemaintext.

###21-26.Socio-Demographics--ALLare“Yes“or“No“ONLY

-**age_reported**:“Yes“ONLYifactualagedata(meanage,agerange,agedistribution)isreportedWITHvalues.“No“otherwise.

-**gender_reported**:“Yes“ONLYifactualgenderdata(counts,percentages,distribution)isreportedWITHvalues.“No“otherwise.

-**nation_origin_reported**:“Yes“ONLYifannotators’nationality/countryoforiginisreportedWITHvalues.“No“otherwise.

-**nation_residence_reported**:“Yes“ONLYifannotators’countryofresidenceisreportedWITHvalues.“No“otherwise.Indirectreferencestotheresidenceofparticipants(e.g.poetsfromalocalclub,facultyemployedatanAmericanuniversitydepartmentofX,English-speakingcountry)shouldbetreatedasNationofresidence=Yes.

-**education_reported**:“Yes“ONLYifannotators’educationlevelorfieldofstudyisreportedWITHvalues.“No“otherwise.

[!]INTERDEPENDENCY:Ifexpertise_level==“High”,seteducation_reported=“Yes”(explicitexpertlabellingimplieseducationwasassessed).

-**political_orientation_reported**:“Yes“ONLYifpoliticalorientationisreportedWITHvalues.“No“otherwise.

***CRITICALDEFAULT:“No“forALLdemographicfields.***

***“No“doesNOTmeanthepapersays“wedidnotcollectdemographics”.***

***“No“simplymeansthepaperdoesNOTreportthatspecificdemographicdata.***

***NEVERuse“Notreported“forthesefields--only“Yes“or“No“.***

###27.quality_control_filtering--“Yes“or“No“ONLY

-“Yes“if:qualitycontrolmeasuresaredescribed(relatedtodiscardingannotationsordeboardingannotators:attentionchecks,goldorrepeatquestions,filteringbadannotators,removinglow-qualityannotations)

-“No“inALLothercases--includingwhenexpertswereemployedandtheirjudgmentstrustedwithoutfiltering,orwhennoQCdetailsaregiven

-***NEVERoutput“NA“forthisfield--“NA“istreatedas“No”,use“No“directly***

-DEFAULT:“No”

###28.disagreement_resolution_adjudication--PICKONEexactly:

*“Majorityvote”(finallabeldeterminedbymajorityvotingamongannotators)

*“Expertadjudication”(anexpertresolvesdisagreements)

*“Thirdannotator”(athirdannotatorbreaksties)

*“Consensusdiscussion”(annotatorsdiscusstoreachconsensus)

*“Softlabels/Distributionallabels”(allannotationskeptasadistribution,nosinglelabelchosen)

*“Weightedvoting”(votesweightedbyannotatorreliability/expertise)

*“Other/mixed”(otherresolutionmethodorcombinationofmethods)

*“None”(nodisagreementresolutiondescribed:allopinionsusedas-isoraveraged)

-DEFAULT:“None”

##ExtractionRules--READCAREFULLY

###ExperimentIdentification

[!]**ANNOTATIONEXPERIMENTSONLY**--Onlyextractexperimentswherehumansactivelycreate,label,rate,rank,orjudgedata.DoNOTextractmodeltrainingruns,automaticevaluationsteps,orsystem-basedexperimentsthatdonotinvolvehumanannotatorsasacorepartoftheprocess.

**SPLITintoseparateexperimentswhenANYofthefollowingdifferbetweenannotationtasks:**

1.**Differentannotationobjective**--distinctlabelspacesorannotationgoals(e.g.hatespeechdetectionvs.stancedetection)

2.**Differentannotatorpools**--differentgroupsusedfordifferenttasks(e.g.expertsforonesubtask,crowdforanother)

3.**Differentrecruitment/compensation**--onetaskpaid,anotherunpaid;orrecruitedviadifferentchannels

4.**Differentannotationunits**--sentence-levelvsdocument-level;word-levelvssentence-level

5.**Separatededicatedannotationrounds**--thepaperdescribesmultipledistinctrounds,eachwithitsownguidelines,annotators,orIAA

Real-paperSPLITexamples:

*binaryattackclassification/languageclassification/attacktargetclassification->3experiments(differentlabelspaces)

*summarizationgeneration/summarizationrating->2experiments(generationtask+evaluationtask)

*ExpertGold-StandardNER/Crowd-BasedNER->2experiments(differentannotatorpools)

*poetrygenerationbycrowd/poetrygenerationbyexperts->2experiments(differentrecruitment)

*IslabelAcceptable/Unacceptable/IfAcceptable,isitPrecise/Imprecise->2experiments(differentannotationobjectives)

**MERGEintooneexperimentwhen:**

-Thesameannotationwasappliedtomultipledatasetswiththesamelabelschemeandannotators

-MultipleIAAmetricsarereportedforthesameannotationtask

-Minorsub-tasksthatshareannotators,andannotateditems,typeofhumanjudgment,andprocessdifferonlyintopicordomain

[!]DEFAULTBIAS--**ALWAYSerrtowardSPLITTING,notmerging.**Under-splittingisthemostcommonerror.Apaperthatrunshatespeechlabelingandthentarget-typelabelingonthesamedatasethasTWOexperiments,notone.Differentlabelspaces=differentexperiments.

[!]**SELF-AUDITbeforewritingJSON**:Afteridentifyingyourfirstexperiment,explicitlyask:

1.DoesthepaperdescribeanyOTHERannotationtask(differentlabels,differentannotators,differentpurposeinadifferentpapersection)?

2.Arethereseparatesectionstitled“Experiment2“,“HumanEvaluation”,“QualityAnnotation”,“PilotStudy”?

3.DidanyOTHERgroupofhumanslabel,rate,orjudgedatainthispaper?

4.IsthereatextproductiontaskANDaseparateevaluation/ratingtaskperformedbyhumans?

IftheanswertoANYoftheseisyes->addanotherexperiment.

###CRITICAL:AvoidTheseCommonMistakes

1.**DoNOTuse“Notreported“fordemographicfields**--use“No“.ThesefieldsONLYaccept“Yes“or“No“.

2.**disagreement_resolution_adjudicationmustbeoneofthespecificmethods**--“Majorityvote”,“Expertadjudication”,“Thirdannotator”,“Consensusdiscussion”,“Softlabels/Distributionallabels”,“Weightedvoting”,“Other/mixed”,or“None“.

3.**DoNOTinferlanguageproficiency**fromthedatasetlanguageorannotatorcountryalone(exception:ethnicitymatchingtasklanguage->“Native”).

4.**DoNOTinfercompensationamounts**--butDOinfer“Free“whenauthorsannotatetheirowndata,ormentionvoluntaryeffort.

5.**DoNOTassumeguidelinesarereleased**justbecausethepaperdescribesthetaskinprose--theremustbeadedicateddocument,appendix,URL,orcitedsource.

6.**DoNOTcalculateworkloadfields**fromothernumbers--onlyextractexplicitlystatedvalues.

7.**Forrecruitment**,usethebroadcategories:“Crowdsourcing”,“Authors”,“Other”,“Mixed”,“Notreported”--doNOTuseplatform-specificnames.

8.**Forcompensation,doNOTuse“NA“**--use“Notreported“ifcompensationisunmentioned.

9.**DoNOTconfuseexpertise_level“GeneralLanguage/Knowledgetask“and“Notreported“**:“GeneralLanguage/Knowledgetask”=taskneedsnospecializationatall;“Notreported”=taskcouldneedspecializationbutpaperdoesn’tspecifyannotators’expertise.

10.**ForIAAvalues,checktables**--numericvaluesareveryoftenintables,nottheprose.Donotoutput“NA“forarealmetricuntilyouhavecheckedalltables.

11.**Fortotal_annotated_items,usetheannotatedsubsetsize**,notthetotalcorpus/datasetsize--thesecandiffersubstantially.

12.**subtask_namemustnametheANNOTATIONtask**,notthedownstreamNLPmodellingtask--“sentimentrating“not“sentimentclassifier”;“NERtagging“not“NERmodel”.

13.**Foreverynumericalvalue,recordevidence**inthe‘numerical_evidence‘object--doNOTleaveitemptywhenyouextractedanon-“Notreported“value.

###FieldInterdependencies--APPLYTHESEBEFOREFINALISINGOUTPUT

TheseconditionalrulestakePRECEDENCEoverallotherguidance.Checkeachoneafterfillingeveryfield.

**A.iaa_valueisfullydeterminedbyiaa_metric_name(applyperlistposition):**

-iaa_metric_name[i]==“Other”->iaa_value[i]=“NA”

-iaa_metric_name[i]==“Notreported”->iaa_value[i]=“Notreported”

-iaa_metric_name[i]isanyrealmetric->iaa_value[i]=thenumericvaluereported(or“NA“ifvalueisabsent)

**B.iaa_metric_name/iaa_value<-annotators_per_itemortasktype:**

-annotators_per_item==“1”(singleannotator)->iaa_metric_name=[“Notreported”],iaa_value=[“Notreported”]

-taskisopen-endedtextgeneration(humansproducefreetext,nofixedlabelset)->iaa_metric_name=[“Notreported”],iaa_value=[“Notreported”]

**C.crowd_qualifications_screening<-recruitment:**

-recruitment==“Crowdsourcing”->crowd_qualifications_screening=“Yes“OR“No”(basedonpapercontent)

-recruitment==“Authors”->crowd_qualifications_screening=“NO”

-recruitment==“Other”->crowd_qualifications_screening=“NO”

-recruitment==“Mixed”->crowd_qualifications_screening=“NO”(unlesspaperexplicitlymentionscrowdscreeningforthecrowdsub-group)

-recruitment==“Notreported”->crowd_qualifications_screening=“NO”

**D.education_reported<-expertise_level:**

-expertise_level==“High”(experts)->education_reported=“Yes”

(Rationale:explicitlylabellingannotatorsasdomainexpertsimpliestheireducational/professionalqualificationwasassessedandreported.)

-Allotherexpertise_levelvalues->applystandard“Yes“/“No“rule

###Defaultvaluessummary(wheninformationisabsentfromthepaper):

-guidelines_released->“No”

-iaa_metric_name->[“Notreported”]

-iaa_value->[“Notreported”]

-workloadfields->“Notreported”

-recruitment->“Notreported”

-crowd_qualifications_screening->“NA”(non-crowd)or“No“(crowd)

-annotator_training->“No”

-language_proficiency->“Notreported”

-expertise_level->“Notreported”

-reported_compensation->“Notreported”

-payment_rate->“NA”

-human_evaluated_sample->“Notreported”

-ALLdemographics->“No”

-quality_control_filtering->“No”(NEVER“NA“)

-disagreement_resolution_adjudication->“None”

###NumericalEvidence

Foreverynumericalvalueyouextract,youMUSTpopulatethe‘numerical_evidence‘objectwiththeverbatimtext(sentencefragment,tablecell,orfigurecaption)fromthepaperthatsupportsthatnumber.

-**Fieldsthatrequireevidence**:‘annotators_per_item‘,‘total_annotators‘,‘total_annotated_items‘,‘items_per_annotator‘,eachelementof‘iaa_value‘(keyedas‘iaa_value_0‘,‘iaa_value_1‘,etc.),and‘payment_rate‘whenitis“Specificnumericrate“.

-Ifavaluewasinferredratherthandirectlystated,write:‘“field_name”:“inferredfrom:[briefreason]”‘

-Omitakeyfrom‘numerical_evidence‘onlywhenthecorrespondingfieldvalueis“Notreported“or“NA“.

##Approach--Two-stepextraction

**STEP1--LocateannotationsectionsBEFOREfillinganyfield.**

Annotationmetadataistypicallyfoundinsectionstitled:“Annotation”,“DataCollection”,

“Annotators”,“HumanEvaluation”,“ExperimentalSetup”,“Crowdsourcing”,“DatasetConstruction”,

“AnnotationProcedure”,“AnnotationScheme”,orappendixsections.

[!]OnlyextractHUMANANNOTATIONexperiments(humanslabeling/rating/judgingdata).Ignoremodeltrainingruns,automaticevaluationsteps,oranyexperimentthatdoesnotinvolvehumanannotators.

Scanthosesectionsandmentallynote:

*HowmanyDISTINCTannotationexperiments?(differenttasksordifferentannotationunits=separateexperiments;otherwisemerge)

*Whoaretheannotators--crowdworkers,authors,domainexperts,students?

*Howmanyannotatorsperitem?Totalannotatorscount?

*WhatIAAmetric(s)andnumericvalue(s)arereported?

*Howwereannotatorsrecruitedandcompensated?

*AnydemographicsWITHVALUES,training,guidelines,orqualitycontroldescribed?

*Foreachnumericalvaluefound,notetheEXACTsentenceortableentryitcamefrom(for‘numerical_evidence‘).

**Multi-experimentscan**:Paperswith2-3distinctannotationexperimentsareCOMMON.Afterfindingyourfirstexperiment,continuescanning:

-Othersections/appendicesforadditionalannotationtasks

-Whetherthepaperreportsa“pilot“or“validation“annotationwithdifferentsetup

-Whetherdifferenttasks/label-spaceswereused(e.g.,firsthate-speech,thentargetclassification)

-Whethertwodifferentannotatorgroups(crowd+expert)independentlyannotatedthesameorrelateddata

DoNOTstopatthefirstannotationsectionyoufind.

**STEP2--Filltheexperimentsarray**usingonlywhatyoufoundinthepaper.

Wheninformationisabsent,applythedefaultsinTAXONOMY_SCHEMA(typically“Notreported“,“No”,or“NA“).

##WORKEDEXAMPLE

Hypotheticalpaper:MTurkhatespeechannotationwithFleiss’kappa,residencefilter,majority-voteadjudication.

{{

“paper_id”:“example-paper”,

“skip”:false,

“experiments”:[

{{

“paper_experiment_id”:“example-paper-1”,

“papers_topic”:[“Subjectivelanguage&socialmeaning”],

“nlp_task_type”:“Closed-labelclassification”,

“subtask_name”:“hatespeechdetection”,

“intended_use”:[“Resourcecreation”],

“guidelines_released”:“No”,

“iaa_metric_name”:[“Fleiss’kappa”],

“iaa_value”:[“0.62”],

“annotators_per_item”:“3”,

“total_annotators”:“50”,

“total_annotated_items”:“10000”,

“items_per_annotator”:“Notreported”,

“human_evaluated_sample”:“Notreported”,

“recruitment”:“Crowdsourcing”,

“crowd_qualifications_screening”:“Yes”,

“annotator_training”:“No”,

“language_proficiency”:“Native”,

“expertise_level”:“GeneralLanguage/Knowledgetask”,

“reported_compensation”:“Paid”,

“payment_rate”:“Specificnumericrate”,

“age_reported”:“No”,

“gender_reported”:“No”,

“nation_origin_reported”:“No”,

“nation_residence_reported”:“Yes”,

“education_reported”:“No”,

“political_orientation_reported”:“No”,

“quality_control_filtering”:“Yes”,

“disagreement_resolution_adjudication”:“Majorityvote”,

“numerical_evidence”:{{

“annotators_per_item”:“3/item”,

“total_annotators”:“50annotatorsfromMTurk”,

“total_annotated_items”:“collected10,000posts”,

“iaa_value_0”:“Fleiss’kappa=0.62”

}}

]

}}

##OutputFormat

OutputONLYvalidJSONmatchingthisschema:

{json_schema}

##PaperContent

{paper_content}

##REMINDERbeforeyouwriteyourJSON:

-**analysis**:writethisFIRST--1-2sentences:Nexperiments,annotatortype,IAAmetric+value,compensation

-Demographics(age/gender/nation_origin/

/education/political_orientation):defaultis“No“,ONLY“Yes“ifactualvaluesWITHNUMBERSarereported

-disagreement_resolution_adjudication:usespecificmethodname(“Majorityvote”,“Expertadjudication”,“Thirdannotator”,“Consensusdiscussion”,“Softlabels/Distributionallabels”,“Weightedvoting”,“Other/mixed”,“None”)

-**IAAvalues**:checkALLtables--valuesareveryofteninatable,nottheprose

-recruitment:“Crowdsourcing“onlyifplatformnamedor“crowdworkers“used;“Authors“ifauthorsthemselvesannotated;“Other“forstudents/experts/assistants;else“Notreported”

-**compensation**:authorsannotating=reported_compensation“Free“;anypaymentioned=“Paid”;nothingmentioned=“Notreported”;“NA“ifnotapplicable

-**payment_rate**:“Specificnumericrate”/“Generalmention”/“No”(ifFree)/“NA”(ifNotreportedorNA)--checktablesforexactamounts

-**expertise_level**:“GeneralLanguage/Knowledgetask“iftaskneedsnospecialization(ratingfluency/attractiveness/preference);“Notreported“ifspecializationCOULDberelevantbutpaperdoesn’tspecify

-**guidelines_released**:“Yes“ifappendixwithdetailedinstructions,URL,citedguidelinespaper,orsupplementarymaterial--NOTjustbecausethepaperdescribesthetask

-**total_annotated_items**:usetheannotatedsubsetcount,NOTthefullcorpus/datasetsize

-**subtask_name**:useSPECIFICwordingmirroringthepaper(e.g.“binaryonlineattackclassification”,NOT“attackannotation“;“PairwiseComparisonofModel-GeneratedSummaries”,NOT“evaluation“)

-**papers_topic**:LISTofoneormoretopicareas

-**intended_use**:LISTofoneormoreintendeduses

-**human_evaluated_sample**:free-textdescribingsamplesize/scopethatwasselectedforhumaninspectionorcross-annotated,or“Notreported“

-**Multipleexperiments**:paperswith2-3distinctannotationexperimentsarecommon--re-readtheFULLpaperbeforesettlingon1experiment.Differentlabelspaces,differentannotatorgroups,generation+rating,pilot+main->eachisaSEPARATEexperimentrow

Appendix FAdditional Analysis Details

RQ1: Additional analysis

Papers on subjective and socio-linguistic phenomena do not exhibit systematically stronger reporting practices than the rest of the sample (see Figure6). Regarding recruitment and qualification practices, socially related papers report annotator recruitment information slightly more often, suggesting more detailed documentation of annotator setup. These papers are also more likely to rely on crowdsourced and mixed annotator pools, and less likely to employ authors as annotators (see Figure12).

AsshowninFigure7, with respect to language proficiency, papers on subjective language and social meaning topics report annotator nativeness more frequently than the rest of the dataset. A chi-square test indicates a statistically significant association between subset membership and native language reporting (χ2=20.50,p<0.001\chi^{2}=20.50,p<0.001), although the effect size is small (Cramer’sV=0.088V=0.088), suggesting a weak association.At the same time,native speaker status is explicitly reported in fewer than 35% of socially related papers overall.

Refer to caption

Figure 6:Trends in reporting quality for socio-linguistic vs. other topics. Lines show reporting improvement over time; stacked bars indicate the number of papers per category. Error bars represent standard errors of the mean (SEM) for the focus group only. Refer to caption

Figure 7:Language proficiency reporting rates, with higher reporting of nativeness for the socially related subset (green) compared to the rest of the dataset (blue).It was observed that socially related papers are more likely to rely on crowdsourced and mixed annotator pools. Building on this higher use of crowdsourcing, crowd-annotated papers are also shown to exhibit a higher rate of quality control measures in the social subset compared to the rest of the dataset (see Figure8).

Refer to caption

Figure 8:Distribution of quality control measures in crowd-annotated papers, defined as post-annotation filtering or validation of annotations, showing a higher rate in the socially related subset compared to the rest of the dataset (72% vs. 62.5% respectively).This indicates more frequent screening or filtering of crowdworkers in these studies. As shown in Figure9, annotator training is also slightly more frequently reported in the social subset. This difference is small but consistent, particularly when annotators belong to explicitly defined non-crowd groups (e.g., students, experts, hired assistants, or colleagues), or when multiple annotator groups are involved (see Figure10).

Refer to caption

Figure 9:Annotator training reporting rates in socially related papers versus the full dataset, showing a slightly higher frequency in the socially related subset (20% vs. 18%), with an overall difference of approximately 2%. Refer to caption

Figure 10:Distribution of annotator expertise in socially related papers versus the rest of the dataset, showing no higher use of expert annotators in the social subset, but a higher share of general language or knowledge-based tasks and more cases where annotator expertise is not reported.Considering the category related to annotators’expertise, no evidence is found to support the hypothesis that papers addressing socio-linguistic topics rely more on expert annotators. In fact, these studies include a higher proportion of tasks classified as general language-or knowledge-based, and more frequently do not report annotator expertise, as illustrated in Figure10. This contrasts with the general pattern observed for other annotator-related dimensions, wherepapers on socio-linguistic topicstend to provide more detailed reporting. A further difference emerges in how annotation disagreements are handled in ordinal or scalar rating tasks. As shown in Figure11, within the socially related subset, adjudication is more commonly performed through majority voting, followed by approaches that preserve disagreement as label distributions. In contrast, methods such as expert adjudication or consensus-based discussion are rarely employed. This suggests that, in addition to more detailed reporting, socially grounded studies tend to adopt aggregation strategies that retain or simplify annotator variation rather than resolving it through expert intervention.

Refer to caption

Figure 11:Distribution of adjudication strategies in ordinal and scalar rating tasks, showing that socially related papers more often use majority voting (50.0%) or label distributions (25.0%) that preserve disagreement, while rarely relying on expert adjudication or consensus-based discussion. Refer to caption

Figure 12:Distribution of recruitment categories in the socially related subset (green) versus the rest of the dataset (blue), showing higher use of crowdsourced and mixed annotator pools and slightly more frequent reporting of recruitment information in socially related papers.Overall, socially oriented NLP studies do not exhibit substantially different overall reporting quality, but they do show some distinctive tendencies in annotator profiles and documentation practices, including lower reliance on authors as annotators, greater use of crowdworkers, and more frequent reporting of native-speaker status. More broadly, these studies tend to document annotator characteristics, particularly annotator sourcing and language background, more systematically and in greater detail. However, reporting levels remain relatively limited even within this subset, suggesting that socially oriented NLP studies are not consistently better at documenting the broader methodological details needed to assess annotation quality, despite the central role of annotator-related variables in socially grounded language phenomena.

RQ2: Venue-specific interrupted time series analysis

As a heterogeneity check, we decomposed the interrupted time series analysis by the three venues with the largest representation in our dataset: ACL, EMNLP, and NAACL. Figure13shows venue-specific fitted trends, observed yearly reportage scores, and counterfactual continuations of the pre-2022 trajectories.

Refer to caption

Figure 13:Interrupted time series analysis of reportage across major NLP venues (ACL, EMNLP, NAACL). Solid colored lines show model-estimated trends for each venue, while dots with whiskers indicate observed yearly means with standard errors. Dashed lines represent counterfactual trajectories under continuation of the pre-2022 trend.The three venues exhibit distinct pre-2022 reporting trajectories. EMNLP shows comparatively high reportage scores throughout the observation period, with no pronounced discontinuity around 2022 beyond the existing upward trend. In contrast, ACL and especially NAACL begin from substantially lower baseline levels and display stronger upward trends prior to 2022, gradually converging toward the EMNLP level.

Following 2022, the previously observed upward trajectories flatten across all three venues. Rather than accelerating post-intervention improvements, reportage scores stabilize at a more similar level across venues, reducing between-venue variation. This pattern suggests that the ACL Responsible NLP Checklist may have had a standardizing effect on reporting practices, even though we do not observe evidence for a strong immediate increase in reportage quality after its introduction.

Because the checklist was introduced during a broader transition period involving ARR adoption and venue-specific rollout schedules, the intervention year should be interpreted as an approximate temporal reference point rather than a perfectly synchronized policy change across all venues.

RQ3: Additional analysis

Figure14visualizes the logistic regression estimates discussed in the main text. In both subplots, annotation studies aimed at resource creation show the strongest association with reporting procedural details, substantially exceeding the baseline odds across all measures considered. Human performance studies occupy an intermediate position, whereas model evaluation studies consistently exhibit the weakest reporting patterns.

Refer to caption (a)QC measures combined (b)recruitment+comp.

Figure 14:Logistic regression of QC usage by annotation purpose and year. Points show odds ratios (OR) with 95% CIs (log scale); OR = 1 indicates no effect.The similarity between the two panels suggests that these differences are not limited to formal quality control procedures alone, but extend more broadly to annotator transparency and documentation practices. Notably, the comparatively weak reporting of recruitment source and compensation in model evaluation studies raises the possibility that authors acting as annotators may remain systematically underreported in this setting. Temporal effects are comparatively modest in both models, indicating that annotation purpose is a substantially stronger predictor of reporting behavior than publication year.

Sample Annotated Data

The following figures present excerpts fromHaberet al.(2023), included as an example of a manually annotated paper with comprehensive reporting practices. The excerpts illustrate reporting of annotator recruitment, demographic and linguistic characteristics, workload distribution, inter-annotator agreement (IAA), and disagreement resolution procedures.

Refer to caption

Figure 15:Example excerpt fromHaberet al.(2023)illustrating comprehensive reporting of annotator-related information, including the number of annotators, recruitment procedures, and annotators’ nation of origin. Refer to caption

Figure 16:Example excerpt fromHaberet al.(2023)providing extensive reporting of annotation methodology and annotator characteristics, including annotators’ nation of origin, language proficiency, gender and age, as well as total number of annotated items and number of annotators per item. Refer to caption

Figure 17:Example excerpt fromHaberet al.(2023)providing detailed reporting of inter-annotator agreement (IAA) metrics, scores, and disagreement resolution procedures.categoryvalue2018-2021 #2022-2025 #2018-2021 %2022-2025 %p<0.05\endfirstheadcategoryvalue2018-2021 #2022-2025 #2018-2021 %2022-2025 %p<0.05\endheadpaper_topicopen-ended generation34980528.425.1*semantics, inference & similarity23245318.914.1*subjective language & social meaning1204239.813.2*discourse & pragmatics1162429.47.6*qa & information access1082658.88.3information extraction1033028.49.4compression & reformulation tasks952017.76.3multimodal tasks442473.67.7*multilingual & translation391743.25.4*linguistics24932.02.9NAtask_typeclosed-label clf25363233.033.2ordinal/scalar rating22955129.929.0open-ended generation10320013.410.5*comparative/preference9126211.913.8span-level/structural681568.98.2other221002.95.3NAintended_usemodel output evaluation380104347.150.5resource creation32977340.837.5human performance8621910.710.6none of the above11291.41.4NAguidelinesno612114579.960.2*yes15475620.139.8*iaa_metricnot reported502110355.748.3*fleiss’κ\kappa11627212.911.9cohen’sκ\kappa792688.811.7*krippendorff’sα\alpha682897.512.7*other421404.76.1total agreement371244.15.4spearman21232.31.0NAf1 agreement15241.71.1NApearson13191.40.8NAkendall’sτ\tau5190.60.8NAmajority agreement330.30.1NAiaa_valuereported765189499.999.6not_reported170.10.4NAannotators/itemreported528120168.963.2*not_reported23870031.136.8*total_annotatorsnot_reported38760150.531.6*reported379130049.568.4*total_itemsreported673162187.985.3not_reported9328012.114.7items/annotatornot_reported640142183.674.8*reported12648016.425.2*recruitmentcrowdsourcing35847646.725.0*other247100932.253.1*not reported10015713.18.3*authors512006.710.5*mixed10591.33.1NAcrowd_screeningno492124464.265.4yes18634124.317.9*na8831611.516.6*trainingno667150287.179.0*yes9939912.921.0*lang._proficiencynot reported631140382.473.8*native11639015.120.5*non-native11541.42.8NAmixed5480.72.5NAmedium360.40.3NAexpertise_levelgeneral task44861358.532.2*high14770819.237.2*not reported10825314.113.3medium562807.314.7*mixed7470.92.5NAcompensationnot reported47969762.536.7*paid21891328.548.0*free692919.015.3*payment_ratena54898871.552.0*specific numeric rate13960518.131.8*general mention7930810.316.2*ageno760176399.292.7*yes61380.87.3NAgenderno758174499.091.7*yes81571.08.3NAnation_originno762184599.597.1*yes4560.52.9NAnation_residenceno675162188.185.3yes9128011.914.7educationno596100777.853.0*yes17089422.247.0*political_orient.no765188799.999.3yes1140.10.7NApost_filteringno551144771.976.1*yes21545428.123.9*adjudicationnone593143777.475.6majority vote10822814.112.0consensus discussion231053.05.5NAexpert adjudication14421.82.2NAother/mixed14491.82.6NAthird annotator8271.01.4NAsoft labels3110.40.6NAweighted voting320.40.1NAImpact of the ACL Responsible NLP Checklist. Distributions are shown for 2018–2021 vs. 2022–2025. Multilabel fields are expanded into individual categories; free-text fields are binarized (reported vs. not reported). Statistical significance (*) reflects differences in proportions between periods, assessed using aχ2\chi^{2}test (p<0.05p<0.05); results are omitted (NA) when counts fall below 30 in either period. Total observations (tasks) : 2,669. Refer to caption

Refer to caption

Figure 18:The NLLG lab and NLP@IT:U members in Speinshart in February, 2026.

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Abstract

1Introduction

2Related Work

Annotator effects, annotation quality, and reporting.

AI4Science and meta-scientific analysis.

3Human Annotation Setup

4AnnotatedDataset

Sampling validation.

5Experiment Setup

Extraction.

Evaluation.

Reportage Score.

6Results

6.1Agreement Analysis

Human–Human agreement

Human–LLM agreement

6.2Trends in Reporting

RQ1: Which NLP areas are more human-annotation intensive, and which aspects of annotation practice are more consistently reported?

RQ2: What is the impact of introducing ACL responsible checklist (2022)?

RQ3: Does the intended use of human judgment affect reportage quality?

7Conclusion and Recommendations

Limitations

Ethics Statement

Broader Impact

Acknowledgments

References

Appendix ADatasets

A.1Random-sample validation of keyword retrieval

Appendix BReportage ScoreCalculation

Appendix CAnnotator demographics

Appendix DDetails on Human Annotation

Appendix ELLM Extraction Prompt

Appendix FAdditional Analysis Details

RQ1: Additional analysis

RQ2: Venue-specific interrupted time series analysis

RQ3: Additional analysis

Sample Annotated Data

相似文章

谁在进行NLP注释？2018-2025年间人类注释报告的大规模评估

使用大语言模型自动标注汉语叙事转录文本

通过句法可预测性的语言学感知型LLM水印技术

基于不同微调策略和模型规模的LLM归因分析在自动代码合规性检查中的应用

注释者立场作为信号：针对反自闭症能力歧视检测的心理测量加权

提交意见反馈