The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints
Summary
This critical survey examines the Annotation Scarcity Paradox in low-resource NLP evaluation, where rapid model scaling outpaces the human infrastructure needed for authentic evaluation, and discusses emerging responses with equity and validity trade-offs.
View Cached Full Text
Cached at: 05/20/26, 08:23 AM
# The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints Source: [https://arxiv.org/html/2605.19066](https://arxiv.org/html/2605.19066) ###### Abstract Over the past decade, low\-resource natural language processing \(NLP\) has experienced explosive growth, propelled by cross\-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks\. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised\. We present a critical narrative survey of low\-resource NLP evaluation \(2014–present\), tracing its evolution across three phases: early heuristic optimism, the illusions of top\-down benchmark scaling, and the current era of generative bottlenecks\. We conceptualise the*Annotation Scarcity Paradox*, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them\. By examining extractive data pipelines, undercompensated “ghost work”, and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress\. We survey emerging responses—including data augmentation, model\-based evaluation, participatory curation, and annotation\-efficient approaches via item response theory and active learning—and assess their equity and validity trade\-offs\. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community\-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership\. ## 1Introduction The past decade has produced an extraordinary proliferation of work in low\-resource Natural Language Processing \(NLP\)Joshiet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib42)\); Alabiet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib11)\); Belayet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib13)\)\. Shared tasks, multilingual benchmarks, and community\-driven resource creation efforts have dramatically expanded the set of languages for which computational tools exist\. Projects such as MasakhaNERAdelaniet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib6)\), AfriSentiMuhammadet al\.\([2023](https://arxiv.org/html/2605.19066#bib.bib53)\), AmericasNLP Shared TasksMageret al\.\([2021](https://arxiv.org/html/2605.19066#bib.bib50)\)and SEACrowdLoveniaet al\.\([2024](https://arxiv.org/html/2605.19066#bib.bib48)\), exemplify the ambition and reach of this movement\. However, this rapid accumulation of benchmarks and reported performance gains has masked a structural fragility: the evaluation pipelines underpinning this progress depend critically on human annotators, linguists, and community members whose capacity is finite, whose labour is often uncompensated, and whose involvement is frequently shallow\. As the demand for evaluation data has scaled, the human infrastructure required to produce high\-quality annotations has not kept pace\. Compounding this, the purported gains of the Large Language Model \(LLM\) era are still accumulating disproportionately to high\-resource languagesBlasiet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib20)\); Ahujaet al\.\([2023](https://arxiv.org/html/2605.19066#bib.bib9)\); Adelaniet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib7)\); Ojoet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib55)\), widening the gap between what is technically possible and what is practically evaluable for the majority of the world’s languages\. On one of the continent’s best language coverage benchmark, AfroBench, the strongest proprietary model \(GPT\-4o\) achieves an average of only 59% across 64 African languagesOjoet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib55)\)—a figure that itself reflects a curated, relatively well\-resourced subset of Africa’s 2,123 languagesAdebara \([2025](https://arxiv.org/html/2605.19066#bib.bib5)\)\. We term this theAnnotation Scarcity Paradox, defined as follows: ###### Definition 1\(Annotation Scarcity Paradox\)\. The structural friction arising—most directly evidenced in the African NLP context, and, we argue, in analogous forms across other under\-resourced language communities—when the technical capacity to produce and scale NLP models outpaces the human infrastructure, encompassing annotator availability, deep linguistic expertise, community participation, and epistemic governance \(whose knowledge matters\), required to authentically evaluate them\. The paradox is not merely logistical but structural, shaping what can be known and what counts as progress in low\-resource NLP\. This survey makes the following contributions: - •We argue that the trajectory of low\-resource NLP evaluation across three chronological phases \(2014–2018, 2019–2022, 2023–present\) has produced a*structural*, not merely logistical, bottleneck in the human capacity required for credible evaluation\. - •We introduce and operationalise the*Annotation Scarcity Paradox*as a conceptual framework for understanding how extractive data pipelines, undercompensated annotation labour, and data sovereignty deficits jointly undermine the epistemic validity of reported progress\. - •We identify concrete directions for transitioning from transactional data extraction to relational, community\-embedded evaluation, including pluralistic evaluation frameworks, transparent annotation reporting norms, and community\-owned data infrastructure\. Methodological note\.This survey proceeds as a critical narrative reviewSnyder \([2019](https://arxiv.org/html/2605.19066#bib.bib75)\); Grant and Booth \([2009](https://arxiv.org/html/2605.19066#bib.bib36)\), drawing on a selection of representative works rather than an exhaustive corpus search\. Papers were drawn primarily from proceedings at ACL, ACM venues, and associated community workshops \(AfricaNLP, AmericasNLP, WiNLP\), as well as shared task system description papers\. Selection was guided by three thematic criteria: papers documenting evaluation methodology or benchmark construction for low\-resource languages; papers examining annotation practices, community engagement, or data governance; and papers that illuminate the structural dynamics of the field rather than solely reporting model performance\. Phase boundaries reflect recognised inflection points: 2019 marks the emergence of massively multilingual models \(mBERTDevlinet al\.\([2019](https://arxiv.org/html/2605.19066#bib.bib29)\), XLM\-RConneauet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib24)\)\) and large\-scale cross\-lingual benchmarks \(XTREME\)Huet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib40)\); 2023 marks the dominance of generative evaluation paradigms and the rise of LLM\-as\-a\-judge methods\. The author has been an active participant in the low\-resource NLP ecosystem described here, particularly within African language NLP communities; this positionality is both the analytical warrant for the claims advanced and a transparency obligation to the reader\. ## 2The Early Boom \(2014–2018\): Initial Efforts and Optimism The period from 2014 to 2018 was characterised by pioneering efforts to extend NLP tools beyond the handful of high\-resource languages that had historically dominated the field\. Early work focused on establishing practical pipelines for resource creation and developing fundamental processing tools for previously underserved languagesKing \([2015](https://arxiv.org/html/2605.19066#bib.bib44)\)\. During this time, researchers across the Global South actively began constructing localised benchmarks to ensure non\-Western languages were represented in the growing data ecosystem\. In the African context, localised institutional efforts laid crucial groundwork; for instance,[Eiselen and Puttkammer](https://arxiv.org/html/2605.19066#bib.bib34)Eiselen and Puttkammer \([2014](https://arxiv.org/html/2605.19066#bib.bib34)\)developed foundational text corpora and core processing technologies for ten South African languages\. Parallel momentum was visible in Southeast Asia through multinational collaborations like the Asian Language TreebankThuet al\.\([2016](https://arxiv.org/html/2605.19066#bib.bib79)\), and in India, where researchers built massive, open\-source datasets such as the IIT Bombay English\-Hindi Parallel CorpusKunchukuttanet al\.\([2018](https://arxiv.org/html/2605.19066#bib.bib46)\)\. Crucially, this period culminated in the realisation that technical algorithms alone could not solve the resource gapNekotoet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib54)\); Hershcovichet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib39)\)\. The technical barriers encountered by researchers, combined with the friction of historically extractive data collection practices, highlighted an urgent need for data sovereignty and epistemic governance\. This catalysed the rise of grassroots movements working towards increasing the participation of indigenous languages in the epistemic discourse of NLP\. We saw the founding of the Masakhane Research Foundation—driven directly by the collaborative efforts of African researchers—and Widening NLP \(WiNLP\) at ACL, building on traditions of grassroots movements within the wider AI space\. ### 2\.1Resource Creation and Early Benchmarks The 2014–2018 era saw researchers begin developing corpora and annotation frameworks for African, Asian, and Indigenous languages, frequently working with small teams of linguists and community volunteers\. The excitement of this period was genuine: even modest datasets enabled meaningful downstream experiments, and cross\-lingual transfer from high\-resource languages offered a seemingly scalable path to broader coverage\. A primary technical driver of progress during this era was cross\-lingual transfer, which sought to leverage data\-rich languages to bridge the data gap in low\-resource settings by mapping representations across languagesAdamset al\.\([2017](https://arxiv.org/html/2605.19066#bib.bib4)\)\. Concurrently, the transition to Neural Machine Translation \(NMT\) sparked new methodological approaches, with researchers exploring universal models capable of translating low\-resource languages by sharing syntactic and lexical representations across multiple languagesGuet al\.\([2018](https://arxiv.org/html/2605.19066#bib.bib37)\)\. These technical advancements were soon paired with localized evaluations addressing specific linguistic families\.[Abbott and Martinus](https://arxiv.org/html/2605.19066#bib.bib1)Abbott and Martinus \([2018](https://arxiv.org/html/2605.19066#bib.bib1)\)laid early groundwork for NMT applied to African languages, highlighting both the severe data sparsity challenges and the potential for neural architectures to overcome them\. Similarly,[Mageret al\.](https://arxiv.org/html/2605.19066#bib.bib49)Mageret al\.\([2018](https://arxiv.org/html/2605.19066#bib.bib49)\)mapped the unique morphological and infrastructural challenges facing indigenous languages of the Americas, signalling a growing need for tailored, region\-specific methodologies\. ### 2\.2Optimism and Its Limits The optimism of this phase rested on an implicit assumption: that cross\-lingual transfer and bootstrapped resources could substitute for deep, language\-specific human investmentJoshiet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib42)\)\. This assumption was rarely examined critically\. While zero\-shot architectures and universal models demonstrated mathematical ingenuity, they frequently reduced complex morphological and syntactic phenomena into universalist frameworks that inherently favoured high\-resource pivot languages, typically EnglishMageret al\.\([2018](https://arxiv.org/html/2605.19066#bib.bib49)\)\. Consequently, efforts to build corpora were often opportunistic rather than systematic\. The field relied heavily on scraping readily available, but narrowly focused texts \(such as religious translationsAgić and Vulić \([2019](https://arxiv.org/html/2605.19066#bib.bib8)\)or government proceedings\), resulting in severe domain mismatch and a lack of true sociolinguistic representation\. This era was largely defined by a top\-down, extractive approach to natural language processing\. Languages were frequently treated as mere data points to solve a technical optimization challenge, disconnected from the communities who actually spoke themBird \([2020](https://arxiv.org/html/2605.19066#bib.bib16)\)\. Because of this disconnect, there was an absence of epistemic governance within the research lifecycle\. The researchers building the models rarely possessed lived experience with the languages, and the communities generating the data had no sovereignty over how their linguistic heritage was utilised, licensed, or deployed\. Annotation teams were small, sometimes consisting of a single annotator per language, and inter\-annotator agreement was inconsistently reportedMageret al\.\([2018](https://arxiv.org/html/2605.19066#bib.bib49)\)\. The foundations for rigorous evaluation were present in principle but seldom fully realised in practice\. Because native speakers were largely excluded from the development loop, evaluations relied heavily on automated metrics applied to noisy, out\-of\-domain test sets, creating a false sense of empirical progressBirhaneet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib18)\)\. Ultimately, the technical limitations of this phase made it increasingly evident that algorithmic advancements could not outpace the fundamental need for community\-driven stewardship\. The friction generated by these historically extractive practices highlighted the necessity for a radical realignment in how NLP research was conducted, creating the exact vacuum that community\-led grassroots movements would soon rise to fill\. ## 3The Scaling Challenge \(2019–2022\): Benchmarking and the Illusion of Progress Between 2019 and 2022, the field entered a phase of rapid benchmark proliferation, driven largely by the advent of Massively Multilingual Language Models \(MMLMs\) such as mBERT and XLM\-R\. Shared tasks multiplied, multilingual leaderboards emerged, and performance on standardised benchmarks became the primary currency of scientific contribution\. However, this scaling often masked profound underlying deficiencies in how low\-resource languages were processed\. ### 3\.1Proliferation of Benchmarks and Shared Tasks The number of shared tasks targeting low\-resource languages grew substantially during this period\. Workshops associated with major venues \(ACL, EMNLP, COLING\) hosted annual competitions that attracted large numbers of submissions and generated significant visibility for low\-resource NLP research\. This era saw a divergence in benchmarking approaches\. On the one hand, massive, top\-down evaluation suites like XTREMEHuet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib40)\)attempted to create universal metrics for cross\-lingual generalization\. On the other hand, grassroots initiatives focused on building authentic datasets from the ground up, evidenced by the AI4D Language Program challenges for dataset creationSiminyuet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib70),[2021](https://arxiv.org/html/2605.19066#bib.bib71)\); Orlic \([2021](https://arxiv.org/html/2605.19066#bib.bib61)\), which actively funded and structured community\-driven resource generation\. ### 3\.2Pressure for Performance and Comparability The competitive dynamics of shared tasks and overarching leaderboards created intense incentives for optimising benchmark performance, frequently at the expense of deeper sociolinguistic understanding\.[Rodriguezet al\.](https://arxiv.org/html/2605.19066#bib.bib67)Rodriguezet al\.\([2021](https://arxiv.org/html/2605.19066#bib.bib67)\)and[Joshiet al\.](https://arxiv.org/html/2605.19066#bib.bib42)Joshiet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib42)\)raised crucial concerns about benchmark bias and the degree to which reported gains reflected genuine linguistic competence rather than dataset\-specific artefacts or overfitting\. The standardisation of evaluation formats, while enabling broad comparability, flattened important morphological and syntactic distinctions between language families\. Furthermore, models scaled to accommodate over a hundred languages often suffered from capacity dilution; researchers found that state\-of\-the\-art MMLMs systematically failed to represent the nuances of low\-resource languages unless subjected to intensive, language\-specific adaptive fine\-tuningAlabiet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib10)\)\. ### 3\.3Emerging Awareness of Limitations and the Participatory Shift By the end of this phase, critical voices within the community had begun to rigorously question the assumptions underlying rapid benchmark scaling\. Concerns centred on the shallow involvement of speaker communities in benchmark design and the concentration of annotation labour among a small pool of multilingual academics\. It became evident that separating the technical creation of benchmarks from the communities who speak the languages perpetuated the extractive practices of the previous decade\. This realisation birthed the participatory research paradigm, championed by movements like Masakhane Research Foundation111[https://www\.masakhane\.io/](https://www.masakhane.io/)\(and others such as AmericasNLP222[https://www\.americasnlp\.org](https://www.americasnlp.org/), \), which demonstrated that integrating native speakers into every node of the NLP development pipeline was not merely an ethical imperative, but a technical necessity for building robust modelsNekotoet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib54)\)\. These concerns and subsequent methodological shifts foreshadowed the severe data and compute bottlenecks that would become increasingly visible in the era of Large Language Models \(LLMs\)\. ## 4The Annotation Scarcity Paradox \(2023–present\): Limits and New Directions From 2023 onwards, the Annotation Scarcity Paradox has become an increasingly explicit concern in the literature and in community discussions\. As the field transitioned from discriminative or highly constrained tasks to open\-ended generative AI, the sheer volume of data required to align and evaluate models met the hard reality of limited human capital\. Several converging pressures have brought this tension to the fore\. ### 4\.1The Rising Cost of High\-Quality Evaluation As LLMs have become central to NLP research, the complexity and time required for meaningful human evaluation has increased substantiallyBenderet al\.\([2021](https://arxiv.org/html/2605.19066#bib.bib15)\)\. Unlike earlier eras where automated metrics like BLEU or F1 could provide a rough proxy for performance, evaluating the outputs of generative models \(for fluency, cultural appropriateness, safety, and factual accuracy\) requires evaluators with deep linguistic and cultural competence, not simply literacy in the target languageManduchiet al\.\([2024](https://arxiv.org/html/2605.19066#bib.bib51)\)\. For many low\-resource languages, such evaluators are exceptionally scarce\. Consequently, there is a growing realisation that model capabilities are severely bottlenecked not just by training data, but by the availability of qualified humans to verify the outputsKholodnaet al\.\([2024](https://arxiv.org/html/2605.19066#bib.bib43)\); Chiuet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib23)\)\. ### 4\.2Challenges of Community Engagement and Data Sovereignty Meaningful participation from speaker communities has proven difficult to sustain at scale\. The power dynamics between resource\-rich research institutions \(often located in the Global North\) and language communities have received growing critical attentionNekotoet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib54)\)\. Historically, data pipelines have been largely extractive: language is frequently treated as a raw resource to be harvested rather than a living system governed by its speakers\. Consequently, communities are often asked to provide intensive annotation labour without commensurate recognition, financial benefit, or agency over the resulting models\. This friction has catalysed a strong push fordata sovereigntyEffoduh \([2026](https://arxiv.org/html/2605.19066#bib.bib33)\); Birhane \([2020](https://arxiv.org/html/2605.19066#bib.bib19)\)and equitable data licensing frameworks \(such as KaitiakitangaTaiuru \([2021](https://arxiv.org/html/2605.19066#bib.bib77)\), CARE PrinciplesCarrollet al\.\([2023](https://arxiv.org/html/2605.19066#bib.bib21)\), Esethu FrameworkRajabet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib66)\), Nwulite Obodo Open Data License \(NOODL\)Okorie \([2025](https://arxiv.org/html/2605.19066#bib.bib59)\); Okorie and Omino \([2025](https://arxiv.org/html/2605.19066#bib.bib58)\)\), advocating that communities must retain ownership over their linguistic heritage and dictate how it is deployed\. However, implementing these equitable governance structures requires significant time and trust\-building, further constraining the speed at which “scale” can be achieved\. Similarly, the pool of computational linguists with deep expertise in specific low\-resource languages remains limited, creating structural constraints on the quality of evaluation that can be produced\. ### 4\.3The Hidden Labour and Ethical Debt of Scale The structural constraints of scaling generative models are not merely technical; they are deeply tied to extractive labour practices and poor resource management\. As[Birhane and Prabhu](https://arxiv.org/html/2605.19066#bib.bib17)Birhane and Prabhu \([2021](https://arxiv.org/html/2605.19066#bib.bib17)\)extensively critique, the prevailing “bigger is better” paradigm inherently relies on the indiscriminate scraping of massive datasets, which disproportionately encodes and amplifies structural harms against marginalised groups\. The sheer scale of these datasets creates severe barriers to responsible data filtering and ethical auditing, frequently shifting the burden of safety onto the very communities most likely to be harmed by the resulting models \(see, on a smaller scale, the work in[Abdulmuminet al\.](https://arxiv.org/html/2605.19066#bib.bib2)Abdulmuminet al\.\([2024](https://arxiv.org/html/2605.19066#bib.bib2)\), which corrects an African\-focused MT dataset that is widely used in the field\)\. This brings to light the precarious reality of data workers in the Global South\.[Okolo](https://arxiv.org/html/2605.19066#bib.bib57)Okolo \([2024](https://arxiv.org/html/2605.19066#bib.bib57)\)has documented how the safety and alignment of global AI systems, such as toxicity filtering and reinforcement learning from human feedback \(RLHF\), rely heavily on under\-compensated “ghost workers” in regions like Africa\. While resource\-rich institutions reap the economic and technological benefits of these models, the psychological toll and manual labour of data annotation are outsourced\. This dynamic exacerbates the global AI divide, prompting scholars to advocate for comprehensive, localised data governance frameworks that protect data workers and enforce equitable regulation across the continent\. Compounding this exploitative paradigm is the phenomenon recently termed “language data flaring” by[Adebara](https://arxiv.org/html/2605.19066#bib.bib5)Adebara \([2025](https://arxiv.org/html/2605.19066#bib.bib5)\)\. Paralleling the wasteful burning of natural gas during oil extraction, language data flaring captures the systemic neglect, poor digitisation practices, and extreme under\-utilisation of existing African linguistic resources\. While high\-resource languages are aggressively harvested to push the boundaries of LLM capabilities, vast amounts of low\-resource data remain siloed in physical archives or inaccessible digital formats\. This wasteful management means that local communities are simultaneously exploited for their annotation labour while being starved of the digital infrastructure necessary to bring their own native languages into the epistemic discourse of modern NLP\. The coverage gap is stark: across Africa, at least 150 of 2,144 languagesEberhardet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib31)\)have any ASR coverageteamet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib78)\), fewer than 80 appear in any NLP benchmarkAdebara \([2025](https://arxiv.org/html/2605.19066#bib.bib5)\); Ojoet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib55)\); Adelaniet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib7)\); Osei and others \([2024](https://arxiv.org/html/2605.19066#bib.bib62)\), and only 20 are served by a regional LLM \(Figure[1](https://arxiv.org/html/2605.19066#S4.F1)\); in Southeast Asia, SEACrowd covers 36 of an estimated 1,300 indigenous languagesLoveniaet al\.\([2024](https://arxiv.org/html/2605.19066#bib.bib48)\); in South Asia, IndicGenBench spans 29 of roughly 800 languagesSinghet al\.\([2024a](https://arxiv.org/html/2605.19066#bib.bib72)\); and across the Americas, indigenous NLP benchmarks collectively cover fewer than 25 of over 1,000 indigenous languagesMageret al\.\([2021](https://arxiv.org/html/2605.19066#bib.bib50)\); Ebrahimiet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib32)\)\. Figure 1:The African language AI pipeline bottleneck\. Each bar shows independent coverage counts at successive stages of the AI pipeline: 2,144 total languagesEberhardet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib31)\); at least 150 with any ASR systemteamet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib78)\)\(estimated lower bound\); 64 covered by a major NLP benchmarkOjoet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib55)\); 20 served by a regional LLMYuet al\.\([2026](https://arxiv.org/html/2605.19066#bib.bib80)\)\. Languages at each stage do not necessarily overlap with adjacent stages — the figure illustrates the stark disparity in scale across the pipeline, not a strict attrition sequence\. ### 4\.4Emerging Alternative Approaches Facing this paradox, researchers have begun exploring approaches that reduce, redistribute, or democratise the demands on human evaluators\. These include: - •Augmentation:A complementary strategy is to maximise the value of already\-annotated data through augmentation\. Widely applied in computer vision to improve model robustness, data augmentation artificially expands available training and evaluation sets\. For low\-resource NLP, this offers a partial workaround for annotation scarcityŞahin \([2022](https://arxiv.org/html/2605.19066#bib.bib68)\), though the choice of technique matters: different augmentation methods affect morphological and syntactic properties differentlyFenget al\.\([2021](https://arxiv.org/html/2605.19066#bib.bib35)\); Dholeet al\.\([2023](https://arxiv.org/html/2605.19066#bib.bib30)\); Chenet al\.\([2023](https://arxiv.org/html/2605.19066#bib.bib22)\), and naive application risks amplifying existing biases or obscuring the linguistic phenomena under study\. - •Model\-Based Evaluation:Using advanced LLMs as proxies for human raters \(“LLM\-as\-a\-judge”\) to scale evaluation dynamicallyZhenget al\.\([2023](https://arxiv.org/html/2605.19066#bib.bib81)\)\. However, this approach carries significant risk in low\-resource settings, as the “judge” models themselves suffer from severe pre\-training data imbalances and frequently fail to capture localised cultural nuances\. - •Global Participatory Curation:Massive, globally distributed annotation efforts designed specifically to bridge the instruction\-tuning gap\. A canonical example is the Aya initiativeSinghet al\.\([2024b](https://arxiv.org/html/2605.19066#bib.bib73)\), which involved collaborators from over 100 countries in a human\-curated, participatory framework to build instruction\-following datasets for 65 languages\. For Automated Speech Recognition datasets we have the Mozilla Common Voice333[https://commonvoice\.mozilla\.org/en](https://commonvoice.mozilla.org/en)Ardilaet al\.\([2020](https://arxiv.org/html/2605.19066#bib.bib12)\)project which worked with volunteers to collect scripted speech\. However, even participatory campaigns at scale carry inherent tensions: volunteer fatigue and dropout across long\-running efforts can produce uneven language coverageKlieet al\.\([2023](https://arxiv.org/html/2605.19066#bib.bib45)\); quality control becomes more difficult as annotator linguistic backgrounds vary widely; and ”global” participation risks over\-sampling diaspora or digitally\-connected speakers relative to in\-country communities most directly affected by the resulting models\. The source of data used in such campaigns may still face challenges, such as where Mozilla Common Voice gets its source textual datade Wetet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib27)\)\. - •Annotation\-Efficient Evaluation via IRT and Active Learning:Rather than collecting human judgements uniformly across all evaluation items, Item Response Theory \(IRT\) models the*difficulty*and*discriminative power*of individual test examples with respect to model capabilityLaloret al\.\([2019](https://arxiv.org/html/2605.19066#bib.bib47)\)\. Applied to LM evaluation, IRT identifies the small subset of items that most sharply distinguish between models, concentrating scarce annotation effort where it yields the greatest evaluative signal, rather than distributing it thinly across thousands of items of unequal informativeness\. The recent*tinyBenchmarks*Poloet al\.\([2024](https://arxiv.org/html/2605.19066#bib.bib64)\)is an example of this approach in practice\. Complementaryactive learningMonarch \([2021](https://arxiv.org/html/2605.19066#bib.bib52)\)strategies extend this logic iteratively: by selecting the next examples to annotate based on model uncertainty or expected information gain, annotation budgets can be directed toward the cases that most change our understanding of model behaviour\. Together, these approaches offer a principled path to doing more with less, a direct response to the annotation scarcity constraint\. The key limitation is that IRT calibration requires a sufficient initial annotation pool, which remains a bootstrapping challenge for the most severely under\-resourced languages\. Each of these approaches involves significant trade\-offs between scalability, validity, and equity\. The current era is defined by the search for a balance between the technical hunger for massive datasets and the ethical imperative of epistemic justice\. ## 5Discussion: Implications for the Future of Low\-Resource NLP Our survey reveals a field at a critical inflection point\. The Annotation Scarcity Paradox is not merely a logistical inconvenience to be solved through cleverer algorithms; it is a structural feature of the evaluation ecosystem that shapes what can be known, and what counts as progress, in low\-resource NLP\. When the capacity to evaluate models trails so far behind the capacity to generate them, the field risks optimising for statistical illusions rather than genuine linguistic utility\. ### 5\.1The Need for Sustainable and Participatory Evaluation A sustainable evaluation ecosystem must shift from transactional, extractive data\-gathering to relational, community\-embedded capacity buildingDataDotOrg \([2026](https://arxiv.org/html/2605.19066#bib.bib26)\)\. We must distribute the labour of annotation and assessment more equitably, invest heavily in the development of local evaluation expertise, and create governance structures that give language communities meaningful agency over how their languages are represented and assessedJo and Gebru \([2020](https://arxiv.org/html/2605.19066#bib.bib41)\)\. Evaluation must be reframed not as the final hurdle of model deployment, but as an ongoing dialogue with the language communitySloaneet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib74)\)\. This requires institutional commitment from funding bodies, universities, and research consortia to finance long\-term digital infrastructure, rather than merely funding short\-term methodological innovation or raw compute\. ### 5\.2Recognising the Limits of Human Capacity and Enforcing Transparency The field can no longer afford to treat human annotation as an infinitely renewable, frictionless resource\. We must develop and enforce clearer norms for reporting the human resources involved in benchmark construction and evaluation\. Building on frameworks like Data StatementsBender and Friedman \([2018](https://arxiv.org/html/2605.19066#bib.bib14)\), researchers must explicitly document team size, annotator demographics, compensation arrangements, and power dynamics\. Furthermore, treating inter\-annotator disagreement simply as “noise” to be averaged out erases vital sociolinguistic variation\. Releasing annotator metrics and acknowledging inherent subjectivityPrabhakaranet al\.\([2021](https://arxiv.org/html/2605.19066#bib.bib65)\)would enable better calibration of confidence in reported results, supporting more realistic, grounded assessments of what evaluation findings actually mean in a real\-world context\. ### 5\.3Avenues for Future Research To navigate this paradox, the NLP community must re\-evaluate its incentive structures\. We identify several promising directions for future research: - •Pluralistic Evaluation Frameworks:Developing methodologies that explicitly model annotator uncertainty, cultural subjectivity, and dialectal variation, rather than forcing a single, universal “ground truth” onto complex linguistic tasks\. - •Re\-aligning Shared Tasks:Designing community competitions that explicitly reward data provenance, annotation transparency, and ethical governance, rather than solely ranking submissions by automated benchmark performance\. - •Community\-Owned Infrastructure:Fostering long\-term, community\-embedded partnerships that move beyond the extractive dynamics of one\-off annotation campaigns\. NLP \(and AI research in general\) faces the same challenge as helicopter research, well documented in health researchAbimbola and Pai \([2020](https://arxiv.org/html/2605.19066#bib.bib3)\); Crane \([2011](https://arxiv.org/html/2605.19066#bib.bib25)\)and we need to move beyond WEIRD \(Western Educated Industrialised Rich Democratic\) contextHenrichet al\.\([2010](https://arxiv.org/html/2605.19066#bib.bib38)\)\. Research should increasingly focus on building local data trustsDelacroix and Lawrence \([2019](https://arxiv.org/html/2605.19066#bib.bib28)\); Olorunju and Adams \([2024](https://arxiv.org/html/2605.19066#bib.bib60)\); Okorie \([2025](https://arxiv.org/html/2605.19066#bib.bib59)\); Okorie and Omino \([2025](https://arxiv.org/html/2605.19066#bib.bib58)\); Rajabet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib66)\)and governed\-access repositories that ensure data sovereignty outlives any single grant cycle or publication\. Sustaining such infrastructure beyond individual grant cycles will require alternative economic models: endowment structures analogous to open\-source software foundations, revenue\-sharing agreements with commercial users of community\-owned data, or explicit financing from multilateral development institutions\. See the example of Karya in IndiaPerrigo \([2023](https://arxiv.org/html/2605.19066#bib.bib63)\); Okolo and Tano \([2024](https://arxiv.org/html/2605.19066#bib.bib56)\) Ultimately, overcoming the Annotation Scarcity Paradox will not be achieved by asking how we can evaluate models faster, but by critically examining who has the power, resources, and mandate to evaluate them at all\. ### 5\.4Limitations and Scope We acknowledge that the African NLP context is more deeply represented in this survey than other low\-resource language communities, including Southeast Asian and Indigenous Americas NLP ecosystems\. This reflects the author’s positionality and professional network, and is itself a manifestation of the inequitable concentration of expertise this paper critiques\. Accordingly, the claims of the paradox are most directly evidenced in the African NLP context; we treat this as a well\-documented instance of a broader structural pattern that we expect manifests in analogous, if locally distinct, forms in Southeast Asian and Indigenous Americas language ecosystems\. Work such as those in communities like AmericasNLPMageret al\.\([2021](https://arxiv.org/html/2605.19066#bib.bib50)\)and SEACrowdLoveniaet al\.\([2024](https://arxiv.org/html/2605.19066#bib.bib48)\)show analogous annotation scarcity dynamics arising from distinct institutional actors \(NGOs, universities, state bodies\) and data sovereignty frameworks\. In Southeast Asia, SEA\-HELM further identifies cultural diagnostics as a distinct evaluation pillar largely absent from global benchmarksSusantoet al\.\([2025](https://arxiv.org/html/2605.19066#bib.bib76)\); in South Asia, IndicGenBench documents that even 29 relatively higher\-resourced Indic languages exhibit systematic generation quality gaps under multilingual LLMsSinghet al\.\([2024a](https://arxiv.org/html/2605.19066#bib.bib72)\); and AmericasNLI demonstrates near\-chance zero\-shot NLI performance for most indigenous Americas languagesEbrahimiet al\.\([2022](https://arxiv.org/html/2605.19066#bib.bib32)\)\. We invite researchers embedded in other regional ecosystems to extend and contest the framework offered here\. As a critical narrative review, this survey makes no claim to exhaustive coverage; the works cited are representative rather than comprehensive\. ## 6Navigating the Tension: A Pragmatic Call to Action Drawing on the author’s experience as a practitioner and community member in this space, this section offers a pragmatic complement to the structural critique above\. The observations below are offered as practitioner reflections inviting empirical contestation, not as reviewed claims\. Critiquing the extractive nature of modern scaling often inadvertently triggers a secondary risk: research paralysis\. Confronted with the immense difficulty of the Annotation Scarcity Paradox, the complexities of epistemic governance, and the historical debt of data colonialism, researchers may feel that working on low\-resource languages is too ethically fraught or logistically demanding to pursue\. There is a palpable anxiety among practitioners who find themselves caught in a precarious middle ground, tasked with the technical imperative of getting their languages represented in global, state\-of\-the\-art models, while bearing the profound, often exhausting responsibility of representing their communities with absolute sincerity and cultural integrity\. It is crucial to emphasise that the messy, human side of computing is not an impediment to natural language processing; it is the fundamental core of the discipline\. The friction of community engagement should not drive researchers away from these problems, but rather redefine what a ‘successful’ project looks like\. For researchers entering this space, pragmatic engagement must take precedence over the pursuit of perfect, frictionless scale\. Practically, this means embracing ‘slow AI’Sambasivanet al\.\([2021](https://arxiv.org/html/2605.19066#bib.bib69)\), choosing to build a single, deeply verified, community\-owned dataset that serves a specific local need, rather than feeling pressured to scrape millions of unverified tokens just to appear on a global leaderboard\. It means accepting that building trust is non\-linear, and that navigating the human dynamics of data collection is just as scientifically rigorous and valuable as optimising a model’s loss function\. Finally, a fundamental humility is required: we must acknowledge that the researchers advocating for and adopting these participatory frameworks are themselves imperfect\. Missteps in cultural translation, logistical oversights in compensation, and the inadvertent replication of historic power dynamics will still inevitably occur\. Yet, committing to this difficult path, despite its friction and our own fallibility, is absolutely essential for the long\-term sustainability of our field\. Embracing the human complexity of NLP is ultimately the only viable mechanism to earn, and rightfully keep, the trust of the communities we claim to serve\. ## 7Conclusion The last decade of low\-resource NLP has produced genuine and important progress\. Yet the pace of benchmark creation, and the ever\-increasing data hunger of modern generative scaling, has vastly outrun the human infrastructure needed to support it\. This imbalance has produced a profound gap between reported machine performance and our warranted confidence in that performance\. We have characterised this gap as the Annotation Scarcity Paradox and traced its emergence across three distinct phases of the field’s development, arguing that this bottleneck is not merely logistical, but deeply structural\. We call on the wider NLP community to critically examine the assumptions underlying current scaling practices and to transition away from transactional data extraction\. We must invest, collectively, in more sustainable, equitable, and epistemically honest approaches that prioritise shared ownership and governed access\. Embracing this complexity, and accepting the inherent friction, slow pace, and imperfection of the human work required to achieve it, is no longer optional\. Doing so is not only a fundamental requirement for scientific rigour; it is a matter of epistemic justice toward the language communities whose linguistic heritage and communicative needs this research ultimately serves\. ## Ethical Statement This paper presents a survey and does not involve human subjects research, the collection of new data, or the deployment of systems\. The survey addresses questions of equity and community involvement in NLP research; we have endeavoured to engage with these questions with appropriate care and humility\. ## Acknowledgments This work is supported by the ABSA Chair of Data Science at the University of Pretoria, and has benefited from gifts from NVIDIA, Google\.org, OpenAI, and Meta, as well as funding from the UK International Development programme and IDRC Ottawa \(AI4D Africa\)\. ## References - Abbott and Martinus \[2018\]Jade Z Abbott and Laura Martinus\.Towards neural machine translation for african languages\.arXiv preprint arXiv:1811\.05467, 2018\. - Abdulmuminet al\.\[2024\]Idris Abdulmumin, Sthembiso Mkhwanazi, Mahlatse Mbooi, Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Neo Putini, Miehleketo Mathebula, Matimba Shingange, Tajuddeen Gwadabe, and Vukosi Marivate\.Correcting FLORES evaluation dataset for four African languages\.In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors,Proceedings of the Ninth Conference on Machine Translation, pages 570–578, Miami, Florida, USA, November 2024\. Association for Computational Linguistics\. - Abimbola and Pai \[2020\]Seye Abimbola and Madhukar Pai\.Will global health survive its decolonisation?The Lancet, 396\(10263\):1627–1628, 2020\. - Adamset al\.\[2017\]Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn\.Cross\-lingual word embeddings for low\-resource language modeling\.InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 937–947, 2017\. - Adebara \[2025\]Ife Adebara\.AI and language data flaring in Africa: Addressing the low\-resource challenge\.Policy Brief No\. 216, 2025\. - Adelaniet al\.\[2022\]David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen\-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen H Muhammad, Peter Nabende, et al\.Masakhaner 2\.0: Africa\-centric transfer learning for named entity recognition\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, 2022\. - Adelaniet al\.\[2025\]David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En\-Shiun Annie Lee, et al\.Irokobench: A new benchmark for african languages in the age of large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), pages 2732–2757, 2025\. - Agić and Vulić \[2019\]Željko Agić and Ivan Vulić\.JW300: A wide\-coverage parallel corpus for low\-resource languages\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210, 2019\. - Ahujaet al\.\[2023\]Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, et al\.Mega: Multilingual evaluation of generative ai\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, 2023\. - Alabiet al\.\[2022\]Jesujoba Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow\.Adapting pre\-trained language models to African languages via multilingual adaptive fine\-tuning\.InProceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, 2022\. - Alabiet al\.\[2025\]Jesujoba Alabi, Michael A Hedderich, David Ifeoluwa Adelani, and Dietrich Klakow\.Charting the landscape of african nlp: Mapping progress and shaping the road ahead\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27795–27829, 2025\. - Ardilaet al\.\[2020\]Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber\.Common voice: A massively\-multilingual speech corpus\.InProceedings of the twelfth language resources and evaluation conference, pages 4218–4222, 2020\. - Belayet al\.\[2025\]Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa\-Dutse, Abrham Belete Haile, Grigori Sidorov, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, et al\.The rise of africanlp: Contributions, contributors, and community impact \(2005\-2025\)\.arXiv preprint arXiv:2509\.25477, 2025\. - Bender and Friedman \[2018\]Emily M\. Bender and Batya Friedman\.Data statements for natural language processing: Toward mitigating system bias and enabling better science\.Transactions of the Association for Computational Linguistics, 6:587–604, 2018\. - Benderet al\.\[2021\]Emily M Bender, Timnit Gebru, Angelina McMillan\-Major, and Shmargaret Shmitchell\.On the dangers of stochastic parrots: Can language models be too big?InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021\. - Bird \[2020\]Steven Bird\.Decolonising speech and language technology\.InProceedings of the 28th international conference on computational linguistics, pages 3504–3519, 2020\. - Birhane and Prabhu \[2021\]Abeba Birhane and Vinay Uday Prabhu\.Large image datasets: A pyrrhic win for computer vision?In2021 IEEE Winter Conference on Applications of Computer Vision \(WACV\), pages 1536–1546\. IEEE, 2021\. - Birhaneet al\.\[2022\]Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao\.The values encoded in machine learning research\.InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages 173–184, 2022\. - Birhane \[2020\]Abeba Birhane\.Algorithmic colonization of africa\.SCRIPTed, 17:389, 2020\. - Blasiet al\.\[2022\]Damian Blasi, Antonios Anastasopoulos, and Graham Neubig\.Systematic inequalities in language technology performance across the world’s languages\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 5486–5505, 2022\. - Carrollet al\.\[2023\]Stephanie Russo Carroll, Ibrahim Garba, Oscar L Figueroa\-Rodríguez, Jarita Holbrook, Raymond Lovett, Simeon Materechera, Mark Parsons, Kay Raseroka, Desi Rodriguez\-Lonebear, Robyn Rowe, et al\.The care principles for indigenous data governance\.Open Scholarship Press Curated Volumes: Policy, 2023\. - Chenet al\.\[2023\]Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang\.An empirical survey of data augmentation for limited data learning in nlp\.Transactions of the Association for Computational Linguistics, 11:191–211, 2023\. - Chiuet al\.\[2025\]Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, et al\.Culturalbench: A robust, diverse and challenging benchmark for measuring lms’ cultural knowledge through human\-ai red\-teaming\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 25663–25701, 2025\. - Conneauet al\.\[2020\]Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov\.Unsupervised cross\-lingual representation learning at scale\.InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8440–8451, 2020\. - Crane \[2011\]Johanna Crane\.Scrambling for africa? universities and global health\.The Lancet, 377\(9775\):1388–1390, 2011\. - DataDotOrg \[2026\]DataDotOrg\.Digitisation of oral data for nlp of low\-resource languages: Practical methods and processes for scalable and sustainable ecosystem development\.Playbook, DataDotOrg, Washington, D\.C\., USA, 2026\.A playbook for building sustainable African language technology ecosystems\. - de Wetet al\.\[2022\]Febe de Wet, Andiswa Bukula, Willem Karsten, Martin Puttkammer, Erwin Schillack, Rone Wierenga, and Roald Eiselen\.Localising the mozilla common voice platform for south africa’s official languages\.Journal of the Digital Humanities Association of Southern Africa \(DHASA\), 4\(01\), 2022\. - Delacroix and Lawrence \[2019\]Sylvie Delacroix and Neil D Lawrence\.Bottom\-up data trusts: Disturbing the ‘one size fits all’approach to data governance\.International data privacy law, 9\(4\):236–252, 2019\. - Devlinet al\.\[2019\]Jacob Devlin, Ming\-Wei Chang, Kenton Lee, and Kristina Toutanova\.Bert: Pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\), pages 4171–4186, 2019\. - Dholeet al\.\[2023\]Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, et al\.Nl\-augmenter: A framework for task\-sensitive natural language augmentation\.Northern European Journal of Language Technology, 9, 2023\. - Eberhardet al\.\[2025\]David M\. Eberhard, Gary F\. Simons, and Charles D\. Fennig\.Ethnologue: Languages of the world\.SIL International, 2025\. - Ebrahimiet al\.\[2022\]Abteen Ebrahimi, Manuel Mager, Adam Wiemerslage, Pavel Denisov, Katharina Kann, et al\.AmericasNLI: Evaluating zero\-shot natural language understanding of pretrained multilingual models in truly low\-resource languages\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 6279–6299, 2022\. - Effoduh \[2026\]Jake Okechukwu Effoduh\.Decolonizing the governance of artificial intelligence in africa: from normative mimicry to epistemic sovereignty\.Science and Public Policy, 53\(2\):245–257, 2026\. - Eiselen and Puttkammer \[2014\]Roald Eiselen and Martin J Puttkammer\.Developing text resources for ten South African languages\.InProceedings of the Ninth International Conference on Language Resources and Evaluation \(LREC’14\), pages 3698–3703, 2014\. - Fenget al\.\[2021\]Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy\.A survey of data augmentation approaches for nlp\.InFindings of the association for computational linguistics: ACL\-IJCNLP 2021, pages 968–988, 2021\. - Grant and Booth \[2009\]Maria J Grant and Andrew Booth\.A typology of reviews: an analysis of 14 review types and associated methodologies\.Health information & libraries journal, 26\(2\):91–108, 2009\. - Guet al\.\[2018\]Jiatao Gu, Hany Hassan Awadalla, Jacob Devlin, and Victor OK Li\.Universal neural machine translation for extremely low resource languages\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\), pages 344–354, 2018\. - Henrichet al\.\[2010\]Joseph Henrich, Steven J Heine, and Ara Norenzayan\.The weirdest people in the world?Behavioral and Brain Sciences, 33\(2\-3\):61–83, 2010\. - Hershcovichet al\.\[2022\]Daniel Hershcovich, Stella Frank, Heather Lent, Miryam De Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, et al\.Challenges and strategies in cross\-cultural nlp\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 6997–7013, 2022\. - Huet al\.\[2020\]Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson\.XTREME: A massively multilingual multi\-task benchmark for evaluating cross\-lingual generalisation\.InInternational conference on machine learning, pages 4411–4421\. PMLR, 2020\. - Jo and Gebru \[2020\]Eun Seo Jo and Timnit Gebru\.Lessons from archives: Strategies for collecting sociocultural data in machine learning\.InProceedings of the 2020 conference on fairness, accountability, and transparency, pages 306–316, 2020\. - Joshiet al\.\[2020\]Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury\.The state and fate of linguistic diversity and inclusion in the nlp world\.InProceedings of the 58th annual meeting of the association for computational linguistics, pages 6282–6293, 2020\. - Kholodnaet al\.\[2024\]Nataliia Kholodna, Sahib Julka, Mohammad Khodadadi, Muhammed Nurullah Gumus, and Michael Granitzer\.Llms in the loop: Leveraging large language model annotations for active learning in low\-resource languages\.InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 397–412\. Springer, 2024\. - King \[2015\]Benjamin Philip King\.Practical Natural Language Processing for Low\-Resource Languages\.PhD thesis, University of Michigan, 2015\. - Klieet al\.\[2023\]Jan\-Christoph Klie, Ji\-Ung Lee, Kevin Stowe, Gözde Şahin, Nafise Sadat Moosavi, Luke Bates, Dominic Petrak, Richard Eckart De Castilho, and Iryna Gurevych\.Lessons learned from a citizen science project for natural language processing\.In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3594–3608, Dubrovnik, Croatia, May 2023\. Association for Computational Linguistics\. - Kunchukuttanet al\.\[2018\]Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhattacharyya\.The IIT Bombay English\-Hindi parallel corpus\.InProceedings of the Eleventh International Conference on Language Resources and Evaluation \(LREC 2018\), 2018\. - Laloret al\.\[2019\]John P\. Lalor, Hao Wu, and Hong Yu\.Learning latent parameters without human response patterns: Item response theory with artificial crowds\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\-IJCNLP\), pages 4674–4684, Hong Kong, China, November 2019\. Association for Computational Linguistics\. - Loveniaet al\.\[2024\]Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V\. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P\. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F\. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse, Ivan Halim Parmonangan, Maria Khelli, Wenyu Zhang, Lucky Susanto, Reynard Adha Ryanda, Sonny Lazuardi Hermawan, Dan John Velasco, Muhammad Dehan Al Kautsar, Willy Fitra Hendria, Yasmin Moslem, Noah Flynn, Muhammad Farid Adilazuarda, Haochen Li, Johanes Lee, R\. Damanhuri, Shuo Sun, Muhammad Reza Qorib, Amirbek Djanibekov, Wei Qi Leong, Quyet V\. Do, Niklas Muennighoff, Tanrada Pansuwan, Ilham Firdausi Putra, Yan Xu, Tai Ngee Chia, Ayu Purwarianti, Sebastian Ruder, William Tjhi, Peerat Limkonchotiwat, Alham Fikri Aji, Sedrick Keh, Genta Indra Winata, Ruochen Zhang, Fajri Koto, Zheng\-Xin Yong, and Samuel Cahyawijaya\.SEACrowd: A multilingual multimodal data hub and benchmark suite for Southeast Asian languages\.In Yaser Al\-Onaizan, Mohit Bansal, and Yun\-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA, November 2024\. Association for Computational Linguistics\. - Mageret al\.\[2018\]Manuel Mager, Ximena Gutierrez\-Vasques, Gerardo Sierra, and Ivan Meza\-Ruiz\.Challenges of language technologies for the indigenous languages of the Americas\.InProceedings of the 27th International Conference on Computational Linguistics, pages 55–69, 2018\. - Mageret al\.\[2021\]Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez\-Vasques, Luis Chiruzzo, Gustavo Giménez\-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto\-Solano, Alexis Palmer, Elisabeth Mager\-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, and Katharina Kann\.Findings of the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas\.In Manuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, and Katharina Kann, editors,Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 202–217, Online, June 2021\. Association for Computational Linguistics\. - Manduchiet al\.\[2024\]Laura Manduchi, Clara Meister, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina Däubener, Sophie Fellenz, Asja Fischer, Thomas Gärtner, Matthias Kirchler, et al\.On the challenges and opportunities in generative ai\.arXiv preprint arXiv:2403\.00025, 2024\. - Monarch \[2021\]Robert Munro Monarch\.Human\-in\-the\-Loop Machine Learning: Active learning and annotation for human\-centered AI\.Simon and Schuster, 2021\. - Muhammadet al\.\[2023\]Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa’id Ahmad, Meriem Beloucif, Saif M\. Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Alipio Jorge, Felermino Dário Mário António Ali, Davis David, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna Chala, Hagos Tesfahun Gebremichael, Bernard Opoku, and Stephen Arthur\.AfriSenti: A Twitter sentiment analysis benchmark for African languages\.In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13968–13981, Singapore, December 2023\. Association for Computational Linguistics\. - Nekotoet al\.\[2020\]Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, et al\.Participatory research for low\-resourced machine translation: A case study in African languages\.InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 2144–2160, 2020\. - Ojoet al\.\[2025\]Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, and David Ifeoluwa Adelani\.Afrobench: how good are large language models on african languages?InFindings of the Association for Computational Linguistics: ACL 2025, pages 19048–19095, 2025\. - Okolo and Tano \[2024\]Chinasa Okolo and Marie Tano\.Moving toward truly responsible AI development in the global AI market, 2024\.Brookings Institution\. - Okolo \[2024\]Chinasa Okolo\.Reforming data regulation to advance AI governance in Africa, 2024\. - Okorie and Omino \[2025\]Chijioke Okorie and Melissa Omino\.Addressing inequitable openness in licences for sharing african data and datasets through the nwulite obodo open data licence\.Law, Tech\. & Hum\., 7:94, 2025\. - Okorie \[2025\]Chijioke Okorie\.It’s the noodl license–awesome and amazingly geeky\!Available at SSRN 5339254, 2025\. - Olorunju and Adams \[2024\]Nokuthula Olorunju and Rachel Adams\.African data trusts: new tools towards collective data governance?Information & Communications Technology Law, 33\(1\):85–98, 2024\. - Orlic \[2021\]Davor Orlic\.Outreach programme to strengthen the AI4D network: final technical report\.Technical report, AI4D Africa, 2021\. - Osei and others \[2024\]Salomey Osei et al\.PazaBench: A speech and language model benchmark for low\-resource african languages\.Microsoft Research, 2024\. - Perrigo \[2023\]Billy Perrigo\.Ai by the people, for the people, July 2023\. - Poloet al\.\[2024\]Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin\.tinybenchmarks: evaluating llms with fewer examples\.InProceedings of the 41st International Conference on Machine Learning, pages 34303–34326, 2024\. - Prabhakaranet al\.\[2021\]Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, and Mark Diaz\.On releasing annotator\-level labels and information in datasets\.In Claire Bonial and Nianwen Xue, editors,Proceedings of the Joint 15th Linguistic Annotation Workshop \(LAW\) and 3rd Designing Meaning Representations \(DMR\) Workshop, pages 133–138, Punta Cana, Dominican Republic, November 2021\. Association for Computational Linguistics\. - Rajabet al\.\[2025\]Jenalea Rajab, Anuoluwapo Aremu, Everlyn Asiko Chimoto, Dale Dunbar, Graham Morrissey, Fadel Thior, Luandrie Potgieter, Jessica Ojo, Atnafu Lambebo Tonja, Wilhelmina NdapewaOnyothi Nekoto, et al\.The esethu framework: Reimagining sustainable dataset governance and curation for low\-resource languages\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 30763–30776, 2025\. - Rodriguezet al\.\[2021\]Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P\. Lalor, Robin Jain, and Jordan Boyd\-Graber\.Evaluation examples are not equally informative: How should that change NLP leaderboards?InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 4489–4504, Online, August 2021\. Association for Computational Linguistics\. - Şahin \[2022\]Gözde Gül Şahin\.To augment or not to augment? a comparative study on text augmentation techniques for low\-resource nlp\.Computational Linguistics, 48\(1\):5–42, 2022\. - Sambasivanet al\.\[2021\]Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo\.Everyone wants to do the model work, not the data work: Data cascades in high\-stakes ai\.Inproceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021\. - Siminyuet al\.\[2020\]Kathleen Siminyu, Sackey Freshia, Jade Abbott, and Vukosi Marivate\.Ai4d–african language dataset challenge\.arXiv preprint arXiv:2007\.11865, 2020\. - Siminyuet al\.\[2021\]Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I Adelani, Amelia Taylor, et al\.Ai4d–african language program\.arXiv preprint arXiv:2104\.02516, 2021\. - Singhet al\.\[2024a\]Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, and Partha Talukdar\.Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 11047–11073, 2024\. - Singhet al\.\[2024b\]Shivalika Singh, Freddie Vargus, Daniel D’souza, Börje F Karlsson, Abinaya Mahendiran, Wei\-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O’Mahony, et al\.Aya dataset: An open\-access collection for multilingual instruction tuning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 11521–11567, 2024\. - Sloaneet al\.\[2022\]Mona Sloane, Emanuel Moss, Olaitan Awomolo, and Laura Forlano\.Participation is not a design fix for machine learning\.InProceedings of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–6, 2022\. - Snyder \[2019\]Hannah Snyder\.Literature review as a research methodology: An overview and guidelines\.Journal of business research, 104:333–339, 2019\. - Susantoet al\.\[2025\]Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi\.Sea\-helm: Southeast asian holistic evaluation of language models\.InFindings of the Association for Computational Linguistics: ACL 2025, pages 12308–12336, 2025\. - Taiuru \[2021\]Karaitiana Taiuru\.Kaitiakitanga māori data sovereignty licences, 2021\. - teamet al\.\[2025\]Omnilingual ASR team, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul\-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sadagopan, Safiyyah Saleem, Arina Turkatenko, Albert Ventayol\-Boada, Zheng\-Xin Yong, Yu\-An Chung, Jean Maillard, Rashel Moritz, Alexandre Mourachko, Mary Williamson, and Shireen Yates\.Omnilingual asr: Open\-source multilingual speech recognition for 1600\+ languages\.arXiv preprint arXiv: 2511\.09690, 2025\. - Thuet al\.\[2016\]Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita\.Introducing the asian language treebank \(alt\)\.InProceedings of the Tenth International Conference on Language Resources and Evaluation \(LREC’16\), pages 1574–1578, 2016\. - Yuet al\.\[2026\]Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani\.Afriquellm: How data mixing and model architecture impact continued pre\-training for african languages\.arXiv preprint arXiv:2601\.06395, 2026\. - Zhenget al\.\[2023\]Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\.Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems, volume 36, 2023\.
Similar Articles
Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
This paper presents a large-scale audit of human annotation reporting in NLP from 2018-2025, showing inconsistent documentation of critical details but improvements over time, and provides a framework and recommendations for better reporting.
Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research
This paper systematically evaluates the applications of large language models in low-resource language research, analyzing opportunities and challenges across linguistic variation, historical documentation, cultural expressions, and literary analysis. The study emphasizes interdisciplinary collaboration and customized model development to preserve linguistic and cultural heritage while addressing issues of data accessibility, model adaptability, and cultural sensitivity.
Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP
This paper introduces the Resource Density Index (RDI) and uses LLM-assisted citation mining to reveal that many languages appear data-poor in catalogue records but have substantial dataset activity in research literature, highlighting a visibility asymmetry in low-resource multilingual NLP.
@_lamaahmad: We (@CedricWhitney, @SandhiniAgarwal, @EstherTetruas, @OliviaGWatkins2, @dgrobinson) wrote about nuances we’ve observed…
OpenAI researchers share lessons learned from working with third parties on frontier model evaluations, highlighting the importance of considering the evaluation harness and potential validity issues like reward hacking, contamination, and sandbagging.
Greedy or not, here I come: Language production under vocabulary constraints in humans and resource-rational models
This paper investigates how humans communicate under strict vocabulary limitations, comparing their incremental production strategies to greedy and globally optimal sampling algorithms using Sequential Monte Carlo inference with large language models.