Sample-Size Scaling of the African Languages NLI Evaluation

arXiv cs.CL 06/03/26, 04:00 AM Papers

african-languages natural-language-inference sample-size-scaling low-resource-nlp multilingual xlm-r afrixnli

Summary

This paper examines the effect of labeled data size on natural language inference performance for 16 African languages using the AfriXNLI benchmark. The results show that scaling behavior is language-sensitive and often non-monotonic, challenging the common assumption of monotonic improvement, and emphasizing the need for language-specific dataset creation and stronger multilingual strategies.

arXiv:2606.03219v1 Announce Type: new Abstract: African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language sensitive datasets creation and stronger multi-lingual modelling strategies.

Original Article

View Cached Full Text

Cached at: 06/03/26, 09:37 AM

# Sample-Size Scaling of the African Languages NLI Evaluation
Source: [https://arxiv.org/html/2606.03219](https://arxiv.org/html/2606.03219)
Anuj Tiwari1,Oluwapelumi Ogunremu2,Terry Oko\-odion3,Jesujuwon Egbewale4,Hannah Nwokocha5 Noida Institute of Engineering and Technology1, ML Collective1,2,3,4,5 aj11anuj123@gmail\.com,ogunremuoluwapelumi@gmail\.com,terryokoodion@gmail\.com, egbewalejesujuwon7@gmail\.com,hannahsopuruchi@gmail\.com

###### Abstract

African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance\. The study is a systematic sample\-size scaling study of natural language inference \(NLI\) on 16 African languages based on the AfriXNLI benchmark\. Under controlled conditions, two multilingual transformer models with roughly 0\.6B parameters XLM\-R Large fine\-tuned on XNLI and AfroXLM\-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs\. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language\-sensitive and often non\-monotonic scaling behavior\. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes\. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language\-sensitive datasets creation and stronger multi\-lingual modelling strategies\.

Sample\-Size Scaling of the African Languages NLI Evaluation

Anuj Tiwari1, Oluwapelumi Ogunremu2, Terry Oko\-odion3, Jesujuwon Egbewale4, Hannah Nwokocha5Noida Institute of Engineering and Technology1, ML Collective1,2,3,4,5aj11anuj123@gmail\.com,ogunremuoluwapelumi@gmail\.com,terryokoodion@gmail\.com,egbewalejesujuwon7@gmail\.com,hannahsopuruchi@gmail\.com

## 1Introduction

The latest advancements in NLP have been fuelled by massive pretraining and access to large amounts of labeled data\. The advances have however benefited more high resource languages unfairly and many African languages are still underexamined in training and evaluation standards\. Therefore, the key problem of multilingual and low\-resource NLP is to learn the performance of the models in relation to the amount of labeled data available\. One of the most widely used assumptions in machine learning is that as more and more data is labelled, the better the performance will be downstream\. Although this assumption is usually true in high\-resource environments, it has not yet been carefully studied in the case of the low\-resource languages especially the African languages with various typological and morphological characteristics\. Practically, annotation is expensive, and expansion of dataset without a clear indication of the benefit can be inefficient or even counterproductive\.

This paper provides an analysis of the behavior of natural language inference \(NLI\) with respect to the amount of labeled data in African language models by using the AfriXNLI benchmark\. Rather than proposing new models or datasets, our objective is to empirically characterize scaling behavior, performance stability, and variance across languages and models under controlled experimental conditions\. In particular, we answer the following research questions:

- •RQ1: Does larger labeled data positively affect NLI performance when using African languages in AfriXNLI?
- •RQ2: How does scaling behavior vary across languages and models?
- •RQ3: To what extent are observed trends stable under random subsampling?

By answering these questions, we hope to offer empirical recommendations to dataset construction and evaluation practice in African NLP, as well as requirements on expectations of data scaling in low\-resource semantic reasoning problems\.

## 2Related Work

Multilingual and African NLP Benchmarks\.The current research has contributed to the development of African\-language NLP by extending the current standards and developing new assessment tools\. AfriXNLI is a human\-translated version of the XNLI benchmark of various African languages, allowing to evaluate the human natural language inference in low\-resource conditions\(Community,[2024](https://arxiv.org/html/2606.03219#bib.bib1)\)in a unified way\.

MasakhaNER offers a named entity recognition system on a large scale on ten African languages, and this project proves that community\-driven sets construction are effective in African NLP\(Adelaniet al\.,[2021](https://arxiv.org/html/2606.03219#bib.bib2)\)\. AfroLID presents neural language identification toolkit, spanning 517 languages in Africa, and greatly increasing the coverage of languages compared to the previous multi\-lingual systems\(Adebaraet al\.,[2022](https://arxiv.org/html/2606.03219#bib.bib3)\)\. These combined efforts spell out long\-term development towards determining assessment materials of the African languages in the context of multilingual NLP\.

Scaling Data Laws and Efficiency\.In high\-resource settings, language models exhibit predictable scaling behavior\.\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.03219#bib.bib4)\)demonstrate that language modeling loss reduces according to power\-law dependencies on both model size and dataset size\.\(Hoffmannet al\.,[2022](https://arxiv.org/html/2606.03219#bib.bib7)\)also show that compute\-optimal training depends in proportionately more data, as the Chinchilla model can outperform much larger models trained on smaller data\.\(Muennighoffet al\.,[2023](https://arxiv.org/html/2606.03219#bib.bib8)\)however establish that performance improvements reduce quickly in data\-constrained environments, and more compute or repeated data produces only small increases in performance\. These results cast some doubts on the fact that classical scaling laws can be directly applied to low\-resource and multilingual settings\.

Data Scaling in Low\-Resource NLP\.\(Eiselen and Gaustad,[2023](https://arxiv.org/html/2606.03219#bib.bib9)\)look into the impact of the training data size on performance in African languages with particular attention to the morphologically diverse languages of South Africa\. They demonstrate that although small data sets can be used to obtain useful models, languages with complex, conjunctive morphology need considerably more data to give similar performance\. The importance of linguistic structure in relation to data efficiency is brought out in this work\. Nonetheless, they only tested embedding\-based models and problems like part\-of\-speech tagging and they pose the open question of behavior with data scaling in semantic reasoning problems and contemporary fine\-tuned pretrained language models\.

African Languages: Scarcity of Data and Benchmarking\.Systemic under\-representation\(Hussenet al\.,[2025](https://arxiv.org/html/2606.03219#bib.bib10)\)report that today only a tiny share of the 2000\+ languages of Africa have been trained on modern large language models, and that the field of African languages has been far under\-represented 15 compared to its representation across the world\.\(Adebara and Abdul\-Mageed,[2022](https://arxiv.org/html/2606.03219#bib.bib11)\)attributes such scarcity to the fact that African languages are structurally unsupported by the current large language model development, and are significantly underrepresented relative to their global distribution\. The latest benchmark projects like AfroBench\(Ojoet al\.,[2025](https://arxiv.org/html/2606.03219#bib.bib12)\)and IrokoBench\(Adelaniet al\.,[2025](https://arxiv.org/html/2606.03219#bib.bib13)\)extend assessment to African languages and task categories, like reasoning and natural language understanding\. Even with this extended coverage, it is evident that these benchmarks always indicate significant performance differences between African and high\-resource languages, and that there are still continued issues in modeling and evaluation\.

Multilingual Representation Models\.Multilingual encoders such as mBERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.03219#bib.bib14)\)and XLM\-R\(Conneauet al\.,[2020](https://arxiv.org/html/2606.03219#bib.bib15)\), which have training based on pretraining on multiple languages, are commonly used as multilingual NLP baselines\.\(Conneauet al\.,[2020](https://arxiv.org/html/2606.03219#bib.bib15)\)demonstrate that multilingual pretraining significantly improves the cross\-lingual test of XLM\-R, especially on low\-resource languages\. Although these models can be shown to be effective in zero\-shot transfer, their response to an incremental scale of data of a single African language has not been thoroughly explored\. Our study supplements this literature by giving an empirical examination of sample\-size scaling action of African\-languages NLI\.

## 3Experimental Setup

### 3\.1Dataset

We use the AfriXNLI benchmark, which consists of the sentence pairs of NLI translated to various African languages\. We use 16 languages in our experiments which adopt a variety of language families, scripts, and typological properties which are represented in AfriXNLI\. All tests are performed on the test splits\. Simulating the various labeled data regimes, we adjust the number of test examples to be evaluated by randomly subsampling, but we do not adjust the model parameters\.

### 3\.2Models

To allow us to compare pretraining strategies, we assess two multilingual transformer models from similar architectures with about 0\.6 billion parameters\. The first model, XLM\-R Large fine\-tuned on XNLI\(Davison,[2020](https://arxiv.org/html/2606.03219#bib.bib5)\)is a powerful task\-aligned multilingual baseline constructed on the XLM\-R model\(Conneauet al\.,[2020](https://arxiv.org/html/2606.03219#bib.bib15)\)\. The second model, AfroXLM\-R Large\(Alabiet al\.,[2022](https://arxiv.org/html/2606.03219#bib.bib6)\)is an African\-based form of XLM\-R trained with more focus on African\-language\-based data\.

With the choice of similar scale and architecture model, we factor out the influence of pretraining data composition and language coverage and reduce the impact of model size\.

### 3\.3Evaluation

For each language–model pair, we evaluate performance at sample sizes ranging from 50 to 500 examples\. To control variance due to the selection of the data, we run several random subsampling runs to calculate the mean and standard deviation between the runs of a given sample size\.

The most common metric that we report is accuracy, however, we also report precision and F1\-score\. Such an assessment plan allows us to differentiate between systematic scaling effects and those caused by sampling\.

## 4Results

### 4\.1Evaluation Variance between Sample Sizes

We initially analyse the patterns of the evaluation variance with the change in size of the sample\. The standard deviation of the accuracy is reported in Figure[1](https://arxiv.org/html/2606.03219#S4.F1)in the aggregate form over all the languages and models\. The maximum variance is in low\-resource samples \(50\-100 examples\) and is sharply decreasing with increase in the sample size that reaches a point of an average of 300 samples after which the variance is constant\.

This tendency shows that small sets of evaluation produce very unstable performance estimates which are highly dependent on the specific sets of samples that one is analyzing\. With an increase in sample size, the variance decreases implying that the bigger the evaluation sets, the more true model performance is likely to have been estimated\.

![Refer to caption](https://arxiv.org/html/2606.03219v1/Fig1.png)Figure 1:Evaluation variance \(standard deviation of accuracy\) as a function of sample size, aggregated across all languages and models\. The variance decreases very rapidly with the size of the sample meaning that evaluation regimes with low resources are unstable\.Figure[2](https://arxiv.org/html/2606.03219#S4.F2)also further modifies this effect by model\. Both XLM\-R Large and AfroXLM\-R Large have the following qualitative trend: large variance at the beginning of the sample size, and this is followed by a sharp rise in the value of the sample size\. Although the absolute levels of different variance are slightly different, the overall trend is maintained in all models which implies that the instability of evaluation is not particular to a given pretraining strategy\.

![Refer to caption](https://arxiv.org/html/2606.03219v1/Fig2.png)Figure 2:Comparison of evaluation variance with sample size, separately by XLM\-R Large and AfroXLM\-R Large\. Both models show high variance in low\-resource regimes and stabilize with larger evaluation sets\.
### 4\.2Trends of Global Scaling across Languages and Models

As the sample size increases, the variance reduces, however, accuracy may not necessarily increase monotonically\. Figure[3](https://arxiv.org/html/2606.03219#S4.F3)provides a heatmap of scaling slopes of every language model pair, indicating whether the performance improves, stays constant or reduces with increased sample size\.

The heatmap indicates that there is significant heterogeneity in languages\. There are some nearly zero or slightly positive slopes, that is, weak gains or early saturation, and also negative slopes, that is, systematic degradation in performance with increasing evaluation sets becoming larger and more representative\. Such patterns exist in both models, indicating that the scaling behaviour is highly language\-specific, as opposed to being model\-driven\.

![Refer to caption](https://arxiv.org/html/2606.03219v1/Fig3.png)Figure 3:Scaling slope heatmap on accuracy between language\-model pairs\. Positive slopes mean that the performance improves as the sample size increases and negative slope means that the performance deteriorates\.
### 4\.3Scaling Behavior Dependent on Language: Yoruba vs Kinyarwanda

To illustrate these trends concretely, we analyze scaling behavior for Yoruba and Kinyarwanda under each model\. Figure[4](https://arxiv.org/html/2606.03219#S4.F4)shows results for XLM\-R Large\. Yoruba exhibits pronounced small\-sample optimism, with relatively high accuracy at 50 examples followed by a consistent decline as sample size increases\. This monotonic degradation suggests that small evaluation subsets overestimate performance, masking systematic errors that emerge with broader coverage\.

Conversely, there is a slight rise in performance of Kinyarwanda up to around 150 examples after which it starts to decrease and stabilize\. At bigger sample sizes, variance collapses, meaning that it can no longer be measured with its actual performance level by the smaller subsets\.

![Refer to caption](https://arxiv.org/html/2606.03219v1/Fig4.png)Figure 4:Yoruba and Kinyarwanda evaluation scaling behaviour with XLM\-R Large\. Yoruba experiences monotonic deterioration as the sample size increases and Kinyarwanda experiences initial improvement and afterwards saturation\.The same comparison is made in Figure[5](https://arxiv.org/html/2606.03219#S4.F5)on AfroXLM\-R Large\. The qualitative trends are similar: Yoruba demonstrates again decreasing accuracy with the growing sample size at the same time Kinyarwanda demonstrates initial gains and then stabilities\. The fact that such trends are maintained in the models supports the conclusion that the scaling behavior is more language\-specific than model\-specific\.

![Refer to caption](https://arxiv.org/html/2606.03219v1/Fig5.png)Figure 5:Yoruba and Kinyarwanda AfroXLM\-R Large evaluation scaling behavior\. The non\-monotonic tendencies that are specific to language prevail within models\.

## 5Discussion

Our findings dispute the widely held belief that more uniformly augmented data is always better to use when applying African\-language natural language inference\. In languages, non\-monotonic scaling behavior in evaluation accuracy is observed, where information other than data quantity influences it, e\.g\. the distribution of labels, ambiguity in translations and representativeness of evaluation subsets contribute significantly\.

Notably, we come up with conclusions about AfriXNLI only and these need not be generalized as being applicable in all African languages\. However, the similarity in trends observed at random subsampling runs and between two different multilingual models indicates that it is systematic, and not incidental\.

#### Evaluation bias and small\-sample optimism

Among the main conclusions that this research made is that small sample evaluation is systematically biased in overestimating the performance of the models\. The figure below \(Figures 1 and 2\) indicated that at sample sizes smaller than 200 evaluation variance is high resulting in unstable and optimistic bias accuracy estimates\. With increase in sample size, both variance and accuracy collapse and this is not due to decreasing model quality but rather due to harder instances, neutral cases, and translation ambiguities displayed by larger evaluation sets that are underrepresented in small samples\. This point is crucial: we are examining reliability of evaluation and not learning curves, and the apparent decline in performance indicates the lower estimation bias and not the alterations of the model behavior\.

#### Language\-specific non\-monotonic scaling

Figure 3, the scaling slope heatmap, indicates that there is a lot of heterogeneity between languages\. Although in some languages, the slopes are weakly positive or near\-zero, in others, the slopes are negative, which means that the performance of a language deteriorates as sets of evaluations increase\. These patterns are present in both models and it is revealed that scaling behavior is more language dependent than model dependent\. The differences between Yoruba and Kinyarwanda case studies, Figures 4 and 5, illustrate this contrast quite well: Yoruba has high small\-sample optimism and decreases in a monotonic way, whereas Kinyarwanda has small gains at the start and stagnates\. These variations indicate that an individual set evaluation size might provide inaccurate results when used across languages when used consistently\.

#### Pretraining strategies Model effects

Even though AfroXLM\-R Large can achieve significantly better accuracy and reduced variance at very small sample sizes than XLM\-R Large fine\-tuned on XNLI, both models have similar qualitative scaling behaviors across languages\. Africa\-centric pretraining improves initial stability but does not eliminate non\-monotonic scaling or language specific evaluation bias\. This implies that data composition pretraining is not enough to consider the heterogeneity of African\-language NLI assessment and that the choice of the model cannot influence not only stability in the evaluation but also absolute performance\.

#### Benchmarking and evaluation implications

These findings have direct methodological implications\. Single scores on small test sets of accuracy can significantly exaggerate the ability of models to perform under low resource conditions\. The larger sets of evaluation decrease variance and bias but can indicate lower real performance and therefore makes comparing studies more difficult\. We also suggest that African language benchmarks should report variance between subsamples, should not over rely on small held\-out sets, and should take into account language specific evaluation sizes instead of fixed\-size test sets\.

Overall, we find that the amount of data is not sufficient to ensure credible assessment on the part of African NLI\. Rather, meaningful benchmarking in a low resource multilingual setting requires representative sampling, meticulous dataset construction, and stability analysis to contribute to it\.

#### Evaluation stability and saturation

In order to measure stability in evaluation, we approximate a saturation value of each language\-model combination, which is the minimum size of the evaluation where the mean accuracy is varied by at most±\\pm0\.5%\. The smallest sample size at which additional increments in evaluation data do not produce significant performance variations is called then∗n^\{\*\}\.

n∗=min⁡\{n\|maxm\>n⁡\|A\(m\)−A\(n\)\|≤ϵ\}\.n^\{\*\}=\\min\\left\\\{n\\;\\middle\|\\;\\max\_\{m\>n\}\\left\|A\(m\)\-A\(n\)\\right\|\\leq\\epsilon\\right\\\}\.In case there is non≤500n\\leq 500, we declare saturation point as\>500\>500\. We are not training, but only testing\. Here saturation refers to the amount of evaluation data one only needs to achieve the stabilization of the estimated performance\.

Table[1](https://arxiv.org/html/2606.03219#S5.T1)provides the summary of these saturation points in all languages and models\. We find very great language to language disparity\. In some languages \(e\.g\., Shona, Yoruba, Xhosa\) as few as 200 or 250 evaluation samples can give the required stable performance estimates, whereas in others \(e\.g\., Swahili, Kinyarwanda, Oromo\) it can take 400 or 450 samples\. It is worth noting that Wolof underestimates performance even in 500 samples of XLM\-R Large, and this implies that Wolof remains unstable in performance estimation\.

These variations are much the same across models indicating that language and dataset specific factors are the primary causes of saturation behaviour as opposed to model architecture itself\. Overall, these findings indicate that model performance can be significantly misestimated using fixed\-size evaluation benchmarks in the case of African languages, and that the required volume of evaluation data to make reliable estimates can differ significantly across languages\.

Table 1:The saturation points are estimated at a sample size of "\>500" at which the average error levels off within±\\pm0\.5% between the languages and models\.

## 6Limitations

The limitations of our study are as follows:

- •Dataset scopeAll experiments are conducted on AfriXNLI; thus, observed trends may reflect dataset\-specific properties such as translation artifacts or label distribution biases\.
- •Evaluation vs learningWe get the evaluation behavior as opposed to the dynamics of learning\. Models do not optimize on successively large training sets hence results reflect the stability and bias of the performance estimates, not the improvement in performance with more training data\.
- •Model scaleThe scale of experiments is restricted to 2 models of multilingual size 0\.6b\. Relationships between data scaling and scales of models are not investigated\.

Regardless of these constraints, the fact that trends were similar among languages, models, and random subsampling runs implies that we have been able to capture systematic elements of evaluation reliability to African NLI\.

## 7Future Work

This analysis can be developed in several ways in future work:

- •Broader tasks and datasetsThe generalizability of the found evaluated scaling behaviour of the study should be tested by extending the study to other African NLP benchmark tasks, like sentiment analysis or named entity recognition\.
- •Linguistic and dataset effectsAdding linguistic metadata, label distributions and tokenization statistics can be useful to explain language behavioral specifics of saturation and non\-monotonic scaling\.
- •Learning dynamicsLearning scaling behavior: Fine\-tuning, but not evaluation, would help illuminate the effect of the addition of labeled data on real model learning with African languages\.

## 8Conclusion

Here we provide a detailed study of scale behavior in terms of sample\-size on the AfriXNLI benchmark using African languages\. Our results based on controlled evaluation in 16 languages and two multilingual models and a series of random subsampling runs demonstrate that growth in evaluation data does not come at uniform or monotonic benefits\. Rather, scaling behavior is very language\-specific, usually non monotonic, and hugely influenced by evaluation variation under low resource limitations\.

We show that the performance estimates of small evaluation subsets are often optimistically biased, whereas the estimates of larger subsets are indicative of latent difficulty and their estimates are more stable\. This demonstrates the difference between the evaluation reliability and model learning as one of the key issues in the African NLP\.

On our results, we suggest: \(i\) do not use single\-point assessment on very small test sets and report the mean±\\pmstandard deviation on more than one subsample, \(ii\) supplement aggregate measures with per\-class measures and \(iii\) use at least 300 evaluation samples unless otherwise, and consider results below this scale as high\-noise measures\. In a broader sense, our paper warns on naive beliefs about the reliability of increased information as a means to have dependable evaluation and the need to have practices that are evaluation conscious in benchmarking of African languages\.

## References

- Towards afrocentric NLP for African languages: where we are and where we can go\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3814–3841\.External Links:[Link](https://aclanthology.org/2022.acl-long.265/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.265)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p5.1)\.
- I\. Adebara, A\. Elmadany, M\. Abdul\-Mageed, and A\. Inciarte \(2022\)AfroLID: a neural language identification tool for African languages\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 1958–1981\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.128/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.128)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p2.1)\.
- D\. I\. Adelani, J\. Abbott, G\. Neubig,et al\.\(2021\)MasakhaNER: named entity recognition for African languages\.Transactions of the Association for Computational Linguistics,pp\. 1116–1131\.External Links:[Link](https://aclanthology.org/2021.tacl-1.66/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00416)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p2.1)\.
- D\. I\. Adelani, J\. Ojo, I\. A\. Azime,et al\.\(2025\)IrokoBench: a new benchmark for African languages in the age of large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 2732–2757\.External Links:[Link](https://aclanthology.org/2025.naacl-long.139/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.139)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p5.1)\.
- J\. O\. Alabi, D\. I\. Adelani, M\. Mosbach,et al\.\(2022\)Adapting pre\-trained language models to African languages via multilingual adaptive fine\-tuning"\.InProceedings of the 29th International Conference on Computational Linguistics,External Links:[Link](https://aclanthology.org/2022.coling-1.382)Cited by:[§3\.2](https://arxiv.org/html/2606.03219#S3.SS2.p1.1)\.
- M\. N\. Community \(2024\)AfriXNLI: dataset\.External Links:[Link](https://huggingface.co/datasets/masakhane/afrixnli)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p1.1)\.
- A\. Conneau, K\. Khandelwal, N\. Goyal,et al\.\(2020\)Unsupervised cross\-lingual representation learning at scale\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 8440–8451\.Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p6.1),[§3\.2](https://arxiv.org/html/2606.03219#S3.SS2.p1.1)\.
- J\. Davison \(2020\)XLM\-roberta large fine\-tuned on xnli\.Note:Hugging Face modelExternal Links:[Link](https://huggingface.co/joeddav/xlm-roberta-large-xnli)Cited by:[§3\.2](https://arxiv.org/html/2606.03219#S3.SS2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p6.1)\.
- R\. Eiselen and T\. Gaustad \(2023\)Deep learning and low\-resource languages: how much data is enough? a case study of three linguistically distinct south african languages\.InProceedings of the Fourth Workshop on Resources for African Indigenous Languages \(RAIL 2023\),pp\. 42–53\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.rail-1.6),[Link](https://aclanthology.org/2023.rail-1.6/)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p4.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch,et al\.\(2022\)Training compute\-optimal large language models\.InProceedings of NeurIPS,External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p3.1)\.
- K\. Y\. Hussen, W\. T\. Sewunetie, A\. A\. Ayele, S\. H\. Imam, S\. H\. Muhammad, and S\. M\. Yimam \(2025\)The state of large language models for african languages: progress and challenges\.ArXivabs/2506\.02280\.External Links:[Link](https://api.semanticscholar.org/CorpusID:279119848)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p5.1)\.
- J\. Kaplan, S\. McCandlish, T\. J\. Henighan,et al\.\(2020\)Scaling laws for neural language models\.ArXivabs/2001\.08361\.External Links:[Link](https://api.semanticscholar.org/CorpusID:210861095)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p3.1)\.
- N\. Muennighoff, S\. Candido, J\. Vamalké,et al\.\(2023\)Scaling data\-constrained language models\.https://sl1nk\.com/wyS64\.Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p3.1)\.
- J\. Ojo, O\. Ogundepo, A\. Oladipo,et al\.\(2025\)AfroBench: how good are large language models on African languages?\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 19048–19095\.External Links:[Link](https://aclanthology.org/2025.findings-acl.976/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.976)Cited by:[§2](https://arxiv.org/html/2606.03219#S2.p5.1)\.

## Appendix AAppendix \- Full Results for models

Table 2:Evaluation of Swahili withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 3:Evaluation of Lingala withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 4:Evaluation of Igbo withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 5:Evaluation of Hausa withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 6:Evaluation of Yoruba withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 7:Evaluation of Kinyarwanda withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 8:Evaluation of Zulu withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 9:Evaluation of Amharic withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 10:Evaluation of Southern sotho withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 11:Evaluation of Oromo withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 12:Evaluation of Twi withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 13:Evaluation of Shona withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 14:Evaluation of Xhosa withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 15:Evaluation of Wolof withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 16:Evaluation of Luganda withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 17:Evaluation of Ewe withxlm\-roberta\-large\-xnli\. Measures are provided of means±\\pmstandard deviation\.Table 18:Evaluation of Swahili withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 19:Evaluation of Lingala withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 20:Evaluation of Igbo withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 21:Evaluation of Hausa withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 22:Evaluation of Zulu withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 23:Evaluation of Yoruba withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 24:Evaluation of Kinyarwanda withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 25:Evaluation of Amharic withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 26:Evaluation of Southern sotho withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 27:Evaluation of Oromo withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 28:Evaluation of Twi withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 29:Evaluation of Shona withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 30:Evaluation of Xhosa withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 31:Evaluation of Wolof withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 32:Evaluation of Luganda withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.Table 33:Evaluation of Ewe withafro\-xlmr\-large\. Measures are provided of means±\\pmstandard deviation\.

Sample-Size Scaling of the African Languages NLI Evaluation

Similar Articles

Scaling laws for neural language models

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

Are super tiny LLMs any good?

Model Merging Scaling Laws in Large Language Models

Submit Feedback

Similar Articles

Scaling laws for neural language models

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

Model Merging Scaling Laws in Large Language Models