No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs

arXiv cs.CL 04/21/26, 04:00 AM Papers
multilingual-llm prompting routing low-resource-languages translation classification nlp
Summary
Researchers from National Taiwan University propose replacing fixed translation-based prompting strategies in multilingual LLMs with lightweight learned classifiers that route each instance to either native or translation-based prompting. Their analysis across 10 languages and 4 benchmarks shows no single strategy is universally optimal, with translation benefiting low-resource languages most, and the learned routing achieving statistically significant improvements over fixed strategies.
arXiv:2604.16937v1 Announce Type: new Abstract: Translation-based prompting is widely used in multilingual LLMs, yet its effectiveness varies across languages and tasks. We evaluate prompting strategies across ten languages of different resource levels and four benchmarks. Our analysis shows that no single strategy is universally optimal. Translation strongly benefits low-resource languages even when translation quality is imperfect, high-resource languages gain little, and prompt-based self-routing underperforms explicit translation. Motivated by these findings, we formulate prompting strategy selection as a learned decision problem and introduce lightweight classifiers that predict whether native or translation-based prompting is optimal for each instance. The classifiers achieve statistically significant improvements over fixed strategies across four benchmarks and generalize to unseen task formats not observed during training. Further analysis reveals that language resource level, rather than translation quality alone, determines when translation is beneficial.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 04/21/26, 07:05 AM
# No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs
Source: [https://arxiv.org/html/2604.16937](https://arxiv.org/html/2604.16937)
Wei\-Chi Wuα\\alphaSheng\-Lun Weiα\\alphaHen\-Hsen Huangβ\\betaHsin\-Hsi Chenα\\alphaγ\\gamma α\\alphaDepartment of Computer Science and Information Engineering, National Taiwan University, Taiwan β\\betaInstitute of Information Science, Academia Sinica, Taiwan γ\\gammaAI Research Center \(AINTU\), National Taiwan University, Taiwan wcwu@csie\.ntu\.edu\.tw,weisl@nlg\.csie\.ntu\.edu\.tw, hhhuang@iis\.sinica\.edu\.tw, hhchen@ntu\.edu\.tw

###### Abstract

Translation\-based prompting is widely used in multilingual LLMs, yet its effectiveness varies across languages and tasks\. We evaluate prompting strategies across ten languages of different resource levels and four benchmarks\. Our analysis shows that no single strategy is universally optimal\. Translation strongly benefits low\-resource languages even when translation quality is imperfect, high\-resource languages gain little, and prompt\-based self\-routing underperforms explicit translation\. Motivated by these findings, we formulate prompting strategy selection as a learned decision problem and introduce lightweight classifiers that predict whether native or translation\-based prompting is optimal for each instance\. The classifiers achieve statistically significant improvements over fixed strategies across four benchmarks and generalize to unseen task formats not observed during training\. Further analysis reveals that language resource level, rather than translation quality alone, determines when translation is beneficial\.

No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs

Wei\-Chi Wuα\\alphaSheng\-Lun Weiα\\alphaHen\-Hsen Huangβ\\betaHsin\-Hsi Chenα\\alphaγ\\gammaα\\alphaDepartment of Computer Science and Information Engineering,National Taiwan University, Taiwanβ\\betaInstitute of Information Science, Academia Sinica, Taiwanγ\\gammaAI Research Center \(AINTU\), National Taiwan University, Taiwanwcwu@csie\.ntu\.edu\.tw,weisl@nlg\.csie\.ntu\.edu\.tw,hhhuang@iis\.sinica\.edu\.tw, hhchen@ntu\.edu\.tw

## 1Introduction

Translation\-based prompting, which translates inputs into English prior to inference, is a widely used strategy for multilingual large language models \(LLMs\) and often improves performance by leveraging stronger English\-centric capabilitiesGhoshet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib21)\)\. However, recent studies show that this advantage is not universal\. Native\-language prompting can outperform translation\-based approaches on culturally grounded tasksTamet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib38)\); Nyandwiet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib22)\)and for models with reduced English biasLiuet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib20)\)\. These findings challenge the assumption that translation into English is always beneficial, raising a fundamental question:when should translation be applied, and when is native\-language prompting preferable?

Prior work has largely focused on improving individual prompting paradigms rather than understanding or selecting between them\. Methods such as QAlignZhuet al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib15)\)and mCoTLai and Nissim \([2024](https://arxiv.org/html/2604.16937#bib.bib16)\)enhance translation\-based prompting, while Strategic CoTWanget al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib32)\)improves native language reasoning\. However, these approaches implicitly assume a fixed prompting strategy and do not treat prompting strategy selection as a decision problem conditioned on the language and task pair\.

This gap motivates three research questions\. \(RQ1\)Does one prompting strategy fit all languages and tasks?Through a systematic comparison across diverse languages and tasks, we find that no single strategy consistently dominates\. Translation\-based prompting benefits low\-resource languages but often provides limited or no gains for high\-resource languages, while prompt\-based self routing yields only marginal improvements and underperforms explicit translation\. \(RQ2\)Can prompting strategy selection be learned?We formulate strategy selection as a learned decision problem and introduce a lightweight classifier that predicts whether simple native language or translation\-based prompting, the two most iconic prompting strategies, is more effective for a given language and task pair\. Consequently, the lightweight classifier consistently outperforms isolated strategies across models and task formats\. \(RQ3\)Why does translation primarily benefit low\-resource languages?We show that translation effectiveness is driven more by language resource level than translation quality alone, with the learned selector favoring translation for low\-resource languages even when translation quality is imperfect\. In summary, our contributions are threefold:1\)We present a systematic empirical study demonstrating that no single prompting strategy fits all languages and tasks\.2\)We introduce a decision\-oriented framework for learned prompting strategy selection\.3\)We provide an analysis uncovering the central role of language resource level in determining when translation\-based prompting is beneficial\.

Table 1:Accuracy \(%\) of six prompting strategies across ten languages on Global\-MMLU using Llama3\.3\-70B\. Languages are grouped by resource level into high \(ZH, ES, DE, HI\), mid \(BN, ID, KO\), and low \(SI, SW, YO\)\.
## 2Related Work

#### Translation\-Based Prompting\.

English chain\-of\-thought reasoning often outperforms native approaches due to English dominance in pretrainingLiet al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib39)\); Kowtalet al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib14)\)\. Recent methods improve the issue through question alignmentZhuet al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib15)\), multilingual CoT reasoningLai and Nissim \([2024](https://arxiv.org/html/2604.16937#bib.bib16)\), and instruction tuning with small setShahamet al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib18)\)\. Translation effectiveness correlates positively with quality, as low\-quality translation can harm performanceLiuet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib20)\)\.

#### Limitations and Alternatives\.

Translation fails for culturally grounded tasksTamet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib38)\); Nyandwiet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib22)\), models with reduced English biasLiuet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib20)\), and certain task structuresHuanget al\.\([2023](https://arxiv.org/html/2604.16937#bib.bib24)\); Intratoret al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib19)\)\. Alternatives include Strategic CoTWanget al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib32)\)and Selective TranslationKowtalet al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib14)\); Mondshineet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib2)\); Paulet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib41)\)\. We learn to select between strategies, revealing that language resource level and response features, not translation quality alone, determine optimality\.

## 3Experimental Setup

#### Datasets and Languages\.

We primarily evaluate on Global\-MMLUSinghet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib1)\), grouping languages by resource level into high \(Chinese/ZH, Spanish/ES, German/DE, Hindi/HI\), mid \(Bengali/BN, Indonesian/ID, Korean/KO\), and low \(Sinhala/SI, Swahili/SW, Yoruba/YO\)\. For strategy selection, we use a 10% training split with balanced language coverage and evaluate on the remaining 90%\. Generalization is assessed on MMLU\-ProXXuanet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib3)\)and out\-of\-domain benchmarks with different task formats: XQuADArtetxeet al\.\([2020](https://arxiv.org/html/2604.16937#bib.bib34)\), mCSQASakaiet al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib5)\), and XCOPAPontiet al\.\([2020](https://arxiv.org/html/2604.16937#bib.bib4)\)\.

#### Prompting Strategies\.

We compare zero\-shot native and translation\-based prompting strategies, includingNative,Translate,Sel\-TransMondshineet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib2)\), Strategic CoT in native and EnglishWanget al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib32)\), andPrompt\-Routing\. Prompt templates and details are provided in Appendix[A\.2](https://arxiv.org/html/2604.16937#A1.SS2)\.

#### Models\.

Experiments are conducted using DeepSeek\-v3\.1DeepSeek\-AI \([2024](https://arxiv.org/html/2604.16937#bib.bib33)\), with additional strategy selection experiments on Llama\-3\.3\-70B\-InstructAI@Meta \([2024](https://arxiv.org/html/2604.16937#bib.bib36)\)\. All models are used in zero\-shot inference\.

#### Learned Strategy Selection\.

We formulate strategy selection as a binary decision betweenNativeandTranslate\. Training labels are assigned when exactly one strategy answers correctly; ambiguous cases are discarded\. We train lightweight classifiers \(XGBoostChen and Guestrin \([2016](https://arxiv.org/html/2604.16937#bib.bib35)\), MLPHaykin \([1994](https://arxiv.org/html/2604.16937#bib.bib44)\)\) using features capturing differences between native and translated inputs and responses\. Details are in Appendix[B\.1](https://arxiv.org/html/2604.16937#A2.SS1)\.

#### Features Engineering\.

For each instance, we run bothNATIVEandTRANSLATEto obtain responsesrn\{r\_\{n\}\}andrt\{r\_\{t\}\}, then extract features capturing their differences across four categories: \(1\) metadata, \(2\) question\-level, \(3\) response\-level, and \(4\) alignment\. The same language\-agnostic pipeline is applied uniformly to all instances\. Complete feature definitions appear in Appendix[B\.2](https://arxiv.org/html/2604.16937#A2.SS2)\.

Table 2:Results on DeepSeek\-v3\.1 across in\-domain \(green\) and out\-of\-domain \(orange\) benchmarks\. Best results arebolded;Oraclemarks the upper bound where at least one ofNativeorTranslatesucceeds\. Empty cells indicate languages not covered by the respective benchmark dataset, as detailed in Table[8](https://arxiv.org/html/2604.16937#A3.T8)\.

## 4RQ1: Does One Strategy Fit All?

Building on prior findings that question the universality of translation\-based prompting, we examine whether any single prompting strategy outperforms others across languages and tasks, as implied by a “one\-strategy\-fits\-all” assumption\. Table[1](https://arxiv.org/html/2604.16937#S1.T1)reveals three key findings\. First,no single strategy dominates: whileSCoT\-Transachieves the highest average \(83\.97%\),Sel\-Transwins for 3 languages \(ZH, KO, YO\) andTranslatefor 5 others \(HI, ID, KO, SI, SW\)\. Second,resource level predicts strategy effectiveness: low\-resource languages consistently favor translation \(SI/SW/YO: \+5\.5 to \+18\.8% over native\), while high\-resource languages show the opposite trend \(<1%\)\. Korean presents an extreme case with a 49\.9% gap between strategies, suggesting severe underrepresentation in training\. Third,prompt\-based strategy selection fails:Prompt\-Routing\(82\.8%\) underperforms simpleTranslate\(84\.0%\) , demonstrating that effective strategy selection requires learning from patterns rather than model self\-assessment\.

## 5RQ2: Can We Learn to Select?

### 5\.1Problem Formulation

For each questionqqin languageℓ\\ell, we generate responses using bothNative\(rnr\_\{n\}\) andTranslate\(rtr\_\{t\}\) strategies\. Our goal is to train a binary classifierf\(q,rn,rt\)→\{0,1\}f\(q,r\_\{n\},r\_\{t\}\)\\rightarrow\\\{0,1\\\}that predicts which strategy yields the correct answer, where 0 selectsNativeand 1 selectsTranslate\.

### 5\.2Experiment Results

Table[2](https://arxiv.org/html/2604.16937#S3.T2)reports the performance of the XGBoost classifier on DeepSeek\-v3\.1 across all benchmarks\. Complete results for both DeepSeek\-v3\.1 and Llama\-3\.3\-70B are provided in Appendix[B\.4](https://arxiv.org/html/2604.16937#A2.SS4)\.

#### In\-domain Performance\.

On Global\-MMLU test set,Classifierachieves 82\.3% accuracy, outperformingTranslate\(\+0\.6%\) and substantially exceedingNative\(\+9\.8%\)\. The classifier captures consistent performance gains across all languages, especially with improvements most pronounced for YO \(\+2\.1% over best baseline\)\. On MMLU\-ProX, gains persist \(about \+0\.2% over best baseline\), demonstrating robustness to increased difficulty\.

#### Out\-of\-domain Generalization\.

Despite training only on multiple\-choice questions, the classifier generalizes to different task formats\. On XQuAD \(extractive QA\), it achieves 87\.6% \(\+0\.5% over best baseline\)\. On XCOPA \(causal reasoning\), performance reaches 95\.7% \(\+0\.4%\)\. Even on mCSQA’s challenging examples, the classifier shows modest gains \(33\.8% vs 33\.4%\)\.

Table 3:Top 3 most important feature groups \(with importance scores, %\) for XGBoost and MLP classifiers on DeepSeek\-v3\.1 and Llama\-3\.3\-70B\.![Refer to caption](https://arxiv.org/html/2604.16937v1/x1.png)Figure 1:Translateselection rate \(%\) of the XGBoost classifier on DeepSeek\-v3\.1\.
#### Statistical Significance\.

We assess significance using the Wilcoxon signed\-rank testWilcoxon \([1945](https://arxiv.org/html/2604.16937#bib.bib26)\)\. Across all language\-dataset pairs on both models, XGBoost significantly outperforms both baselines \(p<0\.001p<0\.001\), while MLP achievesp<0\.05p<0\.05\. This demonstrates the statistical robustness of learned strategy selection\. Detailed calculation formula and results are presented in Appendix[C](https://arxiv.org/html/2604.16937#A3)\.

### 5\.3Feature Importance Analysis

To further understand what drives the routing decisions, we analyze feature importance of the classifiers\. Table[3](https://arxiv.org/html/2604.16937#S5.T3)shows that word overlap features consistently dominate across settings, suggesting that the classifiers primarily rely on semantic alignment differences between native and translated responses\. These features are precisely what prompt\-based self\-routing cannot access\.PROMPT\-ROUTINGrelies on the model’s self\-assessment, which lacks the ability to quantify response\-level differences and features\. This explains whyPROMPT\-ROUTING\(82\.8%\) underperforms simpleTRANSLATE\(84\.0%\), while the learned classifier, equipped with these features, consistently outperforms both fixed strategies\.

### 5\.4Strategy Selection Analysis

Figure[1](https://arxiv.org/html/2604.16937#S5.F1)shows the classifier’sTranslateselection rate strongly correlates with language resource level: relatively high\-resource languages \(ZH, ES, DE, HI, ID\) exhibit balanced selection \(40\-70%\) varying by task, while relatively low\-resource languages \(KO, SI, YO\) heavily favorTranslate\. This contradicts expectations from prior work, which indicates that translating low\-resource languages leads to low translation qualityKoehn and Knowles \([2017](https://arxiv.org/html/2604.16937#bib.bib12)\); Teamet al\.\([2022](https://arxiv.org/html/2604.16937#bib.bib37)\); Shuet al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib43)\)and harm performanceLiuet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib20)\)\. We therefore investigate this relationship further in RQ3 \(§[6](https://arxiv.org/html/2604.16937#S6)\)\. Full translation rate heatmaps appear in Appendix[D\.1](https://arxiv.org/html/2604.16937#A4.SS1)\.

Table 4:Translation quality analysis on Global\-MMLU using chrF scores of the XGBoost classifier\.Low qualitybins \(bottom 30%\) show highTranslateselection rates despite lower accuracy\.High qualitybins \(top 40%\) show improved accuracy but lower translation rate\.

## 6RQ3: Why Low\-Resource Languages Favor Translation?

We conduct the analysis to explore the relationship among language resource level, translation quality, and learned strategy selection\.

#### Setup\.

We evaluate translation quality using BLEURTSellamet al\.\([2020](https://arxiv.org/html/2604.16937#bib.bib9)\), chrFPopović \([2015](https://arxiv.org/html/2604.16937#bib.bib11)\), and METEORBanerjee and Lavie \([2005](https://arxiv.org/html/2604.16937#bib.bib10)\), comparing model\-generated translations against original English questions and options\. We partition Global\-MMLU and MMLU\-ProX examples into quality deciles and measure: \(1\) accuracy for each method, \(2\) performance gap \(Translate−\-Native\), and \(3\) classifier’sTranslateselection rate\. Details appear in Appendix[D\.2](https://arxiv.org/html/2604.16937#A4.SS2)\.

#### Results\.

Table[4](https://arxiv.org/html/2604.16937#S5.T4)shows results using chrF on Global\-MMLU for DeepSeek\-v3\.1 and Llama\-3\.3\-70B\. Complete results across quality metrics and datasets are provided in Appendix[D\.2](https://arxiv.org/html/2604.16937#A4.SS2.SSS0.Px1)\. Three consistent patterns emerge\. As translation quality improves: \(1\) all methods achieve higher accuracy, \(2\) theTranslate−\-Nativegap narrows, and \(3\) the classifier’sTranslateselection rate correspondingly decreases\. Critically, the classifier selectsTranslatemost aggressively where translation quality islowest, not highest\. This inverse correlation demonstrates the classifier learns to effectively exploit translation where native performance is weakest, independent of translation quality itself\.

![Refer to caption](https://arxiv.org/html/2604.16937v1/figure/resource_bin_heatmap_chrf_global_mmlu_deepseek.png)Figure 2:Distribution \(%\) of responses across translation quality bins by language resource level on Global\-MMLU with DeepSeek\-v3\.1\.
#### Discussion\.

This pattern reflects the confounding between language resource level, translation quality, and the learned strategy selection by classifiers\. As shown in Figure[2](https://arxiv.org/html/2604.16937#S6.F2), the responses off low\-resource languages concentrate in low\-quality bins due to limited parallel corporaKoehn and Knowles \([2017](https://arxiv.org/html/2604.16937#bib.bib12)\); Teamet al\.\([2022](https://arxiv.org/html/2604.16937#bib.bib37)\), but benefit the most from translation \(§[4](https://arxiv.org/html/2604.16937#S4)\)\. The strategy performance gap narrows as high\-resource languages dominate high\-quality bins with little strategy differences\. Our analysis reveals thatlanguage resource level, not translation quality alone, determines optimal strategy\. Full language resource and quality bins heatmaps appear in Appendix[D\.2](https://arxiv.org/html/2604.16937#A4.SS2.SSS0.Px1)\.

## 7Conclusion

This work investigates prompting strategy selection for multilingual LLMs, showing that translation\-based prompting is not universally beneficial and that no single strategy fits all language–task pairs, with low\-resource languages favoring translation despite lower translation quality\. To address this variability, we introduce lightweight classifiers that predict the optimal strategy for each instance, achieving statistically significant improvements over both native and translation baselines across four benchmarks and generalizing to unseen task formats\. Through controlled analysis, we show that language resource level, rather than translation quality, is the primary factor determining when translation is beneficial\. These findings reframe multilingual prompting from a fixed\-strategy paradigm to a learned decision problem\. Future work can build on this through stronger routing models, hybrid prompting strategies, retrieval\-based selection methods, and ultimately integrating routing directly into model inference to eliminate dual\-generation overhead\.

## Limitations

While our classifier demonstrates effectiveness across multiple benchmarks, several limitations warrant consideration\. First, our experiments focus on ten languages spanning different resource levels, and the generalizability of our findings to other unseen languages and additional model families, particularly non English\-centric models, remains to be validated\. Second, although our evaluation already covers a range of task types, it still does not fully represent the diversity of multilingual NLP applications due to the limitations of current existing multilingual datasets\. Additionally, the lack of culturally sensitive multilingual datasets makes it difficult to assess whether cultural factors play a role in prompting strategies\. Third, the classifier relies on features extracted from model responses, meaning it requires generating both native and translated outputs for each inference, which increases computational costs compared to selecting a single strategy\. This overhead may limit practical deployment in resource\-constrained settings\. We expect that the strategy decision classifier could be further utilized within LLMs to automatically select the responding route without requiring dual inference, potentially through integration as an internal routing mechanism or by training the model to predict optimal strategies based on input features alone\.

## Acknowledgments

This work was supported by National Science and Technology Council, Taiwan, under grant NSTC 114\-2221\-E\-002 \-070 \-MY3, and by Ministry of Education \(MOE\), Taiwan, under grant NTU\-114L900901\.

## References

- Llama 3 model card\.\.External Links:[Link](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px3.p1.1)\.
- T\. Akiba, S\. Sano, T\. Yanase, T\. Ohta, and M\. Koyama \(2019\)Optuna: a next\-generation hyperparameter optimization framework\.External Links:1907\.10902,[Link](https://arxiv.org/abs/1907.10902)Cited by:[§B\.1](https://arxiv.org/html/2604.16937#A2.SS1.p1.1)\.
- M\. Artetxe, S\. Ruder, and D\. Yogatama \(2020\)On the cross\-lingual transferability of monolingual representations\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,External Links:[Link](http://dx.doi.org/10.18653/v1/2020.acl-main.421),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.421)Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px1.p1.1)\.
- S\. Banerjee and A\. Lavie \(2005\)METEOR: an automatic metric for MT evaluation with improved correlation with human judgments\.InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,J\. Goldstein, A\. Lavie, C\. Lin, and C\. Voss \(Eds\.\),Ann Arbor, Michigan,pp\. 65–72\.External Links:[Link](https://aclanthology.org/W05-0909/)Cited by:[§6](https://arxiv.org/html/2604.16937#S6.SS0.SSS0.Px1.p1.1)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’16,pp\. 785–794\.External Links:[Link](http://dx.doi.org/10.1145/2939672.2939785),[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px4.p1.1)\.
- DeepSeek\-AI \(2024\)DeepSeek\-v3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px3.p1.1)\.
- F\. Feng, Y\. Yang, D\. Cer, N\. Arivazhagan, and W\. Wang \(2022\)Language\-agnostic BERT sentence embedding\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 878–891\.External Links:[Link](https://aclanthology.org/2022.acl-long.62/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.62)Cited by:[§B\.2](https://arxiv.org/html/2604.16937#A2.SS2.SSS0.Px4.p1.1)\.
- A\. Ghosh, D\. Datta, S\. Saha, and C\. Agarwal \(2025\)A survey of multilingual reasoning in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 8920–8936\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.474/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.474),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2604.16937#S1.p1.1)\.
- S\. Haykin \(1994\)Neural networks: a comprehensive foundation\.Prentice Hall PTR\.Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px4.p1.1)\.
- H\. Huang, T\. Tang, D\. Zhang, X\. Zhao, T\. Song, Y\. Xia, and F\. Wei \(2023\)Not all languages are created equal in LLMs: improving multilingual capability by cross\-lingual\-thought prompting\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12365–12394\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.826/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.826)Cited by:[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Intrator, M\. Halfon, R\. Goldenberg, R\. Tsarfaty, M\. Eyal, E\. Rivlin, Y\. Matias, and N\. Aizenberg \(2024\)Breaking the language barrier: can direct inference outperform pre\-translation in multilingual LLM applications?\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 829–844\.External Links:[Link](https://aclanthology.org/2024.naacl-short.75/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-short.75)Cited by:[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Koehn and R\. Knowles \(2017\)Six challenges for neural machine translation\.InProceedings of the First Workshop on Neural Machine Translation,T\. Luong, A\. Birch, G\. Neubig, and A\. Finch \(Eds\.\),Vancouver,pp\. 28–39\.External Links:[Link](https://aclanthology.org/W17-3204/),[Document](https://dx.doi.org/10.18653/v1/W17-3204)Cited by:[§5\.4](https://arxiv.org/html/2604.16937#S5.SS4.p1.1),[§6](https://arxiv.org/html/2604.16937#S6.SS0.SSS0.Px3.p1.1)\.
- N\. Kowtal, T\. Deshpande, and R\. Joshi \(2024\)Chain\-of\-translation prompting \(CoTR\): a novel prompting technique for low resource languages\.InProceedings of the 38th Pacific Asia Conference on Language, Information and Computation,N\. Oco, S\. N\. Dita, A\. M\. Borlongan, and J\. Kim \(Eds\.\),Tokyo, Japan,pp\. 645–655\.External Links:[Link](https://aclanthology.org/2024.paclic-1.61/)Cited by:[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Lai and M\. Nissim \(2024\)MCoT: multilingual instruction tuning for reasoning consistency in language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 12012–12026\.External Links:[Link](https://aclanthology.org/2024.acl-long.649/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.649)Cited by:[§1](https://arxiv.org/html/2604.16937#S1.p2.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Li, Y\. Shi, Z\. Liu, F\. Yang, A\. Payani, N\. Liu, and M\. Du \(2024\)Language ranker: a metric for quantifying llm performance across high and low\-resource languages\.External Links:2404\.11553,[Link](https://arxiv.org/abs/2404.11553)Cited by:[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Liu, W\. Zhang, Y\. Zhao, A\. T\. Luu, and L\. Bing \(2025\)Is translation all you need? a study on solving multilingual tasks with large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 9594–9614\.External Links:[Link](https://aclanthology.org/2025.naacl-long.485/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.485),ISBN 979\-8\-89176\-189\-6Cited by:[§A\.2](https://arxiv.org/html/2604.16937#A1.SS2.p1.1),[§1](https://arxiv.org/html/2604.16937#S1.p1.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1),[§5\.4](https://arxiv.org/html/2604.16937#S5.SS4.p1.1)\.
- I\. Mondshine, T\. Paz\-Argaman, and R\. Tsarfaty \(2025\)Beyond English: the impact of prompt translation strategies across languages and tasks in multilingual LLMs\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 1331–1354\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.73/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.73),ISBN 979\-8\-89176\-195\-7Cited by:[§A\.2](https://arxiv.org/html/2604.16937#A1.SS2.SSS0.Px3.p1.1),[§A\.2](https://arxiv.org/html/2604.16937#A1.SS2.p1.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px2.p1.1)\.
- J\. D\. D\. Nyandwi, Y\. Song, S\. Khanuja, and G\. Neubig \(2025\)Grounding multilingual multimodal LLMs with cultural knowledge\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24198–24242\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1232/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1232),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2604.16937#S1.p1.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Paul, A\. Kamath, K\. Singla, R\. Joshi, U\. Vaidya, S\. S\. Chauhan, and N\. Wartikar \(2025\)Aligning large language models to low\-resource languages through llm\-based selective translation: a systematic study\.External Links:2507\.14304,[Link](https://arxiv.org/abs/2507.14304)Cited by:[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1)\.
- E\. M\. Ponti, G\. Glavaš, O\. Majewska, Q\. Liu, I\. Vulić, and A\. Korhonen \(2020\)XCOPA: a multilingual dataset for causal commonsense reasoning\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 2362–2376\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.185/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.185)Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px1.p1.1)\.
- M\. Popović \(2015\)ChrF: character n\-gram F\-score for automatic MT evaluation\.InProceedings of the Tenth Workshop on Statistical Machine Translation,O\. Bojar, R\. Chatterjee, C\. Federmann, B\. Haddow, C\. Hokamp, M\. Huck, V\. Logacheva, and P\. Pecina \(Eds\.\),Lisbon, Portugal,pp\. 392–395\.External Links:[Link](https://aclanthology.org/W15-3049/),[Document](https://dx.doi.org/10.18653/v1/W15-3049)Cited by:[§6](https://arxiv.org/html/2604.16937#S6.SS0.SSS0.Px1.p1.1)\.
- P\. Qi, Y\. Zhang, Y\. Zhang, J\. Bolton, and C\. D\. Manning \(2020\)Stanza: a python natural language processing toolkit for many human languages\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,A\. Celikyilmaz and T\. Wen \(Eds\.\),Online,pp\. 101–108\.External Links:[Link](https://aclanthology.org/2020.acl-demos.14/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-demos.14)Cited by:[§B\.2](https://arxiv.org/html/2604.16937#A2.SS2.SSS0.Px3.p1.2)\.
- Y\. Sakai, H\. Kamigaito, and T\. Watanabe \(2024\)MCSQA: multilingual commonsense reasoning dataset with unified creation strategy by language models and humans\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 14182–14214\.External Links:[Link](https://aclanthology.org/2024.findings-acl.844/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.844)Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px1.p1.1)\.
- T\. Sellam, D\. Das, and A\. Parikh \(2020\)BLEURT: learning robust metrics for text generation\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 7881–7892\.External Links:[Link](https://aclanthology.org/2020.acl-main.704/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.704)Cited by:[§6](https://arxiv.org/html/2604.16937#S6.SS0.SSS0.Px1.p1.1)\.
- U\. Shaham, J\. Herzig, R\. Aharoni, I\. Szpektor, R\. Tsarfaty, and M\. Eyal \(2024\)Multilingual instruction tuning with just a pinch of multilinguality\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 2304–2317\.External Links:[Link](https://aclanthology.org/2024.findings-acl.136/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.136)Cited by:[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Shu, J\. Chen, Z\. Liu, H\. Wang, Z\. Wu, T\. Zhong, Y\. Li, H\. Zhao, H\. Jiang, Y\. Pan, Y\. Zhou, C\. Owl, X\. Zhai, N\. Liu, C\. Saunt, and T\. Liu \(2024\)Transcending language boundaries: harnessing llms for low\-resource language translation\.External Links:2411\.11295,[Link](https://arxiv.org/abs/2411.11295)Cited by:[§5\.4](https://arxiv.org/html/2604.16937#S5.SS4.p1.1)\.
- S\. Singh, A\. Romanou, C\. Fourrier, D\. I\. Adelani, J\. G\. Ngui, D\. Vila\-Suero, P\. Limkonchotiwat, K\. Marchisio, W\. Q\. Leong, Y\. Susanto, R\. Ng, S\. Longpre, S\. Ruder, W\. Ko, A\. Bosselut, A\. Oh, A\. Martins, L\. Choshen, D\. Ippolito, E\. Ferrante, M\. Fadaee, B\. Ermis, and S\. Hooker \(2025\)Global MMLU: understanding and addressing cultural and linguistic biases in multilingual evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18761–18799\.External Links:[Link](https://aclanthology.org/2025.acl-long.919/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.919)Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px1.p1.1)\.
- Z\. R\. Tam, C\. Wu, Y\. Y\. Chiu, C\. Lin, Y\. Chen, and H\. Lee \(2025\)Language matters: how do multilingual input and reasoning paths affect large reasoning models?\.External Links:2505\.17407,[Link](https://arxiv.org/abs/2505.17407)Cited by:[§1](https://arxiv.org/html/2604.16937#S1.p1.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Team, M\. R\. Costa\-jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard, A\. Sun, S\. Wang, G\. Wenzek, A\. Youngblood, B\. Akula, L\. Barrault, G\. M\. Gonzalez, P\. Hansanti, J\. Hoffman, S\. Jarrett, K\. R\. Sadagopan, D\. Rowe, S\. Spruit, C\. Tran, P\. Andrews, N\. F\. Ayan, S\. Bhosale, S\. Edunov, A\. Fan, C\. Gao, V\. Goswami, F\. Guzmán, P\. Koehn, A\. Mourachko, C\. Ropers, S\. Saleem, H\. Schwenk, and J\. Wang \(2022\)No language left behind: scaling human\-centered machine translation\.External Links:2207\.04672,[Link](https://arxiv.org/abs/2207.04672)Cited by:[§5\.4](https://arxiv.org/html/2604.16937#S5.SS4.p1.1),[§6](https://arxiv.org/html/2604.16937#S6.SS0.SSS0.Px3.p1.1)\.
- Y\. Wang, S\. Zhao, Z\. Wang, H\. Huang, M\. Fan, Y\. Zhang, Z\. Wang, H\. Wang, and T\. Liu \(2024\)Strategic chain\-of\-thought: guiding accurate reasoning in LLMs through strategy elicitation\.Computing Research RepositoryarXiv:2409\.03271\.External Links:[Link](https://arxiv.org/abs/2409.03271)Cited by:[§A\.2](https://arxiv.org/html/2604.16937#A1.SS2.SSS0.Px4.p1.1),[§A\.2](https://arxiv.org/html/2604.16937#A1.SS2.p1.1),[§1](https://arxiv.org/html/2604.16937#S1.p2.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px2.p1.1)\.
- F\. Wilcoxon \(1945\)Individual comparisons by ranking methods\.Biometrics Bulletin1\(6\),pp\. 80–83\.External Links:ISSN 00994987,[Link](http://www.jstor.org/stable/3001968)Cited by:[§5\.2](https://arxiv.org/html/2604.16937#S5.SS2.SSS0.Px3.p1.2)\.
- W\. Xuan, R\. Yang, H\. Qi, Q\. Zeng, Y\. Xiao, A\. Feng, D\. Liu, Y\. Xing, J\. Wang, F\. Gao, J\. Lu, Y\. Jiang, H\. Li, X\. Li, K\. Yu, R\. Dong, S\. Gu, Y\. Li, X\. Xie, F\. Juefei\-Xu, F\. Khomh, O\. Yoshie, Q\. Chen, D\. Teodoro, N\. Liu, R\. Goebel, L\. Ma, E\. Marrese\-Taylor, S\. Lu, Y\. Iwasawa, Y\. Matsuo, and I\. Li \(2025\)MMLU\-ProX: a multilingual benchmark for advanced large language model evaluation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 1513–1532\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.79/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.79),ISBN 979\-8\-89176\-332\-6Cited by:[§3](https://arxiv.org/html/2604.16937#S3.SS0.SSS0.Px1.p1.1)\.
- W\. Zhu, S\. Huang, F\. Yuan, S\. She, J\. Chen, and A\. Birch \(2024\)Question translation training for better multilingual reasoning\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 8411–8423\.External Links:[Link](https://aclanthology.org/2024.findings-acl.498/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.498)Cited by:[§1](https://arxiv.org/html/2604.16937#S1.p2.1),[§2](https://arxiv.org/html/2604.16937#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix APreliminary Details

### A\.1LLM Endpoints

We use the NVIDIA NIM APIs111[https://build\.nvidia\.com](https://build.nvidia.com/)to generate responses for each prompting strategy, usingdeepseek\-ai/deepseek\-v3\_1with thinking mode enabled for DeepSeek\-v3\.1 andmeta/llama\-3\_3\-70b\-instructfor Llama\-3\.3\-70B\.

### A\.2Prompt Strategies

We assess multiple strategies based on the language used for instructions and reasoning steps, using a zero\-shot approach\. The complete prompting templates are provided in Figure[3](https://arxiv.org/html/2604.16937#A1.F3),[4](https://arxiv.org/html/2604.16937#A1.F4),[5](https://arxiv.org/html/2604.16937#A1.F5),[6](https://arxiv.org/html/2604.16937#A1.F6),[7](https://arxiv.org/html/2604.16937#A1.F7), and[8](https://arxiv.org/html/2604.16937#A1.F8)\. The prompt templates refer toLiuet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib20)\)\(Native,Translate\),Mondshineet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib2)\)\(Sel\-Trans\), andWanget al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib32)\)\(SCoT\-Native,SCoT\-Trans\)\.

Native Prompting\[Multiple\-choice\]Answer the following multiple choice question\. The last line of your response should be exactly: ’Answer $LETTER’ where LETTER is one of ABCD\. Think step by step before answering\.Question: \{question\}Options: \{options\}\[QA\]Answer the following question based on the given context\. Provide a concise and accurate answer\. The last line of your response should be exactly: ’Answer: \[your answer\]’\.Context: \{context\}Question: \{question\}Figure 3:Native prompting template for LLM response generation\. We use Google Translate to translate the instruction into other native languages when prompting\.Translate Prompting\[Multiple\-choice\]First, translate the following question and options from \{language\} to English\. Then, answer the translated multiple choice question\. The last line of your response should be exactly: ’Answer $LETTER’ where LETTER is one of ABCD\. Think step by step before answering\.Original Question \(\{language\}\): \{question\}Original Options \(\{language\}\): \{options\}Please provide your response in the following format:Translated Question: \[your English translation\]Translated Options: \[your English translation\]Reasoning: \[your step\-by\-step reasoning\]Answer \[LETTER\]\[QA\]First, translate the following context and question from \{language\} to English\. Then, answer the translated question based on the translated context\. The last line of your response should be exactly: ’Answer: \[your answer\]’\.Original Context \(\{language\}\): \{context\}Original Question \(\{language\}\): \{question\}Please provide your response in the following format:Translated Context: \[your English translation\]Translated Question: \[your English translation\]Reasoning: \[your step\-by\-step reasoning\]Answer: \[your answer\]Figure 4:Translate prompting template for LLM response generation\. All languages use the same English instruction to translate and response\.Selective\-translate Prompting\[Multiple\-choice\]Answer the following multiple choice question\. The last line of your response should be exactly: ’Answer $LETTER’ where LETTER is one of ABCD\. Think step by step before answering\.Question: \{question\}Options: \{options\}Figure 5:Selective translate prompting template for LLM response generation\. All languages use the same English instruction with native inputs\.Native Strategic CoT Prompting\[Multiple\-choice\]\*\*Role:\*\* You are a strategic reasoning expert skilled in systematic problem\-solving\.\*\*Workflow:\*\*1\. First, analyze the problem and develop a strategic approach to solve it\.2\. Then, apply your strategy step\-by\-step to reach the solution\.\*\*Rule:\*\* Ensure each step is logical, clear, and builds upon the previous one\.\*\*Initialization:\*\* Let’s begin by understanding the problem and formulating a strategy\.\*\*Task Input:\*\*Question: \{question\}Options: \{options\}Please follow the SCoT methodology: first outline your strategic approach, then apply it step\-by\-step\. End your response with exactly: ’Answer $LETTER’ where LETTER is one of ABCD\.Figure 6:Native strategic CoT prompting template for LLM response generation\. We use Google Translate to translate the instruction into other native languages when prompting\.Translate Strategic CoT Prompting\[Multiple\-choice\]\*\*Role:\*\* You are a strategic reasoning expert skilled in systematic problem\-solving\.\*\*Workflow:\*\*1\. First, analyze the problem and develop a strategic approach to solve it\.2\. Then, apply your strategy step\-by\-step to reach the solution\.\*\*Rule:\*\* Ensure each step is logical, clear, and builds upon the previous one\.\*\*Initialization:\*\* Let’s begin by understanding the problem and formulating a strategy\.\*\*Task Input:\*\*Question: \{question\}Options: \{options\}Please follow the SCoT methodology: first outline your strategic approach, then apply it step\-by\-step\. End your response with exactly: ’Answer $LETTER’ where LETTER is one of ABCD\.Figure 7:Translate strategic CoT prompting template for LLM response generation\. All languages use the same English instruction with native inputs\.Routing Prompting\[Multiple\-choice\]You are a multilingual AI assistant tasked with determining the best approach to answer a multiple\-choice question\.Question Language: \{language\_name\}Question: \{question\}Options: \{options\}Based on research in multilingual NLP, there are two approaches:1\. NATIVE: Answer directly in \{language\_name\}2\. TRANSLATE: Translate the question to English first, then answerPlease assess your proficiency and confidence:\- How confident are you in understanding and reasoning in \{language\_name\}? \(Consider vocabulary, grammar, cultural context\)\- Is this a complex question requiring nuanced reasoning, or is it straightforward?\- Would translating to English improve your accuracy?Respond with EXACTLY ONE of the following on the last line:ROUTE: NATIVEorROUTE: TRANSLATEProvide brief reasoning first \(1\-2 sentences\), then your routing decision\.\[QA\]You are a multilingual AI assistant tasked with determining the best approach to answer a question based on context\.Question Language: \{language\_name\}Context: \{context\}Question: \{question\}Based on research in multilingual NLP, there are two approaches:1\. NATIVE: Answer directly in \{language\_name\} based on the context2\. TRANSLATE: Translate the context and question to English first, then answerPlease assess your proficiency and confidence:\- How confident are you in understanding and reasoning in \{language\_name\}? \(Consider vocabulary, grammar, cultural context\)\- Is this a complex question requiring nuanced reasoning, or is it straightforward?\- Would translating to English improve your accuracy?Respond with EXACTLY ONE of the following on the last line:ROUTE: NATIVEorROUTE: TRANSLATEProvide brief reasoning first \(1\-2 sentences\), then your routing decision\.Figure 8:Prompt routing template for LLM response generation\. All languages use the same English instruction to decide the strategy\. After the decision, they use the same prompt as native/translate prompting to get the final response and answer\.#### Native method \(Native\)

Innative, we provide the question with both input and Chain\-of\-Thought instructions in the native language\.

#### Translate method \(Translate\)

Intranslate, we provide the question with input in the native language, then instruct the model to translate the question to English and solve it with English Chain\-of\-Thought instructions\.

#### Selective Translate method \(Sel\-Trans\)

Selective TranslationMondshineet al\.\([2025](https://arxiv.org/html/2604.16937#bib.bib2)\)is a method that selectively translates only specific parts of the prompt\. We provide the question with input in the native language, then instruct the model using English Chain\-of\-Thought instructions without first translating the question\.

#### Native Strategic Chain\-of\-Thought method \(SCoT\-Native\)

Native Strategic Chain\-of\-ThoughtWanget al\.\([2024](https://arxiv.org/html/2604.16937#bib.bib32)\)is a method that integrates strategic knowledge before generating intermediate reasoning steps\. We provide both the input and Strategic Chain\-of\-Thought instructions in the native language\.

#### Translate Strategic Chain\-of\-Thought method \(SCoT\-Trans\)

InTranslate Strategic Chain\-of\-Thought, we provide the input in the native language, then instruct the model with English Strategic Chain\-of\-Thought instructions without first translating the complete question input\.

#### Prompt Routing method \(Prompt\-Routing\)

InPrompt\-Routing, we provide the input in the native language, then instruct the model to determine whether to translate the question into English and solve it with native or English Chain\-of\-Thought instructions\.

### A\.3Complete Preliminary Analysis Results

We conduct our preliminary analysis on the Global\-MMLU subsets labeled as Culturally Sensitive \(CS\) and Culturally Agnostic \(CA\)\. The main results of the preliminary analysis separated by subsets are presented in Table[5](https://arxiv.org/html/2604.16937#A1.T5)\.

Table 5:Accuracy \(%\) of six prompting strategies on the Culturally Sensitive \(CS\) and Culturally Agnostic \(CA\) subsets of Global\-MMLU using Llama3\.3\-70B, across ten languages grouped by resource level\.

## Appendix BClassifiers Details

### B\.1Classifier Settings and Details

We employ hyperparameter tuning to optimize the performance of our classifier models during the training session\. We use OptunaAkibaet al\.\([2019](https://arxiv.org/html/2604.16937#bib.bib42)\)to perform automated hyperparameter optimization, optimizing overall accuracy \(problem\-level correctness\) as the primary objective\. The final hyperparameter values selected for each model configuration are presented in Table[6](https://arxiv.org/html/2604.16937#A2.T6)\.

#### XGBoost

We tune the number of estimators \(100–600\), maximum tree depth \(3–12\), learning rate \(0\.01–0\.3, log scale\), subsample ratio \(0\.6–1\.0\), column subsample ratio \(0\.6–1\.0\), and minimum child weight \(1\.0–10\.0\)\.

#### MLP

We tune the hidden layer architecture \(selected from predefined configurations\), L2 regularization parameterα\\alpha\(1e\-5–1e\-2, log scale\), and initial learning rate \(1e\-4–1e\-2, log scale\)\.

HyperparameterMLPXGBoostDeepseek\-v3\.1Llama\-3\.3\-70BDeepseek\-v3\.1Llama\-3\.3\-70BMLP HyperparametersHidden Layer Sizes\(100,50\)\(100,50\)\(100\)\(100\)——α\\alpha\(L2 regularization\)8\.94×10−58\.94\\times 10^\{\-5\}4\.19×10−54\.19\\times 10^\{\-5\}——Learning Rate \(initial\)3\.27×10−33\.27\\times 10^\{\-3\}5\.44×10−35\.44\\times 10^\{\-3\}——XGBoost HyperparametersNumber of Estimators——424101Max Depth——103Learning Rate——2\.87×10−22\.87\\times 10^\{\-2\}1\.88×10−21\.88\\times 10^\{\-2\}Subsample——0\.9510\.700Column Sample by Tree——0\.6150\.996Min Child Weight——9\.514\.71Table 6:Final hyperparameter values for MLP and XGBoost classifiers\.

### B\.2Features

Our approach relies on features capturing differences betweenrnr\_\{n\}andrtr\_\{t\}across linguistic quality, complexity, and alignment dimensions\. Complete feature descriptions and examples appear in Table[C](https://arxiv.org/html/2604.16937#A3)\.

#### Metadata Features\.

Language identifier and subject category provide coarse\-grained context about resource availability and domain\-specific requirements\.

#### Question\-Level Features\.

Punctuation mark count and numeric character count capture structural properties that may interact differently with translation\.

#### Response\-Level Features\.

We compute linguistic quality metrics for bothrnr\_\{n\}andrtr\_\{t\}: named entity count usingspaCypackage, rare word ratio \(words outside top\-10k frequency\), grammar fluency score, lexical diversity \(type\-token ratio\), language confidence \(probability assigned to detected language\) usinglangdetectpackage, syntactic complexity \(average dependency tree depth\) using StanzaQiet al\.\([2020](https://arxiv.org/html/2604.16937#bib.bib6)\)models, and part\-of\-speech diversity using Stanza models\.

#### Question\-Response Alignment Features\.

Word overlap metrics and embedding similarity \(cosine similarity of the combinations of question, answer, and response embeddings, using LaBSEFenget al\.\([2022](https://arxiv.org/html/2604.16937#bib.bib7)\)\) measure how well each response addresses the question, potentially revealing translation\-induced semantic drift\.

### B\.3Training and Evaluation Datasets Details

The complete statistics of the datasets is presented in Table[8](https://arxiv.org/html/2604.16937#A3.T8)\.

### B\.4Complete Classifier Results

The complete accuracy results for the XGBoost and MLP classifier for DeepSeek\-v3\.1 and Llama\-3\.3\-70B are presented in Table[9](https://arxiv.org/html/2604.16937#A3.T9)and[10](https://arxiv.org/html/2604.16937#A3.T10)\.

## Appendix CStatistical Significance

The Wilcoxon signed\-rank test evaluates whether the median difference between paired observations is zero\. For each language\-dataset pairii, we compute the differencedi=siproposed−sibaselined\_\{i\}=s\_\{i\}^\{\\text\{proposed\}\}\-s\_\{i\}^\{\\text\{baseline\}\}, wheresiproposeds\_\{i\}^\{\\text\{proposed\}\}andsibaselines\_\{i\}^\{\\text\{baseline\}\}are the accuracy scores of the proposed method and baseline, respectively\. We then rank the absolute differences\|di\|\|d\_\{i\}\|from smallest to largest, assigning rankRiR\_\{i\}to each pair\. The test statisticWWis computed as:

W=min⁡\(∑di\>0Ri,∑di<0Ri\)\\displaystyle W=\\min\\left\(\\sum\_\{d\_\{i\}\>0\}R\_\{i\},\\sum\_\{d\_\{i\}<0\}R\_\{i\}\\right\)where the first sum is over pairs where the proposed method outperforms the baseline, and the second sum is over pairs where the baseline performs better\. Under the null hypothesis of no difference,WWfollows a known distribution, from which we derive thepp\-value\. The full result is presented in Table[11](https://arxiv.org/html/2604.16937#A3.T11)\.

Feature NameDescriptionExample\\rowcolororaclebgMetadata FeatureslanguageLanguage code of the response \(e\.g\., de, zh, es\)"de","zh"datasetName of the dataset \(e\.g\., mmlu\_prox, xquad\)"mmlu\_prox"subjectCombined subject and category information"STEM:mathematics"\\rowcolororaclebgQuestion\-Level Featuresquestion\_punct\_densityDensity of punctuation marks in question text \(punctuation count / text length\)Q: “What is 2\+2?” \(1 punct / 13 chars = 0\.08\)question\_num\_densityDensity of numeric characters in question text \(digit count / text length\)Q: “What is 2\+2?” \(2 digits / 13 chars = 0\.15\)\\rowcolororaclebgResponse\-Level Featuresrare\_word\_ratioProportion of rare words based on corpus frequency \(words below median frequency\)“The method uses sophisticated techniques” \(rare words: sophisticated, techniques; 2 / 5 = 0\.40\)named\_entity\_countNumber of named entities \(persons, organizations, locations\) detected viaspaCy“Einstein worked at Princeton in Germany” \(Einstein, Princeton, Germany = 3\)grammar\_fluency\_scoreOverall fluency score accounting for punctuation errors and formatting issues \(0\.0–1\.0, higher is better\)“The answer is correct\.” \(1\.0, no errors\) vs “The answer is correct\.\.” \(0\.82, malformed punctuation\)grammar\_malformed\_punctCount of malformed punctuation patterns \(e\.g\., consecutive marks like "\.\."\)“Is this right??” \(1 instance of “??”\)grammar\_missing\_final\_periodBinary indicator of missing sentence\-ending punctuation \(1\.0 = missing, 0\.0 = present\)“The answer is correct” \(1\.0\) vs “The answer is correct\.” \(0\.0\)lexical\_diversityType\-token ratio measuring vocabulary diversity \(unique words / total words\)“The cat sat\. The cat ran\.” \(4 unique: the, cat, sat, ran / 6 total = 0\.67\)language\_detection\_confidenceConfidence score vialangdetect\(0\.0–1\.0, higher = more confident\)“The quick brown fox jumps” \(detected as English with 0\.95 confidence\)language\_mismatchBinary indicator of language mismatch vialangdetect\(1\.0 = mismatch, 0\.0 = match\)Expected: English, Detected: Spanish \(1\.0\)syntactic\_depth\_maxMaximum depth of dependency parse tree \(deeper = more complex\)“The cat that the dog chased ran” \(deep nesting = depth 6\)syntactic\_complexity\_scoreNormalized syntactic complexity \(depth / log2\(word\_count \+ 1\)\)“The book that the student who the teacher praised read” \(depth 7, normalized\)pos\_noun\_verb\_ratioRatio of nouns to verbs \(higher = more nominal/informative style\)“The analysis of the data shows results” \(4 nouns / 1 verb = 4\.0\)pos\_diversity\_unique\_tagsNumber of unique part\-of\-speech tags in response“The cat sat on the mat” \(DET, NOUN, VERB, ADP = 4 unique tags\)pos\_diversity\_scorePOS diversity score \(unique tags / total tags, 0\.0–1\.0\)“The cat sat” \(3 unique / 3 total = 1\.0\) vs “cat cat cat” \(1 unique / 3 total = 0\.33\)\\rowcolororaclebgAlignment Featuresword\_overlap\_\*\_\*Token\-level overlap metrics \(F1, precision, recall\) measuring word overlap between pairs: answer–response, question–answer, and question–responseResponse: “The answer is Paris”; Reference: “Paris”\(overlap: \{paris\}, F1 = 0\.33\)labse\_\*\_\*\_similarityCosine similarity using LaBSE embeddings \(cross\-lingual semantic similarity\) between pairs: answer–response, question–answer, and question–response“The capital is Paris” \(EN\) vs “La capital es París” \(ES\) \(similarity = 0\.91\)
Table 7:Complete list of features used by the strategy selection classifier, grouped into four categories: metadata, question\-level, response\-level, and question–response alignment features\.Table 8:Statistics of the multilingual benchmark datasets used in our experiments, including the number of examples, covered languages, and task types\.
Table 9:Complete results on DeepSeek\-v3\.1 across in\-domain \(green\) and out\-of\-domain \(orange\) benchmarks\. Best results arebolded;Oraclemarks the upper bound where at least one ofNativeorTranslatesucceeds\. Empty cells indicate languages not covered by the respective benchmark dataset, as detailed in Table[8](https://arxiv.org/html/2604.16937#A3.T8)\.
Table 10:Complete results on Llama3\.3\-70B across in\-domain \(green\) and out\-of\-domain \(orange\) benchmarks\. Best results arebolded;Oraclemarks the upper bound where at least one ofNativeorTranslatesucceeds\. Empty cells indicate languages not covered by the respective benchmark dataset, as detailed in Table[8](https://arxiv.org/html/2604.16937#A3.T8)\.
Table 11:Wilcoxon signed\-rank test p\-values combining all datasets\.
## Appendix DTranslation Rate Results

### D\.1Complete Translation Rate Heatmaps

The DeepSeek\-v3\.1 XGBoost classifier’s translation rate heatmap is presented in Figure[1](https://arxiv.org/html/2604.16937#S5.F1)in the main body; the remaining classifiers translation rate heatmaps with different models and classifier’s types are presented in Figure[9](https://arxiv.org/html/2604.16937#A5.F9),[10](https://arxiv.org/html/2604.16937#A5.F10), and[11](https://arxiv.org/html/2604.16937#A5.F11)\.

### D\.2Translation Rate Analysis Experiment

We calculate all translation quality scores by comparing the translated question and options parsed from LLM responses to gold\-standard English reference texts\. Higher scores indicate translations that more faithfully preserve the meaning and structure of the original English content\.

#### Results\.

The Global\-MMLU translation quality analysis results table is in Table[4](https://arxiv.org/html/2604.16937#S5.T4)in the main body; additional results of translation rate tables across three different quality scores, two datasets, and two models appear in Table[12](https://arxiv.org/html/2604.16937#A5.T12),[13](https://arxiv.org/html/2604.16937#A5.T13),[14](https://arxiv.org/html/2604.16937#A5.T14),[15](https://arxiv.org/html/2604.16937#A5.T15), and[16](https://arxiv.org/html/2604.16937#A5.T16)\. Complete results of distribution of responses across translation quality bins are provided in Figure[12](https://arxiv.org/html/2604.16937#A5.F12)and[13](https://arxiv.org/html/2604.16937#A5.F13)\.

## Appendix EInformation About Use Of AI Assistants

We use AI assistants only for minimal tasks such as refining text and basic code snippets\. All core research, experimental design, and critical examination of the results are performed and verified by humans to ensure the integrity of the process\.

![Refer to caption](https://arxiv.org/html/2604.16937v1/x2.png)Figure 9:Translateselection rate \(%\) of the XGBoost classifier on DeekSeek\-v3\.1\.![Refer to caption](https://arxiv.org/html/2604.16937v1/x3.png)Figure 10:Translateselection rate \(%\) of the MLP classifier on DeekSeek\-v3\.1\.![Refer to caption](https://arxiv.org/html/2604.16937v1/x4.png)Figure 11:Translateselection rate \(%\) of the MLP classifier on Llama\-3\.3\-70B\.Table 12:Translation quality analysis on Global\-MMLU using BLEURT scores\.Low qualitybins \(bottom 30%\) show highTranslateselection rates despite lower accuracy\.High qualitybins \(top 40%\) show improved accuracy but lower translation rate\.Table 13:Translation quality analysis on Global\-MMLU using METEOR scores of the XGBoost classifier of the XGBoost classifier\.Low qualitybins \(bottom 30%\) show highTranslateselection rates despite lower accuracy\.High qualitybins \(top 40%\) show improved accuracy but lower translation rate\.Table 14:Translation quality analysis on MMLU\-ProX using BLEURT scores of the XGBoost classifier\.Low qualitybins \(bottom 30%\) show highTranslateselection rates despite lower accuracy\.High qualitybins \(top 40%\) show improved accuracy but lower translation rate\.Table 15:Translation quality analysis on MMLU\-ProX using chrF scores of the XGBoost classifier\.Low qualitybins \(bottom 30%\) show highTranslateselection rates on Llama\-3\.3\-70B despite lower accuracy\.High qualitybins \(top 40%\) show improved accuracy but lower translation rate\.Table 16:Translation quality analysis on MMLU\-ProX using METEOR scores of the XGBoost classifier\.Low qualitybins \(bottom 30%\) show different trend on two models onTranslateselection rates but both have lower accuracy\.High qualitybins \(top 40%\) show improved accuracy, while DeepSeek\-v3\.1 model translation rate increases and Llama\-3\.3\-70B model translation rate drops\.![Refer to caption](https://arxiv.org/html/2604.16937v1/figure/resource_bin_heatmap_combined_globalmmlu.png)Figure 12:Combined distribution \(%\) of responses across translation quality bins on Global\-MMLU\.![Refer to caption](https://arxiv.org/html/2604.16937v1/figure/resource_bin_heatmap_combined_mmluprox.png)Figure 13:Combined distribution \(%\) of responses across translation quality bins on MMLU\-ProX\.
No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs

Similar Articles

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

Self-Supervised Prompt Optimization

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

@heyshrutimishra: Most LLM routers are static rules; OrcaRouter is a router that learns. It embeds every prompt, scores it against past p…

Submit Feedback

Similar Articles

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning
Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch
Self-Supervised Prompt Optimization
All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG
@heyshrutimishra: Most LLM routers are static rules; OrcaRouter is a router that learns. It embeds every prompt, scores it against past p…