Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning
Summary
Translate-R1 introduces a reinforcement learning approach for cost-aware translation tool use in LLMs, where the model learns to decide when to translate inputs based on its own comprehension and a cost-sensitivity parameter, achieving Pareto-optimal trade-offs across multiple languages.
View Cached Full Text
Cached at: 06/08/26, 09:20 AM
# Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning
Source: [https://arxiv.org/html/2606.06835](https://arxiv.org/html/2606.06835)
Pratik Jayarao Chaitanya Dwivedi Himanshu Gupta Neeraj Varshney Adithya M DevrajMeet VaderaPriyanka NigamBing Yin Amazon Stores Foundation AI
###### Abstract
The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine\-tuning on corpora that, for most languages, do not exist\. Translation offers an alternative: converting an input into the model’s dominant language unlocks its full capabilities at once\. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input\(Wang et al\.,[2025b](https://arxiv.org/html/2606.06835#bib.bib37)\)\. Prior work resolves this with language\-specific rules, domain heuristics, language identifiers, or external routers\(Kang et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib15); Son et al\.,[2025b](https://arxiv.org/html/2606.06835#bib.bib34)\), each requiring manual engineering or auxiliary components\. We instead learn a single policy that decides when to translate from reward alone, developing language\- and domain\-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively\.
Using data built by our answer\-preserving translation pipeline, we continue RL on the post\-trained Qwen3\-4B across 22 languages in 3 resource tiers \(High, Low, XLow\) and 5 domains, and introduce confidence\-gated GSPO for cost\-sensitive tool use\. The gated policy lifts reward over the baseline by \+4\.6 on High, \+23\.5 on Low, and \+17\.5 on XLow\. Against an unconstrained policy that almost always translates \(the reward upper bound\), it preserves full reward at63% of the costand is Pareto\-optimal across87%of the cost\-sensitivity range, reaching 95–100% for Low and XLow\. Additionally, to simulate how the model would behave on a completely unseen language, we create 2 synthetic languages, where our gated policy improves \+18\.7 over the overconfident baseline policy that underutilizes the tool even on these incomprehensible inputs\. The policy transfers zero\-shot to 9 held\-out languages absent from training, and we close with an analysis of how tool use emerges over training, per language and per domain\.
Figure 1:Pareto dominance: which model achieves the highest cost\-adjusted score \(R−αCR\-\\alpha C\) at each cost\-sensitivityα\\alpha\. Green = gated model wins\. For low\-resource languages \(Low, XLow, Synth\), the gated model dominates 95–100% of the range\. The unconstrained free\-tool model \(blue\) is Pareto\-optimal only for high\-resource conditions where cost is irrelevant\.## 1Introduction
The performance gap across languages in LLMs is well documented\(Shi et al\.,[2023](https://arxiv.org/html/2606.06835#bib.bib31); Son et al\.,[2025a](https://arxiv.org/html/2606.06835#bib.bib33); Singh et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib32)\), and closing it natively requires pretraining or fine\-tuning on corpora that are scarce for most languages\. Translation offers a different path: unlike tools that augment a single capability \(search for knowledge, code for computation\), translation converts an incomprehensible input into one the model already reasons well in, unlockingallof its capabilities at once\.
Translation brings its own dilemma\. Applied to every input, it is wasteful for languages the model already handles and can even hurt when the translator introduces errors\. Left to the model, the choice fails in the opposite direction:Wang et al\. \([2025b](https://arxiv.org/html/2606.06835#bib.bib37)\)find that LLMsneverinvoke translation tools across 14 languages, forfeiting large gains\. The right choice depends on the language, the domain, and the specific input, a per\-sample decision that cannot be specified by hand\.
Prior work makes this choice with language\-specific rules, domain heuristics, explicit language identifiers, or external routers\(Kang et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib15); Son et al\.,[2025b](https://arxiv.org/html/2606.06835#bib.bib34)\), all of which require manual engineering or auxiliary components\. We ask whether the choice can instead emerge from reward alone\.
We learn a single policy across languages spanning high\-resource to extremely scarce, across diverse domains, and under varying cost budgets\. Trained only on task reward, it develops language\- and domain\-adaptive introspection: it senses its own competence and translates only when doing so pays off\. There is no language identification, no routing rule, and no supervised demonstration of when to translate\.
Our contributions:
1. 1\.A learned introspective policy\.From task reward alone, the model learns language\- and domain\-adaptive tool use: it translates math and QA in low\-resource languages but solves instruction following natively in those same languages\. Both behaviors emerge without explicit engineering\.
2. 2\.Cost\-sensitive tool use via Gated GSPO\.Existing approaches to cost\-sensitive tool use either apply a fixed penalty that cannot adapt to language difficulty \(flat\), or use group\-relative signals that are corrupted by lucky guesses on low\-resource languages \(OTC\)\. Both over\-suppress tool use for exactly the languages that need it most\. We introduce a confidence gate that applies cost pressure only when the model demonstrates strong native competence, adapting automatically to language difficulty without tier labels\. The resulting policy is Pareto\-optimal across 87% of the cost\-sensitivity range and retains full unconstrained reward at 63% of the cost\.
3. 3\.An answer\-preserving translation pipeline\.Multilingual RLVR requires ground\-truth answers to stay valid after translation, but naive translation corrupts answers and LLM judges cannot reliably verify in low\-resource languages\. Our pipeline sidesteps this by verifying entirely in the model’s dominant language via back\-translation, reaching 98\.4% fidelity across 22 languages\.
4. 4\.Tool use on a completely new language\.How does a model behave on a language it has never seen? The correct behavior is to always translate, yet the baseline policy is overconfident and underutilizes the tool\. To study this cleanly, we construct 2 synthetic languages with no possibility of prior exposure, where partial understanding is impossible\. Our gated policy learns to recognize that it cannot comprehend the input and translates, gaining \+18\.7 points over the baseline\.
Together these show that a single model, trained only on task reward, can learn a calibrated sense of its own multilingual limits and call for help precisely when it needs it\. We further verify that this behavior is general, transferring zero\-shot to 9 held\-out languages absent from training\.
## 2Related Work
#### Multilingual Reasoning\.
The cross\-lingual gap is well established\(Shi et al\.,[2023](https://arxiv.org/html/2606.06835#bib.bib31); Son et al\.,[2025a](https://arxiv.org/html/2606.06835#bib.bib33); Qiu et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib29)\); benchmarks such as MGSM\(Shi et al\.,[2023](https://arxiv.org/html/2606.06835#bib.bib31)\), MMLU\-ProX\(Yue et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib43)\), and mAceReason\-Math\(Dobler et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib5)\)report drops of up to 24% for low\-resource languages\. Recent work pins the bottleneck on comprehension rather than reasoning\(Li et al\.,[2025c](https://arxiv.org/html/2606.06835#bib.bib21); Kim et al\.,[2025a](https://arxiv.org/html/2606.06835#bib.bib16); Kang et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib15); Huo et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib13)\)\. Translation\-based approaches range from fixed translate\-then\-solve pipelines\(Qin et al\.,[2023](https://arxiv.org/html/2606.06835#bib.bib28); Huang et al\.,[2023](https://arxiv.org/html/2606.06835#bib.bib11); Chen et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib2)\)to RL\-enforced English pivoting \(TAPO;Son et al\.,[2025b](https://arxiv.org/html/2606.06835#bib.bib34)\)\. Closest to ourdecisionproblem is Selective Translation\(Kang et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib15)\), which translates only 20% of inputs using an explicit failure detector\. We instead learn the decision end\-to\-end from reward, jointly over languages and domains, with no detection module\.
#### Multilingual RL\.
GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib30)\)and its extensions\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.06835#bib.bib4); Yu et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib41); Zhang et al\.,[2025c](https://arxiv.org/html/2606.06835#bib.bib46)\)are now standard for RL with verifiable rewards\. RL generalizes cross\-lingually better than SFT\(Huang et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib12)\), though GRPO on translated data can collapse chain\-of\-thought toward the dominant language\(Kim et al\.,[2025b](https://arxiv.org/html/2606.06835#bib.bib17)\); subsequent work adds language\-consistency rewards\(Park et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib27); Liu et al\.,[2025a](https://arxiv.org/html/2606.06835#bib.bib22); Wang et al\.,[2025a](https://arxiv.org/html/2606.06835#bib.bib36)\)to force native reasoning\.Zhang et al\. \([2025b](https://arxiv.org/html/2606.06835#bib.bib45)\)study seen vs\. unseen languages in multilingual RAG, andWu et al\. \([2025](https://arxiv.org/html/2606.06835#bib.bib40)\)treat language as a latent variable in GRPO, but neither learns tool\-use decisions\. We ask a different question: when should the model stop pretending it can reason natively and ask for help?
#### Multilingual Agents\.
LLMs never invoke translation tools even when available\(Wang et al\.,[2025b](https://arxiv.org/html/2606.06835#bib.bib37)\), and multilingual agent performance degrades sharply with language resource level\(Hofman et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib10); Kulkarni et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib18)\)\. None learns a translation policy through RL\.
#### RL for Tool Use\.
Search\-R1\(Jin et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib14)\), ToRL\(Li et al\.,[2025a](https://arxiv.org/html/2606.06835#bib.bib19)\), ReTool\(Feng et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib6)\), ToRA\(Gou et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib7)\), Tool\-R1\(Zhang et al\.,[2025a](https://arxiv.org/html/2606.06835#bib.bib44)\), and OTC\(Wang et al\.,[2025c](https://arxiv.org/html/2606.06835#bib.bib38)\)train models to call search or code tools via GRPO\.Wang et al\. \([2025d](https://arxiv.org/html/2606.06835#bib.bib39)\)learn when to use code vs\. reason directly, and StepTool\(Yu et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib42)\)shapes rewards at each tool step\. Our setting differs in kind: the model may not understand its input at all and must recognize its own incomprehension, a metacognitive judgment rather than a difficulty estimate\. Tool utility is also language\-conditional, as the same problem needs translation in Hausa but not French, which prior work never faces\.
#### Cost\-Aware Routing\.
RouteLLM\(Ong et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib26)\), xRouter\(Chen et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib3)\), Router\-R1\(Zhang et al\.,[2025d](https://arxiv.org/html/2606.06835#bib.bib47)\), and Think When Needed\(Guo et al\.,[2025](https://arxiv.org/html/2606.06835#bib.bib8)\)route between models or inference modes under cost constraints\. We route within a single model to a translation tool for comprehension failures\. Our cost mechanism extends the group\-relative optimization of Nemotron Nano\(NVIDIA,[2025](https://arxiv.org/html/2606.06835#bib.bib25)\)and Dr\. GRPO\(Liu et al\.,[2025b](https://arxiv.org/html/2606.06835#bib.bib23)\)from response length to tool cost, and adds a confidence gate that prevents over\-suppression on languages that genuinely need translation\.
## 3Data
### 3\.1Languages and Domains
We select 22 natural languages across 3 resource tiers \(Table[1](https://arxiv.org/html/2606.06835#S3.T1)\), from High to XLow, set by their approximate number of digital speakers, a proxy for how well represented each language is in web\-scale pretraining data\. We define tiers by speaker count rather than model performance because the two are not monotonic: on some tasks the model scores higher on XLow than on Low languages, so resource level and native competence are related but distinct\. We also build 2 synthetic languages,KivariandToqal, with zero pretraining exposure \(Section[3\.4](https://arxiv.org/html/2606.06835#S3.SS4)\)\.
Table 1:Language tiers based on approximate digital\-speaker count\.We train across 5 domains: 3 verifiable \(math, QA, instruction following\) with deterministic rewards, and 2 non\-verifiable \(summarization, translation\) scored by an LLM judge\. This mix forces domain\-specific policies: translation helps math in Hausa but is unnecessary for instruction following in the same language\.
### 3\.2Scalable Translation Pipeline for Verifiable Domains
RLVR depends on a simple invariant: the ground\-truth answer must stay valid after translation\. If translating a math problem alters a coefficient, the \\boxed\{\} answer changes and the model is rewarded incorrectly\. In supervised fine\-tuning such errors only degrade input quality; in RLVR they corrupt the learning signal itself\.
Naive verification fails for low\-resource languages: surface metrics \(BLEU/chrF\) miss semantic corruption, direct LLM judges lack target\-language competence, and solve\-based checks are too expensive at scale \(Figure[2](https://arxiv.org/html/2606.06835#S3.F2), left\)\. Our insight is toback\-translateto English and ask the judge tocomparerather thansolve, working in its strongest language and needing only a short binary verdict\.
The pipeline has five stages, each targeting a distinct failure mode; every training sample passes all five:
#### Source Filtering\.
We filter problemsbeforetranslation, dropping those with little natural\-language content \(mostly LaTeX/code\) or excessive length, where translation is most error\-prone\. This avoids spending compute on samples that carry little multilingual signal\.
#### Forward Translation\.
Domain\-aware prompts list what must be kept verbatim \(mathematical notation, JSON structure, option keys\) and restrict translation to the natural\-language parts\. Output is wrapped in XML tags for reliable extraction\.
#### Heuristic Filtering\.
Cheap checks catch degenerate translations before the expensive judge call: repetition detection \(syllable loops, common in low\-resource output\), length\-ratio bounds \(information loss or hallucination\), and source\-copy detection \(the translator echoing English unchanged\)\. These cost almost nothing yet remove 15–30% of low\-resource translations\.
#### Back\-Translation \+ chrF\.
We render the translation back to English and measure chrF against the original, capturing round\-trip information loss before we commit to the judge\.
#### LLM Judge\.
The final gate compares the original English with the back\-translation and returns a binary SAME/DIFFERENT verdict on answer preservation\. Its criteria are deliberately narrow: it ignores surface changes \(rephrasing, renamed entities\) and flags only corruptions that change the answer \(shifted numbers, dropped constraints, leaked solutions\)\. Judging in English sidesteps the core problem that judges cannot reliably evaluate low\-resource text\.
#### Recycling\.
Failed samples are replaced by a different source problem in the same language, so per\-language quotas are met without lowering the bar\. Every training sample clears the full pipeline; harder languages simply draw from a larger source pool\.
Full pipeline details, thresholds, and domain\-specific prompts are in Appendix[L](https://arxiv.org/html/2606.06835#A12)\.
Figure 2:Overview of the answer\-preserving translation pipeline\. Left: four common validation approaches and their failure modes for low\-resource RLVR data\. Right: our five\-stage pipeline that verifies answer integrity by comparing the original with a back\-translation, operating entirely in English where the judge is reliable\.
### 3\.3Non\-Verifiable Domains
Summarization and translation have no single correct answer to verify\. Summarization uses native\-language XL\-Sum\(Hasan et al\.,[2021](https://arxiv.org/html/2606.06835#bib.bib9)\)articles where available; translation data runs through the same five\-stage pipeline with a meaning\-preservation judge\. Details are in Appendix[J](https://arxiv.org/html/2606.06835#A10)\.
### 3\.4Synthetic Languages: A Completely Unseen Language
Even our lowest\-resource natural languages \(Bambara, Ewe, Lingala\) appear at some frequency in web\-scale pretraining data, so the model retains partial familiarity with them\. To simulate how it would behave on a completely unseen language, we build new languages from scratch in which partial understanding is impossible and the only route to a correct answer is the translation tool\.
We construct two synthetic languages,KivariandToqal, using deterministic word\-level substitution over a 962,531\-word vocabulary built from all our English sources\. Each English word maps to a unique sequence of CVC nonsense syllables whose total length matches the source word \(bijective, two independent seeds\)\. Numbers, LaTeX commands, and mathematical symbols pass through unchanged, preserving answer extraction structure\. Substitution is applied to prompts only; labels remain in their original form\. The names are chosen to sound like plausible natural languages so the model cannot learn a shortcut from the language name itself\.
For natural languages the tool calls an LLM; for synthetic languages it is a deterministic lookup server that returns perfect, noise\-free translations\. The model sees the same<tool\_call\>interface either way and cannot tell them apart\. This isolates the question to whether the model learns to invoke the tool, independent of translation quality\. We create 2,000 training and 50 evaluation samples per synthetic language for math and translation, drawn from the same English source pool as the natural languages\.
### 3\.5Data Distribution
Within each domain, low\-resource languages get 4–8×\\timesmore samples per language than high\-resource ones, concentrating data where the performance headroom is largest\. The full training set has 273K samples across all domains and tiers \(Tables[9](https://arxiv.org/html/2606.06835#A8.T9)and[10](https://arxiv.org/html/2606.06835#A8.T10)in Appendix[H](https://arxiv.org/html/2606.06835#A8)\)\. Source datasets are listed in Appendix[I](https://arxiv.org/html/2606.06835#A9)\.
### 3\.6Pipeline Validation
An independent audit of the full evaluation set, with Claude Opus as annotator, confirms98\.4% fidelityacross all Low and XLow languages \(Appendix[E](https://arxiv.org/html/2606.06835#A5)\)\.
## 4Training Setup
### 4\.1Model
We use Qwen3\-4B\(Team,[2025](https://arxiv.org/html/2606.06835#bib.bib35)\), whose extensive post\-training gives it reasoning, multilingual understanding across many languages, and tool calling through structured XML output\. We train in reasoning mode\.
### 4\.2Continued RL from a Post\-Trained Checkpoint
All our training iscontinuedRL on top of the released Qwen3\-4B post\-trained checkpoint\. The checkpoint already has strong reasoning, multilingual, and tool\-calling abilities, and our goal is not to instill these but to teach the modelwhento invoke translation\. OurBaselinethroughout is this same checkpoint evaluated zero\-shot \(no additional RL\), and every RL variant \(no\-tool, free, gated, flat, OTC\) is continued RL from it under identical data and hyperparameters\.
### 4\.3Algorithm
GSPO\(and others,[2025](https://arxiv.org/html/2606.06835#bib.bib1)\)is a recent refinement of GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib30); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.06835#bib.bib4)\)for reinforcement learning with verifiable rewards\. We sample 8 responses per prompt and normalize advantages by subtracting the group mean reward\. We train without a KL penalty, allowing the policy to fully explore tool\-use strategies\. Loss is computed only over model\-generated tokens; tool response tokens receive a loss mask\. Full hyperparameters are in Appendix[C](https://arxiv.org/html/2606.06835#A3)\.
#### Group\-Relative Tool Efficiency\.
The unconstrained model over\-translates high\-resource languages where translation is unnecessary\. To curb this, we add a lightweight reward post\-processing step inspired by Nemotron Nano\(NVIDIA,[2025](https://arxiv.org/html/2606.06835#bib.bib25)\)and OTC\(Wang et al\.,[2025c](https://arxiv.org/html/2606.06835#bib.bib38)\)\. The idea is simple: within each group ofNNsamples for one prompt, reward the most efficient correct solution\. Among correct samples only, those with fewer tool calls get a zero\-sum bonus and those with more calls take a penalty:
ri′=ri−λ⋅w~i∀i∈𝒞r\_\{i\}^\{\\prime\}=r\_\{i\}\-\\lambda\\cdot\\tilde\{w\}\_\{i\}\\quad\\forall\\,i\\in\\mathcal\{C\}\(1\)wherew~i\\tilde\{w\}\_\{i\}is the zero\-mean centered cost weight within the correct set andλ\\lambdacontrols the strength of cost pressure\.
#### The Cascade Problem\.
This mechanism, however, cascades into over\-suppression for low\-resource languages\. A lucky guess \(e\.g\., the 25% chance on a four\-way MCQ without comprehension\) triggers the penalty on genuinely necessary tool use, forming a feedback loop that collapses tool adoption\. We analyze this in Section[6\.3](https://arxiv.org/html/2606.06835#S6.SS3)\.
Figure 3:Overview of the confidence\-gated mechanism\. The same group\-relative formulation produces opposite outcomes depending on native competence: penalizing unnecessary translation for high\-resource \(top\) while protecting necessary translation for low\-resource \(bottom\)\. For MCQ, the model without comprehension guesses A/B/C/D randomly; one lucky correct does not meet theK=6K\{=\}6threshold\. The gate adapts automatically via binomial statistics without language labels\.
#### Confidence Gate\.
We resolve this with a gate requiring strong evidence of native competence before any cost pressure is applied \(Figure[3](https://arxiv.org/html/2606.06835#S4.F3)\):
ri′=\{ri−λ⋅w~iifSN≥Kand\|𝒞T\|\>0riotherwiser\_\{i\}^\{\\prime\}=\\begin\{cases\}r\_\{i\}\-\\lambda\\cdot\\tilde\{w\}\_\{i\}&\\text\{if \}S\_\{N\}\\geq K\\text\{ and \}\|\\mathcal\{C\}\_\{T\}\|\>0\\\\ r\_\{i\}&\\text\{otherwise\}\\end\{cases\}\(2\)whereSN=∑j:cj=0𝟏\[rj\>0\]S\_\{N\}=\\sum\_\{j:c\_\{j\}=0\}\\mathbf\{1\}\[r\_\{j\}\>0\]counts no\-tool correct samples in the group and𝒞T=\{i∈𝒞:ci\>0\}\\mathcal\{C\}\_\{T\}=\\\{i\\in\\mathcal\{C\}:c\_\{i\}\>0\\\}is the subset of correct samples that used a tool\. The gate fires only whenKKofNNno\-tool samples are correct, so one or two lucky guesses no longer trigger suppression\. Its firing probability follows a binomial that adapts to language difficulty on its own, without tier labels\. We useN=8N\{=\}8,λ=0\.1\\lambda\{=\}0\.1, andK=6K\{=\}6; full derivation in Appendix[G](https://arxiv.org/html/2606.06835#A7)\.
### 4\.4Infrastructure and Tool Integration
We train on multi\-node GPU clusters, with a co\-located server on each node acting as both reward judge and translator\. When the model emits a<tool\_call\>block, generation pauses, the call runs, and the result is injected as<tool\_response\>; for synthetic languages a deterministic lookup server replaces the LLM translator behind the same interface\. Tool\-response tokens are masked from the loss, and all domains are mixed within each rollout under inverse resource weighting\. System prompts never reveal the input language, so the model must assess its own comprehension \(Appendix[A](https://arxiv.org/html/2606.06835#A1)\)\. We discuss the translator choice in Appendix[K](https://arxiv.org/html/2606.06835#A11)\.
### 4\.5Rewards
Verifiable domains \(math, QA, IF\) use deterministic extraction: exact match on\\boxed\{\}for math, letter match for QA, and keyword/constraint satisfaction times an LLM language\-fluency gate for IF\. Non\-verifiable domains \(summarization, translation\) use an LLM judge with structured scoring\. Full reward formulas and judge prompts are in Appendix[B](https://arxiv.org/html/2606.06835#A2)\.
## 5Experiments
We run three experiments that progressively add tool access and cost sensitivity, each building on the last\. All share the same data, hyperparameters \(Appendix[C](https://arxiv.org/html/2606.06835#A3)\), and GSPO with 8 samples per prompt\. As defined in Section[4\.2](https://arxiv.org/html/2606.06835#S4.SS2), every run is continued RL from the post\-trained Qwen3\-4B, and theBaselineis that checkpoint evaluated zero\-shot\.
### 5\.1No Tool \(Data Validation\)
We first train without tool access to establish what RL alone can do\. The model receives multilingual problems across all 22 languages and 5 domains and must solve them natively\. Over 100 steps, reward improves steadily\.
Table[2](https://arxiv.org/html/2606.06835#S5.T2)shows the gains, which are notable for an already heavily post\-trained model: instruction following improves by \+17\.7 on Low and \+15\.6 on XLow, QA Low by \+10\.3, and Translation High by \+14\.0\. Steady improvement across many domains and tiers at once is itself evidence that the pipeline is clean; corrupted reward would produce stagnation, not consistent gains\.
Synthetic languages, our proxy for truly unseen input with zero pretraining exposure, stay essentially flat \(6→\\to7% math, 3→\\to7% translation\)\. After 100 steps the model still cannot process them, showing that comprehension is hard to acquire through RL alone and motivating the tool experiments that follow\.
### 5\.2Free Tool \(Unconstrained Translation\)
We now add the translation tool and ask whether the model can learnwhento use it\. The tool is described in the system prompt and may be called up to twice per response, or not at all\. There is no cost penalty: every correct answer earns full reward regardless of tool use, and tool\-response tokens are masked from the loss\. The model gets no signal about language difficulty or tool utility; task reward is the only supervision\.
Adoption is fast and the reward gain is large: overall reward rises to 0\.67 from 0\.50 in the no\-tool setting \(Table[3](https://arxiv.org/html/2606.06835#S5.T3)\)\. It is also uneven\. Figure[6](https://arxiv.org/html/2606.06835#S6.F6)\(left\) shows tool use rising first for synthetic and Low\-resource languages, where the reward signal is strongest, while High\-resource lags\. Instruction following never adopts the tool at any tier \(0–2\.4%\), an early sign of domain\-level discrimination\.
As training continues, this tier separation collapses\. High\-resource tool use climbs until it meets Low\-resource at∼\{\\sim\}60–70%\. By step 40 the model translates high\-resource math 87\.7% of the time despite no reward gain over the baseline \(64\.3% with or without the tool\)\. The early language discrimination washes out into a blanket “always translate” policy\. The over\-translation does not hurt reward, but it is pure wasted cost, which motivates the cost\-sensitive experiments next\.
### 5\.3Cost\-Sensitive Tool Use
The over\-translation in Section[5\.2](https://arxiv.org/html/2606.06835#S5.SS2)wastes cost without helping reward\. Can the model instead be selective, translating only where the benefit justifies the cost?
We compare four cost mechanisms spanning the published approaches \(Figure[4](https://arxiv.org/html/2606.06835#S5.F4)\): a flat per\-call penalty, OTC\-GSPO\(Wang et al\.,[2025c](https://arxiv.org/html/2606.06835#bib.bib38)\)\(the strongest published baseline for cost\-sensitive tool RL, adapted to our setup\), our group\-relative adjustment without the gate, and our full gated method \(K=6K\{=\}6,λ=0\.1\\lambda\{=\}0\.1\)\.
Figure 4:Cost mechanism comparison\. Left: tool cost over training\. Right: reward over training\. All three ungated mechanisms \(flat, OTC, ungated group\-relative\) suppress tool use to 15–27% with 10–13 points lower reward\. The gated model \(green\) maintains higher tool use and matches the free model’s reward\.The result is striking: all three ungated mechanisms converge to 15–30% tool use with reward 10–13 points below the free model \(flat 27%, OTC\-GSPO 23%, ungated group\-relative 30%\), regardless of formulation\. Figure[5](https://arxiv.org/html/2606.06835#S5.F5)confirms it: varyingλ\\lambdafrom 0\.05 to 0\.3 without the gate gives the same plateau every time\. The over\-suppression is structural, not a tuning issue\.
Figure 5:Penalty strength does not matter: all ungatedλ\\lambdavalues \(0\.05–0\.3\) converge to the same cascade plateau\.The gated model \(K=6K\{=\}6,λ=0\.1\\lambda\{=\}0\.1\) breaks the pattern\. It settles at 56% tool use while matching free\-tool reward \(0\.67\), a 37% cut in overall translation cost with no loss in performance\. The threshold matters:K=4K\{=\}4is too lenient and still suppresses to 36–49%, whereasK=6K\{=\}6demands strong evidence of native competence before any cost pressure applies\.
DomTierBaseline R%\+RL R%MathHigh63\.766\.3Low26\.228\.0XLow40\.042\.0Synth6\.07\.0QAHigh81\.078\.8Low43\.754\.0XLow54\.055\.0IFHigh72\.578\.8Low47\.365\.0XLow55\.471\.0SummHigh43\.952\.9Low3\.419\.0TransHigh60\.074\.0Low10\.515\.0XLow14\.721\.0Synth3\.07\.0Table 2:No\-tool RL \(100 steps\)\. R% = reward\. Baseline R% = Qwen3\-4B zero\-shot \(no additional RL, no tool prompt\)\. \+RL R% = after 100 steps of RL on our multilingual data without tool access\. Consistent gains validate pipeline quality\.Bold= best\.ToolDomTierBaseline\+RL Free\+RL GateR%C%R%C%R%C%MathHigh64\.30\.764\.387\.765\.326\.8Low30\.17\.362\.892\.761\.255\.6XLow41\.26\.261\.791\.062\.351\.2Synth28\.043\.056\.095\.054\.063\.5QAHigh82\.30\.082\.377\.280\.719\.7Low49\.31\.476\.066\.274\.547\.5XLow50\.01\.066\.063\.264\.547\.0IFHigh69\.70\.070\.10\.074\.50\.0Low48\.00\.960\.42\.461\.20\.7XLow52\.40\.262\.90\.564\.40\.0SummHigh44\.50\.048\.765\.351\.823\.7Low2\.531\.813\.994\.612\.389\.6TransHigh78\.211\.189\.652\.589\.550\.0Low32\.121\.268\.152\.270\.251\.1XLow31\.119\.152\.455\.253\.551\.3Synth44\.566\.053\.091\.555\.875\.5Table 3:Tool results\. R% = reward, C% = tool usage normalized to max 2 calls \(gray columns\)\. Baseline = step 0 of free\-tool run \(toolavailablebut policy untrained; non\-zero C% reflects sporadic untrained tool calls\)\. \+RL Free/Gate = after RL training\.Bold= best R% \(within 0\.6%\)\.
## 6Results and Analysis
### 6\.1What RL Alone Reveals
The no\-tool results \(Table[2](https://arxiv.org/html/2606.06835#S5.T2)\) show two regimes\. For natural languages, RL extracts real gains even from a heavily post\-trained model: instruction following transfers format compliance across languages \(\+17\.7 Low, \+15\.6 XLow\), and QA improves cross\-lingually \(\+10\.3 Low\), suggesting some reasoning patterns carry across scripts without explicit translation\. The model reasons better in languages it partially understands\.
Synthetic languages behave differently\. The model does not partially understand them; it does not understand them at all\. Near\-zero improvement \(6→\\to7% math, 3→\\to7% translation\) over 100 steps marks a hard boundary: RL can sharpen existing comprehension but cannot create it from nothing\. This line between “partial understanding RL can improve” and “zero understanding RL cannot fix” is exactly the boundary the translation tool must bridge\.
### 6\.2Unconstrained Tool Use
With the tool available and no cost penalty, adoption is rapid: reward improves sharply within the first 20 steps as the model discovers that translation helps\.
The adoption is uneven\. Tool use picks up first for XLow and Low\-resource languages, where the signal is strongest because the model cannot solve these problems without translation, while High\-resource stays low: clear language\-level discrimination\. Domain\-level discrimination is present from the start too: instruction following never adopts the tool at any tier \(0–2\.4%\), because format compliance \(keywords, constraints, target language\) needs no deep comprehension, so the tool never helps and reward never reinforces it\.
As training continues, the language discrimination bleeds away\. With no cost pressure, translation is never punished, so it is always safe, even when pointless\. High\-resource math reaches 87\.7% tool use at convergence despite no reward gain over the baseline \(64\.3% either way\)\. The early language awareness collapses into a default “always translate\.” Domain discrimination, by contrast, holds: IF stays near zero throughout, since reward never rewards it\.
The tool\-call counts show a related pattern\. For summarization and translation, two calls are sensible: translate the input in, then translate the answer back, both serving a purpose\. Early on, 1\-call and 2\-call usage are balanced; by step 20, 2\-call dominates everywhere, 70% for High\-resource and 87% for synthetic\. This even includes math and QA, where the second call is pointless because the answer is\\boxed\{42\}or a letter in any language\. The model adopts “translate both ways” as a blanket habit, ignoring that some output formats are language\-agnostic\. Under the gate, 2\-call usage falls to 1–3% for natural languages but stays high for synthetic \(57%\), showing the model learns that one call suffices when the answer needs no target\-language output\.
So the model is capable of both language\- and domain\-adaptive behavior\. Domain discrimination persists on its own; language discrimination appears early but is fragile without cost pressure\. The gate’s job is to preserve it\.
### 6\.3Gated: Discrimination Emerges
Figure 6:Per\-tier cost over training\. Left: free\-tool model converges to uniformly high cost across all tiers\. Right: gated model maintains clear tier separation, trimming High\-resource cost while preserving Low/Synth tool adoption\.The confidence gate pulls back the free model’s indiscriminate translation, and in doing so sharpens the fragile language\- and domain\-level discrimination into a stable policy \(Figure[6](https://arxiv.org/html/2606.06835#S6.F6)\)\.
#### Language discrimination\.
High\-resource tool use drops from 87\.7% to 26\.8% \(math\) and 77\.2% to 19\.7% \(QA\): the model recognizes it can solve these natively\. Low\-resource and XLow stay high \(51–56% math, 47–48% QA\) and synthetic stays at 64–76%\. The tier separation that washed out in the free model becomes the defining feature of the gated policy\.
#### Domain discrimination\.
Instruction following stays at 0% tool use across all tiers, unchanged from the free model, confirming intrinsic domain behavior rather than an effect of cost pressure\. Translation keeps 50–75% tool use even for High\-resource languages, which is not a failure of selectivity: the taskistranslation, and the 122B translator beats what the 4B model writes natively\. Here the tool improves output quality regardless of comprehension, a different reason to call it than in math or QA, where it unlocks comprehension\. Summarization High drops from 65% to 24% while Low stays at 90%: the model learns that high\-resource summaries can be written natively but low\-resource ones cannot\.
#### Pareto dominance\.
The gated model matches or beats unconstrained reward on 10 of 16 tier\-domain conditions while translating far less \(Table[3](https://arxiv.org/html/2606.06835#S5.T3)\)\. The Pareto analysis \(Figure[1](https://arxiv.org/html/2606.06835#S0.F1)\) makes this precise: the gated model dominates 100% of the cost\-sensitivity range for Low, 95% for XLow, and 100% for Synth\. The unconstrained model wins only on high\-resource conditions where cost is irrelevant \(Math High, Trans High\); everywhere else, the gated model dominates\.
#### Why the gate works\.
Section[5\.3](https://arxiv.org/html/2606.06835#S5.SS3)showed all ungated mechanisms settle at the same suppressed equilibrium\. The cause is the MCQ guess rate: a model that cannot read the input still guesses correctly 25% of the time, so in a group ofN=8N\{=\}8the chance of at least one lucky no\-tool success is1−0\.75N≈90%1\-0\.75^\{N\}\\approx 90\\%\. The gate \(K=6K\{=\}6\) changes the picture\. For high\-resource \(p≈0\.75p\\approx 0\.75\), the binomial probability of 6\+ successes is∼0\.68\{\\sim\}0\.68, so cost pressure fires often; for low\-resource \(p≈0\.26p\\approx 0\.26\) it falls to∼0\.005\{\\sim\}0\.005, leaving tool use protected\. The gate adapts to language difficulty from group statistics alone, with no tier labels or curriculum\.
#### Per\-problem selectivity\.
The gated model does not just apply a per\-language probability; it decides per problem\. Figure[7](https://arxiv.org/html/2606.06835#S6.F7)plots the distribution of per\-prompt tool\-use rates \(8 samples each, equal tier weighting\)\. The baseline model clusters at 0 \(no tool use\), the free model collapses to 1 \(always translates\), and the gated model is bimodal: a peak at 0 for problems it solves natively \(mostly High\-resource\) and a peak at 1 for problems needing translation \(mostly Low/XLow\)\. The two peaks are the signature of learned selectivity, the model committing to a strategy per problem rather than a flat rate\. The pattern holds for both Math and QA\.
Figure 7:Per\-problem tool use distribution \(8 samples/prompt, equal tier weighting\)\. Baseline \(orange\) clusters at 0, Free \(blue\) at 1, Gated \(green\) shows bimodal selectivity with peaks at both extremes\.
### 6\.4RL Teaches Not Just When, But How
Low\-resource summarization needs a multi\-step workflow the model must discover entirely from reward: \(1\) recognize it cannot read the article, \(2\) translate it to English, \(3\) summarize in English, \(4\) translate the summary back, \(5\) format the output\. None of this is demonstrated; the model sees only a binary reward on summary quality\.
Across 200 Low\-resource summarization samples, the strategy emerges over training\. Two\-call usage, the pattern this workflow needs, rises from 29% at step 0 to 87\.5% \(free\) and 81% \(gated\) by step 39\. Among samples that use the correct translate\-in, translate\-out pattern, success rises from 23% to 39% \(free\) and 42% \(gated\)\. The model learns not onlywhento use the tool buthowto run the multi\-step workflow\.
The main remaining failure is the language gate: 41–43% of samples still return the summary in English, so the back\-translation step is not yet reliable\. The trajectory is still improving at our training horizon and should continue with more steps\. The absolute scores \(32–35% success\) reflect the difficulty of a 5\-step emergent workflow, not weak data or judging; math and QA, which need a single call, reach 60–76%\.
### 6\.5Synthetic Languages: A Completely Unseen Language
To simulate how the model would behave on a completely unseen language, we evaluate on the synthetic languages Kivari and Toqal\(Li et al\.,[2025b](https://arxiv.org/html/2606.06835#bib.bib20)\)\. The model has zero pretraining exposure to them, so the only route to a correct answer is the tool, and the correct behavior is to always translate\.
The gated policy does exactly this\. The unconstrained model translates almost always \(91–95% math, 75–92% translation\), and the gated model keeps high usage under cost pressure \(63\.5% math, 75\.5% translation\), against near\-zero improvement without the tool \(Section[6\.1](https://arxiv.org/html/2606.06835#S6.SS1)\)\. The model produces the same “I cannot parse this” response it uses for extreme low\-resource natural languages, but here on inputs it has no way to comprehend\.
The difference between synthetic and XLow natural languages is quantitative, not qualitative: synthetic tool usage is slightly higher \(consistent with truly zero comprehension\), and the reward gain is correspondingly larger \(\+18\.7 points averaged over math and translation\)\. The policy treats synthetic languages as the extreme point of the same competence continuum it learned for natural languages, not as a separate case\.
### 6\.6Generalization to Held\-Out Languages
Does the policy transfer to languages entirely absent from training? We evaluate all models on 9 held\-out languages in two tiers: High \(Hindi, Turkish, Korean\) and Low \(Igbo, Shona, Sesotho, Oromo, Tswana, Maori\), with 50 samples per language across 4 domains \(1,800 total\)\. None appeared in any training data\.
Table 4:Zero\-shot generalization to 9 held\-out languages\. High = Hindi, Turkish, Korean\. Low = Igbo, Shona, Sesotho, Oromo, Tswana, Maori\. No held\-out language appeared in training\.Bold= best R%\.Table[4](https://arxiv.org/html/2606.06835#S6.T4)reveals three key findings:
#### Tool\-use discrimination transfers\.
The gated model cuts Math High cost from 87% \(Free\) to 32% while holding 56% on Math Low, correctly judging that held\-out high\-resource languages need less translation than held\-out low\-resource ones, despite never training on any of them\.
#### Domain discrimination transfers perfectly\.
IF stays at 0% tool use for all held\-out languages and all methods, repeating the training\-language pattern\. The learned rule that format compliance never needs translation is fully language\-independent\.
#### The cascade problem generalizes\.
Flat and OTC collapse QA Low tool use to∼\{\\sim\}1% on held\-out languages \(vs\. 50% for Gated\), with reward drops of 29 and 27 points\. The over\-suppression seen on training languages carries over to held\-out ones, confirming it is a structural property of ungated cost mechanisms, not a language\-specific artifact\.
### 6\.7External Benchmarks
To check that the policy transfers beyond our own evaluation setup, we test on two established native multilingual benchmarks: MGSM\(Shi et al\.,[2023](https://arxiv.org/html/2606.06835#bib.bib31)\)\(math, 11 languages, 250 samples each\) and Global MMLU\(Singh et al\.,[2024](https://arxiv.org/html/2606.06835#bib.bib32)\)\(QA, 21 languages, 250 samples each\)\. Both use native, untranslated questions, testing whether the tool\-use decisions hold on a different data distribution\.
Table 5:External benchmark results\. R% = reward, C% = tool cost\. MGSM High = 9 languages with baseline\>\>80%; MGSM Low = Swahili, Telugu\. MMLU High = 15 languages with baseline\>\>60%; MMLU Low = Hausa, Igbo, Shona, Somali, Swahili, Yoruba\.Table[5](https://arxiv.org/html/2606.06835#S6.T5)confirms the policy transfers to native benchmarks\. The gated model keeps clear tier discrimination: MGSM High C=21% vs\. Low C=57%, and MMLU High C=23% vs\. Low C=51%\. On MGSM it matches free\-tool reward for high\-resource languages \(88\.6 vs\. 88\.9\) at 71 fewer points of cost\. For low\-resource MGSM \(Swahili, Telugu\), the tool lifts reward from 51\.4 to 86\.2 \(Free\), and the gated model captures most of that gain \(83\.2, \+31\.8 over baseline\) at 57% cost against Free’s 93%\. Flat and OTC again collapse MMLU Low tool use to 1–2%, leaving reward at the baseline level \(41–42%\), reproducing the cascade on external data\.
## 7Conclusion
We presented a single policy that decides when to translate from reward alone, developing language\- and domain\-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively\. Using data built by our answer\-preserving translation pipeline, we continue RL on the post\-trained Qwen3\-4B across 22 languages and 5 domains, and introduce confidence\-gated GSPO for cost\-sensitive tool use\. The gated policy lifts reward over the baseline by \+4\.6 on High, \+23\.5 on Low, and \+17\.5 on XLow, and against an unconstrained policy that almost always translates it preserves full reward at 63% of the cost while remaining Pareto\-optimal across 87% of the cost\-sensitivity range\.
To simulate how the model would behave on a completely unseen language, we create 2 synthetic languages, where the gated policy improves \+18\.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs\. The policy transfers zero\-shot to 9 held\-out languages absent from training, learning a general competence\-assessment rule rather than memorizing per\-language behavior, and our answer\-preserving pipeline supplies clean multilingual reward at 98\.4% fidelity, the core requirement for RLVR on translated data\.
## Limitations
#### Translation model quality\.
Our pipeline relies on Qwen3\.5\-122B as translator, which exhibits language proximity confusion for some low\-resource languages \(defaulting to a related higher\-resource neighbor\)\. Our final evaluation covers 22 natural languages where translation accuracy exceeds 98%\.
#### Single model\.
We evaluate only Qwen3\-4B; generalization of the learned tool policy to other model families is not tested\.
#### Single\-seed runs\.
Our results are reported from single training runs\. We mitigate run\-to\-run noise by evaluating over large per\-tier sample pools and by emphasizing effects \(e\.g\., the cascade collapse, tier\-level cost separation\) that are large relative to plausible seed variance, but we do not report standard deviations across seeds\.
## Future Work
Key directions include multi\-tier tool selection \(multiple translators of varying quality and cost\), scaling to larger models where the competence boundary shifts, and inference\-time cost control mechanisms that allow a single checkpoint to serve strict budget constraints\.
## Broader Impact
Selective translation can broaden access to LLM capabilities for speakers of low\-resource languages by routing comprehension through a stronger language only when needed, reducing unnecessary compute for languages the model already handles\. However, reliance on a translation model inherits its biases: translation errors or cultural flattening may propagate into downstream answers, and quality remains uneven across languages\. The translation pipeline also carries a compute footprint \(a large translator served alongside the policy\), which deployments should weigh against the accessibility benefits\.
## References
- and others \[2025\]and others\.Gspo: Group sampling policy optimization\.*arXiv preprint arXiv:2507\.18071*, 2025\.
- Chen et al\. \[2024\]Jiaqi Chen et al\.Mathoctopus: Building math generalist models through hybrid instruction tuning\.*arXiv preprint arXiv:2306\.02670*, 2024\.
- Chen et al\. \[2025\]Yifan Chen et al\.xrouter: Cost\-aware multi\-llm routing via reinforcement learning\.*arXiv preprint*, 2025\.
- DeepSeek\-AI \[2025\]DeepSeek\-AI\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- Dobler et al\. \[2025\]Alexander Dobler et al\.macereason\-math: A dataset of high\-quality multilingual math problems ready for rlvr\.*arXiv preprint arXiv:2603\.10767*, 2025\.
- Feng et al\. \[2025\]Jiazhan Feng et al\.Retool: Reinforcement learning for strategic tool use in llms\.*arXiv preprint*, 2025\.
- Gou et al\. \[2024\]Zhibin Gou et al\.Tora: A tool\-integrated reasoning agent for mathematical problem solving\.*ICLR*, 2024\.
- Guo et al\. \[2025\]Yifan Guo et al\.Think when needed: Model\-aware reasoning routing for llm\-based ranking\.*arXiv preprint arXiv:2601\.18146*, 2025\.
- Hasan et al\. \[2021\]Tahmid Hasan et al\.Xl\-sum: Large\-scale multilingual abstractive summarization for 44 languages\.*Findings of ACL*, 2021\.
- Hofman et al\. \[2025\]Yael Hofman et al\.Maps: Multilingual agent performance and security benchmark\.*arXiv preprint arXiv:2504\.04830*, 2025\.
- Huang et al\. \[2023\]Haoyang Huang et al\.Not all languages are created equal in llms: Improving multilingual capability by cross\-lingual\-thought prompting\.*Findings of EMNLP*, 2023\.
- Huang et al\. \[2025\]Yihong Huang et al\.Beyond english\-centric training: Multilingual rl generalizes better\.*arXiv preprint arXiv:2504\.15855*, 2025\.
- Huo et al\. \[2025\]Jiaheng Huo et al\.Enhancing non\-english capabilities of english\-centric llms through deep supervision fine\-tuning\.*arXiv preprint arXiv:2503\.01275*, 2025\.
- Jin et al\. \[2025\]Bowen Jin et al\.Search\-r1: Training llms to reason and leverage search engines with reinforcement learning\.*arXiv preprint*, 2025\.
- Kang et al\. \[2025\]Minki Kang et al\.Selective translation for cross\-lingual reasoning in llms\.*arXiv preprint*, 2025\.
- Kim et al\. \[2025a\]Jiwoo Kim et al\.Ust: Understand, solve, and translate for multilingual mathematical reasoning\.*arXiv preprint arXiv:2503\.08548*, 2025a\.
- Kim et al\. \[2025b\]Seonghyeon Kim et al\.Cross\-lingual collapse: Grpo causes chain\-of\-thought language drift\.*arXiv preprint arXiv:2504\.09643*, 2025b\.
- Kulkarni et al\. \[2025\]Aniket Kulkarni et al\.Massive\-agents: Multilingual function calling evaluation across 52 languages\.*arXiv preprint*, 2025\.
- Li et al\. \[2025a\]Xufeng Li et al\.Torl: Scaling tool\-integrated rl\.*arXiv preprint*, 2025a\.
- Li et al\. \[2025b\]Yucheng Li et al\.Cipherbank: Exploring the boundary of llm reasoning via cryptography challenges\.*arXiv preprint arXiv:2504\.19093*, 2025b\.
- Li et al\. \[2025c\]Yucheng Li et al\.Qalign: Aligning multilingual reasoning via question translation\.*arXiv preprint arXiv:2504\.01277*, 2025c\.
- Liu et al\. \[2025a\]Zhenyu Liu et al\.M2a: Multilingual reasoning with monolingual anchoring\.*arXiv preprint*, 2025a\.
- Liu et al\. \[2025b\]Zhihao Liu et al\.Dr\. grpo: Removing biases from group relative policy optimization\.*arXiv preprint arXiv:2503\.20783*, 2025b\.
- NLLB Team \[2022\]NLLB Team\.No language left behind: Scaling human\-centered machine translation\.*arXiv preprint arXiv:2207\.04672*, 2022\.
- NVIDIA \[2025\]NVIDIA\.Nemotron\-4 nano: Small language models with group\-relative policy optimization\.*arXiv preprint arXiv:2505\.00949*, 2025\.
- Ong et al\. \[2024\]Isaac Ong et al\.Routellm: Learning to route llms with preference data\.*arXiv preprint*, 2024\.
- Park et al\. \[2025\]Jihyun Park et al\.Think natively: Language\-consistent reasoning for multilingual llms\.*arXiv preprint*, 2025\.
- Qin et al\. \[2023\]Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, and Wanxiang Che\.Cross\-lingual prompting: Improving zero\-shot chain\-of\-thought reasoning across languages\.In*Proceedings of EMNLP*, 2023\.
- Qiu et al\. \[2025\]Yifan Qiu et al\.Multilingual mathematical reasoning: Bridging the gap\.*arXiv preprint*, 2025\.
- Shao et al\. \[2024\]Zhihong Shao et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Shi et al\. \[2023\]Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei\.Language models are multilingual chain\-of\-thought reasoners\.In*International Conference on Learning Representations*, 2023\.
- Singh et al\. \[2024\]Shivalika Singh et al\.Global\-mmlu: Understanding and addressing cultural and linguistic challenges in multilingual evaluation\.*arXiv preprint arXiv:2412\.03304*, 2024\.
- Son et al\. \[2025a\]Yongho Son et al\.A survey on multilingual reasoning in language models\.*arXiv preprint arXiv:2502\.21370*, 2025a\.
- Son et al\. \[2025b\]Yongho Son et al\.Tapo: Task\-adaptive pivot optimization for multilingual mathematical reasoning\.*arXiv preprint arXiv:2505\.06789*, 2025b\.
- Team \[2025\]Qwen Team\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Wang et al\. \[2025a\]Hao Wang et al\.Pb\-rlsvr: Policy\-based rl with semantic verification rewards for multilingual reasoning\.*arXiv preprint*, 2025a\.
- Wang et al\. \[2025b\]Junhong Wang et al\.X\-webagentbench: Benchmarking multilingual web agents across 14 languages\.*arXiv preprint*, 2025b\.
- Wang et al\. \[2025c\]Yifei Wang et al\.Optimizing tool calls in llms via reinforcement learning\.*arXiv preprint*, 2025c\.
- Wang et al\. \[2025d\]Zihao Wang et al\.To code or not to code? adaptive tool integration for mathematical reasoning\.*ACL*, 2025d\.
- Wu et al\. \[2025\]Zheng Wu et al\.Language as a latent variable for reasoning optimization\.*arXiv preprint arXiv:2604\.21593*, 2025\.
- Yu et al\. \[2025\]Qiying Yu et al\.Dapo: An open\-source llm reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*, 2025\.
- Yu et al\. \[2024\]Yuanqing Yu et al\.Steptool: Enhancing multi\-step tool usage via step\-grained reinforcement learning\.*arXiv preprint arXiv:2410\.07745*, 2024\.
- Yue et al\. \[2024\]Xiang Yue et al\.Mmlu\-prox: A multilingual extension of mmlu\-pro across 29 languages\.*arXiv preprint*, 2024\.
- Zhang et al\. \[2025a\]Hao Zhang et al\.Tool\-r1: Sample\-efficient rl for agentic tool use\.*arXiv preprint arXiv:2509\.12867*, 2025a\.
- Zhang et al\. \[2025b\]Wei Zhang et al\.Lcrl: Seen and unseen language generalization in multilingual rag via reinforcement learning\.*arXiv preprint*, 2025b\.
- Zhang et al\. \[2025c\]Yiping Zhang et al\.Simplerl: A simple framework for training language model reasoning with reinforcement learning\.*arXiv preprint arXiv:2501\.04519*, 2025c\.
- Zhang et al\. \[2025d\]Yuxuan Zhang et al\.Router\-r1: Learning multi\-round routing as sequential decision making\.*arXiv preprint*, 2025d\.
Appendix
## Appendix AFull System Prompts
Each domain uses a task\-specific system prompt prepended to the user’s input\. All prompts are language\-agnostic: they instruct the model to handle input “in any language” without specifying which one, letting the model decide whether to solve natively or invoke the translation tool\. WhenTOOL\_MODE=free, the tool\-use prompt \(bottom\-right\) is appended, giving the model access to a translate function it may call up to twice per turn\. Placeholder variables \(e\.g\.,\{language\},\{source\_lang\}\) are filled at data\-loading time from the sample metadata\.
MathYou are a math assistant\. You will receive a math problem that may be written in any language\.Instructions:1\. Solve the problem step by step, showing your reasoning\.2\. Place your final answer inside \\boxed\{\}\. The answer inside \\boxed\{\} should be the mathematical answer only\.Examples of correct output format:\- For a numeric answer: \\boxed\{42\}\- For a fraction: \\boxed\{\\frac\{1\}\{2\}\}\- For an expression: \\boxed\{2x \+ 3\}
QA \(Multiple Choice\)You are a knowledgeable assistant\. You will receive a multiple\-choice question that may be written in any language\.Instructions:1\. Think through the problem step by step\.2\. Place your final answer inside <answer\> tags\. The answer should be ONLY the option letter \(A, B, C, D, etc\.\)\.Example output format:I think the answer is B because\.\.\.<answer\>B</answer\>
Instruction FollowingYou are a writing assistant\. You will receive a prompt in \{language\} containing:\- Keywords: A comma\-separated list of words you MUST include\.\- Constraints: One or more rules your response MUST satisfy\.Instructions:1\. Read the keywords and constraints carefully\.2\. Write a coherent response that naturally incorporates ALL keywords\.3\. Your response MUST satisfy ALL constraints exactly\.4\. Write your entire response in \{language\}\.5\. Place your response inside <answer\> tags\.
SummarizationYou are a summarization assistant\. You will receive a news article that may be written in any language\.Instructions:1\. Read the article carefully\.2\. Write a concise summary in EXACTLY 1 sentence that captures the main topic and key details\.3\. The summary MUST be written in the same language as the article\.4\. Place your final summary inside <answer\> tags\.
TranslationYou are a translation assistant\. Translate the given text from \{source\_lang\} to \{target\_lang\}\.Instructions:1\. Translate the text accurately and fluently\.2\. Preserve the meaning, tone, and style of the original\.3\. Do not add, remove, or change any information\.4\. Place your final translation inside <answer\> tags\.
Tool Use \(appended when available\)You have access to a translation tool\. To use it, output:<tool\_call\>\{"name": "translate", "arguments": \{"text": "\.\.\.", "target\_lang": "\.\.\."\}\}</tool\_call\>The tool will respond with the translation:<tool\_response\><translated text\></tool\_response\>You can translate to or from any language\.You may use the tool up to 2 times, or not at all\.For example, you can translate the input to English to understand it, then translate your answer back\.
Figure 8:System prompts for all five domains plus the tool\-use prompt appended whenTOOL\_MODE=free\.
## Appendix BFull Reward Judge Prompts
Math and QA use deterministic extraction \(\\boxed\{\}matching and option\-letter parsing\) and do not require an LLM judge\. The remaining three domains use an LLM judge \(Qwen3\.5\-122B\-A10B\) that evaluates the model’s response against the reference answer, with the judge prompts shown below\. Summarization applies heuristic pre\-filters \(verbatim copying, length ratio\) before the LLM call to avoid wasting judge compute on clearly incorrect outputs\. All judges output structured scores inside XML tags for reliable parsing\.
Summarization JudgePre\-filters \(before LLM call\):1\. Verbatim gate: If\>\>50% of sentences copied from source→\\to02\. Length penalty:mult=e−0\.5\(r−2\)\\text\{mult\}=e^\{\-0\.5\(r\-2\)\}for ratior=generated length/reference length\>2r=\\text\{generated length\}/\\text\{reference length\}\>2LLM judge prompt:You are an expert summarization evaluator\.Gates \(PASS/FAIL; if ANY fails, score is 0\):1\. Main Topic: Same topic as reference?2\. Factual Accuracy: No hallucinations?3\. Is a Summary: Condensed, not a copy?Score \(only if all gates pass\):4\. Key Detail Coverage: \(0/1/2/3/4\)Output: <output\>FAIL</output\>or: <output\>PASS,3</output\>Final reward:score/4×length\_mult\\text\{score\}/4\\times\\text\{length\\\_mult\}
Translation JudgeYou are an expert translation evaluator\.Score on two dimensions \(0, 1, or 2 each\):1\. Accuracy: Does the translation preserve the meaning of the original?2\. Completeness: Is all content translated without omission?Output ONLY inside <output\> tags: two scores separated by commas\.Example: <output\>2,1</output\>Final reward:\(accuracy\+completeness\)/4\(\\text\{accuracy\}\+\\text\{completeness\}\)/4
Instruction Following JudgeEvaluate on two dimensions \(0, 1, or 2 each\):1\. Language: Written in \{language\}? \(0=English, 1=mixed, 2=target\)2\. Coherence: Fluent? \(0=gibberish, 1=partial, 2=fluent\)Output: <output\>2,2</output\>Final reward:lang\+coh4×\(0\.5⋅kw\+0\.5⋅cstr\)\\frac\{\\text\{lang\}\+\\text\{coh\}\}\{4\}\\times\(0\.5\\cdot\\text\{kw\}\+0\.5\\cdot\\text\{cstr\}\)Language = 0 is a hard gate \(reward = 0\)\.
Figure 9:Reward judge prompts for summarization, translation, and instruction following\.
## Appendix CHyperparameters
Table 6:Training hyperparameters\.
## Appendix DEvaluation Data
Each language has up to 50 evaluation samples per domain\. Table[7](https://arxiv.org/html/2606.06835#A4.T7)shows the total evaluation samples per tier\.
TierMathQAIFSummTransTotalHigh \(6\)3003003003001401,340Low \(12\)6006006002505742,624XLow \(4\)200200200–115715Synth \(2\)100–––100200Total1,2001,1001,1005509294,879Table 7:Evaluation samples per tier and domain\.
## Appendix EData Quality Audit Methodology
We audit the full evaluation set \(up to 50 per language×\\times3 verifiable domains×\\times16 Low\+XLow languages = 2,289 samples\) using Claude Opus as annotator\. Samples are split into 4 batches and annotated in parallel\. Each sample is assessed for: \(1\) language correctness \(is the prompt in the claimed language?\), \(2\) label format \(does it match the domain?\), \(3\) content coherence \(does original\_en relate to the translation?\)\. The following prompt is used:
Data Quality Audit Prompt \(Claude Opus\)Read the file\. It contains ~720 multilingual eval samples\. Each has: domain \(math/qa/translation\), language, prompt, label, original\_en\.These are translated from English \(original\_en\) to the target language \(prompt\)\. The label is the correct answer\.IMPORTANT:\- “kivari” and “toqal” are INTENTIONAL synthetic cipher languages\. They look like random syllables\. This is BY DESIGN\. Do NOT flag\.\- Prompts truncated at 500 chars, labels at 300\. Don’t flag truncation\.\- Guarani/Lingala may use Spanish/French due to regional bilingualism\. Note but not necessarily errors\.Read EVERY sample\. For each, check:1\. Is the prompt in the claimed language? \(Check script: Amharic=Ethiopic, Uyghur=Arabic, etc\.\)2\. Does label format match domain? \(math=number/expression, qa=single letter A/B/C/D, translation=text\)3\. Does original\_en relate to the prompt content?4\. For QA: does the label letter seem plausible given the question?Report ONLY: Total reviewed, number with definite issues, for each issue: language, domain, line number, and what’s wrong \(1 sentence\)\. Overall pass rate percentage\.Be strict on wrong\-language but fair on bilingual regions\.
The audit confirms98\.4% fidelityacross all Low and XLow languages\.
## Appendix FPer\-Language Results
Table[8](https://arxiv.org/html/2606.06835#A6.T8)shows per\-language results averaged across all domains, comparing the zero\-shot baseline, no\-tool RL \(step 99\), free tool \(step 39\), and gated \(step 39\)\. Languages are grouped by tier\.
LanguageTierBaseline R%\+RL R%Free R%Free T%Gate R%Gate T%Δ\\DeltaEnglishHigh73\.776\.372\.646\.873\.55\.5−\-0\.2ArabicHigh53\.359\.567\.459\.964\.933\.2\+11\.6ChineseHigh65\.770\.267\.861\.972\.826\.4\+7\.1FrenchHigh69\.469\.570\.053\.972\.519\.4\+3\.1JapaneseHigh65\.071\.173\.854\.475\.524\.2\+10\.5RussianHigh60\.769\.171\.061\.271\.626\.6\+10\.9AmharicLow27\.936\.256\.061\.456\.047\.4\+28\.1AymaraLow36\.443\.563\.056\.664\.638\.2\+28\.2GuaraniLow48\.256\.664\.454\.564\.037\.6\+15\.8HausaLow24\.731\.961\.160\.257\.148\.0\+32\.4KinyarwandaLow24\.136\.865\.554\.063\.238\.5\+39\.1LugandaLow33\.742\.461\.553\.861\.639\.3\+27\.9QuechuaLow32\.040\.968\.652\.267\.439\.8\+35\.4SomaliLow21\.432\.560\.961\.661\.450\.8\+40\.0UyghurLow44\.253\.472\.452\.871\.435\.5\+27\.2WolofLow31\.240\.664\.753\.264\.238\.5\+33\.0YorubaLow25\.333\.555\.060\.657\.150\.6\+31\.8ZuluLow29\.336\.268\.853\.071\.039\.2\+41\.7BambaraXLow50\.159\.958\.054\.056\.537\.0\+6\.4EweXLow39\.550\.061\.053\.161\.938\.4\+22\.4LingalaXLow38\.346\.257\.549\.860\.936\.5\+22\.6TwiXLow45\.854\.262\.752\.262\.936\.4\+17\.1KivariSynth9\.513\.443\.886\.242\.873\.5\+33\.3ToqalSynth6\.48\.844\.986\.540\.372\.8\+33\.9Table 8:Per\-language results \(averaged across domains\)\. R% = reward, T% = tool usage\. Baseline = zero\-shot Qwen3\-4B \(no additional RL\)\. \+RL = no\-tool RL at step 99\. Free/Gate = tool experiments at step 39\.Δ\\Delta= Gate R% minus Baseline R%\.
## Appendix GConfidence Gate Derivation
The ungated efficiency adjustment fires whenever any no\-tool sample is correct in the group\. For a language with per\-sample no\-tool success probabilitypp, the firing probability within a group ofN=8N\{=\}8is approximately:
P\(fire∣p\)≈1−\(1−p\)N⋅\(1−q\)P\(\\text\{fire\}\\mid p\)\\approx 1\-\(1\-p\)^\{N\\cdot\(1\-q\)\}\(3\)whereqqis the current tool\-usage rate\. This substitutes the expected number of no\-tool samplesN\(1−q\)N\(1\-q\)into the exponent; the exact expression is1−\(1−p\(1−q\)\)N1\-\(1\-p\(1\-q\)\)^\{N\}\(the approximation error is small for our parameter range\)\. For low\-resource \(p≈0\.26p\\approx 0\.26,q≈0\.5q\\approx 0\.5\):P\(fire\)≈0\.70P\(\\text\{fire\}\)\\approx 0\.70\. This creates a positive feedback loop: cost signal reduces tool usage→\\tomore no\-tool samples→\\tomore fire→\\tofurther reduction\. Empirically, allλ\\lambdavalues converge to∼30%\{\\sim\}30\\%tool equilibrium regardless of strength\.
The confidence gate transforms this by requiringKKsuccesses\. At initialization \(q≈0q\\approx 0, all samples are no\-tool attempts\), the firing probability is:
P\(fire∣p,K\)=∑k=KN\(Nk\)pk\(1−p\)N−kP\(\\text\{fire\}\\mid p,K\)=\\sum\_\{k=K\}^\{N\}\\binom\{N\}\{k\}p^\{k\}\(1\-p\)^\{N\-k\}\(4\)As tool adoption increases \(q\>0q\>0\), fewer no\-tool samples are available per group, further reducing the gate’s firing probability and strengthening the protection for languages that adopt tools\.
ForK=6K\{=\}6:
- •High\-resource \(p≈0\.75p\\approx 0\.75\):P\(SN≥6\)≈0\.68P\(S\_\{N\}\\geq 6\)\\approx 0\.68\. Gate fires frequently\.
- •Low\-resource \(p≈0\.26p\\approx 0\.26\):P\(SN≥6\)≈0\.005P\(S\_\{N\}\\geq 6\)\\approx 0\.005\. Gate almost never fires\.
- •Extreme\-low \(p≈0\.10p\\approx 0\.10\):P\(SN≥6\)≈2×10−5P\(S\_\{N\}\\geq 6\)\\approx 2\\times 10^\{\-5\}\. Tool use fully protected\.
## Appendix HData Distribution
TierMathQAIFTotalHigh \(6\)2\.9K2\.9K2\.9K8\.7KLow \(12\)63\.0K58\.2K63\.8K185\.0KXLow \(4\)16\.8K11\.3K17\.8K45\.9KSynth \(2\)4\.0K––4\.0KTotal87K72K85K244KTable 9:Verifiable training data by tier\.TierSummTransTotalHigh \(6\)1\.9K0\.6K2\.5KLow \(12\)11\.3K11\.6K22\.9KXLow \(4\)–1\.7K1\.7KSynth \(2\)–2\.0K2\.0KTotal13K16K29KTable 10:Non\-verifiable training data by tier\. Combined with Table[9](https://arxiv.org/html/2606.06835#A8.T9): 273K total\.
## Appendix IData Sources
DomainEnglish SourceNotesMathMassive\-Math\-455K\-VerifiedHuggingFace; word problems with numeric answersQANemotron\-CrossThink187K MCQ with rule\-based verificationIFCustom constraint generationKeywords \+ structural constraints; see Appendix[A](https://arxiv.org/html/2606.06835#A1)SummarizationXL\-Sum\[Hasan et al\.,[2021](https://arxiv.org/html/2606.06835#bib.bib9)\]Native articles in 13 languages \(no translation\)TranslationXL\-Sum English highlightsBidirectional pairs via 5\-stage pipelineTable 11:English source datasets per domain\. All verifiable domains \(Math, QA, IF\) are translated to 22 target languages via our pipeline; Summarization uses native data where available\.
## Appendix JNon\-Verifiable Domain Data
For summarization, we use native\-language articles from XL\-Sum where available \(13 languages\), and translate only for languages lacking native data using a fail\-fast pipeline: translate the short label first, then the expensive article only if the label passes heuristic and chrF checks, followed by a cross\-coherence judge verifying that the back\-translated label still summarizes the back\-translated article\.
Translation data is sourced from XL\-Sum English highlights, translated to target languages with the standard five\-stage pipeline \(heuristic, back\-translation, chrF≥\\geq0\.35, LLM meaning\-preservation judge\), then formatted as bidirectional pairs \(English\-to\-target and target\-to\-English\)\.
## Appendix KTranslation Model Choice
We use Qwen3\.5\-122B\-A10B\-FP8 as translator because no single compact model covers all 22 target languages with sufficient quality\. Existing open\-source translation models \(NLLB\[NLLB Team,[2022](https://arxiv.org/html/2606.06835#bib.bib24)\], Gemma\-based translators\) perform well on high\-resource pairs but degrade severely on our lowest\-resource targets\. In our evaluation, Gemma\-based translation models produced near\-unusable output for XLow languages like Bambara, Ewe, and Twi, where even basic meaning preservation failed\.
While deploying a 122B model alongside a 4B policy model may appear computationally disproportionate, the 122B serves as a proxy for the specialized, efficient translation service that would be used in deployment\. Integrating an external API introduced latency and cost constraints incompatible with our synchronous RL training loop, making self\-hosted inference the practical choice\. One might ask why not use the 122B directly to solve the problems; the answer is that our goal is to develop a policy that knows when it needs help\.
## Appendix LTranslation Pipeline Details
#### Source Filtering Thresholds\.
For math, we retain only word problems where≥\\geq80% of content is natural language and total length is under 2,000 characters\. This eliminates approximately 40% of raw source data\. For QA, we filter by option count \(must have A/B/C/D\) and question length bounds\.
#### Forward Translation Prompts\.
Each domain has a dedicated translation prompt\. For math: “Keep all mathematical notation \(LaTeX, equations, numbers, variable names\) exactly as\-is\. Only translate the natural language parts\.” For QA: structured JSON input preserving option keys\. Output wrapped in<output\>tags\.
#### Heuristic Thresholds\.
- •Character n\-gram repetition: reject if repetition ratio\>\>0\.3 \(n=10\)
- •Length ratio: reject if translation/source ratio<<0\.2 or\>\>3\.0
- •Source copy: reject if chrF\(translation, source\)\>\>0\.9
- •Translation domain: max tokens 2048 \(math/QA\), 512 \(translation\)
#### Back\-Translation chrF Threshold\.
chrF≥\\geq0\.4 for math/QA,≥\\geq0\.35 for translation \(lower because short sentences have higher chrF variance\)\. Conservative threshold catches substantial content loss before the expensive judge call\.
#### LLM Judge Criteria\.
The judge compares original English with back\-translation\. It explicitly ignores: name/story context changes, rephrasing, format differences, word order\. It flags: number changes, operation changes, constraint changes, information loss, answer leakage, definition changes, question changes\. Binary verdict: SAME/DIFFERENT\. Three retries on bad format; samples failing all retries are rejected\.
#### Concurrency and Quotas\.
Translation runs asynchronously with 128–256 concurrent requests per language\. Per\-language quotas ensure inverse resource weighting \(High: 610, Low: 4880, XLow: 3660 samples per domain\)\. Failed samples recycle with replacement from the source pool; 10 attempts maximum per slot before fallback\.Similar Articles
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT introduces a two-stage RL method that trains LRMs to internalize reflection, enabling single-pass high-quality translation with 94% fewer tokens than multi-step reasoning models like DeepSeek-R1.
Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation
This paper proposes a reinforcement learning approach to enable large language models to translate unseen languages by leveraging in-context linguistic knowledge, outperforming in-context learning and supervised fine-tuning.
Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
Researchers from Tianjin University and Alibaba Group propose EA-RLVR, a reinforcement learning framework with verifiable rewards that improves cross-cultural entity translation in LLMs by activating parametric knowledge already encoded during pre-training, without relying on external knowledge bases. Training on 7k samples boosts Qwen3-14B's entity translation accuracy from 23.66% to 31.87% on unseen entities.
CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning
CLewR introduces a curriculum learning strategy with restarts for improving machine translation performance in LLMs through preference optimization. The method addresses catastrophic forgetting by iterating easy-to-hard curriculum multiple times, showing consistent gains across Gemma2, Qwen2.5, and Llama3.1 models.
No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs
Researchers from National Taiwan University propose replacing fixed translation-based prompting strategies in multilingual LLMs with lightweight learned classifiers that route each instance to either native or translation-based prompting. Their analysis across 10 languages and 4 benchmarks shows no single strategy is universally optimal, with translation benefiting low-resource languages most, and the learned routing achieving statistically significant improvements over fixed strategies.