AdaMame: A Training Recipe for Adaptive Multilingual Reasoning
Summary
This paper introduces AdaMame, a two-stage training recipe (SFT + GRPO) to adaptively align reasoning language with query language in multilingual mathematical reasoning, mitigating language collapse without sacrificing accuracy.
View Cached Full Text
Cached at: 06/16/26, 11:45 AM
# AdaMame: A Training Recipe for Adaptive Multilingual Reasoning
Source: [https://arxiv.org/html/2606.15080](https://arxiv.org/html/2606.15080)
Dayeon Ki![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/peapod.png),Kevin Duh![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/carrot.png),Marine Carpuat![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/peapod.png) ![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/peapod.png)University of Maryland,![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/carrot.png)Johns Hopkins University dayeonki@umd\.edu
###### Аннотация
While Large Reasoning Models \(LRMs\) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse\. Existing RL\-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade\-off in accuracy, mid\-trace code\-switching, and excessive token usage\. In this work, we proposeAdaMame,111Stands forAdaptivemultilingualmodel reasoning\.a two\-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy\. The first SFT stage fine\-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability\. In the subsequent RL stage, we introduce AdaMame\-GRPO, an adaptation of Group Relative Policy Optimization \(GRPO\) in which a query\-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language\. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame\-GRPO achieves Pareto\-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out\-of\-domain, lower\-resource languages\.222Code, data, and model available at[https://github\.com/dayeonki/adamame](https://github.com/dayeonki/adamame)\.
## 1Introduction
Large Reasoning Models \(LRMs\) deployed in multilingual settings must satisfy two objectives simultaneously: producing correct answers \(reasoning accuracy\) and generating reasoning traces \(i\.e\., chains of thoughts, CoTs\) in the same language as the query \(language fidelity\)\(Shiet al\.,[2022](https://arxiv.org/html/2606.15080#bib.bib13); Muennighoffet al\.,[2023](https://arxiv.org/html/2606.15080#bib.bib7); Yonget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib6)\)\. Language fidelity matters both practically, users interacting in their native language expect responses in kind, and technically: effective reasoning strategies vary across languages\(Kiet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib2); Gurgurovet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib27)\), models can sometimes reason more effectively in the original query language\(Gaoet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib5)\), and multilingual thinking promotes output diversity\(Blasiet al\.,[2022](https://arxiv.org/html/2606.15080#bib.bib3); Xu and Zhang,[2026](https://arxiv.org/html/2606.15080#bib.bib4)\)\. Yet because the vast majority of LRM training data is in English\(Ghoshet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib46)\), these models suffer from the so\-called language collapse issue, where models default to reasoning in English regardless of query language\(Parket al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib24)\)\.
Рис\. 1:Accuracy \(Pass@4\) versus Language Fidelity \(LCPR: Language Confusion Pass Rate\)\.Backbone model:Distill\-Qwen 1\.5b\. AdaMame\-GRPO achieves a Pareto\-optimal performance on both objectives\.Рис\. 2:Effectiveness of AdaMame\.We compare reasoning behaviors between a Vanilla Model and AdaMame for French and Telugu queries\. While the Vanilla Model fails to answer correctly in the query language, code\-switches mid\-trace, and overthinks with excessive token usage, AdaMame adapts to the query language, producing both correct and token\-efficient reasoning\.Recent efforts to remedy language collapse share a common design of appending a binary language fidelity reward to the accuracy objective through a manually tuned weighting ratio\(Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26); Sutawikaet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib25); Gaoet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib39)\)\. Despite its appeal, this approach has several persistent limitations \([Figure 2](https://arxiv.org/html/2606.15080#S1.F2)\): \(1\) a trade\-off in reasoning accuracy\(Wanget al\.,[2025b](https://arxiv.org/html/2606.15080#bib.bib43)\); \(2\) alternating languages \(i\.e\., code\-switching\) within reasoning traces\(Wanget al\.,[2025a](https://arxiv.org/html/2606.15080#bib.bib45)\); and \(3\) overthinking, where models spend excessive tokens on reasoning without proportional gains\(Chenet al\.,[2024b](https://arxiv.org/html/2606.15080#bib.bib41); Suiet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib42)\)\. Most existing methods also require English reference reasoning traces\(Sutawikaet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib25); Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26)\)and depend on a fixed, developer\-specified weighting regime, which limits scalability\(Gurgurovet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib27)\)\.
In this work, we propose![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/bean.png)AdaMame, a two\-stage training recipe for multilingual mathematical reasoning that adaptively aligns the reasoning language to the query language, jointly optimizing for reasoning accuracy and language fidelity \(§[3](https://arxiv.org/html/2606.15080#S3)\)\. AdaMame builds on the well\-established SFT\-then\-RL post\-training recipe, introducing targeted modifications to shift model behavior toward the query language\. In Stage 1, we apply supervised fine\-tuning \(SFT\) onnaturallyoccurring reasoning traces in five languages, equipping the model with foundational multilingual reasoning capability and sensitivity to language\-specific reasoning patterns\(Kiet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib2)\)\. In Stage 2, we introduceAdaMame\-GRPO, an adaptation of Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.15080#bib.bib47)\)inspired by ARM\(Wuet al\.,[2026b](https://arxiv.org/html/2606.15080#bib.bib48)\), that incorporates a query\-conditioned alignment factor which adaptively grows during training\. This encourages the model to progressively align its reasoning language to the query while preserving reasoning accuracy as the main objective\.
Across two multilingual mathematical reasoning benchmarks, two LRMs, and 12 in\-domain and out\-of\-domain languages \(§[4](https://arxiv.org/html/2606.15080#S4)\), we show that AdaMame\-GRPO achieves Pareto\-optimal performance across accuracy and language fidelity while using fewer tokens than all baselines\. While the SFT stage alone substantially reduces language collapse, the RL stage with AdaMame\-GRPO further improves generalization to out\-of\-domain languages, with the largest gains on low\-resource languages \(§[5\.1](https://arxiv.org/html/2606.15080#S5.SS1)\)\. Further analysis confirms that AdaMame\-GRPO induces the intended explore\-then\-exploit curriculum: models initially explores diverse reasoning languages before progressively converging on the query language as the query alignment factor grows \(§[5\.2](https://arxiv.org/html/2606.15080#S5.SS2)\), and increasing the weight of the alignment factor yields improved language fidelity with a controllable accuracy trade\-off \(§[5\.3](https://arxiv.org/html/2606.15080#S5.SS3)\)\.
In summary, our contributions are three\-fold:
- •We introduce AdaMame, a two\-stage post\-training recipe \(SFT\+RL\) for multilingual mathematical reasoning, with targeted modifications for query language alignment\.
- •We release a dataset of 30K naturally occurring reasoning traces across five languages, supporting multilingual reasoning research\.
- •AdaMame\-GRPO achieves Pareto\-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all tested baselines, with the strongest generalization to out\-of\-domain and lower\-resource languages\.
## 2Related Work
### 2\.1Improving Multilingual Reasoning
A growing body of work documents substantial performance gaps\(Wanget al\.,[2025b](https://arxiv.org/html/2606.15080#bib.bib43); Luoet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib50)\)and language collapse in LRMs when queries are posed in languages other than English\(Parket al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib24)\)\. Prior methods to address these issues fall into two directions\. First is prompt\-based methods, such as adding language\-specific instructions\(Yonget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib6); Qiet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib49)\)or prefixes at the beginning of model generations\(Tamet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib1)\)to steer the output language\. Second is fine\-tuning models on multilingual CoTs\(Lai and Nissim,[2024](https://arxiv.org/html/2606.15080#bib.bib35); Sheet al\.,[2024](https://arxiv.org/html/2606.15080#bib.bib51); Chaiet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib52)\)\. More recent Reinforcement Learning \(RL\)\-based approaches augment the GRPO reward with an explicit language fidelity term\(Liuet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib28); Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26); Sutawikaet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib25)\)\. For instance, M\-Thinker combines two reward terms in addition to GRPO: a binary language fidelity reward that fires when the reasoning language matches the query, and a cross\-lingual thinking alignment reward in which an LLM judge scores how closely the reasoning trace aligns with a reference English trace on a continuous 0–1 scale\(Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26)\)\.
While effective, these methods exhibit three persistent failure modes \([Figure 2](https://arxiv.org/html/2606.15080#S1.F2)\): \(1\) trading off answer correctness for matching reasoning language to the query language, \(2\) code\-switching within reasoning traces, and \(3\) overthinking with excessive use of tokens\. As compared in[Table 1](https://arxiv.org/html/2606.15080#S2.T1), many prior approaches also require English reference reasoning traces for supervision and rely on a fixed, developer\-specified weighting regime between reward components\(Gaoet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib39); Gurgurovet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib27)\), limiting robustness to changes in that ratio\. AdaMame\-GRPO addresses this gap with anadaptivereward that progressively strengthens query language alignment over the course of GRPO training, requiring no English reference traces\.
### 2\.2Adaptive Reasoning
Recent studies have explored making LRM reasoning adaptive, ranging from binary decisions about whether to engage in thinking\(Tuet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib57)\)to fine\-grained adaptation of reasoning format or effort based on task difficulty\(Yuet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib58); Wuet al\.,[2026b](https://arxiv.org/html/2606.15080#bib.bib48); Wanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib59); Wuet al\.,[2026a](https://arxiv.org/html/2606.15080#bib.bib54); Yanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib56)\)\. In multilingual settings, adaptive reasoning has been studied to select whichever reasoning language is most effective for a given query, through a language router\(Guoet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib63)\), LLM\-as\-judge\(Zhenget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib60)\), or comparison against English traces\(Yeet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib61)\)\. AdaMame\-GRPO is distinctive among these approaches in that it does not treat reasoning language as a free variable to be selected per query\. Instead, it fixes query language alignment as the objective, and progressively adapt to this objective throughout GRPO training by increasing the query alignment factor\.
MethodDatasetObjectiveM\-Thinker\(Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26)\)q,c,o,cenq,c,o,c\_\{\\mathrm\{en\}\}GRPO \(Accuracy \+ Language Fidelity \+ Format \+ Cross\-lingual Thinking Alignment\)SP3F\(Sutawikaet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib25)\)q,c,o,cenq,c,o,c\_\{\\mathrm\{en\}\}GPRO \(Accuracy \+ Language Fidelity \+ Format \+ Judge Preference Feedback\)TRIT\(Liuet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib28)\)q,c,oq,c,oGRPO \(Accuracy \+ Language Fidelity \+ Format \+ Repetition Penalty\)ExpLang\(Gaoet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib39)\)q,c,oq,c,oGRPO \(Accuracy \+ Pass@k \+ Language Fidelity \+ Format \+ Thinking Language Diversity\)ReasonXL\(Gurgurovet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib27)\)q,c,oq,c,oGRPO \(Accuracy \+ Language Fidelity \+ Format \+ Repetition Penalty\)![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/bean.png)AdaMame\(Ours\)q,c,oq,c,oAdaMame\-GRPO \(§[3\.3](https://arxiv.org/html/2606.15080#S3.SS3)\)
Таблица 1:Comparison of AdaMame to prior approaches\.Dataset: required training data components;Objective: reward components used during RL training\.qq: query,cc: reasoning trace,oo: final output,cenc\_\{\\mathrm\{en\}\}: English reference trace\. AdaMame requires no English reference traces and uses no manually tuned weighting between reward components\.
## 3![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/bean.png)AdaMame: A Training Recipe
AdaMame builds on the SFT\-then\-RL post\-training recipe with targeted modifications that optimizes reasoning accuracy, language fidelity, and token efficiency by adaptively aligning reasoning to the query language\. We first construct a high\-quality multilingual training dataset \(§[3\.1](https://arxiv.org/html/2606.15080#S3.SS1)\), then train in two stages: SFT \(§[3\.2](https://arxiv.org/html/2606.15080#S3.SS2)\) followed by RL \(§[3\.3](https://arxiv.org/html/2606.15080#S3.SS3)\)\.
### 3\.1Preparing Ingredients
We consider multilingual mathematical reasoning, where a model receives a math queryqqposed in some languageℓ\\elland must produce a reasoning traceccand a final outputoo\. The goal is to produce correctoomatching the ground truth answerggandccinℓ\\ell\. Our training dataset therefore requires \(1\) queriesqqacross allℓ∈ℒ\\ell\\in\\mathcal\{L\}, and \(2\)naturallyoccurring reasoning tracesccin the respective query languageℓ\\ell, not the machine\-translated counterparts of English traces\. This enables models to reason across languagesℒ\\mathcal\{L\}and learn language\-specific reasoning patterns \(see Appendix[B\.1](https://arxiv.org/html/2606.15080#A2.SS1)for details\)\.
#### Queries\.
We sample queries from DAPO\-MATH\-17K\(Yuet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib67)\)across five in\-domain languages \(French, Portuguese, Japanese, Korean, and Thai\), sourced fromLiuet al\.\([2026](https://arxiv.org/html/2606.15080#bib.bib28)\)\.333We use DAPO\-MATH\-17K specifically for its range of difficulty levels and manually verified quality\(Yuet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib67)\)\.Eachqqis machine\-translated from English usingDeepSeek\-V3\.2\-Exp\(DeepSeek\-AI,[2025b](https://arxiv.org/html/2606.15080#bib.bib68)\), with translation quality verified byQwen3 32B\(Yanget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib69)\)\.
#### Reasoning Traces\.
For each queryqq, we generate reasoning traceccusingGPT\-5 nano\(Singhet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib70)\)\. We retain a triplet if:ccis in the same language asqq,ccleads to a correct answer inoo, and follows the required format \(<think\></think\>forcc,\\\\boxed\{\}inoo\), leading to an average retain rate of 72\.2%\. Appendix[C\.1](https://arxiv.org/html/2606.15080#A3.SS1)provides further details on the generation and filtering process\.
### 3\.2Stage 1: SFT for Learning
#### Dataset\.
We leverage SFT as a cold start to expose the model to diverse reasoning languages and language\-specific reasoning patterns\. For each of the five languages, we sample 6K triplets\(q,c,o\)\(q,c,o\)from §[3\.1](https://arxiv.org/html/2606.15080#S3.SS1), yielding a 30K SFT training corpus𝒟sft\\mathcal\{D\}\_\{\\mathrm\{sft\}\}\.
#### Objective\.
We take an open\-weight LRMℳ\\mathcal\{M\}and fine\-tune on𝒟sft\\mathcal\{D\}\_\{\\mathrm\{sft\}\}, resulting inℳsft\\mathcal\{M\}\_\{\\mathrm\{sft\}\}\. During fine\-tuning, each queryqqis prepended with a language\-specific prompt instruction \(Appendix[A](https://arxiv.org/html/2606.15080#A1)\) that explicitly encourages reasoning in the query language\. To preventℳsft\\mathcal\{M\}\_\{\\mathrm\{sft\}\}from catastrophic forgetting and to improve training efficiency, we fine\-tune with Low\-rank adaptation \(LoRA\)\(Huet al\.,[2021](https://arxiv.org/html/2606.15080#bib.bib34)\), which constrains the magnitude of parameter updates\(Gaoet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib39)\)\. We show that LoRA outperforms full fine\-tuning on both reasoning accuracy and language fidelity \(details in Appendix[B\.2](https://arxiv.org/html/2606.15080#A2.SS2)\)\.
### 3\.3Stage 2: RL for Generalizing
#### Dataset\.
To construct the RL training corpus, we apply rejection sampling over𝒟sft\\mathcal\{D\}\_\{\\mathrm\{sft\}\}with 8 candidates per query\(Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26)\)\.444We compare three different data sampling strategies for RL corpus construction in Appendix[B\.5](https://arxiv.org/html/2606.15080#A2.SS5)\.A queryqqis selected if the model generates both correct and incorrect rollouts \(i\.e\.,0<\|𝒪correct\|<80<\|\\mathcal\{O\}\_\{\\mathrm\{correct\}\}\|<8\), selecting problems that are challenging yet solvable\. We then randomly sample 1K instances for each of five in\-domain languages, yielding a 5K corpus𝒟grpo\\mathcal\{D\}\_\{\\mathrm\{grpo\}\}, with a held\-out validation set of 1K instances\.
#### GRPO\.
Whileℳsft\\mathcal\{M\}\_\{\\text\{sft\}\}learns to reason across the trained languages, it lacks the ability to generalizes to out\-of\-domain settings compared to RL\-trained models \(§[5\.1](https://arxiv.org/html/2606.15080#S5.SS1);Chuet al\.\([2025](https://arxiv.org/html/2606.15080#bib.bib12)\)\)\. We therefore further trainℳsft\\mathcal\{M\}\_\{\\text\{sft\}\}with standard GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.15080#bib.bib47)\)\. Here, given a queryqqwith a ground truth answergg, the model samples a group ofGGoutputs \(i\.e\., rollouts\)𝒪=\{o1,o2,…,oG\}\\mathcal\{O\}=\\\{o\_\{1\},o\_\{2\},\.\.\.,o\_\{G\}\\\}\. For eachoio\_\{i\}, a binary rewardrir\_\{i\}is computed using a rule\-based function that checks whetheroio\_\{i\}matches the ground truth answergg:555We find that accuracy\-only reward outperforms a combined accuracy and format reward; see Appendix[B\.3](https://arxiv.org/html/2606.15080#A2.SS3)\.
r\(q,oi,g\)=𝟙\(oi=g\)\.r\(q,o\_\{i\},g\)=\\mathbbm\{1\}\(o\_\{i\}=g\)\.\(1\)However, since this reward optimizes solely for correctness, the model defaults toward whichever language that maximizes accuracy, typically English, without any incentive to explore alternative reasoning languages or align with the query language\. Standard GRPO thus offers limited gains for language collapse \(§[5\.1](https://arxiv.org/html/2606.15080#S5.SS1)\)\. To address this, we propose an adaptation of GRPO, which enables the model to progressively align to the query language through a query\-conditioned alignment reward mechanism\.
#### AdaMame\-GRPO\.
Inspired by ARM\(Wuet al\.,[2026b](https://arxiv.org/html/2606.15080#bib.bib48)\), we adapt GRPO by introducing a query alignment scaling factorαi\(t\)\\alpha\_\{i\}\(t\)that amplifies the base accuracy rewardr\(q,oi,g\)r\(q,o\_\{i\},g\)\. Formally:
r′\(q,oi,g\)=αi\(t\)⋅r\(q,oi,g\),r^\{\\prime\}\(q,o\_\{i\},g\)=\\hbox\{\\pagecolor\{yellow\!30\}$\\alpha\_\{i\}\(t\)$\}\\cdot r\(q,o\_\{i\},g\),\(2\)αi\(t\)=1\+β⋅𝟙\[lang\(oi\)=lang\(q\)\]⋅ϕ\(t\),\\alpha\_\{i\}\(t\)=1\+\\beta\\cdot\\mathbbm\{1\}\\\!\\left\[\\mathrm\{lang\}\(o\_\{i\}\)=\\mathrm\{lang\}\(q\)\\right\]\\cdot\\phi\(t\),\(3\)ϕ\(t\)=max\(1−cos\(π⋅tT\)2,0\.1\),\\phi\(t\)=\\max\\\!\\left\(\\frac\{1\-\\cos\\\!\\left\(\\pi\\cdot\\dfrac\{t\}\{T\}\\right\)\}\{2\},\\ 0\.1\\right\),\(4\)wherelang\(⋅\)\\mathrm\{lang\}\(\\cdot\)denotes the detected language of a rollout or query, measured usinglingualanguage detector,666[lingua\-language\-detector](https://pypi.org/project/lingua-language-detector)ϕ\(t\)\\phi\(t\)is a cosine growth schedule that gradually increases from 0\.1 at the beginning of training \(t=0t=0\) to 1\.0 at the end \(t=Tt=T\), andβ\\betais the query alignment factor \(defaultβ\\betaas 2\.0, with ablations in §[5\.3](https://arxiv.org/html/2606.15080#S5.SS3)\)\. The key effect ofαi\(t\)\\alpha\_\{i\}\(t\)is a training curriculum: early in training, whenϕ\(t\)\\phi\(t\)is small, the query alignment factor is weak and the model is free to explore diverse reasoning languages\. As training progresses andϕ\(t\)\\phi\(t\)grows, correctly aligned rollouts receive increasingly amplified rewards, gradually guiding the model to converge on the query language\.
We specifically adopt Dr\.GRPO variant\(Liuet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib32)\)of GRPO, which removes the length\-normalization and standard deviation terms; we find this yields higher reasoning accuracy than standard GRPO \(Appendix[B\.4](https://arxiv.org/html/2606.15080#A2.SS4)\)\. For each queryqq, we compute the group advantageA^i,t\\hat\{A\}\_\{i,t\}over the adjusted rewardsr′=\{r1′,r2′,…,r𝒪′\}r^\{\\prime\}=\\\{r^\{\\prime\}\_\{1\},r^\{\\prime\}\_\{2\},\.\.\.,r^\{\\prime\}\_\{\\mathcal\{O\}\}\\\}as:
A^i,t=ri′−mean\(\{r1′,r2′,…,r𝒪′\}\)\.\\hat\{A\}\_\{i,t\}=r^\{\\prime\}\_\{i\}\-\\mathrm\{mean\}\(\\\{r^\{\\prime\}\_\{1\},r^\{\\prime\}\_\{2\},\.\.\.,r^\{\\prime\}\_\{\\mathcal\{O\}\}\\\}\)\.\(5\)Finally, we optimize the policy model by maximizing the following objective:
𝒥\(θ\)=𝔼q∼P\(Q\),\{oi\}i=1𝒪∼πθold\(𝒪\|q\)1𝒪∑i=1𝒪∑t=1\|oi\|\{min\[πθ\(oi,t\|q,oi,<t\)πθold\(oi,t\|q,oi,<t\)A^i,t,clip\(πθ\(oi,t\|q,oi,<t\)πθold\(oi,t\|q,oi,<t\),1−ϵ,1\+ϵ\)A^i,t\]−γ⋅KL\[πθ∥πref\]\},\\begin\{split\}\\mathcal\{J\}\(\\theta\)=\\mathbbm\{E\}\_\{q\\sim P\(Q\),\\\{o\_\{i\}\\\}^\{\\mathcal\{O\}\}\_\{i=1\}\\sim\{\\pi\_\{\\theta\}\}\_\{\\mathrm\{old\}\}\(\\mathcal\{O\}\|q\)\}\\\\ \\frac\{1\}\{\\mathcal\{O\}\}\\sum^\{\\mathcal\{O\}\}\_\{i=1\}\\sum^\{\|o\_\{i\}\|\}\_\{t=1\}\\bigg\\\{\\mathrm\{min\}\\bigg\[\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\}\{\{\\pi\_\{\\theta\}\}\_\{\\mathrm\{old\}\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\}\\hat\{A\}\_\{i,t\},\\\\ \\mathrm\{clip\}\\bigg\(\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\}\{\{\\pi\_\{\\theta\}\}\_\{\\mathrm\{old\}\}\(o\_\{i,t\}\|q,o\_\{i,<t\}\)\},1\-\\epsilon,1\+\\epsilon\\bigg\)\\hat\{A\}\_\{i,t\}\\bigg\]\\\\ \-\\gamma\\cdot\\mathrm\{KL\}\[\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\]\\bigg\\\},\\end\{split\}\(6\)whereπold\\pi\_\{\\mathrm\{old\}\}is the frozen old policy andπref\\pi\_\{\\mathrm\{ref\}\}is the reference model\.
## 4![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/restaurant.png)Experiment Setup
#### Model\.
To assess the effectiveness of AdaMame across models and sizes, we use:Distill\-Qwen 1\.5b\(DeepSeek\-AI,[2025a](https://arxiv.org/html/2606.15080#bib.bib30)\)andQwen3 4b\(Yanget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib69)\)as backbones\.Distill\-Qwen 1\.5bis built onQwen2\.5and post\-trained on multilingual reasoning traces distilled fromDeepSeek\-R1\. Further details are in Appendix[Table 12](https://arxiv.org/html/2606.15080#A3.T12)\.
#### Evaluation Dataset\.
We use two multilingual mathematical reasoning benchmarks: MGSM\-Rev2\(Peteret al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib40)\)and MSVAMP\(Chenet al\.,[2024a](https://arxiv.org/html/2606.15080#bib.bib22)\)\. MGSM\-Rev2 is a revised version of MGSM\(Shiet al\.,[2022](https://arxiv.org/html/2606.15080#bib.bib13)\)with human\-translated math problems, which corrects translation errors or ambiguities in 15\.8% queries on average \(see Appendix[C\.2](https://arxiv.org/html/2606.15080#A3.SS2)\)\. MSVAMP extends SVAMP\(Patelet al\.,[2021](https://arxiv.org/html/2606.15080#bib.bib21)\)withChatGPT\-translated problems, with translation quality verified by native speakers\. Detailed dataset statistics are in Appendix[Table 13](https://arxiv.org/html/2606.15080#A3.T13)\.
#### Languages\.
We study 12 languages representing diverse resource levels, language families, writing scripts, and linguistic typologies\. We select French \(fr\), Portuguese \(pt\), Japanese \(ja\), Korean \(ko\), and Thai \(th\) as the in\-domain \(i\.e\., training\) languages and Bengali \(bn\), English \(en\), Spanish \(es\), Russian \(ru\), Swahili \(sw\), Telugu \(te\), Chinese \(zh\), and German \(de\) as out\-of\-domain languages to measure generalizability\. Per\-language characteristics are detailed in Appendix[Table 14](https://arxiv.org/html/2606.15080#A3.T14)\.
#### Baselines\.
For each backbone model, we study five baselines of increasing training effort:
- •Vanilla: The unmodified backbone modelℳ\\mathcal\{M\}\.
- •Prompt:ℳ\\mathcal\{M\}with a language\-specific prompt instruction prepended to each query \(Appendix[A](https://arxiv.org/html/2606.15080#A1)\)\.
- •SFT:ℳsft\\mathcal\{M\}\_\{\\mathrm\{sft\}\}, cold\-start fine\-tuned on𝒟sft\\mathcal\{D\}\_\{\\mathrm\{sft\}\}\.
- •GRPO:ℳgrpo\\mathcal\{M\}\_\{\\mathrm\{grpo\}\}, initialized fromℳsft\\mathcal\{M\}\_\{\\mathrm\{sft\}\}and trained on𝒟grpo\\mathcal\{D\}\_\{\\mathrm\{grpo\}\}with standard GRPO\.
- •AdaMame:ℳgrpo\\mathcal\{M\}\_\{\\mathrm\{grpo\}\}, trained identically but with AdaMame\-GRPO instead of standard GRPO\.
ForDistill\-Qwen 1\.5Bbackbone model, we additionally compareM\-Thinker Iter1andIter2\(Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26)\), using the checkpoints corresponding to one and two rounds of iterative GRPO training respectively \(see[Table 1](https://arxiv.org/html/2606.15080#S2.T1)for details\)\.
#### Evaluation Metrics\.
We report three metrics:
- •Accuracy \(↑\): Pass@4 over four sampled outputs, which reduces bias and variance associated with single\-generation evaluation\(Zhanget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib16)\)\. Answers are extracted from\\\\boxed\{\}usingmath\-verify\.777[huggingface/Math\-Verify](https://github.com/huggingface/Math-Verify)
- •LCPR \(↑\): Language Confusion Pass Rate\(Marchisioet al\.,[2024](https://arxiv.org/html/2606.15080#bib.bib19)\), computed as the harmonic mean of line\- and word\-level language fidelity over the reasoning tracecc\.888Unlike prior work, which measures language fidelity by the top\-1 detected language, LCPR explicitly captures undesirable code\-switching behavior within traces; see Appendix[C\.4](https://arxiv.org/html/2606.15080#A3.SS4)\.
- •TTC \(↓\): Test\-Time Compute, measured as the fraction of the maximum context length \(8,192 tokens\) consumed by the model’s response, reflecting reasoning token efficiency\(Snellet al\.,[2024](https://arxiv.org/html/2606.15080#bib.bib18); Muennighoffet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib17)\)\.
All inference usevLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.15080#bib.bib20)\)with sampling temperature 0\.6 and top\-p 0\.95\. Further implementation details are in Appendix[C\.3](https://arxiv.org/html/2606.15080#A3.SS3)\.
ModelSizeAccuracy \(%, ↑\)Language Fidelity \(%, ↑\)Test\-Time Compute \(%, ↓\)Gin\\text\{G\}\_\{\\text\{in\}\}Gout\\text\{G\}\_\{\\text\{out\}\}Gall\\text\{G\}\_\{\\text\{all\}\}Vin\\text\{V\}\_\{\\text\{in\}\}Vout\\text\{V\}\_\{\\text\{out\}\}Vall\\text\{V\}\_\{\\text\{all\}\}Gin\\text\{G\}\_\{\\text\{in\}\}Gout\\text\{G\}\_\{\\text\{out\}\}Gall\\text\{G\}\_\{\\text\{all\}\}Vin\\text\{V\}\_\{\\text\{in\}\}Vout\\text\{V\}\_\{\\text\{out\}\}Vall\\text\{V\}\_\{\\text\{all\}\}Gin\\text\{G\}\_\{\\text\{in\}\}Gout\\text\{G\}\_\{\\text\{out\}\}Gall\\text\{G\}\_\{\\text\{all\}\}Vin\\text\{V\}\_\{\\text\{in\}\}Vout\\text\{V\}\_\{\\text\{out\}\}Vall\\text\{V\}\_\{\\text\{all\}\}Distill\-Qwen 1\.5bVanilla\-46\.952\.850\.866\.572\.070\.451\.960\.357\.564\.150\.254\.424\.718\.420\.59\.76\.37\.3Prompt\-51\.657\.855\.766\.973\.471\.457\.858\.658\.366\.252\.156\.329\.434\.733\.010\.76\.17\.5M\-Thinker Iter135K62\.064\.263\.578\.972\.774\.60\.319\.012\.83\.1229\.621\.725\.023\.323\.917\.314\.215\.2M\-Thinker Iter250K69\.565\.066\.581\.174\.676\.60\.117\.711\.81\.627\.519\.720\.921\.121\.014\.714\.015\.2SFT30K65\.358\.360\.677\.575\.476\.077\.662\.867\.784\.050\.160\.31\.54\.23\.31\.21\.71\.5\+GRPO35K72\.663\.666\.677\.376\.276\.680\.161\.667\.884\.847\.458\.71\.02\.61\.91\.01\.61\.4\+AdaMame\-GRPO35K72\.965\.567\.977\.576\.877\.080\.664\.870\.185\.250\.160\.71\.22\.31\.91\.01\.61\.4Δ\\Delta\+0\.3\+1\.9\+1\.3\+0\.8\+0\.4\+0\.4\+0\.5\+3\.2\+2\.3\+0\.4\+2\.7\+2\.0\+0\.2\-0\.10\.00\.00\.00\.0Qwen3 4bVanilla\-81\.377\.478\.770\.364\.767\.013\.034\.523\.40\.446\.327\.914\.317\.816\.612\.113\.813\.1Prompt\-88\.080\.482\.977\.982\.180\.414\.036\.024\.50\.446\.628\.114\.017\.616\.413\.014\.113\.6SFT30K89\.883\.685\.686\.988\.788\.087\.988\.988\.693\.278\.484\.32\.61\.81\.81\.11\.31\.2\+GRPO35K88\.982\.984\.988\.089\.088\.688\.390\.389\.693\.779\.285\.02\.11\.61\.60\.81\.11\.0\+AdaMame\-GRPO35K90\.483\.685\.988\.489\.489\.088\.990\.589\.994\.280\.486\.01\.91\.41\.50\.71\.00\.9Δ\\Delta\+1\.5\+0\.7\+1\.0\+0\.4\+0\.4\+0\.4\+0\.6\+0\.2\+0\.3\+0\.5\+1\.2\+1\.0\+0\.2\-0\.2\-0\.1\-0\.1\-0\.1\-0\.1
Таблица 2:Performance across model variants and evaluation datasets\.G: MGSM\-Rev2;V: MSVAMP\.in: in\-domain;out: out\-of\-domain languages;all: overall average\.Boldandunderlinedvalues denote the best and second\-best scores per column, respectively\.Δ\\Deltadenotes the difference between AdaMame\-GRPO and GRPO \(Adamame\-GRPO−\-GRPO\), with darker shading indicating larger magnitude within each model\. AdaMame\-GRPO achieves the highest overall reasoning accuracy and LCPR \(Language Fidelity\) and the lowest TTC \(Test\-Time Compute\) across all variants, using a dataset of 35K instances, which is smaller than or comparable to theM\-Thinkerbaselines\. Per\-language results are detailed in Appendix[D\.1](https://arxiv.org/html/2606.15080#A4.SS1)\.
#### Training Strategies\.
For SFT stage, we adopt LoRAHuet al\.\([2021](https://arxiv.org/html/2606.15080#bib.bib34)\)for parameter efficient fine\-tuning targeting all linear modules, implemented usingLLaMA\-Factory\(Zhenget al\.,[2024](https://arxiv.org/html/2606.15080#bib.bib66)\)\. We set LoRA rank as 8,α\\alphaas 16, cutoff length as 8,192 tokens, learning rate as2e−42\\mathrm\{e\}^\{\-4\}with cosine scheduler and warm\-up ratio 0\.1\. One SFT run requires 1\.5 hours on one NVIDIA\-A5000\.
For RL stage, we implement all experiments inverl\(Shenget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib65)\)\. We set batch size to 512, mini\-batch size to 256 \(2 gradient updates per batch\), rollout count to 8, max prompt length to 512, max response length to 4,096, training epochs to 10, temperature to 0\.8, learning rate1e−51\\mathrm\{e\}^\{\-5\}, and KL loss coefficient to 0\.001\. The last three hyperparameters are selected viawandb’s sweep search\.999[wandb\.ai/functions/sweep](https://docs.wandb.ai/models/ref/python/functions/sweep)A single run takes approximately 4 hours on one NVIDIA\-A100 forDistill\-Qwen 1\.5b, and 10 hours on two NVIDIA\-A100s forQwen3 4b\.
## 5![[Uncaptioned image]](https://arxiv.org/html/2606.15080v1/figures/logo/cooking.png)Results
We first compare accuracy, LCPR, and TTC across models, in\- and out\-of\-domain languages, and language resource levels \(§[5\.1](https://arxiv.org/html/2606.15080#S5.SS1)\)\. We then examine adaptivity of AdaMame\-GRPO \(§[5\.2](https://arxiv.org/html/2606.15080#S5.SS2)\), and ablate the effect of the query alignment factorβ\\beta\(§[5\.3](https://arxiv.org/html/2606.15080#S5.SS3)\)\.
### 5\.1Main Results
We present results in[Table 2](https://arxiv.org/html/2606.15080#S4.T2)and summarize several interesting findings below:
#### Current LRMs struggle with language collapse\.
We observe that both backbone models have low LCPR, indicating that they either frequently default to produce reasoning traces in specific languages regardless of the query language, or to heavily code\-switch languages mid\-trace\. For example,Qwen3 4bachieves only 23\.4% and 27\.9% LCPR across languages on MGSM\-Rev2 and MSVAMP, meaning that nearly one fourth of all cases, the model fails to deliver traces fully in the query language\. This is less pronounced inDistill\-Qwen 1\.5b, which achieves 54–57% LCPR across both datasets, likely explained by its explicit post\-training on multilingual reasoning traces\(DeepSeek\-AI,[2025a](https://arxiv.org/html/2606.15080#bib.bib30)\)\.
Even with the Prompt baseline, which prepends language\-specific instructions to encourage reasoning in the query language, yields only marginal improvements in LCPR \(\+0\.8%, \+1\.9% forDistill\-Qwen 1\.5b; \+1\.1%, \+0\.2% forQwen3 4b, on MGSM\-Rev2 and MSVAMP respectively\)\. This aligns with prior findings that prompt\-level interventions are insufficient to fully shift models’ reasoning language\(Qiet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib49); Gaoet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib39)\)\.
ModelSizeAccuracy \(%, ↑\)Language Fidelity \(%, ↑\)Test\-Time Compute \(%, ↓\)Glow\\text\{G\}\_\{\\text\{low\}\}Gmid\\text\{G\}\_\{\\text\{mid\}\}Ghigh\\text\{G\}\_\{\\text\{high\}\}Vlow\\text\{V\}\_\{\\text\{low\}\}Vmid\\text\{V\}\_\{\\text\{mid\}\}Vhigh\\text\{V\}\_\{\\text\{high\}\}Glow\\text\{G\}\_\{\\text\{low\}\}Gmid\\text\{G\}\_\{\\text\{mid\}\}Ghigh\\text\{G\}\_\{\\text\{high\}\}Vlow\\text\{V\}\_\{\\text\{low\}\}Vmid\\text\{V\}\_\{\\text\{mid\}\}Vhigh\\text\{V\}\_\{\\text\{high\}\}Glow\\text\{G\}\_\{\\text\{low\}\}Gmid\\text\{G\}\_\{\\text\{mid\}\}Ghigh\\text\{G\}\_\{\\text\{high\}\}Vlow\\text\{V\}\_\{\\text\{low\}\}Vmid\\text\{V\}\_\{\\text\{mid\}\}Vhigh\\text\{V\}\_\{\\text\{high\}\}Distill\-Qwen 1\.5bVanilla\-15\.756\.080\.836\.180\.588\.439\.140\.992\.533\.021\.994\.840\.712\.97\.8312\.76\.14\.1Prompt\-21\.561\.983\.738\.881\.288\.639\.749\.385\.938\.521\.995\.562\.522\.913\.514\.86\.03\.1M\-Thinker Iter135K30\.669\.590\.351\.081\.187\.32\.81\.534\.06\.65\.545\.016\.926\.228\.514\.216\.015\.2M\-Thinker Iter250K34\.474\.790\.451\.983\.390\.02\.30\.233\.04\.33\.043\.917\.223\.222\.713\.915\.113\.7SFT30K31\.466\.983\.552\.481\.589\.655\.459\.888\.149\.131\.989\.96\.41\.91\.72\.61\.11\.0\+GRPO35K38\.573\.887\.453\.282\.090\.055\.061\.287\.140\.834\.690\.03\.51\.11\.22\.61\.10\.8\+AdaMame\-GRPO35K39\.574\.989\.453\.683\.889\.561\.461\.087\.849\.831\.790\.53\.21\.41\.22\.50\.90\.9Qwen3 4bVanilla\-74\.179\.582\.464\.964\.570\.34\.123\.542\.70\.630\.746\.421\.613\.914\.317\.211\.411\.2Prompt\-69\.088\.491\.370\.678\.589\.26\.023\.743\.61\.230\.846\.223\.012\.713\.619\.110\.611\.8SFT35K70\.891\.195\.077\.691\.593\.188\.784\.992\.264\.191\.993\.83\.41\.01\.01\.70\.91\.0\+GRPO35K69\.490\.195\.179\.592\.092\.989\.087\.392\.563\.793\.294\.82\.81\.21\.11\.40\.90\.8\+AdaMame\-GRPO35K71\.190\.995\.680\.091\.993\.689\.487\.892\.564\.793\.896\.02\.61\.01\.01\.20\.80\.8
Таблица 3:Performance by language resource level\.G: MGSM\-Rev2;V: MSVAMP\.low: low\-resource \(th, bn, sw, te\);mid: mid\-resource \(ja,ko,pt, ru, de\);high: high\-resource \(fr, en, es, zh\)\. Resource level is determined by number of speakers and Wikipedia article count per language \(Appendix[Table 14](https://arxiv.org/html/2606.15080#A3.T14)\)\. Note that there is at least one in\-domain language \(italicin the list\) in each resource level group\. AdaMame\-GRPO shows the greatest gains over baselines on lower\-resource languages\.
#### SFT substantially reduces language collapse but generalizes poorly to out\-of\-domain languages\.
Across both backbone models and datasets, we show that SFT yields large LCPR gains over the Prompt baseline\. This demonstrates that fine\-tuning on language\-specific reasoning traces is far more effective than lightweight prompt\-level intervention at matching the reasoning language to the query\(Lai and Nissim,[2024](https://arxiv.org/html/2606.15080#bib.bib35)\)\. The gains are considerably larger forQwen3 4bthanDistill\-Qwen 1\.5b\(MGSM\-Rev2 / MSVAMP: \+9\.4% / \+4\.0% forDistill\-Qwen 1\.5b; \+64\.1% / \+56\.2% forQwen3 4b\)\. Notably, gains in LCPR also translate to improvements in both reasoning accuracy and TTC\. While prior work similarly incorporates SFT as a cold\-start step\(Liuet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib28)\), its standalone contribution is unclear and rather has shown to not necessarily improve reasoning accuracy\(Hwanget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib29)\)\. Our SFT stage avoids this trade\-off, which we attribute to the use of naturally occurring reasoning traces in𝒟sft\\mathcal\{D\}\_\{\\text\{sft\}\}: unlike machine\-translated traces, these are both fluent in the query language and reflect language\-specific reasoning patterns, providing a higher\-quality training signal for multilingual reasoning \(details in Appendix[B\.1](https://arxiv.org/html/2606.15080#A2.SS1)\)\.
However, SFT alone generalizes poorly beyond its five training languages: on out\-of\-domain languages, LCPR gains are substantially smaller and sometimes fall below the Prompt baseline \(e\.g\., \-2\.0% forDistill\-Qwen 1\.5bon MSVAMP\)\.
#### GRPO improves out\-of\-domain reasoning accuracy; AdaMame\-GRPO further improves LCPR generalization\.
We show that GRPO yields stronger accuracy gains on out\-of\-domain languages than SFT\. For instance,Distill\-Qwen 1\.5bgains \+0\.5% accuracy on out\-of\-domain languages over SFT, but an additional \+5\.3% via GRPO\. This is consistent with the established roles of the two training paradigms: SFT encourages memorization of patterns seen during training\(Allen\-Zhu and Li,[2023](https://arxiv.org/html/2606.15080#bib.bib10); Kanget al\.,[2024](https://arxiv.org/html/2606.15080#bib.bib11)\), while RL encourages learning generalizable rules that transfer to new settings\(Chuet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib12)\)\. However, GRPO yields comparatively smaller LCPR gains than accuracy gains \(\+0\.1% vs\. \+6\.0% on MGSM\-Rev2 forDistill\-Qwen 1\.5b\), as the accuracy\-only reward of GRPO provides no incentive to align the reasoning language with the query\.
AdaMame\-GRPO addresses this by introducing the query alignment factor as a scaling term on the accuracy reward \(§[3\.3](https://arxiv.org/html/2606.15080#S3.SS3)\), achieving higher LCPR than GRPO without sacrificing, but instead further improving, reasoning accuracy\. This stands in contrast to the accuracy\-language fidelity trade\-off observed in prior work\(Wanget al\.,[2025b](https://arxiv.org/html/2606.15080#bib.bib43)\)\. AdaMame\-GRPO withDistill\-Qwen 1\.5balso outperformsM\-Thinker Iter2baseline, which uses 15K more training instances \(35K vs\. 50K\), showing greater data efficiency\. Specifically, AdaMame\-GRPO shows a substantial LCPR advantage overM\-Thinkerbaselines, which we attribute to heavy code\-switching inM\-Thinker’s reasoning traces\. It also reflects a failure mode that the top\-1 language detection metric used as language fidelity in prior work overlooks but LCPR explicitly penalizes\.
Taken together, we show that AdaMame\-GRPO resolves the key limitations identified across prior approaches \([Figure 1](https://arxiv.org/html/2606.15080#S1.F1)\): poor out\-of\-domain generalization from SFT, limited LCPR gains from standard GRPO, and overthinking that persist in reward\-concatenation methods\(Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26)\), by achieving the lowest TTC among all baselines\.


Рис\. 3:Reasoning language adaptation rate and validation accuracy throughout AdaMame\-GRPO training\.Left:Distill\-Qwen 1\.5b; Right:Qwen3 4b\. Thexx\-axis denotes training batch; the leftyy\-axis shows the proportion of rollouts whose reasoning trace matches the query language \(language adaptation rate; range: 0–1\) and the rightyy\-axis shows the validation accuracy \(%,↑\\uparrow\)\. Both are evaluated on a held\-out validation set of 1K instances \(250 per language\) every two training batches\.
#### AdaMame\-GRPO yields the largest gains on lower\-resource languages\.
[Table 3](https://arxiv.org/html/2606.15080#S5.T3)reports performance broken down by language resource level \(low, mid, high\) to examine whether AdaMame\-GRPO’s gains are consistent across different languages\. ForDistill\-Qwen 1\.5b, whileM\-Thinkerbaselines tend to lead on high\-resource languages, AdaMame\-GRPO shows a clear advantage on low\-resource languages \(Thai, Bengali, Swahili, and Telugu\), which is notable given that only Thai appears in the SFT and RL training corpora\. For example, on MSVAMP, it improves accuracy by \+10\.8% for Bengali onDistill\-Qwen 1\.5band \+28\.4% onQwen3 4bover the Vanilla baseline\.
The same pattern holds for LCPR: AdaMame\-GRPO achieves the highest low\-resource LCPR on both datasets, while Vanilla and Prompt baselines retain an advantage only at the high\-resource tier\. This positions our approach as a step toward more equitable multilingual reasoning\(Tranet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib8); Zhaoet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib9)\), where both accuracy and LCPR gains are not only in well\-resourced languages at the expense of underrepresented ones\.
### 5\.2Adaptivity of AdaMame\-GRPO
We verify that the cosine growth schedule in AdaMame\-GRPO induces the intended explore\-then\-exploit curriculum: as shown in[Figure 3](https://arxiv.org/html/2606.15080#S5.F3), models initially generate rollouts across diverse reasoning languages, then progressively shift toward the query language as the alignment factor grows\.
Interestingly, we find distinct per\-language trajectories, which suggest that AdaMame\-GRPO adaptively rebalances the accuracy and language alignment objectives at the language level rather than uniformly\. While French shows a monotonically increasing trend that plateaus after roughly the first quarter of training, the remaining four languages each exhibit a mid\-training dip around the halfway point before recovering to a higher endpoint\. This dip likely reflects an exploration vs\. exploitation trade\-off: as the accuracy reward still dominates early in training, the model temporarily shifts toward higher\-resource languages that yield more correct rollouts, sacrificing language alignment before the growing alignment factor rebalances the two objectives\. Consistent with this interpretation, validation accuracy rises relatively steadily through the dip, suggesting the model is exploiting accurate but language\-misaligned reasoning before converging on both\.
Distill\-Qwen 1\.5balso shows larger final gains on lower\-resource languages \(Japanese, Korean, and Thai\), whileQwen3 4bshows a relative advantage on higher\-resource languages \(French and Portuguese\), consistent with the resource\-level breakdown results in[Table 3](https://arxiv.org/html/2606.15080#S5.T3)\.
Рис\. 4:Accuracy and LCPR across values ofβ\\beta\.β\\betacontrols the weight of the query alignment factor in AdaMame\-GRPO\. Increasingβ\\betatrades\-off accuracy for higher LCPR\. Full numerical results are in Appendix[D\.2](https://arxiv.org/html/2606.15080#A4.SS2)\.
### 5\.3Ablation on Query Alignment Factor
While we set the query alignment factorβ\\betato 2\.0 as the default in AdaMame\-GRPO \(§[3\.3](https://arxiv.org/html/2606.15080#S3.SS3)\), we analyze its sensitivity by varyingβ\\betaacross a range of values\.[Figure 4](https://arxiv.org/html/2606.15080#S5.F4)reports reasoning accuracy and LCPR across both backbone models and datasets\. We observe that increasingβ\\betaconsistently improves LCPR but at the cost of reasoning accuracy, showing a controllable accuracy\-language fidelity trade\-off relation\. Between the two models,Qwen3 4bis more sensitive to changes inβ\\betathanDistill\-Qwen 1\.5bon both metrics, likely because it was not explicitly post\-trained on multilingual reasoning traces\. The trade\-off observed here confirms two properties of the query alignment factor: it reliably improves language fidelity to the query language as intended, but at the same time, it introduces a conflicting signal with the accuracy objective, which can be controlled throughβ\\beta\.
## 6Conclusion
We present AdaMame, a two\-stage training recipe featuring AdaMame\-GRPO with a query alignment scaling factor that guides models to explore diverse reasoning languages before exploiting reasoning in the query language\. AdaMame\-GRPO achieves Pareto\-optimal performance across reasoning accuracy, language fidelity, and token efficiency, with the strongest gains on out\-of\-domain and lower\-resource languages\. We hope this work encourages moving beyond English\-centric reward designs toward training recipes that serve the full diversity of languages users bring to these systems\.
## Limitations
#### Computational Scope\.
Our experiments are constrained by computational budget, which limits both the backbone model sizes and training strategies explored\. AdaMame is further evaluated only on mathematical reasoning, where multilingual benchmarks are readily available\. This leaves open questions on how AdaMame generalizes to other settings, including larger LRMs, non\-mathematical tasks, and a broader set of languages, which we leave for future work\.
#### Language Detection Reliability\.
The query alignment reward in AdaMame\-GRPO relies on thelingualanguage detector, whose signals may be less reliable for lower\-resource languages or heavily code\-switched traces\. To partially mitigate this, we validate detector reliability on a held\-out test set in Appendix[B\.6](https://arxiv.org/html/2606.15080#A2.SS6)\.
#### Dependence onGPT\-5 nano\.
Our training corpora𝒟sft\\mathcal\{D\}\_\{\\mathrm\{sft\}\}and𝒟grpo\\mathcal\{D\}\_\{\\mathrm\{grpo\}\}consist of reasoning traces generated byGPT\-5 nano\. We select this model based on its highest average retain rate across the five in\-domain languages in our preliminary experiments, compared toQwen3 32bandDistill\-Qwen 32b\. Nevertheless, the resulting corpora may inherit biases fromGPT\-5 nano’s language\-specific reasoning behaviors\.
## Acknowledgments
We thank the members of theCLIPlab at the University of Maryland for their valuable feedback and support, with special thanks to Yekyung Kim and Dang Nguyen for their helpful advice on GRPO training\. We also thank Siye Wu for graciously answering our questions about ARM and Sanchit Ahuja for insightful discussions on the earlier version of the work\.
## Список литературы
- Physics of language models: part 3\.1, knowledge storage and extraction\.arXiv preprint arXiv:2309\.14316\.Cited by:[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px3.p1.1)\.
- D\. E\. Blasi, J\. Henrich, E\. Adamou, D\. Kemmerer, and A\. Majid \(2022\)Over\-reliance on english hinders cognitive science\.Trends in cognitive sciences26\(12\),pp\. 1153–1170\.Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p1.1)\.
- B\. Bohnet, R\. Dangovski, K\. Swersky, S\. Moore, A\. Chaudhry, K\. Kenealy, and N\. Fiedel \(2025\)A comparative analysis of llm adaptation: sft, lora, and icl in data\-scarce scenarios\.arXiv preprint arXiv:2511\.00130\.Cited by:[§B\.2](https://arxiv.org/html/2606.15080#A2.SS2.p1.3)\.
- L\. Chai, J\. Yang, T\. Sun, H\. Guo, J\. Liu, B\. Wang, X\. Liang, J\. Bai, T\. Li, Q\. Peng, and Z\. Li \(2025\)XCOT: cross\-lingual instruction tuning for cross\-lingual chain\-of\-thought reasoning\.Proceedings of the AAAI Conference on Artificial Intelligence39\(22\),pp\. 23550–23558\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i22.34524),[Link](https://ojs.aaai.org/index.php/AAAI/article/view/34524)Cited by:[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1)\.
- N\. Chen, Z\. Zheng, N\. Wu, M\. Gong, D\. Zhang, and J\. Li \(2024a\)Breaking language barriers in multilingual mathematical reasoning: insights and observations\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 7001–7016\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.411/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.411)Cited by:[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px2.p1.1)\.
- X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang,et al\.\(2024b\)Do not think that much for 2\+ 3=? on the overthinking of o1\-like llms\.arXiv preprint arXiv:2412\.21187\.Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p2.1)\.
- T\. Chu, Y\. Zhai, J\. Yang, S\. Tong, S\. Xie, D\. Schuurmans, Q\. V\. Le, S\. Levine, and Y\. Ma \(2025\)SFT memorizes, RL generalizes: a comparative study of foundation model post\-training\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=dYur3yabMj)Cited by:[§3\.3](https://arxiv.org/html/2606.15080#S3.SS3.SSS0.Px2.p1.10),[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px3.p1.1)\.
- DeepSeek\-AI \(2025a\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[Приложение A](https://arxiv.org/html/2606.15080#A1.p1.1),[§B\.3](https://arxiv.org/html/2606.15080#A2.SS3.p1.1),[Таблица 12](https://arxiv.org/html/2606.15080#A3.T12.1.1.2.5.1.1),[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px1.p1.1)\.
- DeepSeek\-AI \(2025b\)DeepSeek\-v3\.2\-exp: boosting long\-context efficiency with deepseek sparse attention\.Cited by:[§3\.1](https://arxiv.org/html/2606.15080#S3.SS1.SSS0.Px1.p1.1)\.
- C\. Gao, X\. Huang, W\. Zhu, S\. Huang, L\. Li, and F\. Yuan \(2025\)Could thinking multilingually empower llm reasoning?\.External Links:2504\.11833,[Link](https://arxiv.org/abs/2504.11833)Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p1.1)\.
- C\. Gao, Z\. Huang, K\. Yang, J\. Chen, J\. Li, and S\. Huang \(2026\)ExpLang: improved exploration and exploitation in llm reasoning with on\-policy thinking language selection\.External Links:2602\.21887,[Link](https://arxiv.org/abs/2602.21887)Cited by:[§C\.4](https://arxiv.org/html/2606.15080#A3.SS4.p1.1),[§1](https://arxiv.org/html/2606.15080#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p2.1),[Таблица 1](https://arxiv.org/html/2606.15080#S2.T1.4.4.4.2),[§3\.2](https://arxiv.org/html/2606.15080#S3.SS2.SSS0.Px2.p1.5),[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px1.p2.1)\.
- A\. Ghosh, D\. Datta, S\. Saha, and C\. Agarwal \(2025\)A survey of multilingual reasoning in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 8920–8936\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.474/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.474),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p1.1)\.
- G\. Guo, H\. Wakaki, Y\. Mitsufuji, A\. Ritter, and W\. Xu \(2026\)Learning to route languages for multilingual preference optimization\.InProceedings of the 43rd International Conference on Machine Learning,ICML ’26\.External Links:[Link](https://icml.cc/virtual/2026/poster/66718)Cited by:[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1)\.
- D\. Gurgurov, T\. Röhr, S\. von Rohrscheidt, J\. van Genabith, A\. Löser, and S\. Ostermann \(2026\)ReasonXL: shifting llm reasoning language without sacrificing performance\.External Links:2604\.12378,[Link](https://arxiv.org/abs/2604.12378)Cited by:[§B\.1](https://arxiv.org/html/2606.15080#A2.SS1.p1.1),[§B\.4](https://arxiv.org/html/2606.15080#A2.SS4.p1.1),[§1](https://arxiv.org/html/2606.15080#S1.p1.1),[§1](https://arxiv.org/html/2606.15080#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p2.1),[Таблица 1](https://arxiv.org/html/2606.15080#S2.T1.5.5.5.2)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[§3\.2](https://arxiv.org/html/2606.15080#S3.SS2.SSS0.Px2.p1.5),[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px6.p1.2)\.
- J\. Hwang, K\. Tanmay, S\. Lee, A\. Agrawal, H\. Palangi, K\. Ayush, I\. Fiete, and P\. P\. Liang \(2025\)Learn globally, speak locally: bridging the gaps in multilingual reasoning\.External Links:2507\.05418,[Link](https://arxiv.org/abs/2507.05418)Cited by:[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px2.p1.1)\.
- K\. Kang, A\. Setlur, D\. Ghosh, J\. Steinhardt, C\. Tomlin, S\. Levine, and A\. Kumar \(2024\)What do learning dynamics reveal about generalization in llm reasoning?\.arXiv preprint arXiv:2411\.07681\.Cited by:[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px3.p1.1)\.
- D\. Ki, K\. Duh, and M\. Carpuat \(2026\)What makes good multilingual reasoning? disentangling reasoning traces with measurable features\.External Links:2604\.04720,[Link](https://arxiv.org/abs/2604.04720)Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p1.1),[§1](https://arxiv.org/html/2606.15080#S1.p3.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles,SOSP ’23,New York, NY, USA,pp\. 611–626\.External Links:ISBN 9798400702297,[Link](https://doi.org/10.1145/3600006.3613165),[Document](https://dx.doi.org/10.1145/3600006.3613165)Cited by:[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px5.p1.2)\.
- H\. Lai and M\. Nissim \(2024\)MCoT: multilingual instruction tuning for reasoning consistency in language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 12012–12026\.External Links:[Link](https://aclanthology.org/2024.acl-long.649/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.649)Cited by:[§B\.6](https://arxiv.org/html/2606.15080#A2.SS6.p1.2),[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px2.p1.1)\.
- J\. Liu, Z\. Wang, Y\. Li, Z\. Lai, L\. Huang, X\. Huang, X\. Han, J\. Feng, and S\. Huang \(2026\)Self\-improving multilingual long reasoning via translation\-reasoning integrated training\.External Links:2602\.05940,[Link](https://arxiv.org/abs/2602.05940)Cited by:[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1),[Таблица 1](https://arxiv.org/html/2606.15080#S2.T1.3.3.3.2),[§3\.1](https://arxiv.org/html/2606.15080#S3.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px2.p1.1)\.
- Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025\)Understanding r1\-zero\-like training: a critical perspective\.External Links:2503\.20783,[Link](https://arxiv.org/abs/2503.20783)Cited by:[§B\.4](https://arxiv.org/html/2606.15080#A2.SS4.p1.1),[§3\.3](https://arxiv.org/html/2606.15080#S3.SS3.SSS0.Px3.p2.3)\.
- W\. Luo, W\. X\. Zhao, J\. Sha, S\. Wang, and J\. Wen \(2025\)MMATH: a multilingual benchmark for mathematical reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 11187–11202\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.598/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.598),ISBN 979\-8\-89176\-335\-7Cited by:[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1)\.
- K\. Marchisio, W\. Ko, A\. Berard, T\. Dehaze, and S\. Ruder \(2024\)Understanding and mitigating language confusion in LLMs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 6653–6677\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.380/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.380)Cited by:[§C\.3](https://arxiv.org/html/2606.15080#A3.SS3.SSS0.Px2.p1.2),[§C\.3](https://arxiv.org/html/2606.15080#A3.SS3.p1.1),[2nd item](https://arxiv.org/html/2606.15080#S4.I2.i2.p1.1)\.
- N\. Muennighoff, T\. Wang, L\. Sutawika, A\. Roberts, S\. Biderman, T\. Le Scao, M\. S\. Bari, S\. Shen, Z\. Yong, H\. Schoelkopf,et al\.\(2023\)Crosslingual generalization through multitask finetuning\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15991–16111\.Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p1.1)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. Hashimoto \(2025\)S1: simple test\-time scaling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 20275–20321\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1025/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1025),ISBN 979\-8\-89176\-332\-6Cited by:[3rd item](https://arxiv.org/html/2606.15080#S4.I2.i3.p1.1)\.
- C\. Park, J\. Kim, J\. Lee, S\. Bae, J\. Choo, and K\. M\. Yoo \(2026\)Cross\-lingual collapse: how language\-centric foundation models shape reasoning in large language models\.External Links:2506\.05850,[Link](https://arxiv.org/abs/2506.05850)Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1)\.
- A\. Patel, S\. Bhattamishra, and N\. Goyal \(2021\)Are nlp models really able to solve simple math word problems?\.InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies,pp\. 2080–2094\.Cited by:[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px2.p1.1)\.
- J\. Peter, D\. Vilar, T\. Domhan, D\. Malkin, and M\. Freitag \(2025\)Mind the gap… or not? how translation errors and evaluation details skew multilingual results\.External Links:2511\.05162,[Link](https://arxiv.org/abs/2511.05162)Cited by:[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px2.p1.1)\.
- J\. Qi, S\. Chen, Z\. Xiong, R\. Fernández, D\. Bitterman, and A\. Bisazza \(2025\)When models reason in your language: controlling thinking language comes at the cost of accuracy\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 20279–20296\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1103/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1103),ISBN 979\-8\-89176\-335\-7Cited by:[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px1.p2.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[Таблица 12](https://arxiv.org/html/2606.15080#A3.T12.1.1.2.5.1.1)\.
- A\. Rastogi, A\. Q\. Jiang, A\. Lo, G\. Berrada, G\. Lample, J\. Rute, J\. Barmentlo, K\. Yadav, K\. Khandelwal, K\. R\. Chandu,et al\.\(2025\)Magistral\.arXiv preprint arXiv:2506\.10910\.Cited by:[§B\.3](https://arxiv.org/html/2606.15080#A2.SS3.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p3.1),[§3\.3](https://arxiv.org/html/2606.15080#S3.SS3.SSS0.Px2.p1.10)\.
- S\. She, W\. Zou, S\. Huang, W\. Zhu, X\. Liu, X\. Geng, and J\. Chen \(2024\)MAPO: advancing multilingual reasoning through multilingual\-alignment\-as\-preference optimization\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 10015–10027\.Cited by:[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)HybridFlow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,EuroSys ’25,New York, NY, USA,pp\. 1279–1297\.External Links:ISBN 9798400711961,[Link](https://doi.org/10.1145/3689031.3696075),[Document](https://dx.doi.org/10.1145/3689031.3696075)Cited by:[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px6.p2.1)\.
- F\. Shi, M\. Suzgun, M\. Freitag, X\. Wang, S\. Srivats, S\. Vosoughi, H\. W\. Chung, Y\. Tay, S\. Ruder, D\. Zhou,et al\.\(2022\)Language models are multilingual chain\-of\-thought reasoners\.arXiv preprint arXiv:2210\.03057\.Cited by:[Таблица 13](https://arxiv.org/html/2606.15080#A3.T13.1.1.2.3),[§1](https://arxiv.org/html/2606.15080#S1.p1.1),[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram, A\. Nathan, A\. Luo, A\. Helyar, A\. Madry, A\. Efremov, A\. Spyra, A\. Baker\-Whitcomb, A\. Beutel, A\. Karpenko, A\. Makelov, A\. Neitz, A\. Wei, A\. Barr, A\. Kirchmeyer, A\. Ivanov, A\. Christakis, A\. Gillespie, A\. Tam, A\. Bennett, A\. Wan, A\. Huang, A\. M\. Sandjideh, A\. Yang, A\. Kumar, A\. Saraiva, A\. Vallone, A\. Gheorghe, A\. G\. Garcia, A\. Braunstein, A\. Liu, A\. Schmidt, A\. Mereskin, A\. Mishchenko, A\. Applebaum, A\. Rogerson, A\. Rajan, A\. Wei, A\. Kotha, A\. Srivastava, A\. Agrawal, A\. Vijayvergiya, A\. Tyra, A\. Nair, A\. Nayak, B\. Eggers, B\. Ji, B\. Hoover, B\. Chen, B\. Chen, B\. Barak, B\. Minaiev, B\. Hao, B\. Baker, B\. Lightcap, B\. McKinzie, B\. Wang, B\. Quinn, B\. Fioca, B\. Hsu, B\. Yang, B\. Yu, B\. Zhang, B\. Brenner, C\. R\. Zetino, C\. Raymond, C\. Lugaresi, C\. Paz, C\. Hudson, C\. Whitney, C\. Li, C\. Chen, C\. Cole, C\. Voss, C\. Ding, C\. Shen, C\. Huang, C\. Colby, C\. Hallacy, C\. Koch, C\. Lu, C\. Kaplan, C\. Kim, C\. Minott\-Henriques, C\. Frey, C\. Yu, C\. Czarnecki, C\. Reid, C\. Wei, C\. Decareaux, C\. Scheau, C\. Zhang, C\. Forbes, D\. Tang, D\. Goldberg, D\. Roberts, D\. Palmie, D\. Kappler, D\. Levine, D\. Wright, D\. Leo, D\. Lin, D\. Robinson, D\. Grabb, D\. Chen, D\. Lim, D\. Salama, D\. Bhattacharjee, D\. Tsipras, D\. Li, D\. Yu, D\. Strouse, D\. Williams, D\. Hunn, E\. Bayes, E\. Arbus, E\. Akyurek, E\. Y\. Le, E\. Widmann, E\. Yani, E\. Proehl, E\. Sert, E\. Cheung, E\. Schwartz, E\. Han, E\. Jiang, E\. Mitchell, E\. Sigler, E\. Wallace, E\. Ritter, E\. Kavanaugh, E\. Mays, E\. Nikishin, F\. Li, F\. P\. Such, F\. de Avila Belbute Peres, F\. Raso, F\. Bekerman, F\. Tsimpourlas, F\. Chantzis, F\. Song, F\. Zhang, G\. Raila, G\. McGrath, G\. Briggs, G\. Yang, G\. Parascandolo, G\. Chabot, G\. Kim, G\. Zhao, G\. Valiant, G\. Leclerc, H\. Salman, H\. Wang, H\. Sheng, H\. Jiang, H\. Wang, H\. Jin, H\. Sikchi, H\. Schmidt, H\. Aspegren, H\. Chen, H\. Qiu, H\. Lightman, I\. Covert, I\. Kivlichan, I\. Silber, I\. Sohl, I\. Hammoud, I\. Clavera, I\. Lan, I\. Akkaya, I\. Kostrikov, I\. Kofman, I\. Etinger, I\. Singal, J\. Hehir, J\. Huh, J\. Pan, J\. Wilczynski, J\. Pachocki, J\. Lee, J\. Quinn, J\. Kiros, J\. Kalra, J\. Samaroo, J\. Wang, J\. Wolfe, J\. Chen, J\. Wang, J\. Harb, J\. Han, J\. Wang, J\. Zhao, J\. Chen, J\. Yang, J\. Tworek, J\. Chand, J\. Landon, J\. Liang, J\. Lin, J\. Liu, J\. Wang, J\. Tang, J\. Yin, J\. Jang, J\. Morris, J\. Flynn, J\. Ferstad, J\. Heidecke, J\. Fishbein, J\. Hallman, J\. Grant, J\. Chien, J\. Gordon, J\. Park, J\. Liss, J\. Kraaijeveld, J\. Guay, J\. Mo, J\. Lawson, J\. McGrath, J\. Vendrow, J\. Jiao, J\. Lee, J\. Steele, J\. Wang, J\. Mao, K\. Chen, K\. Hayashi, K\. Xiao, K\. Salahi, K\. Wu, K\. Sekhri, K\. Sharma, K\. Singhal, K\. Li, K\. Nguyen, K\. Gu\-Lemberg, K\. King, K\. Liu, K\. Stone, K\. Yu, K\. Ying, K\. Georgiev, K\. Lim, K\. Tirumala, K\. Miller, L\. Ahmad, L\. Lv, L\. Clare, L\. Fauconnet, L\. Itow, L\. Yang, L\. Romaniuk, L\. Anise, L\. Byron, L\. Pathak, L\. Maksin, L\. Lo, L\. Ho, L\. Jing, L\. Wu, L\. Xiong, L\. Mamitsuka, L\. Yang, L\. McCallum, L\. Held, L\. Bourgeois, L\. Engstrom, L\. Kuhn, L\. Feuvrier, L\. Zhang, L\. Switzer, L\. Kondraciuk, L\. Kaiser, M\. Joglekar, M\. Singh, M\. Shah, M\. Stratta, M\. Williams, M\. Chen, M\. Sun, M\. Cayton, M\. Li, M\. Zhang, M\. Aljubeh, M\. Nichols, M\. Haines, M\. Schwarzer, M\. Gupta, M\. Shah, M\. Y\. Guan, M\. Huang, M\. Dong, M\. Wang, M\. Glaese, M\. Carroll, M\. Lampe, M\. Malek, M\. Sharman, M\. Zhang, M\. Wang, M\. Pokrass, M\. Florian, M\. Pavlov, M\. Wang, M\. Chen, M\. Wang, M\. Feng, M\. Bavarian, M\. Lin, M\. Abdool, M\. Rohaninejad, N\. Soto, N\. Staudacher, N\. LaFontaine, N\. Marwell, N\. Liu, N\. Preston, N\. Turley, N\. Ansman, N\. Blades, N\. Pancha, N\. Mikhaylin, N\. Felix, N\. Handa, N\. Rai, N\. Keskar, N\. Brown, O\. Nachum, O\. Boiko, O\. Murk, O\. Watkins, O\. Gleeson, P\. Mishkin, P\. Lesiewicz, P\. Baltescu, P\. Belov, P\. Zhokhov, P\. Pronin, P\. Guo, P\. Thacker, Q\. Liu, Q\. Yuan, Q\. Liu, R\. Dias, R\. Puckett, R\. Arora, R\. T\. Mullapudi, R\. Gaon, R\. Miyara, R\. Song, R\. Aggarwal, R\. Marsan, R\. Yemiru, R\. Xiong, R\. Kshirsagar, R\. Nuttall, R\. Tsiupa, R\. Eldan, R\. Wang, R\. James, R\. Ziv, R\. Shu, R\. Nigmatullin, S\. Jain, S\. Talaie, S\. Altman, S\. Arnesen, S\. Toizer, S\. Toyer, S\. Miserendino, S\. Agarwal, S\. Yoo, S\. Heon, S\. Ethersmith, S\. Grove, S\. Taylor, S\. Bubeck, S\. Banesiu, S\. Amdo, S\. Zhao, S\. Wu, S\. Santurkar, S\. Zhao, S\. R\. Chaudhuri, S\. Krishnaswamy, Shuaiqi, Xia, S\. Cheng, S\. Anadkat, S\. P\. Fishman, S\. Tobin, S\. Fu, S\. Jain, S\. Mei, S\. Egoian, S\. Kim, S\. Golden, S\. Mah, S\. Lin, S\. Imm, S\. Sharpe, S\. Yadlowsky, S\. Choudhry, S\. Eum, S\. Sanjeev, T\. Khan, T\. Stramer, T\. Wang, T\. Xin, T\. Gogineni, T\. Christianson, T\. Sanders, T\. Patwardhan, T\. Degry, T\. Shadwell, T\. Fu, T\. Gao, T\. Garipov, T\. Sriskandarajah, T\. Sherbakov, T\. Korbak, T\. Kaftan, T\. Hiratsuka, T\. Wang, T\. Song, T\. Zhao, T\. Peterson, V\. Kharitonov, V\. Chernova, V\. Kosaraju, V\. Kuo, V\. Pong, V\. Verma, V\. Petrov, W\. Jiang, W\. Zhang, W\. Zhou, W\. Xie, W\. Zhan, W\. McCabe, W\. DePue, W\. Ellsworth, W\. Bain, W\. Thompson, X\. Chen, X\. Qi, X\. Xiang, X\. Shi, Y\. Dubois, Y\. Yu, Y\. Khakbaz, Y\. Wu, Y\. Qian, Y\. T\. Lee, Y\. Chen, Y\. Zhang, Y\. Xiong, Y\. Tian, Y\. Cha, Y\. Bai, Y\. Yang, Y\. Yuan, Y\. Li, Y\. Zhang, Y\. Yang, Y\. Jin, Y\. Jiang, Y\. Wang, Y\. Wang, Y\. Liu, Z\. Stubenvoll, Z\. Dou, Z\. Wu, and Z\. Wang \(2026\)OpenAI gpt\-5 system card\.External Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[§3\.1](https://arxiv.org/html/2606.15080#S3.SS1.SSS0.Px2.p1.8)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.External Links:2408\.03314,[Link](https://arxiv.org/abs/2408.03314)Cited by:[3rd item](https://arxiv.org/html/2606.15080#S4.I2.i3.p1.1)\.
- Y\. Sui, Y\. Chuang, G\. Wang, J\. Zhang, T\. Zhang, J\. Yuan, H\. Liu, A\. Wen, S\. Zhong, N\. Zou,et al\.\(2025\)Stop overthinking: a survey on efficient reasoning for large language models\.arXiv preprint arXiv:2503\.16419\.Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p2.1)\.
- L\. Sutawika, G\. Swamy, Z\. S\. Wu, and G\. Neubig \(2026\)Gained in translation: privileged pairwise judges enhance multilingual reasoning\.External Links:2601\.18722,[Link](https://arxiv.org/abs/2601.18722)Cited by:[§B\.1](https://arxiv.org/html/2606.15080#A2.SS1.p1.1),[§B\.4](https://arxiv.org/html/2606.15080#A2.SS4.p1.1),[§1](https://arxiv.org/html/2606.15080#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1),[Таблица 1](https://arxiv.org/html/2606.15080#S2.T1.2.2.2.2)\.
- Z\. R\. Tam, C\. Wu, Y\. Y\. Chiu, C\. Lin, Y\. Chen, and H\. Lee \(2025\)Language matters: how do multilingual input and reasoning paths affect large reasoning models?\.External Links:2505\.17407,[Link](https://arxiv.org/abs/2505.17407)Cited by:[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1)\.
- S\. Tan, M\. Luo, J\. Wong, C\. Cai, X\. Shi, W\. Y\. Tang, M\. Roongta, T\. Zhang, L\. E\. Li, R\. A\. Popa, and I\. Stoica \(2026\)DeepScaleR: effective RL scaling of reasoning models via iterative context lengthening\.External Links:[Link](https://openreview.net/forum?id=I6GzDCne7U)Cited by:[§B\.1](https://arxiv.org/html/2606.15080#A2.SS1.p1.1)\.
- K\. Tran, B\. O’Sullivan, and H\. D\. Nguyen \(2025\)Reasoning transfer for an extremely low\-resource and endangered language: bridging languages through sample\-efficient language understanding\.External Links:2504\.02890,[Link](https://arxiv.org/abs/2504.02890)Cited by:[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px4.p2.1)\.
- S\. Tu, J\. Lin, Q\. Zhang, X\. Tian, L\. Li, X\. Lan, and D\. Zhao \(2026\)Learning when to think: shaping adaptive reasoning in r1\-style models via multi\-stage RL\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Hs3FrjwyVZ)Cited by:[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1)\.
- M\. Wang, L\. Lange, H\. Adel, Y\. Ma, J\. Strötgen, and H\. Schuetze \(2025a\)Language mixing in reasoning language models: patterns, impact, and internal causes\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 2637–2665\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.132/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.132),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p2.1)\.
- X\. Wang, Y\. Huang, Y\. Wang, X\. Luo, K\. Guo, Y\. Zhou, and X\. Zhang \(2026\)AdaReasoner: adaptive reasoning enables more flexible thinking\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=VjjJlJ5qik)Cited by:[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1)\.
- Y\. Wang, P\. Zhang, J\. Tang, H\. Wei, B\. Yang, R\. Wang, C\. Sun, F\. Sun, J\. Zhang, J\. Wu, Q\. Cang, Y\. Zhang, F\. Huang, J\. Lin, F\. Huang, and J\. Zhou \(2025b\)PolyMath: evaluating mathematical reasoning in multilingual contexts\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=B1vCImy6yI)Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px3.p2.1)\.
- C\. Wu, B\. Li, M\. Gao, Y\. Tian, and Z\. Wang \(2026a\)From efficiency to adaptivity: a deeper look at adaptive reasoning in large language models\.External Links:2511\.10788,[Link](https://arxiv.org/abs/2511.10788)Cited by:[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1)\.
- S\. Wu, J\. Xie, Y\. Zhang, A\. Chen, K\. Zhang, Y\. Su, and Y\. Xiao \(2026b\)ARM: adaptive reasoning model\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=z9oeQrcNh9)Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.15080#S3.SS3.SSS0.Px3.p1.2)\.
- S\. Xu and W\. Zhang \(2026\)Language of thought shapes output diversity in large language models\.External Links:2601\.11227,[Link](https://arxiv.org/abs/2601.11227)Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[Приложение A](https://arxiv.org/html/2606.15080#A1.p1.1),[Таблица 12](https://arxiv.org/html/2606.15080#A3.T12.1.1.3.5.1.1),[§3\.1](https://arxiv.org/html/2606.15080#S3.SS1.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Yang, B\. Hou, W\. Wei, Y\. Bao, and S\. Chang \(2026\)Ares: adaptive reasoning effort selection for efficient llm agents\.External Links:2603\.07915,[Link](https://arxiv.org/abs/2603.07915)Cited by:[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1)\.
- Y\. Ye, X\. Feng, X\. Feng, Y\. Huang, Z\. Yuan, L\. Huang, W\. Ma, Q\. Hong, Y\. Lu, D\. Tu, and B\. Qin \(2026\)X1: learning to think adaptively across languages and cultures\.External Links:2604\.16917,[Link](https://arxiv.org/abs/2604.16917)Cited by:[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1)\.
- Z\. Yong, M\. F\. Adilazuarda, J\. Mansurov, R\. Zhang, N\. Muennighoff, C\. Eickhoff, G\. I\. Winata, J\. Kreutzer, S\. H\. Bach, and A\. F\. Aji \(2025\)Crosslingual reasoning through test\-time scaling\.arXiv preprint arXiv:2505\.05408\.Cited by:[§1](https://arxiv.org/html/2606.15080#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2026\)Dapo: an open\-source llm reinforcement learning system at scale\.Advances in Neural Information Processing Systems38,pp\. 113222–113244\.Cited by:[§3\.1](https://arxiv.org/html/2606.15080#S3.SS1.SSS0.Px1.p1.1),[footnote 3](https://arxiv.org/html/2606.15080#footnote3)\.
- Z\. Yu, T\. Xu, D\. Jin, K\. A\. Sankararaman, Y\. He, W\. Zhou, Z\. Zeng, E\. Helenowski, C\. Zhu, S\. Wang, H\. Ma, and H\. Fang \(2025\)Think smarter not harder: adaptive reasoning with inference aware optimization\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=0ERw2196o1)Cited by:[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1)\.
- Q\. Zhang, F\. Lyu, Z\. Sun, L\. Wang, W\. Zhang, W\. Hua, H\. Wu, Z\. Guo, Y\. Wang, N\. Muennighoff,et al\.\(2025\)A survey on test\-time scaling in large language models: what, how, where, and how well?\.arXiv preprint arXiv:2503\.24235\.Cited by:[1st item](https://arxiv.org/html/2606.15080#S4.I2.i1.p1.1)\.
- X\. Zhang, N\. Thakur, O\. Ogundepo, E\. Kamalloo, D\. Alfonso\-Hermelo, X\. Li, Q\. Liu, M\. Rezagholizadeh, and J\. Lin \(2023\)MIRACL: a multilingual retrieval dataset covering 18 diverse languages\.Transactions of the Association for Computational Linguistics11,pp\. 1114–1131\.External Links:[Link](https://aclanthology.org/2023.tacl-1.63/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00595)Cited by:[Таблица 14](https://arxiv.org/html/2606.15080#A3.T14)\.
- X\. Zhang, Y\. Liang, F\. Meng, S\. Zhang, K\. Huang, Y\. Chen, J\. Xu, and J\. Zhou \(2026\)Think natively: unlocking multilingual reasoning with consistency\-enhanced reinforcement learning\.External Links:2510\.07300,[Link](https://arxiv.org/abs/2510.07300)Cited by:[§C\.4](https://arxiv.org/html/2606.15080#A3.SS4.p1.1),[§1](https://arxiv.org/html/2606.15080#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15080#S2.SS1.p1.1),[Таблица 1](https://arxiv.org/html/2606.15080#S2.T1.1.1.1.2),[§3\.3](https://arxiv.org/html/2606.15080#S3.SS3.SSS0.Px1.p1.4),[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px4.p1.2),[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px3.p3.1)\.
- W\. Zhao, J\. Guo, Y\. Deng, T\. Wu, W\. Zhang, Y\. Hu, X\. Sui, Y\. Zhao, W\. Che, B\. Qin, T\. Chua, and T\. Liu \(2026\)When less language is more: language\-reasoning disentanglement makes LLMs better multilingual reasoners\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=fleQlZ2VTx)Cited by:[§5\.1](https://arxiv.org/html/2606.15080#S5.SS1.SSS0.Px4.p2.1)\.
- W\. Zheng, X\. Huang, Z\. Liu, T\. K\. Vangani, B\. Zou, X\. Tao, Y\. Wu, A\. T\. Aw, N\. F\. Chen, and R\. K\. Lee \(2025\)AdaMCoT: rethinking cross\-lingual factual reasoning through adaptive multilingual chain\-of\-thought\.External Links:2501\.16154,[Link](https://arxiv.org/abs/2501.16154)Cited by:[§2\.2](https://arxiv.org/html/2606.15080#S2.SS2.p1.1)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, and Z\. Luo \(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Y\. Cao, Y\. Feng, and D\. Xiong \(Eds\.\),Bangkok, Thailand,pp\. 400–410\.External Links:[Link](https://aclanthology.org/2024.acl-demos.38/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-demos.38)Cited by:[§4](https://arxiv.org/html/2606.15080#S4.SS0.SSS0.Px6.p1.2)\.
## Appendix
## Приложение APrompts
Figure[5](https://arxiv.org/html/2606.15080#A1.F5)shows the language\-specific prompt instructions used for thePromptandSFTbaselines\. We follow provider\-recommended prompting practices to standardize the output format\(DeepSeek\-AI,[2025a](https://arxiv.org/html/2606.15080#bib.bib30); Yanget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib69)\)\.
Рис\. 5:Prompt templates used for sampling generations for each language\.
## Приложение BComponents of AdaMame
### B\.1Naturally Occurring Reasoning Traces
We compare SFT on naturally occurring reasoning traces against the conventional choice of machine\-translated English traces\(Sutawikaet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib25); Gurgurovet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib27)\)\. Specifically, we evaluate two variants: \(1\) naturally occurring reasoning traces only \(𝒟sft\\mathcal\{D\}\_\{\\text\{sft\}\}, 30K\), and \(2\) naturally occurring traces augmented with 65K machine\-translated traces fromSutawikaet al\.\([2026](https://arxiv.org/html/2606.15080#bib.bib25)\), which are sourced from DeepScaleR\(Tanet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib53)\)and translated usingGPT\-5 nano\. Despite training on over three times more data \(95K vs\. 30K\), augmenting with machine\-translated traces degrades both reasoning accuracy and LCPR, confirming our choice to use naturally occurring traces exclusively in all main experiments\.
Distill\-Qwen 1\.5bAccuracy \(%, ↑\)LCPR \(%, ↑\)w/ Naturally occurring60\.667\.7w/ Naturally occurring\+MT56\.463\.0
Таблица 4:Reasoning accuracy and LCPR for different dataset variants\.Backbone model:Distill\-Qwen 1\.5b\.
### B\.2LoRA vs\. Full SFT
We compare SFT with LoRA and full fine\-tuning for theDistill\-Qwen 1\.5bbackbone, with LoRA applied to all linear modules\. Training corpora𝒟sft\\mathcal\{D\}\_\{\\mathrm\{sft\}\}and all hyperparameters are held constant, except the learning rate, which is reduced from2e−42\\mathrm\{e\}^\{\-4\}to2e−52\\mathrm\{e\}^\{\-5\}for full fine\-tuning\(Bohnetet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib38)\)\. As shown in[Table 5](https://arxiv.org/html/2606.15080#A2.T5), LoRA yields higher reasoning accuracy and LCPR on MGSM\-Rev2, and we therefore adopt it in all SFT experiments\.
Distill\-Qwen 1\.5bAccuracy \(%, ↑\)LCPR \(%, ↑\)w/ LoRA60\.667\.7w/ Full fine\-tuning59\.265\.4
Таблица 5:Reasoning accuracy and LCPR for LoRA and full fine\-tuning\.Backbone model:Distill\-Qwen 1\.5b\.
### B\.3Accuracy vs\. Accuracy\+Format Reward
While standard GRPO uses a binary accuracy reward alone, several prior works augment it with a format reward that penalizes malformed rollouts\(DeepSeek\-AI,[2025a](https://arxiv.org/html/2606.15080#bib.bib30); Rastogiet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib37)\)\. We define the format reward as:
rformat=\{1,if format is correct0,if format is incorrectr\_\{\\mathrm\{format\}\}=\\begin\{cases\}1,&\\text\{if format is correct\}\\\\ 0,&\\text\{if format is incorrect\}\\end\{cases\}\(7\)and compare it against the accuracy\-only reward in Table[6](https://arxiv.org/html/2606.15080#A2.T6)\. We show that accuracy\-only reward yields higher reasoning accuracy and LCPR than the combined reward, and we therefore adopt it in all main experiments\.
Distill\-Qwen 1\.5bAccuracy \(%, ↑\)LCPR \(%, ↑\)Accuracy67\.970\.1Accuracy\+Format66\.267\.1
Таблица 6:Reasoning accuracy and LCPR for different reward designs\.Backbone model:Distill\-Qwen 1\.5b\.
### B\.4GRPO vs\. Dr\.GRPO
Dr\.GRPO is a variant of GRPO that removes the length\-normalization and standard deviation terms from the advantage computation, mitigating pre\-training and optimization biases that otherwise inflate response length during training\(Liuet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib32)\)Given its adoption in recent multilingual reasoning work\(Sutawikaet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib25); Gurgurovet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib27)\), we evaluate AdaMame\-GRPO on top of both standard GRPO and Dr\.GRPO \([Table 7](https://arxiv.org/html/2606.15080#A2.T7)\)\. Dr\.GRPO yields higher reasoning accuracy with comparable LCPR, and we therefore adopt it in all main RL experiments\.
Distill\-Qwen 1\.5bAccuracy \(%, ↑\)LCPR \(%, ↑\)w/ GRPO63\.469\.9w/ Dr\.GRPO67\.970\.1
Таблица 7:Reasoning accuracy and LCPR for GRPO and Dr\.GRPO\.Backbone model:Distill\-Qwen 1\.5b\.
### B\.5Sampling Strategies
We compare three data sampling strategies for constructing the 5K corpus𝒟grpo\\mathcal\{D\}\_\{\\mathrm\{grpo\}\}:
- •Random sampling: queries are randomly sampled from𝒟sft\\mathcal\{D\}\_\{\\mathrm\{sft\}\}without filtering\.
- •Conditional sampling: one candidate is sampled per query; a query is retained if the rollout is in the query language but leads to an incorrect answer \(accuracy as 0, language fidelity as 1\.0\)\.
- •Rejection sampling: 8 candidates are generated per query; a query is retained if the model produces both correct and incorrect rollouts \(i\.e\.,0<\|𝒪correct\|<80<\|\\mathcal\{O\}\_\{\\mathrm\{correct\}\}\|<8\)\.
As shown in Table[8](https://arxiv.org/html/2606.15080#A2.T8), rejection sampling yields the highest reasoning accuracy and LCPR, and we therefore adopt it in all main experiments\.
Distill\-Qwen 1\.5bAccuracy \(%, ↑\)LCPR \(%, ↑\)Random59\.862\.1Conditional59\.764\.6Rejection67\.970\.1
Таблица 8:Reasoning accuracy and LCPR for different sampling strategies\.Backbone model:Distill\-Qwen 1\.5b\.
### B\.6Language Detection Reliability
As noted in Limitations section, AdaMame\-GRPO depends on the reliability of the language detector used to compute the query alignment reward\. We validate thelinguadetector on two settings that mirror its roles in our pipeline: \(1\) short\-context detection on MGSM\-Rev2 queries with gold language labels, corresponding tolang\(q\)\\mathrm\{lang\}\(q\), and \(2\) long\-context detection on randomly sampled 200 reasoning traces from mCoT\-MATH\(Lai and Nissim,[2024](https://arxiv.org/html/2606.15080#bib.bib35)\), which contain mathematical expressions, corresponding tolang\(oi\)\\mathrm\{lang\}\(o\_\{i\}\)\. As shown in[Table 9](https://arxiv.org/html/2606.15080#A2.T9),linguadetector achieves 100% accuracy on short\-context detection and 99\.2% on long\-context detection, demonstrating strong reliability across both settings\.
LanguageShort \(%\)Long \(%\)French100\.0100\.0Japanese100\.099\.5Korean100\.0100\.0Thai100\.099\.0Bengali100\.098\.5English100\.0100\.0Spanish100\.099\.5Russian100\.098\.0Swahili100\.0100\.0Telugu100\.098\.5Chinese100\.098\.5
Таблица 9:Language detection reliability by language\.Short: short\-context evaluation on MGSM\-Rev2 queries;Long: long\-context evaluation on mCoT\-MATH reasoning traces\.
## Приложение CExperiment Setup Details
### C\.1Dataset Construction
LanguageRetain rateFrench6,330 / 8,716 = 72\.6%Japanese6,226 / 8,495 = 73\.3%Korean5,763 / 8,330 = 69\.2%Portuguese6,277 / 8,648 = 72\.6%Thai6,366 / 8,667 = 73\.5%
Таблица 10:Retain rate after the filtering process\.We prepend the language\-specific instructions from Appendix[A](https://arxiv.org/html/2606.15080#A1)when generating reasoning traces withGPT\-5 nano, using a sampling temperature of 1\.0 for diversity\.[Table 10](https://arxiv.org/html/2606.15080#A3.T10)reports per\-language retain rates after filtering, with an average of 72\.2%, with Thai having the highest rate \(73\.5%\) and Korean the lowest \(69\.2%\)\.
### C\.2MGSM\-Rev2 Details
In[Table 11](https://arxiv.org/html/2606.15080#A3.T11), we report the proportion of MGSM queries revised in MGSM\-Rev2 during the process of translation quality and ambiguity correction\.
Language\# Updated \(%\)Bengali46 \(18\.4%\)English22 \(8\.80%\)German35 \(14\.0%\)Spanish38 \(15\.2%\)French45 \(18\.0%\)Russian32 \(12\.8%\)Swahili43 \(17\.2%\)Telugu52 \(20\.8%\)Thai40 \(16\.0%\)Chinese43 \(17\.2%\)
Таблица 11:Number of updated MGSM queries per language\.Total count: 250\.
### C\.3Evaluation Metric Details
We provide implementation details of the LCPR \(Language Confusion Pass Rate\) metric\. We follow the procedure introduced inMarchisioet al\.\([2024](https://arxiv.org/html/2606.15080#bib.bib19)\)and use thelinguadetector for detecting language\(s\) of a text\. Given a reasoning tracecc,
#### Line\-level Pass Rate \(LPR\)\.
We splitccinto lines \(by newline character\) and check against the query languageqℓq\_\{\\ell\}\. LPR is the percentage of reasoning traces where all lines matchqℓq\_\{\\ell\}:
LPR=\|𝒞\\𝒞¬ℓ\|\|𝒞\|,\\mathrm\{LPR\}=\\frac\{\|\\mathcal\{C\}\\textbackslash\\mathcal\{C\}\_\{\\neg\\ell\}\|\}\{\|\\mathcal\{C\}\|\},\(8\)where𝒞\\mathcal\{C\}is the set of all reasoning traces and𝒞¬ℓ\\mathcal\{C\}\_\{\\neg\\ell\}is the set of traces that contain line\-level errors\.
#### Word\-level Pass Rate \(WPR\)\.
We first exclude traces with line\-level errors \(𝒞¬ℓ\\mathcal\{C\}\_\{\\neg\\ell\}, as most line\-level errors would also be counted toward word\-level errors, making it difficult to distinguish between the two error types\(Marchisioet al\.,[2024](https://arxiv.org/html/2606.15080#bib.bib19)\)\. For languages that use Latin script, we identify characters outside of the script’s Unicode range\. For languages that do not use Latin script, we detect errorneous English words \(since languages mostly code\-switch with English\) that do not typically occur in target language text\. WPR is the percentage of reasoning traces where all words are inqℓq\_\{\\ell\}:
WPR=\|\(𝒞\\𝒞¬ℓ\)\\𝒞¬w\|\|𝒞\\𝒞¬ℓ\|,\\mathrm\{WPR\}=\\frac\{\|\(\\mathcal\{C\}\\textbackslash\\mathcal\{C\}\_\{\\neg\\ell\}\)\\textbackslash\\mathcal\{C\}\_\{\\neg w\}\|\}\{\|\\mathcal\{C\}\\textbackslash\\mathcal\{C\}\_\{\\neg\\ell\}\|\},\(9\)where𝒞¬w\\mathcal\{C\}\_\{\\neg w\}is the set of traces that contain word\-level errors\.
#### Language Confusion Pass Rate \(LCPR\)\.
LCPR is a harmonic mean of line\-level and word\-level pass rates:
LCPR=2×LPR×WPRLPR\+WPR\\mathrm\{LCPR\}=2\\times\\frac\{\\mathrm\{LPR\}\\times\\mathrm\{WPR\}\}\{\\mathrm\{LPR\}\+\\mathrm\{WPR\}\}\(10\)
### C\.4Language Consistency vs\. LCPR
The language consistency \(i\.e\., fidelity\) metric used in prior work\(Zhanget al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib26); Gaoet al\.,[2026](https://arxiv.org/html/2606.15080#bib.bib39)\)measures fidelity by the top\-1 detected language alone, treating a trace as fully language\-faithful as long as its dominant language matches the query\. LCPR, by contrast, computes the harmonic mean of line\- and word\-level language pass rates \(LPR and WPR\), capturing code\-switching that top\-1 detection overlooks\. As shown in[Table 15](https://arxiv.org/html/2606.15080#A3.T15), a reasoning trace can receive a perfect language consistency score of 1\.0 while exhibiting substantial code\-switching, which LCPR correctly penalizes\.
ModelContext LengthVocab\. SizeHuggingFace Model IdentifierTrain SetDistill\-Qwen 1\.5b128K152Kdeepseek\-ai/DeepSeek\-R1\-Distill\-Qwen\-1\.5BPre\-trained on mathematical reasoning queries; post\-trained on 70K logical reasoning queries\(Qwenet al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib55)\), further post\-trained on reasoning traces distilled fromDeepSeek\-R1\(DeepSeek\-AI,[2025a](https://arxiv.org/html/2606.15080#bib.bib30)\)Qwen\-3 4b33K152KQwen/Qwen3\-4BPre\-trained on reasoning queries; post\-trained on a wide range of reasoning tasks\(Yanget al\.,[2025](https://arxiv.org/html/2606.15080#bib.bib69)\)
Таблица 12:List of evaluated models\.We report the context length, vocabulary size, and HuggingFace model identifiers\. We useQwen\-3 4bwithenable\_think=Truemode\.Dataset\# QueriesTranslationMGSM\-Rev2250Human\-translated by professional translators\(Shiet al\.,[2022](https://arxiv.org/html/2606.15080#bib.bib13)\)MSVAMP250Machine translated withChatGPT101010[https://openai\.com/index/chatgpt/](https://openai.com/index/chatgpt/); translation quality verified by native speakers on a random subset
Таблица 13:Detailed statistics of evaluation datasets\.Language FamilyLanguageScriptSynthesisWord OrderResource Level\# Speakers\# Wikipedia SizeIndo\-EuropeanEnglishLatinanalyticSVOhigh1,130M5,758,285FrenchLatinfusionalSVOhigh398M2,325,608SpanishLatinfusionalSVOhigh592M1,669,181PortugueseLatinfusionalSVOmid269M1,171,437GermanLatinfusionalSVO,SOVmid178M2,651,352RussianCyrillicfusionalSVOmid260M1,476,045BengaliBengalifusionalSOVlow337M63,762Sino\-TibetanChineseChineseanalyticSVOhigh1,350M1,246,389KoreanicKoreanHangulagglutinativeSOVmid80M437,373JaponicJapaneseJapaneseagglutinativeSOVmid128M1,133,444Niger\-CongoSwahiliLatinagglutinativeSVOlow83M47,793DravidianTeluguTeluguagglutinativeSOVlow96M66,353Kra\-DaiThaiThaianalyticSVOlow72M128,179
Таблица 14:Characteristics of tested languages\.For each language, we show language family, script, linguistic typologies \(synthesis and word order\), and resource level measured by the number of speakers and Wikipedia articles\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15080#bib.bib64)\)\.LanguageReasoning traceLanguage Consistency \(↑\)LCPR \(%, ↑\)GermanLet’s solve the problem: Um denursprünglichen Preis des Buches zu finden, Divisionstypen mit einem Rabatt von 25% auf den ursprünglichen Preisize\. First, wirten wir die Gleichung auf, um denursprünglichen Preis \(p\) zu finden: p \- 0\.25p = 19\.50\. This simplifies to: 0\.75p = 19\.50\. Solving for p, wir p durch Division beider Seiten durch 0\.75: p = \\\\\{19\.50\}\{0\.75\}\. Calculating this gives: p = 26\. \\\\boxed\{26\}1\.00\.0Korean1\. 월요일에 먹은 쿠키 수: 5개 2\. 화요일에 먹은 쿠키 수: 13개 3\. 기대ly, 13개 \- 5개 = 8개의 쿠키를 더 did 4\. 하 Alto에 4배 더 information: 8개 × 4 = 32개 5\. 하 Alto에 2배 더 information: 32개 × 2 = 64개 6\. 하 Alto에 1개의 쿠키는 200칼로리: 64개 × 200칼로리 = 12,800칼로리 7\. summaries: \\\\boxed\{12,800\}칼로리 difference1\.00\.0Japaneseまず、問題を理解します。問題は、青いネクタイの2倍の数の赤いネクタイをgbsn买minいます。 赤いネクタイは青いネクタイよりも50%高いです。 彼はそれぞれ$40の青いネクタイに$200使いました。 彼はネクタイにgbsn多少钱min使いましたか? まず、青いネクタイの数をgbsn设定minします。 Let’s denote青いネクタイの数をCとします。 gbsn紧接着min、赤いネクタイの数は青いネクタイの2倍です。 Therefore, 赤いネクタイの数は2Cです。 青いネクタイの価値はPとします。gbsn根据题目min、R = 0\.5Pです。 P = $40です。R = $200です。 gbsn换句话说min、合計はC × $40 \+ 2C × $200 = $440Cです。 gbsn我们可以得出结论。min \\\\boxed\{440C\}1\.00\.0
Таблица 15:Example of reasoning traces with language consistency and LCPR metric scores\.A reasoning trace can receive a language consistency score of 1\.0 despite substantial code\-switching, which LCPR correctly penalizes\.
## Приложение DDetailed Results
### D\.1Per\-Language Results
We report per\-language results for all models and evaluation datasets in[Table 16](https://arxiv.org/html/2606.15080#A4.T16)and[Table 17](https://arxiv.org/html/2606.15080#A4.T17)\.
### D\.2Change inβ\\betaResults
We report numerical results for varying the query alignment factor in AdaMame\-GRPO in[Table 18](https://arxiv.org/html/2606.15080#A4.T18)\.
ModelSizeMetricfrjakothbnenesruswtezhdeAvg\.Distill\-Qwen 1\.5bVanilla\-Acc\.70\.448\.050\.418\.832\.091\.272\.070\.03\.28\.889\.655\.650\.8Prompt\-Acc\.74\.853\.252\.426\.044\.096\.873\.677\.28\.47\.689\.664\.855\.7M\-Thinker Iter135KAcc\.84\.067\.248\.848\.047\.696\.887\.282\.89\.617\.293\.279\.263\.5M\-Thinker Iter250KAcc\.86\.872\.462\.856\.050\.895\.285\.683\.611\.219\.694\.080\.066\.5SFT30KAcc\.77\.660\.461\.661\.642\.490\.080\.473\.26\.814\.886\.072\.460\.6\+GRPO35KAcc\.83\.268\.070\.069\.246\.894\.484\.080\.415\.622\.488\.076\.866\.6\+AdaMame\-GRPO35KAcc\.85\.268\.470\.068\.049\.296\.484\.880\.815\.625\.291\.280\.467\.9Vanilla\-LCPR92\.533\.817\.364\.016\.090\.094\.824\.028\.847\.592\.588\.557\.5Prompt\-LCPR91\.537\.637\.165\.014\.668\.992\.534\.318\.860\.590\.788\.358\.3M\-Thinker Iter135KLCPR0\.00\.01\.00\.23\.857\.40\.83\.26\.01\.278\.01\.812\.8M\-Thinker Iter250KLCPR0\.20\.00\.00\.04\.554\.40\.00\.00\.84\.077\.30\.811\.8SFT30KLCPR90\.370\.374\.875\.049\.681\.687\.416\.949\.047\.893\.277\.067\.7\+GRPO35KLCPR88\.668\.478\.684\.829\.281\.285\.218\.054\.052\.093\.479\.767\.8\+AdaMame\-GRPO35KLCPR89\.369\.178\.485\.552\.481\.587\.818\.357\.450\.292\.878\.070\.1Vanilla\-TTC6\.819\.012\.260\.726\.39\.44\.714\.237\.538\.310\.46\.320\.5Prompt\-TTC5\.228\.711\.372\.553\.224\.07\.642\.959\.764\.517\.28\.633\.0M\-Thinker Iter135KTTC39\.519\.217\.224\.015\.924\.833\.636\.116\.411\.416\.032\.523\.9M\-Thinker Iter250KTTC26\.719\.317\.320\.414\.518\.432\.523\.219\.913\.913\.333\.021\.0SFT30KTTC1\.60\.81\.52\.13\.81\.62\.83\.013\.46\.30\.72\.43\.3\+GRPO35KTTC1\.30\.11\.01\.72\.61\.21\.51\.55\.93\.60\.71\.81\.9\+AdaMame\-GRPO35KTTC1\.31\.00\.91\.62\.31\.31\.51\.75\.63\.10\.71\.91\.9Qwen3 4bVanilla\-Acc\.98\.070\.064\.892\.488\.499\.699\.294\.837\.678\.032\.888\.478\.7Prompt\-Acc\.98\.486\.883\.283\.684\.898\.899\.694\.430\.077\.668\.489\.282\.9SFT30KAcc\.94\.091\.284\.489\.683\.696\.096\.493\.636\.873\.293\.695\.285\.6\+GRPO35KAcc\.94\.890\.082\.088\.882\.498\.094\.494\.034\.871\.693\.294\.484\.9\+AdaMame\-GRPO35KAcc\.96\.090\.084\.890\.881\.696\.495\.694\.438\.873\.294\.494\.485\.9Vanilla\-LCPR0\.61\.61\.61\.42\.276\.20\.890\.77\.74\.993\.20\.223\.4Prompt\-LCPR0\.81\.21\.81\.84\.179\.81\.091\.713\.84\.392\.90\.224\.5SFT30KLCPR93\.978\.189\.090\.491\.887\.393\.092\.679\.393\.394\.779\.888\.6\+GRPO35KLCPR94\.976\.489\.792\.291\.486\.893\.592\.779\.892\.594\.890\.589\.6\+AdaMame\-GRPO35KLCPR96\.178\.389\.391\.891\.886\.593\.692\.880\.293\.794\.390\.789\.9Vanilla\-TTC13\.414\.713\.915\.117\.822\.413\.613\.234\.319\.27\.813\.916\.6Prompt\-TTC12\.514\.413\.215\.918\.420\.013\.312\.238\.419\.28\.410\.916\.4SFT30KTTC1\.10\.70\.87\.61\.21\.21\.21\.22\.72\.10\.61\.21\.8\+GRPO35KTTC1\.10\.70\.85\.61\.31\.21\.31\.22\.61\.60\.61\.31\.6\+AdaMame\-GRPO35KTTC1\.10\.70\.84\.91\.21\.21\.11\.12\.61\.70\.61\.31\.5
Таблица 16:Per\-language results for MGSM\-Rev2 dataset\.ModelSizeMetricfrjathbnenesswruzhdeAvg\.Distill\-Qwen 1\.5bVanilla\-Acc\.82\.471\.645\.648\.093\.288\.014\.885\.290\.084\.870\.4Prompt\-Acc\.84\.473\.642\.856\.892\.088\.416\.886\.889\.683\.271\.4M\-Thinker Iter135KAcc\.86\.274\.476\.056\.094\.886\.421\.088\.882\.080\.074\.6M\-Thinker Iter250KAcc\.86\.478\.878\.057\.496\.490\.320\.490\.886\.880\.476\.6SFT30KAcc\.83\.276\.472\.857\.694\.891\.626\.882\.688\.885\.676\.0\+GRPO35KAcc\.84\.874\.872\.455\.694\.892\.431\.683\.688\.087\.676\.6\+AdaMame\-GRPO35KAcc\.84\.076\.072\.458\.894\.490\.829\.687\.888\.887\.677\.0Vanilla\-LCPR95\.035\.861\.516\.599\.194\.320\.928\.290\.71\.654\.4Prompt\-LCPR96\.236\.166\.39\.9098\.894\.139\.428\.192\.81\.656\.3M\-Thinker Iter135KLCPR8\.10\.60\.619\.281\.72\.70\.014\.987\.71\.021\.7M\-Thinker Iter250KLCPR3\.11\.00\.411\.581\.64\.31\.06\.386\.41\.619\.7SFT30KLCPR89\.476\.386\.452\.193\.384\.38\.8017\.992\.61\.460\.3\+GRPO35KLCPR91\.278\.784\.629\.394\.081\.78\.6023\.993\.31\.258\.7\+AdaMame\-GRPO35KLCPR91\.877\.486\.354\.294\.383\.29\.0016\.692\.51\.260\.7Vanilla\-TTC3\.68\.417\.07\.73\.13\.313\.56\.36\.33\.67\.3Prompt\-TTC3\.88\.719\.68\.03\.03\.316\.85\.62\.33\.87\.5M\-Thinker Iter135KTTC24\.311\.616\.113\.710\.219\.612\.820\.06\.916\.415\.2M\-Thinker Iter250KTTC18\.911\.114\.110\.89\.819\.416\.815\.56\.718\.815\.2SFT30KTTC1\.20\.71\.11\.90\.90\.84\.81\.21\.01\.31\.5\+GRPO35KTTC0\.80\.81\.32\.00\.80\.94\.41\.30\.41\.11\.4\+AdaMame\-GRPO35KTTC0\.80\.71\.41\.60\.91\.24\.60\.90\.51\.11\.4Qwen3 4bVanilla\-Acc\.93\.246\.488\.453\.286\.483\.675\.653\.218\.071\.667\.0Prompt\-Acc\.91\.870\.289\.060\.586\.486\.481\.262\.492\.484\.080\.4SFT30KAcc\.89\.692\.085\.680\.494\.495\.290\.466\.893\.292\.088\.0\+GRPO35KAcc\.90\.092\.887\.681\.694\.094\.890\.469\.292\.892\.888\.6\+AdaMame\-GRPO35KAcc\.92\.491\.688\.081\.694\.095\.291\.670\.492\.892\.489\.0Vanilla\-LCPR0\.00\.20\.60\.691\.90\.291\.00\.693\.30\.827\.9Prompt\-LCPR0\.00\.01\.00\.691\.10\.491\.52\.093\.50\.828\.1SFT30KLCPR94\.288\.892\.697\.397\.092\.095\.42\.391\.891\.684\.3\+GRPO35KLCPR96\.289\.492\.396\.797\.293\.096\.02\.092\.994\.285\.0\+AdaMame\-GRPO35KLCPR96\.590\.293\.396\.997\.994\.796\.63\.994\.894\.786\.0Vanilla\-TTC10\.710\.512\.714\.416\.810\.012\.024\.67\.511\.713\.1Prompt\-TTC10\.910\.413\.616\.917\.810\.511\.626\.78\.09\.713\.6SFT30KTTC1\.10\.71\.60\.91\.01\.40\.92\.60\.51\.21\.2\+GRPO35KTTC0\.80\.51\.11\.00\.80\.91\.02\.20\.41\.01\.0\+AdaMame\-GRPO35KTTC0\.80\.50\.90\.90\.80\.90\.91\.80\.30\.90\.9
Таблица 17:Per\-language results for MSVAMP dataset\.ModelDatasetβ\\betaAccuracy \(%, ↑\)LCPR \(%, ↑\)Distill\-Qwen 1\.5bMGSM\-Rev2168\.069\.8267\.970\.1366\.270\.5459\.871\.1559\.371\.6MSVAMP177\.759\.6277\.060\.7376\.660\.8476\.361\.2575\.761\.6Qwen3 4bMGSM\-Rev2186\.687\.4285\.989\.9385\.490\.1480\.590\.4579\.090\.7MSVAMP190\.178\.7289\.086\.0388\.887\.4488\.487\.9586\.988\.0
Таблица 18:Numerical results for varying query alignment factorβ\\beta\.
## Приложение EUsage of Large Language Models
We used LLMs to support and refine the writing of our work, such as for stylistic adjustments, including improving readability and removing layout issues \(e\.g\., widows and orphans\)\.Similar Articles
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD dynamically adapts reasoning efficiency during training by using online calibration of correctness-efficiency trade-offs and adaptive problem-specific length targets, improving mathematical reasoning accuracy and reducing output length.
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
This survey synthesizes recent advancements in mathematical reasoning with large language models, covering benchmarks, architectures, training strategies, and evaluation protocols. It identifies key challenges such as reasoning faithfulness and benchmark biases.
Enhancing Multilingual Reasoning via Steerable Model Merging
This paper proposes ST-Merge, a steerable model merging framework that uses a gated cross-attention mechanism to adaptively modulate contributions of a multilingual model and a reasoning model, outperforming fixed merging approaches on multilingual reasoning benchmarks across 21 languages.
Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models
This paper proposes ASAG, a training-free method that adaptively stops reasoning in large reasoning models based on attention distributions, reducing token usage by ~40% while improving accuracy by 3.2% on benchmarks using DeepSeek-R1-Distill and Qwen3 models.
Rethinking the Multilingual Reasoning Gap with Layer Swap
This paper revisits the multilingual reasoning gap in LLMs, finding it smaller than previously reported under comparable supervision. It introduces Layer Swap, which transfers mid-layer weights from an English reasoning specialist to native language specialists, nearly closing the gap while preserving native-language chain-of-thought.