Enhancing Multilingual Reasoning via Steerable Model Merging

arXiv cs.CL Papers

Summary

This paper proposes ST-Merge, a steerable model merging framework that uses a gated cross-attention mechanism to adaptively modulate contributions of a multilingual model and a reasoning model, outperforming fixed merging approaches on multilingual reasoning benchmarks across 21 languages.

arXiv:2606.19002v1 Announce Type: new Abstract: Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:47 AM

# Enhancing Multilingual Reasoning via Steerable Model Merging
Source: [https://arxiv.org/html/2606.19002](https://arxiv.org/html/2606.19002)
Zhuoran Li1, Rui Xu2, Jian Yang3, Junnan Liu4, Zhijun Chen3, Qianren Mao5 Hongcheng Guo2, Jiaheng Liu6, Likang Xiao3, Ming Li7, Xiaojie Wang1 1Beijing University of Posts and Telecommunications,2Fudan University 3Beihang University,4Monash University,5Zhongguancun Laboratory 6Nanjing University,7Tsinghua University

###### Abstract

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model\. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models\. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance\. In other words, the one\-size\-fits\-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others\. To this end, we propose a Steerable Model Merging \(ST\-Merge\) framework to modulate the contribution of each source model\. To realize this idea, we introduce a gated cross\-attention mechanism to weight or filter the two attended source models in an adaptive manner\. Extensive experiments demonstrate that ST\-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages\.

Enhancing Multilingual Reasoning via Steerable Model Merging

Zhuoran Li1, Rui Xu2, Jian Yang3, Junnan Liu4, Zhijun Chen3, Qianren Mao5Hongcheng Guo2, Jiaheng Liu6, Likang Xiao3, Ming Li7, Xiaojie Wang1††thanks:Corresponding author1Beijing University of Posts and Telecommunications,2Fudan University3Beihang University,4Monash University,5Zhongguancun Laboratory6Nanjing University,7Tsinghua University

## 1Introduction

Multilingual reasoning aims to empower Large Language Models \(LLMs\) to perform complex reasoning tasks across diverse languages\. This capability is valuable in circumstances where limited or no annotations are available for low\-resource languages\. In recent years, reasoning large language models, such as MetaMathYuet al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib5)\)and OrcaMitraet al\.\([2023](https://arxiv.org/html/2606.19002#bib.bib17)\), have achieved significant performance improvements through parameter\-efficient fine\-tuning on source language data and direct application to target language data \(as shown in Figure[1](https://arxiv.org/html/2606.19002#S1.F1)\(a\)\)\.

Furthermore, it has been discovered that additional multilingual representation alignment improves the low\-resource language reasoning performance by composing an external multilingual encoder to replace or augment the original LLM query embedding\. This strategy, named Model Merging, has demonstrated improvements in many multilingual reasoning tasksYoonet al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib4)\); Huanget al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib3)\)\(as shown in Figure[1](https://arxiv.org/html/2606.19002#S1.F1)\(b\)\)\.

![Refer to caption](https://arxiv.org/html/2606.19002v1/x1.png)Figure 1:Illustration of our ST\-Merge idea\. \(a\) Direct application of LLM to all languages\. \(b\) One\-size\-fits\-all model merging method\. \(c\) The proposed steerable model merging method learns to modulate the contribution of each source model for different inputs\. LLM/ModelB\\text\{Model\}\_\{B\}: the reasoning LLM\.ModelA\\text\{Model\}\_\{A\}: the external multilingual encoder\.However, as a one\-size\-fits\-all strategy, current fixed model merging approaches struggle to strike an optimal balance for inputs across diverse languages\. On one hand, over\-reliance on the external multilingual encoder can dilute the core reasoning capabilities inherent to the original LLM, potentially leading tocatastrophic forgetting\. On the other hand, insufficient reliance on the external encoder hampers the understanding of low\-resource languages, thereby limiting reasoning performance\. Existing studies have observed that fixed model merging often causes a degradation in reasoning capabilities for languages in which the LLM is already proficientYoonet al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib4)\)\. Therefore, it is imperative to devise an adaptive scheme to modulate the models based on the characteristics of inputs\.

In this paper, we first conduct an exploratory analysis by introducing manual scalar weights,ωA\\omega\_\{A\}andωB\\omega\_\{B\}, to modulate the representation intensity of the multilingual encoder and the reasoning LLM, respectively\. As illustrated in Figure[2](https://arxiv.org/html/2606.19002#S1.F2), the heatmap of accuracy exhibits divergent collaboration patterns across languages\. For English, optimal performance is achieved when the weight for the multilingual encoder \(ωA\\omega\_\{A\}\) is relatively low\. This suggests that for languages where the base LLM is already proficient, over\-reliance on external multilingual signals may act as noise interfering with the inherent reasoning pathways of LLM\. Conversely, for Swahili, performance peaks only when the multilingual encoder contributes significantly \(highωA\\omega\_\{A\}\)\. Here, the base LLM lacks the necessary linguistic grounding, and amplifying the external representation is crucial to bridge the semantic gap and activate the reasoning capabilities\.

![Refer to caption](https://arxiv.org/html/2606.19002v1/x2.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x3.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x4.png)Figure 2:Accuracy on MGSM with different manual weight combinations for the two source models\. Darker blue grids indicate higher reasoning accuracy\.To pursue both effective utilization of low\-resource language understanding and the preservation of inherent reasoning abilities, we propose a Steerable Model Merging framework \(ST\-Merge\) with gated cross\-attention\. Instead of relying on a fixed concatenation of representations, our method enables the model to dynamically modulate the contribution of each source model \(i\.e\., the multilingual encoder and the reasoning LLM\), allowing for more flexible and adaptive coordination\. This design facilitates input\-aware modulation modeling, enabling the merged model to shift its inductive bias toward the source most aligned with the current input\. As a result, it yields more accurate and targeted reasoning across diverse linguistic contexts\. The main contributions of this paper are as follows:

- •We propose a steerable model merging \(ST\-Merge\) framework that modulates multilingual understanding and reasoning preservation for multilingual reasoning\.
- •We devise a gated cross\-attention mechanism to dynamically modulate the contributions of source models\.
- •Extensive experiments on four multilingual reasoning benchmarks across 21 languages demonstrate that ST\-Merge consistently outperforms strong baselines\.

## 2Related Work

### 2\.1Multilingual Reasoning

Enhancing multilingual reasoning in English\-centric LLMs remains a critical challenge\. Existing approaches can be broadly categorized into translation\-based methods and model merging paradigms\. Translation\-based strategies, which involve fine\-tuning on translated datasetsZhuet al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib6)\)or employing external translatorsShiet al\.\([2023](https://arxiv.org/html/2606.19002#bib.bib1)\), have demonstrated significant performance gains\. However, these methods incur substantial computational overheads due to the heavy reliance on high\-quality parallel corpora and the computational latency of autoregressive decoding\. Conversely, model merging has recently emerged as a popular alternativeYoonet al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib4)\); Huanget al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib3)\), aiming to combine the strengths of different experts\. Despite their promise, current merging techniques typically employ static fusion strategies, often overlooking the inherent feature conflicts and interference between the multilingual encoder and the reasoning LLM\. To address this, we propose ST\-Merge, which introduces a dynamic gated network\. This approach allows for the flexible, context\-aware merging of models, effectively resolving inter\-model conflicts while maximizing the collaboration between multilingual understanding and logical reasoning\.

### 2\.2Model Merging

Model merging aims to combine the strengths of multiple models into a unified architecture and has been widely used to enhance capabilities such as modality integration\(Sunget al\.,[2023](https://arxiv.org/html/2606.19002#bib.bib20); Chenet al\.,[2024a](https://arxiv.org/html/2606.19002#bib.bib19)\)and task generalization\(Bandarkaret al\.,[2025](https://arxiv.org/html/2606.19002#bib.bib16); Duet al\.,[2025](https://arxiv.org/html/2606.19002#bib.bib18)\)\. Existing works can be broadly categorized into two types: homogeneous merging, which combines models with the same architecture, and heterogeneous merging, which merges models across architectural or modality boundaries\. Recent studies have explored model merging for cross\-lingual transfer learning\(Yoonet al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib4); Huanget al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib3)\), but often suffer from limited controllability and alignment issues in multilingual reasoning settings\. In contrast, our work introduces a steerable model merging approach that dynamically modulates the representations, enabling better coordination between multilingual encoder and the reasoning LLM for reasoning across both low\-resource and high\-resource languages\.

### 2\.3Gated Attention Mechanism

Gated cross\-attention mechanisms have been developed to selectively fuse heterogeneous representations by leveraging learnable weightsChaplotet al\.\([2018](https://arxiv.org/html/2606.19002#bib.bib26)\); Leeet al\.\([2022](https://arxiv.org/html/2606.19002#bib.bib31)\)\. These approaches employ multiplicative or residual gating strategies to dynamically weight features, effectively filtering noise and enhancing interpretability in fusion tasksKim and Shin \([2021](https://arxiv.org/html/2606.19002#bib.bib30)\); Ortiz\-Perezet al\.\([2025](https://arxiv.org/html/2606.19002#bib.bib27)\)\. Recent studies have extended this paradigm to various architectures, including router\-based gating for audio\-visual recognition and sparse gating for large language modelsJeonget al\.\([2025](https://arxiv.org/html/2606.19002#bib.bib29)\); Qiuet al\.\([2025](https://arxiv.org/html/2606.19002#bib.bib28)\)\. However, while these mechanisms have proven effective in multimodal settings, their potential for steering cross\-lingual alignment within a model merging framework remains unexplored\.

## 3Steerable Model Merging

In this section, we introduce our Steerable Model Merging \(ST\-Merge\) method, which is designed to weight and filter attended models conditioned on the specific input question\. Figure[3](https://arxiv.org/html/2606.19002#S3.F3)depicts an overview of the ST\-Merge framework\. We will provide a detailed description of our approach from the following two stages: feature space alignment and gated cross\-attention learning\.

#### Problem Formulation

Multilingual reasoning can be formulated as a text generation task\. Given an input sequence𝐱\\mathbf\{x\}\(e\.g\. a math problem\), the model aims to generate the target output sequence𝐲\\mathbf\{y\}\(e\.g\. a chain\-of\-thought and the answer\)\. Formally, the language modeling likelihood of the target output is denoted as:

p​\(𝐲\|𝐱\)=∏iLp​\(yi\|𝐱,y<i\)p\(\\mathbf\{y\}\|\\mathbf\{x\}\)=\\prod\_\{i\}^\{L\}p\(y\_\{i\}\|\\mathbf\{x\},y\_\{<i\}\)\(1\)Under the paradigm of model merging, we assume there are a multilingual encoder𝐦A\\mathbf\{m\}\_\{A\}and a reasoning LLM𝐦B\\mathbf\{m\}\_\{B\}\. Our goal is to learn a merger𝐦A⊕B\\mathbf\{m\}\_\{A\\oplus B\}that optimizes the generation probabilityp​\(𝐲\|𝐱\)p\(\\mathbf\{y\}\|\\mathbf\{x\}\), ensuring the reasoning accuracy is maintained across different languages\.

### 3\.1Feature Space Alignment

First, we extract distinct features for each input sequence\. Specifically, we utilize a multilingual encoder to capture linguistic understanding features and a Large Language Model \(LLM\) to extract reasoning features\.

#### Multilingual Feature

Multilingual feature extraction is performed by the mT5 encoderXueet al\.\([2021](https://arxiv.org/html/2606.19002#bib.bib12)\)\. Given an input sequence𝐱\\mathbf\{x\}, we employ the multilingual model to encode it into a generalized representation𝐇A\\mathbf\{H\}\_\{A\}, thereby mitigating the complexity of cross\-lingual understanding:

𝐇A=Encoder​\(𝐱\)\\mathbf\{H\}\_\{A\}=\\text\{Encoder\}\(\\mathbf\{x\}\)\(2\)where𝐇A∈ℝlA×dA\\mathbf\{H\}\_\{A\}\\in\\mathbb\{R\}^\{l\_\{A\}\\times d\_\{A\}\}is the hidden state output of the last layer in the multilingual encoder, withlAl\_\{A\}denoting the sequence length of the original input anddAd\_\{A\}the hidden dimension\.

![Refer to caption](https://arxiv.org/html/2606.19002v1/x5.png)Figure 3:Framework of the proposed steerable model mergingST\-Mergemethod for multilingual reasoning\.
#### Reasoning Feature

We utilize LLaMA\-based parameter\-efficient fine\-tuned reasoning LLM \(e\.g\. MetaMathYuet al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib5)\)\) to extract reasoning features\. To fully activate the intrinsic reasoning capabilities of the LLM, we append a chain\-of\-thought prompt𝐩\\mathbf\{p\}\(e\.g\., “Let’s think step by step”\) to the original input𝐱\\mathbf\{x\}, forming a prompted sequence𝐱′=\[𝐱;𝐩\]\\mathbf\{x\}^\{\\prime\}=\[\\mathbf\{x\};\\mathbf\{p\}\]\. We then process this sequence directly through the embedding layer:

𝐇B=Embedding​\(𝐱′\)\\mathbf\{H\}\_\{B\}=\\text\{Embedding\}\(\\mathbf\{x\}^\{\\prime\}\)\(3\)where𝐇B∈ℝlB×dB\\mathbf\{H\}\_\{B\}\\in\\mathbb\{R\}^\{l\_\{B\}\\times d\_\{B\}\}represents the representation within the semantic space of the reasoning LLM, withlBl\_\{B\}denoting the sequence length of the original input with the prompt anddBd\_\{B\}the LLM embedding dimension\.

#### Feature Alignment

Since the representation𝐇A\\mathbf\{H\}\_\{A\}resides in the multilingual representation space, which is separate from the reasoning LLM space\. The extracted features cannot be used for reasoning directly\. Therefore, we project the multilingual features via a mapping layer:

𝐇^A=Mapping​\(𝐇A\)\\mathbf\{\\hat\{H\}\}\_\{A\}=\\text\{Mapping\}\(\\mathbf\{H\}\_\{A\}\)\(4\)where𝐇^A∈ℝlA×dB\\mathbf\{\\hat\{H\}\}\_\{A\}\\in\\mathbb\{R\}^\{l\_\{A\}\\times d\_\{B\}\}is the projection of𝐇A\\mathbf\{H\}\_\{A\}onto the reasoning feature space\. Unless otherwise stated,Mapping​\(⋅\)\\text\{Mapping\}\(\\cdot\)is implemented as a two\-layer Multi\-Layer Perceptron \(MLP\)\. This transformation aligns the semantic spaces of the multilingual encoder and the reasoning LLM, enabling effective feature merging despite the frozen weights of the base models\.

### 3\.2Gated Cross\-Attention Learning

To overcome the limitations of fixed model merging, we introduce a gated cross\-attention network to dynamically estimate the optimal weights for the multilingual features𝐇^A\\mathbf\{\\hat\{H\}\}\_\{A\}and the reasoning features𝐇B\\mathbf\{H\}\_\{B\}conditioned on the specific input\.

#### Cross\-Attention

We first facilitate a comprehensive information interaction between the two types of features to construct a holistic context for gating estimation\. Formally, the reasoning feature𝐇B\\mathbf\{H\}\_\{B\}acts as the query \(𝐐B\\mathbf\{Q\}\_\{B\}\) to attend to the aligned multilingual representation𝐇^A\\mathbf\{\\hat\{H\}\}\_\{A\}, which serves as the key \(𝐊A\\mathbf\{K\}\_\{A\}\) and value \(𝐕A\\mathbf\{V\}\_\{A\}\)\. We employ𝐇B\\mathbf\{H\}\_\{B\}as the query to ensure the attention mechanism is anchored in the semantic space of the reasoning task\.

𝐊A,𝐕A=𝐇^A​𝐖kK,𝐇^A​𝐖kV\\displaystyle\\mathbf\{K\}\_\{A\},\\mathbf\{V\}\_\{A\}=\\mathbf\{\\hat\{H\}\}\_\{A\}\\mathbf\{W\}\_\{k\}^\{K\},~\\mathbf\{\\hat\{H\}\}\_\{A\}\\mathbf\{W\}\_\{k\}^\{V\}\(5\)𝐐B=𝐇B​𝐖kQ\\displaystyle\\mathbf\{Q\}\_\{B\}=\\mathbf\{H\}\_\{B\}\\mathbf\{W\}\_\{k\}^\{Q\}\(6\)headk=Attn\.​\(𝐐B,𝐊A,𝐕A\)\\displaystyle\\text\{head\}\_\{k\}=\\text\{Attn\.\}\(\\mathbf\{Q\}\_\{B\},\\mathbf\{K\}\_\{A\},\\mathbf\{V\}\_\{A\}\)\(7\)𝐆A⊕B=Concat\.k​\(headk\)​𝐖O\\displaystyle\\mathbf\{G\}\_\{A\\oplus B\}=\\text\{Concat\.\}\_\{k\}\(\\text\{head\}\_\{k\}\)\\mathbf\{W\}^\{O\}\(8\)where𝐖kQ,𝐖kK,𝐖kV∈ℝdB×dk\\mathbf\{W\}\_\{k\}^\{Q\},\\mathbf\{W\}\_\{k\}^\{K\},\\mathbf\{W\}\_\{k\}^\{V\}\\in\\mathbb\{R\}^\{d\_\{B\}\\times d\_\{k\}\}denote the projection matrices for thekk\-th head, anddkd\_\{k\}is the dimension of each attention head\.𝐖O∈ℝdB×dB\\mathbf\{W\}^\{O\}\\in\\mathbb\{R\}^\{d\_\{B\}\\times d\_\{B\}\}is the output projection matrix used to aggregate information from allkkheads\.

#### Language Embedding

We introduce a learnable lightweight language embedding to explicitly inject language identity, which facilitates language differentiation\. Given the language ID, we retrieve the corresponding embedding vector𝐄L​a​n​g∈ℝdL\\mathbf\{E\}\_\{Lang\}\\in\\mathbb\{R\}^\{d\_\{L\}\}and concatenate it with the global context𝐆A⊕B∈ℝdB\\mathbf\{G\}\_\{A\\oplus B\}\\in\\mathbb\{R\}^\{d\_\{B\}\}, yielding a composite representation𝐙∈ℝdB\+dL\\mathbf\{Z\}\\in\\mathbb\{R\}^\{d\_\{B\}\+d\_\{L\}\}\.

#### Feature Fusion

Finally, we employ an MLP layer to project𝐙\\mathbf\{Z\}to generate the two weights to modulate the𝐇^A\\mathbf\{\\hat\{H\}\}\_\{A\}and𝐇B\\mathbf\{H\}\_\{B\}\. We specifically utilize a1\+tanh1\+\\tanhactivation function to center the weights around11, as the value of11indicates the fixed model merging with original features providing a stable initialization\. The final input to the LLM decoder is constructed as follows:

\[ωA,ωB\]=1\+tanh⁡\(MLP​\(𝐙\)\)\\displaystyle\\hskip\-5\.69054pt\[\\mathbf\{\\omega\}\_\{A\},\\mathbf\{\\omega\}\_\{B\}\]=1\+\\tanh\(\\text\{MLP\}\(\\mathbf\{Z\}\)\)\(9\)𝐇A⊕B=\[⟨bos⟩;ωA⋅𝐇^A;⟨sep⟩;ωB⋅𝐇B\]\\displaystyle\\hskip\-2\.84526pt\\mathbf\{H\}\_\{A\\oplus B\}=\[\\langle\\text\{bos\}\\rangle;\\mathbf\{\\omega\}\_\{A\}\\cdot\\mathbf\{\\hat\{H\}\}\_\{A\};\\langle\\text\{sep\}\\rangle;\\mathbf\{\\omega\}\_\{B\}\\cdot\\mathbf\{H\}\_\{B\}\]\(10\)whereωA,ωB\\mathbf\{\\omega\}\_\{A\},\\mathbf\{\\omega\}\_\{B\}are two input\-dependent scalar weights for the multilingual and reasoning representations, respectively;⟨bos⟩\\langle\\text\{bos\}\\rangleand⟨sep⟩\\langle\\text\{sep\}\\rangleare learnable boundary tokens\. The resulting fused embedding𝐇A⊕B\\mathbf\{H\}\_\{A\\oplus B\}serves as the steered input to the frozen LLM, guiding the generation of the chain\-of\-thought reasoning path and the final response\.

## 4Experiments

MGSMBnThSwJaZhDeFrRuEsEnLrl\.Hrl\.Avg\.Translate\-En \[[2023](https://arxiv.org/html/2606.19002#bib.bib1)\]48\.437\.637\.649\.246\.860\.456\.447\.659\.665\.541\.255\.150\.6MetaMath \[[2024](https://arxiv.org/html/2606.19002#bib.bib5)\]6\.87\.26\.836\.438\.455\.254\.452\.057\.268\.86\.951\.838\.3MultiReason \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]33\.240\.042\.042\.042\.045\.244\.845\.248\.052\.038\.445\.643\.4QAlign \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]32\.439\.640\.444\.048\.454\.856\.852\.459\.668\.037\.554\.949\.6LangBridge \[[2024](https://arxiv.org/html/2606.19002#bib.bib4)\]42\.850\.443\.240\.045\.256\.450\.852\.458\.063\.245\.552\.350\.2MindMerger \[[2024](https://arxiv.org/html/2606.19002#bib.bib3)\]50\.452\.857\.254\.453\.661\.257\.660\.858\.466\.853\.559\.057\.3LayAlign \[[2025](https://arxiv.org/html/2606.19002#bib.bib23)\]51\.659\.258\.452\.056\.062\.061\.661\.661\.666\.456\.460\.259\.0ST\-Merge \(Ours\)54\.056\.858\.853\.557\.262\.461\.262\.865\.268\.056\.561\.560\.0
MSVAMPBnThSwJaZhDeFrRuEsEnLrl\.Hrl\.Avg\.Translate\-En \[[2023](https://arxiv.org/html/2606.19002#bib.bib1)\]47\.951\.343\.150\.455\.843\.950\.953\.451\.460\.647\.452\.350\.9MetaMath \[[2024](https://arxiv.org/html/2606.19002#bib.bib5)\]14\.419\.516\.853\.455\.063\.564\.160\.364\.966\.316\.961\.147\.8MultiReason \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]34\.838\.139\.843\.442\.945\.645\.845\.046\.146\.837\.645\.142\.8QAlign \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]41\.747\.754\.858\.055\.762\.863\.261\.163\.365\.348\.161\.357\.2LangBridge \[[2024](https://arxiv.org/html/2606.19002#bib.bib4)\]46\.846\.342\.145\.550\.458\.157\.055\.856\.960\.645\.154\.952\.0MindMerger \[[2024](https://arxiv.org/html/2606.19002#bib.bib3)\]52\.053\.454\.059\.061\.764\.164\.063\.365\.067\.753\.163\.560\.4LayAlign \[[2025](https://arxiv.org/html/2606.19002#bib.bib23)\]51\.855\.156\.959\.358\.762\.562\.158\.862\.064\.054\.661\.159\.1ST\-Merge \(Ours\)52\.856\.357\.659\.761\.763\.865\.762\.266\.167\.655\.663\.961\.4

Table 1:Accuracy \(%\) results on MGSM and MSVAMP\. We regard Bn, Th, and Sw as low\-resource languages, and regard the remaining languages as high\-resource languages\. Lrl\., Hrl\., and Avg\. represent the average accuracy across low\-resource languages, high\-resource languages, and all languages, respectively\. The best performance is in bold \(same for Table[2](https://arxiv.org/html/2606.19002#S4.T2)and Table[3](https://arxiv.org/html/2606.19002#S4.T3)\)\.### 4\.1Evaluation Datasets

We evaluate models on four multilingual reasoning datasets across 21 different languages:

#### Mathematical Reasoning

We evaluate on the multilingual math problem datasets MGSM and MSVAMP for this task\.MGSM\(Shiet al\.,[2023](https://arxiv.org/html/2606.19002#bib.bib1)\)consists of grade\-school level math questions translated by humans into 11 typologically diverse languages\.MSVAMP\(Chenet al\.,[2024b](https://arxiv.org/html/2606.19002#bib.bib8)\)extends the SVAMP dataset\(Patelet al\.,[2021](https://arxiv.org/html/2606.19002#bib.bib11)\)to 10 languages, offering linguistically diverse paraphrases of math problems with varying reasoning structures\.

#### Commonsense Reasoning

We evaluate commonsense reasoning usingX\-CSQA\(Linet al\.,[2021](https://arxiv.org/html/2606.19002#bib.bib9)\), a multilingual extension of the CommonsenseQA dataset\. X\-CSQA provides translated versions of CSQA across multiple languages, along with a new data split to support cross\-lingual evaluation\. The dataset includes 8,888 English training examples, 1,000 development examples per language, and 1,074 test examples per language\.

X\-CSQASwFrEnAvg\.Translate\-En \[[2023](https://arxiv.org/html/2606.19002#bib.bib1)\]36\.557\.271\.352\.3MetaMath \[[2024](https://arxiv.org/html/2606.19002#bib.bib5)\]24\.263\.576\.351\.3MultiReason \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]27\.652\.167\.243\.8QAlign \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]35\.160\.375\.752\.3LangBridge \[[2024](https://arxiv.org/html/2606.19002#bib.bib4)\]31\.838\.244\.436\.1MindMerger \[[2024](https://arxiv.org/html/2606.19002#bib.bib3)\]45\.568\.178\.161\.0LayAlign \[[2025](https://arxiv.org/html/2606.19002#bib.bib23)\]53\.366\.576\.762\.3ST\-Merge \(Ours\)53\.667\.277\.362\.5Table 2:Accuracy \(%\) on X\-CSQA\. Avg\. represents the average accuracy across all languages\.
#### Natural Language Inference

We evaluate natural language inference usingXNLI\(Conneauet al\.,[2018](https://arxiv.org/html/2606.19002#bib.bib10)\), a widely used multilingual benchmark spanning 15 languages\. The task involves determining whether a givenhypothesislogically follows from apremise, categorized as entailment, contradiction, or neutral\. The dataset covers languages both typologically close to English \(e\.g\., French, German, Spanish\) and more distant \(e\.g\., Arabic, Thai, Swahili\), making it well\-suited for evaluating cross\-lingual generalization\.

XNLISwFrEnAvg\.Translate\-En \[[2023](https://arxiv.org/html/2606.19002#bib.bib1)\]65\.380\.481\.475\.1MetaMath \[[2024](https://arxiv.org/html/2606.19002#bib.bib5)\]45\.982\.290\.068\.7MultiReason \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]56\.382\.988\.871\.9QAlign \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]65\.283\.189\.173\.5LangBridge \[[2024](https://arxiv.org/html/2606.19002#bib.bib4)\]71\.779\.983\.476\.5MindMerger \[[2024](https://arxiv.org/html/2606.19002#bib.bib3)\]66\.683\.988\.778\.4LayAlign \[[2025](https://arxiv.org/html/2606.19002#bib.bib23)\]73\.084\.788\.979\.7ST\-Merge \(Ours\)73\.784\.889\.179\.9Table 3:Accuracy \(%\) on XNLI\. Avg\. represents the average accuracy across all languages\.MGSMBnThSwJaZhDeFrRuEsEnLrl\.Hrl\.Avg\.ST\-Merge \(Ours\)54\.056\.858\.853\.557\.262\.461\.262\.865\.268\.056\.561\.560\.0w/o Lang\. Embed51\.454\.056\.453\.256\.661\.460\.362\.163\.967\.653\.960\.758\.7w/o Cross\-Attention52\.754\.757\.852\.855\.860\.858\.562\.763\.467\.155\.160\.258\.6w/o Gate Network50\.752\.455\.851\.053\.659\.856\.460\.762\.066\.553\.058\.656\.9Fix\-Merge \(Baseline\)50\.552\.955\.750\.854\.859\.156\.860\.761\.766\.753\.058\.757\.0Table 4:Ablation study on MGSM\.

### 4\.2Implementation Details

Following prior setup\(Huanget al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib3); Ruanet al\.,[2025](https://arxiv.org/html/2606.19002#bib.bib23)\), we train the mapping layer using the Lego\-MT corpusYuanet al\.\([2023](https://arxiv.org/html/2606.19002#bib.bib25)\)via translation tasks\. Subsequently, we leverage the MultilingualMath datasetYuet al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib5)\); Chenet al\.\([2024b](https://arxiv.org/html/2606.19002#bib.bib8)\)for the gated cross\-attention network learning\. We adopt the encoder of mT5\-xl\(Xueet al\.,[2021](https://arxiv.org/html/2606.19002#bib.bib12)\)as the multilingual backbone, and employ MetaMath\(Yuet al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib5)\)as the large language reasoning model across all experiments, ensuring a fair comparison with prior work\(Yoonet al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib4); Huanget al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib3); Ruanet al\.,[2025](https://arxiv.org/html/2606.19002#bib.bib23)\)\. The final model is selected based on the averaged performance of all languages on the dev set\. For training, we utilized 4 NVIDIA A100 GPUs with a learning rate of 2e\-5, a batch size of 128, a maximum sequence length of 512, and a total of 3 epochs\. We conduct experiments with three different random seeds and report the average results\.

### 4\.3Baselines

We compare our method against several state\-of\-the\-art baselines for multilingual reasoning: Translate\-En\(Shiet al\.,[2023](https://arxiv.org/html/2606.19002#bib.bib1)\)translates non\-English inputs to English and uses an English reasoning model\. MetaMath\(Yuet al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib5)\)is fine\-tuned from LLaMA2\-7B on an additional mathematical dataset MetaMathQA, which serves as the backbone architectures for baseline methods\. MultiReason\(Zhuet al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib6)\)enhances reasoning consistency across languages via question alignment and rationale generation\. QAlign\(Zhuet al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib6)\)aligns questions across languages through fine\-tuned translation\-based contrastive learning\. LangBridge\(Yoonet al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib4)\)introduces an alignment layer to bridge non\-English inputs to an English\-centric reasoning space\. MindMerger\(Huanget al\.,[2024](https://arxiv.org/html/2606.19002#bib.bib3)\)merges task representations across languages to promote cross\-lingual reasoning alignment\. LayAlign\(Ruanet al\.,[2025](https://arxiv.org/html/2606.19002#bib.bib23)\)integrates representations from all encoder layers to enable layer\-wise interaction between the two models\.

### 4\.4Main Results

As shown in Table[1](https://arxiv.org/html/2606.19002#S4.T1), Table[2](https://arxiv.org/html/2606.19002#S4.T2), and Table[3](https://arxiv.org/html/2606.19002#S4.T3), ST\-Merge achieves consistent performance gains across all tasks, demonstrating strong generalization\. \(Please refer to Appendix Table[6](https://arxiv.org/html/2606.19002#A1.T6), Table[7](https://arxiv.org/html/2606.19002#A1.T7)for the complete results of X\-CSQA and XNLI\.\) The performance gains on the X\-CSQA are relatively limited\. We conjecture that the limited gains stem from their discrete choice\-format outputs, which provide limited signals and hinder effective optimization of the gate network\.

Compared to existing strong multilingual reasoning baselines such as MindMergerHuanget al\.\([2024](https://arxiv.org/html/2606.19002#bib.bib3)\), our method achieves superior performance under identical prompts and training data\. Our approach can act as a plug\-and\-play enhancement over fixed model merging strategies, yielding average gains of \+1\.7%, \+1\.3%, \+1\.5%, and \+1\.5% over the MindMerger baseline on the four benchmarks, respectively\.

Notably, our method consistently improves average performance on high\-resource languages\. As shown in Table[1](https://arxiv.org/html/2606.19002#S4.T1), compared to the state\-of\-the\-art baseline, ST\-Merge achieves an average high\-resource gain \+1\.3% on MGSM and \+0\.4% on MSVAMP, respectively\. ST\-Merge maintains highly competitive accuracy in English \(68\.0% on MGSM\), outperforming other multilingual baselines while effectively narrowing the gap with the state\-of\-the\-art\. This demonstrates that ST\-Merge can preserve reasoning proficiency in high\-resource languages\.

Question \[Swahili\]:John hukimbia maili 60 kwa wiki\. Huwa anakimbia siku 3 kwa wiki\. Huwa anakimbia saa 3 siku ya kwanza na nusu ya kiasi hicho kwa siku nyingine mbili anazokimbia\. Huwa anakimbia kwa kasi gani? \(Answer:10\)Output \[Fix\-Merge\]:John runs 3 miles on the first day\. He runs half that distance on the other two days, so he runs 3/2 = 1\.5 miles on each of those days\. In total, he runs 3 \+ 1\.5 \+ 1\.5 = 6 miles in a week\. Since he runs 60 miles in a week, his speed is6\[✗,0\.5523\]\{\}\_\{\[\\text\{ \\char 55\},0\.5523\]\}miles / 7 days = 0\.857miles per day\. The answer is: 0\.857\.Output \[ST\-Merge \(Ours\)\]:John runs 3 miles on the first day\. He runs half that distance on the other two days, so he runs 3/2 = 1\.5 miles on each of those days\. In total, he runs 3 \+ 1\.5 \+ 1\.5 = 6 miles in a week\. Since he runs 60 miles in a week, his speed is60\[✓,0\.9045\]\{\}\_\{\[\\text\{ \\char 51\},0\.9045\]\}/6 = 10miles per hour\. The answer is: 10\.\(ωA\\omega\_\{A\}: 1\.1172,ωB\\omega\_\{B\}: 1\.0547\)Question \[French\]:Kylar se rend au magasin afin d’acheter des verres pour son nouvel appartement\. Un verre coûte 5 $, mais chaque deuxième verre ne coûte que 60% du prix\. Kylar veut acheter 16 verres\. Combien devra\-t\-il payer pour les acheter? \(Answer:64\)Output \[Fix\-Merge\]:The first glass costs $5\. The second glass costs 60% of $5, which is 0\.6 \* $5 = $3\. So for every two glasses, Kylar pays $5 \+ $3 = $8\. Kylar wants to buy 16 glasses, so he will pay16/2 \* $8 = $8\[✗,0\.4948\]\{\}\_\{\[\\text\{ \\char 55\},0\.4948\]\}0\. The answer is: 80\.Output \[ST\-Merge \(Ours\)\]:The first glass costs $5\. The second glass costs 60% of $5, which is 0\.6 \* $5 = $3\. For every two glasses, the price is $5 \+ $3 = $8\. Kylar wants to buy 16 glasses, so he will need to pay16/2 \* $8 = $6\[✓,0\.9614\]\{\}\_\{\[\\text\{ \\char 51\},0\.9614\]\}4\. The answer is: 64\.\(ωA\\omega\_\{A\}: 1\.0156,ωB\\omega\_\{B\}: 1\.0503\)Question \[Chinese\]:帕姆和弗莱德去参加狂欢节。帕姆坐了 2 次云霄飞车,而弗莱德坐了 4 次。在此之后,他们各自决定坐雪橇 2 次。如果每坐一次需要 6 张门票,他们那天用掉了多少张门票? \(Answer:60\)Output \[Fix\-Merge\]:Pam rides the roller coaster 2 times, so she uses 2 x 6 = 12 tickets\. Fred rides the roller coaster 4 times, so he uses 4 x 6 = 24 tickets\. After riding the roller coaster, they each decide to ride the ice cream, so they each use 2 x 6 = 12 tickets\. In total, they use12 \+1\[✗,0\.7773\]\{\}\_\{\[\\text\{ \\char 55\},0\.7773\]\}2 \+ 24 \+ 24 = 72tickets\. The answer is: 72\.Output \[ST\-Merge \(Ours\)\]:Pam rides the roller coaster 2 times, so she uses 2 x 6 = 12 tickets\. Fred rides the roller coaster 4 times, so he uses 4 x 6 = 24 tickets\. After riding the roller coaster, they each decide to ride the snow sled 2 times, so they use 2 x 6 = 12 tickets each\. In total, they use 12 \+ 12 = 24 tickets for the snow sled\. Therefore, they use a total of12 \+2\[✓,0\.9993\]\{\}\_\{\[\\text\{ \\char 51\},0\.9993\]\}4 \+ 24 = 60tickets that day\. The answer is: 60\.\(ωA\\omega\_\{A\}: 0\.9766,ωB\\omega\_\{B\}: 0\.9314\)

Table 5:Case study on MGSM\. The GREEN \(RED\) highlight indicates a correct \(incorrect\) reasoning step\. The real\-valued numbers indicate the next token generation probability\. \(ωA\\omega\_\{A\},ωB\\omega\_\{B\}\) represent the learned weights by our method for each language\.
### 4\.5Ablation Study

Table[4](https://arxiv.org/html/2606.19002#S4.T4)presents the ablation results on the MGSM\.

1. \(1\)w/o Lang\. Embed, which removes the language identity embeddings from the gating network\. The average accuracy drops from 60\.0% to 58\.7%, with a notable degradation of 2\.6% on low\-resource languages\. This suggests that without explicit language cues, the gating network lacks the guidance to differentiate between languages\. Consequently, the optimization process becomes biased towards dominant high\-resource languages, hindering the low\-resource language reasoning capacity of the merger model\.
2. \(2\)w/o Cross\-Attention, which replaces the fine\-grained Cross\-Attention mechanism with a simpler concatenation of the representations\. Removing this module leads to a significant performance decline across all languages, reducing the average accuracy to 58\.6%\. This suggests that the token\-level interaction provided by cross\-attention is essential for deeply considering the context from the two models\.
3. \(3\)w/o Gate Network, which completely eliminates the gating network\. In this case, we increase the number of training steps to match the computational budget of the proposed ST\-Merge\. Despite the extended training, this variant shows no improvement over the static baseline \(Fix\-Merge, 57\.0%\) and lags significantly behind the proposed full ST\-Merge model \(60\.0%\)\. This validates the indispensability of the gated cross\-attention, demonstrating that the performance boost is driven by the steerable merging strategy rather than simply extending the optimization process\.

### 4\.6Case Study

We present a case study to show that the failed cases of Fix\-Merge \(fixed model merging strategy\) can be rectified by our model\. We aim to provide insights into the mechanisms underlying the effectiveness of the proposed steerable model merger\.

The proposed ST\-Merge framework adaptively modulates feature amplification and suppression to extract the most effective representations for multilingual reasoning\. Specifically, ST\-Merge leverages the complementary strengths of the backbone LLM, which possesses strong intrinsic reasoning capabilities, and the external multilingual model, which excels in cross\-lingual semantic understanding\. By utilizing learned weights to selectively enhance these respective strengths while suppressing irrelevant noise, the model is steered toward more accurate reasoning outcomes\. As shown in Table[5](https://arxiv.org/html/2606.19002#S4.T5), in the Swahili case, the static baseline \(Fix\-Merge\) fails to retain the critical entity “60” and hallucinates an incorrect number “6 miles” with a low confidence probability of 0\.5523\. This indicates that the model was trapped in a state of uncertainty due to the symmetric merging\. In contrast, ST\-Merge identifies that the input requires prioritizing certain models over others\. ST\-Merge assigns a higher weight value to the multilingual encoder to boost understanding of the question\. Consequently, this asymmetric merging enables the model to generate the correct answer \(“60”\) with a high confidence of 0\.9045, confirming that breaking the symmetry of feature fusion is crucial for robust low\-resource reasoning\. Similar results can also be observed in the examples of Chinese and French\.

### 4\.7Analysis of Steerable Weights

Figure[4](https://arxiv.org/html/2606.19002#S4.F4)visualizes the reasoning accuracy on four representative languages from the MGSM dataset \(English, Chinese, French, and Swahili\) under varying fusion weightsωA\\omega\_\{A\}andωB\\omega\_\{B\}\. The heatmaps reveal that the optimal weight configuration is highly sensitive to the specific language\. For high\-resource languages like English and Chinese, the learned weights \(marked by gold stars\) and optimal regions favor a balanced or reasoning\-dominant configuration, leveraging the strong mathematical reasoning capabilities inherent in the base model\. In contrast, for a language underrepresented in math reasoning corpora, e\.g\. Swahili, the model assigns a higher value toωA\\omega\_\{A\}\. This suggests that for low\-resource languages, the model prioritizes the multilingual module to align representations before performing reasoning\. Crucially, our steerable gating mechanism consistently converges to these optimal regions across all languages, demonstrating its ability to adaptively regulate the trade\-off between multilingual alignment and mathematical reasoning without manual tuning\.

![Refer to caption](https://arxiv.org/html/2606.19002v1/x6.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x7.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x8.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x9.png)Figure 4:Learned weights analysis of ST\-Merge\. Darker blue grids indicate higher accuracy\. The gold stars represent the learned weights \(ωA\\omega\_\{A\},ωB\\omega\_\{B\}\) by our method for each language\.![Refer to caption](https://arxiv.org/html/2606.19002v1/x10.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x11.png)\(a\) MetaMath\(b\) Our ST\-MergeFigure 5:t\-SNE visualization of multilingual alignment on MGSM\.
### 4\.8Multilingual Representation Visualization

To examine the alignment of multiple languages, we compare the alignment results of vanilla fine\-tuned MetaMath and our ST\-Merge in terms of question representation\. We select questions spanning eleven different languages from the MGSM datasets to visualize the embedding space\. As shown in Figure[5](https://arxiv.org/html/2606.19002#S4.F5)\-\(a\), different languages are distributed in distinct clusters in the embedding space, which indicates that MetaMath can remain highly language\-dependent\. In contrast, Figure[5](https://arxiv.org/html/2606.19002#S4.F5)\-\(b\) shows that the data distributions of different languages are mixed and overlapping, which demonstrates that ST\-Merge achieves more effective alignment of representations across languages compared to MetaMath\. This alignment contributes to the superior multilingual reasoning performance of ST\-Merge\.

## 5Conclusion

This work addresses a fundamental challenge of how to effectively coordinate the multilingual encoder and reasoning LLM for multilingual reasoning tasks\. Our analysis reveals that fixed “one\-size\-fits\-all” merging strategies potentially introduce a conflict: while improving reasoning performance on low\-resource languages with an external multilingual encoder, they often degrade reasoning in high\-resource languages where the LLM is already proficient\. To address this, we propose a steerable model merging \(ST\-Merge\) framework that optimizes the merged model toward balanced multilingual reasoning via dynamic adjustment of weights\. Experiments on four multilingual reasoning benchmarks across 21 languages demonstrate consistent gains across both high\-resource and low\-resource languages\. Beyond performance, we further uncover a correlation between reasoning correctness and gating patterns, providing empirical insight into the mechanisms underlying multilingual reasoning generalization\. Additionally, our findings suggest that steerable merging strategies represent a promising direction for enhancing the multilingual capabilities of large language models\.

## Limitations

Our work presents several limitations worth noting\. First, to ensure a fair comparison with baseline models, our method primarily conducts experiments using the Llama 2 series models\. Future work will involve extending our experiments to additional series models to more comprehensively evaluate the generalizability of our method across diverse backbone architectures\. Second, while our method effectively generates weights to improve model collaboration, it lacks fine\-grained guidance during the generation process\. We hypothesize that a more granular control mechanism during the decoding phase could further enhance performance\. In the future, we will explore incorporating token\-level or step\-aware guidance to address this issue\.

## Acknowledgments

This work is supported by the National Key R&D Program of China \(2024YFF0907003\)\.

## References

- L\. Bandarkar, B\. Muller, P\. Yuvraj, R\. Hou, N\. Singhal, H\. Lv, and B\. Liu \(2025\)Layer swapping for zero\-shot cross\-lingual transfer in large language models\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=vQhn4wrQ6j)Cited by:[§2\.2](https://arxiv.org/html/2606.19002#S2.SS2.p1.1)\.
- D\. S\. Chaplot, K\. M\. Sathyendra, R\. K\. Pasumarthi, D\. Rajagopal, and R\. Salakhutdinov \(2018\)Gated\-attention architectures for task\-oriented language grounding\.InProceedings of the Thirty\-Second AAAI Conference on Artificial Intelligence, \(AAAI\-18\), the 30th innovative Applications of Artificial Intelligence \(IAAI\-18\), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence \(EAAI\-18\), New Orleans, Louisiana, USA, February 2\-7, 2018,S\. A\. McIlraith and K\. Q\. Weinberger \(Eds\.\),pp\. 2819–2826\.External Links:[Link](https://doi.org/10.1609/aaai.v32i1.11832),[Document](https://dx.doi.org/10.1609/AAAI.V32I1.11832)Cited by:[§2\.3](https://arxiv.org/html/2606.19002#S2.SS3.p1.1)\.
- C\. Chen, Y\. Du, Z\. Fang, Z\. Wang, F\. Luo, P\. Li, M\. Yan, J\. Zhang, F\. Huang, M\. Sun, and Y\. Liu \(2024a\)Model composition for multimodal large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 11246–11262\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.606),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.606)Cited by:[§2\.2](https://arxiv.org/html/2606.19002#S2.SS2.p1.1)\.
- N\. Chen, Z\. Zheng, N\. Wu, M\. Gong, D\. Zhang, and J\. Li \(2024b\)Breaking language barriers in multilingual mathematical reasoning: insights and observations\.InFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 7001–7016\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-emnlp.411),[Document](https://dx.doi.org/10.18653/V1/2024.FINDINGS-EMNLP.411)Cited by:[§4\.1](https://arxiv.org/html/2606.19002#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.19002#S4.SS2.p1.1)\.
- A\. Conneau, R\. Rinott, G\. Lample, A\. Williams, S\. Bowman, H\. Schwenk, and V\. Stoyanov \(2018\)XNLI: evaluating cross\-lingual sentence representations\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2475–2485\.External Links:[Link](https://aclanthology.org/D18-1269/),[Document](https://dx.doi.org/10.18653/v1/D18-1269)Cited by:[§4\.1](https://arxiv.org/html/2606.19002#S4.SS1.SSS0.Px3.p1.1)\.
- Y\. Du, X\. Wang, C\. Chen, J\. Ye, Y\. Wang, P\. Li, M\. Yan, J\. Zhang, F\. Huang, Z\. Sui, M\. Sun, and Y\. Liu \(2025\)AdaMMS: model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11\-15, 2025,pp\. 9413–9422\.External Links:[Link](https://openaccess.thecvf.com/content/CVPR2025/html/Du%5C_AdaMMS%5C_Model%5C_Merging%5C_for%5C_Heterogeneous%5C_Multimodal%5C_Large%5C_Language%5C_Models%5C_with%5C_CVPR%5C_2025%5C_paper.html),[Document](https://dx.doi.org/10.1109/CVPR52734.2025.00879)Cited by:[§2\.2](https://arxiv.org/html/2606.19002#S2.SS2.p1.1)\.
- Z\. Huang, W\. Zhu, G\. Cheng, L\. Li, and F\. Yuan \(2024\)MindMerger: efficiently boosting llm reasoning in non\-english languages\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 34161–34187\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/3bf80b34f731313b8292f4578e820c90-Paper-Conference.pdf)Cited by:[Table 6](https://arxiv.org/html/2606.19002#A1.T6.1.1.8.1),[Table 7](https://arxiv.org/html/2606.19002#A1.T7.1.1.8.1),[§1](https://arxiv.org/html/2606.19002#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.19002#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.19002#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.19002#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.19002#S4.SS3.p1.1),[§4\.4](https://arxiv.org/html/2606.19002#S4.SS4.p2.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.1.1.8.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.2.2.8.1),[Table 2](https://arxiv.org/html/2606.19002#S4.T2.1.1.8.1),[Table 3](https://arxiv.org/html/2606.19002#S4.T3.1.1.8.1)\.
- B\. Jeong, J\. Park, S\. Kim, and S\. Kwak \(2025\)Learning audio\-guided video representation with gated attention for video\-text retrieval\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11\-15, 2025,pp\. 26202–26211\.External Links:[Link](https://openaccess.thecvf.com/content/CVPR2025/html/Jeong%5C_Learning%5C_Audio-guided%5C_Video%5C_Representation%5C_with%5C_Gated%5C_Attention%5C_for%5C_Video-Text%5C_Retrieval%5C_CVPR%5C_2025%5C_paper.html),[Document](https://dx.doi.org/10.1109/CVPR52734.2025.02440)Cited by:[§2\.3](https://arxiv.org/html/2606.19002#S2.SS3.p1.1)\.
- Y\. Kim and B\. Shin \(2021\)An interpretable framework for drug\-target interaction with gated cross attention\.InProceedings of the Machine Learning for Healthcare Conference, MLHC 2021, 6\-7 August 2021, Virtual Event,K\. Jung, S\. Yeung, M\. P\. Sendak, M\. W\. Sjoding, and R\. Ranganath \(Eds\.\),Proceedings of Machine Learning Research, Vol\.149,pp\. 337–353\.External Links:[Link](https://proceedings.mlr.press/v149/kim21b.html)Cited by:[§2\.3](https://arxiv.org/html/2606.19002#S2.SS3.p1.1)\.
- J\. Lee, S\. Yun, and M\. Jain \(2022\)Leaky gated cross\-attention for weakly supervised multi\-modal temporal action localization\.InIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3\-8, 2022,pp\. 817–826\.External Links:[Link](https://doi.org/10.1109/WACV51458.2022.00089),[Document](https://dx.doi.org/10.1109/WACV51458.2022.00089)Cited by:[§2\.3](https://arxiv.org/html/2606.19002#S2.SS3.p1.1)\.
- B\. Y\. Lin, S\. Lee, X\. Qiao, and X\. Ren \(2021\)Common sense beyond English: evaluating and improving multilingual language models for commonsense reasoning\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 1274–1287\.External Links:[Link](https://aclanthology.org/2021.acl-long.102/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.102)Cited by:[§4\.1](https://arxiv.org/html/2606.19002#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Mitra, L\. D\. Corro, S\. Mahajan, A\. Codas, C\. Simões, S\. Agrawal, X\. Chen, A\. Razdaibiedina, E\. Jones, K\. Aggarwal, H\. Palangi, G\. Zheng, C\. Rosset, H\. Khanpour, and A\. Awadallah \(2023\)Orca 2: teaching small language models how to reason\.CoRRabs/2311\.11045\.External Links:[Link](https://doi.org/10.48550/arXiv.2311.11045),[Document](https://dx.doi.org/10.48550/ARXIV.2311.11045),2311\.11045Cited by:[§1](https://arxiv.org/html/2606.19002#S1.p1.1)\.
- D\. Ortiz\-Perez, M\. Benavent\-Lledó, J\. Rodríguez\-Juan, J\. G\. Rodríguez, and D\. Tomás \(2025\)CogniAlign: word\-level multimodal speech alignment with gated cross\-attention for alzheimer’s detection\.Knowl\. Based Syst\.329,pp\. 114264\.External Links:[Link](https://doi.org/10.1016/j.knosys.2025.114264),[Document](https://dx.doi.org/10.1016/J.KNOSYS.2025.114264)Cited by:[§2\.3](https://arxiv.org/html/2606.19002#S2.SS3.p1.1)\.
- A\. Patel, S\. Bhattamishra, and N\. Goyal \(2021\)Are NLP models really able to solve simple math word problems?\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 2080–2094\.External Links:[Link](https://aclanthology.org/2021.naacl-main.168/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.168)Cited by:[§4\.1](https://arxiv.org/html/2606.19002#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Qiu, Z\. Wang, B\. Zheng, Z\. Huang, K\. Wen, S\. Yang, R\. Men, L\. Yu, F\. Huang, S\. Huang, D\. Liu, J\. Zhou, and J\. Lin \(2025\)Gated attention for large language models: non\-linearity, sparsity, and attention\-sink\-free\.CoRRabs/2505\.06708\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.06708),[Document](https://dx.doi.org/10.48550/ARXIV.2505.06708),2505\.06708Cited by:[§2\.3](https://arxiv.org/html/2606.19002#S2.SS3.p1.1)\.
- Z\. Ruan, Y\. Li, H\. Zhu, L\. Wang, W\. Luo, K\. Zhang, Y\. Chen, and G\. Chen \(2025\)LayAlign: enhancing multilingual reasoning in large language models via layer\-wise adaptive fusion and alignment strategy\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 1481–1495\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.81/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.81),ISBN 979\-8\-89176\-195\-7Cited by:[Table 6](https://arxiv.org/html/2606.19002#A1.T6.1.1.9.1),[Table 7](https://arxiv.org/html/2606.19002#A1.T7.1.1.9.1),[§4\.2](https://arxiv.org/html/2606.19002#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.19002#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.1.1.9.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.2.2.9.1),[Table 2](https://arxiv.org/html/2606.19002#S4.T2.1.1.9.1),[Table 3](https://arxiv.org/html/2606.19002#S4.T3.1.1.9.1)\.
- F\. Shi, M\. Suzgun, M\. Freitag, X\. Wang, S\. Srivats, S\. Vosoughi, H\. W\. Chung, Y\. Tay, S\. Ruder, D\. Zhou, D\. Das, and J\. Wei \(2023\)Language models are multilingual chain\-of\-thought reasoners\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=fR3wGCk-IXp)Cited by:[Table 6](https://arxiv.org/html/2606.19002#A1.T6.1.1.7.1),[Table 7](https://arxiv.org/html/2606.19002#A1.T7.1.1.7.1),[§2\.1](https://arxiv.org/html/2606.19002#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.19002#S4.SS1.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2606.19002#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.1.1.3.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.2.2.3.1),[Table 2](https://arxiv.org/html/2606.19002#S4.T2.1.1.3.1),[Table 3](https://arxiv.org/html/2606.19002#S4.T3.1.1.3.1)\.
- Y\. Sung, L\. Li, K\. Lin, Z\. Gan, M\. Bansal, and L\. Wang \(2023\)An empirical study of multimodal model merging\.InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 1563–1575\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-emnlp.105),[Document](https://dx.doi.org/10.18653/V1/2023.FINDINGS-EMNLP.105)Cited by:[§2\.2](https://arxiv.org/html/2606.19002#S2.SS2.p1.1)\.
- L\. Xue, N\. Constant, A\. Roberts, M\. Kale, R\. Al\-Rfou, A\. Siddhant, A\. Barua, and C\. Raffel \(2021\)MT5: a massively multilingual pre\-trained text\-to\-text transformer\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 483–498\.External Links:[Link](https://aclanthology.org/2021.naacl-main.41/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.41)Cited by:[§3\.1](https://arxiv.org/html/2606.19002#S3.SS1.SSS0.Px1.p1.2),[§4\.2](https://arxiv.org/html/2606.19002#S4.SS2.p1.1)\.
- D\. Yoon, J\. Jang, S\. Kim, S\. Kim, S\. Shafayat, and M\. Seo \(2024\)LangBridge: multilingual reasoning without multilingual supervision\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 7502–7522\.External Links:[Link](https://aclanthology.org/2024.acl-long.405/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.405)Cited by:[Table 6](https://arxiv.org/html/2606.19002#A1.T6.1.1.6.1),[Table 7](https://arxiv.org/html/2606.19002#A1.T7.1.1.6.1),[§1](https://arxiv.org/html/2606.19002#S1.p2.1),[§1](https://arxiv.org/html/2606.19002#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19002#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.19002#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.19002#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.19002#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.1.1.7.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.2.2.7.1),[Table 2](https://arxiv.org/html/2606.19002#S4.T2.1.1.7.1),[Table 3](https://arxiv.org/html/2606.19002#S4.T3.1.1.7.1)\.
- L\. Yu, W\. Jiang, H\. Shi, J\. Yu, Z\. Liu, Y\. Zhang, J\. T\. Kwok, Z\. Li, A\. Weller, and W\. Liu \(2024\)MetaMath: bootstrap your own mathematical questions for large language models\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=N8N0hgNDRt)Cited by:[Table 6](https://arxiv.org/html/2606.19002#A1.T6.1.1.3.1),[Table 7](https://arxiv.org/html/2606.19002#A1.T7.1.1.3.1),[§1](https://arxiv.org/html/2606.19002#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.19002#S3.SS1.SSS0.Px2.p1.3),[§4\.2](https://arxiv.org/html/2606.19002#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.19002#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.1.1.4.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.2.2.4.1),[Table 2](https://arxiv.org/html/2606.19002#S4.T2.1.1.4.1),[Table 3](https://arxiv.org/html/2606.19002#S4.T3.1.1.4.1)\.
- F\. Yuan, Y\. Lu, W\. Zhu, L\. Kong, L\. Li, Y\. Qiao, and J\. Xu \(2023\)Lego\-MT: learning detachable models for massively multilingual machine translation\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 11518–11533\.External Links:[Link](https://aclanthology.org/2023.findings-acl.731/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.731)Cited by:[§4\.2](https://arxiv.org/html/2606.19002#S4.SS2.p1.1)\.
- W\. Zhu, S\. Huang, F\. Yuan, S\. She, J\. Chen, and A\. Birch \(2024\)Question translation training for better multilingual reasoning\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 8411–8423\.External Links:[Link](https://aclanthology.org/2024.findings-acl.498/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.498)Cited by:[Table 6](https://arxiv.org/html/2606.19002#A1.T6.1.1.4.1),[Table 6](https://arxiv.org/html/2606.19002#A1.T6.1.1.5.1),[Table 7](https://arxiv.org/html/2606.19002#A1.T7.1.1.4.1),[Table 7](https://arxiv.org/html/2606.19002#A1.T7.1.1.5.1),[§2\.1](https://arxiv.org/html/2606.19002#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.19002#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.1.1.5.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.1.1.6.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.2.2.5.1),[Table 1](https://arxiv.org/html/2606.19002#S4.T1.2.2.6.1),[Table 2](https://arxiv.org/html/2606.19002#S4.T2.1.1.5.1),[Table 2](https://arxiv.org/html/2606.19002#S4.T2.1.1.6.1),[Table 3](https://arxiv.org/html/2606.19002#S4.T3.1.1.5.1),[Table 3](https://arxiv.org/html/2606.19002#S4.T3.1.1.6.1)\.

## Appendix AExample Appendix

### A\.1Complete Experimental Results

X\-CSQASwUrHiArViJaPlZhNlRuItDePtFrEsEnAvg\.MetaMath \[[2024](https://arxiv.org/html/2606.19002#bib.bib5)\]24\.225\.132\.932\.350\.949\.150\.656\.557\.556\.056\.061\.261\.763\.564\.076\.351\.3MultiReason \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]27\.629\.232\.028\.738\.838\.745\.543\.845\.946\.550\.249\.151\.252\.154\.367\.243\.8QAlign \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]35\.132\.637\.836\.350\.549\.251\.354\.856\.356\.358\.358\.859\.860\.363\.175\.752\.3LangBridge \[[2024](https://arxiv.org/html/2606.19002#bib.bib4)\]31\.830\.530\.630\.633\.333\.939\.839\.838\.435\.139\.137\.436\.338\.238\.444\.436\.1Translate\-En \[[2023](https://arxiv.org/html/2606.19002#bib.bib1)\]36\.541\.348\.444\.651\.847\.153\.351\.555\.054\.456\.357\.354\.757\.255\.571\.352\.3MindMerger \[[2024](https://arxiv.org/html/2606.19002#bib.bib3)\]45\.546\.248\.451\.460\.653\.963\.362\.963\.863\.766\.867\.067\.168\.169\.178\.161\.0LayAlign \[[2025](https://arxiv.org/html/2606.19002#bib.bib23)\]53\.351\.753\.755\.962\.056\.464\.864\.666\.262\.066\.265\.264\.366\.567\.376\.762\.3ST\-Merge \(Ours\)53\.651\.951\.656\.561\.757\.965\.164\.164\.463\.166\.665\.266\.467\.267\.377\.362\.5Table 6:Accuracy \(%\) on X\-CSQA\. Avg\. represents the average accuracy across all languages\.XNLISwUrHiThArTrElViZhRuBgDeFrEsEnAvg\.MetaMath \[[2024](https://arxiv.org/html/2606.19002#bib.bib5)\]45\.949\.255\.755\.460\.961\.963\.773\.774\.777\.676\.780\.682\.282\.890\.068\.7MultiReason \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]56\.357\.561\.760\.161\.765\.667\.073\.779\.179\.778\.782\.382\.983\.988\.871\.9QAlign \[[2024](https://arxiv.org/html/2606.19002#bib.bib6)\]65\.262\.263\.365\.267\.067\.966\.573\.776\.679\.279\.480\.983\.183\.889\.173\.5LangBridge \[[2024](https://arxiv.org/html/2606.19002#bib.bib4)\]71\.766\.971\.172\.475\.274\.879\.178\.577\.477\.479\.678\.879\.980\.583\.476\.5Translate\-En \[[2023](https://arxiv.org/html/2606.19002#bib.bib1)\]65\.361\.668\.769\.568\.974\.579\.376\.774\.876\.080\.880\.680\.481\.487\.475\.1MindMerger \[[2024](https://arxiv.org/html/2606.19002#bib.bib3)\]66\.669\.474\.771\.876\.275\.778\.580\.380\.080\.782\.483\.583\.984\.488\.778\.4LayAlign \[[2025](https://arxiv.org/html/2606.19002#bib.bib23)\]73\.071\.074\.774\.177\.676\.079\.680\.880\.881\.883\.483\.984\.784\.888\.979\.7ST\-Merge \(Ours\)73\.771\.875\.174\.277\.677\.280\.080\.181\.082\.283\.383\.584\.784\.889\.179\.9Table 7:Accuracy \(%\) on XNLI\. Avg\. represents the average accuracy across all languages\.MethodParams \(M\)FLOPs \(G\)Train \(h\)Infer \(m\)Avg\. AccST\-Merge \(Ours\)10282\.517700\.785\.5726\.360\.0w/o Gate Network10265\.217700\.504\.9826\.357\.3Relative Overhead\+0\.17%\+0\.0016%\+11\.85%0\.00%\+4\.71%Table 8:Comparison of computational overhead and performance\.![Refer to caption](https://arxiv.org/html/2606.19002v1/x12.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x13.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x14.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x15.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x16.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x17.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x18.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x19.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x20.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x21.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x22.png)Figure 6:Accuracy on MGSM with different manual weights combinations for the two source models\. Darker blue grids indicate higher reasoning accuracy\.To facilitate reference, the languages utilized in this work are abbreviated as follows: Bengali \(Bn\), Thai \(Th\), Swahili \(Sw\), Japanese \(Ja\), Chinese \(Zh\), German \(De\), French \(Fr\), Russian \(Ru\), Spanish \(Es\), English \(En\), Urdu \(Ur\), Hindi \(Hi\), Arabic \(Ar\), Vietnamese \(Vi\), Polish \(Pl\), Flemish \(Nl\), Italian \(It\), Portuguese \(Pt\), Turkish \(Tr\), Greek \(El\), and Bulgarian \(Bg\)\. Due to page limitations, the complete breakdown of results is included here\. We report the extensive experimental data on X\-CSQA in Table[6](https://arxiv.org/html/2606.19002#A1.T6)and on XNLI in Table[7](https://arxiv.org/html/2606.19002#A1.T7)\. Additionally, comparative results on MGSM are visualized in Figure[6](https://arxiv.org/html/2606.19002#A1.F6)\(manual weights combinations\) and Figure[7](https://arxiv.org/html/2606.19002#A1.F7)\(learned weights combinations\)\.

### A\.2Computational Overhead Analysis

To verify whether the performance gains of ST\-Merge stem from our proposed steerable merging design rather than a simple increase in parameter capacity, we evaluate its computational overhead against thew/o Gate Networkvariant\. As summarized in Table[8](https://arxiv.org/html/2606.19002#A1.T8), ST\-Merge introduces only a marginal parameter increase of 17\.3M \(\+0\.17%\) and a negligible overhead in FLOPs \(\+0\.0016%\)\. Although the training time increases by 11\.85%, the inference latency remains identical to the baseline\. Considering the significant absolute improvement of 2\.7 points \(\+4\.71%\) in average accuracy, these results demonstrate that the effectiveness of ST\-Merge is driven by its architectural contribution rather than parameter expansion, ensuring its practicality for resource\-constrained applications\.

![Refer to caption](https://arxiv.org/html/2606.19002v1/x23.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x24.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x25.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x26.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x27.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x28.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x29.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x30.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x31.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x32.png)![Refer to caption](https://arxiv.org/html/2606.19002v1/x33.png)Figure 7:Learned weights analysis of ST\-Merge\. Darker blue grids indicate higher accuracy\. The gold stars represent the learned weights \(ωA\\omega\_\{A\},ωB\\omega\_\{B\}\) by our method for each language\.

Similar Articles

Rethinking the Multilingual Reasoning Gap with Layer Swap

arXiv cs.CL

This paper revisits the multilingual reasoning gap in LLMs, finding it smaller than previously reported under comparable supervision. It introduces Layer Swap, which transfers mid-layer weights from an English reasoning specialist to native language specialists, nearly closing the gap while preserving native-language chain-of-thought.

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

arXiv cs.CL

This paper investigates multilingual latent reasoning in large reasoning models across 11 languages, revealing that while latent reasoning capabilities exist, they are unevenly distributed—stronger in resource-rich languages and weaker in low-resource ones. The study finds that despite surface-level differences, the internal reasoning mechanisms are largely aligned with an English-centered pathway.

Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

arXiv cs.CL

This paper introduces a data-efficient fine-tuning framework for teaching reasoning models to code-switch (mix languages) effectively, demonstrating that strategic code-switching can improve reasoning capabilities for lower-resource languages. The work analyzes code-switching behaviors in large language models across diverse languages, tasks, and domains, then develops interventions to promote beneficial code-switching patterns.