Learning to Select, Not Relearn: Hard-Routed Mixtures of Reasoning LoRAs

arXiv cs.AI Papers

Summary

Proposes Hard-Routed MoR-LoRA, a two-stage framework that composes frozen reasoning LoRA experts via hard top-1 routing, preserving expert behavior with fewer trainable parameters than soft-routing baselines.

arXiv:2606.31413v1 Announce Type: new Abstract: Composing independently trained LoRA adapters into a single large language model is useful for multi-domain adaptation, especially when the original training data cannot be shared. A common approach is to use MoE-style routing over LoRA experts, but for frozen pretrained adapters, soft weighted combinations can change the unit-scale additive update under which each LoRA module was originally trained. We propose \textbf{Hard-Routed MoR-LoRA}, a two-stage framework for composing frozen reasoning LoRA experts through unit-scale hard selection. First, domain-specific LoRA adapters are trained independently using reinforcement learning from verifiable feedback to obtain reasoning experts. Then, all experts are frozen, reasoning traces are distilled from them, and only a lightweight shared router together with a small attention LoRA is trained for integration. The router selects exactly one expert per token using hard top-1 routing, while a straight-through estimator enables gradient-based training. Experiments across five benchmarks, multiple model scales, and additional model families show that Hard-Routed MoR-LoRA preserves expert behavior while requiring substantially fewer trainable parameters than soft-routing mixture baselines. Our analysis further shows that normalized soft mixtures often concentrate most routing mass on a single expert, suggesting that hard unit-scale routing provides a simple and efficient abstraction for frozen LoRA expert composition.
Original Article
View Cached Full Text

Cached at: 07/01/26, 05:38 AM

# Learning to Select, Not Relearn: Hard-Routed Mixtures of Reasoning LoRAs
Source: [https://arxiv.org/html/2606.31413](https://arxiv.org/html/2606.31413)
Seyed Alireza Molavi1,Zhan Su1,Yan Hu2, Peyman Sheikholharam Mashhadi1,Stefan Byttner1,Prayag Tiwari1

1Halmstad University, Halmstad, Sweden 2The Chinese University of Hong Kong, Shenzhen, China seyed\.alireza\.molavi@hh\.se,zhan\.su@hh\.se,huyan@cuhk\.edu\.cn, peyman\.mashhadi@hh\.se,stefan\.byttner@hh\.se,prayag\.tiwari@hh\.se

###### Abstract

Composing independently trained LoRA adapters into a single large language model is useful for multi\-domain adaptation, especially when the original training data cannot be shared\. A common approach is to use MoE\-style routing over LoRA experts, but for frozen pretrained adapters, soft weighted combinations can change the unit\-scale additive update under which each LoRA module was originally trained\. We proposeHard\-Routed MoR\-LoRA, a two\-stage framework for composing frozen reasoning LoRA experts through unit\-scale hard selection\. First, domain\-specific LoRA adapters are trained independently using reinforcement learning from verifiable feedback to obtain reasoning experts\. Then, all experts are frozen, reasoning traces are distilled from them, and only a lightweight shared router together with a small attention LoRA is trained for integration\. The router selects exactly one expert per token using hard top\-1 routing, while a straight\-through estimator enables gradient\-based training\. Experiments across five benchmarks, multiple model scales, and additional model families show that Hard\-Routed MoR\-LoRA preserves expert behavior while requiring substantially fewer trainable parameters than soft\-routing mixture baselines\. Our analysis further shows that normalized soft mixtures often concentrate most routing mass on a single expert, suggesting that hard unit\-scale routing provides a simple and efficient abstraction for frozen LoRA expert composition111Our code is available at[github\.com/sar\-molavi/hard\-routed\-mor\-lora](https://github.com/sar-molavi/hard-routed-mor-lora)\.

Learning to Select, Not Relearn: Hard\-Routed Mixtures of Reasoning LoRAs

Seyed Alireza Molavi1, Zhan Su1, Yan Hu2,Peyman Sheikholharam Mashhadi1,Stefan Byttner1,Prayag Tiwari11Halmstad University, Halmstad, Sweden2The Chinese University of Hong Kong, Shenzhen, Chinaseyed\.alireza\.molavi@hh\.se,zhan\.su@hh\.se,huyan@cuhk\.edu\.cn,peyman\.mashhadi@hh\.se,stefan\.byttner@hh\.se,prayag\.tiwari@hh\.se

## 1Introduction

Large language models \(LLMs\) have demonstrated strong performance across diverse language understanding and reasoning tasksZhanget al\.\([2025a](https://arxiv.org/html/2606.31413#bib.bib7),[b](https://arxiv.org/html/2606.31413#bib.bib6)\); Brownet al\.\([2020](https://arxiv.org/html/2606.31413#bib.bib34)\)\. However, adapting these models to specialized domains remains expensive because full fine\-tuning requires substantial computation and access to training data\. Parameter\-efficient fine\-tuning \(PEFT\) methods address this issue by modifying only a small subset of parametersHoulsbyet al\.\([2019](https://arxiv.org/html/2606.31413#bib.bib1)\)\. Among them, Low\-Rank Adaptation \(LoRA\)Huet al\.\([2022](https://arxiv.org/html/2606.31413#bib.bib3)\)is widely used because it adds lightweight trainable matrices while keeping the backbone frozen\.

In practice, domain adaptation is often done independently, where different groups train LoRA adapters on their own datasets but cannot share the original data due to privacy or regulatory constraints\. Instead, they may release only the trained LoRA modules\. A natural goal is therefore to compose multiple independently trained LoRA experts into a single model that supports heterogeneous tasks\. Mixture\-of\-Experts \(MoE\) architecturesShazeeret al\.\([2017](https://arxiv.org/html/2606.31413#bib.bib13)\); Feduset al\.\([2022](https://arxiv.org/html/2606.31413#bib.bib14)\)provide a natural mechanism for such modular reuse by routing each token to selected experts\. However, when the goal is to reuse frozen independently trained LoRA adapters, directly applying standard soft MoE routing introduces a scale mismatch\. LoRA modules are trained assuming a unit additive update, whereas MoE combines experts using routing weights\. Soft routing, therefore, blends expert updates and changes the effective function applied by each adapter\. For example, LoRA\-MixerLiet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib18)\)addresses this by retraining the adapters with auxiliary losses, but this increases trainable parameters, requires more data, and can alter the knowledge encoded in the original experts\.

This challenge becomes more pronounced for instruction\-tuned models with chain\-of\-thought \(CoT\) reasoning ability\. Conventional fine\-tuning encourages imitation of short answers and may weaken internal reasoning behavior in large models, while reinforcement learning from verifiable feedback \(RLVF\) can produce stronger reasoning experts by optimizing correctness rather than matching reference outputsShao and others \([2024](https://arxiv.org/html/2606.31413#bib.bib20)\); Guoet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib19)\)\. However, existing LoRA composition methods primarily focus on supervised multi\-task fine\-tuning and do not study the training and integration of experts in instruction\-tuned models that exhibit CoT reasoning\.

In this work, we proposeHard\-Routed MoR\-LoRA, a framework that separates reasoning acquisition from expert selection\. First, domain\-specific LoRA adapters are trained independently using RLVF to obtain reasoning\. Then, the experts are frozen and integrated using a hard\-routed mixture that selects exactly one expert per token through a straight\-through estimator \(STE\)\. We distill reasoning traces from the experts and train only a shared lightweight router and a small attention adaptation using a small amount of data \(1000 samples per dataset\)\. Hard routing preserves the original LoRA assumption by applying each expert with unit scale, enabling modular composition without retraining the experts\.

Experiments across five benchmarks and multiple model scales show that RLVF produces stronger reasoning experts for capable models, while conventional fine\-tuning can degrade reasoning behavior in large instruction\-tuned models\. We further demonstrate that hard routing achieves comparable or better performance than soft\-routing baselines in our setting while using substantially fewer trainable parameters and keeping the pretrained experts frozen\. Analysis of routing probabilities reveals that normalized soft top\-kkmixtures implicitly behave like near hard top\-1 selection, indicating that hard routing captures the intended behavior more directly\.

Our contributions are summarized as follows:

- •We identify a scaling mismatch that arises when soft MoE routing is applied to independently trained frozen LoRA adapters, and we use hard unit\-scale routing to preserve standalone expert application without retraining experts\.
- •We introduce a two\-stage framework that trains reasoning LoRA experts using RLVF and integrates them through distillation while keeping experts frozen\.
- •We provide empirical evaluation across model scales and tasks showing improved performance and parameter efficiency\.

## 2Related Work

##### Modular LoRA adaptation\.

LoRA enables efficient fine\-tuning by learning small low\-rank updates while keeping the backbone model frozen\. Although effective for adapting a model to a single task, independently trained LoRA modules are usually task\-specific and difficult to reuse across different domains, especially when the original training data cannot be shared\. There is a rich body of research have Several works, therefore, studiedy how to combine multiple pretrained adapters into one modelPage\-Cacciaet al\.\([2023](https://arxiv.org/html/2606.31413#bib.bib10)\); Muqeethet al\.\([2024](https://arxiv.org/html/2606.31413#bib.bib9)\); Ostapenkoet al\.\([2024](https://arxiv.org/html/2606.31413#bib.bib8)\); Wenet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib2)\)\. LoraHubHuanget al\.\([2023](https://arxiv.org/html/2606.31413#bib.bib15)\)learns weights that merge a set of LoRA modules into a single composed adapter, while LoRA\-LEGOZhaoet al\.\([2024](https://arxiv.org/html/2606.31413#bib.bib24)\)decomposes adapters into smaller rank components and reconstructs a merged adapter for cross\-task reuse\. These methods demonstrate that pretrained adapters can be combined without retraining, but the resulting model applies the same merged adapter to every input and cannot dynamically select different experts at the token level\.

##### Mixture\-of\-Experts with LoRA\.

MoE architecturesShazeeret al\.\([2017](https://arxiv.org/html/2606.31413#bib.bib13)\); Feduset al\.\([2022](https://arxiv.org/html/2606.31413#bib.bib14)\); Ostapenkoet al\.\([2023](https://arxiv.org/html/2606.31413#bib.bib5)\); Suet al\.\([2024](https://arxiv.org/html/2606.31413#bib.bib4)\)increase model capacity by routing tokens to a sparse subset of experts\. Instead of merging adapters into one module, MoE\-based approaches keep multiple experts and select among them during inference\. Recent work applies this idea to parameter\-efficient tuning by treating LoRA modules as experts\. MixLoRALiet al\.\([2024](https://arxiv.org/html/2606.31413#bib.bib11)\)inserts multiple LoRA modules into feed\-forward layers \(FFN\) with top\-kkrouting\. MoLEWuet al\.\([2024](https://arxiv.org/html/2606.31413#bib.bib12)\)learns hierarchical gating across layer\-wise adapters, while SiRAZhuet al\.\([2023](https://arxiv.org/html/2606.31413#bib.bib16)\)and GOATLi and others \([2025](https://arxiv.org/html/2606.31413#bib.bib17)\)study alignment and sparse routing strategies\. The closest to our work, LoRA\-Mixer, integrates LoRA modules into attention projections and trains a lightweight router\. However, soft routing produces a weighted sum of LoRA updates, whereas standalone LoRA training assumes unit\-scale application\. In LoRA\-Mixer, this issue is addressed by further fine\-tuning the experts with regularization losses, which increases the number of trainable parameters and introduces additional integration\-stage computation\.

##### Instruction\-tuned Reasoning and Verifiable Feedback\.

Instruction tuning and alignment stages shape how language models produce and structure reasoningOuyanget al\.\([2022](https://arxiv.org/html/2606.31413#bib.bib33)\)\. RLVF improves reasoning by optimizing correctness using automatically checkable rewards instead of matching reference outputsShao and others \([2024](https://arxiv.org/html/2606.31413#bib.bib20)\); Guoet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib19)\)\. DeepSeek\-R1Guoet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib19)\)shows that strong reasoning ability can emerge from this training paradigm, and Group Relative Policy Optimization \(GRPO\)Shao and others \([2024](https://arxiv.org/html/2606.31413#bib.bib20)\)stabilizes training without requiring a learned value critic\. In our setting, adapters are added to instruction\-tuned base models that already exhibit CoT behavior\. Preserving this behavior during training is important, and RLVF provides a natural way to train reasoning experts without overriding the aligned reasoning patterns of the base model\.

##### Hard routing and Straight\-through Estimators\.

Hard routing selects a single expert through a discrete decision, which is not directly differentiable\. The STE enables gradient\-based training by using the hard decision in the forward pass and a surrogate gradient in the backward passBengioet al\.\([2013](https://arxiv.org/html/2606.31413#bib.bib35)\); Liuet al\.\([2023](https://arxiv.org/html/2606.31413#bib.bib31)\)\. In our approach, STE allows top\-1 expert selection with unit scaling, preserving the original LoRA formulation while still allowing end\-to\-end training of the router\.

## 3Preliminaries

This section briefly introduces the components utilized in our method: Low\-Rank Adaptation, Mixture\-of\-Experts, and Reinforcement Learning from Verifiable Feedback\.

### 3\.1Low\-Rank Adaptation

LoRA parameterizes weight updates as a low\-rank decomposition:

Wupdated=Wpretrained\+A​B\\displaystyle W\_\{\\text\{updated\}\}=W\_\{\\text\{pretrained\}\}\+AB\(1\)whereA∈ℝn×rA\\in\\mathbb\{R\}^\{n\\times r\}andB∈ℝr×nB\\in\\mathbb\{R\}^\{r\\times n\}withr≪nr\\ll n\. During fine\-tuning, onlyAAandBBare optimized whileWpretrainedW\_\{\\text\{pretrained\}\}remains frozen, substantially reducing memory and computational cost\.

### 3\.2Mixture\-of\-Experts Routing

Given token representationxx, a router produces logits𝐆​\(x\)∈ℝK\\mathbf\{G\}\(x\)\\in\\mathbb\{R\}^\{K\}and the routing probabilities are formulated as:

ℙi\(x\)=softmax\(𝐆\(x\)\)i\\displaystyle\\mathbb\{P\}\_\{i\}\(x\)=\\operatorname\{softmax\}\(\\mathbf\{G\}\(x\)\)\_\{i\}\(2\)Standard MoE combines multiple experts using routing weightsFeduset al\.\([2022](https://arxiv.org/html/2606.31413#bib.bib14)\), typically activating the top\-kkexperts and scaling each expert output by its routing probability\. The final output is computed as

MoE​\(x\)=∑i=1K𝕀​\(i∈top\-​k​\(ℙ​\(x\)\)\)⋅ℙi​\(x\)⋅Ei​\(x\)\\displaystyle\\text\{MoE\}\(x\)=\\sum\_\{i=1\}^\{K\}\\mathbb\{I\}\\\!\\left\(i\\in\\text\{top\-\}k\(\\mathbb\{P\}\(x\)\)\\right\)\\cdot\\mathbb\{P\}\_\{i\}\(x\)\\cdot E\_\{i\}\(x\)\(3\)where𝕀​\(⋅\)\\mathbb\{I\}\(\\cdot\)denotes the indicator function andEi​\(x\)E\_\{i\}\(x\)is theii\-th expert’s output\.

### 3\.3Reinforcement Learning from Verifiable Feedback

RLVF trains models using rewards derived from automatically verifiable outputs rather than human preference annotationsShao and others \([2024](https://arxiv.org/html/2606.31413#bib.bib20)\); Guoet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib19)\)\. Unlike supervised fine\-tuning, the model is not constrained to follow a reference reasoning trace; only the final answer is evaluated by the verifier\. This allows the model to freely generate intermediate CoT reasoning steps, encouraging the emergence of structured reasoning behavior\.

We adopt GRPOShao and others \([2024](https://arxiv.org/html/2606.31413#bib.bib20)\)\. For a promptxx, a group of reasoning trajectories\{τi\}i=1N\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{N\}is sampled with rewards\{ri\}i=1N\\\{r\_\{i\}\\\}\_\{i=1\}^\{N\}:

r¯\\displaystyle\\bar\{r\}=1N​∑i=1Nri\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}r\_\{i\}\(4\)Ai\\displaystyle A\_\{i\}=ri−r¯\\displaystyle=r\_\{i\}\-\\bar\{r\}\(5\)whereAiA\_\{i\}is the advantage for theii\-th reasoning trace\. The reward function is detailed in Section[F\.3](https://arxiv.org/html/2606.31413#A6.SS3)and Table[15](https://arxiv.org/html/2606.31413#A6.T15)\.

To improve sample efficiency, we use traces collected under the behavior policyπold\\pi\_\{\\text\{old\}\}, which is a few checkpoints behind the target policyπθ\\pi\_\{\\theta\}\. For each trajectoryτi\\tau\_\{i\}, we compute a token\-level importance ratio:

wi,t=πθ​\(yi,t∣x,yi,<t\)πold​\(yi,t∣x,yi,<t\)\\displaystyle w\_\{i,t\}=\\frac\{\\pi\_\{\\theta\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)\}\{\\pi\_\{\\text\{old\}\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)\}\(6\)
The off\-policy token\-level GRPO objective becomes:

ℒoff​\(θ\)=−𝔼x​\[1N​∑i=1NAi​∑t=1Tiwi,t​log⁡πθ​\(yi,t∣x,yi,<t\)\]\\displaystyle\\begin\{split\}\\mathcal\{L\}\_\{\\text\{off\}\}\(\\theta\)&=\-\\mathbb\{E\}\_\{x\}\\Bigg\[\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}A\_\{i\}\\sum\_\{t=1\}^\{T\_\{i\}\}w\_\{i,t\}\\log\\pi\_\{\\theta\}\\\!\\left\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\\right\)\\Bigg\]\\end\{split\}\(7\)

## 4Methodology

We proposeHard\-Routed MoR\-LoRA \(Mixture of Reasoning LoRA\), a plug\-in framework that integrates multiple reasoning LoRA experts into a unified model while preserving the original LoRA assumption and maintaining sparse computation\. The key idea is to decouple*reasoning acquisition*from*expert selection*\. First, each LoRA adapter independently learns domain\-specific reasoning behavior using RLVFShao and others \([2024](https://arxiv.org/html/2606.31413#bib.bib20)\); Guoet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib19)\)\. Then, the pretrained experts are combined through a hard\-routed mixture in which only a lightweight router and a small attention adaptation are trained using supervised fine\-tuning \(SFT\) on distilled reasoning traces\.

The overall two\-stage pipeline is illustrated in Figure[1](https://arxiv.org/html/2606.31413#S4.F1)\. Stage I trains reasoning experts independently, while Stage II integrates them into a shared model by learning which expert should be selected for each token\.

![Refer to caption](https://arxiv.org/html/2606.31413v1/pics/pipeline.png)Figure 1:Overview of the proposed Hard\-Routed MoR\-LoRA framework\.The method follows a two\-stage training pipeline\.Stage I \(left\):Each LoRA adapter is independently trained on a specific domain using RLVF, producing a set of reasoning experts\.Stage II \(right\):All pretrained experts are integrated into a shared model and kept frozen\. A shared lightweight router and a small attention LoRA are trained using supervised fine\-tuning on distilled reasoning traces collected from the experts\. During inference, therouterselects a single expert per token, enforcing unit scaling and preserving the original LoRA assumption\. STE enables gradients to propagate through the discrete routing decision by following the soft probabilities during the backward pass \(dashed arrow\)\.### 4\.1Hard Mixture\-of\-LoRAs Architecture

#### Scale Mismatch in Soft LoRA Mixtures

From Eq\.[1](https://arxiv.org/html/2606.31413#S3.E1), a pretrained LoRA expert is applied with unit contribution\. However, the standard MoE applies weighted combinations of multiple experts:

Wupdated=Wpretrained\+∑i=1kωi​Ai​Bi\\displaystyle W\_\{\\text\{updated\}\}=W\_\{\\text\{pretrained\}\}\+\\sum\_\{i=1\}^\{k\}\\omega\_\{i\}A\_\{i\}B\_\{i\}\(8\)whereωi\\omega\_\{i\}are routing probabilities, which change the magnitude of the update\. Since each LoRA expert was trained assuming unit contribution, scaling byωi\\omega\_\{i\}changes the effective adapter update used at integration time, creating a mismatch between standalone expert training and mixture usage\. LoRA\-Mixer compensates for this mismatch by further fine\-tuning the adapters with preservation losses, aiming to limit deviation from the pretrained LoRA parameters\.

#### Hard Routing with Straight\-Through Estimator

To preserve the original LoRA assumption, the selected expert must be applied with unit scale\. However, a softmax router produces continuous weights\. In a top\-1 soft\-routing variant, one may first select the largest routing probabilitypjp\_\{j\}and apply the selected expert aspj​Ej​\(x\)p\_\{j\}E\_\{j\}\(x\)\. This remains differentiable with respect to the router logits, but it scales the selected LoRA update bypj<1p\_\{j\}<1, violating the unit\-scale application under which the expert was trained\. Alternatively, one can normalize over the selected top\-1 set to enforce unit contribution, givingpj/pj=1p\_\{j\}/p\_\{j\}=1\. However, this normalization is constant for any fixed selected expert and therefore provides zero gradient through the routing weight\. The remaining dependence on the router is only through the discretearg⁡max\\arg\\maxselection, which is non\-differentiable\. Therefore, a purely top\-11soft\-routing formulation cannot simultaneously guarantee unit\-scale expert application and provide a useful gradient for learning the router\.

Instead, we enforce*top\-1 routing*via discrete expert selection\. For a tokenxx:

ℙ\\displaystyle\\mathbb\{P\}=softmax⁡\(𝐆​\(x\)\)\\displaystyle=\\operatorname\{softmax\}\(\\mathbf\{G\}\(x\)\)\(9\)j\\displaystyle j=arg⁡maxi⁡\(ℙi\)\\displaystyle=\\arg\\max\_\{i\}\(\\mathbb\{P\}\_\{i\}\)\(10\)
This produces a one\-hot routing decision\. However, thearg⁡max\\arg\\maxoperation is non\-differentiable, preventing backpropagation through the router\. To enable training, we use the STELiuet al\.\([2023](https://arxiv.org/html/2606.31413#bib.bib31)\), which keeps the hard decision in the forward pass but approximates gradients using the soft probabilities in the backward pass:

Yhard=one\-hot​\(j\)\\displaystyle Y\_\{\\text\{hard\}\}=\\text\{one\-hot\}\(j\)\(11\)Y=SG​\(Yhard−ℙ\)\+ℙ\\displaystyle Y=\\mathrm\{SG\}\(Y\_\{\\text\{hard\}\}\-\\mathbb\{P\}\)\+\\mathbb\{P\}\(12\)whereSG​\(⋅\)\\mathrm\{SG\}\(\\cdot\)stops gradients\. The forward pass uses the discrete routing vectorYhardY\_\{\\text\{hard\}\}, while gradients propagate throughPP\. The final output becomes

STE​\(x\)=∑i=1KYi​Ei​\(x\)=Ej​\(x\)\\mathrm\{STE\}\(x\)=\\sum\_\{i=1\}^\{K\}Y\_\{i\}E\_\{i\}\(x\)=E\_\{j\}\(x\)\(13\)which guarantees unit scaling of the selected LoRA expert while still allowing the router to be optimized via gradient descent\.

##### Parameter Efficiency\.

All pretrained experts remain frozen\. A single linear router is shared across layers, and only a small LoRA module on attention layers is trainable during integration\. The number of trainable parameters, therefore, does not grow with the number of experts while maintaining sparse computation\.

### 4\.2Training

There are two stages in the training: As shown in the Fig[1](https://arxiv.org/html/2606.31413#S4.F1)\. InStage I:we train domain\-specific LoRA experts using RLVFShao and others \([2024](https://arxiv.org/html/2606.31413#bib.bib20)\); Guoet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib19)\)\. InStage II:the LoRA adapters are frozen and we only train the router using SFT on distilled traces\.

#### Stage I: Training Reasoning LoRA Experts

Each dataset trains one LoRA adapter independently\. The adapters are optimized using RLVF with GRPO so that each expert specializes in domain\-specific reasoning and produces CoT traces\. After training, we obtain a set of reasoning experts:\{E1,E2,…,EK\}\\left\\\{E\_\{1\},E\_\{2\},\\dots,E\_\{K\}\\right\\\}

#### Stage II: Mixture Integration via Distillation

##### Distillation Data Generation\.

We query each trained expert with prompts from its domain and collect generated reasoning traces\. A limited number of samples is sufficient to construct the SFT dataset, since the objective of this Stage is to learn the routing policy rather than to relearn reasoning ability\. The reasoning knowledge is already encoded in the frozen experts, and therefore, the training data only needs to cover representative input patterns for correct expert selection\.

##### Router and Attention Adaptation\.

All experts remain frozen\. We traina shared linear router, andLoRA adapters on attention layersusing standard next\-token likelihood on the distilled traces\. The objective is to learn expert selection rather than relearn reasoning ability\.

## 5Experiments

We evaluate Hard\-Routed MoR\-LoRA along two independent dimensions:\(1\)the quality of reasoning experts obtained via RLVF, and\(2\)the effectiveness of hard routing compared to soft mixture routing during expert integration\.

#### Experimental Setup

We conduct experiments on instruction\-following language model Meta\-LLaMA\-3Dubeyet al\.\([2024](https://arxiv.org/html/2606.31413#bib.bib30)\)at three model scales, 1B, 3B, and 8B222[https://huggingface\.co/meta\-llama/models](https://huggingface.co/meta-llama/models)parameters\. We also evaluate additional model families in Appendix[D](https://arxiv.org/html/2606.31413#A4)\.

Datasets\.To evaluate performance across heterogeneous domains, we consider five representative tasks:\(a\)mathematical reasoning,\(b\)commonsense reasoning,\(c\)medical multiple\-choice question answering,\(d\)reading comprehension, and\(e\)grammatical understanding\.

These tasks are evaluated using the GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2606.31413#bib.bib25)\), ARC\-Challenge \(ARC\-C\)Clarket al\.\([2018](https://arxiv.org/html/2606.31413#bib.bib26)\), Medical Question Answering \(MedQA\)Jinet al\.\([2021](https://arxiv.org/html/2606.31413#bib.bib27)\), BoolQClarket al\.\([2019](https://arxiv.org/html/2606.31413#bib.bib32)\), and Corpus of Linguistic Acceptability \(CoLA\)Warstadtet al\.\([2019](https://arxiv.org/html/2606.31413#bib.bib28)\)datasets, respectively\. Each dataset covers a distinct domain with little to no overlap\.

##### Compared Models\.

We useIndependentto denote standalone dataset\-specific LoRA adapters before mixture integration\. These models are evaluated to measure the quality of individual experts, while the mixture methods evaluate how pretrained experts are integrated into one unified model\.

- •Independent SFT:one dataset\-specific LoRA adapter trained with supervised fine\-tuning on the FFN layers\.
- •Independent RLVF:one dataset\-specific LoRA adapter trained with reinforcement learning from verifiable feedback on the FFN layers\.
- •LoRAHub:static integration by learning fixed merge weights over pretrained LoRA adapters\.
- •LoRAMixer TopK=1:soft top\-1 routing with continued adapter training using an L2 preservation loss\.
- •LoRAMixer TopK=2 Normalized:normalized soft top\-2 routing with continued adapter training using an L2 preservation loss\.
- •MoLE:MoE\-style LoRA integration with soft routing and load\-balancing regularization\.
- •Ours \(Hard\-Routed MoR\-LoRA\):hard top\-1 routing via STE, with frozen experts and trainable router plus attention LoRA\.

Unless otherwise stated, Stage II integration uses at most 1000 samples per dataset\. The independent adapters are trained separately for each dataset and then serve as the pretrained experts used by the integration methods\.

### 5\.1RLVF Produces Stronger Reasoning Experts

We first evaluate the quality of individual experts trained in Stage I before any mixture integration\. This isolates the effect of the training paradigm\. Figure[2](https://arxiv.org/html/2606.31413#S5.F2)summarizes the average accuracy across model scales, while full per\-task results are provided in Appendix[A\.1](https://arxiv.org/html/2606.31413#A1.SS1), Table[5](https://arxiv.org/html/2606.31413#A1.T5)\.

For medium and large models \(3B and 8B\), RLVF substantially improves performance on reasoning\-intensive tasks\. For example, the 3B model improves from 68\.60 \(SFT\) to 72\.69 with RLVF, and the 8B model improves from 72\.72 to 79\.72\. Unlike fine\-tuning, which encourages imitation of reference outputs, RLVF allows the model to explore intermediate reasoning steps and optimize correctness, leading to stronger multi\-step reasoning behavior\.

The results also show that ordinary fine\-tuning can degrade performance in large instruction\-tuned models\. We attribute this to fine\-tuning, overriding the model’s inherent COT behavior by encouraging short\-form answer imitation\. In contrast, RLVF preserves the aligned reasoning behavior because optimization depends only on verifiable correctness rather than matching a specific reasoning trace\. For the 1B model, on the other hand, RLVF provides smaller gains compared to fine\-tuning \(32\.61 vs 49\.18\)\. This is likely due to insufficient reasoning capacity as the 1B model struggles to produce diverse and correct reasoning trajectories, weakening the reward signal\. Thus, the benefit of RLVF increases with model capability\.

As detailed in Table[5](https://arxiv.org/html/2606.31413#A1.T5), the improvement primarily appears in reasoning\-heavy datasets such asGSM8KandARC\-C, while simpler linguistic tasks such asCoLAshow limited advantage from RLVF\. This supports the view that RLVF primarily strengthens reasoning behavior rather than general language modeling\.

![Refer to caption](https://arxiv.org/html/2606.31413v1/pics/rlvf-fn.png)Figure 2:Performance of standalone SFT and RLVF\.RLVF improves performance for 3B and 8B models but not for 1B, suggesting that reasoning\-oriented training benefits models with sufficient capacity while preserving CoT behavior\.
### 5\.2Hard Routing Preserves Frozen Expert Behavior During Integration

We now evaluate Stage II, where pretrained experts are integrated using only 1000 samples per dataset\. Figure[3](https://arxiv.org/html/2606.31413#S5.F3)summarizes the integration results for both SFT experts and RLVF experts, and Appendix[A\.2](https://arxiv.org/html/2606.31413#A1.SS2), Tables[6](https://arxiv.org/html/2606.31413#A1.T6)and[7](https://arxiv.org/html/2606.31413#A1.T7)report the full numbers across five tasks and trainable parameter counts\.

In our frozen\-expert integration setting, Hard\-Routed MoR\-LoRA achieves comparable or better performance than soft\-routing baselines while requiring substantially fewer trainable parameters\. Unlike LoRAMixer, which retrains the already\-trained experts to compensate for the scaling mismatch introduced by soft routing, our architecture preserves the original LoRA formulation and therefore only learns expert selection while keeping the experts frozen\. This suggests that much of the integration benefit comes from selecting appropriate frozen experts rather than extensively modifying their parameters\. For example, on the 3B with RLVF experts, our method reaches 72\.07 average while training≈\\approx73M parameters, whereas LoRAMixer Top\-1 reaches 65\.67 while training≈\\approx606M parameters\.

As shown in Figure[3](https://arxiv.org/html/2606.31413#S5.F3), the improvement from hard routing over soft routing is larger for the 1B and 3B models compared to the 8B model\. From Stage I results, these smaller models benefit more from trained experts relative to the base model, so preserving expert behavior during integration is more important\. Therefore, preserving the standalone behavior of each expert through hard unit\-scale routing appears especially beneficial for smaller models\. For the 8B model, the relative improvement is smaller but still consistent, with a particularly large gain on GSM8K as shown in Appendix Table[7](https://arxiv.org/html/2606.31413#A1.T7), where the reasoning expert provides a strong improvement over the base model after RLVF training and hard routing better maintains its effectiveness\.

![Refer to caption](https://arxiv.org/html/2606.31413v1/pics/mixer.png)Figure 3:Hard routing vs\. soft routing for LoRA composition\.Fully saturated bars denote RLVF experts, whereas lighter bars denote SFT experts\. Across model scales and expert types, Hard\-Routed MoR\-LoRA achieves comparable or better performance than soft routing despite not retraining experts, indicating that routing quality dominates expert adaptation\.
### 5\.3Comparison with Additional LoRA Composition Baselines

The previous section compares Hard\-Routed MoR\-LoRA with LoRAMixer, which is the closest baseline to our setting\. To provide a broader comparison, we also evaluate against two additional adapter\-composition baselines: LoRAHub, a static adapter merging method, and MoLE, a representative MoE\-LoRA method with load\-balancing regularization\. All methods are evaluated in the same Stage II setting using RLVF\-trained experts and at most 1000 integration samples per dataset\.

Table[1](https://arxiv.org/html/2606.31413#S5.T1)summarizes the results\. LoRAHub performs static composition and therefore cannot dynamically select experts at inference time\. LoRAMixer and MoLE use soft expert routing, whereas our method applies exactly one frozen LoRA expert with unit scale\. Across LLaMA\-3B and LLaMA\-8B, Hard\-Routed MoR\-LoRA achieves the best average performance among the learned\-composition baselines\. The improvement over MoLE is larger on LLaMA\-3B and smaller but still positive on LLaMA\-8B\. These results suggest that hard unit\-scale routing is a simple and parameter\-efficient integration strategy for frozen pretrained LoRA experts, avoiding expert retraining and additional load\-balancing hyperparameters\. The full per\-task results are reported in Appendix[A\.3](https://arxiv.org/html/2606.31413#A1.SS3), Table[8](https://arxiv.org/html/2606.31413#A1.T8)\.

MethodComposition TypeRouting GranularityExtra ObjectiveLLaMA\-3B AvgLLaMA\-8B AvgLoRAHubStatic adapter mergingPrompt\-level/staticBlack\-box weight search67\.7575\.76LoRAMixer TopK=1Soft mixture routingToken\-levelPreservation loss65\.6777\.74LoRAMixer TopK=2 Norm\.Normalized soft routingToken\-levelPreservation loss70\.8178\.33MoLE \(α=0\.5\\alpha=0\.5\)Soft MoE\-LoRA routingToken\-levelLoad balancing68\.5078\.50MoLE \(α=0\.1\\alpha=0\.1\)Soft MoE\-LoRA routingToken\-levelLoad balancing68\.6678\.02Prompt ClassificationNonePrompt\-level/dynamicBERT Classification72\.7979\.87Hard\-Routed MoR\-LoRAHard unit\-scale routingToken\-levelNone72\.0779\.80Table 1:Comparison with additional LoRA composition baselines using RLVF\-trained experts\.All methods use the same Stage II supervision budget of at most 1000 samples per dataset\. Hard\-Routed MoR\-LoRA achieves the best average performance while preserving frozen expert behavior through hard unit\-scale selection\. The results also show that normalized soft routing remains competitive, but the hard routing provides a simpler and more parameter\-efficient abstraction for frozen LoRA expert composition\.
### 5\.4Soft Routing Collapses Toward Single\-Expert Selection

Hard routing resolves the LoRA integration mismatch by ensuring that exactly one expert with unit scaling is applied\. In contrast, soft routing combines multiple experts using routing weights\. While top\-1 soft routing cannot maintain unit scaling, top\-2 routing can be normalized so that the total update magnitude remains constant\.

As shown in Table[6](https://arxiv.org/html/2606.31413#A1.T6)and Table[7](https://arxiv.org/html/2606.31413#A1.T7), top\-2 normalized routing performs substantially better than top\-1 soft routing and approaches the performance of hard routing\. However, examining the routing behavior reveals that the two experts are not used equally\.

Figure[4](https://arxiv.org/html/2606.31413#S5.F4)shows the distribution of the dominant routing weight, which is the larger weight among the two selected experts\. The router usually assigns most of the probability mass to one expert, with an average dominant weight of about0\.710\.71and an average routing entropy of0\.62450\.6245\. This shows that normalized top\-2 routing often behaves like near top\-1 routing\. However, the model still evaluates and combines two LoRA experts, so the final update is a weighted mixture rather than the original unit\-scale update of a standalone LoRA expert\. Hard routing removes this extra step by directly selecting one expert, preserving sparsity and matching the original LoRA formulation more closely\.

![Refer to caption](https://arxiv.org/html/2606.31413v1/x1.png)Figure 4:Dominant routing weight under top\-2 normalized routing\.The histogram is computed on theGSM8KusingLlama\-3\.2\-3B\-Instructwith RLVF\-trained experts\. The dominant weight is the larger of the two selected expert weights after normalization\. The average dominant weight is approximately0\.710\.71, and the average routing entropy is0\.62450\.6245, indicating that soft top\-2 routing often behaves like near single\-expert selection\.
### 5\.5Ablation Study

We perform ablations to determine whether Stage II improves performance through additional learning or mainly through better expert selection\. We also analyze the effect of attention LoRA rank in Appendix[B](https://arxiv.org/html/2606.31413#A2)and compare alternative hard\-routing approximations in Appendix[E](https://arxiv.org/html/2606.31413#A5)\.

#### Is Reinforcement Learning Necessary for Integration?

Stage II of our method aims to learn expert selection rather than relearning reasoning behavior\. During this stage, all experts remain frozen, and only the shared router and attention LoRA are optimized\. We compare two training strategies: \(1\) SFT on distilled reasoning traces generated by the experts, and \(2\) RLVF training for the routing selection\.

We present the results of the 3B model in Table[2](https://arxiv.org/html/2606.31413#S5.T2)\. RLVF achieves slightly higher performance \(\+0\.93 average\), but requires substantially higher computation due to trajectory sampling and advantage evaluation\. The small improvement indicates that Stage II primary learns which expert to select, while the reasoning capability is already encoded in the pretrained experts\. Therefore, SFT on distilled traces is sufficient for integration, and reinforcement learning is primarily necessary only during expert training\.

MethodGSM8KARC\-CMedQABoolQCoLAAvgRLVF75\.0175\.4052\.0085\.9976\.5873\.00SFT73\.6275\.4351\.5383\.0976\.7072\.07Table 2:Comparison of training strategies for Stage II integration on the 3B model\. Both methods train only the router and attention LoRA while keeping experts frozen and using the same prompts\.
#### Sample Training Budget

We analyze how many supervision samples are required for Stage II\. For each dataset, we randomly sample prompts and generate distilled reasoning traces from the pretrained experts\. The mixture model is then trained for one epoch on the combined distilled data from all five domains\. We compare using 750 and 1000 prompts per dataset against using all available samples \(30,573\)\.

Table[3](https://arxiv.org/html/2606.31413#S5.T3)shows that using only 1000 prompts per dataset achieves performance close to training on the full dataset, with less than1%1\\%difference in average accuracy\. Increasing the amount of data yields only marginal gains, indicating that Stage II mainly learns routing patterns rather than domain knowledge\. Since reasoning ability is already encoded in the experts trained in Stage I, only a small number of examples is sufficient to identify the appropriate expert\.

\# SamplesGSM8KARC\-CMedQABoolQCoLAAvg5×7505\\times 75070\.3673\.7252\.5779\.8874\.4570\.205×10005\\times 100073\.6275\.4351\.5383\.0976\.7072\.07ALL\(30,573\)75\.3275\.0252\.3285\.1277\.1072\.98Table 3:Effect of the Stage II training budget on the 3B model\. The first two rows use the specified number of prompts per dataset \(five datasets total\)\.
#### Prompt\-Level and Mixed\-Domain Routing

The original evaluation uses datasets with clearly separated domains\. This raises a natural question: whether token\-level routing is necessary, or whether the problem can mostly be solved by selecting one expert for the whole prompt\. To test this, we add a prompt\-level routing baseline\. This baseline trains abert\-base\-uncasedDevlinet al\.\([2019](https://arxiv.org/html/2606.31413#bib.bib41)\)classifier using the same Stage II supervision budget, with at most 1000 examples per dataset\. The classifier predicts the input domain and then applies the corresponding frozen LoRA expert to the entire prompt\.

The prompt\-level classifier reaches 99\.70% domain classification accuracy\. On LLaMA\-3B with RLVF\-trained experts, As shown in Table[1](https://arxiv.org/html/2606.31413#S5.T1), prompt\-level routing marginally outperforms Hard\-Routed MoR\-LoRA\. This shows that prompt\-level routing is a strong and simple baseline when each prompt belongs clearly to one domain\. Thus, token\-level routing should not be interpreted as necessary for clean single\-domain inputs\. Full per\-task results and classifier details are provided in Appendix[G](https://arxiv.org/html/2606.31413#A7)\.

However, prompt\-level routing has a structural limitation: it must select a single expert for the whole input\. To test this limitation, we construct a mixed\-domain evaluation where each prompt contains one GSM8K math problem and one BoolQ question, and the model is asked to produce two structured answers\. No method is trained on mixed\-domain prompts\.

MethodMath partBoolQ partAvgPrompt\-level routing61\.6163\.8862\.75Hard\-Routed MoR\-LoRA59\.0473\.0066\.02Table 4:Mixed\-domain evaluation with GSM8K and BoolQ\.Each input contains one math problem and one BoolQ question\. No method is trained on mixed\-domain prompts\.As shown in Table[4](https://arxiv.org/html/2606.31413#S5.T4), prompt\-level routing performs better on the math part but considerably worse on the BoolQ part\. This is expected because it must choose either the GSM8K expert or the BoolQ expert for the entire input\. In contrast, ours is not restricted to one expert for the whole prompt and achieves better average performance\. This result does not show that token\-level routing fully solves mixed\-domain reasoning, but it demonstrates why token\-level expert selection is more flexible when a single input requires multiple expert behaviors\. We additionally report unseen\-dataset results on SVAMP and SST\-2 in Appendix[G\.3](https://arxiv.org/html/2606.31413#A7.SS3)\.

## 6Conclusion

We study how independently trained LoRA adapters can be combined in instruction\-tuned language models without changing their behavior\. We find that an important challenge in frozen pretrained LoRA composition is the mismatch between unit\-scale LoRA updates and weighted mixture routing\. Prior LoRA\-mixing methods often address this by adapting the experts during integration, which increases the number of trainable parameters\. Hard\-Routed MoR\-LoRA instead treats integration mainly as a routing problem: reasoning experts are first trained independently using RLVF, and the mixture is then trained only to select among them while keeping all experts frozen\. Across multiple model scales and tasks, we observe three consistent results\. First, verifiable\-feedback training produces stronger reasoning experts for capable instruction\-tuned models\. Second, hard routing integrates these experts more reliably and with far fewer trainable parameters than soft routing approaches\. Third, normalized soft mixtures often concentrate most routing mass on one expert, helping explain why hard unit\-scale selection can be effective without retraining the experts\. Overall, our results show that modular adaptation can be achieved by learning when to use each expert rather than modifying what each expert knows, enabling practical reuse of domain experts under limited data or data\-sharing constraints\.

## 7Limitations

Although our approach avoids retraining experts, integration is not completely training\-free\. Stage II still requires a some amount of labeled data to learn the routing behavior\. As a result, adding new experts or domains requires additional routing data and training, and the framework does not yet support true zero\-shot integration\. This limitation is more significant in settings where only very few or no samples are available\.

Our method is primarily evaluated in settings where inputs are associated with a small number of expert domains\. In the clean single\-domain setting, prompt\-level routing is already highly competitive, showing that selecting one expert for the whole prompt can be sufficient when the domain is clear\. At the same time, our mixed\-domain GSM8K\+BoolQ evaluation suggests a promising advantage of token\-level routing: it is not restricted to a single expert decision for the entire input\. With explicit mixed\-domain training, the router could further learn when to switch experts within a prompt, making this formulation naturally suited to inputs that combine multiple types of reasoning\. Similarly, our unseen\-dataset results in Appendix[G\.3](https://arxiv.org/html/2606.31413#A7.SS3)suggest that the router can transfer useful expert behavior to related task distributions\. A natural extension is to add an abstention or backbone route, allowing the model to bypass all LoRA experts and follow the original computation graph when none of the available experts is appropriate\. These directions point toward more flexible expert composition with larger expert pools, mixed\-domain supervision, and adaptive routing between specialized experts and the base model\.

## Acknowledgment

This work was supported by Carl Bennet AB\.

## References

- Estimating or Propagating Gradients through Stochastic Neurons for Conditional Computation\.arXiv preprint arXiv:1308\.3432\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px4.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language Models are Few\-shot Learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p1.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BooQ: Exploring the Surprising Difficulty of Natural Yes/No Questions\.arXiv preprint arXiv:1905\.10044\.Cited by:[§F\.1](https://arxiv.org/html/2606.31413#A6.SS1.p1.1),[§5](https://arxiv.org/html/2606.31413#S5.SS0.SSSx1.p3.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think You have Solved Question Answering? Try Arc, the AI2 Reasoning Challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§F\.1](https://arxiv.org/html/2606.31413#A6.SS1.p1.1),[§5](https://arxiv.org/html/2606.31413#S5.SS0.SSSx1.p3.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training Verifiers to Solve Math Word Problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§F\.1](https://arxiv.org/html/2606.31413#A6.SS1.p1.1),[§5](https://arxiv.org/html/2606.31413#S5.SS0.SSSx1.p3.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: Pre\-training of Deep Bidirectional Transformers for Language Understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§5\.5](https://arxiv.org/html/2606.31413#S5.SS5.SSSx3.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The Llama 3 Herd of Models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5](https://arxiv.org/html/2606.31413#S5.SS0.SSSx1.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p2.1),[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.31413#S3.SS2.p1.3)\.
- Gemma Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, and R\. Merhej \(2025\)Gemma 3 Technical Report\.arXiv preprint arXiv:2503\.19786\.External Links:[Link](https://arxiv.org/abs/2503.19786),[Document](https://dx.doi.org/10.48550/arXiv.2503.19786)Cited by:[Appendix D](https://arxiv.org/html/2606.31413#A4.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang,et al\.\(2025\)DeepSeek\-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p3.1),[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.31413#S3.SS3.p1.1),[§4\.2](https://arxiv.org/html/2606.31413#S4.SS2.p1.1),[§4](https://arxiv.org/html/2606.31413#S4.p1.1)\.
- N\. Houlsby, S\. Jastrzebski, A\. Brooks, R\. de Vries, A\. Guedj, and G\. Nematzadeh \(2019\)Parameter\-Efficient Transfer Learning for NLP\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)LoRA: Low\-Rank Adaptation of Large Language Models\.ICLR1\(2\),pp\. 3\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p1.1)\.
- C\. Huang, Q\. Liu, B\. Y\. Lin, T\. Pang, C\. Du, and M\. Lin \(2023\)LoRAHub: Efficient Cross\-Task Generalization via Dynamic LoRA Composition\.arXiv preprint arXiv:2307\.13269\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Jang, S\. Gu, and B\. Poole \(2016\)Categorical Reparameterization with Gumbel\-Softmax\.arXiv preprint arXiv:1611\.01144\.Cited by:[item 2](https://arxiv.org/html/2606.31413#A5.I1.i2.p1.3)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What Disease Does This Patient Have? A Large\-scale Open Domain Question Answering Dataset From Medical Exams\.Applied Sciences11\(14\),pp\. 6421\.Cited by:[§F\.1](https://arxiv.org/html/2606.31413#A6.SS1.p1.1),[§5](https://arxiv.org/html/2606.31413#S5.SS0.SSSx1.p3.1)\.
- D\. Li, Y\. Ma, N\. Wang, Z\. Ye, Z\. Cheng, Y\. Tang, Y\. Zhang, L\. Duan, J\. Zuo, C\. Yang,et al\.\(2024\)MixLoRA: Enhancing Large Language Models Fine\-tuning with LoRA\-based Mixture of Experts\.arXiv preprint arXiv:2404\.15159\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Liet al\.\(2025\)Make LoRA Great Again: Aligning LoRA Mixtures for Efficient Adaptation\.arXiv preprint arXiv:2502\.16894\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Li, Z\. Song, H\. Zhou, Y\. Zhang, J\. Yu, and W\. Yang \(2025\)LoRA\-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing\.arXiv preprint arXiv:2507\.00029\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p2.1)\.
- L\. Liu, C\. Dong, X\. Liu, B\. Yu, and J\. Gao \(2023\)Bridging Discrete and Backpropagation: Straight\-Through and Beyond\.Advances in Neural Information Processing Systems36,pp\. 12291–12311\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px4.p1.1),[§4\.1](https://arxiv.org/html/2606.31413#S4.SS1.SSSx2.p3.1)\.
- M\. Muqeeth, H\. Liu, Y\. Liu, and C\. Raffel \(2024\)Learning to Route Among Specialized Experts For Zero\-shot Generalization\.arXiv preprint arXiv:2402\.05859\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px1.p1.1)\.
- O\. Ostapenko, L\. Caccia, Z\. Su, N\. Le Roux, L\. Charlin, and A\. Sordoni \(2023\)A Case Study of Instruction Tuning With Mixture of Parameter\-efficient Experts\.InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following,Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px2.p1.1)\.
- O\. Ostapenko, Z\. Su, E\. M\. Ponti, L\. Charlin, N\. L\. Roux, M\. Pereira, L\. Caccia, and A\. Sordoni \(2024\)Towards Modular LLMs by Building and Reusing a Library of LoRAs\.arXiv preprint arXiv:2405\.11157\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training Language Models to follow Instructions with Human Feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Page\-Caccia, E\. M\. Ponti, Z\. Su, M\. Pereira, N\. Le Roux, and A\. Sordoni \(2023\)Multi\-head Adapter Routing for Cross\-task Generalization\.Advances in Neural Information Processing Systems36,pp\. 56916–56931\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Patel, S\. Bhattamishra, and N\. Goyal \(2021\)Are NLP Models really able to Solve Simple Math Word Problems?\.InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies,pp\. 2080–2094\.Cited by:[§F\.1](https://arxiv.org/html/2606.31413#A6.SS1.p1.1)\.
- Z\. Shaoet al\.\(2024\)DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p3.1),[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.31413#S3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2606.31413#S3.SS3.p2.3),[§4\.2](https://arxiv.org/html/2606.31413#S4.SS2.p1.1),[§4](https://arxiv.org/html/2606.31413#S4.p1.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously Large Neural Networks: The Sparsely\-gated Mixture\-of\-Experts Layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p2.1),[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Socher, A\. Perelygin, J\. Wu, J\. Chuang, C\. D\. Manning, A\. Y\. Ng, and C\. Potts \(2013\)Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank\.InProceedings of the 2013 conference on empirical methods in natural language processing,pp\. 1631–1642\.Cited by:[§F\.1](https://arxiv.org/html/2606.31413#A6.SS1.p1.1)\.
- Z\. Su, F\. Mo, P\. Tiwari, B\. Wang, J\. Nie, and J\. G\. Simonsen \(2024\)Mixture of Latent Experts Using Tensor Products\.arXiv preprint arXiv:2405\.16671\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Warstadt, A\. Singh, and S\. R\. Bowman \(2019\)Neural Network Acceptability Judgments\.Transactions of the Association for Computational Linguistics7,pp\. 625–641\.Cited by:[§F\.1](https://arxiv.org/html/2606.31413#A6.SS1.p1.1),[§5](https://arxiv.org/html/2606.31413#S5.SS0.SSSx1.p3.1)\.
- B\. Wen, F\. Brahman, Z\. Su, S\. Feng, Y\. Tsvetkov, L\. L\. Wang, and B\. Howe \(2025\)MARVEL: Modular Abstention for Reliable and Versatile Expert LLMs\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Wu, S\. Huang, and F\. Wei \(2024\)Mixture of LoRA Experts\.arXiv preprint arXiv:2404\.13628\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2025\)Qwen2\.5 Technical Report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[Appendix D](https://arxiv.org/html/2606.31413#A4.p1.1)\.
- J\. Zhang, X\. Wang, F\. Mo, Y\. Zhou, W\. Gao, and K\. Liu \(2025a\)Entropy\-based Exploration Conduction for Multi\-step Reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 3895–3906\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p1.1)\.
- J\. Zhang, X\. Wang, W\. Ren, L\. Jiang, D\. Wang, and K\. Liu \(2025b\)RATT: A Thought Structure for Coherent and Correct LLM Reasoning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 26733–26741\.Cited by:[§1](https://arxiv.org/html/2606.31413#S1.p1.1)\.
- Z\. Zhao, T\. Shen, D\. Zhu, Z\. Li, J\. Su, X\. Wang, K\. Kuang, and F\. Wu \(2024\)Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank\-wise Clustering\.arXiv preprint arXiv:2409\.16167\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhu, N\. Wichers, C\. Lin, X\. Wang, T\. Chen, L\. Shu, H\. Lu, C\. Liu, L\. Luo, J\. Chen,et al\.\(2023\)SiRA: Sparse Mixture of Low\-Rank Adaptation\.arXiv preprint arXiv:2311\.09179\.Cited by:[§2](https://arxiv.org/html/2606.31413#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix ADetailed Experimental Results

This section provides thefull per\-taskresults corresponding to the averaged metrics reported in the paper\. All experiments follow the two\-stage pipeline: training standalone experts \(Stage I\) and integrating them through routing \(Stage II\)\.

### A\.1Stage I: Standalone Expert Quality

Table[5](https://arxiv.org/html/2606.31413#A1.T5)shows the performance of experts before mixture integration\. We compare the original instruction\-tuned model \(Baseline\), conventional fine\-tuning \(SFT\), and RLVF\.

The detailed numbers match the observations in Section[5\.1](https://arxiv.org/html/2606.31413#S5.SS1)\. RLVF improves reasoning\-heavy tasks for 3B and 8B models, while fine\-tuning can reduce performance in larger instruction\-tuned models\. By contrast, for the 1B model, RLVF gives smaller gains due to limited reasoning capacity\.

### A\.2Stage II: Expert Integration

Tables[6](https://arxiv.org/html/2606.31413#A1.T6)and[7](https://arxiv.org/html/2606.31413#A1.T7)report mixture integration results for both fine\-tuned experts and RLVF\-trained experts\.

We compare LoRAMixer top\-1 routing, LoRAMixer normalized top\-2 routing, and ours \(Hard\-Routed MoR\-LoRA\)\. Across both expert types, hard routing achieves better performance while training far fewer parameters\. This indicates that integration mainly depends on selecting the correct expert rather than modifying expert weights\.

### Summary

Across all tables, three patterns are consistent:

- •RLVF produces stronger experts for capable models\.
- •Soft routing requires retraining experts and more parameters\.
- •Hard routing preserves expert behavior with fewer trainable parameters\.

These tables confirm that the improvements in the paper are consistent across datasets and model sizes\.

MethodGSM8KARC\-CMedQABoolQCoLAAvgLlama\-1BBaseline03\.4903\.4922\.7822\.7826\.7126\.7130\.3130\.3128\.8628\.8622\.4322\.43SFT30\.10\\mathbf\{30\.10\}46\.07\\mathbf\{46\.07\}38\.17\\mathbf\{38\.17\}61\.9361\.9369\.61\\mathbf\{69\.61\}49\.18\\mathbf\{49\.18\}RLVF00\.0500\.0530\.1230\.1234\.3334\.3368\.23\\mathbf\{68\.23\}30\.3030\.3032\.6132\.61Llama\-3BBaseline01\.8201\.8222\.1822\.1833\.7033\.7048\.9348\.9360\.7960\.7933\.4833\.48SFT60\.9660\.9669\.9769\.9750\.1450\.1483\.4383\.4379\.00\\mathbf\{79\.00\}68\.6068\.60RLVF75\.36\\mathbf\{75\.36\}76\.88\\mathbf\{76\.88\}51\.13\\mathbf\{51\.13\}84\.43\\mathbf\{84\.43\}75\.6575\.6572\.69\\mathbf\{72\.69\}Llama\-8BBaseline63\.8463\.8481\.3881\.3864\.6564\.6582\.5382\.5376\.9976\.9973\.8973\.89SFT70\.7470\.7479\.5279\.5260\.1060\.1073\.1573\.1580\.09\\mathbf\{80\.09\}72\.7272\.72RLVF83\.02\\mathbf\{83\.02\}82\.94\\mathbf\{82\.94\}66\.06\\mathbf\{66\.06\}87\.68\\mathbf\{87\.68\}78\.9178\.9179\.72\\mathbf\{79\.72\}Table 5:Performance of SFT and RLVF\.The Baseline row corresponds to the original model\. RLVF improves reasoning performance for capable models \(3B and 8B\) while preserving chain\-of\-thought behavior, whereas fine\-tuning can degrade CoT ability in large instruction\-tuned models by overriding existing reasoning patterns\. Smaller models \(1B\) benefit less from RLVF due to limited reasoning trajectory diversity\.MethodGSM8KARC\-CMedQABoolQCoLAAverage\# TrainableParametersLlama\-1BLoRAMixer TopK=130\.9330\.9356\.7656\.7625\.3725\.3739\.7239\.7202\.6802\.6831\.0931\.09≈315​M\\approx 315MLoRAMixer TopK=2Normalized33\.8933\.8945\.8245\.8230\.9530\.9543\.5543\.5565\.1065\.1043\.8643\.86≈315​M\\approx 315MOurs34\.7234\.7245\.0545\.0533\.3133\.3153\.1853\.1868\.6568\.6546\.98\\mathbf\{46\.98\}≈27​M\\approx 27MLlama\-3BLoRAMixer TopK=122\.7422\.7446\.5046\.5046\.5046\.5075\.7575\.7570\.5770\.5752\.4152\.41≈606​M\\approx 606MLoRAMixer TopK=2Normalized63\.0063\.0066\.1366\.1350\.1650\.1680\.6580\.6574\.7874\.7866\.9466\.94≈606​M\\approx 606MOurs62\.8562\.8566\.9866\.9849\.9649\.9682\.0082\.0076\.2276\.2267\.60\\mathbf\{67\.60\}≈73​M\\approx 73MTable 6:Stage II\. Integration using fine\-tuning experts\.LoRAMixer retrains experts and requires substantially more trainable parameters, while Hard\-Routed MoR\-LoRA freezes experts and learns only routing and attention adaptation\. Despite using fewer parameters, hard routing achieves comparable or better performance, showing that correct expert selection is more important than modifying expert weights\.MethodGSM8KARC\-CMedQABoolQCoLAAverage\# TrainableParametersLlama\-3BLoRAMixer TopK=163\.5363\.5368\.3468\.3452\.4052\.4076\.6776\.6767\.4067\.4065\.6765\.67≈606​M\\approx 606MLoRAMixer TopK=2Normalized69\.6769\.6773\.9873\.9852\.0852\.0882\.8782\.8775\.4675\.4670\.8170\.81≈606​M\\approx 606MOurs73\.6273\.6275\.4375\.4351\.5351\.5383\.0983\.0976\.7076\.7072\.07\\mathbf\{72\.07\}≈73​M\\approx 73MLlama\-8BLoRAMixer TopK=179\.5179\.5183\.3083\.3065\.2265\.2283\.7083\.7076\.9976\.9977\.7477\.74≈1\.133​B\\approx 1\.133BLoRAMixer TopK=2Normalized82\.1582\.1583\.9683\.9666\.0666\.0682\.4082\.4077\.0877\.0878\.3378\.33≈1\.133​B\\approx 1\.133BOurs84\.6984\.6984\.6184\.6167\.1767\.1784\.4384\.4378\.0978\.0979\.80\\mathbf\{79\.80\}≈109​M\\approx 109MTable 7:Stage II\. Integration using RLVF\-trained reasoning experts\.Hard\-Routed MoR\-LoRA preserves expert behavior while requiring far fewer trainable parameters, outperforming soft routing methods that retrain experts and modify their learned reasoning patterns\.
### A\.3Full Per\-Task Results for Additional Baselines

In this section, we provide the full per\-task results for the additional LoRA composition baselines\. The paper reports average performance to keep the comparison compact, while the tables below show how each method behaves across individual datasets\. We include three types of baselines: LoRAHub as a static adapter\-merging method, LoRAMixer as a preservation\-based soft\-routing method, and MoLE as a MoE\-LoRA method with load\-balancing regularization\. This allows us to compare static composition, soft expert mixing, and hard unit\-scale routing under the same frozen\-expert integration setting\.

All results in this section use RLVF\-trained experts from Stage I\. During Stage II, each method is trained with the same supervision budget of at most 1000 samples per dataset\. For LoRAMixer, the pretrained adapters are further optimized with a preservation loss\. For MoLE, we evaluate two load\-balancing coefficients,α=0\.5\\alpha=0\.5andα=0\.1\\alpha=0\.1, to account for sensitivity to the balancing objective\. In contrast, Hard\-Routed MoR\-LoRA keeps all pretrained experts frozen and trains only the shared router and lightweight attention LoRA\.

Table[8](https://arxiv.org/html/2606.31413#A1.T8)provides the full per\-task results\. These results correspond to the averaged scores reported in Table[1](https://arxiv.org/html/2606.31413#S5.T1)\.

MethodGSM8KARC\-CMedQABoolQCoLAAvgLLaMA\-3BLoRAHub72\.5568\.8651\.0674\.2072\.1067\.75LoRAMixer TopK=163\.5368\.3452\.4076\.6767\.4065\.67LoRAMixer TopK=2Normalized69\.6773\.9852\.0882\.8775\.4670\.81MoLE \(α=0\.5\\alpha=0\.5\)69\.2266\.0454\.4479\.8572\.9668\.50MoLE \(α=0\.1\\alpha=0\.1\)69\.3765\.7054\.6779\.7673\.8368\.66Hard\-Routed MoR\-LoRA73\.6275\.4351\.5383\.0976\.7072\.07LLaMA\-8BLoRAHub82\.4184\.6465\.5176\.2170\.0475\.76LoRAMixer TopK=179\.5183\.3065\.2283\.7076\.9977\.74LoRAMixer TopK=2Normalized82\.1583\.9666\.0682\.4077\.0878\.33MoLE \(α=0\.5\\alpha=0\.5\)84\.0085\.0766\.3080\.5276\.6178\.50MoLE \(α=0\.1\\alpha=0\.1\)83\.9384\.9064\.9679\.9176\.4178\.02Hard\-Routed MoR\-LoRA84\.6984\.6167\.1784\.4378\.0979\.80Table 8:Full per\-task baseline comparison with RLVF\-trained experts\.Results are reported for LLaMA\-3B and LLaMA\-8B\. Hard\-Routed MoR\-LoRA achieves the best average performance for both model sizes and outperforms the other composition baselines on most tasks\.

## Appendix BImpact of LoRA Rank in Integration

As described in Section[4\.2](https://arxiv.org/html/2606.31413#S4.SS2), Stage II integrates pretrained experts by training a shared linear router and applying LoRA adaptation to the attention layers\. The router operates at the token level and selects one expert per token, while in standalone experts, the same adapter processes all tokens of a prompt\. This change in routing granularity can alter token representations, so a small attention adaptation is introduced to improve the compatibility between token embeddings and routing decisions\.

In this section, we analyze whether Stage II requires substantial parameter adaptation to learn expert selection\. We vary the LoRA rank applied to the attention layers\. The*No LoRA*setting trains only the shared router while keeping the model representations unchanged\.

Table[9](https://arxiv.org/html/2606.31413#A2.T9)reports the results when integrating the RLVF\-trained experts from Stage I\. For both model sizes, allowing a small amount of attention adaptation improves performance\. However, the gains quickly saturate\. For Llama\-8B,R=32R=32gives the highest score \(79\.93\), while the*No LoRA*setting already achieves 78\.65, which is close to the maximum\.

These results indicate that Stage II primarily learns routing behavior rather than relearning domain knowledge or reasoning ability\. The attention LoRA stabilizes token\-level routing, but only a small rank is sufficient, and larger ranks provide little additional benefit\.

LoRA RankGSM8KARC\-CMedQABoolQCoLAAverageLlama\-3BNo Attention LoRA63\.4670\.8253\.6281\.4468\.1967\.51R=1665\.8175\.2651\.6182\.1772\.2969\.43R=3270\.7476\.7952\.6383\.3075\.3671\.76R=6472\.1076\.7952\.3283\.5575\.5572\.06R=12873\.6275\.4351\.5383\.0976\.7072\.07Llama\-8BNo Attention LoRA79\.7684\.8165\.5186\.0977\.0678\.65R=1683\.8585\.2465\.5285\.5778\.0479\.64R=3283\.0985\.6766\.9386\.6777\.2879\.93R=6485\.0685\.6766\.3884\.5977\.2879\.80R=12884\.6984\.6167\.1784\.4378\.0979\.80Table 9:Effect of attention LoRA rank during Stage II integration using RLVF\-trained experts\.The configuration of*No LoRA*trains only the shared router while keeping the backbone unchanged\. Small ranks improve performance, but gains quickly saturate, indicating that integration mainly depends on routing rather than relearning expert knowledge\.
## Appendix CWhy Normalized Top\-1 Soft Routing Has Zero Routing Gradient

This section explains why a purely soft top\-11formulation cannot both preserve unit\-scale LoRA application and provide a useful gradient for the router training\.

Letpi=softmax\(g\(x\)\)ip\_\{i\}=\\operatorname\{softmax\}\(g\(x\)\)\_\{i\}be the routing probability for expertiiwhileg​\(x\)g\(x\)is the logits produced by the router, and letj=arg⁡maxi⁡pij=\\arg\\max\_\{i\}p\_\{i\}be the selected top\-11expert\.

A simple top\-11soft router can apply the selected expert as

ysoft\-top1​\(x\)=pj​Ej​\(x\)\\displaystyle y\_\{\\text\{soft\-top1\}\}\(x\)=p\_\{j\}E\_\{j\}\(x\)\(14\)This gives gradients to the router\. For a fixed selected expertjj,

∂ysoft\-top1∂gℓ=Ej​\(x\)​∂pj∂gℓ=Ej​\(x\)​pj​\(𝕀​\[ℓ=j\]−pℓ\)\\displaystyle\\frac\{\\partial y\_\{\\text\{soft\-top1\}\}\}\{\\partial g\_\{\\ell\}\}=E\_\{j\}\(x\)\\frac\{\\partial p\_\{j\}\}\{\\partial g\_\{\\ell\}\}=E\_\{j\}\(x\)p\_\{j\}\(\\mathbb\{I\}\[\\ell=j\]\-p\_\{\\ell\}\)\(15\)However, this applies the selected LoRA expert with weightpjp\_\{j\}instead of weight11\. Sincepj<1p\_\{j\}<1in general, the LoRA update is scaled down\. This changes the unit\-scale update that the expert used during standalone training\.

To keep the selected expert at unit scale, one may normalize the probability over the selected top\-11setS=\{j\}S=\\\{j\\\}:

p~j=pj∑m∈Spm=pjpj=1\\displaystyle\\tilde\{p\}\_\{j\}=\\frac\{p\_\{j\}\}\{\\sum\_\{m\\in S\}p\_\{m\}\}=\\frac\{p\_\{j\}\}\{p\_\{j\}\}=1\(16\)The output then becomes

ynorm\-top1​\(x\)=p~j​Ej​\(x\)=Ej​\(x\)\\displaystyle y\_\{\\text\{norm\-top1\}\}\(x\)=\\tilde\{p\}\_\{j\}E\_\{j\}\(x\)=E\_\{j\}\(x\)\(17\)This preserves the original LoRA scale\. However, for a fixed selected expertjj, the normalized weight is the constant function:

∂p~j∂gℓ=∂∂gℓ​\(pjpj\)=0\.\\displaystyle\\frac\{\\partial\\tilde\{p\}\_\{j\}\}\{\\partial g\_\{\\ell\}\}=\\frac\{\\partial\}\{\\partial g\_\{\\ell\}\}\\left\(\\frac\{p\_\{j\}\}\{p\_\{j\}\}\\right\)=0\.\(18\)Thus, the normalized routing weight gives no useful gradient to the router\.

The only remaining dependence on the router is through the selectionj=arg⁡maxi⁡pij=\\arg\\max\_\{i\}p\_\{i\}\. This selection is discrete\. It is piecewise constant and is not differentiable\. Therefore, normalized top\-11routing preserves unit\-scale expert application, but it cannot train the router with a useful task\-loss gradient\.

Hard routing with a STE avoids this problem\. In the forward pass, it selects one expert with unit scale:

yforward​\(x\)=Ej​\(x\)\.\\displaystyle y\_\{\\text\{forward\}\}\(x\)=E\_\{j\}\(x\)\.\(19\)In the backward pass, it uses the soft probabilities as a surrogate gradient:

Y=StopGradient⁡\(Yhard−p\)\+p\.\\displaystyle Y=\\operatorname\{StopGradient\}\(Y\_\{\\text\{hard\}\}\-p\)\+p\.\(20\)This gives unit\-scale expert selection in the forward pass while still allowing the router to be trained by gradient descent\.

## Appendix DResults on Additional Model Families

In addition to the LLaMA models used in the main experiments, we also evaluate our method on two other model families: Gemma\-3Gemma Teamet al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib36)\)and Qwen\-2\.52\.5Yanget al\.\([2025](https://arxiv.org/html/2606.31413#bib.bib37)\)\. The goal of this experiment is to verify that the observations reported in the paper are not limited to the LLaMA architecture\.

Tables[10](https://arxiv.org/html/2606.31413#A4.T10)and[11](https://arxiv.org/html/2606.31413#A4.T11)report the results for Gemma\-3\-4B and Qwen\-2\.52\.5\-7B, respectively\. Table[10](https://arxiv.org/html/2606.31413#A4.T10)shows the standalone expert performance, while Table[11](https://arxiv.org/html/2606.31413#A4.T11)reports the integration results\. As in the LLaMA experiments, RLVF produces stronger experts, and our method achieves the best integration performance while training far fewer parameters than LoRAMixer\.

Overall, these results show that the main findings of the paper remain consistent across different model families\. This suggests that the proposed routing framework is general and can be applied to any instruction\-tuned architectures, not only the LLaMA family\.

MethodGSM8KARC\-CMedQABoolQCoLAAvgGemma\-3\-4BBaseline85\.4382\.6244\.7980\.9880\.6374\.89SFT64\.9772\.7042\.9787\.0782\.8470\.11RLVF88\.7884\.9947\.6585\.1482\.0777\.73Qwen\-2\.52\.5\-77BBaseline54\.1389\.0757\.4281\.4183\.0373\.01SFT77\.4888\.1457\.8987\.8383\.9979\.07RLVF91\.6689\.4857\.8288\.8783\.1382\.19Table 10:Performance of standalone experts on additional model families\.We report results for Gemma\-3\-4B and Qwen\-2\.52\.5\-7B across five tasks\. As in the LLaMA experiments, RLVF improves reasoning performance compared to the baseline models\.MethodGSM8KARC\-CMedQABoolQCoLAAverage\# TrainableParametersGemma\-3\-4BLoRAMixer TopK=186\.0583\.7046\.5080\.9480\.4475\.53≈836​M\\approx 836MLoRAMixer TopK=2Normalized86\.8883\.1047\.6883\.4381\.3076\.48≈836​M\\approx 836MOurs87\.5783\.6148\.7884\.8081\.5977\.27≈72​M\\approx 72MQwen\-2\.52\.5\-77BLoRAMixer TopK=174\.9188\.9957\.5083\.0083\.1377\.51≈1\.211​B\\approx 1\.211BLoRAMixer TopK=2Normalized90\.6088\.6556\.2085\.5481\.7880\.55≈1\.211​B\\approx 1\.211BOurs90\.6789\.2558\.1388\.1083\.4181\.95≈41​M\\approx 41MTable 11:Stage II integration results on additional model families\.We compare LoRAMixer and Hard\-Routed MoR\-LoRA when integrating pretrained experts\. For Qwen\-2\.52\.5\-7B, the mixer is trained with LoRA rankr=64r=64\.
## Appendix EComparison with Alternative Hard\-Routing Approximations

Our main method uses deterministic hard top\-11routing with a straight\-through estimator \(STE\)\. This section compares it with alternative approximations for training a hard router while keeping all pretrained LoRA experts frozen\. The goal is to determine whether the benefit comes from deterministic STE itself or from the broader property that the selected frozen LoRA expert is applied with unit scale during both training and inference\.

We compare three variants\.

1. 1\.Hard STEis our default method: the forward pass selects the top\-11expert deterministically, while gradients are propagated through the soft router probabilities\.
2. 2\.Gumbel\-SoftmaxJanget al\.\([2016](https://arxiv.org/html/2606.31413#bib.bib38)\)uses hard Gumbel\-Softmax sampling with a continuous relaxation in the backward pass\. We anneal the temperature fromτ=1\.5\\tau=1\.5toτ=0\.1\\tau=0\.1exponentially and keep the final temperature fixed for the last10%10\\%of training\.
3. 3\.Soft\-train/hard\-inferencetrains the model with a probability\-scaled top\-11expert output, but removes the softmax scaling at inference time and applies the selected expert with unit weight\. This variant tests whether unit\-scale routing is sufficient only at inference time, or whether the same behavior is also necessary during training\.

All methods use the same Stage II setting: RLVF\-trained experts are frozen, the router is shared across layers, attention LoRA is enabled, and the model is trained on distilled traces with at most 1000 examples per dataset\.

Table[12](https://arxiv.org/html/2606.31413#A5.T12)shows that Gumbel\-Softmax performs similarly to deterministic STE but does not improve over it\. On LLaMA\-3B, Gumbel\-Softmax reaches an average score of71\.8271\.82, compared with72\.0772\.07for Hard STE\. On LLaMA\-8B, the gap is larger, with Gumbel\-Softmax reaching78\.7178\.71compared with79\.8079\.80for Hard STE\. This suggests that the exact surrogate\-gradient estimator is not the main factor; both methods preserve the important forward\-pass behavior of applying exactly one expert with unit scale\.

In contrast, soft\-train/hard\-inference performs substantially worse\. The drop is especially large on LLaMA\-3B, where the average score decreases from72\.0772\.07to66\.4966\.49\. This indicates that simply removing the softmax scaling at inference time is not sufficient\. If the router and attention adaptation are trained under probability\-scaled expert outputs, the model learns under one effective adapter function but is evaluated under another\. Therefore, the unit\-scale hard\-routing behavior should be used consistently during both training and inference\.

MethodGSM8KARC\-CMedQABoolQCoLAAvgLLaMA\-3BHard STE73\.6275\.4351\.5383\.0976\.7072\.07Gumbel\-Softmax71\.0475\.8552\.6384\.3475\.2671\.82Soft\-train/hard\-inference53\.6872\.2750\.8282\.0573\.6366\.49LLaMA\-8BHard STE84\.6984\.6167\.1784\.4378\.0979\.80Gumbel\-Softmax83\.5584\.4765\.0483\.8876\.6178\.71Soft\-train/hard\-inference78\.7083\.1163\.5583\.1378\.6277\.42Table 12:Comparison with alternative hard\-routing approximations\.All methods use frozen RLVF\-trained experts and train only the shared router and attention LoRA during Stage II\. Gumbel\-Softmax is competitive but does not improve over deterministic STE\. Soft\-train/hard\-inference performs worse, showing that unit\-scale hard routing should be used during training as well as inference\.We further compare STE and Gumbel\-Softmax under different Stage II data budgets on LLaMA\-3B\. This analysis tests whether the routing approximation remains stable when the amount of routing supervision is reduced\.

As shown in Table[13](https://arxiv.org/html/2606.31413#A5.T13), Gumbel\-Softmax is competitive when 1000 samples per dataset are available, trailing deterministic STE by only0\.250\.25average points\. However, when the budget is reduced to 750 samples per dataset, the gap increases to2\.472\.47average points\. Because the pretrained experts are frozen, stochastic expert sampling cannot improve the experts themselves; it only changes which fixed expert is selected during router learning\. This additional routing variance can be harmful when Stage II supervision is limited\.

MethodGSM8KARC\-CMedQABoolQCoLAAvg1000 SamplesHard STE73\.6275\.4351\.5383\.0976\.7072\.07Gumbel\-Softmax71\.0475\.8552\.6384\.3475\.2671\.82750 SamplesHard STE70\.3673\.7252\.5779\.8874\.4570\.20Gumbel\-Softmax64\.1472\.3552\.6379\.3370\.1867\.73Table 13:Effect of Stage II data budget on deterministic STE and Gumbel\-Softmax routing for LLaMA\-3B\.With 1000 samples per dataset, Gumbel\-Softmax is close to deterministic STE\. With 750 samples per dataset, the gap increases, suggesting that stochastic routing is less stable when routing supervision is limited\.
## Appendix FReproducibility Details

This section provides additional implementation and evaluation details for reproducing the experiments\. All datasets used in this work are public benchmarks\. Unless otherwise stated, the same training configuration is used across all model families, including LLaMA, Gemma, and Qwen; only the underlying instruction\-tuned backbone model changes\. Additionally, the code is available at[github\.com/sar\-molavi/hard\-routed\-mor\-lora](https://github.com/sar-molavi/hard-routed-mor-lora)\.

### F\.1Datasets and Evaluation Protocol

The main experiments use five datasets, corresponding to five domains: mathematical reasoning, commonsense reasoning, medical question answering, reading comprehension, and grammatical acceptability\. Specifically, we use GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2606.31413#bib.bib25)\), ARC\-ChallengeClarket al\.\([2018](https://arxiv.org/html/2606.31413#bib.bib26)\), MedQAJinet al\.\([2021](https://arxiv.org/html/2606.31413#bib.bib27)\), BoolQClarket al\.\([2019](https://arxiv.org/html/2606.31413#bib.bib32)\), and CoLAWarstadtet al\.\([2019](https://arxiv.org/html/2606.31413#bib.bib28)\)\. Additional unseen\-dataset evaluation uses SVAMPPatelet al\.\([2021](https://arxiv.org/html/2606.31413#bib.bib39)\)and SST\-2Socheret al\.\([2013](https://arxiv.org/html/2606.31413#bib.bib40)\)\. SVAMP and SST\-2 are not used for Stage I expert training or Stage II mixer training\.

All reported results are evaluated with greedy decoding\. The evaluation metric is accuracy\. Table[14](https://arxiv.org/html/2606.31413#A6.T14)reports the public dataset sources and the number of evaluation examples used for each benchmark\.

DatasetUseSource\# Eval\. SamplesGSM8KMain evaluation[https://huggingface\.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)1,319ARC\-ChallengeMain evaluation[https://huggingface\.co/datasets/allenai/ai2\_arc](https://huggingface.co/datasets/allenai/ai2_arc)1,172MedQAMain evaluation[https://huggingface\.co/datasets/bigbio/med\_qa](https://huggingface.co/datasets/bigbio/med_qa)1,273BoolQMain evaluation[https://huggingface\.co/datasets/google/boolq](https://huggingface.co/datasets/google/boolq)3,270CoLAMain evaluation[https://huggingface\.co/datasets/nyu\-mll/glue](https://huggingface.co/datasets/nyu-mll/glue)1,043SVAMPUnseen\-dataset evaluation[https://huggingface\.co/datasets/ChilleD/SVAMP](https://huggingface.co/datasets/ChilleD/SVAMP)300SST\-2Unseen\-dataset evaluation[https://huggingface\.co/datasets/nyu\-mll/glue](https://huggingface.co/datasets/nyu-mll/glue)872Table 14:Datasets and evaluation split sizes\.The main five\-domain experiments use GSM8K, ARC\-Challenge, MedQA, BoolQ, and CoLA\. SVAMP and SST\-2 are used only for the unseen\-dataset evaluation\.
### F\.2Stage I: Standalone Expert Training

In Stage I, we train one LoRA expert independently for each dataset\. The base model is frozen, and only the LoRA parameters are updated\. We train two types of standalone experts: supervised fine\-tuning \(SFT\) experts and reinforcement learning from verifiable feedback \(RLVF\) experts\.

For both SFT and RLVF experts, LoRA is applied to the feed\-forward network projections:gate\_proj,up\_proj, anddown\_proj\. The SFT experts are trained with supervised next\-token likelihood\. The RLVF experts are trained with the off\-policy GRPO\-style objective described in Section[3\.3](https://arxiv.org/html/2606.31413#S3.SS3), using automatically verifiable rewards computed from final answers\. The full Stage I training configuration, including LoRA target modules, rank, learning rate, batch size, gradient accumulation, number of epochs, precision, and maximum sequence length, is reported in Table[16](https://arxiv.org/html/2606.31413#A6.T16)\.

### F\.3RLVF Reward and Verifier Construction

RLVF uses automatically verifiable rewards computed from final answers\. We do not use human preference labels\. For each prompt, multiple completions are sampled from a behavior policy, which is a few checkpoints behind the target policy\. The target policy is then optimized using a token\-level off\-policy GRPO\-style objective with importance ratios between the target policy and the behavior policy as detailed in Section[3\.3](https://arxiv.org/html/2606.31413#S3.SS3)\.

The reward is1\.01\.0for a correct answer and−1\.1\-1\.1for an incorrect answer\. We also use two small formatting bonuses: a think\-format bonus of0\.20\.2when the response follows the expected explicit thinking format, and a JSON\-on\-wrong bonus of0\.10\.1when an incorrect response still produces a valid structured JSON answer\. These bonuses are used to stabilize response formatting and answer extraction during RLVF training\. The expected response format consists of an explicit reasoning block followed by a structured JSON answer:

`<think\>\.\.\.<\\think\>``\{"answer": <\.\.\.\>\}`

Table[15](https://arxiv.org/html/2606.31413#A6.T15)summarizes the RLVF configuration\.

RLVF SettingValueObjectiveOffline Token\-level GRPOAdvantage modemean\-rewardRatio clipping0\.2Generations per prompt4Sampling temperature1\.1Max new tokens3072checkpoint refresh interval20 training stepsCorrect / incorrect reward1\.0/−1\.11\.0/\-1\.1Think\-format bonus0\.2JSON\-on\-wrong bonus0\.1Table 15:RLVF configuration\.Rewards are computed using automatically verifiable final answers\. The formatting bonuses are used only to stabilize reasoning and answer extraction format\.
### F\.4Stage II: Distillation and Mixer Training

Stage II integrates the pretrained experts into one shared model\. All Stage I experts remain frozen\. We train only a shared linear router and a lightweight LoRA adaptation on the attention projections\.

To construct the Stage II training data, shown in Figure[1](https://arxiv.org/html/2606.31413#S4.F1), we query each frozen expert using prompts from its own domain and collect the generated reasoning traces\. These traces form the supervised distillation dataset\. The mixer is then trained with standard next\-token likelihood on the distilled traces\. The objective of Stage II is to learn expert selection rather than to relearn the domain knowledge or reasoning ability encoded in the frozen experts\.

Unless otherwise stated, Stage II uses at most 1000 prompts per dataset\. The attention LoRA is applied to:q\_proj,k\_proj,v\_proj, ando\_projwith a low rank\. The router is shared across layers, and routing is performed at the token level\. During training and inference, the selected LoRA expert is applied with unit scale\.

### F\.5Training Hyperparameters

Table[16](https://arxiv.org/html/2606.31413#A6.T16)summarizes the main training configuration for Stage I and Stage II\. The same configuration is used across LLaMA, Gemma, and Qwen experiments unless explicitly stated otherwise\. The global batch size is3232in all stages, and we use gradient accumulation and gradient checkpointing\.

Model and Trainable Components

SettingStage I: SFT ExpertsStage I: RLVF ExpertsStage II: MixerBase modelExperiment\-specific backboneExperiment\-specific backboneExperiment\-specific backboneTrainable modulesFFN LoRAFFN LoRAShared router \+ attention LoRAFrozen modulesBackboneBackboneBackbone \+ expert LoRAsLoRA target modulesgate\_proj,up\_proj,down\_projgate\_proj,up\_proj,down\_projq\_proj,k\_proj,v\_proj,o\_projLoRA rank / alpha / dropout128/256/0\.1128/256/0\.1128/256/0\.1128/256/0\.1128/256/0\.1128/256/0\.1ObjectiveNext\-token likelihoodOffline GRPO\-style RLVFNext\-token likelihood on distilled traces
Optimization and Sequence Settings

SettingStage I: SFT ExpertsStage I: RLVF ExpertsStage II: MixerLearning rate1​e−61\\mathrm\{e\}\{\-6\}5​e−65\\mathrm\{e\}\{\-6\}1​e−61\\mathrm\{e\}\{\-6\}Epochs1021Batch size323232LR scheduler / warmupCosine / 0\.05Cosine / 0\.05Cosine / 0\.05Weight decay0\.10\.00\.0Max grad norm1\.00\.751\.0Precisionbf16bf16bf16Max sequence lengthNo explicit limit51205120Stage II samples per dataset––1000

Table 16:Training configuration for Stage I and Stage II\.Stage I trains standalone dataset\-specific experts while keeping the backbone frozen\. Stage II freezes all pretrained experts and trains only the shared router and attention LoRA\. The same configuration is used across model families unless otherwise stated\.
### F\.6RLVF Training Dynamics

To make the behavior of Stage I RLVF training more transparent, we report training dynamics for the standalone RLVF experts\. We track three quantities during training: completion length, verifier reward, and stop rate, whether the generation hit the end\-of\-sentence token\. These statistics are collected from sampled trajectories during RLVF optimization and are, therefore, different from the final benchmark results, which are evaluated separately using greedy decoding\.

For each training step, we compute the per\-step mean over the few sampled completions\. We also report a bias\-corrected exponential moving average \(EMA\) with decay0\.950\.95to show the overall trend more clearly\. The curves cover the five main domains,ARC\-C,BoolQ,CoLA,GSM8K, andMedQA, for both LLaMA\-3B and LLaMA\-8B experts\.

The reward is computed using the same automatic verifier used during RLVF training\. Since trajectories are sampled with high\-temperature exploration on a few training samples using offline reinforcement learning, the reward curves are expected to fluctuate and should not be interpreted as final task accuracy\.

Figure[5](https://arxiv.org/html/2606.31413#A6.F5)reports the mean verifier reward\. Across most datasets and model sizes, the smoothed reward increases during training\. This indicates that RLVF improves the sampled trajectories with respect to automatically verifiable correctness\. The trend is especially clear for datasets such asGSM8K,BoolQ, andCoLA, where the reward rises steadily after the early training phase\. Some non\-monotonic behavior remains, particularly for the larger model on certain tasks, which is expected because the policy continues to sample diverse trajectories during training\. Overall, the reward dynamics support the main Stage I observation that RLVF improves standalone experts by optimizing verifiable correctness rather than imitating fixed reference traces\.

Figure[6](https://arxiv.org/html/2606.31413#A6.F6)shows the generated completion length during training\. The absolute completion length differs substantially across datasets because the tasks require different output formats and reasoning depths\. For several experts, completion length decreases or stabilizes as training progresses, suggesting that the model learns to produce more concise outputs that satisfy the verifier\. However, completion length itself is not optimized as the primary objective, and longer or shorter generations should not be interpreted as better performance without considering the corresponding reward and stopping behavior\.

Finally, Figure[7](https://arxiv.org/html/2606.31413#A6.F7)shows the stop rate, defined as the fraction of sampled completions that satisfy the generation stopping criterion before reaching the maximum generation budget\. This statistic helps distinguish useful changes in completion length from unstable generation behavior\. For most experts, the stop rate either remains high or increases toward a stable value during training, indicating that the models generally learn to terminate their responses correctly under the required output format\. Some datasets exhibit temporary drops or more variable stopping behavior, especially under continued high\-temperature exploration\. These fluctuations are consistent with the reward dynamics and do not necessarily indicate degradation in final greedy\-decoding performance\.

![Refer to caption](https://arxiv.org/html/2606.31413v1/pics/training_curve_plots/reward_mean.png)Figure 5:Reward mean during RLVF training\.We report the mean verifiable reward obtained by the sampled trajectories during RLVF training\. Rows correspond to model scales and columns correspond to datasets\. The light curves show the per\-step mean, while the darker curves show a bias\-corrected exponential moving average with decay0\.950\.95\. Across most datasets, the smoothed reward increases during training, indicating that the experts progressively improve with respect to the automatic verifier\. The fluctuations are expected because trajectories are sampled with high\-temperature exploration on a few samples\. These are training reward statistics, not final benchmark accuracies\.![Refer to caption](https://arxiv.org/html/2606.31413v1/pics/training_curve_plots/completion_length.png)Figure 6:Completion length during RLVF training\.We report the mean generated completion length for each RLVF\-trained expert across training steps\. Rows correspond to model scales and columns correspond to datasets\. The light curves show the per\-step mean, while the darker curves show a bias\-corrected exponential moving average with decay0\.950\.95\. Completion length varies substantially across datasets and model sizes, reflecting differences in task format, reasoning behavior, and exploration under high\-temperature sampling\. These curves are used only to characterize training dynamics and should not be interpreted as the final benchmark performance\.![Refer to caption](https://arxiv.org/html/2606.31413v1/pics/training_curve_plots/stop-rate.png)Figure 7:Stop rate during RLVF training\.We report the fraction of sampled completions that satisfy the stopping criterion during RLVF training\. Rows correspond to model scales and columns correspond to datasets\. The light curves show the per\-step stop rate, while the darker curves show a bias\-corrected exponential moving average with decay0\.950\.95\. For most experts, the stop rate increases or remains high as training progresses, suggesting that the models learn to produce complete outputs under the required generation format\. Temporary drops and fluctuations reflect continued exploration during sampling and should be interpreted together with the reward and completion\-length curves\.

## Appendix GPrompt\-Level Routing and Additional Evaluation

This appendix provides additional details for the prompt\-level routing baseline, the mixed\-domain evaluation, and the unseen\-dataset evaluation discussed in the paper\.

### G\.1Prompt\-Level Routing Baseline

The prompt\-level routing baseline selects one expert for the entire input\. We train abert\-base\-uncasedclassifier to predict the domain of the input prompt\. The predicted domain is then used to select the corresponding frozen LoRA expert for the whole generation\. This baseline uses the same Stage II supervision budget as our mixer: at most 1000 examples per dataset\.

The classifier achieves near\-perfect domain classification accuracy, as shown in Table[18](https://arxiv.org/html/2606.31413#A7.T18)\. This confirms that the original single\-domain evaluation contains a strong domain\-classification component\.

Table[19](https://arxiv.org/html/2606.31413#A7.T19)reports the full per\-task results for prompt\-level routing on LLaMA\-3B and LLaMA\-8B with RLVF\-trained experts\. Prompt\-level routing slightly outperforms Hard\-Routed MoR\-LoRA on average in the clean single\-domain setting\. This supports the conclusion in the main paper: when each prompt clearly belongs to one domain, selecting one expert for the whole prompt is a strong and simple solution\.

ComponentValuePrompt\-level classifierbert\-base\-uncasedClassifier parameters≈\\approx110MMax train samples1000Number of domains5Training epochs3Learning rate2e\-5Weight decay0\.01Warmup ratio0\.06Batch size32Max sequence length256Table 17:Prompt\-level classifier setup\.The classifier predicts the input domain and selects one frozen LoRA expert for the whole prompt\.DatasetAccuracyGSM8K100\.00ARC\-C98\.89MedQA99\.61BoolQ100\.00CoLA99\.42Overall99\.70Table 18:Domain classification accuracy of the prompt\-level classifier\.The high accuracy explains why prompt\-level routing is a strong baseline on clean single\-domain inputs\.MethodGSM8KARC\-CMedQABoolQCoLAAvgLLaMA\-3BPrompt\-level routing75\.3676\.0251\.1385\.3376\.1372\.79Hard\-Routed MoR\-LoRA73\.6275\.4351\.5383\.0976\.7072\.07LLaMA\-8BPrompt\-level routing83\.0582\.8466\.7187\.6779\.0979\.87Hard\-Routed MoR\-LoRA84\.6984\.6167\.1784\.4378\.0979\.80Table 19:Full prompt\-level routing results on clean single\-domain inputs\.Prompt\-level routing selects one frozen expert for the whole input, while Hard\-Routed MoR\-LoRA routes at the token level\.
### G\.2Mixed\-Domain Evaluation Details

The prompt\-level baseline is strong when each input belongs to a single domain, but it must select one expert for the entire prompt\. To test this limitation, we construct a mixed\-domain evaluation using GSM8K and BoolQ\. Each input contains one GSM8K math problem and one BoolQ question, and the model is asked to produce two structured answers:math\_answerandboolq\_answer\. No method is trained on mixed\-domain prompts\. Table[20](https://arxiv.org/html/2606.31413#A7.T20)details the process\.

We choose GSM8K and BoolQ because the prompt\-level classifier reaches 100\.00% accuracy on both datasets in the original single\-domain setting\. Thus, in the clean setting, the prompt\-level classifier can reliably identify both domains\. The mixed\-domain setting tests a different issue: whether selecting only one expert for the whole prompt is sufficient when the input contains two different types of questions\.

FieldDescriptionMixed domainsGSM8K \+ BoolQPrompt contentOne math problem and one BoolQ questionExpected outputTwo structured answers:math\_answerandboolq\_answerPrompt orderRandomly sampled frommath\_firstandboolq\_firstTraining exposureNo method was trained on mixed\-domain promptsReason for choosing these domainsBoth have 100\.00% prompt\-level classification accuracy in the original single\-domain settingTable 20:Mixed\-domain evaluation setup\.The setting tests whether one expert is sufficient when a single input contains questions from two different domains\.Prompt\-level routing performs better on the math part, while Hard\-Routed MoR\-LoRA performs substantially better on the BoolQ part and obtains higher average performance\. This shows the structural limitation of prompt\-level routing: it can only choose either the GSM8K expert or the BoolQ expert for the whole input\. In contrast, token\-level routing is not restricted to one expert for the entire prompt\. Furthermore, the token\-level routing can be optimized for this situation, contrary to the prompt\-level setting\.

### G\.3Unseen\-Dataset Evaluation

We further evaluate whether the learned router and frozen experts transfer to related datasets that are not used during Stage I expert training or Stage II mixer training\. We use SVAMP for mathematical reasoning and SST\-2 for sentiment classification\. Here, Baseline denotes the original instruction\-tuned model without LoRA experts or routing\. This experiment tests transfer to unseen but related task distributions\.

As shown in Table[21](https://arxiv.org/html/2606.31413#A7.T21), Hard\-Routed MoR\-LoRA improves over the original instruction\-tuned model on both unseen datasets and both model scales\. On SVAMP, the routed model transfers arithmetic reasoning behavior to a math dataset that was not used during training\. On SST\-2, which is also not one of the original five domains, the routed model improves over the baseline as well\. These results suggest that the router is not limited to memorizing the original training datasets, and that frozen expert behavior can transfer to related unseen inputs\. However, this experiment should be interpreted as transfer to related task distributions rather than general open\-domain robustness\.

MethodSST\-2SVAMPAvgLLaMA\-3BBaseline75\.111\.6738\.39Ours88\.7681\.3385\.05LLaMA\-8BBaseline82\.6882\.0082\.34Ours90\.7185\.6788\.19Table 21:Evaluation on unseen datasets\.SVAMP and SST\-2 are not used during expert training or mixer training\. The results evaluate transfer to related unseen task distributions\.

Similar Articles

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv cs.LG

Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Hugging Face Daily Papers

This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.