SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
Summary
SAMoRA introduces a semantic-aware router and task-adaptive scaling to improve expert specialization and dynamic weighting in MoE-LoRA fine-tuning, outperforming prior methods on multi-task benchmarks.
View Cached Full Text
Cached at: 04/22/26, 08:30 AM
# SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning Source: [https://arxiv.org/html/2604.19048](https://arxiv.org/html/2604.19048) Boyan Shi1,3,Wei Chen2,∗\*,Shuyuan Zhao1,3,Junfeng Shen1,3, Shengnan Guo1,3,Shaojiang Wang4,5,∗\*,Huaiyu Wan1,3 1School of Computer Science and Technology, Beijing Jiaotong University, China 2Guangxi Key Lab of Trusted Software, Guilin University of Electronic Technology, China 3Beijing Key Lab of Traffic Data Mining and Embodied Intelligence, China 4Institute of AI for Industries, Chinese Academy of Sciences, China 5Nanjing Institute of Software Technology, China boyan118@bjtu\.edu\.cn∗Correspondence:[w\_chen@guet\.edu\.cn](https://arxiv.org/html/2604.19048v1/mailto:[email protected]),[wangshaojiang@iaii\.ac\.cn](https://arxiv.org/html/2604.19048v1/mailto:[email protected]) ###### Abstract The combination of Mixture\-of\-Experts \(MoE\) and Low\-Rank Adaptation \(LoRA\) has shown significant potential for enhancing the multi\-task learning capabilities of Large Language Models\. However, existing methods face two primary challenges: \(1\)Imprecise Routing in the current MoE\-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization\. \(2\)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks\. To address these limitations, we proposeSAMoRA\(Semantic\-AwareMixtureof LoRAExperts\), a novel parameter\-efficient fine\-tuning framework tailored for task\-adaptive learning\. Specifically, ASemantic\-Aware Routeris proposed to explicitly align textual semantics with the most suitable experts for precise routing\. ATask\-Adaptive Scalingmechanism is designed to regulate expert contributions based on specific task requirements dynamically\. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling\. Extensive experiments on multiple multi\-task benchmarks demonstrate that SAMoRA significantly outperforms the state\-of\-the\-art methods and holds excellent task generalization capabilities\. Code is available at[https://github\.com/boyan\-code/SAMoRA](https://github.com/boyan-code/SAMoRA) SAMoRA: Semantic\-Aware Mixture of LoRA Experts for Task\-Adaptive Learning Boyan Shi1,3, Wei Chen2,∗\*, Shuyuan Zhao1,3, Junfeng Shen1,3,Shengnan Guo1,3,Shaojiang Wang4,5,∗\*,Huaiyu Wan1,31School of Computer Science and Technology, Beijing Jiaotong University, China2Guangxi Key Lab of Trusted Software, Guilin University of Electronic Technology, China3Beijing Key Lab of Traffic Data Mining and Embodied Intelligence, China4Institute of AI for Industries, Chinese Academy of Sciences, China5Nanjing Institute of Software Technology, Chinaboyan118@bjtu\.edu\.cn∗Correspondence:[w\_chen@guet\.edu\.cn](https://arxiv.org/html/2604.19048v1/mailto:[email protected]),[wangshaojiang@iaii\.ac\.cn](https://arxiv.org/html/2604.19048v1/mailto:[email protected]) ## 1Introduction Large Language Models \(LLMs\) have achieved impressive performance across a wide range of domains, particularly in natural language processing \(NLP\) tasks such as content generation and question answeringHonget al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib61)\); Xuet al\.\([2023](https://arxiv.org/html/2604.19048#bib.bib47)\); Chenet al\.\([2025a](https://arxiv.org/html/2604.19048#bib.bib58)\)\. This success largely stems from their massive parameter counts and pre\-training on large\-scale, diverse corpora, which endow LLMs with strong generalization capabilities and robust performance across diverse and complex tasksQinet al\.\([2023](https://arxiv.org/html/2604.19048#bib.bib28)\); Raffelet al\.\([2020](https://arxiv.org/html/2604.19048#bib.bib26)\); Chenet al\.\([2025b](https://arxiv.org/html/2604.19048#bib.bib59)\), yet inevitably imposes a substantial parameter burden during fine\-tuning\. \(a\) \(b\) Figure 1:Illustration of limitations in existing mechanisms\.\(a\) MLP\-based Routing: Fails to explicitly match tasks with expert capabilities, resulting in expert homogenization\. \(b\) Uniform Weight Fusion: Applies a uniform update strength across diverse tasks, ignoring specific requirements and limiting multi\-task generalization\.To mitigate the computational burden of full fine\-tuning, Low\-Rank Adaptation \(LoRA\) has emerged as a leading Parameter\-Efficient Fine\-Tuning \(PEFT\) strategyHuet al\.\([2022](https://arxiv.org/html/2604.19048#bib.bib31)\)\. LoRA injects trainable low\-rank matrices into the frozen backbone and merges the updates via a uniform scaling factor\. However, while effective for single tasks, this fixed structure limits performance in complex multi\-task scenarios, as a single set of parameters cannot adequately handle diverse task requirements\. To address this, recent studies have integrated Mixture\-of\-Experts \(MoE\) architectures with LoRA \(MoE\-LoRA\)Liuet al\.\([2024a](https://arxiv.org/html/2604.19048#bib.bib22)\)\. These methods treat multiple LoRA modules as experts and employ a Multi\-Layer Perceptron \(MLP\) based router to selectively activate them\. While these approaches have demonstrated notable success in enhancing model capacity, they still face two critical challenges: \(1\) Current routing mechanisms fail to explicitly associate tasks with expert capabilities, leading to imprecise routing\.Existing MoE\-LoRA methods rely on MLP routers that prioritize learned data distributions over actual expert proficienciesTianet al\.\([2024](https://arxiv.org/html/2604.19048#bib.bib23)\)\. As illustrated in Figure[1](https://arxiv.org/html/2604.19048#S1.F1)\(a\), these strategies fail to explicitly match input semantics with expert expertise, often resulting in homogenized experts that lack distinct roles\. Consequently, this inability to specialize prevents the model from handling diverse requirements effectively, leading to suboptimal capabilities in multi\-task scenarios\.\(2\) Uniform weight fusion strategies fail to provide adaptive adjustments for diverse tasks, limiting multi\-task generalization\.As shown in Figure[1](https://arxiv.org/html/2604.19048#S1.F1)\(b\), standard approaches employ a globally fixed scale factor that applies a uniform update strength across all inputs\. However, multi\-task scenarios involve tasks with varying complexity, where some require significant parameter shifts while others need only minor adjustments\. Applying a uniform strategy ignores these distinct requirements, forcing a rigid "one\-size\-fits\-all" adaptation\. This lack of flexibility prevents the model from effectively adapting to specific task needs, thereby constraining its overall generalization capability in complex multi\-task environments\. To address these challenges, we proposeSAMoRA\(Semantic\-AwareMixtureof LoRAExperts\), a novel framework tailored for task\-adaptive learning\. Specifically, SAMoRA consists of a Semantic\-Aware Router to explicitly align input semantics with expert capabilities, a Task\-Adaptive Scaling mechanism to dynamically regulate expert contributions based on specific task demands, and specialized loss constraints to enforce expert distinctiveness and ensure robust multi\-task performance\. The contributions of this work are as follows: - •We proposeSAMoRA, a novel MoE\-LoRA framework enabling precise semantic\-aware expert routing and significantly enhancing multi\-task generalization capabilities\. - •We introduce aSemantic\-Aware Routerto enforce explicit alignment between input semantics and expert capabilities, coupled with aTask\-Adaptive Scalingmechanism that dynamically regulates parameter updates to effectively adapt to diverse task requirements\. - •We design specialized loss constraints to enforce expert distinctiveness and regularize scaling factors, ensuring specialized expert roles and robust performance\. - •Extensive experiments across diverse multi\-task benchmarks demonstrate that SAMoRA consistently outperforms existing baselines, achieving State\-of\-the\-Art performance\. ## 2Related Work ### 2\.1Mixture of Experts\. MoE was initially proposed to decompose complex tasks into simpler subtasks, where a router dynamically assigns different inputs to specialized expert subnetworksJacobset al\.\([1991](https://arxiv.org/html/2604.19048#bib.bib13)\)\. A key later advancement was the sparsely\-gated MoE, which activates only a small subset of experts per forward pass to significantly improve computational efficiencyShazeeret al\.\([2017](https://arxiv.org/html/2604.19048#bib.bib14)\)\. This sparse\-gating mechanism was subsequently extended to Transformer architectures, further enhancing training efficiency and model scalabilityLepikhinet al\.\([2021](https://arxiv.org/html/2604.19048#bib.bib15)\)\. Subsequent strategies have further optimized routing mechanisms, such as simplified top\-1 routing for stabilityFeduset al\.\([2022](https://arxiv.org/html/2604.19048#bib.bib16)\)and differentiable soft routing for effective expert combinationMuqeethet al\.\([2024](https://arxiv.org/html/2604.19048#bib.bib17)\)\. Despite these architectural improvements, current methods share a fundamental limitation: they rely on implicit routing strategies that lack explicit semantic guidance\. These approaches typically map inputs to experts based on learned statistical distributions rather than establishing an explicit association between input semantics and expert capabilities\. Consequently, the routing decision remains decoupled from actual expert specialization, hindering the model’s ability to precisely match diverse inputs to the most suitable experts based on their semantic features\. ### 2\.2LoRA for Multi\-Task Learning LoRA has attracted widespread attention due to its ability to achieve performance comparable to full fine\-tuning under limited computational resources\. However, its performance in complex multi\-task scenarios remains suboptimal\. To address this, several extensions have been proposed to enhance adaptability\. MultiLoRA introduces a parallelized design with learnable scaling factors to decouple task\-specific featuresWanget al\.\([2023](https://arxiv.org/html/2604.19048#bib.bib51)\), while MTL\-LoRA employs task\-specific transformation matrices to capture distinct informationYanget al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib24)\)\. Furthermore, methods like MoELoRA and HydraLoRA integrate MoE architectures, treating LoRA modules as experts to improve generalization and parameter efficiencyLiaoet al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib63)\); Tianet al\.\([2024](https://arxiv.org/html/2604.19048#bib.bib23)\)\. Despite these architectural advancements, these methods share a fundamental limitation in their weight fusion mechanism\. Most approaches rely on uniform scaling strategies to merge LoRA updates with the pre\-trained model\. This fixed approach ignores the varying complexity of different tasks, where some require significant parameter shifts while others need only minor adjustments\. Consequently, applying the same update strength to all tasks fails to meet specific requirements, thereby limiting the model’s overall multi\-task adaptation performance\. ## 3Preliminary ### 3\.1PEFT for LLMs PEFT for LLMs involves adapting pretrained models to downstream tasks by introducing a small set of trainable parametersΔW\\Delta W, while keeping the original model weightsWWfrozen\. The model is jointly trained on multiple tasks in multi\-task scenarios to learn shared and task\-specific representationsWeiet al\.\([2022](https://arxiv.org/html/2604.19048#bib.bib30)\)\. The training objective is to fine\-tuneΔW\\Delta Wsuch that the conditional probabilityPPof autoregressively generating target sequences across all tasks is maximized\. Formally, the training loss can be written as: ℒtask\(ΔW\)=\\displaystyle\\mathcal\{L\}\_\{\\text\{task\}\}\(\\Delta W\)=\{\}∑\(sin,sout\)∈𝒟∑i=1\|sout\|\\displaystyle\\sum\_\{\(s\_\{\\text\{in\}\},s\_\{\\text\{out\}\}\)\\in\\mathcal\{D\}\}\\sum\_\{i=1\}^\{\|s\_\{\\text\{out\}\}\|\}\(1\)logPW\+ΔW\(sout\(i\)∣sin,sout\(<i\)\),\\displaystyle\\log P\_\{W\+\\Delta W\}\\left\(s\_\{\\text\{out\}\}^\{\(i\)\}\\mid s\_\{\\text\{in\}\},s\_\{\\text\{out\}\}^\{\(<i\)\}\\right\),where𝒟\\mathcal\{D\}denotes the training dataset containing input\-output sentence pairs \(sin,souts\_\{\\text\{in\}\},s\_\{\\text\{out\}\}\) from multiple tasks\. This objective formalizes the autoregressive training process, where the model predicts target sentence incrementally by adapting only the incremental parametersΔW\\Delta W\. ### 3\.2Mixture of LoRA Experts LoRA implements PEFT by freezing the original pretrained weightsWWand introducing two trainable low\-rank matrices\. For a weight matrixW∈ℝdout×dinW\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}, these matrices are specifically defined asA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{\\text\{in\}\}\}andB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}, where the rankrris significantly smaller than the original dimensions\. The resulting productBABAprovides a low\-rank updateΔW\\Delta WtoWW, enabling effective adaptation with minimal additional parametersHuet al\.\([2022](https://arxiv.org/html/2604.19048#bib.bib31)\)\. The LoRA update process is illustrated in Figure[1](https://arxiv.org/html/2604.19048#S1.F1)\(b\)\. To leverage the parameter\-efficiency of LoRA for complex multi\-task scenarios, a promising direction in recent work has been to integrate it with MoEYanget al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib24)\); Liuet al\.\([2024b](https://arxiv.org/html/2604.19048#bib.bib38)\); Fenget al\.\([2024](https://arxiv.org/html/2604.19048#bib.bib21)\)\. By structuring multiple LoRAs as lightweight experts within the attention and feedforward layers of an LLM, the forward pass in such a layer is formalized as: Y=WX\+∑i=1NgiBiAiX,Y=WX\+\\sum\_\{i=1\}^\{N\}g\_\{i\}B\_\{i\}A\_\{i\}X,\(2\)whereX∈Emb\(sin\)X\\in\\text\{Emb\}\(s\_\{\\text\{in\}\}\)is a hidden representation derived from the input sentencesins\_\{\\text\{in\}\}, andYYis the corresponding output\. The set\{Ai,Bi\}i=1N\\\{A\_\{i\},B\_\{i\}\\\}\_\{i=1\}^\{N\}representsNNdistinct LoRA experts\. The gating weightsgig\_\{i\}are dynamically generated by a router conditioned on inputXX, determining which expert to activate\. Figure 2:Overview of our SAMoRA\. We design a Semantic\-Aware Router and a Task\-Adaptive Scaling mechanism, integrated within an asymmetric MoE\-LoRA architecture consisting of a shared Expert A and multiple Semantic Experts B\. ## 4Methodology As illustrated in Figure[2](https://arxiv.org/html/2604.19048#S3.F2), SAMoRA integrates two core components: aSemantic\-Aware Routerdesigned to explicitly match input semantics with expert expertise, and aTask\-Adaptive Scalingmechanism that dynamically regulates update strengths to meet specific task requirements\. In the following, we introduce these components in detail\. ### 4\.1Semantic\-Aware Router Most existing MoE approaches rely on MLP\-based routing strategies that often fail to associate input contents with expert capabilities\. To address this, we introduce aSemantic\-Aware Routerdesigned to explicitly match input semantics with expert expertise\. #### Semantic Extraction via Shared Expert\. To realize explicit routing, the model must first effectively grasp the semantic intent of the input\. Inspired by HydraLoRATianet al\.\([2024](https://arxiv.org/html/2604.19048#bib.bib23)\), we establish an asymmetric architecture by utilizing a single shared expertA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{\\text\{in\}\}\}, while maintaining multiple experts\{Bi\}i=1N\\\{B\_\{i\}\\\}\_\{i=1\}^\{N\}to capture distinct semantic capabilities\. This shared component naturally functions as a semantic encoder, eliminating the need for a separate, decoupled routing network\. This shared module extracts a compact, unified semantic representation𝐡=AX\\mathbf\{h\}=AXfrom the inputXX\. By using the shared expertAA, we ensure that the routing decision is grounded in the same feature space used for expert computation, facilitating consistent semantics aggregation\. Building upon Eq\. \([2](https://arxiv.org/html/2604.19048#S3.E2)\), the core forward process is reformulated as: Y=WX\+∑i=1NgiBi𝐡=WX\+∑i=1NgiBi\(AX\)\.Y=WX\+\\sum\_\{i=1\}^\{N\}g\_\{i\}B\_\{i\}\\mathbf\{h\}=WX\+\\sum\_\{i=1\}^\{N\}g\_\{i\}B\_\{i\}\(AX\)\.\(3\) #### Explicit Matching with Expert Keys\. With the extracted semantic features𝐡\\mathbf\{h\}, the next step is to align them with the specific capabilities of the semantic experts\{Bi\}i=1N\\\{B\_\{i\}\\\}\_\{i=1\}^\{N\}\. To this end, we assign a trainableExpert Keyki∈ℝrk\_\{i\}\\in\\mathbb\{R\}^\{r\}to each expertBiB\_\{i\}\. These keys function as semantic anchors, explicitly representing the unique specialization learned by each expert\. During training, the keys are optimized alongside the experts, ensuring thatkik\_\{i\}moves closer to the semantic clusters that expertBiB\_\{i\}is best at handling\. The routing scoregig\_\{i\}is then derived by measuring the Cosine Similarity between the input’s semantic representation𝐡\\mathbf\{h\}and each expert keykik\_\{i\}: gi=exp\(cos\(𝐡,ki\)/τ\)∑j=1Nexp\(cos\(𝐡,kj\)/τ\),g\_\{i\}=\\frac\{\\exp\\left\(\\mathrm\{cos\}\\left\(\\mathbf\{h\},k\_\{i\}\\right\)/\\tau\\right\)\}\{\\sum\_\{j=1\}^\{N\}\\exp\\left\(\\mathrm\{cos\}\\left\(\\mathbf\{h\},k\_\{j\}\\right\)/\\tau\\right\)\},\(4\)whereτ\\tauis a temperature coefficient that regulates the strictness of the matching\. A smallerτ\\tausharpens the distribution, forcing the router to strictly select only the expert with the highest semantic alignment, while a largerτ\\tausoftens this constraint, allowing for broader expert collaboration\. This mechanism ensures that inputs are routed based on explicit semantic similarity rather than implicit statistical bias\. ### 4\.2Task\-Adaptive Scaling As illustrated in Figure[1](https://arxiv.org/html/2604.19048#S1.F1)\(b\), standard LoRA employs a uniform scaling factor to merge the updates\. However, this fixed approach is problematic in multi\-task scenarios as it ignores the varying complexity of different tasks\. Some tasks require significant parameter shifts while others need only minor adjustments\. Consequently, applying the same update strength to all tasks fails to meet specific requirements, limiting the model’s adaptability\. To address this, we propose aTask\-Adaptive Scalingmechanism that dynamically regulates the update magnitude based on specific task demands\. #### Spectral Initialization via SVD\. First, inspired by recent workYuanet al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib55)\); Zhaoet al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib60)\), we aim to ensure our asymmetric structure starts with a theoretically grounded scale alignment\. We introduce a trainable Diagonal Scaling MatrixS∈ℝr×rS\\in\\mathbb\{R\}^\{r\\times r\}positioned between the shared expertAAand the semantic expertsBiB\_\{i\}\. By performing Singular Value Decomposition \(SVD\) on the pre\-trained weightW=UΣV⊤W=U\\Sigma V^\{\\top\}, we initializeSSusing the top\-rrsingular values: S=Σ1:r,1:r=diag\(σ1,…,σr\)\.S=\\Sigma\_\{1:r,1:r\}=\\mathrm\{diag\}\(\\sigma\_\{1\},\\ldots,\\sigma\_\{r\}\)\.\(5\)This design aligns our components with the dominant semantic directions of the original weights, providing a stable structural basis for subsequent adaptation\. #### Task\-Dependent Dynamic Regulation\. Building upon this aligned basis, we introduce a task\-driven mechanism to dynamically control the fusion ratio\. We assign a learnableTask Embeddingetask∈ℝdge\_\{\\text\{task\}\}\\in\\mathbb\{R\}^\{d\_\{g\}\}to each task, which captures latent task characteristics such as complexity and domain divergence\. To determine the optimal update strength for a given task, we project this embedding into a scalar gating factorgtask∈\(0,1\)g\_\{\\text\{task\}\}\\in\(0,1\)via a non\-linear mapping: gtask=σ\(Wgateetask\+bgate\),g\_\{\\text\{task\}\}=\\sigma\(W\_\{\\text\{gate\}\}e\_\{\\text\{task\}\}\+b\_\{\\text\{gate\}\}\),\(6\)whereσ\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\. This mechanism allows the model to dynamically adjust the update strength based on input features\. It assigns larger scales for tasks needing significant adaptation and smaller scales for those requiring only minor adjustments, effectively meeting diverse task requirements\. By integrating the SVD\-based alignment and task\-dependent regulation into the formulations of Eq\. \([2](https://arxiv.org/html/2604.19048#S3.E2)\) and Eq\. \([3](https://arxiv.org/html/2604.19048#S4.E3)\), the final outputYYis derived as: Y=WX\+gtask∑i=1NgiBi\(SAX\)\.Y=WX\+g\_\{\\text\{task\}\}\\sum\_\{i=1\}^\{N\}g\_\{i\}B\_\{i\}\(SAX\)\.\(7\) ### 4\.3Training Objective To ensure the effective implementation of our proposed mechanisms, we incorporate two specialized regularization terms alongside the standard LLM generation loss\. Specifically, these terms are designed to align Expert Keys with their corresponding experts and impose the necessary SVD constraints for the Task\-Adaptive Scaling mechanism\. The total training objective is formulated as: ℒtotal=ℒtask\+λorth⋅ℒorth\+λmatch⋅ℒmatch,\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\lambda\_\{\\mathrm\{orth\}\}\\cdot\\mathcal\{L\}\_\{\\mathrm\{orth\}\}\+\\lambda\_\{\\mathrm\{match\}\}\\cdot\\mathcal\{L\}\_\{\\mathrm\{match\}\},\(8\)whereℒtask\\mathcal\{L\}\_\{\\mathrm\{task\}\}denotes the multi\-task language modeling loss Ep\. \([1](https://arxiv.org/html/2604.19048#S3.E1)\), andλorth\\lambda\_\{\\mathrm\{orth\}\},λmatch\\lambda\_\{\\mathrm\{match\}\}are scalar hyperparameters weighting the auxiliary constraints\. Table 1:Results of comparison experiments across Commonsense Reasoning benchmarks\. TP indicates Trainable Parameters \(%\)\.†means the results from MoOREYuanet al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib55)\)\.Bold: Best results;Underline: Second\-best results\.Table 2:Results of comparison experiments across GLUE benchmark\. The upper block presents the baselines, while the lower block reports the performance of SAMoRA and its ablation variants\.Bold: Best results;Underline: Second\-best results\.#### Orthogonality Regularization for Scale Decoupling\. We introduce an orthogonality regularization termℒorth\\mathcal\{L\}\_\{\\mathrm\{orth\}\}to strictly decoupledirectional semanticsfrommagnitude scaling\. In our SVD\-based design, the diagonal matrixSSand the gating factorgtaskg\_\{\\text\{task\}\}are intended to handle all "scaling" effects\. Specifically, we force the rows of the shared encoderAAand the columns of each semantic expertBiB\_\{i\}to be orthonormal: ℒorth=‖AA⊤−I‖F2\+∑i=1N‖Bi⊤Bi−I‖F2,\\mathcal\{L\}\_\{\\mathrm\{orth\}\}=\\\|AA^\{\\top\}\-I\\\|\_\{F\}^\{2\}\+\\sum\_\{i=1\}^\{N\}\\\|B\_\{i\}^\{\\top\}B\_\{i\}\-I\\\|\_\{F\}^\{2\},\(9\)whereI∈ℝr×rI\\in\\mathbb\{R\}^\{r\\times r\}is the identity matrix\. By enforcing this constraint,AAandBiB\_\{i\}focus purely on learning distinct semantic directions, ensuring that the control of adaptation strength remains exclusively within the purview of our Task\-Adaptive Scaling mechanism\. #### Semantic Match Regularization via KL Divergence\. The effectiveness of our Semantic\-Aware Router hinges on the semantic consistency between the learnable keykik\_\{i\}and the functional specialization of the expertBiB\_\{i\}\. Any misalignment betweenkik\_\{i\}andBiB\_\{i\}inevitably leads to erroneous expert selection\. To mitigate this, we introduce a regularization loss that explicitly minimizes the divergence betweenkik\_\{i\}and the semantic representation derived fromBiB\_\{i\}\. We detail the specific implementation steps as follows\. \(1\) Extracting Representative Vectors\.Since the expertBi∈ℝdout×rB\_\{i\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}is a matrix while the keyki∈ℝrk\_\{i\}\\in\\mathbb\{R\}^\{r\}is a vector, we obtain a representative vectorbib\_\{i\}from each expert\. This is achieved by mean\-pooling the row vectors ofBiB\_\{i\}, which aggregates the features learned by that expert: bi=1dout∑j=1doutBi\(j\)∈ℝr\.b\_\{i\}=\\frac\{1\}\{d\_\{\\text\{out\}\}\}\\sum\_\{j=1\}^\{d\_\{\\text\{out\}\}\}B\_\{i\}^\{\(j\)\}\\in\\mathbb\{R\}^\{r\}\.\(10\) \(2\) Alignment via Distribution Matching\.To align the routing key with the expert’s actual capability, we map both the keykik\_\{i\}and the semantic centroidbib\_\{i\}into probability distributions \(Pk\(i\)P\_\{k\}^\{\(i\)\}andPb\(i\)P\_\{b\}^\{\(i\)\}\) via Softmax\. We then minimize the Kullback\-Leibler \(KL\) divergence between them: ℒmatch=1N∑i=1NDKL\(Pb\(i\)∥Pk\(i\)\)\.\\mathcal\{L\}\_\{\\mathrm\{match\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}D\_\{\\mathrm\{KL\}\}\\left\(P\_\{b\}^\{\(i\)\}\\parallel P\_\{k\}^\{\(i\)\}\\right\)\.\(11\)Crucially, we employ the directionDKL\(PExpert∥PKey\)D\_\{\\mathrm\{KL\}\}\(P\_\{\\text\{Expert\}\}\\parallel P\_\{\\text\{Key\}\}\)\. This effectively treats the expert’s functional distributionPb\(i\)P\_\{b\}^\{\(i\)\}as the target, compelling the keyPk\(i\)P\_\{k\}^\{\(i\)\}to shift towards and accurately represent the expert’s specialization\. This ensures consistency between the routing keys and the actual expert characteristics\. ### 4\.4Complexity Analysis To demonstrate the computational efficiency and parameter economy of our framework, we compare the complexity of SAMoRA with the standard MoE\-LoRA paradigm\. For a comprehensive breakdown of all baselines and the detailed analysis process, please refer to Appendix[A](https://arxiv.org/html/2604.19048#A1)\. Standard MoE\-LoRA architectures typically assign independent down\-projection and up\-projection matrices to each expert\. This results in a parameter complexity of𝒪\(N\(din\+dout\)r\)\\mathcal\{O\}\(N\(d\_\{\\text\{in\}\}\+d\_\{\\text\{out\}\}\)r\)and necessitates high\-dimensional computations for routing, incurring a cost of𝒪\(Ndin\)\\mathcal\{O\}\(Nd\_\{\\text\{in\}\}\)\. In contrast, SAMoRA optimizes both storage and inference efficiency through its asymmetric design and low\-rank routing mechanism\. Specifically, by using a shared expertAA, SAMoRA eliminates the redundancy of learning separate input projections, reducing the parameter complexity to𝒪\(\(din\+Ndout\)r\)\\mathcal\{O\}\(\(d\_\{\\text\{in\}\}\+Nd\_\{\\text\{out\}\}\)r\)\. Furthermore, unlike standard methods that calculate routing scores in the high\-dimensional input space \(dind\_\{\\text\{in\}\}\), SAMoRA performs routing in the low\-rank latent space \(rr\)\. Given thatr≪dinr\\ll d\_\{\\text\{in\}\}, this design significantly reduces the routing FLOPs from𝒪\(Ndin\)\\mathcal\{O\}\(Nd\_\{\\text\{in\}\}\)to𝒪\(Nr\)\\mathcal\{O\}\(Nr\), ensuring minimal latency overhead during inference\. Overall, SAMoRA achieves a substantial reduction in both parameter count and computational cost compared to other MoE\-LoRA baselines, offering a superior trade\-off between model capacity and efficiency\. ## 5Experiments ### 5\.1Experiment Setting #### Dataset We evaluate SAMoRA on two challenging multi\-task benchmarks that target different capabilities of LLMs:\(1\) Commonsense Reasoning: A curated benchmark comprising nine representative commonsense reasoning tasks: ARC\-Challenge \(ARC\-C\), ARC\-Easy \(ARC\-E\)Clarket al\.\([2018](https://arxiv.org/html/2604.19048#bib.bib32)\), OpenBookQA \(OBQA\)Mihaylovet al\.\([2018](https://arxiv.org/html/2604.19048#bib.bib33)\), PIQABisket al\.\([2020](https://arxiv.org/html/2604.19048#bib.bib34)\), SocialIQA \(SIQA\)Sapet al\.\([2019](https://arxiv.org/html/2604.19048#bib.bib35)\), BoolQWanget al\.\([2019a](https://arxiv.org/html/2604.19048#bib.bib41)\), HellaSwag \(HellaS\)Zellerset al\.\([2019](https://arxiv.org/html/2604.19048#bib.bib42)\), Winogrande \(WinoG\)Sakaguchiet al\.\([2021](https://arxiv.org/html/2604.19048#bib.bib43)\)and CommonsenseQA\(CSQA\)Talmoret al\.\([2019](https://arxiv.org/html/2604.19048#bib.bib57)\)\. These datasets cover diverse commonsense challenges, including science QA, physical and social reasoning, and everyday inference, and are widely used to evaluate the multi\-task capabilities of LLMs\.\(2\) Natural Language Understanding: We use widely used subset of seven tasks from the GLUE benchmarkWanget al\.\([2019b](https://arxiv.org/html/2604.19048#bib.bib36)\), including CoLA, SST\-2, MRPC, QQP, MNLI, QNLI, and RTE\. These tasks assess linguistic phenomena such as grammaticality, sentiment analysis, paraphrase detection, and textual entailment, thus comprehensively evaluating general language understanding capabilities\. Following the same train\-test split protocol and instruction prompts as in prior worksYanget al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib24)\); Liuet al\.\([2024b](https://arxiv.org/html/2604.19048#bib.bib38)\), we conduct our evaluation\. Detailed descriptions of the data splits and prompt formats are provided in Appendix[B\.1](https://arxiv.org/html/2604.19048#A2.SS1)\. #### Implementation Details\. We conduct experiments using Qwen3\-8BTeam \([2025](https://arxiv.org/html/2604.19048#bib.bib56)\)and LLaMA3\.1\-8BTeam \([2024](https://arxiv.org/html/2604.19048#bib.bib62)\)as the backbone architectures\. We compare SAMoRA against a comprehensive set of competitive baselines, including LoRAHuet al\.\([2022](https://arxiv.org/html/2604.19048#bib.bib31)\), MultiLoRAWanget al\.\([2023](https://arxiv.org/html/2604.19048#bib.bib51)\), MoELoRALiuet al\.\([2024a](https://arxiv.org/html/2604.19048#bib.bib22)\), HydraLoRATianet al\.\([2024](https://arxiv.org/html/2604.19048#bib.bib23)\), MTL\-LoRAYanget al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib24)\), and MoOREYuanet al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib55)\)\. To ensure a fair comparison, we modify the hyperparameters of the baselines to make the number of trainable parameters comparable for each method\. We report detailed training settings for all baselines in Appendix[B\.2](https://arxiv.org/html/2604.19048#A2.SS2)\. ### 5\.2Overall Performance As shown in Table[1](https://arxiv.org/html/2604.19048#S4.T1)and Table[2](https://arxiv.org/html/2604.19048#S4.T2), SAMoRA consistently outperforms existing baselines on both Llama3\.1\-8b and Qwen3\-8b across Commonsense Reasoning and GLUE benchmarks, while maintaining strong parameter efficiency\. Compared to the single\-adapter method LoRA, SAMoRA demonstrates clear advantages in handling diverse tasks, underscoring the importance of multi\-expert architectures in multi\-task adaptation\. Compared with MTL\-LoRA and HydraLoRA, which rely on conventional MLP\-based routers, SAMoRA enables more accurate and flexible expert selection through its semantic\-aware routing mechanism\. Regarding MoORE, it attempts to leverage the original LLM weights by exclusively training the router\. However, this approach performs poorly due to the limited number of trainable parameters, which proves insufficient for effective adaptation on downstream tasks\. Furthermore, while MoELoRA introduces task\-specific experts and MultiLoRA assigns a separate trainable scale factor to each LoRA module, they fail to account for task\-specific characteristics and varying task complexity simultaneously\. In contrast, SAMoRA introduces a task\-adaptive scaling mechanism that dynamically modulates this balance, enabling more precise and efficient adaptation across diverse tasks with fewer trainable parameters\. ### 5\.3Ablation Study To better understand the effectiveness of each component in SAMoRA, we conduct a comprehensive ablation study\. We evaluate the impact of the proposed semantic\-aware router by comparing it against a conventional MLP\-based router \(w/ow/oRouter\)\. We assess the contribution of the task\-adaptive scaling mechanism by removing it across all tasks\(w/ow/oScaling\)\. In addition, we examine the influence of the auxiliary losses by removing the orthogonality loss \(w/ow/oℒorth\\mathcal\{L\}\_\{\\text\{orth\}\}\) and semantic match loss \(w/ow/oℒmatch\\mathcal\{L\}\_\{\\text\{match\}\}\), respectively\. The results are summarized in Table[2](https://arxiv.org/html/2604.19048#S4.T2), and further implementation details are provided in Appendix[C\.1](https://arxiv.org/html/2604.19048#A3.SS1)\. As presented in Table[2](https://arxiv.org/html/2604.19048#S4.T2), SAMoRA consistently achieves the best performance across all tasks, validating the synergy of its components\. Notably, removing the task\-adaptive scaling mechanism \(w/ow/oScaling\) leads to the most significant performance degradation \(a sharp drop from 69\.75% to 66\.43% on CoLA\), underscoring its critical role in resolving task conflicts and mitigating negative transfer\. Similarly, replacing the semantic\-aware router with a standard MLP \(w/ow/oRouter\) results in a clear decline, confirming the necessity of explicit semantic alignment for precise expert allocation\. Furthermore, excluding the auxiliary regularization terms \(w/ow/oℒorth\\mathcal\{L\}\_\{\\text\{orth\}\}andw/ow/oℒmatch\\mathcal\{L\}\_\{\\text\{match\}\}\) also impairs overall results, demonstrating their importance in maintaining expert distinctiveness and stabilizing training\. #### Analysis of Semantic\-Aware Router\. To investigate expert specialization, we visualize the PCA projections of the latent representations derived from the Semantic ExpertBBmatrices\. As illustrated in Figure[3](https://arxiv.org/html/2604.19048#S5.F3), the standard MLP router results in entangled clusters with blurred boundaries\. In contrast, our Semantic\-Aware Router yields distinct and well\-separated clusters for the Semantic ExpertBBmodules, explicitly confirming that each expert has specialized in a specific semantic subspace\. Detailed experimental settings are provided in Appendix[C\.2](https://arxiv.org/html/2604.19048#A3.SS2)\. #### Analysis of Task\-Adaptive Scaling\. To validate the effectiveness of our mechanism, we visualize the learned scaling factors of SAMoRA trained on Qwen3\-8B\. Figure[4](https://arxiv.org/html/2604.19048#S5.F4)displays the factors for the query \(qq\), key \(kk\), value \(vv\), and output \(oo\) projections within the final attention layer\. The observed variations across different tasks validate the effectiveness of our proposed mechanism\. Figure 3:PCA visualization of expert features extracted from the final gate layer trained on Commonsense Reasoning dataset\.Figure 4:Visualization of task scaling factors across tasks trained on the Commonsense Reasoning dataset\. ### 5\.4Sensitivity Analysis We conduct a comprehensive sensitivity analysis on key hyperparameters, including model architecture \(N,r,dgN,r,d\_\{g\}\) and training objectives \(λorth,λKL,τ\\lambda\_\{\\mathrm\{orth\}\},\\lambda\_\{\\mathrm\{KL\}\},\\tau\), with detailed results provided in Appendix[D](https://arxiv.org/html/2604.19048#A4)\. Overall, the model exhibits strong robustness across varying configurations\. Notably, regarding the task embedding dimensiondgd\_\{g\}, we observe that compact embeddings are sufficient for effective routing; increasingdgd\_\{g\}to excessive levels introduces unnecessary complexity that hinders convergence\. ## 6Conclusion In this paper, we propose SAMoRA, a novel PEFT framework significantly enhancing multi\-task generalization\. By ensuring precise expert routing and dynamic task adaptation, our approach effectively secures robust and superior performance across diverse multi\-task scenarios\. Extensive experiments demonstrate that SAMoRA consistently outperforms existing baselines, achieving a favorable trade\-off between performance and parameter efficiency\. ## Limitations In this paper, we conduct experiments on Commonsense Reasoning and GLUE benchmarks by fine\-tuning models at the 8B parameter scale\. Due to limited computational resources, the scalability of our framework to significantly larger foundation models \(e\.g\., 70B scale or above\) has not yet been empirically verified\. Furthermore, there is a broader range of application scenarios unexplored, particularly in the multimodal domain, such as visual instruction tuning and visual question answering tasks\. We plan to extend our method to these large\-scale and multimodal settings in future work to further explore its generalization capabilities\. ## Acknowledgments This work was suported by Frontier Technologies R&D Program of Jiangsu \(Grant No\. BF2024052\), Nanjing Municipal Science and Technology Bureau \(Grant No\.202512136\), and Chengdu Science and Technology Program \(Grant No\.2025\-YF08\-00097\-GX\)\. ## References - PIQA: reasoning about physical commonsense in natural language\.InThe Thirty\-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty\-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7\-12, 2020,pp\. 7432–7439\.External Links:[Link](https://doi.org/10.1609/aaai.v34i05.6239),[Document](https://dx.doi.org/10.1609/AAAI.V34I05.6239)Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - J\. Chen, X\. Hei, Y\. Xue, Z\. Wu, J\. Xie, and Y\. Cai \(2025a\)Classic4Children: adapting chinese literary classics for children with large language model\.InFindings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Findings of ACL,pp\. 2473–2488\.External Links:[Link](https://doi.org/10.18653/v1/2025.findings-naacl.133),[Document](https://dx.doi.org/10.18653/V1/2025.FINDINGS-NAACL.133)Cited by:[§1](https://arxiv.org/html/2604.19048#S1.p1.1)\. - J\. Chen, Y\. Jia, Z\. Wu, J\. Yang, J\. Chen, X\. Hei, J\. Xie, Y\. Cai, and Q\. Li \(2025b\)ExpStar: towards automatic commentary generation for multi\-discipline scientific experiments\.InProceedings of the 33rd ACM International Conference on Multimedia, MM 2025, Dublin, Ireland, October 27\-31, 2025,C\. Gurrin, K\. Schoeffmann, M\. Zhang, L\. Rossetto, S\. Rudinac, D\. Dang\-Nguyen, W\. Cheng, P\. Chen, and J\. Benois\-Pineau \(Eds\.\),pp\. 6576–6585\.External Links:[Link](https://doi.org/10.1145/3746027.3755756),[Document](https://dx.doi.org/10.1145/3746027.3755756)Cited by:[§1](https://arxiv.org/html/2604.19048#S1.p1.1)\. - P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the AI2 reasoning challenge\.CoRRabs/1803\.05457\.External Links:[Link](http://arxiv.org/abs/1803.05457),1803\.05457Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.J\. Mach\. Learn\. Res\.23,pp\. 120:1–120:39\.External Links:[Link](https://jmlr.org/papers/v23/21-0998.html)Cited by:[§2\.1](https://arxiv.org/html/2604.19048#S2.SS1.p1.1)\. - W\. Feng, C\. Hao, Y\. Zhang, Y\. Han, and H\. Wang \(2024\)Mixture\-of\-loras: an efficient multitask tuning method for large language models\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20\-25 May, 2024, Torino, Italy,N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),pp\. 11371–11380\.External Links:[Link](https://aclanthology.org/2024.lrec-main.994)Cited by:[§3\.2](https://arxiv.org/html/2604.19048#S3.SS2.p2.8)\. - W\. Hong, W\. Yu, X\. Gu, G\. Wang, G\. Gan, H\. Tang, J\. Cheng, J\. Qi, J\. Ji, L\. Pan, S\. Duan, W\. Wang, Y\. Wang, Y\. Cheng, Z\. He, Z\. Su, Z\. Yang, Z\. Pan, A\. Zeng, B\. Wang, B\. Shi, C\. Pang, C\. Zhang, D\. Yin, F\. Yang, G\. Chen, J\. Xu, J\. Chen, J\. Chen, J\. Chen, J\. Lin, J\. Wang, J\. Chen, L\. Lei, L\. Gong, L\. Pan, M\. Zhang, Q\. Zheng, S\. Yang, S\. Zhong, S\. Huang, S\. Zhao, S\. Xue, S\. Tu, S\. Meng, T\. Zhang, T\. Luo, T\. Hao, W\. Li, W\. Jia, X\. Lyu, X\. Huang, Y\. Wang, Y\. Xue, Y\. Wang, Y\. An, Y\. Du, Y\. Shi, Y\. Huang, Y\. Niu, Y\. Wang, Y\. Yue, Y\. Li, Y\. Zhang, Y\. Zhang, Z\. Du, Z\. Hou, Z\. Xue, Z\. Du, Z\. Wang, P\. Zhang, D\. Liu, B\. Xu, J\. Li, M\. Huang, Y\. Dong, and J\. Tang \(2025\)GLM\-4\.1v\-thinking: towards versatile multimodal reasoning with scalable reinforcement learning\.CoRRabs/2507\.01006\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.01006),[Document](https://dx.doi.org/10.48550/ARXIV.2507.01006),2507\.01006Cited by:[§1](https://arxiv.org/html/2604.19048#S1.p1.1)\. - E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§A\.1](https://arxiv.org/html/2604.19048#A1.SS1.p1.1),[§1](https://arxiv.org/html/2604.19048#S1.p2.1),[§3\.2](https://arxiv.org/html/2604.19048#S3.SS2.p1.8),[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px2.p1.1)\. - R\. A\. Jacobs, M\. I\. Jordan, S\. J\. Nowlan, and G\. E\. Hinton \(1991\)Adaptive mixtures of local experts\.Neural Comput\.3\(1\),pp\. 79–87\.External Links:[Link](https://doi.org/10.1162/neco.1991.3.1.79),[Document](https://dx.doi.org/10.1162/NECO.1991.3.1.79)Cited by:[§2\.1](https://arxiv.org/html/2604.19048#S2.SS1.p1.1)\. - D\. Lepikhin, H\. Lee, Y\. Xu, D\. Chen, O\. Firat, Y\. Huang, M\. Krikun, N\. Shazeer, and Z\. Chen \(2021\)GShard: scaling giant models with conditional computation and automatic sharding\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=qrwe7XHTmYb)Cited by:[§2\.1](https://arxiv.org/html/2604.19048#S2.SS1.p1.1)\. - M\. Liao, W\. Chen, J\. Shen, S\. Guo, and H\. Wan \(2025\)HMoRA: making llms more effective with hierarchical mixture of lora experts\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=lTkHiXeuDl)Cited by:[§2\.2](https://arxiv.org/html/2604.19048#S2.SS2.p1.1)\. - Q\. Liu, X\. Wu, X\. Zhao, Y\. Zhu, D\. Xu, F\. Tian, and Y\. Zheng \(2024a\)When MOE meets llms: parameter efficient fine\-tuning for multi\-task medical applications\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14\-18, 2024,G\. H\. Yang, H\. Wang, S\. Han, C\. Hauff, G\. Zuccon, and Y\. Zhang \(Eds\.\),pp\. 1104–1114\.External Links:[Link](https://doi.org/10.1145/3626772.3657722),[Document](https://dx.doi.org/10.1145/3626772.3657722)Cited by:[§A\.1](https://arxiv.org/html/2604.19048#A1.SS1.p1.1),[§1](https://arxiv.org/html/2604.19048#S1.p2.1),[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px2.p1.1)\. - S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024b\)DoRA: weight\-decomposed low\-rank adaptation\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,External Links:[Link](https://openreview.net/forum?id=3d5CIRG1n2)Cited by:[§3\.2](https://arxiv.org/html/2604.19048#S3.SS2.p2.8),[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p2.1)\. - T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? A new dataset for open book question answering\.CoRRabs/1809\.02789\.External Links:[Link](http://arxiv.org/abs/1809.02789),1809\.02789Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - M\. Muqeeth, H\. Liu, and C\. Raffel \(2024\)Soft merging of experts with adaptive routing\.Trans\. Mach\. Learn\. Res\.2024\.External Links:[Link](https://openreview.net/forum?id=7I199lc54z)Cited by:[§2\.1](https://arxiv.org/html/2604.19048#S2.SS1.p1.1)\. - C\. Qin, A\. Zhang, Z\. Zhang, J\. Chen, M\. Yasunaga, and D\. Yang \(2023\)Is chatgpt a general\-purpose natural language processing task solver?\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 1339–1384\.External Links:[Link](https://doi.org/10.18653/v1/2023.emnlp-main.85),[Document](https://dx.doi.org/10.18653/V1/2023.EMNLP-MAIN.85)Cited by:[§1](https://arxiv.org/html/2604.19048#S1.p1.1)\. - C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.J\. Mach\. Learn\. Res\.21,pp\. 140:1–140:67\.External Links:[Link](https://jmlr.org/papers/v21/20-074.html)Cited by:[§1](https://arxiv.org/html/2604.19048#S1.p1.1)\. - K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)WinoGrande: an adversarial winograd schema challenge at scale\.Commun\. ACM64\(9\),pp\. 99–106\.External Links:[Link](https://doi.org/10.1145/3474381),[Document](https://dx.doi.org/10.1145/3474381)Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - M\. Sap, H\. Rashkin, D\. Chen, R\. L\. Bras, and Y\. Choi \(2019\)SocialIQA: commonsense reasoning about social interactions\.CoRRabs/1904\.09728\.External Links:[Link](http://arxiv.org/abs/1904.09728),1904\.09728Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. V\. Le, G\. E\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=B1ckMDqlg)Cited by:[§2\.1](https://arxiv.org/html/2604.19048#S2.SS1.p1.1)\. - A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)Commonsenseqa: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4149–4158\.Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - L\. Team \(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px2.p1.1)\. - Q\. Team \(2025\)Qwen3 technical report\.CoRRabs/2505\.09388\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.09388),[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),2505\.09388Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px2.p1.1)\. - C\. Tian, Z\. Shi, Z\. Guo, L\. Li, and C\. Xu \(2024\)HydraLoRA: an asymmetric lora architecture for efficient fine\-tuning\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/123fd8a56501194823c8e0dca00733df-Abstract-Conference.html)Cited by:[§A\.1](https://arxiv.org/html/2604.19048#A1.SS1.p1.1),[§1](https://arxiv.org/html/2604.19048#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.19048#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2604.19048#S4.SS1.SSS0.Px1.p1.5),[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px2.p1.1)\. - A\. Wang, Y\. Pruksachatkun, N\. Nangia, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2019a\)SuperGLUE: A stickier benchmark for general\-purpose language understanding systems\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 3261–3275\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html)Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2019b\)GLUE: A multi\-task benchmark and analysis platform for natural language understanding\.In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\-9, 2019,External Links:[Link](https://openreview.net/forum?id=rJ4km2R5t7)Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - Y\. Wang, Y\. Lin, X\. Zeng, and G\. Zhang \(2023\)MultiLoRA: democratizing lora for better multi\-task learning\.CoRRabs/2311\.11501\.External Links:[Link](https://doi.org/10.48550/arXiv.2311.11501),[Document](https://dx.doi.org/10.48550/ARXIV.2311.11501),2311\.11501Cited by:[§2\.2](https://arxiv.org/html/2604.19048#S2.SS2.p1.1),[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px2.p1.1)\. - Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024\)MMLU\-pro: A more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/ad236edc564f3e3156e1b2feafb99a24-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[§C\.2](https://arxiv.org/html/2604.19048#A3.SS2.SSS0.Px2.p1.1)\. - J\. Wei, M\. Bosma, V\. Y\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2022\)Finetuned language models are zero\-shot learners\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,External Links:[Link](https://openreview.net/forum?id=gEZrGCozdqR)Cited by:[§3\.1](https://arxiv.org/html/2604.19048#S3.SS1.p1.4)\. - L\. Xu, H\. Xie, S\. J\. Qin, X\. Tao, and F\. L\. Wang \(2023\)Parameter\-efficient fine\-tuning methods for pretrained language models: a critical review and assessment\.arXiv preprint arXiv:2312\.12148\.Cited by:[§1](https://arxiv.org/html/2604.19048#S1.p1.1)\. - Y\. Yang, D\. Muhtar, Y\. Shen, Y\. Zhan, J\. Liu, Y\. Wang, H\. Sun, W\. Deng, F\. Sun, Q\. Zhang, W\. Chen, and Y\. Tong \(2025\)MTL\-lora: low\-rank adaptation for multi\-task learning\.InAAAI\-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 \- March 4, 2025, Philadelphia, PA, USA,T\. Walsh, J\. Shah, and Z\. Kolter \(Eds\.\),pp\. 22010–22018\.External Links:[Link](https://doi.org/10.1609/aaai.v39i20.35509),[Document](https://dx.doi.org/10.1609/AAAI.V39I20.35509)Cited by:[§A\.1](https://arxiv.org/html/2604.19048#A1.SS1.p1.1),[§B\.1](https://arxiv.org/html/2604.19048#A2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2604.19048#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2604.19048#S3.SS2.p2.8),[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p2.1),[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px2.p1.1)\. - S\. Yuan, Y\. Zheng, T\. Wang, B\. Liu, and H\. Xu \(2025\)MoORE: svd\-based model moe\-ization for conflict\-and oblivion\-resistant multi\-task adaptation\.arXiv preprint arXiv:2506\.14436\.Cited by:[§4\.2](https://arxiv.org/html/2604.19048#S4.SS2.SSS0.Px1.p1.6),[Table 1](https://arxiv.org/html/2604.19048#S4.T1),[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px2.p1.1)\. - R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 4791–4800\.External Links:[Link](https://doi.org/10.18653/v1/p19-1472),[Document](https://dx.doi.org/10.18653/V1/P19-1472)Cited by:[§5\.1](https://arxiv.org/html/2604.19048#S5.SS1.SSS0.Px1.p1.1)\. - S\. Zhao, W\. Chen, B\. Shi, L\. Zhou, S\. Lin, and H\. Wan \(2025\)Spatial\-temporal knowledge distillation for takeaway recommendation\.InThirty\-Ninth AAAI Conference on Artificial Intelligence, Thirty\-Seventh Conference on Innovative Applications of Artificial Intelligence, Fifteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2025, Philadelphia, PA, USA, February 25 \- March 4, 2025,T\. Walsh, J\. Shah, and Z\. Kolter \(Eds\.\),pp\. 13365–13373\.External Links:[Link](https://doi.org/10.1609/aaai.v39i12.33459),[Document](https://dx.doi.org/10.1609/AAAI.V39I12.33459)Cited by:[§4\.2](https://arxiv.org/html/2604.19048#S4.SS2.SSS0.Px1.p1.6)\. ## Appendix AComplexity Analysis Table 3:Comparison of learnable parameters and computational complexity\. Notations:din/doutd\_\{\\text\{in\}\}/d\_\{\\text\{out\}\}are input/output dimensions,rris the rank,NNis the expert number,KKis the task number\.dgd\_\{g\}denote task embedding sizes for MoELoRA and SAMoRA\. SAMoRA achieves a superior trade\-off by combining asymmetric experts with efficient routing\.### A\.1Theoretical Analysis In this section, we analyze the theoretical complexity of SAMoRA in terms of trainable parameters and computational overhead\. We compare our method against standard LoRAHuet al\.\([2022](https://arxiv.org/html/2604.19048#bib.bib31)\)and representative MoE\-based PEFT frameworks, including MoeLoRALiuet al\.\([2024a](https://arxiv.org/html/2604.19048#bib.bib22)\), HydraLoRATianet al\.\([2024](https://arxiv.org/html/2604.19048#bib.bib23)\)and MTL\-LoRAYanget al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib24)\)\. For clarity, we define the following notations:dind\_\{\\text\{in\}\}anddoutd\_\{\\text\{out\}\}denote the input and output dimensions of the adapter layer, respectively\.rrrepresents the low\-rank dimension,NNis the number of experts, andKKis the number of tasks\. #### Parameter Efficiency\. The comparison of learnable parameters is summarized in Table[3](https://arxiv.org/html/2604.19048#A1.T3)\. - •Standard LoRAemploys a single pair of low\-rank matrices per layer, resulting in\(din\+dout\)r\(d\_\{\\text\{in\}\}\+d\_\{\\text\{out\}\}\)rparameters\. It serves as the most parameter\-efficient baseline but lacks multi\-task flexibility\. - •MoELoRAadopts a task\-conditioned routing mechanism\. It introduces a task embedding layer \(KdgKd\_\{g\}\) and a router projection matrix \(dgNd\_\{g\}N\) to generate routing probabilities based on task IDs\. Unlike the asymmetric design in SAMoRA, MoELoRA maintainsfully independentlow\-rank experts\. Consequently, its parameter complexity for the adapters isN\(din\+dout\)rN\(d\_\{\\text\{in\}\}\+d\_\{\\text\{out\}\}\)r, which is significantly higher than shared\-weight approaches\. The total parameter count is given byN\(din\+dout\)r\+Kdg\+NdgN\(d\_\{\\text\{in\}\}\+d\_\{\\text\{out\}\}\)r\+Kd\_\{g\}\+Nd\_\{g\}, wheredgd\_\{g\}is the task embedding dimension\. - •HydraLoRA,MTL\-LoRAandSAMoRAadopt anasymmetric expert architecture\. To optimize parameter efficiency, we share the projection matrix on the input side \(A∈ℝdin×rA\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times r\}\), while maintainingNNexpert\-specific matrices on the output side \(B∈ℝr×doutB\\in\\mathbb\{R\}^\{r\\times d\_\{\\text\{out\}\}\}\)\. This design reduces the complexity from the standard MoE’sN\(din\+dout\)rN\(d\_\{\\text\{in\}\}\+d\_\{\\text\{out\}\}\)rto\(din\+Ndout\)r\(d\_\{\\text\{in\}\}\+Nd\_\{\\text\{out\}\}\)r\. - •MTL\-LoRAcreates task\-specific experts, scaling the number of parameters linearly with the number of tasksKK\. This results in a significantly higher parameter count of approximatelyKN\(din\+dout\)rKN\(d\_\{\\text\{in\}\}\+d\_\{\\text\{out\}\}\)r, making it less scalable for scenarios with many tasks\. #### Computational Overhead\. Our SAMoRA framework introduces minimal computational overhead\. TheSemantic\-Aware Routerrequires a lightweight projection fromdind\_\{\\text\{in\}\}to the rank spacerr\(wherer≪min\(din,dout\)r\\ll min\(d\_\{\\text\{in\}\},d\_\{\\text\{out\}\}\)\), adding only𝒪\(Nr\)\\mathcal\{O\}\(Nr\)operations\. TheTask\-Adaptive Scalingmechanism introduces a lightweight parameter set of sizeKdgKd\_\{g\}to capture task\-specific characteristics\. Since the scaling process primarily involves element\-wise multiplications, the resulting computational overhead is negligible compared to the matrix multiplications in the backbone model\. ## Appendix BExperimental Setup ### B\.1Datasets and Prompts Following the experimental setup inYanget al\.\([2025](https://arxiv.org/html/2604.19048#bib.bib24)\), we summarize the statistics for the Commonsense Reasoning and GLUE benchmarks in Table[4](https://arxiv.org/html/2604.19048#A2.T4)and[5](https://arxiv.org/html/2604.19048#A2.T5), respectively\. The corresponding prompt templates used are detailed in Table[7](https://arxiv.org/html/2604.19048#A4.T7)\. Table 4:The basic information of Commonsense Reasoning DatasetTable 5:The basic information of GLUE Benchmark ### B\.2Implementation Details We implement all methods using the PyTorch framework\. Detailed hyperparameter configurations for our proposed SAMoRA and all baseline methods are summarized in Table[8](https://arxiv.org/html/2604.19048#A4.T8)\. ## Appendix CExtended Analyses and Ablation Studies ### C\.1Setup of Ablation Variants To rigorously evaluate the contribution of each component in SAMoRA, we conduct ablation studies using the Qwen3\-8B model on the GLUE benchmark\. The specific configurations of the ablated variants are defined as follows: - •w/o Router: We replace our proposed Semantic\-Aware Router with a standard MLP\-based gating network\. As analyzed in Section[A](https://arxiv.org/html/2604.19048#A1), this substitution leads to an increase in trainable parameters due to the dense connections in the MLP layers\. - •w/o Scaling: We disable the dynamic scaling mechanisms to verify their impact on task adaptation\. Specifically, we fix all elements of the Diagonal Scaling MatrixSSto 1\.0 and set the task\-dependent scalargtaskg\_\{\\text\{task\}\}to 1\.0 throughout the training process\. Under this setting, the scaling strategy effectively reverts to the standard LoRA formulation\. ### C\.2Analysis of Semantic\-Aware Router To strictly isolate the efficacy of our routing mechanism and eliminate interference from other components, we conduct a controlled experiment based on the asymmetric MoE\-LoRA architecture \(featuring one shared matrixAAand multiple semantic expertsBB\)\. In this setup, we vary only the routing module \(comparing our Semantic\-Aware Router against a standard MLP router\) while keeping all other structures identical\. To make the expert specialization patterns more observable, we scale the number of experts toN=8N=8and employ Llama\-3\.1\-8B as the backbone, training on the Commonsense Reasoning benchmark\. #### Expert Representation Analysis\. As illustrated in Figure[3](https://arxiv.org/html/2604.19048#S5.F3), we visualize the Principal Component Analysis \(PCA\) projection of the learned expert features\. The visualization reveals a stark contrast in the latent structure of the experts\. With the MLP\-based router, the expert representations tend to cluster closely together with ambiguous boundaries, indicating a high degree of functional overlap\. In contrast, our SAMoRA framework produces highly distinct and separated expert clusters\. This explicitly demonstrates that our approach successfully enforces expert distinctiveness, allowing each expert to specialize in different semantic subspaces\. Figure 5:Visualization of expert activation patterns on the unseen MMLU benchmark\. The top row \(MLP Router\) exhibits severe mode collapse, while our SAMoRA \(bottom row\) maintains diverse and adaptive routing across different subjects\. #### Routing Behavior on Unseen Tasks\. To further evaluate the generalization capability of the router, we extend our analysis to the MMLU benchmarkWanget al\.\([2024](https://arxiv.org/html/2604.19048#bib.bib37)\), which serves as an unseen task during training\. We visualize the proportion of activations for each expert in Figure[5](https://arxiv.org/html/2604.19048#A3.F5)\. Here, we display a subset of six randomly selected subjects characterized by diverse semantic distributions\. The figure is organized by subject columns, with the top row representing the MLP router and the bottom row representing ours\. A critical observation from the top row is that the MLP router suffers from severe representation collapse: regardless of the input subject, it predominantly selectsExpert 5, with other experts being rarely activated\. This behavior suggests that the MLP router fails to align expert specialization with input semantics, causing the dynamic MoE architecture to effectively degrade into a static, non\-MoE model\. Conversely, our method \(bottom row\) exhibits diverse and balanced activation patterns adaptive to different subjects, validating its ability to maintain precise routing even on out\-of\-distribution data\. #### Theoretical Rationale for Semantic Match Regularization\. In our architecture, learnable Expert Keys represent the intended specialties\. However, during unconstrained joint optimization, these keys risk diverging from the actual parameters the experts learn\. We employ KL Divergence to penalize this structural misalignment\. KL Divergence is theoretically optimal here because it rigorously measures the relative entropy between two probability distributions\. By forcing the expert key’s assignment distribution to closely track the intrinsic capability distribution \(derived from expert weights\), we steer the optimization trajectory toward a state where routing decisions are strictly anchored in the experts’ genuine functional capabilities, rather than arbitrary local minima\. #### Theoretical Basis for Matrix𝐁\\mathbf\{B\}Row Averaging\. - •Isolating Expert Knowledge:In our asymmetric LoRA structure \(ΔW=∑i=1NgiBiA\\Delta W=\\sum\_\{i=1\}^\{N\}g\_\{i\}B\_\{i\}A\), the shared down\-projectionAAacts as a universal feature extractor, leaving matrixBiB\_\{i\}strictly responsible for the specialized mapping back to the output space\. Consequently,BiB\_\{i\}inherently encodes the unique capabilities of that specific expert\. - •Geometric Centroid as Capability Anchor:The rows ofBiB\_\{i\}are transformation vectors residing in therr\-dimensional latent space—the exact space where our routing occurs\. By computing the average of these rows, we calculate the geometric centroid of the expert’s parameter subspace\. Theoretically, this centroid represents the dominant, macro\-level semantic direction of the expert’s transformations\. - •Robust and Dimensional Alignment:This aggregation smooths out localized parameter noise, yielding a highly stable global representation inℝr\\mathbb\{R\}^\{r\}\. This perfectly aligns with the dimensionality of the Expert Keys, allowing for a mathematically sound distance computation and ensuring the alignment loss is both meaningful and computationally efficient\. \(a\) \(b\) Figure 6:Sensitivity Analysis on hyperparameters evaluated on the Commonsense Reasoning dataset\. Subfigure \(a\) and \(b\) illustrate different ablation settings\. ## Appendix DHyperparameter Sensitivity We conduct a comprehensive sensitivity analysis to evaluate the robustness of our proposed framework under various hyperparameter configurations\. All experiments in this section are performed on the Commonsense Reasoning benchmark using Llama\-3\.1\-8B as the backbone model, trained for 1 epoch\. ### D\.1Hyperparameter Sensitivity Analysis We conduct a comprehensive sensitivity analysis to investigate how different hyperparameter configurations affect SAMoRA’s performance\. #### Impact of Model Architecture \(N,r,dgN,r,d\_\{g\}\)\. We first evaluate the impact of model capacity \(N,rN,r\) and the task embedding dimension \(dgd\_\{g\}\) in Figure[6](https://arxiv.org/html/2604.19048#A3.F6)\(b\)\. - •Robustness to Capacity \(N,rN,r\):The performance remains relatively stable across a broad range of expert countsNNand LoRA ranksrr\. Specifically, increasingrrfrom 8 to 64 yields marginal gains, confirming that our method is parameter\-efficient and does not rely on high\-rank adapters\. Similarly, the robustness againstNNindicates that our routing mechanism effectively utilizes available experts without suffering from redundancy\. - •Task Embedding Dimension \(dgd\_\{g\}\):We observe a distinct behavior regardingdgd\_\{g\}\. While the model performs well with compact dimensions, there is a sharp accuracy drop whendgd\_\{g\}is increased to 64\. This suggests that overly large task embeddings may introduce excessive parameters relative to the supervision signal, hindering convergence\. Thus, a compactdgd\_\{g\}is sufficient for effective semantic encoding\. #### Impact of Optimization Hyperparameters \(τ,λorth,λKL\\tau,\\lambda\_\{\\text\{orth\}\},\\lambda\_\{\\text\{KL\}\}\)\. We further analyze the regularization terms and routing temperature\. Figure[6](https://arxiv.org/html/2604.19048#A3.F6)\(a\) illustrates the individual sensitivity trends for the temperatureτ\\tau, orthogonality loss weightλorth\\lambda\_\{\\text\{orth\}\}, and KL divergence weightλKL\\lambda\_\{\\text\{KL\}\}\. We observe that moderate values generally facilitate better convergence, preventing the router from collapsing or becoming too uniform\. To identify the optimal interaction between these terms, we report the joint ablation results in Table[6](https://arxiv.org/html/2604.19048#A4.T6)\. - •Temperature \(τ\\tau\):The temperature controls the sharpness of the routing distribution\. We find thatτ=0\.8\\tau=0\.8achieves the optimal performance \(84\.01%\)\. Lower temperatures \(e\.g\.,τ=0\.4\\tau=0\.4\) lead to premature expert collapse \(83\.45%\), while higher temperatures \(e\.g\.,τ=1\.0\\tau=1\.0\) result in an overly smooth distribution \(83\.95%\)\. - •Regularization Weights:Combined withτ=0\.8\\tau=0\.8, appropriate regularization weights \(λorth\\lambda\_\{\\text\{orth\}\}andλKL\\lambda\_\{\\text\{KL\}\}\) are essential to balance expert specialization and load distribution, securing the best trade\-off between plasticity and stability\. ### D\.2Impact of Loss Weights Finally, we analyze the sensitivity of the regularization hyperparameters: the orthogonality weightλorth\\lambda\_\{\\text\{orth\}\}and the semantic match divergence weightλKL\\lambda\_\{\\text\{KL\}\}\. #### Orthogonality Weight \(λorth\\lambda\_\{\\text\{orth\}\}\)\. This term encourages diversity among experts\. Comparing the rows in Table[6](https://arxiv.org/html/2604.19048#A4.T6): - •Removing the regularization \(λorth=0\\lambda\_\{\\text\{orth\}\}=0\) results in a performance drop to 83\.20%, confirming the necessity of promoting expert diversity\. - •However, settingλorth\\lambda\_\{\\text\{orth\}\}too high \(1E\-2\) causes a significant performance degradation to 79\.35%\. This suggests that excessive constraints on orthogonality might hinder the optimization of the primary task loss\. - •A moderate value of1E\-3proves to be the most effective, striking a balance between expert diversity and task adaptation\. #### Semantic Match Weight \(λKL\\lambda\_\{\\text\{KL\}\}\)\. This term aligns the routing decisions with semantic information\. The results show a positive correlation betweenλKL\\lambda\_\{\\text\{KL\}\}and model performance within the tested range\. IncreasingλKL\\lambda\_\{\\text\{KL\}\}from 0 to 1E\-2 consistently improves accuracy \(from 83\.26% to 84\.01%\), highlighting the benefit of guiding the router with semantic knowledge derived from task embeddings\. Table 6:Sensitivity Analysis \(%\) of regularization weights and temperature on Commonsense Reasoning dataset \(Backbone: Llama\-3\.1\-8B\)\.Table 7:Prompt templates used for the Natural Language Understanding benchmark \(GLUE\)\. The placeholders \(e\.g\., \{sentence\}\) represent the input fields from the dataset\.Table 8:Detailed hyperparameter settings for all baseline methods on Commonsense Reasoning and GLUE benchmark\. Common settings are listed in the top section, while method\-specific parameters are detailed below\. “\-” indicates the parameter is not applicable\.
Similar Articles
SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment
This paper proposes SARA, a framework that aligns routing distributions of multilingual inputs using Jensen-Shannon divergence to improve expert sharing for low-resource languages in sparse Mixture-of-Experts models. Experiments on Qwen3-30B-A3B and Phi-3.5-MoE-instruct show improvements on multilingual benchmarks.
HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models
HELLoRA introduces activation-aware adapter placement for MoE models, attaching LoRA only to hot experts to reduce parameters and FLOPs while improving performance on reasoning, code, and safety tasks.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
This paper introduces SOMA, a framework for efficient multi-turn LLM serving that uses small language models adapted via soft prompts and LoRA fine-tuning to reduce latency and cost.