Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
Summary
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
View Cached Full Text
Cached at: 05/11/26, 06:48 AM
# Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
Source: [https://arxiv.org/html/2605.07111](https://arxiv.org/html/2605.07111)
Haozhan Tang1,2,\{\}^\{1,2,\\hbox to5\.6pt\{\\vbox to3\.85pt\{\\pgfpicture\\makeatletter\\hbox\{\\thinspace\\lower\-0\.175pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\}\}\{\} \{\}\{\{\}\}\{\}\{\}\{\}\{\}\{\{\}\}\{\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\}\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@lineto\{0\.0pt\}\{3\.50002pt\}\\pgfsys@lineto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@lineto\{5\.25002pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\}\{\}\}\{\{\}\}\{\} \{\}\{\} \{\}\{\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\}\\pgfsys@moveto\{0\.0pt\}\{3\.50002pt\}\\pgfsys@lineto\{2\.62502pt\}\{1\.4pt\}\\pgfsys@lineto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}Xiuqi Zhu1,∗Xinyin Zhang1,∗ Boxun Li3Virginia Smith1Kevin Kuo1,\{\}^\{1,\\hbox to5\.6pt\{\\vbox to3\.85pt\{\\pgfpicture\\makeatletter\\hbox\{\\thinspace\\lower\-0\.175pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\}\}\{\} \{\}\{\{\}\}\{\}\{\}\{\}\{\}\{\{\}\}\{\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\}\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@lineto\{0\.0pt\}\{3\.50002pt\}\\pgfsys@lineto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@lineto\{5\.25002pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\}\{\}\}\{\{\}\}\{\} \{\}\{\} \{\}\{\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\}\\pgfsys@moveto\{0\.0pt\}\{3\.50002pt\}\\pgfsys@lineto\{2\.62502pt\}\{1\.4pt\}\\pgfsys@lineto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\} 1Carnegie Mellon University2Tsinghua University3Infinigence AI
###### Abstract
Recent literature on fine\-tuning Large Language Models highlights a fundamental debate\. While Full Fine\-Tuning \(FFT\) provides the representational plasticity required for high\-entropy knowledge injection, Low\-Rank Adaptation \(LoRA\) can match or surpass FFT performance because many tasks only require updates in a low\-rank space and benefit from LoRA’s additional regularization\. Through empirical evaluation across diverse tasks \(SQL, Medical QA, and Counterfactual Knowledge\) and varying language models \(Gemma\-3\-1B, Qwen2\.5\-1\.5B, and Qwen2\.5\-3B\), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited\. To address this challenge, we propose a Mixture of LoRA and Full \(MoLF\) Fine\-Tuning, a unified framework that enables continuous navigation between both training regimes\. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics\. For memory\-constrained environments, we also introduce MoLF\-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank\. Our evaluations show that MoLF either improves on or stays within1\.5%1\.5\\%of the better of FFT and LoRA across all settings, while MoLF\-Efficient outperforms prior adaptive LoRA approaches by up to20%20\\%on Fact and9%9\\%on Med and SQL\.
††footnotetext:∗These authors contributed equally\.††footnotetext:\{\}^\{\\hbox to5\.6pt\{\\vbox to3\.85pt\{\\pgfpicture\\makeatletter\\hbox\{\\thinspace\\lower\-0\.175pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\}\}\{\} \{\}\{\{\}\}\{\}\{\}\{\}\{\}\{\{\}\}\{\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\}\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@lineto\{0\.0pt\}\{3\.50002pt\}\\pgfsys@lineto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@lineto\{5\.25002pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\}\{\}\}\{\{\}\}\{\} \{\}\{\} \{\}\{\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\{\}\\pgfsys@moveto\{0\.0pt\}\{3\.50002pt\}\\pgfsys@lineto\{2\.62502pt\}\{1\.4pt\}\\pgfsys@lineto\{5\.25002pt\}\{3\.50002pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}Corresponding authors:hztang801@gmail\.com,kkuo2@andrew\.cmu\.edu\.## 1Introduction
Fine\-tuning pre\-trained Large Language Models \(LLMs\) is a standard paradigm that yields strong performance on downstream NLP tasks\[Brownet al\.,[2020](https://arxiv.org/html/2605.07111#bib.bib9), Touvronet al\.,[2023a](https://arxiv.org/html/2605.07111#bib.bib8),[b](https://arxiv.org/html/2605.07111#bib.bib17), Chunget al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib10)\]\. However, effective fine\-tuning is challenging due to the parameter capacity of LLMs far exceeding the limited number of examples in fine\-tuning datasets\. This creates a tension between representational plasticity and generalization, where aggressive optimization can cause overfitting or degrade pretrained representations\[Jianget al\.,[2020](https://arxiv.org/html/2605.07111#bib.bib37), Aghajanyanet al\.,[2020](https://arxiv.org/html/2605.07111#bib.bib36)\]\. One natural axis along which such structure can be controlled is the parameterization of the fine\-tuning update itself\[Dinget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib35), Xuet al\.,[2026](https://arxiv.org/html/2605.07111#bib.bib34)\]\.
Figure 1:Our empirical evaluations reveal a structural trade\-off in fine\-tuning: FFT excels on high\-entropy factual domains while LoRA supplies the regularization needed to preserve pre\-trained reasoning during adaptation\. The Mixture of LoRA and Full \(MoLF\) framework dynamically routes updates between full\-parameter and low\-rank pathways at the optimizer level: shifting sparsity to the optimization step ensures every expert receives full\-batch gradient signals throughout training\.In this space, a fundamental yet unresolved question is whether Full Fine\-Tuning \(FFT\) or Low\-Rank Adaptation \(LoRA\)\[Huet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib50)\]is more effective\. It is commonly assumed that FFT, due to its higher capacity, should achieve superior accuracy over LoRA\. Thus, extensions of LoRA typically aim to increase its effective rank by mixing multiple LoRA modules of a common rank\[Wanget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib61), Albertet al\.,[2025](https://arxiv.org/html/2605.07111#bib.bib38)\]or adapting the ranks of modules throughout training\[Zhanget al\.,[2023b](https://arxiv.org/html/2605.07111#bib.bib52),[a](https://arxiv.org/html/2605.07111#bib.bib53), Liuet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib56)\]\. However, empirical evidence suggests that raw capacity is not the sole factor of performance and the low\-rank constraint can act as a regularizer that enables LoRA to outperform FFT\[Huet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib50), Bidermanet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib68)\]\. Together, these lines of work suggest that relying on a single static architecture is structurally limited, motivating a solution that can leverage the benefits of both\.
To this end, we propose a Mixture of LoRA and Full fine\-tuning \(MoLF\) which simultaneously trains an FFT and a LoRA expert\. Unlike prior mixture\-of\-PEFT or adaptive LoRA methods\[Wanget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib61), Zhanget al\.,[2023b](https://arxiv.org/html/2605.07111#bib.bib52)\], MoLF leaves the expert parameters intact and sparsifies parameterupdatesat the expert level; all experts participate in every forward and backward pass\. The parameter space and optimizer state thus stay fixed throughout training, avoiding the cold\-start AdamW moments that adaptive\-rank methods incur when ranks are promoted, and every expert accumulates gradient statistics from the full batch, yielding stable training dynamics as the importance of each expert shifts\.
For memory\-constrained settings, we additionally propose MoLF\-Efficient \(MoLF\-E\), which forgoes the FFT expert and routes updates among a pair of LoRA experts\. MoLF\-E inherits the training consistency benefits of MoLF while trading full\-parameter expressiveness for a reduced memory footprint\. In summary, our contributions are:
1. 1\.We extensively tune FFT and LoRA in 9 settings where we fine\-tune 3 LLMs \(Gemma\-3\-1B, Qwen2\.5\-1\.5B, Qwen2\.5\-3B\) on 3 datasets \(CounterFact, MedMCQA, and Text\-to\-SQL\)\. Our results show that the optimal choice of method and rank varies across settings, suggesting that methods should not simply seek to maximize the effective rank of the architecture\.
2. 2\.We propose MoLF, which unifies FFT and LoRA within a mixture\-of\-experts framework\. MoLF fine\-tunes both an FFT and a LoRA expert while systematically constraining updates based on a momentum\-based and capacity\-aware expert scoring function\. Across 3 benchmark datasets and 3 LLM architectures, MoLF consistently performs better than or within1\.5%1\.5\\%of the best baseline \(FFT or LoRA\)\.
3. 3\.We propose MoLF\-E, a memory\-efficient variant which freezes the base model and routes updates among a pair of LoRA experts\. At comparable parameter budgets, MoLF\-E consistently outperforms existing adaptive\-rank methods, with over20%20\\%improvement on Fact over the lowest\-performing baseline method\.
## 2Related Work
FFT versus LoRA\.Prior work has shown that pre\-trained LLM fine\-tuning occurs in a low\-dimensional subspace, serving as an explanation of why LoRA is highly effective\[Huet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib50), Aghajanyanet al\.,[2021](https://arxiv.org/html/2605.07111#bib.bib70), Schulman and Thinking Machines,[2025](https://arxiv.org/html/2605.07111#bib.bib69)\]\. Follow\-up works have tried to further improve LoRA by mimicking FFT or by increasing the effective rank of LoRA\[Albertet al\.,[2025](https://arxiv.org/html/2605.07111#bib.bib38), Haoet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib33), Wanget al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib39), Lialinet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib42)\]\. Empirical comparisons between the two methods yield mixed conclusions: some work finds that LoRA matches or exceeds FFT, with the low\-rank constraint acting as an implicit regularizer that mitigates forgetting and reduces reliance on explicit KL penalties during RLHF\[Huet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib50), Bidermanet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib68), Sunet al\.,[2023](https://arxiv.org/html/2605.07111#bib.bib3), Duet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib4)\]\. Conversely, other work finds that FFT outperforms LoRA, particularly in instruction tuning and knowledge\-intensive settings\[Ivisonet al\.,[2023](https://arxiv.org/html/2605.07111#bib.bib5), Pletenevet al\.,[2025](https://arxiv.org/html/2605.07111#bib.bib51)\]\. Beyond raw accuracy, FFT and LoRA also differ in the structure of their learned solutions and their robustness to distribution shift\[Bidermanet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib68), Shuttleworthet al\.,[2025](https://arxiv.org/html/2605.07111#bib.bib41)\]\.
Adaptive LoRA\.LoRA is a parameter\-efficient fine\-tuning \(PEFT\) method which injects trainable low\-rank matrices into a frozen base model\[Huet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib50)\]\. Despite its efficiency, LoRA is sensitive to the choice of rank, motivating a line of work on methods that use importance scores \(e\.g\. parameter or gradient norms\) to promote or prune rank components dynamically across layers\[Zhanget al\.,[2023b](https://arxiv.org/html/2605.07111#bib.bib52),[a](https://arxiv.org/html/2605.07111#bib.bib53), Liuet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib56), Changet al\.,[2025](https://arxiv.org/html/2605.07111#bib.bib60)\]\. A similar family of methods takes a finer\-grained approach by decomposing LoRA updates into rank\-1 components and selectively gating or routing over them, either via sparse regularization, meta\-learning, or importance\-based pruning\[Dinget al\.,[2023](https://arxiv.org/html/2605.07111#bib.bib54), Zhanget al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib57), Maoet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib55)\]\. Finally, a related line of work aims to produce LoRA modules that are robust to rank truncation at inference time\[Valipouret al\.,[2023](https://arxiv.org/html/2605.07111#bib.bib58), Rajabzadehet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib59)\]\.
Mixture\-of\-PEFT\.Mixture\-of\-Experts \(MoE\) models maintain multiple parallel sub\-networks \(experts\) and route each input to a subset of them\[Jacobset al\.,[1991](https://arxiv.org/html/2605.07111#bib.bib71), Shazeeret al\.,[2017](https://arxiv.org/html/2605.07111#bib.bib72)\]\. Several works combine PEFT with MoE\-style routing, treating each LoRA adapter as an expert\. These works are motivated by two related but distinct goals\. First, a single fixed\-rank adapter has limited capacity, and routing over a pool of adapters increases this capacity at low additional compute cost\[Wanget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib61), Zhuet al\.,[2023](https://arxiv.org/html/2605.07111#bib.bib66), Liu and Luo,[2024](https://arxiv.org/html/2605.07111#bib.bib48)\]\. Second, when attempting to specialize to multiple domains, a shared LoRA suffers from gradient conflicts and negative transfer\. Therefore, routing allows individual experts to specialize per domain or task\[Zadouriet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib62), Wuet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib65), Liet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib63), Douet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib64)\]\. However, all of these methods take the low\-rank constraint as given and focus on how to best allocate or route among equally\-constrained experts\. In contrast, MoLF assumes a full\-rank search space and lets the data determine how updates within this space should be constrained to maximize performance\.
Overall, prior work stops short of leveraging the FFT/LoRA tension itself\. MoLF resolves this by mixing FFT and LoRA experts and sparsifying parameterupdatesrather than expert parameters; all experts contribute to every forward pass and accumulate gradient statistics continuously, yielding more stable dynamics than methods that truncate or gate experts\.
## 3Understanding Fine\-Tuning Dynamics: An Empirical Analysis
Recent literature presents two conflicting perspectives on the fine\-tuning dynamics of LLMs\. One line\[Huet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib50), Aghajanyanet al\.,[2021](https://arxiv.org/html/2605.07111#bib.bib70), Schulman and Thinking Machines,[2025](https://arxiv.org/html/2605.07111#bib.bib69)\]argues that meaningful weight updates reside in a low\-rank subspace, making Low\-Rank Adaptation \(LoRA\) not merely an efficient approximation but a theoretically optimal one that avoids over\-parameterization\. A competing line\[Bidermanet al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib68)\]argues that unconstrained Full Fine\-Tuning \(FFT\) is strictly more powerful for complex tasks, concluding“LoRA Learns Less and Forgets Less”: LoRA hits capacity bottlenecks when injecting high\-entropy knowledge, but the same low\-rank constraint also acts as a protective regularizer against the destructive high\-rank updates of FFT\.
To systematically resolve this dispute, we empirically evaluate FFT and LoRA across varying tasks and model scales\. Our setup includes Google Gemma\-3\-1B, Qwen2\.5\-1\.5B, and Qwen2\.5\-3B, evaluated on datasets chosen for their diverse intrinsic dimensionalities: Factual Knowledge \(CounterFact\[Menget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib76)\]\), Medical QA \(MedMCQA\[Palet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib77)\]\), and Text\-to\-SQL \(Gretel synthetic Text\-to\-SQL\[Meyeret al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib78)\]\)\. Following a rigorous hyperparameter sweep over learning rates, schedulers, and LoRA ranks \(r∈\{8,16,32,64,128\}r\\in\\\{8,16,32,64,128\\\}\), we report the optimal result for each setup in Table[1](https://arxiv.org/html/2605.07111#S3.T1)\. We include the details of the hyperparameter sweep in Appendix[A\.2](https://arxiv.org/html/2605.07111#A1.SS2)\.
Table 1:Fine\-Tuning Performance Benchmark\. Cells are Efficacy Score \(%\) on Fact and accuracy \(%\) on Med and SQL, rounded to two decimal places\. The best score in each row is in bold, and any score within a1\.5%1\.5\\%margin of the best is highlighted in blue\.Our results isolate three distinct fine\-tuning regimes, mathematically rationalizing both perspectives in the literature\. LetWbase∈ℝd×kW\_\{\\text\{base\}\}\\in\\mathbb\{R\}^\{d\\times k\}be the pre\-trained weights andΔW∗=∑i=1min\(d,k\)σiuiviT\\Delta W^\{\*\}=\\sum\_\{i=1\}^\{\\min\(d,k\)\}\\sigma\_\{i\}u\_\{i\}v\_\{i\}^\{T\}be the optimal weight update via Singular Value Decomposition\.
- •Fact \(Capacity Bottleneck:“LoRA Learns Less”\): FFT strictly dominates LoRA across all models\. Factual injection requires memorization of high\-entropy, nearly orthogonal entity associations, yielding a heavy\-tailed intrinsic dimension forΔW∗\\Delta W^\{\*\}\. By the Eckart\-Young\-Mirsky Theorem\[Eckart and Young,[1936](https://arxiv.org/html/2605.07111#bib.bib6), Mirsky,[1960](https://arxiv.org/html/2605.07111#bib.bib7)\], any rank\-rrapproximationBABAincurs an error lower\-bounded by the truncated tail energy‖ΔW∗−BA‖F2≥∑i\>rσi2\\\|\\Delta W^\{\*\}\-BA\\\|\_\{F\}^\{2\}\\geq\\sum\_\{i\>r\}\\sigma\_\{i\}^\{2\}, which is large under a heavy tail, whereas unconstrained FFT can expressΔW∗\\Delta W^\{\*\}exactly\. This bound is purely representational and does not imply monotone improvement inrr, since higher rank simultaneously raises the representational ceiling and weakens implicit spectral regularization on a finite training set\. Table[1](https://arxiv.org/html/2605.07111#S3.T1)confirms the qualitative claim: every swept LoRA rank trails FFT by at least4\.17%4\.17\\%on Fact across all three models, though the rank that minimizes this gap is non\-monotonic\.
- •Med \(Spectral Regularization:“LoRA Forgets Less”\): High\-rank LoRA systematically outperforms FFT\. Medical QA requires adapting to complex formats while strictly preserving pre\-trained reasoning\. We model the empirical gradient asG=Gtask\+GnoiseG=G\_\{\\text\{task\}\}\+G\_\{\\text\{noise\}\}, withGnoiseG\_\{\\text\{noise\}\}a high\-rank, approximately isotropic fluctuation\. FFT applies−ηG\-\\eta Gdirectly, aggressively altering orthogonal dimensions and causing catastrophic forgetting\. Under vanilla SGD on the LoRA factors, a first\-order expansion ofBnewAnew−BAB\_\{\\text\{new\}\}A\_\{\\text\{new\}\}\-BAyieldsΔWeff≈−η\(BBTG\+GATA\)\+𝒪\(η2\)\\Delta W\_\{\\text\{eff\}\}\\approx\-\\eta\(BB^\{T\}G\+GA^\{T\}A\)\+\\mathcal\{O\}\(\\eta^\{2\}\), a rank\-bounded linear map that confines the effective update to col\(BB\) from the left and row\(AA\) from the right \(reducing to an orthogonal projection whenBBandA⊤A^\{\\top\}have orthonormal columns, otherwise rescaling by their singular values\)\. This subspace confinement preserves signal directions captured by the LoRA factors and attenuates components outside them, supplying implicit spectral regularization that protects pre\-trained logic; the same intuition extends approximately to the AdamW\-preconditioned update\.
- •SQL \(Low Intrinsic Dimension:“LoRA Without Regret”\): FFT and LoRA perform similarly, and the performance of LoRA is stable across diverse ranks\. With a low rank, LoRA can even potentially outperform FFT\. Text\-to\-SQL primarily requires structural and syntactic alignment rather than novel reasoning\. Consequently, the optimal update has a concentrated singular value spectrum \(σi≈0\\sigma\_\{i\}\\approx 0fori\>ri\>r\)\. Applying Eckart\-Young\-Mirsky in this favorable regime, the truncated tail energy∑i\>rσi2\\sum\_\{i\>r\}\\sigma\_\{i\}^\{2\}is negligible, so LoRA captures the optimal update with minimal representation loss\.
This empirical study demonstrates that static fine\-tuning architectures are structurally limited\. Relying solely on FFT suffers from catastrophic forgetting on logical reasoning tasks \(Med\), while relying on LoRA enforces an irreducible capacity bottleneck on high\-entropy factual tasks \(Fact\)\. Real\-world applications require continuous navigation of both regimes\.
## 4Methodology: Mixture of LoRA and Full Fine\-Tuning
To enable a continuous navigation between the representational plasticity of FFT and the parameter\-efficient regularization of LoRA, we propose the Mixture of LoRA and Full \(MoLF\) Fine\-Tuning framework, alongside its memory\-constrained variant, MoLF\-Efficient\. MoLF enables the model to execute each update step within the most gradient\-saturated rank, dynamically routing between full\-parameter and low\-rank updates\.
### 4\.1Background: LoRA and Mixture\-of\-Experts
Parameter\-Efficient Fine\-Tuning \(PEFT\), particularly Low\-Rank Adaptation \(LoRA\)\[Huet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib50)\], mitigates the memory cost of full fine\-tuning \(FFT\) by freezing the pre\-trained weightsWbase∈ℝdout×dinW\_\{\\text\{base\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}and injecting trainable low\-rank matricesA∈ℝr×din,B∈ℝdout×rA\\in\\mathbb\{R\}^\{r\\times d\_\{\\text\{in\}\}\},B\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}\. The fixed bottleneckrrrigidly caps global structural capacity, inducing the bottlenecks observed on high\-entropy factual learning\.
Mixture\-of\-Experts \(MoE\) architectures\[Jacobset al\.,[1991](https://arxiv.org/html/2605.07111#bib.bib71), Shazeeret al\.,[2017](https://arxiv.org/html/2605.07111#bib.bib72)\]scale capacity by conditionally routing tokens through a sparse subset of independent “experts”, but this fractures the batch and induces noisy gradient statistics, load\-balancing collapse, and training instability\. MoLF \(Figure[2](https://arxiv.org/html/2605.07111#S4.F2)\) bridges these paradigms by shifting sparsity from the forward pass to the backward optimization step: every expert unconditionally receives full\-batch gradient signals, yielding stable, high\-fidelity gradient statistics\.
Figure 2:Overview of MoLF Framework\.
### 4\.2MoLF Architecture and Inference
Structurally, MoLF unifies FFT and LoRA by formulating each linear projection as an unconditional superposition of expert pathways\. For a given input activationxx, the ungated forward pass evaluates:
y=Wbasex\+∑i=1NαiriBi\(Ai\(Dropout\(x\)\)\)\.y=W\_\{\\text\{base\}\}x\+\\sum\_\{i=1\}^\{N\}\\frac\{\\alpha\_\{i\}\}\{\\sqrt\{r\_\{i\}\}\}B\_\{i\}\\big\(A\_\{i\}\(\\text\{Dropout\}\(x\)\)\\big\)\.\(1\)Here, the dense matrixWbaseW\_\{\\text\{base\}\}serves as the FFT expert, while each pair\(Ai,Bi\)\(A\_\{i\},B\_\{i\}\)acts as an independent LoRA expert with rankrir\_\{i\}\. To stabilize learning dynamics across varying capacities, we apply Rank\-Stabilized LoRA \(RS\-LoRA\) scalingαi/ri\\alpha\_\{i\}/\\sqrt\{r\_\{i\}\}\[Kalajdzievski,[2023](https://arxiv.org/html/2605.07111#bib.bib73)\]\. Because conditional token gating is eliminated, all structural pathways evaluate every token\. Sparsity is thus strictly deferred to the optimizer, which dynamically allocates updates based on these dense, globally informed gradient signals\.
### 4\.3Dynamic Gradient Routing via Sparse AdamW
The mathematical core of the MoLF framework is a custom sparse optimization algorithm built upon the decoupled weight decay principles of the AdamW optimizer\[Kingma and Ba,[2014](https://arxiv.org/html/2605.07111#bib.bib74), Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.07111#bib.bib75)\]\. Operating independently on each network layer, the routing mechanism executes a split\-phase update that strictly decouples information flow \(moment tracking\) from action \(weight modification\)\.
### Phase 1: Universal Momentum Tracking
Leti∈\{0,1,…,N\}i\\in\\\{0,1,\\dots,N\\\}denote the index of an expert within a specific module, wherei=0i=0corresponds to the FFT matrix \(WbaseW\_\{\\text\{base\}\}\), andi\>0i\>0denotes the LoRA adapters\. For every expertiireceiving a batch\-averaged gradientgt\(i\)g\_\{t\}^\{\(i\)\}at steptt, the optimizer updates the Exponential Moving Averages \(EMA\) for the first momentmt\(i\)m\_\{t\}^\{\(i\)\}following Equation[2](https://arxiv.org/html/2605.07111#S4.E2)and uncentered second momentvt\(i\)v\_\{t\}^\{\(i\)\}following Equation[3](https://arxiv.org/html/2605.07111#S4.E3):
mt\(i\)\\displaystyle m\_\{t\}^\{\(i\)\}=β1mt−1\(i\)\+\(1−β1\)gt\(i\)\\displaystyle=\\beta\_\{1\}m\_\{t\-1\}^\{\(i\)\}\+\(1\-\\beta\_\{1\}\)g\_\{t\}^\{\(i\)\}\(2\)vt\(i\)\\displaystyle v\_\{t\}^\{\(i\)\}=β2vt−1\(i\)\+\(1−β2\)\(gt\(i\)\)2\\displaystyle=\\beta\_\{2\}v\_\{t\-1\}^\{\(i\)\}\+\(1\-\\beta\_\{2\}\)\\left\(g\_\{t\}^\{\(i\)\}\\right\)^\{2\}\(3\)
The Adam step counterttand moments are updated universally for all experts, regardless of selection for physical updates\. Because every expert sees the full batch, the bias correction factors \(1−β1t,1−β2t1\-\\beta\_\{1\}^\{t\},1\-\\beta\_\{2\}^\{t\}\) remain synchronized and dormant experts maintain a mature, debiased momentum state, avoiding the cold\-start failure mode on later activation\.
### Phase 2: Expert Scoring by Expected Preconditioned Descent
The relevant quantity for deciding which expert should receive an update is not the raw gradient magnitude but the loss reduction expected from the AdamW step the optimizer would actually take\. Because AdamW does not descend alonggt\(i\)g\_\{t\}^\{\(i\)\}but along the preconditioned directionmt\(i\)/\(vt\(i\)\+ϵ\)m\_\{t\}^\{\(i\)\}/\(\\sqrt\{v\_\{t\}^\{\(i\)\}\}\+\\epsilon\), raw gradient norms only loosely track useful descent and discard the per\-coordinate adaptive scaling that AdamW relies on\.
We therefore use the Expected Preconditioned Descent \(EPD\) score𝒮t\(i\)\\mathcal\{S\}\_\{t\}^\{\(i\)\}, which estimates the first\-order expected loss reduction of expertii’s AdamW step and then normalizes byNparams\(i\)N\_\{\\text\{params\}\}^\{\(i\)\}to obtain a per\-parameter quantity comparable across experts of very different sizes:
𝒮t\(i\)=ηt\(i\)Nparams\(i\)∑θ∈Θi\(mt\(i\)\)2vt\(i\)\+ϵ\\mathcal\{S\}\_\{t\}^\{\(i\)\}=\\frac\{\\eta\_\{t\}^\{\(i\)\}\}\{N\_\{\\text\{params\}\}^\{\(i\)\}\}\\sum\_\{\\theta\\in\\Theta\_\{i\}\}\\frac\{\\left\(m\_\{t\}^\{\(i\)\}\\right\)^\{2\}\}\{\\sqrt\{v\_\{t\}^\{\(i\)\}\}\+\\epsilon\}\(4\)Hereηt\(i\)\\eta\_\{t\}^\{\(i\)\}is the per\-expert learning rate andmt\(i\),vt\(i\)m\_\{t\}^\{\(i\)\},v\_\{t\}^\{\(i\)\}are the AdamW moving averages tracked in Phase 1\. A first\-order Taylor derivation of this proxy is given in Appendix[A\.1](https://arxiv.org/html/2605.07111#A1.SS1)\.
Two scaling properties of Equation[4](https://arxiv.org/html/2605.07111#S4.E4)are worth making explicit, because they deliberately differ from a scale\-invariant alternative such as the Preconditioned Frobenius Norm \(PFN; Appendix[A\.3](https://arxiv.org/html/2605.07111#A1.SS3)\)\. First, the score is not rank\-invariant: under RS\-LoRA scalingαi/ri\\alpha\_\{i\}/\\sqrt\{r\_\{i\}\}, the per\-element gradient on a LoRA expert scales like1/ri1/\\sqrt\{r\_\{i\}\}, somt2/vtm\_\{t\}^\{2\}/\\sqrt\{v\_\{t\}\}scales like1/ri1/\\sqrt\{r\_\{i\}\}\. Summing overNparams\(i\)∝riN\_\{\\text\{params\}\}^\{\(i\)\}\\propto r\_\{i\}trainable parameters and dividing byNparams\(i\)N\_\{\\text\{params\}\}^\{\(i\)\}leaves an aggregate score that scales asηt\(i\)/ri\\eta\_\{t\}^\{\(i\)\}/\\sqrt\{r\_\{i\}\}\. The score thus rewards LoRA experts whose effective step size is large relative to their parameter count, which is the correct economic quantity when the optimizer must commit a Top\-KKupdate to a strict subset of experts\. Second, the score retains the per\-expert learning rateηt\(i\)\\eta\_\{t\}^\{\(i\)\}, so two scoring functions with otherwise identical preconditioned magnitudes are correctly differentiated by the actual step the optimizer would take\. PFN, by contrast, cancels both of these factors and ranks experts purely on directional gradient consistency; the ablation in Section[5\.4\.2](https://arxiv.org/html/2605.07111#S5.SS4.SSS2)shows that the additional information in EPD is what produces the gains observed on Med\.
### Phase 3: Top\-K Sparse AdamW Update
The experts within each linear module are sorted by descending EPD score𝒮t\(i\)\\mathcal\{S\}\_\{t\}^\{\(i\)\}\. For the Top\-KKwinning experts, we apply the standard AdamW update\[Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.07111#bib.bib75)\]using the momentsmt\(i\),vt\(i\)m\_\{t\}^\{\(i\)\},v\_\{t\}^\{\(i\)\}tracked in Phase 1, the per\-expert learning rateηt\(i\)\\eta\_\{t\}^\{\(i\)\}, and decoupled weight decayλi\\lambda\_\{i\}:
θt\(i\)←θt−1\(i\)\(1−ηt\(i\)λi\)−ηt\(i\)1−β1t\(mt\(i\)vt\(i\)/\(1−β2t\)\+ϵ\)fori∈Winners\.\\theta\_\{t\}^\{\(i\)\}\\leftarrow\\theta\_\{t\-1\}^\{\(i\)\}\\left\(1\-\\eta\_\{t\}^\{\(i\)\}\\lambda\_\{i\}\\right\)\-\\frac\{\\eta\_\{t\}^\{\(i\)\}\}\{1\-\\beta\_\{1\}^\{t\}\}\\left\(\\frac\{m\_\{t\}^\{\(i\)\}\}\{\\sqrt\{v\_\{t\}^\{\(i\)\}/\(1\-\\beta\_\{2\}^\{t\}\)\}\+\\epsilon\}\\right\)\\quad\\text\{for \}i\\in\\text\{Winners\}\.\(5\)
The losing experts strictly retain their previous physical weights \(θt\(i\)=θt−1\(i\)\\theta\_\{t\}^\{\(i\)\}=\\theta\_\{t\-1\}^\{\(i\)\}\)\. In MoLF, we execute this routing at the local module level, meaning that different projection matrices within the same transformer layer can independently route updates to entirely different representational capacities\.
### 4\.4MoLF\-Efficient: Adaptive LoRA\-Only Mixtures
Standard MoLF incurs high memory costs by tracking dense Adam states forWbaseW\_\{\\text\{base\}\}\. For memory\-constrained hardware, we introduce the MoLF\-Efficient \(MoLF\-E\) variant\. Here,WbaseW\_\{\\text\{base\}\}and all non\-linear parameters are permanently frozen and excluded from optimizer state tracking\. The architecture instead exclusively unifies multiple LoRA experts of varying ranks\.
Without the FFT path, routing becomes a gradient\-aware subspace search\. Because parallel adapters are independently initialized, they map to divergent optimization trajectories\. The optimizer continuously evaluates and directs updates to the specific low\-rank subspace offering the steepest expected loss reduction at each step\.
### 4\.5Post\-Training Fusion and Zero\-Overhead Inference
Unlike traditional MoE, MoLF restricts sparsity entirely to the optimizer update pass\. Because the forward pass is a superposition of all experts, the multi\-expert graph is perfectly collapsible prior to inference\.
After fine\-tuning, all trained LoRA experts are mathematically projected directly into their corresponding dense base weights:
Wfinal=Wbase\+∑i=1Nαiri\(BiAi\)\.W\_\{\\text\{final\}\}=W\_\{\\text\{base\}\}\+\\sum\_\{i=1\}^\{N\}\\frac\{\\alpha\_\{i\}\}\{\\sqrt\{r\_\{i\}\}\}\\left\(B\_\{i\}A\_\{i\}\\right\)\.\(6\)
This algebraic projection permanently collapses the multi\-expert components into the native pathway ofWbaseW\_\{\\text\{base\}\}, making the final exported model structurally identical to the base LLM\. In addition to providing compatibility with standard downstream inference engines, this eliminates the latency penalty of LoRA as well as architectural drift during downstream fine\-tuning\.
## 5Results
### 5\.1Experimental Setup
We evaluate on the three benchmarks introduced in Section[3](https://arxiv.org/html/2605.07111#S3)\(CounterFact\[Menget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib76)\], MedMCQA\[Palet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib77)\], Gretel synthetic Text\-to\-SQL\[Meyeret al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib78)\]\) across three open\-source language models: Gemma\-3\-1B\[Kamathet al\.,[2025](https://arxiv.org/html/2605.07111#bib.bib20)\], Qwen2\.5\-1\.5B, and Qwen2\.5\-3B\[Yanget al\.,[2024](https://arxiv.org/html/2605.07111#bib.bib21)\]\. We compare MoLF and its memory\-efficient variant MoLF\-Efficient \(MoLF\-E\) against FFT, LoRA\[Huet al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib50)\]at ranksr∈\{8,16,32,64,128\}r\\in\\\{8,16,32,64,128\\\}, and two adaptive PEFT baselines, AdaLoRA\[Zhanget al\.,[2023b](https://arxiv.org/html/2605.07111#bib.bib52)\]and AdaMix\[Wanget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib61)\]\. All methods use AdamW\[Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.07111#bib.bib75)\]under a matched compute budget\. The static FFT and LoRA hyperparameter sweep, the per\-expert MoLF and MoLF\-E settings \(ηt\(i\)\\eta\_\{t\}^\{\(i\)\},λi\\lambda\_\{i\},ϵ\\epsilon,KK\), and the configurations used for AdaLoRA and AdaMix are documented in Appendix[A\.2](https://arxiv.org/html/2605.07111#A1.SS2)\. We report Efficacy Score \(ES, defined in Appendix[A\.4](https://arxiv.org/html/2605.07111#A1.SS4)\) on Fact and accuracy on Med and SQL; both quantities are reported as percentages, so all three datasets share a common visual scale in Tables[1](https://arxiv.org/html/2605.07111#S3.T1)and[2](https://arxiv.org/html/2605.07111#S5.T2)and Figures[3](https://arxiv.org/html/2605.07111#S5.F3)and[7](https://arxiv.org/html/2605.07111#A2.F7)\. All fine\-tuning experiments were conducted on a single NVIDIA H100 Tensor Core or NVIDIA RTX PRO 6000 Blackwell GPU\.
### 5\.2MoLF vs\. Full Fine\-Tuning and Tuned LoRA
Table 2:MoLF vs\. Full Fine\-Tuning and LoRA \(best rank\)\. Cells are Efficacy Score \(%\) on Fact and accuracy \(%\) on Med and SQL\. The best score in each column is in bold; any score within a1\.5%1\.5\\%margin of the best baseline \(FFT or LoRA\) is highlighted in blue\. Checkmarks indicate performance recovery within this1\.5%1\.5\\%margin\.Table[2](https://arxiv.org/html/2605.07111#S5.T2)extends the static benchmarks from Section[3](https://arxiv.org/html/2605.07111#S3)by introducing MoLF\. Crucially, MoLF consistently matches or exceeds the optimal static fine\-tuning strategy, eliminating the need to manually choose between LoRA and full fine\-tuning \(FFT\) or conduct exhaustive rank searches\. Across all nine configurations, MoLF recovers the best baseline performance within a1\.5%1\.5\\%margin, achieving the outright best score in three\. In contrast, single static methods falter across domains: FFT trails the optimal baseline by up to4\.04%4\.04\\%on Med \(Gemma\-1B\), while the best\-tuned LoRA degrades by5\.69%5\.69\\%on Fact \(Qwen\-3B\)\.
MoLF adapts to each domain’s intrinsic dimensionality without manual choice\. OnFact, it matches FFT\-level capacity, even surpassing FFT on Qwen\-1\.5B by0\.20%0\.20\\%; onSQL, which favors concentrated low\-rank subspaces, it ties or sets the best score on two of three models; and onMed, it outperforms FFT by up to3\.25%3\.25\\%and tracks the optimal static LoRA within1\.38%1\.38\\%\. MoLF thereby balances parameter capacity with spectral regularization without per\-task hyperparameter sweeps\.
### 5\.3MoLF\-E vs\. Adaptive PEFT Baselines
We next evaluate MoLF\-E, which removes the FFT expert to accommodate memory\-constrained hardware, against two widely adopted adaptive PEFT baselines: AdaLoRA\[Zhanget al\.,[2023b](https://arxiv.org/html/2605.07111#bib.bib52)\]and AdaMix\[Wanget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib61)\]\. We use MoLF\-E with a rank\-64 LoRA expert and a rank\-128 LoRA expert; an ablation of the rank choice for the smaller\-rank LoRA expert is reported in Appendix[B\.2](https://arxiv.org/html/2605.07111#A2.SS2)\. Figure[3](https://arxiv.org/html/2605.07111#S5.F3)presents the comparison\. MoLF\-E outperforms both adaptive baselines in 8 out of 9 \(task, model\) settings, with margins of up to\+11\.70%\+11\.70\\%over AdaLoRA and up to\+20\.01%\+20\.01\\%over AdaMix\.
Figure 3:MoLF\-E vs\. adaptive PEFT baselines across three tasks and three models\. MoLF\-E \(blue\) outperforms both adaptive baselines on88of99\(task, model\) settings\.The gap between MoLF\-E and the adaptive baselines is most pronounced onFact, where the heavy\-tailed singular spectrum ofΔW∗\\Delta W^\{\*\}characterized in Section[3](https://arxiv.org/html/2605.07111#S3)stresses each method’s rank\-allocation strategy\. AdaLoRA’s importance\-based rank pruning and AdaMix’s stochastic routing both restrict the optimizer to a single low\-rank subspace at any given step, leaving them unable to absorb the high\-rank tail of the optimal update; MoLF\-E sidesteps this bottleneck by maintaining multiple LoRA experts of varying rank in parallel and allocating updates via the EPD score, recovering the missing capacity without an FFT pathway\. OnMedandSQL, where rank sensitivity is mild, MoLF\-E still exceeds both adaptive baselines, though by smaller margins\.
### 5\.4Ablation of MoLF
We ablate two components of MoLF: the sparse\-routing decision \(whether selection is needed at all\) and the routing heuristic itself \(EPD vs\. PFN\)\. Table[3](https://arxiv.org/html/2605.07111#S5.T3)reports SQL and Med, the two regimes that expose the nuanced tradeoffs routing must resolve \(low intrinsic dimension and spectral regularization, respectively\); Fact, where the FFT pathway is structurally necessary, is examined separately via the router\-behavior analysis in Figure[4](https://arxiv.org/html/2605.07111#S5.F4)and Appendix[B\.1](https://arxiv.org/html/2605.07111#A2.SS1)\.
Table 3:Ablation measured in accuracy \(%\)m showing the importance of score\-based update routing \(PFN, EPD\) over uniform updates, and the gain from EPD’s learning\-rate term over PFN\.#### 5\.4\.1Sparse Update of Expert
To validate sparse routing, we ablate the selection mechanism by updating all FFT and LoRA experts simultaneously\. Table[3](https://arxiv.org/html/2605.07111#S5.T3)shows that dense updating severely degrades Med accuracy on Qwen2\.5\-3B \(60\.60%→31\.08%60\.60\\%\\to 31\.08\\%\) and Gemma\-3\-1B \(45\.40%→39\.73%45\.40\\%\\to 39\.73\\%\): the unrestricted FFT pathway and the bottlenecked LoRA adapter compete to represent identical features, inducing optimization oscillations\. Sparse updating wins on all setups except Qwen2\.5\-1\.5B SQL, where the intrinsically low\-rank task is captured equally well under either regime\.
#### 5\.4\.2Expert Selection Heuristics
To demonstrate the necessity of Expected Preconditioned Descent \(EPD\), we compare it against a Preconditioned Frobenius Norm \(PFN\) baseline\. PFN provides an intuitive, scale\-invariant metric by calculating the root\-mean\-square of the preconditioned Adam update\. The full formulation of the Preconditioned Frobenius Norm is presented in Appendix[A\.3](https://arxiv.org/html/2605.07111#A1.SS3)\. This naturally eliminates LoRA scaling biases and isolates routing decisions based purely on gradient directional consistency\.
As Table[3](https://arxiv.org/html/2605.07111#S5.T3)shows, routing by the EPD score matches or improves over routing by the PFN score on five of six \(model, task\) cells, with the largest gains on Med \(e\.g\.,\+2\.58\+2\.58on Gemma\-3\-1B and\+1\.46\+1\.46on Qwen2\.5\-3B\); the two score functions tie on Gemma\-3\-1B SQL \(73\.0973\.09\), and EPD trails PFN by0\.480\.48on Qwen2\.5\-1\.5B SQL\. The pattern indicates that scale invariance alone \(PFN\) suffices on regimes where the task is intrinsically low\-rank and any reasonable scoring function works \(SQL\), but is insufficient on Med, where balancing massive and lightweight experts requires the additional information that the EPD score obtains from incorporating the learning rate \(ηt\\eta\_\{t\}\)\. By accounting for both the optimizer’s intended step size and the local loss topology, the EPD score yields a stable mixture that varies with task\.
Furthermore, we find that EPD achieves a highly stable, task\-conditional fine\-tuning mixture\. As shown in Figure[4](https://arxiv.org/html/2605.07111#S5.F4), the router performs persistent structural assignment: individual modules definitively specialize in either FFT or LoRA early in training rather than continuously alternating across optimizer steps\. The full router selection behavior of EPD is presented in Appendix[B\.1](https://arxiv.org/html/2605.07111#A2.SS1)\.
Figure 4:MoLF routing dynamics over training\. Each heatmap row tracks one module’s structural preference \(FFT vs\. LoRA\) across optimizer steps\. The distinct horizontal striping shows routing stabilizes rapidly: modules commit early to either dense or low\-rank pathways with minimal oscillation\.
## 6Conclusion
Fine\-tuning pre\-trained LLMs requires navigating the tension between FFT’s high capacity and LoRA’s implicit regularization\. We have shown that the optimal choice varies across tasks and models and that prior work leaves this insight unexploited\. MoLF addresses it by mixing FFT and LoRA experts and sparsifying only their parameter updates: across a broad set of tasks and models, MoLF reliably tracks whichever method is optimal, while MoLF\-E matches or outperforms adaptive LoRA methods at comparable parameter budgets\. These results suggest that maintaining an expressive trainable architecture while sparsifying only expert\-level updates is an effective way to capture the benefits of each component expert\.
We identify three natural directions for future work\. First, MoLF currently uses a single FFT and a single LoRA expert; scaling to multiple LoRA experts may enable finer\-grained adaptation\. Second, the EPD score is motivated by a first\-order Taylor approximation, and higher\-order information could improve scoring under highly non\-stationary loss landscapes\. Finally, MoLF\-E reduces memory at the cost of representational capacity, leaving open whether the MoLF optimization framework can be combined with more expressive efficient\-LoRA methods\.
## References
- Intrinsic dimensionality explains the effectiveness of language model fine\-tuning\.InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing \(volume 1: long papers\),pp\. 7319–7328\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1),[§3](https://arxiv.org/html/2605.07111#S3.p1.1)\.
- A\. Aghajanyan, A\. Shrivastava, A\. Gupta, N\. Goyal, L\. Zettlemoyer, and S\. Gupta \(2020\)Better fine\-tuning by reducing representational collapse\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p1.1)\.
- P\. Albert, F\. Z\. Zhang, H\. Saratchandran, C\. Rodriguez\-Opazo, A\. van den Hengel, and E\. Abbasnejad \(2025\)RandLoRA: full rank parameter\-efficient fine\-tuning of large models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p2.1),[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- D\. Biderman, J\. Portes, J\. J\. G\. Ortiz, M\. Paul, P\. Greengard, C\. Jennings, D\. King, S\. Havens, V\. Chiley, J\. Frankle,et al\.\(2024\)LoRA learns less and forgets less\.arXiv preprint arXiv:2405\.09673\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p2.1),[§2](https://arxiv.org/html/2605.07111#S2.p1.1),[§3](https://arxiv.org/html/2605.07111#S3.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p1.1)\.
- H\. Chang, Z\. Ma, M\. Ma, Z\. Qi, A\. Sabot, H\. Jiang, and H\. Kung \(2025\)ElaLoRA: elastic & learnable low\-rank adaptation for efficient model fine\-tuning\.arXiv preprint arXiv:2504\.00254\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p2.1)\.
- H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma,et al\.\(2024\)Scaling instruction\-finetuned language models\.Journal of Machine Learning Research25\(70\),pp\. 1–53\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p1.1)\.
- N\. Ding, X\. Lv, Q\. Wang, Y\. Chen, B\. Zhou, Z\. Liu, and M\. Sun \(2023\)Sparse low\-rank adaptation of pre\-trained language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 4133–4145\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p2.1)\.
- N\. Ding, Y\. Qin, G\. Yang, F\. Wei, Z\. Yang, Y\. Su, S\. Hu, Y\. Chen, C\. Chan, W\. Chen,et al\.\(2022\)Delta tuning: a comprehensive study of parameter efficient methods for pre\-trained language models\.arXiv preprint arXiv:2203\.06904\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p1.1)\.
- S\. Dou, E\. Zhou, Y\. Liu, S\. Gao, W\. Shen, L\. Xiong, Y\. Zhou, X\. Wang, Z\. Xi, X\. Fan,et al\.\(2024\)LoRAMoE: alleviating world knowledge forgetting in large language models via MoE\-style plugin\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1932–1945\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p3.1)\.
- Y\. Du, A\. Havrilla, S\. Sukhbaatar, P\. Abbeel, and R\. Raileanu \(2024\)A study on improving reasoning in language models\.InI Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models,Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- C\. Eckart and G\. Young \(1936\)The approximation of one matrix by another of lower rank\.Psychometrika1\(3\),pp\. 211–218\.Cited by:[1st item](https://arxiv.org/html/2605.07111#S3.I1.i1.p1.7)\.
- Y\. Hao, Y\. Cao, and L\. Mou \(2024\)FLoRA: low\-rank adapters are secretly gradient compressors\.InInternational Conference on Machine Learning,pp\. 17554–17571\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)LoRA: low\-rank adaptation of large language models\.ICLR1\(2\),pp\. 3\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p2.1),[§2](https://arxiv.org/html/2605.07111#S2.p1.1),[§2](https://arxiv.org/html/2605.07111#S2.p2.1),[§3](https://arxiv.org/html/2605.07111#S3.p1.1),[§4\.1](https://arxiv.org/html/2605.07111#S4.SS1.p1.3),[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5)\.
- H\. Ivison, Y\. Wang, V\. Pyatkin, N\. Lambert, M\. Peters, P\. Dasigi, J\. Jang, D\. Wadden, N\. A\. Smith, I\. Beltagy,et al\.\(2023\)Camels in a changing climate: enhancing LM adaptation with Tulu 2\.arXiv preprint arXiv:2311\.10702\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- R\. A\. Jacobs, M\. I\. Jordan, S\. J\. Nowlan, and G\. E\. Hinton \(1991\)Adaptive mixtures of local experts\.Neural computation3\(1\),pp\. 79–87\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.07111#S4.SS1.p2.1)\.
- H\. Jiang, P\. He, W\. Chen, X\. Liu, J\. Gao, and T\. Zhao \(2020\)Smart: robust and efficient fine\-tuning for pre\-trained natural language models through principled regularized optimization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 2177–2190\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p1.1)\.
- D\. Kalajdzievski \(2023\)A rank stabilization scaling factor for fine\-tuning with LoRA\.arXiv preprint arXiv:2312\.03732\.Cited by:[§4\.2](https://arxiv.org/html/2605.07111#S4.SS2.p1.5)\.
- A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière,et al\.\(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5)\.
- D\. P\. Kingma and J\. Ba \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§4\.3](https://arxiv.org/html/2605.07111#S4.SS3.p1.1)\.
- D\. Li, Y\. Ma, N\. Wang, Z\. Ye, Z\. Cheng, Y\. Tang, Y\. Zhang, L\. Duan, J\. Zuo, C\. Yang,et al\.\(2024\)MixLoRA: enhancing large language models fine\-tuning with LoRA\-based mixture of experts\.arXiv preprint arXiv:2404\.15159\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p3.1)\.
- V\. Lialin, S\. Muckatira, N\. Shivagunde, and A\. Rumshisky \(2024\)ReLoRA: high\-rank training through low\-rank updates\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- Z\. Liu and J\. Luo \(2024\)AdaMoLE: adaptive mixture of LoRA experts\.arXiv preprint arXiv:2405\.00361\.External Links:[Link](https://arxiv.org/abs/2405.00361)Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p3.1)\.
- Z\. Liu, J\. Lyn, W\. Zhu, X\. Tian, and Y\. Graham \(2024\)ALoRA: allocating low\-rank adaptation for fine\-tuning large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 622–641\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p2.1),[§2](https://arxiv.org/html/2605.07111#S2.p2.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§4\.3](https://arxiv.org/html/2605.07111#S4.SS3.p1.1),[§4](https://arxiv.org/html/2605.07111#S4.SSx3.p1.5),[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5)\.
- Y\. Mao, K\. Huang, C\. Guan, G\. Bao, F\. Mo, and J\. Xu \(2024\)DoRA: enhancing parameter\-efficient fine\-tuning with dynamic rank distribution\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11662–11675\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p2.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§A\.4](https://arxiv.org/html/2605.07111#A1.SS4.p1.8),[§3](https://arxiv.org/html/2605.07111#S3.p2.1),[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5)\.
- Y\. Meyer, M\. Emadi, D\. Nathawani, L\. Ramaswamy, K\. Boyd, M\. Van Segbroeck, M\. Grossman, P\. Mlocek, and D\. Newberry \(2024\)Synthetic\-Text\-To\-SQL: a synthetic dataset for training language models to generate SQL queries from natural language prompts\.Hugging Face\.Note:[https://huggingface\.co/datasets/gretelai/synthetic\_text\_to\_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql)Cited by:[§3](https://arxiv.org/html/2605.07111#S3.p2.1),[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5)\.
- L\. Mirsky \(1960\)Symmetric gauge functions and unitarily invariant norms\.The quarterly journal of mathematics11\(1\),pp\. 50–59\.Cited by:[1st item](https://arxiv.org/html/2605.07111#S3.I1.i1.p1.7)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)MedMCQA: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InProceedings of the Conference on Health, Inference, and Learning \(CHIL\),Proceedings of Machine Learning Research, Vol\.174,pp\. 248–260\.Cited by:[§3](https://arxiv.org/html/2605.07111#S3.p2.1),[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5)\.
- S\. Pletenev, M\. Marina, D\. Moskovskiy, V\. Konovalov, P\. Braslavski, A\. Panchenko, and M\. Salnikov \(2025\)How much knowledge can you pack into a LoRA adapter without harming LLM?\.arXiv preprint arXiv:2502\.14502\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- H\. Rajabzadeh, M\. Valipour, T\. Zhu, M\. S\. Tahaei, H\. J\. Kwon, A\. Ghodsi, B\. Chen, and M\. Rezagholizadeh \(2024\)QDyLoRA: quantized dynamic low\-rank adaptation for efficient large language model tuning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,pp\. 712–718\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p2.1)\.
- J\. Schulman and Thinking Machines \(2025\)LoRA without regret\.Note:Accessed: 2026\-05\-06External Links:[Link](https://thinkingmachines.ai/blog/lora/)Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1),[§3](https://arxiv.org/html/2605.07111#S3.p1.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.07111#S4.SS1.p2.1)\.
- R\. S\. Shuttleworth, J\. Andreas, A\. Torralba, and P\. Sharma \(2025\)LoRA vs full fine\-tuning: an illusion of equivalence\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- S\. Sun, D\. Gupta, and M\. Iyyer \(2023\)Exploring the impact of low\-rank adaptation on the performance, efficiency, and regularization of RLHF\.arXiv preprint arXiv:2309\.09055\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023a\)LLaMA: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023b\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p1.1)\.
- M\. Valipour, M\. Rezagholizadeh, I\. Kobyzev, and A\. Ghodsi \(2023\)DyLoRA: parameter\-efficient tuning of pre\-trained models using dynamic search\-free low\-rank adaptation\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,pp\. 3274–3287\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p2.1)\.
- S\. Wang, L\. Yu, and J\. Li \(2024\)LoRA\-GA: low\-rank adaptation with gradient approximation\.Advances in Neural Information Processing Systems37,pp\. 54905–54931\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p1.1)\.
- Y\. Wang, S\. Agarwal, S\. Mukherjee, X\. Liu, J\. Gao, A\. Hassan, and J\. Gao \(2022\)AdaMix: mixture\-of\-adaptations for parameter\-efficient model tuning\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 5744–5760\.Cited by:[§A\.2](https://arxiv.org/html/2605.07111#A1.SS2.SSS0.Px2.p1.5),[§1](https://arxiv.org/html/2605.07111#S1.p2.1),[§1](https://arxiv.org/html/2605.07111#S1.p3.1),[§2](https://arxiv.org/html/2605.07111#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5),[§5\.3](https://arxiv.org/html/2605.07111#S5.SS3.p1.2)\.
- X\. Wu, S\. Huang, and F\. Wei \(2024\)Mixture of LoRA experts\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p3.1)\.
- L\. Xu, H\. Xie, S\. J\. Qin, X\. Tao, and F\. L\. Wang \(2026\)Parameter\-efficient fine\-tuning methods for pretrained language models: a critical review and assessment\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang,et al\.\(2024\)Qwen2\.5 technical report\.Note:[https://arxiv\.org/abs/2412\.15115](https://arxiv.org/abs/2412.15115)arXiv preprint arXiv:2412\.15115Cited by:[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5)\.
- T\. Zadouri, A\. Üstün, A\. Ahmadian, B\. Ermis, A\. Locatelli, and S\. Hooker \(2024\)Pushing mixture of experts to the limit: extremely parameter efficient MoE for instruction tuning\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p3.1)\.
- F\. Zhang, L\. Li, J\. Chen, Z\. Jiang, B\. Wang, and Y\. Qian \(2023a\)IncreLoRA: incremental parameter allocation method for parameter\-efficient fine\-tuning\.arXiv preprint arXiv:2308\.12043\.Cited by:[§1](https://arxiv.org/html/2605.07111#S1.p2.1),[§2](https://arxiv.org/html/2605.07111#S2.p2.1)\.
- Q\. Zhang, M\. Chen, A\. Bukharin, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao \(2023b\)Adaptive budget allocation for parameter\-efficient fine\-tuning\.InThe Eleventh International Conference on Learning Representations,Cited by:[§A\.2](https://arxiv.org/html/2605.07111#A1.SS2.SSS0.Px2.p1.5),[§1](https://arxiv.org/html/2605.07111#S1.p2.1),[§1](https://arxiv.org/html/2605.07111#S1.p3.1),[§2](https://arxiv.org/html/2605.07111#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.07111#S5.SS1.p1.5),[§5\.3](https://arxiv.org/html/2605.07111#S5.SS3.p1.2)\.
- R\. Zhang, R\. Qiang, S\. A\. Somayajula, and P\. Xie \(2024\)AutoLoRA: automatically tuning matrix ranks in low\-rank adaptation based on meta learning\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 5048–5060\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p2.1)\.
- Y\. Zhu, N\. Wichers, C\. Lin, X\. Wang, T\. Chen, L\. Shu, H\. Lu, C\. Liu, L\. Luo, J\. Chen,et al\.\(2023\)Sira: sparse mixture of low rank adaptation\.arXiv preprint arXiv:2311\.09179\.Cited by:[§2](https://arxiv.org/html/2605.07111#S2.p3.1)\.
## Appendix AExperimental Details
### A\.1Derivation of the Expected Preconditioned Descent \(EPD\) Score
The Expected Preconditioned Descent \(EPD\) score \(𝒮t\(i\)\\mathcal\{S\}\_\{t\}^\{\(i\)\}\) introduced in Equation[4](https://arxiv.org/html/2605.07111#S4.E4)is mathematically derived as a rigorous proxy for the expected loss reduction per parameter\.
By first\-order Taylor expansion, the expected change in the loss functionℒ\\mathcal\{L\}after an optimization stepΔθ\\Delta\\thetais approximated by the inner product of the gradient and the step direction:
Δℒ≈\(∇θℒ\)⊤Δθ\.\\Delta\\mathcal\{L\}\\approx\(\\nabla\_\{\\theta\}\\mathcal\{L\}\)^\{\\top\}\\Delta\\theta\.\(7\)
For the AdamW optimizer \(omitting decoupled weight decay for the heuristic\), the parameter update direction is preconditioned by its moving averages:
Δθt≈−ηtmtvt\+ϵ\.\\Delta\\theta\_\{t\}\\approx\-\\eta\_\{t\}\\frac\{m\_\{t\}\}\{\\sqrt\{v\_\{t\}\}\+\\epsilon\}\.\(8\)
Assuming the gradient landscape is sufficiently smooth such that the current gradient is well\-approximated by its first moment \(gt≈mtg\_\{t\}\\approx m\_\{t\}\), substituting the preconditioned AdamW step from Equation[8](https://arxiv.org/html/2605.07111#A1.E8)into the Taylor expansion in Equation[7](https://arxiv.org/html/2605.07111#A1.E7)yields the expected loss reduction:
−Δℒ≈gt⊤\(ηtmtvt\+ϵ\)≈ηt∑θmt2vt\+ϵ\-\\Delta\\mathcal\{L\}\\approx g\_\{t\}^\{\\top\}\\left\(\\eta\_\{t\}\\frac\{m\_\{t\}\}\{\\sqrt\{v\_\{t\}\}\+\\epsilon\}\\right\)\\approx\\eta\_\{t\}\\sum\_\{\\theta\}\\frac\{m\_\{t\}^\{2\}\}\{\\sqrt\{v\_\{t\}\}\+\\epsilon\}\(9\)
To ensure a statistically fair, density\-based competition between the massive FFT backbone and the lightweight PEFT pathways, we normalize this expected loss reduction by the total parameter countNparams\(i\)N\_\{\\text\{params\}\}^\{\(i\)\}of the respective expertii\. This yields the final EPD score used for dynamic routing:
𝒮t\(i\)=ηt\(i\)Nparams\(i\)∑θ∈Θi\(mt\(i\)\)2vt\(i\)\+ϵ\\mathcal\{S\}\_\{t\}^\{\(i\)\}=\\frac\{\\eta\_\{t\}^\{\(i\)\}\}\{N\_\{\\text\{params\}\}^\{\(i\)\}\}\\sum\_\{\\theta\\in\\Theta\_\{i\}\}\\frac\{\\left\(m\_\{t\}^\{\(i\)\}\\right\)^\{2\}\}\{\\sqrt\{v\_\{t\}^\{\(i\)\}\}\+\\epsilon\}\(10\)
### A\.2Baseline Hyperparameter Sweep
To establish rigorous baselines, we conduct a comprehensive hyperparameter sweep for each model architecture \(Qwen2\.5\-1\.5B, Qwen2\.5\-3B, and Gemma\-3\-1B\) across all three benchmark datasets \(SQL, Fact, and Med\)\. We optimize the configurations for both Low\-Rank Adaptation \(LoRA\) and Full Fine\-Tuning \(FFT\) to ensure the best possible performance for our baselines\.
The search spaces for both tuning methods are detailed in Table[4](https://arxiv.org/html/2605.07111#A1.T4)\. For all configurations, we maintain a fixed warmup ratio of0\.050\.05\.
Table 4:Hyperparameter search space for LoRA and Full Fine\-Tuning \(FFT\) baselines\.##### MoLF and MoLF\-E hyperparameters\.
MoLF and MoLF\-E’s per\-expert learning rateηt\(i\)\\eta\_\{t\}^\{\(i\)\}in Equations[4](https://arxiv.org/html/2605.07111#S4.E4)and[5](https://arxiv.org/html/2605.07111#S4.E5)is set to the best learning rate as found in the sweep above\. The per\-expert decoupled weight decayλi\\lambda\_\{i\}in Equation[5](https://arxiv.org/html/2605.07111#S4.E5)isλFFT=0\.1\\lambda\_\{\\text\{FFT\}\}=0\.1for the FFT pathway andλLoRA=0\.01\\lambda\_\{\\text\{LoRA\}\}=0\.01for every LoRA expert\. The stability constantϵ\\epsilonis set to the HuggingFace AdamW default \(ϵ=10−8\\epsilon=10^\{\-8\}\)\. For both MoLF and MoLF\-E we useK=1K=1for the Top\-KKrouting in Phase 3, so exactly one expert per module receives a physical weight update at each step\. The cosine scheduler, linear warmup ratio of0\.050\.05, and AdamW\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\)are inherited from the baseline sweep\.
##### AdaLoRA and AdaMix configuration\.
For AdaLoRA\[Zhanget al\.,[2023b](https://arxiv.org/html/2605.07111#bib.bib52)\]we follow the rank\-pruning schedule of the original paper and reuse the best LoRA learning rate found in the sweep above \(5×10−45\\times 10^\{\-4\}\) across all \(model, task\) configurations, with a single exception: on Gemma\-3\-1B Fact the learning rate is increased to5×10−35\\times 10^\{\-3\}to obtain reasonable performance\. We additionally tried a learning rate of5×10−55\\times 10^\{\-5\}at a higher initial rank \(r=128r=128\), but this configuration consistently degraded accuracy and is not reported\. For AdaMix\[Wanget al\.,[2022](https://arxiv.org/html/2605.07111#bib.bib61)\]we evaluated both the standard hyperparameters recommended by the original paper and a learning rate of5×10−55\\times 10^\{\-5\}; the latter outperformed the former in our setup and is the configuration reported in Section[5\.3](https://arxiv.org/html/2605.07111#S5.SS3)\.
### A\.3Preconditioned Frobenius Norm \(PFN\) Formulation
In Section[5\.4\.2](https://arxiv.org/html/2605.07111#S5.SS4.SSS2), we utilize the Preconditioned Frobenius Norm \(PFN\) as a scale\-invariant baseline to evaluate expert selection heuristics\. PFN avoids the artificial scaling bias introduced by LoRA by calculating the root\-mean\-square of the preconditioned Adam update direction:
𝒮PFN\(i\)=1Nparams\(i\)∑θ∈Θi\(mt\(i\)vt\(i\)\+ϵ\)2\\mathcal\{S\}\_\{\\text\{PFN\}\}^\{\(i\)\}=\\frac\{1\}\{\\sqrt\{N\_\{\\text\{params\}\}^\{\(i\)\}\}\}\\sqrt\{\\sum\_\{\\theta\\in\\Theta\_\{i\}\}\\left\(\\frac\{m\_\{t\}^\{\(i\)\}\}\{\\sqrt\{v\_\{t\}^\{\(i\)\}\}\+\\epsilon\}\\right\)^\{2\}\}\(11\)
Because the first moment \(mtm\_\{t\}\) and the square root of the second moment \(vt\\sqrt\{v\_\{t\}\}\) scale identically with the gradient, dividing them naturally cancels out the gradient magnitude and isolates a per\-parameter signal\-to\-noise ratio \(close to±1\\pm 1when the gradient direction is consistent across mini\-batches and close to0when it is noisy\)\. Consequently, PFN ranks experts based solely on gradient directional consistency, making it a fair baseline metric for routing between heterogeneous experts without unfairly penalizing modules based on parameter count\.
### A\.4Evaluation Metrics
We report two evaluation metrics across the three benchmarks\. On Med \(MedMCQA\) and SQL \(Gretel synthetic Text\-to\-SQL\) we report standard accuracy: 4\-way multiple\-choice accuracy on the held\-out MedMCQA validation split, and exact\-match accuracy on the held\-out Text\-to\-SQL queries\. On Fact \(CounterFact\) we report the Efficacy Score \(ES\) introduced byMenget al\.\[[2022](https://arxiv.org/html/2605.07111#bib.bib76)\], which measures the fraction of edits for which the post\-edit model assigns higher probability to the new counterfactual target than to the original true object:
ES=1\|ℰ\|∑\(π,o∗,oc\)∈ℰ𝟙\[ℙG′\(o∗∣π\)\>ℙG′\(oc∣π\)\],\\text\{ES\}=\\frac\{1\}\{\|\\mathcal\{E\}\|\}\\sum\_\{\(\\pi,o^\{\*\},o^\{c\}\)\\in\\mathcal\{E\}\}\\mathbb\{1\}\\\!\\left\[\\,\\mathbb\{P\}\_\{G^\{\\prime\}\}\(o^\{\*\}\\mid\\pi\)\>\\mathbb\{P\}\_\{G^\{\\prime\}\}\(o^\{c\}\\mid\\pi\)\\,\\right\],\(12\)whereG′G^\{\\prime\}is the fine\-tuned \(edited\) model,π\\piis the counterfactual prompt,o∗o^\{\*\}is the counterfactual target object,oco^\{c\}is the original true object,ℰ\\mathcal\{E\}is the set of edits, and𝟙\[⋅\]\\mathbb\{1\}\[\\cdot\]is the indicator function\. We report ES as a percentage so that all three benchmarks share a common\[0,100\]\[0,100\]scale in our tables and figures\.
## Appendix BAdditional Results
### B\.1Behavior of the Router
For each fine\-tuning setup, we examine the behavior of the router\. The router acts as the score\-based selector inside the MoLF optimizer: for each MoLF\-wrapped linear module, it scores candidate experts using the EPD score derived from the module’s Adam moments\. Only the fine\-tuning expert with a higher EPD score receives a parameter update at each step\. In these experiments, the two candidates per module are the original base weight \(FFT\) and a rank\-128128LoRA expert\.
Figure 5:Aggregate router decisions over training\. Bars represent the percentage of modules selecting FFT \(lower/dark\) versus LoRA \(upper/light\) at each optimizer step\. The y\-axis for each panel is scaled to its operational range to highlight boundary fluctuations\. The selected mixture varies with both the task and the model\. Med is strongly LoRA\-dominated across all scales, but the FFT preference on Fact and SQL inverts between architectures: on Gemma\-3\-1B, Fact leans heavily toward FFT while SQL is roughly balanced; on the Qwen2\.5 models, SQL leans heavily toward FFT while Fact is roughly balanced\. Despite this heterogeneity, the router converges to a stable global mixture early in training \(typically within500500steps\) and maintains it\. We also observe that larger models \(Qwen2\.5\-3B\) exhibit a tighter, more compressed envelope of task\-specific routing compared to smaller models \(Gemma\-3\-1B\)\.Figure 6:Per\-module router decisions over training\. Rows index the modules in parameter order \(bottom = earliest, top = latest\), columns represent optimizer steps, and pixels encode the winning expert\. The pronounced horizontal banding demonstrates that the aggregate mixtures in Figure[5](https://arxiv.org/html/2605.07111#A2.F5)result from persistent per\-module specialization, not from uniform step\-wise alternation\. The empirical row distribution is strongly bimodal: modules typically commit definitively to either FFT or LoRA early in the run\. The minor step\-wise fluctuations observed in the aggregate view are driven by a small population of swing modules, confirming that the router performs persistent structural assignments rather than continuous resampling\.
### B\.2MoLF\-E Rank Configuration
We also experimented with varying ranks of the LoRA experts in MoLF\-E to isolate the effect of rank capacity on the three fine\-tuning regimes\. All configurations use two parallel LoRA experts with ranksrrand128128, Top\-K=1K\{=\}1routing, and MoLF\-E’s frozen base\-weight setting\. To match the parameter scope of standard PEFT baselines, we additionally freeze the non\-linear parameters \(embeddings, layer norms, andlm\_head\), leaving only the LoRA experts trainable\. Figure[7](https://arxiv.org/html/2605.07111#A2.F7)reports the rank sweep\.
The sweep exhibits a regime\-dependent rank sensitivity that reinforces the analysis of Section[3](https://arxiv.org/html/2605.07111#S3)\. On Fact, the Efficacy Score grows substantially with the smaller expert’s rankrr\. Specifically, it increases by14\.85%14\.85\\%on Gemma\-3\-1B \(r=16→128r=16\\to 128\),7\.73%7\.73\\%on Qwen2\.5\-1\.5B, and6\.88%6\.88\\%on Qwen2\.5\-3B\. We note that two effects are entangled in this sweep, beyond raw representational capacity\. First, the configuration withr=16r=16underperforms a single static rank\-1616LoRA on Gemma\-3\-1B Fact \(Table[1](https://arxiv.org/html/2605.07111#S3.T1)\), so the degradation at lowrrcannot be explained by capacity alone\. Second, under the EPD score’sηt\(i\)/ri\\eta\_\{t\}^\{\(i\)\}/\\sqrt\{r\_\{i\}\}scaling discussed below Equation[4](https://arxiv.org/html/2605.07111#S4.E4), a smaller LoRA expert paired with a rank\-128128partner at the same learning rate receives a score advantage of128/r\\sqrt\{128/r\}, so Top\-11routing tends to over\-select the smaller expert on capacity\-bound domains while the rank\-128128partner is rarely updated\. The Fact sweep should therefore be read as a joint test of capacity and of the routing’s ability to commit to the larger expert when capacity is the binding constraint; the two factors will likely need to be disentangled in future work via rank\-aware learning rates or a routing prior\. On Med, accuracy increases mildly with rank and saturates nearr=64r\{=\}64\(Gemma\-3\-1B peaks at45\.04%45\.04\\%and then slightly declines\), consistent with the spectral\-regularization interpretation: once the principal components of the task gradient are captured, additional rank contributes no further benefit\. On SQL, the rank sweep is essentially flat on both Qwen2\.5 models \(<1\.0%<1\.0\\%spread\) and only mildly increasing on Gemma\-3\-1B \(3\.28%3\.28\\%fromr=16r\{=\}16tor=128r\{=\}128\), reflecting the concentrated singular spectrum of text\-to\-SQL\.
Figure 7:Rank ablation of MoLF\-E: performance versus the first LoRA expert’s rankrr, with the second expert fixed at rank128128and Top\-K=1K\{=\}1routing\. Each panel plots one task across all three models; the y\-axis is Efficacy Score \(%\) on Fact and accuracy \(%\) on Med and SQL\.Similar Articles
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Aletheia introduces a gradient-guided layer selection method for efficient LoRA fine-tuning that identifies task-relevant transformer layers via lightweight gradient probes and applies adapters selectively, achieving 15-28% training speedup across 14 models while maintaining downstream performance on MMLU, GSM8K, and HumanEval benchmarks.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
Introduces Queryable LoRA, a data-adaptive method for efficient fine-tuning that uses a shared memory of low-rank update atoms with attention-based routing and instruction regularization to enable dynamic, context-sensitive parameter updates while maintaining scalability.
RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
RDP-LoRA uses geometric trajectory analysis and the Ramer-Douglas-Peucker algorithm to automatically select the most impactful layers for parameter-efficient fine-tuning, outperforming full-layer and random LoRA baselines.
Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study
This study empirically demonstrates that gradient-based LoRA rank allocation, effective in supervised fine-tuning, degrades performance in GRPO-based reinforcement learning due to flatter gradient landscapes and a gradient amplification effect.
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
The article introduces Echo-LoRA, a new parameter-efficient fine-tuning method that injects cross-layer representations from deeper source layers into shallow LoRA modules to improve performance without adding inference-time overhead.