The Hidden Power of Scaling Factor in LoRA Optimization

arXiv cs.AI 06/12/26, 04:00 AM Papers
lora scaling-factor optimization fine-tuning parameter-efficient low-rank-adaptation deep-learning
Summary
This paper reveals that the scaling factor α in LoRA optimization is more influential than the learning rate, and proposes LoRA-α, a framework that improves performance and simplifies hyperparameter search by restoring α to its principled regime.
arXiv:2606.12883v1 Announce Type: new Abstract: In Low-Rank Adaptation (LoRA), the scaling factor $\alpha$ is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor $\alpha$ and the learning rate function differently, with $\alpha$ emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, $\alpha$ outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-$\alpha$, a minimalist framework that restores $\alpha$ to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-$\alpha$ consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.
Original Article
View Cached Full Text
Cached at: 06/12/26, 08:54 AM
# The Hidden Power of Scaling Factor in LoRA Optimization
Source: [https://arxiv.org/html/2606.12883](https://arxiv.org/html/2606.12883)
Zicheng Zhang1Haoran Li2Jiaxing Wang1Guoqiang Gong1Anqi Li3 Yudong Hu1Ting Xiong1Yurong Gao4Junxing Hu1Zhida Jiang1 Yifeng Zhang1Pengzhang Liu1Qixia Jiang1 1JD2School of Mathematical Sciences, UCAS 3School of Mathematical Sciences, NKU 4School of Advanced Interdisciplinary Sciences, UCAS

###### Abstract

In Low\-Rank Adaptation \(LoRA\), the scaling factorα\\alphais often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood\. In this paper, we reveal that the scaling factorα\\alphaand the learning rate function differently, withα\\alphaemerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone\. Through the synergy of extensive empirical analysis and a theoretical Signal\-Drift framework, we uncover three findings into LoRA’s scaling mechanism: First, LoRA’s spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap\. Second, when leveraging this smoothness to accelerate convergence,α\\alphaoutperforms the learning rate by amplifying the task signal without increasing the drift ratio\. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square\-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank\-tied heuristics\. Based on these insights, we propose LoRA\-α\\alpha, a minimalist framework that restoresα\\alphato its principled regime, making LoRA compatible with standard small learning rates\. Extensive evaluations across diverse tasks demonstrate that LoRA\-α\\alphaconsistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA\.

## 1Introduction

The rapid growth of large pretrained modelsOpenAI Team \[[2020](https://arxiv.org/html/2606.12883#bib.bib3),[2023](https://arxiv.org/html/2606.12883#bib.bib4)\]; Meta Team \[[2023](https://arxiv.org/html/2606.12883#bib.bib54)\]; Qwen Team \[[2023](https://arxiv.org/html/2606.12883#bib.bib5)\]; DeepSeek Team \[[2025](https://arxiv.org/html/2606.12883#bib.bib8)\]has made efficient adaptation a central challenge, driving the development of Parameter\-Efficient Fine\-Tuning \(PEFT\) methodsMangrulkaret al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib50)\]; Lesteret al\.\[[2021](https://arxiv.org/html/2606.12883#bib.bib11)\]; Heet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib12)\]; Edalatiet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib13)\]; Zhanget al\.\[[2025a](https://arxiv.org/html/2606.12883#bib.bib14)\]\. Among these, Low\-Rank Adaptation \(LoRA\)Huet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib10)\]has emerged as a dominant approach due to its efficiency and stability\. LoRA parameterizes weight updates asΔW=αrBA\\Delta W=\\frac\{\\alpha\}\{r\}BA, where low\-rank factorsBBandAAapproximate the update with rankrrand scaling factorα\\alpha\. This simple formulation reinforced by framework supportMangrulkaret al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib50)\], has enabled widespread adoption from Natural Language Processing \(NLP\)Liuet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib18)\]; Dinget al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib19)\]; Zhaoet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib22)\]to multimodal generationGuoet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib24)\]; Blattmannet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib25)\]; Ruizet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib26)\]\.

Despite its conceptual simplicity, the optimization behavior of LoRA remains poorly understood due to its inherent bilinear architecture and the complex interplay of hyperparameters\. The scaling factorα\\alpha, originating from feature learning principlesYang and Hu \[[2021](https://arxiv.org/html/2606.12883#bib.bib99)\], aims to decouple the optimal hyperparameters from the rank selection, thereby streamlining hyperparameter search\. While research has explored initializationWanget al\.\[[2024b](https://arxiv.org/html/2606.12883#bib.bib47)\]; Menget al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib43)\]; Zhanget al\.\[[2025b](https://arxiv.org/html/2606.12883#bib.bib92)\]and learning ratesBidermanet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib17)\]; Schulman and Lab \[[2025](https://arxiv.org/html/2606.12883#bib.bib91)\]; Chenet al\.\[[2026](https://arxiv.org/html/2606.12883#bib.bib125)\], the scaling factorα\\alpharemains systematically underexplored, typically tethered to simplistic rank\-based heuristics, such asα=r\\alpha=rHuet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib10)\]or2r2rBidermanet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib17)\]\. Consequently,α\\alphais often regarded as a secondary alternative to the learning rate for scaling updates, obscuring its pivotal role in reducing hyperparameter search and, more fundamentally, in shaping the underlying optimization regime\.

In this work, we analyze the scaling mechanism of LoRA through a joint empirical and theoretical study\. Across extensive hyperparameter sweeps, we consistently observe that effective optimization hinges more on a sufficiently large scaling factorα\\alphathan on elevating the learning rateη\\eta\. To understand this phenomenon, we develop a Signal\-Drift framework that delineates LoRA’s optimization characteristics\. In this view, the task\-relevant signal and the bilinear\-induced drift respond differently toα\\alphaandη\\eta, providing a principled explanation for our empirical observations\. Together, these empirical and theoretical evidence lead to the following three key findings:

- •Spectral Suppression Causes a Scaling Misalignment\.We find that LoRA’s low\-rank parameterization induces a spectral suppression of the task Hessian, effectively smoothing the optimization landscape\. While this improves stability, it also renders standard hyperparameters overly conservative, leading to a significant optimization gap and motivating the aggressive scaling practices observed in prior workHayouet al\.\[[2024a](https://arxiv.org/html/2606.12883#bib.bib39)\]; Schulman and Lab \[[2025](https://arxiv.org/html/2606.12883#bib.bib91)\]; Zhanget al\.\[[2025b](https://arxiv.org/html/2606.12883#bib.bib92)\]\.
- •α\\alphaandη\\etaPlay Fundamentally Different Roles\.Empirically, we observe that increasingα\\alphaconsistently yields better convergence than increasingη\\eta\. Our framework explains this by showing thatα\\alphaamplifies the task\-aligned signal, whereasη\\etaamplifies both the signal and bilinear drift\. As a result,α\\alphaacts as a purity\-preserving accelerator, enabling faster and more stable optimization\.
- •Optimal Scaling Follows a Sublinear Law\.We identify a sublinear relationship between the optimalα\\alphaand rankrr, concisely characterized by a square\-root law with a large scaling coefficient\. This reveals that commonly used rank\-tied heuristicsHuet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib10)\]; Bidermanet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib17)\]operate in a severely under\-scaled regime\. With proper scaling, LoRA can directly adopt the standard small learning rates used in Full Fine\-Tuning \(FFT\), while achieving superior performance\.

Based on both empirical and theoretical findings, we identify a key limitation in prevailing LoRA practices: commonly used heuristics restrictα\\alphato insufficient magnitudes, limiting LoRA’s optimization capacity\. To address this, we propose LoRA\-α\\alpha, which introduces a large base value with a square\-root scaling law to restoreα\\alphato a principled regime\. This formulation enables practitioners to bypass costly hyperparameter tuning and directly adopt standard FFT learning rates\. Extensive experiments across model scales \(184M–12B\), task domains \(natural language, reasoning, multimodal\), and training paradigms \(supervised, contrastive, and reinforcement learning\) show that largerα\\alphawith standard smallη\\etaconsistently yields superior performance\. In summary, our contributions include:

- •Empirically, we establish three key findings regarding LoRA’s optimization through extensive hyperparameter sweeps\. We uncover that performance hinges critically on a large scaling factor, exposing a pitfall in current rank\-tied heuristics that leave LoRA’s fitting potential underutilized\.
- •Theoretically, we develop a Signal\-Drift framework to provide a principled explanation of these findings\. By characterizing Hessian spectral suppression, we identifyα\\alphaas a better accelerator and derive a square\-root law that aligns LoRA’s optimization regime with that of FFT\.
- •We propose LoRA\-α\\alpha, a minimalist framework that elevates the scaling factor to its principled regime while adopting standard FFT learning rates\. Across diverse models, tasks, and training paradigms, it significantly improves over LoRA, often reaching performance comparable to FFT\.

![Refer to caption](https://arxiv.org/html/2606.12883v1/x1.png)Figure 1:Hyperparameter analysis of Llama 3\-1B on the Tulu 3 dataset\. \(a\) Evaluation loss as a function of the learning rateη\\etaacross different ranksrrand scaling factorsα\\alpha\. Gray lines denote linear fits of the minimum loss for each\(r,α\)\(r,\\alpha\)\. Increasingα\\alphalowers the optimal loss and shiftsη∗\\eta^\{\*\}downward\. \(b\) Scaling paths defined in Eq\. \([2](https://arxiv.org/html/2606.12883#S2.E2)\) relative to the baseline \(ηFFT=2×10−5\\eta\_\{\\text\{FFT\}\}=2\\times 10^\{\-5\},α0=16\\alpha\_\{0\}=16\), whereηFFT\\eta\_\{\\text\{FFT\}\}denotes the optimal FFT learning rate\. Varyingα\\alphareaches lower\-loss regimes that are inaccessible viaη\\eta\-tuning alone\. \(c\) Optimal scaling factorα∗\\alpha^\{\*\}as a function of rankrr, evaluated atηFFT\\eta\_\{\\text\{FFT\}\}\. The observed sublinear trend and large magnitude ofα\\alphachallenge conventional scaling heuristics\.
## 2An Empirical Study of LoRA Scaling

To study the scaling behavior of LoRA, we perform systematic hyperparameter sweeps to characterize the interplay between the scaling factorα\\alpha, learning rateη\\eta, and rankrr\. We focus onfitting behavior, isolating optimization dynamics from generalization effects such as implicit low\-rank regularizationJanget al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib81)\]and task diversity\. This enables a clean analysis of the optimization landscape, while generalization is evaluated separately in Section[5](https://arxiv.org/html/2606.12883#S5)\.

Optimization Setup and Metrics\.FollowingSchulman and Lab \[[2025](https://arxiv.org/html/2606.12883#bib.bib91)\], we conduct Supervised Fine\-Tuning \(SFT\) on two model scales \(Llama 3\-1B and 8BMeta Team \[[2024](https://arxiv.org/html/2606.12883#bib.bib100)\]\) and datasets \(Tulu 3Lambertet al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib101)\]and OpenThoughtsGuhaet al\.\[[2026](https://arxiv.org/html/2606.12883#bib.bib128)\]\), training each configuration for one epoch with a batch size of 32 using AdamW\. For clarity, we report results on Llama 3\-1B with Tulu 3, deferring the rest to Appendix[D](https://arxiv.org/html/2606.12883#A4)\. To manage the large search space overα\\alpha,η\\eta, andrr, we use a 100k\-example subset of Tulu 3 with a maximum token lengthT=1024T=1024, and reserve 10k examples as a proxy evaluation set𝒟prox\\mathcal\{D\}\_\{prox\}\. Letθ\(r,η,α\)\\theta\(r,\\eta,\\alpha\)denote the trained parameters under a given configuration\. We evaluate fitting via the Expected Negative Log\-Likelihood \(NLL\):

ℒ\(r,η,α\)=𝔼x∼𝒟prox\[−Σt=1Tlog⁡P\(xt∣x<t;θ\(r,η,α\)\)\]\.\\mathcal\{L\}\(r,\\eta,\\alpha\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\_\{prox\}\}\\left\[\-\\Sigma\_\{t=1\}^\{T\}\\log P\(x\_\{t\}\\mid x\_\{<t\};\\theta\(r,\\eta,\\alpha\)\)\\right\]\.\(1\)
Interplay betweenα\\alphaandη\\eta\.Letη∗\(r,α\)=arg⁡minη⁡ℒ\(r,η,α\)\\eta^\{\*\}\(r,\\alpha\)=\\arg\\min\_\{\\eta\}\\mathcal\{L\}\(r,\\eta,\\alpha\)denote the optimal learning rate andℒ∗\(r,α\)=ℒ\(r,η∗,α\)\\mathcal\{L\}^\{\*\}\(r,\\alpha\)=\\mathcal\{L\}\(r,\\eta^\{\*\},\\alpha\)the corresponding minimum loss\. As illustrated in Fig\.[1](https://arxiv.org/html/2606.12883#S1.F1)\(a\), we confirm the observation ofSchulman and Lab \[[2025](https://arxiv.org/html/2606.12883#bib.bib91)\]that LoRA requires roughly10×10\\timeslarger learning rates than FFT at smallα\\alpha\(e\.g\.,α=16\\alpha=16\)\. Nonetheless, our extended sweeps reveal a clear coupling: asα\\alphaincreases,η∗\\eta^\{\*\}consistently decreases, while the optimal lossℒ∗\\mathcal\{L\}^\{\*\}improves monotonically\. This gain becomes more pronounced at higher ranks \(gray lines in Fig\.[1](https://arxiv.org/html/2606.12883#S1.F1)\(a\)\)\.Finding I: The need for large learning rates in LoRA is a symptom of inadequateα\\alpha\-scaling, rather than an intrinsic property\.

Superiority ofα\\alpha\-Scaling overη\\eta\-Scaling\.To disentangle the roles of the learning rate and scaling factor, we trace two independent scaling paths from the same baseline:

ℒη\(m;r,ηFFT,α0\)=ℒ\(r,mηFFT,α0\),ℒα\(m;r,ηFFT,α0\)=ℒ\(r,ηFFT,mα0\),\\mathcal\{L\}\_\{\\eta\}\(m;r,\\eta\_\{\\text\{FFT\}\},\\alpha\_\{0\}\)=\\mathcal\{L\}\(r,m\\eta\_\{\\text\{FFT\}\},\\alpha\_\{0\}\),\\quad\\mathcal\{L\}\_\{\\alpha\}\(m;r,\\eta\_\{\\text\{FFT\}\},\\alpha\_\{0\}\)=\\mathcal\{L\}\(r,\\eta\_\{\\text\{FFT\}\},m\\alpha\_\{0\}\),\(2\)whereηFFT=2×10−5\\eta\_\{\\text\{FFT\}\}=2\\times 10^\{\-5\}is the optimal FFT learning rate, andα0=16\\alpha\_\{0\}=16\. As illustrated in Fig\.[1](https://arxiv.org/html/2606.12883#S1.F1)\(b\), although both trajectories exhibit U\-shaped loss surfaces, they reveal a clear asymmetry\. First,α\\alpha\-scaling accommodates a substantially larger optimal multiplier, indicating a more stable optimization regime\. Second, whileη\\eta\-scaling prematurely plateaus,α\\alpha\-scaling consistently converges to deeper loss minima, corroborating the broad trends in Fig\.[1](https://arxiv.org/html/2606.12883#S1.F1)\(a\)\.Finding II:α\\alphais not merely an alternative toη\\etafor scaling step sizes; instead, it reshapes the optimization landscape to facilitate deeper fitting\.

Scaling Law betweenα\\alphaandrr\.We examine the relationship between the rank and the optimal scaling factor, defined asα∗\(r\)=arg⁡minα⁡ℒ\(r,ηFFT,α\)\\alpha^\{\*\}\(r\)=\\arg\\min\_\{\\alpha\}\\mathcal\{L\}\(r,\\eta\_\{\\text\{FFT\}\},\\alpha\)\. Fig\.[1](https://arxiv.org/html/2606.12883#S1.F1)\(c\) shows thatα∗\\alpha^\{\*\}follows a sublinear trend:

α∗\(r\)≈Cr,C≫1,\\alpha^\{\*\}\(r\)\\approx C\\sqrt\{r\},\\quad C\\gg 1,\(3\)withCCobtained and rounded via log\-scale fitting111Estimated via linear regression in log space,i\.e\.,log⁡α=0\.5log⁡r\+log⁡C\\log\\alpha=0\.5\\log r\+\\log\{C\}\. Across all four diverse optimization settings, the optimalα\\alphaconsistently lies in a high\-magnitude regime \(α≥256\\alpha\\geq 256\) under typical FFT learning rates\. This indicates that conventional choices are not only misaligned in functional form but also severely under\-scaled in magnitude\.Finding III: Conventional scaling is mis\-scaled, both in its linear dependence onrrand its insufficient magnitude, thereby limiting LoRA’s fitting effectiveness\.

Relationship to Prior Works\.Our findings unify and extend prior works on LoRA scaling\. The original formulation ofHuet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib10)\]adoptsα=r\\alpha=ras a simplifying default, whileBidermanet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib17)\]observes that increasingα\\alphaimproves performance, albeit only exploring up toα=2r\\alpha=2r\. Meanwhile,Kalajdzievski \[[2023](https://arxiv.org/html/2606.12883#bib.bib38)\]derives the mild scaling \(1/r1/\\sqrt\{r\}\) for stability, without identifying the appropriate magnitude\. In addition,Bidermanet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib17)\]; Schulman and Lab \[[2025](https://arxiv.org/html/2606.12883#bib.bib91)\]report that LoRA requires learning rates10×10\\timeslarger than FFT\. Our results both confirm this observation and attribute it to insufficient scaling ofα\\alpha\. By determining the concrete scaling law, our work consolidates these perspectives into a unified principle, enabling superior optimization with standard learning rates and substantially reducing the need for hyperparameter search\.

## 3A Theoretical Analysis from the Signal\-Drift Perspective

In this section, we formally characterize how LoRA reshapes optimization regime to ground our empirical observations\. Our primary analytical approach is to introduce the Signal\-Drift decomposition, which isolates the structural drift inherently induced by the adapter’s bilinear parameterization\.

Notations\.Consider a pre\-trained weightW0∈ℝdout×dinW\_\{0\}\\in\\mathbb\{R\}^\{d\_\{out\}\\times d\_\{in\}\}and trainable adaptersθ=\{A,B\}\{\\theta\}=\\\{A,B\\\}, whereB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{out\}\\times r\}andA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{in\}\}\. The LoRA mapping is defined asW\(𝜽\)=W0\+αrBAW\(\\bm\{\\theta\}\)=W\_\{0\}\+\\frac\{\\alpha\}\{r\}BA\. Let𝒘\(𝜽\)=vec⁡\(W\(𝜽\)\)∈ℝD\\bm\{w\}\(\\bm\{\\theta\}\)=\\operatorname\{vec\}\(W\(\\bm\{\\theta\}\)\)\\in\\mathbb\{R\}^\{D\}denote the vectorized weight, and𝜽=\[vec⁡\(A\);vec⁡\(B\)\]∈ℝp\\bm\{\\theta\}=\[\\operatorname\{vec\}\(A\);\\operatorname\{vec\}\(B\)\]\\in\\mathbb\{R\}^\{p\}the concatenated parameters, whereD=doutdinD=d\_\{out\}d\_\{in\}andp=r\(din\+dout\)p=r\(d\_\{in\}\+d\_\{out\}\)\. For the objectiveℓ\(𝒘\)\\ell\(\\bm\{w\}\), we define the task gradient𝒈=∇𝒘ℓ∈ℝD\\bm\{g\}=\\nabla\_\{\\bm\{w\}\}\\ell\\in\\mathbb\{R\}^\{D\}and the task Hessian𝐇ℓ=∇𝒘2ℓ∈ℝD×D\\mathbf\{H\}\_\{\\ell\}=\\nabla^\{2\}\_\{\\bm\{w\}\}\\ell\\in\\mathbb\{R\}^\{D\\times D\}, withgkg\_\{k\}andwkw\_\{k\}representing theirkk\-th scalar entries\. The mapping geometry is governed by the JacobianJ\(𝜽\)=∇𝜽𝒘∈ℝD×pJ\(\\bm\{\\theta\}\)=\\nabla\_\{\\bm\{\\theta\}\}\\bm\{w\}\\in\\mathbb\{R\}^\{D\\times p\}and the structural Hessian∇𝜽2𝒘∈ℝD×p×p\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\\in\\mathbb\{R\}^\{D\\times p\\times p\}, interpreted as a bilinear operator: for any𝒗∈ℝp\\bm\{v\}\\in\\mathbb\{R\}^\{p\},∇𝜽2𝒘\[𝒗,𝒗\]∈ℝD\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\bm\{v\},\\bm\{v\}\]\\in\\mathbb\{R\}^\{D\}has entries𝒗⊤\(∇𝜽2wk\)𝒗\\bm\{v\}^\{\\top\}\(\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}\)\\bm\{v\}\.

### 3\.1Decomposing the Optimization Dynamics

By parameterizing the objective asℓ\(𝒘\(𝜽\)\)\\ell\(\\bm\{w\}\(\\bm\{\\theta\}\)\), LoRA optimization inevitably inherits artifacts from its bilinear architecture\. Specifically, the dynamics couple with the mapping’s intrinsic structure∇𝜽2𝒘\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}, prompting us to decouple the task\-aligned projection signal from this task\-agnostic structure drift\.

![Refer to caption](https://arxiv.org/html/2606.12883v1/x2.png)Figure 2:The asymmetric optimization dynamics ofα\\alpha\-scaling versusη\\eta\-scaling from a base configuration of rankr=16r=16,α0=1\\alpha\_\{0\}=1, andη0=10−4\\eta\_\{0\}=10^\{\-4\}\. We compare increasingη\\eta\(warm colors\) versus increasingα\\alpha\(cool colors\)\. \(a\)η\\eta\-scaling leads to early saturation and instability, whileα\\alpha\-scaling enables smooth acceleration and deeper fitting\. \(b\) Increasingη\\etaamplifies the pronounced stochasticity of structural drift, whereasα\\alpha\-scaling does not further amplify these fluctuations, maintaining the smoother optimization profile seen in \(a\)\. \(c\)α\\alpha\-scaling consistently maintains a higher Signal\-to\-Drift ratio, preserving update purity and accelerating convergence\.###### Proposition 1\(Signal\-Drift Decomposition\)\.

LetΔ𝐰LoRA=𝐰\(𝛉\+Δ𝛉\)−𝐰\(𝛉\)\\Delta\\bm\{w\}\_\{\\text\{LoRA\}\}=\\bm\{w\}\(\\bm\{\\theta\}\+\\Delta\\bm\{\\theta\}\)\-\\bm\{w\}\(\\bm\{\\theta\}\)denote the effective step in the weight space, andℋLoRA=∇𝛉2ℓ\\mathcal\{H\}\_\{\\text\{LoRA\}\}=\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\elldenote the parameter\-space Hessian\. Under the LoRA parameterization, both admit an exact decomposition into a task\-aligned signal and a structural drift:

Δ𝒘LoRA=J\(θ\)Δ𝜽⏟Δ𝒘Signal\+12∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]⏟Δ𝒘Drift,ℋLoRA=J\(θ\)⊤𝐇ℓJ\(θ\)⏟ℋSignal\+Σk=1Dgk∇𝜽2wk⏟ℋDrift\.\\displaystyle\\Delta\\bm\{w\}\_\{\\text\{LoRA\}\}=\\underbrace\{J\(\\theta\)\\Delta\\bm\{\\theta\}\}\_\{\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}\}\+\\underbrace\{\\tfrac\{1\}\{2\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]\}\_\{\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}\},\\quad\\mathcal\{H\}\_\{\\text\{LoRA\}\}=\\underbrace\{J\(\\theta\)^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\(\\theta\)\}\_\{\\mathcal\{H\}\_\{\\text\{Signal\}\}\}\+\\underbrace\{\\Sigma\_\{k=1\}^\{D\}g\_\{k\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}\}\_\{\\mathcal\{H\}\_\{\\text\{Drift\}\}\}\.\(4\)

Algebraically, the components take the formΔ𝒘Signal=vec⁡\(αr\(BΔA\+ΔBA\)\)\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}=\\operatorname\{\{vec\}\}\\big\(\\frac\{\\alpha\}\{r\}\(B\\Delta A\+\\Delta BA\)\\big\)and the bilinear driftΔ𝒘Drift=vec⁡\(αrΔBΔA\)\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}=\\operatorname\{\{vec\}\}\\big\(\\frac\{\\alpha\}\{r\}\\Delta B\\Delta A\\big\), whereΔA\\Delta AandΔB\\Delta Bdenote the updates withinΔθ\\Delta\{\\theta\}\.

###### Proposition 2\(Geometric Properties of Signal and Drift\)\.

For any gradient\-based update step222This conceptually extends to adaptive optimizers \(e\.g\., Adam, Muon\) operating via preconditioned gradients\.Δ𝛉=−η∇𝛉ℓ\\Delta\\bm\{\\theta\}=\-\\eta\\nabla\_\{\\bm\{\\theta\}\}\\ell, the decomposed components in Proposition[1](https://arxiv.org/html/2606.12883#Thmproposition1)satisfy:

\(i\) Constructive Signal:The signal component aligns with the descent direction,⟨Δ𝐰Signal,−𝐠⟩≥0\\langle\\Delta\\bm\{w\}\_\{\\text\{Signal\}\},\-\\bm\{g\}\\rangle\\geq 0, and preserves the local convexity of the loss landscape, i\.e\.,ℋSignal⪰0\\mathcal\{H\}\_\{\\text\{Signal\}\}\\succeq 0given𝐇ℓ⪰0\\mathbf\{H\}\_\{\\ell\}\\succeq 0\.

\(ii\) Adversarial Drift:The structural HessianℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}is strictly indefinite, and its induced updateΔ𝐰Drift\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}exerts an uncontrolled force with no guaranteed alignment with−𝐠\-\\bm\{g\}\.

The signal guarantees descent aligned with the task objective, whereas the drift introduces task\-agnostic instability from the bilinear parameterization\. Detailed proofs are deferred to Appendix[G](https://arxiv.org/html/2606.12883#A7)\.

### 3\.2Theoretical Justifications for Empirical Findings

Based on the Signal\-Drift decomposition, we theoretically justify the empirical phenomena observed in Section[2](https://arxiv.org/html/2606.12883#S2)\. For intuition, we utilize a simple MLP for visualization, circumventing the intractable𝒪\(D2\)\\mathcal\{O\}\(D^\{2\}\)complexity of tracking𝐇ℓ\\mathbf\{H\}\_\{\\ell\}and∇𝜽2𝒘\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}in large models\. See Appendix[E](https://arxiv.org/html/2606.12883#A5)for details\.

Justification for Finding I: Spectral Suppression Enables Aggressive Hyperparameters\.Empirically, LoRA tolerates much larger learning rates than FFT without diverging\. This enhanced stability stems from the inherent spectral suppression of the LoRA optimization landscape\.

###### Proposition 3\(Spectral Suppression\)\.

Given the initial state of LoRA whereB=0B=0andAi,j∼𝒩\(0,σA2\)A\_\{i,j\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\_\{A\}\), the expected signal curvature satisfies𝔼\[Tr⁡\(ℋSignal\)\]=α2ρTr⁡\(𝐇ℓ\)\\mathbb\{E\}\[\\operatorname\{Tr\}\(\\mathcal\{H\}\_\{\\text\{Signal\}\}\)\]=\\alpha^\{2\}\\rho\\operatorname\{Tr\}\(\\mathbf\{H\}\_\{\\ell\}\), whereρ=σA2/r\\rho=\\sigma^\{2\}\_\{A\}/r\.

Standard initialization likeσA2=1/din\\sigma\_\{A\}^\{2\}=1/d\_\{in\}yieldsρ≪1\\rho\\ll 1, substantially compressing its maximum eigenvalueλmax\[ℋSignal\]\\lambda\_\{\\max\}\[\\mathcal\{H\}\_\{\\text\{Signal\}\}\]relative to𝐇ℓ\\mathbf\{H\}\_\{\\ell\}\. According to optimization theoryNesterov \[[2013](https://arxiv.org/html/2606.12883#bib.bib103)\], this reduced maximum eigenvalue lowers the gradient Lipschitz constant, thereby expanding the stable learning rate bound thatη≤2/λmax\[ℋSignal\]\\eta\\leq 2/\\lambda\_\{\\max\}\[\\mathcal\{H\}\_\{\\text\{Signal\}\}\]\. Consequently, spectral suppression smooths the landscape, rendering hyperparameters from FFT overly conservative and suboptimal for LoRA\. This behavior is derived in Appendix Lemma[1](https://arxiv.org/html/2606.12883#Thmlemma1)and further evidenced by spectral analysis in Fig\.[8](https://arxiv.org/html/2606.12883#A5.F8)\(a\-b\)\.

Justification for Finding II: The Asymmetric Roles of Scaling Factor and Learning Rate\.To explain whyα\\alpha\-scaling achieves superior fitting capacity compared toη\\eta\-scaling, our decomposition reveals a fundamental asymmetry between the two hyperparameters\.

###### Proposition 4\(Asymmetric Scaling\)\.

The components of the landscape and the weight update scale asymmetrically with respect toα\\alphaandη\\eta\. For the landscape:ℋSignal=Θ\(α2\)\\mathcal\{H\}\_\{\\text\{Signal\}\}=\\Theta\(\\alpha^\{2\}\)whileℋDrift=Θ\(α\)\\mathcal\{H\}\_\{\\text\{Drift\}\}=\\Theta\(\\alpha\)\. For the update under adaptive optimizers like Adam:Δ𝐰Signal=Θ\(αη\)\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}=\\Theta\(\\alpha\\eta\)whileΔ𝐰Drift=Θ\(αη2\)\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}=\\Theta\(\\alpha\\eta^\{2\}\)\.

This asymmetry dictates their distinct optimization roles\. Specifically, increasingη\\etastrictly degrades the Signal\-to\-Drift Ratio, defined as‖Δ𝒘Signal‖/‖Δ𝒘Drift‖=Θ\(1/η\)\\\|\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}\\\|/\\\|\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}\\\|=\\Theta\(1/\\eta\)\. This degradation manifests empirically as the drift\-induced volatility and premature saturation observed in Fig\.[2](https://arxiv.org/html/2606.12883#S3.F2)\. Conversely,α\\alpha\-scaling enables spectral purification\. As shown in Fig\.[8](https://arxiv.org/html/2606.12883#A5.F8)\(c\-d\), increasingα\\alphareshapes the LoRA landscapeℋLoRA\\mathcal\{H\}\_\{\\text\{LoRA\}\}to recover the spectral profile of signal term\.

Justification for Finding III: Conventional Scaling is Fundamentally Mis\-scaled\.Empirically, the optimalα\\alphais exceptionally large and scales sublinearly with rankrr\. This follows from Proposition[3](https://arxiv.org/html/2606.12883#Thmproposition3): to counteract spectral suppression and restore expressiveness, the adapter’s expected signal curvature must match the FFT curvature scale\.

###### Corollary 1\(Initial Curvature Alignment\)\.

To match the signal curvature to the full task curvature at initialization, i\.e\.,𝔼\[Tr⁡\(ℋSignal\)\]=Θ\(Tr⁡\(𝐇ℓ\)\)\\mathbb\{E\}\[\\operatorname\{Tr\}\(\\mathcal\{H\}\_\{\\text\{Signal\}\}\)\]=\\Theta\\big\(\\operatorname\{Tr\}\(\\mathbf\{H\}\_\{\\ell\}\)\\big\), the scaling factor followsα=Θ\(r/σA2\)\\alpha=\\Theta\\big\(\\sqrt\{\{r\}/\{\\sigma\_\{A\}^\{2\}\}\}\\big\)\.

This derivation proves that conventional linear scaling is mis\-specified\. Instead,α\\alphashould scale sublinearly with rank and grow substantially with model width under standard initialization\.

## 4The LoRA\-α\\alphaProtocol

Our empirical and theoretical analyses converge on a central principle:effective LoRA tuning requires a properly calibrated scaling factor\. By aligning the signal curvature of LoRA with that of FFT, such calibration provides two key benefits: \(i\) it enables the direct use of the standard FFT learning rate \(ηFFT\\eta\_\{\\text\{FFT\}\}\) without costly hyperparameter sweeps, and \(ii\) it unlocks stronger fitting capacity than learning\-rate scaling alone\. Based on this insight, we propose LoRA\-α\\alphato redefine LoRA as:

ΔW=α⋅αbaserBA,withη=Θ\(ηFFT\)\.\\Delta W=\\frac\{\\alpha\\cdot\\alpha\_\{\\text\{base\}\}\}\{r\}BA,\\quad\\text\{with\}\\ \\eta=\\Theta\(\\eta\_\{\\text\{FFT\}\}\)\.\(5\)
This decouples the roles of scaling and optimization:αbase\\alpha\_\{\\text\{base\}\}restores the suppressed curvature induced by LoRA, whileα\\alphaserves as a lightweight tuning knob that can be safely fixed to11by default\. If further adjustment is needed,α\\alphacan be searched within a narrow and intuitive range \(e\.g\.,α∈\[0\.1,10\]\\alpha\\in\[0\.1,10\]\), eliminating the need for expensive joint tuning over scaling factors and learning rates\. We provide two complementary strategies for determiningαbase\\alpha\_\{\\text\{base\}\}:

Variant I: Empirical Scaling for LLMs\.Modern LLMs exhibit strong architectural homogeneity, with hidden dimensions typically on the order of thousands \(e\.g\.,din=4096d\_\{in\}=4096\), allowing for a universal hyperparameter regime\. Leveraging the scaling law from Section[2](https://arxiv.org/html/2606.12883#S2), we adoptC=256C=256as a robust default, yielding:αbase=256r\\alpha\_\{\\text\{base\}\}=256\\sqrt\{r\}\.

Variant II: Analytic Scaling\.For general architectures, we derive a principled scaling from Corollary[1](https://arxiv.org/html/2606.12883#Thmcorollary1)by matching the expected task curvature at initialization:αbase=1σAr\\alpha\_\{\\text\{base\}\}=\\frac\{1\}\{\\sigma\_\{A\}\}\\sqrt\{r\}\. This closed\-form solution provides a theoretical baseline that aligns LoRA with the hyperparameter regime of FFT\.

Importantly, Variant I serves as aglobal empirical heuristic, while Variant II provides alayer\-wise theoretical initialization\. In practice, the empirical scaling in Variant I yields marginally larger magnitudes than the analytic prediction of Variant II for modern LLMs\. A detailed magnitude comparison is provided in Appendix[F](https://arxiv.org/html/2606.12883#A6)\. We identify this gap as an important empirical finding\. A plausible explanation is that the analytic scaling matches the average curvature via the trace of the Hessian, whereas optimization stability is governed by the largest eigenvalue\. As shown in Fig\.[8](https://arxiv.org/html/2606.12883#A5.F8)\(a\-b\), these quantities can differ within a moderate range, making such discrepancies reasonable\. Ultimately, this principled choice resolves the16×16\\times–256×256\\timesunder\-scaling inherent in prevailing heuristics\.

## 5Empirical Validation of LoRA\-α\\alpha

To rigorously evaluate the effectiveness of LoRA\-α\\alpha, we conduct extensive experiments along three dimensions: \(i\) Model diversity, covering encoder, decoder, diffusion, and vision\-language architectures; \(ii\) Task heterogeneity, including natural language and multimodal tasks, encompassing cross entropy, reward\-based, flow matching and contrastive learning objectives; \(iii\) Training horizon, ranging from few\-shot adaptation to long\-horizon optimization\. This comprehensive design validates LoRA\-α\\alphaacross both adaptation benchmarks and post\-training scenarios\. Specifically, we verify the effectiveness ofAnalytic Scalingon diverse architectures in Section[5\.1](https://arxiv.org/html/2606.12883#S5.SS1), while adoptingEmpirical Scalingas the default configuration for the remaining LLM\-centric experiments\. All experiments are implemented using the PEFT libraryMangrulkaret al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib50)\]and executed on eight NVIDIA H800 GPUs\. Due to space limitations, we briefly present the core settings and main results in the following subsections\. Comprehensive implementation details are deferred to Appendix[H](https://arxiv.org/html/2606.12883#A8)\.

Table 1:Comparison on the GLUE benchmark\. Experiments are conducted using the DeBERTa\-v3\-base\-184M model with a base learning rate of1×10−41\\times 10^\{\-4\}and a rank of88\. Asterisk \(∗\*\) denotes optimal results from a0\.1×0\.1\\times–10×10\\timessweep overη\\etaorα\\alpha\. Best and second\-best values are bold and underlined\.Table 2:Comparison on NLG tasks using Llama 2\-7B tuned with a base learning rate of2×10−52\\times 10^\{\-5\}\.### 5\.1Performance on Fundamental Adaptation Tasks

We evaluate short\-horizon adaptation within a few hours across both small\- and large\-scale models\. We benchmark our approach against vanilla LoRAHuet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib10)\]\(α=r\\alpha=r\) and representative methods, including scaling\-based RsLoRAKalajdzievski \[[2023](https://arxiv.org/html/2606.12883#bib.bib38)\], learning\-rate\-modified LoRA\+Hayouet al\.\[[2024a](https://arxiv.org/html/2606.12883#bib.bib39)\], spectral\-initialized PiSSAMenget al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib43)\], and magnitude\-initialized LoRAMZhanget al\.\[[2025b](https://arxiv.org/html/2606.12883#bib.bib92)\]\. To ensure a fair comparison, we adopt the baseline results fromZhanget al\.\[[2025b](https://arxiv.org/html/2606.12883#bib.bib92)\], independently training our method under identical settings\.

Natural Language Understanding \(NLU\)\.At first, we fine\-tune and evaluate DeBERTa\-v3\-base\-184MHeet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib58)\]on eight datasets from the GLUE benchmarkWanget al\.\[[2018](https://arxiv.org/html/2606.12883#bib.bib65)\]\. All methods are trained with a batch size of 32, learning rate of1×10−41\\times 10^\{\-4\}, and LoRA rankr=8r=8for 3–5 epochs\. As shown in Table[1](https://arxiv.org/html/2606.12883#S5.T1), LoRA\-α\\alphaachieves the strongest or near\-strongest performance across tasks, outperforming both vanilla LoRA and alternative scaling strategies in average performance\.

Natural Language Generation \(NLG\)\.We then fine\-tune Llama 2\-7BMeta Team \[[2023](https://arxiv.org/html/2606.12883#bib.bib54)\]on three datasets covering mathematical reasoning \(MetaMathQA\), code generation \(CodeFeedback\), and commonsense reasoning \(Commonsense170K\), respectively\. Training is conducted with a batch size of 128, learning rate of2×10−52\\times 10^\{\-5\}, and LoRA ranksr∈\{16,128\}r\\in\\\{16,128\\\}for one epoch over 100k samples\. As shown in Table[2](https://arxiv.org/html/2606.12883#S5.T2), LoRA\-α\\alphamatches or surpasses existing LoRA variants across all tasks and rank settings\. The advantage becomes more pronounced at higher rank of 128, indicating that properα\\alphascaling is increasingly critical as model capacity grows\.

Text\-to\-Image Synthesis\.To validate scalability to multimodal generative settings, we fine\-tune Flux\.1\-12BLabs \[[2024](https://arxiv.org/html/2606.12883#bib.bib57)\]using the DreamBooth protocolRuizet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib26)\]\. All methods are trained with a batch size of 1, learning rate of1×10−41\\times 10^\{\-4\}, and LoRA rankr=8r=8for 1,000 training steps, supervised by flow matching loss\. As illustrated in Fig\.[3](https://arxiv.org/html/2606.12883#S5.F3), LoRA\-α\\alphaproduces images with higher fidelity to reference concepts compared to PiSSA and LoRAM, demonstrating that improved scaling generalizes beyond language tasks to complex generative domains\.

Comparison ofα\\alpha\-scaling andη\\eta\-scaling\. We perform hyperparameter sweeps over bothα\\alphaandη\\etaon NLU and NLG tasks, with full results provided in Appendix[H](https://arxiv.org/html/2606.12883#A8)\. As reported in Tables[1](https://arxiv.org/html/2606.12883#S5.T1)and[2](https://arxiv.org/html/2606.12883#S5.T2),α\\alpha\-scaling consistently yields superior performance compared toη\\eta\-scaling\. Moreover, the performance gap widens significantly at larger ranks, corroborating the initial scaling findings from Fig\.[1](https://arxiv.org/html/2606.12883#S1.F1)\.

![Refer to caption](https://arxiv.org/html/2606.12883v1/x3.png)Figure 3:Qualitative comparison of image customization on Flux\.1\-12B with a base learning rate of1×10−41\\times 10^\{\-4\}and a rank of 8\. LoRA\-α\\alphaconsistently achieves higher fidelity to reference objects\.Table 3:Comparison on the MMEB benchmark via post\-tuning Qwen 2\-VL \(r=16r=16, baseη=2×10−5\\eta=2\\times 10^\{\-5\}\)\. “VLM2VEC” denotes officially released results\. IND and OOD indicate in\- and out\-of\-distribution splits, respectively, with scores reported to one decimal place per the VLM2Vec protocol\.Boldandunderlinedvalues denote the best and second\-best results within each model group\.
### 5\.2Performance on Multimodal Representation Learning

Moving beyond short\-horizon tasks, we evaluate LoRA\-α\\alphain a long\-horizon representation learning setting by transforming a multimodal LLM into a dense retriever on the MMEB benchmarkJianget al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib114)\]\. Following VLM2VecJianget al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib114)\], we train Qwen 2\-VL \(2B and 7B\) using an InfoNCE objective, with a batch size of 1024, learning rate of2×10−52\\times 10^\{\-5\}, maximum sequence length of 4096 tokens, and LoRA rank ofr=16r=16, taking∼\\sim3 days per run\. We compare LoRA\-α\\alphaagainst standard LoRA, LoRA with10×10\\timeslearning rateSchulman and Lab \[[2025](https://arxiv.org/html/2606.12883#bib.bib91)\], DoRALiuet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib77)\], PiSSA, and FFT, where FFT is only feasible for the 2B model due to memory constraints\.

Results\.As shown in Table[3](https://arxiv.org/html/2606.12883#S5.T3), LoRA\-α\\alphaconsistently outperforms all PEFT baselines and FFT across both model scales\. Notably, it surpasses the best checkpoint from official VLM2Vec, indicating that improved scaling alone can unlock additional performance without architectural modifications\. The largest gains are observed on OOD evaluation, where LoRA\-α\\alphasignificantly improves over FFT and VLM2Vec, demonstrating stronger robustness under distribution shifts\.

Table 4:Comparison of Pass@1 accuracy across reasoning benchmarks\. We train Qwen 2\.5\-Math\-7B with a base learning rate of4×10−54\\times 10^\{\-5\}\.nndenotes the number of samples used for Pass@1 estimation\.Table 5:Comparison of Pass@1 accuracy on reasoning benchmarks after GRPO learning\. We train DeepSeek\-R1\-Distill\-Qwen with a base learning rate of1×10−61\\times 10^\{\-6\}and a fixed LoRA rank of6464\.
### 5\.3Performance on Reasoning\-based Supervised Fine\-Tuning

We evaluate LoRA\-α\\alphain a long\-context reasoning setting\. FollowingHuggingFace Team \[[2025](https://arxiv.org/html/2606.12883#bib.bib105)\], we post\-tune Qwen 2\.5\-Math\-7BQwen Team \[[2024](https://arxiv.org/html/2606.12883#bib.bib120)\]on the Mixture\-of\-Thoughts dataset including 350k reasoning traces across math, programming, and science with a 32k token context\. Training uses a batch size of 128, a learning rate of4×10−54\\times 10^\{\-5\}, and LoRA ranksr∈\{64,256\}r\\in\\\{64,256\\\}for one epoch, taking∼\\sim2 days per run\. We compare FFT, standard LoRA, LoRA with10×10\\timeslearning rate, and LoRA\-α\\alphaon AIME 2024/2025, MATH\-500, GPQA Diamond, and LiveCodeBench v4 using Pass@1\.

Results\.As shown in Table[4](https://arxiv.org/html/2606.12883#S5.T4), LoRA\-α\\alphaperforms comparably to the10×10\\timeslearning rate variant at smaller ranks, but achieves stronger gains as the rank increases\. In contrast, learning\-rate scaling becomes unstable at higher ranks\. Notably, due to the intrinsic difficulty of this task, LoRA\-based methods do not surpass FFT, reflecting the substantial learning capacity required\. Nevertheless, LoRA\-α\\alphaconsistently emerges as the closest approximation to FFT\.

### 5\.4Performance on Reinforcement Learning Paradigm

We evaluate LoRA\-α\\alphaunder the reinforcement learning using Group Relative Policy Optimization \(GRPO\)DeepSeek Team \[[2025](https://arxiv.org/html/2606.12883#bib.bib8)\]\. FollowingYinet al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib124)\], experiments are conducted on DeepSeek\-R1\-Distill\-Qwen \(1\.5B and 7B\) models, trained on the DAPO\-Math\-17k datasetYuet al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib121)\]with a LoRA rank of6464, learning rate of1×10−61\\times 10^\{\-6\}and batch size of 128\. We use a maximum completion length of 16k tokens withG=8G=8rollouts per prompt, taking∼\\sim2 days per run\. Evaluation is conducted on a suite of challenging mathematical benchmarks, using Pass@1 as the primary metric\.

Results\.As shown in Table[5](https://arxiv.org/html/2606.12883#S5.T5), LoRA\-α\\alphaachieves the best overall performance across both model scales\. In contrast, increasing the learning rate fails to yield reliable gains and often degrades performance, highlighting the instability induced by high\-variance policy gradients\. FFT also exhibits inconsistent behavior, with noticeable degradation on several benchmarks, suggesting that updating all parameters can disrupt pretrained reasoning structures\. Combined with the results in Table[4](https://arxiv.org/html/2606.12883#S5.T4), these findings echo the observations ofSchulman and Lab \[[2025](https://arxiv.org/html/2606.12883#bib.bib91)\]and suggest a broader implication: compared to SFT, reinforcement learning typically requires lower learning capacity while being more susceptible to optimization instability\. In this regime, LoRA is better suited than FFT, and properly scalingα\\alphapreserves training stability and improves learning effectiveness\.

## 6Conclusion

This paper investigates the scaling mechanisms of LoRA, combining extensive empirical evidence with a novel Signal\-Drift framework\. We show that LoRA’s stability stems from spectral suppression of the task Hessian, which creates a significant optimization gap\. Crucially, we demonstrate that scalingα\\alphais more effective than increasing the learning rate, as it amplifies task\-aligned signal without introducing additional drift\. Building on this insight, we establish a sublinear scaling law and propose LoRA\-α\\alpha, which restoresα\\alphato its principled regime and enables direct reuse of FFT learning rates to substantially improve LoRA performance\. Overall, our work provides both theoretical insight and practical guidance for unlocking LoRA’s potential through a simple yet effective scaling mechanism\.

Looking ahead, developing novel adapter architectures that structurally mitigate this drift without compromising task\-signal integrity remains an exciting frontier for future research\.

## References

- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px1.p1.1)\.
- M\. Balunović, J\. Dekoninck, I\. Petrov, N\. Jovanović, and M\. Vechev \(2025\)Matharena: evaluating llms on uncontaminated math competitions\.InNeurIPS,Cited by:[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px4.p1.1)\.
- D\. Biderman, J\. Portes, J\. J\. G\. Ortiz, M\. Paul, P\. Greengard, C\. Jennings, D\. King, S\. Havens, V\. Chiley, J\. Frankle,et al\.\(2024\)Lora learns less and forgets less\.TMLR\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.12883#A3.p3.12),[3rd item](https://arxiv.org/html/2606.12883#S1.I1.i3.p1.2),[§1](https://arxiv.org/html/2606.12883#S1.p2.5),[§2](https://arxiv.org/html/2606.12883#S2.p6.6)\.
- A\. Blattmann, T\. Dockhorn, S\. Kulal, D\. Mendelevitch, M\. Kilian, D\. Lorenz, Y\. Levi, Z\. English, V\. Voleti, A\. Letts,et al\.\(2023\)Stable video diffusion: scaling latent video diffusion models to large datasets\.arXiv preprint arXiv:2311\.15127\.Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- N\. Chen, S\. Villar, and S\. Hayou \(2026\)Learning rate scaling across lora ranks and transfer to full finetuning\.arXiv preprint arXiv:2602\.06204\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p3.12),[§1](https://arxiv.org/html/2606.12883#S1.p2.5)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px1.p1.1)\.
- DeepSeek Team \(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p1.5),[§5\.4](https://arxiv.org/html/2606.12883#S5.SS4.p1.5)\.
- N\. Ding, Y\. Qin, G\. Yang, F\. Wei, Z\. Yang, Y\. Su, S\. Hu, Y\. Chen, C\. Chan, W\. Chen,et al\.\(2023\)Parameter\-efficient fine\-tuning of large\-scale pre\-trained language models\.Nature Machine Intelligence\.Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- A\. Edalati, M\. S\. Tahaei, I\. Kobyzev, V\. Nia, J\. J\. Clark, and M\. Rezagholizadeh \(2023\)KronA: parameter efficient tuning with kronecker adapter\.InNeurIPS Workshop,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- E\. Guha, R\. Marten, S\. Keh, N\. Raoof, G\. Smyrnis, H\. Bansal, M\. Nezhurina, J\. Mercat, T\. Vu, Z\. Sprague,et al\.\(2026\)Openthoughts: data recipes for reasoning models\.InICLR,Cited by:[§D\.1](https://arxiv.org/html/2606.12883#A4.SS1.p2.4),[§2](https://arxiv.org/html/2606.12883#S2.p2.6)\.
- Y\. Guo, C\. Yang, A\. Rao, Z\. Liang, Y\. Wang, Y\. Qiao, M\. Agrawala, D\. Lin, and B\. Dai \(2024\)Animatediff: animate your personalized text\-to\-image diffusion models without specific tuning\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- N\. Habib, C\. Fourrier, H\. Kydlíček, T\. Wolf, and L\. Tunstall \(2023\)LightEval: a lightweight framework for llm evaluation\.External Links:[Link](https://github.com/huggingface/lighteval)Cited by:[§H\.6](https://arxiv.org/html/2606.12883#A8.SS6.SSS0.Px4.p1.1),[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px4.p1.1)\.
- S\. Hayou, N\. Ghosh, and B\. Yu \(2024a\)LoRA\+: efficient low rank adaptation of large models\.InICML,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1),[1st item](https://arxiv.org/html/2606.12883#S1.I1.i1.p1.1),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p1.1)\.
- S\. Hayou, N\. Ghosh, and B\. Yu \(2024b\)The impact of initialization on lora finetuning dynamics\.InNeurIPS,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1)\.
- H\. He, J\. Ye, M\. Li, Z\. Wang, T\. Chen, L\. Bai, and P\. Ye \(2026\)A unified study of lora variants: taxonomy, review, codebase, and empirical evaluation\.arXiv preprint arXiv:2601\.22708\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- J\. He, C\. Zhou, X\. Ma, T\. Berg\-Kirkpatrick, and G\. Neubig \(2022\)Towards a unified view of parameter\-efficient transfer learning\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- P\. He, J\. Gao, and W\. Chen \(2023\)DeBERTaV3: improving deberta using electra\-style pre\-training with gradient\-disentangled embedding sharing\.InICLR,Cited by:[§H\.1](https://arxiv.org/html/2606.12883#A8.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p2.3)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.InNeurIPS,Cited by:[§H\.6](https://arxiv.org/html/2606.12883#A8.SS6.SSS0.Px4.p1.1),[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px4.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InICLR,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.12883#A3.p3.12),[3rd item](https://arxiv.org/html/2606.12883#S1.I1.i3.p1.2),[§1](https://arxiv.org/html/2606.12883#S1.p1.5),[§1](https://arxiv.org/html/2606.12883#S1.p2.5),[§2](https://arxiv.org/html/2606.12883#S2.p6.6),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p1.1)\.
- Z\. Hu, L\. Wang, Y\. Lan, W\. Xu, E\. Lim, L\. Bing, X\. Xu, S\. Poria, and R\. Lee \(2023\)LLM\-adapters: an adapter family for parameter\-efficient fine\-tuning of large language models\.InEMNLP,Cited by:[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px1.p1.1)\.
- HuggingFace Team \(2025\)Open r1: a fully open reproduction of deepseek\-r1\.External Links:[Link](https://github.com/huggingface/open-r1)Cited by:[§H\.6](https://arxiv.org/html/2606.12883#A8.SS6.SSS0.Px1.p1.1),[§5\.3](https://arxiv.org/html/2606.12883#S5.SS3.p1.6)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2025\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.InICLR,Cited by:[§H\.6](https://arxiv.org/html/2606.12883#A8.SS6.SSS0.Px4.p1.1)\.
- U\. Jang, J\. D\. Lee, and E\. K\. Ryu \(2024\)LoRA training in the NTK regime has no spurious local minima\.InICML,Cited by:[§2](https://arxiv.org/html/2606.12883#S2.p1.3)\.
- T\. Jiang, S\. Huang, S\. Luo, Z\. Zhang, H\. Huang, F\. Wei, W\. Deng, F\. Sun, Q\. Zhang, D\. Wang,et al\.\(2024\)MoRA: high\-rank updating for parameter\-efficient fine\-tuning\.arXiv preprint arXiv:2405\.12130\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- Z\. Jiang, R\. Meng, X\. Yang, S\. Yavuz, Y\. Zhou, and W\. Chen \(2025\)VLM2vec: training vision\-language models for massive multimodal embedding tasks\.InICLR,Cited by:[§H\.5](https://arxiv.org/html/2606.12883#A8.SS5.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2606.12883#S5.SS2.p1.6)\.
- D\. Kalajdzievski \(2023\)A rank stabilization scaling factor for fine\-tuning with lora\.arXiv preprint arXiv:2312\.03732\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1),[§2](https://arxiv.org/html/2606.12883#S2.p6.6),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InACM SOSP,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px2.p1.2)\.
- B\. F\. Labs \(2024\)FLUX\.External Links:[Link](https://github.com/black-forest-labs/flux)Cited by:[§H\.4](https://arxiv.org/html/2606.12883#A8.SS4.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p4.3)\.
- N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu,et al\.\(2025\)Tulu 3: pushing frontiers in open language model post\-training\.InCOLM,Cited by:[§2](https://arxiv.org/html/2606.12883#S2.p2.6)\.
- S\. Lee and J\. Lee \(2026\)Beware of the batch size: hyperparameter bias in evaluating lora\.arXiv preprint arXiv:2602\.09492\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- Y\. Lee, C\. Ko, P\. Chen, and M\. Yeh \(2026\)Learning rate matters: vanilla lora may suffice for llm fine\-tuning\.arXiv preprint arXiv:2602\.04998\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p3.12)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo,et al\.\(2022\)Solving quantitative reasoning problems with language models\.InNeurIPS,Cited by:[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px4.p1.1)\.
- B\. Li, L\. Zhang, A\. Mokhtari, and N\. He \(2025\)On the crucial role of initialization for matrix factorization\.InICLR,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1)\.
- V\. Lialin, S\. Muckatira, N\. Shivagunde, and A\. Rumshisky \(2024\)ReLoRA: high\-rank training through low\-rank updates\.InICLR,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- H\. Liu, D\. Tam, M\. Muqeeth, J\. Mohta, T\. Huang, M\. Bansal, and C\. A\. Raffel \(2022\)Few\-shot parameter\-efficient fine\-tuning is better and cheaper than in\-context learning\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- J\. Liu, Z\. Kong, P\. Dong, X\. Shen, P\. Zhao, H\. Tang, G\. Yuan, W\. Niu, W\. Zhang, X\. Lin, D\. Huang, and Y\. Wang \(2025\)RoRA: efficient fine\-tuning of llm with reliability optimization for rank adaptation\.InICASSP,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024\)Dora: weight\-decomposed low\-rank adaptation\.InICML,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[§5\.2](https://arxiv.org/html/2606.12883#S5.SS2.p1.6)\.
- MAA \(2023\)AMC 2023 problems\.Mathematical Association of America\.External Links:[Link](https://artofproblemsolving.com/wiki/index.php/2023_AMC_12A_Problems)Cited by:[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px4.p1.1)\.
- MAA \(2024\)AIME 2024 problems\.Mathematical Association of America\.External Links:[Link](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I_Problems)Cited by:[§H\.6](https://arxiv.org/html/2606.12883#A8.SS6.SSS0.Px4.p1.1),[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px4.p1.1)\.
- MAA \(2025\)AIME 2025 problems\.Mathematical Association of America\.External Links:[Link](https://artofproblemsolving.com/wiki/index.php/2025_AIME_I_Problems)Cited by:[§H\.6](https://arxiv.org/html/2606.12883#A8.SS6.SSS0.Px4.p1.1),[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px4.p1.1)\.
- S\. Malladi, A\. Wettig, D\. Yu, D\. Chen, and S\. Arora \(2023\)A kernel\-based view of language model fine\-tuning\.InICML,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1)\.
- S\. Mangrulkar, S\. Gugger, L\. Debut, Y\. Belkada, S\. Paul, and B\. Bossan \(2023\)Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p1.5),[§5](https://arxiv.org/html/2606.12883#S5.p1.2)\.
- Y\. Mao, Y\. Ge, Y\. Fan, W\. Xu, Y\. Mi, Z\. Hu, and Y\. Gao \(2025\)A survey on lora of large language models\.Frontiers of Computer Science\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- F\. Meng, Z\. Wang, and M\. Zhang \(2024\)Pissa: principal singular values and singular vectors adaptation of large language models\.InNeurIPS,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[§H\.1](https://arxiv.org/html/2606.12883#A8.SS1.SSS0.Px4.p1.1),[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px1.p1.1),[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p2.5),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p1.1)\.
- Meta Team \(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p1.5),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p3.4)\.
- Meta Team \(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§2](https://arxiv.org/html/2606.12883#S2.p2.6)\.
- S\. Mu and D\. Klabjan \(2025\)On the convergence rate of lora gradient descent\.arXiv preprint arXiv:2512\.18248\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1)\.
- Y\. Nesterov \(2013\)Introductory lectures on convex optimization: a basic course\.Springer Science & Business Media\.Cited by:[§G\.3](https://arxiv.org/html/2606.12883#A7.SS3.4.p1.3),[§3\.2](https://arxiv.org/html/2606.12883#S3.SS2.p3.5)\.
- OpenAI Team \(2020\)Language models are few\-shot learners\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- OpenAI Team \(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- Qwen Team \(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- Qwen Team \(2024\)Qwen2\.5\-math technical report: toward mathematical expert model via self\-improvement\.arXiv preprint arXiv:2409\.12122\.Cited by:[§H\.6](https://arxiv.org/html/2606.12883#A8.SS6.SSS0.Px1.p1.1),[§5\.3](https://arxiv.org/html/2606.12883#S5.SS3.p1.6)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InCOLM,Cited by:[§H\.6](https://arxiv.org/html/2606.12883#A8.SS6.SSS0.Px4.p1.1)\.
- N\. Ruiz, Y\. Li, V\. Jampani, Y\. Pritch, M\. Rubinstein, and K\. Aberman \(2023\)Dreambooth: fine tuning text\-to\-image diffusion models for subject\-driven generation\.InCVPR,Cited by:[§H\.4](https://arxiv.org/html/2606.12883#A8.SS4.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p1.5),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p4.3)\.
- J\. Schulman and T\. M\. Lab \(2025\)LoRA without regret\.Note:Thinking Machines Lab: ConnectionismExternal Links:[Link](https://thinkingmachines.ai/blog/lora)Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.12883#A3.p3.12),[1st item](https://arxiv.org/html/2606.12883#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p2.5),[§2](https://arxiv.org/html/2606.12883#S2.p2.6),[§2](https://arxiv.org/html/2606.12883#S2.p3.11),[§2](https://arxiv.org/html/2606.12883#S2.p6.6),[§5\.2](https://arxiv.org/html/2606.12883#S5.SS2.p1.6),[§5\.4](https://arxiv.org/html/2606.12883#S5.SS4.p2.2)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InEMNLP Workshop,Cited by:[§H\.1](https://arxiv.org/html/2606.12883#A8.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p2.3)\.
- P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, Y\. Fan, K\. Dang, M\. Du, X\. Ren, R\. Men, D\. Liu, C\. Zhou, J\. Zhou, and J\. Lin \(2024a\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§H\.5](https://arxiv.org/html/2606.12883#A8.SS5.SSS0.Px1.p1.1)\.
- S\. Wang, L\. Yu, and J\. Li \(2024b\)Lora\-ga: low\-rank adaptation with gradient approximation\.InNeurIPS,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p2.5)\.
- W\. Wang, Q\. Liu, and Mind Lab \(2026\)0\.03 parameters, 100 potential: reflexivity of compute optimal rank scaling\.Note:Mind Lab: A Lab for Experiential IntelligenceExternal Links:[Link](https://macaron.im/mindlab/research/003-parameters-100-potential-reflexivity-of-compute-optimal-rank-scaling)Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- W\. Xia, C\. Qin, and E\. Hazan \(2024\)Chain of lora: efficient fine\-tuning of language models via residual learning\.InICML,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- Z\. Xu, H\. Min, L\. E\. MacDonald, J\. Luo, S\. Tarmoun, E\. Mallada, and R\. Vidal \(2025\)Understanding the learning dynamics of lora: a gradient flow perspective on low\-rank adaptation in matrix factorization\.InAISTATS,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1)\.
- G\. Yang and E\. J\. Hu \(2021\)Feature learning in infinite\-width neural networks\.InICML,Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p2.5)\.
- M\. Yang, J\. Chen, Y\. Zhang, J\. Liu, J\. Zhang, Q\. Ma, H\. Verma, Q\. Zhang, M\. Zhou, I\. King,et al\.\(2024\)Low\-rank adaptation for foundation models: a comprehensive review\.arXiv preprint arXiv:2501\.00365\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- Q\. Yin, Y\. Wu, Z\. Shen, S\. Li, Z\. Wang, Y\. Li, C\. T\. Leong, J\. Kang, and J\. Gu \(2025\)Evaluating parameter efficient methods for rlvr\.arXiv preprint arXiv:2512\.23165\.Cited by:[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px1.p1.1),[§5\.4](https://arxiv.org/html/2606.12883#S5.SS4.p1.5)\.
- L\. Yu, W\. Jiang, H\. Shi, J\. Yu, Z\. Liu, Y\. Zhang, J\. T\. Kwok, Z\. Li, A\. Weller, and W\. Liu \(2024\)MetaMath: bootstrap your own mathematical questions for large language models\.InICLR,Cited by:[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.InNeurIPS,Cited by:[§H\.7](https://arxiv.org/html/2606.12883#A8.SS7.SSS0.Px1.p1.1),[§5\.4](https://arxiv.org/html/2606.12883#S5.SS4.p1.5)\.
- D\. Zhang, T\. Feng, L\. Xue, Y\. Wang, Y\. Dong, and J\. Tang \(2025a\)Parameter\-efficient fine\-tuning for foundation models\.arXiv preprint arXiv:2501\.13787\.Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- Z\. Zhang, H\. Li, Y\. Zhang, G\. Gong, J\. Wang, J\. Hu, P\. Liu, and Q\. Jiang \(2025b\)The primacy of magnitude in low\-rank adaptation\.InNeurIPS,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.12883#A3.p3.12),[§H\.1](https://arxiv.org/html/2606.12883#A8.SS1.SSS0.Px4.p1.1),[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px3.p1.1),[1st item](https://arxiv.org/html/2606.12883#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.12883#S1.p2.5),[§5\.1](https://arxiv.org/html/2606.12883#S5.SS1.p1.1)\.
- J\. Zhao, T\. Wang, W\. Abid, G\. Angus, A\. Garg, J\. Kinnison, A\. Sherstinsky, P\. Molino, T\. Addair, and D\. Rishi \(2024\)Lora land: 310 fine\-tuned llms that rival gpt\-4, a technical report\.arXiv preprint arXiv:2405\.00732\.Cited by:[§1](https://arxiv.org/html/2606.12883#S1.p1.5)\.
- Q\. Zheng, X\. Xia, X\. Zou, Y\. Dong, S\. Wang, Y\. Xue, Z\. Wang, L\. Shen, A\. Wang, Y\. Li,et al\.\(2023\)Codegeex: a pre\-trained model for code generation with multilingual evaluations on humaneval\-x\.InKDD,Cited by:[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px1.p1.1)\.
- T\. Zheng, G\. Zhang, T\. Shen, X\. Liu, B\. Y\. Lin, J\. Fu, W\. Chen, and X\. Yue \(2024\)OpenCodeInterpreter: integrating code generation with execution and refinement\.InACL Findings,Cited by:[§H\.2](https://arxiv.org/html/2606.12883#A8.SS2.SSS0.Px1.p1.1)\.
- Y\. Zhong, H\. Jiang, L\. Li, R\. Nakada, T\. Liu, L\. Zhang, H\. Yao, and H\. Wang \(2024\)NEAT: nonlinear parameter\-efficient adaptation of pre\-trained models\.arXiv preprint arXiv:2410\.01870\.Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p1.1)\.
- J\. Zhu, K\. Greenewald, K\. Nadjahi, H\. S\. d\. O\. Borde, R\. B\. Gabrielsson, L\. Choshen, M\. Ghassemi, M\. Yurochkin, and J\. Solomon \(2024\)Asymmetry in low\-rank adapters of foundation models\.InICML,Cited by:[Appendix C](https://arxiv.org/html/2606.12883#A3.p2.1)\.

## Appendix ALimitations

Our analysis of the asymmetric roles of the scaling factorα\\alphaand the learning rateη\\etais grounded in the behavior of adaptive optimizers\. While the Signal–Drift decomposition itself is optimizer\-agnostic, the observed asymmetry, whereα\\alphaamplifies task\-aligned signal whileη\\etaintroduces additional drift, may not directly extend to vanilla SGD\. We adopt this perspective to reflect practical training regimes, where adaptive optimizers are ubiquitous in deep learning and large model training\.

## Appendix BBroader Impact

This work studies the scaling mechanism of LoRA and proposes a principled scaling strategy that improves training effectiveness\. By enabling more effective fine\-tuning and reduced hyperparameter search, our method can lower the computational cost associated with adapting large models, potentially making such techniques more accessible\. At the same time, improved training effectiveness may facilitate the broader deployment of large language models, including in settings where misuse is possible\. Nevertheless, our work does not introduce new capabilities beyond improving optimization, and thus does not fundamentally alter the risk profile of existing models\. Overall, we view this work as a step toward more efficient and principled model adaptation, while emphasizing that downstream applications should be developed and deployed with appropriate safeguards\.

## Appendix CRelated Work

LoRA\[Huet al\.,[2022](https://arxiv.org/html/2606.12883#bib.bib10)\]leverages low\-rank decomposition to efficiently adapt large models, and has become a key technique for scalable deployment in modern large\-scale systems, significantly reducing trainingMangrulkaret al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib50)\]; Schulman and Lab \[[2025](https://arxiv.org/html/2606.12883#bib.bib91)\]; Wanget al\.\[[2026](https://arxiv.org/html/2606.12883#bib.bib130)\]and serving costsKwonet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib132)\]\. Comprehensive overviews of LoRA and its extensions can be found in\[Yanget al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib69); Maoet al\.,[2025](https://arxiv.org/html/2606.12883#bib.bib52); Heet al\.,[2026](https://arxiv.org/html/2606.12883#bib.bib126)\]\. To improve the expressiveness of low\-rank updates, prior works explore architectural enhancements such as stacking strategies\[Lialinet al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib31); Xiaet al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib32)\], higher\-order parameterizations\[Edalatiet al\.,[2023](https://arxiv.org/html/2606.12883#bib.bib13)\], and customized adapter designs\[Jianget al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib30); Liuet al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib77); Zhonget al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib72)\]\. More relevant to our work, research focuses on improving optimization and training efficiency\. Existing approaches investigate scaling strategies\[Kalajdzievski,[2023](https://arxiv.org/html/2606.12883#bib.bib38); Bidermanet al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib17)\], learning rate design\[Hayouet al\.,[2024a](https://arxiv.org/html/2606.12883#bib.bib39)\], initialization schemes\[Menget al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib43); Wanget al\.,[2024b](https://arxiv.org/html/2606.12883#bib.bib47); Zhanget al\.,[2025b](https://arxiv.org/html/2606.12883#bib.bib92)\], optimizer choices\[Liet al\.,[2025](https://arxiv.org/html/2606.12883#bib.bib34)\], and batch size configurationsLee and Lee \[[2026](https://arxiv.org/html/2606.12883#bib.bib129)\]\.

Optimization Dynamics of LoRA\.The optimization behavior of LoRA remains poorly understood due to its bilinear parameterization and non\-convex landscape\[Liet al\.,[2025](https://arxiv.org/html/2606.12883#bib.bib34); Xuet al\.,[2025](https://arxiv.org/html/2606.12883#bib.bib20)\]\. Existing theoretical analyses are often restricted to simplified regimes, such as lazy training or infinite\-width limits\[Malladiet al\.,[2023](https://arxiv.org/html/2606.12883#bib.bib82); Hayouet al\.,[2024b](https://arxiv.org/html/2606.12883#bib.bib41)\]\.Kalajdzievski \[[2023](https://arxiv.org/html/2606.12883#bib.bib38)\]; Liuet al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib97)\]study stability conditions, showing that improper scaling can lead to gradient collapse or inefficient training\.\[Hayouet al\.,[2024a](https://arxiv.org/html/2606.12883#bib.bib39); Zhuet al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib40)\]further highlight structural asymmetries between the two low\-rank factors, motivating asymmetric optimization strategies such as distinct learning rates\.Mu and Klabjan \[[2025](https://arxiv.org/html/2606.12883#bib.bib93)\]studies a non\-asymptotic convergence analysis of the original LoRA gradient descent algorithm\. While these efforts provide valuable insights, they do not fully explain how scaling choices affect optimization efficiency in practical settings\.

Theη\\eta–α\\alphaRelationship\.In practice, LoRA is typically trained with a fixed scaling factor \(oftenα∝r\\alpha\\propto r\) while tuning the learning rateη\\etaHuet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib10)\]\. Empirically, LoRA requires significantly larger learning rates, often exceeding10×10\\timesthose of full fine\-tuning, to achieve competitive performance\[Bidermanet al\.,[2024](https://arxiv.org/html/2606.12883#bib.bib17); Schulman and Lab,[2025](https://arxiv.org/html/2606.12883#bib.bib91)\]\. Recent theoretical results reveal a coupling betweenη\\etaandα\\alpha\[Zhanget al\.,[2025b](https://arxiv.org/html/2606.12883#bib.bib92); Schulman and Lab,[2025](https://arxiv.org/html/2606.12883#bib.bib91)\], showing thatincreasingα\\alphacan mimic the effect of simultaneously scaling the learning rate and initialization under adaptive optimizers\.We are inspired by these theoretical analyses, which we interpret as a hint thatα\\alphaandη\\etaare not interchangeable, withα\\alphaexerting a stronger influence on optimization than the learning rate\. More recently,Chenet al\.\[[2026](https://arxiv.org/html/2606.12883#bib.bib125)\]propose the Maximal\-Update Adaptation framework to study rank\-invariant learning rates, whileLeeet al\.\[[2026](https://arxiv.org/html/2606.12883#bib.bib131)\]show that different LoRA variants favor distinct learning rate ranges despite achieving similar peak performance\. Nevertheless, these studies are largely confined to the regimeα=r\\alpha=r\. Our empirical results show that this constraint limits LoRA’s fitting capacity and leaves a significant optimization gap\.

#### Discussion\.

To the best of our knowledge, this is the first work to systematically characterize the role of the scaling factorα\\alphain LoRA optimization\.While existing analyses emphasize learning rate or initialization, we demonstrate thatα\\alphaplays a fundamentally different role\. From a Hessian perspective, we explain why LoRA requires aggressive hyperparameters, and reveal a key asymmetry:α\\alphaprimarily amplifies task\-aligned curvature, whereasη\\etaamplifies the magnitude of bilinear\-induced drift\. This decomposition suggests an optimal regime that increasesα\\alphawhile keepingη\\etaconservative, enabling stronger signal amplification without destabilizing optimization\.

## Appendix DDetails for Empirical Study

In this section, we provide comprehensive details of the experimental setup and extended optimization results across different model scales and datasets\.

### D\.1Experimental Setup and Search Strategy

Hyperparameter Search Strategy\.To characterize the optimization landscape of LoRA, we tailored our hyperparameter search strategies based on the dataset and our evolving empirical understanding:

- •Tulu 3:Given the lack of established heuristics for the optimal scaling regime at the onset of our study, we adopted a two\-stage approach for the Tulu 3 dataset\. We first defined a broad, coarse\-grained grid across the learning rateη\\etaand the scaling factorα\\alpha\. Upon identifying the general region of convergence, we conducted a finer\-grained localized search around the minimum validation loss to accurately pinpoint the optimal configurationsη∗\\eta^\{\*\}andα∗\\alpha^\{\*\}\.
- •OpenThoughts:For the OpenThoughts dataset, rather than repeating an exhaustive grid search, we leveraged the empirical law derived from the Tulu 3 experiments, where the base scaling coefficient was identified nearC=256C=256\. We anchored our search space around this empirical baseline, systematically sweepingα\\alphawithin a10×10\\timesrange \(both scaling up and down\) to validate the transferability of the sublinear scaling law\.

OpenThoughts Dataset Details\.To evaluate whether our findings generalize to reasoning\-heavy tasks, which typically exhibit different optimization dynamics than general instruction tuning, we utilized the OpenThoughts datasetGuhaet al\.\[[2026](https://arxiv.org/html/2606.12883#bib.bib128)\]\. We sampled a 10k training subset and a 1k validation set\. Because reasoning traces inherently require substantially more context, we extended the maximum sequence length toT=8192T=8192\(compared toT=1024T=1024for Tulu 3\)\. Note that FFT experiments were omitted for OpenThoughts, as the primary objective here is to validate the relative scaling dynamics betweenα\\alphaandη\\etaestablished in the main text\.

![Refer to caption](https://arxiv.org/html/2606.12883v1/x4.png)Figure 4:Hyperparameter analysis of Llama 3\-8B on the Tulu 3 dataset\. \(a\) Evaluation loss as a function of the learning rateη\\etaacross different ranksrrand scaling factorsα\\alpha\. Gray lines denote linear fits of the minimum loss for each\(r,α\)\(r,\\alpha\)\. Increasingα\\alphalowers the optimal loss and shiftsη∗\\eta^\{\*\}downward\. \(b\) Scaling paths defined in Eq\. \([2](https://arxiv.org/html/2606.12883#S2.E2)\) relative to the baseline \(ηFFT=1×10−5\\eta\_\{\\text\{FFT\}\}=1\\times 10^\{\-5\},α0=16\\alpha\_\{0\}=16\), whereη0\\eta\_\{0\}denotes the optimal FFT learning rate\. Varyingα\\alphareaches lower\-loss regimes that are inaccessible viaη\\eta\-tuning alone\. \(c\) Optimal scaling factorα∗\\alpha^\{\*\}as a function of rankrr, evaluated atη0\\eta\_\{0\}\. The observed sublinear trend and large magnitude ofα\\alphachallenge conventional scaling heuristics\.![Refer to caption](https://arxiv.org/html/2606.12883v1/x5.png)Figure 5:Optimization dynamics andα\\alphasensitivity for Llama 3\-8B on the Tulu 3 dataset over training steps\. \(a\) Training loss and \(b\) Evaluation loss curves for various combinations of rankrr, scaling factorα\\alpha, and learning rateη\\eta, compared alongside FFT\. Properly scaled configurations \(e\.g\., largerα\\alphawith standardη\\eta\) tightly approximate the FFT trajectory\. \(c\) Sensitivity analysis of the scaling factorα\\alpha, illustrating how increasingα\\alphasmoothly accelerates convergence\.
### D\.2Scaling Behavior of Llama 3\-8B on Tulu 3

Fig\.[4](https://arxiv.org/html/2606.12883#A4.F4)illustrates the hyperparameter analysis for the Llama 3\-8B model on the Tulu 3 dataset\. The observations strictly corroborate our findings from the 1B model\. Notably,α\\alpha\-scaling continues to demonstrate superior optimization capacity overη\\eta\-scaling\. The optimal scaling factorα∗\\alpha^\{\*\}fits a sublinear relationshipα=512r\\alpha=512\\sqrt\{r\}, confirming that larger models also operate optimally at anα\\alphamagnitude substantially larger than conventional rank\-tied heuristics\.

Furthermore, Fig\.[5](https://arxiv.org/html/2606.12883#A4.F5)provides a granular view of the optimization dynamics over training steps\. The loss trajectories demonstrate that properly scaled configurations,i\.e\., operating with a largeα\\alphaand a standard smallη\\eta, tightly approximate the descent trajectory of FFT\.

### D\.3Validation on OpenThoughts Reasoning Tasks

Fig\.[6](https://arxiv.org/html/2606.12883#A4.F6)and[7](https://arxiv.org/html/2606.12883#A4.F7)present the optimization behaviors on the OpenThoughts dataset for Llama 3\-1B and 8B, respectively\. Despite the increased sequence length and task complexity, the fundamental asymmetry betweenα\\alphaandη\\etaremains persistent\. Both models consistently avoid early saturation and reach deeper loss minima whenα\\alphais scaled\. Interestingly, the fitted optimal scaling laws yield large coefficientsC=1024C=1024, indicating that complex reasoning tasks may demand an even more aggressive restoration of the task signal curvature\.

![Refer to caption](https://arxiv.org/html/2606.12883v1/x6.png)Figure 6:Hyperparameter analysis of Llama 3\-1B on the OpenThoughts dataset withηFFT=1×10−5\\eta\_\{\\text\{FFT\}\}=1\\times 10^\{\-5\}andα0=16\\alpha\_\{0\}=16\. \(a\) Scaling paths comparingα\\alpha\-scaling andη\\eta\-scaling\. Consistent with the main findings, increasingα\\alphaenables the model to reach deeper loss minima that are inaccessible viaη\\eta\-tuning alone\. \(b\) Optimal scaling factorα∗\\alpha^\{\*\}as a function of rankrr\. The fitted sublinear relationship confirms the scaling law established in the main text, demonstrating its robustness across different datasets\.![Refer to caption](https://arxiv.org/html/2606.12883v1/x7.png)Figure 7:Hyperparameter analysis of Llama 3\-8B on the OpenThoughts dataset withηFFT=1×10−5\\eta\_\{\\text\{FFT\}\}=1\\times 10^\{\-5\}andα0=16\\alpha\_\{0\}=16\. \(a\) Scaling paths comparingα\\alpha\-scaling andη\\eta\-scaling\. Consistent with the main findings, increasingα\\alphaenables the model to reach deeper loss minima that are inaccessible viaη\\eta\-tuning alone\. \(b\) Optimal scaling factorα∗\\alpha^\{\*\}as a function of rankrr\. The fitted sublinear relationship confirms the scaling law established in the main text, demonstrating its robustness across different datasets\.

## Appendix EEmpirical Validation of Signal\-Drift Dynamics

To empirically validate the Signal\-Drift dynamics and visualize the intractable Hessian matrices without the memory bottleneck, we construct a controlled environment using a multi\-layer perceptron \(MLP\)\. This setup allows us to precisely track the optimization trajectories and perform exact spectral analysis on the optimization landscape\.

#### Setups\.

We employ a 5\-layer MLP utilizing the SiLU activation function, trained to fit synthetic Gaussian data\. We generate a target projection matrixWtrueW\_\{\\text\{true\}\}constructed via Singular Value Decomposition \(SVD\), where the singular values follow a strict decay rate ofσi=1/i\\sigma\_\{i\}=1/i\. The training set consists ofn=100n=100samples drawn from a standard normal distribution, with targets generated asY=XWtrue⊤Y=XW\_\{\\text\{true\}\}^\{\\top\}\. LoRA adapters are applied to all linear layers with a fixed rank ofr=16r=16\. Pre\-trained weightsW0W\_\{0\}and adapter matricesAAare initialized from𝒩\(0,1/d\)\\mathcal\{N\}\(0,1/d\), whileBBis initialized to zero\. For general optimization tracking, we use a hidden dimension ofd=256d=256\. For exact Hessian spectral analysis, which scales quadratically with parameter count, we reduce the dimension tod=32d=32\.

#### Optimization Dynamics \(η\\eta\-scaling vs\.α\\alpha\-scaling\)\.

The model is optimized using Adam for 100 steps with a MSE loss\. The base learning rate and base scaling factor are set toη0=10−4\\eta\_\{0\}=10^\{\-4\}andα0=1\\alpha\_\{0\}=1, respectively\. To compare their behavioral differences, we conduct experiments across two distinct regimes:

- •η\\eta\-scaling:We fixα=α0\\alpha=\\alpha\_\{0\}and strictly scale the learning rate by factors of\{16,32,64,128\}×η0\\\{16,32,64,128\\\}\\times\\eta\_\{0\}\.
- •α\\alpha\-scaling:We anchor the learning rate atη=η0\\eta=\\eta\_\{0\}and scaleα\\alphaby factors of\{32,64,128,256\}×α0\\\{32,64,128,256\\\}\\times\\alpha\_\{0\}\.

These two regimes yield fundamentally different optimization dynamics\. While configurationsη=64η0\\eta=64\\eta\_\{0\}andα=128α0\\alpha=128\\alpha\_\{0\}initially exhibit comparable convergence trends, theη\\eta\-scaled trajectory already manifests noticeable oscillations\. Pushing the scaling further exposes the structural bottleneck:η=128η0\\eta=128\\eta\_\{0\}causes the training to catastrophically diverge, whereasα=256α0\\alpha=256\\alpha\_\{0\}remains strictly stable and continues to unlock deeper fitting \(as evidenced in Fig\.[2](https://arxiv.org/html/2606.12883#S3.F2)\)\.

![Refer to caption](https://arxiv.org/html/2606.12883v1/x8.png)Figure 8:Spectral analysis of Hessian matrices\. \(a\) Comparison between the full\-parameter Hessian𝐇ℓ\\mathbf\{H\}\_\{\\ell\}and LoRA’s signal HessianℋSignal\\mathcal\{H\}\_\{\\text\{Signal\}\}, illustrating the compression of total curvature energy\. \(b\) The predicted sum of eigenvalues from our Signal\-Drift framework closely matches the empirical values across various ranks, validating our theoretical derivation\. \(c\) The exact LoRA HessianℋLoRA\\mathcal\{H\}\_\{\\text\{LoRA\}\}is inherently indefinite due to the structural driftℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}\. Increasingα\\alphatriggers spectral purification, where the spectrum converges toward the positive semi\-definiteℋSignal\\mathcal\{H\}\_\{\\text\{Signal\}\}\. \(d\) Empirical verification of asymmetric scaling: the signal curvature magnitude grows asΘ\(α2\)\\Theta\(\\alpha^\{2\}\)while drift noise grows only asΘ\(α\)\\Theta\(\\alpha\)\. Consequently, asα\\alphaincreases,λmax\[ℋLoRA\]\\lambda\_\{\\max\}\[\\mathcal\{H\}\_\{\\text\{LoRA\}\}\]perfectly aligns withλmax\[ℋSignal\]\\lambda\_\{\\max\}\[\\mathcal\{H\}\_\{\\text\{Signal\}\}\]\.
#### Spectral Compression under Low\-Rank Parameterization\.

To understand why LoRA scales differently from FFT, we explicitly compute the full\-parameter Hessian𝐇ℓ\\mathbf\{H\}\_\{\\ell\}, the LoRA HessianℋLoRA\\mathcal\{H\}\_\{\\text\{LoRA\}\}, and its componentsℋSignal\\mathcal\{H\}\_\{\\text\{Signal\}\}andℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}via exact automatic differentiation\. As shown in Fig\.[8](https://arxiv.org/html/2606.12883#A5.F8)\(a\), the spectrum ofℋSignal\\mathcal\{H\}\_\{\\text\{Signal\}\}is a compressed version of𝐇ℓ\\mathbf\{H\}\_\{\\ell\}\. The dominant eigenvalues are significantly reduced, and the trace explicitly followsTr⁡\(ℋSignal\)≈α2σA2rTr⁡\(𝐇ℓ\)\\operatorname\{Tr\}\(\\mathcal\{H\}\_\{\\text\{Signal\}\}\)\\approx\\frac\{\\alpha^\{2\}\\sigma\_\{A\}^\{2\}\}\{r\}\\operatorname\{Tr\}\(\\mathbf\{H\}\_\{\\ell\}\)\. Notably, Fig\.[8](https://arxiv.org/html/2606.12883#A5.F8)\(a\) illustrates that the maximum eigenvalueλmax\\lambda\_\{\\max\}empirically aligns with the predicted order of magnitude, although it is less strictly constrained than the trace\. Since the trace equals the sum of all eigenvalues \(Tr⁡\(𝐇\)=∑iλi\\operatorname\{Tr\}\(\\mathbf\{H\}\)=\\sum\_\{i\}\\lambda\_\{i\}\), this trace\-level compression inherently necessitates a suppression of the dominant spectral components\. This confirms that low\-rank parameterization severely restricts curvature energy, providing a quantitative explanation for why a largerα\\alphais fundamentally required to compensate for rank\-induced suppression\.

#### Indefiniteness and Spectral Purification\.

Unlike the full Hessian,ℋLoRA\\mathcal\{H\}\_\{\\text\{LoRA\}\}is inherently indefinite due to the presence of the structural driftℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}\. As shown in Fig\.[8](https://arxiv.org/html/2606.12883#A5.F8)\(c\), this drift component introduces negative eigenvalues, which destabilize optimization\. However, increasingα\\alphaprogressively amplifies the signal term relative to the drift, leading to a*spectral purification*effect: the overall indefinite spectrum becomes increasingly aligned with the positive semi\-definiteℋSignal\\mathcal\{H\}\_\{\\text\{Signal\}\}\.

#### Asymmetric Scaling of Signal and Drift\.

Fig\.[8](https://arxiv.org/html/2606.12883#A5.F8)\(d\) empirically verifies the asymmetric scaling behavior\. We observe a strict divergence in growth rates:

λmax\[ℋSignal\]\\displaystyle\\lambda\_\{\\max\}\[\\mathcal\{H\}\_\{\\text\{Signal\}\}\]=Θ\(α2\),\\displaystyle=\\Theta\(\\alpha^\{2\}\),\(6\)\|λmin\[ℋDrift\]\|\\displaystyle\|\\lambda\_\{\\min\}\[\\mathcal\{H\}\_\{\\text\{Drift\}\}\]\|=Θ\(α\)\.\\displaystyle=\\Theta\(\\alpha\)\.\(7\)This mismatch guarantees that increasingα\\alphadisproportionately strengthens the task\-aligned curvature while only linearly amplifying the destabilizing drift\. As a result, the dominant eigenvalue ofℋLoRA\\mathcal\{H\}\_\{\\text\{LoRA\}\}increasingly aligns with that ofℋSignal\\mathcal\{H\}\_\{\\text\{Signal\}\}\.

Together, these controlled experiments provide a validated spectral explanation for the empirical superiority ofα\\alpha\-scaling: \(i\) low\-rank parameterization suppresses curvature magnitude; \(ii\)α\\alpharestores signal strength in a controlled manner; and \(iii\) the asymmetric scaling between signal and drift naturally dictates that increasingα\\alpha, rather than the learning rate, is the principled path to stable and accelerated optimization\.

## Appendix FComparison of Scaling Factor Magnitudes

Table 6:Comparison of the standard heuristic \(α=r\\alpha=r\) with our proposed empirical value \(256r256\\sqrt\{r\}\) and analytic value \(1σAr=3dinr\\frac\{1\}\{\\sigma\_\{A\}\}\\sqrt\{r\}=\\sqrt\{3d\_\{in\}r\}withdin=4096d\_\{in\}=4096\)\.#### Quantitative Comparison of Scaling Magnitudes\.

To further illustrate the optimization gap, we provide a quantitative comparison between our proposed scaling law and prevailing heuristics in Table[6](https://arxiv.org/html/2606.12883#A6.T6)\. Conventional practices typically setα=r\\alpha=r, which implicitly confines the scaling factor to a severely under\-scaled regime as the rank increases\. According to our Signal\-Drift framework, restoring the curvature energy \(Hessian trace\) to a level comparable to FFT requiresα\\alphato scale with1σAr\\frac\{1\}\{\\sigma\_\{A\}\}\\sqrt\{r\}\. For a typical model dimensiondin=4096d\_\{in\}=4096andσA2=1/\(3din\)\\sigma\_\{A\}^\{2\}=1/\(3d\_\{in\}\), the theoretical coefficient1σA\\frac\{1\}\{\\sigma\_\{A\}\}is approximately111111\. Our empirical defaultC=256C=256is an even more optimized choice that ensures deep fitting\. As shown in the table, for a standard rankr=8r=8, LoRA\-α\\alphaprovides a scaling magnitude that is 90\.5×\\timeslarger than the common heuristic\. Even atr=256r=256, the gap remains as high as 16×\\times\. This comparison clarifies why standard FFT learning rates have historically been inadequate for driving LoRA to its full potential under legacy scaling rules\.

#### Note on Initialization Variance\.

The parameterσA2=1/\(3din\)\\sigma\_\{A\}^\{2\}=1/\(3d\_\{in\}\)is chosen to align with the default initialization protocol in standard deep learning libraries \(e\.g\., PyTorch and PEFT\)\. Specifically, LoRA adapters are typically initialized using a Kaiming uniform distribution𝒰\(−1/din,1/din\)\\mathcal\{U\}\(\-\\sqrt\{1/d\_\{in\}\},\\sqrt\{1/d\_\{in\}\}\)\. For a uniform distribution𝒰\(−k,k\)\\mathcal\{U\}\(\-k,k\), the variance is given byk2/3k^\{2\}/3\. By substitutingk2=1/dink^\{2\}=1/d\_\{in\}, we obtainσA2=1/\(3din\)\\sigma\_\{A\}^\{2\}=1/\(3d\_\{in\}\)\. This ensures that the magnitudes predicted by our theoretical framework are consistent in order of magnitude with those used in practical implementations\.

## Appendix GTheoretical Proofs

Notations\.Consider a pre\-trained weightW0∈ℝdout×dinW\_\{0\}\\in\\mathbb\{R\}^\{d\_\{out\}\\times d\_\{in\}\}and trainable adaptersθ=\{A,B\}\{\\theta\}=\\\{A,B\\\}, whereB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{out\}\\times r\}andA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{in\}\}\. The LoRA mapping is defined asW\(𝜽\)=W0\+αrBAW\(\\bm\{\\theta\}\)=W\_\{0\}\+\\frac\{\\alpha\}\{r\}BA\. Let𝒘\(𝜽\)=vec⁡\(W\(𝜽\)\)∈ℝD\\bm\{w\}\(\\bm\{\\theta\}\)=\\operatorname\{vec\}\(W\(\\bm\{\\theta\}\)\)\\in\\mathbb\{R\}^\{D\}denote the vectorized weight, and𝜽=\[vec⁡\(A\);vec⁡\(B\)\]∈ℝp\\bm\{\\theta\}=\[\\operatorname\{vec\}\(A\);\\operatorname\{vec\}\(B\)\]\\in\\mathbb\{R\}^\{p\}the concatenated parameters, whereD=doutdinD=d\_\{out\}d\_\{in\}andp=r\(din\+dout\)p=r\(d\_\{in\}\+d\_\{out\}\)\. For the objectiveℓ\(𝒘\)\\ell\(\\bm\{w\}\), we define the task gradient𝒈=∇𝒘ℓ∈ℝD\\bm\{g\}=\\nabla\_\{\\bm\{w\}\}\\ell\\in\\mathbb\{R\}^\{D\}and the task Hessian𝐇ℓ=∇𝒘2ℓ∈ℝD×D\\mathbf\{H\}\_\{\\ell\}=\\nabla^\{2\}\_\{\\bm\{w\}\}\\ell\\in\\mathbb\{R\}^\{D\\times D\}, withgkg\_\{k\}andwkw\_\{k\}representing theirkk\-th scalar entries\. The mapping geometry is governed by the JacobianJ\(𝜽\)=∇𝜽𝒘∈ℝD×pJ\(\\bm\{\\theta\}\)=\\nabla\_\{\\bm\{\\theta\}\}\\bm\{w\}\\in\\mathbb\{R\}^\{D\\times p\}and the structural Hessian∇𝜽2𝒘∈ℝD×p×p\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\\in\\mathbb\{R\}^\{D\\times p\\times p\}, interpreted as a bilinear operator: for any𝒗∈ℝp\\bm\{v\}\\in\\mathbb\{R\}^\{p\},∇𝜽2𝒘\[𝒗,𝒗\]∈ℝD\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\bm\{v\},\\bm\{v\}\]\\in\\mathbb\{R\}^\{D\}has entries𝒗⊤\(∇𝜽2wk\)𝒗\\bm\{v\}^\{\\top\}\(\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}\)\\bm\{v\}\.

### G\.1Proof for Proposition[1](https://arxiv.org/html/2606.12883#Thmproposition1)

###### Proposition\(Signal\-Drift Decomposition\)\.

LetΔ𝐰LoRA=𝐰\(𝛉\+Δ𝛉\)−𝐰\(𝛉\)\\Delta\\bm\{w\}\_\{\\text\{LoRA\}\}=\\bm\{w\}\(\\bm\{\\theta\}\+\\Delta\\bm\{\\theta\}\)\-\\bm\{w\}\(\\bm\{\\theta\}\)denote the effective step in the weight space, andℋLoRA=∇𝛉2ℓ\\mathcal\{H\}\_\{\\text\{LoRA\}\}=\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\elldenote the parameter\-space Hessian\. Under the LoRA parameterization, both admit an exact decomposition into a task\-aligned signal and a structural drift:

Δ𝒘LoRA=J\(𝜽\)Δ𝜽⏟Δ𝒘Signal\+12∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]⏟Δ𝒘Drift,ℋLoRA=J\(𝜽\)⊤𝐇ℓJ\(𝜽\)⏟ℋSignal\+Σk=1Dgk∇𝜽2wk⏟ℋDrift\.\\displaystyle\\Delta\\bm\{w\}\_\{\\text\{LoRA\}\}=\\underbrace\{J\(\\bm\{\\theta\}\)\\Delta\\bm\{\\theta\}\}\_\{\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}\}\+\\underbrace\{\\tfrac\{1\}\{2\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]\}\_\{\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}\},\\ \\mathcal\{H\}\_\{\\text\{LoRA\}\}=\\underbrace\{J\(\\bm\{\\theta\}\)^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\(\\bm\{\\theta\}\)\}\_\{\\mathcal\{H\}\_\{\\text\{Signal\}\}\}\+\\underbrace\{\\Sigma\_\{k=1\}^\{D\}g\_\{k\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}\}\_\{\\mathcal\{H\}\_\{\\text\{Drift\}\}\}\.\(8\)

###### Proof\.

Part I: Decomposition of the Weight UpdateΔw\\Delta\\bm\{w\}\. Recall that the general multivariate Taylor series expansion of the mapping𝒘\\bm\{w\}around𝜽\\bm\{\\theta\}for a parameter updateΔ𝜽\\Delta\\bm\{\\theta\}is given by:

𝒘\(𝜽\+Δ𝜽\)=𝒘\(𝜽\)\+∇𝜽𝒘\(𝜽\)Δ𝜽\+12∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]\+𝒪\(‖Δ𝜽‖3\)\.\\bm\{w\}\(\\bm\{\\theta\}\+\\Delta\\bm\{\\theta\}\)=\\bm\{w\}\(\\bm\{\\theta\}\)\+\\nabla\_\{\\bm\{\\theta\}\}\\bm\{w\}\(\\bm\{\\theta\}\)\\Delta\\bm\{\\theta\}\+\\frac\{1\}\{2\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]\+\\mathcal\{O\}\(\\\|\\Delta\\bm\{\\theta\}\\\|^\{3\}\)\.\(9\)Under the LoRA framework, the mapping from the parameters𝜽≔\[vec⁡\(A\);vec⁡\(B\)\]\\bm\{\\theta\}\\coloneqq\[\\operatorname\{vec\}\(A\);\\operatorname\{vec\}\(B\)\]to the vectorized weights𝒘\\bm\{w\}is strictly bilinear\. Because each element of𝒘\(𝜽\)\\bm\{w\}\(\\bm\{\\theta\}\)contains products of at most two parameter elements, all third\-order and higher\-order derivatives of𝒘\\bm\{w\}with respect to𝜽\\bm\{\\theta\}vanish identically\. Consequently, the higher\-order residual term𝒪\(‖Δ𝜽‖3\)\\mathcal\{O\}\(\\\|\\Delta\\bm\{\\theta\}\\\|^\{3\}\)is exactly zero, and the expansion terminates exactly at the second\-order term\. Thus, we have the exact equality:

𝒘\(𝜽\+Δ𝜽\)=𝒘\(𝜽\)\+∇𝜽𝒘\(𝜽\)Δ𝜽\+12∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]\.\\bm\{w\}\(\\bm\{\\theta\}\+\\Delta\\bm\{\\theta\}\)=\\bm\{w\}\(\\bm\{\\theta\}\)\+\\nabla\_\{\\bm\{\\theta\}\}\\bm\{w\}\(\\bm\{\\theta\}\)\\Delta\\bm\{\\theta\}\+\\frac\{1\}\{2\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]\.\(10\)By definition, the weight displacement isΔ𝒘LoRA≔𝒘\(𝜽\+Δ𝜽\)−𝒘\(𝜽\)\\Delta\\bm\{w\}\_\{\\text\{LoRA\}\}\\coloneqq\\bm\{w\}\(\\bm\{\\theta\}\+\\Delta\\bm\{\\theta\}\)\-\\bm\{w\}\(\\bm\{\\theta\}\)and the Jacobian isJ\(𝜽\)≔∇𝜽𝒘\(𝜽\)J\(\\bm\{\\theta\}\)\\coloneqq\\nabla\_\{\\bm\{\\theta\}\}\\bm\{w\}\(\\bm\{\\theta\}\)\. Substituting these definitions directly yields the first decomposition:

Δ𝒘LoRA=J\(𝜽\)Δ𝜽⏟Δ𝒘Signal\+12∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]⏟Δ𝒘Drift\.\\Delta\\bm\{w\}\_\{\\text\{LoRA\}\}=\\underbrace\{J\(\\bm\{\\theta\}\)\\Delta\\bm\{\\theta\}\}\_\{\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}\}\+\\underbrace\{\\frac\{1\}\{2\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]\}\_\{\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}\}\.\(11\)
Part II: Decomposition of the HessianℋLoRA\\mathcal\{H\}\_\{\\text\{LoRA\}\}\. First, we apply the multivariate chain rule to find the gradient ofℓ\\ellwith respect to theii\-th parameterθi\\theta\_\{i\}:

∂ℓ∂θi=∑k=1D∂ℓ∂wk∂wk∂θi=∑k=1Dgk∂wk∂θi,\\frac\{\\partial\\ell\}\{\\partial\\theta\_\{i\}\}=\\sum\_\{k=1\}^\{D\}\\frac\{\\partial\\ell\}\{\\partial w\_\{k\}\}\\frac\{\\partial w\_\{k\}\}\{\\partial\\theta\_\{i\}\}=\\sum\_\{k=1\}^\{D\}g\_\{k\}\\frac\{\\partial w\_\{k\}\}\{\\partial\\theta\_\{i\}\},\(12\)wheregk=∂ℓ∂wkg\_\{k\}=\\frac\{\\partial\\ell\}\{\\partial w\_\{k\}\}is thekk\-th element of the task gradient𝒈\\bm\{g\}\.

Next, we differentiate this expression with respect to another parameterθj\\theta\_\{j\}to obtain the\(i,j\)\(i,j\)\-th entry of the parameter Hessian:

\[ℋLoRA\]i,j=∂2ℓ∂θi∂θj=∂∂θj\(∑k=1Dgk∂wk∂θi\)\.\[\\mathcal\{H\}\_\{\\text\{LoRA\}\}\]\_\{i,j\}=\\frac\{\\partial^\{2\}\\ell\}\{\\partial\\theta\_\{i\}\\partial\\theta\_\{j\}\}=\\frac\{\\partial\}\{\\partial\\theta\_\{j\}\}\\left\(\\sum\_\{k=1\}^\{D\}g\_\{k\}\\frac\{\\partial w\_\{k\}\}\{\\partial\\theta\_\{i\}\}\\right\)\.\(13\)Applying the product rule, we obtain two separate terms:

\[ℋLoRA\]i,j=∑k=1D\(∂gk∂θj∂wk∂θi\+gk∂2wk∂θi∂θj\)\.\[\\mathcal\{H\}\_\{\\text\{LoRA\}\}\]\_\{i,j\}=\\sum\_\{k=1\}^\{D\}\\left\(\\frac\{\\partial g\_\{k\}\}\{\\partial\\theta\_\{j\}\}\\frac\{\\partial w\_\{k\}\}\{\\partial\\theta\_\{i\}\}\+g\_\{k\}\\frac\{\\partial^\{2\}w\_\{k\}\}\{\\partial\\theta\_\{i\}\\partial\\theta\_\{j\}\}\\right\)\.\(14\)We now expand the term∂gk∂θj\\frac\{\\partial g\_\{k\}\}\{\\partial\\theta\_\{j\}\}using the chain rule once more:

∂gk∂θj=∑m=1D∂gk∂wm∂wm∂θj=∑m=1D\[𝐇ℓ\]k,mJm,j,\\frac\{\\partial g\_\{k\}\}\{\\partial\\theta\_\{j\}\}=\\sum\_\{m=1\}^\{D\}\\frac\{\\partial g\_\{k\}\}\{\\partial w\_\{m\}\}\\frac\{\\partial w\_\{m\}\}\{\\partial\\theta\_\{j\}\}=\\sum\_\{m=1\}^\{D\}\[\\mathbf\{H\}\_\{\\ell\}\]\_\{k,m\}J\_\{m,j\},\(15\)where\[𝐇ℓ\]k,m=∂2ℓ∂wk∂wm\[\\mathbf\{H\}\_\{\\ell\}\]\_\{k,m\}=\\frac\{\\partial^\{2\}\\ell\}\{\\partial w\_\{k\}\\partial w\_\{m\}\}is the\(k,m\)\(k,m\)\-th entry of the task Hessian𝐇ℓ\\mathbf\{H\}\_\{\\ell\}, andJm,j=∂wm∂θjJ\_\{m,j\}=\\frac\{\\partial w\_\{m\}\}\{\\partial\\theta\_\{j\}\}is the\(m,j\)\(m,j\)\-th entry of the JacobianJ\(𝜽\)J\(\\bm\{\\theta\}\)\. Substituting this expansion back into the expression for\[ℋLoRA\]i,j\[\\mathcal\{H\}\_\{\\text\{LoRA\}\}\]\_\{i,j\}:

\[ℋLoRA\]i,j=∑k=1D∑m=1D∂wk∂θi⏟Jk,i\[𝐇ℓ\]k,m∂wm∂θj⏟Jm,j\+∑k=1Dgk∂2wk∂θi∂θj\.\[\\mathcal\{H\}\_\{\\text\{LoRA\}\}\]\_\{i,j\}=\\sum\_\{k=1\}^\{D\}\\sum\_\{m=1\}^\{D\}\\underbrace\{\\frac\{\\partial w\_\{k\}\}\{\\partial\\theta\_\{i\}\}\}\_\{J\_\{k,i\}\}\[\\mathbf\{H\}\_\{\\ell\}\]\_\{k,m\}\\underbrace\{\\frac\{\\partial w\_\{m\}\}\{\\partial\\theta\_\{j\}\}\}\_\{J\_\{m,j\}\}\+\\sum\_\{k=1\}^\{D\}g\_\{k\}\\frac\{\\partial^\{2\}w\_\{k\}\}\{\\partial\\theta\_\{i\}\\partial\\theta\_\{j\}\}\.\(16\)Observe that the double summation on the left corresponds exactly to the matrix multiplication\(J\(𝜽\)⊤𝐇ℓJ\(𝜽\)\)i,j\(J\(\\bm\{\\theta\}\)^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\(\\bm\{\\theta\}\)\)\_\{i,j\}\. The summation on the right corresponds exactly to the\(i,j\)\(i,j\)\-th entry of the matrix∑k=1Dgk∇𝜽2wk\\sum\_\{k=1\}^\{D\}g\_\{k\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}\.

Converting this element\-wise equality back into matrix notation, we arrive at the final decomposition:

ℋLoRA=J\(𝜽\)⊤𝐇ℓJ\(𝜽\)⏟ℋSignal\+∑k=1Dgk∇𝜽2wk⏟ℋDrift\.\\mathcal\{H\}\_\{\\text\{LoRA\}\}=\\underbrace\{J\(\\bm\{\\theta\}\)^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\(\\bm\{\\theta\}\)\}\_\{\\mathcal\{H\}\_\{\\text\{Signal\}\}\}\+\\underbrace\{\\sum\_\{k=1\}^\{D\}g\_\{k\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}\}\_\{\\mathcal\{H\}\_\{\\text\{Drift\}\}\}\.\(17\)This establishes both partitions and completes the proof\. ∎

By the definition of the LoRA forward pass, the weight matrix isW\(θ\)=W\+αrBAW\(\\theta\)=W\+\\frac\{\\alpha\}\{r\}BA\. For a parameter perturbationΔθ=\{ΔA,ΔB\}\\Delta\\theta=\\\{\\Delta A,\\Delta B\\\}, the perturbed weight matrix is:

W\(θ\+Δθ\)\\displaystyle W\(\\theta\+\\Delta\\theta\)=W\+αr\(B\+ΔB\)\(A\+ΔA\)\\displaystyle=W\+\\frac\{\\alpha\}\{r\}\(B\+\\Delta B\)\(A\+\\Delta A\)\(18\)=W\+αrBA\+αr\(BΔA\+ΔBA\)\+αrΔBΔA\.\\displaystyle=W\+\\frac\{\\alpha\}\{r\}BA\+\\frac\{\\alpha\}\{r\}\(B\\Delta A\+\\Delta BA\)\+\\frac\{\\alpha\}\{r\}\\Delta B\\Delta A\.\(19\)The first\-order variation with respect to the parameters corresponds strictly to the linear terms inΔA\\Delta AandΔB\\Delta B\. Vectorizing this linear component directly yields the exact algebraic form of the task signal:

Δ𝒘Signal=J\(𝜽\)Δ𝜽=vec⁡\(αr\(BΔA\+ΔBA\)\)\.\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}=J\(\\bm\{\\theta\}\)\\Delta\\bm\{\\theta\}=\\operatorname\{vec\}\\left\(\\frac\{\\alpha\}\{r\}\(B\\Delta A\+\\Delta BA\)\\right\)\.\(20\)
Vectorizing this bilinear cross\-term gives the structural drift:

Δ𝒘Drift=12∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]=vec⁡\(αrΔBΔA\)\.\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}=\\frac\{1\}\{2\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]=\\operatorname\{vec\}\\left\(\\frac\{\\alpha\}\{r\}\\Delta B\\Delta A\\right\)\.\(21\)

### G\.2Proof for Proposition[2](https://arxiv.org/html/2606.12883#Thmproposition2)

###### Proposition\(Geometric Properties of Signal and Drift\)\.

For any gradient\-based update stepΔ𝛉=−η∇𝛉ℓ\\Delta\\bm\{\\theta\}=\-\\eta\\nabla\_\{\\bm\{\\theta\}\}\\ell, the decomposed components in Proposition[1](https://arxiv.org/html/2606.12883#Thmproposition1)satisfy:

\(i\) Constructive Signal:The signal component aligns with the descent direction,⟨Δ𝐰Signal,−𝐠⟩≥0\\langle\\Delta\\bm\{w\}\_\{\\text\{Signal\}\},\-\\bm\{g\}\\rangle\\geq 0, and preserves the local convexity of the loss landscape, i\.e\.,ℋSignal⪰0\\mathcal\{H\}\_\{\\text\{Signal\}\}\\succeq 0given𝐇ℓ⪰0\\mathbf\{H\}\_\{\\ell\}\\succeq 0\.

\(ii\) Adversarial Drift:The structural HessianℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}is strictly indefinite, and its induced updateΔ𝐰Drift\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}exerts an uncontrolled force with no guaranteed alignment with−𝐠\-\\bm\{g\}\.

###### Proof\.

Part I: Properties of the Signal Terms\.

1\. Gradient alignment \(⟨Δ𝐰Signal,−𝐠⟩≥0\\langle\\Delta\\bm\{w\}\_\{\\text\{Signal\}\},\-\\bm\{g\}\\rangle\\geq 0\): We evaluate the inner product between the signal component and the negative task gradient−𝒈\-\\bm\{g\}\(the ideal steepest descent direction in the full\-weight space\)\. By the definition of the signal term and the chain rule \(∇𝜽ℓ=J\(𝜽\)⊤𝒈\\nabla\_\{\\bm\{\\theta\}\}\\ell=J\(\\bm\{\\theta\}\)^\{\\top\}\\bm\{g\}\), this inner product elegantly reduces to the parameter space:

⟨Δ𝒘Signal,−𝒈⟩=⟨J\(𝜽\)Δ𝜽,−𝒈⟩=⟨Δ𝜽,−J\(𝜽\)⊤𝒈⟩=⟨Δ𝜽,−∇𝜽ℓ⟩\.\\langle\\Delta\\bm\{w\}\_\{\\text\{Signal\}\},\-\\bm\{g\}\\rangle=\\langle J\(\\bm\{\\theta\}\)\\Delta\\bm\{\\theta\},\-\\bm\{g\}\\rangle=\\langle\\Delta\\bm\{\\theta\},\-J\(\\bm\{\\theta\}\)^\{\\top\}\\bm\{g\}\\rangle=\\langle\\Delta\\bm\{\\theta\},\-\\nabla\_\{\\bm\{\\theta\}\}\\ell\\rangle\.\(22\)This establishes that the alignment of the projected signal in the full\-weight space is mathematically equivalent to the alignment of the parameter update with the parameter gradient\. We analyze this property under three distinct optimization regimes with learning rateη\>0\\eta\>0:

- •Case 1: Standard Gradient Descent \(GD\)\. Under GD,Δ𝜽=−η∇𝜽ℓ\\Delta\\bm\{\\theta\}=\-\\eta\\nabla\_\{\\bm\{\\theta\}\}\\ell\. The alignment yields the squaredℓ2\\ell\_\{2\}\-norm: ⟨Δ𝜽,−∇𝜽ℓ⟩=⟨−η∇𝜽ℓ,−∇𝜽ℓ⟩=η‖∇𝜽ℓ‖22≥0\.\\langle\\Delta\\bm\{\\theta\},\-\\nabla\_\{\\bm\{\\theta\}\}\\ell\\rangle=\\langle\-\\eta\\nabla\_\{\\bm\{\\theta\}\}\\ell,\-\\nabla\_\{\\bm\{\\theta\}\}\\ell\\rangle=\\eta\\\|\\nabla\_\{\\bm\{\\theta\}\}\\ell\\\|\_\{2\}^\{2\}\\geq 0\.\(23\)
- •Case 2: Adam \(Sign Gradient\)\. To isolate the geometric effect of Adam’s adaptive denominator, we abstract its update rule as scaled sign gradient descent,Δ𝜽=−ηsign⁡\(∇𝜽ℓ\)\\Delta\\bm\{\\theta\}=\-\\eta\\operatorname\{sign\}\(\\nabla\_\{\\bm\{\\theta\}\}\\ell\), applied element\-wise\. Sincex⋅sign⁡\(x\)=\|x\|x\\cdot\\operatorname\{sign\}\(x\)=\|x\|, the alignment resolves to theℓ1\\ell\_\{1\}\-norm: ⟨Δ𝜽,−∇𝜽ℓ⟩=η∑isign⁡\(\[∇𝜽ℓ\]i\)\[∇𝜽ℓ\]i=η‖∇𝜽ℓ‖1≥0\.\\langle\\Delta\\bm\{\\theta\},\-\\nabla\_\{\\bm\{\\theta\}\}\\ell\\rangle=\\eta\\sum\_\{i\}\\operatorname\{sign\}\(\[\\nabla\_\{\\bm\{\\theta\}\}\\ell\]\_\{i\}\)\[\\nabla\_\{\\bm\{\\theta\}\}\\ell\]\_\{i\}=\\eta\\\|\\nabla\_\{\\bm\{\\theta\}\}\\ell\\\|\_\{1\}\\geq 0\.\(24\)
- •Case 3: Muon \(Orthogonalized Gradient\)\. Muon operates on the matrix parameters directly\. LetX∈\{A,B\}X\\in\\\{A,B\\\}denote a parameter block with gradientGX=∇XℓG\_\{X\}=\\nabla\_\{X\}\\ell\. Using the compact SVD,GX=UXΣXVX⊤G\_\{X\}=U\_\{X\}\\Sigma\_\{X\}V\_\{X\}^\{\\top\}\. Muon drops the singular values to yield the orthogonalized updateΔX=−ηUXVX⊤\\Delta X=\-\\eta U\_\{X\}V\_\{X\}^\{\\top\}\. The total parameter alignment is the sum of alignments over the blocks\. Using the trace inner product⟨A,B⟩=Tr⁡\(A⊤B\)\\langle A,B\\rangle=\\operatorname\{Tr\}\(A^\{\\top\}B\)and its cyclic permutation property: ⟨ΔX,−GX⟩\\displaystyle\\langle\\Delta X,\-G\_\{X\}\\rangle=Tr⁡\(\(−ηUXVX⊤\)⊤\(−UXΣXVX⊤\)\)\\displaystyle=\\operatorname\{Tr\}\\big\(\(\-\\eta U\_\{X\}V\_\{X\}^\{\\top\}\)^\{\\top\}\(\-U\_\{X\}\\Sigma\_\{X\}V\_\{X\}^\{\\top\}\)\\big\)\(25\)=ηTr⁡\(VXUX⊤UXΣXVX⊤\)\\displaystyle=\\eta\\operatorname\{Tr\}\\big\(V\_\{X\}U\_\{X\}^\{\\top\}U\_\{X\}\\Sigma\_\{X\}V\_\{X\}^\{\\top\}\\big\)\(26\)=ηTr⁡\(ΣXVX⊤VX\)=ηTr⁡\(ΣX\)\.\\displaystyle=\\eta\\operatorname\{Tr\}\\big\(\\Sigma\_\{X\}V\_\{X\}^\{\\top\}V\_\{X\}\\big\)=\\eta\\operatorname\{Tr\}\(\\Sigma\_\{X\}\)\.\(27\)SinceTr⁡\(ΣX\)\\operatorname\{Tr\}\(\\Sigma\_\{X\}\)is the sum of singular values, it exactly equals the nuclear norm‖GX‖∗\\\|G\_\{X\}\\\|\_\{\*\}\. Summing over the adapter blocks yields: ⟨Δ𝜽,−∇𝜽ℓ⟩=η\(‖GA‖∗\+‖GB‖∗\)≥0\.\\langle\\Delta\\bm\{\\theta\},\-\\nabla\_\{\\bm\{\\theta\}\}\\ell\\rangle=\\eta\(\\\|G\_\{A\}\\\|\_\{\*\}\+\\\|G\_\{B\}\\\|\_\{\*\}\)\\geq 0\.\(28\)

In all cases, the update strictly aligns with the negative gradient, guaranteeing that the signal term mimics a valid FFT descent direction measured in different geometric norms\.

2\. Positive semi\-definiteness ofℋSignal\\mathcal\{H\}\_\{\\text\{Signal\}\}: Recall from the decomposition thatℋSignal=J\(𝜽\)⊤𝐇ℓJ\(𝜽\)\\mathcal\{H\}\_\{\\text\{Signal\}\}=J\(\\bm\{\\theta\}\)^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\(\\bm\{\\theta\}\)\. Assume the task landscape is locally convex, meaning the task Hessian is positive semi\-definite \(𝐇ℓ⪰0\\mathbf\{H\}\_\{\\ell\}\\succeq 0\)\. For any arbitrary non\-zero vector𝒗∈ℝp\\bm\{v\}\\in\\mathbb\{R\}^\{p\}, we have:

𝒗⊤ℋSignal𝒗=𝒗⊤\(J\(𝜽\)⊤𝐇ℓJ\(𝜽\)\)𝒗=\(J\(𝜽\)𝒗\)⊤𝐇ℓ\(J\(𝜽\)𝒗\)\.\\bm\{v\}^\{\\top\}\\mathcal\{H\}\_\{\\text\{Signal\}\}\\bm\{v\}=\\bm\{v\}^\{\\top\}\(J\(\\bm\{\\theta\}\)^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\(\\bm\{\\theta\}\)\)\\bm\{v\}=\(J\(\\bm\{\\theta\}\)\\bm\{v\}\)^\{\\top\}\\mathbf\{H\}\_\{\\ell\}\(J\(\\bm\{\\theta\}\)\\bm\{v\}\)\.\(29\)Let𝒖≔J\(𝜽\)𝒗∈ℝD\\bm\{u\}\\coloneqq J\(\\bm\{\\theta\}\)\\bm\{v\}\\in\\mathbb\{R\}^\{D\}\. Since𝐇ℓ⪰0\\mathbf\{H\}\_\{\\ell\}\\succeq 0, it follows that𝒖⊤𝐇ℓ𝒖≥0\\bm\{u\}^\{\\top\}\\mathbf\{H\}\_\{\\ell\}\\bm\{u\}\\geq 0for all𝒖\\bm\{u\}\. Therefore,𝒗⊤ℋSignal𝒗≥0\\bm\{v\}^\{\\top\}\\mathcal\{H\}\_\{\\text\{Signal\}\}\\bm\{v\}\\geq 0, which provesℋSignal⪰0\\mathcal\{H\}\_\{\\text\{Signal\}\}\\succeq 0\.

Part II: Properties of the Drift Terms\.

1\. Indefinite saddle structure ofℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}: To prove that the structural drift introduces a saddle point, we analyze the block structure ofℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}\. Since the mappingW\(θ\)=W0\+αrBAW\(\\theta\)=W\_\{0\}\+\\frac\{\\alpha\}\{r\}BAis strictly bilinear, all intra\-variable second derivatives \(e\.g\., with respect toAAorBBalone\) vanish identically\. The parameter Hessian of any scalar entrywkw\_\{k\}thus consists solely of cross\-derivatives, yielding a strictly block off\-diagonal form\. The aggregated drift Hessian naturally inherits this structure:

∇𝜽2wk=αr\[𝟎𝐂k𝐂k⊤𝟎\]⟹ℋDrift=∑k=1Dgk∇𝜽2wk=αr\[𝟎𝐂𝐂⊤𝟎\],\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}=\\frac\{\\alpha\}\{r\}\\begin\{bmatrix\}\\mathbf\{0\}&\\mathbf\{C\}\_\{k\}\\\\ \\mathbf\{C\}\_\{k\}^\{\\top\}&\\mathbf\{0\}\\end\{bmatrix\}\\quad\\implies\\quad\\mathcal\{H\}\_\{\\text\{Drift\}\}=\\sum\_\{k=1\}^\{D\}g\_\{k\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}=\\frac\{\\alpha\}\{r\}\\begin\{bmatrix\}\\mathbf\{0\}&\\mathbf\{C\}\\\\ \\mathbf\{C\}^\{\\top\}&\\mathbf\{0\}\\end\{bmatrix\},\(30\)where𝐂=∑k=1Dgk𝐂k\\mathbf\{C\}=\\sum\_\{k=1\}^\{D\}g\_\{k\}\\mathbf\{C\}\_\{k\}aggregates the gradient\-weighted cross\-derivatives\.

By the spectral properties of symmetric block off\-diagonal matrices, their non\-zero eigenvalues strictly appear in symmetric pairs \(±λ\\pm\\lambda\)\. This characterizesℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}as anindefinite matrix, establishing that the bilinear parameterization inherently injects a disruptive saddle\-point geometry into the landscape, independent of the task curvature𝐇ℓ\\mathbf\{H\}\_\{\\ell\}\.

2\. Uncontrolled force ofΔ𝐰Drift\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}: To demonstrate that the structural drift can actively degrade optimization, we examine its alignment with the negative task gradient \(−𝒈\-\\bm\{g\}\) under standard gradient descent\. Recall from Proposition[1](https://arxiv.org/html/2606.12883#Thmproposition1)that the weight\-space drift is defined via the structural Hessian tensor asΔ𝒘Drift=12∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}=\\frac\{1\}\{2\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]\. Under GD with learning rateη\>0\\eta\>0, the parameter update isΔ𝜽=−η∇𝜽ℓ=−ηJ\(θ\)⊤𝒈\\Delta\\bm\{\\theta\}=\-\\eta\\nabla\_\{\\bm\{\\theta\}\}\\ell=\-\\eta J\(\\theta\)^\{\\top\}\\bm\{g\}\. Evaluating the inner product between the drift step and the ideal descent direction yields:

⟨Δ𝒘Drift,−𝒈⟩=−12𝒈⊤\(∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]\)=−12Δ𝜽⊤\(∑k=1Dgk∇𝜽2wk\)Δ𝜽\.\\langle\\Delta\\bm\{w\}\_\{\\text\{Drift\}\},\-\\bm\{g\}\\rangle=\-\\frac\{1\}\{2\}\\bm\{g\}^\{\\top\}\\Big\(\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]\\Big\)=\-\\frac\{1\}\{2\}\\Delta\\bm\{\\theta\}^\{\\top\}\\left\(\\sum\_\{k=1\}^\{D\}g\_\{k\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}\\right\)\\Delta\\bm\{\\theta\}\.\(31\)Remarkably, the term inside the parenthesis is exactly the parameter\-space drift Hessian defined in Proposition[1](https://arxiv.org/html/2606.12883#Thmproposition1)\. Thus, the inner product simplifies to a quadratic form:

⟨Δ𝒘Drift,−𝒈⟩=−12Δ𝜽⊤ℋDriftΔ𝜽\.\\langle\\Delta\\bm\{w\}\_\{\\text\{Drift\}\},\-\\bm\{g\}\\rangle=\-\\frac\{1\}\{2\}\\Delta\\bm\{\\theta\}^\{\\top\}\\mathcal\{H\}\_\{\\text\{Drift\}\}\\Delta\\bm\{\\theta\}\.\(32\)BecauseℋDrift\\mathcal\{H\}\_\{\\text\{Drift\}\}is a block off\-diagonal indefinite matrix, this quadratic form is not guaranteed to be non\-negative, meaning the drift can easily oppose the descent direction\.

Counterexample:Consider a scalar network \(D=1D=1, rankr=1r=1\) withα=1\\alpha=1\. The effective weight isw=abw=ab, parameterized by𝜽=\[a,b\]⊤\\bm\{\\theta\}=\[a,b\]^\{\\top\}\. Assume the current parameters are𝜽=\[1,1\]⊤\\bm\{\\theta\}=\[1,1\]^\{\\top\}and the task gradient isg=1g=1\. The Jacobian isJ\(θ\)=∇𝜽w=\[b,a\]=\[1,1\]J\(\\theta\)=\\nabla\_\{\\bm\{\\theta\}\}w=\[b,a\]=\[1,1\], and the parameter update isΔ𝜽=−ηJ\(θ\)⊤g=\[−η,−η\]⊤\\Delta\\bm\{\\theta\}=\-\\eta J\(\\theta\)^\{\\top\}g=\[\-\\eta,\-\\eta\]^\{\\top\}\. The drift Hessian evaluates toℋDrift=g∇𝜽2w=1⋅\[0110\]\\mathcal\{H\}\_\{\\text\{Drift\}\}=g\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w=1\\cdot\\begin\{bmatrix\}0&1\\\\ 1&0\\end\{bmatrix\}\. Substituting these into Eq\. \([31](https://arxiv.org/html/2606.12883#A7.E31)\), the inner product is:

⟨Δ𝒘Drift,−g⟩=−12\[−η−η\]\[0110\]\[−η−η\]=−12\(2η2\)=−η2<0\.\\langle\\Delta\\bm\{w\}\_\{\\text\{Drift\}\},\-g\\rangle=\-\\frac\{1\}\{2\}\\begin\{bmatrix\}\-\\eta&\-\\eta\\end\{bmatrix\}\\begin\{bmatrix\}0&1\\\\ 1&0\\end\{bmatrix\}\\begin\{bmatrix\}\-\\eta\\\\ \-\\eta\\end\{bmatrix\}=\-\\frac\{1\}\{2\}\(2\\eta^\{2\}\)=\-\\eta^\{2\}<0\.\(33\)∎

### G\.3Proof for Proposition[3](https://arxiv.org/html/2606.12883#Thmproposition3)

###### Proposition\(Spectral Suppression\)\.

UnderB=0B=0andAi,j∼𝒩\(0,σA2\)A\_\{i,j\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\_\{A\}\), the expected signal curvature captured by LoRA satisfies𝔼\[Tr⁡\(ℋSignal\)\]=α2ρTr⁡\(𝐇ℓ\)\\mathbb\{E\}\[\\operatorname\{Tr\}\(\\mathcal\{H\}\_\{\\text\{Signal\}\}\)\]=\\alpha^\{2\}\\rho\\operatorname\{Tr\}\(\\mathbf\{H\}\_\{\\ell\}\), whereρ=σA2/r\\rho=\\sigma^\{2\}\_\{A\}/r\.

###### Proof\.

Using the property of vectorizationvec⁡\(XYZ\)=\(Z⊤⊗X\)vec⁡\(Y\)\\operatorname\{vec\}\(XYZ\)=\(Z^\{\\top\}\\otimes X\)\\operatorname\{vec\}\(Y\), we can express the LoRA mapping in the vectorized space as:

𝒘=vec⁡\(W\)\+αr\(A⊤⊗Idout\)vec⁡\(B\),\\bm\{w\}=\\operatorname\{vec\}\(W\)\+\\frac\{\\alpha\}\{r\}\(A^\{\\top\}\\otimes I\_\{d\_\{out\}\}\)\\operatorname\{vec\}\(B\),\(34\)whereIdoutI\_\{d\_\{out\}\}is the identity matrix of dimensiondoutd\_\{out\}\.

LetJ=∇𝜽𝒘∈ℝD×pJ=\\nabla\_\{\\bm\{\\theta\}\}\\bm\{w\}\\in\\mathbb\{R\}^\{D\\times p\}denote the Jacobian matrix of this mapping, which can be partitioned into blocks corresponding toAAandBB, such thatJ=\[JA,JB\]J=\[J\_\{A\},J\_\{B\}\]\. At initialization, sinceB=0B=0, the partial derivative with respect toAAvanishes completely, yieldingJA=0J\_\{A\}=0\. The partial derivative with respect toBBis given by:

JB=αr\(A⊤⊗Idout\)\.J\_\{B\}=\\frac\{\\alpha\}\{r\}\(A^\{\\top\}\\otimes I\_\{d\_\{out\}\}\)\.\(35\)The signal curvature projected into the parameter space is defined byℋSignal=J⊤𝐇ℓJ\\mathcal\{H\}\_\{\\text\{Signal\}\}=J^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\. Taking the trace and applying its cyclic propertyTr⁡\(XYZ\)=Tr⁡\(ZXY\)\\operatorname\{Tr\}\(XYZ\)=\\operatorname\{Tr\}\(ZXY\), we have:

Tr⁡\(ℋSignal\)=Tr⁡\(J⊤𝐇ℓJ\)=Tr⁡\(𝐇ℓJJ⊤\)=Tr⁡\(𝐇ℓJBJB⊤\)\.\\operatorname\{Tr\}\(\\mathcal\{H\}\_\{\\text\{Signal\}\}\)=\\operatorname\{Tr\}\(J^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\)=\\operatorname\{Tr\}\(\\mathbf\{H\}\_\{\\ell\}JJ^\{\\top\}\)=\\operatorname\{Tr\}\(\\mathbf\{H\}\_\{\\ell\}J\_\{B\}J\_\{B\}^\{\\top\}\)\.\(36\)We now evaluate the expectation of the outer productJBJB⊤J\_\{B\}J\_\{B\}^\{\\top\}over the random Gaussian initialization ofAA\. Utilizing the mixed\-product property of the Kronecker product,\(X⊗Y\)\(U⊗V\)=\(XU⊗YV\)\(X\\otimes Y\)\(U\\otimes V\)=\(XU\\otimes YV\), we obtain:

JBJB⊤=\(αr\)2\(A⊤A⊗Idout\)\.J\_\{B\}J\_\{B\}^\{\\top\}=\\left\(\\frac\{\\alpha\}\{r\}\\right\)^\{2\}\(A^\{\\top\}A\\otimes I\_\{d\_\{out\}\}\)\.\(37\)By assumption, the entries ofA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{in\}\}are drawn i\.i\.d\. from𝒩\(0,σA2\)\\mathcal\{N\}\(0,\\sigma^\{2\}\_\{A\}\)\. Thus, the expectation of its Gram matrix is𝔼\[A⊤A\]=rσA2Idin\\mathbb\{E\}\[A^\{\\top\}A\]=r\\sigma^\{2\}\_\{A\}I\_\{d\_\{in\}\}\. Substituting this into the expectation yields:

𝔼\[JBJB⊤\]=α2r2\(𝔼\[A⊤A\]⊗Idout\)=α2r2\(rσA2Idin⊗Idout\)=α2σA2rID,\\mathbb\{E\}\[J\_\{B\}J\_\{B\}^\{\\top\}\]=\\frac\{\\alpha^\{2\}\}\{r^\{2\}\}\(\\mathbb\{E\}\[A^\{\\top\}A\]\\otimes I\_\{d\_\{out\}\}\)=\\frac\{\\alpha^\{2\}\}\{r^\{2\}\}\(r\\sigma^\{2\}\_\{A\}I\_\{d\_\{in\}\}\\otimes I\_\{d\_\{out\}\}\)=\\frac\{\\alpha^\{2\}\\sigma^\{2\}\_\{A\}\}\{r\}I\_\{D\},\(38\)whereIDI\_\{D\}is the identity matrix of dimensionD=dindoutD=d\_\{in\}d\_\{out\}\.Finally, by the linearity of the trace and expectation operators, we conclude:

𝔼\[Tr⁡\(ℋSignal\)\]=Tr⁡\(𝐇ℓ𝔼\[JBJB⊤\]\)=Tr⁡\(𝐇ℓα2σA2rID\)=α2σA2rTr⁡\(𝐇ℓ\)\.\\mathbb\{E\}\[\\operatorname\{Tr\}\(\\mathcal\{H\}\_\{\\text\{Signal\}\}\)\]=\\operatorname\{Tr\}\\left\(\\mathbf\{H\}\_\{\\ell\}\\mathbb\{E\}\[J\_\{B\}J\_\{B\}^\{\\top\}\]\\right\)=\\operatorname\{Tr\}\\left\(\\mathbf\{H\}\_\{\\ell\}\\frac\{\\alpha^\{2\}\\sigma^\{2\}\_\{A\}\}\{r\}I\_\{D\}\\right\)=\\frac\{\\alpha^\{2\}\\sigma^\{2\}\_\{A\}\}\{r\}\\operatorname\{Tr\}\(\\mathbf\{H\}\_\{\\ell\}\)\.\(39\)
Defining the variance\-to\-rank ratio asρ=σA2/r\\rho=\\sigma^\{2\}\_\{A\}/r, we arrive at the exact analytical expression:

𝔼\[Tr⁡\(ℋSignal\)\]=α2ρTr⁡\(𝐇ℓ\)\.\\mathbb\{E\}\[\\operatorname\{Tr\}\(\\mathcal\{H\}\_\{\\text\{Signal\}\}\)\]=\\alpha^\{2\}\\rho\\operatorname\{Tr\}\(\\mathbf\{H\}\_\{\\ell\}\)\.\(40\)∎

###### Lemma 1\(Hessian Spectral Bound and Stability\)\.

Consider a second\-order differentiable loss functionℓ\(θ\)\\ell\(\\theta\)with a local Hessianℋ=∇2ℓ\(θ\)\\mathcal\{H\}=\\nabla^\{2\}\\ell\(\\theta\)\. If the gradient∇ℓ\(θ\)\\nabla\\ell\(\\theta\)isLL\-Lipschitz continuous, the stability of gradient descent with a constant learning rateη\\etais governed by the maximum eigenvalue of the Hessian, denoted asλmax\(ℋ\)\\lambda\_\{\\max\}\(\\mathcal\{H\}\)\. Specifically, the optimization remains stable if and only if:

η≤2L=2λmax\(ℋ\)\.\\eta\\leq\\frac\{2\}\{L\}=\\frac\{2\}\{\\lambda\_\{\\max\}\(\\mathcal\{H\}\)\}\.\(41\)Furthermore, a reduction inλmax\(ℋ\)\\lambda\_\{\\max\}\(\\mathcal\{H\}\)via spectral suppression increases the upper bound of the stable learning rate, effectively expanding the permissible hyperparameter space\.

###### Proof\.

By the descent lemma forLL\-smooth functionsNesterov \[[2013](https://arxiv.org/html/2606.12883#bib.bib103)\], for anyθt\\theta\_\{t\}andθt\+1=θt−η∇ℓ\(θt\)\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\\nabla\\ell\(\\theta\_\{t\}\), we have:

ℓ\(θt\+1\)−ℓ\(θt\)≤−η‖∇ℓ\(θt\)‖2\+Lη22‖∇ℓ\(θt\)‖2=−η\(1−Lη2\)‖∇ℓ\(θt\)‖2\.\\ell\(\\theta\_\{t\+1\}\)\-\\ell\(\\theta\_\{t\}\)\\leq\-\\eta\\\|\\nabla\\ell\(\\theta\_\{t\}\)\\\|^\{2\}\+\\frac\{L\\eta^\{2\}\}\{2\}\\\|\\nabla\\ell\(\\theta\_\{t\}\)\\\|^\{2\}=\-\\eta\\left\(1\-\\frac\{L\\eta\}\{2\}\\right\)\\\|\\nabla\\ell\(\\theta\_\{t\}\)\\\|^\{2\}\.\(42\)To ensure the loss is non\-increasing \(ℓ\(θt\+1\)≤ℓ\(θt\)\\ell\(\\theta\_\{t\+1\}\)\\leq\\ell\(\\theta\_\{t\}\)\), we require1−Lη2≥01\-\\frac\{L\\eta\}\{2\}\\geq 0, which yieldsη≤2/L\\eta\\leq 2/L\. In the quadratic approximation of the local optimization landscape, the Lipschitz constantLLis equivalent to the spectral norm of the Hessian‖ℋ‖2\\\|\\mathcal\{H\}\\\|\_\{2\}, which corresponds to its maximum eigenvalueλmax\(ℋ\)\\lambda\_\{\\max\}\(\\mathcal\{H\}\)\. Thus,η≤2/λmax\(ℋ\)\\eta\\leq 2/\\lambda\_\{\\max\}\(\\mathcal\{H\}\)\. ∎

### G\.4Proof for Proposition[4](https://arxiv.org/html/2606.12883#Thmproposition4)

###### Proposition\(Asymmetric Scaling\)\.

The components of the landscape and the weight update scale asymmetrically with respect toα\\alphaandη\\eta\. For the landscape:ℋSignal=Θ\(α2\)\\mathcal\{H\}\_\{\\text\{Signal\}\}=\\Theta\(\\alpha^\{2\}\)whileℋDrift=Θ\(α\)\\mathcal\{H\}\_\{\\text\{Drift\}\}=\\Theta\(\\alpha\)\. For the update under adaptive optimizers like Adam:Δ𝐰Signal=Θ\(αη\)\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}=\\Theta\(\\alpha\\eta\)whileΔ𝐰Drift=Θ\(αη2\)\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}=\\Theta\(\\alpha\\eta^\{2\}\)\.

###### Proof\.

The proof proceeds by first establishing the scaling properties of the mapping derivatives with respect to the hyperparameterα\\alpha, and subsequently analyzing their propagation into the landscape and the update dynamics\.

#### Step 1: Scaling of the Jacobian and Structural Tensor\.

Based on Eq\. \([34](https://arxiv.org/html/2606.12883#A7.E34)\), the JacobianJ\(𝜽\)=∇𝜽𝒘∈ℝD×pJ\(\\bm\{\\theta\}\)=\\nabla\_\{\\bm\{\\theta\}\}\\bm\{w\}\\in\\mathbb\{R\}^\{D\\times p\}can be partitioned into blocks corresponding to the partial derivatives with respect toAAandBB:

J\(𝜽\)=\[∂𝒘∂A,∂𝒘∂B\]=αr\[\(Idin⊗B\),\(A⊤⊗Idout\)\]\.J\(\\bm\{\\theta\}\)=\\left\[\\frac\{\\partial\\bm\{w\}\}\{\\partial A\},\\frac\{\\partial\\bm\{w\}\}\{\\partial B\}\\right\]=\\frac\{\\alpha\}\{r\}\\left\[\(I\_\{d\_\{in\}\}\\otimes B\),\(A^\{\\top\}\\otimes I\_\{d\_\{out\}\}\)\\right\]\.\(43\)Because the coefficientαr\\frac\{\\alpha\}\{r\}is factored out of the partial derivatives, the Jacobian is directly proportional toα\\alpha\. Thus, its magnitude scales tightly as:

‖J\(𝜽\)‖=Θ\(α\)\.\\\|J\(\\bm\{\\theta\}\)\\\|=\\Theta\(\\alpha\)\.\(44\)Similarly, the structural Hessian tensor∇𝜽2𝒘\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}captures the second\-order derivatives of the mapping\. Because the mapping is bilinear inAAandBB, the only non\-zero second derivatives are the cross\-derivatives betweenAAandBB\. DifferentiatingJ\(𝜽\)J\(\\bm\{\\theta\}\)again yields a tensor that maintains the exact linear dependence onα\\alpha:

‖∇𝜽2𝒘‖=Θ\(α\)\.\\\|\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\\\|=\\Theta\(\\alpha\)\.\(45\)

#### Step 2: Scaling of the Landscape Components\.

We now substitute the scaling rules of the derivatives into the decomposition defined in Proposition[1](https://arxiv.org/html/2606.12883#Thmproposition1)\. Assuming the full task gradient𝒈=∇𝒘ℓ\\bm\{g\}=\\nabla\_\{\\bm\{w\}\}\\elland full task Hessian𝐇ℓ=∇𝒘2ℓ\\mathbf\{H\}\_\{\\ell\}=\\nabla^\{2\}\_\{\\bm\{w\}\}\\ellare locally independent of the adapter parameterization, the Signal Hessian scales quadratically due to the outer product of the Jacobian:

‖ℋSignal‖=‖J\(𝜽\)⊤𝐇ℓJ\(𝜽\)‖=Θ\(α\)×Θ\(α\)=Θ\(α2\)\.\\\|\\mathcal\{H\}\_\{\\text\{Signal\}\}\\\|=\\\|J\(\\bm\{\\theta\}\)^\{\\top\}\\mathbf\{H\}\_\{\\ell\}J\(\\bm\{\\theta\}\)\\\|=\\Theta\(\\alpha\)\\times\\Theta\(\\alpha\)=\\Theta\(\\alpha^\{2\}\)\.\(46\)In contrast, the Drift Hessian is a linear combination of the structural tensor slices weighted by the gradient entriesgkg\_\{k\}\. Thus, it inherits the linear scaling of the structural tensor:

‖ℋDrift‖=‖∑k=1Dgk∇𝜽2wk‖=Θ\(α\)\.\\\|\\mathcal\{H\}\_\{\\text\{Drift\}\}\\\|=\\Big\\\|\\sum\_\{k=1\}^\{D\}g\_\{k\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}w\_\{k\}\\Big\\\|=\\Theta\(\\alpha\)\.\(47\)

#### Step 3: Update Scaling under Adaptive Optimizers\.

A key characteristic of modern adaptive optimizers, such as Adam, is their mechanism to decouple the update magnitude from the raw gradient scale\. This process effectively normalizes the gradient magnitude, ensuring that the parameter updateΔ𝜽\\Delta\\bm\{\\theta\}is primarily governed by the learning rateη\\eta, rather than the local curvature or gradient variance\. Taking Adam as a representative case, the update step is formulated as:

Δ𝜽=−η𝒎^𝒗^\+ϵ,\\Delta\\bm\{\\theta\}=\-\\eta\\frac\{\\hat\{\\bm\{m\}\}\}\{\\sqrt\{\\hat\{\\bm\{v\}\}\}\+\\epsilon\},\(48\)where𝒎^\\hat\{\\bm\{m\}\}and𝒗^\\hat\{\\bm\{v\}\}denote the bias\-corrected first and second moment estimates of the gradients, respectively\. The denominator term𝒗^\+ϵ\\sqrt\{\\hat\{\\bm\{v\}\}\}\+\\epsilonserves as the normalization factor that stabilizes the update step size\. Consequently, the magnitude of the update remains consistent at‖Δ𝜽‖=Θ\(η\)\\\|\\Delta\\bm\{\\theta\}\\\|=\\Theta\(\\eta\), irrespective of the absolute norm of the gradient∇𝜽ℓ\\nabla\_\{\\bm\{\\theta\}\}\\ell\.

By applying thisη\\eta\-invariant step size back to the weight update decomposition in Proposition[1](https://arxiv.org/html/2606.12883#Thmproposition1), we observe an asymmetric scaling behavior between the signal and drift components:

- •Signal Update:As a first\-order linear transformation of the parameter update, the task\-aligned signal scales linearly with bothα\\alphaandη\\eta: ‖Δ𝒘Signal‖=‖J\(𝜽\)Δ𝜽‖=Θ\(α\)×Θ\(η\)=Θ\(αη\)\.\\\|\\Delta\\bm\{w\}\_\{\\text\{Signal\}\}\\\|=\\\|J\(\\bm\{\\theta\}\)\\Delta\\bm\{\\theta\}\\\|=\\Theta\(\\alpha\)\\times\\Theta\(\\eta\)=\\Theta\(\\alpha\\eta\)\.\(49\)
- •Drift Update:Conversely, the structural drift is defined by the bilinear quadratic form ofΔ𝜽\\Delta\\bm\{\\theta\}\. When the structural tensor is applied to the normalized updates, theη\\etascaling is squared whileα\\alpharemains linear: ‖Δ𝒘Drift‖=‖12∇𝜽2𝒘\[Δ𝜽,Δ𝜽\]‖=Θ\(α\)×Θ\(η2\)=Θ\(αη2\)\.\\\|\\Delta\\bm\{w\}\_\{\\text\{Drift\}\}\\\|=\\Big\\\|\\frac\{1\}\{2\}\\nabla^\{2\}\_\{\\bm\{\\theta\}\}\\bm\{w\}\[\\Delta\\bm\{\\theta\},\\Delta\\bm\{\\theta\}\]\\Big\\\|=\\Theta\(\\alpha\)\\times\\Theta\(\\eta^\{2\}\)=\\Theta\(\\alpha\\eta^\{2\}\)\.\(50\)

∎

## Appendix HExperimental Details

We summarize the core configurations and hyperparameters across all six evaluation domains in Table[7](https://arxiv.org/html/2606.12883#A8.T7)\. Detailed learning rate schedules and environmental configurations strictly follow the referenced official codebases\. Subsequent subsections provide granular details, including datasets and evaluation metrics, for each specific task\.

Table 7:Comprehensive summary of experimental settings across all evaluation domains\. The estimated wall\-clock time \(Est\. Time\) highlights our evaluation across diverse training horizons, ranging from rapid adaptation to multi\-day post\-training scenarios\.Table 8:Comparison on NLG tasks using Llama 2\-7B\. Performance is reported as mean±\\pmstd over three independent runs\. Bold and underlined values denote the best and second\-best results within each experimental group\.Table 9:Detailed hyperparameter search results on the GLUE benchmark\. The table includes a grid search over learning rate \(η\\eta\) and scaling factor \(α\\alpha\)\. The average is computed across all 8 tasks\.Table 10:Detailed Hyperparameter Search Results on NLG tasks \(Llama2\-7B\)\. The table includes a grid search over learning rate \(η\\eta\) and scaling factor \(α\\alpha\)\. The average is computed across the 5 tasks\.### H\.1Natural Language Understanding

#### Models and Datasets\.

We evaluate NLU performance using the DeBERTa\-v3\-base model \(184M\)Heet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib58)\]on eight tasks from the GLUE benchmarkWanget al\.\[[2018](https://arxiv.org/html/2606.12883#bib.bib65)\]\. Following standard protocols, we include MNLI, SST\-2, MRPC, CoLA, QNLI, QQP, RTE, and STS\-B\.

#### LoRA Configuration\.

All parameter\-efficient methods are implemented with a fixed rankr=8r=8\. For the proposed LoRA\-α\\alpha, we apply the analytic scaling \(Variant II\) to determineαbase\\alpha\_\{\\text\{base\}\}\. For other baselines, the scaling factorα\\alphafollows their respective original designs\. Adapters are uniformly applied to all linear layers\.

#### Optimization\.

All models are trained using the AdamW optimizer with a constant learning rate of1×10−41\\times 10^\{\-4\}, aligned with standard FFT settings\. Training is conducted for 3 epochs across most tasks, with the exception of MRPC, which is trained for 5 epochs due to its limited sample size\. We use a batch size of 32 and follow the default weight decay settings from the Transformers library\.

#### Evaluation\.

We report task\-specific metrics as per GLUE conventions: matched/mismatched accuracy for MNLI, Matthews correlation for CoLA, Pearson correlation for STS\-B, and accuracy for the remaining tasks\. FollowingMenget al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib43)\]; Zhanget al\.\[[2025b](https://arxiv.org/html/2606.12883#bib.bib92)\], results are averaged over three independent runs\.

### H\.2Natural Language Generation

#### Models and Datasets\.

We evaluate generative performance via supervised fine\-tuning of Llama 2\-7BMeta Team \[[2023](https://arxiv.org/html/2606.12883#bib.bib54)\]across three distinct domains\. Specifically, the model is trained on MetaMathQAYuet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib60)\]for mathematical reasoning \(evaluated on GSM8KCobbeet al\.\[[2021](https://arxiv.org/html/2606.12883#bib.bib61)\]and MATHYuet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib60)\]\), CodeFeedbackZhenget al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib62)\]for code generation \(evaluated on HumanEvalZhenget al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib56)\]and MBPPAustinet al\.\[[2021](https://arxiv.org/html/2606.12883#bib.bib63)\]\), and Commonsense170KHuet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib64)\]for commonsense reasoning \(reporting averaged accuracy across eight sub\-datasets\)\. Following the PiSSA training frameworkMenget al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib43)\], each task strictly utilizes a sampled subset of 100k examples to ensure a balanced and computationally equivalent comparison\.

#### LoRA Configuration\.

We evaluate performance across two distinct rank regimes:r∈\{16,128\}r\\in\\\{16,128\\\}\. To explicitly validate the effectiveness of our theoretically derived scaling rule, LoRA\-α\\alphastrictly utilizes the analytic scaling \(Variant II\) to determine the scaling factorα\\alpha\. This setup verifies whether the analytic formulation can scale to LLMs\. Adapters are uniformly applied to all linear layers\.

#### Optimization\.

Training is performed using the AdamW optimizer with a learning rate of2×10−52\\times 10^\{\-5\}and a total batch size of 128\. All models are optimized for a single epoch, providing a stringent testbed to assess the rapid adaptation efficiency of different scaling strategies\. FollowingMenget al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib43)\]; Zhanget al\.\[[2025b](https://arxiv.org/html/2606.12883#bib.bib92)\], results are averaged over three independent runs\. Table[8](https://arxiv.org/html/2606.12883#A8.T8)augments the results in Table[2](https://arxiv.org/html/2606.12883#S5.T2)by including standard deviations \(std\) for reference\.

### H\.3Detailed Hyperparameter Search Results

To validate the superiority ofα\\alpha\-scaling over conventionalη\\eta\-scaling in generalization, we conduct an exhaustive hyperparameter grid search across both NLU and NLG tasks\. The detailed, per\-task performance breakdowns are provided in Table[9](https://arxiv.org/html/2606.12883#A8.T9)and Table[10](https://arxiv.org/html/2606.12883#A8.T10)\.Given the massive scale of this hyperparameter search space, all results reported in these tables are based on a single experimental run per configuration\.

- •NLU Grid Search \(Table[9](https://arxiv.org/html/2606.12883#A8.T9)\):We observe that strictly scaling the learning rate \(η\\eta\) often leads to suboptimal plateaus or local instability \(e\.g\., initial failure to converge on CoLA\)\. In contrast, scaling the structural multiplierα\\alphaunlocks higher performance ceilings across almost all datasets, raising the aggregated peak performance \(LoRA\-α⋆\\alpha^\{\\star\}\) to 89\.03, compared to 88\.53 forη\\eta\-scaling\.
- •NLG Grid Search \(Table[10](https://arxiv.org/html/2606.12883#A8.T10)\):The structural bottleneck ofη\\eta\-scaling becomes critically evident, especially at larger ranks \(r=128r=128\)\. Pushingη\\etato extremes triggers catastrophic divergence \(e\.g\., performance collapsing to 0\.00 on MATH and GSM8K\)\. Conversely,α\\alpha\-scaling preserves optimization stability under the same aggressive scaling regimes, successfully driving the model to deeper local minima and extending the frontier significantly \(averaging 44\.87 vs\. 42\.69\)\.

Ultimately, aggregating the optimal configurations explicitly demonstrates that amplifying the signal viaα\\alphamaintains optimization advantage\.

### H\.4Text\-to\-Image Synthesis

#### Models and Datasets\.

We evaluate multimodal generation using the Flux\.1\-12B modelLabs \[[2024](https://arxiv.org/html/2606.12883#bib.bib57)\]for image customization via the DreamBooth protocolRuizet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib26)\]\. The implementation is based on the official PEFT DreamBooth examples\. The model learns novel concepts from small sets of instance images \(typically 4–6\) to generate high\-fidelity, prompt\-conditioned images\.

#### LoRA Configuration\.

For LoRA\-α\\alpha, we apply analytic scaling \(Variant II\) to calibrateαbase\\alpha\_\{\\text\{base\}\}based on the Flux architecture, with a fixed rank ofr=8r=8\. Adapters are applied to all attention layers

#### Optimization\.

Models are optimized using AdamW with a constant learning rate of1×10−41\\times 10^\{\-4\}and a batch size of 1\. Training is conducted for 1,000 steps\. This setting provides a challenging environment to test whetherα\\alpha\-scaling can accelerate concept acquisition without over\-fitting\.

#### Evaluation\.

Following DreamBooth, we use the unique identifier"sks"for concept binding\. To test the model’s ability to preserve the learned concept under novel contexts, evaluation prompts include category and scene descriptions:"A sks <object\_name\> in a sunlit forest, vibrant foliage, ultra\-realistic photography"and"A sks <object\_name\>, centered on a pure white background, ultra\-realistic photography"\.

### H\.5Multimodal Representation Learning

#### Models and Datasets\.

We evaluate LoRA\-α\\alphaon the MMEB benchmarkJianget al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib114)\], which converts generative models into dense retrievers\. We use Qwen 2\-VL \(2B and 7B\)Wanget al\.\[[2024a](https://arxiv.org/html/2606.12883#bib.bib6)\]as backbones, implementing the contrastive training pipeline based on the VLM2Vec codebase\. The dataset includes 36 subsets; 20 in\-distribution \(ID\) subsets are used for training, while 16 out\-of\-distribution \(OOD\) subsets are reserved for zero\-shot evaluation\.

#### LoRA Configuration\.

For LoRA\-α\\alpha, we adopt Variant I withαbase=256r\\alpha\_\{\\text\{base\}\}=256\\sqrt\{r\}\. Adapters are applied to all linear layers with a fixed rank ofr=16r=16\.

#### Optimization\.

Training employs the InfoNCE contrastive loss \(temperature0\.020\.02\) and the AdamW optimizer\. We use a global batch size of 1,024 and a learning rate of2×10−52\\times 10^\{\-5\}\. Training lasts for 2,000 steps with gradient caching enabled\.

#### Evaluation\.

Representations are extracted from the final\-token hidden state\. Performance is measured via Precision@1 averaged across ID and OOD splits\. This setup rigorously tests the model’s ability to maintain pretrained multimodal alignment while adapting to discriminative objectives\.

### H\.6Experimental Details for Reasoning\-based SFT

#### Models and Datasets\.

We employ Qwen 2\.5\-Math\-7BQwen Team \[[2024](https://arxiv.org/html/2606.12883#bib.bib120)\]as the backbone for long\-context reasoning SFT\. Training is conducted on the Mixture\-of\-Thoughts dataset \(350k samples\)HuggingFace Team \[[2025](https://arxiv.org/html/2606.12883#bib.bib105)\]utilizing the training recipes provided by Open\-R1\.

#### LoRA Configuration\.

We apply adapters to all linear layers with ranksr∈\{64,256\}r\\in\\\{64,256\\\}\. Furthermore, to support the generation of special tokens, we additionally unfreeze and train the embedding and output layers\. LoRA\-α\\alphauses Variant I scaling\.

#### Optimization\.

Models are trained for 1 epoch using the Adam optimizer\. We use a peak learning rate of4\.0×10−54\.0\\times 10^\{\-5\}\. The effective batch size is 128\. We enable gradient checkpointing and set the maximum sequence length to 32,768 tokens\.

#### Evaluation\.

Following the established protocols of Open\-R1, we evaluate reasoning performance using the Lighteval frameworkHabibet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib109)\]\. Our evaluation spans a comprehensive suite of challenging benchmarks, including AIME 24/25MAA \[[2024](https://arxiv.org/html/2606.12883#bib.bib110),[2025](https://arxiv.org/html/2606.12883#bib.bib112)\], MATH\-500Hendryckset al\.\[[2021](https://arxiv.org/html/2606.12883#bib.bib113)\], GPQA DiamondReinet al\.\[[2024](https://arxiv.org/html/2606.12883#bib.bib118)\], and LiveCodeBench v4Jainet al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib119)\]\. Performance is reported as Pass@1 accuracy, where the number of generation samples per prompt strictly adheres to the Open\-R1 configuration to ensure a standardized and fair comparison across different methods\.

### H\.7Experimental Details for Reasoning\-based RL

#### Models and Datasets\.

We transition to policy optimization using Group Relative Policy Optimization \(GRPO\)DeepSeek Team \[[2025](https://arxiv.org/html/2606.12883#bib.bib8)\]\. We use DeepSeek\-R1\-Distill\-Qwen \(1\.5B and 7B\) as base models, trained on the DAPO\-Math\-17k datasetYuet al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib121)\]\. We implement the RLVR pipeline based on the PeRL frameworkYinet al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib124)\]\. This paradigm optimizes reasoning paths using outcome\-based rewards\.

#### LoRA Configuration\.

Adapters are applied to all linear layers with a rank ofr=64r=64\. We utilize Variant I for LoRA\-α\\alpha\. To accelerate rollouts, we use a co\-located vLLM generation engineKwonet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib132)\]with 0\.4 memory utilization\.

#### Optimization\.

Training is performed with a constant learning rate of1×10−61\\times 10^\{\-6\}, a KL penaltyβ=0\\beta=0, and a total batch size of 128\. For each prompt, we generateG=8G=8rollouts\. The maximum completion length is set to 16,384 tokens to test long\-form stability\.

#### Evaluation\.

We assess reasoning performance on MATH\-500Hendryckset al\.\[[2021](https://arxiv.org/html/2606.12883#bib.bib113)\], AMC 23MAA \[[2023](https://arxiv.org/html/2606.12883#bib.bib115)\], AIME 24/25MAA \[[2024](https://arxiv.org/html/2606.12883#bib.bib110),[2025](https://arxiv.org/html/2606.12883#bib.bib112)\], MinervaLewkowyczet al\.\[[2022](https://arxiv.org/html/2606.12883#bib.bib116)\], and HMMTBalunovićet al\.\[[2025](https://arxiv.org/html/2606.12883#bib.bib117)\], using Lighteval frameworkHabibet al\.\[[2023](https://arxiv.org/html/2606.12883#bib.bib109)\]\. Performance is reported as Pass@1 accuracy estimated over multiple samples\.
The Hidden Power of Scaling Factor in LoRA Optimization

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Beyond LoRA: Is Sparsity-Induced Adaptation Better?

Parameter-Efficient Fine-Tuning with Learnable Rank

LoRA and Weight Decay (2023)

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Submit Feedback

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Beyond LoRA: Is Sparsity-Induced Adaptation Better?
Parameter-Efficient Fine-Tuning with Learnable Rank
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning