Beyond LoRA: Is Sparsity-Induced Adaptation Better?

arXiv cs.LG Papers

Summary

This paper proposes sparsity-induced adaptations to LoRA, including Cheap LoRA (cLA) and a chained circulant variant (c³LA), and provides theoretical generalization bounds along with empirical evaluations showing up to 10% training time reduction and 15% peak GPU memory savings while maintaining competitive performance.

arXiv:2606.13767v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) and its variants provide a memory- and compute-efficient alternative to full fine-tuning of pre-trained models. However, questions remain about the comparative generalizability of these approaches and how the structural restrictions on low-rank updates preserve effective adaptation performance. We present a historical framing, covering the past (full fine-tuning and original LoRA), the present (different variants of LoRA), and propose simpler, cheaper, parameter-efficient extensions by inducing sparsity within existing LoRA variants: Cheap LoRA (cLA), training a single low-rank factor with the other fixed (deterministically or, in its randomized variant, stochastically), and the chained circulant variant, ${c}^3$LA. We frame cLA as a structured instance of asymmetric LoRA, serving as a controlled column-subspace restriction of full fine-tuning. We derive information-theoretic generalization error bounds for these variants, marking one of the first endeavors in this area. Empirically, we evaluate 11 fine-tuning methods across 10 pre-trained models and 14 datasets, analyzing the fine-tuned models' performance and generalization using tools such as loss landscapes and spectral analysis. Despite the sensitivity of fine-tuned models to the pre-trained model, datasets, and other factors, our study suggests that restricting LoRA-based PEFT methods' adaptation to a sparse, structured column space remains competitive across tasks with their parameter-matched baselines while reducing up to 10% training time and peak GPU memory up to 15%, even with a na\"ive, non-optimized, sparse implementation. Our theoretical and empirical generalization measures provide a more consistent and principled approach to their cost-effective adaptation than commonly used analytical tools. Overview and code are available at: https://elicaden.github.io/Beyond_LoRA/.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:07 AM

# Beyond LoRA: Is Sparsity-Induced Adaptation Better?
Source: [https://arxiv.org/html/2606.13767](https://arxiv.org/html/2606.13767)
Elijah Cadenhead1Cristian McGee1, 3Xin Li1El Houcine Bergou2Aritra Dutta1,3 1School of Data, Mathematical and Statistical Sciences, University of Central Florida, United States 2College of Computing, Mohammed VI Polytechnic University \(UM6P\), Morocco 3Department of Computer Science, University of Central Florida, United States

###### Abstract

Low\-rank adaptation \(LoRA\) and its variants provide a memory\- and compute\-efficient alternative to full fine\-tuning of pre\-trained models\. However, questions remain about the comparative generalizability of these approaches and how the structural restrictions on low\-rank updates preserve effective adaptation performance\. We present a historical framing, covering the past \(full fine\-tuning and original LoRA\), the present \(different variants of LoRA\), and propose simpler, cheaper, parameter\-efficient extensions by inducing sparsity within existing LoRA variants: Cheap LoRA \(cLA\), training a single low\-rank factor with the other fixed \(deterministically or, in its randomized variant, stochastically\), and the chained circulant variant,c3\{c\}^\{3\}LA\. We frame cLA as a structured instance of asymmetric LoRA, serving as a controlled column\-subspace restriction of full fine\-tuning\. We derive information\-theoretic generalization error bounds for these variants, marking one of the first endeavors in this area\. Empirically,*we evaluate11 fine\-tuning methodsacross10 pre\-trained models and 14 datasets, analyzing the fine\-tuned models’ performance and generalization using tools such as loss landscapes and spectral analysis\.*Despite the sensitivity of fine\-tuned models to the pre\-trained model, datasets, and other factors, our study suggests that restricting LoRA\-based PEFT methods’ adaptation to a sparse, structured columnspace remains competitive across tasks with their parameter\-matched baselines while reducing*up to 10% training time*and*peak GPU memory up to 15%*, even with a naïve, non\-optimized, sparse implementation\. Our theoretical and empirical generalization measures provide a more consistent and principled approach to their cost\-effective adaptation than commonly used analytical tools\. GitHub repo available at:[github\.com/EliCaden/Beyond\_LoRA](https://github.com/EliCaden/Beyond_LoRA)\.

Keywords:Parameter\-Efficient Fine\-Tuning⋅\\cdotLow\-Rank Adaptation⋅\\cdotLoRA Variants⋅\\cdotSparse LoRA⋅\\cdotColumn\-Subspace Adaptation⋅\\cdotGeneralization Bounds⋅\\cdotPre\-trained Models

###### Contents

1. [1Introduction](https://arxiv.org/html/2606.13767#S1)
2. [2Fine\-Tuning: The Past, Present, and Future](https://arxiv.org/html/2606.13767#S2)1. [2\.1The Past: Full fine\-tuning \(FFT\) and LoRA](https://arxiv.org/html/2606.13767#S2.SS1) 2. [2\.2The Present: Evolution of LoRA](https://arxiv.org/html/2606.13767#S2.SS2) 3. [2\.3The Future: How Can We Achieve More Efficiency?](https://arxiv.org/html/2606.13767#S2.SS3)
3. [3Theoretical Insights](https://arxiv.org/html/2606.13767#S3)1. [3\.1On the Generalization of Different LoRA Variants](https://arxiv.org/html/2606.13767#S3.SS1)
4. [4Quantitative Evaluation](https://arxiv.org/html/2606.13767#S4)1. [4\.1Quality of the Fine\-Tuned Models](https://arxiv.org/html/2606.13767#S4.SS1) 2. [4\.2Generalizability of the Fine\-Tuned Models](https://arxiv.org/html/2606.13767#S4.SS2) 3. [4\.3Performance Analysis](https://arxiv.org/html/2606.13767#S4.SS3)1. [4\.3\.1Discussion](https://arxiv.org/html/2606.13767#S4.SS3.SSS1)
5. [5Conclusion](https://arxiv.org/html/2606.13767#S5)
6. [References](https://arxiv.org/html/2606.13767#bib)
7. [AThe Present: Evolution of LoRA — Continued](https://arxiv.org/html/2606.13767#A1)
8. [BRelationship between PaCA and cLA](https://arxiv.org/html/2606.13767#A2)1. [B\.1Introducing New Artifacts to PaCA](https://arxiv.org/html/2606.13767#A2.SS1) 2. [B\.2Applying PaCA’s Convergence Result to cLA](https://arxiv.org/html/2606.13767#A2.SS2)
9. [CPseudo Code of sparsity\-induced LoRA variants](https://arxiv.org/html/2606.13767#A3)
10. [DTheoretical Results](https://arxiv.org/html/2606.13767#A4)1. [D\.1Generalization](https://arxiv.org/html/2606.13767#A4.SS1)1. [D\.1\.1Inequalities used](https://arxiv.org/html/2606.13767#A4.SS1.SSS1) 2. [D\.1\.2Proof of Theorem1](https://arxiv.org/html/2606.13767#A4.SS1.SSS2) 3. [D\.1\.3Neural Network with No activation Function—Special case of Theorem1](https://arxiv.org/html/2606.13767#A4.SS1.SSS3) 4. [D\.1\.4Tightness of the bounds in Theorem1](https://arxiv.org/html/2606.13767#A4.SS1.SSS4) 5. [D\.1\.5Adapting Theorem1to Attention Mechanism](https://arxiv.org/html/2606.13767#A4.SS1.SSS5) 6. [D\.1\.6Adapting Theorem1under special cases](https://arxiv.org/html/2606.13767#A4.SS1.SSS6)
11. [EAddendum to Benchmarking and Evaluation](https://arxiv.org/html/2606.13767#A5)1. [E\.1Implementation Details](https://arxiv.org/html/2606.13767#A5.SS1) 2. [E\.2The Effects of Learning Rate, Scaling Factor, and Chain Reset Frequency on Quality Metric Over Various Ranks](https://arxiv.org/html/2606.13767#A5.SS2)1. [E\.2\.1DeepseekCoder Performance Analysis](https://arxiv.org/html/2606.13767#A5.SS2.SSS1) 3. [E\.3Computational Cost, Memory, and Efficiency](https://arxiv.org/html/2606.13767#A5.SS3)1. [E\.3\.1Naïve sparse implementation\.](https://arxiv.org/html/2606.13767#A5.SS3.SSS1) 2. [E\.3\.2Experiments](https://arxiv.org/html/2606.13767#A5.SS3.SSS2) 4. [E\.4Performance Analysis—Continued](https://arxiv.org/html/2606.13767#A5.SS4)1. [E\.4\.1Loss Landscape—Continued](https://arxiv.org/html/2606.13767#A5.SS4.SSS1) 2. [E\.4\.2Intruder Dimension implementation](https://arxiv.org/html/2606.13767#A5.SS4.SSS2) 5. [E\.5Generalization Error—Continued](https://arxiv.org/html/2606.13767#A5.SS5)1. [E\.5\.1Normalized Generalization Results](https://arxiv.org/html/2606.13767#A5.SS5.SSS1)
12. [FLimitations and Discussion](https://arxiv.org/html/2606.13767#A6)
13. [GTable of Notations](https://arxiv.org/html/2606.13767#A7)

## 1Introduction

Full fine\-tuning \(FFT\)\[[6](https://arxiv.org/html/2606.13767#bib.bib47)\]modifies a pre\-trained neural network’s parameters on new datasets and adapts the network to new downstream tasks\. As model sizes and datasets grow, FFT is often computationally infeasible or prohibitively expensive\. Additionally, the growth of these complex models and the hardware’s compute capacity are incoherent\[[10](https://arxiv.org/html/2606.13767#bib.bib5),[67](https://arxiv.org/html/2606.13767#bib.bib11)\]\. E\.g\., The smallest variant of Llama\-3\[[1](https://arxiv.org/html/2606.13767#bib.bib6),[14](https://arxiv.org/html/2606.13767#bib.bib7)\]has 8B parameters; it requires 32 GB of GPU memory for inference and 64 GB for training with modern protocols\. In contrast, the half\-precision performance of the NVIDIA H100 is only about2\.4×2\.4\\timesthat of the NVIDIA A100, while their memory capacity remains unchanged\[[48](https://arxiv.org/html/2606.13767#bib.bib74)\]\.

![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/teaser_images/Updated_Teaser_Good.png)

Figure 1:3D loss landscapes of ViT\-Base pretrained on ImageNet\-21K and fine\-tuned on ImageNet\-1K\. We fine\-tuned this model on CIFAR\-10 using different strategies, including FFT\. FFT has the narrowest local minima among the other PEFT methods, and yields the worst test accuracy\. However, it has the least generalization error,𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\), among all the methods; see Definition[1](https://arxiv.org/html/2606.13767#Thmdefinition1)and Table[16](https://arxiv.org/html/2606.13767#A5.T16)\. In \(d\), when we superimpose the loss landscapes, FFT shows the spikiest landscape; RAC has the smoothest landscape with the highest𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)\. According to\[[34](https://arxiv.org/html/2606.13767#bib.bib17)\], this is counterintuitive; a model with a spiky landscape and small\-volume local minima does not generalize well\.Alternatively, parameter\-efficient fine\-tuning \(PEFT\) saves space and time, circumvents overfitting, and is widely used\. Low\-rank adaptation \(LoRA\)\[[27](https://arxiv.org/html/2606.13767#bib.bib2)\]is a PEFT method that achieves performance on par with FFT, by reducing trainable parameters\. To mitigate LoRA’s flaws, researchers proposed numerous variants, including the chain of LoRA \(CoLA\)\[[65](https://arxiv.org/html/2606.13767#bib.bib12)\], asymmetric LoRA\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\], randomized asymmetric chain of LoRA\[[42](https://arxiv.org/html/2606.13767#bib.bib19)\], LoRA\+\[[18](https://arxiv.org/html/2606.13767#bib.bib20)\], adaptive LoRA\[[72](https://arxiv.org/html/2606.13767#bib.bib38)\], among many; see\[[69](https://arxiv.org/html/2606.13767#bib.bib4),[17](https://arxiv.org/html/2606.13767#bib.bib97)\]\.

Despite the existing LoRA variants, structured restrictions on low\-rank updates that preserve effective adaptation under similar parameter counts remain unclear\. Recent works,\[[69](https://arxiv.org/html/2606.13767#bib.bib4),[55](https://arxiv.org/html/2606.13767#bib.bib22),[42](https://arxiv.org/html/2606.13767#bib.bib19)\], analyzed and compared these PEFT methods with full fine\-tuning, but these benchmarks are inconclusive\. Figure[1](https://arxiv.org/html/2606.13767#S1.F1)casts one such example, where generalization and loss landscape sharpness contradict our prior understanding — FFT’s resultant model, despite having the spikiest landscape and narrowest valley, has the smallest generalization error, conflicting with the well\-known heuristic that models with sharper minima should generalize worse\[[34](https://arxiv.org/html/2606.13767#bib.bib17),[28](https://arxiv.org/html/2606.13767#bib.bib3)\]\. Current literature has a limited theoretical grasp on how these methods behave in parameter\-matched comparisons, that is, which extreme sparsity and structured low\-rank constraints preserve effective adaptation across tasks and models and offer better generalization, and how far these restrictions can be pushed before adaptation degrades\.

In the era of resource\-constrained IoTs and edge deployments\[[26](https://arxiv.org/html/2606.13767#bib.bib30),[11](https://arxiv.org/html/2606.13767#bib.bib31)\], pushing parameter efficiency for sparse or structured libraries\[[53](https://arxiv.org/html/2606.13767#bib.bib32),[12](https://arxiv.org/html/2606.13767#bib.bib35)\]has become a practical imperative\. E\.g\., New OpenAI LLM, GPT\-4\.5, requires a 10×\\timesincrease in compute than GPT\-4\. Still, it only obtained a marginal performance improvement, and could be indicative that effective parameter reduction may benefit these models\[[44](https://arxiv.org/html/2606.13767#bib.bib78)\]\. Moreover, to reduce activation memory and improve sequential processing of the adapter and pretrained LoRA layers,\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\]introduced partial connection adaptation \(PaCA\)\.

These ideas motivate us to explore different structured instances of LoRA that explicitly restrict learning to an established column subspace, allowing for a clearer examination of how far restricted subspace updates can be taken while maintaining competitiveness in performance\. At this end,*we propose 4 simpler, cheaper, and parameter\-efficient extensions of the existing SOTA LoRA variants*: Cheap LoRA \(cLA\), which trains only one low\-rank factor and sets the other low\-rank factor deterministically, its randomized variant, random\-cLA, its chain circulant variant,c3c^\{3\}LA, and its randomized chain variant, random\-c3c^\{3\}LA\. cLA and r\-cLA can be interpreted as structured instances of Asymmetric LoRA that confine learning to anrr\-column subspace, enabling a stark contrast between how partial column\-space adaptation compares to alternative low\-rank updates\. Alternatively, they can be seen as the LoRA adaptation of PaCA, where the restricted fine\-tuned columns are set torrcolumns of the pretrained model; see Figure[2](https://arxiv.org/html/2606.13767#S3.F2)\. Therefore,*our proposed sparsity\-induced SOTA LoRA variants act as a bridge between the two families of adapters, LoRA and PaCA*; see §[2\.3](https://arxiv.org/html/2606.13767#S2.SS3)and §[B](https://arxiv.org/html/2606.13767#A2)\.

But which structured restrictions of low\-rank updates remain sufficient for competitive adaptation? Can confining learning to a small, structured fraction of the column space provide performance comparable to fine\-tuning all columns? Or, are there significant performance differences among these sparsified LoRA variants? If so, how do these differences vary across PEFT methods, hyperparameter configurations, and models? To answer these questions, we make the following contributions:

Theoretical insights through generalization \(§[3](https://arxiv.org/html/2606.13767#S3)\)\.*Generalizability*measures how well a model’s loss on its training dataset represents its loss over the entire feature space, reflecting the model’s capacity to avoid overfitting\. Since our questions concern when parameter\-reduced fine\-tuning subspaces remain competitive, we use generalization bounds to connect structural restrictions \(such as adapter rank, chain length, if any, layerwise input\-output dimensions, training bitwidth, fine\-tuning dataset size, etc\.\), to overfitting risk\. To this end, we use an*information\-theoretic approach*to measure the generalization error bounds of the PEFT methods discussed in this paper, including PaCA\. See summary of results in Table[1](https://arxiv.org/html/2606.13767#S3.T1)\.

Quantitative evaluation \(§[4](https://arxiv.org/html/2606.13767#S4)\)\.We evaluate FFT, 9 LoRA\-based PEFTs and PaCA, encompassing 10 different pretrained models, on 4 fine\-tuning tasks: natural language processing, image recognition, coding generation, and logical reasoning\. We report a rich set of metrics, including accuracy, spectral behavior, 3D loss landscape, throughput, runtime, and empirical generalization error\. While it is infeasible to be exhaustive, our comprehensive benchmarking offers broadly applicable insights\.

## 2Fine\-Tuning: The Past, Present, and Future

FFT updates all parameters of deep networks, an approach that becomes increasingly impractical as model size and deployment multiplicity grow\. This leads to the advent of LoRA and its variants\. Based on their evolutionary timeline, we divide this section into three phases\. The*past*contains FFT, and we introduce LoRA, while different LoRA variants dominate the*present*\. Finally, extreme compute efficiency characterizes the*future*, where we induce sparsity to SOTA LoRA variants\.

### 2\.1The Past: Full fine\-tuning \(FFT\) and LoRA

Pre\-training\.Without loss of generality, consider aLL\-layer, fully\-connected, neural network whose layers are,\{Wi\}i=1L\\\{W^\{i\}\\\}\_\{i=1\}^\{L\}, whereWi∈ℝni×miW^\{i\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times m\_\{i\}\}are trainable weights\. Letx∈ℝm1x\\in\\mathbb\{R\}^\{m\_\{1\}\}be the input and𝐖=\(W1,…,WL\)\{\\mathbf\{W\}\}=\(W^\{1\},\.\.\.,W^\{L\}\)\. The networkfW​\(⋅\):ℝdin→ℝdoutf\_\{\{\\textbf\{W\}\}\}\(\\cdot\):\\mathbb\{R\}^\{d\_\{\\rm in\}\}\\rightarrow\\mathbb\{R\}^\{d\_\{\\rm out\}\}is of the form:

fW\(x\)=σL\(WL…\(σ2\(W2σ1\(W1\(x\)\)…\)\),\\displaystyle\\textstyle\{f\_\{\{\\textbf\{W\}\}\}\(x\)=\\sigma\_\{L\}\(W^\{L\}\\ldots\(\\sigma\_\{2\}\(W^\{2\}\\sigma\_\{1\}\(W^\{1\}\(x\)\)\\ldots\)\)\},\(1\)whereσi​\(⋅\):ℝni→ℝni\\sigma^\{i\}\(\\cdot\):\\mathbb\{R\}^\{n\_\{i\}\}\\to\\mathbb\{R\}^\{n\_\{i\}\}is a nonlinear activation function for theithi^\{\\rm th\}layer\. Given a pre\-training set,Npre:=\{\(xi,yi\)\}⊂ℝm1×ℝdoutN\_\{\\rm pre\}:=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\\subset\\mathbb\{R\}^\{m\_\{1\}\}\\times\\mathbb\{R\}^\{d\_\{\\rm out\}\}, and the loss function,ℓpre​\(⋅\):ℝdout×ℝdo​u​t→ℝ\\mathcal\{\\ell\}\_\{\\rm pre\}\(\\cdot\):\\mathbb\{R\}^\{d\_\{\\rm out\}\}\\times\\mathbb\{R\}^\{d\_\{out\}\}\\rightarrow\\mathbb\{R\}, we train the network by solving:

W0≈argminW⁡1\|Npre\|​∑i=1\|Npre\|ℓpre​\(fW​\(xi\),yi\),\\displaystyle\{\\textbf\{W\}\}\_\{0\}\\approx\\operatorname\{argmin\}\_\{\\textbf\{W\}\}\\frac\{1\}\{\|N\_\{\\rm pre\}\|\}\\sum\_\{i=1\}^\{\|N\_\{\\rm pre\}\|\}\\mathcal\{\\ell\}\_\{\\rm pre\}\(f\_\{\\textbf\{W\}\}\(x\_\{i\}\),y\_\{i\}\),\(2\)obtaining the trained weightsW0=\[W01,⋯,W0L\]\{\\textbf\{W\}\}\_\{0\}=\[W\_\{0\}^\{1\},\\cdots,W\_\{0\}^\{L\}\]\. Sophisticated DNNs, such as CNNs, RNNs, Transformers, etc\., can be adapted with some modification to \([1](https://arxiv.org/html/2606.13767#S2.E1)\)\.

FFT\[[6](https://arxiv.org/html/2606.13767#bib.bib47),[24](https://arxiv.org/html/2606.13767#bib.bib13),[27](https://arxiv.org/html/2606.13767#bib.bib2),[69](https://arxiv.org/html/2606.13767#bib.bib4)\]\.Given pre\-trained weights,W0\{\\textbf\{W\}\}\_\{0\}, FFT updates each DNN layer with correspondingΔ​Wi\\Delta W^\{i\}to adapt the model to a downstream task on a domain\-specific training dataset,N:=\{\(xi′,yi′\)\}N:=\\\{\(x\_\{i\}^\{\\prime\},y\_\{i\}^\{\\prime\}\)\\\}\. DenoteΔ​W\\Delta\{\\textbf\{W\}\}as the update, and defineW0\+Δ​W:=\[W01\+Δ​W1,⋯,W0L\+Δ​WL\]\{\\textbf\{W\}\}\_\{0\}\+\\Delta\{\\textbf\{W\}\}:=\[W\_\{0\}^\{1\}\+\\Delta W^\{1\},\\cdots,W\_\{0\}^\{L\}\+\\Delta W^\{L\}\]\. Given a loss function,ℓ​\(⋅\):ℝdout×dout→ℝ\\mathcal\{\\ell\}\(\\cdot\):\\mathbb\{R\}^\{d\_\{\\rm out\}\\times d\_\{\\rm out\}\}\\rightarrow\\mathbb\{R\}, FFT updates the model weights via:

Δ​W^≈argminΔ​W⁡1\|N\|​∑i=1\|N\|ℓ​\(fW0\+Δ​W​\(xi′\),yi′\),\\displaystyle\\Delta\\hat\{\\textbf\{W\}\}\\approx\\operatorname\{argmin\}\_\{\\Delta\\textbf\{W\}\}\\frac\{1\}\{\|N\|\}\\sum\_\{i=1\}^\{\|N\|\}\\mathcal\{\\ell\}\(f\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(x\_\{i\}^\{\\prime\}\),y\_\{i\}^\{\\prime\}\),\(3\)and obtains the fine\-tuned model,fW0\+Δ​W^f\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\hat\{\\textbf\{W\}\}\}, adapted to the downstream task\. The computational overhead for FFT can be prohibitively expensive\. E\.g\., LLMs for task\-specific fine\-tuning\. In contrast, parameter\-efficient fine\-tuning \(PEFT\) trains orders of magnitude fewer parameters while often attaining performance comparable to FFT\[[24](https://arxiv.org/html/2606.13767#bib.bib13),[69](https://arxiv.org/html/2606.13767#bib.bib4)\]\.

LoRA\[[27](https://arxiv.org/html/2606.13767#bib.bib2)\]is a popular PEFT method that replaces the layer\-wise updatesΔ​Wi\\Delta W^\{i\}with a low\-rank representationBi​AiB^\{i\}A^\{i\}, such thatBi∈ℝni×rB^\{i\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times r\},Ai∈ℝr×miA^\{i\}\\in\\mathbb\{R\}^\{r\\times m\_\{i\}\},r≪min⁡\(mi,ni\)r\\ll\\min\(m\_\{i\},n\_\{i\}\)for alli∈\[L\]i\\in\[L\]\. DenoteW0\+αr​BA:=\[W01\+αr​B1​A1,⋯,W0L\+αr​BL​AL\]\\textbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\textbf\{BA\}:=\[W\_\{0\}^\{1\}\+\\frac\{\\alpha\}\{r\}B^\{1\}A^\{1\},\\cdots,W\_\{0\}^\{L\}\+\\frac\{\\alpha\}\{r\}B^\{L\}A^\{L\}\], whereα\>0\{\\alpha\>0\}is a scaling factor\. LoRA initializesBiB^\{i\}= 0,AiA^\{i\}∼\\sim𝒩​\(0,0\.022\)\\mathcal\{N\}\(0,0\.02^\{2\}\), and solves:

\(B^,A^\)≈argminB,A⁡1\|N\|​∑i=1\|N\|ℓ​\(fW0\+αr​BA​\(xi′\),yi′\),\\displaystyle\(\{\\hat\{\\textbf\{B\}\},\\hat\{\\textbf\{A\}\}\}\)\\approx\\operatorname\{argmin\}\_\{\\textbf\{B\},\\textbf\{A\}\}\\frac\{1\}\{\|N\|\}\\sum\_\{i=1\}^\{\|N\|\}\\mathcal\{\\ell\}\(f\_\{\\textbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\textbf\{BA\}\}\(x\_\{i\}^\{\\prime\}\),y\_\{i\}^\{\\prime\}\),\(4\)to obtainBi,AiB^\{i\},A^\{i\}for each tuned layer\. LoRA may not need to be applied to all layers; some layers can remain frozen\. LoRA substantially reduces trainable parameters, saves training time, and the updateBAcan be merged into the base weights to avoid additional inference latency\. LoRA is compute\- and storage\-efficient, but renders worse generalization than FFT\[[55](https://arxiv.org/html/2606.13767#bib.bib22)\]; LoRA may also fail\[[30](https://arxiv.org/html/2606.13767#bib.bib29)\]\.

### 2\.2The Present: Evolution of LoRA

Many variants of LoRA exist to enhance efficiency while addressing weaknesses\. They excel in certain tasks but are less optimal in others\. Including FFT, empirical evidence suggests that no single fine\-tuning method is the best fit for all cases, and that different variations are successful in varying circumstances\[[69](https://arxiv.org/html/2606.13767#bib.bib4)\]\. Thus, there exists compelling reasoning as to why new variants of LoRA emerge\. Below, we discuss a few popular LoRA variants\.

Chain of LoRA \(CoLA\)\[[65](https://arxiv.org/html/2606.13767#bib.bib12)\]increases LoRA’s performance without substantially increasing compute or memory costs\. After fine\-tuningB1​A1\\textbf\{B\}^\{1\}\\textbf\{A\}^\{1\}for the downstream task to obtainB^1​A^1\{\\hat\{\\textbf\{B\}\}^\{1\}\\hat\{\\textbf\{A\}\}^\{1\}\}, CoLA mergesB^1​A^1\{\\hat\{\\textbf\{B\}\}^\{1\}\\hat\{\\textbf\{A\}\}^\{1\}\}into the base weights and continues training with a newB2​A2\\textbf\{B\}^\{2\}\\textbf\{A\}^\{2\}on the same task, treatingW0\+αr​B^1​A^1\\textbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\{\\hat\{\\textbf\{B\}\}^\{1\}\\hat\{\\textbf\{A\}\}^\{1\}\}as the base weights\. DenoteW\(k,B​A\):=W0\+∑j=1kαr​B^j​A^j\\textbf\{W\}^\{\(k,BA\)\}:=\\textbf\{W\}\_\{0\}\+\\sum\_\{j=1\}^\{k\}\\frac\{\\alpha\}\{r\}\{\\hat\{\\textbf\{B\}\}^\{j\}\\hat\{\\textbf\{A\}\}^\{j\}\}andW\(0,B​A\)=W0\\textbf\{W\}^\{\(0,BA\)\}=\\textbf\{W\}\_\{0\}for convenience\. CoLA of chain lengthkksolves:

For​j∈\[k\],B^j​A^j≈argminBj​Aj⁡\[ℒ​\(W0\(j−1,B​A\)\+αr​B^j​A^j\)\]\\displaystyle\\text\{For \}j\\in\[k\],~~~~~\\hat\{\\textbf\{B\}\}^\{j\}\\hat\{\\textbf\{A\}\}^\{j\}\\approx\\operatorname\{argmin\}\_\{\{\\textbf\{B\}^\{j\}\}\\textbf\{A\}^\{j\}\}\\left\[\\mathcal\{L\}\(\\textbf\{W\}\_\{0\}^\{\(j\-1,BA\)\}\+\\frac\{\\alpha\}\{r\}\\hat\{\\textbf\{B\}\}^\{j\}\\hat\{\\textbf\{A\}\}^\{j\}\)\\right\]\(5\)to obtain the fine\-tuned model,fW\(k,B​A\)f\_\{\\textbf\{W\}^\{\(k,BA\)\}\}\. CoLA simulates a higher\-rank approximation of a single LoRA update\[[38](https://arxiv.org/html/2606.13767#bib.bib43)\]and claims to reduce LoRA’s failure\[[30](https://arxiv.org/html/2606.13767#bib.bib29)\]\.

Asymmetric LoRA\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\]modifies LoRA adaptation for each layer by freezing one of the low\-rank matrices, conventionally,AAtoA0A\_\{0\}, initializing the frozen matrix via a Normal distribution, and setting the trainable matrix to 0, and solves:

B^≈argminB⁡\[ℒ​\(W0\+αr​BA0\)=1\|N\|​∑i=1\|N\|ℓ​\(fW0\+αr​BA0​\(xi\),yi\)\],\\displaystyle\\hat\{\\textbf\{B\}\}\\approx\\operatorname\{argmin\}\_\{\\textbf\{B\}\}\[\\mathcal\{L\}\(\\textbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\textbf\{B\}\\textbf\{A\}\_\{0\}\)=\\frac\{1\}\{\|N\|\}\\sum\_\{i=1\}^\{\|N\|\}\\mathcal\{\\ell\}\(f\_\{\\textbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\textbf\{B\}\\textbf\{A\}\_\{0\}\}\(x\_\{i\}\),y\_\{i\}\)\],\(6\)to obtain the fine\-tuned modelfW0\+B^​A0f\_\{\\textbf\{W\}\_\{0\}\+\{\\hat\{\\textbf\{B\}\}\\textbf\{A\}\_\{0\}\}\}\. Under trainable\-parameter constraints, Asymmetric LoRA competes with LoRA\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\]and retains the Lipschitz smoothness of the loss function, which LoRA does not\[[57](https://arxiv.org/html/2606.13767#bib.bib28)\]\.

Randomized Asymmetric Chain of LoRA \(RAC\-LoRA\)\[[42](https://arxiv.org/html/2606.13767#bib.bib19)\]combines Asymmetric LoRA and CoLA\. RAC\-LoRA fixes one of the low\-rank matrices \(conventionallyAA\), initializing via some fixed distribution of matrices𝒟\\mathcal\{D\}, and sets the trainable one to 0\. Like CoLA, the trainedB^1​A01\{\\hat\{\\textbf\{B\}\}^\{1\}\\textbf\{A\}^\{1\}\_\{0\}\}is then merged into the base weights, and a newBA0\\textbf\{BA\}\_\{0\}is trained on the same task\. DenoteW\(k,B\):=W0\+∑j=1kαr​B^j​A0j\\textbf\{W\}^\{\(k,B\)\}:=\\textbf\{W\}\_\{0\}\+\\sum\_\{j=1\}^\{k\}\\frac\{\\alpha\}\{r\}\{\\hat\{\\textbf\{B\}\}^\{j\}\\textbf\{A\}\_\{0\}^\{j\}\}andW\(0,B\)=W0\\textbf\{W\}^\{\(0,B\)\}=\\textbf\{W\}\_\{0\}\. RAC\-LoRA of chain lengthkksolves:

For​j∈\[k\],B^j≈argminBj⁡\[ℒ​\(W0\(j−1,B\)\+αr​B^j​A0j\)\]\\displaystyle\\text\{For \}j\\in\[k\],~~~~~\\hat\{\\textbf\{B\}\}^\{j\}\\approx\\operatorname\{argmin\}\_\{\{\\textbf\{B\}^\{j\}\}\}\\left\[\\mathcal\{L\}\(\\textbf\{W\}\_\{0\}^\{\(j\-1,B\)\}\+\\frac\{\\alpha\}\{r\}\\hat\{\\textbf\{B\}\}^\{j\}\{\\textbf\{A\}\}\_\{0\}^\{j\}\)\\right\]\(7\)to obtain the fine\-tuned modelfW\(k,B\)f\_\{\\textbf\{W\}^\{\(k,\{B\)\}\}\}\.

LoRA\+\+\[[18](https://arxiv.org/html/2606.13767#bib.bib20)\]applies separate learning rates\{γBi,γAi\}\\\{\\gamma\_\{B\}^\{i\},\\gamma\_\{A\}^\{i\}\\\}to the adapter matrices,\{Bi,Ai\}\\\{B^\{i\},A^\{i\}\\\}of each layer, respectively, and maintains the identical structure to LoRA\.LoRA\+\{\\rm LoRA\}\+prioritizes a substantially higher learning rate \(2−16×2\-16\\times\) forBB\. We discuss some other LoRA variants in §[A](https://arxiv.org/html/2606.13767#A1)\.

### 2\.3The Future: How Can We Achieve More Efficiency?

With rapidly increasing model dimensionality, memory, and adaptation costs, we characterize this phase as a key next evolutionary step for LoRA: maximizing efficiency while maintaining parity with current LoRA variants\. TrainingBBgenerally performs better\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\], together with insights from structured chaining methods\[[65](https://arxiv.org/html/2606.13767#bib.bib12),[42](https://arxiv.org/html/2606.13767#bib.bib19)\], leads us to two simple, easy\-to\-analyze and implement variants, where we postulate that the update of the pre\-trained parameter can be restricted torrcolumns ofBB\.

\(*i*\)

Cheap LoRA \(cLA\)\.Stemmed from Asym LoRA\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\], in cLA, the fixed matrix,AiA^\{i\}, for each layerii, is set to be anr×rr\\times ridentity matrix, concatenated with0r×mi−r\\textbf\{0\}\_\{r\\times m\_\{i\}\-r\}, that is,Ai=\[Ir\|𝟎r×\(mi−r\)\]∈ℝr×miA^\{i\}=\\left\[I\_\{r\}\|\\mathbf\{0\}\_\{r\\times\(m\_\{i\}\-r\)\}\\right\]\\in\\mathbb\{R\}^\{r\\times m\_\{i\}\}\. For each layer, withWi∈ℝni×mi,W^\{i\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times m\_\{i\}\},andBi∈ℝni×r,B^\{i\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times r\},we haveΔ​Wi=Bi​\[Ir\|𝟎r×\(mi−r\)\]=\[Bi\|𝟎ni×\(mi−r\)\]\.\\Delta W^\{i\}=B^\{i\}\\left\[I\_\{r\}\|\\mathbf\{0\}\_\{r\\times\(m\_\{i\}\-r\)\}\\right\]=\\left\[B^\{i\}\|\\mathbf\{0\}\_\{n\_\{i\}\\times\(m\_\{i\}\-r\)\}\\right\]\.DenoteBc\\textbf\{B\}^\{c\}as the layer\-wise update withBi,AiB^\{i\},A^\{i\}chosen above, andW0\+αr​Bc:=\[W01\+αr​B1​\[Ir\|𝟎r×\(m1−r\)\],⋯,W0L\+αr​BL​\[Ir\|𝟎r×\(mL−r\)\]\]\\textbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\textbf\{B\}^\{c\}:=\[W\_\{0\}^\{1\}\+\\frac\{\\alpha\}\{r\}B^\{1\}\\left\[I\_\{r\}\|\\mathbf\{0\}\_\{r\\times\(m\_\{1\}\-r\)\}\\right\],\\cdots,W\_\{0\}^\{L\}\+\\frac\{\\alpha\}\{r\}B^\{L\}\\left\[I\_\{r\}\|\\mathbf\{0\}\_\{r\\times\(m\_\{L\}\-r\)\}\\right\]\]\. Then cLA solves:

B^c≈argminBc⁡1\|N\|​∑i=1\|N\|ℓ​\(fW0\+αr​Bc​\(xi′\),yi′\)\.\\displaystyle\\hat\{\\textbf\{B\}\}^\{c\}\\approx\\operatorname\{argmin\}\_\{\\textbf\{B\}^\{c\}\}\\frac\{1\}\{\|N\|\}\\sum\_\{i=1\}^\{\|N\|\}\\mathcal\{\\ell\}\(f\_\{\\textbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\textbf\{B\}^\{c\}\}\(x\_\{i\}^\{\\prime\}\),y\_\{i\}^\{\\prime\}\)\.\(8\)We consider two instantiations of the fixed factor, deterministic \(*cLA*\) and random \(*random\-cLA*\), where we randomly permute the columns ofAiA^\{i\}on initialization\. Empirical results show that the deterministic choice suffices, the randomized variant does not yield better performance\.

\(*ii*\)

Circulant Chain of Cheap LoRA \(c3c^\{3\}LA\)\.As noted in CoLA\[[65](https://arxiv.org/html/2606.13767#bib.bib12)\]and RAC\-LoRA\[[42](https://arxiv.org/html/2606.13767#bib.bib19)\], chaining LoRA modules leverages repeated initializations to avoid poor minima\. We extend this principle to cLA with a structured chaining,c3c^\{3\}LA\. This method shifts the identityIrI\_\{r\}in each matrix\[Ir\|𝟎r×\(mi−r\)\]\\left\[I\_\{r\}\|\\mathbf\{0\}\_\{r\\times\(m\_\{i\}\-r\)\}\\right\]byrrcolumns to the left\. That is, starting with\[Ir\|𝟎r×\(mi−r\)\]\\left\[I\_\{r\}\|\\mathbf\{0\}\_\{r\\times\(m\_\{i\}\-r\)\}\\right\], the next chain is\[𝟎r×r\|Ir\|0r×\(mi−2​r\)\]\\left\[\\mathbf\{0\}\_\{r\\times r\}\\;\\middle\|\\;I\_\{r\}\\;\\middle\|\\;\\mathbf\{0\}\_\{r\\times\(m\_\{i\}\-2r\)\}\\right\], and so on\. Forj∈\[k\]j\\in\[k\], letBc3\\textbf\{B\}^\{c^\{3\}\}denotec3c^\{3\}LA’s update and denoteW\(k,Bc3\):=W0\+∑j=1kαr​B^c3,j\\textbf\{W\}^\{\(k,\{B^\{c^\{3\}\}\}\)\}:=\\textbf\{W\}\_\{0\}\+\\sum\_\{j=1\}^\{k\}\{\\frac\{\\alpha\}\{r\}\\hat\{\\textbf\{B\}\}^\{c^\{3\},j\}\}, andW\(0,Bc3\)=W0\.\\textbf\{W\}^\{\(0,B^\{c^\{3\}\}\)\}=\\textbf\{W\}\_\{0\}\.Thenc3c^\{3\}LA of chain lengthkksolves:

B^c3,j≈argminBc3,j⁡ℒ​\(W0\(j−1,Bc3\)\+αr​B^c3,j\),\\displaystyle\\hat\{\\textbf\{B\}\}^\{\{c^\{3\}\},j\}\\approx\\operatorname\{argmin\}\_\{\{\\textbf\{B\}^\{c^\{3\},j\}\}\}\\mathcal\{L\}\\left\(\\textbf\{W\}\_\{0\}^\{\(j\-1,B^\{c^\{3\}\}\)\}\+\\frac\{\\alpha\}\{r\}\\hat\{\\textbf\{B\}\}^\{c^\{3\},j\}\\right\),\(9\)to obtain the fine\-tuned modelfW\(k,Bc3\)f\_\{\\textbf\{W\}^\{\(k,\{B^\{c^\{3\}\}\)\}\}\}for a chain of lengthkk; see \([6](https://arxiv.org/html/2606.13767#S2.E6)\) for the definition ofℒ\.\\mathcal\{L\}\.Given sufficient epochs and chain length, this ensures we can update all elements in eachW0W\_\{0\}\. We formalize this in the following proposition\.

###### Proposition 1\.

Letk∈ℕk\\in\\mathbb\{N\}be such thatdi​n=k​r\{d\_\{in\}\}=kr\. LetEEbe the total number of epochs used inc3​L​Ac^\{3\}\{LA\}fine\-tuning\. Then by creating a new chain in every⌊Ek⌋\\left\\lfloor\\frac\{E\}\{k\}\\right\\rfloorepochs,c3​L​Ac^\{3\}\{LA\}updates each element inW0W\_\{0\}\.

The intuition behindc3c^\{3\}LA goes beyond chaining cheap LoRA modules; its structured shifts expand the representational capacity of the learnedBBmatrices\. We provide pseudocode of our variants in §[C](https://arxiv.org/html/2606.13767#A3)\.

Connection with partial connection adaptation \(PaCA\)\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\]\.Although stemming from different research queries, our sparsity\-induced LoRA variants and PaCA are closely related as they study restricted column updates of a pre\-trained model\. While PaCA directly updates a randomly chosen subset of the pretrained model to address training\-time and activation\-memory costs, cLA, which is sparsity\-induced Asymteric LoRA, remains within the LoRA PEFT family and uses a deterministic, structured column\-subspace restriction\. However, they differ in where the restriction is imposed \(directly inside the pre\-trained backbone vs\. via a LoRA parameterization\) and the motivation behind their construction\. Their similarities suggest we can lift many results related to cLA and its variants, and apply them to PaCA; see a detailed discussion in §[B](https://arxiv.org/html/2606.13767#A2)\.

Empirically, our sparsity\-induced LoRA variants and PaCA behave similarly; see Tables[4](https://arxiv.org/html/2606.13767#A2.T4)\-[5](https://arxiv.org/html/2606.13767#A2.T5)\. The main benefit of PaCA is avoiding the additional cost of running the forward pass through the LoRA module entirely; in comparison to LoRA, per layer, PaCA computesW​xWx, notW​x\+B​\(A​x\)Wx\+B\(Ax\)\. However, from a theoretical perspective, our sparsity\-induced variants serve as a bridge between LoRA and PaCA \(see Figure[2](https://arxiv.org/html/2606.13767#S3.F2)\), facilitating bidirectional knowledge transfer between these two families\. In §[3\.1](https://arxiv.org/html/2606.13767#S3.SS1), we show that our LoRA\-based generalization results can be used directly to PaCA\. LoRA has a growing literature on different variants, their nonconvex convergence; see\[[42](https://arxiv.org/html/2606.13767#bib.bib19),[57](https://arxiv.org/html/2606.13767#bib.bib28),[45](https://arxiv.org/html/2606.13767#bib.bib76)\]\. These results can also be directly adapted to PaCA if one considers it as a sparsity\-induced extension of the LoRA family\. Interestingly, we can add a well\-known performance enhancer of the LoRA PEFT family, chain construction, to PaCA; see §[B\.1](https://arxiv.org/html/2606.13767#A2.SS1)\. Similarly, PaCA’s loss convergence result in Theorem[2](https://arxiv.org/html/2606.13767#Thmtheorem2)can be adapted for cLA, and so on; see Theorem[3](https://arxiv.org/html/2606.13767#Thmtheorem3)\. We note that LoRI\[[70](https://arxiv.org/html/2606.13767#bib.bib77)\]shares some similarities with cLA, as it keeps the projection matricesAAfixed as random projections while sparsifying theBBmatrices with task\-specific masks\. However, the construction of cLA is general and task agnostic\.

## 3Theoretical Insights

VariantChaining?𝒢​\(𝐖0\+Δ​𝐖\)≤ΦW0\+\{\\cal G\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\\leq\\Phi\_\{\\textbf\{W\}\_\{0\}\}\+LoRA\[[27](https://arxiv.org/html/2606.13767#bib.bib2)\]×\\times2​r​q​σ2​ln⁡2​∑i=1L\(mi\+ni\)\|N\|\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}\\ln 2\\sum\_\{i=1\}^\{L\}\(m\_\{i\}\+n\_\{i\}\)\}\{\|N\|\}\}LoRA\+\[[18](https://arxiv.org/html/2606.13767#bib.bib20)\]×\\times2​r​q​σ2​ln⁡2​∑i=1L\(mi\+ni\)\|N\|\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}\\ln 2\\sum\_\{i=1\}^\{L\}\(m\_\{i\}\+n\_\{i\}\)\}\{\|N\|\}\}Asym\-LoRA\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\]×\\times2​r​q​σ2​ln⁡2​∑i=1Lni\|N\|\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\}\{\|N\|\}\}CoLA\[[65](https://arxiv.org/html/2606.13767#bib.bib12)\]✓2​r​q​σ2​k​ln⁡2​∑i=1L\(mi\+ni\)\|N\|\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}k\\ln 2\\sum\_\{i=1\}^\{L\}\(m\_\{i\}\+n\_\{i\}\)\}\{\|N\|\}\}RAC\[[42](https://arxiv.org/html/2606.13767#bib.bib19)\]✓2​r​q​σ2​k​ln⁡2​∑i=1Lni\|N\|\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}k\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\}\{\|N\|\}\}*cLA & PaCA*×\\times2​r​q​σ2​ln⁡2​∑i=1Lni\|N\|\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\}\{\|N\|\}\}\(This paper,\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\]\)c3c^\{3\}LA✓2​r​q​σ2​k​ln⁡2​∑i=1Lni\|N\|\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}k\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\}\{\|N\|\}\}

Table 1:Generalization error upper boundsof LoRA variants\. The expression,ΦW0\\Phi\_\{\\textbf\{W\}\_\{0\}\}is in Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)\.rris the adapter rank,kkchain length,\|N\|\|N\|size of fine\-tuned dataset,qqbitwidth,\(mi,ni\)\(m\_\{i\},n\_\{i\}\)are the \(input,output\) dimensions for layerii\. The loss,ℒ\{\\cal L\}isσ\\sigma\-sub\-Gaussian \(Assumption[6](https://arxiv.org/html/2606.13767#Thmassumption6)\)\.![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/new_fig.png)

Figure 2:Evolution of fine\-tuning methods\.LoRA falls under the PEFT family \(Family A\)\. PaCA fine\-tunes randomly selected columns within pretrained weights to improve training speed and reduce overhead \(Family B\)\. LoRA has different variants \(e\.g\., Asym, CoLA\); RAC is a combination of various techniques applied to LoRA\. Our sparsity\-induced variants \(cLA,c3c^\{3\}LA, and their randomized forms\) create a bridge between reparameterization \(Family A\) and partial fine\-tuning \(Family B\)\.
In this section, we measure the generalization error upper bounds of the PEFT methods of this paper under an information\-theoretic framework\[[66](https://arxiv.org/html/2606.13767#bib.bib72),[54](https://arxiv.org/html/2606.13767#bib.bib15)\]with our layerwise setup, where each layer’s adapters are updated using gradient descent \(GD\)\.

### 3\.1On the Generalization of Different LoRA Variants

Let𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}be an input space and label space withν\\nudistribution of pairs\(x,y\)∈𝒳×𝒴\(x,y\)\\in\\mathcal\{X\}\\times\\mathcal\{Y\}\. LetN=\{\(xi,yi\)\}i=1\|N\|N=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{\|N\|\}represent the training dataset, where each\(xi,yi\)\(x\_\{i\},y\_\{i\}\)is i\.i\.d\. fromν\\nudistribution of𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}\. Given a*hypothesis*,fW​\(⋅\):𝒳→𝒴f\_\{\\textbf\{W\}\}\(\\cdot\):\\mathcal\{X\}\\to\\mathcal\{Y\}, and a nonnegative*loss function*,ℓ​\(⋅\):𝒴×𝒴→ℝ\\ell\(\\cdot\):\\mathcal\{Y\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}, the*empirical risk*of a hypothesis on the dataset is defined as,ℒ​\(W\):=1\|N\|​∑i=1\|N\|ℓ​\(fW​\(xi\),yi\)\.\{\\cal L\}\(\\textbf\{W\}\):=\\tfrac\{1\}\{\|N\|\}\\sum\_\{i=1\}^\{\|N\|\}\\ell\(f\_\{\\textbf\{W\}\}\(x\_\{i\}\),y\_\{i\}\)\.The*true risk*of the hypothesis,fW​\(⋅\)f\_\{\\textbf\{W\}\}\(\\cdot\)is defined as,ℒ^global​\(W\):=𝔼𝒳,𝒴∼ν​\[ℓ​\(fW​\(X\),Y\)\]\.\\hat\{\{\\cal L\}\}\_\{\\rm global\}\(\\textbf\{W\}\):=\\mathbb\{E\}\_\{\\mathcal\{X\},\\mathcal\{Y\}\\sim\\nu\}\[\\ell\(f\_\{\\textbf\{W\}\}\(X\),Y\)\]\.With the above setup, we define*generalization error*, which tells us how well the hypothesis,fWf\_\{\\textbf\{W\}\}, generalizes from the training sample to the underlying population distribution\.

###### Definition 1\.

\(Generalization Error\[[66](https://arxiv.org/html/2606.13767#bib.bib72)\]\) Generalization error,𝒢​\(W\)\\mathcal\{G\}\(\\textbf\{W\}\), is the difference between a hypothesis’s true risk and its empirical risk on the training dataset, i\.e\.,𝒢​\(W\):=ℒ^global​\(W\)−ℒ​\(W\)\\mathcal\{G\}\(\\textbf\{W\}\):=\\hat\{\{\\cal L\}\}\_\{\\rm global\}\(\\textbf\{W\}\)\-\{\\cal L\}\(\\textbf\{W\}\)\.

For our analysis, we make the following general assumptions\.

###### Assumption 1\.

\(Boundedness of input vectors\)The input vectors are bounded, i\.e\., there exists a constantC≥0C\\geq 0such that‖x‖≤C\\\|x\\\|\\leq C, for allx∈𝒳\.x\\in\\mathcal\{X\}\.

###### Assumption 2\.

\(Lipschitz continuity of the loss\)The loss function,ℓ​\(⋅\):ℝd→ℝ\\ell\(\\cdot\):\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}isLℒL\_\{\{\\cal L\}\}\-Lipschitz continuous, i\.e\.,\|ℓ​\(fW​\(x\),y\)−ℓ​\(fW′​\(x\),y\)\|≤Lℒ​‖fW​\(x\)−fW′​\(x\)‖\|\\ell\(f\_\{\\textbf\{W\}\}\(x\),y\)\-\\ell\(f\_\{\\textbf\{W\}^\{\\prime\}\}\(x\),y\)\|\\leq L\_\{\{\\cal L\}\}\\\|f\_\{\\textbf\{W\}\}\(x\)\-f\_\{\\textbf\{W\}^\{\\prime\}\}\(x\)\\\|for allW,W′∈ℝd​and​\(x,y\)∈𝒳×𝒴\\textbf\{W\},\\textbf\{W\}^\{\\prime\}\\in\\mathbb\{R\}^\{d\}\\text\{ and \}\(x,y\)\\in\\mathcal\{X\}\\times\\mathcal\{Y\}\.

###### Assumption 3\.

\(Lipschitz continuity of activation\)The vector\-valued activation function,σi​\(⋅\):ℝni→ℝni\\sigma\_\{i\}\(\\cdot\):\\mathbb\{R\}^\{n\_\{i\}\}\\to\\mathbb\{R\}^\{n\_\{i\}\}, for each layer,ii, isLσiL\_\{\\sigma\_\{i\}\}\-Lipschitz continuous, i\.e\.,‖σi​\(u\)−σi​\(v\)‖≤Lσi​‖u−v‖,\\\|\\sigma\_\{i\}\(u\)\-\\sigma\_\{i\}\(v\)\\\|\\leq L\_\{\\sigma\_\{i\}\}\\\|u\-v\\\|,for allu,v∈ℝniu,v\\in\\mathbb\{R\}^\{n\_\{i\}\}\.

Based on the assumptions, Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)upper bounds the generalization error of a fine\-tuned,LL\-layer fully connected DNN, parameterized byW0\+Δ​W\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}, by the better of two alternatives: the generalization error ofW0\\textbf\{W\}\_\{0\}and a correction term, or the generalization error ofΔ​W\\Delta\\textbf\{W\}and a different correction term\.

###### Theorem 1\.

\(Generalization bounds\) LetfW0\+Δ​W\(x\)=σL\(\[W0L\+ΔWL\]\(⋯σ2\(\[\(W02\+ΔW2\]σ1\(\[W01\+ΔW1\]x\)\)⋯\)\)f\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(x\)=\\sigma\_\{L\}\(\[\{W\_\{0\}\}^\{L\}\+\\Delta W^\{L\}\]\(\\cdots\\sigma\_\{2\}\(\[\(W\_\{0\}^\{2\}\+\\Delta W^\{2\}\]\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\)\\cdots\)\)be aLL\-layers fine\-tuned DNN, whereW0\+Δ​W\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}is a fine\-tuned update\. Let the loss function,ℒ\\mathcal\{L\}for fine\-tuning, follow Assumptions[1](https://arxiv.org/html/2606.13767#Thmassumption1)–[3](https://arxiv.org/html/2606.13767#Thmassumption3)\. Then𝒢​\(W0\+Δ​W\)≤min⁡\(𝒢​\(W0\)\+ΦΔ​W,𝒢​\(Δ​W\)\+ΦW0\),\{\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\\leq\\min\\left\(\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\)\+\\Phi\_\{\\Delta\\textbf\{W\}\},\\mathcal\{G\}\(\\Delta\\textbf\{W\}\)\+\\Phi\_\{\\textbf\{W\}\_\{0\}\}\\right\)\},where

ΦΔ​W:=2​Lℒ​\[C​∏i=1LLσi​∑i=12L−1∏j=1LP​\(i,j\)\+∑i≠2a−1:a∈\[L\]2L−2F​\(i\)\]​and\{\\Phi\_\{\\Delta\\textbf\{W\}\}:=2L\_\{\{\\cal L\}\}\\left\[C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\(i,j\)\+\\sum\_\{i\\neq 2^\{a\}\-1:a\\in\[L\]\}^\{2^\{L\}\-2\}F\(i\)\\right\]\}\\text\{ and \}ΦW0:=2​Lℒ​\[C​∏i=1LLσi​∑i=22L∏j=1LP​\(i,j\)\+∑i≠2a:a∈\[L\]2L−1F​\(i\)\],\{\\Phi\_\{\\textbf\{W\}\_\{0\}\}:=2L\_\{\{\\cal L\}\}\\left\[C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\sum\_\{i=2\}^\{2^\{L\}\}\\prod\_\{j=1\}^\{L\}P\(i,j\)\+\\sum\_\{i\\neq 2^\{a\}:a\\in\[L\]\}^\{2^\{L\}\-1\}\{F\}\(i\)\\right\]\},are the correction terms,F​\(i\):=‖σL−ψ​\(i\)​\(0\)‖​∏j=1ψ​\(i\)\[LσL−j\+1​H​\(i,j\)\],\{F\(i\):=\\\|\\sigma\_\{L\-\\psi\(i\)\}\(0\)\\\|\\prod\_\{j=1\}^\{\\psi\(i\)\}\\\!\[L\_\{\\sigma\_\{L\-j\+1\}\}\\,H\(i,j\)\]\},ψ​\(i\):=⌊log2⁡\(i\)⌋\{\\psi\(i\):=\\lfloor\\log\_\{2\}\(i\)\\rfloor\}, and

P​\(i,j\):=\{‖W0L−j\+1‖2​if​⌊i−12L−1⌋​is odd,‖Δ​WL−j\+1‖2​if​⌊i−12L−1⌋​is even,H​\(i,j\):=\{‖Δ​WL−j\+1‖2​if​⌊i2ψ​\(i\)−j⌋​is odd,‖W0L−j\+1‖2​if​⌊i2ψ​\(i\)−j⌋​is even\.\\displaystyle P\(i,j\)=,\\hskip\-8\.53581ptH\(i,j\)=

Intuition behind Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)\.Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)provides a general framework to find the generalization error of a fine\-tuned model using only the generalization properties of either the pretrained backbone or the parameter update\. The standalone terms,ΦΔ​W\\Phi\_\{\\Delta W\}andΦW0\\Phi\_\{W\_\{0\}\}, consist of Lipschitz constants of the loss and layerwise activation, spectral norms of\{‖W0i‖2,‖Δ​Wi‖2\}i∈\[L\],\\\{\\\|W\_\{0\}^\{i\}\\\|\_\{2\},\\\|\\Delta W^\{i\}\\\|\_\{2\}\\\}\_\{i\\in\[L\]\},and offset terms,‖σi′​\(0\)‖\\\|\\sigma\_\{i^\{\\prime\}\}\(\{0\}\)\\\|based on the recursive collapse of the difference of‖fW0\+Δ​W−fW0‖\\\|f\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\-f\_\{\\textbf\{W\}\_\{0\}\}\\\|; see Figure[5](https://arxiv.org/html/2606.13767#A4.F5)\.

Tightness of the bounds\.In Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1), the combinatorial form may appear loose, given that it has2L2^\{L\}components\. However, this is simply the expansion of a product\-of\-sums; each layer contributes either its base spectral magnitude or its update spectral magnitude\. This frames Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)within a spectral control perspective, where the generalization behavior is upper\-bounded by the largest singular values among layers, offering insight by making spectral control a design handle when fine\-tuning models\. Additionally, it provides a comparison framework across variants, where spectral control can be combined with PAC\-Bayes, information\-theoretic, or other matrix\-based generalization approaches; see §[D\.1\.3](https://arxiv.org/html/2606.13767#A4.SS1.SSS3)\. In §[D\.1\.4](https://arxiv.org/html/2606.13767#A4.SS1.SSS4), we show that the bounds provided in Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)are tight\.

Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)applied to attention mechanism\.Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)can handle advanced architectures, such as Transformers, whose main working component is attention, by considering inputsx∈𝒳\{x\\in\{\\cal X\}\}after embedding\. The key step is showing that multi\-head attention \(MHA\) blocks can be expressed as compositions of linear maps and Lipschitz operators\. See Theorem[4](https://arxiv.org/html/2606.13767#Thmtheorem4)in §[D\.1\.5](https://arxiv.org/html/2606.13767#A4.SS1.SSS5)for the full derivation\.

Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)under special conditions\.The generalization upper bound𝒢​\(W0\+Δ​W\)\{\\cal G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)in Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)contains two terms: \(*i*\)𝒢​\(W0\)\+ΦΔ​W\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\)\+\\Phi\_\{\\Delta\\textbf\{W\}\}and \(*ii*\)𝒢​\(Δ​W\)\+ΦW0\.\\mathcal\{G\}\(\\Delta\\textbf\{W\}\)\+\\Phi\_\{\\textbf\{W\}\_\{0\}\}\.We can adapt some additional assumptions on loss, quantization bit\-width, size of fine\-tuning datasets, and layer dimensions; see §[D\.1\.6](https://arxiv.org/html/2606.13767#A4.SS1.SSS6)and bound𝒢​\(W0\)\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\)and𝒢​\(Δ​W\)\\mathcal\{G\}\(\\Delta\\textbf\{W\}\)\.

\(*i*\)

Bounding𝒢​\(W0\)\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\)\.We use the PAC\-Bayes generalization bound for fine\-tuning by\[[33](https://arxiv.org/html/2606.13767#bib.bib25)\]; see Theorem[5](https://arxiv.org/html/2606.13767#Thmtheorem5)in §[D\.1\.6](https://arxiv.org/html/2606.13767#A4.SS1.SSS6)\. The loss function,ℒ\{\\cal L\}, is bounded byC2C\_\{2\}\. Since‖W0\(i\)−W0\(i\)‖=0\\\|W\_\{0\}^\{\(i\)\}\-W\_\{0\}^\{\(i\)\}\\\|=0, for alli∈\[L\]i\\in\[L\], in Theorem[5](https://arxiv.org/html/2606.13767#Thmtheorem5), we obtainQi:=0Q\_\{i\}:=0\. Hence,𝒢​\(W0\)≤ϵ\+C2​\|N\|−1​\(3​ln⁡\|N\|​δ−1\+8\),\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\)\\leq\\epsilon\+C\_\{2\}\\sqrt\{\|N\|^\{\-1\}\(3\\ln\|N\|\\delta^\{\-1\}\+8\)\},holds with probability at least1−2​δ1\-2\\delta, whereϵ,δ\>0,\\epsilon,\\delta\>0,are arbitrary small numbers\. Together with Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1), we arrive at𝒢​\(W0\+Δ​W\)≤ϵ\+C2​\|N\|−1​\(3​ln⁡\|N\|​δ−1\+8\)\+ΦΔ​W\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\\leq\\epsilon\+C\_\{2\}\\sqrt\{\|N\|^\{\-1\}\(3\\ln\|N\|\\delta^\{\-1\}\+8\)\}\+\\Phi\_\{\\Delta\\textbf\{W\}\}; we quote this result formally in Theorem[6](https://arxiv.org/html/2606.13767#Thmtheorem6)in §[D\.1\.6](https://arxiv.org/html/2606.13767#A4.SS1.SSS6)\.

\(*ii*\)

Bounding𝒢​\(Δ​W\)\\mathcal\{G\}\(\\Delta\\textbf\{W\}\)\.For a DNN, letqqbe the training bitwidth; we useq=32q=32in this work\. We assumeℒ\\mathcal\{L\}isσ\\sigma\-sub\-gaussian for all𝐖\\mathbf\{W\}and use the generalization upper bound of𝒢​\(Δ​W\)\{\\cal G\}\(\\Delta\\textbf\{W\}\)as in Lemma 4\.5 of\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\], for each PEFT method\. Lemma 4\.5 in\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\]only bounds the generalization error of the fine\-tuned update,Δ​𝐖\\Delta\\mathbf\{W\}, that is,𝒢​\(Δ​W\),\{\\cal G\}\(\\Delta\\textbf\{W\}\),while keeping the pretrained weights𝐖0\\mathbf\{W\}\_\{0\}fixed\. In contrast, Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)is a standalone general result, not an extension\. It gives an explicit general bound for the full model𝐖0\+Δ​𝐖\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}and tells how the pretrained backbone and the fine\-tuned update interact, which is characterized byΦ𝐖0\\Phi\_\{\\mathbf\{W\}\_\{0\}\}\. Together with Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1), we arrive at𝒢​\(W0\+Δ​W\)≤ΦW0\+𝒢​\(BA\)\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\\leq\\Phi\_\{\\textbf\{W\}\_\{0\}\}\+\{\\cal G\}\(\\textbf\{BA\}\), where𝒢​\(BA\)\{\\cal G\}\(\\textbf\{BA\}\)represents the generalization error of different LoRA variants; see Table[1](https://arxiv.org/html/2606.13767#S3.T1)and §[D\.1\.6](https://arxiv.org/html/2606.13767#A4.SS1.SSS6)\. Table[1](https://arxiv.org/html/2606.13767#S3.T1)demonstrates the generalization error upper bounds of different PEFT methods\. In practice, some DNN models may deviate from them; see Tables[3](https://arxiv.org/html/2606.13767#S4.T3)and[16](https://arxiv.org/html/2606.13767#A5.T16)\.

Table 2:Performance of fine\-tuned models with adapter rankr=16r=16\. We usegreen,red, andblueto indicate the best, second best, and third best result\. For the sparse variants,↓\\downarrowindicates the accuracy drop percentage compared to the best\. Some results are deferred to the Appendix; see Table[18](https://arxiv.org/html/2606.13767#A5.T18)\.ModelDatasetThe PastThe PresentThe FutureFFTLoRACoLAAsymRACLoRA\+cLAc3c^\{3\}LAr\-cLAr\-c3c^\{3\}LAViT\-Tiny\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome\[[61](https://arxiv.org/html/2606.13767#bib.bib80)\]79\.6880\.1379\.5478\.0278\.5577\.8778\.01 \(↓\\downarrow2\.65%\)78\.69 \(↓\\downarrow1\.80%\)78\.01 \(↓\\downarrow2\.65%\)79\.32 \(↓\\downarrow1\.01%\)CIFAR10\[[32](https://arxiv.org/html/2606.13767#bib.bib81)\]96\.5996\.1795\.8594\.8095\.3695\.2994\.94 \(↓\\downarrow1\.71%\)95\.23 \(↓\\downarrow1\.41%\)95\.12 \(↓\\downarrow1\.52%\)95\.22 \(↓\\downarrow1\.42%\)ViT\-Base\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome86\.4288\.9689\.0189\.0089\.3387\.8789\.2189\.1888\.8389\.17CIFAR1098\.0698\.7198\.4898\.6898\.7398\.3698\.6398\.5498\.7898\.72DeBERTa v2 XXL\[[23](https://arxiv.org/html/2606.13767#bib.bib51)\]MRPC\[[62](https://arxiv.org/html/2606.13767#bib.bib57)\]87\.4988\.2887\.4787\.0386\.9787\.5386\.13 \(↓\\downarrow2\.44%\)85\.11 \(↓\\downarrow3\.59%\)85\.55 \(↓\\downarrow3\.09%\)85\.15 \(↓\\downarrow3\.55%\)TREC\-50\[[37](https://arxiv.org/html/2606.13767#bib.bib58)\]91\.9991\.4785\.6592\.2692\.0284\.9291\.73 \(↓\\downarrow0\.57%\)90\.87 \(↓\\downarrow1\.51%\)91\.67 \(↓\\downarrow0\.64%\)91\.07 \(↓\\downarrow1\.29%\)PAWS\[[74](https://arxiv.org/html/2606.13767#bib.bib56)\]94\.6994\.9795\.2294\.9594\.6695\.2094\.77 \(↓\\downarrow0\.47%\)94\.90 \(↓\\downarrow0\.34%\)94\.38 \(↓\\downarrow0\.88%\)94\.71 \(↓\\downarrow0\.54%\)DeBERTa v3 Base\[[22](https://arxiv.org/html/2606.13767#bib.bib49)\]MRPC85\.8088\.3387\.9186\.4086\.3484\.5184\.43 \(↓\\downarrow4\.42%\)80\.22 \(↓\\downarrow9\.18%\)85\.42 \(↓\\downarrow3\.29%\)84\.17 \(↓\\downarrow4\.71%\)STS\-B\[[62](https://arxiv.org/html/2606.13767#bib.bib57)\]89\.5289\.0989\.3489\.0488\.7189\.1587\.56 \(↓\\downarrow2\.19%\)87\.90 \(↓\\downarrow1\.81%\)88\.05 \(↓\\downarrow1\.64%\)87\.90 \(↓\\downarrow1\.81%\)TREC\-5090\.1589\.2989\.8890\.6789\.2285\.5286\.04 \(↓\\downarrow5\.11%\)87\.96 \(↓\\downarrow2\.99%\)86\.04 \(↓\\downarrow5\.11%\)87\.70 \(↓\\downarrow3\.28%\)PAWS94\.7694\.6294\.4094\.4894\.4594\.4494\.2394\.6094\.3694\.42RoBERTa\-Base\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]MRPC87\.4086\.3486\.7686\.4086\.6784\.2984\.83 \(↓\\downarrow2\.94%\)84\.39 \(↓\\downarrow3\.44%\)85\.08 \(↓\\downarrow2\.65%\)85\.33 \(↓\\downarrow2\.37%\)RoBERTa\-Large\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]MRPC87\.5788\.4688\.4387\.5687\.6972\.9187\.8186\.3686\.2486\.59CoLA64\.5862\.4260\.0363\.4259\.8428\.8059\.47 \(↓\\downarrow7\.91%\)59\.60 \(↓\\downarrow7\.71%\)58\.60 \(↓\\downarrow9\.26%\)60\.24 \(↓\\downarrow6\.72%\)TinyLlama\[[71](https://arxiv.org/html/2606.13767#bib.bib59)\]FOLIO\[[16](https://arxiv.org/html/2606.13767#bib.bib62)\]60\.7157\.5959\.4058\.3355\.4554\.1758\.9758\.0154\.8159\.82LogiQA\[[39](https://arxiv.org/html/2606.13767#bib.bib63)\]47\.5441\.5443\.7041\.5040\.8645\.8339\.09 \(↓\\downarrow17\.77%\)39\.30 \(↓\\downarrow17\.33%\)39\.09 \(↓\\downarrow17\.77%\)39\.31 \(↓\\downarrow17\.31%\)CLUTRR\[[56](https://arxiv.org/html/2606.13767#bib.bib64)\]42\.0137\.4439\.3837\.9837\.9838\.1039\.1237\.7936\.2337\.03Llama3\-8B\[[14](https://arxiv.org/html/2606.13767#bib.bib7)\]OpenBookQA88\.8087\.5386\.4788\.4787\.3386\.8787\.33\(↓\\downarrow1\.65%\)85\.07\(↓\\downarrow4\.20%\)86\.07\(↓\\downarrow3\.07%\)53\.69\(↓\\downarrow39\.54%\)CLUTRR50\.2948\.747\.6551\.6949\.6552\.8955\.5352\.0454\.949\.94GPT2\-Small\[[52](https://arxiv.org/html/2606.13767#bib.bib48)\]E2E\[[47](https://arxiv.org/html/2606.13767#bib.bib93)\]2\.983\.183\.293\.363\.343\.233\.34\(↑\\uparrow12\.08%\)3\.29\(↑\\uparrow10\.4%\)3\.30\(↑\\uparrow10\.7%\)3\.29\(↑\\uparrow10\.4%\)

## 4Quantitative Evaluation

Our extensive experimental study of 11 fine\-tuning methods confirms that fine\-tuning may or may not be optimal, depending on the actual pre\-trained model, datasets used, and a multitude of other factors\[[69](https://arxiv.org/html/2606.13767#bib.bib4)\]\. Hence, it is better to use the cheaper LoRA variants for cost reduction and better generalizability\.

Implementation details and models used\.Our empirical evaluation encompasses 10 pretrained models: \(*i*\) DeBERTa v3 Base, \(*ii*\) DeBERTa v2 XXL, \(*iii*\) GPT2\-small, \(*iv*\) RoBERTa Base, \(*v*\) RoBERTa Large, \(*vi*\) DeepseekCoder\-1\.3B\-base, \(*vii*\) TinyLlama\-1\.1B, \(*viii*\) Llama 3\-8B, \(*ix*\) ViT Base, and \(*x*\) ViT\-Tiny\. See Table[6](https://arxiv.org/html/2606.13767#A4.T6)in §[E\.1](https://arxiv.org/html/2606.13767#A5.SS1)for a detailed summary of the models and Table[7](https://arxiv.org/html/2606.13767#A5.T7)for implementation details and reproducibility\. We report the lowest validation loss epoch for each model\. We report additional ablation studies to justify the choices of hyperparameters in Table[2](https://arxiv.org/html/2606.13767#S3.T2), such as learning rate, rank, scaling factor, and chain\-reset in §[E\.2](https://arxiv.org/html/2606.13767#A5.SS2), spanning Tables[8](https://arxiv.org/html/2606.13767#A5.T8)–[13](https://arxiv.org/html/2606.13767#A5.T13)\.

Fine\-tuning tasks and datasets\.\(*i*\)

Natural Language Processing \(NLP\)\.We use PAWS, TREC\-50, and various GLUE benchmarks, including MRPC, CoLA, STS\-B, and RTE for NLP tasks\. \(*ii*\)

Image Classification\.We fine\-tuned on OfficeHome and CIFAR\-10\. \(*iii*\)

Coding Generation\.Code generation presents unique challenges; minor deviations can lead to runtime errors or semantic mismatches\. There is relatively limited LoRA\-focused literature on programming tasks; we evaluate how different LoRA variants adapt to these tasks on DJANGO, and report results using Exact Match \(EM\)\. \(*iv*\)

Logical Reasoning\.We use OpenBookQA for elementary science multiple\-choice reasoning, FOLIO for natural language reasoning with first\-order logic, LogiQA for logical comprehension, and CLUTRR for compositional relational reasoning from text\.

Table 3:Empirical generalization error,𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\), of the fine\-tuning methods over various models and datasets\.ModelDatasetThe PastThe PresentThe FutureFFTLoRACoLAAsymRACLoRA\+cLAc3c^\{3\}LAr\-cLAr\-c3c^\{3\}LAViT\-Tiny\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome4\.85e−1e^\{\-1\}6\.96e−2e^\{\-2\}9\.55e−3e^\{\-3\}7\.22e−2e^\{\-2\}6\.17e−2e^\{\-2\}7\.39e−2e^\{\-2\}1\.98e−2e^\{\-2\}3\.40e−2e^\{\-2\}2\.16e−2e^\{\-2\}3\.51e−2e^\{\-2\}DeBERTa v2 XXL\[[23](https://arxiv.org/html/2606.13767#bib.bib51)\]PAWS6\.07e−2e^\{\-2\}1\.99e−2e^\{\-2\}3\.63e−2e^\{\-2\}3\.26e−2e^\{\-2\}3\.95e−2e^\{\-2\}5\.41e−2e^\{\-2\}6\.68e−2e^\{\-2\}5\.11e−2e^\{\-2\}1\.98e−2e^\{\-2\}6\.99e−2e^\{\-2\}DeBERTa v3 Base\[[22](https://arxiv.org/html/2606.13767#bib.bib49)\]MRPC1\.06e−1e^\{\-1\}8\.90e−2e^\{\-2\}2\.59e−2e^\{\-2\}7\.28e−2e^\{\-2\}9\.86e−2e^\{\-2\}1\.52e−2e^\{\-2\}2\.58e−2e^\{\-2\}8\.52e−3e^\{\-3\}1\.16e−1e^\{\-1\}2\.57e−2e^\{\-2\}TREC504\.56e−1e^\{\-1\}2\.73e−1e^\{\-1\}3\.99e−1e^\{\-1\}2\.16e−1e^\{\-1\}2\.67e−1e^\{\-1\}2\.61e−2e^\{\-2\}2\.25e−1e^\{\-1\}3\.70e−1e^\{\-1\}3\.36e−1e^\{\-1\}2\.63e−2e^\{\-2\}TinyLlama\[[71](https://arxiv.org/html/2606.13767#bib.bib59)\]OpenBookQA1\.78e−1e^\{\-1\}2\.82e−1e^\{\-1\}3\.41e−1e^\{\-1\}2\.15e−1e^\{\-1\}1\.86e−1e^\{\-1\}2\.07e−1e^\{\-1\}1\.51e−1e^\{\-1\}2\.20e−1e^\{\-1\}3\.16e−1e^\{\-1\}7\.59e−2e^\{\-2\}FOLIO1\.82e−1e^\{\-1\}2\.37e−1e^\{\-1\}2\.17e−1e^\{\-1\}1\.75e−1e^\{\-1\}1\.93e−1e^\{\-1\}5\.11e−2e^\{\-2\}2\.35e−1e^\{\-1\}1\.91e−1e^\{\-1\}1\.05e−1e^\{\-1\}2\.49e−1e^\{\-1\}CLUTRR4\. 292\.251\.552\.342\.275\.482\.162\.192\.594\.23Llama3\-8B\[[14](https://arxiv.org/html/2606.13767#bib.bib7)\]CLUTRR2\.532\.662\.972\.95\.492\.652\.695\.022\.514\.33DeepseekCoder\[[15](https://arxiv.org/html/2606.13767#bib.bib60)\]DJANGO3\.48e−2e^\{\-2\}4\.65e−2e^\{\-2\}3\.4e−2e^\{\-2\}5\.16e−2e^\{\-2\}4\.64e−2e^\{\-2\}3\.87e−2e^\{\-2\}4\.19e−2e^\{\-2\}3\.89e−2e^\{\-2\}3\.64e−2e^\{\-2\}3\.62e−2e^\{\-2\}GPT2\-Small\[[52](https://arxiv.org/html/2606.13767#bib.bib48)\]E2E1\.65e−1e^\{\-1\}1\.93e−1e^\{\-1\}1\.85e−1e^\{\-1\}1\.83e−1e^\{\-1\}1\.85e−1e^\{\-1\}1\.87e−1e^\{\-1\}1\.77e−1e^\{\-1\}1\.82e−1e^\{\-1\}1\.88e−1e^\{\-1\}1\.82e−1e^\{\-1\}### 4\.1Quality of the Fine\-Tuned Models

In Table[2](https://arxiv.org/html/2606.13767#S3.T2), we present fine\-tuning performance of various models with FFT and LoRA\-based PEFTs\. For the CoLA dataset, we report the Matthews Correlation Coefficient \(higher is better\)\[[3](https://arxiv.org/html/2606.13767#bib.bib82)\]\. For reporting GPT2\-small’s results, we use perplexity \(lower is better\); for the rest of the models and datasets, we report test accuracies \(higher is better\)\. Each model is trained over 3 seeds, and we average the results\. We find that no single method substantially outperforms the others for adapting the model to their downstream tasks, including FFT, which confirms the previous findings in\[[69](https://arxiv.org/html/2606.13767#bib.bib4)\]\. In many cases, FFT performs rather poorly \(e\.g\., ViT\-Base on OfficeHome, DeBERTa v3 on RTE, DeepseekCoder on DJANGO\)\. The sparsity\-induced SOTA LoRA variants outperform FFT and LoRA in some tasks by a large margin \(e\.g\., ViT\-Base on OfficeHome, DeBERTA v3 on MRPC\); in many cases, their performance drop is modest\. By reducing the memory footprint, the sparse PEFT methods perform well for large models \(e\.g\., Llama 3\), even with low batch sizes and short sequence lengths\. However, the sparse variants cannot always produce the best accuracy in low\-epoch fine\-tuning, but they still generalize well; see Table[16](https://arxiv.org/html/2606.13767#A5.T16)\. Even when the sparse PEFT methods perform undesirably, their performance improves significantly by increasing the rank; see cLA’s substantial EM improvement when fine\-tuning DeepseekCoder for Django with a higher rank\. cLA’s parent method, Asym LoRA, performs well in a lower rank budget; this trend switches at a higher rank\. It is a surprising case of its own, which requires a deeper understanding of the rank and column space interplay; we have a dedicated discussion on DeepSeek’s performance in §[E\.2\.1](https://arxiv.org/html/2606.13767#A5.SS2.SSS1)\. This suggests that when fine\-tuning a model for a downstream task, it may be optimal to select a fine\-tuning method based on its other characteristics and user\-specific needs, rather than just the generated accuracy\. To highlight this, in §[4\.3](https://arxiv.org/html/2606.13767#S4.SS3), we analyze the performance of each method based on training time, generalizability, and robustness for adapting to further downstream tasks\. Although the sparse variants do not reduce the number of trainable parameters compared to their non\-sparse LoRA counterparts, they*reduce training time by 5\-10% and peak GPU memory by 5\-15%, with a naïve, non\-optimized, sparse implementation*; see throughput, peak memory, and runtime in §[E\.3](https://arxiv.org/html/2606.13767#A5.SS3)\.

### 4\.2Generalizability of the Fine\-Tuned Models

The generalization error,𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)\(Definition[1](https://arxiv.org/html/2606.13767#Thmdefinition1)\), is hard to realize in practice, as the true distribution of a feature space and label space,𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}, cannot be obtained\. Therefore, we cannot use the theoretical bounds on𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)in Table[1](https://arxiv.org/html/2606.13767#S3.T1)without modification\. Since test samples are i\.i\.d\. from \(𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}\), as an alternative, the difference between the loss of a model on a collection of unseen test samples and the loss on its training set approximates how well the model generalizes to the true distribution of the instance space it was trained on\. Therefore, we approximate𝒢​\(W\)≈𝔼​\(ℒtest\)−ℒtrain\.\{\{\\cal G\}\}\(\\textbf\{W\}\)\\approx\\mathbb\{E\}\(\\mathcal\{L\_\{\\rm test\}\}\)\-\\mathcal\{L\_\{\\rm train\}\}\.As the size of the test set increases, the difference approaches the actual𝒢​\(W\)\{\{\\cal G\}\}\(\\textbf\{W\}\)of the model; see discussion in §[E\.5](https://arxiv.org/html/2606.13767#A5.SS5)\. We report the approximate generalizability of all fine\-tuned models in Tables[3](https://arxiv.org/html/2606.13767#S4.T3)via the epoch with the lowest validation loss averaged over three seeds to avoid overfitting; see additional results in §[E\.5](https://arxiv.org/html/2606.13767#A5.SS5), Table[16](https://arxiv.org/html/2606.13767#A5.T16)\. Since loss scales differently across various tasks, we interpret the loss gaps within the same task and model setting; any cross\-task comparisons are presented for trend inspection rather than validation of the bounds\. We report the normalized scores and their observed trends in §[E\.5\.1](https://arxiv.org/html/2606.13767#A5.SS5.SSS1)\. Drawing a connection from our theoretical upper bounds in Table[1](https://arxiv.org/html/2606.13767#S3.T1), we find PEFT methods with the same upper bounds perform similarly in practice\. More precisely, cLA has a smaller upper bound on𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)than r\-c3c^\{3\}LA in practice, indicating the validity of theoretical upper bounds\. This observation also holds for cLA and RAC, and c3LA and Asymmetric LoRA pairs\. On the other hand, cLA and r\-cLA have the same upper bound on𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\), and they also perform almost similarly in practice\. Nevertheless, there are some discrepancies, and we attribute them to the fact that Table[1](https://arxiv.org/html/2606.13767#S3.T1)gives us an upper bound on𝒢​\(W\)\.\{\\cal G\}\(\\textbf\{W\}\)\.E\.g\., \(*i*\) although the upper bound on𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)of Asymmetric LoRA is smaller than RAC by a factor ofk\\sqrt\{k\}, they behave similarly in practice\. \(*ii*\) Similarly, r\-cLA performs marginally worse than RAC, although RAC has a higher theoretical bound on𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)\. \(*iii*\)c3c^\{3\}LA and RAC\-LoRA have similar theoretical bounds; in practice, we notice stronger generalization trends forc3c^\{3\}LA in comparison to all other variants\. \(*iv*\) In an extreme case, r\-c3c^\{3\}LA empirically outperforms r\-cLA while having a higher theoretical bound on𝒢​\(W\)\.\{\\cal G\}\(\\textbf\{W\}\)\.

### 4\.3Performance Analysis

With trained model quality and empirical𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\), we were curious to dissect the performance of different LoRA variants using two popular and practically useful tools, \(*i*\) loss landscape\[[34](https://arxiv.org/html/2606.13767#bib.bib17)\], and \(*ii*\) intruder dimensions\[[55](https://arxiv.org/html/2606.13767#bib.bib22)\], that assess a model’s quality\. We examine whether these tools can aid in our understanding of which fine\-tuning method to use when a model and task are at hand, given that we measured their performance and empirical𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)beforehand and have a basis for comparison\. Instead of limiting these tools to a single task, we unleash them across tasks, models, and modalities\.

\(*i*\)

3D\-loss landscapevisualizes how a model’s empirical loss differs under small parameter perturbations; see details in §[E\.4\.1](https://arxiv.org/html/2606.13767#A5.SS4.SSS1)\. A sharper loss landscape indicates worse generalization, smoother landscapes indicate the PEFT method is more robust to initialization\[[34](https://arxiv.org/html/2606.13767#bib.bib17)\]\. In Figure[3](https://arxiv.org/html/2606.13767#S4.F3), the top row shows the loss\-landscapes of ViT\-Base, pretrained on Imagenet\-21K, and fine\-tuned on OfficeHome, while the bottom row shows the loss\-landscapes of RoBERTa\-Base, pretrained on a large corpus of English data and fine\-tuned on CoLA\. For ViT\-Base, we used PCA directions, whereas for RoBERTa\-Base, we used random directions; see §[E\.4\.1](https://arxiv.org/html/2606.13767#A5.SS4.SSS1), for their comparison\. We present a direct comparison of non\-chain LoRA methods \(LoRA, Asymmetric LoRA, cLA\) with their chain counterparts \(CoLA, RAC\-LoRA,c3c^\{3\}LA\) in Figure[9](https://arxiv.org/html/2606.13767#A5.F9)\. In §[E\.4\.1](https://arxiv.org/html/2606.13767#A5.SS4.SSS1), we plotted the optimizer path in 2D contour plots\.

Based on the loss landscapes’ characteristics in\[[34](https://arxiv.org/html/2606.13767#bib.bib17)\], for ViT\-Base fine\-tuned on OfficeHome, FFT would generalize worse with the worst test accuracy; Figure[3](https://arxiv.org/html/2606.13767#S4.F3)confirms this\. But recall from Figure[1](https://arxiv.org/html/2606.13767#S1.F1), for ViT\-Base fine\-tuned on CIFAR\-10, FFT produced the lowest𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\), and yielded the worst test accuracy\. In both cases, FFT produced the spikiest loss and small\-volume minima\. We witnessed from Figures[1](https://arxiv.org/html/2606.13767#S1.F1)and[3](https://arxiv.org/html/2606.13767#S4.F3), chain methods \(e\.g\., RAC, c3LA\), sharpen the minima of the fine\-tuned DNN models, and these sharper valleys indicate that these models should generalize worse\. However, in practice, this is not the case\. E\.g\., For ViT\-Base, in Figure[3](https://arxiv.org/html/2606.13767#S4.F3), RAC and c3LA have the lowest𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\), although cLA has a wider valley, its𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)is higher, indicating it should generalize worse\. Contrastingly, all three PEFT methods produced similar high\-quality fine\-tuned models on OfficeHome and CIFAR\-10, albeit with slight differences\. This discrepancy between practice and theory is consistent across video and text model modalities\. For RoBERTa\-Base in Figure[3](https://arxiv.org/html/2606.13767#S4.F3), the chain methods, CoLA and r\-c3c^\{3\}LA, produce sharper landscapes than the non\-chain methods\. Still, CoLA has a lower𝒢​\(W\)\.\{\\cal G\}\(\\textbf\{W\}\)\.

Key Takeaway\.Generalizability and model performance are two sides of one coin, as is generally agreed\[[34](https://arxiv.org/html/2606.13767#bib.bib17)\]\. However, we observed that this perspective can be conflicting for fine\-tuned models, as the narrative surrounding loss\-landscape sharpness versus empirically measured𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)of these models is mostly contradictory\.

![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/draw_io_images/landscape_main_paper_final_fixed.drawio.png)

Figure 3:Loss landscapes of ViT\-Base fined tuned on OfficeHome \(top row\) with PCA directions, and RoBERTa\-Base fine\-tuned on CoLA \(bottom row\) with random directions\. In both cases, we observe the worst generalization error,𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\), in \(a\) and \(f\), respectively, which are the spikiest landscapes in their class of models\. Additionally, chain methods consistently produce spikier landscapes\.![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/intruder_dimension_plots/RoBERTa-CoLA_best_replot_biggerlabels.png)\(a\)RoBERTa\-Base \(CoLA\)
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/intruder_dimension_plots/ViT-Officehome_last_replot_biggerlabels.png)\(b\)ViT\-Base \(OfficeHome\)
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/intruder_dimension_plots/ViT-Cifar10_best_replot_biggerlabels.png)\(c\)ViT\-Base \(CIFAR\-10\)

Figure 4:The average number of intruder dimensions present in different fine\-tuned models at the end of training\. For each method that has a corresponding chain variant \(LoRA to CoLA, Asymmetric LoRA to RAC\-LoRA\), their colors are the same where the chain method is a dashed line\.\(*ii*\)

Intruder dimension\[[55](https://arxiv.org/html/2606.13767#bib.bib22)\]compares the performance between the fine\-tuned models of LoRA and FFT\. Given the pretrained and fine\-tuned models,W0\\textbf\{W\}\_\{0\}andW0\+Δ​W\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}, the number of*intruder dimensions*correlates with their performance on the pretraining task;*higher intruder dimensions correlate to a worse performance\.*We ask:*Will forgetting less of the more diverse dataset indicate better generalizability?*and examine LoRA variants and FFT via this perspective, and how intruder count aligns with our empirical𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)\. In Figure[4](https://arxiv.org/html/2606.13767#S4.F4), we present the number of intruder dimensions present in FFT and various LoRA\-based PEFT methods for RoBERTa\-Base fine\-tuned on CoLA and ViT\-Base fine\-tuned on OfficeHome and CIFAR\-10 over varying threshold ranges,ε∈\(0,1\]\{\\varepsilon\\in\(0,1\]\}\.

In Figure[4](https://arxiv.org/html/2606.13767#S4.F4)\(a\), RoBERTa\-Base fine\-tuned on FFT has some of the least intruders yet has the worst generalizability; see Table[3](https://arxiv.org/html/2606.13767#S4.T3)\. The methods that produce the least generalization error for RoBERTa\-Base, Asymmetric LoRA, and RAC both produce relatively few intruders\. Conversely, ViT\-Base fine\-tuned on both OfficeHome and CIFAR\-10 via FFT produces some of the most intruders while having the best generalization for Cifar10, and relatively poor for OfficeHome; see Table[16](https://arxiv.org/html/2606.13767#A5.T16)\. For all three experiments, the chain variant of any LoRA PEFT method produces more intruders than its non\-chain counterpart; see LoRA compared to CoLA, Asymmetric LoRA to RAC, cLA to c3LA in Figure[4](https://arxiv.org/html/2606.13767#S4.F4)\. This correlates with our loss landscapes, where chain variants produce sharper landscapes\. However, the expected worse generalizability of these chain methods is not observed empirically\.

Key Takeaway\.Although a more promising tool than loss\-landscape, and an excellent addition to analyze FFT and PEFT, intruder count does not always match with our empirically evaluated𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)either\. This is more prominent in the vision task\. The original paper reports experiments involving only language models; we encourage the community to explore diverse modalities and models\.

#### 4\.3\.1Discussion

Using alternative diagnostics such as loss landscapes and intruder dimensions to analyze the fine\-tuned model quality, we observe that these tools do not consistently align the empirical trends between test accuracy \(Table[2](https://arxiv.org/html/2606.13767#S3.T2)\) and generalizability \(Table[3](https://arxiv.org/html/2606.13767#S4.T3)\)\. While they provide informative post hoc rationalization in some settings, their results are often ambiguous, producing false positive scenarios, suggesting good generalization even when𝒢​\(𝐖\)\{\\cal G\}\(\\mathbf\{W\}\)and test accuracy do not support it\. These discrepancies motivate us to present an alternative analysis that better aligns with empirically observed𝒢​\(W\)\{\\cal G\}\(\\textbf\{W\}\)and the obtained accuracy\. This perspective produces fewer false positives\. We observe that the theoretical results in Table[1](https://arxiv.org/html/2606.13767#S3.T1)and the empirical results in Table[3](https://arxiv.org/html/2606.13767#S4.T3)are relatively comparable, and the strong generalizable models in Table[3](https://arxiv.org/html/2606.13767#S4.T3)typically perform well in terms of rendering higher accuracy in Table[2](https://arxiv.org/html/2606.13767#S3.T2)\. Consequently, jointly measuring test accuracy and experimentally evaluating the resulting model’s𝒢​\(𝐖\)\{\\cal G\}\(\\mathbf\{W\}\)provides a more consistent predictive criterion for selecting whether a fine\-tuning method is suitable for a particular model, task, and specific dataset\. In the era of artificial general intelligence, when we want a model to behave in a human\-like way across many tasks, we conclude that it is better to choose PEFT methods that generalize well and are computationally efficient\.

## 5Conclusion

Through extensive benchmarking we show the complex, task\-dependent nature of PEFT performance, including FFT\. This aligns with prior findings\. Our proposed sparse extensions of SOTA LoRA variants perform well across multiple modalities and models while substantially reducing training time and memory requirements\. From a theoretical perspective, our sparsity\-induced variants serve as a bridge between LoRA and PaCA, two different families of PEFT methods\. While these sparse variants may require larger budgets to maintain robustness in certain settings \(e\.g\., code generation with DeepSeekCoder\), they remain overall effective, highlighting the importance of selecting fine\-tuning methods based on task characteristics and user constraints\. To support this, we analyzed various common LoRA PEFT variants through the lens of generalizability\. We show that, in theory, the sparse methods have the same generalization error upper bounds as their non\-sparse counterparts, and closely track the empirical generalization trend across most models and modalities\. This insight provides a more consistent and guided pathway for selecting PEFT methods, complementing existing diagnostic tools such as loss\-landscape and intruder\-dimension analyses\.

Acknowledgment\.Aritra Dutta is partially supported by the Florida Department of Health Grant, AWD00007072, and the National Science Foundation Grant, 2321986\.

## References

- \[1\]Introducing Meta Llama 3: The most capable openly available LLM to date\.External Links:[Link](https://ai.meta.com/blog/meta-llama-3/)Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p1.1)\.
- \[2\]D\. Biderman, J\. Portes, J\. J\. G\. Ortiz, M\. Paul, P\. Greengard, C\. Jennings, D\. King, S\. Havens, V\. Chiley, J\. Frankle,et al\.\(2024\)LoRA Learns Less and Forgets Less\.Transactions on Machine Learning Research\.Cited by:[§E\.2](https://arxiv.org/html/2606.13767#A5.SS2.p1.1)\.
- \[3\]D\. Chicco and G\. Jurman\(2020\)The advantages of the Matthews correlation coefficient \(MCC\) over F1 score and accuracy in binary classification evaluation\.BMC genomics21\(1\),pp\. 6\.Cited by:[§4\.1](https://arxiv.org/html/2606.13767#S4.SS1.p1.1)\.
- \[4\]J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei\(2009\)ImageNet: a Large\-Scale Hierarchical Image Database\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 248–255\.Cited by:[Figure 8](https://arxiv.org/html/2606.13767#A5.F8.2.1),[Figure 8](https://arxiv.org/html/2606.13767#A5.F8.4.2),[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.2.2),[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.4.2)\.
- \[5\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)QLoRA: Efficient Finetuning of Quantized LLMs\.InAdvances in neural information processing systems,Vol\.36,pp\. 10088–10115\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[6\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p2.6.1)\.
- \[7\]N\. Ding, X\. Lv, Q\. Wang, Y\. Chen, B\. Zhou, Z\. Liu, and M\. Sun\(2023\)Sparse low\-rank adaptation of pre\-trained language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 4133–4145\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[8\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly,et al\.\(2020\)An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale\.InInternational Conference on Learning Representations,Cited by:[Figure 8](https://arxiv.org/html/2606.13767#A5.F8.2.1),[Figure 8](https://arxiv.org/html/2606.13767#A5.F8.4.2),[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.2.2),[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.4.2),[Table 16](https://arxiv.org/html/2606.13767#A5.T16.18.12.11),[Table 16](https://arxiv.org/html/2606.13767#A5.T16.38.32.11),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.4.1),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.6.1),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.10.6.5),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.74.72.1),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.10.6.5),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.58.56.1),[Table 3](https://arxiv.org/html/2606.13767#S4.T3.14.12.11)\.
- \[9\]C\. Doyle\(2025\)BayesLoRA: Task\-Specific Uncertainty in Low\-Rank Adapters\.arXiv preprint arXiv:2506\.22809\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[10\]J\. Fei, C\. Ho, A\. N\. Sahu, M\. Canini, and A\. Sapio\(2021\)Efficient sparse collective communication and its application to accelerate distributed deep learning\.InProceedings of the 2021 ACM SIGCOMM 2021 Conference,pp\. 676–691\.Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p1.1)\.
- \[11\]J\. Frankle and M\. Carbin\(2019\)The lottery ticket hypothesis: finding sparse, trainable neural networks\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p4.1)\.
- \[12\]T\. Gale, E\. Elsen, and S\. Hooker\(2020\)Sparse gpu kernels for deep learning\.InThe International Conference for High Performance Computing, Networking, Storage and Analysis,Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p4.1)\.
- \[13\]I\. Goodfellow, Y\. Bengio, and A\. Courville\(2016\)Deep learning\.MIT Press\.External Links:[Link](http://www.deeplearningbook.org/)Cited by:[§D\.1\.5](https://arxiv.org/html/2606.13767#A4.SS1.SSS5.p2.8)\.
- \[14\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Table 16](https://arxiv.org/html/2606.13767#A5.T16.186.180.10),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.22.1),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.66.62.5),[§1](https://arxiv.org/html/2606.13767#S1.p1.1),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.54.50.5),[Table 3](https://arxiv.org/html/2606.13767#S4.T3.84.85.1)\.
- \[15\]D\. Guo, Q\. Zhu, D\. Yang, Z\. Xie, K\. Dong, W\. Zhang, G\. Chen, X\. Bi, Y\. Wu, Y\. K\. Li, F\. Luo, Y\. Xiong, and W\. Liang\(2024\)DeepSeek\-coder: when the large language model meets programming – the rise of code intelligence\.External Links:2401\.14196Cited by:[Table 16](https://arxiv.org/html/2606.13767#A5.T16.196.190.11),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.24.1),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.70.66.5),[Table 3](https://arxiv.org/html/2606.13767#S4.T3.74.72.11)\.
- \[16\]S\. Han, H\. Schoelkopf, Y\. Zhao, Z\. Qi, M\. Riddell, W\. Zhou, J\. Coady, D\. Peng, Y\. Qiao, L\. Benson, L\. Sun, A\. Wardle\-Solano, H\. Szabo, E\. Zubova, M\. Burtell, J\. Fan, Y\. Liu, B\. Wong, M\. Sailor, A\. Ni, L\. Nan, J\. Kasai, T\. Yu, R\. Zhang, A\. R\. Fabbri, W\. Kryściński, S\. Yavuz, Y\. Liu, X\. V\. Lin, S\. Joty, Y\. Zhou, C\. Xiong, R\. Ying, A\. Cohan, and D\. Radev\(2024\)FOLIO: Natural Language Reasoning with First\-Order Logic\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 22017–22031\.Cited by:[Table 18](https://arxiv.org/html/2606.13767#A5.T18.74.76.2),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.58.60.2)\.
- \[17\]Z\. Han, C\. Gao, J\. Liu, J\. Zhang, and S\. Q\. Zhang\(2024\)Parameter\-Efficient Fine\-Tuning for Large Models: A Comprehensive Survey\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=lIsCS8b6zj)Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p2.1)\.
- \[18\]S\. Hayou, N\. Ghosh, and B\. Yu\(2024\)LoRA\+: Efficient Low Rank Adaptation of Large Models\.InInternational Conference on Machine Learning,pp\. 17783–17806\.Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p5.1.1),[Table 1](https://arxiv.org/html/2606.13767#S3.5.5.5.5.5.3)\.
- \[19\]B\. He, N\. K\. Govindaraju, Q\. Luo, and B\. Smith\(2007\)Efficient gather and scatter operations on graphics processors\.InProceedings of the 2007 ACM/IEEE Conference on Supercomputing,pp\. 1–12\.Cited by:[Figure 7](https://arxiv.org/html/2606.13767#A5.F7),[Figure 7](https://arxiv.org/html/2606.13767#A5.F7.5.2.2)\.
- \[20\]H\. He, P\. Ye, Y\. Ren, Y\. Yuan, L\. Zhou, S\. Ju, and L\. Chen\(2025\)Gora: gradient\-driven adaptive low rank adaptation\.arXiv preprint arXiv:2502\.12171\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[21\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§D\.1\.5](https://arxiv.org/html/2606.13767#A4.SS1.SSS5.p2.8)\.
- \[22\]P\. He, J\. Gao, and W\. Chen\(2021\)Debertav3: improving deberta using electra\-style pre\-training with gradient\-disentangled embedding sharing\.arXiv preprint arXiv:2111\.09543\.Cited by:[Table 16](https://arxiv.org/html/2606.13767#A5.T16.88.82.11),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.11.1),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.30.26.5),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.30.26.5),[Table 3](https://arxiv.org/html/2606.13767#S4.T3.34.32.11)\.
- \[23\]P\. He, X\. Liu, J\. Gao, and W\. Chen\(2020\)Deberta: decoding\-enhanced bert with disentangled attention\.arXiv preprint arXiv:2006\.03654\.Cited by:[Table 16](https://arxiv.org/html/2606.13767#A5.T16.58.52.11),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.8.1),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.18.14.5),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.18.14.5),[Table 3](https://arxiv.org/html/2606.13767#S4.T3.24.22.11)\.
- \[24\]N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly\(2019\)Parameter\-efficient transfer learning for NLP\.InProceedings of the 36th International Conference on Machine Learning,Vol\.97,pp\. 2790–2799\.Cited by:[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p2.6.1),[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p2.7)\.
- \[25\]I\. Hounie, C\. Kanatsoulis, A\. Tandon, and A\. Ribeiro\(2024\)LoRTA: Low Rank Tensor Adaptation of Large Language Models\.arXiv preprint arXiv:2410\.04060\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[26\]A\. G\. Howard, M\. Zhu, B\. Chen, D\. Kalenichenko, W\. Wang, T\. Weyand, M\. Andreetto, and H\. Adam\(2017\)MobileNets: efficient convolutional neural networks for mobile vision applications\.arXiv preprint arXiv:1704\.04861\.Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p4.1)\.
- \[27\]E\. J\. Hu, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)LoRA: Low\-Rank Adaptation of Large Language Models\.InInternational Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2606.13767#A2.SS1.p2.1),[§E\.3\.2](https://arxiv.org/html/2606.13767#A5.SS3.SSS2.p1.2),[§1](https://arxiv.org/html/2606.13767#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p2.6.1),[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p3.12.1),[Table 1](https://arxiv.org/html/2606.13767#S3.3.3.3.3.3.3)\.
- \[28\]R\. Huang, Z\. Emam, M\. Goldblum, L\. Fowl, J\. K\. Terry, F\. Huang, and T\. Goldstein\(2020\)Understanding Generalization Through Visualizations\.InProceedings on "I Can’t Believe It’s Not Better\!" at NeurIPS Workshops,Vol\.137,pp\. 87–97\.Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p3.1)\.
- \[29\]D\. Kalajdzievski\(2023\)A rank stabilization scaling factor for fine\-tuning with LoRA\.arXiv preprint arXiv:2312\.03732\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[30\]J\. Kim, J\. Kim, and E\. K\. Ryu\(2025\)LoRA training provably converges to a low\-rank global minimum or it fails loudly \(but it probably won’t fail\)\.arXiv preprint arXiv:2502\.09376\.Cited by:[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p3.14),[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p2.9)\.
- \[31\]D\. P\. Kingma and J\. Ba\(2015\)Adam: A Method for Stochastic Optimization\.InInternational Conference on Learning Representations,Cited by:[§E\.1](https://arxiv.org/html/2606.13767#A5.SS1.p1.1),[Table 7](https://arxiv.org/html/2606.13767#A5.T7.16.8.8),[Table 7](https://arxiv.org/html/2606.13767#A5.T7.8.8.8)\.
- \[32\]A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.\(2009\)\.Cited by:[Figure 8](https://arxiv.org/html/2606.13767#A5.F8.2.1),[Figure 8](https://arxiv.org/html/2606.13767#A5.F8.4.2),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.14.10.6),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.14.10.6)\.
- \[33\]D\. Li and H\. Zhang\(2021\)Improved regularization and robustness for fine\-tuning in neural networks\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 27249–27262\.Cited by:[§D\.1\.6](https://arxiv.org/html/2606.13767#A4.SS1.SSS6.p3.1),[§3\.1](https://arxiv.org/html/2606.13767#S3.SS1.p9.10),[Theorem 5](https://arxiv.org/html/2606.13767#Thmtheorem5.p1.11.11.11)\.
- \[34\]H\. Li, Z\. Xu, G\. Taylor, C\. Studer, and T\. Goldstein\(2018\)Visualizing the Loss Landscape of Neural Nets\.InNeurIPS,Vol\.31\.Cited by:[§E\.4\.1](https://arxiv.org/html/2606.13767#A5.SS4.SSS1.p1.9),[§E\.4\.1](https://arxiv.org/html/2606.13767#A5.SS4.SSS1.p2.1),[Figure 1](https://arxiv.org/html/2606.13767#S1.F1.3.2),[Figure 1](https://arxiv.org/html/2606.13767#S1.F1.5.2),[§1](https://arxiv.org/html/2606.13767#S1.p3.1),[§4\.3](https://arxiv.org/html/2606.13767#S4.SS3.p1.2),[§4\.3](https://arxiv.org/html/2606.13767#S4.SS3.p3.1),[§4\.3](https://arxiv.org/html/2606.13767#S4.SS3.p4.7),[§4\.3](https://arxiv.org/html/2606.13767#S4.SS3.p5.1)\.
- \[35\]H\. Li, W\. Zheng, Q\. Wang, H\. Zhang, Z\. Wang, S\. Xuyang, Y\. Fan, Z\. Ding, H\. Wang, N\. Ding, S\. Zhou, X\. Zhang, and D\. Jiang\(2025\)Predictable Scale: Part I, Step Law — Optimal Hyperparameter Scaling Law in Large Language Model Pretraining\.External Links:2503\.04715Cited by:[§E\.2](https://arxiv.org/html/2606.13767#A5.SS2.p1.1),[Table 8](https://arxiv.org/html/2606.13767#A5.T8),[Table 8](https://arxiv.org/html/2606.13767#A5.T8.2.1)\.
- \[36\]T\. Li, Z\. He, Y\. Li, Y\. Wang, L\. Shang, and X\. Huang\(2024\)Flat\-LoRA: Low\-Rank Adaptation over a Flat Loss Landscape\.arXiv preprint arXiv:2409\.14396\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[37\]X\. Li and D\. Roth\(2006\)Learning question classifiers: the role of semantic information\.Natural Language Engineering12\(3\),pp\. 229–249\.External Links:[Document](https://dx.doi.org/10.1017/S1351324905003955)Cited by:[Table 18](https://arxiv.org/html/2606.13767#A5.T18.22.18.6),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.22.18.6)\.
- \[38\]V\. Lialin, N\. Shivagunde, S\. Muckatira, and A\. Rumshisky\(2024\)Relora: high\-rank training through low\-rank updates\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p2.9)\.
- \[39\]J\. Liu, L\. Cui, H\. Liu, D\. Huang, Y\. Wang, and Y\. Zhang\(2020\)LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning\.arXiv\.External Links:2007\.08124Cited by:[Table 18](https://arxiv.org/html/2606.13767#A5.T18.62.58.6),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.50.46.6)\.
- \[40\]S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen\(2024\)Dora: weight\-decomposed low\-rank adaptation\.InForty\-first International Conference on Machine Learning,Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[41\]Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov\(2019\)Roberta: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.2.2),[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.4.2),[Table 16](https://arxiv.org/html/2606.13767#A5.T16.118.112.11),[Table 16](https://arxiv.org/html/2606.13767#A5.T16.137.131.11),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.14.1),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.16.1),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.46.42.5),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.74.75.1),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.42.38.5),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.58.59.1)\.
- \[42\]G\. Malinovsky, U\. Michieli, H\. A\. A\. K\. Hammoud, T\. Ceritli, H\. Elesedy, M\. Ozay, and P\. Richtárik\(2024\)Randomized asymmetric chain of LoRA: The first meaningful theoretical framework for low\-rank adaptation\.arXiv preprint arXiv:2410\.08305\.Cited by:[§E\.2](https://arxiv.org/html/2606.13767#A5.SS2.p1.1),[§1](https://arxiv.org/html/2606.13767#S1.p2.1),[§1](https://arxiv.org/html/2606.13767#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p4.7.1),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p1.3),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p5.14),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p8.4),[Table 1](https://arxiv.org/html/2606.13767#S3.9.9.9.9.9.2)\.
- \[43\]T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal\(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2381–2391\.Cited by:[Table 18](https://arxiv.org/html/2606.13767#A5.T18.58.54.6)\.
- \[44\]Mori, Giancarlo\(March 13, 2025\)GPT\-4\.5 vs GPT\-4o: Comparing OpenAI’s Latest AI Models\.External Links:[Link](https://giancarlomori.substack.com/p/gpt-45-vs-gpt-4o-comparing-openais)Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p4.1)\.
- \[45\]S\. Mu and D\. Klabjan\(2025\)On the Convergence Rate of LoRA Gradient Descent\.arXiv preprint arXiv:2512\.18248\.Cited by:[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p8.4)\.
- \[46\]P\. Nair\(2025\)Softmax is1/21/2\-Lipschitz: A tight bound across allℓp\\ell\_\{p\}norms\.arXiv preprint arXiv:2510\.23012\.Cited by:[§D\.1\.5](https://arxiv.org/html/2606.13767#A4.SS1.SSS5.p3.8)\.
- \[47\]J\. Novikova, O\. Dušek, and V\. Rieser\(2017\)The E2E dataset: New challenges for end\-to\-end generation\.InProceedings of the 18th annual SIGdial meeting on discourse and dialogue,pp\. 201–206\.Cited by:[Table 18](https://arxiv.org/html/2606.13767#A5.T18.74.70.6),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.58.54.6)\.
- \[48\]\(January 6, 2025\)NVIDIA GPUs: H100 vs\. A100—A detailed comparison\.External Links:[Link](https://gcore.com/blog/nvidia-h100-a100)Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p1.1)\.
- \[49\]Y\. Oda, H\. Fudaba, G\. Neubig, H\. Hata, S\. Sakti, T\. Toda, and S\. Nakamura\(2015\)Learning to generate pseudo\-code from source code using statistical machine translation\.In2015 30th IEEE/ACM International Conference on Automated Software Engineering \(ASE\),pp\. 574–584\.Cited by:[Table 18](https://arxiv.org/html/2606.13767#A5.T18.70.66.6)\.
- \[50\]A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga,et al\.\(2019\)Pytorch: An Imperative Style, High\-Performance Deep Learning Library\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§E\.1](https://arxiv.org/html/2606.13767#A5.SS1.p1.1)\.
- \[51\]K\. Ponkshe, R\. Singhal, E\. Gorbunov, A\. Tumanov, S\. Horvath, and P\. Vepakomma\(2024\)Initialization using update approximation is a silver bullet for extremely efficient low\-rank fine\-tuning\.arXiv preprint arXiv:2411\.19557\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[52\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[Table 16](https://arxiv.org/html/2606.13767#A5.T16.206.200.11),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.25.1),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.74.70.5),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.58.54.5),[Table 3](https://arxiv.org/html/2606.13767#S4.T3.84.82.11)\.
- \[53\]J\. Rasley, S\. Rajbhandari, O\. Ruwase, and Y\. He\(2020\)DeepSpeed: system optimizations for large\-scale deep learning\.arXiv preprint arXiv:2007\.00399\.Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p4.1)\.
- \[54\]D\. Russo and J\. Zou\(2019\)How much does your data exploration overfit? controlling bias via information usage\.IEEE Transactions on Information Theory66\(1\),pp\. 302–323\.Cited by:[§3](https://arxiv.org/html/2606.13767#S3.p1.1)\.
- \[55\]R\. S\. Shuttleworth, J\. Andreas, A\. Torralba, and P\. Sharma\(2025\)LoRA vs\. full fine\-tuning: an illusion of equivalence\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§E\.2](https://arxiv.org/html/2606.13767#A5.SS2.p1.1),[§1](https://arxiv.org/html/2606.13767#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p3.14),[§4\.3](https://arxiv.org/html/2606.13767#S4.SS3.p1.2),[§4\.3](https://arxiv.org/html/2606.13767#S4.SS3.p7.4)\.
- \[56\]K\. Sinha, S\. Sodhani, J\. Dong, J\. Pineau, and W\. L\. Hamilton\(2019\)CLUTRR: A diagnostic benchmark for inductive reasoning from text\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 4506–4515\.Cited by:[Table 18](https://arxiv.org/html/2606.13767#A5.T18.74.77.2),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.58.61.2)\.
- \[57\]Y\. Sun, Z\. Li, Y\. Li, and B\. Ding\(2024\)Improving LoRA in Privacy\-preserving Federated Learning\.InInternational Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p3.3),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p8.4)\.
- \[58\]C\. Tian, Z\. Shi, Z\. Guo, L\. Li, and C\. Xu\(2024\)Hydralora: An asymmetric lora architecture for efficient fine\-tuning\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 9565–9584\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[59\]M\. Valipour, M\. Rezagholizadeh, I\. Kobyzev, and A\. Ghodsi\(2022\)Dylora: parameter efficient tuning of pre\-trained models using dynamic search\-free low\-rank adaptation\.arXiv preprint arXiv:2210\.07558\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[60\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§D\.1\.5](https://arxiv.org/html/2606.13767#A4.SS1.SSS5.p2.8)\.
- \[61\]H\. Venkateswara, J\. Eusebio, S\. Chakraborty, and S\. Panchanathan\(2017\)Deep hashing network for unsupervised domain adaptation\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 5018–5027\.Cited by:[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.2.2),[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.4.2),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.10.6.6),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.10.6.6)\.
- \[62\]A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman\(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,pp\. 353–355\.External Links:[Document](https://dx.doi.org/10.18653/v1/W18-5446)Cited by:[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.2.2),[Figure 9](https://arxiv.org/html/2606.13767#A5.F9.4.2),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.18.14.6),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.34.30.6),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.38.34.6),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.50.46.6),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.18.14.6),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.34.30.6)\.
- \[63\]S\. Wang, L\. Yu, and J\. Li\(2024\)LoRA\-GA: Low\-rank Adaptation with Gradient Approximation\.Advances in Neural Information Processing Systems37,pp\. 54905–54931\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[64\]S\. Woo, S\. Namkung, S\. Lee, I\. Jeong, B\. Kim, and D\. Jeon\(2025\)PaCA: Partial Connection Adaptation for Efficient Fine\-Tuning\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8),[§B\.1](https://arxiv.org/html/2606.13767#A2.SS1.p2.1),[Appendix B](https://arxiv.org/html/2606.13767#A2.p1.9),[§E\.3\.2](https://arxiv.org/html/2606.13767#A5.SS3.SSS2.p1.2),[Table 14](https://arxiv.org/html/2606.13767#A5.T14.85.2),[Table 14](https://arxiv.org/html/2606.13767#A5.T14.87.2),[§1](https://arxiv.org/html/2606.13767#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p7.1.1),[Table 1](https://arxiv.org/html/2606.13767#S3.13.13.13.13.14.1),[Theorem 2](https://arxiv.org/html/2606.13767#Thmtheorem2.p1.2.2)\.
- \[65\]W\. Xia, C\. Qin, and E\. Hazan\(2024\)Chain of LoRA: Efficient Fine\-tuning of Language Models via Residual Learning\.InICML 2024 Workshop on LLMs and Cognition,Cited by:[§E\.4\.1](https://arxiv.org/html/2606.13767#A5.SS4.SSS1.p3.3),[§1](https://arxiv.org/html/2606.13767#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p2.8.1),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p1.3),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p5.14),[Table 1](https://arxiv.org/html/2606.13767#S3.8.8.8.8.8.2)\.
- \[66\]A\. Xu and M\. Raginsky\(2017\)Information\-theoretic analysis of generalization capability of learning algorithms\.Advances in neural information processing systems30\.Cited by:[item 5](https://arxiv.org/html/2606.13767#A4.I1.i5.p1.9),[§D\.1\.6](https://arxiv.org/html/2606.13767#A4.SS1.SSS6.p6.1),[§3](https://arxiv.org/html/2606.13767#S3.p1.1),[Definition 1](https://arxiv.org/html/2606.13767#Thmdefinition1.p1.2),[Lemma 1](https://arxiv.org/html/2606.13767#Thmlemma1.p1.5.5),[Theorem 7](https://arxiv.org/html/2606.13767#Thmtheorem7.p1.6.6.6)\.
- \[67\]H\. Xu, C\. Ho, A\. M\. Abdelmoniem, A\. Dutta, E\. H\. Bergou, K\. Karatsenidis, M\. Canini, and P\. Kalnis\(2021\)Grace: A compressed communication framework for distributed machine learning\.In2021 IEEE 41st International Conference on Distributed Computing Systems \(ICDCS\),pp\. 561–572\.Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p1.1)\.
- \[68\]J\. Xu, X\. Sun, Z\. Zhang, G\. Zhao, and J\. Lin\(2019\)Understanding and improving layer normalization\.InAdvances in neural information processing systems,Vol\.32\.Cited by:[§D\.1\.5](https://arxiv.org/html/2606.13767#A4.SS1.SSS5.p2.8)\.
- \[69\]L\. Xu, H\. Xie, S\. J\. Qin, X\. Tao, and F\. L\. Wang\(2023\)Parameter\-efficient fine\-tuning methods for pretrained language models: A critical review and assessment\.arXiv preprint arXiv:2312\.12148\.Cited by:[§1](https://arxiv.org/html/2606.13767#S1.p2.1),[§1](https://arxiv.org/html/2606.13767#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p2.6.1),[§2\.1](https://arxiv.org/html/2606.13767#S2.SS1.p2.7),[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.13767#S4.SS1.p1.1),[§4](https://arxiv.org/html/2606.13767#S4.p1.1)\.
- \[70\]J\. Zhang, J\. You, A\. Panda, and T\. Goldstein\(2025\)LoRI: reducing cross\-task interference in multi\-task low\-rank adaptation\.InSecond Conference on Language Modeling,Cited by:[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p8.4)\.
- \[71\]P\. Zhang, G\. Zeng, T\. Wang, and W\. Lu\(2024\)TinyLlama: an open\-source small language model\.External Links:2401\.02385Cited by:[Table 16](https://arxiv.org/html/2606.13767#A5.T16.157.151.11),[Table 17](https://arxiv.org/html/2606.13767#A5.T17.8.2.18.1),[Table 18](https://arxiv.org/html/2606.13767#A5.T18.58.54.5),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.58.60.1),[Table 3](https://arxiv.org/html/2606.13767#S4.T3.54.52.11)\.
- \[72\]Q\. Zhang, M\. Chen, A\. Bukharin, N\. Karampatziakis, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao\(2023\)Adalora: adaptive budget allocation for parameter\-efficient fine\-tuning\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8),[§1](https://arxiv.org/html/2606.13767#S1.p2.1)\.
- \[73\]R\. Zhang, R\. Qiang, S\. A\. Somayajula, and P\. Xie\(2024\)AutoLoRA: Automatically Tuning Matrix Ranks in Low\-Rank Adaptation Based on Meta Learning\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 5048–5060\.Cited by:[Appendix A](https://arxiv.org/html/2606.13767#A1.p1.8)\.
- \[74\]Y\. Zhang, J\. Baldridge, and L\. He\(2019\)PAWS: paraphrase adversaries from word scrambling\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 1298–1308\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1131)Cited by:[Table 18](https://arxiv.org/html/2606.13767#A5.T18.26.22.6),[Table 2](https://arxiv.org/html/2606.13767#S3.T2.26.22.6)\.
- \[75\]J\. Zhu, K\. Greenewald, K\. Nadjahi, H\. S\. De Ocáriz Borde, R\. B\. Gabrielsson, L\. Choshen, M\. Ghassemi, M\. Yurochkin, and J\. Solomon\(2024\)Asymmetry in low\-rank adapters of foundation models\.InProceedings of the 41st International Conference on Machine Learning,pp\. 62369–62385\.Cited by:[Appendix B](https://arxiv.org/html/2606.13767#A2.p1.9),[item 5](https://arxiv.org/html/2606.13767#A4.I1.i5.p1.9),[§D\.1\.6](https://arxiv.org/html/2606.13767#A4.SS1.SSS6.p12.4),[§1](https://arxiv.org/html/2606.13767#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p3.2.1),[§2\.2](https://arxiv.org/html/2606.13767#S2.SS2.p3.3),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p1.3),[§2\.3](https://arxiv.org/html/2606.13767#S2.SS3.p3.11),[Table 1](https://arxiv.org/html/2606.13767#S3.7.7.7.7.7.3),[§3\.1](https://arxiv.org/html/2606.13767#S3.SS1.p11.14)\.

Organization of Appendix\.We organize the Appendix with the following structure:In §[A](https://arxiv.org/html/2606.13767#A1), we discuss the popular contemporary LoRA variants; this is a continuation of §[2\.2](https://arxiv.org/html/2606.13767#S2.SS2)of the main paper\. In §[C](https://arxiv.org/html/2606.13767#A3), we give the pseudocode of our proposed LoRA variants, cLA, random\-cLA, andc3c^\{3\}LA\. §[D](https://arxiv.org/html/2606.13767#A4)contains the proofs to the theorems in §[3](https://arxiv.org/html/2606.13767#S3)\. Particularly, it contains the proofs for Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)and Theorem[6](https://arxiv.org/html/2606.13767#Thmtheorem6)\. Additionally, this section contains the statement and proof of Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)adapted to the attention mechanism; see Theorem[4](https://arxiv.org/html/2606.13767#Thmtheorem4)in §[D\.1\.5](https://arxiv.org/html/2606.13767#A4.SS1.SSS5)\. In §[E](https://arxiv.org/html/2606.13767#A5), we discuss the implementation details and extend our empirical study by including various ablation studies and developing discussion topics\. This section acts as an addendum to §[4](https://arxiv.org/html/2606.13767#S4)of the main paper\. For notations used in this paper, we refer to Table[19](https://arxiv.org/html/2606.13767#A7.T19)in §[G](https://arxiv.org/html/2606.13767#A7)\.

## Appendix AThe Present: Evolution of LoRA — Continued

Other LoRA variants\.There are other popular LoRA variants, such as HydraLoRA\[[58](https://arxiv.org/html/2606.13767#bib.bib21)\], designed for fine\-tuning on datasets with high heterogeneity\. LoRA\-SB\[[51](https://arxiv.org/html/2606.13767#bib.bib27)\]simulates the FFT process within low\-rank subspaces by adding a trainabler×rr\\times rmatrixRR, initializingB​R​ABRAbased on the SVD of the first step of FFT, and freezingB,AB,A\. QLoRA\[[5](https://arxiv.org/html/2606.13767#bib.bib37)\]fine\-tunes quantized LLMs\. AdaLoRA\[[72](https://arxiv.org/html/2606.13767#bib.bib38)\]uses varying rank by layer and uses an SVD initialization\. SoRA\[[7](https://arxiv.org/html/2606.13767#bib.bib39)\]introduces sparsity in the low\-rank updates\. DoRA\[[40](https://arxiv.org/html/2606.13767#bib.bib40)\]separates fine\-tuning the direction and magnitude components of the model\. AutoLoRA\[[73](https://arxiv.org/html/2606.13767#bib.bib41)\]trains each LoRA update as a sum of rank\-one matrices and learns which to discard during training\. DyLoRA\[[59](https://arxiv.org/html/2606.13767#bib.bib42)\]concentrates the more important features in the first columns and rows ofBBandAA, respectively\. LoRA\-GA\[[63](https://arxiv.org/html/2606.13767#bib.bib90)\]improves LoRA convergence by initializing low\-rank adapters using a gradient approximation of full fine\-tuned updates\. LoRTA\[[25](https://arxiv.org/html/2606.13767#bib.bib91)\]replaces matrix\-based adapters with a low rank tensor factorization that unifies model updates across layers and attention heads\. Flat\-LoRA\[[36](https://arxiv.org/html/2606.13767#bib.bib92)\]seeks flat minima in the full\-parameter space to improve model generalization\. BayesLoRA\[[9](https://arxiv.org/html/2606.13767#bib.bib94)\]learns the effective adapter rank through Bayesian low\-rank variational dropout\. rsLoRA\[[29](https://arxiv.org/html/2606.13767#bib.bib95)\]stabilizes LoRA training across ranks by correcting the scaling factor from1/r1/rto1/r1/\\sqrt\{r\}, preventing gradient collapse\. GoRA\[[20](https://arxiv.org/html/2606.13767#bib.bib96)\]is a gradient\-based LoRA variant that performs adaptive rank allocation and initializes adapter weights using compressed gradients\. PaCA\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\]proposes fine\-tuning randomly selected connections within pretrained weights to improve training speed and reduce overhead\.

The above\-mentioned PEFT methods are important extensions of the LoRA family of PEFTs\. Although we do not claim any technical or algorithmic novelty in designing another PEFT method in this paper, our research question is still a well\-timed and important one\. We asked, instead of task\-specific tailoring, can we induce simple sparsity, which is easy to implement, in the present SOTA LoRA variants and witness their superior performance across diverse tasks; See Tables[2](https://arxiv.org/html/2606.13767#S3.T2),[1](https://arxiv.org/html/2606.13767#S3.T1), and Figure[6](https://arxiv.org/html/2606.13767#A5.F6)\. At the same time, through a unified information\-theoretic generalization error bound analysis framework, we demonstrate that the sparsity\-induced variants share similar upper bounds with their parent PEFT methods\. However, in practice, they may differ and show substantially better performance in many cases compared to their parent PEFT methods\. In summary, except for PaCA, we share little to no conceptual proximity to other LoRA variants\. In the next section, we draw this connection deeper, and based on our understanding of the SOTA LoRA variants, we introduce a few new artifacts to PaCA\.

## Appendix BRelationship between PaCA and cLA

During fine\-tuning, PEFT methods, such as LoRA adapters, must be processed with the backbone and cannot be merged, which limits their hardware utilization\. PaCA\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\]is motivated by the training\-time inefficiencies of adapter\-based PEFT\. PaCA fine\-tunes a random subset of the pretrained model’s columns explicitly and uses partial activations to form gradients, improving throughput and lowering activation\-memory cost\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\]\. In contrast, cLA stays within the LoRA family to analyze extreme sparse LoRA structures and for simple LoRA\-based adapter deployment\. cLA fixes itsAAmatrix to\[Ir\|0\]\[I\_\{r\}\|\\textbf\{0\}\], forcing a column update throughBB’s projection ofAA;Δ​W=B​\[Ir\|0\]=\[B\|0\]\.\\Delta W=B\[I\_\{r\}\|\\textbf\{0\}\]=\[B\|\\textbf\{0\}\]\.Additionally, an alternate variant of cLA has the opportunity to update theAAmatrix and freeze theBBmatrix, forcing cLA to fine\-tune a restricted subset of the row space of the pretrained model\. As fine\-tuningBBis inherently more effective than fine\-tuningAA\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\], this suggests that applying PaCA to a fixed subset of the rows of the pretrained matrix instead of the columns would degrade performance\.

### B\.1Introducing New Artifacts to PaCA

Since we established a connection between the LoRA PEFT family and PaCA via the sparsity\-induced LoRA variants, we were interested to see how some performance\-enhancing artifacts of LoRA PEFT methods migrate to PaCA, such as chain construction to PaCA, which we call C\-PaCA, where we periodically resample the columns PaCA is updating\. To have a fairer comparison to our sparsity\-induced method, cLA, we introduce a deterministic PaCA variant, D\-PaCA, which updates the same columns as cLA\. PaCA and r\-cLA both update a random subset ofrrcolumns of each adapted layer, and C\-PaCA and r\-c3LA effectively implement chaining behavior to PaCA and r\-cLA, respectively\.

Experiments\.To showcase the performance of PaCA and the sparsity\-induced LoRA variants compared to baseline SOTA LoRA variants and full fine\-tuning, we fine\-tune ViT\-Tiny and ViT\-Base on the OfficeHome and CIFAR\-10 datasets, and RoBERTa\-Base and RoBERTa\-Large on the MRPC and CoLA datasets using the aforementioned methods with rankr=8r=8and 30 epochs, averaged over the other three seeds\. To align our experiments with those done in PaCA’s introduction paper\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\], we adapt all layers of the models, and fully fine\-tune the classification heads, then report the accuracy in Table[4](https://arxiv.org/html/2606.13767#A2.T4)\. To align with the other experiments in our paper as well as LoRA’s introduction paper\[[27](https://arxiv.org/html/2606.13767#bib.bib2)\], we adapt only the query and value matrices of each attention head, fully fine\-tune the classification heads of the models, and report those results in Table[5](https://arxiv.org/html/2606.13767#A2.T5)\.

Discussion of results\.In both Tables[4](https://arxiv.org/html/2606.13767#A2.T4)and[5](https://arxiv.org/html/2606.13767#A2.T5), the difference in accuracy of each sparsity\-induced LoRA method and their PaCA counterpart \(cLA and D\-PaCA, r\-cLA and PaCA, r\-c3LA and C\-PaCA\) is negligible, empirically validating how these methods connect LoRA and PaCA\. In Table[5](https://arxiv.org/html/2606.13767#A2.T5), C\-PaCA’s relative performance to PaCA was consistent with applying chain construction to the LoRA PEFT family\. E\.g\., if CoLA outperformed LoRA for a particular row, then often C\-PaCA outperformed PaCA\. This effect was less consistent in Table[4](https://arxiv.org/html/2606.13767#A2.T4)\. We realized this inconsistency is due to adapting more layers; the increased capability of updating more columns throughout training was unnecessary, as we were already updating far more columns due to adapting more layers\. For efficiency details of PaCA and the sparse\-induced LoRA variants compared to SOTA variants, see §[E\.3](https://arxiv.org/html/2606.13767#A5.SS3)\.

Table 4:Test accuracy \(%\) of ViT\-Tiny and ViT\-Base models fine\-tuned on OfficeHome and CIFAR\-10, and RoBERTa\-Base and RoBERTa\-Large models fine\-tuned on MRPC and CoLA, averaged over three seeds \(0,1,2\) for various LoRA and PaCA methods\. The value in parentheses is normalized to LoRA’s test accuracy, forr=16r=16, adapting the token and map embeddings, query, key, value, and output matrices of the attention layers, and both fully connected layers\. Weunderlinesparsity\-induced variants that remain very competitive with the best performing methods for each model and dataset combination\.ModelDatasetLoRA VariantsSparsity\-Induced VariantsPaCA VariantsFFTLoRACoLAAsymRACcLAc3LAr\-cLAr\-c3LAD\-PaCAPaCAC\-PaCAViT\-TinyOfficeHome54\.2\(0\.810\)66\.9\(1\.000\)66\.9\(1\.000\)67\.4\(1\.008\)67\.4\(1\.008\)64\.7\(0\.968\)64\.7\(0\.968\)64\.6\(0\.965\)64\.4\(0\.963\)63\.7\(0\.952\)64\.0\(0\.956\)63\.9\(0\.956\)CIFAR\-1092\.7\(0\.990\)93\.6\(1\.000\)93\.5\(0\.998\)94\.8\(1\.013\)94\.8\(1\.013\)92\.5\(0\.988\)92\.4\(0\.987\)93\.2\(0\.996\)92\.1\(0\.984\)93\.5\(0\.999\)92\.9\(0\.992\)92\.4\(0\.987\)ViT\-BaseOfficeHome68\.4\(0\.862\)79\.3\(1\.000\)79\.3\(1\.000\)80\.4\(1\.013\)80\.4\(1\.013\)79\.9\(1\.007\)79\.9\(1\.007\)80\.3\(1\.011\)79\.9\(1\.007\)80\.0\(1\.008\)79\.7\(1\.004\)80\.0\(1\.009\)CIFAR\-1095\.0\(0\.972\)97\.8\(1\.000\)98\.0\(1\.002\)98\.8\(1\.011\)98\.8\(1\.011\)98\.5\(1\.007\)98\.5\(1\.007\)98\.6\(1\.008\)98\.7\(1\.009\)98\.2\(1\.005\)98\.4\(1\.006\)98\.5\(1\.007\)RoBERTa\-BaseMRPC90\.8\(0\.998\)91\.0\(1\.000\)90\.3\(0\.992\)90\.6\(0\.995\)90\.2\(0\.990\)89\.3\(0\.981\)89\.3\(0\.981\)90\.1\(0\.989\)89\.8\(0\.987\)89\.4\(0\.982\)90\.2\(0\.991\)89\.7\(0\.985\)CoLA62\.7\(1\.031\)60\.9\(1\.000\)65\.1\(1\.069\)60\.7\(0\.997\)62\.1\(1\.021\)59\.3\(0\.974\)59\.2\(0\.973\)62\.3\(1\.024\)64\.3\(1\.056\)60\.5\(0\.994\)60\.3\(0\.990\)62\.1\(1\.020\)RoBERTa\-LargeMRPC91\.3\(1\.000\)91\.3\(1\.000\)90\.9\(0\.996\)91\.1\(0\.997\)90\.8\(0\.995\)91\.0\(0\.997\)90\.9\(0\.996\)90\.9\(0\.995\)90\.9\(0\.996\)91\.3\(1\.000\)91\.1\(0\.998\)91\.2\(0\.999\)CoLA69\.7\(1\.032\)67\.6\(1\.000\)64\.2\(0\.950\)67\.0\(0\.992\)67\.2\(0\.994\)64\.3\(0\.952\)65\.8\(0\.974\)64\.0\(0\.947\)66\.8\(0\.988\)66\.3\(0\.982\)69\.0\(1\.021\)70\.1\(1\.037\)

Table 5:Test accuracy \(%\) of ViT\-Tiny and ViT\-Base models fine\-tuned on OfficeHome and CIFAR\-10, and RoBERTa\-Base and RoBERTa\-Large models fine\-tuned on MRPC and CoLA, averaged over three seeds \(0,1,2\) for various LoRA and PaCA methods\. The value in parentheses is normalized to LoRA’s test accuracy, forr=16r=16, adapting the query and value matrices of the attention layers as well as the classification head\. Weunderlinesparsity\-induced variants that remain very competitive with the best performing methods for each model and dataset combination\.ModelDatasetLoRA VariantsSparsity\-Induced VariantsPaCA VariantsFFTLoRACoLAAsymRACcLAc3LAr\-cLAr\-c3LAD\-PaCAPaCAC\-PaCAViT\-TinyOfficeHome54\.7\(0\.803\)68\.2\(1\.000\)68\.2\(1\.000\)68\.4\(1\.003\)68\.4\(1\.003\)67\.4\(0\.989\)67\.4\(0\.989\)67\.8\(0\.995\)67\.9\(0\.995\)66\.6\(0\.976\)67\.4\(0\.989\)67\.2\(0\.985\)CIFAR\-1093\.5\(0\.997\)93\.7\(1\.000\)93\.8\(1\.001\)94\.2\(1\.005\)94\.2\(1\.005\)94\.0\(1\.003\)93\.9\(1\.002\)93\.9\(1\.001\)93\.8\(1\.001\)94\.1\(1\.004\)94\.0\(1\.003\)94\.0\(1\.003\)ViT\-BaseOfficeHome68\.4\(0\.850\)80\.5\(1\.000\)80\.5\(1\.000\)80\.2\(0\.995\)80\.4\(0\.998\)79\.9\(0\.992\)79\.9\(0\.992\)79\.8\(0\.991\)79\.8\(0\.991\)80\.3\(0\.997\)80\.1\(0\.994\)80\.0\(0\.994\)CIFAR\-1095\.0\(0\.963\)98\.6\(1\.000\)98\.6\(1\.000\)99\.0\(1\.004\)99\.0\(1\.004\)98\.6\(0\.999\)98\.6\(0\.999\)98\.7\(1\.001\)98\.8\(1\.001\)98\.8\(1\.002\)98\.6\(1\.000\)98\.5\(0\.999\)RoBERTa\-BaseMRPC90\.8\(1\.010\)89\.9\(1\.000\)90\.1\(1\.003\)90\.1\(1\.002\)89\.7\(0\.997\)88\.7\(0\.986\)88\.7\(0\.987\)89\.5\(0\.996\)89\.5\(0\.996\)88\.9\(0\.988\)89\.3\(0\.993\)88\.8\(0\.987\)CoLA62\.7\(0\.953\)65\.8\(1\.000\)61\.2\(0\.931\)59\.3\(0\.901\)60\.0\(0\.911\)58\.1\(0\.883\)58\.4\(0\.887\)59\.5\(0\.904\)57\.6\(0\.876\)57\.1\(0\.868\)57\.9\(0\.880\)57\.6\(0\.875\)RoBERTa\-LargeMRPC91\.3\(1\.019\)89\.6\(1\.000\)90\.8\(1\.014\)91\.0\(1\.016\)91\.2\(1\.019\)90\.1\(1\.006\)90\.5\(1\.010\)89\.5\(1\.000\)90\.0\(1\.005\)90\.4\(1\.010\)89\.6\(1\.000\)90\.3\(1\.008\)CoLA69\.7\(1\.047\)66\.6\(1\.000\)68\.4\(1\.027\)69\.3\(1\.042\)68\.1\(1\.023\)65\.0\(0\.976\)66\.4\(0\.997\)67\.1\(1\.008\)66\.4\(0\.997\)67\.1\(1\.009\)67\.3\(1\.012\)65\.4\(0\.982\)

### B\.2Applying PaCA’s Convergence Result to cLA

PaCA updates only a subset of columns in each layer\. Writing thell\-th layer weight as a list of columnsWl=\[w1l,w2l,…,wdoutl\]W\_\{l\}=\[w^\{l\}\_\{1\},w^\{l\}\_\{2\},\\ldots,w\_\{d\_\{\\rm out\}\}^\{l\}\]and lettingPl=\[i1l,i2l,⋯,irl\]P\_\{l\}=\[i\_\{1\}^\{l\},i\_\{2\}^\{l\},\\cdots,i\_\{r\}^\{l\}\]denote the selected column indices, PaCA’s masked column update is

Wlt\+1=Wlt−η​Δ​Wlt=Wlt−η​\[0,∇i1l​wlt,…,∇irl​wlt,…,0\]\.W\_\{l\}^\{t\+1\}=W\_\{l\}^\{t\}\-\\eta\\Delta W\_\{l\}^\{t\}=W\_\{l\}^\{t\}\-\\eta\[0,\\nabla\_\{\{i\_\{1\}^\{l\}\}w\_\{l\}^\{t\}\},\\ldots,\\nabla\_\{\{i\_\{r\}^\{l\}\}w\_\{l\}^\{t\}\},\\ldots,0\]\.
###### Theorem 2\.

\(Loss Convergence of Partial Connections \[Theorem 1\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\]\]\) If the gradient of the lossf​\(W,X\)f\(W,X\)is Lipschitz continuous with Lipschitz ConstantLℒL\_\{\\cal L\}, and only the partial connections are updated, then

f​\(Wt\+1,Xt\+1\)≤f​\(Wt,Xt\)−η​\(1−η​Lℒ2\)​‖∇Pt‖2,f\(W^\{t\+1\},X^\{t\+1\}\)\\leq f\(W^\{t\},X^\{t\}\)\-\\eta\(1\-\\frac\{\\eta L\_\{\{\\cal L\}\}\}\{2\}\)\\\|\\nabla\_\{P^\{t\}\}\\\|^\{2\},whereη\\etais the learning rate and‖∇Pt‖2\\\|\\nabla\_\{P^\{t\}\}\\\|^\{2\}is the masked/restricted gradient only consisting of the gradient for the columns inPlP\_\{l\}for each layer,l∈\[L\]l\\in\[L\]\.

In particular, the significance of Theorem[2](https://arxiv.org/html/2606.13767#Thmtheorem2)can be accredited to the fact that a learning rate of0<η<2/Lℒ0<\\eta<2/L\_\{\{\\cal L\}\}ensures that the loss decreases after each iteration\. We use PaCA’s Theorem[2](https://arxiv.org/html/2606.13767#Thmtheorem2)to demonstrate similar convergence behavior as cLA and formalize the result in Theorem[3](https://arxiv.org/html/2606.13767#Thmtheorem3)\.

###### Theorem 3\.

\(Loss Convergence of cLA \[Theorem[2](https://arxiv.org/html/2606.13767#Thmtheorem2)applied to cLA\]\) If the gradient of the lossf​\(W,X\)f\(W,X\)is Lipschitz continuous with Lipschitz ConstantLℒL\_\{\\cal L\}, and only the columns from cLA are updated, then

f​\(Wt\+1,Xt\+1\)≤f​\(Wt,Xt\)−η​\(1−η​Lℒ2\)​‖∇Wℒ​\(Wt\)​A0T​A0‖2\.f\(W^\{t\+1\},X^\{t\+1\}\)\\leq f\(W^\{t\},X^\{t\}\)\-\\eta\(1\-\\frac\{\\eta L\_\{\{\\cal L\}\}\}\{2\}\)\\\|\\nabla\_\{W\}\{\\cal L\}\(W^\{t\}\)A\_\{0\}^\{T\}A\_\{0\}\\\|^\{2\}\.

###### Proof\.

In cLA, we consider the frozen factor asA0=\[Ir\|0\]A\_\{0\}=\[I\_\{r\}\|0\], so the effective weight can be represented by

Wt=W0\+αr​Bt​A0\.W^\{t\}=W\_\{0\}\+\\frac\{\\alpha\}\{r\}B^\{t\}A\_\{0\}\.Observe that only the firstrrcolumns ofWWcan change \(Δ​W=B​\[Ir\|0\]=\[B\|0\]\\Delta W=B\[I\_\{r\}\|0\]=\[B\|0\]\)\. SinceWWdepends linearly onBB,Wt=W0\+αr​Bt​AW^\{t\}=W\_\{0\}\+\\frac\{\\alpha\}\{r\}B^\{t\}A, by the chain rule we have:

∇Bℒ​\(Wt\)=∇Bℒ​\(W0\+αr​Bt​A\)=αr​∇Wℒ​\(Wt\)​A⊤\.\\nabla\_\{B\}\{\\cal L\}\(W^\{t\}\)=\\nabla\_\{B\}\{\\cal L\}\(W\_\{0\}\+\\frac\{\\alpha\}\{r\}B^\{t\}A\)=\\frac\{\\alpha\}\{r\}\\nabla\_\{W\}\{\\cal L\}\(W^\{t\}\)A^\{\\top\}\.Through gradient descent, we have:

Bt\+1=Bt−γ​∇Bℒ​\(Wt\)=Bt−γ​αr​∇Wℒ​\(Wt\)​A⊤B^\{t\+1\}=B^\{t\}\-\\gamma\\nabla\_\{B\}\{\\cal L\}\(W^\{t\}\)=B^\{t\}\-\\gamma\\frac\{\\alpha\}\{r\}\\nabla\_\{W\}\{\\cal L\}\(W^\{t\}\)A^\{\\top\}Denote𝒦=A0T​A0=Diag⁡\(1,…,1⏞1​to​r,0,…,0\)\.\{\\cal K\}=A\_\{0\}^\{T\}A\_\{0\}=\\operatorname\{Diag\}\(\\overbrace\{1,\.\.\.,1\}^\{1\\text\{ to \}r\},0,\.\.\.,0\)\.Observe the following:

Wt\+1−Wt=αr​\(Bt\+1−Bt\)​A=−γ​α2r2​∇Wℒ​\(Wt\)​A⊤​A=−γ​α2r2​∇Wℒ​\(Wt\)​𝒦=−η​∇Wℒ​\(Wt\)​𝒦W^\{t\+1\}\-W^\{t\}=\\frac\{\\alpha\}\{r\}\(B^\{t\+1\}\-B^\{t\}\)A=\-\\gamma\\frac\{\\alpha^\{2\}\}\{r^\{2\}\}\\nabla\_\{W\}\{\\cal L\}\(W^\{t\}\)A^\{\\top\}A=\-\\gamma\\frac\{\\alpha^\{2\}\}\{r^\{2\}\}\\nabla\_\{W\}\{\\cal L\}\(W^\{t\}\)\{\\cal K\}=\-\\eta\\nabla\_\{W\}\{\\cal L\}\(W^\{t\}\)\{\\cal K\}
In particular, for cLA we have𝒦\{\\cal K\}acting as a mask that selects only the firstrrcolumns in each layer\. Thus, if we definePlcLA=\{1,2,…,r\}P\_\{l\}^\{\\text\{cLA\}\}=\\\{1,2,\\ldots,r\\\}for every layerll, then the masked gradient∇Wℒ​\(Wt\)​𝒦\\nabla\_\{W\}\{\\cal L\}\(W^\{t\}\)\{\\cal K\}is exactly the restricted gradient∇Pt\\nabla\_\{P^\{t\}\}in Theorem[2](https://arxiv.org/html/2606.13767#Thmtheorem2)\(with a deterministic choice ofPlP\_\{l\}rather than a random subset\)\. Therefore, under the same Lipschitz\-gradient assumption onℒ\{\\cal L\}with constantLℒL\_\{\\cal L\}, Theorem[2](https://arxiv.org/html/2606.13767#Thmtheorem2)applies directly to the effective weightsWtW^\{t\}induced by cLA, resulting in

ℒ​\(Wt\+1,Xt\+1\)≤ℒ​\(Wt,Xt\)−η​\(1−η​Lℒ2\)​‖∇Wℒ​\(Wt\)​𝒦‖2\.\{\\cal L\}\(W^\{t\+1\},X^\{t\+1\}\)\\leq\{\\cal L\}\(W^\{t\},X^\{t\}\)\-\\eta\\Big\(1\-\\frac\{\\eta L\_\{\\cal L\}\}\{2\}\\Big\)\\big\\\|\\nabla\_\{W\}\{\\cal L\}\(W^\{t\}\)\{\\cal K\}\\big\\\|^\{2\}\.Hence the result\. ∎

The same step\-size condition0<η<2/Lℒ0<\\eta<2/L\_\{\\cal L\}ensures a decrease in the loss for cLA updates\. Equivalently, in terms of theBBupdate step size, usingγ=η​r2α2\\gamma=\\eta\\frac\{r^\{2\}\}\{\\alpha^\{2\}\}implies the condition

0<γ<2​r2Lℒ​α2,0<\\gamma<\\frac\{2r^\{2\}\}\{L\_\{\\cal L\}\\alpha^\{2\}\},which guarantees that each iteration decreases the loss\.

## Appendix CPseudo Code of sparsity\-induced LoRA variants

In this Section, we present the pseudocode of our proposed LoRA variants, cLA \(Algorithm[1](https://arxiv.org/html/2606.13767#alg1)\), random\-cLA \(Algorithm[2](https://arxiv.org/html/2606.13767#alg2)\),c3c^\{3\}LA \(Algorithm[3](https://arxiv.org/html/2606.13767#alg3)\), and r\-c3c^\{3\}LA \(Algorithm[4](https://arxiv.org/html/2606.13767#alg4)\)\.

Algorithm 1Cheap LoRA \(cLA\)1:Parameters:Loss function

ℒ\\mathcal\{L\}and model

f𝐖​\(⋅\)f\_\{\\mathbf\{W\}\}\(\\cdot\)\. Pretrained weights

𝐖0=\(W01,…,W0L\)\\mathbf\{W\}\_\{0\}=\(W^\{1\}\_\{0\},\.\.\.,W^\{L\}\_\{0\}\), where

W0i∈ℝni×miW^\{i\}\_\{0\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times m\_\{i\}\}\. rank

r≪min\{mi,ni\}i∈\[L\]r\\ll\\min\\\{m\_\{i\},n\_\{i\}\\\}\_\{i\\in\[L\]\}, learning rate

γ\>0\\gamma\>0, scaling factor

α\>0\\alpha\>0, total training iterations

TT\.

2:Initialize

A0j=\[Ir\|0r×\(mj−r\)\];B0,j=𝟎A\_\{0\}^\{j\}=\[I\_\{r\}\\,\|\\,\\mathbf\{0\}\_\{\\,r\\times\(m\_\{j\}\-r\)\}\];\\ B^\{0,j\}=\\mathbf\{0\}for

j∈\[L\]j\\in\[L\]
3:for

t=1,…,Tt=1,\.\.\.,Tdo

4:forward pass with LoRA modules

5:backward pass then update

𝐁t\\mathbf\{B\}^\{t\}
6:for

j=1,…,Lj=1,\.\.\.,Ldo

7:

Bt,j=Bt−1,j−γ​αr​∇jℒ​\(𝐖0\+αr​𝐁t−1​𝐀0\)​Diag⁡\(1,…,1⏞1​to​r,0,…,0\)B^\{t,j\}=B^\{t\-1,j\}\-\\gamma\\frac\{\\alpha\}\{r\}\\nabla\_\{j\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{t\-1\}\\mathbf\{A\}\_\{0\}\)\\operatorname\{Diag\}\(\\overbrace\{1,\.\.\.,1\}^\{1\\text\{ to \}r\},0,\.\.\.,0\)
8:endfor

9:endfor

10:

j^=argminj∈\[T\]⁡ℒ​\(𝐖0\+αr​𝐁j​𝐀0\)\\hat\{j\}=\\operatorname\{argmin\}\_\{j\\in\[T\]\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{j\}\\mathbf\{A\}\_\{0\}\)or task\-based metric\.

11:ReturnFine\-tuned weights

𝐖0\+αr​𝐁j^​𝐀0\\mathbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{\\hat\{j\}\}\\mathbf\{A\}\_\{0\}

Algorithm 2random Cheap LoRA \(r\-cLA\)1:Parameters:Loss function

ℒ\\mathcal\{L\}and model

f𝐖​\(⋅\)f\_\{\\mathbf\{W\}\}\(\\cdot\)\. Pretrained weights

𝐖0=\(W01,…,W0L\)\\mathbf\{W\}\_\{0\}=\(W^\{1\}\_\{0\},\.\.\.,W^\{L\}\_\{0\}\), where

W0i∈ℝmi×niW^\{i\}\_\{0\}\\in\\mathbb\{R\}^\{m\_\{i\}\\times n\_\{i\}\}\. rank

r≪min\{mi,ni\}i∈\[L\]r\\ll\\min\\\{m\_\{i\},n\_\{i\}\\\}\_\{i\\in\[L\]\}, learning rate

γ\>0\\gamma\>0, scaling factor

α\>0\\alpha\>0, total training iterations

TT\.

2:Initialize

3:

ξj=randint⁡\(0,⌊njr⌋−1\)​for​j∈\[L\]\\xi\_\{j\}=\\operatorname\{randint\}\(0,\\lfloor\\frac\{n\_\{j\}\}\{r\}\\rfloor\-1\)\\text\{ for \}j\\in\[L\]
4:

A0j=\[𝟎r×ξj​∣Ir∣​𝟎r×\(nj−ξj−r\)\];B0,j=𝟎A\_\{0\}^\{j\}=\\left\[\\mathbf\{0\}\_\{r\\times\\xi\_\{j\}\}\\mid I\_\{r\}\\mid\\mathbf\{0\}\_\{r\\times\(n\_\{j\}\-\\xi\_\{j\}\-r\)\}\\right\];\\ B^\{0,j\}=\\mathbf\{0\}for

j∈\[L\]j\\in\[L\]
5:for

t=1,…,Tt=1,\.\.\.,Tdo

6:forward pass with LoRA modules

7:backward pass then update

𝐁t\\mathbf\{B\}^\{t\}
8:for

j=1,…,Lj=1,\.\.\.,Ldo

9:

Bt,j=Bt−1,j−γ​αr​∇jℒ​\(𝐖0\+αr​𝐁t−1​𝐀0\)​Diag⁡\(0,…,0,1,…,1⏞ξj\+1​to​ξj\+r,0,…,0\)B^\{t,j\}=B^\{t\-1,j\}\-\\gamma\\frac\{\\alpha\}\{r\}\\nabla\_\{j\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{t\-1\}\\mathbf\{A\}\_\{0\}\)\\operatorname\{Diag\}\(0,\.\.\.,0,\\overbrace\{1,\.\.\.,1\}^\{\\xi\_\{j\}\+1\\text\{ to \}\\xi\_\{j\}\+r\},0,\.\.\.,0\)
10:endfor

11:endfor

12:

j^=argminj∈\[T\]⁡ℒ​\(𝐖0\+αr​𝐁j​𝐀0\)\\hat\{j\}=\\operatorname\{argmin\}\_\{j\\in\[T\]\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{j\}\\mathbf\{A\}\_\{0\}\)or task\-based metric\.

13:ReturnFine\-tuned weights

𝐖0\+αr​𝐁j^​𝐀0\\mathbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{\\hat\{j\}\}\\mathbf\{A\}\_\{0\}

Algorithm 3Circulant Chain of Cheap LoRA \(c3c^\{3\}LA\)1:Parameters:Loss function

ℒ\\mathcal\{L\}and model

f𝐖​\(⋅\)f\_\{\\mathbf\{W\}\}\(\\cdot\)\. Pretrained weights

𝐖0\(0\)=\(W1,…,WL\)\\mathbf\{W\}\_\{0\}^\{\(0\)\}=\(W^\{1\},\.\.\.,W^\{L\}\), where

Wi∈ℝmi×niW^\{i\}\\in\\mathbb\{R\}^\{m\_\{i\}\\times n\_\{i\}\}\. rank

r≪min\{mi,ni\}i∈\[L\]r\\ll\\min\\\{m\_\{i\},n\_\{i\}\\\}\_\{i\\in\[L\]\}, learning rate

γ\>0\\gamma\>0, scaling factor

α\>0\\alpha\>0, total training iterations

TT, chain\-length

k<=Tk<=T\.

2:Initialize

A0j=\[Ir\|0r×\(nj−r\)\];B0,j=𝟎A\_\{0\}^\{j\}=\[I\_\{r\}\\,\|\\,\\mathbf\{0\}\_\{\\,r\\times\(n\_\{j\}\-r\)\}\];\\ B^\{0,j\}=\\mathbf\{0\}for

j∈\[L\]j\\in\[L\], current chain

c=0c=0\.

3:for

t=1,…,Tt=1,\.\.\.,Tdo

4:if

t≡0\(mod⌊Tk⌋\)t\\equiv 0\\pmod\{\\lfloor\\frac\{T\}\{k\}\\rfloor\}then

5:

c=c\+1c=c\+1
6:Merge LoRA to backbone weights

𝐖0\(c\)=𝐖0\(c−1\)\+αr​𝐁t−1​𝐀0\\mathbf\{W\}\_\{0\}^\{\(c\)\}=\\mathbf\{W\}\_\{0\}^\{\(c\-1\)\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{t\-1\}\\mathbf\{A\}\_\{0\}
7:Re\-initialize with

𝐀0\\mathbf\{A\}\_\{0\}shifted by

rr:

8:

A0j=\[𝟎r×c​r​∣Ir∣​𝟎r×ni−r−c​r\]A^\{j\}\_\{0\}=\\left\[\\mathbf\{0\}\_\{r\\times cr\}\\mid I\_\{r\}\\mid\\mathbf\{0\}\_\{r\\times\{n\_\{i\}\-r\-cr\}\}\\right\];

Bt−1,j=𝟎B^\{t\-1,j\}=\\mathbf\{0\}for

j∈\[L\]j\\in\[L\]
9:endif

10:forward pass with LoRA modules

11:backward pass then update

𝐁t\\mathbf\{B\}^\{t\}
12:for

j=1,…,Lj=1,\.\.\.,Ldo

13:

Bt,j=Bt−1,j−γ​αr​∇jℒ​\(𝐖0\(c\)\+αr​𝐁t−1​𝐀0\)​Diag⁡\(0,…,0,1,…,1⏞c​r​to​\(c\+1\)r,0,…,0\)B^\{t,j\}=B^\{t\-1,j\}\-\\gamma\\frac\{\\alpha\}\{r\}\\nabla\_\{j\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{0\}^\{\(c\)\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{t\-1\}\\mathbf\{A\}\_\{0\}\)\\operatorname\{Diag\}\(0,\.\.\.,0,\\overbrace\{1,\.\.\.,1\}^\{cr\\text\{ to \}\(c\+1\)\_\{r\}\},0,\.\.\.,0\)
14:endfor

15:endfor

16:

c^,j^=argminj∈\[⌊Tk⌋\],c∈\[k\]⁡ℒ​\(𝐖0\(c\)\+αr​𝐁c​j​𝐀0\)\\hat\{c\},\\hat\{j\}=\\operatorname\{argmin\}\_\{j\\in\[\\lfloor\\frac\{T\}\{k\}\\rfloor\],c\\in\[k\]\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{0\}^\{\(c\)\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{cj\}\\mathbf\{A\}\_\{0\}\)or task\-based metric\.

17:ReturnFine\-tuned weights

𝐖0c^\+αr​𝐁c^​j^​𝐀0\\mathbf\{W\}\_\{0\}^\{\\hat\{c\}\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{\\hat\{c\}\\hat\{j\}\}\\mathbf\{A\}\_\{0\}

Algorithm 4Random Circulant Chain of Cheap LoRA \(r\-c3c^\{3\}LA\)1:Parameters:Loss function

ℒ\\mathcal\{L\}and model

f𝐖​\(⋅\)f\_\{\\mathbf\{W\}\}\(\\cdot\)\. Pretrained weights

𝐖0\(0\)=\(W1,…,WL\)\\mathbf\{W\}\_\{0\}^\{\(0\)\}=\(W^\{1\},\.\.\.,W^\{L\}\), where

Wi∈ℝmi×niW^\{i\}\\in\\mathbb\{R\}^\{m\_\{i\}\\times n\_\{i\}\}\. rank

r≪min\{mi,ni\}i∈\[L\]r\\ll\\min\\\{m\_\{i\},n\_\{i\}\\\}\_\{i\\in\[L\]\}, learning rate

γ\>0\\gamma\>0, scaling factor

α\>0\\alpha\>0, total training iterations

TT, chain\-length

k<=Tk<=T\.

2:Initialize

3:

ξj=randint⁡\(0,⌊njr⌋−1\)​for​j∈\[L\]\.\\xi\_\{j\}=\\operatorname\{randint\}\(0,\\lfloor\\frac\{n\_\{j\}\}\{r\}\\rfloor\-1\)\\text\{ for \}j\\in\[L\]\.
4:

A0j=\[𝟎r×ξj​∣Ir∣​𝟎r×\(nj−ξj−r\)\];B0,j=𝟎A\_\{0\}^\{j\}=\\left\[\\mathbf\{0\}\_\{r\\times\\xi\_\{j\}\}\\mid I\_\{r\}\\mid\\mathbf\{0\}\_\{r\\times\(n\_\{j\}\-\\xi\_\{j\}\-r\)\}\\right\];\\ B^\{0,j\}=\\mathbf\{0\}for

j∈\[L\]j\\in\[L\], current chain

c=0c=0\.

5:for

t=1,…,Tt=1,\.\.\.,Tdo

6:if

t≡0\(mod⌊Tk⌋\)t\\equiv 0\\pmod\{\\lfloor\\frac\{T\}\{k\}\\rfloor\}then

7:

c=c\+1c=c\+1
8:Merge LoRA to backbone weights

𝐖0\(c\)=𝐖0\(c−1\)\+αr​𝐁t−1​𝐀0\\mathbf\{W\}\_\{0\}^\{\(c\)\}=\\mathbf\{W\}\_\{0\}^\{\(c\-1\)\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{t\-1\}\\mathbf\{A\}\_\{0\}
9:Re\-initialize with

𝐀0\\mathbf\{A\}\_\{0\}shifted by a new random variable

ξj′\\xi^\{\\prime\}\_\{j\}:

10:

A0j=\[𝟎r×ξj′​∣Ir∣​𝟎r×\(nj−ξj′−r\)\]A\_\{0\}^\{j\}=\\left\[\\mathbf\{0\}\_\{r\\times\\xi^\{\\prime\}\_\{j\}\}\\mid I\_\{r\}\\mid\\mathbf\{0\}\_\{r\\times\(n\_\{j\}\-\\xi^\{\\prime\}\_\{j\}\-r\)\}\\right\];

Bt−1,j=𝟎B^\{t\-1,j\}=\\mathbf\{0\}for

j∈\[L\]j\\in\[L\]
11:endif

12:forward pass with LoRA modules

13:backward pass then update

𝐁t\\mathbf\{B\}^\{t\}
14:for

j=1,…,Lj=1,\.\.\.,Ldo

15:

Bt,j=Bt−1,j−γ​αr​∇jℒ​\(𝐖0\+αr​𝐁t−1​𝐀0\)​Diag⁡\(0,…,0,1,…,1⏞ξj\+1​to​ξj\+r,0,…,0\)B^\{t,j\}=B^\{t\-1,j\}\-\\gamma\\frac\{\\alpha\}\{r\}\\nabla\_\{j\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{t\-1\}\\mathbf\{A\}\_\{0\}\)\\operatorname\{Diag\}\(0,\.\.\.,0,\\overbrace\{1,\.\.\.,1\}^\{\\xi\_\{j\}\+1\\text\{ to \}\\xi\_\{j\}\+r\},0,\.\.\.,0\)
16:endfor

17:endfor

18:

c^,j^=argminj∈\[⌊Tk⌋\],c∈\[k\]⁡ℒ​\(𝐖0\(c\)\+αr​𝐁c​j​𝐀0\)\\hat\{c\},\\hat\{j\}=\\operatorname\{argmin\}\_\{j\\in\[\\lfloor\\frac\{T\}\{k\}\\rfloor\],c\\in\[k\]\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{0\}^\{\(c\)\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{cj\}\\mathbf\{A\}\_\{0\}\)or task\-based metric\.

19:ReturnFine\-tuned weights

𝐖0c^\+αr​𝐁c^​j^​𝐀0\\mathbf\{W\}\_\{0\}^\{\\hat\{c\}\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}^\{\\hat\{c\}\\hat\{j\}\}\\mathbf\{A\}\_\{0\}

## Appendix DTheoretical Results

This section complements Section[3](https://arxiv.org/html/2606.13767#S3)in the main paper\.

### D\.1Generalization

Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)provides a point\-wise deterministic bound that relates the generalization error of the pretrained modelW0\\textbf\{W\}\_\{0\}to that of the fine\-tuned modelW0\+Δ​W,\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\},rather than a uniform probabilistic PAC bound over a hypothesis class\. This construction matches our intended use so that practitioners can be theoretically guided by the structural characteristics of a fine\-tuning procedure\.

In this section, we give a detailed proof of the generalization error bound\. We start by listing the inequalities used in this section\.

#### D\.1\.1Inequalities used

1. 1\.IfA,B∈ℝm×nA,B\\in\\mathbb\{R\}^\{m\\times n\}andx∈ℝnx\\in\\mathbb\{R\}^\{n\}, then the Triangle\-Inequality gives: ‖\(A\+B\)​x‖≤‖A​x‖\+‖B​x‖\.\\displaystyle\\\|\(A\+B\)x\\\|\\leq\\\|Ax\\\|\+\\\|Bx\\\|\.\(10\)
2. 2\.ForA∈ℝm×nA\\in\\mathbb\{R\}^\{m\\times n\}andx∈ℝnx\\in\\mathbb\{R\}^\{n\}, we have: ‖A​x‖≤‖A‖2​‖x‖\.\\displaystyle\\\|Ax\\\|\\leq\\\|A\\\|\_\{2\}\\\|x\\\|\.\(11\)
3. 3\.By Assumption[3](https://arxiv.org/html/2606.13767#Thmassumption3), we have: ‖σ​\(A​x\)‖≤‖σ​\(A​x\)−σ​\(0\)‖\+‖σ​\(0\)‖≤Lσ​‖A​x‖\+‖σ​\(0\)‖\.\\displaystyle\\\|\\sigma\(Ax\)\\\|\\leq\\\|\\sigma\(Ax\)\-\\sigma\(\{0\}\)\\\|\+\\\|\\sigma\(\{0\}\)\\\|\\leq L\_\{\\sigma\}\\\|Ax\\\|\+\\\|\\sigma\(\{0\}\)\\\|\.\(12\)
4. 4\.For a finite collection of matrices,\{A1,⋯,Ak\};Ai∈ℝm×n\\\{A\_\{1\},\\cdots,A\_\{k\}\\\};A\_\{i\}\\in\\mathbb\{R\}^\{m\\times n\}, we have: rank​\(∑i=1kAi\)≤∑i=1krank​\(Ai\)\.\\displaystyle\\text\{rank\}\(\\sum\_\{i=1\}^\{k\}A\_\{i\}\)\\leq\\sum\_\{i=1\}^\{k\}\\text\{rank\}\(A\_\{i\}\)\.\(13\)
5. 5\.Let𝐈​\(X;Y\)\\mathbf\{I\}\(X;Y\)denote the mutual information between random variablesXXandYY\. It measures how much the knowledge of one random variable reveals about measuring the other, i\.e\., 𝐈​\(X;Y\)=D​\(P\(X,Y\)∥PX⊗PY\)=supF\{∫F​𝑑PX​Y−log​∫eF​d​\(PX⊗PY\)\},\\mathbf\{I\}\(X;Y\)=D\(P\_\{\(X,Y\)\}\\\|P\_\{X\}\\otimes P\_\{Y\}\)=\\sup\_\{F\}\\\{\\int FdP\_\{XY\}\-\\log\\int e^\{F\}d\(P\_\{X\}\\otimes P\_\{Y\}\)\\\},whereFFis a bounded, measurable function\[[66](https://arxiv.org/html/2606.13767#bib.bib72)\]\. LetTTbe a deterministic map forA∈ℝm×nA\\in\\mathbb\{R\}^\{m\\times n\}\. Then theData Processing Inequality \(DPI\)gives us𝐈​\(T​\(A\);N\)≤I​\(A;N\),\\mathbf\{I\}\(T\(A\);N\)\\leq\\textbf\{I\}\(A;N\),whereNNdenotes the training dataset\. IfTTis a bijective mapping thenDPIgives us\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\]: 𝐈​\(A;N\)=I​\(T​\(A\);N\)\.\\displaystyle\\mathbf\{I\}\(A;N\)=\\textbf\{I\}\(T\(A\);N\)\.\(14\)

Now, we are set to prove Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)\.

#### D\.1\.2Proof of Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)

###### Theorem 1\.

\(Generalization bounds\) LetfW0\+Δ​W\(x\)=σL\(\[W0L\+ΔWL\]\(⋯σ2\(\[\(W02\+ΔW2\]σ1\(\[W01\+ΔW1\]x\)\)⋯\)\)f\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(x\)=\\sigma\_\{L\}\(\[\{W\_\{0\}\}^\{L\}\+\\Delta W^\{L\}\]\(\\cdots\\sigma\_\{2\}\(\[\(W\_\{0\}^\{2\}\+\\Delta W^\{2\}\]\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\)\\cdots\)\)be aLL\-layers fine\-tuned DNN, whereW0\+Δ​W\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}is a fine\-tuned update\. Let the loss function,ℒ\\mathcal\{L\}for fine\-tuning, follow Assumptions[1](https://arxiv.org/html/2606.13767#Thmassumption1)–[3](https://arxiv.org/html/2606.13767#Thmassumption3)\. Then𝒢​\(W0\+Δ​W\)≤min⁡\(𝒢​\(W0\)\+ΦΔ​W,𝒢​\(Δ​W\)\+ΦW0\),\{\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\\leq\\min\\left\(\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\)\+\\Phi\_\{\\Delta\\textbf\{W\}\},\\mathcal\{G\}\(\\Delta\\textbf\{W\}\)\+\\Phi\_\{\\textbf\{W\}\_\{0\}\}\\right\)\},where

ΦΔ​W:=2​Lℒ​\[C​∏i=1LLσi​∑i=12L−1∏j=1LP​\(i,j\)\+∑i≠2a−1:a∈\[L\]2L−2F​\(i\)\]​and\{\\Phi\_\{\\Delta\\textbf\{W\}\}:=2L\_\{\{\\cal L\}\}\\left\[C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\(i,j\)\+\\sum\_\{i\\neq 2^\{a\}\-1:a\\in\[L\]\}^\{2^\{L\}\-2\}F\(i\)\\right\]\}\\text\{ and \}ΦW0:=2​Lℒ​\[C​∏i=1LLσi​∑i=22L∏j=1LP​\(i,j\)\+∑i≠2a:a∈\[L\]2L−1F​\(i\)\],\{\\Phi\_\{\\textbf\{W\}\_\{0\}\}:=2L\_\{\{\\cal L\}\}\\left\[C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\sum\_\{i=2\}^\{2^\{L\}\}\\prod\_\{j=1\}^\{L\}P\(i,j\)\+\\sum\_\{i\\neq 2^\{a\}:a\\in\[L\]\}^\{2^\{L\}\-1\}\{F\}\(i\)\\right\]\},are the correction terms,F​\(i\):=‖σL−ψ​\(i\)​\(0\)‖​∏j=1ψ​\(i\)\[LσL−j\+1​H​\(i,j\)\],\{F\(i\):=\\\|\\sigma\_\{L\-\\psi\(i\)\}\(0\)\\\|\\prod\_\{j=1\}^\{\\psi\(i\)\}\\\!\[L\_\{\\sigma\_\{L\-j\+1\}\}\\,H\(i,j\)\]\},ψ​\(i\):=⌊log2⁡\(i\)⌋\{\\psi\(i\):=\\lfloor\\log\_\{2\}\(i\)\\rfloor\}, and

P​\(i,j\):=\{‖W0L−j\+1‖2​if​⌊i−12L−1⌋​is odd,‖Δ​WL−j\+1‖2​if​⌊i−12L−1⌋​is even,H​\(i,j\):=\{‖Δ​WL−j\+1‖2​if​⌊i2ψ​\(i\)−j⌋​is odd,‖W0L−j\+1‖2​if​⌊i2ψ​\(i\)−j⌋​is even\.\\displaystyle P\(i,j\)=,\\hskip\-8\.53581ptH\(i,j\)=

###### Proof\.

Let

f𝐖0\+Δ​𝐖:=σL​\(\[W0L\+Δ​WL\]​σL−1​\(…​σ1​\(\[W01\+Δ​W1\]​x\)​…\)\)f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}:=\\sigma\_\{L\}\(\[W\_\{0\}^\{L\}\+\\Delta W^\{L\}\]\\sigma\_\{L\-1\}\(\.\.\.\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\.\.\.\)\)represent our fine\-tuned model and

f𝐖0:=σL​\(W0L​σL−1​\(…​σ1​\(W01​x\)​…\)\)f\_\{\\mathbf\{W\}\_\{0\}\}:=\\sigma\_\{L\}\(W\_\{0\}^\{L\}\\sigma\_\{L\-1\}\(\.\.\.\\sigma\_\{1\}\(W\_\{0\}^\{1\}x\)\.\.\.\)\)represent our pretrained model\. First, we upper bound the quantity‖f𝐖0\+Δ​𝐖−f𝐖0‖\\\|f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\-f\_\{\\mathbf\{W\}\_\{0\}\}\\\|\. We have

‖f𝐖0\+Δ​𝐖−f𝐖0‖\\displaystyle\\\|f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\-f\_\{\\mathbf\{W\}\_\{0\}\}\\\|=∥σL\(\[W0L\+ΔWL\]σL−1\(⋯σ1\(\[W01\+ΔW1\]x\)⋯\)\)\\displaystyle=\\big\\\|\\sigma\_\{L\}\(\[W\_\{0\}^\{L\}\+\\Delta W^\{L\}\]\\,\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\\cdots\)\)−σL\(W0LσL−1\(⋯σ\(1\)\(W01x\)⋯\)\)∥\\displaystyle\\quad\-\\sigma\_\{L\}\(W\_\{0\}^\{L\}\\,\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{\(1\)\}\(W\_\{0\}^\{1\}x\)\\cdots\)\)\\big\\\|≤Assumption​[3](https://arxiv.org/html/2606.13767#Thmassumption3)LσL∥\[W0L\+ΔWL\]σL−1\(⋯σ1\(\[W01\+ΔW1\]x\)⋯\)\)\\displaystyle\\overset\{\{\\rm Assumption~\\ref\{ass:activationlipschitz\}\}\}\{\\leq\}\\;\\;L\_\{\\sigma\_\{L\}\}\\big\\\|\[W\_\{0\}^\{L\}\+\\Delta W^\{L\}\]\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\\cdots\)\)−\[W0L\]σL−1\(⋯σ1\(W01x\)⋯\)\)∥\\displaystyle\\quad\-\[W\_\{0\}^\{L\}\]\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(W\_\{0\}^\{1\}x\)\\cdots\)\)\\big\\\|=LσL∥ΔWLσL−1\(⋯σ1\(\[W01\+ΔW1\]x\)⋯\)\)\\displaystyle=L\_\{\\sigma\_\{L\}\}\\big\\\|\\Delta W^\{L\}\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\\cdots\)\)\+W0L\[\(σL−1\(⋯σ1\(\[W01\+ΔW1\]x\)⋯\)\)−σL−1\(⋯σ1\(W01x\)⋯\)\)\)\]∥\\displaystyle\\quad\+W\_\{0\}^\{L\}\[\(\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\\cdots\)\)\-\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(W\_\{0\}^\{1\}x\)\\cdots\)\)\)\]\\big\\\|≤Triangle​Inequality​and​Inequality​\([11](https://arxiv.org/html/2606.13767#A4.E11)\)LσL\[∥ΔWL∥2∥σL−1\(⋯σ1\(\[W01\+ΔW1\]x\)⋯\)\)∥\\displaystyle\\overset\{\{\\rm Triangle~Inequality~and~Inequality~\\eqref\{eq:norm\_bound\_inequality\}\}\}\{\\leq\}L\_\{\\sigma\_\{L\}\}\[\\\|\\Delta W^\{L\}\\\|\_\{2\}\\\|\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\\cdots\)\)\\\|\+∥W0L∥2∥\(σL−1\(⋯σ1\(\[W01\+ΔW1\]x\)⋯\)−σL−1\(⋯σ1\(W01x\)⋯\)\)∥\]\.\\displaystyle\\quad\+\\\|W\_\{0\}^\{L\}\\\|\_\{2\}\\\|\(\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(\[W\_\{0\}^\{1\}\+\\Delta W^\{1\}\]x\)\\cdots\)\-\\sigma\_\{L\-1\}\(\\cdots\\sigma\_\{1\}\(W\_\{0\}^\{1\}x\)\\cdots\)\)\\\|\]\.
Iff𝐖0f\_\{\\mathbf\{W\}\_\{0\}\}andf𝐖0\+Δ​𝐖f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}are both 1\-layer, we can expand out their difference by:

‖f𝐖0\+Δ​𝐖−f𝐖0‖≤C​Lσ1​‖Δ​W1‖2\.\\\|f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\-f\_\{\\mathbf\{W\}\_\{0\}\}\\\|\\leq CL\_\{\\sigma\_\{1\}\}\\\|\\Delta W^\{1\}\\\|\_\{2\}\.Iff𝐖0f\_\{\\mathbf\{W\}\_\{0\}\}andf𝐖0\+Δ​𝐖f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}are both 2\-layer, we can expand out their difference by:

‖f𝐖0\+Δ​𝐖−f𝐖0‖\\displaystyle\\\|f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\-f\_\{\\mathbf\{W\}\_\{0\}\}\\\|≤C​Lσ2​Lσ1​‖W02‖2​‖Δ​W1‖2\+C​Lσ2​Lσ1​‖Δ​W2‖2​‖W01‖2\\displaystyle\\leq C\\,L\_\{\\sigma\_\{2\}\}L\_\{\\sigma\_\{1\}\}\\,\\\|W\_\{0\}^\{2\}\\\|\_\{2\}\\,\\\|\\Delta W^\{1\}\\\|\_\{2\}\+\\;CL\_\{\\sigma\_\{2\}\}L\_\{\\sigma\_\{1\}\}\\\|\\Delta W^\{2\}\\\|\_\{2\}\\\|W\_\{0\}^\{1\}\\\|\_\{2\}\+C​Lσ2​Lσ2​‖Δ​W2‖2​‖Δ​W1‖2\+Lσ2​‖Δ​W2‖2​‖σ1​\(0\)‖\.\\displaystyle\\quad\+CL\_\{\\sigma\_\{2\}\}L\_\{\\sigma\_\{2\}\}\\\|\\Delta W^\{2\}\\\|\_\{2\}\\\|\\Delta W^\{1\}\\\|\_\{2\}\+L\_\{\\sigma\_\{2\}\}\\\|\\Delta W^\{2\}\\\|\_\{2\}\\\|\\sigma\_\{1\}\(\{0\}\)\\\|\.Iff𝐖0f\_\{\\mathbf\{W\}\_\{0\}\}andf𝐖0\+Δ​𝐖f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}are both 3\-layer, we can expand out their difference by:

‖f𝐖0\+Δ​𝐖−f𝐖0‖\\displaystyle\\\|f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\-f\_\{\\mathbf\{W\}\_\{0\}\}\\\|≤CLσ1Lσ2Lσ3\(∥W03∥2∥W02∥2∥ΔW1∥2\\displaystyle\\leq C\\,L\_\{\\sigma\_\{1\}\}L\_\{\\sigma\_\{2\}\}L\_\{\\sigma\_\{3\}\}\\,\\Big\(\\\|W\_\{0\}^\{3\}\\\|\_\{2\}\\,\\\|W\_\{0\}^\{2\}\\\|\_\{2\}\\,\\\|\\Delta W^\{1\}\\\|\_\{2\}\+‖W03‖2​‖Δ​W2‖2​‖W01‖2\+‖W03‖2​‖Δ​W2‖2​‖Δ​W1‖2\\displaystyle\+\\\|W\_\{0\}^\{3\}\\\|\_\{2\}\\,\\\|\\Delta W^\{2\}\\\|\_\{2\}\\,\\\|W\_\{0\}^\{1\}\\\|\_\{2\}\+\\\|W\_\{0\}^\{3\}\\\|\_\{2\}\\,\\\|\\Delta W^\{2\}\\\|\_\{2\}\\,\\\|\\Delta W^\{1\}\\\|\_\{2\}\+‖Δ​W3‖2​‖W02‖2​‖W01‖2\+‖Δ​W3‖2​‖W02‖2​‖Δ​W1‖2\\displaystyle\+\\\|\\Delta W^\{3\}\\\|\_\{2\}\\,\\\|W\_\{0\}^\{2\}\\\|\_\{2\}\\,\\\|W\_\{0\}^\{1\}\\\|\_\{2\}\+\\\|\\Delta W^\{3\}\\\|\_\{2\}\\,\\\|W\_\{0\}^\{2\}\\\|\_\{2\}\\,\\\|\\Delta W^\{1\}\\\|\_\{2\}\+∥ΔW3∥2∥ΔW2∥2∥W01∥2\+∥ΔW3∥2∥ΔW2∥2∥ΔW1∥2\)\\displaystyle\+\\\|\\Delta W^\{3\}\\\|\_\{2\}\\,\\\|\\Delta W^\{2\}\\\|\_\{2\}\\,\\\|W\_\{0\}^\{1\}\\\|\_\{2\}\+\\\|\\Delta W^\{3\}\\\|\_\{2\}\\,\\\|\\Delta W^\{2\}\\\|\_\{2\}\\,\\\|\\Delta W^\{1\}\\\|\_\{2\}\\Big\)\+Lσ3Lσ2∥σ1\(0\)∥\(∥W03∥2∥ΔW2∥2\+∥ΔW3∥2∥W02∥2\\displaystyle\+\\;L\_\{\\sigma\_\{3\}\}L\_\{\\sigma\_\{2\}\}\\\|\\sigma\_\{1\}\(\{0\}\)\\\|\\,\\Big\(\\\|W\_\{0\}^\{3\}\\\|\_\{2\}\\,\\\|\\Delta W^\{2\}\\\|\_\{2\}\\,\+\\;\\,\\\|\\Delta W^\{3\}\\\|\_\{2\}\\,\\\|W\_\{0\}^\{2\}\\\|\_\{2\}\\,\+∥ΔW3∥2∥ΔW2∥2\)\+Lσ3∥ΔW3∥2∥σ2\(0\)∥,\\displaystyle\+\\;\\,\\\|\\Delta W^\{3\}\\\|\_\{2\}\\,\\\|\\Delta W^\{2\}\\\|\_\{2\}\\Big\)\+\\;L\_\{\\sigma\_\{3\}\}\\,\\\|\\Delta W^\{3\}\\\|\_\{2\}\\,\\\|\\sigma\_\{2\}\(\{0\}\)\\\|,and so on\. Thus a proof by induction indicates the difference betweenf𝐖0f\_\{\\mathbf\{W\}\_\{0\}\}andf𝐖0\+Δ​𝐖f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}for aLL\-layered model can be upper bounded by:

‖f𝐖0\+Δ​𝐖−f𝐖‖≤C​∏i=1LLσi​\[∑i=12L−1∏j=1LPL​\(i,j\)\]\+∑i=2;i≠2a−1,a∈\[L\]2L−2F​\(i\)\.\\\|f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\-f\_\{\\mathbf\{W\}\}\\\|\\leq C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\[\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\]\+\\sum\_\{i=2;i\\neq 2^\{a\}\-1,a\\in\[L\]\}^\{2^\{L\}\-2\}F\(i\)\.If we treatΔ​Wi\\Delta W^\{i\}andW0iW\_\{0\}^\{i\}as binary classes, we can give each identity 0 and 1 respectively; thusW03​W02​W01W\_\{0\}^\{3\}W\_\{0\}^\{2\}W\_\{0\}^\{1\}corresponds to1112111\_\{2\}or 7 andΔ​W3​W02​Δ​W1\\Delta W^\{3\}W\_\{0\}^\{2\}\\Delta W^\{1\}corresponds to0102010\_\{2\}or 2\. Thus, using this pattern, we can expand our summation using the following expression:

PL​\(i,j\)=\{‖W0L−j\+1‖2,if​⌊i−12L−j⌋mod2=1,‖Δ​WL−j\+1‖2,if​⌊i−12L−j⌋mod2=0\.,P\_\{L\}\(i,j\)=\\begin\{cases\}\\mathrm\{\\\|\}\{W\_\{0\}\}^\{L\-j\+1\}\\\|\_\{2\},&\\text\{if \}\\left\\lfloor\\dfrac\{i\-1\}\{2^\{\\,L\-j\}\}\\right\\rfloor\\bmod 2=1,\\\\\[6\.0pt\] \\mathrm\{\\\|\}\\Delta W^\{L\-j\+1\}\\\|\_\{2\},&\\text\{if \}\\left\\lfloor\\dfrac\{i\-1\}\{2^\{\\,L\-j\}\}\\right\\rfloor\\bmod 2=0\.\\end\{cases\},F​\(i\)=‖σ\(L−⌊log2⁡\(i\)⌋\)​\(0\)\|\|∏j=1⌊log2⁡\(i\)⌋\[Lσ\(L−j\+1\)​H​\(i,j\)\]F\(i\)=\\\|\\sigma\_\{\(L\-\\lfloor\\log\_\{2\}\(i\)\\rfloor\)\}\(\{0\}\)\|\|\\prod\_\{j=1\}^\{\\lfloor\\log\_\{2\}\(i\)\\rfloor\}\[L\_\{\\sigma\_\{\(L\-j\+1\)\}\}H\(i,j\)\]and

H​\(i,j\)=\{‖Δ​WL−j\+1‖2if​⌊i2⌊log2⁡\(i\)⌋−j⌋mod2=1,‖W0L−j\+1‖2if​⌊i2⌊log2⁡\(i\)⌋−j⌋mod2=0\.,H\(i,j\)=\\begin\{cases\}\\mathrm\{\\\|\}\\Delta W^\{L\-j\+1\}\\\|\_\{2\}&\\text\{if \}\\lfloor\\frac\{i\}\{2^\{\\lfloor\\log\_\{2\}\(i\)\\rfloor\-j\}\}\\rfloor\\mod 2=1,\\\\\[6\.0pt\] \\mathrm\{\\\|\}W\_\{0\}^\{L\-j\+1\}\\\|\_\{2\}&\\text\{if \}\\lfloor\\frac\{i\}\{2^\{\\lfloor\\log\_\{2\}\(i\)\\rfloor\-j\}\}\\rfloor\\mod 2=0\.\\end\{cases\},whereF​\(i\)F\(i\)andH​\(i,j\)H\(i,j\)are index functions that can be visualized in Figure[5](https://arxiv.org/html/2606.13767#A4.F5)\. For representational purposes, every vertex that has three red edges adds theℓ2\\ell\_\{2\}norm of the layer below its activation function on the zero vector\. When a vertex has two different colored edges strictly below it, it collapses into anAAandBBsub\-component\. When this occurs, no additional offset term is added to our summation\. A total of2L−\(L\+1\)2^\{L\}\-\(L\+1\)of these offset terms will be added\. BothP​\(i,j\)P\(i,j\)andH​\(i,j\)H\(i,j\)can also take cases by even and odd inputs as their indexing requires modulus arithmetic over binary classifications\(‖W0i‖​and​‖Δ​Wi‖\)\.\(\\\|W\_\{0\}^\{i\}\\\|\\text\{ and \}\\\|\\Delta W^\{i\}\\\|\)\.

Now that we have an upper bound for the difference of our hypotheses, we estimate the difference in terms of true loss and empirical loss:

ℒglobal​\(𝐖0\+Δ​𝐖\)−ℒglobal​\(𝐖0\)\\displaystyle\{\\cal L\}\_\{\\rm global\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\-\{\\cal L\}\_\{\\rm global\}\(\\mathbf\{W\}\_\{0\}\)=𝔼𝒳,𝒴∼ν​\[ℓ​\(f𝐖0\+Δ​𝐖​\(X\),Y\)\]−𝔼𝒳,𝒴∼ν​\[ℓ​\(f𝐖0​\(X\),Y\)\]\\displaystyle=\\mathbb\{E\}\_\{\\mathcal\{X\},\\mathcal\{Y\}\\sim\\nu\}\[\\ell\(f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\),Y\)\]\-\\mathbb\{E\}\_\{\\mathcal\{X\},\\mathcal\{Y\}\\sim\\nu\}\[\\ell\(f\_\{\\mathbf\{W\}\_\{0\}\}\(X\),Y\)\]=𝔼𝒳,𝒴∼ν​\[ℓ​\(f𝐖0\+Δ​𝐖​\(X\),Y\)−ℓ​\(f𝐖0​\(X\),Y\)\]\.\\displaystyle=\\mathbb\{E\}\_\{\\mathcal\{X\},\\mathcal\{Y\}\\sim\\nu\}\[\\ell\(f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\),Y\)\-\\ell\(f\_\{\\mathbf\{W\}\_\{0\}\}\(X\),Y\)\]\.≤𝔼𝒳,𝒴∼ν​\[Lℒ​‖f𝐖0\+Δ​𝐖​\(X\)−f𝐖0​\(X\)‖\]\\displaystyle\\leq\\mathbb\{E\}\_\{\\mathcal\{X\},\\mathcal\{Y\}\\sim\\nu\}\[L\_\{\{\\cal L\}\}\\\|f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\-f\_\{\\mathbf\{W\}\_\{0\}\}\(X\)\\\|\]≤𝔼𝒳,𝒴∼ν​\[Lℒ​\(C​∏k=1LLσi​\[∑i=12L−1∏j=1LPL​\(i,j\)\]\+∑i≠2a−12L−2F​\(i\)\)\]\\displaystyle\\leq\\mathbb\{E\}\_\{\\mathcal\{X\},\\mathcal\{Y\}\\sim\\nu\}\\left\[L\_\{\{\\cal L\}\}\\left\(C\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\left\[\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\right\]\+\\sum\_\{i\\neq 2^\{a\}\-1\}^\{2^\{L\}\-2\}F\(i\)\\right\)\\right\]=Lℒ​\[C​∏k=1LLσi​\[∑i=12L−1∏j=1LPL​\(i,j\)\]\+∑i≠2a−12L−2F​\(i\)\]\.\\displaystyle=L\_\{\{\\cal L\}\}\\left\[C\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\left\[\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\right\]\+\\sum\_\{i\\neq 2^\{a\}\-1\}^\{2^\{L\}\-2\}F\(i\)\\right\]\.
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/AB_Component_Collapse.png)Figure 5:‖fW0\+Δ​W−fW0‖\\\|f\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\-f\_\{\\textbf\{W\}\_\{0\}\}\\\|Visual representation of the recursive collapse of differences\.Similarly,

ℒ​\(𝐖0\+Δ​𝐖\)−ℒ​\(𝐖0\)\\displaystyle\{\\cal L\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\-\{\\cal L\}\(\\mathbf\{W\}\_\{0\}\)=1L​∑i′=1Lℓ​\(f𝐖0\+Δ​𝐖​\(xi′\),yi′\)−1L​∑i′=1Lℓ​\(f𝐖0​\(xi′\),yi′\)\\displaystyle=\\frac\{1\}\{L\}\\sum\_\{i^\{\\prime\}=1\}^\{L\}\\ell\\\!\\big\(f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(x\_\{i\}^\{\\prime\}\),\\,y\_\{i\}^\{\\prime\}\\big\)\\;\-\\;\\frac\{1\}\{L\}\\sum\_\{i^\{\\prime\}=1\}^\{L\}\\ell\\\!\\big\(f\_\{\\mathbf\{W\}\_\{0\}\}\(x\_\{i\}^\{\\prime\}\),\\,y\_\{i\}^\{\\prime\}\\big\)=1L​∑i′=1L\[ℓ​\(f𝐖0\+Δ​𝐖​\(xi′\),yi′\)−ℓ​\(f𝐖0​\(xi′\),yi′\)\]\\displaystyle=\\frac\{1\}\{L\}\\sum\_\{i^\{\\prime\}=1\}^\{L\}\\Big\[\\ell\\\!\\big\(f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(x\_\{i\}^\{\\prime\}\),\\,y\_\{i\}^\{\\prime\}\\big\)\-\\ell\\\!\\big\(f\_\{\\mathbf\{W\}\_\{0\}\}\(x\_\{i\}^\{\\prime\}\),\\,y\_\{i\}^\{\\prime\}\\big\)\\Big\]≤1L​∑i′=1LLℒ​‖f𝐖0\+Δ​𝐖​\(xi′\)−f𝐖0​\(xi′\)‖\\displaystyle\\leq\\frac\{1\}\{L\}\\sum\_\{i^\{\\prime\}=1\}^\{L\}L\_\{\{\\cal L\}\}\\,\\big\\\|f\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(x\_\{i\}^\{\\prime\}\)\-f\_\{\\mathbf\{W\}\_\{0\}\}\(x\_\{i\}^\{\\prime\}\)\\big\\\|≤1L​∑i′=1LLℒ​\(C​∏k=1LLσk​\[∑i=12L−1∏j=1LPL​\(i,j\)\]\+∑i≠2a−12L−2F​\(i\)\)\\displaystyle\\leq\\frac\{1\}\{L\}\\sum\_\{i^\{\\prime\}=1\}^\{L\}L\_\{\{\\cal L\}\}\\,\\Big\(C\\,\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{k\}\}\\,\\Big\[\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\,\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\Big\]\\;\+\\;\\sum\_\{i\\neq 2^\{a\}\-1\}^\{2^\{L\}\-2\}F\(i\)\\Big\)=Lℒ​\(C​∏k=1LLσk​\[∑i=12L−1∏j=1LPL​\(i,j\)\]\+∑i≠2a−12L−2F​\(i\)\)\.\\displaystyle=L\_\{\{\\cal L\}\}\\,\\Big\(C\\,\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{k\}\}\\,\\Big\[\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\,\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\Big\]\\;\+\\;\\sum\_\{i\\neq 2^\{a\}\-1\}^\{2^\{L\}\-2\}F\(i\)\\Big\)\.Using the triangle inequality, we reach:

\|𝒢​\(𝐖0\+Δ​𝐖\)−𝒢​\(𝐖0\)\|\\displaystyle\\big\|\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\-\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\)\\big\|=\|ℒglobal​\(𝐖0\+Δ​𝐖\)−ℒ​\(𝐖0\+Δ​𝐖\)−ℒglobal​\(𝐖0\)\+ℒ​\(𝐖0\)\|\\displaystyle=\\big\|\{\\cal L\}\_\{\\rm global\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\-\{\\cal L\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\-\{\\cal L\}\_\{\\rm global\}\(\\mathbf\{W\}\_\{0\}\)\+\{\\cal L\}\(\\mathbf\{W\}\_\{0\}\)\\big\|≤\|ℒglobal​\(𝐖0\+Δ​𝐖\)−ℒglobal​\(𝐖0\)\|\+\|ℒ​\(𝐖0\+Δ​𝐖\)−ℒ​\(𝐖0\)\|\\displaystyle\\leq\\big\|\{\\cal L\}\_\{\\rm global\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\-\{\\cal L\}\_\{\\rm global\}\(\\mathbf\{W\}\_\{0\}\)\\big\|\+\\big\|\{\\cal L\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\-\{\\cal L\}\(\\mathbf\{W\}\_\{0\}\)\\big\|≤2​Lℒ​\(C​∏k=1LLσk​\[∑i=12L−1∏j=1LPL​\(i,j\)\]\+∑i≠2a−12L−2F​\(i\)\)\.\\displaystyle\\leq 2\\,L\_\{\{\\cal L\}\}\\,\\Big\(C\\,\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{k\}\}\\,\\Big\[\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\,\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\Big\]\\;\+\\;\\sum\_\{i\\neq 2^\{a\}\-1\}^\{2^\{L\}\-2\}F\(i\)\\Big\)\.Finally, we obtain the inequality:

𝒢​\(𝐖0\+Δ​𝐖\)≤𝒢​\(𝐖0\)\+2​Lℒ​\(C​∏k=1LLσk​\[∑i=12L−1∏j=1LPL​\(i,j\)\]\+∑i=1;i≠2a−1;a∈\[L\]2L−2F​\(i\)\)\.\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\\leq\\mathcal\{G\}\\left\(\\mathbf\{W\}\_\{0\}\\right\)\+2\\,L\_\{\{\\cal L\}\}\\,\\left\(C\\,\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{k\}\}\\,\\Big\[\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\,\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\Big\]\\;\+\\;\\sum\_\{i=1;i\\neq 2^\{a\}\-1;a\\in\[L\]\}^\{2^\{L\}\-2\}F\(i\)\\right\)\.
Bound aroundfΔ​𝐖f\_\{\\Delta\\mathbf\{W\}\}\.We can also perturb around𝒢​\(Δ​W\)\\mathcal\{G\}\(\\Delta\\textbf\{W\}\)by swapping the roles or conditions ofW0\(i\)W\_\{0\}^\{\(i\)\}andΔ​W\(i\)\\Delta W^\{\(i\)\}in the zero–activation bookkeeping functionH​\(i,j\)H\(i,j\)\. This requires us to ignore the indices2a,a∈\[L\]2^\{a\},a\\in\[L\]as opposed to2a−1,a∈\[L\]2^\{a\}\-1,a\\in\[L\]as viewable in Figure[5](https://arxiv.org/html/2606.13767#A4.F5)\. Similarly, the functionPL​\(⋅,⋅\)P\_\{L\}\(\\cdot,\\cdot\)can be kept unchanged by shifting the summation index range from1:2L−11\\\!:\\\!2^\{L\}\\\!\-\\\!1to2:2L2\\\!:\\\!2^\{L\}\. Thus

𝒢​\(𝐖0\+Δ​𝐖\)≤𝒢​\(Δ​𝐖\)\+2​Lℒ​\(C​∏k=1LLσk​\[∑i=22L∏j=1LPL​\(i,j\)\]\+∑i=3;i≠2a;a∈\[L\]2L−1F​\(i\)\)\.\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\\;\\leq\\;\\mathcal\{G\}\(\\Delta\\mathbf\{W\}\)\\;\+\\;2\\,L\_\{\{\\cal L\}\}\\,\\left\(C\\,\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{k\}\}\\,\\Big\[\\sum\_\{i=2\}^\{2^\{L\}\}\\,\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\Big\]\\;\+\\;\\sum\_\{i=3;i\\neq 2^\{a\};a\\in\[L\]\}^\{2^\{L\}\-1\}\{F\}\(i\)\\right\)\.
Consequently, we can conclude with:

𝒢​\(𝐖0\+Δ​𝐖\)≤min⁡\(𝒢​\(𝐖0\)\+ΦΔ​𝐖,𝒢​\(Δ​𝐖\)\+Φ𝐖0\)\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\\leq\\min\\left\(\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\)\+\\Phi\_\{\\Delta\\mathbf\{W\}\},\\mathcal\{G\}\(\\Delta\\mathbf\{W\}\)\+\\Phi\_\{\\mathbf\{W\}\_\{0\}\}\\right\)
Φ𝐖=\{2​Lℒ​\(C​∏k=1LLσk​\[∑i=12L−1∏j=1LPL​\(i,j\)\]\+∑i≠2a−12L−2F​\(i\)\),for​𝐖=Δ​𝐖,2​Lℒ​\(C​∏k=1LLσk​\[∑i=22L∏j=1LPL​\(i,j\)\]\+∑i≠2a2L−1F​\(i\)\),for​𝐖=𝐖0\.\\Phi\_\{\\mathbf\{W\}\}=\\begin\{cases\}2\\,L\_\{\{\\cal L\}\}\\,\\left\(C\\,\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{k\}\}\\,\\Big\[\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\,\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\Big\]\\;\+\\;\\sum\_\{i\\neq 2^\{a\}\-1\}^\{2^\{L\}\-2\}F\(i\)\\right\),&\\text\{for \}\\mathbf\{W\}=\\Delta\\mathbf\{W\},\\\\\[6\.00006pt\] 2\\,L\_\{\{\\cal L\}\}\\,\\left\(C\\,\\prod\_\{k=1\}^\{L\}L\_\{\\sigma\_\{k\}\}\\,\\Big\[\\sum\_\{i=2\}^\{2^\{L\}\}\\,\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\\Big\]\\;\+\\;\\sum\_\{i\\neq 2^\{a\}\}^\{2^\{L\}\-1\}\{F\}\(i\)\\right\),&\\text\{for \}\\mathbf\{W\}=\\mathbf\{W\}\_\{0\}\.\\end\{cases\}Hence, the result\. ∎

#### D\.1\.3Neural Network with No activation Function—Special case of Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)

We can upper bound the generalization error of a neural network with no nonlinear activation functions, i\.e\.,σi=Ini\\sigma\_\{i\}=I\_\{n\_\{i\}\}for alli∈\[L\]i\\in\[L\]\. We additionally include the simplest case of a one\-layer linear network\.

###### Corollary 1\.

Let Assumption[1](https://arxiv.org/html/2606.13767#Thmassumption1)hold, andℒ\{\\cal L\}follow Assumption[2](https://arxiv.org/html/2606.13767#Thmassumption2)\. Letσi=Ini,\\sigma\_\{i\}=I\_\{n\_\{i\}\},for alli∈\[L\]i\\in\[L\], andf𝐖0\+B​A​\(x\)=\(W0L\+BL​AL​\(⋯​\(W02\+B2​A2​\(W01\+B1​A1\)​x\)​⋯\)\)f\_\{\\mathbf\{W\}\_\{0\}\+BA\}\(x\)=\(W\_\{0\}^\{L\}\+B^\{L\}A^\{L\}\(\\cdots\(W\_\{0\}^\{2\}\+B^\{2\}A^\{2\}\(W\_\{0\}^\{1\}\+B^\{1\}A^\{1\}\)x\)\\cdots\)\)\. Then we have:

𝒢​\(𝐖0\+Δ​W\)≤min⁡\(𝒢​\(𝐖0\)\+2​C​Lℒ​∑i=12L−1∏j=1LPL​\(i,j\),𝒢​\(Δ​W\)\+2​C​Lℒ​∑i=22L∏j=1LPL​\(i,j\)\)\.\{\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\\leq\\min\(\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\)\+2CL\_\{\{\\cal L\}\}\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\),\\mathcal\{G\}\(\\Delta\\textbf\{W\}\)\+2CL\_\{\{\\cal L\}\}\\sum\_\{i=2\}^\{2^\{L\}\}\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\}\)\.

Table 6:Summary of the benchmarks, quality metrics, and trainable parameters\.For LoRA and Asymmetric LoRA methods, we report their percentage of trainable parameters relative to FFT\.TaskModelPretrainedOnFine\-TunedOnTrainableParameters \(FFT\)LoRAAsymmetricLoRAQualityMetricNatural Language ProcessingRoBERTa\-BaseEnglish language corporaMRPC124\.6M0\.9440\.708AccuracyCoLA124\.6M0\.9440\.708MCCRoBERTa\-LargeEnglish language corporaMRPC355\.4M0\.7350\.515AccuracyCoLA355\.4M0\.7350\.515MCCDeBERTa v2 XXLEnglish language corporaMRPC1\.56B0\.3010\.151AccuracyTREC\-501\.56B0\.3010\.151AccuracyPAWS1\.56B0\.3010\.151AccuracyDeBERTa v3 BaseEnglish language corporaMRPC184\.4M0\.6410\.481AccuracyRTE184\.4M0\.6410\.481AccuracySTS\-B184\.4M0\.6400\.481AccuracyTREC\-50184\.4M0\.6610\.501AccuracyPAWS184\.4M0\.6410\.481AccuracyGPT2\-SmallWebTextE2E124\.4M6\.1405\.904AccuracyImage ClassificationViT\-TinyImageNet\-1KOfficeHome5\.54M2\.8151\.518AccuracyCifar105\.53M2\.6331\.333AccuracyViT\-BaseImageNet\-21K then ImageNet\-1KOfficeHome85\.8M0\.7400\.399AccuracyCifar1085\.8M0\.6920\.350AccuracyCoding GenerationDeepSeek\-Coder\-BaseRepo\-Level Code CorpusDJANGO1\.35B0\.2330\.117Exact MatchLogical ReasoningTinyLlamaSlimPajamaOpenBookQA1\.03B0\.2180\.079AccuracyFOLIO1\.03B0\.2180\.079AccuracyLogiQA1\.03B0\.2180\.079AccuracyCLUTRR1\.03B0\.2210\.082AccuracyLlaMA3\-8BLarge\-scale multilingual corporaOpenBookQA8\.03B0\.0910\.035AccuracyCLUTRR8\.03B0\.0920\.036Accuracy

#### D\.1\.4Tightness of the bounds in Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)

We demonstrate Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)as an appropriate upper bound on the generalization error\. We show the case wherefW0\+Δ​W=fW0f\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}=f\_\{\\textbf\{W\}\_\{0\}\}and guarantee that𝒢​\(W0\+Δ​W\)=𝒢​\(W0\)\.\{\\cal G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)=\{\\cal G\}\(\\textbf\{W\}\_\{0\}\)\.

AssumeΔ​W\\Delta\\textbf\{W\}was never trained, i\.e\.,‖Δ​Wi‖=0,\\\|\\Delta W^\{i\}\\\|=0,for alli∈\[L\]i\\in\[L\]\. DenoteF^:=∑i=1\|i≠2a−1;i∈\[L\]2L−2F​\(i\)\\hat\{F\}:=\\sum\_\{i=1\|i\\neq 2^\{a\}\-1;i\\in\[L\]\}^\{2^\{L\}\-2\}F\(i\)Then we have:

\|𝒢​\(W0\+Δ​W\)−𝒢​\(W0\)\|\\displaystyle\|\{\\cal G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\-\{\\cal G\}\(\\textbf\{W\}\_\{0\}\)\|≤Theorem​[1](https://arxiv.org/html/2606.13767#Thmtheorem1)​2​Lℒ​\(C​∏i=1LLσi​∑i=12L−1∏j=1LPL​\(i,j\)\+∑i≠2a−1:a∈\[L\]2L−2F​\(i\)\)\\displaystyle\\overset\{\\rm Theorem~\\ref\{theorem:nonlinearGenBound\}\}\{\\leq\}2L\_\{\{\\cal L\}\}\(C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\_\{L\}\(i,j\)\+\\sum\_\{i\\neq 2^\{a\}\-1:a\\in\[L\]\}^\{2^\{L\}\-2\}F\(i\)\)=​2​Lℒ​\(C​∏i=1LLσi​\(‖W0i‖2\+‖Δ​Wi‖2\)−C​∏i=1LLσi​‖W0i‖2\+F^\)\\displaystyle\\overset\{\}\{=\}2L\_\{\{\\cal L\}\}\\big\(C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\(\\\|W\_\{0\}^\{i\}\\\|\_\{2\}\+\\\|\\Delta W^\{i\}\\\|\_\{2\}\)\-C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\\|W\_\{0\}^\{i\}\\\|\_\{2\}\+\\hat\{F\}\\big\)=∥ΔWi∥2=0;​2​Lℒ​\(C​∏i=1LLσi​‖W0i‖2−C​∏i=1LLσi​‖W0i‖2\+F^\)\\displaystyle\\overset\{\\rm\\\|\\Delta W^\{i\}\\\|\_\{2\}=0;\}\{=\}2L\_\{\{\\cal L\}\}\\big\(C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\\|W\_\{0\}^\{i\}\\\|\_\{2\}\-C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\\|W\_\{0\}^\{i\}\\\|\_\{2\}\+\\hat\{F\}\\big\)=2​Lℒ​F^\.\\displaystyle=2L\_\{\{\\cal L\}\}\\hat\{F\}\.Since eachF​\(i\)F\(i\)does not take entries from2a−1,2^\{a\}\-1,wherea∈\[L\]a\\in\[L\], at least oneH​\(i,j\)H\(i,j\)returns the spectral norm of one of theΔ​W\\Delta\\textbf\{W\}layers, returning 0 by construction\. Hence, eachF​\(i\)F\(i\)returns 0 and we obtain the result:\|𝒢​\(W0\+Δ​W\)−𝒢​\(W0\)\|≤0\|\{\\cal G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\-\{\\cal G\}\(\\textbf\{W\}\_\{0\}\)\|\\leq 0confirming that𝒢​\(W0\+Δ​W\)=𝒢​\(W0\),\{\\cal G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)=\{\\cal G\}\(\\textbf\{W\}\_\{0\}\),ifΔ​W\\Delta\\textbf\{W\}was never trained\. This way, we make sure the generalization measure would be unchanged and does not risk including unnecessary terms\.

#### D\.1\.5Adapting Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)to Attention Mechanism

Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)applies to any architecture that can be written as a composition of linear maps and Lipschitz maps, under bounded input\. We therefore view transformer blocks as fitting the theorem\. Let embedded inputs be bounded, and letXXdenote the current input sequence\.

The MLP sub\-blocks follow the same structure as standard DNNs, so it suffices if their activation functions are Lipschitz\[[13](https://arxiv.org/html/2606.13767#bib.bib89)\]\. Transformers also include residual connections and normalization\. Residual connections take the formX\+F​\(X\)X\+F\(X\), which preserves the same decomposition up to a constant\[[21](https://arxiv.org/html/2606.13767#bib.bib86)\]\. LayerNorm is applied elementwise across features; under bounded activations, LayerNorm is Lipschitz on the bounded set, and therefore it can be treated as another Lipschitz map in the composition\[[68](https://arxiv.org/html/2606.13767#bib.bib87)\]\. For the multi\-head attention \(MHA\) sub\-blocks, the sequenceXXis projected into queries, keys, and values such thatQ=X​WQQ=XW\_\{Q\},K=X​WKK=XW\_\{K\},V=X​WVV=XW\_\{V\}, and the attention output is projected byWOW\_\{O\}\. Each head forms attention weights fromQ​K⊤QK^\{\\top\}using softmax and masks and then directly applies them toVV\[[60](https://arxiv.org/html/2606.13767#bib.bib88)\]\.

Define the linear map𝒯​\(X\):=\[X​WQ,\|X​WK\|​X​WV\]\{\\cal T\}\(X\):=\[XW\_\{Q\},\|XW\_\{K\}\|XW\_\{V\}\]and defineσ​\(𝒯​\(X\)\)=σ\(Q,K,V\)​\(⋅\)=Softmax​\(Q​K⊤dk\+M\)​V\\sigma\(\{\\cal T\}\(X\)\)=\\sigma\_\{\(Q,K,V\)\}\(\\cdot\)=\\mathrm\{Softmax\}\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\+M\)V, whereMMis an attention mask\. Then one attention head can be written as𝒱​\(X\)=\(σ\(Q,K,V\)∘𝒯​\(X\)\)​WO\{\\cal V\}\(X\)=\(\\sigma\_\{\(Q,K,V\)\}\\circ\{\\cal T\}\(X\)\)\\,W\_\{O\}\. We treatσ\(Q,K,V\)\\sigma\_\{\(Q,K,V\)\}as a Lipschitz operator and denote the Lipschitz constant byLσL\_\{\\sigma\}\. In particular, Softmax is12\\frac\{1\}\{2\}\-Lipschitz uniformly across allℓp\\ell\_\{p\}norms\[[46](https://arxiv.org/html/2606.13767#bib.bib85)\]\.

Consider‖σ​\(𝒯W0\+Δ​W​\(X\)\)​WOW0\+Δ​W−σ​\(𝒯W0​\(X\)\)​WOW0‖\\\|\\sigma\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(X\)\)W^\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\_\{O\}\-\\sigma\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\}\(X\)\)W^\{\\textbf\{W\}\_\{0\}\}\_\{O\}\\\|\. We expand it in the same way as in the proof of Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)\. DenoteZΔ=σ\(Q,K,V\)∘\(𝒯W0\+Δ​W​\(X\)\)Z\_\{\\Delta\}=\\sigma\_\{\(Q,K,V\)\}\\circ\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(X\)\)andZ0=σ\(Q​K​V\)∘\(𝒯W0​\(X\)\)Z\_\{0\}=\\sigma\_\{\(QKV\)\}\\circ\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\}\(X\)\)\. LetWOW0\+Δ​W=WO,0\+Δ​WOW\_\{O\}^\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}=W\_\{O,0\}\+\\Delta W\_\{O\}\. Then,

‖σ​\(𝒯W0\+Δ​W​\(X\)\)​WOW0\+Δ​W−σ​\(𝒯W0​\(X\)\)​WOW0‖\\displaystyle\\\|\\sigma\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(X\)\)W^\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\_\{O\}\-\\sigma\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\}\(X\)\)W^\{\\textbf\{W\}\_\{0\}\}\_\{O\}\\\|=By​construction​‖ZΔ​WO𝐖0\+Δ​𝐖−Z0​WO𝐖0‖\\displaystyle\\overset\{\\rm By\\;construction\}\{=\}\\\|Z\_\{\\Delta\}W\_\{O\}^\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\-Z\_\{0\}W\_\{O\}^\{\\mathbf\{W\}\_\{0\}\}\\\|=B​y​WO​Definition​‖ZΔ​\(WO,0\+Δ​WO\)−Z0​WO,0‖\\displaystyle\\overset\{By\\;W\_\{O\}~\\rm Definition\}\{=\}\\\|Z\_\{\\Delta\}\(W\_\{O,0\}\+\\Delta W\_\{O\}\)\-Z\_\{0\}W\_\{O,0\}\\\|≤By​Triangle​Inequality​‖\(ZΔ−Z0\)​WO,0‖\+‖ZΔ​Δ​WO‖\\displaystyle\\overset\{\\rm By\\;Triangle~Inequality\}\{\\leq\}\\\|\(Z\_\{\\Delta\}\-Z\_\{0\}\)W\_\{O,0\}\\\|\+\\\|Z\_\{\\Delta\}\\Delta W\_\{O\}\\\|≤Inequality​\([11](https://arxiv.org/html/2606.13767#A4.E11)\)​‖ZΔ−Z0‖​‖WO,0‖2\+‖ZΔ‖​‖Δ​WO‖2\.\\displaystyle\\overset\{\\rm Inequality~\\eqref\{eq:norm\_bound\_inequality\}\}\{\\leq\}\\\|Z\_\{\\Delta\}\-Z\_\{0\}\\\|\\,\\\|W\_\{O,0\}\\\|\_\{2\}\+\\\|Z\_\{\\Delta\}\\\|\\,\\\|\\Delta W\_\{O\}\\\|\_\{2\}\.\(⋆\)\\displaystyle\(\\star\)
Now, we consider bounding the term‖ZΔ−Z0‖\\\|Z\_\{\\Delta\}\-Z\_\{0\}\\\|as follows:

‖ZΔ−Z0‖\\displaystyle\\\|Z\_\{\\Delta\}\-Z\_\{0\}\\\|=By​construction​‖σ​\(𝒯𝐖0\+Δ​𝐖​\(X\)\)−σ​\(𝒯𝐖0​\(X\)\)‖\\displaystyle\\overset\{\\rm By\\;construction\}\{=\}\\\|\\sigma\(\{\\cal T\}\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\)\-\\sigma\(\{\\cal T\}\_\{\\mathbf\{W\}\_\{0\}\}\(X\)\)\\\|≤Assumption​[3](https://arxiv.org/html/2606.13767#Thmassumption3)​Lσ​‖𝒯𝐖0\+Δ​𝐖​\(X\)−𝒯𝐖0​\(X\)‖\.\\displaystyle\\overset\{\\rm Assumption~\\ref\{ass:activationlipschitz\}\}\{\\leq\}L\_\{\\sigma\}\\\|\{\\cal T\}\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\-\{\\cal T\}\_\{\\mathbf\{W\}\_\{0\}\}\(X\)\\\|\.
Expanding𝒯W0\+Δ​W​\(X\)−𝒯W0​\(X\)\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(X\)\-\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\}\(X\)we obtain:

𝒯𝐖0\+Δ​𝐖​\(X\)−𝒯𝐖0​\(X\)\\displaystyle\{\\cal T\}\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\-\{\\cal T\}\_\{\\mathbf\{W\}\_\{0\}\}\(X\)=By​construction​X​\(WQ​K​V,0\+Δ​WQ​K​V\)−X​WQ​K​V,0\\displaystyle\\overset\{\\rm By\\;construction\}\{=\}X\(W\_\{QKV,0\}\+\\Delta W\_\{QKV\}\)\-XW\_\{QKV,0\}=X​Δ​WQ​K​V,\\displaystyle=X\\Delta W\_\{QKV\},which implies

‖𝒯𝐖0\+Δ​𝐖​\(X\)−𝒯𝐖0​\(X\)‖\\displaystyle\\\|\{\\cal T\}\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\-\{\\cal T\}\_\{\\mathbf\{W\}\_\{0\}\}\(X\)\\\|≤Inequality​[11](https://arxiv.org/html/2606.13767#A4.E11)​‖X‖​‖Δ​WQ​K​V‖2\.\\displaystyle\\overset\{\\rm Inequality~\\ref\{eq:norm\_bound\_inequality\}\}\{\\leq\}\\\|X\\\|\\,\\\|\\Delta W\_\{QKV\}\\\|\_\{2\}\.\(△\)\\displaystyle\(\\triangle\)
Finally, we obtain the result:

‖σ​\(𝒯W0\+Δ​W​\(X\)\)​WOW0\+Δ​W−σ​\(𝒯W0​\(X\)\)​WOW0‖\\displaystyle\\\|\\sigma\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(X\)\)W^\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\_\{O\}\-\\sigma\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\}\(X\)\)W^\{\\textbf\{W\}\_\{0\}\}\_\{O\}\\\|≤\(⋆\)​and​\(△\)​Lσ​‖X‖​‖Δ​WQ​K​V‖2​‖WO,0‖2\+‖ZΔ‖​‖Δ​WO‖2\.\\displaystyle\\overset\{\\rm\(\\star\)~and~\(\\triangle\)\}\{\\leq\}L\_\{\\sigma\}\\\|X\\\|\\,\\\|\\Delta W\_\{QKV\}\\\|\_\{2\}\\,\\\|W\_\{O,0\}\\\|\_\{2\}\+\\\|Z\_\{\\Delta\}\\\|\\,\\\|\\Delta W\_\{O\}\\\|\_\{2\}\.
Hence, the main difference for Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)between transformers and deep neural networks appears in the MHA block, where subtracting the fine\-tuned and pretrained models produces an additional term‖ZΔ‖​‖Δ​WO‖2\\\|Z\_\{\\Delta\}\\\|\\\|\\Delta W\_\{O\}\\\|\_\{2\}from the output projection update\. This term is controlled under our setup sinceXXis bounded andV=X​\(WV\+Δ​WV\)V=X\(W\_\{V\}\+\\Delta W\_\{V\}\)is linear, we have‖V‖≤‖X‖​‖WV\+Δ​WV‖2\\\|V\\\|\\leq\\\|X\\\|\\\|W\_\{V\}\+\\Delta W\_\{V\}\\\|\_\{2\}\. Thus‖ZΔ‖=‖Softmax​\(⋅\)​V‖≤CZ\\\|Z\_\{\\Delta\}\\\|=\\\|\\rm Softmax\(\\cdot\)V\\\|\\leq C\_\{Z\}for some constantCZC\_\{Z\}depending on the input bound and‖WV\+Δ​WV‖2\\\|W\_\{V\}\+\\Delta W\_\{V\}\\\|\_\{2\}\.

IfWOW\_\{O\}is frozen \(Δ​WO=0\\Delta W\_\{O\}=0\), the additional term vanishes, and the bound simplifies toLσ​‖X‖​‖Δ​WQ​K​V‖2​‖WO,0‖2L\_\{\\sigma\}\\\|X\\\|\\,\\\|\\Delta W\_\{QKV\}\\\|\_\{2\}\\,\\\|W\_\{O,0\}\\\|\_\{2\}\. This has the same proof structure as in Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1), with‖WO,0‖2\\\|W\_\{O,0\}\\\|\_\{2\}contributing as an additional spectral factor\. The attention maskMMis fixed, so it does not contribute to the parameter difference\. Also, withWQ​K​V=\[WQ​WK​WV\]W\_\{QKV\}=\[W\_\{Q\}\\,W\_\{K\}\\,W\_\{V\}\]where we concatenate the projection matrices, we have‖Δ​WQ​K​V‖2≤‖Δ​WQ‖22\+‖Δ​WK‖22\+‖Δ​WV‖22\\\|\\Delta W\_\{QKV\}\\\|\_\{2\}\\leq\\sqrt\{\\\|\\Delta W\_\{Q\}\\\|^\{2\}\_\{2\}\+\\\|\\Delta W\_\{K\}\\\|^\{2\}\_\{2\}\+\\\|\\Delta W\_\{V\}\\\|^\{2\}\_\{2\}\}by properties of block matrices\. This leads to the development of Theorem[4](https://arxiv.org/html/2606.13767#Thmtheorem4)\.

###### Theorem 4\.

\(Upperbound on self\-attention fine\-tuned update\)Let an input sequenceXXbe bounded, such that‖X‖≤C\\\|X\\\|\\leq C\. Consider one attention head as𝒱𝐖​\(X\)=\(σ\(Q​K​V\)∘𝒯𝐖​\(X\)\)​WO,𝒯𝐖​\(X\):=\[X​WQ​∣X​WK∣​X​WV\],\\mathcal\{V\}\_\{\\mathbf\{W\}\}\(X\)\\;=\\;\(\\sigma\_\{\(QKV\)\}\\circ\\mathcal\{T\}\_\{\\mathbf\{W\}\}\(X\)\)W\_\{O\},\\mathcal\{T\}\_\{\\mathbf\{W\}\}\(X\):=\[XW\_\{Q\}\\mid XW\_\{K\}\\mid XW\_\{V\}\],whereσ\(Q​K​V\)​\(⋅\)=Softmax​\(Q​K⊤dk\+M\)​V\\sigma\_\{\(QKV\)\}\(\\cdot\)=\\mathrm\{Softmax\}\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\+M\)VandMMis a fixed mask, letX∈ℝn^×m^X\\in\\mathbb\{R\}^\{\\hat\{n\}\\times\\hat\{m\}\}\. Then the difference in attention\-head outputs satisfies

‖𝒱𝐖0\+Δ​𝐖​\(X\)−𝒱𝐖0​\(X\)‖≤C​\(Lσ​‖Δ​WQ​K​V‖2​‖WO,0‖2\+n^​‖WV,0\+Δ​WV‖2​‖Δ​WO‖2\),\\\|\\mathcal\{V\}\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\-\\mathcal\{V\}\_\{\\mathbf\{W\}\_\{0\}\}\(X\)\\\|\\leq C\(L\_\{\\sigma\}\\\|\\Delta W\_\{QKV\}\\\|\_\{2\}\\\|W\_\{O,0\}\\\|\_\{2\}\+\\sqrt\{\\hat\{n\}\}\\\|W\_\{V,0\}\+\\Delta W\_\{V\}\\\|\_\{2\}\\\|\\Delta W\_\{O\}\\\|\_\{2\}\),whereΔ​WQ​K​V\\Delta W\_\{QKV\}denotes a concatenated updateΔ​WQ​K​V=\[Δ​WQ​∣Δ​WK∣​Δ​WV\]\\Delta W\_\{QKV\}=\[\\Delta W\_\{Q\}\\mid\\Delta W\_\{K\}\\mid\\Delta W\_\{V\}\],LσL\_\{\\sigma\}represents the Lipschitz constant ofSoftmax​\(QK⊤\+Mdk\)​V\.\\rm Softmax\(\\frac\{QK^\{\\top\}\+M\}\{\\sqrt\{d\_\{k\}\}\}\)V\.

###### Proof\.

We have already shown that

‖𝒱𝐖0\+Δ​𝐖​\(X\)−𝒱𝐖0​\(X\)‖≤Lσ​‖X‖​‖Δ​WQ​K​V‖2​‖WO,0‖2\+‖ZΔ‖​‖Δ​WO‖2\.\\\|\\mathcal\{V\}\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\-\\mathcal\{V\}\_\{\\mathbf\{W\}\_\{0\}\}\(X\)\\\|\\leq L\_\{\\sigma\}\\\|X\\\|\\,\\\|\\Delta W\_\{QKV\}\\\|\_\{2\}\\,\\\|W\_\{O,0\}\\\|\_\{2\}\+\\\|Z\_\{\\Delta\}\\\|\\,\\\|\\Delta W\_\{O\}\\\|\_\{2\}\.We assume‖X‖≤C\\\|X\\\|\\leq C, hence

‖𝒱𝐖0\+Δ​𝐖​\(X\)−𝒱𝐖0​\(X\)‖≤C​Lσ​‖Δ​WQ​K​V‖2​‖WO,0‖2\+‖ZΔ‖​‖Δ​WO‖2\.\\\|\\mathcal\{V\}\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\-\\mathcal\{V\}\_\{\\mathbf\{W\}\_\{0\}\}\(X\)\\\|\\leq CL\_\{\\sigma\}\\,\\\|\\Delta W\_\{QKV\}\\\|\_\{2\}\\,\\\|W\_\{O,0\}\\\|\_\{2\}\+\\\|Z\_\{\\Delta\}\\\|\\,\\\|\\Delta W\_\{O\}\\\|\_\{2\}\.We aim to upper boundZΔZ\_\{\\Delta\};ZΔ=σ\(Q​K​V\)∘\(𝒯W0\+Δ​W​\(X\)\)=Softmax​\(Q​K⊤dk\)​VZ\_\{\\Delta\}=\\sigma\_\{\(QKV\)\}\\circ\(\{\\cal T\}\_\{\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\}\(X\)\)=\\text\{Softmax\}\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\)Vwhich gives us the relationship

‖ZΔ‖=‖Softmax​\(Q​K⊤dk\+M\)​V‖≤‖Softmax​\(Q​K⊤dk\+M\)‖2​‖V‖≤n^​‖V‖\\\|Z\_\{\\Delta\}\\\|=\\\|\\text\{Softmax\}\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\+M\)V\\\|\\leq\\\|\\text\{Softmax\}\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\+M\)\\\|\_\{2\}\\\|V\\\|\\leq\\sqrt\{\\hat\{n\}\}\\\|V\\\|n^​‖V‖=n^​‖X​\(WV,0\+Δ​WV\)‖≤n^​‖X‖​‖WV,0\+Δ​WV‖2≤C​n^​‖WV,0\+Δ​WV‖2\.\\sqrt\{\\hat\{n\}\}\\\|V\\\|=\\sqrt\{\\hat\{n\}\}\\\|X\(W\_\{V,0\}\+\\Delta W\_\{V\}\)\\\|\\leq\\sqrt\{\\hat\{n\}\}\\\|X\\\|\\\|W\_\{V,0\}\+\\Delta W\_\{V\}\\\|\_\{2\}\\leq C\\sqrt\{\\hat\{n\}\}\\\|W\_\{V,0\}\+\\Delta W\_\{V\}\\\|\_\{2\}\.To justify then^\\sqrt\{\\hat\{n\}\}factor, define the attention weight matrix

S:=Softmax​\(Q​K⊤dk\+M\)∈ℝn^×n^,S:=\\mathrm\{Softmax\}\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\+M\)\\in\\mathbb\{R\}^\{\\hat\{n\}\\times\\hat\{n\}\},where Softmax is applied row\-wise\. Each row ofSSis a probability vector, henceSi​j≥0S\_\{ij\}\\geq 0and∑j=1n^Si​j=1\\sum\_\{j=1\}^\{\\hat\{n\}\}S\_\{ij\}=1for allii\. Therefore‖S‖∞=1\\\|S\\\|\_\{\\infty\}=1\. Moreover, since each column sum satisfies∑i=1n^Si​j≤n^\\sum\_\{i=1\}^\{\\hat\{n\}\}S\_\{ij\}\\leq\\hat\{n\}, we have‖S‖1≤n^\\\|S\\\|\_\{1\}\\leq\\hat\{n\}\. Using the inequality‖A‖2≤‖A‖1​‖A‖∞\\\|A\\\|\_\{2\}\\leq\\sqrt\{\\\|A\\\|\_\{1\}\\\|A\\\|\_\{\\infty\}\}gives‖S‖2≤‖S‖1​‖S‖∞≤n^\.\\\|S\\\|\_\{2\}\\leq\\sqrt\{\\\|S\\\|\_\{1\}\\\|S\\\|\_\{\\infty\}\}\\leq\\sqrt\{\\hat\{n\}\}\.Consequently,

‖ZΔ‖=‖S​V‖≤‖S‖2​‖V‖≤n^​‖V‖\.\\\|Z\_\{\\Delta\}\\\|=\\\|SV\\\|\\leq\\\|S\\\|\_\{2\}\\\|V\\\|\\leq\\sqrt\{\\hat\{n\}\}\\\|V\\\|\.Hence, we conclude with

‖𝒱𝐖0\+Δ​𝐖​\(X\)−𝒱𝐖0​\(X\)‖≤C​\(Lσ​‖Δ​WQ​K​V‖2​‖WO,0‖2\+n^​‖WV,0\+Δ​WV‖2​‖Δ​WO‖2\)\.\\\|\\mathcal\{V\}\_\{\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\}\(X\)\-\\mathcal\{V\}\_\{\\mathbf\{W\}\_\{0\}\}\(X\)\\\|\\leq C\(L\_\{\\sigma\}\\\|\\Delta W\_\{QKV\}\\\|\_\{2\}\\\|W\_\{O,0\}\\\|\_\{2\}\+\\sqrt\{\\hat\{n\}\}\\\|W\_\{V,0\}\+\\Delta W\_\{V\}\\\|\_\{2\}\\\|\\Delta W\_\{O\}\\\|\_\{2\}\)\.∎

#### D\.1\.6Adapting Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)under special cases

To adapt Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)under special cases, we need the following general assumptions\.

###### Assumption 4\.

The loss function,ℓ​\(⋅\):ℝd→ℝ\\ell\(\\cdot\):\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}, is 1\-Lipschitz, i\.e,\|ℓ​\(fW​\(x\),y\)−ℓ​\(fW′​\(x\),y\)\|≤‖fW​\(x\)−fW′​\(x\)‖\|\\ell\(f\_\{\\textbf\{W\}\}\(x\),y\)\-\\ell\(f\_\{\\textbf\{W\}^\{\\prime\}\}\(x\),y\)\|\\leq\\\|f\_\{\\textbf\{W\}\}\(x\)\-f\_\{\\textbf\{W\}^\{\\prime\}\}\(x\)\\\|for allW,W’∈ℝd\\in\\mathbb\{R\}^\{d\}and\(x,y\)∈𝒳×𝒴\.\(x,y\)\\in\{\\cal X\}\\times\{\\cal Y\}\.

###### Assumption 5\.

The loss function,ℓ​\(⋅\):ℝd→ℝ\\ell\(\\cdot\):\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}, is bounded, i\.e\., there exists a constantC2≥0C\_\{2\}\\geq 0such that‖ℓ​\(fW​\(x\),y\)‖≤C2,\\\|\\ell\(f\_\{\\textbf\{W\}\}\(x\),y\)\\\|\\leq C\_\{2\},for allW∈ℝd\\textbf\{W\}\\in\\mathbb\{R\}^\{d\}and\(x,y\)∈𝒳×𝒴\.\(x,y\)\\in\{\\cal X\}\\times\{\\cal Y\}\.

\(*i*\)

Perturbing around𝒢​\(W0\)\\mathcal\{G\}\(\\textbf\{W\}\_\{0\}\)\.First, we adapt Theorem 4\.1 in\[[33](https://arxiv.org/html/2606.13767#bib.bib25)\]into our notation and quote it below\.

###### Theorem 5\.

\(PAC\-Bayes generalization bound for fine\-tuning\)\[\[[33](https://arxiv.org/html/2606.13767#bib.bib25)\], Theorem 4\.1\] Let Assumption[1](https://arxiv.org/html/2606.13767#Thmassumption1)hold with the requirement thatC≥1C\\geq 1\. Let the loss function,ℒ\{\\cal L\}, follow Assumptions[4](https://arxiv.org/html/2606.13767#Thmassumption4)and[5](https://arxiv.org/html/2606.13767#Thmassumption5)\. Let‖W0\(i\)‖2≤\\\|W\_\{0\}^\{\(i\)\}\\\|\_\{2\}\\leq𝒜i\{\\cal A\}\_\{i\}with fixed𝒜i\>1\{\\cal A\}\_\{i\}\>1,‖Δ​W\(i\)‖≤Qi\\\|\\Delta W^\{\(i\)\}\\\|\\leq Q\_\{i\}, for alli∈\[L\]i\\in\[L\]andV=maxi∈\[L\]⁡\{mi,ni\}V=\\max\_\{i\\in\[L\]\}\\\{m\_\{i\},n\_\{i\}\\\}\. Letϵ\\epsilonandδ\\deltabe arbitrary small values\. Then with probability1−2​δ1\-2\\delta, the following inequality holds:

𝒢​\(𝐖0\+Δ​𝐖\)≤ϵ\+C2​36ϵ2​C2​V​log⁡\(4​L​V​C2\)​\(∑i=1L∏j=1L\(𝒜j\+Qj\)𝒜i\+Qi\)2​\(∑i=1LQi2\)\+3​ln⁡\|N\|δ\+8\|N\|\.\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\\mathbf\{W\}\)\\leq\\epsilon\+C\_\{2\}\\sqrt\{\\frac\{\\frac\{36\}\{\\epsilon^\{2\}\}C^\{2\}V\\log\(4LVC\_\{2\}\)\(\\sum\_\{i=1\}^\{L\}\\frac\{\\prod\_\{j=1\}^\{L\}\(\{\\cal A\}\_\{j\}\+Q\_\{j\}\)\}\{\{\\cal A\}\_\{i\}\+Q\_\{i\}\}\)^\{2\}\(\\sum\_\{i=1\}^\{L\}Q\_\{i\}^\{2\}\)\+3\\ln\\frac\{\|N\|\}\{\\delta\}\+8\}\{\|N\|\}\}\.

We now use Theorem[5](https://arxiv.org/html/2606.13767#Thmtheorem5)to obtain a bound for𝒢​\(W0\)\.\{\\cal G\}\(\\textbf\{W\}\_\{0\}\)\.The following Theorem gives that\.

###### Theorem 6\.

Using the Assumptions made for Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)and Theorem[5](https://arxiv.org/html/2606.13767#Thmtheorem5), the following inequality holds with probability at least1−2​δ:1\-2\\delta:

𝒢​\(W0\+Δ​W\)≤ϵ\+C2​3​ln⁡\|N\|δ\+8\|N\|\+2​Lℒ​\(C​∏i=1LLσi​∑i=12L−1∏j=1LP​\(i,j\)\+∑i≠2a−1:a∈\[L\]2L−2F​\(i\)\)\.\{\\cal G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\\leq\\epsilon\+C\_\{2\}\\sqrt\{\\frac\{3\\ln\\frac\{\|N\|\}\{\\delta\}\+8\}\{\|N\|\}\}\+2L\_\{\{\\cal L\}\}\\big\(C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\(i,j\)\+\\sum\_\{i\\neq 2^\{a\}\-1:a\\in\[L\]\}^\{2^\{L\}\-2\}F\(i\)\\big\)\.

###### Proof\.

We wish to find𝒢​\(W0\)\{\\cal G\}\(\\textbf\{W\}\_\{0\}\), and note that if we never train the model, we obtain the expression‖W0i−W0i‖2=0\\\|W\_\{0\}^\{i\}\-W\_\{0\}^\{i\}\\\|\_\{2\}=0\. Thus, we can useQi=0Q\_\{i\}=0for alli∈\[L\]i\\in\[L\]and obtain:

𝒢​\(W0\)\\displaystyle\{\\cal G\}\(\\textbf\{W\}\_\{0\}\)≤Theorem​[5](https://arxiv.org/html/2606.13767#Thmtheorem5)​ϵ\+C2​36ϵ2​C2​V​log⁡\(4​L​V​C2\)​\(∑i=1L∏j=1L\(𝒜j\+Qj\)𝒜i\+Qi\)2​\(∑i=1LQi2\)\+3​ln⁡\|N\|δ\+8\|N\|\\displaystyle\\overset\{\\rm Theorem~\\ref\{theorem:PAC\-Bayes\}\}\{\\leq\}\\epsilon\+C\_\{2\}\\sqrt\{\\frac\{\\frac\{36\}\{\\epsilon^\{2\}\}C^\{2\}V\\log\(4LVC\_\{2\}\)\(\\sum\_\{i=1\}^\{L\}\\frac\{\\prod\_\{j=1\}^\{L\}\(\{\\cal A\}\_\{j\}\+Q\_\{j\}\)\}\{\{\\cal A\}\_\{i\}\+Q\_\{i\}\}\)^\{2\}\(\\sum\_\{i=1\}^\{L\}Q\_\{i\}^\{2\}\)\+3\\ln\\frac\{\|N\|\}\{\\delta\}\+8\}\{\|N\|\}\}=Qi=0;i∈\[L\]​ϵ\+C2​36ϵ2​C2​V​log⁡\(4​L​V​C2\)​\(∑i=1L∏j=1L\(𝒜j\+0\)𝒜i\+0\)2​\(∑i=1L02\)\+3​ln⁡\|N\|δ\+8\|N\|\\displaystyle\\overset\{Q\_\{i\}=0;i~\\in~\[L\]\}\{=\}\\epsilon\+C\_\{2\}\\sqrt\{\\frac\{\\frac\{36\}\{\\epsilon^\{2\}\}C^\{2\}V\\log\(4LVC\_\{2\}\)\(\\sum\_\{i=1\}^\{L\}\\frac\{\\prod\_\{j=1\}^\{L\}\(\{\\cal A\}\_\{j\}\+0\)\}\{\{\\cal A\}\_\{i\}\+0\}\)^\{2\}\(\\sum\_\{i=1\}^\{L\}0^\{2\}\)\+3\\ln\\frac\{\|N\|\}\{\\delta\}\+8\}\{\|N\|\}\}=ϵ\+C2​3​ln⁡\|N\|δ\+8\|N\|\.\\displaystyle=\\epsilon\+C\_\{2\}\\sqrt\{\\frac\{3\\ln\\frac\{\|N\|\}\{\\delta\}\+8\}\{\|N\|\}\}\.Now that we have an upper bound for𝒢​\(W0\)\{\\cal G\}\(\\textbf\{W\}\_\{0\}\), we can apply Theorem[1](https://arxiv.org/html/2606.13767#Thmtheorem1)and obtain the following:

𝒢​\(W0\+Δ​W\)\\displaystyle\{\\cal G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)≤Theorem​[1](https://arxiv.org/html/2606.13767#Thmtheorem1)​𝒢​\(W0\)\+ΦΔ​W\\displaystyle\\overset\{\\rm Theorem~\\ref\{theorem:nonlinearGenBound\}\}\{\\leq\}\{\\cal G\}\(\\textbf\{W\}\_\{0\}\)\+\\Phi\_\{\\Delta\\textbf\{W\}\}≤​ϵ\+C2​3​ln⁡\|N\|δ\+8\|N\|\+ΦΔ​W\.\\displaystyle\\overset\{\}\{\\leq\}\\epsilon\+C\_\{2\}\\sqrt\{\\frac\{3\\ln\\frac\{\|N\|\}\{\\delta\}\+8\}\{\|N\|\}\}\+\\Phi\_\{\\Delta\\textbf\{W\}\}\.By substituting the expression forΦΔ​W,\\Phi\_\{\\Delta\\textbf\{W\}\},in the above expression we have:

𝒢​\(W0\+Δ​W\)≤ϵ\+C2​3​ln⁡\|N\|δ\+8\|N\|\+2​Lℒ​\(C​∏i=1LLσi​∑i=12L−1∏j=1LP​\(i,j\)\+∑i≠2a−1:a∈\[L\]2L−2F​\(i\)\)\.\{\\cal G\}\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\\leq\\epsilon\+C\_\{2\}\\sqrt\{\\frac\{3\\ln\\frac\{\|N\|\}\{\\delta\}\+8\}\{\|N\|\}\}\+2L\_\{\{\\cal L\}\}\\big\(C\\prod\_\{i=1\}^\{L\}L\_\{\\sigma\_\{i\}\}\\sum\_\{i=1\}^\{2^\{L\}\-1\}\\prod\_\{j=1\}^\{L\}P\(i,j\)\+\\sum\_\{i\\neq 2^\{a\}\-1:a\\in\[L\]\}^\{2^\{L\}\-2\}F\(i\)\\big\)\.This concludes the proof\. ∎

\(*i*\)

Perturbing around𝒢​\(𝒜\)\\mathcal\{G\}\(\\mathbf\{\\mathcal\{A\}\}\)\.First, we make another assumption on the loss function and then adapt Theorem 1 in\[[66](https://arxiv.org/html/2606.13767#bib.bib72)\]to our notation\.

###### Assumption 6\.

The loss function,ℓ​\(⋅\):ℝd→ℝ\\ell\(\\cdot\):\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}, isσ\\sigma\-sub\-gaussian, i\.e\.,𝔼​\(eλ​\[ℓ​\(f𝐖​\(X\),Y\)−𝔼​\(ℓ​\(f𝐖​\(X\),Y\)\)\]\)≤eλ2​σ22\\mathbb\{E\}\(e^\{\\lambda\[\\ell\(f\_\{\\mathbf\{W\}\}\(X\),Y\)\-\\mathbb\{E\}\(\\ell\(f\_\{\\mathbf\{W\}\}\(X\),Y\)\)\]\}\)\\leq e^\{\\frac\{\\lambda^\{2\}\\sigma^\{2\}\}\{2\}\}for allλ∈ℝ\\lambda\\in\\mathbb\{R\},𝐖∈ℝd\.\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\}\.

###### Theorem 7\.

\(Upper bound on generalization error using mutual information\)\[Theorem 1\[[66](https://arxiv.org/html/2606.13767#bib.bib72)\]\] Let𝒜\\mathbf\{\{\\cal A\}\}denote a LoRA\-based algorithm that outputs\{𝚫​𝐖i\}i∈\[L\]\\\{\\mathbf\{\\Delta W\}\_\{i\}\\\}\_\{i\\in\[L\]\}on a fine\-tuning dataset,NN\. Byν\\nuwe denote the underlying distribution of the input space,𝒳\{\\cal X\}, of which the elements of the fine\-tuning datasetNNare chosen following i\.i\.d\. Let Assumption[6](https://arxiv.org/html/2606.13767#Thmassumption6)hold\. Then we have the following:

𝒢​\(𝒜\)ν≤2​σ2​𝐈​\(\{𝚫​𝐖i\}i∈\[L\];N\|𝒜;𝐖\)\|N\|\.\{\\cal G\}\(\\mathbf\{\{\\cal A\}\}\)\_\{\\nu\}\\leq\\sqrt\{\\frac\{2\\sigma^\{2\}\\mathbf\{I\}\(\\\{\\mathbf\{\\Delta W\}\_\{i\}\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\mathbf\{W\}\)\}\{\|N\|\}\}\.

Let the loss functionℒ\{\\cal L\}follow Assumption[6](https://arxiv.org/html/2606.13767#Thmassumption6)\. We present the generalization error upper bounds of the LoRA variants in Table[1](https://arxiv.org/html/2606.13767#S3.T1)\. For this, we use the inequality𝒢​\(𝐖0\+𝒜\)≤𝒢​\(𝒜\)\+Φ𝐖0\\mathcal\{G\}\(\\mathbf\{W\}\_\{0\}\+\\mathbf\{\{\\cal A\}\}\)\\leq\\mathcal\{G\}\(\\mathbf\{\{\\cal A\}\}\)\+\\Phi\_\{\\mathbf\{W\}\_\{0\}\}, where𝒢​\(𝒜\)\\mathcal\{G\}\(\\mathbf\{\{\\cal A\}\}\)is upper bounded by the use of Lemma[1](https://arxiv.org/html/2606.13767#Thmlemma1)quoted below\.

###### Lemma 1\.

\(Upperbound on mutual\-information\)\[\[[66](https://arxiv.org/html/2606.13767#bib.bib72)\]\] Let\{𝚫​𝐖i\}i∈\[L\]\\\{\\mathbf\{\\Delta W\}\_\{i\}\\\}\_\{i\\in\[L\]\}be an update to a learning algorithm\. Then the mutual information is upper bounded by the uniform distribution over an updated support set, i\.e\.,𝐈​\(Δ​\{𝐖i\}i∈\[L\];N\|𝒜;𝐖\)≤ln⁡2q​p=q​p​ln⁡2,\\mathbf\{I\}\(\\Delta\\\{\\mathbf\{W\}\_\{i\}\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\mathbf\{W\}\)\\leq\\ln 2^\{qp\}=qp\\ln 2,whereqqrepresents the number of bits the learning algorithm is designed on, andppis the number of trainable parameters\. Thus, with the use of Theorem[7](https://arxiv.org/html/2606.13767#Thmtheorem7), if Assumption[6](https://arxiv.org/html/2606.13767#Thmassumption6)holds, then𝒢​\(𝒜\)≤2​σ2​q​p​ln⁡2\|N\|\{\\cal G\}\(\\mathbf\{\{\\cal A\}\}\)\\leq\\sqrt\{\\frac\{2\\sigma^\{2\}qp\\ln 2\}\{\|N\|\}\}\.

How do we arrive at the bounds of different LoRA variants?

\(*a*\)

LoRA\+has𝒢​\(𝒜\)\{\\cal G\}\(\\mathbf\{\{\\cal A\}\}\)upper bounded by2​r​q​σ2​ln⁡2​∑i=1L\(mi\+ni\)\|N\|\.\\sqrt\{\\frac\{2rq\\sigma^\{2\}\\ln 2\\sum\_\{i=1\}^\{L\}\(m\_\{i\}\+n\_\{i\}\)\}\{\|N\|\}\}\.The learning rate does not alter the number of trainable parameters, which leads LoRA\+ to possess the same upper bounds as LoRA\. We note a unique observation regarding this claim, asγA→0\\gamma\_\{A\}\\to 0, LoRA\+ takes the lowered generalization error bound of Asymmetric LoRA since the adapter matrix,AA, is no longer trainable\.

\(*b*\)

cLAhas the fine\-tuned updateB​\[Ir\|𝟎mi−r\]B\[I\_\{r\}\|\\mathbf\{0\}\_\{m\_\{i\}\-r\}\], where\[Ir\|𝟎mi−r\]\[I\_\{r\}\|\\mathbf\{0\}\_\{m\_\{i\}\-r\}\]is a fixed constant orthogonal matrix\. Thus, by using data processing inequality \([14](https://arxiv.org/html/2606.13767#A4.E14)\), the mutual information between the two is preserved, i\.e,

𝐈​\(\{Bi​\[Ir\|𝟎mi−r\]i\}i∈\[L\];N\|𝒜;𝐖\)=𝐈​\(\{Bi\}i∈\[L\];N\|𝒜;𝐖\)\.\\mathbf\{I\}\(\\\{B\_\{i\}\[I\_\{r\}\|\\mathbf\{0\}\_\{m\_\{i\}\-r\}\]\_\{i\}\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\mathbf\{W\}\)=\\mathbf\{I\}\(\\\{B\_\{i\}\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\mathbf\{W\}\)\.Similar to\[[75](https://arxiv.org/html/2606.13767#bib.bib18)\], we upper bound mutual information by the uniform distribution of a model’s support; particularly𝐈​\(\{Bi\}i∈\[L\];N\|Δ​𝐖;𝐖\)≤q​r​ln⁡2​∑i=1Lni\\mathbf\{I\}\(\\\{B\_\{i\}\\\}\_\{i\\in\[L\]\};N\|\\Delta\\mathbf\{W\};\\mathbf\{W\}\)\\leq qr\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}, by Lemma[1](https://arxiv.org/html/2606.13767#Thmlemma1)\. Finally, by Theorem[7](https://arxiv.org/html/2606.13767#Thmtheorem7), we obtain the result𝒢​\(𝒜\)≤2​r​q​σ2​ln⁡2​∑i=1Lni\|N\|\{\\cal G\}\(\{\\cal A\}\)\\leq\\sqrt\{\\frac\{2rq\\sigma^\{2\}\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\}\{\|N\|\}\}\.

\(*c*\)

c3LAhas the fine\-tuned updateB1​\[Ir\|𝟎mi−r\]\+B2​\[𝟎r​\|Ir\|​𝟎mi−2​r\]\+⋯\+Bk​\[𝟎r​\(k−1\)​\|Ir\|​𝟎mi−k​r\]\.B\_\{1\}\[I\_\{r\}\|\\mathbf\{0\}\_\{m\_\{i\}\-r\}\]\+B\_\{2\}\[\\mathbf\{0\}\_\{r\}\|I\_\{r\}\|\\mathbf\{0\}\_\{m\_\{i\}\-2r\}\]\+\\cdots\+B\_\{k\}\[\\mathbf\{0\}\_\{r\(k\-1\)\}\|I\_\{r\}\|\\mathbf\{0\}\_\{m\_\{i\}\-kr\}\]\.This expansion can be simplified by∑j=1kBj​\[0r×r​\(j−1\)​∣Ir∣​0r×\(mi−r​j\)\]=\[B1​∣B2∣​⋯​∣Bk\|​𝟎ni​\(mi−k​r\)\]\\sum\_\{j=1\}^\{k\}B\_\{j\}\[0\_\{r\\times r\(j\-1\)\}\\mid I\_\{r\}\\mid 0\_\{r\\times\(m\_\{i\}\-rj\)\}\]=\[B\_\{1\}\\mid B\_\{2\}\\mid\\cdots\\mid B\_\{k\}\|\\mathbf\{0\}\_\{n\_\{i\}\(m\_\{i\}\-kr\)\}\]\. Using \([13](https://arxiv.org/html/2606.13767#A4.E13)\), we can upper bound the rank of∑j=1kBj​\[0r×r​\(j−1\)​∣Ir∣​0r×\(mi−r​j\)\]\\sum\_\{j=1\}^\{k\}B^\{j\}\[0\_\{r\\times r\(j\-1\)\}\\mid I\_\{r\}\\mid 0\_\{r\\times\(m\_\{i\}\-rj\)\}\]byk​rkr\. Thus, the mapping\[B1​\|⋯\|​Bk\]→Δ​𝐖\[B\_\{1\}\|\\cdots\|B\_\{k\}\]\\to\\Delta\\mathbf\{W\}is injective and can be inverted by slicing the lastni−k​rn\_\{i\}\-krcolumns\. Using DPI, this leads to the expression

𝐈​\(\{∑j=1kBij​\[0r×r​\(j−1\)​∣Ir∣​0r×\(mi−r​j\)\]\}i∈\[L\];N\|𝒜;𝐖\)=𝐈​\(\{\[B1​\|⋯\|​Bk\]i\}i∈\[L\];N\|𝒜;𝐖\)\.\\mathbf\{I\}\(\\\{\\sum\_\{j=1\}^\{k\}B\_\{i\}^\{j\}\[0\_\{r\\times r\(j\-1\)\}\\mid I\_\{r\}\\mid 0\_\{r\\times\(m\_\{i\}\-rj\)\}\]\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\mathbf\{W\}\)=\\mathbf\{I\}\(\\\{\[B\_\{1\}\|\\cdots\|B\_\{k\}\]\_\{i\}\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\mathbf\{W\}\)\.We upper bound𝐈​\(\{\[B1​\|⋯\|​Bk\]i\}i∈\[L\];N\|𝒜;𝐖\)\\mathbf\{I\}\(\\\{\[B\_\{1\}\|\\cdots\|B\_\{k\}\]\_\{i\}\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\mathbf\{W\}\)byq​r​k​ln⁡2​∑i=1Lniqrk\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}, using Lemma[1](https://arxiv.org/html/2606.13767#Thmlemma1)\. Hence, by Theorem[7](https://arxiv.org/html/2606.13767#Thmtheorem7), we obtain:𝒢​\(𝒜\)≤2​r​q​σ2​k​ln⁡2​∑i=1Lni\|N\|\.\{\\cal G\}\(\\mathbf\{\{\\cal A\}\}\)\\leq\\sqrt\{\\frac\{2rq\\sigma^\{2\}k\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\}\{\|N\|\}\}\.

\(*d*\)

CoLAhas the update structureΔ​W=∑j=1kBj​Aj\\Delta\\textbf\{W\}=\\sum\_\{j=1\}^\{k\}B^\{j\}A^\{j\}\. Using inequality \([13](https://arxiv.org/html/2606.13767#A4.E13)\), we upper bound the rank of each layer’s update byk​r\.kr\.By Lemma[1](https://arxiv.org/html/2606.13767#Thmlemma1), we upper bound𝐈​\(\{∑j=1LBij​Aij\}i∈\[L\];N\|𝒜;W\)\\mathbf\{I\}\(\\\{\\sum\_\{j=1\}^\{L\}B\_\{i\}^\{j\}A\_\{i\}^\{j\}\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\textbf\{W\}\)byq​r​k​ln⁡2​∑i=1L\(mi\+ni\)\.qrk\\ln 2\\sum\_\{i=1\}^\{L\}\(m\_\{i\}\+n\_\{i\}\)\.Hence, we obtain𝒢​\(𝒜\)≤2​r​q​σ2​k​ln⁡2​∑i=1L\(mi\+ni\)\|N\|\{\\cal G\}\(\\mathbf\{\{\\cal A\}\}\)\\leq\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}k\\ln 2\\sum\_\{i=1\}^\{L\}\(m\_\{i\}\+n\_\{i\}\)\}\{\|N\|\}\}, by Theorem[7](https://arxiv.org/html/2606.13767#Thmtheorem7)\.

\(*e*\)

RAC\-LoRAhas the fine\-tuned update∑j=1kBj​Qj,\\sum\_\{j=1\}^\{k\}B^\{j\}Q^\{j\},where we consider eachQjQ^\{j\}to be a frozen orthogonal matrix\. This update can be represented by∑j=1kBj​Qj=\[B1​\|B2\|​⋯\|Bk\]​\[Q1​\|Q2\|​⋯\|Qk\]T,\\sum\_\{j=1\}^\{k\}B^\{j\}Q^\{j\}=\[B^\{1\}\|B^\{2\}\|\\cdots\|B^\{k\}\]\[Q^\{1\}\|Q^\{2\}\|\\cdots\|Q^\{k\}\]^\{T\},where we can invert\[B1\|B2\|⋯\|Bk\]\[Q1\|Q2\|⋯\|QL\]T\]\[B^\{1\}\|B^\{2\}\|\\cdots\|B^\{k\}\]\[Q^\{1\}\|Q^\{2\}\|\\cdots\|Q^\{L\}\]^\{T\}\]to\[B1​\|B2\|​⋯\|BL\]\.\[B^\{1\}\|B^\{2\}\|\\cdots\|B^\{L\}\]\.Thus by using inequality \([13](https://arxiv.org/html/2606.13767#A4.E13)\), DPI, and Lemma[1](https://arxiv.org/html/2606.13767#Thmlemma1)we have

𝐈\(\{\[B1\|B2\|⋯\|BL\]i\[Q1\|Q2\|⋯\|QL\]iT\]\}i∈\[L\];N\|𝒜;W\)=𝐈\(\{B1\|B2\|⋯\|BL\]i\]\}i∈\[L\];N\|𝒜;W\),\\mathbf\{I\}\(\\\{\[B^\{1\}\|B^\{2\}\|\\cdots\|B^\{L\}\]\_\{i\}\[Q^\{1\}\|Q^\{2\}\|\\cdots\|Q^\{L\}\]\_\{i\}^\{T\}\]\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\textbf\{W\}\)=\\mathbf\{I\}\(\\\{B^\{1\}\|B^\{2\}\|\\cdots\|B^\{L\}\]\_\{i\}\]\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\textbf\{W\}\),which is

𝐈\(\{B1\|B2\|⋯\|BL\]i\]\}i∈\[L\];N\|𝒜;W\)≤qrkln2∑i=1Lni\.\\mathbf\{I\}\(\\\{B^\{1\}\|B^\{2\}\|\\cdots\|B^\{L\}\]\_\{i\}\]\\\}\_\{i\\in\[L\]\};N\|\\mathbf\{\{\\cal A\}\};\\textbf\{W\}\)\\leq qrk\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\.Hence, by Theorem[7](https://arxiv.org/html/2606.13767#Thmtheorem7), we have the result:𝒢​\(𝒜\)≤2​r​q​σ2​k​ln⁡2​∑i=1Lni\|N\|\{\\cal G\}\(\\mathbf\{\{\\cal A\}\}\)\\leq\\sqrt\{\\tfrac\{2rq\\sigma^\{2\}k\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\}\{\|N\|\}\}\.

Generalization Upperbound on PaCA

\(*f*\)

PaCAhas an identical update to cLA if the fine\-tuned columns are firstrrcolumns of the pretrained backbone\. This suggests that any generalization guarantee driven by the update motivates the same generalization behavior for PaCA as cLA\. Particularly, since PaCA updatesr​∑i=1Lnir\\sum\_\{i=1\}^\{L\}n\_\{i\}trainable entries \(the same degrees of freedom as cLA’sBBmatrices\), Lemma[1](https://arxiv.org/html/2606.13767#Thmlemma1)yieldsI​\(\{Δ​Wi\}i∈\[L\];N\|𝒜;W\)≤q​r​ln⁡2​∑i=1Lni\.I\(\\\{\\Delta W\_\{i\}\\\}\_\{i\\in\[L\]\};N\|\{\\cal A\};\\textbf\{W\}\)\\leq qr\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\.Plugging into Theorem[7](https://arxiv.org/html/2606.13767#Thmtheorem7), we get𝒢​\(𝒜\)≤2​r​q​σ2​ln⁡2​∑i=1Lni\|N\|\{\\cal G\}\(\{\\cal A\}\)\\leq\\dfrac\{2rq\\sigma^\{2\}\\ln 2\\sum\_\{i=1\}^\{L\}n\_\{i\}\}\{\|N\|\}, matching the generalization upper bound of cLA from Table[1](https://arxiv.org/html/2606.13767#S3.T1)\.

## Appendix EAddendum to Benchmarking and Evaluation

In §[E\.1](https://arxiv.org/html/2606.13767#A5.SS1), we summarize the quality metrics and trainable parameters used for training the models in Table[2](https://arxiv.org/html/2606.13767#S3.T2)and provide the specific hyperparameters for fine\-tuning each model for each dataset in Table[7](https://arxiv.org/html/2606.13767#A5.T7)\. In §[E\.2](https://arxiv.org/html/2606.13767#A5.SS2), we present ablation studies on the effects of learning rate \(γ\)\\gamma\), scaling factor\(α\)\(\\alpha\), and chain reset indices on the resulting test accuracy and test loss for varying ranks\. In §[E\.3](https://arxiv.org/html/2606.13767#A5.SS3), we comment on the potential of our methods by naively leveraging the sparsity of ourAAmatrices\. In §[E\.4\.2](https://arxiv.org/html/2606.13767#A5.SS4.SSS2)and §[E\.4\.1](https://arxiv.org/html/2606.13767#A5.SS4.SSS1), we extend[4\.3](https://arxiv.org/html/2606.13767#S4.SS3)with the implementation details of the loss landscapes and provide additional loss landscapes and intruder dimension results\. In §[E\.5](https://arxiv.org/html/2606.13767#A5.SS5), we extend section §[3\.1](https://arxiv.org/html/2606.13767#S3.SS1)with empirical results on generalization\.

### E\.1Implementation Details

We implement the framework in Python using PyTorch\[[50](https://arxiv.org/html/2606.13767#bib.bib34)\]\. We train all models with the ADAM optimizer\[[31](https://arxiv.org/html/2606.13767#bib.bib33)\]\. Training \(of most models\) was performed on one 80GB NVIDIA H100 GPU\. The ablation studies on ViT\-Tiny in Tables[9](https://arxiv.org/html/2606.13767#A5.T9),[11](https://arxiv.org/html/2606.13767#A5.T11), and[13](https://arxiv.org/html/2606.13767#A5.T13)were trained using one NVIDIA V100 GPU\. We provide the hyperparameter settings, i\.e\., learning rates, learning rate scheduler, chain reset frequency, weight decay, batch size, training epochs, maximum token length or image resolution, and random seeds for the experiments in Table[7](https://arxiv.org/html/2606.13767#A5.T7)\.

Table 7:Summary of the hyperparameters\.We used the same learning rate for LoRA methods that trainB,AB,A, and Asymmetric LoRA methods that only trainBB; we write \(FFT, LoRA, Asym\) to indicate those three sets\. We selected the best model out of all epochs based on the lowest validation loss, except for the CoLA dataset, where we used the lowest Matthews Correlation Coefficient\. We used rankr=16r=16and scaling factorα=2​r\\alpha=2rfor all LoRA PEFT methods\. For all models, we used the ADAM optimizer\[[31](https://arxiv.org/html/2606.13767#bib.bib33)\]with\(β1,β2,ϵ\)=\(0\.9,0\.999,1​e−8\)\.\(\\beta\_\{1\},\\beta\_\{2\},\\epsilon\)=\(0\.9,0\.999,1e^\{\-8\}\)\.For ViT, RoBERTa, and GPT2, we used gradient clipping on globalL2L\_\{2\}norm with a max of 1, and did not otherwise\. For LoRA\+, the learning rate for ourBBmatrix is 16 times that ofA\.A\.ModelDatasetScheduler\(Warmup LR, Ratio\)Learning Rates\(FFT,LoRA,Asym\)Chain resetfrequencyWeight decay\(FFT,LoRA\)Batch sizeEpochsMax length orImage sizeSeedsRoBERTa\-BaseMRPCLinear\(1​e−6,0\.1\)\(1e^\{\-6\},0\.1\)\(1​e−5,3​e−4,3​e−4\)\(1e^\{\-5\},3e^\{\-4\},3e^\{\-4\}\)3\(0\.01,0\)\(0\.01,0\)3220128\(12,22,32\)CoLALinear\(1​e−6,0\.1\)\(1e^\{\-6\},0\.1\)\(1​e−5,3​e−4,3​e−4\)\(1e^\{\-5\},3e^\{\-4\},3e^\{\-4\}\)3\(0\.01,0\)\(0\.01,0\)3220128\(12,22,32\)RoBERTa\-LargeMRPCLinear\(1​e−6,0\.1\)\(1e^\{\-6\},0\.1\)\(1​e−5,3​e−4,3​e−4\)\(1e^\{\-5\},3e^\{\-4\},3e^\{\-4\}\)3\(0\.01,0\)\(0\.01,0\)3220128\(12,22,32\)CoLALinear\(1​e−6,0\.1\)\(1e^\{\-6\},0\.1\)\(1​e−5,3​e−4,3​e−4\)\(1e^\{\-5\},3e^\{\-4\},3e^\{\-4\}\)3\(0\.01,0\)\(0\.01,0\)3220128\(12,22,32\)DeBERTa v2 XXLMRPCConstant\(1​e−5\.51e^\{\-5\.5\},1​e−4\.51e^\{\-4\.5\},1​e−41e^\{\-4\}\)50825512\(100,101,102\)TREC\-50Constant\(1​e−5\.51e^\{\-5\.5\},1​e−4\.51e^\{\-4\.5\},1​e−41e^\{\-4\}\)50825512\(100,101,102\)PAWSConstant\(1e\-6\.5,1​e−4\.51e^\{\-4\.5\},1​e−41e^\{\-4\}\)50810512\(100,101,102\)DeBERTa v3 BaseMRPCConstant\(1​e−51e^\{\-5\},1​e−3\.51e^\{\-3\.5\},1​e−31e^\{\-3\}\)50840512\(100,101,102\)RTEConstant\(1​e−4\.751e^\{\-4\.75\},1​e−3\.51e^\{\-3\.5\},1​e−31e^\{\-3\}\)50840512\(100,101,102\)STS\-BConstant\(1​e−4\.751e^\{\-4\.75\},1​e−3\.51e^\{\-3\.5\},1​e−31e^\{\-3\}\)50840512\(100,101,102\)TREC\-50Constant\(1​e−4\.751e^\{\-4\.75\},1​e−3\.251e^\{\-3\.25\},1​e−31e^\{\-3\}\)50840512\(100,101,102\)PAWSConstant\(1​e−51e^\{\-5\},1​e−3\.51e^\{\-3\.5\},1​e−31e^\{\-3\}\)50820512\(100,101,102\)GPT2\-SmallE2ELinear\(1​e−6,0\.1\)\(1e^\{\-6\},0\.1\)\(5​e−5,3​e−4,3​e−4\)\(5e^\{\-5\},3e^\{\-4\},3e^\{\-4\}\)1\(0\.01,0\)163064\(12,22,32\)ViT\-TinyOfficeHomeCosine\(1​e−6,0\.05\)\(1e^\{\-6\},0\.05\)\(3​e−4,1​e−3,1​e−3\)\(3e^\{\-4\},1e^\{\-3\},1e^\{\-3\}\)5\(0\.05,0\)6430224\(12,22,32\)CIFAR\-10Cosine\(1​e−6,0\.05\)\(1e^\{\-6\},0\.05\)\(3​e−4,1​e−3,1​e−3\)\(3e^\{\-4\},1e^\{\-3\},1e^\{\-3\}\)5\(0\.05,0\)6430224\(12,22,32\)ViT\-BaseOfficeHomeCosine\(1​e−6,0\.05\)\(1e^\{\-6\},0\.05\)\(3​e−4,1​e−3,1​e−3\)\(3e^\{\-4\},1e^\{\-3\},1e^\{\-3\}\)5\(0\.05,0\)6430224\(12,22,32\)CIFAR\-10Cosine\(1​e−6,0\.05\)\(1e^\{\-6\},0\.05\)\(3​e−4,1​e−3,1​e−3\)\(3e^\{\-4\},1e^\{\-3\},1e^\{\-3\}\)5\(0\.05,0\)6430224\(12,22,32\)DeepSeek\-Coder BaseDJANGOConstant\(1​e−5\.51e^\{\-5\.5\},1​e−4\.51e^\{\-4\.5\},1​e−41e^\{\-4\}\)1085512\(100,101,102\)TinyLlamaOpenBookQAConstant\(1​e−6\.251e^\{\-6\.25\},1​e−3\.751e^\{\-3\.75\},1​e−3\.251e^\{\-3\.25\}\)20810512\(100,101,102\)FOLIOConstant\(1​e−51e^\{\-5\},1​e−3\.751e^\{\-3\.75\},1​e−3\.51e^\{\-3\.5\}\)20810512\(100,101,102\)LogiQAConstant\(1​e−5\.751e^\{\-5\.75\},1​e−41e^\{\-4\},1​e−3\.251e^\{\-3\.25\}\)20810512\(100,101,102\)CLUTRRConstant\(1​e−6\.251e^\{\-6\.25\},1​e−5\.251e^\{\-5\.25\},1​e−4\.751e^\{\-4\.75\}\)20810512\(100,101,102\)Llama 3OpenBookQAConstant\(1​e−5\.251e^\{\-5\.25\},1​e−3\.751e^\{\-3\.75\},1​e−3\.51e^\{\-3\.5\}\)2045384\(100,101,102\)CLUTRRConstant\(1​e−5\.251e^\{\-5\.25\},1​e−4\.251e^\{\-4\.25\},1​e−3\.251e^\{\-3\.25\}\)2045384\(100,101,102\)

![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/django__best_by_valem.jpg)Figure 6:Performance of cLA, c3La, r\-cLA, r\-c3LA, CoLA, and Asymmetric LoRA on DeepseekCoder fine\-tuned on the DJANGO Dataset by Exact Match \(EM\)\. We vary the target rank\. The*epoch selection*is based on the highest validation accuracy instead of the lowest validation loss\. The figure demonstrates that the difference between the sparsity\-induced LoRA variants cLA and their non\-sparse counterparts tends to decrease with increasing rank\.
### E\.2The Effects of Learning Rate, Scaling Factor, and Chain Reset Frequency on Quality Metric Over Various Ranks

The ideal learning rate of an LLM tends to scale inversely with its size\[[35](https://arxiv.org/html/2606.13767#bib.bib44)\]\. Many papers suggest a default scaling factor of2​r2r\[[2](https://arxiv.org/html/2606.13767#bib.bib83),[55](https://arxiv.org/html/2606.13767#bib.bib22)\]\.\[[42](https://arxiv.org/html/2606.13767#bib.bib19)\]suggests that for sufficiently low learning rates, performing a chain reset every epoch is optimal\. We validate the first claim under LoRA fine\-tuning methods via ablation studies over learning rates presented in Tables[8](https://arxiv.org/html/2606.13767#A5.T8)\-[9](https://arxiv.org/html/2606.13767#A5.T9)\. Similarly, we assess the scaling factor baseline choice in Tables[10](https://arxiv.org/html/2606.13767#A5.T10)and[11](https://arxiv.org/html/2606.13767#A5.T11), and the optimal chain reset frequency in Table[12](https://arxiv.org/html/2606.13767#A5.T12)\. For the ablation studies, we fine\-tuned DeBERTaV3\-Base on the MRPC, TREC\-50, and PAWS for learning rate, MRPC and TREC\-50 for scaling factor,and MRPC, CoLA, RTE, and TREC\-50 for chain reset over various ranks\. We then re\-ran the same experiments on ViT\-Tiny fine\-tuned on the OfficeHome and CIFAR\-10 datasets\. We ran for 30 epochs\.

As shown in Tables[8](https://arxiv.org/html/2606.13767#A5.T8)and[9](https://arxiv.org/html/2606.13767#A5.T9), Asymmetric LoRA methods are more sensitive to varying learning rates than methods that train both matricesB,A\.B,A\.We notice that the cLA has a wide variety of acceptable learning rates\. Furthermore, across varying ranks, cLA andc3c^\{3\}LA often underperform compared to other LoRA variants\. As rank increases, this gap tends to narrow\. This is a byproduct of their structure, limiting how much of the pretrained weights they can update at any one time\.

For our ablation study on scaling factor shown in Tables[10](https://arxiv.org/html/2606.13767#A5.T10)and[11](https://arxiv.org/html/2606.13767#A5.T11), the use ofα=2​r\\alpha=2rworks as a baseline given how often it was the best choice\. With Asymmetric methods, the ideal scaling factor tends to be larger; this follows from the number of trainable parameters decreasing, requiring a larger effective learning rate, as the scaling factor can be interpreted as a scale on the learning rate\.

Our ablation study on chain reset frequency, shown in Tables[12](https://arxiv.org/html/2606.13767#A5.T12), revealed no clear correlation between the frequency of chain resets\.

#### E\.2\.1DeepseekCoder Performance Analysis

Figure[6](https://arxiv.org/html/2606.13767#A5.F6)displays the performance of DeepSeekCoder fine\-tuned on the DJANGO dataset using Asymmetric LoRA, CoLA, and our sparsity\-induced LoRA variants using Exact Match\. We vary rank overr∈\{8,16,32,64,128\}r\\in\\\{8,16,32,64,128\\\}, otherwise we use the same hyperparameters in Table[7](https://arxiv.org/html/2606.13767#A5.T7)\. While cLA atr=16r=16performs substantially worse than other variants, its EM accuracy increases relatively greatly untilr=64r=64, outperforming its non\-sparse derivative variant Asymmetric LoRA\. A similar trend in EM increasing with rank is observed in c3LA, but peaks atr=32r=32\. These trends are far weaker in r\-cLA and r\-c3LA, suggesting that this is a consequence of applying contiguous column updates to each layer rather than a random permutation\.

As the rank increases, the largest difference between any sparsity\-induced LoRA variant and Asymmetric LoRA or CoLA decreases\. Combined with Table[18](https://arxiv.org/html/2606.13767#A5.T18), these results indicate that restricting adaptation to a structuredrr\-column subspace \(as in cLA\) can still be effective for code generation, but the loss of column space expressivity is most noticeable at low ranks\. In our DJANGO setup, larger ranks substantially improve performance and can even surpass alternative PEFT scores, suggesting that restricted column\-space fine\-tuning remains a viable strategy for code tasks, provided that enough columns are made available for adaptation\. These trends suggest that code generation may tolerate less aggressive restriction than other tasks\.

Table 8:Test accuracies obtained by fine\-tuning DeBERTa v3 on MRPC, TREC\-50, and PAWS varying learning rates \(columns\), ranks \(rows\), and LoRA PEFT methods\. We center our search at1​e−41e^\{\-4\}\. The learning rate for all methods decreases with increasing rank; the relationship between learning rate and model size observed in LLMs\[[35](https://arxiv.org/html/2606.13767#bib.bib44)\]persists when fine\-tuning via LoRA methods\. Chain methods and their non\-chain counterparts produce the best results in similar learning rate ranges; therefore, chain resets do not influence the optimal learning rate\. We repeated the experiment with ViT\-Tiny on Table[9](https://arxiv.org/html/2606.13767#A5.T9)\.
DeBERTa v3 LoRA MRPCRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2266\.466\.479\.984\.285\.587\.388\.166\.466\.4466\.474\.481\.884\.685\.687\.987\.766\.466\.4866\.476\.183\.085\.186\.987\.387\.166\.466\.41666\.478\.183\.185\.086\.887\.972\.966\.466\.43266\.480\.183\.984\.987\.187\.966\.466\.466\.46476\.181\.784\.685\.188\.081\.566\.466\.466\.412877\.581\.984\.986\.288\.587\.666\.466\.466\.4

DeBERTa v3 CoLA MRPCRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2266\.566\.575\.982\.285\.787\.788\.366\.566\.5466\.566\.576\.782\.785\.187\.887\.266\.566\.5866\.566\.579\.184\.086\.888\.386\.966\.566\.51666\.575\.380\.884\.086\.488\.979\.666\.566\.53266\.576\.882\.383\.986\.588\.066\.566\.566\.56469\.379\.683\.385\.787\.188\.566\.566\.566\.512875\.880\.783\.185\.988\.288\.066\.566\.566\.5

DeBERTa v3 Asym MRPCRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2266\.466\.466\.468\.181\.485\.386\.186\.986\.0466\.466\.466\.476\.082\.984\.285\.786\.885\.6866\.466\.466\.580\.684\.285\.286\.586\.372\.31666\.466\.476\.782\.684\.386\.286\.887\.366\.43266\.466\.679\.683\.184\.686\.187\.375\.066\.46466\.477\.781\.782\.584\.786\.187\.466\.466\.412869\.079\.682\.284\.386\.180\.166\.466\.466\.4

DeBERTa v3 RAC MRPCRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2266\.566\.566\.566\.566\.576\.582\.087\.086\.2466\.566\.566\.566\.574\.079\.484\.685\.485\.5866\.566\.566\.566\.576\.982\.886\.387\.576\.71666\.566\.566\.574\.178\.883\.086\.987\.266\.53266\.566\.566\.576\.783\.986\.387\.772\.466\.56466\.566\.574\.280\.484\.187\.987\.366\.566\.512866\.566\.576\.182\.386\.487\.672\.066\.566\.5

DeBERTa v3 cLA MRPCRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2266\.466\.466\.471\.082\.484\.085\.083\.180\.4466\.466\.470\.279\.084\.985\.585\.985\.566\.4866\.466\.473\.781\.584\.684\.585\.285\.566\.41666\.467\.076\.181\.983\.384\.485\.378\.866\.43266\.474\.479\.680\.883\.285\.085\.266\.466\.46466\.477\.080\.881\.884\.585\.186\.766\.466\.412871\.378\.679\.582\.085\.587\.184\.866\.466\.4

DeBERTa v3c3c^\{3\}LA MRPCRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2266\.566\.566\.566\.573\.080\.185\.686\.472\.3466\.566\.566\.566\.874\.683\.185\.787\.266\.5866\.566\.566\.572\.679\.586\.286\.885\.566\.51666\.566\.566\.575\.682\.886\.086\.578\.666\.53266\.566\.571\.979\.184\.586\.886\.866\.566\.56466\.566\.573\.982\.486\.387\.286\.466\.566\.512866\.566\.581\.986\.387\.986\.772\.366\.566\.5

DeBERTa v3 LoRA TREC\-50Rank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-223\.210\.910\.939\.159\.576\.686\.910\.910\.9410\.110\.910\.942\.370\.682\.387\.710\.910\.9810\.910\.910\.950\.070\.684\.790\.110\.910\.9161\.410\.910\.950\.073\.089\.388\.310\.910\.9321\.410\.942\.959\.576\.489\.110\.910\.910\.96410\.910\.948\.266\.182\.987\.110\.910\.910\.912810\.910\.958\.171\.686\.110\.910\.910\.910\.9

DeBERTa v3 CoLA TREC\-50Rank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2210\.910\.942\.154\.471\.888\.189\.110\.910\.9410\.910\.942\.758\.381\.784\.588\.310\.910\.9810\.910\.942\.965\.582\.387\.190\.910\.910\.91610\.910\.939\.966\.784\.987\.168\.110\.910\.93210\.926\.853\.271\.485\.386\.710\.910\.910\.96410\.937\.558\.575\.686\.910\.910\.910\.910\.912810\.943\.566\.182\.786\.710\.910\.910\.910\.9

DeBERTa v3 Asym TREC\-50Rank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2210\.910\.910\.932\.546\.280\.686\.582\.526\.6410\.910\.910\.933\.358\.985\.188\.185\.710\.9810\.910\.910\.940\.173\.687\.386\.784\.510\.91610\.910\.910\.942\.978\.889\.186\.982\.710\.93210\.910\.910\.957\.583\.190\.991\.756\.510\.96410\.910\.941\.973\.088\.590\.787\.510\.910\.912810\.910\.952\.278\.889\.990\.910\.910\.910\.9

DeBERTa v3 RAC TREC\-50Rank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2210\.910\.910\.910\.934\.946\.672\.885\.987\.541\.210\.910\.910\.938\.559\.782\.988\.389\.5810\.910\.910\.910\.943\.868\.182\.988\.172\.41610\.110\.910\.913\.359\.578\.087\.190\.310\.93210\.910\.910\.945\.670\.684\.988\.588\.110\.96410\.910\.934\.950\.674\.286\.787\.710\.910\.91282\.010\.942\.962\.384\.588\.987\.310\.910\.9

DeBERTa v3 cLA TREC\-50Rank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2210\.910\.910\.910\.910\.940\.971\.281\.334\.549\.510\.910\.910\.935\.361\.979\.682\.710\.980\.410\.910\.910\.953\.071\.883\.186\.310\.91610\.910\.910\.940\.760\.784\.385\.587\.510\.93210\.110\.110\.947\.070\.086\.188\.910\.910\.96410\.910\.942\.762\.376\.286\.989\.152\.410\.91283\.610\.950\.267\.185\.186\.766\.710\.910\.9

DeBERTa v3c3c^\{3\}LA TREC\-50Rank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2210\.910\.910\.934\.742\.379\.266\.762\.110\.9410\.910\.919\.434\.756\.986\.587\.773\.810\.9810\.910\.920\.636\.968\.787\.378\.866\.710\.91610\.910\.910\.943\.876\.888\.583\.971\.010\.93210\.910\.938\.758\.184\.388\.180\.434\.310\.96410\.910\.945\.874\.285\.989\.180\.410\.910\.912810\.935\.156\.579\.285\.990\.910\.910\.910\.9

DeBERTa v3 LoRA PAWSRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2292\.193\.794\.094\.294\.794\.594\.055\.855\.8492\.394\.194\.094\.494\.294\.894\.055\.855\.8892\.493\.594\.394\.394\.794\.593\.555\.855\.81693\.194\.094\.494\.694\.693\.655\.855\.855\.83293\.893\.994\.794\.594\.755\.855\.855\.855\.86493\.694\.194\.694\.694\.693\.555\.855\.850\.012894\.094\.294\.494\.794\.755\.855\.855\.850\.0

DeBERTa v3 CoLA PAWSRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2255\.889\.892\.593\.994\.694\.794\.255\.855\.8455\.890\.692\.994\.094\.894\.894\.055\.855\.8855\.889\.393\.294\.194\.394\.193\.055\.855\.81655\.891\.893\.794\.394\.594\.755\.855\.855\.83289\.992\.994\.394\.594\.394\.855\.855\.855\.86490\.593\.394\.594\.594\.893\.255\.855\.855\.812891\.893\.894\.795\.195\.092\.755\.855\.844\.2

DeBERTa v3 Asym PAWSRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2255\.855\.886\.992\.393\.793\.993\.994\.493\.4455\.855\.890\.392\.593\.294\.094\.494\.292\.7855\.855\.892\.393\.194\.194\.794\.794\.255\.81655\.855\.892\.494\.194\.094\.594\.694\.055\.83255\.892\.993\.694\.494\.494\.694\.192\.955\.86455\.892\.594\.194\.194\.894\.893\.455\.855\.812891\.792\.994\.194\.494\.694\.791\.855\.044\.2

DeBERTa v3 RAC PAWSRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2255\.855\.855\.889\.093\.394\.193\.894\.493\.4455\.855\.891\.093\.093\.593\.994\.493\.890\.6855\.855\.889\.493\.493\.894\.594\.294\.388\.91655\.855\.892\.692\.894\.295\.194\.793\.555\.83255\.891\.092\.693\.894\.294\.094\.755\.855\.86455\.892\.793\.594\.394\.894\.593\.855\.855\.812891\.893\.194\.294\.394\.694\.655\.855\.855\.8

DeBERTa v3 cLA PAWSRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2255\.855\.855\.890\.592\.594\.093\.893\.755\.8455\.855\.889\.390\.993\.193\.894\.093\.755\.8855\.855\.889\.792\.692\.994\.393\.993\.755\.81655\.855\.891\.593\.093\.794\.594\.155\.855\.83255\.889\.591\.693\.493\.894\.294\.155\.855\.86455\.890\.093\.293\.893\.994\.393\.555\.855\.812887\.292\.593\.693\.994\.794\.255\.855\.855\.8

DeBERTa v3c3c^\{3\}LA PAWSRank/LR1e\-61e\-5\.51e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-2255\.855\.855\.891\.493\.393\.893\.993\.955\.8455\.855\.889\.891\.293\.794\.094\.093\.355\.8855\.855\.891\.893\.793\.994\.793\.593\.355\.81655\.855\.892\.693\.594\.194\.193\.755\.855\.83255\.892\.093\.294\.094\.194\.493\.955\.855\.86491\.392\.293\.894\.193\.994\.293\.755\.855\.812892\.593\.394\.094\.394\.855\.855\.855\.855\.8

Table 9:Test accuracies obtained by fine\-tuning ViT\-Tiny on CIFAR\-10 and OfficeHome over varying learning rates \(columns\), ranks \(rows\), and LoRA PEFT methods\. We center our search at1​e−3\.1e^\{\-3\}\.Consistent with the results of[8](https://arxiv.org/html/2606.13767#A5.T8), the learning rate for all methods decreases with increasing rank\. Chain methods and their non\-chain counterparts produce the best results in similar learning rate ranges\.
ViT\-Tiny LoRA CIFAR\-101e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1289\.0890\.9192\.8592\.9293\.5691\.7687\.1349\.7510\.70489\.3891\.5593\.4994\.5094\.1895\.1181\.1617\.8211\.15889\.5692\.0494\.0095\.2095\.6895\.6577\.1011\.8510\.001690\.0392\.6994\.3695\.6995\.9191\.2759\.0913\.5210\.083290\.8593\.0595\.1496\.1496\.2687\.8717\.7918\.0710\.676491\.7994\.0095\.3096\.4496\.4381\.7314\.3311\.3013\.2112892\.4794\.7196\.0396\.5096\.1764\.0311\.1612\.0011\.41

ViT\-Tiny CoLA CIFAR\-101e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1287\.9090\.4392\.1393\.7993\.7793\.5787\.1320\.5212\.61488\.4390\.7792\.7394\.3095\.2194\.4983\.7417\.8211\.15888\.9691\.2693\.3794\.9595\.3994\.7477\.0911\.8510\.001689\.5092\.0093\.8695\.2695\.7291\.2762\.1215\.4211\.253290\.0292\.6394\.5495\.7295\.9687\.8719\.1018\.0712\.416491\.1893\.5195\.1396\.0796\.0181\.7314\.3317\.7013\.2312892\.0694\.2095\.5696\.1795\.5465\.7817\.4510\.3011\.03

ViT\-Tiny Asym CIFAR\-101e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1285\.3489\.6390\.8691\.6492\.0391\.8890\.8590\.1880\.81486\.8290\.2991\.6692\.5893\.3292\.7391\.6387\.8878\.91888\.3490\.8792\.3593\.3493\.7193\.8193\.7486\.8864\.431689\.2391\.6193\.1894\.2394\.6595\.0890\.3982\.5250\.443290\.1292\.1793\.8195\.2095\.1595\.3693\.9473\.5534\.136491\.2093\.1194\.6695\.7796\.0892\.9992\.1853\.0724\.5212892\.2794\.0395\.3696\.0994\.5694\.9769\.8127\.7716\.58

ViT\-Tiny RAC CIFAR\-101e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1285\.6189\.6890\.9891\.8792\.7291\.9491\.4789\.5180\.23486\.6490\.3591\.8992\.9693\.7193\.4693\.4987\.8877\.56888\.2690\.8792\.4093\.6594\.0994\.3390\.6986\.8864\.431689\.2691\.7593\.2694\.4095\.3194\.2489\.5682\.5250\.443290\.2792\.2894\.0595\.4395\.5495\.6893\.1473\.5524\.926491\.1293\.2094\.8695\.9194\.7895\.3582\.8653\.0724\.5212892\.1994\.0395\.5696\.1196\.1094\.2169\.8127\.7716\.58

ViT\-Tiny cLA CIFAR\-101e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1287\.3089\.6591\.1392\.1292\.4991\.7088\.9282\.4351\.06488\.0690\.4791\.9192\.0093\.4592\.0588\.7873\.2711\.47889\.1491\.3893\.1393\.3693\.2290\.7285\.7868\.4013\.391690\.4392\.2793\.9194\.8694\.5394\.1178\.8344\.0317\.113291\.3293\.3794\.9195\.6395\.3389\.5571\.9530\.9019\.236492\.5094\.1795\.8296\.2695\.5483\.6350\.9121\.6115\.5012893\.8695\.3996\.5096\.4594\.8575\.5228\.6519\.9531\.06

ViT\-Tinyc3c^\{3\}LA CIFAR\-101e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1287\.2989\.8391\.4992\.4893\.0491\.3588\.6580\.3953\.28488\.5790\.7692\.5093\.3693\.9891\.5587\.2573\.2711\.47889\.6991\.8493\.5294\.6894\.8090\.7285\.7868\.4027\.841690\.6392\.6994\.3395\.3595\.3090\.0481\.1344\.0317\.113291\.3993\.6895\.1295\.5795\.3089\.5571\.9521\.9417\.606492\.8694\.7195\.9296\.4195\.0783\.6341\.6021\.6114\.0312893\.8395\.3896\.4396\.2294\.8975\.5227\.8419\.9514\.48

ViT\-Tiny LoRA OfficeHome1e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1247\.3365\.0370\.8075\.5577\.8175\.9377\.0840\.571\.80448\.5764\.9071\.3675\.3377\.8578\.5077\.5526\.512\.01849\.4264\.8172\.0875\.9379\.3978\.8859\.094\.282\.611650\.4165\.3372\.3077\.2179\.2679\.6949\.171\.974\.533250\.5365\.5073\.5479\.0979\.5666\.0136\.432\.442\.056451\.8666\.3574\.9978\.9779\.8661\.4810\.771\.752\.0912854\.6866\.8276\.7079\.9180\.3353\.532\.781\.752\.91

ViT\-Tiny CoLA OfficeHome1e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1245\.2864\.6470\.5074\.4876\.9676\.7074\.0540\.572\.01446\.4364\.9470\.6374\.3976\.8776\.6172\.6026\.512\.01847\.7165\.1170\.9775\.1277\.7376\.9659\.093\.045\.771649\.4764\.5171\.4075\.6378\.7177\.6849\.171\.972\.223250\.3265\.2072\.1277\.3880\.3866\.0136\.432\.442\.056451\.0565\.6373\.7978\.3279\.3561\.4810\.771\.752\.6912852\.2966\.3575\.2979\.4879\.5252\.292\.781\.754\.10

ViT\-Tiny Asym OfficeHome1e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1243\.9163\.0670\.2073\.4574\.6574\.4874\.2275\.2974\.09444\.7263\.8370\.5973\.7175\.7673\.9276\.1475\.3751\.69845\.7964\.4771\.3174\.6575\.9375\.5075\.8475\.7240\.491647\.2065\.8072\.2175\.1277\.1375\.6776\.8754\.7721\.423249\.6866\.2772\.4776\.6678\.5077\.9477\.0445\.2813\.856451\.5267\.0474\.1377\.8179\.3178\.3757\.4623\.568\.8512853\.5368\.0675\.2579\.3580\.7277\.8546\.8110\.652\.05

ViT\-Tiny RAC OfficeHome1e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1244\.3863\.7570\.5473\.5475\.2976\.1475\.1670\.9754\.85444\.7664\.0470\.7174\.0176\.3674\.2272\.8566\.4451\.69845\.8365\.0771\.6574\.7876\.3177\.0475\.9764\.7340\.491647\.2965\.7172\.5575\.2977\.9477\.4775\.4667\.3821\.423249\.6466\.1072\.6077\.4779\.2678\.7975\.0745\.2813\.856451\.6067\.1274\.3978\.3279\.9175\.3357\.4623\.569\.4912853\.5367\.7675\.7279\.2680\.6376\.7446\.8110\.652\.05

ViT\-Tiny cLA OfficeHome1e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1244\.0464\.0470\.1273\.4575\.8975\.7274\.3573\.1930\.14446\.0964\.8670\.8474\.9076\.5375\.3774\.9954\.812\.01847\.5065\.1672\.1275\.1676\.5776\.8775\.2545\.102\.861650\.7565\.6372\.9877\.0078\.3777\.1759\.3036\.553\.423252\.9366\.9974\.3977\.8576\.8376\.8351\.1312\.612\.356455\.9668\.5375\.6779\.1479\.2663\.0635\.494\.022\.6912861\.2271\.9577\.3479\.7877\.7753\.9117\.443\.423\.42

ViT\-Tinyc3c^\{3\}LA OfficeHome1e\-51e\-4\.51e\-41e\-3\.51e\-31e\-2\.51e\-21e\-1\.51e\-1245\.0264\.5170\.2973\.8876\.3675\.9373\.0757\.4630\.14446\.3064\.9470\.8074\.3177\.1776\.1966\.6554\.812\.01849\.2165\.3772\.6875\.9377\.5172\.6064\.7345\.103\.041651\.1366\.4473\.1577\.5578\.7171\.2359\.3036\.552\.483253\.3167\.5574\.2279\.2679\.2267\.3851\.1312\.012\.356456\.1869\.2276\.0679\.5278\.2463\.0635\.496\.503\.5912861\.1872\.0077\.3879\.9178\.4553\.9117\.448\.383\.42

Table 10:Test accuracies of fine\-tuning DeBERTa v3 on MRPC and TREC\-50 over varying scaling factors \(columns\), ranks \(rows\), and LoRA PEFT methods\. The standard baseline2​r2roften was the best, and asymmetric methods preferred higher scaling factors\.
DeBERTa v3 LoRA MRPCRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r487\.288\.388\.588\.187\.4886\.986\.189\.287\.066\.51687\.888\.989\.166\.566\.5

DeBERTa v3 CoLA MRPCRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r487\.888\.988\.589\.287\.1889\.687\.488\.787\.286\.31689\.287\.686\.987\.266\.5

DeBERTa v3 Asym MRPCRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r475\.579\.980\.485\.084\.2876\.782\.183\.686\.186\.91679\.281\.484\.884\.886\.1

DeBERTa v3 RAC MRPCRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r475\.679\.482\.285\.085\.7877\.881\.684\.685\.787\.21680\.484\.585\.085\.687\.0

DeBERTa v3 cLA MRPCRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r486\.286\.086\.386\.486\.4886\.684\.885\.485\.585\.91686\.886\.986\.286\.286\.4

DeBERTa v3c3c^\{3\}LA MRPCRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r479\.383\.386\.588\.586\.1878\.384\.986\.987\.686\.91685\.085\.787\.385\.866\.5

DeBERTa v3 LoRA TREC\-50Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r488\.989\.790\.783\.190\.3888\.790\.791\.385\.375\.41691\.191\.590\.788\.510\.9

DeBERTa v3 CoLA TREC\-50Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r492\.191\.992\.590\.791\.1891\.789\.790\.990\.385\.51691\.992\.386\.187\.310\.9

DeBERTa v3 Asym TREC\-50Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r479\.884\.787\.790\.589\.9882\.987\.784\.589\.390\.71689\.386\.790\.791\.389\.7

DeBERTa v3 RAC TREC\-50Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r460\.172\.681\.085\.588\.7875\.281\.585\.788\.189\.71683\.186\.387\.790\.378\.0

DeBERTa v3 cLA TREC\-50Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r457\.974\.880\.082\.986\.3873\.676\.683\.782\.587\.31679\.680\.287\.988\.186\.1

DeBERTa v3c3c^\{3\}LA TREC\-50Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}rr2​r2r4​r4r473\.881\.583\.189\.388\.7878\.282\.383\.984\.781\.71683\.185\.585\.387\.586\.3

Table 11:Test accuracies obtained by fine\-tuning ViT\-Tiny on OfficeHome and CIFAR\-10 over varying scaling factors \(columns\), ranks \(rows\), and LoRA PEFT methods\. The standard baseline2​r2roften was the best, and asymmetric methods preferred higher scaling factors\.
ViT\-Tiny LoRA CIFAR\-10Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r493\.994\.194\.094\.0894\.895\.795\.795\.81695\.896\.196\.195\.2

ViT\-Tiny CoLA CIFAR\-10Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r494\.394\.594\.595\.3894\.794\.995\.395\.11695\.095\.595\.796\.2

ViT\-Tiny Asym CIFAR\-10Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r491\.792\.292\.892\.4892\.693\.193\.794\.1693\.194\.094\.494\.4

ViT\-Tiny RAC CIFAR\-10Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r491\.892\.693\.294\.2892\.893\.494\.094\.81693\.694\.394\.895\.6

ViT\-Tiny cLA CIFAR\-10Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r492\.092\.693\.193\.5893\.493\.493\.593\.41694\.394\.594\.594\.5

ViT\-Tinyc3c^\{3\}LA CIFAR\-10Rank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r492\.793\.493\.494\.4893\.894\.494\.694\.81694\.895\.395\.295\.3

ViT\-Tiny LoRA OfficeHomeRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r476\.877\.177\.877\.9876\.977\.978\.579\.41677\.978\.479\.279\.4

ViT\-Tiny CoLA OfficeHomeRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r475\.575\.976\.677\.8876\.076\.477\.479\.61676\.377\.278\.279\.4

ViT\-Tiny Asym OfficeHomeRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r474\.074\.575\.275\.6874\.575\.175\.676\.21675\.275\.976\.476\.9

ViT\-Tiny RAC OfficeHomeRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r474\.374\.675\.676\.0874\.875\.275\.977\.11675\.275\.576\.477\.7

ViT\-Tiny cLA OfficeHomeRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r475\.375\.976\.576\.5876\.376\.576\.676\.51676\.476\.977\.378\.4

ViT\-Tinyc3c^\{3\}LA OfficeHomeRank/α\\alphar4\\frac\{r\}\{4\}r2\\frac\{r\}\{2\}r2r475\.675\.976\.377\.3876\.477\.076\.977\.51676\.677\.778\.478\.5

Table 12:Test accuracies obtained by fine\-tuning DeBERTa v3 on MRPC, CoLA, TREC\-50, and RTE using chain LoRA methods CoLA, RAC, andc3c^\{3\}LA over varying ranks and chain reset frequencies\. We do not observe a clear correlation between the optimal chain reset frequency and rank\.DeBERTa v3 MRPCChain Reset FrequencyVariantRank 125101520CoLA488\.086\.889\.288\.186\.786\.7887\.888\.087\.287\.286\.787\.21666\.587\.287\.287\.287\.287\.2RAC468\.377\.485\.085\.785\.786\.6868\.182\.085\.786\.485\.785\.61669\.184\.285\.686\.186\.586\.3c3LA484\.886\.787\.285\.285\.885\.2885\.287\.786\.686\.785\.386\.91687\.687\.086\.786\.686\.687\.6DeBERTa v3 TREC\-50Chain Reset FrequencyVariantRank 125101520CoLA491\.391\.189\.988\.590\.591\.3892\.791\.185\.310\.992\.790\.51610\.910\.993\.191\.792\.165\.1RAC484\.384\.185\.584\.186\.386\.5888\.388\.588\.188\.787\.788\.91687\.791\.590\.389\.989\.988\.9c3LA486\.388\.189\.385\.988\.988\.9886\.189\.384\.783\.786\.190\.71689\.790\.587\.591\.187\.388\.1DeBERTa v3 CoLAChain Reset FrequencyVariantRank 125101520CoLA486\.986\.586\.286\.687\.186\.7885\.585\.185\.185\.185\.185\.11684\.269\.169\.169\.169\.169\.1RAC487\.086\.787\.787\.488\.087\.7887\.587\.887\.887\.586\.686\.61686\.786\.987\.387\.087\.087\.6c3LA486\.486\.686\.185\.886\.086\.3886\.086\.186\.186\.286\.386\.31686\.285\.786\.385\.685\.486\.7DeBERTa v3 RTEChain Reset FrequencyVariantRank 125101520CoLA482\.984\.485\.183\.785\.486\.2888\.284\.684\.887\.187\.186\.71685\.152\.681\.484\.884\.373\.5RAC482\.383\.081\.682\.182\.482\.4885\.586\.886\.487\.587\.587\.51684\.284\.484\.183\.583\.784\.3c3LA479\.077\.972\.671\.974\.072\.4880\.080\.376\.673\.976\.275\.41685\.082\.583\.683\.082\.982\.4Table 13:Test accuracies obtained by fine\-tuning ViT\-Tiny on OfficeHome and CIFAR\-10 using chain LoRA methods CoLA, RAC, andc3c^\{3\}LA over varying ranks and chain reset frequencies\.ViT\-Tiny OfficeHomeChain Reset FrequencyVariantRank125101520CoLA476\.476\.577\.277\.277\.677\.8877\.677\.178\.377\.378\.779\.61677\.977\.978\.878\.679\.479\.1RAC475\.575\.876\.075\.775\.775\.7877\.176\.176\.376\.676\.176\.41677\.477\.777\.777\.077\.377\.0c3LA476\.777\.377\.376\.376\.576\.6877\.576\.977\.276\.776\.876\.81677\.578\.178\.478\.578\.178\.4ViT\-Tiny CIFAR\-10Chain Reset FrequencyVariantRank125101520CoLA494\.594\.894\.795\.394\.094\.0895\.195\.195\.394\.994\.794\.51695\.595\.595\.796\.096\.096\.2RAC494\.294\.094\.093\.492\.492\.5894\.594\.894\.594\.294\.194\.01695\.695\.295\.395\.195\.094\.3c3LA494\.494\.294\.093\.992\.792\.7893\.694\.294\.894\.593\.493\.41694\.093\.695\.395\.194\.895\.0

### E\.3Computational Cost, Memory, and Efficiency

In this Section, we discuss the computational cost, peak memory, and throughput of our sparsity\-induced LoRA variants\. We also contrast these metrics with PaCA and its variants\. We start by describing our naïve sparse implementation\.

#### E\.3\.1Naïve sparse implementation\.

To highlight the computational benefit of cLA,c3c^\{3\}LA, and their random variants’ inherently sparse structure, we introduce PEFT methods s\-cLA and s\-r\-c3c^\{3\}LA\. The only difference is in how we compute the forward pass for each adapted layer\. Consider a single layerWi∈ℝni×miW^\{i\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times m\_\{i\}\}adapted via r\-cLA toWi\+B​AW^\{i\}\+BA, whereB∈ℝni×rB\\in\\mathbb\{R\}^\{n\_\{i\}\\times r\}andAAis constructed as follows: samplerrrandom numbers without replacement from\[mi\]\[m\_\{i\}\],\{c1,c2,⋯,cr\}⊂\[mi\]\.\\\{c\_\{1\},c\_\{2\},\\cdots,c\_\{r\}\\\}\\subset\[m\_\{i\}\]\.For rowjjinAA, thecjthc\_\{j\}^\{\\rm th\}element is one, and all other elements are zero, thusA​x=\[xc1,⋯,xcr\]⊤Ax=\[x\_\{c\_\{1\}\},\\cdots,x\_\{c\_\{r\}\}\]^\{\\top\}\. For cLA choose\{c1,⋯,cr\}=\{1,⋯,r\}\.\\\{c\_\{1\},\\cdots,c\_\{r\}\\\}=\\\{1,\\cdots,r\\\}\.

In each forward pass, we compute\(Wi\)​\(x\)\+B​\(A​\(x\)\)\(W^\{i\}\)\(x\)\+B\(A\(x\)\)wherex∈ℝmi\.x\\in\\mathbb\{R\}^\{m\_\{i\}\}\.For cLA, we directly calculateA​\(x\)A\(x\)as a general matrix\-matrix multiplication \(GEMM\) resulting inr​\(2​mi−1\)r\(2m\_\{i\}\-1\)floating point operations \(FLOPs\)\. For s\-cLA, we store the columns\{c1,⋯,cr\}\\\{c\_\{1\},\\cdots,c\_\{r\}\\\}in memory whenAAis constructed, then directly construct\[xc1,⋯,xcr\]⊤\.\[x\_\{c\_\{1\}\},\\cdots,x\_\{c\_\{r\}\}\]^\{\\top\}\.See Figure[7](https://arxiv.org/html/2606.13767#A5.F7)for a comparative visualization of the two methods\.

#### E\.3\.2Experiments

To showcase the benefit of our sparse implementations, we fine\-tune ViT\-Tiny and ViT\-Base on the OfficeHome and CIFAR\-10 datasets, and RoBERTa\-Base and RoBERTa\-Large on the MRPC and CoLA datasets using full fine\-tuning, some base LoRA variants, the sparsity\-induced variants and their more optimized counterparts, and PaCA with rankr=16r=16for3030epochs, averaged over three seeds on a single NVIDIA H100 PCIe GPU\. To align our experiments with those done in PaCA\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\], we adapt all layers of the models, and fully fine\-tune the classification heads, then report the throughput, runtime, and trainable parameters in Table[14](https://arxiv.org/html/2606.13767#A5.T14)\. To align with the other experiments in our paper, as well as LoRA\[[27](https://arxiv.org/html/2606.13767#bib.bib2)\], we adapt only the query and value matrices of each attention head, fully fine\-tune the classification heads of the models, and report those results in Table[15](https://arxiv.org/html/2606.13767#A5.T15)\.

Discussion of experiments\.When adapting all layers of the models, PaCA is substantially faster than LoRA\. In Table[14](https://arxiv.org/html/2606.13767#A5.T14), the runtime for fine\-tuning via PaCA, when normalized to LoRA’s time, often is 24 to 25% faster than LoRA and reduces peak GPU memory by 15\-38%, while the optimized for sparsity LoRA variants \(columns s\-cLA and s\-r\-c3LA\) are 10 to 15% faster and reduce peak GPU memory by 15\-40% compared to LoRA\. However, due to the overhead of adapting this many layers, FFT was consistently faster than all other methods\. In contrast, when adapting far fewer of the model’s layers, full fine\-tuning was far less competitive\. In Table[15](https://arxiv.org/html/2606.13767#A5.T15), PaCA is often the fastest method at around 10\-11% faster than LoRA while often reducing peak GPU memory the most by 5\-15%, with the optimized for sparsity LoRA variants generally at 8\-10% faster than LoRA while reducing peak GPU memory almost as much as PaCA, ranging from 5\-15% as well\.

Key Takeaway\.Given the minimal accuracy gaps between each PEFT method for adapting all of the layers in Table[4](https://arxiv.org/html/2606.13767#A2.T4)and adapting only query, value, and classification head in Table[5](https://arxiv.org/html/2606.13767#A2.T5), from a speed perspective, it is optimal to use sparsity\-induced LoRA variants \(and thus their PaCA counterparts\) for fine\-tuning models as shown by Table[15](https://arxiv.org/html/2606.13767#A5.T15)\. However, this is only if the chosen LoRA rank is sufficiently high to ensure the model can adapt to the dataset\. For example, when adapting DeepSeekCoder to the DJANGO dataset, we witness in Table[2](https://arxiv.org/html/2606.13767#S3.T2)thatr=16r=16was insufficient for the sparsity\-induced methods \(the future\) to perform comparably to their non\-sparse counterparts \(the present\)\. This difference disappears as the rank is increased fromr=16r=16tor=64r=64; see Figure[6](https://arxiv.org/html/2606.13767#A5.F6)\.

![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/Sparse_Figure_Final.png)

Figure 7:A comparative visualization of the general matrix\-matrix multiplication in r\-cLA \(left\) and thegather\(indexing\) operation used in s\-r\-cLA \(right\)\.Thegatheroperation\[[19](https://arxiv.org/html/2606.13767#bib.bib98)\]savesr​\(2​mi−1\)r\(2m\_\{i\}\-1\)FLOPs for each inputxxin each adapted layer of the forward pass, at the cost of creating minor memory overhead\.Table 14:Throughput, runtime, and trainable parameter count, and peak allocated GPU memory for fine\-tuning various text and vision models on various datasets using various PEFT methods, including LoRA, PaCA, and their connecting variants, averaged over three seeds \(0,1,2\)\. We adapt the token and map embeddings, query, key, value, and output matrices of the attention layers, and both fully connected layers\. PaCA reduces peak GPU memory when training ViT\-Base on CIFAR\-10 by 31% compared to training via LoRA\. This is similar to the results in PaCA\[[64](https://arxiv.org/html/2606.13767#bib.bib84)\]\. While not specifically designed for reducing peak GPU memory, our sparsity\-induced LoRA variants produce similar results, reducing the peak GPU memory by 33% compared to LoRA\.DatatypeModelDatasetLoRA VariantsSparsity\-Induced VariantsPaCA VariantsFFTLoRACoLAAsymr\-cLAr\-c3LAs\-cLAs\-r\-c3LAPaCAC\-PaCAThroughput\(Samples\\s\)ViT\-TinyOfficeHome344\.4 \(1\.068\)322\.3 \(1\.000\)343\.7 \(1\.066\)345\.3 \(1\.071\)346\.8 \(1\.076\)344\.5 \(1\.069\)348\.6 \(1\.081\)336\.7 \(1\.044\)351\.2 \(1\.090\)353\.0 \(1\.095\)CIFAR\-101064\.2 \(1\.502\)708\.3 \(1\.000\)713\.5 \(1\.007\)713\.7 \(1\.008\)737\.3 \(1\.041\)747\.2 \(1\.055\)782\.2 \(1\.104\)812\.2 \(1\.147\)939\.2 \(1\.326\)944\.0 \(1\.333\)ViT\-BaseOfficeHome362\.2 \(1\.081\)335\.0 \(1\.000\)335\.5 \(1\.001\)343\.5 \(1\.025\)341\.6 \(1\.020\)345\.9 \(1\.032\)349\.0 \(1\.042\)346\.7 \(1\.035\)358\.0 \(1\.069\)354\.2 \(1\.057\)CIFAR\-101056\.3 \(1\.530\)690\.3 \(1\.000\)689\.0 \(0\.998\)703\.9 \(1\.020\)750\.1 \(1\.087\)744\.2 \(1\.078\)788\.4 \(1\.142\)769\.7 \(1\.115\)930\.4 \(1\.348\)859\.2 \(1\.245\)RoBERTa\-BaseMRPC808\.4 \(1\.636\)494\.2 \(1\.000\)493\.1 \(0\.998\)546\.7 \(1\.106\)554\.3 \(1\.121\)556\.3 \(1\.126\)588\.0 \(1\.190\)585\.5 \(1\.185\)648\.1 \(1\.311\)643\.7 \(1\.302\)CoLA817\.6 \(1\.710\)478\.0 \(1\.000\)498\.8 \(1\.043\)554\.3 \(1\.160\)559\.1 \(1\.169\)556\.1 \(1\.163\)587\.9 \(1\.230\)595\.2 \(1\.245\)632\.4 \(1\.323\)648\.5 \(1\.357\)RoBERTa\-LargeMRPC391\.3 \(1\.536\)254\.8 \(1\.000\)254\.1 \(0\.997\)270\.0 \(1\.060\)271\.7 \(1\.066\)283\.6 \(1\.113\)305\.6 \(1\.199\)305\.4 \(1\.199\)336\.7 \(1\.322\)332\.0 \(1\.303\)CoLA394\.3 \(1\.533\)257\.2 \(1\.000\)255\.4 \(0\.993\)287\.5 \(1\.118\)288\.3 \(1\.121\)288\.2 \(1\.121\)306\.4 \(1\.192\)309\.3 \(1\.203\)342\.1 \(1\.330\)343\.8 \(1\.337\)Runtime\(Minutes\)ViT\-TinyOfficeHome53\.2 \(0\.933\)57\.1 \(1\.000\)53\.3 \(0\.935\)52\.9 \(0\.927\)52\.9 \(0\.927\)53\.0 \(0\.928\)52\.4 \(0\.919\)54\.5 \(0\.954\)52\.7 \(0\.924\)51\.9 \(0\.910\)CIFAR\-1023\.1 \(0\.669\)34\.5 \(1\.000\)34\.5 \(0\.999\)34\.6 \(1\.003\)32\.8 \(0\.951\)32\.9 \(0\.952\)31\.5 \(0\.913\)30\.4 \(0\.880\)26\.1 \(0\.757\)26\.0 \(0\.753\)ViT\-BaseOfficeHome50\.6 \(0\.932\)54\.3 \(1\.000\)54\.2 \(0\.999\)53\.1 \(0\.978\)53\.2 \(0\.980\)52\.8 \(0\.972\)52\.3 \(0\.963\)52\.6 \(0\.970\)50\.9 \(0\.937\)51\.4 \(0\.947\)CIFAR\-1023\.3 \(0\.664\)35\.1 \(1\.000\)35\.4 \(1\.008\)34\.5 \(0\.983\)32\.3 \(0\.921\)32\.7 \(0\.932\)31\.1 \(0\.886\)31\.7 \(0\.903\)26\.3 \(0\.751\)28\.7 \(0\.817\)RoBERTa\-BaseMRPC2\.5 \(0\.619\)4\.0 \(1\.000\)4\.0 \(1\.000\)3\.6 \(0\.907\)3\.6 \(0\.905\)3\.6 \(0\.901\)3\.4 \(0\.849\)3\.4 \(0\.854\)3\.1 \(0\.768\)3\.1 \(0\.769\)CoLA5\.5 \(0\.588\)9\.3 \(1\.000\)8\.9 \(0\.964\)8\.0 \(0\.867\)8\.1 \(0\.873\)8\.0 \(0\.864\)7\.7 \(0\.826\)7\.5 \(0\.812\)7\.0 \(0\.760\)6\.9 \(0\.749\)RoBERTa\-LargeMRPC5\.1 \(0\.652\)7\.8 \(1\.000\)7\.8 \(1\.003\)7\.4 \(0\.954\)7\.4 \(0\.949\)7\.1 \(0\.906\)6\.6 \(0\.842\)6\.6 \(0\.844\)5\.9 \(0\.759\)6\.0 \(0\.766\)CoLA11\.3 \(0\.653\)17\.3 \(1\.000\)17\.4 \(1\.004\)15\.6 \(0\.898\)15\.5 \(0\.895\)15\.6 \(0\.898\)14\.5 \(0\.839\)14\.5 \(0\.837\)13\.0 \(0\.752\)13\.0 \(0\.750\)TrainableParametersViT\-TinyOfficeHome5\.7​e65\.7e^\{6\}\(6\.675\)8\.6​e58\.6e^\{5\}\(1\.000\)8\.6​e58\.6e^\{5\}\(1\.000\)5\.2​e55\.2e^\{5\}\(0\.613\)5\.2​e55\.2e^\{5\}\(0\.613\)5\.2​e55\.2e^\{5\}\(0\.613\)5\.2​e55\.2e^\{5\}\(0\.613\)5\.2​e55\.2e^\{5\}\(0\.613\)5\.2​e55\.2e^\{5\}\(0\.613\)5\.2​e55\.2e^\{5\}\(0\.613\)CIFAR\-105\.5​e65\.5e^\{6\}\(8\.304\)6\.7​e56\.7e^\{5\}\(1\.000\)6\.7​e56\.7e^\{5\}\(1\.000\)3\.3​e53\.3e^\{5\}\(0\.501\)3\.3​e53\.3e^\{5\}\(0\.501\)3\.3​e53\.3e^\{5\}\(0\.501\)3\.3​e53\.3e^\{5\}\(0\.501\)3\.3​e53\.3e^\{5\}\(0\.501\)3\.3​e53\.3e^\{5\}\(0\.501\)3\.3​e53\.3e^\{5\}\(0\.501\)ViT\-BaseOfficeHome8\.7​e78\.7e^\{7\}\(25\.288\)3\.4​e63\.4e^\{6\}\(1\.000\)3\.4​e63\.4e^\{6\}\(1\.000\)2\.1​e62\.1e^\{6\}\(0\.612\)2\.1​e62\.1e^\{6\}\(0\.612\)2\.1​e62\.1e^\{6\}\(0\.612\)2\.1​e62\.1e^\{6\}\(0\.612\)2\.1​e62\.1e^\{6\}\(0\.612\)2\.1​e62\.1e^\{6\}\(0\.612\)2\.1​e62\.1e^\{6\}\(0\.612\)CIFAR\-108\.6​e78\.6e^\{7\}\(32\.235\)2\.7​e62\.7e^\{6\}\(1\.000\)2\.7​e62\.7e^\{6\}\(1\.000\)1\.3​e61\.3e^\{6\}\(0\.501\)1\.3​e61\.3e^\{6\}\(0\.501\)1\.3​e61\.3e^\{6\}\(0\.501\)1\.3​e61\.3e^\{6\}\(0\.501\)1\.3​e61\.3e^\{6\}\(0\.501\)1\.3​e61\.3e^\{6\}\(0\.501\)1\.3​e61\.3e^\{6\}\(0\.501\)RoBERTa\-BaseMRPC1\.2​e81\.2e^\{8\}\(38\.396\)3\.2​e63\.2e^\{6\}\(1\.000\)3\.2​e63\.2e^\{6\}\(1\.000\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)CoLA1\.2​e81\.2e^\{8\}\(38\.396\)3\.2​e63\.2e^\{6\}\(1\.000\)3\.2​e63\.2e^\{6\}\(1\.000\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)1\.9​e61\.9e^\{6\}\(0\.591\)RoBERTa\-LargeMRPC3\.6​e83\.6e^\{8\}\(43\.712\)8\.1​e68\.1e^\{6\}\(1\.000\)8\.1​e68\.1e^\{6\}\(1\.000\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)CoLA3\.6​e83\.6e^\{8\}\(43\.712\)8\.1​e68\.1e^\{6\}\(1\.000\)8\.1​e68\.1e^\{6\}\(1\.000\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)4\.6​e64\.6e^\{6\}\(0\.565\)Peak GPUMemory \(GB\)ViT\-TinyOfficeHome2\.22 \(1\.015\)2\.19 \(1\.000\)2\.19 \(1\.000\)1\.72 \(0\.786\)1\.72 \(0\.786\)1\.72 \(0\.786\)1\.68 \(0\.769\)1\.68 \(0\.769\)1\.68 \(0\.770\)1\.68 \(0\.770\)CIFAR\-102\.29 \(1\.060\)2\.16 \(1\.000\)2\.16 \(1\.000\)1\.59 \(0\.738\)1\.59 \(0\.738\)1\.59 \(0\.738\)1\.57 \(0\.728\)1\.57 \(0\.729\)1\.60 \(0\.744\)1\.60 \(0\.744\)ViT\-BaseOfficeHome7\.29 \(0\.887\)8\.22 \(1\.000\)8\.22 \(1\.000\)5\.02 \(0\.611\)5\.02 \(0\.611\)5\.02 \(0\.611\)4\.88 \(0\.594\)4\.88 \(0\.594\)5\.15 \(0\.627\)5\.15 \(0\.627\)CIFAR\-107\.05 \(1\.061\)6\.65 \(1\.000\)6\.65 \(1\.000\)4\.49 \(0\.676\)4\.49 \(0\.676\)4\.49 \(0\.676\)4\.48 \(0\.674\)4\.48 \(0\.674\)4\.61 \(0\.694\)4\.62 \(0\.695\)RoBERTa\-BaseMRPC3\.82 \(1\.433\)2\.66 \(1\.000\)2\.76 \(1\.036\)2\.22 \(0\.835\)2\.22 \(0\.835\)2\.22 \(0\.835\)2\.20 \(0\.827\)2\.20 \(0\.827\)2\.08 \(0\.783\)2\.08 \(0\.783\)CoLA3\.34 \(1\.632\)2\.05 \(1\.000\)2\.05 \(1\.000\)1\.89 \(0\.924\)1\.80 \(0\.877\)1\.80 \(0\.877\)1\.77 \(0\.866\)1\.82 \(0\.891\)1\.73 \(0\.847\)1\.64 \(0\.800\)RoBERTa\-LargeMRPC9\.34 \(1\.508\)6\.19 \(1\.000\)6\.19 \(1\.000\)4\.98 \(0\.804\)4\.85 \(0\.784\)4\.91 \(0\.794\)4\.84 \(0\.782\)4\.84 \(0\.782\)4\.30 \(0\.694\)4\.30 \(0\.694\)CoLA7\.93 \(1\.782\)4\.45 \(1\.000\)4\.45 \(1\.000\)3\.76 \(0\.846\)3\.76 \(0\.846\)3\.76 \(0\.846\)3\.81 \(0\.857\)3\.85 \(0\.865\)3\.19 \(0\.718\)3\.34 \(0\.751\)

Table 15:Throughput, runtime, and trainable parameter count, and peak allocated GPU memory for fine\-tuning various text and vision models on various datasets using various PEFT methods, including LoRA, PaCA, and their connecting variants, averaged over three seeds \(0,1,2\)\. We adapt the query and value matrices of the attention layers as well as the classification head\. When adapting a small enough number of layers such that the PEFT methods are faster than FFT, PaCA reduces peak GPU memory by around 5–15% compared to LoRA, which is only a 1\.4% greater reduction on average than the sparsity\-induced LoRA variants\.DatatypeModelDatasetLoRA VariantsSparsity\-Induced VariantsPaCA VariantsFFTLoRACoLAAsymr\-cLAr\-c3LAs\-cLAs\-r\-c3LAPaCAC\-PaCAThroughput\(Samples\\s\)ViT\-TinyOfficeHome360\.4 \(1\.004\)358\.8 \(1\.000\)361\.6 \(1\.008\)361\.5 \(1\.007\)359\.9 \(1\.003\)359\.4 \(1\.002\)358\.3 \(0\.998\)362\.1 \(1\.009\)344\.3 \(0\.959\)364\.1 \(1\.015\)CIFAR\-101068\.6 \(1\.006\)1062\.4 \(1\.000\)1004\.0 \(0\.945\)1094\.2 \(1\.030\)1083\.4 \(1\.020\)1081\.9 \(1\.018\)1111\.8 \(1\.046\)1109\.5 \(1\.044\)1159\.5 \(1\.091\)1139\.6 \(1\.073\)ViT\-BaseOfficeHome362\.6 \(1\.004\)361\.3 \(1\.000\)357\.5 \(0\.990\)359\.3 \(0\.994\)361\.2 \(1\.000\)360\.4 \(0\.997\)360\.8 \(0\.999\)362\.5 \(1\.003\)351\.4 \(0\.973\)342\.2 \(0\.947\)CIFAR\-101060\.4 \(1\.013\)1047\.2 \(1\.000\)1031\.7 \(0\.985\)1077\.7 \(1\.029\)1070\.3 \(1\.022\)1061\.7 \(1\.014\)1038\.9 \(0\.992\)1104\.3 \(1\.054\)1132\.6 \(1\.082\)1139\.5 \(1\.088\)RoBERTa\-BaseMRPC806\.5 \(0\.974\)828\.3 \(1\.000\)725\.5 \(0\.876\)875\.6 \(1\.057\)829\.3 \(1\.001\)880\.3 \(1\.063\)920\.9 \(1\.112\)910\.6 \(1\.099\)942\.0 \(1\.137\)930\.7 \(1\.124\)CoLA816\.1 \(1\.125\)725\.5 \(1\.000\)834\.8 \(1\.151\)894\.2 \(1\.233\)829\.1 \(1\.143\)834\.3 \(1\.150\)929\.0 \(1\.281\)929\.8 \(1\.282\)765\.3 \(1\.055\)875\.3 \(1\.206\)RoBERTa\-LargeMRPC387\.7 \(0\.888\)436\.6 \(1\.000\)437\.8 \(1\.003\)466\.9 \(1\.069\)468\.9 \(1\.074\)439\.3 \(1\.006\)483\.6 \(1\.108\)453\.9 \(1\.040\)499\.7 \(1\.144\)480\.8 \(1\.101\)CoLA394\.7 \(0\.893\)441\.9 \(1\.000\)417\.9 \(0\.946\)468\.1 \(1\.059\)468\.1 \(1\.059\)471\.2 \(1\.066\)463\.1 \(1\.048\)454\.3 \(1\.028\)480\.3 \(1\.087\)506\.0 \(1\.145\)Runtime\(Minutes\)ViT\-TinyOfficeHome50\.6 \(0\.989\)51\.2 \(1\.000\)50\.5 \(0\.986\)50\.4 \(0\.984\)51\.0 \(0\.996\)50\.6 \(0\.989\)50\.8 \(0\.992\)50\.3 \(0\.983\)53\.8 \(1\.050\)50\.3 \(0\.983\)CIFAR\-1023\.0 \(0\.992\)23\.2 \(1\.000\)24\.7 \(1\.067\)22\.4 \(0\.969\)22\.7 \(0\.979\)22\.7 \(0\.980\)22\.2 \(0\.956\)22\.2 \(0\.959\)21\.3 \(0\.919\)21\.5 \(0\.928\)ViT\-BaseOfficeHome50\.3 \(0\.999\)50\.4 \(1\.000\)51\.0 \(1\.012\)50\.8 \(1\.008\)50\.6 \(1\.004\)50\.6 \(1\.005\)50\.4 \(1\.000\)50\.4 \(1\.000\)52\.7 \(1\.047\)53\.3 \(1\.058\)CIFAR\-1023\.2 \(0\.987\)23\.5 \(1\.000\)23\.5 \(0\.999\)22\.8 \(0\.969\)22\.9 \(0\.976\)22\.9 \(0\.974\)23\.8 \(1\.011\)22\.2 \(0\.946\)21\.7 \(0\.922\)21\.6 \(0\.918\)RoBERTa\-BaseMRPC2\.5 \(1\.016\)2\.5 \(1\.000\)2\.9 \(1\.155\)2\.3 \(0\.941\)2\.5 \(1\.016\)2\.3 \(0\.950\)2\.3 \(0\.926\)2\.3 \(0\.919\)2\.2 \(0\.890\)2\.2 \(0\.885\)CoLA5\.5 \(0\.878\)6\.2 \(1\.000\)5\.4 \(0\.864\)5\.1 \(0\.813\)5\.5 \(0\.885\)5\.5 \(0\.878\)4\.9 \(0\.791\)4\.9 \(0\.786\)5\.9 \(0\.951\)5\.2 \(0\.833\)RoBERTa\-LargeMRPC5\.1 \(1\.100\)4\.7 \(1\.000\)4\.7 \(1\.004\)4\.5 \(0\.961\)4\.4 \(0\.952\)4\.8 \(1\.037\)4\.2 \(0\.909\)4\.7 \(1\.001\)4\.2 \(0\.902\)4\.4 \(0\.939\)CoLA11\.3 \(1\.110\)10\.1 \(1\.000\)10\.8 \(1\.068\)9\.7 \(0\.951\)9\.7 \(0\.953\)9\.6 \(0\.949\)9\.9 \(0\.972\)9\.9 \(0\.977\)9\.5 \(0\.938\)9\.0 \(0\.888\)TrainableParametersViT\-TinyOfficeHome5\.7​e65\.7e^\{6\}\(16\.793\)3\.4​e53\.4e^\{5\}\(1\.000\)3\.4​e53\.4e^\{5\}\(1\.000\)2\.7​e52\.7e^\{5\}\(0\.783\)2\.7​e52\.7e^\{5\}\(0\.783\)2\.7​e52\.7e^\{5\}\(0\.783\)2\.7​e52\.7e^\{5\}\(0\.783\)2\.7​e52\.7e^\{5\}\(0\.783\)2\.7​e52\.7e^\{5\}\(0\.783\)2\.7​e52\.7e^\{5\}\(0\.783\)CIFAR\-105\.5​e65\.5e^\{6\}\(36\.994\)1\.5​e51\.5e^\{5\}\(1\.000\)1\.5​e51\.5e^\{5\}\(1\.000\)7\.6​e47\.6e^\{4\}\(0\.506\)7\.6​e47\.6e^\{4\}\(0\.506\)7\.6​e47\.6e^\{4\}\(0\.506\)7\.6​e47\.6e^\{4\}\(0\.506\)7\.6​e47\.6e^\{4\}\(0\.506\)7\.6​e47\.6e^\{4\}\(0\.506\)7\.6​e47\.6e^\{4\}\(0\.506\)ViT\-BaseOfficeHome8\.7​e78\.7e^\{7\}\(63\.708\)1\.4​e61\.4e^\{6\}\(1\.000\)1\.4​e61\.4e^\{6\}\(1\.000\)1\.1​e61\.1e^\{6\}\(0\.783\)1\.1​e61\.1e^\{6\}\(0\.783\)1\.1​e61\.1e^\{6\}\(0\.783\)1\.1​e61\.1e^\{6\}\(0\.783\)1\.1​e61\.1e^\{6\}\(0\.783\)1\.1​e61\.1e^\{6\}\(0\.783\)1\.1​e61\.1e^\{6\}\(0\.783\)CIFAR\-108\.6​e78\.6e^\{7\}\(143\.606\)6​e56e^\{5\}\(1\.000\)6​e56e^\{5\}\(1\.000\)3​e53e^\{5\}\(0\.506\)3​e53e^\{5\}\(0\.506\)3​e53e^\{5\}\(0\.506\)3​e53e^\{5\}\(0\.506\)3​e53e^\{5\}\(0\.506\)3​e53e^\{5\}\(0\.506\)3​e53e^\{5\}\(0\.506\)RoBERTa\-BaseMRPC1\.2​e81\.2e^\{8\}\(105\.459\)1\.2​e61\.2e^\{6\}\(1\.000\)1\.2​e61\.2e^\{6\}\(1\.000\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)CoLA1\.2​e81\.2e^\{8\}\(105\.459\)1\.2​e61\.2e^\{6\}\(1\.000\)1\.2​e61\.2e^\{6\}\(1\.000\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)8\.9​e58\.9e^\{5\}\(0\.750\)RoBERTa\-LargeMRPC3\.6​e83\.6e^\{8\}\(135\.401\)2\.6​e62\.6e^\{6\}\(1\.000\)2\.6​e62\.6e^\{6\}\(1\.000\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)CoLA3\.6​e83\.6e^\{8\}\(135\.401\)2\.6​e62\.6e^\{6\}\(1\.000\)2\.6​e62\.6e^\{6\}\(1\.000\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)1\.8​e61\.8e^\{6\}\(0\.700\)Peak GPUMemory \(GB\)ViT\-TinyOfficeHome2\.22 \(1\.280\)1\.74 \(1\.000\)1\.74 \(1\.000\)1\.58 \(0\.911\)1\.58 \(0\.911\)1\.58 \(0\.911\)1\.58 \(0\.912\)1\.58 \(0\.912\)1\.61 \(0\.927\)1\.61 \(0\.927\)CIFAR\-102\.29 \(1\.340\)1\.71 \(1\.000\)1\.71 \(1\.000\)1\.57 \(0\.918\)1\.57 \(0\.918\)1\.57 \(0\.918\)1\.57 \(0\.918\)1\.57 \(0\.918\)1\.57 \(0\.918\)1\.57 \(0\.918\)ViT\-BaseOfficeHome7\.29 \(1\.221\)5\.97 \(1\.000\)5\.97 \(1\.000\)5\.12 \(0\.857\)5\.12 \(0\.857\)5\.12 \(0\.857\)5\.12 \(0\.857\)5\.12 \(0\.857\)5\.11 \(0\.856\)5\.11 \(0\.856\)CIFAR\-107\.05 \(1\.437\)4\.91 \(1\.000\)4\.91 \(1\.000\)4\.44 \(0\.904\)4\.44 \(0\.904\)4\.44 \(0\.904\)4\.44 \(0\.904\)4\.44 \(0\.905\)4\.40 \(0\.896\)4\.40 \(0\.896\)RoBERTa\-BaseMRPC3\.82 \(1\.658\)2\.30 \(1\.000\)2\.30 \(1\.000\)2\.18 \(0\.948\)2\.18 \(0\.948\)2\.18 \(0\.948\)2\.28 \(0\.990\)2\.28 \(0\.990\)2\.16 \(0\.940\)2\.16 \(0\.940\)CoLA3\.34 \(1\.812\)1\.84 \(1\.000\)1\.84 \(1\.000\)1\.76 \(0\.953\)1\.76 \(0\.953\)1\.76 \(0\.954\)1\.76 \(0\.953\)1\.76 \(0\.953\)1\.74 \(0\.943\)1\.74 \(0\.943\)RoBERTa\-LargeMRPC9\.34 \(1\.783\)5\.24 \(1\.000\)5\.17 \(0\.988\)4\.81 \(0\.919\)4\.81 \(0\.919\)4\.94 \(0\.944\)4\.81 \(0\.919\)4\.81 \(0\.919\)4\.69 \(0\.895\)4\.77 \(0\.910\)CoLA7\.93 \(2\.026\)3\.91 \(1\.000\)3\.91 \(1\.000\)3\.71 \(0\.949\)3\.71 \(0\.949\)3\.71 \(0\.949\)3\.71 \(0\.948\)3\.71 \(0\.948\)3\.59 \(0\.917\)3\.66 \(0\.934\)

### E\.4Performance Analysis—Continued

We extend §[4\.3](https://arxiv.org/html/2606.13767#S4.SS3)by reporting additional empirical results regarding PEFT methods, including prediction capacity and model behaviors\.

#### E\.4\.1Loss Landscape—Continued

3D landscapes\.We obtained the top two principle directions of the model’s update path via PCA of the update matrix\[𝐖0−𝐖T;…;𝐖T−1−𝐖T\],\[\\mathbf\{W\}^\{0\}\-\\mathbf\{W\}^\{T\};\.\.\.;\\mathbf\{W\}^\{T\-1\}\-\\mathbf\{W\}^\{T\}\],where\{𝐖t\}t=0T\.\\\{\\mathbf\{W\}^\{t\}\\\}\_\{t=0\}^\{T\}\.are the model weight’s update steps\. Letδ,η\\delta,\\etabe those two directions\. For random directions, we generate them via a Gaussian distribution\. For LoRA methods, we merged the adapters into the base weights before calculating\. We normalize the directions similar to the methods of\[[34](https://arxiv.org/html/2606.13767#bib.bib17)\]\. We plot the functionf​\(α,β\):=ℒ​\(W\+α​δ\+β​η\)f\(\\alpha,\\beta\):=\{\\cal L\}\(\\textbf\{W\}\+\\alpha\\delta\+\\beta\\eta\)over a51251^\{2\}grid ofα,β\\alpha,\\betavalues uniformly distributed over\[−2,2\]×\[−2,2\]\[\-2,2\]\\times\[\-2,2\], we use mini\-batches of size1212when finding the values forℒ\.\\mathcal\{L\}\.

Comparison between using random or PCA directions\.To understand the differences between the loss landscapes of the models in the PCA directions compared to random directions, we plotted the loss landscape of ViT\-Base fine\-tuned on CIFAR\-10 in both PCA directions \(top\) and random directions \(bottom\) in Figure[8](https://arxiv.org/html/2606.13767#A5.F8)\. For random directions, the FFT landscape is substantially smoother; this is consistent with\[[34](https://arxiv.org/html/2606.13767#bib.bib17)\], but this is inconsistent with the loss landscapes of RoBERTa\-Base with random direction in Figure[9](https://arxiv.org/html/2606.13767#A5.F9), where chain methods produce spikier landscapes with no substantial change in generalizability\.

2D landscapes\.The initial setup is identical to the 3D landscape\. We obtain the same principal directions and plot the same function\. For 2D landscapes, when generating ourα,β\\alpha,\\betagrid of values, we uniformly distribute over\[−m,m\]×\[−m,m\]\[\-m,m\]\\times\[\-m,m\]wheremmis chosen to ensure the optimizer trajectory \(blue arrows\) is entirely contained in the image\. As shown in Figure[10](https://arxiv.org/html/2606.13767#A5.F10), chain methods have more diverse loss landscapes than their non\-chain counterparts due to their overall update to the pre\-trained weights having a higher effective rank\[[65](https://arxiv.org/html/2606.13767#bib.bib12)\]\.

![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/draw_io_images/Compare_Cifar10_Appendix.png)Figure 8:3D loss landscapes of ViT\-Base\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]pretrained on ImageNet\-1K\[[4](https://arxiv.org/html/2606.13767#bib.bib16)\]and fine\-tuned on CIFAR\-10\[[32](https://arxiv.org/html/2606.13767#bib.bib81)\]using the PCA directions of the model’s weights updates \(top\) and random directions \(bottom\)\.![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/draw_io_images/chain_variant_sharpening.drawio_updated.png)Figure 9:3D loss landscapes of ViT\-Base\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]pretrained on ImageNet\-1K\[[4](https://arxiv.org/html/2606.13767#bib.bib16)\]and fine\-tuned on OfficeHome\[[61](https://arxiv.org/html/2606.13767#bib.bib80)\]\(top\) and RoBERTa\-Base\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]pretrained on a corpus of English text fine\-tuned on CoLA\[[62](https://arxiv.org/html/2606.13767#bib.bib57)\]\(bottom\) using the non\-chain then chain variants of each LoRA method\. The chain variants consistently produce sharper landscapes than the non\-chain variants\. In asymmetric LoRA methods, this often correlates to worse generalizability, but not in symmetric methods whereBB,AAare both trained as shown in[16](https://arxiv.org/html/2606.13767#A5.T16)\.![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_lora.png)\(a\)LoRA
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_asym_a.png)\(b\)Asym LoRA
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_cheap.png)\(c\)cLA
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_random_cheap.png)\(d\)r\-cLA
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_cola.png)\(e\)CoLA
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_c3la.png)\(f\)c3c^\{3\}LA
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_shuffle.png)\(g\)r\-c3c^\{3\}LA
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_rac_a.png)\(h\)RAC
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_lora_plus.png)\(i\)LoRA\+
![Refer to caption](https://arxiv.org/html/2606.13767v1/Images/roberta_landscapes_rdm_2d/roberta-base_cola_full.png)\(j\)Full Fine\-Tuning

Figure 10:2D loss landscapes of RoBERTa\-Base fine\-tuned on CoLA for FFT and other PEFT methods\. The axes dirX and dirY are the constants we scale the top two PCA components of the weight displacement matrix\. The range was chosen to contain the entire gradient path\. The top row is the non\-chain variant of the bottom row, save for the last column\. The center is marked with a cross for visibility; the arrows indicate the direction of the model’s updates\.
#### E\.4\.2Intruder Dimension implementation

Table 16:Generalization error approximations\(𝒢​\(W\)≈𝔼​\(ℒtest\)−ℒtrain\{\{\\cal G\}\}\(\\textbf\{W\}\)\\approx\\mathbb\{E\}\(\\mathcal\{L\_\{\\rm test\}\}\)\-\\mathcal\{L\_\{\\rm train\}\}\) on the past \(FFT, LoRA\), the present \(CoLA, Asymmetric LoRA, RAC, LoRA\+\), and the future \(cLA,c3c^\{3\}LA, r\-cLA, r\-c3c^\{3\}LA\) fine\-tuning methods over various models and datasets\. The colorgreenindicates the best result for each particular model and dataset combination,redis the second\-best result, andbluethe third\.ModelDatasetThe PastThe PresentThe FutureFFTLoRACoLAAsymmRACLoRA\+cLAc3c^\{3\}LAr\-cLAr\-c3c^\{3\}LAViT\-Tiny\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome4\.85e−1e^\{\-1\}6\.96e−2e^\{\-2\}9\.55e−3e^\{\-3\}7\.22e−2e^\{\-2\}6\.17e−2e^\{\-2\}7\.39e−2e^\{\-2\}1\.98e−2e^\{\-2\}3\.40e−2e^\{\-2\}2\.16e−2e^\{\-2\}3\.51e−2e^\{\-2\}CIFAR\-101\.42e−1e^\{\-1\}2\.64e−1e^\{\-1\}2\.87e−1e^\{\-1\}3\.36e−1e^\{\-1\}3\.18e−1e^\{\-1\}2\.80e−1e^\{\-1\}3\.13e−1e^\{\-1\}3\.03e−1e^\{\-1\}3\.12e−1e^\{\-1\}2\.92e−1e^\{\-1\}ViT\-Base\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome3\.66e−1e^\{\-1\}1\.07e−1e^\{\-1\}1\.43e−2e^\{\-2\}8\.52e−3e^\{\-3\}1\.02e−2e^\{\-2\}1\.41e−1e^\{\-1\}3\.16e−2e^\{\-2\}3\.62e−2e^\{\-2\}5\.53e−2e^\{\-2\}3\.00e−2e^\{\-2\}CIFAR\-109\.98e−2e^\{\-2\}1\.92e−1e^\{\-1\}2\.21e−1e^\{\-1\}2\.38e−1e^\{\-1\}2\.30e−1e^\{\-1\}1\.84e−1e^\{\-1\}2\.33e−1e^\{\-1\}2\.34e−1e^\{\-1\}2\.26e−1e^\{\-1\}2\.15e−1e^\{\-1\}DeBERTa v2 XXL\[[23](https://arxiv.org/html/2606.13767#bib.bib51)\]MRPC8\.15e−2e^\{\-2\}6\.89e−2e^\{\-2\}6\.53e−2e^\{\-2\}8\.09e−2e^\{\-2\}8\.02e−2e^\{\-2\}9\.08e−2e^\{\-2\}9\.31e−2e^\{\-2\}1\.10e−1e^\{\-1\}9\.47e−2e^\{\-2\}1\.22e−1e^\{\-1\}TREC503\.38e−1e^\{\-1\}2\.36e−1e^\{\-1\}7\.04e−2e^\{\-2\}1\.53e−1e^\{\-1\}2\.24e−1e^\{\-1\}1\.36e−1e^\{\-1\}1\.85e−1e^\{\-1\}2\.22e−1e^\{\-1\}1\.93e−1e^\{\-1\}1\.92e−1e^\{\-1\}PAWS6\.07e−2e^\{\-2\}1\.99e−2e^\{\-2\}3\.63e−2e^\{\-2\}3\.26e−2e^\{\-2\}3\.95e−2e^\{\-2\}5\.41e−2e^\{\-2\}6\.68e−2e^\{\-2\}5\.11e−2e^\{\-2\}1\.98e−2e^\{\-2\}6\.99e−2e^\{\-2\}DeBERTa v3 Base\[[22](https://arxiv.org/html/2606.13767#bib.bib49)\]MRPC1\.06e−1e^\{\-1\}8\.90e−2e^\{\-2\}2\.59e−2e^\{\-2\}7\.28e−2e^\{\-2\}9\.86e−2e^\{\-2\}1\.52e−2e^\{\-2\}2\.58e−2e^\{\-2\}8\.52e−3e^\{\-3\}1\.16e−1e^\{\-1\}2\.57e−2e^\{\-2\}TREC504\.56e−1e^\{\-1\}2\.73e−1e^\{\-1\}3\.99e−1e^\{\-1\}2\.16e−1e^\{\-1\}2\.67e−1e^\{\-1\}2\.61e−2e^\{\-2\}2\.25e−1e^\{\-1\}3\.70e−1e^\{\-1\}3\.36e−1e^\{\-1\}2\.63e−2e^\{\-2\}PAWS2\.62e−2e^\{\-2\}6\.43e−2e^\{\-2\}2\.40e−2e^\{\-2\}6\.27e−2e^\{\-2\}8\.17e−2e^\{\-2\}5\.55e−2e^\{\-2\}7\.39e−2e^\{\-2\}5\.77e−2e^\{\-2\}1\.01e−1e^\{\-1\}5\.82e−2e^\{\-2\}RoBERTa\-Base\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]MRPC9\.48e−1e^\{\-1\}6\.01e−1e^\{\-1\}2\.05e−1e^\{\-1\}1\.64e−1e^\{\-1\}2\.20e−1e^\{\-1\}5\.33e−1e^\{\-1\}4\.37e−1e^\{\-1\}3\.78e−1e^\{\-1\}3\.35e−1e^\{\-1\}3\.21e−1e^\{\-1\}CoLA1\.397\.74e−1e^\{\-1\}4\.04e−1e^\{\-1\}2\.22e−1e^\{\-1\}1\.96e−1e^\{\-1\}8\.10e−1e^\{\-1\}4\.70e−1e^\{\-1\}4\.43e−1e^\{\-1\}4\.38e−1e^\{\-1\}4\.01e−1e^\{\-1\}RoBERTa\-Large\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]MRPC7\.29e−1e^\{\-1\}4\.64e−1e^\{\-1\}4\.71e−1e^\{\-1\}2\.77e−1e^\{\-1\}2\.68e−1e^\{\-1\}2\.64e−1e^\{\-1\}6\.54e−1e^\{\-1\}5\.57e−1e^\{\-1\}5\.27e−1e^\{\-1\}3\.84e−1e^\{\-1\}CoLA8\.06e−1e^\{\-1\}4\.25e−1e^\{\-1\}4\.18e−1e^\{\-1\}2\.36e−1e^\{\-1\}1\.75e−1e^\{\-1\}2\.28e−1e^\{\-1\}4\.96e−1e^\{\-1\}4\.56e−1e^\{\-1\}6\.14e−1e^\{\-1\}4\.05e−1e^\{\-1\}TinyLlama\[[71](https://arxiv.org/html/2606.13767#bib.bib59)\]OpenBookQA1\.78e−1e^\{\-1\}2\.82e−1e^\{\-1\}3\.41e−1e^\{\-1\}2\.15e−1e^\{\-1\}1\.86e−1e^\{\-1\}2\.07e−1e^\{\-1\}1\.51e−1e^\{\-1\}2\.20e−1e^\{\-1\}3\.16e−1e^\{\-1\}7\.59e−2e^\{\-2\}FOLIO1\.82e−1e^\{\-1\}2\.37e−1e^\{\-1\}2\.17e−1e^\{\-1\}1\.75e−1e^\{\-1\}1\.93e−1e^\{\-1\}5\.11e−2e^\{\-2\}2\.35e−1e^\{\-1\}1\.91e−1e^\{\-1\}1\.05e−1e^\{\-1\}2\.49e−1e^\{\-1\}LogiQA3\.61e−1e^\{\-1\}6\.12e−3e^\{\-3\}1\.45e−1e^\{\-1\}1\.16e−2e^\{\-2\}1\.75e−1e^\{\-1\}2\.37e−1e^\{\-1\}8\.60e−2e^\{\-2\}1\.1e−1e^\{\-1\}6\.64e−2e^\{\-2\}6\.25e−2e^\{\-2\}CLUTRR4\.292\.251\.552\.342\.275\.482\.162\.192\.594\.23Llama 3\[[14](https://arxiv.org/html/2606.13767#bib.bib7)\]OpenBookQA2\.65e−1e^\{\-1\}2\.54e−1e^\{\-1\}2\.32e−1e^\{\-1\}2\.63e−1e^\{\-1\}1\.67e−1e^\{\-1\}1\.92e−1e^\{\-1\}3\.55e−1e^\{\-1\}2\.69e−1e^\{\-1\}2\.79e−1e^\{\-1\}3\.61CLUTRR2\.532\.662\.972\.95\.492\.652\.695\.022\.514\.33DeepseekCoder\[[15](https://arxiv.org/html/2606.13767#bib.bib60)\]DJANGO3\.48e−2e^\{\-2\}4\.65e−2e^\{\-2\}3\.4e−2e^\{\-2\}5\.16e−2e^\{\-2\}4\.64e−2e^\{\-2\}3\.87e−2e^\{\-2\}4\.19e−2e^\{\-2\}3\.89e−2e^\{\-2\}3\.64e−2e^\{\-2\}3\.62e−2e^\{\-2\}GPT2\-Small\[[52](https://arxiv.org/html/2606.13767#bib.bib48)\]E2E1\.65e−1e^\{\-1\}1\.93e−1e^\{\-1\}1\.85e−1e^\{\-1\}1\.83e−1e^\{\-1\}1\.85e−1e^\{\-1\}1\.87e−1e^\{\-1\}1\.77e−1e^\{\-1\}1\.82e−1e^\{\-1\}1\.88e−1e^\{\-1\}1\.82e−1e^\{\-1\}Given the pretrained and fine\-tuned models,W0\\textbf\{W\}\_\{0\}andW0\+Δ​W\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}we find intruder dimensions as follows: first, we decompose each layer ofW0\\textbf\{W\}\_\{0\}andW0\+Δ​W\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}into their corresponding SVDs,Ui​Σi​Vi\(W0\)iTU^\{i\}\\Sigma^\{i\}\{V^\{i\}\}^\{T\}\_\{\(\\textbf\{W\}\_\{0\}\)^\{i\}\}andUi​Σi​Vi\(W0\+Δ​W\)iT,i∈\[L\]U^\{i\}\\Sigma^\{i\}\{V^\{i\}\}^\{T\}\_\{\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)^\{i\}\},i\\in\[L\], respectively\. Then, given a thresholdε∈\(0,1\)\\varepsilon\\in\(0,1\), a singular vectoru\(W0\+Δ​W\)j,iu^\{j,i\}\_\{\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\}inU\(W0\+Δ​W\)iU^\{i\}\_\{\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\}is an intruder dimension if for allu\(W0\)k,iu^\{k,i\}\_\{\(\\textbf\{W\}\_\{0\}\)\}inU\(W0\)iU^\{i\}\_\{\(\\textbf\{W\}\_\{0\}\)\}, the expression,\|⟨u\(W0\+Δ​W\]\)j,i,u\(W0\)k,i⟩\|‖u\(W0\+Δ​W\)j,i‖​‖u\(W0\)k,i‖\|<ε\\tfrac\{\|\\langle u^\{j,i\}\_\{\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\]\}\)\},u^\{k,i\}\_\{\(\\textbf\{W\}\_\{0\}\)\}\\rangle\|\}\{\\\|u^\{j,i\}\_\{\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\}\\\|\\\|u^\{k,i\}\_\{\(\\textbf\{W\}\_\{0\}\)\}\\\|\}\|<\\varepsilon\. Forε\\varepsilonsmall enough, this indicates the vectoru\(W0\+Δ​W\)j,iu^\{j,i\}\_\{\(\\textbf\{W\}\_\{0\}\+\\Delta\\textbf\{W\}\)\}is almost orthogonal to all vectors inU\(W0\)iU^\{i\}\_\{\(\\textbf\{W\}\_\{0\}\)\}\. We denote these vectors as*intruder dimensions\.*

### E\.5Generalization Error—Continued

Let𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}be our input space and label space withν\\nudistribution of pairs\(x,y\)∈𝒳×𝒴\(x,y\)\\in\\mathcal\{X\}\\times\\mathcal\{Y\}, our datasetN=\{\(x1,y1\),…,\(xn,yn\)\}N=\\\{\(x\_\{1\},y\_\{1\}\),\.\.\.,\(x\_\{n\},y\_\{n\}\)\\\}where each\(xi,yi\)\(x\_\{i\},y\_\{i\}\)is i\.i\.d\. fromν\\nudistribution of𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}, thus the distribution over our dataset does not represent the true distribution of input\-output pairs from our instance space\. Letℋ\\mathcal\{H\}be our hypothesis space, wherew∈ℋ;w​\(xi\)=y^iw\\in\\mathcal\{H\};w\(x\_\{i\}\)=\\hat\{y\}\_\{i\}thus, we are concerned with how accuratelywwcan adapt to the true distributionν\\nuof𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}\. This can be addressed by the generalization error of our hypothesisw∈ℋw\\in\\mathcal\{H\}given our loss functionℓ\\ell\. The true risk ofwwover𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}givenℓ\\ellisℒglobal​\(w\):=𝔼𝒳,𝒴​\[ℓ​\(w​\(x\),y\)\]=∫𝒳×𝒴ℓ​\(w​\(x\),y\)​𝑑ν\{\\cal L\}\_\{\\rm global\}\(w\):=\\mathbb\{E\}\_\{\\mathcal\{X\},\\mathcal\{Y\}\}\[\\ell\(w\(x\),y\)\]=\\int\_\{\\mathcal\{X\\times\\mathcal\{Y\}\}\}\\ell\(w\(x\),y\)d\\nu, while empirical risk isℒ:=1n​∑nℓ​\(w​\(xi\),yi\);\(xi,yi\)∈N\{\\cal L\}:=\\frac\{1\}\{n\}\\sum^\{n\}\\ell\(w\(x\_\{i\}\),y\_\{i\}\);\(x\_\{i\},y\_\{i\}\)\\in N\. LetMMdenote the full dataset, whereM=N∪TM=N\\cup T,NNbeing the train dataset, andTTbeing the test dataset\. In practice, the empirical risk can be computed based onNN, and the test dataset,TT, can be used to show how well the model has generalized\.NNandTTare independent samples fromν\\nu; their distributions approximateν\\nubut differ due to random and finite sampling\. Althoughℒtest−ℒtrain\\mathcal\{L\}\_\{\\rm test\}\-\\mathcal\{L\}\_\{\\rm train\}is not a true testament for calculating the generalization error of a model, it can be used as a heuristic for determining generalization\. Understanding how stable these models are to small weight perturbations provides insight into their reliability and reputability for practical use\.

#### E\.5\.1Normalized Generalization Results

We normalize the generalization gaps with respect to LoRA, reporting the ratio𝒢​\(⋅\)𝒢​\(LoRA\)\\frac\{\{\\cal G\}\(\\cdot\)\}\{\{\\cal G\}\(\\mathrm\{LoRA\}\)\}in Table[17](https://arxiv.org/html/2606.13767#A5.T17)\. Under this normalization, many PEFT methods exhibit similar generalization behavior relative to LoRA\. Throughout all models, the effect of chaining behavior tended to coincide; if CoLA generalized better than LoRA, then RAC often generalizes better than Asymmetric LoRA, cLA than c3LA, r\-cLA, and r\-c3LA\.

Table 17:Normalized generalization error approximations with respect to LoRA \(𝒢​\(⋅\)𝒢​\(LoRA\)\\frac\{\{\\cal G\}\(\\cdot\)\}\{\{\\cal G\}\(\\rm LoRA\)\}\), on the past \(FFT, LoRA\), the present \(CoLA, Asymmetric LoRA, RAC, LoRA\+\), and the future \(cLA,c3c^\{3\}LA, r\-cLA, r\-c3c^\{3\}LA\) fine\-tuning methods over various models and datasets\. The colorgreenindicates the best result for each particular model and dataset combination,redis the second\-best result, andbluethe third\.ModelDatasetThe PastThe PresentThe FutureFFTLoRACoLAAsymmRACLoRA\+cLAc3c^\{3\}LAr\-cLAr\-c3c^\{3\}LAViT\-Tiny\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome6\.971\.000\.141\.040\.891\.060\.280\.490\.310\.50CIFAR\-100\.541\.001\.091\.271\.201\.061\.191\.151\.181\.11ViT\-Base\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome3\.421\.000\.130\.080\.101\.320\.300\.340\.520\.28CIFAR\-100\.521\.001\.151\.241\.200\.961\.211\.221\.181\.12DeBERTa v2 XXL\[[23](https://arxiv.org/html/2606.13767#bib.bib51)\]MRPC1\.181\.000\.951\.171\.161\.321\.351\.601\.371\.77TREC501\.431\.000\.300\.650\.950\.580\.780\.940\.820\.81PAWS3\.051\.001\.821\.641\.982\.723\.362\.570\.993\.51DeBERTa v3 Base\[[22](https://arxiv.org/html/2606.13767#bib.bib49)\]MRPC1\.191\.000\.290\.821\.110\.170\.290\.101\.300\.29TREC501\.671\.001\.460\.790\.980\.100\.821\.361\.230\.10PAWS0\.411\.000\.370\.981\.270\.861\.150\.901\.570\.91RoBERTa\-Base\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]MRPC1\.581\.000\.340\.270\.370\.890\.730\.630\.560\.53CoLA1\.801\.000\.520\.290\.251\.050\.610\.570\.570\.52RoBERTa\-Large\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]MRPC1\.571\.001\.020\.600\.580\.571\.411\.201\.140\.83CoLA1\.901\.000\.980\.560\.410\.541\.171\.071\.440\.95TinyLlama\[[71](https://arxiv.org/html/2606.13767#bib.bib59)\]OpenBookQA0\.631\.001\.210\.760\.660\.730\.540\.781\.120\.27FOLIO0\.771\.000\.920\.740\.810\.220\.990\.810\.441\.05LogiQA58\.991\.0023\.691\.9028\.5938\.7314\.0517\.9710\.8510\.21CLUTRR1\.911\.000\.691\.041\.012\.440\.960\.971\.151\.88Llama 3\[[14](https://arxiv.org/html/2606.13767#bib.bib7)\]OpenBookQA1\.041\.000\.911\.040\.660\.761\.401\.061\.1014\.21CLUTRR0\.951\.001\.121\.092\.061\.001\.011\.890\.941\.63DeepseekCoder\[[15](https://arxiv.org/html/2606.13767#bib.bib60)\]DJANGO0\.751\.000\.731\.111\.000\.830\.900\.840\.780\.78GPT2\-Small\[[52](https://arxiv.org/html/2606.13767#bib.bib48)\]E2E0\.851\.000\.960\.950\.960\.970\.920\.940\.970\.94

Table 18:Extended Table 2, performance of fine\-tuned models with adapter rankr=16r=16\. We usegreen,red, andblueto indicate the best, second best, and third best result\. For the sparse variants,↓\\downarrowindicates the accuracy drop percentage compared to the best\.ModelDatasetThe PastThe PresentThe FutureFFTLoRACoLAAsymRACLoRA\+cLAc3c^\{3\}LAr\-cLAr\-c3c^\{3\}LAViT\-Tiny\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome\[[61](https://arxiv.org/html/2606.13767#bib.bib80)\]79\.6880\.1379\.5478\.0278\.5577\.8778\.01 \(↓\\downarrow2\.65%\)78\.69 \(↓\\downarrow1\.80%\)78\.01 \(↓\\downarrow2\.65%\)79\.32 \(↓\\downarrow1\.01%\)CIFAR10\[[32](https://arxiv.org/html/2606.13767#bib.bib81)\]96\.5996\.1795\.8594\.8095\.3695\.2994\.94 \(↓\\downarrow1\.71%\)95\.23 \(↓\\downarrow1\.41%\)95\.12 \(↓\\downarrow1\.52%\)95\.22 \(↓\\downarrow1\.42%\)ViT\-Base\[[8](https://arxiv.org/html/2606.13767#bib.bib26)\]OfficeHome86\.4288\.9689\.0189\.0089\.3387\.8789\.2189\.1888\.8389\.17CIFAR1098\.0698\.7198\.4898\.6898\.7398\.3698\.6398\.5498\.7898\.72DeBERTa v2 XXL\[[23](https://arxiv.org/html/2606.13767#bib.bib51)\]MRPC\[[62](https://arxiv.org/html/2606.13767#bib.bib57)\]87\.4988\.2887\.4787\.0386\.9787\.5386\.13 \(↓\\downarrow2\.44%\)85\.11 \(↓\\downarrow3\.59%\)85\.55 \(↓\\downarrow3\.09%\)85\.15 \(↓\\downarrow3\.55%\)TREC\-50\[[37](https://arxiv.org/html/2606.13767#bib.bib58)\]91\.9991\.4785\.6592\.2692\.0284\.9291\.73 \(↓\\downarrow0\.57%\)90\.87 \(↓\\downarrow1\.51%\)91\.67 \(↓\\downarrow0\.64%\)91\.07 \(↓\\downarrow1\.29%\)PAWS\[[74](https://arxiv.org/html/2606.13767#bib.bib56)\]94\.6994\.9795\.2294\.9594\.6695\.2094\.77 \(↓\\downarrow0\.47%\)94\.90 \(↓\\downarrow0\.34%\)94\.38 \(↓\\downarrow0\.88%\)94\.71 \(↓\\downarrow0\.54%\)DeBERTa v3 Base\[[22](https://arxiv.org/html/2606.13767#bib.bib49)\]MRPC85\.8088\.3387\.9186\.4086\.3484\.5184\.43 \(↓\\downarrow4\.42%\)80\.22 \(↓\\downarrow9\.18%\)85\.42 \(↓\\downarrow3\.29%\)84\.17 \(↓\\downarrow4\.71%\)RTE\[[62](https://arxiv.org/html/2606.13767#bib.bib57)\]82\.4786\.3483\.8078\.9479\.4084\.7276\.00 \(↓\\downarrow11\.98%\)75\.08 \(↓\\downarrow13\.04%\)79\.40 \(↓\\downarrow8\.04%\)79\.40 \(↓\\downarrow8\.04%\)STS\-B\[[62](https://arxiv.org/html/2606.13767#bib.bib57)\]89\.5289\.0989\.3489\.0488\.7189\.1587\.56 \(↓\\downarrow2\.19%\)87\.90 \(↓\\downarrow1\.81%\)88\.05 \(↓\\downarrow1\.64%\)87\.90 \(↓\\downarrow1\.81%\)TREC\-5090\.1589\.2989\.8890\.6789\.2285\.5286\.04 \(↓\\downarrow5\.11%\)87\.96 \(↓\\downarrow2\.99%\)86\.04 \(↓\\downarrow5\.11%\)87\.70 \(↓\\downarrow3\.28%\)PAWS94\.7694\.6294\.4094\.4894\.4594\.4494\.2394\.6094\.3694\.42RoBERTa\-Base\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]MRPC87\.4086\.3486\.7686\.4086\.6784\.2984\.83 \(↓\\downarrow2\.94%\)84\.39 \(↓\\downarrow3\.44%\)85\.08 \(↓\\downarrow2\.65%\)85\.33 \(↓\\downarrow2\.37%\)CoLA\[[62](https://arxiv.org/html/2606.13767#bib.bib57)\]56\.0857\.3358\.3952\.3553\.7650\.4051\.86 \(↓\\downarrow11\.18%\)53\.29 \(↓\\downarrow8\.73%\)52\.56 \(↓\\downarrow9\.98%\)53\.10 \(↓\\downarrow9\.06%\)RoBERTa\-Large\[[41](https://arxiv.org/html/2606.13767#bib.bib50)\]MRPC87\.5788\.4688\.4387\.5687\.6972\.9187\.8186\.3686\.2486\.59CoLA64\.5862\.4260\.0363\.4259\.8428\.8059\.47 \(↓\\downarrow7\.91%\)59\.60 \(↓\\downarrow7\.71%\)58\.60 \(↓\\downarrow9\.26%\)60\.24 \(↓\\downarrow6\.72%\)TinyLlama\[[71](https://arxiv.org/html/2606.13767#bib.bib59)\]OpenBookQA\[[43](https://arxiv.org/html/2606.13767#bib.bib79)\]55\.4752\.4152\.4745\.9647\.5953\.2644\.92\(↓\\downarrow19\.02%\)45\.12\(↓\\downarrow18\.66%\)47\.07\(↓\\downarrow15\.14%\)27\.34\(↓\\downarrow50\.71%\)FOLIO\[[16](https://arxiv.org/html/2606.13767#bib.bib62)\]60\.7157\.5959\.4058\.3355\.4554\.1758\.9758\.0154\.8159\.82LogiQA\[[39](https://arxiv.org/html/2606.13767#bib.bib63)\]47\.5441\.5443\.7041\.5040\.8645\.8339\.09 \(↓\\downarrow17\.77%\)39\.30 \(↓\\downarrow17\.33%\)39\.09 \(↓\\downarrow17\.77%\)39\.31 \(↓\\downarrow17\.31%\)CLUTRR\[[56](https://arxiv.org/html/2606.13767#bib.bib64)\]42\.0137\.4439\.3837\.9837\.9838\.1039\.1237\.7936\.2337\.03Llama3\-8B\[[14](https://arxiv.org/html/2606.13767#bib.bib7)\]OpenBookQA88\.8087\.5386\.4788\.4787\.3386\.8787\.33\(↓\\downarrow1\.65%\)85\.07\(↓\\downarrow4\.20%\)86\.07\(↓\\downarrow3\.07%\)53\.69\(↓\\downarrow39\.54%\)CLUTRR50\.2948\.747\.6551\.6949\.6552\.8955\.5352\.0454\.949\.94DeepseekCoder\[[15](https://arxiv.org/html/2606.13767#bib.bib60)\]DJANGO\[[49](https://arxiv.org/html/2606.13767#bib.bib67)\]22\.7323\.6019\.7935\.1230\.2727\.277\.83 \(↓\\downarrow77\.71%\)19\.48 \(↓\\downarrow44\.53%\)19\.36 \(↓\\downarrow44\.87%\)15\.34 \(↓\\downarrow56\.32%\)GPT2\-Small\[[52](https://arxiv.org/html/2606.13767#bib.bib48)\]E2E\[[47](https://arxiv.org/html/2606.13767#bib.bib93)\]2\.983\.183\.293\.363\.343\.233\.34\(↑\\uparrow12\.08%\)3\.29\(↑\\uparrow10\.4%\)3\.30\(↑\\uparrow10\.7%\)3\.29\(↑\\uparrow10\.4%\)

## Appendix FLimitations and Discussion

cLA and c3LA particularly train only a small subsection of our pretrained model at a time, leading to underperformance on lower ranks in comparison to alternate LoRA variants\. Often, we observed that cLA and c3LA performed nearly as well as their non\-sparse counterparts, Asymmetric LoRA and RAC, while being less expensive\. The nature of the methods they were inspired by already had a frozen matrix component; we leave it up to researchers to study more potential identity\-based LoRA variants to save computational resources\.

## Appendix GTable of Notations

Table 19:Table of notations\.NotationDefinition‖x‖\\\|x\\\|Theℓ2\\ell\_\{2\}norm of a vector,xx‖A‖\\\|A\\\|The Frobenius norm of a matrix,AA‖A‖2\\\|A\\\|\_\{2\}The spectral norm of a matrix,AALLNumber of layers in a deep neural networkWiW^\{i\}ithi^\{\\rm th\}layer of network𝐖\\mathbf\{W\}\(W1,…,WL\)W^\{1\},\.\.\.,W^\{L\}\)xxInput to the networkf𝐖​\(x\)f\_\{\\mathbf\{W\}\}\(x\)σL​\(WL​⋯​σ3​\(W3​σ2​\(W2​σ1​\(W1​\(x\)\)​…\)\)\)\\sigma\_\{L\}\(W^\{L\}\\cdots\\sigma\_\{3\}\(W^\{3\}\\sigma\_\{2\}\(W^\{2\}\\sigma\_\{1\}\(W^\{1\}\(x\)\)\.\.\.\)\)\)σi​\(⋅\)\\sigma\_\{i\}\(\\cdot\)ithi^\{\\rm th\}layer non\-linear activation functionNpreN\_\{\\rm pre\}pre\-training dataset\(xi,yi\)i=1\|Npre\|\(x\_\{i\},y\_\{i\}\)\_\{i=1\}^\{\|N\_\{\\rm pre\}\|\}ℓpre​\(⋅\)\\ell\_\{\\rm pre\}\(\\cdot\)pre\-training loss function𝐖0\\mathbf\{W\}\_\{0\}pre\-training weightsΔ​𝐖\\Delta\\mathbf\{W\}FFT weight\-updateΔ​𝐖^\\Delta\\hat\{\\mathbf\{W\}\}FFT argmin updateℓ​\(⋅\)\\ell\(\\cdot\)fine\-tuning loss function𝐁𝐀\\mathbf\{BA\}LoRA weight\-update𝐁^​𝐀^\\hat\{\\mathbf\{B\}\}\\hat\{\\mathbf\{A\}\}LoRA argmin weight updatekkChain\-length of chain methods \(CoLA, RAC,c3c^\{3\}LA\)𝐁j​𝐀j\{\\mathbf\{B\}\}^\{j\}\\mathbf\{A\}^\{j\}CoLAjthj^\{\\rm th\}chain weight update𝐁^j​𝐀^j\\hat\{\\mathbf\{B\}\}^\{j\}\\hat\{\\mathbf\{A\}\}^\{j\}CoLAjthj^\{\\rm th\}chain argmin weight update𝐖0\(k,B​A\)\\mathbf\{W\}\_\{0\}^\{\(k,BA\)\}kkchains of CoLA updates, where𝐖0\(k,B​A\):=𝐖0\+∑j=1k𝐁^j​𝐀^j\\mathbf\{W\}\_\{0\}^\{\(k,BA\)\}:=\\mathbf\{W\}\_\{0\}\+\\sum\_\{j=1\}^\{k\}\\hat\{\\mathbf\{B\}\}^\{j\}\\hat\{\\mathbf\{A\}\}^\{j\}𝐀0\\mathbf\{A\}\_\{0\}FrozenAAlayers𝐁𝐀0\\mathbf\{BA\}\_\{0\}Assymetric LoRA weight update𝐁^​𝐀0\\hat\{\\mathbf\{B\}\}\\mathbf\{A\}\_\{0\}Assymetric LoRA argmin weight update𝐁j​𝐀0j\\mathbf\{B\}^\{j\}\\mathbf\{A\}\_\{0\}^\{j\}RAC\-LoRAjthj^\{\\rm th\}chain weight update𝐁^j​𝐀0j\\hat\{\\mathbf\{B\}\}^\{j\}\\mathbf\{A\}\_\{0\}^\{j\}RAC\-LoRAjthj^\{\\rm th\}chain argmin weight update𝐖0\(k,B\)\\mathbf\{W\}\_\{0\}^\{\(k,B\)\}kkchains of RAC\-LoRA updates, where𝐖0\(k,B\):=𝐖0\+∑j=1k𝐁^j​𝐀0j\\mathbf\{W\}\_\{0\}^\{\(k,B\)\}:=\\mathbf\{W\}\_\{0\}\+\\sum\_\{j=1\}^\{k\}\\hat\{\\mathbf\{B\}\}^\{j\}\\mathbf\{A\}\_\{0\}^\{j\}𝐁c\\mathbf\{B\}^\{c\}Cheap LoRA \(cLA\) weight update𝐁^c\\hat\{\\mathbf\{B\}\}^\{c\}cLA argmin weight update𝐁c3,j\\mathbf\{B\}^\{c^\{3\},j\}Circulant chain of cheap LoRA’s \(c3c^\{3\}LA\)jthj^\{\\rm th\}chain weight update𝐁^c3\\hat\{\\mathbf\{B\}\}^\{c^\{3\}\}c3c^\{3\}LAjthj^\{\\rm th\}chain argmin weight update𝐖0\(k,Bc3\)\\mathbf\{W\}\_\{0\}^\{\(k,B^\{c^\{3\}\}\)\}kkchains ofc3c^\{3\}LA updates, where𝐖0\(k,Bc3\):=𝐖0\+∑j=1k𝐁^c3,j\\mathbf\{W\}\_\{0\}^\{\(k,B^\{c^\{3\}\}\)\}:=\\mathbf\{W\}\_\{0\}\+\\sum\_\{j=1\}^\{k\}\\hat\{\\mathbf\{B\}\}^\{c^\{3\},j\}LGL\_\{G\}Lipschitz constant for the gradient of the loss function\.𝒳\\mathcal\{X\}feature space of the network𝒴\\mathcal\{Y\}label space of the networkℒ^global​\(⋅\)\\hat\{\\mathcal\{L\}\}\_\{\\rm global\}\(\\cdot\)true risk of an input network

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

arXiv cs.LG

Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.

AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

Hugging Face Daily Papers

AdaPreLoRA is a novel LoRA optimizer that uses Adafactor diagonal Kronecker preconditioning to improve factor-space updates while maintaining low memory usage, demonstrating competitive performance across various LLMs and tasks.

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

arXiv cs.CL

JumpLoRA introduces a novel sparse adapter framework for continual learning in LLMs using JumpReLU gating to dynamically isolate task parameters and prevent catastrophic forgetting. The method enhances LoRA-based approaches and outperforms state-of-the-art continual learning methods like ELLA.

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.