CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning

arXiv cs.LG Papers

Summary

The paper introduces CERSA, a novel parameter-efficient fine-tuning method that uses singular value decomposition to retain principal components, significantly reducing memory usage while outperforming existing methods like LoRA.

arXiv:2605.08174v1 Announce Type: new Abstract: To mitigate the memory constraints associated with fine-tuning large pre-trained models, existing parameter-efficient fine-tuning (PEFT) methods, such as LoRA, rely on low-rank updates. However, such updates fail to fully capture the rank characteristics of the weight modifications observed in full-parameter fine-tuning, resulting in a performance gap. Furthermore, LoRA and other existing PEFT methods still require substantial memory to store the full set of frozen weights, limiting their efficiency in resource-constrained settings. To addres these limitations, we introduce Cumulative Energy-Retaining Subspace Adaptation (CERSA), a novel fine-tuning paradigm that leverages singular value decomposition (SVD) to retain only the principal components responsible for 90% to 95% of the spectral energy. By fine-tuning low-rank representations derived from this principal subspace, CERSA significantly reduces memory consumption. We conduct extensive evaluations of CERSA across models of varying scales and domains, including image recognition, text-to-image generation, and natural language understanding. Empirical results demonstrate that CERSA consistently outperforms state-of-the-art PEFT methods while achieving substantially lower memory requirements. The code will be publicly released.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 06:55 AM

# Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning
Source: [https://arxiv.org/html/2605.08174](https://arxiv.org/html/2605.08174)
Jingze Ge1Xue Geng3Yun Liu2Wanqi Dong1 Wang Zhe Mark3Min Wu3Ngai\-Man Cheung4Bharadwaj Veeravalli1Xulei Yang3

1National University of Singapore 2Nankai University 3Institute for Infocomm Research \(I2R\), A\*STAR 4Singapore University of Technology and Design

jingze\.ge@u\.nus\.edugeng\_xue@i2r\.a\-star\.edu\.sg liuyun@mail\.nankai\.edu\.cnwanqi\.dong@u\.nus\.edu wumin@a\-star\.edu\.sgngaiman\_cheung@sutd\.edu\.sg elebv@nus\.edu\.sgYANG\_XULEI@I2R\.A\-STAR\.EDU\.SG

###### Abstract

To mitigate the memory constraints associated with fine\-tuning large pre\-trained models, existing parameter\-efficient fine\-tuning \(PEFT\) methods, such as LoRA, rely on low\-rank updates\. However, such updates fail to fully capture the rank characteristics of the weight modifications observed in full\-parameter fine\-tuning, resulting in a performance gap\. Furthermore, LoRA and other existing PEFT methods still require substantial memory to store the full set of frozen weights, limiting their efficiency in resource\-constrained settings\. To address these limitations, we introduceCumulative Energy\-Retaining Subspace Adaptation \(CERSA\), a novel fine\-tuning paradigm that leverages singular value decomposition \(SVD\) to retain only the principal components responsible for 90% to 95% of the spectral energy\. By fine\-tuning low\-rank representations derived from this principal subspace, CERSA significantly reduces memory consumption\. We conduct extensive evaluations of CERSA across models of varying scales and domains, including image recognition, text\-to\-image generation, and natural language understanding\. Empirical results demonstrate that CERSA consistently outperforms state\-of\-the\-art PEFT methods while achieving substantially lower memory requirements\. The code will be released\.

## 1Introduction

Fine\-tuning pre\-trained large models for specific tasks has become a common practice to achieve superior performance in both natural language processing and computer vision domains\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7); Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8); Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\)\. Pre\-trained models, which have been trained on extensive and diverse datasets\(Denget al\.,[2009](https://arxiv.org/html/2605.08174#bib.bib26); Linet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib27)\), accumulate rich and general knowledge, enabling them to outperform models trained from scratch\. However, fine\-tuning the entire pre\-trained model typically demands substantial computational resources like memory, particularly for large\-scale models based on transformer architectures, such as ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)and DeBERTaV3\(Heet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib36)\)\. Unlike massive training clusters, often equipped with thousands of GPUs for pre\-training, fine\-tuning is more likely to occur on consumer\-grade GPUs to support diverse downstream applications\. Consequently, reducing the number of tunable parameters and the memory footprint has become a focal point in parameter\-efficient fine\-tuning \(PEFT\) research\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7); Ziet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib11); Zhanget al\.,[2023a](https://arxiv.org/html/2605.08174#bib.bib18); Kopiczkoet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib12); Guet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib13); Renet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib14); Valipouret al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib17)\)\.

Existing PEFT methods aim to fine\-tune only a small subset of parameters within pre\-trained models\(Rebuffiet al\.,[2017](https://arxiv.org/html/2605.08174#bib.bib19); Li and Liang,[2021](https://arxiv.org/html/2605.08174#bib.bib20); Lesteret al\.,[2021](https://arxiv.org/html/2605.08174#bib.bib21)\), which significantly reduces memory requirements\. Since fewer parameters are updated during backpropagation, the demand for memory to store gradients and optimizer states decreases\. Among the most popular methods are LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)and its variants\(Ziet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib11); Zhanget al\.,[2023a](https://arxiv.org/html/2605.08174#bib.bib18); Kopiczkoet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib12); Renet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib14); Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8); Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\), which introduce two low\-rank matrices,𝑩∈ℝm×r\\bm\{B\}\\in\\mathbb\{R\}^\{m\\times r\}and𝑨∈ℝr×n\\bm\{A\}\\in\\mathbb\{R\}^\{r\\times n\}\(r≪mr\\ll m,r≪nr\\ll n\), to reparameterize fine\-tuning as𝑩×𝑨\\bm\{B\}\\times\\bm\{A\}\. Here, the pre\-trained weight matrix𝑾∈ℝm×n\\bm\{W\}\\in\\mathbb\{R\}^\{m\\times n\}is frozen, and only the newly added low\-rank matrices are trained\.

Despite these advances, most existing methods focus on reducing memory usage by exploiting the low\-rank nature of gradients during training\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7); Ziet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib11); Kopiczkoet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib12); Guet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib13); Renet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib14); Valipouret al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib17)\)\. However, the full weight matrices must be stored in memory, with few approaches directly compressing the pre\-trained weights\. As a result, the total memory consumption for weights, gradients, and optimizer states often remains tied to the size of the pre\-trained weights\. Besides, SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)use singular value decomposition \(SVD\) to compress pre\-trained weights but require storing two full singular vector matrices of sizeℝn×n\\mathbb\{R\}^\{n\\times n\}, limiting the memory savings despite their low trainable parameter counts \(see[Section2\.2](https://arxiv.org/html/2605.08174#S2.SS2)\)\. Furthermore, recent studies\(Shuttleworthet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib41)\)reveal a key limitation of LoRA: it introduces intruder dimensions that degrade the model’s performance on learned tasks\. These findings motivate us to directly preserve the major components of pre\-trained weights, enabling memory\-efficient fine\-tuning while maintaining the prior knowledge encoded during pre\-training\.

To this end, we proposeCumulative Energy\-Retaining Subspace Adaptation \(CERSA\), a memory\-efficient fine\-tuning method for pre\-trained weights\. The key idea is to apply SVD to each weight matrix and truncate it to retain only the components that preserve most of the cumulative energy \(typically 90%–95%\)\. Since singular values of weight matrices follow a heavy\-tailed distribution, a small set of dominant singular vectors suffices for adapting the model to downstream tasks\. As illustrated in[Fig\.3](https://arxiv.org/html/2605.08174#S3.F3), depending on the matrix position, retaining only 10%–50% of the original dimensions is often enough to capture the principal energy\. This enables substantial memory savings during fine\-tuning with minimal performance loss\. For example, in ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\), keeping 95% of the cumulative energy yields a memory footprint comparable to LoRA \(rank=32\)\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), as shown in[Fig\.1](https://arxiv.org/html/2605.08174#S1.F1)\. Reducing the threshold to 90% further lowers memory usage below that of the original pre\-trained weights, while causing only a negligible drop—about 0\.3% on average across three image classification datasets \([Tab\.4](https://arxiv.org/html/2605.08174#S4.T4)\)\. As illustrated in[Fig\.2](https://arxiv.org/html/2605.08174#S1.F2), CERSA achieves a clearly superior accuracy\-memory trade\-off compared to baseline methods, making it especially effective under strict memory constraints\.

The primary contributions of this paper are as follows:

- •We propose CERSA, a memory\-efficient PEFT method that uses SVD to retain the primary cumulative energy of pre\-trained model weights and fine\-tunes within the principal subspace\. This reduces memory usage below the weight size, improves fine\-tuning efficiency compared to LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), and minimizes the forgetting of prior knowledge\.
- •We provide a theoretical analysis of CERSA, showing that fine\-tuning within the principal cumulative energy subspace is sufficient for adapting the model to downstream tasks\. This subspace overlaps significantly with those required for most tasks, helping retain pre\-trained knowledge during fine\-tuning\.
- •We comprehensively evaluate CERSA on image classification and natural language understanding tasks\. Results demonstrate that CERSA consistently outperforms state\-of\-the\-art PEFT baselines while achieving the best accuracy\-memory trade\-off, highlighting its effectiveness under constrained memory budgets\.

![Refer to caption](https://arxiv.org/html/2605.08174v1/x1.png)
Figure 1:Memory footprint comparison for fine\-tuning ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\.![Refer to caption](https://arxiv.org/html/2605.08174v1/x2.png)
Figure 2:Average accuracy\(see[Tab\.7](https://arxiv.org/html/2605.08174#S4.T7)\) versus total memory usage on ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)

## 2Related Work

### 2\.1Low\-rank Adaptation

LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)is a key method in PEFT, reducing memory usage by decomposing weight updates into low\-rank matrices while keeping pre\-trained weights frozen\. This enables the efficient fine\-tuning of large models\. Enhancements to LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)can be categorized into three types: weight\-driven, data\-driven, and adaptive methods\.

Weight\-driven methods add adapters derived from weight decomposition on top of frozen pre\-trained weights, directly manipulating the weight space via matrix decompositions and orthonormal constraints\. Representative approaches, including PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\), OLoRA\(Wanget al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib49)\), MiLoRA\(Wanget al\.,[2024a](https://arxiv.org/html/2605.08174#bib.bib51)\), LoRA\-XS\(Bałazyet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib48)\), and DoRA\(Liuet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib5)\), introduce techniques such as SVD\-based initialization, QR\-based orthonormal initialization, and minor singular component adaptation to enhance representation learning and convergence speed\.

Data\-driven methods leverage model activations, gradients, or data distributions to guide adapter updates\. Techniques like LoRA\-GA\(Wanget al\.,[2024b](https://arxiv.org/html/2605.08174#bib.bib53)\), LoRA\-Pro\(Wanget al\.,[2024c](https://arxiv.org/html/2605.08174#bib.bib54)\), LaMDA\(Aziziet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib52)\), and EVA\(Paischeret al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib55)\)employ strategies such as aligning low\-rank gradients with full fine\-tuning gradients and performing SVD on mini\-batch activations for variance\-aware initialization, thereby improving adaptation efficiency through data\-informed adjustments\.

Adaptive methods dynamically configure adapters by task characteristics or layer importance to optimize parameter utilization\. Approaches such as AdaLoRA\(Zhanget al\.,[2023b](https://arxiv.org/html/2605.08174#bib.bib4)\)and EVA\(Paischeret al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib55)\)employ rank allocation by layer importance and variance\-aware adjustments, effectively balancing model capacity with computational cost to achieve efficient fine\-tuning\.

Despite these advancements, most LoRA\-based methods store the entire frozen weight matrix with multiple adapters, offering limited memory savings over the original LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)\. This underscores the need for more efficient methods to further reduce memory and computational costs\.

### 2\.2Weight\-Decomposition\-Based Method

To further minimize the number of parameters required for fine\-tuning and reduce computational costs, weight\-decomposition\-based methods have been developed to process pre\-trained weights\. Generally, the basic step of weight\-decomposition\-based methods\(Hanet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib56)\)is to decompose the original weight matrix𝑾\\bm\{W\}into𝑼\\bm\{U\},𝚺\\bm\{\\Sigma\}, and𝑽\\bm\{V\}\. SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)fine\-tunes only the top\-kksingular values, freezing𝑼\\bm\{U\}and𝑽\\bm\{V\}to retain principal components\. SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)freezes𝚺\\bm\{\\Sigma\}and introduces a sparse adapter for task\-specific adaptation\. SVDiff\(Hanet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib56)\)applies singular value fine\-tuning to diffusion models, reducing storage while mitigating overfitting\. WeLore\(Jaiswalet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib57)\)optimizes rank reduction across layers by identifying low\-rank components for selective fine\-tuning, enhancing efficiency with minimal performance loss\.

Although these methods reduce trainable parameters, they require storing𝑼\\bm\{U\}and𝑽\\bm\{V\}, doubling the original weight size\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\. When accounting for gradients and optimizer states, their memory footprint exceeds twice that of the pre\-trained weights, making them more memory\-intensive than LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)and other PEFT methods\.

## 3Methodology

Fine\-tuning pre\-trained models using Singular Value Decomposition \(SVD\) has proven to be an effective approach for adapting large\-scale models while minimizing parameter updates\(Hanet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib56); Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8); Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\. However, traditional SVD\-based fine\-tuning incurs substantial computational and memory overhead by necessitating the storage of two full decomposed matrices, effectively doubling memory consumption compared to standard weight storage\. Moreover, freezing the left and right singular matrices restricts the model’s expressiveness, making it suboptimal relative to full\-parameter fine\-tuning\. To address these limitations, we propose a constrained optimization framework that selectively updates the principal components using trainable matrices while discarding components associated with minor singular vectors\. By fine\-tuning within the principal subspace of the weight matrix, our method retains the core representational capacity of the pre\-trained model while significantly reducing memory requirements, thereby enabling efficient and stable adaptation to downstream tasks\.

![Refer to caption](https://arxiv.org/html/2605.08174v1/x3.png)Figure 3:Preserved singular value indices in ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\(pre\-trained on ImageNet\-21K\(Denget al\.,[2009](https://arxiv.org/html/2605.08174#bib.bib26)\)\) across layers and weight matrices under different cumulative energy retention rates\. The query \(Q\), key \(K\), value \(V\), and projection \(P\) matrices correspond to weight matrices in self\-attention, while up \(UP\) and down \(DN\) matrices represent the weight matrices of the first and second linear operations of the multilayer perceptron \(MLP\), respectively\.### 3\.1Layer\-wise Rank Selection

In existing methods, a persistent challenge is that, regardless of the attached adapter, the original pre\-trained weights𝑾∈ℝm×n\\bm\{W\}\\in\\mathbb\{R\}^\{m\\times n\}will impose a memory cost of𝒪​\(m​n\)\\mathcal\{O\}\(mn\)and incur a computational overhead of𝒪​\(m​n\)\\mathcal\{O\}\(mn\)during forward propagation, even for fully frozen matrices\. As a result, no matter how parameter\-efficient the fine\-tuning method is, this storage and computation burden remains unavoidable\. Inspired by PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\), we propose to retain the most significant cumulative energy\(Jolliffe,[2002](https://arxiv.org/html/2605.08174#bib.bib25)\)of the weight matrix in terms of the𝑼\\bm\{U\}and𝑽\\bm\{V\}matrices of its SVD, assuming that the subspace defined by𝑼\\bm\{U\}and𝑽\\bm\{V\}matrices is sufficient for most fine\-tuning scenarios\. To further reduce the dimensionality, we propose using truncated SVD as it provides the optimal low\-rank approximation in terms of the Frobenius norm\(Eckart and Young,[1936](https://arxiv.org/html/2605.08174#bib.bib2)\)\.

Moreover, the singular value distributions of pre\-trained weights vary significantly across layers, influenced by both the layer depth and the type of weight matrix\. To effectively extract the principal components across layers, we propose to retain the cumulative energy of the truncated SVD using thecumulative energy retention rate\(Eckart and Young,[1936](https://arxiv.org/html/2605.08174#bib.bib2)\)\. This rate measures the proportion of total cumulative energy retained in the selected components after truncation and is calculated as:

α=∑i=1ksi2∑j=1Nsj2,\\alpha=\\frac\{\\sum\_\{i=1\}^\{k\}s\_\{i\}^\{2\}\}\{\\sum\_\{j=1\}^\{N\}s\_\{j\}^\{2\}\},\\vskip\-5\.69054pt\(1\)wheresis\_\{i\}represents the singular values corresponding to theii\-th principal component,kkdenotes the number of selected singular values after truncation, andNNis the total number of singular values\. The numerator,∑i=1ksi2\\sum\_\{i=1\}^\{k\}s\_\{i\}^\{2\}, represents the energy retained in the firstkksingular values, corresponding to thekkmost significant components of the matrix\. The denominator,∑j=1Nsj2\\sum\_\{j=1\}^\{N\}s\_\{j\}^\{2\}, represents the total energy of the original matrix\. Consequently,α\\alphaquantifies the ratio of retained to total energy, reflecting the proportion of the matrix’s variance preserved in the truncated representation\.

[Eq\.1](https://arxiv.org/html/2605.08174#S3.E1)implies that higher singular values contribute more to the total cumulative energy of the matrix\. By setting a specific cumulative energy retention rate \(*e\.g\.*,α=0\.95\\alpha=0\.95\) across different layers, one can determine the minimum rankkkrequired to retain a desired proportion of the cumulative energy of the weight matrix\. Hence, the cumulative energy retention rate enables us to balance dimensionality reduction and information preservation\.

[Fig\.3](https://arxiv.org/html/2605.08174#S3.F3)illustrates the preserved singular value indices in ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\), pre\-trained on ImageNet\-21K\(Denget al\.,[2009](https://arxiv.org/html/2605.08174#bib.bib26)\), after SVD decomposition\. The rank values are computed across different layer types at various cumulative energy retention rates \{0\.8, 0\.85, 0\.9, 0\.92, 0\.95\}\. As observed, the indices of the preserved singular value arrays exhibit an increasing trend from the lower to upper layers, indicating that lower layers allow for greater compression\. Additionally, compared to the MLP layers, the query, key, value, and projection matrices in the self\-attention module have lower cutoff index values at the same cumulative energy retention rate\. These observations underscore the importance of layer\-wise rank selection in optimizing model efficiency\.

### 3\.2Trainable Matrix in the Principal Subspace

By computing the cumulative energy, we establish the criterion for top\-kktruncation of weight matrices\. Using the truncation index in[Fig\.3](https://arxiv.org/html/2605.08174#S3.F3), we maximize the removal of residual ranks layer\-wise, thereby optimizing memory usage\. The retained ranks determined by the chosen cumulative energy threshold represent the trade\-off between performance and memory budget\.

![Refer to caption](https://arxiv.org/html/2605.08174v1/x4.png)Figure 4:Comparison among LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\), SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)and CERSA\. \(a\) LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)uses two low\-rank matrices to approximate weight updates during fine\-tuning\. \(b\) SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)initializes low\-rank matrices through SVD of𝑾\\bm\{W\}and trains only the most significant singular values as a vector\. \(c\) SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)freezes singular vectors while sparsely fine\-tuning singular values\. \(d\) CERSA discards redundant SVD components and only trains a core matrix initialized with the most significant singular values\.For a descending sequence ofNNsingular values:σ12≥σ22≥⋯≥σN2\\sigma\_\{1\}^\{2\}\\geq\\sigma\_\{2\}^\{2\}\\geq\\dots\\geq\\sigma\_\{N\}^\{2\}, we define two hyperparameters,α\\alphaandβ\\beta, to determine the preserved and trainable subspaces\. The thresholdα\\alphadefines the retained subspace, whileβ\\betaspecifies the trainable portion within that subspace\.kαk\_\{\\alpha\}andkβk\_\{\\beta\}are the smallest indices such that the cumulative sum reaches the proportionsα\\alphaandβ\\beta, respectively:

kα=min⁡\{k\|∑i=1kσi2∑i=1Nσi2≥α\},kβ=min⁡\{k\|∑i=1kσi2∑i=1Nσi2≥β\}\.k\_\{\\alpha\}=\\min\\left\\\{k\\,\\middle\|\\,\\frac\{\\sum\_\{i=1\}^\{k\}\\sigma\_\{i\}^\{2\}\}\{\\sum\_\{i=1\}^\{N\}\\sigma\_\{i\}^\{2\}\}\\geq\\alpha\\right\\\},\\qquad k\_\{\\beta\}=\\min\\left\\\{k\\,\\middle\|\\,\\frac\{\\sum\_\{i=1\}^\{k\}\\sigma\_\{i\}^\{2\}\}\{\\sum\_\{i=1\}^\{N\}\\sigma\_\{i\}^\{2\}\}\\geq\\beta\\right\\\}\.\(2\)These two thresholds,α\\alphaandβ\\beta, divide the matrix into three distinct regions, as illustrated in[Fig\.4](https://arxiv.org/html/2605.08174#S3.F4)\(d\)\.1Discarded Component:We perform SVD on the pre\-trained weights𝑾\\bm\{W\},*i\.e\.*,𝑾=𝑼​𝑺​𝑽𝑻\\bm\{W=USV^\{T\}\}\. By setting the cumulative energy retention rateα\\alpha, the least important components are discarded by truncating the leastr3=N−kαr\_\{3\}=N\-k\_\{\\alpha\}significant singular vectors in𝑼\\bm\{U\}and𝑽\\bm\{V\}\. Unlike PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\)and SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\), which retain the residual part by freezing it, CERSA eliminates redundant high\-rank components that contribute only 5%\-10% to the cumulative energy but occupying 50%\-90% of the embedding dimensions, as determined by the SVD cumulative energy truncation index \([Fig\.3](https://arxiv.org/html/2605.08174#S3.F3)\)\. Most of the singular values in this discarded portion are near zero, indicating feature dimensions that are insignificant in the pre\-trained weights\. Despite that this is still a lossy compression, these high\-rank components may not be well\-aligned or optimally parameterized for downstream fine\-tuning tasks\.2Frozen Component: Next, we introduce another hyperparameter,β\\beta, as a threshold to compute the rankr2=kα−kβr\_\{2\}=k\_\{\\alpha\}\-k\_\{\\beta\}, which determines which of the remaining principal components will be frozen\. The value ofβ\\betareflects the trade\-off between preserving more pre\-trained knowledge and utilizing additional dimensions to learn the feature distribution of downstream tasks for stronger fitting capability\. In the image classification and text sequence classification tasks, we setβ=α\\beta=\\alpha, allowing all principal dimensions to participate in fine\-tuning for maximum fitting performance\. However, if fine\-tuning prioritizes retaining pre\-trained knowledge, a smallerβ\\betacan be chosen to freeze more dimensions\.3Trainable Component: The remaining portion, with rankr1=kαr\_\{1\}=k\_\{\\alpha\}, is designated as trainable\. Unlike SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\), which focuses solely on learning the distribution of singular values, we initialize the diagonal of the𝑺p\\bm\{S\}\_\{p\}matrix with the top\-r1r\_\{1\}singular values, while setting the remaining elements of ther1×r1r\_\{1\}\\times r\_\{1\}matrix to zero and making them trainable\. This approach retains critical singular values while allowing for complex linear combinations between the left and right singular vectors, enhancing the model’s expressive power\.

By decomposing the matrix into three components, we enable fine\-tuning within a much smaller matrix, significantly reducing the number of trainable parameters\. This approach substantially lowers memory consumption and computational cost while preserving model performance\.

### 3\.3Theoretical Analysis

CERSA fine\-tunes pre\-trained weights with fewer parameters by decomposing them via SVD and freezing the𝑼\\bm\{U\}and𝑽\\bm\{V\}matrices for parameter efficiency\. While methods like SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)follow a similar approach, freezing𝑼\\bm\{U\}and𝑽\\bm\{V\}limits the model’s expressiveness, making them suboptimal compared to full\-parameter fine\-tuning\. CERSA overcomes this by introducing a trainable matrix𝑺p\\bm\{S\}\_\{p\}initialized with singular values on the diagonal, reducing these constraints while maintaining memory efficiency\. In the following, we establish the theoretical foundation of CERSA to show that its performance closely matches the full\-parameter fine\-tuning\.

For any weight matrix𝑾∈ℝm×n\\bm\{W\}\\in\\mathbb\{R\}^\{m\\times n\}in a pre\-trained model, we define its full\-parameter fine\-tuned counterpart in downstream tasks as𝑾′\\bm\{W\}^\{\\prime\}\. In the previous section, we removed the bottom 5%–10% of cumulative energy from𝑾\\bm\{W\}by performing truncated SVD, as these correspond to insignificant features and noise in the neural network\. This allows us to approximate the weight matrix as𝑾≈𝑼p​𝚺p​𝑽pT\\bm\{W\}\\approx\\bm\{U\}\_\{p\}\\bm\{\\Sigma\}\_\{p\}\\bm\{V\}\_\{p\}^\{T\}\. Similarly, the fine\-tuned weight matrix can be approximated as𝑾′≈𝑼p′​𝚺p′​𝑽p′⁣T\\bm\{W\}^\{\\prime\}\\approx\\bm\{U\}\_\{p\}^\{\\prime\}\\bm\{\\Sigma\}\_\{p\}^\{\\prime\}\\bm\{V\}\_\{p\}^\{\\prime T\}\.

###### Theorem 3\.1\.

Given a matrix𝐌\\bm\{M\}, applying the SVD, we have𝐌=𝐔​𝚺​𝐕\\bm\{M\}=\\bm\{U\}\\bm\{\\Sigma\}\\bm\{V\}\. If there exists a pair of orthonormal bases𝐐=\{𝐞1,𝐞2,…,𝐞k\}\\bm\{Q\}=\\\{\\bm\{e\}\_\{1\},\\bm\{e\}\_\{2\},\\dots,\\bm\{e\}\_\{k\}\\\}and𝐐′=\{𝐞1′,𝐞2′,…,𝐞k′\}\\bm\{Q^\{\\prime\}\}=\\\{\\bm\{e\}^\{\\prime\}\_\{1\},\\bm\{e\}^\{\\prime\}\_\{2\},\\dots,\\bm\{e\}^\{\\prime\}\_\{k\}\\\}such thatSpan​\(𝐔\)=Span​\(𝐐\)\\text\{Span\}\(\\bm\{U\}\)=\\text\{Span\}\(\\bm\{Q\}\),Span​\(𝐕\)=Span​\(𝐐′\)\\text\{Span\}\(\\bm\{V\}\)=\\text\{Span\}\(\\bm\{Q^\{\\prime\}\}\), there exists a matrix𝐒∈ℝk×k\\bm\{S\}\\in\\mathbb\{R\}^\{k\\times k\}such that𝐌=𝐐​𝐒​𝐐′⁣T\\bm\{M\}=\\bm\{Q\}\\bm\{S\}\\bm\{Q\}^\{\\prime T\}\.

Unlike SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\), which assume that𝑼\\bm\{U\}and𝑽\\bm\{V\}remain unchanged during fine\-tuning, in practice,𝑼\\bm\{U\}and𝑽\\bm\{V\}are likely to be updated to adapt to downstream tasks\. Therefore, we propose an alternative hypothesis: rather than𝑼\\bm\{U\}and𝑽\\bm\{V\}being strictly invariant, the truly preserved components are the principal subspaces of𝑼p′\\bm\{U\}\_\{p\}^\{\\prime\}and𝑽p′\\bm\{V\}\_\{p\}^\{\\prime\}\. This implies that the span of these sets of singular vectors remains unchanged,i\.e\.,Span​\(𝑼p′\)=Span​\(𝑼p\)​and​Span​\(𝑽p′\)=Span​\(𝑽p\)\\text\{Span\}\(\\bm\{U\}\_\{p\}^\{\\prime\}\)=\\text\{Span\}\(\\bm\{U\}\_\{p\}\)\\ \\text\{and\}\\ \\text\{Span\}\(\\bm\{V\}\_\{p\}^\{\\prime\}\)=\\text\{Span\}\(\\bm\{V\}\_\{p\}\)\. According to[Theorem3\.1](https://arxiv.org/html/2605.08174#S3.Thmtheorem1), since the principal subspaces remain the same before and after fine\-tuning, there exists a transformation matrix𝑺p∈ℝk×k\\bm\{S\}\_\{p\}\\in\\mathbb\{R\}^\{k\\times k\}such that𝑾′≈𝑼p′​𝚺p′​𝑽p′⁣T=𝑼p​𝑺p​𝑽pT\\bm\{W\}^\{\\prime\}\\approx\\bm\{U\}^\{\\prime\}\_\{p\}\\bm\{\\Sigma\}^\{\\prime\}\_\{p\}\\bm\{V\}^\{\\prime T\}\_\{p\}=\\bm\{U\}\_\{p\}\\bm\{S\}\_\{p\}\\bm\{V\}\_\{p\}^\{T\}\. This suggests that rather than explicitly updating𝑼p\\bm\{U\}\_\{p\}and𝑽p\\bm\{V\}\_\{p\}, we can freeze them and only update the intermediate matrix𝑺p\\bm\{S\}\_\{p\}\. This is mathematically equivalent to updating all three components:𝑼p\\bm\{U\}\_\{p\},𝚺p\\bm\{\\Sigma\}\_\{p\}, and𝑽pT\\bm\{V\}\_\{p\}^\{T\}, effectively removing the expressiveness constraints imposed by freezing𝑼\\bm\{U\}and𝑽\\bm\{V\}\. The proof of this theorem is provided in the Appendix \(Sec\. F\.2\)\.

To further illustrate the effect of a fully trainable matrix𝑺p∈ℝk×k\\bm\{S\}\_\{p\}\\in\\mathbb\{R\}^\{k\\times k\}, as shown in[Fig\.5](https://arxiv.org/html/2605.08174#S3.F5), we apply an additional SVD decomposition to the updated𝑺p\\bm\{S\}\_\{p\}after fine\-tuning\. The resulting𝑽𝒔pT\\bm\{V\}\_\{\\bm\{s\}\_\{p\}\}^\{T\}, anr1×r1r\_\{1\}\\times r\_\{1\}rotation matrix, rotates𝑽pT\\bm\{V\}\_\{p\}^\{T\}to adjust the spatial distribution of input features while preserving the integrity of the subspace\. Similarly,𝑼𝒔p\\bm\{U\}\_\{\\bm\{s\}\_\{p\}\}adjusts the rotation of𝑼p\\bm\{U\}\_\{p\}, concentrating key features within the intermediate space along output directions essential for downstream tasks\.

Additionally, our experiments \(detailed in Appendix Sec\. F\.1\) confirm that in most full\-parameter fine\-tuning downstream tasks, the principal subspaces of𝑾\\bm\{W\}and𝑾′\\bm\{W\}^\{\\prime\}exhibit a Grassmann subspace similarity\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)of 99%∼\\sim99\.99%\. This provides strong empirical evidence supporting the assumption that the principal subspaces of𝑾\\bm\{W\}remain nearly unchanged after fine\-tuning\.

Although this approach increases the number of trainable parameters fromkktok2k^\{2\}\(compared to SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)\), the space saved through compression in the pre\-trained model and the performance gains achieved make this increase a justifiable cost\. The larger parameterization also enables stronger adaptability, as evidenced by a faster loss decrease during fine\-tuning \(see[Fig\.6](https://arxiv.org/html/2605.08174#S3.F6)\)\. At the same time, the benefits of fixing𝑼p\\bm\{U\}\_\{p\}and𝑽p\\bm\{V\}\_\{p\}are preserved, since the rotation matrix preserves the input and output subspace, only changing its basis representation\. This ensures that features learned during pre\-training remain intact, with only their distribution adjusted\.

![Refer to caption](https://arxiv.org/html/2605.08174v1/x5.png)
Figure 5:Training process of CERSA\. The trainable core matrix𝑺p\\bm\{S\}\_\{p\}can be decomposed into𝑼𝒔p\\bm\{U\}\_\{\\bm\{s\}\_\{p\}\}and𝑽𝒔p\\bm\{V\}\_\{\\bm\{s\}\_\{p\}\}, enabling fine\-tuning by rotating the input and output bases without altering the subspace itself\.![Refer to caption](https://arxiv.org/html/2605.08174v1/x6.png)
Figure 6:Loss curve of fine\-tuning ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)on CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\)using various methods\.

### 3\.4Memory Efficiency

The primary goal of applying SVD to the pre\-trained weight matrix is to reduce memory consumption during fine\-tuning\. Consequently, a compression rank thresholdbbexists, below which memory savings are achieved only ifr<br<b\. Given a pre\-trained weight matrix𝑾∈ℝm×n\\bm\{W\}\\in\\mathbb\{R\}^\{m\\times n\}, its truncated SVD decomposition produces:𝑼∈ℝm×r\\bm\{U\}\\in\\mathbb\{R\}^\{m\\times r\},𝚺∈ℝr×r\\bm\{\\Sigma\}\\in\\mathbb\{R\}^\{r\\times r\}, and𝑽∈ℝr×n\\bm\{V\}\\in\\mathbb\{R\}^\{r\\times n\}, such that𝑾=𝑼​𝚺​𝑽\\bm\{W=U\\Sigma V\}\.

FTCERSA \(Ours\)SVFitSVFTLoRAWeightsm​nmnm​r\+n​r\+r2mr\+nr\+r^\{2\}2​m​n\+m2mn\+m2​m​n\+e2mn\+em​n\+m​r\+n​rmn\+mr\+nrGradientsm​nmnr2r^\{2\}rreem​r\+n​rmr\+nrOpt\. states2​m​n2mn2​r22r^\{2\}2​r2r2​e2e2​m​r\+2​n​r2mr\+2nrTotal4​m​n4mnm​r\+n​r\+4​r2mr\+nr\+4r^\{2\}2​m​n\+m\+3​r2mn\+m\+3r2​m​n\+4​e2mn\+4em​n\+4​m​r\+4​n​rmn\+4mr\+4nr

Table 1:Memory requirements\. In SVFT,eeis the number of sparsified trainable parameters from the diagonal after SVD, wheree≪m​ne\\ll mnbute\>me\>m\.During training, memory usage consists of the frozen matrices𝑼\\bm\{U\}and𝑽\\bm\{V\}, along with the trainable matrix𝑺\\bm\{S\}, requiring𝒪​\(m​r\+n​r\+r2\)\\mathcal\{O\}\(mr\+nr\+r^\{2\}\)storage\. Additionally, since𝑺\\bm\{S\}is trainable, its gradient and optimizer states contribute𝒪​\(r2\)\\mathcal\{O\}\(r^\{2\}\)and𝒪​\(2​r2\)\\mathcal\{O\}\(2r^\{2\}\), respectively, leading to a total memory requirement of𝒪​\(m​r\+n​r\+4​r2\)\\mathcal\{O\}\(mr\+nr\+4r^\{2\}\)\.[Tab\.1](https://arxiv.org/html/2605.08174#S3.T1)compares memory costs across different methods\.

In contrast, the original weight matrix𝑾\\bm\{W\}requires𝒪​\(m​n\)\\mathcal\{O\}\(mn\)memory, excluding the additional𝒪​\(3​m​n\)\\mathcal\{O\}\(3mn\)for gradient and optimizer states, as we focus on reducing memory relative to the pre\-trained parameters\. The compression rateccdepends on the model’s input\-output dimensions, with a lower value indicating better memory efficiency:c=m​r\+n​r\+4​r2m​nc=\\frac\{mr\+nr\+4r^\{2\}\}\{mn\}\. Based on this, we compute the compression curves for three variants of the ViT model: Base, Large, and Huge, across cumulative energy retention rates ranging from 0\.8 to 0\.95, as shown in[Fig\.7](https://arxiv.org/html/2605.08174#S4.F7)\. The calculations are performed for two target module configurations: one with all three Q, K, and V matrices and another with only Q and V matrices\. The dashed horizontal line in[Fig\.7](https://arxiv.org/html/2605.08174#S4.F7)represents the compression rate achieved by the LoRA method when fine\-tuning only Q and V matrices with a rank of 32\. The results indicate that for smaller models, such as ViT\-Base, the compression rate is less favorable, possibly due to the limited embedding dimension\. However, for larger models like ViT\-Large and ViT\-Huge\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\), there is considerably more room for compression, enabling greater memory efficiency\.

## 4Experiments

We conduct extensive evaluations on both image classification and natural language understanding \(NLU\) tasks\. Further experimental results, including subject\-driven text\-to\-image generation and out\-of\-distribution evaluations, are provided in the appendix\(see Sec\. A\.1 and Sec\. E\)\.

### 4\.1Experimental Setup

Baseline selection\.For baseline comparisons, we include full\-parameter fine\-tuning \(FT\), popular PEFT methods such as LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)and PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\), and weight\-decomposition\-based approaches like SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\.

Model selection\.For image classification, we evaluate ViT\-Base and ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\), pre\-trained on ImageNet\-21K\(Denget al\.,[2009](https://arxiv.org/html/2605.08174#bib.bib26)\)\. For the NLU experiment, we fine\-tune DeBERTaV3\-Base\(Heet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib36)\)to assess the fundamental capabilities of our method\.

![Refer to caption](https://arxiv.org/html/2605.08174v1/x7.png)
Figure 7:Comparison on ViT compression rates across various cumulative energy retention rates\.MethodCIFAR\-100RESISC45DTDAverageTotal MemoryCERSA\(Q, V\)94\.095\.882\.190\.61194\.5 MBCERSA\(Q, K, V\)94\.496\.182\.591\.01232\.9 MBCERSA\(Q, K, V, P\)94\.596\.082\.691\.01279\.5 MBCERSA\(Q, K, V, P, UP, DN\)93\.894\.981\.690\.11433\.1 MB

Table 2:Results of CERSA with various matrix type combinations for the ViT\-Large model\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\. For the definitions of \(Q, K, V, P, UP, DN\), please refer to[Fig\.3](https://arxiv.org/html/2605.08174#S3.F3)\.MethodCIFAR\-100RESISC45DTDAverageTotal MemoryTop\-r1r\_\{1\}93\.495\.681\.390\.11112\.7 MBBottom\-r2r\_\{2\}92\.694\.779\.989\.11112\.7 MB

Table 3:Results of fine\-tuning on Top\-r1r\_\{1\}and Bottom\-r2r\_\{2\}rank for the ViT\-Large model\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\.
MethodCIFAR\-100RESISC45DTDAverageTotal MemoryCERSAα=1,β=1\\text\{CERSA\}\_\{\\alpha=1,\\beta=1\}93\.896\.381\.890\.62519\.6 MBCERSAα=0\.95,β=0\.95\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.95\}94\.396\.182\.591\.01232\.9 MBCERSAα=0\.9,β=0\.9\\text\{CERSA\}\_\{\\alpha=0\.9,\\beta=0\.9\}93\.996\.182\.190\.71122\.5 MBCERSAα=0\.8,β=0\.8\\text\{CERSA\}\_\{\\alpha=0\.8,\\beta=0\.8\}93\.595\.981\.890\.41020\.6 MBCERSAα=0\.5,β=0\.5\\text\{CERSA\}\_\{\\alpha=0\.5,\\beta=0\.5\}90\.095\.179\.588\.2914\.9 MBCERSAα=0\.95,β=0\.9\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.9\}94\.096\.082\.590\.81169\.4 MBCERSAα=0\.95,β=0\.8\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.8\}93\.896\.182\.290\.71118\.2 MBCERSAα=0\.95,β=0\.5\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.5\}92\.995\.280\.389\.51079\.3 MB

Table 4:Results of various cumulative energy retention rates for the ViT\-Large model\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)across the CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\), RESISC45\(Chenget al\.,[2017](https://arxiv.org/html/2605.08174#bib.bib33)\), and DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib30)\)datasets\.MethodCIFAR\-100RESISC45DTDAverageTotal MemoryLayer\-wise \(α=β=0\.9\\alpha=\\beta=0\.9\)93\.996\.182\.190\.71122\.5 MBUniform \(r=287r=287\)93\.795\.681\.490\.21122\.5 MB

Table 5:Results of layer\-wise and uniform rank for the ViT\-Large model\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\.MethodCIFAR\-100RESISC45DTDAverageTotal MemoryCERSA w/ Matrix93\.996\.182\.190\.71122\.5 MBCERSA w/ Array93\.595\.281\.590\.01045\.5 MB

Table 6:Results of CERSA with a trainable matrix or array for the ViT\-Large model\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\.
Datasets\.For image classification, we assess our method on eight diverse datasets: CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\), EuroSAT\(Helberet al\.,[2019](https://arxiv.org/html/2605.08174#bib.bib31)\), RESISC45\(Chenget al\.,[2017](https://arxiv.org/html/2605.08174#bib.bib33)\), StanfordCars\(Krauseet al\.,[2013](https://arxiv.org/html/2605.08174#bib.bib32)\), FGVC Aircraft\(Majiet al\.,[2013](https://arxiv.org/html/2605.08174#bib.bib34)\), DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib30)\), CIFAR\-10\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\), and OxfordPets\(Parkhiet al\.,[2012](https://arxiv.org/html/2605.08174#bib.bib29)\)\. These datasets span a variety of classification tasks, including general object classification, fine\-grained classification, remote sensing image classification, and texture classification\.

For NLU, we evaluate our method on eight datasets from the GLUE benchmark\(Wanget al\.,[2019](https://arxiv.org/html/2605.08174#bib.bib42)\): MNLI, MRPC, RTE, CoLA, SST\-2, QNLI, QQP, and STS\-B\. These datasets cover a broad spectrum of NLU tasks, including textual entailment, paraphrase detection, sentiment analysis, question\-answer matching, and semantic textual similarity\.

Metrics\.For image classification, we report accuracy across all datasets\. For NLU, we report overall matched and mismatched accuracy on MNLI, Matthew’s correlation on CoLA, Pearson correlation on STS\-B, and accuracy on the remaining datasets\. Higher values indicate better performance\.

### 4\.2Ablation Study

Impact of the matrix type\.We study the trade\-off between performance and memory when fine\-tuning different matrix types\. As shown in[Tab\.2](https://arxiv.org/html/2605.08174#S4.T2), adapting Q, K, and V achieves the best balance of accuracy and efficiency\. In contrast, adding P, UP, or DN increases memory cost and even reduces performance\. This is mainly because: \(i\) these matrices require higher ranks to preserve cumulative energy \([Fig\.3](https://arxiv.org/html/2605.08174#S3.F3)\), making them less memory\-efficient; and \(ii\) modifying them disrupts pre\-trained feature representations, leading to overfitting or weaker generalization\. Thus, restricting fine\-tuning to Q, K, and V provides the optimal trade\-off\.

Impact of top\-r1r\_\{1\}versus bottom\-r2r\_\{2\}ranks\.To assess the effect of fine\-tuning major versus residual components, we compare𝑺p\\bm\{S\}\_\{p\}trained on the top\-r1r\_\{1\}and bottom\-r2r\_\{2\}components \(withr1=r2r\_\{1\}=r\_\{2\},α=0\.95\\alpha=0\.95\)\. As shown in Tab\.[3](https://arxiv.org/html/2605.08174#S4.T3), adapting the top components consistently yields higher accuracy, validating our design\. Moreover, the OOD results in the appendix\(see Sec\. E\) show that CERSA surpasses both LoRA and FT, confirming that it preserves rather than distorts the original knowledge\.

Impact of the cumulative energy retention rate\.We investigate how different retention rates affect fine\-tuning by training ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)under variousα\\alphaandβ\\betaconfigurations \([Tab\.4](https://arxiv.org/html/2605.08174#S4.T4)\)\. The first five settings useα=β\\alpha=\\betadecreasing from 1 to 0\.5, while the last three fixα=0\.95\\alpha=0\.95and varyβ\\betato adjust the trainable subspace size\. Among them,CERSAα=0\.95,β=0\.95\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.95\}achieves the best overall accuracy\. Performance remains stable even atα=β≈0\.8\\alpha=\\beta\\approx 0\.8\(90% of pre\-trained memory\), but drops sharply when reduced to 0\.5 or lower\.

Impact of layer\-wise versus uniform\.We compare layer\-wise CERSA, which selects singular values by each layer’s cumulative energy retention rate, with uniform CERSA, which fixes the rank at 287 for all layers\. As shown in[Tab\.5](https://arxiv.org/html/2605.08174#S4.T5), despite identical memory consumption, layer\-wise CERSA consistently outperforms the uniform variant, demonstrating the effectiveness of exploiting layer\-specific retention rates\.

Impact of tuning a matrix versus array\.We analyze the impact of defining the𝑺\\bm\{S\}matrix as either a matrix or an array \(as in SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)\) on fine\-tuning performance under the same CERSA configurationα=0\.9,β=0\.9\\alpha=0\.9,\\beta=0\.9for Q, K, and V\. As shown in[Tab\.6](https://arxiv.org/html/2605.08174#S4.T6), although arrays significantly reduce memory usage, they result in a substantial drop in performance, highlighting that matrix\-based fine\-tuning has much better expressiveness\.

MethodMemoryCIFAR\-100EuroSATRESISC45StanfordCarsFGVC\-AircraftDTDCIFAR\-10OxfordPetsAverageFT4629\.8 MB93\.699\.096\.488\.968\.381\.899\.294\.490\.2LoRA1229\.9 MB94\.999\.094\.780\.354\.581\.599\.194\.887\.4PiSSA1229\.9 MB93\.698\.695\.786\.762\.681\.898\.895\.789\.2SVFit1351\.5 MB93\.998\.795\.283\.357\.881\.599\.393\.488\.7SVFT1355\.8 MB93\.798\.895\.484\.963\.782\.399\.393\.388\.9CERSA1232\.3 MB94\.399\.196\.187\.671\.182\.599\.394\.990\.6

Table 7:Comparison of various fine\-tuning methods on eight image classification datasets using ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\. Methods include LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\), SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\), and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\. Bold scores indicate the highest accuracy among PEFT methods, while underlined scores indicate that full\-parameter fine\-tuning \(FT\) achieves the best performance\.MethodMemoryMNLIMRPCSTS\-BRTESST\-2QNLIQQPCoLAAverageFT2814\.0 MB89\.989\.591\.683\.895\.694\.092\.469\.288\.3LoRA730\.5 MB90\.790\.091\.685\.295\.093\.992\.069\.888\.5PiSSA730\.5 MB90\.491\.791\.987\.095\.994\.392\.372\.689\.5SVFit1096\.3 MB89\.788\.891\.887\.495\.494\.390\.271\.088\.6SVFT1108\.6 MB90\.089\.091\.887\.295\.494\.391\.572\.689\.0CERSA728\.6 MB90\.392\.091\.787\.096\.094\.492\.472\.389\.5

Table 8:Comparison of different methods on the GLUE benchmark using the DeBERTaV3\-Base model\(Heet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib36)\)\. Methods include LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\), SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\), and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\.
### 4\.3Comparison on Image Classification Tasks

Experimental results for the ViT\-Large model\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)are presented in[Tab\.7](https://arxiv.org/html/2605.08174#S4.T7)\. With ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\), CERSA achieves an average accuracy of 90\.6%, outperforming full\-parameter fine\-tuning \(90\.2%\) and significantly surpassing other PEFT methods like SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\(88\.9%\) and SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)\(88\.7%\)\. Notably, CERSA excels on fine\-grained classification tasks, particularly for datasets like StanfordCars\(Krauseet al\.,[2013](https://arxiv.org/html/2605.08174#bib.bib32)\)and FGVC Aircraft\(Majiet al\.,[2013](https://arxiv.org/html/2605.08174#bib.bib34)\), highlighting its capability to capture intricate details\. Besides, CERSA matches or exceeds full\-parameter fine\-tuning on general datasets like CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\), EuroSAT\(Helberet al\.,[2019](https://arxiv.org/html/2605.08174#bib.bib31)\), and RESISC45\(Chenget al\.,[2017](https://arxiv.org/html/2605.08174#bib.bib33)\), demonstrating its strong generalization and adaptability across diverse tasks\.

### 4\.4Comparison on NLU Tasks

[Tab\.8](https://arxiv.org/html/2605.08174#S4.T8)compares fine\-tuning strategies on eight GLUE datasets using DeBERTaV3\-Base\(Heet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib36)\)\. CERSA achieves the highest average score \(89\.5%\), outperforming both full\-parameter fine\-tuning and other PEFT methods, and sets new state\-of\-the\-art results on multiple datasets, including MRPC, SST\-2, QNLI, and QQP\. For the remaining tasks, it also attains competitive performance compared with state\-of\-the\-art methods PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\), SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\), and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\. This highlights the effectiveness of CERSA in leveraging pre\-trained representations with minimal computational overhead, making it a strong choice for NLU tasks\.

## 5Conclusion

We propose CERSA, a memory\- and parameter\-efficient fine\-tuning method that performs layer\-wise rank selection based on the cumulative energy retention of pre\-trained weights, enabling adaptation within the principal subspace\. We prove that CERSA achieves performance comparable to full fine\-tuning, and extensive experiments on image classification and language understanding show it outperforms or matches state\-of\-the\-art PEFT methods while reducing memory\.

Limitation and Future Work\.Since CERSA constrains fine\-tuning within the principal subspace of pre\-trained weights, its performance may degrade when the downstream task significantly deviates from the knowledge captured during pre\-training\. In the future, we plan to extend CERSA’s capabilities by dynamically adjusting its learned subspace during fine\-tuning, thereby enhancing its adaptability and performance across a broader range of downstream tasks\.

## References

- LaMDA: large model fine\-tuning via spectrally decomposed low\-dimensional adaptation\.InEMNLP,pp\. 9635–9646\.Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p3.1)\.
- K\. Bałazy, M\. Banaei, K\. Aberer, and J\. Tabor \(2024\)LoRA\-XS: low\-rank adaptation with extremely small number of parameters\.arXiv preprint arXiv:2405\.17604\.Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p2.1)\.
- G\. Cheng, J\. Han, and X\. Lu \(2017\)Remote sensing image scene classification: benchmark and state of the art\.Proceedings of the IEEE105\(10\),pp\. 1865–1883\.Cited by:[Table 15](https://arxiv.org/html/2605.08174#A5.T15),[Appendix E](https://arxiv.org/html/2605.08174#A5.p2.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p3.1),[§4\.3](https://arxiv.org/html/2605.08174#S4.SS3.p1.1),[Table 4](https://arxiv.org/html/2605.08174#S4.T4)\.
- M\. Cimpoi, S\. Maji, I\. Kokkinos, S\. Mohamed, and A\. Vedaldi \(2014\)Describing textures in the wild\.InCVPR,pp\. 3606–3613\.Cited by:[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[Figure 9](https://arxiv.org/html/2605.08174#A4.F9),[Table 15](https://arxiv.org/html/2605.08174#A5.T15),[Appendix E](https://arxiv.org/html/2605.08174#A5.p2.1),[Appendix E](https://arxiv.org/html/2605.08174#A5.p3.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p3.1),[Table 4](https://arxiv.org/html/2605.08174#S4.T4)\.
- J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei \(2009\)ImageNet: a large\-scale hierarchical image database\.InCVPR,pp\. 248–255\.Cited by:[Table 16](https://arxiv.org/html/2605.08174#A5.T16),[§F\.1](https://arxiv.org/html/2605.08174#A6.SS1.p1.3),[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[Figure 3](https://arxiv.org/html/2605.08174#S3.F3),[§3\.1](https://arxiv.org/html/2605.08174#S3.SS1.p4.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p2.1)\.
- A\. Dosovitskiy \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InICLR,Cited by:[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p2.3),[Table 10](https://arxiv.org/html/2605.08174#A1.T10),[Table 9](https://arxiv.org/html/2605.08174#A1.T9),[Table 11](https://arxiv.org/html/2605.08174#A2.T11),[Table 12](https://arxiv.org/html/2605.08174#A2.T12),[Appendix B](https://arxiv.org/html/2605.08174#A2.p2.1),[Figure 9](https://arxiv.org/html/2605.08174#A4.F9),[Appendix D](https://arxiv.org/html/2605.08174#A4.p2.1),[Table 16](https://arxiv.org/html/2605.08174#A5.T16),[Figure 11](https://arxiv.org/html/2605.08174#A6.F11),[Figure 12](https://arxiv.org/html/2605.08174#A6.F12),[§F\.1](https://arxiv.org/html/2605.08174#A6.SS1.p1.3),[§F\.1](https://arxiv.org/html/2605.08174#A6.SS1.p6.5),[Figure 1](https://arxiv.org/html/2605.08174#S1.F1),[Figure 2](https://arxiv.org/html/2605.08174#S1.F2),[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p4.1),[Figure 3](https://arxiv.org/html/2605.08174#S3.F3),[Figure 6](https://arxiv.org/html/2605.08174#S3.F6),[§3\.1](https://arxiv.org/html/2605.08174#S3.SS1.p4.1),[§3\.4](https://arxiv.org/html/2605.08174#S3.SS4.p3.5),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p2.1),[§4\.2](https://arxiv.org/html/2605.08174#S4.SS2.p3.7),[§4\.3](https://arxiv.org/html/2605.08174#S4.SS3.p1.1),[Table 2](https://arxiv.org/html/2605.08174#S4.T2),[Table 3](https://arxiv.org/html/2605.08174#S4.T3),[Table 4](https://arxiv.org/html/2605.08174#S4.T4),[Table 5](https://arxiv.org/html/2605.08174#S4.T5),[Table 6](https://arxiv.org/html/2605.08174#S4.T6),[Table 7](https://arxiv.org/html/2605.08174#S4.T7)\.
- C\. Eckart and G\. Young \(1936\)The approximation of one matrix by another of lower rank\.Psychometrika1\(3\),pp\. 211–218\.Cited by:[§3\.1](https://arxiv.org/html/2605.08174#S3.SS1.p1.7),[§3\.1](https://arxiv.org/html/2605.08174#S3.SS1.p2.10)\.
- Y\. Gu, X\. Han, Z\. Liu, and M\. Huang \(2022\)PPT: pre\-trained prompt tuning for few\-shot learning\.InACL,pp\. 8410–8423\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p3.1)\.
- L\. Han, Y\. Li, H\. Zhang, P\. Milanfar, D\. Metaxas, and F\. Yang \(2023\)SVDiff: compact parameter space for diffusion fine\-tuning\.InICCV,pp\. 7323–7334\.Cited by:[§2\.2](https://arxiv.org/html/2605.08174#S2.SS2.p1.8),[§3](https://arxiv.org/html/2605.08174#S3.p1.1)\.
- P\. He, J\. Gao, and W\. Chen \(2023\)DeBERTaV3: improving DeBERTa using ELECTRA\-style pre\-training with gradient\-disentangled embedding sharing\.InICLR,Cited by:[Table 13](https://arxiv.org/html/2605.08174#A2.T13),[Appendix B](https://arxiv.org/html/2605.08174#A2.p6.1),[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p2.1),[§4\.4](https://arxiv.org/html/2605.08174#S4.SS4.p1.1),[Table 8](https://arxiv.org/html/2605.08174#S4.T8)\.
- P\. Helber, B\. Bischke, A\. Dengel, and D\. Borth \(2019\)EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification\.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing12\(7\),pp\. 2217–2226\.Cited by:[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p3.1),[§4\.3](https://arxiv.org/html/2605.08174#S4.SS3.p1.1)\.
- D\. Hendrycks and K\. Gimpel \(2017\)A baseline for detecting misclassified and out\-of\-distribution examples in neural networks\.InICLR,Cited by:[Appendix E](https://arxiv.org/html/2605.08174#A5.p2.1)\.
- J\. Hessel, A\. Holtzman, M\. Forbes, R\. L\. Bras, and Y\. Choi \(2022\)CLIPScore: a reference\-free evaluation metric for image captioning\.External Links:2104\.08718,[Link](https://arxiv.org/abs/2104.08718)Cited by:[§A\.1](https://arxiv.org/html/2605.08174#A1.SS1.SSS0.Px1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InICLR,Cited by:[Figure 8](https://arxiv.org/html/2605.08174#A1.F8),[§A\.1](https://arxiv.org/html/2605.08174#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[Table 9](https://arxiv.org/html/2605.08174#A1.T9),[Table 14](https://arxiv.org/html/2605.08174#A2.T14.1.1.1.1),[Table 14](https://arxiv.org/html/2605.08174#A2.T14.2.2.2.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p2.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p4.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p5.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p6.1),[Appendix C](https://arxiv.org/html/2605.08174#A3.p3.1),[Appendix C](https://arxiv.org/html/2605.08174#A3.p4.3),[Appendix D](https://arxiv.org/html/2605.08174#A4.p2.1),[Table 15](https://arxiv.org/html/2605.08174#A5.T15),[Table 15](https://arxiv.org/html/2605.08174#A5.T15.1.1.1.3),[Appendix E](https://arxiv.org/html/2605.08174#A5.p2.1),[Appendix E](https://arxiv.org/html/2605.08174#A5.p3.1),[§F\.1](https://arxiv.org/html/2605.08174#A6.SS1.p2.1),[1st item](https://arxiv.org/html/2605.08174#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p2.6),[§1](https://arxiv.org/html/2605.08174#S1.p3.1),[§1](https://arxiv.org/html/2605.08174#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p5.1),[§2\.2](https://arxiv.org/html/2605.08174#S2.SS2.p2.2),[Figure 4](https://arxiv.org/html/2605.08174#S3.F4),[§3\.3](https://arxiv.org/html/2605.08174#S3.SS3.p5.4),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p1.1),[Table 7](https://arxiv.org/html/2605.08174#S4.T7),[Table 8](https://arxiv.org/html/2605.08174#S4.T8)\.
- A\. Jaiswal, L\. Yin, Z\. Zhang, S\. Liu, J\. Zhao, Y\. Tian, and Z\. Wang \(2024\)From GaLore to WeLore: how low\-rank weights non\-uniformly emerge from low\-rank gradients\.arXiv preprint arXiv:2310\.01382\.Cited by:[§2\.2](https://arxiv.org/html/2605.08174#S2.SS2.p1.8)\.
- I\. T\. Jolliffe \(2002\)Principal component analysis for special types of data\.Springer\.Cited by:[§3\.1](https://arxiv.org/html/2605.08174#S3.SS1.p1.7)\.
- D\. J\. Kopiczko, T\. Blankevoort, and Y\. M\. Asano \(2024\)VeRA: vector\-based random matrix adaptation\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p2.6),[§1](https://arxiv.org/html/2605.08174#S1.p3.1)\.
- J\. Krause, M\. Stark, J\. Deng, and L\. Fei\-Fei \(2013\)3D object representations for fine\-grained categorization\.InICCV Workshops,pp\. 554–561\.Cited by:[Table 15](https://arxiv.org/html/2605.08174#A5.T15),[Appendix E](https://arxiv.org/html/2605.08174#A5.p2.1),[Appendix E](https://arxiv.org/html/2605.08174#A5.p3.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p3.1),[§4\.3](https://arxiv.org/html/2605.08174#S4.SS3.p1.1)\.
- A\. Krizhevsky and G\. Hinton \(2009\)Learning multiple layers of features from tiny images\.Master’s thesis, University of Tront\.Cited by:[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[Table 15](https://arxiv.org/html/2605.08174#A5.T15),[Appendix E](https://arxiv.org/html/2605.08174#A5.p2.1),[Appendix E](https://arxiv.org/html/2605.08174#A5.p3.1),[§F\.1](https://arxiv.org/html/2605.08174#A6.SS1.p6.5),[Figure 6](https://arxiv.org/html/2605.08174#S3.F6),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p3.1),[§4\.3](https://arxiv.org/html/2605.08174#S4.SS3.p1.1),[Table 4](https://arxiv.org/html/2605.08174#S4.T4)\.
- A\. Kumar, A\. Raghunathan, R\. M\. Jones, T\. Ma, and P\. Liang \(2022\)Fine\-tuning can distort pretrained features and underperform out\-of\-distribution\.InICLR,Cited by:[Appendix E](https://arxiv.org/html/2605.08174#A5.p2.1)\.
- N\. Kumari, B\. Zhang, R\. Zhang, E\. Shechtman, and J\. Zhu \(2022\)Multi\-concept customization of text\-to\-image diffusion\.InCVPR,pp\. 1931–1941\.Cited by:[§A\.1](https://arxiv.org/html/2605.08174#A1.SS1.p1.1)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InEMNLP,pp\. 3045–3059\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p2.6)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-Tuning: optimizing continuous prompts for generation\.InACL and IJCNLP,pp\. 4582–4597\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p2.6)\.
- T\. Lin, M\. Maire, S\. Belongie, J\. Hays, P\. Perona, D\. Ramanan, P\. Dollár, and C\. L\. Zitnick \(2014\)Microsoft COCO: common objects in context\.InECCV,pp\. 740–755\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p1.1)\.
- V\. C\. Lingam, A\. Neerkaje, A\. Vavre, A\. Shetty, G\. K\. Gudur, J\. Ghosh, E\. Choi, A\. Dimakis, A\. Bojchevski, and S\. Sanghavi \(2024\)SVFT: parameter\-efficient fine\-tuning with singular vectors\.InNeurIPS,pp\. 41425–41446\.Cited by:[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[Table 9](https://arxiv.org/html/2605.08174#A1.T9),[Table 14](https://arxiv.org/html/2605.08174#A2.T14.7.7.11.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p3.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p6.1),[Appendix C](https://arxiv.org/html/2605.08174#A3.p5.1),[§1](https://arxiv.org/html/2605.08174#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.08174#S2.SS2.p1.8),[§2\.2](https://arxiv.org/html/2605.08174#S2.SS2.p2.2),[Figure 4](https://arxiv.org/html/2605.08174#S3.F4),[§3\.3](https://arxiv.org/html/2605.08174#S3.SS3.p1.5),[§3\.3](https://arxiv.org/html/2605.08174#S3.SS3.p3.19),[§3](https://arxiv.org/html/2605.08174#S3.p1.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p1.1),[§4\.3](https://arxiv.org/html/2605.08174#S4.SS3.p1.1),[§4\.4](https://arxiv.org/html/2605.08174#S4.SS4.p1.1),[Table 7](https://arxiv.org/html/2605.08174#S4.T7),[Table 8](https://arxiv.org/html/2605.08174#S4.T8)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024\)DoRA: weight\-decomposed low\-rank adaptation\.InICML,pp\. 32100–32121\.Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p2.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.InICLR,Cited by:[Appendix B](https://arxiv.org/html/2605.08174#A2.p3.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p5.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p6.1)\.
- S\. Maji, E\. Rahtu, J\. Kannala, M\. Blaschko, and A\. Vedaldi \(2013\)Fine\-grained visual classification of aircraft\.arXiv preprint arXiv:1306\.5151\.Cited by:[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p3.1),[§4\.3](https://arxiv.org/html/2605.08174#S4.SS3.p1.1)\.
- F\. Meng, Z\. Wang, and M\. Zhang \(2024\)PiSSA: principal singular values and singular vectors adaptation of large language models\.InNeurIPS,pp\. 121038–121072\.Cited by:[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[Table 9](https://arxiv.org/html/2605.08174#A1.T9),[Appendix B](https://arxiv.org/html/2605.08174#A2.p2.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p6.1),[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p2.6),[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.08174#S3.SS1.p1.7),[§3\.2](https://arxiv.org/html/2605.08174#S3.SS2.p2.30),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p1.1),[§4\.4](https://arxiv.org/html/2605.08174#S4.SS4.p1.1),[Table 7](https://arxiv.org/html/2605.08174#S4.T7),[Table 8](https://arxiv.org/html/2605.08174#S4.T8)\.
- F\. Paischer, L\. Hauzenberger, T\. Schmied, B\. Alkin, M\. P\. Deisenroth, and S\. Hochreiter \(2024\)One initialization to rule them all: fine\-tuning via explained variance adaptation\.arXiv preprint arXiv:2410\.07170\.Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p3.1),[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p4.1)\.
- O\. M\. Parkhi, A\. Vedaldi, A\. Zisserman, and C\. Jawahar \(2012\)Cats and dogs\.InCVPR,pp\. 3498–3505\.Cited by:[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p3.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga,et al\.\(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InNeurIPS,pp\. 8026–8037\.Cited by:[Appendix B](https://arxiv.org/html/2605.08174#A2.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[Appendix B](https://arxiv.org/html/2605.08174#A2.p5.1)\.
- S\. Rebuffi, H\. Bilen, and A\. Vedaldi \(2017\)Learning multiple visual domains with residual adapters\.InNeurIPS,pp\. 506–516\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p2.6)\.
- P\. Ren, C\. Shi, S\. Wu, M\. Zhang, Z\. Ren, M\. de Rijke, Z\. Chen, and J\. Pei \(2024\)Mini\-ensemble low\-rank adapters for parameter\-efficient fine\-tuning\.arXiv preprint arXiv:2402\.17263\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p2.6),[§1](https://arxiv.org/html/2605.08174#S1.p3.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InCVPR,pp\. 10684–10695\.Cited by:[Appendix B](https://arxiv.org/html/2605.08174#A2.p4.1)\.
- O\. Ronneberger, P\. Fischer, and T\. Brox \(2015\)U\-net: convolutional networks for biomedical image segmentation\.InMedical image computing and computer\-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5\-9, 2015, proceedings, part III 18,pp\. 234–241\.Cited by:[Appendix B](https://arxiv.org/html/2605.08174#A2.p5.1)\.
- N\. Ruiz, Y\. Li, V\. Jampani, Y\. Pritch, M\. Rubinstein, and K\. Aberman \(2023\)DreamBooth: fine tuning text\-to\-image diffusion models for subject\-driven generation\.InCVPR,pp\. 22500–22510\.Cited by:[Figure 8](https://arxiv.org/html/2605.08174#A1.F8),[§A\.1](https://arxiv.org/html/2605.08174#A1.SS1.p1.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p4.1)\.
- R\. Shuttleworth, J\. Andreas, A\. Torralba, and P\. Sharma \(2024\)LoRA vs full fine\-tuning: an illusion of equivalence\.arXiv preprint arXiv:2410\.21228\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p3.1)\.
- C\. Sun, J\. Wei, Y\. Wu, Y\. Shi, S\. He, Z\. Ma, N\. Xie, and Y\. Yang \(2024\)SVFit: parameter\-efficient fine\-tuning of large pre\-trained models using singular values\.arXiv preprint arXiv:2409\.05926\.Cited by:[§A\.2](https://arxiv.org/html/2605.08174#A1.SS2.p1.1),[Table 9](https://arxiv.org/html/2605.08174#A1.T9),[Table 14](https://arxiv.org/html/2605.08174#A2.T14.7.7.10.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p3.1),[Appendix B](https://arxiv.org/html/2605.08174#A2.p6.1),[Appendix C](https://arxiv.org/html/2605.08174#A3.p3.1),[Appendix C](https://arxiv.org/html/2605.08174#A3.p5.1),[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p2.6),[§1](https://arxiv.org/html/2605.08174#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.08174#S2.SS2.p1.8),[Figure 4](https://arxiv.org/html/2605.08174#S3.F4),[§3\.2](https://arxiv.org/html/2605.08174#S3.SS2.p2.30),[§3\.3](https://arxiv.org/html/2605.08174#S3.SS3.p1.5),[§3\.3](https://arxiv.org/html/2605.08174#S3.SS3.p3.19),[§3\.3](https://arxiv.org/html/2605.08174#S3.SS3.p6.4),[§3](https://arxiv.org/html/2605.08174#S3.p1.1),[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.08174#S4.SS2.p5.2),[§4\.3](https://arxiv.org/html/2605.08174#S4.SS3.p1.1),[§4\.4](https://arxiv.org/html/2605.08174#S4.SS4.p1.1),[Table 7](https://arxiv.org/html/2605.08174#S4.T7),[Table 8](https://arxiv.org/html/2605.08174#S4.T8)\.
- M\. Valipour, M\. Rezagholizadeh, I\. Kobyzev, and A\. Ghodsi \(2023\)DyLoRA: parameter efficient tuning of pre\-trained models using dynamic search\-free low\-rank adaptation\.InConference of the European Chapter of the Association for Computational Linguistics,pp\. 3266–3279\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p3.1)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2019\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InICLR,Cited by:[§4\.1](https://arxiv.org/html/2605.08174#S4.SS1.p4.1)\.
- H\. Wang, Y\. Li, S\. Wang, G\. Chen, and Y\. Chen \(2024a\)MiLoRA: harnessing minor singular components for parameter\-efficient LLM finetuning\.arXiv preprint arXiv:2406\.09044\.Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p2.1)\.
- S\. Wang, L\. Yu, and J\. Li \(2024b\)LoRA\-GA: low\-rank adaptation with gradient approximation\.InNeurIPS,pp\. 54905–54931\.Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p3.1)\.
- X\. Wang, T\. Chen, Q\. Ge, H\. Xia, R\. Bao, R\. Zheng, Q\. Zhang, T\. Gui, and X\. Huang \(2023\)Orthogonal subspace learning for language model continual learning\.InEMNLP,pp\. 10658–10671\.Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p2.1)\.
- Z\. Wang, J\. Liang, R\. He, Z\. Wang, and T\. Tan \(2024c\)LoRA\-Pro: are low\-rank adapters properly optimized?\.arXiv preprint arXiv:2407\.18242\.Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p3.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. L\. Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. M\. Rush \(2020\)Transformers: state\-of\-the\-art natural language processing\.InEMNLP,pp\. 38–45\.Cited by:[Appendix B](https://arxiv.org/html/2605.08174#A2.p1.1)\.
- L\. Zhang, L\. Zhang, S\. Shi, X\. Chu, and B\. Li \(2023a\)LoRA\-FA: memory\-efficient low\-rank adaptation for large language models fine\-tuning\.arXiv preprint arXiv:2308\.03303\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p2.6)\.
- Q\. Zhang, M\. Chen, A\. Bukharin, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao \(2023b\)Adaptive budget allocation for parameter\-efficient fine\-tuning\.InICLR,Cited by:[§2\.1](https://arxiv.org/html/2605.08174#S2.SS1.p4.1)\.
- B\. Zi, X\. Qi, L\. Wang, J\. Wang, K\. Wong, and L\. Zhang \(2023\)Delta\-LoRA: fine\-tuning high\-rank parameters with the delta of low\-rank matrices\.arXiv preprint arXiv:2309\.02411\.Cited by:[§1](https://arxiv.org/html/2605.08174#S1.p1.1),[§1](https://arxiv.org/html/2605.08174#S1.p2.6),[§1](https://arxiv.org/html/2605.08174#S1.p3.1)\.

## Appendix

## Appendix AAdditional Experimental Results

### A\.1Subject\-driven Text\-to\-Image Generation

![Refer to caption](https://arxiv.org/html/2605.08174v1/x8.png)Figure 8:Results of visual comparison generated by the subject\-driven fine\-tuned diffusion model using the proposed CERSA, LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), and DreamBooth\(Ruizet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib15)\)\.For subject\-driven text\-to\-image generation, we fine\-tune models using selected samples from the DreamBooth\(Ruizet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib15)\)and CustomConcept101\(Kumariet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib16)\)datasets\. Each subject sample contains 5 to 6 images captured from different angles and contexts\. We compare full\-parameter fine\-tuning \(FT\), LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), and our proposed CERSA on this task\.

As shown in[Fig\.8](https://arxiv.org/html/2605.08174#A1.F8), we evaluate subject\-driven generation across multiple domains, including scene composition, material modification, and artistic style transfer:

- •Scene composition\.When placing a sports car in front of the Eiffel Tower or on a New York street, CERSA captures both background details and subject fidelity more accurately than LoRA, producing results that more closely resemble FT\.
- •Material modification\.Applying glass and silver textures to a duck toy highlights CERSA’s strength: it preserves the subject’s original shape and features while achieving consistent material transfer\. In contrast, LoRA and FT often distort shapes or fail to maintain color/material consistency\.
- •Style transfer\.When adapting a dog’s image into the styles of Vincent van Gogh and Leonardo da Vinci, all three methods demonstrate recognizable style transfer, but CERSA produces visuals that align more closely with FT while avoiding artifacts\.

#### Quantitative comparison\.

We use CLIPScore\(Hesselet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib60)\)to assess prompt\-image alignment\. CERSA achieves the highest average CLIPScore\(32\.75\), outperforming LoRA \(31\.88\) and FT \(32\.35\), indicating better generation quality\.

Overall, these results demonstrate that CERSA achieves high\-quality subject\-driven image generation, consistently surpassing LoRA and closely matching or even exceeding the performance of full\-parameter fine\-tuning, while being significantly more memory\- and parameter\-efficient\.

### A\.2Image Classification

MethodCIFAR\-100EuroSATRESISC45StanfordCarsFGVC\-AircraftDTDCIFAR\-10OxfordPetsAverageFT92\.499\.196\.179\.854\.877\.798\.993\.186\.5LoRA92\.098\.492\.745\.525\.275\.098\.893\.177\.6PiSSA91\.298\.795\.567\.147\.678\.798\.695\.984\.2SVFit91\.698\.693\.067\.247\.980\.598\.892\.383\.7SVFT91\.298\.592\.467\.556\.279\.898\.792\.584\.6CERSA92\.198\.995\.683\.968\.281\.298\.893\.289\.0

Table 9:Comparison of various fine\-tuning methods on eight image classification datasets using ViT\-Base\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\. Methods include LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\), SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\), and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\. Bold scores indicate the highest accuracy among PEFT methods, while underlined scores indicate that full\-parameter fine\-tuning \(FT\) achieves the best performance\.CIFAR\-100EuroSATRESISC45StanfordCarsFGVC\-AircraftDTDCIFAR\-10OxfordPetsAverageCERSAα=1,β=1\\text\{CERSA\}\_\{\\alpha=1,\\beta=1\}91\.397\.685\.572\.864\.678\.998\.692\.385\.2CERSAα=0\.95,β=0\.95\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.95\}92\.198\.995\.683\.968\.281\.298\.893\.289\.0CERSAα=0\.9,β=0\.9\\text\{CERSA\}\_\{\\alpha=0\.9,\\beta=0\.9\}92\.198\.695\.383\.568\.280\.398\.693\.288\.7CERSAα=0\.8,β=0\.8\\text\{CERSA\}\_\{\\alpha=0\.8,\\beta=0\.8\}91\.198\.194\.980\.267\.478\.198\.592\.687\.6CERSAα=0\.5,β=0\.5\\text\{CERSA\}\_\{\\alpha=0\.5,\\beta=0\.5\}83\.395\.491\.762\.950\.767\.496\.387\.779\.4CERSAα=0\.95,β=0\.95\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.95\}92\.198\.995\.683\.968\.281\.298\.893\.289\.0CERSAα=0\.95,β=0\.9\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.9\}92\.198\.695\.383\.769\.880\.598\.793\.188\.9CERSAα=0\.95,β=0\.8\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.8\}92\.398\.195\.181\.570\.479\.098\.693\.288\.5CERSAα=0\.95,β=0\.5\\text\{CERSA\}\_\{\\alpha=0\.95,\\beta=0\.5\}91\.595\.494\.375\.860\.277\.698\.493\.285\.8

Table 10:Evaluation results of CERSA on eight image classification datasets under differentα\\alphaandβ\\betasettings using ViT\-Base\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\.In addition to testing the performance of CERSA on ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\), we also test it on ViT\-Base\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\. With ViT\-Base[Tab\.9](https://arxiv.org/html/2605.08174#A1.T9), CERSA achieve an average accuracy of 89\.0% across eight datasets, outperforming full parameter fine\-tuning \(FT\) \(86\.5%\) and significantly surpassing other PEFT methods like LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)\(77\.6%\), PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\)\(84\.2%\), SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)\(84\.6%\), and SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)\(83\.7%\)\. It also excels on fine\-grained classification tasks, particularly Stanford Cars and FGVC Aircraft\(Majiet al\.,[2013](https://arxiv.org/html/2605.08174#bib.bib34)\), and matches or exceeds FT on general datasets like CIFAR\-10\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\), Oxford Pets\(Parkhiet al\.,[2012](https://arxiv.org/html/2605.08174#bib.bib29)\), and DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib30)\), demonstrating strong generalization\. Compared to ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\), the performance advantage of our method is more obvious on ViT\-base\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\), but the compression rate is not as good as that of the large model\.

Additionally, in[Tab\.10](https://arxiv.org/html/2605.08174#A1.T10), we evaluate the performance of all image classification tasks on ViT\-Base\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)under different settings of the cumulative energy retention ratio\. In the first set of experiments, we setα=β\\alpha=\\beta, which means that the entire principal subspace corresponding to the cumulative energy is fine\-tuned\. We test performance under different cumulative energy retention ratios \{0\.95, 0\.9, 0\.8, 0\.5\}\. In the second set of experiments, we fixα=0\.95\\alpha=0\.95and examine the performance of fine\-tuning only a portion of the principal subspace, withβ\\betaset to 0\.9, 0\.8, and 0\.5, respectively\.

In the first set of experiments, we observe that the average performance drop is minimal \(only 1\.4%\) when the cumulative energy retention ratio ranges from 0\.95 to 0\.8\. Only when the ratio decreased to 0\.5 did a significant performance decline occur\. This indicates that we have ample room to trade off a slight performance loss for a substantial reduction in overall memory consumption\. In the second set of experiments, we found that reducingβ\\betafrom 0\.95 to 0\.8 results in only a performance drop of 0\.5%, and even atβ=0\.5\\beta=0\.5, the performance decrease is limited to 3\.2%\. This suggests that fine\-tuning only the most principal part of the preserved subspace allows for a more parameter\-efficient approach while incurring only a minor performance loss\.

## Appendix BMore Implementation Details

Experimental Environment\.All experiments were conducted on an NVIDIA L40 GPU using the PyTorch framework\(Paszkeet al\.,[2019](https://arxiv.org/html/2605.08174#bib.bib37)\)and Hugging Face’sTransformerslibrary\(Wolfet al\.,[2020](https://arxiv.org/html/2605.08174#bib.bib38)\)for fine\-tuning\.

Settings for Image Classification\.For image classification, we fine\-tune ViT\-Base and ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)on the Query \(Q\), Key \(K\), and Value \(V\) matrices within the attention module\. In our method, CERSA, we set a cumulative energy retention rate ofα=β=0\.95\\alpha=\\beta=0\.95across all fine\-tuning tasks\. For comparison, we configure LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)and PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\)with a rank of 32, a commonly chosen value that balances performance and the number of trainable parameters\.

For SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\), we adhere to the recommended configuration of the original paper, using a rank of 768 for all models\. Similarly, for SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\), we adopt the best\-performing settings\. We use the AdamW optimizer\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.08174#bib.bib39)\)with a fixed batch size of 32 and a linear scheduler incorporating a warm\-up ratio of 0\.08\. For further details on hyperparameter settings, see[Tab\.11](https://arxiv.org/html/2605.08174#A2.T11)and[Tab\.12](https://arxiv.org/html/2605.08174#A2.T12)\.

DatasetCIFAR\-100EuroSATRESISC45StanfordCarsFGVC\-AircraftDTDCIFAR\-10OxfordPetsAttention Dropout0\.10\.10\.100\.10\.10\.10Weight Decay1e\-31e\-31e\-30\.011e\-31e\-31e\-30\.01LR1e\-48e\-51e\-31e\-32e\-33e\-41e\-41e\-4LR \(Classifier\)1e\-35e\-43e\-33e\-36e\-31e\-31e\-31e\-3

Table 11:Hyperparameter settings for ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)across different datasets for image classification experiments\. LR: Learning Rate\.DatasetCIFAR\-100EuroSATRESISC45StanfordCarsFGVC\-AircraftDTDCIFAR\-10OxfordPetsAttention Dropout0\.10\.10\.100\.10\.10\.10Weight Decay1e\-31e\-31e\-30\.011e\-31e\-31e\-30\.01LR2e\-41e\-42e\-32e\-31e\-32e\-42e\-42e\-4LR \(Classifier\)1e\-35e\-45e\-35e\-35e\-31e\-31e\-31e\-3

Table 12:Hyperparameter settings for ViT\-Base\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)across different datasets for image classification experiments\. LR: Learning Rate\.DatasetMNLISST\-2MRPCCoLAQNLIQQPRTESTS\-BMax Seq\. Len\.25612832064512320320128Epochs81630151082015Batch Size1632161632161632Classifier Dropout0\.15000\.10\.10\.20\.20\.2Weight Decay00\.010\.0100\.010\.010\.010\.1LR1e\-41e\-42e\-41e\-41e\-41e\-42e\-42e\-4LR\(Classifier\)3e\-43e\-44e\-43e\-43e\-43e\-44e\-44e\-4Table 13:Hyperparameter settings for DeBERTa\-V3\-Base\(Heet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib36)\)across different datasets for NLU experiments\. LR: Learning Rate\.Settings for Text\-to\-Image Generation\.For the subject\-driven text\-to\-image generation task, We use Stable Diffusion v2\-1\-base\(Rombachet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib40)\)as the pre\-trained model and apply DreamBooth\(Ruizet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib15)\)for subject\-driven text\-to\-image fine\-tuning\. We follow the setup of DreamBooth\(Ruizet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib15)\)to evaluate CERSA’s fine\-tuning\. This ensures that the method captured subject\-specific details while preserving pre\-trained knowledge\. We compare CERSA with full\-parameter DreamBooth\(Ruizet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib15)\)and LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), evaluating image quality and textual alignment\.

In our implementation, we replace all linear layers in the UNet\(Ronnebergeret al\.,[2015](https://arxiv.org/html/2605.08174#bib.bib58)\)and the attention modules of the CLIP\(Radfordet al\.,[2021](https://arxiv.org/html/2605.08174#bib.bib59)\)text encoder with CERSA\. The cumulative energy retention rate is set toα=β=0\.95\\alpha=\\beta=0\.95\. For LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), we insert the adapters into the same layers with a rank of 32\. For full\-parameter fine\-tuning, we made all these layers trainable\. The VAE\(variational autoencoder\) module remains frozen in all methods\. We use the AdamW\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.08174#bib.bib39)\)optimizer and a constant scheduler\. To ensure fairness in the inference stage, we use identical random seeds, inference steps, and guidance scales across all methods, preventing variations due to different parameter settings\.

Settings for NLU Experiments\.For the NLU experiments, we fine\-tune the Q, K, and V matrices in DeBERTa\-v3\-base\(Heet al\.,[2023](https://arxiv.org/html/2605.08174#bib.bib36)\)\. The adapter rank for LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)and PiSSA\(Menget al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib9)\)is set to 32\. SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)use the same settings as in the image classification experiments\. To ensure fairness, we follow SVFT’s\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)max sequence length settings\. We use the AdamW\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.08174#bib.bib39)\)optimizer and employ a linear scheduler with a warm\-up ratio of 0\.08\. For detailed hyperparameters, see[Tab\.13](https://arxiv.org/html/2605.08174#A2.T13)\.

MethodsTrainableParameter \(M\)TrainableRatio \(%\)WeightsMemory \(MB\)OptimizerState Memory \(MB\)GradientMemory \(MB\)TotalMemory \(MB\)FT303\.31001157\.72314\.41157\.74629\.8LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)\(r=8r=8\)0\.80\.31161\.99\.44\.71175\.9LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)\(r=32r=32\)3\.21\.01175\.436\.418\.21229\.9SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)0\.040\.021349\.81\.10\.61351\.5SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\)0\.120\.061350\.93\.31\.71355\.8CERSAα=β=0\.95\\text\{CERSA\}\_\{\\alpha=\\beta=0\.95\}10\.53\.61111\.480\.540\.31232\.2CERSAα=β=0\.92\\text\{CERSA\}\_\{\\alpha=\\beta=0\.92\}8\.12\.81069\.059\.029\.51157\.5CERSAα=β=0\.9\\text\{CERSA\}\_\{\\alpha=\\beta=0\.9\}6\.32\.31048\.649\.324\.61122\.5CERSAα=β=0\.85\\text\{CERSA\}\_\{\\alpha=\\beta=0\.85\}4\.31\.61011\.433\.316\.71061\.4CERSAα=β=0\.8\\text\{CERSA\}\_\{\\alpha=\\beta=0\.8\}3\.01\.2985\.223\.611\.81020\.6

Table 14:Memory consumption comparison across various methods with different settings\.
## Appendix CPerformance on Memory Consumption

[Tab\.14](https://arxiv.org/html/2605.08174#A2.T14)compares the parameter and memory efficiency of various fine\-tuning methods\. We exclude activation and dataset\-related memory usage, as they remain largely independent of the fine\-tuning approach\. Thus, total memory refers to the sum of the weight size, the gradient size, and the size of the optimizer parameter\. Besides, in[Tab\.14](https://arxiv.org/html/2605.08174#A2.T14), we report the number of trainable parameters in millions \(M\)\.

Full\-parameter fine\-tuning \(FT\) updates all model parameters \(303\.3 M trainable parameters\), resulting in a substantial total memory consumption of 4629\.8 MB\. This high memory demand makes FT impractical for resource\-constrained environments\.

LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), with ranks of 8 and 32, significantly reduces the number of trainable parameters to 0\.8 M and 3\.2 M, respectively\. However, its total memory consumption remains considerable – 1175\.9 MB for rank=8 and 1229\.9 MB for rank=32 – exceeding the memory footprint of the pre\-trained weights due to additional optimizer state and gradient storage\. Similarly, SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)achieves high parameter efficiency with only 0\.04 M of trainable parameters yet still requires 1351\.5 MB of total memory, primarily due to the storage overhead of full singular vector matrices\.

The proposed CERSA method provides a flexible solution for memory and parameter\-efficient fine\-tuning by adjusting the cumulative energy retention rate, enabling different levels of efficiency based on memory constraints\. For example, with a relatively ample memory budget, setting the retention rate toα=β=0\.95\\alpha=\\beta=0\.95yields better performance\. Atα=β=0\.92\\alpha=\\beta=0\.92, CERSA maintains a memory footprint equivalent to the pre\-trained weights during fine\-tuning\. When reduced toα=β=0\.8\\alpha=\\beta=0\.8, it retains 3\.0 M trainable parameters comparable to LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)\(rank=32\) while significantly lowering total memory consumption to 1020\.6 MB \(while LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)uses up to 1229\.9 MB\)\.

Although not as parameter\-efficient as SVFit\(Sunet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib8)\)and SVFT\(Lingamet al\.,[2024](https://arxiv.org/html/2605.08174#bib.bib10)\), CERSA excels in overall memory efficiency, even with more trainable parameters\. This makes it particularly advantageous for fine\-tuning large\-scale models in memory\-constrained environments\. Additionally, its adjustable cumulative energy retention rate allows for customized trade\-offs, making CERSA a versatile solution that outperforms other PEFT methods in total memory consumption while maintaining competitive performance\.

## Appendix DPerformance on Speed

CERSA decomposes the pre\-trained weight matrix into three components:𝑼p\\bm\{U\}\_\{p\},𝑺p\\bm\{S\}\_\{p\}, and𝑽p\\bm\{V\}\_\{p\}\. For simplicity, we assume that CERSA is configured withα=β=0\.95\\alpha=\\beta=0\.95\. Compared to the original weight matrix𝑾\\bm\{W\}, this decomposition introduces more granular matrix computations\. However, since the size of the matrices involved in computation is significantly reduced, the overall computational cost is also reduced\. To evaluate the actual impact on fine\-tuning, we design experiments to measuretraining throughputandtraining time\.

To eliminate the impact of dataset pre\-processing and batch size on computation time and throughput, we fix the batch size at 32 and the number of epochs at 15\. Fine\-tuning is performed on ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)across full\-parameter fine\-tuning, LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), and CERSA\.

![Refer to caption](https://arxiv.org/html/2605.08174v1/x9.png)Figure 9:Training throughput and training time of fine\-tuning ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)on the DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib30)\)dataset under various configurations\.Experimental results show that our method achieves a comparable or superior training efficiency to LoRA while significantly outperforming FT in terms of speed\. As shown in[Fig\.9](https://arxiv.org/html/2605.08174#A4.F9), LoRA \(rr=32,rr=8\) improves throughput by about 30% over FT\. CERSA, across all cumulative energy retention rates \{0\.95, 0\.9, 0\.85, 0\.8\}, slightly exceeds LoRA’s efficiency, demonstrating that the cumulative energy retention decomposition of the weight matrix effectively reduces computational complexity while preserving model capacity\. Despite introducing more granular matrix multiplications, the significantly reduced dimensionality effectively lowers the computational cost\. As a result, CERSA matches or even surpasses LoRA in fine\-tuning speed\.

![Refer to caption](https://arxiv.org/html/2605.08174v1/x10.png)Figure 10:Out\-of\-distribution evaluation on various tasks\.
## Appendix EPerformance on Out\-of\-Distribution Tasks

During full\-parameter fine\-tuning, the model gradually forgets core features from the pre\-training data as its parameter space shifts significantly\. In contrast, CERSA restricts updates to𝑺p\\bm\{S\}\_\{p\}, adjusting only the most critical feature subspace while ensuring that the principal subspace remains unaffected by less important dimensions\. This preserves essential pre\-trained knowledge\.

Out\-of\-distribution\(OOD\) performance is a crucial indicator of knowledge retention, as previously studied in\(Hendrycks and Gimpel,[2017](https://arxiv.org/html/2605.08174#bib.bib46)\)and\(Kumaret al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib47)\)\.[Fig\.10](https://arxiv.org/html/2605.08174#A4.F10)shows the OOD performance of models fine\-tuned on each of the four datasets \(CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\), DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib30)\), StanfordCars\(Krauseet al\.,[2013](https://arxiv.org/html/2605.08174#bib.bib32)\), and RESISC45\(Chenget al\.,[2017](https://arxiv.org/html/2605.08174#bib.bib33)\)\), with accuracy evaluated on the remaining three datasets\. We compare FT, LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), and our proposed CERSA method\. The gray bars indicate the model’s original performance before fine\-tuning, serving as a reference for relative performance degradation\.

Across all fine\-tuning settings, CERSA consistently achieves superior OOD performance compared to FT and LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)\. Specifically, when fine\-tuned on CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\)\(leftmost subplot\), CERSA maintains a higher average OOD accuracy than LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)and FT, suggesting that it better preserves pre\-trained knowledge for handling novel tasks such as DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib30)\)and StanfordCars\(Krauseet al\.,[2013](https://arxiv.org/html/2605.08174#bib.bib32)\), or even leverages knowledge from CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\)\. A similar trend is observed in the DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib30)\)fine\-tuning scenario \(second subplot\), where CERSA demonstrates stronger retention of pre\-trained features, particularly on CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\)\.

MethodFTLoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)CERSAAverage Forgetting Rate17\.8%2\.3%\-1\.5%

Table 15:Average forgetting rate of FT, LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\)and CERSA on the four datasets \(CIFAR\-100\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\), DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.08174#bib.bib30)\), StanfordCars\(Krauseet al\.,[2013](https://arxiv.org/html/2605.08174#bib.bib32)\), and RESISC45\(Chenget al\.,[2017](https://arxiv.org/html/2605.08174#bib.bib33)\)\)In our experiments, fine\-tuning is performed on one dataset while accuracy is evaluated on the remaining three\. The average forgetting rate is defined as the ratio of the average accuracy drop in the three out\-of\-distribution tasks compared to the baseline accuracy of the pre\-trained model after fine\-tuning on a specific task\. As shown in[Tab\.15](https://arxiv.org/html/2605.08174#A5.T15), these results highlight CERSA’s ability to mitigate catastrophic forgetting by retaining key representations learned during pre\-training, thereby preserving higher accuracy on tasks not directly involved in fine\-tuning\.

DatasetCIFAR\-100EuroSATRESISC45StanfordCarsQ99\.69%/99\.65%99\.94%/99\.94%99\.81%/99\.79%99\.95%/99\.94%K99\.76%/99\.74%99\.96%/99\.96%99\.86%/99\.85%99\.94%/99\.94%V99\.58%/99\.58%99\.91%/99\.91%99\.79%/99\.79%99\.92%/99\.92%
DatasetFGVC\-AircraftDTDCIFAR\-10OxfordPetsQ99\.76%/99\.73%99\.91%/99\.90%99\.68%/99\.63%99\.96%/99\.94%K99\.78%/99\.76%99\.94%/99\.94%99\.74%/99\.72%99\.94%/99\.93%V99\.69%/99\.70%99\.85%/99\.86%99\.55%/99\.55%99\.89%/99\.90%

Table 16:Principal subspace similarity between the Q, K, and V matrices of ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)pre\-trained on ImageNet\-21K\(Denget al\.,[2009](https://arxiv.org/html/2605.08174#bib.bib26)\)and the fine\-tuned weights on various downstream image classification tasks\.
## Appendix FSubspace Similarity Analysis

![Refer to caption](https://arxiv.org/html/2605.08174v1/x11.png)Figure 11:The similarity between the principal output subspace𝑼p\\bm\{U\}\_\{p\}of the pre\-trained and fine\-tuned weights for the Q, K, and V matrices in layers 0, 11, and 23 of ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\. The x\-axis represents the subspace spanned by the top\-iisingular vectors of the pre\-trained weights, while the y\-axis represents the subspace spanned by the top\-jjsingular vectors of the fine\-tuned weights\.![Refer to caption](https://arxiv.org/html/2605.08174v1/x12.png)Figure 12:The similarity between the principal input subspace𝑽p\\bm\{V\}\_\{p\}of the pre\-trained and fine\-tuned weights for the Q, K, and V matrices in layers 0, 11, and 23 of ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)\. The x\-axis represents the subspace spanned by the top\-iisingular vectors of the pre\-trained weights, while the y\-axis represents the subspace spanned by the top\-jjsingular vectors of the fine\-tuned weights\.### F\.1Subspace Similarity Between Pre\-trained and Fine\-tuned Models

Our theoretical analysis assumes that CERSA can approximate full\-parameter fine\-tuning based on the premise that the principal subspace of𝑾′\\bm\{W\}^\{\\prime\}after full fine\-tuning on a downstream task remains highly similar to that of the pre\-trained weights\. This assumption suggests that fine\-tuning primarily refines the existing subspace rather than significantly altering its structure\. To empirically validate this assumption, we measure the subspace similarity between𝑾′\\bm\{W\}^\{\\prime\}and the pre\-trained weights𝑾\\bm\{W\}of the ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)model, initially trained on ImageNet\-21K\(Denget al\.,[2009](https://arxiv.org/html/2605.08174#bib.bib26)\), across eight different downstream image classification datasets\.

To quantitatively assess this similarity, we employ theGrassmann subspace similarity\(Huet al\.,[2022](https://arxiv.org/html/2605.08174#bib.bib7)\), a metric that effectively captures the alignment between the principal output subspaces of the pre\-trained and fine\-tuned weights\. Formally, the Grassmann similarity is defined as follows:

ψ​\(𝑼Ai,𝑼Bj\)=‖𝑼Ai⊤​𝑼Bj‖F2min⁡\{i,j\},ψ∈\[0,1\]\\psi\(\\bm\{U\}\_\{A\}^\{i\},\\bm\{U\}\_\{B\}^\{j\}\)=\\frac\{\\\|\\bm\{U\}\_\{A\}^\{i\\top\}\\bm\{U\}\_\{B\}^\{j\}\\\|\_\{F\}^\{2\}\}\{\\min\\\{i,j\\\}\},\\quad\\psi\\in\[0,1\]\(3\)where𝑼Ai\\bm\{U\}\_\{A\}^\{i\}and𝑼Bj\\bm\{U\}\_\{B\}^\{j\}stands for top\-iicolumns of the𝑼\\bm\{U\}matrix from SVD decomposition of matrix𝑨\\bm\{A\}and top\-jjcolumns of the𝑼\\bm\{U\}matrix from SVD decomposition of matrix𝑩\\bm\{B\}respectively\.ψ\\psiranges from 0 \(completely disjoint subspaces\) to 1 \(identical subspaces\)\.

Similarly, we extend this analysis to the input subspace represented by𝑽\\bm\{V\}, applying the same similarity computation\. The results presented in[Tab\.16](https://arxiv.org/html/2605.08174#A5.T16)are computed using a subspace that retains 95% of the cumulative energy\. The first value represents the similarity of the output subspace𝑼\\bm\{U\}, while the second corresponds to the input subspace𝑽\\bm\{V\}\.

Analyzing[Tab\.16](https://arxiv.org/html/2605.08174#A5.T16), we observe that across all downstream tasks, the Grassmann subspace similarity between the fine\-tuned and pre\-trained subspaces consistently exceeds 99\.5% for both𝑼\\bm\{U\}and𝑽\\bm\{V\}across all three attention matrices – Q, K, and V\. This strong evidence suggests that fine\-tuning minimally affects the principal subspace of the pre\-trained weights, thereby validating our assumption\.

To further examine the stability of the Grassmann subspace similarity under varying top\-kkselections, we conducted experiments on the 0th, 11th, and 23rd layers of the ViT\-Large\(Dosovitskiy,[2021](https://arxiv.org/html/2605.08174#bib.bib35)\)model before and after fine\-tuning on CIFAR\-10\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08174#bib.bib28)\)\. Specifically, we extract the Q, K, and V matrices from these layers, perform SVD to obtain the principal subspaces𝑼\\bm\{U\}and𝑽\\bm\{V\}for both the pre\-trained and fine\-tuned models, and measure subspace similarity by selecting top\-iiand top\-jjsingular vectors\. The results are visualized in a heat map\.

Since all similarity values fall within the range of 0\.9 to 0\.9999, we applied a logarithmic transformation to the color scale for better visualization\. As depicted in[Fig\.11](https://arxiv.org/html/2605.08174#A6.F11)and[Fig\.12](https://arxiv.org/html/2605.08174#A6.F12), the subspaces of the fine\-tuned and pre\-trained weight matrices exhibit consistently high similarity across all choices of top\-iiand top\-jj, with values ranging from 0\.9 to 0\.9999\. This confirms that the observed subspace similarity is not confined to specific top\-kkselections but persists across all choices of singular vectors\.

Regardless of the truncation level, the fine\-tuned and pre\-trained weight matrices maintain exceptionally high Grassmann subspace similarity\. This finding further substantiates our hypothesis that fine\-tuning does not significantly alter the principal subspace of the pre\-trained model, reinforcing the fundamental assumption underlying our method\.

### F\.2Proofs

###### Proof\.

Let𝑴∈ℝm×n\\bm\{M\}\\in\\mathbb\{R\}^\{m\\times n\}be a matrix of rankkk\. By the singular value decomposition \(SVD\), we can write𝑴=𝑼​𝚺​𝑽T,\\bm\{M\}=\\bm\{U\}\\bm\{\\Sigma\}\\bm\{V\}^\{T\},where𝑼∈ℝm×k\\bm\{U\}\\in\\mathbb\{R\}^\{m\\times k\}and𝑽∈ℝn×k\\bm\{V\}\\in\\mathbb\{R\}^\{n\\times k\}are matrices with orthonormal columns, and𝚺∈ℝk×k\\bm\{\\Sigma\}\\in\\mathbb\{R\}^\{k\\times k\}is a diagonal matrix with positive diagonal entries \(the singular values\)\.

Since there exists an orthonormal basis𝑸=\{𝒆1,𝒆2,…,𝒆k\}\\bm\{Q\}=\\\{\\bm\{e\}\_\{1\},\\bm\{e\}\_\{2\},\\dots,\\bm\{e\}\_\{k\}\\\}such thatSpan⁡\(𝑼\)=Span⁡\(𝑸\),\\operatorname\{Span\}\(\\bm\{U\}\)=\\operatorname\{Span\}\(\\bm\{Q\}\),both𝑼\\bm\{U\}and𝑸\\bm\{Q\}form orthonormal bases for the samekk\-dimensional subspace\. Therefore, there exists an orthogonal matrix𝑹∈ℝk×k\\bm\{R\}\\in\\mathbb\{R\}^\{k\\times k\}such that

𝑼=𝑸​𝑹\.\\bm\{U\}=\\bm\{Q\}\\bm\{R\}\.Similarly, becauseSpan​\(𝑽\)=Span​\(𝑸′\)\\text\{Span\}\(\\bm\{V\}\)=\\text\{Span\}\(\\bm\{Q^\{\\prime\}\}\), there exists an orthogonal matrix𝑹′∈ℝk×k\\bm\{R\}^\{\\prime\}\\in\\mathbb\{R\}^\{k\\times k\}satisfying

𝑽=𝑸′​𝑹′\.\\bm\{V\}=\\bm\{Q^\{\\prime\}\}\\bm\{R\}^\{\\prime\}\.Substituting these expressions into the singular value decomposition of𝑴\\bm\{M\}, we obtain

𝑴=𝑼​𝚺​𝑽T=\(𝑸​𝑹\)​𝚺​\(𝑸′​𝑹′\)T=𝑸​𝑹​𝚺​𝑹′⁣T​𝑸′T\.\\bm\{M\}=\\bm\{U\}\\bm\{\\Sigma\}\\bm\{V\}^\{T\}=\(\\bm\{Q\}\\bm\{R\}\)\\bm\{\\Sigma\}\(\\bm\{Q^\{\\prime\}\}\\bm\{R\}^\{\\prime\}\)^\{T\}=\\bm\{Q\}\\bm\{R\}\\bm\{\\Sigma\}\\bm\{R\}^\{\\prime T\}\\bm\{Q^\{\\prime\}\}^\{T\}\.Defining𝑺=𝑹​𝚺​𝑹′⁣T\\bm\{S\}=\\bm\{R\}\\bm\{\\Sigma\}\\bm\{R\}^\{\\prime T\}, we have

𝑴=𝑸​𝑺​𝑸′T,\\bm\{M\}=\\bm\{Q\}\\bm\{S\}\\bm\{Q^\{\\prime\}\}^\{T\},where𝑺∈ℝk×k\\bm\{S\}\\in\\mathbb\{R\}^\{k\\times k\}\. This completes the proof\. ∎

Similar Articles