Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning

arXiv cs.LG Papers

Summary

The paper proposes Slice, a gradient-surgery-based initialization for LoRA adapters in continual learning that reconciles conflicting gradients from current and past tasks to reduce catastrophic forgetting, achieving better stability-plasticity trade-offs.

arXiv:2605.12752v1 Announce Type: new Abstract: LoRA is widely adopted for continual fine-tuning of Large Language Models due to its parameter efficiency, modularity across tasks, and compatibility with replay strategies. However, LoRA-based continual learning remains vulnerable to catastrophic forgetting, whose severity depends on how successive task gradients interact: when consecutive task gradients conflict, standard adapter initializations channel updates into subspaces that overwrite previously learned directions. We propose SLICE, a gradient-surgery-based initialization for LoRA adapters in continual learning. SLICE accumulates gradients from both the current task and a replay buffer of prior tasks, reconciles them through a projection operator, and decomposes the result via truncated SVD to initialize the adapter weights. We evaluate SLICE on the TRACE benchmark and sequences of Super-NI tasks, including a set of adversarial Super-NI sequences that we construct by mining task pairs with maximally opposing gradients. Compared to vanilla LoRA, LoRA-GA, and LoRAM, SLICE consistently achieves a better stability-plasticity trade-off, improving Average Performance, Final Performance and Forgetting metrics while preserving General Performance and In Context Performance across both standard and adversarial continual learning sequences.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:18 AM

# Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning
Source: [https://arxiv.org/html/2605.12752](https://arxiv.org/html/2605.12752)
Joana PasqualiEqual contributionMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilRamiro N\. BarrosEqual contributionMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilArthur S\. BianchessiVinícius Conte TuraniMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilJoão Vitor Boer AbitanteMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilRafaela Cappelari RavazioMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilChristian MattjieMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilOtávio ParragaMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilLucas S\. KupssinsküMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilRodrigo C\. BarrosMALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, BrazilKunumi Institute, Brazil

###### Abstract

LoRA is widely adopted for continual fine\-tuning of Large Language Models due to its parameter efficiency, modularity across tasks, and compatibility with replay strategies\. However, LoRA\-based continual learning remains vulnerable to catastrophic forgetting, whose severity depends on how successive task gradients interact: when consecutive task gradients conflict, standard adapter initializations channel updates into subspaces that overwrite previously learned directions\. We proposeSlice, a gradient\-surgery\-based initialization for LoRA adapters in continual learning\.Sliceaccumulates gradients from both the current task and a replay buffer of prior tasks, reconciles them through a projection operator, and decomposes the result via truncated SVD to initialize the adapter weights\. We evaluateSliceon the TRACE benchmark and sequences of Super\-NI tasks, including a set of adversarial Super\-NI sequences that we construct by mining task pairs with maximally opposing gradients\. Compared to vanilla LoRA, LoRA\-GA, and LoRAM,Sliceconsistently achieves a better stability\-plasticity trade\-off, improving Average Performance, Final Performance and Forgetting metrics while preserving General Performance and In Context Performance across both standard and adversarial continual learning sequences\.

## 1Introduction

Large language Models \(LLMs\) are increasingly deployed in settings that demand sequential adaptation to non\-stationary task distributions: new domains, evolving instructions, and shifting user requirements arrive over time, and retraining from scratch at each stage is prohibitively expensive\. Continual learning \(CL\) provides the natural framework for this regimeshi2025continualsurvey, but its central pathology—catastrophic forgettingkirkpatrick2017overcoming, where optimization on a new task degrades performance on previously learned ones—remains a fundamental obstacle\.

Low\-Rank Adaptation \(LoRA\)hu2022lorais the dominant parameter\-efficient fine\-tuning paradigm, and it offers several advantages for CL: its low parameter count limits the degrees of freedom available for destructive interference, its modularity enables task\-specific adapter storage and composition, and its compatibility with replay and regularization strategies makes it suitable for CLliang2024inflora,liu2024learningattentionalmixtureloras,wang2023olora,xiong2026oplora\.

In CL, when the gradient of a new task conflicts with the gradients of previously learned tasks—that is, when their Frobenius inner product is negative—any descent step that improves current\-task performance degrades prior\-task performance\. The adapter initialization determines the subspace within which all subsequent optimization occurs, and standard initialization schemes are blind to this conflict structuremeng2024pissa\. Vanilla LoRA initializesAAwith random Gaussian entries andB=0B=0, placing the adapter in a direction uncorrelated with any task objective\. Spectral methods such as PiSSAmeng2024pissaand MiLoRAwang2024miloraderive initializations from the singular structure of the pretrained weights, capturing general representational directions but encoding no task\-specific signal\. LoRA\-GAwang2024loragatakes a step toward task awareness by initializing adapters via SVD of the fine\-tuning\-task gradient\. However, none of these methods account for previously learned tasks: LoRA\-GA’s initialization is optimal for single\-task adaptation but may push the adapter into a subspace that overwrites prior knowledge whenever the current and previous task gradients conflict\. The initialization stage thus represents a consequential but underexploited intervention point for CL\.

We proposeSlice\(Gradient\-Surgery\-basedLow\-rankInitialization forContinual lEarning\), a method that initializes LoRA adapters in a subspace that is simultaneously aligned with the current\-task objective and minimally destructive to previously acquired task knowledge\.

To better evaluateSlice, we introduce NI\-SEQ\-OPPOSITE, three adversarial 5\-task sequences where tasks were mined by exhaustive combinatorial search over\(465\)\\binom\{46\}\{5\}candidate subsets from the Super\-NI task poolwang\-etal\-2022\-superto minimize mean pairwise gradient cosine similarity\. Existing Super\-NI task sequencesjiang2025unlockingare constructed by grouping tasks according to output type \(classification, generation, or mixed\) without any explicit criterion linking sequence composition to gradient interference\.

We evaluateSliceon the TRACE benchmarkwang2023tracecomprehensivebenchmarkcontinualand on both standard \(G1, G2\) and adversarial \(NI\-SEQ\-OPPOSITE\) Super\-NI sequences\.Sliceis compared to three baseline initializations: vanilla LoRAhu2022lora, LoRAMzhang2025primacy, and LoRA\-GAwang2024loraga\. All comparisons employ variance\-matched magnitude rescaling to control for the confound identified by Zhang et al\.zhang2025primacy, whereby apparent initialization gains are attributable to implicit magnitude amplification rather than subspace quality\. Our results show thatSliceconsistently improves Final Performance and reduces Forgetting while incurring only marginal reductions in General Performance\. Figure[1](https://arxiv.org/html/2605.12752#S1.F1)shows that allSlicevariants \(c∈\{0\.50,0\.75,1\.00\}c\\in\\\{0\.50,0\.75,1\.00\\\}consistently achieve higher stability in comparison to all baselines at similar plasticity for NI\-Seq\-Opp1\.111Code available at:[https://github\.com/RamiroNB/slice](https://github.com/RamiroNB/slice)

AP \(plasticity\)FP \(stability\)0\.100\.150\.200\.250\.300\.000\.050\.100\.150\.200\.250\.30FP = APbackward transferforgettingc=1\.00c\{=\}1\.00LoRAMVanilla LoRALoRA\-GAc=0\.50c\{=\}0\.50c=0\.75c\{=\}0\.75SLICEBaselinesNI\-Seq\-Opposite\-1

Figure 1:Stability–plasticity trade\-off on NI\-Seq\-Opposite\-1\. FP \(stability\) vs\. AP \(plasticity\) forSliceand baselines under variance\-matched rescalingzhang2025primacy\. The dashed line denotes FP = AP; the shaded region indicates positive backward transfer\.
## 2Problem Formulation

##### Continual learning setup\.

Let\{𝒯1,…,𝒯T\}\\\{\\mathcal\{T\}\_\{1\},\\dots,\\mathcal\{T\}\_\{T\}\\\}denote a sequence of tasks with data distributions𝒟1,…,𝒟T\\mathcal\{D\}\_\{1\},\\dots,\\mathcal\{D\}\_\{T\}, and letθ\\thetaparametrizes a model trained sequentially over them\. At task𝒯t\\mathcal\{T\}\_\{t\}, the learner is updated on𝒟t\\mathcal\{D\}\_\{t\}alone but is evaluated on all𝒟i\\mathcal\{D\}\_\{i\}withi≤ti\\leq t\. Two opposing population losses characterize this regime—the current\-task \(plasticity\) loss:

ℒcur​\(θ\):=𝔼\(x,y\)∼𝒟t​\[ℒt​\(θ;x,y\)\],\\mathcal\{L\}\_\{\\text\{cur\}\}\(\\theta\)\\;:=\\;\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{t\}\}\\\!\\left\[\\mathcal\{L\}\_\{t\}\(\\theta;x,y\)\\right\],\(1\)and the previous tasks \(stability\) loss:

ℒprev​\(θ\):=∑i=1t−1𝔼\(x,y\)∼𝒟i​\[ℒi​\(θ;x,y\)\]\.\\mathcal\{L\}\_\{\\text\{prev\}\}\(\\theta\)\\;:=\\;\\sum\_\{i=1\}^\{t\-1\}\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{i\}\}\\\!\\left\[\\mathcal\{L\}\_\{i\}\(\\theta;x,y\)\\right\]\.\(2\)Sequential fine\-tuning only optimizesℒcur\\mathcal\{L\}\_\{\\text\{cur\}\}while control overℒprev\\mathcal\{L\}\_\{\\text\{prev\}\}is implicit\. Catastrophic forgetting occurs when the latter grows uncontrollably\.

##### Data access\.

We assume⋃i<t𝒟i\\bigcup\_\{i<t\}\\mathcal\{D\}\_\{i\}remains accessible for sampling on demand\. This is strictly weaker than the rehearsal regime\[buzzega2020der,chaudhry2018agem,chaudhry2019er\]:Slicedraws a fresh sample𝒟prev⊆⋃i<t𝒟i\\mathcal\{D\}\_\{\\text\{prev\}\}\\subseteq\\bigcup\_\{i<t\}\\mathcal\{D\}\_\{i\}at initialization, uses it once, and discards it; no persistent memory is replayed during training\.

##### Per\-layer gradients\.

For each target weight matrixWlW\_\{l\}, the two losses induce gradients

Gcur:=∇Wℒcur​\(θ\),Gprev:=∇Wℒprev​\(θ\),G\_\{\\text\{cur\}\}\\;:=\\;\\nabla\_\{W\}\\,\\mathcal\{L\}\_\{\\text\{cur\}\}\(\\theta\),\\qquad G\_\{\\text\{prev\}\}\\;:=\\;\\nabla\_\{W\}\\,\\mathcal\{L\}\_\{\\text\{prev\}\}\(\\theta\),\(3\)estimated from finite samples of𝒟t\\mathcal\{D\}\_\{t\}and𝒟prev\\mathcal\{D\}\_\{\\text\{prev\}\}\. The direction of both gradients provide conflicting information\. When⟨Gcur,Gprev⟩F<0\\langle G\_\{\\text\{cur\}\},G\_\{\\text\{prev\}\}\\rangle\_\{F\}<0, a descent step onℒcur\\mathcal\{L\}\_\{\\text\{cur\}\}also increasesℒprev\\mathcal\{L\}\_\{\\text\{prev\}\}; when the inner product is negative, the two objectives are incompatible and any improvement on the current task is paid for in stability\.Sliceoperates directly on this geometry, reconcilingGcurG\_\{\\text\{cur\}\}andGprevG\_\{\\text\{prev\}\}into a conflict\-free direction prior to adapter construction\.

## 3Gradient\-Surgery Low\-Rank Initialization

We introduceSlice, a method for initializing low\-rank adapters in a subspace aligned with the current\-task objective while minimizing interference with previously learned tasks\.

In a fine\-tuning with LoRA\[hu2022lora\], each target weight matrixW\(l\)W^\{\(l\)\}is modified asW′=W0\+B​A\{W\}^\{\\prime\}=W\_\{0\}\+BA, withB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}andA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{\\text\{in\}\}\}for rankr≪min⁡\(dout,din\)\\smash\[t\]\{r\\ll\\min\(d\_\{\\text\{out\}\},d\_\{\\text\{in\}\}\)\}\.

Standard practice initializesAArandomly andB=0dout×rB=0\_\{d\_\{\\text\{out\}\\times r\}\}, making the initial adapter direction agnostic to both the current task𝒯t\\mathcal\{T\}\_\{t\}and the previously learned tasks\{𝒯i\}i<t\\\{\\mathcal\{T\}\_\{i\}\\\}\_\{i<t\}\.Sliceinitializes weights in a*conflict\-free*update direction by performing a four\-stage procedure detailed in Algorithm[1](https://arxiv.org/html/2605.12752#alg1)\.

𝒟c​u​r\\mathcal\{D\}\_\{cur\}𝒟p​r​e​v\{\\mathcal\{D\}\}\_\{prev\}Gc​u​rG\_\{cur\}\\vphantom\{\{\}^\{\(l\)\}\}Gp​r​e​vG\_\{prev\}\\vphantom\{\{\}^\{\(l\)\}\}∇ℒ\\nabla\\\!\\mathcal\{L\}∇ℒ\\nabla\\\!\\mathcal\{L\}ψ​\(GA,GP\)\\psi\\\!\\bigl\(G\_\{A\},\\;G\_\{P\}\\bigr\)PCGrad\-basedG~c​u​r\\tilde\{G\}\_\{cur\}\\vphantom\{\{\}^\{\(l\)\}\}SVDU​Σ​V⊤U\\Sigma V^\{\\\!\\top\}ΦB\(l\)=U:,:r\\Phi\_\{B\}^\{\(l\)\}=U\_\{:,\\,:r\}ΦA\(l\)=Vr:2​r,:⊤\\Phi\_\{A\}^\{\(l\)\}=V\_\{r:2r,\\,:\}^\{\\top\}MagScaleβ=\(ηr​ηvar\)1/4\\beta=\(\\eta\_\{r\}\\,\\eta\_\{\\text\{var\}\}\)^\{1/4\}match variance toWlW\_\{l\}Bl=β​ΦB\(l\)B\_\{l\}=\\beta\\,\\Phi\_\{B\}^\{\(l\)\}Al=β​ΦA\(l\)A\_\{l\}=\\beta\\,\\Phi\_\{A\}^\{\(l\)\}

Figure 2:Sliceinitialization pipeline\.𝒟c​u​r\\mathcal\{D\}\_\{cur\}and𝒟p​r​e​v\{\\mathcal\{D\}\}\_\{prev\}denote the current\-task data and prior\-task replay buffer;Gc​u​rG\_\{cur\},Gp​r​e​vG\_\{prev\}their gradients;ψ\\psiis the projection;G~c​u​r\\tilde\{G\}\_\{cur\}the reconciled gradient;ΦB\(l\)\\smash\[t\]\{\\Phi\_\{B\}^\{\(l\)\}\},ΦA\(l\)\\smash\[t\]\{\\Phi\_\{A\}^\{\(l\)\}\}the low\-rank subspaces; andβ=\(ηr​ηvar\)1/4\\beta=\(\\eta\_\{r\}\\,\\eta\_\{\\mathrm\{var\}\}\)^\{1/4\}the magnitude\-scaling coefficient\.##### Stage 1: Gradient Estimation\.

We estimate the gradient matrices for the current task and for previously seen task data by accumulating over mini\-batches:

Gcur=1Scur​∑s=1Scur∇Wℒcur​\(θ;ℬscur\),Gprev=1Sprev​∑s=1Sprev∇Wℒprev​\(θ;ℬsprev\)G\_\{\\text\{cur\}\}=\\frac\{1\}\{S\_\{\\text\{cur\}\}\}\\sum\_\{s=1\}^\{S\_\{\\text\{cur\}\}\}\\nabla\_\{W\}\\mathcal\{L\}\_\{\\text\{cur\}\}\(\\theta;\\,\\mathcal\{B\}\_\{s\}^\{\\text\{cur\}\}\),\\qquad G\_\{\\text\{prev\}\}=\\frac\{1\}\{S\_\{\\text\{prev\}\}\}\\sum\_\{s=1\}^\{S\_\{\\text\{prev\}\}\}\\nabla\_\{W\}\\mathcal\{L\}\_\{\\text\{prev\}\}\(\\theta;\\,\\mathcal\{B\}\_\{s\}^\{\\text\{prev\}\}\)\(4\)whereℬscur∼𝒟t\\mathcal\{B\}\_\{s\}^\{\\text\{cur\}\}\\sim\\mathcal\{D\}\_\{t\}andℬsprev∼𝒟prev\\mathcal\{B\}\_\{s\}^\{\\text\{prev\}\}\\sim\\mathcal\{D\}\_\{\\text\{prev\}\}are mini\-batches sampled from the current\-task data and from the previous\-tasks sample, respectively, andScurS\_\{\\text\{cur\}\},SprevS\_\{\\text\{prev\}\}are the number of accumulation steps\. Notably,ScurS\_\{\\text\{cur\}\}andSprevS\_\{\\text\{prev\}\}are small \(typically tens of steps\), requiring only a lightweight forward–backward pass over a small sample of data from the current task and from𝒟prev\\mathcal\{D\}\_\{\\text\{prev\}\}\.

##### Stage 2: Gradient Reconciliation\.

We reconcile the current\-task gradient with the previous\-tasks gradient through a projection operatorψ\\psi:

G~cur=ψ​\(Gcur,Gprev\)\\tilde\{G\}\_\{\\text\{cur\}\}=\\psi\\\!\\left\(G\_\{\\text\{cur\}\},\\;G\_\{\\text\{prev\}\}\\right\)\(5\)
The role ofψ\\psiis to return an update direction that pursues the current\-task \(plasticity\) objective orthogonal to the previous\-tasks \(stability\) objective\.

We employ a generalization of PCGrad parameterized byc∈\[0,1\]c\\in\[0,1\]\. Atc=1c=1the formula recovers standard PCGrad; atc=0c=0no correction is applied, andψ\\psireturnsGcurG\_\{\\text\{cur\}\}unchanged\. Themin⁡\(⋅,0\)\\min\(\\cdot,0\)term is the non\-linearity: the correction is active only when the two gradients are conflicting \(⟨Gcur,Gprev⟩F<0\\langle G\_\{\\text\{cur\}\},G\_\{\\text\{prev\}\}\\rangle\_\{F\}<0\), leaving the gradient unchanged otherwise \(G~cur=Gcur\\tilde\{G\}\_\{\\text\{cur\}\}=G\_\{\\text\{cur\}\}\)\.

ψPCGrad​\(Gcur,Gprev,c\)=Gcur−c​min⁡\(⟨Gcur,Gprev⟩F,0\)‖Gprev‖F2​Gprev,\\psi\_\{\\text\{PCGrad\}\}\\\!\\left\(G\_\{\\text\{cur\}\},G\_\{\\text\{prev\}\},c\\right\)=G\_\{\\text\{cur\}\}\-c\\frac\{\\ \\min\\bigl\(\\langle G\_\{\\text\{cur\}\},\\;G\_\{\\text\{prev\}\}\\rangle\_\{F\},\\,0\\bigr\)\\ \}\{\\\|G\_\{\\text\{prev\}\}\\\|\_\{F\}^\{2\}\}\\;G\_\{\\text\{prev\}\},\(6\)

##### Stage 3: Low\-Rank Decomposition\.

We decompose the projected gradient via truncated SVD into matrices that will form the adapter initialization\. For each target modulell, letG~cur\(l\)=U​Σ​V⊤\\smash\[t\]\{\\tilde\{G\}\_\{\\text\{cur\}\}^\{\(l\)\}=U\\Sigma V^\{\\top\}\}be the singular value decomposition\. We set

ΦB\(l\)=U:,:r\(l\),ΦA\(l\)=\(V:,r:2​r\(l\)\)⊤\\Phi\_\{B\}^\{\(l\)\}=U\_\{:,:r\}^\{\(l\)\},\\qquad\\Phi\_\{A\}^\{\(l\)\}=\(V\_\{:,r:2r\}^\{\(l\)\}\)^\{\\top\}\(7\)whereU:,:r\(l\)∈ℝdout\(l\)×rU\_\{:,:r\}^\{\(l\)\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}^\{\(l\)\}\\times r\}denotes the firstrrleft singular vectors andV:,r:2​r\(l\)∈ℝdin\(l\)×rV\_\{:,r:2r\}^\{\(l\)\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}^\{\(l\)\}\\times r\}denotes the\(r\+1\)\(r\{\+\}1\)\-th through2​r2r\-th right singular vectors\. We acknowledge that this particular factorization is not immediately intuitive—one might naturally expect bothΦB\(l\)\\Phi\_\{B\}^\{\(l\)\}andΦA\(l\)\\Phi\_\{A\}^\{\(l\)\}to be initialized from the leading singular components\. The choice to drawΦA\(l\)\\Phi\_\{A\}^\{\(l\)\}from the\(r\+1\)\(r\{\+\}1\)\-th through2​r2r\-th right singular vectors is a deliberate design decision that, while theoretically non\-obvious, has been empirically validated in prior work\. LoRA\-GAwang2024loragaadopts this factorization and reports strong results, demonstrating that it yields effective adapter initializations\. Furthermore, LoRAMzhang2025primacyreproduces the LoRA\-GA initialization as one of its baselines and similarly obtains strong performance, further corroborating its effectiveness\. A complete theoretical explanation for this behavior remains elusive, though it suggests that factors beyond pure approximation quality—such as the optimization landscape induced by the initialization—may play crucial roles in determining optimization behavior\. Following these empirical findings, we adopt the same factorization\.

##### Stage 4: Magnitude Rescaling\.

Following the magnitude gain initialization\[zhang2025primacy\], we rescale the variance of the effective low\-rank reconstructionB0\(l\)​A0\(l\)B^\{\(l\)\}\_\{0\}A^\{\(l\)\}\_\{0\}to match the pretrained weights scale\. Let

σW2=Var​\(W0\(l\)\),σB​A2=Var​\(ΦB\(l\)​ΦA\(l\)\)\\sigma\_\{W\}^\{2\}\\;=\\;\\mathrm\{Var\}\(W\_\{0\}^\{\(l\)\}\),\\qquad\\sigma\_\{BA\}^\{2\}\\;=\\;\\mathrm\{Var\}\(\\Phi\_\{B\}^\{\(l\)\}\\Phi\_\{A\}^\{\(l\)\}\)\(8\)define the variance ratioηvar=σW2/σB​A2\\eta\_\{\\text\{var\}\}=\\sigma\_\{W\}^\{2\}/\\sigma\_\{BA\}^\{2\}and the rank\-dependent factorηr=logm⁡\(r\)\\eta\_\{r\}=\\log\_\{m\}\(r\)wherem=min⁡\(dout,din\)m=\\min\(d\_\{\\text\{out\}\},d\_\{\\text\{in\}\}\)\. Then

B0\(l\)=β⋅ΦB\(l\),A0\(l\)=β⋅ΦA\(l\),whereβ=\(ηr​ηvar\)1/4\.B^\{\(l\)\}\_\{0\}\\;=\\;\\beta\\cdot\\Phi\_\{B\}^\{\(l\)\},\\quad A^\{\(l\)\}\_\{0\}\\;=\\;\\beta\\cdot\\Phi\_\{A\}^\{\(l\)\},\\quad\\text\{where\}\\quad\\beta\\;=\\;\\Big\(\\eta\_\{r\}\\,\\eta\_\{\\text\{var\}\}\\Big\)^\{1/4\}\.\(9\)
ConsequentlyVar​\(B0\(l\)​A0\(l\)\)\\mathrm\{Var\}\(B^\{\(l\)\}\_\{0\}A^\{\(l\)\}\_\{0\}\)is aligned withVar​\(W0\(l\)\)\\mathrm\{Var\}\(W\_\{0\}^\{\(l\)\}\)up to the prescribed rank correction\.

##### Integration\.

The resulting\(A0\(l\),B0\(l\)\)\(A^\{\(l\)\}\_\{0\},B^\{\(l\)\}\_\{0\}\)pairs replace the default LoRA initialization\. Like other non\-zero LoRA initialization methods, such as LoRA\-GA and LoRAM,Slicerequires weight absorption, as described in Appendix[A\.1](https://arxiv.org/html/2605.12752#A1.SS1)\.Sliceis agnostic to the downstream training algorithm: the initialized adapters can be fine\-tuned with standard objectives, replay\-augmented losses, or regularization\-based methods\. The computational overhead is limited to a gradient accumulation pass over small samples of𝒟t\\mathcal\{D\}\_\{t\}and𝒟prev\\mathcal\{D\}\_\{\\text\{prev\}\}, followed by one SVD per target module\. Figure[2](https://arxiv.org/html/2605.12752#S3.F2)depictsSliceinitialization\.

Algorithm 1Slice: Gradient\-Surgery Low\-Rank Initialization0:Model parameters

θ=\{Wl\}l=1L\\theta=\\\{W\_\{l\}\\\}\_\{l=1\}^\{L\}, current\-task data

𝒟c​u​r\\mathcal\{D\}\_\{cur\}, previous\-tasks data sample

𝒟prev\\mathcal\{D\}\_\{\\text\{prev\}\}, target modules

𝒯\\mathcal\{T\}, rank

rr, accumulation steps

ScurS\_\{\\text\{cur\}\},

SprevS\_\{\\text\{prev\}\}, reconciliation operator

ψ∈\{ψPCGrad\\psi\\in\\\{\\psi\_\{\\text\{PCGrad\}\}\};

0:Initialized adapter pairs

\{\(B0\(l\),A0\(l\)\)\}l∈𝒯\\\{\(B^\{\(l\)\}\_\{0\},A^\{\(l\)\}\_\{0\}\)\\\}\_\{l\\in\\mathcal\{T\}\}
2:

Gcur←1Scur​∑s=1Scur∇Wℒcur​\(θ;ℬscur\)G\_\{\\text\{cur\}\}\\leftarrow\\frac\{1\}\{S\_\{\\text\{cur\}\}\}\\sum\_\{s=1\}^\{S\_\{\\text\{cur\}\}\}\\nabla\_\{W\}\\mathcal\{L\}\_\{\\text\{cur\}\}\(\\theta;\\,\\mathcal\{B\}\_\{s\}^\{\\text\{cur\}\}\)//ℬscur∼𝒟c​u​r\\mathcal\{B\}\_\{s\}^\{\\text\{cur\}\}\\sim\\mathcal\{D\}\_\{cur\}

3:

Gprev←1Sprev​∑s=1Sprev∇Wℒprev​\(θ;ℬsprev\)G\_\{\\text\{prev\}\}\\leftarrow\\frac\{1\}\{S\_\{\\text\{prev\}\}\}\\sum\_\{s=1\}^\{S\_\{\\text\{prev\}\}\}\\nabla\_\{W\}\\mathcal\{L\}\_\{\\text\{prev\}\}\(\\theta;\\,\\mathcal\{B\}\_\{s\}^\{\\text\{prev\}\}\)//ℬsprev∼𝒟prev\\mathcal\{B\}\_\{s\}^\{\\text\{prev\}\}\\sim\\mathcal\{D\}\_\{\\text\{prev\}\}

4:

5:[Stage 2: Gradient Reconciliation\.](https://arxiv.org/html/2605.12752#S3.SS0.SSS0.Px2)\(Eq\.[6](https://arxiv.org/html/2605.12752#S3.E6)\)

6:

G~cur←ψ​\(Gcur,Gprev,c\)\\tilde\{G\}\_\{\\text\{cur\}\}\\leftarrow\\psi\\\!\\left\(G\_\{\\text\{cur\}\},\\;G\_\{\\text\{prev\}\},c\\right\)
7:

8:foreach target module

l∈𝒯l\\in\\mathcal\{T\}do

10:

U,Σ,V←SVD​\(G~cur\(l\)\)U,\\Sigma,V\\leftarrow\\mathrm\{SVD\}\(\\tilde\{G\}\_\{\\text\{cur\}\}^\{\(l\)\}\)
11:

ΦB\(l\)←U:,:r\\Phi\_\{B\}^\{\(l\)\}\\leftarrow U\_\{:,:r\}
12:

ΦA\(l\)←\(V:,r:2​r\)⊤\\Phi\_\{A\}^\{\(l\)\}\\leftarrow\(V\_\{:,r:2r\}\)^\{\\top\}
13:

14:[Stage 4: Magnitude Rescaling\.](https://arxiv.org/html/2605.12752#S3.SS0.SSS0.Px4)\(Eq\.[9](https://arxiv.org/html/2605.12752#S3.E9)\)

15:

β←MagnitudeScale​\(ΦA\(l\),ΦB\(l\),W\(l\)\)\\beta\\leftarrow\\texttt\{MagnitudeScale\}\\left\(\\Phi\_\{A\}^\{\(l\)\},\\Phi\_\{B\}^\{\(l\)\},W^\{\(l\)\}\\right\)
16:

A0\(l\)←β​ΦA\(l\)A^\{\(l\)\}\_\{0\}\\leftarrow\\beta\\,\\Phi\_\{A\}^\{\(l\)\}
17:

B0\(l\)←β​ΦB\(l\)B^\{\(l\)\}\_\{0\}\\leftarrow\\beta\\,\\Phi\_\{B\}^\{\(l\)\}
18:endfor

19:

20:return

\(B0,A0\)\(B\_\{0\},\\ A\_\{0\}\)

## 4Empirical Analysis

We evaluateSliceagainst vanilla initialization and two strong initialization baselines—LoRAM and LoRA\-GA—and compare the reconciliation operatorψPCGrad\\psi\_\{\\text\{PCGrad\}\}\(Eq\.[6](https://arxiv.org/html/2605.12752#S3.E6)\) acrossc∈\{0\.5,0\.75,1\.0\}c\\in\\\{0\.5,0\.75,1\.0\\\}\. Further details on the experimental setup are described in Appendix[B](https://arxiv.org/html/2605.12752#A2)\.

##### Datasets\.

We evaluateSliceon two settings\. First, following previous work\[jiang2025unlocking,wang2023olora\], we construct task sequences from Super\-NaturalInstructions \(Super\-NI\)\[wang\-etal\-2022\-super\], restricting our selection to purely generative tasks \(i\.e\., free\-form text generation, excluding classification\)\. Second, we evaluate onTrace\[wang2023tracecomprehensivebenchmarkcontinual\], a benchmark explicitly designed to stress\-test CL methods, comprising diverse task families with documented interference patterns between sequential fine\-tuning stages\. Both settings satisfy the core desiderata for CL evaluation: they expose models to non\-stationary task distributions and jointly measure forward transfer and catastrophic forgetting\. The Super\-NI sequences used in prior work\[jiang2025unlocking,wang2023olora\]are constructed by grouping tasks according to surface\-level criteria—output type or NLP task category—without any explicit criterion linking sequence composition to gradient interference\. However, sequences whose consecutive tasks exhibit opposing gradients pose a strictly harder continual learning problem, as every descent step on the current task degrades prior\-task performance to first order\. To provide a controlled stress test, we introduce theNI\-Seq\-Opposite: 5\-task sequences mined by exhaustive combinatorial search over\(465\)\\binom\{46\}\{5\}candidate subsets to minimize mean pairwise gradient cosine similarity\. Combined with evaluation on TRACE and the standard G1/G2 sequences, the resulting protocol spans the full spectrum of inter\-task interference regimes, exposing differences between methods that gradient\-aligned sequences leave latent\. We detail the construction procedure and per\-task gradient alignment in Appendix[C](https://arxiv.org/html/2605.12752#A3)\.

##### Evaluation\.

Following the standard CL evaluation framework, we construct a results matrix𝐑∈ℝT×T\\mathbf\{R\}\\in\\mathbb\{R\}^\{T\\times T\}where entryRi,jR\_\{i,j\}denotes the performance of the model on taskjjafter completing training on taskii, evaluated on each task’s held\-out split\. From this matrix, we derive four CL metrics\. Average Performance \(*AP*\) is the mean of the diagonal entriesRi,iR\_\{i,i\}, capturing task\-specific performance measured immediately after each task is trained, thereby reflecting the model’s peak retention before any subsequent forgetting\. Final Performance \(*FP*\) is the mean of the last row,RT,:R\_\{T,:\}, measuring performance across all tasks after the full sequence has been trained and thus capturing how much task knowledge is retained at the end of training\. Forgetting \(*Fgt*\) is the difference*AP*−\-*FP*, quantifying the average performance degradation induced by subsequent task training\. All three metrics—*AP*,*FP*, and*Fgt*—use the evaluation splits of the CL task sequence itself \(e\.g\., Super\-NaturalInstructions or TRACE\), with generation\-based evaluation scoring each task by exact match or ROUGE\-L depending on the task type\. In addition, we evaluate general language model capabilities on a fixed set of held\-out benchmarks never encountered during training\. General Performance \(*GP*\) is the mean zero\-shot accuracy across HellaSwag, CommonsenseQA, and Alpaca \(scored via ROUGE\-L\), evaluated after the final training stage, measuring the preservation of broad language understanding\. In\-context Performance \(*IP*\) evaluates the same three benchmarks, plus BBH Object Counting with few\-shot prompting \(5\-shot for most tasks, 3\-shot for BBH\), and measures the model’s ability to leverage in\-context demonstrations after continual training\. Detailed computation of the metrics*AP*,*FP*,*GP*,*IP*, and*Fgt*is presented in Appendix[D](https://arxiv.org/html/2605.12752#A4)\.

Table 1:CL metrics across all sequences: baseline values withΔ\\Deltafor eachSlicevariant \(rank 64\)Table 2:*GP*and*IP*preservation across all sequences \(rank 64\)\. The first row reports absolute scores for Vanilla LoRA; all remaining rows showΔ\\Deltarelative to Vanilla LoRA\.Slicevariants remain close to the reference on*GP*across all sequences, indicating that the substantial*FP*and forgetting gains in Table[1](https://arxiv.org/html/2605.12752#S4.T1)incur negligible cost to general capability\.
### 4\.1Results

Table[1](https://arxiv.org/html/2605.12752#S4.T1)displays thatSliceconsistently improves*FP*and reduces*Fgt*across task sequences in comparison with baselines, with the most pronounced gains on sequences characterized by severe catastrophic forgetting\. When compared to Vanilla LoRA,Slice\(c=1\.0c=1\.0\) achieves*FP*improvements of up to \+22\.55 on G2 and \+16\.97 on Opp1, while simultaneously reducing*Fgt*by up to 18\.70 and 20\.75 points on Opp1 and Opp3, respectively — sequences where the baseline exhibits the highest inter\-task interference\. Similar trends hold for LoRAM, whereSlice\(c=0\.50c=0\.50\) recovers up to \+26\.06*FP*points on G2 and reduces*Fgt*by 11\.93\. Gains on*AP*are more modest, consistent with the fact thatSlicetargets the retention of previously acquired knowledge rather than peak task\-specific performance during training\. The comparatively smaller gains observed with LoRA\-GA indicate that its initialization already exhibits favorable properties that partially address catastrophic forgetting — yet our approach further enhances these gains beyond what LoRA\-GA alone achieves\. These trends persist at rankr=128r\{=\}128, as reported in Appendix[E](https://arxiv.org/html/2605.12752#A5)\.

![Refer to caption](https://arxiv.org/html/2605.12752v1/x1.png)Figure 3:Performance heatmaps comparing Vanilla LoRA vs\.Slice\(c=1\.0c=1\.0\) during CL onNI\-Seq\-Opposite\-1\.Numbers above each heatmap indicate baseline performance on the trained task,*GP*and*IP*, at left, center, and right, respectively\. Heatmap values show percentage change relative to baseline\.Table[2](https://arxiv.org/html/2605.12752#S4.T2)shows thatSlicevariants incur modest reductions in*GP*and*IP*relative to Vanilla LoRA across most sequences, remaining close to the reference across all evaluated sequences\. The exception is*IP*onNI\-Seq\-G2, where larger drops are observed\. Crucially, however, these reductions in general capability occur alongside substantially larger gains in task retention, indicating that the gradient surgery applied bySliceprimarily reallocates optimization pressure toward mitigating catastrophic forgetting rather than compressing the model’s underlying representational capacity\. This asymmetry in the trade\-off—small*GP*cost against large*FP*and*Fgt*gains—suggests thatSlicereduces destructive gradient interference between sequential tasks without significantly degrading broad linguistic knowledge\. Moreover, theα\\alphasweep in Appendix[A](https://arxiv.org/html/2605.12752#A1)demonstrates that the*GP*and*IP*cost is largely recoverable: reducingα\\alphafrom 2 to 1 closes much of the*IP*gap on G2 while leaving*AP*,*FP*, and*Fgt*nearly unchanged, indicating thatα\\alphaserves as a reliable control knob for the generalization–retention trade\-off without requiring a different initialization strategy\.

Figure[3](https://arxiv.org/html/2605.12752#S4.F3)shows the stability–plasticity dynamics onNI\-Seq\-Opposite\-1, comparing Vanilla LoRA \(highest*AP*baseline\) withSlice\(c=1\.0c=1\.0\)\. Vanilla LoRA exhibits strong performance across all tasks at each step, as indicated by the top values in the left heatmap\. However, its off\-diagonal entries decay rapidly as subsequent tasks are learned: by the final stage M5, tasks 1 and 2 have collapsed to 19% and 21% of the baseline score, revealing severe catastrophic forgetting\. In contrast,Slice\(c=1\.0c=1\.0\) maintains substantially higher off\-diagonal retention throughout the sequence: at M5, task 1 not only avoids collapse but reaches 243% of baseline \(26\.5 points, higher than vanilla baseline\) while tasks 3–4 remain similar\. This pattern confirms thatSliceachieves a competitive*AP*while delivering markedly higher*FP*and lower*Fgt*, precisely because the conflict\-aware initialization steers the adapter subspace away from directions that would overwrite earlier task knowledge\.

Sliceshows a mild reduction relative to Vanilla LoRA on*GP*and*IP*\. Nevertheless, these marginal losses in general capability are small in comparison to the CL gains, as shown in Table[1](https://arxiv.org/html/2605.12752#S4.T1)\. Opp1 shows an*FP*improvement of 16\.97 and a*Fgt*reduction of 18\.70 points, indicating that the gradient surgery reallocates representational capacity toward task retention without meaningfully eroding the model’s broad linguistic competence\. Appendix[F](https://arxiv.org/html/2605.12752#A6)extends this comparison to LoRA\-GA and LoRAM\.Sliceremains the only approach that does not degenerate*FP*for any task \(including Algebra, where all other methods perform poorly\) while remaining competitive on*GP*and*IP*\.

Our results are robust to a known confounder in LoRA initialization comparisons\. Zhang et al\.\[zhang2025primacy\]showed that spectral initialization methods can appear beneficial due to implicit magnitude amplification rather than directional quality\. Our experimental protocol applies variance\-matched magnitude rescaling \([Stage 4](https://arxiv.org/html/2605.12752#S3.SS0.SSS0.Px4)\) across all initialization baselines—Slice, LoRA\-GA, and LoRAM—so the gains reported reflect genuine subspace selection rather than scale artifacts\.

## 5Related Work

##### Initialization Strategies for LoRA\.

The standard LoRA initialization scheme\[hu2022lora\], with Gaussian noise forAAand zeros forBB, can limit convergence speed and downstream performance, motivating alternative initialization strategies\. Existing methods can be broadly grouped into spectral approaches, which derive adapter parameters from pre\-trained weights, and gradient\-based approaches, which use task\-specific gradient information\. Spectral methods such as PiSSA\[meng2024pissa\]and MiLoRA\[wang2024milora\]initialize adapters from principal or minor components of the pre\-trained weights\. Zhang et al\.\[zhang2025primacy\]argued that much of the benefit of these methods may arise from increased update magnitude rather than from the specific spectral directions, and proposed LoRAM as a simpler alternative based on deterministic orthogonal bases constructed with the Discrete Sine Transform\. In contrast, LoRA\-GA\[wang2024loraga\]initializes adapters by computing gradients on a small calibration dataset and decomposing the resulting gradient matrices to obtain a task\-relevant subspace\.

Based on these findings, we selected LoRAM and LoRA\-GA as initialization baselines representing magnitude\-oriented and gradient\-based initialization strategies, respectively\.Sliceextends LoRA\-GA: while LoRA\-GA initializes adapters using only the adaptation task gradientGcurG\_\{\\text\{cur\}\},Slicealso incorporates the preserve task gradientGp​r​e​vG\_\{prev\}and projects out conflicting components before decomposition\.

##### LoRA\-Based Continual Learning\.

A growing body of work leverages low\-rank adapters to mitigate catastrophic forgetting in CL, predominantly through orthogonality constraints imposed*during training*\. O\-LoRA\[wang2023olora\]learns each task within a low\-rank subspace kept orthogonal to the subspaces of all previous tasks\. InfLoRA\[liang2024inflora\]designs the dimensionality reduction matrixBtB\_\{t\}such that the update subspace is orthogonal to the gradient subspace of previously learned tasks\. OPLoRA\[xiong2026oplora\]constrains updates to the orthogonal complement of the dominant singular subspace of the pre\-trained weights\. Beyond orthogonality, AM\-LoRA\[liu2024learningattentionalmixtureloras\]introduces an attentional mixture of task\-specific LoRA modules with learned routing\. Crucially, all of these methods address the stability\-plasticity trade\-off through constraints or architectural mechanisms applied*throughout the training process*, while remaining agnostic to how the adapter parameters are initialized\.Sliceis not a competitor to these methods but rather a complementary component that occupies a different stage of the pipeline: it determines the initial values ofAAandBB*before any training begins*, and the resulting adapters can then be trained with any of the above strategies\.

##### Gradient surgery in multi\-task optimization\.

Multi\-task optimization \(MTO\) replaces the naive average of per\-task gradients with a corrected update direction that mitigates destructive interference\. PCGrad\[yu2020gradient\_surgery\]projects each task gradient onto the normal plane of any conflicting partner whenever their cosine similarity is negative\. GradVac\[wang2021gradvac\]generalizes this surgery by aligning gradients to a target cosine similarity tracked online with an exponential moving average rather than to strict orthogonality, recovering PCGrad as the special case of a zero target\. MGDA\[sener2018mgda\], which itself returns the minimum\-norm convex combination of task gradients and converges to a Pareto\-stationary point\. GradDrop\[chen2020graddrop\]stochastically masks coordinates whose task gradients disagree in sign, biasing the optimizer toward joint minima rather than task\-specific ones\. Aligned\-MTL\[senushkin2023alignedmtl\]stabilizes training by enforcing a unit condition number on the gradient matrix via a Procrustes\-style SVD, simultaneously eliminating directional conflict and magnitude dominance\. Nash\-MTL\[navon2022nashmtl\]casts the gradient\-combination step as a cooperative bargaining game and solves for the weights realizing the Nash equilibrium over directional utilities, yielding a scale\-invariant aggregator\. RotoGrad\[javaloy2022rotograd\]pairs magnitude rebalancing with learned task\-specific rotations of the shared feature space to homogenize both the size and the direction of gradient updates\. FAMO\[liu2023famo\]amortizes the cost of gradient surgery to𝒪​\(1\)\\mathcal\{O\}\(1\)per step by maintaining softmax\-parameterized task weights updated from temporal differences in scalar log\-losses\.

## 6Conclusion

We introducedSlice, a LoRA initialization method based on gradient surgery that reconciles the trade\-off between learning new tasks while maintaining the capabilities of previously learned tasks\.Slicesubstantially improves stability without a proportionate decrease in plasticity\.

In sequences exhibiting severe inter\-task gradient conflict, particularly the NI\-SEQ\-OPPOSITE,Sliceyields*FP*gains of up to \+22\.55 points and*Fgt*reductions exceeding 20 points over vanilla LoRA, while*AP*remains competitive\.Sliceis composable by design: since it operates exclusively at initialization, the resulting adapter pairs serve as drop\-in replacements compatible with any downstream training\-time continual learning strategy\.

Slicehas a potentially negative impact on*GP*and*IP*, with mild*GP*reductions\. While this effect is small relative to CL improvements, they indicate that the projected gradient subspace, although conflict\-free, may sacrifice some alignment with directions beneficial to broad linguistic generalization\. A selection of theα\\alphahyperparameter can diminish this effect\.

Our evaluation is designed to isolate the effect of subspace selection from confounding factors, such as implicit magnitude amplification, by controlling for them via variance\-matched rescaling across all baselines\. We further introduce NI\-SEQ\-OPPOSITE, an adversarial sequences of maximally gradient\-conflicting task sequences, complementing standard TRACE, G1 and G2 evaluations to provide a complete picture of each method’s behavior across the full spectrum of inter\-task interference\.

We show that the initialization of low\-rank adapters is a consequential and underexplored intervention point for CL: a single, lightweight gradient\-surgery step prior to training yields stability gains superior to other initialization methods\. Future work can expand on conflict\-aware subspace selection at initialization beyond the projection operators studied here and use our conflicting task sequence NI\-SEQ\-OPPOSITE to evaluate the continual adaptation of LLMs\.

### 6\.1Limitations

##### Computational cost considerations\.

While LoRA adapters reduce the parameter count requiring gradient updates and thus improve training efficiency relative to full fine\-tuning,Slice’ gradient\-based initialization incurs non\-trivial computational overhead during setup\.Sliceincludes backward passes over batches of𝒟c​u​r\\mathcal\{D\}\_\{cur\}and𝒟p​r​e​v\{\\mathcal\{D\}\}\_\{prev\}\. Additionally, the cost of the randomized SVD scales as𝒪​\(min⁡\(m2​n,m​n2\)\)\\mathcal\{O\}\(\\min\(m^\{2\}n,mn^\{2\}\)\)per target matrix for dimensionsm,nm,n\. For practitioners operating under strict constraints, this overhead may still be prohibitive relative to LoRAM’s essentially free deterministic initialization\.

### 6\.2Broader Impact

Continual fine\-tuning of large language models is increasingly common in deployed systems, where models must adapt sequentially to new domains, instructions, and user requirements\. Methods that reduce catastrophic forgetting in this regime lower the practical barrier to maintaining capable models over time, without the cost of full retraining\. By providing a lightweight, composable initialization step that improves task retention,Slicehelps make sequential adaptation more resource\-efficient\. This is relevant to both research labs and practitioners operating under resource constraints\.

We also introduceNI\-Seq\-Opposite, a set of sequences for evaluating continual learning methods under adversarial gradient conflict\. This resource can help future work measure robustness to inter\-task interference in a controlled, reproducible way, independent of the proposed method\.

#### Acknowledgments

This study was financed in part by the Coordination for the Improvement of Higher Education Personnel \(CAPES\) — Finance Code 001; by Conselho Nacional de Desenvolvimento Científico e Tecnológico \(CNPq\)— Grant Number: 443072/2024\-8; and by Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul \(FAPERGS\) — Grant Number: 25/2551\-0000891\-3\.

This work was supported by Kunumi Institute\. The authors thank the institution for its financial support and commitment to advancing scientific research\.

## References

## Appendix AAlpha Variation

We sweep rs\-LoRA’sα\\alphaforSlice\(c=0\.50c\{=\}0\.50\) across three sequences, NI\-Seq\-Opp3, NI\-Seq\-G2, and TRACE, to quantify how scaling the adapter update trade\-offs generalization \(*GP*and*IP*\) against task metrics \(*AP*,*FP*and*Fgt*\)\. For each sequence, we report results atα∈\{1,2,4\}\\alpha\\in\\\{1,2,4\\\}and highlightΔ​\(2−1\)\\Delta\(2\-1\)to isolate the effect of a modest reduction inα\\alpha\.

Table[3](https://arxiv.org/html/2605.12752#A1.T3)shows that decreasingα\\alphafrom 2 to 1, improves*GP*and*IP*for NI\-Seq\-G2 \(*GP*:−4\.2\-4\.2,*IP*:−19\.5\-19\.5\) while*AP*and*FP*change modestly and*Fgt*stays small\. The same trend appears in the other sequences, suggestingα\\alphais a reliable*GP*and*IP*control knob without affecting*AP*,*FP*and*Fgt*\.

Figure[4](https://arxiv.org/html/2605.12752#A1.F4)shows per\-metricα\\alphaon TRACE:*GP*and*IP*decrease asα\\alphais reduced, while*AP*,*FP*and*Fgt*remain comparatively stable\. We match the initial adapter variance to the layer weights for LoRA\-GA, LoRAM, andSlice, but not for vanilla; vanilla LoRA uses a lower initial variance as discussed inzhang2025primacy\.

Table 3:Slice \(c=0\.50c\{=\}0\.50\): effect ofα\\alphaacross three sequences\. Deltas highlight thatα\\alphacan be used to control*GP*and*IP*\.![Refer to caption](https://arxiv.org/html/2605.12752v1/x2.png)\(a\)AP
![Refer to caption](https://arxiv.org/html/2605.12752v1/x3.png)\(b\)FP
![Refer to caption](https://arxiv.org/html/2605.12752v1/x4.png)\(c\)GP
![Refer to caption](https://arxiv.org/html/2605.12752v1/x5.png)\(d\)IP

Figure 4:Metrics for TRACE acrossα∈\{1,2,4\}\\alpha\\in\\\{1,2,4\\\}for \(Slice,c=0\.50c\{=\}0\.50\)\.### A\.1Weight absorption

After computing\(A,B\)\(A,B\)for each target linear layer, we write the tensors into the corresponding LoRA adapter parameters*before*training\. To preserve the function of the pretrained model at initialization \(i\.e\., ensure the*effective*weight remainsW0W\_\{0\}even after attaching LoRA\), we perform*weight absorption*:

Weff=Wbase\+αr​B​A,W\_\{\\text\{eff\}\}\\;=\\;W\_\{\\text\{base\}\}\+\\frac\{\\alpha\}\{r\}BA,\(10\)we set the frozen base weight to

Wbase←W0−αr​B​A,W\_\{\\text\{base\}\}\\;\\leftarrow\\;W\_\{0\}\-\\frac\{\\alpha\}\{r\}BA,\(11\)so that immediately after initialization,

Weff=\(W0−αr​B​A\)\+αr​B​A=W0\.W\_\{\\text\{eff\}\}\\;=\\;\\Big\(W\_\{0\}\-\\frac\{\\alpha\}\{r\}BA\\Big\)\+\\frac\{\\alpha\}\{r\}BA\\;=\\;W\_\{0\}\.\(12\)This absorption step ensures that any observed effect at step0is attributable to training dynamics rather than an unintended initial perturbation of the forward pass\.

## Appendix BExperimental Configuration

##### Base model and tokenizer\.

All experiments use Llama\-3\.2\-3B\-Instruct, trained using PEFT andbfloat16quantization\.

##### Data and task sequences\.

Continual learning sequences come from SuperNI and TRACE\. SuperNI inputs are formatted as a chat prompt with the task definition and input; TRACE prompts include the task name and instruction\. For each task, we shuffle with seed 42 and hold out up to 200 examples for validation; the remaining examples are used for training\.

##### LoRA adapters\.

We use rs\-LoRA with rankr∈\{64,128\}r\\in\\\{64,128\\\}\(main tables user=64r=64\), scaling factorα=2\\alpha=2, dropout 0\.0, and no bias adaptation\. Adapters are inserted into all attention and MLP projection modules:q\_proj,k\_proj,v\_proj,o\_proj,gate\_proj,up\_proj, anddown\_proj\.

##### Training hyperparameters\.

We train for 3 epochs with learning rate1×10−41\\times 10^\{\-4\}, with optimizer AdamW, warmup ratio 0\.01, and no weight decay\. Per\-device batch sizes are 16 \(train\) and 8 \(eval\) with gradient accumulation of 2, giving an effective train batch size of 32\. Sequence length is capped at 256 tokens\.

##### Gradient estimation and projection\.

ForSlice, we compute current and previous gradients with max 8 accumulation steps \(script default\)\. Previous gradients include all previous tasks; we build a dataloader per retain task, each using the training batch size\. Projection uses a single global coefficient shared across modules\.

##### SVD\.

For SVD\-based we use randomized low\-rank SVD, withq=4​rq=4rwith 4 power iterations\.

##### Evaluation\.

We evaluate seen tasks with max\. input length 512, using up to 64 evaluation samples per task\.

##### Hardware\.

Experiments were run on two NVIDIA RTX A6000 GPUs with 49Gb of VRAM each\.

## Appendix CNI\-Seq\-Opposite

Existing Super\-NI sequences used in prior work\[jiang2025unlocking,wang2023olora\]do not control for inter\-task gradient interference, leaving sequences with maximal inter\-task gradient interference underrepresented in standard evaluations\.NI\-Seq\-Oppositefills this gap: we mine 5\-task sequences by exhaustive combinatorial search over\(465\)\\binom\{46\}\{5\}candidate subsets to minimize mean pairwise gradient cosine similarity, defined as:

ϕ¯​\(𝒮\)=\(N2\)−1​∑1≤i<j≤N⟨gi,gj⟩‖gi‖​‖gj‖,\\bar\{\\phi\}\(\\mathcal\{S\}\)=\\binom\{N\}\{2\}^\{\-1\}\\sum\_\{1\\leq i<j\\leq N\}\\frac\{\\langle g\_\{i\},\\,g\_\{j\}\\rangle\}\{\\\|g\_\{i\}\\\|\\,\\\|g\_\{j\}\\\|\},\(13\)where𝒮=\(T1,…,TN\)\\mathcal\{S\}=\(T\_\{1\},\\dots,T\_\{N\}\)is a task sequence andgi=vec​\(G\(i\)\)∈ℝDg\_\{i\}=\\mathrm\{vec\}\(G^\{\(i\)\}\)\\in\\mathbb\{R\}^\{D\}is the gradient of the base model on taskTiT\_\{i\}, averaged over a small calibration set\. Sequences withϕ¯≪0\\bar\{\\phi\}\\ll 0contain many conflicting task pairs\.

##### Mining procedure\.

We computegig\_\{i\}for each taskTiT\_\{i\}in a candidate pool ofMMtasks by accumulating gradients on the frozen base model for a small number of steps\. We then score all\(M2\)\\binom\{M\}\{2\}pairs once and evaluate every candidateNN\-task subset by summing precomputed pairwise scores:

𝒮∗=arg⁡min𝒮⊆𝒫,\|𝒮\|=N⁡ϕ¯​\(𝒮\)\.\\mathcal\{S\}^\{\*\}=\\arg\\min\_\{\\mathcal\{S\}\\subseteq\\mathcal\{P\},\\,\|\\mathcal\{S\}\|=N\}\\bar\{\\phi\}\(\\mathcal\{S\}\)\.\(14\)We searched a pool ofM=46M=46tasks withN=5N=5, evaluating all\(465\)=1,370,754\\binom\{46\}\{5\}=1\{,\}370\{,\}754candidate subsets from\(462\)=1,035\\binom\{46\}\{2\}=1\{,\}035precomputed pair scores\.

##### Selected sequences\.

Table[4](https://arxiv.org/html/2605.12752#A3.T4)lists the five adversarial sequences produced by this search\. The experiments reported in the main manuscript useNI\-Seq\-Opposite\-1,NI\-Seq\-Opposite\-2, andNI\-Seq\-Opposite\-3\.

Table 4:Task composition of the fiveNI\-Seq\-Oppositesequences\. Sequences 1–3 are used in the main manuscript experiments; 4 and 5 are reported here for completeness\.
##### Standard Super\-NI sequences\.

For completeness, Table[5](https://arxiv.org/html/2605.12752#A3.T5)lists the task compositions of the two standard Super\-NI sequences used in the main manuscript, G1 and G2\. Both are pure\-generation sequences drawn from the same Super\-NI pool as the adversarial sequences; they are constructed by grouping tasks by output type without any constraint on gradient alignment\.

Table 5:Task composition of the two standard Super\-NI sequences\. Both are pure\-generation sequences following the construction of Jiang et al\.\[jiang2025unlocking\]\.
##### TRACE sequence\.

Table[6](https://arxiv.org/html/2605.12752#A3.T6)lists the six tasks comprising the TRACE benchmark sequence\[wang2023tracecomprehensivebenchmarkcontinual\], which spans diverse domains and task types explicitly chosen to exhibit interference between sequential fine\-tuning stages\.

Table 6:Task composition of the TRACE benchmark sequence\. Tasks span Chinese and English, covering stance detection, finance, summarization, code, science QA, and math reasoning\.

## Appendix DMetrics

We consider a sequence ofTTtasks\{t1,…,tT\}\\\{t\_\{1\},\\dots,t\_\{T\}\\\}learned in order\. LetRi,jR\_\{i,j\}denote the score on tasktjt\_\{j\}after the model has been sequentially trained on taskst1,…,tit\_\{1\},\\dots,t\_\{i\}\. These scores form a lower\-triangular matrix where entryRi,jR\_\{i,j\}is defined forj≤ij\\leq i\.

##### Average Performance \(*AP*\)\.

The mean diagonal score, capturing how well the model performs on each task immediately after learning it:

AP=1T​∑i=1TRi,i\.\\mathrm\{AP\}=\\frac\{1\}\{T\}\\sum\_\{i=1\}^\{T\}R\_\{i,i\}\\,\.\(15\)

##### Final Performance \(*FP*\)\.

The mean score over all learned tasks evaluated at the final stage:

FP=1T​∑j=1TRT,j\.\\mathrm\{FP\}=\\frac\{1\}\{T\}\\sum\_\{j=1\}^\{T\}R\_\{T,j\}\\,\.\(16\)

##### Forgetting \(*Fgt*\)\.

The average drop in per\-task performance between the time a task was first learned and the end of training:

Forget=AP−FP=1T​∑i=1T\(Ri,i−RT,i\)\.\\mathrm\{Forget\}=\\mathrm\{AP\}\-\\mathrm\{FP\}=\\frac\{1\}\{T\}\\sum\_\{i=1\}^\{T\}\\bigl\(R\_\{i,i\}\-R\_\{T,i\}\\bigr\)\\,\.\(17\)A positive value indicates catastrophic forgetting; a negative value indicates that subsequent training improved performance on earlier tasks \(backward transfer\)\.

##### General Performance \(*GP*\)\.

Letℬ=\{b1,…,bK\}\\mathcal\{B\}=\\\{b\_\{1\},\\dots,b\_\{K\}\\\}be a set of general\-purpose benchmarks evaluated vialm\-eval\-harness\[eval\-harness\]\. Each benchmarkbkb\_\{k\}is evaluated in a*zero\-shot*setting and scored by its primary metric \(accuracy or normalized accuracy when available, falling back to exact match, F1, or ROUGE\-L\)\.*GP*is the mean score across all benchmarks after the final training stage:

GP=1K​∑k=1Ksbk\(0\),\\mathrm\{GP\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}s\_\{b\_\{k\}\}^\{\(0\)\}\\,,\(18\)wheresbk\(0\)s\_\{b\_\{k\}\}^\{\(0\)\}denotes the zero\-shot score on benchmarkbkb\_\{k\}\.

##### In Context Performance \(*IP*\)\.

*IP*mirrors*GP*but evaluates each benchmark in a*few\-shot*setting to measure the model’s ability to follow in\-context demonstrations after continual learning\. Concretely, each benchmark is re\-evaluated withnnin\-context examples \(n=5n\{=\}5for most tasks;n=3n\{=\}3for BBH tasks\):

IP=1K​∑k=1Ksbk\(nk\),\\mathrm\{IP\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}s\_\{b\_\{k\}\}^\{\(n\_\{k\}\)\}\\,,\(19\)wheresbk\(nk\)s\_\{b\_\{k\}\}^\{\(n\_\{k\}\)\}is thenkn\_\{k\}\-shot score on benchmarkbkb\_\{k\}\. The gapIP−GP\\mathrm\{IP\}\-\\mathrm\{GP\}reflects how effectively the model leverages in\-context examples; a shrinking gap signals that the model derives progressively less benefit from in\-context demonstrations\.

## Appendix ERank\-128 Experiments

All main manuscript results use LoRA rankr=64r\{=\}64\. To assess whether the gains fromSlicepersist at higher rank, we re\-run the full comparison atr=128r\{=\}128across all six evaluation sequences\. The experimental setup is otherwise identical to the rank\-64 protocol\.

Table[7](https://arxiv.org/html/2605.12752#A5.T7)reports*AP*,*FP*, and*Fgt*in the same baseline\-plus\-delta format as Table[1](https://arxiv.org/html/2605.12752#S4.T1)in the main text\. Table[8](https://arxiv.org/html/2605.12752#A5.T8)reports GP and IP preservation\.

##### Summary of findings\.

The pattern of results at rank 128 closely mirrors rank 64\. On adversarial sequences \(Opp1 and Opp3\),Sliceconsistently recovers large FP gaps relative to Vanilla LoRA: on Opp1,Slice\(c=1\.0c\{=\}1\.0\) improves FP by\+20\.31\+20\.31points and reduces Fgt by19\.3419\.34points, while on Opp3,Slice\(c=0\.75c\{=\}0\.75\) improves FP by\+24\.10\+24\.10points and reduces Fgt by19\.6819\.68points\. On G2, all threeSlicevariants improve FP by1212–1515points over Vanilla LoRA, consistent with rank\-64 results\. On standard sequences \(G1, TRACE\), differences remain small\. Finally,*GP*and*IP*degradation at rank 128 is comparable to rank 64, remaining close to the Vanilla LoRA reference across most sequences \(Table[8](https://arxiv.org/html/2605.12752#A5.T8)\)\.

Table 7:CL metrics across all sequences at rank 128\. Baseline absolute scores appear in bold rows; indented rows showΔ\\Deltafor eachSlicevariant relative to its base initializer\.Table 8:GP and IP preservation at rank 128\. The first row reports absolute scores for Vanilla LoRA; all remaining rows showΔ\\Deltarelative to Vanilla LoRA\.Slicevariants remain close to the reference on GP across all sequences\.

## Appendix FAdditional Performance Heatmaps onNI\-Seq\-Opposite\-1

![Refer to caption](https://arxiv.org/html/2605.12752v1/x6.png)Figure 5:Performance heatmaps comparing LoRA\-GA vs\. LoRAM vs\.Slice\(c=1\.0c=1\.0\) during continual learning onNI\-Seq\-Opposite\-1\.Numbers above each heatmap indicate baseline performance on the trained task,*GP*and*IP*, at left, center, and right, respectively\. Heatmap values show percentage change relative to baseline\.Figure[3](https://arxiv.org/html/2605.12752#S4.F3)in the main manuscript compares Vanilla LoRA againstSlice\(c=1\.0c\{=\}1\.0\) onNI\-Seq\-Opposite\-1\. Vanilla LoRA is chosen as the reference there because it achieves the highest*AP*among the baselines on this sequence, making the forgetting contrast the most visually striking\.

Figure[5](https://arxiv.org/html/2605.12752#A6.F5)shows the analogous three\-panel heatmaps for the two remaining baselines, LoRA\-GA and LoRAM, paired againstSlice\(c=1\.0c\{=\}1\.0\) in each case\. The layout is identical to Figure[3](https://arxiv.org/html/2605.12752#S4.F3): the left panel shows per\-task trained\-task evaluation \(values are percentage of each task’s baseline score, so 100 = parity with the model before any continual training\), the center panel shows zero\-shot*GP*on the four held\-out benchmarks, and the right panel shows few\-shot*IP*on the same benchmarks plus BBH Object Counting\.

##### LoRA\-GA vs\.Slice\.

LoRA\-GA initializes adapters using only the current\-task gradient, achieving higher*AP*than Vanilla LoRA onNI\-Seq\-Opposite\-1due to its task\-aligned initialization\. However, its off\-diagonal entries still decay substantially across stages: tasks learned early in the sequence suffer significant forgetting by the final stage\.Slice\(c=1\.0c\{=\}1\.0\) recovers or improves upon LoRA\-GA’s off\-diagonal retention while matching its diagonal scores, confirming that the gradient\-surgery projection adds value beyond the task\-aware subspace that LoRA\-GA already captures\.*GP*and*IP*degradation are comparable between the two methods\.

##### LoRAM vs\.Slice\.

LoRAM initialization relies on a deterministic orthogonal basis constructed via the Discrete Sine Transform\. LoRAM achieves a strong AP on this sequence\. However, like LoRA\-GA, its task\-agnostic subspace selection leaves it vulnerable to inter\-task interference: the below\-diagonal entries decay noticeably across stages\.Slice\(c=1\.0c\{=\}1\.0\) achieves better off\-diagonal retention throughout the sequence, reducing forgetting while keeping AP competitive\. The modest*GP*and*IP*reductions are consistent with the aggregate results in Table[2](https://arxiv.org/html/2605.12752#S4.T2)of the main manuscript\.

Similar Articles

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

arXiv cs.CL

JumpLoRA introduces a novel sparse adapter framework for continual learning in LLMs using JumpReLU gating to dynamically isolate task parameters and prevent catastrophic forgetting. The method enhances LoRA-based approaches and outperforms state-of-the-art continual learning methods like ELLA.

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.

Parameter-Efficient Fine-Tuning with Learnable Rank

arXiv cs.CL

Researchers from Adelaide University introduce LR-LoRA (Learnable Rank LoRA), a parameter-efficient fine-tuning method that dynamically learns the adapter rank for each transformer layer during training rather than using a fixed global rank. LR-LoRA achieves state-of-the-art performance on language understanding and commonsense reasoning benchmarks, outperforming fixed-rank LoRA baselines.