JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

arXiv cs.CL Papers

Summary

JumpLoRA introduces a novel sparse adapter framework for continual learning in LLMs using JumpReLU gating to dynamically isolate task parameters and prevent catastrophic forgetting. The method enhances LoRA-based approaches and outperforms state-of-the-art continual learning methods like ELLA.

arXiv:2604.16171v1 Announce Type: cross Abstract: Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-rank update matrix for each task. To mitigate catastrophic forgetting, state-of-the-art approaches impose constraints on new adapters with respect to the previous ones, by targeting either subspace or coordinate-wise interference. In this paper, we propose JumpLoRA, a novel framework to adaptively induce sparsity in the Low-Rank Adaptation (LoRA) blocks through the use of JumpReLU gating. The method achieves dynamic parameter isolation, which helps prevent task interference. We demonstrate that our method is highly modular and compatible with LoRA-based CL approaches. Specifically, it significantly boosts the performance of IncLoRA and outperforms the leading state-of-the-art CL method, ELLA.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# Sparse Adapters for Continual Learning in Large Language Models
Source: https://arxiv.org/html/2604.16171
Alexandra Dragomir1 Bitdefender &Ioana Pintilie1 Bitdefender &Antonio Barbalau1 Bitdefender &Marius Dragoi1 Bitdefender &Florin Brad1 Bitdefender &Cristian Daniel Paduraru1 Bitdefender &Alexandru Tifrea1 Bitdefender &Elena Burceanu1 Bitdefender &Radu Tudor Ionescu2, University of Bucharest

###### Abstract

Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-rank update matrix for each task. To mitigate catastrophic forgetting, state-of-the-art approaches impose constraints on new adapters with respect to the previous ones, by targeting either subspace or coordinate-wise interference. In this paper, we propose JumpLoRA, a novel framework to adaptively induce sparsity in the Low-Rank Adaptation (LoRA) blocks through the use of JumpReLU gating. The method achieves dynamic parameter isolation, which helps prevent task interference. We demonstrate that our method is highly modular and compatible with LoRA-based CL approaches. Specifically, it significantly boosts the performance of IncLoRA and outperforms the leading state-of-the-art CL method, ELLA.

## 1 Introduction

The rapid advancement of Large Language Models (LLMs) has revolutionized natural language processing (Vaswani et al., 2017; Brown et al., 2020), enabling unprecedented performance across diverse generative and reasoning tasks (Zhao et al., 2023; Minaee et al., 2024). However, LLMs are typically static once trained, without an embedded ability to integrate new data without undergoing costly retraining. Continual Learning (CL) (Wu et al., 2024; Shi et al., 2026; Chen et al., 2026) aims to address this by allowing models to acquire new knowledge from a sequential stream of tasks.

One fundamental challenge in CL is mitigating catastrophic forgetting (McCloskey and Cohen, 1989; French, 1999; Kirkpatrick et al., 2016), where the sequential acquisition of new information results in the abrupt loss of previously learned knowledge. This phenomenon is inherent to the stability-plasticity trade-off (Grossberg, 1987; Abraham and Robins, 2005; Dohare et al., 2024; Lange et al., 2022), as the model must be flexible enough in learning the new task while maintaining stability required to preserve existing knowledge.

Parameter-Efficient Fine-Tuning (PEFT) methods (Houlsby et al., 2019) address the computational cost induced by full fine-tuning of LLMs for every new task. Based on the observation that model updates have a low intrinsic dimension (Li et al., 2018; Aghajanyan et al., 2021), Low-Rank Adaptation (LoRA) (Hu et al., 2021) has emerged as a standard for CL alongside its variants. LoRA approximates the weight update through the product of two trainable low-rank matrices, while freezing the original weight matrix, significantly reducing the memory footprint.

Despite their parameter efficiency, naively training low-rank adapters for each new task often leads to task interference, causing notable forgetting on earlier tasks (Liang and Li, 2024; Wang et al., 2023a). Current state-of-the-art CL methods address this by imposing constraints on the adapter updates. Subspace-partitioning methods restrict updates to lie orthogonal to previous tasks. For instance, O-LoRA (Wang et al., 2023a) enforces orthogonality between successive low-rank matrices, while InfLoRA (Liang and Li, 2024) projects gradients onto the orthogonal complement of previous task subspaces to eliminate interference. Alternatively, coordinate-wise methods such as ELLA (Biswas et al., 2026) restrict specific sets of parameters that can be modified when learning a new task. While methods like ELLA effectively mitigate forgetting, they operate in a dense parameter update regime, where all low-rank coordinates remain active for every task. A fundamental limitation of this dense approach is that it cannot achieve full parameter isolation. As every weight remains subject to optimization at each step, the update must navigate an increasingly constrained landscape to satisfy all previous alignment constraints simultaneously. As the task stream grows, this lack of structural separation can lead to capacity saturation or the accumulation of residual gradient interference, ultimately diminishing the model's ability to learn new information without forgetting the old.

In this paper, we introduce JumpLoRA, a method designed to achieve adaptive parameter isolation. Our approach leverages JumpReLU (Rajamanoharan et al., 2024) as an activation function to adaptively induce sparsity on the LoRA blocks. Unlike traditional regularization methods, JumpLoRA dynamically cancels redundant or interfering weights, creating sparse adapters that minimize overlap with previous knowledge. We illustrate our approach in Figure 1.

Figure 1: Continual Learning with JumpLoRA. We construct LoRA updates that are able to perform fine-grained interventions by repurposing the JumpReLU activation function such that it can be applied to weight updates during training. For each task we train a learnable threshold τ alongside the LoRA weights, meant to cut off low-magnitude updates, enabling adapters to specifically target only the most relevant parameters. This change effectively reduces the impact of the adapter upon the base weights while reducing the overlap between different task adapters.

The primary contributions of our work are as follows:

- We introduce JumpLoRA, which to the best of our knowledge, is the *first framework to integrate learnable JumpReLU gating* for low-rank adapters. This mechanism enables precise, coordinate-wise sparsity in weight updates by optimizing a threshold directly alongside the adapter parameters.
- We adapt the JumpLoRA framework in a CL setup, which induces adaptive sparsity per task. This allows different adapters to occupy disjoint parameter coordinates and help mitigate task interference.
- We demonstrate the modularity and efficacy of JumpLoRA by applying it on top of existing CL frameworks. Through extensive benchmarking, we show that JumpLoRA improves upon existing approaches like IncLoRA and ELLA on the Standard CL Benchmark (Zhang et al., 2015) and Long Sequence Benchmark (Razdaibidin et al., 2023).

## 2 Related Work

Training neural networks on sequential tasks results in catastrophic forgetting (McCloskey and Cohen, 1989), where learning new tasks degrades the model's performance on previously learned tasks. Continual Learning (CL) is a research area that covers methods aiming to learn new tasks over time, while mitigating catastrophic forgetting.

CL approaches belong to three major categories: rehearsal-based, regularization-based, and architecture-based. Rehearsal-based approaches (Lopez-Paz and Ranzato, 2017; de Masson d'Autume et al., 2019; Riemer et al., 2019) store past examples in a replay buffer which is leveraged during training on the current task.

Regularization-based methods impose constraints on the optimization process to protect previous task knowledge. This is typically achieved with penalty terms or gradient constraints. A foundational approach is Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2016), which computes the Fisher Information Matrix to establish parameter importance. Similarly, replay-free distillation is used to preserve prior tasks' outputs when dealing with new tasks (Li and Hoiem, 2018). Farajtabar et al. (2020) prevent interference by projecting the current updates onto a subspace orthogonal to the gradient of the previous tasks. Orthogonal isolation has then been extended to a PEFT CL setup, where O-LoRA (Wang et al., 2023a) adds a loss term to penalize new LoRA adapters if they are not orthogonal to past adapters. InfLoRA (Liang and Li, 2024) alternatively achieves interference-free adaptation by projecting gradients onto the orthogonal complement of previous task subspaces, preventing new updates from interfering with earlier tasks without requiring an explicit orthogonality loss.

In addition to subspace-aware methods, other approaches perform coordinate-wise updates to achieve more granular isolation within shared parameters. ELLA (Biswas et al., 2026) prevents interference with weight coordinates that had historically high-magnitude in previous tasks. Piggyback (Mallya et al., 2018) builds on ideas from network quantization and pruning, adapting frozen backbone networks to new tasks by learning end-to-end binary masks that selectively activate existing weights, enabling continual learning without modifying shared parameters or suffering from catastrophic forgetting. Similarly, MIGU (Du et al., 2024) performs coordinate-wise isolation by masking gradients corresponding to weights connected to high-magnitude neurons, while HAT (Serra et al., 2018) learns task-specific, nearly binary attention masks via backpropagation to gate network units, conditioning gradients to freeze weights important for previous tasks, while dynamically allocating network capacity. Recently, Zhang et al. (2025) prevent parameter interference by freezing the LoRA matrices A as random projections and applying calibrated sparse masks M to the expansion matrices B. The sparse masks M are obtained by first fixing a sparsity ratio s. A task-specific threshold τ_t is then derived as the s-quantile of the parameter magnitudes in B matrices, which are learned on an initial calibration set. Finally, the masks for B are generated by keeping indices whose values are larger than the threshold τ_t. In contrast, our approach, JumpLoRA, provides a more adaptive framework, by using a *learnable JumpReLU gating* to obtain the sparse masks M. Thus, instead of relying on a fixed sparsity ratio s, our method changes the threshold τ_t to dynamically induce different levels of sparsity in the LoRA adapters, based on the task-specific complexity.

Architecture-based methods prevent interference by isolating task knowledge within dedicated parameters or sub-networks. This can be achieved by dynamically reusing or expanding the underlying model's layers during training (Yoon et al., 2018; Li et al., 2019), or by freezing previous task weights and progressively adding new columns with lateral connections for new tasks (Rusu et al., 2016). More recent approaches use Parameter-Efficient Fine-Tuning (PEFT) to avoid modifying the base model. This is achieved by learning task-specific soft prompts (Razdaibidin et al., 2023) or learning a new adapter for each task and selecting the relevant adapter at test time (Wang et al., 2023b). While these approaches isolate task-specific parameters, they require either knowing the Task ID at test time or defining a procedure to select the appropriate adapter for an unknown input. Our method merges existing task adapters into a single adapter, which allows tackling unseen tasks at test time without relying on Task IDs.

## 3 Method

**Continual Learning setup.** We consider the challenging rehearsal-free continual learning setting with task-agnostic inference for pretrained large language models. In this setting, the model has no access to data from previous tasks and must produce predictions without knowing the task identity of the input. Given a set of tasks T, for each supervised task T ∈ T, defined as T = {(x_i^T, y_i^T)}_{i=1}^{n_T}, we train a low-rank adapter ΔW = A · B, with A ∈ ℝ^{d_{in} × r}, B ∈ ℝ^{r × d_{out}} and r ≪ min(d_{in}, d_{out}) for each base weight matrix W_{base} ∈ ℝ^{d_{in} × d_{out}}. The output of the base layer h = x · W_{base} thus becomes h = x · (W_{base} + ΔW). Upon training on each task, the adapter is merged into the base weights: W_{base} ← W_{base} + ΔW.

**Proposed methodology.** Because ΔW is the product of two low-rank matrices, performing fine-grained interventions that target specific parameters of W_{base} is by design difficult or even intractable. We therefore introduce a mechanism that enables the low-rank adapter to perform targeted updates on a task-relevant subset of parameters. Concretely, we operationalize task-relevance through gradient magnitude: parameters accumulating larger cumulative gradients are considered more important for the current task. Following Wang et al. (2023a), we interpret ΔW as a proxy for the cumulative gradient of W_{base} over the current task. We thus design a mechanism able to identify, retain and train only the tentative top-magnitude elements of ΔW, effectively enabling sparse, targeted updates. We additionally impose the constraint that the sparsity level of each layer should not be treated as a tunable hyperparameter. Doing so would render the method prohibitively expensive, as it would require repeated training runs across all tasks to evaluate each candidate sparsity level. We design our method such that each layer can learn to select the level of sparsity that most appropriately delimits the relevant updates.

**JumpReLU.** We implement the proposed approach by repurposing the JumpReLU function, enabling its application on weight updates rather than activations. The JumpReLU function was introduced by Rajamanoharan et al. (2024)

Similar Articles

Beyond LoRA: Is Sparsity-Induced Adaptation Better?

arXiv cs.LG

This paper proposes sparsity-induced adaptations to LoRA, including Cheap LoRA (cLA) and a chained circulant variant (c³LA), and provides theoretical generalization bounds along with empirical evaluations showing up to 10% training time reduction and 15% peak GPU memory savings while maintaining competitive performance.

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

arXiv cs.LG

Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.