ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

arXiv cs.LG 06/03/26, 04:00 AM Papers
Summary
ReLoRA is a knowledge-reusing adaptation framework that efficiently restores service-ready LoRA adapters for evolving LLM services, reducing time-to-readiness by up to 8.9× and improving accuracy by up to 4.6% through adaptive initialization and scheduled regularization.
arXiv:2606.02606v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as continuously evolving services, where frequent base-model updates may invalidate previously deployed task-specific Low-Rank Adaptation (LoRA) adapters. For service providers managing numerous downstream model services, retraining each LoRA adapter from scratch for every updated base model is computationally prohibitive and delays service rollout. Meanwhile, the simpler alternative, i.e., naively applying the original LoRA adapter to the updated base model, often leads to degraded service quality due to adapter-backbone incompatibility. To address this problem, we propose ReLoRA, a knowledge-reusing re-adaptation framework that efficiently restores service-ready LoRA adapters for evolving LLM services while preserving or improving task performance. Specifically, ReLoRA comprises two key optimization steps: 1) Adaptive LoRA initialization leverages Bayesian optimization to construct a compatibility-aware starting point by fusing information from both the previously deployed task adapter and the base model's evolution; 2) Fine-tuning with scheduled regularization first rapidly steers the adapter to a high-quality region via strong regularization, followed by relaxed regularization for task-specific refinement. This design enables rapid service-quality recovery with reduced re-adaptation overhead. Extensive experiments demonstrate that ReLoRA reduces time-to-readiness by up to 8.9$\times$ and improves accuracy by up to 4.6\% compared to baselines.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:39 AM
# ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services
Source: [https://arxiv.org/html/2606.02606](https://arxiv.org/html/2606.02606)
Yang Xu, Zihuai Xu, \*Hongli Xu, Yunming Liao, Zhiwei Yao, Xitong Fu This article is supported by the National Science Foundation of China \(NSFC\) under Grants XXXXXXXX, XXXXXXX, and XXXXXXX\. Y\. Xu, Z\. Xu, H\. Xu, Y\. Liao, Z\. Yao, and X\. Fu are with the School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui, China, 230027, and also with Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu, China, 215123\. E\-mails: xuyangcs@ustc\.edu\.cn; zihuaixu@mail\.ustc\.edu\.cn; xuhongli@ustc\.edu\.cn; ymliao98@mail\.ustc\.edu\.cn; zhiweiyao@mail\.ustc\.edu\.cn; xt1310632375@mail\.ustc\.edu\.cn\. H\. Xu is the corresponding author\.

###### Abstract

Large Language Models \(LLMs\) are increasingly deployed as continuously evolving services, where frequent base\-model updates may invalidate previously deployed task\-specific Low\-Rank Adaptation \(LoRA\) adapters\. For service providers managing numerous downstream model services, retraining each LoRA adapter from scratch for every updated base model is computationally prohibitive and delays service rollout\. Meanwhile, the simpler alternative,i\.e\., naively applying the original LoRA adapter to the updated base model, often leads to degraded service quality due to adapter\-backbone incompatibility\. To address this problem, we propose ReLoRA, a knowledge\-reusing re\-adaptation framework that efficiently restores service\-ready LoRA adapters for evolving LLM services while preserving or improving task performance\. Specifically, ReLoRA comprises two key optimization steps: 1\)Adaptive LoRA initializationleverages Bayesian optimization to construct a compatibility\-aware starting point by fusing information from both the previously deployed task adapter and the base model’s evolution; 2\)Fine\-tuning with scheduled regularizationfirst rapidly steers the adapter to a high\-quality region via strong regularization, followed by relaxed regularization for task\-specific refinement\. This design enables rapid service\-quality recovery with reduced re\-adaptation overhead\. Extensive experiments demonstrate that ReLoRA reduces time\-to\-readiness by up to 8\.9×\\timesand improves accuracy by up to 4\.6% compared to baselines\.

###### Index Terms:

*LLM Services, Service Rollout, Model Service Maintenance, Adapter Backward Compatibility, Parameter\-Efficient Fine\-Tuning, Low\-Rank Adaptation*\.

## 1Introduction

Large Language Models \(LLMs\), such as ChatGPT\[[1](https://arxiv.org/html/2606.02606#bib.bib1)\]and LLaMA\[[2](https://arxiv.org/html/2606.02606#bib.bib2)\], have become a fundamental component of modern AI services\[[3](https://arxiv.org/html/2606.02606#bib.bib3)\]\. Through extensive pretraining on diverse and massive corpora, these models exhibit remarkable generalization capabilities across a wide range of tasks\. Increasingly, LLMs are not deployed as static models, but as continuously evolving services whose base models are periodically updated to incorporate new knowledge, improve alignment, and enhance general capabilities\. In such service\-oriented ecosystems, a provider\-side base\-model update is not merely a model replacement; it also triggers a lifecycle maintenance problem for numerous downstream model services built upon the previous model version\.

To support specialized downstream applications, LLM services are commonly customized through fine\-tuning\. However, conventional full\-parameter fine\-tuning requires substantial computation and memory overhead\. For instance, fine\-tuning a LLaMA model with 13 billion parameters requires approximately 100GB of memory\. To reduce this cost, Parameter\-Efficient Fine\-Tuning \(PEFT\) methods have been developed, among which Low\-Rank Adaptation \(LoRA\)\[[4](https://arxiv.org/html/2606.02606#bib.bib4)\]has become a predominant solution\. LoRA freezes the base pretrained model weights and represents task\-specific updates through low\-rank matrix decomposition, significantly reducing resource demands while maintaining competitive performance\. As a result, many downstream LLM services are deployed as a shared base model equipped with task\-specific LoRA adapters\.

However, the continuous evolution of LLM backbones, such as Gemini\[[5](https://arxiv.org/html/2606.02606#bib.bib5)\]and LLaMA\[[2](https://arxiv.org/html/2606.02606#bib.bib2)\], poses a critical challenge to such LoRA\-based service deployment\. When the base model evolves from an old version to a new one, previously deployed LoRA adapters may become obsolete or suboptimal because they were optimized for the old backbone\. Consequently, service providers who have invested in fine\-tuning task\-specific adapters may have to update these adapters before the corresponding services can be safely rolled out on the evolved base model\. For service providers managing thousands of LoRA\-based task services in production\[[6](https://arxiv.org/html/2606.02606#bib.bib6),[7](https://arxiv.org/html/2606.02606#bib.bib7)\], even a moderate per\-adapter retraining cost can accumulate into substantial GPU\-hours for each base\-model update cycle\. For instance, if fine\-tuning a single LoRA instance on a moderately sized dataset takes approximately 4 hours on an NVIDIA A100 GPU\[[8](https://arxiv.org/html/2606.02606#bib.bib8)\], updating 2,000 LoRA instances would require roughly 8,000 GPU hours, resulting in a significant operational burden\[[9](https://arxiv.org/html/2606.02606#bib.bib9),[10](https://arxiv.org/html/2606.02606#bib.bib10)\]\. This cost directly delays the rollout of updated LLM services and increases the maintenance overhead\.

To mitigate this problem, existing approaches have largely followed two distinct paths\. The most straightforward implementation, called PortLLM\[[11](https://arxiv.org/html/2606.02606#bib.bib11)\], directly applies the original LoRA weights to the newly evolved base model\. From a service\-management perspective, this approach is appealing because it requires almost no additional computation and can immediately reuse existing task\-specific knowledge\. However, as we demonstrate in Section[2\.4](https://arxiv.org/html/2606.02606#S2.SS4), such naive adapter transfer fails to account for the parameter shifts introduced by base\-model evolution\. This causes a mismatch between the original task adaptation and the evolved backbone, leading to degraded service quality\. Another line of research explores generative methods to synthesize LoRA weights for evolving LLMs\. For instance, ORAL\[[12](https://arxiv.org/html/2606.02606#bib.bib12)\]employs conditional recurrent diffusion models to generate LoRA weights from scratch based on task specifications and model architecture\. Although such generative approaches provide flexibility in creating adapters for new tasks, they typically require substantial upfront investment in training the generative model itself, and their effectiveness highly depends on the quality of the learned generator and conditioning information\.

Given that direct adapter transfer may lead to unacceptable service\-quality degradation, some form of re\-adaptation is necessary\. Meanwhile, to enable fast service rollout, re\-adaptation should avoid retraining each adapter from scratch\. Our key insight is that two forms of existing knowledge can be jointly reused to accelerate this process: the task\-specific knowledge encoded in the previously deployed LoRA adapter, and the evolution knowledge reflected by the parameter shift between the old and updated base models\. By properly integrating these two sources of knowledge, the updated adapter can start from a more compatible parameter region, thereby reducing the time required to restore service\-ready performance\.

Building on this insight, we propose ReLoRA, a knowledge\-reusing re\-adaptation framework for fast rollout of evolving LLM services\. ReLoRA orchestrates the adapter update process through two key steps,i\.e\.,adaptive LoRA initializationandfine\-tuning with scheduled regularization\. First, adaptive LoRA initialization constructs a compatibility\-aware starting point by fusing information from the original task adapter with the delta of the base model’s evolution,i\.e\., the difference between the evolved and original base models\. This fusion is achieved by searching for an optimal weighted combination via Bayesian optimization, which provides a superior initialization for subsequent re\-adaptation\. Second, fine\-tuning with scheduled regularization efficiently steers the adapter toward a new task\-specific optimum while reducing re\-adaptation overhead\. This step begins with aGuided Rapid Adaptationstage, where strong L2 regularization anchors the adapter to the initialized state and rapidly moves it into a high\-quality service\-ready region\. It then proceeds to aRefinement and Explorationstage, where the regularization is relaxed to allow task\-loss\-driven refinement for improved final performance\. Overall, the two\-step design enables fast service\-quality recovery while preserving downstream task performance\.

Our main contributions can be summarized as follows\.

- •We formulate adapter re\-adaptation for evolving LLMs as an adapter backward\-compatibility and service rollout problem, where the objective is to rapidly restore service\-ready task adapters after base\-model evolution\.
- •We propose ReLoRA, a knowledge\-reusing framework that reduces the maintenance overhead of LoRA\-based LLM services by leveraging both the previously deployed task adapter and the base model’s evolution\.
- •We introduce an adaptive LoRA initialization strategy that employs Bayesian optimization to fuse knowledge from the existing LoRA adapter and the base\-model evolution delta, thereby constructing a compatibility\-aware starting point for fast re\-adaptation\.
- •We design a fine\-tuning strategy with scheduled regularization, which first anchors the optimization process to the fused initialization for rapid service\-quality recovery and then relaxes the constraint for refinement\.
- •We conduct extensive experiments across six downstream service tasks, three model families, and three update sources\. The results show that ReLoRA reduces time\-to\-readiness by up to 8\.9×\\timesand improves task accuracy by up to 4\.6 percentage points over baselines\.

The rest of the paper is organized as follows\. Section[2](https://arxiv.org/html/2606.02606#S2)introduces the background of LoRA and the motivation for this work\. Section[3](https://arxiv.org/html/2606.02606#S3)details the proposed ReLoRA framework\. Section[4](https://arxiv.org/html/2606.02606#S4)presents the experimental results and analysis\. Section[5](https://arxiv.org/html/2606.02606#S5)reviews the related works\. Section[6](https://arxiv.org/html/2606.02606#S6)discusses limitations of our work, and Section[7](https://arxiv.org/html/2606.02606#S7)concludes the paper\.

![Refer to caption](https://arxiv.org/html/2606.02606v1/x1.png)Figure 1:Workflow of ReLoRA for fast rollout of evolving LLM services\.
## 2Preliminaries and Motivations

### 2\.1Low\-Rank Adaptation \(LoRA\)

Parameter\-Efficient Fine\-Tuning \(PEFT\) methods substantially reduce the number of trainable parameters required to adapt LLMs to downstream tasks\. Among them, Low\-Rank Adaptation \(LoRA\) has become a widely adopted technique for customizing LLM services, as it keeps the pretrained backbone frozen and updates only a small set of additional adapter parameters\. This design allows downstream model services to share the same base model while maintaining separate lightweight task\-specific adapters\.

Specifically, for a weight matrixΘ0h∈ℝd×k\\Theta\_\{0\}^\{h\}\\in\\mathbb\{R\}^\{d\\times k\}at thehh\-th layer of an LLM, LoRA represents the incremental update for theii\-th downstream task as a low\-rank decomposition:

ΔΘih=BihAih,\\Delta\\Theta\_\{i\}^\{h\}=B\_\{i\}^\{h\}A\_\{i\}^\{h\},\(1\)whereBih∈ℝd×rB\_\{i\}^\{h\}\\in\\mathbb\{R\}^\{d\\times r\},Aih∈ℝr×kA\_\{i\}^\{h\}\\in\\mathbb\{R\}^\{r\\times k\}, and the rank satisfiesr≪min⁡\{d,k\}r\\ll\\min\\\{d,k\\\}\. Given an input representationxx, the layer output is computed as

Θ0hx\+BihAihx\.\\Theta\_\{0\}^\{h\}x\+B\_\{i\}^\{h\}A\_\{i\}^\{h\}x\.\(2\)During fine\-tuning, the original weight matrixΘ0h\\Theta\_\{0\}^\{h\}remains fixed, while onlyAihA\_\{i\}^\{h\}andBihB\_\{i\}^\{h\}are trainable\. For simplicity, we useΔΘi\\Delta\\Theta\_\{i\}to denote the collection of LoRA updates across all adapted layers for theii\-th downstream service task\.

### 2\.2LLM Service Evolution and Adapter Backward Compatibility

We next formalize the adapter backward\-compatibility problem for evolving LLM services\. LetΘ\\Thetadenote the old service backbone andΘ′=Θ\+ΔΘ\\Theta^\{\\prime\}=\\Theta\+\\Delta\\Thetadenote the updated backbone after base\-model evolution, whereΔΘ\\Delta\\Thetarepresents the parameter delta introduced by the update\. For theii\-th downstream service task, letΔΘi\\Delta\\Theta\_\{i\}denote the previously deployed LoRA adapter onΘ\\Theta, and letΔΘi′\\Delta\\Theta^\{\\prime\}\_\{i\}denote the updated adapter to be constructed forΘ′\\Theta^\{\\prime\}\. The service quality of the updated task service is denoted byQi\(Θ′,ΔΘi′\)Q\_\{i\}\(\\Theta^\{\\prime\},\\Delta\\Theta^\{\\prime\}\_\{i\}\)\.

The objective is to minimize the*time\-to\-readiness*τi\\tau\_\{i\},i\.e\., the wall\-clock time from the completion of the base\-model update to the point where the updated adapter reaches the required service\-quality thresholdqiq\_\{i\}:

minΔΘi′⁡τis\.t\.Qi\(Θ′,ΔΘi′\)≥qi\.\\min\_\{\\Delta\\Theta^\{\\prime\}\_\{i\}\}\\tau\_\{i\}\\quad\\text\{s\.t\.\}\\quad Q\_\{i\}\(\\Theta^\{\\prime\},\\Delta\\Theta^\{\\prime\}\_\{i\}\)\\geq q\_\{i\}\.\(3\)This formulation differs from conventional fine\-tuning\. Instead of only maximizing final task accuracy, the key objective is to rapidly restore a service\-ready adapter after base\-model evolution while preserving service quality\.

### 2\.3Challenges in Adapting LoRA for Evolving LLM Services: The Pitfall of Naive Transfer

Although LoRA significantly lowers the cost of downstream adaptation, updating a large number of LoRA\-based services after every base\-model evolution remains expensive\. This issue is especially important in LLM service ecosystems, where multiple task\-specific adapters are commonly deployed on top of a shared backbone\. When the backbone evolves, retraining every adapter from scratch can reliably recover task performance, but it also delays the rollout of downstream services\. A more attractive strategy is to directly reuse the previously deployed adapter on the updated backbone, thereby avoiding re\-training\.

PortLLM\[[11](https://arxiv.org/html/2606.02606#bib.bib11)\]follows this direction by treating the task\-specific adapterΔΘi\\Delta\\Theta\_\{i\}learned on the old backbone as a transferable “patch”\. As illustrated in Figure[1](https://arxiv.org/html/2606.02606#S1.F1), when the base model evolves fromΘ\\ThetatoΘ′\\Theta^\{\\prime\}, where

Θ′=Θ\+ΔΘ,\\Theta^\{\\prime\}=\\Theta\+\\Delta\\Theta,\(4\)PortLLM directly applies the original adapterΔΘi\\Delta\\Theta\_\{i\}to the updated backbone\. The resulting adapted model can be written as

Θi′=Θ′\+ΔΘi=Θ\+ΔΘ\+ΔΘi\.\\Theta^\{\\prime\}\_\{i\}=\\Theta^\{\\prime\}\+\\Delta\\Theta\_\{i\}=\\Theta\+\\Delta\\Theta\+\\Delta\\Theta\_\{i\}\.\(5\)From a service\-management perspective, this strategy enables an almost immediate rollout path after model evolution because it requires no adapter re\-training\. However, this convenience comes with a major risk: the old adapter was optimized for the previous backbone and may no longer be compatible with the evolved one\.

The limitation of naive transfer stems from its neglect of the interaction between the base\-model evolutionΔΘ\\Delta\\Thetaand the task\-specific adaptationΔΘi\\Delta\\Theta\_\{i\}\. These two components may be complementary or conflicting\. If the base\-model update changes internal representations that are important for the downstream task, simply attaching the old adapter to the updated backbone may move the model to a suboptimal parameter region, leading to degraded service quality\.

![Refer to caption](https://arxiv.org/html/2606.02606v1/x2.png)
![Refer to caption](https://arxiv.org/html/2606.02606v1/x3.png)
![Refer to caption](https://arxiv.org/html/2606.02606v1/x4.png)
![Refer to caption](https://arxiv.org/html/2606.02606v1/x5.png)

Figure 2:The results of preliminary experiments\. \(a\) Performance variation when applying the previously fine\-tuned LoRA adapterΔΘi\\Delta\\Theta\_\{i\}to\{Θ1,⋯,Θ4\}\\\{\\Theta\_\{1\},\\cdots,\\Theta\_\{4\}\\\}; \(b\) Training performance with differentβ\\betavalues whenα=1\\alpha=1; \(c\) Training performance with differentα\\alphavalues; \(d\) Training performance under strong regularization/no regularization on MMLU and 20News\.To evaluate the impact of model evolution on naive transfer, we conduct preliminary experiments using LLaMA3\.1\-8B\[[13](https://arxiv.org/html/2606.02606#bib.bib13)\]as the initial base modelΘ0\\Theta\_\{0\}\. We simulate model evolution through continual pretraining\[[14](https://arxiv.org/html/2606.02606#bib.bib14),[15](https://arxiv.org/html/2606.02606#bib.bib15)\]on four distinct domains sequentially, including biomedical papers \(𝐃1\\mathbf\{D\}\_\{1\}\)\[[16](https://arxiv.org/html/2606.02606#bib.bib16)\], Amazon reviews \(𝐃2\\mathbf\{D\}\_\{2\}\)\[[17](https://arxiv.org/html/2606.02606#bib.bib17)\], computer science papers \(𝐃3\\mathbf\{D\}\_\{3\}\)\[[16](https://arxiv.org/html/2606.02606#bib.bib16)\], and news articles \(𝐃4\\mathbf\{D\}\_\{4\}\)\[[18](https://arxiv.org/html/2606.02606#bib.bib18)\]\. Each domain is used for 6k pretraining steps with a batch size of 8\. LetΘu\\Theta\_\{u\}denote the evolved model after continual pretraining on𝐃u\\mathbf\{D\}\_\{u\}\. For downstream evaluation, we use the MMLU benchmark\[[19](https://arxiv.org/html/2606.02606#bib.bib19)\]\. We directly apply a LoRA adapterΔΘi\\Delta\\Theta\_\{i\}, originally fine\-tuned onΘ0\\Theta\_\{0\}, to each evolved modelΘu\\Theta\_\{u\}and measure the resulting performance\.

As shown in Figure[2](https://arxiv.org/html/2606.02606#S2.F2), we obtain three observations\. First, each evolved model equipped with the old adapter performs worse than the original adapted model, indicating thatΔΘi\\Delta\\Theta\_\{i\}becomes less compatible as the backbone evolves\. Second, despite this degradation, reusing the adapter still consistently outperforms zero\-shot inference, suggesting that the previously deployed adapter retains useful task\-specific knowledge\. Third, among all evolved models,Θ4\+ΔΘi\\Theta\_\{4\}\+\\Delta\\Theta\_\{i\}achieves the best performance, althoughΘ4\\Theta\_\{4\}has undergone the longest evolution path\. This may be because the news domain𝐃4\\mathbf\{D\}\_\{4\}is more aligned with the original pretraining distribution ofΘ0\\Theta\_\{0\}\[[20](https://arxiv.org/html/2606.02606#bib.bib20)\]\. These results show that the effectiveness of a transferred adapter is sensitive to the nature of the base\-model evolution\. Therefore, simply reusingΔΘi\\Delta\\Theta\_\{i\}is insufficient for reliable service rollout\.

### 2\.4Motivation

The above observations suggest that the previously deployed adapter remains valuable, but it should not be reused blindly after base\-model evolution\. Naive transfer implicitly assumes that the base\-model evolutionΔΘ\\Delta\\Thetaand the task\-specific adaptationΔΘi\\Delta\\Theta\_\{i\}can be combined with fixed unit weights\. In practice, however, their relative importance may vary across downstream service tasks and update sources\. The evolution delta may introduce useful general capabilities, irrelevant changes, or even conflicts with the existing adapter\. Therefore, a more flexible strategy is needed to adaptively balance these two sources of knowledge\.

To this end, we consider constructing an initialization by fusing the base\-evolution knowledge and the old task\-adaptation knowledge:

Θinit=α⋅ΔΘ~\+β⋅ΔΘi,\\Theta\_\{\\mathrm\{init\}\}=\\alpha\\cdot\\widetilde\{\\Delta\\Theta\}\+\\beta\\cdot\\Delta\\Theta\_\{i\},\(6\)whereα\\alphaandβ\\betacontrol the contributions of the base\-model evolution and the previously deployed task adapter, respectively\. Here,ΔΘ~\\widetilde\{\\Delta\\Theta\}denotes the representation of the base\-evolution delta in a parameter space compatible with the LoRA adapter\. For notational simplicity, we useΔΘ\\Delta\\Thetato denote this compatible representation when there is no ambiguity\. The naive summation strategy can be regarded as a special case withα=1\\alpha=1andβ=1\\beta=1\.

To validate the necessity of adaptive fusion, we conduct preliminary experiments on 20News\[[21](https://arxiv.org/html/2606.02606#bib.bib21)\]and MMLU using LLaMA3\.1\-8B as the initial backboneΘ\\Theta\. After further pretraining on OpenOrca\[[22](https://arxiv.org/html/2606.02606#bib.bib22)\], we obtain the evolved backboneΘ′\\Theta^\{\\prime\}\. We first search for the optimalβ\\betawhile fixingα=1\\alpha=1, and then search for the optimalα\\alphawith the previously selectedβ\\beta\. As shown in Figure[2](https://arxiv.org/html/2606.02606#S2.F2)and Figure[2](https://arxiv.org/html/2606.02606#S2.F2), the choice of fusion coefficients has a substantial impact on downstream service quality\. On 20News, for example, increasingβ\\betafrom 0\.7 to 1\.9 improves accuracy by 46\.9%\. Given the selectedβ\\beta, varyingα\\alphaalso leads to different performance, with optimal values smaller than 1 on both datasets\. This indicates that the base\-evolution delta should be incorporated selectively rather than naively added with full strength\.

The comparison with naive fusion further confirms this point\. The setting\(α=1,β=1\)\(\\alpha=1,\\beta=1\)achieves only 35\.6% accuracy on 20News and 52\.4% accuracy on MMLU\. In contrast, by optimizing the scaling coefficients to\(α=0\.6,β=1\.9\)\(\\alpha=0\.6,\\beta=1\.9\)for 20News and\(α=0\.3,β=1\.8\)\(\\alpha=0\.3,\\beta=1\.8\)for MMLU, the accuracy increases to 77\.2% and 73\.4%, respectively\. These findings demonstrate that a static fusion rule fails to fully exploit the knowledge contained in the old adapter and the base\-model evolution\. Therefore, an intelligent and adaptive strategy is needed to identify suitable fusion weights and construct a better starting point for re\-adaptation\.

Furthermore, even with a high\-quality initialization, subsequent fine\-tuning remains necessary to fully adapt the adapter to the downstream service task on the evolved backbone\. However, conventional fine\-tuning guided solely by the task loss may quickly drift away from this favorable initialization, slowing convergence and increasing time\-to\-readiness\. A natural solution is to use regularization to constrain the optimization trajectory around the initialized adapter\. Strong regularization can keep the model in a promising parameter region and accelerate early service\-quality recovery\. Nevertheless, if the regularization remains too strong throughout training, it may restrict task\-specific refinement and harm the final performance\.

This trade\-off motivates a scheduled regularization strategy\. As shown in Figure[2](https://arxiv.org/html/2606.02606#S2.F2), strong regularization accelerates early convergence, but may constrain the final achievable accuracy because it limits exploration\. In contrast, no regularization allows the model to explore more freely and may eventually reach a better local optimum, but it requires more steps to reach a high\-quality region\. Therefore, an effective re\-adaptation process should first apply strong regularization to rapidly recover service\-ready performance, and then relax the constraint to enable task\-specific refinement\. This observation motivates the two\-stage design of ReLoRA, which combines adaptive LoRA initialization with scheduled regularization to achieve both fast time\-to\-readiness and high final service quality\.

## 3Method

### 3\.1Overview

Our proposed ReLoRA is designed to rapidly restore service\-ready adapters after base\-service evolution\. When the LLM backbone evolves fromΘ\\ThetatoΘ′\\Theta^\{\\prime\}, whereΘ′=Θ\+ΔΘ\\Theta^\{\\prime\}=\\Theta\+\\Delta\\Theta, previously deployed task adapters may no longer be fully compatible with the updated backbone\. Instead of retraining each adapter from scratch, ReLoRA reuses two forms of existing knowledge: the task\-specific knowledge encoded in the previously deployed adapterΔΘi\\Delta\\Theta\_\{i\}, and the evolution knowledge reflected by the base\-model parameter deltaΔΘ\\Delta\\Theta\. The goal is to efficiently construct an updated adapter for the evolved backbone such that the downstream service can quickly reach its required quality threshold\.

As illustrated in Figure[1](https://arxiv.org/html/2606.02606#S1.F1), ReLoRA consists of two key steps\. ①Adaptive LoRA initialization\.This step constructs a compatibility\-aware starting point, denoted byΘinit\\Theta\_\{\\mathrm\{init\}\}, for re\-adapting theii\-th downstream service adapter on the evolved backbone\. Here,Θinit\\Theta\_\{\\mathrm\{init\}\}denotes the initialized LoRA update rather than the full backbone\. It is obtained by adaptively fusing the previously deployed task adapterΔΘi\\Delta\\Theta\_\{i\}with the base\-evolution deltaΔΘ\\Delta\\Theta, after mapping the latter into a compatible LoRA parameter space\. The fusion coefficients are selected through Bayesian optimization to reduce the number of pre\-rollout validation probes\. ②Fine\-tuning with scheduled regularization\.Starting fromΘinit\\Theta\_\{\\mathrm\{init\}\}, this step performs re\-adaptation under a two\-stage regularization schedule\. The first stage strongly anchors the adapter to the initialized state for fast service\-readiness recovery, while the second stage relaxes the constraint to enable task\-specific refinement\. Together, these two steps reduce time\-to\-readiness while preserving downstream service quality\.

### 3\.2Adaptive LoRA Initialization

The purpose of adaptive LoRA initialization is to construct a high\-quality starting point for re\-adapting the old task adapter to the evolved backbone\. A naive strategy directly reusesΔΘi\\Delta\\Theta\_\{i\}onΘ′\\Theta^\{\\prime\}, while training from scratch ignores the knowledge already encoded in the deployed adapter\. Both strategies are suboptimal for fast service rollout\. In contrast, ReLoRA initializes the updated adapter by jointly exploiting the old task adaptation and the base\-model evolution\.

Before fusion, the base\-evolution delta is mapped to the same LoRA parameter space as the task adapter\. This mapping ensures that both components are defined over the same target modules and have compatible ranks and parameterization\. We denote this mapping byΠr\(⋅\)\\Pi\_\{r\}\(\\cdot\), whererris the rank used for downstream LoRA adaptation\. For each target module,Πr\(ΔΘ\)\\Pi\_\{r\}\(\\Delta\\Theta\)represents the rank\-compatible LoRA update induced by the base\-model evolution\. The initialized adapter is then formulated as:

Θinit\(α,β\)=α⋅Πr\(ΔΘ\)\+β⋅ΔΘi,\\Theta\_\{\\mathrm\{init\}\}\(\\alpha,\\beta\)=\\alpha\\cdot\\Pi\_\{r\}\(\\Delta\\Theta\)\+\\beta\\cdot\\Delta\\Theta\_\{i\},\(7\)whereα\\alphaandβ\\betaare scalar coefficients that control the contribution of the base\-evolution knowledge and the previously deployed task\-specific knowledge, respectively\.

The central challenge is to efficiently determine the optimal coefficients\(α∗,β∗\)\(\\alpha^\{\*\},\\beta^\{\*\}\)\. As shown in Section[2\.4](https://arxiv.org/html/2606.02606#S2.SS4), the quality of the initialization is sensitive to the choice of\(α,β\)\(\\alpha,\\beta\)\. Although one\-dimensional coefficient sweeps reveal informative trends, the joint optimum remains task\-dependent and difficult to predict because it depends on the interaction between base\-model evolution and task\-specific adaptation\. A brute\-force grid search is unattractive in service deployment\. A coarse grid may miss a high\-quality initialization, while a fine\-grained grid introduces excessive pre\-rollout overhead, especially when many downstream service adapters need to be updated\.

Therefore, ReLoRA adopts Bayesian Optimization \(BayesOpt\) to search for\(α,β\)\(\\alpha,\\beta\)in a sample\-efficient manner\. BayesOpt is suitable here because directly evaluating the quality of a coefficient pair is expensive, and the relationship between\(α,β\)\(\\alpha,\\beta\)and adapter quality is a black\-box function without analytical gradients\[[23](https://arxiv.org/html/2606.02606#bib.bib23)\]\. Since each validation probe delays service rollout, the search process should minimize the number of evaluations\. Following prior practice\[[24](https://arxiv.org/html/2606.02606#bib.bib24)\], BayesOpt iteratively builds a probabilistic surrogate model for the objective function and selects the next coefficient pair using an acquisition function\.

For a given coefficient pair\(α,β\)\(\\alpha,\\beta\), we define the objective functionf\(α,β\)f\(\\alpha,\\beta\)as the negative validation loss of the evolved backbone equipped with the initialized adapter:

f\(α,β\)=−ℒval\(Θ′,Θinit\(α,β\);Dval\),f\(\\alpha,\\beta\)=\-\\mathcal\{L\}\_\{\\mathrm\{val\}\}\\bigl\(\\Theta^\{\\prime\},\\Theta\_\{\\mathrm\{init\}\}\(\\alpha,\\beta\);D\_\{\\mathrm\{val\}\}\\bigr\),\(8\)whereDvalD\_\{\\mathrm\{val\}\}is a small held\-out validation set sampled from the corresponding downstream service task\. Maximizingf\(α,β\)f\(\\alpha,\\beta\)therefore favors an initialization that gives lower validation loss before full re\-adaptation\. The optimal coefficients are obtained by:

\(α∗,β∗\)=arg⁡maxα,β⁡f\(α,β\)\.\(\\alpha^\{\*\},\\beta^\{\*\}\)=\\arg\\max\_\{\\alpha,\\beta\}f\(\\alpha,\\beta\)\.\(9\)
We use a Gaussian Process \(GP\)\[[25](https://arxiv.org/html/2606.02606#bib.bib25)\]as the surrogate model\. Let𝐱=\(α,β\)\\mathbf\{x\}=\(\\alpha,\\beta\)denote a coefficient pair\. Given a set of observed coefficient pairs𝐗1:N=\[𝐱1,…,𝐱N\]\\mathbf\{X\}\_\{1:N\}=\[\\mathbf\{x\}\_\{1\},\\ldots,\\mathbf\{x\}\_\{N\}\]and their objective valuesf\(𝐗1:N\)=\[f\(𝐱1\),…,f\(𝐱N\)\]f\(\\mathbf\{X\}\_\{1:N\}\)=\[f\(\\mathbf\{x\}\_\{1\}\),\\ldots,f\(\\mathbf\{x\}\_\{N\}\)\], the GP prior is defined as:

f\(𝐗1:N\)∼𝒩\(μ0\(𝐗1:N\),Σ0\(𝐗1:N,𝐗1:N\)\),f\(\\mathbf\{X\}\_\{1:N\}\)\\sim\\mathcal\{N\}\\bigl\(\\mu\_\{0\}\(\\mathbf\{X\}\_\{1:N\}\),\\Sigma\_\{0\}\(\\mathbf\{X\}\_\{1:N\},\\mathbf\{X\}\_\{1:N\}\)\\bigr\),\(10\)whereμ0\(⋅\)\\mu\_\{0\}\(\\cdot\)andΣ0\(⋅,⋅\)\\Sigma\_\{0\}\(\\cdot,\\cdot\)are the prior mean and covariance functions, respectively\. For a new candidate𝐱N\+1\\mathbf\{x\}\_\{N\+1\}, the posterior distribution off\(𝐱N\+1\)f\(\\mathbf\{x\}\_\{N\+1\}\)is:

f\(𝐱N\+1\)∣f\(𝐗1:N\)∼𝒩\(μN\(𝐱N\+1\),σN2\(𝐱N\+1\)\),f\(\\mathbf\{x\}\_\{N\+1\}\)\\mid f\(\\mathbf\{X\}\_\{1:N\}\)\\sim\\mathcal\{N\}\\bigl\(\\mu\_\{N\}\(\\mathbf\{x\}\_\{N\+1\}\),\\sigma\_\{N\}^\{2\}\(\\mathbf\{x\}\_\{N\+1\}\)\\bigr\),\(11\)where the posterior mean and variance are computed as:

μN\(𝐱N\+1\)\\displaystyle\\mu\_\{N\}\(\\mathbf\{x\}\_\{N\+1\}\)=Σ0\(𝐱N\+1,𝐗1:N\)Σ0−1\(𝐗1:N,𝐗1:N\)\\displaystyle=\\Sigma\_\{0\}\(\\mathbf\{x\}\_\{N\+1\},\\mathbf\{X\}\_\{1:N\}\)\\Sigma\_\{0\}^\{\-1\}\(\\mathbf\{X\}\_\{1:N\},\\mathbf\{X\}\_\{1:N\}\)⋅\(f\(𝐗1:N\)−μ0\(𝐗1:N\)\)\+μ0\(𝐱N\+1\),\\displaystyle\\quad\\cdot\\bigl\(f\(\\mathbf\{X\}\_\{1:N\}\)\-\\mu\_\{0\}\(\\mathbf\{X\}\_\{1:N\}\)\\bigr\)\+\\mu\_\{0\}\(\\mathbf\{x\}\_\{N\+1\}\),\(12\)σN2\(𝐱N\+1\)\\displaystyle\\sigma\_\{N\}^\{2\}\(\\mathbf\{x\}\_\{N\+1\}\)=Σ0\(𝐱N\+1,𝐱N\+1\)−Σ0\(𝐱N\+1,𝐗1:N\)\\displaystyle=\\Sigma\_\{0\}\(\\mathbf\{x\}\_\{N\+1\},\\mathbf\{x\}\_\{N\+1\}\)\-\\Sigma\_\{0\}\(\\mathbf\{x\}\_\{N\+1\},\\mathbf\{X\}\_\{1:N\}\)⋅Σ0−1\(𝐗1:N,𝐗1:N\)Σ0\(𝐗1:N,𝐱N\+1\)\.\\displaystyle\\quad\\cdot\\Sigma\_\{0\}^\{\-1\}\(\\mathbf\{X\}\_\{1:N\},\\mathbf\{X\}\_\{1:N\}\)\\Sigma\_\{0\}\(\\mathbf\{X\}\_\{1:N\},\\mathbf\{x\}\_\{N\+1\}\)\.\(13\)We adopt the Squared Exponential,i\.e\., the RBF kernel:

Σ0\(𝐱,𝐱′\)=σf2exp⁡\(−‖𝐱−𝐱′‖22l2\),\\Sigma\_\{0\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)=\\sigma\_\{f\}^\{2\}\\exp\\left\(\-\\frac\{\\\|\\mathbf\{x\}\-\\mathbf\{x\}^\{\\prime\}\\\|^\{2\}\}\{2l^\{2\}\}\\right\),\(14\)whereσf2\\sigma\_\{f\}^\{2\}is the signal variance andllis the length\-scale\. These hyperparameters are estimated from the observed data during the search\.

To select the next coefficient pair, we use Expected Improvement \(EI\) as the acquisition function\. EI balances exploitation,i\.e\., selecting regions with high predicted objective values, and exploration,i\.e\., selecting regions with high uncertainty\. It is defined as:

EI\(𝐱\)=\(μN\(𝐱\)−f\+−ξ\)Φ\(Z\)\+σN\(𝐱\)ϕ\(Z\),EI\(\\mathbf\{x\}\)=\\bigl\(\\mu\_\{N\}\(\\mathbf\{x\}\)\-f^\{\+\}\-\\xi\\bigr\)\\Phi\(Z\)\+\\sigma\_\{N\}\(\\mathbf\{x\}\)\\phi\(Z\),\(15\)wheref\+f^\{\+\}is the best observed objective value,ξ\\xiis a trade\-off parameter, and

Z=μN\(𝐱\)−f\+−ξσN\(𝐱\)\.Z=\\frac\{\\mu\_\{N\}\(\\mathbf\{x\}\)\-f^\{\+\}\-\\xi\}\{\\sigma\_\{N\}\(\\mathbf\{x\}\)\}\.\(16\)Here,Φ\(⋅\)\\Phi\(\\cdot\)andϕ\(⋅\)\\phi\(\\cdot\)denote the Cumulative Distribution Function \(CDF\) and Probability Density Function \(PDF\) of the standard normal distribution, respectively\. The next coefficient pair is selected by maximizing EI:

\(αN\+1,βN\+1\)=arg⁡maxα,β⁡EI\(α,β\)\.\(\\alpha\_\{N\+1\},\\beta\_\{N\+1\}\)=\\arg\\max\_\{\\alpha,\\beta\}EI\(\\alpha,\\beta\)\.\(17\)AfterTTsearch iterations, the best observed coefficient pair is used to constructΘinit∗\\Theta\_\{\\mathrm\{init\}\}^\{\*\}according to Eq\. \([7](https://arxiv.org/html/2606.02606#S3.E7)\)\.

Input:Old backbone

Θ\\Theta, evolved backbone

Θ′\\Theta^\{\\prime\}, old adapter

ΔΘi\\Delta\\Theta\_\{i\}, evolution delta

ΔΘ\\Delta\\Theta, training set

DtrainD\_\{\\mathrm\{train\}\}, validation set

DvalD\_\{\\mathrm\{val\}\}, search budget

TT, stage lengths

E1E\_\{1\}and

EtotalE\_\{\\mathrm\{total\}\}\.

Output:Updated adapter

ΔΘi′\\Delta\\Theta^\{\\prime\}\_\{i\}for the evolved backbone\.

1

2Initialize observation history

ℋ0\\mathcal\{H\}\_\{0\}with a small set of evaluated coefficient pairs;

3

4for*k=0k=0toT−1T\-1*do

5Fit the GP surrogate using the current observation history

ℋk\\mathcal\{H\}\_\{k\};

6Compute the acquisition function

EI\(α,β\)EI\(\\alpha,\\beta\)according to Eq\. \([15](https://arxiv.org/html/2606.02606#S3.E15)\);

7Select

\(αk\+1,βk\+1\)\(\\alpha\_\{k\+1\},\\beta\_\{k\+1\}\)by maximizing

EI\(α,β\)EI\(\\alpha,\\beta\)according to Eq\. \([17](https://arxiv.org/html/2606.02606#S3.E17)\);

8Construct candidate initialized adapter

Θinit\(αk\+1,βk\+1\)=αk\+1Πr\(ΔΘ\)\+βk\+1ΔΘi\\Theta\_\{\\mathrm\{init\}\}\(\\alpha\_\{k\+1\},\\beta\_\{k\+1\}\)=\\alpha\_\{k\+1\}\\Pi\_\{r\}\(\\Delta\\Theta\)\+\\beta\_\{k\+1\}\\Delta\\Theta\_\{i\};

9Evaluate

f\(αk\+1,βk\+1\)f\(\\alpha\_\{k\+1\},\\beta\_\{k\+1\}\)on

DvalD\_\{\\mathrm\{val\}\}using

Θ′\\Theta^\{\\prime\}with

Θinit\(αk\+1,βk\+1\)\\Theta\_\{\\mathrm\{init\}\}\(\\alpha\_\{k\+1\},\\beta\_\{k\+1\}\);

10Update the observation history

ℋk\+1←ℋk∪\{\(\(αk\+1,βk\+1\),f\(αk\+1,βk\+1\)\)\}\\mathcal\{H\}\_\{k\+1\}\\leftarrow\\mathcal\{H\}\_\{k\}\\cup\\\{\(\(\\alpha\_\{k\+1\},\\beta\_\{k\+1\}\),f\(\\alpha\_\{k\+1\},\\beta\_\{k\+1\}\)\)\\\};

11

12

13Select the best coefficients

\(α∗,β∗\)\(\\alpha^\{\*\},\\beta^\{\*\}\)according to Eq\. \([9](https://arxiv.org/html/2606.02606#S3.E9)\);

14Obtain

Θinit∗\\Theta\_\{\\mathrm\{init\}\}^\{\*\}by Eq\. \([7](https://arxiv.org/html/2606.02606#S3.E7)\), and initialize

\(A\(0\),B\(0\)\)←\(Ainit,Binit\)\(A^\{\(0\)\},B^\{\(0\)\}\)\\leftarrow\(A\_\{\\mathrm\{init\}\},B\_\{\\mathrm\{init\}\}\);

15

16for*t=0t=0toE1−1E\_\{1\}\-1*do

17Sample a mini\-batch

D\(t\)D^\{\(t\)\}from

DtrainD\_\{\\mathrm\{train\}\};

18Compute task loss

ℒtask\(t\)\\mathcal\{L\}\_\{\\mathrm\{task\}\}^\{\(t\)\}using

\(A\(t\),B\(t\)\)\(A^\{\(t\)\},B^\{\(t\)\}\)on

D\(t\)D^\{\(t\)\};

19Compute the stage\-1 loss

ℒS1\(t\)\\mathcal\{L\}\_\{S1\}^\{\(t\)\}according to Eq\. \([18](https://arxiv.org/html/2606.02606#S3.E18)\);

20Update

A\(t\+1\)A^\{\(t\+1\)\}and

B\(t\+1\)B^\{\(t\+1\)\}using learning rate

η1\\eta\_\{1\};

21

22

23for*t=E1t=E\_\{1\}toEtotal−1E\_\{\\mathrm\{total\}\}\-1*do

24Sample a mini\-batch

D\(t\)D^\{\(t\)\}from

DtrainD\_\{\\mathrm\{train\}\};

25Compute task loss

ℒtask\(t\)\\mathcal\{L\}\_\{\\mathrm\{task\}\}^\{\(t\)\}using

\(A\(t\),B\(t\)\)\(A^\{\(t\)\},B^\{\(t\)\}\)on

D\(t\)D^\{\(t\)\};

26Compute the stage\-2 loss

ℒS2\(t\)\\mathcal\{L\}\_\{S2\}^\{\(t\)\}according to Eq\. \([21](https://arxiv.org/html/2606.02606#S3.E21)\);

27Compute the annealed learning rate

η\(t\)\\eta\(t\)according to Eq\. \([22](https://arxiv.org/html/2606.02606#S3.E22)\);

28Update

A\(t\+1\)A^\{\(t\+1\)\}and

B\(t\+1\)B^\{\(t\+1\)\}using learning rate

η\(t\)\\eta\(t\);

29

30

31Return the final adapter

ΔΘi′=\(A\(Etotal\),B\(Etotal\)\)\\Delta\\Theta^\{\\prime\}\_\{i\}=\(A^\{\(E\_\{\\mathrm\{total\}\}\)\},B^\{\(E\_\{\\mathrm\{total\}\}\)\}\);

Algorithm 1Adaptive LoRA initialization and fine\-tuning with scheduled regularization
### 3\.3Fine\-tuning with Scheduled Regularization

After obtaining the optimized initializationΘinit∗\\Theta\_\{\\mathrm\{init\}\}^\{\*\}, ReLoRA further re\-adapts the adapter on the downstream task training setDtrainD\_\{\\mathrm\{train\}\}\. Although a compatibility\-aware initialization provides a favorable starting point, directly fine\-tuning with only the task loss may cause the adapter to drift away from this initialization too quickly, which can slow convergence and increase time\-to\-readiness\. As observed in Figure[2](https://arxiv.org/html/2606.02606#S2.F2), strong regularization accelerates early convergence but may restrict final refinement, whereas no regularization provides greater flexibility but requires more steps to reach a high\-quality region\. To balance these two effects, ReLoRA adopts a scheduled regularization strategy with two stages: fast service\-readiness recovery and quality refinement\.

For clarity, the following equations are written for one adapted module, and the same procedure is applied to all LoRA\-adapted modules\. LetΘinit∗=\(Ainit,Binit\)\\Theta\_\{\\mathrm\{init\}\}^\{\*\}=\(A\_\{\\mathrm\{init\}\},B\_\{\\mathrm\{init\}\}\)denote the optimized initialized adapter, and letΔΘi\(t\)=\(A\(t\),B\(t\)\)\\Delta\\Theta\_\{i\}^\{\(t\)\}=\(A^\{\(t\)\},B^\{\(t\)\}\)denote the trainable LoRA factors at steptt\.

#### 3\.3\.1Stage 1: Guided Rapid Adaptation for Service\-Readiness Recovery

The objective of the first stage is to rapidly move the adapter into a service\-ready quality region\. To this end, ReLoRA anchors the optimization trajectory around the initialized adapter by adding strong L2 regularization to the task loss\. At training steptt, the stage\-1 objective is:

ℒS1\(t\)=ℒtask\(t\)\+λ12\(‖A\(t\)−Ainit‖F2\+‖B\(t\)−Binit‖F2\),\\mathcal\{L\}\_\{S1\}^\{\(t\)\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}^\{\(t\)\}\+\\frac\{\\lambda\_\{1\}\}\{2\}\\left\(\\\|A^\{\(t\)\}\-A\_\{\\mathrm\{init\}\}\\\|\_\{F\}^\{2\}\+\\\|B^\{\(t\)\}\-B\_\{\\mathrm\{init\}\}\\\|\_\{F\}^\{2\}\\right\),\(18\)where∥⋅∥F\\\|\\cdot\\\|\_\{F\}denotes the Frobenius norm, andλ1\\lambda\_\{1\}controls the strength of the anchor\. The gradient with respect toA\(t\)A^\{\(t\)\}is:

∇A\(t\)ℒS1\(t\)=∇A\(t\)ℒtask\(t\)\+λ1\(A\(t\)−Ainit\),\\nabla\_\{A^\{\(t\)\}\}\\mathcal\{L\}\_\{S1\}^\{\(t\)\}=\\nabla\_\{A^\{\(t\)\}\}\\mathcal\{L\}\_\{\\mathrm\{task\}\}^\{\(t\)\}\+\\lambda\_\{1\}\(A^\{\(t\)\}\-A\_\{\\mathrm\{init\}\}\),\(19\)with the gradient forB\(t\)B^\{\(t\)\}computed analogously\. Given learning rateη1\\eta\_\{1\}, the update forA\(t\)A^\{\(t\)\}is:

A\(t\+1\)=A\(t\)−η1∇A\(t\)ℒS1\(t\),A^\{\(t\+1\)\}=A^\{\(t\)\}\-\\eta\_\{1\}\\nabla\_\{A^\{\(t\)\}\}\\mathcal\{L\}\_\{S1\}^\{\(t\)\},\(20\)andB\(t\)B^\{\(t\)\}is updated in the same manner\.

This strong anchoring encourages the adapter to stay close toΘinit∗\\Theta\_\{\\mathrm\{init\}\}^\{\*\}, which already integrates knowledge from the base\-model evolution and the previously deployed task adapter\. By using a relatively largeλ1\\lambda\_\{1\}during the firstE1E\_\{1\}steps, ReLoRA reduces unnecessary exploration in the early phase and accelerates service\-quality recovery\.

#### 3\.3\.2Stage 2: Refinement and Exploration for Quality Improvement

After the adapter reaches a promising parameter region, the second stage relaxes the anchor to enable more task\-specific refinement\. The regularization strength is reduced toλ2\\lambda\_\{2\}, whereλ2≪λ1\\lambda\_\{2\}\\ll\\lambda\_\{1\}\. The stage\-2 objective becomes:

ℒS2\(t\)=ℒtask\(t\)\+λ22\(‖A\(t\)−Ainit‖F2\+‖B\(t\)−Binit‖F2\)\.\\mathcal\{L\}\_\{S2\}^\{\(t\)\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}^\{\(t\)\}\+\\frac\{\\lambda\_\{2\}\}\{2\}\\left\(\\\|A^\{\(t\)\}\-A\_\{\\mathrm\{init\}\}\\\|\_\{F\}^\{2\}\+\\\|B^\{\(t\)\}\-B\_\{\\mathrm\{init\}\}\\\|\_\{F\}^\{2\}\\right\)\.\(21\)With a weaker anchor, the optimization is primarily driven by the task loss while still retaining mild protection against catastrophic drift from the initialization\. This allows the adapter to refine its task\-specific behavior and approach a better local optimum\.

Meanwhile, a cosine annealing learning\-rate schedule\[[26](https://arxiv.org/html/2606.02606#bib.bib26)\]is applied during the second stage:

η\(t\)=ηmin\+12\(η1−ηmin\)\(1\+cos⁡\(\(t−E1\)πEtotal−E1\)\),\\eta\(t\)=\\eta\_\{\\min\}\+\\frac\{1\}\{2\}\(\\eta\_\{1\}\-\\eta\_\{\\min\}\)\\left\(1\+\\cos\\left\(\\frac\{\(t\-E\_\{1\}\)\\pi\}\{E\_\{\\mathrm\{total\}\}\-E\_\{1\}\}\\right\)\\right\),\(22\)whereηmin\\eta\_\{\\min\}is the minimum learning rate,η1\\eta\_\{1\}is the initial learning rate of the second stage, andEtotalE\_\{\\mathrm\{total\}\}is the total number of re\-adaptation steps across both stages\. This schedule gradually reduces the step size, enabling stable refinement after the rapid recovery phase\. The final adapter afterEtotalE\_\{\\mathrm\{total\}\}steps is used as the updated task adapterΔΘi′\\Delta\\Theta^\{\\prime\}\_\{i\}\.

## 4Evaluation

### 4\.1Experimental Setup

All experiments described in this paper are conducted on a single server equipped with an Intel\(R\) Xeon\(R\) Platinum 8358P CPU, 8 NVIDIA GeForce RTX A6000 GPUs \(each with 48 GB of memory\), and 512 GB of RAM\.

Models\.We evaluate ReLoRA across multiple LLM service backbones to demonstrate its generality, including LLaMA2\-7B\[[2](https://arxiv.org/html/2606.02606#bib.bib2)\], LLaMA3\.1\-8B\[[13](https://arxiv.org/html/2606.02606#bib.bib13)\], and Mistral\-7B\[[27](https://arxiv.org/html/2606.02606#bib.bib27)\]\. These models cover different architectures and model families, allowing us to examine whether ReLoRA remains effective under heterogeneous service backbones\.

Datasets and Service Tasks\.We evaluate ReLoRA on six datasets covering representative downstream service tasks\. The statistics of these datasets are summarized in Table[I](https://arxiv.org/html/2606.02606#S4.T1)\. Specifically, we consider the following service categories\.

- •Knowledge and Reasoning Service: We use the MMLU benchmark\[[19](https://arxiv.org/html/2606.02606#bib.bib19)\], which contains questions across 57 subjects, to evaluate general reasoning capability\.
- •Sentiment Analysis Service: We use SST\-2\[[28](https://arxiv.org/html/2606.02606#bib.bib28)\], which consists of movie reviews annotated with binary labels\.
- •News and Topic Classification Services: We use AGNews\[[29](https://arxiv.org/html/2606.02606#bib.bib29)\], a four\-class news classification dataset, and 20News\[[21](https://arxiv.org/html/2606.02606#bib.bib21)\]containing documents from 20 categories\.
- •Natural Language Inference Service: We use MNLI\[[28](https://arxiv.org/html/2606.02606#bib.bib28)\]and SNLI\[[30](https://arxiv.org/html/2606.02606#bib.bib30)\]to evaluate the ability to determine logical relationships between sentence pairs\.

TABLE I:Statistics of downstream service tasks in experiments\.Training Details\.To simulate base\-service evolution, we employ continual pretraining\[[14](https://arxiv.org/html/2606.02606#bib.bib14),[15](https://arxiv.org/html/2606.02606#bib.bib15)\]\. Specifically, we start from a pretrained LLM backboneΘ\\Thetaand update it on a specific pretraining corpus to obtain an evolved backboneΘ′\\Theta^\{\\prime\}\. The evolution process is implemented using LoRA\[[4](https://arxiv.org/html/2606.02606#bib.bib4),[11](https://arxiv.org/html/2606.02606#bib.bib11)\]with rankr=64r=64, learning rate2e−42\\text\{e\}^\{\-4\}, and 4 training epochs, since full\-parameter updating is infeasible under our platform resource constraints\. The continued pretraining datasets include OpenOrca\[[22](https://arxiv.org/html/2606.02606#bib.bib22)\], AlpacaGPT4\[[31](https://arxiv.org/html/2606.02606#bib.bib31)\], and OpenPlatypus\[[32](https://arxiv.org/html/2606.02606#bib.bib32)\], which represent different update sources for LLM services\. For downstream service adaptation, we use LoRA with rankr=16r=16and a uniform batch size of 32 for all methods\. For the adaptive initialization search in ReLoRA, we use a small validation set of 512 examples sampled from the corresponding downstream service task\. We train for 5 epochs on MMLU and 2 epochs for all other datasets\. ReLoRA uses a cosine annealing learning\-rate schedule withη1=5e−4\\eta\_\{1\}=5\\text\{e\}^\{\-4\}andηmin=1e−5\\eta\_\{\\min\}=1\\text\{e\}^\{\-5\}\[[33](https://arxiv.org/html/2606.02606#bib.bib33),[34](https://arxiv.org/html/2606.02606#bib.bib34)\]\. Based on the hyperparameter sensitivity analysis in Section[4\.5](https://arxiv.org/html/2606.02606#S4.SS5), we set the regularization strengths toλ1=1e−3\\lambda\_\{1\}=1\\text\{e\}^\{\-3\}in Stage 1 andλ2=1e−4\\lambda\_\{2\}=1\\text\{e\}^\{\-4\}in Stage 2, which provides a balance between convergence speed and final service quality\.

Metrics\.We evaluate service quality using accuracy on the test set of each downstream service task\. For the original backboneΘ\\Thetaand the evolved backboneΘ′\\Theta^\{\\prime\}, we report zero\-shot accuracy\. For task\-adaptation methods, we report the accuracy after adapter adaptation\. To assess rollout efficiency, we use two time\-related metrics\. First,time\-to\-readinessmeasures the wall\-clock time required for an updated adapter to reach a predefined service\-quality threshold after the base model evolves\. This metric is reported in seconds\. Second,end\-to\-end service rollout timemeasures the total time required by the complete update process, including search and re\-adaptation when applicable\. This metric is reported in minutes\. For fairness, the target service\-quality threshold is set to a level that all methods can reach\.

Baselines\.We compare ReLoRA with the following baselines\. Under zero\-shot settings, we include: \(1\)Pretrained LLMΘ\\Theta, the original backbone before service evolution; and \(2\)Evolved LLMΘ′\\Theta^\{\\prime\}, the updated backbone after continual pretraining\. These two baselines help quantify the effect of base\-model evolution without task\-specific adaptation\.

For task\-adaptation methods, we include: \(3\)LoRA\-Scratch, which reconstructs a new LoRA adapter from scratch onΘ′\\Theta^\{\\prime\}for each downstream task; \(4\)PortLLM\+FT\[[11](https://arxiv.org/html/2606.02606#bib.bib11)\], which performs naive adapter transfer followed by fine\-tuning; and \(5\)ORAL\[[12](https://arxiv.org/html/2606.02606#bib.bib12)\], a conditional recurrent diffusion framework that generates LoRA weights from scratch for the evolved backbone\. Unless otherwise specified, the training cost of ORAL’s generative model is not included\. Including ORAL enables a comparison between re\-adapting an existing deployed adapter and generating a new adapter from task specifications\.

TABLE II:Service\-quality comparison under different base\-service update sources\.Update SourceBaselinesMMLUSST\-2AGNews20NewsMNLISNLIAvg\. \(↑\)Zero\-shotPretrained LLMΘ\\Theta26\.256\.928\.44\.631\.430\.229\.6OpenOrcaZero\-shotEvolved LLMΘ′\\Theta^\{\\prime\}24\.556\.029\.23\.732\.333\.029\.8Task\-adaptation MethodsLoRA\-Scratch81\.496\.395\.276\.291\.691\.888\.8ORAL81\.296\.495\.076\.191\.992\.088\.8PortLLM\+FT82\.596\.695\.276\.492\.692\.189\.2ReLoRA \(Ours\)83\.997\.095\.376\.693\.593\.189\.9AlpacaGPT4Zero\-shotEvolved LLMΘ′\\Theta^\{\\prime\}24\.051\.631\.04\.531\.631\.429\.0Task\-adaptation MethodsLoRA\-Scratch79\.596\.195\.075\.691\.592\.188\.3ORAL79\.796\.095\.175\.591\.892\.388\.4PortLLM\+FT81\.796\.495\.276\.192\.492\.789\.1ReLoRA \(Ours\)82\.996\.995\.476\.393\.093\.189\.6OpenPlatypusZero\-shotEvolved LLMΘ′\\Theta^\{\\prime\}24\.052\.229\.93\.831\.930\.528\.7Task\-adaptation MethodsLoRA\-Scratch79\.796\.195\.274\.391\.692\.288\.2ORAL79\.496\.295\.074\.691\.492\.488\.2PortLLM\+FT82\.896\.495\.275\.892\.793\.089\.3ReLoRA \(Ours\)84\.397\.095\.376\.493\.393\.189\.9

TABLE III:Service\-quality recovery across different LLM service backbones\.![Refer to caption](https://arxiv.org/html/2606.02606v1/x6.png)Figure 3:Time\-to\-readiness of different methods to reach the target service\-quality threshold across downstream tasks\.
### 4\.2Overall Performance

We first evaluate the overall effectiveness of ReLoRA against the baselines\. The service\-quality results are summarized in Tables[II](https://arxiv.org/html/2606.02606#S4.T2)and[III](https://arxiv.org/html/2606.02606#S4.T3)\. We make four key observations\.

First, compared with the evolved backboneΘ′\\Theta^\{\\prime\}, ReLoRA consistently improves service quality across all downstream tasks\. For example, when LLaMA3\.1\-8B is further pretrained on OpenOrca, ReLoRA improves the service quality on \{MMLU, SST\-2, AGNews, 20News, MNLI, SNLI\} by \{59\.4%, 41\.0%, 66\.1%, 72\.9%, 61\.2%, 60\.1%\}, respectively\. These results show that task\-specific adaptation remains necessary even after the base model has been updated through continued pretraining\.

Second, ReLoRA improves not only final service quality but also rollout efficiency\. As shown in Figure[3](https://arxiv.org/html/2606.02606#S4.F3), we report the time\-to\-readiness required by each method to reach a predefined target service\-quality threshold\. The target thresholds for \{MMLU, SST\-2, AGNews, 20News, MNLI, SNLI\} are set to \{81%, 96%, 95%, 75%, 91%, 92%\}, respectively\. All experiments in Figure[3](https://arxiv.org/html/2606.02606#S4.F3)are conducted using LLaMA3\.1\-8B with OpenOrca as the update source\. On MNLI, for instance, ReLoRA reaches 91% service quality in 28,802 seconds, whereas LoRA\-Scratch requires 255,328 seconds\. This corresponds to an 8\.9×\\timesspeedup and an 88\.7% reduction in time\-to\-readiness\. Meanwhile, ReLoRA also improves final accuracy over LoRA\-Scratch by 1\.9% and 2\.5% on MNLI and MMLU, respectively\. These results demonstrate that ReLoRA can accelerate service readiness while maintaining the final quality of the updated adapter\.

Third, ReLoRA achieves clear advantages over PortLLM\+FT in both convergence speed and final service quality\. For example, on MNLI, ReLoRA achieves a 4\.0×\\timesspeedup and improves service quality by 1\.1% compared with PortLLM\+FT, which initializes fine\-tuning by directly transferring the old adapter, whereas ReLoRA adaptively fuses the old task adapter with the base\-model evolution delta\. This compatibility\-aware initialization provides a better starting point for re\-adaptation, resulting in faster convergence and higher service quality\.

Fourth, ReLoRA outperforms ORAL, a representative generative adapter baseline\. As shown in Tables[II](https://arxiv.org/html/2606.02606#S4.T2)and[III](https://arxiv.org/html/2606.02606#S4.T3), ORAL generally achieves service quality comparable to LoRA\-Scratch\. For example, on LLaMA3\.1\-8B with OpenOrca, ORAL and LoRA\-Scratch both achieve an average service quality of 88\.8%\. This confirms that ORAL can generate task\-specific adapters for evolved backbones\. However, ReLoRA consistently obtains better performance by reusing the fine\-grained task knowledge encoded in the previously deployed adapter\. On the same setting, ReLoRA achieves an average service quality of 89\.9%, outperforming ORAL by 1\.1%\. The advantage is more pronounced on MMLU, where ReLoRA reaches 83\.9%, exceeding ORAL by 2\.7%\. These results indicate that re\-adapting an existing specialized adapter can be more effective than generating a new adapter from general task descriptions\.

### 4\.3Optimization Analysis and System Efficiency

Efficiency of Optimization Strategy\.A potential concern of adaptive initialization is the additional cost of searching for the fusion coefficientsα\\alphaandβ\\beta\. To evaluate this cost, we compare the pre\-rollout compatibility search overhead of the GP\-based Bayesian optimization strategy with a standard grid search baseline\. A grid search with 10 values per coefficient requires 100 evaluations, whereas our GP\-based search uses onlyT=20T=20evaluations\. Thus, the GP\-based strategy reduces the number of evaluations by 5×\\times\.

Table[IV](https://arxiv.org/html/2606.02606#S4.T4)empirically confirms this advantage\. For example, on MMLU, grid search consumes 20,786 seconds, which could offset the benefit of fast re\-adaptation\. In contrast, the GP\-based search used by ReLoRA takes only 4,573 seconds\. On smaller datasets such as SST\-2, the search overhead is only 634 seconds\. These results show that the proposed search strategy provides a sample\-efficient way to identify compatibility\-aware initialization coefficients\.

TABLE IV:Pre\-rollout compatibility search overhead\.Analysis of Fusion Coefficients\.To understand how ReLoRA balances base\-model evolution and task\-specific adaptation, we analyze the optimal fusion coefficients\(α,β\)\(\\alpha,\\beta\)identified under different downstream service tasks and update sources\. The results are summarized in Table[V](https://arxiv.org/html/2606.02606#S4.T5)\.

We observe thatβ\\betais consistently larger than 1\.0, typically ranging from 1\.4 to 1\.9\. This suggests that the task\-specific knowledge encoded in the previously deployed adapter remains highly valuable and often needs to be amplified to compensate for the representation shift induced by base\-model evolution\. At the same time,α\\alpharemains nonzero across all settings, ranging from 0\.1 to 0\.7\. This indicates that the base\-evolution delta also provides useful information and should be incorporated into the initialization\. For complex reasoning tasks such as MMLU,α\\alphatends to be higher, reaching 0\.7 under OpenOrca and OpenPlatypus\. This implies that general capability changes introduced by base\-model evolution are especially important for knowledge\-intensive tasks\. The variation of\(α,β\)\(\\alpha,\\beta\)across tasks and update sources further confirms that a static fusion rule cannot reliably handle heterogeneous service updates, motivating the adaptive search mechanism in ReLoRA\.

TABLE V:Optimal service\-specific fusion coefficients under different update sources\.End\-to\-End Service Rollout Time\.Beyond the time\-to\-readiness required to reach a target threshold, practical deployment also requires evaluating the total cost of the complete update process\. Table[VI](https://arxiv.org/html/2606.02606#S4.T6)reports the end\-to\-end service rollout time on LLaMA3\.1\-8B with OpenOrca as the update source\. For ReLoRA, this time includes both the Bayesian optimization search and the re\-adaptation\.

Although ReLoRA introduces an additional search step, its total rollout time remains substantially lower than the baselines\. For example, on SNLI, ReLoRA completes the rollout process in 1,264 minutes, while LoRA\-Scratch requires 2,813 minutes\. Aggregated over all six tasks, ReLoRA reduces the total rollout time from 12,628 minutes to 5,437 minutes, achieving a 56\.9% reduction and a 2\.32×\\timesspeedup over LoRA\-Scratch\. These results show that the search overhead is well compensated by the reduction in re\-adaptation time, making ReLoRA suitable for service maintenance\.

TABLE VI:End\-to\-end service rollout time including search and re\-adaptation \(minutes\)\.TABLE VII:End\-to\-end service rollout efficiency derived from Table[VI](https://arxiv.org/html/2606.02606#S4.T6)\.Table[VII](https://arxiv.org/html/2606.02606#S4.T7)further breaks down the rollout efficiency by task\. The results highlight the practical service\-maintenance benefit of ReLoRA: it not only accelerates training convergence, but also shortens the time required to bring updated downstream LLM services back online\.

### 4\.4Ablation Study

We further evaluate the contribution of the two core components of ReLoRA: adaptive LoRA initialization, denoted as AdaInit, and scheduled regularization, denoted as SReg\. The results are reported in Table[VIII](https://arxiv.org/html/2606.02606#S4.T8)\.

To assess the effect of AdaInit, we compare ReLoRA without SReg against PortLLM\+FT\. The former uses the adaptive initialization produced by Bayesian optimization but performs standard fine\-tuning afterward, while the latter starts from naive adapter transfer\. The average service quality improves from 89\.2% to 89\.5%, indicating that compatibility\-aware fusion provides a better starting point than direct adapter transfer\.

To assess the effect of SReg, we compare full ReLoRA with ReLoRA without SReg\. Full ReLoRA improves the service quality from 89\.5% to 89\.9%\. This improvement, together with the faster convergence observed in previous experiments, shows that scheduled regularization helps the adapter move from the fused initialization to a better task\-specific optimum\. Overall, both AdaInit and SReg are important for achieving fast and high\-quality service recovery\.

TABLE VIII:Component contribution to service\-quality recovery\.
### 4\.5Robustness Across Update Sources and Service Backbones

Effect of Different Update Sources\.We first evaluate whether ReLoRA remains robust under different base\-service update sources\. Specifically, we use OpenOrca, AlpacaGPT4, and OpenPlatypus as continued pretraining datasets for LLaMA3\.1\-8B\. As shown in Table[II](https://arxiv.org/html/2606.02606#S4.T2)and Figure[4](https://arxiv.org/html/2606.02606#S4.F4), ReLoRA consistently improves service quality and time\-to\-readiness across different update sources\.

For example, on MNLI, ReLoRA improves accuracy over LoRA\-Scratch by 1\.9%, 1\.5%, and 1\.7% under OpenOrca, AlpacaGPT4, and OpenPlatypus, respectively\. We also observe that different update sources affect the zero\-shot performance of the evolved backbone\. For instance, continued pretraining on AlpacaGPT4 decreases zero\-shot performance on MMLU and SST\-2 compared with the original backboneΘ\\Theta\. This suggests that base\-model evolution can introduce task\-dependent shifts\. By incorporating the evolution delta into adapter initialization, ReLoRA mitigates these shifts and achieves more reliable service\-quality recovery\.

Effect of Different Service Backbones\.We next evaluate ReLoRA across different LLM backbones\. As shown in Table[III](https://arxiv.org/html/2606.02606#S4.T3)and Figure[4](https://arxiv.org/html/2606.02606#S4.F4), we compare ReLoRA with the baselines on three different LLMs using OpenOrca as the update source\. Across all tested backbones, ReLoRA consistently achieves competitive or superior service quality and rollout efficiency\. For example, on Mistral\-7B, ReLoRA achieves a service quality of 88\.1%, compared with 87\.4% for LoRA\-Scratch, 87\.6% for PortLLM\+FT, and 87\.1% for ORAL\. The efficiency improvement is also substantial\. On LLaMA2\-7B, ReLoRA provides 5\.0×\\timesand 2\.2×\\timesspeedups over LoRA\-Scratch and PortLLM\+FT, respectively\. These results demonstrate that ReLoRA is not tied to a specific backbone and can generalize across different backbones\.

![Refer to caption](https://arxiv.org/html/2606.02606v1/x7.png)

![Refer to caption](https://arxiv.org/html/2606.02606v1/x8.png)

Figure 4:Time\-to\-readiness under different update sources and LLM service backbones\.Hyperparameter Sensitivity Analysis\.The scheduled regularization strategy introduces two key hyperparameters:λ1\\lambda\_\{1\}for the Guided Rapid Adaptation stage andλ2\\lambda\_\{2\}for the Refinement and Exploration stage\. To evaluate the robustness of ReLoRA and verify that it does not require extensive per\-task tuning, we varyλ1∈\{5e−4,1e−3,5e−3,1e−2\}\\lambda\_\{1\}\\in\\\{5\\text\{e\}^\{\-4\},1\\text\{e\}^\{\-3\},5\\text\{e\}^\{\-3\},1\\text\{e\}^\{\-2\}\\\}andλ2∈\{1e−4,5e−4,1e−5\}\\lambda\_\{2\}\\in\\\{1\\text\{e\}^\{\-4\},5\\text\{e\}^\{\-4\},1\\text\{e\}^\{\-5\}\\\}on 20News and MMLU\. The results are summarized in Table[IX](https://arxiv.org/html/2606.02606#S4.T9)\. We observe that the final service quality remains stable under different regularization strengths\. For example, on 20News, the service quality stays within a narrow range from 76\.48% to 76\.65%\. However, the choice of\(λ1,λ2\)\(\\lambda\_\{1\},\\lambda\_\{2\}\)affects convergence efficiency\. The configurationλ1=1e−3\\lambda\_\{1\}=1\\text\{e\}^\{\-3\}andλ2=1e−4\\lambda\_\{2\}=1\\text\{e\}^\{\-4\}consistently achieves a trade\-off between service quality and convergence speed, reaching the target quality in 275 steps on 20News and 625 steps on MMLU\. Therefore, we use this configuration as the default setting in all experiments\. These results suggest that ReLoRA is robust to hyperparameter variations and avoids the operational burden of sensitive per\-service tuning\.

TABLE IX:Sensitivity of scheduled regularization for service readiness\.Preliminary Extension to Advanced PEFT Variants\.To examine whether ReLoRA can extend beyond vanilla LoRA, we apply it to DoRA\[[35](https://arxiv.org/html/2606.02606#bib.bib35)\], a representative LoRA variant that decomposes weight updates into magnitude and direction components\. Following the default setting, we use LLaMA3\.1\-8B evolved on OpenOrca and evaluate on SST\-2\. The results are shown in Table[X](https://arxiv.org/html/2606.02606#S4.T10)\. ReLoRA achieves a service quality of 97\.1%, outperforming DoRA\-Scratch and PortLLM\+FT, which achieve 96\.7% and 96\.9%, respectively\. In terms of efficiency, ReLoRA reduces the end\-to\-end service rollout time to 207 minutes, compared with 517 minutes for DoRA\-Scratch and 242 minutes for PortLLM\+FT\. These results suggest that the knowledge\-reusing design of ReLoRA can potentially extend to other PEFT variants\.

TABLE X:Preliminary extension to DoRA adapters on SST\-2\.

## 5Related Works

### 5\.1LLM Services and Model Service Maintenance

LLM\-as\-a\-Service has become an important paradigm for deploying large language models as shared service backbones, where downstream users access model capabilities through APIs, hosted instances, or task\-specific adapters\. In such service ecosystems, base models are periodically updated to incorporate new knowledge, improve safety, or enhance general capabilities, while downstream model services built on previous versions must be maintained accordingly\. This creates a service lifecycle challenge: updating the base model may invalidate existing task adapters, and retraining numerous adapters from scratch can introduce substantial computation cost and rollout latency\. Therefore, efficient service maintenance requires mechanisms that can rapidly restore downstream service quality after base\-model evolution without incurring full retraining overhead\.

### 5\.2Parameter\-Efficient Fine\-Tuning for LLM Services

Parameter\-Efficient Fine\-Tuning \(PEFT\) techniques\[[36](https://arxiv.org/html/2606.02606#bib.bib36),[4](https://arxiv.org/html/2606.02606#bib.bib4),[37](https://arxiv.org/html/2606.02606#bib.bib37)\]have become key enablers for lightweight customization of LLM services\. Instead of updating all parameters of a large backbone, PEFT freezes the pretrained model and introduces only a small number of trainable parameters, thereby reducing adaptation cost, memory usage, and storage overhead\. Among existing PEFT methods, LoRA\[[4](https://arxiv.org/html/2606.02606#bib.bib4)\]is widely adopted because it injects trainable low\-rank matrices into selected layers of the frozen backbone, allowing multiple task\-specific adapters to share the same base model\. This property makes LoRA particularly suitable for multi\-tenant LLM service environments, where many downstream services can be deployed as lightweight adapters rather than independent full models\. However, when the shared backbone evolves, these deployed adapters may no longer remain fully compatible with the updated model, which motivates efficient adapter re\-adaptation\.

### 5\.3Adapting LoRA for Evolving LLMs

As base LLMs continuously evolve, maintaining the compatibility of previously deployed adapters with updated backbones becomes a critical challenge\. Retraining each adapter from scratch can recover task performance, but it is expensive and delays service rollout when many downstream services are maintained in production\. PortLLM\[[11](https://arxiv.org/html/2606.02606#bib.bib11)\]addresses this issue by directly reusing original LoRA weights as transferable “patches” for evolved models, which reduces computation cost but may fail to capture the interaction between base\-model evolution and task\-specific adaptation\. Another line of work focuses on generating LoRA parameters from task or model descriptions\. For example, Cond P\-Diff\[[38](https://arxiv.org/html/2606.02606#bib.bib38)\]and ORAL\[[12](https://arxiv.org/html/2606.02606#bib.bib12)\]use conditional diffusion\-based models to synthesize LoRA weights for target tasks and architectures\. Although these approaches provide flexibility in creating adapters on demand, they typically require substantial task\-adapter training data, and their effectiveness depends on the quality of the learned generator and conditioning signals\. In contrast, ReLoRA focuses on re\-adapting existing deployed adapters by explicitly reusing both the old task\-specific adapter and the parameter delta introduced by base\-model evolution, enabling efficient service\-quality recovery after model updates\.

### 5\.4Efficient Service Rollout and Model Update

Efficient model update and service rollout are also related to model merging and version\-aware deployment\. Model merging aims to combine useful capabilities from different models or adapters without full retraining\[[39](https://arxiv.org/html/2606.02606#bib.bib39)\]\. Representative methods include arithmetic strategies such as Task Arithmetic\[[40](https://arxiv.org/html/2606.02606#bib.bib40)\], statistical approaches such as Fisher\-Merging\[[41](https://arxiv.org/html/2606.02606#bib.bib41)\]and RegMean\[[42](https://arxiv.org/html/2606.02606#bib.bib42)\], and search\-based methods such as AdaMerging\[[43](https://arxiv.org/html/2606.02606#bib.bib43)\]and Evolutionary Model Merge\[[44](https://arxiv.org/html/2606.02606#bib.bib44)\]\. These methods are mainly designed to merge task vectors or adapters for multi\-task learning on a relatively static backbone, where the goal is to balance capabilities from multiple tasks\. By contrast, ReLoRA addresses a different service maintenance problem: the base model changes over time, and the objective is to realign an existing task adapter with the evolved backbone\. Rather than merging multiple task adapters, ReLoRA uses Bayesian optimization to identify task\-specific fusion coefficients between the old adapter and the base\-evolution delta, thereby reducing rollout latency while preserving service quality\.

## 6Discussion and Limitation

### 6\.1Implications for LLM Service Ecosystems

Analogous to software engineering where operating system updates ideally maintain compatibility with existing applications, the evolution of base LLMs poses a significant backward compatibility challenge\. Currently, downstream consumers bear the full cost of retraining adapters for every upstream update\. ReLoRA addresses this by serving as a bridge between the parameter spaces of the old and new models\. We envision a future ecosystem where providers release compatibility metadata alongside updates, allowing ReLoRA to transform adaptation from an expensive retraining task into a near\-instantaneous patching process\.

### 6\.2Broader Application Scenarios

While our current evaluation focuses on continual pre\-training, practical LLM evolution encompasses diverse stages, such as Supervised Fine\-Tuning \(SFT\) and Reinforcement Learning from Human Feedback \(RLHF\)\. Although these alignment processes induce parameter shifts that are distinct from knowledge injection, the core principle of ReLoRA remains applicable\. By conceptualizing the parameter difference as a generic evolution vector, our framework can readily adapt to these scenarios or even support quantization\-aware updates\. Consequently, ReLoRA holds the potential to serve as a unified protocol for preserving task\-specific capabilities across the entire model lifecycle\.

### 6\.3Limitations of Offline Evaluation

Online serving metrics\.Our evaluation focuses on offline adapter re\-adaptation time and service\-quality recovery\. It does not directly measure online SLA violations, request\-level serving latency, or multi\-tenant serving interference during deployment\. These system\-level serving metrics are orthogonal to ReLoRA’s adapter update mechanism and will be explored in future work\.

### 6\.4Technical Limitations and Future Work

While ReLoRA achieves significant improvements in adapting LoRA adapters for evolving LLMs, we acknowledge several limitations that motivate our future research\. 1\)Coarse\-grained fusion strategy\.Our core fusion mechanism,α⋅ΔΘ\+β⋅ΔΘi\\alpha\\cdot\\Delta\\Theta\+\\beta\\cdot\\Delta\\Theta\_\{i\}, currently employs global scalar coefficientsα\\alphaandβ\\beta\. This design relies on a simplified assumption of uniform interplay between model evolution and task adaptation across all layers\. However, this overlooks the hierarchical nature of LLMs, where different layers or module types \(e\.g\., Attention vs\. FFNs\) may benefit from distinct fusion strategies\. For instance, lower layers processing syntax may require different fusion ratios compared to upper layers handling semantics\. Future work could explore more fine\-grained methods, such as layer\-wise fusion strategies\. 2\)Reliability of the proxy objective\.ReLoRA currently utilizes the initial validation loss as a proxy to guide the Bayesian search\. While computationally efficient, the initial loss is not always a reliable predictor of the final model performance\. For instance, a configuration with slightly higher initial loss might reside in a wider and flatter basin, potentially leading to superior generalization after fine\-tuning\. To address this limitation, we plan to incorporate geometric metrics, such as Hessian\-based approximations, to identify more robust initialization candidates in future work\.

## 7Conclusion

In this paper, we propose ReLoRA, a service maintenance framework for fast rollout of evolving LLM services\. We formulate adapter re\-adaptation as a service rollout and adapter backward\-compatibility problem, where deployed task\-specific adapters must be efficiently updated after base\-model evolution\. To this end, ReLoRA reuses two forms of knowledge: the task\-specific knowledge in the deployed adapter and the evolution knowledge captured by the base\-model parameter delta\. Specifically, ReLoRA constructs a compatibility\-aware starting point through adaptive LoRA initialization, and then applies scheduled regularization to accelerate service\-readiness recovery while preserving final task quality\. Extensive experiments across six downstream service tasks, three model families, and three update sources demonstrate that ReLoRA reduces time\-to\-readiness by up to 8\.9×\\timesand improves service quality by up to 4\.6 percentage points over baselines\. These results show that ReLoRA effectively reduces adapter maintenance overhead and accelerates the rollout of updated LLM services\.

## References

- \[1\]T\. Wu, S\. He, J\. Liu, S\. Sun, K\. Liu, Q\.\-L\. Han, and Y\. Tang, “A brief overview of chatgpt: The history, status quo and potential future development,”*IEEE/CAA Journal of Automatica Sinica*, vol\. 10, no\. 5, pp\. 1122–1136, 2023\.
- \[2\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale*et al\.*, “Llama 2: Open foundation and fine\-tuned chat models,”*arXiv preprint arXiv:2307\.09288*, 2023\.
- \[3\]Y\. Cao, S\. Li, Y\. Liu, Z\. Yan, Y\. Dai, P\. S\. Yu, and L\. Sun, “A comprehensive survey of ai\-generated content \(aigc\): A history of generative ai from gan to chatgpt,”*arXiv preprint arXiv:2303\.04226*, 2023\.
- \[4\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen, “Lora: Low\-rank adaptation of large language models,”*arXiv preprint arXiv:2106\.09685*, 2021\.
- \[5\]G\. Team, R\. Anil, S\. Borgeaud, J\.\-B\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican*et al\.*, “Gemini: a family of highly capable multimodal models,”*arXiv preprint arXiv:2312\.11805*, 2023\.
- \[6\]AWS, “Amazon sagemaker,”https://aws\.amazon\.com/cn/sagemaker/\.
- \[7\]“Together AI,”https://www\.together\.ai, 2023, accessed: 2023\-10\-15\.
- \[8\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”*Advances in neural information processing systems*, vol\. 36, pp\. 10 088–10 115, 2023\.
- \[9\]B\. Gao, Z\. He, P\. Sharma, Q\. Kang, D\. Jevdjic, J\. Deng, X\. Yang, Z\. Yu, and P\. Zuo, “\{\\\{Cost\-Efficient\}\\\}large language model serving for multi\-turn conversations with\{\\\{CachedAttention\}\\\},” in*2024 USENIX Annual Technical Conference \(USENIX ATC 24\)*, 2024, pp\. 111–126\.
- \[10\]AWS, “Amazon ec2 pricing,”https://aws\.amazon\.com/ec2/pricing/\.
- \[11\]R\. M\. S\. Khan, P\. Li, S\. Yun, Z\. Wang, S\. Nirjon, C\.\-W\. Wong, and T\. Chen, “Portllm: Personalizing evolving large language models with training\-free and portable model patches,”*arXiv preprint arXiv:2410\.10870*, 2024\.
- \[12\]R\. M\. S\. Khan, D\. Tang, P\. Li, K\. Wang, and T\. Chen, “Oral: Prompting your large\-scale loras via conditional recurrent diffusion,”*arXiv preprint arXiv:2503\.24354*, 2025\.
- \[13\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan*et al\.*, “The llama 3 herd of models,”*arXiv preprint arXiv:2407\.21783*, 2024\.
- \[14\]Y\. Qin, C\. Qian, X\. Han, Y\. Lin, H\. Wang, R\. Xie, Z\. Liu, M\. Sun, and J\. Zhou, “Recyclable tuning for continual pre\-training,”*arXiv preprint arXiv:2305\.08702*, 2023\.
- \[15\]Y\. Qin, J\. Zhang, Y\. Lin, Z\. Liu, P\. Li, M\. Sun, and J\. Zhou, “Elle: Efficient lifelong pre\-training for emerging data,”*arXiv preprint arXiv:2203\.06311*, 2022\.
- \[16\]K\. Lo, L\. L\. Wang, M\. Neumann, R\. Kinney, and D\. S\. Weld, “S2orc: The semantic scholar open research corpus,”*arXiv preprint arXiv:1911\.02782*, 2019\.
- \[17\]R\. He and J\. McAuley, “Ups and downs: Modeling the visual evolution of fashion trends with one\-class collaborative filtering,” in*proceedings of the 25th international conference on world wide web*, 2016, pp\. 507–517\.
- \[18\]R\. Zellers, A\. Holtzman, H\. Rashkin, Y\. Bisk, A\. Farhadi, F\. Roesner, and Y\. Choi, “Defending against neural fake news,”*Advances in neural information processing systems*, vol\. 32, 2019\.
- \[19\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt, “Measuring massive multitask language understanding,”*arXiv preprint arXiv:2009\.03300*, 2020\.
- \[20\]S\. Gururangan, A\. Marasović, S\. Swayamdipta, K\. Lo, I\. Beltagy, D\. Downey, and N\. A\. Smith, “Don’t stop pretraining: Adapt language models to domains and tasks,”*arXiv preprint arXiv:2004\.10964*, 2020\.
- \[21\]T\. Joachims*et al\.*, “A probabilistic analysis of the rocchio algorithm with tfidf for text categorization,” in*ICML*, vol\. 97\. Citeseer, 1997, pp\. 143–151\.
- \[22\]W\. Lian, B\. Goodson, E\. Pentland, A\. Cook, C\. Vong, and “Teknium”, “Openorca: An open dataset of gpt augmented flan reasoning traces,” 2023\.
- \[23\]P\. I\. Frazier, “A tutorial on bayesian optimization,”*arXiv preprint arXiv:1807\.02811*, 2018\.
- \[24\]D\. Liu, Z\. Wang, B\. Wang, W\. Chen, C\. Li, Z\. Tu, D\. Chu, B\. Li, and D\. Sui, “Checkpoint merging via bayesian optimization in llm pretraining,”*arXiv preprint arXiv:2403\.19390*, 2024\.
- \[25\]M\. Seeger, “Gaussian processes for machine learning,”*International journal of neural systems*, vol\. 14, no\. 02, pp\. 69–106, 2004\.
- \[26\]I\. Loshchilov and F\. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,”*arXiv preprint arXiv:1608\.03983*, 2016\.
- \[27\]D\. S\. Chaplot, “Albert q\. jiang, alexandre sablayrolles, arthur mensch, chris bamford, devendra singh chaplot, diego de las casas, florian bressand, gianna lengyel, guillaume lample, lucile saulnier, lélio renard lavaud, marie\-anne lachaux, pierre stock, teven le scao, thibaut lavril, thomas wang, timothée lacroix, william el sayed,”*arXiv preprint arXiv:2310\.06825*, 2023\.
- \[28\]A\. Wang, “Glue: A multi\-task benchmark and analysis platform for natural language understanding,”*arXiv preprint arXiv:1804\.07461*, 2018\.
- \[29\]X\. Zhang, J\. Zhao, and Y\. LeCun, “Character\-level convolutional networks for text classification,”*Advances in neural information processing systems*, vol\. 28, 2015\.
- \[30\]S\. R\. Bowman, G\. Angeli, C\. Potts, and C\. D\. Manning, “A large annotated corpus for learning natural language inference,”*arXiv preprint arXiv:1508\.05326*, 2015\.
- \[31\]B\. Peng, C\. Li, P\. He, M\. Galley, and J\. Gao, “Instruction tuning with gpt\-4,”*arXiv preprint arXiv:2304\.03277*, 2023\.
- \[32\]A\. N\. Lee, C\. J\. Hunter, and N\. Ruiz, “Platypus: Quick, cheap, and powerful refinement of llms,”*arXiv preprint arXiv:2308\.07317*, 2023\.
- \[33\]Z\. Zhang, Y\. Yang, Y\. Dai, Q\. Wang, Y\. Yu, L\. Qu, and Z\. Xu, “Fedpetuning: When federated learning meets the parameter\-efficient tuning methods of pre\-trained language models,” in*Annual Meeting of the Association of Computational Linguistics 2023*\. Association for Computational Linguistics \(ACL\), 2023, pp\. 9963–9977\.
- \[34\]Z\. Yao, J\. Liu, H\. Xu, L\. Wang, C\. Qian, and Y\. Liao, “Ferrari: A personalized federated learning framework for heterogeneous edge clients,”*IEEE Transactions on Mobile Computing*, 2024\.
- \[35\]S\.\-Y\. Liu, C\.\-Y\. Wang, H\. Yin, P\. Molchanov, Y\.\-C\. F\. Wang, K\.\-T\. Cheng, and M\.\-H\. Chen, “Dora: Weight\-decomposed low\-rank adaptation,” in*Forty\-first International Conference on Machine Learning*, 2024\.
- \[36\]V\. Lialin, V\. Deshpande, and A\. Rumshisky, “Scaling down to scale up: A guide to parameter\-efficient fine\-tuning,”*arXiv preprint arXiv:2303\.15647*, 2023\.
- \[37\]N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly, “Parameter\-efficient transfer learning for nlp,” in*International conference on machine learning*\. PMLR, 2019, pp\. 2790–2799\.
- \[38\]X\. Jin, K\. Wang, D\. Tang, W\. Zhao, Y\. Zhou, J\. Tang, and Y\. You, “Conditional lora parameter generation,”*arXiv preprint arXiv:2408\.01415*, 2024\.
- \[39\]E\. Yang, L\. Shen, G\. Guo, X\. Wang, X\. Cao, J\. Zhang, and D\. Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities,”*arXiv preprint arXiv:2408\.07666*, 2024\.
- \[40\]G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, S\. Gururangan, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi, “Editing models with task arithmetic,”*arXiv preprint arXiv:2212\.04089*, 2022\.
- \[41\]M\. S\. Matena and C\. A\. Raffel, “Merging models with fisher\-weighted averaging,”*Advances in Neural Information Processing Systems*, vol\. 35, pp\. 17 703–17 716, 2022\.
- \[42\]X\. Jin, X\. Ren, D\. Preotiuc\-Pietro, and P\. Cheng, “Dataless knowledge fusion by merging weights of language models,”*arXiv preprint arXiv:2212\.09849*, 2022\.
- \[43\]E\. Yang, Z\. Wang, L\. Shen, S\. Liu, G\. Guo, X\. Wang, and D\. Tao, “Adamerging: Adaptive model merging for multi\-task learning,”*arXiv preprint arXiv:2310\.02575*, 2023\.
- \[44\]T\. Akiba, M\. Shing, Y\. Tang, Q\. Sun, and D\. Ha, “Evolutionary optimization of model merging recipes,”*Nature Machine Intelligence*, vol\. 7, no\. 2, pp\. 195–204, 2025\.
ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

Similar Articles

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

CARE-LoRA: Compressed Activation REconstruction for Memory-Efficient LoRA

Submit Feedback

Similar Articles

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation
CARE-LoRA: Compressed Activation REconstruction for Memory-Efficient LoRA