ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

arXiv cs.CL Papers

Summary

ShadowPEFT introduces a centralized parameter-efficient fine-tuning method that uses a depth-shared shadow module to refine transformer layer representations, matching or outperforming LoRA/DoRA with comparable trainable parameters.

arXiv:2604.19254v1 Announce Type: new Abstract: Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/22/26, 08:30 AM

# ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
Source: [https://arxiv.org/html/2604.19254](https://arxiv.org/html/2604.19254)
###### Abstract

Parameter\-efficient fine\-tuning \(PEFT\) reduces the training cost of full\-parameter fine\-tuning for large language models \(LLMs\) by training only a small set of task\-specific parameters while freezing the pretrained backbone\. However, existing approaches, such as Low\-Rank Adaptation \(LoRA\), achieve adaptation by inserting independent low\-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation\. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer\-level refinement through a depth\-shared shadow module\. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states\. This design shifts adaptation from distributed weight\-space perturbations to a shared layer\-space refinement process\. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios\. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable\-parameter budgets\. Additional analyses on shadow pretraining, cross\-dataset transfer, parameter scaling, inference latency, and system\-level evaluation suggest that centralized layer\-space adaptation is a competitive and flexible alternative to conventional low\-rank PEFT\.

![Refer to caption](https://arxiv.org/html/2604.19254v1/x2.png)Figure 1:\(Left\) Conventional LoRA configuration\. \(Right\) Our proposed ShadowPEFT\.## 1Introduction

Parameter\-efficient fine\-tuning \(PEFT\) alleviates the high training costs of full\-parameter fine\-tuning and provides a practical solution by efficiently adapting the large language models \(LLMs\) on various downstream applications\(Hanet al\.,[2024](https://arxiv.org/html/2604.19254#bib.bib35)\)\. Representative PEFT approaches include prompt\- and prefix\-based methods\(Li and Liang,[2021](https://arxiv.org/html/2604.19254#bib.bib2)\), adapter modules inserted into Transformer blocks\(Houlsbyet al\.,[2019](https://arxiv.org/html/2604.19254#bib.bib1)\), and and low\-rank weight adaptation methods such as LoRA\(Huet al\.,[2022](https://arxiv.org/html/2604.19254#bib.bib3)\)and its variants \(e\.g\., QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2604.19254#bib.bib5)\), DoRA\(Liuet al\.,[2024](https://arxiv.org/html/2604.19254#bib.bib4)\),inter alia\)\.

Among these methods, LoRA\-style PEFT has become the dominant practical choice due to its simplicity, effectiveness, and compatibility with existing LLM training pipelines\. LoRA injects trainable low\-rank updates into selected linear projections while keeping the pretrained weights frozen\. Despite its empirical success, LoRA adopts a fundamentally linear\-local parameterization: each selected linear layer receives its own trainable update, and task adaptation emerges from the aggregate effect of many independent perturbations distributed across depth\. Although these modules are optimized jointly, the adaptation mechanism itself remains fragmented, since each linear learns a separate transformation without explicitly sharing an adaptation state or function across the network\. Moreover, the fragmented adaptation is tied closely to the backbone’s internal weight structure and cannot be decoupled from the backbone model\.

In this work, we explore an alternative PEFT design in which adaptation is centralized in a shared functional module that operates on hidden representations of transformer layers\. We propose ShadowPEFT, which augments a frozen backbone with a lightweight centralized shadow network \(architecturally similar to the base model but scaled down\) that is reused across transformer layers\. The shadow network maintains a parallel hidden state, iteratively updates it across depth, and produces additive corrections to the backbone activations\. In contrast to learning a collection of linear weight perturbations, ShadowPEFT performs transformer layer\-level refinement with cross\-layer parameter sharing\. In this way, ShadowPEFT shifts the locus of adaptation from decentralized linear\-level perturbations to centralized layer\-level refinement, treating the PEFT process as learning a portable functional overlay rather than altering the backbone parameters\.

Since the shadow module is architecturally decoupled from the backbone, it can be trained, stored, and deployed as a standalone component, benefiting edge computing\. This enables two appealing properties that are difficult to obtain with standard LoRA\-style PEFT\. First, the shadow can be attached or detached without modifying the frozen backbone weights, enabling modular deployment and independent versioning of adaptation modules\. Second, the shadow model can be initialized from a smaller pretrained model, allowing a compact model to serve as a reusable adaptation module for a larger backbone\. For example, a smaller model such as Qwen\-0\.5B can serve as the shadow model for a larger backbone like Qwen\-8B\. In this configuration, shadow model’s adaptation capacity can be reused across model scales\. This perspective expands PEFT beyond lightweight parameter injection toward reusable, cross\-scale adaptation dynamics\. We study both randomly initialized and pretrained shadows, and show that pretraining substantially improves both attached and detached performance\.

We evaluate ShadowPEFT on multiple standard benchmarks spanning both generation and understanding tasks, including MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2604.19254#bib.bib17)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2604.19254#bib.bib8)\), and SQuAD V2\(Rajpurkaret al\.,[2018](https://arxiv.org/html/2604.19254#bib.bib9)\)\. ShadowPEFT achieves competitive or improved performance relative to LoRA and DoRA under comparable, and in some cases slightly smaller, trainable\-parameter budgets\. We further study shadow pretraining, detached shadow\-only inference, out\-of\-distribution transfer, parameter scaling, efficiency, and system\-level robot intent evaluation\. These results suggest that centralized ShadowPEFT is a viable and flexible alternative to conventional low\-rank adaptation\. Our contributions are summarized as follows:

1\. We present ShadowPEFT, a PEFT framework that leverages a shared shadow module integrated at the transformer layer level rather than performing independent adaption at linear updates\. We also introduce a stateful shadow mechanism that maintains and updates a parallel hidden representation across transformer layers \(depth\) and uses it to refine the frozen backbone hidden states\.

2\. We show that the shadow module is modular and portable: it can be attached or detached from the base backbone and can be initialized from a smaller pretrained model for cross\-scale adaptation\.

3\. Across multiple benchmarks and backbone scales, ShadowPEFT achieves competitive or improved performance relative to LoRA and DoRA under comparable parameter budgets, while also enabling additional deployment modes unavailable to standard low\-rank PEFT\.

## 2Related Work

The rapid scaling of LLMs to hundreds of billions of parameters has rendered full fine\-tuning increasingly impractical for multiple downstream tasks\. An alternative to weight updates is prompt engineering, including few\-shot prompting\(Liuet al\.,[2022a](https://arxiv.org/html/2604.19254#bib.bib25)\)and Chain\-of\-Thought reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2604.19254#bib.bib36)\), which inject task instructions directly into the input without modifying model parameters\. While highly parameter\-efficient, these approaches are limited by context length and lack persistent task\-specific adaptation\. These constraints have motivated extensive research into parameter\-efficient adaptation, which aims to finetune pretrained models by updating only a small subset of parameters while keeping the backbone frozen\(Xuet al\.,[2026](https://arxiv.org/html/2604.19254#bib.bib19)\)\.

Early parametric adaptation attempts focus on soft prompt\-based PEFT methods\. Such methods, like Prompt Tuning\(Lesteret al\.,[2021](https://arxiv.org/html/2604.19254#bib.bib20)\), Prefix Tuning\(Li and Liang,[2021](https://arxiv.org/html/2604.19254#bib.bib2)\), and P\-Tuning\(Liuet al\.,[2022b](https://arxiv.org/html/2604.19254#bib.bib21)\), optimize a small set of continuous prompt parameters while leaving the backbone unchanged\. However, the expressive capacity of prompt\-based PEFT is fundamentally constrained by the limited dimensionality of prompt vectors\. Later, adapter\-based methods gain attention because of their compatibility with LLMs’ stacking\-layer architecture\. Adapter\-based approaches insert lightweight bottleneck module into Transformer backbone while freezing backone weights\(Houlsbyet al\.,[2019](https://arxiv.org/html/2604.19254#bib.bib1)\)\. Subsequent extensions improve modularity and transferability, including AdapterSoup\(Chronopoulouet al\.,[2023](https://arxiv.org/html/2604.19254#bib.bib22)\), Tiny\-attention adapter\(Zhaoet al\.,[2022](https://arxiv.org/html/2604.19254#bib.bib23)\), and Compacter\(Karimi Mahabadiet al\.,[2021](https://arxiv.org/html/2604.19254#bib.bib24)\)\. Although effective and modular, adapters remain independently optimized\. Each layer learns task\-specific transformations without explicit cross\-layer coordination, potentially introducing redundancy and inconsistent depth\-wise adaptation\. While superficially similar to shared adapters, ShadowPEFT differs fundamentally in that it maintains a persistent state that evolves across layers, enabling global coordination and iterative refinement\.

Low\-rank adaptation has emerged as one of the most influential PEFT paradigms\. Instead of updating the full weight matrix, LoRA\(Huet al\.,[2022](https://arxiv.org/html/2604.19254#bib.bib3)\)learns low\-rank matrices and injects the update in parallel to the frozen pretrained weight\. This formulation replaces full activation\-layer updates with a constrained low\-dimensional subspace update, significantly reducing trainable parameters while preserving performance\. Subsequent extensions explore rank adaptivity \(AdaLoRA\(Zhanget al\.,[2023](https://arxiv.org/html/2604.19254#bib.bib26)\), DyLoRAValipouret al\.\([2023](https://arxiv.org/html/2604.19254#bib.bib27)\), LoRA\-GA\(Wanget al\.,[2024](https://arxiv.org/html/2604.19254#bib.bib29)\)\), quantization \(QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2604.19254#bib.bib5)\), QA\-LoRA\(Xuet al\.,[2023](https://arxiv.org/html/2604.19254#bib.bib28)\)\), multi\-task composition \(visual tuning\(Cheet al\.,[2026](https://arxiv.org/html/2604.19254#bib.bib31)\), MoELoRA\(Liet al\.,[2024](https://arxiv.org/html/2604.19254#bib.bib30)\), Mtl\-LoRA\(Yanget al\.,[2025](https://arxiv.org/html/2604.19254#bib.bib33)\), and LLM safety\(Hsuet al\.,[2024](https://arxiv.org/html/2604.19254#bib.bib32)\)\)\. Despite their effectiveness, LoRA\-based approaches still exhibit a decentralized parameterization, where low\-rank modules are inserted independently into selected linear weights across layers\. Each layer learns its own update without explicit cross\-layer coordination\. Consequently, adaptation remains structurally fragmented, potentially leading to inconsistent shifts in representation across depth\. Different from them, the centralized parameter design of ShadowPEFT can alleviate this issue\.

## 3Our ShadowPEFT Framework

We introduceShadowPEFT, a PEFT framework that adapts a frozen pretrained LLM \(thebase model\) with a centralized*shadow model*operating on Transformer decoder layers rather than on linears\. Unlike LoRA\-style approaches, which distribute independent trainable perturbation across linear weights \(Figure[1](https://arxiv.org/html/2604.19254#S0.F1)\(left\)\), ShadowPEFT centralizes adaptation in a shadow model that is reused across depth\. The key idea is to maintains a parallel*shadow state*that evolves alongside the frozen backbone hidden states and provides task\-adaptive refinement signals at each LLM layer \(Figure[1](https://arxiv.org/html/2604.19254#S0.F1)\(right\)\)\.

### 3\.1ShadowPEFT Overview

Let the frozen base consist ofLLTransformer decoder layers\. Given an input sequence𝐱\\mathbf\{x\}, let𝐡o​u​t\(ℓ\)∈ℝT×d\\mathbf\{h\}\_\{out\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{T\\times d\}denote the hidden state of theℓ\\ell\-th LLM decoder layer \(base layer\), whereTTis the sequence length andddis the hidden dimension\. The*shadow state*𝐬\(ℓ\)∈ℝT×d\\mathbf\{s\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{T\\times d\}serves as a depth\-shared, task\-adaptive reference trajectory\. The initial shadow state𝐬\(0\)\\mathbf\{s\}^\{\(0\)\}is produced by a*shadow backbone*fshadowf\_\{\\text\{shadow\}\}:

𝐬\(0\)=fshadow​\(𝐱;θshadow\)\.\\mathbf\{s\}^\{\(0\)\}=f\_\{\\text\{shadow\}\}\\\!\\left\(\\mathbf\{x\};\\,\\theta\_\{\\text\{shadow\}\}\\right\)\.\(1\)For each base layerℓ≥1\\ell\\geq 1, ShadowPEFT performs three steps: \(1\)Shadow Injection\. The shadow model injects trainable parameters into the previous base hidden state𝐡o​u​t\(ℓ−1\)\\mathbf\{h\}\_\{out\}^\{\(\\ell\-1\)\}with the current shadow state𝐬\(ℓ−1\)\\mathbf\{s\}^\{\(\\ell\-1\)\}\. \(2\)Base Encoding\. The frozen base layerfbase\(ℓ\)f^\{\(\\ell\)\}\_\{\\text\{base\}\}processes the refined hidden state to produce𝐡out\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}\_\{\\text\{out\}\}\. \(3\)Shadow Update\. The shadow model updates the shadow state by advancing𝐬\(ℓ−1\)→𝐬\(ℓ\)\\mathbf\{s\}^\{\(\\ell\-1\)\}\\to\\mathbf\{s\}^\{\(\\ell\)\}using the newly obtained base hidden representation𝐡out\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}\_\{\\text\{out\}\}\.

The base layer atℓ=0\\ell=0remains unchanged; injection and update begin atℓ=1\\ell=1, resulting inL−1L\-1refinement steps in total\. Under this view, ShadowPEFT can be interpreted as a depth\-wise state\-space adaptation process: instead of injecting separate local weight perturbations into each layer, the model learns a single portable adaptation pathway that is reused across depth\. Additional architectural designs of the centralized shadow model are provided in Appendix[A](https://arxiv.org/html/2604.19254#A1)\.

![Refer to caption](https://arxiv.org/html/2604.19254v1/x3.png)Figure 2:Architecture of ShadowPEFT\.\(a\) Shadow Injection Module\.The discrepancy𝜹\(ℓ\)\\boldsymbol\{\\delta\}^\{\(\\ell\)\}is projected through a low\-rank bottleneck \(Wdown∼𝒩​\(0,σ2\)W\_\{\\\!\\mathrm\{down\}\}\{\\sim\}\\mathcal\{N\}\(0,\\sigma^\{2\}\),Wup=0W\_\{\\\!\\mathrm\{up\}\}\{=\}0\) and added back to the base hidden state\.\(b\) Base Encoding Module\.The frozen base layer encodes the refined representation\.\(c\) Shadow Update Module\.The base output𝐡out\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}\_\{\\mathrm\{out\}\}is LayerNorm\-normalised, then split into a transformWtW\_\{t\}and a sigmoid gateσ​\(Wg\)\\sigma\(W\_\{g\}\); their product updates the shadow state via a gated residual\.
### 3\.2Shadow Injection Module

Before base layerℓ≥1\\ell\\geq 1processes its input, the Shadow Injection Module modulates the incoming hidden state𝐡o​u​t\(ℓ−1\)\\mathbf\{h\}\_\{out\}^\{\(\\ell\-1\)\}using the current shadow state𝐬\(ℓ−1\)\\mathbf\{s\}^\{\(\\ell\-1\)\}, as illustrated in Figure[2](https://arxiv.org/html/2604.19254#S3.F2)\(a\)\. This step is the point where the centralized shadow pathway directly influences the frozen backbone\.

Since the shadow has been initialized from the input and progressively updated across previous layers, it carries task\-relevant information that is shared across depth\. Rather than learning an independent weight perturbation inside each backbone block, ShadowPEFT derives adaptation from the discrepancy between the current backbone representation and this evolving shadow reference\.

Formally, at base layerℓ\\ell, we measure the representational difference between𝐡o​u​t\(ℓ−1\)\\mathbf\{h\}\_\{out\}^\{\(\\ell\-1\)\}and𝐬\(ℓ−1\)\\mathbf\{s\}^\{\(\\ell\-1\)\}\(by Eq\. \([2](https://arxiv.org/html/2604.19254#S3.E2)\)\)\. This discrepancy measures how far the current backbone state deviates from the shadow\-guided reference at the same depth\. We then map this discrepancy through a lightweight trainable bottleneck to obtain a task\-specific correction𝜹~\(ℓ\)\\tilde\{\\boldsymbol\{\\delta\}\}^\{\(\\ell\)\}\(by Eq\. \([3](https://arxiv.org/html/2604.19254#S3.E3)\)\):

𝜹\(ℓ\)\\displaystyle\\boldsymbol\{\\delta\}^\{\(\\ell\)\}=𝐡o​u​t\(ℓ−1\)−𝐬\(ℓ−1\),\\displaystyle=\\mathbf\{h\}\_\{out\}^\{\(\\ell\-1\)\}\-\\mathbf\{s\}^\{\(\\ell\-1\)\},\(2\)𝜹~\(ℓ\)\\displaystyle\\tilde\{\\boldsymbol\{\\delta\}\}^\{\(\\ell\)\}=Dropout⁡\(𝜹\(ℓ\)​𝐖down\(ℓ\)\)​𝐖up\(ℓ\),\\displaystyle=\\operatorname\{Dropout\}\\\!\\left\(\\boldsymbol\{\\delta\}^\{\(\\ell\)\}\\,\\mathbf\{W\}^\{\(\\ell\)\}\_\{\\mathrm\{down\}\}\\right\)\\mathbf\{W\}^\{\(\\ell\)\}\_\{\\mathrm\{up\}\},\(3\)where𝐖down\(ℓ\)∈ℝd×r\\mathbf\{W\}^\{\(\\ell\)\}\_\{\\mathrm\{down\}\}\\in\\mathbb\{R\}^\{d\\times r\}and𝐖up\(ℓ\)∈ℝr×d\\mathbf\{W\}^\{\(\\ell\)\}\_\{\\mathrm\{up\}\}\\in\\mathbb\{R\}^\{r\\times d\}are layer\-dependent and trainable projection matrices111Inspired by LoRA\(Huet al\.,[2022](https://arxiv.org/html/2604.19254#bib.bib3)\),𝐖down\(ℓ\)\\mathbf\{W\}^\{\(\\ell\)\}\_\{\\mathrm\{down\}\}is initialized from𝒩​\(0,σ2\)\\mathcal\{N\}\(0,\\,\\sigma^\{2\}\)and𝐖up\(ℓ\)\\mathbf\{W\}^\{\(\\ell\)\}\_\{\\mathrm\{up\}\}is zero\-initialized\. This guarantees that𝜹~\(ℓ\)=𝟎\\tilde\{\\boldsymbol\{\\delta\}\}^\{\(\\ell\)\}=\\mathbf\{0\}at initialization, so the frozen base model remains unchanged at the start of training\.with rankr≪dr\\ll d\. This low\-rank bottleneck serves two purposes\. First, it preserves parameter efficiency, making the injection module comparable in cost to existing low\-rank PEFT methods\. Second, it prevents the model from simply copying the raw discrepancy back into the backbone\. Instead, the module learns which components of the base\-shadow difference are useful for adaptation, so the injected signal functions as a filtered refinement rather than a direct overwrite\.

To this end, corrected representation𝐡\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}is obtained via a residual connection based on𝜹~\(ℓ\)\\tilde\{\\boldsymbol\{\\delta\}\}^\{\(\\ell\)\}:

𝐡\(ℓ\)←𝐡o​u​t\(ℓ−1\)\+α​𝜹~\(ℓ\),\\mathbf\{h\}^\{\(\\ell\)\}\\leftarrow\\mathbf\{h\}\_\{out\}^\{\(\\ell\-1\)\}\+\\alpha\\,\\tilde\{\\boldsymbol\{\\delta\}\}^\{\(\\ell\)\},\(4\)whereα\>0\\alpha\>0controls the overall injection strength in the residual connection\.𝐡\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}will be then encoded by base layerℓ\\ellto produce𝐡o​u​t\(ℓ\)\\mathbf\{h\}\_\{out\}^\{\(\\ell\)\}, as shown in Figure[2](https://arxiv.org/html/2604.19254#S3.F2)\(b\)\.

### 3\.3Shadow Update Module

After base layerℓ\\elloutputs hidden states𝐡out\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}\_\{\\mathrm\{out\}\}, the Shadow Update Module, as illustrated in Figure[2](https://arxiv.org/html/2604.19254#S3.F2)\(c\), advances the shadow state𝐬\(ℓ−1\)→𝐬\(ℓ\)\\mathbf\{s\}^\{\(\\ell\-1\)\}\\to\\mathbf\{s\}^\{\(\\ell\)\}\. In the injection step, the model uses base\-shadow discrepancy to compute a correction\. For this comparison to remain meaningful across depth, the shadow state should evolve together with the base model rather than remain fixed\. The purpose of this update step is to allow the shadow to absorb new information from the base model while preserving its own accumulated task\-specific context, so that it can serve as a stable and task\-adaptive reference for subsequent layers\.

To achieve this, we use a gated residual update:

𝐭\(ℓ\)\\displaystyle\\mathbf\{t\}^\{\(\\ell\)\}=T\(ℓ\)​\(𝐡out\(ℓ\)\),\\displaystyle=T^\{\(\\ell\)\}\\\!\\left\(\\mathbf\{h\}^\{\(\\ell\)\}\_\{\\mathrm\{out\}\}\\right\),\(5\)𝐠\(ℓ\)\\displaystyle\\mathbf\{g\}^\{\(\\ell\)\}=σ​\(G\(ℓ\)​\(𝐡out\(ℓ\)\)\),\\displaystyle=\\sigma\\\!\\left\(G^\{\(\\ell\)\}\\\!\\left\(\\mathbf\{h\}^\{\(\\ell\)\}\_\{\\mathrm\{out\}\}\\right\)\\right\),\(6\)𝐬\(ℓ\)\\displaystyle\\mathbf\{s\}^\{\(\\ell\)\}=\(1−𝐠\(ℓ\)\)⊙𝐬\(ℓ−1\)\+𝐠\(ℓ\)⊙𝐭\(ℓ\),\\displaystyle=\\left\(1\-\\mathbf\{g\}^\{\(\\ell\)\}\\right\)\\odot\\mathbf\{s\}^\{\(\\ell\-1\)\}\+\\mathbf\{g\}^\{\(\\ell\)\}\\odot\\mathbf\{t\}^\{\(\\ell\)\},\(7\)where⊙\\odotdenotes element\-wise multiplication, andσ\\sigmais the sigmoid function\. HereT\(ℓ\)​\(⋅\)T^\{\(\\ell\)\}\(\\cdot\)maps the current base output into a candidate shadow representation, whileG\(ℓ\)​\(⋅\)G^\{\(\\ell\)\}\(\\cdot\)predicts an element\-wise gate controlling how strongly the shadow should move toward this candidate\. The new shadow state is an interpolation between the previous shadow state and the current base state\. The implementation ofT\(ℓ\)​\(⋅\)T^\{\(\\ell\)\}\(\\cdot\)andG\(ℓ\)​\(⋅\)G^\{\(\\ell\)\}\(\\cdot\)are detailed on Appendix[D](https://arxiv.org/html/2604.19254#A4)\.

The gated update is a natural mechanism for controlling how much information should be retained from earlier layers versus incorporated from the current layer\. This GRU\-style\(Choet al\.,[2014](https://arxiv.org/html/2604.19254#bib.bib37)\)design is helpful to prevent shadow collapse and improve optimization stability\.

### 3\.4Base\-Shadow Joint Training

ShadowPEFT is trained end\-to\-end with a joint loss while keeping the base model frozen\. The trainable components include the shadow backbone, the injection projections, the update networks, and the shadow prediction head\.

#### Causal language modeling objective\.

For causal language modeling, given shifted target labels𝐲\\mathbf\{y\}, the training loss is defined as:

ℒ=ℒCE​\(𝐡base​𝐖lm,𝐲\)\+λ​ℒCE​\(𝐬\(0\)​𝐖shadow,𝐲\),\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\\\!\\left\(\\mathbf\{h\}\_\{\\mathrm\{base\}\}\\,\\mathbf\{W\}\_\{\\mathrm\{lm\}\},\\;\\mathbf\{y\}\\right\)\+\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\\\!\\left\(\\mathbf\{s\}^\{\(0\)\}\\,\\mathbf\{W\}\_\{\\mathrm\{shadow\}\},\\;\\mathbf\{y\}\\right\),\(8\)where𝐡base\\mathbf\{h\}\_\{\\mathrm\{base\}\}is the final hidden state of the injected base model,𝐖lm\\mathbf\{W\}\_\{\\mathrm\{lm\}\}and𝐖shadow\\mathbf\{W\}\_\{\\mathrm\{shadow\}\}are the base and shadow language\-model heads,ℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}is the standard next\-token cross\-entropy loss, andλ\\lambdais a scalar hyperparameter \(defaultλ=0\.05\\lambda=0\.05\)\.

#### Sequence classification objective\.

For sequence classification withCCclasses, followingLiet al\.\([2026](https://arxiv.org/html/2604.19254#bib.bib34)\), the shadow hidden state𝐬\(L\)∈ℝB×T×d\\mathbf\{s\}^\{\(L\)\}\\in\\mathbb\{R\}^\{B\\times T\\times d\}is reduced to a single vector\. This pooled representation is then scored by a shadow classifier head𝐖clsshadow∈ℝd×C\\mathbf\{W\}\_\{\\mathrm\{cls\}\}^\{\\mathrm\{shadow\}\}\\in\\mathbb\{R\}^\{d\\times C\}, initialized as a copy of the base model’s classifier head\. The joint classification loss is:

ℒCLS=ℒCE​\(𝐖cls​𝐡base,y\)\+λ​ℒCE​\(𝐖clsshadow​𝐬¯\(L\),y\),\\mathcal\{L\}\_\{\\mathrm\{CLS\}\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\\\!\\left\(\\mathbf\{W\}\_\{\\mathrm\{cls\}\}\\,\\mathbf\{h\}\_\{\\mathrm\{base\}\},\\;y\\right\)\+\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\\\!\\left\(\\mathbf\{W\}\_\{\\mathrm\{cls\}\}^\{\\mathrm\{shadow\}\}\\,\\bar\{\\mathbf\{s\}\}^\{\(L\)\},\\;y\\right\),\(9\)wherey∈\{1,…,C\}y\\in\\\{1,\\ldots,C\\\}is the ground\-truth class label\.

The shadow loss could be interpreted as a regularizer, detailed in the Appendix[E](https://arxiv.org/html/2604.19254#A5)\. At inference time, ShadowPEFT supports two deployment configurations:

1\)Shadow\-attached inference\. The full ShadowPEFT pipeline is executed\. The primary prediction is produced by the shadow\-attached base model, while the shadow head may optionally provide an auxiliary output\.

2\)Detached shadow\-only inference\. Only the shadow backbone and its prediction head are used\. This mode bypasses the large base model and enables lightweight deployment, while still benefiting from the task\-adaptive representations learned during fine\-tuning\.

These two inference modes highlight a distinctive property of ShadowPEFT: the adaptation module is not merely a set of attached perturbation weights, but a standalone functional component that can be trained jointly with the base model and deployed either together with or separately from it\.

## 4Experiment

### 4\.1Experimental Setup

#### Models and baselines\.

We evaluate ShadowPEFT on three backbone scales from the Qwen3 family, i\.e\., 0\.6B, 4B, and 8B parameters\. We compare against two widely used low\-rank PEFT baselines: LoRA\(Huet al\.,[2022](https://arxiv.org/html/2604.19254#bib.bib3)\)and DoRA\(Liuet al\.,[2024](https://arxiv.org/html/2604.19254#bib.bib4)\)\. All methods are evaluated under a comparable trainable\-parameter budget\. Notably, ShadowPEFT uses slightly*fewer*trainable parameters than both DoRA and LoRA\.

#### Benchmarks\.

We evaluate on five benchmarks including two categories\.Generation tasks:MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2604.19254#bib.bib17)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2604.19254#bib.bib8)\), and SQuAD v2\(Rajpurkaret al\.,[2018](https://arxiv.org/html/2604.19254#bib.bib9)\)\.Understanding tasks:Amazon review sentiment\(Keunget al\.,[2020](https://arxiv.org/html/2604.19254#bib.bib10)\)\(Amazon\) and 20 Newsgroup\(Mitchell,[1997](https://arxiv.org/html/2604.19254#bib.bib16)\)\(20News\) classification\. For fair comparison, we use the same prompt template for each dataset across all methods\. The detailed prompts are provided in Appendix[C](https://arxiv.org/html/2604.19254#A3)\. We report per\-task performance as well as the average score across all five benchmarks\.

#### Implementation details\.

We run all experiments on two NVIDIA A800 GPUs\. For LoRA and DoRA, we follow common practice and set the rankr=32r=32, scaling factora​l​p​h​a=32alpha=32, and dropout rate0\.050\.05\. The learning rate is selected via grid search over the range of1​e−51e\-5to1​e−31e\-3\. To maintain similar numbers of trainable parameters across different methods, we modify the number of attention heads and attention layers in the centralized shadow model according to each backbone size\. In addition, we adjust the injection size for shadow injection and the projection size for shadow update module\. Full configuration details are provided in the released code repository\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2604.19254#S4.T1)summarizes the performance of PEFT methods across the three model sizes\. We highlight the following findings: ShadowPEFT achieves the best average performance at all three backbone scales and is competitive on most individual benchmarks, despite using slightly fewer trainable parameters than LoRA and DoRA\. Specifically, on Qwen3 0\.6B, it scores62\.2762\.27vs\.61\.8161\.81\(LoRA\) and62\.0862\.08\(DoRA\); the gap widens as models grow, reaching75\.4375\.43vs\.74\.5574\.55/74\.8574\.85on Qwen3 4B and76\.9276\.92vs\.76\.5176\.51/75\.9975\.99on Qwen3 8B\. The gains on individual datasets are generally modest but consistent, indicating that centralized layer\-level adaptation can match or improve upon standard low\-rank PEFT without increasing parameter count\. These results are notable because ShadowPEFT additionally supports detachable deployment and shadow pretraining, which are not native to conventional LoRA\-style adaptation\.

We investigate this mechanism by training ShadowPEFT with a randomly initialized 0\.5B shadow model\. In the default shadow\-attached inference setting, this model already performs competitively, though it marginally underperforms ShadowPEFT on Qwen3 8B and around 0\.03B trainable parameters; we will analyze this counter\-intuitive behavior in Section[4\.5](https://arxiv.org/html/2604.19254#S4.SS5)\. When we detached the centralized shadow from the base model, and independently evaluated its performance, we found that the centralized shadow collapses \(∼\\sim4141average\), indicating that random initialization does not equip the shadow model with sufficient standalone capability\.

We then pretrain the 0\.5B centralized shadow model with Moore–Penrose pseudo\-inverse\(Barata and Hussein,[2012](https://arxiv.org/html/2604.19254#bib.bib15)\)initialization and with casual language modeling objective on a small portion of FineWeb\-Edu\(Lozhkovet al\.,[2024](https://arxiv.org/html/2604.19254#bib.bib12)\)\(English\) and Wudao\(Xueet al\.,[2022](https://arxiv.org/html/2604.19254#bib.bib13)\)corpus \(Chinese\)\. We detail the pretraining in Appendix[B](https://arxiv.org/html/2604.19254#A2)\. In the default attached deployment setting of ShadowPEFT, it yields the best overall result, with particularly strong gains on reasoning\-intensive tasks: GSM8K rises from80\.2180\.21\(random init\.\) to82\.1882\.18, and SQuAD v2 from87\.3987\.39to87\.7887\.78\. Beyond the attached deployment, we also evaluate the performance of detached centralized shadow model\. Notably, it retains a reasonable average of62\.1162\.11, which outperforms the fine\-tuned Qwen3 0\.6B with both LoRA and DoRA, indicating that pretraining equips the centralized shadow model with sufficient general and domain knowledge to serve as a standalone lightweight model\. This also validates the pretrainability feature of ShadowPEFT\.

Table 1:Comparison of different PEFT methods across model sizes on generation and understanding benchmarks\. Reported results are the average of 5 runs\. Bold indicates the best result among baselines per backbone group\.ShadowPEFTindicates the shadow\-attached inference mode\.Detached Shadow Onlymeans the results of the detached centralized shadow model\.Boldindicates the best result per dataset within each parameter group, andunderlinedenotes the overall best result per dataset\.
### 4\.3Ablation Study

We have compared the performance of pretraining and random initialization of centralized shadow model in the main experiment\. Here, we ablate the key component of ShadowPEFT: shadow update module on Qwen3 4B\. Specifically, we compare the ShadowPEFT’s performance with and without the update model on the GSM8k \(generation\) and Amazon \(understanding\) tasks\. We observe that removing the update module causes a drop of2\.432\.43points on GSM8K \(79\.00→76\.5779\.00\\to 76\.57\) while leaving Amazon nearly unchanged \(62\.66→62\.6462\.66\\to 62\.64\)\. This evidence suggests that the update module, which continuously refreshes the shadow model’s internal state based on the base model’s representations, is particularly critical for generation tasks, where accurate intermediate states are essential for multi\-step reasoning\. For simpler classification tasks, the shadow model’s initial representations are sufficient without continuous updates, making the update module’s contribution negligible\.

### 4\.4Discussion on Generalization Performance

To assess whether ShadowPEFT improves out\-of\-distribution \(OOD\) generalization, we fine\-tune Qwen3 4B on a single dataset and evaluate on the remaining two held\-out generation benchmarks using 2\-shot demonstrations, as shown in Table[2](https://arxiv.org/html/2604.19254#S4.T2)\.

ShadowPEFT achieves the highest OOD performance in every training condition\. When trained on GSM8K \(train split\), ShadowPEFT achieves an OOD average of50\.6150\.61vs\.50\.4050\.40\(LoRA\) and48\.5748\.57\(DoRA\)\. DoRA notably degrades relative to LoRA in this condition, suggesting that its additional weight\-magnitude decomposition may hurt generalization performance\. When trained on SQuAD v2 \(train split\), ShadowPEFT again leads with an OOD average of53\.2353\.23vs\.52\.4152\.41\(LoRA\) and52\.9252\.92\(DoRA\)\. When trained on MMLU \(train split\), all three methods show strong OOD transfer to the other reasoning benchmarks; ShadowPEFT still performs better than LoRA and DoRA\.

Overall, ShadowPEFT not only improves in\-distribution performance but also preserves the generalization performance, making it suitable for practical applications\.

Table 2:Out\-of\-distribution generalization of Qwen3 4B fine\-tuned on different datasets, evaluated with 2\-shot demonstrations\.Bluecells indicate in\-distribution evaluation\.Boldindicates the best result per training group\. Trainable parameter counts are matched across all three methods, consistent with the setup in Table[1](https://arxiv.org/html/2604.19254#S4.T1)\.
### 4\.5Discussion on Trainable Parameter Scaling

The main results showed that the randomly initialized 0\.5B shadow marginally underperforms the compact shadow \(around 0\.03B\) used in the standard 8B configuration, raising the question of how shadow model size affects fine\-tuning performance\. To investigate this, we vary the trainable parameter scales from 0\.1B to 0\.5B within ShadowPEFT on Qwen3\-8B, and compare against LoRA and DoRA across the same trainable parameter range, as shown in Figure[3](https://arxiv.org/html/2604.19254#S4.F3)\(a\)\.

ShadowPEFT consistently benefits from a larger shadow model: GSM8K accuracy rises from81\.3581\.35\(0\.1B\) to a peak of82\.1282\.12\(0\.4B\), followed by a marginal decrease at 0\.5B \(81\.8081\.80\), suggesting mild saturation near this scale\. By contrast, LoRA’s performance is nearly flat across all shadow sizes \(80\.5280\.52–81\.2881\.28\), as LoRA has no mechanism to exploit an enlarged companion model\. Most strikingly, DoRA’s performance*degrades*monotonically as the parameter scale grows \(81\.1281\.12at 0\.1B to77\.7977\.79at 0\.5B\)\. This pattern is consistent with a known limitation of low rank\-based PEFT methods: increasing the rank beyond a certain threshold tends to hurt generalization and accelerate forgetting\(Rathoreet al\.,[2025](https://arxiv.org/html/2604.19254#bib.bib14)\)\. ShadowPEFT can alleviate this issue by expanding capacity through the centralized shadow model rather than the rank adaptation, allowing it to absorb additional parameters effectively\.

The saturation of ShadowPEFT at 0\.5B also explains why the randomly initialized 0\.5B variant in the main experiment only marginally underperforms the compact configuration: the shadow model has approached the capacity ceiling for the given base model scale, and further gains require a stronger initialization, as evidenced by the pretrained 0\.5B variant’s superior performance\.

0\.1B0\.2B0\.3B0\.4B0\.5B787879798080818182828383\(a\) Parameter ScalingGSM8K Accuracy \(%\)LoRADoRAShadowPEFTQwen3 0\.6BQwen3 4BQwen3 8B0404080801201201601608181101\.2101\.2103\.3103\.3121\.5121\.5156156152\.7152\.78484107\.2107\.2109\.2109\.2\(b\) Inference LatencyMean Latency \(ms\)\(b\) Inference LatencyLoRADoRAShadowPEFTFigure 3:\(a\) Parameter scaling:GSM8K accuracy of LoRA, DoRA, and ShadowPEFT with the parameter scales from 0\.1B to 0\.5B \(base model: Qwen3 8B\)\.\(b\) Inference latency\(mean±\\pmstd over 10 attempts\) across three base model sizes\.
### 4\.6Discussion on Efficiency

Figure[3](https://arxiv.org/html/2604.19254#S4.F3)\(b\) reports the inference latency \(mean±\\pmstd over 10 attempts\) of LoRA, DoRA, and ShadowPEFT across three model sizes\. ShadowPEFT introduces minimal overhead over LoRA: the latency increases by only3\.03\.0ms \(3\.7%3\.7\\%\),6\.06\.0ms \(5\.9%5\.9\\%\), and5\.95\.9ms \(5\.7%5\.7\\%\) on Qwen3 0\.6B, 4B, and 8B, respectively, amounting to an average overhead of4−6%4\-6\\%\. DoRA, by comparison, incurs substantially higher cost\. The low latency overhead of ShadowPEFT stems from its design: the shadow forward pass runs in parallel with the base model computation, and the injection and update modules add only a lightweight residual connection\. These results demonstrate that ShadowPEFT achieves consistent accuracy improvements at a latency cost that is effectively equivalent to LoRA, making it practical for real\-world applications\.

### 4\.7Discussion on System\-Level Evaluation

To evaluate the practical benefit of ShadowPEFT, we further conduct a system\-level experiment on the Unitree Go2 robot dog intent understanding\. The detailed experimental setup is provided in the Appendix[F](https://arxiv.org/html/2604.19254#A6)\. As illustrated in the Figure[4](https://arxiv.org/html/2604.19254#S4.F4), ShadowPEFT achieves both the lowest latency and the highest accuracy compared to baselines\. The low latency comes from its detachable design: routine robot skills can be resolved locally with the detached shadow without accessing the cloud\. The cloud model \(full ShadowPEFT\) is only used for complex or open\-domain requests\. This design improves real\-time interaction and reduces cloud usage\. The higher accuracy suggests the effetiveless of ShadowPEFT\.

010020030010001500Test time \(s\)97\.597\.5989898\.598\.5999999\.599\.5Accuracy \(%\)99\.3599\.35ShadowPEFT97\.797\.7LoRA97\.797\.7DoRA

Figure 4:Accuracyvs\.test time on test set for ShadowPEFT, LoRA, and DoRA\.
We also present several case studies in Table[3](https://arxiv.org/html/2604.19254#S4.T3)\. The results show that the detached shadow model understands routine robot skills reliably\. For more complex or open\-domain queries, instead of generating hallucinations, it returns the\[REMOTE\]tag, indicating that the input should be forwarded to the full ShadowPEFT model \(deployed at cloud server\)\. This behavior suggests that the detached mode can serve as an effective lightweight front\-end for simple instructions while safely forwarding harder cases to the cloud model, benefiting edge computing scenarios\.

The full ShadowPEFT model performs well on both routine robot skills and more complex user queries\. In contrast, although LoRA and DoRA can generate fluent and high\-quality responses, they still occasionally produce incorrect actions or hallucinations\. This evidence further demonstrates the effectiveness of ShadowPEFT, as well as the practicality of its detachable and attached modes\. Overall, ShadowPEFT provides more accurate and precise responses than LoRA and DoRA while maintaining a scalable design for real\-world deployment\.

Table 3:Case study comparison of generated responses across Detached Shadow\-Only, ShadowPEFT, LoRA, and DoRA\.Redcolor marks wrong answers\.Orangecolor annotates hallucinations\.

## 5Conclusion

We presented ShadowPEFT, a PEFT framework that replaces independent linear\-wise weight perturbations with a shared transformer layer\-level shadow network\. By maintaining and updating a parallel shadow state across transformer depth, ShadowPEFT centralizes adaptation in a reusable functional component\. Empirically, ShadowPEFT achieves competitive or improved performance relative to strong low\-rank baselines under comparable parameter budgets, while also enabling detachable deployment and shadow pretraining\. These findings suggest that PEFT can be designed not only as lightweight parameter injection, but also as modular, stateful function\-level adaptation\.

## Limitation

Due to computational resource constraints, we were unable to evaluate ShadowPEFT on larger\-scale LLMs or across a more diverse set of architectures\. We leave these directions for future work\.

## References

- J\. C\. A\. Barata and M\. S\. Hussein \(2012\)The moore–penrose pseudoinverse: a tutorial review of the theory\.Brazilian Journal of Physics42\(1\),pp\. 146–165\.Cited by:[Appendix B](https://arxiv.org/html/2604.19254#A2.p2.14),[§4\.2](https://arxiv.org/html/2604.19254#S4.SS2.p3.5)\.
- C\. Che, Z\. Wang, P\. Yang, C\. Wang, H\. Ma, and Z\. Shi \(2026\)LoRA in lora: towards parameter\-efficient architecture expansion for continual visual instruction tuning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 19978–19986\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- K\. Cho, B\. van Merrienboer, Ç\. Gülçehre, D\. Bahdanau, F\. Bougares, H\. Schwenk, and Y\. Bengio \(2014\)Learning phrase representations using rnn encoder–decoder for statistical machine translation\.InConference on Empirical Methods in Natural Language Processing,Cited by:[§3\.3](https://arxiv.org/html/2604.19254#S3.SS3.p3.1)\.
- A\. Chronopoulou, M\. E\. Peters, A\. Fraser, and J\. Dodge \(2023\)Adaptersoup: weight averaging to improve generalization of pretrained language models\.InFindings of the Association for Computational Linguistics: EACL 2023,pp\. 2054–2063\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p5.1),[§4\.1](https://arxiv.org/html/2604.19254#S4.SS1.SSS0.Px2.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)Qlora: efficient finetuning of quantized llms\.Advances in neural information processing systems36,pp\. 10088–10115\.Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p1.1),[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- Z\. Han, C\. Gao, J\. Liu, J\. Zhang, and S\. Q\. Zhang \(2024\)Parameter\-efficient fine\-tuning for large models: a comprehensive survey\.Trans\. Mach\. Learn\. Res\.2024\.Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p5.1),[§4\.1](https://arxiv.org/html/2604.19254#S4.SS1.SSS0.Px2.p1.1)\.
- N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly \(2019\)Parameter\-efficient transfer learning for nlp\.InInternational conference on machine learning,pp\. 2790–2799\.Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p1.1),[§2](https://arxiv.org/html/2604.19254#S2.p2.1)\.
- C\. Hsu, Y\. Tsai, C\. Lin, P\. Chen, C\. Yu, and C\. Huang \(2024\)Safe lora: the silver lining of reducing safety risks when finetuning large language models\.Advances in Neural Information Processing Systems37,pp\. 65072–65094\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p1.1),[§2](https://arxiv.org/html/2604.19254#S2.p3.1),[§4\.1](https://arxiv.org/html/2604.19254#S4.SS1.SSS0.Px1.p1.1),[footnote 1](https://arxiv.org/html/2604.19254#footnote1)\.
- R\. Karimi Mahabadi, J\. Henderson, and S\. Ruder \(2021\)Compacter: efficient low\-rank hypercomplex adapter layers\.Advances in neural information processing systems34,pp\. 1022–1035\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p2.1)\.
- P\. Keung, Y\. Lu, G\. Szarvas, and N\. A\. Smith \(2020\)The multilingual amazon reviews corpus\.InProceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\),pp\. 4563–4568\.Cited by:[§4\.1](https://arxiv.org/html/2604.19254#S4.SS1.SSS0.Px2.p1.1)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 conference on empirical methods in natural language processing,pp\. 3045–3059\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p2.1)\.
- D\. Li, Y\. Ma, N\. Wang, Z\. Ye, Z\. Cheng, Y\. Tang, Y\. Zhang, L\. Duan, J\. Zuo, C\. Yang,et al\.\(2024\)Mixlora: enhancing large language models fine\-tuning with lora\-based mixture of experts\.arXiv preprint arXiv:2404\.15159\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 4582–4597\.Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p1.1),[§2](https://arxiv.org/html/2604.19254#S2.p2.1)\.
- Z\. Li, X\. Li, J\. Li, H\. Xie, F\. L\. Wang, and Q\. Li \(2026\)LS\-billms: label supervised bi\-directional large language models for token\-and sequence\-level information extraction\.Information Processing & Management63\(4\),pp\. 104568\.Cited by:[§3\.4](https://arxiv.org/html/2604.19254#S3.SS4.SSS0.Px2.p1.3)\.
- H\. Liu, D\. Tam, M\. Muqeeth, J\. Mohta, T\. Huang, M\. Bansal, and C\. A\. Raffel \(2022a\)Few\-shot parameter\-efficient fine\-tuning is better and cheaper than in\-context learning\.Advances in Neural Information Processing Systems35,pp\. 1950–1965\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p1.1)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024\)Dora: weight\-decomposed low\-rank adaptation\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p1.1),[§4\.1](https://arxiv.org/html/2604.19254#S4.SS1.SSS0.Px1.p1.1)\.
- X\. Liu, K\. Ji, Y\. Fu, W\. Tam, Z\. Du, Z\. Yang, and J\. Tang \(2022b\)P\-tuning: prompt tuning can be comparable to fine\-tuning across scales and tasks\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 61–68\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p2.1)\.
- A\. Lozhkov, L\. Ben Allal, L\. von Werra, and T\. Wolf \(2024\)FineWeb\-edu: the finest collection of educational content\.Hugging Face\.External Links:[Link](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu),[Document](https://dx.doi.org/10.57967/hf/2497)Cited by:[Appendix B](https://arxiv.org/html/2604.19254#A2.p2.14),[§4\.2](https://arxiv.org/html/2604.19254#S4.SS2.p3.5)\.
- T\. Mitchell \(1997\)Twenty Newsgroups\.Note:UCI Machine Learning RepositoryDOI: https://doi\.org/10\.24432/C5C323Cited by:[§4\.1](https://arxiv.org/html/2604.19254#S4.SS1.SSS0.Px2.p1.1)\.
- P\. Rajpurkar, R\. Jia, and P\. Liang \(2018\)Know what you don’t know: unanswerable questions for squad\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 784–789\.Cited by:[§1](https://arxiv.org/html/2604.19254#S1.p5.1),[§4\.1](https://arxiv.org/html/2604.19254#S4.SS1.SSS0.Px2.p1.1)\.
- D\. Rathore, V\. Kumar, C\. Bansal, and A\. Moitra \(2025\)How much is too much? exploring lora rank trade\-offs for retaining knowledge and domain robustness\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,pp\. 1003–1013\.Cited by:[§4\.5](https://arxiv.org/html/2604.19254#S4.SS5.p2.7)\.
- M\. Valipour, M\. Rezagholizadeh, I\. Kobyzev, and A\. Ghodsi \(2023\)DyLoRA: parameter\-efficient tuning of pre\-trained models using dynamic search\-free low\-rank adaptation\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,pp\. 3274–3287\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- S\. Wang, L\. Yu, and J\. Li \(2024\)Lora\-ga: low\-rank adaptation with gradient approximation\.Advances in Neural Information Processing Systems37,pp\. 54905–54931\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, E\. H\. Chi, F\. Xia, Q\. Le, and D\. Zhou \(2022\)Chain of thought prompting elicits reasoning in large language models\.ArXivabs/2201\.11903\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p1.1)\.
- L\. Xu, H\. Xie, S\. J\. Qin, X\. Tao, and F\. L\. Wang \(2026\)Parameter\-efficient fine\-tuning methods for pretrained language models: a critical review and assessment\.IEEE Transactions on Pattern Analysis and Machine Intelligence\(\),pp\. 1–20\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.2026.3657354)Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p1.1)\.
- Y\. Xu, L\. Xie, X\. Gu, X\. Chen, H\. Chang, H\. Zhang, Z\. Chen, X\. Zhang, and Q\. Tian \(2023\)QA\-lora: quantization\-aware low\-rank adaptation of large language models\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- Z\. Xue, H\. Zhao, S\. Yuan, and Y\. Wang \(2022\)WuDaoCorpora Text\.Science Data Bank\.External Links:[Document](https://dx.doi.org/10.57760/sciencedb.o00126.00004)Cited by:[Appendix B](https://arxiv.org/html/2604.19254#A2.p2.14),[§4\.2](https://arxiv.org/html/2604.19254#S4.SS2.p3.5)\.
- Y\. Yang, D\. Muhtar, Y\. Shen, Y\. Zhan, J\. Liu, Y\. Wang, H\. Sun, W\. Deng, F\. Sun, Q\. Zhang,et al\.\(2025\)Mtl\-lora: low\-rank adaptation for multi\-task learning\.InProceedings of the AAAI Conference on Artificial intelligence,Vol\.39,pp\. 22010–22018\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- Q\. Zhang, M\. Chen, A\. Bukharin, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao \(2023\)Adaptive budget allocation for parameter\-efficient fine\-tuning\.InThe Eleventh International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p3.1)\.
- H\. Zhao, H\. Tan, and H\. Mei \(2022\)Tiny\-attention adapter: contexts are more important than the number of parameters\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 6626–6638\.Cited by:[§2](https://arxiv.org/html/2604.19254#S2.p2.1)\.

## Appendix AMore Designs of Centralized Shadow Model

#### Implicit vs\. explicit shadow models\.

ShadowPEFT supports two centralized shadow initialization strategies\. In the*implicit*setting the centralized shadow model is derived automatically from the base model’s configuration by reducing the number of layers toLs≪LL\_\{s\}\\ll Land optionally narrowing the intermediate width and attention heads\. In the*explicit*setting the practitioner provides an independently centralized shadow model of any compatible architecture, enabling cross\-model knowledge transfer\.

#### Embedding sharing\.

To avoid maintaining a duplicate token embedding lookup table inside the centralized shadow model, ShadowPEFT shares the base model’s frozen embedding layer\. Input token embeddings are computed once as𝐄=Embedbase​\(𝐱\)∈ℝB×T×d\\mathbf\{E\}=\\texttt\{Embed\}\_\{\\text\{base\}\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{B\\times T\\times d\}and fed to the centralized shadow model asinputs\_embeds, so the shadowembed\_tokensmatrix is removed from the module entirely\.

#### Hidden\-size projection\.

When the centralized shadow model has a hidden dimensionds≠dd\_\{s\}\\neq d, a learned linear projection𝐖proj∈ℝds×d\\mathbf\{W\}\_\{\\text\{proj\}\}\\in\\mathbb\{R\}^\{d\_\{s\}\\times d\}\(no bias\) is applied to align the shadow output with the representation space of the base model:

𝐬\(0\)←𝐬\(0\)​𝐖proj\.\\mathbf\{s\}^\{\(0\)\}\\leftarrow\\mathbf\{s\}^\{\(0\)\}\\,\\mathbf\{W\}\_\{\\text\{proj\}\}\.\(10\)Whends=dd\_\{s\}=d, the projection is an identity and does not introduce additional parameters\.

## Appendix BCentralized Shadow Model Pretraining

Here, we delineate the process to pretrain a centralized shadow model that can be attached to Qwen3 8B as the base model\.

Due to resource constraints, we reuse Qwen3 0\.6B as the centralized shadow modelfθf\_\{\\theta\}rather than pretraining from scratch\. To attach it to Qwen3 8B, we must bridge their hidden dimensionsds≠dtd\_\{s\}\\neq d\_\{t\}via a linear projection𝐏∈ℝdt×ds\\mathbf\{P\}\\in\\mathbb\{R\}^\{d\_\{t\}\\times d\_\{s\}\}, giving the forward pass:

𝐲^=𝐖lm​𝐏​𝐡,\\hat\{\\mathbf\{y\}\}=\\mathbf\{W\}\_\{\\mathrm\{lm\}\}\\,\\mathbf\{P\}\\,\\mathbf\{h\},\(11\)where𝐡∈ℝds\\mathbf\{h\}\\in\\mathbb\{R\}^\{d\_\{s\}\}are the shadow hidden states and𝐖lm∈ℝV×dt\\mathbf\{W\}\_\{\\mathrm\{lm\}\}\\in\\mathbb\{R\}^\{V\\times d\_\{t\}\}is the frozen 8B LM head\. A randomly initialised𝐏\\mathbf\{P\}destroys the model’s generation ability\. To recover a useful starting point, we initialise𝐏\\mathbf\{P\}by minimising the Frobenius distance to Qwen3 0\.6B’s original head𝐖lmref\\mathbf\{W\}\_\{\\mathrm\{lm\}\}^\{\\mathrm\{ref\}\}:

𝐏∗=𝐖lm\+​𝐖lmref,\\mathbf\{P\}^\{\*\}=\\mathbf\{W\}\_\{\\mathrm\{lm\}\}^\{\+\}\\,\\mathbf\{W\}\_\{\\mathrm\{lm\}\}^\{\\mathrm\{ref\}\},\(12\)where𝐖lm\+\\mathbf\{W\}\_\{\\mathrm\{lm\}\}^\{\+\}is the Moore–Penrose pseudo\-inverse\[Barata and Hussein,[2012](https://arxiv.org/html/2604.19254#bib.bib15)\]\. This warm start ensures the composed head𝐖lm​𝐏∗\\mathbf\{W\}\_\{\\mathrm\{lm\}\}\\mathbf\{P\}^\{\*\}approximates the original output distribution from the first step, reducing subsequent alignment training\. We then continue pretrainingfθf\_\{\\theta\}and𝐏\\mathbf\{P\}on FineWeb\-Edu \(English subset, sampled100​K100K\)\[Lozhkovet al\.,[2024](https://arxiv.org/html/2604.19254#bib.bib12)\]and Wudao \(Chinese only, sampled100​K100K\)\[Xueet al\.,[2022](https://arxiv.org/html/2604.19254#bib.bib13)\]corpus with causal language modeling objective\.

## Appendix CDataset Prompt Templates

We describe the prompt templates used to format each dataset for supervised fine\-tuning\. All prompts are wrapped in the model’s chat template viaapply\_chat\_template\(withadd\_generation\_prompt=Trueandenable\_thinking=False\), and the assistant response is appended directly after the generated prompt prefix\. Prompt tokens are masked with−100\-100in the label sequence so that loss is computed only over the answer tokens\.

#### MMLU\.

Each MMLU example is formatted as a four\-way multiple\-choice question\. The model is instructed to respond with a single letter\. It supports two templates:

1\)Zero\-shot template: it is applied in the main experiment\.

\{promptbox\}

```
Question: <question>

Options:
A: <choice_0>
B: <choice_1>
C: <choice_2>
D: <choice_3>

Instructions: Answer with ONLY the letter (A, B, C, or D).
Do not include any explanation, reasoning, or additional text.
Answer:
```

Target \(gold label\): a single letter, e\.g\.B\.

2\)Few\-shot template: it is used in generalization testing\. When few\-shot prompting is enabled, the following prefix is prepended to the zero\-shot template above:

\{promptbox\}

```
Examples of correct response format:

Question: What is 2+2?
Options:
A: 3
B: 4
C: 5
D: 6
Answer: B

Question: What color is the sky?
Options:
A: Green
B: Blue
C: Red
D: Yellow
Answer: B

Now answer the following question:
```

#### GSM8K\.

GSM8K examples are formatted for arithmetic reasoning\.

\{promptbox\}

```
Question: <question>
Answer:
```

Target: the full gold solution string, including intermediate reasoning steps and the final answer line\#\#\#\# <number\>\.

#### SQuAD v2\.

SQuAD v2 examples are formatted as extractive reading\-comprehension tasks\. Unanswerable questions are supervised with the literal tokenunanswerable\.

\{promptbox\}

```
Answer with the exact span from the context.
If the question is unanswerable from the context,
respond with exactly: unanswerable

Context:
<context>

Question:
<question>

Answer (span or ‘unanswerable’ only):
```

Target: the first gold answer span, orunanswerable\.

## Appendix DGated Residual Update

In practice, bothT\(ℓ\)​\(⋅\)T^\{\(\\ell\)\}\(\\cdot\)andG\(ℓ\)​\(⋅\)G^\{\(\\ell\)\}\(\\cdot\)are implemented as lightweight two\-layer MLPs\. The transform network is defined as

T\(ℓ\)​\(𝐳\)=𝐖T,2\(ℓ\)​Dropout⁡\(SiLU⁡\(𝐖T,1\(ℓ\)​𝐳\)\),𝐖T,1\(ℓ\)∈ℝd×hg,𝐖T,2\(ℓ\)∈ℝhg×d,T^\{\(\\ell\)\}\(\\mathbf\{z\}\)=\\mathbf\{W\}^\{\(\\ell\)\}\_\{T,2\}\\,\\operatorname\{Dropout\}\\\!\\left\(\\operatorname\{SiLU\}\\\!\\left\(\\mathbf\{W\}^\{\(\\ell\)\}\_\{T,1\}\\,\\mathbf\{z\}\\right\)\\right\),\\quad\\mathbf\{W\}^\{\(\\ell\)\}\_\{T,1\}\\in\\mathbb\{R\}^\{d\\times h\_\{g\}\},\\;\\mathbf\{W\}^\{\(\\ell\)\}\_\{T,2\}\\in\\mathbb\{R\}^\{h\_\{g\}\\times d\},\(13\)and the gate network is

G\(ℓ\)​\(𝐳\)=𝐖G,2\(ℓ\)​SiLU⁡\(𝐖G,1\(ℓ\)​𝐳\),𝐖G,1\(ℓ\)∈ℝd×hg,𝐖G,2\(ℓ\)∈ℝhg×d,G^\{\(\\ell\)\}\(\\mathbf\{z\}\)=\\mathbf\{W\}^\{\(\\ell\)\}\_\{G,2\}\\,\\operatorname\{SiLU\}\\\!\\left\(\\mathbf\{W\}^\{\(\\ell\)\}\_\{G,1\}\\,\\mathbf\{z\}\\right\),\\quad\\mathbf\{W\}^\{\(\\ell\)\}\_\{G,1\}\\in\\mathbb\{R\}^\{d\\times h\_\{g\}\},\\;\\mathbf\{W\}^\{\(\\ell\)\}\_\{G,2\}\\in\\mathbb\{R\}^\{h\_\{g\}\\times d\},\(14\)wherehgh\_\{g\}is the gate hidden size hyperparameter\. All projection matrices in bothT\(ℓ\)T^\{\(\\ell\)\}andG\(ℓ\)G^\{\(\\ell\)\}are bias\-free\.

## Appendix EAuxiliary Shadow Loss as a Regularizer

The auxiliary shadow loss in Eqs\.[8](https://arxiv.org/html/2604.19254#S3.E8)and[9](https://arxiv.org/html/2604.19254#S3.E9)serves as a regularizer for the centralized shadow model\. Since the base model is frozen, training otherwise relies entirely on how effectively the shadow can steer the backbone through injection\. By directly supervising the shadow output, the auxiliary loss stabilizes optimization and encourages the shadow state to encode task\-relevant information on its own\. This property is especially important for detached deployment, where only the shadow model is used at inference time\.

## Appendix FSystem\-Level Evaluation Setup

#### Dataset\.

We release a robot\-dog instruction dataset \(adapted for Unitree Go2\)\. It contains4,771examples including34 predefined robot skillsplus one\[REMOTE\]category\. Overall, the corpus is bilingual, with2,816 English examplesand1,955 Chinese examples\. Table[4](https://arxiv.org/html/2604.19254#A6.T4)lists the data statistics\.

Table 4:Summary statistics of the robot\-dog instruction dataset\.The predefined robot skills are:Damp\(\),BalanceStand\(\),StopMove\(\),StandUp\(\),StandDown\(\),RecoveryStand\(\),Sit\(\),RiseSit\(\),Stretch\(\),Wallow\(\),Scrape\(\),FrontFlip\(\),FrontJump\(\),FrontPounce\(\),WiggleHips\(\),TurnLeft\(\),TurnRight\(\),SayHello\(\),Dance\(\),DrawHeart\(\),BalanceAttitude\(\),PlayKungFu\(\),HappyBirthday\(\),HappyNewYear\(\),CheerLeading\(\),ShakeBody\(\),TurnAround\(\),LionDance\(\),Welcome\(\),WaltzDance\(\),ChaChaDance\(\),Tango\(\),HipHopDance\(\), andBark\(\)\.

#### Evaluation\.

The latency and accuracy calculation of ShadowPEFT is as follows: we first performs intent understanding with the detached shadow\-only model\. If the predicted intent corresponds to one of the predefined robot skills, the command is executed directly on\-device\. Otherwise, if the request is out of scope or the prediction contains the\[REMOTE\]tag, the query is forwarded to the cloud model for further processing\. By contrast, LoRA and DoRA do not support this detached execution mode, so they always rely on the full model for intent understanding\.

Similar Articles

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hugging Face Daily Papers

GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.