Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
Summary
The article introduces Echo-LoRA, a new parameter-efficient fine-tuning method that injects cross-layer representations from deeper source layers into shallow LoRA modules to improve performance without adding inference-time overhead.
View Cached Full Text
Cached at: 05/12/26, 07:00 AM
# Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
Source: [https://arxiv.org/html/2605.08177](https://arxiv.org/html/2605.08177)
Yihang Peng1Peng Jin2Jie Gong1Xingyuan Chen2 Lingjiao Xu2Ning Su2Yan Ran1 1School of Computer Science and Software Engineering, Southwest Petroleum University 2School of Electronic Information and Artificial Intelligence, Leshan Normal University
###### Abstract
Parameter\-efficient fine\-tuning \(PEFT\) has become a practical route for adapting large language models to downstream tasks, with LoRA\-style methods being particularly attractive because they are inexpensive to train and easy to deploy\. Most LoRA variants, however, revise the update rule within the weight space of each layer and leave the intermediate representations formed by deeper layers largely unused\. We propose Echo\-LoRA, a cross\-layer representation injection method for parameter\-efficient fine\-tuning\. During training, Echo\-LoRA collects boundary hidden states from deeper source layers, aggregates them into a sample\-level echo representation, and uses lightweight projection and gating networks to inject the resulting signal into shallow LoRA or DoRA modules\. Answer\-only masking, masked distillation, and stochastic routing are used to keep this auxiliary path stable and to reduce the gap between training and inference\. On eight commonsense reasoning benchmarks, Echo\-LoRA exceeds the reported LoRA baselines by 5\.7 percentage points on average across LLaMA\-7B, LLaMA2\-7B, and LLaMA3\-8B\. Under reproduced LoRA baselines in our unified implementation, the average gain is 3\.0 points; when combined with DoRA, the gain is 2\.7 points\. The Echo path is discarded after training, so the deployed model keeps the original low\-rank LoRA/DoRA form and adds neither inference\-time parameters nor inference computation\.
## 1Introduction
Large language models \(LLMs\) now serve as a common foundation for natural language understanding, text generation, and complex reasoning\[[15](https://arxiv.org/html/2605.08177#bib.bib15),[16](https://arxiv.org/html/2605.08177#bib.bib16),[17](https://arxiv.org/html/2605.08177#bib.bib17)\]\. A pretrained model is rarely used unchanged in downstream applications; it usually needs to be adapted to a new task distribution or instruction format\[[2](https://arxiv.org/html/2605.08177#bib.bib2)\]\. Full fine\-tuning remains costly for contemporary LLMs, both in training memory and in storage for task\-specific checkpoints, which keeps efficient adaptation a practical bottleneck\[[8](https://arxiv.org/html/2605.08177#bib.bib8),[11](https://arxiv.org/html/2605.08177#bib.bib11)\]\.
PEFT methods reduce this burden by training only a small set of additional or reparameterized variables\. Adapter modules, prompt or prefix tuning, and low\-rank updates represent the main families of such methods\[[11](https://arxiv.org/html/2605.08177#bib.bib11),[19](https://arxiv.org/html/2605.08177#bib.bib19),[21](https://arxiv.org/html/2605.08177#bib.bib21),[20](https://arxiv.org/html/2605.08177#bib.bib20)\]\. LoRA is often used as a default low\-rank baseline: it represents the weight update with two low\-rank matrices, trains efficiently, and can be merged into the frozen weights before deployment\[[18](https://arxiv.org/html/2605.08177#bib.bib18)\]\. Later variants have mostly refined this weight\-update view, for instance through adaptive rank allocation, vector\-based random\-matrix adaptation, or magnitude\-direction decomposition\[[24](https://arxiv.org/html/2605.08177#bib.bib24),[23](https://arxiv.org/html/2605.08177#bib.bib23),[8](https://arxiv.org/html/2605.08177#bib.bib8)\]\.
This view leaves one question underexplored: during adaptation, what information should a trainable layer receive? Probing studies suggest that Transformer layers do not play identical roles\. Shallow layers are more closely tied to lexical, syntactic, and local patterns, while deeper layers tend to encode more abstract semantic and task\-relevant information\[[1](https://arxiv.org/html/2605.08177#bib.bib1),[22](https://arxiv.org/html/2605.08177#bib.bib22)\]\. A shallow LoRA module, by design, updates its layer from the representations available at that point in the forward pass; it has no direct access to semantic states that appear later in the network\. We argue that this separation can be limiting for tasks that depend on global judgment, commonsense integration, or structured generation\.
Echo\-LoRA is built around this observation\. During training, we extract answer\-boundary representations from deeper source layers, aggregate them into a sample\-level echo representation, and inject that representation into shallow LoRA/DoRA modules using small projection and gating networks\. The design gives shallow adaptation modules access to a compact signal derived from deeper semantic states\. Since such a cross\-layer path can also introduce spurious dependencies, we use answer\-only masking, masked distillation, and stochastic routing to keep the auxiliary signal controlled\.
Our contributions are as follows\. We introduce Echo\-LoRA, a training\-time cross\-layer injection mechanism that feeds answer\-boundary representations from deeper layers into shallow LoRA/DoRA adaptation modules\. We pair this mechanism with answer\-only masking, masked distillation, and stochastic routing, so that the auxiliary path helps optimization without being required at inference time\. We evaluate Echo\-LoRA on LLaMA\-7B, LLaMA2\-7B, and LLaMA3\-8B\. On eight commonsense reasoning datasets, it improves the average score by 5\.7 points over reported LoRA baselines and by 3\.0 points over reproduced LoRA baselines in our unified implementation\. Echo\-DoRA improves the corresponding DoRA baseline by 2\.7 points\. Additional experiments on mathematical reasoning, code generation, and multitask knowledge understanding show similar positive trends\.
## 2Related Work
### 2\.1Parameter\-Efficient Fine\-Tuning for LLMs
Scaling pretrained language models makes full fine\-tuning increasingly expensive and harder to maintain in deployment\. PEFT methods respond by keeping the backbone mostly frozen and updating a small subset of trainable parameters\[[11](https://arxiv.org/html/2605.08177#bib.bib11)\]\. Adapter\-based methods insert compact trainable modules into the original network\[[11](https://arxiv.org/html/2605.08177#bib.bib11),[6](https://arxiv.org/html/2605.08177#bib.bib6)\], whereas prompt\- and prefix\-tuning optimize continuous vectors that steer generation\[[19](https://arxiv.org/html/2605.08177#bib.bib19),[21](https://arxiv.org/html/2605.08177#bib.bib21),[20](https://arxiv.org/html/2605.08177#bib.bib20)\]\.
Low\-rank updates form another widely used PEFT family\. LoRA approximates the full weight update by two low\-rank matrices and obtains strong performance at modest training cost\[[18](https://arxiv.org/html/2605.08177#bib.bib18)\]\. AdaLoRA adjusts rank budgets according to parameter importance\[[24](https://arxiv.org/html/2605.08177#bib.bib24)\]; VeRA fixes random matrices and trains scaling vectors\[[23](https://arxiv.org/html/2605.08177#bib.bib23)\]; DoRA decomposes pretrained weights into magnitude and direction components to better mimic full fine\-tuning\[[8](https://arxiv.org/html/2605.08177#bib.bib8)\]\. These methods mainly refine how local weights are updated\. Echo\-LoRA instead asks whether hidden states from deeper layers can serve as useful conditioning signals for shallow adaptation modules\.
### 2\.2Intermediate Representations and Cross\-Layer Information
Transformer hidden states vary substantially with depth\. Probing and interpretability studies have found that shallow layers tend to encode local lexical and syntactic patterns, while deeper layers capture more abstract semantic information\[[1](https://arxiv.org/html/2605.08177#bib.bib1),[22](https://arxiv.org/html/2605.08177#bib.bib22)\]\. This layered structure suggests that depth itself can provide a useful source of training signal\.
Layer differences have also been used at inference time\. DoLa, for example, contrasts output distributions from different layers during decoding to improve factuality and accuracy\[[4](https://arxiv.org/html/2605.08177#bib.bib4)\]\. These results motivate treating intermediate states as reusable signals rather than disposable computation traces\. Echo\-LoRA follows this line of thought in the fine\-tuning stage, where we use deeper representations as a training\-time auxiliary condition for shallow PEFT modules\.
### 2\.3Training Stability Strategies
Randomized computation paths and selective supervision are standard tools for improving robustness\. Stochastic Depth, for instance, drops layers during training to reduce dependence on a fixed path\[[12](https://arxiv.org/html/2605.08177#bib.bib12)\]\. Instruction tuning typically ignores prompt tokens in the loss and backpropagates only through target\-answer positions\[[2](https://arxiv.org/html/2605.08177#bib.bib2)\]\. Echo\-LoRA uses a related principle: because the Echo branch adds a second, training\-only path, we restrict where the injected signal appears and randomize when the path is active\.
## 3Method
### 3\.1Problem Definition and LoRA Preliminaries
Let the input sequence beX=\(x1,…,xT\)X=\(x\_\{1\},\\ldots,x\_\{T\}\)\. In instruction tuning, the sequence is usually the concatenation of a prompt and an answer\. We denote the answer\-token positions by𝒜\\mathcal\{A\}\. The model predicts these answer tokens autoregressively conditioned on the prompt, while prompt positions are typically excluded from the loss\.
Consider a target linear transformation in a Transformer block with input𝐮∈ℝdin\\mathbf\{u\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\}and output𝐨∈ℝdout\\mathbf\{o\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\}\. LoRA keeps the pretrained weight𝐖\\mathbf\{W\}frozen and learns a low\-rank update:
𝐨=𝐖𝐮\+Δ𝐖𝐮,Δ𝐖=αr𝐁𝐀,\\mathbf\{o\}=\\mathbf\{W\}\\mathbf\{u\}\+\\Delta\\mathbf\{W\}\\mathbf\{u\},\\quad\\Delta\\mathbf\{W\}=\\frac\{\\alpha\}\{r\}\\mathbf\{B\}\\mathbf\{A\},\(1\)where𝐖∈ℝdout×din\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}is the frozen pretrained weight,𝐀∈ℝr×din\\mathbf\{A\}\\in\\mathbb\{R\}^\{r\\times d\_\{\\text\{in\}\}\}and𝐁∈ℝdout×r\\mathbf\{B\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}are trainable low\-rank matrices,rris the rank, andα\\alphais a scaling coefficient\.
This update remains local to the target layer\. It does not condition shallow trainable modules on representations that appear later in the network\. If deeper hidden states contain task\-relevant semantic information, using them as a training\-time signal may improve the adaptation of shallow modules\.
### 3\.2Overall Framework
Echo\-LoRA uses deeper representations as auxiliary conditions for shallow PEFT modules\. At routed training steps, we collect hidden states at answer\-boundary positions from deeper source layers, aggregate them into a sample\-level echo representation, and inject the resulting signal into shallow target LoRA/DoRA modules through projection and gating networks\.
Let𝒮\\mathcal\{S\}be the source\-layer set and𝒯\\mathcal\{T\}the target\-layer set, with source layers placed deeper than target layers\. For samplebb,tb⋆t\_\{b\}^\{\\star\}denotes the boundary position immediately before the answer region\. We extract source\-layer hidden states at this position and average them:
𝐳b=1\|𝒮\|∑l∈𝒮𝐡b,tb⋆\(l\)\.\\mathbf\{z\}\_\{b\}=\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{l\\in\\mathcal\{S\}\}\\mathbf\{h\}^\{\(l\)\}\_\{b,t\_\{b\}^\{\\star\}\}\.\(2\)Here𝐡b,tb⋆\(l\)\\mathbf\{h\}^\{\(l\)\}\_\{b,t\_\{b\}^\{\\star\}\}is the hidden state of layerllat positiontb⋆t\_\{b\}^\{\\star\}\. We use this boundary position because it is expected to summarize the prompt context before answer generation begins\. In implementation, the first forward pass produces the source representation, and the second pass uses it as a stop\-gradient condition; the injection branch does not backpropagate through the source hidden states obtained in the first pass\.
Given𝐳b\\mathbf\{z\}\_\{b\}, Echo\-LoRA first normalizes it and then computes an injection vector through projection and gating networks\. For a target layerl∈𝒯l\\in\\mathcal\{T\}and target modulemm, the computation is:
𝐳¯b=Norm\(𝐳b\),\\bar\{\\mathbf\{z\}\}\_\{b\}=\\mathrm\{Norm\}\(\\mathbf\{z\}\_\{b\}\),\(3\)𝐞b\(l,m\)=𝐖2\(l,m\)tanh\(𝐖1\(l,m\)𝐳¯b\),\\mathbf\{e\}\_\{b\}^\{\(l,m\)\}=\\mathbf\{W\}^\{\(l,m\)\}\_\{2\}\\tanh\\left\(\\mathbf\{W\}^\{\(l,m\)\}\_\{1\}\\bar\{\\mathbf\{z\}\}\_\{b\}\\right\),\(4\)𝐠b\(l,m\)=σ\(𝐔2\(l,m\)tanh\(𝐔1\(l,m\)𝐳¯b\)\+𝐛\(l,m\)\),\\mathbf\{g\}\_\{b\}^\{\(l,m\)\}=\\sigma\\left\(\\mathbf\{U\}^\{\(l,m\)\}\_\{2\}\\tanh\\left\(\\mathbf\{U\}^\{\(l,m\)\}\_\{1\}\\bar\{\\mathbf\{z\}\}\_\{b\}\\right\)\+\\mathbf\{b\}^\{\(l,m\)\}\\right\),\(5\)𝜹b\(l,m\)=λ\(l,m\)\(𝐞b\(l,m\)⊙𝐠b\(l,m\)\)\.\\boldsymbol\{\\delta\}\_\{b\}^\{\(l,m\)\}=\\lambda^\{\(l,m\)\}\\left\(\\mathbf\{e\}\_\{b\}^\{\(l,m\)\}\\odot\\mathbf\{g\}\_\{b\}^\{\(l,m\)\}\\right\)\.\(6\)The projection parameters𝐖1\(l,m\)\\mathbf\{W\}^\{\(l,m\)\}\_\{1\}and𝐖2\(l,m\)\\mathbf\{W\}^\{\(l,m\)\}\_\{2\}map the deep representation into the target\-module output space through a small bottleneck\. The gating parameters𝐔1\(l,m\)\\mathbf\{U\}^\{\(l,m\)\}\_\{1\},𝐔2\(l,m\)\\mathbf\{U\}^\{\(l,m\)\}\_\{2\}, and𝐛\(l,m\)\\mathbf\{b\}^\{\(l,m\)\}filter the injected signal in a sample\- and module\-dependent manner\. The scalarλ\(l,m\)\\lambda^\{\(l,m\)\}is a learnable scale,⊙\\odotdenotes element\-wise multiplication, andσ\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\. We initialize the gate bias negatively, so the Echo branch begins with a weak activation and is less likely to dominate early updates\.
The final injected target\-module output is
𝐨~b,t\(l,m\)=𝐨b,t\(l,m\)\+rkMb,t𝜹b\(l,m\),\\widetilde\{\\mathbf\{o\}\}\_\{b,t\}^\{\(l,m\)\}=\\mathbf\{o\}\_\{b,t\}^\{\(l,m\)\}\+r\_\{k\}M\_\{b,t\}\\boldsymbol\{\\delta\}\_\{b\}^\{\(l,m\)\},\(7\)where𝐨b,t\(l,m\)\\mathbf\{o\}\_\{b,t\}^\{\(l,m\)\}is the original module output,𝐨~b,t\(l,m\)\\widetilde\{\\mathbf\{o\}\}\_\{b,t\}^\{\(l,m\)\}is the injected output,Mb,t∈\{0,1\}M\_\{b,t\}\\in\\\{0,1\\\}is the answer\-region mask, andrk∈\{0,1\}r\_\{k\}\\in\\\{0,1\\\}is the stochastic routing variable at stepkk\. Since the source layers come after the target layers in the forward computation, training uses two passes: an Echo\-off pass that extracts the source boundary representation, followed by an Echo\-on pass that injects this representation into shallow target modules and computes the losses\.
Figure 1:Overall framework of Echo\-LoRA\. Boundary hidden states are extracted from deeper layers, aggregated into an echo representation, and injected into shallow LoRA modules during training\.
### 3\.3Answer\-Only Selective Injection
Applying the echo signal to every token would also perturb prompt positions, which normally do not contribute to the language\-modeling loss\. We restrict Echo injection to supervised answer positions for this reason\.
For thebb\-th sample, the binary mask is constructed from the supervised language\-modeling labels:
Mb,t=\{1,yb,t≠−100,0,otherwise\.M\_\{b,t\}=\\begin\{cases\}1,&y\_\{b,t\}\\neq\-100,\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(8\)Hereyb,ty\_\{b,t\}is the supervised label at positiontt\. Positions ignored by the language\-modeling loss are typically marked as−100\-100, so the mask retains Echo injection only where answer tokens are predicted\.
### 3\.4Stochastic Routing
Keeping the Echo path active throughout training can make the model rely too heavily on a branch that will be absent at inference time\. Echo\-LoRA uses stochastic routing to address this mismatch\.
At training stepkk, a Bernoulli variable decides whether the Echo path is enabled:
rk∼Bernoulli\(pk\)\.r\_\{k\}\\sim\\mathrm\{Bernoulli\}\(p\_\{k\}\)\.\(9\)All target modules share the same routing state within a training step\. The routing probability follows a linear decay:
pk=pstart\+kK−1\(pend−pstart\),p\_\{k\}=p\_\{\\text\{start\}\}\+\\frac\{k\}\{K\-1\}\\left\(p\_\{\\text\{end\}\}\-p\_\{\\text\{start\}\}\\right\),\(10\)whereKKis the total number of training steps andpstart\>pendp\_\{\\text\{start\}\}\>p\_\{\\text\{end\}\}\. Early training exposes the target layers more often to the source\-layer signal, while later training gradually shifts the model back toward the deployable Echo\-off path\.
### 3\.5Training Objective and Deployment
Training starts from an Echo\-off forward pass, which gives the supervised lossℒoff\\mathcal\{L\}\_\{\\text\{off\}\}\. When routing activates the Echo branch, a second Echo\-on pass producesℒon\\mathcal\{L\}\_\{\\text\{on\}\}and a masked distillation lossℒkd\\mathcal\{L\}\_\{\\text\{kd\}\}on answer positions\. The objective is
ℒt=ℒoff\+rt\(ℒon\+λkdℒkd\),\\mathcal\{L\}\_\{t\}=\\mathcal\{L\}\_\{\\text\{off\}\}\+r\_\{t\}\\left\(\\mathcal\{L\}\_\{\\text\{on\}\}\+\\lambda\_\{\\text\{kd\}\}\\mathcal\{L\}\_\{\\text\{kd\}\}\\right\),\(11\)wherertr\_\{t\}is the routing variable andλkd\\lambda\_\{\\text\{kd\}\}is the distillation weight\. Ifrt=0r\_\{t\}=0, only the base path is optimized; ifrt=1r\_\{t\}=1, both paths are optimized and their answer\-region predictions are constrained to remain close\.
Let𝐪b,ton=softmax\(𝐩b,ton/τ\)\\mathbf\{q\}\_\{b,t\}^\{\\mathrm\{on\}\}=\\operatorname\{softmax\}\(\\mathbf\{p\}\_\{b,t\}^\{\\mathrm\{on\}\}/\\tau\)and𝐪b,toff=softmax\(𝐩b,toff/τ\)\\mathbf\{q\}\_\{b,t\}^\{\\mathrm\{off\}\}=\\operatorname\{softmax\}\(\\mathbf\{p\}\_\{b,t\}^\{\\mathrm\{off\}\}/\\tau\)\. The distillation loss is
ℒkd=τ2\|𝒜\|∑\(b,t\)∈𝒜KL\(𝐪b,ton∥𝐪b,toff\)\.\\mathcal\{L\}\_\{\\mathrm\{kd\}\}=\\frac\{\\tau^\{2\}\}\{\|\\mathcal\{A\}\|\}\\sum\_\{\(b,t\)\\in\\mathcal\{A\}\}\\operatorname\{KL\}\\left\(\\mathbf\{q\}\_\{b,t\}^\{\\mathrm\{on\}\}\\,\\middle\\\|\\,\\mathbf\{q\}\_\{b,t\}^\{\\mathrm\{off\}\}\\right\)\.\(12\)The Echo\-on branch serves as the teacher and is detached in the distillation term\. Following standard knowledge distillation, the factorτ2\\tau^\{2\}keeps the gradient scale comparable across temperatures\. Unless otherwise stated, we setλkd=1\.0\\lambda\_\{\\text\{kd\}\}=1\.0andτ=2\.0\\tau=2\.0\.
Echo modules are training\-only components\. At deployment, we disable the Echo path, so no echo extraction, projection, or injection is executed\. The deployable model has the same inference structure as standard LoRA/DoRA; auxiliary Echo parameters may remain in checkpoints for analysis, but they are not used during generation\.
## 4Experiments
### 4\.1Experimental Setup
We use LLaMA\-7B, LLaMA2\-7B, and LLaMA3\-8B as the main backbones, with LLaMA2\-13B included in extended experiments\. The commonsense reasoning setup follows the benchmark suite used by DoRA\[[8](https://arxiv.org/html/2605.08177#bib.bib8)\]\. To make the source of each comparison explicit, we separate main\-table comparisons against published baselines from reproduced LoRA comparisons under our unified implementation, which are reported in Appendix[A\.1](https://arxiv.org/html/2605.08177#A1.SS1)\.
Table[1](https://arxiv.org/html/2605.08177#S4.T1)summarizes the datasets\. The first eight datasets form the commonsense reasoning suite: their training sets are mixed during training, and their evaluation sets are tested separately\. GSM8K, HumanEval, and MMLU are used for extended evaluation on mathematical reasoning, code generation, and multitask knowledge understanding\.
Table 1:Datasets used in the experiments\.DatasetTask TypeTrain/Dev SizeTest SizeBoolQYes/no reasoning9,4273,270PIQAPhysical commonsense16,1131,838SIQASocial commonsense33,4101,954HellaSwagSituation continuation39,90510,042WinoGrandeCoreference reasoning40,3981,267ARC\-eScience QA \(Easy\)2,2512,376ARC\-cScience QA \(Challenge\)1,1191,172OBQAOpen\-book commonsense QA4,957500GSM8KMathematical reasoning7,4731,319HumanEvalCode generation–164MMLUMultitask knowledge28514,042Unless otherwise stated, our experiments use rankr=16r=16, scaling coefficientα=32\\alpha=32, and apply LoRA/DoRA to the attention projectionsq\_proj,k\_proj,v\_proj, ando\_proj\. The learning rate is2×10−42\\times 10^\{\-4\}, LoRA dropout is 0\.05, training lasts for 3 epochs, the maximum sequence length is 256, and bf16 training is used\. For four\-GPU training, each GPU uses batch size 2, resulting in total batch size 16\.
For the Echo mechanism, the default source layers are\[−8,−7,−6,−5\]\[\-8,\-7,\-6,\-5\], the default target layers are\[4,5,6,7\]\[4,5,6,7\], and injection is applied toq\_projandv\_proj\. The bottleneck dimension is 64\. Stochastic routing usespstart=1\.0p\_\{\\text\{start\}\}=1\.0andpend=0\.2p\_\{\\text\{end\}\}=0\.2\. Answer\-only masking and masked distillation are enabled by default, withλkd=1\.0\\lambda\_\{\\text\{kd\}\}=1\.0andτ=2\.0\\tau=2\.0\.
### 4\.2Main Commonsense Reasoning Results
Table[2](https://arxiv.org/html/2605.08177#S4.T2)reports the LoRA\-line results on eight commonsense reasoning datasets\. Prefix, Series, Parallel, and LoRA values are taken from the DoRA paper\[[8](https://arxiv.org/html/2605.08177#bib.bib8)\]; Echo\-LoRA values are obtained by our method\. All average scores are computed from the eight displayed task scores\.
Table 2:Accuracy comparison on eight commonsense reasoning datasets \(%\)\. Prefix, Series, Parallel, and LoRA are reported by DoRA\[[8](https://arxiv.org/html/2605.08177#bib.bib8)\]; Echo\-LoRA is our method\.ModelPEFT MethodBoolQPIQASIQAHellaSwagWinoGrandeARC\-eARC\-cOBQAAvg\.LLaMA\-7BPrefix64\.376\.873\.942\.172\.172\.954\.060\.664\.6Series63\.079\.276\.367\.975\.774\.557\.172\.470\.8Parallel67\.976\.478\.869\.878\.973\.757\.375\.272\.2LoRA68\.980\.777\.478\.178\.877\.861\.374\.874\.7Echo\-LoRA63\.682\.977\.693\.784\.585\.070\.181\.279\.8LLaMA2\-7BLoRA69\.879\.979\.583\.682\.679\.864\.781\.077\.6Echo\-LoRA72\.783\.780\.594\.085\.987\.674\.784\.683\.0LLaMA3\-8BLoRA70\.885\.279\.991\.784\.384\.271\.279\.080\.8Echo\-LoRA75\.690\.282\.596\.689\.893\.682\.589\.287\.5
Echo\-LoRA improves the reported LoRA average on all three backbones\. The gains are 5\.1, 5\.4, and 6\.7 points for LLaMA\-7B, LLaMA2\-7B, and LLaMA3\-8B, respectively, giving an average gain of 5\.7 points\. The largest gain is observed on LLaMA3\-8B, suggesting that the Echo signal is not merely compensating for a weak backbone but can also complement a stronger one\.
### 4\.3Combining Echo with DoRA
We next apply the same Echo mechanism to DoRA, yielding Echo\-DoRA\. Table[3](https://arxiv.org/html/2605.08177#S4.T3)compares DoRA and Echo\-DoRA on the same eight commonsense reasoning tasks\. DoRA values are from the DoRA paper\[[8](https://arxiv.org/html/2605.08177#bib.bib8)\], while Echo\-DoRA values are our results\. For numerical consistency, the average scores are recomputed from the displayed task scores\.
Table 3:Accuracy comparison on eight commonsense reasoning datasets \(%\)\. DoRA values are from DoRA\[[8](https://arxiv.org/html/2605.08177#bib.bib8)\]; Echo\-DoRA is our method\.ModelPEFT MethodBoolQPIQASIQAHellaSwagWinoGrandeARC\-eARC\-cOBQAAvg\.LLaMA\-7BDoRA69\.783\.478\.687\.281\.081\.966\.279\.278\.4Echo\-DoRA69\.682\.681\.293\.483\.686\.170\.783\.281\.3LLaMA2\-7BDoRA71\.883\.776\.089\.182\.683\.768\.282\.479\.7Echo\-DoRA73\.584\.781\.694\.485\.688\.174\.287\.483\.7LLaMA3\-8BDoRA74\.689\.379\.995\.585\.690\.580\.485\.885\.2Echo\-DoRA75\.189\.081\.997\.188\.591\.381\.686\.886\.4
Echo\-DoRA improves the DoRA average by 2\.9, 4\.0, and 1\.2 points on the three backbones, respectively, for an average gain of 2\.7 points\. These gains are smaller than those on the LoRA line, which is consistent with the view that Echo provides a complementary signal when the underlying adaptation method is already stronger\.
### 4\.4Ablation Study
Ablations are conducted on LLaMA3\-8B with the same training configuration\. Table[4](https://arxiv.org/html/2605.08177#S4.T4)reports the average score over the eight commonsense reasoning tasks\. The baseline in this table is our reproduced LLaMA3\-8B LoRA baseline, which differs from the published LoRA row in Table[2](https://arxiv.org/html/2605.08177#S4.T2)\. Appendix[A\.2](https://arxiv.org/html/2605.08177#A1.SS2)provides the task\-level results\.
Table 4:Ablation results on LLaMA3\-8B \(average accuracy over eight commonsense reasoning datasets\)\.SettingAvg\.A\-0 Reproduced LoRA baseline84\.7A\-1 w/o Stochastic Routing78\.7A\-2 Deep→\\rightarrowDeep84\.8A\-3 Shallow→\\rightarrowShallow86\.2A\-4 w/o Answer\-Only Masking87\.2A\-5v\_projonly87\.2A\-6 w/o Answer\-Only Masking \+ all attention projections87\.3A\-7q\_projonly87\.4A\-8 All attention projections87\.5A\-9 Full Echo\-LoRA87\.5The score drops most sharply when stochastic routing is removed, so we regard routing as the most important stabilizing component in this set of ablations\. Deep\-to\-deep and shallow\-to\-shallow variants underperform the default deep\-to\-shallow configuration, supporting our hypothesis that deeper semantic states are most useful when they guide shallower adaptation modules\. Removing answer\-only masking or using a single injected projection remains competitive but is slightly weaker than the default\. Injecting all attention projections matches the default average score, yet it touches more modules and increases training cost; we keep the simplerq\_proj/v\_projconfiguration\.
### 4\.5Extended Task Results
We also test whether the same mechanism transfers beyond commonsense reasoning\. GSM8K and MMLU are evaluated with accuracy, and HumanEval with pass@1\. The baseline values in Table[5](https://arxiv.org/html/2605.08177#S4.T5)are from Flat\-LoRA\[[25](https://arxiv.org/html/2605.08177#bib.bib25)\]; Echo\-LoRA values are our results\.
Table 5:Extended task results \(%\)\. Baselines are from Flat\-LoRA\[[25](https://arxiv.org/html/2605.08177#bib.bib25)\]; Echo\-LoRA is our method\.ModelDatasetBaselineEcho\-LoRALLaMA2\-7BGSM8K56\.2558\.61HumanEval24\.5625\.78LLaMA2\-13BMMLU52\.2753\.90HumanEval13\.7815\.85Echo\-LoRA improves over the corresponding baselines in all four extended settings\. The gains are modest but consistent, suggesting that Echo is not tied to a single benchmark family and may be useful when tasks require reasoning, code synthesis, or broad knowledge integration\.
### 4\.6Discussion
The Echo gains are larger on LoRA than on DoRA\. We hypothesize that standard LoRA leaves more room for auxiliary semantic guidance, while DoRA has already improved the adaptation dynamics through weight decomposition\. From the task side, commonsense reasoning, mathematical reasoning, and code generation all require contextual integration and conditional constraints\. Our observations suggest that Echo\-LoRA helps by giving shallow adaptation modules access to semantic information that is less local than their native layer states\.
At deployment, we disable the Echo path\. Inference uses the same low\-rank update form as LoRA/DoRA and requires no Echo extraction, projection, or injection\. The cost is paid during training: whenever routing activates Echo, a second forward pass is needed\. With the routing probability decaying frompstart=1\.0p\_\{\\text\{start\}\}=1\.0topend=0\.2p\_\{\\text\{end\}\}=0\.2, Echo\-LoRA uses more training computation than standard LoRA while preserving the inference\-time efficiency that makes low\-rank adaptation attractive\.
## 5Conclusion
We introduced Echo\-LoRA, a parameter\-efficient fine\-tuning method that uses cross\-layer representation injection during training\. The method extracts answer\-boundary hidden states from deeper source layers, aggregates them into a sample\-level echo representation, and injects this representation into shallow LoRA/DoRA modules\. Answer\-only masking, masked distillation, and stochastic routing are used to keep the auxiliary path stable and compatible with Echo\-off inference\.
Across three LLaMA backbones and eight commonsense reasoning benchmarks, Echo\-LoRA improves the reported LoRA baselines by 5\.7 points on average; under our reproduced LoRA implementation, the average gain is 3\.0 points\. Echo\-DoRA improves DoRA by 2\.7 points on average, and extended evaluations on GSM8K, HumanEval, and MMLU show additional positive gains\. Because the Echo path is removed at inference time, the deployed model retains the standard LoRA/DoRA form and adds no inference parameters or computation\.
Taken together, the results indicate that deeper representations can be useful not only for inference\-time decoding strategies but also as training\-time auxiliary signals for shallow PEFT modules\. We view cross\-layer information use as a promising direction for future PEFT research, especially when the auxiliary path can be removed before deployment\.
## Appendix AAdditional Results and Discussion
### A\.1Full Echo\-LoRA Results on Commonsense Reasoning
Table[6](https://arxiv.org/html/2605.08177#A1.T6)gives the task\-level LoRA\-line results\. Alongside the published LoRA values, we report reproduced LoRA baselines from our unified implementation\. The gainΔ\\Deltais computed as Echo\-LoRA minus the reproduced LoRA baseline\. Under this protocol, Echo\-LoRA improves the average score by 6\.2, 0\.8, and 1\.9 points on LLaMA\-7B, LLaMA2\-7B, and LLaMA3\-8B, respectively, giving an average gain of 3\.0 points\.
Table 6:Full LoRA\-line results on commonsense reasoning tasks \(%\)\.Δ\\Deltadenotes the absolute gain of Echo\-LoRA over the reproduced LoRA baseline under our unified implementation\.ModelPEFT MethodBoolQPIQASIQAHellaSwagWinoGrandeARC\-eARC\-cOBQAAvg\.LLaMA\-7BLoRA \(reported\)68\.980\.777\.478\.178\.877\.861\.374\.874\.7LoRA \(reproduced\)57\.180\.578\.664\.481\.882\.966\.177\.473\.6Echo\-LoRA63\.682\.977\.693\.784\.585\.070\.181\.279\.8Δ\\Delta\+6\.5\+2\.4\-1\.0\+29\.3\+2\.7\+2\.1\+4\.0\+3\.8\+6\.2LLaMA2\-7BLoRA \(reported\)69\.879\.979\.583\.682\.679\.864\.781\.077\.6LoRA \(reproduced\)71\.384\.480\.593\.785\.387\.573\.182\.082\.2Echo\-LoRA72\.783\.780\.594\.085\.987\.674\.784\.683\.0Δ\\Delta\+1\.4\-0\.7\+0\.0\+0\.3\+0\.6\+0\.1\+1\.6\+2\.6\+0\.8LLaMA3\-8BLoRA \(reported\)70\.885\.279\.991\.784\.384\.271\.279\.080\.8LoRA \(reproduced\)67\.690\.382\.196\.588\.092\.382\.785\.485\.6Echo\-LoRA75\.690\.282\.596\.689\.893\.682\.589\.287\.5Δ\\Delta\+8\.0\-0\.1\+0\.4\+0\.1\+1\.8\+1\.3\-0\.2\+3\.8\+1\.9
### A\.2Full Ablation Results on LLaMA3\-8B
Table[7](https://arxiv.org/html/2605.08177#A1.T7)reports the full task\-level ablation results on LLaMA3\-8B\. Average scores are computed from the eight displayed task scores\. The results point to stochastic routing as the dominant stabilizing factor in this configuration\. Answer\-only masking contributes more modestly, and deep\-to\-shallow injection is stronger than the alternative layer\-direction choices\.
Table 7:Full task\-level ablation results on LLaMA3\-8B \(%\)\.SettingBoolQPIQASIQAHellaSwagWinoGrandeARC\-eARC\-cOBQAAvg\.A\-0 Reproduced LoRA baseline66\.689\.381\.895\.587\.091\.381\.784\.484\.7A\-1 Full Echo\-LoRA75\.690\.282\.596\.689\.893\.682\.489\.287\.5A\-2 w/o Answer\-Only Masking75\.090\.081\.896\.789\.793\.183\.787\.887\.2A\-3 w/o Stochastic Routing72\.981\.882\.792\.034\.293\.284\.188\.878\.7A\-4 Deep→\\rightarrowDeep70\.689\.682\.896\.575\.493\.082\.688\.284\.8A\-5 Shallow→\\rightarrowShallow73\.488\.583\.094\.586\.893\.782\.387\.086\.2A\-6q\_projonly75\.790\.383\.096\.589\.292\.382\.989\.487\.4A\-7v\_projonly75\.390\.182\.396\.588\.793\.182\.488\.487\.2A\-8 All attention projections75\.790\.382\.096\.589\.593\.183\.090\.287\.5A\-9 w/o Answer\-Only Masking \+ all attention projections76\.090\.782\.196\.488\.193\.183\.688\.087\.3
### A\.3Additional Observations
Echo gains vary across tasks\. Datasets such as HellaSwag, WinoGrande, and OBQA require context integration and candidate discrimination, so the injected semantic signal may be more useful there\. When the baseline is already strong, the gains are usually smaller\. We interpret this pattern as evidence that Echo supplies task\-relevant semantic cues beyond the original low\-rank path, with the realized benefit depending on task difficulty, backbone capacity, and baseline strength\.
Echo also yields larger gains on LoRA than on DoRA\. When the underlying PEFT method is strengthened by weight decomposition, the cross\-layer signal appears to act more as a robust supplement than as a large shift in the adaptation ceiling\. This pattern may help guide future combinations of Echo\-style training paths with other PEFT frameworks\.
## References
- \[1\]G\. Alain and Y\. Bengio\. Understanding intermediate layers using linear classifier probes\. In*International Conference on Learning Representations*, 2017\.
- \[2\]R\. Taori, I\. Gulrajani, T\. Zhang, et al\. Stanford Alpaca: An instruction\-following LLaMA model\. 2023\.
- \[3\]E\. Ben Zaken, Y\. Goldberg, and S\. Ravfogel\. BitFit: Simple parameter\-efficient fine\-tuning for transformer\-based masked language\-models\. In*ACL*, 2022\.
- \[4\]Y\.\-S\. Chuang, S\. M\. Xie, H\. Luo, et al\. DoLa: Decoding by contrasting layers improves factuality in large language models\. In*International Conference on Learning Representations*, 2024\.
- \[5\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, et al\. Training verifiers to solve math word problems\. arXiv:2110\.14168, 2021\.
- \[6\]R\. K\. Mahabadi, J\. Henderson, and S\. Ruder\. Compacter: Efficient low\-rank hypercomplex adapter layers\. In*Advances in Neural Information Processing Systems*, 2021\.
- \[7\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\. QLoRA: Efficient finetuning of quantized LLMs\. In*Advances in Neural Information Processing Systems*, 2023\.
- \[8\]S\.\-Y\. Liu, C\. Wang, Y\. Yin, et al\. DoRA: Weight\-decomposed low\-rank adaptation\. In*International Conference on Machine Learning*, 2024\.
- \[9\]H\. W\. Chung, L\. Hou, S\. Longpre, et al\. Scaling instruction\-finetuned language models\.*Journal of Machine Learning Research*, 25\(70\):1–53, 2024\.
- \[10\]D\. Hendrycks, C\. Burns, S\. Basart, et al\. Measuring massive multitask language understanding\. In*International Conference on Learning Representations*, 2021\.
- \[11\]N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, et al\. Parameter\-efficient transfer learning for NLP\. In*International Conference on Machine Learning*, pages 2790–2799, 2019\.
- \[12\]G\. Huang, Y\. Sun, Z\. Liu, et al\. Deep networks with stochastic depth\. In*European Conference on Computer Vision*, pages 646–661, 2016\.
- \[13\]M\. Chen, J\. Tworek, H\. Jun, et al\. Evaluating large language models trained on code\. arXiv:2107\.03374, 2021\.
- \[14\]H\. Liu, D\. Tam, M\. Muqeeth, et al\. Few\-shot parameter\-efficient fine\-tuning is better and cheaper than in\-context learning\. In*Advances in Neural Information Processing Systems*, 2022\.
- \[15\]H\. Touvron, T\. Lavril, G\. Izacard, et al\. LLaMA: Open and efficient foundation language models\. arXiv:2302\.13971, 2023\.
- \[16\]H\. Touvron, L\. Martin, K\. Stone, et al\. Llama 2: Open foundation and fine\-tuned chat models\. arXiv:2307\.09288, 2023\.
- \[17\]A\. Dubey, A\. Jauhri, A\. Pandey, et al\. The Llama 3 herd of models\. arXiv:2407\.21783, 2024\.
- \[18\]E\. J\. Hu, Y\. Shen, P\. Wallis, et al\. LoRA: Low\-rank adaptation of large language models\. In*International Conference on Learning Representations*, 2022\.
- \[19\]X\. L\. Li and P\. Liang\. Prefix\-tuning: Optimizing continuous prompts for generation\. In*ACL\-IJCNLP*, pages 4582–4597, 2021\.
- \[20\]X\. Liu, K\. Ji, Y\. Fu, et al\. P\-Tuning v2: Prompt tuning can be comparable to fine\-tuning universally across scales and tasks\. In*ACL*, pages 61–68, 2022\.
- \[21\]B\. Lester, R\. Al\-Rfou, and N\. Constant\. The power of scale for parameter\-efficient prompt tuning\. In*EMNLP*, pages 3045–3059, 2021\.
- \[22\]I\. Tenney, D\. Das, and E\. Pavlick\. BERT rediscovers the classical NLP pipeline\. In*ACL*, pages 4593–4601, 2019\.
- \[23\]D\. J\. Kopiczko, T\. Blankevoort, and Y\. M\. Asano\. VeRA: Vector\-based random matrix adaptation\. In*International Conference on Learning Representations*, 2024\.
- \[24\]Q\. Zhang, M\. Chen, A\. Bukharin, et al\. AdaLoRA: Adaptive budget allocation for parameter\-efficient fine\-tuning\. In*International Conference on Learning Representations*, 2023\.
- \[25\]T\. Li, Z\. He, Y\. Li, et al\. Flat\-LoRA: Low\-rank adaptation over a flat loss landscape\. In*Forty\-second International Conference on Machine Learning*, 2025\.Similar Articles
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Aletheia introduces a gradient-guided layer selection method for efficient LoRA fine-tuning that identifies task-relevant transformer layers via lightweight gradient probes and applies adapters selectively, achieving 15-28% training speedup across 14 models while maintaining downstream performance on MMLU, GSM8K, and HumanEval benchmarks.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
RDP-LoRA uses geometric trajectory analysis and the Ramer-Douglas-Peucker algorithm to automatically select the most impactful layers for parameter-efficient fine-tuning, outperforming full-layer and random LoRA baselines.
ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
ShadowPEFT introduces a centralized parameter-efficient fine-tuning method that uses a depth-shared shadow module to refine transformer layer representations, matching or outperforming LoRA/DoRA with comparable trainable parameters.
MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
# Paper page - MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning Source: [https://huggingface.co/papers/2605.07850](https://huggingface.co/papers/2605.07850) We propose**MatryoshkaLoRA**, a general, Matryoshka\-inspired training framework for LoRA that learns accurate hierarchical low\-rank representations by inserting a fixed, carefully crafted diagonal matrix**P**between the existing LoRA adapters to scale their sub\-ranks accordingly\. By introducing