Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Summary
Aletheia introduces a gradient-guided layer selection method for efficient LoRA fine-tuning that identifies task-relevant transformer layers via lightweight gradient probes and applies adapters selectively, achieving 15-28% training speedup across 14 models while maintaining downstream performance on MMLU, GSM8K, and HumanEval benchmarks.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
# Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Source: https://arxiv.org/html/2604.15351
Abdulmalek Saket Royal Fenice Kft / ALETHEIA PROTOCOL research Budapest, Hungary abdulmalek@fenicebrand\.com
\(March 2026\)
###### Abstract
Low\-Rank Adaptation \(LoRA\) has become the dominant parameter\-efficient fine\-tuning method for large language models, yet standard practice applies LoRA adapters uniformly to all transformer layers regardless of their relevance to the downstream task\. We introduceAletheia, a gradient\-guided layer selection method that identifies the most task\-relevant layers via a lightweight gradient probe and applies LoRA adapters only to those layers with asymmetric rank allocation\. Across81 experiment rowscovering14 successful modelsfrom8 architecture families\(0\.5B–72B parameters, including dense and Mixture\-of\-Experts architectures\), with one additional documented failed Pythia/GPT\-NeoX attempt in Campaign 2, Aletheia achieves a15–28% training speedup\(mean 23.1%,p<0.001p<0.001\) withbounded extra forgetting and broadly matched downstream behavioron the evaluated MMLU, GSM8K, and HumanEval benchmark pack\. Across the tested families and scales, Campaign 1 shows a 100% per\-model speed win rate and Campaign 2 shows broadly preserved downstream behavior within a bounded\-degradation framing\. Together these results support a practical model\-economics claim: intelligent layer selection can make LoRA fine\-tuning materially more efficient without introducing major downstream damage on the evaluated set\.
## 1Introduction
Parameter\-efficient fine\-tuning \(PEFT\) methods, particularly Low\-Rank Adaptation\(Hu et al\.,2022 (https://arxiv.org/html/2604.15351#bib.bib6)\), have become essential for adapting large language models \(LLMs\) to downstream tasks without the prohibitive cost of full fine\-tuning\. Standard LoRA applies low\-rank adapters uniformly across all attention and MLP layers, treating every transformer block as equally important for the target task\.
This uniform approach is suboptimal: not all layers contribute equally to task\-specific learning\. Prior work on structured layer dropping and selective adaptation\(Fan et al\.,2020 (https://arxiv.org/html/2604.15351#bib.bib4); Sharma et al\.,2023 (https://arxiv.org/html/2604.15351#bib.bib8); Zhang et al\.,2023 (https://arxiv.org/html/2604.15351#bib.bib9)\)suggests that transformer layers exhibit varying sensitivity to fine\-tuning data, with some layers acting primarily as “pass\-through” blocks that add minimal task\-relevant transformation\.
We proposeAletheia, a simple yet effective method that:
1. 1\.Performs alightweight gradient probe\(5 forward\-backward passes\) to measure per\-layer gradient norms as a proxy for task relevance;
2. 2\.Selects the top\-50% of layersby gradient magnitude;
3. 3\.Applies LoRA adapters withasymmetric rank allocationonly to selected layers\.
The key insight is that by skipping low\-gradient layers, we eliminate unnecessary adapter computation and memory overhead while preserving—and sometimes improving—the quality achieved by standard full\-layer LoRA\.
Our contributions are:
- •Agradient\-guided layer selectionalgorithm that requires only 5 probe batches and adds negligible overhead \(<2%<2\\%of total training time\);
- •Abroad cross\-architecture evaluationof selective LoRA: 14 successful models, 8 families, 0\.5B–72B parameters, including MoE \(Mixtral 8×\\times7B\);
- •Evidence ofconsistent speedup across the full Campaign 1 model set\(100% win rate,p<0\.001p<0\.001\) withbounded forgetting\(≤0\.50\\leq 0\.50pp extra MMLU degradation on the core evaluated set\);
- •Full reproducibility: 3 seeds per model, paired statistical tests, and a frozen evidence bundle covering the reported experiments\.
## 2Related Work
#### Parameter\-Efficient Fine\-Tuning\.
LoRA\(Hu et al\.,2022 (https://arxiv.org/html/2604.15351#bib.bib6)\)injects trainable low\-rank matrices into frozen transformer weights, reducing trainable parameters by 10–100×\\timescompared to full fine\-tuning\. Subsequent work includes QLoRA\(Dettmers et al\.,2023 (https://arxiv.org/html/2604.15351#bib.bib3)\)\(4\-bit quantized base weights\), DoRA\(Liu et al\.,2024 (https://arxiv.org/html/2604.15351#bib.bib7)\)\(weight\-decomposed adaptation\), and AdaLoRA\(Zhang et al\.,2023 (https://arxiv.org/html/2604.15351#bib.bib9)\)\(adaptive rank allocation\)\. Most methods apply adapters to all layers uniformly\.
#### Layer Importance and Selection\.
LayerDrop\(Fan et al\.,2020 (https://arxiv.org/html/2604.15351#bib.bib4)\)applies structured dropout at the layer level during training\. LASER\(Sharma et al\.,2023 (https://arxiv.org/html/2604.15351#bib.bib8)\)identifies that removing specific low\-rank components from certain layers can improve model truthfulness\. These findings motivate our gradient\-based approach to selective adaptation\.
#### Adaptive LoRA\.
AdaLoRA\(Zhang et al\.,2023 (https://arxiv.org/html/2604.15351#bib.bib9)\)dynamically adjusts rank during training via importance scoring\. Our approach differs by making a binary layer selection decision*before*training begins, based on a fast gradient probe, which is simpler and incurs no training\-time overhead\.
## 3Method
### 3\.1Overview
Given a pretrained modelM\\mathcal\{M\}withLLtransformer layers and a fine\-tuning datasetD\\mathcal\{D\}, Aletheia proceeds in three stages:
1. 1\.Gradient Probe\(§3\.2 (https://arxiv.org/html/2604.15351#S3.SS2)\): Compute per\-layer gradient norms on a small sample ofD\\mathcal\{D\}\.
2. 2\.Layer Selection\(§3\.3 (https://arxiv.org/html/2604.15351#S3.SS3)\): Select the top\-k%k\\%layers by gradient magnitude\.
3. 3\.Selective LoRA Training\(§3\.4 (https://arxiv.org/html/2604.15351#S3.SS4)\): Apply LoRA adapters only to selected layers with asymmetric rank allocation, then train for the same number of steps as standard LoRA\.
### 3\.2Gradient Probe
For each layerl∈\{0,...,L−1\}\\ell\\in\\\{0,\\ldots,L\-1\\\}, we compute the accumulated gradient norm:
gl=∑b=1B‖∇θlL\(xb;θ\)‖2g\_\{\\ell\}=\\sum\_\{b=1\}^\{B\}\\left\\\|\\nabla\_\{\\theta\_\{\\ell\}\\}\\mathcal\{L\}\(x\_\{b\};\\theta\)\\right\\\|\_\{2\}\(1\)whereB=5B=5probe batches,θl\\theta\_\{\\ell\}denotes the parameters of layerl\\ell, andL\\mathcal\{L\}is the causal language modeling loss\.
To maintain bounded GPU memory, we process layers in chunks of 8: for each chunk, only the parameters in layers\lstart,lend\)\[\\ell\_\{\\text\{start\}\},\\ell\_\{\\text\{end\}\}\)haverequires\_grad=True, while all other parameters are frozen\. After processing all chunks, gradient norms are normalized and ranked\.
### 3\.3Layer Selection
Layers are ranked byglg\_\{\\ell\}in descending order\. The top\-k%k\\%\(defaultk=50k=50\) are selected:
S=top\-k%\{\(l,gl\):l∈\[0,L\)\}S=\\text\{top\-\}k\\%\\\{\(\\ell,g\_\{\\ell\}\):\\ell\\in\[0,L\)\\\}\(2\)
The selected setSSidentifies the “task\-relevant” layers that show the highest sensitivity to the fine\-tuning data\.
### 3\.4Selective LoRA Training
LoRA adapters \(rankr=16r=16,α=32\\alpha=32\) are applied only to the attention and MLP modules in layersl∈S\\ell\\in S\. Both Standard LoRA \(all layers\) and Aletheia \(selected layers\) use the same optimization hyperparameters:
- •Optimizer: AdamW \(β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95,ε=10−7\\epsilon=10^\{\-7\}, weight decay=0\.01=0\.01\)
- •Learning rate:5×10−45\\times 10^\{\-4\}\(scaled per model\), cosine schedule with 20\-step warmup
- •Training steps: 200 fixed for the matched Campaign 1 / Campaign 2 comparisons; 250 for the compute\-matched Campaign 2 runs
- •Gradient accumulation: 2 steps
- •Precision: bf16 \(Qwen, Phi\) or fp16 \(Llama, Mistral, others\); QLoRA 4\-bit for≥\\geq7B on 16GB
By adapting 50% of layers, Aletheia reduces the number of trainable LoRA parameters by∼\\sim4–16%, and more importantly, eliminates the forward/backward computation for adapter modules in skipped layers, yielding a 15–28% wall\-clock speedup\.
### 3\.5AutoResearch Recipe Discovery \(Supporting Evidence\)
In addition to the cross\-family “Aletheia Matched” protocol used throughout this paper, we ran a separate automated recipe\-search pipeline \(“AutoResearch for LoRA”\) on Qwen2\.5\-3B\. This pipeline runs a gradient probe, executes an 8\-arm quick scan \(150 steps\), advances top candidates to full runs \(500 steps\), performs push experiments, and then validates the winner with a 12\-run, 3\-seed factorial ablation\. The search\-stage winner wasffn\_lr\_high\(12 gradient\-selected layers, MLP rank 64, attention rank 16\), which established 12 layers as the best quick\-scan tradeoff\. A later 18\-layer higher\-rank push matched the baseline quality frontier before causal ablation revised the final best toAttn16 @ lr=2e\-4\(mean eval loss0\.3444±0\.00120\.3444\\pm 0\.0012\), with FFN\-only at the same LR remaining a valid efficiency trade \(0\.3451±0\.00110\.3451\\pm 0\.0011\)\. Taken together, these search stages show that layer count, learning rate, and module/rank allocation materially affect LoRA quality even before the broader cross\-family validation pass\. This pipeline achieved a3\.8×\\timeswall\-clock speeduprelative to the full LoRA baseline on Qwen2\.5\-3B while matching or slightly exceeding baseline quality, but it is a*single\-model*result and is therefore presented as supporting evidence rather than as a cross\-family headline\. We keep the cross\-family claims anchored to the “Aletheia Matched” protocol \(fixed steps, paired baselines\), and treat AutoResearch as evidence that a systematic pipeline can discover and refine strong recipes without manual tuning\.
## 4Experimental Setup
### 4\.1Hardware
All experiments were conducted on CINECA Leonardo HPC using NVIDIA A100\-SXM4\-64GB GPUs\. Each experiment used a single GPU node \(120GB system memory, 16 CPUs\) except Mixtral 8×\\times7B which required 4×\\timesA100 with QLoRA 4\-bit quantization\.
### 4\.2Models
We evaluate across 14 successful models from 8 architecture families spanning 4 weight tiers \(Table[1 (https://arxiv.org/html/2604.15351#S4.T1)\)\.
Table 1:Models evaluated across two experimental campaigns\.Pythia\-1\.4B is omitted from Table1 (https://arxiv.org/html/2604.15351#S4.T1)because all Campaign 2 seeds failed under both recipes with fp16 NaN losses\. Those failed runs remain part of the 81\-row campaign ledger and are discussed in Section6 (https://arxiv.org/html/2604.15351#S6)\.
### 4\.3Training Data
We use the Aletheia Bootstrap dataset, a curated Alpaca\-style instruction\-following dataset designed for efficient adapter training\. The paired cross\-family comparisons in Campaign 1 and Campaign 2 use 200 fixed training steps; the compute\-matched variant in Campaign 2 extends Aletheia to 250 steps \(\+25%\) to spend the saved wall\-clock budget\. Batch size varies by model and GPU memory, with gradient accumulation of 2\.
### 4\.4Evaluation Benchmarks
- •MMLU\(Hendrycks et al\.,2021 (https://arxiv.org/html/2604.15351#bib.bib5)\): 200\-question subset in both campaigns for broad knowledge assessment\.
- •GSM8K\(Cobbe et al\.,2021 (https://arxiv.org/html/2604.15351#bib.bib2)\): 200\-question subset for mathematical reasoning \(Campaign 2 only\)\.
- •HumanEval\(Chen et al\.,2021 (https://arxiv.org/html/2604.15351#bib.bib1)\): 164 coding problems for code generation \(Campaign 2 only\)\.
- •Eval Loss: Held\-out validation cross\-entropy loss\.
### 4\.5Statistical Protocol
Each model is trained with 3 seeds \(42, 123, 999\)\. We report per\-model means and standard deviations\. Overall significance is assessed via a pairedtt\-test across all 30 Campaign 1 speed comparisons \(t=9\.518t=9\.518,p<0\.001p<0\.001, Cohen’sd=1\.74d=1\.74\)\. All tables report mean±\\pmSD computed from 3\-seed runs\.
### 4\.6Protocol Naming
To avoid confusion, we use the following names consistently:Aletheia Matchedrefers to the main cross\-family protocol in this paper \(fixed step count, paired baseline\)\.Compute\-matchedrefers to the variant that trains Aletheia for additional steps to match Standard LoRA wall\-clock time\.AutoResearchrefers to the automated recipe\-discovery pipeline on Qwen2\.5\-3B \(Section3\.5 (https://arxiv.org/html/2604.15351#S3.SS5)\)\.
## 5Results
### 5\.1Training Speedup
Campaign 1 provides direct wall\-clock timing comparisons across 10 models \(Table2 (https://arxiv.org/html/2604.15351#S5.T2)\)\.
Table 2:Training speedup of Aletheia vs\. Standard LoRA \(Campaign 1, 3\-seed mean±\\pmSD\)\.Figure1 (https://arxiv.org/html/2604.15351#S5.F1)visualizes the per\-model speedups with 95% confidence intervals\.
Refer to captionFigure 1:Training speedup of Aletheia vs\. Standard LoRA across 10 models \(3\-seed means with 95% CI error bars\)\. All models show positive speedup with tight confidence intervals, confirming reproducibility\.Key findings:
- •100% win rate: All 30 experiments \(10 models×\\times3 seeds\) show positive speedup\.
- •Overall significance: Pairedtt\-test yieldst=9\.518t=9\.518,p<0\.001p<0\.001, Cohen’sd=1\.74d=1\.74\(large effect\)\.
- •Scale\-independent: Speedups range from 15\.8% \(72B\) to 27\.8% \(14B\), with no degradation at scale\.
- •Architecture\-independent: Both GQA \(Qwen, Llama\) and MHA \(Mistral, Phi\) architectures benefit \(Figure3 (https://arxiv.org/html/2604.15351#S5.F3)\)\.
### 5\.2Benchmark Quality: MMLU
Table3 (https://arxiv.org/html/2604.15351#S5.T3)shows MMLU forgetting analysis for Campaign 1\. “Extra forgetting” is defined as the difference between Aletheia’s MMLU delta and Standard LoRA’s MMLU delta\.
Table 3:MMLU forgetting analysis \(Campaign 1, 3\-seed means\)\. Forgetting≤2\\leq 2pp for all models\.MMLU degradation is negligible: maximum extra forgetting is 1\.8pp \(TinyLlama, where Aletheia actually*recovers*from Standard LoRA’s forgetting\)\. Models≥\\geq14B show no material negative forgetting: Qwen\-14B slightly improves under both recipes, while 70B and 72B are flat\.
### 5\.3Multi\-Benchmark Quality: GSM8K and HumanEval
Campaign 2 evaluates downstream task quality beyond MMLU \(Table4 (https://arxiv.org/html/2604.15351#S5.T4)\)\.
Table 4:Downstream benchmark deltas \(Aletheia−\-Standard, 3\-seed means, Campaign 2\)\. Values near zero indicate matched performance\.Across the core models used for the bounded\-quality claim \(Qwen 3B/7B, Llama 8B, Mixtral\), MMLU remains within 1pp\. GSM8K and HumanEval deltas are mixed but remain bounded in the core set, while weaker models \(StableLM, GPT\-J\) show more variable downstream behavior\.
Refer to captionFigure 2:Campaign 2 benchmark deltas \(Standard LoRA vs\. Aletheia\) with 95% CI error bars across 6 models and 3 benchmarks\. Taken together with the per\-model means, the intervals support a bounded\-delta interpretation on the evaluated set rather than a quality\-collapse story\.Refer to captionFigure 3:Speedup by architecture family \(Campaign 1, 95% CI\)\. All 5 Campaign 1 families show consistent, statistically significant speedup from gradient\-guided layer selection\.
### 5\.4Mixture\-of\-Experts: Mixtral 8×\\times7B
Aletheia’s first evaluation on an MoE architecture confirms that gradient\-guided layer selection generalizes beyond dense transformers\. For Mixtral \(46B total parameters, QLoRA 4\-bit\), Aletheia adapts 16/32 layers \(heuristic top\-50% selection\) and achieves:
- •MMLU forgetting:Δ=0\.000\\Delta=0\.000across all 3 seeds
- •Reliable completion: all 6 runs \(3 seeds×\\times2 recipes\) finished successfully
- •50% fewer adapted layers with matched downstream quality
### 5\.5Compute\-Matched Analysis
In the compute\-matched setting, Aletheia trains the same selected layers for additional steps to match Standard LoRA’s total wall\-clock time \(Table5 (https://arxiv.org/html/2604.15351#S5.T5), Figure4 (https://arxiv.org/html/2604.15351#S5.F4)\)\.
Table 5:Compute\-matched Aletheia vSimilar Articles
RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
RDP-LoRA uses geometric trajectory analysis and the Ramer-Douglas-Peucker algorithm to automatically select the most impactful layers for parameter-efficient fine-tuning, outperforming full-layer and random LoRA baselines.
JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models
JumpLoRA introduces a novel sparse adapter framework for continual learning in LLMs using JumpReLU gating to dynamically isolate task parameters and prevent catastrophic forgetting. The method enhances LoRA-based approaches and outperforms state-of-the-art continual learning methods like ELLA.
ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
ShadowPEFT introduces a centralized parameter-efficient fine-tuning method that uses a depth-shared shadow module to refine transformer layer representations, matching or outperforming LoRA/DoRA with comparable trainable parameters.
Aurora: A Leverage-Aware Optimizer for Rectangular Matrices
Tilde Research introduces Aurora, a new optimizer designed to prevent neuron death in MLP layers while maintaining orthogonality, achieving state-of-the-art results on nanoGPT benchmarks and 100x data efficiency on 1B models.
Attribution-Guided Continual Learning for Large Language Models
This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.