Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

arXiv cs.LG Papers

Summary

This paper reveals that zeroth-order fine-tuning of LLMs is dominated by a single decoding layer, which can be identified by activation outliers, and fine-tuning only that layer matches or exceeds full-model fine-tuning with up to 4.52x speedup.

arXiv:2606.05516v1 Announce Type: new Abstract: Zeroth-order (ZO) optimization enables memory-efficient fine-tuning of large language models (LLMs) using only forward passes, but it remains unclear how useful adaptation is distributed across layers. In this work, we reveal a surprising phenomenon: ZO fine-tuning is sharply dominated by a single decoding layer. Across multiple LLM families and downstream tasks, fine-tuning this dominant layer alone consistently matches or even exceeds full-model ZO fine-tuning. We further show that the dominant layer is task-agnostic but model-specific, and can be identified before training through a simple inference-only analysis of activation outliers. Specifically, the dominant layer consistently aligns with the first activation-outlier layer in the pre-trained model. To explain this phenomenon, we analyze how perturbation effects propagate under ZO optimization. We find that the dominant layer combines two key properties: high perturbation sensitivity and early placement in the residual stream, allowing perturbation-induced effects to propagate and accumulate through remaining subsequent decoding layers. As a result, this layer produces disproportionately strong and stable optimization signals under forward-only updates. Extensive experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks show that dominant-layer ZO fine-tuning improves average performance over full-model MeZO and LoRA-based ZO fine-tuning while achieving up to 4.52$\times$ training speedup.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:11 AM

# Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs
Source: [https://arxiv.org/html/2606.05516](https://arxiv.org/html/2606.05516)
Wanhao Yu1Ziyan Wang1Zheng Wang2Abeer Matar Almalky3 Yihang Zuo4Shuteng Niu5Sen Lin2Adnan Siraj Rakin3 Deliang Fan4Li Yang1†

1University of North Carolina at Charlotte2University of Houston 3State University of New York at Binghamton4Arizona State University 5Department of Artificial Intelligence and Informatics, Mayo Clinic

###### Abstract

Zeroth\-order \(ZO\) optimization enables memory\-efficient fine\-tuning of large language models \(LLMs\) using only forward passes, but it remains unclear how useful adaptation is distributed across layers\. In this work, we reveal a surprising phenomenon: ZO fine\-tuning is sharply dominated by a single decoding layer\. Across multiple LLM families and downstream tasks, fine\-tuning this dominant layer alone consistently matches or even exceeds full\-model ZO fine\-tuning\. We further show that the dominant layer is task\-agnostic but model\-specific, and can be identified before training through a simple inference\-only analysis of activation outliers\. Specifically, the dominant layer consistently aligns with the first activation\-outlier layer in the pre\-trained model\. To explain this phenomenon, we analyze how perturbation effects propagate under ZO optimization\. We find that the dominant layer combines two key properties: high perturbation sensitivity and early placement in the residual stream, allowing perturbation\-induced effects to propagate and accumulate through remaining subsequent decoding layers\. As a result, this layer produces disproportionately strong and stable optimization signals under forward\-only updates\. Extensive experiments on LLaMA2\-7B and Qwen3\-8B across nine benchmarks show that dominant\-layer ZO fine\-tuning improves average performance over full\-model MeZO and LoRA\-based ZO fine\-tuning while achieving up to 4\.52×\\timestraining speedup\.

††footnotetext:†Corresponding author\.## 1Introduction

Zeroth\-order \(ZO\) optimization has recently emerged as a promising approach for memory\-efficient fine\-tuning of large language models \(LLMs\)\[[39](https://arxiv.org/html/2606.05516#bib.bib24)\]\. Instead of computing first\-order \(FO\) gradients through backpropagation, ZO methods estimate update directions using only forward evaluations, typically by measuring loss differences under random parameter perturbations\[[29](https://arxiv.org/html/2606.05516#bib.bib17)\]\. Building on this idea, MeZO shows that pre\-trained LLMs can be fine\-tuned with near inference\-level memory\[[22](https://arxiv.org/html/2606.05516#bib.bib1)\]\. Subsequent studies further improve convergence and accuracy by reducing the variance of ZO gradient estimates through sparse parameter perturbation\[[21](https://arxiv.org/html/2606.05516#bib.bib2),[14](https://arxiv.org/html/2606.05516#bib.bib25)\], low\-rank or structured perturbation spaces\[[6](https://arxiv.org/html/2606.05516#bib.bib27),[20](https://arxiv.org/html/2606.05516#bib.bib29)\], and more stable or faster optimizer designs\[[5](https://arxiv.org/html/2606.05516#bib.bib51),[9](https://arxiv.org/html/2606.05516#bib.bib5)\]\. Despite these advances, existing methods largely treat ZO fine\-tuning as a full\-model process, without explaining how useful adaptation differs across layers\. This leaves a fundamental question unanswered:under forward\-only updates, where does useful optimization actually occur inside LLM architectures?

In this work, we reveal an unexpected phenomenon:useful ZO adaptation is not spread broadly across layers, but is dominated by a single layer\.

To study this phenomenon, we first conduct a systematic layer\-wise analysis across multiple LLMs and downstream tasks, where we fine\-tune one layer at a time under identical ZO updates while freezing all other layers\. The results show a highly uneven layer\-wise pattern: most layers provide little or no improvement over the no fine\-tuning baseline, while a single layer consistently achieves performance comparable to, or even exceeding, full\-model ZO fine\-tuning\. We refer to this layer asdominant layer\. Moreover, the dominant layer istask\-agnostic but model\-specific: for a given LLM, the same layer consistently dominates across tasks, while different model families may have different dominant\-layer indices\. In contrast, under matched first\-order gradient fine\-tuning, improvements are more evenly spread across layers, and no single layer consistently dominates\. This contrast shows that the dominant\-layer phenomenon is unique for ZO optimization, which follows a different layer\-wise adaptation pattern from first\-order \(FO\) fine\-tuning\.

We further study how to efficiently identify this dominant layer without expensive layer\-wise ZO fine\-tuning\. Inspired by the knownactivation outlierphenomenon in LLMs\[[11](https://arxiv.org/html/2606.05516#bib.bib34),[34](https://arxiv.org/html/2606.05516#bib.bib35)\], where a small number of activations exhibit extremely large magnitudes at specific dimension indices across layers in an input\-independent manner\[[30](https://arxiv.org/html/2606.05516#bib.bib13),[2](https://arxiv.org/html/2606.05516#bib.bib16)\], we find that the dominant layer aligns with the first layer where activation outliers emerge\. Based on this observation, we design a simple inference\-only selection method: given a small calibration set, we run the pre\-trained LLM forward, measure layer\-wise activation statistics, and select the first layer that shows a clear outlier pattern\. This method avoids exhaustive layer\-wise ZO fine\-tuning and identifies the dominant layer before training begins\.

Finally, we explain why this dominant layer emerges under ZO fine\-tuning\. Unlike first\-order optimization, ZO estimates updates only from final\-loss differences caused by random perturbations\. Therefore, a layer can contribute more to ZO fine\-tuning when its perturbation has a stronger effect on the forward computation\. We find that the dominant layer satisfies this condition because it appears early in the model and aligns with the first activation\-outlier layer\. Perturbations at this layer enter the residual stream and affect the activations of all remaining layers\. This propagation allows the perturbation effect to be repeatedly transformed and accumulated before reaching the final loss, leading to larger final\-loss changes and a more stable forward signal for ZO updates\.

This finding of a dominant layer in ZO fine\-tuning has both practical and conceptual implications\. Practically, because most useful ZO adaptation comes from the dominant layer, ZO fine\-tuning can significantly reduce training cost while preserving full\-model performance\. More importantly, we hope this finding provides insight for future ZO method design, such as explicitly considering where useful updates arise across layers or making updates to non\-dominant layers more effective\.

Our contributions can be summarized as follows:

- •We discover a dominant\-layer phenomenon in ZO fine\-tuning: tuning a single layer can recover, and sometimes exceed, the performance of full\-model ZO fine\-tuning\.
- •We show that the dominant layer is task\-agnostic but model\-specific, and can be efficiently identified before training using the first activation\-outlier layer\.
- •We explain why the dominant layer learns well under ZO fine\-tuning: residual connection propagation amplifies its perturbation effect, leading to larger final\-loss changes and stronger ZO update signals\.
- •We validate the dominant\-layer ZO fine\-tuning through extensive experiments on two LLMs, LLaMA2\-7B and Qwen3\-8B, across nine downstream tasks\. Compared to MeZO, dominant\-layer ZO fine\-tuning improves the average score by0\.86%0\.86\\%over full\-model and0\.61%0\.61\\%over LoRA\-based\[[16](https://arxiv.org/html/2606.05516#bib.bib31)\]ZO fine\-tuning\. In addition, it achieves a 1\.12×\\times−\-4\.52×\\timesspeedup in ZO fine\-tuning relative to full\-model MeZO\.

## 2Related Work

#### Zeroth\-order LLM Fine\-tuning\.

Zeroth\-order \(ZO\) optimization estimates update directions from function values rather than explicit backpropagated gradients, using methods such as SPSA, forming its classical foundation\[[29](https://arxiv.org/html/2606.05516#bib.bib17),[23](https://arxiv.org/html/2606.05516#bib.bib19)\]\. Recently, MeZO\[[22](https://arxiv.org/html/2606.05516#bib.bib1)\]first shows that LLMs can be fine\-tuned for downstream tasks with inference\-level memory, making ZO a memory\-efficient alternative to backpropagation for large models\. In practice, MeZO estimates gradients by applying random perturbations to model parameters and measuring the loss difference between two forward passes, without storing intermediate activations for backward propagation\.

To reduce gradient\-estimation variance and accelerate convergence for more accurate and efficient fine\-tuning, follow\-up work mainly improves ZO fine\-tuning along three directions\. First, one line of work reduces the trainable or perturbed parameter scope through sparse parameter selection\[[21](https://arxiv.org/html/2606.05516#bib.bib2)\], transferable static sparsity\[[14](https://arxiv.org/html/2606.05516#bib.bib25)\], or random layer\-wise sparse updates\[[33](https://arxiv.org/html/2606.05516#bib.bib4)\]\. Second, another line of work reduces gradient\-estimation variance by designing more informative perturbation directions, including low\-rank directions in LOZO\[[6](https://arxiv.org/html/2606.05516#bib.bib27)\], random subspaces in SubZero\[[37](https://arxiv.org/html/2606.05516#bib.bib28)\], activation\-derived directions in AGZO\[[20](https://arxiv.org/html/2606.05516#bib.bib29)\], and curvature\-aware directions in HiZOO\[[41](https://arxiv.org/html/2606.05516#bib.bib26)\]\. Third, optimizer\-level methods modify the update rule to improve optimization speed and stability, including layer\-wise calibration in DiZO\[[31](https://arxiv.org/html/2606.05516#bib.bib8)\], clipping and annealing in HELENE\[[40](https://arxiv.org/html/2606.05516#bib.bib6)\], faster estimators in FZOO\[[9](https://arxiv.org/html/2606.05516#bib.bib5)\], and learned update rules in ZO Fine\-tuner\[[38](https://arxiv.org/html/2606.05516#bib.bib7)\]\. Unlike these works, which improve how ZO updates are better estimated or applied, our work studies where useful ZO adaptation occurs inside the model and shows that it is dominated by a single layer\.

#### Selective Layer\-wise Fine\-tuning\.

Layer\-wise fine\-tuning and layer\-importance analysis have been widely studied in first\-order LLM adaptation\. Parameter\-efficient tuning methods such as adapters and LoRA\[[15](https://arxiv.org/html/2606.05516#bib.bib30),[16](https://arxiv.org/html/2606.05516#bib.bib31)\], together with intrinsic\-dimensionality analyses\[[1](https://arxiv.org/html/2606.05516#bib.bib32)\], suggest that effective adaptation often lies in a smaller update space than full\-model fine\-tuning implies\. Recent methods further exploit layer\-wise importance: LISA selectively freezes middle layers\[[24](https://arxiv.org/html/2606.05516#bib.bib33)\], ILA identifies layers critical for alignment\[[27](https://arxiv.org/html/2606.05516#bib.bib10)\], and IST/OwLore update selected layers based on layer importance or outlier\-weighted sampling\[[36](https://arxiv.org/html/2606.05516#bib.bib12),[19](https://arxiv.org/html/2606.05516#bib.bib11)\]\. In contrast, to the best of our knowledge, our work is the first to systematically analyze how layer\-wise fine\-tuning behaves under ZO optimization\.

#### Outlier Activations in LLMs\.

Outlier activation is a common phenomenon in LLMs, first highlighted by LLM\.int8\(\)\[[11](https://arxiv.org/html/2606.05516#bib.bib34)\]as a unique challenge for model compression: a small number of activation dimensions exhibit extremely large magnitudes compared to the average of the activation distribution\. One important property is that these outliers consistently appear in the same activation dimensions across different inputs, suggesting that they come from the model structure rather than from any specific input sample\[[30](https://arxiv.org/html/2606.05516#bib.bib13),[2](https://arxiv.org/html/2606.05516#bib.bib16)\]\. Following this observation, a series of works study how to address activation outliers for efficient model compression, especially for quantization and pruning\. For example, LLM\.int8\(\) isolates outlier features for mixed\-precision inference\[[11](https://arxiv.org/html/2606.05516#bib.bib34)\], while SmoothQuant migrates activation outlier difficulty into weights to enable accurate low\-bit quantization\[[34](https://arxiv.org/html/2606.05516#bib.bib35)\]\.

## 3Dominant Layer in ZO Fine\-Tuning: Discovery and Identification

### 3\.1Preliminary: Zeroth\-Order Optimization

Following the classical two\-point SPSA estimator\[[29](https://arxiv.org/html/2606.05516#bib.bib17)\]adopted by MeZO for ZO fine\-tuning of LLMs\[[22](https://arxiv.org/html/2606.05516#bib.bib1)\], we estimate gradients from two perturbed loss evaluations\. At iterationttwith parametersθt\\theta\_\{t\}, minibatch andℬt\\mathcal\{B\}\_\{t\}, we sample random perturbation vectorztz\_\{t\}and compute the ZO gradient estimate as:

g^t=ℒ​\(θt\+ϵ​zt;ℬt\)−ℒ​\(θt−ϵ​zt;ℬt\)2​ϵ​zt,\\widehat\{g\}\_\{t\}=\\frac\{\\mathcal\{L\}\(\\theta\_\{t\}\+\\epsilon z\_\{t\};\\mathcal\{B\}\_\{t\}\)\-\\mathcal\{L\}\(\\theta\_\{t\}\-\\epsilon z\_\{t\};\\mathcal\{B\}\_\{t\}\)\}\{2\\epsilon\}z\_\{t\},\(1\)and the parameter update is

θt\+1=θt−ηt​g^t,\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\_\{t\}\\widehat\{g\}\_\{t\},\(2\)whereϵ\\epsilonis the perturbation scale andηt\\eta\_\{t\}is the learning rate\.

### 3\.2Empirical Discovery: ADominant LayerExists in ZO Fine\-Tuning

We begin by analyzing how ZO fine\-tuning behaves across layers by isolating each layer’s contribution\. Specifically, based on MeZO, we fine\-tune one layer at a time while freezing all other layers, using the same ZO update configuration for every layer\. We conduct this study on LLaMA2\-7B\[[32](https://arxiv.org/html/2606.05516#bib.bib49)\]across multiple tasks, including WSC\[[18](https://arxiv.org/html/2606.05516#bib.bib44)\], COPA\[[26](https://arxiv.org/html/2606.05516#bib.bib46)\], and DROP\[[12](https://arxiv.org/html/2606.05516#bib.bib48)\], which cover classification, multiple\-choice, and generation settings\.

As shown in Figure[1](https://arxiv.org/html/2606.05516#S3.F1), we make two key observations:

\(1\) ZO fine\-tuning is highly uneven across layers\.Performance varies significantly across layers, and most layers provide little or no improvement over the no fine\-tuning baseline\. For example, on the COPA dataset, only a small subset of layers \(4 out of 32 in LLaMA2\-7B\) improve accuracy after fine\-tuning, while the majority remain close to the baseline\.

\(2\) Adominant layerclearly emerges in ZO fine\-tuning\.One specific layer achieves substantially higher performance than all other layers and matches, or even exceeds, full\-model ZO fine\-tuning\. We refer to this layer as thedominant layer\. Moreover, we find that this layer has two important properties\. First, it istask\-agnostic: for a given LLM model, the same layer consistently dominates across different tasks\. For example, in LLaMA2\-7B, layer 1 achieves the best performance on all three tasks\. Second, it appears to bemodel\-specific: different model families may have different dominant\-layer indices\. For example, the dominant layer is layer 1 in LLaMA2\-7B and layer 6 in Qwen3\-8B, with the Qwen3\-8B layer\-wise analysis provided in the Appendix\.

![Refer to caption](https://arxiv.org/html/2606.05516v1/figs/mezo_layerwise_three_datasets_vertical_compact.png)\(a\)MeZO layerwise results\.
![Refer to caption](https://arxiv.org/html/2606.05516v1/figs/fo_layerwise_three_datasets_vertical_compact.png)\(b\)FO layerwise results\.

Figure 1:Layer\-wise fine\-tuning on Llama2\-7B results across three representative datasets\.To further examine whether this behavior is specific to ZO, we repeat the same layer\-wise analysis using first\-order fine\-tuning\. As shown in Figure[1\(b\)](https://arxiv.org/html/2606.05516#S3.F1.sf2), first\-order fine\-tuning exhibits a different pattern: most layers achieve clear improvements over the baseline, and no single layer consistently dominates\. This contrast indicates that strong layer\-wise dominance is a unique property of ZO optimization\.

![Refer to caption](https://arxiv.org/html/2606.05516v1/figs/llama2_7b_copa_activation_magnitude.png)\(a\)LLaMA2\-7B on COPA\.
![Refer to caption](https://arxiv.org/html/2606.05516v1/figs/llama2_7b_wsc_activation_magnitude.png)\(b\)LLaMA2\-7B on WSC\.
![Refer to caption](https://arxiv.org/html/2606.05516v1/figs/qwen3_8b_copa_activation_magnitude.png)\(c\)Qwen3\-8B on COPA\.
![Refer to caption](https://arxiv.org/html/2606.05516v1/figs/qwen3_8b_wsc_activation_magnitude.png)\(d\)Qwen3\-8B on WSC\.

Figure 2:Mean and Maximum of Layer\-wise output activation magnitudes for LLaMA2\-7B and Qwen3\-8B on COPA and WSC training samples\. The highlighted point indicates the first activation\-outlier layer, where the maximum output activation magnitude shows a clear jump\.
### 3\.3Identifying theDominant Layervia Activation Outliers

The layer\-wise analysis above identifies the dominant layer, but it requires fine\-tuning each layer separately, which is computationally expensive\. We therefore ask whether the dominant layer can be identified before fine\-tuning, using only lightweight signals from the pre\-trained model\.

Our key observation is that the dominant layer aligns with the first layer where activation outliers emerge\. This is motivated by the activation\-outlier phenomenon: activation outliers appear consistently at specific activation dimensions and are agnostic to input data, which exhibits a similar property to the dominant layer observed in Section[3\.2](https://arxiv.org/html/2606.05516#S3.SS2)\. As shown in Figure[2](https://arxiv.org/html/2606.05516#S3.F2), the first activation\-outlier layer coincides with the dominant layer identified by layer\-wise ZO fine\-tuning\. For example, the first outlier layer appears at layer 1 in LLaMA2\-7B and layer 6 in Qwen3\-8B, matching the dominant layers identified by layer\-wise analysis\.

Based on this observation, we propose a simple inference\-only selection method\. Given a small calibration set, we run the pre\-trained model forward, compute layer\-wise activation statistics, and select the first layer whose maximum activation magnitude is abnormally large compared with the typical activation scale\. This method identifies the dominant layer before training and avoids expensive layer\-wise ZO fine\-tuning\.

## 4Why Does theDominant LayerEmerge in ZO Fine\-Tuning?

![Refer to caption](https://arxiv.org/html/2606.05516v1/x1.png)Figure 3:Comparison of optimization signal flow in ZO and FO\. In ZO, gradients are estimated only from forward\. In FO, backpropagation provides exact gradients throughout the network\.![Refer to caption](https://arxiv.org/html/2606.05516v1/x2.png)Figure 4:Layerwise Loss Change under 1\-step Perturbation on LLaMA2\-7B across 3 tasks\.ZO fine\-tuning relies entirely on changes in the final loss under parameter perturbations\. As a result, layers differ in effectiveness based on how strongly their perturbations influence the final loss, as illustrated in Figure[3](https://arxiv.org/html/2606.05516#S4.F3)\(a\)\. In contrast, FO fine\-tuning distributes gradients across layers through backpropagation, preventing such concentration, as shown in Figure[3](https://arxiv.org/html/2606.05516#S4.F3)\(b\)\. To evaluate how each layer contributes to the final loss change, we perform a perturbation\-only sensitivity analysis before ZO fine\-tuning, where we apply random perturbations to one layer at a time and measure the resulting change in final loss\. As shown in Figure[4](https://arxiv.org/html/2606.05516#S4.F4), the dominant layer induces the largest loss changes, making it the most effective for ZO updates\. In contrast, most other layers induce only minor loss change, which explains their limited contribution to fine\-tuning performance\.

However, the magnitude of the change in loss alone does not fully explain the dominant\-layer phenomenon\. For example, Figure[4](https://arxiv.org/html/2606.05516#S4.F4)also shows that some later layers, such as layer 30, can produce noticeable loss changes under perturbation\. Nevertheless, these layers do not achieve comparable ZO fine\-tuning performance as shown in Figure[1\(a\)](https://arxiv.org/html/2606.05516#S3.F1.sf1)\. This suggests that a large loss change can be sufficient to update the parameters, but it does not necessarily form a stable or useful update for improving task performance\.

In contrast, due to residual connections between layers in LLMs, the output activation of an earlier layer can affect remaining layers and accumulate into the final hidden activation, as illustrated in Figure[5](https://arxiv.org/html/2606.05516#S4.F5)\. This effect is especially important for the dominant layer because it aligns with the first activation\-outlier layer, whose output contains extremely large\-magnitude activations\. Once the perturbation effect from the dominant layer enters the following layers, it can affect their hidden activations and continue to accumulate toward the final activation\. Therefore, the resulting loss change is not only large in magnitude, but also comes from a propagated effect across all remaining layers\. This makes the corresponding ZO update more stable and more useful for fine\-tuning\. Figure[6](https://arxiv.org/html/2606.05516#S4.F6)further supports this explanation from the training trajectory\. Across both LLaMA2\-7B and Qwen3\-8B, the dominant layer reduces training loss faster than later outlier layers and follows a trajectory closer to full\-model fine\-tuning\. Although later layers also contain activation outliers, their loss decreases much more slowly, suggesting that activation outliers alone are insufficient; early residual\-stream propagation is also important for effective ZO fine\-tuning\.

![Refer to caption](https://arxiv.org/html/2606.05516v1/x3.png)Figure 5:Schematic view of how ZO perturbations injected at the first outlier\-activation layer and propagate through later decoding layers via the residual stream\.![Refer to caption](https://arxiv.org/html/2606.05516v1/x4.png)Figure 6:Training loss curve comparison\. In both LLaMA2\-7B and Qwen3\-8B, the dominant layer reduces training loss more effectively than later activation outlier layers\.
## 5Experimental Validation of Dominant Layer in ZO Fine\-tuning

### 5\.1Experimental Setting

#### Models and Datasets\.

To evaluate performance, we conduct experiments on LLaMA2\-7B\[[32](https://arxiv.org/html/2606.05516#bib.bib49)\]and Qwen3\-8B\[[35](https://arxiv.org/html/2606.05516#bib.bib50)\]over classification, multiple\-choice, and generation tasks used in MeZO\[[22](https://arxiv.org/html/2606.05516#bib.bib1)\], including SST\-2\[[28](https://arxiv.org/html/2606.05516#bib.bib37)\], RTE\[[8](https://arxiv.org/html/2606.05516#bib.bib38),[3](https://arxiv.org/html/2606.05516#bib.bib39),[13](https://arxiv.org/html/2606.05516#bib.bib40),[4](https://arxiv.org/html/2606.05516#bib.bib41)\], CB\[[10](https://arxiv.org/html/2606.05516#bib.bib42)\], BoolQ\[[7](https://arxiv.org/html/2606.05516#bib.bib43)\], WSC\[[18](https://arxiv.org/html/2606.05516#bib.bib44)\], MultiRC\[[17](https://arxiv.org/html/2606.05516#bib.bib45)\], COPA\[[26](https://arxiv.org/html/2606.05516#bib.bib46)\], SQuAD\[[25](https://arxiv.org/html/2606.05516#bib.bib47)\], and DROP\[[12](https://arxiv.org/html/2606.05516#bib.bib48)\]\.

#### Comparison Setup\.

To assess whether a single dominant layer can capture the useful adaptation achieved by full\-model ZO fine\-tuning, we first compare dominant\-layer ZO against full\-model MeZO\[[22](https://arxiv.org/html/2606.05516#bib.bib1)\]and MeZO LoRA, the most common PEFT variant of MeZO\. We also include zero\-shot inference without fine\-tuning and full\-model first\-order fine\-tuning with AdamW as references, allowing us to understand the gap between ZO and FO fine\-tuning\. In addition, we compare with Sparse\-MeZO\[[21](https://arxiv.org/html/2606.05516#bib.bib2)\], which reduces the perturbation parameters through sparse masking\. Dominant\-layer ZO uses the same ZO\-SGD update rule as MeZO, but restricts both perturbations and updates to the identified dominant layer\. To ensure a controlled comparison, all methods follow the MeZO training and evaluation protocol\. We tune the learning rate for all methods and the perturbation scale for ZO methods\. Full hyperparameter ranges, training steps and other implementation details are provided in the Appendix\.

Table 1:Performance of fine\-tuning Llama2\-7B \(with 1000 examples\)\. FT: full finetuning\.TaskSST\-2RTECBBoolQWSCMultiRCCOPASQuADDROPAVG\.Task Typeclassificationmultiple choicegenerationZero\-shot w/o finetune58\.0261\.7332\.1466\.736\.5445\.38160\.4319\.7351\.29First Order Adamw FT95\.8784\.8485\.7186\.771\.1582\.68690\.6848\.7481\.37MeZO FT92\.3265\.3471\.4376\.763\.4664\.48687\.3239\.8571\.87MeZO LoRA92\.6665\.8969\.6477\.663\.4665\.98886\.2640\.5772\.22Dominant\-layer ZO FT90\.7967\.5169\.6476\.564\.4265\.868789\.241\.0572\.44Dominant\-layer ZO LoRA91\.0267\.4467\.8677\.862\.566\.528788\.5842\.1072\.31

### 5\.2Main Results

Dominant\-layer ZO matches or even exceeds ZO fine\-tuning on all layers across tasks\.Across both model families, restricting ZO updates to a single dominant layer achieves performance comparable to, and sometimes better than, full\-model MeZO\. On LLaMA2\-7B, dominant\-layer ZO FT improves over full\-model MeZO FT by an average gain of0\.57%0\.57\\%, with larger task\-level gains on RTE \(\+2\.17%\+2\.17\\%\)\. Dominant\-layer ZO LoRA also improves over MeZO LoRA by0\.09%0\.09\\%on average, with notable gains on SQuAD \(\+2\.32%\+2\.32\\%\)\. The gains are higher on Qwen3\-8B: dominant\-layer MeZO improves over full\-model MeZO by1\.15%1\.15\\%on average, with the largest improvements on CB \(\+3\.57%\+3\.57\\%\) and SST\-2 \(\+2\.04%\+2\.04\\%\)\. Dominant\-layer MeZO LoRA similarly improves over MeZO LoRA by1\.12%1\.12\\%on average, with strong gains on CB \(\+3\.57%\+3\.57\\%\) and COPA \(\+2%\+2\\%\)\. These results indicate that useful ZO adaptation is not uniformly distributed across all layers, but can be effectively captured by updating one structurally important layer\.

Table 2:Performance of fine\-tuning Qwen3\-8B \(with 1000 examples\)\. FT: full finetuning\.TaskSST\-2RTECBBoolQWSCMultiRCCOPASQuADDROPAVGTask typeclassificationmultiple choicegenerationZero\-shot w/o finetune58\.038782\.1478\.370\.1976\.48282\.9763\.5575\.62First Order Adamw FT95\.4192\.0694\.6489\.978\.8590\.18993\.6571\.6388\.36MeZO FT92\.1190\.2592\.8685\.070\.1987\.28989\.9864\.1284\.52MeZO LoRA91\.7490\.7591\.0785\.571\.1586\.88790\.4164\.5984\.33Dominant\-layer MeZO94\.1591\.3496\.4384\.772\.1285\.69090\.7665\.9485\.67Dominant\-layer MeZO LoRA92\.8991\.5394\.6486\.973\.0885\.498990\.7464\.8585\.45

Comparison with Sparse\-MeZO\.Sparse\-MeZO\[[21](https://arxiv.org/html/2606.05516#bib.bib2)\]selects low\-magnitude weights and restricts the ZO perturbation to this selected weight subset to make ZO fine\-tuning more stable\. It constructs a binary mask and replaces the dense MeZO perturbation with a masked perturbation, so only the selected weights are updated\. Following the default settings, the smallest magnitude selection threshold is set at 25%\\%before training, while the mask can be regenerated during training by comparing the current parameters with the fixed threshold\. Table[3](https://arxiv.org/html/2606.05516#S5.T3)compares Sparse\-MeZO with our dominant\-layer ZO on representative tasks\. Sparse\-MeZO improves the selected\-task average performance by0\.83%0\.83\\%over MeZO FT\. Dominant\-layer MeZO improves0\.96%0\.96\\%over MeZO FT and0\.13%0\.13\\%over Sparse\-MeZO\. This indicates that choosing where to apply ZO at the layer level can be as important as choosing parameters within every tensor\.

Table 3:Sparse\-MeZO results on LLaMA2\-7B with 1000 training examples\. We report representative tasks from classification and multiple\-choice settings\.MethodBoolQWSCCOPASQuADDROPAVGMeZO FT76\.7063\.4686\.0087\.3239\.8570\.67Sparse\-MeZO FT77\.5063\.4686\.0088\.8341\.7271\.50Dominant\-layer MeZO76\.5064\.4287\.0089\.241\.0571\.63
### 5\.3Training efficiency

Dominant\-layer ZO fine\-tuning also improves training efficiency\. Each MeZO step consists of full\-model forward evaluations, parameter perturbation, and parameter update\. Since the perturbed losses still require full\-model forward passes, restricting the optimized parameters does not substantially reduce the forward cost\. Therefore, the expected efficiency gain comes mainly from the parameter\-side operations when using Dominant\-layer ZO\. For LLaMA2\-7B, which has 32 decoding layers, updating only one dominant layer gives an ideal parameter\-side reduction of about32×32\\timescompared with full\-model MeZO\. The measured results in Table[4](https://arxiv.org/html/2606.05516#S5.T4)closely match this expectation: single\-layer ZO reduces perturbation time by27\.3427\.34–31\.28×31\.28\\timesand update time by31\.5631\.56–32\.71×32\.71\\timesacross tasks\. Thus, dominant\-layer ZO realizes nearly the full theoretical saving for parameter perturbation and update\. The end\-to\-end speedup depends on the fraction of the forward pass\. On short\-input tasks, where perturbation and update dominate a larger portion of each step, dominant\-layer ZO achieves larger improvements, such as4\.52×4\.52\\timeson COPA and2\.45×2\.45\\timeson SST\-2\. On long\-input tasks, the full\-model forward pass dominates the runtime, so the total speedup is smaller, such as1\.24×1\.24\\timeson CB and1\.12×1\.12\\timeson DROP\. This confirms that dominant\-layer ZO is most beneficial when parameter\-side operations are a significant bottleneck\.

We also compare with Sparse\-MeZO as an elementwise sparse perturbation baseline\. Although Sparse\-MeZO keeps fewer than25%25\\%of parameters active in our setting, its perturbation and update speedups are much smaller\. This is expected because Sparse\-MeZO applies scattered masks within tensors, while dominant\-layer ZO preserves dense contiguous tensor operations on a single decoding layer\. Overall, dominant\-layer ZO provides stronger practical speedups by combining a layer\-level reduction in updated parameters with efficient dense computation\.

Table 4:Per\-step runtime breakdown for SST2, CB, WSC, COPA, and DROP under different ZO settings on Llama2\-7B\. \(Speedups\) are relative to full\-model MeZO for the same task and runtime component\. For Sparse\-MeZO, we exclude dynamic mask construction time to isolate the cost of the core ZO perturbation and update operations\.TaskParam RangeForward/stepPerturb/stepUpdate/stepTotal/stepSST2Full model0\.4431s \(1\.00x\)0\.4849s \(1\.00x\)0\.2061s \(1\.00x\)1\.1341s \(1\.00x\)SST2Sparse\-MeZO0\.4454s \(0\.99x\)0\.3714s \(1\.31x\)0\.1317s \(1\.57x\)0\.9485s \(1\.20x\)SST2Dominant layer0\.4411s \(1\.00x\)0\.0155s \(31\.28x\)0\.0063s \(32\.71x\)0\.4629s \(2\.45x\)CBFull model2\.9190s \(1\.00x\)0\.4867s \(1\.00x\)0\.2061s \(1\.00x\)3\.6118s \(1\.00x\)CBSparse\-MeZO2\.8469s \(1\.03x\)0\.3842s \(1\.27x\)0\.1365s \(1\.51x\)3\.3676s \(1\.07x\)CBDominant layer2\.8815s \(1\.01x\)0\.0178s \(27\.34x\)0\.0065s \(31\.71x\)2\.9057s \(1\.24x\)WSCFull model1\.0724s \(1\.00x\)0\.4858s \(1\.00x\)0\.2083s \(1\.00x\)1\.7665s \(1\.00x\)WSCSparse\-MeZO1\.0623s \(1\.01x\)0\.3775s \(1\.29x\)0\.1301s \(1\.60x\)1\.5699s \(1\.13x\)WSCDominant layer1\.0766s \(1\.00x\)0\.0161s \(30\.17x\)0\.0066s \(31\.56x\)1\.0992s \(1\.61x\)COPAFull model0\.1764s \(1\.00x\)0\.4856s \(1\.00x\)0\.2081s \(1\.00x\)0\.8701s \(1\.00x\)COPASparse\-MeZO0\.1655s \(1\.07x\)0\.3490s \(1\.39x\)0\.1242s \(1\.68x\)0\.6387s \(1\.36x\)COPADominant layer0\.1696s \(1\.04x\)0\.0163s \(29\.79x\)0\.0064s \(32\.52x\)0\.1923s \(4\.52x\)DROPFull model5\.8392s \(1\.00x\)0\.4852s \(1\.00x\)0\.2084s \(1\.00x\)6\.5328s \(1\.00x\)DROPSparse\-MeZO5\.9472s \(0\.98x\)0\.3707s \(1\.31x\)0\.1334s \(1\.56x\)6\.4512s \(1\.01x\)DROPDominant layer5\.8273s \(1\.00x\)0\.0158s \(30\.71x\)0\.0066s \(31\.58x\)5\.8496s \(1\.12x\)
### 5\.4Ablation study and analysis\.

#### Does combining layers improve over the dominant layer?

Table 5:Layer combination Performance on LLaMA2\-7B\.SettingSST\-2COPAw/o Finetune58\.0281Full\-model MeZO92\.3286Dominant\-layer MeZO90\.7987Dominant Layer \+ Layer 3091\.5286We next study whether the dominant layer can be further improved by adding another layer for tuning\. As an example, we select layer 30, which also shows relatively large loss changes under perturbation\. Table[5](https://arxiv.org/html/2606.05516#S5.T5)shows that simply adding another layer does not reliably improve performance\. On SST\-2, tuning layer 1 together with layer 30 slightly improves over tuning the dominant layer alone, but the gain remains close to full\-model MeZO\. On COPA, however, the two\-layer setting does not improve over the dominant layer\. These results suggest that the dominant\-layer bottleneck cannot be resolved by simply tuning more layers\. Consistent with Section 5, ZO performance depends more on how a layer’s perturbation propagates through the forward computation than on the number of updated layers\. Thus, adding another sensitive layer does not necessarily provide additional gains\.

#### Impact of outlier channels within the dominant layer\.

Motivated by prior studies showing that activation outliers occur in a small number of fixed feature dimensions and are closely related to the MLP down\-projection layer\[[30](https://arxiv.org/html/2606.05516#bib.bib13),[2](https://arxiv.org/html/2606.05516#bib.bib16)\], we further examine the corresponding channels inside the dominant layer\. Specifically, since these outlier dimensions mainly connect to the MLP module within a decoder layer, we study whether the dominant\-layer advantage is driven by the associated MLP channels\. Table 6 shows that these activation\-outlier MLP channels play a critical role\. Fine\-tuning only the dominant\-layer MLP recovers most of the full dominant\-layer performance, and tuning only the top 1% activation\-outlier MLP channels still preserves much of the gain on WSC\. In contrast, removing these channels from MLP tuning causes performance to collapse close to the no\-fine\-tuning baseline, indicating that they provide a major part of the ZO update signal\. At the same time, the dominant\-layer effect cannot be fully reduced to this small channel subset\. Removing the top 1% outlier channels from full dominant\-layer tuning significantly reduces performance, especially on COPA, but does not completely remove the gain on WSC\. These results suggest that activation\-outlier MLP channels serve as high\-leverage components for ZO adaptation, while the rest of the dominant layer still provides additional adaptation capacity\.

Table 6:Channel\-level MeZO ablations within the bottleneck layer on LLaMA2\-7B\. Tuning only the top 1% activation\-outlier channels recovers most of the gain from bottleneck\-layer tuning, while freezing those channels removes much of the advantage\.SettingWSCCOPABase, no fine\-tuning36\.5481Full Dominant\-layer MeZO64\.587Dominant\-layer MLP MeZO62\.586Dominant\-layer MLP Top 1% outlier channels MeZO62\.583Dominant\-layer MeZO without top 1% MLP outlier channels56\.7381Dominant\-layer MLP MeZO without top 1% MLP outlier channels37\.581

## 6Conclusion

We study where effective adaptation occurs in full\-model zeroth\-order fine\-tuning of LLMs and find that it concentrates in a single dominant layer\. Across two model families and nine downstream tasks, tuning this layer often matches or exceeds full\-model MeZO, while matched first\-order fine\-tuning shows much weaker layer concentration\. We further show that this layer aligns with the first activation\-outlier layer, enabling inference\-only identification before training\. Our analysis suggests that the dominant layer combines high perturbation sensitivity with an early position in the residual stream, allowing its perturbation to affect many subsequent blocks\. This produces a stronger forward\-loss signal under ZO, which relies on loss differences rather than backpropagated gradients\. Overall, our results show that full\-model ZO does not simply update too many parameters; it allocates optimization effort unevenly across layers\. This motivates future ZO methods that account for layer identity and within\-layer importance\.

Limitation\.There is still a performance gap between Dominant\-layer ZO and first\-order fine\-tuning methods\. Dominant\-layer ZO still takes many steps to achieve good performance, which remains a problem for applications\. We didn’t explore more models and combining Dominant\-layer ZO with other optimizers designed for ZO, such as ZO\-AdaMM\[[5](https://arxiv.org/html/2606.05516#bib.bib51)\]and FZOO\[[9](https://arxiv.org/html/2606.05516#bib.bib5)\]\. We plan to address these limitations and investigate them on more pre\-trained LLMs in our future research\.

## References

- \[1\]\(2021\)Intrinsic dimensionality explains the effectiveness of language model fine\-tuning\.InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing \(volume 1: long papers\),pp\. 7319–7328\.Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px2.p1.1)\.
- \[2\]Y\. An, X\. Zhao, T\. Yu, M\. Tang, and J\. Wang\(2025\)Systematic outliers in large language models\.External Links:2502\.06415Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p4.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px3.p1.1),[§5\.4](https://arxiv.org/html/2606.05516#S5.SS4.SSS0.Px2.p1.1)\.
- \[3\]R\. Bar\-Haim, I\. Dagan, B\. Dolan, L\. Ferro, and D\. Giampiccolo\(2006\-01\)The second pascal recognising textual entailment challenge\.Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment,pp\.\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[4\]L\. Bentivogli, P\. Clark, I\. Dagan, and D\. Giampiccolo\(2009\)The fifth pascal recognizing textual entailment challenge\.\.TAC7\(8\),pp\. 1\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[5\]X\. Chen, S\. Liu, K\. Xu, X\. Li, X\. Lin, M\. Hong, and D\. Cox\(2019\)ZO\-adamm: zeroth\-order adaptive momentum method for black\-box optimization\.External Links:1910\.06513Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p1.1),[§6](https://arxiv.org/html/2606.05516#S6.p2.1)\.
- \[6\]Y\. Chen, Y\. Zhang, L\. Cao, K\. Yuan, and Z\. Wen\(2024\)Enhancing zeroth\-order fine\-tuning for language models with low\-rank structures\.arXiv preprint arXiv:2410\.07698\.Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p1.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.
- \[7\]C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova\(2019\)Boolq: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 \(long and short papers\),pp\. 2924–2936\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[8\]I\. Dagan, O\. Glickman, and B\. Magnini\(2005\)The pascal recognising textual entailment challenge\.InMachine learning challenges workshop,pp\. 177–190\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[9\]S\. Dang, Y\. Guo, Y\. Zhao, H\. Ye, X\. Zheng, G\. Dai, and I\. Tsang\(2025\)FZOO: fast zeroth\-order optimizer for fine\-tuning large language models towards adam\-scale speed\.External Links:2506\.09034Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p1.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1),[§6](https://arxiv.org/html/2606.05516#S6.p2.1)\.
- \[10\]M\. De Marneffe, M\. Simons, and J\. Tonhauser\(2019\)The commitmentbank: investigating projection in naturally occurring discourse\.Inproceedings of Sinn und Bedeutung,Vol\.23,pp\. 107–124\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[11\]T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer\(2022\)GPT3\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 30318–30332\.Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p4.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px3.p1.1)\.
- \[12\]D\. Dua, Y\. Wang, P\. Dasigi, G\. Stanovsky, S\. Singh, and M\. Gardner\(2019\)DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 2368–2378\.Cited by:[§3\.2](https://arxiv.org/html/2606.05516#S3.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[13\]D\. Giampiccolo, B\. Magnini, I\. Dagan, and W\. B\. Dolan\(2007\)The third pascal recognizing textual entailment challenge\.InProceedings of the ACL\-PASCAL workshop on textual entailment and paraphrasing,pp\. 1–9\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[14\]W\. Guo, J\. Long, Y\. Zeng, Z\. Liu, X\. Yang, Y\. Ran, J\. R\. Gardner, O\. Bastani, C\. De Sa, X\. Yu,et al\.\(2025\)Zeroth\-order fine\-tuning of llms with transferable static sparsity\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p1.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.
- \[15\]N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly\(2019\)Parameter\-efficient transfer learning for nlp\.InInternational conference on machine learning,pp\. 2790–2799\.Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[4th item](https://arxiv.org/html/2606.05516#S1.I1.i4.p1.5),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]D\. Khashabi, S\. Chaturvedi, M\. Roth, S\. Upadhyay, and D\. Roth\(2018\)Looking beyond the surface: a challenge set for reading comprehension over multiple sentences\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),pp\. 252–262\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[18\]H\. J\. Levesque, E\. Davis, and L\. Morgenstern\(2012\)The winograd schema challenge\.\.KR2012\(13th\),pp\. 3\.Cited by:[§3\.2](https://arxiv.org/html/2606.05516#S3.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[19\]P\. Li, L\. Yin, X\. Gao, and S\. Liu\(2025\)Outlier\-weighed layerwise sampling for llm fine\-tuning\.External Links:2405\.18380Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px2.p1.1)\.
- \[20\]W\. Lin, Y\. Jiang, Q\. Song, Q\. Xiang, and H\. Xu\(2026\)AGZO: activation\-guided zeroth\-order optimization for llm fine\-tuning\.arXiv preprint arXiv:2601\.17261\.Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p1.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.
- \[21\]Y\. Liu, Z\. Zhu, C\. Gong, M\. Cheng, C\. Hsieh, and Y\. You\(2024\)Sparse mezo: less parameters for better performance in zeroth\-order llm fine\-tuning\.arXiv preprint arXiv:2402\.15751\.Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p1.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1),[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.05516#S5.SS2.p2.4)\.
- \[22\]S\. Malladi, T\. Gao, E\. Nichani, A\. Damian, J\. D\. Lee, D\. Chen, and S\. Arora\(2023\)Fine\-tuning language models with just forward passes\.Advances in Neural Information Processing Systems36,pp\. 53038–53075\.Cited by:[§A\.1](https://arxiv.org/html/2606.05516#A1.SS1.p1.1),[§A\.3](https://arxiv.org/html/2606.05516#A1.SS3.p1.1),[Table 9](https://arxiv.org/html/2606.05516#A1.T9),[Table 9](https://arxiv.org/html/2606.05516#A1.T9.16.2),[§1](https://arxiv.org/html/2606.05516#S1.p1.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.05516#S3.SS1.p1.4),[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px2.p1.1)\.
- \[23\]Y\. Nesterov and V\. Spokoiny\(2017\)Random gradient\-free minimization of convex functions\.Foundations of Computational Mathematics17\(2\),pp\. 527–566\.Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p1.1)\.
- \[24\]R\. Pan, X\. Liu, S\. Diao, R\. Pi, J\. Zhang, C\. Han, and T\. Zhang\(2024\)Lisa: layerwise importance sampling for memory\-efficient large language model fine\-tuning\.Advances in Neural Information Processing Systems37,pp\. 57018–57049\.Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px2.p1.1)\.
- \[25\]P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang\(2016\)Squad: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 conference on empirical methods in natural language processing,pp\. 2383–2392\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[26\]M\. Roemmele, C\. A\. Bejan, and A\. S\. Gordon\(2011\)Choice of plausible alternatives: an evaluation of commonsense causal reasoning\.\.InAAAI spring symposium: logical formalizations of commonsense reasoning,pp\. 90–95\.Cited by:[§3\.2](https://arxiv.org/html/2606.05516#S3.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[27\]G\. Shi, Z\. Lu, X\. Dong, W\. Zhang, X\. Zhang, Y\. Feng, and X\. Wu\(2025\)Understanding layer significance in llm alignment\.External Links:2410\.17875Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px2.p1.1)\.
- \[28\]R\. Socher, A\. Perelygin, J\. Wu, J\. Chuang, C\. D\. Manning, A\. Ng, and C\. Potts\(2013\-10\)Recursive deep models for semantic compositionality over a sentiment treebank\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,Seattle, Washington, USA,pp\. 1631–1642\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[29\]J\. C\. Spall\(2002\)Multivariate stochastic approximation using a simultaneous perturbation gradient approximation\.IEEE transactions on automatic control37\(3\),pp\. 332–341\.Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p1.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.05516#S3.SS1.p1.4)\.
- \[30\]M\. Sun, X\. Chen, J\. Z\. Kolter, and Z\. Liu\(2024\)Massive activations in large language models\.arXiv preprint arXiv:2402\.17762\.Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p4.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px3.p1.1),[§5\.4](https://arxiv.org/html/2606.05516#S5.SS4.SSS0.Px2.p1.1)\.
- \[31\]Q\. Tan, J\. Liu, Z\. Zhan, C\. Ding, Y\. Wang, X\. Ma, J\. Lee, J\. Lu, and G\. Yuan\(2025\)Harmony in divergence: towards fast, accurate, and memory\-efficient zeroth\-order llm fine\-tuning\.External Links:2502\.03304Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.
- \[32\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§3\.2](https://arxiv.org/html/2606.05516#S3.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[33\]F\. Wang, L\. Shen, L\. Ding, C\. Xue, Y\. Liu, and C\. Ding\(2024\)Simultaneous computation and memory efficient zeroth\-order optimizer for fine\-tuning large language models\.External Links:2410\.09823Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.
- \[34\]G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han\(2023\)Smoothquant: accurate and efficient post\-training quantization for large language models\.InInternational conference on machine learning,pp\. 38087–38099\.Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p4.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2606.05516#S5.SS1.SSS0.Px1.p1.1)\.
- \[36\]K\. Yao, P\. Gao, L\. Li, Y\. Zhao, X\. Wang, W\. Wang, and J\. Zhu\(2024\-11\)Layer\-wise importance matters: less memory for better performance in parameter\-efficient fine\-tuning of large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 1977–1992\.Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px2.p1.1)\.
- \[37\]Z\. Yu, P\. Zhou, S\. Wang, J\. Li, M\. Tian, and H\. Huang\(2025\)Zeroth\-order fine\-tuning of llms in random subspaces\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4475–4485\.Cited by:[§A\.2](https://arxiv.org/html/2606.05516#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.
- \[38\]K\. Zhang, H\. Li, Y\. Zhao, Y\. Sun, and H\. Zhang\(2025\)Learning a zeroth\-order optimizer for fine\-tuning llms\.External Links:2510\.00419Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.
- \[39\]Y\. Zhang, P\. Li, J\. Hong, J\. Li, Y\. Zhang, W\. Zheng, P\. Chen, J\. D\. Lee, W\. Yin, M\. Hong, Z\. Wang, S\. Liu, and T\. Chen\(2024\)Revisiting zeroth\-order optimization for memory\-efficient llm fine\-tuning: a benchmark\.External Links:2402\.11592Cited by:[§1](https://arxiv.org/html/2606.05516#S1.p1.1)\.
- \[40\]H\. Zhao, J\. Li, Y\. Pan, S\. Liang, X\. Yang, W\. Liu, X\. Li, F\. Dou, T\. Liu, and J\. Lu\(2024\)HELENE: hessian layer\-wise clipping and gradient annealing for accelerating fine\-tuning llm with zeroth\-order optimization\.External Links:2411\.10696Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.
- \[41\]Y\. Zhao, S\. Dang, H\. Ye, G\. Dai, Y\. Qian, and I\. W\. Tsang\(2024\)Second\-order fine\-tuning without pain for llms: a hessian informed zeroth\-order optimizer\.arXiv preprint arXiv:2402\.15173\.Cited by:[§2](https://arxiv.org/html/2606.05516#S2.SS0.SSS0.Px1.p2.1)\.

## Appendix AExperimental Details

### A\.1Tasks, Models, and Metrics

Following MeZO\[[22](https://arxiv.org/html/2606.05516#bib.bib1)\], we construct each task split by sampling up to 1000 training examples, 500 validation examples, and 1000 evaluation examples when sufficient data are available\. For smaller datasets, we follow the same protocol but reduce the validation split accordingly; in particular, for WSC, CB, and COPA we use a validation set of 100 examples\. We evaluate two model families, LLaMA2\-7B and Qwen3\-8B, over classification, multiple\-choice, and generation tasks\. All ZO experiments are run in float16, while FO experiments are run in bfloat16\. LLaMA2\-7B experiments are conducted on A6000\-48GB GPUs, and Qwen3\-8B experiments are conducted on H200 141GB GPUs\.

Unless otherwise stated, we keep the task formatting, data budget, and evaluation protocol fixed across methods so that the comparisons isolate the effect of the optimization method or trainable scope\. In particular, the comparisons among full\-model ZO, dominant\-layer ZO, Sparse\-MeZO, FO fine\-tuning, and MeZO\-LoRA use the same prompt family and validation\-based model selection procedure\. This shared setup is important for interpreting the layerwise results: differences in performance should be attributed to optimization behavior rather than to prompt or data changes\.

### A\.2Hyperparameters

Tables[7](https://arxiv.org/html/2606.05516#A1.T7)and[8](https://arxiv.org/html/2606.05516#A1.T8)summarize the hyperparameter grids used in our experiments\. For MeZO\-style methods, we use constant learning rates, a fixed perturbation scale and 10k steps, while FO fine\-tuning with AdamW follows a separate learning\-rate grid over 5 epochs\. For all methods, we select the final checkpoint based on the lowest validation loss among checkpoints saved every 2k training steps\. Following the setting from Subzero\[[37](https://arxiv.org/html/2606.05516#bib.bib28)\], we use default sparse rate 0\.75 for Sparse\-MeZO across all datasets\.

Table 7:The hyperparameter grids used for LLama2\-7B experiments\. All weight decay is set to 0\. FO FT uses 5 epochs and MeZO uses 10K steps and constant learning rates\. We check validation performance and save the best checkpoint every 2k total training steps\.ExperimentHyperparametersValuesMeZO FTBatch size16Learning rate\{1​e−7,5​e−7,1​e−6,5​e−6\}\\\{1\\mathrm\{e\}\{\-7\},5\\mathrm\{e\}\{\-7\},1\\mathrm\{e\}\{\-6\},5\\mathrm\{e\}\{\-6\}\\\}ϵ\\epsilon1​e−31\\mathrm\{e\}\{\-3\}MeZO Single LayerBatch size16Learning rate\{1​e−7,5​e−7,1​e−6,5​e−6\}\\\{1\\mathrm\{e\}\{\-7\},5\\mathrm\{e\}\{\-7\},1\\mathrm\{e\}\{\-6\},5\\mathrm\{e\}\{\-6\}\\\}ϵ\\epsilon1​e−31\\mathrm\{e\}\{\-3\}MeZO \(LoRA\)Batch size16Learning rate\{5​e−6,1​e−5,2​e−5,5​e−5\}\\\{5\\mathrm\{e\}\{\-6\},1\\mathrm\{e\}\{\-5\},2\\mathrm\{e\}\{\-5\},5\\mathrm\{e\}\{\-5\}\\\}ϵ\\epsilon1​e−31\\mathrm\{e\}\{\-3\}\(r,α\)\(r,\\alpha\)\(8,16\)\(8,16\)Sparse\-MeZOBatch size16Learning rate\{5​e−7,1​e−6,2​e−6,5​e−6\}\\\{5\\mathrm\{e\}\{\-7\},1\\mathrm\{e\}\{\-6\},2\\mathrm\{e\}\{\-6\},5\\mathrm\{e\}\{\-6\}\\\}ϵ\\epsilon1​e−31\\mathrm\{e\}\{\-3\}sparse rate0\.75FO FT with AdamwBatch size8Learning rates\{1​e−5,5​e−5,1​e−4\}\\\{1\\mathrm\{e\}\{\-5\},5\\mathrm\{e\}\{\-5\},1\\mathrm\{e\}\{\-4\}\\\}Table 8:The hyperparameter grids used for Qwen3\-8B experiments\. All weight decay is set to 0\. FO FT uses 5 epochs and MeZO uses 10K steps and constant learning rates\. We check validation performance and save the best checkpoint every 2k total training steps\.ExperimentHyperparametersValuesMeZO FTBatch size16Learning rate\{1​e−7,5​e−7,1​e−6,5​e−6\}\\\{1\\mathrm\{e\}\{\-7\},5\\mathrm\{e\}\{\-7\},1\\mathrm\{e\}\{\-6\},5\\mathrm\{e\}\{\-6\}\\\}ϵ\\epsilon1​e−31\\mathrm\{e\}\{\-3\}MeZO Single LayerBatch size16Learning rate\{1​e−7,5​e−7,1​e−6,5​e−6\}\\\{1\\mathrm\{e\}\{\-7\},5\\mathrm\{e\}\{\-7\},1\\mathrm\{e\}\{\-6\},5\\mathrm\{e\}\{\-6\}\\\}ϵ\\epsilon1​e−31\\mathrm\{e\}\{\-3\}MeZO \(LoRA\)Batch size16Learning rate\{5​e−6,1​e−5,2​e−5,5​e−5\}\\\{5\\mathrm\{e\}\{\-6\},1\\mathrm\{e\}\{\-5\},2\\mathrm\{e\}\{\-5\},5\\mathrm\{e\}\{\-5\}\\\}ϵ\\epsilon1​e−31\\mathrm\{e\}\{\-3\}\(r,α\)\(r,\\alpha\)\(8,16\)\(8,16\)FO FT with AdamwBatch size8Learning rates\{1​e−5,5​e−5,1​e−4\}\\\{1\\mathrm\{e\}\{\-5\},5\\mathrm\{e\}\{\-5\},1\\mathrm\{e\}\{\-4\}\\\}The hyperparameter tables also clarify an important aspect of our comparisons: the dominant\-layer advantage is not due to using a more favorable optimization budget for the restricted scope\. Instead, dominant\-layer ZO inherits essentially the same MeZO optimization protocol as the full\-model baseline, differing only in which parameters are perturbed and updated\. This makes the dominant\-layer results in the main paper a structural finding rather than a consequence of hyperparameter tuning\.

### A\.3Prompts

Table[9](https://arxiv.org/html/2606.05516#A1.T9)lists the prompt templates used in our experiments\. We follow MeZO\[[22](https://arxiv.org/html/2606.05516#bib.bib1)\]for templates and keep the prompt template fixed across zero\-shot evaluation, FO fine\-tuning, and ZO fine\-tuning for each task\. This consistency is especially important for the layerwise analyses, since it ensures that the observed differences across layers and optimization methods are not confounded by changes in verbalization or answer formatting\.

Table 9:Prompt templates used in our experiments\. Task types are classification \(cls\.\), multiple\-choice \(mch\.\), and question answering \(QA\)\. Prompts are same as MeZO\[[22](https://arxiv.org/html/2606.05516#bib.bib1)\]\.DatasetTypePromptSST\-2cls\.<text\>It was terrible/greatRTEcls\.<premise\>Does this mean that “<hypothesis\>” is true? Yes or No?Yes/NoCBcls\.Suppose<premise\>Can we infer that “<hypothesis\>”?Yes, No, or Maybe?Yes/No/MaybeBoolQcls\.<passage\><question\>?Yes/NoWSCcls\.<text\>In the previous sentence, does the pronoun “<span2\>” refer to<span1\>?Yes or No?Yes/NoMultiRCcls\.<paragraph\>Question:<question\>I found this answer “<answer\>”\. Is that correct? Yes or No?Yes/NoCOPAmch\.<premise\>so/because<candidate\>ReCoRDmch\.<passage\><query\>\.replace\("@placeholder", <candidate\>\)SQuADQATitle:<title\>Context:<context\>Question:<question\>Answer:DROPQAPassage:<context\>Question:<question\>Answer:

## Appendix BAdditional Empirical Results

This section reports additional numerical results that complement the figures and summaries in the main paper\. The main paper emphasizes representative plots and high\-level comparisons, whereas the appendix provides exact per\-task and per\-layer values to make the dominant\-layer phenomenon fully transparent\. Together, these tables show that the concentration of useful ZO adaptation is both numerically sharp and substantially stronger than the corresponding layerwise variation under FO fine\-tuning\.

### B\.1Comparison between Dominant layer and other layers in Qwen3\-8B

We first report a compact cross\-task summary for Qwen3\-8B\. This table complements the main\-text claim that the dominant\-layer phenomenon is model\-specific but not unique to LLaMA2\-7B: although the dominant layer index differs across model families, as layer 6 is the dominant layer in Qwen\-8B while layer 1 is the dominant layer in Llama2\-7B\. This dominant layer can recover and even outperform the gain of full\-model ZO\.

Table 10:Qwen3\-8B best performance under full\-model and selected single\-layer ZO fine\-tuning\. DROP is reported by F1; other tasks are reported by accuracy\. AVG is the simple average over the shown tasks\.MethodSST2RTECBBoolQWSCMultiRCCOPADROPAVGZero\-shot w/o finetune58\.038782\.1478\.370\.1976\.48263\.5574\.70Full\-model ZO92\.1190\.2592\.8685\.070\.1987\.28964\.1283\.84Dominant\-layer ZO94\.1591\.3496\.4384\.772\.1285\.69065\.9485\.04Layer\-34 ZO83\.4987\.0087\.5079\.1171\.1583\.58363\.7579\.81Layer\-35 ZO86\.3587\.7385\.7182\.9171\.1585\.18559\.1280\.38Table[10](https://arxiv.org/html/2606.05516#A2.T10)shows that the dominant\-layer pattern generalizes to Qwen3\-8B\. Across the eight reported tasks, dominant\-layer ZO remains highly competitive with full\-model ZO and slightly improves the average score by 1\.2%\\%\. At the same time, the much weaker performance of late alternative layers such as layers 34 and 35 indicates that the effect is not simply a generic preference for deeper layers\. Instead, Qwen3\-8B appears to have its own model\-specific dominant layer, consistent with the main paper’s claim that the dominant layer is stable within a model family but differs across architectures\.

Table 11:Llama2\-7B single\-layer FO results across all 32 transformer layers on WSC, COPA, and DROP\. WSC and COPA report accuracy \(%\), while DROP reports F1 \(%\)\.Layers 0–15Layers 16–31LayerWSCCOPADROP F1LayerWSCCOPADROP F1064\.428745\.311660\.588840\.29168\.278443\.871756\.738940\.10261\.548546\.071858\.658437\.31360\.588346\.531958\.658137\.58467\.318644\.862058\.659036\.05568\.278746\.062155\.778635\.88667\.318647\.482258\.658335\.61767\.318445\.872358\.658635\.22871\.158245\.442460\.588135\.23967\.318444\.942559\.628435\.331066\.358545\.862656\.738435\.991166\.358744\.462753\.858036\.131263\.468745\.042861\.548135\.931357\.698443\.462961\.548635\.181462\.508442\.443060\.588334\.261564\.428940\.203159\.628133\.88Full model w/o finetune:WSC = 36\.54, COPA = 81, DROP F1 = 19\.73\.FO full model finetune:WSC = 71\.15, COPA = 89, DROP F1 = 48\.74\.
### B\.2Detailed Results of FO, ZO Layerwise Fine\-tuning on Llama2\-7B

We next provide the full layerwise FO results for LLaMA2\-7B on three representative tasks\. These exact values complement the main\-text FO figure and make the contrast with ZO more explicit\.

Table[11](https://arxiv.org/html/2606.05516#A2.T11)confirms that FO fine\-tuning is heterogeneous across layers, but the pattern is relatively distributed\. Many early and middle layers provide substantial gains over the zero\-shot baseline, and no single layer consistently accounts for the full\-model FO performance across WSC, COPA, and DROP\. Although some layers stand out on individual tasks, the overall FO picture is broad rather than sharply concentrated\. This numerical pattern supports the main\-text claim that the dominant\-layer bottleneck is not a generic property of all fine\-tuning, but is much sharper under forward\-only ZO\.

Table 12:Llama2\-7B single\-layer MeZO results across all 32 transformer layers on WSC, COPA, and DROP\. WSC and COPA report accuracy \(%\), while DROP reports F1 \(%\)\.Layers 0–15Layers 16–31LayerWSCCOPADROP F1LayerWSCCOPADROP F1037\.58236\.951648\.088121\.24164\.428741\.051745\.198120\.46237\.58133\.081845\.198120\.38341\.358132\.471939\.428119\.97447\.128134\.162036\.548120\.30543\.278133\.912136\.548119\.67643\.278133\.712237\.58119\.75745\.198233\.002336\.548119\.86843\.278233\.412437\.58119\.70944\.238133\.612537\.58119\.601043\.278131\.852636\.548119\.731143\.278130\.692737\.58119\.741243\.278130\.282839\.428119\.511343\.278129\.442945\.198119\.711445\.198127\.453053\.858121\.451545\.198127\.003159\.628125\.88Full model w/o finetune:WSC = 36\.54, COPA = 81, DROP F1 = 19\.73\.MeZO full model finetune:WSC = 63\.5, COPA = 86, DROP F1 = 39\.85\.Table[12](https://arxiv.org/html/2606.05516#A2.T12)reports the corresponding single\-layer MeZO ablations on LLaMA2\-7B\. These exact values provide the numerical counterpart to the main\-text layerwise MeZO figure\. The contrast with FO is obvious\. In Table[12](https://arxiv.org/html/2606.05516#A2.T12), layer 1 clearly emerges as the dominant layer across WSC, COPA, and DROP, recovering most of the gain of full\-model MeZO and even slightly exceeding it on DROP\. By comparison, most other layers remain far closer to the zero\-shot baseline, especially on COPA and in the later layers on DROP\. Although a weaker late\-layer rise appears on some tasks, it does not overturn the overall pattern: full\-model ZO is effectively dominated by a single especially high\-leverage layer\.

### B\.3Computation Efficiency Analysis

This subsection provides an analytical and empirical breakdown of the runtime consequences of restricting the trainable scope in ZO fine\-tuning\. The main text argues that dominant\-layer and channel\-restricted ZO can preserve much of full\-model performance while reducing optimization overhead\. Here we make that tradeoff explicit by separating the cost of forward computation from the cost of perturbing and updating the selected parameters\.

#### Setup\.

Consider a decoder\-only layer with hidden sizeHH, MLP sizeII, batch sizeBB, effective sequence lengthnn, andqqSPSA directions\. LetF​\(B,n\)F\(B,n\)denote the cost of one full forward pass through the model\. For a trainable parameter subsetSS, let\|S\|\|S\|be the number of parameters that are actually perturbed and updated\.

#### Step cost\.

One MeZO SPSA step performs three parameter perturbations, two full forward passes, and one parameter update\. Thus its cost can be written as

Tstep\(1\)​\(S\)=2​F​\(B,n\)\+γ​\|S\|,T^\{\(1\)\}\_\{\\mathrm\{step\}\}\(S\)=2F\(B,n\)\+\\gamma\|S\|,\(3\)whereγ\>0\\gamma\>0absorbs the per\-parameter perturbation and update cost\. Withqqdirections,

Tstep​\(S\)=2​q​F​\(B,n\)\+q​γ​\|S\|\.T\_\{\\mathrm\{step\}\}\(S\)=2qF\(B,n\)\+q\\gamma\|S\|\.\(4\)

#### Key observation\.

Equation \([4](https://arxiv.org/html/2606.05516#A2.E4)\) shows that changing the trainable scope does*not*change the dominant forward computation: full\-model, single\-layer, and outlier\-only MeZO all still require two full forward passes per direction\. The only difference lies in the perturbation/update termq​γ​\|S\|q\\gamma\|S\|\.

#### Parameter counts\.

For three scopes considered in this paper,

\|Sfull\|=Pfull,\|S\_\{\\mathrm\{full\}\}\|=P\_\{\\mathrm\{full\}\},\(5\)\|Slayer\|=Player=4​H2\+3​H​I\+2​H,\|S\_\{\\mathrm\{layer\}\}\|=P\_\{\\mathrm\{layer\}\}=4H^\{2\}\+3HI\+2H,\(6\)and forkkselected MLP outlier channels in one layer,

\|Sout\|=Pout=3​H​k\.\|S\_\{\\mathrm\{out\}\}\|=P\_\{\\mathrm\{out\}\}=3Hk\.\(7\)Therefore,

Tfull\\displaystyle T\_\{\\mathrm\{full\}\}=2​q​F\+q​γ​Pfull,\\displaystyle=2qF\+q\\gamma P\_\{\\mathrm\{full\}\},\(8\)Tlayer\\displaystyle T\_\{\\mathrm\{layer\}\}=2​q​F\+q​γ​Player,\\displaystyle=2qF\+q\\gamma P\_\{\\mathrm\{layer\}\},\(9\)Tout\\displaystyle T\_\{\\mathrm\{out\}\}=2​q​F\+q​γ​Pout\.\\displaystyle=2qF\+q\\gamma P\_\{\\mathrm\{out\}\}\.\(10\)

#### Runtime implication\.

Assuming perturbation and update are implemented only on the selected parameters, the step\-time speedups satisfy

TfullTlayer=2​F\+γ​Pfull2​F\+γ​Player,TlayerTout=2​F\+γ​Player2​F\+γ​Pout\.\\frac\{T\_\{\\mathrm\{full\}\}\}\{T\_\{\\mathrm\{layer\}\}\}=\\frac\{2F\+\\gamma P\_\{\\mathrm\{full\}\}\}\{2F\+\\gamma P\_\{\\mathrm\{layer\}\}\},\\qquad\\frac\{T\_\{\\mathrm\{layer\}\}\}\{T\_\{\\mathrm\{out\}\}\}=\\frac\{2F\+\\gamma P\_\{\\mathrm\{layer\}\}\}\{2F\+\\gamma P\_\{\\mathrm\{out\}\}\}\.\(11\)Hence the achievable end\-to\-end speedup is always smaller than the raw parameter\-count ratio, and approaches that ratio only when perturbation/update dominates the forward cost\.

This explains why reducing trainable scope can greatly decrease the perturbation/update overhead, yet may yield only moderate wall\-clock speedup when full\-sequence forward passes already dominate total runtime\.

Table 13:Per\-step runtime breakdown for SST2, CB, WSC, COPA, and DROP under fine\-tuning different model parameter ranges\.TaskParam RangeForward/stepPerturb/stepUpdate/stepTotal/stepSST2full model0\.4431s \(39\.07%\)0\.4849s \(42\.76%\)0\.2061s \(18\.18%\)1\.1341sSST2single layer0\.4411s \(95\.273%\)0\.0155s \(3\.355%\)0\.0063s \(1\.372%\)0\.4629sSST21% mlp channel0\.4421s \(99\.377%\)0\.0024s \(0\.537%\)0\.0004s \(0\.086%\)0\.4449sSST2Sparse\-MeZO0\.4454s \(46\.96%\)0\.3714s \(39\.16%\)0\.1317s \(13\.89%\)0\.9485sCBfull model2\.9190s \(80\.82%\)0\.4867s \(13\.48%\)0\.2061s \(5\.7%\)3\.6118sCBsingle layer2\.8815s \(99\.17%\)0\.0178s \(0\.61%\)0\.0065s \(0\.22%\)2\.9057sCB1% mlp channel2\.8811s \(99\.68%\)0\.0088s \(0\.3%\)0\.0004s \(0\.02%\)2\.8903sCBSparse\-MeZO2\.8469s \(84\.54%\)0\.3842s \(11\.41%\)0\.1365s \(4\.05%\)3\.3676sWSCfull model1\.0724s \(60\.71%\)0\.4858s \(27\.50%\)0\.2083s \(11\.79%\)1\.7665sWSCsingle layer1\.0766s \(97\.94%\)0\.0161s \(1\.46%\)0\.0066s \(0\.60%\)1\.0992sWSC1% mlp channel1\.0732s \(99\.55%\)0\.0044s \(0\.41%\)0\.0004s \(0\.04%\)1\.0781sWSCSparse\-MeZO1\.0623s \(67\.67%\)0\.3775s \(24\.05%\)0\.1301s \(8\.29%\)1\.5699sCOPAfull model0\.1764s \(20\.27%\)0\.4856s \(55\.81%\)0\.2081s \(23\.91%\)0\.8701sCOPAsingle layer0\.1696s \(88\.17%\)0\.0163s \(8\.49%\)0\.0064s \(3\.34%\)0\.1923sCOPA1% mlp channel0\.1691s \(96\.70%\)0\.0053s \(3\.05%\)0\.0004s \(0\.25%\)0\.1749sCOPASparse\-MeZO0\.1655s \(25\.91%\)0\.3490s \(54\.64%\)0\.1242s \(19\.45%\)0\.6387sDROPfull model5\.8392s \(89\.38%\)0\.4852s \(7\.43%\)0\.2084s \(3\.19%\)6\.5328sDROPsingle layer5\.8273s \(99\.62%\)0\.0158s \(0\.27%\)0\.0066s \(0\.11%\)5\.8496sDROP1% mlp channel5\.8334s \(99\.94%\)0\.0029s \(0\.05%\)0\.0004s \(0\.01%\)5\.8368sDROPSparse\-MeZO5\.9472s \(92\.19%\)0\.3707s \(5\.75%\)0\.1334s \(2\.07%\)6\.4512sTable[13](https://arxiv.org/html/2606.05516#A2.T13)grounds the analytical cost model in measured step\-time statistics\. Across all tasks, restricting the trainable scope leaves the forward\-pass cost nearly unchanged, since the full model must still be executed to obtain the loss\. The major difference lies in perturbation and update overhead, which shrinks dramatically for single\-layer and 1% MLP\-channel settings\. Sparse\-MeZO falls between these extremes: it reduces perturbation overhead relative to full\-model MeZO, but still incurs substantially more optimization\-side cost than dominant\-layer or channel\-restricted tuning\. These measurements clarify why scope restriction can deliver substantial end\-to\-end savings while still falling short of the raw parameter\-count reduction when forward evaluation dominates runtime\.

Similar Articles

Don't let the LLM speak, just probe it (8 minute read)

TLDR AI

The article introduces a technique that extracts hidden states from an LLM at the last prompt token to perform classification without text generation, using a small MLP to read the model's internal decision, enabling fast and cheap zero-shot classifiers.